Opening Large CSV Files in R

Posted on December 27, 2012December 27, 2012 by stathack

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:

all <- read.csv("file.csv")

However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:

library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)

Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

12 thoughts on “Opening Large CSV Files in R”

Pingback: Momento R do dia « De Gustibus Non Est Disputandum
TAD says:

December 27, 2012 at 2:36 pm

Just out of curiosity, how long did the full read take using ‘ff’ ?

Reply
1. stathack says:
  
  December 27, 2012 at 5:18 pm
  
  It took about two hours
  
  Reply
henk says:

December 27, 2012 at 7:08 pm

Specifically for large text files:
http://cran.r-project.org/web/packages/LaF/index.html

Reply
Ben Bolker says:

December 27, 2012 at 7:13 pm

does the new fread function in the data.table package help? http://r.789695.n4.nabble.com/New-function-fread-in-v1-8-7-td4653745.html

Reply
1. stathack says:
  
  December 27, 2012 at 7:18 pm
  
  Thanks for tip! Unfortunately this csv had columns of text, integers, and dates.
  
  Reply
  1. Matthew Dowle says:
    
    December 28, 2012 at 11:01 am
    
    stathack – confused by your comment. If you’re referring to the date issue in that linked thread, that was related to automatic separator detection if the _first_ column was a date. And is now fixed (sep is passed through now). fread hasn’t been released yet, still in beta, but it might be worth a shot. It can certainly cope with columns of mixed types.
Chris says:

December 28, 2012 at 12:19 am

I just load large multi-gigabyte delimited files into a sqlite database then use the sqldf package to make queries. This also allows you to do analysis of very large files without worrying about the vector limits on windows based machines, (if you happen to be tied to one for whatever reason), and makes all the work so much faster.

Reply
inkhorn82 says:

December 28, 2012 at 12:31 am

Hi there,

I’ve worked a few times with the ff package and I think you might get better results if you take some time to map out the colClasses before you load the csv through read.csv.ffdf. I’ve written about doing exactly that in my blog:

http://rforwork.info/2012/10/11/know-your-dataset-ffdf/

Cheers,
Inkhorn

Reply
Oscar Perpiñán Lamigueiro says:

December 29, 2012 at 11:01 am

Maybe the colbycol package is useful too:
http://colbycol.r-forge.r-project.org/

Reply
Andrew Ziem says:

January 2, 2013 at 6:03 pm

I often read large CSV files about this size using the read.csv() function, but here it’s much faster. Last month I frequently read a 700MB .csv of size about 2M rows x 100 columns with mostly numerics and a few factors, and in my log, it took less than 3 minutes on R 2.15.2 on Windows 7 64-bit on a Dell Latitude laptop with an SSD and Intel i7 CPU and 8GB RAM. I get similar performance on Amazon EC2’s machines on Linux.

Reply
Jan says:

January 4, 2013 at 9:14 pm

The speed of read.table.ffdf depends on how you set the argument next.rows. If you set it to more than the number of rows in your data, read.table.ffdf will give similar performance as read.table/read.csv as it will read all data at once, but then you might risk overblowing your RAM. So in this example it reads in chunks of 50000 which is rather small but if you are working on a 32 bit laptop computer with only 2Gb of RAM, this will still work. If you have more RAM and 64bit you might consider using it by increasing next.rows

Reply