Opening Large CSV Files in R

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:

all <- read.csv("file.csv")

However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:

library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)

Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

About these ads

12 thoughts on “Opening Large CSV Files in R

  1. Pingback: Momento R do dia « De Gustibus Non Est Disputandum

      • stathack – confused by your comment. If you’re referring to the date issue in that linked thread, that was related to automatic separator detection if the _first_ column was a date. And is now fixed (sep is passed through now). fread hasn’t been released yet, still in beta, but it might be worth a shot. It can certainly cope with columns of mixed types.

  2. I just load large multi-gigabyte delimited files into a sqlite database then use the sqldf package to make queries. This also allows you to do analysis of very large files without worrying about the vector limits on windows based machines, (if you happen to be tied to one for whatever reason), and makes all the work so much faster.

  3. I often read large CSV files about this size using the read.csv() function, but here it’s much faster. Last month I frequently read a 700MB .csv of size about 2M rows x 100 columns with mostly numerics and a few factors, and in my log, it took less than 3 minutes on R 2.15.2 on Windows 7 64-bit on a Dell Latitude laptop with an SSD and Intel i7 CPU and 8GB RAM. I get similar performance on Amazon EC2′s machines on Linux.

  4. The speed of read.table.ffdf depends on how you set the argument next.rows. If you set it to more than the number of rows in your data, read.table.ffdf will give similar performance as read.table/read.csv as it will read all data at once, but then you might risk overblowing your RAM. So in this example it reads in chunks of 50000 which is rather small but if you are working on a 32 bit laptop computer with only 2Gb of RAM, this will still work. If you have more RAM and 64bit you might consider using it by increasing next.rows

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s