column-store R or: how i learned to stop worrying and love monetdb


"Combining R's sophisticated calculations and MonetDB's excellent data access performance is a no-brainer. One gets the best of two (open source) worlds with minimal hassle." - Dr. Hannes Mühleisen


"oh wow that was fast like a cheetah with a jetpack or something" - anthony damico




why try monetdb + r


a speed test of four analysis commands on sixty-seven million physician visit records using my personal laptop --

 
# calculate the sum, mean, median, and count of a single variable with sql
system.time( dbGetQuery( db , 'select sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08' ) )
   user  system elapsed
   0.01    0.00   16.86
seconds

# calculate the same statistics, broken down by six age and two gender categories

system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd' ) )
   user  system elapsed
   0.00    0.01   36.03
seconds

# calculate the same statistics, broken down by six age, two gender,

# and 924 icd-9 diagnosis code categories
system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd' ) )
   user  system elapsed
   0.04    0.00   20.50
seconds


 -- you're not using computer hardware built in 1966, you shouldn't use software written for that era, either.





how to get started with monetdb + r


click here to install (free, open-source) monetdb on your windows computer in two count them two easy steps

..now that you have monetdb and a few r packages installed..

..once you've got your database set up..





a brief history of monetdb + r

like so many other legends, our story begins at the 2007 r statistical programming conference in iowa.  dr. thomas lumley presented his idea for big survey data to his contemporaries, who - in predictable contemporary form - failed to acknowledge its genius.  over the next half-decade, only ill-advised attempts were made at analyzing the big survey data beast.

for work (invention's mama), i needed to access the fifteen-million record, five year american community survey files, but since database-backed survey objects read all replicate weights into ram, my sixteen gigabyte desktop hissed, popped, crapped out.  so i e-mailed dr. lumley and asked for ideas.  next thing i know, he had developed..
..but the folks in amsterdam caught on, assigning codesmith dr. hannes mühleisen to the task of writing a java-free connector.  on cran since february of 2013, the MonetDB.R package has made barefoot many analysts were wearing socks just a second ago.





summary()

turns out, monetdb is lightning fast on any big data, not just surveys.  no reason for demographers to hog all the fun.  pro tip: if you're not interested in survey data and just want to use monetdb for its big data chops, start with the bsapuf example code - that's not survey data either.


for more detail about monetdb, visit:

notes:

there's a price for such fast data access.  importing a table into monetdb takes a while, so prepare to let your initial import code run overnight.  it's a good deal: leaving your computer on for one night sure beats running hour-long commands for every new analysis.

the MonetDB.R and sqlsurvey packages are experimental.  the more you use them, the sooner they won't be.  if you hit something you don't understand (especially a line of code that works without error in the survey package), read the documentation carefully before contacting the author.  sqlsurvey commands do not uniformly behave like survey commands.

remember, all scripts on this archive work on my 2009-era windows seven laptop (with four gigabytes of ram).  by default, r reads objects into memory.  when a data set is too big, the analysis scripts presented on this website work around memory limitations by connecting to either a monetdb (speedy) or sqlite (easy-to-use) database.

many government data sets are only available as fixed-width (flat) files accompanied by a sas import script, and the big data that necessitates MonetDB.R is no exception.  i've written a variant of the read.SAScii() function to import ascii files directly into a monet database all in one step.  you may notice it in the code for some of the large surveys.

confidential to sas, spss, stata, sudaan users: if you start analyzing big data with r + monetdb, you will no longer have to wait around long enough to take a coffee break after running each command.  for that, i apologize. :D

No comments:

Post a Comment