why and how to install monetdb with r on windows


warning: the instructions below are obsolete.  please check this page for the latest version.  <3 anthony


why

a speed test of three sql queries on sixty-seven million records using my personal computer --


# calculate the sum, mean, median, and standard deviation of a single variable
system.time( dbGetQuery( db , 'select sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), stddev( car_hcpcs_pmt_amt ), count(*) from carrier08' ) )

   user  system elapsed
   0.00    0.03   25.96


# calculate the same statistics, broken down by six age and two gender categories
system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), stddev( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd' ) )

   user  system elapsed
   0.00    0.02  121.56


# calculate the same statistics, broken down by six age, two gender, and 924 icd-9 diagnosis code categories
system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), stddev( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd' ) )

   user  system elapsed
   0.30    0.03  125.16


-- you're not using computer hardware built in 1966, you shouldn't use software written for that era, either.



how

click here to install (free, open-source) monetdb on your windows computer in four easy steps

..now that you have monetdb and a few r packages installed..


..once you've got your database set up..





bonus section: backstory

like so many other legends, our story begins at the 2007 r statistical programming conference in iowa.  dr. thomas lumley presented his idea for big survey data to his contemporaries, who - in predictable contemporary form - failed to acknowledge its genius.  over the next half-decade, only ill-advised attempts were made at analyzing the big survey data beast.

for work (invention's mama), i needed to access the fifteen-million record, five year american community survey files, but since database-backed survey objects read all replicate weights into ram, my sixteen gigabyte desktop hissed, popped, crapped out.  so i e-mailed dr. lumley and asked for ideas.  next thing i know, he had developed:


turns out, monetdb is lightning fast on any big data, not just surveys.  no reason for demographers to hog all the fun.


for more detail about monetdb, visit:

notes:

there's a price for such fast data access.  importing a table into monetdb takes a while, so prepare to let your initial import code run overnight.  it's a good deal: leaving your computer on for one night beats running hour-long commands for every new analysis.

the RMonetDB and sqlsurvey packages are experimental.  the more you use them, the sooner they won't be.  if you hit something you don't understand (especially a line of code that works without error in the survey package), read the documentation carefully before contacting the author.  sqlsurvey commands do not uniformly behave like survey commands.

remember, all scripts on this archive work on my 2009-era windows seven laptop (with four gigabytes of ram).  by default, r reads objects into memory.  when a data set is too big, the analysis scripts presented on this website work around memory limitations by connecting to either a monetdb (speedy) or sqlite (easy-to-use) database.

the folks at monetdb have begun work on a direct monetdb-to-r connector.  until that's complete, dr. lumley's java-driven connector (the RMonetDB package) works just fine.

many government data sets are only available as fixed-width (flat) files accompanied by a sas import script, and the big data that necessitates RMonetDB is no exception.  i've written a variant of the read.SAScii() function to import ascii files directly into a monet database all in one step.  you may notice it in the code for some of the large surveys.

confidential to sas, spss, stata, sudaan users: if you start analyzing big data with r + monetdb, you will no longer have to wait around long enough to take a coffee break after running each command.  for that, i apologize. :D

6 comments:

  1. Hi Anthony,
    Great post ! Thank you so much for the step by step instructions on MonetDB. I'm definitely going to try this. I have one concern though. Will it be able to import multiple large excel/csv files ? I couldn't see any code on that. Looking forward to your reply.

    ReplyDelete
    Replies
    1. csv yes, excel no - i haven't seen the u.s. government publish any big data as .xls or .xlsx

      load RMonetDB and look at ?monet.read.csv

      my next few posts will also provide specific examples :)

      Delete
    2. Thanks Anthony !

      Really looking forward to the examples(when will you be posting?*eagerly awaiting*). Please do consider files which have been broken up into large multiple csv files and then their import into the MonetDB. Meanwhile I'm trying to make sense of monet.read.csv

      Delete
    3. if you're anxious, you can look around in the repository for what i'll describe on the blog soon. for example, line 269 imports multiple large csv files into a single table. notice the 'pde' object is a character vector containing multiple filepaths

      https://github.com/ajdamico/usgsd/blob/4b606c7edfc1a5d2c77671e1068cecd65a154353/Basic%20Stand%20Alone%20Medicare%20Claims%20Public%20Use%20Files/2008%20-%20import%20all%20csv%20files%20into%20monetdb.R#L269

      Delete
    4. Hi Anthony,

      I just ran through your code and got the jist. I tried using dbWriteTable with 'mtcars' and it was a success. However when I tried using the monet.read.csv code I'm getting errors !

      monet.read.csv(db,chk,'t',header=T,nrows=sapply(chk,countLines))
      Error in .local(conn, statement, ...) :
      execute JDBC update query failed in dbSendUpdate (Invalid identifier 'Row.Labels')

      Can you help me understand what the error could be ?
      chk=[1] "./GAS.csv"

      Thank you.

      Delete
    5. contact me directly if you need further help. but please do more research on your own before then ;)

      for example- your csv might have a row name that monetdb doesn't like, or it could have an invalid character somewhere in the data. break the csv file into smaller chunks (delete some rows, delete some columns, until you get it working.. then gradually put them back in until you figure out what monetdb doesn't like)

      Delete