analyze the basic stand alone medicare claims public use files (bsapufs) with r and monetdb

the centers for medicare and medicaid services (cms) took the plunge.  the famous medicare 5% sample has been released to the public, free of charge.  jfyi - medicare is the u.s. government program that provides health insurance to 50 million elderly and disabled americans.  the basic stand alone medicare claims public use files (bsapufs) contain either person- or event-level data on inpatient stays, durable medical equipment purchases, prescription drug fills, hospice users, doctor visits, home health provision, outpatient hospital procedures, skilled nursing facility short-term residents, as well as aggregated statistics for medicare beneficiaries with chronic conditions and medicare beneficiaries living in nursing homes.  oh sorry, there's one catch: they only provide sas scripts to analyze everything.  cue the villian music.  that bored old game of monopoly ends today.

the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy, a big win for government transparency.  unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated.  prime examples:
  • none of the files can be linked to any other file.  not across years, not across expenditure categories
  • costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values
  • ages are categorized into five year bands
so these files are baldly inferior to the unsquelched, linkable data only available through an expensive formal application process.  any researcher with a budget flush enough to afford a sas license (the only statistical software mentioned in the cms official documentation) can probably also cough up the money to buy the identifiable data through resdac (resdac, btw, rocks).


cms released free public data sets that could only be analyzed with a software package costing thousands of dollars.  so even though the actual data sets were free, researchers still needed deep pockets to buy sas.  meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars.  researchers with funding would (reasonably) just buy the better data.  researchers without any financial resources - the target audience of free, public data - were left out in the cold.  no wonder these bsapufs haven't been used much.

that ends now.  using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime.  woah.  plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas.  woah^2.  dump your sas license water wings and learn how to swim.  the scripts below require monetdb.  click here for step-by-step instructions of how to install it on windows and click here for speed tests.  vroom.

since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty.  although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary.  most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment.  this new github repository contains three scripts:

2008 - download all csv files.R
  • loop through and download every zip file hosted by cms
  • unzip the contents of each zipped file to the working directory

2008 - import all csv files into monetdb.R
  • create the batch (.bat) file needed to initiate the monet database in the future
  • loop through each csv file in the current working directory and import them into the monet database
  • create a well-documented block of code to re-initiate the monetdb server in the future

2008 - replicate cms publications.R
  • initiate the same monetdb server instance, unsing the same well-documented block of code as above
  • replicate nine sets of statistics found in data tables provided by cms

click here to view these three scripts

for more detail about the basic stand alone medicare claims public use files (bsapufs), visit:


the replication script has oodles of easily-modified syntax and should be viewed for analysis examples.  if you know the name of the data table you want to examine, you can quickly modify these general monetdb analysis examples too.  just run sql queries - sas users, that's "proc sql;" for you.  never used sql? start fresh with this tutorial.  once you know the sql command you want to run on the data, you're almost done.  for operations that make changes to the data tables, use dbSendUpdate().  for operations that only read the data tables, use dbGetQuery().

don't ever use dbReadTable() on the outpatient, carrier, dme, or prescription drug event tables - they'll likely cause r to crash.

if you need the more advanced statistical functions described on the sqlsurvey homepage but not available in monetdb's flavor of sql, you could potentially create a taylor-series sqlsurvey() object with a weight column full of twenties and a strata+psu column with all ones.  the statistics should be correct, but if the columns in your analysis include any missing data, the variances might be wider (so more conservative) than those computed with monetdb's stddev() function.

confidential to sas, spss, stata, and sudaan users: why are you using software that's twenty years shy of medicare eligibility itself?  time to transition to r.  :D


  1. Hi Anthony,

    Still trying out RMonetDB! Love your posts. Thanks for so many details ! Is there a possibility of specifying the type of columns I am reading while importing a csv file ? Sometimes I have columns which have numbers/text but it fails to recognize it as varchar!

    1. if you are using, just increase the nrow.check parameter (it defaults to 500 but you can make it 1,000,000 if you have enough RAM)

      if you want to read it in manually - and specify the column types yourself - you can re-create the read.SAScii.monetdb() function, but it's a lot more work ;)

  2. really nice work. just curious, why monetdb and not something that would be available across OSs such as SQLite (available through sqldf)?

    1. monetdb does work on any os ( it shouldn't be hard to modify my work to get it running on something other than windows

      sas and sqlite are turtles compared to monetdb (review speed tests in the monetdb post - 67 million records processed in 125 seconds). the setup might be harder for monet, but i've done most of the heavy lifting already ;)

    2. interesting, will give it a shot over the next few days