analyze the american community survey (acs) with r and monetdb

think of the american community survey (acs) as the united states' census for off-years - the ones that don't end in zero.  every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it's not a survey).  the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it's often accurate at sub-state geographies and - depending how many years pooled - down to small counties.  households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey.  this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure.  the census bureau runs and finances this behemoth, of course.

the downloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files.  merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file.  for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or five-year pooled files.  use as few pooled years as you can, unless you like sentences that start with, "over the period of 2006 - 2010, the average american ... [insert yer findings here]."

the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat.  this new github repository contains three scripts:
download all microdata.R
  • initiate a monetdblite database in the current working directory
  • download, unzip, and import each file for every year and size specified by the user
  • create and save household- and merged/person-level replicate weight complex sample designs
fair warning: this full script takes a loooong time.  run it friday afternoon, commune with nature for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action.  otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week.

2011 single-year - analysis examples.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file
  • perform the standard repertoire of analysis examples, only this time using monetdblite
replicate census estimates - 2011.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file
  • match every nationwide statistic on the census bureau's estimates page

click here to view these three scripts

for more detail about the american community survey (acs), visit:


if you're just looking for a couple data points, you ought to give the census bureau's american factfinder a whirl.  it's a table creator (click here to watch me blab about table creators), so it's easy-to-use but inflexible.  here's a li'l tip: if you run a statistic using american factfinder and then the same statistic using these scripts, they will be close but won't match exactly.  it's not a mistake, and both are methodologically correct.

every now and then, grumpy lawmakers threaten to defund the acs because, well, it's expensive.  use it or lose it.

unless specified by the question's phrasing, most acs variables should be treated as point-in-time, as opposed to either annualized or ever during the year.  this distinction is particularly important for health insurance coverage.  think about these three statistics --
  • the number of americans who won't have health insurance at least once during this year
  • the number of americans without health insurance right now
  • the number of americans who won't ever have health insurance during this year
-- the number of americans without health insurance right now is the point-in-time variable, smaller than the at least once number but larger than the ever number.

although the automated ftp download program for this data set only retrieves files back as far as 2005, a nationwide version of the american community survey has been conducted since 2000.  i skipped those years for two reasons --
  • the sample size (the true strength of the modern acs) wasn't very large on the older files (the 2004 and 2011 single-year person-level files are 54mb and 580mb, respectively).  there's no reason to import these files into a monet database.
  • the replicate weighted design wasn't implemented until 2005, so the creation of a complex sample survey object isn't possible.  if you need to calculate standard errors for earlier years, you'll have to rely on a pita generalized variance formula instead.  evidence: the published estimates prior to 2005 don't include error columns.
-- but if it's critical for you to analyze this early data, those tables should be small enough to read into memory with read.csv. use wgtp for household-weighted and pwgtp for person-weighted statistics with wtd.mean and wtd.quantile functions in the Hmisc package.

confidential to sas, spss, stata, sudaan users: the decennial census is enshrined in our constitution.  your statistical software isn't.  time to transition to r.  :D