this blog announces obsessively-detailed instructions to analyze us government survey data with free tools -
the r language, the survey package, and (for big data) sqlsurvey + monetdb.
the united states government spends billions of dollars each year surveying our population. if you have a computer and some energy, you should be able to unlock it for free, with transparent, open-source software, using reproducible techniques. we're in a golden era of public government data, but almost nobody knows how to mine it with technology designed for this millennium. i can change that, so i'm gonna. help. use it.
the computer code for each survey data set consists of three core components:
current analysis examples
- fully-commented, easy-to-modify examples of how to load, clean, configure, and analyze the most current data sets available.
massive ftp download automation
- no-changes-necessary programs to download every microdata file from every survey year as an r data file onto your local disk.
- match published numbers exactly to show that r produces the same results as other statistical languages. these are your rosetta stones, so you know the syntax has been translated into r properly.
want a more gentle introduction? read this flowchart, grab some popcorn, watch me talk at the dc r users group.
endorsements, citations, links, words on the street:
- the consumer expenditure survey microdata page, bureau of labor statistics
- the survey of consumer finances microdata page, federal reserve
- the health services research methods external resources page, academyhealth
- the r survey package homepage, r core contributor dr. thomas lumley
frequently asked questions
what if i would like to offer additional code for the repository, or can't figure something out, or find a mistake, or just want to say hi?
if it's related to a data set discussed in a blog post, please write it in the comments section so others might benefit from the response. otherwise, e-mail me directly. i love talking about this stuff, in case you hadn't noticed.
how do i get started with r?
either watch some of my two-minute tutorial videos or read this post at flowingdata.com.
r isn't that hard to learn, but you've gotta want it.
are you sure r matches other statistical software like sas, stata, and sudaan?
yes. i wrote this journal article outlining how r precisely matches these three languages with complex survey data.
but that journal article only provides comparisons across software for the medical expenditure panel survey. what about other data sets?
along with the download, importation, and analysis scripts, each data set in the repository contains at least one syntax example that exactly replicates the statistics and standard errors of some government publication, so you can be confident that the methods are sound.
does r have memory limits that prevent it from working with big survey data and big data in general?
sort of, but i've worked around them for you. all published analyses get tested on my clunky 2009-era windows seven laptop (with four gigabytes of ram). larger data sets are imported and analyzed using memory-free sql to accommodate analysts with limited computing resources.
why does this blog use a github repository as a back-end?
github is designed to host computer syntax that gets updated frequently. blogs don't go there.
why does your github repository use this blog as a front-end?
most us government survey data sets become available on a regular basis (many are annual, but not all). if you use these scripts, you probably don't care about every little change that i make to the underlying computer code (which you can view by clicking here).
but you probably want to be alerted when new data become available (which you can follow with rss or by entering your e-mail address on the left of this page).
what is github?
a version control website.
what is version control?
it's like the track changes feature in microsoft word, only specially-designed for computer code.
what else do i need to analyze us government survey data?
all scripts get tested on 64-bit windows 7 with the latest version of r and the latest version of the survey package. you can probably get everything to work on a macintosh or unix-based machine with minimal tweaks, but if you run into problems or would like to contribute code to accommodate other platforms, please e-mail me about it.
what is SAScii?
(too) many data sets produced by the us government include only a fixed-width ascii file and a sas-readable importation script. r is expert at loading in csv, spss, stata, sas transport, even sas7bdat files, but (until SAScii) couldn't read the block of code written for sas to import fixed-width data. click here to see what others have to say about it.
a few of the importation scripts in the repository use a sql-based variant of SAScii to prevent overloading ram. but don't worry, everything gets loaded automagically when you run the program.
how many questions should a good faq answer?