analyze the demographic and health surveys (dhs) with r

professors of public health 101 probably cite the results of the demographic and health surveys (dhs) more than all other data sources combined.  funded by the united states agency for international development (usaid) and administered by the technically-savvy analysts at icf international, this collection of multinational surveys enters its third decade as the authoritative source of international development indicators.  want a sampler of what that all means?  load up the dhs homepage and watch the statistics fly by: 70% of beninese children younger than five sleep under an insecticide-treated bednet / more than a third of kyrgyz kids aged 6-59 months have anemia / only 35% of guinean households have a place to wash yer hands.  this is the front-and-center toolkit for professional epidemiologists who want to know who/what/when/where/why to target a public health intervention in any of these nations.

before you read any more about the microdata, look at  this online table creator might give you access to every statistic that you need, and without the fuss, muss, or missing values of a person-level table.  (bonus: click here to watch me describe dhs-style online table creation from a teleprompter.)  why should you use statcompiler?  because it's quick, easy, and has aggregated statistics for every country at your fingertips.

if that doesn't dissuade you from digging into an actual data set, one more point of order: you'll likely only be given access to a small number of countries.  so when applying for access, it'd be smart to ask for whichever country you are interested in _and also_ for malawi 2004.  that way, you will be able to muck around with my example syntax using the data tables that they were intended for.  if you have already registered, no fear: you can request that malawi be added to your existing project.  i tried requesting every data set.  i failed.  the data archivists do not grant access to more than a few countries unless you provide a legitimate research question that requires each dataset, and as i was only testing scripts, i received access to just a few countries.  also note that some surveys require permission to be given by the implementing organization from the individual country - access to restricted countries is at the discretion of the implementing organization.  while some surveys are restricted, these are generally public data:  so long as you have a legitimate research question, you'll be granted access to the majority of the datasets without cost.  this new github repository contains three scripts:

download and import.R

analysis examples.R

  • load the 2004 malawi individual recodes file into working memory
  • re-create some of the old school-style strata described in this forum
  • match a single row from pdf page 324 all the way across, deft and all.

click here to view these three scripts

for more detail about the demographic and health surveys (dhs), visit:


next to the main survey microdata set, you'll see some roman numerals ranging from one through six.  this number indicates which version manual of the survey that particular dataset corresponds to.  different versions have different questions, structures, microdata files: read the entire "general description" section (only about ten pages) of the manual before you even file your request for data access.

these microdata are complex, confusing, occasionally strangely-coded, and often difficult to reconcile with historical versions.  (century month codes? wowza.)  that's understandable, and the survey administrators deserve praise for keeping everything as coherent as they have after thirty years of six major questionnaire revisions of ninety countries of non-english-speaking respondents across this crazy planet of ours.  if you claw through the documentation and cannot find an explanation, you'll want to engage the user forum.  they are thoroughly responsive, impressively knowledgeable, and will help you get to the bottom of it - whatever `it` may be.  before you ask a question here, or really anywhere in life, have a solid answer to whathaveyoutried.  and for heavens' sakes,* prepare a reproducible example for them.

* my non-denominational way of saying heaven's sake.

confidential to sas, spss, stata, and sudaan users: i would shake your hand but you've yet to adopt the statistical equivalent of coughing into your sleeve.  time to transition to r.  :D

analyze the american housing survey (ahs) with r

plenty of nationwide surveys collect information at the housing unit-level, only the american housing survey (ahs) focuses on the physical structure rather than the inhabitants.  when asked to pick their favorite public-use file, urban planners, realty researchers, even data-driven squatters choose this one.  in action since (and with available microdata dating back to) 1973, the united states department of housing and urban development (hud) contracts with our census bureau to collect information about a panel of both nationally- and metropolitan area-representative homes so that scientists (like you) can boldly answer questions about america's residential housing supply.

from 1973 until 1996, the survey administrators mushed all of the content from this survey into a single one-record-per-housing-unit consolidated table that they call a "flat file" - simple.  beginning in 1997, you have access to much more detailed information.  if you feel confused rather than empowered, walk through the various 2011 files with me.  background first:  the `control` column in the microdata is just the unique identifier for the housing unit.  in the 2011 release, it's appropriate to think of `tnewhouse` and `trepwgt` as the main files - those are the only files that have weights.  there are no person-level weights in this microdata.  you can make statements like, "the average american housing unit has x bathrooms" but not, "the average american lives in a housing unit with x bathrooms."  you cannot make a statement about average american anythings without sampling weights.  catch my drift?  alright, here's my description of each file using this structure:

  • tablename (number of records in 2011) [unique `control` numbers in 2011] - structure.  notes/description.

files, structures, descriptions of individual tables in the 2011 ahs public use file:

  • tnewhouse (186,448) [186,448] - one record per housing unit.  housing unit characteristics.  the main file.
  • trepwgt (186,448) [186,448] - one record per housing unit.  weight file.  needs to be merged onto the main file.
  • towner (60,572) [60,572] - one record per owner of rented unit.  not all homes have an outside owner, but the ones that do will merge onto `tnewhouse` by `control` one-to-one
  • thomimp (147,329) [50,532] - one record per home improvement.    to uniquely identify each home-improvement use `control` plus `ras`   some homes have multiple home-improvements, others have none.
  • tmortg (56,507) [56,507] - one record per housing unit with a mortgage or home equity loan, maximum information captured: three of each.   not all homes have a mortgage or home equity loan (renters never do), but the ones that do will merge onto `tnewhouse` by `control` one-to-one
  • tperson (339,453) [134,918] - one record per person.    to uniquely identify each person use `control` plus `pline`    some homes have multiple persons, others have none.
  • tratiov (8,166) [8,166] - one record per housing unit.  verification that the renter pays x amount when their reported income makes it seem implausible.
  • trmov (43,968) [39,464] - one record per movement group.    to uniquely identify each group of movers use `control` plus `mvg`    some homes have multiple movement groups, others have none.
  • ttypec (71,672) [71,672] - one record per housing unit available in prior years but not the current year.

don't say i didn't warn you that this survey kicks ass.  ahh yes and if you are still perplexed by something, pdf page eleven of hud's documentation outlines what i've tried to summarize above in much more detail, using a mix of both capital and lowercase letters.  this new github repository contains four scripts:

download all microdata.R
  • download, import, save each and every american housing survey file onto your local computer
  • when a housing unit file and replicate weights file are both available, merge them.  you'll have to do it eventually, why not automate it from the start?
  • store all successfully-imported r data files (.rda) into a big fat sqlite database in case your computer isn't the newest edition

analysis examples.R
  • load a single housing unit-level data file, either into working memory or as a database-backed (ram-free) object
  • construct the complex sample survey object post-stratifying according to census bureau specifications
  • run example analyses that calculate perfect means, medians, quantiles, totals

merge and recode examples.R
  • recode some columns in the person-level table into other columns, inside the sql database
  • aggregate some person-level statistics into housing unit-level information just like hud's file flattener sas program
  • merge these aggregated person-level results onto the main housing unit-level file
  • re-construct a legitimate replicate-weighted database-backed survey design object, using the new person-level results you just created
  • repeat the four previous steps, but all in working memory rather than with a sql database - for more powerful computers only

  • fire up a sqlite-backed replicate-weighted survey design
  • match two separate statistics and standard errors in this census bureau publication
  • fire up the same design, sans sqlite-backing
  • repeat step two

click here to view these four scripts

for more detail about the american housing survey (ahs), visit:


it might not be perfectly clear from the documentation and they've yet to publish a core set of longitudinal weights for the various national periods and metropolitan samples, but the american housing survey is drawn from the same panel of housing units every other year.  when comparing the 2009 and 2011 unique identifiers (the `control` column), i found 55,065 matches.  you'd be smart to contact the (superhumanly responsive) quants who create this survey via their userlist to confirm your panel-based analysis strategy makes sense.

when you think of a housing unit, you might informally refer to it as a place where you would expect people to have their own bathroom and kitchen for their excluuuusive use.  the american housing survey includes some assisted living settings, but excludes group quarters like dormitories, hospitals, military barracks, and most nursing homes.  for more detailed explanations, take a look at the methodology document and especially appendix b.

confidential to sas, spss, stata, and sudaan users:  knock knock.  who's there?  r.  r who?  aren't you glad you transitioned to r?

analyze the medicare current beneficiary survey (mcbs) with r

for over two decades now, researchers at cms have produced the definitive complex sample survey dataset of americans covered by medicare: the medicare current beneficiary survey (mcbs).  i bristle with righteous indignation when healthcare researchers tell me that medicare is boring because it's pushing fifty.  yeah listen close - in any nation, who gets sick the most?  older people and disabled people.  oh and who does medicare cover?  older people and disabled people.  your uncle leo might be a snore, the dynamics of his government-provided health insurance are patently not.  in the world of american healthcare research, medicare is where the action is and mcbs is the richest tool for understanding that program.  here's what it's made of.

so why would this survey data with its measly fifteen thousand respondents be superior to the two-million-record chronic condition warehouse (ccw) or even the medicare public use files?  because those behemoths are just administrative claims, not substantive interviews with legit questionnaires.  and why does that matter?  well, as long as both are nationally-representative, i'd rather have a data set with ten thousand observations and one thousand variables (mcbs) than a data set with ten million observations and one hundred variables (ccw).  if the columns in your data are principally medical claims with a few basic identifiers, you'll be stuck with cool-sounding but goofy-looking variables like 'what is your race?' as deduced by algorithm from the person's last name.  in mcbs, they just ask everyone the actual question.  huzzah.

before you start licking your chops: these data are not (yet) publicly-available nor free, so you'll need to submit a research application stating why you want the data and what you plan to use it for, sign some documents stating you'll comply with privacy laws and then cough up about six hundred dollars to receive an encrypted cd via fedex.  now i just gotta say: i learned everything i know about the medicare current beneficiary survey from my prolific co-worker juliette cubanski.  creating a consolidated file was plainly her invention, and though i steered this ship away from the sas iceberg and into the tropical port that is the r language, she did most of the heavy thinking.  the syntax to create an easy peasy annual dataset - with all record identification code (ric) files bound together - would not exist had i not been able to draw on her expert data stewardship.  this new github repository contains four scripts:

  • scan through each of the mcbs cost and use files that you own, assuming you own some.
  • load each ric file directly into memory using our very own sascii package
  • consolidate everything into a one-record-per-person flat file, save each year as an r data file (.rda)

analysis examples.R


multiyear variable crosswalk.R
  • cycle through all of the readme files of the mcbs cost and use files that you already possess
  • determine which variable names are available which years
  • aggregate all of this information into one delightful table that can be easily filtered, so you can quickly see which mcbs columns are trendable - and for how long.

click here to view these four scripts

for more detail about the medicare current beneficiary survey (mcbs), visit:


although mcbs comes in two flavors - `access to care` and `cost and use` - these scripts only touch the latter.  the access to care data should be thought of as an early version of the cost and use files, so it's incomplete in some important ways: it does not contain medical utilization or spending, and it excludes anyone without 365 days of coverage (anyone who either gained eligibility or died mid-year).  for any given year, the access to care component gets released about eighteen months earlier than the final cost and use version.  if you're coveting slightly more recent data, you might find some utility in these files - just be forewarned that any population with a high death rate (like nursing home residents) will look a lot healthier in the `access to care` than they do in the final module.  if you're tight on cash, buy the cost and use.  but don't take my word for it, take resdac's.

while not well-publicized, this survey does track medicare beneficiaries over three full calendar years, allowing you to construct a neat little panel.  assuming you're the proud owner of two or three consecutive single-year modules already, send 'em an e-mail requesting the longitudinal weights.  then, instead of using `ricx` (as seen in my scripts), use `ricx3` or `ricx4` - and merge all other year-specific ric files on `baseid`.  since it's a rolling panel, longitudinal analyses necessitate a sample size hit of about one- or two-thirds for the two- and three-year panels, respectively.  oh.  and be sure to review this methodology document before you attempt anything with the multi-year weights.

confidential to sas, spss, stata, sudaan users: minimize your netscape navigator and put down your crystal pepsi for a second because i have big news for you:  time to transition to r.  :D