analyze the national immunization survey (nis) with r

for twenty years now, the centers for disease control and prevention (cdc) has been random-digit-dialing american households to ask parents which vaccinations their little tykes have received.  and since vaccination history might not be at the forefront of every parent's mind, the cdc follows up with many of the respondents' pediatricians.  rigorous.  example abridged interview:

   cdc interviewer: hi are there any adorable children aged 19 months to 35 months in your household?

   proud papa: why yes, little tina turned two years old today.

   ci: oh that's swell.  has tina received her recommended dose of the diphtheria and tetanus toxoids vaccination?

   pp: she sure has.  we went and bought her ice cream afterwards since she didn't cry or nothing.

   ci: what an immunized little girl you have.  i am sure she will grow up to be one smart cookie just like her dad.
   say, do you mind if i call her pediatrician to confirm everything you've told me today?

   pp: well certainly, her pediatrician's name is dr. sergeant pepper and the office phone number is..and scene!

while the parental questionnaire is just a laundry list of did-your-kid-get-this-shot (and then the demographics of the mother), the follow-up provider study collects detailed information about exact vaccination dates and doses.  so you've got a trade-off: respondents with a provider-record-check will have more precise data, but not every respondent's pediatrician gets contacted, so that added accuracy lops off one-third of the unweighted sample size.

although only about thirty-thousand unweighted records, the cdc carefully calibrates the survey sample so that results can be reliably analyzed down to the state- and (for a select few) metropolitan area-level.  how else would public health officials be able to measure their battle against jenny mccarthyism?  (sidenote: jenny mccarthy and the we don't vaccinate our kids would make a phenomenal name for your cock rock band.)

beginning in 2008, the cdc launched a parallel teen-people version of the survey aimed at assessing the vaccination coverage rates of high school students.  the structure of these puberty-era data files aren't markedly different from the main survey, except that they generalize to thirteen-to-seventeen year old non-institutionalized americans instead of toddlers.  handy mnemonic to differentiate the two populations: diapers versus gripers.

joe walsh at alabama co-authored this blog post and the r scripts you'll soon come to know and love.  joe and i pinky-swore to write this post together after he had finished his impact evaluation of the nurse-family partnership at chicago's data science for social good.  this new github repository contains three scripts:

download and import.R
  • download the microdata years you've requested
  • either re-configure the r scripts to work as promised or just prepare the sas import script for sascii
  • save the r data files to your local disk

analysis examples.R


click here to view these three scripts

for more detail about the national immunization survey (nis), visit:
  • the microdata's about page.  that's required reading.
  • the cdc's immunization homepage.  that's smart reading, just in case someone else has already run the number you're looking for.
  • frequently asked questions about both the parent and provider interviews.


starting in 2011, the national immunization survey followed the lead of the national health interview survey (nhis) and switched over to a mixed landline plus cell phone sampling strategy.  this likely doesn't mean much to you, except that the survey design variables (cluster, strata, weight) changed between 2010 and 2011.  focus your energy on the decision of whether to use analytic weight provwt_d (the two-thirds of the sample that include pediatrician confirmation) or analytic weight rddwt_d (the whole sample, regardless of physician component).  assuming you'd like the beefiest sample size possible, browse through the data user's guide for the words "provider-reported" before you dump a third of your sample in exchange for variables you might not need.  if you're in doubt, use the sub-sample with physician verification (provwt_d).

confidential to sas, spss, stata, and sudaan users: no reason eradication campaigns should be limited to infectious diseases.  time to transition to r.  :D