analyze the national longitudinal study of adolescent health (addhealth) with r

the national longitudinal survey of adolescent health (addhealth) is to the health and retirement study what teen people is to aarp magazine.  both surveys have followed a cohort of respondents for almost twenty years now, asking them health behavior and social well-being questions.  this is the best data set to investigate the relationship between middle school math class grades and having a gambling problem in your twenties.  addhealth is a bit of a niche product: since it's longitudinal, you can look at a group of kids back in 1995 and then see where they ended up fifteen years later.  if that's not important to you, check cross-sectional surveys like nhis, meps, yrbss, even nhanes first - each of those other surveys represent a broader segment of the nation and get fielded at more regular intervals.  still with me?  addhealth weights generalize to all americans who were enrolled in 7th through 12th grade in 1995, which makes interpretation a little tricky, or at least verbose.  if you want to look at longitudinal data for the whole united states population or if you don't particularly care about the topic of health and healthcare, take a look at the psid and the nlsy, respectively.  have none of my survey siren songs dissuaded you from the odyssean path to addhealth?  then read on, gentle analyst, read on.

the kids who first responded to this survey were selected and contacted for wave one in 1994-1995.  they were interviewed, given a standardized test, their parents were interviewed, their schools' principals were interviewed, then they were left alone for one short year, until..the summer of 1996 when the same kids were given a different questionnaire for wave two.  we'll call what happens next intermission.  respondents were left alone until young adulthood in 2001-2002, when they were administered wave three - with biomarkers.  finally, by the time wave four rolled around in 2008-2009, these twentysomethings had spent more than half of their lives as addhealth panelists.  when asked what the future holds, the researchers at the carolina population center told me they've already received funding to interview the parents of the respondents again.  and they're hard at work to win funding for a fifth wave.  the good folks at unc might be pushing this boulder up that hill forever, but hopefully they never have to watch it roll back down.

addhealth has been featured in more than four thousand publications, so you can expect a rich data set to play with.  there's some crazy-cool stuff like a twin oversample and retrospective adhd questions and a (restricted data only) biological specimen component.  serious journals will want you to obtain the restricted access data sets (fill this out, pay $850) which triples your sample size and includes many more variables.  but if you're just browsing, these scripts should give you a good feel of what's possible - and whether it's sensible to pursue the restricted data.  yeah and i've created a consolidated file (one-record-per-respondent) for each survey wave, making your life a tad socrat-easier.  this new github repository contains three scripts:


download and consolidate.R
  • log into the university of michigan's website with the free login info you'll have to obtain beforehand
  • download every data file available for this study to the local disk
  • loop through each of the four waves, determining whether the file contains one-record-per-person or one-record-per-person-per-event or something-else, leaving those other tables free-standing
  • merge all of the one-record-per-person tables into a single wave-specific consolidated file

longitudinal analysis examples.R
  • load up the consolidated files from wave one and wave three
  • isolate the files to only the columns you need
  • merge 'em, then construct the complex design object using the wave one+three-specific weight
  • analyze 'em lotsa different ways

replicate unc puf.R
  • load up the wave one consolidated file
  • recode the data.frame object according to the addhealth stata code construction of the variables of interest
  • precisely match each and every statistic provided to me by the friendly folks at unc's carolina population center



click here to view these three scripts



for more detail about addhealth, visit:

notes:

when choosing which weights to use for your specific analysis, consult the last few pages of the weights4.pdf document.  remember if you're using the public use files, those published numbers show the counts for the restricted data, so these public use files should clock in at around a third of their published unweighted record counts.

i've never said it before, i'll probably never say it again, but the r language might not be the perfect tool for analyzing addhealth.  according to dr. thomas lumley - author of the survey package and the major reference textbook - it's not yet possible to run design-adjusted multilevel models with r.  he's already given away a herculean amount of free software, so if you're knowledgeable about complex sample survey adjustments for multilevel models, write some r code, and please share it.

the download program creates a consolidated file for each of the four waves, but does so with a little bit of sneakery.  if a data table has one-record-per-respondent, i merged it on to the consolidated file.  otherwise, i didn't.  waves one and two consist of almost exclusively one-record-per-respondent data tables, so most of the data has been merged into a single data.frame.  waves three and four include a few tables that have multiple-records-per-respondent (for example, there's a table with one-record-per-respondent-per-live-birth), so you'll have to aggregate or tapply or group-with-sqldf these tables from the person-per-event-level to the person-level before you merge onto the consolidated file.

confidential to sas, spss, stata, and sudaan users: when thetis offered to dip me in the styx river, i told her, "no need, i have already transitioned to r."  :D

1 comment:

  1. I'm not an authority on surveys like Lumley, but if I had a particular analysis I'd like to do I'd probably ask R-sig-mixed-models group about what sorts of problems would be expected. Andrew Gelman and others have pointed out that in the simpler cross-sectional case, a lot of analyses can be done without using the survey weights. As I understand it, the idea is to include the variables used in the sampling procedure in one's model (including interaction terms), and then do post-stratification for predictions.

    http://andrewgelman.com/2011/07/01/weighting_and_p/
    http://andrewgelman.com/2011/05/10/some_interestin/

    ReplyDelete