analyze the fda adverse event reporting system (faers) with r

doctors prescribe medications for patients all the time.  all the time.  sometimes the results are beneficial, other times the drug has no discernible effect, but occasionally those substances actually cause harm.  since the drug is already on the market, there needs to be a post-approval mechanism for detecting health hazards that might've slipped past the clinical trials.  this is it.  if a side-effect alarms a physician or patient enough, either party can make a (voluntary) submission to the fda or the manufacturer (who then must report that event).  think of this as the central repository of skeletal xylophoning.

these public use files are the first in my experience to admit possessing yet fail to release a proper data dictionary.  the steps to learn about their contents: (1) read the full faers homepage, not too long.  (2) download and unzip one of the recent quarterly files by hand, for example 2012 quarter four.  (3) read yes read the faqs.doc and readme.doc files included in that microdata file.  once you're convinced these have what you need, let the download and import automation do the rest.  this new github repository contains two scripts:

download and import.R
  • figure out all zipped files containing quarterly microdata for both laers (legacy) and faers
  • loop through each available quarter, download and unzip onto the local disk
  • import each dollar-sign-delimited text file into an r data.frame object, cleaning up as you go
  • save each object as a fresh yet familiar rda file in a convenient pattern within the working directory
year stacks.R
  • find each quarterly data file for both laers (legacy) and faers on the local disk and sort them by year
  • stack all similar-system files into single-year files that nearly match the fda-published annual statistics.  but not exactly.  even though the individual quarterly files do match their control counts.  can't win 'em all.


for more detail about the fda adverse event reporting system (faers), visit:


in pursuit of what's hip and stylish, the fda has set up an api where users might query this database for up-to-the-minute case reports.  but unless you're setting up a bot to tweet adverse events as they happen or researching something that cannot wait for the quarterly file to be released - like google flu trends - the api seems too sexy for anyone other than right said fred.  you probably ought to load the entire data set onto your computer and explore it on your own first.

confidential to sas, spss, stata, and sudaan users: heavy doses of those programs may cause statococcal infection.  time to transition to r.  :D

analyze the demographic and health surveys (dhs) with r

professors of public health 101 probably cite the results of the demographic and health surveys (dhs) more than all other data sources combined.  funded by the united states agency for international development (usaid) and administered by the technically-savvy analysts at icf international, this collection of multinational surveys enters its third decade as the authoritative source of international development indicators.  want a sampler of what that all means?  load up the dhs homepage and watch the statistics fly by: 70% of beninese children younger than five sleep under an insecticide-treated bednet / more than a third of kyrgyz kids aged 6-59 months have anemia / only 35% of guinean households have a place to wash yer hands.  this is the front-and-center toolkit for professional epidemiologists who want to know who/what/when/where/why to target a public health intervention in any of these nations.

before you read any more about the microdata, look at  this online table creator might give you access to every statistic that you need, and without the fuss, muss, or missing values of a person-level table.  (bonus: click here to watch me describe dhs-style online table creation from a teleprompter.)  why should you use statcompiler?  because it's quick, easy, and has aggregated statistics for every country at your fingertips.

if that doesn't dissuade you from digging into an actual data set, one more point of order: you'll likely only be given access to a small number of countries.  so when applying for access, it'd be smart to ask for whichever country you are interested in _and also_ for malawi 2004.  that way, you will be able to muck around with my example syntax using the data tables that they were intended for.  if you have already registered, no fear: you can request that malawi be added to your existing project.  i tried requesting every data set.  i failed.  the data archivists do not grant access to more than a few countries unless you provide a legitimate research question that requires each dataset, and as i was only testing scripts, i received access to just a few countries.  also note that some surveys require permission to be given by the implementing organization from the individual country - access to restricted countries is at the discretion of the implementing organization.  while some surveys are restricted, these are generally public data:  so long as you have a legitimate research question, you'll be granted access to the majority of the datasets without cost.  this new github repository contains three scripts:

download and import.R

analysis examples.R

  • load the 2004 malawi individual recodes file into working memory
  • re-create some of the old school-style strata described in this forum
  • match a single row from pdf page 324 all the way across, deft and all.

click here to view these three scripts

for more detail about the demographic and health surveys (dhs), visit:


next to the main survey microdata set, you'll see some roman numerals ranging from one through six.  this number indicates which version manual of the survey that particular dataset corresponds to.  different versions have different questions, structures, microdata files: read the entire "general description" section (only about ten pages) of the manual before you even file your request for data access.

these microdata are complex, confusing, occasionally strangely-coded, and often difficult to reconcile with historical versions.  (century month codes? wowza.)  that's understandable, and the survey administrators deserve praise for keeping everything as coherent as they have after thirty years of six major questionnaire revisions of ninety countries of non-english-speaking respondents across this crazy planet of ours.  if you claw through the documentation and cannot find an explanation, you'll want to engage the user forum.  they are thoroughly responsive, impressively knowledgeable, and will help you get to the bottom of it - whatever `it` may be.  before you ask a question here, or really anywhere in life, have a solid answer to whathaveyoutried.  and for heavens' sakes,* prepare a reproducible example for them.

* my non-denominational way of saying heaven's sake.

confidential to sas, spss, stata, and sudaan users: i would shake your hand but you've yet to adopt the statistical equivalent of coughing into your sleeve.  time to transition to r.  :D

analyze the american housing survey (ahs) with r

plenty of nationwide surveys collect information at the housing unit-level, only the american housing survey (ahs) focuses on the physical structure rather than the inhabitants.  when asked to pick their favorite public-use file, urban planners, realty researchers, even data-driven squatters choose this one.  in action since (and with available microdata dating back to) 1973, the united states department of housing and urban development (hud) contracts with our census bureau to collect information about a panel of both nationally- and metropolitan area-representative homes so that scientists (like you) can boldly answer questions about america's residential housing supply.

from 1973 until 1996, the survey administrators mushed all of the content from this survey into a single one-record-per-housing-unit consolidated table that they call a "flat file" - simple.  beginning in 1997, you have access to much more detailed information.  if you feel confused rather than empowered, walk through the various 2011 files with me.  background first:  the `control` column in the microdata is just the unique identifier for the housing unit.  in the 2011 release, it's appropriate to think of `tnewhouse` and `trepwgt` as the main files - those are the only files that have weights.  there are no person-level weights in this microdata.  you can make statements like, "the average american housing unit has x bathrooms" but not, "the average american lives in a housing unit with x bathrooms."  you cannot make a statement about average american anythings without sampling weights.  catch my drift?  alright, here's my description of each file using this structure:

  • tablename (number of records in 2011) [unique `control` numbers in 2011] - structure.  notes/description.

files, structures, descriptions of individual tables in the 2011 ahs public use file:

  • tnewhouse (186,448) [186,448] - one record per housing unit.  housing unit characteristics.  the main file.
  • trepwgt (186,448) [186,448] - one record per housing unit.  weight file.  needs to be merged onto the main file.
  • towner (60,572) [60,572] - one record per owner of rented unit.  not all homes have an outside owner, but the ones that do will merge onto `tnewhouse` by `control` one-to-one
  • thomimp (147,329) [50,532] - one record per home improvement.    to uniquely identify each home-improvement use `control` plus `ras`   some homes have multiple home-improvements, others have none.
  • tmortg (56,507) [56,507] - one record per housing unit with a mortgage or home equity loan, maximum information captured: three of each.   not all homes have a mortgage or home equity loan (renters never do), but the ones that do will merge onto `tnewhouse` by `control` one-to-one
  • tperson (339,453) [134,918] - one record per person.    to uniquely identify each person use `control` plus `pline`    some homes have multiple persons, others have none.
  • tratiov (8,166) [8,166] - one record per housing unit.  verification that the renter pays x amount when their reported income makes it seem implausible.
  • trmov (43,968) [39,464] - one record per movement group.    to uniquely identify each group of movers use `control` plus `mvg`    some homes have multiple movement groups, others have none.
  • ttypec (71,672) [71,672] - one record per housing unit available in prior years but not the current year.

don't say i didn't warn you that this survey kicks ass.  ahh yes and if you are still perplexed by something, pdf page eleven of hud's documentation outlines what i've tried to summarize above in much more detail, using a mix of both capital and lowercase letters.  this new github repository contains four scripts:

download all microdata.R
  • download, import, save each and every american housing survey file onto your local computer
  • when a housing unit file and replicate weights file are both available, merge them.  you'll have to do it eventually, why not automate it from the start?
  • store all successfully-imported r data files (.rda) into a big fat sqlite database in case your computer isn't the newest edition

analysis examples.R
  • load a single housing unit-level data file, either into working memory or as a database-backed (ram-free) object
  • construct the complex sample survey object post-stratifying according to census bureau specifications
  • run example analyses that calculate perfect means, medians, quantiles, totals

merge and recode examples.R
  • recode some columns in the person-level table into other columns, inside the sql database
  • aggregate some person-level statistics into housing unit-level information just like hud's file flattener sas program
  • merge these aggregated person-level results onto the main housing unit-level file
  • re-construct a legitimate replicate-weighted database-backed survey design object, using the new person-level results you just created
  • repeat the four previous steps, but all in working memory rather than with a sql database - for more powerful computers only

  • fire up a sqlite-backed replicate-weighted survey design
  • match two separate statistics and standard errors in this census bureau publication
  • fire up the same design, sans sqlite-backing
  • repeat step two

click here to view these four scripts

for more detail about the american housing survey (ahs), visit:


it might not be perfectly clear from the documentation and they've yet to publish a core set of longitudinal weights for the various national periods and metropolitan samples, but the american housing survey is drawn from the same panel of housing units every other year.  when comparing the 2009 and 2011 unique identifiers (the `control` column), i found 55,065 matches.  you'd be smart to contact the (superhumanly responsive) quants who create this survey via their userlist to confirm your panel-based analysis strategy makes sense.

when you think of a housing unit, you might informally refer to it as a place where you would expect people to have their own bathroom and kitchen for their excluuuusive use.  the american housing survey includes some assisted living settings, but excludes group quarters like dormitories, hospitals, military barracks, and most nursing homes.  for more detailed explanations, take a look at the methodology document and especially appendix b.

confidential to sas, spss, stata, and sudaan users:  knock knock.  who's there?  r.  r who?  aren't you glad you transitioned to r?