analyze the programme for the international assessment of adult competencies (piaac) with r

heaven knows we've all been there: you're in a heated argument with some patriotic zealot who thinks (insert country here) has the best labor force on earth.  you know they're just spewing made-up-statistic after made-up-statistic, but you don't have hard examples of your own to counter their ignorance.  hit the pause button on that nation altercation, because now you do!  the organisation for economic co-operation and development (oecd) has released round one of the programme for the international assessment of adult competencies (piaac), a golden goose of cross-national comparison data regarding working-age adults.  they have a three minute intro, you should watch the three minute intro.  if you like what you see, read these four pages of key facts.  this is the appropriate microdata for serious study of advanced-economy labor markets. and also for debate winners.

following the footsteps of its older cousin - the program for international student assessment (pisa) - the piaac survey administrators at oecd and participating countries publish only a nightmarish tangle of custom sas (groan) and stata (gasp) macros for you to learn and implement for the sake of just one public-use survey.  why does bad software happen to good people?  rather than spending all your time translating ancient greek and all your money on proprietary statistical products, you can use the r survey package and buy me a drink.  this new github repository contains three scripts:

download import and design.R

analysis examples.R

  • load the survey design for austria and belgium into working memory.  because, you know, the alphabet.
  • match statistics and standard errors provided by oecd on not one but two tabs of this excel table
  • for loop through every country to hit the stats and standard errors provided by oecd on pdf page 48 of this pdf table

click here to view these three scripts

for more detail about the programme for the international assessment of adult competencies (piaac), visit:


while preparing your own analysis, you'll surely need the (fantastic) codebook.  aside from that, idk what else to say..  oecd supports only lousy statistical languages to analyze their marvelous data; now you can use a powerful programming language to analyze the same rich data set.  i suppose if you're bored, you could take the piaac test yourself.

confidential to sas, spss, stata, and sudaan users: maybe it's time you join the subphylum vertebrata of the statistical software kingdom.  maybe it's time to transition to r. :D

analyze the fda adverse event reporting system (faers) with r

doctors prescribe medications for patients all the time.  all the time.  sometimes the results are beneficial, other times the drug has no discernible effect, but occasionally those substances actually cause harm.  since the drug is already on the market, there needs to be a post-approval mechanism for detecting health hazards that might've slipped past the clinical trials.  this is it.  if a side-effect alarms a physician or patient enough, either party can make a (voluntary) submission to the fda or the manufacturer (who then must report that event).  think of this as the central repository of skeletal xylophoning.

these public use files are the first in my experience to admit possessing yet fail to release a proper data dictionary.  the steps to learn about their contents: (1) read the full faers homepage, not too long.  (2) download and unzip one of the recent quarterly files by hand, for example 2012 quarter four.  (3) read yes read the faqs.doc and readme.doc files included in that microdata file.  once you're convinced these have what you need, let the download and import automation do the rest.  this new github repository contains two scripts:

download and import.R
  • figure out all zipped files containing quarterly microdata for both laers (legacy) and faers
  • loop through each available quarter, download and unzip onto the local disk
  • import each dollar-sign-delimited text file into an r data.frame object, cleaning up as you go
  • save each object as a fresh yet familiar rda file in a convenient pattern within the working directory
year stacks.R
  • find each quarterly data file for both laers (legacy) and faers on the local disk and sort them by year
  • stack all similar-system files into single-year files that nearly match the fda-published annual statistics.  but not exactly.  even though the individual quarterly files do match their control counts.  can't win 'em all.


for more detail about the fda adverse event reporting system (faers), visit:


in pursuit of what's hip and stylish, the fda has set up an api where users might query this database for up-to-the-minute case reports.  but unless you're setting up a bot to tweet adverse events as they happen or researching something that cannot wait for the quarterly file to be released - like google flu trends - the api seems too sexy for anyone other than right said fred.  you probably ought to load the entire data set onto your computer and explore it on your own first.

confidential to sas, spss, stata, and sudaan users: heavy doses of those programs may cause statococcal infection.  time to transition to r.  :D

analyze the demographic and health surveys (dhs) with r

professors of public health 101 probably cite the results of the demographic and health surveys (dhs) more than all other data sources combined.  funded by the united states agency for international development (usaid) and administered by the technically-savvy analysts at icf international, this collection of multinational surveys enters its third decade as the authoritative source of international development indicators.  want a sampler of what that all means?  load up the dhs homepage and watch the statistics fly by: 70% of beninese children younger than five sleep under an insecticide-treated bednet / more than a third of kyrgyz kids aged 6-59 months have anemia / only 35% of guinean households have a place to wash yer hands.  this is the front-and-center toolkit for professional epidemiologists who want to know who/what/when/where/why to target a public health intervention in any of these nations.

before you read any more about the microdata, look at  this online table creator might give you access to every statistic that you need, and without the fuss, muss, or missing values of a person-level table.  (bonus: click here to watch me describe dhs-style online table creation from a teleprompter.)  why should you use statcompiler?  because it's quick, easy, and has aggregated statistics for every country at your fingertips.

if that doesn't dissuade you from digging into an actual data set, one more point of order: you'll likely only be given access to a small number of countries.  so when applying for access, it'd be smart to ask for whichever country you are interested in _and also_ for malawi 2004.  that way, you will be able to muck around with my example syntax using the data tables that they were intended for.  if you have already registered, no fear: you can request that malawi be added to your existing project.  i tried requesting every data set.  i failed.  the data archivists do not grant access to more than a few countries unless you provide a legitimate research question that requires each dataset, and as i was only testing scripts, i received access to just a few countries.  also note that some surveys require permission to be given by the implementing organization from the individual country - access to restricted countries is at the discretion of the implementing organization.  while some surveys are restricted, these are generally public data:  so long as you have a legitimate research question, you'll be granted access to the majority of the datasets without cost.  this new github repository contains three scripts:

download and import.R

analysis examples.R

  • load the 2004 malawi individual recodes file into working memory
  • re-create some of the old school-style strata described in this forum
  • match a single row from pdf page 324 all the way across, deft and all.

click here to view these three scripts

for more detail about the demographic and health surveys (dhs), visit:


next to the main survey microdata set, you'll see some roman numerals ranging from one through six.  this number indicates which version manual of the survey that particular dataset corresponds to.  different versions have different questions, structures, microdata files: read the entire "general description" section (only about ten pages) of the manual before you even file your request for data access.

these microdata are complex, confusing, occasionally strangely-coded, and often difficult to reconcile with historical versions.  (century month codes? wowza.)  that's understandable, and the survey administrators deserve praise for keeping everything as coherent as they have after thirty years of six major questionnaire revisions of ninety countries of non-english-speaking respondents across this crazy planet of ours.  if you claw through the documentation and cannot find an explanation, you'll want to engage the user forum.  they are thoroughly responsive, impressively knowledgeable, and will help you get to the bottom of it - whatever `it` may be.  before you ask a question here, or really anywhere in life, have a solid answer to whathaveyoutried.  and for heavens' sakes,* prepare a reproducible example for them.

* my non-denominational way of saying heaven's sake.

confidential to sas, spss, stata, and sudaan users: i would shake your hand but you've yet to adopt the statistical equivalent of coughing into your sleeve.  time to transition to r.  :D