analyze the world values survey (wvs) with r

a global barometer of public opinion, the world values survey (wvs) clocks in as your best source of cross-cultural moods, attitudes.   you might find its most famous product sweepingly general, but who among us has never ever swept a smidgen of nuance under the rug?  if you want to explore complex international patterns of belief, now's your chance.

though their scientific advisory committee (sac) sets the ground rules and dictates the core content, individual national samples should be viewed as something of a confederacy of surveys.  carefully read the technical reports for any nations you dare to compare.  the homepage struck me as more personality-driven than that of other public use data sets.  but, really, who am i to judge?  if you care about religious fervency, gender equality, democracy, or even being grossly nationally happy, then the world values survey is the best source there ever will be.  this github repository contains two scripts:

download all microdata.R
  • impersonate a thirteen year old ukrainian boy, convince the archive that a human's doing the downloading
  • for-loop through every wave, every study, every nation
  • save each file to your local hard disk according to an easy-to-peruse structure

analysis examples.R
  • load a country-specific data set
  • construct a fake survey design object.  statistics and coefficients will be calculated correctly, but standard errors and confidence intervals generated off of this complex sample design should be ignored.  read the user note within the script for more four one one
  • examine the bejesus out of that survey design object, calculating every descriptive statistic possible

click here to view these two scripts

for more detail about the world values survey (wvs), visit:
  • geocities and myspace had a baby, and named it  i half expected a midi track to start up
  • wikipedia for much of the same content, but structured in a format you know and love


the administrators have neglected to produce microdata files that permit users to calculate confidence intervals using either of the most common survey analysis methods.  in other words, these data will give you a best guess, but you'll be in the dark about whether that guess is any good.  since there are no correct confidence intervals to match, i have not provided my usual replication script.  if you look in the "results" pdf file (not the "sample design" or "methodology" pdf files) for any nation, you'll find an "estimated error" somewhere around the second page.  this is a crude, dataset-wide measure of variance, but it's your only option to use as the standard error for any statistical testing.  this is a one-size-fits-all substitute for other more precise sampling error calculations like taylor-series linearization or replicate weightingyou could politely! request that they include clustering and strata variables on both future and historical files.  because awesome data can always get more awesome.

confidential to sas, spss, stata, and sudaan users: would you buy an imitation rolex if the real thing were free?  well look at your wrist because it's time to transition to r.  :D

analyze the programme for the international assessment of adult competencies (piaac) with r

heaven knows we've all been there: you're in a heated argument with some patriotic zealot who thinks (insert country here) has the best labor force on earth.  you know they're just spewing made-up-statistic after made-up-statistic, but you don't have hard examples of your own to counter their ignorance.  hit the pause button on that nation altercation, because now you do!  the organisation for economic co-operation and development (oecd) has released round one of the programme for the international assessment of adult competencies (piaac), a golden goose of cross-national comparison data regarding working-age adults.  they have a three minute intro, you should watch the three minute intro.  if you like what you see, read these four pages of key facts.  this is the appropriate microdata for serious study of advanced-economy labor markets. and also for debate winners.

following the footsteps of its older cousin - the program for international student assessment (pisa) - the piaac survey administrators at oecd and participating countries publish only a nightmarish tangle of custom sas (groan) and stata (gasp) macros for you to learn and implement for the sake of just one public-use survey.  why does bad software happen to good people?  rather than spending all your time translating ancient greek and all your money on proprietary statistical products, you can use the r survey package and buy me a drink.  this new github repository contains three scripts:

download import and design.R

analysis examples.R

  • load the survey design for austria and belgium into working memory.  because, you know, the alphabet.
  • match statistics and standard errors provided by oecd on not one but two tabs of this excel table
  • for loop through every country to hit the stats and standard errors provided by oecd on pdf page 48 of this pdf table

click here to view these three scripts

for more detail about the programme for the international assessment of adult competencies (piaac), visit:


while preparing your own analysis, you'll surely need the (fantastic) codebook.  aside from that, idk what else to say..  oecd supports only lousy statistical languages to analyze their marvelous data; now you can use a powerful programming language to analyze the same rich data set.  i suppose if you're bored, you could take the piaac test yourself.

confidential to sas, spss, stata, and sudaan users: maybe it's time you join the subphylum vertebrata of the statistical software kingdom.  maybe it's time to transition to r. :D

analyze the fda adverse event reporting system (faers) with r

doctors prescribe medications for patients all the time.  all the time.  sometimes the results are beneficial, other times the drug has no discernible effect, but occasionally those substances actually cause harm.  since the drug is already on the market, there needs to be a post-approval mechanism for detecting health hazards that might've slipped past the clinical trials.  this is it.  if a side-effect alarms a physician or patient enough, either party can make a (voluntary) submission to the fda or the manufacturer (who then must report that event).  think of this as the central repository of skeletal xylophoning.

these public use files are the first in my experience to admit possessing yet fail to release a proper data dictionary.  the steps to learn about their contents: (1) read the full faers homepage, not too long.  (2) download and unzip one of the recent quarterly files by hand, for example 2012 quarter four.  (3) read yes read the faqs.doc and readme.doc files included in that microdata file.  once you're convinced these have what you need, let the download and import automation do the rest.  this new github repository contains two scripts:

download and import.R
  • figure out all zipped files containing quarterly microdata for both laers (legacy) and faers
  • loop through each available quarter, download and unzip onto the local disk
  • import each dollar-sign-delimited text file into an r data.frame object, cleaning up as you go
  • save each object as a fresh yet familiar rda file in a convenient pattern within the working directory
year stacks.R
  • find each quarterly data file for both laers (legacy) and faers on the local disk and sort them by year
  • stack all similar-system files into single-year files that nearly match the fda-published annual statistics.  but not exactly.  even though the individual quarterly files do match their control counts.  can't win 'em all.


for more detail about the fda adverse event reporting system (faers), visit:


in pursuit of what's hip and stylish, the fda has set up an api where users might query this database for up-to-the-minute case reports.  but unless you're setting up a bot to tweet adverse events as they happen or researching something that cannot wait for the quarterly file to be released - like google flu trends - the api seems too sexy for anyone other than right said fred.  you probably ought to load the entire data set onto your computer and explore it on your own first.

confidential to sas, spss, stata, and sudaan users: heavy doses of those programs may cause statococcal infection.  time to transition to r.  :D