how to provide a variance calculation on your public-use survey data file without disclosing sampling clusters or violating respondent confidentiality

this post and accompanying syntax would not have been possible without dan oberski.  read more, find out why.  thanks dan.

dear survey administrator: someone sent you this link because you work for an organization or a government agency that conducts a complex-sample survey, releases a public-use file, but does not correspondingly disclose the sampling clusters.

you had good reason to do this: malicious users lurk around every corner of the internet, determined to isolate and perhaps humiliate the good-natured respondents to your survey.  those sampling clusters are, more often than not, geographic locations.  you ran your survey, you promised respondents you wouldn't disclose enough information for data users to identify them, now you're keeping your word.  you need to maintain the trust of your respondents, perhaps you're even bound to do so by law.  i understand.  keep doing it.  but remember that confidentiality costs statistical precision.

you drew a sample, you fielded your survey, you probably analyzed the microdata yourself, then you blessedly documented everything and shared your data file with other researchers.  thank you.  that first step, when you drew that sample, was it a simple random sampling?  because if you used a clustered sample design (like almost every survey run by the united states government), your microdata users will need those sampling units to compute a confidence interval either through linearization or replication.

if you don't disclose those clusters, some of your data users will blindly calculate their confidence intervals under the faulty assumption of simple random sampling.  that is not right, the confidence intervals are too tight.  if a survey data provider neglects to provide a defensible method to calculate the survey-adjusted variance, users will rely on srs and occasionally declare statistically significant differences that aren't statistically significant.  nightmares are born, yada yada.

you cannot disclose your sampling units but you would like your users to calculate a more accurate (or at least more conservative) confidence interval around the statistics that they compute off of your survey data.  the alternative to linearization-based confidence interval calculations? a replication-based confidence interval calculation.  try this:

click here to view a step-by-step tutorial to create obfuscated replicate weights for your complex-sample survey data

there aren't many people who i like more than dan oberski.  a survey methodologist at tilburg university, dr. oberski kindly reviewed my proposed solution and sketched out an argument in favor of the procedure.

read his arguments in this pdf file or in latex format.

even though he's convinced that the conclusion is true, he cautions that some of the design-unbiasedness proof steps are not wholly rigorous - especially (3) - and that in order for this method to gain wide acceptance, a research article submitted to a peer-reviewed journal would need more careful study, a formal justification of unbiased standard errors, and a small simulation.  so you have a green light from us, but give your own survey methodologists the final say.  glhf and use r

analyze the world values survey (wvs) with r

a global barometer of public opinion, the world values survey (wvs) clocks in as your best source of cross-cultural moods, attitudes.   you might find its most famous product sweepingly general, but who among us has never ever swept a smidgen of nuance under the rug?  if you want to explore complex international patterns of belief, now's your chance.

though their scientific advisory committee (sac) sets the ground rules and dictates the core content, individual national samples should be viewed as something of a confederacy of surveys.  carefully read the technical reports for any nations you dare to compare.  the homepage struck me as more personality-driven than that of other public use data sets.  but, really, who am i to judge?  if you care about religious fervency, gender equality, democracy, or even being grossly nationally happy, then the world values survey is the best source there ever will be.  this github repository contains two scripts:

download all microdata.R
  • impersonate a thirteen year old ukrainian boy, convince the archive that a human's doing the downloading
  • for-loop through every wave, every study, every nation
  • save each file to your local hard disk according to an easy-to-peruse structure

analysis examples.R
  • load a country-specific data set
  • construct a fake survey design object.  statistics and coefficients will be calculated correctly, but standard errors and confidence intervals generated off of this complex sample design should be ignored.  read the user note within the script for more four one one
  • examine the bejesus out of that survey design object, calculating every descriptive statistic possible

click here to view these two scripts

for more detail about the world values survey (wvs), visit:
  • geocities and myspace had a baby, and named it  i half expected a midi track to start up
  • wikipedia for much of the same content, but structured in a format you know and love


the administrators have neglected to produce microdata files that permit users to calculate confidence intervals using either of the most common survey analysis methods.  in other words, these data will give you a best guess, but you'll be in the dark about whether that guess is any good.  since there are no correct confidence intervals to match, i have not provided my usual replication script.  if you look in the "results" pdf file (not the "sample design" or "methodology" pdf files) for any nation, you'll find an "estimated error" somewhere around the second page.  this is a crude, dataset-wide measure of variance, but it's your only option to use as the standard error for any statistical testing.  this is a one-size-fits-all substitute for other more precise sampling error calculations like taylor-series linearization or replicate weightingyou could politely! request that they include clustering and strata variables on both future and historical files.  because awesome data can always get more awesome.

confidential to sas, spss, stata, and sudaan users: would you buy an imitation rolex if the real thing were free?  well look at your wrist because it's time to transition to r.  :D

analyze the programme for the international assessment of adult competencies (piaac) with r

heaven knows we've all been there: you're in a heated argument with some patriotic zealot who thinks (insert country here) has the best labor force on earth.  you know they're just spewing made-up-statistic after made-up-statistic, but you don't have hard examples of your own to counter their ignorance.  hit the pause button on that nation altercation, because now you do!  the organisation for economic co-operation and development (oecd) has released round one of the programme for the international assessment of adult competencies (piaac), a golden goose of cross-national comparison data regarding working-age adults.  they have a three minute intro, you should watch the three minute intro.  if you like what you see, read these four pages of key facts.  this is the appropriate microdata for serious study of advanced-economy labor markets. and also for debate winners.

following the footsteps of its older cousin - the program for international student assessment (pisa) - the piaac survey administrators at oecd and participating countries publish only a nightmarish tangle of custom sas (groan) and stata (gasp) macros for you to learn and implement for the sake of just one public-use survey.  why does bad software happen to good people?  rather than spending all your time translating ancient greek and all your money on proprietary statistical products, you can use the r survey package and buy me a drink.  this new github repository contains three scripts:

download import and design.R

analysis examples.R

  • load the survey design for austria and belgium into working memory.  because, you know, the alphabet.
  • match statistics and standard errors provided by oecd on not one but two tabs of this excel table
  • for loop through every country to hit the stats and standard errors provided by oecd on pdf page 48 of this pdf table

click here to view these three scripts

for more detail about the programme for the international assessment of adult competencies (piaac), visit:


while preparing your own analysis, you'll surely need the (fantastic) codebook.  aside from that, idk what else to say..  oecd supports only lousy statistical languages to analyze their marvelous data; now you can use a powerful programming language to analyze the same rich data set.  i suppose if you're bored, you could take the piaac test yourself.

confidential to sas, spss, stata, and sudaan users: maybe it's time you join the subphylum vertebrata of the statistical software kingdom.  maybe it's time to transition to r. :D