how to provide a variance calculation on your public-use survey data file without disclosing sampling clusters or violating respondent confidentiality

this post and accompanying syntax would not have been possible without dan oberski.  read more, find out why.  thanks dan.

dear survey administrator: someone sent you this link because you work for an organization or a government agency that conducts a complex-sample survey, releases a public-use file, but does not correspondingly disclose the sampling clusters.

you had good reason to do this: malicious users lurk around every corner of the internet, determined to isolate and perhaps humiliate the good-natured respondents to your survey.  those sampling clusters are, more often than not, geographic locations.  you ran your survey, you promised respondents you wouldn't disclose enough information for data users to identify them, now you're keeping your word.  you need to maintain the trust of your respondents, perhaps you're even bound to do so by law.  i understand.  keep doing it.  but remember that confidentiality costs statistical precision.

you drew a sample, you fielded your survey, you probably analyzed the microdata yourself, then you blessedly documented everything and shared your data file with other researchers.  thank you.  that first step, when you drew that sample, was it a simple random sampling?  because if you used a clustered sample design (like almost every survey run by the united states government), your microdata users will need those sampling units to compute a confidence interval either through linearization or replication.

if you don't disclose those clusters, some of your data users will blindly calculate their confidence intervals under the faulty assumption of simple random sampling.  that is not right, the confidence intervals are too tight.  if a survey data provider neglects to provide a defensible method to calculate the survey-adjusted variance, users will rely on srs and occasionally declare statistically significant differences that aren't statistically significant.  nightmares are born, yada yada.

you cannot disclose your sampling units but you would like your users to calculate a more accurate (or at least more conservative) confidence interval around the statistics that they compute off of your survey data.  the alternative to linearization-based confidence interval calculations? a replication-based confidence interval calculation.  try this:

click here to view a step-by-step tutorial to create obfuscated replicate weights for your complex-sample survey data

there aren't many people who i like more than dan oberski.  a survey methodologist at tilburg university, dr. oberski kindly reviewed my proposed solution and sketched out an argument in favor of the procedure.

read his arguments in this pdf file or in latex format.

even though he's convinced that the conclusion is true, he cautions that some of the design-unbiasedness proof steps are not wholly rigorous - especially (3) - and that in order for this method to gain wide acceptance, a research article submitted to a peer-reviewed journal would need more careful study, a formal justification of unbiased standard errors, and a small simulation.  so you have a green light from us, but give your own survey methodologists the final say.  glhf and use r