analyze the european social survey (ess) with r

with more than a decade of microdata aimed at gauging the political mood across european nations, the european social survey (ess) allows scientists like you to examine socio-demographic shifts among broad groups all the way down to pirate party (piratpartiet) voters in sweden.  with much of the same scope as the united states' general social survey (gss), this biennial survey gives demographers the clearest window into political opinion and behavior across the continent.

run out of the city university london and six other centres, this survey sets its sample universe at all persons aged 15 and over resident within private households, regardless of nationality, citizenship, language or legal status in the participating countries.  however, it's smart - dare i say very smart - to check the documentation report (here's round five) and confirm that the statistics you're coming up with actually generalize to the resident populations that you think that they do.

after enduring a few spammy e-mails from me, daniel oberski agreed to co-author this post and all of the code.  dan spent a handful of years in catalonia at upf's ess competence centre, so in addition to being able to disentangle and simplify this survey's tricky methodology for us, he's also provided a wicked starter script on structural equation modeling (sem) with complex sample survey data, using his very own lavaan.survey package.  so tell him thanks.  this new github repository contains four scripts:

download all microdata.R
  • after you register for an account, plop `` at the top of this script and let 'er rip
  • automatically log in and determine which countries and rounds are currently available
  • for each round available, cycle through each file available, download, unzip, and import it.
  • save everything on the local disk as a convenient data.frame object

analysis examples.R

structural equation modeling examples.R


click here to view these four scripts

for more detail about the european social survey (ess), visit:


some analysts blindly start with the integrated, multi-country data set for each round.  that file contains all countries stacked into a single data table and the appropriate within-country weights, so you'll get the correct point estimates (means, medians, percents).  unfortunately, the integrated file does not contain other sample design information such as clusters and strata, which influence standard errors and statistical tests.  so it's generally necessary to use the country-specific files and associated sample design data file (sddf) if you're itching to calculate a confidence interval, standard error, or any kind of honest statistical test.  a classical approximation to correct standard errors is to multiply the standard error you get without accounting for the survey design by the square root of the "design effect";  the norwegian social science data services have created this tutorial on how to calculate design effects for linear functions of the data such as means and totals, but if that's over your head or you want to estimate something other than means or totals, just use our scripts instead.

confidential to sas, spss, stata, and sudaan users: unless you're a paleontologist, forget those fossils and transition to r.  :D

analyze the national household travel survey (nhts) with r and monetdb

if you've ever gotten stuck in traffic and started wondering what data might be available to better design the network of roads and rail, rev your engines for the national household travel survey (nhts).  dating back to the same decade as eisenhower's interstate system, this random sample of americans contains most every event related to mobility, commuting, yes even national lampoon's vacation.  professional transportation planners and transit researchers: this is where you belong.  i somehow convinced my friend alex karner to author both this post and most all of the code, so if you like what you see, thank him not me.

this data set began life as the nationwide personal transportation survey (npts), so if you see that title somewhere, just think of it as nhts classic.  the latest main data files provide cross-sectional, nationally representative data on persons and households including their vehicles, all trips made in one assigned travel day, and their neighborhoods. (think of a trip as one-way travel between an origin - like home - and a destination - like work.)  in addition to the national sample, many state departments of transportation and regional transportation planning agencies fund add-on samples so that descriptive statistics can be calculated at finer geographies.  and since the person-level data contain detailed demographics, it's feasible to analyze travel behavior of the young, the elderly, people of color, and low-income folks, etc. etc.  good luck trying to do that with smaller-scale transit surveys.  that said, be cautious when generating estimates at the sub-national level; check out the weighting reports to get a sense of which geographies have sufficient sample size.

before you start editing our code and writing your own, take some time to familiarize yourself with the user guide and other relevant documents (such as their glossary of terms or how they create constructed variables) on their comprehensive  publications table.  each single-year release year comprises four files: person-level (age, sex, internet shopping behavior), household-level (size, number of licensed drivers), vehicle-level (make, model, fuel type), and travel day-level (trip distance, time starting/ending, means of transportation).  the download automation script merges each file with its appropriate replicate-weight file, so if you wish to attach household-level variables onto the person-level file, ctrl+f search through that script for examples of how to create additional _m_ (merged) files.  this new github repository contains four scripts:

download and import.R
  • create the batch (.bat) file needed to initiate the monet database in the future
  • download, unzip, and import each year specified by the user
  • merge on the weights wherever the weights need to be merged on
  • create and save the taylor-series linearization complex sample designs
  • create a well-documented block of code to re-initiate the monetdb server in the future

analysis examples.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the replicate-weighted design for the person-level 2009 file
  • perform the standard repertoire of analysis examples, only this time using sqlsurvey functions

variable recode example.R
  • run the well-documented block of code to re-initiate the monetdb server
  • copy the 2009 person-level table to maintain the pristine original
  • add a new age category variable by hand
  • re-create then save the sqlsurvey replicate-weighted complex sample design on this new table
  • close everything, then load everything back up in a fresh instance of r

replicate ornl.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the replicate-weighted design for the 2009 person-level file
  • replicate statistics from "table 1" of oak ridge national laboratory's example output document

click here to view these four scripts


data from the 1969 and 1977 national personal transportation survey (the nhts predecessor) are not available online.  replicate weights were added beginning with the 2001 release.  the 1983, 1990 and 1995 survey years contain only the overall weight and no way to accurately estimate the variance, so if you'd like to create a survey object that will give you incorrect standard errors, you might copy the taylor-series linearization object creation at the very bottom of the us decennial census public use microdata sample's helper functions, but don't for a second trust the confidence intervals that produces.  if you'd like either of those things to change, it can't hurt to ask.

confidential to sas, spss, stata, sudaan users:  honk if you love r :D

analyze the pesquisa mensal de emprego (pme) with r

whether or not you hit the snooze button, once every month your morning radio station probably announces the latest employment statistics for the nation.  in the united states, those headlines come from the bureau of labor statistics' current population survey.  meanwhile down in rio de janeiro, the brazilian institute of geography and statistics (ibge) releases a staggeringly similar pesquisa mensal de emprego.  simply translated: monthly employment survey.  my friend djalma pessoa at ibge co-authored this post and the dutifully-commented code, so let me loosely paraphrase beyonce and recommend, "if you like it then you shoulda [sent him a nice thank-you note]..wuh-uh-oh.."

the primary tool to assess the brazilian labor force, this household-level survey only covers the metropolitan areas of the six largest cities (about a quarter of the nationwide population).  though there's plenty of information about the income and circumstances of active workers, this data set contains plenty of detail about what the unemployed are trying in order to get a job themselves (employment agencies, entrance exams).  and, although this sample gets conducted as a panel, the individuals surveyed over multiple months cannot be connected across public-use files.  so there you have it.  when you're ready to fire up your formal research of the brazilian labor market, well, this new github repository contains four scripts:

download all microdata.R
  • download each monthly zipped file, plus documentation
  • import each individual microdata table directly into r, short and sweet
  • store quick-to-load copies of each microdata table for easy access later

analysis examples.R
  • load a single month of data into working memory
  • construct the complex sample survey object, post-stratifying according to ibge specifications
  • run example analyses that calculate perfect means, medians, quantiles, totals, even ratios

unemployment rate.R


click here to view these four scripts

for more detail about the pesquisa mensal de emprego, visit:


starting in march of 2014, the pme will be re-weighted in order to hit population projections from their 2013 revision and maintain trendability over the past decade. ibge will also begin publishing weight variables that contain the same number of decimal places as are used internally to compute the published tables.  for reproducibility's sake: manero!

although undergoing a few revisions after its preliminary launch in the early eighties, the file structure of the pme has not changed since 2002, the first month of downloadable microdata.  therefore, the code that worked on a mid-2002 microdata month should also work on mid-2012 month.  and if you're a serious pme user, you likely also seriously speak portuguese.  in that case, you might as well get started on the portuguese-language version of their homepage.

confidential to sas, spss, stata, sudaan users: leave those bananas for the monkeys.  to r is human. :D