analyze the national household travel survey (nhts) with r and monetdb

if you've ever gotten stuck in traffic and started wondering what data might be available to better design the network of roads and rail, rev your engines for the national household travel survey (nhts).  dating back to the same decade as eisenhower's interstate system, this random sample of americans contains most every event related to mobility, commuting, yes even national lampoon's vacation.  professional transportation planners and transit researchers: this is where you belong.  i somehow convinced my friend alex karner to author both this post and most all of the code, so if you like what you see, thank him not me.

this data set began life as the nationwide personal transportation survey (npts), so if you see that title somewhere, just think of it as nhts classic.  the latest main data files provide cross-sectional, nationally representative data on persons and households including their vehicles, all trips made in one assigned travel day, and their neighborhoods. (think of a trip as one-way travel between an origin - like home - and a destination - like work.)  in addition to the national sample, many state departments of transportation and regional transportation planning agencies fund add-on samples so that descriptive statistics can be calculated at finer geographies.  and since the person-level data contain detailed demographics, it's feasible to analyze travel behavior of the young, the elderly, people of color, and low-income folks, etc. etc.  good luck trying to do that with smaller-scale transit surveys.  that said, be cautious when generating estimates at the sub-national level; check out the weighting reports to get a sense of which geographies have sufficient sample size.

before you start editing our code and writing your own, take some time to familiarize yourself with the user guide and other relevant documents (such as their glossary of terms or how they create constructed variables) on their comprehensive  publications table.  each single-year release year comprises four files: person-level (age, sex, internet shopping behavior), household-level (size, number of licensed drivers), vehicle-level (make, model, fuel type), and travel day-level (trip distance, time starting/ending, means of transportation).  the download automation script merges each file with its appropriate replicate-weight file, so if you wish to attach household-level variables onto the person-level file, ctrl+f search through that script for examples of how to create additional _m_ (merged) files.  this new github repository contains three scripts:

download and import.R
  • initiate the monet database with new monetdblite
  • download, unzip, and import each year specified by the user
  • merge on the weights wherever the weights need to be merged on
  • create and save the taylor-series linearization complex sample designs
  • create a well-documented block of code to re-initiate the monetdb server in the future

analysis examples.R
  • re-initiate the monetdb server
  • load the r data file (.rda) containing the replicate-weighted design for the person-level 2009 file
  • perform the standard repertoire of analysis examples

replicate ornl.R
  • re-initiate the monetdb server
  • load the r data file (.rda) containing the replicate-weighted design for the 2009 person-level file
  • replicate statistics from "table 1" of oak ridge national laboratory's example output document

click here to view these three scripts


data from the 1969 and 1977 national personal transportation survey (the nhts predecessor) are not available online.  replicate weights were added beginning with the 2001 release.  the 1983, 1990 and 1995 survey years contain only the overall weight and no way to accurately estimate the variance, so if you'd like to create a survey object that will give you incorrect standard errors, you might copy the taylor-series linearization object creation at the very bottom of the us decennial census public use microdata sample's helper functions, but don't for a second trust the confidence intervals that produces.  if you'd like either of those things to change, it can't hurt to ask.

confidential to sas, spss, stata, sudaan users:  honk if you love r :D