analyze the current population survey (cps) annual social and economic supplement (asec) with r

the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948.  wow.  the us census bureau and the bureau of labor statistics (bls) tag-team on this one.  until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents.  this provides enough sample to conduct state- and a few large metro area-level analyses.  your sample size will vanish if you start investigating subgroups by state - consider pooling multiple years.  county-level is a no-no.

despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency.  aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be treated as point-in-time statistics.  cps-asec generalizes to the united states non-institutional, non-active duty military population.

the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person).  to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database.  you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show.  this new github repository contains three scripts:

2005-2012 asec - download all microdata.R
  • download the fixed-width file containing household, family, and person records
  • import by separating this file into three tables, then merge 'em together at the person-level
  • download the fixed-width file containing the person-level replicate weights
  • merge the rectangular person-level file with the replicate weights, then store it in a sql database
  • create a new variable - one - in the data table

2012 asec - analysis examples.R
  • connect to the sql database created by the 'download all microdata' program
  • create the complex sample survey object, using the replicate weights
  • perform a boatload of analysis examples

replicate census estimates - 2011.R
  • connect to the sql database created by the 'download all microdata' program
  • create the complex sample survey object, using the replicate weights
  • match the sas output shown in the png file below

2011 asec replicate weight sas output.png


click here to view these three scripts



for more detail about the current population survey - annual social and economic supplement (cps-asec), visit:

notes:

interviews are conducted in march about experiences during the previous year.  the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011.  when you use the current population survey to talk about america, subract a year from the data file name.

as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research.


confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter?  time to transition to r.  :D

5 comments:

  1. # I created the following functions to analyze
    # the ACS files because I found that the survey # module doesn't give perfect standard errors.

    library(plyr)

    #-----------------------------------
    # Functions for calculation of ACS PUMS estimates and Standard Errors

    # acs.se <- function(x) {
    # # formula found at: http://www.census.gov/acs/www/Downloads/handbooks/ACSPUMS.pdf
    # # page 26
    # # 4/80 in the formula = 0.05
    #
    # # dataset with all of the weights
    # x.weights <- x[,grep("pwgtp", names(x))]
    # x.weights.sum <- sapply(x.weights, sum)
    # x.estimate <- sum(x$PWGTP)
    #
    # return( c(se = round (sqrt( sum( (x.weights.sum - x.estimate)^2 ) * .05 ), 0 ) ) )
    # }

    acs.se <- function(x) {
    # formula found at: http://www.census.gov/acs/www/Downloads/handbooks/ACSPUMS.pdf
    # page 26
    # 4/80 in the formula = 0.05

    # Expanded version of the return statement (to show you what it does)
    # # dataset with all of the weights
    # x.weights <- x[,grep("pwgtp", names(x))]
    # x.weights.sum <- sapply(x.weights, sum)
    # x.estimate <- sum(x$PWGTP)
    # return( c(se = round (sqrt( sum( (x.weights.sum - x.estimate)^2 ) * .05 ), 0 ) ) )


    return( c(se = round (sqrt( sum( ( sapply( x[, grep("pwgtp", names(x) ) ] , sum ) - sum( x$PWGTP ) )^2 ) * .05 ), 0 ) ) )
    }

    acs.estimate <- function(x) {
    return( c( estimate = sum( x$PWGTP ) ) )
    }

    acs.summary <- function(x) {

    return( c( acs.estimate( x ), acs.se( x ) ) )

    }

    #------------

    # Usage
    ddply(acs.data, .(SEX), acs.summary)

    ReplyDelete
    Replies
    1. that's not true :) in order to get perfect standard errors with the acs, you just need to use the mse option, described here: http://tolstoy.newcastle.edu.au/R/packages/11/0785.html

      the acs code currently in my github repository uses the survey package and matches census numbers exactly, however i haven't published it yet because it doesn't work on computers with very limited resources (less than 4GB RAM)

      permanent link to today's acs github:

      https://github.com/ajdamico/usgsd/tree/a0840308603cda8046be274a9193a8ac3dfa67af/American%20Community%20Survey

      Delete
  2. Anthony:

    Thank you for putting up this code. I am trying to work with the CPS for 2011 (that is, data from the 2012 March ASEC).

    I have downloaded all Microdata from 2005 to 2013. I am following and modifying the example you provide in the "2012 asec -analysis example.R" file, but cannot get my results to match up with Census Bureau published results, nor can I create (survey weighted) histograms.

    Here is my code:

    ```
    library(survey) # load survey package (analyzes complex design surveys)
    library(RSQLite) # load RSQLite package (creates database files in R)
    options( survey.lonely.psu = "adjust" )
    options( survey.replicates.mse = TRUE )
    y <-
    svrepdesign(
    weights = ~marsupwt,
    repweights = "pwwgt[1-9]",
    type = "Fay",
    rho = (1-1/sqrt(4)),
    data = "asec12" ,
    combined.weights = T ,
    dbtype = "SQLite" ,
    dbname = "cps.asec.db"
    )

    svyquantile(~htotval, y, 0.5)
    svyhist(~htotval, y)
    ```

    The svyquantile command produces the following result:

    ```
    > svyquantile(~htotval, y, 0.5)
    Statistic:
    htotval
    q0.5 59505
    SE:
    htotval
    q0.5 300
    ```

    The published value for US Census Bureau in unadjusted dollars is $50,054 (for example, see table H-8 here: http://www.census.gov/hhes/www/income/data/historical/household/)

    The svyhist command produces the following error:

    ```
    > svyhist(~htotval, y)
    Error in formula.default(object, env = baseenv()) : invalid formula
    ```

    Any help/explanation of what is going on with the data is appreciated!

    ReplyDelete
    Replies
    1. awesome questions..

      to get median household income, i believe you'd need a table that's one record per household and uses the household weights and replicate-weights. the asecXX tables created by the download automation script is one record per person and actually deletes the household-level tables (search the script for "drop table household"). try `svyquantile(~ptotval,subset(y,a_age>14 & ptotval>0),0.5)` and you should get very close to the median personal income tables on the same website (note the census often uses a slightly-modified dataset from the public use file). and, of course, see my replication script for an exact match of both statistic and standard error.

      you've found a bug in the survey package's 3.29-5 version of the svyhist() function when used on replicate-weighted database-backed designs. it's been documented and sent to dr. lumley. assuming you have a moderately powerful computer, you can work around this bug by reading the data all into memory `db <- dbConnect( SQLite() , "cps.asec.db" ) ; asec12 <- dbReadTable( db , "asec12" ) ; y <- svrepdesign( weights = ~marsupwt, repweights = "pwwgt[1-9]", type = "Fay", rho = (1-1/sqrt(4)), data = asec12 , combined.weights = T )` and then `svyhist(~ptotval , y )` should work as normal

      Delete
    2. Both pieces of code worked perfectly on my machine with 8 GB of RAM. Thanks for all your help!

      Delete