analyze the new york city housing and vacancy survey (nychvs) with r

for those interested in the real estate and rental markets of the big apple, the census bureau's nyc housing and vacancy survey might be your key to the city.  if you care about how many new york residents live more than one person per room (a lot), how many structures are dilapidated (a few, phew), or what rent prices run these days (cha-ching), start here.  way back in 1965, new york law began requiring the enumeration of the city's heavily-regulated rental market, establishing this complex sample survey of about twenty thousand households, both occupied and vacant.  nowadays it's triennial, it's publicly-downloadable, and it's free n easy to analyze with the r language.

although the census bureau employs the survey administrators and produces the main how-to documents (both faq and overviews), city government actually pays the bill and gets the glory: the preliminary 2011 report with all the fun facts and the older but more complete 2008 report.  the microdata include four exciting files: a person-level file for occupied units, a household-level file for occupied units, a household-level file for vacant units, and a household-level file for units that didn't yield an interview (solely for adjusting the vacant-unit statistics).  most urban planning and policy wonks line up the occupied and vacant household-level files to calculate a vacancy rate, but depending on your mission, you might need some person-level action as well.  by the way, the nyc.gov report is six months older than the latest 2011 microdata, so don't panic if your stats are off by a whisker.  this new github repository contains three scripts:


2002 - 2011 - download all microdata.R
  • download, import, save each of the four data files into a single year-specific .rda file back to 2002
  • bumper sticker idea for nychvs data users: if you can read this, thank a furman center for the sas import scripts.

2011 analysis examples.R
  • load all available tables for a single year of data
  • construct the complex sample survey object, but it's fake - see note below.
  • run example analyses that calculate perfect means, medians, quantiles, totals

replicate contract items 2008.R
  • load all available tables for a single year of data
  • construct the complex sample survey object, but it's fake - see note below.
  • thoroughly explain a back-of-the-envelope calculation for standard errors, confidence intervals, variances
  • print statistics that match exactly - and confidence intervals more conservative than - the target replication table



click here to view these three scripts



for more detail about the new york city housing and vacancy survey, visit:


notes:

hint for statistical illiterates: if the data point you're looking for isn't in the nyc.gov grand report, check the census bureau's copious online tables too.

as described in detail in the comments of the replication script, it's impossible to exactly match the census-published confidence intervals.  here's one snippet of a longer conversation about how users cannot automate the computation of standard errors (discussed at footnote five) with the nychvs.  the `segment` variable (mentioned in the e-mail) does not get released due to confidentiality concerns.  either calculate them by hand with the infuriating generalized variance formula recommended in each year's source and accuracy statement (2008, 2011) or use the back-of-the-envelope method i invented that approximates census-published confidence intervals conservatively.  when i learned that users couldn't automate the matching of census-published numbers, i tried to be a bootstrapping young lad and come up with some fancy standard error computation methodology.  but it turns out that multiplying the un-adjusted errors by two gets as close to the right answer as anything else.  if you're writing the final draft of a research product destined to get heavy exposure, you might have to calculate confidence intervals by hand or pay the census bureau for a custom run.  but for those of us who can live with an occasional false negative in our lives, try it my way.

confidential to sas, spss, stata, and sudaan users: i look at you the way new yorkers look at jersey.  time to transition to r. :D

analyze the social security administration public use microdata files (ssapumf) with r

the social security administration (ssa) must be overflowing with quiet heroes, because their public-use microdata files are as inconspicuous as they are thorough.  sure, ssa publishes enough great statistical research of their own that outside researchers rarely find ourselves wanting more and finer data that this agency can provide, but does that stop them from releasing detailed microdata as well?  why no.  no it does not.  if you wake up one morning with a hankerin' to study the person-level lifetime cash-flows of fdr's legacy, roll up your sleeves and start right here.

compared to the other data sets on asdfree.com, the social security administration public use microdata files (ssapumf) are as straightforward as it gets.  you won't find complex sample survey data here, so just review the short-and-to-the-point data descriptions then calculate your statistics the way you would with other non-survey data.  each of these files contain either one record per person or one record per person per year, and effortlessly generalize to the entire population of either social security number holders (most of the country) or social security recipients (just beneficiaries).  the one-percent samples should be multiplied by 100 to get accurate nationwide count statistics and the five-percent samples by 20, but ykta (my new urban dictionary entry).  this new github repository contains one script:


download all microdata.R
  • download each zipped file directly onto your local computer
  • load each file into a data.frame using a mixture of both fancery and schmantzery
  • reproduce the overall count statistics provided in each respective data dictionary
  • save each file as an R data file (.rda) for ultra-fast future use




for more detail about the social security administration public use microdata files (ssapumf), visit:

notes:

i skipped importing these new beneficiary data system (nbds) files because i broadly distrust data older than i am and you probably want these easy-to-use, far more current files anyway. 


confidential to sas, spss, stata, and sudaan users: no doubt they were very impressive when they originally became available.  but so was the bone flute.  time to transition to r.  :D

analyze the medical large claims experience study (mlces) with r

not a survey, not even remotely current, the society of actuaries' medical large claims experience study (mlces) might be the best private health insurance claims data available to the public.  this data should be used to calibrate other data sets, and probably nothing more.

researchers interested in studying healthcare patterns among our elderly, disabled, or poor can go to the centers for medicare and medicaid services for all sorts of up-to-date utilization data.  but what if you want to study the behavior and spending of everyone else?  you could look at the medical expenditure panel survey (meps), the consumer expenditure survey (ce) or the national health interview survey (nhis), but there's an attrition problem with those - anyone who suddenly falls expensively-ill also starts slamming the door on follow-up survey interviews.  and that's understandable - who wants to respond to a government questionnaire when you're struggling with a serious health condition?  american healthcare surveys are biased at the tail - they don't capture our sickest very well.

think about it some more: we have single-payer healthcare for our elderly (medicare), disabled (medicare again), and poor (medicaid), meaning there's a government agency that's got all that data in one place.  every claim paid by the government is just hanging out in baltimore, waiting for you to come a knockin'.  and there's no non-response bias with government healthcare claims data - the united states government knows exactly how much the united states government paid on your behalf, whether or not you agreed to respond to somesuch survey.  doctors submit bills pretty consistently, after all.  so the utilization patterns of medicare and medicaid beneficiaries are stored in a central location, standardized, and available for purchase or (with limitations) for immediate download.  but in a heavily-privatized medical industry like ours, what do you do when you want to explore the purchasing patterns of everyone else?  well, you still probably look at meps or ce.  but if your research question is hyper-focused on the dist-ri-bu-tion of medical claims among the privately-insured, well hey, the distribution of medical claims in mlces is much more realistic than what you'll find in survey data.  yes, it's old.  yes, it's only composed of claims from seven insurers and not every private insurer covering every covered life in the united states.  and yes, it might even have a y2k bug or two.  but for publicly-available medical claims for the privately insured in the united states of america, well, take it or leave it.  this new github repository contains two scripts:

1997-1999 mlces - download.R
  • download each zipped year of data onto your local computer
  • load the entire table into RAM
  • save the condensed file as an R data file (.rda)

replicate soa publications.R


click here to view these two scripts


 for more detail about the medical large claims experience study (mlces), visit:

notes:

this data set is not generalizable to any recent population of americans.  its chief value is its relationship to itself - the distribution of medical spending, especially at the extreme values.  in caveman speak: percentages good, totals bad.


confidential to sas, spss, stata, sudaan users: the best things in life are free.  time to transition to r.  :D