analyze the surveillance epidemiology and end results (seer) with r and monetdb

the surveillance epidemiology and end results program is the aggregation of all cancer registry statistics in the united states.  created by congressional decree, seer has captured a nationally-representative quarter of american cancer incidence since 1973.  when acs, cdc, nci, and naaccr publish their collaborative annual report, they use seer.  when the aacr predicts that america will have 18 million cancer survivors by 2022, they use seer too.  you can use seer three.

the national cancer institute blessedly provides a bouquet of free statistical software to import and analyze this microdata.  obviously, my code won't compete with the legions of epidemiological software programmers at the largest of the nih institutes.  but plenty of other r users have written packages to work with this stuff, so maybe, just maybe, someone will find value in my automated importation syntax.  plus, the seer microdata include a sas import script - which triggers my fight or fight harder reflex.  list of things i hate, descending sort order: mosquitoes, cancer, then sas a very distant third.  but still.

aside from easing the importation of this data into the r language, i suppose i have contributed one tangible improvement to the seer-analyst community: these download and import scripts will put all eight million records into wickedly-fast monetdb.  so long as you can perform your analysis using sql, you can perform your analysis (on all eight million records) in basically one second.  haa-cha!  i've said it before, i'll say it again: the import takes forrrrrever (leave it overnight). but once it's loaded, it'll outrun lightning.  this new github repository contains four scripts:


download.R
  • after setting your username and password, download and unzip the seer text data file to some working directory

import all tables into rda.R
  • grep through the unzipped seer text folders to find individual- and population-level tables
  • import each individual-level table into an r data.frame with sascii, then save to disk for fast loading later.
  • import each population-level table into an r data.frame with sascii, then save to disk for fast loading later.

import individual-level tables into monetdb.R
  • grep through the unzipped seer text folders to find individual-level tables
  • initiate a monetdb server on the local disk, then import each individual-level table with read.sascii.monetdb
  • stack all of the imported individual-level tables into one, thereby replicating the total case count
  • create a well-documented block of code to re-initiate the monetdb server in the future

replicate case counts table.R



click here to view these four scripts



for more detail about surveillance epidemiology and end results microdata, visit:


notes:

seer is publicly-available, you just gotta sign and e-mail in this form, then wait two business days for them to send you the login and password needed for the box that pops up when you click this download link.


confidential to sas, spss, stata, and sudaan users: it's black tie dinner night at the governor's mansion and you're still wearing a t-shirt.  ready to change into your tuxedo?  time to transition to r.  :D

No comments:

Post a Comment