Posts in category SQL

Adding SEER Site Recode to HERON Tumor Registry integration

Our HERON tuttlecreek release a couple months ago included initial integration of data on ~60,000 cancer cases from the KUMC tumor registry. We organized the NAACCR terms based on work by colleagues at the Kimmel Cancer Center in Philadelphia and Group Health Cooperative in Seattle:

NAACR terms for tumor registry

But if you want to find, for example, brain cancer cases, due to an outstanding issue (#733), you have to be an expert in codes for primary site, histology, etc.:

For our next release, based on work with John Keighley, we're providing query by SEER Site Recode, a state of the art method for combining primary site and histology:

screenshot of SEER Site Recode term hierarchy

Under the hood: Using python to convert the rules table to SQL

The SEER Site Recode ICD-O-3 (1/27/2003) Definition, lays out the rules in a fairly convenient HTML table:

Converting that table to code manually might have been straightforward, but it would have been repetitive and error-prone; so like so many Geeks and repetitive tasks, I wrote a script to automate it.

source:tumor_reg/ weighs in at about 200 lines, including whitespace and a handful of test cases. It reads the HTML page (well, I feed it through tidy first to clean up some table markup) and produces

  1. A term hierarchy in CSV format (source:heron_load/curated_data/seer_recode_terms.csv)
  2. Rules to recode our our ~60K cancer cases as a SQL case statement (source:heron_load/seer_recode.sql).

The resulting SQL weighs in at about 500 lines. Handling all the different kinds of rules in the table was fun; a lot more fun than writing this sort of SQL by hand:

/* Lip */ when (site between 'C000' and 'C009')
  and  not (histology between '9590' and '9989'
   or histology between '9050' and '9055'
   or histology = '9140') then '20010'


/* Melanoma of the Skin */ when (site between 'C440' and 'C449')
  and (histology between '8720' and '8790') then '25010'


/* Cranial Nerves Other Nervous System */ when (site between 'C710' and 'C719')
  and (histology between '9530' and '9539') then '31040'

/* ... */ when (site between 'C700' and 'C709'
   or site between 'C720' and 'C729')
  and  not (histology between '9590' and '9989'
   or histology between '9050' and '9055'
   or histology = '9140') then '31040'

Managing temporary tables with a python context manager

Temporary tables can be a hassle to manage. In computing concept stats, at first, my code did the obvious:

  1. create a temporary index
  2. create a temporary table
  3. use the table and the index
  4. truncate/drop the temporary table
  5. drop the index

But if the code fails in step 3, the temporary table and the index will still be there when you run it again, and you'll get name conflicts. An obvious solution starts like:

cursor.execute("create global temporary table ...")
    # use table
    cursor.execute("truncate table ...")
    cursor.execute("drop table ...")

But it starts to get ugly when you add the try/finally for the temporary index. Isn't this a lot nicer?

    with transaction(conn) as work:
        with table_index(conn, 'metadata_by_path',
                         concept_schema, 'i2b2', ['c_dimcode']):
            with temp_table(work, stats, total_counts):
                exec_debug(work, update_labels, explain_plan=True)

This is where python context managers come in handy. temp_table is implemented like this:

from contextlib import contextmanager

def temp_table(cursor, name, create_ddl):
    exec_debug(cursor, create_ddl, explain_plan=True)
        yield cursor
        exec_debug(cursor, "truncate table %s" % name)
        exec_debug(cursor, "drop table %s" % name)

The table_index and transaction context managers are implemented likewise. Take a look at source:heron_load/ and source:heron_load/ for details.