Posts for the month of December 2010

Managing temporary tables with a python context manager

Temporary tables can be a hassle to manage. In computing concept stats, at first, my code did the obvious:

  1. create a temporary index
  2. create a temporary table
  3. use the table and the index
  4. truncate/drop the temporary table
  5. drop the index

But if the code fails in step 3, the temporary table and the index will still be there when you run it again, and you'll get name conflicts. An obvious solution starts like:

cursor.execute("create global temporary table ...")
try:
    # use table
finally:
    cursor.execute("truncate table ...")
    cursor.execute("drop table ...")

But it starts to get ugly when you add the try/finally for the temporary index. Isn't this a lot nicer?

    with transaction(conn) as work:
        with table_index(conn, 'metadata_by_path',
                         concept_schema, 'i2b2', ['c_dimcode']):
            with temp_table(work, stats, total_counts):
                exec_debug(work, update_labels, explain_plan=True)

This is where python context managers come in handy. temp_table is implemented like this:

from contextlib import contextmanager


@contextmanager
def temp_table(cursor, name, create_ddl):
    exec_debug(cursor, create_ddl, explain_plan=True)
    try:
        yield cursor
    finally:
        exec_debug(cursor, "truncate table %s" % name)
        exec_debug(cursor, "drop table %s" % name)

The table_index and transaction context managers are implemented likewise. Take a look at source:heron_load/concepts_stats.py and source:heron_load/db_util.py for details.

Concept stats: how many needles are in which parts of our i2b2 haystack?

For the HERON research data repository we're building, we're Using I2B2, which involves navigating a huge vocabulary of medical terminology. We found ourselves wishing that the terms were tagged with some clues about how often they occur in our database. Now that the enhancement ticket is done, we find it really does help.

I2B2 lets researchers identify patient cohorts by querying a database of "observation facts" about patients. All the facts are collected in one table, indexed by concept. Concepts may be demographics (age, gender, etc.), diagnoses, labs, medications, or procedures, and the tool supports hierarchical navigation and text search:

About 14,000 concepts are included with the I2B2 software. As we extract concepts from health records at KU Medical center, some obviously match the i2b2 concepts but many do not. For example, we have not yet determined the relationship between local medication codes and the NDC system. This complicates the already daunting task of navigating the hierarchy of medical terminology. It's quite frustrating to poke around in the dark, running query after query, not knowing which concepts have real data attached to them.

We find that pre-computing the number of facts and the number of associated patients and including these statistics in the hierarchy makes navigation much more efficient:

We can see at a glance that we have no records of dispensing ANTI-OBESITY drugs, at least by that exact categorization.

We can also see that we only have rich data (diagnoses, labs, meds, ...) about roughly 10% of the 2 million patients in our database; this is due to rather recent deployment of an electronic medical record system compared to the long-standing use of computerized billing systems.

How we did it

In the observation_fact table, each of the facts carries a concept_cd; for example:

concept_cdencounter_numpatient_numstart_date...
DEM|AGE:51231233212005-01-01...

Then the concept_cd is related to a longer path in the concept_dimension table:

concept_cdconcept_path...
DEM|AGE:5\i2b2\Demographics\Age\0-10\5...

And these paths are used in the i2b2 table that is used to build the hierarchical navigation display:

c_dimcodec_name...
\i2b2\Demographics\Demographics...
\i2b2\Demographics\Age\Age...
\i2b2\Demographics\Age\0-10\Ages 0-10...
\i2b2\Demographics\Age\0-10\5Age 5...

So we join each i2b2 row with the concept_dimension rows under it, and then join with observation_fact on concept_cd, and store the stats in a temporary table:

create global temporary table stats
        on commit preserve rows as
        select c.super_path,
               count(*) facts,
               count(distinct f.patient_num) patients
         from
           (select distinct super.c_dimcode super_path,
                   sub.concept_cd sub_code
            from blueheronmetadata.i2b2 super
            join blueherondata.concept_dimension sub
             on sub.concept_path like (super.c_dimcode || '%')) c
           join blueherondata.observation_fact f
             on f.concept_cd = c.sub_code
        group by super_path

To compute this efficiently, we added an index on i2b2.c_dimcode; you can see that the whole thing runs in just under 22 minutes:

2010-12-09 15:20:17.357337 starting statement:  
      CREATE INDEX blueheronmetadata.metadata_by_path
        ON blueheronmetadata.i2b2 (c_dimcode)
    
2010-12-09 15:20:18.649760 finished after 0:00:01.292423; rows affected: -1
2010-12-09 15:20:18.649852 starting statement:  create global temporary table stats
        on commit preserve rows as
        select c.super_path,
               count(*) facts,
               count(distinct f.patient_num) patients
         from
           (select distinct super.c_dimcode super_path,
                   sub.concept_cd sub_code
            from blueheronmetadata.i2b2 super
            join blueherondata.concept_dimension sub
             on sub.concept_path like (super.c_dimcode || '%')) c
           join blueherondata.observation_fact f
             on f.concept_cd = c.sub_code
        group by super_path
explain plan result:
Plan hash value: 539913617
 
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |                        |   206 | 34196 |       |   175M  (2)|584:12:42 |
|   1 |  LOAD AS SELECT                  | STATS                  |       |       |       |            |          |
|   2 |   SORT GROUP BY                  |                        |   206 | 34196 |       |   131M  (2)|438:09:33 |
|*  3 |    HASH JOIN                     |                        |  8398M|  1298G|  2207M|   129M  (1)|433:18:21 |
|   4 |     INDEX FAST FULL SCAN         | FACT_CNPT_PAT_ENCT_IDX |    59M|  1528M|       | 78907   (2)| 00:15:47 |
|   5 |     VIEW                         |                        |    28M|  3780M|       |   129M  (1)|431:40:01 |
|   6 |      HASH UNIQUE                 |                        |    28M|  6501M|   444G|   116M  (1)|386:55:49 |
|   7 |       TABLE ACCESS BY INDEX ROWID| CONCEPT_DIMENSION      |  9826 |  1161K|       |   257   (0)| 00:00:04 |
|   8 |        NESTED LOOPS              |                        |  1837M|   409G|       |    48M  (1)|160:23:14 |
|   9 |         INDEX FAST FULL SCAN     | METADATA_BY_PATH       |   187K|    21M|       |   753   (1)| 00:00:10 |
|* 10 |         INDEX RANGE SCAN         | CONCEPT_DIMENSION_PK   |  1769 |       |       |    56   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("F"."CONCEPT_CD"="C"."SUB_CODE")
  10 - access("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%')
       filter("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%')
2010-12-09 15:42:12.129193 finished after 0:21:53.479341; rows affected: -1

We then used an expedient hack of updating the concept labels to include the stats:

update blueheronmetadata.i2b2 m
      set m.c_name = (
        select distinct
          case when tots.facts > 0 then
            case when instr(m.c_name, '[') > 0 then
              regexp_replace(m.c_name, '\[.*', '')
            else m.c_name || ' ' end
            || '['
            || (case when tots.facts < 10 then '<10'
                else to_char(tots.facts) end) || ' facts'
            || (case when tots.patients < 10 then ''
                  else '; ' || tots.patients || ' patients' end)
            || ']'
         else m.c_name end
         from stats tots
           where tots.super_path = m.c_dimcode)
      where m.c_dimcode in
        (select super_path from stats)

That code is complicated because it has to

  1. leave out stats regarding fewer than 10 patients (per HIPAA regulations, as part of our DeIdentificationStrategy)
  2. replace any stats that were already there

That's probably enough detail for others using i2b2 to do likewise, but if you're curious, the script that manages the process is source:heron_load/concepts_stats.py and the ticket (#211) tells the whole saga of approaches that don't work and such.

Informatics at KU Med Center

A person working in partnership with an information resource is "better" than that same person unassisted.

-- Friedman 2009

We referenced that "fundamental theorem" of biomedical informatics to introduce the discipline in his recent clinical research presentation:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2649317/bin/169.S1067502708002417.gr1.jpg

Friedman goes on to explain that informatics is more about people than technology:

NOT: information resource > person

One can view the whole EHR as one giant intervention where the healthcare system (providers, nurses, pharmacists, allied health) are the subjects. Will it be a chewable Flintstone vitamin or a barium enema?