Managing temporary tables with a python context manager
Temporary tables can be a hassle to manage. In computing concept stats, at first, my code did the obvious:
- create a temporary index
- create a temporary table
- use the table and the index
- truncate/drop the temporary table
- drop the index
But if the code fails in step 3, the temporary table and the index will still be there when you run it again, and you'll get name conflicts. An obvious solution starts like:
cursor.execute("create global temporary table ...") try: # use table finally: cursor.execute("truncate table ...") cursor.execute("drop table ...")
But it starts to get ugly when you add the try/finally for the temporary index. Isn't this a lot nicer?
with transaction(conn) as work: with table_index(conn, 'metadata_by_path', concept_schema, 'i2b2', ['c_dimcode']): with temp_table(work, stats, total_counts): exec_debug(work, update_labels, explain_plan=True)
This is where python context managers come in handy. temp_table
is implemented like this:
from contextlib import contextmanager @contextmanager def temp_table(cursor, name, create_ddl): exec_debug(cursor, create_ddl, explain_plan=True) try: yield cursor finally: exec_debug(cursor, "truncate table %s" % name) exec_debug(cursor, "drop table %s" % name)
The table_index
and transaction
context managers are implemented likewise. Take a look at source:heron_load/concepts_stats.py and source:heron_load/db_util.py for details.
Concept stats: how many needles are in which parts of our i2b2 haystack?
For the HERON research data repository we're building, we're Using I2B2, which involves navigating a huge vocabulary of medical terminology. We found ourselves wishing that the terms were tagged with some clues about how often they occur in our database. Now that the enhancement ticket is done, we find it really does help.
I2B2 lets researchers identify patient cohorts by querying a database of "observation facts" about patients. All the facts are collected in one table, indexed by concept. Concepts may be demographics (age, gender, etc.), diagnoses, labs, medications, or procedures, and the tool supports hierarchical navigation and text search:
About 14,000 concepts are included with the I2B2 software. As we extract concepts from health records at KU Medical center, some obviously match the i2b2 concepts but many do not. For example, we have not yet determined the relationship between local medication codes and the NDC system. This complicates the already daunting task of navigating the hierarchy of medical terminology. It's quite frustrating to poke around in the dark, running query after query, not knowing which concepts have real data attached to them.
We find that pre-computing the number of facts and the number of associated patients and including these statistics in the hierarchy makes navigation much more efficient:
We can see at a glance that we have no records of dispensing ANTI-OBESITY
drugs, at least by that exact categorization.
We can also see that we only have rich data (diagnoses, labs, meds, ...) about roughly 10% of the 2 million patients in our database; this is due to rather recent deployment of an electronic medical record system compared to the long-standing use of computerized billing systems.
How we did it
In the observation_fact
table, each of the facts carries a concept_cd
;
for example:
concept_cd | encounter_num | patient_num | start_date | ... |
DEM|AGE:5 | 123123 | 321 | 2005-01-01 | ... |
Then the concept_cd
is related to a longer path in the concept_dimension
table:
concept_cd | concept_path | ... |
DEM|AGE:5 | \i2b2\Demographics\Age\0-10\5 | ... |
And these paths are used in the i2b2
table that is used
to build the hierarchical navigation display:
c_dimcode | c_name | ... |
\i2b2\Demographics\ | Demographics | ... |
\i2b2\Demographics\Age\ | Age | ... |
\i2b2\Demographics\Age\0-10\ | Ages 0-10 | ... |
\i2b2\Demographics\Age\0-10\5 | Age 5 | ... |
So we join each i2b2
row with the concept_dimension
rows
under it, and then join with observation_fact
on concept_cd
,
and store the stats in a temporary table:
create global temporary table stats on commit preserve rows as select c.super_path, count(*) facts, count(distinct f.patient_num) patients from (select distinct super.c_dimcode super_path, sub.concept_cd sub_code from blueheronmetadata.i2b2 super join blueherondata.concept_dimension sub on sub.concept_path like (super.c_dimcode || '%')) c join blueherondata.observation_fact f on f.concept_cd = c.sub_code group by super_path
To compute this efficiently, we added an index on i2b2.c_dimcode
; you can see that the whole thing runs in just under 22 minutes:
2010-12-09 15:20:17.357337 starting statement: CREATE INDEX blueheronmetadata.metadata_by_path ON blueheronmetadata.i2b2 (c_dimcode) 2010-12-09 15:20:18.649760 finished after 0:00:01.292423; rows affected: -1 2010-12-09 15:20:18.649852 starting statement: create global temporary table stats on commit preserve rows as select c.super_path, count(*) facts, count(distinct f.patient_num) patients from (select distinct super.c_dimcode super_path, sub.concept_cd sub_code from blueheronmetadata.i2b2 super join blueherondata.concept_dimension sub on sub.concept_path like (super.c_dimcode || '%')) c join blueherondata.observation_fact f on f.concept_cd = c.sub_code group by super_path explain plan result: Plan hash value: 539913617 ------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------------------- | 0 | CREATE TABLE STATEMENT | | 206 | 34196 | | 175M (2)|584:12:42 | | 1 | LOAD AS SELECT | STATS | | | | | | | 2 | SORT GROUP BY | | 206 | 34196 | | 131M (2)|438:09:33 | |* 3 | HASH JOIN | | 8398M| 1298G| 2207M| 129M (1)|433:18:21 | | 4 | INDEX FAST FULL SCAN | FACT_CNPT_PAT_ENCT_IDX | 59M| 1528M| | 78907 (2)| 00:15:47 | | 5 | VIEW | | 28M| 3780M| | 129M (1)|431:40:01 | | 6 | HASH UNIQUE | | 28M| 6501M| 444G| 116M (1)|386:55:49 | | 7 | TABLE ACCESS BY INDEX ROWID| CONCEPT_DIMENSION | 9826 | 1161K| | 257 (0)| 00:00:04 | | 8 | NESTED LOOPS | | 1837M| 409G| | 48M (1)|160:23:14 | | 9 | INDEX FAST FULL SCAN | METADATA_BY_PATH | 187K| 21M| | 753 (1)| 00:00:10 | |* 10 | INDEX RANGE SCAN | CONCEPT_DIMENSION_PK | 1769 | | | 56 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("F"."CONCEPT_CD"="C"."SUB_CODE") 10 - access("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%') filter("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%') 2010-12-09 15:42:12.129193 finished after 0:21:53.479341; rows affected: -1
We then used an expedient hack of updating the concept labels to include the stats:
update blueheronmetadata.i2b2 m set m.c_name = ( select distinct case when tots.facts > 0 then case when instr(m.c_name, '[') > 0 then regexp_replace(m.c_name, '\[.*', '') else m.c_name || ' ' end || '[' || (case when tots.facts < 10 then '<10' else to_char(tots.facts) end) || ' facts' || (case when tots.patients < 10 then '' else '; ' || tots.patients || ' patients' end) || ']' else m.c_name end from stats tots where tots.super_path = m.c_dimcode) where m.c_dimcode in (select super_path from stats)
That code is complicated because it has to
- leave out stats regarding fewer than 10 patients (per HIPAA regulations, as part of our DeIdentificationStrategy)
- replace any stats that were already there
That's probably enough detail for others using i2b2 to do likewise, but if you're curious, the script that manages the process is source:heron_load/concepts_stats.py and the ticket (#211) tells the whole saga of approaches that don't work and such.
Informatics at KU Med Center
A person working in partnership with an information resource is "better" than that same person unassisted.
We referenced that "fundamental theorem" of biomedical informatics to introduce the discipline in his recent clinical research presentation:
Friedman goes on to explain that informatics is more about people than technology:
One can view the whole EHR as one giant intervention where the healthcare system (providers, nurses, pharmacists, allied health) are the subjects. Will it be a chewable Flintstone vitamin or a barium enema?