Concept stats: how many needles are in which parts of our i2b2 haystack?

For the HERON research data repository we're building, we're Using I2B2, which involves navigating a huge vocabulary of medical terminology. We found ourselves wishing that the terms were tagged with some clues about how often they occur in our database. Now that the enhancement ticket is done, we find it really does help.

I2B2 lets researchers identify patient cohorts by querying a database of "observation facts" about patients. All the facts are collected in one table, indexed by concept. Concepts may be demographics (age, gender, etc.), diagnoses, labs, medications, or procedures, and the tool supports hierarchical navigation and text search:

No image "concepts.jpg" attached to Ticket #211

About 14,000 concepts are included with the I2B2 software. As we extract concepts from health records at KU Medical center, some obviously match the i2b2 concepts but many do not. For example, we have not yet determined the relationship between local medication codes and the NDC system. This complicates the already daunting task of navigating the hierarchy of medical terminology. It's quite frustrating to poke around in the dark, running query after query, not knowing which concepts have real data attached to them.

We find that pre-computing the number of facts and the number of associated patients and including these statistics in the hierarchy makes navigation much more efficient:

No image "concept-stats.png" attached to Ticket #211

We can see at a glance that we have no records of dispensing ANTI-OBESITY drugs, at least by that exact categorization.

We can also see that we only have rich data (diagnoses, labs, meds, ...) about roughly 10% of the 2 million patients in our database; this is due to rather recent deployment of an electronic medical record system compared to the long-standing use of computerized billing systems.

How we did it

In the observation_fact table, each of the facts carries a concept_cd; for example:

concept_cdencounter_numpatient_numstart_date...
DEM|AGE:51231233212005-01-01...

Then the concept_cd is related to a longer path in the concept_dimension table:

concept_cdconcept_path...
DEM|AGE:5\i2b2\Demographics\Age\0-10\5...

And these paths are used in the i2b2 table that is used to build the hierarchical navigation display:

c_dimcodec_name...
\i2b2\Demographics\Demographics...
\i2b2\Demographics\Age\Age...
\i2b2\Demographics\Age\0-10\Ages 0-10...
\i2b2\Demographics\Age\0-10\5Age 5...

So we join each i2b2 row with the concept_dimension rows under it, and then join with observation_fact on concept_cd, and store the stats in a temporary table:

create global temporary table stats
        on commit preserve rows as
        select c.super_path,
               count(*) facts,
               count(distinct f.patient_num) patients
         from
           (select distinct super.c_dimcode super_path,
                   sub.concept_cd sub_code
            from blueheronmetadata.i2b2 super
            join blueherondata.concept_dimension sub
             on sub.concept_path like (super.c_dimcode || '%')) c
           join blueherondata.observation_fact f
             on f.concept_cd = c.sub_code
        group by super_path

To compute this efficiently, we added an index on i2b2.c_dimcode; you can see that the whole thing runs in just under 22 minutes:

2010-12-09 15:20:17.357337 starting statement:  
      CREATE INDEX blueheronmetadata.metadata_by_path
        ON blueheronmetadata.i2b2 (c_dimcode)
    
2010-12-09 15:20:18.649760 finished after 0:00:01.292423; rows affected: -1
2010-12-09 15:20:18.649852 starting statement:  create global temporary table stats
        on commit preserve rows as
        select c.super_path,
               count(*) facts,
               count(distinct f.patient_num) patients
         from
           (select distinct super.c_dimcode super_path,
                   sub.concept_cd sub_code
            from blueheronmetadata.i2b2 super
            join blueherondata.concept_dimension sub
             on sub.concept_path like (super.c_dimcode || '%')) c
           join blueherondata.observation_fact f
             on f.concept_cd = c.sub_code
        group by super_path
explain plan result:
Plan hash value: 539913617
 
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |                        |   206 | 34196 |       |   175M  (2)|584:12:42 |
|   1 |  LOAD AS SELECT                  | STATS                  |       |       |       |            |          |
|   2 |   SORT GROUP BY                  |                        |   206 | 34196 |       |   131M  (2)|438:09:33 |
|*  3 |    HASH JOIN                     |                        |  8398M|  1298G|  2207M|   129M  (1)|433:18:21 |
|   4 |     INDEX FAST FULL SCAN         | FACT_CNPT_PAT_ENCT_IDX |    59M|  1528M|       | 78907   (2)| 00:15:47 |
|   5 |     VIEW                         |                        |    28M|  3780M|       |   129M  (1)|431:40:01 |
|   6 |      HASH UNIQUE                 |                        |    28M|  6501M|   444G|   116M  (1)|386:55:49 |
|   7 |       TABLE ACCESS BY INDEX ROWID| CONCEPT_DIMENSION      |  9826 |  1161K|       |   257   (0)| 00:00:04 |
|   8 |        NESTED LOOPS              |                        |  1837M|   409G|       |    48M  (1)|160:23:14 |
|   9 |         INDEX FAST FULL SCAN     | METADATA_BY_PATH       |   187K|    21M|       |   753   (1)| 00:00:10 |
|* 10 |         INDEX RANGE SCAN         | CONCEPT_DIMENSION_PK   |  1769 |       |       |    56   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("F"."CONCEPT_CD"="C"."SUB_CODE")
  10 - access("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%')
       filter("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%')
2010-12-09 15:42:12.129193 finished after 0:21:53.479341; rows affected: -1

We then used an expedient hack of updating the concept labels to include the stats:

update blueheronmetadata.i2b2 m
      set m.c_name = (
        select distinct
          case when tots.facts > 0 then
            case when instr(m.c_name, '[') > 0 then
              regexp_replace(m.c_name, '\[.*', '')
            else m.c_name || ' ' end
            || '['
            || (case when tots.facts < 10 then '<10'
                else to_char(tots.facts) end) || ' facts'
            || (case when tots.patients < 10 then ''
                  else '; ' || tots.patients || ' patients' end)
            || ']'
         else m.c_name end
         from stats tots
           where tots.super_path = m.c_dimcode)
      where m.c_dimcode in
        (select super_path from stats)

That code is complicated because it has to

  1. leave out stats regarding fewer than 10 patients (per HIPAA regulations, as part of our DeIdentificationStrategy)
  2. replace any stats that were already there

That's probably enough detail for others using i2b2 to do likewise, but if you're curious, the script that manages the process is source:heron_load/concepts_stats.py and the ticket (#211) tells the whole saga of approaches that don't work and such.

Comments

No comments.