Opened 6 years ago

Closed 6 years ago

Last modified 4 years ago

#547 closed enhancement (fixed)

intial integration of tumor registry, supporting query by Grade etc.

Reported by: rwaitman Owned by: rwaitman
Priority: major Milestone: heron-tuttlecreek-update
Component: data-repository Keywords: public-web
Cc: rwaitman, dconnolly Blocked By:
Blocking: Sensitive: no


We got approval in April; see DROCMeetingMinutes?.

Initial use case is: search by Grade.

See #733 for data that isn't completely integrated yet.

note: this is Dan, reporting for Russ

Change History (40)

comment:1 Changed 6 years ago by dconnolly

On Fri, 2011-08-12 at 12:58 +0000, Russ Waitman wrote:

Russ, Arvinder, and Dan need to think about how we represent observations about tumors and treatment in i2b2 which is usually thinking about representing clinical observations relative to a patient during at visit at a specific time. The simple issue is how to timestamp registry data. For example: the collaborative staging is based on multiple inputs taken at different time and done typically after the case is complete.

Russ, Arvinder and Dan need to see if there are ontologies in the UMLS for things like ICD-O.

comment:2 Changed 6 years ago by rwaitman

The NAACR file is very rich in data. We will decompose this ticket into sub-tickets for various observations after it's prioritized by Andy Godwin and others.

comment:3 Changed 6 years ago by rwaitman

  • Milestone changed from HERONv.Next to heron-perry-update

We would deploy very simple observations that don't require 1.6 modifier functionality first. Then, we would add elements with the visit and modifier capabilities after our 1.6 upgrade.

comment:4 Changed 6 years ago by rwaitman

Adding some good links from John Keighley

collaborative staging information which tells you how site specific factors are used:
and then specifically the first one which is used for estrogen receptor

comment:5 Changed 6 years ago by rwaitman

Also adding the location of an Access database copy of the NAACR formats

Specifically, our registry is on 12.1 and we'd want the following MDB file
12.1 Data Standards and Data Dictionary Database (12/2/2010) (MS Access)

Within the file, there are several tables and the main ones of interest appear to be the
tblItem - which has the item in the file like "Race". Also tells you the field length and position in the NAACR file.

tblCode - has the response codes for a given item.

Example, Race is itemid=18 and codenumber 160. It has several allowable response codes like codeid =657 for Thai.

comment:6 Changed 6 years ago by dconnolly

I bulk loaded the August tumor registry data:

  1. I grabbed the 12.1 PDF spec and converted it to .csv (see source:tumor_reg/Makefile for details)
  2. I wrote source:tumor_reg/ to turn the record layout into a CREATE TABLE statement (source:tumor_reg/naacr_extract.sql) and a sqlldr control file (source:tumor_reg/naacr_extract.ctl)
  3. I created a NAACR user.
  4. I ran the CREATE TABLE statement.
  5. I ran sqlldr using the .ctl file (I had to symlink ALL_PATIENTS_NAACCR to ALL_PATIENTS_NAACCR.dat)

kuh_tumor_registry/August242011NAACCRextract> ORACLE_SID=... sqlldr userid=... control=~/tumor_reg/naacr_extract.ctl data="'"`/bin/pwd`/ALL_PATIENTS_NAACCR"'"

comment:7 Changed 6 years ago by dconnolly

I imported the concepts from Dustin Key's NAACCR_Ontology.ZIP using sqldeveloper; it was a little messy;
I didn't get the column sizes quite right, so there were some import errors. It's good
enough for now, though, I hope...

The concepts are in the bmid database; I connected our test i2b2 instance to that.
so start at https://<non-prod-app-server>

For reference:

On Thu, 2011-08-18 at 16:16 -0700, Key, Dustin wrote:

Attached is my revision of our NAACCR ontology based on some work I had done today.

comment:8 Changed 6 years ago by dconnolly

  • Owner changed from rwaitman to dconnolly
  • Status changed from new to accepted

Looks like my conversion from .pdf to .csv (via .xls) was buggy.
It's missing key stuff such as:

Column # LengthItem # Item Name Section Note
3606-3616 11 2300 Medical Record Number Patient-Confidential

comment:9 Changed 6 years ago by dconnolly

  • Owner changed from dconnolly to rwaitman
  • Status changed from accepted to assigned


I'm putting this on the back burner, as we discussed.

In fact, I'm throwing it your way, in case you (or Arvinder) find time to download

  • Version 12.1 Data Standards and Data Dictionary Database (12/2/2010) (MS Access)

from the NAACCR data dictionary
and export the contents in CSV files.

comment:10 Changed 6 years ago by dconnolly

  • Cc achoudhary rwaitman dconnolly added

Oops... I forgot to copy, you Arvinder. If you find time to help convert the NAACCR
data dictionary from MS Access to CSV, see ticket:547#comment:9 for details.

comment:11 Changed 6 years ago by dconnolly

  • Blocking set to 673

thinking about dependencies for this release...

comment:12 Changed 6 years ago by dconnolly

  • Blocking 673 deleted

(In #673) We can probably start the long part of the HeronLoad build without the tumor registry (#547),
but we should get race tidied up (#637).

comment:13 Changed 6 years ago by dconnolly

  • Blocking set to 674

comment:14 Changed 6 years ago by dconnolly

  • Owner changed from rwaitman to dconnolly
  • Status changed from assigned to accepted

comment:15 Changed 6 years ago by dconnolly

  • Owner changed from dconnolly to achoudhary
  • Status changed from accepted to assigned

Let me know how it goes doing sqlldr with the version of
source:tumor_reg/naaccr_extract.ctl updated Sep 14 in [5436f83be44f].

comment:16 Changed 6 years ago by rwaitman

  • Milestone changed from heron-perry-update to heron-tuttlecreek-update

comment:17 Changed 6 years ago by dconnolly

  • Blocking 674 deleted

comment:18 Changed 6 years ago by dconnolly

See also shrine-hmorn stuff by Dustin Key, Group Health Research Institute in Seattle:

comment:19 Changed 6 years ago by achoudhary

Data load Log sits in the /d1/kuh_tumor_registry/August242011NAACCRextract folder.

Interesting part is that we have no records for the MRN

Select ne."Medical Record Number"
from NAACR.extract ne
where ne."Medical Record Number" is not null;

comment:20 follow-up: Changed 6 years ago by achoudhary

Well found out from Tim Metcalf that it is the "Patient ID Number" that is the MRN in the NAACCR file. Good news is that we have the data :)

Query to be run on the identified server

select cii.pat_id as patient_ide , cii.identity_id as MRN
  from CLARITY.identity_id@EPIC cii,
  where to_number(e."Patient ID Number") = to_number(cii.identity_id)
    and cii.identity_type_id=10;   

comment:21 Changed 6 years ago by achoudhary

  • Owner changed from achoudhary to dconnolly

comment:22 Changed 6 years ago by achoudhary

  • Blocking set to 721

comment:23 Changed 6 years ago by dconnolly

  • Blocking 721 deleted

(In #721) Arvinder asked if it makes sense to start the long part of the HeronLoad process now and layer the tumor registry (#547) on after. Yes, that seems reasonable.

The stats issues (#553, #701) can wait until after the heron_load step too.

comment:24 Changed 6 years ago by dconnolly

  • Owner changed from dconnolly to rwaitman


I think the next step is clinical analisys of the data in NAACR.EXTRACT@KUMC.

comment:25 Changed 6 years ago by dconnolly

Jack London <…> 8/18/2011 11:13 AM

Attached is a zip file with Oracle SQL insert statements generated from our metadata tables for our caTissue biospecimen data and our Impac tumor registry data.

I loaded the tumor registry metadata into I2B2METADATA.TUMOR_REGISTRY_ONTOLOGY@bmid.
It loaded cleanly.

For reference: the Kimmel Cancer Center is in Philadelphia.

comment:26 in reply to: ↑ 20 Changed 6 years ago by dconnolly

Replying to achoudhary:

where to_number(e."Patient ID Number") = to_number(cii.identity_id)

why the to_number() call? I'm having trouble matching without it, but adding it
seems likely to kill performance (indexes won't be used, will they?). I wonder
if we really need the to_number() on both sides, or if we can only adjust
the e."Patient ID Number" side somehow.

... answering my own question...

"MRNs are 6 or 7 digits" according to heron_load/epic_i2b2_transform.sql#L94
They seem to be 8 digits in the NAACCR file.

comment:27 Changed 6 years ago by dconnolly

Based on this week's discussion with John K., I've got a preliminary design[b021bd99e70f] coded up:

  • source:heron_load/naaccr_txform.sql handles extract/transformation for the following:
    • Grade, Site, and Histology following the Kimmel ontology
      • I haven't been able to figure out how treatment works.
    • Various other data using "the stupid approach":
      • 'Race 1', 'Sex', 'Laterality', 'Diagnostic Confirmation', 'Inpatient Status', 'Primary Payer at DX', 'Date of Birth', 'Date of 1st Contact', 'Date of Inpatient Adm', 'Date of Inpatient Disch'
        • 'Class of Case' is wierd; the values don't seem to match the NAACR code table
        • Now that I think about it, the way I'm handling dates isn't going to be usable from the i2b2 query interface. I have another idea we could try.
  • source:heron_load/naaccr_load.sql actually loads the stuff in

I'm working on the surrounding infrastructure to try out the design.

comment:28 Changed 6 years ago by dconnolly

I'm trying out the design in the 'B' instance. It's messy and a little risky:
I haven't taken the time to make test data, so I'm doing this in
the database we're preparing for production release.


You can see the London/Kimmel ontology via https://<non-prod-app_server>
It's a top-level folder, not under "Heron tuttlecreek".
I copied it to BHeronFB.I2B2METADATA2 so we can manage it like our
other external ontologies (i2b2, CPT). Hmm... perhaps that management
strategy should be captured in a design issue ticket.

I tried the new load_tumor_facts ETL task. It fell over right away the
first few times due to easily-fixed typos and such. Then I realized
I had used a very inefficient mechanism to count loaded records to
update the UPLOAD_STATUS table, so I killed it.

Upload id: 29 OK. 0:05:37.672789

---> kumc_etl.load_tumor_facts
---> kumc_etl.make_tumor_views
Running: naaccr_txform.sql

2011-11-16 16:21:35.280423: KUMC on localhost:8521
select "Accession Number--Hosp" from naacr.extract where 1=0
2011-11-16 16:21:35.281484 end. duration: 0:00:00.001061 rows: 0

That meant I had to clean up the records from that attempt.
I did it by doing a binary search for the first encounter_num
used for tumor facts. I got it wrong the 1st time, so it took
me 2 deletes for the encounter_mapping table and 2 deletes
for the observation_fact table:

select distinct encounter_ide_source
from NightHeronData.encounter_mapping
where encounter_num >= 18300000
-- where encounter_num >= 18436613

select encounter_num, concept_cd
from NightHeronData.observation_fact
where encounter_num >= 18436613;

from NightHeronData.observation_fact
where encounter_num >= 18300000; 

from NightHeronData.encounter_mapping
where encounter_num >= 18300000;

-- 18323902
751,287 rows deleted.
48,898 rows deleted.

60,947 rows deleted.
4,629 rows deleted.

select 60947 + 4629 from dual;


The 2nd try at load_tumor_facts did better (Upload id: 29 OK. 0:05:37.672789), but it fell over
at the de-id step because the zips_near_kumc data isn't loaded in my schema.

(do we have a ticket for moving all the ETL views and temp tables from the schema of the user
running the scripts to some common schema?

comment:29 Changed 6 years ago by dconnolly

Taking another try at the GHC ontology.

This TABLE_ACCESS stuff is a pain...

2011-11-17 11:24:20,454 DEBUG [org.springframework.jdbc.core.JdbcTemplate] Executing SQL query [select c_hlevel from BlueHeronmetadata.NAACCR_ONTOLOGY where c_fullname = ?  and c_synonym_cd = 'N']
2011-11-17 11:24:20,454 DEBUG [org.springframework.jdbc.datasource.DataSourceUtils] Fetching JDBC Connection from DataSource
2011-11-17 11:24:20,454 DEBUG [org.springframework.jdbc.core.StatementCreatorUtils] Setting SQL statement parameter value: column index 1, parameter value [\i2b2\], value class [java.lang.String], SQL type unknown
2011-11-17 11:24:20,469 DEBUG [org.springframework.jdbc.datasource.DataSourceUtils] Returning JDBC Connection to DataSource
2011-11-17 11:24:20,469 ERROR [edu.harvard.i2b2.ontology.dao.ConceptDao] Incorrect result size: expected 1, actual 0

comment:30 Changed 6 years ago by dconnolly

aha... got the GHC ontology in there.


comment:31 Changed 6 years ago by dconnolly

Bingo! one query works:

  • Finished Query: "Grade III - Poo@21:18:27"
    Patient Count - 9759 patients
    FINISHED [3.9 secs]


comment:32 Changed 6 years ago by dconnolly

Darn; Site/Histology query didn't work.
our concept codes are like TR|S:C050|H:81403
where as London's are like TR|S:C50|H:82113
Note the leading 0 in our concept codes, which are derived from codes in the NAACCR extract.

John K. explains that the 3rd digit of "Primary Site" is a sub-site.

Meanwhile, it looks like the Kimmel ontology is mixing in ICD-10 codes,
such as Hodgkin's disease C81.

IDC-O-3 goes from C000 to C809
IDC10 goes beyond.

John K. also explains that the state-of-the-art in site coding is to map
ICD-O-3 site and histology to SEER Site Recode.

The GHC ontology is starting to look more straightforward.

comment:33 Changed 6 years ago by dconnolly

The GHC ontology works for race now[2a15fca4e515]:

  • Finished Query: "01 White@23:09:47"
    Patient Count - 53804 patients
    FINISHED [4.6 secs]

This validates the "big flat bucket approach", though it seems that Dustin Key considered the
late-binding approach to concept_cd's somewhat late in his work:

However, after starting to do this I've realized that it's better to just treat these as 'regular facts' and store the storage code as the value. Then handle the stage presentation at the concept dimension and metadata level. That way you can more quickly change how the stages are presented without having to reload facts.

comment:34 Changed 6 years ago by dconnolly

After discussion with Russ, I'm going more toward the "big flat approach". I found a reasonable
broad-brush approach[400c06fbebe6] to de-identification:

  1. Eliminate whole sections that are uninteresting or problematic, such as "Text-Miscellaneous".
  2. Prune remaining items base on AllowValue and Format (zip codes, counties, identifiers)

But I'm starting to wonder how long it's going to take to get metadata for all the concept_cd's.
For items whose values join with the t_code table, it's likely that the we'll align with
the GHC ontology easily, and if not, it's straightforward to turn the t_item and t_code
tables into i2b2 metadata.

But there are others (e.g. cause of death) that use other coding systems, and it's not clear that
this "big flat approach" gives codes that agree with GHC.

After another chat with Russ, we'll use trivial metadata if we have to, but we'll continue
with the "big flat approach."

comment:35 Changed 6 years ago by dconnolly

  • Resolution set to fixed
  • Status changed from assigned to closed

I worked out the remaining duplicate key kinks in the "big flat approach"[91231eba2b38].

Some queries against the GHC ontology now work:

  • Finished Query: "2 Married (incl@16:20:44"
    Patient Count - 38415 patients FINISHED [3.3 secs]

The GHC ontology and the "big flat approach" agree on concept_cd's of the form NAACCR|item_num:codenbr
in many cases. What to do in the cases where they do not agree remains an open issue #733.

comment:36 Changed 6 years ago by dconnolly

  • Resolution fixed deleted
  • Status changed from closed to reopened

oops; I set end_date to null for non-date items, so they don't work in the timeline view.

comment:37 Changed 6 years ago by dconnolly

  • Resolution set to fixed
  • Status changed from reopened to closed

end_date problem fixed in [14ee3ecce5f2]

comment:38 Changed 6 years ago by dconnolly

comment:39 Changed 4 years ago by dconnolly

  • Cc achoudhary removed
  • Keywords public-web added

comment:40 Changed 4 years ago by kcrane2

Cleared by Infosec

Note: See TracTickets for help on using tickets.