Version 8 (modified by dconnolly, 7 years ago) (diff)


HERON integrates data from the KUH Tumor Registry, with 65,000 cases dating back to the 1950s.


newest first...

Source code and development notes for NAACCR ETL

The data comes from the KUH Tumor Registry in the NAACCR format:

todo: update to current data dictionary matierals (v13 as of this writing), which has relational data and python export tools.

Our NAACCR ETL SQL scripts are designed for use in the HeronLoad ETL process (see also source:heron_load/README.rst).

We greatfully acknowledge contributions from Dustin Key of GHC and Jack London of the Kimmel Cancer Center.

The code is not (yet) designed to run independent of the KUMC environment, but peers in the informatics community have managed to port these scripts to their environment:

not yet released; stay tuned (#1254)

Design notes include:

intial integration of tumor registry, supporting query by Grade etc.
HERON uses ICD-O-2 labels for ICD-O-3 morphology/histology cancer tumor registry codes
tumor registry terminology hierarchy simplified for cancer study feasibility

We reviewed the data we get by section to eliminate potentially sensitive data, including free-text; the sections with a -- below are not loaded into HERON:

167	and ns.SectionID in (
168	  1 -- Cancer Identification
169	 , 2 -- Demographic
170	-- , 3 -- Edit Overrides/Conversion History/System Admin
171	 , 4 -- Follow-up/Recurrence/Death
172	-- , 5 -- Hospital-Confidential
173	 , 6 -- Hospital-Specific
174	-- , 7 -- Other-Confidential
175	-- , 8 -- Patient-Confidential
176	-- , 9 -- Record ID
177	-- , 10 -- Special Use
178	  11 -- Stage/Prognostic Factors -- TODO: numeric stuff
179	-- , 12 -- Text-Diagnosis
180	-- , 13 -- Text-Miscellaneous
181	-- , 14 -- Text-Treatment
182	-- , 15 -- Treatment-1st Course
183	, 16 -- Treatment-Subsequent & Other
184	, 17 -- Pathology
185	)

-- source:heron_load/naaccr_txform.sql#L67

Requirements Gathering

December 1, 2011 Planning

Meeting with Tim Metcalf, Russ, Arvinder, Bhargav

Where exactly is the "RX Summary info?" should be after the RX-Summ

Longer term, wanting the site specific items.

Note: some fields are not required.

Subsq RX 2nd Course and other of these fields all seem to be 00 or 0. Arvinder thinks this may be an error with the first part of ETL and Varchars. Arvinder will work with John and Tim to run the frequency of those columns.

  • The SEER site recode: John says can Tim ask his Vendor if they do that already and have that data available for Dan. Tim suspects they do. This would really help with ontology creation.

Brainstorming with Tim on how this out of the box could help the registrar

  • Death index and Death from hospital could save them time
  • validate that the data coded by his team is being done accurately. For example, are they using very old codes and rad therapy technologies when they should be using a newer or more accurate term (beam radiation). Class of case is another example. Collaborative staging.
  • For the annual report, integrating data from HERON like BMI could really add value to their annual report.
  • HERON for investigators is a win because he doesn't have to run all their exploratory queries. For example Steve Williamson asks routine questions every year about how many patients have this type of histology and site combinations. Tim wins as well because he can incorporate additional data.
  • Follow up report could be useful. What's changed since he last coded the case? Note: some of this kind of work is already provided by merged reports coming from KUH IT staff. We don't want to duplicate that work. Would also need to understand the operational commitment to fund this kind of work.
  • Could long term though it might take a lot of work to find things automatically or check for things. Like CA19-9 over 1000. PSA over 7, clinical recurrence. First level: present the clinical data to the registrar, Second level: auto populate. Followup: has anything cropped up since last coded? Of those, anything which needs recoding or noting that there is recurrence?