HERON integrates data from the KUH Tumor Registry, with 65,000 cases dating back to the 1950s.
Accomplishments
newest first...
Source code and development notes for NAACCR ETL
The data comes from the KUH Tumor Registry in the NAACCR format:
- Thornton M, (ed). DATA STANDARDS AND DATA DICTIONARY Standards for Cancer Registries Volume II: Data Standards and Data Dictionary, Record Layout Version 12.1, 15th ed. Springfield, Ill.: North American Association of Central Cancer Registries, June 2010.
todo: update to current data dictionary matierals (v13 as of this writing), which has relational data and python export tools.
Our NAACCR ETL SQL scripts are designed for use in the HeronLoad ETL process (see also source:heron_load/README.rst).
We greatfully acknowledge contributions from Dustin Key of GHC and Jack London of the Kimmel Cancer Center.
The code is not (yet) designed to run independent of the KUMC environment, but peers in the informatics community have managed to port these scripts to their environment:
- source:heron_load/metadata_init.sql
- TODO: fix "see also: naacr_init.sql"
- source:heron_load/naaccr_txform.sql
- source:heron_load/seer_recode.sql
- source:heron_load/naaccr_load.sql
not yet released; stay tuned (#1254)
- source:heron_staging/tumor_reg: convert NAACCR specification to SQL view and Oracle sqloader control file
Design notes include:
We reviewed the data we get by section to eliminate potentially sensitive data, including free-text; the sections with a --
below are not loaded into HERON:
167 and ns.SectionID in ( 168 1 -- Cancer Identification 169 , 2 -- Demographic 170 -- , 3 -- Edit Overrides/Conversion History/System Admin 171 , 4 -- Follow-up/Recurrence/Death 172 -- , 5 -- Hospital-Confidential 173 , 6 -- Hospital-Specific 174 -- , 7 -- Other-Confidential 175 -- , 8 -- Patient-Confidential 176 -- , 9 -- Record ID 177 -- , 10 -- Special Use 178 11 -- Stage/Prognostic Factors -- TODO: numeric stuff 179 -- , 12 -- Text-Diagnosis 180 -- , 13 -- Text-Miscellaneous 181 -- , 14 -- Text-Treatment 182 -- , 15 -- Treatment-1st Course 183 , 16 -- Treatment-Subsequent & Other 184 , 17 -- Pathology 185 )
Requirements Gathering
December 1, 2011 Planning
Meeting with Tim Metcalf, Russ, Arvinder, Bhargav
Where exactly is the "RX Summary info?" should be after the RX-Summ
Longer term, wanting the site specific items.
Note: some fields are not required.
Subsq RX 2nd Course and other of these fields all seem to be 00 or 0. Arvinder thinks this may be an error with the first part of ETL and Varchars. Arvinder will work with John and Tim to run the frequency of those columns.
- The SEER site recode: John says can Tim ask his Vendor if they do that already and have that data available for Dan. Tim suspects they do. This would really help with ontology creation.
Brainstorming with Tim on how this out of the box could help the registrar
- Death index and Death from hospital could save them time
- validate that the data coded by his team is being done accurately. For example, are they using very old codes and rad therapy technologies when they should be using a newer or more accurate term (beam radiation). Class of case is another example. Collaborative staging.
- For the annual report, integrating data from HERON like BMI could really add value to their annual report.
- HERON for investigators is a win because he doesn't have to run all their exploratory queries. For example Steve Williamson asks routine questions every year about how many patients have this type of histology and site combinations. Tim wins as well because he can incorporate additional data.
- Follow up report could be useful. What's changed since he last coded the case? Note: some of this kind of work is already provided by merged reports coming from KUH IT staff. We don't want to duplicate that work. Would also need to understand the operational commitment to fund this kind of work.
- Could long term though it might take a lot of work to find things automatically or check for things. Like CA19-9 over 1000. PSA over 7, clinical recurrence. First level: present the clinical data to the registrar, Second level: auto populate. Followup: has anything cropped up since last coded? Of those, anything which needs recoding or noting that there is recurrence?