Loading HERON involves extract/transform/load (ETL) work from each of the participating clinical organizations; so far:
- HospitalEpicSource
- ClinicIdxSource
- BiospecimenRepository
- TumorRegistry
- UHCSource
- REDCap (ticket:1107)
as well as terminology sources such as:
- RxNorm Terms for MedMapping (ticket:1048)
We continue to expand and refine our data integration; see HeronLoadDev for development details.
Iteration Process Overview
Our monthly production ETL iteration process involves storage for two iterations:
- In-production, i.e. what users see
- Next build, using new ETL code on new data
Each month...
- We BulkTransfer a copy of the source data periodically; e.g. the hospital's clarity store (#249), and we stage the bulk data in Oracle.
- We extract and transform the data to the I2B2 star schema:
- The original copy is on the identified staging server.
- We make a de-identified copy that is used live in i2b2 (see DeIdentificationStrategy).
- We update the I2B2 middle tier to refer to the new data.
- We validate the data and then release it to our customers.

Key:

Other overviews:
- HERON ETL presentation from GPC:HackathonOne
- Tasks and Dependencies in GPC:MultiSiteDev
Manual Work-arounds
When building the 4 ticket pipeline, it helps to add the open heron-release tickets on the data repository as blockers:
No results
todo: double-check w.r.t. query:keywords=~technical-debt&component=data-repository&status!=closed
Stage Data from Sources
See BulkTransfer. Be sure to document the resulting row counts (#814, #2223).
Note that we have cases where we've stored away old hierarchies to preserve old user queries. We plan to back them up at some point (see #3614). Examples include:
- Old medications hierarchy (see gpc:ticket:78#comment:6):
i2b2metadata2.old_kuh_medication_tree
- Old Diagnoses hierarchy:
i2b2metadata2.old_kuh_dx_terms
Run ETL
This is typically represented by a ticket such as #2212 (which has links to previous ETL tickets. When making a new one, the data keyword can help find dependencies.
- Create a release branch; see ticket:4108#comment:17
- Check disk space and tablespaces (e.g. ticket:1535#comment:17, #347)
- Check monitor sessions to make sure there are no connections to blueherondata or blueheronmetadata in non-production instance.
Process Overview
We use Jenkins to manage the ETL process. See the "HERON ETL" tab in Jenkins on the production ID server. The first few ETL jobs are:
- etl_config_PROD creates the configuration file to be used for subsequent jobs.
- heron_init_PROD (heron_init) initializes ID and DEID databases resulting in the deletion of all ID and DEID data in the configured database.
- heron_etl_PROD (heron_load, load_concepts) runs fact and concept loads.
Note that the check_config_PROD job verifies that we're not running ETL against the production database.
diagram is from "HERON System and Servers" presentation, Oct 2014, #2936
heron_config_PROD
Run etl_config_PROD with required parameters (database instance, dump files, source code branch, etc). heron-prod.ini will be generated and used in subsequent jobs.
heron_init_PROD
The heron_init_PROD Jenkins job requires DBA privileges to be entered when executing. WARNING: Any existing data in the NightHeronData
and BlueHeronData
schemas will be wiped out. So, to be safe, one can use different passwords for different instances. However, there is a layer of safety built in - the check_config_PROD job is a build step of the initialization and will fail if one tries to run the initialization on the instance that is currently being used for production (as determined by the configuration files on the middle-tier JBoss server).
Creates identified, de-identified schemas and datamart tables.
- The build log is archived by the job.
- source:heron_load/create_datamart_schema.sql is used to create schemas. Note use of
&&name
substitution variables.crc_create_datamart_oracle.sql
andcrc_create_uploader_oracle.sql
from i2b2 sources are used to create datamart tables. actually, a slightly tweaked version. see #1176.
heron_etl_PROD
Jenkins parameters (paver commands):
- heron_load
Load, de-identify the data, concepts.
Note: Make sure we have the I2B2METADATA2 schema, which has the original i2b2 concepts and the CPT concepts, in the instance.
See source:heron_load/, especially source:heron_load/heron_build.py for details.
Watch out for etl-performance
issues:
No results
The following timings from our Feb 2014 release (#2528; see also ticket:3063#comment:12, ticket/4300#comment:25) give a rough outline of what it does:
Order | Duration | Task | o(Duration) |
8 | 00:01:40 | make_epic_visit_views | oooooooo |
21 | 00:18:59 | make_epic_lab_views | ooooooooooo |
25 | 00:13:20 | make_epic_meds_views | ooooooooooo |
31 | 00:05:21 | make_epic_notes_views | ooooooooo |
33 | 00:08:51 | make_epic_order_views | oooooooooo |
36 | 00:00:22 | make_epic_social_history_views | ooooo |
38 | 00:00:24 | make_epic_enc_vitals_views | oooooo |
44 | 00:54:48 | epic_dimensions | ooooooooooooo |
51 | 00:02:08 | make_idx_views | oooooooo |
53 | 00:31:56 | idx_dimensions | oooooooooooo |
59 | 00:22:07 | epic_demographics | ooooooooooo |
70 | 00:02:01 | idx_demographics | oooooooo |
83 | 00:01:01 | ssdmf_load | ooooooo |
94 | 00:00:55 | epic_allergy | ooooooo |
107 | 00:01:47 | make_tumor_views | oooooooo |
110 | 00:12:34 | tumor_facts | ooooooooooo |
121 | 00:48:51 | idx_clinical_facts | ooooooooooooo |
132 | 00:12:30 | epic_diagnosis | ooooooooooo |
153 | 00:00:54 | make_epic_flowsheet_views | ooooooo |
156 | 00:13:49 | summarize_epic_flowsheets | ooooooooooo |
158 | 15:23:01 | epic_flowsheets | ooooooooooooooooo |
289 | 00:58:53 | epic_medical_history | ooooooooooooo |
320 | 02:03:13 | epic_social_history | oooooooooooooo |
401 | 02:28:39 | epic_labs | oooooooooooooo |
412 | 04:25:29 | epic_med_facts | ooooooooooooooo |
483 | 01:05:22 | epic_microbiology | ooooooooooooo |
504 | 02:28:59 | epic_orders | oooooooooooooo |
515 | 00:00:19 | make_redcap_views | ooooo |
517 | 00:07:22 | redcap_clinical_facts | oooooooooo |
528 | 02:02:17 | epic_notes | oooooooooooooo |
549 | 00:07:40 | bsr_dimensions | oooooooooo |
555 | 00:06:46 | bsr_clinical_facts | oooooooooo |
568 | 00:12:41 | uhc_clinical_facts | ooooooooooo |
581 | 00:30:39 | epic_services | oooooooooooo |
592 | 00:59:26 | epic_enc_vitals | ooooooooooooo |
603 | 01:36:57 | i2b2_facts_index.sql | oooooooooooooo |
604 | 00:00:10 | epic_i2b2_deid_verify.sql | oooo |
605 | 00:12:13 | count_facts_by_concept.sql | ooooooooooo |
We can see that flowsheets dominate the time.
See also ticket/1852/flowsheet_trends.svg
Load concepts (part of the heron_etl_PROD job)
Jenkins parameters (paver commands):
- load_all_concepts
Add Patient, Fact Counts and Tooltips (part of the heron_etl_PROD job)
Jenkins parameters (paver commands):
- enhance_concepts
Note: if the Oracle table statistics have already been migrated (to achieve acceptable end-user performance - see wiki:HeronLoad#Migrate_Oracle_Table_Statistics) then the tasks to add patient and fact counts may not finish in a reasonable amount of time - see ticket:2727.
Prepare new data for testing
Validate and Release
See GroupOnly/UpdateValidation
Milestone Planning
See GroupOnly/UpdateValidation
ACT-SHRINE overview.
Attachments (3)
- medford_perf_pie.png (79.5 KB) - added by 7 years ago.
- beaver_timing_pie.png (109.7 KB) - added by 7 years ago.
- HERON_ETL_Data_Flows.png (49.1 KB) - added by 6 years ago.
Download all attachments as: .zip