wiki:HeronLoad

Loading HERON involves extract/transform/load (ETL) work from each of the participating clinical organizations; so far:

as well as terminology sources such as:

We continue to expand and refine our data integration; see HeronLoadDev for development details.

Iteration Process Overview

Our monthly production ETL iteration process involves storage for two iterations:

  1. In-production, i.e. what users see
  2. Next build, using new ETL code on new data

Each month...

  1. We BulkTransfer a copy of the source data periodically; e.g. the hospital's clarity store (#249), and we stage the bulk data in Oracle.
  2. We extract and transform the data to the I2B2 star schema:
    1. The original copy is on the identified staging server.
    2. We make a de-identified copy that is used live in i2b2 (see DeIdentificationStrategy).
  3. We update the I2B2 middle tier to refer to the new data.
  4. We validate the data and then release it to our customers.
GraphViz image

Key:

GraphViz image

Other overviews:

Manual Work-arounds

When building the 4 ticket pipeline, it helps to add the open heron-release tickets on the data repository as blockers:

No results

todo: double-check w.r.t. query:keywords=~technical-debt&component=data-repository&status!=closed

Stage Data from Sources

See BulkTransfer. Be sure to document the resulting row counts (#814, #2223).

Note that we have cases where we've stored away old hierarchies to preserve old user queries. We plan to back them up at some point (see #3614). Examples include:

  • Old medications hierarchy (see gpc:ticket:78#comment:6):
    • i2b2metadata2.old_kuh_medication_tree
  • Old Diagnoses hierarchy:
    • i2b2metadata2.old_kuh_dx_terms

Run ETL

This is typically represented by a ticket such as #2212 (which has links to previous ETL tickets. When making a new one, the data keyword can help find dependencies.

Watch out for etl-performance issues: No results

Process Overview

We use Jenkins to manage the ETL process. See the "HERON ETL" tab in Jenkins on the production ID server. The first few ETL jobs are:

  1. etl_config_PROD creates the configuration file to be used for subsequent jobs.
  2. heron_init_PROD (heron_init) initializes ID and DEID databases resulting in the deletion of all ID and DEID data in the configured database.
  3. heron_etl_PROD (heron_load, load_concepts) runs fact and concept loads.

Note that the check_config_PROD job verifies that we're not running ETL against the production database.

diagram is from "HERON System and Servers" presentation, Oct 2014, #2936

heron_config_PROD

Run etl_config_PROD with required parameters (database instance, dump files, source code branch, etc). heron-prod.ini will be generated and used in subsequent jobs.

heron_init_PROD

The heron_init_PROD Jenkins job requires DBA privileges to be entered when executing. WARNING: Any existing data in the NightHeronData and BlueHeronData schemas will be wiped out. So, to be safe, one can use different passwords for different instances. However, there is a layer of safety built in - the check_config_PROD job is a build step of the initialization and will fail if one tries to run the initialization on the instance that is currently being used for production (as determined by the configuration files on the middle-tier JBoss server).

Creates identified, de-identified schemas and datamart tables.

  • The build log is archived by the job.
  • source:heron_load/create_datamart_schema.sql is used to create schemas. Note use of &&name substitution variables.
    • crc_create_datamart_oracle.sql and crc_create_uploader_oracle.sql from i2b2 sources are used to create datamart tables. actually, a slightly tweaked version. see #1176.

heron_etl_PROD

Jenkins parameters (paver commands):

  • heron_load

Load, de-identify the data, concepts.

Note: Make sure we have the I2B2METADATA2 schema, which has the original i2b2 concepts and the CPT concepts, in the instance.

See source:heron_load/, especially source:heron_load/heron_build.py for details. The following timings from our Feb 2014 release (#2528; see also ticket:3063#comment:12) give a rough outline of what it does:

OrderDurationTasko(Duration)
800:01:40make_epic_visit_viewsoooooooo
2100:18:59make_epic_lab_viewsooooooooooo
2500:13:20make_epic_meds_viewsooooooooooo
3100:05:21make_epic_notes_viewsooooooooo
3300:08:51make_epic_order_viewsoooooooooo
3600:00:22make_epic_social_history_viewsooooo
3800:00:24make_epic_enc_vitals_viewsoooooo
4400:54:48epic_dimensionsooooooooooooo
5100:02:08make_idx_viewsoooooooo
5300:31:56idx_dimensionsoooooooooooo
5900:22:07epic_demographicsooooooooooo
7000:02:01idx_demographicsoooooooo
8300:01:01ssdmf_loadooooooo
9400:00:55epic_allergyooooooo
10700:01:47make_tumor_viewsoooooooo
11000:12:34tumor_factsooooooooooo
12100:48:51idx_clinical_factsooooooooooooo
13200:12:30epic_diagnosisooooooooooo
15300:00:54make_epic_flowsheet_viewsooooooo
15600:13:49summarize_epic_flowsheetsooooooooooo
15815:23:01epic_flowsheetsooooooooooooooooo
28900:58:53epic_medical_historyooooooooooooo
32002:03:13epic_social_historyoooooooooooooo
40102:28:39epic_labsoooooooooooooo
41204:25:29epic_med_factsooooooooooooooo
48301:05:22epic_microbiologyooooooooooooo
50402:28:59epic_ordersoooooooooooooo
51500:00:19make_redcap_viewsooooo
51700:07:22redcap_clinical_factsoooooooooo
52802:02:17epic_notesoooooooooooooo
54900:07:40bsr_dimensionsoooooooooo
55500:06:46bsr_clinical_factsoooooooooo
56800:12:41uhc_clinical_factsooooooooooo
58100:30:39epic_servicesoooooooooooo
59200:59:26epic_enc_vitalsooooooooooooo
60301:36:57i2b2_facts_index.sqloooooooooooooo
60400:00:10epic_i2b2_deid_verify.sqloooo
60500:12:13count_facts_by_concept.sqlooooooooooo

We can see that flowsheets dominate the time.

See also ticket/1852/flowsheet_trends.svg

Load concepts (part of the heron_etl_PROD job)

Jenkins parameters (paver commands):

  • load_all_concepts

Add Patient, Fact Counts and Tooltips (part of the heron_etl_PROD job)

Jenkins parameters (paver commands):

  • enhance_concepts

Note: if the Oracle table statistics have already been migrated (to achieve acceptable end-user performance - see wiki:HeronLoad#Migrate_Oracle_Table_Statistics) then the tasks to add patient and fact counts may not finish in a reasonable amount of time - see ticket:2727.

Prepare new data for testing

Validate and Release

See GroupOnly/UpdateValidation

Milestone Planning

See GroupOnly/UpdateValidation

Last modified 5 weeks ago Last modified on 05/17/18 10:58:33

Attachments (3)

Download all attachments as: .zip