wiki:HeronLoad

Loading HERON involves extract/transform/load (ETL) work from each of the participating clinical organizations; so far:

as well as terminology sources such as:

We continue to expand and refine our data integration; see HeronLoadDev for development details.

Iteration Process Overview

Our monthly production ETL iteration process involves storage for two iterations:

  1. In-production, i.e. what users see
  2. Next build, using new ETL code on new data

Each month...

  1. We BulkTransfer? a copy of the source data periodically; e.g. the hospital's clarity store (#249), and we stage the bulk data in Oracle.
  2. We extract and transform the data to the I2B2 star schema:
    1. The original copy is on the identified staging server.
    2. We make a de-identified copy that is used live in i2b2 (see DeIdentificationStrategy).
  3. We update the I2B2 middle tier to refer to the new data.
  4. We validate the data and then release it to our customers.
GraphViz image

Key:

GraphViz image

Manual Work-arounds

When building the 4 ticket pipeline, it helps to add the relevant tickets from the list below as blockers:

todo: prune this list using a heron-release keyword

#2138
use SCILHS ontology for HERON ICD9, CPT, HCPCS? (sync with UMLS?)

Stage Data from Sources

See BulkTransfer?. Be sure to document the resulting row counts (#814, #2223).

Note that we have cases where we've stored away old hierarchies to preserve old user queries. We plan to back them up at some point (see #3614). Examples include:

  • Old medications hierarchy (see gpc:ticket:78#comment:6):
    • i2b2metadata2.old_kuh_medication_tree
  • Old Diagnoses hierarchy:
    • i2b2metadata2.old_kuh_dx_terms

Run ETL

This is typically represented by a ticket such as #2212 (which has links to previous ETL tickets. When making a new one, the data keyword can help find dependencies.

Process Overview

We use Jenkins to manage the ETL process. See the "HERON ETL" tab in Jenkins on the production ID server. The first few ETL jobs are:

  1. etl_config_PROD creates the configuration file to be used for subsequent jobs.
  2. heron_init_PROD (heron_init) initializes ID and DEID databases resulting in the deletion of all ID and DEID data in the configured database.
  3. heron_etl_PROD (heron_load, load_concepts) runs fact and concept loads.

Note that the check_config_PROD job verifies that we're not running ETL against the production database.

diagram is from "HERON System and Servers" presentation, Oct 2014, #2936

heron_config_PROD

Run etl_config_PROD with required parameters (database instance, dump files, source code branch, etc). heron-prod.ini will be generated and used in subsequent jobs.

heron_init_PROD

The heron_init_PROD Jenkins job requires DBA privileges to be entered when executing. WARNING: Any existing data in the NightHeronData and BlueHeronData schemas will be wiped out. So, to be safe, one can use different passwords for different instances. However, there is a layer of safety built in - the check_config_PROD job is a build step of the initialization and will fail if one tries to run the initialization on the instance that is currently being used for production (as determined by the configuration files on the middle-tier JBoss server).

Creates identified, de-identified schemas and datamart tables.

  • The build log is archived by the job.
  • source:heron_load/create_datamart_schema.sql is used to create schemas. Note use of &&name substitution variables.
    • crc_create_datamart_oracle.sql and crc_create_uploader_oracle.sql from i2b2 sources are used to create datamart tables. actually, a slightly tweaked version. see #1176.

heron_etl_PROD

Jenkins parameters (paver commands):

  • heron_load

Load, de-identify the data, concepts.

Note: Make sure we have the I2B2METADATA2 schema, which has the original i2b2 concepts and the CPT concepts, in the instance.

See source:heron_load/, especially source:heron_load/heron_build.py for details. The following timings from our Feb 2014 release (#2528; see also ticket:3063#comment:12) give a rough outline of what it does:

OrderDurationTasko(Duration)
800:01:40make_epic_visit_viewsoooooooo
2100:18:59make_epic_lab_viewsooooooooooo
2500:13:20make_epic_meds_viewsooooooooooo
3100:05:21make_epic_notes_viewsooooooooo
3300:08:51make_epic_order_viewsoooooooooo
3600:00:22make_epic_social_history_viewsooooo
3800:00:24make_epic_enc_vitals_viewsoooooo
4400:54:48epic_dimensionsooooooooooooo
5100:02:08make_idx_viewsoooooooo
5300:31:56idx_dimensionsoooooooooooo
5900:22:07epic_demographicsooooooooooo
7000:02:01idx_demographicsoooooooo
8300:01:01ssdmf_loadooooooo
9400:00:55epic_allergyooooooo
10700:01:47make_tumor_viewsoooooooo
11000:12:34tumor_factsooooooooooo
12100:48:51idx_clinical_factsooooooooooooo
13200:12:30epic_diagnosisooooooooooo
15300:00:54make_epic_flowsheet_viewsooooooo
15600:13:49summarize_epic_flowsheetsooooooooooo
15815:23:01epic_flowsheetsooooooooooooooooo
28900:58:53epic_medical_historyooooooooooooo
32002:03:13epic_social_historyoooooooooooooo
40102:28:39epic_labsoooooooooooooo
41204:25:29epic_med_factsooooooooooooooo
48301:05:22epic_microbiologyooooooooooooo
50402:28:59epic_ordersoooooooooooooo
51500:00:19make_redcap_viewsooooo
51700:07:22redcap_clinical_factsoooooooooo
52802:02:17epic_notesoooooooooooooo
54900:07:40bsr_dimensionsoooooooooo
55500:06:46bsr_clinical_factsoooooooooo
56800:12:41uhc_clinical_factsooooooooooo
58100:30:39epic_servicesoooooooooooo
59200:59:26epic_enc_vitalsooooooooooooo
60301:36:57i2b2_facts_index.sqloooooooooooooo
60400:00:10epic_i2b2_deid_verify.sqloooo
60500:12:13count_facts_by_concept.sqlooooooooooo

We can see that flowsheets dominate the time.

Load concepts (part of the heron_etl_PROD job)

Jenkins parameters (paver commands):

  • load_all_concepts

Add Patient, Fact Counts and Tooltips (part of the heron_etl_PROD job)

Jenkins parameters (paver commands):

  • enhance_concepts

Note: if the Oracle table statistics have already been migrated (to achieve acceptable end-user performance - see wiki:HeronLoad#Migrate_Oracle_Table_Statistics) then the tasks to add patient and fact counts may not finish in a reasonable amount of time - see ticket:2727.

Prepare new data for testing

heron_config_datasources_PROD

This Jenkins job will create change the passwords for the i2b2 accounts (i2b2pm, BlueHeronData, etc.) and create two JBoss datasources files (heron-prod-deid-ds.xml and heron-prod-deid-ds.xml). These two configuration files will be used to configure the front-end application server to use the new data.

  • Note that the job also updates the i2b2 OBFSC_SERVICE_ACCOUNT account password to match the BlueHeronData (NightHeronData for ID) Oracle passwords.

Update test i2b2 configuration (somewhat manual process)

  • Copy the heron-prod-ds.xml file to the JBoss deployments directory on the test JBoss application server (/opt/jboss-as-7.1.1.Final/standalone/deployments).
    • Note: subsequent scripts (cutover script) expects the .xml configuration file to be named heron-prod1-ds.xml, heron-prod1-ds.xml, or simply heron-prod-ds.xml.
  • Update passwords in CRC application configuration for i2b2 on JBoss application server (/opt/jboss-as-7.1.1.Final/standalone/configuration/crcapp). The updated passwords are contained in the heron-prod-ds.xml file generated in steps noted previously.
    • CRCLoaderApplicationContext.xml
      1. password for i2b2hive user
      2. database url (A or B)
    • crc.properties
      1. OBFSC_SERVICE_ACCOUNT password to match the BlueHeronData password
  • Restart Apache so that heron_admin loads the new passwords:
    • /etc/init.d/apache2 restart
  • Update build name label:
    • UPDATE/INSERT I2B2PM.PM_PROJECT_DATA SET PROJECT_NAME with new build name (e.g Pamona, Wilson etc). Do this for BOTH ID and DEID.
  • try a query; e.g. count frontiers patients

heron_content_summary

  • This job creates report_stats.txt which is a summary of patient/fact counts formatted to make a nice table in Trac when pasted into a ticket comment.

Validate and Release

UpdateValidation includes various techniques, including reviewing contents stats while drafting the HeronReleaseNotes blog item

heron_update_oracle_stats_PROD

Migrate table statistics from the previous month (see ticket:1625#comment:11) since they've shown to provide better performance than the statistics that Oracle generates automatically.

Check Performance

Jenkins jobs:

  • heron_performance_check_plans
    • In the production Jenkins environment, select the heron_performance_check_plans job and click Build with Parameters.
      • Select the appropriate SID from the drop-down menu SID.
  • heron_performance
    • In the production Jenkins environment, select the heron_performance job and click Build with Parameters.
      • Select the appropriate SID from the drop-down menu SID.
      • Select the ./test_queries/timing_tests_default.txt

This should produce an HTML report that compares query times with the previous test run. Make sure test durations haven't increased dramatically.

Automated Query Tests

Run source:heron_load/test_heron_query.py like:

python test_heron_query.py https://<test app server name>.kumc.edu/heron

All tests should pass. Attach the resulting output to the validate and release ticket.

Production cut-over

A Notify users and/or verifying that nobody is currently logged in to HERON before proceeding.

  • Run query against production to find users for last two weeks
  • Note that some HERON users don't always check their KUMC e-mail accounts, such as some of our GPC collaborators from Iowa and Texas.
    select distinct user_id||'@kumc.edu' from i2b2pm.pm_user_session where trunc(entry_date) >= trunc(sysdate-14);
    
  • Send an email notice to users at least an hour before beginning the down time. In the past, the following format has been used:

The HERON/i2b2 application is scheduled for downtime today (Thursday, October 23, 2014). The system will be unavailable from 2:00 PM until 4:00 PM. During this downtime, we will be performing the monthly data update which will include clinical data through September, 2014.

Thanks!

Regards,

  1. Point production instance of i2b2 to the new data (using heron-prod-deid-ds.xml etc. as was done above for the test server above)
    • Stop JBoss and apache on the production application server.
      /etc/init.d/jboss stop
      /etc/init.d/apache2 stop
      
    • Copy the heron-prod-deid-ds.xml config file from the test server
    • Update passwords in CRC Application file as was done before on the test application server
    • Start JBoss/Apache again:
      /etc/init.d/jboss start
      /etc/init.d/apache2 start
      
    • Run at least one query with a patient set and at least test the following plugins:
      • Timeline
      • Kaplan Meier
      • R Data Builder
        • Run the create_databuilder_db_user Jenkins job to create the databuilder user and run the appropriate grants.
        • you'll need to run builder-extract-config as well as builder-deploy-jboss on middle tiers to tell them about the A1 vs. B2 flip/flop
          • todo: make it a downstream job from one of the other cutover jobs?
            • where are those cutover jobs?
  1. Publish the announcement/disclaimer
  1. Migrate user history:
    • Stop JBoss/Apache as above
    • Jenkins DEID_PROD_Cutover
    • Restart JBoss/Apache as above
  1. Backup patient and encounter mappings and update previous patient sets. Note: It's very important that the build name has been updated before this step since the backup step gets the build name from the PROJECT_NAME field noted above. Jenkins jobs update_qt_patient_num:
    1. Parameter: backup_pat_enc_mappings
    2. Parameter: update_qt_patient_num

Kill/Restart the periodic query monitor

The periodic query monitor runs on the production application server. See also #1415 and the HERON_periodic_query Jenkins job.

  • Before running the periodic query monitor, first kill any currently running instances - use the below command to find the PID to kill:
    ps aux | grep test_heron_query_periodic.py 
    
  • To run the periodic query monitor - use the below command (where the first argument is the location of the bmidev/heron_load repository):
    nohup python ~/bmi_ops/heron_monitor/test_heron_query_periodic.py 
    

NightHeron

Perform the same production user history migration steps for ID HERON, though using Jenkins job ID_PROD_Cutover.

Review and Close Milestone

We agreed November 21, 2013 that the process for closing HERON release milestones is that most/all of us get together and discuss it, especially

  • Discuss any (major) open tickets that might not be in the release dependency graph.
  • Review schedule slips: what were the causes? are there cost-effective ways to address those causes?

Weekly status meeting

Each week, the team meets to discuss progress, schedule risks, etc. and to prioritize remaining work for the current release. The heron-weekly keyword is used to specifically add a ticket to the list to discuss.

Milestone Planning

We do a larger HeronProjectTimeline exercise every few months, but in between, we aim to have our release goals aligned with our major tickets. When these lists don't match, there are three choices:

  1. Promote the ticket to a team release goal.
  2. Postpone the ticket to a later release.
  3. Demote the ticket to minor priority; i.e. agree that it can be moved from milestone to milestone without discussion.

Coordinated with IR to apply any needed OS patches for ID server

In our IR meeting (2015.01.06), we (Russ, Nathan, Bhargav, Brandon) agreed to e-mail Eric, Rob, John, Jeff when they can have a downtime to apply server patches to the ID server (aim for the following Sunday). It requires special effort since the machine uses FusionIO cards and any kernel updates require recompilation of the driver. Nathan owes a link...

Last modified 2 years ago Last modified on 09/29/15 08:01:01

Attachments (3)

Download all attachments as: .zip