Changes between Initial Version and Version 202 of HeronLoad


Ignore:
Timestamp:
09/29/15 08:01:01 (2 years ago)
Author:
ngraham
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • HeronLoad

    v1 v202  
     1[[PageOutline]]
     2
     3Loading [[HERON]] involves extract/transform/load (ETL) work
     4from each of the participating clinical organizations; so far:
     5 - HospitalEpicSource
     6 - ClinicIdxSource
     7 - BiospecimenRepository
     8 - TumorRegistry
     9 - [[UHCSource]]
     10 - REDCap (ticket:1107)
     11
     12as well as terminology sources such as:
     13 - !RxNorm Terms for MedMapping (ticket:1048)
     14
     15We continue to expand and refine our data integration; see HeronLoadDev for development details.
     16
     17== Iteration Process Overview ==
     18
     19Our monthly production ETL iteration process involves storage for two iterations:
     20  1. In-production, i.e. what users see
     21  2. Next build, using new ETL code on new data
     22
     23Each month...
     24
     251. We BulkTransfer a copy of the source data periodically; e.g. the hospital's [HospitalEpicSource#clarity clarity store]  (#249), and we stage the bulk data in Oracle.
     262. We extract and transform the data to the I2B2 star schema:
     27  a. The original copy is on the identified staging server.
     28  b. We make a de-identified copy that is used live in i2b2 (see DeIdentificationStrategy).
     293. We update the I2B2 middle tier to refer to the new data.
     304. We validate the data and then release it to our customers.
     31
     32{{{
     33#!graphviz
     34digraph epic_alpha {
     35graph [rankdir=TB];
     36
     37EPIC -> epic_file [label="1.a dump, scp", constraint=false];
     38epic_file [label="clarity.dmp"];
     39
     40subgraph clusterID {
     41 label="Id Server";
     42 epic_file [ shape="folder"];
     43 epic_file -> Clarity1:CLARITY [label="1.b oracle restore"];
     44 Clarity1 [color=red, shape=record, label="Clarity1|<CLARITY>CLARITY.*"];
     45 nheron [color=red, shape=record, label="nheron|<NightHeron>NightHeronData.*"];
     46
     47 Clarity1:CLARITY -> nheron:NightHeron [label="2.a ETL"];
     48}
     49
     50nheron:NightHeron -> bheron1:blue_heron [label="2.b de-identify", constraint=false];
     51
     52subgraph clusterDID {
     53 label="De-id Server";
     54 bheron1 [color=orange, shape=record, label="bheron1|<blue_heron>BlueHeronData.*"];
     55}
     56
     57}
     58}}}
     59
     60Key:
     61
     62{{{
     63#!graphviz
     64digraph key {
     65graph [rankdir=TB];
     66
     67subgraph clusterbox {
     68 label="Machine";
     69 sid1 [color=red, shape=record, label="sid\n(identified)|<blue_hictr>schema.*"];
     70 sid2 [color=orange, shape=record, label="sid\n(limited)|<blue_hictr>schema.*"];
     71}
     72
     73}
     74}}}
     75
     76
     77== Manual Work-arounds == #technical-debt
     78
     79When building the 4 ticket pipeline, it helps to add the relevant tickets from the list below as blockers:
     80
     81''todo: prune this list using a `heron-release` keyword''
     82[[TicketQuery(keywords=~technical-debt&component=data-repository&status!=closed)]]
     83
     84
     85== Stage Data from Sources ==
     86
     87See BulkTransfer. Be sure to document the resulting row counts (#814, #2223).
     88
     89Note that we have cases where we've stored away old hierarchies to preserve old user queries.  We plan to back them up at some point (see #3614).  Examples include:
     90* Old medications hierarchy (see gpc:ticket:78#comment:6):
     91 - `i2b2metadata2.old_kuh_medication_tree`
     92* Old Diagnoses hierarchy:
     93 - `i2b2metadata2.old_kuh_dx_terms`
     94
     95
     96== Run ETL == #heron_ETL
     97
     98''This is typically represented by a ticket such as #2212 (which has links to previous ETL tickets. When making a new one, the [query:keywords=~data data keyword] can help find dependencies.''
     99 * Create a release branch see wiki:HeronLoad#ReleaseBranches
     100 * Check disk space and tablespaces (e.g. ticket:1535#comment:17, #347)
     101
     102==== Process Overview ====
     103We use [http://jenkins-ci.org/ Jenkins] to manage the ETL process.  See the "HERON ETL" tab in Jenkins on the production ID server. The first few ETL jobs are:
     104  1. '''etl_config_PROD''' creates the configuration file to be used for subsequent jobs.
     105  2. '''heron_init_PROD''' ([wiki:HeronLoad#heron_init heron_init]) initializes ID and DEID databases resulting in the deletion of all ID and DEID data in the configured database.
     106  3. '''heron_etl_PROD''' ([wiki:HeronLoad#heron_load heron_load], [wiki:HeronLoad#load_concepts load_concepts]) runs fact and concept loads.
     107Note that the '''check_config_PROD''' job verifies that we're not running ETL against the production database.
     108
     109[[Image(HERON_ETL_Data_Flows.png)]]
     110
     111''diagram is from "HERON System and Servers" presentation, Oct 2014, #2936''
     112
     113=== heron_config_PROD === #heron_config
     114Run etl_config_PROD with required parameters (database instance, dump files, source code branch, etc).  heron-prod.ini will be generated and used in subsequent jobs.
     115
     116=== heron_init_PROD === #heron_init
     117The heron_init_PROD Jenkins job requires DBA privileges to be entered when executing.  '''WARNING''': Any existing data in the `NightHeronData` and `BlueHeronData` schemas will be '''wiped out'''. So, to be safe, one can use different passwords for different instances.  However, there is a layer of safety built in - the check_config_PROD job is a build step of the initialization and will fail if one tries to run the initialization on the instance that is currently being used for production (as determined by the configuration files on the middle-tier JBoss server).
     118
     119Creates identified, de-identified schemas and datamart tables.
     120* The build log is archived by the job.
     121* source:heron_load/create_datamart_schema.sql is used to create schemas. Note use of `&&name` substitution variables.
     122        * `crc_create_datamart_oracle.sql` and `crc_create_uploader_oracle.sql` from i2b2 sources are used to create datamart tables. ''actually, a slightly tweaked version. see #1176.''
     123
     124=== heron_etl_PROD === #heron_load
     125Jenkins parameters (paver commands):
     126* heron_load
     127
     128Load, de-identify the data, concepts.
     129
     130**Note:** Make sure we have the I2B2METADATA2 schema, which has the original i2b2 concepts and the CPT concepts, in the instance.
     131
     132See source:heron_load/, especially source:heron_load/heron_build.py for details. The following timings from our Feb 2014 release (#2528; see also ticket:3063#comment:12) give a rough outline of what it does:
     133
     134{{{
     135#!CSV
     136"Order","Duration","Task","o(Duration)"
     137"8","00:01:40","make_epic_visit_views","oooooooo"
     138"21","00:18:59","make_epic_lab_views","ooooooooooo"
     139"25","00:13:20","make_epic_meds_views","ooooooooooo"
     140"31","00:05:21","make_epic_notes_views","ooooooooo"
     141"33","00:08:51","make_epic_order_views","oooooooooo"
     142"36","00:00:22","make_epic_social_history_views","ooooo"
     143"38","00:00:24","make_epic_enc_vitals_views","oooooo"
     144"44","00:54:48","epic_dimensions","ooooooooooooo"
     145"51","00:02:08","make_idx_views","oooooooo"
     146"53","00:31:56","idx_dimensions","oooooooooooo"
     147"59","00:22:07","epic_demographics","ooooooooooo"
     148"70","00:02:01","idx_demographics","oooooooo"
     149"83","00:01:01","ssdmf_load","ooooooo"
     150"94","00:00:55","epic_allergy","ooooooo"
     151"107","00:01:47","make_tumor_views","oooooooo"
     152"110","00:12:34","tumor_facts","ooooooooooo"
     153"121","00:48:51","idx_clinical_facts","ooooooooooooo"
     154"132","00:12:30","epic_diagnosis","ooooooooooo"
     155"153","00:00:54","make_epic_flowsheet_views","ooooooo"
     156"156","00:13:49","summarize_epic_flowsheets","ooooooooooo"
     157"158","15:23:01","epic_flowsheets","ooooooooooooooooo"
     158"289","00:58:53","epic_medical_history","ooooooooooooo"
     159"320","02:03:13","epic_social_history","oooooooooooooo"
     160"401","02:28:39","epic_labs","oooooooooooooo"
     161"412","04:25:29","epic_med_facts","ooooooooooooooo"
     162"483","01:05:22","epic_microbiology","ooooooooooooo"
     163"504","02:28:59","epic_orders","oooooooooooooo"
     164"515","00:00:19","make_redcap_views","ooooo"
     165"517","00:07:22","redcap_clinical_facts","oooooooooo"
     166"528","02:02:17","epic_notes","oooooooooooooo"
     167"549","00:07:40","bsr_dimensions","oooooooooo"
     168"555","00:06:46","bsr_clinical_facts","oooooooooo"
     169"568","00:12:41","uhc_clinical_facts","ooooooooooo"
     170"581","00:30:39","epic_services","oooooooooooo"
     171"592","00:59:26","epic_enc_vitals","ooooooooooooo"
     172"603","01:36:57","i2b2_facts_index.sql","oooooooooooooo"
     173"604","00:00:10","epic_i2b2_deid_verify.sql","oooo"
     174"605","00:12:13","count_facts_by_concept.sql","ooooooooooo"
     175}}}
     176
     177We can see that flowsheets dominate the time.
     178
     179[[Image(beaver_timing_pie.png)]]
     180
     181==== Load concepts (part of the heron_etl_PROD job) ==== #load_concepts
     182Jenkins parameters (paver commands):
     183* load_all_concepts
     184
     185==== Add Patient, Fact Counts and Tooltips (part of the heron_etl_PROD job) ====
     186Jenkins parameters (paver commands):
     187* enhance_concepts
     188
     189'''Note: if the Oracle table statistics have already been migrated (to achieve acceptable end-user performance - see wiki:HeronLoad#Migrate_Oracle_Table_Statistics) then the tasks to add patient and fact counts may not finish in a reasonable amount of time - see ticket:2727.'''
     190
     191== Prepare new data for testing == #test_query
     192=== heron_config_datasources_PROD === #heron_load
     193This Jenkins job will create change the passwords for the i2b2 accounts (`i2b2pm`, `BlueHeronData`, etc.) and create two JBoss datasources files  (`heron-prod-deid-ds.xml` and `heron-prod-deid-ds.xml`).  These two configuration files will be used to configure the front-end application server to use the new data.
     194 * Note that the job also updates the i2b2 OBFSC_SERVICE_ACCOUNT account password to match the `BlueHeronData` (`NightHeronData` for ID) Oracle passwords.
     195
     196=== Update test i2b2 configuration (somewhat manual process) === #prepare_for_testing
     197   * Copy the heron-prod-ds.xml file to the JBoss deployments directory on the test JBoss application server (`/opt/jboss-as-7.1.1.Final/standalone/deployments`).
     198     - Note: subsequent scripts (cutover script) expects the .xml configuration file to be named `heron-prod1-ds.xml`, `heron-prod1-ds.xml`, or simply `heron-prod-ds.xml`.
     199   * Update passwords in CRC application configuration for i2b2 on JBoss application server (`/opt/jboss-as-7.1.1.Final/standalone/configuration/crcapp`).  The updated passwords are contained in the `heron-prod-ds.xml` file generated in steps noted previously.
     200     - `CRCLoaderApplicationContext.xml`
     201       1. password for i2b2hive user
     202       2. database url (A or B)
     203     - `crc.properties`
     204       1. `OBFSC_SERVICE_ACCOUNT` password to match the `BlueHeronData` password
     205   * Restart Apache so that `heron_admin` loads the new passwords: 
     206     - `/etc/init.d/apache2 restart`
     207   * Update build name label:
     208     - UPDATE/INSERT I2B2PM.PM_PROJECT_DATA SET PROJECT_NAME with new build name (e.g Pamona, Wilson etc).  '''Do this for BOTH ID and DEID'''.
     209  * try a query; e.g. count frontiers patients
     210
     211=== heron_content_summary === #heron_content_summary
     212 * This job creates `report_stats.txt` which is a summary of patient/fact counts formatted to make a nice table in Trac when pasted into a ticket comment.
     213
     214== Validate and Release == #release
     215
     216UpdateValidation includes various techniques, including reviewing contents stats while drafting the HeronReleaseNotes blog item
     217
     218=== heron_update_oracle_stats_PROD === #heron_update_oracle_stats_PROD   
     219Migrate table statistics from the previous month (see ticket:1625#comment:11) since they've shown to provide better performance than the statistics that Oracle generates automatically.
     220
     221=== Check Performance === #Check_Performance
     222Jenkins jobs:
     223* heron_performance_check_plans
     224 - In the production Jenkins environment, select the `heron_performance_check_plans` job and click `Build with Parameters`.
     225  - Select the appropriate SID from the drop-down menu SID. 
     226* heron_performance
     227 - In the production Jenkins environment, select the `heron_performance` job and click `Build with Parameters`.
     228  - Select the appropriate SID from the drop-down menu SID.
     229  - Select the `./test_queries/timing_tests_default.txt`
     230This should produce an HTML report that compares query times with the previous test run.  Make sure test durations haven't increased dramatically.
     231
     232=== Automated Query Tests === #Automated_Tests
     233Run source:heron_load/test_heron_query.py like:
     234{{{
     235python test_heron_query.py https://<test app server name>.kumc.edu/heron
     236}}}
     237All tests should pass.  Attach the resulting output to the validate and release ticket.
     238
     239=== Production cut-over ===
     240A Notify users and/or verifying that nobody is currently logged in to HERON before proceeding.
     241- Run query against production to find users for last two weeks
     242- Note that some HERON users don't always check their KUMC e-mail accounts, such as some of our GPC collaborators from Iowa and Texas. 
     243{{{
     244select distinct user_id||'@kumc.edu' from i2b2pm.pm_user_session where trunc(entry_date) >= trunc(sysdate-14);
     245}}}
     246- Send an email notice to users at least an hour before beginning the down time.  In the past, the following format has been used:
     247
     248  > The HERON/i2b2 application is scheduled for downtime today (Thursday, October 23, 2014). The system will be unavailable from 2:00 PM until 4:00 PM. During this downtime, we will be performing the monthly data update which will include clinical data through September, 2014.
     249  >
     250  > Thanks!
     251  >
     252  > Regards,
     253
     254A. Point production instance of i2b2 to the new data (using `heron-prod-deid-ds.xml` etc. as was done above for the test server above)
     255 - Stop JBoss and apache on the production application server.
     256{{{
     257/etc/init.d/jboss stop
     258/etc/init.d/apache2 stop
     259}}}
     260 - Copy the `heron-prod-deid-ds.xml` config file from the test server
     261 - Update passwords in CRC Application file as was done before on the test application server
     262 - Start JBoss/Apache again:
     263{{{
     264/etc/init.d/jboss start
     265/etc/init.d/apache2 start
     266}}}
     267 - Run at least one query with a patient set and at least test the following plugins:
     268  - Timeline
     269  - Kaplan Meier
     270  - R Data Builder
     271    - Run the '''create_databuilder_db_user''' Jenkins job to create the databuilder user and run the appropriate grants.
     272    - you'll need to run **builder-extract-config** as well as **builder-deploy-jboss** on middle tiers to tell them about the A1 vs. B2 flip/flop
     273       - todo: make it a downstream job from one of the other cutover jobs?
     274          - where are those cutover jobs?
     275
     276B. Publish the announcement/disclaimer
     277  * remove the `draft` keyword
     278  * Add a record in the [https://redcap.kumc.edu/redcap_v4.1.0/setup.php?pid=398 HERON Disclaimers REDCap project] with current=1 and set current=0 on the previous disclaimer.
     279
     280C. Migrate user history:
     281 * Stop JBoss/Apache as above
     282 * Jenkins DEID_PROD_Cutover
     283 * Restart JBoss/Apache as above
     284
     285D. Backup patient and encounter mappings and update previous patient sets. '''Note:  It's very important that the build name has been updated before this step since the backup step gets the build name from the `PROJECT_NAME` field noted above.'''
     286  '''Jenkins jobs update_qt_patient_num''':
     287   1. Parameter: backup_pat_enc_mappings
     288   2. Parameter: update_qt_patient_num
     289
     290=== !Kill/Restart the periodic query monitor === #QueryMonitor
     291The periodic query monitor runs on the production application server. See also #1415 and the HERON_periodic_query Jenkins job.
     292 
     293- Before running the periodic query monitor, first kill any currently running instances - use the below command to find the PID to kill:
     294  {{{
     295  ps aux | grep test_heron_query_periodic.py
     296  }}}
     297
     298- To run the periodic query monitor - use the below command (where the first argument is the location of the bmidev/heron_load repository):
     299  {{{
     300  nohup python ~/bmi_ops/heron_monitor/test_heron_query_periodic.py
     301  }}}
     302
     303=== !NightHeron === #NightHeron
     304Perform the same production user history migration steps for ID HERON, though using Jenkins job ID_PROD_Cutover.
     305
     306=== Review and Close Milestone === #milestone-close
     307
     308We agreed November 21, 2013 that the process for closing HERON release milestones is that most/all of us get together and discuss it, especially
     309 - Discuss any (major) open tickets that might not be in the release dependency graph.
     310 - Review schedule slips: what were the causes? are there cost-effective ways to address those causes?
     311
     312=== Weekly status meeting === #heron-weekly
     313Each week, the team meets to discuss progress, schedule risks, etc. and to prioritize remaining work for the current release.  The [query:status=!closed&keywords=~heron-weekly heron-weekly] keyword is used to specifically add a ticket to the list to discuss.
     314
     315=== Milestone Planning === #milestone-plan
     316We do a larger HeronProjectTimeline exercise every few months, but in between, we aim to have our release goals aligned with our major tickets. When these lists don't match, there are three choices:
     317  a. Promote the ticket to a team release goal.
     318  b. Postpone the ticket to a later release.
     319  c. Demote the ticket to minor priority; i.e. agree that it can be moved from milestone to milestone without discussion.
     320
     321=== Coordinated with IR to apply any needed OS patches for ID server === #id-server-patching
     322In our IR meeting (2015.01.06), we (Russ, Nathan, Bhargav, Brandon) agreed to e-mail Eric, Rob, John, Jeff when they can have a downtime to apply server patches to the ID server (aim for the following Sunday).  It requires special effort since the machine uses FusionIO cards and any kernel updates require recompilation of the driver.  Nathan owes a link...