wiki:HeronLoadDev

Development Notes, Dependencies

The development process is as follows:

Each release, we aim to integrate new data into HERON. Sometimes this is a whole new source, such as the TumorRegistry, but most often it's a new part of the HospitalEpicSource.

See also HERON ETL presentation from HackathonOne

At a high level, such an enhancement involves:

  1. Use case development
  2. Analysis and Design
  3. Development
  4. QA

Story telling: Use cases and requirements

Analysis

  • Analysis: Browse the source data (using Oracle SQLDeveloper) and review HospitalEpicSource#clarity or other relevant documentation and sketch SQL queries sufficient to answer queries from the use cases.

Development

@@todo: consolidate with ETL Developmetn section below

QA

ETL Development

The production process is described above. It presumes some details inherited from our ETL approach of using python and paver to manage execution of SQL scripts. (See also #1 for exploration of less ad-hoc tools.)

To begin development, set yourself up for WritingQualityCode, including borrowing python community packages. Then get paver and cx_Oracle. (cx_Oracle has some platform-specific gotchas; see PythonOnWindows? or PythonVirtualEnvironment for linux). To test that these dependencies are installed correctly, start python and import paver and cx_Oracle:

$ python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paver
>>> import cx_Oracle

If you get an error message, see if it's covered in PythonOnWindows? or PythonVirtualEnvironment.

Then check out a copy of source:heron_load (in the bmidev repository) UsingVersionControl. The repository address is http://informatics.kumc.edu/changes/bmidev/ , as in:

$ pip install mercurial
...
$ hg clone http://informatics.kumc.edu/changes/bmidev/
$ cd bmidev/heron_load

We follow the python community norms for testing and documentation integrated into the code; for example heron_load: ETL for HERON is generated from the source code.

Running paver tasks

Remember that you need to activate your PythonVirtualEnvironment each time you start a new Windows command session:

C:\Documents and Settings\achoudhary>C:\Python-Dev\heron\Scripts\activate.bat
(heron) C:\Documents and Settings\achoudhary>

Make sure the current directory is your heron_load directory:

(heron) C:\Documents and Settings\achoudhary>cd "C:\Documents and Settings\achoudhary\Desktop\bmidev\heron_load"
(heron) C:\Documents and Settings\achoudhary\Desktop\bmidev\heron_load> paver help

The code follows the paver conventions:

To use Paver, you run paver <taskname> and the paver command will look for a pavement.py file in the current directory.

For example:

$ paver help

should list all of the available tasks.

Set up a heron-test.ini based on source:heron_load/heron-test.ini.example . (note new format #2221)

Note HERONsecuritynotes? has some notes about the test system.

Note emerging use of Jenkins CI tools.

Try this one to see that most things are working:

$ paver load_epic_dimensions

To run the main task of the ETL process:

$ paver heron_load

todo: document the failure modes and remedies;

File Header: purpose, copyright, license

IOU some docs; meanwhile, see #265

Note: python convention is evolving since then.

SQL Header: which database does this script run in?

e.g. bsr_i2b2_transform.sql around line 14:

-- Test that we're in the KUMC sid.
select MRN from BSR.bsr_table where 1=0;

Release Branches

By convention, we create a branch for the release in the source code repository (see UsingVersionControl for more information on our version control system). We name this branch after the release name (such as heron-marion for the "Marion" release).

There are a couple of reasons for creating a release branch:

  • Developers can continue to work on new features on the "default" branch (mainline branch) without bringing these changes in to the release in progress. Changes to the "release" branch are limited to bug fixes needed to get the release working.
    • After the release is complete, any bug fixes on the release branch are merged back in to the default branch.
  • Having a release branch makes it much easier to look back and determine exactly what version of code was used for a particular release.

Automated testing in SQL

Tests can be represented as queries that result in errors (for example division by zero errors) when the test condition fails.

See also #2781 re logging errors and continuing on.

For example, source:heron_load/epic_demographics_load.sql#L13 documents a run context assumption by way of an automated test:

13	/* We're wasting our time if an upload_status record isn't in place. */
14	select case count(*) when 1 then 1 else 1/0 end as test_result from (
15	select * from NightHeronData.upload_status up
16	where up.upload_id = :upload_id
17	);

Note the use of sub-select where the inner query is handy for interactive use.

A basic feature test for "after loading demographics data, we know the age of all patients" follows the insert at source:heron_load/epic_demographics_load.sql#L45 :

45	insert into NightHerondata.observation_fact(
46	  patient_num, encounter_num, ...
74	from observation_fact_demo@epic f ...
88	
89	/* We should know the age of almost all patients.
90	  TODO: find out why we have ~22K patient records with no birthdate.
91	*/
92	select case when coverage > (95/100) then 1 else 1/0 end test_result from (
93	  select (
94	     select count(*)
95	     from NightHerondata.observation_fact nof
96	     where nof.concept_cd like 'DEM|AGE:%' ) / (
97	     select count(*)
98	     from NightHerondata.patient_dimension npd) as coverage
99	     from dual
100	);

The pattern works for regression tests as well. Before fixing a bug with missing start_date in the observation_fact_demo view at source:heron_load/epic_i2b2_transform.sql#L175, we add a test case to confirm the fix:

175	create or replace view observation_fact_demo as select ...
194	
195	select case count(*) when 0 then 1 else 1/0 end as test_result from (
196	  -- test for missing start_date
197	  select * from observation_fact_demo where start_date is null
198	);
199

Breaking large transactions into chunks

todo: add link to our notes on the problem of large inserts filling up log/temp space.

Any transaction that mentions the :part bind variable gets run multiple times, with :part bound to 1, 2, 3, ... 10 (see run_script around source:heron_load/db_util.py#L380 ). For example:

source:heron_load/epic_dimensions_load.sql#L62

62	insert into NightHeronData.patient_mapping ...
67	(select ...
83	       from NightHeronData.upload_status up,
84	         patient_dimension@epic pd
85	         left join epic_test_patients etp
86	           on pd.patient_ide = etp.pat_id
87	       where up.upload_id = :upload_id
88	        and pd.part = :part
89	      );

The mod() function seems to be a particularly handy way to split data into chunks; for example, from source:heron_load/epic_i2b2_transform.sql#L30:

30	create or replace view patient_dimension
31	as Select
32	  p.pat_id PATIENT_IDE,
33	  mod(ora_hash(p.pat_id), &&heron_etl_chunks)+1 as part, ...

Note the use of ora_hash() to make a number out of a string.

NTILE() was previously recommended, but it is overly complex and performs poorly.

todo: look into using partitioned tables and parallel DML instead of this chunking approach.

Operations, Administration, Monitoring

See GroupOnly/Operations for some operations/administration/monitoring details.

Old stuff to clean up

Our production HeronLoad ETL process was originally based on discussion in HeronProjectTimeline#Follow-upMarch18; it has evolved since then.

ETL Architectural Changes

Last modified 2 years ago Last modified on 06/12/15 10:25:07