Opened 6 years ago

Closed 4 years ago

Last modified 4 years ago

#554 closed design-issue (fixed)

Performance issue in the IDX load

Reported by: achoudhary Owned by: mhoag
Priority: major Milestone: heron-milford-update
Component: data-repository Keywords: performance, single-sid, public-web
Cc: dconnolly, ngraham, bhamlin Blocked By: 513, 1385
Blocking: Sensitive: no

Description

Monthly load process is taking longer than expected. We will investigate and see if we can add index to the IDX table to make it faster

Change History (20)

comment:1 Changed 6 years ago by dconnolly

  • Keywords performance added

comment:2 Changed 5 years ago by dconnolly

  • Type changed from task to design-issue

Indeed, we're getting just

  • 52K rows per minute inserting from observation_fact_dx@idx vs
  • 1.5M rows/min inserting from observation_fact_demo_vital@epic.

Neither the disk nor the CPU on the machine is maxed out. I'm not sure
what is the rate-limiting-factor.

2012-05-11 06:09:48.854904: id_db: [identified]FB on localhost
insert into NightHerondata.observation_fact
...
from observation_fact_dx@idx f
2012-05-11 07:05:14.433299 end. duration: 0:55:25.578395 rows: 2878647

vs

2012-05-09 15:09:34.338939: id_db: [identified]FB on localhost
insert into NightHerondata.observation_fact
...
from observation_fact_demo_vital@epic f
2012-05-09 15:11:57.670482 end. duration: 0:02:23.331543 rows: 3555253

comment:3 Changed 5 years ago by achoudhary

  • Owner changed from achoudhary to mnair
  • Status changed from new to assigned

Mani:

Merge this ticket with one you have created to fine tune the ETL process

comment:4 Changed 5 years ago by achoudhary

After we made changes [512d9a773d69] of using patient_mapping table instead of view in the IDX load...the idx_clinical_facts_load.sql scripts takes over 9 hours to finish.

comment:5 Changed 5 years ago by dconnolly

Regarding the meeting agenda from ticket:1118#comment:22, I collected some stuff to look at:

Issues in HERON ETL

I didn't touch on "order of load"; I can imagine there are issues, but I'm not clear on what they are.

comment:6 Changed 5 years ago by dconnolly

  • Cc ngraham bhamlin added

a note from our June 18 meeting:

Mani pointed out that going across a database link to get to the patient mapping
table is likely to have terrible performance. The index on the patient mapping table
probably doesn't get used. (And normal explain plan tools don't work across database links.)

comment:7 Changed 5 years ago by dconnolly

  • Milestone changed from HERONv.Next to ea-dutch-guilder
  • Owner changed from mnair to mhoag

comment:8 Changed 5 years ago by dconnolly

  • Milestone changed from ea-dutch-guilder to heron-council-grove-update

comment:9 Changed 5 years ago by dconnolly

  • Blocked By set to 1385
  • Keywords heron-sample added
  • Milestone changed from heron-council-grove-update to heron-neosho-update

I think several weeks ago we talked about using the 10% sample to facilitate work on this.

comment:10 Changed 4 years ago by mhoag

  • Milestone changed from heron-smoky-hill-update to heron-sappa-update

comment:11 Changed 4 years ago by mhoag

  • Milestone changed from heron-sappa-update to heron-big-blue-update

Since the 10% work has not been completed and is in the critical stage of being pushed out the door, I don't think that working on this issue for the Sappa Release would be a good idea. Going to push this to big blue to think about it for the next release.

comment:12 Changed 4 years ago by mhoag

  • Blocked By changed from 1385 to 513, 1385
  • Milestone changed from heron-big-blue-update to heron-crooked-creek-update

The architecture redesign (#513) will likely alleviate this problem since the need to join over links will be removed comment:2 as a part of that design. This should be looked at after the redesign has been completed.

comment:13 Changed 4 years ago by mhoag

  • Milestone changed from heron-crooked-creek-update to heron-chikaskia-update

Moving to chikaskia along with #513

comment:14 Changed 4 years ago by mhoag

  • Keywords single-sid added; heron-sample removed
  • Milestone changed from heron-chikaskia-update to heron-solomon-update

Moving to solomon along with #513. Like #513 this will be waiting on the new hardware.

comment:15 Changed 4 years ago by dconnolly

Ouch... 14 hours in Chikaskia (#2273)

2013-09-16 17:38:20 for 14:39:01.699365   IDX_clinical_facts
2013-09-16 17:38:20 for 0:00:00.051242       $source_cd_lookup
2013-09-16 17:38:20 for 14:13:39.216927       idx_clinical_facts_load.sql
2013-09-17 07:52:00 for 0:00:00.003158       $ensure_index:observation_fact
2013-09-17 07:52:00 for 0:25:22.391949       i2b2_facts_deid.sql

comment:16 Changed 4 years ago by mhoag

  • Milestone changed from heron-solomon-update to heron-milford-update

comment:17 Changed 4 years ago by mhoag

Comparisons on the sample data with standard ETL vs single-sid ETL look promising:

Standard ETL

...
2013-10-11 23:26:10 for 0:15:31.948448   IDX_clinical_facts
2013-10-11 23:26:10 for 0:00:00.036290       $source_cd_lookup
2013-10-11 23:26:11 for 0:15:13.099079       idx_clinical_facts_load.sql
2013-10-11 23:41:24 for 0:00:00.002950       $ensure_index:observation_fact
2013-10-11 23:41:24 for 0:00:18.775095       i2b2_facts_deid.sql
...

Single-sid ETL

...
2013-10-11 08:45:53 for 0:00:41.690605   IDX_clinical_facts
2013-10-11 08:45:53 for 0:00:00.025748       $source_cd_lookup
2013-10-11 08:45:53 for 0:00:25.144181       idx_clinical_facts_load.sql
2013-10-11 08:46:18 for 0:00:00.003369       $ensure_index:observation_fact
2013-10-11 08:46:18 for 0:00:16.482637       i2b2_facts_deid.sql

comment:18 Changed 4 years ago by mhoag

IDX clinical facts have been loaded on both single-sid and standard ETL on Milford:

Standard ETL

...
2013-11-08 20:05:52     for     14:52:17.490321 1         load_idx_clinical_facts
2013-11-08 20:05:52     for     0:00:00.025185  2           is transform loaded? (no file)
2013-11-08 20:05:52     for     14:52:17.464096 2           IDX_clinical_facts
2013-11-08 20:05:52     for     0:00:00.033116  4               $source_cd_lookup
2013-11-08 20:05:52     for     14:25:34.424181 4               idx_clinical_facts_load.sql
2013-11-09 10:31:27     for     0:00:00.004529  4               $ensure_index:observation_fact
2013-11-09 10:31:27     for     0:26:42.961776  4               i2b2_facts_deid.sql
...

Single-sid ETL

...
2013-11-13 10:54:35     for     0:30:36.911448  1         load_idx_clinical_facts
2013-11-13 10:54:35     for     0:00:00.043071  2           is transform loaded? (no file)
2013-11-13 10:54:35     for     0:30:36.867148  2           IDX_clinical_facts
2013-11-13 10:54:35     for     0:00:00.044243  4               $source_cd_lookup
2013-11-13 10:54:35     for     0:18:59.074159  4               idx_clinical_facts_load.sql
2013-11-13 11:13:34     for     0:00:00.047309  4               load_zip_data
2013-11-13 11:13:34     for     0:00:00.046248  5                 curated_data_zips_near_66160_csv
2013-11-13 11:13:34     for     0:00:00.092021  4               load_school_districts
2013-11-13 11:13:34     for     0:00:00.091279  5                 curated_data_school_districts_csv
2013-11-13 11:13:34     for     0:00:00.004034  4               $ensure_index:observation_fact
2013-11-13 11:13:34     for     0:11:37.547665  4               i2b2_facts_deid.sql
...

The load time went from 15 hours (Standard ETL) to 30 minutes(Single-sid ETL). Closing since loading IDX clinical facts now takes a reasonable amount of time.

comment:19 Changed 4 years ago by mhoag

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:20 Changed 4 years ago by mhoag

  • Keywords public-web added
Note: See TracTickets for help on using tickets.