Opened 9 years ago

Closed 9 years ago

Last modified 7 years ago

#834 closed enhancement (fixed)

query by MSDRG, length of stay, service line from UHC 1-year snapshot

Reported by: rwaitman Owned by: achoudhary
Priority: major Milestone: heron-eldorado-update
Component: data-repository Keywords: public-web
Cc: tmcmahon, achoudhary, badagarla, mnair, ngraham Blocked By:
Blocking: #907 Sensitive: no

Description

Arvinder and Tamara to take the lead but with Bhargav and Dan looped in.

Dan to review his notes

One query might be
prevalence of a MSDRG

82|31|2006
Traumatic stupor & coma, coma >1 hr w MCC|Traumatic stupor & coma, coma >1 hr w MCC|17|1|medical

joined with our EMR data such as flowsheet documentation of ventilator settings, o2sat, resp rate, and death

Change History (30)

comment:1 Changed 9 years ago by rwaitman

Milestone: heron-clinton-updateheron-eldorado-update

comment:2 Changed 9 years ago by dconnolly

Owner: changed from dconnolly to achoudhary
Status: newassigned

The way the ETL code is currently designed, The first goal is to come
up with one or more queries that transform the source data to look like
the i2b2 observation_fact table (see figure 4 in
the i2b2 paper).

For example, create or replace view observation_fact_bsr as ...
at source:heron_load/bsr_i2b2_transform.sql#L107

In most cases, this involves a visit transformation too to get an encounter_ide.
(see bsr_visit etc. above). As long as the source data has MRNs,
you can just use the MRN as the patient_ide and take advantage of the existing
code to map MRNs.

After those queries are captured as views in the source db, separate scripts
run in the identified db to load the data; in the case of BSR:

Another example is the tumor registry; the transformation and load scripts are:


I thought we had a similar situation a while ago where parallelized
development of ETL work, but reviewing those tickets (#399, #400)
shows the situation was a little different and there isn't much in
the way of notes anyway.

comment:3 Changed 9 years ago by dconnolly

Sensitive: unset

comment:4 Changed 9 years ago by dconnolly

In the informatics meeting yesterday, Mani mentioned doing some exploration into the data.
Russ said that he wants to take a "rip the whole thing" approach, much like we did for NAACCR (ticket:547#comment:34).

comment:5 Changed 9 years ago by rwaitman

Cc: mnair added

comment:6 Changed 9 years ago by rwaitman

Owner: changed from achoudhary to mnair

comment:7 Changed 9 years ago by dconnolly

Test data helps immensely in developing ETL code. (see #910) So we'll want the same
DB structure in the test DB.

If the staging of the UHC is separated between

  1. building tables
  2. loading public metadata (i.e. lookup tables etc.)
  3. loading the PHI

then we can re-use 1 and 2 in the test environment, and we only
need to manually mock up data for part 3.

comment:8 Changed 9 years ago by dconnolly

Arvinder, Russ, Dan, Tamara, and Mani met today; we put some notes in UHCSource

comment:9 Changed 9 years ago by dconnolly

Type: enhancementtask

At some point, we should phrase this or some other ticket as an end-user enhancement, but for now, it's pretty clearly a task.

comment:10 Changed 9 years ago by dconnolly

Mani, it sounds like you're making progress that's worth checking in. Bummer there are so
many meetings today; here's hoping we can chat about this soon.

comment:11 Changed 9 years ago by dconnolly

Cc: ngraham added

In order to get Mani, Tamara, and Nathan going on ETL, I cleaned up HeronLoad#dev-notes.
Bonus points to the first one of you who gets paver heron_load to run in your dev environment!
I'll look for chances to help you out with it.

comment:12 Changed 9 years ago by dconnolly

I added paver tasks for make_uhc_views and load_uhc_clinical_facts [1b71785cbc84],
though I haven't tested them. Test data for UHC (#910) would be ideal, but I suppose
production data and "B" instance would work for testing.

p.s. This morning, Tamara collected the bonus points for getting through HeronLoad#dev-notes
and getting paver heron_load to run.

comment:13 Changed 9 years ago by mnair

Transform and load scripts are completed and checked in. Following four concepts are loaded in this eldorado release

'UHC|Gender:'

'UHC|ServiceLine:'

'UHC|BaseMSDRG:'

'UHC|LOS:'

comment:14 Changed 9 years ago by dconnolly

Blocking: 907 added

(In #907) Current plan is to include UHC (#834) in this release, so don't start the HeronLoad ETL process until it's ready.

comment:15 Changed 9 years ago by dconnolly

Blocked By: 910, 916 added
Type: taskenhancement

I suggest:

  1. in parallel
    • build test data (#910) and ETL it to bmid using paver heron_load
    • write code to build concepts (#916) and run the relevant paver tasks
  2. point non-prod application server to bmid using paver configure_i2b2_datasources and such
  3. Check that queries work, and provided they do, close this ticket as fixed.
  4. proceed with production ETL (#907)
  5. As part of validation (#906), check the use cases in the description of this ticket to see that the results make sense from a clinical point of view
    • for bonus points, check the results of the i2b2 queries against the source staged data using SQL
      • for double-bonus points, develop an automated test wrapper so that we can run that test again next month with the push of a button

As a fallback, we could test with production data using BHeronFB instead of bmid.

comment:16 Changed 9 years ago by dconnolly

re Nathan's patch_for_dan_review_20120314.txt

UHC data has default values in the code as well, but put it here too...

yeah, I haven't made up my mind about the best way to keep those source codes DRY.

How about using datetime.now() in the py code if the dump file is set to None?

That would mean that the date that goes into the database no longer reliably represents the
date of the bulk transfer. If we don't need to keep that audit info, we should just get rid
of the code that handles it altogether.

It's probably worth catching the exception and producing a comprehensible diagnostic, though.

comment:17 Changed 9 years ago by dconnolly

Owner: changed from mnair to achoudhary

Arvinder, it looks like the next step is to point the non-prod application server to bmid using paver configure_i2b2_datasources and such.

I think you and I are the only ones who have experience with that. Go ahead and set it up, OK?
Bonus points for explaining it to Nathan, Mani, etc. while you're at it.

comment:18 Changed 9 years ago by dconnolly

Blocked By: 910 removed

(In #910) OK, then it's no longer blocking #834.

comment:19 Changed 9 years ago by dconnolly

I got a query to work, but it's not finding any data.

Do the MRNs in the UHC test data match the Epic/Clarity? MRNs?

  • Finished Query: "Gender@13:05:44"
    Number of patients for "Gender@13:05:44"

ah... the problem wasn't MRNs but propagating UHC concepts to the concept_dimension.

win!

  • Finished Query: "Gender@13:11:14"
    [13.6 secs]
    Compute Time: 11 secs
    Number of patients for "Gender@13:11:14"
    patient_count: 13

comment:20 Changed 9 years ago by dconnolly

Blocked By: 916 removed

comment:21 Changed 9 years ago by dconnolly

Resolution: fixed
Status: assignedclosed

I think Mani meant to close this ticket: ticket:916#comment:13

comment:22 Changed 9 years ago by dconnolly

Resolution: fixed
Status: closedreopened

oops! duplicate key in the facts. ticket:907#comment:11

Time to add a test case etc. ...

comment:23 Changed 9 years ago by dconnolly

Resolution: fixed
Status: reopenedclosed

it was easy enough to use recordid to ensure uniqueness. [d59a708bcedf]

comment:24 Changed 9 years ago by dconnolly

After a couple tweaks[18b0096cc635], query for length of stay greater than 2 days works!

  • Finished Query: "Hospital LOS [2@17:08:12" [4.1 secs]
    Compute Time: 2 secs
    Number of patients for "Hospital LOS [2@17:08:12"
    patient_count: 12382

comment:25 Changed 9 years ago by ngraham

From the ticket description, it looks like we need to query the following for verification:

  • 'UHC|Gender:'
  • 'UHC|ServiceLine:'
  • 'UHC|BaseMSDRG:'
  • 'UHC|LOS:'

For verification, it looks like Dan has validated some of this using the non-prod application server already.

comment:26 Changed 9 years ago by ngraham

Results noted below:

Gender:
Number of patients for "Gender [27,197 @16:05:32"
patient_count: 19735

ServiceLine:
Number of patients for "Service Line [2@16:06:32"
patient_count: 19735

Base MS DRG:
Number of patients for "Base MS DRG [27@16:08:31"
patient_count: 19735

LOS:
Number of patients for "Length of Stay @16:04:00"
patient_count: 19735

comment:27 Changed 9 years ago by dconnolly

Querying by Gender, LOS, etc. independently is one thing, but the use cases in the
ticket description and in UHCSource are about correlating UHC data with data from HospitalEpicSource, SSA, etc.

We did a certain amount of that in Russ's office just now, ad-hoc.
Here's hoping we manage to automate more of it in due course.

comment:28 Changed 9 years ago by dconnolly

comment:29 Changed 7 years ago by dconnolly

Keywords: public-web added

comment:30 Changed 7 years ago by kcrane2

Cleared by Infosec

Note: See TracTickets for help on using tickets.