Opened 3 years ago

Closed 2 years ago

Last modified 2 years ago

#3224 closed enhancement (fixed)

Bring NCDR(v4) data into HERON - Phase I

Reported by: rwaitman Owned by: schandaka
Priority: major Milestone: heron-riverwalk-update
Component: data-repository Keywords: NCDR public-web
Cc: rwaitman, dconnolly, tmcmahon, bzschoche, ssuman, ngraham Blocked By: 3479, 3605, 3661
Blocking: 3563, 3676 Sensitive: no

Description

Sravani reporting to Russ.
Load data from NCDR CathPCI Registry v4.4
(Diagnostic Catheterization and Percutaneous Coronary Intervention Registry)
https://www.ncdr.com/WebNCDR/cathpci/home/datacollection
Bring the below NCDR data into HERON.

  • Demographics
  • PCI Procedure
  • Diagnostic Cath
  • History and Risk Factors
  • Coronary Anatomy

Change History (85)

comment:1 Changed 3 years ago by schandaka

  • Component changed from dev-framework to data-repository

comment:2 Changed 3 years ago by bzschoche

From Russ:
Terry Rusconi suggested reaching out to Ann Powell as NCDR contact to provide best leads from her team and IT to attend meeting with SC and RW to sort out how to incorporate the NCDR data into HERON.

It may resemble the tumor registry where we pick off the uploaded data from their PATS system as opposed to UHC where we pull down cleaned files from UHC after we upload.

comment:3 Changed 3 years ago by schandaka

Ann supervises the abstractors
Bilal is joining the team in the IT role
Both report to Chris

Get registries
CathPCI
ACTION
ICD
STS TVT - check with the surgeons (Kamal and Chris to check and verify)

start on the ETL with CathPCI but grab the whole set.

  1. What’s the process versus Tumor Registry, Cystic Fibrosis, and UHC

Is it like the tumor registry in that it’s contained in a local software system that then dumps files that are uploaded.

They use PATS software.
It then gets uploaded to NCDR.

  • They update it to NCDR quarterly.

Gets uploaded and then they get a report back of what needs to be cleaned up back at KUMC
They then fix the data in PATS and re-upload again.

  • The process of uploading, correcting, and then doing the final upload is managed by Ann and Chris’ team.

Medical Informatics would then get the data after its cleaned up.

Goal: see if we can get the file that gets uploaded to NCDR so we are picking off a standard format even if there are different vendors of NCDR software besides PATS

Sravani to walk over with Ann and Bilal to look at PATS and the file formats that can be dumped.

  1. Initial update

Get dump as far back as possible.

How far back does the data go for each registry of NCDR? Has been used since 2002 and everything we’ve reported has come through PATS,

PATS has everything we’ve abstracted. It goes back more than 10 years.

There could be older cases but may all be in PATS.

  • check and see
  1. How fresh is it?
  • it lags by roughly 4 to 5 months.
  1. Going forward update approach

two basic options:

  • Since the data is small, ideally re-dump full set every month like we do tumor registry from Tim Metcalf that gets transferred to our secure staging server. Or email us the file.
  • or, grant EA/MI access to pull down each month like we do with Epic.

Who are the key customers:

  • Kamal Gupta, James Vacek, DJ Lakkraddy, Tadros.
  • Sravani and Tamara work with Kamal and John Chen to make sure the data is retrievable in a nice format

We will have KU then set the pace nationally for cardiology outcomes research similar to what KU has done for leading work to integrate tumor registries

  1. Sravani will get the file securely from Bilal and then stage it for loading into HERON

Secondary issues:

  1. Chris sees need as well for smart phrases for notes. Issues around fields not built discretely
  1. Short term need for Dr. Gupta with Dr. Chen:
  • need a definition of what they require and submit a HERON request;
  • we can fulfill even if we don’t yet have the data loaded in HERON. They need to scope the population to what they require for the request.

comment:4 Changed 3 years ago by schandaka

Meeting notes from 03/02/2015 (RW,SC)

Liz Drew(Cardio Vascular Data Coordinator) and Bilal Muhammad sent sample files to Sravani.

Rough NCDR ontology
Use the headers in their dictionary
http://cvquality.acc.org/~/media/QII/NCDR/Data%20Collection%20Forms/CathPCI%20Registry_DataDictionary.ashx

Demographics is A
Episode of care is B

Past history (medical dx, surgical history, prior cath)
Encounter where the Cath happened

  • Urgent, emergent
  • Date of cath

Cath Details D

Diagnostic Cath E

  • Has details about who did the cath

o Only store things like their NPI number on night heron

But de-id it to blue heron

Coronary anatomy F

PCI Procedure G

Lesion and Devices H

Labs I

  • Bonus points to map to LOINC

Intra and post procedural events J

Discharge K

Medications L

  • May need to ask how to translate from MedID for find the meds.

Be wary of the XML file getting split into multiple rows for the PCI procedure cases.

Also for the diagnostic caths; may be getting split into 2 or 3 rows

comment:5 Changed 3 years ago by bzschoche

Russ had a meeting this morning and based on that discussion, we can proceed with this.

Sravani: please follow up with Chris and Kamal about getting the files.

comment:6 Changed 3 years ago by schandaka

Brittany,
I have reached out to Bilal Muhammad(Data Architect). Looks like he wants to sit with Chris Wittkopp before he can start working again on this. I will keep you posted.

comment:7 Changed 3 years ago by schandaka

Brittany, I am meeting with Bilal on 04/20. When he gave the Cath/PCI file(sample file) the last time, Russ and I noticed that a lot of columns have external references to some mapping tables which the CardioVascular? folks were not sure how to obtain. So I will discuss it in the upcoming meeting as they are important for the ontology design.

comment:8 Changed 3 years ago by schandaka

I met with Bilal this morning. I should have the files by today.

[4/20/2015 12:28 PM] Bilal Muhammad: 
Liz sent the email
[4/20/2015 12:29 PM] Bilal Muhammad: 
Told CHris and she said that she will sign off as soon as she is done with the meeting
[4/20/2015 12:29 PM] Bilal Muhammad: 
so should be sending you the files shortly

comment:9 Changed 3 years ago by ngraham

  • Blocking set to 3469

comment:10 Changed 3 years ago by schandaka

  • Keywords heron-weekly added

comment:11 Changed 3 years ago by dconnolly

What is it you'd like to discuss at heron-weekly, Sravani?

comment:12 Changed 3 years ago by schandaka

I have looked at the NCDR files that Liz Drew sent me. There are three different versions of NCDR files.

  • PATS V2 04/2002 - 11/2004
  • PATS V3 12/2004 - 6/2009
  • PATS V4 7/2009 - PRESENT

Dr. Waitman and I looked at the latest versions Data Dictionary http://cvquality.acc.org/~/media/QII/NCDR/Data%20Collection%20Forms/CathPCI%20Registry_DataDictionary.ashx. We thought that this hierarchy
( A. Demographics, B. Episode of Care, C. History and Risk Factors etc.,) might be a good place to start the work. However, there are certain data consistency issues with the files the way they are now. The column names in the older versions are different than every older version.

For example,Prior MI in the new version is Previous MI in the older versions. I also noticed that some of the data collection elements are represented differently. One example for that is, in the new version Currently on Dialysis accepts Yes/No?. In the older versions, Renal failure accepts Yes/No/Yes?-Yes Dialysis/No?-Yes Dialysis etc.,

Every version has a lot of new data collection elements(new columns) added.

And also, I am not quite sure what all data from the NCDR data registry that Dr.Gupta wants to see in HERON. I am not sure about the usecase. I was thinking of reaching out to him to see if there is a resident who has the bandwidth to work with me to answer my initial questions. I would like to run it by Russ to see what he thinks about it.

I am afraid this might not go in heron-quivira.

comment:13 Changed 3 years ago by schandaka

Russ and myself are meeting on 05/07 to discuss this.

comment:14 Changed 3 years ago by rwaitman

For the first release, let's focus on loading the PATS V4. Then deliver to Dr. Gupta and get the team's feedback. Then visit with them the prioritization to incorporate V3 and have a follow up discussion (Sravani and Russ) to discuss representation similar to the Tumor Registry.

comment:15 Changed 3 years ago by schandaka

NCDR files given by the hospital need a lot of clean up. Oracle is not liking the column names the way they are.(special characters in the column names like NCDR Patient ID (DDS) {2040}, VAPatient?)Brandon helped me with a python code to add quotes around the columns.That didn't work either. I discussed this with Nathan and reached out to Bilal asking him to clean up the file.Bilal is working on it.Hoping to get the files back by tomorrow.

comment:16 Changed 3 years ago by dconnolly

  • Blocking changed from 3469 to 3468, 3469
  • Type changed from task to enhancement

heron-weekly:
SC would like 'till 19May for dev (provided staging completes by tomorrow)
RW endorses slipping the release a bit for this

if we don't get the file by Fri, punt to next release

comment:17 Changed 3 years ago by dconnolly

  • Keywords heron-weekly removed

comment:18 Changed 3 years ago by ngraham

  • Blocked By set to 3479

comment:19 Changed 3 years ago by dconnolly

  • Blocking changed from 3468, 3469 to 3469

comment:20 Changed 3 years ago by schandaka

I haven't heard from Bilal after our last interaction on 05/14(Thursday morning). His outlook status showed me that he was out of office both Thursday and Friday last week. I am inclined to move this to the next release unless Dr.Waitman thinks otherwise.

comment:21 Changed 3 years ago by schandaka

  • Blocked By 3479 deleted
  • Blocking 3469 deleted
  • Milestone changed from heron-quivira-update to heron-jamestown-update

Russ dropped by. He concurs with comment:16. Moving NCDR work to the next release.

comment:22 Changed 3 years ago by dconnolly

  • Blocked By set to 3479

I suspect dropping the dependency on #3479 was a typo.

comment:23 Changed 2 years ago by schandaka

  • Keywords heron-weekly added

comment:24 Changed 2 years ago by dconnolly

What is it you'd like to discuss at heron-weekly, Sravani?

comment:25 Changed 2 years ago by schandaka

I haven't really made any progress with Chris Witktkopp's team regarding the NCDR files as Bilal is still on sick leave. Just added heron-weekly to update Russ about it.

comment:26 Changed 2 years ago by ngraham

  • Keywords heron-weekly removed

comment:27 Changed 2 years ago by dconnolly

  • Milestone changed from heron-jamestown-update to dconnolly

Milestone renamed

comment:28 Changed 2 years ago by dconnolly

  • Milestone changed from dconnolly to heron-jamestown-update

Milestone renamed

comment:29 Changed 2 years ago by ngraham

  • Blocking set to 3536

comment:30 Changed 2 years ago by schandaka

Bilal sent me only 2014 data as he was encountering some file corruption errors for the prior year data files.I was able to successfully stage 2014 data in NHERON14. Hoping to receive the rest of the files by today as Bilal has been working on it.He is also made aware of our ETL kick off deadline. I will work with suman on staging once I hear back from him. For now, I have what I need to proceed ahead with the development.For the hierarchy, I am following Dr.Waitman's instructions from comment:4.

comment:31 Changed 2 years ago by ssuman

  • Cc ssuman added

comment:32 Changed 2 years ago by schandaka

NCDR work is more complicated than I anticipated. NCDR(csv) file has 314 columns. Based on the comment:4, these are the below folders into which these columns will be categorized into.

  • Demographics
  • Episode of Care
  • Cath Lab Visit
  • Labs
  • Diagnostic Cath
  • History and Risk Factors
  • Coronary Anatomy
  • PCI Procedure
  • Lesion and Devices
  • Intra and Post Procedure Events
  • Discharge
  • Medications

Every folder has a lot of columns that would go into it, For example Cath Lab Visit folder has close to 40 columns that would go into it. Most of them are variables that document the Cath/PCI procedure. Coding all this requires more development time than just 2 days.

Also, looks like I need inputs from Dr.Gupta/ Dr.Waitman to narrow down the list of the items that they want to see in HERON as there are different flavors of data available.

I spoke to Matt Hoag earlier this morning. Tackling Demographics piece is a good place to start which is what I did.I am planning on tackling Cath Lab Visit next as that will have a lot of the actual PCI procedure information. I have also reached out to Dr.Gupta asking him about his preferences(what he would like to see first) I think it will be a good idea to break this ticket down into multiple tickets. I will do that once I hear back from him.

I don't see this going in heron-jamestown. Aiming this for heron-calhoun.

comment:33 Changed 2 years ago by schandaka

  • Blocking 3536 deleted

comment:34 Changed 2 years ago by bzschoche

  • Keywords heron-weekly added
  • Milestone changed from heron-jamestown-update to heron-calhoun-update

Moving to the next release.

Adding heron-weekly so we can discuss with Russ if he attends on 6/17, and to determine other tickets competing for the team's time.

comment:35 Changed 2 years ago by schandaka

Dr.Gupta is unavailable as he is out of the country.I spoke to Dr.Waitman this morning regarding the prioritization of the NCDR items. He concurs with the idea that bringing the below items in the first go is good.

  • Demographics
  • Cath Lab Visit
  • Diagnostic Cath
  • History and Risk Factors
  • Coronary Anatomy

I will break this ticket down into two. The above mentioned items will be tracked as part #3224 and the remaining items will be tracked as part of #3563.

comment:36 Changed 2 years ago by dconnolly

  • Keywords heron-weekly removed

Looks like we got the scope info we wanted from Russ.

I'd still like to see it written in use case form, but Russ I suppose Sravani can do that without Russ.

comment:37 Changed 2 years ago by ngraham

  • Blocking set to 3592

comment:38 Changed 2 years ago by schandaka

This week, I made some progress on NCDR work. Hopefully, I will start committing code from this afternoon.

comment:39 Changed 2 years ago by bzschoche

Hi, Sravani, checking in on status before HERON-Weekly tomorrow. Is Calhoun still realistic?

comment:40 Changed 2 years ago by schandaka

Hi Brittany,
Yes. It is realistic for heron-calhoun.

comment:41 Changed 2 years ago by schandaka

  • Owner changed from schandaka to mhoag
  • Status changed from new to assigned

Matt,
As per our conversation earlier, I tried to split the ncdr_facts_load.sql into multiple sqls(transform,visit_dimension and facts_load).I tried to manipulate the python script(kumc_etl.py). I was unsuccessful with that approach. As I am remotely working tomorrow and not to put the ETL kick off at risk,I opted to revert back to the changes where I had a successful build.

I will be working on the remaining portion of the NCDR load for heron-riverwalk and I will certainly make sure to incorporate that if I cannot get to it tomorrow.Here is the changeset for code review (cc0646fe2006).

FYI,I have also spoken to Nathan(during the fishing trip) regarding parsing the NCDR data dictionary. Nathan mentioned that he could help me(if his priorities permit) parse the NCDR data dictionary file(PDF file)and create a hierarchy off that.So for the heron-riverwalk the manual code inserts that I did to obtain the hierarchy like the folder and prefix will no longer be necessary if the pdf parsing works.

comment:42 Changed 2 years ago by schandaka

comment:43 follow-ups: Changed 2 years ago by mhoag

Code Review

First off the critical issues (if the issue is addressed in some it is in "(" ")" ):

  1. (probably not an issue) Nothing is added to the visit dimension w.r.t NCDR encounters. Queries might work, but I have no idea what kind of errors can occur downstream (i2b2, plugins, databuilder) if we are missing part of out visit dimension.
  2. (addressed in 23d3e87115e5, c9fc77b31927) No facts will be uploaded because the join with the patient_mapping table is wrong in ncdr_facts_load.sqlL744. Specifically it is joining with the Epic@ source to get MRNs when it should be SMS@.
  3. (missed the insert over a link) kumc_etl.py loads concepts into DEID not ID so none of the NCDR concepts will show up in i2b2. Notably, ncdr_concepts_load.sql is run in ID instead of DEID.
  4. ncdr_concepts_load.sql is coupled with ncdr_facts_load.sql in that the former requires a view from the later. This is generally a very bad idea since we should try to construct concepts independent of facts.
  5. There is no documentation in GroupOnly/BulkTransfer on how to stage NCDR. Thus, we have no strong convention for referencing staged NCDR data during ETL. On an equally important note, we should not have to change schema names to run a test build (34dcfa3d7ad0 -> 44b9601c8a67).
  6. It seems like test ETL is being run with the totality of production NCDR. This is a very bad idea for a number of reasons, but the most important being that we should always avoid putting data we know to be real patient data into the test system. I realize that rich data is key to effectively making the concept dimension, which is why we should make rich test data for NCDR.
  7. From the code, Concept leafs appear to be using internal codes as the leaf names. These codes are probably of no practical use to the researcher, which again gives credence to parsing the PDF for the code mapping for each column.

Feature Checklist

Item Ready? Detail
Use Case Clear? Yes ticket description has some, also some in various comments
Automated Feature Tests? No We definitely need an automated test query
Usable by Peer Developers? No No clear instructions on how to stage, or what to do if you want to run ETL but not NCDR load.
Code Secure? Yes Uses existing ETL framework and does not leverage a link from DEID to ID
Performance OK? Not Checked Be nice to see some performance notes from Sravani as she was doing investigation. MAybe some explain plans that look good.
Deployment OK? No No instructions in GroupOnly/BulkTransfer on how to stage NCDR
Copyright and Acknowledgements? No See by file review
Happy to Maintain? No Lots of code duplication, few notes on what we are ETLing, and no quality/invariant checks on the NCDR data. Also numerous items in the By File review.

By File Review

ncdr_age_terms.csv

This looks like a copy and paste of source:heron_load/curated_data/age_terms.csv and I am not comfortable maintaining that kind of code duplication. Seems like it would be easy enough to add an extra column (NCDR_CONCEPT_CD) to age_terms.csv. Though I may go a step further and remove DEM| from the items in the concept_cd column and prepend DEM| and NCDR| to epic demographics and NCDR demographics, respectively. Worth a small design discussion at the very least.

source_master.csv

Is KU data sent to NCDR and then NCDR sends it back? If that is the case then @kumed is appropriate. Just wanted to make sure.

epic_etl.py

277: Seems to have an erroneous [tab] inserted on the line.

heron_build.py

1044: This (load_ncdr_age_terms) is an unnecessary dependency here since load_ncdr_concepts already has load_ncdr_age_terms as a dependency. Additionally, the task load_ncdr_age_terms may disappear entirely depending on how you address my other code review comments.

kumc_etl.py

202-209: Things as they are, what in load_ncdr_facts requires the age terms curated data?

ncdr_concepts_load.sql

1: KUMC and Epic are not properly cited
9-87: Another copy and paste job from source:heron_load/dem_concepts_load.sql. Immediate observation is refactor so they can use most of the same code, but I question that we even want that complicated age hierarchy. I t seems to me that a single age concept with metadata to deal with the age as an nval_num is more appropriate.
116: What is the point of excluding on prefix here when we are just at the folder level?
116, 125, 134: I am really uncomfortable with loading the concept form the ncdr facts table. First, there is the issue of coupling facts and concepts load, which is undesirable (since the dependent view cannot be regenerated without attempting to re-inserting loading concept cannot be effectively re-run). Further there is a strong implication that the NCDR concept dimension will be dynamic to a certain extent, which will break past queries from month to month. Was there a reason good reason that this dependency was necessary?

ncdr_dimensions_load.sql

1: KUMC and Epic are not properly cited
There is clearly a script for loading dimensions, but no associated paver task; there should be one. Also there appears to still be duplicate code for loading the encounter_mapping in heron_load/ncdr_facts_load.sql this needs to be removed. Otherwise one of the inserts into the encounter mapping will fail due to duplicate keys.

ncdr_transform_load.sql

1: KUMC and Epic are not properly cited
It appears that an effort was made to break up facts load and the transform, which greatly reduces the current coupling between NCDR facts and concepts load. However, there is no associate paver task yet. I will review the duplicate code in heron_load/ncdr_facts_load.sql since it appears to be what is actively developed.

heron_load/ncdr_facts_load.sql

1: KUMC and Epic are not properly cited
1-232: We have a data dictionary of sorts in PDF form, we should use that instead of doing this manual mapping.
246-290:There is no limiter (where) here and this will likely cause facts load to fail due to duplicate facts if there was more than one row per patient (and I see no spot check to assure us otherwise). As it is written for each row in NCDRV4_HERON you will get a fact w.r.t to White, Black, Asian regardless of whether the patient has positive corresponding code. Even if this did work, it would be very misleading from an i2b2 query perspective, because there would be a folder "White", "Black", etc with leafs such as "0", "1" that mean nothing to researcher.

  • There is also appears to be some needless duplication and magic strings here. You should reserve sometime with either Nathan or myself to help design a more appropriate solution.

291-635:Same as my previous comment, but duplication seems to be getting rather serious and hugely difficult to maintain.

  • Loading each of these facts should be using some kind of code mapping table (preferably from the the parsed PDF) and magic string should be isolated to one occurrence)
  • each 5 line chunk looks almost identical to previous with repetitive magic strings with would be a nightmare to maintain and should be refactored somehow. Maybe pivoting NCDRV4_HERON would be a good idea, but I would have to think about it more.

653-655: This date transformation should only be done once, maybe in NCDRV4_HERON?
656: What is going on with the entry number being multiplied and added to itself for the instance_num. Doing something like this could really hide a problem with duplicate facts.
661, 669: What happens to facts that have no associated entry number? We just throw them away? What kind of facts have no associated entry number? Demographics?

comment:44 in reply to: ↑ 43 Changed 2 years ago by mhoag

Replying to mhoag:

First off the critical issues:

  1. Nothing is added to the visit dimension w.r.t NCDR encounters. Queries might work, but I have no idea what kind of errors can occur downstream (i2b2, plugins, databuilder) if we are missing part of out visit dimension.

Sravani points out that NAACCR doesn't have a visit dimension and everything seems to be working fine there. So I guess this is not a critical issue.

comment:45 in reply to: ↑ 43 Changed 2 years ago by mhoag

Replying to mhoag:

  1. kumc_etl.py loads concepts into DEID not ID so none of the NCDR concepts will show up in i2b2. Notably, ncdr_concepts_load.sql is run in ID instead of DEID.

I guess this will work since I missed the fact that we are inserting over a link insert into BLUEHERONMETADATA.epic_terms@deid.

comment:46 in reply to: ↑ 43 Changed 2 years ago by mhoag

Replying to mhoag:

  1. ncdr_concepts_load.sql is coupled with ncdr_facts_load.sql in that the former requires a view from the later. This is generally a very bad idea since we should try to construct concepts independent of facts.

The major problem with this would be if the we were creating a concept for each fact, but it looks like distinct is used to filter out duplicate concepts.

comment:47 Changed 2 years ago by ngraham

  • Blocking 3592 deleted
  • Milestone changed from heron-calhoun-update to heron-riverwalk-update

After running heron_etl_tiny_no_DBA build 884, we don't get any facts loaded. Given that we're out of time for the Calhoun schedule (and comments above from partial code review from Matt) I'm pushing this off the Calhoun release for now and starting ETL.

If on Monday we can get things working, I believe we can load NCDR separately. I don't think we'll be any any worse off than if we pushed the ETL schedule in order to get this working.

comment:48 Changed 2 years ago by ngraham

  • Cc ngraham added

comment:49 Changed 2 years ago by schandaka

NCDR data did not have any matching test patients in NHERON14.I inserted a test record and was able to successfully run the query.As per Nathan's comment in
ticket:3592#comment:10, We could load NCDR facts/concepts on Monday.

comment:50 Changed 2 years ago by mhoag

  • Owner changed from mhoag to schandaka

Finished my review: comment:43. Giving this back to Sravani to address my comments.

comment:51 Changed 2 years ago by schandaka

Added a test query in (9352e9ab49e3) to show at least one fact with Age=62 years old.

test_heron_query.py http://.../i2b2/webclient/ --demo
INFO:__main__:opening Heron homepage...
INFO:__main__:Found page with title: i2b2 Web Client
INFO:__main__:Getting user configuration...
INFO:__main__:Got it.
INFO:__main__:
Testing: NCDR Query #3224
INFO:__main__:Found: 1
INFO:__main__:timing: 2015-07-20 15:39:33.207000, 2015-07-20 15:39:33.628000, 0:00:00.421000
INFO:__main__:

comment:52 Changed 2 years ago by mhoag

example of using unpivot to reduce the code in ncdr_facts_load.sql (lines 234 - 635)

select EntryNumber, something_value cd, 
       something_name as prefix, EpicMRN,
'NCDR|'||something_name||':'||something_value as concept_cd,
'@' as modifier_cd
from (
 select distinct EPICMRN, EntryNumber, something_value, something_name 
 from heron_etl_14.NCDRV4_HERON
 unpivot
 (
   something_value
   for something_name in (White as 'White', 
                          BlackorAfricanAmerican as 'Black')
 )
 order by something_name
);

comment:53 Changed 2 years ago by dconnolly

Ryan and I looked into unpivot for CMS (GPC:ticket:287) - it aggregates data and just gives counts. I'd be surprised if it would help here.

We did find a way to factor out much of the redundancy in loading 10 diagnosis columns: 6a036f91e8c9 (that changeset only applies to 3 columns; the rest are TODO).

comment:54 Changed 2 years ago by ngraham

  • Blocking set to 3592

Sravani,

Since we have to reload UHC anyway (see #3645) let's give this a shot for Calhoun after all. As per comment:49, it looks like you believe it'll produce the desired results from a end-user perspective. And, there's a test case in [9352e9ab49e3].

comment:55 Changed 2 years ago by ngraham

  • Milestone changed from heron-riverwalk-update to heron-calhoun-update

comment:56 Changed 2 years ago by ngraham

  • Blocking changed from 3592 to 3563, 3592
  • Keywords public-web added
  • Resolution set to fixed
  • Sensitive unset
  • Status changed from assigned to closed

Dan and I think the community would be interested in this and it would be good engineering documentation to publish.

comment:57 Changed 2 years ago by dconnolly

  • Resolution fixed deleted
  • Status changed from closed to reopened

Darn; we (Nathan, Dan, Suman) can't get it to work: unique constraint (NIGHTHERONDATA.ENCOUNTER_MAPPING_PK) violated. See ticket:3592#comment:24 for further detail.

comment:58 Changed 2 years ago by schandaka

I am looking into it.

comment:59 Changed 2 years ago by ssuman

  • Blocking changed from 3563, 3592 to 3563

We got through ETL ticket:3592#comment:25.

comment:60 Changed 2 years ago by schandaka

NCDR test data on nheron14 only has 2014 data in it unlike the production database. Looks like there is some issue with ArrivalDateTime? column(date formatting) from the prior years. I spoke to Dan briefly this afternoon and I agree with his recommendation to load all the NCDR data into nheron14 for more further testing. I am working on it now.

comment:61 Changed 2 years ago by dconnolly

all the data? I guess I wasn't clear. My recommendation is to manually add to our test data items that exhibit the problems you're running into.

something like: source:heron_staging/Clarity_data_load/zero_pad_mrn_2451.sql

comment:62 Changed 2 years ago by ngraham

  • Milestone changed from heron-calhoun-update to heron-riverwalk-update

We removed this (again) from Calhoun (blocking removed in comment:59) but we neglected to change the milestone.

comment:63 Changed 2 years ago by ngraham

  • Blocked By changed from 3479 to 3479, 3605

comment:64 Changed 2 years ago by schandaka

Dan, I think I misunderstood you. Since I had insert scripts from the staging on ID server, it wasn't that difficult for me to stage all the data. NCDR data is not that huge either.

select count(*) from ncdrv4.ncdrv4;
--16070 rows

Let me know if you think otherwise.

  • As per comment:57, there were some duplicate encounters. Here is the explanation for that. When a patient walks in for a PCI procedure,he/she could be getting multiple PCI procedures done during the same encounter. As in, multiple stents could be placed during the same encounter. I thought EntryNumber? in NCDR is unique as the test data lacked patients with multiple stents placed during the same encounter. I was able to fix this error by appending LesionNumber? to the EntryNumber? to obtain an unique encounterid. LesionNumber? is unique within the same encounter. It uniquely identifies the region where a stent is placed.
  • I also made changes to the ncdr_facts_load script as per comment:52. I used unpivot function 5 times to get 5 different folders. I believe it could still be shortened once Nathan can finish parsing the NCDR data dictionary(PDF file). It doesn't stop my work in anyway though.
  • I also added a test query for PriorPCI condition.
python test_heron_query.py http://.../i2b2/webclient/ --demo
INFO:__main__:opening Heron homepage...
INFO:__main__:Found page with title: i2b2 Web Client
INFO:__main__:Getting user configuration...
INFO:__main__:Got it.
INFO:__main__:
Testing: NCDR PriorPCI Query #3224
INFO:__main__:Found: 6
INFO:__main__:timing: 2015-07-30 12:42:06.861000, 2015-07-30 12:42:07.262000, 0:00:00.401000
INFO:__main__:
Testing: NCDR Query #3224
INFO:__main__:Found: 1
INFO:__main__:timing: 2015-07-30 12:42:07.276000, 2015-07-30 12:42:09.210000, 0:00:01.934000
INFO:__main__:

All the above mentioned code modifications are committed in (7286ffeac727).

I am proceeding ahead to work on #3563 to bring other NCDR data in.

comment:65 Changed 2 years ago by schandaka

Dan, assigning this ticket to you to review the work on test. Please take a look at Coronary Anatomy folder. The numbers that you see are the actual result values. I believe they could be made nval_num. Matt Hoag suggested similar thing with Age folder before and wanted me to check with Russ. I spoke to Russ about it and he was ok with having it as is for now and suggested we could come back to it after reviewing it with the client(Dr.Gupta). I will make sure to bring this discussion during our Tuesday(August 4th) meeting.

comment:66 Changed 2 years ago by schandaka

  • Owner changed from schandaka to dconnolly
  • Status changed from reopened to assigned

comment:67 Changed 2 years ago by schandaka

  • Blocked By changed from 3479, 3605 to 3479, 3605, 3661

comment:68 Changed 2 years ago by ngraham

Sravani,

I finished up the script/Jenkins jobs to parse/stage the NCDR data dictionary (see ticket:3661#comment:2). Please let me know if you see any issues with the results and/or you want to work together to build the NCDR hierarchy.

comment:69 Changed 2 years ago by mhoag

  • Blocking changed from 3563 to 3563, 3676

comment:70 Changed 2 years ago by dconnolly

  • Owner changed from dconnolly to schandaka

Russ answered Sravani's question (comment:65)

comment:71 follow-up: Changed 2 years ago by schandaka

Discussion about NCDR 'Age' column came up during a meeting with Russ,Dan,Matt,Nathan and Tamara. Nathan and I talked about Age column in NCDR Demographics folder. It is the age at the Cath/PCI Procedure visit. Since we cant obtain that from the Visit Details->Age at Visit(because we couldn't map NCDR encounters),we have decided to retain the NCDR Age terms. However, instead of duplicating the code we will be using something like this from the existing age_terms(from the curated data).

select folder_name, leaf_name, 
case when concept_cd is null then null 
else 'NCDR' || substr(concept_cd, instr(concept_cd, '|')) end concept_cd, 
age, a_min, a_max  from age_terms; 

comment:72 in reply to: ↑ 71 Changed 2 years ago by dconnolly

Replying to schandaka:

... because we couldn't map NCDR encounters

Why is mapping NCDR encounters necessary for adding age at visit facts? Age at visit is a coherent concept whether it's an Epic visit or not, no?

comment:73 follow-up: Changed 2 years ago by schandaka

Nathan, I have modified the ncdr_facts_load script to point to the newly staged column names. While working on that I have noticed that some of the data in the new table is bad. As in, the selections from the ncdrv4.ncdrv4 do not the match the selections from ncdr_concepts table.

Fox example,

  • Some of the EntryNumbers? are 'No' instead of numbers.
  • One of the selections for seq_5000 and seq_5020 is 'No symptom; no angina'(with semi-colon). In the ncdr_concepts view I see it as 'No symptom, no angina'(with comma). The join on ncdr_concepts is failing because of the resulting in null concept_cd.
  • Race values for some of the rows is numeric. It should be Yes/No?.
    SELECT Entry_number FROM  NCDRV4.NCDRV4 WHERE SEQ_2070='180.0' OR SEQ_2070='154.9';
    
  • You can see the rows with such anamolies here.
    select * from schandaka.ncdrall_nullconcepts;
    

For now i included concept_cd is not null in the script so that it wont error out on observation_fact load due to the null concept_cd.

I have pushed the changes that I made in the facts load script so far.

comment:74 in reply to: ↑ 73 Changed 2 years ago by ngraham

Sravani,

Unfortunately, the data seems a bit messy - numbers where there should be text field in places, etc. Also, it appears the data doesn't match the data dictionary.

In some cases at least, I think we should make special-case exceptions for the dictionary differences (for example, 'No symptom, no angina' in the dictionary vs. 'No symptom; no angina' in the data).

For cases where we have numbers instead of "Yes" and "No", etc., I don't know that we have a choice other than to just not load the rows (assuming of course that it's not a staging problem - preliminary investigation says that what's staged does match the spreadsheet).

I'll come chat with you before proceeding to make sure I'm understanding the mismatched concepts report.

details

Replying to schandaka:

  • Some of the EntryNumbers are 'No' instead of numbers.

Indeed, but it looks like there are only 2 rows:

select count(*) from (
  select ltrim(translate(entry_number,'0123456789', ' ')) entry_num FROM  ncdrv4.ncdrv4
  )
where entry_num is not null
;

I double-checked the spreadsheets and indeed, the very last row of "HERON PATS v4 1-2012 to 9-2014.xlsx` (7772) has "No" in the Entry Num field. Same for row 720 of "HERON PATS v4 7-2009 to 12-2011.xlsx". I think we should just ignore these rows in SQL - it's only 2 out of 1000's.

  • One of the selections for seq_5000 and seq_5020 is 'No symptom; no angina'(with semi-colon). In the ncdr_concepts view I see it as 'No symptom, no angina'(with comma). The join on ncdr_concepts is failing because of the resulting in null concept_cd.

The data dictionary has a comma "No symptom, no angina" I plan to make some special case exceptions in the PDF parser such that the resulting dictionary matches the data.

  • Race values for some of the rows is numeric. It should be Yes/No?.
    SELECT Entry_number FROM  NCDRV4.NCDRV4 WHERE SEQ_2070='180.0' OR SEQ_2070='154.9';
    
  • You can see the rows with such anamolies here.
    select * from schandaka.ncdrall_nullconcepts;
    

I recreated your view (schandaka.ncdrall_nullconcepts) in my schema and took out the "to_number" calls as I kept getting "invalid number" errors when looking at the concepts.

The following produces 692 rows:

select distinct something_name, concept_cd, something_value from ncdrall_nullconcepts order by something_name; --692

comment:75 Changed 2 years ago by ngraham

After chatting with Sravani, it looks like a large number of missing concepts (in the ncdrall_nullconcepts view) are due to the fact that the code is trying to join on both the sequence number _and_ the selection name (except for Labs for which Sravani has made manual exceptions).

I'll reopen #3661 to add a column in the dictionary that is a flag to indicate if there are selections for a given sequence number so we'll know whether to join on just sequence number or sequence number and selection.

comment:76 Changed 2 years ago by ngraham

Sravani, it looks like the "has selections" flag was achievable in SQL. I've added it to the ncdr_concepts view. See [bc1e79d90915].

Also, as we discussed, there's one column per medication - they should be constants (selectable to the user in the webclient) but they aren't "selections" in the NCDR dictionary.

So, I added SQL to build the medication hierarchy based on the dictionary column translation table in [f59e48a8628f].

Also, as discussed, I added support for manually curated NCDR selections (specifically, for sequence number 7105) in [2ee6069c79f5].

comment:77 Changed 2 years ago by ngraham

Sravani,

I fixed a few bugs with the staging and the ncdr_concepts view (see ticket:3605#comment:7, [235d2c5cb17b]) that caused concept code mismatch for medications and for 7105 Name: Segment Number.

I also made a quick tweak in [3500e38766d4] to avoid hard-coded strings.

After that, a quick test run seems to indicate that the only facts without concepts we have are NCDR|CURRENT_AGE:2009 (2 facts - I assume it's due to messy data) and NCDR|CURRENT_AGE:Dead (126 facts).

comment:78 Changed 2 years ago by schandaka

Created a ticket on how to handle NCDR dead patients.(#3694)
For heron-riverwalk, added exception to ignore the dead patient facts without concepts in the revision (9a0f59df37f8).

comment:79 Changed 2 years ago by schandaka

  • Owner changed from schandaka to mhoag

Modified test queries for the new paths in (e3bdf0d7a96c).
Assigning to Matt to code-review.

comment:80 follow-up: Changed 2 years ago by mhoag

Code reviewing changes between b71217e9f53c (default) and e3bdf0d7a96c (latest NCDR):

Overall, I didn't see anything that worries me. I think this is okay to merge.

By File Review

concepts_merge.sql

Nathan, I remember you showing me on test that some of the hidden concepts actually had facts associated with them. Did that get fixed? Otherwise looks good to me.

code_sync_ex.csv

No comment.

ncdr_manual_selections.csv

Any documentation on where these come from, why they were created? Pointer to a ticket comment would be nice.

source_master.csv

No comment.

epic_etl.py

279: Erroneous tab

heron_build.py

517: Not sure why an NCDR specific curated data is not in kumc_etl.py instead. I don't think we are intending to use this curated data in any other ETL piece (i.e. epic)

kumc_etl.py

Looks good.

ncdr_concepts_load.sql

9-26: there is an implicit dependency on this part of ncdr concept load that Age at Visit has also been loaded. I guess that is captured well enough in heron_build.py. But I think it is important to note that we are sacrificing code portability for and edge on code reuse.

Otherwise this looks good.

ncdr_constants.sql

21: s/dasta/data
Looks good. I love that this file exists! So that there is a source of truth for concept code generation.

ncdr_facts_load.sql

19: Pivot looks as I described... maybe though you don't need a seq num for each line and instead kind of group them by paragraph bases on their category (e.g. /*Cath Lab Visit*/)
225-227: Code duplication here is mildly annoying, but is isolated to a single place (unlike before) and DRY SQL code might actually add more lines than it removes...
341: Why not commit after the update of upload status?

test_heron_query.py

18: Not sure why the typo was added here...

ncdr_dictionary.ctl

ncdr_dictionary.sql

Looks good.

ncdr_stage_prep.py

ncdr_to_i2b2_paths.py

Looks pretty good to me. I know that it does what it is supposed to do and and has good unit test coverage.

sample_ncdr_dict

yeah for rich test data

comment:81 Changed 2 years ago by schandaka

Matt, Hidden concepts having facts issue has been fixed.

comment:82 Changed 2 years ago by mhoag

  • Owner changed from mhoag to ngraham

Code Review done (comment:80). I think this is ready for merge, but you might want to address some of the non-critical comments.

comment:83 in reply to: ↑ 80 Changed 2 years ago by ngraham

  • Owner changed from ngraham to schandaka

Thanks for the review, Matt. I addressed some of your comments in [e428a2813bf8]. Merged to default in [3cfa2468a4be].

Credit to Sravani for all her hard work to add the ability to search NCDR facts in HERON.

details

Replying to mhoag:

concepts_merge.sql

Nathan, I remember you showing me on test that some of the hidden concepts actually had facts associated with them. Did that get fixed? Otherwise looks good to me.

...answered by Sravani in comment:81

ncdr_manual_selections.csv

Any documentation on where these come from, why they were created? Pointer to a ticket comment would be nice.

I tried to explain this better in the comment.

epic_etl.py

279: Erroneous tab

Fixed.

heron_build.py

517: Not sure why an NCDR specific curated data is not in kumc_etl.py instead. I don't think we are intending to use this curated data in any other ETL piece (i.e. epic)

Hmm, I guess I'm not sure which is better - I guess the rest of the NCDR stuff is in the KUMC file so maybe it does make sense. However, I'm not inspired to move it at this point.

ncdr_concepts_load.sql

9-26: there is an implicit dependency on this part of ncdr concept load that Age at Visit has also been loaded. I guess that is captured well enough in heron_build.py. But I think it is important to note that we are sacrificing code portability for and edge on code reuse.

I agree - I went back and forth on this for a while. In the end, though, I decided to make the portability/order dependency sacrifice to avoid copy/pasting a bunch of code.

ncdr_constants.sql

21: s/dasta/data

I de-dastafied the comment.

ncdr_facts_load.sql

19: Pivot looks as I described... maybe though you don't need a seq num for each line and instead kind of group them by paragraph bases on their category (e.g. /*Cath Lab Visit*/)

Yeah, we could have saved some screen real estate, but I'm not inclined to change it at this point.

225-227: Code duplication here is mildly annoying, but is isolated to a single place (unlike before) and DRY SQL code might actually add more lines than it removes...

Oops - I agree - seems we could have factored them out. Don't think I'll do it now though - I don't feel like making too many changes this close to ETL after Sravani has already done quite a bit of validateion.

341: Why not commit after the update of upload status?

Hmm, maybe that would be a good idea, but I think db_util will commit once it's done with running the file anyway. The commit right after the insert into observation_fact_upload is there (I think) so that if the following statement (primary key constraint) fails, we can easily tell which rows were duplicated without having to re-run the insert.

test_heron_query.py

18: Not sure why the typo was added here...

Not sure either - I fixed it.

comment:84 Changed 2 years ago by schandaka

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:85 Changed 2 years ago by schandaka

Dr.Gupta finally got a chance to look at the NCDR hierarchy in HERON. He provided me with the below use cases that I can use to validate NCDR on HERON.

  • All patients with > 70% stenosis in atleast one artery(CIRC/Mid Distal LAD/Proximal). This information could be found in NCDR\Coronary Anatomy.
  • All patients with > 70% stenosis in two arteries(CIRC/Mid Distal LAD/Proximal).
  • All patients with LVEF<=35%, no acute STEMI,>70% stenosis in 3 arteries and no prior CABG.
Note: See TracTickets for help on using tickets.