Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#1655 closed defect (fixed)

realistic date-of-death exposed details about ages > 90

Reported by: dconnolly Owned by: ngraham
Priority: major Milestone: heron-council-grove-update
Component: data-repository Keywords: deid, vital-statistics, public-web
Cc: dconnolly, rwaitman Blocked By:
Blocking: Sensitive: no

Description

When we made date-of-death work in the timeline (#639 [fbfa9eecb02a]) we broke the age > 90 HIPAA rule.

See ticket:76#comment:6 and following for details.

Attachments (3)

death.html (117.3 KB) - added by ngraham 8 years ago.
Analysis of death date differences (hospital versus SSA, NAACCR and UHC)
death_all_sources_log.html (65.4 KB) - added by ngraham 8 years ago.
Death deltas on log10 scale comparing all sources.
death_all_sources_log_fixed_uhc.html (60.1 KB) - added by ngraham 8 years ago.
Used end date for UHC death fact

Download all attachments as: .zip

Change History (31)

comment:1 Changed 8 years ago by dconnolly

Resolution: fixed
Status: newclosed

comment:2 Changed 8 years ago by dconnolly

Blocking: 1623 added

comment:3 Changed 8 years ago by ngraham

Blocking: 1623 removed

(In #1623) This morning, we're on chunk 4 of observation_fact_dx on idx:

chunk #4

2012-12-18 07:56:07.762783: id_db: SID on localhost
insert into NightHerondata.observation_fact(
...
from observation_fact_dx@idx f

From November:

chunk #4

2012-11-15 00:28:27.779589: id_db: SID on localhost
insert into NightHerondata.observation_fact(
...
from observation_fact_dx@idx f

November build was done at 2012-11-16 03:05:18 (from comment:11). So, in November, we were 1 day, 02 hours 36 minutes from being done. That puts us on track to be done on Wednesday at 10:30 (2012-12-19 10:32)

comment:4 Changed 8 years ago by ngraham

Resolution: fixed
Status: closedreopened

We found during the milestone:heron-elk-city-update that the patient dimension can still reveal ages > 90. See ticket:1623#comment:22.

comment:5 Changed 8 years ago by ngraham

Blocking: 1625 added

comment:6 Changed 8 years ago by ngraham

Blocking: 1625 removed
Cc: dconnolly rwaitman added
Milestone: heron-elk-city-updateheron-council-grove-update

After talking with Russ today, we'll address this next release.

comment:7 Changed 8 years ago by ngraham

I've proven that the SSDMF code doesn't properly adjust ages/birthdates. I found a production case where a patient was thought to be dead in 198x (as per Epic), but SSDMF corrected to 200x thus making their age much older (closed #1650 as a duplicate).

As I work on failing test cases, I'm keeping notes here.

  • What about the case when we have a patient that was born in 1840 but we have no death record (we don't know if they are alive or dead). Should we move the birth date forward to 1922 then so they appear 90? It seems like the more we adjust, the more we increase the risk pushing the birth date later than other medical records.
    • This is a risk anyway - if we have facts dating earlier than our adjusted birth than we're giving away the fact that this patient was over 90, right?
  • Is it worth considering round() versus floor() when calculating ages?
    • We used months_between() - if 89 and 11.9 months you're still 89, right? Maybe it doesn't matter.

comment:8 Changed 8 years ago by ngraham

Russ e-mailed the i2b2 user community:

From: Russ 
Subject: HIPAA Safe Harbor and implications on age over 90 in i2b2
Date: Fri, 28 Dec 2012 16:53:06 -0600

We got good input from the group - I've taken a few notes from the responses below. Note that all quotations from from e-mails that were sent to the AUG and therefore will be publicly available in the AUG archives:

Some comments / notes from Luke Rasmussen's Response Below

From: Luke Rasmussen 
Sent: Saturday, December 29, 2012 10:34 PM

Interesting reference:

slides from an AMIA panel on honest brokers & de-identification

Some comments / notes from Bradley Malin's Response

From: Malin, Bradley 
Sent: Monday, December 31, 2012 11:03 AM
To: members@i2b2aug.org; Nathan Graham; Russ Waitman
Subject: RE: HIPAA Safe Harbor and implications on age over 90 in i2b2

A short excerpt from Brad's comments:

it sounds like your problem could be solved by:

i) running a check on if the individual is *known* to be over 90.
ii) if they are over 90, then you either:

a) generalize all events before a certain date early in life (which might preclude the analysis of pediatric-related disorders)
b) generalize all events after a certain date later in life (which might preclude the analysis of geriatric-related disorders) However, this is why de-identified data is not necessarily conducive to all types of biomedical research...

Comments / notes from Matvey Palchuk's Response

From: Matvey Palchuk [MPalchuk@recomdata.com]
Sent: Monday, December 31, 2012 8:48 AM
Subject: Re: HIPAA Safe Harbor and implications on age over 90 in i2b2

if you set out to build de-identified data set (per HIPAA), all 18 PHI elements have to be excluded. These 18 include dates and ZIP codes. Dates (all dates – DOB, DOD, start and end dates, etc.) must be stripped down to a year (as in, everything happens on 1st of January, for example).

Perhaps we could consider adjusting all birth dates to January 1st of the year? This may simplify our age calculations / adjustments without reducing the quality of the data from a research perspective.

comment:9 Changed 8 years ago by ngraham

Another age question - consider the following:

  • We have a patient who was born in 1921
  • Epic says he died in 2012
  • So, we adjust his birth date forward to 1923
  • SSDMF has a record that he actually died in 1933.
    • Since we take SSDMF as truth for our patient dimension, we update the deid patient dimension with this real birth date since he died when he was only 12 years old.

Now, We're left with vital fact in the deid observation_fact table that is > 90 later than the birth date (since we use Epic death dates for the Epic death fact).

Maybe this is ok - I don't see that we can update the Epic death fact as it's explicitly an Epic fact (Epic said it, so we can't edit the data).

I think I fixing our previous design [10602a0d5e13]. Also addresses age at visit since tests failed (see #1680).

I'll leave this open as this has become somewhat of a design issue as well (though, also a defect it seems...).

I plan to talk with Russ about how to proceed given input from the AUG and the questions noted above.

comment:10 Changed 8 years ago by ngraham

I met with Russ this morning - here are my notes from the meeting including some tasks / questions to answer.

Questions to answer

We have various sources of death including:

  • Epic
  • SSDMF
  • NAACCR:
  • UHC

Task / Question to answer: Determine how each affect the data in HERON - which ones are just "facts" and which ones update the patient dimension.

  • Epic: yes
  • SSMDF: yes (replaces Epic observation)
  • NAACCR: no - it looks like we load the "death" as a fact (see source:/heron_load/naaccr_txform.sql).
    • \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\
    • NAACCR|1760:0
  • UHC: no - death (expired) as a fact (see source:/heron_load/uhc_i2b2_transform.sql)
    • \i2b2\UHC\Visit Details\Discharge Status\20\ and \i2b2\UHC\Visit Details\Discharge Status\20\
    • UHC|DISCHSTCODE:20 and UHC|DISCHSTCODE:41

Task / Question to answer: If possible, gather statistics on how much the dates from the various sources actually differ per patient. Are they almost always within a week? A year? The demographics plugin uses the patient dimension, right? (verify) Given the stats on death differences, we hope to answer: Which death source do we trust? Does it really matter? Maybe not if death dates are usually close together from the various sources.

Obfuscation Design for HERON

Given input from the AUG and our discussion today, we plan to approach the following design for HERON at this time - once we answer the questions noted above, it may affect how we handle observation_facts.

  • For both alive and dead people: Shift their birth date forward in the patient dimension such that they don't appear > 90 years old.
    • Which death source to trust for the patient dimension? I believe that SSDMF is what we trust now (over Epic).
  • For facts that then appear to be prior to birth, move them forward to be the new birth date.
    • We considered shifting facts forward to be relative to birth, but that compresses them and may be a misrepresentation we don't want to make.
    • Investigate how to do this? During the fact load I assume - load all demographics stuff first then reference as we load facts.
  • What about facts after our trusted death date? We didn't explicitly discuss this but for symmetry, consider shifting them to death date.

comment:11 Changed 8 years ago by ngraham

Blocking: 1696 added

Changed 8 years ago by ngraham

Attachment: death.html added

Analysis of death date differences (hospital versus SSA, NAACCR and UHC)

comment:12 Changed 8 years ago by ngraham

Russ,

After some guidance from Dan on Friday with respect to R (thanks Dan) I added some more analysis to source:heron_analysis/death.Rmd to show histograms of death date differences between the hospital and SSA, NAACCR, and UHC.

Results are attached (attachment:death.html).

To easily view, it looks like you can click on the attachment link and then at the bottom of the resulting page, click on "Original Format" (under "Download in other formats:"). Then, select to "open" with your browser.

comment:13 Changed 8 years ago by ngraham

Russ and I met today to look at the histograms noted in comment:12. Below are notes from the discussion:

  • For differences between Epic and NAACCR, generated a de-identified file as an example and securely e-mail Tim Metcalf and Theresa Jackson with the findings to see if they have an explanation as to why there are discrepancies. File sorted by difference between death dates (NAACCR and Epic)
    • Patient num
    • Encounter num
    • DOB
    • Death dates by source (Epic, SSA, NAACCR)
  • We believe that the death dates from UHC should line up with the dates from Epic given that they both originate from the same source. The discrepancy, we believe, is from the fact that we were using the "admission" date rather than the "discharge" date for UHC death facts. See #1695.
  • Modify the plots to "zoom in" on the histograms to remove the huge peak where dates align by, say, 2 days. Add histograms for:
    • SSA vs NAACCR
    • SSA vs UHC
    • NAACCR vs UHC
  • From comment:10: What about facts after our trusted death date? We didn't explicitly discuss this.... Until we determine what to use as "truth" for death date, don't worry about these facts that appear to occur after death.

Changed 8 years ago by ngraham

Attachment: death_all_sources_log.html added

Death deltas on log10 scale comparing all sources.

comment:14 Changed 8 years ago by ngraham

In order to better visualize the data (as discussed in comment:13), I created a custom histogram function in R with a logarithmic y axis. R has almost what I wanted but not quite - it didn't draw nicely with a logarithmic scale when histogram bins were empty (log(0)).

The logarithmic scale helps reduce the huge peak near 0 and show what the "tails" look like. See [29b1bbd0b401] (accidentally checked in under Dan's name).

The results are attached (attachment:death_all_sources_log.html).

Changed 8 years ago by ngraham

Used end date for UHC death fact

comment:15 Changed 8 years ago by ngraham

In [a91c554837df], I modified the script to use end date instead of start date for UHC death facts only. The result then shows UHC matches Epic much better (as expected). See attachment:death_all_sources_log_fixed_uhc.html.

As for providing the file for the NAACCR folks, here's the SQL I used:

  • Create a table of deltas:
    create table death_source_deltas as(
      select * from (
        with 
        epic as (
          select obs.patient_num, obs.encounter_num, obs.start_date death_date, pdim.birth_date
          from blueherondata.observation_fact obs
          join blueherondata.patient_dimension pdim on pdim.patient_num = obs.patient_num
          where concept_cd = 'DEM|VITAL:y'
          ),
        ssa as (
          select obs.patient_num, obs.encounter_num, obs.start_date death_date, pdim.birth_date
          from blueherondata.observation_fact obs
          join blueherondata.patient_dimension pdim on pdim.patient_num = obs.patient_num
          where concept_cd = 'DEM|VITAL|SSA:y'
          ),
        naaccr as (
          select obs.patient_num, obs.encounter_num, obs.start_date death_date, pdim.birth_date
          from blueherondata.observation_fact obs
          join blueherondata.patient_dimension pdim on pdim.patient_num = obs.patient_num
          where concept_cd =  'NAACCR|1760:0'
          ),
        uhc as (
          --Note:  Use end date for UHC death dates 
          select obs.patient_num, obs.encounter_num, obs.end_date death_date, pdim.birth_date
          from blueherondata.observation_fact obs
          join blueherondata.patient_dimension pdim on pdim.patient_num = obs.patient_num
          where concept_cd in ('UHC|DISCHSTCODE:20', 'UHC|DISCHSTCODE:41')
          )
        select 
        epic.patient_num, epic.encounter_num, epic.birth_date, epic.death_date epic_death_date,  
        ssa.death_date ssa_death_date, epic.death_date - ssa.death_date epic_ssa_delta_days,
        naaccr.death_date naaccr_death_date, epic.death_date - naaccr.death_date epic_naaccr_delta_days,
        uhc.death_date uhc_death_date, epic.death_date - uhc.death_date epic_uhc_delta_days
        from epic
        left join ssa on ssa.patient_num = epic.patient_num
        left join naaccr on naaccr.patient_num = epic.patient_num
        left join uhc on uhc.patient_num = epic.patient_num
        where epic.death_date is not null
        )
      )
    ;
    
  • Find where NAACCR disagrees:
    select * from death_source_deltas 
    where epic_naaccr_delta_days is not null
    order by abs(epic_naaccr_delta_days) desc
    
  • Double-check against the results from the R script to make sure we get the same patients. We do! R code (inside of our death analysis script):
    # Marge the two on patient number
    case <- merge(hospital, naaccr, by=c('patient.num'), suffixes = c(".hosp",".naaccr"))
    # create a column of diffs
    case$diff <- abs(case$when.hosp - case$when.naaccr)
    # sort by diff
    sorted_case <- case[order(case$diff, decreasing=TRUE),]
    sorted <- sorted_case[1:10,] # look at top 10
    

comment:16 Changed 8 years ago by ngraham

Owner: changed from ngraham to mhoag
Status: reopenedassigned

Matt said he'd review the changes on the "limit_age_1655" branch, so giving him this ticket for now.

comment:17 Changed 8 years ago by mhoag

Status: assignedaccepted

comment:18 Changed 8 years ago by mhoag

Owner: changed from mhoag to ngraham
Status: acceptedassigned

CODE REVIEW

Code Review Key

AnnotationDescription
*Style issue can ignore
?Question
! Critical or Functional issue

epic_i2b2_deid_verify.sql
================
9?: Do you need the commented out '--alter session set NLS_DATE_FORMAT = 'YYYY-MM-DD HH24:MI:SS' ;'?
116?: What facts in particular and what is the suggested remediation? (i.e. turn this into a TODO)

epic_i2b2_transform.sql
epic_i2b2_facts_deid.sql
ssdmf_load.sql
===============
No Comment

Everything looks good here. Ready for merge.

comment:19 Changed 8 years ago by ngraham

With regard to the death analysis script, Russ noted that the 0-peak was being removed from the graphs. I fixed this in [b1ea8cce8e36].

comment:20 in reply to:  18 Changed 8 years ago by ngraham

Replying to mhoag:

Matt, thanks for reviewing. Changes implemented in [0ea9a43cc51a].

CODE REVIEW

Code Review Key

AnnotationDescription
*Style issue can ignore
?Question
! Critical or Functional issue

epic_i2b2_deid_verify.sql
================
9?: Do you need the commented out '--alter session set NLS_DATE_FORMAT = 'YYYY-MM-DD HH24:MI:SS' ;'?

I don't need it so I removed it. The default Oracle time format is annoying to me so I had it there for testing manually.

116?: What facts in particular and what is the suggested remediation? (i.e. turn this into a TODO)

Ah - my wording was confusing and I changed it. The sentences below were answering the question. We don't test all facts because facts can be tagged with a date that is long after death (because that's when the hospital updated the record) and therefore can legitimately be > 90 after birth.

epic_i2b2_transform.sql
epic_i2b2_facts_deid.sql
ssdmf_load.sql
===============
No Comment

Everything looks good here. Ready for merge.

comment:21 Changed 8 years ago by ngraham

Resolution: fixed
Status: assignedclosed

Merged in [e988930f5fde]. After a discussion with Russ (see comment:13) this is done enough for this release as tests show we've obfuscated the age appropriately.

However, we're still working to find what to use as "truth" for death date. Then, we'll decide what to do for non-age specific facts that still remain outside the obfuscated lifetime.

Russ e-mailed Tim and Theresa (comment:13) for comment.

I've created #1713 for continuing the investigation into the source of truth for death facts.

comment:22 Changed 8 years ago by rwaitman

Theresa indicated they are working on cleaning this up and email us when there's a change. We will then want to rerun our statistics and data sets to help them see what stragglers remain.

comment:23 Changed 8 years ago by ngraham

Blocking: 1696 removed
Milestone: heron-council-grove-updateheron-neosho-update
Priority: criticalmajor
Resolution: fixed
Status: closedreopened

Oops...2 patients (only 2! )have a calculated age of 91 (when subtracting the birth date from the SSA death date). We found this during the milestone:heron-council-grove-update (see ticket:1696#comment:25).

It appears that these cases are "fence post" cases - these two patients appear to have died on their birthday.

Before "fixing", let's be sure to create a failing test case in the test environment that mimics this issue.

comment:24 Changed 8 years ago by dconnolly

Maybe make a separate ticket for the fence-post bug? It seems to me that this belongs in the list of things we fixed in milestone:heron-council-grove-update.

comment:25 Changed 8 years ago by ngraham

Milestone: heron-neosho-updateheron-council-grove-update
Resolution: fixed
Status: reopenedclosed

As per Dan's suggestion in comment:17, I opened a new ticket (#1745) to address the two fence post cases. I'm closing this one and moving the milestone back to milestone:heron-council-grove-update.

comment:26 Changed 8 years ago by dconnolly

Keywords: vital-statistics added

comment:27 Changed 7 years ago by dconnolly

Keywords: public-web added

comment:28 Changed 7 years ago by kcrane2

Approved for public release.

Note: See TracTickets for help on using tickets.