Version 23 (modified by dconnolly, 7 years ago) (diff)


HIPAA Identifiers section of HERON IRB Protocol

This is an exerpt from HERON Repository IRB Protocol v2.1 of 2012. See HERON#governance for full text.

We will be transforming identified data into a form that addresses all of the 18 de-identification criteria. We will shift all in the EMR 1–365 days into the past; the shift is different across records but constant within the records of each patient, thereby allowing temporal analyses such as the development of adverse effects after a drug. We have listed the identifiers specified by HIPAA and whether they will be included in our data sources and the general i2b2 repository. While de-identified, we will be requesting that investigators treat released data with the same sensitivity as a limited data set.

Included in Source Data Included in de-identified i2b2 repository Identifier
Yes No 1. Names
Yes No 2. Postal address information. Zipcode has been requested as the predominant method for bundling cohorts of patients (ex: all zipcodes in Kansas City Metropolitan Area) but we will bundle search criteria into regions defining populations greater than 20,000. Example: we will allow users to search for patients within a 5 mile radius of KUMC but not the zip code 64111
Yes No 3. Social security numbers
Yes No 4. Account numbers
Yes No 5. Telephone & fax numbers
Yes No 6. Elements of dates for dates directly related to an individual, including birth date, admission date, discharge date, date of death. We will preserve the relationship between care encounters but randomly shifted dates, not actual dates, will be stored in the de-identified respository. The data stored may be up to 365 days before the actual date of service.
Yes No 7. Medical record numbers
No * No 8. Certificate/license numbers
No * No 9. Electronic mail addresses
Yes No 10. Ages over 89 and all elements of dates indicative of such age
Yes No 11. Health plan beneficiary numbers
No * No 12. Vehicle identifiers & serial numbers, including license plate numbers
No * No 13. Device identifiers & serial numbers
No * No 14. Web Universal Resource Locators (URLs)
No * No 15. Internet Protocol (IP) address numbers
No (see note) No 16. Biometric identifiers, including fingers and voice prints. Clinical molecular diagnostic results may be present in clinical laboratory results. We do not intend to incorporate large scale microarray expression data or full genome sequencing in HERON. If that was requested, we would submit a separate IRB application.
No * No 17. Full face photographic images & any comparable images
No * No 18. Any other unique identifying number, characteristic or code that is derived from or related to information about the individual

Identifiers marked with a ‘*’ are not believed to be captured in any of our data sources, but they may be added without our knowledge.

Since 2012

Our source systems now provide

  • email
  • device ids
  • order numbers also seem to qualify as "unique identifying numbers"

Misc Design Notes

Our object is to hash data so you can't go back to the identified source. If HERON provokes interest to go back, they will need to request access to identified data.

Will want to distinguish between

  • elements or keys which may index things on the de-id server but are hidden to the user
  • what is viewable but not retrievable (such as timeline patient "descriptors")
  • what is distributed in datasets after a DUA

What identifiers or "things" do we de-identify by removing versus obfuscating either by text replacement or hash

  • Examples Patient Name gets removed.
  • Ex of hashing, mrn, casenumber, provider number
  • For discussion: clinic codes? Nursing units? (service line: #834)
  • See the visit dimension and Location_Cd, location Path (#201)

Note deid tickets:

Ticket Type Summary Resolution Priority Status Owner
#1655 defect realistic date-of-death exposed details about ages > 90 fixed major closed ngraham

NIH Guidance on HIPAA

The NIH outline General guidance on privacy concerns.


November 26, 2012

Today, OCR released guidance regarding methods for de-identification of protected health information in accordance with the HIPAA Privacy Rule. This guidance fulfills the American Recovery and Reinvestment Act of 2009 (ARRA) mandate that HHS issue such guidance. In response to this mandate, OCR collected research and views regarding de-identification approaches, best practices for implementation and management of the current de-identification standard and potential changes to address policy concerns. OCR solicited stakeholder input from experts with practical technical and policy experience to inform the creation of guidance materials by organizing an in-person workshop consisting of multiple panel sessions, each addressing a specific topic related to de-identification methodologies and policies. The workshop was open to the public and was held March 8-9, 2010 in Washington, DC. The guidance synthesizes these diverse perspectives. It can be found at

Obscuring Dates by shifting

main article: dateshifting

Offset: note: Russ wrote -365 to 0 as how we would do an offset based on work by Vanderbilt.

  • let's just double check best practice.

Russ' general sense would be we offset but apply the same offset consistently across the patient

Relative dates are important for research purposes...

Peer Approaches

UC Davis approach:

  • maintain a map table with the original patient id
  • use the oracle sequence generation to get a new sequence number as the fake patient id
  • generated a random number from -14 to +14 and use that offset for all dates relative to the patient.


All dates in the EMR are shifted 1–364 days into the past

-- Roden et al 2008

Obscuring psuedo-Identifiers with one-way hashing functions


Hash or sequence number

Potentially relevant approaches to hashing for medical data: Zero-Check: A Zero-Knowledge Protocol for Reconciling Patient Identities Across Institutions, Berman, 2004

in which we find, from HHS Regulations Re-Identification - § 164.514(c):

Since the HMAC allows identification of individuals by the recipient, disclosure of the HMAC violates the Rule.

Oracle support


SQL> select dbms_crypto.hash( utl_raw.cast_to_raw('foo'), 3) from dual;

Note this assumes SYS has done:

SQL> grant execute on DBMS_CRYPTO to dconnolly

blog post by Berman: One-way hash: Perl, Python, Ruby, January 30, 2010

Peer approaches: Vanderbuilt

In order to accomplish the goal of linking the clinical and DNA information in a de-identified fashion, the medical record number that labels each sample and each entry in the EMR is replaced with a research unique identifier (RUI) generated by the secure hash algorithm (SHA-512)

-- Roden et al 2008

Free Text

De-identifing information from free text data source is beyond the scope of milestone:HERON1.0

More recent discussion on 9/28/2010:

We will engage MITRE team developing the MITRE Identification Scrubber Toolkit.

Literature Review

Peter Szolovits from MIT CSAIL is in the i2b2 contact list; he supervised a thesis that seems relevant. I (Dan) don't see much in the way of specific outcomes in the abstract; I wonder if it's worth reading:

Another good paper is Loukides and Malin PNAS.

Another Vanderbilt pub, re bioview:

Nominated by Frank J. Manion Chief Information Officer | University of Michigan Comprehensive Cancer Center to list.kfc.informatics.idr@… December 10, 2012 after AMIA:

Attachments (2)

Download all attachments as: .zip