Changes between Initial Version and Version 23 of DeIdentificationStrategy

Remember: No patient names, identifiers, or other PHI

Oct 24, 2014 11:36:57 AM (7 years ago)



  • DeIdentificationStrategy

    v1 v23  
     5== HIPAA Identifiers section of HERON IRB Protocol ==
     7''This is an exerpt from **HERON Repository IRB Protocol v2.1** of 2012. See [[HERON#governance]] for full text.''
     9We will be transforming identified data into a form that addresses all of the 18 de-identification criteria. We will shift all in the EMR 1–365 days into the past; the shift is different across records but constant within the records of each patient, thereby allowing temporal analyses such as the development of adverse effects after a drug. We have listed the identifiers specified by HIPAA and whether they will be included in our data sources and the general i2b2 repository. While de-identified, we will be requesting that investigators treat released data with the same sensitivity as a limited data set.
     11|| Included in Source Data ||  Included in de-identified i2b2 repository || Identifier ||
     12|| Yes || No || 1. Names ||
     13|| Yes || No || 2. Postal address information. Zipcode has been requested as the predominant method for bundling cohorts of patients (ex: all zipcodes in Kansas City Metropolitan Area) but we will bundle search criteria into regions defining populations greater than 20,000. Example: we will allow users to search for patients within a 5 mile radius of KUMC but not the zip code 64111 ||
     14|| Yes || No || 3. Social security numbers ||
     15|| Yes || No || 4. Account numbers ||
     16|| Yes || No || 5. Telephone & fax numbers ||
     17|| Yes || No || 6. Elements of dates for dates directly related to an individual, including birth date, admission date, discharge date, date of death. We will preserve the relationship between care encounters but randomly shifted dates, not actual dates, will be stored in the de-identified respository. The data stored may be up to 365 days before the actual date of service. ||
     18|| Yes || No || 7. Medical record numbers ||
     19|| No * || No || 8. Certificate/license numbers ||
     20|| No * || No || 9. Electronic mail addresses ||
     21|| Yes || No ||  10. Ages over 89 and all elements of dates indicative of such age ||
     22|| Yes || No || 11. Health plan beneficiary numbers ||
     23|| No * || No || 12. Vehicle identifiers & serial numbers, including license plate numbers ||
     24|| No * || No || 13. Device identifiers & serial numbers ||
     25|| No * || No || 14. Web Universal Resource Locators (URLs) ||
     26|| No * || No || 15. Internet Protocol (IP) address numbers ||
     27|| No (see note) || No || 16. Biometric identifiers, including fingers and voice prints. Clinical molecular diagnostic results may be present in clinical laboratory results. We do not intend to incorporate large scale microarray expression data or full genome sequencing in HERON. If that was requested, we would submit a separate IRB application. ||
     28|| No * || No || 17. Full face photographic images & any comparable images ||
     29|| No * || No || 18. Any other unique identifying number, characteristic or code that is derived from or related to information about the individual ||
     31Identifiers marked with a ‘*’ are not believed to be captured in any of our data sources, but they may be added without our knowledge.
     33=== Since 2012 ===
     35Our source systems now provide
     37 - email
     38 - device ids
     39 - order numbers also seem to qualify as "unique identifying numbers"
     41== Misc Design Notes ==
     43Our object is to hash data so you can't go back to the identified source.  If [[HERON]] provokes interest to go back, they will need to request access to identified data.
     45Will want to distinguish between
     47 * elements or keys which may index things on the de-id server but are hidden to the user
     48 * what is viewable but not retrievable (such as timeline patient "descriptors")
     49 * what is distributed in datasets after a DUA
     52What identifiers or "things" do we de-identify by removing versus obfuscating either by text replacement or hash
     53- Examples Patient Name gets removed.
     54- Ex of hashing, mrn, casenumber, provider number
     55- For discussion: clinic codes?  Nursing units?  (service line: #834)
     56- See the visit dimension and Location_Cd, location Path (#201)
     59Note deid tickets:
     63== NIH Guidance on HIPAA ==
     65The NIH outline [ General guidance on privacy concerns].
     69  November 26, 2012
     71  Today, OCR released guidance regarding methods for de-identification of protected health information in accordance with the HIPAA Privacy Rule.  This guidance fulfills the American Recovery and Reinvestment Act of 2009 (ARRA) mandate that HHS issue such guidance. In response to this mandate, OCR collected research and views regarding de-identification approaches, best practices for implementation and management of the current de-identification standard and potential changes to address policy concerns.  OCR solicited stakeholder input from experts with practical technical and policy experience to inform the creation of guidance materials by organizing an in-person workshop consisting of multiple panel sessions, each addressing a specific topic related to de-identification methodologies and policies. The workshop was open to the public and was held March 8-9, 2010 in Washington, DC.  The guidance synthesizes these diverse perspectives.  It can be found at
     75== Obscuring Dates by shifting == #dateshifting
     77main article: [[dateshifting]]
     80note: Russ wrote -365 to 0 as how we would do an offset based on work by Vanderbilt.
     81- let's just double check best practice.
     82Russ' general sense would be we offset but apply the same offset consistently across the patient
     84Relative dates are important for research purposes...
     86=== Peer Approaches ===
     88UC Davis approach:
     89- maintain a map table with the original patient id
     90- use the oracle sequence generation to get a new sequence number as the fake patient id
     91- generated a random number from -14 to +14 and use that offset for all dates relative to the patient. 
     95  All dates in the EMR are shifted 1–364 days into the past
     97  -- [ Roden et al 2008]
     99== Obscuring psuedo-Identifiers with one-way hashing functions ==
     104Hash or sequence number
     105- Russ said hash in the proposal.  Arvinder and UC Davis used a sequence number approach.
     106- Russ: double check if Vandy uses a hash.
     107- need to decide how big of hash and what to do about collisions
     108  - odds of md5 collisions are around 10^-18^ for billions of documents; that's lower than the odds of a bit error on a disk. (see [ wikipedia birthday problem article])
     109- ''[ Threshold protocol for the exchange of confidential medical data]'', Berman, 2002
     112Potentially relevant approaches to hashing for medical data:
     113[ Zero-Check: A Zero-Knowledge Protocol for Reconciling Patient Identities Across Institutions], Berman, 2004
     115in which we find, from [ HHS Regulations Re-Identification - § 164.514(c)]:
     117  Since the HMAC allows identification of individuals by the recipient, disclosure of the HMAC violates the Rule.
     120=== Oracle support ===
     122Using [ DBMS_CRYPTO]:
     125SQL> select dbms_crypto.hash( utl_raw.cast_to_raw('foo'), 3) from dual;
     129Note this assumes SYS has done:
     132SQL> grant execute on DBMS_CRYPTO to dconnolly
     135blog post by Berman: [ One-way hash: Perl, Python, Ruby], January 30, 2010
     138=== Peer approaches: Vanderbuilt ===
     140  In order to accomplish the goal of linking the clinical and
     141  DNA information in a de-identified fashion, the medical record number
     142  that labels each sample and each entry in the EMR is replaced with a
     143  research unique identifier (RUI) generated by the secure hash algorithm (SHA-512)
     145  -- [ Roden et al 2008]
     148== Free Text ==
     149De-identifing information from free text data source is beyond the scope of milestone:HERON1.0
     151More recent discussion on 9/28/2010:
     153We will engage MITRE team developing the [ MITRE Identification Scrubber Toolkit].
     156== Literature Review ==
     158[ Peter Szolovits] from MIT CSAIL is in the [ i2b2 contact list]; he supervised a thesis that seems relevant. I (Dan) don't see much in the way of specific outcomes in the abstract; I wonder if it's worth reading:
     160 * [ Privacy and identifiability in clinical research, personalized medicine, and public health surveillance]
     162Another good paper is [attachment:PNASLoukidesMalinAnonGWAS.pdf Loukides and Malin PNAS].
     164Another Vanderbilt pub, re bioview:
     166 * [ Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records.]
     167   Genet Med. 2010 Oct;12(10):648-50.
     169Nominated by Frank J. Manion Chief Information Officer | [ University of Michigan Comprehensive Cancer Center] to December 10, 2012 after AMIA:
     171 * Kushida, C. A., Nichols, D. A., Jadrnicek, R., Miller, R., Walsh, J. K., & Griffin, K. (2012). [ Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies]. Medical Care, 50.
     172 * Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., & Meystre, S. (2012). ''Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents''. BMC Medical Research Methodology, 12(1), 109.
     175 * Kathleen Benitez, Bradley Malin [ Evaluating re-identification risks with respect to the HIPAA privacy rule]
     176   \\J Am Med Inform Assoc 2010;17:169-177 doi:10.1136/jamia.2009.000026