Changes between Initial Version and Version 23 of DeIdentificationStrategy

Remember: No patient names, identifiers, or other PHI


Ignore:
Timestamp:
Oct 24, 2014 11:36:57 AM (7 years ago)
Author:
dconnolly
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DeIdentificationStrategy

    v1 v23  
     1[[PageOutline]]
     2
     3
     4
     5== HIPAA Identifiers section of HERON IRB Protocol ==
     6
     7''This is an exerpt from **HERON Repository IRB Protocol v2.1** of 2012. See [[HERON#governance]] for full text.''
     8
     9We will be transforming identified data into a form that addresses all of the 18 de-identification criteria. We will shift all in the EMR 1–365 days into the past; the shift is different across records but constant within the records of each patient, thereby allowing temporal analyses such as the development of adverse effects after a drug. We have listed the identifiers specified by HIPAA and whether they will be included in our data sources and the general i2b2 repository. While de-identified, we will be requesting that investigators treat released data with the same sensitivity as a limited data set.
     10
     11|| Included in Source Data ||  Included in de-identified i2b2 repository || Identifier ||
     12|| Yes || No || 1. Names ||
     13|| Yes || No || 2. Postal address information. Zipcode has been requested as the predominant method for bundling cohorts of patients (ex: all zipcodes in Kansas City Metropolitan Area) but we will bundle search criteria into regions defining populations greater than 20,000. Example: we will allow users to search for patients within a 5 mile radius of KUMC but not the zip code 64111 ||
     14|| Yes || No || 3. Social security numbers ||
     15|| Yes || No || 4. Account numbers ||
     16|| Yes || No || 5. Telephone & fax numbers ||
     17|| Yes || No || 6. Elements of dates for dates directly related to an individual, including birth date, admission date, discharge date, date of death. We will preserve the relationship between care encounters but randomly shifted dates, not actual dates, will be stored in the de-identified respository. The data stored may be up to 365 days before the actual date of service. ||
     18|| Yes || No || 7. Medical record numbers ||
     19|| No * || No || 8. Certificate/license numbers ||
     20|| No * || No || 9. Electronic mail addresses ||
     21|| Yes || No ||  10. Ages over 89 and all elements of dates indicative of such age ||
     22|| Yes || No || 11. Health plan beneficiary numbers ||
     23|| No * || No || 12. Vehicle identifiers & serial numbers, including license plate numbers ||
     24|| No * || No || 13. Device identifiers & serial numbers ||
     25|| No * || No || 14. Web Universal Resource Locators (URLs) ||
     26|| No * || No || 15. Internet Protocol (IP) address numbers ||
     27|| No (see note) || No || 16. Biometric identifiers, including fingers and voice prints. Clinical molecular diagnostic results may be present in clinical laboratory results. We do not intend to incorporate large scale microarray expression data or full genome sequencing in HERON. If that was requested, we would submit a separate IRB application. ||
     28|| No * || No || 17. Full face photographic images & any comparable images ||
     29|| No * || No || 18. Any other unique identifying number, characteristic or code that is derived from or related to information about the individual ||
     30
     31Identifiers marked with a ‘*’ are not believed to be captured in any of our data sources, but they may be added without our knowledge.
     32
     33=== Since 2012 ===
     34
     35Our source systems now provide
     36
     37 - email
     38 - device ids
     39 - order numbers also seem to qualify as "unique identifying numbers"
     40
     41== Misc Design Notes ==
     42
     43Our object is to hash data so you can't go back to the identified source.  If [[HERON]] provokes interest to go back, they will need to request access to identified data.
     44
     45Will want to distinguish between
     46
     47 * elements or keys which may index things on the de-id server but are hidden to the user
     48 * what is viewable but not retrievable (such as timeline patient "descriptors")
     49 * what is distributed in datasets after a DUA
     50
     51
     52What identifiers or "things" do we de-identify by removing versus obfuscating either by text replacement or hash
     53- Examples Patient Name gets removed.
     54- Ex of hashing, mrn, casenumber, provider number
     55- For discussion: clinic codes?  Nursing units?  (service line: #834)
     56- See the visit dimension and Location_Cd, location Path (#201)
     57
     58
     59Note deid tickets:
     60
     61[[TicketQuery(order=status,col=type|summary|resolution|priority|status|owner,format=table,keywords=~deid,component=data-repository)]]
     62
     63== NIH Guidance on HIPAA ==
     64
     65The NIH outline [http://privacyruleandresearch.nih.gov/pr_02.asp General guidance on privacy concerns].
     66
     67Update:
     68
     69  November 26, 2012
     70 
     71  Today, OCR released guidance regarding methods for de-identification of protected health information in accordance with the HIPAA Privacy Rule.  This guidance fulfills the American Recovery and Reinvestment Act of 2009 (ARRA) mandate that HHS issue such guidance. In response to this mandate, OCR collected research and views regarding de-identification approaches, best practices for implementation and management of the current de-identification standard and potential changes to address policy concerns.  OCR solicited stakeholder input from experts with practical technical and policy experience to inform the creation of guidance materials by organizing an in-person workshop consisting of multiple panel sessions, each addressing a specific topic related to de-identification methodologies and policies. The workshop was open to the public and was held March 8-9, 2010 in Washington, DC.  The guidance synthesizes these diverse perspectives.  It can be found at http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html.
     72
     73
     74
     75== Obscuring Dates by shifting == #dateshifting
     76
     77main article: [[dateshifting]]
     78
     79Offset:
     80note: Russ wrote -365 to 0 as how we would do an offset based on work by Vanderbilt.
     81- let's just double check best practice.
     82Russ' general sense would be we offset but apply the same offset consistently across the patient
     83
     84Relative dates are important for research purposes...
     85
     86=== Peer Approaches ===
     87
     88UC Davis approach:
     89- maintain a map table with the original patient id
     90- use the oracle sequence generation to get a new sequence number as the fake patient id
     91- generated a random number from -14 to +14 and use that offset for all dates relative to the patient. 
     92
     93Vanderbuilt:
     94
     95  All dates in the EMR are shifted 1–364 days into the past
     96
     97  -- [http://www.ncbi.nlm.nih.gov/pubmed/18500243 Roden et al 2008]
     98
     99== Obscuring psuedo-Identifiers with one-way hashing functions ==
     100
     101#207
     102
     103
     104Hash or sequence number
     105- Russ said hash in the proposal.  Arvinder and UC Davis used a sequence number approach.
     106- Russ: double check if Vandy uses a hash.
     107- need to decide how big of hash and what to do about collisions
     108  - odds of md5 collisions are around 10^-18^ for billions of documents; that's lower than the odds of a bit error on a disk. (see [http://en.wikipedia.org/wiki/Birthday_problem wikipedia birthday problem article])
     109- ''[http://www.biomedcentral.com/1471-2288/2/12/ Threshold protocol for the exchange of confidential medical data]'', Berman, 2002
     110
     111
     112Potentially relevant approaches to hashing for medical data:
     113[http://www.archivesofpathology.org/doi/pdf/10.1043/1543-2165%282004%29128%3C344%3AZAZPFR%3E2.0.CO%3B2 Zero-Check: A Zero-Knowledge Protocol for Reconciling Patient Identities Across Institutions], Berman, 2004
     114
     115in which we find, from [http://www.bricker.com/services/resource-details.aspx?resourceid=368 HHS Regulations Re-Identification - § 164.514(c)]:
     116
     117  Since the HMAC allows identification of individuals by the recipient, disclosure of the HMAC violates the Rule.
     118
     119
     120=== Oracle support ===
     121
     122Using [http://download-uk.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_crypto.htm#i1004371 DBMS_CRYPTO]:
     123
     124{{{
     125SQL> select dbms_crypto.hash( utl_raw.cast_to_raw('foo'), 3) from dual;
     1260BEEC7B5EA3F0FDBC95D0DD47F3C5BC275DA8A33
     127}}}
     128
     129Note this assumes SYS has done:
     130
     131{{{
     132SQL> grant execute on DBMS_CRYPTO to dconnolly
     133}}}
     134
     135blog post by Berman: [http://julesberman.blogspot.com/2010/01/one-way-hash-perl-python-ruby.html One-way hash: Perl, Python, Ruby], January 30, 2010
     136
     137
     138=== Peer approaches: Vanderbuilt ===
     139
     140  In order to accomplish the goal of linking the clinical and
     141  DNA information in a de-identified fashion, the medical record number
     142  that labels each sample and each entry in the EMR is replaced with a
     143  research unique identifier (RUI) generated by the secure hash algorithm (SHA-512)
     144
     145  -- [http://www.ncbi.nlm.nih.gov/pubmed/18500243 Roden et al 2008]
     146
     147
     148== Free Text ==
     149De-identifing information from free text data source is beyond the scope of milestone:HERON1.0
     150
     151More recent discussion on 9/28/2010:
     152
     153We will engage MITRE team developing the [http://sourceforge.net/projects/mist-deid/ MITRE Identification Scrubber Toolkit].
     154
     155
     156== Literature Review ==
     157
     158[http://groups.csail.mit.edu/medg/people/psz/home/Pete_MEDG_site/Home.html Peter Szolovits] from MIT CSAIL is in the [https://www.i2b2.org/about/contact.html i2b2 contact list]; he supervised a thesis that seems relevant. I (Dan) don't see much in the way of specific outcomes in the abstract; I wonder if it's worth reading:
     159
     160 * [http://dspace.mit.edu/handle/1721.1/45624 Privacy and identifiability in clinical research, personalized medicine, and public health surveillance]
     161
     162Another good paper is [attachment:PNASLoukidesMalinAnonGWAS.pdf Loukides and Malin PNAS].
     163
     164Another Vanderbilt pub, re bioview:
     165
     166 * [http://www.ncbi.nlm.nih.gov/pubmed/20733501 Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records.]
     167   Genet Med. 2010 Oct;12(10):648-50.
     168
     169Nominated by Frank J. Manion Chief Information Officer | [http://www.cancer.med.umich.edu/ University of Michigan Comprehensive Cancer Center] to list.kfc.informatics.idr@ctsacentral.org December 10, 2012 after AMIA:
     170
     171 * Kushida, C. A., Nichols, D. A., Jadrnicek, R., Miller, R., Walsh, J. K., & Griffin, K. (2012). [http://journals.lww.com/lww-medicalcare/Fulltext/2012/07001/Strategies_for_De_identification_and_Anonymization.17.aspx Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies]. Medical Care, 50.
     172 * Ferrandez, O., South, B., Shen, S., Friedlin, F., Samore, M., & Meystre, S. (2012). ''Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents''. BMC Medical Research Methodology, 12(1), 109.
     173
     174
     175 * Kathleen Benitez, Bradley Malin [http://jamia.bmj.com/content/17/2/169 Evaluating re-identification risks with respect to the HIPAA privacy rule]
     176   \\J Am Med Inform Assoc 2010;17:169-177 doi:10.1136/jamia.2009.000026
     177