A medical informatics perspective on the role of metadata in the data lifecycle

Our group has been invited to a panel discussion:

  • Metadata Forum
    A discussion of the role of metadata in the data lifecycle
    Friday April 13, 2012
    11:30am - 1:00pm
    Watson Library, 503A and 503B

The panel questions have inspired this bit of thinking out loud:

What is your research area or discipline?

Our discipline is medical informatics. We're involved in two kinds of research:

  1. informatics services to support KUMC researchers, including areas such as cancer center, health of the public, etc.
  2. research in medical informatics per se; that is: looking at the electronic medical record (EMR) as a medical intervention and studying its impact

What do your data look like?

To our customers, we present a large and growing set of medical observations -- currently over 630 million observations -- using a tool called i2b2, developed at Harvard/Partners with NIH funding. It presents a hierarchy of terms:

  • under demographics it has age, gender, etc.;
  • diagnoses are organized using the ICD9 terminology;
  • there are terms for medications, lab results, procedures, etc.

This allows cohort identification queries such as "how many patients does the University of Kansas Hospital (KUH) see each year that are over the age of 35, diagnosed with diabetes, and had an abnormal glucose lab result?"

The data is not necessarily “ours” in that we take data from multiple sources, aggregate it, and provide a tool for knowledge discovery. For example, we integrate vital statistics from the U.S. Social Security Administration, so that the query above can be refined a la "... and how many of them are dead, according to the SSA?"

Are they structured or unstructured?

So far, we have our hands full with structured data (pulled from EMR, billing system, tumor registry, etc.).

A lot of work in our field is concerned with natural language processing of physician's notes.

We haven't begun work in that direction, but we are among the first to make use of i2b2 to explore nursing observations. They dominate our database (over 400 million observations) and quite likely they dominate the use of EHR usage in the hospital. Plus, they contain basic information such as height and weight that is essential to screening for many studies.

Are they typically represented in tables or some other form (audio, video, transcripts)?

Integrating medical imaging with i2b2 has been done elsewhere, but we haven't gone beyond brainstorming about it. We were tangentially involved in a project to collect video samples from patients for one study.

But the vast majority of our work is with data stored in tables.

How are your data typically documented - in the form of a document, or in some structured form?

The bulk of our data comes from the KUH EMR. Much of our data is documented by the EMR vendor, and following long-standing billing practice, standards for diagnoses (ICD9, soon to be ICD10) and procedures (CPT) are used for much of the data in the EMR. But the hospital heavily customizes the installation as well. For example, the formulary of medicines and the list of labs are curated by the hospital.

Moving nursing flowsheets from paper to the EMR initially involved a huge number of design decisions made in very short order; many of those decisions are reconsidered as they gain experience. There is some overlap between the terms used in KUH flowsheets and standards such as SNOMED-CT and LOINC, but we have only scratched the surface of the work of mapping these terminologies.

Sources other than the EMR also vary as to the level of standardization of terminology. Our integration of the KUH tumor registry makes fairly straightforward use of the national standard for cancer registries, NAACCR. But our biospecimen repository uses a locally-curated terminology.

The bulk of this documentation is in tables and spreadsheets, with some documents and diagrams mixed in.

If your metadata are structured please describe that structure. Is it defined by something like a formal XML schema?

One way or another, we fit all of our metadata into i2b2's database schema. As a byproduct, i2b2 can produce an XML form of the metadata, following one of its XML schemas.

Is it common in your area to think in terms of a data lifecycle?

If so, what does that view include – (concepts and measures shared across studies?, data reuse?)

We reload our data repository from the source systems monthly. This is something of a compromise between real-time updates from the EMR and one-time data gathering exercises such as chart reviews.

Our process for updating metadata is something of a patchwork. For flowsheets, we updated it monthly along with the data. For ICD9 and CPT, we plan to update as they republish annually, but we haven't tackled that just yet.

Are there tools available which help manage lifecycle metadata?

Various tools are under development in the i2b2 community; e.g. Health Ontology Mapper (HOM) by Rob Wynden et. a. at UCSF. We haven't investigated them in much depth, yet.

Can the metadata be expressed in Resource Description Framework (RDF) format as part of Linked Open Data?

NCBO is developing ontology services that integrate with i2b2 and provide RDF mappings. Again, we haven't investigated them in much depth, yet.

Is there an archive offering ongoing curation of your data available to you?

How does that operate? Are there issues with privacy, data size, financing etc.)?

Are there requirements from that archive for how data and metadata are represented?

We interact with varying sorts of metadata curation, as discussed under documentation above.

Setting up a governance structure was a major task that took several months in the start-up phase of our clinical data repository project. We have a data request oversight committee (DROC) with representation from

  • the hospital (which provides the bulk of the EMR data),
  • the clinics (which originally provided diagnosis and procedure information from billing systems, but are increasingly adopting the EMR), and
  • KU medical center itself (which manages the biospecimen repository etc.).

To address HIPAA requirements for dealing with protected health information, not to mention institutional liability, we have technical approaches to de-identification, network security, etc.

Sources such as the tumor registry and biospecimen repository are curated data as such. The hospital is an institution of long standing that has robust systems for long-term EMR storage, though perhaps recording vital signs wouldn't normally be called curation.

The governance policies include being able to trace all data in our system back to its source. The i2b2 database schema includes auditing fields (import_date, update_date, sourcesystem_cd, ...) that make this reasonably straightforward.

Moving forward – Would it be useful for us to have more sessions?

A number of i2b2 sites participate in federated query networks which allow researchers to broaden their cohort identification queries and validate their findings more widely. In the medium to long term, we're interested in the sort of terminology alignment that it takes to participate in these networks, but it's not yet high on our list of priorities.

Another motivation for terminology alignment is health information exchange. We're monitoring HIE efforts in Kansas, but again, it's not yet high on our list of priorities.

As we complete other projects and make room for more work on terminology alignment and data interchange, we hope to be able to participate more actively.


No comments.