Posts for the month of April 2012

HERON: Big Hill release adds home medications, other medication orders, and expands UHC data

The Big Hill release adds support for home medications and expands UHC coverage. Previously, we only had medications that are dispensed by the hospital pharmacy; now, you'll be able to see

  • medications that the patient reports taking at home (so called "historical medications"), as well as
  • all other orders, which includes prescriptions, inpatient medication orders, and discharge medication orders.
  • data for an additional 100K patients

The UHC data, although primarily administrative, provides a new view on a patient's interaction at the hospital. New UHC concepts include:

  • ICU length of stay: Search on time spent in ICU.
  • Admission and Discharge status concepts: Search where patients came from prior to admission or went to upon discharge.
  • Readmission concept: Helps you search for patients readmitted after discharge to their home, allowing specification to a desired number of days.
  • Clinical Classification Software (CCS) ICD-9: These codes provided by AHRQ collapse ICD-9 diagnosis and procedure codes into a smaller number of categories useful in analyzing data. See AHRQ web site.
  • All Patient Refined Diagnosis Related Groups (APR DRGs): DRG codes expanded to include 4 subclasses in severity of illness and mortality subgroups for each code. See web site.
  • Major Diagnostic Categories (MDCs): Search these diagnosis categories created by combining ICD-9 diagnosis codes into 25 MDCs. These codes, which are used primarily for administrative and billing purposes, provide another view on patient data.
  • Comorbidity: Search 29 comorbidity categories.

Currently the UHC data is limited to a 1-year time period (Nov. 2010-Nov. 2011). Look for additional years in future releases. This data is limited to hospital encounters and lacks clinic data.

HERON Big Hill Contents Summary

This month, our tour of rivers and lakes in Kansas honors lake Big Hill.

The HERON repository contains approximately 630 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.0M 1.90M
KUH Billing (O2 via SMS) 1980s Feb 2012 various*
UKP Billing 2000 Feb 2012
9.5K 9.5K Frontiers participant registry Jun 2009 Feb 2012
183K 183k Social Security Death Index 1962 Feb 2012
Diagnoses (IDC9) 18.7M 602K
KUH/O2/Epic Nov 2007 Feb 2012 various*
UKP Billing 2000 Feb 2012
Medications 78.1M 245K
KUH/O2/Epic Nov 2007 Feb 2012 various*
Nursing Observations 463M ?
KUH/O2/Epic Nov 2007 Feb 2012 various*
Lab Results 72.6M 257K
KUH/O2/Epic 2003 Feb 2012 various*
Procedures (CPT) 9.6M 542K
UKP Billing 2000 Feb 2012
Specimens 27.8K 2.80K
KUMC Biospecimen Repository ? Jan 2012
Cancer Cases 9.1M 62.8K
KUH Cancer Registry 1950s Jan 2012 labels*
Hospital Quality Metrics .97M 19.7K
University HealthSystem Consortium (UHC) N/A Nov 2011 #997
All 630M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

No results

Outstanding Problems/Defects/Issues

A medical informatics perspective on the role of metadata in the data lifecycle

Our group has been invited to a panel discussion:

  • Metadata Forum
    A discussion of the role of metadata in the data lifecycle
    Friday April 13, 2012
    11:30am - 1:00pm
    Watson Library, 503A and 503B

The panel questions have inspired this bit of thinking out loud:

What is your research area or discipline?

Our discipline is medical informatics. We're involved in two kinds of research:

  1. informatics services to support KUMC researchers, including areas such as cancer center, health of the public, etc.
  2. research in medical informatics per se; that is: looking at the electronic medical record (EMR) as a medical intervention and studying its impact

What do your data look like?

To our customers, we present a large and growing set of medical observations -- currently over 630 million observations -- using a tool called i2b2, developed at Harvard/Partners with NIH funding. It presents a hierarchy of terms:

  • under demographics it has age, gender, etc.;
  • diagnoses are organized using the ICD9 terminology;
  • there are terms for medications, lab results, procedures, etc.

This allows cohort identification queries such as "how many patients does the University of Kansas Hospital (KUH) see each year that are over the age of 35, diagnosed with diabetes, and had an abnormal glucose lab result?"

The data is not necessarily “ours” in that we take data from multiple sources, aggregate it, and provide a tool for knowledge discovery. For example, we integrate vital statistics from the U.S. Social Security Administration, so that the query above can be refined a la "... and how many of them are dead, according to the SSA?"

Are they structured or unstructured?

So far, we have our hands full with structured data (pulled from EMR, billing system, tumor registry, etc.).

A lot of work in our field is concerned with natural language processing of physician's notes.

We haven't begun work in that direction, but we are among the first to make use of i2b2 to explore nursing observations. They dominate our database (over 400 million observations) and quite likely they dominate the use of EHR usage in the hospital. Plus, they contain basic information such as height and weight that is essential to screening for many studies.

Are they typically represented in tables or some other form (audio, video, transcripts)?

Integrating medical imaging with i2b2 has been done elsewhere, but we haven't gone beyond brainstorming about it. We were tangentially involved in a project to collect video samples from patients for one study.

But the vast majority of our work is with data stored in tables.

How are your data typically documented - in the form of a document, or in some structured form?

The bulk of our data comes from the KUH EMR. Much of our data is documented by the EMR vendor, and following long-standing billing practice, standards for diagnoses (ICD9, soon to be ICD10) and procedures (CPT) are used for much of the data in the EMR. But the hospital heavily customizes the installation as well. For example, the formulary of medicines and the list of labs are curated by the hospital.

Moving nursing flowsheets from paper to the EMR initially involved a huge number of design decisions made in very short order; many of those decisions are reconsidered as they gain experience. There is some overlap between the terms used in KUH flowsheets and standards such as SNOMED-CT and LOINC, but we have only scratched the surface of the work of mapping these terminologies.

Sources other than the EMR also vary as to the level of standardization of terminology. Our integration of the KUH tumor registry makes fairly straightforward use of the national standard for cancer registries, NAACCR. But our biospecimen repository uses a locally-curated terminology.

The bulk of this documentation is in tables and spreadsheets, with some documents and diagrams mixed in.

If your metadata are structured please describe that structure. Is it defined by something like a formal XML schema?

One way or another, we fit all of our metadata into i2b2's database schema. As a byproduct, i2b2 can produce an XML form of the metadata, following one of its XML schemas.

Is it common in your area to think in terms of a data lifecycle?

If so, what does that view include – (concepts and measures shared across studies?, data reuse?)

We reload our data repository from the source systems monthly. This is something of a compromise between real-time updates from the EMR and one-time data gathering exercises such as chart reviews.

Our process for updating metadata is something of a patchwork. For flowsheets, we updated it monthly along with the data. For ICD9 and CPT, we plan to update as they republish annually, but we haven't tackled that just yet.

Are there tools available which help manage lifecycle metadata?

Various tools are under development in the i2b2 community; e.g. Health Ontology Mapper (HOM) by Rob Wynden et. a. at UCSF. We haven't investigated them in much depth, yet.

Can the metadata be expressed in Resource Description Framework (RDF) format as part of Linked Open Data?

NCBO is developing ontology services that integrate with i2b2 and provide RDF mappings. Again, we haven't investigated them in much depth, yet.

Is there an archive offering ongoing curation of your data available to you?

How does that operate? Are there issues with privacy, data size, financing etc.)?

Are there requirements from that archive for how data and metadata are represented?

We interact with varying sorts of metadata curation, as discussed under documentation above.

Setting up a governance structure was a major task that took several months in the start-up phase of our clinical data repository project. We have a data request oversight committee (DROC) with representation from

  • the hospital (which provides the bulk of the EMR data),
  • the clinics (which originally provided diagnosis and procedure information from billing systems, but are increasingly adopting the EMR), and
  • KU medical center itself (which manages the biospecimen repository etc.).

To address HIPAA requirements for dealing with protected health information, not to mention institutional liability, we have technical approaches to de-identification, network security, etc.

Sources such as the tumor registry and biospecimen repository are curated data as such. The hospital is an institution of long standing that has robust systems for long-term EMR storage, though perhaps recording vital signs wouldn't normally be called curation.

The governance policies include being able to trace all data in our system back to its source. The i2b2 database schema includes auditing fields (import_date, update_date, sourcesystem_cd, ...) that make this reasonably straightforward.

Moving forward – Would it be useful for us to have more sessions?

A number of i2b2 sites participate in federated query networks which allow researchers to broaden their cohort identification queries and validate their findings more widely. In the medium to long term, we're interested in the sort of terminology alignment that it takes to participate in these networks, but it's not yet high on our list of priorities.

Another motivation for terminology alignment is health information exchange. We're monitoring HIE efforts in Kansas, but again, it's not yet high on our list of priorities.

As we complete other projects and make room for more work on terminology alignment and data interchange, we hope to be able to participate more actively.

The Great Blue Herons at Cornell are very prolific

We are very excited as one of our mascots, the Great Blue Heron, is currently in nesting season. Cornell has a terrific web cam

They now have 5 eggs in their nest which is more prolific than usual.