Posts by author dconnolly

HERON Team offers support for complex searching, data access

Having trouble creating complex searches in HERON? Seeking support for your grant or proposal? Have funding?

Medical Informatics Services include searching and data access consultation with the HERON team on an hourly fee-for-service basis.

request a Medical Informatics consultation or contact MISupport@… to learn more.

As always, the Frontiers Lunch and Learn meets the 1st and 3rd Tuesdays of each month from 12:00 - 1:30 PM in room 3001D Student Center. Register in advance if possible but drop-ins are welcome. See the KUMC Calendar for the date of the next clinic.

HERON Saline Release integrates 1.6 billion observations from 8 data sources

The HERON Saline release includes data from the KU Hospital EMR, tumor registry etc. through September, 2014.

Having trouble creating complex searches in HERON? Seeking support for your grant or proposal? Note HERON Team offers support for complex searching, data access on an hourly fee-for-service basis.

Any publication that results from a project utilizing HERON should cite grant support (CTSA Award # UL1TR000001).

HERON Saline Contents Summary

This month, our tour of rivers and lakes in Kansas honors the Saline River.

The HERON repository contains approximately 1.6 billion real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Allergy 537K 246K
KUH/O2/Epic Nov 2007 Sept 2014 various*
Cancer Cases 11.1M 75K
KUH Cancer Registry 1950s Sept 2014 labels*
Cardiology Labs 22.2M 333K
KUH/O2/Epic Nov 2007 Sept 2014
Demographics 19.8M 2.07M
KUH Billing (O2 via SMS) 1980s Sept 2014 various*
UKP Billing 2000 Sept 2014
30.2K 30.2K Frontiers participant registry Jun 2009 Sept 2014 #2451
194.5K 194.5K Social Security Death Index 1962 Sept 2014
Diagnoses (IDC9) 67.7M 781K
KUH/O2/Epic Nov 2007 Sept 2014 various*
UKP Billing 2000 Sept 2014
University HealthSystem Consortium (UHC) Q4 2008 June 2014
History 30.8M 417K
KUH/O2/Epic Nov 2007 Sept 2014
Lab Results 149M 364K
KUH/O2/Epic 2003 Sept 2014 various*
Medications 126M 425K
KUH/O2/Epic (Organized by VA Class) Nov 2007 Sept 2014 various*
Microbiology 3.31M 29.1K
KUH/O2/Epic 2003 (?) Sept 2014 various*
782K 121K Negative 2003 Sept 2014
10.3K 7.68K Positive 2003 Sept 2014
Nursing Observations 852M ?
KUH/O2/Epic Nov 2007 Sept 2014 various*
Procedure Orders 169M 556K
KUH/O2/Epic 2003 (?) Sept 2014 various*
Procedures (CPT) 13.6M 678K
UKP Billing 2000 Sept 2014 #26
REDCap 1.04M 48.7K
REDCap July 2011 Sept 2014
Reports/Notes 67.6M 404K
KUH/O2/Epic ? Sept 2014
Specimens 121K 8.83K
KUMC Biospecimen Repository ? Sept 2014
Visit Details 67M 940K
KUH/O2/Epic Nov 2007 Sept 2014
Hospital Quality Metrics 6.9M 93K
University HealthSystem Consortium (UHC) Q4 2008 June 2014 #2911
All 1.6B

Notice

Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright holders. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright, patent or trademark restrictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference.

Misuse of the Limited Access Death Master File, NTIS, U.S. Department of Commerce data is subject to penalties under provisions of 15 CFR § 1110.200.

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

Due to in-progress trac access policy issues, this section may be blank for a time.

#3034
portable Data Builder

Outstanding Problems/Defects/Issues

Learning Object Capability Security with the Online Python Tutor

In Everything Is Broken, Quinn Norton presents an alarming, though witty, case that heartbleed is really just the tip of the iceberg when it comes to computer security problems.

The best weapons I've seen are (a) certified programming with dependent types, and (b) Robust Composition with capabilties.

And on that front, there's great news: sel4, a formally verified, capability based microkernel written in optimized C, is going open source. That's the very lowest layer. At the other end, secure ecmascript lets us use Javascript as an object capability language. Distributed Electronic Rights in JavaScript tells the story at a high level, including which bits are available and which are still in progress. Stuff like the SES node package seems to work pretty well.

Meanwhile, we're a mostly python shop. I've been playing with some python capability idioms for a while. Some of them are a bit obscure, and I've been wondering how to explain our CodeReviewNotes about explicit authority to new developers.

Then I discovered the online python tutor. Perfect!

I hope that trying it out on encap.py will provide enlightenment on the encapsulation aspect of capabilities discussed in From Functions To Objects`. Copy and paste encap.py into the editor and add something like this at the end:

# test
s1 = makeSlot('apple')
print s1.get()
s1.put('orange')
print s1.get()

Then try walking through sealing.py to see how rights amplification works though the motivating example, money.py should be integrated in order for it to really make sense.

KUMC receives PCORI award to lead Greater Plains Collaborative

The Greater Plains Collaborative award from the Patient-Centered Outcomes Research Institute (PCORI) provides $7 million for a project that will establish a new network of nine medical centers in seven states committed to building a data set from electronic medical records that will be used to contribute to new research in the fields of breast cancer, obesity and amyotrophic lateral sclerosis (also known as ALS, or Lou Gehrig's disease).

The principal investigator of the project is Russ Waitman, director of medical informatics at KU Medical Center. The HERON technology developed by the medical informatics team plays a prominent role.

Further reading:

HERON Medicine Lodge introduces Microbiology Lab results

Microbiology laboratory results are searchable by source, organism, antibiotic, or sensitivity levels. Patient statistics are corrected to diagnosis display the correct number, and patients will no longer appear in ICU Length of Stay results unless they have an ICU stay of 1 day or greater.

HERON Smoky Hill Contents Summary

This month, our tour of rivers and lakes in Kansas honors Medicine Lodge River.

The HERON repository contains approximately 1 billion real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Cancer Cases 10M 67.9K
KUH Cancer Registry 1950s April 2013 labels*
Demographics 18.7M 1.96M
KUH Billing (O2 via SMS) 1980s April 2013 various*
UKP Billing 2000 April 2013
16.7K 16.7K Frontiers participant registry Jun 2009 April 2013
188.5K 188.5K Social Security Death Index 1962 April 2013
Diagnoses (IDC9) 45.9M 677K
KUH/O2/Epic Nov 2007 April 2013 various*
UKP Billing 2000 April 2013
University HealthSystem Consortium (UHC) Q4 2008 Dec 2012
History 16.4M 314K
KUH/O2/Epic Nov 2007 April 2013
Lab Results 87.6M 298K
KUH/O2/Epic 2003 April 2013 various*
Medications 94.6M 324K
KUH/O2/Epic (Organized by VA Class) Nov 2007 April 2013 various*
Microbiology 2.5M 22.7K
KUH/O2/Epic 2003 (?) April 2013 various*
Nursing Observations 609M ?
KUH/O2/Epic Nov 2007 April 2013 various*
Order Sets 303K 39.6K
KUH/O2/Epic 2003 (?) April 2013 various*
Procedure Orders 122M 460K
KUH/O2/Epic 2003 (?) April 2013 various*
Procedures (CPT) 11M 593K
UKP Billing 2000 April 2013
REDCap 37.4K 328
REDCap July 2011 March 2013 #2000
Reports/Notes 41M 299K
KUH/O2/Epic ? April 2013 #1955
Specimens 52K 4.32K
KUMC Biospecimen Repository ? April 2013
Visit Details 30.1M 551K
KUH/O2/Epic Nov 2007 April 2013
Hospital Quality Metrics 5M 69.2K
University HealthSystem Consortium (UHC) Q4 2008 Dec 2012
All 1.13B

Notice

Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright holders. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright, patent or trademark restrictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference.

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

Due to in-progress trac access policy issues, this section may be blank for a time.

No results

Outstanding Problems/Defects/Issues

HERON Cow Creek release includes order sets and analysis tool enhancements

Previously, the Multi-Cohort Survival Analysis tool had the option to limit by 5 years, 10 years or All. This release expands that option to weeks, days, or hours in addition to years. The Timeline analysis tool increases granularity to display by the hour as opposed to day as in previous releases.

The Cow Creek release introduces order sets as a new searchable data type. This release also kicks off the ability to limit MAR by dose.

HERON Cow Creek Contents Summary

This month, our tour of rivers and lakes in Kansas honors Cow Creek.

The HERON repository contains approximately 1 billion real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.6M 1.96M
KUH Billing (O2 via SMS) 1980s Feb 2013 various*
UKP Billing 2000 Feb 2013
15.8K 15.8K Frontiers participant registry Jun 2009 Feb 2013
187.8K 187.8K Social Security Death Index 1962 Feb 2013
Diagnoses (IDC9) 44.1M 667K
KUH/O2/Epic Nov 2007 Feb 2013 various*
UKP Billing 2000 Feb 2013
University HealthSystem Consortium (UHC) Q4 2008 Dec 2012
History 15M 302K
KUH/O2/Epic Nov 2007 Feb 2013
Medications 91M 313K
KUH/O2/Epic (Organized by VA Class) Nov 2007 Feb 2013 various*
Nursing Observations 585M ?
KUH/O2/Epic Nov 2007 Feb 2013 various*
Lab Results 85.2M 295K
KUH/O2/Epic 2003 Feb 2013 various*
Order Sets 275K
KUH/O2/Epic 2003 (?) Feb 2013 various*
Procedure Orders 117M 450K
KUH/O2/Epic 2003 (?) Feb 2013 various*
Procedures (CPT) 10.8M 585K
UKP Billing 2000 Feb 2013
Reports/Notes 38M 288K
KUH/O2/Epic ? Feb 2013
Specimens 47.4K 4K
KUMC Biospecimen Repository ? Feb 2013
Visit Details 28.5M 539K
KUH/O2/Epic Nov 2007 Feb 2013
Cancer Cases 9.8M 67.1K
KUH Cancer Registry 1950s Feb 2013 labels*
Hospital Quality Metrics 5.10M 69.2K
University HealthSystem Consortium (UHC) Q4 2008 Dec 2012
Triple Negative Breast Cancer Registry (BRCA) 27.6K 161
REDCap July 2011 Jan 2013 #1741
All 1.09B

Notice

Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright holders. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright, patent or trademark restrictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference.

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

Due to in-progress trac access policy issues, this section may be blank for a time.

#1526
Load Orderset Usage into HERON
#1685
Add sponsor, title, and description to HERON Usage Audit Report
#1742
multi-cohort survival plugin fails to initialize after the 1st time
#1835
HERON uses ICD-O-2 labels for ICD-O-3 morphology/histology cancer tumor registry codes
#1849
flexible time windows for HERON Multi-cohort survival analysis tool
#1891
handle times, not just dates, in i2b2 timeline in HERON

Outstanding Problems/Defects/Issues

Quick-and-dirty usage documentation for python integration tests

In addition to my addiction to python's doctest for WritingQualityCode, I'm developing a habit of letting my modules serve as their own integration tests when run as scripts. Python comes batteries-included with argparse, and Aargh is pretty cool, but for quick-and-dirty stuff like this, I tend not to bother at all. I just let python's stack traces serve as documentation.

For example, if it's been a while since I used source:raven-j/heron_wsgi/admin_lib/i2b2pm.py, I just run it, and I get:

(haenv):admin_lib$ python i2b2pm.py
Traceback (most recent call last):
  File "i2b2pm.py", line 319, in <module>
    _test_main()
  File "i2b2pm.py", line 289, in _test_main
    user_id, full_name = sys.argv[1:3]
ValueError: need more than 0 values to unpack

The user_id, full_name = sys.argv[1:3] line is typically enough of a clue to remind me what arguments are needed:

(haenv):admin_lib$ python i2b2pm.py dconnolly 'Dan Connolly'
DEBUG:__main__:generate authorization for: ('dconnolly', 'Dan Connolly')
INFO:sqlalchemy.engine.base.Engine:SELECT USER FROM DUAL
...
INFO:sqlalchemy.engine.base.Engine:{'param_1': u'dconnolly'}
('1f8c885f-43e3-4ace-8512-88e65494d59a', <User(dconnolly, Dan Connolly)>)

Mark Medovich on Software Defined Networks

Mark Medovich, Juniper's Chief Architect for the Public Sector, gave a very interesting talk at the Kansas City Software Defined Networking Luncheon, hosted by FishNet Security.

It directly addressed a gap in my knowledge regarding our new NetworkInnovation project, where we plan to "evaluate the applicability of GENI software defined networking and OpenFlow/Openstack technologies to support the secure transmission and storage of personal health information with the Google Fiber network."

I have enough experience as a customer of VM clusters to have vague notion of what OpenStack is about, but I'm brand new to OpenFlow and software defined networks. My first thought on exposure to them was: so what happened to The Rise of the Stupid Network?

Medovich explained that OpenFlow development was driven by high performance computing (HPC). Researchers are trying to reduce the latency for moving data between compute nodes.

The term "switch fabric" flew by... one of many buzzwords that I'm slowly picking up. "If we are going All L2..." I recognized L2 as a reference to a layer in the OSI model, but I didn't remember much about it, and I didn't have enough connectivity to look it up. Afterward, I reminded myself that it's where switches live, as opposed to hubs below and routers above.

"Networks within networks" was another phrase that caught my attention. It appealed to me as like scale-free design in Web Architecture. It reminded me of heated discussions in the IETF about the evils of NAT vs. the end-to-end purity of IPv6. The people I trust were on the IPv6 side (and IPv6 is great for lots of other reasons) but as I reflected later, the idea of one big flat IPv6 network seems like a monoculture, not scale-free.

He talked about multi tenancy data centers:

photo of "Multi-tenant flows within an end site" slide by Medovich

He used an example from when he was at Sun, visiting CVS Caremark: they had to provision for the monthly Medicaid Monday burst, which left a lot of excess capacity for most of the month. My understanding is that Amazon's cloud services came about roughly the same way: they have to provision for Christmas, which left them with a lot of spare capacity most of the time.

Traditional three-level networks are OK provided capacity is reasonably predictable, he said, but they don't deal with dynamic demand.

 "photo of "SCALING Multi-tenant SERVICES" slide by Medovich"

You can't make service level agreements (SLA) for dynamic demand with traditional networks; the best you can do is a service level probability (SLoP).

This brings us to OpenFlow. "Open flow is all about the data center and making virtualization better."

He introduced it using a slide from the OpenFlow Presentation:

"Here's the problem: that step 2. Encapsulate and forward to controller. What controller?" The controller isn't specified, he said.

Variability of OpenFlow devices (switches from this vendor or that) introduces too many variables. The only way the Juniper engineering team could see to make the scaling work was an any-to-any switch fabric. They had to collapse the network, from 3 tiers to 2 to 1.

People are building this sort of scalable multi-tenancy network, he said. But not it's not OpenFlow. Cisco UCS, Juniper fabric. 10000 ports. Software programmable.

"Don't get me wrong; I'm not here to knock OpenFlow. Juniper does support OpenFlow." Just don't expect OpenFlow to be the whole solution. I gather Juniper has filled in all the gaps implicit in OpenFlow use cases.

He threw out "... close to the lambda ..." as a goal the audience would be familiar with. This audience member was not.

Lambda switching uses small amounts of fiber-optic cable and differing light wavelengths (called lambdas) to transport many high-speed datastreams to their destinations -- Network World research center

His discussion of trust, interfaces, and economics reminded me of studying Miller's work on object capability security, the principle of least authority, and patterns of cooperation without vulnerability. More on that in another item. Meanwhile...

I felt more on solid ground when he started to discuss software architecture.

Way back, juniper decided to put a XML RPC server in every switch. This was adopted by the IETF as Netconf. Juniper has a rich SDK layered on top. This is how they rapidly ported OpenFlow to their devices. In answer to criticism that they don't have OpenFlow implemented in firmware, he compared it with developing a new platform from scratch, an argument with obvious appeal, to me. Who's going to implement "legacy" protocols like PPOE, baked into various purchasing specs?

While working toward Junosphere, the software team saw that they couldn't wait until the hardware was finished; they virtualized the whole thing. The result is now used by major telcom providers (Comcast, Telecom Italia) as a test lab.

At the other extreme, he explained how their architecture scales down to multi-tenancy embedded applications to meet military needs.

He mentioned in passing that their architecture includes a JBoss application server. I have a bit of a bad taste from using JBoss as the platform for HERON and hence i2b2, but I gather the version we use is ages out of date. So this nod encourages me to keep an open mind.

He mentioned a "single pane of glass" user interface with roles and permissions and templates. Again, I wondered to what extent this architecture employs the principle of least authority.

The Q&A that followed the talk quickly went over my head with "top of rack architectures" and such. But I did pick up a few more details about virtualization:

Medovich: Which are you using, Xen or KVM?

Audience member: KVM

Medovich: Good for you.

Medovich brought up SAN storage architectures and noted a trend... with aggregate bandwidth of racks is approaching a TB...

Medovich: Are you using a SAN or local storage?

Audience member: local storage

Medovich: That's the right answer.

He brought up the big Amazon outage and explained the causes in some detail. A big compute job could generate a bunch of data in one zone and then decommission the nodes. Then the customer wants to re-instantiate the 200 nodes, but that much compute is only available in another zone. So Amazon would have to migrate the data. It's like de-fragmenting a disk. And eventually, the aggregate bandwidth brought the whole thing down.

The next generation architecture will have to continuously de-fragment the data center.

p.s. A capsule subset of the slides he used is available:

01/25/2012 Winter 2012 ESCC/Internet2 Joint Techs Software Defined Networks - Juniper Networks

HERON Walnut update introduces O2 smart phrases, medication by dose and age at visit searching

In O2, smart phrases, commonly used words or phrases, are used to expedite patient documentation. Find these phrases under Reports->Visit Notes->Note Concepts.

Search medications by dose for inpatient medication orders. When constructing a search, drag "Cumulative Daily Dose of Single Inpatient Order" and you will be prompted to enter desired dose criteria.

Age at visit is searchable for patient visits. Please note this is based on de-identified dates. See HERON training for more information on how dates are shifted.

Need help using HERON? Learn how to Sponsor a HERON user, request data, and perform a search on the Informatics training video page.

HERON Walnut Contents Summary

This month, our tour of rivers and lakes in Kansas honors Walnut River.

The HERON repository contains approximately 856 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.3M 1.93M
KUH Billing (O2 via SMS) 1980s Sept 2012 various*
UKP Billing 2000 Sept 2012
13.2K 13.2K Frontiers participant registry Jun 2009 Sept 2012
185.9K 185.9K Social Security Death Index 1962 Sept 2012
Diagnoses (IDC9) 33.6M 638K
KUH/O2/Epic Nov 2007 Sept 2012 various*
UKP Billing 2000 Sept 2012
University HealthSystem Consortium (UHC) Q4 2008 June 2012
Medications 71.9M 283K
Organized by VA Class Nov 2007 Sept 2012
KUH/O2/Epic Nov 2007 Sept 2012 various*
Nursing Observations 524M ?
KUH/O2/Epic Nov 2007 Sept 2012 various*
Lab Results 79.2M 278K
KUH/O2/Epic 2003 Sept 2012 various*
Procedure Orders 52.7M 425K
KUH/O2/Epic 2003 (?) Sept 2012 various*
Procedures (CPT) 10.4M 566K
UKP Billing 2000 Sept 2012
Reports/Notes 24M 214K
KUH/O2/Epic ? Sept 2012
Specimens 34.6K 3.22K
KUMC Biospecimen Repository ? Sept 2012
Visit Details ? ?
KUH/O2/Epic Nov 2007 Sept 2012 #1514
Cancer Cases 9.6M 65.2K
KUH Cancer Registry 1950s Sept 2012 labels*
Hospital Quality Metrics 4.12M 60.9K
University HealthSystem Consortium (UHC) Q4 2008 June 2012
Triple Negative Breast Cancer Registry (BRCA) 17.8K 133
REDCap July 2011 Sept 2012
All 816M

Notice

Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright holders. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright, patent or trademark restrictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference.

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

#1246
Approximately 2% of Medication Facts are Not Covered by our VA Med Hierarchy

Outstanding Problems/Defects/Issues

On language complexity as authority and new hope for secure systems

Why is the overwhelming majority of common networked software still not secure, despite all effort to the contrary? Why is it almost certain to get exploited so long as attackers can craft its inputs? Why is it the case that no amount of effort seems to be enough to fix software that must speak certain protocols?

The video of The Science of Insecurity by Meredith Patterson crossed my radar several times last year, but I just recently found time to watch it. She offers hope:

In this talk we'll draw a direct connection between this ubiquitous insecurity and basic computer science concepts of Turing completeness and theory of languages. We will show how well-meant protocol designs are doomed to their implementations becoming clusters of 0-days, and will show where to look for these 0-days. We will also discuss simple principles of how to avoid designing such protocols.

In memory of Len Sassaman

In discussion of Postel's Principle, she argues:

  • Treat input-handling computational power [aka input language complexity] as privilege, and reduce it whenever possible.

This is essentially the principle of least privilege, which is the cornerstone of capability systems.

I have been arguing for keeping web language complexity down since I started working on HTML. The official version is the 2006 W3C Technical Architecture Group finding on The Rule of Least Power, but as far back as my 1994 essay, On Formally Unconvertable Document Formats, I wrote:

The RTF, TeX, nroff, etc. document formats provide very sophisticated automated techniques for authors of documents to express their ideas. It seems strange at first to see that plain text is still so widely used. It would seem that PostScript is the ultimate document format, in that its expressive capabilities include essentially anything that the human eye is capable of perceiving, and yet it is device-independent.

And yet if we take a look at the task of interpreting data back into the ideas that they represent, we find that plain text is much to be preferred, since reading plain text is so much easier to automate than reading GIF files (optical character recognition) or postscript documents (halting problem). In the end, while the source to a various TeX or troff documents may correspond closely to the structure of the ideas of the author, and while PostScript allows the author very precise control and tremenous expressive capability, all these documents ultimately capture an image of a document for presentation to the human eye. They don't capture the original information as symbols that can be processed by machine.

To put it another way, rendering ideas in PostScript is not going to help solve the problem of information overload -- it will only compound the situation.

But as recently as my Dec 2008 post on Web Applications security designs, I didn't see the connection between language complexity and privilege, and had little hope of things getting better:

The E system, which is a fascinating model of secure multi-party communication (not to mention lockless concurrency), [...] seems an impossibly high bar to reach, given the worse-is-better tendency in software deployment.

On the other hand, after wrestling with the patchwork of javascript security policies in browsers in the past few weeks, the capability approach in adsafe looks simple and elegant by comparison. Is there any chance we can move the state-of-the-art that far?

After all, who would be crazy enough to essentially throw out all the computing platforms we use and start over?

I've been studying CapROS: The Capability-based Reliable Operating System. Its heritage goes back through EROS in 1999 and KeyKOS in 1988 to GNOSIS in 1979. After a few hours of study, I started to wonder where the pull would come from to provide energy to complete the project. Then this headline crossed my radar:

I saw some comments encouraging them to look at EROS. I hope they do. Meanwhile, Capsicum: practical capabilities for UNIX lets capability approaches co-exist with traditional unix security.

These days, the browser is the biggest threat vector, and turing-complete data, i.e. mobile code, remains notoriously difficult to secure:

The sort of thing that gives me hope is chromium-capsicum - a version of Google's Chromium web browser that uses capability mode and capabilities to provide effective sandboxing of high-risk web page rendering.

Another is servo, Mozilla's exploration into a new browser architecture built on rust. Rust is a new systems programming language designed toward concerns of “programming in the large”, that is, of creating and maintaining boundaries – both abstract and operational – that preserve large-system integrity, availability and concurrency.

It took me several hours, but the other night I managed to build rust and servo. While servo is clearly in its infancy, passing a few dozen tests but not bearing much resemblance to an actual web browser, rust is starting to feel quite mature.

I'd like to see more of a least-authority approach in the rust standard library. Here's hoping for time to participate.

"In-Home Monitoring in Support of Caregivers for Patients with Dementia" obtains NSF US-Ignite grant

The U.S. National Science Foundation (NSF) awarded us an exploratory research (EAGER) grant for In-Home Monitoring in Support of Caregivers for Patients with Dementia. The investigator team is:

  • Dr. Russ Waitman, Principal Investigator, is Director of Biomedical Informatics at KU Medical Center.
  • Dr. Kristine Williams, Co-Investigator, is Associate Professor of Nursing and Associate Scientist of Gerontology at the University of Kansas.
  • Dr. James Sterbenz, Co-Investigator, is the lead PI of an NSF GENI project: The Great Plains Environment for Network Innovation (GpENI).

This project develops, integrates, and tests advanced video and networking technologies to support family caregivers in managing behavioral symptoms of individuals with dementia, a growing public health problem that adds to caregiver stress, increases morbidity and mortality, and accelerates nursing home placement. The project builds upon a recent University of Kansas Medical Center (KUMC) clinical pilot study that tested the application of video monitoring in the home to support family caregivers of persons with Alzheimer’s disease who exhibited disruptive behaviors. The proposed project focuses on expanding the in-home technological tools available to strengthen the linkage between patients and caregivers with their healthcare team via multi-camera full-motion/high definition video monitoring. Google’s deployment this year of a 1 Gpbs fiber network throughout Kansas City provides the ideal environment for measuring the impact that ultra-high speed networking will have on health care.

fig 2 from US Ignite_FINAL_EAGERv_14.docx from Russ 30 Aug 2012

In a January NSF press release, the National Science Foundation (NSF) "announced that it will serve as the lead federal agency for a White House Initiative called US Ignite, which aims to realize the potential of fast, open, next-generation networks."

Our new connection with US Ignite provides access to resources in that community such as Mozilla Ignite and the GENI network lab. If you'd like to get involved, email Dan Connolly and Russ Waitman.

Web Security Best Practices in Medical Informatics: OWASP Top 10

A lot of what the Medical Informatics division does for Frontiers, the KUMC CTSA program, is install, configure, maintain, support, enhance, or--in a few cases--build from scratch systems to manage data, facilitate work-flow, and enforce policy in clinical and translational research.

As our development team grows, it's increasingly important that everybody is up to speed on best practices in secure web application development.

A few months ago, I picked up a copy of The Tangled Web by Michal Zalewski because while I was a long-time participant in the development of the architecture and standards for the Web, I didn't follow a lot of the nitty gritty details as they developed. Who knew that Internet Explorer would take back-ticks (`) around attribute values in HTML? I do now, thanks to Zalewski.

I was chatting with a couple teammates about the risks around drupal customization, and I suggested that they should read this book too. That seemed daunting, but we agreed that a reading group around the book looked like fun.

When I got out the calendar to plan the first meeting, I looked at the first few chapters and realized that the tour of the foundations of the Web provided there would be great if we had started a couple months ago. Plus, the book is much more browser-focused, while a lot of what we do is back-end integration with databases and such.

The OWASP Top 10 Web Application Security Risks looks like a better fit where we are right now:

  1. Injection
  2. Cross-Site Scripting (XSS)
  3. Broken Authentication and Session Management
  4. Insecure Direct Object References
  5. Cross-Site Request Forgery (CSRF)
  6. Security Misconfiguration
  7. Insecure Cryptographic Storage
  8. Failure to Restrict URL Access
  9. Insufficient Transport Layer Protection
  10. Unvalidated Redirects and Forwards

I expect we'll follow up with The Tangled Web in due course.

Meanwhile, I notice there's a OWASP Kansas City chapter that meets Wed. Sept 12, 2012 6:30 PM at McCoys Foundry in Westport. That reminds me... we have an open position for a Biomedical Informatics Software Engineer.

HERON Cedar Bluff release incorporates procedure orders, note types, ICD9CM update

Cedar Bluff release includes bug fixes and introduces new data types to the repository. 

  • Procedure orders searchable. This included all procedure orders in O2, not just those ordered and fulfilled.
  • Note types allows you to limit searches to only records with a particular type of note.
  • Expected length of stay is a new quality measure added to the Length of Stay folder.
  • Thanks to Oregon Health & Science University for help with Diagnosis Mapping from ICD9-CM in UMLS. (#441)

HERON Cedar Bluff Contents Summary

This month, our tour of rivers and lakes in Kansas honors Cedar Bluff Reservoir.

The HERON repository contains approximately 850 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.2M 1.92M
KUH Billing (O2 via SMS) 1980s July 2012 various*
UKP Billing 2000 July 2012
12.3K 12.3K Frontiers participant registry Jun 2009 July 2012
185K 185k Social Security Death Index 1962 July 2012
Diagnoses (IDC9) 32.4M 626K
KUH/O2/Epic Nov 2007 July 2012 various*
UKP Billing 2000 July 2012
University HealthSystem Consortium (UHC) Q4 2008 June 2012
Medications 120M 269K
KUH/O2/Epic Nov 2007 July 2012 various*
Nursing Observations 502M ?
KUH/O2/Epic Nov 2007 July 2012 various*
Lab Results 76.8M 271K
KUH/O2/Epic 2003 July 2012 various*
Procedure Orders 50.5M ?
KUH/O2/Epic 2003 (?) July 2012 #1363, various*
Procedures (CPT) 10.2M 559K
UKP Billing 2000 July 2012
Reports/Notes 2.32M ?
KUH/O2/Epic ? July 2012 #1363
Specimens 33.4K 3.13K
KUMC Biospecimen Repository ? July 2012
Visit Details 2.28M
KUH/O2/Epic Nov 2007 July 2012
Cancer Cases 9.4M 64.3K
KUH Cancer Registry 1950s July 2012 labels*
Hospital Quality Metrics 3.64M 56.9K
University HealthSystem Consortium (UHC) Q4 2008 June 2012 #1359, #1364
Triple Negative Breast Cancer Registry (BRCA) 16.6K 126
REDCap July 2011 July 2012
All 851M

Notice

Some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright holders. Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright, patent or trademark restrictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference.

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

No results

Outstanding Problems/Defects/Issues

UHC Core measures, Standard Medication Terms, and REDCap integration added to HERON in Cimarron release

Cimarron release includes a new structure for browsing data (medications), addition of a new data repository (REDCap), and new data concepts (UHC) plus bug fixes.

  • We can now import data from selected REDCap projects into HERON. This integration effort was piloted with a Cancer Registry.
  • Medication navigation improved due to organization by RxNorm based on GCN or NDC codes instead of clinical drug level, such as pill size. 89% of medication is mapped to RxNorm and we continue to work on the remaining 11%.
  • Visit details for inpatient visits provides insight into what services were assigned to a patient during their hospital stay.
  • Three important new types of UHC data available
    • Physician role and specialty are available to search in the UHC Visit Details folder.
    • Core Measures are standardized, evidenced based measures used by UHC to evaluate hospitals. These measures were designed by the Joint Commission to permit more rigorous comparisons of hospitals.
    • Numerator or denominator only case flag is added to AHRQ. See more information on AHRQ.

HERON Cimarron Contents Summary

This month, our tour of rivers and lakes in Kansas honors Cimarron River.

The HERON repository contains approximately 755 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.2M 1.92M
KUH Billing (O2 via SMS) 1980s June 2012 various*
UKP Billing 2000 June 2012
11.6K 11.6K Frontiers participant registry Jun 2009 June 2012
184.6K 184.6k Social Security Death Index 1962 June 2012
Diagnoses (IDC9) 20.5M 618K
KUH/O2/Epic Nov 2007 June 2012 various*
UKP Billing 2000 June 2012
Medications 100M 263K
KUH/O2/Epic Nov 2007 June 2012 various*
89.8M 250K Medications by VA Class/Clinical? Dose Form (DRAFT) Nov 2007 June 2012 various*
Nursing Observations 491M ?
KUH/O2/Epic Nov 2007 June 2012 various*
Lab Results 75.8M 268K
KUH/O2/Epic 2003 June 2012 various*
Procedures (CPT) 10.1M 555K
UKP Billing 2000 June 2012
Specimens 31.3K 3.06K
KUMC Biospecimen Repository ? June 2012
Visit Details 2.24M
KUH/O2/Epic Nov 2007 June 2012
Cancer Cases 9.3M 63.8K
KUH Cancer Registry 1950s June 2012 labels*
Hospital Quality Metrics 3.90M 56.9K
University HealthSystem Consortium (UHC) Q4 2008 June 2012
Triple Negative Breast Cancer Registry (BRCA) 15.9K 123
REDCap July 2011 June 2012
All 755M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

#1048
Navigate Medications by National Standard hierarchy (RxNorm, NDFRT, VA)
#1243
Load UHC Core measures observations into HERON
#1244
Load UHC physician specialty and roles

Outstanding Problems/Defects/Issues

HERON: Hillsdale release incorporates usage statistics and bug fixes

The Hillsdale release makes usage statistics available to HERON users. By clicking on 'Usage Report' you can view monthly query counts and queries by users over time. Average query time improved and we will continue our work on optimization. Marital status now displays correct patient count.

HERON Hillsdale Contents Summary

This month, our tour of rivers and lakes in Kansas honors Hillsdale Lake.

The HERON repository contains approximately 725 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.1M 1.91M
KUH Billing (O2 via SMS) 1980s May 2012 various*
UKP Billing 2000 May 2012
10.6K 10.6K Frontiers participant registry Jun 2009 May 2012
184K 184k Social Security Death Index 1962 May 2012
Diagnoses (IDC9) 20.1M 612K
KUH/O2/Epic Nov 2007 May 2012 various*
UKP Billing 2000 May 2012
Medications 100M 256K
KUH/O2/Epic Nov 2007 May 2012 various*
Nursing Observations 479M ?
KUH/O2/Epic Nov 2007 May 2012 various*
Lab Results 74.6M 264K
KUH/O2/Epic 2003 May 2012 various*
Procedures (CPT) 10M 551K
UKP Billing 2000 May 2012
Specimens 30.9K 3.03K
KUMC Biospecimen Repository ? May 2012
Cancer Cases 9.3M 63.5K
KUH Cancer Registry 1950s May 2012 labels*
Hospital Quality Metrics 3.23M 56.9K
University HealthSystem Consortium (UHC) N/A May 2012
All 725M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

No results

Outstanding Problems/Defects/Issues

Hello to the R User Community

Russ and I are at the R users's conference, useR! 2012 at Vanderbilt in Nashville.

We enjoyed the short course by Bill Venables, the guy who literally wrote the book on modern applied statistics.

I'm about to give my talk on HeronStatsPlugins. Wish me luck!

And stay tuned to KUBMIPresentations for slides and such.

If you'd like our contact info, see the biostatistics faculty and staff.

HERON: Big Hill release adds home medications, other medication orders, and expands UHC data

The Big Hill release adds support for home medications and expands UHC coverage. Previously, we only had medications that are dispensed by the hospital pharmacy; now, you'll be able to see

  • medications that the patient reports taking at home (so called "historical medications"), as well as
  • all other orders, which includes prescriptions, inpatient medication orders, and discharge medication orders.
  • data for an additional 100K patients

The UHC data, although primarily administrative, provides a new view on a patient's interaction at the hospital. New UHC concepts include:

  • ICU length of stay: Search on time spent in ICU.
  • Admission and Discharge status concepts: Search where patients came from prior to admission or went to upon discharge.
  • Readmission concept: Helps you search for patients readmitted after discharge to their home, allowing specification to a desired number of days.
  • Clinical Classification Software (CCS) ICD-9: These codes provided by AHRQ collapse ICD-9 diagnosis and procedure codes into a smaller number of categories useful in analyzing data. See AHRQ web site.
  • All Patient Refined Diagnosis Related Groups (APR DRGs): DRG codes expanded to include 4 subclasses in severity of illness and mortality subgroups for each code. See web site.
  • Major Diagnostic Categories (MDCs): Search these diagnosis categories created by combining ICD-9 diagnosis codes into 25 MDCs. These codes, which are used primarily for administrative and billing purposes, provide another view on patient data.
  • Comorbidity: Search 29 comorbidity categories.

Currently the UHC data is limited to a 1-year time period (Nov. 2010-Nov. 2011). Look for additional years in future releases. This data is limited to hospital encounters and lacks clinic data.

HERON Big Hill Contents Summary

This month, our tour of rivers and lakes in Kansas honors lake Big Hill.

The HERON repository contains approximately 630 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.0M 1.90M
KUH Billing (O2 via SMS) 1980s Feb 2012 various*
UKP Billing 2000 Feb 2012
9.5K 9.5K Frontiers participant registry Jun 2009 Feb 2012
183K 183k Social Security Death Index 1962 Feb 2012
Diagnoses (IDC9) 18.7M 602K
KUH/O2/Epic Nov 2007 Feb 2012 various*
UKP Billing 2000 Feb 2012
Medications 78.1M 245K
KUH/O2/Epic Nov 2007 Feb 2012 various*
Nursing Observations 463M ?
KUH/O2/Epic Nov 2007 Feb 2012 various*
Lab Results 72.6M 257K
KUH/O2/Epic 2003 Feb 2012 various*
Procedures (CPT) 9.6M 542K
UKP Billing 2000 Feb 2012
Specimens 27.8K 2.80K
KUMC Biospecimen Repository ? Jan 2012
Cancer Cases 9.1M 62.8K
KUH Cancer Registry 1950s Jan 2012 labels*
Hospital Quality Metrics .97M 19.7K
University HealthSystem Consortium (UHC) N/A Nov 2011 #997
All 630M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

No results

Outstanding Problems/Defects/Issues

A medical informatics perspective on the role of metadata in the data lifecycle

Our group has been invited to a panel discussion:

  • Metadata Forum
    A discussion of the role of metadata in the data lifecycle
    Friday April 13, 2012
    11:30am - 1:00pm
    Watson Library, 503A and 503B

The panel questions have inspired this bit of thinking out loud:

What is your research area or discipline?

Our discipline is medical informatics. We're involved in two kinds of research:

  1. informatics services to support KUMC researchers, including areas such as cancer center, health of the public, etc.
  2. research in medical informatics per se; that is: looking at the electronic medical record (EMR) as a medical intervention and studying its impact

What do your data look like?

To our customers, we present a large and growing set of medical observations -- currently over 630 million observations -- using a tool called i2b2, developed at Harvard/Partners with NIH funding. It presents a hierarchy of terms:

  • under demographics it has age, gender, etc.;
  • diagnoses are organized using the ICD9 terminology;
  • there are terms for medications, lab results, procedures, etc.

This allows cohort identification queries such as "how many patients does the University of Kansas Hospital (KUH) see each year that are over the age of 35, diagnosed with diabetes, and had an abnormal glucose lab result?"

The data is not necessarily “ours” in that we take data from multiple sources, aggregate it, and provide a tool for knowledge discovery. For example, we integrate vital statistics from the U.S. Social Security Administration, so that the query above can be refined a la "... and how many of them are dead, according to the SSA?"

Are they structured or unstructured?

So far, we have our hands full with structured data (pulled from EMR, billing system, tumor registry, etc.).

A lot of work in our field is concerned with natural language processing of physician's notes.

We haven't begun work in that direction, but we are among the first to make use of i2b2 to explore nursing observations. They dominate our database (over 400 million observations) and quite likely they dominate the use of EHR usage in the hospital. Plus, they contain basic information such as height and weight that is essential to screening for many studies.

Are they typically represented in tables or some other form (audio, video, transcripts)?

Integrating medical imaging with i2b2 has been done elsewhere, but we haven't gone beyond brainstorming about it. We were tangentially involved in a project to collect video samples from patients for one study.

But the vast majority of our work is with data stored in tables.

How are your data typically documented - in the form of a document, or in some structured form?

The bulk of our data comes from the KUH EMR. Much of our data is documented by the EMR vendor, and following long-standing billing practice, standards for diagnoses (ICD9, soon to be ICD10) and procedures (CPT) are used for much of the data in the EMR. But the hospital heavily customizes the installation as well. For example, the formulary of medicines and the list of labs are curated by the hospital.

Moving nursing flowsheets from paper to the EMR initially involved a huge number of design decisions made in very short order; many of those decisions are reconsidered as they gain experience. There is some overlap between the terms used in KUH flowsheets and standards such as SNOMED-CT and LOINC, but we have only scratched the surface of the work of mapping these terminologies.

Sources other than the EMR also vary as to the level of standardization of terminology. Our integration of the KUH tumor registry makes fairly straightforward use of the national standard for cancer registries, NAACCR. But our biospecimen repository uses a locally-curated terminology.

The bulk of this documentation is in tables and spreadsheets, with some documents and diagrams mixed in.

If your metadata are structured please describe that structure. Is it defined by something like a formal XML schema?

One way or another, we fit all of our metadata into i2b2's database schema. As a byproduct, i2b2 can produce an XML form of the metadata, following one of its XML schemas.

Is it common in your area to think in terms of a data lifecycle?

If so, what does that view include – (concepts and measures shared across studies?, data reuse?)

We reload our data repository from the source systems monthly. This is something of a compromise between real-time updates from the EMR and one-time data gathering exercises such as chart reviews.

Our process for updating metadata is something of a patchwork. For flowsheets, we updated it monthly along with the data. For ICD9 and CPT, we plan to update as they republish annually, but we haven't tackled that just yet.

Are there tools available which help manage lifecycle metadata?

Various tools are under development in the i2b2 community; e.g. Health Ontology Mapper (HOM) by Rob Wynden et. a. at UCSF. We haven't investigated them in much depth, yet.

Can the metadata be expressed in Resource Description Framework (RDF) format as part of Linked Open Data?

NCBO is developing ontology services that integrate with i2b2 and provide RDF mappings. Again, we haven't investigated them in much depth, yet.

Is there an archive offering ongoing curation of your data available to you?

How does that operate? Are there issues with privacy, data size, financing etc.)?

Are there requirements from that archive for how data and metadata are represented?

We interact with varying sorts of metadata curation, as discussed under documentation above.

Setting up a governance structure was a major task that took several months in the start-up phase of our clinical data repository project. We have a data request oversight committee (DROC) with representation from

  • the hospital (which provides the bulk of the EMR data),
  • the clinics (which originally provided diagnosis and procedure information from billing systems, but are increasingly adopting the EMR), and
  • KU medical center itself (which manages the biospecimen repository etc.).

To address HIPAA requirements for dealing with protected health information, not to mention institutional liability, we have technical approaches to de-identification, network security, etc.

Sources such as the tumor registry and biospecimen repository are curated data as such. The hospital is an institution of long standing that has robust systems for long-term EMR storage, though perhaps recording vital signs wouldn't normally be called curation.

The governance policies include being able to trace all data in our system back to its source. The i2b2 database schema includes auditing fields (import_date, update_date, sourcesystem_cd, ...) that make this reasonably straightforward.

Moving forward – Would it be useful for us to have more sessions?

A number of i2b2 sites participate in federated query networks which allow researchers to broaden their cohort identification queries and validate their findings more widely. In the medium to long term, we're interested in the sort of terminology alignment that it takes to participate in these networks, but it's not yet high on our list of priorities.

Another motivation for terminology alignment is health information exchange. We're monitoring HIE efforts in Kansas, but again, it's not yet high on our list of priorities.

As we complete other projects and make room for more work on terminology alignment and data interchange, we hope to be able to participate more actively.

HERON El Dorado Release incorporates searching by MSDRG and LOS

This release includes hospital quality measures (UHC) data integration which enhances search capabilities by introducing new concepts, such as length of stay and MSDRG. 109K UHC observations from the November 2010 - November 2011 are included in this release. This is our first step in making UHC data available. Look forward to additional UHC data in future releases.

HERON El Dorado Contents Summary

This month, our tour of rivers and lakes in Kansas honors lake El Dorado.

The HERON repository contains approximately 630 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 18.0M 1.90M
KUH Billing (O2 via SMS) 1980s Feb 2012 various*
UKP Billing 2000 Feb 2012
9.1K 9.1K Frontiers participant registry Jun 2009 Feb 2012
183K 183k Social Security Death Index 1962 Feb 2012
Diagnoses (IDC9) 18.4M 596K
KUH/O2/Epic Nov 2007 Feb 2012 various*
UKP Billing 2000 Feb 2012
Medications 29.1M 107K
KUH/O2/Epic Nov 2007 Feb 2012 various*
Nursing Observations 452M ?
KUH/O2/Epic Nov 2007 Feb 2012 various*
Lab Results 71.5M 253K
KUH/O2/Epic 2003 Feb 2012 various*
Procedures (CPT) 9.5M 539K
UKP Billing 2000 Feb 2012
Specimens 27.3K 2.77K
KUMC Biospecimen Repository ? Jan 2012
Cancer Cases 9.1M 62.6K
KUH Cancer Registry 1950s Jan 2012 labels*
Hospital Quality Metrics 109K 19.7K
University HealthSystem Consortium (UHC) N/A Nov 2011 #997
All 630M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

#834
query by MSDRG, length of stay, service line from UHC 1-year snapshot

Outstanding Problems/Defects/Issues

HERON Bow Creek Release brings Cancer Survival Analysis, R integration

The highlights for this release include:

Our cancer survival analysis plug-in is based on work by Segagni et. al.:

HERON Bow Creek Contents Summary

This month, our tour of rivers and lakes in Kansas honors Bow Creek river.

The HERON repository contains approximately 600 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 16.0M 1.89M
KUH Billing (O2 via SMS) 1980s Dec 2011 various*
UKP Billing 2000 Dec 2011
8.18K 8.18K Frontiers participant registry Jun 2009 Dec 2011
182K 182k Social Security Death Index 1962 Dec 2011
Diagnoses (IDC9) 17.9M 586K
KUH/O2/Epic Nov 2007 Dec 2011 various*
UKP Billing 2000 Dec 2011
Medications 28.0M 103K
KUH/O2/Epic Nov 2007 Dec 2011 various*
Nursing Observations 434M ?
KUH/O2/Epic Nov 2007 Dec 2011 various*
Lab Results 69.8M 248K
KUH/O2/Epic 2003 Dec 2011 various*
Procedures (CPT) 9.4M 531K
UKP Billing 2000 Dec 2011
Specimens 25.5K 2.66K
KUMC Biospecimen Repository ? Dec 2011
Cancer Cases 8.96M 61.9K
KUH Cancer Registry 1950s Dec 2011 labels*
All 606M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Enhancements and Problems/Defects/Issues Addressed in this Release

No results

Outstanding Problems/Defects/Issues

Adding SEER Site Recode to HERON Tumor Registry integration

Our HERON tuttlecreek release a couple months ago included initial integration of data on ~60,000 cancer cases from the KUMC tumor registry. We organized the NAACCR terms based on work by colleagues at the Kimmel Cancer Center in Philadelphia and Group Health Cooperative in Seattle:

NAACR terms for tumor registry

But if you want to find, for example, brain cancer cases, due to an outstanding issue (#733), you have to be an expert in codes for primary site, histology, etc.:

For our next release, based on work with John Keighley, we're providing query by SEER Site Recode, a state of the art method for combining primary site and histology:

screenshot of SEER Site Recode term hierarchy

Under the hood: Using python to convert the rules table to SQL

The SEER Site Recode ICD-O-3 (1/27/2003) Definition, lays out the rules in a fairly convenient HTML table:

Converting that table to code manually might have been straightforward, but it would have been repetitive and error-prone; so like so many Geeks and repetitive tasks, I wrote a script to automate it.

source:tumor_reg/seer_recode.py weighs in at about 200 lines, including whitespace and a handful of test cases. It reads the HTML page (well, I feed it through tidy first to clean up some table markup) and produces

  1. A term hierarchy in CSV format (source:heron_load/curated_data/seer_recode_terms.csv)
  2. Rules to recode our our ~60K cancer cases as a SQL case statement (source:heron_load/seer_recode.sql).

The resulting SQL weighs in at about 500 lines. Handling all the different kinds of rules in the table was fun; a lot more fun than writing this sort of SQL by hand:

case
/* Lip */ when (site between 'C000' and 'C009')
  and  not (histology between '9590' and '9989'
   or histology between '9050' and '9055'
   or histology = '9140') then '20010'

...

/* Melanoma of the Skin */ when (site between 'C440' and 'C449')
  and (histology between '8720' and '8790') then '25010'

...

/* Cranial Nerves Other Nervous System */ when (site between 'C710' and 'C719')
  and (histology between '9530' and '9539') then '31040'

/* ... */ when (site between 'C700' and 'C709'
   or site between 'C720' and 'C729')
  and  not (histology between '9590' and '9989'
   or histology between '9050' and '9055'
   or histology = '9140') then '31040'

AMIA 2011 Highlight: Dr. Bill Tierney's 10 year story on health care in Africa

Tierney's inspiring closing keynote was truly a highlight of #amia2011. Standing ovations for a great guy and great speaker.
-- Gunther Eysenbach, Oct 26

That's one tweet among a chorus of #amia2011 tweets about Tierney, including:

  • Death by HIPAA: shouldn't sacrifice care on altar of privacy #AMIA2011 keynote by Tierney
  • LIVE: #AMIA2011 Bill Tierney uses Clem McDonald's 1998 JAMA "Canopy Computing" paper; great metaphor for connected health data, no silos!

AMIA 2011 keynote recordings are now available:

"Dr. Tierney’s work has taken him far afield—to Kenya, Africa—to use electronic health records and to gather information from patients, applying the data to critical points in the patient–provider relationship to improve the quality and cost-effectiveness of health care. He led the effort to develop the first ambulatory electronic medical record system in sub-Saharan Africa, which has evolved into a comprehensive, open-source electronic medical record system that has been implemented in more than a dozen developing countries."

The video editing is a little rough, with quite a bit of conference administrivia at the beginning. But by the time he gets to "a Case" at 9:40, I'm sure you'll be hooked. Even if you're not an informatics geek, I'm sure you'll find the "10 year story" (starting at 37:30 into the video) inspiring.

HERON Pawnee release paves the way to a richer data repository

Our Pawnee release is the first to use i2b2 1.6. This allows us to take advantage of i2b2 1.6 features in future releases:

  • visit information, such as length of stay, age at visit
  • provider information
  • primary diagnosis vs secondary; billing vs clinical
  • medication routes, frequency

We are still in the process of re-verifying data integrity and functionality after the upgrade, so consider these statistics provisional:

CATEGORYFACTSPATIENTS
i2b2 464M
Demographics 14.1M 1.88M
HICTR Participant 7.65K 7.65K
Diagnoses 17.5M 578K
Flowsheet 328M
Labtests 68.3M 244K
Medications 27.0M 99K
Procedures 9.2M 527K
Specimens 24.1K 2.56K

Note: our statistics regarding total number of flowsheet observations is reduced significantly from 470 million to 328 million. We are investigating the cause as the we think the data is intact (see ticket #789). Please give us feedback if you notice discrepancies with any prior queries where the counts are down significantly relative to last month's build.

Notice in the right hand side of the user interface that i2b2 1.6 allows you constrain observations to the same encounter. For example, you might require that patients have drug exposure (furosemide) and laboratory monitoring (serum potassium) during the same encounter while another observation like "diagnosis of diabetes" can be treated independently of encounter. Right now, we are not sure we have this working (see defect #790) but we are looking into it.

This release also fixed several outstanding bugs:

No results

Automatically populating REDCap fields from earlier forms

In our work on the Alzheimers Disease Core Center, we had information entered into one REDCap form that we wanted to see in another.

REDCap doesn't offer this out of the box, so we added a little code (attachment:calc_text.patch ; #569). The way it works is a little quirky:

  1. In the usual REDCap fashion,
    1. Make a new field
    2. Choose Calculated Field for Field Type.
    3. Put the name of the source field in square brackets in the Calculation Equation. For example, [last_name]
  2. Now for the quirk: start the Variable Name for this new automatically populated field with
    • text_ for a single-line text field; for example: text_display_last_name
    • textarea_ for a multi-line text area.

OK, so using the variable name like this is sort of cheating, but hey... it seems to work for now.

If you would like us to show it to you in person, feel free to come to our FrontiersInformaticsClinic, which meets today at 4pm in Dykes 410. If today doesn't work, we're there every other Tuesday. Check the KUMC calendar.

HERON Tuttlecreek release brings initial Tumor Registry integration

This month's HERON release integrates data from the KUH Tumor Registry, with 65,000 cases dating back to the 1950s(#547). We have also added support for finding patients in KCK county school districts (#531).

Russ regularly gives presentations on our work, describing the integration of various sources into HERON. Since this diagram from How Medical Informatics and HERON Can Help Your Research?, given on November 17, we have integrated the Social Security death master file as well as the tumor registry:

HERON Tuttlecreek Contents Summary

This month, our tour of rivers and lakes in Kansas honors Tuttle Creek Lake.

The HERON repository contains approximately 570 million real observations from the hospital, clinics, and research systems:

Observation Patients Source Go-Live Snapshot Issues
Demographics 15.9M 1.88M
KUH Billing (O2 via SMS) 1980s Oct 2011 various*
UKP Billing 2000 Oct 2011
6.64K 6.64K Frontiers participant registry Jun 2009 Oct 2011
Diagnoses (IDC9) 17.2M 571K
KUH/O2/Epic Nov 2007 Oct 2011 various*
UKP Billing 2000 Oct 2011
Medications 26.3M 96K
KUH/O2/Epic Nov 2007 Oct 2011 various*
Nursing Observations 407M ?
KUH/O2/Epic Nov 2007 Oct 2011 various*
Lab Results 67.2M 240K
KUH/O2/Epic Nov 2007 Oct 2011 various*
Procedures (CPT) 9.1M 523K
UKP Billing 2000 Oct 2011
Specimines 23.1K 2.48K
KUMC Biospecimine Repository ? Oct 2011
Cancer Cases 6.21M 60.4K
KUH Cancer Registry 1950s Aug 2011 labels*
All 570M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Problems/Defects/Issues Addressed in this Release

No major issues addressed.

Outstanding Problems/Defects/Issues

No results

AMIA 2011: Nursing Flowsheets data and the wild west of terminology

Like a number of other CTSA sites, we're Using I2B2 in our HERON research data respository. The original domain of I2B2 was genome/phenome integration and personalized medicine. Genomic stuff is on our long-term radar, but one of our earliest wins has been mining the vast amount of data (~400M observations as of our latest release) recorded by nurses and other practitioners in flowsheets in our EMR.

Height, weight, and BMI are quite common inclusion/exclusion criteria for clinical trials, and those aren't available in the term hierarchy that comes with I2B2 out of the box.

It's been a big challenge because unlike diagnoses with ICD9 codes and procedures with CPT codes, the MedicalTerminologyMarketplace has no widespread norms for flowsheets.

We're presenting some results in Washington D.C. on Wednesday at AMIA 2011:

Stay tuned to KUBMIPresentations for presentation materials.

p.s. I'm new to AMIA. As a long-time Web guy, there's a bit of culture shock: the conference program only seems to be available as a big hunk of PDF or a goofy mobile flash thing, and the "Join the conversation on twitter" box inside gives the hash tag as AMIA2011#, with the hash at the end. Chuckle. And no open-access to the full text of the article. Sigh.

Analysis tools, Social Security Death Index, 50% more Labs arrive as HERON visits Cheney Lake

This month, the release of our HERON research data repository honors Cheney Lake. In addition to our regularly monthly refresh of the data, we're excited about the three big enhancements in this release:

No results

Unfortunately, we broke something so we know longer know race regarding our population (#637). That data was suspect so we will be exploring how to incorporate race and ethnicity with greater fidelity.

Join us at the Frontiers Clinical/Translational Informatics Clinic for full demonstration and discussion. The next one is Tuesday, Oct 4 at 4pm in 1040 Dykes Library.

Meanwhile, see:

Contents Summary: 50% more Labs

The HERON repository contains approximately 516 million real observations from the hospital and clinics:

Observation Patients Source Go-Live Snapshot Issues
Demographics 14.7M 1.87M
KUH Billing (O2 via SMS) 1980s Aug 2011 various*
UKP Billing 2000 Aug 2011
6.09K 6.09K Frontiers participant registry Jun 2009 Aug 2011
Diagnoses (IDC9) 16.7M 558K
KUH/O2/Epic Nov 2007 Aug 2011 various*
UKP Billing 2000 Aug 2011
Medications 245.0M 91K
KUH/O2/Epic Nov 2007 Aug 2011 various*
Nursing Observations 386M ?
KUH/O2/Epic Nov 2007 Aug 2011 various*
Lab Results 65.2M 234K
KUH/O2/Epic Nov 2007 Aug 2011 various*
Procedures (CPT) 8.94M 515K
UKP Billing 2000 Aug 2011
All 516M

We have loaded was was referred to in O2/Epic as "conversion labs". These were lab results recorded prior to Epic system implementation from 2004 through 2007. Note that because we shift dates back in time (0 to 365 days) for de-identifcation, some of these results in HERON are noted as resulted in 2003. The table below provides a comparison of laboratory results by year for our current Cheney release versus last month's release

Year Cottonwood (last month) Cheney (current)
2011- release date 13,905 16,886
2010- 2011 51,722 53,173
2009- 2010 50,405 50,449
2008- 2009 45,864 45,830
2007- 2008 33,265 38,757
2006- 2007 3,484 35,358
2005- 2006 0 32,447
2004- 2005 0 30,018
2003- 2004 0 16,886
2002- 2003 0 3
2001- 2002 0 0

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON. Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous. We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Problems/Defects/Issues Addressed in this Release

no major issues.

Outstanding Problems/Defects/Issues

No results

Heron Cottonwood release includes early Biospecimen Repository integration

In addition to a monthly data refresh to include data from July, this Cottonwood release of HERON brings an early integration of the KUMC/KUCC Biospecimen repository.

No results

HERON Cottonwood Contents Summary

The HERON repository contains approximately 480 million real observations from the hospital and clinics:

Observation Patients Source Go-Live Snapshot Issues
Demographics 15.4M 1.86M
KUH Billing (O2 via SMS) 1980s July 2011 various*
UKP Billing 2000 July 2011
5.94K 5.94K HICTR? participant registry Jun 2009 July 2011
Diagnoses (IDC9) 16.4M 551K
KUH/O2/Epic Nov 2007 July 2011 various*
UKP Billing 2000 July 2011
Medications 24.4M 88.1K
KUH/O2/Epic Nov 2007 July 2011 various*
Nursing Observations 375M ?
KUH/O2/Epic Nov 2007 July 2011 various*
Lab Results 40.0M 158K
KUH/O2/Epic Nov 2007 July 2011 various*
Procedures (CPT) 8.86M 511K
UKP Billing 2000 July 2011
All 480M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Problems/Defects/Issues Addressed in this Release

No results

Outstanding Problems/Defects/Issues

No results

First Frontiers Clinical/Translational Informatics Clinic

We'd like to thank everyone who came to the first Frontiers Clinical/Translational Informatics Clinic yesterday.

The clinic is scheduled for 4pm every other Tuesday in Library Room 1040. As the announcement said:

The Frontiers Clinical/Translational Informatics Clinic exists for investigators needing assistance using informatics tools for data exploration, cohort discovery, and data collection. The clinic is a brainstorming session and an education forum for investigators and informatics to share knowledge, provide feedback, and optimize the use of systems such as: - REDCap for data collection - the HERON clinical data repository and its i2b2 application - and other systems developed as part of Frontiers: Kansas City's Clinical and Translational Science Award.

It was interesting to chat with everybody, from investigators with proposals in development to research nurses, study coordinators, etc. We introduced our systems and talked about how they can be used to support research.

CRIS? is KUMC's custom installation of Velos eResearch, the highly-regarded clinical research information system chosen by 20 medical institutions with top 25 ratings.

For those doing a clinical trial, specially for cancer or an investigational drug where adverse event reporting is requred, CRIS is the way to go:

REDCap allows you to build and manage online surveys and databases quickly and securely. It's used by thousands of projects at hundreds of institutions.

If you are building a patient registry or a lightweight trial at KUMC, try REDCap:

HERON allows you to explore de-identified data from Epic/O2 (the hospital electronic medical records) and IDX (the clinical billing system). We talked about capabilities and access rules for HERON, using some HERONTrainingMaterials? for reference.

Access to de-identified data is non-human subjects research and does not require approval by the HSC/IRB. For qualified faculty who want view-only access to do patient count queries, executing a system access agreement is the only requirement.

A Data Request Oversight Committee (DROC) oversees requests to:

  • sponsor collaborators and research team members
  • extract de-identified data sets

For access to identified data, you will need an IRB approved protocol.

We look forward to meeting with our user community again in two weeks.

Dan, Kahlia, Arvinder, and Russ

Approaches to data integration for health care research: i2b2, SHRINE, SPARQL and OWL

A couple of us are off to Boston this week for the i2b2 Academic Users' Group First Annual Conference to present a poster and soak up all kinds of good stuff.

i2b2 is the basis of HERON, our health care research data repository, which stores clinical observations into a fairly traditional datamart. I've done database application development of various kinds for decades, but the scale and operational challenges are new to me. I'm particularly happy that the extract/transform/load (ETL) process for our last release ran for 37 hours, lights-out, loading 450 million clinical observations from various sources, primarily, a copy of Epic's Clarity store from the KU Hospital.

While i2b2 includes a modern Ajax web front-end, it makes no use of web-style linked data, let alone OWL or realist ontology, for medical terminology alignment, which is a big part of the challenge in making HERON an effective platform for research and for data integration beyond the KUMC enterprise.

While I'm taking a break from heads-down development mode for this conference, and while I'm in Boston, I hope to take time to look more closely at developments such as:

Our Oracle database on SAS drives handles simple user queries in a few seconds, but in some cases it takes a minute or two or fails altogether. After evaluating fusion-io's solid state storage, we ordered 4 fairly large units. We're still working through the operational details of setting it up, but we hope to see considerable performance improvements for both ETL and end-user queries.

I'm also interested to catch up on some W3C stuff: RDF/SQL mapping (no RIF?! darn.), and the Semantic Web Health Care and Life Sciences (HCLS) Interest Group. It should be interesting to compare SPARQL1.1 federated query with the corresponding i2b2 approach: SHRINE.

Announcing CTSA funding from NIH to KUMC

We're proud of our role in this week's announcement:

KUMC receives $20 million Clinical and Translational Science Award

June 14, 2011

Kansas City, Kan. — Patients will gain faster access to the benefits of health research throughout the region thanks to a grant announced today.

The University of Kansas Medical Center has received a $19,794,046 Clinical and Translational Science Award from the National Institutes of Health (NIH). The five-year grant puts the medical center among an elite, 60-member group of universities collaborating on clinical and translational research, which transforms laboratory discoveries into treatments and cures.

Launched by the NIH in 2006, the Clinical and Translational Science Awards (CTSA) program goals are to speed laboratory discoveries into treatments for patients, to work with communities in clinical research efforts, and to train a new generation of researchers to bring cures and treatments to patients faster. With its new grant, KU Medical Center will create a program called Frontiers, greatly expanding the reach of its existing Heartland Institute for Clinical and Translational Research, which has been the center of clinical and translational research for Kansas and the greater Kansas City region.

Scientists at KU have been doing translational research for years. For example, clinical trials are now being held for an ovarian cancer drug that KU researchers have reformulated so that it can be delivered in a patient's abdomen instead of intravenously, which caused negative side effects. Other scientists have discovered that DHA, the omega-3 fatty acid common in fish oil, may help infants develop better attention skills. In part, as a result of this research, DHA is now added to many infant formulas. Other researchers are studying whether exercise can slow the progression of Alzheimer's disease.

...

In fact, a big part of what's new about translational research at KUMC this year is our very own biomedical informatics division:

Biomedical Informatics accelerates scientific discovery and improves patient care by converting data into actionable information. Pharmacologists and biologists use informatics to understand how drugs and cells interact at a molecular level; scientists use software to determine what kind of patients may most benefit from a clinical trial; doctors view risk models to help individualize therapies for patients.

The specific aims from our section of the grant are:

  1. Provide a HICTR portal for investigators to access clinical and translation research resources, track usage and outcomes, and provide informatics consultation services.
  2. Create a platform, HERON (Healthcare Enterprise Repository for Ontological Narration), to integrate clinical and biomedical data for translational research.
  3. Advance medical innovation by linking biological tissues to clinical phenotype an pharmacokinetic and pharmacodynamic data generated by research in phase I and II clinical trials (address T1 translational research).
  4. Leverage an active, engaged statewide telemedicine and Health Information Exchange (HIE) to enable community based translational research (Addressing T2 translational research).

Presentation materials from Dr. Waitman's talk from last September, Developing Clinical and Translational Informatics Capabilities for Kansas University go into more detail on those aims.

The focus of our development work for the past year or so has been on the HERON data repository, but starting with milestone:RavenCTSA, the plan is to broaden the portal from just informatics tools for use within KUMC to a variety of tools for investigators in our community.

Want to join the fun? We're hiring.

HERON April update brings revised Lab terminology, performance increase

Since the HERON Feb 2011 snapshot release, the major enhancements are:

No results

HERON Contents Summary

The HERON repository contains approximately 445 million real observations from the hospital and clinics:

Facts Patients Source Go-Live Snapshot Issues
Demographics 10.6M 1.84M
KUH Billing (O2 via SMS) 1980s Apr 2011 various*
UKP Billing 2000 Apr 2011 #406
5622 HICTR? participant registry Jun 2009 Apr 2011
Diagnoses (IDC9) 15.7M 540K
KUH/O2/Epic Nov 2007 Apr 2011 various*
UKP Billing 2000 Apr 2011
Medications 22.8M 81.1K
KUH/O2/Epic Nov 2007 Apr 2011 various*
Nursing Observations 349M ?
KUH/O2/Epic Nov 2007 Apr 2011 various*
Lab Results 37.6M 150K
KUH/O2/Epic Nov 2007 Apr 2011 various*
Procedures (CPT) 8.62M 501K
UKP Billing 2000 Apr 2011
All 445M

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the original HERON beta notice, including:

Status of Problems/Bugs

Problems/Bugs addressed in this milestone

No results

Outstanding Problems/Defects/Issues

No results

Completed tasks, Resolved Design Issues

For a more detailed account of the development of this release, see milestone:heron-apr-update.

We have a great opportunity for a Biomedical Informatics Software Engineer!

The division of medical informatics seeks highly motivated individuals with a passion for software development, scientific discovery, and improving healthcare. This position is responsible for developing and maintaining medical informatics applications to support Kansas University Medical Center. This includes developing/interacting with clinical systems (Ex. EPIC, Cerner), data warehouses and analytics, national terminology vocabularies (UMLS, RxNorm, LOINC, FDB), clinical research systems (Ex. VELOS), and external registries and state/national datasets.

-- Job J0084846

HERON Update Integrates UKP Clinic data with KUH Hospital Data

This second beta release of the HERON repository includes billing diagnoses and procedures from the UKP clinics (ClinicIdxSource), integrated with data from the hospital (HospitalEpicSource) (#306). The data is current through February 2011 for both sources.

This release also supports querying by whether a patient is in the HICTR? participant registry (#155).

HERON Contents Summary

The repository contains approximately 300 million real observations from the hospital and clinics.

This is down from 470 million due to a defect in the way nursing flowsheet was loaded (#456). We will fix this defect in our next release which will incorporate April data.

Facts/Patients Source Go-Live Snapshot Issues
Demographics 10M/1.8M
KUH Billing (O2 via SMS) 1980s Feb 2011 various*
UKP Billing 2000 Feb 2011
Diagnoses (IDC9) 14M/521K
KUH/O2/Epic Nov 2007 Feb 2011 various*
UKP Billing 2000 Feb 2011
Medications 20M/70K
KUH/O2/Epic Nov 2007 Feb 2011 various*
Nursing Observations 217M*/?
KUH/O2/Epic Nov 2007 Feb 2011 various*
Lab Results 28M/132K
KUH/O2/Epic Nov 2007 Feb 2011 various*
Procedures (CPT) 8.4M/491K
UKP Billing 2000 Feb 2011

Beta Disclaimer

We are providing this early access to obtain feedback from you, the research community. While we are actively working on validating the data loaded into the system with hospital and clinic technical staff, there may be problems with our translation of data from our source systems (HospitalEpicSource and ClinicIdxSource) into HERON.

Please email us at heron-admin@kumc.edu if you discover information you believe may be erroneous.

We are actively working on enhancing the types of data included. Stay tuned to our roadmap to track progress toward upcoming releases.

Various Issues Still Apply

Keep in mind the issues noted in the previous HERON update notice.

Status of Problems/Bugs

Problems/Bugs addressed in this milestone

No results

Outstanding Problems/Defects/Issues

#2301
constraining birth-date by date doesn't work - HERON uses sysdate for start_date in demographics
#2617
Saved searches not updated in REDCap projects after Arkansas release

Completed tasks, Resolved Design Issues

For a more detailed account of the development of this release, see milestone:heron-update.

Getting ahead of the ball with service monitoring

In the earliest phases of development of our HERON clinical research repository, the only users were us developers and a handful of friendly alpha testers, so it was fine to discover problems as we used the system.

But one of the features included in milestone:EpicBetai2b2 is more proactive monitoring (#150), using the popular open source opsview toolset, built on nagios.

Once it was in place for HERON, I showed it to a guy who supports CRIS?, and he figured out how to get nagios working on Windows servers etc.

CRIS is a long-standing production service. Its user community consists of clinical researchers. When CRIS acts up, they don't see it as an interesting technical puzzle to solve; they just see it as the darned computer getting between them and their research goals again.

The database under the CRIS service acted up over the weekend, but this time, instead of a call from a frustrated researcher on Monday morning, the CRIS support team got an automated notice right when the problem started and had it cleaned up before any users noticed.

It's so much nicer to be ahead of the ball in the customer support game.

Missing link in high performance bulk transfer between Oracle databases [DRAFT]

To build HERON, our clinical research data repository, we move a lot of data (hundreds of gigabytes) between Oracle databases:

GraphViz image

When I discover Oracle's support for distributed transactions using database links, it seemed to be a perfect match. For example, from one of our early ETL tickets, #126:

Executing:  create database link idxp_link USING 'idxp' 

Executing:
                create global temporary table hictr_table_clone
                on commit preserve rows
                as select * from HICTR.hictr_table@idxp_link

That initial work was based on a tiny (4000 patient) slice of data, so performance was not much of an issue. But when we got into millions of records, run times ran into the tens of hours.

A long-time Oracle user here told me that sqlldr is the right tool for bulk loading data, but like sqlplus, it's a command-line tool, and Oracle seems to turn off normal access to them by default and issues a scary disclaimer when we enable access.

As I read up on sqlldr, I discovered Inserting Data Into Tables Using Direct-Path INSERT; i.e. the performance magic behind sqlldr is available using SQL too. Great! But... not

A transaction containing a direct-path INSERT statement cannot be or become distributed.

http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_9014.htm#i2163698

but I also got the impression

it's a command-line tool, and I wanted to stick with the python DB API (or JDBC or the like) rather than tackle the administrative/security issues of passwords on the command line and such.

Managing temporary tables with a python context manager

Temporary tables can be a hassle to manage. In computing concept stats, at first, my code did the obvious:

  1. create a temporary index
  2. create a temporary table
  3. use the table and the index
  4. truncate/drop the temporary table
  5. drop the index

But if the code fails in step 3, the temporary table and the index will still be there when you run it again, and you'll get name conflicts. An obvious solution starts like:

cursor.execute("create global temporary table ...")
try:
    # use table
finally:
    cursor.execute("truncate table ...")
    cursor.execute("drop table ...")

But it starts to get ugly when you add the try/finally for the temporary index. Isn't this a lot nicer?

    with transaction(conn) as work:
        with table_index(conn, 'metadata_by_path',
                         concept_schema, 'i2b2', ['c_dimcode']):
            with temp_table(work, stats, total_counts):
                exec_debug(work, update_labels, explain_plan=True)

This is where python context managers come in handy. temp_table is implemented like this:

from contextlib import contextmanager


@contextmanager
def temp_table(cursor, name, create_ddl):
    exec_debug(cursor, create_ddl, explain_plan=True)
    try:
        yield cursor
    finally:
        exec_debug(cursor, "truncate table %s" % name)
        exec_debug(cursor, "drop table %s" % name)

The table_index and transaction context managers are implemented likewise. Take a look at source:heron_load/concepts_stats.py and source:heron_load/db_util.py for details.

Concept stats: how many needles are in which parts of our i2b2 haystack?

For the HERON research data repository we're building, we're Using I2B2, which involves navigating a huge vocabulary of medical terminology. We found ourselves wishing that the terms were tagged with some clues about how often they occur in our database. Now that the enhancement ticket is done, we find it really does help.

I2B2 lets researchers identify patient cohorts by querying a database of "observation facts" about patients. All the facts are collected in one table, indexed by concept. Concepts may be demographics (age, gender, etc.), diagnoses, labs, medications, or procedures, and the tool supports hierarchical navigation and text search:

About 14,000 concepts are included with the I2B2 software. As we extract concepts from health records at KU Medical center, some obviously match the i2b2 concepts but many do not. For example, we have not yet determined the relationship between local medication codes and the NDC system. This complicates the already daunting task of navigating the hierarchy of medical terminology. It's quite frustrating to poke around in the dark, running query after query, not knowing which concepts have real data attached to them.

We find that pre-computing the number of facts and the number of associated patients and including these statistics in the hierarchy makes navigation much more efficient:

We can see at a glance that we have no records of dispensing ANTI-OBESITY drugs, at least by that exact categorization.

We can also see that we only have rich data (diagnoses, labs, meds, ...) about roughly 10% of the 2 million patients in our database; this is due to rather recent deployment of an electronic medical record system compared to the long-standing use of computerized billing systems.

How we did it

In the observation_fact table, each of the facts carries a concept_cd; for example:

concept_cdencounter_numpatient_numstart_date...
DEM|AGE:51231233212005-01-01...

Then the concept_cd is related to a longer path in the concept_dimension table:

concept_cdconcept_path...
DEM|AGE:5\i2b2\Demographics\Age\0-10\5...

And these paths are used in the i2b2 table that is used to build the hierarchical navigation display:

c_dimcodec_name...
\i2b2\Demographics\Demographics...
\i2b2\Demographics\Age\Age...
\i2b2\Demographics\Age\0-10\Ages 0-10...
\i2b2\Demographics\Age\0-10\5Age 5...

So we join each i2b2 row with the concept_dimension rows under it, and then join with observation_fact on concept_cd, and store the stats in a temporary table:

create global temporary table stats
        on commit preserve rows as
        select c.super_path,
               count(*) facts,
               count(distinct f.patient_num) patients
         from
           (select distinct super.c_dimcode super_path,
                   sub.concept_cd sub_code
            from blueheronmetadata.i2b2 super
            join blueherondata.concept_dimension sub
             on sub.concept_path like (super.c_dimcode || '%')) c
           join blueherondata.observation_fact f
             on f.concept_cd = c.sub_code
        group by super_path

To compute this efficiently, we added an index on i2b2.c_dimcode; you can see that the whole thing runs in just under 22 minutes:

2010-12-09 15:20:17.357337 starting statement:  
      CREATE INDEX blueheronmetadata.metadata_by_path
        ON blueheronmetadata.i2b2 (c_dimcode)
    
2010-12-09 15:20:18.649760 finished after 0:00:01.292423; rows affected: -1
2010-12-09 15:20:18.649852 starting statement:  create global temporary table stats
        on commit preserve rows as
        select c.super_path,
               count(*) facts,
               count(distinct f.patient_num) patients
         from
           (select distinct super.c_dimcode super_path,
                   sub.concept_cd sub_code
            from blueheronmetadata.i2b2 super
            join blueherondata.concept_dimension sub
             on sub.concept_path like (super.c_dimcode || '%')) c
           join blueherondata.observation_fact f
             on f.concept_cd = c.sub_code
        group by super_path
explain plan result:
Plan hash value: 539913617
 
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |                        |   206 | 34196 |       |   175M  (2)|584:12:42 |
|   1 |  LOAD AS SELECT                  | STATS                  |       |       |       |            |          |
|   2 |   SORT GROUP BY                  |                        |   206 | 34196 |       |   131M  (2)|438:09:33 |
|*  3 |    HASH JOIN                     |                        |  8398M|  1298G|  2207M|   129M  (1)|433:18:21 |
|   4 |     INDEX FAST FULL SCAN         | FACT_CNPT_PAT_ENCT_IDX |    59M|  1528M|       | 78907   (2)| 00:15:47 |
|   5 |     VIEW                         |                        |    28M|  3780M|       |   129M  (1)|431:40:01 |
|   6 |      HASH UNIQUE                 |                        |    28M|  6501M|   444G|   116M  (1)|386:55:49 |
|   7 |       TABLE ACCESS BY INDEX ROWID| CONCEPT_DIMENSION      |  9826 |  1161K|       |   257   (0)| 00:00:04 |
|   8 |        NESTED LOOPS              |                        |  1837M|   409G|       |    48M  (1)|160:23:14 |
|   9 |         INDEX FAST FULL SCAN     | METADATA_BY_PATH       |   187K|    21M|       |   753   (1)| 00:00:10 |
|* 10 |         INDEX RANGE SCAN         | CONCEPT_DIMENSION_PK   |  1769 |       |       |    56   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("F"."CONCEPT_CD"="C"."SUB_CODE")
  10 - access("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%')
       filter("SUB"."CONCEPT_PATH" LIKE "SUPER"."C_DIMCODE"||'%')
2010-12-09 15:42:12.129193 finished after 0:21:53.479341; rows affected: -1

We then used an expedient hack of updating the concept labels to include the stats:

update blueheronmetadata.i2b2 m
      set m.c_name = (
        select distinct
          case when tots.facts > 0 then
            case when instr(m.c_name, '[') > 0 then
              regexp_replace(m.c_name, '\[.*', '')
            else m.c_name || ' ' end
            || '['
            || (case when tots.facts < 10 then '<10'
                else to_char(tots.facts) end) || ' facts'
            || (case when tots.patients < 10 then ''
                  else '; ' || tots.patients || ' patients' end)
            || ']'
         else m.c_name end
         from stats tots
           where tots.super_path = m.c_dimcode)
      where m.c_dimcode in
        (select super_path from stats)

That code is complicated because it has to

  1. leave out stats regarding fewer than 10 patients (per HIPAA regulations, as part of our DeIdentificationStrategy)
  2. replace any stats that were already there

That's probably enough detail for others using i2b2 to do likewise, but if you're curious, the script that manages the process is source:heron_load/concepts_stats.py and the ticket (#211) tells the whole saga of approaches that don't work and such.