Opened 3 years ago

Closed 2 years ago

Last modified 20 months ago

#1246 closed problem (fixed)

Approximately 2% of Medication Facts are Not Covered by our VA Med Hierarchy

Reported by: ngraham Owned by: ngraham
Priority: major Milestone: heron-walnut-update
Component: data-repository Keywords: public-web
Cc: dconnolly, rwaitman, tmcmahon Blocked By:
Blocking: Sensitive: no



Around 12% of all medication facts from Clarity don't map to a medication concept in the new VA hierarchy. This means that the end-user cannot see these medications in the ontology and therefore cannot perform any queries to access 12% of the medication facts.

The ideal goal would be to match 100% of the medications found in Clarity to a location in the VA hierarchy. That way, all medications (and subsequently all medication-related facts) would be available to the end-user. 100% coverage is likely unrealistic, but we believe there are ways to improve coverage such as:


As per #1048, we have implemented a medication ontology based on VA Drug Class (NDF-RT) and the RxNorm Semantic Clinical Dose Form. This approach is based on this 2010 paper on mashing up RxNorm and NDF-RT hierarchy.

Other TODOS:

  • Some of the SCDFs have very long text strings (see ticket:1048#comment:41). Right now we truncate, but there may be a better way to abbreviate).

Useful links:
BioPortal NDF
RxNorm technical documentation

Attachments (9)

epic_med_mapping.sql (15.6 KB) - added by ngraham 3 years ago.
Hacked-up version of epic_med_mapping.sql used to test out NDC normalization. (1.1 KB) - added by ngraham 3 years ago.
Pre-process script (just a hack for testing)
meds_with_no_rxcui.xlsx (29.8 KB) - added by ngraham 3 years ago.
Excel sheet with med facts that don't have an rxcui.
meds_with_no_rxcui_2.xlsx (38.2 KB) - added by ngraham 3 years ago.
Meds with no RxCUI after MedEx? NLP: Fixed for improper NULL handling. (1.9 KB) - added by ngraham 3 years ago.
Post-proc script (hack for testing).
Clarity Medication Mapping.pptx (340.4 KB) - added by ngraham 3 years ago.
For the meeting today (few last-minute tweaks)
epic_med_mapping.2.sql (33.0 KB) - added by dconnolly 3 years ago.
clarity_under_va.png (13.2 KB) - added by ngraham 2 years ago.
Clarity medication directly under VA class example.
unmapped.png (27.9 KB) - added by ngraham 2 years ago.
Unmapped example.

Download all attachments as: .zip

Change History (48)

comment:1 Changed 3 years ago by ngraham

  • Cc tmcmahon added

comment:2 Changed 3 years ago by dconnolly

Let's try to make it clear from ticket summaries what it means to be done.
I expect there will always be ways to "improve coverage...".


  • change this to a problem ticket, with an explanation that (a) helps users understand the limitation, and (b) helps us judge the priority of fixing the problem:
    • "cannot search for [some important med] among others" or
    • "23% of Hospital med orders are not covered by our VA med hierarchy" (I made up 23%, of course)
  • make separate tickets for Medex, one for "Normalization"

Having thought through it a bit, the separate tickets approach doesn't seem worthwhile in this

comment:3 Changed 3 years ago by ngraham

  • Type changed from enhancement to problem

comment:4 Changed 3 years ago by ngraham


Normalized NDCs:
Loading the normalized NDCs does give us an advantage - we match 5,595 more distinct medication ids (from 13,200 to 18,795).

However, with the 2012.06 data (Cimarron), that only gets us 11,960 more facts (though a lot more meds map to the hierarchy). This is only .01% or so :)

Attached is a hacked-up version of the epic_med_mapping.sql file I used to experiment with.

Med Ex
Spot-checking the results of scanning Clarity medication names shows seemingly good matches (eye-balling the Clarity name with the RxNorm? name).

I picked the top 3 missing medications (totally 1,393,758 facts) and checked against the results from the Med Ex - got hits for all of them (including NDFRT sources).

Med Ex NLP Details

DB2828D7CFFCADB8D514AC2D1AF563A9 from Vanderbilt site.

Extracted the Medication_id and name fields from the clarity.clarity_medication table. Ran as per readme.

It appears to take a non-trivial amount of time (30 minutes on my PC).

Interestingly, it appears that the RxNorm? data is included in the download. I suspect it wouldn't be too swap out with newer data from RxNorm?.

Anyway, from the output, spot-checking shows seemingly good matches (eye-balling the Clarity name with the RxNorm? name). The Med Ex code does appear to reject quite a few things, which is probably good (avoiding false-positives) but I haven't inspected the code to know any details on the matching.

I picked the top 3 missing medications (totally 1,393,758 facts) and checked against the results from the Med Ex - got hits for all of them (including NDFRT sources):

select * from clarity.clarity_medication@epic where medication_id in (210319, 210264, 81928);
select * from rxnorm.rxnconso@kumc where rxcui in ('2180', '5966', '11124') and tty='PT';

NDC Details

A glance at the breakdown of values in clarity.clarity_ndc_codes and how many we can convert to the RxNorm?/HIPAA 11 digit format (using Sample normalization code from nlm).

150,312 lines processed, 30,439 could not be converted.

Can convert

format (digits-digits...)Count

Cannot convert (unknown format)

/* for testing NDC normalization.  manually loaded CSV for testing. */
create table norm_ndc (
  LINE	NUMBER(38,0),
--TOTAL number of entries (multiple NDCs for a single med id)
select count(*) from clarity.clarity_ndc_codes@epic cnc;
--150311 (same lines processed by script - note header)

--TOTAL in source table
select count(distinct cnc.medication_id) from clarity.clarity_ndc_codes@epic cnc;

--TOTAL med ids
select count(distinct nndc.medication_id) from norm_ndc nndc;

select count(distinct nndc.medication_id) from norm_ndc nndc
join rxnorm.rxnsat@kumc rxs 
on nndc.clarity_ndc_code=rxs.atv
where rxs.atn='NDC';
-- 13200

select count(distinct nndc.medication_id) from norm_ndc nndc
join rxnorm.rxnsat@kumc rxs 
on nndc.norm_ndc_code=rxs.atv
where rxs.atn='NDC';

Changed 3 years ago by ngraham

Hacked-up version of epic_med_mapping.sql used to test out NDC normalization.

comment:5 Changed 3 years ago by dconnolly

How many names are left to match?

Have you see the Approximate Match String Search in the RxNorm API? (That API is a web service.)

comment:6 Changed 3 years ago by ngraham


I've loaded the results from the MedEx NLP on both the and the clarity_medication.generic_name. The number of RxCUI matches we get looks promising (approx 800K facts are left without RxCUIs from the original 10.6M facts that weren't mapped to the hierarchy before).

Work is still in progress to determine how many facts we actually get mapped to the hierarchy using MedEx (saving now to preserve my notes).

  • We have 100,395,527 medication facts total
  • 10,636,780 facts don't yet fit in the VA hierarchy (missing NDC/GCN)
    • This is 89.4% mapped.
  • If we use MedEx on the field, we don't have RxCUIs for 1,382,749 of those missing facts. If we include the clarity_medication.generic_name the missing go down to 826,445 facts.


  • As per suggestions from the MedEx authors at Vanderbilt, I pre-processed the table output from the SQL dump (clarity_medication table) to add a period and an extra new-line to each medication name.
    • I seem to always get this error, though I do get a lot of output
      MedParser_V1.exe -i input_dir -o output_dir
      Exception KeyError: KeyError(7520,) in <module 'threading' from '<frozen>'> ignored
  • Post-process result to extract "CLARITY_NAME","UMLSCUI","RXCUI".
    • MedEx appears to escape quotes with double quotes - so replace '"' with "
    • Also, remove the '.' from the name in the post-process
    • 81,040 lines
    • Max medication name length: 116 characters
  • Import into KUMC.RXNORM.clarity_name_to_rxcui_medex
    • format: csv
    • skip header
    • left/right enclosure: None
    • 81,040 rows successfully imported
  • I found that some med names have a period in them (ie "A.E.R. WITCH HAZEL 12.5-50 % TP PADM"). So, MedEx strips that out in the output file so it becomes "WITCH HAZEL 12.5-50 % TP PADM" so we can't match back to clarity names (can't match 2,472 clarity names).
    • To combat this, I also ran MedEx? on the generic_name saved in the clarity_medication table.
    • 78,634 rows imported into kumc.rxnorm.g_clarity_name_to_rx_cui_medex.
  • Preliminary results show that we now have RxCUIs for all but 1,382,749 facts (as opposed to 10,636,780 before) without generic name matching. Including generic name matching takes us to 826,445. This doesn't mean they map to the VA hierarchy yet...that's still to come...but at least we have an RxCUI

Query used for testing these matches:

--Facts we're missing now (from above): 10636780
--Total facts (from above): 100395527
with med_facts as(
  --Distinct counts by concept wrt medications
  select distinct concept_cd, facts
  from blueheronmetadata.counts_by_concept@deid cbc
  where cbc.concept_cd like 'KUH|MEDICATION_ID:%'
rxnorm_med_concepts as(
  --Distinct concepts that link to RxNorm
  select distinct(c_basecode) concept_cd
  from BLUEHERONMETADATA.rxnorm_terms@deid 
  where c_visualattributes = 'LA'
joined_med_concepts as(
  -- Find facts where we don't map to RxNorm
  select distinct mfct.concept_cd facts_concept_cd, mfct.facts, rxnc.concept_cd rxn_concept_cd
  from med_facts mfct
  left join rxnorm_med_concepts rxnc
  on mfct.concept_cd = rxnc.concept_cd
  where rxnc.concept_cd is null
medex_concepts as(
    distinct(ccm.medication_id), rxcui
    rxnorm.clarity_name_to_rxcui_medex@kumc medex
  join --skip no-matches
    clarity.clarity_medication@epic ccm
  on = medex.clarity_name
g_medex_concepts as(
    distinct(ccm.medication_id), rxcui
    rxnorm.g_clarity_name_to_rxcui_medex@kumc medex
  join --skip no-matches
    clarity.clarity_medication@epic ccm
    ccm.generic_name = medex.clarity_name
select sum(facts) from(
  select * from joined_med_concepts
  join medex_concepts
  on 'KUH|MEDICATION_ID:' || medex_concepts.medication_id = joined_med_concepts.facts_concept_cd
  join g_medex_concepts
  on 'KUH|MEDICATION_ID:' || g_medex_concepts.medication_id = joined_med_concepts.facts_concept_cd
  where medex_concepts.rxcui is null
); --1,382,749 without generic, 826,445 including generic.

Changed 3 years ago by ngraham

Pre-process script (just a hack for testing)

comment:7 Changed 3 years ago by rwaitman

Can you send me a file of what's missing with the different names and counts?

comment:8 Changed 3 years ago by ngraham


I counted incorrectly in the last comment. It appears that only 645,308 facts (370 concepts) don't have an RxCUI after using MedEx.

Attached is the requested file: concept_cd,fact_count,clarity_name,clarity_generic_name.

Next Step

I will now attempt to map all of these medications to the VA hierarchy.


NOTE missing_meds table is a a table containing CONCEPT_CD,FACTS,MEDEX_RXCUI,GENERIC_MEDEX_RXCUI from the counts_by_concept table where we don't have an RxCUI from matching the name.

The error I made last night was that the table will have 2 entries when MedEx doesn't match the brand name but does match the generic name. The below query should filter for those that we can't find a match on either one.

select sum(fact_count) from(

  mm1.facts_concept_cd concept_cd, mm1.facts fact_count, clarity_name, 
  ccm.generic_name clarity_generic_name, mm1.medex_rxcui medex_rxcui, 
  mm1.generic_medex_rxcui generic_medex_rxcui  
  missing_meds mm1
  clarity.clarity_medication@epic ccm
  'KUH|MEDICATION_ID:' || ccm.medication_id = mm1.facts_concept_cd
   mm1.medex_rxcui is null and mm1.generic_medex_rxcui is null
order by 

); --count is: 645,308

Note that in some cases, I've seen that we have a UMLS CUI when we don't have an RX CUI. This may help us if we bring in the UMLS tables and look for synonyms.


From clarity:

concept_cdfact countclarity nameclarity generic name

From Medex matching:

clarity nameUMLS CUIRxNORM CUI

I'll attach a dump file of the inner query which has the concept_cd,fact_count,clarity_name,clarity_generic_name.

Changed 3 years ago by ngraham

Excel sheet with med facts that don't have an rxcui.

Changed 3 years ago by ngraham

Meds with no RxCUI after MedEx? NLP: Fixed for improper NULL handling.

comment:9 Changed 3 years ago by ngraham

Oops. I was improperly handling some of the null values from the output of MedEx. attachment:meds_with_no_rxcui_2.xlsx is what I believe to be correct now. The counts increased some but not by a huge amount (671,160 facts, 513 concepts).

Changed 3 years ago by ngraham

Post-proc script (hack for testing).

comment:10 Changed 3 years ago by ngraham

comment:11 Changed 3 years ago by ngraham

The only non-obsolete terms in RxNorm for a med-id is seemingly unrelated?

While working on medication mapping, I found some oddities in the data. In the following example, the clarity medication_id has been linked to RxNorm via GCN sequence number. Looking at the result, we can see that the only term that is not suppressed is one that is labeled as an "ingredient" (TTY=IN) and doesn't appear to be related.

Note: N=Not suppressable, O=Obsolete as per RxNorm Documentation.

Clarity Med IDClarity NameRxCUISABTTYRxNorm NameSuppress

The query:

   ccm.medication_id,, rxc.rxcui, rxc.sab, rxc.tty, rxc.str, rxc.suppress
  clarity.clarity_medication@epic ccm
  clarity.rx_med_gcnseqno@epic crmg 
  rxnorm.rxnconso@kumc rxc
  lpad(crmg.gcn_seqno, 6, '0') = rxc.code
  rxc.sab = 'NDDF' and ccm.medication_id = '85352';

comment:12 Changed 3 years ago by ngraham

I've been working on this for the past few days but I've not been able to make much improvement in the linking. Additionally, I've found some questionable data (see comment:11 for an example).

I don't believe that I'll be able to provide improvement to the end-user for this next release (milestone:heron-cedarbluff-update).

Note that isn't dependent on ETL - we could update the terms separately as the facts will be loaded with the current code.

Dustin Key (AUG member) has provided some information (SAS code) on how he mapped the medications to NDF-RT, so I plan to study that next.

comment:13 Changed 3 years ago by ngraham

Similar to comment:11, I found this:

008039259436INDandelion RootN

I've matched clarity medications to RxCUIs via GCN but this results in "ingredients" as well. So, as before, I get Dandelion Root as an RxCUI for the clarity name of 12 HOUR NASAL SPRAY 0.05 % NA SPRA.

comment:14 Changed 3 years ago by ngraham

I'm dumping some notes here that I've accumulated during this investigation.

Other Tables

Here is a list of tables that may help our medication mapping as per Clarity Data Dictionary:

Tables that don't exist in the full extract

RX_MED_ONE (table not found)

  • NDDF_PLUS_MEDID "The NDDF plus MED Medication ID."
  • GCN_GMED_FLAG_C "FDB GCN_SEQNO and generic MED ID assignment indicator"

RX_MED_TWO (table not found)
RX_MED_THREE (table found, but no data)

Tables to consider bringing in next extract

RX_SIMPLE_MED_IDS "This table extracts the list of the medications that are linked to simple generic medications"



Other Notes

We do have AHFS codes via RX_MED_AHFS: (65% of medication ids) but we don't have AHFS as a source in RxNorm?.

comment:15 Changed 3 years ago by ngraham

After using MedEx to match the missing medications to dose form via ingredient, it looks like we have 94% of our facts matched. At first glance, it appears as though many of them are vitamins, containers/bags, etc. It might be fairly easy to pick off quite a few more facts.

Note that missing_meds_20120822.csv is an extract of the missing medications. Please note that will be more than one row for a given clarity concept if we found more than one RxCUI for that concept.

Code is here [400ffc4a0684]. Probably not production ready, but the med mapping does run via paver if you take the next version [b49abf833db1] which removes some code I accidentally included in the branch.

comment:16 Changed 3 years ago by rwaitman

Great work Nathan. I'll check it out but you might skim it with Tamara who can no doubt use her Informatics/Information? Science skills to map those 12000 rows ;) Seriously, could you just run it one more time but also print out a column that shows the description of the RxCUI that was mapped?

comment:17 Changed 3 years ago by ngraham


As per your request in comment:16, I've attached the new .csv that contains the RxNorm name (using "preferred term" where available). unmapped_meds_20120823.csv.

comment:18 Changed 3 years ago by ngraham

I've attached the presentation I've prepared for the meeting today: Clarity Medication Mapping. I'll probably review it again before the meeting and perhaps tweak it slightly.

comment:19 Changed 3 years ago by ngraham

  • Status changed from new to accepted

comment:20 Changed 3 years ago by ngraham

I queried the UMLS list serve regarding ingredients and GCN mapping issues (comment:11 and comment:13). The reply is here: UMLS list serve archives

John Kilbourne, MD:

The CODE value in RXNCONSO for SAB=NDDF is not a GCNSEQNO when the TTY is IN; instead it is the First Data Bank code for ingredients. To find only GCNSEQNOs, you will need to filter out TTY= IN.

Now filtering out ingredients [44f13fed6795].

comment:21 Changed 3 years ago by dconnolly

In the medex issues, I see

  • Not integrated into our ETL (“manual technical-debt”)

I'm not sure that's much of an issue, really. I look at it as more of a manual curation tool. I don't see that much value in running a fuzzy-matching tool in lights-out mode. I expect we'd just run it manually every six months or so and keep the results in source:heron_load/curated_data.

Changed 3 years ago by ngraham

For the meeting today (few last-minute tweaks)

comment:22 Changed 3 years ago by ngraham

Scribbled notes from the meeting today:

  • Look at whether or not we can map SDCF/SBDF directly to the VA class
  • Consider "helping MedEx?" out with manual string replacement (ie IJP = Injectable Product)
  • "NO HOME MEDICATIONS" is a clarity medication. Is a concept that would be useful - saying patient doesn't have documented medications.

Next Steps:

  • Highlight what MedEx? chose for the missing meds in spreadsheet
    • Give MedEx? a chance when we don't map to the hierarchy
  • Look at Multivitamin PO (152255 med id) - one GCN or multiple? Why so many RxCUIs?
    • clarity GCN always a 1:1? Or, can it be 1:many?
  • Once past these steps, Brian Barns.
  • Create "unspecified" bucket eventually.

Looking at the code with Dan:

  • Top-Down: building hierarchy first
  • Do SCDFs/SBDFs have any relationships to the VA classes? Thought no...but...
  • Consider putting modifiers in folders for medications.
  • Ask about tables in comment:14

Changed 3 years ago by dconnolly


comment:23 Changed 3 years ago by dconnolly

attachment:epic_med_mapping.2.sql is a scratchpad based on [44f13fed6795]... a form of code review, I suppose.

One of the more interesting results:


comment:24 Changed 3 years ago by dconnolly

FWIW, I managed to take our blackboard drawing and render it using graphviz on MedMapping. This is perhaps more relevant to #1048, as it covers only the initial, more straightforward techniques.

comment:25 Changed 3 years ago by rwaitman

  • Milestone changed from heron-cedarbluff-update to heron-waconda-update

comment:26 Changed 2 years ago by ngraham

  • Blocking set to 1391

I talked with Russ today - he asked that I add an ontology folder for the meds that don't fit anywhere else for now. Also, he'd like to get this change into the milestone:heron-waconda-update.

This change is only to the ontology and doesn't affect the ETL.

comment:27 Changed 2 years ago by ngraham

We now get 98.6% of the medication facts somewhere in the hierarchy whereas before we had 96.4% that were accessible with therapeutic/pharmaceutical classes. Changeset [e0005ea88aec].

The following changes were made to achieve this:

  • Added medications that map directly to the VA Hierarchy but not to a semantic clinical dose and form (SCDF).
  • Added an "unmapped" bucket in the VA Hierarchy for the top 500 medications (by order frequency) that aren't mapped anywhere else.

100% of the clarity medications existed in the hierarchy before (using therapeutic/pharmaceutical classes) but 24,000 of them fell under the "No Theraputic Class Available" category. Therefore, these medications aren't selectable by the i2b2 end-user since there are so many (see related ticket:1082).

Coverage based on fact count for the milestone:heron-cedarbluff-update:

Fact Percent in VA HierarchyTotal FactsFact Percent thera/pharma * (see note)

Note that with the thera/pharma classification method, all medications were selectable using the "find terms" tab, but modifiers were not selectable from this interface.

Consider bringing in some of the there/pharma classes for some things we can't map to VA? Things that may be KU specific like "MISCELLANEOUS MEDICAL SUPPLIES, DEVICES, NON-DRUG [248,444 facts; 21,236 patients]"?

Clarity medication directly under VA class example.

Unmapped example.

Changed 2 years ago by ngraham

Clarity medication directly under VA class example.

Changed 2 years ago by ngraham

Unmapped example.

comment:28 Changed 2 years ago by ngraham

If I grab the top 500 medications based on fact count rather than order popularity, we get 99.2% coverage with our "other medication" folder. The counts_by_concept table isn't built until the end of the ETL. I created ticket #1411 to possibly change the design to built the counts_by_concept table earlier in the ETL.

For milestone:heron-waconda-update, we'll stick with order popularity as in comment:27 (41c44654d6a6).

comment:29 Changed 2 years ago by ngraham

  • Blocking changed from 1391 to 1373

comment:30 Changed 2 years ago by ngraham

  • Blocking 1373 deleted
  • Milestone changed from heron-waconda-update to heron-walnut-update

With the "other concepts" folder based on order popularity, we now have around 98% of the medication facts somewhere in the hierarchy. There is more work to do but this is what we have for milestone:heron-waconda-update so continue this work for the next milestone (milestone:heron-walnut-update).

comment:31 Changed 2 years ago by ngraham

As per comment:30, changing summary to reflect that we have about 2% of the facts remaining unmapped.

comment:32 Changed 2 years ago by ngraham

Russ / Tamara,

Could you comment on a criteria for closing this ticket? I think I can get 99.2% with a design change (#1411). This of course includes the "other" medication bucket.

My thoughts are:

  • Work with Dan on #1411 - try to get 99%
    • This includes the "other" medication bucket.
  • Retire the "old" medication hierarchy for milestone:heron-walnut-update.
    • Though technically 100% of med facts exist in the "old" hierarchy, many were under the "No Theraputic Class Available" category and weren't selectable by the i2b2 end-user (see comment:27) except through the search interface which does not show modifiers.
      • Excluding the "No Theraputic Class Available" category, the "old" hierarchy covered 96.4% of the facts.
    • With the new hierarchy, the medications in the "other" category are searchable and can then be selected via the i2b2 ontology (including modifiers).
  • Close this ticket and open a new ticket for investigating ways of moving more medications from the "Other" category to more specific categories.

comment:33 Changed 2 years ago by tmcmahon

Here are my thoughts...

  • Close the ticket and link to #1411.
  • Retire the old medication hierarchy in walnut.
  • I defer to Russ re: opening a new ticket.

comment:34 Changed 2 years ago by ngraham

Note: Changing the design so that counts_by_concepts is built before medication mapping (#1411) also helps with automating coverage tests (to make sure we get 99%+ of the facts covered in the hierarchy).

comment:35 Changed 2 years ago by ngraham

As per ticket:1411#comment:2, I created a branch for building the concept count earlier in the ETL. Then, I used the concept count to determine which top 500 meds go in the "other" folder [09e16cc62ef6].

As per data from milestone:heron-waconda-update, we get 99.12% fact coverage. Also added a test to make sure we have at least 99% coverage in the future.

comment:36 Changed 2 years ago by ngraham

On the same MEDS_CONCEPTS_COUNT_1411_1246 branch, I removed the old therapeutic / pharmaceutical class medication tree and put the NDF-RT tree under "Medications" in anticipation that we'll decide to deploy it this way for milestone:heron-walnut-update. See also comment:33.


comment:37 Changed 2 years ago by ngraham

  • Resolution set to fixed
  • Status changed from accepted to closed

I talked with Russ - agreed that we can remove the old hierarchy, close this ticket, and create a new one to map the remaining medications in the "other" folder. Changes noted comment:36 merged to default in [a24dd707e4d1].

comment:38 Changed 20 months ago by kcrane2

Approved for public release

comment:39 Changed 20 months ago by ngraham

  • Keywords public-web added
Note: See TracTickets for help on using tickets.