We have an implementaition of the March 2013 design for HERON to REDCap, but scalability is becoming an issue the technical debt incurred in building it is becoming unmanageable (cf. ticket:2082#comment:11)

Data Flow Diagram

GraphViz image


  • BlueHeronData is an Oracle schema/user
    • NightHeronData contains identified data
  • deid.R is an R source code file, part of source:rgate
  • patients is an R class (interface)
  • KaplanMeierStat_ctrlr.js is an i2b2 plugin controller, part of source:kmstat
    • an image is actually what km_analysis.R produces; the relevant dependency is more on the arguments that the controller supplies to the analysis script. But as it's a minor part of the discussion, I'm not inclined to fix the diagram just now.
  • kgerard is a grad student working with HERON data in this .Rda format.
  • blue signifies is part of source:heron_extract
    • see also dua2redcap Jenkins job re REDCap API key and such.
    • I/O is actually in dua_io.R, not "DUA_deid_redcap.R", but the latter is more relevant to these design issues.
  • orange signifies that uploading "job_name_dict.csv" into REDCap is done manually.

note Jenkins config.xml files in source:bmi_ops

External Constraints

  • i2b2 star schema
  • redcap dictionary, data import formats
  • is delivered to customers, so its format is documented in HERONTrainingMaterials/HeronDataSet. The honest broker (occasionally) cites this document when delivering data.


See #2365, #1539, #2352 for progress.


not (yet) captured in tickets:

  • Fix Completion Signal Race Condition
  • When we blow away the star schemas with heron_init, we need to re-grant Jenkins access to the databases; see ticket:2201#comment:6 .

Maybe TODO, customer visible:

  • Option to deliver only raw data?
    • Current design always provides cooked and raw data
  • dfbuilder.R might as well produce something pretty close to what's in, plus one more "job info" file with the patient set id and label a la job_info.csv.
    • closer in what way, exactly? @'s? other?

Maybe TODO, code quality:

  • use unshift.dates() from dataset_raw.R throughout rather than the approach in simplify.datetimes() in DUA_deid_redcap.R
  • in docs, give an example and refer to .Rmd section ## Relevant facts and cohort summaries
  • test the cross-product of by-patient vs by-encounter X de-identified vs. identified

Acceptable warts:

  • The name job_name.csv suggests it's an alternative serialization of job_name.Rda, but actually it only has a small subset of what's in job_name.Rda.
  • DUA_deid_redcap.R is now a misnomer; it's used in identified data requests too.
  • "job_name-data.csv" and "job_name_data.csv" differ only in - vs _.
  • km_analysis.R and cj_analysis.R are, conceptually, separate from the rgate platform, but for deployment reasons, they're maintained as part of source:rgate rather than along with their respective plug-ins in source:kmstat.

Test/Dev Notes

moved to WritingQualityCode, #2365, etc.

