wiki:RGateDataFlows

We have an implementaition of the March 2013 design for HERON to REDCap, but scalability is becoming an issue the technical debt incurred in building it is becoming unmanageable (cf. ticket:2082#comment:11)

Data Flow Diagram

GraphViz image

Key

  • BlueHeronData is an Oracle schema/user
    • NightHeronData contains identified data
  • deid.R is an R source code file, part of source:rgate
  • patients is an R class (interface)
  • KaplanMeierStat_ctrlr.js is an i2b2 plugin controller, part of source:kmstat
    • an image is actually what km_analysis.R produces; the relevant dependency is more on the arguments that the controller supplies to the analysis script. But as it's a minor part of the discussion, I'm not inclined to fix the diagram just now.
  • kgerard is a grad student working with HERON data in this .Rda format.
  • blue signifies redcap_upload.py is part of source:heron_extract
    • see also dua2redcap Jenkins job re REDCap API key and such.
    • I/O is actually in dua_io.R, not "DUA_deid_redcap.R", but the latter is more relevant to these design issues.
  • orange signifies that uploading "job_name_dict.csv" into REDCap is done manually.

note Jenkins config.xml files in source:bmi_ops

External Constraints

  • i2b2 star schema
  • redcap dictionary, data import formats
  • job_name-raw.zip is delivered to customers, so its format is documented in HERONTrainingMaterials/HeronDataSet. The honest broker (occasionally) cites this document when delivering data.

Issues

See #2365, #1539, #2352 for progress.

TODO

not (yet) captured in tickets:

  • Fix Completion Signal Race Condition
  • When we blow away the star schemas with heron_init, we need to re-grant Jenkins access to the databases; see ticket:2201#comment:6 .

Maybe TODO, customer visible:

  • Option to deliver only raw data?
    • Current design always provides cooked and raw data
  • dfbuilder.R might as well produce something pretty close to what's in job_name-raw.zip, plus one more "job info" file with the patient set id and label a la job_info.csv.
    • closer in what way, exactly? @'s? other?

Maybe TODO, code quality:

  • use unshift.dates() from dataset_raw.R throughout rather than the approach in simplify.datetimes() in DUA_deid_redcap.R
  • in relevant.to docs, give an example and refer to .Rmd section ## Relevant facts and cohort summaries
  • test the cross-product of by-patient vs by-encounter X de-identified vs. identified

Acceptable warts:

  • The name job_name.csv suggests it's an alternative serialization of job_name.Rda, but actually it only has a small subset of what's in job_name.Rda.
  • DUA_deid_redcap.R is now a misnomer; it's used in identified data requests too.
  • "job_name-data.csv" and "job_name_data.csv" differ only in - vs _.
  • km_analysis.R and cj_analysis.R are, conceptually, separate from the rgate platform, but for deployment reasons, they're maintained as part of source:rgate rather than along with their respective plug-ins in source:kmstat.

Test/Dev Notes

moved to WritingQualityCode, #2365, etc.

Last modified 4 years ago Last modified on 12/04/13 17:13:29