A tour of the rgate API

by Dan Connolly, KUMC Informatics

For contenxt, see HeronStatsPlugins.

The rgate back-end integrates R scripts such as km_analysis.R with i2b2 plug-ins. The km_analysis_test.R script simulates the rgate calling environment:

source("km_analysis.R")
source("km_analysis_test.R")

Querying Patient Sets

The core of the interface is an R object that provides access to data about a patient set. Suppose we have a patient set with 30 patients. Since one patient may have multiple cases or no cases in the tumor registry, let's suppose these patients have a total of 35 cancer cases.

pset <- mock.patients(n.patient = 30, n.case = 35)
class(pset)
## [1] "patients"
pset$id
## [1] 123

The I2B2 Star Schema

I2B2 integrates all data about patients into an observation_fact table. Each observation has a concept_cd that is related to any number of concept_paths. For example, the code SEER_SITE:32010 is related to

The observations method on a patient set takes a vector of concept paths and returns the relevant facts as an R dataframe:

survival.paths <- mock.paths()

obs.db <- observations(pset, c(survival.paths$event))
markdown.df(obs.db)
PATIENT_NUM START_DATE CONCEPT_CD NAME_CHAR PANEL
22 1980-04-21 11:41:31 MOCK-VITAL:y Deceased \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\
26 1980-04-02 21:59:54 MOCK-VITAL:y Deceased \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\
18 1980-02-04 22:45:44 MOCK-VITAL:y Deceased \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\
11 1981-02-20 22:41:56 MOCK-VITAL:y Deceased \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\
15 1980-06-03 01:13:07 MOCK-VITAL:y Deceased \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\

I2B2 Data Set for Survival Analysis

Those survival.paths are a mock-up of the paths we need for survival analysis:

survival.paths
## $start
## [1] "\\i2b2\\naaccr\\S:1 Cancer Identification\\0390 Date of Diagnosis\\"
## 
## $end
## [1] "\\i2b2\\naaccr\\S:4 Follow-up/Recurrence/Death\\1750 Date of Last Contact\\"
## 
## $event
## [1] "\\i2b2\\naaccr\\S:4 Follow-up/Recurrence/Death\\1760 Vital Status\\0 Dead (CoC)\\"
## 
## $stratum
## [1] "\\i2b2\\naaccr\\S:1 Cancer Identification\\0440 Grade\\"
## 

To plot a survival curve, we need, for each patient/subject:

So the data set from the database looks like this:

obs.db <- observations(pset, unlist(survival.paths))
markdown.df(head(obs.db, 20))
PATIENT_NUM START_DATE CONCEPT_CD NAME_CHAR PANEL
3 1982-07-17 19:46:08 MOCK-GRADE:2 2 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
26 1981-10-16 11:43:38 MOCK-GRADE:3 3 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
15 1982-06-06 18:20:56 MOCK-GRADE:4 4 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
17 1983-09-09 07:34:35 MOCK:end Last Contact \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1750 Date of Last Contact\
26 1981-10-16 11:43:38 MOCK:start Diagnosed \i2b2\naaccr\S:1 Cancer Identification\0390 Date of Diagnosis\
3 1985-06-30 01:09:07 MOCK:end Last Contact \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1750 Date of Last Contact\
20 1985-06-26 13:54:18 MOCK:end Last Contact \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1750 Date of Last Contact\
6 1982-05-04 07:15:49 MOCK-GRADE:3 3 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
23 1980-04-21 11:41:31 MOCK:start Diagnosed \i2b2\naaccr\S:1 Cancer Identification\0390 Date of Diagnosis\
23 1980-04-21 11:41:31 MOCK-GRADE:4 4 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
21 1982-10-23 18:22:51 MOCK:start Diagnosed \i2b2\naaccr\S:1 Cancer Identification\0390 Date of Diagnosis\
19 1980-04-02 21:59:54 MOCK:start Diagnosed \i2b2\naaccr\S:1 Cancer Identification\0390 Date of Diagnosis\
4 1983-06-18 22:04:55 MOCK:end Last Contact \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1750 Date of Last Contact\
16 1981-06-28 04:33:57 MOCK-GRADE:2 2 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
22 1980-02-04 22:45:44 MOCK:start Diagnosed \i2b2\naaccr\S:1 Cancer Identification\0390 Date of Diagnosis\
21 1984-05-07 18:40:37 MOCK:end Last Contact \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1750 Date of Last Contact\
18 1981-08-12 01:27:31 MOCK-GRADE:4 4 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\
22 1980-04-21 11:41:31 MOCK-VITAL:y Deceased \i2b2\naaccr\S:4 Follow-up/Recurrence/Death\1760 Vital Status\0 Dead (CoC)\
8 1982-03-31 17:53:58 MOCK:start Diagnosed \i2b2\naaccr\S:1 Cancer Identification\0390 Date of Diagnosis\
9 1980-06-03 01:13:07 MOCK-GRADE:3 3 \i2b2\naaccr\S:1 Cancer Identification\0440 Grade\

It starts to make sense as a whole if we sort it and prune dates and paths:

obs.display <- function(obs) {
    sorted <- obs[order(obs$PATIENT_NUM, obs$PANEL, obs$CONCEPT_CD), ]
    fix.types <- transform(sorted, START_DATE = substr(START_DATE, 1, 10), PANEL = as.character(PANEL))
    transform(fix.types, PANEL = paste("...", substr(PANEL, nchar(PANEL) - 10 + 
        1, nchar(PANEL))))
}
markdown.df(head(obs.display(obs.db), 30))
PATIENT_NUM START_DATE CONCEPT_CD NAME_CHAR PANEL
3 1982-07-17 MOCK:start Diagnosed … Diagnosis\
3 1981-09-02 MOCK:start Diagnosed … Diagnosis\
3 1982-07-17 MOCK-GRADE:2 2 … 440 Grade\
3 1981-09-02 MOCK-GRADE:4 4 … 440 Grade\
3 1985-06-30 MOCK:end Last Contact … t Contact\
3 1984-02-10 MOCK:end Last Contact … t Contact\
4 1981-12-07 MOCK:start Diagnosed … Diagnosis\
4 1981-02-20 MOCK:start Diagnosed … Diagnosis\
4 1981-12-07 MOCK-GRADE:1 1 … 440 Grade\
4 1981-02-20 MOCK-GRADE:4 4 … 440 Grade\
4 1983-06-18 MOCK:end Last Contact … t Contact\
4 1984-01-04 MOCK:end Last Contact … t Contact\
5 1982-07-01 MOCK:start Diagnosed … Diagnosis\
5 1982-07-01 MOCK-GRADE:4 4 … 440 Grade\
5 1985-05-16 MOCK:end Last Contact … t Contact\
6 1982-05-04 MOCK:start Diagnosed … Diagnosis\
6 1982-05-04 MOCK-GRADE:3 3 … 440 Grade\
6 1984-10-01 MOCK:end Last Contact … t Contact\
8 1982-03-31 MOCK:start Diagnosed … Diagnosis\
8 1982-03-31 MOCK-GRADE:1 1 … 440 Grade\
8 1985-11-23 MOCK:end Last Contact … t Contact\
9 1981-06-20 MOCK:start Diagnosed … Diagnosis\
9 1980-06-03 MOCK:start Diagnosed … Diagnosis\
9 1981-06-12 MOCK:start Diagnosed … Diagnosis\
9 1980-11-06 MOCK:start Diagnosed … Diagnosis\
9 1980-11-06 MOCK-GRADE:2 2 … 440 Grade\
9 1980-06-03 MOCK-GRADE:3 3 … 440 Grade\
9 1981-06-12 MOCK-GRADE:3 3 … 440 Grade\
9 1981-06-20 MOCK-GRADE:4 4 … 440 Grade\
9 1985-01-11 MOCK:end Last Contact … t Contact\

Pivoting Entity-Attribute-Value (EAV) Data

This entity-attribute-value (EAV) structure is a bit awkward to deal with, so our km_analysis.R script includes some routines to pivot columns:

obs.t0 <- obs.pivot.date(obs.db, survival.paths$start, "t0", first.t = T)
head(obs.t0)
##    patient                  t0
## 5       26 1981-10-16 11:43:38
## 9       23 1980-04-21 11:41:31
## 12      19 1980-04-02 21:59:54
## 15      22 1980-02-04 22:45:44
## 19       8 1982-03-31 17:53:58
## 23      15 1980-02-16 12:18:43
obs.tend <- obs.pivot.date(obs.db, survival.paths$end, "tend", first.t = T)
obs.t <- data.frame(patient = obs.t0$patient, t = difftime(obs.tend$t, obs.t0$t))
head(obs.t)
##   patient           t
## 1      26  692.8 days
## 2      23 1892.1 days
## 3      19 1172.0 days
## 4      22 1531.1 days
## 5       8 1010.1 days
## 6      15 1454.8 days
obs.outcome <- obs.pivot.logical(obs.db, survival.paths$event, "outcome", first.t = T)
head(obs.outcome)
##    patient outcome
## 18      22    TRUE
## 36      26    TRUE
## 57      18    TRUE
## 91      11    TRUE
## 95      15    TRUE
obs.stratum <- obs.pivot.name(obs.db, survival.paths$stratum, "stratum", first.t = T)
head(obs.stratum)
##    patient stratum
## 2       26       3
## 8        6       3
## 10      23       4
## 14      16       2
## 17      18       4
## 20       9       3

Note the use of first.t=T to select the earliest observation in the case of multiple cases for a patient.

A Survival Data Set

We can merge all the observations into one data.frame;
note the use of all.x=TRUE a la an SQL left join:

data <- merge(merge(obs.t, obs.outcome, all.x = TRUE), obs.stratum, all.x = TRUE)
data$outcome <- ifelse(is.na(data$outcome), F, data$outcome)
data$t <- as.numeric(data$t)/365
data
##    patient      t outcome stratum
## 1        3 3.0824   FALSE       4
## 2        4 4.4925   FALSE       4
## 3        5 0.6104   FALSE       4
## 4        6 0.6618   FALSE       3
## 5        8 2.7675   FALSE       1
## 6        9 4.3296   FALSE       3
## 7       10 1.1639   FALSE       3
## 8       11 2.7948    TRUE       1
## 9       12 2.5302   FALSE       4
## 10      13 2.0049   FALSE       1
## 11      14 1.8095   FALSE       2
## 12      15 3.9857    TRUE       3
## 13      16 1.7948   FALSE       2
## 14      17 4.6085   FALSE       2
## 15      18 2.3304    TRUE       4
## 16      19 3.2109   FALSE       3
## 17      20 5.3016   FALSE       2
## 18      21 2.5090   FALSE       2
## 19      22 4.1948    TRUE       3
## 20      23 5.1837   FALSE       4
## 21      26 1.8982    TRUE       3
## 22      27 2.5882   FALSE       3
## 23      28 5.0354   FALSE       2

Survival Plot

The R surival package provides plotting support:

library(survival)
fit <- survfit(Surv(data$t, data$outcome) ~ data$stratum, data)
labels <- sort(unique(data$stratum))
colors <- rainbow(length(labels))
plot(fit, col = colors)

plot of chunk survival_plot