Opened 7 years ago

Closed 4 years ago

#1 closed task (fixed)

Evaluate ETL tools and techniques for data flow visualization and structured logging

Reported by: dconnolly Owned by: dconnolly
Priority: minor Milestone: gpc-ts-1
Component: data-repository Keywords: datadeps, public-web
Cc: ngraham, rwaitman, mhoag, bhamlin Blocked By: 1337
Blocking: Sensitive: no

Description

Use talend? or Pentaho? or just homebrew ETL?

Attachments (18)

heron_etl_tasks.dot (7.9 KB) - added by dconnolly 4 years ago.
heron_etl_tasks.dot.svg (86.1 KB) - added by dconnolly 4 years ago.
heron_etl_tasks.dot.pdf (28.4 KB) - added by dconnolly 4 years ago.
epic_i2b2_transform.dot (4.1 KB) - added by dconnolly 4 years ago.
epic_i2b2_transform.dot.svg (34.0 KB) - added by dconnolly 4 years ago.
epic_i2b2_transform.dot.pdf (18.9 KB) - added by dconnolly 4 years ago.
heron_load.csv (9.8 KB) - added by dconnolly 4 years ago.
etl-deps.csv (22.8 KB) - added by dconnolly 4 years ago.
,etl-deps-build.log (8.5 KB) - added by dconnolly 4 years ago.
etl-deps.dot.svg (219.4 KB) - added by dconnolly 4 years ago.
,deps.csv (25.2 KB) - added by dconnolly 4 years ago.
,err (56.5 KB) - added by dconnolly 4 years ago.
,dep.dot (24.7 KB) - added by dconnolly 4 years ago.
,dep.dot.svg (229.2 KB) - added by dconnolly 4 years ago.
task-deps.dot (6.9 KB) - added by dconnolly 4 years ago.
task-deps.dot.svg (76.4 KB) - added by dconnolly 4 years ago.
task-deps.dot.pdf (34.3 KB) - added by dconnolly 4 years ago.
task deps with dimensions_load_2224
task-deps.csv (9.0 KB) - added by dconnolly 4 years ago.
task deps with dimensions_load_2224

Download all attachments as: .zip

Change History (59)

comment:1 Changed 7 years ago by dconnolly

  • Status changed from new to accepted

comment:2 Changed 7 years ago by dconnolly

talend has an HL7 module. interesting.

comment:3 Changed 7 years ago by dconnolly

comment:4 Changed 7 years ago by dconnolly

comment:5 Changed 7 years ago by dconnolly

discussed with Russ just now... he asked whether it would save time for data cleaning tasks; no, I'd do those much more quickly with python or the like. The value of this tool seems to be in:

  • monitoring execution of ETL tasks: running things regularly or in response to events, logging, problem reporting, and the like
  • providing a graphical view of the process that doesn't get out of sync like some documentation does

comment:6 Changed 7 years ago by dconnolly

  • Component changed from i2b2-integration to dev-framework

comment:7 Changed 7 years ago by dconnolly

Hmm... how to configure development/test/production settings?

Aha! This looks like an important clue, from Talend Open Studio 4.X User Guide
(in documentation downloads):

4.13 Defining Contexts and variables

Depending on the circumstances the Job is being used in, you might want to manage it differently for
various execution types (Prod and Test in the example given below). For instance, there might be various
testing stages you want to perform and validate before a Job is ready to go live for production use.

comment:8 Changed 7 years ago by dconnolly

when trying to do a hictr_load.ctl work-alike job, I'm getting:

Exception in component tOracleOutput_1
java.sql.SQLException: ORA-00942: table or view does not exist

	at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
	at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331)
	at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288)
	at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:745)
	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:207)
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:957)
	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1170)
	at oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:1696)
	at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:1662)
	at heron_etl_eval.idx_load_0_1.idx_load.tFileInputDelimited_1Process(idx_load.java:2627)
	at heron_etl_eval.idx_load_0_1.idx_load.runJobInTOS(idx_load.java:3840)
	at heron_etl_eval.idx_load_0_1.idx_load.main(idx_load.java:3718)

It seems to be a known problem: 0012995: An error with TOracleOuput when we want to create a table with "Drop if exist and create" action

comment:9 Changed 7 years ago by dconnolly

talend can export a jobs as a JBoss ESB. I'm not really sure what that is, but since we're already using JBoss for i2b2, it might be something to look into.

comment:10 Changed 7 years ago by dconnolly

tOracleBulkExec assumes write access to the directory containing the input file.
It sorta assumes the ETL process and the server are on the same host... or at least share a filesystem.

comment:11 Changed 7 years ago by dconnolly

Russ asked about extensibility w.r.t. components; yes, it does support that. In fact, there's an exchange, a sort of marketplace of components and such.

comment:12 Changed 7 years ago by dconnolly

Using talend's normal tMap component results in processing each row with Java. The way to use talend to get the Oracle db to do the heavy lifting is to use the ELT components. But I ran into trouble getting those to work; in particular, when I put a context variable in the Default Schema Name field of tELTOracleOutput, the name of the variable ended up in the computed SQL, rather than the value of the variable.

I hope to report the issue to the talend community, but I'm not sure what priority that should get.

comment:13 Changed 7 years ago by dconnolly

  • Resolution set to wontfix
  • Status changed from accepted to closed

Though exploring the tool has been useful for understanding data warehousing norms (#12), it's pretty clear that it doesn't insulate job builders from SQL-level knowledge of ETL tasks. More traditional tools such as Oracle developer or mysql workbench work well as aids for developing SQL data manipulation instructions; capturing those instructions with talend is tedious by comparison.

So we're inclined to go with homebrew ETL scripts for now.

Talend does provide nice visual documentation of jobs once the tedious capture process is done, and we acknowledge that comparable documentation of homebrew ETL scripts is not without cost.

comment:14 Changed 6 years ago by dconnolly

  • Cc mnair added
  • Component changed from dev-framework to CDR
  • Milestone set to ringed-teal
  • Resolution wontfix deleted
  • Sensitive unset
  • Status changed from closed to reopened

as I think about TEAL? and even validation/testing of HeronLoad, I think
maybe pentaho kettle is worth a closer look.

235MB!
7b90c29565df1049bb458d5a8a4d5312 pdi-ce-4.2.0-stable.zip

apt-get install xulrunner-2.0

Spoon is a graphical user interface that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen. Pan is a data transformation engine that performs a multitude of functions such as reading, manipulating, and writing data to and from various data sources. Kitchen is a program that executes jobs designed by Spoon in XML or in a database repository. Jobs are usually scheduled in batch mode to be run automatically at regular intervals.
-- Introduction to Spoon

comment:15 Changed 6 years ago by dconnolly

  • Cc rwaitman added
  • Status changed from reopened to accepted

After playing with it a bit, I'm getting the impression that this would be a good toolset
for TEAL? work and perhaps, in the long run, HeronLoad.

The pentaho community is well represented on FreeNode; I got just the sort of help I hoped
for in setting up a trivial job to read from our database.

Next up... consider the wider BI toolset, i.e. reporting, dashboards, etc. ...

Among the books on Pentaho, Pentaho Soultions looks like it covers some gaps in my education as well as the pentaho tools themselves.

Meanwhile, I looked around to see how Pentaho compares to QV, and found that while
they compete in some ways, they also play nicely together:

comment:16 Changed 6 years ago by dconnolly

Pentaho's .kdb files store database connection info. The passwords are stored in some sort
of encrypted form, but... I need to figure out how they're encrypted. Is it just with
some well-known key?

comment:17 Changed 6 years ago by dconnolly

Also, I want to develop a job using one database connection but run it using another.
What are the norms/best practices for doing that sort of thing?

I suppose it's some combination of properties files and variables, but I should
figure out the details.

comment:18 Changed 6 years ago by dconnolly

putting this on the back burner while I work on milestone:FrontiersWebsiteUserCentered etc.

comment:19 Changed 6 years ago by dconnolly

  • Status changed from accepted to assigned

comment:20 Changed 6 years ago by dconnolly

  • Status changed from assigned to accepted

comment:21 Changed 6 years ago by fvisconti

We installed PDI and kicked the tires.

Dan recommended a Pentaho Kitchen tutorial.

comment:22 Changed 6 years ago by dconnolly

  • Resolution set to wontfix
  • Status changed from accepted to closed

Russ, Mani, Nathan, Frank and I came to the conclusion that PDI doesn't offer sufficient benefit over our existing SQL+python style (especially with nicer python DevTools... see #1046).

comment:23 Changed 6 years ago by dconnolly

Re so-called "encryption": as I suspected, it's just using a published seed
to jumble up the bits a little bit; it's not actually difficult to recover
the plaintext from the cyphertext.

Details:

In working on #1001, sqlldr was rejecting records without much in the way of diagnostics;
I thought maybe the pentaho tools could help with debugging.

I ran into a Java stacktrace, which prompted me to browse the source, where I found
Encr.java.

comment:24 Changed 5 years ago by dconnolly

  • Cc ngraham mhoag added; mnair removed
  • Priority changed from major to minor
  • Resolution wontfix deleted
  • Status changed from closed to reopened

Matt,

Russ just sent me a pointer to DataExpress BETA A Simple, Scala-based, cross-database ETL tooolkit.

Here's hoping we find time to look into it.

comment:25 follow-up: Changed 5 years ago by dconnolly

  • Cc bhamlin added
  • Component changed from CDR to dev-framework
  • Milestone changed from ringed-teal to HERONv.Next

Speaking of scala...

I gather one of the normal ways to run large batch jobs in Oracle (ETL or other wise) is with stored procedures. My tendancy has been to shy away from them (see OracleTips#PLSQL:WhenandHow), but I've read a few things that are making me reconsider; e.g.

Rule 2, the only reliable way to avoid SQL injection attacks is to only allow access to the database via stored type safe procedures.
-- Quora database security discussion

Oracle supports stored procedures in Java. For those purposes, can it tell the difference between Java and scala?

comment:26 in reply to: ↑ 25 ; follow-up: Changed 5 years ago by dconnolly

  • Resolution set to wontfix
  • Status changed from reopened to closed

Replying to dconnolly:

Oracle supports stored procedures in Java. For those purposes, can it tell the difference between Java and scala?

Yes, scala requires a newer Java version than oracle supports. :-/

Details...

Matt and I looked into this a little bit 19 Feb 13 and found:

I get this error when the version of Java I used to compile the class file isn't the same as the version of Java in Oracle (1.5.0 in 11g). To be sure you match perfectly, let Oracle compile the class for you.
-- Bill Dolan on stack overflow Mar 10 '11

Oracle 10g docs show Java version 1.4.2, though they also say "or later". Hmm.

The Scala software distribution ... requires the Java runtime version 1.6 or later.
-- Scala distribution, Created 2008-07-24. Updated: 2013-03-29

comment:27 in reply to: ↑ 26 Changed 5 years ago by mhoag

Replying to dconnolly:

Matt and I looked into this a little bit 19 Feb 13 and found:

I get this error when the version of Java I used to compile the class file isn't the same as the version of Java in Oracle (1.5.0 in 11g). To be sure you match perfectly, let Oracle compile the class for you.
-- Bill Dolan on stack overflow Mar 10 '11

Oracle 10g docs show Java version 1.4.2, though they also say "or later". Hmm.

The Scala software distribution ... requires the Java runtime version 1.6 or later.
-- Scala distribution, Created 2008-07-24. Updated: 2013-03-29

All hope for Scala may not be lost. It appears that Scala 2.9 should work on the Java 1.5 JVM (Scala 2.10 moving to Java 1.6) and Oracle 11g supposedly supports Java 1.5. Since migration to Oracle 11g is on the horizon (#1943), maybe we can try this out again.

comment:28 Changed 4 years ago by dconnolly

  • Resolution wontfix deleted
  • Status changed from closed to reopened

At the I2B2Community conference today, an Integrated Data Repository Toolkit (IDRT) was presented. It used talend. In response, several in the audience noted they use talend too.

Did I look at this one before? Ah... yes... 3 years ago:

Using talend's normal tMap component results in processing each row with Java. The way to use talend to get the Oracle db to do the heavy lifting is to use the ELT components. But I ran into trouble getting those to work ...

As we ramp up collaboration with peer sites, and we have started looking at tools like this that provide a visual representation of the ETL process, maybe it would be worth taking another look at talend among other architecture and HeronQA issues.

comment:29 Changed 4 years ago by dconnolly

  • Status changed from reopened to accepted

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

comment:30 Changed 4 years ago by dconnolly

ETL tools tend to let you manipulate graphical elements and then generate code.
Our approach is to write the code directly, but that shouldn't preclude nice graphical documentation.

I've made some progress on a form of static analysis of our code:

Code:

Changed 4 years ago by dconnolly

comment:31 Changed 4 years ago by dconnolly

  • Blocked By set to 1337
  • Component changed from dev-framework to data-repository
  • Milestone changed from HERONv.Next to heron-big-blue-update
  • Priority changed from minor to major

putting more priority on this, motivated by collaboration with other dev teams

comment:32 Changed 4 years ago by dconnolly

more complete diagram: attachment:etl-deps.dot.svg from attachment:etl-deps.csv
[0bf7857d2ad4]

You can see which files got successfully parsed and which were skipped in attachment:",etl-deps-build.log"

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

comment:33 Changed 4 years ago by dconnolly

  • Priority changed from major to minor

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

Changed 4 years ago by dconnolly

comment:34 Changed 4 years ago by dconnolly

  • Keywords datadeps added
  • Milestone changed from heron-crooked-creek-update to heron-solomon-update

Batch update from file ticket-triage.xls

Changed 4 years ago by dconnolly

task deps with dimensions_load_2224

comment:35 Changed 4 years ago by dconnolly

  1. [b741f2ff2cd9] refine task dependencies from data_dimensions_load_2224 in light of dep diagram

also updated attachment:task-deps.dot.pdf and attachment:task-deps.csv

Changed 4 years ago by dconnolly

task deps with dimensions_load_2224

comment:36 Changed 4 years ago by dconnolly

  • Keywords public-web added

comment:37 Changed 4 years ago by kcrane2

Cleared by Infosec

comment:38 Changed 4 years ago by dconnolly

todo (#2372): get Jenkins to re-build devdoc, including diagrams, on each commit.

comment:39 Changed 4 years ago by dconnolly

  • Milestone changed from heron-toronto-update to heron-lovewell-update

comment:40 Changed 4 years ago by dconnolly

  • Milestone changed from heron-lovewell-update to gpc-ts-1

idea: "finish" these diagrams and present them at the GPC hackathon

another idea: trac interwiki

comment:41 Changed 4 years ago by dconnolly

  • Resolution set to fixed
  • Status changed from accepted to closed

The pictures-from-code approach (comment:30) looks workable.

So much for evaluating the technique; for follow up execution, see:

No results

as well as GPC:ticket:71 on portable ETL for Epic.

Note: See TracTickets for help on using tickets.