Posts for the month of January 2011

Missing link in high performance bulk transfer between Oracle databases [DRAFT]

To build HERON, our clinical research data repository, we move a lot of data (hundreds of gigabytes) between Oracle databases:

GraphViz image

When I discover Oracle's support for distributed transactions using database links, it seemed to be a perfect match. For example, from one of our early ETL tickets, #126:

Executing:  create database link idxp_link USING 'idxp' 

Executing:
                create global temporary table hictr_table_clone
                on commit preserve rows
                as select * from HICTR.hictr_table@idxp_link

That initial work was based on a tiny (4000 patient) slice of data, so performance was not much of an issue. But when we got into millions of records, run times ran into the tens of hours.

A long-time Oracle user here told me that sqlldr is the right tool for bulk loading data, but like sqlplus, it's a command-line tool, and Oracle seems to turn off normal access to them by default and issues a scary disclaimer when we enable access.

As I read up on sqlldr, I discovered Inserting Data Into Tables Using Direct-Path INSERT; i.e. the performance magic behind sqlldr is available using SQL too. Great! But... not

A transaction containing a direct-path INSERT statement cannot be or become distributed.

http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_9014.htm#i2163698

but I also got the impression

it's a command-line tool, and I wanted to stick with the python DB API (or JDBC or the like) rather than tackle the administrative/security issues of passwords on the command line and such.