Posts in category python

Learning Object Capability Security with the Online Python Tutor

In Everything Is Broken, Quinn Norton presents an alarming, though witty, case that heartbleed is really just the tip of the iceberg when it comes to computer security problems.

The best weapons I've seen are (a) certified programming with dependent types, and (b) Robust Composition with capabilties.

And on that front, there's great news: sel4, a formally verified, capability based microkernel written in optimized C, is going open source. That's the very lowest layer. At the other end, secure ecmascript lets us use Javascript as an object capability language. Distributed Electronic Rights in JavaScript tells the story at a high level, including which bits are available and which are still in progress. Stuff like the SES node package seems to work pretty well.

Meanwhile, we're a mostly python shop. I've been playing with some python capability idioms for a while. Some of them are a bit obscure, and I've been wondering how to explain our CodeReviewNotes about explicit authority to new developers.

Then I discovered the online python tutor. Perfect!

I hope that trying it out on encap.py will provide enlightenment on the encapsulation aspect of capabilities discussed in From Functions To Objects`. Copy and paste encap.py into the editor and add something like this at the end:

# test
s1 = makeSlot('apple')
print s1.get()
s1.put('orange')
print s1.get()

Then try walking through sealing.py to see how rights amplification works though the motivating example, money.py should be integrated in order for it to really make sense.

Quick-and-dirty usage documentation for python integration tests

In addition to my addiction to python's doctest for WritingQualityCode, I'm developing a habit of letting my modules serve as their own integration tests when run as scripts. Python comes batteries-included with argparse, and Aargh is pretty cool, but for quick-and-dirty stuff like this, I tend not to bother at all. I just let python's stack traces serve as documentation.

For example, if it's been a while since I used source:raven-j/heron_wsgi/admin_lib/i2b2pm.py, I just run it, and I get:

(haenv):admin_lib$ python i2b2pm.py
Traceback (most recent call last):
  File "i2b2pm.py", line 319, in <module>
    _test_main()
  File "i2b2pm.py", line 289, in _test_main
    user_id, full_name = sys.argv[1:3]
ValueError: need more than 0 values to unpack

The user_id, full_name = sys.argv[1:3] line is typically enough of a clue to remind me what arguments are needed:

(haenv):admin_lib$ python i2b2pm.py dconnolly 'Dan Connolly'
DEBUG:__main__:generate authorization for: ('dconnolly', 'Dan Connolly')
INFO:sqlalchemy.engine.base.Engine:SELECT USER FROM DUAL
...
INFO:sqlalchemy.engine.base.Engine:{'param_1': u'dconnolly'}
('1f8c885f-43e3-4ace-8512-88e65494d59a', <User(dconnolly, Dan Connolly)>)

Adding SEER Site Recode to HERON Tumor Registry integration

Our HERON tuttlecreek release a couple months ago included initial integration of data on ~60,000 cancer cases from the KUMC tumor registry. We organized the NAACCR terms based on work by colleagues at the Kimmel Cancer Center in Philadelphia and Group Health Cooperative in Seattle:

NAACR terms for tumor registry

But if you want to find, for example, brain cancer cases, due to an outstanding issue (#733), you have to be an expert in codes for primary site, histology, etc.:

For our next release, based on work with John Keighley, we're providing query by SEER Site Recode, a state of the art method for combining primary site and histology:

screenshot of SEER Site Recode term hierarchy

Under the hood: Using python to convert the rules table to SQL

The SEER Site Recode ICD-O-3 (1/27/2003) Definition, lays out the rules in a fairly convenient HTML table:

Converting that table to code manually might have been straightforward, but it would have been repetitive and error-prone; so like so many Geeks and repetitive tasks, I wrote a script to automate it.

source:tumor_reg/seer_recode.py weighs in at about 200 lines, including whitespace and a handful of test cases. It reads the HTML page (well, I feed it through tidy first to clean up some table markup) and produces

  1. A term hierarchy in CSV format (source:heron_load/curated_data/seer_recode_terms.csv)
  2. Rules to recode our our ~60K cancer cases as a SQL case statement (source:heron_load/seer_recode.sql).

The resulting SQL weighs in at about 500 lines. Handling all the different kinds of rules in the table was fun; a lot more fun than writing this sort of SQL by hand:

case
/* Lip */ when (site between 'C000' and 'C009')
  and  not (histology between '9590' and '9989'
   or histology between '9050' and '9055'
   or histology = '9140') then '20010'

...

/* Melanoma of the Skin */ when (site between 'C440' and 'C449')
  and (histology between '8720' and '8790') then '25010'

...

/* Cranial Nerves Other Nervous System */ when (site between 'C710' and 'C719')
  and (histology between '9530' and '9539') then '31040'

/* ... */ when (site between 'C700' and 'C709'
   or site between 'C720' and 'C729')
  and  not (histology between '9590' and '9989'
   or histology between '9050' and '9055'
   or histology = '9140') then '31040'

Missing link in high performance bulk transfer between Oracle databases [DRAFT]

To build HERON, our clinical research data repository, we move a lot of data (hundreds of gigabytes) between Oracle databases:

GraphViz image

When I discover Oracle's support for distributed transactions using database links, it seemed to be a perfect match. For example, from one of our early ETL tickets, #126:

Executing:  create database link idxp_link USING 'idxp' 

Executing:
                create global temporary table hictr_table_clone
                on commit preserve rows
                as select * from HICTR.hictr_table@idxp_link

That initial work was based on a tiny (4000 patient) slice of data, so performance was not much of an issue. But when we got into millions of records, run times ran into the tens of hours.

A long-time Oracle user here told me that sqlldr is the right tool for bulk loading data, but like sqlplus, it's a command-line tool, and Oracle seems to turn off normal access to them by default and issues a scary disclaimer when we enable access.

As I read up on sqlldr, I discovered Inserting Data Into Tables Using Direct-Path INSERT; i.e. the performance magic behind sqlldr is available using SQL too. Great! But... not

A transaction containing a direct-path INSERT statement cannot be or become distributed.

http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_9014.htm#i2163698

but I also got the impression

it's a command-line tool, and I wanted to stick with the python DB API (or JDBC or the like) rather than tackle the administrative/security issues of passwords on the command line and such.

Managing temporary tables with a python context manager

Temporary tables can be a hassle to manage. In computing concept stats, at first, my code did the obvious:

  1. create a temporary index
  2. create a temporary table
  3. use the table and the index
  4. truncate/drop the temporary table
  5. drop the index

But if the code fails in step 3, the temporary table and the index will still be there when you run it again, and you'll get name conflicts. An obvious solution starts like:

cursor.execute("create global temporary table ...")
try:
    # use table
finally:
    cursor.execute("truncate table ...")
    cursor.execute("drop table ...")

But it starts to get ugly when you add the try/finally for the temporary index. Isn't this a lot nicer?

    with transaction(conn) as work:
        with table_index(conn, 'metadata_by_path',
                         concept_schema, 'i2b2', ['c_dimcode']):
            with temp_table(work, stats, total_counts):
                exec_debug(work, update_labels, explain_plan=True)

This is where python context managers come in handy. temp_table is implemented like this:

from contextlib import contextmanager


@contextmanager
def temp_table(cursor, name, create_ddl):
    exec_debug(cursor, create_ddl, explain_plan=True)
    try:
        yield cursor
    finally:
        exec_debug(cursor, "truncate table %s" % name)
        exec_debug(cursor, "drop table %s" % name)

The table_index and transaction context managers are implemented likewise. Take a look at source:heron_load/concepts_stats.py and source:heron_load/db_util.py for details.