Reproducible Scientific Computing: Preserve the Mess or Encourage Cleanliness?

As part of the DASPOS project, my research group is developing a variety of prototype tools for software preservation. Haiyan Meng and Peter Ivie have been doing the hard work of software development while I write the blog posts. This recent paper gives an overview of what we are up to:

Douglas Thain, Peter Ivie, and Haiyan Meng, Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?, 12th International Conference on Digital Preservation (iPres), November, 2015.

Our basic idea is that preservation is fundamentally about understanding dependencies. If you can identify (and name) all of the components necessary to running a bit of code (the data, the app, the libraries, the OS, the network, everything!) then you can reproduce something precisely. The problem is that the dependencies are often invisible or irrelevant to the end user, so we need a way to make them explicit.

Given that, there are two basic approaches, and a continuum between them:

Preserve the Mess. Allow the user to do whatever they want using whatever programs, machines, and environments that they like. As things run, observe the dependencies and record them for later use. Once known, the dependencies can be collected for preservation.

Encourage Cleanliness. Require the user to run in a restricted space, in which all objects (data, programs, operating systems) must first be explicitly imported and named. Once imported, the user may combine them together, and the system will accurately record the combination.

An example of preserving the mess is to use the Parrot virtual file system to run an arbitrary application. When run in preservation mode, Parrot will observe all of the system calls and then record all the files and network resources that you accessed. This list can be used to package up all the dependencies and re-run them using a virtual machine, container, or Parrot itself. For a given application, the list can be surprisingly long and complex:

file:/etc/passwd
file:/usr/lib/libboost.so.1
file:/home/fred/mydata.dat
file:/usr/local/bin/mysim.exe
http://www.example.com/calibration.dat
git://github.com/myproject/code.git
...

This approach is easy to apply to any arbitrary program, but only for the sake of precise reproduction. The resulting package is only guaranteed to run the exact command and input files given in the first place. (Although one could edit the package by hand to change the inputs or make it more generally applicable.) More importantly, loses the context of what the user was attempting to do: it does not capture the source or version number of all the components, making it challenging to upgrade or modify components.

If you prefer encouraging cleanliness, then Umbrella is an alternate approach to preservation. Using Umbrella, the end user writes a specification that states the hardware, operating system, software packages, and input data necessary to invoke the desired program. Umbrella picks the packages out of fixed repositories, and then assembles them all together using your preference of cloud services (Amazon), container technology (Docker), or a ptrace jail (Parrot). An Umbrella file looks like this:

{
"hardware": {
    "arch": "x86_64",
    "cores": "2",
    "memory": "2GB",
    "disk": "3GB"
},
"kernel" : {
    "name": "linux",
    "version": ">=2.6.32"
},
"os": {
    "name": "Redhat",
    "version": "6.5"
},
"software": {
"cmssw-5.2.5-slc5_amd64": {
      "mountpoint": "/cvmfs/cms.cern.ch"
    }
},
"data": {
    "final_events_2381.lhe": {
      "mountpoint": "/tmp/final_events_2381.lhe",
      "action": "none"
    }
},
"environ": {
    "CMS_VERSION": "CMSSW_5_2_5",
    "SCRAM_ARCH": "slc5_amd64_gcc462"
}
}

When using Umbrella, nothing goes into the computation unless it is identified in advance by the user. In principle, one can share a computation by simply passing the JSON file along to another person, checking it into version control, or archiving it and assigning a DOI. Of course, that assumes that all the elements pointed to by the spec are also accurately archived!

Obviously, requiring cleanliness places a greater burden on the user to get organized and name everything. And they to learn yet another tool that isn't directly related to their work. On the other hand, it makes it much easier to perform modifications, like trying out a new version of an OS, or the latest calibration data.

So, what's the right approach for scientific software: Preserving the mess, or encouraging cleanliness? Or something else entirely?

P.S. The third piece described in the paper is PRUNE, which I'll describe in a later post.

2 comments:

AnonymousAugust 10, 2015 at 10:54 AM
I'm reminded of 'implicit none' from FORTRAN - scary...
Douglas ThainAugust 10, 2015 at 11:33 AM
I had to look it up, but I wholly approve of the idea!

In a similar way, the Go language is very particular about source code dependencies. As in most languages, you can't use a library unless you do "import X". But just as importantly, you aren't allowed to "import X" unless you actually use something from X! This cuts down on the problem of the every-growing set of include statements found in many C and C++ programs.

Reproducible Scientific Computing

Pages

Monday, August 10, 2015

Preserve the Mess or Encourage Cleanliness?

2 comments: