Monday, May 18, 2015

Introduction: Portability is Reproducibility

I see that everyone is a bit shy to get started, so I will break the ice -

My name is Douglas Thain, and I am an Associate Professor in Computer Science and Engineering at the University of Notre Dame.  My background is in large scale distributed systems for scientific computing.  I lead a computer science team that collaborates with researchers in physics, biology, chemistry, and other fields of experimental science.  We work to design systems that are able to reliably run simulations or data analysis codes on thousands of machines to enable new discoveries.  And we publish our software in open source form as the  Cooperative Computing Tools.

I have sort of fallen into reproducibility area sideways, as a result of working in systems at large scale.  In the last two years, this has been part of the Data and Software Preservation for Open Science (DASPOS) project funded by the NSF.

And here is a tale of non-reproducibility:

A common problem that we encounter is that a student works hard to get some particular code running on their laptop or workstation, only to find that it doesn't work at all on the other machines that they can get access to.  We often find this happening in the realm of bioinformatics, where tools like BLAST, BOWTIE, and BWA are distributed in source form with some rough instructions on how to get things going.  To make it work, the student must download the software, compile it just so, install a bunch of shared libraries, and then repeat the process for each tool on which the first tool depends upon.  This process can take weeks or months to get just right.

Now, the student is able to run the code and gets some sample results.  He or she now wishes to run the tool at scale, and uses our software to distribute the job to thousands of machines spread around campus.  Success, right?

Of course not!  None of the machines to which the student has access has been set up in the same painstaking way as the original machine, so the student ends up working backwards to debug all the different ways in which the target differs from the original.
  • Code works with Python 2 and not Python 3, check.
  • Missing config file in home directory, check.
  • found in /usr/lib, check.
  • (bang head on keyboard)
I have described this as a problem of portability across machines, but of course you can also look at it as reproducibility across time.  If an important result has been generated and now a different person wishes to reproduce that run a year later on a different machine, you can bet that they will encounter the same set of headaches.

Can we make things better?

I think we can, but it is going to take several different kinds of developments:

- Better tools for specifying and standing up environments in a precise way.
- Better habits by researchers to track and specify what they use.
- Better repositories that can track and store environments, code, and data and relate them together.
- Higher expectations by the community that work should be made reproducible.

Ok, that's my introduction.  Would the other panelists like to introduce themselves?


  1. I've had some experience with this in a variety of contexts.
    I don't think the problem is tooling, because apt and conda cover a wide variety of situations for deploying software.
    Sometimes data is a problem, but for most computer science papers I read the data sets aren't that large.
    For many of the papers (and some textbooks) I've read, reproducibility could be solved with a virtual machine snapshot. The fact that this isn't done very often is telling. The real problem is the lack of incentive to make work reproducible.

  2. There are certainly cases in CS where the datasets are small, and so dumping everything into one VM is a perfectly good solution. And you are quite right about incentives.

    But if you look at the real sciences, the data situation is different. In bioinformatics, you frequently have thousands of analyses performed on a common dataset of tens to hundreds of gigabytes. You can squeeze 100GB into a VM, but you wouldn't want to duplicate it once for each analysis code, so you need a way to archive the data independently of the code, be able to refer to it, mount it at runtime, and verify that you have the right thing. Same thing in astronomy at larger scales (100s of TB) and high energy physics (100s of PBs) and many other fields as well.