My name is Douglas Thain, and I am an Associate Professor in Computer Science and Engineering at the University of Notre Dame. My background is in large scale distributed systems for scientific computing. I lead a computer science team that collaborates with researchers in physics, biology, chemistry, and other fields of experimental science. We work to design systems that are able to reliably run simulations or data analysis codes on thousands of machines to enable new discoveries. And we publish our software in open source form as the Cooperative Computing Tools.
I have sort of fallen into reproducibility area sideways, as a result of working in systems at large scale. In the last two years, this has been part of the Data and Software Preservation for Open Science (DASPOS) project funded by the NSF.
And here is a tale of non-reproducibility:
A common problem that we encounter is that a student works hard to get some particular code running on their laptop or workstation, only to find that it doesn't work at all on the other machines that they can get access to. We often find this happening in the realm of bioinformatics, where tools like BLAST, BOWTIE, and BWA are distributed in source form with some rough instructions on how to get things going. To make it work, the student must download the software, compile it just so, install a bunch of shared libraries, and then repeat the process for each tool on which the first tool depends upon. This process can take weeks or months to get just right.
Now, the student is able to run the code and gets some sample results. He or she now wishes to run the tool at scale, and uses our software to distribute the job to thousands of machines spread around campus. Success, right?
Of course not! None of the machines to which the student has access has been set up in the same painstaking way as the original machine, so the student ends up working backwards to debug all the different ways in which the target differs from the original.
- Code works with Python 2 and not Python 3, check.
- Missing config file in home directory, check.
- libobscure.so.3 found in /usr/lib, check.
- (bang head on keyboard)
Can we make things better?
I think we can, but it is going to take several different kinds of developments:
- Better tools for specifying and standing up environments in a precise way.
- Better habits by researchers to track and specify what they use.
- Better repositories that can track and store environments, code, and data and relate them together.
- Higher expectations by the community that work should be made reproducible.
Ok, that's my introduction. Would the other panelists like to introduce themselves?