Tuesday, March 1, 2016

On "Repeatability in Computer Systems Research"

Kudos to University of Arizona's Christian Collberg and Todd A. Proebsting for their article on Repeatability in computer systems research, appearing in this month's Communications of the ACM! It is wonderful to see a prominent call for our field to "fund repeatability engineering and reward commitments to sharing research artifacts."

This article recounts an interaction with authors of a research publication during an attempt to reproduce and build upon that published work; it is an interaction that is far too familiar! The article also provides a nice discussion of the benefits of reproducible research, repeatability, and benefaction. This sentiment inspired the authors' quest to review 601 recent publications for source code availability. The article provides a very nice visual summary of the authors' findings as well as a good overview of related studies to date and is well worth a read.

As one might suspect, the article reports depressing discoveries about the level of reproducibiity in scientific research. In response to these results, the article makes a somewhat controversial recommendation that is a great starting point for further discussion on reproducible research: to require authors to label their publications with "sharing contracts" that specify the level of repeatability of the work.

I find this recommendation an interesting point for discussion for several reasons:

*) The contract commits peer reviewers to take the promised resources into account when evaluating the contributions made by the paper. How will they be taken into account? We have already seen evidence that separate artifact evaluation committees are a good plan, rather than asking paper reviewers to do double duty.

*) What is the incentive for authors to include a sharing contract, or to include one that states they will share artifacts rather than not sharing? The article includes many strong arguments authors have expressed for not sharing in the past, so how does this contract address those problems?

*) Where would the implementation of sharing contracts start? At a particular conference like Artifact Evaluation Committees? As a journal publication standard? Elsewhere?

The authors also make a recommendation that I wholeheartedly support and would like to publicly endorse on this blog: "Funding agencies should encourage researchers to request additional funds for 'repeatability engineering,' including hiring programming staff to document and maintain code, do release management, and assist other research groups wanting to repeat published experiments." Well-said Christian and Todd! Please count me in on your efforts to make this happen!

Friday, August 14, 2015

Reproducibility Workshop Report

Final report of the reproducibility@xsede at XSEDE 2014:

Monday, August 10, 2015

Preserve the Mess or Encourage Cleanliness?

As part of the DASPOS project, my research group is developing a variety of prototype tools for software preservation.  Haiyan Meng and Peter Ivie have been doing the hard work of software development while I write the blog posts.  This recent paper gives an overview of what we are up to:
Douglas Thain, Peter Ivie, and Haiyan Meng, Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?, 12th International Conference on Digital Preservation (iPres), November, 2015.

Our basic idea is that preservation is fundamentally about understanding dependencies.  If you can identify (and name) all of the components necessary to running a bit of code (the data, the app, the libraries, the OS, the network, everything!) then you can reproduce something precisely.  The problem is that the dependencies are often invisible or irrelevant to the end user, so we need a way to make them explicit.

Given that, there are two basic approaches, and a continuum between them: 

  • Preserve the Mess.  Allow the user to do whatever they want using whatever programs, machines, and environments that they like.  As things run, observe the dependencies and record them for later use.  Once known, the dependencies can be collected for preservation.
  • Encourage Cleanliness. Require the user to run in a restricted space, in which all objects (data, programs, operating systems) must first be explicitly imported and named.  Once imported, the user may combine them together, and the system will accurately record the combination.
An example of preserving the mess is to use the Parrot virtual file system to run an arbitrary application.  When run in preservation mode, Parrot will observe all of the system calls and then record all the files and network resources that you accessed.  This list can be used to package up all the dependencies and re-run them using a virtual machine, container, or Parrot itself.  For a given application, the list can be surprisingly long and complex:


This approach is easy to apply to any arbitrary program, but only for the sake of precise reproduction.  The resulting package is only guaranteed to run the exact command and input files given in the first place.  (Although one could edit the package by hand to change the inputs or make it more generally applicable.)  More importantly, loses the context of what the user was attempting to do: it does not capture the source or version number of all the components, making it challenging to upgrade or modify components.

If you prefer encouraging cleanliness, then Umbrella is an alternate approach to preservation.  Using Umbrella, the end user writes a specification that states the hardware, operating system, software packages, and input data necessary to invoke the desired program.  Umbrella picks the packages out of fixed repositories, and then assembles them all together using your preference of cloud services (Amazon), container technology (Docker), or a ptrace jail (Parrot).  An Umbrella file looks like this:

  "hardware": {
    "arch": "x86_64",
    "cores": "2",
    "memory": "2GB",
    "disk": "3GB"
  "kernel" : {
    "name": "linux",
    "version": ">=2.6.32"
  "os": {
    "name": "Redhat",
    "version": "6.5"
  "software": {
    "cmssw-5.2.5-slc5_amd64": {
      "mountpoint": "/cvmfs/cms.cern.ch"
  "data": {
    "final_events_2381.lhe": {
      "mountpoint": "/tmp/final_events_2381.lhe",
      "action":  "none"
  "environ": {
    "CMS_VERSION": "CMSSW_5_2_5",
    "SCRAM_ARCH": "slc5_amd64_gcc462"

When using Umbrella, nothing goes into the computation unless it is identified in advance by the user.  In principle, one can share a computation by simply passing the JSON file along to another person, checking it into version control, or archiving it and assigning a DOI.  Of course, that assumes that all the elements pointed to by the spec are also accurately archived!

Obviously, requiring cleanliness places a greater burden on the user to get organized and name everything.  And they to learn yet another tool that isn't directly related to their work.  On the other hand, it makes it much easier to perform modifications, like trying out a new version of an OS, or the latest calibration data.

So, what's the right approach for scientific software: Preserving the mess, or encouraging cleanliness?  Or something else entirely?

 P.S. The third piece described in the paper is PRUNE, which I'll describe in a later post.

Monday, August 3, 2015

I had some pretty extensive notes for this post, but I lost them in the upgrade to Windows 10: I foolishly forgot to sync my notebook before I hit "ok" to the upgrade! So be warned: I am reconstructing this from memory!

In the context of particle physics I have a very hard time talking about tools. After thinking about this for a while I realized why: this is part technical and part sociological. It is technical because either the tools aren't all there yet, or we are asking them to do too much. It is sociological because given the current state of the technology we are almost certainly going to have to ask people to do particle physics analysis in a different way in order to fully preserve the analysis steps.

I can explain a bit more.

First, the sociological part. The field of particle physics is an old one, and very individualistic. Organizing a large, 3000 person, experiment like Atlas is often referred to as hearing cats for a reason. ‘Cats’ have many strengths: people with an ideal pursue it, and sometimes that pays off big, even if frowned upon by the management of the experiment. However, there is also a downside: it is particularly hard to organize things. The only way to do it is if you show there are clear benefits to doing it way A vs. way B. Data preservation falls smack dab in the middle of this. Currently tools do not cover all operations, as I will briefly discuss below. However, if physicists change how they perform the steps of the analysis, we'd be most of the way to preserving the analysis.

So to convince them to move to a new system will happen only if there are significant advantages. I am nervous those aren’t there yet. To talk about that, we should probably first classify the tools that are currently out there for data preservation, and the domain.

A large physics experiment that runs at the LHC, like Atlas, manipulates the data many times over to get it from raw data to published data. Much of that manipulation is automated by necessity – there is so much data there is no other way to deal with it. This part of the data processing pipeline is relatively easily preserved. The most interesting unsolved problem in data preservation is "last mile" – where the individual analyst or analysis group performs their work. This work is unique. This work can include making selection cuts on the data, feeding data into a neural network or boosted decision tree, using some custom software to enhance the signal to background, or just about anything else the physicist thinks might enhance the sensitivity of their analysis. You can already see the problem.

Almost none of this last mile is currently automated. The physicist will write scripts will automate small sections. But some portions will be submitted to a batch farm, other parts need hand tuning, etc. While a work flow tool might help, the reality is that people tend to re-run a single step 100’s of times while developing it, and then they don’t re-run it for months after that. Worse, these steps take place across a large range of machines and architectures. As a result people don’t perceive much value in a work flow system.

Preserving these steps is a daunting task (to me). One approach might be to require all physicist to use a single tool, and every single step would have to be done in that tool. The tool could then record everything. I believe technically this could work. But this tool would have to be a true kitchen sink, and an evolving one at that. It would need to understand an individual users computer, it would need to understand clusters, it would need to understand GRID computing (and perhaps cloud computing soon too), it would have to incorporate specialized analysis tools (like neural networks, etc). In short, I just don’t see how restricting one to a single tool can make this happen. Further, I very much doubt the users – physicists – would be willing to adopt it.

The alternative is to build tools that can record each step. These exist in various forms now – you could run a command and have all files and network connections made be archived and thus preserved. However, this is now an extra step that has to be run. Each time. And the analyzer never really knows the last time they will run a step. Sometimes during the final review a whole section needs to be re-run. Often not. Any performance impact will be noticeable. It will be hard to get the analyzer to commit to something like this as well.

Is there a happy middle ground? Perhaps, but I’ve not seen it yet. In my own work I’ve been doing my best to move towards a continuous integration approach. Once I have a step “down”, I attempt to encode it into a Jenkins build script. This isn’t perfect – Jenkins doesn’t really understand the GRID, for example, but it is a start. And if everything can be written as a build script, perhaps that is a first-order preservation. Remaining, of course, is preserving the environment around it.

A more fruitful approach might be be to work with much smaller experiments, and integrate the data preservation into the data handling layer. Perhaps if all data processing is always run as a batch job of some sort (be it on the GRID or in the cloud), that will provide a point-of-control that is still flexible enough for the user and enough of a known for the tool authors. As long as the metadata system is sufficient enough to track the inter-job operations… which translates to as long as the physicist does nothing outside of the data handling system. Hmmm… :-)

Friday, July 24, 2015

Artifact Evaluation Committees!

It would seem we can all agree that reproducibility is a worthwhile goal, even that it is a sort of bar that all peer-reviewed publications should clear. However, there seems to be a significant variance in the greater research community with regard to the willingness to demonstrate or require (or even suggest!) reproducibility of peer-reviewed publications. Part of that variance certainly stems from the difficulty of precisely defining what reproducibility means for any given work, i.e. a specific research paper. Another part stems from the belief that certainly the authors would have both a) taken the time to determine specifically what reproducibility means for their submission and b) ensured that all necessary artifacts for reproducibility were both publicly available and in good working order before publication.

While I applaud the level of optimism about authors' commitments to reproducibility that seems to pervade the research community today, I also find it highly unrealistic.

Authors are frequently over-worked researchers, academics, and students working long hours right up until the paper submission deadline, pushing for the best submission possible. I find it highly unrealistic to expect that this crowd of tired authors focused tightly on the goal of a submission that passes a rigorous peer review will also always take the time to pontificate on what it really means for their work to be reproducible, something that is nearly never influential in the peer-review process.

Therefore, I argue that if we truly place any value on reproducibility as reason for or goal of peer-reviewed research, we need to make it a required element in peer-review. Until we do that we remain "all talk and no action." Sadly, we have yet to see a top peer-reviewed computer science or engineering venue willing to include reproducibility explicitly as a criteria for publication at the same level as evaluations of the novelty of or interest in the work.

However, there is (a little) hope. One of the best moves in this direction has been in the research area of software engineering. While there are certainly exceptions, for most publications in this area, reproducibility of the published work will involve some level of a) being able to access the software artifacts that comprise the novel contributions of the paper, b) being able to execute them. This led to a new movement called Artifact Evaluation, the motivation for which is quite compelling.

Here is an excerpt:
"We find it especially ironic that areas that are so centered on software, models, and specifications would not want to evaluate them as part of the paper review process, as well as archive them with the final paper. Not examining artifacts enables everything from mere sloppiness to, in extreme cases, dishonesty. More subtly, it also imposes a subtle penalty on people who take the trouble to vigorously implement and test their ideas."

Very importantly artifact evaluation was assigned to a special committee of PhD students and new postdocs, not to the regular PC. This strategy worked brilliantly.

The evaluation was extremely conservative, for three major reasons:

  • Evaluation of artifacts was conducted strictly after papers were unconditionally accepted for publication so that it did not affect peer review in any way.
  • Artifact evaluation was strictly limited to whether the available materials matched what was described in the paper. As Shriram Krishnamurthi notes, "We were only charged with deciding whether the artifact met the expectations set by the paper, no matter how low or incorrect those might be! This is of course a difficult emotional barrier to ignore."
  • The results of the evaluation were not publicly released. They were sent only to the authors of the paper, who could choose to describe the results in public as they wished, or not at all.

So, hats off to Shriram Krishnamurthi, Matthias Hauswirth, Steve Blackburn, Jan Vitek, Andreas Zeller, Carlo Ghezzi, and the others who have taken a stand on reproducibility of published research in software engineering! Andreas Zeller even secured a cash prize of USD 1000 from Microsoft Research for a Distinguished Artifact at ESEC/FSE 2011! AECs have now been included in several major conferences, most recently CAV 2015!

Now that we have a system that works, it would seem it is time to make it stick: make artifact evaluation standard, take it into account for paper acceptance, and make the results public.

So, how do we do that?

Wednesday, July 15, 2015

The Need for Notebooks

In response to Doug Thain's question:
What currently-available tools do you recommend for enabling reproducible scientific computing?  Is there a tool that we ought to have, but do not?

P.S. I am using "reproducibility" as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion.  Please interpret the question broadly.
I think the most important thing we need is an electronic lab notebook that really allows us to go back and understand exactly what we did, repeat it, modify it, record it, etc.  If you accept this, it leads to a number of points:

  1. Why don't more people (including me) do this now?
  2. What tool(s) should we use?
  3. How should this integrate into the publication process (for papers, software, data, etc.)
It's a bit hypocritical for me to write this post, since I've never really been a good notebook user or note taker, whether in classes, meetings, or labs.  And I think this is true of a lot of us.  Part of it is probably making it a matter of habit, and it's hard to break a bad habit or start a new one.  And part of it is the lack of good tools with good interfaces.

Tools I've seen that I've liked include VizTrails for data exploration and visualization, and Project Jupyter for a lot of other things.  And software version control (e.g., Git) is also part of the answer, most likely.  While these independent tools have their strengths and weaknesses, I don't think they really fit the bill.

I would really like something that's much more integrated into my computer, probably at a runtime or OS layer, that helps me understand all of my work, not just my visualization or coding.

If we did have something that really was more an automated work-tracking notebook, we could use it to help us with publications as well.  For example, in my idea of transitive credit, if we are going to decide what products contributed to a new product, a starting point is the the products we used during the creation of the new product, which could be created by such a notebook.  Or we could track our reading list in the period leading towards a new paper as a starting point for the papers we should reference.

And, this potentially leads us beyond the PDF on the path towards executable papers, which would be a key step towards reproducible science in the large.

(Note: this is crossposted to https://danielskatzblog.wordpress.com/2015/07/15/the-need-for-notebooks/)

Disclaimer: Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Monday, June 29, 2015

Q: Tools for Reproducible Computing?

Thank you to all of the panelists for your introductions and initial comments.  In my first posting, I suggested that we need better tools, better habits, better repositories, and higher expectations in order to achieve reproducibility.  That seems like a useful way to structure the next few questions, so I will start with tools:

What currently-available tools do you recommend for enabling reproducible scientific computing?  Is there a tool that we ought to have, but do not?

P.S. I am using "reproducibility" as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion.  Please interpret the question broadly.