Friday, July 24, 2015

Artifact Evaluation Committees!

It would seem we can all agree that reproducibility is a worthwhile goal, even that it is a sort of bar that all peer-reviewed publications should clear. However, there seems to be a significant variance in the greater research community with regard to the willingness to demonstrate or require (or even suggest!) reproducibility of peer-reviewed publications. Part of that variance certainly stems from the difficulty of precisely defining what reproducibility means for any given work, i.e. a specific research paper. Another part stems from the belief that certainly the authors would have both a) taken the time to determine specifically what reproducibility means for their submission and b) ensured that all necessary artifacts for reproducibility were both publicly available and in good working order before publication.

While I applaud the level of optimism about authors' commitments to reproducibility that seems to pervade the research community today, I also find it highly unrealistic.

Authors are frequently over-worked researchers, academics, and students working long hours right up until the paper submission deadline, pushing for the best submission possible. I find it highly unrealistic to expect that this crowd of tired authors focused tightly on the goal of a submission that passes a rigorous peer review will also always take the time to pontificate on what it really means for their work to be reproducible, something that is nearly never influential in the peer-review process.

Therefore, I argue that if we truly place any value on reproducibility as reason for or goal of peer-reviewed research, we need to make it a required element in peer-review. Until we do that we remain "all talk and no action." Sadly, we have yet to see a top peer-reviewed computer science or engineering venue willing to include reproducibility explicitly as a criteria for publication at the same level as evaluations of the novelty of or interest in the work.

However, there is (a little) hope. One of the best moves in this direction has been in the research area of software engineering. While there are certainly exceptions, for most publications in this area, reproducibility of the published work will involve some level of a) being able to access the software artifacts that comprise the novel contributions of the paper, b) being able to execute them. This led to a new movement called Artifact Evaluation, the motivation for which is quite compelling.

Here is an excerpt:
"We find it especially ironic that areas that are so centered on software, models, and specifications would not want to evaluate them as part of the paper review process, as well as archive them with the final paper. Not examining artifacts enables everything from mere sloppiness to, in extreme cases, dishonesty. More subtly, it also imposes a subtle penalty on people who take the trouble to vigorously implement and test their ideas."

Very importantly artifact evaluation was assigned to a special committee of PhD students and new postdocs, not to the regular PC. This strategy worked brilliantly.

The evaluation was extremely conservative, for three major reasons:

  • Evaluation of artifacts was conducted strictly after papers were unconditionally accepted for publication so that it did not affect peer review in any way.
  • Artifact evaluation was strictly limited to whether the available materials matched what was described in the paper. As Shriram Krishnamurthi notes, "We were only charged with deciding whether the artifact met the expectations set by the paper, no matter how low or incorrect those might be! This is of course a difficult emotional barrier to ignore."
  • The results of the evaluation were not publicly released. They were sent only to the authors of the paper, who could choose to describe the results in public as they wished, or not at all.

So, hats off to Shriram Krishnamurthi, Matthias Hauswirth, Steve Blackburn, Jan Vitek, Andreas Zeller, Carlo Ghezzi, and the others who have taken a stand on reproducibility of published research in software engineering! Andreas Zeller even secured a cash prize of USD 1000 from Microsoft Research for a Distinguished Artifact at ESEC/FSE 2011! AECs have now been included in several major conferences, most recently CAV 2015!

Now that we have a system that works, it would seem it is time to make it stick: make artifact evaluation standard, take it into account for paper acceptance, and make the results public.

So, how do we do that?

Wednesday, July 15, 2015

The Need for Notebooks

In response to Doug Thain's question:
What currently-available tools do you recommend for enabling reproducible scientific computing?  Is there a tool that we ought to have, but do not?

P.S. I am using "reproducibility" as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion.  Please interpret the question broadly.
I think the most important thing we need is an electronic lab notebook that really allows us to go back and understand exactly what we did, repeat it, modify it, record it, etc.  If you accept this, it leads to a number of points:

  1. Why don't more people (including me) do this now?
  2. What tool(s) should we use?
  3. How should this integrate into the publication process (for papers, software, data, etc.)
It's a bit hypocritical for me to write this post, since I've never really been a good notebook user or note taker, whether in classes, meetings, or labs.  And I think this is true of a lot of us.  Part of it is probably making it a matter of habit, and it's hard to break a bad habit or start a new one.  And part of it is the lack of good tools with good interfaces.

Tools I've seen that I've liked include VizTrails for data exploration and visualization, and Project Jupyter for a lot of other things.  And software version control (e.g., Git) is also part of the answer, most likely.  While these independent tools have their strengths and weaknesses, I don't think they really fit the bill.

I would really like something that's much more integrated into my computer, probably at a runtime or OS layer, that helps me understand all of my work, not just my visualization or coding.

If we did have something that really was more an automated work-tracking notebook, we could use it to help us with publications as well.  For example, in my idea of transitive credit, if we are going to decide what products contributed to a new product, a starting point is the the products we used during the creation of the new product, which could be created by such a notebook.  Or we could track our reading list in the period leading towards a new paper as a starting point for the papers we should reference.

And, this potentially leads us beyond the PDF on the path towards executable papers, which would be a key step towards reproducible science in the large.

(Note: this is crossposted to https://danielskatzblog.wordpress.com/2015/07/15/the-need-for-notebooks/)

Disclaimer: Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.