Friday, August 14, 2015

Reproducibility Workshop Report


Final report of the reproducibility@xsede at XSEDE 2014:



Monday, August 10, 2015

Preserve the Mess or Encourage Cleanliness?

As part of the DASPOS project, my research group is developing a variety of prototype tools for software preservation.  Haiyan Meng and Peter Ivie have been doing the hard work of software development while I write the blog posts.  This recent paper gives an overview of what we are up to:
 
Douglas Thain, Peter Ivie, and Haiyan Meng, Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?, 12th International Conference on Digital Preservation (iPres), November, 2015.


Our basic idea is that preservation is fundamentally about understanding dependencies.  If you can identify (and name) all of the components necessary to running a bit of code (the data, the app, the libraries, the OS, the network, everything!) then you can reproduce something precisely.  The problem is that the dependencies are often invisible or irrelevant to the end user, so we need a way to make them explicit.

Given that, there are two basic approaches, and a continuum between them: 

  • Preserve the Mess.  Allow the user to do whatever they want using whatever programs, machines, and environments that they like.  As things run, observe the dependencies and record them for later use.  Once known, the dependencies can be collected for preservation.
  • Encourage Cleanliness. Require the user to run in a restricted space, in which all objects (data, programs, operating systems) must first be explicitly imported and named.  Once imported, the user may combine them together, and the system will accurately record the combination.
An example of preserving the mess is to use the Parrot virtual file system to run an arbitrary application.  When run in preservation mode, Parrot will observe all of the system calls and then record all the files and network resources that you accessed.  This list can be used to package up all the dependencies and re-run them using a virtual machine, container, or Parrot itself.  For a given application, the list can be surprisingly long and complex:

file:/etc/passwd
file:/usr/lib/libboost.so.1
file:/home/fred/mydata.dat
file:/usr/local/bin/mysim.exe
http://www.example.com/calibration.dat
git://github.com/myproject/code.git
...


This approach is easy to apply to any arbitrary program, but only for the sake of precise reproduction.  The resulting package is only guaranteed to run the exact command and input files given in the first place.  (Although one could edit the package by hand to change the inputs or make it more generally applicable.)  More importantly, loses the context of what the user was attempting to do: it does not capture the source or version number of all the components, making it challenging to upgrade or modify components.

If you prefer encouraging cleanliness, then Umbrella is an alternate approach to preservation.  Using Umbrella, the end user writes a specification that states the hardware, operating system, software packages, and input data necessary to invoke the desired program.  Umbrella picks the packages out of fixed repositories, and then assembles them all together using your preference of cloud services (Amazon), container technology (Docker), or a ptrace jail (Parrot).  An Umbrella file looks like this:

{
  "hardware": {
    "arch": "x86_64",
    "cores": "2",
    "memory": "2GB",
    "disk": "3GB"
  },
  "kernel" : {
    "name": "linux",
    "version": ">=2.6.32"
  },
  "os": {
    "name": "Redhat",
    "version": "6.5"
  },
  "software": {
    "cmssw-5.2.5-slc5_amd64": {
      "mountpoint": "/cvmfs/cms.cern.ch"
    }
  },
  "data": {
    "final_events_2381.lhe": {
      "mountpoint": "/tmp/final_events_2381.lhe",
      "action":  "none"
    }
  },
  "environ": {
    "CMS_VERSION": "CMSSW_5_2_5",
    "SCRAM_ARCH": "slc5_amd64_gcc462"
  }
}

When using Umbrella, nothing goes into the computation unless it is identified in advance by the user.  In principle, one can share a computation by simply passing the JSON file along to another person, checking it into version control, or archiving it and assigning a DOI.  Of course, that assumes that all the elements pointed to by the spec are also accurately archived!

Obviously, requiring cleanliness places a greater burden on the user to get organized and name everything.  And they to learn yet another tool that isn't directly related to their work.  On the other hand, it makes it much easier to perform modifications, like trying out a new version of an OS, or the latest calibration data.

So, what's the right approach for scientific software: Preserving the mess, or encouraging cleanliness?  Or something else entirely?

 P.S. The third piece described in the paper is PRUNE, which I'll describe in a later post.


Monday, August 3, 2015

I had some pretty extensive notes for this post, but I lost them in the upgrade to Windows 10: I foolishly forgot to sync my notebook before I hit "ok" to the upgrade! So be warned: I am reconstructing this from memory!

In the context of particle physics I have a very hard time talking about tools. After thinking about this for a while I realized why: this is part technical and part sociological. It is technical because either the tools aren't all there yet, or we are asking them to do too much. It is sociological because given the current state of the technology we are almost certainly going to have to ask people to do particle physics analysis in a different way in order to fully preserve the analysis steps.

I can explain a bit more.

First, the sociological part. The field of particle physics is an old one, and very individualistic. Organizing a large, 3000 person, experiment like Atlas is often referred to as hearing cats for a reason. ‘Cats’ have many strengths: people with an ideal pursue it, and sometimes that pays off big, even if frowned upon by the management of the experiment. However, there is also a downside: it is particularly hard to organize things. The only way to do it is if you show there are clear benefits to doing it way A vs. way B. Data preservation falls smack dab in the middle of this. Currently tools do not cover all operations, as I will briefly discuss below. However, if physicists change how they perform the steps of the analysis, we'd be most of the way to preserving the analysis.

So to convince them to move to a new system will happen only if there are significant advantages. I am nervous those aren’t there yet. To talk about that, we should probably first classify the tools that are currently out there for data preservation, and the domain.

A large physics experiment that runs at the LHC, like Atlas, manipulates the data many times over to get it from raw data to published data. Much of that manipulation is automated by necessity – there is so much data there is no other way to deal with it. This part of the data processing pipeline is relatively easily preserved. The most interesting unsolved problem in data preservation is "last mile" – where the individual analyst or analysis group performs their work. This work is unique. This work can include making selection cuts on the data, feeding data into a neural network or boosted decision tree, using some custom software to enhance the signal to background, or just about anything else the physicist thinks might enhance the sensitivity of their analysis. You can already see the problem.

Almost none of this last mile is currently automated. The physicist will write scripts will automate small sections. But some portions will be submitted to a batch farm, other parts need hand tuning, etc. While a work flow tool might help, the reality is that people tend to re-run a single step 100’s of times while developing it, and then they don’t re-run it for months after that. Worse, these steps take place across a large range of machines and architectures. As a result people don’t perceive much value in a work flow system.

Preserving these steps is a daunting task (to me). One approach might be to require all physicist to use a single tool, and every single step would have to be done in that tool. The tool could then record everything. I believe technically this could work. But this tool would have to be a true kitchen sink, and an evolving one at that. It would need to understand an individual users computer, it would need to understand clusters, it would need to understand GRID computing (and perhaps cloud computing soon too), it would have to incorporate specialized analysis tools (like neural networks, etc). In short, I just don’t see how restricting one to a single tool can make this happen. Further, I very much doubt the users – physicists – would be willing to adopt it.

The alternative is to build tools that can record each step. These exist in various forms now – you could run a command and have all files and network connections made be archived and thus preserved. However, this is now an extra step that has to be run. Each time. And the analyzer never really knows the last time they will run a step. Sometimes during the final review a whole section needs to be re-run. Often not. Any performance impact will be noticeable. It will be hard to get the analyzer to commit to something like this as well.

Is there a happy middle ground? Perhaps, but I’ve not seen it yet. In my own work I’ve been doing my best to move towards a continuous integration approach. Once I have a step “down”, I attempt to encode it into a Jenkins build script. This isn’t perfect – Jenkins doesn’t really understand the GRID, for example, but it is a start. And if everything can be written as a build script, perhaps that is a first-order preservation. Remaining, of course, is preserving the environment around it.

A more fruitful approach might be be to work with much smaller experiments, and integrate the data preservation into the data handling layer. Perhaps if all data processing is always run as a batch job of some sort (be it on the GRID or in the cloud), that will provide a point-of-control that is still flexible enough for the user and enough of a known for the tool authors. As long as the metadata system is sufficient enough to track the inter-job operations… which translates to as long as the physicist does nothing outside of the data handling system. Hmmm… :-)

Friday, July 24, 2015

Artifact Evaluation Committees!

It would seem we can all agree that reproducibility is a worthwhile goal, even that it is a sort of bar that all peer-reviewed publications should clear. However, there seems to be a significant variance in the greater research community with regard to the willingness to demonstrate or require (or even suggest!) reproducibility of peer-reviewed publications. Part of that variance certainly stems from the difficulty of precisely defining what reproducibility means for any given work, i.e. a specific research paper. Another part stems from the belief that certainly the authors would have both a) taken the time to determine specifically what reproducibility means for their submission and b) ensured that all necessary artifacts for reproducibility were both publicly available and in good working order before publication.

While I applaud the level of optimism about authors' commitments to reproducibility that seems to pervade the research community today, I also find it highly unrealistic.

Authors are frequently over-worked researchers, academics, and students working long hours right up until the paper submission deadline, pushing for the best submission possible. I find it highly unrealistic to expect that this crowd of tired authors focused tightly on the goal of a submission that passes a rigorous peer review will also always take the time to pontificate on what it really means for their work to be reproducible, something that is nearly never influential in the peer-review process.

Therefore, I argue that if we truly place any value on reproducibility as reason for or goal of peer-reviewed research, we need to make it a required element in peer-review. Until we do that we remain "all talk and no action." Sadly, we have yet to see a top peer-reviewed computer science or engineering venue willing to include reproducibility explicitly as a criteria for publication at the same level as evaluations of the novelty of or interest in the work.

However, there is (a little) hope. One of the best moves in this direction has been in the research area of software engineering. While there are certainly exceptions, for most publications in this area, reproducibility of the published work will involve some level of a) being able to access the software artifacts that comprise the novel contributions of the paper, b) being able to execute them. This led to a new movement called Artifact Evaluation, the motivation for which is quite compelling.

Here is an excerpt:
"We find it especially ironic that areas that are so centered on software, models, and specifications would not want to evaluate them as part of the paper review process, as well as archive them with the final paper. Not examining artifacts enables everything from mere sloppiness to, in extreme cases, dishonesty. More subtly, it also imposes a subtle penalty on people who take the trouble to vigorously implement and test their ideas."

Very importantly artifact evaluation was assigned to a special committee of PhD students and new postdocs, not to the regular PC. This strategy worked brilliantly.

The evaluation was extremely conservative, for three major reasons:

  • Evaluation of artifacts was conducted strictly after papers were unconditionally accepted for publication so that it did not affect peer review in any way.
  • Artifact evaluation was strictly limited to whether the available materials matched what was described in the paper. As Shriram Krishnamurthi notes, "We were only charged with deciding whether the artifact met the expectations set by the paper, no matter how low or incorrect those might be! This is of course a difficult emotional barrier to ignore."
  • The results of the evaluation were not publicly released. They were sent only to the authors of the paper, who could choose to describe the results in public as they wished, or not at all.

So, hats off to Shriram Krishnamurthi, Matthias Hauswirth, Steve Blackburn, Jan Vitek, Andreas Zeller, Carlo Ghezzi, and the others who have taken a stand on reproducibility of published research in software engineering! Andreas Zeller even secured a cash prize of USD 1000 from Microsoft Research for a Distinguished Artifact at ESEC/FSE 2011! AECs have now been included in several major conferences, most recently CAV 2015!

Now that we have a system that works, it would seem it is time to make it stick: make artifact evaluation standard, take it into account for paper acceptance, and make the results public.

So, how do we do that?

Wednesday, July 15, 2015

The Need for Notebooks

In response to Doug Thain's question:
What currently-available tools do you recommend for enabling reproducible scientific computing?  Is there a tool that we ought to have, but do not?

P.S. I am using "reproducibility" as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion.  Please interpret the question broadly.
I think the most important thing we need is an electronic lab notebook that really allows us to go back and understand exactly what we did, repeat it, modify it, record it, etc.  If you accept this, it leads to a number of points:

  1. Why don't more people (including me) do this now?
  2. What tool(s) should we use?
  3. How should this integrate into the publication process (for papers, software, data, etc.)
It's a bit hypocritical for me to write this post, since I've never really been a good notebook user or note taker, whether in classes, meetings, or labs.  And I think this is true of a lot of us.  Part of it is probably making it a matter of habit, and it's hard to break a bad habit or start a new one.  And part of it is the lack of good tools with good interfaces.

Tools I've seen that I've liked include VizTrails for data exploration and visualization, and Project Jupyter for a lot of other things.  And software version control (e.g., Git) is also part of the answer, most likely.  While these independent tools have their strengths and weaknesses, I don't think they really fit the bill.

I would really like something that's much more integrated into my computer, probably at a runtime or OS layer, that helps me understand all of my work, not just my visualization or coding.

If we did have something that really was more an automated work-tracking notebook, we could use it to help us with publications as well.  For example, in my idea of transitive credit, if we are going to decide what products contributed to a new product, a starting point is the the products we used during the creation of the new product, which could be created by such a notebook.  Or we could track our reading list in the period leading towards a new paper as a starting point for the papers we should reference.

And, this potentially leads us beyond the PDF on the path towards executable papers, which would be a key step towards reproducible science in the large.

(Note: this is crossposted to https://danielskatzblog.wordpress.com/2015/07/15/the-need-for-notebooks/)

Disclaimer: Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.


Monday, June 29, 2015

Q: Tools for Reproducible Computing?

Thank you to all of the panelists for your introductions and initial comments.  In my first posting, I suggested that we need better tools, better habits, better repositories, and higher expectations in order to achieve reproducibility.  That seems like a useful way to structure the next few questions, so I will start with tools:

What currently-available tools do you recommend for enabling reproducible scientific computing?  Is there a tool that we ought to have, but do not?

P.S. I am using "reproducibility" as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion.  Please interpret the question broadly.

Monday, June 22, 2015

Reproducibility, Correctness, and Buildability: the Three Principles for Ethical Public Dissemination of Computer Science and Engineering Research

Thank you to Douglas Thain for starting this blog on reproducible research! It is my hope that by bringing light to this topic, sharing ideas and success stories, documenting lessons learned, and stimulating on-going discussion on this topic we can help to pave the way for increasing reproducibility in published engineering and computer science research.

My name is Kristin Yvonne Rozier and I am an Assistant Professor of Aerospace Engineering and Engineering Mechanics in the University of Cincinnati's College of Engineering and Applied Science. Before moving to UC in January, 2015, I spent 14 years as a Research Scientist at NASA, first NASA Langley and then most recently NASA Ames. My primary research area is formal methods, a class of mathematically rigorous techniques for the specification, design, validation, and verification of critical systems. In our lab, the Laboratory for Temporal Logic in Aerospace Engineering, we publish a website accompanying each research paper and containing the artifacts required to reproduce the published results, wherever applicable.

When reading research papers, and especially when serving on Programme Committees to peer-review them, I am too-often astonished by the low level of dedication to reproducibility in published research, not just among authors but peer-reviewers as well. Certainly there are many barriers to publishing in a truly reproducible way, but I believe that most of these are surmountable and for the rest, one has to ask if it they are also ethical barriers to publishing at all.

I recently examined this topic in detail with another blogger on this site, Eric Rozier; our paper on Reproducibility, Correctness, and Buildability: the Three Principles for Ethical Public Dissemination of Computer Science and Engineering Research was published in IEEE International Symposium on Ethics in Engineering, Science, and Technology, Ethics’2014.

Together, we proposed a system of three principles that underlie public dissemination of computer science and engineering research, in essence the raison d'être of scientific publication. In addition to reproducibility, research is also published to show correctness, and enable buildability. We argue that consideration of these principles is a necessary step when publicly disseminating results in any evidence-based scientific or engineering endeavor. In our recent paper, we examined how these principles apply to the release and disclosure of the four elements associated with computational research: theory, algorithms, code, and data.

Precisely, reproducibility refers to the capability to reproduce fundamental results from released details. Correctness refers to the ability of an independent reviewer to verify and validate the results of a paper. Buildability indicates the ability of other researchers to use the published research as a foundation for their own new work. Buildability is more broad than extensibility, as it requires that the published results have reached a level of completeness that the research can be used for its stated purpose, and has progressed beyond the level of a preliminary idea. We argue that these three principles are not being sufficiently met by current publications and proposals in computer science and engineering, and represent a goal for which publishing should continue to aim.

In order to address the barriers to upholding these principles, we introduced standards for the evaluation of reproducibility, correctness, and buildability in relation to the varied elements of computer science and engineering research and discuss how they apply to proposals, workshops, conferences, and journal publications, making arguments for appropriate standards of each principle in these settings. We address modern issues including big data, data confidentiality, privacy, security, and privilege.

Given that it is possible for all publicly disseminated research to uphold these three principles and deviation from them is determined by factors that restrict public dissemination, such as intellectual property, secure/classified information, and time, it is unethical to publicly disseminate work that does not adhere to at least one of the principles of reproducibility, correctness, and buildability.

Friday, June 5, 2015

Interesting Replicable "Badge" for journal articles

I am writing to point out an interesting experiment that's starting up in the ACM Transactions on Mathematical Software (TOMS) computational journal.  There's a preprint of an editorial by Mike Heroux available that explains the process.

In brief, TOMS has created a "Replicated Computational Results (RCR)" review process, which is really a designation saying that the computational results published in an article are replicable.  As you might expect, this adds an extra burden on the authors to submit their work in a way that enables this RCR review, an extra burden on the editor to work with this part of the submission, and a completely new burden on the RCR reviewer.  If this process is successfully completed, the paper gets an RCR designation on its cover page, the journal editors hopefully get higher quality work, and the reviewer gets satisfaction in a job well done and a credited contribution to the field.

This seems a really interesting experiment; I hope it succeeds.

p.s.  I'm also personally happy that the first paper that will receive this designation is an output of an NSF-funded SI2 project: Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (May 2015), with James Willenbring as the first RCR reviewer.

Disclaimer

Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Tuesday, May 26, 2015

Introduction: Hi, I'm Gordon...

I'm Gordon Watts, a professor at the University of Washington in the Physics Department. My main focus of physics research is particle physics: I'm a member of the ATLAS experiment at the LHC, located at CERN. I'm also a member of the DZERO experiment at the Tevatron, located at FNAL. These days I mostly do research into exotic decays - looking of a hint of what particles might be dark matter or what particles might fill in some of the missing pieces of the Standard Model of particle physics... a model that should describe all matter and forces and interactions in our universe if it was complete.

Reproducibility is something I've fallen into - there was very little career planning that sent me in this direction: mostly just interest. I originally got into particle physics because it seemed to exist at the intersection of hardware, software, and physics. As my career has moved forward I've moved away from hardware and more towards software and physics. I'm (a little) embarrassed to admit that I first started thinking about this because of a new policy at the National Science Foundation (NSF). They are one of the major funding agencies in our field, and, more to the point, fund our group at the University of Washington (the Department of Energy's Office of Science is the other, larger, funding agency). In 2011 they start requiring a Data Management and Sharing plan along with every grant proposal. The basic idea: if the government funded scientific research, then the data derived from that research should be public. "What's this?" I thought... and it has been downhill since then. :-)

What is missing?

Reproducibility, unfortunately, means different things to different people (see the post just the other day on this blog which lays out an entirely sensible taxonomy). Lets take an experiment in particle physics. This is funny: I don't mean ATLAS, though that is the traditional definition of an experiment. Rather, in the jargon of my field, lets take an analysis. This example analysis starts with the data taking in 2012. The raw data is reconstructed, analyzed, and finally the plots and conclusions are used to write a paper. That paper represents about 2 years of three people's work. Could I give you the TB's worth of raw data, whatever the software was that you needed, and then could you re-make the plots or re-calculate the numbers in that paper? In particular, if you were not a member of ATLAS?

No. You could not. I'm not sure I could even do it if you were a member of ATLAS.

The reasons? Humans. Those pesky creatures at the end of the keyboard. The ones that do the physics and drive the field forward.

There are a number of steps that one has to go through to transform the raw data into final plots and physics results. First some very general reconstruction code must be run. It takes the raw electronic signals from the detector and finds the trajectory of an electron, or a muon, or a jet of particles, etc. If everything went well, then these are the raw particles that are left in the detector.

Once the reconstruction is done, the next step is producing a small summary data file from that reconstruction. Often some calibration is applied at this point. While the reconstruction is expensive in terms of CPU and disk space, the summary data file production is cheap and quick - and often re-run many times.

The final step is to take the summary data and produce the final plots. This involves running over multiple summary data files, producing plots, fitting, scaling, calculating ratios, and statistical calculations to determine if you have discovered something or you are going to rule out the existence of what you haven't discovered.

This is a simplification, but it is a good enough model for this discussion. The first two steps, the reconstruction and production of the summary data files, are automated. Large CPU farms run the jobs, a huge database system tracks all the files in and out, including log files from the jobs that produce each step. I don't know the best way to actually give you the code and the data, but the problem is solvable. Computers run and manage every step of the way. Capturing that information is a hard problem, but it is all recorded in databases, in script files in well known places, in source files in source file repositories (like git or svn).

The last step, however, is a mess! Take figure 3 of the paper I referenced above. We had to make some 50 versions of that plot, and fit each one. We never found a single function that would fit everything. And the fit often would not converge without a little help. I call this sort of computer-human interaction hand-art: one cannot automate the process and direct intervention by a "hand" is required. Even without that something like fitting-by-hand, this last step is typically a mess of scripts and jobs that are run by hand.

This is an issue that spans both tools and sociology in the field. What if the tools to preserve the tasks are very different from what we do now? How do we convince people to switch? And given such an uncontrolled environment where analysis takes place, how do we even capture the steps in the first place?

This is one of many things lacking when it comes to delivering a particle physics analysis in a reproducible fashion. Some of the other blog posts should be read to see more!

Monday, May 25, 2015

re****bility

Since we often forget what reproducibility means, particularly in the context of repeatability and replicability, I think it is worth highlighting this slide, taken from a talk a few days ago:



Sunday, May 24, 2015

Reproducibility, Data Curation, and Cost-Constrained Domains

My name is Eric Rozier, and I'm an Assistant Professor in Electrical Engineering and Computing Systems at the University of Cincinnati, and a two time faculty mentor at the University of Chicago's Data Science for Social Good Summer Fellowship (DSSG). DSSG is a summer program through the Computation Institute which seeks to tackle problems in Big Data with social relevance.  Last summer I advised eight student fellows on projects with the Chicago Department of Public Health and the World Bank's Integrity Vice Presidency.  During the normal year I'm the director of the Trustworthy Data Engineering Laboratory at the University of Cincinnati where we work on problems at the intersection of Big Data, Data Privacy and Cybersecurity, Data Dependability, and Analytics.

My interest in reproducibility stems from my research into long-term data curation and dependability, and issues focusing on reproducibility with data that includes personally identifiable information (PII).  Reproducibility is a topic complicated by many facets, scientific, technical, and ethical in nature.  In an effort to differentiate the perspective of this post from those of my colleagues I want to discuss the challenges of long-term reproducibility from a data curation stand point.

Last year at the 2014 International Conference on Massive Storage Systems and Technology (MSST) Daniel Duffy and John Schnase of the NASA Center for Climate Simulation discussed how data infrastructure was one of the primary challenges facing climate science.  The problem becomes one of reproducibility when the temporal scale of climate analytics becomes evident.  In discussions with the University of Miami Rosenstiel School of Marine and Atmospheric Science (RSMAS) and representatives of the National Oceanic and Atmospheric Administration (NOAA) researchers have told me that to conduct the sort of climate research that would be necessary to identify important decadal patterns would require data curation without significant loss for at least 50, if not 100, years.  Facing these challenges with an ever shrinking budget and an exponential growth of the digital universe is an incredibly challenging task.  Even if we somehow alter the culture of research science (a goal that the NSF is making admirable progress towards), we still face the double edged sword of scale in the era of Big Data.

While the exponentially increasing volume and variety of data being ingested means our predictive models and resulting understanding of complex processes are becoming better; we are also faced with a world of changing fault landscapes.  Once rare and inconsequential faults now pose serious threats; and can be orthogonal to traditional reliability mechanisms such as RAID.  How can we hope to achieve 50 to 100 years of reliability when many important domains are shifting away from more traditional enterprise hardware in favor of commercial off-the-shelf (COTS) components to deal with decreasing budgets and increasing data volume?

One of the biggest concerns I have moving forward on issues such as reproducibility is the continued focus of our field on technologies aimed at the best funded portion of the market, while socially important applications that relate to basic science, sustainability, civic governance, and public health struggle with decreasing budgets and an increasing reliance on insufficiently reliable, dependable, and secure, COTS-based architectures.  To achieve true reproducibility solutions need to scale down as well as they scale up.  Mechanisms need to be affordable and available for cost-constrained domains like public clinics, the developing world, and city governments.  Data that cannot be preserved, cannot enable reproducibility in science.

Introduction: On a related note, Repeatability (in Computer Science)

Hello, world!

The idea of using standardized Digital Object Identifiers (DOIs) to reference electronic artifacts has been around for many moons. However, for better or worse there does not seem to be an analogous standard for Digital Personal Identifiers (DPIs). Consequently, I must introduce myself with a sobriquet that I used to think was unique -- my name, Ashish Gehani. (Facebook has proved to me the folly of my thought -- there are many others who go by the same label!) I first heard the case for using DPIs from my colleague, Natarajan Shankar, in the Computer Science Laboratory at SRI (where we work). I’m sure the motivation is much older but I believe we first spoke about it in the context of disambiguating authors whose publications were incorrectly being conflated on Google Scholar.

I arrived at the topic of scientific reproducibility via a different path than others on this blog (I think). I’d been studying systems’ security and observed (as had many others) that the knowledge of the antecedents of a piece of information could be quite helpful in deciding how much to trust it. Indeed, a similar approach has been used to judge the authenticity of valuables, from art to wine, for many centuries. As I delved into ways to reliably collect and generate metadata about the history of digital artifacts, I found an entire community of like-minded souls, who have banded together with the common interest of data provenance. Many of these researchers meet annually at the USENIX Workshop on the Theory and Practice of Provenance and in alternate years at the International Provenance and Annotation Workshop. In 2014, these were co-located as part of Provenance Week. A significant focus is on the use of provenance to facilitate scientific reproducibility.

Doug asked for "an example of where reproducibility is not happening (or not working) in computational science today". I am going to take the liberty of considering repeatability instead of reproducibility. (Repeatability is of particular interest when reproducibility is questioned.) As the old adage goes, "there’s no place like home". Assuming one is willing to accept computer science as a computational science, we can consider the findings of Collberg et al. at the University of Arizona. They downloaded 601 papers from ACM conferences and journals ("with a practical orientation"), and focused on two questions: "is the source code available, and does it build?" Even the most basic next step, of trying to execute the code, was eliminated. Of course, this meant they could not check the correctness of the output. Unfortunately, even with this low bar, the results were not encouraging. It’s worth taking a look at their report. Their findings are currently being reviewed by Krishnamurthi et al. at Brown University.

The Burden of Reproducibility

The Burden of Reproducibility

Hi, I am Tanu Malik, a research scientist and Fellow at the Computation Institute, University of Chicago and Argonne National Laboratory. I am also a Lecturer in the Department of Computer Science, University of Chicago. I work in scientific data management, with emphasis on distributed data management, and metadata management, and regularly collaborate with astronomers, chemists, geoscientists, and cosmologists.

For the past few years, I have had an active interest in data provenance, which has direct implications on computational reproducibility.  My student Quan Pham (co-advised with Ian Foster) recently graduated from the Department of Computer Science, University of Chicago with a thesis on the topic of computational reproducibility. In his thesis, Quan investigated the three orthogonal issues of efficiency, usability and cost of computational reproducibility.  He proposed a novel system for computational reproducibility and showed how it can be optimized for several use cases.


Quan’s novel work lead to our NSF-sponsored project ‘GeoDataspace’ in which we are exploring the ideas of sharing and reproducibility in the geoscience community, introducing the geoscientists to efficient and usable tools so that they are ultimately in-charge of their ‘reproducible’ pipelines, publications, and projects.


So what challenges have we faced so far, and in particular to answer Doug’s question “where in computational science is reproducibility not happening or not working”?


As I perceive, reproducibility is indeed happening at an individual/collaboration level but not at the community level. Again, I witness reproducibility happening at micro time scales but not at macro time scales. Let me explain myself.


Publications, which serve as the primary medium of dissemination for computational science, are not reproducible. Typically, to produce a publication, the authors conceive of a research idea, experiment with it, and produce some results. Those results are reproducible in that the authors are able to re-confirm themselves that their research idea and experiments are plausible. However, reproducibility stops soon thereafter. The authors take in some effort to describe their idea, in the form of a publication, to the community, and since it doesn’t pay any further to be part of the community, reproducibility stops.


In this process, reproducibility was happening for a short time scale, i.e., when the authors were investigating the ideas. They were reconfirming it many ways. But at larger time scales, such as taking a publication that was published five years ago, it is incredibly hard even for the author to reproduce it.


So why is reproducibility not happening at that scale? There are several reasons, and Dan described some of them in his post. Let me also give my take:


When the end result of the scientific method was safely proclaimed to be a text publication (in late 16th century after the printing press was in considerable use), we made some assumptions. In particular, that (i) we shall always remember all the details of the chosen scientific method including the know-how of the tools that were used, and that (ii) research is a singular activity done with a focused mind, possibly in the confines of an attic or dungeons.
Four centuries down, those assumptions no longer hold true. Computers, Internet, search engines, Internet-based services, and social networks have changed much of that: We can access, discover, and connect knowledge and people much faster. But can we use these new inventions to verify faster?

GitHub + TravisCI is an excellent example of improving the speed of verification by using open-source, re-executable source codes and bringing in the social network. For the varieties of scientific methods, systems, tools, and communities, this example is just a start. There is still a significant burden of verification that science has to bear.  Humans cannot tackle or reduce this burden. We need highly-efficient computational methods and tools that verify fast, encourage good behavior, and provide instantaneous feedback of being left out from the community so that scientists ensure reproducibility at all steps of the scientific method.

Tuesday, May 19, 2015

Introduction: Mark Twain on Reproducibility, more or less

Having now figured out Blogspot's somewhat painful user interface, I will take up Doug's challenge to introduce myself and talk a little about reproducibility in the context of computational science:

I'm Daniel S. Katz (or less formally, Dan), and I'm have two relevant roles in the context of this blog. First, I'm a researcher, specifically a Senior Fellow in the Computation Institute at the University of Chicago and Argonne National Laboratory, where I work on the development and use of advanced cyberinfrastructure to solve challenging problems at multiple scales. My technical research interests are in applications, algorithms, fault tolerance, and programming in parallel and distributed computing, including HPC, Grid, Cloud, etc. I am also interested in policy issues, including citation and credit mechanisms and practices associated with software and data, organization and community practices for collaboration, and career paths for computing researchers. (I also have been blogging a bit at another site, and I will cross-post part of this post there.)

Second, I'm currently a program officer at the National Science Foundation in the Division of Advanced Cyberinfrastructure (ACI.) At NSF, I am managing $25m-$35m of software programs, including leading NSF's Software Infrastructure for Sustained Innovation (SI2) program, and I currently lead ACI's participation in Computational and Data-enabled Science & Engineering (CDS&E), Designing Materials to Revolutionize and Engineer our Future (DMREF), and Computational and Data-Enabled Science & Engineering in Mathematical and Statistical Sciences (CDS&E-MSS). I previously led ACI's participation in Faculty Early Career Development (CAREER) and Exploiting Parallelism and Scalability (XPS). I also co-led writing of "A Vision and Strategy for Software for Science, Engineering, and Education: Cyberinfrastructure Framework for the 21st Century," and led writing of "Implementation of NSF Software Vision." [This leads me to add: "Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF."]

Given these two roles, a researcher and a funder, it's clear to me that reproducibility in science is increasingly seen as a concern, at least a high level. And thus, making science more reproducible is a challenge that many people want to solve. But it's quite hard to do this, in general. In my opinion, there are a variety of factors responsible, including:

  1. Our scientific culture thinks reproducibility is important at a high level, but not in specific cases. This reminds me of Mark Twain's definition of classic books: those that people praise but don't read. We don't have incentives or practices in place that translate the high level concept of reproducibility into actions that support actual reproducibility.
  2. In many cases, reproducibility is difficult in practice, due to some unique situation. For example, data can be taken with a unique instrument, such as the LHC or a telescope, or the data may be transient, such as seismometer that measured a specific signal, though on the other hand, in many cases, data taken in one period should be statistically the same as data taken in another period.
  3. Given limited resources, reproducibility is less important than new research. As an example, perhaps a computer run that took months has been completed. This is unlikely to be repeated, because generating a new result is seen as a better use of the computing resources than reproducing the old result.

We can't easily change culture, but we can try to change practice, with the idea that a change in practice will eventually turn into a change in culture. And we can start by working on the easier parts of the problem, not the difficult ones. One way we can do this is by formalizing the need for reproducibility. This could be done at multiple levels, such as by publishers, funders, and faculty.

Publishers could require that reviewers actually try to reproduce submitted work as a review criterion. Funders could require the final project reports contain a reproducibility statement, a demonstration that an unrelated group had reproduced specific portions of the reported work, with funders funding these independent groups to do this. And faculty could require students to reproduce the work of other students, benefitting the reproducer with training and the reproducee with knowledge that their work has been proven to be reproducible.

What do we do about work that cannot be reproduced due to a unique situation? Perhaps try to isolate that situation and reproduce the parts of the work that can be reproduced. Or reproduce the work as a thought experiment rather than in practice. In either case, if we can't reproduce something, then we have to accept that we can't reproduce it and we need to decide how close we can come and if this is good enough.

In all of these cases, there's an implied cost-benefit tradeoff. Do we think the benefit of reproducibility is worth the cost, in reviewers' time, funders' funds, or students' time? This gets to the third factor I mentioned previously, the comparative value of reproducibility versus new research. We can try to reduce the cost using automation, tools, etc., but it will always be there and we will have to choose if it is sufficiently important to pursue.

Let me close by going back to Twain's definition, and asking, will reproducibility become one of the classic books of the 21st Century, praised but not carried out? Or will we choose to make the effort to read it?

Disclaimer

Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Monday, May 18, 2015

Introduction: Portability is Reproducibility

I see that everyone is a bit shy to get started, so I will break the ice -

My name is Douglas Thain, and I am an Associate Professor in Computer Science and Engineering at the University of Notre Dame.  My background is in large scale distributed systems for scientific computing.  I lead a computer science team that collaborates with researchers in physics, biology, chemistry, and other fields of experimental science.  We work to design systems that are able to reliably run simulations or data analysis codes on thousands of machines to enable new discoveries.  And we publish our software in open source form as the  Cooperative Computing Tools.


I have sort of fallen into reproducibility area sideways, as a result of working in systems at large scale.  In the last two years, this has been part of the Data and Software Preservation for Open Science (DASPOS) project funded by the NSF.

And here is a tale of non-reproducibility:

A common problem that we encounter is that a student works hard to get some particular code running on their laptop or workstation, only to find that it doesn't work at all on the other machines that they can get access to.  We often find this happening in the realm of bioinformatics, where tools like BLAST, BOWTIE, and BWA are distributed in source form with some rough instructions on how to get things going.  To make it work, the student must download the software, compile it just so, install a bunch of shared libraries, and then repeat the process for each tool on which the first tool depends upon.  This process can take weeks or months to get just right.

Now, the student is able to run the code and gets some sample results.  He or she now wishes to run the tool at scale, and uses our software to distribute the job to thousands of machines spread around campus.  Success, right?

Of course not!  None of the machines to which the student has access has been set up in the same painstaking way as the original machine, so the student ends up working backwards to debug all the different ways in which the target differs from the original.
  • Code works with Python 2 and not Python 3, check.
  • Missing config file in home directory, check.
  • libobscure.so.3 found in /usr/lib, check.
  • (bang head on keyboard)
I have described this as a problem of portability across machines, but of course you can also look at it as reproducibility across time.  If an important result has been generated and now a different person wishes to reproduce that run a year later on a different machine, you can bet that they will encounter the same set of headaches.

Can we make things better?

I think we can, but it is going to take several different kinds of developments:

- Better tools for specifying and standing up environments in a precise way.
- Better habits by researchers to track and specify what they use.
- Better repositories that can track and store environments, code, and data and relate them together.
- Higher expectations by the community that work should be made reproducible.

Ok, that's my introduction.  Would the other panelists like to introduce themselves?

Tuesday, May 5, 2015

Q: Example of Non-Reproducibility?

To get things going, I'll post some questions and ask each panelist to respond individually:

1. Please introduce yourself and your research interests.

2. Give us an example of where reproducibility is not happening (or not working) in computational science today.


Monday, May 4, 2015

Welcome

Reproducibility is central to the conduct of science.  In order to verify, refute, or build upon someone else's work, it must first be possible to reproduce what they have done.  In years past, this was accomplished by the maintenance of a written lab notebook, in which all the steps of an experiment were painstakingly entered, so that others could follow along later.

(DaVinci's Lab Notebook)

Today, the computer has replaced the scientific notebook for measuring, documenting, and (in some cases) experimentation itself.  In principle, computers should make reproducibility simpler, because a program should execute identically every time it runs.  The reality is much more complicated, because computing environments are dynamic, chaotic, and poorly described.

As a result, a scientific code written to run correctly on a computer owned by person A has a surprisingly low chance of running correctly on a computer owned by person B.  Or even on person A's computer one year later!


In this blog, we will explore what it means to achieve reproducibility in the context of scientific computing.  Your hosts are a panel of experts with experience in computer science and scientific computing:

This will be a sort of virtual panel discussion where we discuss what causes these problems, what can be done to improve the situation, and share pointers to ideas, technologies, and meetings.

Ok, let's get started!