Reproducible Scientific Computing: May 2015

Tuesday, May 26, 2015

Introduction: Hi, I'm Gordon...

I'm Gordon Watts, a professor at the University of Washington in the Physics Department. My main focus of physics research is particle physics: I'm a member of the ATLAS experiment at the LHC, located at CERN. I'm also a member of the DZERO experiment at the Tevatron, located at FNAL. These days I mostly do research into exotic decays - looking of a hint of what particles might be dark matter or what particles might fill in some of the missing pieces of the Standard Model of particle physics... a model that should describe all matter and forces and interactions in our universe if it was complete.

Reproducibility is something I've fallen into - there was very little career planning that sent me in this direction: mostly just interest. I originally got into particle physics because it seemed to exist at the intersection of hardware, software, and physics. As my career has moved forward I've moved away from hardware and more towards software and physics. I'm (a little) embarrassed to admit that I first started thinking about this because of a new policy at the National Science Foundation (NSF). They are one of the major funding agencies in our field, and, more to the point, fund our group at the University of Washington (the Department of Energy's Office of Science is the other, larger, funding agency). In 2011 they start requiring a Data Management and Sharing plan along with every grant proposal. The basic idea: if the government funded scientific research, then the data derived from that research should be public. "What's this?" I thought... and it has been downhill since then. :-)

What is missing?

Reproducibility, unfortunately, means different things to different people (see the post just the other day on this blog which lays out an entirely sensible taxonomy). Lets take an experiment in particle physics. This is funny: I don't mean ATLAS, though that is the traditional definition of an experiment. Rather, in the jargon of my field, lets take an analysis. This example analysis starts with the data taking in 2012. The raw data is reconstructed, analyzed, and finally the plots and conclusions are used to write a paper. That paper represents about 2 years of three people's work. Could I give you the TB's worth of raw data, whatever the software was that you needed, and then could you re-make the plots or re-calculate the numbers in that paper? In particular, if you were not a member of ATLAS?

No. You could not. I'm not sure I could even do it if you were a member of ATLAS.

The reasons? Humans. Those pesky creatures at the end of the keyboard. The ones that do the physics and drive the field forward.

There are a number of steps that one has to go through to transform the raw data into final plots and physics results. First some very general reconstruction code must be run. It takes the raw electronic signals from the detector and finds the trajectory of an electron, or a muon, or a jet of particles, etc. If everything went well, then these are the raw particles that are left in the detector.

Once the reconstruction is done, the next step is producing a small summary data file from that reconstruction. Often some calibration is applied at this point. While the reconstruction is expensive in terms of CPU and disk space, the summary data file production is cheap and quick - and often re-run many times.

The final step is to take the summary data and produce the final plots. This involves running over multiple summary data files, producing plots, fitting, scaling, calculating ratios, and statistical calculations to determine if you have discovered something or you are going to rule out the existence of what you haven't discovered.

This is a simplification, but it is a good enough model for this discussion. The first two steps, the reconstruction and production of the summary data files, are automated. Large CPU farms run the jobs, a huge database system tracks all the files in and out, including log files from the jobs that produce each step. I don't know the best way to actually give you the code and the data, but the problem is solvable. Computers run and manage every step of the way. Capturing that information is a hard problem, but it is all recorded in databases, in script files in well known places, in source files in source file repositories (like git or svn).

The last step, however, is a mess! Take figure 3 of the paper I referenced above. We had to make some 50 versions of that plot, and fit each one. We never found a single function that would fit everything. And the fit often would not converge without a little help. I call this sort of computer-human interaction hand-art: one cannot automate the process and direct intervention by a "hand" is required. Even without that something like fitting-by-hand, this last step is typically a mess of scripts and jobs that are run by hand.

This is an issue that spans both tools and sociology in the field. What if the tools to preserve the tasks are very different from what we do now? How do we convince people to switch? And given such an uncontrolled environment where analysis takes place, how do we even capture the steps in the first place?

This is one of many things lacking when it comes to delivering a particle physics analysis in a reproducible fashion. Some of the other blog posts should be read to see more!

Monday, May 25, 2015

re****bility

Since we often forget what reproducibility means, particularly in the context of repeatability and replicability, I think it is worth highlighting this slide, taken from a talk a few days ago:

Neil Chue Hong, "Open Software for Open Science ," Young Alliance Against Cancer, 22-23 May 2015, Copenhagen, http://dx.doi.org/10.6084/m9.figshare.1424440

Sunday, May 24, 2015

Reproducibility, Data Curation, and Cost-Constrained Domains

My name is Eric Rozier, and I'm an Assistant Professor in Electrical Engineering and Computing Systems at the University of Cincinnati, and a two time faculty mentor at the University of Chicago's Data Science for Social Good Summer Fellowship (DSSG). DSSG is a summer program through the Computation Institute which seeks to tackle problems in Big Data with social relevance. Last summer I advised eight student fellows on projects with the Chicago Department of Public Health and the World Bank's Integrity Vice Presidency. During the normal year I'm the director of the Trustworthy Data Engineering Laboratory at the University of Cincinnati where we work on problems at the intersection of Big Data, Data Privacy and Cybersecurity, Data Dependability, and Analytics.

My interest in reproducibility stems from my research into long-term data curation and dependability, and issues focusing on reproducibility with data that includes personally identifiable information (PII). Reproducibility is a topic complicated by many facets, scientific, technical, and ethical in nature. In an effort to differentiate the perspective of this post from those of my colleagues I want to discuss the challenges of long-term reproducibility from a data curation stand point.

Last year at the 2014 International Conference on Massive Storage Systems and Technology (MSST) Daniel Duffy and John Schnase of the NASA Center for Climate Simulation discussed how data infrastructure was one of the primary challenges facing climate science. The problem becomes one of reproducibility when the temporal scale of climate analytics becomes evident. In discussions with the University of Miami Rosenstiel School of Marine and Atmospheric Science (RSMAS) and representatives of the National Oceanic and Atmospheric Administration (NOAA) researchers have told me that to conduct the sort of climate research that would be necessary to identify important decadal patterns would require data curation without significant loss for at least 50, if not 100, years. Facing these challenges with an ever shrinking budget and an exponential growth of the digital universe is an incredibly challenging task. Even if we somehow alter the culture of research science (a goal that the NSF is making admirable progress towards), we still face the double edged sword of scale in the era of Big Data.

While the exponentially increasing volume and variety of data being ingested means our predictive models and resulting understanding of complex processes are becoming better; we are also faced with a world of changing fault landscapes. Once rare and inconsequential faults now pose serious threats; and can be orthogonal to traditional reliability mechanisms such as RAID. How can we hope to achieve 50 to 100 years of reliability when many important domains are shifting away from more traditional enterprise hardware in favor of commercial off-the-shelf (COTS) components to deal with decreasing budgets and increasing data volume?

One of the biggest concerns I have moving forward on issues such as reproducibility is the continued focus of our field on technologies aimed at the best funded portion of the market, while socially important applications that relate to basic science, sustainability, civic governance, and public health struggle with decreasing budgets and an increasing reliance on insufficiently reliable, dependable, and secure, COTS-based architectures. To achieve true reproducibility solutions need to scale down as well as they scale up. Mechanisms need to be affordable and available for cost-constrained domains like public clinics, the developing world, and city governments. Data that cannot be preserved, cannot enable reproducibility in science.

Introduction: On a related note, Repeatability (in Computer Science)

Hello, world!

The idea of using standardized Digital Object Identifiers (DOIs) to reference electronic artifacts has been around for many moons. However, for better or worse there does not seem to be an analogous standard for Digital Personal Identifiers (DPIs). Consequently, I must introduce myself with a sobriquet that I used to think was unique -- my name, Ashish Gehani. (Facebook has proved to me the folly of my thought -- there are many others who go by the same label!) I first heard the case for using DPIs from my colleague, Natarajan Shankar, in the Computer Science Laboratory at SRI (where we work). I’m sure the motivation is much older but I believe we first spoke about it in the context of disambiguating authors whose publications were incorrectly being conflated on Google Scholar.

I arrived at the topic of scientific reproducibility via a different path than others on this blog (I think). I’d been studying systems’ security and observed (as had many others) that the knowledge of the antecedents of a piece of information could be quite helpful in deciding how much to trust it. Indeed, a similar approach has been used to judge the authenticity of valuables, from art to wine, for many centuries. As I delved into ways to reliably collect and generate metadata about the history of digital artifacts, I found an entire community of like-minded souls, who have banded together with the common interest of data provenance. Many of these researchers meet annually at the USENIX Workshop on the Theory and Practice of Provenance and in alternate years at the International Provenance and Annotation Workshop. In 2014, these were co-located as part of Provenance Week. A significant focus is on the use of provenance to facilitate scientific reproducibility.

Doug asked for "an example of where reproducibility is not happening (or not working) in computational science today". I am going to take the liberty of considering repeatability instead of reproducibility. (Repeatability is of particular interest when reproducibility is questioned.) As the old adage goes, "there’s no place like home". Assuming one is willing to accept computer science as a computational science, we can consider the findings of Collberg et al. at the University of Arizona. They downloaded 601 papers from ACM conferences and journals ("with a practical orientation"), and focused on two questions: "is the source code available, and does it build?" Even the most basic next step, of trying to execute the code, was eliminated. Of course, this meant they could not check the correctness of the output. Unfortunately, even with this low bar, the results were not encouraging. It’s worth taking a look at their report. Their findings are currently being reviewed by Krishnamurthi et al. at Brown University.

The Burden of Reproducibility

Hi, I am Tanu Malik, a research scientist and Fellow at the Computation Institute, University of Chicago and Argonne National Laboratory. I am also a Lecturer in the Department of Computer Science, University of Chicago. I work in scientific data management, with emphasis on distributed data management, and metadata management, and regularly collaborate with astronomers, chemists, geoscientists, and cosmologists.

For the past few years, I have had an active interest in data provenance, which has direct implications on computational reproducibility. My student Quan Pham (co-advised with Ian Foster) recently graduated from the Department of Computer Science, University of Chicago with a thesis on the topic of computational reproducibility. In his thesis, Quan investigated the three orthogonal issues of efficiency, usability and cost of computational reproducibility. He proposed a novel system for computational reproducibility and showed how it can be optimized for several use cases.

Quan’s novel work lead to our NSF-sponsored project ‘GeoDataspace’ in which we are exploring the ideas of sharing and reproducibility in the geoscience community, introducing the geoscientists to efficient and usable tools so that they are ultimately in-charge of their ‘reproducible’ pipelines, publications, and projects.

So what challenges have we faced so far, and in particular to answer Doug’s question “where in computational science is reproducibility not happening or not working”?

As I perceive, reproducibility is indeed happening at an individual/collaboration level but not at the community level. Again, I witness reproducibility happening at micro time scales but not at macro time scales. Let me explain myself.

Publications, which serve as the primary medium of dissemination for computational science, are not reproducible. Typically, to produce a publication, the authors conceive of a research idea, experiment with it, and produce some results. Those results are reproducible in that the authors are able to re-confirm themselves that their research idea and experiments are plausible. However, reproducibility stops soon thereafter. The authors take in some effort to describe their idea, in the form of a publication, to the community, and since it doesn’t pay any further to be part of the community, reproducibility stops.

In this process, reproducibility was happening for a short time scale, i.e., when the authors were investigating the ideas. They were reconfirming it many ways. But at larger time scales, such as taking a publication that was published five years ago, it is incredibly hard even for the author to reproduce it.

So why is reproducibility not happening at that scale? There are several reasons, and Dan described some of them in his post. Let me also give my take:

When the end result of the scientific method was safely proclaimed to be a text publication (in late 16th century after the printing press was in considerable use), we made some assumptions. In particular, that (i) we shall always remember all the details of the chosen scientific method including the know-how of the tools that were used, and that (ii) research is a singular activity done with a focused mind, possibly in the confines of an attic or dungeons.

Four centuries down, those assumptions no longer hold true. Computers, Internet, search engines, Internet-based services, and social networks have changed much of that: We can access, discover, and connect knowledge and people much faster. But can we use these new inventions to verify faster?

GitHub + TravisCI is an excellent example of improving the speed of verification by using open-source, re-executable source codes and bringing in the social network. For the varieties of scientific methods, systems, tools, and communities, this example is just a start. There is still a significant burden of verification that science has to bear. Humans cannot tackle or reduce this burden. We need highly-efficient computational methods and tools that verify fast, encourage good behavior, and provide instantaneous feedback of being left out from the community so that scientists ensure reproducibility at all steps of the scientific method.

Tuesday, May 19, 2015

Introduction: Mark Twain on Reproducibility, more or less

Having now figured out Blogspot's somewhat painful user interface, I will take up Doug's challenge to introduce myself and talk a little about reproducibility in the context of computational science:

I'm Daniel S. Katz (or less formally, Dan), and I'm have two relevant roles in the context of this blog. First, I'm a researcher, specifically a Senior Fellow in the Computation Institute at the University of Chicago and Argonne National Laboratory, where I work on the development and use of advanced cyberinfrastructure to solve challenging problems at multiple scales. My technical research interests are in applications, algorithms, fault tolerance, and programming in parallel and distributed computing, including HPC, Grid, Cloud, etc. I am also interested in policy issues, including citation and credit mechanisms and practices associated with software and data, organization and community practices for collaboration, and career paths for computing researchers. (I also have been blogging a bit at another site, and I will cross-post part of this post there.)

Second, I'm currently a program officer at the National Science Foundation in the Division of Advanced Cyberinfrastructure (ACI.) At NSF, I am managing $25m-$35m of software programs, including leading NSF's Software Infrastructure for Sustained Innovation (SI2) program, and I currently lead ACI's participation in Computational and Data-enabled Science & Engineering (CDS&E), Designing Materials to Revolutionize and Engineer our Future (DMREF), and Computational and Data-Enabled Science & Engineering in Mathematical and Statistical Sciences (CDS&E-MSS). I previously led ACI's participation in Faculty Early Career Development (CAREER) and Exploiting Parallelism and Scalability (XPS). I also co-led writing of "A Vision and Strategy for Software for Science, Engineering, and Education: Cyberinfrastructure Framework for the 21st Century," and led writing of "Implementation of NSF Software Vision." [This leads me to add: "Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF."]

Given these two roles, a researcher and a funder, it's clear to me that reproducibility in science is increasingly seen as a concern, at least a high level. And thus, making science more reproducible is a challenge that many people want to solve. But it's quite hard to do this, in general. In my opinion, there are a variety of factors responsible, including:

Our scientific culture thinks reproducibility is important at a high level, but not in specific cases. This reminds me of Mark Twain's definition of classic books: those that people praise but don't read. We don't have incentives or practices in place that translate the high level concept of reproducibility into actions that support actual reproducibility.
In many cases, reproducibility is difficult in practice, due to some unique situation. For example, data can be taken with a unique instrument, such as the LHC or a telescope, or the data may be transient, such as seismometer that measured a specific signal, though on the other hand, in many cases, data taken in one period should be statistically the same as data taken in another period.
Given limited resources, reproducibility is less important than new research. As an example, perhaps a computer run that took months has been completed. This is unlikely to be repeated, because generating a new result is seen as a better use of the computing resources than reproducing the old result.

We can't easily change culture, but we can try to change practice, with the idea that a change in practice will eventually turn into a change in culture. And we can start by working on the easier parts of the problem, not the difficult ones. One way we can do this is by formalizing the need for reproducibility. This could be done at multiple levels, such as by publishers, funders, and faculty.

Publishers could require that reviewers actually try to reproduce submitted work as a review criterion. Funders could require the final project reports contain a reproducibility statement, a demonstration that an unrelated group had reproduced specific portions of the reported work, with funders funding these independent groups to do this. And faculty could require students to reproduce the work of other students, benefitting the reproducer with training and the reproducee with knowledge that their work has been proven to be reproducible.

What do we do about work that cannot be reproduced due to a unique situation? Perhaps try to isolate that situation and reproduce the parts of the work that can be reproduced. Or reproduce the work as a thought experiment rather than in practice. In either case, if we can't reproduce something, then we have to accept that we can't reproduce it and we need to decide how close we can come and if this is good enough.

In all of these cases, there's an implied cost-benefit tradeoff. Do we think the benefit of reproducibility is worth the cost, in reviewers' time, funders' funds, or students' time? This gets to the third factor I mentioned previously, the comparative value of reproducibility versus new research. We can try to reduce the cost using automation, tools, etc., but it will always be there and we will have to choose if it is sufficiently important to pursue.

Let me close by going back to Twain's definition, and asking, will reproducibility become one of the classic books of the 21st Century, praised but not carried out? Or will we choose to make the effort to read it?

Disclaimer

Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Monday, May 18, 2015

Introduction: Portability is Reproducibility

I see that everyone is a bit shy to get started, so I will break the ice -

My name is Douglas Thain, and I am an Associate Professor in Computer Science and Engineering at the University of Notre Dame. My background is in large scale distributed systems for scientific computing. I lead a computer science team that collaborates with researchers in physics, biology, chemistry, and other fields of experimental science. We work to design systems that are able to reliably run simulations or data analysis codes on thousands of machines to enable new discoveries. And we publish our software in open source form as the Cooperative Computing Tools.

I have sort of fallen into reproducibility area sideways, as a result of working in systems at large scale. In the last two years, this has been part of the Data and Software Preservation for Open Science (DASPOS) project funded by the NSF.

And here is a tale of non-reproducibility:

A common problem that we encounter is that a student works hard to get some particular code running on their laptop or workstation, only to find that it doesn't work at all on the other machines that they can get access to. We often find this happening in the realm of bioinformatics, where tools like BLAST, BOWTIE, and BWA are distributed in source form with some rough instructions on how to get things going. To make it work, the student must download the software, compile it just so, install a bunch of shared libraries, and then repeat the process for each tool on which the first tool depends upon. This process can take weeks or months to get just right.

Now, the student is able to run the code and gets some sample results. He or she now wishes to run the tool at scale, and uses our software to distribute the job to thousands of machines spread around campus. Success, right?

Of course not! None of the machines to which the student has access has been set up in the same painstaking way as the original machine, so the student ends up working backwards to debug all the different ways in which the target differs from the original.

Code works with Python 2 and not Python 3, check.
Missing config file in home directory, check.
libobscure.so.3 found in /usr/lib, check.
(bang head on keyboard)

I have described this as a problem of portability across machines, but of course you can also look at it as reproducibility across time. If an important result has been generated and now a different person wishes to reproduce that run a year later on a different machine, you can bet that they will encounter the same set of headaches.

Can we make things better?

I think we can, but it is going to take several different kinds of developments:

- Better tools for specifying and standing up environments in a precise way.
- Better habits by researchers to track and specify what they use.
- Better repositories that can track and store environments, code, and data and relate them together.
- Higher expectations by the community that work should be made reproducible.

Ok, that's my introduction. Would the other panelists like to introduce themselves?

Tuesday, May 5, 2015

Q: Example of Non-Reproducibility?

To get things going, I'll post some questions and ask each panelist to respond individually:

1. Please introduce yourself and your research interests.

2. Give us an example of where reproducibility is not happening (or not working) in computational science today.

Monday, May 4, 2015

Welcome

Reproducibility is central to the conduct of science. In order to verify, refute, or build upon someone else's work, it must first be possible to reproduce what they have done. In years past, this was accomplished by the maintenance of a written lab notebook, in which all the steps of an experiment were painstakingly entered, so that others could follow along later.

(DaVinci's Lab Notebook)

Today, the computer has replaced the scientific notebook for measuring, documenting, and (in some cases) experimentation itself. In principle, computers should make reproducibility simpler, because a program should execute identically every time it runs. The reality is much more complicated, because computing environments are dynamic, chaotic, and poorly described.

As a result, a scientific code written to run correctly on a computer owned by person A has a surprisingly low chance of running correctly on a computer owned by person B. Or even on person A's computer one year later!

In this blog, we will explore what it means to achieve reproducibility in the context of scientific computing. Your hosts are a panel of experts with experience in computer science and scientific computing:

Ashish Gehani - SRI International
Daniel Katz - Computational Institute, University of Chicago
Tanu Malik - Computation Institute, University of Chicago
Eric Rozier - University of Cincinnati
Kristin Rozier - University of Cincinnati
Douglas Thain - University of Notre Dame
Gordon Watts - University of Washington

This will be a sort of virtual panel discussion where we discuss what causes these problems, what can be done to improve the situation, and share pointers to ideas, technologies, and meetings.

Ok, let's get started!

Pages