Reproducible Scientific Computing: Introduction: Hi, I'm Gordon...

I'm Gordon Watts, a professor at the University of Washington in the Physics Department. My main focus of physics research is particle physics: I'm a member of the ATLAS experiment at the LHC, located at CERN. I'm also a member of the DZERO experiment at the Tevatron, located at FNAL. These days I mostly do research into exotic decays - looking of a hint of what particles might be dark matter or what particles might fill in some of the missing pieces of the Standard Model of particle physics... a model that should describe all matter and forces and interactions in our universe if it was complete.

Reproducibility is something I've fallen into - there was very little career planning that sent me in this direction: mostly just interest. I originally got into particle physics because it seemed to exist at the intersection of hardware, software, and physics. As my career has moved forward I've moved away from hardware and more towards software and physics. I'm (a little) embarrassed to admit that I first started thinking about this because of a new policy at the National Science Foundation (NSF). They are one of the major funding agencies in our field, and, more to the point, fund our group at the University of Washington (the Department of Energy's Office of Science is the other, larger, funding agency). In 2011 they start requiring a Data Management and Sharing plan along with every grant proposal. The basic idea: if the government funded scientific research, then the data derived from that research should be public. "What's this?" I thought... and it has been downhill since then. :-)

What is missing?

Reproducibility, unfortunately, means different things to different people (see the post just the other day on this blog which lays out an entirely sensible taxonomy). Lets take an experiment in particle physics. This is funny: I don't mean ATLAS, though that is the traditional definition of an experiment. Rather, in the jargon of my field, lets take an analysis. This example analysis starts with the data taking in 2012. The raw data is reconstructed, analyzed, and finally the plots and conclusions are used to write a paper. That paper represents about 2 years of three people's work. Could I give you the TB's worth of raw data, whatever the software was that you needed, and then could you re-make the plots or re-calculate the numbers in that paper? In particular, if you were not a member of ATLAS?

No. You could not. I'm not sure I could even do it if you were a member of ATLAS.

The reasons? Humans. Those pesky creatures at the end of the keyboard. The ones that do the physics and drive the field forward.

There are a number of steps that one has to go through to transform the raw data into final plots and physics results. First some very general reconstruction code must be run. It takes the raw electronic signals from the detector and finds the trajectory of an electron, or a muon, or a jet of particles, etc. If everything went well, then these are the raw particles that are left in the detector.

Once the reconstruction is done, the next step is producing a small summary data file from that reconstruction. Often some calibration is applied at this point. While the reconstruction is expensive in terms of CPU and disk space, the summary data file production is cheap and quick - and often re-run many times.

The final step is to take the summary data and produce the final plots. This involves running over multiple summary data files, producing plots, fitting, scaling, calculating ratios, and statistical calculations to determine if you have discovered something or you are going to rule out the existence of what you haven't discovered.

This is a simplification, but it is a good enough model for this discussion. The first two steps, the reconstruction and production of the summary data files, are automated. Large CPU farms run the jobs, a huge database system tracks all the files in and out, including log files from the jobs that produce each step. I don't know the best way to actually give you the code and the data, but the problem is solvable. Computers run and manage every step of the way. Capturing that information is a hard problem, but it is all recorded in databases, in script files in well known places, in source files in source file repositories (like git or svn).

The last step, however, is a mess! Take figure 3 of the paper I referenced above. We had to make some 50 versions of that plot, and fit each one. We never found a single function that would fit everything. And the fit often would not converge without a little help. I call this sort of computer-human interaction hand-art: one cannot automate the process and direct intervention by a "hand" is required. Even without that something like fitting-by-hand, this last step is typically a mess of scripts and jobs that are run by hand.

This is an issue that spans both tools and sociology in the field. What if the tools to preserve the tasks are very different from what we do now? How do we convince people to switch? And given such an uncontrolled environment where analysis takes place, how do we even capture the steps in the first place?

This is one of many things lacking when it comes to delivering a particle physics analysis in a reproducible fashion. Some of the other blog posts should be read to see more!

1 comment:

Daniel S. KatzMay 26, 2015 at 7:04 AM
I wonder why you decided to start with reconstruction. One might equally well choose to start with a new run of the same detector, or even with a new detector. The question of where reproducibility starts seems like something that should always be asked.

Reproducible Scientific Computing

Pages

Tuesday, May 26, 2015

Introduction: Hi, I'm Gordon...

What is missing?

1 comment: