I had some pretty extensive notes for this post, but I lost them in the upgrade to Windows 10: I foolishly forgot to sync my notebook before I hit "ok" to the upgrade! So be warned: I am reconstructing this from memory!
In the context of particle physics I have a very hard time talking about tools. After thinking about this for a while I realized why: this is part technical and part sociological. It is technical because either the tools aren't all there yet, or we are asking them to do too much. It is sociological because given the current state of the technology we are almost certainly going to have to ask people to do particle physics analysis in a different way in order to fully preserve the analysis steps.
I can explain a bit more.
First, the sociological part. The field of particle physics is an old one, and very individualistic. Organizing a large, 3000 person, experiment like Atlas is often referred to as hearing cats for a reason. ‘Cats’ have many strengths: people with an ideal pursue it, and sometimes that pays off big, even if frowned upon by the management of the experiment. However, there is also a downside: it is particularly hard to organize things. The only way to do it is if you show there are clear benefits to doing it way A vs. way B. Data preservation falls smack dab in the middle of this. Currently tools do not cover all operations, as I will briefly discuss below. However, if physicists change how they perform the steps of the analysis, we'd be most of the way to preserving the analysis.
So to convince them to move to a new system will happen only if there are significant advantages. I am nervous those aren’t there yet. To talk about that, we should probably first classify the tools that are currently out there for data preservation, and the domain.
A large physics experiment that runs at the LHC, like Atlas, manipulates the data many times over to get it from raw data to published data. Much of that manipulation is automated by necessity – there is so much data there is no other way to deal with it. This part of the data processing pipeline is relatively easily preserved. The most interesting unsolved problem in data preservation is "last mile" – where the individual analyst or analysis group performs their work. This work is unique. This work can include making selection cuts on the data, feeding data into a neural network or boosted decision tree, using some custom software to enhance the signal to background, or just about anything else the physicist thinks might enhance the sensitivity of their analysis. You can already see the problem.
Almost none of this last mile is currently automated. The physicist will write scripts will automate small sections. But some portions will be submitted to a batch farm, other parts need hand tuning, etc. While a work flow tool might help, the reality is that people tend to re-run a single step 100’s of times while developing it, and then they don’t re-run it for months after that. Worse, these steps take place across a large range of machines and architectures. As a result people don’t perceive much value in a work flow system.
Preserving these steps is a daunting task (to me). One approach might be to require all physicist to use a single tool, and every single step would have to be done in that tool. The tool could then record everything. I believe technically this could work. But this tool would have to be a true kitchen sink, and an evolving one at that. It would need to understand an individual users computer, it would need to understand clusters, it would need to understand GRID computing (and perhaps cloud computing soon too), it would have to incorporate specialized analysis tools (like neural networks, etc). In short, I just don’t see how restricting one to a single tool can make this happen. Further, I very much doubt the users – physicists – would be willing to adopt it.
The alternative is to build tools that can record each step. These exist in various forms now – you could run a command and have all files and network connections made be archived and thus preserved. However, this is now an extra step that has to be run. Each time. And the analyzer never really knows the last time they will run a step. Sometimes during the final review a whole section needs to be re-run. Often not. Any performance impact will be noticeable. It will be hard to get the analyzer to commit to something like this as well.
Is there a happy middle ground? Perhaps, but I’ve not seen it yet. In my own work I’ve been doing my best to move towards a continuous integration approach. Once I have a step “down”, I attempt to encode it into a Jenkins build script. This isn’t perfect – Jenkins doesn’t really understand the GRID, for example, but it is a start. And if everything can be written as a build script, perhaps that is a first-order preservation. Remaining, of course, is preserving the environment around it.
A more fruitful approach might be be to work with much smaller experiments, and integrate the data preservation into the data handling layer. Perhaps if all data processing is always run as a batch job of some sort (be it on the GRID or in the cloud), that will provide a point-of-control that is still flexible enough for the user and enough of a known for the tool authors. As long as the metadata system is sufficient enough to track the inter-job operations… which translates to as long as the physicist does nothing outside of the data handling system. Hmmm… :-)
Gordon, I think you hit upon something really important: the preservation tools need to be part of the work from the beginning. If you only start to think about preservation at the end, once the publication is made, then everyone has moved on and there will be little interest (time, funding) to do preservation tasks.
ReplyDeleteIt follows that the preservation tools have to be used every single time, and they better not put up roadblocks to the work that you actually want to do.
I like the approach of interposing on the grid submission system, which is an example of "preserving the mess" that I posted today. And, it also has the advantage that, in order to run remotely, you need to already specify all the dependencies necessary to create the environment.
If every important task gets run in a batch system instead of locally (does it?) then that could be the right solution for HEP...