Reproducible Scientific Computing: The Need for Notebooks

In response to Doug Thain's question:

What currently-available tools do you recommend for enabling reproducible scientific computing? Is there a tool that we ought to have, but do not?

P.S. I am using "reproducibility" as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion. Please interpret the question broadly.

I think the most important thing we need is an electronic lab notebook that really allows us to go back and understand exactly what we did, repeat it, modify it, record it, etc. If you accept this, it leads to a number of points:

Why don't more people (including me) do this now?
What tool(s) should we use?
How should this integrate into the publication process (for papers, software, data, etc.)

It's a bit hypocritical for me to write this post, since I've never really been a good notebook user or note taker, whether in classes, meetings, or labs. And I think this is true of a lot of us. Part of it is probably making it a matter of habit, and it's hard to break a bad habit or start a new one. And part of it is the lack of good tools with good interfaces.

Tools I've seen that I've liked include VizTrails for data exploration and visualization, and Project Jupyter for a lot of other things. And software version control (e.g., Git) is also part of the answer, most likely. While these independent tools have their strengths and weaknesses, I don't think they really fit the bill.

I would really like something that's much more integrated into my computer, probably at a runtime or OS layer, that helps me understand all of my work, not just my visualization or coding.

If we did have something that really was more an automated work-tracking notebook, we could use it to help us with publications as well. For example, in my idea of transitive credit, if we are going to decide what products contributed to a new product, a starting point is the the products we used during the creation of the new product, which could be created by such a notebook. Or we could track our reading list in the period leading towards a new paper as a starting point for the papers we should reference.

And, this potentially leads us beyond the PDF on the path towards executable papers, which would be a key step towards reproducible science in the large.

(Note: this is crossposted to https://danielskatzblog.wordpress.com/2015/07/15/the-need-for-notebooks/)

Disclaimer: Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Reproducible Scientific Computing

Pages

Wednesday, July 15, 2015

The Need for Notebooks

No comments:

Post a Comment