Sunday, May 24, 2015

Introduction: On a related note, Repeatability (in Computer Science)

Hello, world!

The idea of using standardized Digital Object Identifiers (DOIs) to reference electronic artifacts has been around for many moons. However, for better or worse there does not seem to be an analogous standard for Digital Personal Identifiers (DPIs). Consequently, I must introduce myself with a sobriquet that I used to think was unique -- my name, Ashish Gehani. (Facebook has proved to me the folly of my thought -- there are many others who go by the same label!) I first heard the case for using DPIs from my colleague, Natarajan Shankar, in the Computer Science Laboratory at SRI (where we work). I’m sure the motivation is much older but I believe we first spoke about it in the context of disambiguating authors whose publications were incorrectly being conflated on Google Scholar.

I arrived at the topic of scientific reproducibility via a different path than others on this blog (I think). I’d been studying systems’ security and observed (as had many others) that the knowledge of the antecedents of a piece of information could be quite helpful in deciding how much to trust it. Indeed, a similar approach has been used to judge the authenticity of valuables, from art to wine, for many centuries. As I delved into ways to reliably collect and generate metadata about the history of digital artifacts, I found an entire community of like-minded souls, who have banded together with the common interest of data provenance. Many of these researchers meet annually at the USENIX Workshop on the Theory and Practice of Provenance and in alternate years at the International Provenance and Annotation Workshop. In 2014, these were co-located as part of Provenance Week. A significant focus is on the use of provenance to facilitate scientific reproducibility.

Doug asked for "an example of where reproducibility is not happening (or not working) in computational science today". I am going to take the liberty of considering repeatability instead of reproducibility. (Repeatability is of particular interest when reproducibility is questioned.) As the old adage goes, "there’s no place like home". Assuming one is willing to accept computer science as a computational science, we can consider the findings of Collberg et al. at the University of Arizona. They downloaded 601 papers from ACM conferences and journals ("with a practical orientation"), and focused on two questions: "is the source code available, and does it build?" Even the most basic next step, of trying to execute the code, was eliminated. Of course, this meant they could not check the correctness of the output. Unfortunately, even with this low bar, the results were not encouraging. It’s worth taking a look at their report. Their findings are currently being reviewed by Krishnamurthi et al. at Brown University.

5 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. You should look at ORCID - http://orcid.org

    ReplyDelete
  3. ORCID is a good idea, and they have a nice infrastructure, but I was surprised that after creating a profile (http://orcid.org/0000-0001-5218-1956) it became *my* job to upload papers to their system and associate them to my ID. (This is now the fourth or fifth website where I'm expected to maintain my publications!) The ecosystem will work better when publishers collect ORCIDs at submission time and take on the role of associating ORCIDs to DOIs.




    ReplyDelete
  4. You can think of ORCID just as a unique person ID. Or, you can use it to store a profile. You don't have to do the latter to take advantage of the former.

    ReplyDelete
  5. Dan,

    Thanks for the pointer to ORCID.

    I see that NSF's Fastlane includes the option to list one's ORCID identity.

    Do you know if this has existed for a while (and I have just noticed it) or whether this is a new feature?

    If it has been around for an extended period, has there been any analysis of its uptake / impact?

    ReplyDelete