My interest in reproducibility stems from my research into long-term data curation and dependability, and issues focusing on reproducibility with data that includes personally identifiable information (PII). Reproducibility is a topic complicated by many facets, scientific, technical, and ethical in nature. In an effort to differentiate the perspective of this post from those of my colleagues I want to discuss the challenges of long-term reproducibility from a data curation stand point.
Last year at the 2014 International Conference on Massive Storage Systems and Technology (MSST) Daniel Duffy and John Schnase of the NASA Center for Climate Simulation discussed how data infrastructure was one of the primary challenges facing climate science. The problem becomes one of reproducibility when the temporal scale of climate analytics becomes evident. In discussions with the University of Miami Rosenstiel School of Marine and Atmospheric Science (RSMAS) and representatives of the National Oceanic and Atmospheric Administration (NOAA) researchers have told me that to conduct the sort of climate research that would be necessary to identify important decadal patterns would require data curation without significant loss for at least 50, if not 100, years. Facing these challenges with an ever shrinking budget and an exponential growth of the digital universe is an incredibly challenging task. Even if we somehow alter the culture of research science (a goal that the NSF is making admirable progress towards), we still face the double edged sword of scale in the era of Big Data.
While the exponentially increasing volume and variety of data being ingested means our predictive models and resulting understanding of complex processes are becoming better; we are also faced with a world of changing fault landscapes. Once rare and inconsequential faults now pose serious threats; and can be orthogonal to traditional reliability mechanisms such as RAID. How can we hope to achieve 50 to 100 years of reliability when many important domains are shifting away from more traditional enterprise hardware in favor of commercial off-the-shelf (COTS) components to deal with decreasing budgets and increasing data volume?
One of the biggest concerns I have moving forward on issues such as reproducibility is the continued focus of our field on technologies aimed at the best funded portion of the market, while socially important applications that relate to basic science, sustainability, civic governance, and public health struggle with decreasing budgets and an increasing reliance on insufficiently reliable, dependable, and secure, COTS-based architectures. To achieve true reproducibility solutions need to scale down as well as they scale up. Mechanisms need to be affordable and available for cost-constrained domains like public clinics, the developing world, and city governments. Data that cannot be preserved, cannot enable reproducibility in science.