Sunday, May 24, 2015

Reproducibility, Data Curation, and Cost-Constrained Domains

My name is Eric Rozier, and I'm an Assistant Professor in Electrical Engineering and Computing Systems at the University of Cincinnati, and a two time faculty mentor at the University of Chicago's Data Science for Social Good Summer Fellowship (DSSG). DSSG is a summer program through the Computation Institute which seeks to tackle problems in Big Data with social relevance.  Last summer I advised eight student fellows on projects with the Chicago Department of Public Health and the World Bank's Integrity Vice Presidency.  During the normal year I'm the director of the Trustworthy Data Engineering Laboratory at the University of Cincinnati where we work on problems at the intersection of Big Data, Data Privacy and Cybersecurity, Data Dependability, and Analytics.

My interest in reproducibility stems from my research into long-term data curation and dependability, and issues focusing on reproducibility with data that includes personally identifiable information (PII).  Reproducibility is a topic complicated by many facets, scientific, technical, and ethical in nature.  In an effort to differentiate the perspective of this post from those of my colleagues I want to discuss the challenges of long-term reproducibility from a data curation stand point.

Last year at the 2014 International Conference on Massive Storage Systems and Technology (MSST) Daniel Duffy and John Schnase of the NASA Center for Climate Simulation discussed how data infrastructure was one of the primary challenges facing climate science.  The problem becomes one of reproducibility when the temporal scale of climate analytics becomes evident.  In discussions with the University of Miami Rosenstiel School of Marine and Atmospheric Science (RSMAS) and representatives of the National Oceanic and Atmospheric Administration (NOAA) researchers have told me that to conduct the sort of climate research that would be necessary to identify important decadal patterns would require data curation without significant loss for at least 50, if not 100, years.  Facing these challenges with an ever shrinking budget and an exponential growth of the digital universe is an incredibly challenging task.  Even if we somehow alter the culture of research science (a goal that the NSF is making admirable progress towards), we still face the double edged sword of scale in the era of Big Data.

While the exponentially increasing volume and variety of data being ingested means our predictive models and resulting understanding of complex processes are becoming better; we are also faced with a world of changing fault landscapes.  Once rare and inconsequential faults now pose serious threats; and can be orthogonal to traditional reliability mechanisms such as RAID.  How can we hope to achieve 50 to 100 years of reliability when many important domains are shifting away from more traditional enterprise hardware in favor of commercial off-the-shelf (COTS) components to deal with decreasing budgets and increasing data volume?

One of the biggest concerns I have moving forward on issues such as reproducibility is the continued focus of our field on technologies aimed at the best funded portion of the market, while socially important applications that relate to basic science, sustainability, civic governance, and public health struggle with decreasing budgets and an increasing reliance on insufficiently reliable, dependable, and secure, COTS-based architectures.  To achieve true reproducibility solutions need to scale down as well as they scale up.  Mechanisms need to be affordable and available for cost-constrained domains like public clinics, the developing world, and city governments.  Data that cannot be preserved, cannot enable reproducibility in science.

1 comment:

  1. Eric, I share your concern about the focus of our profession. Computer science as an academic discipline spends a lot of effort producing technology, training students, and burning coal that ultimately supports nothing more than the optimization of advertising within social networks. Our educational systems could do a better job of demonstrating more important applications of computing.

    But there is the converse problem: we have seen many times that specialized areas of computing cannot take advantage of economies of scale, and are better off riding the commodity wave. Specialized supercomputers have largely been supplanted by commodity clusters. GPU technology driven by video games has far better energy/performance density than any bespoke parallel hardware. Soldiers in the gulf war preferred using personal smartphones to military-issue GPS units. More examples abound.

    So, can we advance science, sustainability, governance, and public health using commodity technologies? Or do they need something fundamentally different?