10:30 am
Precise Data Identification Services for Long Tail Research Data
Stefan Proell | SBA Research | Austria
» Show details
Authors:
Stefan Proell | SBA Research | Austria
Kristof Meixner | Vienna University of Technology | Austria
Prof. Andreas Rauber | Vienna University of Technology | Austria
While sophisticated research infrastructures assist scientists in managing massive volumes of data, the so-called long tail of research data frequently suffers from a lack of such services. This is mostly due to the complexity caused by the variety of data to be managed and a lack of easily standardiseable procedures in highly diverse research settings. Yet, as even domains in this long tail of research data are increasingly data-driven, scientists need efficient means to precisely communicate, which version and subset of data was used in a particular study to enable reproducibility and comparability of result and foster data re-use.
This paper presents three implementations of systems supporting such data identification services for comma separated value (CSV) files, a dominant format for data exchange in these settings. The implementations are based on the recommendations of the Working Group on Dynamic Data Citation of the Research Data Alliance (RDA). They provide implicit change tracking of all data modifications, while precise subsets are identified via the respective subsetting process. These enhances reproducibility of experiments and allows efficient sharing of specific subsets of data even in highly dynamic data settings.
11:00 am
CERN Services for Long Term Data Preservation
Dr. Jamie Shiers | CERN | Switzerland
» Show details
Authors:
Dr. Jamie Shiers | CERN | Switzerland
Frank Berghaus | CERN | Switzerland
German Cancio Melia | CERN | Switzerland
Suenje Dallmeier Tiessen | CERN | Switzerland
Gerado Ganis | CERN | Switzerland
Tibor Simko | CERN | Switzerland
Jakob Blomer | CERN | Switzerland
In this paper we describe the services that are offered by CERN [3] for Long Term preservation of High Energy Physics (HEP) data, with the Large Hadron Collider (LHC) as a key use case.
Data preservation is a strategic goal for European High Energy Physics (HEP) [9], as well as for the HEP community worldwide and we position our work in this global content. Specifically, we target the preservation of the scientific data, together with the software, documentation and computing environment needed to process, (re-)analyse or otherwise (re)use the data. The target data volumes range from hundreds of petabytes (PB – 10 5 bytes) to hundreds of exabytes (EB – 1018 bytes) for a target duration of several decades.
The Use Cases driving data preservation are presented together with metrics that allow us to measure how close we are to meeting our goals, including the possibility for formal certification for at least part of this work. Almost all of the services that we describe are fully generic – the exception being Analysis Preservation that has some domain-specific aspects (where the basic technology could nonetheless be adapted).