13:00 pm
Designing Scalable Cyberinfrastructure for Metadata Extraction in Billion-Record Archives
Gregory Jansen | University of Maryland | United States
» Show details
Authors:
Gregory Jansen | University of Maryland | United States
Richard Marciano | University of Maryland | United States
Smruti Padhy | UIUC / NCSA | United States
Kenton McHenry | UIUC / NCSA | United States
We present a model and testbed for a curation and preservation infrastructure, “Brown Dog”, that applies to heterogeneous and legacy data formats. “Brown Dog” is funded through a National Science Foundation DIBBs grant (Data Infrastructure Building Blocks) and is a partnership between the National Center for Supercomputing Applications at the University of Illinois and the College of Information Studies at the University of Maryland at College Park. In this paper we design and validate a “computational archives” model that uses the Brown Dog data services framework to orchestrate data enrichment activities at petabyte scale on a 100 million archival record collection. We show how this data services framework can provide customizable workflows through a single point of software integration. We also show how Brown Dog makes it straightforward for organizations to contribute new and legacy data extraction tools that will become part of their archival workflows, and those of the larger community of Brown Dog users. We illustrate one such data extraction tool, a file characterization utility called Siegfried, from development as an extractor, through to its use on archival data.
13:30 pm
Navigating through 200 Years of Historical Newspapers
Yannick Rochat | DHLAB, EPFL | Switzerland
» Show details
Authors:
Yannick Rochat | DHLAB, EPFL | Switzerland
Maud Ehrmann | DHLAB, EPFL | Switzerland
Vincent Buntinx | DHLAB, EPFL | Switzerland
Cyril Bornet | DHLAB, EPFL | Switzerland
Frédéric Kaplan | DHLAB, EPFL | Switzerland
This paper describes the processes which led to the creation of an innovative interface to access a digital archive composed of two Swiss newspapers, namely Gazette de Lausanne (1798–1998) and Journal de Gen`eve (1826–1998). Based on several textual processing steps, including lexical indexation, n-grams computation and named entity recognition, a general purpose web-based application was designed and implemented; it allows a large variety of users (e.g. historians, journalists, linguists and the general public) to explore different facets of about 4 million press articles spanning an almost 200 hundred years period.