Note
This document is a literal conversion of the DataONE Preservation Strategy document developed by the “Preservation and Metadata Working Group” as an outcome of the December 2010 meeting held in Chicago. The original document may be retrieved from the DataONE document library.
To meet the objective of “easy, secure, and persistent storage of data”, DataONE adopts a simple three-tiered approach.
Fundamentally, DataONE’s preservation goal is to protect the content, meaning, and behavior of data sets registered in its global network of heterogeneous data repositories. This a complex undertaking that warrants a layered, prioritized approach. To get started on a solid footing, our first objective was to build a platform that immediately provides a significant degree of preservation assurance and makes it easy to add more sophisticated preservation function over time. Initially, DataONE will focus on preventing loss due to non-malicious causes, such as,
While malicious threats do exist, many of them are addressed as a side-effect of DataONE protocols and information technology (IT) management standards in place at member nodes (MNs). By design, DataONE protocols limit the ability of any MN or Coordinating Node (CN) to directly alter content on another node, which in turn limits the havoc that an intruder could wreak. Moreover, MN guidelines call for the same strong local IT management standards that are widespread in financial services, manufacturing, and large technology organizations, and these are typically already in force at well-established data repositories.
An ancillary objective is to help inform an overall NSF DataNet preservation strategy, and to that end this strategy was prepared at a DataONE workshop (Chicago, 2010) with direct input from the Data Conservancy project.
The initial DataONE approach to preservation is described in the following three sections.
Retaining the actual bits comprising the data and metadata is paramount, as all other preservation and access questions are moot if the bits are damaged or lost. The direct role played by MNs in bit-level preservation is addressed in the third section describing organizational sustainability.
Persistent identifiers (PIDs) are required for stable reference to all content stored in DataONE. Without them, reliable data citation and long-term access would not be possible. Because there are many legacy identifiers to be accommodated, some of them dating from before the advent of the world-wide web, DataONE uses PIDs from a variety of schemes and support systems, such as purl.org and handle.net. Remaining agnostic about identifier syntax, DataONE will also rely on scheme-agnostic support systems such as n2t.net/ezid, which can deal with ARKs, DOIs, and traditionally non-actionable identifiers such as PMIDs (PubMed Identifiers).
To protect against the possible loss of a MN, or a bit-level failure at a single MN,DataONE replicates both data and metadata. Two replicas of the raw bits representing each dataset are created upon registration of a dataset by a MN; the two replicas and the original dataset held at the MN result in a total of three instances. The instances are kept at three different MNs, which creates safety through copies that are “de-correlated” by geographic, administrative, and financial domain. In this way, the instances are not vulnerable to the same power failure, same earthquake, same funding loss, etc. As for metadata, three replicas of all metadata are created and held at the CNs, resulting in a total of four de-correlated metadata instances.
Depending on its data replication policy, a MN may fall into one of several classes:
Replication will be triggered automatically by content registration. While it is desirable to maintain three instances of each data object, over time there may arise practical limits to replication due to a number of changeable factors:
MN guidelines also call for the common-sense and usual practice of periodic “media refresh”, which is the copying of data from old physical recording devices to new physical recording devices to avoid errors due to media degradation and vendor de-support.
Damage or corruption in those copies is detected by periodically re-computing checksums (e.g., SHA-256 digests) for randomly selected datasets and comparing them with checksums securely stored at the CNs. Any bit-level changes detected can be repaired by copying from an unchanged copy. This kind of “pop quiz” cannot be cheated by simply reporting back a previously computed checksum; the actual MN replica data is requested and the checksum recomputed. It is appropriate that this entails sampling only a subset of the data as it is not computationally feasible to keep the MNs and CNs constantly busy exhaustively checking the amount of content that DataONE anticipates holding.
Digital content has three related aspects that must be considered when planning and performing preservation functions: content has a specific digital form; that form encapsulates a given abstract meaning; and that meaning is recovered for use through appropriate behaviors applied against the form. Preservation of form ensures that the low-level structure of the content is preserved; preservation of meaning ensures that the semantics of the content are recoverable, at least in theory; and preservation of behavior ensures that the semantics are recoverable in practice.
For example, consider a dataset of environmental samples. At the structural level these numeric data are organized in a tabular fashion. But their full meaning is only recoverable by knowing the variables and units of measure associated with each column in the table. If the data are represented in binary, rather than textual form, then use of the data also depends on an appropriate software application that can expose the information in a directly human-useable form. DataONE metadata standards should incorporate schemas to document and describe data in terms enabling preservation of form, meaning, and behavior.
Preservation activities in this tier fall into four conceptual categories:
Ultimately, having the bits and their meaning is useless if one doesn’t also have the legal right (a) to hold the data, (b) to make copies and derivatives in performance of preservation management (such as replication and migration), and (c) to transfer those same rights to a successor archive. Just as important is to know specifically who owns the original data and whether those rights have been granted.
As a start we strongly encourage providers to assign “Creative Commons Zero” (CC0) licenses to all contributed data, which facilitates preservation and does not prevent data providers from requiring an attribution statement as a condition of re-use.
Migration and emulation are sub-strategies that DataONE will use in the event that formats become obsolete. At some time in the future, one may expect that available contemporary hardware and software will be unable to render or otherwise use bits saved in some formats.
Migration is used to convert from older to newer formats; all converted content is subject to “before” and “after” characterization to ensure semantic invariance. Emulation effectively preserves older computing environments in order to retain the experience of rendering older formats; once considered a specialized intervention, emulation has become a more viable technique with recent developments in consumer and enterprise server virtualization solutions.
Migration workflows need only be available on a subset of DataONE member nodes, which can function as service utilities to the greater network. A successful migration strategy requires versioning of the content, where all versions are retained. The versioning of managed content that results from migration will be reflected in that content’s system metadata. All migrated content will be subject to “before” and “after” characterization to ensure the semantic invariance of the transformation.
While emulation has become a more viable technique, it is important to understand the technological dependencies of the component parts of the workflow underlying the use of a particular resource. Emulation may become difficult if various workflow components require multiple levels of emulation support.
Preservation action plans for mitigating preservation risks will be developed ahead of the need for their application. Protection against obsolescence requires an understanding of the technological dependencies underlying that use. While some resources are easily manipulated using relatively standard and long-lived desktop tools, others require highly specialized applications and complex workflows.
DataONE will rely on existing notification services, such as AONS II (Pearson 2007), CRiB (Ferreira 2006), and PLATO (http://www.ifs.tuwien.ac.at/dp/plato/intro.html). These services themselves depend on other external technical registries such as PRONOM. Existing coverage by these services may be limited to formats and tools geared towards general applicability in the cultural heritage realm. DataONE will encourage community effort to expand the scope of these services to understand technical components specific to scientific disciplines. It is preferable to enhance these existing frameworks and services so that obsolescence detection can take place centrally or in a consistent federated manner, rather than in an ad hoc and parallel manner.
Responses to obsolescence should be planned in advance of need and captured in action plans (cf. the Planets template at http://www.ifs.tuwien.ac.at/dp/plato/docs/plan-template.pdf and the FCLA template at http://www.fcla.edu/digitalArchive/formatInfo.htm).
If the DataONE federated network and its member nodes were to disappear, that would amount to total data loss.
By design, the DataONE network provides resiliency against the occasional loss of nodes. While departure of a MN (or even a CN) from DataONE should not be frequent, it is also not an unexpected occurrence. It is a feature of networks that they can sustain such events by redistributing the assets and workload among the remaining nodes. The arrival of a new node will be less disruptive. The software infrastructure of the DataONE network, including the Investigator Toolkit the cyberinfrastructure protocol stacks, are open-source in order to help it have a life beyond the end of DataONE funding; open-source community ownership improves not only buy-in and adoption, but also long-term external support for the DataONE network.
The Sustainability and Governance Working Group is investigating a range of issues in protecting the DataONE organization. These include CN and MN succession planning, an analysis of the costs of preservation, the possibility of services that offer accreditation for repositories, realtime monitoring, and external auditing.
Risks to the DataONE federated network are different from risks to individual nodes. Some risks are reduced by the redundancy and geographic distribution that the network provides. As for malicious threats that might increase due to federation, these are addressed by the authentication and authorization strategies that DataONE is developing with participation of Teragrid security experts. It is a core requirement of MNs that they conform to DataONE authentication and authorization policies and protocols.
Not any data repository can qualify as a DataONE MN. Guidelines call for organizations to be on a sound technical and financial footing and to adhere to important standards. The DataONE network is in some respects only as secure as its weakest link, so local Information Technology (IT) standards at the MNs are critical.
MNs conform to IT management practices typically found in federal agencies and higher education, which in turn are based on Payment Card Industry data security standards (PCI DSS) and the widely adopted ITIL (IT Infrastructure Library) best practices for such things as physical protection, electronic perimeter control (firewalls), account management, and event logging for forensic analysis. Adopters include financial service organizations, and large technology, pharmaceutical, and manufacturing companies.
These standards call for common-sense practices such as periodic “media refresh”, which is the copying of data from older to newer physical devices and media with the aim of avoiding errors due to media degradation and vendor de-support.