DataONE Preservation Strategy

Note

This document is a literal conversion of the DataONE Preservation Strategy document developed by the “Preservation and Metadata Working Group” as an outcome of the December 2010 meeting held in Chicago. The original document may be retrieved from the DataONE document library.

Summary

To meet the objective of “easy, secure, and persistent storage of data”, DataONE adopts a simple three-tiered approach.

  1. Keep the bits safe. Retaining the actual bits that comprise the data is paramount, as all other preservation and access questions are moot if the bits are lost. Key sub-strategies for this tier are (a) persistent identification, (b) replication of data and metadata, (c) periodic verification that stored content remains uncorrupted, and (d) reliance on member nodes to adhere to DataONE protocols and guidelines consistent with widely adopted public and private sector standards for IT infrastructure management.
  2. Protect the form, meaning, and behavior of the bits. Assuming the bits are kept undamaged, users must also be able to make sense of them into the future, so protecting their form, meaning, and behavior is critical. In this tier we rely on collecting characterization metadata, encouraging use of standardized formats, and securing legal rights appropriate to long-term archival management, all of which supports future access and, as needed to preserve meaning and behavior, format migration and emulation.
  3. Safeguard the guardians. If the DataONE organization and its member nodes were to disappear, that would be equivalent to 100 percent data loss. The DataONE network itself provides resiliency against the occasional loss of member nodes, and this will be shored up by succession planning, ongoing investigations into preservation cost models, and open-source software tools that can sustained by external developer communities.

Preservation Objectives

Fundamentally, DataONE’s preservation goal is to protect the content, meaning, and behavior of data sets registered in its global network of heterogeneous data repositories. This a complex undertaking that warrants a layered, prioritized approach. To get started on a solid footing, our first objective was to build a platform that immediately provides a significant degree of preservation assurance and makes it easy to add more sophisticated preservation function over time. Initially, DataONE will focus on preventing loss due to non-malicious causes, such as,

  • Technological obsolescence (e.g., loss of support for rendering software and hardware),
  • Accidental loss (human error, natural disaster, etc.), and
  • Financial instability (loss of funding).

While malicious threats do exist, many of them are addressed as a side-effect of DataONE protocols and information technology (IT) management standards in place at member nodes (MNs). By design, DataONE protocols limit the ability of any MN or Coordinating Node (CN) to directly alter content on another node, which in turn limits the havoc that an intruder could wreak. Moreover, MN guidelines call for the same strong local IT management standards that are widespread in financial services, manufacturing, and large technology organizations, and these are typically already in force at well-established data repositories.

An ancillary objective is to help inform an overall NSF DataNet preservation strategy, and to that end this strategy was prepared at a DataONE workshop (Chicago, 2010) with direct input from the Data Conservancy project.

Three DataONE preservation tiers

The initial DataONE approach to preservation is described in the following three sections.

1. Keep the bits safe

Retaining the actual bits comprising the data and metadata is paramount, as all other preservation and access questions are moot if the bits are damaged or lost. The direct role played by MNs in bit-level preservation is addressed in the third section describing organizational sustainability.

Identify data persistently

Persistent identifiers (PIDs) are required for stable reference to all content stored in DataONE. Without them, reliable data citation and long-term access would not be possible. Because there are many legacy identifiers to be accommodated, some of them dating from before the advent of the world-wide web, DataONE uses PIDs from a variety of schemes and support systems, such as purl.org and handle.net. Remaining agnostic about identifier syntax, DataONE will also rely on scheme-agnostic support systems such as n2t.net/ezid, which can deal with ARKs, DOIs, and traditionally non-actionable identifiers such as PMIDs (PubMed Identifiers).

Make lots of copies

To protect against the possible loss of a MN, or a bit-level failure at a single MN,DataONE replicates both data and metadata. Two replicas of the raw bits representing each dataset are created upon registration of a dataset by a MN; the two replicas and the original dataset held at the MN result in a total of three instances. The instances are kept at three different MNs, which creates safety through copies that are “de-correlated” by geographic, administrative, and financial domain. In this way, the instances are not vulnerable to the same power failure, same earthquake, same funding loss, etc. As for metadata, three replicas of all metadata are created and held at the CNs, resulting in a total of four de-correlated metadata instances.

Depending on its data replication policy, a MN may fall into one of several classes:

  1. Read-Only: MNs that are unwilling or unable to hold replicas (whether such MNs are admitted by DataONE is up to the Governance Working Group).
  2. Replication-Open: MNs that are willing to accept whatever content a CN tells them (up to capacity). These nodes will try to honor any access control rules that are applied to the data objects they replicate. If a MN is unable to honor a given access rule, the associated data object will be “darkived”, namely, stored within a dark archive that is not accessible to general users.
  3. Replication-Only: MNs that are set up by DataONE specifically to provide replication services and that do not provide original content of their own. They are designed for capacity and capability, and are willing to accept whatever content a CN tells them to replicate (up to capacity). They are able to honor any access restriction rules defined within the DataONE system.

Replication will be triggered automatically by content registration. While it is desirable to maintain three instances of each data object, over time there may arise practical limits to replication due to a number of changeable factors:

  • MN capacities and/or MN capabilities
  • willingness of MNs to be targets of replication (to hold copies)
  • willingness of MNs to permit outbound copies
  • perceived value of data relative to its size
  • changes in access control restrictions

Refresh and verify the copies

MN guidelines also call for the common-sense and usual practice of periodic “media refresh”, which is the copying of data from old physical recording devices to new physical recording devices to avoid errors due to media degradation and vendor de-support.

Damage or corruption in those copies is detected by periodically re-computing checksums (e.g., SHA-256 digests) for randomly selected datasets and comparing them with checksums securely stored at the CNs. Any bit-level changes detected can be repaired by copying from an unchanged copy. This kind of “pop quiz” cannot be cheated by simply reporting back a previously computed checksum; the actual MN replica data is requested and the checksum recomputed. It is appropriate that this entails sampling only a subset of the data as it is not computationally feasible to keep the MNs and CNs constantly busy exhaustively checking the amount of content that DataONE anticipates holding.

2. Protect the form, meaning, and behavior of the bits

Digital content has three related aspects that must be considered when planning and performing preservation functions: content has a specific digital form; that form encapsulates a given abstract meaning; and that meaning is recovered for use through appropriate behaviors applied against the form. Preservation of form ensures that the low-level structure of the content is preserved; preservation of meaning ensures that the semantics of the content are recoverable, at least in theory; and preservation of behavior ensures that the semantics are recoverable in practice.

For example, consider a dataset of environmental samples. At the structural level these numeric data are organized in a tabular fashion. But their full meaning is only recoverable by knowing the variables and units of measure associated with each column in the table. If the data are represented in binary, rather than textual form, then use of the data also depends on an appropriate software application that can expose the information in a directly human-useable form. DataONE metadata standards should incorporate schemas to document and describe data in terms enabling preservation of form, meaning, and behavior.

Preservation activities in this tier fall into four conceptual categories:

  • Know your rights
  • Know what you have and share that knowledge
  • Cope with obsolescence
  • Watch the copies, yourself, and the world

Know your rights

Ultimately, having the bits and their meaning is useless if one doesn’t also have the legal right (a) to hold the data, (b) to make copies and derivatives in performance of preservation management (such as replication and migration), and (c) to transfer those same rights to a successor archive. Just as important is to know specifically who owns the original data and whether those rights have been granted.

As a start we strongly encourage providers to assign “Creative Commons Zero” (CC0) licenses to all contributed data, which facilitates preservation and does not prevent data providers from requiring an attribution statement as a condition of re-use.

Know what you have and share that knowledge

Understanding the content being preserved in the DataONE network is a function of metadata that describes various aspects of the content. This metadata comes from a variety of sources, including automated characterization of content at the point of acquisition or submission to the DataONE network, and direct contribution from content creators, curators, and consumers. Characterization is important in order to determine the significant properties of digital resources that must be preserved over time, and for purposes of classification, since many preservation actions will take place in an automated fashion on groups of similar resources. DataONE will encourage use of data formats that are open, transparent, widely used, and non-encrypted – formats that are more inherently amenable to long-term preservation.

Use of automated characterization tools, such as DROID (http://droid.sourceforge.net/) and JHOVE2 (https://bitbucket.org/jhove2/main/wiki/Home) will be strongly recommended of data providers. While most existing characterization tools focus on general formats for cultural heritage content, the DataONE community can contribute to development that will result in increased coverage of scientific formats.

Rich preservation metadata will be managed at varying levels of granularity, including individual files, aggregations of files that collectively form a single coherent resource, and meaningful subsets of files. In addition to describing primary data sets, metadata is also necessary for associated toolkit workflows, which are themselves first-class data objects needing DataONE preservation. In some cases, characterization metadata will consist of references to information managed in external technical registries such as UDFR (http://udfr.org/) and PRONOM (http://www.nationalarchives.gov.uk/PRONOM/Default.aspx). DataONE is fortunate to have partnered with the primary maintainers of JHOVE2 and UDFR.

Preservation metadata will be expressed and managed within the DataONE network using a variety of schemas in the science metadata. In addition to providing a means for accurate description and citation of resources managed by DataONE, this information also will be exposed for search and retrieval by external indexing, search, and notification services.

Cope with obsolescence

Migration and emulation are sub-strategies that DataONE will use in the event that formats become obsolete. At some time in the future, one may expect that available contemporary hardware and software will be unable to render or otherwise use bits saved in some formats.

Migration is used to convert from older to newer formats; all converted content is subject to “before” and “after” characterization to ensure semantic invariance. Emulation effectively preserves older computing environments in order to retain the experience of rendering older formats; once considered a specialized intervention, emulation has become a more viable technique with recent developments in consumer and enterprise server virtualization solutions.

Migration workflows need only be available on a subset of DataONE member nodes, which can function as service utilities to the greater network. A successful migration strategy requires versioning of the content, where all versions are retained. The versioning of managed content that results from migration will be reflected in that content’s system metadata. All migrated content will be subject to “before” and “after” characterization to ensure the semantic invariance of the transformation.

While emulation has become a more viable technique, it is important to understand the technological dependencies of the component parts of the workflow underlying the use of a particular resource. Emulation may become difficult if various workflow components require multiple levels of emulation support.

Watch the copies, yourself, and the world

Preservation action plans for mitigating preservation risks will be developed ahead of the need for their application. Protection against obsolescence requires an understanding of the technological dependencies underlying that use. While some resources are easily manipulated using relatively standard and long-lived desktop tools, others require highly specialized applications and complex workflows.

DataONE will rely on existing notification services, such as AONS II (Pearson 2007), CRiB (Ferreira 2006), and PLATO (http://www.ifs.tuwien.ac.at/dp/plato/intro.html). These services themselves depend on other external technical registries such as PRONOM. Existing coverage by these services may be limited to formats and tools geared towards general applicability in the cultural heritage realm. DataONE will encourage community effort to expand the scope of these services to understand technical components specific to scientific disciplines. It is preferable to enhance these existing frameworks and services so that obsolescence detection can take place centrally or in a consistent federated manner, rather than in an ad hoc and parallel manner.

Responses to obsolescence should be planned in advance of need and captured in action plans (cf. the Planets template at http://www.ifs.tuwien.ac.at/dp/plato/docs/plan-template.pdf and the FCLA template at http://www.fcla.edu/digitalArchive/formatInfo.htm).

3. Safeguard the guardians

If the DataONE federated network and its member nodes were to disappear, that would amount to total data loss.

Safeguard the federation

By design, the DataONE network provides resiliency against the occasional loss of nodes. While departure of a MN (or even a CN) from DataONE should not be frequent, it is also not an unexpected occurrence. It is a feature of networks that they can sustain such events by redistributing the assets and workload among the remaining nodes. The arrival of a new node will be less disruptive. The software infrastructure of the DataONE network, including the Investigator Toolkit the cyberinfrastructure protocol stacks, are open-source in order to help it have a life beyond the end of DataONE funding; open-source community ownership improves not only buy-in and adoption, but also long-term external support for the DataONE network.

The Sustainability and Governance Working Group is investigating a range of issues in protecting the DataONE organization. These include CN and MN succession planning, an analysis of the costs of preservation, the possibility of services that offer accreditation for repositories, realtime monitoring, and external auditing.

Safeguard the member nodes

Risks to the DataONE federated network are different from risks to individual nodes. Some risks are reduced by the redundancy and geographic distribution that the network provides. As for malicious threats that might increase due to federation, these are addressed by the authentication and authorization strategies that DataONE is developing with participation of Teragrid security experts. It is a core requirement of MNs that they conform to DataONE authentication and authorization policies and protocols.

Not any data repository can qualify as a DataONE MN. Guidelines call for organizations to be on a sound technical and financial footing and to adhere to important standards. The DataONE network is in some respects only as secure as its weakest link, so local Information Technology (IT) standards at the MNs are critical.

MNs conform to IT management practices typically found in federal agencies and higher education, which in turn are based on Payment Card Industry data security standards (PCI DSS) and the widely adopted ITIL (IT Infrastructure Library) best practices for such things as physical protection, electronic perimeter control (firewalls), account management, and event logging for forensic analysis. Adopters include financial service organizations, and large technology, pharmaceutical, and manufacturing companies.

These standards call for common-sense practices such as periodic “media refresh”, which is the copying of data from older to newer physical devices and media with the aim of avoiding errors due to media degradation and vendor de-support.