Status: | Early Draft / notes |
---|
DataONE requires storage, search, and retrieval of information (data and metadata) from a wide variety of data services (e.g. Mercury, Metacat, and OpenDAP). All of these systems have different data service interfaces, support different metadata standards, and implement different query mechanism and syntaxes. Data must be replicated between service instances (Member Nodes, MN) and metadata must be replicated between all nodes (Coordinating Nodes, CN and Member Nodes) to ensure multiple copies exist to avoid data loss in the event of node failure and to improve access through geographic proximity.
A few general approaches to the problem include:
Translations between all metadata formats and the data service interfaces are implemented. In this scenario, metadata is translated to the native metadata format (or where multiple formats are supported, to the most appropriate form) supported by a MN and stored using the native API of the service. A common API provides the integration between all MNs, providing the basic operations necessary for managing and retrieving the content. Perhaps the most difficult component of this approach is the translation of metadata to the format supported internally by the service.
Problems:
Advantages:
Implement a common service API on all nodes that treats data and metadata as discrete units that can be read from and written to any node. The set of all nodes then becomes a large storage device. The CNs implement the processes which distribute content between all nodes (like a file system driver) to provide basic system level functionality. The actual metadata documents are opaque to the underlying storage system.
Metadata is not searched directly but is indexed by extracting content that matches semantically equivalent search terms. A trivial example is the use of the Dublin Core terms to search across all types of metadata. In this case, a “dublin core metadata extractor” extracts term values from a metadata document and updates an index that supports DC fields with the values and the document PID. Searches on the index return the document PID, which is then retrieved using the MN API.
Problems:
Advantages:
Similar to the indexing approach, but in addition to the lowest common denominator format, objects may make more detailed metadata/data available by advertising that they exhibit specific content models. These content models may be dictated by central DataONE community, or may be agreed upon by a small group of Member Nodes.
Problems:
Advantages: