Revisions: |
|
---|
This document describes the generation of system metadata associated with a data object and its associated science metadata document as they are added to a DataONE Member Node and subsequently propagated through the DataONE infrastructure.
There are two scenarios examined here: A) the system metadata is stored on all nodes (MN and CN) and B) system metadata is stored only on the CNs. The general sequence of operations is the same for both scenarios with the exclusion of a couple of steps from scenario B.
Scientist using an ITK tool generates some science data and some science metadata describing the science data.
The ITK populates the required fields of the system metadata documents.
Not necessarily in this order:
On receipt, the MN updates the system metadata fields for which it is responsible.
The CN detects new content is available on the MN using the listObjects() method.
Copies of the science metadata and the two (incomplete) system metadata documents are retrieved by the CN.
A: | The CN updates the system metadata replication information, and indicates to the MN that it should update its copy of the system metadata objects. |
---|---|
B: | The CN updates the system metadata replication information. |
The CN issues a replication order to an MN.
A: | The recipient MN retrieves copies of the system metadata and associated objects (science data and metadata) |
---|---|
B: | The recipient MN retrieves copies of the various objects indicated by the CN |
The recipient MN requests content from the origin MN, which in turn verifies the request with the CN.
A: | The origin MN and the CN update copies of system metadata |
---|---|
B: | The CN updates system metadata |
After verifying the successful transfer of content, the recipient MN indicates completion to the CN.
A: | The CN updates the system metadata for the objects and ensures that the MN copies of the system metadata are accurate. |
---|---|
B: | The CN updates the system metadata for the objects. |
The change in system metadata state is shown in the following process detail sections. In each case, the identifiers used for the various elements are indicated in table 1.
Identifier | Description |
---|---|
sciD.1 | Science Data |
sciM.1 | Science metadata describing sciD.1 |
sysD.1 | System metadata for object sciD.1 |
sysM.1 | System metadata for object sciM.1 |
cn1.dataone.org | A Coordinating Node |
mn1.dataone.org | Member Node that is initial recipient of data |
mn2.dataone.org | Member Node that receives a copy of the objects |
The following fictitious script provides an example of how uploading content to a Member Node using a DataONE client library might occur.
import sys
import logging
from pyd1.client import *
from pyd1.exceptions import *
d1client = pyd1.client(identity="/home/vieglais/.d1/credentials",
defaults="/home/vieglais/.d1/defaults.ini")
d1client.login()
# The science data and metadata as files on disk
data_file = "/home/vieglais/data/mydata.csv"
meta_file = "/home/vieglais/data/mymeta.xml"
# Reserve a couple of identifiers. Note that this reservation will
# be system-wide.
try:
data_id = d1client.reserve(u"sysD.1")
meta_id = d1client.reserve(u"sysM.1")
except IdentifierInUseException, e:
logging.fatal(repr(e))
sys.exit()
# Generate stub system metadata documents. Default values
# for access rules etc are loaded from defaults.ini
sysm_data = d1client.generateSysMeta(id=data_id,
format="text/csv",
describedBy=meta_id)
sysm_meta = d1client.generateSysMeta(id=meta_id,
format="eml://ecoinformatics.org/eml-2.1.0",
describedBy=data_id)
# Open file handles for data and metadata
fdata = file(data_file, "r")
fmetadata = file(meta_file, "r")
# Send the data to the Member Node
target_node = 'mn1.dataone.org'
data_sysid = d1client.create(data_id, fdata, sys_data,
target=target_node)
meta_sysid = d1client.create(meta_id, fmeta, sys_meta,
target=target_node)
fdata.close()
fmeta.close()
The scientist generates a data set and associated science metadata.
Using tools in the ITK, the scientist generates two initial system metadata documents, one describing the science metadata (sysm_M.1), one describing the science data (sysm_D.1).
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
objectFormat | text/csv | eml://ecoinformatics.org/eml-2.1.0 |
size | 1450238 | 8132 |
submitter | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
rightsHolder | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
obsoletes | ||
obsoletedBy | ||
derivedFrom | ||
describes | sciD.1 | |
describedBy | sciM.1 | |
checksum | 2e01e17467891f7c933dbaa00e1459d23db3fe4f | 9967467891f7c933dbaa00e1459d23db3f342 |
checksumAlgorithm | SHA-1 | SHA-1 |
accessRule[1] | ||
.rule | Allow | Allow |
.service | Read | Read |
.principal | * | * |
accessRule[2] | ||
.rule | Allow | Allow |
.service | Write | Write |
.principal | o=NCEAS,dc=ecoinformatics,dc=org | o=NCEAS,dc=ecoinformatics,dc=org |
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | ||
dateSysMetadataModified | ||
originMemberNode | ||
authoritativeMemberNode | ||
replica[1] | ||
.replicaMemberNode | ||
.replicationStatus | ||
.replicaVerified |
The client tool transmits the data, science metadata and two stub system metadata documents to a Member Node. The Member Node updates values of the system metadata for which it is responsible.
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
objectFormat | text/csv | eml://ecoinformatics.org/eml-2.1.0 |
size | 1450238 | 8132 |
submitter | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
rightsHolder | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
obsoletes | ||
obsoletedBy | ||
derivedFrom | ||
describes | sciD.1 | |
describedBy | sciM.1 | |
checksum | 2e01e17467891f7c933dbaa00e1459d23db3fe4f | 9967467891f7c933dbaa00e1459d23db3f342 |
checksumAlgorithm | SHA-1 | SHA-1 |
accessRule[1] | ||
.rule | Allow | Allow |
.service | Read | Read |
.principal | * | * |
accessRule[2] | ||
.rule | Allow | Allow |
.service | Write | Write |
.principal | o=NCEAS,dc=ecoinformatics,dc=org | o=NCEAS,dc=ecoinformatics,dc=org |
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Queued | Queued |
.replicaVerified |
Since there are many copies of system metadata from this point onwards, only the replication properties of the sysD.1 document are considered, summarized in a table as follows. MN1, CN, MN2 = copy of system metadata on mn1.dataone.org, cn1.dataone.org, mn2.dataone.org respectively.
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | ||
R[1].replicationStatus | Queued | ||
R[1].replicaVerified |
A CN detects new content available on mn1.dataone.org through the listObjects() method.
There is no change of state in system metadata resulting from this operation.
Copies of sciM.1, sysM.1, and sysD.1 are retrieved to the CN, and the system metadata objects are updated. There are three steps to this process:
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | ||
R[1].replicationStatus | Requested | ||
R[1].replicaVerified |
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | |
R[1].replicationStatus | Requested | Requested | |
R[1].replicaVerified |
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | |
R[1].replicationStatus | Completed | Completed | |
R[1].replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z |
The CN initiates replication of content from mn1.dataone.org to mn2.dataone.org.
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | |
R[1].replicationStatus | Completed | Completed | |
R[1].replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z | |
R[2].replicaMemberNode | mn2.dataone.org | ||
R[2].replicationStatus | Queued | ||
R[2].replicaVerified |
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | mn1.dataone.org |
R[1].replicationStatus | Completed | Completed | Completed |
R[1].replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z |
R[2].replicaMemberNode | mn2.dataone.org | mn2.dataone.org | mn2.dataone.org |
R[2].replicationStatus | Requested | Queued | Requested |
R[2].replicaVerified |
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | mn1.dataone.org |
R[1].replicationStatus | Completed | Completed | Completed |
R[1].replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z |
R[2].replicaMemberNode | mn2.dataone.org | mn2.dataone.org | mn2.dataone.org |
R[2].replicationStatus | Requested | Requested | Requested |
R[2].replicaVerified |
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | mn1.dataone.org |
R[1].replicationStatus | Completed | Completed | Completed |
R[1].replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z |
R[2].replicaMemberNode | mn2.dataone.org | mn2.dataone.org | mn2.dataone.org |
R[2].replicationStatus | Completed | Requested | Completed |
R[2].replicaVerified | 2010-03-04T20:00:00.0Z | 2010-03-04T20:00:00.0Z |
MN2 informs the Coordinating Node that it has retrieved a valid copy of the object from MN1, forwarding the time stamp for when the verification was made.
Property | MN1 | CN | MN2 |
---|---|---|---|
R[1].replicaMemberNode | mn1.dataone.org | mn1.dataone.org | mn1.dataone.org |
R[1].replicationStatus | Completed | Completed | Completed |
R[1].replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:51.0Z |
R[2].replicaMemberNode | mn2.dataone.org | mn2.dataone.org | mn2.dataone.org |
R[2].replicationStatus | Completed | Completed | Completed |
R[2].replicaVerified | 2010-03-04T20:00:00.0Z | 2010-03-04T20:00:00.0Z | 2010-03-04T20:00:00.0Z |
This scenario uses the notion that system metadata is stored on the CNs only.
The scientist generates a data set and associated science metadata.
Using tools in the ITK, the scientist generates two initial system metadata documents, one describing the science metadata (sysm_M.1), one describing the science data (sysm_D.1).
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
objectFormat | text/csv | eml://ecoinformatics.org/eml-2.1.0 |
size | 1450238 | 8132 |
submitter | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
rightsHolder | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
obsoletes | ||
obsoletedBy | ||
derivedFrom | ||
describes | sciD.1 | |
describedBy | sciM.1 | |
checksum | 2e01e17467891f7c933dbaa00e1459d23db3fe4f | 9967467891f7c933dbaa00e1459d23db3f342 |
checksumAlgorithm | SHA-1 | SHA-1 |
accessRule[1] | ||
.rule | Allow | Allow |
.service | Read | Read |
.principal | * | * |
accessRule[2] | ||
.rule | Allow | Allow |
.service | Write | Write |
.principal | o=NCEAS,dc=ecoinformatics,dc=org | o=NCEAS,dc=ecoinformatics,dc=org |
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | ||
dateSysMetadataModified | ||
originMemberNode | ||
authoritativeMemberNode | ||
replica[1] | ||
.replicaMemberNode | ||
.replicationStatus | ||
.replicaVerified |
The client tool transmits the data, science metadata and two stub system metadata documents to a Member Node (mn1.dataone.org). On receipt, the Member Node sets the values of several fields as indicted in the table below.
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
objectFormat | text/csv | eml://ecoinformatics.org/eml-2.1.0 |
size | 1450238 | 8132 |
submitter | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
rightsHolder | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
obsoletes | ||
obsoletedBy | ||
derivedFrom | ||
describes | sciD.1 | |
describedBy | sciM.1 | |
checksum | 2e01e17467891f7c933dbaa00e1459d23db3fe4f | 9967467891f7c933dbaa00e1459d23db3f342 |
checksumAlgorithm | SHA-1 | SHA-1 |
accessRule[1] | ||
.ruleType | Allow | Allow |
.service | Read | Read |
.principal | * | * |
accessRule[2] | ||
.ruleType | Allow | Allow |
.service | Write | Write |
.principal | o=NCEAS,dc=ecoinformatics,dc=org | o=NCEAS,dc=ecoinformatics,dc=org |
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Queued | Queued |
.replicaVerified |
A CN detects new content available on mn1.dataone.org through MN_replication.listObjects().
There is no change of state in system metadata resulting from this operation.
Copies of sciM.1, sysM.1, and sysD.1 are retrieved to the Coordinating Node, and the system metadata objects are updated on the Coordinating Node. For sake of brevity, only the elements from ReplicationPolicy onwards are shown in the remaining state tables since the remainder of the content does not change.
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
... | ||
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:52.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:52.0Z |
The Coordinating Node initiates replication of the objects sciM.1 and sciD.1 from mn1.dataone.org to mn2.dataone.org. It does this by calling the “replicate()” method on mn2.dataone.org.
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
... | ||
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T20:58:00.0Z | 2010-03-04T20:58:00.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:52.0Z |
replica[2] | ||
.replicaMemberNode | mn2.dataone.org | mn2.dataone.org |
.replicationStatus | Queued | Queued |
.replicaVerified |
The node mn2.dataone.org requests the objects sciM.1 and sciD.1 from mn1.dataone.org. After verifying the request with a Coordinating Node (which has a side effect of setting the system metadata ReplicationStatus to Requested), mn1.dataone.org returns the objects. (Note that in actuality, each object is requested individually.)
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
... | ||
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T20:59:00.0Z | 2010-03-04T20:59:00.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:52.0Z |
replica[2] | ||
.replicaMemberNode | mn2.dataone.org | mn2.dataone.org |
.replicationStatus | Requested | Requested |
.replicaVerified |
The node mn2.dataone.org verifies the retrieved objects against their checksums and indicates to the Coordinating Node that transfer was successful.
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
... | ||
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T21:00:00.0Z | 2010-03-04T21:00:00.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:52.0Z |
replica[2] | ||
.replicaMemberNode | mn2.dataone.org | mn2.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T21:00:00.0Z | 2010-03-04T21:00:00.0Z |
Assuming the replication policy was for one replica to be created for sciM.1 and sciD.1, then the eventual end state of the associated system metadata documents (sysM.1 and sysD.1 respectively) after synchronization of the Coordinating Nodes, would be something like the following.
Note that synchronization between Coordinating Nodes does not trigger an update to the DataSysMetadataModified field. This is necessary to avoid a situation where the state of the metadata is never consistent between Coordinating Nodes.
Note also that even though the replication policy of sciD.1 indicates that NumberReplicas is one, there are actually three additional replicas stored on the Coordinating Nodes.
Field | sysD.1 | sysM.1 |
---|---|---|
identifier | sciD.1 | sciM.1 |
objectFormat | text/csv | eml://ecoinformatics.org/eml-2.1.0 |
size | 1450238 | 8132 |
submitter | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
rightsHolder | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org | uid=jones,o=NCEAS,dc=ecoinformatics,dc=org |
obsoletes | ||
obsoletedBy | ||
derivedFrom | ||
describes | sciD.1 | |
describedBy | sciM.1 | |
checksum | 2e01e17467891f7c933dbaa00e1459d23db3fe4f | 9967467891f7c933dbaa00e1459d23db3f342 |
checksumAlgorithm | SHA-1 | SHA-1 |
accessRule[1] | ||
.ruleType | Allow | Allow |
.service | Read | Read |
.principal | * | * |
accessRule[2] | ||
.ruleType | Allow | Allow |
.service | Write | Write |
.principal | o=NCEAS,dc=ecoinformatics,dc=org | o=NCEAS,dc=ecoinformatics,dc=org |
replicationPolicy | ||
.preferredMemberNode | mn2.dataone.org | mn2.dataone.org |
.blockedMemberNode | ||
.replicationAllowed | true | true |
.numberReplicas | 1 | 1 |
dateUploaded | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
dateSysMetadataModified | 2010-03-04T18:13:51.0Z | 2010-03-04T18:14:45.0Z |
originMemberNode | mn1.dataone.org | mn1.dataone.org |
authoritativeMemberNode | mn1.dataone.org | mn1.dataone.org |
replica[1] | ||
.replicaMemberNode | mn1.dataone.org | mn1.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T19:13:51.0Z | 2010-03-04T19:13:52.0Z |
replica[2] | ||
.replicaMemberNode | mn2.dataone.org | mn2.dataone.org |
.replicationStatus | Completed | Completed |
.replicaVerified | 2010-03-04T21:00:00.0Z | 2010-03-04T21:00:00.0Z |
Maintaining complete copies of system metadata on all nodes is cumbersome and seems to be overly complicated. However, it is still necessary for Member Nodes to provide accurate information for some elements of the system metadata (to assist with transfer verification, to validate a data object is intact).
As such, it seems that scenario B above is the most tractable and provides the desired functionality for tracking content in the DataONE infrastructure. The role of system metadata on Member Nodes and Coordinating Nodes is further clarified below.
The recommendations are: