DataONE API - 2.0
System metadata (often referred to as sysmeta) is the information used by DataONE to track and manage objects across the distributed Coordinating and Member Nodes of the network. System metadata documents contain low level information (e.g. size, type, owner, access control rules) about managed content such as science data, science metadata, and resource map objects and obsolescence chains (obsoletes and obsoletedBy).
System Metadata is maintained dynamically by each Member Node. The Member Node’s copy is authoritative, while Coordinating Node copies are are authoritative for the Replica portion of system metadata. Mirror copies of system metadata are maintained at each of the Coordinating Nodes. Some properties are mutable to reflect the current state of an object in the system. Initial properties of system metadata are generated by clients and Member Nodes.
System metadata is considered operational information needed to run DataONE, and can be read by all Coordinating Nodes and Member Nodes in the course of service provision. In order to reduce issues with third-party tracking of data status information, users can read system metadata for an object if they have the access rights to read the corresponding object which a system metadata record describes.
System Metadata elements are partitioned into two classes: metadata elements that must be provided by client software to the DataONE system, and elements that are generated by DataONE itself in the course of managing objects.
The mutability of system metadata elements is described in Table 1.
Table 1. Mutability of system metadata. Values are Initialized By different components during creation, and those values are vetted by (Controlled By) downstream, authoritative components. Mutable properties are edited through Edit Method.
Property | Mutable? | Initialized By | Controlled By | Edit Method |
---|---|---|---|---|
serialVersion | Yes | MN | CN | Various |
identifier | No | Client | Client | |
formatId | Yes | Client | Client | |
size | No | Client | MN | |
checksum | No | Client | MN | |
submitter | No | Client | MN | |
rightsHolder | Yes | Client | client | MNAuthorization.setRightsHolder() |
accessPolicy | Yes | Client | client | manual (Tier 1), MNAuthorization.setAccessPolicy(), MNStorage.update ()(all must call CNCore.systemMetadataChanged()) |
replicationPolicy | Yes | Client | client | CNReplication.setReplicationPolicy() |
obsoletes | Yes | Client | client | MNStorage.update() |
obsoletedBy | Yes | Client | client | MNCore.setObsoletedBy(), MNStorage.update() |
archived | Yes | MN | MN | MNCore.archive() |
dateUploaded | No | MN | MN | |
dateSysMetadataModified | Yes | MN | MN | Various - any change to system metadata |
originMemberNode | No | MN | MN | |
authoritativeMemberNode | Yes | MN | MN/CN | Manual update process |
replica | Yes | CN | CN | CNReplication.updateReplicationMetadata() |
seriesId | No | Client | Client |
The system metadata schema is expressed in XMLSchema and the development version is available at:
The most recent release version is available from:
This document refers to the current target for release, version 1 which is located at:
If there are discrepancies between this document and the schema, then the schema shall be considered authoritative.
The current development version of the system metadata schema document (version 1.0) is included here for reference. It is located in the source control repository at:
The example instance document included here was auto-generated so does not include useful values. It is included here to provide a general indication as to the structure of a populated system metadata document.
The following outline describes the policy and technical steps needed to shift the majority of control of system metadata attributes to Member Nodes such that client operations are more responsive. The changes would require a new DataONE API v2 that involve changes to the DataONE Types schema, changes to the Member Node APIs, changes to the Coordinating Node APIs, and changes to the various software stacks that implement these APIs. It will also involve a release and deployment schedule that allows both v1 and v2 of the APIs to be in operation simultaneously.
In transferring control to the Member Nodes, they also adopt the responsibility of consistently managing the versions of the documents in a serial manner. Use of the serialVersion attribute ensures that previous values are not overwritten by new values out of order (e.g. AccessPolicy)
The main use case involves access control. When a scientist using an ITK client creates an object through MN.create(), control of the system metadata is currently transferred to the CN once synchronization happens. After that point, the ITK client (and scientist) has to make CN.setAccessPolicy() calls to make any changes. If the MN is set to sync once a week, this is problematic, since the scientist would naturally expect that the access control changes should take effect immediately.
Example sequence diagrams show the differnce follows:
— ..
@startuml images/sysmeta_mn_control.png skinparam notebordercolor #AAAAAA skinparam notefontcolor #222222 title Proposed Set Access Policy Sequencenn participant “Client” as Client <<D1Client>> participant “MN” as MN <<MNode>> participant “CN” as CN <<CNode>> Client -> MN : MNStorage.create(pid, object, sysmeta) activate MN #D74F57 MN –> Client : pid deactivate MN Client -> MN : MNStorage.updateSystemMetadata(pid, SystemMetadata) note left
SystemMetadata includes AccessPolicyend note activate MN #D74F57 MN –> Client : true note right
Success. The client calls directly to the MN without delay.
- end note
- deactivate MN
MN -> CN : CNCore.updateSystemMetadata(pid, sysmeta) activate CN #D74F57 CN –> MN : [true | false] deactivate CN note right
The MN calls the CN; best-faith effort to keep it synchronized, but fine if the object has not been harvested.end note
... <b>Potentially long delay due to sync schedule</b> ... CN -> MN : MNRead.listObjects(fromDate, toDate) activate MN #D74F57 MN –> CN : objectList deactivate MN CN -> MN : MNread.getSystemMetadata(pid) activate MN #D74F57 MN –> CN : sysmeta deactivate MN CN -> CN : CNCore.registerSystemMetadata(pid, sysmeta) activate CN #D74F57 deactivate CN note left
Now the CN finally gets the SystemMetadata, with most recent values already set on the MNend note
@enduml
This document describes the management of system metadata across nodes, and has been updated to reflect control of system metadata attributes by the MN rather than the CN, except for the Replicas listed per object. Text changes are highlighted here.
The Types Schema could be changed in two ways:
2.1 Modify the Replica Type
By adding an optional version attribute to the Replica Type, the Coordinating Nodes would no longer need to rely on the serialVersion attribute of the entire system metadata document to manage versions. A Replica example, with the version line highlighted, would be:
<replica version="1">
<replicaMemberNode>urn:node:PISCO</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2012-07-10T00:00:00.000+00:00</replicaVerified>
</replica>
By making the version attribute optional, this approach would be backwards-compatible with existing system metadata documents in the system. However, the Replica list in System Metadata documents on the MN may be out of sync with the list on the CN during times of rapid change such as MN-to-MN replication operations.
2.2 Remove the Replica
Another approach is to remove the Replica entry from the SystemMetadata Type entirely, and manage replicas separately. This approach would be backwards-incompatible with existing system metadata documents, but once upgraded, all Replica information would be obtained through the CN services.
Todo
Needs discussion.
2.3 Leave the data types as is and let the CN have control over both the replica list and the serialVersion as it currently does. We always hope and intend that the MN and CN will have the same consistent SystemMetadata eventually. In this scenario, the CN would ignore any values the MN provided for SM.serialVersion and SM.replica and the MN would accept those values as provided by the CN’s copy of SystemMetadata. This allows much of our processing on the CN to remain as is and the different types of nodes then choose which parts of SM to manage/ignore. BRL: I believe we decided to pursue this course for now.
Changes would be required for both the Member Node and Coordinating Node APIs, in both the architecture documentation and the d1_common_java and d1_common_python libraries:
3.1 MN and CN API changes
Action | Method | Notes |
Add | MNStorage.updateSystemMetadata() | Instead of multiple methods |
Change | MNRead.systemMetadataChanged() | Move from MNAuthorization |
Reject | MNAuthorization.setRightsHolder() | |
Reject | MNAuthorization.setAccessPolicy() | |
Reject | CNCore.sytemMetadataChanged() | Required to push notify CNs |
Add | CNCore.updateSytemMetadata() | Keeps the CN copy in sync |
Deprecate | CNCore.archive() | |
Deprecate | CNCore.setObsoletedBy() | |
Deprecate | CNAuthorization.setRightsHolder() | |
Reject | CNReplication.getReplicaVersion() | |
Reject | CNReplication.setReplicaVersion() |
As an alternative to individual MN APIs above, we might want to consider using a single MN call to update system metadata documents:
Action | Method | Notes |
Add | MNCore.updateSytemMetadata() | Using this method now |
Note
CJ and BRL discussed this and decided the single updateSystemMetadata method would suffice and implementations could determine which mutable fields from the SystemMetadata it would update. TBD: do we reject updates if an immutable field differs from the original value even if we never intend to save that new value anyway?
The DataONE Client Libraries (d1_libclient_java and d1_libclient_python) will need to be changed to support the above API changes in v2, as well as the existing v1 APIs. This will help multiple MN software stacks in supporting both APIs.
5.1 New CN Rest Service calls
The CN REST Service will need to be modified to add and deprecate the methods listed above. Likewise, the CN REST Proxy will also need to be adjusted.
5.2 MN to CN Synchronization
With these changes, d1_synchronization classes will need to consult the node registry to determine if an MN implements v1 or v2 of the API, and act accordingly. As the synchronization code adds in replica entries, it should notify the authoritative Member Node and all replica Member Nodes of the change using MNRead.systemMetadataChanged() calls. It will also need to call CNReplication.setReplicaVersion() for new entries.
5.3 MN to MN Replication
The CN ReplicationManager code will need to be adjusted to 1) Get authoritative copies of system metadata from the MN, 2) use CNReplication.getReplicaVersion() and CNReplication.setReplicaVersion() when processing replica tasks rather than setting the serialVersion of the system metadata document.
5.4 Metacat CNodeService and schema
The MetacatCNodeService class will need to be modified to implement the above CN API calls. Likewise, the database schema will need to change to store a new version column in the smreplicationstatus SQL table. This will also affect other classes that manage the persistence of system metadata, namely IndetifierManager. Upgrade classes and scripts will need to be written for existing installations.
Member node software stacks will need to implement the API methods listed above, and will need to ensure that other calls that affect system Metadata entries also update Coordinating Node system metadata copy. For instance, a call to MNStorage.update() should also call CNCore.updateSystemMetadata() so that the CNs remain in sync with the MNs with regard to system metadata.
We will need to establish a release schedule and deploy software stacks, likely in this order:
Note that we plan on introducing other changes into the DataONE types schema to accommodate mutable content and other features. Changes to the type schema should be consolidated to reduce the impact on software that depend on the types.