Contents
With the release of version 2 APIs, DataONE introduces support for Member Nodes whose primary repositories implement a mutable content storage model. Under such a model, updates to content overwrite previous versions, instead of being persisted side-by-side with previous versions. Form a retrieval point of view, this means that a new set of bytes is returned when called using its primary identifier, or more importantly, there is no guarantee that the set of bytes returned in the past is the same as what is returned in the future.
DataONE accomplishes this with the addition of a secondary identifier called the Series Identifier (SID) that has the semantics of retrieving the latest version of the series through the DataONE Read APIs (MNRead and CNRead). Member Nodes that employ a mutable content storage model would assigned their repository’s primary identifier to this SID, and populate DataONEs primary identifier field (systemMetadata.identifer) with a different unique value specific to that version. This identifier is forever associated with that specific persisted version, so is also referred to as the Persistent Identifier (or PID).
The use of the PID or SID for either citation or analysis workflows is up to the user and is context dependent. In general, DataONE anticipates DATA and RESOURCE_MAP objects will be referenced by PID, to ensure reproducibility; and in general, METADATA documents will be referenced by SID, to take advantage of any data curation / correction efforts that would not otherwise affect scientific reproducibility. Additionally, clues for the content submitter’s preference can be found in the format of the identifiers themselves. For example, DOIs and EZIDs take a recognizable format, and are often encouraged in scientific communities for citations, so an end-user might take that into consideration when deciding which identifier to choose.
[TODO: guidance on RESOURCE_MAPS - initial thoughts: depends on references to DATA objects, whether they be SIDs or PIDs]
Depending on the Member Node used as the primary repository, content originators may have some choice in assigning identifiers. For those that do, it is advised that they assign PIDs and SIDs according to the typical usage pattern described above.
Some Member Nodes may not preserve past versions of content, in which case the PID is likely to be automatically generated, and the submitter only has to determine the SID, and may not need to know the difference between the SID and PID. Other Member Nodes may still be at v1 of the DataONE APIs and only allow assignment of the PID.
The SID is used to conceptually represent an object that may vary modestly over time, but remains conceptually the same. Content contributers should be careful to apply reasonable limits on the scope of documents such that an entity does not deviate too much from the original item. In such cases, a new / different series should be initiated.
For Member Nodes that employ a mutable content storage model, the only additional DataONE requirement is that the Member Node generate a SystemMetadata document for the updated content, containing:
- unique PID in systemMetadata.identifier field
- new checksum
- the previous PID in the systemMetadata.obsoletes field
Ideally, the SystemMetadata of now unavailable versions will be maintained, and the ‘obsoletedBy’ field is populated with the PID of the version that replaced it.
Some Member Nodes may opt to preserve recent back-versions to aid the complete capture of versions by the DataONE network via synchronization.
to be determined
DataONE will attempt to synchronize all versions it’s made aware of through the synchronization process, but may miss short-lived versions that are in existence only between the Member Node’s synchronization interval. Please note, also, that the synchronization schedule is not guaranteed. Periods of DataONE maintenance may suspend synchronization, or high CN load could prolong the synchronization interval.
Member Nodes keen to make sure versions have the highest chance of synchronization can choose to issue a v2.CNCore.synchronize(pid) command that will put the item on the synchronization queue instead of waiting for the harvest interval.
Conversely, if the Member Node expressly doesn’t want DataONE to preserve back-versions, they can set systemMetadata.replicationPolicy.numberReplicas field to 0.
At its core, DataONE is in the business of preserving definite versions of content through centrally coordinated per-to-peer replication. That is, DataONE Coordinating Nodes direct certain Member Nodes to replicate newly synchronized objects from the originating Member Node to better preserve it. New versions of objects appear as first class immutable objects with unique PIDs, even if originating from mutable Member Nodes.
From the DataONE perspective the only difference between objects from mutable Member Nodes and immutable Member Nodes is the completeness of the series of versions it is able to synchronize and replicate.
Current DataONE replication processes and fixity checks depend on content identified by a PID that does not change. If this were not enforced, mutable content from a member node would not be differentiated from corrupt copies of the object and our replication and recovery features would attempt to correct the byte inconsistency. The immutability requirement helps to ensure reproducible results of any use of an object. Any analysis on a data set repeated sometime in the future should yield identical results (within the limits of precision of the analytical tools) and this is one of the major guiding principles in creating DataONE as a long term data repository federation. By simply overwriting existing content using the same identifier, nodes cannot be relied upon for repeatable retrieval of content.
The proposal for supporting “mutable” content is to allow a series identifier (SID) to facilitate the semantics of citing an object at the conceptual level, instead of the version level. As content changes over time, new identifiers (PIDs) will still be used to mark each change, but the conceptual object can continue to be referred to with an unchanging identifier (SID). The member node will be responsible for creating each version and assigning a unique PID to it and these objects will be synchronized and replicated to other DataONE member nodes as they are today. So instead of allowing content to be directly modified, we are allowing strongly-versioned chains to be referenced by an identifier; and relaxing the requirement that all revisions be resolvable forever.
The proposed solution is to model and implement a “series identifier” (SID) along with modified services that would work with both SIDs and PIDs. From a DataONE perspective, the series identifiers would be assigned to all versions of an object, be unique in DataONE (assigned to only one version chain), and would be reserved just as PIDs - from the same namespace. The series identifier, once assigned to the version chain, would similarly be immutable, and could apply to all new versions of the item. It is also assumed that in order to coordinate users to use one identifier for citations, that the cardinality for the citation identifier would be 0..1. The semantics for making API calls with a SID would, in general, be to return responses as if the call were made with the most current PID.
Member Nodes that only maintain the latest version of an item would be required to use a new PID for any updated content, and modify the System Metadata appropriately so that the new version can be synchronized with the network. The same SID would typically be used for the updated object, although we would allow the revision chain to shift to a new SID as desired by the client and/or member node.
It cannot be assumed that a user with an identifier in hand knows whether it is a SID or a PID, so DataONE expects the user to refer to the System Metadata once it has the item to determine if the identifier used in the call matches the PID or the SID. Similarly, they could interrogate search results for the same information. For high-level interfaces, like D1Client.getD1Object(id), the PID of the object returned may or may not match the passed in ‘id’. So, high-level functions or applications that use resolve will have to make sure they handle the new resolving semantics.
It is recommended that search indexes include a search field for the series identifier that can also be returned in the results.
Type 1: An object on the SID chain doesn’t have the “obsoletedBy” field. Example: P1(S1) <-> P2(S1). P2 is a type 1 end.
Type 2: An object on the SID chain does have the “obsoletedBy” field, but the PID in the “obsoletedBy” field has a different SID (including no SID value). Examples: P1(S1) <-> P2(S2), P1(S1) <-> P2(). P1 is a type 2 end on both chains.
It is tricky to determine a type 2 end if the object in the “obsoletedBy” field is missing. For example, P1(S1) <-> P2(S1) -> ??. We don’t have the knowledge of the series id of the object ”??”. So we generally consider it a type 2 end except we are sure it is not an end - there is another object in the chain (has the same series id) that obsoletes the missing object.
In previous example [P1(S1) <-> P2(S1) -> ??], P2 is a type 2 end (case 12).
However, P1(S1) <-> P2(S1) -> ?? <- P4(S1), P2 is not an end (case 8) since ”??” is in the obsoletes field of P4 that has the same series id - S1 (We are sure that the ”??” has the series id S1 as well, so P2 is not an end).
In P1(S1) <-> P2(S1) -> ?? <- P4(S2), P2 is a type 2 end even though ”??” is in the obsoletes field of P4. But P4 has a different series id - S2 (so we are not sure ”??” has the S1 or S2).
Ideally, if there is one and only one end on a SID chain, this end will be the HEAD (current) version. This kind of chains are called ideal chains.
Use cases ( “->” means “obsoletedBy”, “<-” means “obsoletes”. The t1, t2 and t3 are the time stamps. ??: object was not synchronized):
case 1. P1(S1, t1) <-> P2(S1, t2), t1<t2. S1 = P2 (Type 1, an ideal chain)
case 2. P1(S1, t1) ? P2(S1, t2), t1<t2. S1 = P2, Error condition, P2 not allowed (should not exist) (P1 and P2 are type 1 ends, not an ideal chain. Choose P2 as the temporary HEAD. Nothing obsoletes P2 and P2 is the HEAD)
case 3. P1(S1, t1) <- P2(S1, t2), t1<t2. S1 = P2, Discouraged, but not error condition, S1 = P2 (P1 and P2 are type 1 ends, not an ideal chain. Choose P2 as the temporary HEAD. Nothing obsoletes P2 on the chain and P2 is the HEAD)
case 4. P1(S1, t1) <-> P2(S1, t2) <-> P3(S2, t3), t1<t2<t3. S1 = P2 (Type 2), S2 = P3 (Type 1, an ideal chain)
case 5. P1(S1, t1) <- P2(S1, t2) <- P3(S2, t3), t1<t2<t3. S1 = P2 (P1 and P2 are type 1 ends, not an ideal chain. Choose P2 as temporary head. Nothing with the same sid obsoletes P2 and P2 is the HEAD). S2 = P3 (Type 1 end, an ideal chain)
case 6. P1(S1, t1) <-> P2(S1, t2) <-> P3(, t3) t1<t2<t3. S1 = P2 (Type 2, an ideal chain)
case 7. P1(S1, t1) <-> P2(S1, t2) <-> P3(,t3) <-> P4(S2, t4), t1<t2<t3<t4. S1 = P2 (Type 2 end, an ideal chain), S2 = P4 (Type 1, an ideal chain)
case 8. P1(S1, t1) <-> P2(S1, t2) -> ?? <- P4(S1, t4), t1<t2<t4. S1 = P4, (Type 1, an ideal chain) (Error, but will happen)
case 9. P1(S1, t1) <-> P2(S1, t2) ?? <- P4(S1, t4), t1<t2<t4. S1 = P4 (P2 and P4 are type 1 ends, not an ideal chain. Choose P4 as the temporary HEAD. Nothing obsoletes P4 on the chain and P4 is the HEAD)
case 10: P1(S1, t1) <-> P2(S1, t2) -> XX <- P4(S1, t4), t1<t2<t4. S1 = P4, (Type 1 end, an ideal chain) (XX: object P3 was deleted)
case 11: P1(S1, t1) <-> P2(S1, t2) <-> [archived:P3(S1, t3)], t1<t2<t3. S1 = P3, (Type 1 end, an ideal chain)
case 12. P1(S1, t1) <-> P2(S1, t2) -> ??, t1<t2. S1 = P2, (Type 2 end, an ideal chain) (Error, but will happen)
case 13. P1(S1, t1) <- P2(S1, t2) -> ??, t1<t2. S1 = P2 (P1 is a type 1 end and P2 is a type 2 end, not an ideal chain. Choose P2 as the temporary HEAD. Nothing obsoletes P2 on the chain and P2 is the HEAD)
case 14. P1(S1, t1) <- P2(S1, t2) -> P3(S2, t3), t1<t2<t3. S1 = P2 (P1 is a type one end and P2 is a type 2 end, not an ideal chain. Choose P2 as the temporary HEAD. Nothing obsoletes P2 on the chain and P2 is the HEAD)
case 15. P1(S1, t1) <-> P2(S1, t2) ?? <- P4(S1, t4) <-> P5(S2, t5), t1<t2<t4<t5. S1 = P4 (P2 is a type 1 end and P4 is a type 2 end, not an ideal chain. Choose P4 as the temporary HEAD. Nothing obsoletes P4 with the same sid on the chain and P4 is the HEAD)
case 16. P1(S1, t1) <- P2(S1, t2) -> ?? <-P4(S2, t4), t1<t2<t4. S1 = P2 (P1 is a type 1 end and P2 is a type 2 ends, not an ideal chain.Choose P2 as the temporary HEAD. Nothing obsoletes P2 on the chain and P2 is the HEAD), S2=P4 (type 1 end, ideal chain)
case 17. P1(S1, t1) <- P2(S1, t2) -> ?? <-P4(S1, t4), t1<t2<t4. S1 = P4 (P1 and P4 are type 1 ends, not an ideal chain. Choose P4 as the temporary HEAD. Nothing obsoletes P4 on the chain and P4 is the HEAD)
case 18. P1(S1, t1) <->P2(S1, t2) -> ?? ???<-P5(S1, t5), t1<t2<t4. S1 = P5 (P2 is a type 2 end and P5 is a type 1 end, not an ideal chain. Choose P5 as the temporary HEAD. Nothing obsoletes P5 on the chain and P5 is the HEAD)
case 19. P1(S1,t1) <-P2(S1,t2) <-P3(S1, t3), t1 >t2 > t3. S1 = P3 (The time stamps don’t agree the obsolete chain. P1, P2 and P3 are a type 1 end, not an ideal chain. Choose P1 as the temporary HEAD since t1 is the latest date of uploaded. However, P2 obsoletes P1 and P3 obsoletes P2. P3 is the HEAD)
Mutable content implies that back-versions of content may not be readily available on the nodes that originally produce the content. For metadata and resource maps, the coordinating nodes will store previous versions of objects during the synchronization process, but any data updates will result in only the latest version being available at the originating node. If the data objects were replicated (as is the hope), it is likely that previous versions of the data can still be resolved from replica target nodes, though this is dependent on replication policies, synchronization schedules and the availability of replica storage across the federation.
The current DataONE storage model, through the MN_Storage.update method, places responsibility for storing versions squarely on the submitter. Each update to the object requires a new unique identifier (PID) and must state which PID the new version is obsoleting. We will continue to require that unique PIDs are provided for each and every version of an object, but the member node will not be required to maintain a copy of previous revisions if it chooses not to. An optional series identifier (SID) can be provided with object SystemMetadata to group revisions together and to provide a convenient way to refer to the latest version of the object.
As is currently the case, the member node should maintain all versions of content using unique identifiers (PID) and synchronization will harvest each new revision to the network. While there will be no requirement that the Member node continue to make available the object identified by the obsoleted PID, the hope is that they will persist the data history as best they can. If the objects in the revision chain have a SID assigned, the new PID will be considered the latest version of this series.
The member node can allow access to the current version of the object using MN_Read.get(sid) as a convenience and any reference to the SID would resolve to the latest version of the object with a potentially different checksum and PID from what was originally present when the citation was distributed.
The member node must [minimally] maintain system metadata for the current revision of the object. Any updated object is still required to be identified by a new unique PID, but would include the same SID used in the previous version. The obsoletes field should indicate that the new PID replaces the previous PID. The coordinating node learns about the updated content during synchronization because there is:
- a new PID
- an updated dateSystemMetadataUpdated timestamp
- an updated checksum (other fields may also be updated).
N.B. Multiple revisions between synchronization periods would not result in multiple versions recorded in the federation - just the revision[s] that happened to be synchronized would be persisted in DataONE. This leaves open the possibility of an end user retrieving a version from the MN that will ultimately not be persisted in perpetuity.
DataONE essentially considers member nodes as the originators of selected versions of content. That is, not every intermediate revision on the way to a final product should neccessarily be saved for future reference. Organizations following the mutable content model for storage may wish to limit the objects returned by listObjects() to those that are considered in their publishable form. Certainly theses objects can later be updated as needed, but minimizing draft-status objects will reduce the amount of [possibly irretrievable] draft content floating around the federated network.
As illustrated in the optional use cases, the rate and regularity of change of objects can be widely variable. The more frequent the change, the less likely that all versions would need to be reproduced, and the utility of complete version history diminishes. One can imagine a member node serving up an unrecorded data stream, such as a web-cam, delaying creating a version until a user calls MN.get() on the item, by tee’ing the output stream to file while returning the object.
Additionally the need to keep past versions may be less important for metadata objects (correcting typos that do not change the meaning or interpretation of the data) than data objects or resource maps.
The use case of mutable data objects that grow with new records appended to the end of a table, for example, was given as a common practice for some groups, and one that would produce progressively redundant information with each persisted version. The motivation for rolling up records accumulated over time instead of new data files for each is the ease of use for end users. Using a SID to access the data object will always give the latest snapshot of the data records where old revisions may or may not also be accessible.
Objects like NetCDF files that include both metadata and data in the same object will be managed with the same PID and SID considerations. If only the metadata portion of the file is modified, the SID may remain the same, but a new PID and checksum must be created and made available for synchronization. The old revision may immediately become inaccessible using the PID and that is allowable under the proposal.
Implicit in the support for versioned content is support for retrieval of, or possibly just resolution to, the current object bytes by the identifier assigned in the originating system. At a minimum CNs will be required to support calculating which is the current version of series of versions and returning it or its identifier. This will be accomplished using the series identifier (SID) associated with object[s] in a revision chain. The “current” version of an object is defined as the non-obsoleted object with a SID that matches the requested identifier. Objects that are marked as “archived” may be returned as the most current version, but they should not be seen in default search interfaces. Since DataONE identifiers have no special formating semantics, those following a citation will not know by looking at the identifier whether it is referring to a specific version (PID) or the latest version of the item (SID), so services may be provided to easily investigate an entire version series. Existing services allow clients to deduce this information by inspecting the system metadata for the identifier and following any obsolescence properties as needed.
Because the content of an object is retrieved in a separate call from its system metadata, use of the SID for MN Read API calls is troublesome because the content may be updated between the two calls. It would be impossible to tell if the bytes retrieved were incorrect (bit rot) or correct (newer version) when comparing checksums in this case. If data consistency is important to the caller, the PID should be used to guarantee that only the expected bytes (or a NotFound exception) are returned by any MN.get calls.
Those making a citation may wish to cite a specific version, or the latest current version. Followers of citations may wish to, if given an identifier representing a specific version (PID), find out what is the latest version (another, newer PID, or the SID). Conversely, if given a series identifier that navigates to the latest version, they may wish to find out what the content was at some previous point in time (e.g., the time of the citation) by following the obsolescence chain backward.
DataONE will be providing CN services for navigating to the latest version of an object, since the only way to do it currently is for the clients to serially retrieve the system metadata for versions in the chain until they reach the head version, which is can be inefficient. A new method to retrieve the entire version history is also under consideration.
The use cases below organize the identified requirements related to mutable content, with the most relevant use cases listed first.
Defined as activities that help ensure continued discoverability and usefulness and usually in reference to metadata, not data.
For institutions following a mutable content model:
What is the best way to version mutable data that frequently changes but may or not be used. For example a “current time” object, replaced every minute, or “current weather radar” that’s replaced every 3 hours.
The underlying dynamic here is the the rate of mutation vs. the rate of synchronization
This means supporting data objects that add records over time, either:
Some formats combine data with metadata, for example netCDF, so allowing the metadata to change without impacting the consistency assessment of the data itself.
but may be referenced using a seriesId
Mutable content can theoretically include things that are live feeds from sensors, but are otherwise not captured.
This proposal does not accommodate streams unless they have discrete snapshots that can be referenced as part of a seriesId.