Immutability of Content in DataONE
==================================

.. index:: immutability


.. contents::

Overview
------------
To support the goals of `preservation and scientific reproducibility <PreservationStrategy.html>`_, all registered 
objects in DataONE are considered immutable, with each object representing a 
published snapshot of data or metadata associated with a specific time. DataONE 
manages the registration, indexing, and replication of these snapshots throughout 
the DataONE network of Member Nodes.  Upon this foundation, DataONE can guarantee 
that the exact byte array returned through the DataONE Read APIs (MNRead and CNRead) 
is the one submitted and registered.

Any repository that provides unique identifiers to snapshots (also referred to as versions) 
can participate as 
DataONE Member Nodes, irrespective of whether or not they retain past snapshots. 
This is accomplished by the use of two identifiers, one representing the version, and the 
other representing the changing (or mutable) entity. For those Member Nodes only managing the 
mutable entity, as long as a unique version-level identifier is generated upon each
update to the entity, DataONE will not reject the update. In situations where the
rate of change is faster than DataONE's Member Node synchronization,  it is possible
that some snapshots will fail to be registered.  However, since that version's unique 
identifier is never indexed or otherwise made available, the chance for needing to 
retrieve that snapshot (and not finding it) in the future is very small.


The two identifiers are known as:

**Persistent Identifier (PID)** 
  declared in the ``systemMetadata.identifier`` field. This identifier represents 
  the snapshot or version DataONE replicates among the DataONE federation. 

**Series Identifier (SID)**
  declared in the ``systemMetadata.seriesId`` field.  This identifier represents 
  the mutable content, and resolves to the latest version among all registered versions 
  when used in the DataONE Read APIs.

DataONE relies on content originators to generate the identifiers they use for 
each snapshot (with the series identifier being optional) being registered, and 
determining which field will hold the "citable" identifier.


Changes constituting a new snapshot
------------------------------------
DataONE considers any change that results in a different byte array of content 
to be a new snapshot, and thus a new object to be registered. Subtle changes, 
such as whitespace differences, although potentially meaningless, do
therefore constitute a new object. If not properly identified with a new PID, 
the content held on that Member Node is invalid. Member Nodes that periodically 
regenerate their stored content or manipulate it upon retrieval will need to take 
extra care to validate checksums after regeneration or manipulation and resolve 
any discrepancies in content they may encounter.


Changes constituting a new series
---------------------------------
The SID is provided expressly to group the snapshots of a single entity that is 
stable over time.  It was not intended to represent highly volatile entities, or
those that significantly "drift" over time.  Member Nodes are encouraged to instead 
use registered services for volatile content, and create new entities when 
significant change in the scope of an entity occurs.

It is not clear that individual contributers will always have the means to 
register a service, and so may have entities organized for input, such 
as a file that accumulates observations. If items such as these are registered as an 
object, the rightsHolders should be mindful to apply reasonable temporal bounds.

When the scope of an item has changed significantly, it is permissible to supply
a new seriesId to the next snapshot while still relating the two items by 
obsoletes and obsoletedBy. However, it is not necessary, and may be more straightforward
to simply save the new entity without relating the snapshots through 
system metadata, but through provenance mechanisms instead.


Usage Conventions
-----------------
DataONE anticipates data consumers using other contributors' data will prefer to cite 
using the PID, for the certainty that provides, and will prefer using the SID when
citing indirectly (when citing metadata).  Similarly, we anticipate content originators 
will wish to promote one identifier for citation. For those content providers using 
both identifiers, it is recommended to assign the preferred identifier according 
to anticipated data consumer preference.


Aggregating Download Statistics
-------------------------------
To aid in aggregating download statistics for data providers, DataONE provides the
``cn/v2/query/logsolr`` query endpoint. Both PID and SID field are included in the index records 
to allow straightforward retrieval of download statistics by either field.


Identifier resolution in DataONE APIs
-------------------------------------
All DataONE APIs accepting an Identifier must treat PIDs as requests for the exact 
snapshot, and SIDs as a request for the latest snapshot that DataONE Node has 
knowledge of.  Due to the distributed nature of snapshot replication, it is possible 
that a replica Member Node not know about the latest snapshot, in which case, a
request by SID to that node should give a previous snapshot.  For all nodes, even 
the authoritative Member Node, a request by PID for a snapshot that it doesn't host 
must return a NotFound exception.

End users should therefore rely on ``v2.cn.resolve`` for object retrieval. If the
Identifier used for the resolve is a SID, this method will return an ObjectLocationList 
for the latest known snapshot.


Series Identifier resolution to the head version
-------------------------------------------------
The primary way for determining the head of a series is via the ``obsoletedBy`` field 
in the system metadata that links to the next snapshot in the chain. With all of 
the snapshots synchronized, there should only be one of the series that is not 
obsolete, and that is the head.  With incomplete synchronization, there will be 
possibly more than one snapshot that is not obsoleted, and in these cases, the 
one with the latest ``dateUploaded`` value will be chosen as the head.

Use of the obsoleted fields as the primary indicator for the head of the series
is preferred because it is a direct reflection of the rightsHolder's intentions, 
whereas the ``dateUploaded`` value is only a reflection of the order in which the
Member Node processes uploaded content.

Importance of the obsolete fields
---------------------------------
Member Nodes that manage by mutable entity (don't preserve prior snapshots) should
populate the ``obsoletes`` and ``obsoletedBy`` fields, even if they do not plan to preserve 
older snapshots. Replica nodes and the DataONE Coordinating Nodes can use these
fields to optimize queries for finding the head of the series.

*Question: should mutable Member Nodes keep systemMetadata documents for snapshots they 
no longer have?  (it would allow the obsoletedBy fields to be synchronized, but 
would it be in conflict with the behavior of deleted items (and would it matter?)*

Mutable Member Node example
---------------------------
To illustrate by way of example, author **A** uploads an item to Member Node **M**,
with an identifier **S** not using the DataONE API, but with **M**'s primary API.  **M** 
builds a systemMetadata document for **S**, generating a PID, **P1**, to uniquely identify
the initial snapshot, and assigns an upload date of **D1**, and uses the identifier **S**
for the seriesId.  DataONE synchronizes the object, and replicates the snapshot
**P1** to one other Member Node **R1**. The author, **A**, then saves changes to the item, 
whereupon **M** generates another PID, **P2**, to uniquely identifier this newer
snapshot, uses **S** in the seriesId field, puts **P1** in the obsoletes field, and **D2**
in the dateUploaded field.  (size and checksum are also calculated for the new
snapshot.)  This is synchronized and replicated to a different Member Node, **R2**.

When ``v2.cn.resolve(S`)`` is called, an ObjectLocationList for **P2** is returned, listing
Member Nodes **M** and **R2** as locations for retrieval.  A call to ``M.get(P2)`` or 
``R2.get(P2)`` will return the latest snapshot, as will the same call using **S** as the identifier
instead.  However, a call to ``R1.get(P2)`` will return NotFound, because it was not
a replication target for that snapshot, and a call to ``R1.get(S)`` will return the 
initial snapshot, because it has snapshot **P1** with the associated seriesId **S**.

Notice, too, that ``v2.cn.resolve(P1)`` will return an ObjectLocationList containing
both **M** and **R1**, although retrieval from **M** is no longer possible, since **M** doesn't 
preserve past snapshots.  ``M.get(P1)`` should return a NotFound, and the client will 
move on to **R1**, and be able to retrieve **P1** with ``R1.get(P1)``.

The CN, when ``v2.cn.resolve(S)`` was called, determined the head of the series by first
finding all of the snapshots where the seriesId is **S**, and obsoletedBy is null or
the obsoletedBy object has a different seriesId. In this case, since the **P1** 
systemMetadata is never updated to fill in the obsoletedBy field, the algorithm 
will get both **P1** and **P2**.  It will then notice that **P2** has the later date of **D2**, 
so will choose **P2** as the head of the series.

Suppose now that **A** spawns two more snapshots in quick succession, and DataONE 
synchronizes afterwards.  It missed **P3(S)** but picks up **P4(S)**.  ``Cn.resolve(S)`` will
return an ObjectLocationList for the **P4** snapshot, since it is the latest of all 
non-obsoleted snapshots.

Later, **A** makes some changes and realizes that the content is significantly different
from previous versions, so renames it **S2**.  The system treats it as related, so 
links to the **P4** snapshot with the obsoletes field.  **P5(S2)** is now hosted on **M**, but
**P4(S)** is gone.  ``cn.resolve(S)`` will return an ObjectLocationList for **P4**, but ``M.get(P4)``
will return NotFound, and the client will have to retrieve from a replica Member Node, 
if possible.  ``Cn.resolve(S2)`` will return an OLL for **P5**.  Note also that ``M.get(S)``
will not be able to resolve the SID to any PID, since it doesn't host any of the
snapshots of **S**.


Summary
=======
- the PID represents the snapshot, and snapshots are immutable
- the SID represents the entity, and can be applied to several connected snapshots.
- A particular SID cannot be used if it is either reserved by or in use by someone else.

  * Specifically, the CN checks that the submitter has CHANGE_PERMISSION on 
    the current head of the series
  * if not in use, checks submitter against ``cn.hasReservation(SID)``
  * 
 
- cannot put SID in obsoletes and obsoletedBy fields
- SID resolution is "latest upload" among set of snapshots not obsoleted by 
  another member of the series. 

- not all versions are guaranteed to be synchronized or replicated

  * is dependent on synchronization frequency, CN availability for sync
  * missing versions, if synced but not replicated will appear as NotFound
    exceptions on forwarded resolve requests
  *
- unsynchronized versions might appear in obsoletes/dBy fields of existing versions
- cn.listObjects(idFilter=sid)?? retrieves all synchronized versions 

- once SeriesID is set, it cannot be changed, because it breaks the trust that
  the identifier always gets the user to a conceptually equivalent object.