Identifiers in DataONE

Identifiers (PIDs, Persistent IDentifiers) are handles that uniquely identify objects within the DataONE system.

  • All data, metadata, and resource map objects in DataONE have a unique identifier.
  • PIDs will always refer to the same set of bytes accessed through the DataONE API methods such as MNRead.get().
  • The location of content identified by a PID is determined by calling the CNCore.resolve() method.
  • PIDs are persistent. Once content is registered with DataONE, the identifier for that content will remain in the DataONE system.
  • PIDs are unique, and can not be reused once assigned.
  • PIDs are generally controlled by Member Nodes, however their uniqueness and immutability is enforced primarily by the Coordinating Nodes.

Uniqueness

Generation of identifiers in DataONE is largely under the control of the Member Nodes (i.e. the data providers), with the requirement that an existing identifier (i.e. one that is already registered in the DataONE system) can not be reused. This rule is enforced for new content by checking the uniqueness of a proposed identifier in the MNStorage.create() method, and for existing content by ignoring content with identifiers that are already in use. The CNCore.reserveIdentifier() method may be used to reserve an identifier, so that a client may for example compose a composite object prior to committing the new content to storage on the Member Node. Similarly, Tier 3 and above Member Nodes may support the MNStorage.generateIdentifier() which will typically delegate to a third party persistent identifier service such as EZID [1] to return an identifier guaranteed to be unique within the DataONE system.

Authority

DataONE treats the original identifier (i.e. the first assignment of the identifier to an object that becomes known to DataONE) as the authoritative identifier for an object. Although generally not encouraged, multiple identifiers may refer to a particular object and in such cases, DataONE will attempt to utilize the original identifier for all communications about the object.

Opacity

Identifiers utilized by Member Nodes can take many different forms from automatically generated sequential or random character strings to strings that conform to schemes such as the LSID [2] and DOI [3] specifications. DataONE does not directly utilize implied functionality and services that might be available for some of the identifier schemes. This is not to say that mechanisms such as metadata retrieval for LSIDs is not used by any components of the DataONE infrastructure, but rather that the DataONE infrastructure and services have no functional dependency on such external services.

Identifiers are treated as opaque strings in the DataONE system, with no meaning inferred from structure or pattern that may be present in identifiers. The rules for identifier construction in DataONE are minimal and intended to ensure practical utility of identifiers. There is a set of characters that can not be used within an identifier string (non-printing and whitespace characters), and the maximum number of characters that such a string may contain (800 characters, #577). Leading and trailing white space is not allowed.

Immutability

Once assigned and registered in the DataONE infrastructure, an identifier will always refer to the same sequence of bytes. Generation of other representations of objects may be supported by services (e.g. an image may be transformed from TIFF to JPEG), but the identifier will always refer to the original form.

Resolvability

A fundamental goal of DataONE is to ensure that any identifier utilized in the system is resolvable, that is, DataONE provides a mechanism that will enable the location of the object to be determined. Resolution is handled by the Coordinating Nodes through the CNCore.resolve() method, which returns a list of nodes from which the object may be retrieved.

A guarantee of identifier resolvability is an important, core function of the DataONE infrastructure upon which many other services may be constructed, both within DataONE and by third party systems.

Granularity

Identifiers refer to managed objects in DataONE. Initially data, science metadata documents, and resource maps have identifiers. The definition of “data” is somewhat arbitrary though, and a single data object may be a single record within some larger collection, or may refer to an entire set of records contained within some package.

Structure

The characters that may appear in an identifier string acceptable to the DataONE system is constrained by the XMLSchema definition (Types.Identifier), which is essentially a string of length greater than zero but less than 800 characters with no whitespace (spaces, tabs, non-printing characters, carriage returns, new lines). Identifiers may be Unicode provided they conform to the fairly liberal restrictions imposed by the XML specification [4]. Examples of valid identifiers in DataONE are shown in the section Serializing below.

Serializing

When identifiers appear in text, the full identifier should be presented unmodified.

Identifiers appearing in URLs or other representations that have reserved characters should be escaped according to the rules of the targeted serialization format. For example, the identifiers:

10.1000/182
urn:lsid:ubio.org:namebank:11815
http://example.com/data/mydata?row=24
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
ฉันกินกระจกได้
Is_féidir_liom_ithe_gloine

would be serialized in DataONE MNRead.get() URLs (or any other URL path) according to RFC3986_ encoding guidelines for URI path segments:

http://mn.example.com/mn/object/10.1000%2F182
http://mn.example.com/mn/object/urn:lsid:ubio.org:namebank:11815
http://mn.example.com/mn/object/http:%2F%2Fexample.com%2Fdata%2Fmydata%3Frow=24
http://mn.example.com/mn/object/ldap:%2F%2Fldap1.example.net:6666%2Fo=University%2520of%2520Michigan,c=US%3F%3Fsub%3F(cn=Babs%2520Jensen)
http://mn.example.com/mn/object/%E0%B8%89%E0%B8%B1%E0%B8%99%E0%B8%81%E0%B8%B4%E0%B8%99%E0%B8%81%E0%B8%A3%E0%B8%B0%E0%B8%88%E0%B8%81%E0%B9%84%E0%B8%94%E0%B9%89
http://mn.example.com/mn/object/Is_f%C3%A9idir_liom_ithe_gloine

Note

The “+” (plus) character is a special case since it was once treated as a space character in URLs, and was changed in RFC3986 [5] such that the “+” would not be treated as a space. To minimize confusion when the plus character appears in an identifier, DataONE recommends that the character is percent escaped (%2B) when it appears in DataONE service URLs. All DataONE libraries and services operate in this manner.

The necessary encoding of URLs can be usually achieved through standard libraries available in many languages, with the caveat that the encoding follows the RFC3986 encoding rules. Many packages over-escape, keeping only the unreserved character set unescaped. For its client libraries, DataONE is taking a minimal escaping approach within the latitude RFC3986 allows. Specifically, using [pchar] - [‘+’] as the set of unescaped characters for identifiers in path segments, and [pchar] - [‘+’, ‘&’, ‘=’] + [‘/’, ‘?’] for identifiers in query segments, (segments in both cases meaning characters between delimiters). For example:

example-location-dependent-__/__?__&__=__
example-common-unescaped-;:@$-_.!*()',~

will be encoded in paths to:

example-location-dependent-__%2F__%3F__&__=__
example-common-unescaped-;:@$-_.!*()',~

and encoded in the query section to:

example-location-dependent-__/__?__%26__%3D__
example-common-unescaped-;:@$-_.!*()',~

Note that RFC3986 [5] treats the query section of the URI as a blackbox, so ‘&’ and ‘=’ are unescaped (to be used as sub-delimiters). For the purpose of encoding content, we take the approach of encoding at the segment level, so need to escape those characters. For those implementations using standard encoding routines, it is important to know that package’s treatment of this.

The following examples in Python and Java illustrate percent encoding of data such as an identifier appropriate for appending to a URL. Each processes utf-8 encoded input through stdin and outputs percent encoded or decoded responses. In java pseudo-code the general process is as follows.

// pseudo-code: this will not compile!

CharacterSet PATH_SAFE = RFC3986_PCHAR and not ['+'];
CharacterSet QUERY_SAFE = PATH_SAFE and not ['&','='] or ['?','/'];

String encodeUtf8_pathSegment(identifier) {
    String utf8ID = identifier.translate("UTF-8");
    return encodedID = percentEscape(utf8ID,PATH_SAFE);
}

String encodeUtf8_querySegment(identifier) {
    String utf8ID = identifier.translate("UTF-8");
    return encodedID = percentEscape(utf8ID,QUERY_SAFE);
}

String decodeString(string) {
    // older clients may encode spaces with '+'
    // so if we see them in the input, it is due to that
    // and we need to decode them, too.

    String correctedString = string.replace("+","%2B");
    return decodePercentEscaped(correctedString);
}
import sys
import codecs
import urllib

def pctEncode(data):
  '''Encode the unicode string data as utf-8 then percent encode that
  ready for appending as a path element to a URL.
  '''
  response = urllib.quote(data.encode("utf-8"), safe=":")
  return response


def pctDecode(data):
  '''Decode a percent encoded string and return the unicode object.
  but first handle any mistaken '+' in the data string
  '''
 data = data.replace("+","%2B")
  response = urllib.unquote(data)
  return response


if __name__ == "__main__":
  '''
  Read utf-8 encoded input from stdin and percent encode or
  decode (with command line argument -d).

  e.g. given test_ids.txt, a UTF-8 encoded file with identifiers
  appearing one per line:
    cat test_ids.txt | python PctEncode.py | python PctEncode.py -d

  should output equivalent to:
    cat test_ids.txt
  '''
  doEncode = True
  try:
    if sys.argv[1] == "-d":
      doEncode = False
  except:
    pass
  id = unicode(sys.stdin.readline(), "utf-8").strip()
  while len(id) > 0:
    if doEncode:
      print pctEncode(id)
    else:
      print pctDecode(id)
    id = unicode(sys.stdin.readline(), "utf-8").strip()
import java.io.*;
import java.net.*;

class PctEncode
{
  /**
  Simple example of URL path encoding of UTF-8 strings for including as
  path elements in URLs as per RFC3986.

  e.g. given test_ids.txt, a UTF-8 encoded file with identifiers
  appearing one per line:
    cat test_ids.txt | java PctEncode | java PctEncode -d

  should output equivalent to:
    cat test_ids.txt
  */

  public static String pctDecode(String data) {
    /**
    Decode a percent encoded string, returning a Java Unicode string
    */
    String response = null;
    try {
      data = data.replace("+","%2B");
      response = URLDecoder.decode( data, "UTF-8");
    } catch (java.io.UnsupportedEncodingException e) {
      System.out.println("Error pctDecode : " + e.getMessage());
    }
    return response;
  }


  public static String pctEncodePathSegment(String data) {
    /**
    Encode a Java string according to the path encoding rules in
    RFC3986. Note that this does not encode properly for data that
    is to be the root of the path, it is assumed that the data will
    be appended to the end of a a URL path.
    */
    String response = null;
    try {
      response = URLEncoder.encode( data, "UTF-8" );
      // fix outdated space-to-+ convention
      response = response.replace("+","%20");
      // now un-escape for minimally escaped result
      response = response.replace("%3A",":").replace("%28","(");
      response = response.replace("%3B",";").replace("%29",")");
      response = response.replace("%40","@").replace("%27","'");
      response = response.replace("%24","$").replace("%2C",",");
      response = response.replace("%21","!").replace("%7E","~");

    } catch (java.io.UnsupportedEncodingException e) {
      System.out.println("Error  pctEncode: " + e.getMessage());
    }
    return response;
  }


  public static void main( String[] args ) {
    try {
      boolean doEncode = true;
      try {
        if (args[0].equals( "-d" ))
          doEncode = false;
      } catch(ArrayIndexOutOfBoundsException e) {
      }

      PrintStream outs = new PrintStream( System.out, true, "UTF-8" );
      InputStreamReader isr = new InputStreamReader( System.in, "UTF-8" );
      BufferedReader reader = new BufferedReader( isr );
      String id = null;
      String data = null;
      while ( (id = reader.readLine()) != null ) {
        if (doEncode) {
          data = pctEncode( id );
        } else {
          data = pctDecode( id );
        }
        outs.println( data );
      }
    } catch(java.io.IOException e) {
      System.out.println("Error main: " + e.getMessage());
    }
  }
}

Given this code and a utf-8 encoded source file test_ids.txt such as:

ö
10.1000/182
urn:lsid:ubio.org:namebank:11815
http://example.com/data/mydata?row=24
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,%20c=US??sub?(cn=Babs%20Jensen)",
ฉันกินกระจกได้
Is_féidir_liom_ithe_gloine

The following commands should output the same as cat test_ids.txt:

cat test_ids.txt | java PctEncode | python PctEncode.py -d
cat test_ids.txt | python PctEncode.py | java PctEncode -d
[1]http://n2t.net/ezid/
[2]http://lsids.sourceforge.net/
[3]http://www.doi.org/
[4]http://www.w3.org/TR/xml11/#charsets
[5](1, 2) http://tools.ietf.org/html/rfc3986

Table Of Contents

Related Topics