DataONE API - 2.0

Event Logging and Reporting

The DataONE system should log various interactions and operations in the system to provide operational status information about the entire system, to report on specific node operations, and to inform DataONE participants (users, contributors, administrators) about their specific domain of interest in the system. For example, a contributor might like to monitor use of their data and where it is being replicated to. The methods MNCore.getLogRecords() and CNCore.getLogRecords() provide outward facing services for retrieving log information from member and coordinating nodes respectively.

Logging is described in use cases 16, 17, 18, 20, and potentially 19.

Use Cases to be Supported

  • UC 16 specifies “all CRUD operations on metadata and data are logged at each node”
  • UC 17 indicates that all CRUD operation logs should be aggregated at the CNs
  • UC 18 indicates that a MN can retrieve aggregated logs about content that originated from that MN.
  • UC 20 indicates that a data owner (original contributor, delegated owner) can retrieve aggregated logs about objects they own.
  • UC 19 indicates that anyone can retrieve general use information for any object in DataONE.

Performance Metrics to be Reported

The performance metrics survey results from the Leadership Team specify (at least) the following metrics should be captured. Items that may be captured from the CI portion of the project are indicated by !!!.

Size and Diversity of DataONE Data, Metadata, and Investigator Toolkit Holdings

  1. !!! Data volume – total size of data holdings in DataONE Member Nodes and Coordinating Nodes
Recorded in:
  • Total (including replicas) data volume + unique object data volume
  • Sysmeta
  1. !!! Number of metadata records – quantity of metadata records held at a Coordinating Node (note: the concept of a record may vary across metadata standards)
    • science metadata only
  2. !!! Number of data sets held by Member Nodes – (note: may be less meaningful as an absolute value because of the immense variability in data set granularity, but probably still useful in tracking the shape of the data accumulation curve)
  • system metadata
  1. !!! Number and types of software tools included in the Investigator Toolkit
  • !!
  • Mechanism to register external uses / implementations
  • HTTP user-agent
  1. !!! Number of different metadata schemas supported – (note: more metadata schemas is not necessarily better)
  • object format from sysmeta

DataONE System Capacity

  1. !!! Number of Member Nodes
  • registry
  1. !!! Total storage capacity at Member Nodes
  • per member node
  • total
  • part of MN registration and capabilities
  1. !!! Geographic coverage of Member Nodes – continents, regions, and countries covered
    • could potentially be part of capabilities (record physical location)
  2. !!! Number of Coordinating Nodes
    • constant = 3
    • general expansion over five years to all countries.
  3. !!! Total storage capacity at Coordinating Nodes
  • disk space
  1. !!! Geographic coverage of Coordinating Nodes – continents, regions, and countries covered
  • capabilities / metadata

DataONE Usage Statistics

  1. !!! (CN LOG) Number of web hits on DataONE portal
  • standard web hits from logs
  • aggregated across CNs
  1. !!! (CN LOG) Number of DataONE users – (note: recording of individual IP addresses may be most readily implemented; requiring users to login to Member Nodes is not presently required)
    • Any users of MNs = D1 user
    • Eventually can use authn subsystem
  2. !!! (supporting web site logs) Number of downloads of tools from the Investigator Toolkit –> “Number of times tools are downloaded”
    • Log analysis
    • Text needs to be clarified. ITK isn’t a place
    • supporting information (videos, documents, ...)
  3. !!! (CN LOG) Number of metadata catalog searches completed – (note: over time it may also be desirable to assess precision and recall of incoming searches)
    • part of the Mercury log (in mysql)
  4. !!! (MN, CN LOG) Number of DataONE datasets downloaded (daily, weekly, monthly, annually) – (note: this may be straightforward if included in specifications for Member Node data, impossible otherwise)
    • Member node access logs - need to be aggregated
    • What was pulled through D1.get() vs the native mechanisms of the MN
  5. !!! (MN, CN LOG) Most frequently downloaded datasets
  • Same as 16

Reliability and System Performance

  1. !!! (CN heartbeat logs) Uptime (availability) of Coordinating Nodes
  • Need a monitoring service in addition to the CN service
  • also need to consider geographic accessibility (users)
  1. !!! (MN heartbeat logs) Uptime (availability) of Member Nodes – (note: tracked if heartbeat is fully implemented)
  • Same as 18
  1. !!! Server response time
  • REST service performance
  • Define a bunch of test queries that can be executed in parallel for load testing.
  1. !!! Response time of user interface
  • Time for “page load” vs. number of concurrent users
  • Time for specific operations (test queries, test renderings, ...)

Community Engagement

  1. Baseline assessment of scientists completed
  2. Number of repeat assessments of scientists completed
  3. Baseline assessment of other stakeholders completed
  4. Number of repeat assessments of other stakeholders completed
  5. Number of DataONE Partnership Agreements established

Education and Outreach

  1. Number of education modules developed and/or accessible through DataONE
  2. Number of times education modules are downloaded
  3. Number of best practices guides developed and/or accessible through DataONE
  4. Number of times best practices guides are downloaded
  5. Number of training sessions or workshops offered (e.g., at meetings)
  6. Number of workshop participants
  7. Number of people in DataONE International Users Group

Sustainability

  1. Amount of revenue (including in-kind support) generated annually to support DataONE
  2. Diversity of revenue streams – e.g., government, private foundations, commercial for-profit sector, etc.
  3. Number of projects and partners collaborating with DataONE or leveraging DataONE infrastructure

Union of Use Cases and Metrics

The following bullets represent the union of logging information indicated in the use cases and the metrics that can be reported from the logs. The information logged and suitable summarization and extraction procedures need to be identified to ensure the following items can be addressed:

  • all CRUD operations on metadata and data are logged at each node
  • all CRUD operation logs should be aggregated at the CNs
  • an MN can retrieve aggregated logs about content that originated from that MN.
  • A data owner (original contributor, delegated owner) can retrieve aggregated logs about objects they own.
  • Anyone can retrieve general use information for any object in DataONE.
  • (metric) Number of web hits on DataONE portal
  • (metric) Number of DataONE users
  • (metric) Number of downloads of tools from the Investigator Toolkit (From the download site logs)
  • (metric) Number of metadata catalog searches completed
  • (metric) Number of DataONE datasets downloaded (daily, weekly, monthly, annually)
  • (metric) Most frequently downloaded datasets