DataONE Usage Statistics

Overview

DataONE Member Nodes and Coordinating Nodes record access events that result from DataONE API calls. A list of access events and the API calls that logged these events is shown in Table 1.

Table 1 Access Events

Access event DataONE MN API call Metacat API call
create MNStorage.create() action=insert
delete MNStorage.delete() action=delete
read MNRead.get() action=read
replicate MNReplication.replicate()  
update MNStoreage.update() action=update

The content of the access event log records are described here: LoggingSchema.html.

The access event log records are harvested from each MN in the network and aggregated into a common search index by the Log Aggregation Facility which is described here: LogAggregator.html. The Event Log Index is implemented as an Apache Solr instance and can be queried using standard Solr queries using the DataONE service endpoint https://cn.dataone.org/cn/v1/query/logsolr.

The Solr search platform provides query capabilities such as field faceting, range filtering, numeric field statistics and more that provide usage information based on the access events, harvest from the MN, thereby providing network wide statistics from one search index.

The section Example Queries gives several examples of usage information that can be obtained from the Event Log Index.

Event Log Index

Table 2. Solr index schema

Solr index schema
Name Solr Type Comment
id string added after harvest
dateAggregated date added after harvest
isPublic boolean added after harvest, obtained from systemmetadata
readPermission string added after harvest, obtained from systemmetadata, filtered during query
entryId string obtained from MN event log
pid string added after harvest, obtained from systemmetadata
ipAddress string obtained from MN event log, filtered during query
userAgent string obtained from MN event log
subject string obtained from MN event log, filtered during query
event string obtained from MN event log
dateLogged date obtained from MN event log
nodeId string obtained from MN event log
rightsHolder string added after harvest, obtained from systemmetadata, filtered during query
formatId string added after harvest, obtained from systemmetadata
formatType string added after harvest, obtained from systemmetadata
size slong added after harvest, obtained from systemmetadata
country string added after harvest, determined from ipAddress
region string added after harvest, determined from ipAddress
city string added after harvest, determined from ipAddress
geohash_1 string added after harvest, determined from ipAddress
geohash_2 string added after harvest, determined from ipAddress
geohash_3 string added after harvest, determined from ipAddress
geohash_4 string added after harvest, determined from ipAddress
geohash_5 string added after harvest, determined from ipAddress
geohash_6 string added after harvest, determined from ipAddress
geohash_7 string added after harvest, determined from ipAddress
geohash_8 string added after harvest, determined from ipAddress
geohash_9 string added after harvest, determined from ipAddress
location location added after harvest, determined from ipAddress
inFullRobotList boolean added after harvest, determined based on log processing for COUNTER compliance
inPartialRobotList boolean added after harvest, determined based on log processing for COUNTER compliance
isRepeatVisit boolean added after harvest, determined based on log processing for COUNTER compliance

Access to Event Log Index

Access to the Event Log Index adheres to the DataONE identity and authentication protocols described here: Authentication.html. The level of access allowed when querying the index is determined by your DataONE Authentication Session Identity

CN Administrators

CN Administrators have full access to the index and can therefor select index entries based on any field and can view the entire contents of the index entries.

Authenticated session access

Clients (i.e. web browsers) that have established an authenticated session using a DataONE identity have access to information for any pids for which they are the rightsholder, or pids for which they have an access policy granting write access. For example, if the authenticated subject is 'uid=smith,o=NCEAS,dc=ecoinformatics,dc=org' then the client can query index entries for pids that have access policies allowing write access to the authenticated subject. This level of access allows summary information to be viewed, so the full content of index entries cannot be viewed.

Public Access

All other access is considered non-privileged public access in which case only index entries associated with pids that have an access policy granting public read can be queried. This level of access only allows summary information to be viewed, so the full content of index entries cannot be viewed.

In addition to these access rules, certain fields are considered sensitive such that they cannot be included in Solr field queries (i.e. &fq=<field name>) or included in Solr facet queries (i.e. &facet.field=<field name>). The fields from the Event Log Index that are considered sensitive are rightsHolder, ipAddress, subject and readPermission.

COUNTER Compliance

While unfiltered log records are useful for some system monitoring and related activities, scientifically-meaningful analysis of log records requires that we correct log records for common events that would otherwise artificially inflate the statistics, such as access by web-indexing robots and multiple accesses from the same individual. Within the publishing community, the COUNTER standard has been used to provide a consistent set of guidelines as to how resource access statistics should be reported. To be COUNTER-compliant, DataONE provides three filters on log files:

  1. Only allow status 200 and 304 on READ requests

    This ensures that redirect requests (302) are only counted once, and that unsuccessful requests are ignored.

  2. Exclude robots

    This ensures that the myriad web-robots that constantly index web-accessible content do not artificially inflate results.

  3. Exclude repeat visits within certain time window

    This ensures that accidental double-clicks on a link or repeated requests from a client tool in a short time period are only counted once.

Compliance with these three COUNTER requirements is implemented as two boolean index field (isRepeatVisit and inFullRobotList) which, for each record, determines if a given record adheres to the COUNTER standards outlined above. Client queries which wish to only report COUNTER-compliant results just add a filter expression to their query (isRepeatVisit=false, inFullRobotList=false), and all non-compliant records will be removed from the usage statistics reports.

The field inFullRobotList indicates whether or not the logged event originated from a request issued by a user agent found in the full list of web robots, with the value true indicating that the user agent is a web robot, and thus the event record is not COUNTER compliant.

DataONE will maintain a list on known Internet robots to be used for filtering addresses, and this list will be updated periodically as new robots become known, at least annually.

The field isRepeatVisit indicates whether or not a duplicate request has occurred for the same IP address and pid within a certain time window (currently 30 seconds), with a value of true indicating that an entry is a repeat request.

The following query will return the count of all read events that have passed the COUNTER compliance tests:

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:read&fq=inFullRobotList:false&fq=isRepeatVisit:false

The event index is updated once a day with event entries from all active member nodes, with the most current information being from the previous day.

In addition to the ‘COUNTER‘ related fields, the field inPartialRobotList indicates whether or not the user agent was found in a list that contains a subset of the full robots list, and represents a less strict interpretation of which user agents are considered web robots, and does not include user agents such as ‘java’, ‘libwww’, ‘Wget’. A value of true indicates that a match was found in the less strict web robots list. This field is not used in COUNTER compliance filtering.

Statistics Service Usage

The following sections shows example queries that can be sent to the Event Log Solr index. Note: in order to make the examples easier to read, the output of some of the examples queries has been editied, with removed lines replaced with ellipses, i.e. ‘...’.

Retrieve pids for a specified subject

The following example shows a query for download volume for pids created by subjects matching uid*smith* with download size statistics aggregated by pid:

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=subject:uid*smith*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid

The following result is returned:

<?xml version="1.0"?>
<response>
  ...
  <result name="response" numFound="96" start="0"/>
  <lst name="stats">
    <lst name="stats_fields">
      <lst name="size">
        <double name="min">135.0</double>
        <double name="max">1.5209072E8</double>
        <double name="sum">1.082767665E9</double>
        <long name="count">96</long>
        <long name="missing">0</long>
        <double name="sumOfSquares">1.13751276670495792E17</double>
        <double name="mean">1.127882984375E7</double>
        <double name="stddev">3.2692977584385287E7</double>
        <lst name="facets">
          <lst name="pid">
            <lst name="doi:10.6085/AA/pisco_intertidal.45.1">
              <double name="min">2.8738045E7</double>
              <double name="max">2.8738045E7</double>
              <double name="sum">2.8738045E7</double>
              <long name="count">1</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">8.25875230422025E14</double>
              <double name="mean">2.8738045E7</double>
              <double name="stddev">0.0</double>
            </lst>
            <lst name="doi:10.6085/AA/MLPA_intertidal.30.10">
              <double name="min">2984.0</double>
              <double name="max">2984.0</double>
              <double name="sum">11936.0</double>
              <long name="count">4</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">3.5617024E7</double>
              <double name="mean">2984.0</double>
              <double name="stddev">0.0</double>
            </lst>
            <lst name="doi:10.6085/AA/pisco_snbs.19.1">
              <double name="min">52335.0</double>
              <double name="max">52335.0</double>
              <double name="sum">104670.0</double>
              <long name="count">2</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">5.47790445E9</double>
              <double name="mean">52335.0</double>
              <double name="stddev">0.0</double>
            </lst>
            ...
            </lst>
          </lst>
        </lst>
      </lst>
    </lst>
  </lst>
</response>

The previous query can be constrained to a specific time by adding a time range, i.e.:

&fq=dateLogged:[2013-01-01T23:59:59Z TO 2013-12-31T23:59:59Z]

or using Solr date range key words:

&fq=dateLogged:[NOW-1MONTH TO NOW]

Data upload counts

The following query shows counts of data uploads by format type by a specified rightsHolder (PISCO):

https://cn.dataone.org/cn/v2/query/logsolr/?&q=*:*&facet=true&fq=rightsHolder:uid*PISCO*&fq=event:create&facet.field=formatId&facet.mincount=1
<?xml version="1.0"?>
<response>
  ...
  <result name="response" numFound="40928" start="0"/>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="formatId">
        <int name="eml://ecoinformatics.org/eml-2.0.1">32932</int>
        <int name="text/csv">5236</int>
        <int name="application/octet-stream">2570</int>
        <int name="eml://ecoinformatics.org/eml-2.0.0">100</int>
        <int name="eml://ecoinformatics.org/eml-2.1.0">28</int>
        <int name="-//ecoinformatics.org//eml-dataset-2.0.0beta6//EN">19</int>
        <int name="-//ecoinformatics.org//eml-entity-2.0.0beta6//EN">12</int>
        <int name="-//ecoinformatics.org//eml-attribute-2.0.0beta6//EN">11</int>
        <int name="-//ecoinformatics.org//eml-access-2.0.0beta6//EN">7</int>
        <int name="-//ecoinformatics.org//eml-physical-2.0.0beta6//EN">6</int>
        <int name="image/jpeg">3</int>
        <int name="text/plain">3</int>
        <int name="-//ecoinformatics.org//eml-project-2.0.0beta6//EN">1</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>

Data download counts by month

The following query shows data download counts by a specific user for each month in 2013:

https://cn.dataone.org/cn/v1/query/logsolr/select?q=*:*&fq=rightsHolder:uid*PISCO*&fq=event:read&facet=true&facet.range=dateLogged&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
<?xml version="1.0"?>
<response>
   ...
  <result name="response" numFound="3623404" start="0"/>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields"/>
    <lst name="facet_dates"/>
    <lst name="facet_ranges">
      <lst name="dateLogged">
        <lst name="counts">
          <int name="2013-01-01T01:01:01Z">56962</int>
          <int name="2013-02-01T01:01:01Z">23656</int>
          <int name="2013-03-01T01:01:01Z">46167</int>
          <int name="2013-04-01T01:01:01Z">58562</int>
          <int name="2013-05-01T01:01:01Z">65192</int>
          <int name="2013-06-01T01:01:01Z">203082</int>
          <int name="2013-07-01T01:01:01Z">66013</int>
          <int name="2013-08-01T01:01:01Z">92320</int>
          <int name="2013-09-01T01:01:01Z">23059</int>
          <int name="2013-10-01T01:01:01Z">16135</int>
          <int name="2013-11-01T01:01:01Z">73831</int>
          <int name="2013-12-01T01:01:01Z">44968</int>
        </lst>
        <str name="gap">+1MONTH</str>
        <date name="start">2013-01-01T01:01:01Z</date>
        <date name="end">2014-01-01T01:01:01Z</date>
      </lst>
    </lst>
  </lst>
</respones>

Read counts for format type EML

The following query shows all EML metadata activity by a specific user for each month in 2013:

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=rightsHolder:uid*PISCO*&fq=formatId:eml*&facet=true&facet.field=event&facet.range=dateLogged&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
<?xml version="1.0"?>
<response>
  ...
  <result name="response" numFound="3504705" start="0"/>
    <lst name="facet_counts">
      <lst name="facet_queries"/>
      <lst name="facet_fields">
      <lst name="event">
        <int name="read">3327009</int>
        <int name="delete">51249</int>
        <int name="update">47593</int>
        <int name="synchronization_failed">45752</int>
        <int name="create">33060</int>
        <int name="replicate">42</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges">
      <lst name="dateLogged">
        <lst name="counts">
          <int name="2013-01-01T01:01:01Z">54815</int>
          <int name="2013-02-01T01:01:01Z">18652</int>
          <int name="2013-03-01T01:01:01Z">45043</int>
          <int name="2013-04-01T01:01:01Z">58420</int>
          <int name="2013-05-01T01:01:01Z">64208</int>
          <int name="2013-06-01T01:01:01Z">136014</int>
          <int name="2013-07-01T01:01:01Z">65417</int>
          <int name="2013-08-01T01:01:01Z">92103</int>
          <int name="2013-09-01T01:01:01Z">22899</int>
          <int name="2013-10-01T01:01:01Z">15522</int>
          <int name="2013-11-01T01:01:01Z">73340</int>
          <int name="2013-12-01T01:01:01Z">44745</int>
        </lst>
        <str name="gap">+1MONTH</str>
        <date name="start">2013-01-01T01:01:01Z</date>
        <date name="end">2014-01-01T01:01:01Z</date>
      </lst>
    </lst>
  </lst>
</response>

Download volume for pids

The following query shows all pids created by rightsHolder PISCO with upload size statistics aggregated by formatId:

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=rightsHolder:uid=*PISCO*&fq=event:create&stats=true&stats.field=size&rows=0&stats.facet=formatId
<result name="response" numFound="14721" start="0"/>
  ...
        <lst name="facets">
          <lst name="formatId">
            <lst name="eml://ecoinformatics.org/eml-2.0.0">
              <double name="min">3582.0</double>
              <double name="max">29176.0</double>
              <double name="sum">604461.0</double>
              <long name="count">43</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">1.1348783711E10</double>
              <double name="mean">14057.232558139534</double>
              <double name="stddev">8240.051522137841</double>
            </lst>
            <lst name="eml://ecoinformatics.org/eml-2.0.1">
              <double name="min">938.0</double>
              <double name="max">646484.0</double>
              <double name="sum">2.37265549E8</double>
              <long name="count">14668</long>
              <long name="missing">0</long>
              <double name="sumOfSquares">7.985322030167E12</double>
              <double name="mean">16175.72600218162</double>
              <double name="stddev">16815.75005078953</double>
            </lst>
            ...
          </lst>
        </lst>
      </lst>
    </lst>
  </lst>
</response>

Note

The examples that follow do not include the result output to improve legibility. The reader is encouraged to cut/paste the sample queries into a web browser to view the resulting output.

Select events using time range based on date of access event

https://cn.dataone.org/cn/v2/query/logsolr/?q=dateLogged:[2014-03-01T00:00:01Z TO 2014-03-31T00:00:01Z]

Counts of event types

https://cn.dataone.org/cn/v2/query/logsolr/?q=dateLogged:[* TO NOW]&facet=true&facet.field=event

Wildcard search for pids

https://cn.dataone.org/cn/v2/query/logsolr/?q=pid:doi*&facet=true&facet.field=pid&facet.mincount=1

Spatial search for events within 10km of the latitude, longitude of Santa Barbara, CA

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq={!geofilt sfield=location pt=34.4329,-119.837 d=10}

Search by city name for events occuring in Albuquerque

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=city:Albuquerque

Events aggregated by location name

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:create&facet=true&facet.field=city

Download (read) counts by month for all data format types

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=event:read&formatType=DATA&facet=true&facet.range=dateLogged&facet.range.start=2014-01-01T00:00:00.000Z&facet.range.end=2015-01-01T00:00:00.000Z&facet.range.gap=%2B1MONTH

Download (read) counts by month for all format types, counter-compliant

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=event:read&counterCompliant=true&facet=true&facet.range=dateLogged&facet.range.start=2014-01-01T00:00:00.000Z&facet.range.end=2015-01-01T00:00:00.000Z&facet.range.gap=%2B1MONTH

Metadata read counts by month for all metadata format types

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=event:read&formatType=METADATA&facet=true&facet.range=dateLogged&facet.range.start=2014-01-01T00:00:00.000Z&facet.range.end=2015-01-01T00:00:00.000Z&facet.range.gap=%2B1MONTH

Byte count for read events for May 2013

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:read&fq=dateLogged:[2013-05-01T00:00:00.000Z TO 2013-05-31T23:59:59.999Z]&stats=true&stats.field=size&sort=size%20desc&rows=0

Bytes downloaded for subject=cjones aggregated by formatId

https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=subject:uid=*cjones*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=formatId

Download (read) counts for node KNB, excluding web crawler accesses and duplicate (repeat) visits (with a short time interval, i.e. 30 seconds)

https://cn.dataone.org/cn/v2/query/logsolr/?q=event:read&fq=inFullRobotList:false&fq=isRepeatVisit:false&fq=nodeId:urn\:node\:KNB