Very Large Data Packages ======================== :Document Status: ======== ================================================================== Status Comment ======== ================================================================== DRAFT (rnahf) committed minor modifications shortly (1hr) after email to developers@dataone.org ======== ================================================================== .. contents:: Synopsis ~~~~~~~~~ While many data packages are of modest size (<100 objects), some large studies generate upwards of 100,000 datasets that form a data package. These very large data packages challenge performance limits in the DataONE data ingest cycle and can present usability issues in user interfaces not prepared for them. Both memory and processor time increase dramatically with increased number of data objects and relationships expressed. Potential submitters of packages containing large numbers of data objects must be mindful that packages of such an large number of objects is likely to be unusable for the majority of interested parties, and should consider consolidating and compressing the individual objects into fewer objects to allow easier discovery/ inspection and download. This should be especially considered if the objects in the package would not be usefully retrieved individually. Creation of large resource maps is potentially the most time consuming activity, depending on the tool used. Deserialization is comparatively quick, but the memory requirements are high, depending on the type of model used during parsing. At the stage of indexing, at issue is the time needed to process index record updates, as well as the resulting number items in certain fields in the solr records. Last, high-level client methods would like to safely be able to do whole-package downloads, but need to be able to detect large data packages which could overwhelm their ability to handle such as large package. Below are discussions and test results of the known issues related to very large resource maps, presented in order of when encountered in the object lifecycle. Identified Issues ~~~~~~~~~~~~~~~~~ Resource map creation --------------------- Use of the foresite library for building resource maps includes many checks to make sure that the map validates. First the identifiers of the data and metadata are added to a graph held in memory, then the graph is serialized to RDF/XML format. For small packages the overhead for building the graph and performing consistency checks is minimal, but both memory and time to build seem to scale geometrically with the number of objects in the package. Test results on different size resource maps are summarized below. In all cases there is one metadata object that documents all of the objects. ============== ================ =========== =========== # of objects time to build memory file size ============== ================ =========== =========== 10 7 K 33 20 K 100 2 seconds 45 MB 60 K 330 192 K 1000 6 seconds 20 MB 600 K 3300 24 seconds 23 MB 2 Mb 10000 4.5 minutes 30 MB 6 Mb 33000 66 minutes 142 MB 20 Mb ============== ================ =========== =========== For creating very large resource maps, generation time using the java foresite toolkit is an issue. Directly creating a serialized resource map is much faster. For example, using an existing resource map as a template, and a short perl script, a 100000 member resource map was created in approximately 10 seconds with the only memory cost that of holding an identifier array in memory and any output buffering. RDF Deserialization --------------------- Deserialization happens both on the client side when downloading resource maps, and on coordinating nodes, both when validating the resource map, and also when indexing the relationships into the solr index. Performance metrics obtained from JUnit tests monitored with Java Visual VM are summarized below. Fully expressed resource maps were deserialized using both the default simple model, and again using an OWL model loaded with the ORE schema to be able to do semantic reasoning. The reasoning model adds an additional 268 triples from the ORE schema. ========== ======= ======== ======== ======= ======== ======== .. Default model Reasoning model ---------- --------------------------- --------------------------- # objects triples time memory triples time memory ========== ======= ======== ======== ======= ======== ======== 10 61 1 sec. 9 Mb 329 2 sec. 13 Mb 33 176 1 sec. 10 Mb 444 2 sec. 13 Mb 100 511 2 sec. 15 Mb 779 2 sec. 17 Mb 330 1661 2 sec. 20 Mb 1929 3 sec. 17 Mb 1000 5011 2 sec. 17 Mb 5279 3 sec. 24 Mb 3300 16511 3 sec. 20 Mb 16779 4 sec. 40 Mb 10000 50011 6 sec. 30 Mb 50279 8 sec. 90 Mb 33000 165011 7 sec. 51 Mb 165279 10 sec. 264 Mb 100000 500011 15 sec. 138 Mb 500279 26 sec. 792 Mb ========== ======= ======== ======== ======= ======== ======== The same information listed by model size shows that for small models, one can see that memory requirements are not a simple function of number of triples, but also a function of the model type. The reasoning model uses more memory per triple than the simple model. Especially noticeable is that at very large sizes, in terms of number of triples, the reasoning model uses significantly more memory. ======== ======= ======== ========== triples time memory model type ======== ======= ======== ========== 61 1 sec. 9 Mb simple 176 1 sec. 10 Mb simple 329 2 sec. 13 Mb reasoning 444 2 sec. 13 Mb reasoning 511 2 sec. 15 Mb simple 779 2 sec. 17 Mb reasoning 1661 2 sec. 20 Mb simple 1929 3 sec. 17 Mb reasoning 5011 2 sec. 17 Mb simple 5279 3 sec. 24 Mb reasoning 16511 3 sec. 20 Mb simple 16779 4 sec. 40 Mb reasoning 50011 6 sec. 30 Mb simple 50279 8 sec. 90 Mb reasoning 165011 7 sec. 51 Mb simple 165279 10 sec. 264 Mb reasoning 500011 15 sec. 138 Mb simple 500279 26 sec. 792 Mb reasoning ======== ======= ======== ========== The impact of this is that especially automated applications that deserialize RDF files (such as the index processor) will need to be able to detect when they are dealing with a resource map that could exceed available system resources. It also seems wise, given that memory issues weigh larger than RDF file size, to specify that resource maps with more than 50,000 triples need to fully express relationships, instead of relying on reasoning models to infer semantically-defined inverse relationships. This implies that if DataONE allows resource maps to sparsely populate their relationships, that there also be tools to tell whether an RDF is fully expressing relationships, or will be relying on semantic reasoning. Indexing --------- When resource maps are synchronized, the map is read and - once all of the package members are indexed - the relationships in the map are added to the index records of the data members. A 10000 member package will trigger the update of 10000 index records, adding the metadata object pid to the 'isDocumentedBy' field. Additionally, both the 'contains' field in the resource map and the 'documents' field in the metadata records will be updated with the pids of the 10000 members. Such many-membered fields are difficult to impossible to display, and are time- consuming to search when queried. Indexing is by necessity a single-threaded process, one that can update on the order of 100 records/minute. Therefore a package containing 100,000 members will take about 1000 minutes, or about 17 hours. During this time, no other updates will be processed. Workarounds for this issue requires a redesign of the index processor so that the large resource map does not delay other items in the indexing queue. Ultimately, the solution would be to implement a different search engine for tracking package relationships, and implementing another search endpoint using SPARQL (http://en.wikipedia.org/wiki/SPARQL), and probably hiding the search query details behind new DataONE API methods to spare the end user from having to learn another query language to interact with DataONE. Whole-Package Download ------------------------ The high-level DataPackage.download(packageID) method in d1_libclient implementations by default downloads the entire collection of data package objects for local usage. For these very-large data packages, the total package size is likely to be gigabytes of information. In order to better support such convenience features, there needs to be ways for determining the number of members of a package prior to download. This would also help in situations where the number of package members is small, but the individual data objects are large. Mitigations ~~~~~~~~~~~~ It is useful for applications to know when a given data package is too large for it to work with, or will require special handling. Ideally, this could be determined before deserializing the xml, and even for some clients, prior to download of the resource map itself. Indexing performance is a function of member count, while deserialization performance is a function of the number of triples. Download performance is a function of total file size. Determining Member Count ------------------------ For indexed resource maps, the easiest way to get the member count is with the query:: cn/v1/query/solr/?q=resourceMap:{pid}&rows=0 For unindexed resource maps, the count of the number of occurences of the term "ore:isAggregatedBy" in the RDF file will suffice. Determining total package size for download ------------------------------------------- To get the total size of the package, the following solr queries can be used:: # returns only sizes of package members cn/v1/query/solr/?q=resourceMap:{pid}&fl=id,size # returns sizes for package members and the resource map itself cn/v1/query/solr/?q=resourceMap:{pid} OR id:{pid}&fl=id,size from which the client could calculate the sum of the sizes returned. To get the size of the resource map itself (useful for estimating memory requirements):: # returns size of only the resource map cn/v1/query/solr/?q=id:{pid}&fl=id,size Determining Memory Requirements for deserialization --------------------------------------------------- It is the number of triples and type of model used, moreso than the number of package members, that best determines the graph model's memory requirement, and so any additional triples expressed for each member would multiply the model size. The use of ORE proxies, for example, or the inclusion of provenance information are situations where this would be the case. DataONE *is* planning for the inclusion of provenance statements in the resource maps, so users and developers alike should take this into consideration. The number of triples in an RDF/XML file can be determined either by parsing the XML, or by estimating off the resource map byte count. By parsing the XML, one would use an XML parser of choice to count all of the sub-elements of all of the "rdf:Description" elements. In psuedo-code:: tripleCount = 0; descriptionList = getRDFDescriptionElements(); foreach descriptionElement in descriptionList { tripleCount += descriptionElement.getElementList().size; } To estimate from the file size, an upper limit of the number of triples can be deduced. RDF/XML organizes triples as predicate-object sub-elements under an rdf:Description element for each subject. If the ratio of subjects to triples is low, then the number of bytes per triple is determined by the length of the predicate-object sub-element. For a 30-character identifier, that sub-element is about 100 characters, and so:: upper limit on the number of triples = file size (bytes) / 100 bytes-per-triple So for example, a 5Mb resource map has at most 50K triples, assuming an average identifier size of 30 characters (URL encoded). For a point of reference, a resource map for 1 metadata object documenting 1000 objects, expressing the 'ore:aggregates', 'ore:isAggregatedBy', 'cito:documents', 'cito:isDocumentedBy', and 'cito:identifier' predicates creates 5005 triples using 1003 subjects, and was tested to create 600K file. Applying the upper limit approximation, (600K / 100 = 6K) gives 6000 triples, an over-estimate matching the number of subjects. Also note that long identifiers and identifiers predominated by non-ascii characters that would be percent encoded in the file (3bytes per character) can lead to an even higher upper limit than expected, and similarly, short identifiers in the resource map could lead to a less robust upper limit. Determining the memory requirement from the number of triples can be done either by interpolating from the tables above, or by equation. Curve-fits of the deserialization performance tests using polynomial equations gave the following:: simple model memory(Mb) ~ 2.6E-15 * triples^3 - 1.7E-09 * triples^2 + 0.00044 * triples + 12.7 (R2 = 0.99466) reasoning model memory(Mb) ~ 1.25E-10 * triples^2 + 0.0015 * triples + 14.3 (R2 = 0.99997) Note that the simple model required (rightly or wrongly) a third-order equation to get a curve-fit with R2 > 0.9, whereas the reasoning model data could be highly corelated with a binomial equation. Expressed as a function of file size (bytes):: simple model memory(Mb) ~ 2.6E-21 * size^3 - 1.7E-13 * size^2 + 4.4E-06 * size + 12.7 reasoning model memory(Mb) ~ 1.25E-14 * size^2 + 1.5E-05 * size + 14.3