Prototype Protocol for Statistical Harvesting (PSH)
Introduction | About the Prototype | Appeal for Feedback | Technical Reference | Draft 'Identify' SchemaIntroduction
It's my fault for not paying attention. I was under the impression that ORE (Object Reuse and Exchange) would be the successor to OAI-PMH, and that all my harvesting troubles would be sorted out. However, I now realise that ORE does not address my main OAI-PMH issues, of which there are two: counting items, and obtaining 'identify' information. ORE focuses on the reuse and exchange the deposited objects themselves (the clue was in the name), which it does very effectively, but statistics and repository identification were not within its remit.
Counting Items with OAI-PMH (and ORE)
In OpenDOAR we try to display the number of items in a given archive - ideally, just the number of full items, but in practice the number of full items and metadata-only records combined. ROAR also gives totals, but additionally provides repository growth charts. There is no easy way to obtain this numeric data using OAI-PMH. It is usually a matter of sending a number of ListIdentifiers requests, and iteratively counting the number of identifiers returned. The situation with ORE is similar, involving iterative processing of resource maps. This approach is fraught with difficulties, and counts often fail. For instance, with large repositories or with slow response times, harvesting scripts may time out. Also, some hosts, such as Bepress, limit the rate at which OAI-PMH requests can be submitted so that the server is not overloaded and is less vulnerable to abuse.
This situation is particularly frustrating given that in most cases the data could be obtain using much simpler SQL statements. For instance, the number of public records in an EPrints V.3 archive is returned in one step using:
SELECT COUNT(*)
FROM eprint
WHERE eprint_status = "archive";
and for a breakdown by month:
SELECT CONCAT(datestamp_year, "-", LPAD(datestamp_month,2,"0")),COUNT(*)
FROM eprint
WHERE eprint_status = "archive"
GROUP BY datestamp_year,datestamp_month;
A more complex but still fast command can count the number of EPrints that have attached files (and therefore likely to be full items).
SELECT COUNT(*)
FROM eprint
WHERE eprint_status = "archive" AND eprintid IN
(SELECT DISTINCT e.eprintid
FROM eprint AS e LEFT JOIN document AS d ON e.eprintid=d.eprintid
WHERE e.eprint_status = "archive" AND d.docid IS NOT NULL);
If only OAI-PMH had a Count verb! What would be nicer still would be if were possible to return the number of records present for each OAI-PMH set. That way a specialist subject harvester could conceivably check how many or what proportion of items are available for its subject, and decide whether or not to harvest the archive accordingly. Similarly, it would useful to be able determine what proportion of a repository's content is made up of full-text items.
An extension to OAI-PMH or a new Protocol for Statistical Harvesting (PSH) ought easily be able to satisfy these requirements. A prototype for such a protocol is outlined below.
'Identify' Problems
In theory, data from an OAI-PMH verb=Identify request should be a good source of information about an archive. There are, however, some surprising omissions, such as the URL for human users, and the parent organisation's name. Also, only single names and URLs are returned, whereas an archive may have more than one name (e.g. a full name and an acronym), and multilingual archives may have different names and URLs for each language. It would be useful to be able to harvest these alternatives, although of course it might be necessary to flag which is the preferred option.
In addition, an archive may have other machine-to-machine interfaces in addition to OAI-PMH, such as Z39.50 or SRU-CQL. It would be useful to retrieve information about these and the relevant URLs, in addition to the OAI Base URL
In initial draft alternative Identify schema is outlined after the prototype PSH Technical Reference.
Peter Millington
About the Prototype
This harvesting protocol is modelled on the OAI Protocol for Metadata Harvesting (OAI-PMH). As with OAI-PMH, data is harvested by making HTTP requests to the PSH base URL, with the addition of various parameters that specify what information is required. This include an obligatory verb parameter. The results are returned as an XML stream.
To illustrate how the protocol works, a prototype PSH interface has been created for the OpenDOAR database itself. (Note that the protocol could work with any type of database, not just open access repositories).
The prototype PSH Base URL for OpenDOAR is:
http://www.opendoar.org/demos/psh
The simplest meaningful request is to count the total number of items in the database:
http://www.opendoar.org/demos/psh?verb=Count
which returns the following XML:
<?xml version="1.0" encoding="UTF-8" ?>
<psh>
<responseDate>2010-09-02T15:48:13Z</responseDate>
<request verb="Count">http://www.opendoar.org/demos/psh.php</request>
<Count>
<header>
<setType />
<setSpec />
<setName />
<datestamp />
<numItems>1641</numItems>
</header>
</Count>
</psh>
The key element of the XML is the <header> element, and in particular the sub-element <numItems> which gives the total number of records in the database. The other four sub-elements within <header> are empty because they are not relevant to this particular PSH request.
By adding a suitable dateUnit argument, it is possible to break down this total by year (or month or day):
http://www.opendoar.org/demos/psh?verb=Count&dateUnit=year
which yields the XML:
<?xml version="1.0" encoding="UTF-8" ?>
<psh>
<responseDate>2010-09-02T15:48:13Z</responseDate>
<request verb="Count">http://www.opendoar.org/demos/psh.php</request>
<Count>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2005</datestamp>
<numItems>105</numItems>
</header>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2006</datestamp>
<numItems>653</numItems>
</header>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2007</datestamp>
<numItems>176</numItems>
</header>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2008</datestamp>
<numItems>286</numItems>
</header>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2009</datestamp>
<numItems>269</numItems>
</header>
<header>
<setType />
<setSpec />
<setName />
<datestamp>2010</datestamp>
<numItems>152</numItems>
</header>
</Count>
</psh>
Here, the <datestamp> sub-element gives the date for corresponding <numItems>.
Similarly, it is possible to break down the numbers according to various categories within the database. With an open access repository, these would typically be OAI-PMH sets. To get a list of the available set types for a particular PSH instance, the ListSetTypes verb is used:
http://www.opendoar.org/demos/psh?verb=ListSetTypes
returning:
<?xml version="1.0" encoding="UTF-8" ?>
<psh>
<responseDate>2010-09-02T15:48:13Z</responseDate>
<request verb="ListSetTypes">http://www.opendoar.org/demos/psh.php</request>
<ListSetTypes>
<setType>
<setTypeSpec>repositoryType</setTypeSpec>
<setTypeName>Type of repository</setTypeName>
</setType>
<setType>
<setTypeSpec>operationalStatus</setTypeSpec>
<setTypeName>Repository's operational status</setTypeName>
</setType>
<setType>
<setTypeSpec>country</setTypeSpec>
<setTypeName>Repository's home country</setTypeName>
</setType>
<setType>
<setTypeSpec>continent</setTypeSpec>
<setTypeName>Repository's home continent</setTypeName>
</setType>
<setType>
<setTypeSpec>subject</setTypeSpec>
<setTypeName>Repository's Subject coverage</setTypeName>
</setType>
</ListSetTypes>
</psh>
The relevant setTypeSpec can then be assigned to the PSH setType argument to get the corresponding breakdown of item numbers - in this example by repository type:
http://www.opendoar.org/demos/psh?verb=Count&setType=repositoryType
returning:
<?xml version="1.0" encoding="UTF-8" ?>
<psh>
<responseDate>2010-09-02T15:48:14Z</responseDate>
<request verb="Count">http://www.opendoar.org/demos/psh.php</request>
<Count>
<header>
<setType>repositoryType</setType>
<setSpec>2</setSpec>
<setName>Institutional</setName>
<datestamp />
<numItems>1331</numItems>
</header>
<header>
<setType>repositoryType</setType>
<setSpec>3</setSpec>
<setName>Disciplinary</setName>
<datestamp />
<numItems>206</numItems>
</header>
<header>
<setType>repositoryType</setType>
<setSpec>4</setSpec>
<setName>Aggregating</setName>
<datestamp />
<numItems>68</numItems>
</header>
<header>
<setType>repositoryType</setType>
<setSpec>5</setSpec>
<setName>Governmental</setName>
<datestamp />
<numItems>36</numItems>
</header>
</Count>
</psh>
Here, the <datestamp> sub-element is void, but there is data for the <setType>, <setSpec> and <setName> sub-elements. These hold the following:
<setType>- The value of thesetTypeargument specified in the PSH request.
<setSpec>- The specification of the returned set - equivalent to an OAI-PMHsetSpec. This is typically some sort of classification code - e.g. a Library of Congress subject code.
<setName>- The name or description of the returned set - equivalent to an OAI-PMHsetName. This is usually a human-readable heading or description that corresponds to thesetSpeccode.
It is also possible to limit the categories that are reported by adding a setQuery argument, specifying a query string that sets must match. Obviously, it also necessary to specify whether the setQuery should be applied to the <setSpec> or <setName> fields, which is done using the obligatory setQueryType argument (permitted values: spec or name). The following example gives monthly breakdown of OpenDOAR records for United States repositories:
http://www.opendoar.org/demos/psh?verb=Count&dateUnit=month&setType=country&setQuery=US&setQueryType=spec
Furthermore, it is possible to specify where the setQuery string should occur in the data field, using the operator argument. This has four values 'equals' (the default), 'starts', 'ends', and 'contains'.
&operator=starts is especially useful for setQueryType=spec queries, where the right-hand wild carding can, for instance, retrieve all the sub-class codes of a main subject in a classification scheme. E.g.:
http://www.opendoar.org/demos/psh?verb=Count&setType=subject&setQuery=Ci&setQueryType=spec&operator=starts
&operator=contains is similarly useful for setQueryType=name queries, where word stems can be retrieved to good effect - E.g. to find the Americas:
http://www.opendoar.org/demos/psh?verb=Count&setType=continent&setQuery=america&setQueryType=name&operator=contains
Date ranges can be specified using the from and/or until arguments. For example::
http://www.opendoar.org/demos/psh?verb=Count&dateUnit=month&from=2007-01-01
The dateUnit and setSpec arguments can be used together to get data suitable for a date-by-category grid of item counts:
http://www.opendoar.org/demos/psh?verb=Count&dateUnit=year&setType=repositoryType
Further details of the arguments are given in the Technical Reference section.
Appeal for Feedback
The prototype protocol has been prepared in order to stimulate discussion. I fully expect that it may not fulfil everyone's requirements, and that people may be able to devise additional functions or better approaches. I would welcome comments and suggestions.
If there is sufficient interest, we may organise a working seminar to move things forward. Ultimately, it would be beneficial for the final protocol to be incorporated as standard into the popular repository software packages - EPrints, DSpace, Bepress, Fedora, etc., working with existing configurations and database schemata.
Please email me at: peter.millington@nottingham.ac.uk, or phone me on +44 (0)115 8468481.
Technical Reference
All parameters, verbs, arguments, and their values are case-sensitive,
except for setQuery strings. The case-sensitivity of
setQuery stings - which should be URL-encoded - could vary from
implementation to implementation of PSH. (It is recommended, however, that set query
strings should be implemented as case-insensitive).
Verbs
All requests must have a verb. Some verbs require one or
more additional arguments. Some arguments are optional.
Count
Returns the number of records, optionally broken down by set and/or date.
Arguments
countType- Type of records to be counted.Use
verb=ListCountTypesto see which options are available in a given implementation of PSH. Typical examples are:
- partialItems - For open access repositories, these would be metadata-only records
- fullItems - For open access repositories, these would be records having attached files (assumed to be the full-text PDF or similar document).
Other examples, not implemented in this prototype are:
- withdrawnItems - For open access repositories, these would be deleted full items for which the metadata has been retained.
- openAccess - Records where any user can access the full item free of charge or subscription, and without the need for username & password.
- gatedAccess - Records where users must pay a fee or have a subscription, and/or have a username & password in order to access the full item.
If omitted, counts all records
There may be repositories or resources where a breakdown by record type can be achieved using the
setTypeargument. However, in most cases, it seems likely that the required SQL select statements will be complex and need to be run singly for eachcountTypein turn. This argument could also be used for other categories of similar complexity.
dateUnit- Available date units by which to break down item counts - e.g. 'year', 'month', 'day'.Use
verb=ListDateUnitsto see which options are available in a given implementation of PSH.If omitted, counts all records
from- Inclusive date from which records are to be counted.Date format: yyyy-mm-dd - e.g. 2008-05-01
operator- Available operators for setQuery strings - 'equals', 'starts', 'ends', 'contains'.This optional argument defines where a query sub-string 'setQuery' appears in the setSpec or setName value.
The default
operatoris 'equals' - i.e. an exact match
setQuery- Query string for set queries, which may be based on either the setSpec or the setName field.Must be used with the
setTypeargument to define the set being queried.Must be used with the
setQueryTypeargument to specify whether to query setSpec or setName.May be used with the
operatorto define where a query sub-string appears in the setSpec or setName value.The default
operatoris 'equals' - i.e. an exact match
setQueryType- Defines whethersetQuerysearches setSpec or setName.Permissible values (no default):
- spec - Searches the setSpec field (often a classification code).
- name - Searches the setName field (often a human-readable description).
This argument is required whenever
setQueryis specified.
setType- The name of a set or category type by which record counts may be broken down.Sets are categories that can usually be displayed as lists using SQL statements such as:
SELECT category,COUNT(*) FROM resource GROUP BY category.Use
verb=ListSetTypesto see whichsetTypeoptions are available in a given implementation of PSH.If omitted, counts all records
until- Inclusive date until which records are to be counted.Date format: yyyy-mm-dd - e.g. 2007-12-31
Help
Returns this Technical Reference.
Identify
Returns information describing the repository or resource.
ListCountTypes
List the available 'countType' argument options for use with the 'Count' verb.
ListDateUnits
List the available 'dateUnit' argument options for use with the 'Count' verb.
ListSetTypes
List the available 'setType' argument options for use with the 'Count' verb.
Draft 'Identify' Schema
This draft schema is presented in order to stimulate discussion and as a starting point for further development.
<Identify>
<archiveName lang="en" titleType="preferred">OpenDOAR</archiveName>
<archiveName lang="en" titleType="alternative">Open Directory of Open Access Repositories</archiveName>
<archiveURL lang="en" urlType="preferred">http://www.opendoar.org/</archiveURL>
<archiveURL lang="en" urlType="alternative">http://opendoar.org/</archiveURL>
<org orgType="parent">
<orgName lang="en" nameType="preferred">University of Nottingham</orgName>
<orgName lang="en" nameType="alternative">UoN</orgName>
<orgURL lang="en" urlType="preferred">http://www.nottingham.ac.uk/</orgURL>
<orgURL lang="zh" urlType="preferred">http://www.nottingham.edu.cn/index.php?changelang=zh</orgURL>
</org>
<org orgType="team">
<orgName lang="en" nameType="preferred">SHERPA</orgName>
<orgURL lang="en" urlType="preferred">http://www.sherpa.ac.uk/</orgURL>
</org>
<machineInterfaces>
<protocol>
<protocolName>PSH</protocolName>
<protocolURL urlType="preferred">http://www.opendoar.org/demos/psh_prototype.php</protocolURL>
<protocolVersion>1.0</protocolVersion>
<protocolHelp>http://www.opendoar.org/demos/psh_prototype.php</protocolHelp>
</protocol>
<protocol>
<protocolName>OpenDOAR API</protocolName>
<protocolURL urlType="preferred">http://www.opendoar.org/api.php</protocolURL>
<protocolURL urlType="alternative">http://www.opendoar.org/api13.php</protocolURL>
<protocolVersion>1.3</protocolVersion>
<protocolHelp>http://www.opendoar.org/tools/api.html</protocolHelp>
</protocol>
</machineInterfaces>
<earliestDatestamp>2005-12-09T11:44:56Z</earliestDatestamp>
<dateEstablished>2005-12-01</dateEstablished>
<description>
<![CDATA[
<p>This is one of the principal worldwide lists of open access repositories.</p>
]]>
</description>
</Identify>