.

Harvester Documentation

This documentation includes instructions or information for:

Overview
Harvester setup
Harvest test files
Registered data providers
The Java Harvester API

 

 

 

 

Overview

The jOAI harvester is used to retrieve metadata records from remote OAI data providers and save them to the local file system, one record per file. In addition, records that have been harvested are packaged into zip archives that can be downloaded and opened through the harvester's web-based interface.The harvester can be configured to harvest automatically at regular intervals and effectively maintain a mirror of the remote repository on the local file system.

The jOAI harvester supports OAI protocol versions 1.1 and 2.0, supports data providers that use resumption tokens for flow control, selective harvesting by date or set, gzip response compression and other protocol features.

See the Harvester FAQ for additional information.


Harvester setup

1. Install the jOAI software on a system in a servlet container such as Apache Tomcat.

See INSTALL.txt for installation instructions. If reading this page, most likely this step has been completed.

2. Complete Harvester Setup. Add a new harvest and complete:

  • Enter a repository name (required)
  • Provide a repository base URL that starts with http:// (required)
  • Include a setSpec (optional)
  • Provide the metadata format being harvested (required)
  • Indicate if the harvest should occur at regular intervals (optional)
  • Indicate where metadata files should be saved (required)
  • Indicate how metadata files are saved (by set or not)

The repository name is a name to describe the data provider being harvested. The harvester status table is organized as an alphabetical listing of repository names.

The base URL is the access point of a data provider. It’s a web address that starts with http://

The harvested metadata format can be any metadata format as long as it matches a metadata format used by the provider being harvested. Use the OAI ListMetadataFormats request to find available metadata formats at the provider. The ListMetadataFormats requests look like:

http://some.provider.org/base/url?verb=ListMetadataFormats

that is, concatenate together the [base URL] + [?verb=ListMetadataFormats]

The OAI ListMetadataFormats request returns an XML document and the XML element, metadataPrefix, provides the metadata formats available.

Harvest automatically at regular intervals means a time interval (days/hours/minutes/seconds) can be specified that tells the jOAI harvester when and how often to perform an automatic harvest that checks for and updates new records.

Saving files at the default harvest location means metadata files are saved to the context (directory) within the OAI application generally of the form "~oai/WEB-INF/harvested_records/". To view the default directory path of this location, click on the save files help button (the question mark).

Saving files to a non-default harvest location means metadata files are saved to a user-specified location in which the full directory path is provided or files are saved to a recently used location.

If a SetSpec is specified, metadata files are saved as a group. If a SetSpec is not specified, metadata files can be saved into one big group (the do not split by set option) or saved in many groups (split by set option) depending on how the provider being harvested is organized. The default save option is do not split by set.

top


Harvest test files

Conduct a test harvest by completing the harvester setup section above but use the following information:

  • Repository name: DLESE
  • Repository base URL: http://dlese.org/oai/provider
  • Metadata format: adn

Leave all other fields blank and save the entry.

On the Harvester Setup and Status click 'View harvest history' page to see the harvest being performed. Click 'Refresh page' to see the number of metadata files increase. The entire harvest may take several minutes to complete.

The test harvest is successful if the metadata files can be viewed by one of these methods. On the Harvester Setup and Status page,

  • Under 'Download zipped harvest', click on 'Most recent'. Save the zip file to your Desktop, unzip it, and view the harvested records.
  • Locate and go to the 'Harvested to' directory on the server and view the files.
top

Registered data providers

The Open Archives Initiative maintains a list of registered data providers that can be harvested. Known data providers of interest:

  • Digital Library for Earth System Education (DLESE) base URL: http://www.dlese.org/oai/provider
  • National Science Digital Library (NSDL) base URL: http://services.nsdl.org:8080/nsdloai/OAI

The Java Harvester API

The Java Harvester API that is part of jOAI may be used programmatically to harvest from OAI data providers. To use the API in a Java application, include the DLESETools.jar Java library, found in the lib directory of this distribution, in the classpath. Use of the API assumes familiarity with the Java programming language.

top

 

 

University Corporation for Atmospheric Research (UCAR) National Science Foundation (NSF) Digital Library for Earth System Education (DLESE)