DLESE Tools
v1.6.0

org.dlese.dpc.oai.harvester
Class Harvester

java.lang.Object
  extended by org.dlese.dpc.oai.harvester.Harvester
All Implemented Interfaces:
ErrorHandler

public class Harvester
extends Object
implements ErrorHandler

Harvests metadata from an OAI data provider, saving the results to file or returning the raw XML as an array of Strings. Supports data providers that use resumption tokens for flow control , selective harvesting by date or set , gzip response compression and other protocol features. Supports OAI protocol versions 1.1 and 2.0 .

To perform a harvest, use one of the following methods:

Use of this API assumes familiarity with the OAI protocol .

Version:
$Id: Harvester.java,v 1.52 2009/03/20 23:33:53 jweather Exp $
Author:
Steve Sullivan, John Weatherley
See Also:
HarvestMessageHandler

Constructor Summary
Harvester()
          Creates a Harvester that uses no HarvestMessageHandler.
Harvester(HarvestMessageHandler msgHandler, int timeOutMilliseconds)
          Creates a Harvester that uses the given HarvestMessageHandler.
 
Method Summary
 String[][] doHarvest(String baseURL, String metadataPrefix, String set, Date from, Date until, String outdir, boolean splitBySet, String zipName, String zDir, boolean writeHeaders, boolean harvestAll, boolean harvestAllIfNoDeletedRecord)
          Performs the harvest.
 void error(SAXParseException exc)
          Handles errors.
 void fatalError(SAXParseException exc)
          Handles fatal errors.
 long getEndTime()
          Gets the endTime when the havest completed either because of an error or at the end of a successful harvest.
 String getHarvestedRecordsDir()
          Gets the harvestedRecordsDir attribute of the Harvester object
 long getHarvestUid()
          Returns a unique ID for this harvest.
 int getNumRecordsHarvested()
          Gets the current number of records that have been harvested by this harvester.
 int getNumResumptionTokensIssued()
          Gets the number of resumption tokens that have currently been issued by the data provider.
 long getStartTime()
          Gets the startTime when the harvest began, or 0 if it has not begun yet.
static String[][] harvest(String baseURL, String metadataPrefix, String set, Date from, Date until, String outdir, boolean splitBySet, HarvestMessageHandler msgHandler, boolean writeHeaders, boolean harvestAll, boolean harvestAllIfNoDeletedRecord, int timeOutMilliseconds)
          Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings.
 boolean isRunning()
          Determines whether this Harvester is currently running or not.
 void kill()
          Gracefully kills the harvest after the current record is finished being harvested.
static void main(String[] args)
          Command line interface for the harvester.
static void setDebug(boolean db)
          Sets the debug attribute object
 void setNumRecordsForNotification(int numRecords)
          Sets the number of records harvested before statusMessage notifications to the HarvestMessageHandler are made.
 void warning(SAXParseException exc)
          Handles warnings.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Harvester

public Harvester()
Creates a Harvester that uses no HarvestMessageHandler.


Harvester

public Harvester(HarvestMessageHandler msgHandler,
                 int timeOutMilliseconds)
Creates a Harvester that uses the given HarvestMessageHandler.

Parameters:
msgHandler - The HarvestMessageHandler that will receive messages as the harvest progresses, or null if none.
timeOutMilliseconds - Number of milliseconds the harvester will wait for a response from the data provider before timing out
Method Detail

main

public static void main(String[] args)
Command line interface for the harvester. Harvest status messages are output to standard out.

Arguments (required arguments must be in this order, optional arguments may be in any order):

Parameters:
args - The command line arguments

harvest

public static String[][] harvest(String baseURL,
                                 String metadataPrefix,
                                 String set,
                                 Date from,
                                 Date until,
                                 String outdir,
                                 boolean splitBySet,
                                 HarvestMessageHandler msgHandler,
                                 boolean writeHeaders,
                                 boolean harvestAll,
                                 boolean harvestAllIfNoDeletedRecord,
                                 int timeOutMilliseconds)
                          throws Hexception,
                                 OAIErrorException
Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings. A HarvestMessageHandler may be specified to capture harvest progress messages. Use a SimpleHarvestMessageHandler to have harvest messages sent to standard out.

Parameters:
baseURL - The baseURL of the data provider, for example "http://www.dlese.org/oai/provider"
metadataPrefix - The metadataPrefix, for example "oai_dc"
set - The set to harvest, for example "testset", or null to harvest all sets
from - The from date, for example "2003-12-31T23:59:59Z", or null for none
until - The until date, for example "2003-12-31T23:59:59Z", or null for none
outdir - The path of output dir. If null or "", we return the String[][] array; if specified we return null
msgHandler - A handler for status messages that occur during the harvest, or null to ingnore messages
writeHeaders - True to have OAI headers written to the output, false not to
harvestAll - True to delete previous harvested record files and harvest all records again from scratch; false to preserve previous record files and replace or delete only those that have changed
harvestAllIfNoDeletedRecord - True to harvest all record files from scratch if deleted records are not supported
splitBySet - True to save each record in separate directories split by set inside outdir, false to save all records to the root of outdir
timeOutMilliseconds - Number of milliseconds the harvester will wait for a response from the data provider before timing out
Returns:
If outdir is specified returns null; if outdir is null or "", returns one row for each record harvested. Each row has two elements:
  • identifier, encoded
  • content xml record, or the String deleted if status=deleted.

Throws:
Hexception - If serious error
OAIErrorException - If OAI error

kill

public void kill()
Gracefully kills the harvest after the current record is finished being harvested.


setNumRecordsForNotification

public void setNumRecordsForNotification(int numRecords)
Sets the number of records harvested before statusMessage notifications to the HarvestMessageHandler are made.

Parameters:
numRecords - The new numRecordsForNotification value

getStartTime

public long getStartTime()
Gets the startTime when the harvest began, or 0 if it has not begun yet.

Returns:
The startTime, or 0 if not started yet.

getHarvestedRecordsDir

public String getHarvestedRecordsDir()
Gets the harvestedRecordsDir attribute of the Harvester object

Returns:
The harvestedRecordsDir value

getHarvestUid

public long getHarvestUid()
Returns a unique ID for this harvest.

Returns:
The harvestId value

getEndTime

public long getEndTime()
Gets the endTime when the havest completed either because of an error or at the end of a successful harvest. Returns 0 if the harvest is still in progress.

Returns:
The endTime, or 0 if the harvest is still in progress.

getNumRecordsHarvested

public int getNumRecordsHarvested()
Gets the current number of records that have been harvested by this harvester. This number increases as the harvest progresses.

Returns:
The numRecordsHarvested value

getNumResumptionTokensIssued

public int getNumResumptionTokensIssued()
Gets the number of resumption tokens that have currently been issued by the data provider. This number increases as the harvest progresses. This number gives a rough indication of the progression and duration of the harvest.

Returns:
The numResumptionTokensIssued value.

isRunning

public boolean isRunning()
Determines whether this Harvester is currently running or not.

Returns:
True if the harvest is in progress, false otherwise.

doHarvest

public String[][] doHarvest(String baseURL,
                            String metadataPrefix,
                            String set,
                            Date from,
                            Date until,
                            String outdir,
                            boolean splitBySet,
                            String zipName,
                            String zDir,
                            boolean writeHeaders,
                            boolean harvestAll,
                            boolean harvestAllIfNoDeletedRecord)
                     throws Hexception,
                            OAIErrorException
Performs the harvest. Note that his method is not safe for multiple harvests - a separate Harvester instance should be created for each havest performed.

Parameters:
metadataPrefix - metadataPrefix. e.g., "oai_dc", or null to harvest all formats
set - set. e.g., "testset" or null for none.
from - from date. May be null.
until - until date. May be null.
outdir - path of output dir. If null or "", we return the String[][] array; if specified we return null.
writeHeaders - True to have oai headers written to file, false not to.
The directory structure under outdir is:
outdir/set/subset/subset/metadataPrefix/oaiId_hdr.xml OAI header
outdir/set/subset/subset/metadataPrefix/oaiId_data.xml OAI contents
baseURL - The baseURL of the data provider.
splitBySet - To split set
zipName - Name of the zip file to save to
zDir - Directory of the zipfile
harvestAll - True to delete previous harvested records and harvest all records again from scratch
harvestAllIfNoDeletedRecord - True to harvest all records from scratch if deleted records are not supported
Returns:
If outdir is specified returns null; if outdir is null or "", returns one row for each record harvested. Each row has two elements:
  • identifier, encoded
  • content xml record.

Throws:
Hexception - If serious error.
OAIErrorException - If OAI error was returned by the data provider.

setDebug

public static void setDebug(boolean db)
Sets the debug attribute object

Parameters:
db - The new debug value

fatalError

public void fatalError(SAXParseException exc)
Handles fatal errors. Part of ErrorHandler interface.

Specified by:
fatalError in interface ErrorHandler
Parameters:
exc - The Exception thrown

error

public void error(SAXParseException exc)
Handles errors. Part of ErrorHandler interface.

Specified by:
error in interface ErrorHandler
Parameters:
exc - The Exception thrown

warning

public void warning(SAXParseException exc)
Handles warnings. Part of ErrorHandler interface.

Specified by:
warning in interface ErrorHandler
Parameters:
exc - The Exception thrown

DLESE Tools
v1.6.0