Metadata Collections & QA

Collections management process


The goals of the DLESE collections management process are to ensure:

  • Access to resources - maximize appropriate search returns to the user and minimize returns with invalid URLs
  • Growth of collections when they expand or are updated
  • Provision of collections to the NSDL (National Science Digital Library)


To accomplish the goals of collections management (access, growth and provision), the DPC (DLESE Program Center) is responsible for five primary management functions that ensure data integrity at all levels of metadata creation, manipulation and updating. These functions are performed on all collections and all metadata records within collections regardless of the DLESE metadata framework used. The functions are defined as follows:

  1. Quality assurance - reviewing metadata records and collections for completeness and readiness to be part of the library
  2. Maintenance - testing the vitality of records and collections through link checking, duplicate checking and syntax checking
  3. Updating - harvesting new and changed metadata records in collections and making them available in the library
  4. Redistribution - the providing of metadata records in various formats so they can be harvested by service providers (external parties) and providing DLESE collections to NSDL as separate and discrete collections
  5. Deaccessioning - removing problematic metadata records or collections from the DLESE in accordance with the DLESE Accessioning and Deaccessioning Policy

Each of these management functions are described in detail next.


1. Quality assurance

Quality assurance is performed by quality assurance person(s) for each collection. Every DLESE collection has such person(s). These person(s) are either DPC staff, or external collection builders or the two working together. Often DPC is contracted to perform quality assurance for collections. No matter who performs the quality assurance, the following are the goals of quality assurance in order to accession metadata records into collections:

  • URL is confirmed to be active and point to the resource described in the metadata record
  • The content of the resource is within the scope of DLESE per the DLESE Collection Scope and Policy Statement
  • All required metadata is present and appropriate; any missing or inappropriate data is added or amended per the DLESE Metadata Quality Guidelines
  • Coverage/geospatial formatting is correct
  • Technical aspects are functional ( links, images, applets work)
  • Editing for spelling, grammar and completeness
  • Non-required metadata fields are used appropriately and correctly
  • The use of the particular metadata framework is correct
  • Prevent the usage of illegal XML (eXentisble Markup Language) characters (detection is done by a separate Python script)

While quality assurance of metadata records is primarily the responsibility of external collection builders, DPC staff work with collection builders to ensure the above checks are performed. This is done by reviewing individual metadata records (could be the whole collection or just a sampling) with the collection builder upon accessioning and periodically during updating.

2. Maintenance

Since maintenance involves link checking, duplicate checking, syntax checking and status checking, the Idmapper tool was designed to facilitate these checks. The checks, performed over all collections, are defined as follows:

  • Link checking - determine the vitality of urls within metadata records; are the URLs accessible, broken, redirected etc.
  • Duplication checking - detect the presence of metadata records that reference the same URL within a collection
  • Syntax checking - detect the presence of ill-formed data for certain fields (e.g. email addresses)
  • Status checking - determine if the record is active or not and whether the record is new to the collection (data from this check in not completely accessible)

Each of these checks are performed twice a day and a history over time is maintained in a database. Generated reports are emailed to DPC staff twice a week and passed to DLESE Discovery via an API twice a day. DPC staff takes action on the issues or, if need be, shares the issues with the collection quality assurance person(s) of the collection that has errors. Each of these maintenance checks are described in more detail below. Please remember that a collection quality assurance person(s) could be DPC staff or external collection builders.


For link checking, the DPC built Idmapper tool tests URL vitality on all important URLs in each particular DLESE metadata framework. Errors are tracked and organized by type, with URL not found, vitality too low (less than 50% accessibility over time), connection refused, and permanent redirects errors taking the highest priority for action and especially those URLs that have been inaccessible for 3 days. The goal within DLESE is to have less than 5% inaccessible URLs. When DPC staff cannot resolve the URL errors, the collection quality assurance persons is contacted to take action regarding the metadata record.

When the collection quality assurance person is a DPC staff member, the resource creator of a broken URL is contacted. A standard message is sent out via email. If a response is received, the appropriate changes are made (may include removal if resource has been taken offline). If no response is received, in approximately 3 days, the metadata record for the resource is removed from DLESE Discovery.

When the collection quality assurance person is an external collection builder, the collection builder is emailed and asked to either remove the record from their collection or to contact the resource creator to fix the URL. If the collection builder responds and provides information on what to do with the record, DPC staff take that action. If the collection builder does not respond in 3 days, DPC staff do not contact the resource creator directly but rather remove the metadata record from DLESE Discovery until a decision is made what to do.

Metadata records are not deleted but rather moved to a non-discoverable location. These changes are propagated forward (but this takes some time - see the Updating section of this document).

For duplication checking, the Idmapper uses both URLs and resource content to determine duplications. Content checking compares content in order to find duplicate content or mirror site content that does not match primary URL content. Duplicate URL checking is performed on ADN and Annotation metadata records only. Errors are generated when the same URL appears twice in the primary URL field across metadata records within the same collection. When this happens, the quality assurance person for the collection is notified to resolve the duplicate metadata records. Duplicate records within a collection are not allowed but duplicate records across collections are allowed. To determine if duplication of catalog record numbers and collection keys occurs, a Python script is used. Ideally, the duplication of cataloging record numbers and collection keys would become part of the Idmapper checks.


Syntax checking ensures data integrity of certain fields, particularly email addresses of resource creators. This check is important and came into being because the Community Review System (CRS) collection uses resource creator email addresses to contact resource creators to see if they would like to submit their resource for review and potential representation in the DLESE Reviewed Collection (DRC). When errors are discovered, the same resolution procedures as described in the link checking section are enacted here.

Status checking is used to sense the presence of new metadata records in a collection for both DPC staff and DLESE Discovery. More importantly for DLESE Discovery, it indicates whether the record is active or inactive. Currently, by default, all records are active and there is no user interface or easy manipulation method, except to disable an entire collection, to make individual metadata records within a collection inactive. Generally, to make individual records records inactive they have to be physically moved by hand into a 'holding' directory so that they are not discoverable.

3. Updating

Updating of metadata records and collections occurs for the following reasons:

  • New metadata records
  • Updated metadata records
  • DLESE metadata framework changes, in structure or vocabularies

To accomplish these updates, the updating process includes:

  • Harvesting - how metadata records arrive at the DPC
  • Quality checks - what have you got and the same checks as described in the quality assurance section but performed on new or updated records
  • Transforming - put metadata records into the proper format for DLESE Discovery
  • Validating - ensuring the data integrity
  • Maintenance checks - the same checks as described in the maintenance section
  • Indexing - add the new and changed metadata records to DLESE Discovery


For harvesting, in the broadest sense, metadata records within collections arrive at the DPC by any of the following methods (and sometimes multiple methods):

  • Cataloged directly at the DPC using a DCS (records land on Bolide)
  • Cataloged directly at the DPC using an XML metadata template in XML Spy (records land on Bolide)
  • Cataloged directly at the DPC by transforming an existing collection of metadata records from the collection builders native metadata format to a DLESE metadata format (script manipulations that uses XSLT) (records land on Bolide)
  • Email (records land on Flood)
  • OAI 2.0 harvesting OAI 2.0 or OAI 1.1 (records land on Bolide) OAI means Open Archives Initiative
  • Initiating an FTP session to get to a collection builder server (records land on Flood)

The comments in parenthesis above indicate the the name of the DLESE server that the metadata records arrive on via the particular delivery method. To process newly arrived records, the first task is to gather them in a single area on a single DLESE server.

The first quality check to be performed is to figure out if what was harvested is truly the collection. This is a rather tricky question because collection providers do not often indicate if they deleted records. Their OAI providers might provide deletion information but more often than not they don't and with records arriving by email there is no indication either. So generally a collection is completely deleted in order to process any new or updated records. That is, the whole collection is worked with rather than just new or updated records. This is time consuming and CPU intensive for large collections. The quality checks described in section 2 for generating metadata records are completed when updating records as well.

In terms of transforming, only collections that do not arrive in a format that DLESE Discovery can handle are transformed. In order for this to happen, a semantic crosswalk and technical XSLT transform must exist to change metadata from the collection builder format to a DLESE metadata format. Currently, the DPC supports the following metadata transformations:

  • DLESE-IMS 1.2 to ADN 0.6.50 (as of June 2005, DLESE-IMS has been retired)
  • ADN 0.6.50 to Annotation 0.1.01 in order to generate annotation records for a collection
  • ADN 0.6.50 to OAI-DC
  • ADN 0.6.50 to NSDL-DC 1.02
  • Individual collections formats to ADN 0.6.50

All other transforms must be performed locally by the collection builder prior to delivery to the DPC. The DPC can assist in writing such transforms. Any errors generated during the transform process are shared with the collection quality assurance person(s). Transformation errors on greater than 5% of records will prevent the collection from being updated.


When the updating process is a result of a change in a metadata framework, the DPC creates the semantic and technical crosswalk to go from the currently acceptable DLESE Discovery metadata format to the new metadata format. Such changes greatly impact collection builders and DLESE systems. DPC staff works together and with collection builders to minimize such impacts. Such changes take time and are carefully coordinated transitions.

All metadata records are validated during the update process. Any errors generated by validating are shared with collection quality assurance person(s). Validation errors on greater than 5% of records will prevent a collection from being updated.

In terms of maintenance, a collection must undergo all the previously described maintenance checks before it can be updated in DLESE Discovery. This is because the Idmapper used in maintenance checking provides direct information about records to DLESE Discovery. If the Idmapper does not know about the metadata record (status), then DLESE Discovery does not know about the metadata record either and an indexing error is generated.

Finally, records are indexed into DLESE Discovery. This process uses the Collection Manager to index records. Any errors generated by indexing are shared with the collection quality assurance person(s). Indexing errors on greater than 5% of records will prevent the collection from being updated.

The overall updating process of collections is ongoing and varies considerably in the amount of time required to perform the actual collection updates. The process is also very manual in nature since all components are initiated and ended by hand, not automatic methods. As such, the entire library only gets updated once a month but may take a week or so to actually complete. Individual collections like the DLESE Community Collection (DCC), which is 50% of the library, is updated once a week.


4. Redistribution

This is the process of making metadata records available in various formats to service provides through OAI. The redistribution process includes:

  • Determination - determine if the collection can be redistributed. If so, how does it want to be redistributed, as a separate collection or under the DLESE umbrella. If the collection wishes to be separate, verify all required extra metadata is obtained
  • NSDL - determine if the collection will be sent onto NSDL; if so complete (by hand) the NSDL web form that creates a collection-level metadata record
  • Transformation - if not already done so, create an XSLT transform to support OAI providing of metadata records in the desired metadata format; put the XSL stylesheet in the proper location in the OAI setup so that it can be used
  • Organization - put metadata records into the OAI provider by creating the proper set information

5. Deaccessioning

Below is a summary of the deaccessioning process. For a complete explanation, please see the Collections Deaccessioning Process document. This is the process of removing problematic collections or individual metadata records from DLESE Discovery in accordance with the DLESE Accessioning and Deaccessioning Policy. When issues arise, the DLESE Collection Committee is consulted. The process of removal means physically moving records to a holding directory so that they are not discoverable. Records are not deleted. Periodically, these records or collections are checked to see if they can be come active again. Collection builders are informed when deaccessioning actions occur and are generally aware of such actions prior to occurrence because the DPC staff has generally been working with the collection builder to keep records and collections active.


Visuals displaying the process

Technologies used

  • Xerces and Xalan for validating and transforming metadata records
  • Python scripts for bad character checking
  • Shell scripts as wrappers around Java tools
  • XML as the container for metadata records
  • XSLT as the stylesheets for transformations
  • XML schema for validating against the metadata frameworks
  • Database (MySQL) for use in the Idmapper (familiarity with command line MySQL commands is required)
  • Email as the communication conduit with collection builders

DLESE tools used

  • DLESE Collection System
  • Idmapper for URL, duplicate checking and status information of metadata records
  • Metadata frameworks for making actual metadata records
  • Collection manager for making metadata records discoverable in DLESE
  • OAI software for redistributing metadata records
  • DLESE Discovery for indexing and searching records


Outstanding issues

These are actions within the process that need automation in order for this process to scale more effectively:

  • Prevent duplication of collection keys (a script checks this but not integrated into all DLESE systems)
  • Prevent duplication of catalog record numbers within a collection (a script checks this but not integrated into all DLESE systems)
  • Prevent duplication of catalog records numbers across collections in order to have successful OAI sets (currently no support, relies on human memory)
  • Allow the ability to change individual metadata record statuses, rather than whole collections (currently no support; requires hand intervention of physically changing directories of metadata records)
  • Allow for tracking of link checking activities (currently no support, relies on human tracking notes)


Last updated: 2005-06-23