|
Collections management processPurposeThe goals of the DLESE collections management process are to ensure:
WorkflowTo accomplish the goals of collections management (access, growth and provision), the DPC (DLESE Program Center) is responsible for five primary management functions that ensure data integrity at all levels of metadata creation, manipulation and updating. These functions are performed on all collections and all metadata records within collections regardless of the DLESE metadata framework used. The functions are defined as follows:
Each of these management functions are described in detail next. 1. Quality assuranceQuality assurance is performed by quality assurance person(s) for each collection. Every DLESE collection has such person(s). These person(s) are either DPC staff, or external collection builders or the two working together. Often DPC is contracted to perform quality assurance for collections. No matter who performs the quality assurance, the following are the goals of quality assurance in order to accession metadata records into collections:
While quality assurance of metadata records is primarily the responsibility of external collection builders, DPC staff work with collection builders to ensure the above checks are performed. This is done by reviewing individual metadata records (could be the whole collection or just a sampling) with the collection builder upon accessioning and periodically during updating. 2. MaintenanceSince maintenance involves link checking, duplicate checking, syntax checking and status checking, the Idmapper tool was designed to facilitate these checks. The checks, performed over all collections, are defined as follows:
Each of these checks are performed twice a day and a history over time is maintained in a database. Generated reports are emailed to DPC staff twice a week and passed to DLESE Discovery via an API twice a day. DPC staff takes action on the issues or, if need be, shares the issues with the collection quality assurance person(s) of the collection that has errors. Each of these maintenance checks are described in more detail below. Please remember that a collection quality assurance person(s) could be DPC staff or external collection builders. For link checking, the DPC built Idmapper tool tests URL vitality on all important URLs in each particular DLESE metadata framework. Errors are tracked and organized by type, with URL not found, vitality too low (less than 50% accessibility over time), connection refused, and permanent redirects errors taking the highest priority for action and especially those URLs that have been inaccessible for 3 days. The goal within DLESE is to have less than 5% inaccessible URLs. When DPC staff cannot resolve the URL errors, the collection quality assurance persons is contacted to take action regarding the metadata record. When the collection quality assurance person is a DPC staff member, the resource creator of a broken URL is contacted. A standard message is sent out via email. If a response is received, the appropriate changes are made (may include removal if resource has been taken offline). If no response is received, in approximately 3 days, the metadata record for the resource is removed from DLESE Discovery. When the collection quality assurance person is an external collection builder, the collection builder is emailed and asked to either remove the record from their collection or to contact the resource creator to fix the URL. If the collection builder responds and provides information on what to do with the record, DPC staff take that action. If the collection builder does not respond in 3 days, DPC staff do not contact the resource creator directly but rather remove the metadata record from DLESE Discovery until a decision is made what to do. Metadata records are not deleted but rather moved to a non-discoverable location. These changes are propagated forward (but this takes some time - see the Updating section of this document). For duplication checking, the Idmapper uses both URLs and resource content to determine duplications. Content checking compares content in order to find duplicate content or mirror site content that does not match primary URL content. Duplicate URL checking is performed on ADN and Annotation metadata records only. Errors are generated when the same URL appears twice in the primary URL field across metadata records within the same collection. When this happens, the quality assurance person for the collection is notified to resolve the duplicate metadata records. Duplicate records within a collection are not allowed but duplicate records across collections are allowed. To determine if duplication of catalog record numbers and collection keys occurs, a Python script is used. Ideally, the duplication of cataloging record numbers and collection keys would become part of the Idmapper checks. Syntax checking ensures data integrity of certain fields, particularly email addresses of resource creators. This check is important and came into being because the Community Review System (CRS) collection uses resource creator email addresses to contact resource creators to see if they would like to submit their resource for review and potential representation in the DLESE Reviewed Collection (DRC). When errors are discovered, the same resolution procedures as described in the link checking section are enacted here. Status checking is used to sense the presence of new metadata records in a collection for both DPC staff and DLESE Discovery. More importantly for DLESE Discovery, it indicates whether the record is active or inactive. Currently, by default, all records are active and there is no user interface or easy manipulation method, except to disable an entire collection, to make individual metadata records within a collection inactive. Generally, to make individual records records inactive they have to be physically moved by hand into a 'holding' directory so that they are not discoverable. 3. UpdatingUpdating of metadata records and collections occurs for the following reasons:
To accomplish these updates, the updating process includes:
For harvesting, in the broadest sense, metadata records within collections arrive at the DPC by any of the following methods (and sometimes multiple methods):
The comments in parenthesis above indicate the the name of the DLESE server that the metadata records arrive on via the particular delivery method. To process newly arrived records, the first task is to gather them in a single area on a single DLESE server. The first quality check to be performed is to figure out if what was harvested is truly the collection. This is a rather tricky question because collection providers do not often indicate if they deleted records. Their OAI providers might provide deletion information but more often than not they don't and with records arriving by email there is no indication either. So generally a collection is completely deleted in order to process any new or updated records. That is, the whole collection is worked with rather than just new or updated records. This is time consuming and CPU intensive for large collections. The quality checks described in section 2 for generating metadata records are completed when updating records as well. In terms of transforming, only collections that do not arrive in a format that DLESE Discovery can handle are transformed. In order for this to happen, a semantic crosswalk and technical XSLT transform must exist to change metadata from the collection builder format to a DLESE metadata format. Currently, the DPC supports the following metadata transformations:
All other transforms must be performed locally by the collection builder prior to delivery to the DPC. The DPC can assist in writing such transforms. Any errors generated during the transform process are shared with the collection quality assurance person(s). Transformation errors on greater than 5% of records will prevent the collection from being updated. When the updating process is a result of a change in a metadata framework, the DPC creates the semantic and technical crosswalk to go from the currently acceptable DLESE Discovery metadata format to the new metadata format. Such changes greatly impact collection builders and DLESE systems. DPC staff works together and with collection builders to minimize such impacts. Such changes take time and are carefully coordinated transitions. All metadata records are validated during the update process. Any errors generated by validating are shared with collection quality assurance person(s). Validation errors on greater than 5% of records will prevent a collection from being updated. In terms of maintenance, a collection must undergo all the previously described maintenance checks before it can be updated in DLESE Discovery. This is because the Idmapper used in maintenance checking provides direct information about records to DLESE Discovery. If the Idmapper does not know about the metadata record (status), then DLESE Discovery does not know about the metadata record either and an indexing error is generated. Finally, records are indexed into DLESE Discovery. This process uses the Collection Manager to index records. Any errors generated by indexing are shared with the collection quality assurance person(s). Indexing errors on greater than 5% of records will prevent the collection from being updated. The overall updating process of collections is ongoing and varies considerably in the amount of time required to perform the actual collection updates. The process is also very manual in nature since all components are initiated and ended by hand, not automatic methods. As such, the entire library only gets updated once a month but may take a week or so to actually complete. Individual collections like the DLESE Community Collection (DCC), which is 50% of the library, is updated once a week. |