Metadata Collections & QA

Quality WG3: Metadata Structures Whitepaper

Summary from Quality Report:

  1. Learning resources in DLESE require description in terms of both pedagogy/education and scientific concepts
  2. An Earth System Science (ESS) vocabulary is missing for search and browse structures
  3. A vocabulary focus on teaching and learning about science is needed
  4. Current vocabularies (especially for ESS) are not concept-specific but are rather subject-specific

Charge:

Recommend a framework for new vocabulary development [and strategies for integrating it with the Discovery system]

Topics considered:

The metadata structures group concentrated on developing strategies for the following topic areas:

  1. Development, management and maintenance of controlled vocabularies/terminologies in DLESE metadata frameworks
  2. Integration of collection specific (local) controlled vocabularies/terminologies that collection builders want to catalog to, search by and browse through.
  3. Methods and processes for changing terms (keeping up with changes in pedagogic and scientific concepts)
  4. Impacts of controlled vocabularies/terminologies on library stakeholders (primarily library end users but impacts are also felt by collection builders, catalogers and the library as a whole in terms of assessing the collection)

Background:

The discussion of the topics above started with understanding background information about metadata and the DLESE metadata frameworks. This background information is provider here to help readers understand topic discussion starting points.

Why use metadata: Per a discussion by quality working group 5, the following suggestions emerged:

  • Supports user comprehension of the relevance of resources; for some resources, like animations, metadata may be the only description the user has in understanding the resource
  • Supports collection assessment and development
  • Acts as a foundation for other library services (e.g. the metadata can capture National Science Education Standards; these can be used to facilitate matching to state standards)
  • Improves user search results when combined with resource content (see Merging Metadata and Content-Based Retrieval)

What are controlled vocabularies/terminologies? What types are there? Why use controlled vocabularies/terminologies: Controlled terminologies are useful metadata structures. For an explanation on what are controlled are see the controlled vocabulary/terminology concepts page.

What are the DLESE metadata frameworks: DLESE supports six metadata frameworks, ADN, annotation, news & opps., SMS, objects and collection. The frameworks were developed similarly; therefore any proposed strategies or recommendations should be carried out or considered across all the frameworks to avoid the possibility of maintaining two systems. Metadata is held as individual XML documents.

What fields in DLESE metadata frameworks have vocabularies: The vocabularies used in DLESE frameworks are accessible from the following pages for the various frameworks, ADN, annotation, news & opps., SMS, objects, collection.

Does DLESE have a vocabulary development/management process now: Yes. Please see the DLESE vocabulary management process page. Additions to this process are suggested later in this paper.

How do controlled vocabularies/terminologies relate to quality in DLESE: The use of controlled vocabularies/terminologies adds value and quality to the library user experience. For example, DLESE resources often do not contain explicit information about appropriate grade level. Since grade range is a piece of DLESE required metadata, the assignment of grade level information to a resource using the DLESE grade range controlled vocabulary can be thought of as a pedagogical service. If the controlled vocabulary is assigned consistently, tdds value and quality to the library user experience. For example, DLESE resources often do not contain explicit information about appropriate grade level. Since grade range is a piece of DLESE required metadata, the assignment of grade level information to a resource using the DLESE grade range controlled vocabulary can be thought of as a pedagogical service. If the controlled vocabulary is assigned consistently, then there is also a measure of quality across resources because DLESE metadata would be attedds value and quality to the library user experience. For example, DLESE resources often do not contain explicit information about appropriate grade level. Since grade range is a piece of DLESE required metadata, the assignment of grade level information to a resource using the DLESE grade range controlled vocabulary can be thought of as a pedagogical service. If the controlled vocabulary is assigned consistently, then there is also a measure of quality across resources because DLESE metadata would be attempting to indicate that this certain group of resources is appropriate for such and such a grade range. This provides a better quality experience to the library user in terms of finding appropriate materials.

The next sections discuss the topics considered.

 

Topic 1: Development, management and maintenance of controlled vocabularies/terminologies in DLESE metadata frameworks

This discussion revolved around answering the following questions.

When should controlled vocabularies/terminologies be used? The strategy is to identify needs or conditions that may trigger the use of controlled vocabularies/terminologies within DLESE metadata frameworks. The current DLESE vocabulary management process identifies the following conditions as possibilities:

  • Consistent data is required for library browsing or searching
  • New frameworks or new framework fields are developed
  • A significant number (50% or greater) of metadata records across DLESE collections would benefit from a proposed specialized vocabulary
  • A vocabulary is not meeting the original need intent and therefore needs modification (i.e. the current vocabulary needs adjustment)

The following considerations are needed in the decision making process as well:

  • Cataloging cost (will it go up or down or stay the same)
  • Classification of a concept would add value to library stakeholder experiences including collections assessment of the library as a whole
  • The library would be able to develop new or enhance existing services
  • A controlled vocabulary/terminology already exists for a concept (e.g. NSDL has a vocabulary or some professional organization has a vocabulary)
  • A controlled vocabulary/terminology can be readily developed by using resource content indexing to help determine meaningful terms
  • Requiring a controlled vocabulary/terminology at a later date is harder to implement than requiring one early on and then changing it over to free text

Not all of these conditions need to be met for adoption of a controlled vocabulary/terminology. Rather these are the conditions to consider in whether it is prudent to use a controlled vocabulary/terminology. The decision makers should involve the DLESE Program Center (DPC) metadata group and other knowledgeable and appropriate metadata/content experts as necessary. Additionally, the decision process needs to account for evolutions of the digital library landscape in terms of information technology retrieval (ITR).

What controlled vocabulary/terminology type works best for the each of the appropriate metadata fields across the DLESE metadata frameworks? This table (analysis instrument) is meant as a starting point for determining the complexity of a metadata field that is, or may use a controlled vocabulary/terminology. It lists the metadata field for a particular DLESE metadata framework, its complexity in terms of high, medium and low and the type of controlled vocabulary/terminology that is recommended. The table below is meant only as a possible sample to diagram some fields in the ADN metadata framework. The choices of high, medium and low for element complexity refer to the task of applying a controlled vocabulary/terminology to the content of the resource not to a structure of a controlled vocabulary/terminology.

Table 1: Controlled Vocabulary (CV) Information table for ADN metadata

Some ADN Metadata Elements
Element Complexity
Existing CV Info.
Adoption and Development Notes
Current DLESE CV Type
CV Type Recommended
subject
high
   
Flat list
Ontology?
gradeRange
low
   
Flat list
Synonym ring
toolFor
medium
GEM has a list Use appropriate GEM terms; add missing terms like media specialists, evaluators, Earth data specialists
Flat list
beneficiary
medium
  Use appropriate GEM terms; add missing terms like media specialists, evaluators, Earth data specialists
Flat list
contentStandard
high
   
Hierarchal list
(tech) requirements
medium
   
cost
low
IMS uses yes/no Add the term unknown because for most web-based resources cost could be difficult or impossible for a cataloger to determine
Flat list
Flat list
teachingMethod
medium
GEM has a list  
Flat list
resourceType
high
Dublin Core uses 8 resource types but these are of such large granularity as not to be very useful.  
Hierarchal list
(relationships) kind
high
Dublin Core uses ...  
Flat list
language of the resource/metadata
low
ISO 639 (language) standard the W3C uses the ISO standard as an integral part of the language definition of XML
W3C XML Schema
W3C XML Schema

Which type of controlled vocabulary/terminologies should be used? This refers to when should a synonym ring versus an authority file versus an ontology be used. The answer to this question is highly dependent on the inherent complexity of the metadata field and what controlled vocabularies/terminologies may already exist to support the metadata field and if and how other significant metadata frameworks support the metadata field. Again the analysis table in the preceding section helps answer this question.

What controlled vocabularies/terminologies should be used? This question goes hand-in-hand with what type of controlled vocabulary/terminology should be used. It helps to complete information in the table above with the main goal of answering whether existing vocabularies can be used or does some development work need to occur and if so how. The following questions should be considered:

  • Is the expected vocabulary so simple (like yes, no, unknown for DLESE cost) that the vocabulary can developed directly?
  • Do other significant metadata frameworks like Dublin Core, GEM, NSDL and IMS have this metadata concept and possibly a vocabulary?
  • Do organizations dealing with the content area (e.g. scientific or pedagogic) of the vocabulary have existing vocabularies? If so, a group should review it for appropriateness for the DLESE community and whether the vocabularies can be accessed through a programmatic service or need to be adopted by some other method.
  • What is an appropriate size (number of terms) for the controlled vocabulary/terminology to make it the most accessible in searching, browsing and cataloging?

How are controlled vocabularies/terminologies incorporated into DLESE systems? A vocabulary manager is the mechanism by which DLESE systems and services (including web services) know about controlled vocabularies/terminologies. Data is entered into the manager through a series of XML files. Any recommendations from this paper that use different structures than the current manager model will require software development to incorporate. For example, software development would be required to incorporate an existing OWL or RDF ontology.

 

Topic 2: Integration of collection specific (local) controlled vocabularies/terminologies that collection builders want to catalog to, search by and browse through

Many DLESE collections builder ask to have their own specific subject controlled vocabularies/terminologies be searchable and browseable. The challenge is that collection builder controlled vocabularies/terminologies generally benefit only a single collection and do not have wide application across many DLESE collections. This topic was discussed at the DLESE Metadata Workshop in March 2004 and those discussions and recommendations are summarized next.

The first part of the discussion centered on when is it appropriate to use a collection-specific controlled vocabulary/terminology. These conditions exist if collection builders have terms and phrases that describe the collection better than using the DLESE metadata structures of free text descriptions for general characteristics, educational, technical, geospatial and temporal, annotation metadata records and existing DLESE controlled vocabularies/terminologies. If this is the case, the controlled vocabulary/terminology is expected to be for a small number of terms (15-30). The terms can be used in optional metadata fields and should avoid terms already in use in existing DLESE controlled vocabularies/terminologies. The workshop participants then discussed three methods to incorporate terms.

  1. Keyword method
  2. Collection builder controlled vocabulary method
  3. XML schema method

Method 1: Keyword method: The collection builder develops a list of terms or phrases and uses them consistently in the keyword field of the metadata record. These terms are also entered in the collection-level metadata record in order to provide library users hints at terms to use while searching the collection. The collection builder has complete control and no definitions are needed. No browse is provided but free text searching yields good results. There is no integration into DLESE system, services or metadata frameworks and no extra work is required by the DLESE Program Center. (DPC) Currently, the Global Change Master Directory (GCMD) takes a similar approach by allowing data providers to suggest terms for their data sets.

Method 2: Collection builder controlled vocabulary method: The collection builder develops a list of terms or phrases and provides a code or URL to identify the terms. The collection builder catalogs to these terms (in a consistent manner of spelling and use) and indicates the URL or code in the metadata records. (aside: the ADN metadata field of subjectOther was built for this purpose). Because the term list and code are known, any metadata records using them are searchable and available to DLESE web services.

An example of this approach is DLESE resources for meteorology (this link may go up and down). On the page, the list of terms, remote sensing, simulations, modeling, etc. are the collection builder terms. Resources that use this specific vocabulary are then searchable. These resources are available through DLESE web services but not as browseable histograms. This requires moderate work from both the collection builder and the DPC for functionality.

Method 3: XML schema method: This approach incorporates the collection builder terms or phrases completely into DLESE systems, services and metadata frameworks. It requires substantial work from the collection builder and the DPC. This method impacts every existing metadata record in the library.

Additionally, the collection builder needs to provide definitions, attribution of the definition and best practices for using their controlled vocabulary/terminology. The collection builder must also create XML schema and instance documents. This method allows the collection builder to have precise search results and browseable histograms as long as their terms do not overlap with existing DLESE controlled vocabularies/terminologies.

The overriding question to using collection-specific vocabularies is will it be beneficial? Is the potential metadata development, cataloging, software development and interface design worth it? How will DLESE systems and services keep up if many collection builders wish to use methods 2 or 3? Either way, the collection builder should consider the questions that were raised in Topic 1 above when developing their list of terms or phrases.

The Metadata Workshop recommendation concluded that browse structures were of benefit to the user and that DLESE should allow the collection builders some flexibility in employment of controlled vocabularies/terminologies. DLESE should investigate implementation of collection-specific vocabularies in a way that permits the collection builder to choose between method 1 and method 2 above. This quality group on metadata structures supports these recommendations.

 

Topic 3: Methods and processes for changing terms (keeping up with changes in pedagogic and scientific concepts)

Keeping DLESE controlled vocabularies/terminologies inline with the accepted scientific, pedagogic and technical terms is absolutely necessary to maintain library quality. However, once a controlled vocabulary/terminology is developed and in use, it can be a daunting task to update terms or phrase without disruption to end users, services and developers. Therefore, this group looked to the International Standards Organization (ISO) to see how often they review their standards. Per their procedures, standards are reviewed at least once every five years. A majority of reviewers decides whether the standard should be confirmed, revised or withdrawn.

A recommendation to DLESE is to review controlled vocabularies/terminologies at least once every five years as well. Some questions to ask during the review are:

  • Are there terms that are not being used and can be removed?
  • Are there terms that have caused confusion and can be made better if another word or phrase was used?
  • Are there terms that are out of date/favor scientifically, pedagogically and technically?
  • Have new concepts emerged such that new terms need to be added?
  • How difficult would is it to make changes? Can things be done better technically?
  • Does the metadata concept still need a controlled vocabulary/terminology?
  • Would the metadata concept benefit from a different type of controlled vocabulary/terminology (e.g. synonym ring versus ontology)?

If possible, the following groups and items should be included in the review:

  • Metadata specialists
  • Library end users
  • Log analysis of queries and click streams
  • Collection builders
  • Content specialists
  • Vocabulary registry organizations

 

Topic 4: Impacts of controlled vocabularies/terminologies on library stakeholders

The list of library stakeholders includes library end users, collection builders, catalogers, library developers, service providers and library evaluators. Because this is a broad list, the group narrowed this discussion primarily to controlled vocabulary/terminology impacts on library end users. The group acknowledged the following points in the use of controlled vocabularies/terminologies:

  • Precision of library user search results may improve but recall of resources may decrease. Therefore, a balance between precision and recall should try to be achieved.
  • User interfaces incorporating controlled vocabularies/terminologies may become more complex. This complexity may highly influence if a controlled vocabulary/terminology is used in search or browse.
  • Much research needs to be done in this area.

The National Science Digital Library (NSDL) evaluation group may be developing the idea of a digital learning testbed. One focus of this testbed could be to examine controlled vocabularies/terminologies on end users. If this plan goes forward, it is recommended that DLESE participate in this research in order to get the most benefit from it.


Top

Last updated: 9-24-04
Maintained by: Katy Ginger (support@dlese.org), DLESE Metadata