DDS Search
Fields
Skip navigation Digital Library for Earth System Education
Digital Library for Earth System Education
Search tips

Configuring Search Fields for XML Frameworks

 

This page describes how to configure standard and custom search fields for any XML framework that is made available through the Search Service API. This information is provided for system administrators who are installing or managing a DDS repository system, which includes the Digital Discovery System (DDS) and the NSDL Catalog System (NCS). While it is not necessary to configure a framework in order for it to be used effectively in the repository, doing so adds additional search functionality that may be useful.

This document assumes familiarity with Apache Tomcat, Lucene, servlet configurations, and XML.

How search fields are generated

At index creation time, each record is inserted in the repository in it's native XML format. The indexer extracts standard, custom and XPath search fields from the contents of the XML and then generates a single entry containing each of the fields and inserts it into the index. All records are guaranteed to contain certain fields such as the default and stems fields, as well as XPath fields for the native XML format, which are created automatically.

For detailed information about search fields and the content within them, see the Search Service documentation (Search fields section).

How to configure search fields

The standard and custom search fields for a given XML framework can be defined using an XML configuration file, which is described below. Standard search fields include title, description, ID, URL and geospatial bounding box coordinates. Custom search fields can be defined for any content extracted from the XML document.

 

To configure search fields for a specific XML framework, follow these steps:

  1. Add the given XML framework to the search fields configuration index file
  2. Create a configuration file for the XML framework and define the standard and custom search fields as needed
  3. Start or re-start Tomcat for changes to take place

1. Add XML frameworks to the configuration index file

Add the given XML framework to the search fields configuration index file, which contains a list of the individual configurations files for each XML framework. Entries in the index may contain relative or absolute URIs to the individual framework configuration files that may be located on the local file system (file://) or anywhere on the Web (http://).

The index file is named xmlIndexerFieldsConfigIndex.xml and in a typical DDS installation it can be found in the tomcat context at $tomcat/$context/WEB-INF/conf/xmlIndexerFieldsConfigIndex.xml. The exact location of the index file is indicated by the DDS/DCS/NCS web application's context-param repositoryConfigDir (found in web.xml or server.xml).

Example index file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<XMLIndexerFieldsConfigIndex>
	<!-- List the location of each framework-specific configuration file -->
	<configurationFiles>
		<configurationFile>xmlIndexerFieldsConfigs/oai_dc_search_fields.xml</configurationFile>
		<configurationFile>xmlIndexerFieldsConfigs/my_framework_search_fields.xml</configurationFile>
	</configurationFiles>		
</XMLIndexerFieldsConfigIndex>
	

Each configurationFile element indicates a relative or absolute URI to the individual configuration for the XML framework. The above example points to two framework configuration files, oai_dc_search_fields.xml and my_framework_search_fields.xml, which reside in the directory xmlIndexerFieldsConfigs relative to the index configuration file.

2. Define the search fields for each XML framework

Each configuration file describes the standard and/or custom search fields for an XML framework and where the content for those fields reside in the XML instance documents. For the following discussion, see the example configuration file below.

The xmlFormat or schema attribute of the XMLIndexerFieldsConfig element defines which framework the configuration is for, and only one or the other may be used. The xmlFormat corresponds to the XML format key that the repository system is indexing, for example oai_dc, nsdl_dc, adn, etc. To provide a schema-specific configuration, for example if a given repository is working with two versions of the same framework, indicate the schema location in the schema attribute. If there are two configurations that operate over the same framework, one indicated by xmlFormat and the other schema, the schema definition takes precedence.

Standard fields

Standard fields are processed by the indexer in a uniform manner, allowing clients to search the fields in a consistent manner across frameworks.

The standard fields are the following:

Standard Field Description Index Fields Generated
id Contains the ID for the record. If not defined, the ID is derived automatically by the indexer. idvalue
url Contains the URL for the resource described by the XML metadata. url
title Contains the title text for the item. title, titlestems
description Contains the description text for the item. description, descriptionstems
geoBBNorth, geoBBSouth, geoBBWest, geoBBEast Contains the north and south latitudes [-90, 90] and the west and east longitudes [-180, 180] for the geographic bounding box footprint that represents this item. n/a - Handled internally by the Search request.

 

To configure a standard field for a framework, add a standardField element in the configuration field as shown in the example below. The attribute name defines the standard field name (id, url, title, etc). Inside standardField, nested xpath elements should contain XPaths that select the desired content. The xpath element can be repeated and the contents of all repeated elements in the instance documents will be included in the content for that field with the exception of the geographic bounding box fields, which must contain a single element only.

Custom fields

Custom fields can be defined for any content extracted from the XML document.

To define a custom field, add a customField element in the configuration field as shown in the example below.

The customField element must contain the following attributes:

Attribute Name Description Valid Values
name Indicates the name of the search field. Anything. Alpha-numeric recommended.
store Indicates whether to store the content in the index. Stored fields are visible in the admin pages of the DDS/DCS/NCS repository system web application. yes, no
type Indicates the type of field this should be. If type is used, analyzer must not be.

text - Text is processed using the Lucene StandardAnalyzer.

stems - Text is processed using the Lucene SnowballAnalyzer for the english language.

key - Text is processed using the Lucene KeywordAnalyzer, which is case-sensitive and includes the entire element or attribute as a single token.

analyzer Indicates the specific Lucene Analyzer to use when processing this field. If analyzer is used, type must not be. Include the fully-qualified Java class that implements a Lucene Analyzer. The class must be in the classpath of the DDS web application.

 

Note that the Lucene Analyzer that is defined for a given field is automatically applied both in the indexer and the searcher.

The attribute name defines the name of the field and can contain any value. Inside customField, nested xpath elements should contain XPaths to the content. The xpath element can be repeated and the contents of all repeated elements in the instance documents will be included in the content for that field.

XPaths

As the indexer processes the XML records, it first removes namespaces from the documents. This simplifies the XPath notation necessary to select the desired elements and attributes within. Do not include namespaces in your XPath notation.

For example, these XPaths select given elements in an oai_dc Dublin Core record:

  • /dc/title - Selects all title elements that are children to the dc element
  • /dc/title[1] - Selects the first title element that is a child to the dc element
  • //title - Selects all title elements anywhere in the XML document

For more information about the XPath language, see XPath Language 1.0 and the ZVON XPath Tutorial.

 

Example search configuration for the oai_dc Dublin Core framework:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- XMLIndexerFieldsConfig attributes: [xmlFormat OR schema] -->
<XMLIndexerFieldsConfig xmlFormat="oai_dc">
	<standardFields>
		<!-- standardField attributes: name=[id|url|title|description|geoBBNorth|geoBBSouth|geoBBWest|geoBBEast] -->
		<standardField name="url">
			<xpaths>
				<xpath>/dc/identifier</xpath>
			</xpaths>		
		</standardField>
		<standardField name="title">
			<xpaths>
				<xpath>/dc/title</xpath>
			</xpaths>		
		</standardField>
		<standardField name="description">
			<xpaths>
				<xpath>/dc/description</xpath>
			</xpaths>		
		</standardField>	
	</standardFields>
	<customFields>
		<!-- customField attributes: name, store, [type OR analyzer] -->
		<customField name="dcIdentifier" store="yes" type="key">
			<xpaths>
				<xpath>/dc/identifier</xpath>
			</xpaths>
		</customField>		
		<customField name="dcType" store="yes" type="text">
			<xpaths>
				<xpath>/dc/type</xpath>
			</xpaths>
		</customField>
		<customField name="dcPublisher" store="yes" type="text">
			<xpaths>
				<xpath>/dc/publisher</xpath>
			</xpaths>
		</customField>	
	</customFields>
</XMLIndexerFieldsConfig>	
	

How to verify it's working

Follow these steps to verify that the desired content is being indexed for search as expected:

  1. Place the configuration files in the repository system and make sure Tomcat has been restarted.
  2. Index or re-index the files.
  3. View the fields and indexed content in the admin pages of the DDS, DCS or NCS repository system Web application. These pages reside in different places depending on the system and version you are working with. Note that in many places in the admin UI, fields and content are displayed only for those fields that have been stored in the index (see above).
  4. Perform searches using the Search Service API or admin search pages to verify the expected results are returned for specific queries against know data in one or more records.

 

Last revised: $Date: 2010/02/19 01:02:54 $

University Corporation for Atmospheric Research (UCAR) National Science Foundation (NSF) National Science Digital Library (NSDL)