Open Source Content Management Framework

mRFC 0024: Full text indexing in Midgard

  1. Revision history
  2. Background
  3. Requirements
  4. The full text index
  5. The parent cache
  6. The indexer process
  7. The query implementation
  8. Related work

This document proposes a way to support full text indexing in the Midgard core. This document covers both the indexing and querying mechanisms.

This mRFC been submitted to the Midgard Community for discussion and approval under the Creative Commons Attribution-ShareAlike license.

Revision history

2006-01-02 Created by Jukka Zitting

2006-01-03 Revised by Jukka Zitting to incorporate changes and omissions noticed by Torben Nehmer.

Background

Search functionality has traditionally not been a strong part of the Midgard framework. For a long time Midgard only supported the mgd_list_... functions as the means for querying the database. More flexibility was gradually introduced by adding new query functions and introducing various optional query parameters in the existing functions. Even with these additions the standard Midgard programming model for more advanced queries was to start from a general object listing and to filter the results programmatically.

This model performed poorly and was cumbersome to use, so various alternatives were introduced along the way. As Midgard contained no easy and efficient way to make general queries, various external search engines like ht://Dig were often used as an alternative. While this solution produced decent end user search interfaces, it lacked integration with the Midgard object model and caused extra setup and administration overhead.

Finally a more general solution was introduced by the MidCOM component framework version 2.4 in the form of the MidCOM indexer. The MidCOM indexer uses the Apache Lucene search engine as a backend for indexing and querying content using the MidCOM component model. The MidCOM indexer is the most powerful Midgard search feature to date, but it is limited to the MidCOM data model and does not detect content changes made by other tools than MidCOM.

The Midgard Query Builder was introduced in Midgard version 1.7 as an alternative advanced search feature. The Midgard Query Builder replaces the mgd_list functions as the default query mechanism in the Midgard core. The Midgard Query Builder is a general purpose tool for searching any Midgard content, but it does not support full text queries and understands only the MgdSchema data model.

The current state of affairs is that the MidCOM indexer and the Midgard Query Builder together provide a quite complete feature set related to searching, but as they are not integrated and both have separate drawbacks, there still is no ideal solution to Midgard search needs. This document proposes adding full text indexing and other improvements to the Midgard Query Builder in order to overcome some of the limitations and to provide an easier path to an eventual integration with the MidCOM indexer.

Requirements

The requirements of the full text improvements proposed in this document are listed below:

Integrated API
The full text query API should be integrated with the existing Midgard Query Builder. It should be possible to combine traditional and full text constraints in a single query.
The MgdSchema data model
The indexer should fully support the MgdSchema data model. All data types and properties should be included in the index unless explicitly excluded, and it should be possible to query for content regardless of the object type or property name. The full text index should also be aware of the content tree structure.
Indexing binary attachments
The full text indexer should know how to index the common binary attachment formats like MS Word and PDF.
Automatic indexing
The full text index should automatically be kept up to date with the changes in the underlying Midgard database. Indexing should be transparent to all applications that use the standard Midgard API to modify the Midgard database. It should also be possible to manually recreate the entire full text index.

The full text index

The proposed full text index shall be a Lucene index directory residing on the host that runs the Midgard applications. If there are multiple web servers accessing a shared backend database, then each frontend server shall have its own full text index. Each Midgard database shall have its own full text index that shall index the objects in all the sitegroups within the database. The index directories shall be located like the current midgard blob directories, i.e. the full text index of a database whose blobs are located in /path/to/midgard/blobs/database shall be in /path/to/midgard/index/database.

The full text index shall contain information about all the MgdSchema objects in a database. All the object metadata (including its type and GUID) is indexed and by default also all the object properties. The special index="no" property attribute can be used in the schema declaration to exclude properties from being indexed. The properties are indexed both as separate index fields and as a general node content field. A property can be excluded from the node content field by adding the attribute index="separate" in the property definition.

Parameters and attachment are treated as object properties in the full text index. Parameters are included as virtual properties named "parameter:domain:name", and attachments (whose content can be extracted) as properties named "attachment:name". All link fields are stored as GUID entries in the full text index.

MultiLang content shall be handled by prefixing all the MultiLang property names with the language code. A German version of a "title" property would be indexed as "de:title" while the default "title" property (MultiLang zero) would be indexed simply as "title". Also the general node content field will be versioned by language.

The index shall normally be created and queried using the standard analyzer to avoid problems with content in multiple different languages. It is possible that a future extension will add support for semi-automated language analyzers based on the Midgard MultiLang feature.

The index directory shall be read directly by the midgard-core component in all Midgard application processes, but written by a separate indexer process. The indexer process shall monitor the Midgard database directly for record changes instead of relying on notifications from the Midgard applications. This makes the full text system robust in the face of indexer failures and reduces the amount of dependencies across different functions. The drawback of this approach is a slight delay before content changes are reflected in the full text index.

The parent cache

In addition to maintaining the full text index, the indexer shall also maintain a cache of the Midgard content tree structure.

The content tree is derived from the parent link properties declared in the MgdSchema configuration. Each object can have a single parent link that points to another object. The parent links are required to form a forest structure, i.e. there can be multiple roots and no self-references or cycles are allowed in the parent links.

The parent links are normally stored as fields in the content tables, which makes it difficult to efficiently traverse the content tree. A more efficient representation of the parent links shall therefore be added to the repligard table to overcome this problem. This parent cache is a combination of the adjacency list and nested set data models to allow efficient querying of both immediate parent/child and more distant ancestor/descendant relationships. The additions to the repligard table are:

parent_guid VARCHAR(80),
tree_left   INTEGER NOT NULL,
tree_right  INTEGER NOT NULL,
INDEX parent_index (parent_guid),
INDEX tree_index (parent_down, parent_up)

The parent_guid field will contain the GUID of the parent object and can be NULL if the object has no parent. The tree_left and tree_right fields contain special depth-first traversal indexes of the object within the global content tree. This allows for very quick determination of ancestor relations using predicates like:

A is an ancestor of B :-
    B.tree_left BETWEEN A.tree_left AND A.tree_right.
A is a descendant of B :-
    A.tree_left BETWEEN B.tree_left AND B.tree_right.

The full text indexer is responsible for setting up the parent cache and updating it whenever objects are created, moved, or deleted. The parent cache adds extra overhead whenever an object is created, moved, or deleted, but does not affect normal object updates. As the parent cache is managed by a separate indexer process, this performance hit is amortized at the cost of small periods of cache inconsistency.

The indexer process

The indexer process shall be a Java application that listens to all the configured Midgard databases and indexes all record changes in the respective full text indexes. Depending on the capabilities of the underlying Midgard database, the indexer can either use triggers, a dedicated change log, or the record change timestamps to locate the changed records.

Whenever a record creation or change is detected, the indexer process will read the changed record and write the latest version to the full text index. Record deletions are handled by removing the respective index entry.

The indexer process should keep a persistent timestamp of the last processed record change per Midgard database. This timestamp can be used to avoid having to fully reindex the entire database in case the indexer or the database is restarted. Full reindexing can be triggered simply by clearing this timestamp while restarting the indexer process.

If possible the indexer process should contain both the MidCOM indexer and the indexing service described in this document. In addition to reducing the runtime and administration overhead, this would support the eventual goal of merging the two indexing mechanisms.

The query implementation

The full text index can be accessed by applications through the Midgard Query Builder API. A new contraint operator, CONTAINS, is used to add full text constraints to a query. The CONTAINS operator specifies a Lucene full text query for the specified property:

$qb = new MidgardQueryBuilder("midgard_article");
$qb->add_constraint("title", "CONTAINS", "foo");
$qb->add_constraint("abstract", "CONTAINS", "bar baz");
$qb->execute();

The above query would result in the execution of the Lucene search: title:foo AND abstract:"bar baz". The special property name * (a star) is used to add a full text constraint for the general node content field. The Lucene wildcard characters ? and * are also supported. The query builder will automatically use the correct MultiLang settings when building the Lucene query. Clients can explicitly specify the query language using the $qb->set_lang() method.

The full Lucene query syntax is available for special cases using the operator LUCENE and the property name *:

$qb->add_constraint("*", "LUCENE", "title:roam~");

This special syntax should only be used when the normal constraint mechanisms are not enough. The query syntax is not guaranteed to remain the same across Midgard releases and MultiLang settings are not applied to the constraint.

It is possible to freely mix normal and full text constraints in one query. Full text constraints can however only be added to the top-level constraint group. A future extension may relax this limitation.

The parent cache can be used in constraints like this:

$qb->add_constraint("metadata:parent", "=", $parent_guid);
$qb->add_constraint("metadata:ancestor", "=", $ancestor_guid);
$qb->add_constraint("metadata:child", "=", $child_guid);
$qb->add_constraint("metadata:descendant", "=", $descendant_guid);

Like the full text constraints, the parent constraints can at the moment only be added to the top level constraint group.

Both the full text index and the content tree metadata constraints are available also for making generic midgard_object queries:

$qb = new MidgardQueryBuilder("midgard_object");
$qb->add_constraint("metadata:ancestor", "=", $root_guid);
$qb->add_constraint("*", "CONTAINS", "foo");
$qb->execute();

The above query would return all descendants of $root_guid that contain the word "foo". The returned objects are full instances of the matching midgard_object subclasses. Note that none of the normal object properties are available for use in constraints, as no properties are registered for the midgard_object class.

Related work

The long term goal of the full text indexing work is to be able to merge the MidCOM indexer functionality with the approach presented here. One essential step on this path is to be able to associate the search results with a specific MidCOM component.

The parent cache could possible be changed from a cache to the authoritative representation of parent relations in Midgard 2.0.

Back

Designed by Nemein, hosted by Anykey