Open Source Content Management System

mRFC 0009: MidCOM Indexing Service

  1. Introduction
  2. Indexing Backends
    1. Required Document Fields
    2. NAP integration
    3. Legacy application interface
    4. Search result post processing
  3. midcom.helper.search
    1. Reindexing documents
  4. Indexer backend proposals
    1. Plucene
    2. Recommendation

This document outlines a fulltext-search subsystem for MidCOM. It will require external tools for indexing, which is why the subsystem will be pluggable. Usage of at least the indexer interface from outside midcom is required.

Introduction

The basic requirements as outlined by Henri Bergious can be found in Bug #107 on Tigris. Read this first please.

The subsystem will be split into the following parts, that will be considered individually:

  • Search engine backend
  • PHP search engine interface module
  • Main search component providing:
    • Search frontend
    • Indexing frontend
    • Configuration UI

Some other parts of MidCOM need to be adapted to

  • NAP GUID lookup interface
  • MidCOM independant interfaces to the indexer
    Note: External Access to the search engine is currently not considered, as running the UI with the MidCOM is far easier.
  • Integration into the Datamanger that make automatic indexing of datamanager driven objects possible.

This will lead to a system like this:

        Indexer backend plugin --[relay commands]--> 3rd party indexer backend
^ (e.g. Plucence)
| Abstract indexer access
|
Indexer service <----[Use search interface]---- midcom.helper.search
^ |
| Use indexer interface Filter search results |
| to register documents according to the usual |
| for indexing rules |
| V
Any component including the DM; Web site user
non-MidCOM application use a
"subinterface" of this

The important point here is the distinction between the Indexing service and the actual search engine component. The actual heart of the system is the indexer service. It will provide access to the full-text index in a backend-independant interface. A reduced version of the indexer service is available for non-MidCOM Applications using the same basic API (but see below).

The component midcom.helper.search provides an easy way of making the indexing service available to the website end user.

Note, that in the course of this documents, all cases where MidCOM-unaware applications (such as OpenPSA) are concerned, they are referred to as legacy applications.

Indexing Backends

An indexing backend usable by MidCOM is required to store, basically, key/value pairs. The values represent the actual content, along with a few additional fields (like the GUID) that should not be indexed.

The service will pass an accociative array with plain-text information as both keys and values to the backend plugins. The plugin then is responsible for passing this information to the actual indexer backend to use. Any conversion from the storage record passed by the indexer service to and from a format suitable for the backend has to be done here.

The indexer backend should support UTF-8 encoded strings, if not, the search backend has to ensure, that a correct conversion away from UTF-8 to a somewhere configured charset is done (or alternatively work with the encoded UTF-8 strings if you can live with the sorting/comparison problems that might arise).

Indexer Service

This is the main access point for all components that use the indexer. Both searching and indexing is controlled through this interface.

The heart of the Indexing is the so-called "document", which will form its own class hirarchy (see below). It roughly resembles the documents stored in (P)Lucene and contains a number of key/value fields of various types (controlling tokenizing, indexing and storage). The Document class hierarchy will contain interfaces to all neccessary metadata fields.

The Resource ID (see below) uniquely identifies any document in the index. It is contained in the document and extracted approprately when saving the record, depending on the document that is indexed, this value is either deduced automatically, or has to be supplied within the document.

A first API sketch (pseudo code):

class midcom_service_indexer {
midcom_service_indexer($backend);

Array (int key) get_capabilities();

bool index(object $document);
bool delete(string $resource_id);
bool delete_all();
Array(documents) search(string query, object $filter);
}

The Indexer Service will be managed by MidCOM ensuring usage similar to the Singelton design pattern (link is a PDF!).

The get_capabilities() call is relayed to the indexer backend plugin and will show its capabilities. At this time, there is no further definition about the available capabilities, but it is there for later usage in case the need arises.

The Filter argument to the search request may contain any one Filter defined and supported by the search engine. At this time, there will only be the Date-Filter, which allows filtereing a query by an arbitary Date field of the stored documents.

Indexer backend plugins

The indexer backend plugins make the integration of different search engines possible. They represent a Strategy design pattern.

A first rough interface sketch (pseudo code):

class midcom_service_indexer__backend {
midcom_service_indexer__backend($configuration);

Array (int key) get_capabilities();

bool index($document);
bool delete(string $resource_id);
bool delete_all();
Array(documents) search(string query, object $filter);
}

The plugin must assume that all data passed to it is UTF-8 encoded. The search result has to list all documents with all fields that have been returned by the indexer..

The method get_capabilities gives an indication about what the backend can do, and what not. For example RETRIEVE_METADATA, SEARCH_METADATA or SEARCH_SUBSTRING.

Documents

A document, in general, is a collection of fields that are to be indexed. The type of these fields is distinguished on indexing, but not on retrieval. These fields are available:

Stored fields are fields whose content is available in the resultset. Indexed fields are fields which can be queried. Tokenized fields are broken apart into words (or whatever the tokenizer actually does) to allow searching for substrings.

  • Date: Used for valid dates which can only be used for filtering. These can be retrieved upon queries, but their value can be anything, which is why they must not be relied upon during search result processing. They are stored as Keywords.
  • Keyword: This field is indexed, stored, but not tokenized. This should only be used for the Resource Identifier and for Date filter fields, as its contents are not easily queried.
  • Unindexed: This field is stored, but neither indexed nor tokenized. It should be used for fields that are required to access the document but that must not be available for searching.
  • Unstored: This field is indexed and tokenized, but not stored. It should be used for all values that should be queryable but that need not be in the resultset.
  • Text: This field is indexed, tokenized and stored. Used only for fields whose content need to be in the result set.

It is strongly recommended to use unstored and to avoid keyword fields wherever possible. Additionally, fields whose content is stored in the index (Keyword, Unindexed and Text) should be used only where absolutely neccessary. See below for the list of recommended and required fields.

Some of the document fields start with a double-underscore and are all-caps. These mark internal fields that should normally not be queried by the end-user. Most of the time they will be added programmatically to the query instead.

Any text data passed to the Document is transparently converted into UTF-8 on the basis of the encoding currently selected in the I18n service. For all accesses from legacy applications the charset has to be specified explicitly. Files or attachments passed to the indexer (if readable) will need an explicitly specified charset or they will be treated as-is.

Document Types

Documents must store a type keyword in the index. It will also be used by the search engine to select the correct document subclass.

Resource Idenitificator (RI)

A resource identificator uniquely identifies a document within the index. It is required so that documents can be deleted from the index. There is no further usage of RIs on the side of the indexer.

The calle must ensure the uniqueness of the RI values. If a single key is used twice, it will overwrite the existing document upon indexing. The GUID of Midgard objects, or the URLs of external objects should be used here.

Required Document Fields

Each type of document, independant from its actual implementation, must define a set of fields. The Document base class has an built-in interface for this, there is no need to add the field manually.

  • __RI (keyword): The Resource Identifier, usually added automatically by subclasses.
  • title (text): The title of the resource, sometimes added automatically by subclasses.
  • __TOPIC_GUID (text): The GUID of the topic the document is assigned to. This can be used to limit the search to a specific topic, which can be done by components showing search forms for themselves only. Might be empty for external resources.
  • __COMPONENT (text): The component that originally invoked the indexing. Might be empty for external resources.
  • __DOCUMENT_URL (unindexed): The fully qualified URL to access this document (this should be a PermaLink for MidCOM items or a fully qualified URL for other resources).
  • __CREATED (date): Document creation timestamp.
  • __REVISED (date): Document last-modified timestamp.
  • __INDEXED (date): The timestamp of indexing, could be used for intelligent index update.
  • content (unstored): This is the main content field, where all of the page's content has to be added. It is the field that is searched by default when the user does not specify any explicit field. The Document class will automatically add the documents title field to the beginning of the content before indexing. Additional content available for direct search should be added to this field. (It might be sensible to additionally add the content to a separate field though, to faciliate more elaborate queries.)

Recommended Document Fields

These fields should be added to the Index wherever possible, but can be left out, in which case the indicated default is used. Again, these values can be modified without actually adding the fields.

  • __SOURCE (unindexed): The source object "handle", used to retrieve the actual object. Usable only by the indexer client to get a "direct" reference which object was indexed originally.
  • __CREATOR (text): Creator (MidgardPerson GUID).
  • __REVISOR (text): Last modifier (MidgardPerson GUID).
  • author (text): Plain-text author name, may be different then __CREATOR, which represents the actual object creator (which is not neccessarily the author). If omitted upon indexing the indexer service will set it to the CREATOR's name field.
  • abstract (text): This is the abstract to be shown on the search result page. May include HTML inline tags. (When dealing with MidCOM data, again only PermaLinks should be used.)

Datamanager integration

A specialized document type for indexing datamanger driven objects will be available. It will be aware of the various types stored at the object, and can, for example, handle attachments and the like. Additional datamanager schema directives will be needed to control indexing.

All unspecified fields will have their string representations concatenated and appended to the content field automatically. The creator, revisor and author fields are populated by using the according Midgard values.

For compatibility reasons, a schema field named abstract will automatically be used as the default abstract field value, unless specified otherwise.

NAP integration

NAP will need some additional features to make an integration like this clean.

First and most important of all, NAP needs a way to uniquely identify each element within the Tree. These links need to be stable so that even in a case of a moved article it can be found at any time.

Using the GUID of the object will give the best permanence, but has a few drawbacks also: First and most, a complete scan of the NAP tree will be required to find anything that is not within the regular content tree (MidgardEvents for example). It might also lead to ambigous results in case that there are content topics symlinked using the taviewer or newsticker.

Unfortunalety, I currently do not have a better idea, so we stick for this at the moment.

Upon a permalink request, the system will translate the permalink into a real URL and create a HTTP relocate answer, which can be cached by the MidCOM caching engine to improve repeated accesses.

Note:

To improve performance of these lookups, a NAP cache would be advisable finally, as the "live" NAP requests can get quite time consuming. These NAP changes should also be combined with the Arbitary-NAP-Metadata-Storage-Interface-Project(tm). Finally this will mean a more-or-less complete rewrite of the NAP core. (Should actually give its own mRFC).

File indexing mode

It schould be possible to index external files (via path or URL) and Midgard Attachments, as they can contain valuable information too.

Some kind of Filter plugin interface (again a "Strategy") will be required here, that takes some kind of file handle and parses it. The various filters can register itself with a list of mime-types that they can handle. The result will again be a indexer-compatible content-array. The metadata will come from the callee but can be overridden in the import run.

Possible importers might be: HTML (skipping tags), Plain Text, XML (using the XML schema to for indexing), JPEG (EXIF extract), PDF (... does have metadata too).

Legacy application interface

This will be a simple interface that can index by-data, by-file, by-attachment or by-URL, it will be able to run outside MidCOM and will need explicit specification of all meta-data usually set by MidCOM (like the character encoding). The legacy resources will have to be indexed by URL.

An external search interface is not yet planned.

Search result post processing

All results need to be post-processed by MidCOM for various permission checks. At the moment only the ViewerGroups checks will be made here, hiding search results that are inaccessible by the user.

midcom.helper.search

This is the default search frontend, it will use the indexer and will render a default search interface honoring the current backend's capabilities. Styling should be doable with a bit of intelligent CSS.

Two search modes are planned, simple (default, suitable for dynamic_loading) and advanced.

Hits will be shown using the "abstract" field, if possible, or the first few lines of the "content field". If both are missing, no content preview is shown.

Indexing Problems

As you can imagine, indexing is a quite complicated thing. Several points come to my mind here:

Reindexing documents

The push-technology deployed will mean, that a periodic check of the complete index is difficult at least. Within MidCOM this can be done by using the component interface, which allows to "trigger" a reindex uniformly. Legacy applications will not have this luxury, and having a "callback interface" there is theoretically possible, but will probably be difficult in reality due to the variety of applications involved.

Therefore, reindexing can only be done through dropping the complete index, telling the MidCOM content tree to reindex itself and telling the user that he must now tell all legacy applications to reindex themselves.

Searching only a part of the content tree

This is another quite difficult thing. The RI's used to uniquely identify an object with as less dependencies to the actual site structure as possible. Therefore, the metadata for a given object must include an URL to the topic it is assigned to (that way attachments and files can be "linked" to a topic as well). This field is then queried for a given prefix (the backend should support this feature).

If the backend does not support prefix searches for a given field, this feature will of course be unavailable.

Index consistency

The push-model will not gurantee a current and complete index at all times, as bugs in the various components might show errors in the index. Compared to a traditional indexing system the chance for an incomplete index is significantly higher, perhaps a full magnitude.

Dynamically loaded components

A component will only be in the index at its "primary" location. Any place where dynamic_load displays content within another component, that part will not be indexed. The corresponding location will not be available through the search engine therefore.

Usually this is no problem, as long as the dynamically loaded content is regularily available at another place. You will get into trouble though if the loaded topic is hidden from the user (for example a discussion forum below a page). The search result URL will lead you to the actual (hidden) topic, not the page which it is assigned to.

This is very difficult to circumvent at the moment, as a dynamically loaded component does not know that it was dynamically loaded in the first place. Especially not within AIS, which makes backtracking dynamic_load links virtually impossible without larger changes at the moment.

Indexer backend proposals

First for some general considerations:

As this indexing stuff is quite time consuming, it is suggested having some kind of indexing daemon instead of a simple command line interface on the long run. The independant indexer interfaces provided should allow adding such a daemon at a later time without any changes to the rest of the system (with the exception of the interface driver itself).

In any way the runtime considerations are significant, as in a really simple environment a new process will be spawned for each index query; it will not be possible to do batch queries if a simple shell script interface is used as outlined above. I do not think that this will be a problem on smaller sites, but when you get down to the guts, an Indexing deamon will be a must. This will also make it easier to access it out of other applications.

Which backend is used can ultimately only decided by checking the features that are required on the sepcific site.

Plucene

Plucene supports the outlined features quite well as far as I can tell from a first look. Of course, I have no idea about stability and maturity of Plucene. As Perl is readily available everywhere, the dependency problems should not be that hard (on the other hand, Perl repeatedly had problems with its various versions that are out there).

Plucene has two major drawbacks: First does not support Range or Wildcard-Searches (though a range-search over a date field seems possible). Second, its documentation is rather incomplete, though the author says, that the Lucene docs are valid for Plucene as well. Well, I'm not happy with this fact nevertheless.

Installation difficulty unknown, may be as easy as pulling the module out of CPAN.

Lucene

This is the original, Java implementation, which is, obviously, the most powerful one. Also, its documetation seems to be quite extensive and there are lots of articles covering Lucene on the Web. The fact, that Lucene is Java shouldn't be much of a problem nowadays, as most distributions ship with Java on-board.

As the Java-based approach has considerable larger startup times that the Perl-one, some kind of indexing daemon is far more important here, then it would be at the Perl solution.

Installation difficulty unknown, when a working Java VM is available, it should be rather trivial though.

Note, that when using Lucene it is quite important to have a Search Index Daemon available, as Java is a language that gains speed while it is executing. Starting up a new Java VM on each and every search request is not an option in this case. (Writing Daemons in Java isn't that difficult, on the other hand.)

Recommendation

This depends on the actual requirements. From a purly technological point of view, Lucene is almost always the better solution. It is more powerful, the original implementation, well documented and has a wide userbase.

The disadvantage here is the increased requirements during installation and setup (Java is some kind of a beast, always was). Benchmarks vary widely, though Java will not have the dependency hell. Lucene requires JDK 1.3.1 upwards, if this is satisified, the rest is a breeze (due to Java's huge amount of standardization). This is especially interesting for the interface stuff, which requires XML processing and the like.

Bottom line: Both Plucene and Lucene have a large number of unknown variables, and both have their own advantages and disadvantages. I personally would prefer Lucene, but this is more a matter of personal taste (I just find Java more suitable for server applications the Perl (or PHP, for that matter)).

     

Back

Designed by Nemein, hosted by Anykey