mRFC 0009: MidCOM Indexing Service
This document outlines a fulltext-search subsystem for MidCOM. It
will require external tools for indexing, which is why the subsystem
will be pluggable. Usage of at least the indexer interface from outside
midcom is required.
Introduction
The basic requirements as outlined by Henri Bergious can be found in Bug #107 on Tigris. Read this first please.
The subsystem will be split into the following parts, that will be considered individually:
- Search engine backend
- PHP search engine interface module
- Main search component providing:
- Search frontend
- Indexing frontend
- Configuration UI
Some other parts of MidCOM need to be adapted to
- NAP GUID lookup interface
- MidCOM independant interfaces to the indexer
Note: External Access to the search engine is currently not considered, as running the UI with the MidCOM is far easier. - Integration into the Datamanger that make automatic indexing of datamanager driven objects possible.
This will lead to a system like this:
Indexer backend plugin --[relay commands]--> 3rd party indexer backend
^ (e.g. Plucence)
| Abstract indexer access
|
Indexer service <----[Use search interface]---- midcom.helper.search
^ |
| Use indexer interface Filter search results |
| to register documents according to the usual |
| for indexing rules |
| V
Any component including the DM; Web site user
non-MidCOM application use a
"subinterface" of this
The
important point here is the distinction between the Indexing service
and the actual search engine component. The actual heart of the system
is the indexer service. It will provide access to the full-text index
in a backend-independant interface. A reduced version of the indexer
service is available for non-MidCOM Applications using the same basic
API (but see below).
The component midcom.helper.search provides an easy way of making the indexing service available to the website end user.
Note,
that in the course of this documents, all cases where MidCOM-unaware
applications (such as OpenPSA) are concerned, they are referred to as legacy applications.
Indexing Backends
An
indexing backend usable by MidCOM is required to store, basically,
key/value pairs. The values represent the actual content, along with a
few additional fields (like the GUID) that should not be indexed.
The
service will pass an accociative array with plain-text information as
both keys and values to the backend plugins. The plugin then is
responsible for passing this information to the actual indexer backend
to use. Any conversion from the storage record passed by the indexer
service to and from a format suitable for the backend has to be done
here.
The indexer backend should support UTF-8 encoded strings, if not, the search backend has to ensure,
that a correct conversion away from UTF-8 to a somewhere configured
charset is done (or alternatively work with the encoded UTF-8 strings
if you can live with the sorting/comparison problems that might arise).
Indexer Service
This is the main access point for all components that use the indexer.
Both searching and indexing is controlled through this interface.
The heart of the Indexing is the so-called "document", which will
form its own class hirarchy (see below). It roughly resembles the
documents stored in (P)Lucene and contains a number of key/value fields
of various types (controlling tokenizing, indexing and storage). The
Document class hierarchy will contain interfaces to all neccessary
metadata fields.
The Resource ID (see below) uniquely identifies any document in the
index. It is contained in the document and extracted approprately when
saving the record, depending on the document that is indexed, this
value is either deduced automatically, or has to be supplied within the
document.
A first API sketch (pseudo code):
class midcom_service_indexer {
midcom_service_indexer($backend);
Array (int key) get_capabilities();
bool index(object $document);
bool delete(string $resource_id);
bool delete_all();
Array(documents) search(string query, object $filter);
}
The Indexer Service will be managed by MidCOM ensuring usage similar to the Singelton design pattern (link is a PDF!).
The
get_capabilities() call is relayed to the indexer backend plugin and
will show its capabilities. At this time, there is no further
definition about the available capabilities, but it is there for later
usage in case the need arises.
The Filter argument to the search request may
contain any one Filter defined and supported by the search engine. At
this time, there will only be the Date-Filter, which allows filtereing
a query by an arbitary Date field of the stored documents.
Indexer backend plugins
The indexer backend plugins make the integration of different search engines possible. They represent a Strategy design pattern.
A first rough interface sketch (pseudo code):
class midcom_service_indexer__backend {
midcom_service_indexer__backend($configuration);
Array (int key) get_capabilities();
bool index($document);
bool delete(string $resource_id);
bool delete_all();
Array(documents) search(string query, object $filter);
}
The plugin must assume that all data passed to it is UTF-8 encoded. The search result has to list all documents with all fields that have been returned by the indexer..
The method get_capabilities gives an indication about what the
backend can do, and what not. For example RETRIEVE_METADATA,
SEARCH_METADATA or SEARCH_SUBSTRING.
Documents
A document, in general, is a collection of fields that are to be
indexed. The type of these fields is distinguished on indexing, but not
on retrieval. These fields are available:
Stored fields are fields whose content is available in the resultset. Indexed fields are fields which can be queried. Tokenized fields are broken apart into words (or whatever the tokenizer actually does) to allow searching for substrings.
- Date: Used for valid dates which can only be used for filtering. These can be retrieved upon queries, but their value can be anything, which is why they must not be relied upon during search result processing. They are stored as Keywords.
- Keyword: This field is indexed, stored, but not tokenized. This should only be used for the Resource Identifier and for Date filter fields, as its contents are not easily queried.
- Unindexed: This field is stored, but neither indexed nor tokenized. It should be used for fields that are required to access the document but that must not be available for searching.
- Unstored: This field is indexed and tokenized, but not stored. It should be used for all values that should be queryable but that need not be in the resultset.
- Text: This field is indexed, tokenized and stored. Used only for fields whose content need to be in the result set.
It is strongly recommended to use unstored and to avoid keyword fields wherever possible. Additionally, fields whose content is stored in the index (Keyword, Unindexed and Text) should be used only where absolutely neccessary. See below for the list of recommended and required fields.
Some of the document fields start with a double-underscore and are
all-caps. These mark internal fields that should normally not be
queried by the end-user. Most of the time they will be added
programmatically to the query instead.
Any text data passed to the Document is transparently converted into
UTF-8 on the basis of the encoding currently selected in the I18n
service. For all accesses from legacy applications the charset has to
be specified explicitly. Files or attachments passed to the indexer (if
readable) will need an explicitly specified charset or they will be
treated as-is.
Document Types
Documents must store a type keyword in the index. It will also be used by the search engine to select the correct document subclass.
Resource Idenitificator (RI)
A resource identificator uniquely identifies a document within the
index. It is required so that documents can be deleted from the index.
There is no further usage of RIs on the side of the indexer.
The calle must ensure the uniqueness of the RI values. If a single
key is used twice, it will overwrite the existing document upon
indexing. The GUID of Midgard objects, or the URLs of external objects should be used here.
Required Document Fields
Each type of document, independant from its actual implementation, must define a set of fields. The Document base class has an built-in interface for this, there is no need to add the field manually.
- __RI (keyword): The Resource Identifier, usually added automatically by subclasses.
- title (text): The title of the resource, sometimes added automatically by subclasses.
- __TOPIC_GUID (text):
The GUID of the topic the document is assigned to. This can be used to
limit the search to a specific topic, which can be done by components
showing search forms for themselves only. Might be empty for external
resources.
- __COMPONENT (text): The component that originally invoked the indexing. Might be empty for external resources.
- __DOCUMENT_URL (unindexed): The fully qualified URL to access this document (this should be a PermaLink for MidCOM items or a fully qualified URL for other resources).
- __CREATED (date): Document creation timestamp.
- __REVISED (date): Document last-modified timestamp.
- __INDEXED (date): The timestamp of indexing, could be used for intelligent index update.
- content (unstored): This is the main content field, where all of the page's content has to be
added. It is the field that is searched by default when the user does
not specify any explicit field. The Document class will automatically
add the documents title field to the beginning of the content before indexing. Additional content available for direct search should
be added to this field. (It might be sensible to additionally add the
content to a separate field though, to faciliate more elaborate
queries.)
Recommended Document Fields
These fields should be
added to the Index wherever possible, but can be left out, in which
case the indicated default is used. Again, these values can be modified
without actually adding the fields.
- __SOURCE (unindexed): The source object "handle", used to retrieve the actual object. Usable only by the indexer client to get a "direct" reference which object was indexed originally.
- __CREATOR (text): Creator (MidgardPerson GUID).
- __REVISOR (text): Last modifier (MidgardPerson GUID).
- author (text): Plain-text author name, may be different then __CREATOR, which represents the actual object creator (which is not neccessarily the author). If omitted upon indexing the indexer service will set it to the CREATOR's name field.
- abstract (text): This is the abstract to be shown on the search result page. May include HTML inline tags. (When dealing with MidCOM data, again only PermaLinks should be used.)
Datamanager integration
A specialized document type for indexing datamanger driven objects
will be available. It will be aware of the various types stored at the
object, and can, for example, handle attachments and the like.
Additional datamanager schema directives will be needed to control
indexing.
All unspecified fields will have their string representations
concatenated and appended to the content field automatically. The
creator, revisor and author fields are populated by using the according
Midgard values.
For compatibility reasons, a schema field named abstract will automatically be used as the default abstract field value, unless specified otherwise.
NAP integration
NAP will need some additional features to make an integration like this clean.
First and most important of all, NAP needs a way to uniquely
identify each element within the Tree. These links need to be stable so
that even in a case of a moved article it can be found at any time.
Using the GUID of the object will give the best permanence, but has
a few drawbacks also: First and most, a complete scan of the NAP tree
will be required to find anything that is not within the regular
content tree (MidgardEvents for example). It might also lead to
ambigous results in case that there are content topics symlinked using
the taviewer or newsticker.
Unfortunalety, I currently do not have a better idea, so we stick for this at the moment.
Upon a permalink request, the system will translate the permalink
into a real URL and create a HTTP relocate answer, which can be cached
by the MidCOM caching engine to improve repeated accesses.
Note:
To improve performance of these lookups, a NAP cache would be
advisable finally, as the "live" NAP requests can get quite time
consuming. These NAP changes should also be combined with the
Arbitary-NAP-Metadata-Storage-Interface-Project(tm). Finally this will
mean a more-or-less complete rewrite of the NAP core. (Should actually
give its own mRFC).
File indexing mode
It schould be possible to index external files (via path or URL) and
Midgard Attachments, as they can contain valuable information too.
Some kind of Filter plugin interface (again a "Strategy") will be
required here, that takes some kind of file handle and parses it. The
various filters can register itself with a list of mime-types that they
can handle. The result will again be a indexer-compatible
content-array. The metadata will come from the callee but can be
overridden in the import run.
Possible importers might be: HTML (skipping tags), Plain Text, XML (using the XML schema to for indexing), JPEG (EXIF extract), PDF (... does have metadata too).
Legacy application interface
This will be a simple interface that can index by-data, by-file,
by-attachment or by-URL, it will be able to run outside MidCOM and will
need explicit specification of all meta-data usually set by MidCOM
(like the character encoding). The legacy resources will have to be
indexed by URL.
An external search interface is not yet planned.
Search result post processing
All results need to be post-processed by MidCOM for various
permission checks. At the moment only the ViewerGroups checks will be
made here, hiding search results that are inaccessible by the user.
midcom.helper.search
This is the default search frontend, it will use the indexer and
will render a default search interface honoring the current backend's
capabilities. Styling should be doable with a bit of intelligent CSS.
Two search modes are planned, simple (default, suitable for dynamic_loading) and advanced.
Hits will be shown using the "abstract" field, if possible, or the
first few lines of the "content field". If both are missing, no content
preview is shown.
Indexing Problems
As you can imagine, indexing is a quite complicated thing. Several points come to my mind here:
Reindexing documents
The push-technology deployed will mean, that a periodic check of the
complete index is difficult at least. Within MidCOM this can be done by
using the component interface, which allows to "trigger" a reindex
uniformly. Legacy applications will not have this luxury, and having a
"callback interface" there is theoretically possible, but will probably
be difficult in reality due to the variety of applications involved.
Therefore, reindexing can only be done through dropping the complete
index, telling the MidCOM content tree to reindex itself and telling
the user that he must now tell all legacy applications to reindex
themselves.
Searching only a part of the content tree
This is another quite difficult thing. The RI's used to uniquely
identify an object with as less dependencies to the actual site
structure as possible. Therefore, the metadata for a given object must
include an URL to the topic it is assigned to (that way attachments and
files can be "linked" to a topic as well). This field is then queried
for a given prefix (the backend should support this feature).
If the backend does not support prefix searches for a given field, this feature will of course be unavailable.
Index consistency
The push-model will not gurantee a current and complete index at all
times, as bugs in the various components might show errors in the
index. Compared to a traditional indexing system the chance for an
incomplete index is significantly higher, perhaps a full magnitude.
Dynamically loaded components
A component will only be in the index at its "primary" location. Any
place where dynamic_load displays content within another component,
that part will not be indexed. The corresponding location will not be
available through the search engine therefore.
Usually this is no problem, as long as the dynamically loaded
content is regularily available at another place. You will get into
trouble though if the loaded topic is hidden from the user (for example
a discussion forum below a page). The search result URL will lead you
to the actual (hidden) topic, not the page which it is assigned to.
This is very difficult to circumvent at the moment, as a dynamically
loaded component does not know that it was dynamically loaded in the
first place. Especially not within AIS, which makes backtracking
dynamic_load links virtually impossible without larger changes at the
moment.
Indexer backend proposals
First for some general considerations:
As this indexing stuff is quite time consuming, it is suggested
having some kind of indexing daemon instead of a simple command line
interface on the long run. The independant indexer interfaces provided
should allow adding such a daemon at a later time without any changes
to the rest of the system (with the exception of the interface driver
itself).
In any way the runtime considerations are significant, as in a
really simple environment a new process will be spawned for each index
query; it will not be
possible to do batch queries if a simple shell script interface is used
as outlined above. I do not think that this will be a problem on
smaller sites, but when you get down to the guts, an Indexing deamon
will be a must. This will also make it easier to access it out of other
applications.
Which backend is used can ultimately only decided by checking the features that are required on the sepcific site.
Plucene
Plucene supports the outlined features quite well as far as I
can tell from a first look. Of course, I have no idea about stability
and maturity of Plucene. As Perl is readily available everywhere, the
dependency problems should not be that hard (on the other hand, Perl
repeatedly had problems with its various versions that are out there).
Plucene has two major drawbacks: First does not support
Range or Wildcard-Searches (though a range-search over a date field
seems possible). Second, its documentation is rather incomplete, though
the author says, that the Lucene docs are valid for Plucene as well.
Well, I'm not happy with this fact nevertheless.
Installation difficulty unknown, may be as easy as pulling the module out of CPAN.
Lucene
This is the original, Java implementation, which is, obviously, the
most powerful one. Also, its documetation seems to be quite extensive
and there are lots of articles covering Lucene on the Web. The fact,
that Lucene is Java shouldn't be much of a problem nowadays, as most
distributions ship with Java on-board.
As the Java-based approach has considerable larger startup times
that the Perl-one, some kind of indexing daemon is far more important
here, then it would be at the Perl solution.
Installation difficulty unknown, when a working Java VM is available, it should be rather trivial though.
Note, that when using Lucene it is quite important to have a Search
Index Daemon available, as Java is a language that gains speed while it
is executing. Starting up a new Java VM on each and every search
request is not an option in this case. (Writing Daemons in Java isn't
that difficult, on the other hand.)
Recommendation
This depends on the actual requirements.
From a purly technological point of view, Lucene is almost always the
better solution. It is more powerful, the original implementation, well
documented and has a wide userbase.
The disadvantage here is the increased requirements during
installation and setup (Java is some kind of a beast, always was).
Benchmarks vary widely, though Java will not have the dependency hell.
Lucene requires JDK 1.3.1 upwards, if this is satisified, the rest is a
breeze (due to Java's huge amount of standardization). This is
especially interesting for the interface stuff, which requires XML
processing and the like.
Bottom line: Both Plucene and Lucene have a large number of unknown
variables, and both have their own advantages and disadvantages. I
personally would prefer Lucene, but this is more a matter of personal
taste (I just find Java more suitable for server applications the Perl
(or PHP, for that matter)).
