Open Source Content Management System

mRFC 0014: MidCOM Remote Indexing Service: XML communication protocol

  1. Dropping an entire Index for recreation
  2. Using XMLComm over TCP
  3. TODO
  4. Download the DTDs

This communication protocol is used in conjunction with all remote access operations regarding the MidCOM indexing service. It supports all operations required to query and maintain the index.

Recommended Reading:

Main operational targets:

  • Communication between the MidCOM Indexing Service and out-process indexer services (like (P)Lucene)
  • Programmatic access to the MidCOM index from external applications

Author's Note: I'm not yet that fit in XML DTD writing, therefore please forgive me if the descriptions below are not state-of-the-art. The full DTD will be made available for download when it is finished.

Important Note: Due to HTMLAreas inability to correctly handle preformatted text with < and > entities in it, I have decided to replace all occurences of these two characters with [ and ]. This is unfortunalety the only way this text can be edited again without everything breaking up.

General Communication Protocol

Communication is done in Batches. A batch contains an unlimited number of requests which are processed in the order they are given. The batch itself may contain authentication information (were required), and must contain the name of the index.

Each request is identified by its own unique identifier within this batch, so that the client can accociate the responses accordingly. Not all Requests must have a response.

If the server encountes a non-critical problem that prevents execution of a request without endangering further requests (like deleting a nonexistent item), it emits a Warning but continues processing.

Upon any critical error, processing stops immediately, and a Error response is put into the Response Batch, accociated with the corresponding Request ID. No transactional processing is guranteed, so one is to assume that all responses up to but not including the error were processed successfully.

The server returns a response which does not need to include all items from the request (for example, the auth request will not produce any message unless the credencials are wrong).

If the Server encounters any parsing errors, that is, a document not matching the DTD, it will generate an error with the ID 0 and stop processing immediately.

This yield the following general structure:

DTD and XML version

Both request and response transmissions must have the DTD declaration in the header so that the Server can validate the requests for syntactic correctness. It should be expected, that at least the server validates the documents before actually processing the request.

The communication protocol is XML 1.0.

Date/Time format

All date fields in the XML documents must follow the ISO-8601 standard with a precision of one second: YYYY-MM-DDThh:mm:ss. Example: 2005-02-01T08:56:20. No timezone processing is done on the server side at this time, timestamps are stored as-is.

Example Request

[request index="name"]
[auth id="1" type="auth_type"][!-- Authentication Data --][/auth]
[query id="2"][!-- Query Data --][/query]
[delete id="3" documentid="id" /]
[index id="4"][/index]
[/request]

Example Response

[response]
[resultset id="2"][!-- Query Result --][/resultset]
[warning id="3"][!-- Warning Message --][/warning]
[error id="4"][!-- Error Message --][error]
[/response]

Requests

A request must specify which index it is for. Any number of the following blocks may appear in it in any order:

[!ELEMENT request (auth|query|index|delete|deleteall)+]
[!ATTLIST request index CDATA #REQUIRED]

Authentication Data

A authentication block can occur at any time during the request processing and will set the user credencials to those given in the data block. The type specified in the element attribut tells the type of authentication which is used. The authentication data is evaluated by the server and not structured further:

[auth id="n" type="plain"]
username=user;password=pass
[/auth]

At this time, only a plain text is specified, more will be added on a as-needed basis, so no further restrictions apply:

[!ELEMENT auth (#PCDATA)]
[!ATTLIST auth id CDATA #REQUIRED]
[!ATTLIST auth type (plain) #REQUIRED]

A auth element will either succeed silently authentication all following elements within this request, or produce an error otherwise.

Queries

This queries the database for a given query string. Optionally, a date filter for one of the date fields in the index may be specified:

[query id="n"]
[string]query keyword:something[/string]
[filter]
[datefilter field="created"]
[from]2004-05-01 12:00:00[/from]
[to]2004-05-31 12:00:00[/from]
[/datefilter]
[filter]
[/query]

The filter argument is optional, if set, it must contain exactly one of the allowed elements. At this time, multiple filters are not supported.

At least one of the from and to field of the datefilter must be specified. The omitted one is interpreted as an open bound in the respective direction. Specifying an empty datefilter is not valid, as this would be a null operation. The Datefilter argument must be a ISO compatible date/timestamp.

[!ELEMENT query (string,filter?)]
[!ATTLIST query id CDATA #REQUIRED]
[!ELEMENT string (#PCDATA)]
[!ELEMENT filter (datefilter)]
[!ELEMENT datefilter (from|to)+]
[!ATTLIST datefilter field CDATA #REQUIRED]
[!ELEMENT from (#PCDATA)]
[!ELEMENT to (#PCDATA)]

A query will produce a resultset, which might be empty, a warning will be thrown if there was an invalid or an error in case of anything critical.

Adding and Updating elements in the Index

The Index element is used to add a new document to the Index. It has to contain a unique resource identifier (used for deletion) and a document, which consists of a set of fields that should be added to or updated in the Index:

[index id="n"]
[document id="some_unique_document_identifier"]
[date name="created"]2005-01-27 15:50:27[/date]
[keyword name="guid"]13ab489798cef59867de856[/keyword]
[unindexed name="internal"]some internal stuff[/unindexed]
[unstored name="content"]The actual content...[/unstored]
[text name="abstract"]The abstract[/text]
[/document]
[/index]

In general, five types of fields will be supported by the indexing, I go with (P)Lucene here:

  • date is a date-wrapped field suitable for use with the Date Filter.
  • keyword is store and indexed, but not tokenized.
  • unindexed is stored but neither indexed nor tokenized.
  • unstored is not stored, but indexed and tokenized.
  • text is stored, indexed and tokenized.

A special warning about keyword fields: Both Plucene and Lucene has serious trouble when quering these fields, making them mostly unusable. See this posting for details.

[!ELEMENT index (document)]
[!ATTLIST index id CDATA #REQUIRED]
[!ELEMENT document (date*,keyword*,unindexed*,unstored*,text*)]
[!ATTLIST document id CDATA #REQUIRED]
[!ELEMENT date (#PCDATA)]
[!ATTLIST date name CDATA #REQUIRED]
[!ELEMENT keyword (#PCDATA)]
[!ATTLIST keyword name CDATA #REQUIRED]
[!ELEMENT unindexed (#PCDATA)]
[!ATTLIST unindexed name CDATA #REQUIRED]
[!ELEMENT unstored (#PCDATA)]
[!ATTLIST unstored name CDATA #REQUIRED]
[!ELEMENT text (#PCDATA)]
[!ATTLIST text name CDATA #REQUIRED]

An index request will either succeed silently or throw an error in case of anything critical. If a document with the same document id already exists in the index, it has to be overwritten silently, essentially having an implicit delete operation preceding the index call.

Deleting Elements from the Index

This is actually rather trivial:

[delete id="n" documentid="some_unique_document_identifier /]

The document ID must match the one given in the index call. Note, that if you have several documents in your index which all have the same a delete request to one of them will delete all silblings as well.

[!ELEMENT delete EMPTY]
[!ATTLIST delete id CDATA #REQUIRED]
[!ATTLIST delete documentid CDATA #REQUIRED]

An delete request will either succeed silently or throw an error in case of anything critical. If a document does not exists, this is not an error condition, it is ignored silently instead.

Dropping an entire Index for recreation

Sometimes you want to complete recreate an entire index and deleting every single document out of them is not really an option. The following command will drop the current index and create a new, empty one:

[deleteall id="n" /]
 

Be aware, that it is the responsibility of the client to ensure proper protection against unauthorized calls of this function, unless the indexer service provides some means of authorization, in which case the auth method could be used.

Responses

The complexity of the responses is significantly lower, naturally. Note, that not all request elements will produce a corresponding response.

[!ELEMENT response (resultset*,warning*,error?)]

Resultset

This is a list of documents that match a given query, ordered by match quality, delivering the best match first. The document in the response is a slightly altered document compared to the one of the index request. Its main difference is that a score attribute is added to the element itself and all fields are treated equally in the search result (distinction is not possible anymore at this stage).

[resultset id="n"]
[document id="some_unique_document_identifier" score="12.7"]
[field name="guid"]13ab489798cef59867de856[/keyword]
[field name="internal"]some internal stuff[/unindexed]
[field name="abstract"]The abstract[/text]
[/document]
[/index]

The score of a document is some percentage value indicating the relevance of the document. This should never, of course, exceed 100%. Lucene gurantees this by having scores between 0.0 and 1.0, Plucene, unfortunaltey, will require some messing around with the scores to do this (Plucene scores may exceed 1.0).

[!ELEMENT resultset (document)]
[!ATTLIST resultset id CDATA #REQUIRED]
[!ELEMENT document (field*)]
[!ATTLIST document id CDATA #REQUIRED]
[!ATTLIST document score CDATA #REQUIRED]
[!ELEMENT field (#PCDATA)]
[!ATTLIST field name CDATA #REQUIRED]

Note, that any Plucene-style date fields cannot be reverted into their original state automatically (it is not known if a plain-text field contains a date field). It is suggested indexing them unstored and storing a clear-text representation unindexed.

Warnings

Warnings are non-critical informations to the client. They do indicate a problem with the request referred to its ID, but it was not critical enough to justify aborting the current batch of requests.

[warning id="n"]Query invalid[/warning]

[!ELEMENT warning (#PCDATA)]
[!ATTLIST warning id CDATA #REQUIRED]

Errors

An error is an unrecoverable situation which aborts the processing of a batch immediately.

It is possible to receive errors for non existant request IDs, for example when parsing or IO errors occur on the server side.

[error id="n"]Username or Password invalid[/warning]

[!ELEMENT error (#PCDATA)]
[!ATTLIST error id CDATA #REQUIRED]

Using XMLComm over TCP

A single but inconvenient side-effect arises when using this protocol over TCP. PHP does not allow you to close a TCP socket connection in one direction only. This means especially, that you cannot close the connection to the server after sending the request. Detecting the end of the XML file is not supported by the DOM parsers I have used until now (SAX could do it). Since I did not want to rewrite the entire XML Comm driver I have written in Java and Perl, I decided to use a simple workaround.

The System detects the end of the input stream by a line consisting exactly of "". In case you have lines like this in your content (CDATA sections do allow this), you have to mask them as "_".

TODO

  • Maintain operation (index optimization) => currently implicit with write_close.
  • Delete type (delete all records with a given keyword field, perhaps __TYPE?).
  • Must-Have fields
    • __RI: Resource Identifier, generated automatically using the document ID, keyword field.
    • __TYPE: Document Type (used for delete type mainly), defaults to "default", keyword field.
    • content: Default search field, should be unstored, but may be text.

Download the DTDs

The current DTDs can be downloaded from the CVS at Tigris:

 

Back

Designed by Nemein, hosted by Anykey