mRFC 0014: MidCOM Remote Indexing Service: XML communication protocol
- General Communication Protocol
- Requests
- Authentication Data
- Queries
- Adding and Updating elements in the Index
- Deleting Elements from the Index
- Dropping an entire Index for recreation
- Responses
- Using XMLComm over TCP
- TODO
- Download the DTDs
This communication protocol is used in conjunction with all remote
access operations regarding the MidCOM indexing service. It supports
all operations required to query and maintain the index.
Recommended Reading:
Main operational targets:
- Communication between the MidCOM Indexing Service and out-process indexer services (like (P)Lucene)
- Programmatic access to the MidCOM index from external applications
Author's Note: I'm not yet
that fit in XML DTD writing, therefore please forgive me if the
descriptions below are not state-of-the-art. The full DTD will be made
available for download when it is finished.
Important Note: Due
to HTMLAreas inability to correctly handle preformatted text with <
and > entities in it, I have decided to replace all occurences of
these two characters with [ and ]. This is unfortunalety the only way
this text can be edited again without everything breaking up.
General Communication Protocol
Communication is done in Batches. A batch contains an unlimited
number of requests which are processed in the order they are given. The
batch itself may contain authentication information (were required),
and must contain the name of the index.
Each request is identified by its own unique identifier within this
batch, so that the client can accociate the responses accordingly. Not
all Requests must have a response.
If the server encountes a non-critical problem that prevents
execution of a request without endangering further requests (like
deleting a nonexistent item), it emits a Warning but continues
processing.
Upon any critical error, processing stops immediately, and a Error
response is put into the Response Batch, accociated with the
corresponding Request ID. No transactional processing is guranteed, so
one is to assume that all responses up to but not including the error
were processed successfully.
The server returns a response which does not need to include all
items from the request (for example, the auth request will not produce
any message unless the credencials are wrong).
If the Server encounters any parsing errors, that is, a document not
matching the DTD, it will generate an error with the ID 0 and stop
processing immediately.
This yield the following general structure:
DTD and XML version
Both request and response transmissions must have the DTD
declaration in the header so that the Server can validate the requests
for syntactic correctness. It should be expected, that at least the
server validates the documents before actually processing the request.
The communication protocol is XML 1.0.
Date/Time format
All date fields in the XML documents must follow the ISO-8601 standard with a precision of one second: YYYY-MM-DDThh:mm:ss. Example: 2005-02-01T08:56:20. No timezone processing is done on the server side at this time, timestamps are stored as-is.
Example Request
[request index="name"]
[auth id="1" type="auth_type"][!-- Authentication Data --][/auth]
[query id="2"][!-- Query Data --][/query]
[delete id="3" documentid="id" /]
[index id="4"][/index]
[/request]
Example Response
[response]
[resultset id="2"][!-- Query Result --][/resultset]
[warning id="3"][!-- Warning Message --][/warning]
[error id="4"][!-- Error Message --][error]
[/response]
Requests
A request must specify which index it is for. Any number of the following blocks may appear in it in any order:
[!ELEMENT request (auth|query|index|delete|deleteall)+]
[!ATTLIST request index CDATA #REQUIRED]
Authentication Data
A authentication block can occur at any time during the request
processing and will set the user credencials to those given in the data
block. The type specified in the element attribut tells the type of
authentication which is used. The authentication data is evaluated by
the server and not structured further:
[auth id="n" type="plain"]
username=user;password=pass
[/auth]
At this time, only a plain text is specified, more will be added on a as-needed basis, so no further restrictions apply:
[!ELEMENT auth (#PCDATA)]
[!ATTLIST auth id CDATA #REQUIRED]
[!ATTLIST auth type (plain) #REQUIRED]
A auth element will either succeed silently authentication all following elements within this request, or produce an error otherwise.
Queries
This queries the database for a given query string. Optionally, a
date filter for one of the date fields in the index may be specified:
[query id="n"]
[string]query keyword:something[/string]
[filter]
[datefilter field="created"]
[from]2004-05-01 12:00:00[/from]
[to]2004-05-31 12:00:00[/from]
[/datefilter]
[filter]
[/query]
The filter argument is optional, if set, it must contain exactly one
of the allowed elements. At this time, multiple filters are not
supported.
At least one of the from and to field of the datefilter must be
specified. The omitted one is interpreted as an open bound in the
respective direction. Specifying an
empty datefilter is not valid, as this would be a null operation. The
Datefilter argument must be a ISO compatible date/timestamp.
[!ELEMENT query (string,filter?)]
[!ATTLIST query id CDATA #REQUIRED]
[!ELEMENT string (#PCDATA)]
[!ELEMENT filter (datefilter)]
[!ELEMENT datefilter (from|to)+]
[!ATTLIST datefilter field CDATA #REQUIRED]
[!ELEMENT from (#PCDATA)]
[!ELEMENT to (#PCDATA)]
A query will produce a resultset, which might be empty, a warning will be thrown if there was an invalid or an error in case of anything critical.
Adding and Updating elements in the Index
The Index element is used to add a new document to the Index. It has
to contain a unique resource identifier (used for deletion) and a document,
which consists of a set of fields that should be added to or updated in
the Index:
[index id="n"]
[document id="some_unique_document_identifier"]
[date name="created"]2005-01-27 15:50:27[/date]
[keyword name="guid"]13ab489798cef59867de856[/keyword]
[unindexed name="internal"]some internal stuff[/unindexed]
[unstored name="content"]The actual content...[/unstored]
[text name="abstract"]The abstract[/text]
[/document]
[/index]
In general, five types of fields will be supported by the indexing, I go with (P)Lucene here:
- date is a date-wrapped field suitable for use with the Date Filter.
- keyword is store and indexed, but not tokenized.
- unindexed is stored but neither indexed nor tokenized.
- unstored is not stored, but indexed and tokenized.
- text is stored, indexed and tokenized.
A special warning about keyword fields: Both Plucene and Lucene has serious trouble when quering these fields, making them mostly unusable. See this posting for details.
[!ELEMENT index (document)]
[!ATTLIST index id CDATA #REQUIRED]
[!ELEMENT document (date*,keyword*,unindexed*,unstored*,text*)]
[!ATTLIST document id CDATA #REQUIRED]
[!ELEMENT date (#PCDATA)]
[!ATTLIST date name CDATA #REQUIRED]
[!ELEMENT keyword (#PCDATA)]
[!ATTLIST keyword name CDATA #REQUIRED]
[!ELEMENT unindexed (#PCDATA)]
[!ATTLIST unindexed name CDATA #REQUIRED]
[!ELEMENT unstored (#PCDATA)]
[!ATTLIST unstored name CDATA #REQUIRED]
[!ELEMENT text (#PCDATA)]
[!ATTLIST text name CDATA #REQUIRED]
An index request will either succeed silently or throw an error
in case of anything critical. If a document with the same document id
already exists in the index, it has to be overwritten silently,
essentially having an implicit delete operation preceding the index
call.
Deleting Elements from the Index
This is actually rather trivial:
[delete id="n" documentid="some_unique_document_identifier /]
The document ID must match the one given in the index call. Note,
that if you have several documents in your index which all have the
same a delete request to one of them will delete all silblings as well.
[!ELEMENT delete EMPTY]
[!ATTLIST delete id CDATA #REQUIRED]
[!ATTLIST delete documentid CDATA #REQUIRED]
An delete request will either succeed silently or throw an error
in case of anything critical. If a document does not exists, this is not an error condition, it is ignored silently instead.
Dropping an entire Index for recreation
Sometimes you want to complete recreate an entire index and deleting every single document out of them is not really an option. The following command will drop the current index and create a new, empty one:
[deleteall id="n" /]
Be aware, that it is the responsibility of the client to ensure proper protection against unauthorized calls of this function, unless the indexer service provides some means of authorization, in which case the auth method could be used.
Responses
The complexity of the responses is significantly lower, naturally.
Note, that not all request elements will produce a corresponding
response.
[!ELEMENT response (resultset*,warning*,error?)]
Resultset
This is a list of documents that match a given query, ordered by
match quality, delivering the best match first. The document in the
response is a slightly altered document compared to the one of the
index request. Its main difference is that a score attribute is added
to the element itself and all fields are treated equally in the search
result (distinction is not possible anymore at this stage).
[resultset id="n"]
[document id="some_unique_document_identifier" score="12.7"]
[field name="guid"]13ab489798cef59867de856[/keyword]
[field name="internal"]some internal stuff[/unindexed]
[field name="abstract"]The abstract[/text]
[/document]
[/index]
The score of a document is some percentage value indicating the
relevance of the document. This should never, of course, exceed 100%.
Lucene gurantees this by having scores between 0.0 and 1.0, Plucene,
unfortunaltey, will require some messing around with the scores to do
this (Plucene scores may exceed 1.0).
[!ELEMENT resultset (document)]
[!ATTLIST resultset id CDATA #REQUIRED]
[!ELEMENT document (field*)]
[!ATTLIST document id CDATA #REQUIRED]
[!ATTLIST document score CDATA #REQUIRED]
[!ELEMENT field (#PCDATA)]
[!ATTLIST field name CDATA #REQUIRED]
Note, that any Plucene-style date fields cannot be reverted into
their original state automatically (it is not known if a plain-text
field contains a date field). It is suggested indexing them unstored
and storing a clear-text representation unindexed.
Warnings
Warnings are non-critical informations to the client. They do
indicate a problem with the request referred to its ID, but it was not
critical enough to justify aborting the current batch of requests.
[warning id="n"]Query invalid[/warning]
[!ELEMENT warning (#PCDATA)]
[!ATTLIST warning id CDATA #REQUIRED]
Errors
An error is an unrecoverable situation which aborts the processing of a batch immediately.
It is possible to receive errors for non existant request IDs, for example when parsing or IO errors occur on the server side.
[error id="n"]Username or Password invalid[/warning]
[!ELEMENT error (#PCDATA)]
[!ATTLIST error id CDATA #REQUIRED]
Using XMLComm over TCP
A single but inconvenient side-effect arises when using this protocol over TCP. PHP does not allow you to close a TCP socket connection in one direction only. This means especially, that you cannot close the connection to the server after sending the request. Detecting the end of the XML file is not supported by the DOM parsers I have used until now (SAX could do it). Since I did not want to rewrite the entire XML Comm driver I have written in Java and Perl, I decided to use a simple workaround.
The System detects the end of the input stream by a line consisting exactly of "". In case you have lines like this in your content (CDATA sections do allow this), you have to mask them as "_".
TODO
- Maintain operation (index optimization) => currently implicit with write_close.
- Delete type (delete all records with a given keyword field, perhaps __TYPE?).
- Must-Have fields
- __RI: Resource Identifier, generated automatically using the document ID, keyword field.
- __TYPE: Document Type (used for delete type mainly), defaults to "default", keyword field.
- content: Default search field, should be unstored, but may be text.
Download the DTDs
The current DTDs can be downloaded from the CVS at Tigris:
