Class midcom_services_indexer_document

Description

This class encaspulates a single indexer document. It is used for both indexing and retrieval.

A document consists of a number of fields, each field has different properties when handled by the indexer (exact bahvoir depends, as always, on the indexer backend in use). On retrieval, this field information is lost, all fields being of the same type (naturally). The core indexer backend supports these field types:

  • date is a date-wrapped field suitable for use with the Date Filter.
  • keyword is store and indexed, but not tokenized.
  • unindexed is stored but neither indexed nor tokenized.
  • unstored is not stored, but indexed and tokenized.
  • text is stored, indexed and tokenized.
This class should not be instantinated directly, a new instance of this class can be obtained using the midcom_service_indexer class.

A number of predefined fields are available using member fields. These fields are all meta-fields. See their individual documentation for details. All fields are mandatory unless mentioned otherwise explicitly and, as always, assumed to be in the local charset.

Remember, that both date and unstored fields are not available on retrieval. For the core fields, all timestamps are stored twice therefore, once as searchable field, and once as readable timestamp.

The class will automatically pass all data to the i18n charset conversion functions, thus you work using your site's charset like usual. UTF-8 conversion is done implicitly.

Located in /midcom/services/indexer/document.php (line 48)


	
			
Direct descendents
Class Description
 class midcom_services_indexer_document_attachment This is a class geared at indexing attachments. It requires you to "assingn" the attachment to a topic, which is used as TOPIC_URL for permission purposes. In addition you may set another MidgardObject as source object, it's GUID is stored in the __SOURCE field of the index.
 class midcom_services_indexer_document_midcom This is a base class which is targeted at MidCOM content object indexing. It should be used whenever MidCOM documents are indexed, either directly or as a base class.
Variable Summary
 string $abstract
 string $author
 string $component
 string $content
 int $created
 MidgardPerson $creator
 string $document_url
 int $edited
 MidgardPerson $editor
 int $indexed
 string $RI
 double $score
 string $security
 string $source
 string $title
 string. $topic_url
 string $type
 Array $_fields
 midcom_service_i18n $_i18n
Method Summary
 midcom_services_indexer_document midcom_services_indexer_document ()
 void add_date (string $name, int $timestamp)
 void add_date_pair (string $name, int $timestamp)
 void add_keyword (string $name, string $content)
 void add_result (string $name, string $content)
 void add_text (string $name, string $content)
 void add_unindexed (string $name, string $content)
 void add_unstored (string $name, string $content)
 string datamanager_get_text_representation (mixed &$datamanager, string $name)
 void dump (string $message, [int $loglevel = MIDCOM_LOG_DEBUG])
 mixed get_field (string $name)
 Array get_field_record (string $name)
 string html2text (string $text)
 bool is_a (string $document_type)
 Array list_fields ()
 void remove_field (string $name)
 void _add_field (string $name, string $type, string $content, [bool $is_utf8 = false])
 void _set_type (string $type)
Variables
string $abstract = '' (line 190)

The abstract of the document

This is optional.

string $author = '' (line 199)

The author of the document

This is optional.

string $component = '' (line 107)

The name of the component responsible for the document. May be empty for non-midgard resources.

This field is mandatory.

string $content = '' (line 181)

The content of the document

This is mandatory.

This field is empty on documents retrieved from the index.

int $created = 0 (line 125)

The time of document creation, this is an UNIX timestamp.

This field is mandatory.

MidgardPerson $creator = null (line 152)

The MidgardPerson who created the object.

This is optional.

string $document_url = '' (line 116)

The fully qualified URL to the document, this should be a PermaLink.

This field is mandatory.

int $edited = 0 (line 134)

The time of the last document modification, this is an UNIX timestamp.

This field is mandatory.

MidgardPerson $editor = null (line 161)

The MidgardPerson who modified the object the last time.

This is optional.

int $indexed = 0 (line 143)

The timestamp of indexing.

This field is added automatically and to be considered read-only.

string $RI = '' (line 87)

The Resource Identifier of this document. Must be UTF-8 on assignement already.

This field is mandatory.

double $score = 0.0 (line 75)

This is the score of this document. Only populated on resultset documents, of course.

string $security = 'default' (line 259)

Security mechainsm used to determine the availability of a search result.

Can be one of:

  • 'default': Use only built-in processing (topic and metadata visibility checks), this is, as you might have guessed, the default.
  • 'component': Invoke the _on_check_document_visible component interface method of the component after doing default checks. This security class absolutely requires the document to contain a vaild topic GUID, otherwise access control will fail anyway.
  • 'function:$function_name': Invoke the globally available function $function_name, its signature is bool $function_name ($document, $topic), if you don't change the document during the check, you don't need to pass by-reference, so this is up to you. The topic passed is the Return true if the document is visible, false otherwise.
  • 'class:$class_name': Like above, but using a class instead. The class must provide a statically callable get_instance() method, which returns a usable instance of the class (mostly, this should be a singelton, for performance reasons). The instance returned is assigned by-reference. On that object, the method check_document_permissions, whose signature must be identical to the function callback.

string $source = '' (line 210)

An additional tag indicating the source of the document for use by the component doing the indexing. This value is not indexed and should not be used by anybody except the component doing the indexing.

This is optional.

string $title = '' (line 170)

The title of the document

This is mandatory.

GUID $topic_guid = '' (line 97)

The GUID of the topic the document is assigned to. May be empty for non-midgard resources.

This field is mandatory.

string. $topic_url = '' (line 225)

The full path to the topic that houses the document. For external resources, this should be either a MidCOM topic, to which this resource is accociated or some "directory" after which you could filter. You may also leave it empty prohibiting it to appear on any topic-specific search.

The value should be fully qualified, as returned by MIDCOM_NAV_FULLURL, including a trailing slahs, f.x. https://host/path/to/topic/

This is optional.

string $type = '' (line 239)

The type of the document, set by subclasses and added to the index automatically.

The type *must* reflect the original type hierarchy. It is to be set using the $this->_set_type call after initializing the base class.

Array $_fields = array() (line 59)

An acciociative array containing all fields of the current document.

Each field is indexed by its name (a string). The value is another array containing the fields "name", type" and "content".

  • access: private
midcom_service_i18n $_i18n = null (line 67)

A reference to the i18n service, used for charset conversion.

  • access: protected
Methods
Constructor midcom_services_indexer_document (line 267)

Initialize the object, nothing fancy here.

midcom_services_indexer_document midcom_services_indexer_document ()
add_date (line 346)

Add a date field. A timestamp is expected, which is automatically converted to a suiteable ISO timestamp before storage.

Direct specification of the ISO timestamp is not yet possible due to lacking validation outside the timestamp range.

If a field of the same name is already present, it is overwritten silently.

void add_date (string $name, int $timestamp)
  • string $name: The field's name.
  • int $timestamp: The timestamp to store.
add_date_pair (line 364)

This is a small helper which will create a normal date field and a unindexed _TS-postfixed timestamp field at the same time.

This is useful because the date fields are not in a readable format, it can't even be determined that they were a date in the first place. so the _TS field is quite useful if you need the orginal value for the timestamp.

void add_date_pair (string $name, int $timestamp)
  • string $name: The field's name, "_TS" is appended for the plain-timestamp field.
  • int $timestamp: The timestamp to store.
add_keyword (line 376)

Add a keyword field.

void add_keyword (string $name, string $content)
  • string $name: The field's name.
  • string $content: The field's content.
add_result (line 422)

Add a search result field, this should normally not be done manually, the indexer will call this function when creating a document out of a search result.

void add_result (string $name, string $content)
  • string $name: The field's name.
  • string $content: The field's content, which is assumed to be UTF-8 already
add_text (line 409)

Add a text field.

void add_text (string $name, string $content)
  • string $name: The field's name.
  • string $content: The field's content.
add_unindexed (line 387)

Add a unindexed field.

void add_unindexed (string $name, string $content)
  • string $name: The field's name.
  • string $content: The field's content.
add_unstored (line 398)

Add a unstored field.

void add_unstored (string $name, string $content)
  • string $name: The field's name.
  • string $content: The field's content.
datamanager_get_text_representation (line 635)

Returns a textual representation of the specified datamanager field.

Actual behavoir is dependent on the datatype. Text fields are accessed directly, for other fields, the CSV representation is used.

Text fields run through the html2text converter of the document base class.

Attention: This function accesses originally private datamanager members. It is the only possible way to access the CSV interface of individual fields.

string datamanager_get_text_representation (mixed &$datamanager, string $name)
  • string $name: The name of the field that should be queried
dump (line 552)

Debugging helper, which will dump the documents contents to the log file using the indicated log level. It will check the log-level explicitly for performance reasons.

Note: print_r'ing the entire document might not be an option, as subclasses contain reference to non-dumpable object like the datamanager.

void dump (string $message, [int $loglevel = MIDCOM_LOG_DEBUG])
  • string $message: The log message for the dump
  • int $loglevel: The log level
fields_to_members (line 492)

This function should be called after retrieving a document from the index. It will populate all relevant members with the according values.

void fields_to_members ()
get_field (line 279)

Returns the contents of the field name or false on failure.

  • return: The content of the field or false on failure.
mixed get_field (string $name)
  • string $name: The name of the field.
get_field_record (line 301)

Returns the complete internal field record, including type and UTF-8 encoded content.

This should normally not be used from the outside, it is geared towards the indexer backends, which need the full field information on indexing.

  • return: The full content record.
Array get_field_record (string $name)
  • string $name: The name of the field.
html2text (line 600)

This is a small helper that converts HTML to plain text (relativly simple):

Basically, JavaScript blocks and HTML Tags are stripped, and all HTML Entities are converted to their native equivalents.

Don't replace with an empty string but with a space, so that constructs like <li>torben</li><li>nehmer</li> are recognized correctly. While this might result in double-spaces between words, this is better then loosing the word boundaries entirely.

  • return: The converted text.
string html2text (string $text)
  • string $text: The text to convert to text
is_a (line 667)

Checks wether the given document is an instance of given document type.

This is equivalent to the is_a object hirarchy check, except that it works with MidCOM documents.

bool is_a (string $document_type)
  • string $document_type: The base type to search for.
list_fields (line 318)

Returns a list of all defined fields.

  • return: Fieldname list.
Array list_fields ()
members_to_fields (line 436)

This will translate all member variables into appropriate field records, the backend should call this immediately before indexing.

This call will automatically populate indexed with time() and author with the name of the creator (if set).

void members_to_fields ()
remove_field (line 328)

Remove a field from the list. Nonexistent fields are ignored silently.

void remove_field (string $name)
  • string $name: The name of the field.
_add_field (line 531)

Internal helper which actually stores a field.

  • access: protected
void _add_field (string $name, string $type, string $content, [bool $is_utf8 = false])
  • string $name: The field's name.
  • string $type: The field's type.
  • string $content: The field's content.
  • bool $is_utf8: Set this to true explicitly, to override charset conversion and assume $content is UTF-8 already.
_set_type (line 680)

Sets the type of the object, reflecting the inheritance hierarchy.

void _set_type (string $type)
  • string $type: The name of this document type

Documentation generated on Mon, 21 Nov 2005 18:14:53 +0100 by phpDocumentor 1.3.0RC3