Texis Search Help - Some Technical Details
The Morpheme Stripping Routine
Morpheme stripping is done by the Texis search engine as a preliminary
step before actually executing any search, using the content of the Prefix and
Suffix Lists. The user interface only uses this routine to get words from the
Equivalence File, and only does the suffix stripping part, using the content
of the Equiv-Suffix List.
- When it is time to execute a search, the suffix and prefix lists as
entered in the UI are each sorted by descending size and ascending
alphabetical order. The reason for and importance of descending order is so
that suffixes and prefixes can be stripped largest to smallest. There is no
particular reason for alphabetical order except to provide a predictable
ordering sequence.
Get the word to be checked from the search query line; e.g.,
``antidisestablishmentarianism''.
- Check the word's length to see if it is greater than or equal to the
length set in minimum word length (default is 5). If so, carry on. If not,
there is no need to morpheme strip the word; it would just get searched for
as is; e.g., ``red''.
- Check the word found against the list of suffixes to see if there is a
match, starting from largest suffix on the list.
- If a match is found strip it from the word. Note: This is why ordering
by size is so important: because you want to remove suffixes (or prefixes)
by the largest first, so as not to miss multiple suffixes, where one suffix
may be a subset of another.
- Continue checking against the list for the next match. Follow steps 4-5
until no more matches found. In the case of our example above, based on
the default suffix list, we would be left with the following morpheme
before prefix processing: ``antidisestablishmentarian''. Note the following
things:
- The suffix ``ism'' was on the list and was stripped.
- Neither ``an'', ``ian'', nor ``arian'' was on the suffix list, so it
was not stripped.
- The suffix ``ment'' is on the suffix list, but it was not left at
the end of the word at any point, and therefore was not removed.
- If ``arian'' and ``ian'' were both entered on the suffix list,
``arian'' would be removed first, so as not to remove ``ian'' and be
left with ``ar'' at the end of the word which would not be strippable.
- If suffix checking (only), remove any trailing vowels, or 1 of any
double trailing consonants. This handles things like ``strive'', which
would be correctly stripped down to ``striv'' so that it won't miss matches
for ``striving'', etc. (trailing vowel). And things like ``travelling''
would be stripped to ``travell''; you have to strip the second `l' so that
you wouldn't miss the word ``travel'' (trailing double consonant). Note: this
is only done for suffix checking, not prefix checking.
- Now repeat Steps 4-6 for prefix stripping against the prefix list. In
our example, ``antidisestablishmentarian'' would get stripped down to
``establishmentarian''. This is what you have left and is what goes to the
pattern matcher.
- When something is found, the pattern matcher builds it back up again
to make sure it is truly a match to what you were looking for. This prevents
things like taking ``pressure'' when you were really looking for
``president'', ``restive'' when you were really looking for ``restaurant'',
and other such oddities.
Searching for Approximations
In any search environment there is always a fine line between relevance and irrelevance. Any configuration aims to allow just enough abstraction to find what one is looking for, but not so much that unwanted hits become distracting. Speed is also an important consideration; one does not want to look for so many possibilities that the search is overly burdened and therefore too slow in response time.
If a spelling checker were run into every search, not only would the general search time be greatly impeded, but a lot of what can be referred to as ``noise'' would deflect the accuracy, or relevancy of the search results. The aim is to allow maximum user control and direction of the search. Since there is no requirement to conform to any spelling standard, the Texis search engine is able to accept completely unknown words and process them correctly. This includes slang, acronyms, code, or technical nomenclature. Even so, this does not deal with the issue of misspellings or typos.
Texis search engine thoroughly handles this problem through the use of the
Approximate Pattern Matcher (XPM). The intent behind XPM is that you haven't found what you believe you should have found, and are therefore willing to accept patterns which deviate from your entered pattern by a specified percentage. The percentage entered on the query line is the percentage of proximity to the entered pattern (rather than the percent of deviation).
Return to How to search GulfLINK with Texis