Texis Search Help - Some Technical Details

The Morpheme Stripping Routine

Morpheme stripping is done by the Texis search engine as a preliminary step before actually executing any search, using the content of the Prefix and Suffix Lists. The user interface only uses this routine to get words from the Equivalence File, and only does the suffix stripping part, using the content of the Equiv-Suffix List.

  1. When it is time to execute a search, the suffix and prefix lists as entered in the UI are each sorted by descending size and ascending alphabetical order. The reason for and importance of descending order is so that suffixes and prefixes can be stripped largest to smallest. There is no particular reason for alphabetical order except to provide a predictable ordering sequence.

    Get the word to be checked from the search query line; e.g., ``antidisestablishmentarianism''.

  2. Check the word's length to see if it is greater than or equal to the length set in minimum word length (default is 5). If so, carry on. If not, there is no need to morpheme strip the word; it would just get searched for as is; e.g., ``red''.

  3. Check the word found against the list of suffixes to see if there is a match, starting from largest suffix on the list.

  4. If a match is found strip it from the word. Note: This is why ordering by size is so important: because you want to remove suffixes (or prefixes) by the largest first, so as not to miss multiple suffixes, where one suffix may be a subset of another.

  5. Continue checking against the list for the next match. Follow steps 4-5 until no more matches found. In the case of our example above, based on the default suffix list, we would be left with the following morpheme before prefix processing: ``antidisestablishmentarian''. Note the following things:

  6. If suffix checking (only), remove any trailing vowels, or 1 of any double trailing consonants. This handles things like ``strive'', which would be correctly stripped down to ``striv'' so that it won't miss matches for ``striving'', etc. (trailing vowel). And things like ``travelling'' would be stripped to ``travell''; you have to strip the second `l' so that you wouldn't miss the word ``travel'' (trailing double consonant). Note: this is only done for suffix checking, not prefix checking.

  7. Now repeat Steps 4-6 for prefix stripping against the prefix list. In our example, ``antidisestablishmentarian'' would get stripped down to ``establishmentarian''. This is what you have left and is what goes to the pattern matcher.

  8. When something is found, the pattern matcher builds it back up again to make sure it is truly a match to what you were looking for. This prevents things like taking ``pressure'' when you were really looking for ``president'', ``restive'' when you were really looking for ``restaurant'', and other such oddities.


Searching for Approximations

In any search environment there is always a fine line between relevance and irrelevance. Any configuration aims to allow just enough abstraction to find what one is looking for, but not so much that unwanted hits become distracting. Speed is also an important consideration; one does not want to look for so many possibilities that the search is overly burdened and therefore too slow in response time.

If a spelling checker were run into every search, not only would the general search time be greatly impeded, but a lot of what can be referred to as ``noise'' would deflect the accuracy, or relevancy of the search results. The aim is to allow maximum user control and direction of the search. Since there is no requirement to conform to any spelling standard, the Texis search engine is able to accept completely unknown words and process them correctly. This includes slang, acronyms, code, or technical nomenclature. Even so, this does not deal with the issue of misspellings or typos.

Texis search engine thoroughly handles this problem through the use of the Approximate Pattern Matcher (XPM). The intent behind XPM is that you haven't found what you believe you should have found, and are therefore willing to accept patterns which deviate from your entered pattern by a specified percentage. The percentage entered on the query line is the percentage of proximity to the entered pattern (rather than the percent of deviation).


Return to How to search GulfLINK with Texis