|
Stemmers are used in various information retrieval, indexing, and web scraping applications. For example, a search engine for a documentation set might want to collapse and index the words "compile", "compiler", "compiling", and "compilation" all into the same root "compil". Assuming uses of these words are related to the same concept, a search for any of them (actually, a search for the common stem) would find references to all of them.
Stemmers do sometimes collapse unrelated words into the same root. This causes false positives in searches. Stemmers work best where false positives do not greatly harm usability of the results.
Stemming a language like English (which is very irregular in its morphology) is particularly difficult as the algorithm must find the right balance between handling many special cases yet not collapsing too many unrelated words into the same stem. The stemming algorithm is a marvelous collection of transformational twists and tricks that must be executed in exactly the right way and in exactly the right order.
Martin Porter devised this particular stemmer in 1980 and he and others has\ve since created versions for it in many programming languages. Steve Haflich coded the Common Lisp version a few years ago and offerred it back to Porter, who maintains an official site for the stemmer. Good descriptions of the algorithm as well as a test suite are available on Porter's site.
Don't expect to understand the stemming algorithm without some serious study, but it is quite fast and works really well. The Lisp implementation passes Porters test suite and was tested creating a useful word root index of the entire Allegro CL documentation set.
references:
Martin Porter maintains a web page for the Porter Stemming Algorithm with sources, documentation, and links to related resources.
home url:
http://www.tartarus.org/~martin/PorterStemmer
books:
None.
|