it:ad:lucene.net:howto:create_analysers

IT:AD:Lucene.NET:HowTo:Create Analysers

Summary

In Lucene.NET there's the following to choose from:

  • StandardAnalyser
    • The tokenizer uses spaces/punctuation as split points.
    • Filters: StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter` (using English stop words).
  • StopAnalyser
  • SimpleAnalyser
  • WhitespaceAnalyser
  • KeywordAnalyser
    • “Tokenizes” the entire stream as a single token. Useful for zip codes, ids, product codes, etc.
  • PerFieldAnalyserWrapper

…but which one?

  • In general, an Analyzer is a tokenizer + stemmer + stop-words filter.
  • When choosing a tokenizer specific to a language, choose a stemmer specific to the same language.

Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text.

At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase “I am very happy” it will produce list [“i”, “am”, “very”, “happy”] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.

  • Stemmers get the base of a word in question.
    • Depends on the language used.
      • eg: English [“i”, “be”, “veri”, “happi”] produced,
      • eg: French “Je suis très heureux” some kind of French analyzer (like SnowballAnalyzer, initialized with “French”) will produce [“je”, “être”, “tre”, “heur”].
    • Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.

    KeywordAnalyzer do not use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.

Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are “a”, “the”, “I”, “be”, “have”, etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase “I'm very happy” with StandardAnalyzer will be transformed to list [“veri”, “happi”]. And KeywordAnalyzer again do nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.

And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.

  • /home/skysigal/public_html/data/pages/it/ad/lucene.net/howto/create_analysers.txt
  • Last modified: 2023/11/04 01:48
  • by 127.0.0.1