Class PersianAnalyzer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class PersianAnalyzer
    extends org.apache.lucene.analysis.StopwordAnalyzerBase
    Analyzer for Persian.

    This Analyzer uses ArabicLetterTokenizer which implies tokenizing around zero-width non-joiner in addition to whitespace. Some persian-specific variant forms (such as farsi yeh and keheh) are standardized. "Stemming" is accomplished via stopwords.

    • Field Detail

      • DEFAULT_STOPWORD_FILE

        public static final String DEFAULT_STOPWORD_FILE
        File containing default Persian stopwords. Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html The stopword list is BSD-Licensed.
        See Also:
        Constant Field Values
      • STOPWORDS_COMMENT

        public static final String STOPWORDS_COMMENT
        The comment character in the stopwords file. All lines prefixed with this will be ignored
        See Also:
        Constant Field Values
    • Constructor Detail

      • PersianAnalyzer

        public PersianAnalyzer​(org.apache.lucene.util.Version matchVersion)
        Builds an analyzer with the default stop words: DEFAULT_STOPWORD_FILE.
      • PersianAnalyzer

        public PersianAnalyzer​(org.apache.lucene.util.Version matchVersion,
                               Set<?> stopwords)
        Builds an analyzer with the given stop words
        Parameters:
        matchVersion - lucene compatibility version
        stopwords - a stopword set
    • Method Detail

      • getDefaultStopSet

        public static Set<?> getDefaultStopSet()
        Returns an unmodifiable instance of the default stop-words set.
        Returns:
        an unmodifiable instance of the default stop-words set.
      • createComponents

        protected org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents createComponents​(String fieldName,
                                                                                                         Reader reader)
        Creates ReusableAnalyzerBase.TokenStreamComponents used to tokenize all the text in the provided Reader.
        Specified by:
        createComponents in class org.apache.lucene.analysis.ReusableAnalyzerBase
        Returns:
        ReusableAnalyzerBase.TokenStreamComponents built from a StandardTokenizer filtered with LowerCaseFilter, ArabicNormalizationFilter, PersianNormalizationFilter and Persian Stop words
      • initReader

        protected Reader initReader​(Reader reader)
        Wraps the Reader with PersianCharFilter
        Overrides:
        initReader in class org.apache.lucene.analysis.ReusableAnalyzerBase