Class QueryAutoStopWordAnalyzer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class QueryAutoStopWordAnalyzer
    extends org.apache.lucene.analysis.Analyzer
    An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

    For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.

    Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.

    • Field Detail

      • defaultMaxDocFreqPercent

        public static final float defaultMaxDocFreqPercent
        See Also:
        Constant Field Values
    • Constructor Detail

      • QueryAutoStopWordAnalyzer

        @Deprecated
        public QueryAutoStopWordAnalyzer​(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.Analyzer delegate)
        Deprecated.
        Stopwords should be calculated at instantiation using one of the other constructors
        Initializes this analyzer with the Analyzer object that actually produces the tokens
        Parameters:
        delegate - The choice of Analyzer that is used to produce the token stream which needs filtering
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.Analyzer delegate,
                                         org.apache.lucene.index.IndexReader indexReader)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.Analyzer delegate,
                                         org.apache.lucene.index.IndexReader indexReader,
                                         int maxDocFreq)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        maxDocFreq - Document frequency terms should be above in order to be stopwords
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.Analyzer delegate,
                                         org.apache.lucene.index.IndexReader indexReader,
                                         float maxPercentDocs)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.Analyzer delegate,
                                         org.apache.lucene.index.IndexReader indexReader,
                                         Collection<String> fields,
                                         float maxPercentDocs)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        fields - Selection of fields to calculate stopwords for
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer​(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.Analyzer delegate,
                                         org.apache.lucene.index.IndexReader indexReader,
                                         Collection<String> fields,
                                         int maxDocFreq)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
        Parameters:
        matchVersion - Version to be used in StopFilter
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        fields - Selection of fields to calculate stopwords for
        maxDocFreq - Document frequency terms should be above in order to be stopwords
        Throws:
        IOException - Can be thrown while reading from the IndexReader
    • Method Detail

      • addStopWords

        @Deprecated
        public int addStopWords​(org.apache.lucene.index.IndexReader reader)
                         throws IOException
        Deprecated.
        Stopwords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader)
        Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent
        Parameters:
        reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
        Returns:
        The number of stop words identified.
        Throws:
        IOException
      • addStopWords

        @Deprecated
        public int addStopWords​(org.apache.lucene.index.IndexReader reader,
                                int maxDocFreq)
                         throws IOException
        Deprecated.
        Stopwords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int)
        Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent
        Parameters:
        reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
        maxDocFreq - The maximum number of index documents which can contain a term, after which the term is considered to be a stop word
        Returns:
        The number of stop words identified.
        Throws:
        IOException
      • addStopWords

        @Deprecated
        public int addStopWords​(org.apache.lucene.index.IndexReader reader,
                                float maxPercentDocs)
                         throws IOException
        Deprecated.
        Stowords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float)
        Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent
        Parameters:
        reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.
        Returns:
        The number of stop words identified.
        Throws:
        IOException
      • addStopWords

        @Deprecated
        public int addStopWords​(org.apache.lucene.index.IndexReader reader,
                                String fieldName,
                                float maxPercentDocs)
                         throws IOException
        Deprecated.
        Stowords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float)
        Automatically adds stop words for the given field with terms exceeding the maxPercentDocs
        Parameters:
        reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
        fieldName - The field for which stopwords will be added
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.
        Returns:
        The number of stop words identified.
        Throws:
        IOException
      • addStopWords

        @Deprecated
        public int addStopWords​(org.apache.lucene.index.IndexReader reader,
                                String fieldName,
                                int maxDocFreq)
                         throws IOException
        Deprecated.
        Stowords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int)
        Automatically adds stop words for the given field with terms exceeding the maxPercentDocs
        Parameters:
        reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
        fieldName - The field for which stopwords will be added
        maxDocFreq - The maximum number of index documents which can contain a term, after which the term is considered to be a stop word.
        Returns:
        The number of stop words identified.
        Throws:
        IOException
      • tokenStream

        public org.apache.lucene.analysis.TokenStream tokenStream​(String fieldName,
                                                                  Reader reader)
        Specified by:
        tokenStream in class org.apache.lucene.analysis.Analyzer
      • reusableTokenStream

        public org.apache.lucene.analysis.TokenStream reusableTokenStream​(String fieldName,
                                                                          Reader reader)
                                                                   throws IOException
        Overrides:
        reusableTokenStream in class org.apache.lucene.analysis.Analyzer
        Throws:
        IOException
      • getStopWords

        public String[] getStopWords​(String fieldName)
        Provides information on which stop words have been identified for a field
        Parameters:
        fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
        Returns:
        the stop words identified for a field
      • getStopWords

        public org.apache.lucene.index.Term[] getStopWords()
        Provides information on which stop words have been identified for all fields
        Returns:
        the stop words (as terms)