Class ShingleAnalyzerWrapper

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class ShingleAnalyzerWrapper
    extends org.apache.lucene.analysis.Analyzer
    A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.

    A shingle is another name for a token based n-gram.

    • Constructor Detail

      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(org.apache.lucene.analysis.Analyzer defaultAnalyzer)
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(org.apache.lucene.analysis.Analyzer defaultAnalyzer,
                                      int maxShingleSize)
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(org.apache.lucene.analysis.Analyzer defaultAnalyzer,
                                      int minShingleSize,
                                      int maxShingleSize)
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(org.apache.lucene.analysis.Analyzer defaultAnalyzer,
                                      int minShingleSize,
                                      int maxShingleSize,
                                      String tokenSeparator,
                                      boolean outputUnigrams,
                                      boolean outputUnigramsIfNoShingles)
        Creates a new ShingleAnalyzerWrapper
        Parameters:
        defaultAnalyzer - Analyzer whose TokenStream is to be filtered
        minShingleSize - Min shingle (token ngram) size
        maxShingleSize - Max shingle size
        tokenSeparator - Used to separate input stream tokens in output shingles
        outputUnigrams - Whether or not the filter shall pass the original tokens to the output stream
        outputUnigramsIfNoShingles - Overrides the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? Note that if outputUnigrams==true, then unigrams are always output, regardless of whether any shingles are available.
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(org.apache.lucene.util.Version matchVersion)
        Wraps StandardAnalyzer.
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(org.apache.lucene.util.Version matchVersion,
                                      int minShingleSize,
                                      int maxShingleSize)
        Wraps StandardAnalyzer.
    • Method Detail

      • getMaxShingleSize

        public int getMaxShingleSize()
        The max shingle (token ngram) size
        Returns:
        The max shingle (token ngram) size
      • setMaxShingleSize

        @Deprecated
        public void setMaxShingleSize​(int maxShingleSize)
        Deprecated.
        Setting maxShingleSize after Analyzer instantiation prevents reuse. Confgure maxShingleSize during construction.
        Set the maximum size of output shingles (default: 2)
        Parameters:
        maxShingleSize - max shingle size
      • getMinShingleSize

        public int getMinShingleSize()
        The min shingle (token ngram) size
        Returns:
        The min shingle (token ngram) size
      • setMinShingleSize

        @Deprecated
        public void setMinShingleSize​(int minShingleSize)
        Deprecated.
        Setting minShingleSize after Analyzer instantiation prevents reuse. Confgure minShingleSize during construction.

        Set the min shingle size (default: 2).

        This method requires that the passed in minShingleSize is not greater than maxShingleSize, so make sure that maxShingleSize is set before calling this method.

        Parameters:
        minShingleSize - min size of output shingles
      • getTokenSeparator

        public String getTokenSeparator()
      • setTokenSeparator

        @Deprecated
        public void setTokenSeparator​(String tokenSeparator)
        Deprecated.
        Setting tokenSeparator after Analyzer instantiation prevents reuse. Confgure tokenSeparator during construction.
        Sets the string to use when joining adjacent tokens to form a shingle
        Parameters:
        tokenSeparator - used to separate input stream tokens in output shingles
      • isOutputUnigrams

        public boolean isOutputUnigrams()
      • setOutputUnigrams

        @Deprecated
        public void setOutputUnigrams​(boolean outputUnigrams)
        Deprecated.
        Setting outputUnigrams after Analyzer instantiation prevents reuse. Confgure outputUnigrams during construction.
        Shall the filter pass the original tokens (the "unigrams") to the output stream?
        Parameters:
        outputUnigrams - Whether or not the filter shall pass the original tokens to the output stream
      • isOutputUnigramsIfNoShingles

        public boolean isOutputUnigramsIfNoShingles()
      • setOutputUnigramsIfNoShingles

        @Deprecated
        public void setOutputUnigramsIfNoShingles​(boolean outputUnigramsIfNoShingles)
        Deprecated.
        Setting outputUnigramsIfNoShingles after Analyzer instantiation prevents reuse. Confgure outputUnigramsIfNoShingles during construction.

        Shall we override the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? (default: false.)

        Note that if outputUnigrams==true, then unigrams are always output, regardless of whether any shingles are available.

        Parameters:
        outputUnigramsIfNoShingles - Whether or not to output a single unigram when no shingles are available.
      • tokenStream

        public org.apache.lucene.analysis.TokenStream tokenStream​(String fieldName,
                                                                  Reader reader)
        Specified by:
        tokenStream in class org.apache.lucene.analysis.Analyzer
      • reusableTokenStream

        public org.apache.lucene.analysis.TokenStream reusableTokenStream​(String fieldName,
                                                                          Reader reader)
                                                                   throws IOException
        Overrides:
        reusableTokenStream in class org.apache.lucene.analysis.Analyzer
        Throws:
        IOException