Class DictionaryCompoundWordTokenFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
-
- org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public class DictionaryCompoundWordTokenFilter extends CompoundWordTokenFilterBase
ATokenFilter
that decomposes compound words found in many Germanic languages."Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.
You must specify the required
Version
compatibility when creating CompoundWordTokenFilterBase:- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.
If you pass in a
CharArraySet
as dictionary, it should be case-insensitive unless it contains only lowercased entries and you haveLowerCaseFilter
before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrarySets
to the ctors orString[]
dictionaries, they will be automatically transformed to case-insensitive!
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
-
Constructor Summary
Constructors Constructor Description DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary)
Deprecated.DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary)
Deprecated.DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary)
Deprecated.Use the constructors takingSet
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Use the constructors takingSet
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set<?> dictionary)
Creates a newDictionaryCompoundWordTokenFilter
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Creates a newDictionaryCompoundWordTokenFilter
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
decompose()
Decomposes the currentCompoundWordTokenFilterBase.termAtt
and placesCompoundWordTokenFilterBase.CompoundToken
instances in theCompoundWordTokenFilterBase.tokens
list.-
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, makeDictionary, reset
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
-
-
-
-
Constructor Detail
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Creates a newDictionaryCompoundWordTokenFilter
.- Parameters:
input
- theTokenStream
to processdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary)
Deprecated.Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
input
- theTokenStream
to processdictionary
- the word dictionary to match against
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary)
Deprecated.Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
input
- theTokenStream
to processdictionary
- the word dictionary to match against. If this is aCharArraySet
it must have set ignoreCase=false and only contain lower case strings.
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
input
- theTokenStream
to processdictionary
- the word dictionary to match against. If this is aCharArraySet
it must have set ignoreCase=false and only contain lower case strings.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Use the constructors takingSet
Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- theTokenStream
to processdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary)
Deprecated.Use the constructors takingSet
Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- theTokenStream
to processdictionary
- the word dictionary to match against
-
DictionaryCompoundWordTokenFilter
public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set<?> dictionary)
Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- theTokenStream
to processdictionary
- the word dictionary to match against.
-
DictionaryCompoundWordTokenFilter
public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Creates a newDictionaryCompoundWordTokenFilter
- Parameters:
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- theTokenStream
to processdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream
-
-
Method Detail
-
decompose
protected void decompose()
Description copied from class:CompoundWordTokenFilterBase
Decomposes the currentCompoundWordTokenFilterBase.termAtt
and placesCompoundWordTokenFilterBase.CompoundToken
instances in theCompoundWordTokenFilterBase.tokens
list. The original token may not be placed in the list, as it is automatically passed through this filter.- Specified by:
decompose
in classCompoundWordTokenFilterBase
-
-