public class DictionaryCompoundWordTokenFilter extends CompoundWordTokenFilterBase
TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, onlyLongestMatch, tokens
Constructor and Description |
---|
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
java.util.Set dictionary) |
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
java.util.Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch) |
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
java.lang.String[] dictionary) |
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
java.lang.String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch) |
Modifier and Type | Method and Description |
---|---|
protected void |
decomposeInternal(org.apache.lucene.analysis.Token token) |
addAllLowerCase, createToken, decompose, incrementToken, makeDictionary, makeLowerCaseCopy, next, next, reset
getOnlyUseNewAPI, setOnlyUseNewAPI
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
public DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, java.lang.String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
input
- the TokenStream
to processdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the streampublic DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, java.lang.String[] dictionary)
input
- the TokenStream
to processdictionary
- the word dictionary to match againstpublic DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, java.util.Set dictionary)
input
- the TokenStream
to processdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.public DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, java.util.Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
input
- the TokenStream
to processdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the streamprotected void decomposeInternal(org.apache.lucene.analysis.Token token)
decomposeInternal
in class CompoundWordTokenFilterBase
Copyright © 2000-2016 Apache Software Foundation. All Rights Reserved.