Class DefaultICUTokenizerConfig


  • public class DefaultICUTokenizerConfig
    extends ICUTokenizerConfig
    Default ICUTokenizerConfig that is generally applicable to many languages.

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai text is broken into words with a DictionaryBasedBreakIterator
    • Lao, Myanmar, and Khmer text is broken into syllables based on custom BreakIterator rules.
    • Hebrew text has custom tailorings to handle special cases involving punctuation.
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Field Detail

      • WORD_IDEO

        public static final String WORD_IDEO
        Token type for words containing ideographic characters
      • WORD_HIRAGANA

        public static final String WORD_HIRAGANA
        Token type for words containing Japanese hiragana
      • WORD_KATAKANA

        public static final String WORD_KATAKANA
        Token type for words containing Japanese katakana
      • WORD_HANGUL

        public static final String WORD_HANGUL
        Token type for words containing Korean hangul
      • WORD_LETTER

        public static final String WORD_LETTER
        Token type for words that contain letters
      • WORD_NUMBER

        public static final String WORD_NUMBER
        Token type for words that appear to be numbers
    • Constructor Detail

      • DefaultICUTokenizerConfig

        public DefaultICUTokenizerConfig()
    • Method Detail

      • getBreakIterator

        public com.ibm.icu.text.BreakIterator getBreakIterator​(int script)
        Description copied from class: ICUTokenizerConfig
        Return a breakiterator capable of processing a given script.
        Specified by:
        getBreakIterator in class ICUTokenizerConfig
      • getType

        public String getType​(int script,
                              int ruleStatus)
        Description copied from class: ICUTokenizerConfig
        Return a token type value for a given script and BreakIterator rule status.
        Specified by:
        getType in class ICUTokenizerConfig