Class TextTokenizer
The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of
- sequences of word characters and sentence delimiters where the last character is a word character,
- sentence delimiter characters (if they do not precede a word character),
- sequences of whitespace characters,
- and individual separator characters.
The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.
By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.
 A tokenizer provides a standard iterator interface similar to StringTokenizer. The validity of the iterator can be queried with hasNext(), and
 the next token can be queried with nextToken(). In addition,
 getNextTokenType() returns the type of the token as an integer. NB that you need to
 call getNextTokenType() before calling nextToken(), since calling
 nextToken() will advance the iterator.
- Version:
- $Id: TextTokenizer.java,v 1.1 2002/09/30 19:09:09 goetz Exp $
- 
Field SummaryFields
- 
Constructor SummaryConstructorsConstructorDescriptionTextTokenizer(String string) Construct a tokenizer from a Java string.TextTokenizer(CharArrayString string) Construct a tokenizer from a CharArrayString.
- 
Method SummaryModifier and TypeMethodDescriptionvoidaddSeparators(String chars) Add to the set of separator characters.voidaddToEndOfSentenceChars(String chars) Add to the set of sentence delimiters.voidaddWhitespaceChars(String chars) Add to the set of whitespace characters.voidaddWordChars(String chars) Add to the set of word characters.intgetCharType(char c) Get the type of an individual character.intGet the type of the token returned by the next call tonextToken().booleanhasNext()voidsetEndOfSentenceChars(String chars) Set the set of sentence delimiters.voidsetSeparators(String chars) Set the set of separator characters.voidsetShowSeparators(boolean b) Set the flag for showing separator tokens.voidsetShowWhitespace(boolean b) Set the flag for showing whitespace tokens.voidsetWhitespaceChars(String chars) Set the set of whitespace characters (in addition to the Unicode whitespace chars).voidsetWordChars(String chars) Set the set of word characters.
- 
Field Details- 
EOSpublic static final int EOSSentence delimiter character/word type.- See Also:
 
- 
SEPpublic static final int SEPSeparator character/word type.- See Also:
 
- 
WSPpublic static final int WSPWhitespace character/word type.- See Also:
 
- 
WCHpublic static final int WCHWord character/word type.- See Also:
 
 
- 
- 
Constructor Details- 
TextTokenizerConstruct a tokenizer from a CharArrayString.- Parameters:
- string- The string to tokenize.
 
- 
TextTokenizerConstruct a tokenizer from a Java string.- Parameters:
- string- -
 
 
- 
- 
Method Details- 
setShowWhitespacepublic void setShowWhitespace(boolean b) Set the flag for showing whitespace tokens.- Parameters:
- b- -
 
- 
setShowSeparatorspublic void setShowSeparators(boolean b) Set the flag for showing separator tokens.- Parameters:
- b- -
 
- 
setEndOfSentenceCharsSet the set of sentence delimiters.- Parameters:
- chars- -
 
- 
addToEndOfSentenceCharsAdd to the set of sentence delimiters.- Parameters:
- chars- -
 
- 
setSeparatorsSet the set of separator characters.- Parameters:
- chars- -
 
- 
addSeparatorsAdd to the set of separator characters.- Parameters:
- chars- -
 
- 
setWhitespaceCharsSet the set of whitespace characters (in addition to the Unicode whitespace chars).- Parameters:
- chars- -
 
- 
addWhitespaceCharsAdd to the set of whitespace characters.- Parameters:
- chars- -
 
- 
setWordCharsSet the set of word characters.- Parameters:
- chars- -
 
- 
addWordCharsAdd to the set of word characters.- Parameters:
- chars- -
 
- 
getNextTokenTypepublic int getNextTokenType()Get the type of the token returned by the next call tonextToken().- Returns:
- The token type, or -1if there is no next token.
 
- 
hasNextpublic boolean hasNext()- Returns:
- trueiff there is a next token.
 
- 
nextToken- Returns:
- the next token.
 
- 
getCharTypepublic int getCharType(char c) Get the type of an individual character.- Parameters:
- c- -
- Returns:
- -
 
 
-