Class TextStringTokenizer
The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of
- sequences of word characters and sentence delimiters where the last character is a word character,
- sentence delimiter characters (if they do not precede a word character),
- sequences of whitespace characters,
- and individual separator characters.
The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.
By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.
A tokenizer provides a standard iterator interface similar to StringTokenizer
. The validity of the iterator can be queried with hasNext()
, and
the next token can be queried with nextToken()
. In addition,
getNextTokenType()
returns the type of the token as an integer. NB that you need to
call getNextTokenType()
before calling nextToken()
, since calling
nextToken()
will advance the iterator.
- Version:
- $Id: TextStringTokenizer.java,v 1.6 2003/04/07 14:50:11 goetz Exp $
-
Field Summary
-
Constructor Summary
ConstructorDescriptionTextStringTokenizer
(String string) Construct a tokenizer from a Java string. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addSeparators
(String chars) Add to the set of separator characters.void
addToEndOfSentenceChars
(String chars) Add to the set of sentence delimiters.void
addWhitespaceChars
(String chars) Add to the set of whitespace characters.void
addWordChars
(String chars) Add to the set of word characters.int
getCharType
(char c) Get the type of an individual character.getToken()
Return the next token.int
Get the end of the token.int
Get the start of the token.int
Get the type of the token returned by the next call tonextToken()
.boolean
isValid()
Returntrue
iff there is a next token.void
setEndOfSentenceChars
(String chars) Set the set of sentence delimiters.void
setSeparators
(String chars) Set the set of separator characters.void
setShowSeparators
(boolean b) Set the flag for showing separator tokens.void
setShowWhitespace
(boolean b) Set the flag for showing whitespace tokens.void
Reset the tokenizer at any time.void
Compute the next token.void
setWhitespaceChars
(String chars) Set the set of whitespace characters (in addition to the Unicode whitespace chars).void
setWordChars
(String chars) Set the set of word characters.
-
Field Details
-
EOS
public static final int EOSSentence delimiter character/word type.- See Also:
-
SEP
public static final int SEPSeparator character/word type.- See Also:
-
WSP
public static final int WSPWhitespace character/word type.- See Also:
-
WCH
public static final int WCHWord character/word type.- See Also:
-
-
Constructor Details
-
TextStringTokenizer
Construct a tokenizer from a Java string.- Parameters:
string
- The string to tokenize.
-
-
Method Details
-
setShowWhitespace
public void setShowWhitespace(boolean b) Set the flag for showing whitespace tokens.- Parameters:
b
- The whitespace flag.
-
setShowSeparators
public void setShowSeparators(boolean b) Set the flag for showing separator tokens.- Parameters:
b
- The flag.
-
setEndOfSentenceChars
Set the set of sentence delimiters.- Parameters:
chars
- A string containing EOS chars.
-
addToEndOfSentenceChars
Add to the set of sentence delimiters.- Parameters:
chars
- A string containing EOS chars.
-
setSeparators
Set the set of separator characters.- Parameters:
chars
- The separator chars.
-
addSeparators
Add to the set of separator characters.- Parameters:
chars
- Separator chars.
-
setWhitespaceChars
Set the set of whitespace characters (in addition to the Unicode whitespace chars).- Parameters:
chars
- Whitespace chars.
-
addWhitespaceChars
Add to the set of whitespace characters.- Parameters:
chars
- Whitespace chars.
-
setWordChars
Set the set of word characters.- Parameters:
chars
- Word chars.
-
addWordChars
Add to the set of word characters.- Parameters:
chars
- Word chars.
-
getTokenType
public int getTokenType()Get the type of the token returned by the next call tonextToken()
.- Returns:
- The token type, or
-1
if there is no next token.
-
isValid
public boolean isValid()Returntrue
iff there is a next token.- Returns:
true
iff there is a next token.
-
setToFirst
public void setToFirst()Reset the tokenizer at any time. -
getToken
Return the next token.- Returns:
- The next token.
-
getTokenStart
public int getTokenStart()Get the start of the token.- Returns:
- The start of the token.
-
getTokenEnd
public int getTokenEnd()Get the end of the token.- Returns:
- The token end.
-
setToNext
public void setToNext()Compute the next token. -
getCharType
public int getCharType(char c) Get the type of an individual character.- Parameters:
c
- -- Returns:
- The char type.
-