Package org.apache.uima.util.impl
Class OptimizeStrings
java.lang.Object
org.apache.uima.util.impl.OptimizeStrings
Share common underlying char[] among strings: Optimize sets of strings for efficient storage
Apply it to a set of strings.
Lifecycle:
1) make an instance of this class
2) call .add (String or String[]) for all strings in the set
or - skip 1 and 2 and pass in a String [] that can be modified (sorted]
3) call .optimize()
4) call
.getString(String) or .getStringArray(String[]) - returns new string objects
.updateStringArray(String[]) to replace original Strings
.getOffset(String) - returns int offset in some common string
.getCommonStringIndex(String)
5) call getCommonStrings to get all the common strings
6) Let the GC collect the instance of this class
The strings are added first to a big list, instead of directly to a
stringbuilder in order to improve reuse by sorting for longest strings first
The commonStrings are kept in an array. There are more than one to support
very large amounts, in excess of 2GB.
Nulls, passed in as strings, are mostly ignored, but handled appropriately.
-
Constructor Summary
ConstructorDescriptionOptimizeStrings
(boolean doMeasurement) OptimizeStrings
(boolean doMeasurement, int splitSize) -
Method Summary
Modifier and TypeMethodDescriptionvoid
null strings not added 0 length strings addedvoid
int
getCommonStringIndex
(int index) String[]
The list of common stringsint
int
getOffset
(int i) long
long
The number of characters saved - for statistical reporting onlylong
return a string which is made as a substring of the common stringString[]
getStringArray
(String[] sai) int
void
optimize()
Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous onevoid
updateStringArray
(String[] sa)
-
Constructor Details
-
OptimizeStrings
public OptimizeStrings(boolean doMeasurement) -
OptimizeStrings
public OptimizeStrings(boolean doMeasurement, int splitSize)
-
-
Method Details
-
getSavedCharsExact
public long getSavedCharsExact()The number of characters saved - for statistical reporting only- Returns:
- the number of characters saved
-
getSavedCharsSubstr
public long getSavedCharsSubstr() -
getCommonStrings
The list of common strings- Returns:
- the list of common strings
-
add
null strings not added 0 length strings added- Parameters:
s
- -
-
add
-
getStringIndex
-
getIndexOrSeqIndex
- Parameters:
s
- must not be null- Returns:
- a (positive or 0) or negative number. If positive, it is the offset in the common string. If negative, -v is the index (starting at 1) that sequentially increases, for each new unique string fetched using this method.
-
getString
return a string which is made as a substring of the common string- Parameters:
s
- -- Returns:
- an equal string, made as substring of common string instance equal results return the same string
-
getOffset
-
getOffset
public int getOffset(int i) -
getCommonStringIndex
public int getCommonStringIndex(int index) - Parameters:
index
- an index (not offset) to the sorted strings,- Returns:
- the index of the segment it belongs to
-
getStringArray
-
updateStringArray
-
optimize
public void optimize()Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous one
-