Class OptimizeStrings

java.lang.Object
org.apache.uima.util.impl.OptimizeStrings

public class OptimizeStrings extends Object
Share common underlying char[] among strings: Optimize sets of strings for efficient storage Apply it to a set of strings. Lifecycle: 1) make an instance of this class 2) call .add (String or String[]) for all strings in the set or - skip 1 and 2 and pass in a String [] that can be modified (sorted] 3) call .optimize() 4) call .getString(String) or .getStringArray(String[]) - returns new string objects .updateStringArray(String[]) to replace original Strings .getOffset(String) - returns int offset in some common string .getCommonStringIndex(String) 5) call getCommonStrings to get all the common strings 6) Let the GC collect the instance of this class The strings are added first to a big list, instead of directly to a stringbuilder in order to improve reuse by sorting for longest strings first The commonStrings are kept in an array. There are more than one to support very large amounts, in excess of 2GB. Nulls, passed in as strings, are mostly ignored, but handled appropriately.
  • Constructor Details

    • OptimizeStrings

      public OptimizeStrings(boolean doMeasurement)
    • OptimizeStrings

      public OptimizeStrings(boolean doMeasurement, int splitSize)
  • Method Details

    • getSavedCharsExact

      public long getSavedCharsExact()
      The number of characters saved - for statistical reporting only
      Returns:
      the number of characters saved
    • getSavedCharsSubstr

      public long getSavedCharsSubstr()
    • getCommonStrings

      public String[] getCommonStrings()
      The list of common strings
      Returns:
      the list of common strings
    • add

      public void add(String s)
      null strings not added 0 length strings added
      Parameters:
      s - -
    • add

      public void add(String[] sa)
    • getStringIndex

      public int getStringIndex(String s)
    • getIndexOrSeqIndex

      public int getIndexOrSeqIndex(String s)
      Parameters:
      s - must not be null
      Returns:
      a (positive or 0) or negative number. If positive, it is the offset in the common string. If negative, -v is the index (starting at 1) that sequentially increases, for each new unique string fetched using this method.
    • getString

      public String getString(String s)
      return a string which is made as a substring of the common string
      Parameters:
      s - -
      Returns:
      an equal string, made as substring of common string instance equal results return the same string
    • getOffset

      public long getOffset(String s)
    • getOffset

      public int getOffset(int i)
    • getCommonStringIndex

      public int getCommonStringIndex(int index)
      Parameters:
      index - an index (not offset) to the sorted strings,
      Returns:
      the index of the segment it belongs to
    • getStringArray

      public String[] getStringArray(String[] sai)
    • updateStringArray

      public void updateStringArray(String[] sa)
    • optimize

      public void optimize()
      Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous one