Class CasIOUtils

java.lang.Object
org.apache.uima.util.CasIOUtils

public class CasIOUtils extends Object

A collection of static methods aimed at making it easy to

  • save and load CASes, and to
  • optionally include the CAS's Type System (abbreviated TS (only available for Compressed Form 6)) and optionally also include the CAS's indexes definition.
  • The combination of Type System and Indexes definition is called TSI.
    • The TSI's purpose: to replace the CAS's existing type system and index definition.
    • The TS's purpose: to specify the type system used in the serialized data for format Compressed Form 6, in order to allow deserializing into some other type system in the CAS, leniently.

TSI information can be

  • embedded
  • externally supplied (via another input source to the load)
  • both embedded and externally supplied.  In this case the embedded takes precedence.

TS information is available embedded, for COMPRESSED_FILTERED_TS format, and also from embedded or external TSI information (since it also contains the type system information).

When an external TSI is supplied while loading Compressed Form 6,

  • for COMPRESSED_FILTERED_TS
    • it uses the embedded TS for decoding
    • it uses the external TSI to replace the CAS's existing type system and index definition if CasLoadMode == REINIT.
  • for COMPRESSED_FILTERED_TSI
    • the external TSI is ignored, the embedded one overrides, but otherwise operates as above.
  • for COMPRESSED_FILTERED
    • the external TSI's type system part is used for decoding.
    • if CasLoadMode == REINIT, the external TSI is also used to replace the CAS's existing type system and index definition.

Compressed Form 6 loading decoding type system is picked from these sources, in this order:

  • a passed in type system
  • an embedded TS or TSI
  • an external TSI
  • the CAS's type system

The serialization formats supported here are specified in the SerialFormat enum.

The load api's automatically use the appropriate deserializers, based on the input data format.

Loading inputs may be supplied as URLs or as an appropriately buffered InputStream.

Note: you can use Files or Paths by converting these to URLs:

  • URL url = a_path.toUri().toURL();
  • URL url = a_file.toUri().toURL();

When loading, an optional CasLoadMode enum value maybe specified to indicate

  • LENIENT loading - used with XCas and XMI data data sources to silently ignore types and features present in the serialized form, but not in the receiving type system.
  • REINIT - used with Compressed Form 6 loading to indicate that  if no embedded TSI information is available, the external TSI is to be used to replace the CAS's existing type system and index definition.

For more details, see the Javadocs for CasLoadMode.

When TS or TSI information is saved, it is either saved in the same destination (e.g. file or stream), or in a separate one.

  • The serialization formats ending in _TSI and _TS support saving the TSI (or TS) in the same destination.
  • The save APIs for other formats can optionally also save the TSI into a separate (second) OutputStream.

Summary of APIs for saving:

   save(aCAS, outputStream, aSerialFormat)
   save(aCAS, outputStream, tsiOutputStream, aSerialFormat)
 

Summary of APIs for loading:

   load(aURL       , aCas)
   load(inputStream, aCas)
   load(inputStream, aCas, typeSystem) // typeSystem used for decoding Compressed Form 6
   load(inputStream, tsiInputStream, aCas)
 
   load(aURL       , tsiURL        , aCAS, casLoadMode)   - the second URL is for loading a separately-stored TSI
   load(inputStream, tsiInputStream, aCAS, aCasLoadMode)
   load(aURL       , tsiURL        , aCAS, lenient)       - lenient is used to set the CasLoadMode to LENIENT or DEFAULT
   load(inputStream, tsiInputStream, aCAS, lenient)
 
  • Constructor Details

    • CasIOUtils

      public CasIOUtils()
  • Method Details

    • load

      public static SerialFormat load(URL casUrl, CAS aCAS) throws IOException
      Loads a Cas from a URL source. For SerialFormats ending with _TSI except for COMPRESSED_FILTERED_TSI, the CAS's type system and indexes definition are replaced. CasLoadMode is DEFAULT.
      Parameters:
      casUrl - The url containing the CAS
      aCAS - The CAS that should be filled
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - - Problem loading from given URL
    • load

      public static SerialFormat load(URL casUrl, URL tsiUrl, CAS aCAS, CasLoadMode casLoadMode) throws IOException
      Loads a CAS from a URL source. The format is determined from the content. If the value of tsiUrl is null it is ignored.
      Parameters:
      casUrl - The url to deserialize the CAS from
      tsiUrl - null or an optional url to deserialize the type system and index definitions from
      aCAS - The CAS that should be filled
      casLoadMode - specifies how to handle reinitialization and lenient loading see the Javadocs for CasLoadMode
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - Problem loading
    • load

      public static SerialFormat load(URL casUrl, URL tsiUrl, CAS aCAS, boolean leniently) throws IOException
      Loads a CAS from a URL source. The format is determined from the content. For SerialFormats ending with _TSI except for COMPRESSED_FILTERED_TSI, the CAS's type system and indexes definition are replaced. CasLoadMode is set according to the leniently flag.
      Parameters:
      casUrl - The url to deserialize the CAS from
      tsiUrl - The optional url to deserialize the type system and index definitions from
      aCAS - The CAS that should be filled
      leniently - true means do lenient loading
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - Problem loading
    • load

      public static SerialFormat load(InputStream casInputStream, CAS aCAS) throws IOException
      Loads a Cas from an Input Stream. The format is determined from the content. For SerialFormats ending with _TSI except for COMPRESSED_FILTERED_TSI, the CAS's type system and indexes definition are replaced. CasLoadMode is DEFAULT.
      Parameters:
      casInputStream - The input stream containing the CAS. Caller should buffer this appropriately.
      aCAS - The CAS that should be filled
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - - Problem loading from given InputStream
    • load

      public static SerialFormat load(InputStream casInputStream, InputStream tsiInputStream, CAS aCAS) throws IOException
      Loads a CAS from an Input Stream. The format is determined from the content. For SerialFormats ending with _TSI the embedded value is used instead of any supplied external TSI information. TSI information is available either via embedded value, or if a non-null input is passed for tsiInputStream. If TSI information is available, the CAS's type system and indexes definition are replaced, except for SerialFormats COMPRESSED_FILTERED, COMPRESSED_FILTERED_TS, and COMPRESSED_FILTERED_TSI. The CasLoadMode is DEFAULT.
      Parameters:
      casInputStream - The input stream containing the CAS. Caller should buffer this appropriately.
      tsiInputStream - -
      aCAS - The CAS that should be filled
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - -
    • load

      public static SerialFormat load(InputStream casInputStream, InputStream tsiInputStream, CAS aCAS, boolean leniently) throws IOException
      Loads a CAS from an Input Stream. The format is determined from the content. For SerialFormats ending with _TSI the embedded value is used instead of any supplied external TSI information. TSI information is available either via embedded value, or if a non-null input is passed for tsiInputStream. If TSI information is available, the CAS's type system and indexes definition are replaced, except for SerialFormats SerialFormat.COMPRESSED_FILTERED, SerialFormat.COMPRESSED_FILTERED_TS, and SerialFormat.COMPRESSED_FILTERED_TSI. The CasLoadMode is set to CasLoadMode.LENIENT if the leniently flag is true; otherwise it is set to CasLoadMode.DEFAULT.
      Parameters:
      casInputStream - the stream to load the CAS from
      tsiInputStream - an optional stream to read the type system and index information from
      aCAS - the target CAS
      leniently - if true, missing types in the target CAS will be ignored instead of leading to an exception
      Returns:
      the format that was detected in the CAS input stream
      Throws:
      IOException - if the data could not be loaded
    • load

      public static SerialFormat load(InputStream casInputStream, InputStream tsiInputStream, CAS aCAS, CasLoadMode casLoadMode) throws IOException
      Loads a CAS from an Input Stream. The format is determined from the content. For formats of ending in _TSI SERIALIZED_TSI or COMPRESSED_FILTERED_TSI, the type system and index definitions are read from the cas input source; the value of tsiInputStream is ignored. For other formats, if the tsiInputStream is not null, type system and index definitions are read from that source. If TSI information is available, the CAS's type system and indexes definition are replaced, except for SerialFormats SerialFormat.COMPRESSED_FILTERED, SerialFormat.COMPRESSED_FILTERED_TS, and SerialFormat.COMPRESSED_FILTERED_TSI. If the CasLoadMode == REINIT, then the TSI information is also used for these 3 formats to replace the CAS's definitions.
      Parameters:
      casInputStream - The input stream containing the CAS, appropriately buffered.
      tsiInputStream - The optional input stream containing the type system, appropriately buffered. This is only used if it is non null and - the casInputStream does not already come with an embedded CAS Type System and Index Definition, or - the serial format is COMPRESSED_FILTERED_TSI
      aCAS - The CAS that should be filled
      casLoadMode - specifies loading alternative like lenient and reinit, see CasLoadMode.
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - if the data could not be loaded
    • load

      public static SerialFormat load(InputStream casInputStream, CAS aCAS, TypeSystem typeSystem) throws IOException
      This load variant can be used for loading Form 6 compressed CASes where the type system to use to deserialize is provided as an argument. It can also load other formats, where its behavior is identical to load(casInputStream, aCas). Loads a CAS from an Input Stream. The format is determined from the content. For SerialFormats of ending in _TSI SERIALIZED_TSI or COMPRESSED_FILTERED_TSI, the type system and index definitions are read from the cas input source; the value of typeSystem is ignored. For COMPRESSED_FILTERED_xxx formats, if the typeSystem is not null, the typeSystem is used for decoding. If TSI information is available, the CAS's type system and indexes definition are replaced, except for SerialFormats SerialFormat.COMPRESSED_FILTERED, SerialFormat.COMPRESSED_FILTERED_TS, and SerialFormat.COMPRESSED_FILTERED_TSI. To replace the CAS's type system and indexes definition for these, use a load form which has the CasLoadMode argument, and set this to CasLoadMode.REINIT.
      Parameters:
      casInputStream - The input stream containing the CAS, appropriately buffered.
      aCAS - The CAS that should be filled
      typeSystem - the type system to use for decoding the serialized form, must be non-null
      Returns:
      the SerialFormat of the loaded CAS
      Throws:
      IOException - Problem loading from given InputStream
    • save

      public static void save(CAS aCas, OutputStream docOS, SerialFormat format) throws IOException
      Write the CAS in the specified format.
      Parameters:
      aCas - The CAS that should be serialized and stored
      docOS - The output stream for the CAS
      format - The SerialFormat in which the CAS should be stored.
      Throws:
      IOException - - Problem saving to the given InputStream
    • save

      public static void save(CAS aCas, OutputStream docOS, OutputStream tsiOS, SerialFormat format) throws IOException
      Write the CAS in the specified format. If the format does not include typesystem information and the optional output stream of the typesystem is specified, then the typesystem information is written there.
      Parameters:
      aCas - The CAS that should be serialized and stored
      docOS - The output stream for the CAS, with appropriate buffering
      tsiOS - Optional output stream for type system information. Only used if the format does not support storing typesystem information directly in the main output file.
      format - The SerialFormat in which the CAS should be stored.
      Throws:
      IOException - - Problem saving to the given InputStream
    • writeTypeSystem

      public static void writeTypeSystem(CAS aCas, OutputStream aOS, boolean includeIndexDefs) throws IOException
      Throws:
      IOException