UIMA project logo
Cookbook: addressing some typical use-cases
Apache UIMA

Search the site

 Working with Feature Structures

These work with all kinds of Feature Structures, Annotations and non-Annotations, both.

Remove all Feature Structures of a particular type

There are built-in methods to do this, over all indexes in a particular view. There are 2 variations:

  • remove all including the subtypes of the type
    myJCasView.removeAllIncludingSubtypes(Foo.type)
  • remove all excluding the subtypes of the type
    myJCasView.removeAllExcludingSubtypes(Foo.type)

Both of these are much faster than iterating over the Feature Structures; they directly clear the associated indexes.

General suggestions: working with iterators

Many times code will iterate over all instances of a type, and only do something with a subset. Frequently, the iteration can be cut short, by starting near the spot of interest and stopping as soon as it can be determined that no further iteration will find interesting Annotations.

Example: Let's say you have a "token" annotation, and want to find the "sentence" that contains it. You could write an iterator over all sentences.

Stop early

When you find the first sentence that overlaps the token, you can use extra knowledge that you might have, such as: there's only one sentence per token, to conclude that having found it, there's no need to do any further iteration, so you can stop the iteration.

Furthermore, if the token appears outside of any sentence, you can similarly stop the iteration, and return an "empty" result, as soon as the test sentence begins after the token's "begin". This is because, at that point, due to the sorting of the returned values, no future sentences could start before or equal to the token's begin.

Begin closer to the right spot, maybe iterate backwards

But you can do better.

You can start the iteration, instead of at the beginning, at the position of the token, and iterate backwards. Iterators have a moveTo() method which takes a feature structure argument, so you can moveTo(the-token), and then perhaps with some edge adjustment for equality, start iterating backwards, looking for the sentence at that position that covers the token.

If you are iterating backwards, and looking for a "covering" annotation, and know the largest span for that covering type, then you can stop iterating as soon as the start position you reach, + the largest span, is less than the start of the annotation you're trying to cover.

This is used internally in version 3's select framework to speed up the covering kind of iteration.

There are many other examples, but the principle is the same: start the iteration "close to" the right spot, perhaps moving backwards instead of forwards, and end the iteration as soon as you can logically say that no more suitable feature structures would be found.

Use UIMA Version 3's select framework

The select framework incorporates many of the popular use cases for doing iterations that we've seen, into a Java friendly approach that automatically uses optimized iterators and can produce Java Streams, as well.

 Working with Annotations

The CAS holds Feature Structures (FSs). There is special support for FSs which are a subtype of Annotation; these have an associated Subject of Analysis (Sofa) and begin and end offsets.

Annotations are not required in all cases

If your application deals with a different kind of unstructured data, say, for instance, images, then Annotations may not be the appropriate supertype for your types, because they're designed for things having a linear begin / end meaningful demarcations.

You can have your feature structures inherit from TOP, or from some other appropriate supertype, other than Annotation.

  • For example, if you want to define a new kind of annotation (e.g. a rectangular region if your subject of analysis is an image), you should write a new type which inherits from AnnotationBase. Types which inherit from AnnotationBase are bound to a particular subject of analysis (aka view).

  • On the other hand, if you have information which is not directly related to a subject of analysis (e.g. a Date type with day/month/year fields which would be used as a value rather than as an annotation) then consider inheriting from TOP instead.

  • It is also not necessary to add all feature structures or annotations to the indexes. For example, if the Date type just described is used as a feature value, it may well be sufficient to be able to reach it through the feature.

Making use of the built-in Annotation index

Annotations are special in UIMA in that there is a "built-in" index, the AnnotationIndex, which can be used to rapidly access these in a sorted order. The ordering is by begin (ascending), then by end (descending), and then by type-priorities.

This is really a set of indexes, one for each subtype of Annotation.

Although the index has type-priorities, in UIMA v3, the select-framework by default ignores these; this behavior can be overridden on an as-needed basis.

Watch out for type-priorites

When 2 annotations have the same start and end, but different types, then one comes before the other, according to type priorites. This is intended to allow you to say if you have a Sentence annotation, and a Foo annotation, both covering the same span, to declare that the Sentence logically contains Foo, and not the other way around.

To make this work, you need to specify the type priorities. This is a global setting for your application. See type priorities (scroll down to find it) for how to specify this.

Avoiding type priorities

Often, the use of type priorities gets in the way. With UIMA Version 3, the select framework by default ignores type priorites when doing its operations; but this can be overridden as needed.

Annotation containment

a contains b

  • Ignoring type priorities:
a != null && b != null &&       // null check
a.getBegin() <= b.getBegin() && // a starts before (or equal to) b 
a.getEnd() >= b.getEnd()        // a ends after (or equal to) b

a and b overlap (have at least one char position in common)

                                    // ((omitted) check for non-null)
if (a.getBegin() <= b.getBegin()) { // if a starts before (or equal to) b
  return a.getEnd() > b.getBegin(); // then it overlaps if a's end is after b's begin
} else {                            // otherwise, b's begin is before a's begin
  return b.getEnd() > a.getBegin(); // so it overlaps if b's end is after a's begin.

An alternative, where overlap includes the edge case when the annotations just touch each other, but have no char position in common:

                                     // ((omitted) check for non-null)
if (a.getBegin() <= b.getBegin()) {  // if a starts before (or equal to) b
  return a.getEnd() >= b.getBegin(); // then it overlaps or abuts if a's end is after or equal to b's begin
} else {                             // otherwise, b's begin is before a's begin
  return b.getEnd() >= a.getBegin(); // so it overlaps or abuts if b's end is after or equal to a's begin.

Adjusting an existing annotation's begin and end

Sometimes, your code may want to adjust an annotations begin and end values. If the annotation is not indexed, there's no issue - just change the value. But if it is indexed, it's in index(es) in a position determined by its begin and end position, so if you change these, the item needs to be reindexed (in all the indexes holding it). Typically, only one index (the Annotation Index for a particular CAS View) is involved, but in general, there could be multiple indexes involved.

If you are using UIMA version 2.7.0 or later, the UIMA framework detects updates that would need this re-indexing, and automatically removes the Feature Structure from all involved index(es), updates the Feature, and then adds the Feature Structure back to the index(es).

You can improve the efficiency of this, if you are updating, say, both the begin and end value of an annotation, by doing this yourself, in your code.

  • Removing the item from the index(es)
  • Doing both updates
  • Adding the item back into the index(es)
. More details here.

Example: if you know a particular annotation is only indexed in one view, then you can update it's begin and end features using

a.removeFsFromIndexes();
	  		    
  a.setBegin(new_value_begin);
  a.setEnd(new_value_end);
  
a.addToIndexes();
This is the most efficient way to do this.

There's a couple of special forms you can use to protect indexes while you're updating features used as keys. This is useful when you're not sure what feature values might be used as keys in some index.

try (AutoCloseable ac = my_cas.protectIndexes()) {
   // ...  arbitrary user code which updates features 
   //      which may be "keys" in one or more indexes, e.g.
   
   a.setBegin(new_value_begin);
   a.setEnd(new_value_end); 
}
or
my_cas.protectIndexes(() -> {
   // ... arbitrary user code updating "key" features, 
   //     but no checked exceptions are permitted
   //     (because inside a lambda)
   
   a.setBegin(new_value_begin);
   a.setEnd(new_value_end);
   });
These use the frameworks automatic detection mechanism, and removes Feature Structures from all involved indexes if needed, but delays adding them back, until the end of the protected section.

Avoid where possible, copying sets of Feature Structures

Operations which iterate over Feature Structures, and put them into a Collection or List, and then iterate over that list to do some other operations, can often be done directly on the Feature Structures in the CAS, omitting the first copying of them into a list.

A frequent speedup can happen when the particular logic can detect when no further items in a (sorted) index are needed, and the iteration can be stopped early.

For example, you might have code which iterates over all feature structures of a particular type, and puts these into a list, and then goes thru the list, and picks out certain ones and put those into another list, which is then returned.

The first copying can be omitted, by moving the logic of what to include into the first iteration, and producing the second list directly.

In UIMA Version 3, you can make use of the select framework. It already has many of the use-cases where you might want to start or exit an iteration, accounted for. You can also use its ability to produce streams, and combine that with Java's takeWhile method, to exit a stream early.