|
Working with Feature Structures
|
These work with all kinds of Feature Structures, Annotations and non-Annotations, both.
Remove all Feature Structures of a particular type
|
There are built-in methods to do this, over all indexes in a particular view. There are 2 variations:
Both of these are much faster than iterating over the Feature Structures; they directly clear the associated indexes.
|
General suggestions: working with iterators
|
Many times code will iterate over all instances of a type, and only do something with a subset.
Frequently, the iteration can be cut short, by starting near the spot of interest and stopping as soon
as it can be determined that no further iteration will find interesting Annotations.
Example: Let's say you have a "token" annotation, and want to find the "sentence" that contains it.
You could write an iterator over all sentences.
Stop early
When you find the first sentence that overlaps the token, you can use extra knowledge that you might have,
such as: there's only one sentence per token, to conclude that having found it, there's no need to do any
further iteration, so you can stop the iteration.
Furthermore, if the token appears outside of any sentence, you can similarly stop the iteration, and return
an "empty" result, as soon as the test sentence begins after the token's "begin".
This is because, at that point, due to the sorting of the returned values, no future sentences could
start before or equal to the token's begin.
Begin closer to the right spot, maybe iterate backwards
But you can do better.
You can start the iteration, instead of at the beginning, at the position of the token, and iterate backwards.
Iterators have a moveTo() method which takes a feature structure argument, so you can moveTo(the-token),
and then perhaps with some edge adjustment for equality, start iterating backwards, looking for the sentence at that
position that covers the token.
If you are iterating backwards, and looking for a "covering" annotation, and know the largest span for that
covering type, then you can stop iterating as soon as the start position you reach, + the largest span, is less than
the start of the annotation you're trying to cover.
This is used internally in version 3's
select framework
to speed up
the covering kind of iteration.
There are many other examples, but the principle is the same: start the iteration "close to" the right spot,
perhaps moving backwards instead of forwards, and end the iteration as soon as you can logically say that
no more suitable feature structures would be found.
Use UIMA Version 3's select framework
The select framework
incorporates many of the popular use cases for doing iterations that we've seen, into a Java friendly approach that
automatically uses optimized iterators and can produce Java Streams, as well.
|
|
Working with Annotations
|
The CAS holds Feature Structures (FSs). There is special support for FSs which are a subtype of Annotation;
these have an associated Subject of Analysis (Sofa) and begin and end offsets.
Annotations are not required in all cases
If your application deals with a different kind of unstructured data, say, for instance, images, then
Annotations may not be the appropriate supertype for your types, because they're designed for
things having a linear begin / end meaningful demarcations.
You can have your feature structures inherit from TOP, or from some other appropriate supertype, other
than Annotation.
For example, if you want to define a new kind of annotation (e.g. a rectangular
region if your subject of analysis is an image),
you should write a new type which inherits from AnnotationBase. Types which
inherit from AnnotationBase are bound to a particular subject of analysis (aka view).
On the other hand, if you have information which is not directly related to a subject of analysis
(e.g. a Date type with day/month/year fields which would be used as a value rather
than as an annotation) then consider inheriting from TOP instead.
It is also
not necessary to add all feature structures or annotations to the indexes. For example, if the
Date type just described is used as a feature value, it may well be sufficient to be
able to reach it through the feature.
Making use of the built-in Annotation index
Annotations are special in UIMA in that there is a "built-in" index, the AnnotationIndex, which can be used
to rapidly access these in a sorted order. The ordering is by begin (ascending), then by
end (descending), and then by type-priorities.
This is really a set of indexes, one for each subtype of Annotation.
Although the index has type-priorities, in UIMA v3, the select-framework
by default ignores these; this behavior can be overridden on an as-needed basis.
Watch out for type-priorites
|
When 2 annotations have the same start and end, but different types, then one comes before the other,
according to type priorites. This is intended to allow you to say if you have a Sentence annotation, and a
Foo annotation, both covering the same span, to declare that the Sentence logically contains Foo, and not the
other way around.
To make this work, you need to specify the type priorities. This is a global setting for your application.
See
type priorities (scroll down to find it) for how to specify this.
Avoiding type priorities
Often, the use of type priorities gets in the way. With UIMA Version 3, the
select framework
by default ignores type priorites when doing its operations; but this can be overridden as needed.
|
Annotation containment
|
a contains b
- Ignoring type priorities:
a != null && b != null && // null check
a.getBegin() <= b.getBegin() && // a starts before (or equal to) b
a.getEnd() >= b.getEnd() // a ends after (or equal to) b
a and b overlap (have at least one char position in common)
// ((omitted) check for non-null)
if (a.getBegin() <= b.getBegin()) { // if a starts before (or equal to) b
return a.getEnd() > b.getBegin(); // then it overlaps if a's end is after b's begin
} else { // otherwise, b's begin is before a's begin
return b.getEnd() > a.getBegin(); // so it overlaps if b's end is after a's begin.
An alternative, where overlap includes the edge case when the annotations just touch each other, but have no char position in common:
// ((omitted) check for non-null)
if (a.getBegin() <= b.getBegin()) { // if a starts before (or equal to) b
return a.getEnd() >= b.getBegin(); // then it overlaps or abuts if a's end is after or equal to b's begin
} else { // otherwise, b's begin is before a's begin
return b.getEnd() >= a.getBegin(); // so it overlaps or abuts if b's end is after or equal to a's begin.
|
Adjusting an existing annotation's begin and end
|
Sometimes, your code may want to adjust an annotations begin and end values.
If the annotation is not indexed, there's no issue - just change the value.
But if it is indexed, it's in index(es) in a position determined by its begin and end position, so if you
change these, the item needs to be reindexed (in all the indexes holding it). Typically, only one index
(the Annotation Index for a particular CAS View) is involved, but in general, there could be multiple
indexes involved.
If you are using UIMA version 2.7.0 or later, the UIMA
framework
detects updates that would need this re-indexing, and
automatically removes the Feature Structure from all involved index(es), updates the Feature, and then adds the Feature Structure back to the index(es).
You can improve the efficiency of this, if you are updating, say, both the begin and end value of an annotation, by
doing this yourself, in your code.
- Removing the item from the index(es)
- Doing both updates
- Adding the item back into the index(es)
.
More details here.
Example: if you know a particular annotation is only indexed in one view,
then you can update it's begin and end features using
a.removeFsFromIndexes();
a.setBegin(new_value_begin);
a.setEnd(new_value_end);
a.addToIndexes();
This is the most efficient way to do this.
There's a couple of special forms you can use to protect indexes while you're updating features used as keys.
This is useful when you're not sure what feature values might be used as keys in some index.
try (AutoCloseable ac = my_cas.protectIndexes()) {
// ... arbitrary user code which updates features
// which may be "keys" in one or more indexes, e.g.
a.setBegin(new_value_begin);
a.setEnd(new_value_end);
}
or
my_cas.protectIndexes(() -> {
// ... arbitrary user code updating "key" features,
// but no checked exceptions are permitted
// (because inside a lambda)
a.setBegin(new_value_begin);
a.setEnd(new_value_end);
});
These use the frameworks automatic detection mechanism, and removes Feature Structures from all involved indexes
if needed, but delays adding them back, until the end of the protected section.
|
Avoid where possible, copying sets of Feature Structures
|
Operations which iterate over Feature Structures, and put them into a Collection or List, and then
iterate over that list to do some other operations, can often be done directly on the Feature Structures in the CAS,
omitting the first copying of them into a list.
A frequent speedup can happen when the particular logic can detect when no further items in a (sorted) index
are needed, and the iteration can be stopped early.
For example, you might have code which iterates over all feature structures of a particular type, and puts these into a list,
and then goes thru the list, and picks out certain ones and put those into another list, which is then returned.
The first copying can be omitted, by moving the logic of what to include into the first iteration, and producing the second
list directly.
In UIMA Version 3, you can make use of the select framework.
It already has many of the use-cases where you might want to start or exit an iteration, accounted for.
You can also use its ability to produce streams, and combine that with Java's takeWhile method, to exit a stream early.
|
|
|