Apache UIMA Addons and Sandbox

The Apache UIMA™ Sandbox is a workspace that is open to all UIMA committers and developers who would like to contribute code and join the UIMA developer community.

Components often start in the Sandbox and, when ready for release, migrate from here to the Addons or other parts of the site, over time, as part of the process of integration by the Apache community.

The Addons and Sandbox currently host analysis components and tooling around UIMA. All the components are free to use and licensed under the Apache Software License. A list of proposed analysis components and tooling for UIMA is available at the UIMA wiki and can be discussed there.

You can access the UIMA Addons in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/. Likewise, you can access the UIMA sandbox in the SVN repository at https://svn.apache.org/repos/asf/uima/sandbox/trunk/.

The list below shows the currently available components of the UIMA Addons. Many of these components are annotators. The Addons projects are released - see the download page.

UIMA Addons components

Annotators and Consumers

Whitespace Tokenizer Annotator

Snowball Annotator

Regular Expression Annotator

Dictionary Annotator

Hidden Markov Model Tagger Annotator

BSF Annotator

OpenCalais Annotator

Concept Mapper Annotator

Configurable Feature Extractor Annotator

Tika Annotator

Lucene CAS indexer (Lucas)

AlchemyAPI Annotator

Solr CAS Consumer (Solrcas)

Servers

Simple Server (UIMA REST service)

Packaging tools

PEAR Packaging ANT Task

PEAR Packaging Maven Plugin

Miscellaneous

Feature Structure Variables

These are described in more detail below.

Whitespace Tokenizer Annotator

The Whitespace tokenizer annotator component provides an UIMA annotator implementation that tokenizes text documents using a simple whitespace segmentation. During the tokenization, the annotator creates token and sentence annotations as result. The Java source of the annotator can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/WhitespaceTokenizer.

Snowball Annotator

The Snowball annotator is an UIMA annotator component that wraps the Snowball stemming algorithm. The annotator iterates over the available token annotations in the CAS and creates for each token a feature containing the stem. The stemming algorithm is avaialble for several languages. For details about Snowball please see https://snowball.tartarus.org/. The Java source of the annotator can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/SnowballAnnotator.

Note: the used implementation of the Snowball stemming algorithm is licensed under the BSD license.

Regular Expression Annotator

The Regular Expression Annotator (RegexAnnotator) is an Apache UIMA analysis engine that detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions and concepts. For each entity that was detected an annotation can be created or an already existing annotation can be updated with feature values. Click here to access the user documentation. The Java source of the annotator can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/RegularExpressionAnnotator.

PEAR Packaging ANT Task

The PEAR packaging ANT task component is a project to create UIMA PEAR packages automatically during a component build using a custom Apache ANT task. With this task, users are able to build their components from the source and then package them automatically as UIMA PEAR package. Click here to access the user documentation. The Java source of the PEAR packaging task can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/PearPackagingAntTask.

PEAR Packaging Maven Plugin

Note: The PEAR Packaging Maven Plugin has been moved to the main UIMA Java Framework and SDK package.

The PEAR packaging Maven plugin component is a project to create UIMA PEAR packages automatically during a component build using a custom Maven plugin. With this plugin, users are able to build their components from the source and then package them automatically as UIMA PEAR package. Click here to access the user documentation. The Java source of the PEAR packaging Maven plugin can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/uimaj/trunk/PearPackagingMavenPlugin.

Dictionary Annotator

The Dictionary Annotator is an Apache UIMA analysis engine that creates annotations based on word lists that are compiled to simple dictionaries. The output annotation type for the annotations that are created and the input annotation type where the dictionary lookup is executed on, can be specified individually. Click here to access the user documentation. The Java source of the annotator can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/DictionaryAnnotator.

Feature Structure Variables

The Feature Structure variables project allows you to create named feature structure instances. It further allows you to refer to individual feature structures or annotations across annotators, without creating a special index. Click here to access the user documentation. The Java source of the project can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/FsVariables.

Hidden Markov Model Tagger Annotator

The Tagger Annotator component implements a Hidden Markov Model (HMM) tagger. The tagger assumes that sentences and tokens have already been annotated in the CAS with sentence and token annotations. It iterates then in turn over sentences and tokens to accumulate a list of words, and then invokes the tagger on this list. The HMM tagger employs the Viterbi algorithm to calculate the most probable tag sequence. For each Token it updates the posTag field with the part of speech tag. Model training is happening outside of UIMA, the tagger just receives statistical information from a model file which is passed to the tagger along with some further parameters through a properties file. Click here to access the user documentation. The Java source of the annotator can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/Tagger.

BSF Annotator

The Bean Scripting Framework (BSF) Annotator is an Apache UIMA analysis engine that provides a link between the UIMA framework and the scripting languages that are supported by Apache BSF (https://jakarta.apache.org/bsf). The current implementation comes with examples in Beanshell (https://www.beanshell.org) and Rhino Javascript (https://www.mozilla.org/rhino). Simple tests have also been conducted successfully with Jython (https://jython.sourceforge.net/Project/index.html) and JRuby (https://jruby.codehaus.org). The annotator takes as parameter the source file containing the script. The script is supposed to implement the initialize and process functions of the analysis engine. Using a scripting language can be very handy to do quick prototyping, pre/post processing, CAS cleaning tasks or typeystem conversion/adaptation. The Java source of the annotator can be accessed from the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/BSFAnnotator.

Tika Annotator

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata. It consists of three resources:

FileSystemCollectionReader

similar to the one in UIMA examples but uses TIKA to extract the text from binary documents and generates annotations to represent the markup

MarkupAnnotator

takes the original content from a view and generates a new view containing the extracted text with markup annotations

TikaWrapper

utility class which allows to populate a CAS from a binary document; used by the FileSystemCollectionReader

Lucene CAS indexer (Lucas)

The Lucene CAS indexer (Lucas) is a UIMA CAS consumer that stores CAS data in a Lucene index. The consumer transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document. Token streams can further be processed by token filters. Lucas comes with a set of its own token filters and integrations for some Lucene token filters. Furthermore, you can deploy your own token filters. The mapping between UIMA annotations and Lucene tokens and token filtering is configured by a xml mapping file.

Click here to access the user documentation. The Java source of the consumer can be accessed in the SVN repository.

Simple Server (UIMA REST Service)

The UIMA Simple Server makes results of UIMA processing available in a simple, XML-based format. The intended use of the the Simple Server is to provide UIMA analysis as a REST service. The Simple Server is implemented as a Java Servlet, and can be deployed into any Servlet container (such as Apache Tomcat or Jetty).

Click here to access the user documentation. The Java source of the annotator can be accessed from the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/SimpleServer .

OpenCalais Annotator

The OpenCalais Annotator component wraps the OpenCalais web service and makes the OpenCalais analysis results available in UIMA. OpenCalais can detect a large variety of entities, facts and events like for example Persons, Companies, Acquisitions, Mergers, etc. For details about the OpenCalais analytics and the license to use the service, please refer to the to the OpenCalais website. The Java source of the annotator can be accessed in the SVN repository at https://svn.apache.org/repos/asf/uima/addons/trunk/OpenCalaisAnnotator.

Concept Mapper Annotator

ConceptMapper is a powerful, highly configurable dictionary UIMA-based annotator.

Numerous parameters can be used to specify various aspects of the lookup algorithm, input processing and output options. The dictionary structure is flexible, allowing any number synonyms to be associated with an entry, and any number of attributes to be associated with entries or synonyms.

ConceptMapper is separately released, and available on the downloads page.

Lookup and matching against dictionary entries can be performed against contiguous or non-contiguous blocks of text, and token order independent lookup is also allowed (for example, the tokens "A" "B" would be considered a match against dictionary entry "B" "A").

Additionally, ConceptMapper can be configured to use any tokenizer annotator, enabling tokenization of the dictionary identically with the input text.

Click here to access the user documentation.

Configurable Feature Extractor Annotator

The Configurable Feature Extractor (CFE) Annotator is a multipurpose tool that enables feature extraction from a UIMA CAS in a very generalized and application independent way.

The extraction process is performed according to rules expressed using the Feature Extraction Specification Language (FESL) that are stored in configuration files.

Using CFE eliminates the need for creating customized CAS consumers and writing Java code for every application. Instead, by using FESL rules in XML format, users can customize the information extraction process to suit their application. FESL's rule semantics allow the precise identification of the information that is required to be extracted by specifying precise multi-parameter criteria.

Click here to access the user documentation.

AlchemyAPI Annotator

The AlchemyAPI Annotator is a wrapper for the AlchemyAPI webservices which provide text enrichment facilities like categorization, entity extraction, language identification, keyword extraction, concept tagging etc.

Click here to access the user documentation.

Solr CAS Consumer (Solrcas)

The Solr CAS Consumer (Solrcas) consumes CAS objects transforming them into Solr documents to write to a remote or local Solr instance in order to provide serach capabilities on top of UIMA pipelines with the Apache Solr search server.

Click here to access the user documentation.

UIMA Sandbox components

These components are partially only available in SVN at this time.

Annotators and Consumers

RDF CAS Consumer

Miscellaneous

GALE Multi-Modal Example

These are described in more detail below.

RDF CAS Consumer

The RDF CAS Consumer is responsible of taking a CAS view and write it to a file in a RDF format; this is usefult to plug UIMA pipelines with RDF backed systems (using ontologies, reasoners, etc.).

GALE Multi-Modal Example

The GALE Multi-Modal Example contains a type-system and sample code based on a rich multimodal application developed under the Darpa GALE project to demonstrate how to combine analytics from multiple sources and modalities. The GALE Type System (GTS) has been designed for applications that combine analytics from multiple sources and modalities, such as speech recognition, language translation, entity detection, topic detection, speech synthesis, etc.

The sample code will illustrate how to wrap NLP analytics as UIMA annotators using appropriate GTS types, as well as data-reorganization components that convert the output of each analytic into a form suitable for the following analytics, and add cross-reference links back to the original data.

The type system descriptors can be accessed from the SVN repository at https://svn.apache.org/repos/asf/uima/sandbox/trunk/GaleMultiModalExample .