DUCC is a Linux cluster controller designed to scale out any UIMA pipeline
for high throughput collection processing jobs as well as for low latency
Building on UIMA-AS, DUCC is particularly well suited to run large memory Java
analytics in multiple threads in order to fully utilize multicore machines.
DUCC manages the life cycle of all processes deployed across the cluster, including
non-UIMA processes such as tomcat servers or VNC sessions.
DUCC has an extensive web interface providing details on all user activity.
Because DUCC is built for UIMA-based analytics from the
ground up it automatically makes available such details as what annotators are
currently initializing as well as the timing breakdown for each primitive annotator
in a pipeline.
DUCC's resource manager uses the combination of specified memory requirement and
process class to find space to run each new managed process instance.
The resource manager does not overcommit RAM memory for machines, and DUCC uses Linux
cgroups to constrain all managed processes from interfering with each other.
Only processes that exceed their requested memory allocation will be subject to paging.
The resource manager employs process preemption to dynamically rebalance compute
resources between collection processing jobs.
DUCC is primarily intended to be used for research and development activities where
multiple users need to efficiently share cluster resources for a wide variety of
computational activities. All processes run with
the credentials of the submitting user. Process logfiles and DUCC collected performance
data are stored in user filesystem space. This allows/forces each
user to decide how to manage the metadata associated with work submitted to DUCC.
Visit the UIMA-DUCC live
and the UIMA-DUCC
The following sections will describe each of the three types of DUCC managed processes
(collection processing jobs, services and arbitary processes) and constrast some
differences between DUCC and Hadoop for scaling out UIMA applications.