Overview   Class List   Class Hierarchy   Class Members   Functions & Constants   Defines   Header Files  

uima::Language Class Reference

List of all members.

Detailed Description

The class Language models languages in UIMACPP.

A language is specified as a string holding a 2-character language and an optional 2-character territory, i.e. as "ll-cc" or "ll".

String representation of the simple language sub-part is according to ISO standard 639 "Codes for the representation of the names of languages".

String representation of the territory sub-part is according to ISO standard 3166 "Codes for the Representation of Names of Countries".

String representation of the full language object is according to IANA RFC 1766 "Tags for the Identification of Languages": <LANG>-<SUBTAG>

There is also an internal technical numeric representation of a language as a 4 byte number (32 bit, high-word language value and low-word territory value). Conversion to or from numeric representation is provided via a constructor or conversion operator.

The class distinguishes between unspecified and invalid languages and territories. For example, in "en" the territory and sub-language is valid, but unspecified, as opposed to "en-US" where the territory and sub-language is specified as "US". However, in "en-FOO" the territory and sub-language is invalid as there is no such territory code.

Because of this, there is more than one way for two language objects to be compatible with each other:

  1. Full identity: language and sub-language/territory are identical (for example, en-US == en-us).

  2. Match: the language part is identical and the sub-language and territory of at least one of the objects is set to unspecified (for example, en matches en-US).

  3. Match ignoring territory: the language part is identical, both sub-languages are specified but they are different from each other (for example, en-US matches-without-territory en-AU).

Match type 2 is used if a annotator specifies that it can deal with any kind of english text and is not limited, or specialized to US-English.

Match type 3 is not supported.

     Language clLanguage(argv[1]);
     if(!clLanguage.isValid()) {
        // abort with error
           //...
     }
     if (! (   clLanguage.matches("en")
            || clLanguage.matches("de") ) ) {
        // abort with error
           //...
     }

Note: As the class is simple, compiler generated copy constructor and the assignment operator can be used.


Language constants and types

typedef long TyLanguageAsNumber
 A typedef for representing a languages as a numeric value.
char const * INVALID
 Special constants for the invalid & unspecified languages.
char const * UNSPECIFIED

Public Member Functions

Constructors
 Language (void)
 Default Constructor: Language::UNSPECIFIED.
 Language (const TCHAR *cpszLanguageCode)
 Constructor from a C string.
 Language (const std::string &languageCode)
 Constructor from a single-byte string (std::string).
 Language (const icu::UnicodeString &languageCode)
 Constructor from a ICU Unicode string.
 Language (const UnicodeStringRef &languageCode)
 Constructor from a UnicodeStringRef.
 Language (TyLanguageAsNumber ulLanguageAsLong)
 Constructor from a 32 bit representation of a language (see asNumber).
Match functions
bool operator== (const Language &crclObject) const
 Returns TRUE, if this language is identical to the specified language.
bool operator!= (const Language &crclObject) const
 Returns TRUE, if this language is identical to the specified language.
bool operator< (const Language &crclOther) const
 Returns TRUE, if this language code sorts before the specified language.
bool matches (const Language &crclCompareLang) const
 Returns TRUE, if the languages are identical and either the territories are equal or one is unspecified, or if one of the languages is unspecified.
Miscellaneous
bool isValid (void) const
 Returns TRUE if language is valid (territory may be missing).
const char * getLanguageCode (void) const
 Get just the 2-character language part, or an empty string if unspecified.
TyLanguageAsNumber getLanguage (void) const
 Get a numeric form of just the language (2-characters in top 2-bytes).
bool hasLanguage (void) const
 Returns TRUE if language has been specified.
const char * getTerritoryCode (void) const
 Get just the 2-character territory part, or an empty string if unspecified.
TyLanguageAsNumber getTerritory (void) const
 Get a numeric form of just the territory (2-characters in bottom 2-bytes).
bool hasTerritory (void) const
 Returns TRUE if territory has been specified.
void setValue (const std::string &crclString)
 Sets the value according to string crclString.
std::string asString (void) const
 Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.
icu::UnicodeString asUnicodeString (void) const
 Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.
TyLanguageAsNumber asNumber (void) const
 Returns the object as a 4-byte "number" (actually just the 4 character bytes, e.g.


Member Typedef Documentation

typedef long uima::Language::TyLanguageAsNumber
 

A typedef for representing a languages as a numeric value.


Constructor & Destructor Documentation

uima::Language::Language void   )  [inline]
 

Default Constructor: Language::UNSPECIFIED.

uima::Language::Language const TCHAR cpszLanguageCode  )  [inline]
 

Constructor from a C string.

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language const std::string &  languageCode  )  [inline]
 

Constructor from a single-byte string (std::string).

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language const icu::UnicodeString &  languageCode  )  [inline]
 

Constructor from a ICU Unicode string.

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language const UnicodeStringRef languageCode  )  [inline]
 

Constructor from a UnicodeStringRef.

String must have form language_territory. For example, "en-US" or just language "en".

uima::Language::Language TyLanguageAsNumber  ulLanguageAsLong  ) 
 

Constructor from a 32 bit representation of a language (see asNumber).


Member Function Documentation

bool uima::Language::operator== const Language crclObject  )  const [inline]
 

Returns TRUE, if this language is identical to the specified language.

bool uima::Language::operator!= const Language crclObject  )  const [inline]
 

Returns TRUE, if this language is identical to the specified language.

bool uima::Language::operator< const Language crclOther  )  const [inline]
 

Returns TRUE, if this language code sorts before the specified language.

bool uima::Language::matches const Language crclCompareLang  )  const
 

Returns TRUE, if the languages are identical and either the territories are equal or one is unspecified, or if one of the languages is unspecified.

bool uima::Language::isValid void   )  const [inline]
 

Returns TRUE if language is valid (territory may be missing).

const char * uima::Language::getLanguageCode void   )  const [inline]
 

Get just the 2-character language part, or an empty string if unspecified.

Language::TyLanguageAsNumber uima::Language::getLanguage void   )  const [inline]
 

Get a numeric form of just the language (2-characters in top 2-bytes).

bool uima::Language::hasLanguage void   )  const [inline]
 

Returns TRUE if language has been specified.

const char * uima::Language::getTerritoryCode void   )  const [inline]
 

Get just the 2-character territory part, or an empty string if unspecified.

Language::TyLanguageAsNumber uima::Language::getTerritory void   )  const [inline]
 

Get a numeric form of just the territory (2-characters in bottom 2-bytes).

bool uima::Language::hasTerritory void   )  const [inline]
 

Returns TRUE if territory has been specified.

void uima::Language::setValue const std::string &  crclString  )  [inline]
 

Sets the value according to string crclString.

crclString must have the form <LANG_ID>[-<TERR_ID>].

std::string uima::Language::asString void   )  const [inline]
 

Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.

For example, en-US.

icu::UnicodeString uima::Language::asUnicodeString void   )  const [inline]
 

Returns the object in the form <LANGUAGE_CODE>-<TERRITORY_CODE>.

For example, en-US.

Language::TyLanguageAsNumber uima::Language::asNumber void   )  const [inline]
 

Returns the object as a 4-byte "number" (actually just the 4 character bytes, e.g.

x656e7472 'enus')


Member Data Documentation

char const* uima::Language::INVALID [static]
 

Special constants for the invalid & unspecified languages.

char const* uima::Language::UNSPECIFIED [static]
 


The documentation for this class was generated from the following file:
Generated on Mon Oct 1 16:04:13 2012 for UIMACPP API by  doxygen 1.3.9.1