Operator Lemmatizer

Toolkits > com.ibm.streamsx.nlp 1.9.2 > com.ibm.streamsx.nlp > Lemmatizer

This operator derives dictionary form of words in a text (aka lemma), (e.g. box for boxes), and determines their part-of-speech tags (e.g. NNS for noun, plural).

The operator is based on the GPoSTTL http://gposttl.sourceforge.net/ Parts-of-Speech Tagger, with built-in Tokenizer and Lemmatizer. GPoSTTL uses enhanced Penn tagset to make its output compatible with the output of TreeTagger. In particular, second letter of the verb tags distinguishes between "be" verbs (B), "have" verbs (H) and other verbs (V). The enhancement is done at last step of tagging procedure as its lexicon contains the original Penn tagset.

Alphabetical list of part-of-speech tags:

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	Existential there
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP$	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
TO	to
UH	Interjection
VB	"be" Verb, base form
VH	"have" Verb, base form
VV	"other" Verb, base form
VBD	"be" Verb, past tense
VHD	"have" Verb, past tense
VVD	"other" Verb, past tense
VBG	"be" Verb, gerund or present participle
VHG	"have" Verb, gerund or present participle
VVG	"other" Verb, gerund or present participle
VBN	"be" Verb, past participle
VHN	"have" Verb, past participle
VVN	"other" Verb, past participle
VBP	"be" Verb, non-3rd person singular present
VHP	"have" Verb, non-3rd person singular present
VVP	"other" Verb, non-3rd person singular present
VBZ	"be" Verb, 3rd person singular present
VHZ	"have" Verb, 3rd person singular present
VVZ	"other" Verb, 3rd person singular present
WDT	Wh-determiner
WP	Wh-pronoun
WP$	Possessive wh-pronoun
WRB	Wh-adverb

Summary

Ports

This operator has 0 or more input ports and 1 output port.

Windowing

This operator does not accept any windowing configurations.

Parameters

This operator supports 2 parameters.

Required: textAttribute

Optional: initializationDirectory

Metrics

This operator does not report any metrics.

Properties

Implementation: C++
Threading: Always - Operator always provides a single threaded execution context.

Input Ports

Ports (0...)

Properties

Output Ports

Assignments: This operator allows any SPL expression of the correct type to be assigned to output attributes.

Output Functions

Functions

<any T> T AsIs(T)

The original argument expression is submitted.

public list<tuple<rstring word, rstring pos, rstring lemma>> TagWords()

This function generates a list of words and corresponding part-of-speech tags and lemmas.

Example use:

stream<rstring text,list<tuple<rstring word, rstring pos, rstring lemma>> result> A = Lemmatizer() {
   param ...
   output A : result = TagWords();
}

public rstring NormalizedText()

The converted input text. This function generates normalized text (words are replaced by lemmas).

Example use:

stream<rstring text,rstring normalizedtext> A = Lemmatizer() {
   param ...
   output A : normalizedtext = NormalizedText();
}

public list<rstring> Lemmas()

This function generates a list of lemmas.

Example use:

stream<rstring text, list<rstring> lemmas> A = Lemmatizer() {
   param ...
   output A : lemmas = list<rstring> Lemmas();
}

public list<rstring> PosTags()

This function generates a list of part-of-speech tags.

Example use:

stream<rstring text, list<rstring> pos> A = Lemmatizer() {
   param ...
   output A : pos = list<rstring> PosTags();
}

public list<rstring> Words()

This function generates a list of words.

Example use:

stream<rstring text, list<rstring> words> A = Lemmatizer() {
   param ...
   output A : words = list<rstring> Words();
}

Ports (0)

This mandatory output port sends the tuples.

Properties

Optional: false

TupleMutationAllowed: true
WindowPunctuationOutputMode: Preserving

Parameters

This operator supports 2 parameters.

Required: textAttribute

Optional: initializationDirectory

initializationDirectory

This optional parameter specifies the path to lexicon files that is read during the initialization of the operator. The recommended location for storing this file is in the etc directory in the toolkit. If a relative path is specified, the path is relative to the application directory.

Properties

Type: rstring
Cardinality: 1
Optional: true
ExpressionMode: AttributeFree

textAttribute

Specifies the rstring attribute of the input port that holds the text, which needs to be lemmatized.

Properties

Type: rstring
Cardinality: 1
Optional: false
ExpressionMode: Attribute

Libraries

GPoSTTL: Library Name: GPoSTTL; Library Path: ../../impl/lib; Include Path: ../../impl/include