Operator Lemmatizer

Toolkits > com.ibm.streamsx.nlp 1.9.2 > com.ibm.streamsx.nlp > Lemmatizer

This operator derives dictionary form of words in a text (aka lemma), (e.g. box for boxes), and determines their part-of-speech tags (e.g. NNS for noun, plural).

The operator is based on the GPoSTTL http://gposttl.sourceforge.net/ Parts-of-Speech Tagger, with built-in Tokenizer and Lemmatizer. GPoSTTL uses enhanced Penn tagset to make its output compatible with the output of TreeTagger. In particular, second letter of the verb tags distinguishes between "be" verbs (B), "have" verbs (H) and other verbs (V). The enhancement is done at last step of tagging procedure as its lexicon contains the original Penn tagset.

Alphabetical list of part-of-speech tags:

Tag

Description

CC

Coordinating conjunction

CD

Cardinal number

DT

Determiner

EX

Existential there

FW

Foreign word

IN

Preposition or subordinating conjunction

JJ

Adjective

JJR

Adjective, comparative

JJS

Adjective, superlative

LS

List item marker

MD

Modal

NN

Noun, singular or mass

NNS

Noun, plural

NNP

Proper noun, singular

NNPS

Proper noun, plural

PDT

Predeterminer

POS

Possessive ending

PRP

Personal pronoun

PRP$

Possessive pronoun

RB

Adverb

RBR

Adverb, comparative

RBS

Adverb, superlative

RP

Particle

SYM

Symbol

TO

to

UH

Interjection

VB

"be" Verb, base form

VH

"have" Verb, base form

VV

"other" Verb, base form

VBD

"be" Verb, past tense

VHD

"have" Verb, past tense

VVD

"other" Verb, past tense

VBG

"be" Verb, gerund or present participle

VHG

"have" Verb, gerund or present participle

VVG

"other" Verb, gerund or present participle

VBN

"be" Verb, past participle

VHN

"have" Verb, past participle

VVN

"other" Verb, past participle

VBP

"be" Verb, non-3rd person singular present

VHP

"have" Verb, non-3rd person singular present

VVP

"other" Verb, non-3rd person singular present

VBZ

"be" Verb, 3rd person singular present

VHZ

"have" Verb, 3rd person singular present

VVZ

"other" Verb, 3rd person singular present

WDT

Wh-determiner

WP

Wh-pronoun

WP$

Possessive wh-pronoun

WRB

Wh-adverb

Summary

Ports
This operator has 0 or more input ports and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 2 parameters.

Required: textAttribute

Optional: initializationDirectory

Metrics
This operator does not report any metrics.

Properties

Implementation
C++
Threading
Always - Operator always provides a single threaded execution context.

Input Ports

Ports (0...)
Properties

Output Ports

Assignments
This operator allows any SPL expression of the correct type to be assigned to output attributes.
Output Functions
Functions
<any T> T AsIs(T)

The original argument expression is submitted.

public list<tuple<rstring word, rstring pos, rstring lemma>> TagWords()

This function generates a list of words and corresponding part-of-speech tags and lemmas.

Example use:

stream<rstring text,list<tuple<rstring word, rstring pos, rstring lemma>> result> A = Lemmatizer() {
   param ...
   output A : result = TagWords();
}

public rstring NormalizedText()

The converted input text. This function generates normalized text (words are replaced by lemmas).

Example use:

stream<rstring text,rstring normalizedtext> A = Lemmatizer() {
   param ...
   output A : normalizedtext = NormalizedText();
}

public list<rstring> Lemmas()

This function generates a list of lemmas.

Example use:

stream<rstring text, list<rstring> lemmas> A = Lemmatizer() {
   param ...
   output A : lemmas = list<rstring> Lemmas();
}

public list<rstring> PosTags()

This function generates a list of part-of-speech tags.

Example use:

stream<rstring text, list<rstring> pos> A = Lemmatizer() {
   param ...
   output A : pos = list<rstring> PosTags();
}

public list<rstring> Words()

This function generates a list of words.

Example use:

stream<rstring text, list<rstring> words> A = Lemmatizer() {
   param ...
   output A : words = list<rstring> Words();
}

Ports (0)

This mandatory output port sends the tuples.

Properties

Parameters

This operator supports 2 parameters.

Required: textAttribute

Optional: initializationDirectory

initializationDirectory

This optional parameter specifies the path to lexicon files that is read during the initialization of the operator. The recommended location for storing this file is in the etc directory in the toolkit. If a relative path is specified, the path is relative to the application directory.

Properties

textAttribute

Specifies the rstring attribute of the input port that holds the text, which needs to be lemmatized.

Properties

Libraries

GPoSTTL
Library Name: GPoSTTL
Library Path: ../../impl/lib
Include Path: ../../impl/include