Operator TfIdfWeight

Toolkits > com.ibm.streamsx.nlp 1.9.2 > com.ibm.streamsx.nlp > TfIdfWeight

This operator determines how meaningful a term/word in a text is related to a previously trained model (corpus).

Summary

Ports
This operator has 2 input ports and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 5 parameters.

Required: corpusFile, defaultIDF, documentAttribute

Optional: nTopWeightedTerms, termAttribute

Metrics
This operator reports 2 metrics.

Properties

Implementation
C++
Threading
Always - Operator always provides a single threaded execution context.

Input Ports

Ports (0)

Tuples for TF-IDF calculation

Properties

Ports (1)

This input port tuple contains the filename of the corpus file to load. You can also change the defaultIDF value with this control port tuple. Supported tuple attributes are rstring corpusFile and/or float64 defaultIDF.

Properties

Output Ports

Assignments
This operator allows any SPL expression of the correct type to be assigned to output attributes.
Output Functions
Functions
<any T> T AsIs(T)

The original argument expression is submitted.

public list<tuple<rstring term, float64 tfidf>> WeightedTerms()

This function generates a list of terms and corresponding TF-IDF value.

Example use:

stream<rstring text,list<tuple<rstring term, float tfidf>> result> A = TfIdfWeight() {
   param ...
   output A : result = WeightedTerms();
}

public list<tuple<rstring term, float64 tfidf>> TopWeightedTerms()

This function generates a list of terms and corresponding TF-IDF value limited to the number of term specified by the parameter nTopWeightedTerms.

Example use:

stream<rstring text,list<tuple<rstring term, float tfidf>> result> A = TfIdfWeight() {
   param ...
   output A : result = TopWeightedTerms();
}

Ports (0)

This mandatory output port sends the tf-idf tuples from the documents received on input port 0.

Properties

Parameters

This operator supports 5 parameters.

Required: corpusFile, defaultIDF, documentAttribute

Optional: nTopWeightedTerms, termAttribute

corpusFile

Filename of the corpus file read at operator initialization. If relative path is used, then root is application directory. It is recommended to store the file in etc directory.

Properties

defaultIDF

The IDF value if term is not in the corpus.

Properties

documentAttribute

The input stream attribute containing the document. It must be of type SPL::rstring, SPL:list%ltrstring%gt or SPL:list%lttuple%ltrstring term%gt%gt

Properties

nTopWeightedTerms

Limits the number of terms in the output list. If this parameter is not specified, then all terms are in the output list. This parameter is relevant for the custom output function TopWeightedTerms() only.

Properties

termAttribute

The attribute containing the term if documentAttribute is of type SPL:list. If this parameter is not specified, then the SPL:list tuple must contain an attribute with the name term or the first attribute is of type rstring.

Properties

Metrics

nDocuments - Counter

The number of documents used to train the corpus

nTerms - Counter

The number of terms used to train the corpus