Operator WatsonSTT

Gateway to the IBM Speech To Text (STT) cloud service > com.ibm.streamsx.sttgateway 1.0.1 > com.ibm.streamsx.sttgateway.watson > WatsonSTT

The WatsonSTT operator is designed to ingest audio data in the form of a file (.wav, .mp3 etc.) or RAW audio and then transcribe that audio into text via the IBM Watson STT (Speech To Text) cloud service. It does that by sending the audio data to the configured Watson STT service running in the IBM public cloud or in the IBM Cloud Private (ICP) via the Websocket interface. It then outputs transcriptions of speech in the form of utterances or in full text as configured. An utterance is a group of transcribed words meant to approximate a sentence. Audio data must be in 16-bit little endian, mono format. For the Telephony model and configurations, the audio must have an 8 kHz sampling rate. For the Broadband model and configurations, the audio must have a 16 kHz sampling rate. The data can be provided as a .wav file or as RAW uncompressed PCM audio. Here is a sample ffmpeg command to convert a .wav file to the correct telephony format (use -ar 16000 for broadband):

$ ffmpeg -i MyFile.wav -ac 1 -ar 8000 MyNewFile.wav

This operator must be configured with a Websocket URL, a Watson STT authentication token and a base language model (see in parameter section). This operator may also be customized with many other optional parameters including custom patch files and appropriate custom patch weights.

Requirements:
  • Intel RHEL6 or RHEL7 hosts installed with IBM Streams.

Note: Multiple invocations of this operator can be fused to make an efficient use of the available pool of CPU cores.

See the samples folder inside this toolkit for working examples that show how to use this operator.

For a detailed documentation about the operator design, usage patterns and in-depth technical details, please refer to the official STT Gateway toolkit documentation available at this URL:

https://ibmstreams.github.io/streamsx.sttgateway

Summary

Ports
This operator has 1 input port and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 25 parameters.

Required: authToken, baseLanguageModel, uri

Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, identifySpeakers, keywordsSpottingThreshold, keywordsToBeSpotted, maxAllowedConnectionAttempts, maxUtteranceAlternatives, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, waitTimeBeforeSTTServiceConnectionRetry, websocketLoggingNeeded, wordAlternativesThreshold, wordConfidenceNeeded, wordTimestampNeeded

Metrics
This operator can report metrics.

Properties

Implementation
C++
Threading
Never - Operator never provides a single threaded execution context.

Input Ports

Ports (0)

This port brings the audio data into this operator for transcription.

Attributes on this input port:
  • speech (required, rstring/blob) - In the case of file based input (.wav, .mp3 etc. for batch workload), the expected value will be an absolute path of a file as an rstring. In the case of RAW audio data (received from a network switch for real-time workload), the expected input is of type blob.
  • conversationId (optional, rstring) - An rstring conversationId field for identifying the origin of the audio data that is being sent for transcription (either an audio filename or a call center specific call identifier).

All the extra input attributes will be forwarded if matching output attributes are found.

Properties

Output Ports

Assignments
This operator allows any SPL expression of the correct type to be assigned to output attributes.
Output Functions
STTGatewayFunctions
<any T> T AsIs(T)

The default function for output attributes. This function assigns the output attribute to the value of the input attribute with the same name.

int32 getUtteranceNumber()

Returns an int32 number indicating the utterance number.

rstring getUtteranceText()

Returns the transcription of audio in the form of a single utterance.

boolean isFinalizedUtterance()

Returns a boolean value to indicate if this is an interim partial utterance or a finalized utterance.

float32 getConfidence()

Returns a float32 confidence value for an interim partial utterance or for a finalized utterance or for the full text.

rstring getFullTranscriptionText()

Returns the transcription of audio in the form of full text after completing the entire transcription.

rstring getSTTErrorMessage()

Returns the Watson STT error message if any.

boolean isTranscriptionCompleted()

Returns a boolean value to indicate whether the full transcription is completed.

list<rstring> getUtteranceAlternatives()

Returns a list of n-best alternative hypotheses for an utterance result. List will have the very best guess first followed by the next best ones in that order.

list<list<rstring>> getWordAlternatives()

Returns a nested list of word alternatives (Confusion Networks).

list<list<float64>> getWordAlternativesConfidences()

Returns a nested list of word alternatives confidences (Confusion Networks).

list<float64> getWordAlternativesStartTimes()

Returns a list of word alternatives start times (Confusion Networks).

list<float64> getWordAlternativesEndTimes()

Returns a list of word alternatives end times (Confusion Networks).

list<rstring> getUtteranceWords()

Returns a list of words in an utterance result.

list<float64> getUtteranceWordsConfidences()

Returns a list of confidences of the words in an utterance result.

list<float64> getUtteranceWordsStartTimes()

Returns a list of start times of the words in an utterance result relative to the start of the audio.

list<float64> getUtteranceWordsEndTimes()

Returns a list of end times of the words in an utterance result relative to the start of the audio.

float64 getUtteranceStartTime()

Returns the start time of an utterance relative to the start of the audio.

float64 getUtteranceEndTime()

Returns the end time of an utterance relative to the start of the audio.

list<int32> getUtteranceWordsSpeakers()

Returns a list of speaker ids for the individual words in an utterance result.

list<float64> getUtteranceWordsSpeakersConfidences()

Returns a list of confidences in identifying the speakers of the individual words in an utterance result.

map<rstring, list<map<rstring, float64>>> getKeywordsSpottingResults()

Returns the STT keywords spotting results as a map of key/value pairs. Read this toolkit's documentation to learn about the map contents.

Ports (0)

This port produces the output tuples that carry the result of the speech to text transcription.

An output tuple is created for every utterance that is observed from the incoming audio data. An utterance is a group of transcribed words meant to approximate a sentence. This means there is a one to many relationship between an incoming tuple and outgoing tuples (i.e. a single .wav file may result in 30 output utterances). Intermediate utterances are sent out on this output port only when the sttResultMode operator parameter is set to a value of either 1 or 2. If it is set to 3, then only the fully transcribed text for the entire audio data will be sent on this output port after the given audio is completely transcribed. There are multiple available output functions, and output attributes can also be assigned values with any SPL expression that evaluates to the proper type.

Properties

Parameters

This operator supports 25 parameters.

Required: authToken, baseLanguageModel, uri

Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, identifySpeakers, keywordsSpottingThreshold, keywordsToBeSpotted, maxAllowedConnectionAttempts, maxUtteranceAlternatives, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, waitTimeBeforeSTTServiceConnectionRetry, websocketLoggingNeeded, wordAlternativesThreshold, wordConfidenceNeeded, wordTimestampNeeded

acousticCustomizationId

This parameter specifies a custom acoustic model to be used for transcription. (Default is an empty string)

Properties

authToken

This parameter specifies the auth token needed to access the Watson STT service.

Properties

baseLanguageModel

This parameter specifies the name of the Watson STT base language model that should be used.

Properties

baseModelVersion

This parameter specifies a particular base model version to be used for transcription. (Default is an empty string)

Properties

contentType

This parameter specifies the content type to be used for transcription. (Default is audio/wav)

Properties

cpuYieldTimeInAudioSenderThread

This parameter specifies the CPU yield time (in seconds) needed inside the audio sender thread's tight loop spinning to look for new audio data to be sent to the STT service. It should be >= 0.0 (Default is 0.001 i.e. 1 millisecond)

Properties

customizationId

This parameter specifies a custom language model to be used for transcription. (Default is an empty string)

Properties

customizationWeight

This parameter specifies a relative weight for a custom language model as a float64 between 0.0 to 1.0 (Default is 0.0)

Properties

filterProfanity

This parameter indicates whether profanity should be filtered from a transcript. (Default is false)

Properties

identifySpeakers

This parameter indicates whether the speakers of the individual words in an utterance result should be identified. (Default is false)

Properties

keywordsSpottingThreshold

This parameter specifies the minimum confidence level that the STT service must have for an utterance word to match a given keyword. A value of 0.0 disables this feature. Valid value must be less than 1.0. (Default is 0.0)

Properties

keywordsToBeSpotted

This parameter specifies a list (array) of strings to be spotted. (Default is an empty list)

Properties

maxAllowedConnectionAttempts

This parameter specifies the maximum number of attempts to make a Websocket connection to the STT service. It should be >= 1 (Default is 10)

Properties

maxUtteranceAlternatives

This parameter indicates the required number of n-best alternative hypotheses for the transcription results. (Default is 1)

Properties

smartFormattingNeeded

This parameter indicates whether to convert date, time, phone numbers, currency values, email and URLs into conventional representations. (Default is false)

Properties

sttJsonResponseDebugging

This parameter is used for debugging the STT JSON response message. Mostly for IBM internal use. (Default is false)

Properties

sttLiveMetricsUpdateNeeded

This parameter specifies whether live update for this operator's custom metrics is needed. (Default is true)

Properties

sttRequestLogging

This parameter specifies whether request logging should be done for every STT audio transcription request. (Default is false)

Properties

sttResultMode

This parameter specifies what type of STT result is needed: 1 to get partial utterances, 2 to get completed utterance, 3 (default) to get the full text after transcribing the entire audio.

Properties

uri

This parameter specifies the Watson STT Websocket service URI.

Properties

waitTimeBeforeSTTServiceConnectionRetry

This parameter specifies the time (in seconds) to wait before retrying a connection attempt to the Watson STT service. It should be >= 1.0 (Default is 3.0)

Properties

websocketLoggingNeeded

This parameter specifies whether logging is needed from the Websocket library. (Default is false)

Properties

wordAlternativesThreshold

This parameter controls the density of the word alternatives results (a.k.a. Confusion Networks). A value of 0.0 disables this feature. Valid value must be less than 1.0 (Default is 0.0)

Properties

wordConfidenceNeeded

This parameter indicates whether the transcription result should include individual words and their confidences or not. (Default is false)

Properties

wordTimestampNeeded

This parameter indicates whether the transcription result should include individual words and their timestamps or not. (Default is false)

Properties

Metrics

A few custom metrics are available for the WatsonSTT operator. The Counter kind metrics listed below will be updated when the operator starts. But, the Gauge kind metrics will be updated live during transcription only when the sttLiveMetricsUpdateNeeded operator parameter is set to true.

nFullAudioConversationsReceived - Gauge

Number of full audio conversations received for transcription by this operator instance.

nFullAudioConversationsTranscribed - Gauge

Number of full audio conversations transcribed by this operator instance.

nSTTResultMode - Counter

STT result mode currently in effect for a given operator instance.

nWebsocketConnectionAttempts - Counter

Number of STT service Websocket connection attempts made by this operator instance.

Libraries

Boost Library
Library Name: boost_system
Library Path: ../../impl/lib
Boost Library
Library Name: boost_chrono
Library Path: ../../impl/lib
Boost Library
Library Name: boost_random
Library Path: ../../impl/lib
Boost Library
Library Name: boost_thread
Library Path: ../../impl/lib