Operator WatsonSTT

Gateway to the IBM Speech To Text (STT) cloud service > com.ibm.streamsx.sttgateway 2.2.3 > com.ibm.streamsx.sttgateway.watson > WatsonSTT

The WatsonSTT operator is designed to ingest audio data in the form of a file (.wav, .mp3 etc.) or RAW audio and then transcribe that audio into text via the IBM Watson STT (Speech To Text) cloud service. It does that by sending the audio data to the configured Watson STT service running in the IBM public cloud or in the IBM Cloud Pak for Data (CP4D) via the Websocket interface. It then outputs transcriptions of speech in the form of utterances or in full text as configured. An utterance is a group of transcribed words meant to approximate a sentence. Audio data must be in 16-bit little endian, mono format. For the Telephony model and configurations, the audio must have an 8 kHz sampling rate. For the Broadband model and configurations, the audio must have a 16 kHz sampling rate. The data can be provided as a .wav file or as RAW uncompressed PCM audio. Here is a sample ffmpeg command to convert a .wav file to the correct telephony format (use -ar 16000 for broadband):

$ ffmpeg -i MyFile.wav -ac 1 -ar 8000 MyNewFile.wav

This operator must be configured with a Websocket URL, a Watson STT authentication token and a base language model (see in parameter section). This operator may also be customized with many other optional parameters including custom patch files and appropriate custom patch weights.

The operator parameter sttResultMode specifies what type of STT result is produced:
  • partial: to get partial utterances,
  • complete (default) to get the full text after transcribing the entire audio.
The setting of this this parameter influences the validity of output functions and parameters. In sttResultMode partial the parameter nonFinalUtterancesNeeded controls the output of non final utterances.

Note: Multiple invocations of this operator can be fused to make an efficient use of the available pool of CPU cores.

See the samples folder inside this toolkit for working examples that show how to use this operator.

For a detailed documentation about the operator design, usage patterns and in-depth technical details, please refer to the official STT Gateway toolkit documentation available at this URL:

https://ibmstreams.github.io/streamsx.sttgateway

Summary

Ports
This operator has 2 input ports and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 20 parameters.

Required: baseLanguageModel, uri

Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold

Metrics
This operator reports 9 metrics.

Properties

Implementation
C++
Threading
Never - Operator never provides a single threaded execution context.

Input Ports

Ports (0)

This port brings the audio data into this operator for transcription.

Attributes on this input port:
  • speech (required, rstring/blob) - In the case of file-based input (.wav, .mp3 etc. for batch workload),
the expected value will be an absolute path of a file as an rstring. In the case of RAW audio data (received from a network switch for real-time workload), the expected input is of type blob.

A window punctuation marker or a empty speech blob may be used to mark the end of an conversation. Thus an conversation can be a composite of multiple audio files. When the end of conversation is encountered, the STT engine delivers all results of the current conversation and flushes all buffers.

All the extra input attributes will be forwarded if matching output attributes are found.

Properties

Ports (1)

This port brings an unexpired IAM access token (generated by using your service instance's API key) into this operator that is needed to access the Watson STT service. This input port should be used in a different thread than port 0.

Attributes on this input port:
  • access_token (required, rstring) - A rstring access token required for securing the access to the STT service.

All the extra attributes found in this input port will be ignored.

Properties

Output Ports

Assignments
This operator allows any SPL expression of the correct type to be assigned to output attributes.
Output Functions
STTGatewayFunctions
<any T> T AsIs(T)

The default function for output attributes. This function assigns the output attribute to the value of the input attribute with the same name.

rstring getSTTErrorMessage()

Returns the Watson STT error message if any. Default attribute name sttErrorMessage

boolean isTranscriptionCompleted()

Returns a boolean value to indicate whether the full transcription/conversation is completed. Default attribute name transcriptionCompleted

Note: If this function is requested, the operator emits at conversation end a tuple with this flag equal true. All other output functions deliver the default value for the final tuple of an conversation. If this function is not requested, the operator does not send this final tuple for the conversation. The operator sends in any case a window punctuation marker at the end of an conversation.

rstring getUtteranceText()

Returns the transcription of audio in the form of a single utterance. Default attribute name utteranceText

float64 getConfidence()

Returns a float64 confidence value for the utterance. Default attribute name confidence

Note: This function does not deliver results if sttResult mode equals complete.

float64 getUtteranceStartTime()

Returns the start time of an utterance relative to the start of the audio. Default attribute name utteranceStartTime

float64 getUtteranceEndTime()

Returns the end time of an utterance relative to the start of the audio. Default attribute name utteranceEndTime

int32 getUtteranceNumber()

Returns an int32 number indicating the utterance number. Default attribute name utteranceNumber

Note: This function does not deliver results if sttResult mode equals complete.

boolean isFinalizedUtterance()

Returns a boolean value to indicate if this is an interim partial utterance or a finalized utterance. Default attribute name finalizedUtterance

Note: The function does not deliver results if sttResult mode equals complete.

list<rstring> getUtteranceAlternatives()

Returns a list of n-best alternative hypotheses for an utterance result. List will have the very best guess first followed by the next best ones in that order. Default attribute name utteranceAlternatives

Note: This function does not deliver results if sttResult mode equals complete. Note: n-best alternative hypotheses are available only for final utterances.

list<rstring> getUtteranceWords()

Returns a list of words in an utterance result. Default attribute name utteranceWords

list<float64> getUtteranceWordsConfidences()

Returns a list of confidences of the words in an utterance result. Default attribute name utteranceWordsConfidences

Note: word confidences are available only for final utterances.

list<float64> getUtteranceWordsStartTimes()

Returns a list of start times of the words in an utterance result relative to the start of the audio. Default attribute name utteranceWordsStartTimes

list<float64> getUtteranceWordsEndTimes()

Returns a list of end times of the words in an utterance result relative to the start of the audio. Default attribute name utteranceWordsEndTimes

list<list<rstring>> getWordAlternatives()

Returns a nested list of word alternatives (Confusion Networks). Default attribute name wordAlternatives

Note: word alternatives are available only for final utterances.

list<list<float64>> getWordAlternativesConfidences()

Returns a nested list of word alternatives confidences (Confusion Networks). Default attribute name wordAlternativesConfidences

Note: word alternative confidences are available only for final utterances.

list<float64> getWordAlternativesStartTimes()

Returns a list of word alternatives start times (Confusion Networks). Default attribute name wordAlternativesStartTimes

Note: word alternative start times are available only for final utterances.

list<float64> getWordAlternativesEndTimes()

Returns a list of word alternatives end times (Confusion Networks). Default attribute name wordAlternativesEndTimes

Note: word alternative end times are available only for final utterances.

list<int32> getUtteranceWordsSpeakers()

Returns a list of speaker ids for the individual words in an utterance result. Each entry in this list corresponds to the appropriate entry in the utterance words list delivered with function getUtteranceWords. In rare cases the operator delivers a value of -1 for a speaker label. In this case there was no speaker label delivered from STT service. Default attribute name utteranceWordsSpeakers

Note: word speaker labels are available only for final utterances.

list<float64> getUtteranceWordsSpeakersConfidences()

Returns a list of confidences in identifying the speakers of the individual words in an utterance result. Default attribute name utteranceWordsSpeakersConfidences

Note: word speaker label confidences are available only for final utterances.

<tuple T> list<T> getUtteranceWordsSpeakerUpdates()

Return the speaker identifier updates for all emitted utterance words in the current conversation. This function returns an update to previously sent word speaker identifiers without adding any new results. Default attribute name utteranceWordsSpeakerUpdates

Speaker numbering starts at 0. The invalid index -1 identifies a non identified speaker. It will typically take 5-10 seconds of speakers for diarization results to become reliable, and will continually improve over the course of an audio conversation.

The return type of this function is
list<tuple<float64 startTime, int32 speaker, float64 confidence>>
The speaker label is given with respect to the start time of the utterance word. No other tuple types are accepted as result type.

<any T> T getKeywordsSpottingResults()

Returns the STT keywords spotting results as a map. The keys of the map are the spotted keywords. The values of the map are lists with the keyword emergence. The the keyword emergence may be either

tuple<float64 startTime, float64 endTime, float64 confidence>
or
map<rstring, float64>

In the latter case the inner map has the entries: start_time, end_time, confidence. Thus the complete attribute type may be either:

map<rstring, list<tuple<float64 startTime, float64 endTime, float64 confidence>>>
or
map<rstring, list<map<rstring, float64>>>

Default attribute name keywordsSpottingResults

Note: keyword spotting results are available only for final utterances.

Ports (0)

This port produces the output tuples that carry the result of the speech to text transcription.

An output tuple is created for every utterance that is observed from the incoming audio data. An utterance is a group of transcribed words meant to approximate a sentence. This means there is a one to many relationship between an incoming tuple and outgoing tuples (i.e. a single .wav file may result in 30 output utterances). If the input stream feeds audio samples, there may be also a many to one relationship between an incoming tuple and outgoing tuples. It can happen that several input tuples are consumed before an output tuple is generated.

Additionally an error during processing (connection error, invalid data, server error) triggers a outgoing tuple. The custom output function getSTTErrorMessage returns the error message.

Intermediate utterances are sent out on this output port only when the sttResultMode equals partial and operator parameter nonFinalUtterancesNeeded equals true. If sttResultMode equals complete, then only the fully transcribed text for the entire audio data will be sent on this output port after the given audio is completely transcribed.

The port emits a window punctuation marker when a conversation has finished. If sttResultMode equals partial, the operator sends out the results immediately without any further delay. If output function isTranscriptionCompleted is requested, the operator sends additionally a tuple with empty/default utterance attributes and the transcription complete flag when a conversation has finished.

There are multiple available output functions, and output attributes can also be assigned values with any SPL expression that evaluates to the proper type.

Every custom output function has an appropriate default attribute. The operator assigns the results of the COF automatically to the default attribute if:
  • the output stream has an attribute with the default name,
  • the type of the attribute is the return type of the COF,
  • this attribute has no assignment in the operator output clause,
  • this attribute has no auto assignment from input port 0.

Properties

Parameters

This operator supports 20 parameters.

Required: baseLanguageModel, uri

Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold

acousticCustomizationId

This parameter specifies a custom acoustic model to be used for transcription. (Default is an empty string)

Properties

baseLanguageModel

This parameter specifies the name of the Watson STT base language model that should be used. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#models

Properties

baseModelVersion

This parameter specifies a particular base model version to be used for transcription. (Default is an empty string)

The parameter may point to a specific version of the base model if needed. e-g: "en-US_NarrowbandModel.v07-06082016.06202016" see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#version

Properties

contentType

This parameter specifies the content type to be used for transcription. (Default is audio/wav)

Properties

cpuYieldTimeInAudioSenderThread

This parameter specifies the CPU yield time (in seconds) needed inside the audio sender thread's tight loop spinning to look for new audio data to be sent to the STT service. It should be >= 0.0 (Default is 0.001 i.e. 1 millisecond)

Properties

customizationId

This parameter specifies a custom language model to be used for transcription. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#custom (Default is an empty string)

Properties

customizationWeight

This parameter specifies a relative weight for a custom language model as a float64 between 0.0 to 1.0 (Default is 0.0)

Properties

filterProfanity

This parameter indicates whether profanity should be filtered from a transcript. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#profanity_filter

Properties

keywordsSpottingThreshold

This parameter specifies the minimum confidence level that the STT service must have for an utterance word to match a given keyword. A value of 0.0 disables this feature. Valid value must be less than 1.0. (Default is 0.3) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#keyword_spotting

Properties

keywordsToBeSpotted

This parameter specifies a list (array) of strings to be spotted. (Default is an empty list)

Properties

maxConnectionRetryDelay

The maximum wait time in seconds before a connection re-try is made. The re-try delay of connection to the STT service increases exponentially starting from 2 seconds but not exceeding 'maxConnectionRetryDelay'. It must be greater 1.0 (Default is 60.0)

Properties

maxUtteranceAlternatives

This parameter indicates the required number of n-best alternative hypotheses for the transcription results. (Default is 3) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#max_alternatives

Note: This parameter is ignored if sttResultMode equals complete.

Properties

nonFinalUtterancesNeeded

If sttResultMode equals partial this parameter controls the output of non final utterances. If sttResultMode equals complete this parameter is ignored. (Default is false.)

Properties

smartFormattingNeeded

This parameter indicates whether to convert date, time, phone numbers, currency values, email and URLs into conventional representations. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#smart_formatting

Properties

sttLiveMetricsUpdateNeeded

This parameter specifies whether live update for this operator's metrics nFullAudioConversationsReceived and nFullAudioConversationsTranscribed is needed. (Default is true)

Properties

sttRequestLogging

Indicates whether IBM can use data that is sent over the connection to improve the service for future users. Specify false to prevent IBM from accessing the logged data. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#logging

Properties

sttResultMode

This parameter specifies what type of STT result is needed: partial: to get partial utterances, complete: (default) to get the full text after transcribing the entire audio. The setting of this If this parameter influences the validity of output functions.

In sttResultMode complete parameter maxUtteranceAlternatives is not allowed.

In sttResultMode complete the following output functions are not allowed: getUtteranceNumber(), isFinalizedUtterance(), getConfidence(), getUtteranceAlternatives(),

Properties

uri

This parameter specifies the Watson STT Websocket service URI. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-websockets#WSopen

Properties

websocketLoggingNeeded

This parameter specifies whether logging is needed from the Websocket library. (Default is false)

Properties

wordAlternativesThreshold

This parameter controls the density of the word alternatives results (a.k.a. Confusion Networks). A value of 0.6 disables this feature. Valid value must be less than 1.0 (Default is 0.0) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#word_alternatives

Properties

Metrics

nAudioBytesSend - Counter

The amount of audio samples sent to the stt service in bytes.

NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.

nFullAudioConversationsFailed - Counter

Number of full audio conversations failures by this operator instance.

NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.

nFullAudioConversationsReceived - Counter

Number of full audio conversations received for transcription by this operator instance.

NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.

nFullAudioConversationsTranscribed - Counter

Number of full audio conversations transcribed by this operator instance.

NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.

nWebsocketConnectionAttempts - Counter

The cumulative number of STT service Websocket connection attempts made by this operator instance.

nWebsocketConnectionAttemptsCurrent - Gauge

Number of STT service Websocket connection attempts made by this operator instance for the current connection. The value is reset to zero when a connection attempt succeeds.

nWebsocketConnectionAttemptsFailed - Counter

The cumulative number of failed STT service Websocket connection attempts made by this operator instance.

sttOutputResultMode - Gauge

STT result mode currently in effect for a given operator instance.

  • 1 sttResultMode == partial and nonFinalUtterancesNeeded == true
  • 2 sttResultMode == partial and nonFinalUtterancesNeeded == false
  • 3 sttResultMode == complete

wsConnectionState - Gauge

The state of the websocket connection:

  • 0=idle: initial idle state
  • 1=start: start a new connection
  • 2=connecting: trying to connect to stt
  • 3=open: connection is opened
  • 4=listening: listening received
  • 5=closing: transcription is finalized - close the connection
  • 6=error: error received during transcription
  • 7=closed: connection has closed
  • 8=failed: connection has failed
  • 9=crashed: Error was caught in ws_init tread

Libraries

Implementation library
Include Path: ../../impl/include
Boost Library common/Websocket include and Boost system
Library Name: boost_system
Library Path: ../../lib
Include Path: ../../include
Boost Library chrono
Library Name: boost_chrono
Library Path: ../../lib
Boost Library random
Library Name: boost_random
Library Path: ../../lib