Gateway to the IBM Speech To Text (STT) cloud service > com.ibm.streamsx.sttgateway 2.0.0 > com.ibm.streamsx.sttgateway.watson > WatsonSTT
The WatsonSTT operator is designed to ingest audio data in the form of a file (.wav, .mp3 etc.) or RAW audio and then transcribe that audio into text via the IBM Watson STT (Speech To Text) cloud service. It does that by sending the audio data to the configured Watson STT service running in the IBM public cloud or in the IBM Cloud Pak for Data (CP4D) via the Websocket interface. It then outputs transcriptions of speech in the form of utterances or in full text as configured. An utterance is a group of transcribed words meant to approximate a sentence. Audio data must be in 16-bit little endian, mono format. For the Telephony model and configurations, the audio must have an 8 kHz sampling rate. For the Broadband model and configurations, the audio must have a 16 kHz sampling rate. The data can be provided as a .wav file or as RAW uncompressed PCM audio. Here is a sample ffmpeg command to convert a .wav file to the correct telephony format (use -ar 16000 for broadband):
$ ffmpeg -i MyFile.wav -ac 1 -ar 8000 MyNewFile.wav
This operator must be configured with a Websocket URL, a Watson STT authentication token and a base language model (see in parameter section). This operator may also be customized with many other optional parameters including custom patch files and appropriate custom patch weights.
Note: Multiple invocations of this operator can be fused to make an efficient use of the available pool of CPU cores.
See the samples folder inside this toolkit for working examples that show how to use this operator.
For a detailed documentation about the operator design, usage patterns and in-depth technical details, please refer to the official STT Gateway toolkit documentation available at this URL:
Required: baseLanguageModel, uri
Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold
This port brings the audio data into this operator for transcription.
A window punctuation marker or a empty speech blob may be used to mark the end of an conversation. When the end of conversation is encountered, the STT engine delivers all results of the current conversation and flushes all buffers. All the extra input attributes will be forwarded if matching output attributes are found.
This port brings an unexpired IAM access token (generated by using your service instance's API key) into this operator that is needed to access the Watson STT service. This input port should be used in a different thread than port 0.
All the extra attributes found in this input port will be ignored.
The default function for output attributes. This function assigns the output attribute to the value of the input attribute with the same name.
Returns an int32 number indicating the utterance number.
Note: This function is not available if sttResult mode equals complete.
Returns the transcription of audio in the form of a single utterance.
Returns a boolean value to indicate if this is an interim partial utterance or a finalized utterance.
Note: This function is not available if sttResult mode equals complete.
Returns a float64 confidence value for an interim partial utterance or for a finalized utterance or for the full text.
Note: This function is not available if sttResult mode equals complete.
Returns the Watson STT error message if any.
Returns a boolean value to indicate whether the full transcription/conversation is completed.
Note: This function is not available if sttResult mode equals complete.
Returns a list of n-best alternative hypotheses for an utterance result. List will have the very best guess first followed by the next best ones in that order.
Note: This function is not available if sttResult mode equals complete.
Returns a nested list of word alternatives (Confusion Networks).
Note: This function is not available if sttResult mode equals complete.
Returns a nested list of word alternatives confidences (Confusion Networks).
Note: This function is not available if sttResult mode equals complete.
Returns a list of word alternatives start times (Confusion Networks).
Note: This function is not available if sttResult mode equals complete.
Returns a list of word alternatives end times (Confusion Networks).
Note: This function is not available if sttResult mode equals complete.
Returns a list of words in an utterance result.
Note: This function is not available if sttResult mode equals complete.
Returns a list of confidences of the words in an utterance result.
Note: This function is not available if sttResult mode equals complete.
Returns a list of start times of the words in an utterance result relative to the start of the audio.
Note: This function is not available if sttResult mode equals complete.
Returns a list of end times of the words in an utterance result relative to the start of the audio.
Note: This function is not available if sttResult mode equals complete.
Returns the start time of an utterance relative to the start of the audio.
Note: This function is not available if sttResult mode equals complete.
Returns the end time of an utterance relative to the start of the audio.
Note: This function is not available if sttResult mode equals complete.
Returns a list of speaker ids for the individual words in an utterance result.
Note: This function is not available if sttResult mode equals complete.
Returns a list of confidences in identifying the speakers of the individual words in an utterance result.
Note: This function is not available if sttResult mode equals complete.
Returns the STT keywords spotting results as a map of key/value pairs. Read this toolkit's documentation to learn about the map contents.
Note: This function is not available if sttResult mode equals complete.
This port produces the output tuples that carry the result of the speech to text transcription.
An output tuple is created for every utterance that is observed from the incoming audio data. An utterance is a group of transcribed words meant to approximate a sentence. This means there is a one to many relationship between an incoming tuple and outgoing tuples (i.e. a single .wav file may result in 30 output utterances). Intermediate utterances are sent out on this output port only when the sttResultMode operator parameter is set to a value of either 1 or 2. If it is set to 3, then only the fully transcribed text for the entire audio data will be sent on this output port after the given audio is completely transcribed. The port emits a window punctuation marker when a conversation has finished. There are multiple available output functions, and output attributes can also be assigned values with any SPL expression that evaluates to the proper type.
Required: baseLanguageModel, uri
Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold
This parameter specifies a custom acoustic model to be used for transcription. (Default is an empty string)
This parameter specifies the name of the Watson STT base language model that should be used. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#models
This parameter specifies a particular base model version to be used for transcription. (Default is an empty string)
The parameter may point to a specific version of the base model if needed. e-g: "en-US_NarrowbandModel.v07-06082016.06202016" see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#version
This parameter specifies the content type to be used for transcription. (Default is audio/wav)
This parameter specifies the CPU yield time (in seconds) needed inside the audio sender thread's tight loop spinning to look for new audio data to be sent to the STT service. It should be >= 0.0 (Default is 0.001 i.e. 1 millisecond)
This parameter specifies a custom language model to be used for transcription. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#custom (Default is an empty string)
This parameter specifies a relative weight for a custom language model as a float64 between 0.0 to 1.0 (Default is 0.0)
This parameter indicates whether profanity should be filtered from a transcript. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#profanity_filter
This parameter specifies the minimum confidence level that the STT service must have for an utterance word to match a given keyword. A value of 0.0 disables this feature. Valid value must be less than 1.0. (Default is 0.0) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#keyword_spotting
Note: This parameter is not valid if sttResultMode equals complete.
This parameter specifies a list (array) of strings to be spotted. (Default is an empty list)
Note: This parameter is not valid if sttResultMode equals complete.
The maximum wait time in seconds before a connection re-try is made. The re-try delay of connection to the STT service increases exponentially starting from 2 seconds but not exceeding 'maxConnectionRetryDelay'. It must be greater 1.0 (Default is 60.0)
This parameter indicates the required number of n-best alternative hypotheses for the transcription results. (Default is 1) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#max_alternatives
Note: This parameter is not valid if sttResultMode equals complete.
If sttResultMode equals partial this parameter controls the output of non final utterances. If sttResultMode equals complete this parameter is ignored. (Default is false.)
This parameter indicates whether to convert date, time, phone numbers, currency values, email and URLs into conventional representations. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#smart_formatting
This parameter is used for debugging the STT JSON response message. Mostly for IBM internal use. (Default is false)
This parameter specifies whether live update for this operator's metrics nFullAudioConversationsReceived and nFullAudioConversationsTranscribed is needed. (Default is true)
Indicates whether IBM can use data that is sent over the connection to improve the service for future users. Specify false to prevent IBM from accessing the logged data. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#logging
This parameter specifies what type of STT result is needed: partial: to get partial utterances, complete: (default) to get the full text after transcribing the entire audio. The setting of this If this parameter influences the validity of output functions.
In sttResultMode complete the following parameters are not allowed: maxUtteranceAlternatives, wordAlternativesThreshold, keywordsSpottingThreshold, keywordsToBeSpotted,
In sttResultMode complete the following output functions are not allowed: getUtteranceNumber(), isFinalizedUtterance(), getConfidence(), isTranscriptionCompleted(), getUtteranceAlternatives(), getWordAlternatives(), getWordAlternativesConfidences(), getWordAlternativesStartTimes(), getWordAlternativesEndTimes(), getUtteranceWords(), getUtteranceWordsConfidences(), getUtteranceWordsStartTimes(), getUtteranceWordsEndTimes(), getUtteranceStartTime(), getUtteranceEndTime(), getUtteranceWordsSpeakers(), getUtteranceWordsSpeakersConfidences(), getKeywordsSpottingResults()
This parameter specifies the Watson STT Websocket service URI. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-websockets#WSopen
This parameter specifies whether logging is needed from the Websocket library. (Default is false)
This parameter controls the density of the word alternatives results (a.k.a. Confusion Networks). A value of 0.0 disables this feature. Valid value must be less than 1.0 (Default is 0.0) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#word_alternatives
Note: This parameter is not valid if sttResultMode equals complete.
The amount of audio samples sent to the stt service in bytes.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
Number of full audio conversations failures by this operator instance.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
Number of full audio conversations received for transcription by this operator instance.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
Number of full audio conversations transcribed by this operator instance.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
The cumulative number of STT service Websocket connection attempts made by this operator instance.
Number of STT service Websocket connection attempts made by this operator instance for the current connection. The value is reset to zero when a connection attempt succeeds.
The cumulative number of failed STT service Websocket connection attempts made by this operator instance.
STT result mode currently in effect for a given operator instance.
The state of the websocket connection: