Gateway to the IBM Speech To Text (STT) cloud service > com.ibm.streamsx.sttgateway 2.2.3 > com.ibm.streamsx.sttgateway.watson > WatsonSTT
The WatsonSTT operator is designed to ingest audio data in the form of a file (.wav, .mp3 etc.) or RAW audio and then transcribe that audio into text via the IBM Watson STT (Speech To Text) cloud service. It does that by sending the audio data to the configured Watson STT service running in the IBM public cloud or in the IBM Cloud Pak for Data (CP4D) via the Websocket interface. It then outputs transcriptions of speech in the form of utterances or in full text as configured. An utterance is a group of transcribed words meant to approximate a sentence. Audio data must be in 16-bit little endian, mono format. For the Telephony model and configurations, the audio must have an 8 kHz sampling rate. For the Broadband model and configurations, the audio must have a 16 kHz sampling rate. The data can be provided as a .wav file or as RAW uncompressed PCM audio. Here is a sample ffmpeg command to convert a .wav file to the correct telephony format (use -ar 16000 for broadband):
$ ffmpeg -i MyFile.wav -ac 1 -ar 8000 MyNewFile.wav
This operator must be configured with a Websocket URL, a Watson STT authentication token and a base language model (see in parameter section). This operator may also be customized with many other optional parameters including custom patch files and appropriate custom patch weights.
Note: Multiple invocations of this operator can be fused to make an efficient use of the available pool of CPU cores.
See the samples folder inside this toolkit for working examples that show how to use this operator.
For a detailed documentation about the operator design, usage patterns and in-depth technical details, please refer to the official STT Gateway toolkit documentation available at this URL:
Required: baseLanguageModel, uri
Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold
This port brings the audio data into this operator for transcription.
A window punctuation marker or a empty speech blob may be used to mark the end of an conversation. Thus an conversation can be a composite of multiple audio files. When the end of conversation is encountered, the STT engine delivers all results of the current conversation and flushes all buffers.
All the extra input attributes will be forwarded if matching output attributes are found.
This port brings an unexpired IAM access token (generated by using your service instance's API key) into this operator that is needed to access the Watson STT service. This input port should be used in a different thread than port 0.
All the extra attributes found in this input port will be ignored.
The default function for output attributes. This function assigns the output attribute to the value of the input attribute with the same name.
Returns the Watson STT error message if any. Default attribute name sttErrorMessage
Returns a boolean value to indicate whether the full transcription/conversation is completed. Default attribute name transcriptionCompleted
Note: If this function is requested, the operator emits at conversation end a tuple with this flag equal true. All other output functions deliver the default value for the final tuple of an conversation. If this function is not requested, the operator does not send this final tuple for the conversation. The operator sends in any case a window punctuation marker at the end of an conversation.
Returns the transcription of audio in the form of a single utterance. Default attribute name utteranceText
Returns a float64 confidence value for the utterance. Default attribute name confidence
Note: This function does not deliver results if sttResult mode equals complete.
Returns the start time of an utterance relative to the start of the audio. Default attribute name utteranceStartTime
Returns the end time of an utterance relative to the start of the audio. Default attribute name utteranceEndTime
Returns an int32 number indicating the utterance number. Default attribute name utteranceNumber
Note: This function does not deliver results if sttResult mode equals complete.
Returns a boolean value to indicate if this is an interim partial utterance or a finalized utterance. Default attribute name finalizedUtterance
Note: The function does not deliver results if sttResult mode equals complete.
Returns a list of n-best alternative hypotheses for an utterance result. List will have the very best guess first followed by the next best ones in that order. Default attribute name utteranceAlternatives
Note: This function does not deliver results if sttResult mode equals complete. Note: n-best alternative hypotheses are available only for final utterances.
Returns a list of words in an utterance result. Default attribute name utteranceWords
Returns a list of confidences of the words in an utterance result. Default attribute name utteranceWordsConfidences
Note: word confidences are available only for final utterances.
Returns a list of start times of the words in an utterance result relative to the start of the audio. Default attribute name utteranceWordsStartTimes
Returns a list of end times of the words in an utterance result relative to the start of the audio. Default attribute name utteranceWordsEndTimes
Returns a nested list of word alternatives (Confusion Networks). Default attribute name wordAlternatives
Note: word alternatives are available only for final utterances.
Returns a nested list of word alternatives confidences (Confusion Networks). Default attribute name wordAlternativesConfidences
Note: word alternative confidences are available only for final utterances.
Returns a list of word alternatives start times (Confusion Networks). Default attribute name wordAlternativesStartTimes
Note: word alternative start times are available only for final utterances.
Returns a list of word alternatives end times (Confusion Networks). Default attribute name wordAlternativesEndTimes
Note: word alternative end times are available only for final utterances.
Returns a list of speaker ids for the individual words in an utterance result. Each entry in this list corresponds to the appropriate entry in the utterance words list delivered with function getUtteranceWords. In rare cases the operator delivers a value of -1 for a speaker label. In this case there was no speaker label delivered from STT service. Default attribute name utteranceWordsSpeakers
Note: word speaker labels are available only for final utterances.
Returns a list of confidences in identifying the speakers of the individual words in an utterance result. Default attribute name utteranceWordsSpeakersConfidences
Note: word speaker label confidences are available only for final utterances.
Return the speaker identifier updates for all emitted utterance words in the current conversation. This function returns an update to previously sent word speaker identifiers without adding any new results. Default attribute name utteranceWordsSpeakerUpdates
Speaker numbering starts at 0. The invalid index -1 identifies a non identified speaker. It will typically take 5-10 seconds of speakers for diarization results to become reliable, and will continually improve over the course of an audio conversation.
list<tuple<float64 startTime, int32 speaker, float64 confidence>>The speaker label is given with respect to the start time of the utterance word. No other tuple types are accepted as result type.
Returns the STT keywords spotting results as a map. The keys of the map are the spotted keywords. The values of the map are lists with the keyword emergence. The the keyword emergence may be either
tuple<float64 startTime, float64 endTime, float64 confidence>or
map<rstring, float64>
In the latter case the inner map has the entries: start_time, end_time, confidence. Thus the complete attribute type may be either:
map<rstring, list<tuple<float64 startTime, float64 endTime, float64 confidence>>>or
map<rstring, list<map<rstring, float64>>>
Default attribute name keywordsSpottingResults
Note: keyword spotting results are available only for final utterances.
This port produces the output tuples that carry the result of the speech to text transcription.
An output tuple is created for every utterance that is observed from the incoming audio data. An utterance is a group of transcribed words meant to approximate a sentence. This means there is a one to many relationship between an incoming tuple and outgoing tuples (i.e. a single .wav file may result in 30 output utterances). If the input stream feeds audio samples, there may be also a many to one relationship between an incoming tuple and outgoing tuples. It can happen that several input tuples are consumed before an output tuple is generated.
Additionally an error during processing (connection error, invalid data, server error) triggers a outgoing tuple. The custom output function getSTTErrorMessage returns the error message.
Intermediate utterances are sent out on this output port only when the sttResultMode equals partial and operator parameter nonFinalUtterancesNeeded equals true. If sttResultMode equals complete, then only the fully transcribed text for the entire audio data will be sent on this output port after the given audio is completely transcribed.
The port emits a window punctuation marker when a conversation has finished. If sttResultMode equals partial, the operator sends out the results immediately without any further delay. If output function isTranscriptionCompleted is requested, the operator sends additionally a tuple with empty/default utterance attributes and the transcription complete flag when a conversation has finished.
There are multiple available output functions, and output attributes can also be assigned values with any SPL expression that evaluates to the proper type.
Required: baseLanguageModel, uri
Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold
This parameter specifies a custom acoustic model to be used for transcription. (Default is an empty string)
This parameter specifies the name of the Watson STT base language model that should be used. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#models
This parameter specifies a particular base model version to be used for transcription. (Default is an empty string)
The parameter may point to a specific version of the base model if needed. e-g: "en-US_NarrowbandModel.v07-06082016.06202016" see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#version
This parameter specifies the content type to be used for transcription. (Default is audio/wav)
This parameter specifies the CPU yield time (in seconds) needed inside the audio sender thread's tight loop spinning to look for new audio data to be sent to the STT service. It should be >= 0.0 (Default is 0.001 i.e. 1 millisecond)
This parameter specifies a custom language model to be used for transcription. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#custom (Default is an empty string)
This parameter specifies a relative weight for a custom language model as a float64 between 0.0 to 1.0 (Default is 0.0)
This parameter indicates whether profanity should be filtered from a transcript. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#profanity_filter
This parameter specifies the minimum confidence level that the STT service must have for an utterance word to match a given keyword. A value of 0.0 disables this feature. Valid value must be less than 1.0. (Default is 0.3) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#keyword_spotting
This parameter specifies a list (array) of strings to be spotted. (Default is an empty list)
The maximum wait time in seconds before a connection re-try is made. The re-try delay of connection to the STT service increases exponentially starting from 2 seconds but not exceeding 'maxConnectionRetryDelay'. It must be greater 1.0 (Default is 60.0)
This parameter indicates the required number of n-best alternative hypotheses for the transcription results. (Default is 3) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#max_alternatives
Note: This parameter is ignored if sttResultMode equals complete.
If sttResultMode equals partial this parameter controls the output of non final utterances. If sttResultMode equals complete this parameter is ignored. (Default is false.)
This parameter indicates whether to convert date, time, phone numbers, currency values, email and URLs into conventional representations. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#smart_formatting
This parameter specifies whether live update for this operator's metrics nFullAudioConversationsReceived and nFullAudioConversationsTranscribed is needed. (Default is true)
Indicates whether IBM can use data that is sent over the connection to improve the service for future users. Specify false to prevent IBM from accessing the logged data. (Default is false) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-input#logging
This parameter specifies what type of STT result is needed: partial: to get partial utterances, complete: (default) to get the full text after transcribing the entire audio. The setting of this If this parameter influences the validity of output functions.
In sttResultMode complete parameter maxUtteranceAlternatives is not allowed.
In sttResultMode complete the following output functions are not allowed: getUtteranceNumber(), isFinalizedUtterance(), getConfidence(), getUtteranceAlternatives(),
This parameter specifies the Watson STT Websocket service URI. see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-websockets#WSopen
This parameter specifies whether logging is needed from the Websocket library. (Default is false)
This parameter controls the density of the word alternatives results (a.k.a. Confusion Networks). A value of 0.6 disables this feature. Valid value must be less than 1.0 (Default is 0.0) see: https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-output#word_alternatives
The amount of audio samples sent to the stt service in bytes.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
Number of full audio conversations failures by this operator instance.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
Number of full audio conversations received for transcription by this operator instance.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
Number of full audio conversations transcribed by this operator instance.
NOTE: This metric is only updated if parameter sttLiveMetricsUpdateNeeded is true.
The cumulative number of STT service Websocket connection attempts made by this operator instance.
Number of STT service Websocket connection attempts made by this operator instance for the current connection. The value is reset to zero when a connection attempt succeeds.
The cumulative number of failed STT service Websocket connection attempts made by this operator instance.
STT result mode currently in effect for a given operator instance.
The state of the websocket connection: