Gateway to the IBM Speech To Text (STT) cloud service > com.ibm.streamsx.sttgateway 1.0.1 > com.ibm.streamsx.sttgateway.watson > WatsonSTT
The WatsonSTT operator is designed to ingest audio data in the form of a file (.wav, .mp3 etc.) or RAW audio and then transcribe that audio into text via the IBM Watson STT (Speech To Text) cloud service. It does that by sending the audio data to the configured Watson STT service running in the IBM public cloud or in the IBM Cloud Private (ICP) via the Websocket interface. It then outputs transcriptions of speech in the form of utterances or in full text as configured. An utterance is a group of transcribed words meant to approximate a sentence. Audio data must be in 16-bit little endian, mono format. For the Telephony model and configurations, the audio must have an 8 kHz sampling rate. For the Broadband model and configurations, the audio must have a 16 kHz sampling rate. The data can be provided as a .wav file or as RAW uncompressed PCM audio. Here is a sample ffmpeg command to convert a .wav file to the correct telephony format (use -ar 16000 for broadband):
$ ffmpeg -i MyFile.wav -ac 1 -ar 8000 MyNewFile.wav
This operator must be configured with a Websocket URL, a Watson STT authentication token and a base language model (see in parameter section). This operator may also be customized with many other optional parameters including custom patch files and appropriate custom patch weights.
Note: Multiple invocations of this operator can be fused to make an efficient use of the available pool of CPU cores.
See the samples folder inside this toolkit for working examples that show how to use this operator.
For a detailed documentation about the operator design, usage patterns and in-depth technical details, please refer to the official STT Gateway toolkit documentation available at this URL:
Required: authToken, baseLanguageModel, uri
Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, identifySpeakers, keywordsSpottingThreshold, keywordsToBeSpotted, maxAllowedConnectionAttempts, maxUtteranceAlternatives, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, waitTimeBeforeSTTServiceConnectionRetry, websocketLoggingNeeded, wordAlternativesThreshold, wordConfidenceNeeded, wordTimestampNeeded
This port brings the audio data into this operator for transcription.
All the extra input attributes will be forwarded if matching output attributes are found.
The default function for output attributes. This function assigns the output attribute to the value of the input attribute with the same name.
Returns an int32 number indicating the utterance number.
Returns the transcription of audio in the form of a single utterance.
Returns a boolean value to indicate if this is an interim partial utterance or a finalized utterance.
Returns a float32 confidence value for an interim partial utterance or for a finalized utterance or for the full text.
Returns the transcription of audio in the form of full text after completing the entire transcription.
Returns the Watson STT error message if any.
Returns a boolean value to indicate whether the full transcription is completed.
Returns a list of n-best alternative hypotheses for an utterance result. List will have the very best guess first followed by the next best ones in that order.
Returns a nested list of word alternatives (Confusion Networks).
Returns a nested list of word alternatives confidences (Confusion Networks).
Returns a list of word alternatives start times (Confusion Networks).
Returns a list of word alternatives end times (Confusion Networks).
Returns a list of words in an utterance result.
Returns a list of confidences of the words in an utterance result.
Returns a list of start times of the words in an utterance result relative to the start of the audio.
Returns a list of end times of the words in an utterance result relative to the start of the audio.
Returns the start time of an utterance relative to the start of the audio.
Returns the end time of an utterance relative to the start of the audio.
Returns a list of speaker ids for the individual words in an utterance result.
Returns a list of confidences in identifying the speakers of the individual words in an utterance result.
Returns the STT keywords spotting results as a map of key/value pairs. Read this toolkit's documentation to learn about the map contents.
This port produces the output tuples that carry the result of the speech to text transcription.
An output tuple is created for every utterance that is observed from the incoming audio data. An utterance is a group of transcribed words meant to approximate a sentence. This means there is a one to many relationship between an incoming tuple and outgoing tuples (i.e. a single .wav file may result in 30 output utterances). Intermediate utterances are sent out on this output port only when the sttResultMode operator parameter is set to a value of either 1 or 2. If it is set to 3, then only the fully transcribed text for the entire audio data will be sent on this output port after the given audio is completely transcribed. There are multiple available output functions, and output attributes can also be assigned values with any SPL expression that evaluates to the proper type.
Required: authToken, baseLanguageModel, uri
Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, identifySpeakers, keywordsSpottingThreshold, keywordsToBeSpotted, maxAllowedConnectionAttempts, maxUtteranceAlternatives, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, waitTimeBeforeSTTServiceConnectionRetry, websocketLoggingNeeded, wordAlternativesThreshold, wordConfidenceNeeded, wordTimestampNeeded
This parameter specifies a custom acoustic model to be used for transcription. (Default is an empty string)
This parameter specifies the auth token needed to access the Watson STT service.
This parameter specifies the name of the Watson STT base language model that should be used.
This parameter specifies a particular base model version to be used for transcription. (Default is an empty string)
This parameter specifies the content type to be used for transcription. (Default is audio/wav)
This parameter specifies the CPU yield time (in seconds) needed inside the audio sender thread's tight loop spinning to look for new audio data to be sent to the STT service. It should be >= 0.0 (Default is 0.001 i.e. 1 millisecond)
This parameter specifies a custom language model to be used for transcription. (Default is an empty string)
This parameter specifies a relative weight for a custom language model as a float64 between 0.0 to 1.0 (Default is 0.0)
This parameter indicates whether profanity should be filtered from a transcript. (Default is false)
This parameter indicates whether the speakers of the individual words in an utterance result should be identified. (Default is false)
This parameter specifies the minimum confidence level that the STT service must have for an utterance word to match a given keyword. A value of 0.0 disables this feature. Valid value must be less than 1.0. (Default is 0.0)
This parameter specifies a list (array) of strings to be spotted. (Default is an empty list)
This parameter specifies the maximum number of attempts to make a Websocket connection to the STT service. It should be >= 1 (Default is 10)
This parameter indicates the required number of n-best alternative hypotheses for the transcription results. (Default is 1)
This parameter indicates whether to convert date, time, phone numbers, currency values, email and URLs into conventional representations. (Default is false)
This parameter is used for debugging the STT JSON response message. Mostly for IBM internal use. (Default is false)
This parameter specifies whether live update for this operator's custom metrics is needed. (Default is true)
This parameter specifies whether request logging should be done for every STT audio transcription request. (Default is false)
This parameter specifies what type of STT result is needed: 1 to get partial utterances, 2 to get completed utterance, 3 (default) to get the full text after transcribing the entire audio.
This parameter specifies the Watson STT Websocket service URI.
This parameter specifies the time (in seconds) to wait before retrying a connection attempt to the Watson STT service. It should be >= 1.0 (Default is 3.0)
This parameter specifies whether logging is needed from the Websocket library. (Default is false)
This parameter controls the density of the word alternatives results (a.k.a. Confusion Networks). A value of 0.0 disables this feature. Valid value must be less than 1.0 (Default is 0.0)
This parameter indicates whether the transcription result should include individual words and their confidences or not. (Default is false)
This parameter indicates whether the transcription result should include individual words and their timestamps or not. (Default is false)
A few custom metrics are available for the WatsonSTT operator. The Counter kind metrics listed below will be updated when the operator starts. But, the Gauge kind metrics will be updated live during transcription only when the sttLiveMetricsUpdateNeeded operator parameter is set to true.
Number of full audio conversations received for transcription by this operator instance.
Number of full audio conversations transcribed by this operator instance.
STT result mode currently in effect for a given operator instance.
Number of STT service Websocket connection attempts made by this operator instance.