Operator WatsonSTT

Gateway to the IBM Speech To Text (STT) cloud service > com.ibm.streamsx.sttgateway 2.0.0 > com.ibm.streamsx.sttgateway.watson > WatsonSTT

The WatsonSTT operator is designed to ingest audio data in the form of a file (.wav, .mp3 etc.) or RAW audio and then transcribe that audio into text via the IBM Watson STT (Speech To Text) cloud service. It does that by sending the audio data to the configured Watson STT service running in the IBM public cloud or in the IBM Cloud Pak for Data (CP4D) via the Websocket interface. It then outputs transcriptions of speech in the form of utterances or in full text as configured. An utterance is a group of transcribed words meant to approximate a sentence. Audio data must be in 16-bit little endian, mono format. For the Telephony model and configurations, the audio must have an 8 kHz sampling rate. For the Broadband model and configurations, the audio must have a 16 kHz sampling rate. The data can be provided as a .wav file or as RAW uncompressed PCM audio. Here is a sample ffmpeg command to convert a .wav file to the correct telephony format (use -ar 16000 for broadband):

$ ffmpeg -i MyFile.wav -ac 1 -ar 8000 MyNewFile.wav

This operator must be configured with a Websocket URL, a Watson STT authentication token and a base language model (see in parameter section). This operator may also be customized with many other optional parameters including custom patch files and appropriate custom patch weights.

The operator parameter sttResultMode specifies what type of STT result is needed:

partial: to get partial utterances,
complete (default) to get the full text after transcribing the entire audio.

The setting of this this parameter influences the validity of output functions and parameters. In sttResultMode partial the the parameter nonFinalUtterancesNeeded controls the output of non final utterances.

Requirements:

Intel RHEL6 or RHEL7 hosts installed with IBM Streams.

Note: Multiple invocations of this operator can be fused to make an efficient use of the available pool of CPU cores.

See the samples folder inside this toolkit for working examples that show how to use this operator.

For a detailed documentation about the operator design, usage patterns and in-depth technical details, please refer to the official STT Gateway toolkit documentation available at this URL:

https://ibmstreams.github.io/streamsx.sttgateway

Summary

Ports

This operator has 2 input ports and 1 output port.

Windowing

This operator does not accept any windowing configurations.

Parameters

This operator supports 21 parameters.

Required: baseLanguageModel, uri

Optional: acousticCustomizationId, baseModelVersion, contentType, cpuYieldTimeInAudioSenderThread, customizationId, customizationWeight, filterProfanity, keywordsSpottingThreshold, keywordsToBeSpotted, maxConnectionRetryDelay, maxUtteranceAlternatives, nonFinalUtterancesNeeded, smartFormattingNeeded, sttJsonResponseDebugging, sttLiveMetricsUpdateNeeded, sttRequestLogging, sttResultMode, websocketLoggingNeeded, wordAlternativesThreshold

Metrics

This operator reports 9 metrics.

Properties

Implementation: C++
Threading: Never - Operator never provides a single threaded execution context.

Input Ports

Ports (0)

This port brings the audio data into this operator for transcription.

Attributes on this input port:

speech (required, rstring/blob) - In the case of file based input (.wav, .mp3 etc. for batch workload), the expected value will be an absolute path of a file as an rstring. In the case of RAW audio data (received from a network switch for real-time workload), the expected input is of type blob.

A window punctuation marker or a empty speech blob may be used to mark the end of an conversation. When the end of conversation is encountered, the STT engine delivers all results of the current conversation and flushes all buffers. All the extra input attributes will be forwarded if matching output attributes are found.

Properties

Optional: false

Ports (1)

This port brings an unexpired IAM access token (generated by using your service instance's API key) into this operator that is needed to access the Watson STT service. This input port should be used in a different thread than port 0.

Attributes on this input port:

access_token (required, rstring) - An rstring access token required for securing the access to the STT service.

All the extra attributes found in this input port will be ignored.

Properties

Optional: false

Output Ports

Assignments: This operator allows any SPL expression of the correct type to be assigned to output attributes.

Output Functions

STTGatewayFunctions

<any T> T AsIs(T)

The default function for output attributes. This function assigns the output attribute to the value of the input attribute with the same name.

int32 getUtteranceNumber()

Returns an int32 number indicating the utterance number.