Reference information for IBM Streams Runner for Apache Beam

Edit me

Learn about the package contents and pipeline options for IBM® Streams Runner for Apache Beam.

Package contents for Streams Runner

The Streams Runner package contains the following directories:

  • com.ibm.streams.beam: The IBM Streams Runner for Apache Beam toolkit, which you can use to submit Apache Beam 2.4 applications to the IBM Streams runtime environment.

  • examples: Toolkit sample applications. For information about the samples, see the README file in the examples directory.

Pipeline Options

General pipeline options

Parameter Description Default value
runner The pipeline runner to use. Use this option to determine the pipeline runner at run time. Set this option to StreamsRunner to run with IBM Streams.
streaming A flag to indicate whether streaming mode is enabled (true).

Note: IBM Streams is a pure streaming engine and does not have a discrete batch-processing mode. For this reason, this parameter is ignored and is automatically set to true.
true
jobName The name of the job. Defaults to a Beam-generated string.
appName The name of the app for display purposes. Defaults to the class name of the PipelineOptions creator.

Streams Runner pipeline options

Parameter Description Default value
contextType The mode to run the application in:
  • STREAMING_ANALYTICS_SERVICE: Compile an application remotely and submit the translated application to a Streaming Analytics service on IBM Cloud (formerly IBM Bluemix).
  • DISTRIBUTED: Submit the application to a Streams instance. The domain and instance are configured by the STREAMS_DOMAIN_ID and STREAMS_INSTANCE_ID environment variables.
  • DISTRIBUTED: Submit the application to a Streams instance. The domain and instance are configured by the STREAMS_DOMAIN_ID and STREAMS_INSTANCE_ID environment variables.
  • BUNDLE: Create a Streams application bundle (SAB) file for submission at a later time.
STREAMING_ANALYTICS_SERVICE
jarsToStage A list of JAR files (separated by colons) that are required to run the Apache Beam application. Include the JAR files that contain your program and any dependencies. (You don’t need to include Beam Google IO SDK or core Beam JAR files.) The listed JAR files are added to the SAB file. Globs (wildcards) may be used to specify files (e.g., `foo/bar/*.jar`). However, globs in directory paths are not supported (e.g., `**/*.jar`).

Note: The use of fat or uber JAR files can reduce the number of JAR files that must be specified, but take care not to include JAR files that are provided by Streams Runner. Including redundant dependencies can increase the application archive and can negatively impact submission times to IBM Cloud.
[null]
filesToStage A JSON string that maps local file paths as keys to destination paths as values. The local files are included in the SAB and can be accessible to the application through the destination path. For example,
--filesToStage="{\"/path/to/local/file\": \"input/file1.in\", \"path/to/another/local/file\": \"env.conf\"}"

In this example, the two local files are added into the bundle at the specified destination paths, which can then be accessed by using "streams://input/file1.in" and "streams://env.conf" paths.
[null]
beamToolkitDir The location of the Streams Runner toolkit. Use this option to explicitly specify the Streams Runner toolkit location. Defaults to the path of the com.ibm.streams.beam.translation.jar file in the Java™ class path.
tracingLevel Set the tracing and logging level of StreamsRunner translation and runtime. If specified, overrides all other tracing options. Levels: ERROR, WARN, INFO, DEBUG, TRACE [null]
traceTranslation Set the translation trace level for the application. Specify a single level: ERROR, WARN, INFO, DEBUG, TRACE to set all loggers to level and/or set individual components. Components include Runner and Streams. For example, to set Streams Runner to WARN and all other loggers to INFO, specify the following:

INFO,Runner=WARN
Runner=INFO,Streams=WARN
traceRuntime Set the runtime trace level for the application. See traceTranslation for more information. WARN
bundleSize Controls the maximum number of data tuples in every bundle. Applications should make sure that each bundle does not exceed 2GB. 1
bundleMillis Controls the maximum time delay of every bundle. Applications should make sure that each bundle does not exceed 2GB. 100
parallelWidths Experimental

Sets the parallelism for the entire Beam pipeline or individual transforms via transform step names. A default parallel width for the pipeline is specified as plain number with no step name, and a list of step paths with widths gives the widths for matching steps. If not specified, the parallel width for all transforms will be 1. Step matching is done the same way step names are matched in BeamMetricsFilter.

If a step name matches multiple paths, the first match is used. Likewise, if a default is given multiple times, the first one is used.

For example, the following configuration sets the default (entire pipeline parallelism) to width 2, but steps (transforms) that contain the subpath Device_3/Map will have width 3.

--parallelWidths=2,Device_3/Map=3

As the first match is taken, the parallel width for Device_3/Map is still 3 in the following configuration.

--parallelWidths=2,Device_3/Map=3,Map=4

For Source transforms, the runner attempts to match the configured parallel width by calling the split API. For UnboundedSource, the runner uses the specified parallel width as the desiredNumSplits argument in split. For BoundedSource, the runner uses the specified parallel width to calculate the desiredBundleSizeBytes. However, the split API does not guarantee to respect the desired widths. Some sources might be unsplittable, and some only split to a certain number of sub-sources. Source parallel width is set to either the specified width or the number of sub-sources, whichever is smaller. If there were more sub-sources than specified width, one channel could contain multiple sub-source instances.

For KV PCollections, it is guaranteed that the same key is always processed in the same parallel instance, correctly preserving per-key states.

Changing parallel width (since IBM Streams release 4.3.0) dynamically at runtime will break Beam pipelines.
[null]

For the full list of pipeline options, enter --help=StreamsPipelineOptions on the Beam application command line.

STREAMING_ANALYTICS_SERVICE context-specific pipeline options

Parameter Description Default value
vcapServices The location of the Streaming Analytics VCAP file. This parameter is required when you use the STREAMING_ANALYTICS_SERVICE context type. This parameter can be omitted if the $VCAP_SERVICES environment variable is set to the path of the file. [null]
serviceName The name of the Streaming Analytics service on IBM Cloud. This parameter is required when you use the STREAMING_ANALYTICS_SERVICE context type. [null]

DISTRIBUTED context-specific pipeline options

Parameter Description Default value
restUrl The URL for the REST API when you use the DISTRIBUTED context type. This parameter is required if you implement metrics in the application. The return value of command streamtool geturl
userName The user name for basic authentication with REST API when you use the DISTRIBUTED context type. The user.name system property
userPassword A path to a file containing the user password (recommended) or a string containing the user password for basic authentication for REST API when you use the DISTRIBUTED context type. If the option value is a path to a readable file, the first line of the file will be used as the password. [null]

Environment Variables

These environment variables are not required for Streams Runner to work; however, they can be used for convenience when you launch your Beam application.

</tr> </tbody> </table> ## Apache Beam SDK for Java See Beam's [Java API Reference](https://beam.apache.org/documentation/sdks/javadoc/2.4.0/) for information on application APIs. ## Streams Runner SDK API Reference See the [javadoc](../release/1.2/javadoc/index.html) for more information.
Name Description Notes
STREAMS_RUNNER_HOME The absolute path to the extraction location of the `com.ibm.streams.beam-` directory, where `` is the version of Streams Runner</td> Set by using one of the following methods:
  • Source the $STREAMS_RUNNER_HOME/examples/bin/streams-runner-env.sh file.
  • Use the export command.
STREAMS_BEAM_TOOLKIT The path to the Streams Runner toolkit ($STREAMS_RUNNER_HOME/com.ibm.streams.beam) Set by using one of the following methods:
  • Source the $STREAMS_RUNNER_HOME/examples/bin/streams-runner-env.sh file.
  • Use the export command.
VCAP_SERVICES The path to the IBM Cloud credentials file. If this environment variable is set, the --vcapServices parameter does not need to be specified on the command line.

For more information about the credentials file, see Creating a credentials file for your Streaming Analytics service.
Set by using the export command.
STREAMING_ANALYTICS_SERVICE_NAME The name of the Streaming Analytics service in the IBM Cloud credentials file to use. If this environment variable is set, the --serviceName parameter does not need to be specified on the command line. Set by using the export command.
STREAMS_INSTALL The path to the local IBM Streams installation on your system. Only set if submitting an application to a local Streams environment. Important: If this variable exists, you must use the unset command to unset it before you can submit an application to the Streaming Analytics service.