Performance Considerations for IBM Streams Runner for Apache Beam

Edit me

The performance of your IBM® Streams Runner for Apache Beam application is affected by options and configuration parameters given to both the runner and the IBM Streams instance where your application is deployed. Understanding how these work will help you get the best performance.

Bundling elements

The DoFn instances that make up an Apache Beam application process elements in arbitrary bundles created by the runners. Bundling elements together can increase overall throughput but can increase latency of individual elements.

The IBM Streams Runner for Apache Beam pipeline options --bundleSize and --bundleMillis control how the runner handles bundles by either directing the runner to create bundles of a maximum number of elements, or after a maximum time delay, whichever limit is reached first.

For more information on these options, see Streams Runner pipeline options.

Parallelism

Parallelism is a key concept in Apache Beam and many transforms are able to process elements in parallel. The IBM Streams Runner for Apache Beam has experimental support for parallelism using the --parallelWidths pipeline option. Using this option allows the runner to take advantage of IBM Streams user-defined parallelism when translating and running your application.

The simplest way to enable parallelism for your application is to set a default width for the entire application, for example --parallelWidth=4 will cause the runner to use parallelism of 4 wherever it can.

In some cases finer control can be useful if, for example, some transforms in the application will benefit more or less from parallelism. In that case, the option can supply widths for matching steps.

As an example, suppose the application has a step VeryParallel that could benefit from more parallelism than most of the application as well as one named Serial that should not be parallelized at all. Using the pipeline option --parallelWidths=4,Serial=1,VeryParallel=8 would disable parallelism for Serial while using higher parallelism for VeryParallel than the rest of the application.

PCollections of KV elements are partitioned based on keys. The same key will always go into the same parallel channel, guaranteeing that keyed states are correctly handled. PCollections of other types are partitioned using the round-robin scheme. It is applications’ responsibility to balance workloads across keys. Skewed workloads can overwhelm hot keys and hamper performance. Since the value of keys are determined by applications, the runner cannot easily detect or predict keys. If a PCollection contains fewer keys than the parallel width, some parallel channels will stay idle, wasting system resources. One special case is a PCollection using the Void key. In this situation, the runner can identify that there is only one global key and run the transform in a non-parallelized form. The following code snippet demonstrates an example. As the AddKey transform adds the same Void key to all elements, the downstream GBK cannot run in parallel. If a user specifies a parallel width (greater than 1) for the entire pipeline, the runner can silently ignore the parallel width for the section of the pipeline that uses the Void keys. If a user explicitly specifies a parallel width for the GBK in the example, the runner, again, can ignore the specified width.

Pipeline p = Pipeline.create(options);
p.apply("Source", GenerateSequence.from(0))
 .apply("AddKey", WithKeys.of((Void) null))
 .setCoder(KvCoder.of(VoidCoder.of(), VarLongCoder.of()))
 .apply("GBK", GroupByKey.create());

For Source transforms, the runner attempts to match the configured parallel width by splitting the source into sub-sources using the UnboundedSource#split or BoundedSource#split method. Applications need to make sure that the source implementation supports split properly. If the number of sub-sources returned by split disagrees with the parallel width, the source parallel width is set to the smaller value of the specified width and the number of sub-sources. If there are more sub-sources than the specified width, the runner will allocate sub-sources to parallel channels in a round-robin fashion and one parallel channel can contain multiple sub-source instances. To get the best performance, the number of sub-sources should match the expected parallel width.

For more information on this option, see Streams Runner pipeline options.

When the runner launches the application it will set the IBM Streams operator fusionType to channelIsolation. To change the fusion type use the pipeline option --contextType=BUNDLE to create your application bundle and set the fusion type when submitting the bundle. See The BUNDLE Context for more information about --contextType=BUNDLE and Operator fusion in parallel regions FIX LINK for more information on fusion configuration.

Note that IBM Streams Runner for Apache Beam does not allow changing the degree of parallelism at submission time or at run time.

Tracing

Tracing can give information about the translation and execution of your IBM Streams Runner for Apache Beam application and can help debug problems, but tracing can have a serious performance impact.

If you set tracing levels other than the defaults with any of the --traceTranslation, --traceRuntime, or --tracingLevel pipeline options while you are developing and debugging your application, remove them or set them back to the defaults when deploying your application in production.

More documentation about tracing options and levels can be found in Streams Runner pipeline options.

IBM Streams performance

Once your IBM Streams Runner for Apache Beam application is running on an IBM Streams instance, it is subject to the same performance considerations as any other IBM Streams application. See the IBM Streams documentation for more information on performance. (TBD: LINK)