Operator S3ObjectStorageSink

IBMStreams com.ibm.streamsx.objectstorage Toolkit > com.ibm.streamsx.objectstorage 2.2.5 > com.ibm.streamsx.objectstorage.s3 > S3ObjectStorageSink

Operator writes objects to S3 compliant object storage.

This operator writes tuples that arrive on its input port to the output object that is named by the objectName parameter. You can optionally control whether the operator closes the current output object and creates a new object for writing based on the sizeof the object in bytes, the number of tuples that are written to the object, or the time in seconds that the object is open for writing, or when the operator receives a punctuation marker.

Behavior in a consistent region
Buffering mechanism
Supported Storage Formats
Rolling Policy
Examples

Summary

Ports
This operator has 1 input port and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 32 parameters.

Required: accessKeyID, bucket, secretAccessKey

Optional: bytesPerObject, closeOnPunct, dataAttribute, encoding, endpoint, headerRow, maxAttempts, nullPartitionDefaultValue, objectName, objectNameAttribute, parquetBlockSize, parquetCompression, parquetDictPageSize, parquetEnableDict, parquetEnableSchemaValidation, parquetPageSize, parquetWriterVersion, partitionValueAttributes, protocol, s3aFastUploadActiveBlocks, s3aFastUploadBuffer, s3aMultipartSize, skipPartitionAttributes, sslEnabled, storageFormat, timeFormat, timePerObject, tuplesPerObject, uploadWorkersNum

Metrics
This operator reports 12 metrics.

Properties

Implementation
Java

Input Ports

Ports (0)

The S3ObjectStorageSink operator has one input port, which writes the contents of the input stream to the object that you specified. The S3ObjectStorageSink supports writing data into object storage in two formats. For line format, the schema of the input port is tuple<rstring line>, which specifies a single rstring attribute that represents a line to be written to the object. For binary format, the schema of the input port is tuple<blob data>, which specifies a block of data to be written to the object.

Properties

Output Ports

Assignments
Java operators do not support output assignments.
Ports (0)

The S3ObjectStorageSink operator is configurable with an optional output port. The schema of the output port is <rstring objectName, uint64 objectSize>, which specifies the name and size of objects that are written to object storage. Note, that the tuple is generated on the object upload completion.

Properties

Parameters

This operator supports 32 parameters.

Required: accessKeyID, bucket, secretAccessKey

Optional: bytesPerObject, closeOnPunct, dataAttribute, encoding, endpoint, headerRow, maxAttempts, nullPartitionDefaultValue, objectName, objectNameAttribute, parquetBlockSize, parquetCompression, parquetDictPageSize, parquetEnableDict, parquetEnableSchemaValidation, parquetPageSize, parquetWriterVersion, partitionValueAttributes, protocol, s3aFastUploadActiveBlocks, s3aFastUploadBuffer, s3aMultipartSize, skipPartitionAttributes, sslEnabled, storageFormat, timeFormat, timePerObject, tuplesPerObject, uploadWorkersNum

accessKeyID

Specifies the Access Key ID for S3 account.

Properties
bucket

Specifies a bucket to use for writing objects. The bucket must exist. The operator does not create a bucket.

Properties
bytesPerObject

Specifies the approximate size of the output object, in bytes. When the object size exceeds the specified number of bytes, the current output object is closed and a new object is opened.

Properties
closeOnPunct

Specifies whether the operator closes the current output object and creates a new object when a punctuation marker is received. The default value is true if parameters timePerObject, tuplesPerObject and bytesPerObject are not set.

Properties
dataAttribute

The name of the attribute containing the data to be written to the object storage.

Properties
encoding

Specifies the character set encoding that is used in the output object.

Properties
endpoint

Specifies endpoint for connection to object storage. For example, for S3 the endpoint might be 's3.amazonaws.com'. The default value is the IBM Cloud Object Storage (COS) public endpoint 's3.us.cloud-object-storage.appdomain.cloud'.

Properties
headerRow

Specifies if the operator should add header row when starting to write object. By default no header row generated.

Properties
maxAttempts

Number of times we should retry errors. Default value is 20.

Properties
nullPartitionDefaultValue

Specifies default for partitions with null values.

Properties
objectName

Specifies the name of the object that the operator writes to.

Properties
objectNameAttribute

The name of the attribute containing the object name.

Properties
parquetBlockSize

Specifies the block size which is the size of a row group being buffered in memory. The default is 128M.

Properties
parquetCompression

Enum specifying support compressions for parquet storage format. Supported compression types are 'UNCOMPRESSED','SNAPPY','GZIP'

Properties
parquetDictPageSize

There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.

Properties
parquetEnableDict

Specifies if parquet dictionary should be enabled.

Properties
parquetEnableSchemaValidation

Specifies of schema validation should be enabled.

Properties
parquetPageSize

Specifies the page size is for compression. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default is 1M.

Properties
parquetWriterVersion

Specifies parquet writer version.

Properties
partitionValueAttributes

Specifies the list of attributes to be used for partition column values.

Properties
protocol

Specifies the protocol to use for communication with object storage. Supported values are s3a and cos. The default value is s3a.

Properties
s3aFastUploadActiveBlocks

The parameter is valid for protocol s3a only. Defines the maximum number of blocks a single output stream can have active uploading. If not set, then the default value 8 is used.

Properties
s3aFastUploadBuffer

The parameter is valid for protocol s3a only. The parameter determines the buffering mechanism to use for s3a multipart upload. Allowed values are: disk, array, bytebuffer (default): "disk" will use local file system directories as the location(s) to save data prior to being uploaded. "array" uses arrays in the JVM heap. "bytebuffer" uses off-heap memory within the JVM. Both "array" and "bytebuffer" will consume memory in a single stream up to the number of blocks set by: s3aMultipartSize * s3aFastUploadActiveBlocks. If using either of these mechanisms, keep this value low.

Properties
s3aMultipartSize

The parameter is valid for protocol s3a only. Defines the size (in bytes) of the chunks into which the upload will be split up. If not set, then the default value 5242880 is used.

Properties
secretAccessKey

Specifies the Secret Access Key for S3 account.

Properties
skipPartitionAttributes

Avoids writing of attributes used as partition columns in data files. Default value is false.

Properties
sslEnabled

Enables or disables SSL connections to S3, default is true.

Properties
storageFormat

Specifies storage format operator uses. The default is raw, i.e. the data is stored in the same format as received.

Properties
timeFormat

Specifies the time format to use when the objectName parameter value contains %TIME. The parameter value must contain conversion specifications that are supported by the java.text.SimpleDateFormat. The default format is yyyyMMdd_HHmmss.

Properties
timePerObject

Specifies the approximate time, in seconds, after which the current output object is closed and a new object is opened for writing.

Properties
tuplesPerObject

Specifies the maximum number of tuples that can be received for each output object. When the specified number of tuples are received, the current output object is closed and a new object is opened for writing.

Properties
uploadWorkersNum

Specifies number of upload workers.

Properties

Metrics

cachedData - Counter

Data stored in cache in bytes.

cachedDataMax - Gauge

Maximal size of data stored in cache in bytes.

closeTimeAvg - Time

Average time for closing objects on COS in milliseconds.

closeTimeMax - Time

Maximal duration for closing an object on COS in milliseconds.

closeTimeMin - Time

Minimal duration for closing an object on COS in milliseconds.

nActiveObjects - Counter

Number of active (open) objects

nCloseFailures - Counter

Number of close failures

nClosedObjects - Counter

Number of closed objects

nWriteFailures - Counter

Number of failures during create object or write to object

objectSizeMax - Gauge

Maximal size of an object uploaded to COS in bytes.

objectSizeMin - Gauge

Minimal size of an object uploaded to COS in bytes.

startupTimeMillisecs - Time

Operator startup time in milliseconds

Libraries

Operator class library
Library Path: ../../impl/lib/com.ibm.streamsx.objectstorage.jar, ../../opt/*, ../../opt/downloaded/*