Supported Storage Formats

IBMStreams com.ibm.streamsx.objectstorage Toolkit > com.ibm.streamsx.objectstorage 2.2.5 > com.ibm.streamsx.objectstorage.s3 > S3ObjectStorageSink > Supported Storage Formats

The operator support two storage formats:

parquet - when output object is generated in parquet format
raw - when output object is generated in the raw format

The storage format can be configured with the storageFormat parameter.

The storageFormat parameter supports two values: parquet and raw.

Parquet Storage Format

Parquet output schema is derived from the tuple structure. Note, that parquet format is supported for tuples with the flat SPL schema only.

The following table summarizes primitive SPL to Parquet types mapping:

SPL Type	Parquet Type
BOOLEAN	boolean
INT8, UINT8, INT16, UINT16, INT32, UINT32	int32
INT64, UINT64	int64
FLOAT32	float
FLOAT64	double
RSTRING, USTRING, BLOB	binary
TIMESTAMP	int64
ALL OTHER SPL PRIMITIVE TYPES	binary

The following table summarizes collection SPL to Parquet types mapping:

SPL Type	Parquet Type
LIST, SET	optional group my_list (LIST) (repeated group of list/set elements)
MAP	repeated group of key/value

Parameters relevant for parquet storage format

nullPartitionDefaultValue - Specifies default for partitions with null values.
parquetBlockSize - Specifies the block size which is the size of a row group being buffered in memory. The default is 128M.
parquetCompression - Enum specifying support compressions for parquet storage format. Supported compression types are 'UNCOMPRESSED','SNAPPY','GZIP'
parquetDictPageSize - There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.
parquetEnableDict - Specifies if parquet dictionary should be enabled.
parquetEnableSchemaValidation - Specifies of schema validation should be enabled.
parquetPageSize - Specifies the page size is for compression. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default is 1M.
parquetWriterVersion - Specifies parquet writer version. Supported versions are 1.0 and 2.0
skipPartitionAttributes - Avoids writing of attributes used as partition columns in data files.
partitionValueAttributes - Specifies the list of attributes to be used for partition column values. Please note,that its strongly recommended not to use attributes with continuous values per rolling policy unit of measureto avoid operator performance degradation. The following examples demonstrates recommended and non-recommended partitioning approaches.Recommended: /YEAR=YYYY/MONTH=MM/DAY=DD/HOUR=HH Non-recommended: /latutide=DD.DDDD/longitude=DD.DDDD/

Parquet storage format - preferred practices for partitions design

Think about what kind of queries you will need. For example, you might need to build monthly reports or sales by product line.
Do not partition on an attribute with high cardinality per rolling policy window that you end up with too many simultaneouslyactive partitions. Reducing the number of simultaneously active partitions can greatly improve performance and operator's resource consumption.
Do not partition on attribute with high cardinality per rolling policy window so you end up with many small-sized objects.

Raw Storage Format

If the input tuple schema for the raw storage format has more than one input attribute the operators expect dataAttribute parameter to be specified. The attribute specified as dataAttribute value should be of rstring or blob type.

Parameters relevant for the raw storage format:

dataAttribute - Required when input tuple has more than one attribute. Specifies the name of the attribute which content is about to be written to the output object. The attribute should has rstring or blob SPL type.

Mandatory parameter for the case when input tuple has more than one attribute and the storage format is set to raw.

objectNameAttribute - If set, it points to the attribute containing an object name. The operator will close the object when value of this attribute changes and will open the new object with an updated name.
encoding - Specifies the character encoding that is used in the output object.
headerRow - If specified the header line with the parameter content will be generated in each output object.