Supported Storage Formats

IBMStreams com.ibm.streamsx.objectstorage Toolkit > com.ibm.streamsx.objectstorage 2.2.5 > com.ibm.streamsx.objectstorage.s3 > S3ObjectStorageSink > Supported Storage Formats

The operator support two storage formats:

  • parquet - when output object is generated in parquet format
  • raw - when output object is generated in the raw format

The storage format can be configured with the storageFormat parameter.

The storageFormat parameter supports two values: parquet and raw.

Parquet Storage Format

Parquet output schema is derived from the tuple structure. Note, that parquet format is supported for tuples with the flat SPL schema only.

The following table summarizes primitive SPL to Parquet types mapping:

SPL Type

Parquet Type

BOOLEAN

boolean

INT8, UINT8, INT16, UINT16, INT32, UINT32

int32

INT64, UINT64

int64

FLOAT32

float

FLOAT64

double

RSTRING, USTRING, BLOB

binary

TIMESTAMP

int64

ALL OTHER SPL PRIMITIVE TYPES

binary

The following table summarizes collection SPL to Parquet types mapping:

SPL Type

Parquet Type

LIST, SET

optional group my_list (LIST) (repeated group of list/set elements)

MAP

repeated group of key/value

Parameters relevant for parquet storage format

  • nullPartitionDefaultValue - Specifies default for partitions with null values.
  • parquetBlockSize - Specifies the block size which is the size of a row group being buffered in memory. The default is 128M.

  • parquetCompression - Enum specifying support compressions for parquet storage format. Supported compression types are 'UNCOMPRESSED','SNAPPY','GZIP'

  • parquetDictPageSize - There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.

  • parquetEnableDict - Specifies if parquet dictionary should be enabled.

  • parquetEnableSchemaValidation - Specifies of schema validation should be enabled.

  • parquetPageSize - Specifies the page size is for compression. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default is 1M.

  • parquetWriterVersion - Specifies parquet writer version. Supported versions are 1.0 and 2.0

  • skipPartitionAttributes - Avoids writing of attributes used as partition columns in data files.

  • partitionValueAttributes - Specifies the list of attributes to be used for partition column values. Please note,that its strongly recommended not to use attributes with continuous values per rolling policy unit of measureto avoid operator performance degradation. The following examples demonstrates recommended and non-recommended partitioning approaches.Recommended: /YEAR=YYYY/MONTH=MM/DAY=DD/HOUR=HH Non-recommended: /latutide=DD.DDDD/longitude=DD.DDDD/

Parquet storage format - preferred practices for partitions design

  1. Think about what kind of queries you will need. For example, you might need to build monthly reports or sales by product line.
  2. Do not partition on an attribute with high cardinality per rolling policy window that you end up with too many simultaneouslyactive partitions. Reducing the number of simultaneously active partitions can greatly improve performance and operator's resource consumption.

  3. Do not partition on attribute with high cardinality per rolling policy window so you end up with many small-sized objects.

Raw Storage Format

If the input tuple schema for the raw storage format has more than one input attribute the operators expect dataAttribute parameter to be specified. The attribute specified as dataAttribute value should be of rstring or blob type.

Parameters relevant for the raw storage format:

  • dataAttribute - Required when input tuple has more than one attribute. Specifies the name of the attribute which content is about to be written to the output object. The attribute should has rstring or blob SPL type.
Mandatory parameter for the case when input tuple has more than one attribute and the storage format is set to raw.
  • objectNameAttribute - If set, it points to the attribute containing an object name. The operator will close the object when value of this attribute changes and will open the new object with an updated name.
  • encoding - Specifies the character encoding that is used in the output object.

  • headerRow - If specified the header line with the parameter content will be generated in each output object.