Toolkit Development overview

Edit me

Prerequisite

This toolkit uses Apache Ant 1.8 (or later) to build.

Internally Apache Maven 3.2 (or later) and Make are used.

Download and setup directions for Apache Maven can be found here: http://maven.apache.org/download.cgi#Installation

Set environment variable M2_HOME to the path of maven home directory.

export M2_HOME=/opt/apache-maven-3.5.0
export PATH=$PATH:$M2_HOME/bin

Build the toolkit

Run the following command in the streamsx.objectstorage directory:

ant all

Build the samples

Run the following command in the streamsx.objectstorage directory to build all projects in the samples directory:

ant build-all-samples

Toolkit Architecture Overview

Following diagram represents the toolkit high-level architecture: Import As might be seen, the toolkit implemented on top of org.apache.hadoop.fs.FileSystem. This allowed to inherit some of the HDSF Toolkit logic. Yet, as in many aspects Object Storage behaves differently from HDFS, the current toolkit version contains significant number of adoptions and optimizations which are unique for COS.

Also, the diagram represents two different clients which utilized by the toolkit:

hadoop-aws is an open-source S3 connector implementing org.apache.hadoop.fs.FileSystem. hadoop-aws has a unique and very powerful S3A Fast Upload feature heavily utilized by the ObjectStorageSink operator. The feature allows uploading of large files as blocks with the size set by fs.s3a.multipart.size, buffering of object multipart blocks to the local disk as well as upload the mutlipart blocks in parallel in background thread.

stocator is an open-source connector to object storage. In this toolkit the stocator connector interacts directly with object stores using connectors optimized for IBM COS.

Note, that parquet storage format is implemented with parquet-mr. To be more specific, the parquet-hadoop module is used.

The toolkit user may easily switch from hadoop-aws to stocator client by specifying appropriate protocol in the objectStorageURI parameter of the ObjectStorageSink operator or in the protocol parameter of the S3ObjectStorageSink operator. Concretely, when s3a protocol is specified the toolkit uses hadoop-aws client. When cos protocol is specified the toolkit uses stocator client.

Toolkit Class Diagram

The following class diagram represents main toolkit classes and packages: Import As might be seen from the diagram above there are three level of abstraction in the toolkit implementation:

  • AbstractObjectStorageOperator class contains the logic that is common for all toolkit operators, such as COS connection establishment according to the specific protocol (client).
  • BaseObjectStorageScan\Source\Sink class contains the implementation which is common for the generic and S3-specific Scan\Source\Sink operators.
  • S3ObjectStorageXXX\ObjectStorageXXX class contains parameters specific for S3 and generic operators respectively. Note, that this layer almost doesn’t contain business logic.

Sink Operators Implementation Details

The following diagram represents the ObjectStorageSink and S3ObjectStorageSink operators architecture:

Import

Note, the operator implementation is based on EHCache3. EHCache3 is used for the active objects asynchronous rolling policy management. At any given point of time, EHCache contains map with partition as a key and corresponding OSWritableObject as a value. See OSObjectRegistry class for concrete details related to EHCache utilization by the operator.

For each tuple the operator main thread is used to write tuples to the to the OutputStream managed by the specific (hadoop-aws or stocator) client. On the rolling policy expiration (either by time, data size or tuple count), the new object has been created by the main thread for the upcoming tuples, the old object has been removed from the EHCache and closed synchronously on a separate thread, and the new object has been inserted into EHCache.

Toolkit Java-based Tests

Current java-based test suite covers ObjectStorageSink operator only. Yet, its important to mention, that the test suite might be easily extended with additional tests for the ObjectStorageSink and other operators. For more details about the tests and functionality they cover see the Object Storage Toolkit Java-based Tests

ObjectStorageSink Tests Class Diagrams

ObjectStorageSink tests can be divided to the two groups by the storage format:

  • tests for the raw storage format The following class diagram represents classes involved in the raw storage format testing. Import BaseObjectStorageTestSink is the base class for all Sink operator tests. Each specific test implements/overwrites methods specific for it only.

  • tests for the parquet storage format The following class diagram represents classes involved in the parquet storage format testing. Import BaseObjectStorageTestSink is the base class for all Sink operator tests. Each specific test implements/overwrites methods specific for it only.

Updated: