Toolkit Development overview
Prerequisite
This toolkit uses Apache Ant 1.8 (or later) to build.
Internally Apache Maven 3.2 (or later) and Make are used.
Download and setup directions for Apache Maven can be found here: http://maven.apache.org/download.cgi#Installation
Set environment variable M2_HOME to the path of maven home directory.
export M2_HOME=/opt/apache-maven-3.5.0
export PATH=$PATH:$M2_HOME/bin
Build the toolkit
Run the following command in the streamsx.objectstorage
directory:
ant all
Build the samples
Run the following command in the streamsx.objectstorage
directory to build all projects in the samples
directory:
ant build-all-samples
Toolkit Architecture Overview
Following diagram represents the toolkit high-level architecture:
As might be seen, the toolkit implemented on top of org.apache.hadoop.fs.FileSystem
.
This allowed to inherit some of the HDSF Toolkit logic.
Yet, as in many aspects Object Storage
behaves differently from HDFS
, the current
toolkit version contains significant number of adoptions and optimizations which are
unique for COS.
Also, the diagram represents two different clients which utilized by the toolkit:
hadoop-aws
is an open-source S3 connector implementing org.apache.hadoop.fs.FileSystem
.
hadoop-aws
has a unique and very powerful S3A Fast Upload
feature heavily utilized by the ObjectStorageSink
operator. The feature allows uploading of large files as blocks
with the size set by fs.s3a.multipart.size
, buffering of object multipart blocks to the local disk as well as
upload the mutlipart blocks in parallel in background thread.
stocator
is an open-source connector to object storage. In this toolkit the stocator connector interacts directly with object stores using connectors optimized for IBM COS.
Note, that parquet storage format is implemented with parquet-mr. To be more specific, the parquet-hadoop module is used.
The toolkit user may easily switch from hadoop-aws
to stocator
client by specifying appropriate protocol in the objectStorageURI
parameter of the ObjectStorageSink
operator or in the protocol
parameter of the S3ObjectStorageSink
operator.
Concretely, when s3a
protocol is specified the toolkit uses hadoop-aws
client. When cos
protocol is specified
the toolkit uses stocator
client.
Toolkit Class Diagram
The following class diagram represents main toolkit classes and packages: As might be seen from the diagram above there are three level of abstraction in the toolkit implementation:
AbstractObjectStorageOperator
class contains the logic that is common for all toolkit operators, such as COS connection establishment according to the specific protocol (client).BaseObjectStorageScan\Source\Sink
class contains the implementation which is common for the generic and S3-specific Scan\Source\Sink operators.S3ObjectStorageXXX\ObjectStorageXXX
class contains parameters specific for S3 and generic operators respectively. Note, that this layer almost doesn’t contain business logic.
Sink Operators Implementation Details
The following diagram represents the ObjectStorageSink
and S3ObjectStorageSink
operators architecture:
Note, the operator implementation is based on EHCache3.
EHCache3 is used for the active objects asynchronous rolling policy management.
At any given point of time, EHCache contains map with partition
as a key
and corresponding OSWritableObject
as a value. See OSObjectRegistry
class for concrete details related to EHCache utilization by the operator.
For each tuple the operator main thread is used to write tuples to the to the OutputStream
managed by the specific
(hadoop-aws or stocator) client. On the rolling policy expiration (either by time, data size or tuple count),
the new object has been created by the main thread for the upcoming tuples,
the old object has been removed from the EHCache and closed synchronously on a separate thread,
and the new object has been inserted into EHCache.
Toolkit Java-based Tests
Current java-based test suite covers ObjectStorageSink
operator only.
Yet, its important to mention, that the test suite might be easily extended
with additional tests for the ObjectStorageSink
and other operators.
For more details about the tests and functionality they cover see the
Object Storage Toolkit Java-based Tests
ObjectStorageSink Tests Class Diagrams
ObjectStorageSink
tests can be divided to the two groups by the storage format:
-
tests for the
raw
storage format The following class diagram represents classes involved in theraw
storage format testing.BaseObjectStorageTestSink
is the base class for all Sink operator tests. Each specific test implements/overwrites methods specific for it only. -
tests for the
parquet
storage format The following class diagram represents classes involved in theparquet
storage format testing.BaseObjectStorageTestSink
is the base class for all Sink operator tests. Each specific test implements/overwrites methods specific for it only.