Clustering with Pentaho Data Integration

The basics

The basics of clustering in Pentaho Data Integration is very simple.  The user configures one or more steps as being configured :

TODO: Insert screenshot of simple clustering transformation.

In other words, clustering information is data integration metadata like before.  Nothing is done with it until we execute the transformation.  The most important part of the metadata is the definition of the slave server...

Slave servers

A slave server is essentially a small embedded webserver.  We use this web-server to control the slave server.  The slave server is started with the Carte program found in your PDI distribution.

The arguments used to start a slave server with Carte are discussed elsewhere but the minimum is always an IP-address or hostname and a HTTP port to communicate.  For example:

sh carte.sh localhost 8080

The slave server metadata can be entered in the Spoon GUI by basically giving your Carte instance a name and entering the same data.

Cluster schema

A cluster schema is essentially a collection of slave servers.   In each collection you need to pick at least one slave server that we will call the Master slave server or master.  The master is also just a carte instance but it takes care of all sort of management tasks across the cluster schema. 

In the Spoon GUI you can enter this metadata as well once you started a couple of slave servers.

Execution process

What happens when you execute a clustered transformation is this: all the steps that have clustering configured will run on the configured slave servers.  The steps that don't have any clustering metadata attached to it run on the single master slave server.  Then the following happens

  1. The users transformation is split up into different parts: 1 part becomes a "master transformation" and N parts become "slave transformation".
  2. The master and slave transformations are sent in XML format down to the slave servers (Carte instances) over HTTP
  3. All the master and slave transformations are initialized (server socket ports are opened as well as files) across the cluster
  4. All the master and slave transformations are started
  5. The parent application (Spoon, Kitchen or the Transformation job entry) monitor the master and slave transformations until finished