Kettle Clustering With Yarn

The current Kettle Clustering capability, which is addressed in http://infocenter.pentaho.com/help/topic/pdi_admin_guide/topic_carte_setup.html, can now be implemented using the resources and data nodes of a Hadoop cluster.
This services article explains how to set up your Kettle environment and how to build a job that uses this new capability. There are two new job steps that make it possible to execute Kettle transforms in parallel using YARN: Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster.
This functionality requires that you set the active Hadoop distribution to a version of Hadoop 2 that fully supports YARN. As of February 2014, HDP 2.0 and CDH 5.0 are the only Hadoop 2 distributions that Pentaho has been tested with. See http://wiki.pentaho.com/display/BAD/Configuring+Pentaho+for+your+Hadoop+Distro+and+Version for an updated list of tested and supported Hadoop distributions.

Set Active Hadoop Distribution

If you have not done so already, stop the components (Spoon, DI Server or BA Server) that you want to configure for YARN transform execution.
Set the correct Hadoop Configuration (shim).
Navigate to the directory that contains the plugin.properties file for the component.
* DI Server - data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations
* BA Server - biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations
* Spoon - data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations
Open the plugin.properties file.
Set the active.hadoop.configuration property to match the name of the shim you want to make active. For example, if the name of the shim is cdh42, then the code would look like this: active.hadoop.configure=cdh42.
Save and close the plugin.properties file.

Additional Configuration for YARN Shims

Configure the cluster settings.
Navigate to the folder that contains the shim, then open the yarn-site.xml file in a text editor.
Check the value for the yarn.application.classpath. The yarn.application.classpath specifies the classpath to use on the cluster for executing YARN applications. If the paths for your environment are not listed, add them. Paths are separated by a comma.
Check the value for the yarn.resourcemanager.hostname. #* For HDP 2, the default value is sandbox.hortonworks.com #* On CDH 5 the default is clouderamanager.cdh5.test.
- Since your hostname will probably be different, make the change to the value. Here is an example.

<value>yarncluster.mycompany.com</value>

5. Check the value for the yarn.resourcemanager.address and change the port to match your environment if necessary like this:

<value>$\{yarn.resourcemanager.hostname\}:9000</value>

6. If you are using CDH 5.x, there are a few more things that you need to do.
a. Navigate to the folder that contains the shim, then open the hive-site.xml file in a text editor.
b. Modify the hive.metastore.uris property so that it points to the location of your hive metastore.
c. Save and close the hive-site.xml file.
d. Navigate to the folder that contains the shim, then open the mapred-site.xml file, in a text editor.
e. Modify the mapreduce.jobhistory.address property so that it points to the place where the job history logs are stored.

Note: Not all shim properties can be set in the Spoon user interface, nor are instructions for modifying them listed here. If you need to set additional properties that are not addressed, you will need to set them manually in the *-site.xml files that are in the shim directory. Consult your Hadoop distribution's vendor for details about the properties you want to set.

PDI Client (Spoon) YARN Job Entries

Spoon has two YARN entries that can be added to a job: Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster.

Start a YARN Kettle Cluster

Description

Start a YARN Kettle Cluster is used to start a cluster of carte servers on Hadoop nodes, assign them ports, and pass cluster access credentials. When this step is run and a cluster is created, the metadata for that cluster is stored in the shared.xml file or, if you are using the enterprise repository, in the DI Repository.
The carte servers in the cluster will continue to run until a Stop a YARN Kettle Cluster step is executed, or you manually stop the cluster.
NOTE: If you assign the cluster a name that has not been used before, you will need to create a cluster schema in Spoon. You only need to specify the cluster name when you create the cluster schema. For information on how to do this, see the Create a Cluster Schema in Spoon topic in the Pentaho Infocenter.

Options

Start a YARN Kettle Cluster Fields

Field	Description
Name	Name of the entry. You can customize this, or leave it as the default.
Number of nodes to start	Number of nodes in your carte cluster.
Cluster port range start port	The number of the port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this port number as the starting point. For example, if the start port is 40000, and there are 4 nodes in the cluster, the master node is assigned port 40000, slave node #1 is assigned port 40001, slave node #2 is assigned port 40002, and slave node #3 is assigned port 40003. Before you assign a port number, ensure that the entire sequence of ports are available. If they are not, this entry will fail when it runs.
Cluster data range start port	The number of the data port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this data port number as the starting point. Before you assign a port number, ensure that the entire sequence of ports are available. If they are not, this entry will fail when it runs.
Node memory	Amount of memory assigned to each node. Memory is in megabytes.
Virtual cores per node	Number of virtual cores per node.
Carte user name	User name needed to access the carte cluster.
Carte password	Password needed to access the carte cluster.
Name of cluster schema to update	Name of the cluster schema.
Default FS	URL for the default HDFS file system.
PDI Client Zip	Indicates the path to the location of the PDI Client (Spoon) that is on the DI Server. When the Start YARN Kettle Cluster entry is in a job, it can be executed in one of three places: 1) locally on the same host on which you build the job and transformation, 2) on a Carte server or 3) on a remote DI server. If you plan to only execute this entry locally or on a Carte server, leave this field blank. But, if you want to run the entry remotely on a DI Server, you need to indicate the path to the location of the PDI Client (Spoon) that is installed on the DI Server host. If you don't do this, the Kettle cluster will not start properly. If you enter a value in this field, when the job that contains this entry is run on the DI Server, it finds the directory (or zip file) and places a copy of it on the HDFS. NOTE: If you have a zip file, the root directory must be data-integration. You can either indicate the path that the data-integration directory or point to a zip file that contains the data-integration-client directory and its subdirectories. Here are some examples of entries. PDI Client directory: C:\Program Files\pentaho\design-tools\data-integration PDI Client (Spoon) zip file: C:\Program Files\Zips\data-integration.zip

Stop a YARN Kettle Cluster

Description

Stop a YARN Kettle Cluster stops a YARN based carte cluster from running. This entry is usually used downstream of a Start a YARN Kettle Cluster step in the current Kettle job. It also works if executed in a separate Kettle job as long as the cluster schema name matches the cluster you want to stop.

Options

Stop a YARN Kettle Cluster Fields

Field	Description
Name	Name of the entry. You can customize this, or leave it as the default.
Name of cluster schema to update to stop	Name of the cluster schema to stop.