The current Kettle Clustering capability, which is addressed in http://infocenter.pentaho.com/help/topic/pdi_admin_guide/topic_carte_setup.html, can now be implemented using the resources and data nodes of a Hadoop cluster.
This services article explains how to set up your Kettle environment and how to build a job that uses this new capability. There are two new job steps that make it possible to execute Kettle transforms in parallel using YARN: Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster.
This functionality requires that you set the active Hadoop distribution to a version of Hadoop 2 that fully supports YARN. As of February 2014, HDP 2.0 and CDH 5.0 are the only Hadoop 2 distributions that Pentaho has been tested with. See http://wiki.pentaho.com/display/BAD/Configuring+Pentaho+for+your+Hadoop+Distro+and+Version for an updated list of tested and supported Hadoop distributions.
<value>yarncluster.mycompany.com</value> |
5. Check the value for the yarn.resourcemanager.address and change the port to match your environment if necessary like this:
<value>$\{yarn.resourcemanager.hostname\}:9000</value> |
6. If you are using CDH 5.x, there are a few more things that you need to do.
a. Navigate to the folder that contains the shim, then open the hive-site.xml file in a text editor.
b. Modify the hive.metastore.uris property so that it points to the location of your hive metastore.
c. Save and close the hive-site.xml file.
d. Navigate to the folder that contains the shim, then open the mapred-site.xml file, in a text editor.
e. Modify the mapreduce.jobhistory.address property so that it points to the place where the job history logs are stored.
Note: Not all shim properties can be set in the Spoon user interface, nor are instructions for modifying them listed here. If you need to set additional properties that are not addressed, you will need to set them manually in the *-site.xml files that are in the shim directory. Consult your Hadoop distribution's vendor for details about the properties you want to set.
Spoon has two YARN entries that can be added to a job: Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster.
Start a YARN Kettle Cluster is used to start a cluster of carte servers on Hadoop nodes, assign them ports, and pass cluster access credentials. When this step is run and a cluster is created, the metadata for that cluster is stored in the shared.xml file or, if you are using the enterprise repository, in the DI Repository.
The carte servers in the cluster will continue to run until a Stop a YARN Kettle Cluster step is executed, or you manually stop the cluster.
NOTE: If you assign the cluster a name that has not been used before, you will need to create a cluster schema in Spoon. You only need to specify the cluster name when you create the cluster schema. For information on how to do this, see the Create a Cluster Schema in Spoon topic in the Pentaho Infocenter.
Start a YARN Kettle Cluster Fields
Field |
Description |
---|---|
Name |
Name of the entry. You can customize this, or leave it as the default. |
Number of nodes to start |
Number of nodes in your carte cluster. |
Cluster port range start port |
The number of the port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this port number as the starting point. For example, if the start port is 40000, and there are 4 nodes in the cluster, the master node is assigned port 40000, slave node #1 is assigned port 40001, slave node #2 is assigned port 40002, and slave node #3 is assigned port 40003. |
Cluster data range start port |
The number of the data port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this data port number as the starting point. |
Node memory |
Amount of memory assigned to each node. Memory is in megabytes. |
Virtual cores per node |
Number of virtual cores per node. |
Carte user name |
User name needed to access the carte cluster. |
Carte password |
Password needed to access the carte cluster. |
Name of cluster schema to update |
Name of the cluster schema. |
Default FS |
URL for the default HDFS file system. |
PDI Client Zip |
Indicates the path to the location of the PDI Client (Spoon) that is on the DI Server. When the Start YARN Kettle Cluster entry is in a job, it can be executed in one of three places: 1) locally on the same host on which you build the job and transformation, 2) on a Carte server or 3) on a remote DI server. If you plan to only execute this entry locally or on a Carte server, leave this field blank. But, if you want to run the entry remotely on a DI Server, you need to indicate the path to the location of the PDI Client (Spoon) that is installed on the DI Server host. If you don't do this, the Kettle cluster will not start properly. If you enter a value in this field, when the job that contains this entry is run on the DI Server, it finds the directory (or zip file) and places a copy of it on the HDFS.
|
Stop a YARN Kettle Cluster stops a YARN based carte cluster from running. This entry is usually used downstream of a Start a YARN Kettle Cluster step in the current Kettle job. It also works if executed in a separate Kettle job as long as the cluster schema name matches the cluster you want to stop.
Stop a YARN Kettle Cluster Fields
Field |
Description |
---|---|
Name |
Name of the entry. You can customize this, or leave it as the default. |
Name of cluster schema to update to stop |
Name of the cluster schema to stop. |