Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

Anchor
_GoBack
_GoBack
Kettle Clustering With Yarn

...

Set Active Hadoop Distribution

  1. If you have not done so already, stop the components (Spoon, DI Server or BA Server) that you want to configure for YARN transform execution.
  2. Set the correct Hadoop Configuration (shim).
  3. Navigate to the directory that contains the plugin.properties file for the component.
    * DI Server - data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations
    * BA Server - biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations
    * Spoon - data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations
  4. Open the plugin.properties file.
  5. Set the active.hadoop.configuration property to match the name of the shim you want to make active. For example, if the name of the shim is cdh42, then the code would look like this: active.hadoop.configure=cdh42.
  6. Save and close the plugin.properties file.

Additional Configuration for YARN Shims

  1. Anchor
    ConfigurePentahoforYourHadoopDistributio
    ConfigurePentahoforYourHadoopDistributio
    Configure the cluster settings.
  2. Navigate to the folder that contains the shim, then open the yarn-site.xml file in a text editor.
  3. Check the value for the yarn.application.classpath. The yarn.application.classpath specifies the classpath to use on the cluster for executing YARN applications. If the paths for your environment are not listed, add them. Paths are separated by a comma.
  4. Check the value for the yarn.resourcemanager.hostname. #* For HDP 2, the default value is sandbox.hortonworks.com #* On CDH 5 the default is clouderamanager.cdh5.test. 
    • Since your hostname will probably be different, make the change to the value. Here is an example.
Code Block
<value>yarncluster.mycompany.com</value>

5. Check the value for the yarn.resourcemanager.address and change the port to match your environment if necessary like this:

Code Block
<value>$\{yarn.resourcemanager.hostname\}:9000</value>

6. If you are using CDH 5.x, there are a few more things that you need to do.
a. Navigate to the folder that contains the shim, then open the hive-site.xml file in a text editor.
b. Modify the hive.metastore.uris property so that it points to the location of your hive metastore.
c. Save and close the hive-site.xml file.
d. Navigate to the folder that contains the shim, then open the mapred-site.xml file, in a text editor.
e. Modify the mapreduce.jobhistory.address property so that it points to the place where the job history logs are stored.

Note: Not all shim properties can be set in the Spoon user interface, nor are instructions for modifying them listed here. If you need to set additional properties that are not addressed, you will need to set them manually in the *-site.xml files that are in the shim directory. Consult your Hadoop distribution's vendor for details about the properties you want to set.

PDI Client (Spoon) YARN Job Entries

Spoon has two YARN entries that can be added to a job: Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster.

Start a YARN Kettle Cluster

Description

Start a YARN Kettle Cluster is used to start a cluster of carte servers on Hadoop nodes, assign them ports, and pass cluster access credentials. When this step is run and a cluster is created, the metadata for that cluster is stored in the shared.xml file or, if you are using the enterprise repository, in the DI Repository.
The carte servers in the cluster will continue to run until a Stop a YARN Kettle Cluster step is executed, or you manually stop the cluster.
NOTE: If you assign the cluster a name that has not been used before, you will need to create a cluster schema in Spoon. You only need to specify the cluster name when you create the cluster schema. For information on how to do this, see the Create a Cluster Schema in Spoon topic in the Pentaho Infocenter.

Options

Start a YARN Kettle Cluster Fields

Field

Description

Name

Name of the entry. You can customize this, or leave it as the default.

Number of nodes to start

Number of nodes in your carte cluster.

Cluster port range start port

The number of the port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this port number as the starting point. For example, if the start port is 40000, and there are 4 nodes in the cluster, the master node is assigned port 40000, slave node #1 is assigned port 40001, slave node #2 is assigned port 40002, and slave node #3 is assigned port 40003.

Before you assign a port number, ensure that the entire sequence of ports are available. If they are not, this entry will fail when it runs.

Cluster data range start port

The number of the data port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this data port number as the starting point.

Before you assign a port number, ensure that the entire sequence of ports are available. If they are not, this entry will fail when it runs.

Node memory

Amount of memory assigned to each node. Memory is in megabytes.

Virtual cores per node

Number of virtual cores per node.

Carte user name

User name needed to access the carte cluster.

Carte password

Password needed to access the carte cluster.

Name of cluster schema to update

Name of the cluster schema.

Default FS

URL for the default HDFS file system.

PDI Client Zip

Indicates the path to the location of the PDI Client (Spoon) that is on the DI Server.  When the Start YARN Kettle Cluster entry is in a job, it can be executed in one of three places:  1) locally on the same host on which you build the job and transformation, 2) on a Carte server or 3) on a remote DI server.  If you plan to only execute this entry locally or on a Carte server, leave this field blank.  But, if you want to run the entry remotely on a DI Server, you need to indicate the path to the location of the PDI Client (Spoon) that is installed on the DI Server host.  If you don't do this, the Kettle cluster will not start properly.  If you enter a value in this field, when the job that contains this entry is run on the DI Server, it finds the directory (or zip file) and places a copy of it on the HDFS. 

NOTE: If you have a zip file, the root directory must be data-integration.

You can either indicate the path that the data-integration directory or point to a zip file that contains the data-integration-client directory and its subdirectories.  Here are some examples of entries.

  • PDI Client directory: C:\Program Files\pentaho\design-tools\data-integration
  • PDI Client (Spoon) zip file: C:\Program Files\Zips\data-integration.zip

...

Stop a YARN Kettle Cluster

Description

Stop a YARN Kettle Cluster stops a YARN based carte cluster from running. This entry is usually used downstream of a Start a YARN Kettle Cluster step in the current Kettle job. It also works if executed in a separate Kettle job as long as the cluster schema name matches the cluster you want to stop.

Options

Stop a YARN Kettle Cluster Fields

Field

Description

Name

Name of the entry. You can customize this, or leave it as the default.

Name of cluster schema to update to stop

Name of the cluster schema to stop.

...