Start a PDI Cluster on YARN

Description

Start a PDI Cluster on YARN is used to start a cluster of carte servers on Hadoop nodes, assign them ports, and pass cluster access credentials. When this step is run and a cluster is created, the metadata for that cluster is stored in the shared.xml file or, if you are using the enterprise repository, in the DI Repository. For more information on carte clusters, see Use Carte Clusters in the Pentaho Help documentation.

In earlier versions of Spoon, this step was labeled Start a YARN Kettle Cluster.

Context

Use this step to start a cluster of carter servers. The carte servers in the cluster will continue to run until a Stop a PDI Cluster on YARN step is executed, or you manually stop the cluster.  

Prerequisites

If you assign the cluster a name that has not been used before, you will need to create a cluster schema in Spoon. You only need to specify the cluster name when you create the cluster schema, see the Create a Cluster Schema in Spoon topic in the Pentaho Help documentation for more information. A YARN hadoop configuraiton should already be configured.  Information on configuring a YARN hadoop configuration appears in Additional Configuration for YARN Shims.

Options

You can configure the cluster through the Start Kettle Cluster on YARN dialog, which appears when you double-click on the job icon. This dialog contains a Step Name field and 2 tabs. The Step Name field is the entry name, which can be customized or left as the default. The 2 tabs enable you to configure the Cluster and Files.

Cluster

The items in the Cluster tab contain cluster configuration details:

Option

Description

Name Cluster Schema

Name of the cluster schema.

Carte User Name

User name needed to access the carte cluster.

Carte Password

Password needed to access the carte cluster.

Number of Nodes

Indicates the number of nodes in the cluster.

Virtual Cores Per Node

Number of virtual cores per node.

Carte Port Range Start

The number of the port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this port number as the starting point. For example, if the start port is 40000, and there are 4 nodes in the cluster, the master node is assigned port 40000, slave node #1 is assigned port 40001, slave node #2 is assigned port 40002, and slave node #3 is assigned port 40003.

Before you assign a port number, ensure that the entire sequence of ports are available. If they are not, this entry will fail when it runs.

Cluster Data Range Start

The number of the data port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this data port number as the starting point.

Before you assign a port number, ensure that the entire sequence of ports are available. If they are not, this entry will fail when it runs.

Application Master Memory

Indicates the amount of master memory assigned to the application.

Nodes Memory

Amount of memory assigned to each node. Memory is in megabytes.

Files

The items in the Files tab contain file configuration details:

Option

Description

File System Path

URL for the default HDFS file system. Make sure that the Default FS setting matches the configured hostname for the HDFS Name node. If it does not, an error message will display and the Kettle Cluster will not start.

PDI Client Archive

Indicates the path to the location of the PDI Client (Spoon) that is on the DI Server.  When the Start a PDI Cluster on YARN entry is in a job, it can be executed in one of three places:  1) locally on the same host on which you build the job and transformation, 2) on a Carte server or 3) on a remote DI server.  If you plan to only execute this entry locally or on a Carte server, leave this field blank. But, if you want to run the entry remotely on a DI Server, you need to indicate the path to the location of the PDI Client (Spoon) installed on the DI Server host. If not, the Kettle cluster will not start properly. If you enter a value in this field, when the job containing this entry runs on the DI Server, it finds the directory (or zip file) and places a copy of it on the HDFS. 

NOTE: If you have a zip file, the root directory must be data-integration.

You can either indicate the path that the data-integration directory or point to a zip file that contains the data-integration-client directory and its subdirectories.  Here are some examples of entries.

  • PDI Client directory: C:\Program Files\pentaho\design-tools\data-integration 
  • PDI Client (Spoon) zip file: C:\Program Files\Zips\data-integration.zip 

Copy local resource files to YARN

Copies the contents of the current user's kettle.properties, repositories.xml, and shared.xml files to the YARN workspace folder. When a job containing this step runs, the contents of the YARN Workspace folder are copied to the cluster. The contents of the YARN Workspace folder are copied to the cluster even when the checkboxes are deselected. If you do not want the contents of the YARN Workspace folder to be copied, you need to remove the contents manually.  For more information, see Using the YARN Workspace Folder article in Pentaho Help documentation. 

  • kettle.properties - Provides the cluster with any parameters and variables that have been set, such as the path to a log file.
  • shared.xml - Provides access to database connections stored in the file, such as connection details for a MongoDB instance.
  • repositories.xml - Allows the job or transformation running on the cluster access to the DI Repository.

When a Kettle Cluster is started on YARN, the configuration files (kettle.properties, shared.xml, repositories.xml) and any additional resource files it might need are deployed from the workspace folder in the shim plugin (pentaho-big-data-plugin/plugin/pentaho-kettle-yarn-plugin). Files can be placed there manually, but the three primary configuration files can be copied to the workspace at runtime if their corresponding checkboxes in the Copy local resource files to YARN section of the Files tab are selected.

If you run the job from a user’s PDI installation, the config files from that user’s KETTLE_HOME directory are used. If the job is scheduled or otherwise runs on a Pentaho DI Server, the config files from that server’s configured KETTLE_HOME are copied when the job starts.

If you want to use different configuration files from what is in your and the server’s KETTLE_HOME directories, you should copy those files manually into the YARN workspace folder and ensure the corresponding checkboxes in the Copy local resource files to YARN section of the Files are not selected.

If you have configuration files appropriate for development. testing, or staging in your KETTLE_HOME directory and the Pentaho DI Server has production configuration files in its KETTLE_HOME, then select the corresponding checkboxes to insure the Kettle Cluster deployed by YARN uses the appropriate configuration files for the environment from which it is run.