Panel

	title:Warning
	title:Warning
borderColor	black
bgColor	#ffff00
borderStyle	solid

PLEASE NOTE: This documentation applies to Pentaho 7.0 and an earlier version. For Pentaho 7.1 and later, see Pentaho MapReduce on the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Pentaho MapReduce

Panel
Note: This entry was formerly known as Hadoop Transformation Job Executor.

This job entry executes transformations as part of a Hadoop MapReduce job. This is frequently used to execute transformations that act as mappers and reducers in lieu of a traditional Hadoop Java class. The User Defined tab is for Hadoop option name/value pairs that are not defined in the Job Setup and Cluster tabs. Any properties defined here will be set in the MapReduce job configuration.

General

Option	Definition
Name	The name of this Pentaho MapReduce entry instance
Hadoop Job Name	The name of the Hadoop job you are executing

Mapper

Option	Definition
Look in	Sets the context for the Browse button. Options are: Local (the local filesystem), Repository by Name (a PDI database or solution repository), or Repository by Reference (a link to a transformation no matter which repository it is in).
Mapper Transformation	The KTR that will perform the mapping functions for this job.
Mapper Input Step Name	The name of the step that receives mapping data from Hadoop. This must be a MapReduce Input step.
Mapper Output Step Name	The name of the step that passes mapping output back to Hadoop. This must be a MapReduce Output step.

Combiner

Option	Definition
Look in	Sets the context for the Browse button. Options are: Local (the local filesystem), Repository by Name (a PDI database or solution repository), or Repository by Reference (a link to a transformation no matter which repository it is in).
Combiner Transformation	The KTR that will perform the combiner functions for this job.
Combiner Input Step Name	The name of the step that receives combiner data from Hadoop. This must be a MapReduce Input step.
Combiner Output Step Name	The name of the step that passes combiner output back to Hadoop. This must be a MapReduce Output step.
Combine single threaded	Indicates if the Single Threaded transformation execution engine should be used to execute the combiner transformation. If false, the normal multi-threaded transformation engine will be used. The Single Threaded transformation execution engine reduces overhead when processing many small groups of output.

Reducer

Option	Definition
Look in	Sets the context for the Browse button. Options are: Local (the local filesystem), Repository by Name (a PDI database or solution repository), or Repository by Reference (a link to a transformation no matter which repository it is in).
Reducer Transformation	The KTR that will perform the reducer functions for this job.
Reducer Input Step Name	The name of the step that receives reducing data from Hadoop. This must be a MapReduce Input step.
Reducer Output Step Name	The name of the step that passes reducing output back to Hadoop. This must be a MapReduce Output step.
Reduce single threaded	Indicates if the Single Threaded transformation execution engine should be used to execute the reducer transformation. If false, the normal multi-threaded transformation engine will be used. The Single Threaded transformation execution engine reduces overhead when processing many small groups of output.

Job Setup

Option	Definition
Suppress Output of Map Key	If selected the key output from the Mapper transformation will be ignored and replaced with NullWritable.
Suppress Output of Map Value	If selected the value output from the Mapper transformation will be ignored and replaced with NullWritable.
Suppress Output of Reduce Key	If selected the key output from the Combiner and/or Reducer transformations will be ignored and replaced with NullWritable. This requires a Reducer transformation to be used, not the "Identity Reducer".
Suppress Output of Reduce Value	If selected the key output from the Combiner and/or Reducer transformations will be ignored and replaced with NullWritable. This requires a Reducer transformation to be used, not the "Identity Reducer".
Input Path	A comma-separated list of input directories , such as /wordcount/input, from your Hadoop cluster where the source data for the MapReduce job is stored.
Output Path	The directory on your Hadoop cluster where you want the output from the MapReduce job to be stored., such as //wordcount/output. The output directory cannot exist prior to running the MapReduce job.
Input Format	The Apache Hadoop class name that describes the input specification for the MapReduce job. See InputFormat for more information.
Output Format	The Apache Hadoop class name that describes the output specification for the MapReduce job. See OutputFormat for more information.
Clean output path before execution	If enabled the output path specified will be removed before the MapReduce job is scheduled.

Cluster

Option	Definition
Hadoop Cluster	Allows you to create, edit, and select a Hadoop cluster configuration for use. The Hadoop Cluster section below defines the options for editing or creating an entry for this option. The Edit button allows you to edit Hadoop cluster configuration information. The New button allows you to add a new Hadoop cluster configuration. Information on Hadoop Clusters can be found in Pentaho Help.
Number of Mapper Tasks	The number of mapper tasks you want to assign to this job. The size of the inputs should determine the number of mapper tasks. Typically there should be between 10-100 maps per node, though you can specify a higher number for mapper tasks that are not CPU-intensive.
Number of Reducer Tasks	The number of reducer tasks you want to assign to this job. Lower numbers mean that the reduce operations can launch immediately and start transferring map outputs as the maps finish. The higher the number, the quicker the nodes will finish their first round of reduces and launch a second round. Increasing the number of reduce operations increases the Hadoop framework overhead, but improves load balancing. If this is set to 0, then no reduce operation is performed, and the output of the mapper becomes the output of the entire job; also, combiner operations will also not be performed.
Enable Blocking	Forces the job to wait until each step completes before continuing to the next step. This is the only way for PDI to be aware of a Hadoop job's status. If left unchecked, the Hadoop job is blindly executed, and PDI moves on to the next job entry. Error handling/routing will not work unless this option is checked.
Logging Interval	Number of seconds between log messages.

Hadoop Cluster

Excerpt

The Hadoop cluster configuration dialog allows you to specify configuration detail such as host names and ports for HDFS, Job Tracker, and other big data cluster components, which can be reused in transformation steps and job entries that support this feature.

Option	Definition
Cluster Name	Name that you assign the cluster configuration.
Use MapR Client	Indicates that this configuration is for a MapR cluster. If this box is checked, the fields in the HDFS and JobTracker sections are disabled because those parameters are not needed to configure MapR.
Hostname (in HDFS section)	Hostname for the HDFS node in your Hadoop cluster.
Port (in HDFS section)	Port for the HDFS node in your Hadoop cluster.
Username (in HDFS section)	Username for the HDFS node.
Password (in HDFS section)	Password for the HDFS node.
Hostname (in JobTracker section)	Hostname for the JobTracker node in your Hadoop cluster. If you have a separate job tracker node, type in the hostname here. Otherwise use the HDFS hostname.
Port (in JobTracker section)	Port for the JobTracker in your Hadoop cluster. Job tracker port number; this cannot be the same as the HDFS port number.
Hostname (in ZooKeeper section)	Hostname for the Zookeeper node in your Hadoop cluster.
Port (in Zookeeper section)	Port for the Zookeeper node in your Hadoop cluster.
URL (in Oozie section)	Field to enter an Oozie URL. This must be a valid Oozie location.

User Defined

Option

Definition

Name

Name of the user defined parameter or variable that you want to set. To set a java system variable, preface the variable name with java.system, like this: java.system.SAMPLE_VARIABLE. Kettle variables that are set here override the Kettle variables set in the kettle.properties file. For more information on how to set a kettle variable, see the Set Kettle Variables help topic in the Pentaho Help documentation.

Value

Value of the user defined parameter or variable that you want to set.

Versions Compared

Old Version 18

New Version Current

Key

Pentaho MapReduce

General

Mapper

Combiner

Reducer

Job Setup

Cluster

Hadoop Cluster

User Defined

Page Comparison

Versions Compared

Old Version 18

New Version Current

Key

Pentaho MapReduce

General

Mapper

Combiner

Reducer

Job Setup

Cluster

Hadoop Cluster

User Defined