Amazon Hive Job Executor

(warning) PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

This job executes Hive jobs on an Amazon Elastic MapReduce (EMR) account. In order to use this step, you must have an Amazon Web Services (AWS) account configured for EMR, and a pre-made Java JAR to control the remote job.

Option

Definition

Name

The name of this job as it appears in the transformation workspace.

Hive Job Flow Name

The name of the Hive job flow to execute.

Existing JobFlow Id (optional)

The name of a Hive Script on an existing EMR job flow.

AWS Access Key

Your Amazon Web Services access key.

AWS Secret Key

Your Amazon Web Services secret key.

Bootstrap Actions

References to scripts to invoke before the node begins processing data. See {+}http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html+for more information.

S3 Log Directory

The URL of the Amazon S3 bucket in which your job flow logs will be stored. Artifacts required for execution (e.g. Hive Script) will also be stored here before execution. (Optional)

Hive Script

The URL of the Hive script to execute within Amazon S3.

Command Line Arguments

A list of arguments (space-separated strings) to pass to Hive.

Number of Instances

The number of Amazon EC2 instances used to execute the job flow.

Master Instance Type

The Amazon EC2 instance type that will act as the Hadoop "master" in the cluster, which handles MapReduce task distribution.

Slave Instance Type

The Amazon EC2 instance type that will act as one or more Hadoop "slaves" in the cluster. Slaves are assigned tasks from the master. This is only valid if the number of instances is greater than 1.

Keep Job Flow Alive

Specifies whether the job flow should terminate after completing all steps.

Enable Blocking

Specifies whether this job step should block until the EMR Hive Job Completes.

Logging Interval in Seconds

If the job step is blocking then write a status to the log every X seconds.