PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site. |
Apache Spark is an open-source cluster computing framework that is an alternative to the Hadoop MapReduce paradigm. The Spark Submit entry allows you to submit Spark jobs to CDH clusters version 5.3 and later, HDP 2.3 and later, MapR 5.1 and later, and EMR 3.10 and later.
Before you use this entry, you will need to install and configure a Spark client on any node from which you will run Spark jobs.
You will need to configure the Spark client to work with the cluster on every machine where Sparks jobs can be run from. Complete these steps.
<SPARK_HOME>/conf
and create the spark-defaults.conf
file using the instructions here - https://spark.apache.org/docs/latest/configuration.htmlspark-defaults.conf
file, add the following line of code. (If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment.) Here are a couple of examples:java-opts
in the <SPARK_HOME>/conf
folder and add your HDP version to that file. For example: -Dhdp.version=2.3.0.0-2557hdp-select status hadoop-client
.<!-- |
If you are connecting to CDH 5.7 cluster when using Apache Spark 1.6.0 on your client node, an error may occur while trying to run a job containing a Spark Submit entry in yarn-client mode. This error will be similar to the following message:
Perform one of the following tasks to resolve this error:
This section explains how to set up the Spark client to connect to unsecured MapR clusters.
hive-site.xml file
from the /opt/mapr/spark/spark-1.6.1/conf
folder on the MapR cluster to the client machines MapR configuration folder.sudo apt-get install mapr-spark
<SPARK_HOME>/conf
folder and create the spark-defaults.conf
file using the instructions at the following link: https://spark.apache.org/docs/latest/configuration.htmlspark-defaults.conf
file to add the following code using your HDFS name and spark-assembly.jar
file path:spark.yarn.jar maprfs:///user/spark/lib/spark-assembly-1.6.1-mapr-1609-hadoop2.7.0-mapr-1607.jar
Note: If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment.
This section explains how to set up the Spark client to connect to secured MapR clusters.
hive-site.xml
file from the /opt/mapr/spark/spark-1.6.1/conf
folder on the MapR cluster to the client machines MapR configuration folder.sudo apt-get install mapr-spark
<SPARK_HOME>/conf
folder and create the spark-defaults.conf
file using the instructions at the following link: https://spark.apache.org/docs/latest/configuration.htmlspark-defaults.conf
file to add the following code using your HDFS name and spark-assembly.jar file path:maprfs:///user/spark/lib/spark-assembly-1.6.1-mapr-1609-hadoop2.7.0-mapr-1607.jar
Note: If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment.
Note: When Spark runs on YARN, the MapR client nodes require the hadoop-yarn-server-web-proxy.JAR
file to run Spark applications.
The MapR-client package does not include the jar file required to run Spark applications. You must copy the /opt/mapr/hadoop/hadoop-2.x.x/share/hadoop/yarn/hadoop-yarn-server-web-proxy-<version>.jar
file from a MapR cluster node to the same location on the MapR client node where you want to run the Spark application.
Note that we support the yarn-cluster and yarn-client modes. Descriptions of the modes can be found here:
Note: If you have configured your Hadoop Cluster and Spark for Kerberos, a valid Kerberos ticket must already be in the ticket cache area on your client machine before you launch and submit the Spark Submit job.
Option | Description |
---|---|
Entry Name | Name of the entry. You can customize this, or leave it as the default. |
Spark-Submit Utility | Script that launches the spark job. |
Spark Master URL | The master URL for the cluster. Two options are supported:
|
Jar | Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. |
Class Name | The entry point for your application. |
Arguments | Arguments passed to the main method of your main class, if any. |
Executor | Amount of memory to use per executor process. Use the JVM format (e.g. 512m, 2g). |
Driver | Amount of memory to use per driver. Use the JVM format (e.g. 512m, 2g). |
Block Execution | This option is enabled by default. If this option is selected, the job entry waits until the spark job finishes running. If it is not, job proceeds with its execution once the spark job is submitted for execution. |
Help | Displays documentation on this entry. |
OK | Saves the information and closes the window. |
Cancel | Closes the window without saving changes. |
For more information on spark parameters, including memory parameters, review this documentation: https://spark.apache.org/docs/latest/configuration.html.
Option | Description |
---|---|
Entry Name | Name of the entry. You can customize this, or leave it as the default. |
# | Number of the parameter. |
Name | Name of the parameter. |
Value | Value of the parameter. |