Sqoop Import

The Sqoop Import job allows you to import data from a relational database into the Hadoop Distributed File System (HDFS) using Apache Sqoop. This job has two setup modes:

Quick Mode provides the minimum options necessary to perform a successful Sqoop import.
Advanced Mode's default view provides options for to better control your Sqoop import. Advanced Mode also has a command line view which allows you to paste an existing Sqoop command line argument into.

For additional information about Apache Sqoop, visit http://sqoop.apache.org/.

Quick Setup

Option	Definition
Name	The name of this job as it appears in the transformation workspace.
Database Connection	Select the database connection to import from. Clicking Edit... allows you to edit an existing connection or you can create a new connection from this dialog by clicking New....
Table	Source table to import from. If the source database requires it a schema may be supplied in the format: SCHEMA.YOUR_TABLE_NAME.
Hadoop Cluster	Allows you to create, edit, and select a Hadoop cluster configuration for use. Hadoop cluster configurations settings can be reused in transformation steps and job entries that support this feature. In a Hadoop cluster configuration, you can specify information like host names and ports for HDFS, Job Tracker, and other big data cluster components. The Edit button allows you to edit Hadoop cluster configuration information. The New button allows you to add a new Hadoop cluster configuration. Information on Hadoop Clusters can be found in Pentaho Help.
Target Directory	Path of the directory to import into.

Hadoop Cluster

Error rendering macro 'excerpt-include' : User 'null' does not have permission to view the page 'Pentaho Map Reduce (Draft)'.

Open File

Option	Definition
Open from Folder	Indicates the path and name of the the MapRFS or HDFS directory you want to browse. This directory becomes the active directory.
Up One Level	Displays the parent directory of the active directory shown in the Open from Folder field.
Delete	Deletes a folder from the active directory.
Create Folder	Creates a new folder in the active directory.
Active Directory Contents (no label)	Displays the active directory, which is the one that is listed in the Open from Folder field.
Filter	Applies a filter to the results displayed in the active directory contents.

Advanced Setup

Option	Definition
Default/List view	List of property and value pair settings which can be modified to suit your needs including options to configure an import to Hive or HBase.
Command line view	Field which accepts command line arguments, typically used to allow you to paste an existing Sqoop command line argument.

Additional Instructions

This section contains additional instructions for using this step.

Import Sqoop Table to an HBase Table (MapR 3.1)

If you want to run a job that uses Sqoop to import data to an HBase table on a mapr 3.1 (or higher) secured cluster, you will need to specify the path to the mapr security jar in the Sqoop Import job entry.

Add the Sqoop Import entry for your job.
Select the Advanced Options link, then click the List View icon.
Set the hbase and other arguments needed for your job.
Click the Command Line View icon.
In the Command Line field, set the libjars parameter to the path to the mapr security jar. The path you enter depends on whether you plan to run the job locally on the Spoon node or remotely on the DI Server node.

Local (Spoon) Node Path: -libjars plugins/pentaho-big-data-plugin/hadoop-configurations/mapr31/lib/pentaho-hadoop-shims-mapr31-security-<version number>.jar
Remote (DI Server) Node Path: -libjars ../../pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr31/lib/pentaho-hadoop-shims-mapr31-security-<version number>.jar

Note: Replace <version number> with the version number of the pentaho-hadoop-shims-mapr31-security jar file that is on your system.

6. Click OK to close the Sqoop Import entry.

Import Sqoop Table to Hadoop Cluster

If you want to run a job that uses Sqoop to import data to an HBase or Hive table on a MapR 3.1 (or higher) secured cluster, you will need to specify the path to the MapR security jar in the Sqoop Import job entry.

Add the Sqoop Import entry for your job.
Select the Advanced Options link, then click the List View icon.
Set the arguments needed for your job.

If needed, copy these Kite libraries to an HDFS location with READ ACCESS.

If you are using:

Copy These Libraries

Parquet

kite-data-core.jar
kite-data-mapreduce.jar
kite-hadoop-compatibility.jar

Avro

nameofavrolibrary.jar
kite-data-core.jar*
kite-data-mapreduce.jar*
kite-hadoop-compatibility.jar*

* If necessary

Cloudera/Horton/MapR

kite-data-core.jar*
kite-data-mapreduce.jar*
kite-hadoop-compatibility.jar*

* If necessary

In the Sqoop step under the Default tab, libjars, reference the location of the Kite libraries.
For example, for Cloudera, it would look something like this:
hdfs://dbmaster:8020/tmp/sqoop/libs/kite-data-core-1.0.0-cdh<version number>.jar, hdfs://dbmaster:8020/tmp/sqoop/libs/kite-data-mapreduce-1.0.0-cdh<version number>.jar, hdfs://dbmaster:8020/tmp/sqoop/libs/kite-hadoop-compatibility-1.0.0-cdh<version number>.jar
Note: Replace <version number> with the version number of the cdh file jar that is on your system.
Click OK to close the Sqoop Import entry.
Test.