Sqoop Import

Sqoop Import

The Sqoop Import job allows you to import data from a relational database into the Hadoop Distributed File System (HDFS) using Apache Sqoop. This job has two setup modes:

  • Quick Mode provides the minimum options necessary to perform a successful Sqoop import.
  • Advanced Mode's default view provides options for to better control your Sqoop import. Advanced Mode also has a command line view which allows you to paste an existing Sqoop command line argument into.

For additional information about Apache Sqoop, visit http://sqoop.apache.org/.

Quick Setup

Option

Definition

Name

The name of this job as it appears in the transformation workspace.

Database Connection

Select the database connection to import from. Clicking Edit... allows you to edit an existing connection or you can create a new connection from this dialog by clicking New....

Table

Source table to import from. If the source database requires it a schema may be supplied in the format: SCHEMA.YOUR_TABLE_NAME.

Hadoop Cluster

Allows you to create, edit, and select a Hadoop cluster configuration for use.  Hadoop cluster configurations settings can be reused in transformation steps and job entries that support this feature.  In a Hadoop cluster configuration, you can specify information like host names and ports for HDFS, Job Tracker, and other big data cluster components.  The Edit button allows you to edit Hadoop cluster configuration information.  The New button allows you to add a new Hadoop cluster configuration.  Information on Hadoop Clusters can be found in Pentaho Help.

Target Directory

Path of the directory to import into.

Hadoop Cluster

Error rendering macro 'excerpt-include' : User 'null' does not have permission to view the page 'Pentaho Map Reduce (Draft)'.

Open File

Option

Definition

Open from Folder

Indicates the path and name of the the MapRFS or HDFS directory you want to browse.  This directory becomes the active directory.

Up One Level

Displays the parent directory of the active directory shown in the Open from Folder field.

Delete

Deletes a folder from the active directory.

Create Folder

Creates a new folder in the active directory.

Active Directory Contents (no label)

Displays the active directory, which is the one that is listed in the Open from Folder field.

Filter

Applies a filter to the results displayed in the active directory contents.

Advanced Setup

Option

Definition

Default/List view

List of property and value pair settings which can be modified to suit your needs including options to configure an import to Hive or HBase.

Command line view

Field which accepts command line arguments, typically used to allow you to paste an existing Sqoop command line argument.

Additional Instructions

This section contains additional instructions for using this step.

Import Sqoop Table to an HBase Table (MapR 3.1)

If you want to run a job that uses Sqoop to import data to an HBase table on a mapr 3.1 (or higher) secured cluster, you will need to specify the path to the mapr security jar in the Sqoop Import job entry. 

  1. Add the Sqoop Import entry for your job.
  2. Select the Advanced Options link, then click the List View icon. 
  3. Set the hbase and other arguments needed for your job.
  4. Click the Command Line View icon.
  5. In the Command Line field, set the libjars parameter to the path to the mapr security jar.  The path you enter depends on whether you plan to run the job locally on the Spoon node or remotely on the DI Server node. 
  • Local (Spoon) Node Path: -libjars plugins/pentaho-big-data-plugin/hadoop-configurations/mapr31/lib/pentaho-hadoop-shims-mapr31-security-<version number>.jar 
  • Remote (DI Server) Node Path:  -libjars ../../pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr31/lib/pentaho-hadoop-shims-mapr31-security-<version number>.jar

Note: Replace <version number> with the version number of the pentaho-hadoop-shims-mapr31-security jar file that is on your system. 

    6. Click OK to close the Sqoop Import entry.

Import Sqoop Table to Hadoop Cluster

If you want to run a job that uses Sqoop to import data to an HBase or Hive table on a MapR 3.1 (or higher) secured cluster, you will need to specify the path to the MapR security jar in the Sqoop Import job entry.

  1. Add the Sqoop Import entry for your job.
  2. Select the Advanced Options link, then click the List View icon. 
  3. Set the arguments needed for your job.
  4. If needed, copy these Kite libraries to an HDFS location with READ ACCESS. 

    If you are using:Copy These Libraries
    Parquet
    • kite-data-core.jar
    • kite-data-mapreduce.jar
    • kite-hadoop-compatibility.jar
    Avro
    • nameofavrolibrary.jar
    • kite-data-core.jar*
    • kite-data-mapreduce.jar*
    • kite-hadoop-compatibility.jar*

          * If necessary

    Cloudera/Horton/MapR
    • kite-data-core.jar*
    • kite-data-mapreduce.jar*
    • kite-hadoop-compatibility.jar*

    * If necessary

  5. In the Sqoop step under the Default tab, libjars, reference the location of the Kite libraries. 
    For example, for Cloudera, it would look something like this:

    hdfs://dbmaster:8020/tmp/sqoop/libs/kite-data-core-1.0.0-cdh<version number>.jar, hdfs://dbmaster:8020/tmp/sqoop/libs/kite-data-mapreduce-1.0.0-cdh<version number>.jar, hdfs://dbmaster:8020/tmp/sqoop/libs/kite-hadoop-compatibility-1.0.0-cdh<version number>.jar

  6. Note: Replace <version number> with the version number of the cdh file jar that is on your system.
  7. Click OK to close the Sqoop Import entry.
  8.  Test.