Hadoop Copy Files

(warning) PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Hadoop Copy Files

This job entry copies files in a Hadoop cluster from one location to another.

General

Option

Definition

Include Subfolders

If selected, all subdirectories within the chosen directory will be copied as well

Destination is a file

Determines whether the destination is a file or a directory

Copy empty folders

If selected, will copy all directories, even if they are empty the Include Subfolders option must be selected for this option to be valid

Create destination folder

If selected, will create the specified destination directory if it does not currently exist

Replace existing files

If selected, duplicate files in the destination directory will be overwritten

Remove source files

If selected, removes the source files after copy (a move procedure)

Copy previous results to args

If selected, will use previous step results as your sources and destinations

File/folder source

The file or directory to copy from; click Browse and select Hadoop to enter your Hadoop cluster connection details

File/folder destination

The file or directory to copy to; click Browse and select Hadoop to enter your Hadoop cluster connection details

Wildcard (RegExp)

Defines the files that are copied in regular expression terms (instead of static file names), for instance: .*\.txt would be any file with a .txt extension

Files/folders

A list of selected sources and destinations

Result files name

Option

Definition

Add files to result files name

Any files that are copied will appear as a result from this step; shows a list of files that were copied in this step

Notes

When not using Kerberos security, the Hadoop API used by this step sends the username of the logged in user when trying to copy the file(s) regardless of what username was used in the connect field. To Change the user you must set the environment variable HADOOP_USER_NAME. You can modify spoon.bat or spoon.sh by changing the OPT variable:

OPT="$OPT .... -DHADOOP_USER_NAME=HadoopNameToSpoof"