Hadoop File Output

(warning) PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Description

The Hadoop File Output step is used to export data to text files stored on a Hadoop cluster. This is commonly used to generate comma separated values (CSV files) that can be read by spreadsheet applications. It is also possible to generate fixed width files by setting lengths on the fields in the fields tab.

Options

These tables describe all available Hadoop File Output options. 

File Tab

The options under the File tab is where you define basic properties about the file being created.

Option

Description

Step name

Optionally, you can change the name of this step to fit your needs. Every step in a transformation must have a unique name.

Hadoop Cluster

Allows you to create, edit, and select a Hadoop cluster configuration for use.  Hadoop cluster configurations settings can be reused in transformation steps and job entries that support this feature.  In a Hadoop cluster configuration, you can specify information like host names and ports for HDFS, Job Tracker, and other big data cluster components.  The Edit button allows you to edit Hadoop cluster configuration information.  The New button allows you to add a new Hadoop cluster configuration.  Information on Hadoop Clusters can be found in Pentaho Help.

Folder/File

Specifies the location and/or name of the text file to which to write. Click Browse to launch the Open File window and to navigate to the file or folder.

Create Parent Folder

Indicates whether a parent folder should be created for the file when it is copied.

Do not create file at start

Enable to avoid empty files when no rows are getting processed.

Accept file name from field?

Enables you to specify the file name(s) in a field in the input stream.

File name field

When the previous option is enabled, you can specify the field that contains the filename(s) at runtime.

Extension

Adds a point and the extension to the end of the file name (.txt).

Include stepnr in filename

If you run the step in multiple copies (Launching several copies of a step), the copy number is included in the file name before the extension. (_0).

Include partition nr in file name?

Includes the data partition number in the file name.

Include date in file name

Includes the system date in the filename (_20101231)

Include time in file name

Includes the system time in the filename (_235959)

Specify Date time format

Allows you to specify the date time format from the list within the Date time format dropdown list..

Date time format

Dropdown list of date format options.

Show file name(s)

Displays a list of the files that are generated. This is a simulation and depends on the number of rows that go into each file.

Add filenames to result

This adds the filename to the internal file result set.

Open File

Option

Definition

Open from Folder

Indicates the path and name of the directory you want to browse.  This directory becomes the active directory.

Up One Level

Displays the parent directory of the active directory shown in the Open from Folder field.

Delete

Deletes a folder from the active directory.

Create Folder

Creates a new folder in the active directory.

Name

Displays the active directory, which is the one that is listed in the Open from Folder field.

Filter

Applies a filter to the results displayed in the active directory contents.

Content Tab

The Content tab contains these options for describing the content being read.

Option

Description

Append

Enables to append lines to the end of the specified file.

Separator

Specifies the character that separates the fields in a single line of text. Typically this is semicolon ( ; ) or a tab.

Enclosure

A pair of strings can enclose some fields. This allows separator characters in fields. The enclosure string is optional. Enable if you want the text file to have a header row (first line in the file).

Force the enclosure around fields?

Forces all field names to be enclosed with the character specified in the Enclosure property above

Header

Enable this option if you want the text file to have a header row (first line in the file)

Footer

Enable this option if you want the text file to have a footer row (last line in the file)

Format

Can be either DOS or UNIX; UNIX files have lines are separated by line feeds, DOS files have lines separated by carriage returns and line feeds

Encoding

Specify the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.

Compression

Specify the type of compression, .zip or .gzip to use when compressing the output. Only one file is placed in a single archive.

Fast data dump (no formatting)

Improves the performance when dumping large amounts of data to a text file by not including any formatting information.

Split every ... rows

If the number N is larger than zero, split the resulting text-file into multiple parts of N rows.

Add Ending line of file

Allows you to specify an alternate ending row to the output file.

Fields Tab

The fields tab is where you define properties for the fields being exported. The table below describes each of the options for configuring the field properties:

Option

Description

Name

The name of the field

Type

Type of the field can be either String, Date or Number.

Format

The format mask to convert with. See Number Formats for a complete description of format symbols.

Length

The length option depends on the field type follows:
Number - Total number of significant figures in a number
String - total length of string
Date - length of printed output of the string (for exampl, 4 returns year)

Precision

The precision option depends on the field type as follows:
Number - Number of floating point digits
String - unused
Date - unused

Currency

Symbol used to represent currencies like $10,000.00 or E5.000,00

Decimal

A decimal point can be a "." (10,000.00) or "," (5.000,00)

Group

A grouping can be a "," (10,000.00) or "." (5.000,00)

Trim type

The trimming method to apply on the string Trimming works when there is no field length given only.

Null

If the value of the field is null, insert this string into the text file

Get

Click to retrieve the list of fields from the input fields stream(s)

Minimal width

Change the options in the Fields tab in such a way that the resulting width of lines in the text file is minimal. So instead of save 0000001, you write 1, and so on. String fields will no longer be padded to their specified length.

Metadata Injection Support (7.x and later)

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.