Advanced Users FAQ

Are there other steps/job entries available?

Q: Are there are other steps/job entries available than the default ones that come with PDI?

A: Yes, many of the publicly available ones are listed here, or are available via the Pentaho Marketplace.

Plugins may have different licenses or pricing, please review each plugin for details.

If you've created a plug-in and you want to put it in the list as well, consider reviewing How to Contribute via the Pentaho Marketplace.

Strings bigger then defined String length

Q: I'm trying to use the Select values step to set the max length for a field. But I'm still seeing longer strings getting through, and this leads to SQL errors further on.

Isn't the Select Values step supposed to truncate strings to the correct size? Or must I truncate the string in some other way?

If so, what's the 'best' other way?

A: A transformation is all about metadata really. Early in the game the original Kettle developer decided not to annoy users with data type "errors" like these. "Select Values" too is all about metadata, not the data itself.

However, if you want to "truncate" a field, you are going to have to use the "Strings cut" step or implement some Javascript logic to force it to the correct length.

var string = originalfield.getString();
if ( string.length() > 50 )
{
    string = string.substring(0,50);
}

Decimal point doesn't show in .csv output

Q: In a database table I have a single decimal column containing the following contents:

100.23
100.20
100.00
100.00

I output this table via a Text File output step (with extension .csv) and open it with Excel, I get the following:

100.23
100.20
100
100

How do I fix this and add the ".00" to the last 2 rows?

A: In the Text File Output step you have to use a format of "#.00" for the column. This will get you the a precision after the decimal point of 2 decimals on all of the values in the output file, if you open it with Notepad.

If you open the .csv file with Excel the ".00" will still not be there, the problem is that .csv output when opened in Excel uses a default formatting which hides the ".00". There is an Excel Writer step that can do some basic Excel formatting.

Function call returning boolean fails in Oracle

Q: The return type Boolean for a db function gives an error "Invalid column type" in Pentaho Data Integration. My is as follows:

FUNCTION test(id number) RETURN BOOLEAN IS
BEGIN
	return true;
END

A: It's an Oracle thing. That sort of return value is only valid in a PL/SQL block. This is because an Oracle table can't contain a boolean data type (boolean is not part of the SQL standard). It's suggested that you return a varchar with 'true' / 'false' in it (or 'Y' / 'N'). If you then set the convert the data type to boolean you might find that you will get a boolean.

For a reference from the Oracle manuals on this behaviour:

It is not feasible for Oracle JDBC drivers to support calling arguments or return values of the PL/SQL RECORD, BOOLEAN, or table with non-scalar element types. However, Oracle JDBC drivers support PL/SQL index-by table of scalar element types. For a complete description of this, see Chapter 11, "Accessing PL/SQL Index-by Tables"

As a workaround to PL/SQL RECORD, BOOLEAN, or non-scalar table types, create wrapper procedures that handle the data as types supported by JDBC. For example, to wrap a stored procedure that uses PL/SQL booleans, create a stored procedure that takes a character or number from JDBC and passes it to the original procedure as BOOLEANor, for an output parameter, accepts a BOOLEAN argument from the original procedure and passes it as a CHAR or NUMBER to JDBC. Similarly, to wrap a stored procedure that uses PL/SQL records, create a stored procedure that handles a record in its individual components (such as CHAR and NUMBER) or in a structured object type. To wrap a stored procedure that uses PL/SQL tables, break the data into components or perhaps use Oracle collection types.

Difference between variables/arguments in launcher

Q: When running a Transformation, the dialog has two tables, one for Arguments and one for Variables. What is the difference between the two?

A: Arguments are command line arguments that you would normally specify during batch processing (via Pan). Variables are environment or PDI variables that you would normally set in a previous transformation in a job or in the Operating System.

How to use database connections from repository

Q: If you create a new database connection directly in the repository explorer, how can use it in a new or existing transformation?

A: Create a new transformation (or job) or close and re-open the ones you have loaded in Spoon.

How to systematically optimize a slow transformation

Q: I've got a transformation that doesn't run fast enough, but it is hard to tell in what order to optimize the steps. What should I do?

A: Transformations stream data through their steps.
That means that the slowest step is going to determine the speed of a transformation.

So you optimize the slowest steps first. How can you tell which step is the slowest: look at the size of the input buffer in the Step Metrics view. A slow step will have consistently large input buffer sizes. A fast step will consistently have low input buffer sizes.  In Spoon, a running transformation will also highlight the step icon's border for any steps that may be considered bottlenecks.

How to use JNDI?

Q: We will be using PDI integrated in a web application deployed on an application server. We've created a JNDI datasource in our application server. The Spoon application doesn't run in an application server, so how can we use the JNDI data source in PDI while developing jobs and transformation?

A: If you look in the PDI main directory you will see a sub-directory "simple-jndi", which contains a file called "jdbc.properties". You should change this file so that the JNDI information matches the one you use in your application server.

After that you set in the connection tab of Spoon the "Method of access" to JNDI, the "Connection type" to the type of database you're using. And "Connection name" to the name of the JDNI datasource (as used in "jdbc.properties").

The Simple-JNDI layer does not create true JNDI connections, so the behavior of connections in Spoon may differ slightly compared to an Application Server running the same PDI job or transformation.

About inserting booleans into a MySQL database

Q: How do you insert booleans into a MySql database, PDI encodes a boolean as 'Y' or 'N' and this can't be insert into a BIT(1) column in MySql.

A: BIT is not a standard SQL data type. It's not even standard on MySQL as the meaning (core definition) changed from MySQL version 4 to 5.

Also a BIT uses 2 bytes on MySQL. That's why in PDI we made the safe choice and went for a char(1) to store a boolean.

There is a simple workaround available: change the data type with a Select Values step to "Integer" in the metadata tab. This converts it to 1 for "true" and 0 for "false", just like MySQL expects.

How do Length and Precision affect numbers? (Was: Calculator ignores result type on division)

Q: I made a transformation using A/B in a calculator step and it rounded wrong: the 2 input fields are integer but my result type was Number(6, 4) so I would expect the integers to be cast to Number before executing the division.

If I wanted to execute e.g. 28/222, I got 0.0 instead of 0.1261 which I expected. So it seems the result type is ignored. If I change the input types both to Number(6, 4) I get as result 0.12612612612612611 which still ignores the result type (4 places after the comma).

Why is this?

A: Length & Precision are just metadata pieces.

If you want to round to the specified precision, you should do this in another step. However: please keep in mind that rounding double point precision values is futile anyway. A floating point number is stored as an approximation (it floats) so 0.1261 (your desired output) could (would probably) end up being stored as 0.126099999999 or 0.1261000000001 (Note: this is not the case for BigNumbers)

So in the end we round using BigDecimals once we store the numbers in the output table, but NOT during the transformation. The same is true for the Text File Output step. If you would have specified Integer as result type, the internal number format would have been retained, you would press "Get Fields" and it the required Integer type would be filled in. The required conversion would take place there and then.

In short: we convert to the required metadata type when we land the data somewhere, NOT BEFORE.

How do I use the Apache VFS capability to read different types of compressed files?

Q: The Text File Input step has a Compression option that allows you to select Zip or Gzip, but it will only read the first file in Zip. How can I use Apache VFS support to handle tarballs or multi-file zips?

A: The catch is to specifically restrict the file list to the files inside the compressed collection.  Some examples:

You have a file with the following structure:

access.logs.tar.gz
    access.log.1
    access.log.2
    access.log.3

To read each of these files in a File Input step:

File/Directory

Wildcard

tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar!

.+

 Note: If you only wanted certain files in the tarball, you could certainly use a wildcard like access.log..* or something.  .+ is the magic if you don't want to specify the children filenames.  .* will not work because it will include the folder (i.e. tar:gz:/path/to/access.logs.tar.gz!/access.logs.tar!/ )

You have a simpler file, fat-access.log.gz.  You could use the Compression option of the File Input step to deal with this simple case, but if you wanted to use VFS instead, you would use the following specification:

File/Directory

Wildcard

gz:file://c:/path/to/fat-access.log.gz!

.+

Finally, if you have a zip file with the following structure:
access.logs.zip/
    a-root-access.log
    subdirectory1/
        subdirectory-access.log.1
        subdirectory-access.log.2
    subdirectory2/
        subdirectory-access.log.1
        subdirectory-access.log.2

You might want to access all the files, in which case you'd use:

File/Directory

Wildcard

zip:file://c:/path/to/access.logs.zip!

a-root-access.log

zip:file://c:/path/to/access.logs.zip!/subdirectory1 (file://c:/path/to/access.logs.zip\!/subdirectory1)

subdirectory-access.log.*

zip:file://c:/path/to/access.logs.zip!/subdirectory2 (file://c:/path/to/access.logs.zip\!/subdirectory2)

subdirectory-access.log.*

Note: For some reason, the .+ doesn't work in the subdirectories, they still show the directory entries. :/