What's new in PDI version 3.2

Index

Introduction


When you compare this release to the previous one, you will see that the changes are more evolutionary, rather than revolutionary.  Even so, there have been a large amount of changes for a minor version increase.  The main focus of the release is once again stability and usability.  We went through a large number of pet-peeves, common mis-understandings and simply solved them, either through new features or by modifying existing ones.  On top of that we worked on the clustering side to make that mode "cloud-ready" with dynamic clustering, making it more solid, adding features too.

Once again, many many thanks go to our large community of Kettle enthusiasts for all the help they provided to make this release another success.

General changes

  • Visual changes
    • Hop color scheme with mini-icons, tooltips (note: tooltips not available on OSX currently*** After running with errors : show error icons (note: error details tooltip not available on OSX currently)
      • Visual feedback : reading from info steps
      • Visual feedback : writing to target steps that run in multiple copies
      • New step categories
      • Step filter in step tree tool bar
    • Long standing bugs attack
    • Long standing wish list attack
    • Resource Exporter to export transformations or complete jobs including their used resources (sub-transformations and sub-jobs) to a single ZIP file.
    • Named Parameters
      • Jobs and transformations can now define parameters with default values that will be available at runtime as variables.  This makes it easy to have dynamic configuration of a job/transformation from the command line (e.g. specifying a date range to process with the default being yesterday)
    • Dynamic clustering
      • Instead of having to configure all of the slaves that a transformation will be executed on in clustered mode, you can run Carte slaves in dynamic mode, configuring them to register with a master (or multiple masters) when they start up.  The clustered transformation is configured with a list of the masters it can run on.  When the transformation is executed, it will go down the list of masters, attempting to submit the job to each one until it is accepted.  That master will then execute the transformation using all of the currently available slaves that are registered to it.

Step changes

New steps

  • Analytic Query : get information from previous/first rows
  • User Defined Java Expression : evaluate Java expressions, in-line compiled for maximum performance
  • Formula step : promoted from a plug-in to a native step
  • Synchronize after merge : performs updates, inserts or deletes in a database depending on a flag in the incoming data
  • SalesForce Input : reads information from SalesForce (promoted from a plug-in to a native step)
  • Replace in string : replace values in strings
  • Strings cut : cut strings down to size
  • If field value is null : ... then set default values per type or per field
  • Mail : send e-mails all over the globe
  • Process files: Copy, move or delete files
  • Identify last row in a stream : sets a flag when the last row in a stream was reached
  • Credit card validator : validates a credit card number, extracts information
  • Mail validator : valides an e-mail address
  • Reservoir sampling : promoted from a plug-in to a native step
  • Univariate statistics : promoted from a plug-in to a native step + upgrade
  • LucidDB Bulk loader : high performance bulk loader for the LucidDB column database
  • Unique Rows by Hashset : Allows de-duping a stream without having to sort it first. Requires enough memory to be able to store each set of unique keys.

Updated steps

  • Table Output: ability to specify the fields to insert
  • Calculator : all sorts of new calculations, string manipulations, etc.
  • Java script values : ability to replace values + improved script testing
  • Database lookup : pre-load cache option (load all values in memory)
  • Dimension Lookup/Update:
    • Cache pre-load (load all dimension entries in memory)
    • Support for alternative start of date range scenarios
    • Support for timestamp columns (last update/insert/both)
    • Support for current version column
  • Combination Lookup/Update : support for last update timestamp column
  • Data validator:
    • New option to report all errors, not only the first
    • Ability to read data from another step
  • Group By: added support for cumulative sum and average
  • Text File Input: New option to pass through additional fields from previous step (removing the need to do a Cartesian join)
  • Mapping:
    • Inherit all variables from parent transformation
    • Allow setting of variables in mapping
    • Allow preview of mapping output
    • Improved logging
    • New "Open mapping" option in transformation graph right click

Job entry changes

New job entries

  • Simple evaluation : Compare a value from previous result or a variable with a given value.
  • Evaluate rows number in a table : Check if the number of rows returned by a query resultset is identical, different, greater or smaller than a given value. Can also store those rows in the result for later use.
  • Check if connected to the repository : See if there is a live connection
  • Export repository to XML file : exports the entire repository to an XML file

Databases

  • Boolean support
  • Default schema name support
  • Support for the InfoBright database

Repository

  • Preview repository creation/upgrade SQL before execution
  • Schema name support via the database dialog
  • Code revision and code hardening

Internationalization


In the i18n department, the overall winners were again the French (Super Samatar Hassan) and Italian (The great Nico Ben) teams.  We've had several contributors compliment them for translating keys (or complete steps that got added) in less than a day!   The Korean (Kim YoungWoo) and Japanese (Hiroyuki Kawaguch) translators continue to do an excellent catch-up job.

Here is an overview of the translation status:

Language

% Complete

en_US

100,00%

it_IT

100,00%

fr_FR

100,00%

ko_KR 

78%

es_AR

55%

ja_JP

53%

zh_CN

49%

de_DE

42%

es_ES

36%

nl_NL

13%

pt_BR

11%

pt_PT

11%

Community and codebase

Codebase

Even though we try our best to re-factor and simplify the codebase all the time, there is no denying that the codebase keeps growing.
Right before every release we run the following command:

find . -name "*.java" -exec wc -l {} \; | awk '{ sum+=$1 } END { print sum }'

This is what that gave us over the last releases:
 

Version

Lines of code

Increase

2.1.4

160,000

 

2.2.2

177,450

  17,450

2.3.0

213,489

  36,039

2.4.0

256,030

  42,541

2.5.0

292,241

  36,211

3.0.0

348,575

  56,334

3.1.0

456,772

108,197

3.2.0

529,277

  72,505


Libraries

As every release since version 3.0, we split off a new library called kettle-db.jar. The total library portfolio of Pentaho Data Integration now includes:

Filename

Description

Dependency

kettle-core.jar

A small set of core classes and utilities for the Kettle environments

none

kettle-db.jar

Contains database related classes

kettle-core

kettle-engine.jar

The transformation and job runtime engines

kettle-core, kettle-db

kettle-ui-swtjar

The UI classes, Spoon, dialogs, etc

kettle-core, kettle-db, kettle-engine



Matt Casters - Okegem/Belgium - May 11th 2009