Development Guidelines

Note: More extensive and up to date information about plug-in development can be found in the PDI SDK in "Embedding and Extending PDI Functionality"

Priority on development

Correctness/Consistency

If a tool is not correct it's not going to be trusted however fast it may be. It can't be that the same input will produce output A in one case, and output B in another case.

Backwards compatibility

Everyone like upgrades to go smoothly. Install the new binaries and be able to run without testing is the ideal. Of course, in some cases compatibility has to be broken for the greater good in the long term, but then it should be clearly documented (for upgrades).

Speed

There is a need for a speed. No-one wants to wait 30 minutes to insert 100.000 rows.

User friendliness

It should not be a torment to use a tool. It should allow both novice and expert users to get their job done. As example: any XML or configuration file should have a GUI element to manage this and should never be edited manually.

Create JIRA cases

For all your bug fixes, feature implementations and translation efforts, please create a JIRA case : http://jira.pentaho.org/browse/PDI 

Then please mention the case number (for example PDI-9999) in your commit message (PDI in uppercase) to help us keep track of the changes.  JIRA automatically links the Subversion commits to the JIRA case this way.

Use English in the source code

Since PDI is developed by an international group of people use English in the source code for everything: identifiers, comments, ..

Division of functionality in steps and job entries

One of the ideas of Pentaho Data Integration is to make simple steps and job entries which have a single purpose, and be able to make complex transformations by linking them together, much like UNIX utilities.

Putting too much (diverse) functionality in 1 step/job entry will make them less intuitive for people to use, and since most people only start reading manuals when they get into problems we need all the intuitivity we can get.

Rows on a single hop have to be of the same structure

As in user section of this document: all rows that flow over a single hop have to be of the same structure. You shouldn't try to build things that try to circumvent this, which will be harder as of v2.5.0 because of the design time check on the structure of the rows.

Null and "" are the same in PDI

As in Oracle the empty string "" and NULL should be considered the same by all steps. This is to be in line with the rest with PDI.

About converting data to fit the corresponding Metadata

Length & Precision are just metadata pieces.

If you want to round to the specified precision, you should do this in another step. However: please keep in mind that rounding double point precision values is futile anyway. A floating point number is stored as an approximation (it floats) so 0.1261 (your desired output) could (would probably) end up being stored as 0.126099999999 or 0.1261000000001 (Note: this is not the case for BigNumbers)

So in the end we round using BigDecimals once we store the numbers in the output table, but NOT during the transformation. The same is true for the Text File Output step. If you would have specified Integer as result type, the internal number format would have been retained, you would press "Get Fields" and it the required Integer type would be filled in. The required conversion would take place there and then.

In short: we convert to the required metadata type when we land the data somewhere, NOT BEFORE.

The strategy of keeping the same datatype as long as possible has saved us from many a conversion error like the one described above.

About logging in steps in Pentaho Data Integration

Q: How to use logging in PDI steps? This applies to the main PDI steps as to any PDI steps you develop yourself.

A: All detailed, debug, and rowlevel statements needs to be preceded by the following:

if (log.isDetailed())
...

if (log.isDebug())
...

if (log.isRowlevel())
...

That is because otherwise, the string calculation of what you send to the log is always calculated. For Basic and Minimal logging levels this doesn't matter as normally they would always be "on", but it does for Debug and Rowlevel.

Sometimes it is helpful to have the stack trace and more details (e.g. of the row when applicable) in the log in case of errors. Do not log this to the console with e.printStackTrace() - instead use logError(Const.getStackTracker(e))

To get the row in the log, use rowMeta.getString(row)) - but be aware that the rowMeta and row must be correctly defined in case of an error. Otherwise you will get an error when logging an error....

Example:

catch(Exception e)
{
    String message = Messages.getString("FilterRows.Exception.UnexpectedErrorFoundInEvaluationFuction");  //$NON-NLS-1$
    logError(message);
    logError(Messages.getString("FilterRows.Log.ErrorOccurredForRow")+rowMeta.getString(row)); //$NON-NLS-1$
    logError(Const.getStackTracker(e));
    throw new KettleException(message, e);
}

About using XML in Pentaho Data Integration

Q: What's the idea of using XML in PDI?

A: XML is for machines, not for humans. So unless the functionality is in fact on the processing of XML itself (XML input step/XML output step) the XML should be kept hidden. Behind the screens XML will be used but users of PDI should not to be required to know this or manipulate XML in any way. Every XML configuration/setup file should be managed through a GUI element in PDI.

About dropdown boxes and storing values

Don't use the index of the dropdown box as a value in the XML export file or database repository.

Suppose you have currently 3 possible values in a dropdown box. And if someone chooses the first value you put "1" in the XML export file to indicate value 1. This would work fine, except that:

  • if someone wants to add extra values in the future he must use the order you defined first;
  • it makes the XML output very much unreadable.
    It's better to convert from a Locale string in the GUI to some English equivalent which is then stored. As example:
  • Suppose on the GUI you have a dropdown box with values "Date mask" and "Date time mask";
  • Instead of using a 1 in the output for "Date mask" and 2 for "Date time mask", it would be better to put in the output "DATE_MASK" for "Date mask" and "DATE_TIME_MASK" for "Date time mask";
  • Also note that DATE_MASK/DATE_TIME_MASK would then not be allowed to be subject to I18N translation (which is ok for transformation/job files).

About using I18N in PDI

Q: Some more details on using I18N

A:

  • Only translate what a normal user will see, it doesn't make sense to translate all debug message in PDI. Some performance improvements were achieved in PDI just by removing some of translations for debug messages;
  • Make sure you don't translate strings used in the control logic of PDI:
  • If you would e.g. make the default name of a new step "language dependent" this would still make jobs/transformations usable across different locales;
  • If you would e.g. make tags used in the XML generated for the step language dependent there would be a problem when a user would switch his locale;
  • If you would translate non-tag strings used in the control logic you will also have a problem. E.g. in the repository manager "Administrator" is used to indicate which user is administrator (and this is used in the PDI control logic). So if you would translate administrator to a certain language, this would work as long as you wouldn't switch locales.

About using Locale's in PDI

PDI should always use the default Locale, so the Locale should not be hardcoded somewhere to English or so. However some steps may choose to be able to override the default Locale but this is then step specific and it should always be possible to select the Locale via the GUI of the step.

About reformatting source code

Try to keep reformatting code to a minimum, especially on things like {'s at the end of the line or at the start of the next line, not all people like the same and why should your specific preference be used.When changing code try to use the same formatting as the surrounding code, even if it's not your preference.

If you really feel a need for some reformatting do it in a separate SVN check-in, DON'Tmix reformatting source code with real changes. It's VERYannoying not being able to easily see what changed because someone decided to reformat source code at the same time. "My tool does automatic formatting" is a very lame excuse as all known IDEs allow to switch it off.

About checking persistence correctness

All steps and job entries need to implement persistence to XML format ANDthe repository. Since saving to XML format and saving to the repository is done using separate methods it sometimes happened that the meta-data would be saved properly in one format, but not in the other.The following procedure does a basic check whether data is properly saved. It's not a 100% check, but it will detect most obvious mistakes.

Procedure part 1:

  • Make a job or a transformation in Spoon with only the step/job entry which is new or was changed. Fill in all values that you can;
  • Save it to XML format, close it in Spoon;
  • Reload it in Spoon;
  • Save it XML format using a new name;
  • Take your favorite text compare tool and compare both XML files, all changes should be explainable.

If the latter test succeeds it shows that saving/loading of XML has no obvious mistakes. If you can have multiple exclusive values in the meta-data this diminishes the value of the test a bit and you maybe should try with some extra testcases.

Procedure part 2:

  • Load the original XML format of part 1 in Spoon;
  • Connect to a repository, and save the job/transformation in there;
  • Close it in Spoon;
  • Reload the the job/transformation from the repository;
  • Save it to XML format;
  • Take your favorite text compare tool and compare both XML files, all changes should be explainable.

Part 1 shows that the loading/saving to XML works. If part 1 succeeds, but part 2 chances are very high something is wrong in the methods loading/saving to the repository.

About using non-temporary storage

Don't implement functionality that stores data in non-temporary storage in a PDI specific/internal format. With non-temporary we mean surviving a job or a transformation execution. The reason for this is to avoid having to deal with conversions.

Reason: suppose e.g. a step would serialize rows into a file to be read out in a next transformation. If you would upgrade PDI in between runs the row may not de-serialize correctly anymore. To solve this conversion applications would be required (or some old/new format logic if even possible). As long as you don't save data into an internal format that survives a job/transformation you're always ok.

How do you start developing your own plug-in step

Q: I need some complex calculation in my ETL. I have created own logic in java classes. How can I make my own step and integrate this in PDI?

A: see Writing your own Pentaho Data Integration Plug-In

Checklist for end of step/job entry development

Following is a list of things you need to think of before considering a change to a step/job entry or a complete new step/job entry to be complete.

  • Does the change break anything compared to the previous release(s). If possible nothing should break, but sometimes there's no other way. If something breaks and there's no way around it, at least inform the PDI tech lead;
  • Is the source code completely in English. Especially check the key fields that will be saved in the XML file/repository whether they are in English. It's hard to change these later on;
  • Does loading/saving work correctly with both XML format and a repository. This is mostly for new steps/job entries or for changes in existing attributes. A check for this included in the FAQ "On checking persistence correctness";
  • Is the documentation up to date with the changes?

About using Subversion

  1. Always make sure that whatever you put in SVN keeps PDI buildable and startable. Nothing is more annoying as not being able to even start PDI as someone checked in half-working code. If the change is too big to do at once, work by making small steps towards the full change (but at all times keeping PDI buildable/runnable).
  2. Always comment your commit. The best is to add the PDI-xx number from the bug / feature tracker and a short description (thus you have not to search the description from the PDI-xx number). See the other developers commits as examples.
  3. To keep track of the changes and to follow the software development process you HAVE to add a Jira PDI-xx number to your commit. There are only exceptions when this is like a cosmetic change or fix of a spelling mistake. This helps also other users not involved in the development process to keep track of the changes.

About Serializable and Binary

Q: If I need to select a type I can choose between Serializable and Binary. Are they the same, or what's the difference?

A: Binary is used for images, sounds, GIS data, BLOBs, .... You can read e.g. a database BLOB from one database and insert it in another database. Binary uses getByte() and setByte() to get and set its value.

Serializable is used for some proprietary plugins build by a company using it to pass Java objects from one step to another. Unless you're developing your own steps/plugins Serializable is not something to be used. The way to read/write data depends on the objects being stored in a Serializable.

Success factors of PDI

Modular design

Pentaho Data Integration as a runtime engine consists of a "row engine" taking care of transporting data from one step to the next. The steps are separate and can even use a plug-in architecture.

No generation of code

The biggest advantage of not generating code is that the job is always in the correct state. The object code and the "source" code of the jobs can never be out of sync as explicit object code is not generated. An extra advantage is that jobs always use the latest version of the components after upgrading the core, it does not require re-compiling/regenerating their own jobs.

The biggest disadvantage would be speed, but comparing the speed of PDI to other ETL tools the disadvantage doesn't seem that big (in a lot of cases PDI is even faster than similar jobs in other ETL tools).

More developer information

can be found here: PDI Developer information