Configuring log tables for concurrent access

Configuring log tables for concurrent access

It can be surprisingly tricky to use a database to generate a unique ID.  The problem is mostly caused by inadequate or incorrect locking by the various databases.  To make matters worse, every database behaves different and even within the same database you can use different database engines (like MySQL) that behave differently when it comes to locking tables.

Note: We are working on an easier solution for future releases, see PDI-5501

The problem

All that Kettle wants to do is generate a unique integer ID at the very start of the transformation or job.  That "batch" ID is then being used in various places throughout the life of the transformation or job.   Unfortunately, generating a unique ID is not something that is handled in any sort of standard fashion by the various database vendors.  Some database don't offer any support for it, some offer "Sequences" to do it and others have special identity or auto-increment columns that get automatically assigned a value upon insertion of a record.

Whatever the case, Kettle opted in the early beginning to stay on the safe path and simply insert look in a log table to calculate the maximum ID to then add one to it.  This works fine until more than one transformation or job runs at the same time.  At that point you can be looking at the very same maximum ID value with 2 different database connections.  This in turn gives you duplicate values in the log tables. 

The solution

Over the years, many attempts were made to properly lock the logging tables but for certain databases this has proven to be quite tricky to get right.  To finally come to a working solution that is also backward compatible the Pentaho engineers came up with the following options...

Use an identity column

You can use an identity or auto-increment column if your database supports this.  To enable support, create a new table like this (MySQL example):

create table LOG_COUNTER(ID INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, NAME VARCHAR(100)) engine=InnoDB;

Insert a record:

insert into LOG_COUNTER(NAME) values('LOG_TABLES');
commit;

In the logging database connection in Pentaho Data Integration (Spoon), add the following line in the Options pane:

Parameter: AUTOINCREMENT_SQL_FOR_BATCH_ID
Value: UPDATE LOG_COUNTER SET ID=LAST_INSERT_ID(ID+1)

This will explain to PDI to use a value from the LOG_COUNTER table every time a new batch ID needs to be generated for a transformation or a job table.

Use a SEQUENCE column

You can use a database sequence if your database supports this.  To enable support for this, create a sequence like this (Oracle example):

CREATE SEQUENCE LOGGINGSEQ START WITH 10000;

In the logging database connection in Pentaho Data Integration (Spoon), add the following line in the Options panel:

Parameter: SEQUENCE_FOR_BATCH_ID
Value: LOGGINGSEQ

This will explain to PDI to use a value from the LOGGINGSEQ sequence every time a new batch ID needs to be generated for a transformation or a job table.