Change Data Capture (CDC)

Status: At this time this page collects all kind of information around CDC and Kettle and will be extended over time with examples.

Introduction

CDC is used to load only new or changed data from a source system. There are no dedicated CDC functions in PDI but there are many ways to achieve CDC functionality within PDI.

Source Data-Based CDC

In this case you use the time stamp or sequenced IDs to identify the last loaded rows and store this information in a status table. This can even be combined with transactions: This status table holds for all jobs/transformations all tables that need to be in a consistent state. For all tables the last processed keys (source/target) and the status is saved. Some tables might need compound keys depending on the ER-design. It is also possible to combine this approach with the own Kettle transformation log tables and the Dates and Dependencies functionality. There is an extended example in the Pentaho Data Integration for Database Developers (PDI2000C) course in module ETL patterns (Patterns: Batching, Transaction V - Status Table)

Snapshot-Based CDC

When no suitable time stamps or IDs are available or when some records might have been updated, you need the snapshot-based approach. Store a copy of the loaded data (in a table or even a file) and compare record by record. It is possible to create a SQL statement that queries the delta or use a transformation. Kettle supports this very comfortable by the Merge rows (diff) step. There is an extended example in the Pentaho Data Integration for Database Developers (PDI2000C) course in module ETL patterns (Pattern: Change Data Capture)

Trigger-Based CDC

Kettle does not create triggers in a database system and some (or most?) people don't like the trigger-based CDC approach because it introduces a further layer of complexity into another system. Over time it is hard to maintain and keep in sync with the overall architecture. But at the end, it depends on the use case and might be needed in some projects. There are two main options:

Create a trigger and write the changed data to a separate table

This table has a time stamp or sequenced ID that can be used to select the changed data rows.

Create a trigger and call Kettle directly via the Kettle API

This scenario might be needed in real-time CDC needs, so a Kettle transformation might be called directly from the trigger. Some databases support Java calls from a trigger (e.g. PL/Java for PostgreSQL or Oracle, see References below).

If you are using Hibernate to communicate with the database, you can use Hibernate event listeners as triggers (package summary). That way it would work with every database when you use standard SQL queries or HQL queries in the triggers.

Database Log-Based CDC

Some databases allow own CDC logs that can be analyzed.

Real-time CDC

So in case you need Real-time CDC, some of the above solutions will solve this need. Depending on the timing (how real-time or near-time) your needs are, you may choose the best suitable option. The trigger based call of Kettle is the most real-time solution. It is also possible to combine all of the above solutions with a continuously executed transformation (e.g. every 15 minutes) that collects the changed data.

References

Table Input step to start the data load from a given keys (see the Insert data from step option)

Merge rows (Diff) step to compare two data streams

Pentaho Data Integration for Database Developers (PDI2000C) course, module "ETL patterns"

Diethard Steiner (Jan 26, 2011): Kettle Transformation Logging and Change Data Capture

Matt Casters, Roland Bouman, Jos van Dongen: Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration, chapter 6 CDC: Change Data Capture

Oracle8i Java Stored Procedures Developer's Guide: Calling Java from Database Triggers

And a lot of forum posts, e.g.