Merge rows

(warning) PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.


Description

Merge rows allows you to compare two streams of rows. This is useful for comparing data from two different times. It is often used in situations where the source system of a data warehouse does not contain a date of last update.

The two streams of rows, a reference stream (the old data) and a compare stream (the new data), are merged. Only the last version of a row is passed to the next steps each time. The row is marked as follows:

  • "identical" - The key was found in both streams and the values to compare are identical;
  • "changed" - The key was found in both streams but one or more values is different;
  • "new" - The key was not found in the reference stream;
  • "deleted" - The key was not found in the compare stream.

When the rows are flagged as "identical" or "deleted", the output row will be created based upon the "Reference rows origin" stream.  For "new" or "changed" rows, the output row will be created based upon the "Compare rows origin" stream.  Subsequent steps can be used, such as the Synchronize after merge or Switch-Case step, to handle the various flagfield values.

Important: Both streams must be sorted on the specified key(s). When using the Sort step, this works fine. If the data is sorted outside of PDI (e.g. in a SQL query), you may run into issues with the internal case sensitive/insensitive flag or other collations. Further information can be found on PDI-11440 (illustrating an issue with the Merge join step, but the issue is valid for this step as well).

The following sample illustrates the usage:

samples/transformations/Merge rows - mergs 2 streams of data and add a flag.ktr.ktr

This step can also be used in conjunction with the Synchronize after merge step. The flag column is then used to execute updates/inserts/deletes on a target table.

Options

Option

Description

Step name

Name of the step;this name has to be unique in a single transformation.

Reference rows origin

Specify the step origin for the reference rows <- Stream with original rows, or rows you want to compare the new rows to.

Compare rows origin

Specify the step origin for the compare rows.<- Stream with new rows

Flag fieldname

Specify the name of the flag field on the output stream.

Keys to match

Specify fields containing the keys on which to match;click Get key fields to insert all of the fields originating from the reference rows step

Values to compare

Specify fields contaning the values to compare; click Get value fields to insert all of the fields from the originating value rows step.  Key fields do not need to be specified here.

Metadata Injection Support (7.x and later)

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.