Extraction of data from source systems

The complexity of extraction batch depends very much on the environment.

Is source system mission critical?
Can the source system sustain a long query?
Is source system located in local lan or cloud?
Is source system continuously being accessed?

There are 2 main scenarios...

Push - kettle batch located at source system and pushes data to ETL staging area
Pull - kettle batch located in ETL server pulling data into the ETL staging area

Pull

Pattern 1: Full extract with output truncate

The kettle script consist of an input step and an output step.

Output step is set to truncate table.

Pattern 2: Full extract with sql script

The kettle script consist of an input step and an output step plus a sql script that is not connected.

The stand-alone sql script is a special step that will be executed before any trans steps is run.

When the table is small this is okay.

If you are doing a major extract, do not place the sql script in the trans.

The sql script will cause locking when it takes some time to execute or is locked from accessing or writing to the table.

Pattern 3: Full extract job with 2 transformation steps

The kettle script comprise of a job and a extract transformation.

Inside the job is a sql script step.

The difference from Pattern 2 is that the sql script is now in the job not in the trans.

I have encountered problems with sql script in trans.

Pattern 4: Extraction transformation with an error step after the output step

When you have error handling turned on and an output text file on the error step, the output step will exit elegantly dumping error lines with error codes to the error output text file.

Pentaho Community

Extraction Patterns