About Kettle and Big Data

Resources

Pentaho's Big Data story revolves around Pentaho Data Integration AKA Kettle. Kettle is a powerful Extraction, Transformation and Loading (ETL) engine that uses a metadata-driven approach. The kettle engine provides data services for, and is embedded in, most of the applications within the Pentaho BI suite. This "Kettle Everywhere" is very powerful but can also create some confusion since it comes in many forms and packages.

List of Kettle Packages and their proper names:

  • Kettle - The name of the open source project and also the name of the ETL engine. When the name Kettle is used, it usually refers to the engine that Executes the Jobs and Transforms. Unfortunately, many long time Kettle users also refer to the Kettle graphical designer UI called Spoon as Kettle which adds to the confusion. The Kettle community home is here.
  • Pentaho Data Integration (PDI) - When Pentaho created the commercial or Enterprise Edition of Kettle, it chose PDI as the branded name to distinguish the commercial version from the open source project. Unfortunately the names are used interchangeably and may just have created more confusion.
  • Spoon - The Kettle desktop visual design tool used to create and edit ETL transformations and jobs. Spoon also has perspectives for running and debugging, visualizing and generating data models that can be used by the rest of the Pentaho Suite.
  • Pan - A program that can execute transformations from the command line, usually via scheduler.
  • Kitchen - A program that can execute jobs from the command line, usually via scheduler.
  • Carte - A simple web server that allows you to execute transformations and jobs remotely. It does so by accepting XML (using a small servlet) that contains the transformation to execute and the execution configuration. It also allows you to remotely monitor, start and stop the transformations and jobs that run on the Carte server.
  • Pentaho Report Designer (PRD) - The Kettle Engine is embedded in the Pentaho Report Designer which enables PRD to generate reports from a Kettle transform without having to stage the data.  It also gives PRD access to all of the database connectors within Kettle including the NoSQL databases. The Pentaho Reporting community home is here.
  • Pentaho BI Platform - The Kettle Engine is embedded in the BI Platform which enables reports created with PRD that rely on transforms to be published to the web. The Kettle community home is here.
  • Pentaho Data Integration Server (DI Server) EE - Standalone server for running Kettle Jobs and transforms. It has a CMS repository for storing and versioning Jobs and Transforms. It also has a scheduler and performance monitor. The DI Server is part of Pentaho Enterprise Edition and is not available in open source.
  • Pentaho Hadoop Distribution (PHD) - No longer required

Spoon UI:

Pentaho Report Designer

Pentaho BI Server