Pentaho Data Integration (aka Kettle) Concepts, Best Practices and Solutions

About this page

Note: This Wiki space is in active development

This space is dedicated for Pentaho Data Integration (aka Kettle) topics around concepts, best practices and solutions.

It is more practical oriented whereas the basic reference documentation is more detailed and descriptive.

Main categories

  • Planning (e.g. Sizing questions, Multi-Tenancy)
  • Administration (e.g. Installation, Configuration, Multi-Tenancy)
  • Operations (Lifecycle Management, Monitoring, Logging, Exception Handling, Restart-ability)
  • Documentation (Auto-Documentation, Data-Lineage, Process Documentation, References, Dependencies)
  • Connecting with 3rd Party Applications (e.g. Webservices, ERP, CRM systems)
  • Special database issues and experiences
  • Big Data (e.g. Hadoop)
  • Clustering (Basic clustering, fail-over, load balancing, recover-ability)
  • Performance Considerations
  • Change Data Capture (CDC)
  • Real-Time-Concepts
  • Data Quality, Data Profiling, Deduplication (e.g. Master Data Management: MDM, Customer Data Integration: CDI)
  • Special File Processing (e.g. EDI(FACT), ASC X12, HL7 healthcare, large and complex XML files, hierarchical and multiple field formats)
  • Dynamic ETL (Meta-Data driven ETL, How to change the ETL process and fields dynamically depending on the processed content)
  • QA, Automated Testing
  • Special Job Topics (e.g. launching job entries in parallel, looping)
  • Special Transformation Topics (e.g. Error handling, tricky row and column handling)