Understanding How Pentaho works with Hadoop


For a complete overview of using Pentaho and Hadoop, visit PentahoBigData.com/ecosystem/platforms/hadoop.

Pentaho is integrated with Hadoop at many levels

  • Traditional ETL - Graphical designer to visually build transformations that read and write data in Hadoop from/to anywhere and transform the data on the way. No coding required - unless you want to. Transformation steps include...
    • HDFS files Read and Write
    • HBase Read/Write
    • Hive, Hive2 SQL Query and Write
    • Impala SQL Query and Write
    • Support for Avro file format and snappy compression
  • Data Orchestration - Graphical designer to visually build and schedule jobs that orchestrate processing, data movement and most aspects of operationalizing your data preparation. No coding required - unless you want to. Job steps include...
    • HDFS Copy files
    • Map Reduce Job Execution
    • Pig Script Execution
    • Amazon EMR Job Execution
    • Oozie integration
    • Sqoop Import/Export
    • Pentaho MapReduce Execution
    • PDI Clustering via YARN
  • Pentaho MapReduce - Graphical designer to visually build MapReduce jobs and run them in cluster. With a simple, point-and-click alternative to writing Hadoop MapReduce programs in Java or Pig, Pentaho exposes a familiar ETL-style user interface. Hadoop becomes easily usable by IT and data scientists, not just developers with specialized MapReduce and Pig coding skills. As always, No coding required - unless you want to.
  • Traditional Reporting - All data sources supported above can be used directly or blended with other data to drive our pixel perfect reporting engine. The reports can be secured, parameterized and published to the web to provide guided adhoc capabilities to end users. The reports can be mashed up with other pentaho visualizations to create dashboards.
  • Web Based Interactive Reporting - Pentaho's Metadata layer leverages data stored in Hive, Hive2 and Impala for WYSIWYG, interactive, self-service reporting. More Info
  • Pentaho Analyzer - Leverage your data stored Impala or Hive2 (Stinger) for interactive visual analysis with drill through, lasso filtering, zooming, and attribute highlighting for greater insight. More Info

In cluster ETL


 1) Pentaho MapReduce with Kettle


The first three videos compare using Pentaho Kettle to create and execute a simple MapReduce job with using Java to solve the same problem. The Kettle transform shown here runs as a Mapper and Reducer within the cluster.


 Click here to expand...


 2) Straight Java


What would the same task as "1) Pentaho MapReduce with Kettle" look like if you coded it in Java? At a half hour long, you may not want to watch the entire video...


 Click here to expand...


 3) Compare using Kettle to Java


This is a quick summary of the previous two videos, "1) Pentaho MapReduce with Kettle" and "2) Straight Java", and why Pentaho Kettle boosts productivity and maintainability.


 Click here to expand...


 Loading Data into Hadoop


A quick example of loading into the Hadoop Distributed File System (HDFS) using Pentaho Kettle.


 Click here to expand...


 Extracting Data from Hadoop


A quick example of extracting data from the Hadoop Distributed File System (HDFS) using Pentaho Kettle.


 Click here to expand...