Extracting Data from HBase to Load an RDBMS in MapR

Unknown macro: {scrollbar}
How to use a PDI transformation to extract data from HBase and load it into a RDBMS table. The new RDBMS table will contain the count of page views by IP address and month.

Note

For brevity's sake, this transformation will only contain three steps: HBase Input, Split Fields and Table Output. In practice, the full expressiveness of the PDI transformation semantic is available. Further, PDI supports bulk loading for many RDBMS and that would be a viable, and common, alternative to using a Table Output approach.

Prerequisites

In order to follow along with this how-to guide you will need the following:

Sample Files

There are no sample files for this guide. The Loading Data into MapR HBase guide must be completed prior to starting this guide as it loads the sample data.

Step-By-Step Instructions

Setup

Start MapR if it is not already running.

Create a RDBMS Connection

In this task you will create a connection to a RDBMS that you will use throughout this guide. This task uses a MySQL connection, but you may use any database you wish that has a JDBC driver.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.

  2. Create a Database Connection: You need to create a database connection to your RDBMS, so right click on the 'Database connections' in the View palette and select New. If you already have a database connection for your RDBMS database you may skip this step.
    The Database Connection window will appear. Enter the following information:
    1. Connection Name: Enter 'RDBMS'
    2. Connection Type: Select 'MySQL'
    3. Host Name and Port Number: Your connection information for the MySQL Server. For a local MySQL database Host Name is 'localhost' and Port Number is '3306'.
    4. Database Name: Enter your database name. For a local MySQL database use 'test'.  If you have not already created a database named 'test' on MySQL, please do so now.
    5. User Name and Password: Your database username and password.
      When you are done your window should look like:

      Notice that there are lots of connection types that you could have used.
      Click 'Test' to verify your connection is working properly. If the test fails verify your RDBMS server is running and you have entered the correct connection information.

      Click 'OK' to close the Database Connection window.

  3. Share the RDBMS Connection: You will want to use your RDBMS connection in future transformations, so share the connection by expanding 'Database Connections' in the View Palette, right clicking on 'RDBMS', and selecting 'Share'.
    Sharing the connection will prevent you from having to recreate the connection every time you want to access the RDBMS in a transformation.

Create a Transformation to Extract Data from HBase

In this task you will create a transformation to extract data from HBase and load into a RDBMS table.

Speed Tip

You can download the Kettle Transform hbase_to_rdbms.ktr already completed

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.

  2. Add a HBase Input Step: You are going to extract data from HBase, so expand the 'Big Data' section of the Design palette and drag a 'HBase Input' node onto the transformation canvas. Your transformation should look like:


  3. Edit the HBase Input Step: Double-click on the 'HBase Input' node to edit its properties. Do the following:
    1. Zookeeper host(s) and Zookeeper port: Your HBase Zookeeper connection information. For a local single node MapR cluster your host is 'localhost' and your port is '5181'.
    2. HBase table name: Click 'Get mapped table names' and select 'weblogs'.
    3. Mapping name: Click 'Get mappings for the specified table' and select 'pageviews'.
    4. Click the 'Get Key/Fields Info' button to populate the grid
      When you are done your window should look like:

      Click 'OK' to close the window.

  4. Add a Split Fields Step: You need to split the key field which is client_ip|year into two fields, so expand the 'Transform' section of the Design palette and drag a 'Split Fields' node onto the transformation canvas. Your transformation should look like:

  5. Connect the Input and Split Fields steps: Hover the mouse over the 'HBase Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Split Fields' node. Your canvas should look like this:


  6. Edit the Split Fields Step: Double-click on the 'Split Fields' node to edit its properties. Do the following:
    1. Field to split: Select 'key'.
    2. Delimiter: Enter '|'.
    3. Fields: Add the following:

      New field

      Type

      client_ip

      String

      year

      Integer


      When you are done your window should look like:

      Click 'OK' to close the window.

  7. Add a Table Output Step: You want to write the values to a RDBMS, so expand the 'Output' section of the Design palette and drag a 'Table Output' node onto the transformation canvas. Your transformation should look like:


  8. Connect the Split Fields and Table Output steps: Hover the mouse over the 'Split Fields' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Table Output' node. Your canvas should look like this:


  9. Edit the Table Output Step: Double-click on the 'Table Output' node to edit its properties. Do the following:
    1. Connection: Select 'RDBMS'
    2. Target Table: Enter 'aggregate_hbase'
    3. Check 'Truncate table' so you can re-run this transformation.
    4. Click the 'SQL' button to create the table in your target database.
    5. Click the 'Execute' button to run the SQL.

      The 'Results of the SQL statements' window will appear telling you if the SQL was successfully executed or give you troubleshooting information if the SQL failed.
    6. Click 'OK' to close the window.
    7. Click 'Close' to close the 'Simple SQL editor' window.
      When you are done your window should look like:

      Click 'OK' to close the window.

  10. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hbase_to_rdbms.kjb' into a folder of your choice.

  11. Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:
    If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check RDBMS for HBase Aggregated Table

  1. Explore the Database: Choose 'Tools' -> 'Database' -> 'Explore' from the menu system.

  2. Select the Connection: In the 'Make your selection' window select 'RDBMS' and click 'OK'.


  3. Preview the Table: Expand RDBMS -> Tables. Right click on 'aggregate_hbase' and select 'Preview first 100'

Summary

During this guide you learned how to use a PDI transformation to extract data from HBase and load into an RDBMS table.

Unknown macro: {scrollbar}