Kettle Data Profiling with DataCleaner

Example for Data Profiling (DataCleaner) with Kettle

Data Profiling (DataCleaner) is fully integrated within Pentaho Kettle / PDI and you can profile your data directly within Spoon.

  • Download/Install the plug-in as documented in the Data Profiling (Data Cleaner) section of the Human Inference page.
  • Within Spoon, open one of your existing transformation or use the attached sample
  • Right click on a step you want to profile its data and select Profile from the context menu

  • Execute the Transformation with Launch and wait until DataCleaner starts up
  • The DataCleaner Logo will be shown, press Continue
  • You will see that DataCleaner automatically parsed the meta data from Kettle and created the Number and String Analyzer
  • Press Execute to start the profiling

After the progress information, you get the results for you Number and String Analyzer, for the Number Analyzer you get null values in this sample:

When you click on details of the 8 null rows, you get the following information:

For the String Analyzer, you get the following results:

Look at the Diacritic chars and see the details:

Depending on your target database or file character sets, you may need to change these special characters. This can be done within the Kettle transformation.

  • Another nice use case of the combination of Data Profiling within the Kettle transformation on each step, is to clean the data, e.g. replace the null values by other values or look it up. Then profile the data stream again after the cleansing. You get an immediate result if your changes to the data have been corrected the issue.

FAQ

Q: I want to avoid to export all the data of the transformation to DataCleaner. How can I profile a sample set of my data?
A: You may use the Reservoir Sampling step (see Statistics category) within Kettle.

Q: Where can I get support?

A: The DataCleaner integration is community supported and Pentaho and Human Inference invested into this integration. See the Kettle and DataCleaner forums for community support.
Pentaho Customer Support responds to all questions directly associated with the Pentaho products. Full production support (severity levels one through four) is not given to the Data Profiling capabilities provided by DataCleaner whereas any general questions are welcome. For further information contact Pentaho Support.
DataCleaner Support including the Pentaho integration is provided by Human Inference according to the support and maintenance matrix.

Q: Where can I find more information?
A: Further information about DataCleaner can be found here: http://datacleaner.eobjects.org/