Extracting Data from Snappy Compressed Files

Unknown macro: {scrollbar}

How to configure client-side PDI so that files compressed using the Snappy codec can be decompressed using the Hadoop file input or Text file input step.

Prerequisites

  • Pentaho Data Integration
  • Snappy compressed source data (either inside or outside of HDFS)

Step-By-Step Instructions

Configure PDI to Access Snappy Native Libraries

In order to use client-side PDI to decompress files encoded by hadoop-snappy (the snappy implementation used in Hadoop) it is necessary to build and install both the hadoop-snappy JNI interface and the snappy native libraries for your platform. Instructions for achieving this can be found at:

http://code.google.com/p/hadoop-snappy/

In particular, the instructions under "Build Hadoop Snppy" should be followed. The "Install Hadoop Snappy in Hadoop" instructions should only be followed if

  1. You want to decompress snappy encoded files within a Pentaho map reduce job (see Using Compression with Pentaho MapReduce for more information), and
  2. Your Hadoop installation does not have snappy hadoop-snappy installed already (recent Hadoop distributions from Cloudera etc. are configured with hadoop-snappy out of the box)

Once you have built hadoop-snappy:

  1. Uncompress the hadoop-snappy-x.y.z-SNAPSHOT.tar.gz archive the build process creates somewhere on your client PDI machine
  2. Copy hadoop-snappy-x.y.z-SNAPSHOT/lib/hadoop-snappy-x.y.z-SNAPSHOT.jar to libext/bigdata in your client PDI installation
  3. Set the java.library.path property to point to the subdirectory of hadoop-snappy-x.y.z-SNAPSHOT/lib/native that corresponds to your platform

Where to set the java.library.path in Step 3 will vary depending on your platform:

  • Under Linux edit "spoon.sh" in your PDI installation directory and add an entry to the LIBPATH variable
  • Under Windows edit "Spoon.bat" and add an entry to the LIBSPATH variable
  • Under Mac OS X edit "Data Integration 64-bit.app/Contents/Info.plist" and add "-Djava.library.path=<path to the subdirectory in Step 3>" to the string entry under the key "VMOptions"

Verifying that Snappy Decompression is Available to PDI

After following the instructions of the previous section restart PDI. If hadoop-snappy and the snappy native libraries have been installed correctly on the PDI client machine then a "Hadoop-snappy" option will be available under the "Compression" drop-down box on the "Content" tab of the Hadoop file input and Text file input steps.

Unknown macro: {scrollbar}