Unknown macro: {scrollbar}

How to configure client-side PDI so that files compressed using the Snappy codec can be decompressed using the Hadoop file input or Text file input step.

Prerequisites

Pentaho Data Integration
Snappy compressed source data (either inside or outside of HDFS)

Step-By-Step Instructions

Configure PDI to Access Snappy Native Libraries

In order to use client-side PDI to decompress files encoded by hadoop-snappy (the snappy implementation used in Hadoop) it is necessary to build and install both the hadoop-snappy JNI interface and the snappy native libraries for your platform. Instructions for achieving this can be found at:

http://code.google.com/p/hadoop-snappy/

In particular, the instructions under "Build Hadoop Snppy" should be followed. The "Install Hadoop Snappy in Hadoop" instructions should only be followed if

You want to decompress snappy encoded files within a Pentaho map reduce job (see Using Compression with Pentaho MapReduce for more information), and
Your Hadoop installation does not have snappy hadoop-snappy installed already (recent Hadoop distributions from Cloudera etc. are configured with hadoop-snappy out of the box)

Once you have built hadoop-snappy:

Uncompress the hadoop-snappy-x.y.z-SNAPSHOT.tar.gz archive the build process creates somewhere on your client PDI machine
Copy hadoop-snappy-x.y.z-SNAPSHOT/lib/hadoop-snappy-x.y.z-SNAPSHOT.jar to libext/bigdata in your client PDI installation
Set the java.library.path property to point to the subdirectory of hadoop-snappy-x.y.z-SNAPSHOT/lib/native that corresponds to your platform

Where to set the java.library.path in Step 3 will vary depending on your platform:

Under Linux edit "spoon.sh" in your PDI installation directory and add an entry to the LIBPATH variable
Under Windows edit "Spoon.bat" and add an entry to the LIBSPATH variable
Under Mac OS X edit "Data Integration 64-bit.app/Contents/Info.plist" and add "-Djava.library.path=<path to the subdirectory in Step 3>" to the string entry under the key "VMOptions"

Verifying that Snappy Decompression is Available to PDI

After following the instructions of the previous section restart PDI. If hadoop-snappy and the snappy native libraries have been installed correctly on the PDI client machine then a "Hadoop-snappy" option will be available under the "Compression" drop-down box on the "Content" tab of the Hadoop file input and Text file input steps.

Unknown macro: {scrollbar}

Pentaho Big Data

Extracting Data from Snappy Compressed Files

Prerequisites

Step-By-Step Instructions

Configure PDI to Access Snappy Native Libraries

Verifying that Snappy Decompression is Available to PDI