Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contents

...

indent10px
locationtop
typelist

1 Introduction

The reservoir sampling plugin is a tool that allows you to sample a fixed number of rows from an incoming Kettle data stream when the total number of incoming rows is not known in advance. All rows have equal chance of being selected (uniform sampling). This step is particularly useful when used in conjunction with the ARFF output step in order to generate a suitable sized data set to be used by WEKA. The reservoir sampling step uses algorithm "R" by Vitter (Vitter 1985). 

2 Getting Started

Since Kettle 3.2, this step became a standard step and is available in the Statistics category of Spoon.

Before Kettle 3.2: In order to use the reservoir sampling plugin, it must first be installed correctly in Kettle---unpack the plugin  archive and copy all files in the ReservoirSamplingDeploy directory to a sub-directory in the plugins/steps  directory of your Kettle installation. Now start Spoon. Confirm that the plugin has been installed and correctly recognized by Kettle by expanding the "Transform" list under "Core Objects" in Spoon. You should see Reservoir Sampling listed in bold near the bottom of the list.
  Image Removed
 

3 Configuring the Reservoir Sampling Step

The reservoir sampling step is very simple to configure. Double click on its icon to open the configuration dialog. Image Removed
The dialog shows a text field that can be used to name the step, and two fields for editing the parameters of the sampling.
 
The steps to complete the configuration are:

  1. Select how many rows to sample from the incoming stream(setting a value of 0 will cause all rows to be sampled; setting a negative value will block all rows)
  2. Choose a seed for the random number generator (repeating a transformation with a different value for the seed will result in a different random sample being chosen).

4 References

...

Incorporated this page into the Step documentation: Reservoir Sampling