Reservoir Sampling

Description

The reservoir sampling step allows you to sample a fixed number of rows from an incoming data stream when the total number of incoming rows is not known in advance. The step uses uniform sampling; all incoming rows have an equal chance of being selected. This step is particularly useful when used in conjunction with the ARFF output step in order to generate a suitable sized data set to be used by WEKA. The reservoir sampling step uses Algorithm R by Jeffery Vitter.

Options

Option

Description

Step name

The name of this step as it appears in the transformation workspace.

Sample size

Select how many rows to sample from an incoming stream. Setting a value of 0 will cause all rows to be sampled; setting a negative value will block all rows.

Random seed

Choose a seed for the random number generator. Repeating a transformation with a different value for the seed will result in a different random sample being chosen.

References

Vitter, J. S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985. Pages 37-57.