Handling Large Data Sets with Weka

Recent releases of Weka 3.7 now have some new packages that add support for running Weka in Hadoop and Spark. See this series of blog entries (Hadoop) and this one (Spark) for details.

Weka has a large collection of learning algorithms, most of which are batch-based and operate on data held in main memory. 64 bit VMs extend the amount of data that these methods can operate on (within the limits of computational complexity), but available RAM still limits the amount of data that can be processed. There are a number of approaches to consider when data is too large to fit into main memory.

Data characteristics

The first thing to consider is the nature of the data itself. If there are columns that can be safely eliminated apriori (i.e. identifiers, dates, constant fields etc.) then this should be done in a preprocessing step. If the data contains many zeros then Weka has a sparse data format that can save a lot of memory. All algorithms in Weka can take advantage of the memory savings afforded by the sparse format; some algorithms can take advantage of the sparsity to speed up computation.

Streaming incremental learners

Quite a few learning algorithms in Weka that can be trained incrementally, one data row at a time. These methods generally have runtime that is linear in the number of rows and fields in the data and only require the current data row to be present in main memory. Because of this, they can be used to process a (potentially) infinite number of rows. Furthermore, due to their incremental nature, they are "anytime" algorithms that can be used for prediction at any stage. Their current state can be saved and training resumed at a later time. Algorithms included in Weka that fall into this category are:

naive Bayes
naive Bayes multinomial (naive Bayes for text categorization)
DMNBtext (discriminitive multinomial naive bayes for text categoriziation)
AODE and AODEsr (averaged one dependence estimators)
SPegasos (the Pegasos stochasitc gradient descent algorithm for learning linear support vector machines and logistic regression for binary class problems)
SGD (stochastic gradient descent for linear regression, binary class logistic regression and linear support vector machines)
IB1, IBk and KStar (nearest neighbor learners for classification and regression using a sliding window on the data)
locally weighted learning (locally weighted models using a sliding window on the data)
RacedIncrementalLogitBoost (ensembles of boosted base learners applied to data "chunks")
Cobweb (incremental clustering)

Furthermore, Weka >=3.7.2 includes a package that makes the MOA (Massive Online Analysis) data stream learners available in Weka. This adds Hoeffding incremental decision trees, Hoeffding trees with naive Bayes models at the leaves, Hoeffding option trees, bagging, boosting and adaptive variants thereof. MOA also includes routines for automatic memory management that can prevent a learned model from exceeding a user-specified maximum memory constraint.

Weka's "Explorer" environment is batch-based (regardless of whether the algorithm used is incremental or not) and loads the data into main memory. To take advantage of incremental learning either the command line interface or the graphical Knowledge Flow interface (using "instance" connections) can be used.

Although not actually a streaming algorithm, the FPGrowth association rule learner can (as of Weka 3.7.2) operate off of data stored on disk rather than loaded into main memory. It requires only two passes over the data in order to build its extended prefix tree structure. This prefix tree is held in main memory however, so despite the compression it usually achieves (which is partly dependent on data characteristics such as sparseness), there will be a limit to the number of rows/transactions it can handle.

Sampling

Quite often a carefully selected subsample of the data can be used to train a highly accurate model that usually has performance close to what would be achieved if the full data set could be processed. How is this possible? When plotting the accuracy of model as a function of the size of the data used to train it, the curve for a given scheme will flatten out after a certain amount of data has been processed. Learning schemes with high bias, such as naive Bayes and linear models will achieve most of their accuracy with a very small amount of data. Processing more data is usually not worth the effort (diminishing returns). So, selecting a suitably sized subsample of the data is a viable approach in these cases. Weka includes a pre-processing filter that can perform reservoir sampling (uniformly random sampling from streamed data when the number of rows is not known in advance).

Voted ensembles of classifiers

One simple approach to handling large data sets is to create an ensemble of classifiers by splitting the data into a number of chunks, where each chunk is capable of being loaded and processed in main memory. A separate classifier can be learned and then saved for each data chunk. This process can naturally be distributed across multiple machines (if available). The resulting ensemble of classifiers can be used for prediction by combining the predictions from the individual classifiers. Simple voting (classification) or averaging (probabilities for classification or predicted target values for regression) works well.