What's new or improved in Weka 3.7.1

Classifiers

SPegasos

This is a fast algorithm for learning linear support vector machines and logistic regression via stochastic gradient descent. It is also an incremental classifier, so can be trained in an online setting (weka.classifiers.functions.SPegasos). See:

S. Shalev-Shwartz, Y. Singer, N. Srebro: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: 24th International Conference on MachineLearning, 807-814, 2007.

Friedman's RealAdaBoost algorithm

Algorithm for boosting a 2-class classifier using the Real Adaboost method (weka.classifiers.meta.RealAdaBoost). See:

J. Friedman, T. Hastie, R. Tibshirani (2000). Additive Logistic Regression: a Statistical View of Boosting. Annals of Statistics. 95(2):337-407.

FURIA rule learner

Fuzzy Unordered Rule Induction Algorithm. A fuzzy rule learner based on the well known RIPPER algorithm (weka.classifiers.rules.FURIA). See:

Jens Christian Huehn, Eyke Huellermeier (2009). FURIA: An Algorithm for Unordered Fuzzy Rule Induction. Data Mining and Knowledge Discovery.

Thanks to Jens Christian Huehn for this contribution.

One class classifier

A classifier for one class problems (aka outlier/novelty detection) that combines density and class probability estimation (weka.classifiers.meta.OneClassClassifier). See:

Kathryn Hempstalk, Eibe Frank, Ian H. Witten: One-Class Classification by Combining Density and Class Probability Estimation. In: Proceedings of the 12th European Conference on Principles and Practice of Knowledge Discovery in Databases and 19th European Conference on Machine Learning, ECMLPKDD2008, Berlin, 505--519, 2008.

Parallel ensemble learning

Some meta classifiers in Weka now support multiple cpus/cores and are able to construct ensemble members in parallel. See Support for parallelism in ensemble learning for details.

Miscellaneous

Gaussian process regression is now improved and faster. J48 now includes options to turn off subtree collapsing and the MDL correction for the info gain of splits on numeric attributes.

Association rules

FP-Growth

FPGrowth is a fast method for learning association rules on market basket data. It requires only two passes over the data and constructs a compressed tree-based representation in main memory. Rather than generating candidate frequent item sets and then counting their occurance in the data, it "grows" frequent item sets by recursively processing the tree-structure. This avoids the combinatorial explosion for generate-and-test methods when there are many items (weka.associations.FPGrowth). See:

J. Han, J.Pei, Y. Yin: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM-SIGMID International Conference on Management of Data, 1-12, 2000.

Clusterers

Hierarchical clusterer

Implements a number of classic agglomorative (i.e. bottom up) hierarchical clustering methods (weka.clusterers.HierarchicalClusterer).

Filters

MILES propositionalizaton filter

Implements the MILES transformation that maps multiple instance bags into a high-dimensional single-instance feature space (weka.filters.unsupervised.attribute.MILESFilter). See:

Y. Chen, J. Bi, J.Z. Wang (2006). MILES: Multiple-instance learning via embedded instance selection. IEEE PAMI. 28(12):1931-1947.

Rename attribute

A simple filter for renaming attributes (weka.filters.unsupervised.attribute.RenameAttribute).

Remove by name

Removes attributes based on a regular expression matched against their names (weka.filters.unsupervised.attribute.RemoveByName).

Merge many values

Merges many values of a nominal attribute into one value (weka.filters.unsupervised.attribute.MergeManyValues).

PMML import

Import of PMML RuleSet is now supported.

Attribute selection

Single attribute evaluation by a classifier

An attribute evaluator similar to OneRAttributeEval, except that it uses a user-specified classifier to evaluate (either on the training data or by cross-validation) each attribute individually (weka.attributeSelection.ClassifierAttributeEval).

Cost/Benefit analysis component

A graphical tool for exploring various cost/benefit tradeoffs by interactively selecting different population sizes from the ranked list of prospects or by varying the threshold on the predicted probability of the positive class. More information can be seen on this Wiki page.