What's new in Weka 3.5.8
Classifiers
Bayesian logistic regression and discriminitive multinomial naive Bayes for text classification
Two fast and powerful techniques for text classification problems. Often outperforms SMO (support vector machine) without parameter tuning. (weka.classifiers.bayes.BayesianLogisticRegression, weka.classifiers.bayes.DMNBtext). See:
Alexander Genkin, David D. Lewis, David Madigan (2004). Large-scale bayesian logistic regression for text categorization (http://www.stat.rutgers.edu/~madigan/PAPERS/shortFat-v3a.pdf)
Jiang Su,Harry Zhang,Charles X. Ling,Stan Matwin (2008). Discriminative Parameter Learning for Bayesian Networks. In: ICML 2008'.
Functional trees
Jaoa Gama's tree learner that incorporates oblique splits and functions at the leaves (weka.classifiers.trees.FT). See:
Jaoa Gama (2004). Functional Trees. Machine Learning, Vol. 55(3), Kluwer Academic Press.
Decision table/naive Bayes hybrid classifier
A semi-naive Bayesian ranking mehod that combines decision tables with naive Bayes (weka.classifiers.rules.DTNB). See:
Clusterers
CLOPE
A clustering algorithm for transactional data (weka.clusterers.CLOPE). See:
Yiling Yang, Xudong Guan, Jinyuan You (2002). CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 682-687.
sIB
Clustering using the sequential information bottleneck algorithm (weka.clusterers.sIB). See:
Noam Slonim, Nir Friedman, Naftali Tishby (2002). Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, 129-136.
Attribute selection
Cost-sensitive attribute and subset evaluation
Via re-weighting/sampling of the input data according to a supplied cost matrix (weka.attributeSelection.CostSensitiveAttributeEval, weka.attributeSelection.CostSensitiveSubsetEval).
Filtered attribute and subset evaluation
Apply a filter (or set of filters) to the input data before applying attribute selection (weka.attributeSelection.FilteredAttributeEval, weka.attributeSelection.FilteredSubsetEval).
Latent semantic analysis
Perform SVD-based latent semantic analysis via the attribute selection interface, or transform data using LSA via the AttributeSelection filter (weka.attributeSelection.LatentSemanticAnalysis, weka.filters.supervised.attribute.AttributeSelection).
Improved output
Output has been improved for naive Bayes, logistic regression and k-means clustering.
KnowledgeFlow
Plugin support
The KnowledgeFlow now o?ers the ability to easily add new components via a plugin mechanism. Plugins are installed in a directory called .knowledgeFlow/plugins in the user's home directory and are dynamically loaded by the KnowledgeFlow at runtime.
Headless execution
Flows can now be executed from outside of the GUI KnowledgeFlow environment. weka.gui.beans.FlowRunner can be executed from the command line, or used programatically, to run multiple flows in parallel.
Instance weights
While instance weights have been used internally by meta classifiers (e.g. boosting methods and such like) for ages, it has only been possible to specify them in ARFF files by using the XML-based XRFF (eXtensible attribute-Relation File Format) format. Now it is possible to specify instance weights in standard ARFF files.
A weight can be associated with an instance in a standard ARFF file by appending it to the end of the line for that instance and enclosing the value in curly braces. E.g:
@data 0, X, 0, Y, "class A", {5}
For a sparse instance, this example would look like:
{1 X, 3 Y, 4 "class A"}, {5}
Any instance without a weight value specified is assumed to have a weight of 1 for backwards compatibility.
Running an experiment using clusterers
Using the advanced mode of the Experimenter it is now possible to run experiments on clustering algorithms as well as classifiers. The main evaluation metric for this type of experiment is the log likelihood of the clusters found by each clusterer.