PMML Support in Weka

News

04/22/10 - SupportVectorMachineModel is now supported!

06/22/09 - RuleSetModel is now supported.

02/26/09 - TreeModel is now supported.

02/08/09 - Feedback from the PMML testing web page has resulted in some bug fixes and improvements (e.g. derived fields can now reference other derived fields as long as the referred field is declared before the referring field). Get these latest improvements via the download link above.

09/15/08 - Neural network, TransformationDictionary, LocalTransformation and DerivedFieldare now supported.

Overview 

What is PMML? 

The Predictive Modeling Markup Language (PMML) is a vendor-agnostic XML-based standard for expressing statistical and data mining models. Applications can produce and consume PMML models, thus allowing a model created in one application to be consumed and used for scoring (prediction) in another. The PMML standard is maintained by the Data Mining Group (DMG).

What PMML model types are supported? 

Support for importing PMML models into Weka is under development. Implementation of the PMML (v 3.2) model types Regression, GeneralRegression, NeuralNetwork, TreeModel, RuleSetModel and SupportVectorMachineModel is complete. Support for other model types will follow in the future. The current plan is to implement support for (in order): naive Bayes, association rules and clustering models. This wiki page will be updated with new information and new download archives as more features are implemented.

What are the current limitations of Weka's PMML support?

Only PMML Regression, GeneralRegression, NeuralNetwork, TreeModel, RuleSetModel and SupportVectorMachineModel are implemented so far. GeneralRegression supports a single Predictor-to-Parameter matrix (i.e. in the case of classification, each target class value shares the same PPMatrix). Aggregate and MapValues expressions are not supported yet. The first six of the eleven PMML built-in functions are supported so far. There is no support for exporting PMML models from Weka yet.

How will I be able to use PMML models with Pentaho?

PMML models will be able to be used in several different contexts: 1) In the Weka GUIs (Explorer and KnowledgeFlow) or from the command line, a PMML model will be able to be loaded and applied to test data to score it. Since Weka's implementation of PMML import renders a PMML model as a standard (albeit immutable) Weka Classifier, all the standard Weka evaluation metrics will be available for evaluating performance on the test set (if it contains reference target values); 2) Using the Weka scoring plugin for Pentaho Data Integration (Kettle), PMML models will be able to be deployed for scoring as part of an ETL job.

Integration of PMML support into the Weka scoring plugin and a new PMML classifier scoring plugin for the Weka KnowledgeFlow have been completed (see below for example usage and screenshots). From Weka 3.6.0, PMML models can be run from the Classify panel in Weka's Explorer user interface and from the command line.

Example Output

Below is some example output of Weka's implementation of PMML GeneralRegression (multinomial logistic in this case) and the first few predictions (probability distributions over the class values) for some test data for the famous Irisdataset:

PMML version 3.2
PMML Model: multinomialLogistic

Mining schema:

@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}
        usage: predicted
        outlier treatment: asIs
        missing value treatment: asIs
@attribute sepal_length numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asIs
@attribute sepal_width numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asIs
@attribute petal_length numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asIs
@attribute petal_width numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asIs


Covariates:
        sepal_length
        sepal_width
        petal_length
        petal_width

Predictor-to-Parameter matrix:
                              Predictor
Parameter     sepal_length  sepal_width petal_length  petal_width
Intercept
sepal_length             1
sepal_width                           1
petal_length                                       1
petal_width                                                     1

Parameter estimates:
class                            Coeff.      df
Iris-setosa
                     Intercept  33.1503       1
                  sepal_length  11.8531       1
                   sepal_width  13.2994       1
                  petal_length -26.9143       1
                   petal_width -37.9972       1
Iris-versicolor
                     Intercept  42.6378       1
                  sepal_length   2.4652       1
                   sepal_width   6.6809       1
                  petal_length  -9.4294       1
                   petal_width -18.2861       1



Found class class in test data.
Actual: Iris-setosa  Predicted: 0.999999999999996 4.0732051602909886E-15 6.290640809163842E-42
Actual: Iris-setosa  Predicted: 0.9999999999992712 7.287039008004535E-13 5.20202200281695E-38
Actual: Iris-setosa  Predicted: 0.9999999999997793 2.2066439218458243E-13 2.640414975828411E-39
Actual: Iris-setosa  Predicted: 0.9999999999638924 3.610752987621063E-11 7.108436456076591E-36

...

Here is another example. This shows the output from Weka's implementation of PMML Regression (polynomial regression in this case) and the first few predictions for some test data on the Elninodataset:

PMML version 3.0
PMML Model: polynomialRegression

Mining schema:

@attribute buoy numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = 37.5481)
@attribute day numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = 8.8283)
@attribute latitude numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = 5.0354)
@attribute longitude numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = -106.1912)
@attribute zon_winds numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = -4.8239)
@attribute mer_winds numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = 2.6773)
@attribute humidity numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = 84.5448)
@attribute airtemp numeric
        usage: predicted
        outlier treatment: asIs
        missing value treatment: asIs
@attribute s_s_temp numeric
        usage: active
        outlier treatment: asIs
        missing value treatment: asValue (replacementValue = 28.2222)

Regression table:
airtemp =

      0.0894 * buoy +
     -0.0107 * day +
      0.0178 * latitude +
      0.002  * longitude +
      0.0389 * zon_winds +
     -0.0643 * mer_winds +
     -0.0345 * humidity +
      0.7101 * s_s_temp +
     -0.0031 * buoy^2 +
     -0.0061 * day^2 +
      0.0038 * latitude^2 +
      0.0186 * zon_winds^2 +
     -0.0134 * mer_winds^2 +
      0      * buoy^3 +
      0.0004 * day^3 +
      0      * longitude^3 +
      0.0013 * zon_winds^3 +
     10.0055

Found class airtemp in test data.
Actual: 27.32  Predicted: 27.15551895992993
Actual: 26.7  Predicted: 27.23171331397675
Actual: 27.36  Predicted: 27.24651843122208
Actual: 27.32  Predicted: 27.300579679426757
Actual: 27.09  Predicted: 27.03963237958885
Actual: 26.82  Predicted: 27.12223258541705

...

Weka's implementation of TreeModel for classification and regression trees implements Weka's Drawable interface, which allows the tree to be output in the Dot language used by the excellent Graphviz graph visualization software from AT&T Research. This enables the tree to be visualized by Weka's built-in TreeVisualizer or by other tools that support the Dot language. Here is a visualization of a PMML tree generated by SPSS Clementine from the Cleveland heart disease data.

 





Using PMML Models in the Weka Scoring Kettle Plugin

Once the Weka PMML library is installed in the same directory as the Weka scoring plugin in your Kettle plugins directory, using PMML models is simple and follows the same procedure as using a standard serialized Weka model (for more information on using the Weka scoring plugin, see the documentation provided with the distribution).

The following screenshot shows browsing for PMML model files from the WekaScoring file browser.

 
The next screenshot shows the "HEART_NOMREG" PMML GeneralRegression model loaded into the Weka scoring plugin.
 
 
 










Scoring Data using the PMML Classifier Scoring KnowledgeFlow Plugin

 
The PMML classifier scoring plugin for the KnowledgeFlow allows PMML classification and regression models to be loaded and used to score incoming batches of instances or instance streams in the KnowledgeFlow. Below are some example screenshots showing the PMML classifier scoring plugin, with a PMML binomial logistic regression model loaded, accepting an instance stream from the UCI Cleveland heart disease dataset. Evaluation metrics are computed by the incremental classifier evaluator component and displayed in a text viewer. Predictions for the data are appended and saved to a new ARFF file via the prediction appender and the ARFF saver components.
 
 
 
 







 Using the PMML Library Programatically


import weka.core.pmml.PMMLFactory;
import weka.core.pmml.PMMLModel;
import weka.classifiers.pmml.consumer.PMMLClassifier;

...

PMMLModel model = PMMLFactory.getPMMLModel("<path to PMML xml file>");
System.out.println(model);

if (model instanceof PMMLClassifier) {
   PMLClassifier classifier = (PMMLClassifier)model;

   // Since PMMLClassifier is a subclass of weka.classifiers.Classifier,
   // you can use it just like any other Weka Classifier. The only
   // exception is that calling buildClassifier() will raise an
   // Exception because PMML models are pre-built.
}