R script executor

Description

This step executes an R script from within a PDI transformation. R is a freely-available language and environment for statistical computing and graphics that provides many statistical and graphical techniques such as linear and nonlinear modeling, statistical tests, time series analysis, classification, and clustering.  Consult the R project homepage for further information.

This step pushes streaming rows into the R environment as a data frame where your R script manipulates the data.  Then the step converts the contents of the data frame to output rows or a kettle string.

NOTE: This step will be available in the 5.1 GA version of PDI.  

Installation Instructions

The R Script Executor step is already installed in PDI.  You need to install R and configure your environment before you use it.

Note:  Install and configure R on each node where the R Script Executor will be run.  This includes the Spoon, DI Server, and cluster nodes.    

Install R

To install R, visit the CRAN site, then install the rJava package.  Install R on the Spoon and DI Server nodes, as well as on each node on which this step will be run, including cluster nodes.  

1. Download R version 3.1.0 for Windows (http://cran.r-project.org/bin/windows/base/old/), Linux (http://cran.r-project.org/bin/linux/), or Mac (http://cran.r-project.org/bin/macosx/old/).  Choose to download either a 32 or 64 bit version.  The version you choose depends on your whether your Java installation is 32 or 64 bit.  
2. Install R. See R documentation if you need assistance.
3. If you are running Linux or Mac, enter this command in a shelltool.  (The <path to your java installation> is the path to your Java JDK or JRE.)

sudo R CMD javareconf JAVA_HOME=<path to your java installation>

4. Start R. In the R Console, install the rJava package by typing:

install.packages('rJava')

NOTE:  Watch for messages that indicate rJava is installing. If you do not see these messages, add the jvm.dll to your Path environment variable,  then try to install rJava again.  By default, jvm.dll is in <your java installation directory>\jre\bin\server OR <your java installation directory>\jre\bin\client.

5. If prompted with "Would you like to use a personal library instead?" click Yes.
6. If prompted with the path of the library, click Yes.
7. When prompted for the CRAN mirror, choose a country then click OK.
8. After rJava has successfully been installed, type q() to quit the R console.
9. Click Yes to close the workplace image.
10. Close R. 

Set R Environment Variables

Set environment variables to indicate where R is installed.  

Note:  You must configure the environment for each node on which this step will be run.  This includes the Spoon node, DI Server node, as well as cluster nodes.

1. Set the following R environment variables to the paths indicated. 

Environment Variable

Value

Additional Notes

R_HOME

Path to the root directory of your R installation.

Enter Sys.getenv("R_HOME") in the R console to get the path.

R_LIBS_USER

Path to the directory where R installs your packages.

Enter Sys.getenv("R_LIBS_USER") in the R console to get the path.

PATH

Append the PATH variable with the directory that contains the R executable.

 

2. Stop and restart the DI Server and Spoon to ensure that the environment variables take effect.

Configure Spoon with rJava

In the rJava directory there is a jri.dll file that needs to be copied into the libswt directory of Spoon.

  1. Stop Spoon, if it's running
  2. Find %R_LIBS_USER%/rJava/jri/x64/jri.dll
  3. Windows: Copy jri.dll to [Pentaho directory]/client-tools/data-integration/libswt/win64
    Linux: Copy jri.dll to [Pentaho directory]/client-tools/data-integration/libswt/linux

Verify Your Installation

Once you have installed and configured R, verify your installation.

  1. Start Spoon.
  2. Select File > New > Transformation to start a new transformation.
  3. Open the Statistics folder in the Design tab, then drag the R Script Executor step to the transformation canvas.
  4. Open the R Script Executor step and click the Output Fields tab.
  5. Add an input step, such as Generate Rows, and an output step, such as Text file output.
  6. Compose a simple R script. 
  7. Click Test script.  A window should appear that displays iris data.  The Output Fields table populates with iris data.  Iris is a dataset that is included with R.

Note: The first time you use the step there might be a short delay as the step instructs R to install one more package JavaGD.  JavaGD is a Java-based graphics device for R.

Options

Configure tab

Option

Description

Step name

Name of the step.

Number of Rows to Process

The number of rows from the data stream to process at one time.  Options are All, Row By Row, and Batch.  With All, the script processes all of the rows in the data stream at once.  For Row By Row, the script processes one row of data at a time.  For Batch, a specified number of rows are processed at a time.  The number of rows in a batch are set in the Size field.

Size

Maximum number of rows in a batch.  This field becomes active if you set the Number of Rows to Process field to Batch.

Reservoir Sampling

If selected, randomly samples rows from an incoming data stream.  Use this option to sample a fixed number of rows from an incoming data stream when the total number of incoming rows is not known in advance. Using this option limits the number of processed rows by the R script to the number you enter in the Size field.  Note that if Reservoir Sampling is active and the Number of Rows to Process is set to Row By Row, only the first data frame is used.

Size

The number of rows to sample from an incoming stream.  Setting a value of -1 will sample 100,000 rows.  This field becomes active if Reservoir Sampling is selected.

Random Seed

The value to use for seeding the random number generator.  Repeating a transformation with a different value for the seed will result in a different random sample being chosen.

Strings as factors in R

Enable to have incoming String fields represented as factors in the data frame(s) in R. A factor is a different level of an explanatory variable. When disabled, String fields are kept as strings in R.

Output from script is text

Outputs the results of the script (as would be seen when running in the R console) as a text string instead of in a data frame.

Include Input Fields as Output Fields

Passes through the input metadata to the output metadata.  This allows you to see input field names to your result set.  If you check this option, then click the Get Fields button in the Output Fields tab, the Get Fields button prepends the input metadata to the output metadata.  If you do not select this option, your input fields do not appear with your results unless you manually add code to pass through the fields.  See the Pass-Through Incoming Fields in the Sample section of this page for a code sample.

Logging level for R messages

Defines the PDI logging level at which to report messages from the R environment.  

Step name

Define the step name and the mapped R Data Frame name. You can have multiple input streams coming into this step. When there are multiple input streams, just one data frame is pushed into R from each and, in this case, reservoir sampling is always used (unless there are fewer rows than the batch/reservoir size).

R Frame name

Define the R frame name that is mapped to the step.

OK

Click to save the information and close the window.

Cancel

Do not save changes made to the window.  Close the window.

Help

Provides a link to the help documentation.

R Script tab

Option

Description

Load script from file at runtime

If enabled, loads the script from a file instead using from the script from the Manual R Script field.

Script file

Define the file path to the script file.

Manual R Script

Paste your R Script here if you are not loading it from a file.

Test Script

Validates the script.   This button generates random data, based on incoming row metadata and the results of the test are based on this data.  

OK

Click to save the information and close the window.

Cancel

Do not save changes made to the window.  Close the window.

Help

Provides a link to the help documentation.

Output Fields tab

Option

Description

Name

Name of the field.

Type

Defines the field type.  When you click the Get Fields button, the fields should automatically populate.  You can adjust the values as needed.  Valid values in this field are BIG NUMBER, BINARY, BOOLEAN, DATE, INTEGER, INTERNET ADDRESS, NUMBER, STRING, and TIMESTAMP.

Indexed values

Use an optimized storage type of indexed values for saving memory.

Get Fields

Populates the Output Fields table. Runs the script to determine the field values.  When this button is clicked, and the Include Input Fields as Output Fields option is selected in the Configure Tab, the input metadata is prepended to the output metadata.

OK

Click to save the information and close the window.

Cancel

Do not save changes made to the window.  Close the window.

Help

Provides a link to the help documentation.

Note: If clicking the Test Script button causes issues with the script, manually specify the output field meta data in the step. This prevents the automatic output field metadata discovery method from executing at runtime.

Samples

Pass-Through Incoming Fields 

By default, the R Script Executor step returns result set data without the associated input field names unless you select Include Input Fields as Output Fields in the Configure tab.  If you do not choose this option, you can choose to manually pass-through field names, by adding code to the end of your R script that associates these elements in an output data frame.  

NOTE: For more information on any R command in these instructions, see http://cran.r-project.org/doc/contrib/Short-refcard.pdf.

1. In the R Script tab, add a line to create a data frame at the end of your existing script.  This example creates a variable called polarity and converts it to a data frame called output.

output=as.data.frame(polarity)

2. If you want to introduce new fields and associate them with new field names, add that code next.  In this example, the positive (posn), negative (negn), and total score (score) of each tweet is specified as variables that will be generated by the R function and sent to the output of the step.

output$posn <- eval(parse(text=polarity[1,1]))
output$negn <- eval(parse(text=polarity[1,2]))
output$score <- eval(parse(text=polarity[1,3]))

3. Specify the incoming field names and associate them with the data.   If you plan to reuse this code, change tweets to the variable name in your incoming R data frame.

output$followersCount <- c(tweets#followersCount)
output$retweetCount <- c(tweets#retweetCount)
output$userName <- c(tweets#userName )
output$messageId <- c(tweets#messageId)
output$text <- c(tweets#text)
output$createdAt <- c(tweets$createdAt)

4. At the end of the codeblock, place the name of the data frame that contains the output data that you would like to pass to the next step.  When complete, the sample looks like this.

output=as.data.frame(polarity)
output$posn <- eval(parse(text=polarity[1,1]))
output$negn <- eval(parse(text=polarity[1,2]))
output$score <- eval(parse(text=polarity[1,3]))
output$followersCount <- c(tweets#followersCount)
output$retweetCount <- c(tweets#retweetCount)
output$userName <- c(tweets#userName)
output$messageId <- c(tweets#messageId)
output$text <- c(tweets#text)
output$createdAt <- c(tweets$createdAt)
output

5. In the Output Fields tab, click the Test script button to test the script.  The output table populates with the data you specified in the previous steps and the script is tested for validity.

6. When complete, click the OK button.

Sample Transformations

Three example transformations are included in the attached archive.  

  • The first loads the iris data in R, removes the class column, performs a PCA analysis and then outputs the data frame containing the rotated data.
  • The second processes a single incoming data stream (pendigits data).  Batches of 1000 rows are pushed into R and are assigned to the variable "frame1."  A PCA is computed and the rotated data is output.
  • The third processes multiple input streams as named data frames. Two copies of the iris data are concatenated using rbind. Note that when there are multiple input streams, just one data frame is pushed into R from each. Reservoir sampling is always used unless there are fewer rows than the batch/reservoir size.

Sample to Create and Test a Model

Another example to create and test a model is included in create_test_model.zip which has been provided by Dave Reinke of Inquidia Consulting.  See his blog post about Data Science and NCAA Bracketology for more details.

This job uses the R script executor step create a predictive (regression) model on a training dataset, saves the model to an rdata file and then executes the model on a test dataset. The job also generates an R plot and saves it to the file system.  The output of the job is a hexbin plot that shows the model performance for the test dataset.  It is saved as a pdf in your working directory.  See the notes within the jobs and transformations for setup instructions.

Create Model Transformation

The Create Model Transformation does several things.

  1. Randomly splits the diamonds dataset into training (80%) and test (20%) datasets.
  2. Builds a regression model on the training set to predict the price of the diamond.
  3. Saves the regression model and a hexbin plot function in a diamonds.rdata file to be used in the Test Model transform.
  4. Outputs the test dataset as a csv file to be input in the Test Model transform.

Test Model Transformation

Here is what the Test Model transformation does.

  1. Inputs the test dataset from the csv file.
  2. Loads the saved diamonds.rdata file containing the model object and the hexbin plot function.
  3. Invokes the model on the test dataset.
  4. Plots the results of the model using the hexbin plot and generates a pdf in your working directory.  The results compare the predicted versus actual prices.

The output of the RScript step includes the test data with the additional predicted price field. Note that a log function to some numeric variables has been applied to reduce the skew in their distributions, (e.g. normalize).   The predicted price is the log of the price.

Demo Video:  Supervised Learning Using PDI and R

Interested in seeing R Script Executor in action?  This video shows you how to use PDI and R scripts to apply the supervised learning statistical method.  Supervised Learning predicts an outcome or dependent variable with a set of input or feature attributes.  The example shows how to use the R Script Executor step apply the Random Forest Classification model, which is used to predict a marketing response.  Random Forest is a powerful black box recursive partitioning algorithm that is a staple for R predictive modelers.