02. Pentaho Metadata Formulas

Please see the Pentaho InfoCenter for the latest version of this document: http://infocenter.pentaho.com/help/topic/pme_user_guide/concept_pme_metadata_formulas.html

Formula Overview

Formulas have multiple uses in Pentaho Metadata. 

The first use of formulas within Pentaho Metadata is in the constraint definition of a Metadata Query, also known as MQL.  A constraint function references business table columns and uses various comparison operators to determine which subset of data the business user is interested in.

The second use is in the definition of Physical Table Columns.  In addition to Physical table columns mapping directly to a database table column, physical table columns defined in Pentaho Metadata may also be defined as a formula. This allows for combining of multiple columns into a single column, and also for doing more advanced aggregate calculations within aggregate table definitions.

The third use is in the definition of complex joins within business model relationships.  This allows for multiple key joins as well as other logic when joining tables.

The fourth use is row level security.

Under the covers, Pentaho Metadata uses JFreeReport's libFormula package for interpreting formulas.  The goal is to support OpenFormula syntax within the Metadata environment.  Formulas are first interpreted by libFormula, and then within the Metadata system are converted to native SQL depending on the type of database used.

First Use: MQL Constraints

Here is an example of an MQL Constraint formula:

OR([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME] = "EuroCars"; (([BT_CUSTOMERS.BC_CUSTOMERS_CREDITLIMIT] * 2) / 3 > 1000))

We'll walk through this example to help explain the core components of MQL formulas.  First note the OR function.  This is a boolean function which has two parameters, separated by semi-colons.  These parameters are boolean expressions.

The first boolean expression first references a business column from our Metadata model.   All references appear with brackets around them [].  This reference first refers to the business table, and then to the business column.   This boolean expression first does some arithmetic and checks to see if the final value us larger than 1000.

In the second expression, we compare the business column BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME to EuroCars.  Note that we use double quotes when referring to text.  Double quotes are required.

Second Use: Physical Table Column Formulas 

Here is an example of a Physical Table Column Formula:

[QUANTITYORDERED]*[PRICEEACH]

The references here specifically refer to the database column, not derived physical column definitions.  All operators and functions may be used in the definition of the physical table column.  One special note, in order for this formula to be recognized, the "isExact" property of the physical table column must be set to true. Also note, the referenced physical column must be explicitly defined in the metadata model.

Recent Changes to Physical Table Column Formulas

In earlier versions of Pentaho Metadata Editor, (prior to the CITRUS release), aggregation functions had to be specified explicitly and the aggregation rule had to be selected. This is no longer necessary; the query that is generated will use the selected aggregation rule during execution. See Defining the Physical Column Aggregations for more information.

Multi-table expressions: Formulas can use any business column in the model

Since the latest versions (after 2008/03/14) it is possible to define formulas that use business columns from anywhere in the business model.

For example suppose we have two business tables:

  • Orders (fact table), ID=BT_ORDER_FACT
  • Product (dimension), ID=BT_PRODUCT

Suppose we want to calculate the turnover based on:

  • the number of products sold, from the Orders table, ID=BC_FACT_ORDER_NRPRODUCTS
  • the price of the product, from the Product table, ID=BC_DIM_PRODUCT_PRICE

To arrive there, we define a new business column, say in the Orders business table (although you could take Product too): 

  • Table: Orders (BT_ORDER_FACT)
  • ID = BC_FACT_ORDER_TURNOVER
  • Name = Turnover
  • Formula = [BT_ORDER_FACT.BC_FACT_ORDER_NRPRODUCTS] * [BT_PRODUCT.BC_DIM_PRODUCT_PRICE]
  • Exact = Yes
  • Aggregation Rule = SUM

The SQL generator is now going to replace the 2 business columns by their respective SQL variants.  As such, we have to make sure that the business columns on which we base ourselves are resolving correctly. In this specific case, this means we want the 2 columns to be non-aggregated.  If we now select the single business column BT_FACT_ORDER_TURNOVER, this is the SQL that is generated:

SELECT
           SUM( BT_ORDER_FACT.NRPRODUCTS  *  BT_PRODUCT.PRICE ) AS COL0
FROM
          FACT_ORDER BT_ORDER_FACT
         ,DIM_PRODUCT BT_PRODUCT
WHERE
          ( BT_ORDER_FACT.PRODUCT_TK = BT_PRODUCT.PRODUCT_TK )

Now, suppose we want to generate the multiplication of the 2 sums (different use-case), we define the formula as  "[BT_ORDER_FACT.BC_FACT_ORDER_NRPRODUCTS] * [BT_PRODUCT.BC_DIM_PRODUCT_PRICE]" (without the SUM) and specify an aggregation for the 2 used business columns.  The generated SQL will then be:

SELECT
           SUM( BT_ORDER_FACT.NRPRODUCTS )  *  SUM( BT_PRODUCT.PRICE ) AS COL0
FROM
          FACT_ORDER BT_ORDER_FACT
         ,DIM_PRODUCT BT_PRODUCT
WHERE
          ( BT_ORDER_FACT.PRODUCT_TK = BT_PRODUCT.PRODUCT_TK )

It is obviously possible to create 2 versions of the used business columns, one aggregated (exposed to the users) and one non-aggregated (hidden from the users) for example.

The SQL generator works recursively.  That means that it is possible to create a formula that calculates 7% (taxes for example) of the turnover:

  • ID = BC_FACT_ORDER_TURNOVER_TAXES
  • Name = Turnover Taxes
  • Formula = [BT_ORDER_FACT.BC_FACT_ORDER_TURNOVER] * 7 / 100
  • Exact = Yes

 If we add that column to the selection, we get one extra column like this:

(  SUM( BT_ORDER_FACT.NRPRODUCTS  *  BT_PRODUCT.PRICE )  * 7 / 100) AS COL1

Appendix

Formula Syntax

Function syntax

FUNCTION_NAME ( PARAM ; PARAM )

Text (requires double quotes)

"TEXT"

Parenthesis are used for formula precedence:

( 1 + 2) * 3

Metadata References 

Business Column References:

[<BUSINESS_TABLE_ID>.<BUSINESS_COLUMN_ID>]

Physical Column References (only used in physical column formula definitons):

[<PHYSICAL_COLUMN_NAME>]

MQL Parameter References:

[param:PARAM_NAME]

Supported Functions

Function Name

Parameters

Description

Example

OR

two or more boolean expression parameters

Returns true if one or more parameters are true

OR(
[BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME] = "EuroCars";
[BT_CUSTOMERS.BC_CUSTOMERS_CREDITLIMIT] > 1000
)

AND

two or more boolean expression parameters

Returns true if all parameters are true

AND(
[BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME] = "EuroCars";
[BT_CUSTOMERS.BC_CUSTOMERS_CREDITLIMIT] > 1000
)

LIKE

two parameters

Compares a column to a regular expression, using "%" as wild cards

LIKE([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME]; "%SMITH%")

CONTAINS

two parameters

Determines if a column contains a string.

CONTAINS([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME]; "SMITH")

BEGINSWITH

two parameters

Determines if a column begins with a string.

BEGINSWITH([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME]; "JOE")

ENDSWITH

two parameters

Determines if a column ends with a string.

ENDSWITH([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME]; "SMITH")

IN

two or more parameters

Checks to see if the first parameter is in the following list of parameters

IN([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME]; "Adam Smith"; "Brian Jones")

NOW

none

The current date

NOW()

DATE

three numeric parameters, year, month, and day

A specified date

DATE(2008;4;15)

DATEVALUE

one text parameter "year-month-day"

A specified date

DATEVALUE("2008-04-15")

CASE

two or more parameters

Evaluates the first, third, etc parameter, and returns the second, fourth, etc parameter value
if there are an odd number of parameters, the last parameter is returned if no other parameter evaluates to true.
Note that when using this function, the formula needs to be set on a new column, not on the BT_CUSTOMER.BC_CUSTOMER_CUSTOMERNAME (using the example to the right)

CASE(
[BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME] = "EuroCars";
"European Cars";
[BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME] = "AsiaCars";
"Asian Cars";
"Unknown Cars"
)

COALESCE

one or more parameters

returns the first non null parameter

COALESCE(
[BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERNAME];
[BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERID];
"Customer is Null"
)

DATEMATH

one expression parameter

returns a date based on an expression. Important note - this does NOT return a timestamp irrespective of the implementation details mentioned in the description to the right.DateMath Javadoc for full syntax

DATEMATH("0:ME -1:DS") - 00:00:00.000 of the day before the last day of the current month
DATEMATH("0:MS  0:WE") - 23:59:59.999 the last day of the first week of the month
DATEMATH("0:ME") - 23:59:59.999 of the last day of the current month
DATEMATH("5:Y") -  the current month, day and time 5 years in the future
DATEMATH("5:YS") - 00:00:00.000 of the first day of the years 5 years in the future

ISNA

one parameter

returns true if the value is null

ISNA([BT_CUSTOMERS.BC_CUSTOMERS_CUSTOMERID])

NULL

none

returns the null value

NULL()

TRUE

none

returns true

TRUE()

FALSE

none

returns false

FALSE()

  • see below for aggregate functions

Supported Operators

Operator

Description

=

returns true if two expressions are equal

>

returns true if first expression is larger than the second

<

returns true if first expression is smaller than the second

>=

returns true if first expression is larger than or equal to the second

<=

returns true if first expression is smaller than or equal to the second

<>

returns true if two expressions are not equal

+

adds two values

-

subtracts two values

*

multiplies two values

/

divides two values

Supported Aggregate Functions

Aggregate functions may only be used in physical column definitions. In more recent versions of metadata editor, these functions are no longer required.  Instead, the query generator uses the Aggregation rule specified by the user.

Function Name

Description

SUM

sums a specific columns values determined by grouping

COUNT

counts a specific columns values determined by grouping

AVG

averages a specific columns values determined by grouping

MIN

selects the minimum column value determined by grouping

MAX

selects the maximum column value determined by grouping