Aggregation functions in crosstabs

1. Technical background

In Pentaho Reporting, aggregation is performed by report functions. All report functions are stateful classes that see all data during their life-time.

Aggregation happens in the items-advanced method, which is called before the details/itemband is printed. This method sees the data at its finest granularity.

Crosstabs are a two dimensional structure, and therefore a linear approach to aggregation does not always work. Linear aggregation works for horizontal aggregations and for crosstab-totals (the sum of all details printed in a crosstab).

Vertical aggregations (column sums), which are printed in summary rows, require changes to the aggregation functions themselves to work properly.

When aggregating vertically, we need to divide the data stream into separate buckets. Each bucket holds the accumulated data for one column-dimension instance. In the simple case of a crosstab without summary columns, the number of buckets will equal the number of data columns printed in the crosstab table.

The reporting engine maintains an index to the current bucket inside the "ReportState" object, to which all functions have access when the ReportEventListener-methods are invoked. We call this index the 'crosstabColumnSequence'. The crosstabColumnSequence is a linear index counting from zero upwards.

The crosstabColumnSequence uniquely identifies the current instance of a given crosstab-column-group within the innermost crosstab-row-group.

Hint: Add a CrosstabColumnSequenceFunction to your report to print this crosstabColumnSequence number in a report. This function is for debugging and teaching purposes, do not use it in production reports.

The numbering is an internal identifier, independent of the existence of other groups and you should only use it as an index to a bucket in an aggregation calculation.

An aggregation function should only acces the crosstabColumnSequence if it has a valid column-dimension to filter on. Passing an group index of any other kind of group into the getter is undefined and the result of this call is also undefined.

Function implementations should hold a copy of the last bucket index so that they can query their collected buckets on 'getValue()'.

The crosstabColumnSequence only changes when a new group starts, so it is safe to update that bucket number during the 'groupStarted' method.

When printing summary rows, the engine has to replay the structure of the last row along with all column-dimension instances encountered when printing the last row of detail data.

While the engine performs the replay, it will fire 'summaryRowSelection' events before the layouter prints any summary cell.

All aggregation function must select the corresponding result bucket for printing the aggregation result during the 'summaryRowSelection' call.

2. Implementing an aggregate function

This example shows how to implement aggregation buckets on an existing aggregation function. I assume that you are familiar with the normal function processing.
This example shows how to implement aggregation buckets on an existing aggregation function. I assume that you are familiar with the normal function processing.

--- ItemSumFunction.java	(revision before crosstabs)
+++ ItemSumFunction.java	(revision with crosstabs)
==========================================================================================================================
 package org.pentaho.reporting.engine.classic.core.function;

 import java.math.BigDecimal;
 
 import org.pentaho.reporting.engine.classic.core.event.ReportEvent;
+import org.pentaho.reporting.engine.classic.core.util.Sequence;
 
 /** 
  * A report function that calculates the sum of one field (column) from the data-row. This function produces a running
  * total, no global total. For a global sum, use the TotalGroupSumFunction function. The function can be used in two
  * ways: <ul> <li>to calculate a sum for the entire report;</li> <li>to calculate a sum within a particular group;</li>
  * </ul> This function expects its input values to be either java.lang.Number instances or Strings that can be parsed to
  * java.lang.Number instances using a java.text.DecimalFormat.
  * <p/>
  * The function understands two parameters, the field parameter is required and denotes the name of an
  * ItemBand-field which gets summed up.
  * <p/>
  * The parameter group denotes the name of a group. When this group is started, the counter gets reset to
  * null.
  *
  * @author Thomas Morgner
  */
 public class ItemSumFunction extends AbstractFunction implements FieldAggregationFunction
 {
   /**
    * A useful constant representing zero.
    */
   protected static final BigDecimal ZERO = new BigDecimal(0.0);

   /**
    * The item sum.
    */
-  private transient BigDecimal sum;
+  private Sequence<BigDecimal> sum;
+  private transient int lastGroupSequenceNumber;
+

Instead of a single 'sum' value, a crosstab aware function needs a sequence of values. This sequence represents the buckets for the per-column aggregation.

   /**
    * The name of the group on which to reset the count. This can be set to null to compute the sum for the whole
    * report.
    */
   private String group;
   /**
    * The name of the field from where to read the values.
    */
   private String field;
 
+  private String crosstabFilterGroup;
+

The crosstabFilterGroup property holds the name of the column dimension that will be used to select an bucket. Leave it empty for horizontal or non-crosstab aggregations.

   /**
    * Constructs an unnamed function. Make sure to set a Name or function initialisation will fail.
    */
   public ItemSumFunction()
   {
-    sum = ItemSumFunction.ZERO;
+    sum = new Sequence<BigDecimal>();
   }
 
   /**
    * Receives notification that a new report is about to start. <P> Does nothing.
    *
    * @param event Information about the event.
    */
   public void reportInitialized(final ReportEvent event)
   {
-    this.sum = ItemSumFunction.ZERO;
+    clear();
   }
 
+  protected void clear()
+  {
+    this.lastGroupSequenceNumber = 0;
+    this.sum.clear();
+  }
+

Resetting or clearing the value now involves clearing all contents from the buckets. Clearing happens whenever we hit a new reset-group instance.

   /**
    * Receives notification that a new group is about to start.  If this is the group defined for the function, then the
    * running total is reset to zero.
    *
    * @param event Information about the event.
    */
   public void groupStarted(final ReportEvent event)
   {
-    if (FunctionUtilities.isDefinedGroup(getGroup(), event))
-    {
-      this.sum = ItemSumFunction.ZERO;
-    }

+    if (FunctionUtilities.isDefinedGroup(getGroup(), event))
+    {
+      clear();
+    }
+
+    if (FunctionUtilities.isDefinedGroup(getCrosstabFilterGroup(), event))
+    {
+      final int groupIndex = event.getState().getCurrentGroupIndex();
+      this.lastGroupSequenceNumber = (int) event.getState().getCrosstabColumnSequenceCounter(groupIndex);
+    }
   }

During the groupStarted event we either reset the aggregation or select a new bucket.

As you can see, the clearing for the reset-group is still roughly the same as before. In addition to that, we now select the appropriate bucket whenever a matching column-dimension is started.

If there is no crosstabFilterGroup defined or we never find a match, then the lastGroupSequenceNumber will always be zero, and we just use the first bucket of the sequence for everything.

 
   /**
    * Returns the group name.
    *
    * @return The group name.
    */
   public String getGroup()
   {
     return group;
   }

   /**
    * Sets the group name. <P> If a group is defined, the running total is reset to zero at the start of every instance
    * of this group.
    *
    * @param name the group name (null permitted).
    */
   public void setGroup(final String name) 
   {
     this.group = name;
   }

   /**
    * Returns the field used by the function. The field name corresponds to a column name in the report's data-row.
    *
    * @return The field name.
    */
   public String getField()
   {
     return field;
   }

   /**
    * Sets the field name for the function. The field name corresponds to a column name in the report's data-row.
    *
    * @param field the field name.
    */
   public void setField(final String field)
   {
     this.field = field;
   }

   /**
    * Receives notification that a row of data is being processed.  Reads the data from the field defined for this
    * function and adds it to the running total. <P> This function assumes that it will find an instance of the Number
    * class in the column of the data-row specified by the field name.
    *
    * @param event Information about the event.
    */
   public void itemsAdvanced(final ReportEvent event)
   {
     final Object fieldValue = getDataRow().get(getField());
     if (fieldValue == null)
     {
       return;
     }
     if (fieldValue instanceof Number == false)
     {
       return;
     }
 
-    final Number n = (Number) fieldValue;
-    if (n instanceof BigDecimal)
-    {
-      sum = sum.add((BigDecimal) n);
-    }
-    else
-    {
-      sum = sum.add(new BigDecimal(n.toString()));
-    }

+    final Number numerValue = (Number) fieldValue;
+    final BigDecimal number = ExpressionUtilities.convertToBigDecimal(numerValue);
+    final BigDecimal oldValue = sum.get(lastGroupSequenceNumber);
+    if (oldValue == null)
+    {
+      sum.set(lastGroupSequenceNumber, number);
+    }
+    else
+    {
+      sum.set(lastGroupSequenceNumber, oldValue.add(number));
+    }
   }
 

The actual details processing is roughly the same as before, but instead of aggregating into a single value, we aggregate into the currently selected bucket.

If no bucket is ever selected, we aggregate into the first bucket (index = 0) and the function behaves exactly as before.

Note that we have to handle 'null' values from the sequence. If we started a new bucket, the sequence value will be null.


   /**
    * Returns the function value, in this case the running total of a specific column in the report's data-row.
    *
    * @return The function value.
    */
   public Object getValue()
   {
-    return sum;

+    final BigDecimal value = sum.get(lastGroupSequenceNumber);
+    if (value == null)
+    {
+      return ZERO;
+    }
+    return value;
   }

Return the value from the currently selected bucket. If there is no value, we return zero instead of null, so that we remain backward compatible with the old implementation.


+  public void summaryRowSelection(final ReportEvent event)
+  {
+    if (FunctionUtilities.isDefinedGroup(getCrosstabFilterGroup(), event))
+    {
+      final int groupIndex = event.getState().getCurrentGroupIndex();
+      this.lastGroupSequenceNumber = (int) event.getState().getCrosstabColumnSequenceCounter(groupIndex);
+    }
+  }
 

This is a new method for selecting an aggregation bucket while a summary row is printed. This selection code is the same as the one in the 'groupStarted' event, but we do not need any code to reset the sequence values here.

This method selects buckets to print results, it should not modify any data at that point.


   /**
    * Return a completly separated copy of this function. The copy does no longer share any changeable objects with the
    * original function.
    *
    * @return a copy of this function.
    */
   public Expression getInstance()
   {
     final ItemSumFunction function = (ItemSumFunction) super.getInstance();
-    function.sum = ItemSumFunction.ZERO;
+    function.sum = sum.clone();
+    function.lastGroupSequenceNumber = 0;
     return function;
   }
 

The factory method to create a new instance for the report processing must handle the sequence. The old version of this function had a BigDecimal, which is an immutable object and thus does not need to be cloned.

The sequence object is not immutable, and thus we need to clone it or the clone and the old instance work on the same object, causing calculation errors.

 
+  public Object clone()
+  {
+    try
+    {
+      final ItemSumFunction clone = (ItemSumFunction) super.clone();
+      clone.sum = sum.clone();
+      return clone;
+    }
+    catch (CloneNotSupportedException e)
+    {
+      throw new IllegalStateException();
+    }
+  }
+
+  public String getCrosstabFilterGroup()
+  {
+    return crosstabFilterGroup;
+  }
+
+  public void setCrosstabFilterGroup(final String crosstabFilterGroup)
+  {
+    this.crosstabFilterGroup = crosstabFilterGroup;
+  }
}
 

See above. Also getter and setter for the 'crosstabFilterGroup', so that the bean-inspector recognizes this property.

3. Using an crosstab aware aggregation function

Relational reports, Aggregating the full crosstab, a crosstab-other group or a crosstab-column group:

Set the group on which to reset in the 'group' property, leave the 'crosstabFilterGroup' empty.

Crosstab row groups:

Set the name of row-dimension on which you want to reset the aggregation in the 'group' property. Set the 'crosstabFilterGroup' to the crosstab-column-group for which this aggregation is calculated.

If you calculate a aggregation for a detail cell, set the 'crosstabFilterGroup' to the name of the innermost crosstab-column-dimension.