Classic Engine Algorithm

Report groups

Reports used in the real-world-business are more than simple list of rows and columns. To gain information from the presented data, the report data is sorted and condensed into separate sections, which correspond to the logical structure of the report. Therefore report engines introduced the concept of groups to support the structuring of report data.

With groups, a set of rows, the report data set, is subdivided into an ordered collection of smaller subsets of rows. All rows of a group instance share a common attribute. In the domain of relational databases, the attribute is defined by the name of one or more columns. Within the group instance, all those columns have the same value. That set of attributes, which identifies a group, is called the group key.

Groups in the reporting usually can be mapped to 'is-part-of' relations in the real world. Employees could be grouped by the department, where they work in, or by the first letter of their last name.

The Control Break Algorithm

Reporting can be a very resource intensive process. Data has to be queried from databases and other sources, the data must be processed, sorted and intermediate results must be computed and finally the report output must be generated. Even with today's powerfull computers, performance is a key to customer satisfaction. To fulfill these performance requirements, it is unacceptable to waste time building a global view on the report data, unless absolutely necessary.

For performance reasons, group memberships are computed at runtime with an algorithm called the 'control-break algorithm'. Understanding this algorithm is the key to successful reporting, as this algorithm heavily influences how the report engine behaves and how groups are built up. The control-break-algorithm does not build a global view over all group instances. To detect the end of one and the start of a new group, it compares the current row with the previously read row of data. If the data of one of the group's key columns is different, the current group must be finished. If the report processing has not reached the last row, a new group instance is opened immediately. This new group will remain active until either the end of the report data has been reached or the group key values changed.

Some Definitions

A group is a contiguous set of rows, where all rows of the group share the same group key.

A group instance is identified by a group key. The key consists of one or more attributes, where all entities in that group have the same values stored in the specified attributes. When using databases, entity sets are usually represented by tables and attributes are represented by the table's columns.

A sub group is a group, which has a key that contains all attributes of its parent and has at least one more attribute. Therefore, a sub group's rowset is a subset of it's parent's rowset. All rows of the sub-group are fully contained in the parent's rowset.

A group with a group key without attributes defined is called the default group. This group has a single instance which always spans the complete report data set.

The single level Control-Break-Algorithm as pseudo code

OPEN file.

READ record.

PRINT ReportHeader.

WHILE NOT END-OF-FILE

DO

   VAR groupKey := GET-CURRENT-GROUP-KEY.

      PRINT GroupHeader.

   WHILE (NOT END-OF-FILE AND

             groupKey IS-EQUAL-TO GET-CURRENT-GROUP-KEY)

   DO

      PRINT ItemBand.

      READ

   DONE.

   PRINT GroupFooter.

DONE

PRINT ReportFooter.

CLOSE FILE.

and in Java (complete Example is in the CVS):

public static void main (final String[] args)

{

final Object[][] data = initData();


// cursor at the first row ...

// in this example modifying the 'position' is equal to a 'read' operation

int position = 0;


printReportHeader(data, position);

while (!isEndOfFile (data, position))

{

  // initialize the group key with the data from the first row

  // remember the current group key

  // if the group key changes, we have to do a control break ...

  final Object groupKey = datapositionCONTINENT_COL;

  printGroupHeader(data, position);


  while (!isEndOfFile (data, position) &&

		 datapositionCONTINENT_COL.equals(groupKey))

  {
	// store the last value of the group key (so that we can detect changes)

	// print the items ...
	printItems (data, position);

	// now 'read' the next row of data ...
	position++;
  }  printGroupFooter(data, position - 1);

}


printReportFooter(data, position - 1);

}

From this algorithm, we can derive some basic principles about groups as they are used in Pentaho Reporting.

  1. Data used in group keys must be sorted according to the group hierarchy definition.
    As only neighbouring rows are compared against each other, this algorithm will consider rows to be part of the current group if and only if all rows of a particular group key instance are kept together as direct neighbours in the report dataset.
  2. A group with no attributes defined will have a single instance, which spans the complete report data set.
  3. As long as there is data available, a new group instance will be opened immediately after the previous instance has been closed down. It is not possible to print the item band without having an open group.
  4. The order of the attribute specification within a group definition is not important. For the algorithm it is important, whether at least one attribute's value has changed, it is not important, which one changed.

By considering the group data to be a new data set, we can stack multiple control-break runs into each other. These multi-level control breaks add some new behavioural constraints to the engine.

  1. A sub group can only be opened, after its parent group has been opened.
    Subgroup processing starts as soon as the group header of the parent group has been printed. Immediately after the processing finishes, the parent group's footer gets printed. The processing flow cannot reach the subgroup without passing through the parent group.
  2. A sub group must cease control as soon as one the parent's group attributes changes.
    A control break in one of the parent groups will close all subgroups. As soon as the parent group generated a new group instance, new instances of the sub groups will be opened as well. (In Pentaho Reporting, sub groups must contain all the group attributes of it's parent.)
  3. Adding an attribute, which has a constant value, to the group definition will not alter the number or order of the generated group instances.
    This allows you to insert artificial group levels to the report by referencing static or non-existing fields (which always evaluate to 'null') in addition to the real group fields. That little trick can be used to print more than one group header or footer.
  4. Adding an attribute, which has a constant value, to the group definition will not alter the number or order of the generated group instances.
  5. A group can only have one directly attached sub group. Building trees of groups is impossible with that algorithm.
     

Working with groups: Examples

The effects of the control break algorithm are best understood when looking at an example. Lets take the following table as datasource.

Table Basics of reporting.1. Initial data set

Location

Department

Employee

Salary

Denver

Sales

John Doe

100.000

Denver

Sales

Jane Doe

100.000

Denver

Marketing

Arthur Dent

125.000

New York

Marketing

Adam Johnson

125.000

New York

Marketing

Eve Lynn

145.000

New York

Management

J.D. Salinger

500.000

When grouping the table by the 'department' column, we'll get the following three group instances.

Table Basics of reporting.2. Data set grouped by 'Department'

Location

Department

Employee

Salary

Notes

 

group start for 'department group'

Denver

Sales

John Doe

100.000

 

Denver

Sales

Jane Doe

100.000

 

 

group end for 'department group'

 

group start for 'department group'

Denver

Marketing

Arthur Dent

125.000

 

New York

Marketing

Adam Johnson

125.000

 

New York

Marketing

Eve Lynn

145.000

 

 

group end for 'department group'

 

group start for 'department group'

New York

Management

J.D. Salinger

500.000

 

 

group end for 'department group'

Within each group instance, the value of the 'department' column is the same for all rows of that group. We get a group instance for each department type. Note, that the department data is sorted.

Now, lets use the 'location' column as group key. We'll receive two group instances now, 'Denver' and 'New York'.

Table Basics of reporting.3. Data set grouped by 'Location'

Location

Department

Employee

Salary

Notes

 

group start for 'location group'

Denver

Sales

John Doe

100.000

 

Denver

Sales

Jane Doe

100.000

 

Denver

Marketing

Arthur Dent

125.000

 

 

group end for 'location group'

 

group start for 'location group'

New York

Marketing

Adam Johnson

125.000

 

New York

Marketing

Eve Lynn

145.000

 

New York

Management

J.D. Salinger

500.000

 

 

group end for 'location group'

Of course, we can combine groups to create multi-level reports. The control break algorithm allows only one group definition per level. That means, we cannot have a report that has top level groupings for 'location' and 'department' at the same time. But we are able to subdivide groups in an ordered way.

For example, we can first group by the 'location' and in a second step group all locations by it's department. The first level grouping will produce the following layout:

Table Basics of reporting.4. Data set grouped by 'Location' and subgrouped by 'department'

Location

Department

Employee

Salary

Notes

 

group start for 'location group'

 

group start for 'department group'

Denver

Sales

John Doe

100.000

 

Denver

Sales

Jane Doe

100.000

 

 

group end for 'department group'

 

group start for 'department group'

Denver

Marketing

Arthur Dent

125.000

 

 

group end for 'department group'

 

group end for 'location group'

 

group start for 'location group'

 

group start for 'department group'

New York

Marketing

Adam Johnson

125.000

 

New York

Marketing

Eve Lynn

145.000

 

 

group end for 'department group'

 

group start for 'department group'

New York

Management

J.D. Salinger

500.000

 

 

group end for 'department group'

 

group end for 'location group'

The order of the groups is important. There's a difference, whether one first groups by the location and then by the department column or first by department and then location. As you can see, a subgroup is always part of the parent group. When a subgroup's headers are printed, the parent group's headers are already fully processed. For footers, we can see, that the footer of the subgroup is always printed before the parent group's footer gets on the paper.

Now we switch the group order, the department group is the top level group, followed by the 'location' subgroup.

Table Basics of reporting.5. Data set grouped by 'department' and subgrouped by 'location'

Location

Department

Employee

Salary

Notes

 

group start for 'department group'

 

group start for 'location group'

Denver

Sales

John Doe

100.000

 

Denver

Sales

Jane Doe

100.000

 

 

group end for 'location group'

 

group end for 'department group'

 

group start for 'department group'

 

group start for 'location group'

Denver

Marketing

Arthur Dent

125.000

 

 

group end for 'location group'

 

group start for 'location group'

New York

Marketing

Adam Johnson

125.000

 

New York

Marketing

Eve Lynn

145.000

 

 

group end for 'location group'

 

group end for 'department group'

 

group start for 'department group'

 

group start for 'location group'

New York

Management

J.D. Salinger

500.000

 

 

group end for 'location group'

 

group end for 'department group'

As we can see, whenever the department changes, the location group was closed down and reopened once the department group generated a new instance.