Using PostgreSQL 8.x for Hibernated Data in the Platform

February 28, 2007
Submitted by Marc Batchelor, Pentaho Team Developer

Background

Pentaho uses Hibernate for persistence of operational state of the platform. The hibernated data can be categorized in the following areas:

  • Content - Generated content
  • Runtime - Used by the runtime engine to save cross-process state
  • Version Management - Handles hibernated change detection
  • audit - In the subscription edition, audit logging/tracking table structures for audit reporting
  • RDBMS Based solution - in 1.2.x subscription edition, and 1.6 open source, stores solution files (action sequences, resources, etc.)
  • Subscription - In 1.2.x subscription edition, and 1.6 open source, stores schedules, subscriptions and subscription parameters.

Cross-database issues

Just as a background follow-up, the following is a list (not necessarily all-inclusive) of the issues that can (and often are) encountered when providing cross-database support within a platform.

Areas of concern and Incompatibility

Some of the items listed here were encountered when doing this update to the platform:

Reserved Words

Each database has it's own list of reserved words. These reserved words cannot be used as table names, or column names. Examples of these are things like "SELECT" or "DELETE".

Blob handling

Each databases handles persistence of binary large objects differently. Whether limitations are imposed by the database engine, or by the JDBC driver, BLOB handling is a very tricky cross-database issue.

Column Limits

Each database has their own limits on lengths of columns, supported column types, etc. In most cases, the length of columns becomes a factor in whether a column can be indexed, or how many columns can be in a compound index. Databases also have differing length limits regarding object names. For example, on one platform, you'll find that a column name can be no more than 20 characters. On other platforms, column names are restricted to 50 characters.

SQL Syntax differences

Even though there are standards, the funny joke in BI is that each database conforms to it's own standard. There are so many standards, it's hard to keep up. Every database has it's own database-specific dialect for creating schema, for retrieving records, etc. Sure, many bits are are the same across databases. But these are superficial similarities. Yeah,

SELECT COUNT(*) FROM MYTABLE

will likely work in all databases, provided that the table MYTABLE exists, and you're in the correct context. In reality, for one database, it my be

SELECT COUNT(*) FROM DBO.MYTABLE

, or

SELECT COUNT(*) FROM myTable

or

SELECT COUNT(*) FROM DBSPACE.MYTABLE

, etc.

Database Driver Behavior

Some databases come with multiple, differently behaving drivers. For example, Oracle supplies an OCI driver (one that requires native libraries to be installed on the client machine), and a thin driver (one that doesn't require client libraries to be installed). The capabilities of each of these differs. DB2, for a time supplied three different types of UDB drivers. And, the JDBC API support level differs from one database driver to another. So, the DB2 JDBC driver may not be fully JDBC compliant, but another driver may.

Stored Functions/Procedures

Different databases handle these items differently. As it relates to interfacing to these from a BI platform, some databases (notably DB2, SQL Server, and others) allow result sets to be "returned" from stored procedures. On the other hand, some databases require one to use stored functions for this activity. Also, parameter passing to these functions/parameters can be different between database platforms. Notably, some allow IN/OUT/INOUT parameters, others only IN parameters. The list goes on.

Transaction management and ACID compliance

Some databases and their drivers are loose-and-flabby about transactions - others are very picky. It turns out that MySQL isn't so picky, but PostgreSQL is very picky. I'm not casting judgment over which approach is better or worse. But, it's a difference, and code that works just fine against one database, may be completely unacceptable to another.

How does Hibernate Help?

Hibernate provides a layer that mostly insulates an application against the issues listed above. It doesn't address all the issues, but it does a great job of hiding many of the pains of cross-database persistence. The dialect package is a wonderful example of the pains that this open source project has gone through to shield the average (and even advanced) developer from the pains listed above. But, hibernate does have it's limitations.

Overview of changes to support PostgreSQL

I didn't encounter all the issues listed above when providing support for PostgreSQL. But, I did encounter several of them. Here are the main bits:

Column Name Issues

In PostgreSQL, the following list of reserved words were used as column names, and had to be overridden for PostgreSQL support:

  • start became startdt
  • user became usernm
  • hour became hourcol
  • year became yearcol
  • month became monthcol
  • data became datacol

Transaction Issues

Under PostgreSQL, automatic schema generation tasks (DefinitionVersionManager, and PentahoSchemaUpdate) required several modifications related to proper transaction management. One of the biggest issues to resolve is that PostgreSQL requires a rollback on a connection if a query is performed against schema that doesn't yet exist. This isn't a problem with MySQL or HSQLDB. But, with PostgreSQL, attempting to discover that the tables needed to be created resulted in an exception that required a rollback. The fact that rollback was required wasn't apparent until further attempts to use the connection failed with a cryptic message. Rolling back the transaction after the exception was required to solve this problem. Also, in the area of schema generation, the code didn't properly commit the changes.

Telling Hibernate what the unsaved value is

This seems like it should have caused a problem with other RDBMS servers, but we hadn't see it until working with PostgreSQL. In several classes, the "unsaved value" which is used by hibernate to determine whether something is an insert or an update was not specified on the object. Examples were the AuditLogEntry (Management Services), and the DefinitionVersionManager. There were several examples of this that worked fine in MySQL and HSQLDB, but failed with PostgreSQL.

PostgreSQL and sequences/auto-increment fields

In MySQL and HSQLDB, sequence columns (properly defined in the Hibernate definition) will be filled in upon insert of a row if the column is not specified in the insert. An auto-increment field is one where the database assigns the next number in the series to the row of data. With PostgreSQL, the table would have to be defined with a default value for the column (something Hibernate doesn't do), or the column has to be manually assigned during the insert with a special database function call. This means that the .SQL for audit logging (currently defined in the pentaho.xml) has to be different for PostgreSQL than for MySQL and HSQLDB.

Final Analysis

When all was said and done, there were 8 Java class changes, 6 PostgreSQL-specific hibernate hbm.xml files created, 3 hibernate hbm.xml file changes that will be used by all three databases. Additionally, three files from the management services audit solution needed to be modified because of column name changes.