XML Input Stream (StAX)

PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Description

This step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads: Since Kettle has so many own steps to process data in different ways, the processing logic has been moved more into the transformation and the step itself provides the raw XML data stream together with additional and helpful processing information.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples at the end of this page for illustrations of the usage.

Note: In almost all use cases, a Set/Reset functionality was needed. At this time it can be accomplished by the Modified Java Script Value step or the User Defined Java Class step where the latter one is recommended and much more faster. An own Kettle step with Set/Reset functionality is one the road map to solve these and other similar use cases, see PDI-6389 for more details.

Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:

Very fast and independend of the memory regardless of the file size (GBs and more are possible due to the streaming approach)
Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)

Options

Property	Description	Default (bold=selected)
Step name	Name of the step; the name has to be unique in a single transformation.
Filename	Specifies the file name of the input XML file.
Add filename to result?	Adds the processed XML filename to the result of this transformation. A unique list is being kept in memory that can be used in the next job entry in a job, for example in another transformation.	no
Skip (Elements/Attributes)	The number of Elements / Attributes that should be skipped. This can be used for starting the processing at a specific location of a file. The file is still beeing loaded by the parser but the rows are not produced.	0
Limit (Elements/Attributes)	The processing stops after the given limit of Elements / Attributes. With the Skip and Limit properties it is possible to enable chunk loading that is defined in an outer loop.	0
Default String Length (Elements / Attributes)	This is the default string length for the XML data name and value fields.	1024
Encoding	The encoding of the XML file.	UTF-8
Add Namespace information?	When namespace information is selected, the XML data type NAMESPACE is added to the stream with an optional prefix (given in the XML data name) and URI information (given in the XML data value). Also a defined prefix in the ELEMENT data type is preceeded to the XML data name, e.g. prefix:product. Performance considerations: Due to the extra namespace handling this option slows down the processing throughput a little bit.	no
Trim strings?	When selected, the step trims all name/value elements and attributes. It is also eliminating white spaces, tab, cr, lf at the beginning and end of the string.	yes
Include filename in output? / Fieldname	When selected, the step adds the processed filename to the given fieldname.	xml_filename (String 256)
Row number in output? / Fieldname	When selected, the step adds the processed row number (starting with 1) to the given fieldname.	xml_row_number (Integer)
XML data type (numeric) in output? / Fieldname	When selected, the step adds the processed data type in numeric format to the given fieldname. The following data types are defined: 0 - "UNKNOWN" (not used, reserved) 1 - "START_ELEMENT" 2 - "END_ELEMENT" 3 - "PROCESSING_INSTRUCTION" (not used, reserved) 4 - "CHARACTERS" 5 - "COMMENT" (not used, reserved) 6 - "SPACE" (not used, reserved) 7 - "START_DOCUMENT" 8 - "END_DOCUMENT" 9 - "ENTITY_REFERENCE" (not used, reserved) 10-"ATTRIBUTE" 11-"DTD" (not used, reserved) 12-"CDATA" (not used, reserved) 13-"NAMESPACE" (when namespace information is selected) 14-"NOTATION_DECLARATION" (not used, reserved) 15-"ENTITY_DECLARATION" (not used, reserved)	xml_data_type_numeric (Integer)
XML data type (description) in output? / Fieldname	When selected, the step adds the processed data type in text format to the given fieldname. This should be used instead of the numeric data type for better readability of the transformation. See XML data type (numeric) for a list of values. Performance considerations: Due to slower processing of strings and the extra memory consumption, it is recommended to use the numeric data type format for big data loads.	xml_data_type_description (String 25)
XML location line in output? / Fieldname	When selected, the step adds the processed source XML location line to the given fieldname.	xml_location_line (Integer)
XML location column in output? / Fieldname	When selected, the step adds the processed source XML location column to the given fieldname.	xml_location_column (Integer)
XML element ID in output? / Fieldname	When selected, the step adds the processed element number (starting with 0) to the given fieldname. In contrast to the Row number, this field gets incremented by a new element and not a now row. The correct nesting between levels is ensured.	xml_element_id (Integer)
XML parent element ID in output? / Fieldname	When selected, the step adds the parent element number to the given fieldname. Note: By the use of the XML element ID in connection with the XML parent element ID, a complete XML element tree is available for later usage.	xml_parent_element_id (Integer)
XML element level in output? / Fieldname	When selected, the step adds the processed element level (starting with 0 for the root START_ and END_DOCUMENT) to the given fieldname.	xml_element_level (Integer)
XML path in output? / Fieldname	When selected, the step adds the processed XML path to the given fieldname.	xml_path (String 1024)
XML parent path in output? / Fieldname	When selected, the step adds the processed XML parent path to the given fieldname.	xml_parent_path (String 1024)
XML data name in output? / Fieldname	When selected, the step adds the processed data name of elements, attributes and opional namespace prefixes to the given fieldname.	xml_data_name (String 1024 or Default String Length)
XML data value in output? / Fieldname	When selected, the step adds the processed data value of elements, attributes and optional namespace URIs to the given fieldname.	xml_data_value (String 1024 or Default String Length)

Samples

Sample transformations demonstrating the capabilities of this step are available in the distribution package (in samples folder)

samples/transformations/XML Input Stream (StAX) Test 1 - Basic Tests.ktr
samples/transformations/XML Input Stream (StAX) Test 2 - Element Blocks.ktr
samples/transformations/XML Input Stream (StAX) Test 3 - Attribute Groups.ktr
samples/transformations/XML Input Stream (StAX) Test 4 - Hierarchies.ktr
samples/transformations/XML Input Stream (StAX) Test 5 - Performance Test Data for Element Blocks.ktr
samples/transformations/XML Input Stream (StAX) Test 6 - Namespaces.ktr

Usage Sample for XML Input Stream (StAX) Test 2 - Element Blocks

This example parses the XML Input Stream (StAX) Test 2 - Element Blocks.xml
and has 2 main sample data blocks (Analyzer Lists & Products).

The different data blocks are separated by splitting the parent XML path to levels and the Switch / Case steps.
This separation can also be solved by the 'string contains' option of the Switch / Case step or other steps.

In more complex processings, Mappings (sub transformations) should be used for the different data blocks to have it clearly represented.

XML Sample with different element blocks:

A preview on the step may look like this (depending on the selected fields):

You see you really get almost the original streaming information with Elements and Attributes from the XML file together with helpful other fields like the element level.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples of this step for illustrations of the usage.

The transformation looks like this:

The end result for the Analyzer List block:

The end result for the Products block (splitted for example into two separate data streams for the end system):

And there are a lot more options in the step to help to solve your needs:

More details of the various options can be found in the mentioned examples and the above step description.