MongoDB Input

(warning) PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Description

The MongoDB Input transformation step enables you to retrieve documents or records from a collection within MongoDB. For additional information about MongoDB, see the MongoDB documentation.

Options

Configure Connection Tab

The Configure connection tab enables you to specify the database and collection to query.

Option

Definition

Option

Definition

Step name

Name of the step as it appears in the transformation workspace.

Host name(s) or IP address(es)

Indicates the network name or address of the MongoDB instance or instances. You can input multiple host names or IP addresses, separated by a comma. You can also specify a different port number for each host name by separating the host name and port number with a colon, and separating each combination of host name and port number with a comma. For example, to include the host name and port number for two different MongoDB instances, you would input localhost1:27017,localhost2:27018 and leave the Port field empty.

Use all replica set members/mongos

Differentiates between a replica set containing one node and a stand-alone single Mongo host. If there is a replica set, and it contains more than one host, then the Java driver discovers all hosts automatically. It is good practice to list more than one replica set host in the hosts field so that the driver has a better chance of connecting successfully if one is down.

Port

Indicates the port number of the MongoDB instance or instances. Specify a default port to use if no port numbers are specified in the Host name(s) or IP address(es) field.

Username

Indicates the username required to access the database. If you want to use Kerberos authentication, enter the Kerberos principal in this field.  If you do not know the principal, contact your system administrator.  The principal is the unique identity to which Kerberos assigns tickets.  When you enter the principal as the username, it should be formatted like this: <primary>/<instance>@<KERBEROS_REALM> is typically the name of the user.  If the primary is a host, the primary is typically the word host.  <instance> qualifies the primary.  Sometimes if the primary is a user, the instance is the username of the database administrator.  <KERBEROS_REALM> is the Kerberos realm (domain name).  Note that the <KERBEROS_REALM> is case sensitive. Here is an example of a correctly-formatted Kerberos principal username: <joe/admin@CORPORATION.COM>.

Password

Indicates the password associated with the provided Username. If you are using Kerberos authentication, you do not need to enter the password.

Authenticate using Kerberos

Indicates whether to use the Kerberos service to manage the authentication process. If you check this, make sure that you enter the Kerberos principal as the Username. If you choose this option, read Use Kerberos Authentication to Provide Spoon Users Access to MongoDB for configuration information.

Connection timeout

Designates how long to wait for a connection to a database (in milliseconds) before terminating the connection attempt. Leave blank to never terminate the connection.

Socket timeout

Designates how long to wait for a write operation (in milliseconds) before terminating the operation. Leave blank to never terminate the operation.

Preview

Displays a first look of the data. Clicking Preview causes the Enter preview size window to appear. Enter the maximum number of records that you want to preview, then click OK. The preview data appears in the Examine preview data window.

Input Options Tab

The Input Options tab enables you to specify which database and collection you want to retrieve information from. You can also indicate the read preferences and tag sets in this tab. See Tag Sets for more information.

Option

Definition

Database

Name of the database to retrieve data from. Click Get DBs to populate the drop-down menu with a list of databases on the server.

Collection

Name of the collection to retrieve data from. Click Get collections to populate the drop-down menu with a list of collections within the database.

Read preference

Indicates which node to read first---Primary, Primary preferred, Secondary, Secondary preferred, or Nearest.

Tag set specification/#/Tag Set

Tags allow you to customize write concerns and read preferences for a replica set. The Tag set specification section of the window allows you to specify criteria for selecting replica set members. When you click Get tags, the Tag set specification populates with the tag sets that are available on the database, in order of execution. You can join, delete, copy, or paste tag sets, then click Test tag set to see which replica set members match the Tag set specification criteria you specified. The # field indicates the number of the tag set. The Tag set field displays the tag set criteria.

Get Tags

Retrieves a list of the tag sets that are in the database indicated in the Database field.

Join tags

Appends selected tag sets so that nodes that match the criteria are queried or written to simultaneously. If you select individual tag sets, then click Join tags, the tag sets are combined to create one tag set. Note that this change only occurs in the MongoDB Input window, not on the database.

Test tag set

Displays the set members that match the tags indicated in the tag set specification. Clicking Test tag set displays the id, host name, priority, and tags for each replica set member that matches the tag set specification criteria.

Query Tab

The Query tab enables you to refine your read request. This tab operates in two different modes. You can create a query using JSON Query expression or using the Aggregation Framework. By default, the Query tab is in JSON Query expression mode. You can enter a JSON Query expression when the Query is aggregation pipeline checkbox is deselected. MongoDB queries use a JSON-like query language that includes a variety of query operators. To place the Aggregation Framework mode Query is aggregation pipeline checkbox. You can then enter a query, using the Aggregation Framework, in the Aggregation pipeline specification field that appears. See MongoDB's Aggregation Framework for additional information, including code examples.

Option

Definition

Query expression (JSON)(Field is visible if Query is aggregation pipeline checkbox is not selected.)

JSON expression to limit the output. See the sub-section Query Examples (JSON Query Expressions) for additional details.

Aggregation pipeline specification (JSON)(Field is visible if Query is aggregation pipeline checkbox is selected.)

Use this field if you want to use the MongoDB Aggregation Framework to perform a simple or complex aggregations or selections such as totaling or averaging field values. Note that the method name (which includes the collection name of the database you selected in the Input Options tab), appears after the Aggregation pipeline specification (JSON) label for this field. See the sub-section Query Examples (JSON Aggregate Pipeline) for additional details.

Query is aggregation pipeline

Pipes multiple JSON expressions together to execute at once. An aggregation pipeline strings several JSON expressions together, with the output of the previous expression becoming the input for the next. When selected, the Aggregation pipeline specification (JSON) field appears. When deselected, the Query expression (JSON) field appears.

Execute for each row

Perform the query on each row of data.

Fields expression (JSON)(Field is visible if Query is aggregation pipeline check box is not selected.)

This field becomes active only if Query is aggregation pipelineis not selected. Controls the fields to return, or in MongoDB terms, the projection. If empty, all fields are returned. Enter true or false after the fields to indicate selected or not, respectively. See the MongoDB documentation [{+}http://docs.mongodb.org/manual/reference/method/db.collection.find/+] for more information about projections.

Fields Tab

The Fields tab enables you to define properties for the exported fields. The Fields tab operates in two different modes that impact how query results are formatted. You can indicate that you want the query result to be stored in a single JSON field. To do this, click the Output single JSON field check box. If you decide to do this and you want to parse the results of the field, you can apply a transformation step later in the process. Or, you can uncheck the Output single JSON field check box and instead, click the Get Fields button to apply Pentaho's Schema on Read functionality. This functionality parses fields, paths, and data types and displays them. You can then review and adjust this information, as needed.

Option

Definition

Output single JSON field

Indicates whether the JSON result of the query should be outputted to a single field that has the String data type. You can parse this JSON using the JSON Input transformation step, eval("{"jsonString"}") in JavaScript, or using a User Defined Java Class step.

Name of JSON output field(Field is active if Output single JSON field check box is selected.)

Designates the name of the field that contains the JSON output from the server.

Get fields(Field is active if Output single JSON field check box is not selected.)

Creates a sample set of documents, then displays the name and field for each record. Pentaho's Schema on Read functionality determines the field names, paths, and the data type for each field in the sample.

  1. (Field is active if Output single JSON field check box is not selected.)

The order of this entry in the list.

Name(Field is active if Output single JSON field check box is not selected.)

Displays a user-friendly name of the field that is based on the value in the Path field. The name that appears here maps the name of the field as it appears in the PDI transformation with the field that appears in the MongoDB database. You can edit the name as desired.

Path(Field is active if Output single JSON field check box is not selected.)

Indicates the JSON path of the field in MongoDB. If the path shown is an array, you can specify a specific element in the array by passing it the key value, which is contained in the bracketed part of the array. For example $.emails[0] indicates that you want the result to display the first value in the array. $.emails[1] indicates that you want the result to display the second value in the array and so forth. If you want to display all array values, use the asterisk as the key, like this $.email[]. If the array contains records, and not just strings, you can specify that you want to display the record like this: $.emails[].sender.

Type(Field is active if Output single JSON field check box is not selected.)

Indicates the data type.

Indexed Values(Field is active if Output single JSON field check box is not selected.)

Allows you to enter a comma-separated list of legal values for String fields. If you specify values in this field, the Kettle indexed data type is applied to the data. If not, the String data type is applied. Usually, you will only need to modify this field if you are using Weka metadata for nominal fields.

Sample: array min: max index(Field is active if Output single JSON field check box is not selected.)

Indicates minimum and maximum values for the index seen in the sampled documents.

Sample: #occur/#docs(Field is active if Output single JSON field check box is not selected.)

Indicates how often the field occurs as well as the number of documents processed.

Sample: disparate types(Field is active if Output single JSON field check box is not selected.)

If several documents are sampled, but the same field contain different data types, the Sample: disparate types field is populated with a "Y." The Type field displays the String data type. In this instance, the Kettle type for the field in question is set to the String data type, so it is able to output values of differing types.

Examples

Query Examples (JSON Query Expressions)

MongoDB has a rich query system that allows you to select and filter documents in a collection along specific fields and values. The MongoDB Extended JSON page in the MongoDB Wiki details how to use queries. Pentaho supports only the features discussed on this page. This table displays some examples of the syntax and structure of the queries you can use to request data from MongoDB.

Query expression

Description

{ name : "MongoDB" }

Queries all values where the name field has a value equal to MongoDB

{ name : { '$regex' : "m.*", '$options' : "i" } }

Uses a regular expression to find name fields starting with m, case insensitive

{ name : { '$gt' : "M" } }

Searches all strings greater than M

{ name : { '$lte' : "T" } }

Searches all strings less than or equal to T

{ name : { '$in' : [ "MongoDB", "MySQL" ] } }

Finds all names that are either MongoDB or MySQL (Reference)

{ name : { '$nin' : [ "MongoDB", "MySQL" ] } }

Finds all names that are not either MongoDB or MySQL, or where the field is not set (Reference)

{ created_at : { $gte : { $date : "2014-12-31T00:00:00.000Z" } } }

Finds all created_at documents that are greater than or equal to the specified UTC date

{ $where : "this.count == 1" }

Uses JavaScript to evaluate a condition

{ $query: {}, $orderby: { age : -1 } }

Returns all documents in the collection named collection sorted by the age field in descending order.

Query Examples (JSON Aggregate Pipeline)

MongoDB has a rich query system that allows you to select and filter documents using the aggregation pipeline framework. The Aggregation Framework Examples page in the MongoDB wiki provides additional examples of function calls. This table displays some examples of the syntax and structure of the queries you can use to request data from MongoDB.

Query expression

Description

{ $match : {state : "FL", city : "ORLANDO" } }, {$sort : {pop : -1 } }

Returns all fields from all documents where the state field has a value of FL and the city field has a value of ORLANDO.  The documents will be returned sorted by the pop field in descending order.

{ $group : { _id: "$state"} }, { $sort : { _id : 1 } }

Returns one field named _id containing the distinct values for state in ascending order.  Similar to the SQL: SELECT DISTINCT state AS _id FROM collection ORDER BY state ASC.

{ $match : {state : "FL" } }, { $group: {_id: "$city" , pop: { $sum: "$pop" }  }  }, { $sort: { pop: -1 } }, { $project: {_id : 0, city : "$_id" }  }

Gets all documents where thestate field has a value of FL, aggregates all values of pop for each city, sorts by population descending and returns one field named city.

{ $unwind : "$result" }</p>

Peels off the elements of an array individually, and returns one document for each element of the array.

Metadata Injection Support (7.x and later)

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.