AEL Troubleshooting Guide

General tips

AEL issues can be tricky to isolate due to the complexity of distributed execution.  Errors can originate from many sources:

  • Spoon 
  • Pentaho Server
  • the AEL Daemon
  • the Spark driver process
  • Spark executors
  • Yarn
  • Kerberos
  • SSL Negotiation

Getting an understanding of where an issue originates ends up being crucial to deal with it.

Inspect Errors

Look at the errors returned to Spoon by the transformation execution.  Often they will give a strong clue of what's happening.  If not, check the daemon logs as well for errors during execution.

Simplify

If the default error and logging information is not helpful, the next step to take when dealing with an error is to simplify.  For example, if you're setting up AEL and no transformations seem to run, use the simplest possible configuration for the daemon first:

  • without kerberos authentication into the daemon.  (i.e. leave the http.security.* and driver.security.* properties commented out in application.properties).
  • sparkMaster=local
  • ael.ssl.enabled=false

Such a configuration eliminates a number of possible errors.  If transformations succeed with these simpler options, begin adding them in one at a time to isolate the component causing issues.

If some but not all transformations execute on AEL, then simplify the transformation.  Remove steps until something runs.

Logging

Each component produces it's own logs.

  • Kettle logs.  The logs displayed in spoon should show many execution errors.  Log level settings are honored by AEL, and setting the level to Detailed or Debug is often a good way to drill in to a problem.
  • Daemon logs.  Written to the stdout console if running in foreground.  The logs displayed will include both the entries from the AEL daemon, as well as the logs from the Spark Driver process launched by the daemon.  By default the log level for the Daemon entries is at INFO.
  • Spark Driver logs.  By default, the spark driver uses the log4j.xml config from data-integration/classes/log4j.xml.  To increase logging or direct to a file, update the config as needed.
  • Spark Executor logs.  slf4j logging on the executors is harder to configure and view.  The default logging can be viewed in the yarn application logs.  If more information is needed, the easiest option is to run spark in "local" mode, in which case both the driver and executor logs will be configured with the same log4j.xml mentioned above.
  • Yarn application logs.  Standard yarn logs, typically viewed via Hue.  
  • Kerberos logging.  To get additional logging on the daemon from kerberos negotiation, set http.security.debug=true.  Logs will be included in the daemon console logging.

F.A.Q.

I see a message about NoSuchField "TOKEN_KIND"

For example,

 Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND
  at org.apache.hadoop.crypto.key.kms.KMSClientProvider$KMSTokenRenewer.handleKind(KMSClientProvider.java:162)
  at org.apache.hadoop.security.token.Token.getRenewer(Token.java:351)
  at org.apache.hadoop.security.token.Token.renew(Token.java:377)
  at org.apache.spark.deploy.yarn.security.HadoopFSCredentialProvider$$anonfun$getTokenRenewalInterval$1$$anonfun$5$$anonfun$apply$1.apply$mcJ$sp(HadoopFSCredentialProvider

This can happen if the versions of the hadoop libraries picked up by the spark process are inconsistent with one another.

When the daemon is running on a Hadoop cluster, the SPARK_DIST_CLASSPATH environment variable will be set to point to the hadoop libraries installed on the cluster. If the Spark distribution installed also has it's own hadoop libraries, there could be a conflict.

If using the Apache Spark, verify that the version used is the one "with user-provided Apache Hadoop" (https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-without-hadoop.tgz)

I see a message with "Verify app proPerties" in the daemon:

 2017-10-17 14:28:50.810  INFO 15597 --- [launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    : Verify app properties.
2017-10-17 14:28:50.810  INFO 15597 --- [launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :
-- Args passed to Spark:  ArgHandler{driverSecurityPrincipal=HTTP/devcdh57n1.pentahoqa.com:53000,
driverSecurityKeytab=/home/devuser/http-devcdh57n1.pentahoqa.com-53000.keytab,
requestId='2706639b-2451-4782-b1f4-6540ce5629a7', daemonURL='ws://localhost:53000/execution',
proxyingUser=devuser, proxyKeytab=/home/devuser/devuser.keytab}

The Spark Main helpfully shows the arguments used when it was invoked, including the use of a driverSecurityPrincipal.  Since that principal is defined, AEL is likely configured with a kerberos secured daemon.  The daemon URL specifies localhost, however.  Connecting to a kerberos secured service requires using the FQDN for service authentication.

When using SSL, I see a "Handshake response not received." in Spoon.

Verify that the certificate is trusted by the client.  See http://wiki.pentaho.com/display/EPAM/Testing+SSL+with+AEL, the sections on creating a trusted cert and importing it into the java keystore. 

ClassNotFound involving jersey classes

See http://wiki.pentaho.com/display/COM/AEL+and+Spark+Library+Conflicts

I get "javax.websocket.DeploymentException: Connection failed". when I try to run a transformation on AEL Spark from PDI.

Make sure that AEL Daemon is running, and you have specified the correct hostname and port in Run configurations.

You can check this by opening the URL for AEL Daemon in the browser: e.g. https://localhost:53000. This should open an error page saying something like "This application has no explicit mapping for /error, so you are seeing this as a fallback.".

I get "Yarn application has already ended!  It might have been killed or unable to launch application master."

One of the root causes of this error is the inability for the AEL Daemon to launch the Spark Application.  Double check configuration files.  If running in a secured environment please make sure that the user submitting the Spark application is a user on the cluster.  For example if you run as `suzy`then `suzy` must be a user on each node in the cluster.  LDAP is a mechanism that could assist in keeping accounts in sync across systems.

I get "Broken pipe exception".

Make sure that output was created correctly. Further investigation needed, the error points to problem in communication between PDI and Daemon.

I see no values in Input and Output columns in Step Metrics in PDI.

This is still not implemented. AEL Spark only populates Read and Written columns. The difference between Read/Written vs Input/Output is:

  • Read/Written is what comes in and out of hops
  • Input/Output is what gets read or written from external sources (flat files, HDFS, datasources)

I get error when trying to compile AEL Daemon: DaemonMainTest.testGreetingEndpoint » IllegalState Failed to load ApplicationC...

Port 53000 may already be in use. You could define a port in the tests so it will not use 53000, that way if you are currently running the daemon you can still run the tests.

Try adding `ael.unencrypted.port=52000` into the `application-test.properties`. The test server would then start up on 52000, But the application would use 53000.