Tuesday, May 3, 2016

How to resolve spark-cassandra-connector's Guava version conflict in spark-shell

In my blog How to resolve spark-cassandra-connector Guava version conflicts in Yarn cluster mode, I explained how to resolve Guava version issue in Yarn cluster mode. This blog covers how to do it in spark-shell.
The first thing is, when you start spark-shell with --master yarn, you actually run in yarn-client mode. Unfortunately my method for Yarn cluster mode won’t work. You may still get an exception as below:
Caused by: java.lang.NoSuchMethodError: com.google.common.util.concurrent.Futures.withFallback(Lcom/google/common/util/concurrent/ListenableFuture;Lcom/google/common/util/concurrent/FutureFallback;Ljava/util/concurrent/Executor;)Lcom/google/common/util/concurrent/ListenableFuture;
What’s wrong? If you log on the data node, and check the launch_container.sh for you Yarn application, you will find guava-16.0.1.jar is in the first one in the classpath
export CLASSPATH="$PWD/guava-16.0.1.jar:$PWD:$PWD/__spark__.jar:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH"
Here is the trick: you need to add Guava jar to --files in your command
spark-shell \
  --master yarn-cluster \
  --driver-class-path <local path of guava-16.0.1.jar> \
  --conf spark.executor.extraClassPath=./guava-16.0.1.jar \
  --jars <local path of guava-16.0.1.jar> \
  --files <local path of guava-16.0.1.jar> \
  ...
Sounds weird? You can do this test, then you will understand why. Run spark-shell command, when you see the prompt, don’t do anything, log on the data nodes where your application’s spark executors are running. Find the application’s cache, what will you find?
# ls /grid/0/yarn/nm/usercache/bwang/appcache/application_1459869234031_5503/container_e45_1459869234031_5503_01_000004/
container_tokens                       launch_container.sh
default_container_executor_session.sh  __spark__.jar
default_container_executor.sh          tmp
Where are those jars listed in —jars? The answer is those jars are copied until you start some action of RDD or DataFrame in the spark-shell. Unfortunately, the JVM of the executor is already started, and the JVM might use the older version of Guava already.
If you add Guava jar in --files, the jar will be copied to the executor’s container. And guava-16.0.1.jar will be chosen over the older version of Guava.
Updates
Adding --files is not necessary for Spark 2.0.1. For Spark 2.1, you are not able to start Spark shell if you keep it. All of jars are already distributed in Spark 2 when each executor’s JVM starts.

No comments:

Post a Comment