In my blog How to resolve spark-cassandra-connector Guava version conflicts in Yarn cluster mode, I explained how to resolve Guava version issue in Yarn cluster mode. This blog covers how to do it in spark-shell.
The first thing is, when you start spark-shell with
--master yarn
, you actually run in yarn-client
mode. Unfortunately my method for Yarn cluster mode won’t work. You may still get an exception as below:Caused by: java.lang.NoSuchMethodError: com.google.common.util.concurrent.Futures.withFallback(Lcom/google/common/util/concurrent/ListenableFuture;Lcom/google/common/util/concurrent/FutureFallback;Ljava/util/concurrent/Executor;)Lcom/google/common/util/concurrent/ListenableFuture;
What’s wrong? If you log on the data node, and check the
launch_container.sh
for you Yarn application, you will find guava-16.0.1.jar
is in the first one in the classpathexport CLASSPATH="$PWD/guava-16.0.1.jar:$PWD:$PWD/__spark__.jar:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH"
Here is the trick: you need to add Guava jar to
--files
in your commandspark-shell \
--master yarn-cluster \
--driver-class-path <local path of guava-16.0.1.jar> \
--conf spark.executor.extraClassPath=./guava-16.0.1.jar \
--jars <local path of guava-16.0.1.jar> \
--files <local path of guava-16.0.1.jar> \
...
Sounds weird? You can do this test, then you will understand why. Run
spark-shell
command, when you see the prompt, don’t do anything, log on the data nodes where your application’s spark executors are running. Find the application’s cache, what will you find? # ls /grid/0/yarn/nm/usercache/bwang/appcache/application_1459869234031_5503/container_e45_1459869234031_5503_01_000004/
container_tokens launch_container.sh
default_container_executor_session.sh __spark__.jar
default_container_executor.sh tmp
Where are those jars listed in —jars? The answer is those jars are copied until you start some action of RDD or DataFrame in the spark-shell. Unfortunately, the JVM of the executor is already started, and the JVM might use the older version of Guava already.
If you add Guava jar in
--files
, the jar will be copied to the executor’s container. And guava-16.0.1.jar
will be chosen over the older version of Guava.
Updates
Adding
Adding
--files
is not necessary for Spark 2.0.1. For Spark 2.1, you are not able to start Spark shell if you keep it. All of jars are already distributed in Spark 2 when each executor’s JVM starts.
No comments:
Post a Comment