When you use
spark-cassandra-connector
, you will encounter this problem “Guava version conflicts” when you submit your job using Yarn cluster mode. spark-cassandra-connector
usually use the latest Guava version 16.0.1, which has some new methods could not be found in the old version of Guava, e.g., 11.0.2. It is A BIG Headache when you try to resolve this problem.
Here is how you can resolve this without building something specially.
I think everyone might already have the idea: Put
guava-16.0.1.jar
before guava-11.0.2.jar
in the classpath. But how can we achieve this when you run as YARN cluster mode?
You Hadoop cluster might already have the Guava jar. If you use CDH, try this
find -L /opt/cloudera/parcels/CDH -name "guava*.jar"
. If you like use that jar, you can resolve this problem by addingspark-submit
--master yarn-cluster
--conf spark.driver.extraClassPath=<path of guava-16.0.1.jar>
--conf spark.executor.extraClassPath=<path of guava-16.0.1.jar>
...
extraClassPath
allow you prepend the jars in the class path.
If you could not find the version of Guava in you cluster, you can just include the jar by yourself
spark-submit
--master yarn-cluster
--conf spark.driver.extraClassPath=./guava-16.0.1.jar
--conf spark.executor.extraClassPath=./guava-16.0.1.jar
--jars <path of guava-16.0.1.jar>
...
In
--jars
, you actually tell spark how to find the jar, so you need to provide the full path of the jar. When spark starts in Yarn cluster mode, the jar will be shipped to the container in NodeManager, in that everything will the current directory where the executor starts, you only need to tell it is the current working directory in extraClassPath
.
If you use CDH, all hadoop jars are automatically added when you run a Yarn application. Take a look of
launch_container.sh
when you job is running, you will see something like below.export CLASSPATH="$PWD:$PWD/__spark__.jar:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH"
Here is how you can find the
launch_container.sh
- Find the hose where one of the executors is running
- Run this command
find -L /yarn -path "*<app_id>*" -name "launch*"
There is a Yarn configuration
yarn.application.classpath
. If you like, you can prepend an entry for Guava. <property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value>
</property>
You know
spark-submit
have some messy convention:--jars
is separated by comma “,
“extraClassPath
is separated by column “:
“--driver-class-path
is separated by comma “,
“
I was in a hurry to write this blog. I might miss something or I assume you know a lot. If something is not clear, let me know and I will fix it.