How to fix the classpath in Spark – Rebuilding Spark to get rid of old jars
After fixing an earlier classpath problem a new and more difficult problem showed up. Again the evil-doer is and ancient version of Jackson.
Unfortunately all classes are loaded from one big container with Spark/Hadoop/Yarn, this causes a lot of problems. The spark.files.userClassPathFirst option is still ‘experimental’ which, in our case, meant it just didn’t work. But again, we found a solution. Our system engineers also wanted a reproducible solution so in the end it’s just a small recipe.
The first suspect was Hadoop, but after upgrading Hadoop to version 2.5.1 the problem still occurred. After some digging I found out that Spark was the culprit. A library named parquet, to be precise. Parquet already updated his Jackson-dependency, but Spark uses an old parquet. So let’s rebuild Spark with the right Jackson-version.
Rebuilding Spark isn’t trivial. Spark has to be built for a specific version of Hadoop and with or without Yarn support. The Spark documentation shows how to do this. But the latest version of Hadoop in the docs is 2.4 and we want 2.5. So upgrading the version numbers would seem a logical step. When doing this you will get the following warning (not an error, so I overlooked it the first time) :
[WARNING] The requested profile "hadoop-2.5" could not be activated because it does not exist.
After a while I found a post on the Spark mailing list with the solution : use the 2.4 profile since it means 2.4+. The Hadoop version can stay the same, since this is the direct Maven dependency. The final build string :
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.1 -DskipTests clean package
Fixing the Jackson-version
This still doesn’t fix everything. When running the following command
mvn dependency:tree -Dincludes=org.codehaus.jackson
We still have an old version of Jackson :
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ spark-assembly_2.10 --- [INFO] org.apache.spark:spark-assembly_2.10:pom:1.1.0 [INFO] \- org.apache.spark:spark-sql_2.10:jar:1.1.0:compile [INFO] \- com.twitter:parquet-hadoop:jar:1.4.3:compile [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] \- org.codehaus.jackson:jackson-core-asl:jar:1.9.11:compile
I somehow wanted to give Maven a command line property to add a hardcoded dependency to Jackson 1.9.13.
Manually fixing your Spark installation
After some Googlin’ I found a very hopeful url that ended with “add-a-maven-dependency-from-the-command-line/” . It wasn’t exactly a command line property, but good enough for me. The solution is doing a regex on the tag and replacing it with the right version of Jackson. It’s a bit ugly, but since this issue is already reported at Spark it’s only temporary.
The final recipe
So the final recipe (which can be nicely be used as Jenkins project to keep the system engineers happy regarding reproducability) :
git clone git://github.com/apache/spark.git #git checkout 2f9b2bd7844ee8393dc9c319f4fefedf95f5e460 mv assembly/pom.xml assembly/pom.orig sed "0,/<dependencies>/s//<dependencies><dependency><groupId>org.codehaus.jackson<\/groupId><artifactId>jackson-core-asl<\/artifactId><version>1.9.13<\/version><\/dependency><dependency><groupId>org.codehaus.jackson<\/groupId><artifactId>jackson-mapper-asl<\/artifactId><version>1.9.13<\/version><\/dependency>/" assembly/pom.orig > assembly/pom.xml mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.1 -DskipTests clean package
Running this command might take a while (about 20 minutes on our Jenkins machine and about an hour on my laptop).
When you want to build for a specific git revision uncomment the line with ‘git checkout’. The hash shown above is for version 1.1.0
What to do with the assembly jar?
When the command is finished you have to copy assembly/target/scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.5.1.jar to you Spark lib dir. Make sure you rename the old file (spark-assembly-[version]-hadoop[hadoop-version].jar) to something with another extension since Spark looks for spark-assembly*hadoop*.jar and throws an error when not exactly one file is found.
In the end this is just a quick fix that hopefully won’t be necessary in the future. At the time of writing the latest stable version of Spark was 1.1.0 and 2.5.1 of Hadoop.