Home > English, java, work > How to fix the classpath in Spark – Rebuilding Spark to get rid of old jars

How to fix the classpath in Spark – Rebuilding Spark to get rid of old jars

After fixing an earlier classpath problem a new and more difficult problem showed up. Again the evil-doer is and ancient version of Jackson.
Unfortunately all classes are loaded from one big container with Spark/Hadoop/Yarn, this causes a lot of problems. The spark.files.userClassPathFirst option is still ‘experimental’ which, in our case, meant it just didn’t work. But again, we found a solution. Our system engineers also wanted a reproducible solution so in the end it’s just a small recipe.

The first suspect was Hadoop, but after upgrading Hadoop to version 2.5.1 the problem still occurred. After some digging I found out that Spark was the culprit. A library named parquet, to be precise. Parquet already updated his Jackson-dependency, but Spark uses an old parquet. So let’s rebuild Spark with the right Jackson-version.

Rebuilding Spark

Rebuilding Spark isn’t trivial. Spark has to be built for a specific version of Hadoop and with or without Yarn support. The Spark documentation shows how to do this. But the latest version of Hadoop in the docs is 2.4 and we want 2.5. So upgrading the version numbers would seem a logical step. When doing this you will get the following warning (not an error, so I overlooked it the first time) :

[WARNING] The requested profile "hadoop-2.5" could not be activated because it does not exist.

After a while I found a post on the Spark mailing list with the solution : use the 2.4 profile since it means 2.4+. The Hadoop version can stay the same, since this is the direct Maven dependency. The final build string :

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.1 -DskipTests clean package

Fixing the Jackson-version

This still doesn’t fix everything. When running the following command

mvn dependency:tree -Dincludes=org.codehaus.jackson

We still have an old version of Jackson :

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ spark-assembly_2.10 ---
[INFO] org.apache.spark:spark-assembly_2.10:pom:1.1.0
[INFO] \- org.apache.spark:spark-sql_2.10:jar:1.1.0:compile
[INFO]    \- com.twitter:parquet-hadoop:jar:1.4.3:compile
[INFO]       +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
[INFO]       \- org.codehaus.jackson:jackson-core-asl:jar:1.9.11:compile

I somehow wanted to give Maven a command line property to add a hardcoded dependency to Jackson 1.9.13.

Manually fixing your Spark installation

After some Googlin’ I found a very hopeful url that ended with “add-a-maven-dependency-from-the-command-line/” . It wasn’t exactly a command line property, but good enough for me. The solution is doing a regex on the tag and replacing it with the right version of Jackson. It’s a bit ugly, but since this issue is already reported at Spark it’s only temporary.

The final recipe

So the final recipe (which can be nicely be used as Jenkins project to keep the system engineers happy regarding reproducability) :

git clone git://github.com/apache/spark.git
#git checkout 2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
mv assembly/pom.xml assembly/pom.orig
sed "0,/<dependencies>/s//<dependencies><dependency><groupId>org.codehaus.jackson<\/groupId><artifactId>jackson-core-asl<\/artifactId><version>1.9.13<\/version><\/dependency><dependency><groupId>org.codehaus.jackson<\/groupId><artifactId>jackson-mapper-asl<\/artifactId><version>1.9.13<\/version><\/dependency>/" assembly/pom.orig > assembly/pom.xml


mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.1 -DskipTests clean package

Running this command might take a while (about 20 minutes on our Jenkins machine and about an hour on my laptop).

When you want to build for a specific git revision uncomment the line with ‘git checkout’. The hash shown above is for version 1.1.0

What to do with the assembly jar?

When the command is finished you have to copy assembly/target/scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.5.1.jar to you Spark lib dir. Make sure you rename the old file (spark-assembly-[version]-hadoop[hadoop-version].jar) to something with another extension since Spark looks for spark-assembly*hadoop*.jar and throws an error when not exactly one file is found.

Conclusion

In the end this is just a quick fix that hopefully won’t be necessary in the future. At the time of writing the latest stable version of Spark was 1.1.0 and 2.5.1 of Hadoop.

Sources

Categories: English, java, work Tags: , , , ,
  1. 3 November 2014 at 21:41

    Dit is op HadoopEssentials herblogd.

  2. 9 February 2015 at 23:57

    hi,
    where exactly should i copy the jar?
    i added it as an external library in my intelliJ project, but when i try to run sbt assembly i get:
    “object spark is not a member of package org.apache
    [error] import org.apache.spark.storage.StorageLevel”

    also, can i build it with scala 2.11 support?
    and if so, is the following command ok?
    mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -Dhadoop.version=2.5.1 -DskipTests clean package

    • Jeroen van Wilgenburg
      18 February 2015 at 10:17

      That depends on your Spark installation, on my server it’s located in /opt/spark-1.1.0/lib
      The error is probably because the Spark jar is not found.

      I’m not sure about the build command, you should open the pom.xml files in the Spark distribution to check if those parameters are right. It’s always possible to try (with a little downside that it might take a long time).

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: