Home > English, java, work > Overriding Hadoop jars with MapReduce v2 – How to fix the classpath

Overriding Hadoop jars with MapReduce v2 – How to fix the classpath

We ran into a nasty problem with Hadoop when we wanted to use a ‘new’ version of Jackson. Hadoop decided to throw a NoSuchMethodError. It appeared Hadoop uses an ancient version of Jackson. Some terrible memories of classloading and JBoss EAP came into my head. After I was calmed down by some colleagues I ultimately found a solution.

Searching for answers is a hard job with Hadoop. It looks like in every version they solve things differently. There are some configuration parameters that basically do the same thing and then there is Yarn, Mr2, MR v2, MapReduce v2 etc. This blog should work for the 2.2.x+ versions. At the time of writing this blog version 2.5.0 was just released.

The right nomenclature is explained in the documentation of an old version of Spring Hadoop

We’re using MapReduce Version 2 running on Yarn.

The solution

In the end the solution is the right combination of two solutions :

  • pacakage your jar the right way
  • tell Hadoop you want your own jars first.

All the code samples in this blog are small snippets. When you want to try things out for yourself and see the complete files I created a project on github.

Step 1: Package your jar

For the complete explanation I refer to “Maven: Building a Self-Contained Hadoop Job“. In this article I provide a summary. The first step is including the maven assembly plugin.
Add the following section to your pom.xml

<build>
    <plugins>
        <plugin>
            <!-- source: http://blog.mafr.de/2010/07/24/maven-hadoop-job/ -->
            <artifactId>maven-assembly-plugin</artifactId>
            <version>2.4.1</version>
            <configuration>
                <descriptors>
                    <descriptor>src/main/assembly/hadoop-job.xml</descriptor>
                </descriptors>
                <archive>
                    <manifest>
                        <mainClass>nl.jpoint.hadoop.sandbox.SandboxJob</mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

When the plugin is enabled you have to create an assembly file (ie. src/assembly/hadoop-job.xml) with the following contents :

<!-- source: http://blog.mafr.de/2010/07/24/maven-hadoop-job/ -->
<assembly>
    <id>job</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <unpack>false</unpack>
            <scope>runtime</scope>
            <outputDirectory>lib</outputDirectory>
            <excludes>
                <exclude>${groupId}:${artifactId}</exclude>
                <exclude>org.slf4j:log4j-over-slf4j</exclude>
                <exclude>org.slf4j:slf4j-api</exclude>
                <exclude>ch.qos.logback:logback-classic</exclude>
                <exclude>ch.qos.logback:logback-core</exclude>
            </excludes>
        </dependencySet>
        <dependencySet>
            <unpack>true</unpack>
            <includes>
                <include>${groupId}:${artifactId}</include>
            </includes>
        </dependencySet>
    </dependencySets>
</assembly>

Note that there are some exclusions for Logback and Slf. Depending on the logging framework you’re using you might need to replace some things.

Step 2: Tell Hadoop you don’t want their ancient jars

Now you have to set the configuration parameter “mapreduce.job.user.classpath.first” to true in the run method of your Job :

job.getConfiguration().set("mapreduce.job.user.classpath.first", "true");

When you comment the line with the configuration parameter you will see the following error :

  2014-09-05 08:32:53,493 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
Error running child : java.lang.NoSuchMethodError: org.codehaus.jackson.map.ObjectMapper.setSerializationInclusion(Lorg/codehaus/jackson/map/annotate/JsonSerialize$Inclusion;)Lorg/codehaus/jackson/map/ObjectMapper;
  at nl.jpoint.hadoop.sandbox.SandboxMapper.map(SandboxMapper.java:20)
  at nl.jpoint.hadoop.sandbox.SandboxMapper.map(SandboxMapper.java:11)
  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

So don’t do that, I just wanted to prove my point😉

Try it yourself

When you clone the project on github (git clone https://github.com/jvwilge/hadoop-classpath.git) you have to take the following steps to run the example :

First add the input.txt (in the src/main/resources folder) to hdfs : hadoop fs -put input.txt
Copy the assembled jar (classloading-1.0-SNAPSHOT-job.jar) to the system running Hadoop
Run the job (on the system running Hadoop) : hadoop jar classloading-1.0-SNAPSHOT-job.jar

Play around by commenting the line I mentioned before to see the effect.

UPDATE (11-sept-14) – when using jars outside Mappers or Reducers

It appeared there still was a problem with jars. We tried to load a jar in an InputFormat with the same issues. The solution for this is to copy the jar to the Hadoop master and do the following exports:

export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CLASSPATH=/home/cloudera/jars/jackson-mapper-asl-1.9.13.jar

I’m using the Cloudera image and copied the jars to ~/jars

Alternative solution

Another solution is working with libjars.

In this [http://grepalex.com/2013/02/25/hadoop-libjars/] post it’s explained how. The author thinks it’s a more elegant solution. It might be, but it is a more dangerous one. With one ‘fat’ jar you have all the libraries in one place and no dependency on the correctness of the included libjars.
But it might help you, so I included the link in this blog.

Conclusion

Of course it’s not only Jackson with this problem, that’s just the jar we had trouble with. I still don’t get it why Hadoop includes ancient jars, but this is how you fix it.

Big thanks to Jethro and Gijs for giving me a firm push in the right direction.

Sources

Hadoop sample project (on which my git project is based)
http://stackoverflow.com/questions/12707726/run-a-hadoop-job-without-output-file
http://stackoverflow.com/questions/11685949/overriding-default-hadoop-jars-in-class-path

Categories: English, java, work Tags: , , ,
  1. 8 September 2014 at 02:30

    Dit is op HadoopEssentials herblogd.

  2. 23 April 2015 at 15:25

    Instead of: job.getConfiguration().set(“mapreduce.job.user.classpath.first”, “true”);

    Use this: job.setUserClassesTakesPrecedence(true);

    That is able to use the correct config string for your version which is stored in JobContext.MAPREDUCE_TASK_CLASSPATH_PRECEDENCE. On CDH 4.7, that is “mapreduce.task.classpath.user.precedence”.

  3. Jeroen van Wilgenburg
    10 May 2015 at 07:40

    Thanks for the tip! I’ll try it out and update the article.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: