Home > English, java, work > Using a Spring Context within Hadoop Mappers and Reducers with environment specific property files

Using a Spring Context within Hadoop Mappers and Reducers with environment specific property files

Bootstrapping a Spring Context from the classpath is quite simple, even from Mappers/Reducers in Hadoop. When you want to use enviroment specific property files it gets a bit harder. In our Mappers and Reducers we rely heavily on Spring Beans who perform the magic. Besides that we load different properties per environment (ie a MongoDB). After several iterations we came up with a nice solution. You can also leave out the Spring part and use this solution to broadcast Properties to all your Mappers/Reducers.

For this article I created a sample project on github.
This repository will contain all the code examples mentioned here in a monkey proof way.
The latest Hadoop version at the moment of writing this article is 2.5.1, and that’s the version we’re using (with MR2).

The problem

Let’s assume a Hadoop Mapper has to lookup some information in a database. This database has a different host (and maybe even url) in production than your development Hadoop cluster. What you want is load a property file when a Job is started and then in some way broadcast this property-file to all the Mappers and Reducers. The Mappers and Reducers have access to a Spring context which is loaded with the broadcasted property file.
A simple solution would be to package the different properties in your jar. This actually is a bad idea because Murphy’s will take care of using development properties in production (or worse, the other way around). Besides that your system engineers really hate it when they have to give away their precious production passwords.

Step 1 – Loading the Property files

To point to the right location of the properties file I’m using a System property and read the property file as a String. We have to read the property file directly because the local file system (or file system of your Hadoop master) isn’t available in the Mappers/Reducers.
The simplified version of the code (whole version is available in the SandboxJob.java file).

String propertiesFolder = System.getProperty("properties.folder”);
String sourcePath = propertiesFolder + "/job.properties";
String propertiesAsString = new String(Files.readAllBytes(Paths.get(sourcePath)));

To pass system properties to Hadoop you can’t just do it while submitting your job, you have to change an environment variable :

export HADOOP_OPTS="$HADOOP_OPTS -Dproperties.folder=/home/hadoop/conf/dev"

In this sample I use /home/hadoop/conf/dev . In the conf-folder I created a file job.properties which contains a ‘name’ property with value ‘Harrie’

name=Harrie

The properties-file is loaded in SandboxJob.java as a String and passed as the parameter nl.jpoint.properties :

job.getConfiguration().set("nl.jpoint.properties", loadPropertiesAsString());

Step 2 – Loading the Spring Context

Now that the properties are stored on the Configuration object we have to point the Mappers/Reducers to the properties and initialize the Spring context. Since we don’t want to start a Spring Context too many times it’s stored in a static variable in the class LazySpringContext. A synchronized check is done to see if it’s initialized yet.
There probably is a performance optimisation inside the static block, but I went for safety today and since the bean is only auto wired once it’s a negligible performance hit.

springContext = new ClassPathXmlApplicationContext(new String[]{"applicationContext.xml"}, false);
properties.load(new StringReader(propertiesString));
PropertySource propertySource = new PropertiesPropertySource("configProperties", properties);
springContext.getEnvironment().getPropertySources().addFirst(propertySource);
springContext.refresh();

This again is the simplified version, the full LazySpringContext.java can be found on Github.
The final step is to create a method that autowires your Mapper/Reducer :

public static synchronized void autowireBean(Configuration conf, Object bean) {
     String propertiesString = conf.get("nl.jpoint.properties");

     if (springContext != null) {
          initContext(propertiesString);
     }

     springContext.getAutowireCapableBeanFactory().autowireBean(bean);
}

The bean parameter is the *this* of you Mapper/Reducer.

Step 3 – Using the properties that just got broadcasted.

In you Mapper it would look like this :

LazySpringContext.autowireBean(context.getConfiguration(), this);

It wires all the annotated (with either @Inject or @Autowired) properties in you Mapper. Full file available at Github

Step 4 – Running the example

To run the example project please consult the README.MD file

You can play around with the property folder (dev vs production). On the production environment Hans Gruber is greeted instead of Harrie.

The initial solution

The initial solution was writing a property file to HDFS and load that property file in the Mappers and Reducers. This is still a viable solution when you don’t like the solution I proposed before. My colleague Cliff pointed me to the Configuration-solution, I think it’s a cleaner solution than via HDFS, but I let you decide for yourself.

Conclusion

It’s still is a bit hard to get used to the Hadoop programming model and finding good documentation. This solution needed quite some iterations before we were happy. Don’t hesitate to give me feedback, I will update the article, mention your name and you will probably help some people struggling with Hadoop🙂

Sources

Spring ClassPathXmlApplicationContext
Packaging of your Hadoop project

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: