A few days ago my eyes fell on a new release of Akka Stream Kafka. Since I’m doing a lot with Kafka currently and I really wanted to get my hands dirty with Akka this sounded very good. Also a good opportunity to see if an upgrade to Kafka 0.10.0.1 (from 0.8.2.2) is worth while (since older versions of Kafka are not supported in Akka Stream Kafka 0.11).
The default behaviour of a Spark-job is losing data when it explodes. This isn’t a bad choice per se, but at my current project we need higher reliability. In this article I’ll talk about at least once delivery with the Spark write ahead log. This article is focussed on Kafka, but can also be applied to other Receivers.
One of my complaints about Spark was that it wasn’t possible to set a dynamic maximum rate. This is a problem in many jobs since the maximum throughput isn’t always linear with the output rate. Another issue is with local testing. You have to set the rate to extremely low values and experiment a lot to make a Spark job usable on a local machine.
But all these problems are in the past with the introduction of backpressure (I believe it’s spelled as back pressure, but I’ll stick to the Spark notation).
After using Spark for a few months we thought we had a pretty good grip on how to use it. The documentation of Spark appeared pretty decent and we had acceptable performance on most jobs. On one Job we kept hitting limits which were much lower than with that Jobs predecessor (Storm). When we did some research we found out we didn’t understand Spark as good as we thought.
My colleague Jethro pointed me to an article by Gerard Maas and I found another great article by Michael Noll. Combined with the Spark docs and some Googlin’ I wrote this article to help you tune your Spark Job. We improved our throughput by 600% (and then the elasticsearch cluster became the new bottle neck)