Home > English, java, work > Lessons learned after serving thousands of concurrent users in a devops team for a year

Lessons learned after serving thousands of concurrent users in a devops team for a year

I just celebrated a year at a my customer. When I arrived the project was in a good shape with a few rough edges. There was a solid code review proces, a stable Jenkins server, motivated people, a reasonable amount of integration tests and a product owner who is very involved and very often in our vicinity. We serve thousands of users and our analist has the confidence to do a release at the busiest time of day.

Clean up your logging and keep it clean

In my first week I was amazed by the thousands of daily error messages in the production logging and the fact that nobody was screaming in panic about it. Most people were used to these amounts as it grew historically (not because people didn’t care, it just happened). The problem with this phenomenon is that small warnings are lost in the sea of errors. New features were introduced, logged some errors and nobody noticed. The customer noticed eventually. When your feedback loop is this long it is harder to fix the error since the features you created are not fresh in your memory anymore and could be ‘contaminated’ with code from other features.

My goal was to reduce the error logging to 0 and make sure the team is on the alert when error messages show up on the dashboard. Our logging is collected in a central place, distributed with Kibana and displayed on a large monitor in the team room. This made things very easy (When you are not collecting your logs in a central place it really is time to start now. Products like Logstash and Graylog are great tools to make this happen.) .
Now you have to create some structure in the error messages. Start with a top 10 of error messages and create tickets for them. This usually solves 80% of the problems. I also created a page with ‘known errors’ for these log messages with a ticket number (or an explanation of the solution). A lot of errors are recurring on the long term (even if you’re pretty sure they aren’t. Yes, also on your project), so this will be a good investment.
Repeat this cycle until the amount of errors is approaching zero.
Now it is time to prevent this from happening again by appointing a ‘developer of the day’.

Appoint a developer of the day

At my previous customer a developer of the day was introduced. At first it might seem like a waste of resources but it has several advantages :

  • The other developers can focus on the sprint and have less context switches.
  • Knowledge is shared automatically
  • You keep your logging clean
  • Context switches are expensive. When you’re in a deep concentration it can take 15 minutes to ‘restore’ that context when you’re disturbed (and this happens often since we’re in an open-plan office).
    Since everybody sees all the error messages all parts of the application are touched. When you’re researching a problem you automatically learn about the code (or can ask the developer who ‘owns’ that part).
    Because error messages are investigated every day the logging stays clean. Somehow it’s easier to clean up small bits instead of a big mess.

    When we just started with this concept it took almost the all day, now it’s usually one or two hours a day.

    Fix your flaky integration tests and increase integration coverage

    Our integration tests were very flaky. When the build failed a retry was usually sufficient to make it succeed. I did some experiments with Karate and really loved it. The whole team liked it when I proposed it, so we decided to replace all the flaky tests with Karate. This proces can take a while, but is really worth it in the end. An advantage of Karate is that our test team can also read the Karate tests and is involved with the reviews of these tests. It’ll even save them some time with their end-to-end tests because they know which cases are covered already. When your tests are replaced run some coverage tools and check where integration tests are missing. Note that this isn’t watertight, but gives you a good indication on where to focus.
    After a few months we had so much confidence in the code that scary refactorings were possible again. The Karate tests also saved us on numerous occasions.

    Improve your dashboard

    When I started on the project there was a custom dashboard application that was constructed by taking screenshots from graphs (that were ‘shot’ too early sometimes). It wasn’t great anymore, but it worked at the time. One day it broke so badly that the operations team decided not to fix it and switched to Grafana.
    This immediately improved the quality of our application. Grafana has data sources for Cloudwatch, web server logging and Kibana. We were now able to detect problems before they became a problem (like increasing response times and high application load are canaries in the coal mine).

    When you have a great dashboard your team members will scream when something is red. Make sure that there really is a problem when things are red otherwise the dashboard is like the boy who cried wolf and you end up like you did with the sea of log messages.

    ‘Meten is weten’

    There is a Dutch saying : ‘Meten is weten’ that means ‘to measure is to know’.
    I already wrote an article about the memory problems we had : Introduction to java heap tuning
    And this rule is also applicable to other things you can measure (like the Mongo slow query log). Don’t blindly change parameters, prove that your gut feeling is right.

    One caveat is that tuning can also hurt. We added a feature that needed a lot more memory than before. Since we still need to improve our load testing this problem appeared very late (when we already released) and slowed down the application. So don’t be to aggressive and have some safe guards (like a proper load test).

    If it hurts do it more often

    A year ago the team was doing about one release every 8-10 days. This isn’t really bad, but we were having some QA-issues and the rest of our infrastructure is in such a good shape that it shouldn’t be too hard to reach continuous deployments in order to tackle the QA-issues.
    There is a saying ‘if it hurts do it more often’, so that’s what I’m pushing for.
    Release frequency is improved slightly and there is a correlation with the QA-issues, but I still think that not all the pain is visible. This is still a thing I’m fighting for, but I probably need some more ammo a to show that it really helps.

    Read books, watch talks and drink lots of coffee

    So how did I came up with all these ideas? Most ideas I read in books and talks I watched. But all those are ideas are theoretical. The real proof of the pudding is practice. You can’t try all ideas, what you can do is pitch those ideas at the coffee machine, when you get a lot of positive feedback you should give it a shot. Drinking coffee is a surprisingly good way to exchange information with other teams/developers. Since the applications of the customer have a lot of common ground (and most of them share a platform) you’ll probably run into the same problems.

    Books :

    Talks :


    I hope this article will help to improve the quality of your application.
    At our customer the conditions are pretty good. We have an involved product owner and there is room for suggestions and improvement with management. I do realise that we’re lucky and this makes improvements a lot easier. As you can read between the lines there’s still room for improvement and when those things are improved new things can be improved.
    Take baby steps and don’t try to build a pyramid when your shed still has a leaky roof.
    I’d like to conclude with a saying I recently heard : “Don’t let perfect be the enemy of good”, stop improving when it’s a certain area is good enough, it will only annoy people and waste resources.

Categories: English, java, work Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: