Home > English, java, work > When you tried high available logging and ended up with ‘mostly-available’ – How to setup Logback with GELF with TCP and HAProxy

When you tried high available logging and ended up with ‘mostly-available’ – How to setup Logback with GELF with TCP and HAProxy

Recently we ran into a problem where our logging disappeared sometimes. This happened after switching to an HAProxy load balancer. After many hours and some wireshark sessions we found an incorrect setting in HAProxy and apparently left out some important configuration in Logback. Idle connections were killed by HAProxy and we didn’t know there was a way to see that this happened.
This article may also be interesting when you run into disappearing or stale connections with HAProxy.

How to reproduce the problem

In our setup there is one HAProxy load balancer with three Graylog nodes. Our microservices spit out their logging in the GELF (Graylog Extended Log Format) with SLF4j and Logback as a logging facility.

To reproduce the problem yourself install HAProxy locally and use it with the following config :

defaults
	timeout connect 10000ms
	timeout client 13000ms
	timeout server 13000ms

frontend gelf
    bind :12201
	mode tcp
	default_backend nodes
	
backend nodes 
	server log01 harrie.log.com:12201

Point your Java GELF configuration to localhost:12201 and do some logging from a java application every 19 seconds. For example :

while(true) {
    log.debug("Test... @ {}", Instant.now());
    Thread.sleep(19000);
}

You will notice that after three log statements only the first and third log statement will appear in Graylog. The second log statement only showed when we are connected directly to the nodes behind the load balancer.
What happens here is that HAProxy disconnects the connections with Graylog and your local machine after 13 seconds (the timeout we just set in the config). To figure this out you need a packet sniffer like Wireshark.
With these results you can dive into the GELF-library and end up in the GelfTcpAppender on line 141 (version 1.0.4). In a debugging session you will notice the GELF library registers a broken pipe.
GELF can’t log it to Graylog because logging is broken.

This issue only occurs when logging is idle for longer than the HAProxy timeout. In this example 13 seconds (on our servers it is of course much higher). The timeout was probably introduced to prevent denial of service attacks, when connections never timeout it is very easy to claim all available sockets and effectively block the machine.

Logback StatusManager

When things go wrong with logging you can’t log it because there might be no logging anymore. Fortunately Logback saves these kind of messages in an object called StatusManager. These messages can be made visible in the console (System.out in Java) when you add the following line to your logback.xml :

<statusListener class="ch.qos.logback.core.status.OnConsoleStatusListener" />  

In our environment this logging ends up in /var/log/syslog which needs to be monitored for these kind of failures.

This is of course no ideal situation, but this only happens in very exceptional cases. When you’re desperate you will probably look there eventually.

There also is a servlet called ViewStatusMessagesServlet to display these statements in your browser, check the docs on how to configure the Servlet.

Different logging libraries

There are a few libraries that support GELF with Logback. We initially started with ‘Pukka One’ (/pukkaone/logback-gelf on github) but ditched it because there was no way to reconnect on failures.
After some research we found another library, also called logback-gelf developed by Oliver Siegmar. This library has many more configuration options, has recent commits and the code looks very clean.

Solution to our problem

Since the ‘Siegmar’ library has an option reconnectInterval with a default value of 5 minutes we decided to stick to this default and increase the HAProxy timeout to 6 minutes (should be more than the reconnectInterval). Our applications run in a relatively safe network, not accessible from the internet. When you run your application on the internet these values should shorter (to prevent the DoS mentioned earlier).

I also tried changing the retryDelay but it seems the appender thinks the second log statement succeeds and retries the third statement. The socket of the GELF appender is constructed with socket.shutdownInput(); which effectively ignores the FIN/ACK the server sends to the local machine.

update 12-jun-2017 20:15 And a few hours after publishing, the shutdownInput is removed by the creator of the library: diff

Conclusion

Logging seems easy, but as I proved earlier it is easy to unknowingly discard logging without even noticing. Luckily we expected a certain kind of logging and it didn’t appear, but when you don’t expect anything you won’t miss it.
What I learned is that after finding a solution you’re not finished. With these kind of problems try to answer questions on Stackoverflow or write an article about it. This will give you new insights, solutions to hidden bugs and you might even help a few people who have the same problem.

Sources

GELF Documentation
HAProxy website
‘Siegmar’ logback-gelf appender

Advertisements
Categories: English, java, work Tags: , , , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: