Devil in the Defaults

Default settings are the worst. Every systems has them, and they’re great until they’re not. For whatever reasons in the past, my predecessors decided to purchase a bunch of bare-bones HP servers and install Check Point’s firewall software on them. The HP servers were significantly cheaper than buying Check Point’s branded appliances, but unfortunately they come with a different set of risks. For example, you have to work on estimating max throughput yourself, rather than knowing exactly what the appliance is rated for.

Over the past few weeks, we have been lightly troubleshooting an issue between a VMware vCenter server and the ESX hosts that it manages. ESX hosts were randomly showing up as disconnected for a brief moment, then would reconnect. It was nothing extremely impacting, but a mild annoyance for the server guys. A couple of people on my team had taken a quick look on the network side, and turned up empty handed. Due to some upcoming maintenance work the server team needed to perform, I was asked to spend some time trying to isolate the root cause of this issue.

First thing was digging through the logs from the two different sets of firewalls between these systems. The first firewall set showed that traffic was passing normally as I would expect. However, I started seeing some unexpected logs for the second firewall set, a CheckPoint cluster. The logs showed that vCenter was opening connections out to the ESX hosts for a short while, then the CheckPoint would log a “TCP Packet out of state” error. The details of this log would show that vCenter sent a non-SYN packet to the ESX host (usually a PSH ACK).

Seeing an error like that indicates that something is killing the TCP connection before vCenter is finished using it. vCenter still believes that the connection is open, which is why it sends packets with incorrect flags. Since we were already aware that this particular CheckPoint cluster has some issues, we began examining this cluster first. Sure enough, the IPS logs on the device showed that the cluster was often reaching >80% of it’s maximum concurrent connections and then enabling the “Aggressive Aging” feature.

Aggressive Aging is a CheckPoint protection which prevents the cluster from running out of memory and potentially crashing. By default, this is set to take effect whenever the cluster exceeds 80% of it’s available memory or concurrent connections. This protection will continue to be enabled until the cluster drops below another threshold, which is below 78% by default. Seems like a helpful feature to have, right? Yeah - but there are some considerations with how this protection works. When Aggressive Aging is activated, the cluster significantly reduces all of the normal TCP timeout values. For example, CheckPoint’s documentation shows that new TCP sessions are given only 5 seconds to establish, instead of the normal 25 seconds. This also changes how long a TCP session can be open from 1 hour to 10 minutes. In order to help drop below the 78% threshold, Aggressive Aging will evaluate and terminate 10 connections for every individual new connection that is established.

As I stated previously, this cluster was already pretty busy - often hitting CPU limits mostly. However, through the brief research I completed, it looks like increasing the concurrent connections table mostly affects RAM utilization more than anything else. This system has over 20G of RAM and is typically only using around 4GB. I was still concerned that an increase in total concurrent connections could mean more CPU usage, because that means more connections for the IPS to process. Unfortunately, CheckPoint has no publicly available utilities to help calculate what to set your max concurrent connection limit to. In fact, when I opened a support ticket with them, I was told to “just keep increasing it, until you hit a point where the cluster is no longer triggering Aggressive Aging. Then add about 10-20k above that to set the new maximum concurrent connection limit”. That’s not really an acceptable answer to me, but I wasn’t able to get anything more out of them.

So in order to change the maximum concurrent connections (Using R77.xx), you need to open SmartDashboard and open the cluster object. Then find Optimizations in the left-hand menu. Here you can set a new manually-defined limit, or allow the cluster to automatically scale the maximum connections. If this cluster was significantly less busy, I might be tempted to enable the automatic limit for a bit and try to get a baseline. However, I would rather not open myself up to the chance of crashing the cluster - so I manually increased the limit from 25,000 to 50,000. Install the policy for the configuration to take effect. You can see the current concurrent connections by either looking at the Overview page in SmartDashboard, or logging into the cluster CLI and using the cpview utility.

In my case - the new connections almost immediately started ramping up to ~35,000. Within a day we started encountering the Aggressive Aging protection again, but it was happening significantly less often than before. This also resolved our ESX host disconnection problem, which proved my theory that the Aggressive Aging feature was causing our problem. I’ve been slowly monitoring and increasing the concurrent connections limit since, and I think we have finally stabilized around 90,000. Just think of how many connections were denied or terminated early because this limit was in place!

Moral of the story here: Understand the systems that you own. This firewall cluster had been in place years before I was hired, and all of the settings were left at their defaults. Default settings probably work for most cases, but they also come with their own problems. This setting had likely been the cause of multiple problems in the past, however no one truly understood they system enough to find out what was happening. Ever have a scenario where a default setting caused problems? Share it in the comments!