Checkpoint on 0x2142 | Networking Nonsense

Tips for Working with Vendor Support

Wed, 09 May 2018 12:00:24 +0000

This post has been on my mind for a while now. I’ve worked as a network admin for long enough, and opened more technical support cases with vendors than I want to think about. Over the years I’ve developed my own process for how I handle those support cases in an effort to get a quick and efficient resolution. Some of this stems from starting off in a NOC, where calling vendor support was practically step 1 of any troubleshooting procedure. A lot of this is based on my own experiences or things I’ve been taught by former co-workers.

On the other side of things, I’ve worked with people over the years who haven’t had quite the same experiences as I have. Some of them don’t typically call vendors for support - or maybe they’ve just never been lucky enough to be the first responder on a Severity-1/Production-down case. This has occasionally resulted in a number of vendor support cases being closed with highly unsatisfactory results. In fact, there have been times where this has been bad enough that it gives management the impression that we received bad support from the vendor - but what if it was just our own inability to work effectively with that vendor? Maybe we didn’t push hard enough on an issue or stress the importance.

I feel that in a majority of cases, it shouldn’t be difficult to get the results you want out of a vendor support case. So I’ve put together a list of my own personal tips and guidelines for working with technical support.

The Vendor is your partner - Not the enemy

I’ve seen a lot of people call vendor support for help, then treat the support engineer as the bad guy. If you want to get a quick resolution, then you need to look at them as your partner. They’re here to help you figure out your issue and get everything fixed. When you open a new case, give them a concise summary of your issue with any relevant details you think are important. Don’t hide information, as this will just prolong finding a resolution - Give them everything they need. If you know your vendor always asks for the same diagnostics information every time you open a case, then have that information ready before you contact them.

Always remember how difficult the job of a technical support rep can be. They’re likely sitting in a call center, just waiting for the next case. They may have in-depth knowledge of their product, but they’re walking into your network completely blind. They won’t know your traffic flows, or that some systems are redirected through a proxy, or that one band-aid fix that Joe put in a year ago and never documented. Your support rep is going to do the best job they can to get a handle on what your network looks like, but be prepared to guide them. How would you like to troubleshoot a completely different and unknown network every time your phone rings?

Have realistic expectations

Especially when you’re new to a career in IT, it can be hard to gauge what you can and can’t ask of your tech support rep. One of my former jobs had a policy that for every ticket with AT&T we opened, we would be required to call back every hour for a status update. If there wasn’t one, then we were supposed to demand escalation to a manager. That policy might make sense if it’s a critical issue, but what about something that’s not? Have realistic expectations about what your vendor can do.

Troubleshooting takes time. If your support engineer grabs a bunch of logs and says they’ll need to get back to you - then you’ll need to give them the time they need. Feel free to ask for a time estimate - but if they say they’ll have something to you by the following day, don’t start bugging them every hour.

Remember that your support engineer’s job is to help you. If you don’t feel like that’s happening, you have a few options. You can ask for a case escalation or ask for the case to be reassigned. You never know the skillset of the person receiving your case, and you might get someone who isn’t super familiar with the problem you’ve raised. As long as there is no urgency, I will usually give the person time to work the issue - but be prepared to request a case transfer if it becomes apparent that they’re not getting anywhere. For example, I once had a case for a web-based firewall management system. The engineer I got was very good with the GUI side of things, but wasn’t very knowledgeable when troubleshooting took a turn toward the underlying linux system. A quick request to transfer the case to someone more experienced in this side of the system and we had the case solved within an hour. If an escalation or case transfer doesn’t help, you can also usually reach out to your local account representative and ask them to help push the issue for you.

It’s also very helpful to have an idea of your vendor’s support policies. Have a question about how to set up a new feature? Some vendors don’t permit you to open a case for a new configuration, and will refer you to their professional services team. On the other hand, some vendors are perfectly okay with helping you figure out how to set up something. Even better, some support teams are willing to stand by during migrations and upgrades, just in case you need their help. In my experience, if you’re not 100% confident in your changes, then it’s better to open a proactive case beforehand.

Be clear about the impact

If your entire datacenter is offline because of an issue, make sure that you immediately stress the importance. Again, your support engineer is jumping into your environment blind. Does this firewall performance issue impact twenty people, where it is just a minor inconvenience? Or is this issue prohibiting 50,000 customers from using the services you provide? The last thing you want is a misunderstanding of impact when it’s a high priority issue for your business.

Usually when I open a high severity case, I’ll let the engineer know: “Just so we’re on the same page, this issue impacts a large datacenter impacting 600+ customers. We need to get this back into a stable state as soon as possible”. High severity cases can be stressful for both sides, and I try to be clear about the impact without making that worse.

On the other side of things, if there is a low severity issue - don’t blow it out of proportions. I’ve worked with too many engineers who open up a Sev 1/Prod-Down case for every issue, even if the issue is just a minor inconvenience. Categorize your issues appropriately when you open them - and do your best to be realistic. A slow download for three users probably doesn’t warrant getting half a dozen TAC engineers on a conference bridge.

In case of emergency

Emergency situations are a completely different subject - so I want to spend a bit of time covering them separately. It’s really important to know what constitutes a true emergency in your environment. Is it when an office (or datacenter) goes offline? Or maybe even a single extremely critical business application? Things break - so have a plan and be prepared.

Step one - Always call into your vendors support line. You don’t want to open a web/email case and wait around for a technician. This might seem obvious, but I’ve known a lot of people who complain about the vendor’s response on a critical issue when they opened a case via a support portal. Find the vendor’s support number (or have it saved somewhere) and call them.

Step two - Ask for a warm handoff. In most cases, the person answering the support line is just creating a case and routing it to the appropriate team/ticket queue. They may just give you a ticket number and tell you to expect a call back shorty. If the issue is truly critical, ask them for a warm handoff to a technician. Most vendors I have worked with have had no problem doing this, and it helps you get to troubleshooting faster.

Step three - Clarify your issue and set expectations. You may be in a rush to get the issue fixed, but take a minute to explain your issue thoroughly and clearly. The more information you give to your support technician, the more easily they can dive into troubleshooting. And as I had mentioned earlier, be sure to set expectations and be extremely clear about the impact of your issue.

Step four - Keep troubleshooting on track. As I’ve stated before, you know your environment/network better than your vendor does. If they start looking at something you don’t believe is related, you need to guide them back to the main problem.

In addition, if you feel after a bit that the troubleshooting isn’t making progress - then request an escalation or a second set of eyes. There is no harm in asking for more eyes on the problem. I’ve even had situations before where the technician said “Well, I think we need to do X to fix it”, and I’ve just asked them to see what their peers think. You would rather be sure about a change, than make the issue worse.

Step five - See the issue through to resolution. Make sure you get your environment back to a stable state before ending the call. If the technician wants to drop off and call you back after reviewing logs, let them know you’re willing to just wait on hold. Once they’re off the phone, it’s easy for your technician to get dragged into another issue.

If the call ends with everything in a temporary state - then take follow ups on your next steps and make sure you accomplish them! Maybe you were able to restore connectivity, but need to wait for a maintenance window to make a change that is more service impacting than the original issue. Or maybe your support technician needs you to gather additional logs that they can forward to development. Whatever it ends up being, make sure you take note of it and follow through.

These are just some of my own personal tips that have worked for me. Support calls with vendors don’t always need to be a massive pain to deal with. Sure, sometimes you might have bad luck and get an inexperienced technician - but I find most issues with vendors can be solved easily enough once you know how much you can push them, and what you can and cannot ask for.

I hope these are useful - Let me know in the comments if you have any additional tips!

Devil in the Defaults

Tue, 03 Oct 2017 08:00:28 +0000

Default settings are the worst. Every systems has them, and they’re great until they’re not. For whatever reasons in the past, my predecessors decided to purchase a bunch of bare-bones HP servers and install Check Point’s firewall software on them. The HP servers were significantly cheaper than buying Check Point’s branded appliances, but unfortunately they come with a different set of risks. For example, you have to work on estimating max throughput yourself, rather than knowing exactly what the appliance is rated for.

Over the past few weeks, we have been lightly troubleshooting an issue between a VMware vCenter server and the ESX hosts that it manages. ESX hosts were randomly showing up as disconnected for a brief moment, then would reconnect. It was nothing extremely impacting, but a mild annoyance for the server guys. A couple of people on my team had taken a quick look on the network side, and turned up empty handed. Due to some upcoming maintenance work the server team needed to perform, I was asked to spend some time trying to isolate the root cause of this issue.

First thing was digging through the logs from the two different sets of firewalls between these systems. The first firewall set showed that traffic was passing normally as I would expect. However, I started seeing some unexpected logs for the second firewall set, a CheckPoint cluster. The logs showed that vCenter was opening connections out to the ESX hosts for a short while, then the CheckPoint would log a “TCP Packet out of state” error. The details of this log would show that vCenter sent a non-SYN packet to the ESX host (usually a PSH ACK).

Seeing an error like that indicates that something is killing the TCP connection before vCenter is finished using it. vCenter still believes that the connection is open, which is why it sends packets with incorrect flags. Since we were already aware that this particular CheckPoint cluster has some issues, we began examining this cluster first. Sure enough, the IPS logs on the device showed that the cluster was often reaching >80% of it’s maximum concurrent connections and then enabling the “Aggressive Aging” feature.

Aggressive Aging is a CheckPoint protection which prevents the cluster from running out of memory and potentially crashing. By default, this is set to take effect whenever the cluster exceeds 80% of it’s available memory or concurrent connections. This protection will continue to be enabled until the cluster drops below another threshold, which is below 78% by default. Seems like a helpful feature to have, right? Yeah - but there are some considerations with how this protection works. When Aggressive Aging is activated, the cluster significantly reduces all of the normal TCP timeout values. For example, CheckPoint’s documentation shows that new TCP sessions are given only 5 seconds to establish, instead of the normal 25 seconds. This also changes how long a TCP session can be open from 1 hour to 10 minutes. In order to help drop below the 78% threshold, Aggressive Aging will evaluate and terminate 10 connections for every individual new connection that is established.

As I stated previously, this cluster was already pretty busy - often hitting CPU limits mostly. However, through the brief research I completed, it looks like increasing the concurrent connections table mostly affects RAM utilization more than anything else. This system has over 20G of RAM and is typically only using around 4GB. I was still concerned that an increase in total concurrent connections could mean more CPU usage, because that means more connections for the IPS to process. Unfortunately, CheckPoint has no publicly available utilities to help calculate what to set your max concurrent connection limit to. In fact, when I opened a support ticket with them, I was told to “just keep increasing it, until you hit a point where the cluster is no longer triggering Aggressive Aging. Then add about 10-20k above that to set the new maximum concurrent connection limit”. That’s not really an acceptable answer to me, but I wasn’t able to get anything more out of them.

So in order to change the maximum concurrent connections (Using R77.xx), you need to open SmartDashboard and open the cluster object. Then find Optimizations in the left-hand menu. Here you can set a new manually-defined limit, or allow the cluster to automatically scale the maximum connections. If this cluster was significantly less busy, I might be tempted to enable the automatic limit for a bit and try to get a baseline. However, I would rather not open myself up to the chance of crashing the cluster - so I manually increased the limit from 25,000 to 50,000. Install the policy for the configuration to take effect. You can see the current concurrent connections by either looking at the Overview page in SmartDashboard, or logging into the cluster CLI and using the cpview utility.

In my case - the new connections almost immediately started ramping up to ~35,000. Within a day we started encountering the Aggressive Aging protection again, but it was happening significantly less often than before. This also resolved our ESX host disconnection problem, which proved my theory that the Aggressive Aging feature was causing our problem. I’ve been slowly monitoring and increasing the concurrent connections limit since, and I think we have finally stabilized around 90,000. Just think of how many connections were denied or terminated early because this limit was in place!

Moral of the story here: Understand the systems that you own. This firewall cluster had been in place years before I was hired, and all of the settings were left at their defaults. Default settings probably work for most cases, but they also come with their own problems. This setting had likely been the cause of multiple problems in the past, however no one truly understood they system enough to find out what was happening. Ever have a scenario where a default setting caused problems? Share it in the comments!

Migrating IP Addressing Schemes

Wed, 24 May 2017 08:00:27 +0000

Back a few months ago, I wrote a bit about why it is important to have a good design for IP addressing schemes (part 1 and part 2). As a brief refresher, the situation I found myself in was an environment where practically everything was assigned a 10.x.x.x/16 subnet - even if we only needed a handful of hosts. When I arrived at the company, we were already down to less than 1/3 of the 10.x.x.x range remaining unallocated (with multiple new locations already being discussed).

The IP addressing design that I came up with limited our typical data center deployment from 4-6 /16 blocks to a single /16 block for each location. For all new locations since then, this new design has been used and it has proved to be extremely beneficial. The ability to use proper address summarization has made firewall rules, routing, and VPN tunnel configuration much simpler. But what about all the old locations which still had several /16 blocks? None of these needed more than a single /16, but we have thousands of systems that would need to be re-addressed. Not something that was going to happen overnight. So let’s take a look at some of the methods we employed for migrating from one IP addressing scheme to another.

Have a plan - The first step is to have a good handle on the overall situation and how to get from point A to point B. You’re likely going to need buy-in from other teams to help get there, and this could easily be a multi-year project depending on the number of systems. When you meet to discuss the re-addressing project, you need to be pretty strong when describing the benefits of the new system - otherwise no one will want to help.
Enforce the standard for anything new - The easy target for any transition is to hit new stuff first. For example, we started using the new range in a brand new location first. Anything being deployed that requires a new IP address allocation needs to be using the new scheme. We don’t want to perpetuate the scheme we are trying to get rid of.
Transition (Network Config) - This can be a difficult step that requires a bit of planning. For any existing sites, we need to configure the ability to use both IP address schemes side-by-side until the transition is completed. There are two primary ways to accomplish this that I’ve used - either build out a new segmented (VLANed) network, or overlay the existing using secondary IP addresses. Don’t forget to propagate routes to the new subnets and ensure that firewall rules match the existing functionality.
Transition (Infrastructure/Servers) - Once the underlying networking pieces are done, the next step is to begin transitioning services. Again, make sure any new systems getting deployed are now using the new ranges. Then we can take either an active or passive approach. In the passive approach, we are going to essentially just build new systems in the new scheme and wait until the older systems are eventually removed from service. This probably isn’t the ideal way to do this - but it’s certainly an option. In a more active approach, we would start identifying the older systems to move and making plans to do so (likely in a phased manner). Either method is going to require a serious investment of time, depending on the size of your network.
Long-Term - This process is never going to be quick or easy, but the end result should be a much better state than we began. In the meantime, maintaining both IP addressing schemes can be quite painful. Make sure that everyone on the team understands the goals of the new scheme, the plan for getting there, and how everything is configured to make it happen. The last thing you want is for someone to try and back out of the move, just because they’re not confident in what’s going on.

I also wanted to stress the importance of research throughout this whole process. It’s important to try and understand why the original IP addressing was designed the way it was, and what goals they had in mind at the time. It’s also important to check the technologies you’re using to understand how everything will work. For example, Juniper’s SSG (ScreenOS) platform doesn’t support utilizing a secondary IP address on an ‘untrust’ interface (KB5527) - but it works if you use a custom zone name. And Check Point doesn’t support secondary IP addresses at all when you are using their ClusterXL protocol (SK89980), instead they actually recommend that you deploy a new VLAN and tagged sub-interface. However, they do support it if you are using VRRP instead.

This is in no way a definitive guide on the various ways you might accomplish this - but I wanted to give a bit of background on how we tackled the problem. Unfortunately in my case, most of the older locations have several thousand systems - so I’ll be working on this migration for quite a while.

Ever had to migrate to a new IP addressing scheme? What methods did you use? How large was the network? Run into any big problems? Comment below!

Juniper SRX VPN Issues

Tue, 18 Apr 2017 08:00:02 +0000

Last year we began migrating from our old Juniper SSG firewalls to the new SRX line. After a few months, I’ve honestly really started to enjoy working with them - so much that we’ve decided to start standardizing our firewall platforms by ditching everything else. So far I’ve had the opportunity to install ten SRX 1500s, six SRX 345s, and one SRX 340. Some have been completely new installs for a new location and some have been migrations from other devices. But while most of the process has been surprisingly smooth - there is one thing that keeps coming back up: VPN issues. (Oh, and the fact that pre-15.1X49-D60 doesn’t support In-service-upgrades - but don’t get me started on that one…)

We run multiple locations around the world, and unfortunately have to keep full mesh VPN connectivity due to the way our systems have been deployed. Today each SRX cluster has around 15 different VPN peers, which are made up of other SRXs, older SSGs, CheckPoint firewalls, Cisco ASAs, and Watchguard firewalls. This is still an on-going process - but I wanted to throw out some of the issues I’ve run into so far, and what I’ve been able to do to fix them or work around them..

Issue #1 - VPN is up, but no traffic is flowing across it

This one initially took me a minute to figure out. All of our tunnels are route-based, using secure tunnel interfaces. So each VPN is configured with a set security ipsec vpn vpn_name bind-interface st0.x command. I had a set of VPN tunnels between two locations that were not passing traffic, even though a show security ipsec sa showed the tunnels as established. For reference, here is what the config looked like:

root@SRX-SITE-A> show configuration security ike
respond-bad-spi 1;
proposal ike-aes256 {
 authentication-method pre-shared-keys;
 dh-group group2;
 authentication-algorithm sha-256;
 encryption-algorithm aes-256-cbc;
 lifetime-seconds 28800;
}
policy ikepolAES256 {
 mode main;
 proposals ike-aes256;
 pre-shared-key ascii-text xxxxxxxxx; ## SECRET-DATA
}
gateway gateway-siteB {
 ike-policy ikepolAES256;
 address XXX.XXX.XXX.XXX;
 no-nat-traversal;
 external-interface reth0.0;
}

root@SRX-SITE-A> show configuration security ipsec
proposal ipsec-aes256 {
 protocol esp;
 authentication-algorithm hmac-sha1-96;
 encryption-algorithm aes-256-cbc;
 lifetime-seconds 28800;
}
policy ipsecpolAES256 {
 perfect-forward-secrecy {
 keys group2;
 }
 proposals ipsec-aes256;
}
vpn vpn-to-SITE-B {
 bind-interface st0.1;
 df-bit clear;
 ike {
 gateway gateway-siteB;
 ipsec-policy ipsecpolAES256;
 }
 establish-tunnels immediately;
}

root@SRX-SITE-A> show configuration interfaces st0
unit 1 {
 description vpn-to-SITE-B;
}

The config on both sides practically matched, but there was one thing missing that was preventing the tunnel from passing traffic. Under the st0 configuration, unit 1 (or whichever tunnel interface you might be using) needs to have family inet configured. Even though I’m using an unnumbered tunnel interface, this command still needs to exist to tell the SRX that the interface is used for IPv4 traffic. Quick fix, but it’s easy to miss.

Issue #2 - VPN drops every 2-4 hours and doesn’t re-establish for another 2-4 hours (or manual SA clearing)

The original SRXs that I installed were running JunOS 15.1X49-D40.6. I had at least half a dozen of these devices interconnected with full mesh VPNs, and experienced no issues. However, when I picked up a new set of SRX 1500s a few months back, Juniper had just released 15.1X49-D70.3 - so I upgraded before these were put into production. Strangely enough, when I began migrating tunnels to the new cluster we started to see the VPNs to remote SRXs drop sporadically. The first remote sites to migrate were less of a priority to keep connectivity established, so I took this opportunity to spend a little time figuring out what was going on.

The initial issue seemed to be that the VPNs would establish, but only for about 2-4 hours. Then they would drop and not re-establish for 2-4 hours. This seemed a bit weird to me, because the re-key interval was set for 8 hours - which means that re-key wasn’t playing into this. Even more weird, whenever the issue occurred - one of the two SRX clusters would always still show the IPSec tunnel as up, while the peer SRX would just keep logging errors about bad SPIs. Clear the stale IPSec security association, and the tunnels re-establish immediately.

In order to resolve this, I had to configure both Dead-Peer-Detection and Juniper’s VPN monitoring on both sides of the connection - so that each SRX would more actively monitor the tunnel status. Juniper’s documentation states that they enable DPD by default, but in an ‘optimized’ method which only sends a DPD R-U-THERE message under certain conditions. I had to change this to force the SRX to send the DPD messages at regular intervals. Here are the changes I made to fix the issues:

root@SRX-SITE-A# set security ike gateway gateway-SITE-B dead-peer-detection always-send
root@SRX-SITE-A# set security ipsec vpn vpn-SITE-B vpn-monitor optimized

After these changes were in place, I stopped experiencing the issue. Again, these had to be implemented on BOTH sides of the connection. These weren’t necessary on the tunnels in-between the SRX clusters on the older firmware version - so there may be some sort of bug between those and the newer firmware.

Issue #3 - VPN between SRX and CheckPoint duplicates IPSec SA on re-key (sometimes causes tunnel to stop passing traffic)

This issue was a complete mess - mostly because of the effort involved in trying to coordinate two separate vendors to work on an issue. New SRX clusters (on 15.1X49D40.6 at the time) had been deployed and all of them had to connect back into our existing CheckPoint locations via IPsec tunnels. All was great, until about two weeks after installation we started seeing some weird tunnel drops. After some troubleshooting on my end, I discovered that watching what happened during the regularly scheduled re-key interval was helpful to see what was going on. Right at the eight hour re-key, the tunnels would try to re-establish but couldn’t - and sometimes this led to uni-directional traffic flows across the VPN.

The SRX tries to start a soft reset process prior to the re-key interval, so that it can gracefully migrate traffic to the new SPIs. However, something was happening that was causing the SRX to never terminate the old SPIs - so after a while the SRX would try to begin the soft reset process and fail because it had already reached its maximum SPIs for a given peer. Once the re-key interval was reached, the SRX would initiate the hard reset process on the tunnel. The CheckPoint side typically wouldn’t notice that anything was going on, and would keep sending traffic down the bad (expired) SPIs. A quick clear security ipsec sa and clear security ike sa would bring the tunnels back up.

I worked with some great guys on the Advanced JTAC team - but ultimately the SRX configuration and behavior seemed to be exactly what was expected. The only thing we couldn’t figure out is why the SRX was holding onto the old IPSec SPIs. So we opened a support case with CheckPoint to see what they had to say. After a few troubleshooting sessions and running a bunch of debugs, the CheckPoint engineer seemed to believe that the issue was on their side. All of our CheckPoint clusters were running R77.10 at the time, but we also tried upgrading to R77.30 which still experienced the issue.

Ultimately, the CheckPoint guy pointed to SK97746, which states that CheckPoint has interoperability issues due to the way it handles the tunnel renegotiation between other vendors. Essentially, as soon as the Phase 1 IKE tunnel re-negotiates, the CheckPoint deletes the Phase 2 tunnel immediately (even when we are working in a tunnel soft reset). This means that the SRX would have believed the tunnel has re-established and keep using the old one until the hard re-key time. However, the CheckPoint had already deleted the old tunnel - which caused the traffic drops. This is fixed using CheckPoint’s GUI DB editor tool and making the modifications listed in the support article linked above.

While the CheckPoint side seemed to be responsible, it’s still odd that the SRX was never clearing the old SPIs. It might be that it kept them open because the old tunnels were never gracefully closed with the CheckPoint.

So there you have it - I hope that these might help someone out who is currently banging their head against a SRX VPN issue. If you’ve run into similar issues, drop a comment below!