Last year we began migrating from our old Juniper SSG firewalls to the new SRX line. After a few months, I’ve honestly really started to enjoy working with them – so much that we’ve decided to start standardizing our firewall platforms by ditching everything else. So far I’ve had the opportunity to install ten SRX 1500s, six SRX 345s, and one SRX 340. Some have been completely new installs for a new location and some have been migrations from other devices… But while most of the process has been surprisingly smooth – there is one thing that keeps coming back up: VPN issues. (Oh, and the fact that pre-15.1X49-D60 doesn’t support In-service-upgrades – but don’t get me started on that one…)
We run multiple locations around the world, and unfortunately have to keep full mesh VPN connectivity due to the way our systems have been deployed. Today each SRX cluster has around 15 different VPN peers, which are made up of other SRXs, older SSGs, CheckPoint firewalls, Cisco ASAs, and Watchguard firewalls. This is still an on-going process – but I wanted to throw out some of the issues I’ve run into so far, and what I’ve been able to do to fix them or work around them..
Issue #1 – VPN is up, but no traffic is flowing across it
This one initially took me a minute to figure out. All of our tunnels are route-based, using secure tunnel interfaces. So each VPN is configured with a “set security ipsec vpn vpn_name bind-interface st0.x” command. I had a set of VPN tunnels between two locations that were not passing traffic, even though a “show security ipsec sa” showed the tunnels as established. For reference, here is what the config looked like:
root@SRX-SITE-A> show configuration security ike
respond-bad-spi 1;
proposal ike-aes256 {
authentication-method pre-shared-keys;
dh-group group2;
authentication-algorithm sha-256;
encryption-algorithm aes-256-cbc;
lifetime-seconds 28800;
}
policy ikepolAES256 {
mode main;
proposals ike-aes256;
pre-shared-key ascii-text xxxxxxxxx; ## SECRET-DATA
}
gateway gateway-siteB {
ike-policy ikepolAES256;
address XXX.XXX.XXX.XXX;
no-nat-traversal;
external-interface reth0.0;
}
root@SRX-SITE-A> show configuration security ipsec
proposal ipsec-aes256 {
protocol esp;
authentication-algorithm hmac-sha1-96;
encryption-algorithm aes-256-cbc;
lifetime-seconds 28800;
}
policy ipsecpolAES256 {
perfect-forward-secrecy {
keys group2;
}
proposals ipsec-aes256;
}
vpn vpn-to-SITE-B {
bind-interface st0.1;
df-bit clear;
ike {
gateway gateway-siteB;
ipsec-policy ipsecpolAES256;
}
establish-tunnels immediately;
}
root@SRX-SITE-A> show configuration interfaces st0
unit 1 {
description vpn-to-SITE-B;
}
The config on both sides practically matched, but there was one thing missing that was preventing the tunnel from passing traffic. Under the st0 configuration, unit 1 (or whichever tunnel interface you might be using) needs to have “family inet” configured. Even though I’m using an unnumbered tunnel interface, this command still needs to exist to tell the SRX that the interface is used for IPv4 traffic. Quick fix, but it’s easy to miss.
Issue #2 – VPN drops every 2-4 hours and doesn’t re-establish for another 2-4 hours (or manual SA clearing)
The original SRXs that I installed were running JunOS 15.1X49-D40.6. I had at least half a dozen of these devices interconnected with full mesh VPNs, and experienced no issues. However, when I picked up a new set of SRX 1500s a few months back, Juniper had just released 15.1X49-D70.3 – so I upgraded before these were put into production. Strangely enough, when I began migrating tunnels to the new cluster we started to see the VPNs to remote SRXs drop sporadically. The first remote sites to migrate were less of a priority to keep connectivity established, so I took this opportunity to spend a little time figuring out what was going on.
The initial issue seemed to be that the VPNs would establish, but only for about 2-4 hours. Then they would drop and not re-establish for 2-4 hours. This seemed a bit weird to me, because the re-key interval was set for 8 hours – which means that re-key wasn’t playing into this. Even more weird, whenever the issue occurred – one of the two SRX clusters would always still show the IPSec tunnel as up, while the peer SRX would just keep logging errors about bad SPIs. Clear the stale IPSec security association, and the tunnels re-establish immediately.
In order to resolve this, I had to configure both Dead-Peer-Detection and Juniper’s VPN monitoring on both sides of the connection – so that each SRX would more actively monitor the tunnel status. Juniper’s documentation states that they enable DPD by default, but in an ‘optimized’ method which only sends a DPD R-U-THERE message under certain conditions. I had to change this to force the SRX to send the DPD messages at regular intervals. Here are the changes I made to fix the issues:
root@SRX-SITE-A# set security ike gateway gateway-SITE-B dead-peer-detection always-send
root@SRX-SITE-A# set security ipsec vpn vpn-SITE-B vpn-monitor optimized
After these changes were in place, I stopped experiencing the issue. Again, these had to be implemented on BOTH sides of the connection. These weren’t necessary on the tunnels in-between the SRX clusters on the older firmware version – so there may be some sort of bug between those and the newer firmware.
Issue #3 – VPN between SRX and CheckPoint duplicates IPSec SA on re-key (sometimes causes tunnel to stop passing traffic)
This issue was a complete mess – mostly because of the effort involved in trying to coordinate two separate vendors to work on an issue. New SRX clusters (on 15.1X49D40.6 at the time) had been deployed and all of them had to connect back into our existing CheckPoint locations via IPsec tunnels. All was great, until about two weeks after installation we started seeing some weird tunnel drops. After some troubleshooting on my end, I discovered that watching what happened during the regularly scheduled re-key interval was helpful to see what was going on. Right at the eight hour re-key, the tunnels would try to re-establish but couldn’t – and sometimes this led to uni-directional traffic flows across the VPN.
The SRX tries to start a soft reset process prior to the re-key interval, so that it can gracefully migrate traffic to the new SPIs. However, something was happening that was causing the SRX to never terminate the old SPIs – so after a while the SRX would try to begin the soft reset process and fail because it had already reached its maximum SPIs for a given peer. Once the re-key interval was reached, the SRX would initiate the hard reset process on the tunnel. The CheckPoint side typically wouldn’t notice that anything was going on, and would keep sending traffic down the bad (expired) SPIs. A quick “clear security ipsec sa” and “clear security ike sa” would bring the tunnels back up.
I worked with some great guys on the Advanced JTAC team – but ultimately the SRX configuration and behavior seemed to be exactly what was expected. The only thing we couldn’t figure out is why the SRX was holding onto the old IPSec SPIs. So we opened a support case with CheckPoint to see what they had to say. After a few troubleshooting sessions and running a bunch of debugs, the CheckPoint engineer seemed to believe that the issue was on their side. All of our CheckPoint clusters were running R77.10 at the time, but we also tried upgrading to R77.30 which still experienced the issue.
Ultimately, the CheckPoint guy pointed to SK97746, which states that CheckPoint has interoperability issues due to the way it handles the tunnel renegotiation between other vendors. Essentially, as soon as the Phase 1 IKE tunnel re-negotiates, the CheckPoint deletes the Phase 2 tunnel immediately (even when we are working in a tunnel soft reset). This means that the SRX would have believed the tunnel has re-established and keep using the old one until the hard re-key time. However, the CheckPoint had already deleted the old tunnel – which caused the traffic drops. This is fixed using CheckPoint’s GUI DB editor tool and making the modifications listed in the support article linked above.
While the CheckPoint side seemed to be responsible, it’s still odd that the SRX was never clearing the old SPIs. It might be that it kept them open because the old tunnels were never gracefully closed with the CheckPoint.
So there you have it – I hope that these might help someone out who is currently banging their head against a SRX VPN issue. If you’ve run into similar issues, drop a comment below!