Troubleshooting on 0x2142 | Networking Nonsense

Old Computer Fails to Discover SSID nor Connect to Home Network [Solved]

Thu, 04 Mar 2021 10:28:00 +0000

^{The post below was contributed by guest author: Nicole Henry}

Hey folks. So, we got a new Comcast xFi Gateway 3rd Generation at my parent’s house, and all the devices were able to discover our SSIDs & connect to the internet at both 2.4Gz & 5Ghz …except my personal computer. My personal computer was listing networks not associated with our house, yet it wasn’t showing OUR networks. What gives?!

First off, System Information about my computer…

OS: Windows 10 Home
Manufacturer & Model: Dell Inspiron 5558
Wireless adapter: Intel(R) Dual Band Wireless-AC 3160
xFi Gateway models numbers: CGM4331COM or TG4482A

Back to the goods…

I restarted my computer twice. Turned on & off the wi-fi on my computer. Played around with network settings on my computer. Still no luck.

So of course I headed to Twitter and asked for assistance. You can read the tweet and read the replies here. I got many different tips from people: update the wi-fi drivers, buy a USB Wi-fi Dongle, etc. But my friend Pat ( @Battle_Nerd_1 ) messaged me and suggested that I look into what wi-fi standards (802.11 b,a,g,n,ac) are associated with my wi-fi adapter.

Let’s walk through how to do that, why don’t we?

1. Look at the Advanced properties of the wifi adapter. (I have a Windows 10 device)

Go to Start -> Device Manager -> Network Adapters -> Right click the adapter name -> click Properties -> then Advanced

The Advanced network properties of the computer’s wi-fi adapter is what we need to review. Notice in the “Property” field; the wi-fi standard displays 802.1b/g for my computer. This is an important detail.

2. Compare & Change the settings

(I use modem and router interchangeably. That’s probably not good practice, but oh well lol)

Login to your router (*** I’ll add instructions below how to do this) & start clicking around until you find the Wi-fi Mode settings. Here’s the settings for our router at 2.4GHz.

From the same page, here are the Mode options of our router:

3. Compare & Change the settings

Remember, my wi-fi standard for my computer’s wi-fi adapter is 802.1b/g. Initially the modem’s Wi-fi Mode was set to 802.11 g/n/ax, then Pat told me to change it to 802.11 g/n because the new ax isn’t supported by a lot of wireless drivers; it’s Wi-Fi 6. Once I changed the Mode to 802.11 g/n, IMMEDIATELY my computer recognized the 2.4GHz network and connected!! Yay Pat!!

How to Log in to your Router/Modem

(These instructions are for Windows devices )

1. Find your Gateway/Router/Modem’s IP address

Go to Start and type cmd for Command Prompt.
Type ipconfig, then hit Enter
Scroll down until the Wi-fi section, & take note of the default gateway IP address.
This number will most likely look like 10.x.x.x or 192.168.x.x

2. Log into your Router/Modem

Next open up a web browser and type in the aforementioned IP address, then hit Enter.

Once prompted for a Username and password; try Admin for username & Password for password. Also, It’s good practice to change the default password, so definitely do that :)

So now you know how to find the IP address, log into, & change the password of your Gateway/Router/Modem. Yay you! Now you can go back to Step 2 and learn how to view your Router/Modem’s wi-fi settings.

So thanks to my Twitter friend, Pat, for walking me through how to find info about my computer’s wi-fi adapter and how to change the modem/router’s wi-fi settings; also he saved me from having to spend money on a new computer! In the 5GHz Mode settings, there is no 802.11 b/g option, therefore my computer can only connect at the 2.4GHz frequency, which is not an issue for me. I can now connect to a home network & have internet access on my personal computer.

Autonegotiation issues on Nexus QSFP Ports

Tue, 20 Feb 2018 08:00:14 +0000

Over the past two years we have made a ton of progress shifting datacenter infrastructure from 1G to 10G+. A majority of this has been through a vendor migration back to Cisco for switching - and specifically using the Nexus 9372 line. These boxes come with 48 ports of 1G/10G SFP+ and another 6 QSFP ports that hit 40G.

Late last year we placed an order to expand our 10G+ coverage in one of our larger datacenters. After meeting with our local Cisco reps and talking through options, we settled on a pair of Nexus 93180YC-EX switches. The new toys offer additional flexibility, by providing 48 SFP+ ports capable of 1G/10G/25G and the 6 QSFP ports are 40/100G.

A week or two ago we worked during a planned maintenance window to try and bring the new 93180s online. The new switches are just directly connected back to the 9372s using four QSFP-40G-CR4 cables. The time comes, we turn up the ports - and they don’t come up. We know the cable types definitely work, since we’re using them for all of our current interconnects between the 9372s. Unfortunately, due to tight timelines on maintenance windows - we have to turn down the ports and move on to other task.

So we go down the normal line of troubleshooting. Reseat cables - still nothing. Remove port-channel/VPC configurations - nothing. Test the QSFP cables by cabling in between just the new 93180s - yeah, ports come up and the cables are good. One of my teammates, who is running with this task, is almost at the point of opening up a support case with TAC. I double checked the switch port configurations - but everything looks good. My first thought was that maybe there is a speed/autonegotiation issue - since the QSFP ports on the 9372s are fixed 40G, while the 93180s are 40/100G.

We scheduled another quick no-downtime maintenance window to test out the theory. Each of the ports on both sides of the connection gets the following configuration changes:

Switch(config)# interface x/x
Switch(config-if)# no negotiate auto
Switch(config-if)# duplex full
Switch(config-if)# speed 40000

The time comes - and sure enough the ports come online.

Just wanted to throw this out there in case anyone else runs across the same problem. The fix is surely easy enough, but you don’t always think of autonegotiation issues - especially in such a simplistic configuration as this.

I also wanted to say thanks to the great people in the #CiscoChampions DataCenter group. I was able to run the problem through them, and they suggested the same potential root cause. It’s always great to have a second opinion to provide some confidence, especially when there are strict time constraints for maintenance.

Devil in the Defaults

Tue, 03 Oct 2017 08:00:28 +0000

Default settings are the worst. Every systems has them, and they’re great until they’re not. For whatever reasons in the past, my predecessors decided to purchase a bunch of bare-bones HP servers and install Check Point’s firewall software on them. The HP servers were significantly cheaper than buying Check Point’s branded appliances, but unfortunately they come with a different set of risks. For example, you have to work on estimating max throughput yourself, rather than knowing exactly what the appliance is rated for.

Over the past few weeks, we have been lightly troubleshooting an issue between a VMware vCenter server and the ESX hosts that it manages. ESX hosts were randomly showing up as disconnected for a brief moment, then would reconnect. It was nothing extremely impacting, but a mild annoyance for the server guys. A couple of people on my team had taken a quick look on the network side, and turned up empty handed. Due to some upcoming maintenance work the server team needed to perform, I was asked to spend some time trying to isolate the root cause of this issue.

First thing was digging through the logs from the two different sets of firewalls between these systems. The first firewall set showed that traffic was passing normally as I would expect. However, I started seeing some unexpected logs for the second firewall set, a CheckPoint cluster. The logs showed that vCenter was opening connections out to the ESX hosts for a short while, then the CheckPoint would log a “TCP Packet out of state” error. The details of this log would show that vCenter sent a non-SYN packet to the ESX host (usually a PSH ACK).

Seeing an error like that indicates that something is killing the TCP connection before vCenter is finished using it. vCenter still believes that the connection is open, which is why it sends packets with incorrect flags. Since we were already aware that this particular CheckPoint cluster has some issues, we began examining this cluster first. Sure enough, the IPS logs on the device showed that the cluster was often reaching >80% of it’s maximum concurrent connections and then enabling the “Aggressive Aging” feature.

Aggressive Aging is a CheckPoint protection which prevents the cluster from running out of memory and potentially crashing. By default, this is set to take effect whenever the cluster exceeds 80% of it’s available memory or concurrent connections. This protection will continue to be enabled until the cluster drops below another threshold, which is below 78% by default. Seems like a helpful feature to have, right? Yeah - but there are some considerations with how this protection works. When Aggressive Aging is activated, the cluster significantly reduces all of the normal TCP timeout values. For example, CheckPoint’s documentation shows that new TCP sessions are given only 5 seconds to establish, instead of the normal 25 seconds. This also changes how long a TCP session can be open from 1 hour to 10 minutes. In order to help drop below the 78% threshold, Aggressive Aging will evaluate and terminate 10 connections for every individual new connection that is established.

As I stated previously, this cluster was already pretty busy - often hitting CPU limits mostly. However, through the brief research I completed, it looks like increasing the concurrent connections table mostly affects RAM utilization more than anything else. This system has over 20G of RAM and is typically only using around 4GB. I was still concerned that an increase in total concurrent connections could mean more CPU usage, because that means more connections for the IPS to process. Unfortunately, CheckPoint has no publicly available utilities to help calculate what to set your max concurrent connection limit to. In fact, when I opened a support ticket with them, I was told to “just keep increasing it, until you hit a point where the cluster is no longer triggering Aggressive Aging. Then add about 10-20k above that to set the new maximum concurrent connection limit”. That’s not really an acceptable answer to me, but I wasn’t able to get anything more out of them.

So in order to change the maximum concurrent connections (Using R77.xx), you need to open SmartDashboard and open the cluster object. Then find Optimizations in the left-hand menu. Here you can set a new manually-defined limit, or allow the cluster to automatically scale the maximum connections. If this cluster was significantly less busy, I might be tempted to enable the automatic limit for a bit and try to get a baseline. However, I would rather not open myself up to the chance of crashing the cluster - so I manually increased the limit from 25,000 to 50,000. Install the policy for the configuration to take effect. You can see the current concurrent connections by either looking at the Overview page in SmartDashboard, or logging into the cluster CLI and using the cpview utility.

In my case - the new connections almost immediately started ramping up to ~35,000. Within a day we started encountering the Aggressive Aging protection again, but it was happening significantly less often than before. This also resolved our ESX host disconnection problem, which proved my theory that the Aggressive Aging feature was causing our problem. I’ve been slowly monitoring and increasing the concurrent connections limit since, and I think we have finally stabilized around 90,000. Just think of how many connections were denied or terminated early because this limit was in place!

Moral of the story here: Understand the systems that you own. This firewall cluster had been in place years before I was hired, and all of the settings were left at their defaults. Default settings probably work for most cases, but they also come with their own problems. This setting had likely been the cause of multiple problems in the past, however no one truly understood they system enough to find out what was happening. Ever have a scenario where a default setting caused problems? Share it in the comments!

SRX High CPU: httpd

Tue, 05 Sep 2017 08:00:12 +0000

Over the past few years of my Juniper SRX adventures, I’ve run into a few cases where the Routing Engine (RE) CPU is pegged at 100%. From what I’ve seen so far, this is typically one of three causes: high traffic (spike in IPS inspection), logging using event mode, or a stuck web management session.

In a few occasional cases, the CPU issue doesn’t resolve itself and someone needs to manually investigate the cause. Luckily, the httpd issue is pretty easy to spot and fix - so I wanted to cover that briefly today. This issue can crop up randomly after someone uses the JWeb GUI to administer an SRX firewall. You could avoid this issue entirely by disabling the web interface entirely - but that’s not always possible.

So the first thing we want to do is log into our SRX firewall and check the current CPU utilization for our RE processor:

{primary:node0}
root@test-srx> show chassis routing-engine node 0 
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                  41 degrees C / 105 degrees F
    CPU temperature              70 degrees C / 158 degrees F
    Total memory               4096 MB Max 1556 MB used ( 38 percent)
      Control plane memory     2976 MB Max 804 MB used ( 27 percent)
      Data plane memory        1120 MB Max 773 MB used ( 69 percent)
    5 sec CPU utilization:
      User                       41 percent
      Background                  0 percent
      Kernel                     59 percent
      Interrupt                   0 percent
      Idle                        0 percent
    Model                           RE-SRX345
    Serial ID                       XX1000XX0002
    Start time                      2016-09-01 02:49:50 UTC
    Uptime                          351 days, 13 hours, 28 minutes, 47 seconds
    Last reboot reason              0x1:power cycle/failure
    Load averages:                  1 minute   5 minute   15 minute
                                        1.29       1.27        1.10

So we can see that over the past 5 seconds, there is 0% idle CPU - It’s all split between User and Kernel. Some higher-end SRX models will also show utilization for 1 minute, 5 minutes, and 15 minutes.

Next, we want to confirm which process is consuming that CPU:

{primary:node0}
root@test-srx> show system processes extensive node 0
node0:
--------------------------------------------------------------------------
last pid: 25330;  load averages:  1.16,  1.24,  1.10  up 351+13:29:51    16:19:11
165 processes: 21 running, 132 sleeping, 12 waiting

Mem: 354M Active, 191M Inact, 1253M Wired, 585M Cache, 112M Buf, 1595M Free
Swap:


  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
1635 root           7  76    0  1192M   113M RUN    0    ??? 281.93% flowd_octeon_hm
14607 nobody         3  76    0 14848K  6308K ucondt 0  25:03 83.45% httpd
   21 root           1 171   52     0K    16K RUN    0 6952.9  0.00% idle: cpu0
 1679 root           1  76    0 48580K 24476K select 0  90.2H  0.00% mib2d
 1715 root           1  76    0 35264K 19520K select 0  49.0H  0.00% snmpd
   23 root           1 -20 -139     0K    16K RUN    0  29.9H  0.00% swi7: clock
 1681 root           1   4    0   101M 68284K kqread 0  28.0H  0.00% rpd
   22 root           1 -40 -159     0K    16K WAIT   0  26.0H  0.00% swi2: netisr 0
  <-- Output Truncated -->

In this case it’s pretty clear that httpd is the top offender for CPU usage. You might also notice the process named ‘flowd_octeon_hm’. This is part of the firewall processes, so don’t be surprised if this process is also one of the top. It’s pretty normal for this process to show >100% CPU, so this is safe to ignore. If you see eventd as a top consumer, then you might have your logging configured to use event mode rather than stream mode - which I’ll cover in another post.

So how do we fix the httpd problem? Reboot the SRX? Well, yeah that would probably fix it - but there is an easier way:

{primary:node0}
root@test-srx> restart web-management
Web management gatekeeper process started, pid 25343

One quick command and we’ve restarted all of the web management processes, including httpd. So now you’ll want to give the SRX a few seconds to recover itself - then run the show system processes extensivecommand again:

{primary:node0}
root@test-srx> show chassis routing-engine node 0
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 41 degrees C / 105 degrees F
    CPU temperature             69 degrees C / 156 degrees F
    Total memory              4096 MB Max  1556 MB used ( 38 percent)
      Control plane memory    2976 MB Max   804 MB used ( 27 percent)
      Data plane memory       1120 MB Max   773 MB used ( 69 percent)
    5 sec CPU utilization:
      User                       6 percent
      Background                 0 percent
      Kernel                     3 percent
      Interrupt                  0 percent
      Idle                      91 percent
    Model                          RE-SRX345
    Serial ID                      XX1000XX0002
    Start time                     2016-09-01 02:49:50 UTC
    Uptime                         351 days, 13 hours, 32 minutes, 52 seconds
    Last reboot reason             0x1:power cycle/failure
    Load averages:                 1 minute   5 minute  15 minute
                                       0.35       0.99       1.04

Looks much better, with 91% idle CPU!

Even though this issue can be annoying, its a quick fix - I recommend that you perform some sort of CPU monitoring/alerting on your SRX clusters (I use Observium for this). This will help to identify the issue quickly and then get it resolved quickly. If this issue is left unchecked, it can sometimes cause some latency and performance issues.

Hope this helps!

Odd Behavior of Protected Switchports

Tue, 22 Aug 2017 08:00:29 +0000

I ran into an interesting issue recently, which was caused by use of the switchport protected command. So I use a pair of Cisco 2960-8TC-L switches at home, for both my home network and lab. A few months back I ran a bunch of ethernet cabling within my house, which all terminated in a patch panel in the basement. I was able to migrate/consolidate enough of my ports so that I could dedicate one of the 2960s to the patch panel. I had eight ports on my switch, and eight ethernet drops in my house - one of which ran back to my lab network for internet.

Usually when I configure something like this, I want to try and take security into consideration as much as possible. I have a Synology NAS on my home network, which contains enough of my personal backups that I would want to keep this inaccessible from a typical house-guest. So by default, I made the following configuration standards on the ports connected to my patch panel:

Any unconnected ports were added to my guest VLAN (which only has internet access)
Any ports that needed to be in my home VLAN were configured with port security, sticky MAC, and maximum 1 MAC allowed
All ports were configured as switchport protected (except the uplink)

The concept of protected switchports should be fairly simple: Any port configured with switchport protected is not permitted to communicate with any other port configured with switchport protected. A protected switchport is only permitted to communicate with a non-protected port (in this case, my uplink/trunk to my other 2960). I added this mostly as a safeguard against a potentially malicious house-guest.

However, once I actually began to use my patch panel ports, I began to experience a very interesting issue. For example, I purchased a home security camera which by default used a wireless connection. The location of the camera unfortunately made the wifi connection a bit more unreliable than I would like for a security camera. So I went ahead and ran a cable to the nearest ethernet drop.

The IP camera uses my Synology NAS as a backend storage for any recordings. It was able to connect and stream video on the wireless connection, but the video was choppy. Once I plugged in the ethernet cable the connection actually got worse than it already was. From my Synology - the camera would become unresponsive for a while, then you could reach it again for a few moments, then back to unresponsive (about 60-70% packet loss). If I disabled the wireless NIC entirely, the camera would be completely unresponsive. However, this whole time I was able to reach the camera with no issues from my laptop which was connected via wireless (my AP uses the same switch as the Synology).

The NAS is connected to the 2960 in my lab, which is connected to the patch-panel-2960 via a single trunk port. From the Synology, I could still see ARP entries for my wired camera - I just couldn’t reach it via ping or http. I spent a good hour or two trying a number of things: clearing ARP entries, double checking my trunk port configs, and I also upgraded the firmware on the IP camera. Nothing seemed to work. It made even less sense that anything else connected to the Synology-side switch could hit the camera with no problems.

It’s worth noting that no ports on the Synology-side 2960 were configured with switchport protected - only the ports on the patch panel side. So I finally tried removing the switchport protected command off of the IP camera port - and magically it all started working.

The protected switchport config worked exactly as I would have expected for traffic between ports on the same switch - however, it seemed to act against what I would have expected once it crossed a trunk to another switch. It was especially odd that it only seemed to crop up between the Synology and the IP camera. Oh well, I guess the only way you really learn something is by breaking it, right? I hope this might help someone who finds themselves in a similar situation.

Tracking Latency and Packet Loss with SmokePing

Tue, 25 Apr 2017 08:53:04 +0000

“The network is slow” - Sound like something you’ve heard before? What does ‘slow’ mean anyway? And is it different from yesterday? Sometimes tracking down network ‘slowness’ can be pretty difficult, especially when you don’t have a good baseline of what is normal. This kind of goes back to one of the tips I shared earlier in ‘A Little Bit of Magic’ - having a baseline and understanding of what is normal on your network will help you find issues much more quickly.

When I started working for a cloud service provider a few years ago, the first thing to start coming up extremely often is network latency and performance issues. These are things I never had to worry too much about previously, as most of my jobs had been with enterprise environments where everyone is on the same LAN (or at least within one state). However, when you get into hosting a Software-as-a-Service cloud on a global scale, then slight performance issues begin to mean big slowdowns for your customers.

I was amazed at the current network infrastructure monitoring that was in place when I began working for the SaaS provider: A few bare-bones Cacti instances, completely unmanaged by anyone, and not configured to monitor any relevant ports or data. Today that situation is vastly different - I have installed a few different applications that allow us to get alerted on network variances and quickly determine exactly where the issue is. One of the tools that has helped us get to this point is called SmokePing, which I would like to talk about today.

Setup and Installation

I won’t get into the details of installing SmokePing, as there are already a number of good tutorials out there (like this one or this one). If you have a decent familiarity with Linux, then the process should be fairly straightforward. Keep in mind that your SmokePing graphs will show latency and packet loss between the machine you have SmokePing installed on and the targets you define. So make sure that you plan out where you deploy your SmokePing machine(s) to provide beneficial information.

Once you have SmokePing installed and setup, it’s time to start defining targets to monitor. We have over a dozen points of presence globally, so I’ve installed SmokePing on a single machine in each location. Each instance has ping targets defined for every network segment within it’s own datacenter, network segments in every other datacenter, and some public IP space of every datacenter. So we accomplish latency and packet loss monitoring within the datacenter, across the site-to-site VPNs between each datacenter, and the general internet connections between each datacenter. For certain customers, particularly those who have dedicated MPLS circuits to us, we are also monitoring latency/packet loss to customer endpoints.

SmokePing also supports deployment in a controller/worker configuration, where you have a single primary configuration/management point and several workers to perform testing. I really want to test this out for our environment, but I haven’t quite had the time to dedicate to it. If you’re interested though, you can find the details on that here.

Interpreting the graphs

The graphs created by Smokeping might not seem clear the first time you see them. For example, take a look at this:

This graph is the result of a standard latency test - 20 pings every 5 minutes. So for every step on the graph, SmokePing draws out the range of responses in those 20 pings - shown by the gray ‘smoke’. The darker the gray area, the more pings came back with that response time - and similarly the lighter areas mean that fewer pings had that response time. The solid colored part of the line marks the average response across all 20 pings, and also gives an indication of percentage of packets lost.

So the first thing I would notice about this graph is that the average response time is varying quite significantly between about 15ms and 200ms. In a normal healthy network, you should not expect to see such a drastic change in response times like that - some variation is normal, but not to this extreme. Two other things to note from this graph: The time of each latency jump seems to line up almost every 30 minutes, and towards the end we begin seeing some slight packet loss.

After being informed that there was a performance issue between a few different systems, I opened up SmokePing immediately to start looking for anything that jumped out - like the graph above. In this case, this was a 200Mb dedicated MPLS circuit used only for replication traffic between data centers. Every 30 minutes, a replication job was kicking off and saturating the line for a few minutes - which in turn was causing excessive jumps in latency and some minor packet loss.

As another example:

The first thing you probably notice about the graph above is the sudden stabilization of latency. This graph monitors traffic between two data centers over an IPsec VPN tunnel - and we happened to be suspecting that one of the two peer firewalls was having performance issues. We swapped out to new hardware on one side of the connection, and the latency immediately started flat-lining. A consistent 85ms is way better than averaging anywhere from 90-180ms. (And if you happened to notice the slight packet loss after the new device was implemented - that was actually due to an unrelated upstream provider issue). My point with this graph is really just to show how helpful it is to have the historical data available. It would have been extremely difficult to prove that the one firewall was the root cause of our problems if I didn’t have a way to track the issue.

So that’s a bit about SmokePing and how I’ve deployed it within a cloud provider’s environment. It’s only been up and running for a few months, but I’ve already found it to be extremely helpful in troubleshooting performance and latency issues. SmokePing is also extensible via scripting, which can help to collect additional data at the time of an issue. I’ve written a few quick scripts to run extended traceroutes during packet loss events, which I might post up here in the future.

Have you installed SmokePing in your environment? How do you use it? Has it helped you with performance issues?

Comment below!

Juniper SRX VPN Issues

Tue, 18 Apr 2017 08:00:02 +0000

Last year we began migrating from our old Juniper SSG firewalls to the new SRX line. After a few months, I’ve honestly really started to enjoy working with them - so much that we’ve decided to start standardizing our firewall platforms by ditching everything else. So far I’ve had the opportunity to install ten SRX 1500s, six SRX 345s, and one SRX 340. Some have been completely new installs for a new location and some have been migrations from other devices. But while most of the process has been surprisingly smooth - there is one thing that keeps coming back up: VPN issues. (Oh, and the fact that pre-15.1X49-D60 doesn’t support In-service-upgrades - but don’t get me started on that one…)

We run multiple locations around the world, and unfortunately have to keep full mesh VPN connectivity due to the way our systems have been deployed. Today each SRX cluster has around 15 different VPN peers, which are made up of other SRXs, older SSGs, CheckPoint firewalls, Cisco ASAs, and Watchguard firewalls. This is still an on-going process - but I wanted to throw out some of the issues I’ve run into so far, and what I’ve been able to do to fix them or work around them..

Issue #1 - VPN is up, but no traffic is flowing across it

This one initially took me a minute to figure out. All of our tunnels are route-based, using secure tunnel interfaces. So each VPN is configured with a set security ipsec vpn vpn_name bind-interface st0.x command. I had a set of VPN tunnels between two locations that were not passing traffic, even though a show security ipsec sa showed the tunnels as established. For reference, here is what the config looked like:

root@SRX-SITE-A> show configuration security ike
respond-bad-spi 1;
proposal ike-aes256 {
 authentication-method pre-shared-keys;
 dh-group group2;
 authentication-algorithm sha-256;
 encryption-algorithm aes-256-cbc;
 lifetime-seconds 28800;
}
policy ikepolAES256 {
 mode main;
 proposals ike-aes256;
 pre-shared-key ascii-text xxxxxxxxx; ## SECRET-DATA
}
gateway gateway-siteB {
 ike-policy ikepolAES256;
 address XXX.XXX.XXX.XXX;
 no-nat-traversal;
 external-interface reth0.0;
}

root@SRX-SITE-A> show configuration security ipsec
proposal ipsec-aes256 {
 protocol esp;
 authentication-algorithm hmac-sha1-96;
 encryption-algorithm aes-256-cbc;
 lifetime-seconds 28800;
}
policy ipsecpolAES256 {
 perfect-forward-secrecy {
 keys group2;
 }
 proposals ipsec-aes256;
}
vpn vpn-to-SITE-B {
 bind-interface st0.1;
 df-bit clear;
 ike {
 gateway gateway-siteB;
 ipsec-policy ipsecpolAES256;
 }
 establish-tunnels immediately;
}

root@SRX-SITE-A> show configuration interfaces st0
unit 1 {
 description vpn-to-SITE-B;
}

The config on both sides practically matched, but there was one thing missing that was preventing the tunnel from passing traffic. Under the st0 configuration, unit 1 (or whichever tunnel interface you might be using) needs to have family inet configured. Even though I’m using an unnumbered tunnel interface, this command still needs to exist to tell the SRX that the interface is used for IPv4 traffic. Quick fix, but it’s easy to miss.

Issue #2 - VPN drops every 2-4 hours and doesn’t re-establish for another 2-4 hours (or manual SA clearing)

The original SRXs that I installed were running JunOS 15.1X49-D40.6. I had at least half a dozen of these devices interconnected with full mesh VPNs, and experienced no issues. However, when I picked up a new set of SRX 1500s a few months back, Juniper had just released 15.1X49-D70.3 - so I upgraded before these were put into production. Strangely enough, when I began migrating tunnels to the new cluster we started to see the VPNs to remote SRXs drop sporadically. The first remote sites to migrate were less of a priority to keep connectivity established, so I took this opportunity to spend a little time figuring out what was going on.

The initial issue seemed to be that the VPNs would establish, but only for about 2-4 hours. Then they would drop and not re-establish for 2-4 hours. This seemed a bit weird to me, because the re-key interval was set for 8 hours - which means that re-key wasn’t playing into this. Even more weird, whenever the issue occurred - one of the two SRX clusters would always still show the IPSec tunnel as up, while the peer SRX would just keep logging errors about bad SPIs. Clear the stale IPSec security association, and the tunnels re-establish immediately.

In order to resolve this, I had to configure both Dead-Peer-Detection and Juniper’s VPN monitoring on both sides of the connection - so that each SRX would more actively monitor the tunnel status. Juniper’s documentation states that they enable DPD by default, but in an ‘optimized’ method which only sends a DPD R-U-THERE message under certain conditions. I had to change this to force the SRX to send the DPD messages at regular intervals. Here are the changes I made to fix the issues:

root@SRX-SITE-A# set security ike gateway gateway-SITE-B dead-peer-detection always-send
root@SRX-SITE-A# set security ipsec vpn vpn-SITE-B vpn-monitor optimized

After these changes were in place, I stopped experiencing the issue. Again, these had to be implemented on BOTH sides of the connection. These weren’t necessary on the tunnels in-between the SRX clusters on the older firmware version - so there may be some sort of bug between those and the newer firmware.

Issue #3 - VPN between SRX and CheckPoint duplicates IPSec SA on re-key (sometimes causes tunnel to stop passing traffic)

This issue was a complete mess - mostly because of the effort involved in trying to coordinate two separate vendors to work on an issue. New SRX clusters (on 15.1X49D40.6 at the time) had been deployed and all of them had to connect back into our existing CheckPoint locations via IPsec tunnels. All was great, until about two weeks after installation we started seeing some weird tunnel drops. After some troubleshooting on my end, I discovered that watching what happened during the regularly scheduled re-key interval was helpful to see what was going on. Right at the eight hour re-key, the tunnels would try to re-establish but couldn’t - and sometimes this led to uni-directional traffic flows across the VPN.

The SRX tries to start a soft reset process prior to the re-key interval, so that it can gracefully migrate traffic to the new SPIs. However, something was happening that was causing the SRX to never terminate the old SPIs - so after a while the SRX would try to begin the soft reset process and fail because it had already reached its maximum SPIs for a given peer. Once the re-key interval was reached, the SRX would initiate the hard reset process on the tunnel. The CheckPoint side typically wouldn’t notice that anything was going on, and would keep sending traffic down the bad (expired) SPIs. A quick clear security ipsec sa and clear security ike sa would bring the tunnels back up.

I worked with some great guys on the Advanced JTAC team - but ultimately the SRX configuration and behavior seemed to be exactly what was expected. The only thing we couldn’t figure out is why the SRX was holding onto the old IPSec SPIs. So we opened a support case with CheckPoint to see what they had to say. After a few troubleshooting sessions and running a bunch of debugs, the CheckPoint engineer seemed to believe that the issue was on their side. All of our CheckPoint clusters were running R77.10 at the time, but we also tried upgrading to R77.30 which still experienced the issue.

Ultimately, the CheckPoint guy pointed to SK97746, which states that CheckPoint has interoperability issues due to the way it handles the tunnel renegotiation between other vendors. Essentially, as soon as the Phase 1 IKE tunnel re-negotiates, the CheckPoint deletes the Phase 2 tunnel immediately (even when we are working in a tunnel soft reset). This means that the SRX would have believed the tunnel has re-established and keep using the old one until the hard re-key time. However, the CheckPoint had already deleted the old tunnel - which caused the traffic drops. This is fixed using CheckPoint’s GUI DB editor tool and making the modifications listed in the support article linked above.

While the CheckPoint side seemed to be responsible, it’s still odd that the SRX was never clearing the old SPIs. It might be that it kept them open because the old tunnels were never gracefully closed with the CheckPoint.

So there you have it - I hope that these might help someone out who is currently banging their head against a SRX VPN issue. If you’ve run into similar issues, drop a comment below!