Networking on 0x2142 | Networking Nonsense

SRX High CPU: httpd

Tue, 05 Sep 2017 08:00:12 +0000

Over the past few years of my Juniper SRX adventures, I’ve run into a few cases where the Routing Engine (RE) CPU is pegged at 100%. From what I’ve seen so far, this is typically one of three causes: high traffic (spike in IPS inspection), logging using event mode, or a stuck web management session.

In a few occasional cases, the CPU issue doesn’t resolve itself and someone needs to manually investigate the cause. Luckily, the httpd issue is pretty easy to spot and fix - so I wanted to cover that briefly today. This issue can crop up randomly after someone uses the JWeb GUI to administer an SRX firewall. You could avoid this issue entirely by disabling the web interface entirely - but that’s not always possible.

So the first thing we want to do is log into our SRX firewall and check the current CPU utilization for our RE processor:

{primary:node0}
root@test-srx> show chassis routing-engine node 0 
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                  41 degrees C / 105 degrees F
    CPU temperature              70 degrees C / 158 degrees F
    Total memory               4096 MB Max 1556 MB used ( 38 percent)
      Control plane memory     2976 MB Max 804 MB used ( 27 percent)
      Data plane memory        1120 MB Max 773 MB used ( 69 percent)
    5 sec CPU utilization:
      User                       41 percent
      Background                  0 percent
      Kernel                     59 percent
      Interrupt                   0 percent
      Idle                        0 percent
    Model                           RE-SRX345
    Serial ID                       XX1000XX0002
    Start time                      2016-09-01 02:49:50 UTC
    Uptime                          351 days, 13 hours, 28 minutes, 47 seconds
    Last reboot reason              0x1:power cycle/failure
    Load averages:                  1 minute   5 minute   15 minute
                                        1.29       1.27        1.10

So we can see that over the past 5 seconds, there is 0% idle CPU - It’s all split between User and Kernel. Some higher-end SRX models will also show utilization for 1 minute, 5 minutes, and 15 minutes.

Next, we want to confirm which process is consuming that CPU:

{primary:node0}
root@test-srx> show system processes extensive node 0
node0:
--------------------------------------------------------------------------
last pid: 25330;  load averages:  1.16,  1.24,  1.10  up 351+13:29:51    16:19:11
165 processes: 21 running, 132 sleeping, 12 waiting

Mem: 354M Active, 191M Inact, 1253M Wired, 585M Cache, 112M Buf, 1595M Free
Swap:


  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
1635 root           7  76    0  1192M   113M RUN    0    ??? 281.93% flowd_octeon_hm
14607 nobody         3  76    0 14848K  6308K ucondt 0  25:03 83.45% httpd
   21 root           1 171   52     0K    16K RUN    0 6952.9  0.00% idle: cpu0
 1679 root           1  76    0 48580K 24476K select 0  90.2H  0.00% mib2d
 1715 root           1  76    0 35264K 19520K select 0  49.0H  0.00% snmpd
   23 root           1 -20 -139     0K    16K RUN    0  29.9H  0.00% swi7: clock
 1681 root           1   4    0   101M 68284K kqread 0  28.0H  0.00% rpd
   22 root           1 -40 -159     0K    16K WAIT   0  26.0H  0.00% swi2: netisr 0
  <-- Output Truncated -->

In this case it’s pretty clear that httpd is the top offender for CPU usage. You might also notice the process named ‘flowd_octeon_hm’. This is part of the firewall processes, so don’t be surprised if this process is also one of the top. It’s pretty normal for this process to show >100% CPU, so this is safe to ignore. If you see eventd as a top consumer, then you might have your logging configured to use event mode rather than stream mode - which I’ll cover in another post.

So how do we fix the httpd problem? Reboot the SRX? Well, yeah that would probably fix it - but there is an easier way:

{primary:node0}
root@test-srx> restart web-management
Web management gatekeeper process started, pid 25343

One quick command and we’ve restarted all of the web management processes, including httpd. So now you’ll want to give the SRX a few seconds to recover itself - then run the show system processes extensivecommand again:

{primary:node0}
root@test-srx> show chassis routing-engine node 0
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 41 degrees C / 105 degrees F
    CPU temperature             69 degrees C / 156 degrees F
    Total memory              4096 MB Max  1556 MB used ( 38 percent)
      Control plane memory    2976 MB Max   804 MB used ( 27 percent)
      Data plane memory       1120 MB Max   773 MB used ( 69 percent)
    5 sec CPU utilization:
      User                       6 percent
      Background                 0 percent
      Kernel                     3 percent
      Interrupt                  0 percent
      Idle                      91 percent
    Model                          RE-SRX345
    Serial ID                      XX1000XX0002
    Start time                     2016-09-01 02:49:50 UTC
    Uptime                         351 days, 13 hours, 32 minutes, 52 seconds
    Last reboot reason             0x1:power cycle/failure
    Load averages:                 1 minute   5 minute  15 minute
                                       0.35       0.99       1.04

Looks much better, with 91% idle CPU!

Even though this issue can be annoying, its a quick fix - I recommend that you perform some sort of CPU monitoring/alerting on your SRX clusters (I use Observium for this). This will help to identify the issue quickly and then get it resolved quickly. If this issue is left unchecked, it can sometimes cause some latency and performance issues.

Hope this helps!

Odd Behavior of Protected Switchports

Tue, 22 Aug 2017 08:00:29 +0000

I ran into an interesting issue recently, which was caused by use of the switchport protected command. So I use a pair of Cisco 2960-8TC-L switches at home, for both my home network and lab. A few months back I ran a bunch of ethernet cabling within my house, which all terminated in a patch panel in the basement. I was able to migrate/consolidate enough of my ports so that I could dedicate one of the 2960s to the patch panel. I had eight ports on my switch, and eight ethernet drops in my house - one of which ran back to my lab network for internet.

Usually when I configure something like this, I want to try and take security into consideration as much as possible. I have a Synology NAS on my home network, which contains enough of my personal backups that I would want to keep this inaccessible from a typical house-guest. So by default, I made the following configuration standards on the ports connected to my patch panel:

Any unconnected ports were added to my guest VLAN (which only has internet access)
Any ports that needed to be in my home VLAN were configured with port security, sticky MAC, and maximum 1 MAC allowed
All ports were configured as switchport protected (except the uplink)

The concept of protected switchports should be fairly simple: Any port configured with switchport protected is not permitted to communicate with any other port configured with switchport protected. A protected switchport is only permitted to communicate with a non-protected port (in this case, my uplink/trunk to my other 2960). I added this mostly as a safeguard against a potentially malicious house-guest.

However, once I actually began to use my patch panel ports, I began to experience a very interesting issue. For example, I purchased a home security camera which by default used a wireless connection. The location of the camera unfortunately made the wifi connection a bit more unreliable than I would like for a security camera. So I went ahead and ran a cable to the nearest ethernet drop.

The IP camera uses my Synology NAS as a backend storage for any recordings. It was able to connect and stream video on the wireless connection, but the video was choppy. Once I plugged in the ethernet cable the connection actually got worse than it already was. From my Synology - the camera would become unresponsive for a while, then you could reach it again for a few moments, then back to unresponsive (about 60-70% packet loss). If I disabled the wireless NIC entirely, the camera would be completely unresponsive. However, this whole time I was able to reach the camera with no issues from my laptop which was connected via wireless (my AP uses the same switch as the Synology).

The NAS is connected to the 2960 in my lab, which is connected to the patch-panel-2960 via a single trunk port. From the Synology, I could still see ARP entries for my wired camera - I just couldn’t reach it via ping or http. I spent a good hour or two trying a number of things: clearing ARP entries, double checking my trunk port configs, and I also upgraded the firmware on the IP camera. Nothing seemed to work. It made even less sense that anything else connected to the Synology-side switch could hit the camera with no problems.

It’s worth noting that no ports on the Synology-side 2960 were configured with switchport protected - only the ports on the patch panel side. So I finally tried removing the switchport protected command off of the IP camera port - and magically it all started working.

The protected switchport config worked exactly as I would have expected for traffic between ports on the same switch - however, it seemed to act against what I would have expected once it crossed a trunk to another switch. It was especially odd that it only seemed to crop up between the Synology and the IP camera. Oh well, I guess the only way you really learn something is by breaking it, right? I hope this might help someone who finds themselves in a similar situation.

Migrating IP Addressing Schemes

Wed, 24 May 2017 08:00:27 +0000

Back a few months ago, I wrote a bit about why it is important to have a good design for IP addressing schemes (part 1 and part 2). As a brief refresher, the situation I found myself in was an environment where practically everything was assigned a 10.x.x.x/16 subnet - even if we only needed a handful of hosts. When I arrived at the company, we were already down to less than 1/3 of the 10.x.x.x range remaining unallocated (with multiple new locations already being discussed).

The IP addressing design that I came up with limited our typical data center deployment from 4-6 /16 blocks to a single /16 block for each location. For all new locations since then, this new design has been used and it has proved to be extremely beneficial. The ability to use proper address summarization has made firewall rules, routing, and VPN tunnel configuration much simpler. But what about all the old locations which still had several /16 blocks? None of these needed more than a single /16, but we have thousands of systems that would need to be re-addressed. Not something that was going to happen overnight. So let’s take a look at some of the methods we employed for migrating from one IP addressing scheme to another.

Have a plan - The first step is to have a good handle on the overall situation and how to get from point A to point B. You’re likely going to need buy-in from other teams to help get there, and this could easily be a multi-year project depending on the number of systems. When you meet to discuss the re-addressing project, you need to be pretty strong when describing the benefits of the new system - otherwise no one will want to help.
Enforce the standard for anything new - The easy target for any transition is to hit new stuff first. For example, we started using the new range in a brand new location first. Anything being deployed that requires a new IP address allocation needs to be using the new scheme. We don’t want to perpetuate the scheme we are trying to get rid of.
Transition (Network Config) - This can be a difficult step that requires a bit of planning. For any existing sites, we need to configure the ability to use both IP address schemes side-by-side until the transition is completed. There are two primary ways to accomplish this that I’ve used - either build out a new segmented (VLANed) network, or overlay the existing using secondary IP addresses. Don’t forget to propagate routes to the new subnets and ensure that firewall rules match the existing functionality.
Transition (Infrastructure/Servers) - Once the underlying networking pieces are done, the next step is to begin transitioning services. Again, make sure any new systems getting deployed are now using the new ranges. Then we can take either an active or passive approach. In the passive approach, we are going to essentially just build new systems in the new scheme and wait until the older systems are eventually removed from service. This probably isn’t the ideal way to do this - but it’s certainly an option. In a more active approach, we would start identifying the older systems to move and making plans to do so (likely in a phased manner). Either method is going to require a serious investment of time, depending on the size of your network.
Long-Term - This process is never going to be quick or easy, but the end result should be a much better state than we began. In the meantime, maintaining both IP addressing schemes can be quite painful. Make sure that everyone on the team understands the goals of the new scheme, the plan for getting there, and how everything is configured to make it happen. The last thing you want is for someone to try and back out of the move, just because they’re not confident in what’s going on.

I also wanted to stress the importance of research throughout this whole process. It’s important to try and understand why the original IP addressing was designed the way it was, and what goals they had in mind at the time. It’s also important to check the technologies you’re using to understand how everything will work. For example, Juniper’s SSG (ScreenOS) platform doesn’t support utilizing a secondary IP address on an ‘untrust’ interface (KB5527) - but it works if you use a custom zone name. And Check Point doesn’t support secondary IP addresses at all when you are using their ClusterXL protocol (SK89980), instead they actually recommend that you deploy a new VLAN and tagged sub-interface. However, they do support it if you are using VRRP instead.

This is in no way a definitive guide on the various ways you might accomplish this - but I wanted to give a bit of background on how we tackled the problem. Unfortunately in my case, most of the older locations have several thousand systems - so I’ll be working on this migration for quite a while.

Ever had to migrate to a new IP addressing scheme? What methods did you use? How large was the network? Run into any big problems? Comment below!

Tracking Latency and Packet Loss with SmokePing

Tue, 25 Apr 2017 08:53:04 +0000

“The network is slow” - Sound like something you’ve heard before? What does ‘slow’ mean anyway? And is it different from yesterday? Sometimes tracking down network ‘slowness’ can be pretty difficult, especially when you don’t have a good baseline of what is normal. This kind of goes back to one of the tips I shared earlier in ‘A Little Bit of Magic’ - having a baseline and understanding of what is normal on your network will help you find issues much more quickly.

When I started working for a cloud service provider a few years ago, the first thing to start coming up extremely often is network latency and performance issues. These are things I never had to worry too much about previously, as most of my jobs had been with enterprise environments where everyone is on the same LAN (or at least within one state). However, when you get into hosting a Software-as-a-Service cloud on a global scale, then slight performance issues begin to mean big slowdowns for your customers.

I was amazed at the current network infrastructure monitoring that was in place when I began working for the SaaS provider: A few bare-bones Cacti instances, completely unmanaged by anyone, and not configured to monitor any relevant ports or data. Today that situation is vastly different - I have installed a few different applications that allow us to get alerted on network variances and quickly determine exactly where the issue is. One of the tools that has helped us get to this point is called SmokePing, which I would like to talk about today.

Setup and Installation

I won’t get into the details of installing SmokePing, as there are already a number of good tutorials out there (like this one or this one). If you have a decent familiarity with Linux, then the process should be fairly straightforward. Keep in mind that your SmokePing graphs will show latency and packet loss between the machine you have SmokePing installed on and the targets you define. So make sure that you plan out where you deploy your SmokePing machine(s) to provide beneficial information.

Once you have SmokePing installed and setup, it’s time to start defining targets to monitor. We have over a dozen points of presence globally, so I’ve installed SmokePing on a single machine in each location. Each instance has ping targets defined for every network segment within it’s own datacenter, network segments in every other datacenter, and some public IP space of every datacenter. So we accomplish latency and packet loss monitoring within the datacenter, across the site-to-site VPNs between each datacenter, and the general internet connections between each datacenter. For certain customers, particularly those who have dedicated MPLS circuits to us, we are also monitoring latency/packet loss to customer endpoints.

SmokePing also supports deployment in a controller/worker configuration, where you have a single primary configuration/management point and several workers to perform testing. I really want to test this out for our environment, but I haven’t quite had the time to dedicate to it. If you’re interested though, you can find the details on that here.

Interpreting the graphs

The graphs created by Smokeping might not seem clear the first time you see them. For example, take a look at this:

This graph is the result of a standard latency test - 20 pings every 5 minutes. So for every step on the graph, SmokePing draws out the range of responses in those 20 pings - shown by the gray ‘smoke’. The darker the gray area, the more pings came back with that response time - and similarly the lighter areas mean that fewer pings had that response time. The solid colored part of the line marks the average response across all 20 pings, and also gives an indication of percentage of packets lost.

So the first thing I would notice about this graph is that the average response time is varying quite significantly between about 15ms and 200ms. In a normal healthy network, you should not expect to see such a drastic change in response times like that - some variation is normal, but not to this extreme. Two other things to note from this graph: The time of each latency jump seems to line up almost every 30 minutes, and towards the end we begin seeing some slight packet loss.

After being informed that there was a performance issue between a few different systems, I opened up SmokePing immediately to start looking for anything that jumped out - like the graph above. In this case, this was a 200Mb dedicated MPLS circuit used only for replication traffic between data centers. Every 30 minutes, a replication job was kicking off and saturating the line for a few minutes - which in turn was causing excessive jumps in latency and some minor packet loss.

As another example:

The first thing you probably notice about the graph above is the sudden stabilization of latency. This graph monitors traffic between two data centers over an IPsec VPN tunnel - and we happened to be suspecting that one of the two peer firewalls was having performance issues. We swapped out to new hardware on one side of the connection, and the latency immediately started flat-lining. A consistent 85ms is way better than averaging anywhere from 90-180ms. (And if you happened to notice the slight packet loss after the new device was implemented - that was actually due to an unrelated upstream provider issue). My point with this graph is really just to show how helpful it is to have the historical data available. It would have been extremely difficult to prove that the one firewall was the root cause of our problems if I didn’t have a way to track the issue.

So that’s a bit about SmokePing and how I’ve deployed it within a cloud provider’s environment. It’s only been up and running for a few months, but I’ve already found it to be extremely helpful in troubleshooting performance and latency issues. SmokePing is also extensible via scripting, which can help to collect additional data at the time of an issue. I’ve written a few quick scripts to run extended traceroutes during packet loss events, which I might post up here in the future.

Have you installed SmokePing in your environment? How do you use it? Has it helped you with performance issues?

Comment below!

Juniper SRX VPN Issues

Tue, 18 Apr 2017 08:00:02 +0000

Last year we began migrating from our old Juniper SSG firewalls to the new SRX line. After a few months, I’ve honestly really started to enjoy working with them - so much that we’ve decided to start standardizing our firewall platforms by ditching everything else. So far I’ve had the opportunity to install ten SRX 1500s, six SRX 345s, and one SRX 340. Some have been completely new installs for a new location and some have been migrations from other devices. But while most of the process has been surprisingly smooth - there is one thing that keeps coming back up: VPN issues. (Oh, and the fact that pre-15.1X49-D60 doesn’t support In-service-upgrades - but don’t get me started on that one…)

We run multiple locations around the world, and unfortunately have to keep full mesh VPN connectivity due to the way our systems have been deployed. Today each SRX cluster has around 15 different VPN peers, which are made up of other SRXs, older SSGs, CheckPoint firewalls, Cisco ASAs, and Watchguard firewalls. This is still an on-going process - but I wanted to throw out some of the issues I’ve run into so far, and what I’ve been able to do to fix them or work around them..

Issue #1 - VPN is up, but no traffic is flowing across it

This one initially took me a minute to figure out. All of our tunnels are route-based, using secure tunnel interfaces. So each VPN is configured with a set security ipsec vpn vpn_name bind-interface st0.x command. I had a set of VPN tunnels between two locations that were not passing traffic, even though a show security ipsec sa showed the tunnels as established. For reference, here is what the config looked like:

root@SRX-SITE-A> show configuration security ike
respond-bad-spi 1;
proposal ike-aes256 {
 authentication-method pre-shared-keys;
 dh-group group2;
 authentication-algorithm sha-256;
 encryption-algorithm aes-256-cbc;
 lifetime-seconds 28800;
}
policy ikepolAES256 {
 mode main;
 proposals ike-aes256;
 pre-shared-key ascii-text xxxxxxxxx; ## SECRET-DATA
}
gateway gateway-siteB {
 ike-policy ikepolAES256;
 address XXX.XXX.XXX.XXX;
 no-nat-traversal;
 external-interface reth0.0;
}

root@SRX-SITE-A> show configuration security ipsec
proposal ipsec-aes256 {
 protocol esp;
 authentication-algorithm hmac-sha1-96;
 encryption-algorithm aes-256-cbc;
 lifetime-seconds 28800;
}
policy ipsecpolAES256 {
 perfect-forward-secrecy {
 keys group2;
 }
 proposals ipsec-aes256;
}
vpn vpn-to-SITE-B {
 bind-interface st0.1;
 df-bit clear;
 ike {
 gateway gateway-siteB;
 ipsec-policy ipsecpolAES256;
 }
 establish-tunnels immediately;
}

root@SRX-SITE-A> show configuration interfaces st0
unit 1 {
 description vpn-to-SITE-B;
}

The config on both sides practically matched, but there was one thing missing that was preventing the tunnel from passing traffic. Under the st0 configuration, unit 1 (or whichever tunnel interface you might be using) needs to have family inet configured. Even though I’m using an unnumbered tunnel interface, this command still needs to exist to tell the SRX that the interface is used for IPv4 traffic. Quick fix, but it’s easy to miss.

Issue #2 - VPN drops every 2-4 hours and doesn’t re-establish for another 2-4 hours (or manual SA clearing)

The original SRXs that I installed were running JunOS 15.1X49-D40.6. I had at least half a dozen of these devices interconnected with full mesh VPNs, and experienced no issues. However, when I picked up a new set of SRX 1500s a few months back, Juniper had just released 15.1X49-D70.3 - so I upgraded before these were put into production. Strangely enough, when I began migrating tunnels to the new cluster we started to see the VPNs to remote SRXs drop sporadically. The first remote sites to migrate were less of a priority to keep connectivity established, so I took this opportunity to spend a little time figuring out what was going on.

The initial issue seemed to be that the VPNs would establish, but only for about 2-4 hours. Then they would drop and not re-establish for 2-4 hours. This seemed a bit weird to me, because the re-key interval was set for 8 hours - which means that re-key wasn’t playing into this. Even more weird, whenever the issue occurred - one of the two SRX clusters would always still show the IPSec tunnel as up, while the peer SRX would just keep logging errors about bad SPIs. Clear the stale IPSec security association, and the tunnels re-establish immediately.

In order to resolve this, I had to configure both Dead-Peer-Detection and Juniper’s VPN monitoring on both sides of the connection - so that each SRX would more actively monitor the tunnel status. Juniper’s documentation states that they enable DPD by default, but in an ‘optimized’ method which only sends a DPD R-U-THERE message under certain conditions. I had to change this to force the SRX to send the DPD messages at regular intervals. Here are the changes I made to fix the issues:

root@SRX-SITE-A# set security ike gateway gateway-SITE-B dead-peer-detection always-send
root@SRX-SITE-A# set security ipsec vpn vpn-SITE-B vpn-monitor optimized

After these changes were in place, I stopped experiencing the issue. Again, these had to be implemented on BOTH sides of the connection. These weren’t necessary on the tunnels in-between the SRX clusters on the older firmware version - so there may be some sort of bug between those and the newer firmware.

Issue #3 - VPN between SRX and CheckPoint duplicates IPSec SA on re-key (sometimes causes tunnel to stop passing traffic)

This issue was a complete mess - mostly because of the effort involved in trying to coordinate two separate vendors to work on an issue. New SRX clusters (on 15.1X49D40.6 at the time) had been deployed and all of them had to connect back into our existing CheckPoint locations via IPsec tunnels. All was great, until about two weeks after installation we started seeing some weird tunnel drops. After some troubleshooting on my end, I discovered that watching what happened during the regularly scheduled re-key interval was helpful to see what was going on. Right at the eight hour re-key, the tunnels would try to re-establish but couldn’t - and sometimes this led to uni-directional traffic flows across the VPN.

The SRX tries to start a soft reset process prior to the re-key interval, so that it can gracefully migrate traffic to the new SPIs. However, something was happening that was causing the SRX to never terminate the old SPIs - so after a while the SRX would try to begin the soft reset process and fail because it had already reached its maximum SPIs for a given peer. Once the re-key interval was reached, the SRX would initiate the hard reset process on the tunnel. The CheckPoint side typically wouldn’t notice that anything was going on, and would keep sending traffic down the bad (expired) SPIs. A quick clear security ipsec sa and clear security ike sa would bring the tunnels back up.

I worked with some great guys on the Advanced JTAC team - but ultimately the SRX configuration and behavior seemed to be exactly what was expected. The only thing we couldn’t figure out is why the SRX was holding onto the old IPSec SPIs. So we opened a support case with CheckPoint to see what they had to say. After a few troubleshooting sessions and running a bunch of debugs, the CheckPoint engineer seemed to believe that the issue was on their side. All of our CheckPoint clusters were running R77.10 at the time, but we also tried upgrading to R77.30 which still experienced the issue.

Ultimately, the CheckPoint guy pointed to SK97746, which states that CheckPoint has interoperability issues due to the way it handles the tunnel renegotiation between other vendors. Essentially, as soon as the Phase 1 IKE tunnel re-negotiates, the CheckPoint deletes the Phase 2 tunnel immediately (even when we are working in a tunnel soft reset). This means that the SRX would have believed the tunnel has re-established and keep using the old one until the hard re-key time. However, the CheckPoint had already deleted the old tunnel - which caused the traffic drops. This is fixed using CheckPoint’s GUI DB editor tool and making the modifications listed in the support article linked above.

While the CheckPoint side seemed to be responsible, it’s still odd that the SRX was never clearing the old SPIs. It might be that it kept them open because the old tunnels were never gracefully closed with the CheckPoint.

So there you have it - I hope that these might help someone out who is currently banging their head against a SRX VPN issue. If you’ve run into similar issues, drop a comment below!

Port Security: Worth the effort?

Tue, 14 Mar 2017 08:00:40 +0000

Port Security. Always seems like one of those things covered in Cisco exams, yet how many businesses actually use it? For those that aren’t implementing it, should they? Or is it too much of a headache?

So the concept of port security is fairly simple - We want to secure each individual switch port to a physical layer 2 MAC address, or at least limit how many unique MAC addresses might be learned on an individual port. The technology could be used to just limit the number of simultaneous devices on a port - by just setting a MAC threshold. Or we can also take it to the extreme and lock down each port to a hard-coded MAC address - which will never allow another device to connect. You might be thinking that the second method is absolutely ridiculous, but it really depends on the business needs.

First, let’s take brief look at the typical port security configuration and some of the options available.

SecureSwitch(config)# interface x/x  ! Whichever interface we want to lock down
SecureSwitch(config-if)# switchport port-security max xx  ! Max number of MAC addresses that can be learned
SecureSwitch(config-if)# switchport port-security violation xxxxx  ! Choose to either restrict or shutdown the port (description below)
SecureSwitch(config-if)# switchport port-security  ! This actually enables the port security config

Fairly straight-forward, right? We choose a port (or you could do a range) and set a few options. The default number of MAC addresses able to be learned on a port is 1, so it’s likely you’re going to want to change this - unless 1 is all you need. Port security can only be enabled on access ports, so 1 MAC address works in most cases - except where you have a PC daisy-chained off of an IP phone (in which case this will need to be set to 2 or 3).

Next we set our preferred violation action. This step is pretty important because it defines what happens when the port exceeds it’s MAC count. Restrict is the passive approach. If we have two PC’s plugged into a single access port (maybe using an unmanaged switch), then the second PC will just never be able to work as long as our max MAC limit is 1. The first PC to connect will be fine, and the switch will log a message and send an SNMP trap when the second MAC is picked up. Shutdown is the more forceful approach. Once that second MAC address turns up on the port, the switch puts the port in an err-disabled state - which shuts down the port to all traffic. This event is also logged and generates an SNMP trap - however the port will not come back online until an administrator manually re-enables it.

Now that we see a basic config, let’s take a look at a few different use cases for this feature. In one of my previous jobs, I worked as a network admin for a local government organization. Port security configuration in that environment was extremely strict. Each switchport was configured to permit only one MAC address, shutdown upon violation, and the switchport port-security mac-address sticky command was also used. This command takes the first MAC address learned on the port and commits it to the running configuration, which means that this MAC is essentially hard-coded to be the only MAC permitted on the port. So in this environment, a single PC was tied to a single port - nothing else could ever be plugged into that port without either shutting down the port or administrator intervention. In a government office, this was absolutely necessary because every device on the network needed to be tracked and personal devices were not permitted to be connected. We needed to know if anything was ever plugged in that wasn’t an authorized device - so manual intervention and investigation was a requirement.

In a more typical office environment - port security configurations can just be a good security practice without going overboard with it. We never want a user to plug in a rouge switch into our network without our knowledge, right? So maybe we assume each user has an IP phone and PC, and limit the port to 2 MAC addresses. In this case, we can go ahead and just set the port to restrict. We don’t want to prevent the user from working if a port violation occurs, nor do we want to spend time resetting the port for them - but we might still want to be notified, especially if it happens often. In addition, port security is an excellent way to secure ports public areas. For example, maybe we have an IP phone or kiosk PC in our lobby. These need access to the network, but we don’t want anyone to be able to unplug that device and gain access into our network. In cases like this, it would actually make sense to have the switch only permit access from that single MAC address.

Outside of the ‘practical’ use cases, there is also the strictly security side of things. I’ve touched on a few considerations already - but there are also certain types of attacks that can be defeated by port security. One of those would be exhausting the CAM table resources. A malicious person could use publicly available tools to spoof MAC addresses in the packets they send to the switch. Tools like this force the switch to learn hundreds of thousands of MAC addresses, which eventually will overload the CAM table. When a switch CAM table becomes full, the switch begins flooding packets out all interfaces. This is because the switch can no longer assign mappings between MAC addresses and the ports they originate from - so the switch has no choice but to flood everything and hope the correct recipient receives the data. For the attacker, this means they can run a packet capture on the port and collect information they wouldn’t have otherwise needed to. This scenario could be prevented by implementing port security, which could simply restrict the number of MAC addresses learned off of any individual interface.

Port security configuration can be implemented in a few different ways depending on your use case. Overall though, it can prove to be a useful way to help implement security controls on your network. What do you think about port security? Extremely useful or does it just get in the way? Comment below and tell me how you have implemented it!

Virtual Networking Contexts

Tue, 21 Feb 2017 08:00:58 +0000

I really want to take a moment to talk about how wonderful VRFs/firewall contexts really are. Both technologies essentially allow a network administrator to spin up a virtualized, isolated instance of a network device. I’ll be honest and say that I hadn’t had the chance to play much with this stuff until just recently - but it makes life a lot easier in a cloud provider environment.

I’ve been looking for a good chance to use VRFs in the past, but in most cases it didn’t really make much sense. About a year ago, I had a great opportunity when we needed to build a new data center. The data center was aimed at being lower capacity than most of our other locations, so we had to cut some costs here and there. In all of our other locations we use two physically separate sets of firewalls, one for external traffic and one for internal traffic. In this new location, we opted to save some money by picking up only a single pair of Juniper SRX 345 firewalls.

I made the decision here to make use of Juniper’s virtual routing instances to keep logical separation of internal vs external firewalling, even though it was only a single physical cluster. For one, this would allow existing staff to maintain their current understanding of network architecture. Every data center has the same overall logical traffic flow, even if the physical devices are different. Second, this allowed us to split load across the two devices. Normally we have two physical clusters to handle the traffic load, but in this case we were essentially going to pump the same traffic through one pair of firewalls. Assigning each virtual routing instance into its own redundancy group allowed us to run each firewall instance on a separate device - yet still allow for both instances to run on one in the event of a failure.

Once we got that firewall cluster into production, there seemed to be a lot less fear regarding virtualized network contexts. I was able to prove that it worked, and worked well for what we needed at the time. Soon enough I was able to find a few additional places where we could make use of the same concepts. We recently procured quite a few Cisco Nexus 9372PX switches for both new deployments and hardware refreshes. By default these switches already come pre-configured with a out-of-band management VRF, which is already super useful to me. We run all of our device management traffic on a segregated network, so a management VRF allowed me to configure the IP/route information to make all that work - while not interfering with the normal layer 3 operations of the device.

Being a cloud provider, most of our customers are completely abstracted from the hardware/software that runs their hosted applications. However, in a few cases there are instances where a customer negotiates for a contract change to say otherwise. For example, a customer might have a special software integration they want to run and have the ability to control - or some customers want a dedicated point-to-point Ethernet connection into one of our data centers for increased reliability. A lot of the background networking work for this in the past was a bit of a pain - but it opened up another opportunity to make use of VRFs. I now have a dedicated customer VRF, which has separate routing configurations than our normal production environment. Customer wants to stand up BGP peers across their direct connection to our data center? Sure, I can isolate that BGP instance in the customer VRF, so there is no conflict with our production routing tables.

I’m sure that my current use cases are probably not the ideal implementations of virtual networking contexts - but they work for what we need and they make life a lot easier. I can see these becoming more and more common in our environment to logically segregate traffic. I am interested to hear how other companies have integrated this type of technology into their networks - so leave a comment below!