Networking on 0x2142 | Networking Nonsense

[How To] Protect Your Home Network with Mullvad VPN & OPNsense

Tue, 29 Aug 2023 15:01:15 +0000

In this post, we’ll walk through how to connect an OPNsense firewall to Mullvad’s wireguard VPN. This can be used in various deployments to help protect your home network traffic.

As a quick note, this post is not sponsored by Mullvad - I just happen to really like their service & appreciate their approach to privacy and security. In fact, they currently don’t offer any incentives or paid promotions of their product. More info on their current policy can be found here.

Creating an Account

Mullvad is unique in that they don’t really require any sign up to get started with their service. Instead, when you sign up for new service, they randomly generate an account number. That’s it. They don’t require any additional information from you.

So first we can hit their new account page here - then click to generate a new account number.

Keep this account number somewhere safe. Mullvad has very limited ability to help you recover the account number.

Once we have that, we can click the button to Add time to our account.

Mullvad doesn’t support any recurring subscriptions, so that they are able to keep less data about their customers. Instead, we just add time increments in months for how long we would like to use the service. This can be anywhere from a single month up to a year.

In terms of payments, they do offer a number of options, including the usual PayPal/credit cards - or if you’re truly concerned about privacy, they accept a few cryptocurrencies or you can actually mail them a cash payment as well.

After a payment has been made & time added to the account, we can quickly jump into the configuration side of things.

Config: Mullvad Side

First we’ll take a quick look at the configuration required on the Mullvad VPN side of things.

On the left side of the account page, we’ll click on WireGuard Configuration.

First we’ll select Linux as our platform (In this context, what is selected here isn’t really important) - and then click the button to generate WireGuard keys:

Note: Alternatively, we could have generated our WireGuard keys on our OPNsense firewall - then applied them here. It’s up to you on which method you prefer!

Next, we’ll take a look at which server(s) we would like to connect to.

When selecting a server, we have the option to pick our desired country & location - as well as picking a specific server to connect to if we choose. There is also an option to select all servers - in which case the config generator will create a WireGuard configuration file for each server.

For the purpose of this walk through, we’ll keep things simple & only select a single server. So in the screenshot above, I’ve selected us-qas-wg-004.

We’ll also take a quick look at the advanced options, which give us a little flexibility if we need it. For most people, these additional options will not be necessary to change or modify.

We can enable Multihop functionality, so long as we only selected a single server to connect to. So if you selected the All Servers option, this won’t be available. This allows us to specify an entry & exit server for our VPN. In other words, our device would directly connect to the entry server we select - then Mullvad would tunnel our traffic across their network to the exit server, where our traffic would be decrypted & forwarded out to the internet. Depending on your privacy & security desires, this is a really nice option to have the ability to enable.

Next, we have options on what type of connection we would like & which traffic to forward.

Server connection protocol will specify whether we are using IPv4 or IPv6 between our device & our VPN server. Most likely you’ll want to leave this at IPv4, unless you have an IPv6-only internet connection (or if you just would prefer to use IPv6 anyways!).

Tunnel traffic is how we can specify whether we would like IPv4 or IPv6 client-side traffic to be forwarded over the VPN. So this would depend on whether the clients on our network have IPv4 vs IPv6 connectivity, or both - and whether we prefer to only forward certain types of traffic over the VPN. I’ll be leaving this setting as the default: Both.

Next we can specify a custom port if we would like. By default, wireguard will use UDP port 51820 - and we probably won’t need to change this unless the port is being blocked upstream.

Lastly, we also have the option to enable content blocking across the VPN. Mullvad accomplishes content filtering through DNS-level blocking - and when we finish generating a configuration file, the file will include DNS servers to use. This is okay to use if you were connecting a single client to their VPN service. However, if you’re using a router or device like OPNsense, you would need to update the DNS on all clients on your network to make this work. This is possible by updating DHCP on our router with the new DNS server address - or configuring a DNS rewrite. We won’t get into either of those in this post - so for now I will leave the content blocking options unchecked.

Once we’re good with our configuration - we can click the Download File button. We’ll get a standard WireGuard config file, that looks like this:

At this point, we’re good to move onto the next part!

Config: OPNsense Side

Okay, so now that we have everything ready to go on the Mullvad side - we can configure our OPNsense device.

Make sure you already have WireGuard installed on OPNsense. This can be done by navigating to System > Firmware > Plugins then searching for wireguard & clicking the install button.

Next, we’ll enable Wireguard by navigating to VPN > Wireguard and checking the box to Enable WireGuard, then Apply.

Then we’ll hop over to the Endpoints tab & configure our Mullvad VPN peer.

For this part of the configuration, we’ll just copy our public key, allowed IPs, endpoint address, and endpoint port from our Mullvad config file. In the screenshot below, I also named my endpoint with the specific Mullvad server I’ll be connecting to:

Then we can click Save and Apply.

If you wanted multiple Mullvad servers configured, just create a new endpoint for each one. Then, make sure that you select all of the Mullvad peers on the next step below.

After that, we can move over to the Local tab to define our OPNsense tunnel configuration.

Click the button to add a new peer, then we’ll fill in our private key and tunnel address(es) from the Mullvad config file. Under Peers, we’ll also select our Mullvad VPN peer that we configured just a moment ago:

UPDATE: Looks like with a recent OPNsense update, they now require you to enter both the WireGuard private & public key into the local config (shown above). In the the screenshot, I only show entering the private key - since this was all that was required at the time. Two ways to get your public key:

Log into Mullvad & check the “Devices” tab under “Account Management”. This will show your device public key (They don’t keep your private key after generating it for you, only the public).

If you have wireguard installed somewhere else, you can use the “wg pubkey” command to derive a public key from your private key. Command: echo | wg pubkey

Note: By default with the configuration we’ve applied so far, this VPN will forward ALL traffic on our network to Mullvad. If we would prefer to selectively choose which traffic to send over the VPN, we can check the box for Disable Routes - then use policy routing to forward specific things to Mullvad. We’ll take a look at how to do this later in the post - but for now just be aware that our current configuration will forward ALL traffic.

Okay, with that all done we can click Apply and Save here as well.

With any luck, we can check the Status tab & see that there is data being transmitted & successful WireGuard handshakes:

However, before our clients traffic can be forwarded over the VPN, we’ll need to create a firewall rule to permit traffic & a NAT rule to translate our client addresses to our Mullvad IP.

We’ll navigate to Firewall > Rules > WireGuard (Group). Then we’ll click to create a new rule.

Within this new rule, I’ll update Direction to Out and change TCP/IP Version to IPv4+IPv6. I’ll leave the source as Any:

We can also leave the Destination as Any, but I’ll update the rule to enable logging:

This rule will allow any clients behind our OPNsense firewall to reach anything on the internet.

Next, we’ll have to create a NAT rule. This ensures that our client addresses on our network get appropriately translated to the tunnel IP address that Mullvad has assigned us.

We’ll navigate to Firewall > NAT > Outbound. By default, OPNsense will be set to Automatic outbound NAT rule generation. We’ll need to update this to Hybrid outbound NAT rule generation to allow custom NAT rules. Then we can click Save and Apply.

Next, we’ll create a new Manual NAT rule, where we’ll update our Interface to WireGuard (Group):

Then we’ll make sure our Translation / target is set to Interface Address - and again, I’ll enable logging:

After we click save, we should have a NAT rule that looks like this:

At this point, we should be good to test our clients!

Testing

Of course, it’s easy enough to use one of our clients to check that we still have internet access - but how can we be sure that they’re using the VPN?

The easiest way might be to check the mullvad.net, where they do have a quick validation at the top of the page:

So according to Mullvad, it looks like we’re connected & they even show which server we’re connecting from.

We can also double check using a traceroute or tracepath command:

In this case, we can see that our traffic to Google hits our OPNsense gateway, then the Mullvad VPN gateway followed by another external address owned by Mullvad.

So based on some quick testing, it looks like we’re all good!

Policy Routing

So in the above walkthrough, we configured a Mullvad VPN from our OPNsense firewall - but it is forwarding ALL of our network clients over the VPN. What about if we only wanted certain clients to use the VPN? Or all clients to use it, but only for certain destinations?

We can accomplish this through policy routing.

So the first thing we’ll do is go back to our WireGuard config, then under the Local tab. We’ll edit our configuration here, and check the box for Disable Routes.

By default, OPNsense / WireGuard will install routes for any IPs listed in the AllowedIPs field for each peer. In our set up, we configured 0.0.0.0/0 - which matches all traffic. By checking the box for Disable Routes, we prevent OPNsense from installing that default route - and instead we can manually specify our own.

If you’re curious to double check this, you can try hitting Mullvad’s website after changing this setting - and it should show that you’re no longer connected.

Then, we’ll need to set up our WireGuard configuration to use a dedicated, named interface so that we can create a static gateway.

We’ll head to Interfaces > Assignments - and create a new interface. From the drop-down, we’ll select our WireGuard interface - in my case this was wg1. Then we can assign it a name:

Then click the + icon to add, and Save.

Then we can navigate to the interface name under the Interfaces menu - and enable the new interface:

Next we can create a gateway to route traffic through. Navigate to System > Gateways > Single.

Create a new gateway & give it a name. Then we’ll enter our Mullvad VPN gateway IP address, which in my case was 10.64.0.1. How did we find this? Well earlier when we tested our VPN connectivity - we performed a tracepath. In the output of this tracepath, our second hop (the one right after our OPNsense firewall) would be the Mullvad VPN gateway. So this is the address we’ll use for our gateway here:

Then clic Save and Apply.

Note: You may get an error here, like “The gateway address “10.64.0.1” does not lie within one of the chosen interface’s IPv4 subnets.” To resolve this, we’ll temporarily change our interface address. Navigate to the WireGuard local config, and re-enter your tunnel address excluding the “/32”. For example, if your tunnel IP was 10.10.10.10/32, change this to just 10.10.10.10 You should be able to set the gateway address now. Be sure to change your tunnel IP back afterwards!

Next, we’ll create a firewall rule for each set of sources or destinations we would like to manipulate.

So we can navigate to Firewall > Rules > Floating (or select a specific interface for clients, like LAN).

In here, we’ll set whichever parameters we would like to match. So for this example, let’s say I have a client PC at 10.100.100.10 and I only want traffic to 8.8.8.8 to use Mullvad. All other traffic can use the normal internet connection & not use the VPN.

In that case, I’ll set my source address to 10.100.100.10/32 and my destination to 8.8.8.8/32:

Then all we need to do is update our Gateway to use the Mullvad gateway we just created:

Now, if we hop over to our client PC - Mullvad’s website will say that we’re not connected to the VPN. However, we can check that traffic to 8.8.8.8 is actually being sent through Mullvad using tracepath again:

In the screenshot above, I also included a tracepath to 8.8.4.4, just to show that it is going through my normal internet connection & not Mullvad.

This was just one example, but you could easily create multiple firewall rules for different sources and/or destinations to control where traffic is sent. Aliases can also be used to group sources or destinations, so that multiple can be added to a single firewall rule.

Okay - I think that’s about all I wanted to cover in this post.

Hope it is helpful! Feel free to leave a comment below - or follow me on YouTube 😊

OPNsense on Qotom Q750G5 - Hardware Overview & Perf Testing

Sat, 18 Jun 2022 18:21:18 +0000

^{Note: I may receive commissions for purchases made through links in this post. This is to help support my blog and does not have any impact on my recommendations.}

One day I would love to have symmetric internet speeds, but for the meantime I have to work with what I have - which is dismal upload speeds that don’t come anywhere close to the download speed.

So I’m currently at 200Mb/s down & 10Mb/s up - and I would prefer to have higher upload speeds. However, even just 20Mb/s upload requires me to purchase a 500Mb/s download plan. 🙄

I’ve been hesitant to upgrade, since my current Meraki MX64 tops out at ~250Mb/s throughput. I would have to upgrade my firewall or half of my new download speed would be unusable.

While Meraki does have faster firewalls, they’re a bit expensive & at the moment somewhat difficult to obtain. So instead I searched around for alternatives, and came across the Qotom Q750G5 - which is a barebones mini-PC that I could load pfSense or OPNsense onto.

So in this blog post - I wanted to talk about the device & some of the performance testing I did.

Qotom Q750G5 - Hardware

First let’s take a quick look at the hardware. What got my attention quickly was the five 2.5GbE ports, but this small device offers quite a lot for how inexpensive it was.

A quick look at the specs:

Intel Celeron J4125 Quad Core 2.0Ghz (Burst 2.7Ghz)
Five Intel I225-V 2.5Gb Ethernet ports
Optional WiFi slot
Optional 3G Cellular & SIM card slots
1 RAM slot
1 mSATA slot
1 2.5in SATA HDD slot
2x USB 2.0 & 3x USB 3.0 ports

By default it seems this is a barebones kit, however I did opt to buy a model that included 16GB RAM & a 256GB mSATA SSD. It arrived with a single 16GB TeamGroup DDR4 3200Mhz module & 256GB Hoodisk mSATA SSD.

Are the specs a bit overkill for an embedded firewall? Yeah definitely. But the whole kit was only around $230 USD - which seemed crazy to me.

The big thing that pulled me toward this model (besides price) was the 2.5GbE ports. Since a lot more places have gigabit residential internet these days, and some lucky places are starting to see 2Gb/s - I was hoping that this little box would offer me a bit of future-proofing.

Of course, I was a little curious about what performance this device could realistically achieve - but more on that later!

I wanted to provide a few photos of the unit as well. The casing itself is pretty minimal, but the box is definitely heavier than you would expect!

Front Photo:

On the front there isn’t a whole lot to see. Power button, reset button, status LEDs, and USB ports. Oh - and there is an HDMI port as well, which comes in handy when installing an operating system (or if you’re using this as an actual PC).

Back Photo:

The back shows off the five 2.5Gb Ethernet ports, as well as the power plug. While I didn’t get a WiFi-enabled unit, the back faceplate still has cutouts for wireless antenna. I could opt to add wireless to this later, but I do wish the faceplate came with plugs or something to cover the holes in the meantine.

Inside Photo:

Now here’s where things get interesting! This box surprisingly packs a lot in it’s tiny size.

In the upper right, there is a PCI slot for WiFi. There’s also a WiFi/3G PCI slot in the lower left.

Just above the 3G slot is the SSD mSATA slot. When a drive is installed, it does cover the SIM card slot - meaning that the SSD would need to be removed to change SIM cards.

Also on the left, mostly out of the photo, is a connector for a 2.5in HDD. While I didn’t snap a photo of the bottom of the unit, the drive would screw into the bottom plate.

Interestingly enough, the CPU is on the other side of the motherboard. While you might think that the overall unit kinda looks like a heatsink - I was surprised to see that’s exactly what it is. The CPU is located right under the four screws with black washers and has thermal paste pre-applied. It rests up against the top of the case & uses the top as a heatsink.

As a last note about the CPU - it does get quite toasty at times. During one of my performance tests, the CPU reached ~90°C - and the top of the unit was very hot to the touch. It may not act as the world’s best heatsink, so I would avoid placing anything on top of the unit - or resting your hand there for too long 🙂.

Performance Testing

As a quick note before we get to the good stuff: All of my tests were performed using iPerf3. This may not show us the best real-world throughput tests, but I had quite a difficult time getting dpdk drivers to compile correctly for the Intel 2.5GbE ports (so I could use something like TRex). I may attempt to go back to this later, but the iPerf data is what I’ll provide for now.

The Testbed

For the performance testing, I used the two Intel NUC11 PCs that I purchased recently for my VMware lab. These already come with a 2.5GbE port, which I am currently using for vMotion between the two. I disconnected the vMotion port temporarily, and instead enabled PCI-passthrough to connect the ports directly to a VM.

I built a VM on each NUC with the following specs:

Debian 11
8x vCPU
16GB RAM
PCI-passthrough to 2.5GbE adapter

iPerf 3.9 was used for all tests. The iPerf server was enabled with the iperf3 -s command, and the client tests were run with iperf3 -c -P 8 -t 600. Each test was run for 10 minutes.

The devices were connected & configured with the following topology:

Note: In the below paragraphs, I will talk about the configuration I used for each test. If you’re interested in more detail, please check out the video at the top of this blog post. In the video, I walk through the configuration & setup for most of the tests below.

Test 1: No Firewall

Avg: 2.35Gb/s

Okay, so expecting that I likely wouldn’t get the full 2.5Gb speeds once I added the firewall - I wanted to perform a baseline test with the VM’s directly connected via the 2.5GbE passthrough port.

In each of these tests, I was able to reach an average speed of: 2.35Gb/s

Test 2: Routing, NAT, Simple Firewall rules

Avg: 2.35Gb/s

In this test case, OPNsense was configured to match the diagram above. The iPerf server was located on the LAN segment, with an IP of 10.2.2.1 & a default route toward the LAN interface of the OPNsense box (10.2.2.2). The client is connected via the WAN interface, at 203.0.113.50.

I created a proxy-ARP virtual address for the 203.0.113.25 IP address, then created a NAT rule to forward traffic sent to that address to 10.2.2.1.

Next, there was a single firewall rule created to permit TCP port 5201 inbound from the WAN. This is the default port that iPerf will use.

During these tests, I averaged about 2.35Gb/s and relatively low CPU usage around 10-20%.

As an interesting side note to this: Originally I didn’t use the default LAN & WAN ports (ports 1 & 2), but instead used ports 3 & 4. During testing between those ports, I could only reach ~1.7Gb/s. Once I switched to ports 1 & 2, I could easily hit 2.35Gb/s. I’m curious to dig in later & see if this is a hardware issue, or possibly OPNsense is prioritizing the LAN/WAN traffic differently

Test 3: Large Firewall Ruleset

Avg: 2.35Gb/s

While my home network will likely have a somewhat minimal firewall ruleset, I was curious to see how well the device would perform with a large ruleset.

So I wrote a script to auto-generate ~1,200 firewall rules with random IP addresses, ports, and a mix of permit/block actions. I applied the ruleset inbound on both the LAN & WAN ports.

Under this test, I was still able to reach the 2.35Gb/s speeds - but now the CPU was creeping up to 30-40%.

So what next? Well I decided to see if I could stress the box a little - and increased the auto-generated ruleset to just over ~15,000 rules.

The increased ruleset took about 5-10 minutes to load into the system, and CPU was pegged at 100% the whole time. In fact, the CPU never really came back down. Even after the rules had loaded, the CPU was flat at 100% - and it took anywhere from 2-4 minutes to navigate between pages in the web GUI. (This is the part where my CPU temps went from <50°C to over 90°C 🙃)

I ended up having to re-image the box.

Test 4: Suricata IPS

Avg: 2.35Gb/s (But sometimes much, much less)

Next I loaded up the built-in IDS/IPS, which uses suricata. I downloaded all available free rulesets, and enabled all of them. Ideally, you wouldn’t necessarily enable everything - but I wanted to start off with a full load test.

My experience with these tests was highly variable. Most times, I was surprised to see that I could still push 2.35Gb/s through without any issue. During these tests, the CPU usually bounced between 50-80% usage.

Strangely, every so often I would test and get significantly less than that. Most times when this happened, I would instead only see somewhere between 400-700Mb/s - but on one test the box slowed to just over 200Mb/s. Each time this happened, the performance tests would remain degraded until I restarted the suricata service - then I would be able to reach the 2.35Gb/s again.

My best guess is that something is getting stuck. I know some older IPS systems I’ve worked with in the past had issues with CPU-pinning, and you might get garbage performance if your traffic hit the wrong CPU core. I didn’t spend a lot of time digging into this, but it feels a bit similar.

Test 5: Wireguard VPN

Avg: 800Mb/s, 650-700Mb/s with IPS also enabled

I also wanted to try VPN performance. There are a lot of options for VPN services within OPNsense, including standard IPSec and OpenVPN. However, I opted to try out wireguard - since it’s really easy to set up & get running.

In this case, I definitely hit a limitation with CPU-pinning. With my initial tests, I could only reach about ~450-500Mb/s - but the Qotom CPU was only spiking up to 70%.

Seems like the wireguard client (at least for same source/destination/port traffic) does pin everything to a single CPU. So my client CPU, between encrypting the traffic & generating it, was actually maxing out long before the Qotom was.

I ended up spinning up a second VM on the client side (which required disabling PCI passthough & adding both VMs to a shared vSwitch). With both of these running, I was able to max out the Qotom CPU at 100% - and hit a fairly consistent 800Mb/s VPN throughput.

As a final test, I enabled the suricata IPS on top of the VPN as well. Now the traffic would need to be decrypted & inspected before reaching the server. With a single client, I averaged ~400-450Mb/s - and with both clients it was around ~650-700Mb/s

Overall I’m really pleased with how well this thing performs, especially for how relatively inexpensive it was. I also expected OPNsense to have a steep learning curve, but it was surprisingly easy to get up and running very quickly.

Next I’ll be working on getting this firewall up & running at home. I may be tempted to write up some more on OPNsense - so if there are any questions or specific configurations you might like to see, please leave a comment!

An Afternoon with ARIN

Tue, 15 Sep 2020 11:30:03 +0000

I had the opportunity to attend an ARIN on the Road event last week. It was an all-day event that focused on education: who ARIN is, what they do, and some things they are working on. As a network admin I’ve had to work with ARIN a handful of times to request network resources. I figured it would be a good experience to attend one of these events and see what ARIN has to say. I actually found out about a few things I wasn’t aware of previously, so this post is going to be a brief summary of what I learned.

About ARIN

If you haven’t already worked with them - ARIN is the American Registry for Internet Numbers. They are a non-profit organization and their purpose is to assign/manage Internet number resources for all of North America. This includes IPv4/IPv6 addresses and BGP Autonomous System Numbers (ASNs). ARIN is one of five Regional Internet Registries (RIRs) - each managing Internet resources for it’s own individual region. All of these report back to a top-level organization, the Internet Assigned Numbers Authority (IANA).

What I didn’t know: ARIN actually used to manage resources for all of South America and Africa as well. LACNIC formed and took ownership of South America in 2001, and AFRINIC took Africa in late 2004. ARIN itself has only been around since 1997, and will be celebrating it’s 20th anniversary this December.

Outside of assigning/managing number resources - ARIN manages a huge manual of numbering policies and standards (The Number Resource Policy Manual). A good note here is that these policies are heavily influenced by the community - so if any individual or group of network operators want to change/modify or add new policies, then they can submit proposals to do so.

IPv4 Depletion

I was very interested to hear about what’s going on with IPv4/IPv6 - mostly because I’ve been trying to push for IPv6 in many of the places I have worked. The ARIN group spent a little bit of time talking about how the depletion of IPv4 addresses has affected their workload. Overall, it seems like their work has remained about the same - but it has transitioned from mostly IPv4 allocations to more IPv4 transfer requests.

An interesting note from this discussion was that ARIN only performs the backend registration changes for IPv4 block transfers. They play no part in the actual negotiations between two organizations. However, they do perform their own investigations during transfers to ensure that the source organization legitimately owns the IP block, and the destination organization can justify the use of the space.

I had heard previously that ARIN kept a block of IPv4 addresses for transition to IPv6 - but I never researched it further. This was a topic ARIN touched on during the event. Essentially, they have kept ownership of a /10 block of addresses, which is split up into individual /24 blocks for assignment. Any organization can request one of the /24s when they request a block of IPv6 addresses. The organization must fill out a justification form, in which they demonstrate how the IPv4 blocks will be used to help transition to IPv6. Organizations can request one of these blocks every 6 months, provided they can still justify the need for them. This is all documented in NRPM section 4.10.

The somewhat surprising thing here is that ARIN was actively encouraging people to take advantage of this. Probably because they need to push IPv6 adoption in any way they can. As of the date of the event, ARIN stated that only ~60 /24 blocks had been assigned so far.

IPv6 Adoption

This part of the event wasn’t quite everything I wanted it to be. Overall ARIN touched on statistics from Google and other organizations that show the trending uptake in IPv6 network access. They also spoke briefly about how the structure of IPv6 addresses makes life easier - because the last 64 bits can always be used for host-based MAC autoconfig, then network operators only worry about subnetting above that.

Interestingly enough, ARIN was advocating for the method of ‘assign way more addresses than you’ll ever need’ mentality for IPv6. Another attendee asked the question ‘Won’t we run into the same thing as IPv4, if we just throw out v6 blocks like candy’? This actually led to hearing something I wasn’t aware of - IANA has currently only made 1/8th of IPv6 blocks public available for use. The current numbering scheme/standard will be used for this first block of addresses. If we run through them too quickly, then we can step back and re-evaluate best practices before handing out the next 1/8th block of addresses.

DNSSec

Initially I was a bit confused that DNSSec was on the topic list - but I figured maybe ARIN was just trying to push this for the betterment of the Internet. While they spoke a bit about DNSSec for forward DNS, their primary topic was how DNSSec for reverse DNS isn’t something people are normally thinking about. As it turns out, ARIN offers reverse-lookup DNSSec for any IP blocks that they assign out. This is good to know, since reverse DNS can be important for things like email security - and its certainly something I’ve never really considered in the past.

If you have purchased IPv4/v6 blocks directly from ARIN - I would recommend that you check this out.

RPKI

Resource Public Key Infrastructure (RPKI) is a way of cryptographically validating ownership of IP address space or routing objects. Since BGP is primarily a trust-based protocol between organizations, RPKI allows network operators to implement additional security by providing a certificate-based system of trust. The majority of this discussion was around how bad BGP security is, and that overall North America is far behind on implementing RPKI.

ARIN has a service available where they will act as your Certificate Authority (CA) for RPKI - so it only requires network operators to sign records then implement a few device changes.

My Thoughts

Overall the event was fairly informative! It wasn’t quite everything I wanted it to be, but I did walk away with additional knowledge that I didn’t have before. I was really hoping to learn more about how other organizations are implementing IPv6, or even how other people are convincing their employers to take IPv6 adoption seriously. When I spoke with some other attendees, it seemed like not many people had IPv6 running in a production environment yet - only a few of them had even started testing. Surprisingly, even the ARIN reps were repeatedly asking people to contact them if they had an IPv6 success story to share.

One thing I found really interesting was surrounding DNSSec/RPKI. A few attendees asked about how many people are actually validating signed resources. It’s one thing to implement signing, but it won’t matter if no one validates the resources, right? Surprisingly, ARIN had no statistics about this - and stated the point that they cannot enforce adoption of these standards. It certainly makes sense, but it’s not something I gave much thought to previously. Since they’re just a registry, they can only make these services available - not enforce their usage. This is why they put on events such as this to raise awareness and provide education.

ARIN pushed the fact that all of their policies are community driven. There were quite a few examples throughout the event of how individual members of the community could impact changes to their policies. My primary concern is that it seemed like a majority of the individuals in attendance represented government or educational organizations - and not a lot who worked in similar network environments to what I manage. They raised their own concerns and questions, which were certainly valid for the types of infrastructure and designs that they maintain. However, a number of these things don’t really apply to my infrastructure in quite the same ways.

If I have to make one point here: If you’re a network operator, go subscribe to ARINs mailing lists and get involved. Maybe you don’t have any ideas for policy changes, but you never know what might come up that you could provide meaningful input on. The ARIN reps provided an example or two of when a smaller group of people suggested policy changes which drastically affected bigger companies - and almost no one opposed it until it took effect. Only you have the ability to voice your opinion and concerns about how a proposed policy could affect your network. If not, the next time you try to request a block of IP addresses or a BGP ASN, you could potentially run into roadblocks because of a policy change proposed by someone with very different needs.

The staff at ARIN don’t live and work in the networks that we do. They try to work with network operators to understand use cases and the possible ramifications of policy changes - but ultimately they are a small non-profit. They can’t think of everything, nor can they force network operators to contribute their opinions. Get involved and make a difference.

As a final note, ARIN has a Fellowship Program where you can apply to attend one of their Public Policy meetings for free. Fill out an application and if you’re chosen they’ll provide a ticket, hotel room, and travel expenses. It’s a great opportunity to experience one of these meetings, especially if you might not have the financial means to otherwise.

The slide deck from the event is publicly available on ARIN’s website: here.

Where is all the Automation?

Tue, 24 Jul 2018 10:00:03 +0000

^{Note: I may receive commissions for purchases made through links in this post. This is to help support my blog and does not have any impact on my recommendations.}

The future is APIs! SD-EVERYTHING! Automation! Orchestration! Artificial Intelligence and Machine Learning! Sound familiar? It’s all part of the messaging going around in just about everything IT-related. With as much as you keep hearing about it, you might think that it’s all anyone is doing anymore. Yet it still just seems like not a whole lot of people are really getting into it in my area. Every vendor event I’ve gone to this year has asked attendees the same questions: “How many of you are leveraging the APIs in your network hardware/software?”. And every time the same answer - maybe two or three people in a room of 40 raise their hands.

So where is the problem? Is all of this just marketing fluff or am I just talking to the wrong people?

Let’s think about this from a typical network admin’s perspective. Shifting from traditional CLI to automation and APIs can seem difficult or overwhelming. Let’s say I want to automate a new VLAN deployment. Oh, you’re telling me I need to stop and learn vendor APIs… but before that I need to understand how to write scripts. But I’ve never even programmed something before. There are dozens of languages - how do I pick one? How much fundamental programming knowledge do I really need to have before starting? I don’t want to be a developer!

Okay, okay - just stop there for a second. No one is asking you to drop networking and write code for a living. The end goal of all this programmability stuff isn’t to turn networkers into developers - It’s to enable network/systems admins to be more efficient at their jobs. Why copy/paste the same config change to 100+ devices, if you can mass-deploy the change via an API? That’s a lot of time savings that could be used toward educating yourself on new products, planning other projects, or thinking about your ideal network design.

I’ve heard a lot of the same things over the past few years:

“Programming is difficult” or “I don’t know where to start”

Try learning Python. It’s simple to get started and you can build from there.

“I don’t know what an API is or how to use it”

Don’t worry about that yet - start with learning the basics and APIs will make sense later.

“I’m not a developer”

No one is asking you to be one! But learning the basics of scripting and automation gives you a whole new toolset to solve problems.

For me personally - I would never want to be a developer. I can’t stand the thought of coming into work every day and just writing code. Some people might enjoy that, but for me it doesn’t sound like fun. However - I enjoy writing scripts to solve problems, especially when it ends up making my job easier. I think that’s the part where some people tend to get stuck though. A lot of automation sounds like I need to be able to develop a huge 10,000+ line application to pull data from 15 sources and aggregate it to make intelligent network changes. Ehhh… Nope, not really. But what about just a quick script that runs every 5 minutes to check an interface statistic, and email you when a particular threshold is exceeded? Realistically that could be done in less than 50-100 lines of a script and maybe 30 minutes worth of work.

Still not interested? That’s okay too. Traditional networking isn’t going away any time soon, and over time the vendors will write all of that automation for you. They will package it up in a pretty GUI and sell it off to companies that want it. In fact, this has already happening and has been for quite some time. This isn’t a bad thing - vendors need to make money, and not all companies will have the time or skilled resources to automate all the things. However, a network admin who can write their own scripts/automation won’t be exclusively tied to a vendor to help them - and instead they will be empowered to solve more problems themselves.

Where do you get started? I already wrote a bit earlier this year on a few resources for learning Python - which you can find here. I also wanted to point out some other great resources that are a bit more specific to using those skills for network automation:

Python For Network Engineers - Don’t know anything about Python yet? Start here! This is a free course provided by Kirk Byers for anyone who is interested in using Python for network automation. Once a week you’ll get an email with all the great free content, but it will be up to you to spend time going through it. Go sign up, and set aside an hour or two each week to practice.
Cisco DevNet - There is a ton of great content here. While DevNet does offer some tutorials on basic Python fundamentals, the real value here is examples on how to use some network APIs (NX-OS, Meraki, UCS, etc). Also - one of the best parts about DevNet is the sandboxes they offer. Want to write scripts against the FirePower Management Center, but you don’t have one to test with? Well with DevNet you can get access to one! Get familiar with your Python basics, then come here to see where you can start using those skills with your existing infrastructure.
Network Programmability and Automation - This is a fantastic book. Not free, but it is well worth the ~$30. Once you have a good handle on how to write some basic network automation with Python, I highly recommend picking this up. While Python is covered here, the book does a great job of introducing you to all of the other toolsets available. Curious about how Linux or Ansible fit into network automation? You can find out here - and learn about APIs and source control systems too!

So - What are you waiting for? Go get started, and see what you can accomplish. Learn the basics - and keep an open mind for opportunities to use those skills.

Have suggestions on where else to learn? Comment below!

How Does Maintenance Scheduling Affect Your Network?

Wed, 07 Feb 2018 10:00:55 +0000

Last week I came across a thread on Reddit that asked the question: “What is your company’s policy on maintenance windows?”. This got me thinking about how maintenance windows have been handled at the various companies I’ve worked at, and how those schedules/restrictions impact project timelines, network design, etc.

Many of the places that I have worked at in the past have been typical 8a-5p/M-F shops. Outside of normal business hours, no one really cared if the network was available. Sure, we might have people who worked late - but a few hours notice via email was always enough. However, the company I work for currently has much tighter restrictions on when work can be performed. We have worldwide customers in over a dozen datacenters and some fairly strict uptime SLAs. What this comes down to is a once-a-month allowance for scheduled maintenance - where the timeframe is limited anywhere from 15 minutes to 4 hours.

Some of the immediate impacts of these differing maintenance window schedules are somewhat obvious. Network maintenance can be practically open to all nights and weekends with a lot of typical 8-5 businesses. This means changes can happen much more frequently - especially changes involving a full network outage. For example, at one of my previous jobs I needed to upgrade each floor of the building from individual Cisco Catalyst 3548 switches to new 2960X stacks. This required moving the cables for up to 200 ports per floor (while also trying to clean up cable management). I was able to complete the work by just coming into the office earlier every day to move the connections before anyone else arrived.

On the other hand, a cloud service provider can’t just decide one day to take a few hour outage to swap out network equipment. Instead, changes have to be carefully planned, scheduled, then executed within a short window. Customers have come to expect 100% uptime - and rightfully so. However, we still need some amount of time dedicated to performing upgrades, changes, or other maintenance activities. The simple switch migration from the last example suddenly becomes a multi-month ordeal in an environment such as this. You might be ready to jump on the work, but you need to wait for the next regularly scheduled window - and even then you may have only a handful of time to complete your task. If you don’t complete all of it in the time allocated? Well now your project gets pushed back another month.

So you might ask - as a business scales, does it always end up creating this maintenance monolith? It might - but it certainly doesn’t have to end that way. The effects of higher uptime requirements and shorter maintenance periods might seem like nothing but bad news. However, the change in mindset that comes along with that does bring some unique benefits.

The first major benefit comes in the form of planning. When you have 15 or 30 minutes to complete an entire migration or upgrade, it becomes extremely beneficial to plan out a complete play-by-play of every activity. The limited window means that simple mistakes can cost you valuable time. Of course, the tendency for maintenance windows to be scheduled for late nights also compounds the problem since you may be tired or less alert. For critical maintenance tasks that I need to accomplish, I take the time to create a step-by-step checklist of every command that must be run, every system that must be tested, and every step needed to roll back. Sufficient planning means less mistakes, which in turn increases chances of success during a tight work period.

Automation and efficiency start to become a necessity when you have only a few minutes to perform a task. Sure, I might create a very detailed checklist of what must be accomplished - but what happens if it’s simply too much for the time allocated? You can’t complete a 20-minute task in a 15-minute outage window, right? Sometimes we can schedule extended maintenance periods, but this certainly isn’t feasible every month. This is where we begin to try and identify inefficiencies and tasks that would benefit from automation. Over the years I have written a handful of scripts and utilities that allow for normal maintenance tasks to be completed quickly. These are things that might have otherwise continued to be done manually (and error-prone) without the timing restrictions.

A short maintenance period also encourages more careful network design. If you’re only permitted a half-hour of downtime, then you start looking for ways to minimize the impact. Could the network be designed in a way that allows for a no-downtime switch upgrade or replacement? If not, then how do we get there? In many smaller business networks you might plan for redundancy but never test it - but in a high-uptime environment you begin to relyon it. If you want to get to a point where work can be accomplished with minimal downtime (or even during normal hours), then you must be confident that your network can seamlessly absorb the impact.

I certainly wish some days that I could go back to a life where downtime is acceptable any time during off-hours. However, I’m sure that the desire for higher uptime and greater reliability are likely here to stay - and I believe that I’ve learned some valuable lessons in trying to meet those requirements. An extremely short maintenance period certainly complicates things, but it also forces us to look for process and design improvements. I believe that the end result is a better network for both the business and it’s customers.

What are your maintenance practices like? Do you have hours or minutes? Comment below!

L2 Basics: Configuring an EtherChannel

Tue, 30 Jan 2018 10:00:46 +0000

Today we’re going to take a look at how to configure an etherchannel between two Cisco Switches.

What is an etherchannel? It’s a way of taking multiple independent links and bundling them together, so that they appear as one logical connection between two devices. Etherchannels are commonly used between two switches, or between a switch and a host - which allows for both additional bandwidth and fault tolerance/redundancy. In the example today, we’ll be using an etherchannel protocol called Link Aggregation Control Protocol (LACP). LACP is an IEEE standard (802.3ad).

You might be thinking “Wait, wouldn’t multiple links cause a loop? Or trigger Spanning-tree to block ports?”. Not in this case! Etherchannel technologies work around those problems by creating a single logical interface for spanning-tree to worry about. The etherchannel protocol itself worries about loop prevention in between the two devices, so we get multiple ports of non-blocking bandwidth.

For everything we cover in this example, we’ll be using the following topology:

So we have two switches, which are connected together via Eth0/0 and Eth0/1. Each switch has three VLANs configured - 10, 20, and 30.

Configuring an Etherchannel

I’ll only be showing the configuration from the perspective of 0x2142-SW1 - but all configuration is replicated on 0x2142-SW2.

! We'll use the interface range command to apply the etherchannel configuration to
! both Eth0/0 and Eth0/1 at the same time:
0x2142-SW1(config)#int range Eth0/0 - 1

! We specify which etherchannel protocol to use by configuring 'channel-protocol'
! PAgP is a Cisco Proprietary protocol, but we'll be using LACP for this example:
0x2142-SW1(config-if-range)#channel-protocol ?
  lacp  Prepare interface for LACP protocol
  pagp  Prepare interface for PAgP protocol
0x2142-SW1(config-if-range)#channel-protocol lacp

! Next we need to specify a channel-group and mode:
0x2142-SW1(config-if-range)#channel-group 1 mode ?
  active     Enable LACP unconditionally
  auto       Enable PAgP only if a PAgP device is detected
  desirable  Enable PAgP unconditionally
  on         Enable Etherchannel only
  passive    Enable LACP only if a LACP device is detected

0x2142-SW1(config-if-range)#channel-group 1 mode active
Creating a port-channel interface Port-channel 1

0x2142-SW1(config-if-range)#
*Jan 26 01:03:04.532: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-channel1, changed state to up

Let’s talk through a few notes about the above configuration. In order to enable etherchannel, we only need to configure two commands: channel-protocol and channel-group. The channel-protocol command tells the switch which etherchannel protocol to use for negotiation (LACP in this case). The channel-group command provides two necessary components: the group number and mode. The group number is just a device-local identifier for which group to add these ports to. When we specified group 1, the switch adds both Eth0/0 and Eth0/1 into the new logical interface Port-Channel 1.

The etherchannel mode is also important. The two primary options we want to look at for LACP are active and passive. Active tells the switch to preemptively send out LACP negotiation packets. In this case, the switch really wants the ports to become a bundle and will ask it’s partner device for an etherchannel to be formed. Passive mode tells our switch to only negotiate if another device wants to. In this mode, our switch won’t send out any etherchannel negotiation packets unless its partner device does so first.

Generally speaking, the most common configuration is to set the mode on both devices to active. This ensures that both devices actively participate in trying to establish an etherchannel. Placing one device in active and one in passive will work as well. However, if both devices are placed into passive mode, an etherchannel will never form.

Validation

So how do we validate that the etherchannel has formed correctly? One way is using the show etherchannel summary command:

0x2142-SW1#show etherchannel summary
Flags:  D - down        P - bundled in port-channel
        I - stand-alone s - suspended
        H - Hot-standby (LACP only)
        R - Layer3      S - Layer2
        U - in use      N - not in use, no aggregation
        f - failed to allocate aggregator

        M - not in use, minimum links not met
        m - not in use, port not aggregated due to minimum links not met
        u - unsuitable for bundling
        w - waiting to be aggregated
        d - default port

        A - formed by Auto LAG

Number of channel-groups in use: 1
Number of aggregators:           1

Group  Port-channel  Protocol    Ports
------+-------------+-----------+-----------------------------------------------
1      Po1(SU)         LACP      Et0/0(P)    Et0/1(P)

From the output above, we see that there is one group configured with the group ID of 1. It shows that both Eth0/0 and Eth0/1 have been added into the Port-channel 1 interface. The (SU) next to the Port-channel interface indicate that the etherchannel is up (U) and configured for layer 2 (S). I mentioned earlier that spanning-tree only worries about the port-channel interface, not the individual member ports. We can also check that out by using the show spanning-tree command:

0x2142-SW1#sh spanning-tree vlan 20
VLAN0020
  Spanning tree enabled protocol rstp
  Root ID    Priority    32788
             Address     aabb.cc00.1000
             This bridge is the root
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec

  Bridge ID  Priority    32788  (priority 32768 sys-id-ext 20)
             Address     aabb.cc00.1000
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec
             Aging Time  300 sec

Interface           Role Sts Cost      Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Et0/2               Desg FWD 100       128.3    Shr
Et0/3               Desg FWD 100       128.4    Shr
<-- Output omitted -->
Po1                 Desg FWD 56        128.65   Shr

Making Configuration Changes to an Etherchannel

Now that we have a working etherchannel - We have a few things that need special attention. The individual port configurations, Eth0/0 and Eth0/1 in this case, need to match at all times! Port configuration mis-matches are going to be an easy way to inadvertently bring down the port-channel. The good thing is that we now have a convenient Port-Channel interface which we can use for configuration. This logical port will replicate any configuration changes to all member ports.

! Let's jump into our Port-Channel 1 interface and configure a trunk for VLAN 20
0x2142-SW1(config)#int po1
0x2142-SW1(config-if)#switchport mode trunk
0x2142-SW1(config-if)#switchport trunk allowed vlan 20
! Now we can check the individual port configs:
0x2142-SW1(config-if)#do sh run int e0/0
Building configuration...

Current configuration : 176 bytes
!
interface Ethernet0/0
 switchport trunk allowed vlan 20
 switchport mode trunk
 channel-protocol lacp
 channel-group 1 mode active
end

0x2142-SW1(config-if)#do sh run int e0/1
Building configuration...

Current configuration : 176 bytes
!
interface Ethernet0/1
 switchport trunk allowed vlan 20
 switchport mode trunk
 channel-protocol lacp
 channel-group 1 mode active
end

Easy enough, right? The configuration changes for the trunk are now on both Eth0/0 and Eth0/1.

Troubleshooting Etherchannels

There is always a possibility that something goes wrong - so let’s take a quick look at some common problems and how to fix them.

Remember how I said that the member port configurations had to match? Here’s what happens if we make a configuration change on only one of the two member ports:

0x2142-SW1(config)#int eth0/1
0x2142-SW1(config-if)#switchport trunk allowed vlan 30
0x2142-SW1(config-if)#
*Jan 28 20:43:55.458: %EC-5-CANNOT_BUNDLE2: Et0/1 is not compatible with Et0/0 and will be suspended (vlan mask is different)

Eth0/1 immediately gets put into a suspended state, and is no longer active in the port-channel interface. In this case the switch gives us a good hint as to what’s wrong - vlan mask is different. Error messages will vary slightly, but a suspended port is easy to fix by comparing individual port configurations and fixing the mismatch.

Here’s another one:

*Jan 28 21:06:07.346: %EC-5-L3DONTBNDL2: Et0/0 suspended: LACP currently not enabled on the remote port.
*Jan 28 21:06:08.009: %EC-5-L3DONTBNDL2: Et0/1 suspended: LACP currently not enabled on the remote port.

This error message can mean a few things - the common one being exactly what it states! Check both sides of the connection, and ensure that LACP is configured on each device. This error message can also occur on certain mismatches - like if one side is running as a Layer 2 etherchannel, but the other side is running as Layer 3.

One more:

Jan 28 20:83:55.458 %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel1 is down (No operational members)

The above message is also somewhat self-explanatory. In this case, the switch is unable to bring up the port-channel interface, because none of the underlying member ports are coming online. Troubleshoot what might be wrong with those ports first, then the port-channel should come up.

Hope this was useful! In a later post, we’ll dig into more configuration and considerations - like packet hashing, layer 3 etherchannels, and how packets are weighted between interfaces.

Questions? Drop them in the comments below!

One Year Later

Tue, 02 Jan 2018 08:51:47 +0000

2017 is over! Now we’re on to whatever 2018 may bring. The past year has been very interesting for me. For one thing, it was the first full year of this blog which started in December of 2016. While I didn’t quite accomplish everything here that I had hoped for, I still managed to do a lot more than I realistically expected.

One of the things I’ve had problems with in the past is keeping a blog updated. Usually I would start, write an entry or two, then completely forget about it. I never thought I had good enough content to warrant sharing, or I was trying to keep to too narrow a topic. So when I started this blog, I said that I was going to focus on networking but leave it a bit more open-ended. I also wanted to try sharing some more generalized IT experience and career advice. I started off with a list of topics that I wanted to write about, and even began pre-writing a few of them so that I had a bit of content lined up ahead of time.

Even though I told myself originally that I was only going to post something whenever I had something good to share, I still ended up setting myself a goal of writing one thing a week. For a while this actually worked out, because I was forcing myself to think about it more often - but eventually I ran out of immediate ideas. I had to remind myself that it was more important for me to write/post content that was actually worth reading, not just having something available on a weekly basis. Even so, I’ve managed to post 44 items since I started, 40 of which were in 2017 - Much better than I had actually anticipated.

So here is to 2018 - I’m not going to try and set any strict goals for myself in terms of posting content (or at least I’ll tell myself that now). However, I’m also going to try and work on getting better at putting up content. I spend too much time waiting for that ‘great thing’ to write about, and not enough time on just writing something that might not be particularly fantastic - even though it might still benefit someone. I feel like I have a lot to share, and not everyone is an expert. Continuing to think that much of my content ‘isn’t good enough to post’ is just holding me back. I’m going to try and be better this year about this - and not keep waiting for only the ‘great things’ to share.

The other big thing I’ll be focusing on this year is studying for the CCIE R&S, which I wrote about in October. I bought a few books and found some training videos, which I’ve been slowly working though… and when I say slowly, I mean probably much slower than I should be. Now that the holidays are over and it’s a new year, I’ll be pushing myself a more to actually make progress. My current tentative goal for attempting the written exam is June - so I’m hopeful that I’ll be able to make it work.

The blog has been fun so far, and I’ve done a bit more than I thought I would with it. However, there was one thing over the past year that I wasn’t really expecting at all - getting to talk with a bunch of other people who are interested in networking/IT. I’ve mostly been on Twitter, and more recently on Reddit’s /r/networking and /r/cisco. There have been a ton of people I’ve gotten to talk to, get opinions from, or even a few people that I’ve been able to help out with some of their problems. A large portion of my career has been limited to working with just a small team of people, few which actually have much interest in networking. I’ve really enjoyed the experiences over the past year, and I’m really looking forward to what else might come. If you’re one of the people I’ve interacted with over the past year, thank you!

A new year comes with new challenges, problems, and complaints - but it also comes with new accomplishments and new things to look forward to. I hope that all of you reading this are able to set new goals for the year and pass your expectations!

What's wrong with VLAN 1?

Tue, 05 Dec 2017 09:07:06 +0000

Earlier this year I was involved in a string of interviews for an open network engineer position. The questions and scenarios provided during the interviews were aimed for someone mid-level. One of the more basic-ish scenario questions I like to ask is the following:

Given a brand new switch, can you provide me the commands you would use to configure the first four ports for hosts in VLAN 15?

This question is always interesting because I get such a wide variety of responses. You can certainly filter out people quickly who have never touched a switch. Some people will start with conf t, while others just jump straight into setting the VLAN tag. Some people specify that they’ll use an interface range command, while others get confused and want to configure the ports as a trunk. In fact, during this year’s interviews I had quite a significant number of people who completed the scenario by providing the commands to configure a trunk port, instead of static access. One thing that struck me as noteworthy is that the people who did this also provided the commands they needed to change the default native VLAN.

For a vast majority of networking devices (maybe all of them), the default/native VLAN for a trunk port is VLAN 1. This is not the best configuration for reasons we’ll get into a bit later - but unfortunately it needs to be manually changed. In every interview where the candidate suggested this be changed, I followed up with asking why. I asked so that I can find out two things: How well the individual is at explaining concepts, and whether or not this is something they just do because they were taught or if they actually understand the logic behind it. Surprisingly, a vast majority of the candidates could only provide the reasoning that “Well, VLAN 1 is bad” - but they couldn’t elaborate on why.

So why is VLAN 1 bad to use?

Technically, VLAN 1 itself isn’t the problem. The concept of a default VLAN allows for someone to attack a network by taking advantage of how switches use a default VLAN. Since VLAN 1 is typically set as the default for most vendors, then it becomes a well-known configuration for attackers to abuse. If every vendor had set the default native VLAN to 52, then we would still run into the exact same problem except the ‘bad’ VLAN would be 52. So let’s back up just a little here and explain a bit of background. The concept of a VLAN is a segmented logical network. Rather than requiring a different piece of physical hardware to keep hosts separate, we can assign them into different virtual LAN segments. This is accomplished by ’tagging’ each Ethernet packet with a VLAN ID. Internally, the switch code does not allow packets that contain one VLAN ID tag to be sent to hosts configured with a different VLAN ID tag.

Let’s look at a few practical examples of how this works. When you configure a host port, you would configure the switch with the appropriate VLAN ID tag that the host should be assigned to. The actual server may know nothing about the VLAN it is in, but the switch knows to inject that VLAN tag into the Ethernet headers of every packet received from that server. For the rest of that packet’s life on the network, every switch reads that VLAN tag to assist in forwarding decisions. Trunk ports are commonly used between switches, and these ports are enabled to carry multiple VLAN ID tags. The problem here is that each switch is expecting that the packets it receives already contain a VLAN tag in the Ethernet header.

So whats the problem with the default native VLAN? This is a special type of VLAN in which the switch never attaches a VLAN tag to the headers. Whenever we connect two switches via a trunk port, we are likely configuring multiple VLANs that need to cross that link. However, the switches themselves still need to communicate directly with eachother (for protocol negotiations/spanning-tree), and each switch isn’t necessarily going to know what VLAN the other wants to use for this management traffic. This is where the concept of a native VLAN comes in. For every configured trunk link, each switch has a default native VLAN for which it expects to receive packets with no VLAN tag headers. Even though normal network traffic crossing a trunk link is going to require a VLAN tag in the headers, the switch-to-switch control-plane communication is sent with no header present.

This is where VLAN 1 becomes a problem because of the native VLANs are processed/interpreted by the switch. Whenever a switch receives a packet which contains the VLAN ID set to the native VLAN, it knows that packet doesn’t need to contain a VLAN header - so it removes the VLAN ID header.

How could this be exploited? Well for example, let’s say that we had a network where every developer PC was using a trunk port and permitted to use VLAN tags. This might be so they could run local VMs for testing that connect to both the DEV and QA networks. If we leave the default native VLAN as 1, then a malicious developer could exploit this to gain access to another segment. This is accomplished by using a software package to double-tag an Ethernet packet with two separate VLAN ID headers. The first VLAN tag header will be set to the native VLAN (VLAN 1), and the second header will be set to the target VLAN - let’s say we use VLAN 20 for the accounting network. When the switch receives the packet across the trunk link, it will read the Ethernet headers. When it processes the first VLAN tag and sees that it matches the default native VLAN, the switch strips this header - which then leaves the second VLAN tag header for VLAN 20. When the switch goes to forward this traffic to it’s destination, the remaining VLAN header will allow the packet to bypass any security measures and be directly forwarded on VLAN 20.

Well what if we don’t provide end users with a trunk port? Could they still execute this type of attack? Remember that the default switchport configuration allows for dynamic negotiation - where the connecting computer can tell the switch whether or not it needs an access port or a trunk port.

If VLAN 1 is well-known as the default, what should the native VLAN be set to?

Anything you like! The key thing is that it should be a VLAN which has no access to any network resources - so it should be a VLAN with no hosts and gateway. I usually don’t even create the VLAN on the switch itself, therefore immediately black-holing all traffic sent to that VLAN.

How else can this be prevented?

Always statically set your end user ports to switchport mode access, and enable switchport nonegotiate. This will prevent the switch from allowing port negotiations, which prevents a user from tricking the switch into assigning a trunk port.

Newer Cisco switches also support a global configuration command: vlan dot1q tag native. This command will force the switch to require VLAN tags to be present on packets in the native VLAN. See more detail here. This will need to be configured on all switches in your network to be effective.

Do you change the native VLAN in your network? Or is it not a big security concern for you? Comment below!

How to Improve: Stop Doing, Start Understanding

Tue, 28 Nov 2017 08:07:54 +0000

There is a key to being successful at just about any IT job: Stop just doing work, and start understanding what you’re doing. Might seem like an odd thing to say right? But this is something that I have seen confuse engineers at earlier points in their careers.

In a lot of jobs, the initial training you receive is fairly straightforward. You are usually taught how to respond to a task by following a series of steps to get an intended result. Training like this is great - It helps to achieve consistency and efficiency. You bring in any new person and give them the exact same troubleshooting steps, implementation steps, and/or validation steps - and you’re likely to get a similar result every time.

This is the point where I have seen far too many people stop though. They are happy with doing their job, and don’t necessarily want to progress their career or maybe they don’t know how. These engineers will continue to produce decent work at the quality that they were taught at. Even for those who try and progress further (maybe through certifications or otherwise), there is a difference between learning new technologies/concepts and truly understanding them. For some people out there, having this basic level of skill is all they really want - and if that’s their goal in life, then this type task-based knowledge is perfect. But if you really want to master the domain of technology that you are interested in, then you need to put forth the time and effort into gaining that understanding.

For me, a true understanding of a technology means that you’re able to speak confidently about how something works, abstract concepts to apply to similar products, and mentally walk though how the technology might handle a given situation.

Let me provide an example or two that might help to frame this a bit better. Given a particular network, an engineer might know that for traffic to get from point A to point B, it travels through two firewalls. Every time there is a new request to permit a new traffic flow, that engineer knows that they must make a configuration change to one or both of those firewalls to allow that traffic through. However, to this engineer, the inner workings of that firewall are a complete mystery. The firewalls are complete black boxes which take in traffic through one interface and spit it out another. So when there is a technical issue within the firewall appliance, they may be extremely limited in their troubleshooting abilities - and they may have no choice but to call the vendor for support.

Another engineer who has a deeper understanding of how firewalls work might see the problem differently. This engineer knows that for every packet received, the firewall follows a specific flow of processing. That flow could include any number of things, including NAT, routing, firewalling, IPS, VPN, etc. This engineer knows which order those things get processed and what effects those processes can have on the traffic. So when we have a technical issue with this firewall appliance, this engineer may be able to mentally walk though the packet flow/processing and determine where the problem may be - sometimes without even looking further than general log files.

Another example is something that I see quite often. An engineer is asked to implement something - let’s say a new port configuration for a server. They follow their known process for implementing this change, but something doesn’t quite work right. So they change settings or maybe delete the entire port configuration and start over - but eventually they get it working. They don’t know why it didn’t work the first time, or what caused it to work the second time - but it works now, so they aren’t concerned with it. However, it’s possible this engineer ends up running into this same problem more than once. The ideal step here would be to step back and look at what was different between the original configuration and the working configuration. Maybe there is an additional command in the original configuration which seems suspect - a quick search of the internet could turn up an explanation behind why that command was preventing the port from working as expected. After that research, that engineer would not only know why their configuration didn’t work - but now they know what that command actually does, which could be beneficial in a future scenario.

As I briefly mentioned earlier, not all IT admins or engineers are concerned with gaining a significant level of understanding. There are those who want to come to work, get their job done, then go home to their families - and there is absolutely nothing wrong with that. For me personally, I can’t handle running into a problem and not knowing exactly what the cause was. An issue that “fixes itself” is never an acceptable answer to me, because if something caused the problem once then it can certainly happen again. I don’t enjoy having to blindly configure an option on a system without knowing what’s going on in the background. Some people might call me crazy, but this seems to be a skill/trait shared by many higher-level engineers I have worked with. So how do you get to a point where you really understand a system? For me, it’s been a lot of playing in labs, reading vendor documentation, and not settling until I feel like I can speak confidently to how something works. I never feel truly comfortable in a new company until I can mentally walk through every device a packet touches from source to destination - and know which devices configs/routes may have an impact on that flow. Any time there is a problem with something, I spend time digging into it until I know what caused it - even if the problem is only momentary and goes away. Not only do I then understand why the problem happened, but I also learn how to quickly identify similar issues again.

Especially if you’re still in the beginning stages of your career, I can’t stress enough how important it is to understand the technologies you’re responsible for. Take the extra time and study it, play with it, break it and fix it again. Know how things work and what their behaviors are under different conditions. Don’t settle for “It just works because it does”. One of the key skills I’ve seen in engineers who truly understand their domain of technology, is the ability to abstract concepts to apply to other systems. Someone who has a deep understanding of routing and switching technologies might prefer to work with a certain vendor, but given any router/switch they can make it work.

Have you worked with anyone who you think has a great understanding of what they do? What other skills or traits do they display that makes them successful? Comment below!

What's Going Out of Your Network?

Tue, 21 Nov 2017 08:00:53 +0000

Over this past weekend I purchased a few upgrades to my home network/lab. One of which was upgrading my older Ubiquiti 802.11n wireless access point to the newer 802.11ac model they have out. The other purchase was a new external firewall. I had previously been running on a Cisco ASA5505, but the device is older and doesn’t support some of the newer features I would like to play with. In addition, in my current job I no longer support Cisco firewalls. So I bought a Juniper SRX300 - which should allow me to play with some new features I want, plus it can be a playground for testing things I want to do at work.

Anyways - after I cut over to my new firewall, I’ve been digging through logs to make sure that I didn’t miss anything. I have all of my device/lab logs going into an instance of Splunk Light (their free product). It makes it easy to collect and search through logs, and it’s extremely easy to set up and use. A few quick queries and I came across one or two minor things that needed to be tweaked on my firewall - but I also saw some traffic that I wasn’t sure about.

So that brings me to my question of the day: Do you know what’s going out of your network?

A lot of people I know only use firewalls to block inbound access, both in homes and businesses. For homes it’s more understandable since most average people aren’t network admins. However, it still surprises me how many businesses are willing to add a ‘permit any any’ out to the internet. Yes, I block all traffic by default through my home firewall, both inbound and outbound. Yes, it’s a bit of a pain sometimes when something isn’t quite working right - but it’s usually a quick ACL change, and overall I would rather take the minor inconvenience for the security gains.

When I originally built the firewall policy for my network, I started off simple. I know we need DNS, HTTP, and HTTPS outbound - easy enough, right? Then I started watching logs for blocked traffic and trying to decipher what else was trying to communicate outbound using another port. Some things were very easy to determine - TCP 5228 out to a Google owned IP? Yep that’s actually a known thing - a lot of Google services, like Chrome, will use this. Some other things were harder to figure out - like game consoles which use a very wide range of non-standard ports. Many of these weren’t really documented well by the console manufacturer, and meant that I spent a while between browsing forums and some trial and error.

This really gets interesting when you start digging past the stuff you know about. What about a PC in my home network that is trying (and getting blocked) to reach a few random IPs in Korea and Russia over a bunch of non-standard TCP ports? Yeah that doesn’t make me feel comfortable. Could it be a legitimate application, or is it malware? A few quick searches on the internet don’t turn up anything immediately helpful. For the time being, I’ll keep stuff like this blocked until I have time to spin up some packet captures to see what this traffic is actually doing.

For a business I feel like this type of thing is even more important than just what I’m doing at home. You certainly don’t want end users (or servers) possibly running strange applications, which might be transferring data to some unknown external party. It seems like larger companies seem to have a better handle on restricting outbound access than most smaller companies, who likely don’t have the time or see the value. However, I’ve also worked with a few larger organizations who still permit all user and server traffic out to the internet with no filtering in place.

If you’re not already blocking outbound traffic - Get some good logging in place. Use something like Splunk Light and start collecting firewall logs for everything going out of your environment. Start with the basics - create a list of the software/ports you know you’ll need to open. After a few weeks, start digging through the logs to figure out what else might need to be added to your list. Once you feel comfortable that you’ve compiled a sufficient base ruleset, schedule a time to make the change and put it in place. Start blocking the unknown traffic - and only permit when necessary.

How do you have your firewalls configured today? Do you permit everything or are you very restrictive? Comment below - I’m curious to see what other people are doing.

L2 Basics: Spanning-Tree Protocol

Tue, 14 Nov 2017 08:00:22 +0000

Spanning-tree protocol (STP) is one of those network technologies that is easy to forget about. It exists in the background of almost every network, and for the most part it does it’s job without any issues. However, there is still a huge benefit to understanding what STP does and how it works - because it’s default behaviors might not the the best for every network.

I’ve been making progress going through my CCIE books, and the earlier sections are focusing on layer 1 and layer 2 technologies. A lot of this is review from CCNP studies, but with STP the book starts to get into additional detail on the inner workings of the protocol - which I’m finding to be quite fascinating. It seems like in many of the companies that I’ve worked I’ve found that a majority of the IT staff (whether sysadmins or network admins) don’t really have a good handle on how STP works and why it needs to be tuned. So this post is meant to cover spanning-tree at a very high level, and I’ll include some examples from issues I’ve seen in the past.

So what is spanning-tree protocol anyways?

At it’s very basic level, STP is a communications protocol used between switches to allow them to identify redundant paths in the network. The goal of STP is to figure out what is the most efficient L2 path through the network, then block all other paths to prevent loops. The best way I’ve heard STP explained is that it’s essentially a routing protocol for layer 2. Rather than routers communicating and exchanging routes to determine the best path through a network, all of the switches will talk to determine the best (loop-free) layer 2 path.

STP determines the best layer 2 path - but the best path to what?

When configuring a standard routing protocol (like EIGRP or OSPF), you might have a node that advertises a route for 10.10.10.0/24. All other routers in the network are going to select a best path to the router who originates this advertisement - but how does something like this work when we’re talking about layer 2?

Spanning-tree relies on the concept of having a single root bridge of each network. At the beginning of a spanning-tree process, all switches will hold a quick election to determine who the root bridge is - then each switch will figure out what it’s own best path is to that device. The switch that ultimately becomes the root bridge will be based on the priority set by the administrator - but by default all switches are pre-configured with the same priority. In a tie, the switch with the lowest MAC address will win and become the root bridge.

What does that actually mean? More or less, one switch gets put in charge of defining the best path through the network. All other switches examine all of their redundant paths to the primary switch, then figure out which of those paths are more preferable than the others. An important note here, is that the “best path” selected is all from the specific viewpoint of whichever switch takes charge.

For an example, let’s use the following topology:

In this example, we have five switches and a firewall - which are used to provide connectivity to two network segments (NET1 and NET2). For each of the two network segments, there are a number of different paths that traffic could take to reach the firewall. Without spanning tree, NET1 might send traffic to SW4, which in turn would forward it to both SW2 and SW3. This sounds like a good thing, since we would use all available paths to try and reach the firewall - but in reality this can cause other problems like the firewall receiving packets out of order.

So for the example above, let’s assume that SW1 becomes our root bridge. SW1 is now in charge of determining what the best path through the network is. It does this by sending out messages on all ports connected to other switches, called Bridge Protocol Data Units (BPDU). In this message, SW1 asserts it’s role as the root bridge - and provides some information for other switches to use for path selection. Each switch will examine the message from SW1 to determine which of it’s uplinks is the most efficient path to SW1. Once each switch does this, it will forward on the message to downstream switches - this time adding in some of it’s own information (or path cost).

After all that is complete, we might be left with the following path below:

The green lines above show the final path that was selected. For NET1 to reach the firewall, it would use SW4, then SW2, then up to SW1. For NET2, it would use SW5 > SW2 > SW1. This leaves the orange links unused. In fact, spanning-tree will place these links into a blocking state. The switches might still listen on those links, just in case their neighbor starts advertising a better path - but they will not forward any data traffic on these connections. In the case of SW2 suddenly failing, SW4 and SW5 would still be aware of their connections through SW3 - and after a brief period would begin using those links to reach the firewall.

This is a very simplistic explanation, and there is a lot more in the background that actually happens during spanning-tree operation. There are a number of different STP standards that a switch can run, each with their own options for configuration and tuning. There are also methods of providing a loop-free path while still utilizing some redundant paths. I plan to cover some more detail on these topics in later posts.

So why should I care about STP?

Remember that part earlier when I said that if STP priority is not configured, then switch with the lowest MAC becomes the root bridge? Well as it turns out, MAC addresses are the hardware addresses configured by the manufacturer - and these addresses increment as they produce new devices. So the lower MAC addresses are typically going to be the oldest equipment in your network. Unfortunately, this can have a dramatic effect on your network traffic if you’re not paying attention to STP.

From the earlier example, what happened if SW4 became the root bridge? Maybe this was an old Cisco 2950 that someone forgot to replace and it’s just been left in the network. If the STP configuration went unmodified, then this switch would likely become the root bridge of our network. Let’s look at what that path might look like:

So in this case, SW4’s path to the firewall hasn’t changed. However, it’s best path to SW5 and NET2 is through SW3 - which means any traffic from NET2 to the firewall has to follow the path of SW5 > SW3 > SW4 > SW2 > SW1. Not only does that add more layer 2 hops that NET2 has to pass through, but it also adds more (unnecessary) load onto SW4. What happened if SW4 was so old that it still had 100M ports? It might get overwhelmed pretty quickly.

Now you might be thinking, “How often does this really happen”? Well, when I started at my last job they were experiencing a similar issue. The primary building had three floors, each with two Cisco 3548 switches to service users. Each of these switches linked back to a pair of Cisco 4500 core switches. All of the 3548 switches were purchased at the same time (far prior to the 4500s), and it turned out that one of them on the third floor had the lowest MAC address in the network. The entire layer 2 topology was then based on this switch as the central point of the network. This caused the interconnects between the core switches to be put into blocking mode - meaning that if a switch on the second floor needed to connect to the alternate core switch, then it would have to pass traffic through the third floor. A quick change to the spanning-tree priority (during a maintenance period) was all that was needed to put the core switches back in charge.

This doesn’t immediately make spanning-tree a bad technology. As with just about anything in IT, it’s something you need to understand and tune to fit your needs - otherwise you’ll just get less-than-ideal results. At another employer, I actually found out that the previous network administrator had manually disabled all of the redundant paths in the network - because he didn’t understand STP, and therefore thought that any redundant paths would cause a loop. Spanning-tree isn’t something we need to be afraid of - it just needs a little attention.

So next time you’re logged into one of the switches in your network, just run show spanning-tree and double-check that the switch you assume is your root bridge actually is.

Well I hope that this was helpful. As I mentioned earlier, I meant this as a fairly basic overview - but I intend on diving a bit deeper in later posts. The most fascinating part of networking to me is all the details on how things like spanning-tree actually work behind the scenes. Have any spanning-tree stories? Leave a comment below

SRX Basics: Redunancy Groups and Failover

Tue, 18 Jul 2017 08:00:24 +0000

In last weeks post, we took a look at how to set up a chassis cluster on a Juniper SRX Firewall. So now that we have a basic cluster setup - Let’s explore some of the additional options and configuration items.

Redundant Ethernet Interfaces

So first thing is first - Once you have a cluster configured, you’ll probably want to configure a few sets of redundant ethernet interfaces. These interfaces are also often referred to as reth interfaces. This will create a shared interface between your SRX pair, where you can configure IP address and VLAN information to be shared between the two. Let’s say that we have a Juniper SRX 1500 cluster, and we want to create a redundant interface for one of our 10Gb ports. Here is how we would do that:

root@testsrx# set interfaces xe-0/0/16 gigether-options redundant-parent reth1
root@testsrx# set interfaces xe-7/0/16 gigether-options redundant-parent reth1
root@testsrx# set interfaces reth1 redundant-ether-options redundancy-group 1

In the config above, we first take both of our interfaces (xe-0/0/16 on node0, and xe-7/0/16 on node1) and tell them that they now belong to a redundant interface group (reth1). Next, we enter into the reth1 config, and associate it to a redundancy group.

You’re also going to need to keep in mind that the SRX requires you to specify how many redundant ethernet interfaces will be configured. This is likely a memory thing, since each SRX also has a different maximum number of reth interfaces that can be configured. For example, if you tell the SRX that you need 5 reth interfaces, then the SRX will allocate system resources to manage those interfaces. In order to set the number of available reth interfaces, we’ll use the following command:

root@testsrx# set chassis cluster reth-count 5

Redundancy Groups

A redundancy group, or RG, is used as a container for logically grouping redundant interfaces/virtual routers which must fail over together. A single RG can be configured as primary on one of the two active SRX firewalls is a cluster - with the ability to fail over to the other node. For example, we might want be planning on only using one virtual routing instance on our SRX - so we would create RG1 and assign out interfaces to belong to it.

A quick note - all interfaces in a single virtual router must belong to the same RG. This way the virtual routing instance and all of it’s associated interfaces will always run on the same SRX node. In order to achieve an active/active firewall configuration, you would need to create two separate virtual routers, each with their own reth interfaces and different RGs. Then you would make RG1 primary on node0, and RG2 primary on node1.

In most configurations, dumping all of your reth interfaces into RG1 will be sufficient. You’re likely going to want to set up a priority for each RG - and maybe even preemptive fail-over. In order to do that - you’ll have to configure each cluster member with a priority:

root@testsrx# set chassis cluster redundancy-group 0 node 0 priority 200
root@testsrx# set chassis cluster redundancy-group 0 node 1 priority 50
root@testsrx# set chassis cluster redundancy-group 1 node 0 priority 200
root@testsrx# set chassis cluster redundancy-group 1 node 1 priority 50
root@testsrx# set chassis cluster redundancy-group 1 preempt

The higher priority wins here - so if you set node0 to a higher priority and preempt is enabled, then node0 will actively try to take ownership of RG1. I would rather not set preempt on RG0 for a few reasons - which we’ll cover in the next section. Priorities can also be modified using interface monitoring, so if a particular interface goes offline we can decrement the priority of that node (also covered below).

A Note About RG0

You might notice from the last post, that you’re output of show chassis cluster status already showed two redundancy groups: RG0 and RG1. RG0 is only used for management traffic and manages the routing engine for your SRX. Unfortunately, this can lead to some weird behaviors that you might not be expecting.

For example, whichever node is primary for RG0 is the only node that collects interface and monitoring statistics. If you’re using a monitoring tool that polls data from both of your SRXs, then the secondary for RG0 will report nothing about it’s interfaces, CPU, etc. This is also true if you log into the actual SRX itself - a show interfaces will actually return a bunch of default values, including showing that your ports are half-duplex. Don’t panic though, this is just an oddity of RG0. If you log back into the primary node for RG0, then it will show all of the proper statistics for both SRX firewalls.

Due to these weird things about RG0 - I prefer to always leave it on node0. Therefore I know which one to log into whenever I need to look at something, or which SRX to check in our monitoring tools. It’s also worth noting that whichever SRX is primary for RG0 is also the node you’re going to need to log into for configuration changes - even if all of your other redundancy groups are the other SRX.

Weird, right?

Oh, and be warned that since RG0 controls the routing engine, a failover of this RG can cause brief outages. This is primarily because the routing table and firewall state information will be lost. The secondary node has to spin up new processes for the routing engine, and at least currently there isn’t a graceful sync of all of that data.

Interface Monitoring

I mentioned setting device priorities a bit earlier. Setting interface weights is going to be the primary method for dynamically affecting those priorities, and therefore possibly causing a preemptive failover. One example might be that you’re using an SRX cluster for your edge firewall, and you want it to automatically fail over if the primary loses it’s internet uplink.

Note that you must configure the physical interfaces here, not the redundant ethernet interfaces:

root@testsrx# set chassis cluster redundancy-group 1 interface-monitor xe-0/0/16 weight 160

Remember when we set the priorities of our firewalls earlier? Node0 was set to 200, and node1 at 50. So here we are saying that xe-0/0/16 on node0 is worth 160 points. So if xe-0/0/16 goes down, then node0 will decrement it’s priority by 160 - which will be 40. This will trigger a preemtive failover by node1. The reverse is also true - when xe-0/0/16 comes back up, then node0’s priority will go back up to 200. Then node0 will take back ownership of RG1.

Manual Failover

There is a pretty good chance at some point you might need to perform a manual failover of your SRX redundancy groups. Maybe you need to do some maintenance or upgrades, or you just want to make sure failover works as you expect. In either case, the commands to do this are pretty straightforward:

root@testsrx> show chassis cluster status
Cluster ID: 5
Node       Priority       Status       Preempt       Manual failover
Redundancy group: 0 , Failover count: 0
node0      200            primary      no            no 
node1      50             secondary    no            no

Redundancy group: 1 , Failover count: 0
node0      200            primary      yes           no 
node1      50             secondary    yes           no

root@testsrx> request chassis cluster failover redundancy-group 1 node 1 

root@testsrx> show chassis cluster status 
Cluster ID: 5
Node       Priority       Status       Preempt       Manual failover
Redundancy group: 0 , Failover count: 0
node0     200             primary      no            no 
node1     50              secondary    no            no

Redundancy group: 1 , Failover count: 1
node0     200             secondary    yes           yes
node1     255             primary      yes           yes

Okay - so let’s talk about a few things that have happened here. I always recommend that you run a show chassis cluster status first, so you know where things already stand. Then we can proceed by requesting a failover. To do this, you have to specify which redundancy group you want to fail over, and which node you want to become the new primary. So in this case, we made node1 the new primary of RG1.

You might also notice that the priorities have changed, and the devices are marked as being in a manual failover state. This is important, because you cannot manually fail back until you reset this state. That’s right - if you tried to run the failover command again to move RG1 back to node0, it will not work. An automatic failover due to hardware failure or interface monitoring will still be permitted. In order to perform a manual fail-back to node0, we have to run the following reset command:

root@testsrx> request chassis cluster failover reset redundancy-group 1

Hopefully between last weeks post and this one, you should have a good handle on the basics of configuring a chassis cluster on your new pair of Juniper SRX firewalls. Let me know in the comments below if this helped you!

Migrating IP Addressing Schemes

Wed, 24 May 2017 08:00:27 +0000

Back a few months ago, I wrote a bit about why it is important to have a good design for IP addressing schemes (part 1 and part 2). As a brief refresher, the situation I found myself in was an environment where practically everything was assigned a 10.x.x.x/16 subnet - even if we only needed a handful of hosts. When I arrived at the company, we were already down to less than 1/3 of the 10.x.x.x range remaining unallocated (with multiple new locations already being discussed).

The IP addressing design that I came up with limited our typical data center deployment from 4-6 /16 blocks to a single /16 block for each location. For all new locations since then, this new design has been used and it has proved to be extremely beneficial. The ability to use proper address summarization has made firewall rules, routing, and VPN tunnel configuration much simpler. But what about all the old locations which still had several /16 blocks? None of these needed more than a single /16, but we have thousands of systems that would need to be re-addressed. Not something that was going to happen overnight. So let’s take a look at some of the methods we employed for migrating from one IP addressing scheme to another.

Have a plan - The first step is to have a good handle on the overall situation and how to get from point A to point B. You’re likely going to need buy-in from other teams to help get there, and this could easily be a multi-year project depending on the number of systems. When you meet to discuss the re-addressing project, you need to be pretty strong when describing the benefits of the new system - otherwise no one will want to help.
Enforce the standard for anything new - The easy target for any transition is to hit new stuff first. For example, we started using the new range in a brand new location first. Anything being deployed that requires a new IP address allocation needs to be using the new scheme. We don’t want to perpetuate the scheme we are trying to get rid of.
Transition (Network Config) - This can be a difficult step that requires a bit of planning. For any existing sites, we need to configure the ability to use both IP address schemes side-by-side until the transition is completed. There are two primary ways to accomplish this that I’ve used - either build out a new segmented (VLANed) network, or overlay the existing using secondary IP addresses. Don’t forget to propagate routes to the new subnets and ensure that firewall rules match the existing functionality.
Transition (Infrastructure/Servers) - Once the underlying networking pieces are done, the next step is to begin transitioning services. Again, make sure any new systems getting deployed are now using the new ranges. Then we can take either an active or passive approach. In the passive approach, we are going to essentially just build new systems in the new scheme and wait until the older systems are eventually removed from service. This probably isn’t the ideal way to do this - but it’s certainly an option. In a more active approach, we would start identifying the older systems to move and making plans to do so (likely in a phased manner). Either method is going to require a serious investment of time, depending on the size of your network.
Long-Term - This process is never going to be quick or easy, but the end result should be a much better state than we began. In the meantime, maintaining both IP addressing schemes can be quite painful. Make sure that everyone on the team understands the goals of the new scheme, the plan for getting there, and how everything is configured to make it happen. The last thing you want is for someone to try and back out of the move, just because they’re not confident in what’s going on.

I also wanted to stress the importance of research throughout this whole process. It’s important to try and understand why the original IP addressing was designed the way it was, and what goals they had in mind at the time. It’s also important to check the technologies you’re using to understand how everything will work. For example, Juniper’s SSG (ScreenOS) platform doesn’t support utilizing a secondary IP address on an ‘untrust’ interface (KB5527) - but it works if you use a custom zone name. And Check Point doesn’t support secondary IP addresses at all when you are using their ClusterXL protocol (SK89980), instead they actually recommend that you deploy a new VLAN and tagged sub-interface. However, they do support it if you are using VRRP instead.

This is in no way a definitive guide on the various ways you might accomplish this - but I wanted to give a bit of background on how we tackled the problem. Unfortunately in my case, most of the older locations have several thousand systems - so I’ll be working on this migration for quite a while.

Ever had to migrate to a new IP addressing scheme? What methods did you use? How large was the network? Run into any big problems? Comment below!

Tracking Latency and Packet Loss with SmokePing

Tue, 25 Apr 2017 08:53:04 +0000

“The network is slow” - Sound like something you’ve heard before? What does ‘slow’ mean anyway? And is it different from yesterday? Sometimes tracking down network ‘slowness’ can be pretty difficult, especially when you don’t have a good baseline of what is normal. This kind of goes back to one of the tips I shared earlier in ‘A Little Bit of Magic’ - having a baseline and understanding of what is normal on your network will help you find issues much more quickly.

When I started working for a cloud service provider a few years ago, the first thing to start coming up extremely often is network latency and performance issues. These are things I never had to worry too much about previously, as most of my jobs had been with enterprise environments where everyone is on the same LAN (or at least within one state). However, when you get into hosting a Software-as-a-Service cloud on a global scale, then slight performance issues begin to mean big slowdowns for your customers.

I was amazed at the current network infrastructure monitoring that was in place when I began working for the SaaS provider: A few bare-bones Cacti instances, completely unmanaged by anyone, and not configured to monitor any relevant ports or data. Today that situation is vastly different - I have installed a few different applications that allow us to get alerted on network variances and quickly determine exactly where the issue is. One of the tools that has helped us get to this point is called SmokePing, which I would like to talk about today.

Setup and Installation

I won’t get into the details of installing SmokePing, as there are already a number of good tutorials out there (like this one or this one). If you have a decent familiarity with Linux, then the process should be fairly straightforward. Keep in mind that your SmokePing graphs will show latency and packet loss between the machine you have SmokePing installed on and the targets you define. So make sure that you plan out where you deploy your SmokePing machine(s) to provide beneficial information.

Once you have SmokePing installed and setup, it’s time to start defining targets to monitor. We have over a dozen points of presence globally, so I’ve installed SmokePing on a single machine in each location. Each instance has ping targets defined for every network segment within it’s own datacenter, network segments in every other datacenter, and some public IP space of every datacenter. So we accomplish latency and packet loss monitoring within the datacenter, across the site-to-site VPNs between each datacenter, and the general internet connections between each datacenter. For certain customers, particularly those who have dedicated MPLS circuits to us, we are also monitoring latency/packet loss to customer endpoints.

SmokePing also supports deployment in a controller/worker configuration, where you have a single primary configuration/management point and several workers to perform testing. I really want to test this out for our environment, but I haven’t quite had the time to dedicate to it. If you’re interested though, you can find the details on that here.

Interpreting the graphs

The graphs created by Smokeping might not seem clear the first time you see them. For example, take a look at this:

This graph is the result of a standard latency test - 20 pings every 5 minutes. So for every step on the graph, SmokePing draws out the range of responses in those 20 pings - shown by the gray ‘smoke’. The darker the gray area, the more pings came back with that response time - and similarly the lighter areas mean that fewer pings had that response time. The solid colored part of the line marks the average response across all 20 pings, and also gives an indication of percentage of packets lost.

So the first thing I would notice about this graph is that the average response time is varying quite significantly between about 15ms and 200ms. In a normal healthy network, you should not expect to see such a drastic change in response times like that - some variation is normal, but not to this extreme. Two other things to note from this graph: The time of each latency jump seems to line up almost every 30 minutes, and towards the end we begin seeing some slight packet loss.

After being informed that there was a performance issue between a few different systems, I opened up SmokePing immediately to start looking for anything that jumped out - like the graph above. In this case, this was a 200Mb dedicated MPLS circuit used only for replication traffic between data centers. Every 30 minutes, a replication job was kicking off and saturating the line for a few minutes - which in turn was causing excessive jumps in latency and some minor packet loss.

As another example:

The first thing you probably notice about the graph above is the sudden stabilization of latency. This graph monitors traffic between two data centers over an IPsec VPN tunnel - and we happened to be suspecting that one of the two peer firewalls was having performance issues. We swapped out to new hardware on one side of the connection, and the latency immediately started flat-lining. A consistent 85ms is way better than averaging anywhere from 90-180ms. (And if you happened to notice the slight packet loss after the new device was implemented - that was actually due to an unrelated upstream provider issue). My point with this graph is really just to show how helpful it is to have the historical data available. It would have been extremely difficult to prove that the one firewall was the root cause of our problems if I didn’t have a way to track the issue.

So that’s a bit about SmokePing and how I’ve deployed it within a cloud provider’s environment. It’s only been up and running for a few months, but I’ve already found it to be extremely helpful in troubleshooting performance and latency issues. SmokePing is also extensible via scripting, which can help to collect additional data at the time of an issue. I’ve written a few quick scripts to run extended traceroutes during packet loss events, which I might post up here in the future.

Have you installed SmokePing in your environment? How do you use it? Has it helped you with performance issues?

Comment below!

College vs Certification - Which is better?

Tue, 28 Mar 2017 08:00:58 +0000

As of the beginning of this month, I have officially completed my four years of trying to balance working full time and going back to school. I finished up my last college classes and now I can sit back and appreciate having some free time to myself again. I’ve never been really into the concept of school, but ultimately I went back because I was being pushed to by my previous employer. So I figured that now is just as good a time as any to tackle the topic of which is better - certs or college degrees?

I talked about this briefly in my initial background story posts, but I went straight from Cisco Networking Academy in high school out to working a full time job at a local IT consulting company. By the time I finished high school, I had already passed the Cisco Certified Network Associate (CCNA) exams and become certified. Having that certification is what got me in the door for a number of interviews, and eventually got me the job at the consulting company. At that point, I really didn’t have much else going for me - I didn’t have a college education nor any real-world experience. In my time working at this company, I spent a significant amount of time doing self-study and labs for my certification goals. When I got my CCNP certification, I used it along with the experience I had gathered to get my next job. This new employer was heavily focused on their IT staff needing to have a college education - so they pressured me for a while to go back until I eventually gave in.

I spent a while reviewing many colleges in the area and online, trying to figure out what would meet my needs. I ended up picking out a four-year degree in network security, and opted to go the online-only route because it benefited my schedule better. I packed my classes up to a full-time schedule, because I didn’t want a four-year degree to take any longer than four years. At this point, I also had the benefit that my employer was willing to reimburse 100% of the costs - which certainly helped convince me to go back.

Over the course of the past four years, I have taken many classes that include general IT, development, networking, and security (not including the normal required materials). I found that a significant portion of these classes didn’t directly benefit me. A lot of the material was much more focused toward beginners who haven’t already been working in the field for six years - which is completely understandable. The most I really got out of this was improving my abilities to push myself through work that I didn’t want to do. I did have a few interesting classes, like an Android development course, which I found to be extremely fun even if I probably won’t use the knowledge much.

Four years later and I’m done - did I benefit from it? On some level yes, I think I did. At the time of my degree completion, I have now been Cisco certified for ten years and I’ve been working in networking nearly the same amount of time. I’m already further in my career than I thought I would be at this point, and I’m happy with my position and pay (the degree isn’t going to change either of these things). At this point in time, finishing the degree is not much more than an accomplishment that I can add to my resume. Sure, having the degree on my resume may get me past HR screening for new jobs and opportunities - but it likely won’t actually play much into a company’s decision to hire me.

In the end I think that both certifications and college education are useful - they can both be great indications to an employer that you’ve been trained on certain technologies or fields. However, I think that the actual on-the-job experience is what really matters - and I experienced a direct benefit from getting in the field early and working while all of my friends were still in college. I would not be as far in my career as I am today if I had waited four more years to start working. Unfortunately, I think that we place a little too much importance on completing a formalized degree program, when equivalent experience and certifications may benefit a company more.

I understand that I had a bit of a unique situation, but I figured it would be worth sharing my experiences and how they have affected my view of college education. I’m still happy that I went through with it and completed the degree, but you won’t see me throwing a big celebration - except that I’m just super glad it’s all finished. At this point, I will take a few months to relax and spend time on hobbies - but I do plan on going back to certification studies (Juniper stuff and likely begin working on a CCIE).

Any thoughts? Comment below with your experiences - I’m interested to see if there are many people who have had similar experiences to me, or possibly even the complete opposite.

Port Security: Worth the effort?

Tue, 14 Mar 2017 08:00:40 +0000

Port Security. Always seems like one of those things covered in Cisco exams, yet how many businesses actually use it? For those that aren’t implementing it, should they? Or is it too much of a headache?

So the concept of port security is fairly simple - We want to secure each individual switch port to a physical layer 2 MAC address, or at least limit how many unique MAC addresses might be learned on an individual port. The technology could be used to just limit the number of simultaneous devices on a port - by just setting a MAC threshold. Or we can also take it to the extreme and lock down each port to a hard-coded MAC address - which will never allow another device to connect. You might be thinking that the second method is absolutely ridiculous, but it really depends on the business needs.

First, let’s take brief look at the typical port security configuration and some of the options available.

SecureSwitch(config)# interface x/x  ! Whichever interface we want to lock down
SecureSwitch(config-if)# switchport port-security max xx  ! Max number of MAC addresses that can be learned
SecureSwitch(config-if)# switchport port-security violation xxxxx  ! Choose to either restrict or shutdown the port (description below)
SecureSwitch(config-if)# switchport port-security  ! This actually enables the port security config

Fairly straight-forward, right? We choose a port (or you could do a range) and set a few options. The default number of MAC addresses able to be learned on a port is 1, so it’s likely you’re going to want to change this - unless 1 is all you need. Port security can only be enabled on access ports, so 1 MAC address works in most cases - except where you have a PC daisy-chained off of an IP phone (in which case this will need to be set to 2 or 3).

Next we set our preferred violation action. This step is pretty important because it defines what happens when the port exceeds it’s MAC count. Restrict is the passive approach. If we have two PC’s plugged into a single access port (maybe using an unmanaged switch), then the second PC will just never be able to work as long as our max MAC limit is 1. The first PC to connect will be fine, and the switch will log a message and send an SNMP trap when the second MAC is picked up. Shutdown is the more forceful approach. Once that second MAC address turns up on the port, the switch puts the port in an err-disabled state - which shuts down the port to all traffic. This event is also logged and generates an SNMP trap - however the port will not come back online until an administrator manually re-enables it.

Now that we see a basic config, let’s take a look at a few different use cases for this feature. In one of my previous jobs, I worked as a network admin for a local government organization. Port security configuration in that environment was extremely strict. Each switchport was configured to permit only one MAC address, shutdown upon violation, and the switchport port-security mac-address sticky command was also used. This command takes the first MAC address learned on the port and commits it to the running configuration, which means that this MAC is essentially hard-coded to be the only MAC permitted on the port. So in this environment, a single PC was tied to a single port - nothing else could ever be plugged into that port without either shutting down the port or administrator intervention. In a government office, this was absolutely necessary because every device on the network needed to be tracked and personal devices were not permitted to be connected. We needed to know if anything was ever plugged in that wasn’t an authorized device - so manual intervention and investigation was a requirement.

In a more typical office environment - port security configurations can just be a good security practice without going overboard with it. We never want a user to plug in a rouge switch into our network without our knowledge, right? So maybe we assume each user has an IP phone and PC, and limit the port to 2 MAC addresses. In this case, we can go ahead and just set the port to restrict. We don’t want to prevent the user from working if a port violation occurs, nor do we want to spend time resetting the port for them - but we might still want to be notified, especially if it happens often. In addition, port security is an excellent way to secure ports public areas. For example, maybe we have an IP phone or kiosk PC in our lobby. These need access to the network, but we don’t want anyone to be able to unplug that device and gain access into our network. In cases like this, it would actually make sense to have the switch only permit access from that single MAC address.

Outside of the ‘practical’ use cases, there is also the strictly security side of things. I’ve touched on a few considerations already - but there are also certain types of attacks that can be defeated by port security. One of those would be exhausting the CAM table resources. A malicious person could use publicly available tools to spoof MAC addresses in the packets they send to the switch. Tools like this force the switch to learn hundreds of thousands of MAC addresses, which eventually will overload the CAM table. When a switch CAM table becomes full, the switch begins flooding packets out all interfaces. This is because the switch can no longer assign mappings between MAC addresses and the ports they originate from - so the switch has no choice but to flood everything and hope the correct recipient receives the data. For the attacker, this means they can run a packet capture on the port and collect information they wouldn’t have otherwise needed to. This scenario could be prevented by implementing port security, which could simply restrict the number of MAC addresses learned off of any individual interface.

Port security configuration can be implemented in a few different ways depending on your use case. Overall though, it can prove to be a useful way to help implement security controls on your network. What do you think about port security? Extremely useful or does it just get in the way? Comment below and tell me how you have implemented it!

Want Change? Make it happen!

Tue, 07 Mar 2017 08:00:01 +0000

Too often, it seems like a common component of office culture is complaining about the issues. “Why do we always do things this way? It’s not the right or best way.” Even when things are running smoothly to most, there will always be someone who believes that things are not being done right. Occasionally you might get lucky and someone will suggest a better option. However, in my experience many of those people only offer the better option as a suggestion, then complain when nothing changes. So let’s take a look at how notto fall into that trap.

1. Identify the problem

This can be the both the easy part and the hard part. For example, let’s take a recent example at my job: Poor coordination across teams for project work. Awesome, we have identified the problem, right? Well, not quite - that may be a high-level summary of the problem, but why is there poor coordination? Maybe the teams aren’t meeting often enough with each other, or maybe those meetings aren’t structured in an effective way.

2. Identify the solution

So it’s easy for anyone to say “I FOUND A PROBLEM”, yet it’s more difficult to come up with a reasonable and effective solution to that problem. Sit back and evaluate the problem, but make sure you consider the perceptive of both sides. Maybe Team A works better if they have all of the information up front, so they can design a proper solution - yet Team B likes to work as they go, and tackle things as they come up. In this case, we probably won’t get very far in asking Team B to schedule architectural/design meetings before starting a project, will we?

3. Propose the solution

This part is important, because we don’t want to start making changes without anyone understanding what or why we are doing it. If changes come out of no where, people are more likely to reject them. So maybe we sit down with Team A and Team B and explain our solution: We will hold a quick, high-level design meeting at the beginning of a project - but Team B will be responsible for trying to notify Team A as soon as they identify a new requirement, and Team A will be responsible for identifying when those new requirements warrant a bigger meeting to gather requirements/details.

4. Make the change

This is probably going to be the most difficult step. If you want the change to happen, you cannot stand back and hope that someone else does it. You have to lead the change. For example, if you are on Team A and you propose this idea, then you must hold Team B accountable to sitting down for a requirements-gathering meeting when you think one is needed. Give it a good effort, because if other teammates see you working hard to make working-life better for everyone then they will be more likely to join in.

5. Re-evaluate and refine

No one is perfect, and no idea will ever be absolutely perfect on the first try. So give it a while, then make sure you sit down and take a look at how this change has impacted everything. Has Team A been more productive, since they can get requirements earlier in the process? Has Team B become less productive due to the increase of meetings? You might get lucky and have a fairly smooth transition into a better working environment - but chances are good that the overall change might still need some tweaks. Don’t let yourself fall into the ‘set it and forget it’ mentality - getting better means constant improvement.

This might seem like a lot of work just to make a simple change, but it doesn’t really have to be. As a real-life example, I recently worked with another team who was starting a new project to install an entirely new application. They began by submitting individual tickets to the network team with bits and pieces of what network changes they believed their project would need. Once I realized this was happening, I asked the person leading the project if they had 30-minutes to sit down and talk that afternoon. It was a very quick meeting where I asked about the application they were installing, what it did, how it was intended to be used, and what other applications/systems they believed it would need access to. I also provided a little insight into why this mattered to me from a network design perspective. From that I had enough of an understanding of their project to put together an effective design from the network side, which makes both of our lives easier - because we don’t have to piecemeal it together now then realize later that it wasn’t the ideal configuration. After that meeting, the project lead said “Man, I always wished we had a lot more meetings like this - this was really helpful”.

That is the difference between wanting change and driving change. The bottom line is: If you want to inspire positive change - You have to be the catalyst.

I’m sure we have all identified areas of improvement with our workplace. Have you ever been the one to drive change? Leave a comment below, and tell me your story!

Virtual Networking Contexts

Tue, 21 Feb 2017 08:00:58 +0000

I really want to take a moment to talk about how wonderful VRFs/firewall contexts really are. Both technologies essentially allow a network administrator to spin up a virtualized, isolated instance of a network device. I’ll be honest and say that I hadn’t had the chance to play much with this stuff until just recently - but it makes life a lot easier in a cloud provider environment.

I’ve been looking for a good chance to use VRFs in the past, but in most cases it didn’t really make much sense. About a year ago, I had a great opportunity when we needed to build a new data center. The data center was aimed at being lower capacity than most of our other locations, so we had to cut some costs here and there. In all of our other locations we use two physically separate sets of firewalls, one for external traffic and one for internal traffic. In this new location, we opted to save some money by picking up only a single pair of Juniper SRX 345 firewalls.

I made the decision here to make use of Juniper’s virtual routing instances to keep logical separation of internal vs external firewalling, even though it was only a single physical cluster. For one, this would allow existing staff to maintain their current understanding of network architecture. Every data center has the same overall logical traffic flow, even if the physical devices are different. Second, this allowed us to split load across the two devices. Normally we have two physical clusters to handle the traffic load, but in this case we were essentially going to pump the same traffic through one pair of firewalls. Assigning each virtual routing instance into its own redundancy group allowed us to run each firewall instance on a separate device - yet still allow for both instances to run on one in the event of a failure.

Once we got that firewall cluster into production, there seemed to be a lot less fear regarding virtualized network contexts. I was able to prove that it worked, and worked well for what we needed at the time. Soon enough I was able to find a few additional places where we could make use of the same concepts. We recently procured quite a few Cisco Nexus 9372PX switches for both new deployments and hardware refreshes. By default these switches already come pre-configured with a out-of-band management VRF, which is already super useful to me. We run all of our device management traffic on a segregated network, so a management VRF allowed me to configure the IP/route information to make all that work - while not interfering with the normal layer 3 operations of the device.

Being a cloud provider, most of our customers are completely abstracted from the hardware/software that runs their hosted applications. However, in a few cases there are instances where a customer negotiates for a contract change to say otherwise. For example, a customer might have a special software integration they want to run and have the ability to control - or some customers want a dedicated point-to-point Ethernet connection into one of our data centers for increased reliability. A lot of the background networking work for this in the past was a bit of a pain - but it opened up another opportunity to make use of VRFs. I now have a dedicated customer VRF, which has separate routing configurations than our normal production environment. Customer wants to stand up BGP peers across their direct connection to our data center? Sure, I can isolate that BGP instance in the customer VRF, so there is no conflict with our production routing tables.

I’m sure that my current use cases are probably not the ideal implementations of virtual networking contexts - but they work for what we need and they make life a lot easier. I can see these becoming more and more common in our environment to logically segregate traffic. I am interested to hear how other companies have integrated this type of technology into their networks - so leave a comment below!

A Little Bit of Magic

Tue, 07 Feb 2017 08:00:03 +0000

I’ve lost track of the amount of times in my career that someone has said “How did you do that?”, “Wow that’s amazing!” or “I would have never figured that out”. My answer is typically that it involved a little bit of magic - but then I follow it up with an actual explanation. For those less technical, some things really can seem like a bit of magic. Solved an outage in minutes? Knew about an impending issue before it happened? Yeah - that can be quite magical.

So I have a few posts that I plan on scattering here and there, which will cover some tips on how to become a networking magician. I will aim to provide detail behind some of the expert intuition and skills which can amaze and confuse others. Let’s get started with Magic Tip #1:

Monitor your network

No, really, just monitor your network. I don’t mean “Oh, there is a ping alert for that switch” - I mean pay attention. How will you ever know what an anomaly looks like, if you don’t have an internal baseline of what your network should look like? This tip really takes time, but I’ve found that it pays off in the long run.

I use a couple of open-source tools and applications, like Observium and SmokePing, to track metrics on my networks. I spend a quick 5-10 minutes each morning quickly skimming through the pretty graphs to get an idea of how we are performing today. About once every or every other week, I will spend a bit more time for a deeper dive into the metrics. However, the important thing here is not the time spent, but the fact that I look at these. In the back of my mind, I keep a mental note of the general averages for bandwidth, latency, packet loss, etc.

Once in a while, I might look at a graph and notice that something is a bit off. Defining the word ‘off’ in this sense is difficult. Maybe a router interface that averages 20Mb/s spiked to over 40Mb/s through the night. Maybe traffic was actually far lower than the average. Sometimes I might see a slight increase or drop in latency between a pair of data centers. Some of these things could mean absolutely nothing - but in many cases they are an indicator of something else.

As an example to this - A few weeks ago, I noticed that the average latency between two data centers had increased slightly, and SmokePing was reporting occasional packet loss of up to 5%. I also track historical traceroute tests - so when I reviewed those, I found that the upstream carrier’s route had changed about 2-3 hops out. No big issues - but I made a note of these findings. A few days later, we began experiencing a spike in packet loss between those two data centers. Rather than being caught completely off-guard, I already had all of the information I needed to work with the upstream carrier. Issue resolved - quickly, simply, and without wasting time during a network degradation event.

Let me just reiterate that I don’t expect everyone out there to stare at bandwidth graphs all day long - that’s not going to get you anywhere. However, we do need to spend a little bit of time giving our network the attention that it deserves, even if its just a quick check-up every day. Once you have a good idea of how things typically operate, it can be much simpler to pinpoint issues and get ahead of them - which means being resolved without wasting time.

Ever had someone claim you’ve performed magic? Tell me about your experiences in the comments!

The Argument for Standardized Configurations

Tue, 31 Jan 2017 08:00:45 +0000

There are quite a few things that you don’t realize how great they are until you don’t have them anymore. For me, one of those things was standard guidelines for device configurations. At my last job, documented standards were extremely important - we had them for everything. While some devices might ultimately be configured in a slightly different manner to accommodate their specific purpose, the underlying basics were all configured exactly the same. Fast forward to where I am at now, and when I started there was no such thing. One device might be configured for management access only over the out of band interface, while a few others might allow management traffic over everyinterface. Some devices had SNMP configured, some didn’t, and yet others had default credentials still enabled.

The problem here stemmed from the fact that there were no documented standards in place. An engineer was given a device to configure, and it was configured depending on who did it and what they felt needed configuring. In a few cases, this actually led to unnecessary security risks being introduced into the environment because something was left enabled. In one instance, this included open root SSH logins via the Internet to a production firewall. Scary, huh?

So how do we go about changing this? Here is a quick little guide I threw together on my method for tackling the situation:

1. Define a standard

Begin creating a baseline document, whether it be a spreadsheet, word doc, or a wiki page. Start small and choose a single system, like your external firewalls for example.

2. Research best practices

Check out the vendor’s website to see what they recommend. There are also some amazing free resources out there like the Center for Internet Security’s configuration benchmarks, Do your research - there is plenty available to help you.

3. Figure out what’s best for your network

Not all of the best practices or security hardening guides will be a perfect fit for your environment. So it will take a little manual review to see what actually fits. For example, many of these guides recommend disabling local authentication in exchange for something centralized like TACACS+ or RADIUS. But if you don’t have that available, then you’re going to stick with local authentication. This can still be a great time to find room for future improvement projects though.

4. Test

If you have a development or test environment available, then run a device or two through your checklist and make sure there are no big issues. If you don’t have a dedicated test area, then try and choose a low-impact device - where not much will be impacted if the changes go wrong.

5. Roll out the changes

Make sure you have a list of every device that needs to be touched, so that you have a way to validate. Then make the configuration changes to get each device into compliance with your new standards. Have a validation/testing checklist ready, so that you can quickly ensure that no production traffic was impacted

6. Train your peers

Configuration standards only work well as long as everyonefollows them. It only takes one person to ignore the checklist and potentially expose a vulnerability. So take an afternoon, schedule a training session with your team. Help them understand the importance of maintaining these standards, and train them on how to apply the changes (if necessary).

7. Automate

This part is optional, but highly recommended. If nothing else, spend the time to automate verification of the standards - which will make it easy to locate a device that falls out of compliance. If you or your team have the skill set, then automate the entire process from initial deployment to continuous validation. Why is this the last step, instead of being included with the roll out? I am a firm believer that you should completely understand how your device functions and reacts to changes before automating those changes.

So that’s more or less how I worked to implement a standardized configuration at my current job. I began with a completely new device platform that we were integrating into our environment, then began to go back to older device platforms. It might be a lot of upfront work, but it certainly helps me sleep better at night not having to wonder if there might be one device out there that’s misconfigured (and will cause an issue later, due to that misconfiguration).

So let me know in the comments below - have you ever implemented something like this? If so, what did you do differently? If not, then let me know if you give this a try!

The Small Things (0x2142)

Tue, 17 Jan 2017 08:00:34 +0000

Even when you’re ten years or so into your career, you can always stand to learn something. It’s important that no matter how experienced you get, you always keep an open mind to other people’s ideas or opinions. As an example to this, I would like to share the story of this blog name.

Back when I worked at a local IT consulting company, they hired a network admin who had worked as several large service providers in the past. He was very experienced and intelligent, and was able to walk into the organization and immediately begin making positive changes. Exactly the type of person that you would want to hire, right?

Well after a few months in, he began checking through some of the equipment we had in our spare store-room. A bunch of Cisco routers and switches, some older than others. After a week, he began complaining about how the devices had sat on the shelves too long. It seemed as though the flash memory was degraded, which caused the devices to not retain their configuration settings. Almost every device he checked through seemed to be experiencing this issue. What else can you do at this point but throw out the bad hardware?

So I decided to pick up one of the devices to see what he was talking about. After all, I was still very early in my career - so if I could stand to learn something from how the devices were behaving, I wanted to see it. So I boot up an old Cisco 2610 router and make a few configuration changes. Save, reboot, and sure enough my changes were gone. However, I had also just been studying how to password reset these devices - since I had a pile of them that needed to be reset. Part of resetting the devices was booting into rommon mode and changing the configuration register value to a hex value of 0x2142.

So what is 0x2142? It’s a hex value that tells the router upon boot to ignore any saved configuration. Of course that easily explained the “degraded flash” issue that the experienced network admin had seen. So I changed the configuration register back to 0x2102, made a few more configuration changes, then rebooted. Sure enough, everything was still there. So I went and told the network admin what I had found. “Oh, checking up on me, huh?”

This story has been a bit of a running joke for a while. But really the importance is that even when you’re extremely intelligent and experienced, you can still overlook simple things. He had been password resetting the devices, but never reverting the configuration register values back to the defaults. Even when you think you might know everything, you should still keep an open mind - because even someone with no experience might have a different view on something. Sure, this wasn’t really a big “save the day” moment, but it helped to show that guy that I had some idea of what I was talking about. From then on, he actually began to work with me on understanding more networking concepts and started asking me to help out with some more of the work he was doing.

What was the most ridiculous simple mistake you’ve made? And how did you find out about it? Share in the comments!

BGP: Getting Started with Multi-homed Internet

Tue, 10 Jan 2017 08:00:17 +0000

A few years back I worked for an organization that had a single 100Mb Internet connection. Not bad for just typical corporate traffic, but we also hosted our production web site out of that location as well. An incident occurred where our website was down due to Internet issues during an extremely inconvenient time. So we decided to procure a second Internet uplink through a different provider. At the time, I had no practical experience doing something like this - yet I was put in charge of the project. Let’s go over some of what I learned…

The easy part of the whole process is the first step - ordering a second Internet connection. Our CIO at the time placed a few calls and had a quote back pretty quickly. A local carrier was willing to run new fiber cables to our building in less than a month. Depending on how important uptime is to your organization, this is the point where you might want to ask about a diverse path into the building. If both connections run though the same physical paths, then a single incident could still cause an outage. For example - I once worked somewhere where the redundant Internet connections shared the same telephone poll across the street. So even though the connections were redundant, a single accident involving that telephone poll and both connections were severed.

Next - Ask about IP space. In terms of IPv4, the general rule for external BGP peering is that ISP’s don’t like to accept any prefixes smaller than a /24. In our case, we had a single /25 block already allocated by our current provider - which wasn’t going to work. Luckily, the new service provider offered to give up a free /24 block along with the installation costs. Unfortunately, this meant that we had to re-address all of our public-facing services, which is almost always a pain to do. I have a few tips for this, which helped us to minimize downtime - but that’s a story for another time.

Next, we need to obtain a globally unique Autonomous System (AS) number, which will be used to advertise our network to the world. Since we were located in North America, we went though ARIN for this process - which was fairly painless. Sign up for an account, prove that you’re associated with the business, fill out a few forms to justify your need, and then just wait for the approval. One thing to watch out for is 2-byte vs 4-byte AS numbers. 2-byte is the standard and has been around forever, but only allows for up to 65,535 unique IDs. A 4-byte ASN allows for significantly more unique IDs, but I have actually run into instances where an ISP doesn’t support these. I would hope that in most cases a 4-byte ASN will be just fine, but it might be worth asking your ISP just in case.

At this point, you should be ready to hit the ground running as soon as that second Internet uplink is installed. This is also assuming you already run a router or multilayer switch on the edge of your network, which also has BGP capabilities. So let’s get down to the fun stuff - an extremely basic configuration to peer between two ISPs. I’ll dedicate another post to additional recommended settings and configurations - but for now let’s focus on getting this running. The configuration sample below is aimed at Cisco devices, but the same concepts apply to most vendors:

EdgeRouter(config)# router bgp *  *! The AS number provided by ARIN
EdgeRouter(config-router)# network **   ! The subnet we need to advertise out both ISPs
EdgeRouter(config-router)# neighbor ** remote-as ** ! Provided by the first ISP - Their remote peer IP and ASN
EdgeRouter(config-router)# neighbor * *remote-as * *! Provided by the second ISP

As I mentioned, this config is very basic and will just accomplish what we need to get going. Follow up with a quick show ip bgp neighbors and hopefully you’ll see two peers in the established state. Any other state indicates a problem bringing up the peer connection. I won’t get into too much detail here - but check the physical connection, ping the peer, and make sure there are no firewalls blocking TCP port 179 between the peer addresses.

Hope this was helpful! Comment below and let me know how your experiences have gone with this type of setup - and look forward to a few more posts regarding BGP peering setup with multiple ISPs.

IP Address Design (Part 2)

Tue, 03 Jan 2017 08:00:02 +0000

Last week in IP Address Design (Part 1) we discussed an example of a bad design for IP allocations and the problems that it caused. This week we will continue by discussing the proposed solution and how it resolved those issues.

The problems with our IP Addressing scheme bothered me quite a lot - especially because IP Addressing design doesn’t really seem to be something you can easily go back and fix. We are in a somewhat unique case since we often open new locations, which is a perfect opportunity to make a positive change going forward. About a year ago, I heard that we would be opening four new data center locations in the near future. So I finally sat down and figured out a new scheme, which ultimately we deployed to all new locations.

My first goal was to start making more proper use of address space, while still making it somewhat easy to remember. As I stated in the last post, our largest data center was only using about 4,000 addresses. I began the design by trying to figure out a good starting point. A single /16 is probably still too large, but if I split up a /16 into two /17s then people will get confused about where a subnet lives. Remember that we were migrating from a very simple scheme in the past, where the second octet dictated the network location. So for the sake of simplicity, I started the design using a single /16 per data center.

Next, I needed to split up that /16 into classless subnets which could be routed in a somewhat meaningful fashion within the data center. In also trying to keep human usability in mind, I decided to split the main /16 assignment into two /17s. The top /17 subnet would be designated to all edge subnets, like the DMZ and Out of Band Management - both of which were directly terminated off of the external firewall set. The bottom /17 would be designated for all internal, protected subnets. This included anything behind the internal firewall set, like our primary internal network and some of the new isolated network segments we had built.

So here is the final scheme:

10.15.0.0/16 - Overall data center allocation

10.15.0.0/17 - Edge subnets
- 10.15.0.0/18 - Main DMZ (10.15.0.0-10.15.63.255)
- 10.15.64.0/21 - Out of band management (10.15.64.0-10.15.71.255)
- 10.15.72.0/21 - Misc DMZ VLAN (10.15.72.0-10.15.79.255)
- 10.15.80.0/20 - Unused (10.15.80.0-10.15.97.255)
10.15.128.0/17 - Internal subnets
- 10.15.128.0/18 - Main Internal subnet (10.15.128.0-10.15.191.255)
- 10.15.192.0/22 - Protected subnet 1 (10.15.192.0-10.15.195.255)
- 10.15.196.0/22 - Protected subnet 2 (10.15.196.0-10.15.199.255)
- 10.15.200.0/21 - Unused (10.15.200.0-10.15.200.207.255)
- 10.15.208.0/20 - Unused (10.15.208.0-10.15.223.255)
- 10.15.224.0/19 - Unused (10.15.224.0-10.15.255.255)

Now the first thing you may notice is that there is a large amount of unused IP space - but I’m accepting that as potential for future growth. Even the large /18 allocations will allow for over 16,000 hosts, which may be more than we will need in the foreseeable future. However, as I mentioned earlier I needed to balance conservation and efficiency with human readability.

So how does this help some of our problems? We’ve already addressed the problem of IP exhaustion by dropping each data center to a single /16 subnet rather than several /16s. Routing tables are immensely simplified now due to summarization. Oh, I need a route to that other data center? Sure, now it is only a single /16 route to the VPN peer for that location. Once the traffic gets over to that local network, then we can worry about trying to route the individual allocations within there. Even then, within the data center I only need a handful of small routes. The external firewall can point the whole 10.15.128.0/17 subnet to the internal firewall set and let it handle routing from there. And finally - that pesky problem of exponential VPN tunnels. Now that each data center has a single /16, we only have to create a single tunnel between two locations which saves us a ton of valuable CPU on the VPN gateways.

Now, obviously these benefits only apply to locations where the new IP addressing scheme is the only addressing scheme. For connections back to a legacy data center, we would still have a single /16 on one side of the VPN while the other side had 4-6 /16 subnets. Even so, the VPN tunnels required for that configuration are significantly less than before. So to wrap this up, the design was proposed to the team and we decided to go with it for the four new data center builds. It is working quite well so far - and we are beginning to have conversations on back-porting this design to the legacy data centers (which will be another post for another time).

Have you ever had to re-design an IP addressing scheme? or have you ever been bothered by the current design and wished you could change it? Comment with your thoughts!

IP Address Design (Part 1)

Tue, 27 Dec 2016 08:00:37 +0000

It’s funny when you think about basic networking concepts and wonder if they will ever actually prove to be useful. Kind of like that “Do I really need to learn complex geometry? When am I ever going to use this?”. What I’m here to talk about today is IP Addressing design. In many cases this will be something that is already in place and fairly solid, so there won’t be much to think about. This was the case at every company I worked at until the most recent one, which is a local cloud service provider. The type of architecture required for this environment is a bit different from what I’ve previously worked with.

So here is my first architecture tip:

No matter how small your organization is today, think about how your proposed design might look 5-10 years down the road.

The problem that I ran into here was that this cloud provider was still using an IP addressing design which was originally designed for a different set of needs. The design was intended to support the business back when we had two data centers and no one thought we would expand. Well, today we have over a dozen locations and there are constant discussions about adding more.

Let’s start with the original design, why it was a good idea, and why it doesn’t scale well today. Every data center location was assigned a few standard blocks of IP Addresses, where each block corresponded to a logical network location. The 10.0.0.0/8 space was used for this, and broken into the following blocks:

10.1.0.0/16 - Reserved
10.11.0.0/16 - DMZ
10.111.0.0/16 - Out of band Management
10.211.0.0/16 - Internal network

This was the bare minimum that each location received, in some cases another /16 or two might be allocated. So first, let’s cover the reasons why this was a good design for the time. All subnets were terminated at classful boundaries, which means there was never confusion on a subnet mask. The association of the second octet to network region made the subnets easy to remember - it was quick for anyone to say “10.2xx? Oh yeah that’s an internal segment”. Also, with a minimum of four /16 blocks, we would practically never run out of IP space in each location (>260,000 usable addresses). All that being said, the addressing scheme was perfect for what it was designed for: Easy to be read and remembered by humans.

While that may have been great for two data center locations, it doesn’t really scale well about eight years later. So let’s take a look at why this design doesn’t work in the long run. After we reached the number of locations we have today, we are left with only ~40 /16 blocks unused in the 10.x.x.x block. That means we have room for ten or less new locations, before we completely exhaust that IP space. Next, after some quick research it turns out that even our largest location was only consuming about 4,000 addresses - not even 2% of the total addresses allocated. Routing tables in each data center were a nightmare, because each location had to have several discontiguous /16 blocks routed back to it. And to top it all off - it turns out that our site-to-site VPN tunnel architecture between locations was configured to use subnet-pair tunnels. This meant that for each pair of data centers (4 /16s per site), there would be 16 VPN tunnels. While 16 isn’t a lot, that really grows exponentially when we add more locations which are all configured for full-mesh VPN connectivity.

I’m trying to keep these posts somewhat manageable - so look for a continuation of this post next week, where I’ll discuss the solution to this problem and how we implemented it.

Background Story (Continued)

Tue, 20 Dec 2016 08:00:09 +0000

This post is a continuation of last week’s “First, A Bit of Background”

So once I had that magical CCNP certification, I finally felt like I needed to move on. I had gained as much experience from that first job as I thought I would, which meant that I needed to start looking. I got some help from a co-worker of mine at the time, who gave me some wonderful resume tips (which I will share in a future post). Two months and a handful of interviews later, and I found myself jumping on a contract-to-hire position for a local government organization.

The three and a half years spent with this organization taught me so much. I had a great boss, to whom I owe many personal improvements that helped me get where I am today. I walked into the place in a role that was technically supposed to be a Junior Systems Administrator, but the position was much more widely focused than that. I did everything and anything, including managing an Avaya phone system, desktop support, networking, Windows administration, and even a bit of VMware ESX. Obviously, I began to lean more and more toward the networking side of the house, as the team was relatively well split in terms of specializations. One guy loved virtualization and storage, another loved application support, and I owned all things networking.

Another thing this job brought me was the push I needed to go back to school. The organization didn’t like to hire people without a college degree, but I managed to make it in under a very rare set of circumstances. Unfortunately, that meant that I was constantly told that I really need to go back to school and get a degree. After a short while, I gave in and picked up a four-year online degree program in Network Security.

This place was my first real experience in actually owning a network. Having complete control and being able to call it my own. I spent the first couple of months doing exploratory research - what did we have running and how was it configured. Then I built a list of recommendations for things I thought needed to be improved. After a few years, I had replaced almost every device (many were end of life) and made the network significantly more secure and resilient. I had many great learning opportunities in managing my own time and building project plans. I designed network upgrades and made detailed plans to make it all work - and it did, surprisingly.

While that job was an absolutely amazing experience for me in terms of personal and career growth, I eventually reached a point where those things slowed down. Soon the negative aspects of the job were starting to outweigh the positives, and so I began my job search once more. A friend of mine, who I had previously worked with at the consulting company, ended up referring me to a position with a company he worked for. The position was a Network Administrator for a local cloud Software as a Service provider.

I didn’t know it when I took the job, but I ended up walking into an environment where I had the most experience on the team. For having several datacenters around the world, the network architecture left much to be desired - A lot of designs built upon the need of the moment and not the future. At the time of this writing, I’m still with this company - and I’ve already gained quite a different set of skill and experiences: Being the senior team member, designing scalable network architecture, and learning the ability to lead others.

I’m going to stop here with my story for now - but hopefully this provides a bit of context around where my experiences and insight have come from. I have a lot of future post ideas which will build upon everything that I have learned over the past ten years. Thanks for reading!

First, A Bit of Background

Tue, 13 Dec 2016 08:00:30 +0000

I wanted to start off my providing a little background on myself. Hopefully this will put some context around my future posts.

In the beginning - I started off doing some minor PC repair for family and friends. Really quite minor stuff, like replacing power supplies, reinstalling the operating system, or troubleshooting application issues. The technical work really was fun for me, but at that point I had never considered the possibility of it becoming a career. It just seemed like a fun hobby that was great to do in my spare time.

After I completed my second year of high school, I found out that I would have to change schools. Luckily, I found out that my new high school offered this fun program called the Cisco Networking Academy. The program was three hours a day for two years, and taught all of the networking fundamentals necessary to pass the Cisco Certified Network Associate (CCNA) exam. I quickly found that this is something that I truly enjoy doing and I was actually good at it. We had quite a few networking professionals come into the class over those two years and tell stories of how successful a career in computer networking could be. That was the point where I realized that this might actually be a career option - so I went with it.

Within two months of finishing high school, I took and passed the CCNA exam. Cisco certified at the age of eighteen, and now left wondering how to find a job. My next stroke of luck came in the form of a family member who had actively been working in IT for about 10-15 years already. She sat down with me and helped me build my first resume, then showed me where to post it online. Within a few weeks, I began receiving calls from recruiters in the area about a variety of positions. “Level 1 Help desk? No, I want to be a Network Engineer making ALL THE MONIES”. Of course at the time, I had no idea that jumping directly into a network engineer position was very unlikely - especially given that I had no real world experience yet.

A couple interviews and a few months later, and I happened upon a local IT consulting company. I remember interviewing with the manager at the time and mentioning how difficult it was to find a job, since everyone wants you to have experience but no one wants to help you get it. Well, he decided that he was willing to help out and offered me a job as a Level 1 Network Operations Center Engineer.

I spent nearly four long years at that job. I was new to the field so I took advantage of every opportunity they offered me. Certification training? Yes. Networking projects? Yes. Consulting for a variety of businesses? Yep! The company culture was heavily focused on making money quickly, which meant that they didn’t always take care of the employees very well - but there is something to be said about the amount of varied experience I gained, especially for my first real tech job. While I was working here, I also added onto my collection of Cisco certifications: CCNA Voice, CCNA Security, CCDA. I finally finished up by achieving one of my goals of becoming CCNP certified.

So this has been part one of my history, and to make this a bit more readable I’m going to split it into two postings. Continue the story in the next post!

A New Start

Tue, 06 Dec 2016 08:46:07 +0000

Over the years I have made several attempts at starting a blog. A few on networking, general IT, or whatever came to mind. They all end up the same - I start off strong and fall off quick. Finally, I believe I’ve realized what my problem is: I always assumed that a successful blog had to be purpose-built and constantly kept up to date with new and exciting content.

So here I am again, giving this another shot. This time I won’t be backing myself into a corner from the start. This blog is intended to be networking oriented but with a bit of a wider focus. I’ve already come up with quite a few ideas for content I would like to write here, so I’m more prepared. That being said - I’m not committing to regular updates or always exciting content. When I have something I feel is worth sharing, I will share it.

So to provide a general overview, here is the outline of topic ideas I have for this blog:

Education/Certification Studies- Every network admin has a blog to document their road to CCIE certification, right? This has certainly been one of my goals over the years, so I’ll be writing about the awesome things that I learn. This is also meant to include general networking education topics, since you can never stop learning.

Career - I’ve been Cisco certified and working in networking for nearly ten years, but that doesn’t mean everyone has. I’ve finally reached the point in my career where I’m meeting a lot of people new to the field and I’m able to help guide them. So I would like to share some of the career advice that I have, both from my own experiences and advice that I have received from others.

Network Design/Architecture- This stuff is really important, as I’ve run into more than enough situations where a network wasn’t originally designed for the type of workload it handles today. I want to cover both network design topics, as well as why it is important. I have some stories to share on how bad network architectures can have significant consequences.

A Little Bit of Magic - This is probably my favorite topic. You ever work with an IT professional who just somehow knows how to fix everything? The person who can pick up almost any technology and become an overnight expert in it? Well, I would like to share some of my insight into how this type of thing is accomplished, and why it just seems so magical.

This likely won’t be everything I cover here, but these are the primary topics. I’m going to give this site my best shot over the next few months - so let’s see how it goes.

Feel free to bug me in the comments with any questions!