Matt/ February 7, 2018

Last week I came across a thread on Reddit that asked the question: “What is your company’s policy on maintenance windows?”. This got me thinking about how maintenance windows have been handled at the various companies I’ve worked at, and how those schedules/restrictions impact project timelines, network design, etc.

Many of the places that I have worked at in the past have been typical 8a-5p/M-F shops. Outside of normal business hours, no one really cared if the network was available. Sure, we might have people who worked late – but a few hours notice via email was always enough. However, the company I work for currently has much tighter restrictions on when work can be performed. We have worldwide customers in over a dozen datacenters and some fairly strict uptime SLAs. What this comes down to is a once-a-month allowance for scheduled maintenance – where the timeframe is limited anywhere from 15 minutes to 4 hours.

Some of the immediate impacts of these differing maintenance window schedules are somewhat obvious. Network maintenance can be practically open to all nights and weekends with a lot of typical 8-5 businesses. This means changes can happen much more frequently – especially changes involving a full network outage. For example, at one of my previous jobs I needed to upgrade each floor of the building from individual Cisco Catalyst 3548 switches to new 2960X stacks. This required moving the cables for  up to 200 ports per floor (while also trying to clean up cable management). I was able to complete the work by just coming into the office earlier every day to move the connections before anyone else arrived.

On the other hand, a cloud service provider can’t just decide one day to take a few hour outage to swap out network equipment. Instead, changes have to be carefully planned, scheduled, then executed within a short window. Customers have come to expect 100% uptime – and rightfully so. However, we still need some amount of time dedicated to performing upgrades, changes, or other maintenance activities. The simple switch migration from the last example suddenly becomes a multi-month ordeal in an environment such as this. You might be ready to jump on the work, but you need to wait for the next regularly scheduled window – and even then you may have only a handful of time to complete your task. If you don’t complete all of it in the time allocated? Well now your project gets pushed back another month.

So you might ask – as a business scales, does it always end up creating this maintenance monolith? It might – but it certainly doesn’t have to end that way. The effects of higher uptime requirements and shorter maintenance periods might seem like nothing but bad news. However, the change in mindset that comes along with that does bring some unique benefits.

The first major benefit comes in the form of planning. When you have 15 or 30 minutes to complete an entire migration or upgrade, it becomes extremely beneficial to plan out a complete play-by-play of every activity. The limited window means that simple mistakes can cost you valuable time. Of course, the tendency for maintenance windows to be scheduled for late nights also compounds the problem since you may be tired or less alert. For critical maintenance tasks that I need to accomplish, I take the time to create a step-by-step checklist of every command that must be run, every system that must be tested, and  every step needed to roll back. Sufficient planning means less mistakes, which in turn increases chances of success during a tight work period.

Automation and efficiency start to become a necessity when you have only a few minutes to perform a task. Sure, I might create a very detailed checklist of what must be accomplished – but what happens if it’s simply too much for the time allocated? You can’t complete a 20-minute task in a 15-minute outage window, right? Sometimes we can schedule extended maintenance periods, but this certainly isn’t feasible every month. This is where we begin to try and identify inefficiencies and tasks that would benefit from automation. Over the years I have written a handful of scripts and utilities that allow for normal maintenance tasks to be completed quickly. These are things that might have otherwise continued to be done manually (and error-prone) without the timing restrictions.

A short maintenance period also encourages more careful network design. If you’re only permitted a half-hour of downtime, then you start looking for ways to minimize the impact. Could the network be designed in a way that allows for a no-downtime switch upgrade or replacement? If not, then how do we get there? In many smaller business networks you might plan for redundancy but never test it – but in a high-uptime environment you begin to rely on it. If you want to get to a point where work can be accomplished with minimal downtime (or even during normal hours), then you must be confident that your network can seamlessly absorb the impact.

I certainly wish some days that I could go back to a life where downtime is acceptable any time during off-hours. However, I’m sure that the desire for higher uptime and greater reliability are likely here to stay – and I believe that I’ve learned some valuable lessons in trying to meet those requirements. An extremely short maintenance period certainly complicates things, but it also forces us to look for process and design improvements. I believe that the end result is a better network for both the business and it’s customers.


What are your maintenance practices like? Do you have hours or minutes? Comment below!

About Matt

Cisco certified since 2007 with a wide variety of IT and networking experiences. Just looking to share a bit of my own knowledge and experiences. All opinions are my own, and do not represent any vendor or current/former employer.

1 Comment

  1. Interesting article. My current job has more stringent requirements than past jobs but certainly not as stringent as yours. Usually major changes have to be made on Fridays sometimes not until after 11 pm. When planning a change, I spend the prior days painstakingly going through the plan, making sure that I have all the commands/changes written out. I also come up with a list of things to check, IP routes, connectivity etc.

    Invariably I usually forget something very simple and stupid. Case in point recently I needed to change the data VLAN on our core switch in our primary data center. We were using VLAN1 which I know is not best practice, but what primarily prompted the change was OTV and our DR site which had it’s own VLAN 1 on a separate subnet.

    Simple change right? Even for something simple I wrote a detailed plan of each port that needs to be on the new VLAN etc. I thought I was set. I ran through the scenarios back and forth. We were also upgrading the NXOS as well. So I was at the data center. After the upgrade I moved the SVI of the old VLAN to the new VLAN. And a few seconds later I got kicked out of the Nexus switch! I forgot to check something very simple. ACS. ACS was on the data VLAN and when I moved the SVI, I lost connectivity to ACS. Of course the Nexus falls back to the local account. Problem is the local account wasn’t working and it was past midnight! We were digging for the local account through old documentation from when the switch was installed. I was in the cage about to just pull the plug on the switch when one of my coworkers called to let me know they had found the password!

    I felt kind of stupid. If I’m making a change that I think will cause me to lose connectivity to ACS, I make sure that I know the local account or I create an additional local account just in case. Also, it’s probably better to be consoled in when making such changes. I don’t think I would have gotten kicked out.

    I do feel that having maintenance windows only on Fridays has impeded progress. Some of the changes I’ve had to make on the network could have been made in a couple of weeks but instead took a few months to complete. But I’m also not the one who has to answer to the business side if something is down.

    I like your idea of making the most of a short time window, forcing you to be creative and use scripting and automation.

Leave a Reply