Last week I came across a thread on Reddit that asked the question: “What is your company’s policy on maintenance windows?”. This got me thinking about how maintenance windows have been handled at the various companies I’ve worked at, and how those schedules/restrictions impact project timelines, network design, etc.
Many of the places that I have worked at in the past have been typical 8a-5p/M-F shops. Outside of normal business hours, no one really cared if the network was available. Sure, we might have people who worked late – but a few hours notice via email was always enough. However, the company I work for currently has much tighter restrictions on when work can be performed. We have worldwide customers in over a dozen datacenters and some fairly strict uptime SLAs. What this comes down to is a once-a-month allowance for scheduled maintenance – where the timeframe is limited anywhere from 15 minutes to 4 hours.
Some of the immediate impacts of these differing maintenance window schedules are somewhat obvious. Network maintenance can be practically open to all nights and weekends with a lot of typical 8-5 businesses. This means changes can happen much more frequently – especially changes involving a full network outage. For example, at one of my previous jobs I needed to upgrade each floor of the building from individual Cisco Catalyst 3548 switches to new 2960X stacks. This required moving the cables for up to 200 ports per floor (while also trying to clean up cable management). I was able to complete the work by just coming into the office earlier every day to move the connections before anyone else arrived.
On the other hand, a cloud service provider can’t just decide one day to take a few hour outage to swap out network equipment. Instead, changes have to be carefully planned, scheduled, then executed within a short window. Customers have come to expect 100% uptime – and rightfully so. However, we still need some amount of time dedicated to performing upgrades, changes, or other maintenance activities. The simple switch migration from the last example suddenly becomes a multi-month ordeal in an environment such as this. You might be ready to jump on the work, but you need to wait for the next regularly scheduled window – and even then you may have only a handful of time to complete your task. If you don’t complete all of it in the time allocated? Well now your project gets pushed back another month.
So you might ask – as a business scales, does it always end up creating this maintenance monolith? It might – but it certainly doesn’t have to end that way. The effects of higher uptime requirements and shorter maintenance periods might seem like nothing but bad news. However, the change in mindset that comes along with that does bring some unique benefits.
The first major benefit comes in the form of planning. When you have 15 or 30 minutes to complete an entire migration or upgrade, it becomes extremely beneficial to plan out a complete play-by-play of every activity. The limited window means that simple mistakes can cost you valuable time. Of course, the tendency for maintenance windows to be scheduled for late nights also compounds the problem since you may be tired or less alert. For critical maintenance tasks that I need to accomplish, I take the time to create a step-by-step checklist of every command that must be run, every system that must be tested, and every step needed to roll back. Sufficient planning means less mistakes, which in turn increases chances of success during a tight work period.
Automation and efficiency start to become a necessity when you have only a few minutes to perform a task. Sure, I might create a very detailed checklist of what must be accomplished – but what happens if it’s simply too much for the time allocated? You can’t complete a 20-minute task in a 15-minute outage window, right? Sometimes we can schedule extended maintenance periods, but this certainly isn’t feasible every month. This is where we begin to try and identify inefficiencies and tasks that would benefit from automation. Over the years I have written a handful of scripts and utilities that allow for normal maintenance tasks to be completed quickly. These are things that might have otherwise continued to be done manually (and error-prone) without the timing restrictions.
A short maintenance period also encourages more careful network design. If you’re only permitted a half-hour of downtime, then you start looking for ways to minimize the impact. Could the network be designed in a way that allows for a no-downtime switch upgrade or replacement? If not, then how do we get there? In many smaller business networks you might plan for redundancy but never test it – but in a high-uptime environment you begin to rely on it. If you want to get to a point where work can be accomplished with minimal downtime (or even during normal hours), then you must be confident that your network can seamlessly absorb the impact.
I certainly wish some days that I could go back to a life where downtime is acceptable any time during off-hours. However, I’m sure that the desire for higher uptime and greater reliability are likely here to stay – and I believe that I’ve learned some valuable lessons in trying to meet those requirements. An extremely short maintenance period certainly complicates things, but it also forces us to look for process and design improvements. I believe that the end result is a better network for both the business and it’s customers.
What are your maintenance practices like? Do you have hours or minutes? Comment below!