SRX Basics: Redunancy Groups and Failover

In last weeks post, we took a look at how to set up a chassis cluster on a Juniper SRX Firewall. So now that we have a basic cluster setup – Let’s explore some of the additional options and configuration items.

Redundant Ethernet Interfaces

So first thing is first – Once you have a cluster configured, you’ll probably want to configure a few sets of redundant ethernet interfaces. These interfaces are also often referred to as reth interfaces. This will create a shared interface between your SRX pair, where you can configure IP address and VLAN information to be shared between the two. Let’s say that we have a Juniper SRX 1500 cluster, and we want to create a redundant interface for one of our 10Gb ports. Here is how we would do that:

root@testsrx# set interfaces xe-0/0/16 gigether-options redundant-parent reth1
root@testsrx# set interfaces xe-7/0/16 gigether-options redundant-parent reth1
root@testsrx# set interfaces reth1 redundant-ether-options redundancy-group 1

In the config above, we first take both of our interfaces (xe-0/0/16 on node0, and xe-7/0/16 on node1) and tell them that they now belong to a redundant interface group (reth1). Next, we enter into the reth1 config, and associate it to a redundancy group.

You’re also going to need to keep in mind that the SRX requires you to specify how many redundant ethernet interfaces will be configured. This is likely a memory thing, since each SRX also has a different maximum number of reth interfaces that can be configured. For example, if you tell the SRX that you need 5 reth interfaces, then the SRX will allocate system resources to manage those interfaces. In order to set the number of available reth interfaces, we’ll use the following command:

root@testsrx# set chassis cluster reth-count 5

Redundancy Groups

A redundancy group, or RG, is used as a container for logically grouping redundant interfaces/virtual routers which must fail over together. A single RG can be configured as primary on one of the two active SRX firewalls is a cluster – with the ability to fail over to the other node. For example, we might want be planning on only using one virtual routing instance on our SRX – so we would create RG1 and assign out interfaces to belong to it.

A quick note – all interfaces in a single virtual router must belong to the same RG. This way the virtual routing instance and all of it’s associated interfaces will always run on the same SRX node. In order to achieve an active/active firewall configuration, you would need to create two separate virtual routers, each with their own reth interfaces and different RGs. Then you would make RG1 primary on node0, and RG2 primary on node1.

In most configurations, dumping all of your reth interfaces into RG1 will be sufficient. You’re likely going to want to set up a priority for each RG – and maybe even preemptive fail-over. In order to do that – you’ll have to configure each cluster member with a priority:

root@testsrx# set chassis cluster redundancy-group 0 node 0 priority 200
root@testsrx# set chassis cluster redundancy-group 0 node 1 priority 50
root@testsrx# set chassis cluster redundancy-group 1 node 0 priority 200
root@testsrx# set chassis cluster redundancy-group 1 node 1 priority 50
root@testsrx# set chassis cluster redundancy-group 1 preempt

The higher priority wins here – so if you set node0 to a higher priority and preempt is enabled, then node0 will actively try to take ownership of RG1. I would rather not set preempt on RG0 for a few reasons – which we’ll cover in the next section. Priorities can also be modified using interface monitoring, so if a particular interface goes offline we can decrement the priority of that node (also covered below).

A Note About RG0

You might notice from the last post, that you’re output of show chassis cluster status already showed two redundancy groups: RG0 and RG1. RG0 is only used for management traffic and manages the routing engine for your SRX. Unfortunately, this can lead to some weird behaviors that you might not be expecting….

For example, whichever node is primary for RG0 is the only node that collects interface and monitoring statistics. If you’re using a monitoring tool that polls data from both of your SRXs, then the secondary for RG0 will report nothing about it’s interfaces, CPU, etc. This is also true if you log into the actual SRX itself – a show interfaces will actually return a bunch of default values, including showing that your ports are half-duplex. Don’t panic though, this is just an oddity of RG0. If you log back into the primary node for RG0, then it will show all of the proper statistics for both SRX firewalls.

Due to these weird things about RG0 – I prefer to always leave it on node0. Therefore I know which one to log into whenever I need to look at something, or which SRX to check in our monitoring tools. It’s also worth noting that whichever SRX is primary for RG0 is also the node you’re going to need to log into for configuration changes – even if all of your other redundancy groups are the other SRX.

Weird, right?

Oh, and be warned that since RG0 controls the routing engine, a failover of this RG can cause brief outages. This is primarily because the routing table and firewall state information will be lost. The secondary node has to spin up new processes for the routing engine, and at least currently there isn’t a graceful sync of all of that data.

Interface Monitoring

I mentioned setting device priorities a bit earlier. Setting interface weights is going to be the primary method for dynamically affecting those priorities, and therefore possibly causing a preemptive failover. One example might be that you’re using an SRX cluster for your edge firewall, and you want it to automatically fail over if the primary loses it’s internet uplink.

Note that you must configure the physical interfaces here, not the redundant ethernet interfaces:

root@testsrx# set chassis cluster redundancy-group 1 interface-monitor xe-0/0/16 weight 160

Remember when we set the priorities of our firewalls earlier? Node0 was set to 200, and node1 at 50. So here we are saying that xe-0/0/16 on node0 is worth 160 points. So if xe-0/0/16 goes down, then node0 will decrement it’s priority by 160 – which will be 40. This will trigger a preemtive failover by node1. The reverse is also true – when xe-0/0/16 comes back up, then node0’s priority will go back up to 200. Then node0 will take back ownership of RG1.

Manual Failover

There is a pretty good chance at some point you might need to perform a manual failover of your SRX redundancy groups. Maybe you need to do some maintenance or upgrades, or you just want to make sure failover works as you expect. In either case, the commands to do this are pretty straightforward:

root@testsrx> show chassis cluster status 
Cluster ID: 5
Node       Priority       Status       Preempt       Manual failover
Redundancy group: 0 , Failover count: 0
node0      200            primary      no            no 
node1      50             secondary    no            no

Redundancy group: 1 , Failover count: 0
node0      200            primary      yes           no 
node1      50             secondary    yes           no

root@testsrx> request chassis cluster failover redundancy-group 1 node 1 

root@testsrx> show chassis cluster status 
Cluster ID: 5
Node       Priority       Status       Preempt       Manual failover
Redundancy group: 0 , Failover count: 0
node0     200             primary      no            no 
node1     50              secondary    no            no

Redundancy group: 1 , Failover count: 1
node0     200             secondary    yes           yes
node1     255             primary      yes           yes

Okay – so let’s talk about a few things that have happened here. I always recommend that you run a show chassis cluster status first, so you know where things already stand. Then we can proceed by requesting a failover. To do this, you have to specify which redundancy group you want to fail over, and which node you want to become the new primary. So in this case, we made node1 the new primary of RG1.

You might also notice that the priorities have changed, and the devices are marked as being in a manual failover state. This is important, because you cannot manually fail back until you reset this state. That’s right – if you tried to run the failover command again to move RG1 back to node0, it will not work. An automatic failover due to hardware failure or interface monitoring will still be permitted. In order to perform a manual fail-back to node0, we have to run the following reset command:

root@testsrx> request chassis cluster failover reset redundancy-group 1

Hopefully between last weeks post and this one, you should have a good handle on the basics of configuring a chassis cluster on your new pair of Juniper SRX firewalls. Let me know in the comments below if this helped you!