SRX Basics: Clustering

So you just unboxed a brand new pair of Juniper SRX firewalls – now what? Well, the first thing you’re likely going to want to do is get the two devices hooked up and clustered together. That should be pretty simple, right? Yeah, mostly – though there are a few variations between device models, and there are a few fine-print steps that might keep you from getting everything working the first time.

So let’s take a look at what we need to do!

Physical configuration

First thing we need to do is get both devices unboxed and cabled appropriately. In order to get a successful cluster configured, we will need to get two critical ports connected: the HA control port and the cluster fabric port. Technically only the HA control port is required to get a cluster working, but you’ll want to get the fabric port working as well – here is what both ports are used for:

HA Control Port – This is used for communication between the cluster members. This connection is used just for control-plane stuff – like keepalives/heartbeats and config sync between the two nodes.

Fabric Port – This port is used for data sync between the cluster members. All routing/firewall state information is synced using this port, and any cross-cluster traffic is also transferred using this port (for example, if one SRX was primary for a redundancy group, but the secondary was the active BGP speaker for your upstream connection – then the traffic would come in through the secondary and cross this link to reach the primary RG)

The fabric port is the easiest to connect – because you can use any port you like, then specify which port to use in the CLI. The control port, however, must be the assigned port that Juniper allocates for this use. Unfortunately, this port varies between device models. The two most common SRXs that I’ve deployed are the 345 and 1500. The SRX 1500 has a dedicated 10G HA control port, but the SRX 345 actually uses ge-0/0/1 on both nodes for this. Juniper lists what all those port assignments are over on this page.

Once those ports are connected, go ahead and power on both devices!

JunOS Config

Okay – Once the physical configuration has been completed, there are a few things that need to be configured on both devices before you can establish a cluster.

When you first boot each device, you’ll log in with root and no password. Then you’ll be dropped into the JunOS shell, and you’ll need to type cli to start the JunOS command-line interface. Then type configure to get into the configuration mode.

root@testsrx% cli
root@testsrx> configure
root@testsrx#

In the config mode, we’ll need to set a root password before we can enable the clustering. This password must match on both devices!

root@testsrx# set system root-authentication plain-text-password
New Password: <type your root password here>
Retype new password: <and again...>

For the SRX 1500 series, where there is a dedicated HA Control port, this is enough to get the cluster working. But for some of the branch SRXs, like the 300 series, you’ll need to make a few additional changes. These devices come with a default config, which includes IP addresses on certain interfaces. Unfortunately, this will conflict with your cluster config and will not allow your cluster to reach a healthy state.

In my config, I already plan on re-configuring all of the interfaces and security-zones to fit my needs – so I will just delete those entire config sections:

root@testsrx# delete interfaces
root@testsrx# delete security

After all that is done, we need to commit our changes:

root@testsrx# commit

Finally we can go ahead and set up the cluster! This config is actually done outside of configure mode, so you will need to exit that.

So one thing to note here – each cluster will be configured with a cluster-id. This MUST be unique across any layer 2 subnet. So if we had multiple SRX clusters within a single broadcast domain, we would need to assign each one a different cluster ID.  I’ll use cluster-id 5 in this example.

On whichever SRX you want to be the primary node:

root@testsrx# exit
root@testsrx> set chassis cluster cluster-id 5 node 0 reboot

I personally like to give the primary a minute or to into the boot process before I configure the secondary, but we’ll do so with a similar command (just specifying node 1 instead of 0):

root@testsrx> set chassis cluster cluster-id 5 node 1 reboot

After both nodes come back online, log into node 0 and run the following command:

root@testsrx> show chassis cluster status 
Cluster ID: 5
Node       Priority      Status      Preempt      Manual failover

Redundancy group: 0 , Failover count: 0
node0      100           primary     no           no 
node1      100           secondary   no           no

Redundancy group: 1 , Failover count: 0
node0      100           primary     yes          no 
node1      100           secondary   yes          no

Perfect! Now let’s go configure our fabric ports! Interface fab0 will be configured as the fabric port on node0, and interface fab1 will be configured as the fabric port on node1.

root@testsrx> configure
root@testsrx# set interface fab0 fabric-options member-interfaces ge-0/0/10
root@testsrx# set interface fab1 fabric-options member-interfaces ge-5/0/10
root@testsrx# commit

Now that we’re in a cluster, all of this configuration can be done on node0 – but note that in this case the secondary device’s ports all start with ge-5/x/x. This is another oddity of JunOS – that numbering scheme isn’t always the case. In the SRX1500s, the node1 ports all start with ge-7/x/x – so this will vary depending on what devices you’re working with. If you ever need to check this – you can run show interface terse to list all interfaces in the cluster.

As a final verification that all our ports are up, drop out of config mode and run show chassis cluster interfaces:

root@testsrx> show chassis cluster interfaces
Control link 0 name: ge-0/0/1
Control link status: Up

Fabric interfaces:
Name Child-interface Status
fab0 ge-0/0/10 up
fab0
fab1 ge-5/0/10 up
fab1
Fabric link status: up

Hooray! We now have a functioning SRX cluster!

Sometimes if this doesn’t work, the output of show chassis cluster status will show the secondary node as disabled or lost. I’ve found that lost usually indicates a conflicting configuration on the cluster interfaces (like leaving the default IPs configured). If you see disabled, try rebooting the secondary node again – and if that doesn’t work, then you may need to disable clustering on both nodes and re-configure. This can be done using the set chassis cluster disable reboot command.

Next week, we’ll look at redundancy-groups, performing manual failovers, and setting up interface monitoring for automatic failovers. Hope this was helpful!