Paired failover
When you’re designing an infrastructure for a service or application that requires high uptime, you have to plan for certain levels of resiliency and redundancy. There are a lot of considerations that come into play. Every decision is balanced against the cost, the benefit, and the uptime requirements. You have to ensure that you have high availability at every level of your tech stack, so there’s no point in buying two firewalls and two switches if your server only has a single NIC. Do you have separate internet connections, from different carriers, coming in over different physical paths? Does each host connect to redundant network fabrics? Are key services like DNS and authentication highly available?
There are a lot of different models for high availability. For example, Cisco ASA firewalls are typically deployed in an Active/Passive configuration. A trio of Juniper EX switches in a Virtual Chassis are all passing traffic at the same time, with the higher level roles being consolidated to one master device at a time. Each device has its own failover scenarios and requirements. One of the things to avoid when combining all of these different mechanisms is a paired failover.
A paired failover, at least by my definition, is when one failover occurs, other dependent devices are required to failover as well. A good example would be having two internet connections, with each one plugged into a different firewall. If an internet connection failover had to occur, you would also have to have a firewall failover occur at the same time. The internet connection and the firewall form a paired failover group, where one’s failover requires the other to failover too.
I hated paired failovers. In the ideal network design, each object in a path is redundant, and a failover is transparent to the devices both upstream and downstream of it. There are a few good ways to accomplish this:
Single-image systems (think Cisco’s VCS or “stacking” or Juniper’s Virtual Chassis)
LACP and VRRP in both directions (Cisco’s VPC handles this well)
Quick failover routing protocol
Active/Passive failover with state replication (Cisco ASAs fit the bill – still need LACP for link redundancy)
The key feature is that the identity of the device (the IP address or the link) is transparent/virtual and can move between devices. While using a routing protocol to handle failover between devices might fit the bill, it depends on the protocol and the speed of the failover. BGP is great for this, since it can support dual-active scenarios. I also particularly like single-image solutions or configurations where the state information is shared between redundant devices to further minimize the impact of a failure.
How do you end up with a paired failover? It’s an easy trap to fall into.
Imagine you have two firewalls (lets say Juniper SRX boxes) and a couple of Juniper switches. You obviously want networking redundancy, so you put the switches into a virtual chassis. Done, right? Nope, you still have to make sure everything is properly multihomed. Are your servers capable of LACP or another bonding protocol to the upstream switch? Both of your firewalls will need to connect to both of your switches (4 cables), otherwise a switch failover would also require the firewall to failover at the same time. Every device in the chain needs to have redundant connections to the other devices, in both directions. But what about the internet connections to the SRX boxes? Do you just have a single WAN switch out in front? Are they separate links? How would your IP space failover between the devices? You’ll probably want BGP to ensure your IPs can float. But what if your BGP addresses float to the non-active firewall, causing asymmetric routing? As you can see, there are a lot of considerations at play, and you have to be careful to address every possible failure scenario.