From failovers to keepalived over vSwitch(es) with Hetzner

From failovers to keepalived over vSwitch(es) with Hetzner

A few years ago, we were happy with Hetzner’s failover IPs and managed them manually. Rather, we managed it with some custom scripts, but it still wasn’t enough. But as with everything related to Internet, data-center networks are evolving rapidly.

Since the beginning of our cooperation with Hetzner I (and other DevOps folks I know) was used to using Hetzner’s failover IPs to bring at least some level of fault tolerance, even though it was done manually with custom python scripts. Now, it’s 2019 and it’s time to automate things and let the system heal itself.

Fail[ed]over IPs experiment on kubernetes NGINX ingress(es)

For the DTOne data platform, we build Kubernetes clusters on Hetzner’s bare metal for many different workloads (the question: “Why bare metal and not cloud?” is a bit of out of scope here, but believe me there are reasons). We utilize kubernetes NGINX ingress to expose the workload to the wild via standard http(s) 80 and 443 tcp ports. Without any failover or virtual IPs, NGINX ingress ports are bound to the main host IP address. If the host goes down, clients will face timeouts and failed connections.

As we were used to using Hetzner’s failover IPs, the first idea was to let it be managed automatically by keepalived (one of the implementations of VRRP protocol) that is able to manage automatic migration of virtual IP address(es) among the hosts. Luckily, Cornelius Keller already did some work on this, and there’s a helm chart with a similar implementation of keepalived-vip for Kubernetes.

We spent two hours figuring out that it doesn’t work out-of-the box. In fact, it required some investigation to find out why the virtual_ipaddress {} stanza was empty. Short answer: there was a lot more to be done to make it work the Kubernetes way.

We also had a question: “Do we really need to have it implemented as a Kubernetes workload?”

No!

We have Puppet that can manage it automatically on the host itself. That’s how the keepalived.conf PoC with one failover IP for two machines came to life. Let’s call it a “crafted” version of what Cornelius Keller did, slightly modified to suit our needs.

Honestly, this kind of setup sucks, because you have to keep the communication encrypted in VRRP v1, use the default Hetzner machine network, and use only for IPv4. You can open it and go with the newest VRRPv3 that supports even IPv6 stack, but no one wants to keep VRRP open on a default Hetzner’s network. Just try to run the tcpdump in promiscuous mode on your default Hetzner IP.

vSwitch enters the game

In September 2018, Hetzner came up with an interesting piece of technology (I’m not saying that it’s something brand new in the world of hosting technologies) that hides your machines into separate L2 network (read: separate VLAN) called vSwitch. Basically, your personal network within Hetzner’s network.

For those familiar with the vSwitch setup, it’s likely a game changer and you may have a clue as to how the story ends. I was not one of these people, but you’ll read about that later.

I took the vSwitch, put the Kube Ingress machines into the vSwitch network, and reconfigured keepalived.conf to utilize the vSwitch network for VRRP communication. The failover IPs part remained the same, except for one additional failover IP address.

Later on you’ll see it’s still highly suboptimal to let keepalived manage the IP address assignments even for failover IPs - a better approach is to bind IPs manually in /etc/network/interfaces and let keepalived manage some random local IP address (e.g. 192.168.100.4). Then, you can even manually manage the failover via Hetzner’s API.

Downsides of the Hetzner failover IPs

Hetzner failover IPs did their job in the past, and I can say that they saved many stressful moments. But, it’s not perfect (nothing is).

One of the downsides of Hetzner’s failover IPs is the time to apply failover via Hetzner Robot API - long term thumb suck says that it’s ~30s to proceed with the request. New settings needs to be applied on all the routers in the Hetzner infrastructure. There’s also a high possibility that it will fail for any reason - especially in case of outage on Hetzner’s side. Marcin Onufry Hlybin noticed that and came up with more sophisticated mechanism that can keep an eye on failover IP state with his improved keepalived failover script logic (note that cron job).

Even the CRON job can be the culprit behind other issues like hitting Hetzner’s API limits - so you won’t be able to fail over when you really need it. Some may say that you can overcome such an issue with custom hacks that will check the failover destination via TCP, but honestly, it’s another o’hack that complicates things and usually means more items to maintain. No one likes that.

Failover IPs are not a great candidate for VRRP’s virtual IPs - something noted by a colleague of mine:

isn’t it too complicated setup

He was totally right! My lack of experience with vSwitch technology - combined with some level of ignorance and excitement about making it work - was the culprit here. I totally missed that Hetzner’s vSwitch offers additional IP subnets - for both IPv[46] stacks. “Game over!”

Hetzner vSwitch with additional IPv4/6 subnets

Now, the good stuff - where the things really work!

The final PoC implementation is stored in the v3-final-config folder that contains the complete source code of keepalived and networking among 3 physical nodes. The nodes communicate over a separate vSwitch network, are shuffling the virtual IP addresses of the additional vSwitch IP network range that is separate from nodes’ main IP addresses.

The whole setup is not complicated at all, but let’s break it down a bit to make it clearer.

/etc/network/interfaces part to setup additional VLAN 4000 interface

As per their official documentation, one needs to define a separate interface over the physical one that will handle 802.1q encapsulated traffic. In Debian’s /etc/network/interfaces language, it’s kinda simple:

auto eth0.4000
iface eth0.4000 inet static
    address 192.168.200.1
    netmask 255.255.255.0

eth0.4000 automatically creates the device that operates with VLAN ID 4000 over the physical eth0 device. VLAN devices are assigned local IP addresses, so we can keep all additional public IP addresses for failover purposes.

Here’s where some kind of magic happens. To be able to use a separate IP network with a different gateway IP address, one needs to define separate rules and routes for the traffic that flows through the specified subnet. iproute2 tooling is our friend here.

# Define additional / custom routing table that will handle new rules / routes
echo "1 vswitch" >> /etc/iproute2/rt_tables
# Define rules that will tell kernel to use different routing table
ip rule add to 1.2.3.56/29 lookup vswitch
ip rule add from 1.2.3.56/29 lookup vswitch
# Define routes for additional subnet
#  - tell kernel where the default route is, because there's no IP address from the specified subnet assigned, yet
ip route add 1.2.3.57/32 dev eth0.4000 table vswitch
ip route add 0.0.0.0/0 via 1.2.3.57 dev eth0.4000 table vswitch

As long as we’re not assigning IP addresses from the additional subnet to any device, we need to tell kernel where it should look for additional default gateway. All the rules and routes are part of the debian interfaces hook scripts, but can be specified directly in the /etc/network/interfaces file too via post-up resp. pre-down stanzas.

The keepalived part

Keepalived has a setup of three VRRP instances, with additional track_* checks. Each of the nodes is a MASTER for one of the VRRP IPv[46] address (and is also serving as a NGINX ingress), followed by two BACKUP nodes, in case the MASTER goes down.

The image below shows the healthy and one node outage states of VRRP instances:

keepalived setup PoC v3 scheme

Each machine has “its own” virtual IP address in healthy setup (the top image). On the bottom image ‘“Machine A” outage’ you may see that virtual IP address 1.2.3.59 was automatically moved to / handled by the “Machine B”, when the “Machine A” stopped sending VRRP advertisements and / or was marked as FAULTY node.

virtual_ipaddress and virtual_ipaddress_excluded stanzas

The main purpose of any VRRP implementation is to increase availability, which is achieved by automatic IP assignments via so-called virtual addresses. Unfortunately, VRRP implementation allows you to define only a single address family, trying to set up both stacks in the virtual_ipaddress stanza results in:

Nov 11 12:56:59 k8s01mc Keepalived_vrrp[22380]: (Line 40) (1.2.3.6): address family must match VRRP instance [2a03:4444:fff0:59:88::218/64] - ignoring

… duh, it took me a few more minutes to figure out exactly what was going on and how to overcome this issue, but I realized that IPv6 is not the subject of the negotiation here. That’s what virtual_ipaddress_excluded stanza is good for:

 virtual_ipaddress_excluded {
   2a03:4444:fff0:59:88::59/64 dev eth0.4000
 }

You’re telling keepalived to handle the actual setup of the IP address on the given interface, but you’re also saying that it shouldn’t send it within VRRP advertisements. Clever! Otherwise you may still define a separate vrrp_instance exclusively for IPv6 addresses.

track_script and track_interface stanzas usage

There were thoughts to use a different (and one would say easier) method to handle virtual IP addresses (e.g. UCARP). The main reason to choose keepalived implementation instead of UCARP was the ability to perform additional checks - script and interface status checks.

track_interface aka manual failovers

The simplest one - and the most useful. It checks the interface status and, in the case of interface failure, it will mark itself as FAULTy. In the case of the main interface (eth0), it’s a bit useless, because it means that the node will become unavailable anyway. The magic is buried in the dummy0 interface.

Sometimes you need to put the node into maintenance mode, but keepalived won’t let you do this easily (at least I don’t know how to do that with a single keepalived-whatever command). But, using the ifdown dummy0 command, your job is done thanks to keepalived’s track_interface feature.

track_script aka is NGINX ingress alive?

Does it make sense to hold the virtual IP address if there’ll be nowhere to send the traffic. NO!

One simple command specified in vrrp_script stanza:

vrrp_script chk_nginx_ingress {
  script    "/usr/bin/curl GET http://localhost:80/healthz"
  interval  2
  weight    2
  fall      2
  rise      2
}

This will make your setup robust and more HA than ever before. When the NGINX ingress on a given node doesn’t reply on health check, the node will mark itself as FAULTy and the virtual IP address will appear somewhere else.

The rest of the keepalived configuration

The last piece of the keepalived puzzle is the global_defs stanza, where one may set up email notifications to track the status of the nodes, and enable SNMP support for monitoring purposes (I personally prefer to use something that produces prometheus metrics format).

Automate it …

At DTOne, we use the Puppet configuration management tool. No matter what technology you use, setting up dozens - or hundreds - of machines manually is a painful and time consuming process. It took half a day to set it all up manually, even though it’s a first-time configuration, you don’t want to copy/paste/send it manually again and again.

There’s a puppet code snippet included in the code folder that orchestrates the above-mentioned setup. It’s not a complete code, but some kind of illustration how it can be all done. Using puppet you’re able to set it all up within seconds, again, and, again, and so on - each time you’ll re-install the node(s).

Things we didn’t cover here

You might notice that we didn’t cover all the implementation details. It’s simply because it’s out of scope of this blog post, but keep in mind that with real deployment you’ll need to take care of things like resource management (each node has to be capable to handle all the traffic), security (especially firewall rules, make sure you don’t block VRRP traffic) and many other details that makes your deployment production ready.

Summary

Modern technologies are focused on full automation, so no human operator can screw it up. The vSwitch technology as-is, is a powerful tool that can make dreams come true. Once you incorporate keepalived with virtual IP addresses (on any other tool that implements automatic handling of virtual IP addresses), your technology stack will receive the sort of high-availability that it deserves.

On top of that, your DevOps life will become a lot happier, and your sleep much deeper.

infra