We saw some stability issues earlier this week, as increased load impacted our ceph cluster, which provides the backend to the application config folders, as well as to ElfStorage.
It turned out that the 1Gbps nodes which run our SSD-backed config storage were also running the Ceph metadata servers, whose job it is to co-ordinate your view of the filesystems of your volumes. The combination of these two roles (storage and metadata) was saturating the 1Gbps NICs, causing slowdowns and the occasional corruption as the fault cascaded.
In parallel, all the fun we've been having with Real-Debrid streaming was impacting our app nodes, in some cases creating so much incoming traffic that the nodes were unable to respond timeously to communications with ceph, again resulting in slowdows and corruption.
Here's are a few recent changes we've made to address growth:
Optimized routing to streamers
Most of our incoming HTTPs traffic arrives via a Hetzner network load balancer, and is then passed to a random app node (running Traefik), to be SSL-terminated, and then passed on again to the target pod. This offers excellent resilience and fault-tolerance, but can result in an inefficient traffic path, as illustrated below:
As you can see, based on the randomness of the load-balancing, a simple Plex / Debrid stream of a 50GB movie could pass through 3 different nodes, in some cases simply "passing through" on the way out. While we have some ability to rate-limit traffic, you can only really control the rate of outgoing traffic - you don't get to determine how fast incoming traffic arrives, and once it's arrived, it's too late to do anything about it!
DNS-based routing to streamers
The first change implemented this week works as follows - every time a streamer (Plex, Jellyfin, Emby) starts, a DNS record is created/updated for that streamer's URL (say, funkypenguin-plex.elfhosted.com), pointing to the precise host the streamer is running on.
This means that rather than having the incoming request routed via Hetzner's load balancer to a random node running Traefik, the request hits the Traefik pod on the precise node that Plex is running on, avoiding any overhead in delivering streaming traffic between Traefik and Plex:
10Gbps for Zurg and incoming friends
The DNS-based routing improves the flow of traffic, but it doesn't solve the problem of incoming media saturating the 1Gbps link and making nodes briefly unresponsive.
Hetzner offer 10Gbps uplinks on nodes, but it's double the price, and you're charged for traffic (critically, egress traffic) over 20TB / month.
So we added some 10Gbps nodes, and moved all the Zurg instances onto them. Now our streaming traffic flow looks like this:
Now there's no saturation of the Zurg node, and we can control the rate of traffic sent from the Zurg node to the Plex nodes, such that we don't get massive traffic spikes causing issues.
To avoid having to pay extra egress charges, we've moved only the downloading apps which don't upload (SABnzbd, NZBGet, and RDTClient) onto the 10Gbps nodes, so you should now see much better performance from these apps!
10Gbps for Ceph MDSs
Remember those saturated MDSs? It turns out, they love a little extra room to stretch their legs, and are running much better on the 10Gbps nodes as well. Here's our current Ceph status:
They're our Ceph HDD (dwarves) and SSD (goblins) storage nodes. What you don't see in these metrics are their disk I/O or network usage, which are likely high.
Summary
As always, thanks for building with us - feel free to share suggestions, and your own ideas for new apps to add!
The big increase here is due to all the Zurg mounts! โฉ
I've excluded some nodes which have been de-provisioned and are awaiting expiry at the end of their monthly Hetzner invoice cycle โฉ