We're 9-months old now, and have passed the one-month mark since the "new" (April 2024) pricing, which normalized resource-heavy "infinite debrid" apps, based on measure usage data.
We moved a little more slowly and broke a little less things, testing and ultimately rolling out an update to RDTClient to bring our auto-updating image up-to-date with the upstream code.
(We then totally "cowboyed" an IPv6 upgrade to deal with a surprise policy change by RealDebrid towards non-residential IPv4 addresses!)
Additionally, for the first time every, we broke even 4 on cash expenses for the month!
To get us started, here are some shiny stats for Apr 2024, followed by a summary of some of the user-facing changes announced this month in the blog...
The stats below indicate CPU cores used (not percentage), and illustrate that after tenant workloads, our highest CPU consumer is traefik, which significantly spikes up during the daily maintenance window when there's a lot of pod "churn". We run Traefik as a daemonset, meaning we have > 22 pods, each of which gets "excited" when there's a lot of pod churn, and clocks in some heavy CPU usage, but this is still unexpected behaviour. Work will be undertaken in April 2024 to upgrade Traefik to v3 (it's a major update, needs careful testing), after which we'll revisit the CPU usage.
Until recently, Flux's CPU usage has also been high, with 26 shards (a-z) to spread out the load of reconciliations, but a recent refactor has significantly reduced this.
The "last" values on the chart are specific to when the snapshot was taken, but compared to the previous month, there's not a lot of change in overall tenant CPU usage (which is good, most of the resource pressure is on network and storage I/O).
This graph represents memory usage across the entire cluster. By far the largest consumers of RAM is rook-ceph, since ceph will basically take all the RAM you give it, for caching / performance.
Interestingly, RAM usage for tenants has increased by ~25%, while CPU usage has remained relatively constant. This may indicate that there are more tenant apps (we know there are) which are idling most of the time, consuming RAM but not active CPU resources.
The 2 network graphs below to represent the two points where throughput varies.. the 10Gbps "giant" nodes (which are primarily used for ingress, as incoming Debrid content is passed to elves for streaming), and the 1Gbps "elf" nodes which are our general workhorses.
Compared to March's graphs, April's graphs show significantly more continuous traffic on both the Giants ((still ~10% of capacity) and the Elves (a more balanced spread). It's worth noting that during April, we also migrated some of the torrent clients onto the Giants, keeping a careful eye on total egress bandwidth, since we're billed for monthly bandwidth exceeding 20TB per 10Gbps node.
Last month (Mar 2024)'s for comparison:
These are the traffic stats for egress from Hetzner. They exclude any traffic to/from Hetzner Storageboxes or ElfStorage, since this traffic is not classified as "external".
Ingress / Egress stats for Apr 2024
Last month (Mar 2024)'s for comparison:
Ingress / Egress stats for Mar 2024
Ceph provides our own storage ("[ElfStorage][elfstorage]"), typically used for long-term slow storage and seeding, as well as all the storage for our per-app /config volumes, now backed by 10Gbps nodes with NVMe disks.
Ceph storage only grew ~5% 30% during April, but it should be noted that during the period we switched to ephemeral storage for the PhotoTranscoder directories, which significantly reduced Plex's storage consumption.
Last month (Mar 2024)'s for comparison:
Retrospective
100% more ElfVengers!
While producing this report, we graduated 2 more Elves (@chris and @erythro) to "Elf-Vengers", the elite team of heroes who help new elves with their setups, and respond to #elf-help tickets when available / applicable.
What about the ?
Elf-Vengers who respond to user tickets have been vetted to be "not catphishers"!
If you'd like to join our hallowed ranks, start here!
Cluster updated with IPv6 "in-flight"
Our Kubernetes cluster was built on IPv4 addressing only (no IPv6). Adding IPv6 after cluster initialization (making it "dual-stack") is not supported in K3s (but it turns out it works anyway)...
During April 2024, RealDebrid surprise-changed their policies, restricting non-residential IPv4 addresses, which impacted our "Infinite Streaming" users, since we mount your RealDebrid libraries from our Hetzner datacenter IP ranges. At the time of the change, even switching to a VPN for the Zurg->RD connection didn't help, since VPNs were also blocked (this may have been reversed by now).
To restore access to our users, we enabled IPv6 addressing in our cluster (technically, because we use Cilium and not K3s-managed Flannel, this was possible, but.. hacky), forcing the Zurg->RD connection via IPv6. This resolved the problem (for the time being!), and services were restored. It did take a few more days for the after-effects of this change to be fully resolved though, and we experienced connectivity issues across our apps we dealt with IP address allocation overlaps and other consequences of making such a wide-reaching change with relatively little planning/testing!
We've noted no further issues for about a week since, and our cluster is now fully IPv6-enabled and ready for the "future"!
Plex Performance Tweaks
Some daily network usage analysis (thanks to the public ElfHosted cluster Grafana dashboard) revealed Plex to be doing multi-GBbit/s downloads from RealDebrid / ElfStorage on a daily basis, for the purposes of analysing media for chapter markers, credit/intro skipping, etc.
Since such activities are obviously not scalable when dealing with remote storage mounts, we've disabled these scheduled tasks, and carefully tuned the transcode killer to prevent unnecessary thrashing of CPU/GPU resources.
It served us well from Feb 2024, and was my introduction to the wider StremioAddons community, but the rapid pace of the KnightCrawler devs outpaced our build, and so while fresh builds were prancing around with fancy parsers and speedy Redis caches, we ended up with a monster MongoDB instance (shared by the consumers, and public/private addon instances), which would occasionally eat more than its allocated 4 vCPUs and get into a bad state!
To migrate our hosted instance to the v2 code, we ran a parallel build, imported the 2M torrents we'd scraped/ingested, and re-ran these through KnightCrawler's v2 parsing/ingestion process. Look at how happily our v2 instance is purring along now!
For a few months, we'd been stuck on an older version of RDTClient, since our cluster-specific hardening (not running as root, for example) didn't work with the latest, "official" version.
With some careful testing by a handful of users, we were able to adjust our image to better align with the upstream code, and we now have an in-house image which is both compliant with our internal security standards, and is able to be auto-updated from upstream when new features are released.
With the repricing behind us, we had time to focus on a significant release - a service which runs an instance of Stremio Server, the transcoding / torrenting component of Stremio, usually bundled with the client app. With the addition of a user-supplied VPN config, this service provides a user with a separate streaming/transcoding endpoint for Stremio, such that they can point their Stremio clients (like under-powered smart TVs) at their own streaming server, and have the torrenting / transcoding process offloaded, allowing users to:
Safely stream direct torrents without worrying about a VPN
Simultaneously stream Debrid downloads from supported clients (currently just https://web.strem.io or our hosted [Stremio Web][stremio-web] instance)
I've toyed with the idea of producing video guides for some of the more complex integrations, and during May I hope to try my hand at producing some screencasts to make these more accessible. If you've got an interest in screencasting / video production, and you'd like to help a noob out, hit me up!
TRaSH is TReaSUre
Informally, in the #elf-friends channel, we've been discussing implementing TRaSH Guides custom formats for several weeks now, and many users are using these to download the optimal content for directplaying (always better than transcoding!). We've also recently published @LayZee's Radarr/Sonarr guides to manually adding TRaSH to your Aars.
There's a little more work to do around the ElfBot implementation (elfbot recyclarr sync works, but some of the other commands don't), and then we'll publish a blog post with more details!
AirDC++ service
We've had an AirDC++ service in beta for a while, and with the support of a few invested Elves, this is now ready for "gen-pop". With a dusting of documentation and process, look for the AirDC++ service to become available shortly!
PostgreSQL support for the Aars
Radarr, Sonarr, etc all now have built-in support for PostgreSQL as a replacement to the troublesome and easy-to-corrupt SQLite database they come with by default. To support our KnightCrawler database, I've started using CloudNativePG (CNPG) for full-lifecycle database management. With a simple CR, CNPG will establish a highly-available PostgreSQL cluster, including automated failover, incremental and automatic backups to local snapshots and to an S3 bucket.
This declarative approach to creating a PostgreSQL database could allow us to, in bulk, create individual database for Arr instances, such that setup is still fully automatic, and users just "get" the PostgreSQL-enabled instances. Once again deferred by more pressing issues in April, I hope to pay more attention to this in May!
Offering free demos
Nothing gets my attention on a new app like a live demo. I've considered reaching out to open source projects who don't have their own online demo, and offering to host a self-resetting demo for them.
This approach has been successful with the Stremio Addons, and I think it'd be effective at gaining exposure to more niche communities.
A hosted demo would provide their potential users the opportunity to evaluate the app "live", and would also drive more traffic / recognition / SEO juice towards ElfHosted.
If you've got an open-source, self-hosted app and you'd like a free demo instance hosted, hit me up!
Got ideas for improvements? I'd love to hear them, post them in #elf-suggestions!
How to help
If you'd like to make a donation in recognition of our infrastructure costs, our open-source resources, or our friendly support, a simple donation product is available at https://store.elfhosted.com/product/elf-love/
Another effective way to help is to drive traffic / organic discovery, by mentioning ElfHosted in appropriate forums such as Reddit's r/selfhosted, r/seedboxes, r/realdebrid, and r/StremioAddons, which is where much of our target audience is to be found!
Join us!
Want to get involved?
Want to get involved? Join us in Discord and come and test-in-production!
Includes "opportunity cost" of deferring billable consulting work for ElfHosted development! โฉ
Yes! We finally broke even on our cash expenses. This was the first month we offered prepaid pricing though, and several users took advantage of yearly commitments, so next month may not be as profitable. We remain cautiously optimistic... โฉ
These are actually paid yearly in advance, but represented here monthly for consistency. Confirm my sponsorships hereโฉ
Total pod count dropped because we (a) improved pruning of un-subscribed services, and (b) removed rclonefm/ui by default, but made them free to add, in the store. โฉ