FeedFilters
Pricing Log in Sign up

Blog

  • All posts
  • Subscribe (Atom)

Recent

  • Where the cookie boundary didn't May 12, 2026
  • A CDN for most of it May 11, 2026
  • Done, not abandoned May 8, 2026
  • There's no catch May 7, 2026
  • Sixteen mockups May 6, 2026
  • From scripts to infrastructure May 5, 2026
  • Doing mail myself May 4, 2026
  • Easier now than later May 2, 2026

Archive

  • 2026 15 posts

Pointing load at it

April 30, 2026 · Kyle Cronin

With the admin dashboard in place, I started load testing. I wanted to find the actual ceiling for FeedFilters on a 1 GB Linode Nanode — the smallest production target I cared about — and to understand what would break before I got there. What followed was the most concentrated stretch of work in the project. Every day surfaced something I hadn’t expected, and most of those “somethings” turned into commits that made the service substantially better.

This post tries to cover the full arc.

The harness

The load is synthetic, but it’s substantial. There are no real users involved — everything below is driven by a controlled test rig — but the rig is sized to push the service well past its comfortable operating range, and the lessons that came out of it are real.

The harness has three pieces. A mock upstream that serves deterministic RSS / Atom / JSON feeds, with knobs for body shape, delay, hang, errors, and gzip. A runner with a small web UI that drives the test and aggregates metrics. And the SUT — a debug build of FeedFilters running in a near-production container shape. The mock and runner live on a separate box from the SUT so the measurement isn’t competing with the workload for the only CPU the production target has.

The runner has two modes. Classic is the textbook shape: fixed user count, fixed ramp, stop when p95 latency crosses a threshold. Adaptive is more interesting — a controller that seeks the ceiling and tracks it, ramping load up while the SUT looks healthy and backing off when it doesn’t. Most of the useful tests I ran were in adaptive mode: the system tells you where the cliff is rather than you having to guess.

OS limits, before anything else

The first thing the load test surfaced was that my host wasn’t configured for traffic. Two kernel sysctls turned out to matter right away.

net.core.somaxconn defaults to 128 — the size of the TCP accept queue. Under any meaningful burst of new connections, the queue fills up and connections get refused outright. Bumping it to 65535 stopped the spurious ECONNREFUSED errors immediately.

net.netfilter.nf_conntrack_max defaults to about 65K, which the service blew through during outbound-fetch storms. The symptom was confusing: DNS lookups started failing with write: operation not permitted. That turned out to be conntrack refusing to allocate a slot for a new outbound flow. Setting it to 524288 fixed it.

Both of these have to be applied per-container as well as on the host, because Docker gives each container its own network namespace and somaxconn is per-namespace. The host setting doesn’t propagate.

The connection-pool cliff

Once the kernel was out of the way, the next thing the runner found was a ceiling that didn’t look like a ceiling at first. Throughput would climb steadily, then collapse — not just stop scaling, but actively drop into a death spiral with the box going OOM.

The cause was that I’d left sql.DB’s connection pool unbounded. SQLite serializes writers (one writer at a time, full stop), so under a write-heavy burst every blocked goroutine grabbed a fresh connection from the pool. A brief queue cascaded into thousands of open connections, each carrying its own buffers and goroutine stack and lock-manager bookkeeping. That memory pressure produced GC pressure, which produced more queueing, which produced more connections. The system tipped over.

The fix was a one-line change with a lot of reasoning behind it: cap the pool at 25. Twenty-five connections gives WAL readers plenty of headroom (they don’t block each other) while pinning a hard ceiling on the writer queue. Excess request rate above that queues in Go’s scheduler instead of in SQLite’s lock manager, which is dramatically cheaper. There’s no principled formula behind 25; it’s empirically derived from capacity sweeps on the 1 GB Nanode that’s the reference deployment. Bigger boxes can likely run higher, but no one’s swept that.

Profiling and the hot-spot day

With the kernel and the pool fixed, the SUT ran clean enough that the next ceiling was about CPU and memory inside FeedFilters itself. I’d already wired in pprof for dev builds, so I pointed it at a sweep and looked at where the cycles were going.

It was instructive. The output path — where a feed reader fetches a filtered feed — was spending most of its time in two places I hadn’t expected: re-parsing the same upstream XML on every request, and re-normalizing every item’s text on every filter pass. The output endpoint was, in effect, paying full parse-and-filter cost for every cache hit.

Several changes landed in a single day:

  • Cache the parsed feed itself. Alongside the disk-backed HTTP cache for upstream fetches, an in-memory byte-bounded LRU holds the parsed gofeed.Feed plus per-item byte ranges into the source XML. Cache hits skip the parser entirely; misses pay full parse cost once and amortize across every reader.
  • Cache normalized item text. The filter engine compares case-folded, accent-stripped item text. Doing that work inside the filter loop meant doing it for every (item × filter) combination. Caching the normalized text per item, once, on parse, dropped filter-pass CPU substantially.
  • In-memory cache for Store.GetByID. The output handler was hitting SQLite for the feed row on every request. The row changes rarely; an in-memory cache keyed by feed ID eliminated the lookup from the hot path.
  • Slim the cache value type. The original FeedCache value carried more than it needed to. Trimming it let the same heap budget hold roughly five times as many entries, which cut the steady-state miss rate proportionally.

None of these are clever individually. Together they shifted the output endpoint from CPU-bound to nearly memory-bound, which is the regime I wanted.

An upstream cache

Around the same time, I replaced the day-2 source cache with a new disk-backed HTTP cache (internal/httpcache). The original two-state-machine design — metadata in SQLite, bytes on disk — had race windows between the two halves that the load test exposed. The new cache is interesting enough to deserve its own post, so I’ll write that one separately. For the purposes of this post, the relevant fact is that upstream feed responses are cached on disk and concurrent requests for the same URL coalesce to a single fetch.

Debouncing the timestamp writes

At sustained cache-hit load — around 200 requests per second — every /feed/{id} request was firing a synchronous UPDATE on feeds.last_successful_fetch_at. With the connection pool capped at 25 and SQLite’s single-writer model, every UPDATE serialized through one writer lock. The runner started seeing SQLITE_BUSY errors and Caddy began returning 502s as the in-process queue grew.

The fix was a coalescing flusher: a FetchRecorder accumulates “feed N was just fetched successfully” events into a per-feed-id map, and flushes a snapshot of that map in a single transaction every two seconds. Latest-event-wins is fine for what those columns are: advisory timestamps and an icon URL that almost never changes. N writes become one writer-lock acquire. The hot path’s contribution drops to one map insert under a mutex.

Two seconds of staleness on last_successful_fetch_at is invisible to admins; the writer-contention cliff disappeared completely.

Load shedding: from binary to graded

Once the box got close to its memory ceiling, even the optimized service would eventually fall over. So I added a load shedder — a middleware on the public feed endpoint that returns 503 + Retry-After when the system is under unsustainable pressure.

The first cut was binary. When MemAvailable (read from /proc/meminfo every 200 ms) dropped below a configured floor, flip a flag on; when it climbed back above floor × 1.5, flip the flag off. Hysteresis was supposed to prevent flapping. It didn’t.

The failure mode was destructive. The flag flipped on at the floor, GC freed memory, the flag flipped off, a flood of queued requests poured in, memory dropped back to the floor, the flag flipped on again. Each “off” cycle’s flood had to be processed — allocating buffers, spawning goroutines, growing in-flight count — before the shedder could catch up. Eventually a flood arrived that the box couldn’t keep up with, and we cascaded.

The replacement is a continuous probability instead of a flag. Each request is shed with probability p, where p is a linear interpolation between two thresholds: a soft floor (where shedding starts at 0%) and a hard floor (where shedding reaches 100%). As MemAvailable drops from 100 MiB toward 50 MiB, shed probability climbs from 0% to 100%. The system finds an equilibrium: shed rate matches the fraction by which incoming load exceeds sustainable load. No oscillation, no flood-cycles.

CPU got the same treatment using load1 / numCPU as the signal, with its own soft and hard thresholds. The shedder reports which signal is dominating, so an operator can tell why the system is shedding rather than just that it is.

This was the single piece of work where I most clearly felt the difference between “code that compiles and runs” and “code that behaves well under pressure.” The first version did the second thing badly. Getting it right took thinking about feedback loops, not just thresholds.

Recovery without intervention

The point of all this is that the system has to handle being overloaded without me being there. If I’m asleep and a feed reader client misbehaves and starts hammering the box, the service has to shed enough load to stay functional, ride out the storm, and return to its normal state without me logging in.

The hard part of that, on a small box, is staying out of swap. Once the kernel starts paging memory to disk, latencies blow up by orders of magnitude, the in-flight queue grows because requests aren’t clearing, the queue eats more memory, and you’re in a spiral the shedder can’t get you out of fast enough. The whole game is to shed before swap starts, not after.

That puts a constraint on the shedder’s signal. MemAvailable is the right thing to read — it’s the kernel’s accounting of “how much memory can you get without paging.” But there’s a complication: Go’s garbage collector, by default, has no soft heap cap. On a 768 MiB cgroup, a healthy steady-state load will let the heap drift up until MemAvailable is sitting around 100 MiB — not because anything is wrong, but because that’s where the GC decides it should be.

The first multi-phase capacity sweep tripped 100% shed on every test within forty seconds for exactly that reason. The shedder was reading “memory pressure” off natural GC headroom. The fix wasn’t to make the shedder less aggressive; it was to give the GC a target. Setting GOMEMLIMIT to about 70% of the cgroup limit (550 MiB on a 768 MiB container) tells Go’s GC to keep the heap under that, leaving real margin in the cgroup for genuine spikes. With the heap bounded, MemAvailable in healthy state sits at 200–300 MiB, and a floor of 50 MiB becomes a meaningful “about to swap” margin instead of a false-positive trigger.

Combined with the graded shedder, this gives the box a recovery shape I’m pleased with. Real memory pressure ramps shed probability up; serving load drops; GC catches up; pressure eases; shed probability ramps back down; serving resumes. Healthy state and overloaded state are the same code path, different sliders. There’s no “panic mode” the system has to manually exit.

The container itself runs with memswap_limit equal to mem_limit, which disables swap inside the container regardless of whether the host has any. If the cgroup limit is 768 MiB, 768 MiB is the actual ceiling — no slow-spiral page-out behavior is possible. That’s the belt to the shedder’s suspenders.

Race conditions

A couple of weeks after the bulk of the load-testing work, two audit passes through the cache code and a survey of about 7000 real-world feeds turned up bugs that hadn’t shown up in the synthetic runs.

In the HTTP cache:

  • The leader of a coalesced fetch was protected against the evictor by an in-flight registration held by the consumer’s Fetch defer. When every consumer cancelled before the leader finished, the refcount went to zero and the evictor was free to unlink the cache file out from under the leader’s “open the cache file” path. Fix: the leader registers its own protection.
  • A handful of WordPress installs returned 304 Not Modified to plain GET requests with no If-* headers and no prior cache entry. The 304 branch tried to open a file that never existed. Fix: gate the 304 branch on actually having a cache entry, and surface a clear error when we don’t.
  • The evictor’s filepath.Walk callback was returning walkErr unconditionally. Any concurrent rm between readdir and the callback’s stat produced an ENOENT that aborted the entire eviction tick.
  • snapshot() and reset() on the cache stats actor used unbuffered channels. Calling either after Close deadlocked forever.

In SQLite-land, three call sites had a pattern that looked innocuous but turned out to be a landmine: a deferred BeginTx, a SELECT, and then an INSERT or UPDATE. SQLite returns SQLITE_BUSY immediately when a transaction tries to upgrade from SHARED to RESERVED while another writer is active, and — this is the cruel part — busy_timeout doesn’t retry transaction upgrades. The fetch-batch flusher I’d added to fix the timestamp-write contention was just frequent enough to clash with these read-then-write transactions and fail them outright. The fix was to restructure each into a single statement (UPDATE...RETURNING with every gate in the WHERE) or a write-only transaction.

These weren’t surfaced by the load test. They were surfaced by later, more careful auditing — with the load test having taught me what to look for.

The bottleneck

One thing the load test resolved that I hadn’t been certain about: where the box runs out. On the 1 GB Nanode at the optimized steady state, FeedFilters is CPU-bound. Memory and disk I/O have headroom; the single vCPU is what saturates first. That’s a useful answer because it tells me what scaling looks like — if FeedFilters needs more capacity than this box can deliver, the answer is more CPU, not more memory and not a different storage tier.

The rough numbers that came out of the capacity sweeps: at a workload of 25 feeds per user polling every five minutes — more aggressive than what most readers do in practice — the box stays under a 5% shed rate up through about 5,000 simultaneous active users. That’s roughly 125,000 active feed subscriptions and several hundred requests per second sustained. Past that, the shedder kicks in to keep the box upright, but the system isn’t serving every request anymore. That’s plenty of headroom for FeedFilters to spend a long time at one box.

What I came out with

A couple of weeks ago, FeedFilters was a working app I’d shipped. Now it’s a working app I have rough numbers for: capacity, where it breaks, what breaks first, what to do when it does. That’s a different kind of confidence.

The next post is about the architectural decisions that came out of all this — the moves I made to give the service room to grow before any growing actually has to happen.

About· Help· Blog· Privacy & Terms
FeedFilters by Flat Six Software · © 2026