Pointing load at it
With the admin dashboard in place, I started load testing. I wanted to find the actual ceiling for FeedFilters on a 1 GB Linode Nanode — the smallest production target I cared about — and to understand what would break before I got there. What followed was the most concentrated stretch of work in the project. Every day surfaced something I hadn’t expected, and most of those “somethings” turned into commits that made the service substantially better.
This post tries to cover the full arc.
The harness
The load is synthetic, but it’s substantial. There are no real users involved — everything below is driven by a controlled test rig — but the rig is sized to push the service well past its comfortable operating range, and the lessons that came out of it are real.
The harness has three pieces. A mock upstream that serves deterministic RSS / Atom / JSON feeds, with knobs for body shape, delay, hang, errors, and gzip. A runner with a small web UI that drives the test and aggregates metrics. And the SUT — a debug build of FeedFilters running in a near-production container shape. The mock and runner live on a separate box from the SUT so the measurement isn’t competing with the workload for the only CPU the production target has.
The runner has two modes. Classic is the textbook shape: fixed user count, fixed ramp, stop when p95 latency crosses a threshold. Adaptive is more interesting — a controller that seeks the ceiling and tracks it, ramping load up while the SUT looks healthy and backing off when it doesn’t. Most of the useful tests I ran were in adaptive mode: the system tells you where the cliff is rather than you having to guess.
OS limits, before anything else
The first thing the load test surfaced was that my host wasn’t configured for traffic. Two kernel sysctls turned out to matter right away.
net.core.somaxconn defaults to 128 — the size of the TCP
accept queue. Under any meaningful burst of new connections, the
queue fills up and connections get refused outright. Bumping it
to 65535 stopped the spurious ECONNREFUSED errors immediately.
net.netfilter.nf_conntrack_max defaults to about 65K, which the
service blew through during outbound-fetch storms. The symptom
was confusing: DNS lookups started failing with write: operation not permitted. That turned out to be conntrack refusing to
allocate a slot for a new outbound flow. Setting it to 524288
fixed it.
Both of these have to be applied per-container as well as on the
host, because Docker gives each container its own network
namespace and somaxconn is per-namespace. The host setting
doesn’t propagate.
The connection-pool cliff
Once the kernel was out of the way, the next thing the runner found was a ceiling that didn’t look like a ceiling at first. Throughput would climb steadily, then collapse — not just stop scaling, but actively drop into a death spiral with the box going OOM.
The cause was that I’d left sql.DB’s connection pool unbounded.
SQLite serializes writers (one writer at a time, full stop), so
under a write-heavy burst every blocked goroutine grabbed a fresh
connection from the pool. A brief queue cascaded into thousands
of open connections, each carrying its own buffers and goroutine
stack and lock-manager bookkeeping. That memory pressure produced
GC pressure, which produced more queueing, which produced more
connections. The system tipped over.
The fix was a one-line change with a lot of reasoning behind it: cap the pool at 25. Twenty-five connections gives WAL readers plenty of headroom (they don’t block each other) while pinning a hard ceiling on the writer queue. Excess request rate above that queues in Go’s scheduler instead of in SQLite’s lock manager, which is dramatically cheaper. There’s no principled formula behind 25; it’s empirically derived from capacity sweeps on the 1 GB Nanode that’s the reference deployment. Bigger boxes can likely run higher, but no one’s swept that.
Profiling and the hot-spot day
With the kernel and the pool fixed, the SUT ran clean enough that
the next ceiling was about CPU and memory inside FeedFilters
itself. I’d already wired in pprof for dev builds, so I pointed
it at a sweep and looked at where the cycles were going.
It was instructive. The output path — where a feed reader fetches a filtered feed — was spending most of its time in two places I hadn’t expected: re-parsing the same upstream XML on every request, and re-normalizing every item’s text on every filter pass. The output endpoint was, in effect, paying full parse-and-filter cost for every cache hit.
Several changes landed in a single day:
- Cache the parsed feed itself. Alongside the disk-backed
HTTP cache for upstream fetches, an in-memory byte-bounded LRU
holds the parsed
gofeed.Feedplus per-item byte ranges into the source XML. Cache hits skip the parser entirely; misses pay full parse cost once and amortize across every reader. - Cache normalized item text. The filter engine compares case-folded, accent-stripped item text. Doing that work inside the filter loop meant doing it for every (item × filter) combination. Caching the normalized text per item, once, on parse, dropped filter-pass CPU substantially.
- In-memory cache for
Store.GetByID. The output handler was hitting SQLite for the feed row on every request. The row changes rarely; an in-memory cache keyed by feed ID eliminated the lookup from the hot path. - Slim the cache value type. The original
FeedCachevalue carried more than it needed to. Trimming it let the same heap budget hold roughly five times as many entries, which cut the steady-state miss rate proportionally.
None of these are clever individually. Together they shifted the output endpoint from CPU-bound to nearly memory-bound, which is the regime I wanted.
An upstream cache
Around the same time, I replaced the day-2 source cache with a
new disk-backed HTTP cache (internal/httpcache). The original
two-state-machine design — metadata in SQLite, bytes on
disk — had race windows between the two halves that the
load test exposed. The new cache is interesting enough to
deserve its own post, so I’ll write that one separately. For
the purposes of this post, the relevant fact is that upstream
feed responses are cached on disk and concurrent requests for
the same URL coalesce to a single fetch.
Debouncing the timestamp writes
At sustained cache-hit load — around 200 requests per
second — every /feed/{id} request was firing a
synchronous UPDATE on feeds.last_successful_fetch_at. With
the connection pool capped at 25 and SQLite’s single-writer
model, every UPDATE serialized through one writer lock. The
runner started seeing SQLITE_BUSY errors and Caddy began
returning 502s as the in-process queue grew.
The fix was a coalescing flusher: a FetchRecorder accumulates
“feed N was just fetched successfully” events into a per-feed-id
map, and flushes a snapshot of that map in a single transaction
every two seconds. Latest-event-wins is fine for what those
columns are: advisory timestamps and an icon URL that almost
never changes. N writes become one writer-lock acquire. The hot
path’s contribution drops to one map insert under a mutex.
Two seconds of staleness on last_successful_fetch_at is
invisible to admins; the writer-contention cliff disappeared
completely.
Load shedding: from binary to graded
Once the box got close to its memory ceiling, even the optimized
service would eventually fall over. So I added a load shedder
— a middleware on the public feed endpoint that returns
503 + Retry-After when the system is under unsustainable
pressure.
The first cut was binary. When MemAvailable (read from
/proc/meminfo every 200 ms) dropped below a configured floor,
flip a flag on; when it climbed back above floor × 1.5, flip the
flag off. Hysteresis was supposed to prevent flapping. It didn’t.
The failure mode was destructive. The flag flipped on at the floor, GC freed memory, the flag flipped off, a flood of queued requests poured in, memory dropped back to the floor, the flag flipped on again. Each “off” cycle’s flood had to be processed — allocating buffers, spawning goroutines, growing in-flight count — before the shedder could catch up. Eventually a flood arrived that the box couldn’t keep up with, and we cascaded.
The replacement is a continuous probability instead of a flag.
Each request is shed with probability p, where p is a
linear interpolation between two thresholds: a soft floor
(where shedding starts at 0%) and a hard floor (where shedding
reaches 100%). As MemAvailable drops from 100 MiB toward 50
MiB, shed probability climbs from 0% to 100%. The system finds
an equilibrium: shed rate matches the fraction by which incoming
load exceeds sustainable load. No oscillation, no flood-cycles.
CPU got the same treatment using load1 / numCPU as the signal,
with its own soft and hard thresholds. The shedder reports
which signal is dominating, so an operator can tell why the
system is shedding rather than just that it is.
This was the single piece of work where I most clearly felt the difference between “code that compiles and runs” and “code that behaves well under pressure.” The first version did the second thing badly. Getting it right took thinking about feedback loops, not just thresholds.
Recovery without intervention
The point of all this is that the system has to handle being overloaded without me being there. If I’m asleep and a feed reader client misbehaves and starts hammering the box, the service has to shed enough load to stay functional, ride out the storm, and return to its normal state without me logging in.
The hard part of that, on a small box, is staying out of swap. Once the kernel starts paging memory to disk, latencies blow up by orders of magnitude, the in-flight queue grows because requests aren’t clearing, the queue eats more memory, and you’re in a spiral the shedder can’t get you out of fast enough. The whole game is to shed before swap starts, not after.
That puts a constraint on the shedder’s signal. MemAvailable is
the right thing to read — it’s the kernel’s accounting of
“how much memory can you get without paging.” But there’s a
complication: Go’s garbage collector, by default, has no soft
heap cap. On a 768 MiB cgroup, a healthy steady-state load will
let the heap drift up until MemAvailable is sitting around 100
MiB — not because anything is wrong, but because that’s
where the GC decides it should be.
The first multi-phase capacity sweep tripped 100% shed on every
test within forty seconds for exactly that reason. The shedder
was reading “memory pressure” off natural GC headroom. The fix
wasn’t to make the shedder less aggressive; it was to give the
GC a target. Setting GOMEMLIMIT to about 70% of the cgroup
limit (550 MiB on a 768 MiB container) tells Go’s GC to keep the
heap under that, leaving real margin in the cgroup for genuine
spikes. With the heap bounded, MemAvailable in healthy state
sits at 200–300 MiB, and a floor of 50 MiB becomes a
meaningful “about to swap” margin instead of a false-positive
trigger.
Combined with the graded shedder, this gives the box a recovery shape I’m pleased with. Real memory pressure ramps shed probability up; serving load drops; GC catches up; pressure eases; shed probability ramps back down; serving resumes. Healthy state and overloaded state are the same code path, different sliders. There’s no “panic mode” the system has to manually exit.
The container itself runs with memswap_limit equal to
mem_limit, which disables swap inside the container regardless
of whether the host has any. If the cgroup limit is 768 MiB,
768 MiB is the actual ceiling — no slow-spiral page-out
behavior is possible. That’s the belt to the shedder’s suspenders.
Race conditions
A couple of weeks after the bulk of the load-testing work, two audit passes through the cache code and a survey of about 7000 real-world feeds turned up bugs that hadn’t shown up in the synthetic runs.
In the HTTP cache:
- The leader of a coalesced fetch was protected against the
evictor by an in-flight registration held by the consumer’s
Fetchdefer. When every consumer cancelled before the leader finished, the refcount went to zero and the evictor was free to unlink the cache file out from under the leader’s “open the cache file” path. Fix: the leader registers its own protection. - A handful of WordPress installs returned
304 Not Modifiedto plainGETrequests with noIf-*headers and no prior cache entry. The 304 branch tried to open a file that never existed. Fix: gate the 304 branch on actually having a cache entry, and surface a clear error when we don’t. - The evictor’s
filepath.Walkcallback was returningwalkErrunconditionally. Any concurrentrmbetweenreaddirand the callback’sstatproduced anENOENTthat aborted the entire eviction tick. snapshot()andreset()on the cache stats actor used unbuffered channels. Calling either afterClosedeadlocked forever.
In SQLite-land, three call sites had a pattern that looked
innocuous but turned out to be a landmine: a deferred BeginTx,
a SELECT, and then an INSERT or UPDATE. SQLite returns
SQLITE_BUSY immediately when a transaction tries to upgrade
from SHARED to RESERVED while another writer is active, and
— this is the cruel part — busy_timeout doesn’t
retry transaction upgrades. The fetch-batch flusher I’d added
to fix the timestamp-write contention was just frequent enough
to clash with these read-then-write transactions and fail them
outright. The fix was to restructure each into a single
statement (UPDATE...RETURNING with every gate in the WHERE)
or a write-only transaction.
These weren’t surfaced by the load test. They were surfaced by later, more careful auditing — with the load test having taught me what to look for.
The bottleneck
One thing the load test resolved that I hadn’t been certain about: where the box runs out. On the 1 GB Nanode at the optimized steady state, FeedFilters is CPU-bound. Memory and disk I/O have headroom; the single vCPU is what saturates first. That’s a useful answer because it tells me what scaling looks like — if FeedFilters needs more capacity than this box can deliver, the answer is more CPU, not more memory and not a different storage tier.
The rough numbers that came out of the capacity sweeps: at a workload of 25 feeds per user polling every five minutes — more aggressive than what most readers do in practice — the box stays under a 5% shed rate up through about 5,000 simultaneous active users. That’s roughly 125,000 active feed subscriptions and several hundred requests per second sustained. Past that, the shedder kicks in to keep the box upright, but the system isn’t serving every request anymore. That’s plenty of headroom for FeedFilters to spend a long time at one box.
What I came out with
A couple of weeks ago, FeedFilters was a working app I’d shipped. Now it’s a working app I have rough numbers for: capacity, where it breaks, what breaks first, what to do when it does. That’s a different kind of confidence.
The next post is about the architectural decisions that came out of all this — the moves I made to give the service room to grow before any growing actually has to happen.