Demo Type · 14

Research / feature explainer

Use this when you want to deep-dive one feature or mechanism — a focused page with an on-page table of contents, tabbed code (algorithm / config / usage), one annotated diagram, and an FAQ that answers the questions readers actually ask.

This is a copyable exemplar. Lift the .demo-card section below into a lesson built from assets/lesson-template.html — the design tokens, tech-toggle, tabs, and SVG patterns are already wired to match.

Feature deep-dive · Rate limiting with the token-bucket algorithm

1

What rate limiting does


A rate limiter decides how many requests a single caller is allowed to make in a given window of time. If a client stays under its allowance, every request goes through. If it suddenly floods the server, the limiter starts turning requests away — usually with an HTTP 429 Too Many Requests — so one noisy caller can't starve everyone else.

The token-bucket algorithm is the most common way to do this. Picture a bucket that holds tokens. Every request must take one token to proceed. Tokens drip back in at a steady rate, and the bucket has a maximum size. When the bucket is empty, requests are denied until it refills — but because the bucket can hold a reserve, short bursts are allowed while the long-run average stays capped.

Think of it like… a coin-operated turnstile. Each entry costs one coin. A machine drops a fresh coin into the tray at a fixed pace (say, 5 coins a second), and the tray only holds 10 coins at most. A quick rush can spend the 10 saved-up coins all at once, but after that, people enter only as fast as new coins appear.

Under the hood

The bucket has two parameters: capacity B (max tokens, the burst ceiling) and refill rate r (tokens added per second, the sustained ceiling). State per key is just two numbers: tokens and last_refill_timestamp.

Refill is computed lazily on access rather than by a background timer: tokens = min(B, tokens + (now - last_refill) * r). A request of cost c is admitted iff tokens >= c, decrementing by c; otherwise it is rejected and the caller can compute Retry-After = (c - tokens) / r.

This differs from a fixed-window counter (cheaper, but allows 2× bursts at window edges) and from a leaky bucket (which shapes output to a constant rate and does not permit bursts at all). Token-bucket is O(1) time and O(1) memory per key, which is why it backs most production limiters.

2

The bucket, drawn


Tokens drip in from the top at a steady rate. Each request you send drains one token. Spend the reserve and the next request gets denied — watch the counter.

refill r capacity B = 10 request take 1 200 OK
Read left → right: a request takes 1 token from the bucket; a full bucket lets bursts through, an empty bucket returns 429.
tokens 10 / 10 · denied 0
3

The code


Three views of the same feature: the algorithm that decides allow/deny, the config you tune in production, and how you use it as middleware on a route.

limiter/token_bucket.py
import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity    = capacity        # B — burst ceiling
        self.refill_rate = refill_rate     # r — tokens / second
        self.tokens      = capacity
        self.updated     = time.monotonic()

    def allow(self, cost=1):
        now     = time.monotonic()
        elapsed = now - self.updated
        # lazy refill, capped at capacity
        self.tokens  = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.updated = now
        if self.tokens >= cost:
            self.tokens -= cost
            return True, 0.0            # allowed
        retry_after = (cost - self.tokens) / self.refill_rate
        return False, retry_after       # denied → 429

Find it yourself: grep -rn "class TokenBucket" limiter/

4

FAQ


Real traffic is bumpy. A page that fires six API calls on load would trip a perfectly steady cap even though the user is well-behaved. The bucket's capacity is a savings account that absorbs those legitimate bursts while refill_rate still bounds the long-run average. If you truly need a perfectly smooth output, that's the leaky-bucket variant instead.
Read the Retry-After header and wait that many seconds before retrying — don't hammer immediately. A good client backs off exponentially with jitter on repeated 429s. The server computes the wait as (cost − tokens) / refill_rate, so it's the exact time until enough tokens exist.
Each server holding its own in-memory bucket would multiply the real limit by the number of servers. Put the bucket state in a shared store (the config sets storage: redis) and do the refill-and-decrement atomically — a small Lua script or an INCR/EXPIRE pair — so all servers debit the same bucket.
A fixed-window counter is the cheapest to build but lets a caller send a full window's worth of requests at the very end of one window and again at the start of the next — a 2× burst at the seam. Token bucket smooths that out and gives you a clean burst/sustained split. Pick fixed-window only when approximate limiting is fine and simplicity wins.