Demo Type · 14

Research / feature explainer

Use this when you want to deep-dive one feature or mechanism — a focused page with an on-page table of contents, tabbed code (algorithm / config / usage), one annotated diagram, and an FAQ that answers the questions readers actually ask.

This is a copyable exemplar. Lift the .demo-card section below into a lesson built from assets/lesson-template.html — the design tokens, tech-toggle, tabs, and SVG patterns are already wired to match.

Feature deep-dive · Rate limiting with the token-bucket algorithm

What rate limiting does

A rate limiter decides how many requests a single caller is allowed to make in a given window of time. If a client stays under its allowance, every request goes through. If it suddenly floods the server, the limiter starts turning requests away — usually with an HTTP 429 Too Many Requests — so one noisy caller can't starve everyone else.

The token-bucket algorithm is the most common way to do this. Picture a bucket that holds tokens. Every request must take one token to proceed. Tokens drip back in at a steady rate, and the bucket has a maximum size. When the bucket is empty, requests are denied until it refills — but because the bucket can hold a reserve, short bursts are allowed while the long-run average stays capped.

Think of it like… a coin-operated turnstile. Each entry costs one coin. A machine drops a fresh coin into the tray at a fixed pace (say, 5 coins a second), and the tray only holds 10 coins at most. A quick rush can spend the 10 saved-up coins all at once, but after that, people enter only as fast as new coins appear.

Under the hood

The bucket has two parameters: capacity B (max tokens, the burst ceiling) and refill rate r (tokens added per second, the sustained ceiling). State per key is just two numbers: tokens and last_refill_timestamp.

Refill is computed lazily on access rather than by a background timer: tokens = min(B, tokens + (now - last_refill) * r). A request of cost c is admitted iff tokens >= c, decrementing by c; otherwise it is rejected and the caller can compute Retry-After = (c - tokens) / r.

This differs from a fixed-window counter (cheaper, but allows 2× bursts at window edges) and from a leaky bucket (which shapes output to a constant rate and does not permit bursts at all). Token-bucket is O(1) time and O(1) memory per key, which is why it backs most production limiters.

The bucket, drawn

Tokens drip in from the top at a steady rate. Each request you send drains one token. Spend the reserve and the next request gets denied — watch the counter.

Read left → right: a request takes 1 token from the bucket; a full bucket lets bursts through, an empty bucket returns 429.

tokens 10 / 10 · denied 0

The code

Three views of the same feature: the algorithm that decides allow/deny, the config you tune in production, and how you use it as middleware on a route.

limiter/token_bucket.py

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity    = capacity        # B — burst ceiling
        self.refill_rate = refill_rate     # r — tokens / second
        self.tokens      = capacity
        self.updated     = time.monotonic()

    def allow(self, cost=1):
        now     = time.monotonic()
        elapsed = now - self.updated
        # lazy refill, capped at capacity
        self.tokens  = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.updated = now
        if self.tokens >= cost:
            self.tokens -= cost
            return True, 0.0            # allowed
        retry_after = (cost - self.tokens) / self.refill_rate
        return False, retry_after       # denied → 429

config/ratelimit.yaml

# one bucket policy per route tier. key = client API token.
buckets:
  default:
    capacity:    10      # allow short bursts up to 10
    refill_rate: 5       # sustained 5 req/s long-run
  search:
    capacity:    30
    refill_rate: 10
  auth_login:
    capacity:    5       # brute-force guard: tight burst
    refill_rate: 1

storage:   redis        # shared state across app servers
key_by:    api_token    # or: ip, user_id
on_deny:   "429"         # + Retry-After header

app/routes.py

from limiter import limit          # decorator built on TokenBucket

@app.route("/search")
@limit(policy="search")            # 30 burst, 10/s sustained
def search(req):
    return do_search(req.query)

# when the bucket is empty the decorator short-circuits:
#   HTTP/1.1 429 Too Many Requests
#   Retry-After: 2
#   X-RateLimit-Remaining: 0

Find it yourself: grep -rn "class TokenBucket" limiter/

FAQ

Why allow bursts at all — isn't a steady cap simpler?

Real traffic is bumpy. A page that fires six API calls on load would trip a perfectly steady cap even though the user is well-behaved. The bucket's capacity is a savings account that absorbs those legitimate bursts while refill_rate still bounds the long-run average. If you truly need a perfectly smooth output, that's the leaky-bucket variant instead.

What should a client do when it gets a 429?

Read the Retry-After header and wait that many seconds before retrying — don't hammer immediately. A good client backs off exponentially with jitter on repeated 429s. The server computes the wait as (cost − tokens) / refill_rate, so it's the exact time until enough tokens exist.

How does this work across many app servers?

Each server holding its own in-memory bucket would multiply the real limit by the number of servers. Put the bucket state in a shared store (the config sets storage: redis) and do the refill-and-decrement atomically — a small Lua script or an INCR/EXPIRE pair — so all servers debit the same bucket.

Token bucket vs. fixed-window counter — which do I pick?

A fixed-window counter is the cheapest to build but lets a caller send a full window's worth of requests at the very end of one window and again at the start of the next — a 2× burst at the seam. Token bucket smooths that out and gives you a clean burst/sustained split. Pick fixed-window only when approximate limiting is fine and simplicity wins.