Scaling, Caching and Background Workers
"Scale" is a word that gets thrown around a lot, often before it's actually needed. For a FastAPI app, the truth is that one well-sized server can comfortably serve millions of requests a day - provided you've avoided the cliffs. This page is about the cliffs, and the few specific moves (caching, workers, sharding the right things) that buy real headroom.
The order of operations
Before you scale out, scale up and scale smart. The order matters because the wrong move at the wrong time is expensive in both money and complexity.
1. Make a slow endpoint fast (one bad query can fake a need to scale)
2. Add a cache where it pays back (orders of magnitude, sometimes)
3. Move slow work off the request (background workers)
4. Vertical scaling (bigger box) (the cheapest hardware fix)
5. Horizontal scaling (more pods) (when one is no longer enough)
6. Database scaling (the last one, usually the most complex)Most teams reach for step 5 when step 1 would have done it. Profile first.
Vertical scaling: the underrated answer
A single $40/month VPS with 4 cores and 8 GB of RAM can serve a lot of traffic if your code isn't doing anything silly. Before adding instances:
- Did you tune the worker count?
(2 × cores) + 1is the gunicorn rule of thumb for sync workers; for async uvicorn workers,coresis closer to right. - Are workers actually doing async I/O, or sitting on blocking calls?
- Is your database the bottleneck? Adding app workers won't help if Postgres is saturated.
A $400/month machine with 32 cores and 64 GB of RAM is often a better investment than a Kubernetes cluster with the same total resources spread across pods. Less complexity, less network latency, fewer moving parts. The disadvantage - single point of failure - is mitigated by having a second one as warm standby.
Horizontal scaling: when one isn't enough
The moment a single instance isn't enough, you add more behind a load balancer. The hard part isn't running multiple FastAPI processes - that's trivial. The hard part is making sure your code is prepared for there to be more than one.
A short checklist of things that break when you add a second instance:
| Pattern | Why it breaks |
|---|---|
| In-memory caches | Each instance has its own; data inconsistent across them |
| In-memory rate limits | Each instance counts separately; limits become "per instance" |
| In-process background tasks | Lost on instance shutdown; not balanced across instances |
| Sticky session assumptions | Same user routed to different instances |
| WebSocket registries | Each instance only knows its own connections (covered in the WS section) |
| Local file uploads | Files exist on one instance; other instances can't find them |
app.state for shared data | Not shared. At all. Just per-process state with a misleading name. |
The fix for all of these is the same: move shared state to a place that's outside any single process. Usually that means Redis, Postgres, or object storage.
┌─ instance 1 ─┐ ┌─ instance 2 ─┐
│ FastAPI │ │ FastAPI │
└──────┬───────┘ └──────┬───────┘
│ │
└──────────┬─────────────┘
▼
┌──────────────────────────┐
│ Redis (cache, rate │
│ limits, queue) │
│ Postgres (data) │
│ S3 (files) │
└──────────────────────────┘Two instances become twenty instances with no code change, once the shared-state assumption is right.
Caching: order-of-magnitude wins
A well-placed cache can turn a 200ms endpoint into 5ms. It's also the thing most likely to be wrong - caches are famously hard. A few patterns that hold up.
What to cache
| Good candidates | Bad candidates |
|---|---|
| Slow read-only lookups that don't change often | Personalized data with high read-after-write requirements |
| External API responses | Anything that must always be the very latest |
| Expensive computed values | Tiny, fast values where the cache itself isn't faster than the original |
| Public catalog data, config | Anything containing secrets unless the cache is access-controlled |
A simple Redis cache wrapper
import json
import redis.asyncio as redis
from functools import wraps
from typing import Callable, Any
cache = redis.from_url("redis://localhost:6379")
def cached(prefix: str, ttl: int):
def decorator(fn: Callable):
@wraps(fn)
async def wrapper(*args, **kwargs):
key = f"{prefix}:" + ":".join(map(str, args)) + ":" + ":".join(f"{k}={v}" for k, v in kwargs.items())
hit = await cache.get(key)
if hit is not None:
return json.loads(hit)
value = await fn(*args, **kwargs)
await cache.set(key, json.dumps(value), ex=ttl)
return value
return wrapper
return decorator@cached("user-profile", ttl=300)
async def fetch_user_profile(user_id: int) -> dict:
# an expensive query
...Five-minute cache. First request hits the DB; the next 200 requests in the next five minutes get the cached value. A real measurement on a typical app: 90%+ cache hit rate, response time down by 10-50x for cached endpoints.
Cache invalidation
The famous hard problem. Two strategies that work in practice:
- TTL-only. Set a short TTL (60s, 5min) and accept that data can be stale for that long. Simplest. Right answer for catalog data, profile pages, anything where "a few minutes behind" is fine.
- Explicit invalidation on write. When data changes, delete the cache key. Harder because you have to track which keys to invalidate, but necessary for things that must update immediately.
async def update_user_profile(user_id: int, payload: dict):
db.update(...)
await cache.delete(f"user-profile:{user_id}")Stick to one strategy per cache key, document it, and be honest about the trade-off.
Cache the response, not just the data
For pure read endpoints, you can cache the whole HTTP response at the proxy layer. nginx supports this. Cloudflare does it well. A Cache-Control: public, max-age=60 header on a GET /products endpoint lets the CDN serve it from edge locations without touching your server at all. For high-traffic public endpoints, this is the biggest possible win.
@app.get("/products")
def list_products():
return Response(
content=...,
media_type="application/json",
headers={"Cache-Control": "public, max-age=60"},
)Background workers: the architecture, deployed
The earlier section covered the application code for background tasks and Arq workers. In production, the worker is its own deployment. Same Docker image, different command:
# docker-compose.prod.yml (excerpt)
services:
api:
image: yourorg/yourapp:${TAG}
command: gunicorn app.main:app -k uvicorn.workers.UvicornWorker -w 4 -b 0.0.0.0:8000
worker:
image: yourorg/yourapp:${TAG} # same image
command: arq app.worker.WorkerSettings
deploy:
replicas: 2 # scale workers independently of web
redis:
image: redis:7-alpine
volumes:
- redis_data:/dataNotes worth flagging:
- Same image, different command. This is the right shape. One artifact, multiple roles.
- Workers scale separately from web. A burst of background work shouldn't slow down API requests, and vice versa.
- Redis is now the queue too, not just a cache. Using one Redis for both is fine for moderate workloads; for serious volume, run separate Redis instances for cache (volatile) and queue (durable).
Worker health and graceful shutdown
The same shutdown discipline applies to workers as to the web app. When SIGTERM arrives, an in-flight job should be allowed to finish before the process exits. Arq handles this by default - it stops accepting new jobs and waits for the current one. Set a sensible terminationGracePeriodSeconds (or stop_grace_period in compose) to give it room to finish.
Worker health checks are different from web health checks. A worker doesn't serve HTTP, so there's no /health to probe. Some platforms support TCP-level checks; others expect you to write a tiny health endpoint into the worker itself (a separate FastAPI app on a different port). The simpler answer is to monitor the worker's output - if the job-completion rate drops to zero while jobs are pending, the worker is dead.
Database scaling
The hardest one, kept short here because it deserves its own series.
The usual progression, in order of typical adoption:
1. Indexes ── fixes 80% of "slow database" problems
2. Connection pooling ── PgBouncer in front of Postgres
3. Read replicas ── route read-only queries elsewhere
4. Caching at the app ── covered above
5. Vertical scaling ── a bigger database server
6. Sharding ── last resort; major architectural changeMost apps never need step 6. Many never need step 3. Step 1 is the one that gets skipped most.
A specific Postgres tip: install pg_stat_statements and look at the top 20 queries by total time. The biggest cost is almost never where you'd guess.
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;That query alone has saved more weekends than any specific scaling architecture.
A small monitoring kit
Once you have multiple components (web, worker, cache, queue), you need to know they're all healthy. The minimal kit:
| Layer | What to watch | Where |
|---|---|---|
| Web | Request rate, error rate, p95 latency | Prometheus + Grafana |
| Worker | Job rate, failure rate, queue depth | Same |
| Redis | Memory used, hit rate, evictions | Redis exporter for Prometheus |
| Database | Connections, slow queries, replication lag | DB-specific exporter |
| Process | CPU, memory, restart count | Platform-native (k8s, ECS, etc.) |
Alert on: error rate > 1% for 5 minutes, p95 latency > 1s for 5 minutes, queue depth growing for 10 minutes, any worker restart loop. Tune the numbers to your app's tolerance.
The "scaling is a problem you want to have" reminder
Almost everything in this page is the kind of thing that doesn't matter until it does. Don't optimize for a million users before you have a hundred. The patterns matter; you don't need to deploy all of them on day one.
A reasonable progression for a real project:
| Stage | What's deployed |
|---|---|
| Day 1 | One container, one VPS, SQLite or small Postgres |
| First real users | Move to managed Postgres, add nginx for TLS |
| First real load | Add Redis for cache, run multiple uvicorn workers |
| First background features | Add a worker process (same image, different command) |
| Multiple instances needed | Move in-memory state to Redis, add load balancer |
| Genuine scale | Read replicas, CDN, dedicated workers per queue |
You will know when each step is needed because something will hurt. Don't preemptively build for the next stage; understand what each one solves so you recognize the symptoms when they arrive.
Closing the deployment section
We started this section with a production-readiness checklist and ended with the patterns that keep an app fast under load. The thread through all six pages: a real production deployment isn't a single decision - it's a stack of small, boring, correct choices. Pick the right hosting tier. Use a real config system. Ship a clean Docker image. Put a proxy in front. Cache the slow reads. Move slow work to workers. Watch what's happening.
None of it is hard. All of it adds up. A FastAPI service that does these six things well will handle far more traffic than most teams expect, with far less drama than most teams fear.
How is this guide?
Last updated on
