
We’ve been using Django Rest Framework (DRF) for 5 years now at Photoroom. We leverage it to sync users’ projects across clients (iOS, Android, Web), allow team members to collaborate (comments, reactions) and manage billing. In this article, we’ll share our learnings scaling it to 25 million monthly active users and ~500 queries per second.

As shown in the image above, the setup is dead-simple and quite standard: a reverse proxy (Traefik) in front of DRF. Like all of our servers, the VM is behind Cloudflare’s network to protect it against DDOS and monitor against attacks (more on that later).
The VM boasts a whopping 60 cores, and it starts to be quite busy as you can see:

Screenshot of htop opened during peak traffic time
For background tasks, we leverage Celery. We have 3 levels of priorities for our workers:
If you’ve played with Django, you probably already know to leverage select_related() / prefetch_related() and carefully add indexes before deploying. But all the carefulness in the world won’t protect you against unexpectedly slow queries. Responses might be fast when the first requests start hitting a route, then get much slower as the table grows.
To prevent those issues, we always carefully monitor for any slowness using an APM (Application Performance Monitoring, for instance Datadog or Dynatrace). Each request is traced and we can analyze the slow paths. Slowness stems from too many many factors: from slow database queries, unoptimized libraries, external services that misbehave.
We have alerts on latency, error rate and throughput. The first 2 are thresholds while the latter leverages anomaly detection. We also have a strict no-error-policy on the Stripe webhooks and get alerted if there’s a single error, regardless of the reason.
The practice of always analyzing and monitoring results in a rather decent performance, with a large majority of the requests below 100ms:
