We’ve been using Django Rest Framework (DRF) for 5 years now at Photoroom. We leverage it to sync users’ projects across clients (iOS, Android, Web), allow team members to collaborate (comments, reactions) and manage billing. In this article, we’ll share our learnings scaling it to 25 million monthly active users and ~500 queries per second.

General architecture

As shown in the image above, the setup is dead-simple and quite standard: a reverse proxy (Traefik) in front of DRF. Like all of our servers, the VM is behind Cloudflare’s network to protect it against DDOS and monitor against attacks (more on that later).

The VM boasts a whopping 60 cores, and it starts to be quite busy as you can see:

Screenshot of htop opened during peak traffic time

For background tasks, we leverage Celery. We have 3 levels of priorities for our workers:

Blocking (P1): for tasks that require an response before the request completes, with a latency target of ~100ms. Example: uploading a few assets in parallel before responding
High priority: for tasks that need to complete in a timely manner but can run late from time to time. Examples: sending notifications after an action is completed or sending events for analytics
Background: for long running background tasks, some of which can run for a long time (up to a few minutes). For example: sending aggregated analytics, periodic cleanups.

Performance & monitoring

If you’ve played with Django, you probably already know to leverage select_related() / prefetch_related() and carefully add indexes before deploying. But all the carefulness in the world won’t protect you against unexpectedly slow queries. Responses might be fast when the first requests start hitting a route, then get much slower as the table grows.

To prevent those issues, we always carefully monitor for any slowness using an APM (Application Performance Monitoring, for instance Datadog or Dynatrace). Each request is traced and we can analyze the slow paths. Slowness stems from too many many factors: from slow database queries, unoptimized libraries, external services that misbehave.

We have alerts on latency, error rate and throughput. The first 2 are thresholds while the latter leverages anomaly detection. We also have a strict no-error-policy on the Stripe webhooks and get alerted if there’s a single error, regardless of the reason.

The practice of always analyzing and monitoring results in a rather decent performance, with a large majority of the requests below 100ms:

Screenshot of the latency breakdown over all routes over 2 days