Building Reliable Microservices with Django | Sajjad Fani

Introduction

Microservices architecture is often presented as a silver bullet for scaling. In practice, it introduces significant operational complexity that teams must be prepared to manage. Over two years of building and maintaining a production microservices system at Medical Toxicology, I have learned what works and what does not.

This article shares concrete lessons from building a Django-based microservices platform that serves real medical clients.

Defining Service Boundaries

The most critical decision in microservices architecture is where to draw the boundaries. The wrong boundaries create distributed monoliths — all the operational complexity of microservices with none of the benefits.

Align Boundaries with Business Capabilities

My system has five services: Auth, CoreLogic, Management, CMS, and Support. Each boundary maps to a distinct business capability with its own team ownership and release cadence.

The key question when defining a boundary is: can this service be deployed independently without coordinating with other teams? If the answer is no, the boundary is wrong.

Auth is Always a Cross-Cutting Concern

Authentication and identity are referenced by every service. They belong in a dedicated Auth Service that all others call — not duplicated across services or embedded in a "user service" with mixed responsibilities.

Async Task Management with Celery

Celery is the backbone of our async processing layer. After two years in production, here is what I have learned.

Use Celery Beat Carefully

Celery Beat (scheduled tasks) creates a singleton scheduling risk in containerized environments. If you run multiple Beat instances, tasks fire multiple times. Always run Beat as a single process and monitor it explicitly.

In Docker Compose, we use a dedicated beat service with explicit locking:

celery-beat:
  command: celery -A app beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
  restart: unless-stopped

Task Idempotency

Every Celery task must be idempotent. Network failures cause retries, and retries cause duplicate task execution. Design tasks so that executing them twice produces the same result as executing them once.

Shared Authentication State

All five services need to verify user identity. Our solution: a centralized Auth Service that issues JWTs, with each downstream service validating tokens against a shared public key. No service touches authentication state directly.

This means Auth outages degrade all services. We mitigate this with:

High availability for the Auth Service (redundant replicas)
JWT expiry windows that tolerate short Auth outages
Graceful degradation for non-auth-critical operations

What I Would Do Differently

If I were starting from scratch, I would:

Start with a monolith. Extract services only when a genuine boundary emerges, not upfront.
Invest in distributed tracing earlier. Debugging across services without tracing is extremely painful.
Define service contracts with OpenAPI from day one. Undocumented internal APIs accumulate unexpected callers.

Conclusion

Microservices are a tool, not a goal. The Medical Toxicology platform benefits from them because the team structure, release cadences, and business domains genuinely align with service boundaries. Before adopting microservices, ask whether your organization's structure supports them.

The technical challenges — async tasks, shared auth, service contracts — are solvable. The organizational challenges — clear ownership, independent deployability, operational maturity — are harder.