Building Reliable Microservices with Django
Lessons learned from architecting a production microservices system for a medical AI platform — covering service boundaries, async task management, and operational reliability.
Introduction
Microservices architecture is often presented as a silver bullet for scaling. In practice, it introduces significant operational complexity that teams must be prepared to manage. Over two years of building and maintaining a production microservices system at Medical Toxicology, I have learned what works and what does not.
This article shares concrete lessons from building a Django-based microservices platform that serves real medical clients.
Defining Service Boundaries
The most critical decision in microservices architecture is where to draw the boundaries. The wrong boundaries create distributed monoliths — all the operational complexity of microservices with none of the benefits.
Align Boundaries with Business Capabilities
My system has five services: Auth, CoreLogic, Management, CMS, and Support. Each boundary maps to a distinct business capability with its own team ownership and release cadence.
The key question when defining a boundary is: can this service be deployed independently without coordinating with other teams? If the answer is no, the boundary is wrong.
Auth is Always a Cross-Cutting Concern
Authentication and identity are referenced by every service. They belong in a dedicated Auth Service that all others call — not duplicated across services or embedded in a "user service" with mixed responsibilities.
Async Task Management with Celery
Celery is the backbone of our async processing layer. After two years in production, here is what I have learned.
Use Celery Beat Carefully
Celery Beat (scheduled tasks) creates a singleton scheduling risk in containerized environments. If you run multiple Beat instances, tasks fire multiple times. Always run Beat as a single process and monitor it explicitly.
In Docker Compose, we use a dedicated beat service with explicit locking:
celery-beat:
command: celery -A app beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
restart: unless-stopped
Task Idempotency
Every Celery task must be idempotent. Network failures cause retries, and retries cause duplicate task execution. Design tasks so that executing them twice produces the same result as executing them once.
Shared Authentication State
All five services need to verify user identity. Our solution: a centralized Auth Service that issues JWTs, with each downstream service validating tokens against a shared public key. No service touches authentication state directly.
This means Auth outages degrade all services. We mitigate this with:
- High availability for the Auth Service (redundant replicas)
- JWT expiry windows that tolerate short Auth outages
- Graceful degradation for non-auth-critical operations
What I Would Do Differently
If I were starting from scratch, I would:
- Start with a monolith. Extract services only when a genuine boundary emerges, not upfront.
- Invest in distributed tracing earlier. Debugging across services without tracing is extremely painful.
- Define service contracts with OpenAPI from day one. Undocumented internal APIs accumulate unexpected callers.
Conclusion
Microservices are a tool, not a goal. The Medical Toxicology platform benefits from them because the team structure, release cadences, and business domains genuinely align with service boundaries. Before adopting microservices, ask whether your organization's structure supports them.
The technical challenges — async tasks, shared auth, service contracts — are solvable. The organizational challenges — clear ownership, independent deployability, operational maturity — are harder.