How to actually see, trust, and operate your Python systems in production - with Grafana, Loki, Promtail, Traefik, Docker Compose, and real-world tradeoffs

Who This Article Is For
This is written for CEOs, CTOs, and engineering leads running real businesses - not DevRel demos. If you have a Django/Python backend, you deploy with Docker Compose, and your team is small-to-mid-sized, this article is for you.
This is explicitly NOT for FAANG-scale infrastructure, teams with dedicated SRE orgs, or people who want to try Kubernetes because it sounds impressive. Observability is about reducing business risk. Full stop.
1. What "Observability" Actually Means
Observability is one of the most abused words in the industry. Let's cut through it.
Observability = Can I answer "what is broken, why, and how bad is it" in under 5 minutes?
Break it down into four simple primitives:
- Logs → What happened
- Metrics → How bad / how often
- Errors → What users are feeling
- Alerts → When a human must wake up As a CEO or CTO, the real cost of downtime is never just the server bill. It's lost customer trust, missed revenue, and engineer burnout from firefighting instead of building.
2. The Non-Negotiables of a Robust Django Backend
2.1 Deterministic Deployments
Same code + same config must equal same behavior. "It works on my machine" is a cultural and engineering failure. Docker Compose enforces determinism at the service boundary level, making it the right tool for most teams.
2.2 Visibility Over Cleverness
Prefer boring tools that work at 2 AM. Avoid infrastructure that only one developer fully understands. If your most senior engineer gets hit by a bus, can the rest of the team keep the lights on?
2.3 Human-Readable Failure
When something breaks, the logs and errors need to be readable by the on-call engineer, the CTO, and occasionally even the CEO. If your stack requires a PhD to interpret a failure, it's the wrong stack.
3. Your Logging Stack: Promtail + Loki + Grafana

Most teams start with print() statements or ad-hoc logging. This fails at scale. The modern, lightweight answer for Docker Compose environments is the PLG stack: Promtail, Loki, and Grafana.
3.1 What Each Tool Does
Promtail - The Log Shipper
Promtail runs as a sidecar or Docker container and tails your application's log files or Docker log streams. It ships log lines to Loki with labels attached, like service name, environment, and container ID. Think of it as your log collector that runs silently in the background.
# promtail-config.yaml
scrape_configs:
- job_name: django
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
target_label: container
Loki - The Log Aggregation Engine
Loki is Grafana Labs' answer to Elasticsearch - but designed specifically for logs, not full-text search. It indexes only the labels (metadata), not the full log content, making it dramatically cheaper and faster for small to mid-sized teams.
You do not need to run Elasticsearch, Logstash, or Kibana (the ELK stack). Loki does the same job for a fraction of the operational cost and complexity.
Key insight: Loki is to logs what Prometheus is to metrics. It speaks the same query language family (LogQL vs PromQL) and integrates natively into Grafana.
Grafana - The Visualization Layer
Grafana is your single pane of glass. It connects to Loki for logs, Prometheus for metrics, and can even display Sentry error counts - all in one dashboard. For Django teams, this means you finally have one place to look when something goes wrong.
Example Grafana setup in Docker Compose:
services:
loki:
image: grafana/loki:2.9.0
ports: ["3100:3100"]
volumes:
- ./loki-config.yaml:/etc/loki/config.yaml
promtail:
image: grafana/promtail:2.9.0
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./promtail-config.yaml:/etc/promtail/config.yaml
grafana:
image: grafana/grafana:10.0.0
ports: ["3000:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=yourpassword
3.2 What to Log in Django
Use structured JSON logging. Random print() calls and unformatted strings become noise at scale. Every log entry should include at minimum:
- timestamp - ISO 8601 format
- request_id / correlation_id - for tracing a request across services
- service name - which Django service or worker emitted this
- environment - staging vs production
- severity - DEBUG, INFO, WARNING, ERROR, CRITICAL Never log secrets, tokens, API keys, or PII without masking. This is both a security issue and a compliance issue.
3.3 Structured Logging Setup in Django
# settings.py
LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"json": {
"()": "pythonjsonlogger.jsonlogger.JsonFormatter",
"format": "%(asctime)s %(name)s %(levelname)s %(message)s"
}
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"formatter": "json"
}
},
"root": { "handlers": ["console"], "level": "INFO" }
}
4. Reverse Proxying: Traefik vs Nginx

When you run multiple services in Docker Compose - a Django web app, a Celery worker dashboard, Grafana, etc. - you need a reverse proxy to route traffic and handle TLS. The two main options are Nginx and Traefik.
4.1 Nginx - The Reliable Veteran
Nginx has been the standard reverse proxy for over a decade. It's battle-tested, well-documented, and every developer has seen it before. For Django, it typically sits in front of Gunicorn and handles static files, SSL termination, and rate limiting.
The limitation: Nginx configuration is static. Every time you add a new service or change a port, you need to manually update the nginx.conf and reload. For small, stable setups this is fine.
# nginx.conf example for Django + Gunicorn
server {
listen 80;
server_name yourdomain.com;
location /static/ {
alias /app/static/;
}
location / {
proxy_pass http://web:8000;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
4.2 Traefik - The Container-Native Proxy
Traefik is designed specifically for dynamic containerized environments. Instead of a static config file, Traefik reads Docker labels from your containers and automatically configures routing. Add a new service, label it, and Traefik routes to it - no config reload required.
For teams using Docker Compose, Traefik offers three major advantages over Nginx:
- Automatic TLS via Let's Encrypt - zero manual certificate management
- Docker-native service discovery - routes update automatically as containers start and stop
- Built-in dashboard - a simple UI to see all active routes and health checks
# docker-compose.yml with Traefik
services:
traefik:
image: traefik:v2.10
command:
- "--providers.docker=true"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.le.acme.email=you@company.com"
- "--certificatesresolvers.le.acme.storage=/certs/acme.json"
ports: ["80:80", "443:443"]
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./certs:/certs
web:
image: yourdjangapp:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.web.rule=Host(`yourdomain.com`)"
- "traefik.http.routers.web.tls.certresolver=le"
4.3 Which Should You Choose?
Use Nginx if your service topology is stable and your team already knows it. The configuration is explicit and easy to audit.
Use Traefik if you run multiple services that change frequently, want zero-touch TLS certificate management, or want a dashboard showing your live routing configuration. For teams adopting a PLG stack (Promtail + Loki + Grafana), Traefik fits more naturally because the whole stack leans into container-native tooling.
Our recommendation: Traefik for new setups running Docker Compose with multiple services. Nginx for simple single-service Django apps where you want maximum control and familiarity.
5. Error Tracking That Engineers Actually Check
5.1 Why Silent Failures Kill Businesses
Errors that don't crash your server still kill revenue. A user who hits a broken checkout flow, a payment that fails silently, a background job that stops processing - none of these necessarily trigger an alert in a naive setup. "Users complained" is not observability.
5.2 Sentry or GlitchTip
Sentry is the industry standard for error tracking. It captures full stack traces, request context, user impact counts, and environment tags (staging vs production). GlitchTip is the open-source self-hosted alternative with a compatible API, making it a viable option for teams with data residency requirements.
For Django, setup is three lines:
# pip install sentry-sdk
import sentry_sdk
sentry_sdk.init(
dsn="https://your-dsn@sentry.io/project",
environment="production",
traces_sample_rate=0.1,
)
5.3 Alert Fatigue is Worse Than No Alerts
Most teams disable alerts within weeks because they fire too often. The solution is tuning, not silence. Set error rate thresholds rather than individual error counts. Alert on new errors that haven't been seen before. Build in regression detection so old bugs that re-emerge get flagged.
6. Metrics That Matter
6.1 The Only Metrics That Actually Count
Vanity dashboards are a morale boost, not a business tool. The metrics that matter to a CEO or CTO are simple:
- Error rate - percentage of requests returning 5xx
- Response time - p50, p95, p99 latency
- Uptime - are users able to reach the service
- Queue backlog - are background jobs keeping up
- Failed background jobs - Celery or RQ task failure rate
6.2 Django + Celery Visibility
One of the most common blind spots for Django teams is background worker health. A growing Celery queue is a red alert that often goes unnoticed until a customer complains. Add Flower (the Celery monitoring dashboard) behind Traefik or Nginx, and route it to Grafana via a Prometheus exporter for metrics.
7. Alerts: Waking Humans Only When It Matters

7.1 Alert Channels by Severity
Not every problem deserves to wake someone up at 3 AM. Build a tiered alert system:
- Low severity → logs only, review in the morning
- Medium severity → Slack notification to the engineering channel
- High severity → Slack + email to the on-call engineer
- Critical → Slack + email + phone call (Twilio or PagerDuty)
7.2 Every Alert Must Answer Three Questions
If an alert doesn't answer all three of the following, it shouldn't be an alert:
- What broke?
- How bad is it?
- What should I do right now? If the answer to "what should I do right now" is "nothing", that's not an alert - it's a log entry.
8. Docker Compose: Enough for Most Companies
8.1 What Docker Compose Does Well
Predictable environments, easy developer onboarding, clear service boundaries, and minimal operational overhead. For a team running a Django web service, one or two Celery workers, a scheduler, Redis, and PostgreSQL - Docker Compose handles this elegantly.
8.2 Production-Grade Compose Structure
A well-structured production Compose file separates concerns cleanly:
services:
web:
image: yourapp:latest
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health/"]
interval: 30s
retries: 3
worker:
image: yourapp:latest
command: celery -A yourapp worker -l info
restart: always
scheduler:
image: yourapp:latest
command: celery -A yourapp beat -l info
restart: always
#
**## It's preferable to separate monitoring stack in a separate compose file.**
#
loki:
image: grafana/loki:2.9.0
promtail:
image: grafana/promtail:2.9.0
grafana:
image: grafana/grafana:10.0.0
traefik:
image: traefik:v2.10
##
9. Why Most Companies Do NOT Need Kubernetes
This is the section DevRel engineers don't want you to read.
9.1 Kubernetes Solves Organizational Problems, Not Code Problems
Kubernetes was designed for large teams, multiple deployment pipelines, and dedicated infrastructure roles. If you have fewer than 15 engineers and no dedicated SRE, Kubernetes will cost you more than it saves - in hiring, cognitive load, and slower delivery velocity.
9.2 The Hidden Costs CEOs Never See
- Hiring cost - SREs who can operate Kubernetes are expensive and scarce
- Cognitive load - every developer on the team now needs to understand Kubernetes concepts to debug production issues
- Debugging complexity - simple issues become multi-layer detective work
- Slower delivery - more infra to manage means less time shipping features
9.3 Resume-Driven Architecture
Developer motivation and business outcomes are not always aligned. Kubernetes is exciting to work on and looks great on a CV. Docker Compose is boring and doesn't. Boring infrastructure that works at 2 AM is what your business actually needs.
9.4 When Kubernetes Actually Makes Sense
- Multiple teams deploying independently to the same infrastructure
- High and unpredictable traffic volatility requiring auto-scaling
- Compliance constraints requiring fine-grained workload isolation
- Platform-level companies whose product IS the infrastructure
10. Observability as Business Insurance
You don't buy fire insurance hoping your office burns down. You buy it because the cost of not having it is catastrophic when things go wrong - and things always go wrong.
A good observability stack reduces Mean Time To Recovery (MTTR), eliminates the panic that compounds outages, ends hero culture (where only one person knows how to fix things), and lets engineers sleep at night.
Calm systems build calm teams. Calm teams build better products.
11. Final Checklist: Is My Django Backend Actually Observable?
- Structured JSON logs with request_id, service name, and environment
- Centralized log access via Loki + Grafana (not SSH-ing into servers)
- Error tracking with Sentry or GlitchTip, properly tuned alerts
- Background job visibility - Celery/RQ queue depth and failure rate
- Real alerts that answer What/How Bad/What To Do - not noise
- Predictable deployments via Docker Compose with health checks
- Reverse proxy (Traefik or Nginx) handling TLS and routing
- No infrastructure that only one person fully understands
12. A Note on Fractional CTO Work
If your team has Django services in production and you're not 100% sure what happens when things go wrong - how long it takes to detect, who gets notified, and how quickly you recover - that uncertainty is costing you money.
This is exactly where fractional CTO work pays for itself. Not in writing code, but in making sure the system you've built is actually observable, resilient, and owned by the whole team rather than one heroic individual.
Good infrastructure is invisible. You only notice it when it's missing.
