Observability is the ability to understand internal system conditions from the signals the system produces. It is more than monitoring; observability helps explain why an issue happened, not only that it happened.
The three pillars of observability
- Logs: event records with context for detailed investigation.
- Metrics: aggregated numbers such as latency, throughput, and error rate for trends and alerts.
- Traces: request journeys across services to locate bottlenecks and dependencies.
Start with what matters most
If you are new to observability, do not instrument everything at once. Choose critical endpoints or flows, such as login, submission, or approval. Set simple SLOs like success rate and latency thresholds.
Healthy alerting
- Actionable: a clear first step exists.
- Low noise: avoid notification spam.
- Based on SLOs: focus on user experience, not irrelevant internal metrics.
Good practices often missed
- Add correlation IDs so logs and traces can be linked.
- Standardize log format and levels (info, warn, error).
- Keep runbooks near alerts so responders know what to do.
Example investigation flow
- An error rate alert increases for a specific endpoint.
- Check traces to see which service is slow or failing.
- Review logs for the same correlation IDs to get context.
- Apply mitigation, then document a postmortem and permanent fix.
Good observability keeps teams calm: issues are detected faster, investigations are shorter, and fixes are more targeted.