What you will learn
- Define SLIs, SLOs, and error budgets that engineering and business can agree on
- Instrument applications with OpenTelemetry — traces, metrics, logs — without vendor lock-in
- Query Prometheus with PromQL; build actionable Grafana dashboards
- Search and correlate logs in Elasticsearch and Kibana
- Operate Zabbix for infrastructure and Splunk for log analytics at enterprise scale
- Design monitoring architecture when you have both cloud-native and legacy systems
Weekend 1
Observability foundations & SRE mindset
Monitoring tells you something broke. Observability lets you ask new questions without redeploying. Start with reliability targets and alerting philosophy that avoids pager fatigue.
- Three pillars: metrics, logs, traces — and how they correlate
- SLI / SLO / SLA — practical examples for web APIs and batch jobs
- Alerting on symptoms vs causes; on-call runbook structure
- Golden signals: latency, traffic, errors, saturation
- USE and RED methods for resource and service health
LabDraft SLOs for a sample service; write one runbook for a hypothetical alert.
Weekend 2
OpenTelemetry instrumentation
Instrument once, export anywhere. SDKs, auto-instrumentation, the Collector pipeline, context propagation, and sampling strategies that balance cost and fidelity.
- Traces, metrics, and logs as unified signals
- OTel SDK in a sample app (Java or Python)
- Collector receivers, processors, exporters
- Baggage and trace context across service boundaries
- Head-based vs tail-based sampling trade-offs
LabInstrument a microservice; trace a request through two services via Collector.
Weekend 3
Metrics — Prometheus & Grafana
Pull-based metrics, PromQL queries, recording rules, and Grafana dashboards that on-call engineers actually use at 3 AM.
- Prometheus architecture: scrape, storage, federation
- Exporters for node, kube-state-metrics, and custom apps
- PromQL — rates, histograms, aggregations, alert expressions
- Grafana panels, variables, and dashboard design anti-patterns
- Alertmanager routing, silences, and escalation
LabDeploy Prometheus stack; build dashboard + alert for error rate spike.
Weekend 4
Logs & traces — Elastic stack & APM
Structured logging, centralized search, and distributed tracing in Kibana. Service maps and latency breakdown for debugging production incidents.
- Structured JSON logs vs unstructured grep nightmares
- Elasticsearch indexing, ILM, and retention economics
- Kibana Discover, Lens, and log correlation with trace IDs
- Elastic APM — service maps, span analysis, anomaly concepts
- When Elastic vs Prometheus — complementary roles
LabShip logs to Elastic; find root cause of injected latency using trace + log correlation.
Weekend 5
Enterprise monitoring, Zabbix, Splunk & capstone
Not every system speaks Prometheus. Learn Zabbix for hosts and network gear, Splunk for high-volume log analytics, and how to route signals between stacks.
- Zabbix agents, templates, discovery, and escalation chains
- Splunk ingestion, parsing, indexes, and search language basics
- Bridging OTel → multiple backends; forwarders and collectors
- Incident simulation — timeline reconstruction from telemetry
- Postmortem writing and blameless culture basics
CapstoneUnified observability design doc for a hybrid estate (K8s + VMs + legacy).
Tools covered
OpenTelemetry
Prometheus
Grafana
Elasticsearch
Kibana
Elastic APM
Zabbix
Splunk