Observability & Monitoring

What you will learn

Define SLIs, SLOs, and error budgets that engineering and business can agree on
Instrument applications with OpenTelemetry — traces, metrics, logs — without vendor lock-in
Query Prometheus with PromQL; build actionable Grafana dashboards
Search and correlate logs in Elasticsearch and Kibana
Operate Zabbix for infrastructure and Splunk for log analytics at enterprise scale
Design monitoring architecture when you have both cloud-native and legacy systems

Weekend 1

Observability foundations & SRE mindset

Monitoring tells you something broke. Observability lets you ask new questions without redeploying. Start with reliability targets and alerting philosophy that avoids pager fatigue.

Three pillars: metrics, logs, traces — and how they correlate
SLI / SLO / SLA — practical examples for web APIs and batch jobs
Alerting on symptoms vs causes; on-call runbook structure
Golden signals: latency, traffic, errors, saturation
USE and RED methods for resource and service health

LabDraft SLOs for a sample service; write one runbook for a hypothetical alert.

Weekend 2

OpenTelemetry instrumentation

Instrument once, export anywhere. SDKs, auto-instrumentation, the Collector pipeline, context propagation, and sampling strategies that balance cost and fidelity.

Traces, metrics, and logs as unified signals
OTel SDK in a sample app (Java or Python)
Collector receivers, processors, exporters
Baggage and trace context across service boundaries
Head-based vs tail-based sampling trade-offs

LabInstrument a microservice; trace a request through two services via Collector.

Weekend 3

Metrics — Prometheus & Grafana

Pull-based metrics, PromQL queries, recording rules, and Grafana dashboards that on-call engineers actually use at 3 AM.

Prometheus architecture: scrape, storage, federation
Exporters for node, kube-state-metrics, and custom apps
PromQL — rates, histograms, aggregations, alert expressions
Grafana panels, variables, and dashboard design anti-patterns
Alertmanager routing, silences, and escalation

LabDeploy Prometheus stack; build dashboard + alert for error rate spike.

Weekend 4

Logs & traces — Elastic stack & APM

Structured logging, centralized search, and distributed tracing in Kibana. Service maps and latency breakdown for debugging production incidents.

Structured JSON logs vs unstructured grep nightmares
Elasticsearch indexing, ILM, and retention economics
Kibana Discover, Lens, and log correlation with trace IDs
Elastic APM — service maps, span analysis, anomaly concepts
When Elastic vs Prometheus — complementary roles

LabShip logs to Elastic; find root cause of injected latency using trace + log correlation.

Weekend 5

Enterprise monitoring, Zabbix, Splunk & capstone

Not every system speaks Prometheus. Learn Zabbix for hosts and network gear, Splunk for high-volume log analytics, and how to route signals between stacks.

Zabbix agents, templates, discovery, and escalation chains
Splunk ingestion, parsing, indexes, and search language basics
Bridging OTel → multiple backends; forwarders and collectors
Incident simulation — timeline reconstruction from telemetry
Postmortem writing and blameless culture basics

CapstoneUnified observability design doc for a hybrid estate (K8s + VMs + legacy).

Tools covered

OpenTelemetry Prometheus Grafana Elasticsearch Kibana Elastic APM Zabbix Splunk

Observability & monitoring