Observability program

Observability & monitoring

See what your systems are doing — metrics, logs, and traces with OpenTelemetry, Prometheus, Grafana, and Elastic. Plus enterprise monitoring with Zabbix and Splunk for mixed estates.

Weekends only 5 weekends · 30 hours SRE fundamentals

What you will learn

  • Define SLIs, SLOs, and error budgets that engineering and business can agree on
  • Instrument applications with OpenTelemetry — traces, metrics, logs — without vendor lock-in
  • Query Prometheus with PromQL; build actionable Grafana dashboards
  • Search and correlate logs in Elasticsearch and Kibana
  • Operate Zabbix for infrastructure and Splunk for log analytics at enterprise scale
  • Design monitoring architecture when you have both cloud-native and legacy systems

Weekend 1

Observability foundations & SRE mindset

Monitoring tells you something broke. Observability lets you ask new questions without redeploying. Start with reliability targets and alerting philosophy that avoids pager fatigue.

  • Three pillars: metrics, logs, traces — and how they correlate
  • SLI / SLO / SLA — practical examples for web APIs and batch jobs
  • Alerting on symptoms vs causes; on-call runbook structure
  • Golden signals: latency, traffic, errors, saturation
  • USE and RED methods for resource and service health
LabDraft SLOs for a sample service; write one runbook for a hypothetical alert.

Weekend 2

OpenTelemetry instrumentation

Instrument once, export anywhere. SDKs, auto-instrumentation, the Collector pipeline, context propagation, and sampling strategies that balance cost and fidelity.

  • Traces, metrics, and logs as unified signals
  • OTel SDK in a sample app (Java or Python)
  • Collector receivers, processors, exporters
  • Baggage and trace context across service boundaries
  • Head-based vs tail-based sampling trade-offs
LabInstrument a microservice; trace a request through two services via Collector.

Weekend 3

Metrics — Prometheus & Grafana

Pull-based metrics, PromQL queries, recording rules, and Grafana dashboards that on-call engineers actually use at 3 AM.

  • Prometheus architecture: scrape, storage, federation
  • Exporters for node, kube-state-metrics, and custom apps
  • PromQL — rates, histograms, aggregations, alert expressions
  • Grafana panels, variables, and dashboard design anti-patterns
  • Alertmanager routing, silences, and escalation
LabDeploy Prometheus stack; build dashboard + alert for error rate spike.

Weekend 4

Logs & traces — Elastic stack & APM

Structured logging, centralized search, and distributed tracing in Kibana. Service maps and latency breakdown for debugging production incidents.

  • Structured JSON logs vs unstructured grep nightmares
  • Elasticsearch indexing, ILM, and retention economics
  • Kibana Discover, Lens, and log correlation with trace IDs
  • Elastic APM — service maps, span analysis, anomaly concepts
  • When Elastic vs Prometheus — complementary roles
LabShip logs to Elastic; find root cause of injected latency using trace + log correlation.

Weekend 5

Enterprise monitoring, Zabbix, Splunk & capstone

Not every system speaks Prometheus. Learn Zabbix for hosts and network gear, Splunk for high-volume log analytics, and how to route signals between stacks.

  • Zabbix agents, templates, discovery, and escalation chains
  • Splunk ingestion, parsing, indexes, and search language basics
  • Bridging OTel → multiple backends; forwarders and collectors
  • Incident simulation — timeline reconstruction from telemetry
  • Postmortem writing and blameless culture basics
CapstoneUnified observability design doc for a hybrid estate (K8s + VMs + legacy).

Tools covered

OpenTelemetry Prometheus Grafana Elasticsearch Kibana Elastic APM Zabbix Splunk

Enroll in Observability & monitoring

Submit your details for this program — we will confirm batch dates and next steps.

Enrollment form

Step 1: Enter email and click Send code.

Step 2: Check inbox and spam, then enter the code.