MLE-440: ML Monitoring, Observability, and Reliability

Course Description

Keeping ML systems healthy in production. Data drift: covariate shift, label shift, concept drift — detection with PSI, KL divergence, maximum mean discrepancy. Model degradation: performance metrics over time, shadow deployments, A/B testing. Logging: what to log, log aggregation (ELK stack, CloudWatch), structured logging for ML. Alerting strategies: avoid alert fatigue while catching real issues. Observability tools: Arize AI, Evidently AI, Fiddler, WhyLabs. LLM-specific monitoring: response quality, latency, token usage, hallucination detection.