Use Datadog's AI to Reduce Alert Noise
What This Does
Datadog's AI features (Watchdog, anomaly detection, and alert correlation) automatically surface genuine performance anomalies without you needing to set static thresholds for every metric. Instead of reviewing 150 alerts per day, you see the 10 that actually matter. This is the single highest-leverage change you can make if you're already a Datadog customer.
Before You Start
- You have a Datadog account with at least the Pro plan (Watchdog is included in Pro+)
- Agents are installed on your monitored hosts
- You're logged in to your Datadog account
Steps
1. Enable Watchdog for automatic anomaly detection
Watchdog is Datadog's core AI feature — it analyzes baseline behavior and surfaces unusual patterns.
- Go to Monitors → Watchdog in the left navigation
- If Watchdog isn't enabled, click Enable Watchdog
- Select which services and infrastructure you want Watchdog to monitor
- Set your notification channel (email, PagerDuty, Slack) for Watchdog alerts
What you should see: A Watchdog dashboard showing any currently detected anomalies. If your infrastructure is healthy, this should be mostly empty.
Troubleshooting: If you don't see the Watchdog option, check your Datadog plan — it requires Pro or higher. Contact your Datadog account rep if you're on an older plan.
2. Set up anomaly detection monitors instead of static threshold alerts
Static threshold alerts ("alert if CPU > 80%") generate false positives constantly. Anomaly detection alerts when behavior deviates from its normal pattern for that specific host and time of day.
- Go to Monitors → Create Monitor
- Select Anomaly Detection as the monitor type
- Choose your metric (e.g.,
system.cpu.userfor CPU utilization) - Select the scope (a specific host, service, or tag group)
- Set the algorithm to Agile (responds to gradual shifts) or Robust (ignores short spikes)
- Configure the alert threshold: "Alert when behavior is X standard deviations from normal for Y minutes"
What you should see: A preview graph showing the "normal" band for that metric and where anomalies have occurred historically. The band is wider at expected peak usage times.
3. Enable alert correlation with Event Management
Alert correlation groups related alerts into a single correlated event — so 50 disk space alerts from the same maintenance operation become 1 grouped alert.
- Go to Events → Event Management
- Click Correlations or Correlation Rules
- Create a correlation rule that groups alerts by host tag, service, or infrastructure component
- Set a time window (alerts within 5-15 minutes are typically related)
What you should see: Related alerts automatically grouped into a single correlated event with a summary of the affected scope.
4. Review and refine with the Alert Storm Protection feature
- Go to Monitors → Manage Monitors
- Click on any monitor that generates frequent alerts
- Review the alert history — if it's alerting more than 3-4 times per day, it's likely noisy
- Use Mute for known maintenance periods or Downtime scheduling for recurring windows
What you should see: Over 2-3 weeks, your alert volume should decrease as you refine thresholds and enable correlation.
Real Example
Scenario: Your disk space monitor alerts every time backup jobs run because disk usage temporarily spikes to 78% during backup, well below your 85% real alert threshold.
Before: 14 "disk space high" alerts per week, all false positives.
What you do: Switch the disk space monitor from static threshold to anomaly detection. Set it to alert only when disk usage deviates significantly from the usual pattern for that time of day and shows an upward trend (not a temporary spike).
After: 0-1 alerts per week for disk space, and when one fires it's a real event worth investigating.
Tips
- Start with Watchdog enabled and review it for 1-2 weeks before creating additional anomaly monitors — Watchdog often catches things you'd never think to monitor
- Tag your infrastructure consistently (environment: production, team: network) — anomaly detection and correlation work much better when hosts are properly tagged
- Review your 10 highest-volume monitors monthly and ask: "Are these actionable or noise?" Convert the noisy ones to anomaly detection
Tool interfaces change — if a button has moved, look for similar AI/magic/smart options in the same menu area.