Use Datadog's AI to Reduce Alert Noise

Tool:Datadog

AI Feature:AI-powered anomaly detection and Watchdog

Time:10-15 minutes

Difficulty:Beginner

Datadog

What This Does

Datadog's AI features (Watchdog, anomaly detection, and alert correlation) automatically surface genuine performance anomalies without you needing to set static thresholds for every metric. Instead of reviewing 150 alerts per day, you see the 10 that actually matter. This is the single highest-leverage change you can make if you're already a Datadog customer.

Before You Start

You have a Datadog account with at least the Pro plan (Watchdog is included in Pro+)
Agents are installed on your monitored hosts
You're logged in to your Datadog account

Steps

1. Enable Watchdog for automatic anomaly detection

Watchdog is Datadog's core AI feature — it analyzes baseline behavior and surfaces unusual patterns.

Go to Monitors → Watchdog in the left navigation
If Watchdog isn't enabled, click Enable Watchdog
Select which services and infrastructure you want Watchdog to monitor
Set your notification channel (email, PagerDuty, Slack) for Watchdog alerts

What you should see: A Watchdog dashboard showing any currently detected anomalies. If your infrastructure is healthy, this should be mostly empty.

Troubleshooting: If you don't see the Watchdog option, check your Datadog plan — it requires Pro or higher. Contact your Datadog account rep if you're on an older plan.

2. Set up anomaly detection monitors instead of static threshold alerts

Static threshold alerts ("alert if CPU > 80%") generate false positives constantly. Anomaly detection alerts when behavior deviates from its normal pattern for that specific host and time of day.

Go to Monitors → Create Monitor
Select Anomaly Detection as the monitor type
Choose your metric (e.g., system.cpu.user for CPU utilization)
Select the scope (a specific host, service, or tag group)
Set the algorithm to Agile (responds to gradual shifts) or Robust (ignores short spikes)
Configure the alert threshold: "Alert when behavior is X standard deviations from normal for Y minutes"

What you should see: A preview graph showing the "normal" band for that metric and where anomalies have occurred historically. The band is wider at expected peak usage times.

3. Enable alert correlation with Event Management

Alert correlation groups related alerts into a single correlated event — so 50 disk space alerts from the same maintenance operation become 1 grouped alert.

Go to Events → Event Management
Click Correlations or Correlation Rules
Create a correlation rule that groups alerts by host tag, service, or infrastructure component
Set a time window (alerts within 5-15 minutes are typically related)

What you should see: Related alerts automatically grouped into a single correlated event with a summary of the affected scope.

4. Review and refine with the Alert Storm Protection feature

Go to Monitors → Manage Monitors
Click on any monitor that generates frequent alerts
Review the alert history — if it's alerting more than 3-4 times per day, it's likely noisy
Use Mute for known maintenance periods or Downtime scheduling for recurring windows

What you should see: Over 2-3 weeks, your alert volume should decrease as you refine thresholds and enable correlation.

Real Example

Scenario: Your disk space monitor alerts every time backup jobs run because disk usage temporarily spikes to 78% during backup, well below your 85% real alert threshold.

Before: 14 "disk space high" alerts per week, all false positives.

What you do: Switch the disk space monitor from static threshold to anomaly detection. Set it to alert only when disk usage deviates significantly from the usual pattern for that time of day and shows an upward trend (not a temporary spike).

After: 0-1 alerts per week for disk space, and when one fires it's a real event worth investigating.

Tips

Start with Watchdog enabled and review it for 1-2 weeks before creating additional anomaly monitors — Watchdog often catches things you'd never think to monitor
Tag your infrastructure consistently (environment: production, team: network) — anomaly detection and correlation work much better when hosts are properly tagged
Review your 10 highest-volume monitors monthly and ask: "Are these actionable or noise?" Convert the noisy ones to anomaly detection

Tool interfaces change — if a button has moved, look for similar AI/magic/smart options in the same menu area.