Your server hits 85% CPU at 3 AM from a scheduled backup, and your phone buzzes. You check, see it's nothing urgent, and go back to bed annoyed.
What if your system could check the time, evaluate severity, and decide whether to wake you up? That's what we're building.
const now = new Date();
const isBusinessHours = now.getUTCHours() >= 9 && now.getUTCHours() < 17;
const severity = alert.labels.severity;
return {
shouldWakeUp: severity === 'critical' && !isBusinessHours,
route: isBusinessHours ? 'slack' : (severity === 'critical' ? 'sms' : 'log')
};
AI-agent decision prompt:
Analyze the following Prometheus alert to determine if it should be auto-resolved by restarting the EC2 instance to handle issues like high CPU usage, especially when the team is unavailable. The context is:
- Alert Name: {{ $node["Code"].json["alertname"] }}
- Severity: {{ $node["Code"].json["severity"] }}
- Duration: {{ $node["Code"].json.durationMinutes }} minutes
- Business Hours: {{ $node["Code"].json["isBusinessHours"] }} (true if 9 AM–5 PM UTC, false otherwise)
- Description: {{ $node["Code"].json["description"] }}
Extract the CPU usage (X%) from the description, formatted as: "On <instance> at <alertname>: CPU usage is X%, Memory available is Y%, Swap usage is Z%, Disk I/O is A s, Network received is B MB/s, Latency is C s".
Decide to auto-resolve (restart the EC2 instance) if:
1. CPU usage > 80% AND outside business hours (isBusinessHours is false).
2. CPU usage > 90% AND duration < 5 minutes.
3. Severity is "critical" AND outside business hours (isBusinessHours is false).
Return only the following JSON object, with no additional text, explanations, or markdown:
{
"shouldAutoResolve": boolean,
"reason": "Explanation of the reason why this action should or should not be auto-resolved, referencing CPU usage, duration, severity, and business hours if relevant."
}
- If shouldAutoResolve is true, a Lambda function will be triggered to restart the EC2 instance.
- If shouldAutoResolve is false, no restart will occur.
- Keep the reason concise and clear, referencing the specific criteria met or not met.
- If CPU usage cannot be extracted, assume 0% and include it in the reason.
For common issues, let the system auto-fix:
Total cost: 0$
https://github.com/Bubobot-Team/automation-workflow-monitoring
https://github.com/Bubobot-Team/monitoring-stack
Check our blog for detailed setup: https://bubobot.com/blog/automated-incident-response-workflows-with-n8n-and-monitoring-tools
Feel free to share you expected workflows, we're improving more to share with the community!