Smart Incident Response with n8n, Prometheus and Lambda

Your server hits 85% CPU at 3 AM from a scheduled backup, and your phone buzzes. You check, see it's nothing urgent, and go back to bed annoyed.

What if your system could check the time, evaluate severity, and decide whether to wake you up? That's what we're building.

Sometimes fix things automatically

The 3-Step Smart Response System

Receive alerts from monitoring
Analyze context - time, severity, business impact
Route intelligently - SMS for emergencies, Slack for business hours

Building It with n8n (Free Tool)

Step 1: Smart Alert Processing

const now = new Date();
const isBusinessHours = now.getUTCHours() >= 9 && now.getUTCHours() < 17;
const severity = alert.labels.severity;

return {
  shouldWakeUp: severity === 'critical' && !isBusinessHours,
  route: isBusinessHours ? 'slack' : (severity === 'critical' ? 'sms' : 'log')
};

Step 2: Smart Routing Rules

Critical + After Hours → SMS
Critical + Business Hours → Slack urgently
Warning → Slack (no urgency)
Info → Log for morning review

Step 3: Auto-Resolution with Lambda from AI-agent decision

AI-agent decision prompt:

Analyze the following Prometheus alert to determine if it should be auto-resolved by restarting the EC2 instance to handle issues like high CPU usage, especially when the team is unavailable. The context is:

- Alert Name: {{ $node["Code"].json["alertname"] }}
- Severity: {{ $node["Code"].json["severity"] }}
- Duration: {{ $node["Code"].json.durationMinutes }} minutes
- Business Hours: {{ $node["Code"].json["isBusinessHours"] }} (true if 9 AM–5 PM UTC, false otherwise)
- Description: {{ $node["Code"].json["description"] }}

Extract the CPU usage (X%) from the description, formatted as: "On <instance> at <alertname>: CPU usage is X%, Memory available is Y%, Swap usage is Z%, Disk I/O is A s, Network received is B MB/s, Latency is C s".

Decide to auto-resolve (restart the EC2 instance) if:
1. CPU usage > 80% AND outside business hours (isBusinessHours is false).
2. CPU usage > 90% AND duration < 5 minutes.
3. Severity is "critical" AND outside business hours (isBusinessHours is false).

Return only the following JSON object, with no additional text, explanations, or markdown:
{
  "shouldAutoResolve": boolean,
  "reason": "Explanation of the reason why this action should or should not be auto-resolved, referencing CPU usage, duration, severity, and business hours if relevant."
}

- If shouldAutoResolve is true, a Lambda function will be triggered to restart the EC2 instance.
- If shouldAutoResolve is false, no restart will occur.
- Keep the reason concise and clear, referencing the specific criteria met or not met.
- If CPU usage cannot be extracted, assume 0% and include it in the reason.

For common issues, let the system auto-fix:

CPU > 90% for 5+ minutes outside hours → Auto-restart
Memory leak patterns → Clear cache automatically
Disk full → Clean temp files

Your 15-Minute Setup

Install n8n (one Docker command)
Create webhook endpoint for alerts
Add time-based routing logic
Connect your monitoring
Test with controlled alerts

Total cost: 0$

Sample code

https://github.com/Bubobot-Team/automation-workflow-monitoring

https://github.com/Bubobot-Team/monitoring-stack

Check our blog for detailed setup: https://bubobot.com/blog/automated-incident-response-workflows-with-n8n-and-monitoring-tools