ACTIVE-INCIDENT


title: Active Incident — Hand Off Monitor Failures to Your AI Agent description: How UpMonitor's Active Incident panel ships a copy-pasteable AI prompt for every open issue — portable across Claude Code, Cursor, Claude Desktop, and any MCP-compatible agent. keywords: incident response AI agent, AI uptime monitoring, MCP server uptime, fix DNSSEC failure with AI, handoff infrastructure alerts to AI, AI SRE ogTitle: Active Incident — Hand Off Monitor Failures to Your AI Agent ogDescription: One panel. One paste. Your AI agent already has the context.

Active Incident — Hand Off Monitor Failures to Your AI Agent in One Click

An Active Incident is the single open issue that UpMonitor surfaces at the top of every monitor when severity-aware health flips from green to warning or red. It bundles the AI-generated root cause, three-step remediation, estimated impact, and a copy-pasteable prompt — designed to be dispatched directly into the AI agent you already use (Claude Code, Cursor, Claude Desktop, ChatGPT, or any MCP-compatible client) so the agent can take over with zero per-incident setup.

The panel renders only while the incident is open. The moment a follow-up check restores health: green, the panel disappears — there is no extra acknowledge or close button to manage.

Why this exists: portable handoff vs. walled-garden agents

Existing AI-SRE platforms either own the agent or only route alerts to one. UpMonitor is the only platform where the incident itself produces the prompt, so the engineer's existing AI agent takes over without per-incident configuration. The contrast:

  • PagerDuty MCP plugins (Claude Code, Cursor) expose incident-management primitives — page on-call, log to timeline — but leave diagnosis to the human before the agent is invoked. (PagerDuty announcement)
  • Datadog Agent Builder lets you describe NL agents that "interpret observability data" — powerful but configuration-heavy, requiring per-agent goals, prompts, and tool grants. (Datadog Agent Builder)
  • BetterStack Agentic AI SRE investigates and writes post-mortems autonomously — but the loop runs inside their walled garden, not in the engineer's tool of choice. (Better Stack)

UpMonitor's Active Incident panel ships one copy-pasteable prompt per open incident. Paste it into any MCP-aware agent. The agent fetches structured context via the UpMonitor MCP server and walks the user through the fix.

What the panel contains

The panel renders in a fixed structure — read top-to-bottom for the fastest path from "alert received" to "fix in motion."

Severity icon, checker name, age

The header surfaces the failing checker (e.g. dnssec, ssl, http) and the time since the incident opened (e.g. Open 2h 14m). The icon turns red for red-tier failures (SSL, DNS, DNSSEC, HTTP) and amber for warning-tier (security headers, compression, content). Severity ceilings are declared per-checker in the checker registry, not derived from individual check pass/fail counts.

Root cause

A single human-readable paragraph explaining what is broken and why. Generated by Vertex AI Gemini 3.1 Pro from the failing checker's structured output, baseline regression context, and a server-side Lighthouse run. Cached for 24 hours per (monitorId, contributingChecker, regressionBucket) signature, so identical incidents do not re-burn quota.

Ask Your AI Agent To Fix It

The primary CTA. Two buttons:

  1. Copy Prompt — clipboards a structured prompt that contains the incident ID, the failing URL, the contributing checker, the root cause, and a directive to walk the user through remediation and re-verify.
  2. upmonitor://monitor/<id> — opens the incident in the user's installed CLI handler (registered by upmonitor register after running npm i -g @upmonitor/cli).

Both paths produce the same outcome: the agent receives the full context object via MCP and acts on it.

Or fix it yourself

Collapsed by default. Expands to a numbered list of remediation steps, each with a one-line rationale so the engineer learns why each step matters. The estimated-impact line at the bottom states the user-visible outcome of completing the fix.

When does the incident open and close?

The incident opens when UpMonitor's evaluateRegionalHealth algorithm flips a monitor's persisted health from green to warning or red AND a successful AI diagnosis lands on the triggering result. The transition is edge-triggered — it fires once at the boundary, not on every subsequent failing check.

The incident closes when the next check returns health: green. The panel disappears, the alerts stop, the monitor.health field on the persisted document flips back. There is no manual acknowledgment workflow because the source of truth is the live signal, not a human-managed state machine.

Three real-world incidents that prove the model

The following are disaster reports from the last 30 months that the Active Incident handoff would have shortened, drawn from public post-mortems.

Cloudflare DNSSEC root-zone signature expiry — Oct 4, 2023

DNSSEC signatures in the root zone expired with no fresh version available, and Cloudflare resolvers began responding SERVFAIL for validation-enabled queries. The trust chain failed silently — clients saw timeouts, resolvers saw SERVFAIL, and operators chased the wrong cause for hours. (IANIX DNSSEC outage log)

How the Active Incident model would have shortened it: a dnssec checker probing the resolver chain emits structured failure data (code: NO_DNSSEC_RECORDS_FOUND or code: ROOT_ZONE_SIGNATURE_EXPIRED) within seconds of the first failed validation. The AI Diagnostic ships the root cause and the agent has the full incident object — including the upstream resolver's exact SERVFAIL reason — without the operator opening a single dashboard.

Cloudflare Bot Management outage — Nov 18, 2025

A change to ClickHouse permissions silently altered how a query behaved, returning duplicate rows when generating the feature file used by the bot-scoring ML model. Cascading HTTP 500s spread across thousands of customer sites for 3+ hours before the team correlated the permission change to the corrupted file. (Cloudflare post-mortem index)

How the Active Incident model would have shortened it: an http checker emits 5xx status with the specific edge worker error code. The handoff prompt includes recent change context (via MCP), allowing the agent to query the change log, correlate the permission change, and propose the rollback before the operator finishes reading the alert.

Cloudflare Sept 12, 2025 dashboard/API outage

The Cloudflare dashboard and API tier went down, leaving SREs unable to access vendor tooling during the outage they needed to diagnose. (Cloudflare deep-dive)

How the Active Incident model would have shortened it: the upmonitor://monitor/<id> deep link drops the engineer straight into the local CLI, which fetches incident state from UpMonitor's independent infrastructure. The handoff prompt works without dependence on any vendor dashboard — including UpMonitor's own.

How the prompt is constructed

Each prompt follows the same five-line skeleton, deterministically built server-side from the incident object:

Use the upmonitor MCP server. Fetch incident <monitorId> on <monitorUrl>.
The <relatedChecker> checker is failing.

Root cause:
<rootCause>

Walk me through the remediation steps and apply them where you can.
Verify the fix by re-running the checker on the monitor when done.

The skeleton is intentionally tool-agnostic — it does not assume Claude, Cursor, or any specific runtime. The MCP-call directive is the only line that requires agent-side capability; any agent without MCP support can still act on the human-readable root cause and remediation context the prompt carries.

Integration paths for your AI agent

Three install paths cover the common AI surfaces. The full per-tool walkthrough lives at /docs/agentic-setup.

  • Claude Code: claude mcp add upmonitor -e UPMONITOR_API_KEY=up_... -- npx -y @upmonitor/mcp-server
  • Cursor: Settings → MCP → Add Server with the standard MCP block
  • Claude Desktop: edit claude_desktop_config.json per the MCP integration guide

After install, the very next Active Incident panel you encounter becomes a one-click handoff.

Required permissions and quotas

The Active Incident pipeline consumes one slot from your monthly AI Diagnostic quota per generated diagnosis (Free 5/mo, PRO 50/mo, Agency unlimited). Cached diagnoses do not consume quota — identical-signature incidents within a 24-hour window replay the prior diagnosis. The 15-minute cooldown per monitor prevents quota burn during flapping incidents.

The handoff prompt and upmonitor:// deep link are unmetered — they ship with every successful diagnosis at no marginal cost.