AI Agent Observability and Incident Response Skill
Instrument AI agent systems with high-signal telemetry and runbook-driven incident response for reliability and safety.
Open the source and read safety notes before installing.
Prerequisites
- Runtime access where agent requests can be instrumented
- Centralized logging/metrics/tracing destination
- On-call or owner process for incident handling
Schema details
- Install type
- package
- Reading time
- 7 min
- Difficulty score
- 76
- Troubleshooting
- Yes
- Breaking changes
- No
- Package verified
- Yes
- SHA-256
- be991706421c0df692082e6109da5ef9b3479fa7fdcdf95809c8eeaff3b892b4
- Skill type
- general
- Skill level
- advanced
- Verification
- draft
- Verified at
- 2026-04-10
| Platform | Support | Install path |
|---|---|---|
| claude-code | Native | .claude/skills/<skill-name>/SKILL.md |
| codex | Native | .agents/skills/<skill-name>/SKILL.md |
| windsurf | Native | .windsurf/skills/<skill-name>/SKILL.md |
| gemini | Native | .gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md |
| cursor | Adapter | .cursor/rules/<skill-name>.mdc |
| cli | Manual | AGENTS.md or tool-specific context file |
Full copyable content
# Trigger
"Apply AI agent observability and incident response skill to this stack."
# Required output
1) Telemetry schema (traces, metrics, logs, events)
2) Core SLOs and alert thresholds
3) Incident triage playbook
4) Post-incident review templateAbout this resource
Overview
This skill turns AI agent operations into an observable system with measurable reliability. It defines what to log, what to measure, and what to alert on so incidents can be resolved quickly and with consistent process.
Compatibility
Native
- Claude Code / Claude: native skill usage via
SKILL.md. - Codex/OpenAI workflows: compatible with Agent Skills-style
SKILL.mdcontent as reusable workflow instructions.
Manual Adaptation
- Gemini CLI: native skill usage via
.gemini/skills/<skill-name>/SKILL.mdor.agents/skills/<skill-name>/SKILL.mdwhere supported. - Cursor: use the generated
.cursor/rules/*.mdcadapter for project rules. - OpenClaw and similar agents: use the same skill content as a reusable prompt/workflow file when native skill import is unavailable.
Prerequisites
- Access to request/response lifecycle in your agent runtime
- Structured logging support
- Ability to tag traces/events by workflow and model
What to Instrument
- Prompt and tool execution spans (with redaction-safe metadata)
- Latency percentiles by route/workflow/model
- Error classes: model timeout, tool failure, policy denial, parse failure
- Safety events: blocked actions, suspicious prompt patterns, auth failures
How to Use This Skill
Prompt Pattern
Apply the AI agent observability and incident response skill.
Provide:
1) telemetry contract,
2) SLO definitions,
3) alert routing matrix,
4) incident playbook with triage steps.
Execution Flow
- Define critical user journeys and reliability targets.
- Add telemetry fields needed for fast diagnosis.
- Create alert thresholds aligned to user impact.
- Build runbooks with owner, severity, and escalation path.
- Validate incident drills before production launch.
Troubleshooting
Issue: Alert fatigue from noisy thresholds
Fix: Alert on sustained error budgets, not single spikes.
Issue: Logs are present but not useful
Fix: Standardize event schema (request ID, workflow ID, tool name, failure reason).
Issue: Incidents take too long to triage
Fix: Add direct links from alerts to trace dashboards and runbook sections.
Knowledge Freshness
Treat tooling details as time-sensitive. Re-validate APIs, limits, pricing, auth models, and deployment flags immediately before implementation. If docs conflict with prior memory, follow current official docs and release notes.
Retrieval Sources
Output Contract
- Return a concrete plan with implementation order.
- Provide production-ready commands/config/code snippets (not placeholders).
- Include explicit assumptions and unresolved risks.
- Include a verification checklist with pass/fail criteria.
Quality Gates
- All commands are copy/paste ready.
- Security-sensitive steps call out secret handling and least privilege.
- Version-sensitive guidance cites current docs used.
- Rollback path is included for risky changes.
- Final output includes quick validation commands/tests.
Source citations
Signals
Loading live community signals…
A short, calm digest of reviewed Claude resources. Unsubscribe any time.