skillsFirst-partyReview first Safety · Privacy ·

AI Agent Observability and Incident Response Skill

Instrument AI agent systems with high-signal telemetry and runbook-driven incident response for reliability and safety.

by JSONbored·added 2026-04-10·

Claude CodeCodexWindsurfGeminiCursorCLI

HarnessClaude CodeCodexWindsurfGeminiCursorCLI

Level:advancedType:generalVerified:draft

Install

curl -L https://heyclau.de/downloads/skills/ai-agent-observability-incident-response.zip -o ai-agent-observability-incident-response.zip && unzip -o ai-agent-observability-incident-response.zip -d ./ai-agent-observability-incident-response

Readiness

TrustTrusted
Sourcefirst-party
Safety notesMissing
ReviewedYes

Documentation Source repository Registry JSON · LLM text

Review first — review before installing

Open the source and read safety notes before installing.

Prerequisites

Runtime access where agent requests can be instrumented
Centralized logging/metrics/tracing destination
On-call or owner process for incident handling

Schema details

Install type: package
Reading time: 7 min
Difficulty score: 76
Troubleshooting: Yes
Breaking changes: No

Package metadata

Download URL: /downloads/skills/ai-agent-observability-incident-response.zip
Package verified: Yes
SHA-256: be991706421c0df692082e6109da5ef9b3479fa7fdcdf95809c8eeaff3b892b4

Skill and platform metadata

Skill type: general
Skill level: advanced
Verification: draft
Verified at: 2026-04-10

Retrieval sources

https://opentelemetry.io/docs/

Tested platforms

ClaudeCodexOpenClawCursorWindsurfGemini

Platform	Support	Install path
claude-code	Native	.claude/skills/<skill-name>/SKILL.md
codex	Native	.agents/skills/<skill-name>/SKILL.md
windsurf	Native	.windsurf/skills/<skill-name>/SKILL.md
gemini	Native	.gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md
cursor	Adapter	.cursor/rules/<skill-name>.mdc
cli	Manual	AGENTS.md or tool-specific context file

Full copyable content

# Trigger
"Apply AI agent observability and incident response skill to this stack."

# Required output
1) Telemetry schema (traces, metrics, logs, events)
2) Core SLOs and alert thresholds
3) Incident triage playbook
4) Post-incident review template

About this resource

Overview

This skill turns AI agent operations into an observable system with measurable reliability. It defines what to log, what to measure, and what to alert on so incidents can be resolved quickly and with consistent process.

Compatibility

Native

Claude Code / Claude: native skill usage via SKILL.md.
Codex/OpenAI workflows: compatible with Agent Skills-style SKILL.md content as reusable workflow instructions.

Manual Adaptation

Gemini CLI: native skill usage via .gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md where supported.
Cursor: use the generated .cursor/rules/*.mdc adapter for project rules.
OpenClaw and similar agents: use the same skill content as a reusable prompt/workflow file when native skill import is unavailable.

Prerequisites

Access to request/response lifecycle in your agent runtime
Structured logging support
Ability to tag traces/events by workflow and model

What to Instrument

Prompt and tool execution spans (with redaction-safe metadata)
Latency percentiles by route/workflow/model
Error classes: model timeout, tool failure, policy denial, parse failure
Safety events: blocked actions, suspicious prompt patterns, auth failures

How to Use This Skill

Prompt Pattern

Apply the AI agent observability and incident response skill.
Provide:
1) telemetry contract,
2) SLO definitions,
3) alert routing matrix,
4) incident playbook with triage steps.

Execution Flow

Define critical user journeys and reliability targets.
Add telemetry fields needed for fast diagnosis.
Create alert thresholds aligned to user impact.
Build runbooks with owner, severity, and escalation path.
Validate incident drills before production launch.

Troubleshooting

Issue: Alert fatigue from noisy thresholds
Fix: Alert on sustained error budgets, not single spikes.

Issue: Logs are present but not useful
Fix: Standardize event schema (request ID, workflow ID, tool name, failure reason).

Issue: Incidents take too long to triage
Fix: Add direct links from alerts to trace dashboards and runbook sections.

Knowledge Freshness

Treat tooling details as time-sensitive. Re-validate APIs, limits, pricing, auth models, and deployment flags immediately before implementation. If docs conflict with prior memory, follow current official docs and release notes.

Retrieval Sources

https://opentelemetry.io/docs/

Output Contract

Return a concrete plan with implementation order.
Provide production-ready commands/config/code snippets (not placeholders).
Include explicit assumptions and unresolved risks.
Include a verification checklist with pass/fail criteria.

Quality Gates

All commands are copy/paste ready.
Security-sensitive steps call out secret handling and least privilege.
Version-sensitive guidance cites current docs used.
Rollback path is included for risky changes.
Final output includes quick validation commands/tests.

Content outline

Overview
Compatibility
Native
Manual Adaptation
Prerequisites
What to Instrument
How to Use This Skill
Prompt Pattern
Execution Flow
Troubleshooting
Knowledge Freshness
Retrieval Sources
Output Contract
Quality Gates

#observability#incident-response#reliability#telemetry#ai-agents

Source citations

Signals

Loading live community signals…

Prerequisites

Schema details

About this resource

Overview

Compatibility

Native

Manual Adaptation

Prerequisites

What to Instrument

How to Use This Skill

Prompt Pattern

Execution Flow

Troubleshooting

Knowledge Freshness

Retrieval Sources

Output Contract

Quality Gates

Source citations

Related resources

Browser Agent Workflow Automation Skill

Agent Evals Regression Gate Skill

Docker Compose Production Blueprints Skill

Google Workspace Gemini Automation Skill

Signals