skillsFirst-partyReview first Safety · Privacy ·

Agent Evals Regression Gate Skill

Build repeatable eval suites that catch quality regressions in AI agent behavior before merge or release.

by JSONbored·added 2026-04-10·

Claude CodeCodexWindsurfGeminiCursorCLI

HarnessClaude CodeCodexWindsurfGeminiCursorCLI

Level:advancedType:generalVerified:draft

Install

curl -L https://heyclau.de/downloads/skills/agent-evals-regression-gate.zip -o agent-evals-regression-gate.zip && unzip -o agent-evals-regression-gate.zip -d ./agent-evals-regression-gate

Readiness

TrustTrusted
Sourcefirst-party
Safety notesMissing
ReviewedYes

Documentation Source repository Registry JSON · LLM text

Review first — review before installing

Open the source and read safety notes before installing.

Prerequisites

Existing prompts, tools, or agent workflows to evaluate
A representative set of real user tasks or transcripts
CI or local runner where eval suites can be executed repeatedly

Schema details

Install type: package
Reading time: 6 min
Difficulty score: 74
Troubleshooting: Yes
Breaking changes: No

Package metadata

Download URL: /downloads/skills/agent-evals-regression-gate.zip
Package verified: Yes
SHA-256: c11d9b5eecae8fa09374644ee91dd77197ecbc11f5f60519b9b161f4754214d3

Skill and platform metadata

Skill type: general
Skill level: advanced
Verification: draft
Verified at: 2026-04-10

Retrieval sources

https://openai.com/index/new-tools-and-features-in-the-responses-api/

Tested platforms

ClaudeCodexOpenClawCursorWindsurfGemini

Platform	Support	Install path
claude-code	Native	.claude/skills/<skill-name>/SKILL.md
codex	Native	.agents/skills/<skill-name>/SKILL.md
windsurf	Native	.windsurf/skills/<skill-name>/SKILL.md
gemini	Native	.gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md
cursor	Adapter	.cursor/rules/<skill-name>.mdc
cli	Manual	AGENTS.md or tool-specific context file

Full copyable content

# Trigger
"Run the agent evals regression gate skill on this project and produce a
merge-blocking quality report."

# Required output
1) Eval dataset design (happy path + adversarial + edge cases)
2) Scoring rubric with explicit pass thresholds
3) Baseline vs candidate comparison with deltas
4) Merge decision with release risk summary

About this resource

Overview

This skill helps you turn subjective "looks good" checks into measurable, repeatable quality gates for AI agent behavior. It is built for teams shipping agent workflows to production and needing confidence that model, prompt, or tool changes do not silently degrade outcomes.

Compatibility

Native

Claude Code / Claude: native skill usage via SKILL.md.
Codex/OpenAI workflows: compatible with Agent Skills-style SKILL.md content as reusable workflow instructions.

Manual Adaptation

Gemini CLI: native skill usage via .gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md where supported.
Cursor: use the generated .cursor/rules/*.mdc adapter for project rules.
OpenClaw and similar agents: use the same skill content as a reusable prompt/workflow file when native skill import is unavailable.

Prerequisites

Baseline behavior snapshot from current stable version
Candidate branch or prompt/tool change to evaluate
Evaluation tasks that reflect real production use

What This Skill Delivers

Task matrix across happy-path, edge-case, and failure scenarios
Deterministic scoring rubric and weighted pass/fail thresholds
Regression report with severity and root-cause candidates
Release recommendation for merge, patch, or rollback

How to Use This Skill

Prompt Pattern

Apply the agent evals regression gate skill to this workflow.
Generate:
1) Eval set (minimum 30 cases),
2) Rubric and scoring model,
3) Baseline vs candidate score delta report,
4) Merge decision with blocking issues.

Execution Flow

Define scope and quality goals (accuracy, safety, latency, format compliance).
Build or refresh eval dataset from production-like tasks.
Run baseline and candidate under identical conditions.
Compute deltas and classify regressions by impact.
Block merge when thresholds fail; output remediation actions.

Troubleshooting

Issue: Scores fluctuate heavily between runs
Fix: Reduce nondeterminism (temperature controls, fixed seeds where possible, stable tool mocks).

Issue: Eval passes but production still degrades
Fix: Expand dataset with real production transcripts and failure exemplars.

Issue: Teams disagree on pass criteria
Fix: Move rubric to explicit weighted dimensions and threshold contracts.

Knowledge Freshness

Treat tooling details as time-sensitive. Re-validate APIs, limits, pricing, auth models, and deployment flags immediately before implementation. If docs conflict with prior memory, follow current official docs and release notes.

Retrieval Sources

https://openai.com/index/new-tools-and-features-in-the-responses-api/

Output Contract

Return a concrete plan with implementation order.
Provide production-ready commands/config/code snippets (not placeholders).
Include explicit assumptions and unresolved risks.
Include a verification checklist with pass/fail criteria.

Quality Gates

All commands are copy/paste ready.
Security-sensitive steps call out secret handling and least privilege.
Version-sensitive guidance cites current docs used.
Rollback path is included for risky changes.
Final output includes quick validation commands/tests.

Content outline

Overview
Compatibility
Native
Manual Adaptation
Prerequisites
What This Skill Delivers
How to Use This Skill
Prompt Pattern
Execution Flow
Troubleshooting
Knowledge Freshness
Retrieval Sources
Output Contract
Quality Gates

#evals#regression#ai-agents#qa#quality-gate

Source citations

Signals

Loading live community signals…

Prerequisites

Schema details

About this resource

Overview

Compatibility

Native

Manual Adaptation

Prerequisites

What This Skill Delivers

How to Use This Skill

Prompt Pattern

Execution Flow

Troubleshooting

Knowledge Freshness

Retrieval Sources

Output Contract

Quality Gates

Source citations

Related resources

AI Agent Observability and Incident Response Skill

Browser Agent Workflow Automation Skill

Google Workspace Gemini Automation Skill

OpenClaw Agent Ops Hardening Skill

Signals