Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

· ai ai-agents coding · Source ↗

TLDR

  • Open-source test runner that benchmarks Anthropic Agent Skills (SKILL.md) by running each eval with and without the skill, then judge-grades both outputs.

Key Takeaways

  • Runs every eval twice: with_skill (SKILL.md in context) vs without_skill baseline, using any OpenAI-compatible model as target and judge.
  • Produces portable JSON/JSONL artifacts plus a static HTML report in an iteration-N/ workspace layout; no infrastructure needed to publish results.
  • CLI is one-liner (npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline); TypeScript SDK supports custom providers, CI pipelines, and JSONL streaming.
  • Supports deterministic tool-call assertions alongside judge-graded text output, covering agentic workflows beyond plain text generation.
  • Fully implements the agentskills.io spec: SKILL.md frontmatter validation, evals/evals.json schema, and official artifact layout.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN