@skill-graph/cli 0.5.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +247 -0
- package/LICENSE +200 -0
- package/NOTICE +62 -0
- package/README.md +398 -0
- package/SKILL_GRAPH.md +443 -0
- package/bin/skill-graph.js +374 -0
- package/docs/ADOPTION.md +117 -0
- package/docs/CONFORMANCE.md +66 -0
- package/docs/PRIMER.md +384 -0
- package/docs/QUICKSTART-30MIN.md +333 -0
- package/docs/ROUTING-METRICS.md +120 -0
- package/docs/SKILL-MD-FORMAT-COMPATIBILITY.md +127 -0
- package/docs/SKILL_AUDIT_CHECKLIST.md +199 -0
- package/docs/SKILL_AUDIT_LOOP.md +195 -0
- package/docs/SKILL_METADATA_PROTOCOL.md +609 -0
- package/docs/_archived/marketplace-publication-priority-2026-05-18.md +239 -0
- package/docs/adr/0001-predicate-set.md +69 -0
- package/docs/adr/0002-json-ld-context.md +82 -0
- package/docs/adr/0003-ontoclean-rigidity-tags.md +65 -0
- package/docs/adr/0004-persistent-identifiers.md +74 -0
- package/docs/adr/0005-freshness-consolidation.md +70 -0
- package/docs/adr/0006-revise-predicate-rename.md +105 -0
- package/docs/adr/0007-audit-loop-cadence.md +99 -0
- package/docs/adr/0008-skill-surface-split-and-curation-policy.md +93 -0
- package/docs/category-consumers.md +168 -0
- package/docs/concept-map.md +194 -0
- package/docs/diagrams/drift-states.mmd +21 -0
- package/docs/diagrams/manifest-pipeline.mmd +25 -0
- package/docs/diagrams/routing-harness.mmd +41 -0
- package/docs/diagrams/starter-graph.mmd +53 -0
- package/docs/field-decision-guide.md +315 -0
- package/docs/field-rationale.md +211 -0
- package/docs/field-reference.generated.md +624 -0
- package/docs/field-reference.md +1426 -0
- package/docs/glossary.md +190 -0
- package/docs/head-noun-glossary.md +63 -0
- package/docs/images/audit-phases.png +0 -0
- package/docs/images/drift-states.png +0 -0
- package/docs/images/graded-mode.png +0 -0
- package/docs/images/manifest-pipeline.png +0 -0
- package/docs/images/routing-harness.png +0 -0
- package/docs/images/skill-anatomy.png +0 -0
- package/docs/images/starter-graph.png +0 -0
- package/docs/images/system-model.png +0 -0
- package/docs/integrations/github-actions.md +155 -0
- package/docs/manifest-field-mapping.md +443 -0
- package/docs/marketplace-publication-queue.generated.md +240 -0
- package/docs/marketplace-release-agent-prompt.md +82 -0
- package/docs/marketplace-skill-candidate-list.md +272 -0
- package/docs/marketplace-syndication.md +222 -0
- package/docs/migration-sample-review.md +155 -0
- package/docs/migrations/v4-to-v5.md +168 -0
- package/docs/migrations/v5-to-v6.md +221 -0
- package/docs/name-exceptions.yaml +37 -0
- package/docs/plans/marketplace-p1-public-migration-plan.md +41 -0
- package/docs/plans/multi-root-workspace.md +148 -0
- package/docs/plans/scripts-roadmap.md +107 -0
- package/docs/plans/v4-schema-bump.md +160 -0
- package/docs/plans/wave-2-extraction.md +122 -0
- package/docs/positioning-vs-marketplaces.md +175 -0
- package/docs/proposals/skill-audit-loop-positioning.md +160 -0
- package/docs/quality-doctrine.md +138 -0
- package/docs/recommended-skills.md +150 -0
- package/docs/research/skill-comprehension-eval-research.md +1830 -0
- package/docs/research/skill-retrieval-evidence.md +66 -0
- package/docs/skill-metadata-protocol.md +471 -0
- package/docs/skills-sh-maintainer-cleanup-request.md +80 -0
- package/examples/audits/a11y/findings.md +52 -0
- package/examples/audits/a11y/scorecard.md +21 -0
- package/examples/audits/a11y/verdict.md +44 -0
- package/examples/audits/debugging/findings.md +59 -0
- package/examples/audits/debugging/scorecard.md +22 -0
- package/examples/audits/debugging/verdict.md +33 -0
- package/examples/audits/documentation/findings.md +59 -0
- package/examples/audits/documentation/scorecard.md +22 -0
- package/examples/audits/documentation/verdict.md +33 -0
- package/examples/evals/a11y.json +140 -0
- package/examples/evals/api-design.json +52 -0
- package/examples/evals/code-review.json +52 -0
- package/examples/evals/data-modeling.json +52 -0
- package/examples/evals/database-migration.json +52 -0
- package/examples/evals/debugging.json +118 -0
- package/examples/evals/dependency-architecture.json +52 -0
- package/examples/evals/design-system-architecture.json +52 -0
- package/examples/evals/error-tracking.json +52 -0
- package/examples/evals/event-contract-design.json +52 -0
- package/examples/evals/form-ux-architecture.json +52 -0
- package/examples/evals/framework-fit-analysis.json +52 -0
- package/examples/evals/graph-audit.json +139 -0
- package/examples/evals/information-architecture.json +52 -0
- package/examples/evals/interaction-feedback.json +52 -0
- package/examples/evals/interaction-patterns.json +52 -0
- package/examples/evals/layout-composition.json +52 -0
- package/examples/evals/lint-overlay.json +117 -0
- package/examples/evals/microcopy.json +52 -0
- package/examples/evals/observability-modeling.json +52 -0
- package/examples/evals/pattern-recognition.json +96 -0
- package/examples/evals/performance-engineering.json +52 -0
- package/examples/evals/refactor.json +128 -0
- package/examples/evals/semiotics.json +52 -0
- package/examples/evals/skill-infrastructure.json +96 -0
- package/examples/evals/skill-router.json +140 -0
- package/examples/evals/skill-router.routing.json +113 -0
- package/examples/evals/system-interface-contracts.json +52 -0
- package/examples/evals/task-analysis.json +52 -0
- package/examples/evals/testing-strategy.json +118 -0
- package/examples/evals/type-safety.json +249 -0
- package/examples/evals/visual-design-foundations.json +52 -0
- package/examples/evals/webhook-integration.json +52 -0
- package/examples/exports/a11y.skill-md.md +80 -0
- package/examples/exports/debugging.skill-md.md +80 -0
- package/examples/exports/refactor.skill-md.md +78 -0
- package/examples/exports/testing-strategy.skill-md.md +81 -0
- package/examples/projects/markdown-static-site/README.md +115 -0
- package/examples/projects/markdown-static-site/skills/content-source-router/SKILL.md +131 -0
- package/examples/projects/markdown-static-site/skills/image-optimization-pipeline-config/SKILL.md +132 -0
- package/examples/projects/markdown-static-site/skills/link-rot-detection/SKILL.md +103 -0
- package/examples/projects/markdown-static-site/skills/markdown-post-frontmatter-validation/SKILL.md +133 -0
- package/examples/projects/markdown-static-site/skills/migrate-posts-to-v2-frontmatter/SKILL.md +140 -0
- package/examples/projects/saas-stripe-postgres/README.md +208 -0
- package/examples/projects/saas-stripe-postgres/db/migrations/0004_canonicalize_orders.sql +37 -0
- package/examples/projects/saas-stripe-postgres/db/schema.sql +112 -0
- package/examples/projects/saas-stripe-postgres/skills/migrate-orders-to-canonical-schema/SKILL.md +149 -0
- package/examples/projects/saas-stripe-postgres/skills/nextjs-server-action-validation/SKILL.md +154 -0
- package/examples/projects/saas-stripe-postgres/skills/payment-provider-router/SKILL.md +153 -0
- package/examples/projects/saas-stripe-postgres/skills/postgres-rls-pattern/SKILL.md +163 -0
- package/examples/projects/saas-stripe-postgres/skills/stripe-webhook-signature-verification/SKILL.md +137 -0
- package/examples/protocol/skill-metadata-template.md +301 -0
- package/examples/protocol/skills.manifest.sample.json +13245 -0
- package/examples/skill-metadata-template.md +317 -0
- package/examples/skills.manifest.sample.json +13519 -0
- package/examples/tests/v3-1-skos-fixture/SKILL.md +93 -0
- package/marketplace/README.md +17 -0
- package/marketplace/skills/a11y/SKILL.md +66 -0
- package/marketplace/skills/acid-fundamentals/SKILL.md +106 -0
- package/marketplace/skills/agent-engineering/SKILL.md +386 -0
- package/marketplace/skills/agent-eval-design/SKILL.md +55 -0
- package/marketplace/skills/ai-native-development/SKILL.md +294 -0
- package/marketplace/skills/api-design/SKILL.md +60 -0
- package/marketplace/skills/architecture-decision-records/SKILL.md +55 -0
- package/marketplace/skills/background-jobs/SKILL.md +265 -0
- package/marketplace/skills/bounded-context-mapping/SKILL.md +55 -0
- package/marketplace/skills/cap-theorem-tradeoffs/SKILL.md +127 -0
- package/marketplace/skills/client-server-boundary/SKILL.md +187 -0
- package/marketplace/skills/code-review/SKILL.md +120 -0
- package/marketplace/skills/color-system-design/SKILL.md +43 -0
- package/marketplace/skills/component-architecture/SKILL.md +126 -0
- package/marketplace/skills/compression/SKILL.md +112 -0
- package/marketplace/skills/conceptual-modeling/SKILL.md +181 -0
- package/marketplace/skills/connection-pooling/SKILL.md +105 -0
- package/marketplace/skills/constraint-awareness/SKILL.md +287 -0
- package/marketplace/skills/content-monitor/SKILL.md +209 -0
- package/marketplace/skills/context-engineering/SKILL.md +320 -0
- package/marketplace/skills/context-graph/SKILL.md +174 -0
- package/marketplace/skills/context-management/SKILL.md +174 -0
- package/marketplace/skills/context-window/SKILL.md +239 -0
- package/marketplace/skills/contract-testing/SKILL.md +120 -0
- package/marketplace/skills/cron-scheduling/SKILL.md +223 -0
- package/marketplace/skills/dark-mode-implementation/SKILL.md +47 -0
- package/marketplace/skills/data-modeling/SKILL.md +59 -0
- package/marketplace/skills/data-modeling-fundamentals/SKILL.md +117 -0
- package/marketplace/skills/database-migration/SKILL.md +429 -0
- package/marketplace/skills/debugging/SKILL.md +67 -0
- package/marketplace/skills/dependency-architecture/SKILL.md +58 -0
- package/marketplace/skills/design-module-composition/SKILL.md +43 -0
- package/marketplace/skills/design-system-architecture/SKILL.md +61 -0
- package/marketplace/skills/design-thinking/SKILL.md +44 -0
- package/marketplace/skills/diagnosis/SKILL.md +296 -0
- package/marketplace/skills/diff-analysis/SKILL.md +188 -0
- package/marketplace/skills/e2e-test-design/SKILL.md +113 -0
- package/marketplace/skills/entity-relationship-modeling/SKILL.md +218 -0
- package/marketplace/skills/epistemic-grounding/SKILL.md +112 -0
- package/marketplace/skills/error-boundary/SKILL.md +235 -0
- package/marketplace/skills/error-tracking/SKILL.md +261 -0
- package/marketplace/skills/eval-driven-development/SKILL.md +147 -0
- package/marketplace/skills/evaluation/SKILL.md +113 -0
- package/marketplace/skills/event-contract-design/SKILL.md +60 -0
- package/marketplace/skills/event-storming/SKILL.md +56 -0
- package/marketplace/skills/form-ux-architecture/SKILL.md +60 -0
- package/marketplace/skills/framework-fit-analysis/SKILL.md +59 -0
- package/marketplace/skills/frontend-architecture/SKILL.md +43 -0
- package/marketplace/skills/generative-ui/SKILL.md +118 -0
- package/marketplace/skills/graph-audit/SKILL.md +81 -0
- package/marketplace/skills/guardrails/SKILL.md +118 -0
- package/marketplace/skills/hooks-patterns/SKILL.md +185 -0
- package/marketplace/skills/http-semantics/SKILL.md +136 -0
- package/marketplace/skills/ideation/SKILL.md +41 -0
- package/marketplace/skills/indexing-strategy/SKILL.md +108 -0
- package/marketplace/skills/information-architecture/SKILL.md +59 -0
- package/marketplace/skills/integration-test-design/SKILL.md +111 -0
- package/marketplace/skills/intent-recognition/SKILL.md +136 -0
- package/marketplace/skills/interaction-feedback/SKILL.md +59 -0
- package/marketplace/skills/interaction-patterns/SKILL.md +59 -0
- package/marketplace/skills/journey-mapping/SKILL.md +41 -0
- package/marketplace/skills/keywords/SKILL.md +213 -0
- package/marketplace/skills/knowledge-modeling/SKILL.md +232 -0
- package/marketplace/skills/layout-composition/SKILL.md +59 -0
- package/marketplace/skills/linguistics/SKILL.md +429 -0
- package/marketplace/skills/lint-overlay/SKILL.md +76 -0
- package/marketplace/skills/mental-models/SKILL.md +126 -0
- package/marketplace/skills/merge-queue/SKILL.md +94 -0
- package/marketplace/skills/methodology/SKILL.md +317 -0
- package/marketplace/skills/microcopy/SKILL.md +232 -0
- package/marketplace/skills/middleware-patterns/SKILL.md +363 -0
- package/marketplace/skills/mobile-responsive-ux/SKILL.md +287 -0
- package/marketplace/skills/mutation-testing/SKILL.md +112 -0
- package/marketplace/skills/naming-conventions/SKILL.md +112 -0
- package/marketplace/skills/observability-modeling/SKILL.md +59 -0
- package/marketplace/skills/ontology-modeling/SKILL.md +67 -0
- package/marketplace/skills/owasp-security/SKILL.md +153 -0
- package/marketplace/skills/pattern-recognition/SKILL.md +472 -0
- package/marketplace/skills/performance-budgets/SKILL.md +185 -0
- package/marketplace/skills/performance-engineering/SKILL.md +58 -0
- package/marketplace/skills/performance-testing/SKILL.md +125 -0
- package/marketplace/skills/printify/SKILL.md +42 -0
- package/marketplace/skills/prioritization/SKILL.md +118 -0
- package/marketplace/skills/problem-framing/SKILL.md +41 -0
- package/marketplace/skills/problem-locating-solving/SKILL.md +203 -0
- package/marketplace/skills/project-knowledge-extraction/SKILL.md +54 -0
- package/marketplace/skills/prompt-craft/SKILL.md +134 -0
- package/marketplace/skills/prompt-injection-defense/SKILL.md +132 -0
- package/marketplace/skills/property-based-testing/SKILL.md +100 -0
- package/marketplace/skills/prototyping/SKILL.md +43 -0
- package/marketplace/skills/query-optimization/SKILL.md +144 -0
- package/marketplace/skills/real-time-updates/SKILL.md +324 -0
- package/marketplace/skills/ref-patterns/SKILL.md +284 -0
- package/marketplace/skills/refactor/SKILL.md +65 -0
- package/marketplace/skills/rendering-models/SKILL.md +142 -0
- package/marketplace/skills/replication-patterns/SKILL.md +110 -0
- package/marketplace/skills/research-synthesis/SKILL.md +41 -0
- package/marketplace/skills/route-handler-design/SKILL.md +347 -0
- package/marketplace/skills/schema-evolution/SKILL.md +140 -0
- package/marketplace/skills/security-fundamentals/SKILL.md +139 -0
- package/marketplace/skills/semantic-center/SKILL.md +194 -0
- package/marketplace/skills/semantic-relations/SKILL.md +250 -0
- package/marketplace/skills/semantics/SKILL.md +366 -0
- package/marketplace/skills/semiotics/SKILL.md +230 -0
- package/marketplace/skills/seo-strategy/SKILL.md +260 -0
- package/marketplace/skills/server-actions-design/SKILL.md +243 -0
- package/marketplace/skills/server-components-design/SKILL.md +190 -0
- package/marketplace/skills/sharding-strategy/SKILL.md +123 -0
- package/marketplace/skills/shopify/SKILL.md +42 -0
- package/marketplace/skills/skill-infrastructure/SKILL.md +320 -0
- package/marketplace/skills/skill-router/SKILL.md +71 -0
- package/marketplace/skills/skill-scaffold/SKILL.md +105 -0
- package/marketplace/skills/snapshot-testing/SKILL.md +120 -0
- package/marketplace/skills/spec-driven-development/SKILL.md +148 -0
- package/marketplace/skills/state-machine-modeling/SKILL.md +56 -0
- package/marketplace/skills/state-management/SKILL.md +134 -0
- package/marketplace/skills/streaming-architecture/SKILL.md +194 -0
- package/marketplace/skills/summarization/SKILL.md +156 -0
- package/marketplace/skills/suspense-patterns/SKILL.md +265 -0
- package/marketplace/skills/system-interface-contracts/SKILL.md +59 -0
- package/marketplace/skills/task-analysis/SKILL.md +201 -0
- package/marketplace/skills/taxonomy-design/SKILL.md +66 -0
- package/marketplace/skills/test-coverage-strategy/SKILL.md +108 -0
- package/marketplace/skills/test-doubles-design/SKILL.md +98 -0
- package/marketplace/skills/test-driven-development/SKILL.md +96 -0
- package/marketplace/skills/testing-strategy/SKILL.md +67 -0
- package/marketplace/skills/theme-system-design/SKILL.md +43 -0
- package/marketplace/skills/tool-call-flow/SKILL.md +229 -0
- package/marketplace/skills/tool-call-strategy/SKILL.md +292 -0
- package/marketplace/skills/transaction-isolation/SKILL.md +98 -0
- package/marketplace/skills/type-safety/SKILL.md +177 -0
- package/marketplace/skills/typography-system/SKILL.md +43 -0
- package/marketplace/skills/usability-testing/SKILL.md +43 -0
- package/marketplace/skills/user-research/SKILL.md +43 -0
- package/marketplace/skills/vercel-composition-patterns/SKILL.md +157 -0
- package/marketplace/skills/version-control/SKILL.md +233 -0
- package/marketplace/skills/visual-design-foundations/SKILL.md +59 -0
- package/marketplace/skills/visual-hierarchy/SKILL.md +43 -0
- package/marketplace/skills/webhook-integration/SKILL.md +331 -0
- package/marketplace/skills/writing-humanizer/SKILL.md +380 -0
- package/package.json +67 -0
- package/schemas/manifest.schema.json +811 -0
- package/schemas/manifest.v2.schema.json +164 -0
- package/schemas/manifest.v3.schema.json +758 -0
- package/schemas/manifest.v4.schema.json +755 -0
- package/schemas/manifest.v5.schema.json +755 -0
- package/schemas/manifest.v6.schema.json +811 -0
- package/schemas/skill.context.jsonld +279 -0
- package/schemas/skill.schema.json +919 -0
- package/schemas/skill.v2.schema.json +201 -0
- package/schemas/skill.v3.schema.json +827 -0
- package/schemas/skill.v4.schema.json +822 -0
- package/schemas/skill.v5.schema.json +830 -0
- package/schemas/skill.v6.schema.json +946 -0
- package/schemas/vocabulary/keywords.json +180 -0
- package/schemas/vocabulary/workspace_tags.json +23 -0
- package/scripts/__tests__/migrate-skill-v2-to-v3.test.js +161 -0
- package/scripts/__tests__/migrate-skill-v3-to-v4.test.js +158 -0
- package/scripts/__tests__/test-export-parser-drift.js +149 -0
- package/scripts/__tests__/test-marketplace-export.js +114 -0
- package/scripts/__tests__/test-router-paths.js +82 -0
- package/scripts/__tests__/test-stability-promotion.js +244 -0
- package/scripts/__tests__/test-v3-1-alias-contract.js +109 -0
- package/scripts/__tests__/test-v3-1-skos-runtime.js +116 -0
- package/scripts/backfill-schema-version.js +198 -0
- package/scripts/build-field-reference.js +160 -0
- package/scripts/build-retrieval-baseline.js +511 -0
- package/scripts/check-markdown-links.js +211 -0
- package/scripts/check-protocol-consistency.js +979 -0
- package/scripts/export-marketplace-skills.js +610 -0
- package/scripts/export-skill.js +374 -0
- package/scripts/generate-manifest.js +787 -0
- package/scripts/lib/alias-contract.js +83 -0
- package/scripts/lib/audit-prompt-builder.js +771 -0
- package/scripts/lib/mock-grader.js +134 -0
- package/scripts/lib/parse-frontmatter.js +429 -0
- package/scripts/lib/roots.js +119 -0
- package/scripts/lint/check-archetype-sections.js +185 -0
- package/scripts/lint/check-category-enum.js +83 -0
- package/scripts/lint/check-routing-eval.js +146 -0
- package/scripts/lint/check-routing-quality.js +211 -0
- package/scripts/lint/check-stability-promotion.js +220 -0
- package/scripts/lint/format-code-frame.js +206 -0
- package/scripts/marketplace-install.js +125 -0
- package/scripts/migrate-category-to-enum.js +169 -0
- package/scripts/migrate-skill-v2-to-v3.js +424 -0
- package/scripts/migrate-skill-v3-to-v4.js +200 -0
- package/scripts/migrate-skill-v5-to-v6.js +304 -0
- package/scripts/restructure-by-category.js +85 -0
- package/scripts/seed-publication-classification.js +282 -0
- package/scripts/skill-audit.js +893 -0
- package/scripts/skill-graph-drift.js +483 -0
- package/scripts/skill-graph-route.js +766 -0
- package/scripts/skill-graph-routing-eval.js +393 -0
- package/scripts/skill-lint.js +1317 -0
- package/scripts/skill-overlap.js +213 -0
- package/scripts/verify-skill-md-export.js +201 -0
|
@@ -0,0 +1,1830 @@
|
|
|
1
|
+
# Skill Comprehension Evaluation — Research Report
|
|
2
|
+
|
|
3
|
+
> **Subject.** How to measure whether an AI agent has actually **learned the subject** from a Skill Graph `SKILL.md`, as opposed to whether it can route the skill correctly, conform to the schema, or pattern-match against the body.
|
|
4
|
+
>
|
|
5
|
+
> **Audience.** Skill Graph maintainers and contributors deciding the next minimum upgrade to the comprehension-grading surface.
|
|
6
|
+
>
|
|
7
|
+
> **Status.** Research deliverable. Not a protocol change. The proposed rubric and worked example are recommendations; final decisions belong to the maintainer.
|
|
8
|
+
>
|
|
9
|
+
> **Date.** 2026-05-16.
|
|
10
|
+
>
|
|
11
|
+
> **Scope.** Comprehension-quality evaluation only. Structural conformance (lint, manifest parity, drift, routing) is already owned by `scripts/skill-lint.js`, `scripts/skill-graph-routing-eval.js`, and `scripts/skill-graph-drift.js` and is **out of scope** for this report.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Table of contents
|
|
16
|
+
|
|
17
|
+
1. [Executive summary](#1-executive-summary)
|
|
18
|
+
2. [What the repo currently evaluates](#2-what-the-repo-currently-evaluates)
|
|
19
|
+
3. [What the repo does NOT evaluate](#3-what-the-repo-does-not-evaluate)
|
|
20
|
+
4. [External research synthesis](#4-external-research-synthesis)
|
|
21
|
+
- 4.7 [RLHF failure rates as the empirical case for forced-completeness graders](#47-rlhf-failure-rates-as-the-empirical-case-for-forced-completeness-graders)
|
|
22
|
+
5. [Proposed comprehension-quality rubric](#5-proposed-comprehension-quality-rubric)
|
|
23
|
+
- 5.13 [Scoring discipline — the rubric as an application of `methodical`](#513-scoring-discipline--the-rubric-as-an-application-of-methodical)
|
|
24
|
+
6. [Worked example — `type-safety`](#6-worked-example--type-safety)
|
|
25
|
+
7. [Implementation recommendations](#7-implementation-recommendations)
|
|
26
|
+
- 7.10 [R9 — Grader prompt incorporates `methodical` forcing functions](#710-r9--grader-prompt-incorporates-methodical-forcing-functions)
|
|
27
|
+
8. [Risks and anti-patterns](#8-risks-and-anti-patterns)
|
|
28
|
+
- 8.9 [`methodical` anti-patterns mapped to LLM-as-judge comprehension-grading failures](#89-methodical-anti-patterns-mapped-to-llm-as-judge-comprehension-grading-failures)
|
|
29
|
+
9. [Open questions](#9-open-questions)
|
|
30
|
+
10. [Completeness claim](#10-completeness-claim)
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## 1. Executive summary
|
|
35
|
+
|
|
36
|
+
The Skill Graph evaluates four orthogonal surfaces: **schema conformance** (lint), **manifest parity** (generator round-trip), **drift** (truth-source hashes), and **routing** (`activation.examples` and `activation.anti_examples` against the router). All four are mature and largely deterministic.
|
|
37
|
+
|
|
38
|
+
A fifth surface — **comprehension quality** — is **declared but unimplemented**. The protocol defines `comprehension_state: present`, mandates the seven-field `concept` block when present (`definition`, `mental_model`, `purpose`, `boundary`, `taxonomy`, `analogy`, `misconception`), and the schema documents per-field grader weights (`mental_model` and `boundary` at 1.5, `definition` / `purpose` / `taxonomy` at 1.0, `analogy` at 0.5, `misconception` not directly graded — see [`schemas/skill.v4.schema.json` lines 169–211](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json)). But the grader script the schema references — `scripts/skill/evaluate-skill.js --comprehension` — **does not exist** in the repo. No code reads the `concept` block for grading purposes; no eval format encodes "does the agent's answer match the `definition` field without copying it"; no rubric distinguishes "the agent retrieved a quote from the body" from "the agent reasoned about a fresh case using the skill's mental model."
|
|
39
|
+
|
|
40
|
+
The 31 eval files in `examples/evals/` use a rich per-case schema (`dimension`, `substance`, `calibration`, `truth_mode`, `skill_type`, `criticality`, `truth_sources`, optional `expected_reasoning`). But the dimension distribution skews heavily toward **routing-adjacent** measurement: 84 cases are tagged `application`, 62 are `boundary`, and only 10/11/7/2/1 cases respectively measure `definition` / `mental_model` / `purpose` / `rule_conflict` / `anti_pattern`. The concept-block grader weights are not represented in the data.
|
|
41
|
+
|
|
42
|
+
This report proposes a minimum upgrade path:
|
|
43
|
+
|
|
44
|
+
1. **An 8-dimension comprehension rubric** — one rubric dimension per concept-block field, plus two cross-cutting dimensions (`verification-application` and `negative-boundary-respect`) that the existing 7-field block does not cover but that the body Verification and Do-NOT-Use-When sections already encode.
|
|
45
|
+
2. **Pass/fail criteria with concrete examples** for each dimension, distinguishing six failure modes (verbatim copying, paraphrastic regurgitation, body-only retrieval, near-vs-far transfer failure, leading-prompt capitulation, scope-reduction softening).
|
|
46
|
+
3. **An eval-file extension** — additive only, no schema break — adding optional `comprehension_dimension`, `concept_field`, and `expected_behaviors` keys so existing files remain valid.
|
|
47
|
+
4. **A worked example** of ~10 scenarios mapped onto `skills/type-safety/SKILL.md` in valid JSON shape, demonstrating coverage across all 8 rubric dimensions.
|
|
48
|
+
5. **A concrete implementation order**, ranked by leverage: the smallest viable change is a 30-line grader prompt template plus a `evals/comprehension/<skill>.json` example, **not** a new script or schema bump.
|
|
49
|
+
|
|
50
|
+
The report does not propose a new schema version, does not redesign lint or drift, and does not recommend a commercial eval platform.
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## 2. What the repo currently evaluates
|
|
55
|
+
|
|
56
|
+
The Skill Graph's evaluation discipline is described in [`AGENTS.MD` § Evaluation Discipline (lines 143–189)](/Users/jacobbalslev/Development/skill-graph/AGENTS.MD). It names four layers, each with its own definition of "good":
|
|
57
|
+
|
|
58
|
+
| Layer | Question | Surface | Deterministic? |
|
|
59
|
+
|---|---|---|---|
|
|
60
|
+
| Per-skill content | Schema valid, body sections present, eval coherent? | `scripts/skill-lint.js` | Yes |
|
|
61
|
+
| Routing | Right skill fires for the prompt? | `scripts/skill-graph-routing-eval.js` | Yes (against router) |
|
|
62
|
+
| Manifest / contract | Authored skills round-trip through the manifest generator? | `scripts/generate-manifest.js --validate-only` | Yes |
|
|
63
|
+
| Drift | Has the cited truth source changed since `last_verified`? | `scripts/skill-graph-drift.js` | Yes |
|
|
64
|
+
|
|
65
|
+
The audit loop ([`SKILL_AUDIT_LOOP.md`](https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_LOOP.md)) layers a five-phase wrapper on top: deterministic lint (Phase 1), optional model-graded review across seven dimensions (Phase 2), aggregate verdict (Phase 3), fix-or-defer (Phase 4), re-verify (Phase 5). The seven audit dimensions are defined in [`scripts/lib/audit-prompt-builder.js` lines 37–80](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js): `metadata`, `activation`, `relation`, `grounding`, `content`, `eval`, `portability`.
|
|
66
|
+
|
|
67
|
+
### 2.1 The 31 eval files
|
|
68
|
+
|
|
69
|
+
The universe of comprehension-relevant data is `examples/evals/*.json` — 31 files at the time of this report. Each file is keyed by `skill_name` (the lint check `checkEvalCoherence` at [`scripts/skill-lint.js:414`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js) enforces that this matches the skill's `name` field) and contains an `evals` array of scenario objects.
|
|
70
|
+
|
|
71
|
+
A scenario object has the following per-case fields, observed across the 31 files:
|
|
72
|
+
|
|
73
|
+
| Field | Required | Cardinality | What it carries |
|
|
74
|
+
|---|---|---|---|
|
|
75
|
+
| `id` | yes | 1 per case | Sequence number within the file |
|
|
76
|
+
| `prompt` | yes | 1 | The realistic user input the grader will pose to the model |
|
|
77
|
+
| `dimension` | yes (de facto) | enum below | The comprehension axis being measured |
|
|
78
|
+
| `substance` | yes (de facto) | `domain` / `contradiction-check` | Whether the case probes positive correctness or a negative boundary |
|
|
79
|
+
| `calibration` | yes (de facto) | `semantic` / `process` | Whether the grader checks meaning or sequence |
|
|
80
|
+
| `truth_mode` | yes (de facto) | `code_verification` / `conceptual_correctness_plus_repo_application` / `process_correctness` | How a grader confirms the answer |
|
|
81
|
+
| `skill_type` | yes (de facto) | `concept` / `workflow` | Which archetype the scenario tests |
|
|
82
|
+
| `criticality` | yes (de facto) | `normal` / `high` / `critical` | Weight for aggregation |
|
|
83
|
+
| `truth_sources` | yes (de facto) | array of `path` or `path:start-end` or `path#anchor` | Where a grader reads to verify |
|
|
84
|
+
| `expected_reasoning` | optional | string | Sketch of the correct chain of reasoning |
|
|
85
|
+
|
|
86
|
+
Two files — `comprehension.json` (`documentation`) and `debugging.json` — also include `expected_reasoning` on at least one case (`examples/evals/comprehension.json:163` and `examples/evals/comprehension.json:179` — both `dimension: rule_conflict`). These two cases are the high-water mark of comprehension-quality measurement in the repo today.
|
|
87
|
+
|
|
88
|
+
The lint check `checkEvalTruthSourceRanges` (D2, at [`scripts/skill-lint.js:586`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js)) validates that every `truth_sources` reference resolves: the file exists, line ranges are within bounds, anchors match an actual heading. This is the only contract the eval files enforce today; nothing checks dimension coverage, prompt quality, or rubric alignment.
|
|
89
|
+
|
|
90
|
+
### 2.2 Dimension distribution
|
|
91
|
+
|
|
92
|
+
A grep across all 31 files (`grep -h '"dimension":' examples/evals/*.json | sort | uniq -c`) shows the current dimension distribution:
|
|
93
|
+
|
|
94
|
+
| Dimension | Cases | % of total |
|
|
95
|
+
|---|---|---|
|
|
96
|
+
| `application` | 84 | 47.7% |
|
|
97
|
+
| `boundary` | 62 | 35.2% |
|
|
98
|
+
| `mental_model` | 11 | 6.3% |
|
|
99
|
+
| `definition` | 10 | 5.7% |
|
|
100
|
+
| `purpose` | 7 | 4.0% |
|
|
101
|
+
| `rule_conflict` | 2 | 1.1% |
|
|
102
|
+
| `anti_pattern` | 1 | 0.6% |
|
|
103
|
+
| **Total** | **177** | 100% |
|
|
104
|
+
|
|
105
|
+
This distribution is heavily skewed toward two dimensions — `application` and `boundary` — which together capture **82.9%** of the data. The seven concept-block fields (`definition`, `mental_model`, `purpose`, `boundary`, `taxonomy`, `analogy`, `misconception`) collectively have eval coverage on **only four** of the seven (`definition`, `mental_model`, `purpose`, `boundary`); `taxonomy`, `analogy`, and `misconception` have **zero** eval cases tagged for them across the entire library.
|
|
106
|
+
|
|
107
|
+
### 2.3 The audit loop's content dimension
|
|
108
|
+
|
|
109
|
+
The audit loop's `content` dimension (one of the seven scorecard rows) does include comprehension-shaped checks. From [`SKILL_AUDIT_CHECKLIST.md` § 5 (lines 132–141)](https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_CHECKLIST.md):
|
|
110
|
+
|
|
111
|
+
- the skill has a clear Coverage section
|
|
112
|
+
- the skill has a clear Philosophy section
|
|
113
|
+
- the skill has a clear Verification section
|
|
114
|
+
- the skill has at least one concrete decision table, checklist, or routing rule
|
|
115
|
+
- the skill contains negative bounds (Do NOT Use When)
|
|
116
|
+
- the skill does not contain generic model-native filler
|
|
117
|
+
- the skill does not claim behavior it cannot verify
|
|
118
|
+
|
|
119
|
+
These are **structural** checks on the body, not behavioral checks on an agent. The audit grader reads the skill and judges the body; it does **not** pose scenarios to a fresh model that has the skill loaded as context. The distinction matters: a skill's body can be perfectly clear and still fail to cause comprehension, and a skill's body can be cryptic and still cause comprehension if its primitives transfer. The audit measures the prose; comprehension measures the **agent's behavior given the prose**.
|
|
120
|
+
|
|
121
|
+
### 2.4 The `concept` block today
|
|
122
|
+
|
|
123
|
+
The seven-field `concept` block is defined in the schema at [`schemas/skill.v4.schema.json` lines 169–211](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json). Each field carries an explicit grader weight in its description string:
|
|
124
|
+
|
|
125
|
+
| Field | Schema weight | Schema description (summary) |
|
|
126
|
+
|---|---|---|
|
|
127
|
+
| `definition` | 1.0 | What the concept IS — primary category + what it does + who uses it |
|
|
128
|
+
| `mental_model` | 1.5 | Primitives and their relationships |
|
|
129
|
+
| `purpose` | 1.0 | Problem it solves and the alternative it replaced |
|
|
130
|
+
| `boundary` | 1.5 | Things commonly confused with the concept but that are NOT it |
|
|
131
|
+
| `taxonomy` | 1.0 | Nearby concepts with relationship type (subset / alternative / prerequisite / composition / specialization) |
|
|
132
|
+
| `analogy` | 0.5 | Analogy that preserves the core mechanism |
|
|
133
|
+
| `misconception` | not directly graded | Wrong mental model people bring; inoculation hint |
|
|
134
|
+
|
|
135
|
+
The schema description for `concept` explicitly names a script that does not exist in the repo: `scripts/skill/evaluate-skill.js --comprehension`. A `find` over the repo confirms no file at that path. This is the single largest gap: the protocol declares the surface, the schema documents the grader weights, the lint enforces the presence of the seven fields when `comprehension_state: present`, but **no code consumes the block for grading**.
|
|
136
|
+
|
|
137
|
+
Two skills in the active library declare `comprehension_state: present` and provide a populated `concept` block: `skills/type-safety/SKILL.md` (lines 63–116) and `skills/acid-fundamentals/SKILL.md` (lines 66–181). Both are exemplary in body depth — `type-safety` runs 116 lines of frontmatter concept block; `acid-fundamentals` runs 115 lines. Both also declare `eval_artifacts: planned` and `eval_state: unverified` — i.e., they self-report that no comprehension eval has shipped yet, which is accurate.
|
|
138
|
+
|
|
139
|
+
The `examples/evals/comprehension.json` file is keyed to `skill_name: documentation`, **not** to either of the two skills with populated concept blocks. As of this report, no eval file exists for `type-safety` or `acid-fundamentals` — the two gold-standard comprehension targets are the two with zero eval coverage.
|
|
140
|
+
|
|
141
|
+
### 2.5 Summary of what's evaluated today
|
|
142
|
+
|
|
143
|
+
The repo today evaluates:
|
|
144
|
+
|
|
145
|
+
- **Whether `concept` exists** when `comprehension_state: present`, and whether each of the seven sub-fields is a string (lint enforces; see [`scripts/skill-lint.js:309`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js)).
|
|
146
|
+
- **Whether eval files exist** when `eval_artifacts: present` and whether their `skill_name` matches a skill (lint enforces; see [`scripts/skill-lint.js:414`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js)).
|
|
147
|
+
- **Whether eval `truth_sources` resolve** to real files, valid line ranges, and existing anchors (D2 check; see [`scripts/skill-lint.js:586`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js)).
|
|
148
|
+
- **Whether `activation.examples` route to this skill** and `activation.anti_examples` route away from it (positive/negative routing; see [`scripts/skill-graph-routing-eval.js`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-graph-routing-eval.js)).
|
|
149
|
+
- **Whether the body is structurally complete** against the archetype (see [`scripts/lint/check-archetype-sections.js`](/Users/jacobbalslev/Development/skill-graph/scripts/lint/check-archetype-sections.js)).
|
|
150
|
+
- **Whether the audit grader can judge each of seven scorecard dimensions** from the skill body + truth sources + neighbors + (for eval) attached eval artifact (see [`scripts/lib/audit-prompt-builder.js`](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js)).
|
|
151
|
+
|
|
152
|
+
What the repo today **does not evaluate** — the focus of §3 — is **whether an agent actually learns the subject from the skill**.
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
## 3. What the repo does NOT evaluate
|
|
157
|
+
|
|
158
|
+
The gap is precise. It is not "we have no evals." It is: **no surface in the repo measures whether an agent given this `SKILL.md` as context, and given a fresh prompt that does not appear in the body, produces an answer that demonstrates the skill's concept primitives have been internalized.**
|
|
159
|
+
|
|
160
|
+
This section maps the gap onto the seven concept-block fields plus two cross-cutting dimensions the body already encodes (Verification, Do NOT Use When) but no comprehension eval probes.
|
|
161
|
+
|
|
162
|
+
### 3.1 No per-concept-field eval contract
|
|
163
|
+
|
|
164
|
+
The schema documents grader weights for each concept field. There is no eval-file convention that ties a scenario to a concept field. The closest signal is the `dimension` enum in eval cases — but `dimension` is a loose enum (`application`, `boundary`, `definition`, `mental_model`, `purpose`, `rule_conflict`, `anti_pattern`) that does not map 1:1 to the seven concept fields. Specifically:
|
|
165
|
+
|
|
166
|
+
- `taxonomy`, `analogy`, and `misconception` have no corresponding `dimension` value in any eval file in the library.
|
|
167
|
+
- `application` (the most common dimension, 84 cases) is not a concept field; it is closer to "task-level behavior".
|
|
168
|
+
- `boundary` is overloaded — it serves both as the concept-block field name AND as the eval dimension that means "routing handoff to a sibling skill". These are not the same idea: the concept-block `boundary` is about *which things look like this concept but aren't* (a discrimination test); the eval-dimension `boundary` is about *which queries belong to a sibling skill instead* (a routing test).
|
|
169
|
+
|
|
170
|
+
The result: a skill author who fills the `concept` block does not know what a comprehension eval for that block would look like, and a grader has no scaffolding to tell whether `mental_model` (weight 1.5) actually transferred.
|
|
171
|
+
|
|
172
|
+
### 3.2 No verbatim-copy detector
|
|
173
|
+
|
|
174
|
+
A failure mode the protocol's quality doctrine warns against — "evals that paraphrase the skill body back to itself" ([`AGENTS.MD` line 186](/Users/jacobbalslev/Development/skill-graph/AGENTS.MD)) — has no test. The current eval files frequently use prompts like:
|
|
175
|
+
|
|
176
|
+
> "According to the X skill's Y section, what is the correct primitive…"
|
|
177
|
+
> (e.g. [`examples/evals/a11y.json:8-16`](/Users/jacobbalslev/Development/skill-graph/examples/evals/a11y.json))
|
|
178
|
+
|
|
179
|
+
These prompts effectively instruct the model to **quote** the skill, then check whether the quote is correct. They cannot distinguish a model that copies from a model that reasons. The cases marked `dimension: definition` (10 cases) are particularly vulnerable: by definition, the right answer is "what the skill says X is", and any model that can find the section heading wins.
|
|
180
|
+
|
|
181
|
+
### 3.3 No near-vs-far transfer separation
|
|
182
|
+
|
|
183
|
+
The educational psychology literature, especially [Barnett & Ceci (2002)](https://pubmed.ncbi.nlm.nih.gov/12081085/), distinguishes near transfer (similar context, similar surface) from far transfer (different context, same underlying primitives). Far transfer is the harder test of genuine comprehension and the one the literature finds most often fails.
|
|
184
|
+
|
|
185
|
+
The eval library has no near/far distinction. Most cases use scenarios extremely close to the body's worked examples — the `documentation` skill's eval asks about doc-type selection in a hypothetical that closely matches the body's doc-type-selection section. A model that has the body in context can pattern-match the prompt to the relevant body anchor without ever using the concept's primitives on a case the body did not cover.
|
|
186
|
+
|
|
187
|
+
A genuine `mental_model` test would force far transfer: pose a scenario the body does not enumerate, where the only path to a correct answer is to apply the concept's named primitives to the novel case. The repo today has no eval cases that explicitly forbid the model from quoting the body, no cases that use scenarios from a different domain than the body's examples, and no rubric for distinguishing "the primitives transferred" from "the surface matched".
|
|
188
|
+
|
|
189
|
+
### 3.4 No misconception inoculation test
|
|
190
|
+
|
|
191
|
+
The `concept.misconception` field is required when `comprehension_state: present`, and the schema notes it is "not directly graded; complements `boundary`" ([`schemas/skill.v4.schema.json:208`](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json)). But "complements `boundary`" is not a measurement. A misconception eval would pose a prompt that **sounds like the misconception is correct** and check whether the agent corrects it unprompted. No eval case in the library does this. The closest is the comprehension.json file's `rule_conflict` dimension — two cases that pose a "should this rule bend?" question and check the agent's reasoning — but those probe rule-application under tension, not misconception-inoculation.
|
|
192
|
+
|
|
193
|
+
### 3.5 No analogy-reuse / analogy-overreach probe
|
|
194
|
+
|
|
195
|
+
The `concept.analogy` field has grader weight 0.5 (lowest among the seven) and zero eval cases tagged for it. The literature on analogical reasoning ([Gentner 1983 structure-mapping; Hofstadter & Sander 2013 *Surfaces and Essences*](https://www.basicbooks.com/titles/douglas-hofstadter/surfaces-and-essences/9780465018475/)) consistently finds that the test of analogy mastery is **knowing where the analogy breaks**. The current eval files have no scenario that asks "given this analogy from the skill, is this novel case structurally analogous or is the analogy stretched too far?" The result: a skill can have a beautiful analogy in its `concept.analogy` field and no test will catch a model that has rote-memorized the analogy but applies it where it shouldn't.
|
|
196
|
+
|
|
197
|
+
### 3.6 No taxonomy-navigation probe
|
|
198
|
+
|
|
199
|
+
The `concept.taxonomy` field is required to enumerate nearby concepts with their relationship type (subset / alternative / prerequisite / composition / specialization). The schema description is explicit about the relationship-type vocabulary ([`schemas/skill.v4.schema.json:200`](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json)). No eval case tests whether the agent can correctly classify a novel instance into the right relationship-type bucket. A genuine taxonomy test would pose a scenario involving a concept not enumerated in `taxonomy` and ask the agent to place it (e.g., for `type-safety`'s taxonomy that lists "sound", "unsound/gradual", "structural", "nominal", "dependent", "refinement", "narrowing", "validation": pose a question about a Haskell **GADT** — neither dependent nor refinement, but it composes types with constraints — and verify the agent reaches "extension of refinement-typing in the direction of dependent typing" rather than mis-placing it).
|
|
200
|
+
|
|
201
|
+
### 3.7 No Verification-checklist application test
|
|
202
|
+
|
|
203
|
+
Every `capability` and `workflow` skill is required to have a `## Verification` section (enforced by [`scripts/lint/check-archetype-sections.js`](/Users/jacobbalslev/Development/skill-graph/scripts/lint/check-archetype-sections.js)). The Verification section is the skill's authored answer to "after applying this skill, what evidence confirms the skill was applied correctly?" Eval cases occasionally reference Verification (e.g. `examples/evals/comprehension.json:174-181` — the dimension is `rule_conflict`, the prompt asks whether a doc passing Verification but failing the Philosophy test is acceptable). But no case poses a fresh artifact and asks the agent to run the Verification checklist against it unprompted. This is a comprehension test the body already authors and the eval surface does not redeem.
|
|
204
|
+
|
|
205
|
+
### 3.8 No negative-boundary refusal test
|
|
206
|
+
|
|
207
|
+
The `## Do NOT Use When` section is required for `capability` skills and is a primary anti-overreach mechanism. The lint check verifies the section exists; the routing eval verifies that `anti_examples` route away. But neither tests **the loaded model's refusal behavior**: when a user prompt asks the skill to do something in the Do-NOT list, does the agent refuse and route to the named owner skill, or does it overreach? The routing eval tests the router's decision before the skill is loaded; the comprehension test would probe the loaded skill's refusal discipline. They answer different questions.
|
|
208
|
+
|
|
209
|
+
### 3.9 Gap summary table
|
|
210
|
+
|
|
211
|
+
| Concept-block field | Schema weight | Eval cases today | Gap |
|
|
212
|
+
|---|---|---|---|
|
|
213
|
+
| `definition` | 1.0 | 10 (5.7%) | Cannot distinguish quote-retrieval from reasoning |
|
|
214
|
+
| `mental_model` | 1.5 | 11 (6.3%) | No far-transfer scenarios — body-side pattern match dominates |
|
|
215
|
+
| `purpose` | 1.0 | 7 (4.0%) | Few cases; tests "why does this exist" but not "what would replace it" |
|
|
216
|
+
| `boundary` | 1.5 | 62 (35.2%) — but these mostly test **routing** boundary, not concept-block discrimination | Concept-discrimination tests confused with routing-handoff tests |
|
|
217
|
+
| `taxonomy` | 1.0 | 0 | No test of relationship-type placement |
|
|
218
|
+
| `analogy` | 0.5 | 0 | No test of analogy reuse or analogy-stretch limits |
|
|
219
|
+
| `misconception` | not graded | 0 | No misconception-inoculation probe |
|
|
220
|
+
| (cross-cutting) Verification application | — | 0 unprompted | No fresh-artifact "apply checklist" cases |
|
|
221
|
+
| (cross-cutting) Do NOT refusal | — | 0 in-skill refusal | Only routing-time anti_examples — no in-skill refusal probe |
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## 4. External research synthesis
|
|
226
|
+
|
|
227
|
+
Comprehension measurement for AI agents draws from three rough buckets of prior work: classical educational measurement, machine-learning evaluation infrastructure, and recent LLM-specific eval practice. Each contributes a different idea the proposed rubric uses.
|
|
228
|
+
|
|
229
|
+
### 4.1 Classical educational measurement
|
|
230
|
+
|
|
231
|
+
#### 4.1.1 Bloom's Taxonomy, revised (Anderson & Krathwohl 2001)
|
|
232
|
+
|
|
233
|
+
The 2001 revision of [Bloom's Taxonomy](https://www.researchgate.net/publication/242400296_A_Revision_of_Bloom's_Taxonomy_An_Overview) replaces nouns with verbs and reorders the top two levels:
|
|
234
|
+
|
|
235
|
+
> remember → understand → apply → analyze → evaluate → create
|
|
236
|
+
|
|
237
|
+
Each level corresponds to a different cognitive operation, and importantly, the revision introduces an **orthogonal knowledge dimension** (factual / conceptual / procedural / metacognitive). The 2x2 framing — cognitive operation × knowledge type — is exactly the framing a comprehension grader needs: "did the agent **apply** **conceptual** knowledge?" is a different test from "did the agent **remember** **factual** knowledge?".
|
|
238
|
+
|
|
239
|
+
Implication for the Skill Graph: the existing `application` dimension (84 cases) is mostly testing the Apply level. The repo has no `Analyze` (decompose the skill's claim into primitives), `Evaluate` (judge a proposed answer against the skill's standards), or `Create` (compose new artifacts the skill governs) cases. The `definition` and `mental_model` dimensions are Understand-level. The `boundary` and `rule_conflict` dimensions are Analyze-level. The seven-level Bloom vocabulary maps cleanly onto a more discriminating rubric than the current loose enum.
|
|
240
|
+
|
|
241
|
+
Recent applied work — [BloomAPR (Tang et al., 2025)](https://arxiv.org/html/2509.25465v1) — uses Bloom's taxonomy explicitly to grade LLM capabilities at automatic program repair, with "higher-order questions primarily involving advanced cognitive processes such as applying, analyzing, evaluating, and creating, typically requiring multi-step reasoning, concept integration, or complex problem-solving". This confirms the framing transfers to LLM evaluation.
|
|
242
|
+
|
|
243
|
+
#### 4.1.2 Criterion-referenced vs norm-referenced assessment
|
|
244
|
+
|
|
245
|
+
The educational measurement literature distinguishes [criterion-referenced](https://www.edpsycinteractive.org/topics/measeval/crnmref.html) (you pass if you meet a fixed standard) from norm-referenced (you pass if you outperform the comparison group). The Skill Graph's existing evals are criterion-referenced: a case PASSes if the answer matches the truth source, not if it outperforms a baseline. This is the right design — comprehension is "did the agent learn the concept" not "did the agent score in the top quartile" — and the proposed rubric stays criterion-referenced.
|
|
246
|
+
|
|
247
|
+
The implication is that the rubric must encode **what counts as a pass on each dimension independently**, not "score 4/5 overall". Validity and reliability research ([Norm- vs Criterion-Referenced Ratings, Hagiwara 2023](https://pmc.ncbi.nlm.nih.gov/articles/PMC10498947/)) finds that "criterion-referenced scaling demonstrated higher reliability than norm-referenced scaling" — i.e., binary pass/fail per criterion is more reliable than a Likert score across criteria.
|
|
248
|
+
|
|
249
|
+
#### 4.1.3 Near vs far transfer (Barnett & Ceci 2002)
|
|
250
|
+
|
|
251
|
+
[Barnett & Ceci's "When and where do we apply what we learn? A taxonomy for far transfer"](https://pubmed.ncbi.nlm.nih.gov/12081085/) provides a 9-dimensional taxonomy distinguishing near from far transfer along context (knowledge domain, physical context, temporal context, functional context, social context, modality) and content (learned skill, performance change, memory demands) axes. The empirical conclusion is that **far transfer is rare and difficult**: most measured transfer effects are near transfer. The implication for skill comprehension: a comprehension eval that only uses scenarios near to the body's worked examples is measuring near transfer — a much weaker claim of learning. A far-transfer eval forces the agent to apply the primitives to a context the body did not enumerate.
|
|
252
|
+
|
|
253
|
+
The proposed rubric uses this directly: each dimension lists "near-pass" and "far-pass" criteria, and the rubric distinguishes them so the grader can tell which kind of pass occurred.
|
|
254
|
+
|
|
255
|
+
#### 4.1.4 The Feynman technique
|
|
256
|
+
|
|
257
|
+
The [Feynman technique](https://fs.blog/feynman-technique/) — "if you can't explain it simply, you don't understand it" — is the practitioner-side counterpart to Bloom's Understand level. The four steps are: (1) write the concept, (2) explain it as if to a beginner, (3) identify gaps, (4) simplify. A [2023 study using the technique](https://ejournal.iainmadura.ac.id/index.php/panyonara/article/download/14936/4202/) found pre-test scores doubled (34→66). [A 2025 study on English language learners](https://arxiv.org/pdf/2506.09055) (the "Feynman Bot" paper) recorded a 28% average score increase by applying the technique. The technique operationalizes the **Protégé effect**: teaching forces comprehension.
|
|
258
|
+
|
|
259
|
+
The Feynman technique provides one rubric dimension directly: ask the agent to re-explain the concept to a domain outsider, without quoting the body. The output should preserve the skill's primitives and structural relationships, in different surface words. A correct re-explanation is evidence of comprehension; a paraphrase of the body is not.
|
|
260
|
+
|
|
261
|
+
### 4.2 ML/LLM evaluation infrastructure
|
|
262
|
+
|
|
263
|
+
#### 4.2.1 HELM (Stanford CRFM)
|
|
264
|
+
|
|
265
|
+
[HELM](https://crfm.stanford.edu/helm/) is a multi-axis evaluation framework. The seven metrics are **accuracy, calibration, robustness, fairness, bias, toxicity, efficiency**. The framing — "many metrics on many scenarios" — is the design point: a single accuracy number hides the trade-offs. From the [paper](https://arxiv.org/abs/2211.09110):
|
|
266
|
+
|
|
267
|
+
> "metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed"
|
|
268
|
+
|
|
269
|
+
The HELM contribution to a comprehension rubric is **the discipline of multi-metric reporting**: a skill that PASSes accuracy on definition cases but FAILs on calibration (the agent is overconfident on cases it gets wrong) is still a partial pass and the rubric should reflect that. Specifically, the HELM **calibration** metric — "alignment between model confidence and actual performance" — applies directly: a model that hedges appropriately on a far-transfer case and gets it right is comprehending better than a model that asserts the same answer with high confidence.
|
|
270
|
+
|
|
271
|
+
#### 4.2.2 BIG-bench
|
|
272
|
+
|
|
273
|
+
[BIG-bench (Srivastava et al. 2022)](https://arxiv.org/abs/2206.04615) — 204 tasks across 132 institutions — defines good task design principles: tasks should be **diverse**, **expert-authored**, **focus on capabilities current models fail at**, and have **standardized evaluation metrics** with human baselines. The "breakthroughness" concept measures how a task's success curve changes with model scale — a useful idea: a comprehension dimension where every model scores 0% or 100% is not discriminating; the useful dimensions are the ones where current-generation models partially succeed.
|
|
274
|
+
|
|
275
|
+
The implication for the Skill Graph: comprehension eval cases should be calibrated to discriminate among current models. A case that today's Sonnet/Opus answers correctly without any skill loaded is not measuring the skill's contribution. The smallest viable check: spot-test each comprehension case against the model **without the skill in context** and confirm the answer changes when the skill is added.
|
|
276
|
+
|
|
277
|
+
#### 4.2.3 TruthfulQA — imitative falsehoods
|
|
278
|
+
|
|
279
|
+
[TruthfulQA (Lin, Hilton, Evans 2021)](https://arxiv.org/abs/2109.07958) is the canonical benchmark for measuring whether a model truthfully answers questions where the most-likely-on-training-distribution answer is wrong. The central finding:
|
|
280
|
+
|
|
281
|
+
> "The best model was truthful on 58% of questions while human performance was 94%, with the largest models generally being the least truthful."
|
|
282
|
+
|
|
283
|
+
The construct of **imitative falsehood** — "a false answer is an imitative falsehood if it has high likelihood on the model's training distribution" — translates directly to skill comprehension. A skill exists to **correct** something the underlying model would otherwise get wrong (a misconception, a category error, an overreach). The comprehension test of a skill's `misconception` field is: pose a prompt the base model handles wrongly (with the wrong answer being high-likelihood on the training distribution); load the skill; verify the answer changes to the right one.
|
|
284
|
+
|
|
285
|
+
This gives a precise definition of "the skill is teaching something the model doesn't already know": the base-model answer is wrong AND the skill-loaded answer is right. A skill that produces no answer change is either redundant (the model already knows) or ineffective (the skill is loaded but ignored).
|
|
286
|
+
|
|
287
|
+
#### 4.2.4 MMLU and its critiques
|
|
288
|
+
|
|
289
|
+
[MMLU (Hendrycks et al. 2020)](https://arxiv.org/abs/2009.03300) covers 57 subject areas in multiple-choice format. The design intent — quoted in [DataCamp's overview](https://www.datacamp.com/blog/what-is-mmlu) — was to "eliminate the efficacy of shallow statistical tricks", forcing models to "draw inferences, connect abstract ideas, and navigate complex conceptual relationships". The critique (e.g., [Chandak et al. 2025](https://intuitionlabs.ai/pdfs/mmlu-pro-explained-the-advanced-ai-benchmark-for-llms.pdf)) is that even on MMLU-Pro, models exploit multiple-choice shortcuts, and free-form answer matching is more robust.
|
|
290
|
+
|
|
291
|
+
The implication: comprehension cases should be **free-form**, not multiple choice. The current eval files are already mostly free-form (the prompt asks for a reasoned answer). The risk to avoid: introducing leading prompts that contain the answer phrasing. The Skill Graph's prompts mostly avoid this, but a quality check could add it.
|
|
292
|
+
|
|
293
|
+
#### 4.2.5 Contrast sets (Gardner et al. 2020)
|
|
294
|
+
|
|
295
|
+
[Gardner et al.'s "Evaluating Models' Local Decision Boundaries via Contrast Sets"](https://aclanthology.org/2020.findings-emnlp.117.pdf) introduces the contrast-set technique: take a test instance, perturb it minimally so the gold label flips, and re-evaluate. The empirical result across 10 datasets: model performance drops up to 25% on contrast sets vs. the original test set. The diagnostic: high test-set accuracy with low contrast-set accuracy indicates **the model learned a dataset shortcut, not the underlying concept**.
|
|
296
|
+
|
|
297
|
+
This is directly usable: for each comprehension case, author a **paired contrast case** where one detail flips the right answer (e.g., for a `type-safety` case "the function accepts `unknown` and narrows it" → pair it with "the function accepts `any` and casts it"). If the model gets both right, the primitives transferred; if it gets only the first right, it's pattern-matching. The contrast-set discipline can be added incrementally to the current eval files.
|
|
298
|
+
|
|
299
|
+
#### 4.2.6 LLM-as-a-judge bias
|
|
300
|
+
|
|
301
|
+
The [Zheng et al. 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"](https://arxiv.org/abs/2306.05685) documents three judge biases that bear directly on comprehension grading:
|
|
302
|
+
|
|
303
|
+
| Bias | Effect | Mitigation |
|
|
304
|
+
|---|---|---|
|
|
305
|
+
| **Position bias** | "gpt-3.5 being biased 50% of the time and claude-v1 being biased 70% of the time toward the first position" in pairwise comparisons | Evaluate every pair in both orders and only count consistent verdicts |
|
|
306
|
+
| **Verbosity bias** | "claude-v1 and gpt-3.5 preferred the longer response more than 90% of the time" | Use direct scoring with a rubric rather than pairwise length-sensitive comparison |
|
|
307
|
+
| **Self-enhancement** | "claude-v1 favored itself with a 25% higher win rate" vs. human | Use a different model family as the judge than the one being evaluated |
|
|
308
|
+
|
|
309
|
+
Follow-up work — [Zhang et al. 2024 "Style Outweighs Substance"](https://arxiv.org/html/2409.15268v2) — shows LLM judges can prefer answers with confident style even when factually wrong. [Surveys on LLM-as-judge](https://arxiv.org/html/2411.15594v6) document 12 bias types.
|
|
310
|
+
|
|
311
|
+
The implication for the proposed comprehension rubric:
|
|
312
|
+
|
|
313
|
+
- **Direct scoring on a fixed rubric**, not pairwise comparison, mitigates verbosity bias.
|
|
314
|
+
- **Different model family as grader** than as agent, mitigates self-enhancement.
|
|
315
|
+
- **Chain-of-thought in the grader prompt** improves reliability ([Wang et al. 2023](https://arxiv.org/html/2410.15393v1)).
|
|
316
|
+
- **Binary pass/fail per rubric criterion**, not Likert, improves reliability ([Hagiwara 2023](https://pmc.ncbi.nlm.nih.gov/articles/PMC10498947/) and Yan's writing — see §4.3.2).
|
|
317
|
+
|
|
318
|
+
The existing audit-prompt-builder ([`scripts/lib/audit-prompt-builder.js`](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js)) already does most of this: it requires evidence quotes, constrains output to a `<verdict>` JSON block, and uses direct scoring on a 1–5 scale. The 1–5 scale should arguably move to binary-per-criterion + aggregate, but that's a separate decision.
|
|
319
|
+
|
|
320
|
+
### 4.3 Practitioner content (recent)
|
|
321
|
+
|
|
322
|
+
#### 4.3.1 Hamel Husain — "Your AI product needs evals"
|
|
323
|
+
|
|
324
|
+
[Husain's blog](https://hamel.dev/blog/posts/evals/) defines a three-tier evaluation hierarchy:
|
|
325
|
+
|
|
326
|
+
- **Level 1: Unit tests / assertions** — fast, deterministic, run on every change
|
|
327
|
+
- **Level 2: Human & model evaluation** — set cadence, more expensive
|
|
328
|
+
- **Level 3: A/B testing** — only for significant product changes
|
|
329
|
+
|
|
330
|
+
Importantly, Husain argues against treating evals as pure pass/fail unit tests: "unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries." The same assertion can be a gate **and** a synthetic-data filter **and** a retry signal.
|
|
331
|
+
|
|
332
|
+
His [FAQ](https://hamel.dev/blog/posts/evals-faq/) takes a strong position on rubric design:
|
|
333
|
+
|
|
334
|
+
> "Binary evaluations force clearer thinking and more consistent labeling. Likert scales introduce significant challenges: the difference between adjacent points (like 3 vs 4) is subjective and inconsistent across annotators"
|
|
335
|
+
|
|
336
|
+
And on rubric construction:
|
|
337
|
+
|
|
338
|
+
> "Begin with a 'benevolent dictator' (single domain expert) reviewing 100+ traces. Use open coding to identify real failure patterns. Group observations into a failure taxonomy through 'axial coding'. Only build evaluators for failures that persist after fixing obvious prompt gaps."
|
|
339
|
+
|
|
340
|
+
This grounds-up approach is the opposite of authoring rubric criteria from theory. The implication: a Skill Graph comprehension rubric should be derived from **observed comprehension failures** in current eval runs, not (only) from Bloom levels. The proposed rubric in §5 is theory-anchored on the concept-block fields, but the criteria for each dimension should be calibrated against actual model failures on existing eval cases.
|
|
341
|
+
|
|
342
|
+
#### 4.3.2 Eugene Yan — task-specific evals
|
|
343
|
+
|
|
344
|
+
[Yan's task-specific evaluation patterns](https://eugeneyan.com/writing/evals/) reject several common approaches:
|
|
345
|
+
|
|
346
|
+
- N-gram metrics (ROUGE, METEOR): "Poor distribution separation"
|
|
347
|
+
- Embedding-similarity metrics (BERTScore, MoverScore): "the similarity distributions of positive and negative instances are too close"
|
|
348
|
+
- Generic LLM-as-judge (G-Eval): "unreliable (low recall), costly, and has poor sensitivity"
|
|
349
|
+
|
|
350
|
+
Yan advocates **binary classification framing** — "Factually consistent vs. inconsistent / Relevant vs. irrelevant / Toxic vs. non-toxic" — because it enables ROC-AUC and PR-AUC analysis, distribution-separation diagnostics, and active-learning prioritization.
|
|
351
|
+
|
|
352
|
+
His [LLM-evaluator analysis](https://eugeneyan.com/writing/llm-evaluators/) finds that LLM judges agree with humans on subjective tasks (85% on MT-Bench) but fail on objective ones (30–60% recall on hallucination detection). Implication: comprehension grading is closer to a **subjective task** (judging meaning preservation) than an **objective one** (string match), so LLM-judge has a reasonable chance, but the judge must be well-calibrated and the rubric must be discriminating.
|
|
353
|
+
|
|
354
|
+
#### 4.3.3 Anthropic agent-skills evaluation (2026)
|
|
355
|
+
|
|
356
|
+
In March 2026, Anthropic added evals to its `skill-creator` ([Anthropic engineering blog](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills); [Tessl coverage](https://tessl.io/blog/anthropic-brings-evals-to-skill-creator-heres-why-thats-a-big-deal/)). The system uses **four sub-agents working in parallel**:
|
|
357
|
+
|
|
358
|
+
| Sub-agent | Role |
|
|
359
|
+
|---|---|
|
|
360
|
+
| Executor | Runs skills against eval prompts |
|
|
361
|
+
| Grader | Evaluates outputs against defined expectations |
|
|
362
|
+
| Comparator | Blind A/B comparisons between skill versions |
|
|
363
|
+
| Analyzer | Surfaces patterns aggregate stats might hide |
|
|
364
|
+
|
|
365
|
+
Eval cases are JSON test cases with `eval_id`, `eval_name`, `prompt`, and `assertions`. Assertions are tagged as either **quality** (specific content checks, like "review should flag forEach with async callback") or **format** (structural checks, like "Review uses severity levels critical/warning/suggestion"). The system renders **pass-rate comparisons of skill-enabled vs. baseline** — which is exactly the BIG-bench breakthroughness idea (§4.2.2) applied to skills.
|
|
366
|
+
|
|
367
|
+
The Skill Graph eval files are structurally similar to Anthropic's shape (`evals` array, per-case `prompt`, per-case truth references). The gap is the **assertion tagging** (`quality` vs `format`) and the **baseline comparison**. The proposed rubric in §5 maps onto Anthropic's assertion vocabulary: each rubric dimension is either quality-style (does the agent's answer match the concept primitives) or format-style (does the agent invoke the Verification checklist).
|
|
368
|
+
|
|
369
|
+
#### 4.3.4 OpenAI agent-skills evals
|
|
370
|
+
|
|
371
|
+
[OpenAI's developer blog "Testing Agent Skills Systematically with Evals"](https://developers.openai.com/blog/eval-skills) recommends layering deterministic checks and rubric grading:
|
|
372
|
+
|
|
373
|
+
- **Deterministic checks**: "Did it run `npm install`? Did it create `package.json`?" — observable behavior
|
|
374
|
+
- **Rubric grading**: qualitative requirements like "component structure, styling conventions" with structured output schemas
|
|
375
|
+
|
|
376
|
+
OpenAI frames skill mastery in four success categories: **outcome goals** (task completion), **process goals** (correct tool invocation), **style goals** (adherence to conventions), **efficiency goals** (avoiding excessive commands or token use). The four-category framing maps cleanly onto the audit loop's seven dimensions; the contribution to comprehension specifically is the explicit "process goals" vocabulary — for `workflow` skills, comprehension means executing the right steps in the right order, not just producing the right end-state.
|
|
377
|
+
|
|
378
|
+
[OpenAI Evals (the open-source framework)](https://github.com/openai/evals) supports both **ground-truth evals** (deterministic comparison to known answers) and **model-graded evals** (a stronger model judges qualitative output). The system uses rubric-based scoring, with healthcare rubrics in the wild containing 48,562 unique criteria — i.e., rubrics can scale arbitrarily without losing reliability **as long as each criterion is binary**.
|
|
379
|
+
|
|
380
|
+
#### 4.3.5 agentskills.io specification
|
|
381
|
+
|
|
382
|
+
The community [Agent Skills specification](https://agentskills.io/specification) is the open standard the Skill Graph protocol is downstream of (the v4 frontmatter is a strict superset of the agentskills.io frontmatter). The spec defines two required frontmatter fields (`name`, `description`) and a body with no format restrictions. **It defines no evaluation surface.** Quality is checked only structurally via `skills-ref validate`. This means: there is no upstream comprehension contract for the Skill Graph to inherit; the Skill Graph is upstream of the eval discipline, not downstream.
|
|
383
|
+
|
|
384
|
+
### 4.4 Synthesis — what each source contributes to the rubric
|
|
385
|
+
|
|
386
|
+
| Source | Contributes |
|
|
387
|
+
|---|---|
|
|
388
|
+
| Bloom revised (2001) | 6-level cognitive hierarchy; orthogonal knowledge dimension; verbs for each level |
|
|
389
|
+
| Criterion-referenced assessment | Pass/fail per criterion; reliability advantage over Likert |
|
|
390
|
+
| Barnett & Ceci (2002) | Near-vs-far transfer distinction as a quality test |
|
|
391
|
+
| Feynman technique | Re-explanation as a comprehension probe |
|
|
392
|
+
| HELM | Multi-metric reporting discipline; calibration as a distinct metric |
|
|
393
|
+
| BIG-bench | Skill-vs-baseline comparison ("breakthroughness") |
|
|
394
|
+
| TruthfulQA | Imitative-falsehood inoculation as the misconception test |
|
|
395
|
+
| MMLU critiques | Free-form > multiple choice; shortcut detection |
|
|
396
|
+
| Contrast sets (Gardner 2020) | Paired-perturbation cases to detect pattern-match |
|
|
397
|
+
| LLM-as-judge bias (Zheng 2023, follow-ups) | Position / verbosity / self-enhancement mitigations |
|
|
398
|
+
| Hamel Husain | Binary criteria; failure-mode-driven rubric construction |
|
|
399
|
+
| Eugene Yan | Binary classification framing; LLM-judge fitness per task type |
|
|
400
|
+
| Anthropic skill-creator (2026) | 4 sub-agents; quality-vs-format assertions; baseline comparison |
|
|
401
|
+
| OpenAI agent-skills | 4 success categories (outcome/process/style/efficiency) |
|
|
402
|
+
| agentskills.io | No upstream contract — Skill Graph defines its own |
|
|
403
|
+
|
|
404
|
+
### 4.7 RLHF failure rates as the empirical case for forced-completeness graders
|
|
405
|
+
|
|
406
|
+
The comprehension grader proposed in §5 and §7.2 is itself an LLM — and the same training pressures that distort the agent-under-test distort the grader. The literature on LLM-as-judge bias (§4.2.6) names the *taxonomy* of failure (position, verbosity, self-enhancement, style-over-substance). The literature on RLHF-induced behavior names the *base rates*. Both matter: the taxonomy tells the rubric what to defend against; the base rates tell the maintainer how much defending is actually needed.
|
|
407
|
+
|
|
408
|
+
The Development workspace's [`skills/methodical/SKILL.md` § 5 — The Root Cause Model](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) collates the most-cited measurements. The numbers below come from the cited primary sources, verified for this revision:
|
|
409
|
+
|
|
410
|
+
| Failure mode | Measured rate | Primary source | Verified |
|
|
411
|
+
|---|---|---|---|
|
|
412
|
+
| **Sycophancy** — model agrees with the user's framing even when wrong | 58.19% across frontier models (Gemini-1.5-Pro 62.47%, ChatGPT-4o 56.71%, Claude-Sonnet intermediate) | Fanous et al. (2025), *SycEval: Evaluating LLM Sycophancy*. [arXiv:2502.08177](https://arxiv.org/abs/2502.08177). Tested on AMPS (math) and MedQuad (medical). | Yes (WebSearch 2026-05-16) |
|
|
413
|
+
| **Summarization overgeneralization** — LLM summaries claim broader conclusions than the source supports | 26–73% per model across 4,900 summaries of 200 abstracts from *Nature, Science, Lancet, NEJM*; nearly 5× more frequent than human-written summaries (OR 4.85, 95% CI [3.06, 7.70]) | Peters & Chin-Yee (2025), *Generalization bias in large language model summarization of scientific research*. [Royal Society Open Science, vol. 12(4):241776](https://royalsocietypublishing.org/rsos/article/12/4/241776/235656/Generalization-bias-in-large-language-model). PMC mirror: [PMC12042776](https://pmc.ncbi.nlm.nih.gov/articles/PMC12042776/). | Yes (WebSearch 2026-05-16) |
|
|
414
|
+
| **Framing bias** — output changes the sentiment of the source context | 26.42% of cases across five LLM families on summarization + fact-checking | Aldigheri et al. (2025), *Quantifying Cognitive Bias Induction in LLM-Generated Content*. [arXiv:2507.03194](https://arxiv.org/abs/2507.03194); also [ACL Anthology 2025.ijcnlp-long.155](https://aclanthology.org/2025.ijcnlp-long.155/). | Yes (WebSearch 2026-05-16) |
|
|
415
|
+
| **Instruction skipping / attention dilution** — later or middle-of-context instructions receive less attention as prompt length grows | Performance degradation of 13.9–85% as input length grows, even with 100% retrieval of the target information; ~20% recall in the middle of long context vs. higher recall at start and end | As cited in [`skills/methodical/SKILL.md` § Key Sources](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) (attributing Unite.AI 2025). The phenomenon is independently corroborated by the "lost in the middle" literature: Liu et al. (2023), *Lost in the Middle: How Language Models Use Long Contexts*. [arXiv:2307.03172](https://arxiv.org/abs/2307.03172). The specific Unite.AI article was not directly located in this revision; the underlying mechanism is well-established. | Mechanism verified; the specific Unite.AI 2025 attribution from `methodical` is reused as cited and qualified here. |
|
|
416
|
+
| **Multi-agent error amplification** — error rate of unstructured multi-agent chains relative to single-agent baseline | 17.2× in unstructured "bag of agents" networks; ~4.4× when central coordination acts as a circuit breaker | Towards Data Science (2025), *Why Your Multi-Agent System is Failing: Escaping the 17x Error Trap of the "Bag of Agents"*. [towardsdatascience.com](https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/). The underlying scaling principles are from Kim et al. cited in that article. | Yes (WebSearch 2026-05-16) |
|
|
417
|
+
|
|
418
|
+
**Why these numbers matter for the comprehension grader.** Each row is the failure mode the grader is most likely to exhibit *when grading* — not (only) the failure mode the agent-under-test exhibits.
|
|
419
|
+
|
|
420
|
+
- **Sycophancy in the grader** looks like: agreeing with the agent's answer because the answer is articulate and confident. The 58% base rate ([SycEval, 2025](https://arxiv.org/abs/2502.08177)) means roughly half of unconstrained verdicts will lean toward whatever framing the agent's response anchored. Mitigation: the rubric forces the grader to quote the agent's exact words before deciding (Appendix A § A.1 RULE), and to render a binary verdict per behavior, not an impressionistic 1–5.
|
|
421
|
+
- **Summarization overgeneralization in the grader** looks like: "the agent's answer broadly demonstrates understanding" instead of "the agent invoked the runtime-boundary primitive on line 3 of its answer." The 26–73% base rate ([Peters & Chin-Yee, 2025](https://royalsocietypublishing.org/rsos/article/12/4/241776/235656/Generalization-bias-in-large-language-model)) is the empirical case for the rubric's per-behavior decomposition: a summary verdict is not a verdict.
|
|
422
|
+
- **Framing bias in the grader** looks like: scoring an answer with positive framing as better than the same content with negative framing. The 26.42% base rate ([Aldigheri et al., 2025](https://arxiv.org/abs/2507.03194)) is the empirical case for output-schema constraint (the verdict JSON has no free-prose summary slot where positive framing could leak in unscored).
|
|
423
|
+
- **Instruction skipping in the grader** looks like: the grader's prompt enumerates 6 binary behaviors but the verdict only addresses 4 of them, with the missing two silently dropped. The attention-dilution mechanism (Liu et al. 2023, [arXiv:2307.03172](https://arxiv.org/abs/2307.03172); as discussed in [`methodical` § Key Sources](/Users/jacobbalslev/Development/skills/methodical/SKILL.md)) is the empirical case for the verdict-schema requiring an entry per `expected_behavior.id`, not a free array. A response missing an `id` is a parse error.
|
|
424
|
+
- **Multi-agent error amplification** matters because the comprehension surface is itself a chain: agent → grader → aggregator → audit-loop scorecard. The 17.2× amplification rate ([Towards Data Science, 2025](https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/)) is the empirical case for the grader being a single well-constrained step, not a chain — and for the audit-loop integration (R6, §7.6) to consume the grader's structured verdict, not its prose summary.
|
|
425
|
+
|
|
426
|
+
The implication is direct: the comprehension grader cannot be a prompt that says "judge whether the agent understood." That prompt would inherit every base rate above. The grader prompt must be a structurally constrained instrument that produces a verdict the failure modes above cannot easily corrupt — which is what the §5 rubric (binary per behavior + quoted evidence + verbatim-overlap check) and the Appendix A template (forced JSON output + per-behavior verdicts + RULE block) together implement.
|
|
427
|
+
|
|
428
|
+
**`methodical` as the explanatory layer.** The repo-internal [`skills/methodical/SKILL.md`](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) frames these statistics as the *root cause model* for why behavioral rules like complete-reporting exist. Where this report's §5 rubric *measures* comprehension and §8 *defends* the measurement against bias, `methodical` provides the WHY: the grader (and the agent) are not careless — they are doing exactly what RLHF trained them to do, and structural countermeasures are the only reliable correction. Subsequent sections cite `methodical` rules directly where they constrain grader behavior.
|
|
429
|
+
|
|
430
|
+
---
|
|
431
|
+
|
|
432
|
+
## 5. Proposed comprehension-quality rubric
|
|
433
|
+
|
|
434
|
+
This section proposes 8 rubric dimensions, each a distinct measurement of whether the agent learned the subject. Six dimensions map directly to concept-block fields; two are cross-cutting on the body's `## Verification` and `## Do NOT Use When` sections, which every capability skill must have.
|
|
435
|
+
|
|
436
|
+
The rubric is **criterion-referenced**, **binary per criterion**, **free-form prompt**, and **direct-score** (no pairwise). It mirrors the existing audit-prompt-builder pattern of one `<verdict>` block per dimension but adds two new dimensions specific to comprehension (not currently in the seven audit dimensions). The intent is **additive**: adopt the rubric without breaking lint, manifest, drift, or routing surfaces.
|
|
437
|
+
|
|
438
|
+
### 5.1 Rubric overview
|
|
439
|
+
|
|
440
|
+
| # | Dimension | Concept field | Schema weight | Cross-cutting | Bloom level |
|
|
441
|
+
|---|---|---|---|---|---|
|
|
442
|
+
| C1 | Definitional precision | `concept.definition` | 1.0 | — | Understand |
|
|
443
|
+
| C2 | Mental-model fidelity | `concept.mental_model` | 1.5 | — | Apply / Analyze |
|
|
444
|
+
| C3 | Purpose articulation | `concept.purpose` | 1.0 | — | Understand / Evaluate |
|
|
445
|
+
| C4 | Boundary discrimination | `concept.boundary` | 1.5 | — | Analyze |
|
|
446
|
+
| C5 | Taxonomy navigation | `concept.taxonomy` | 1.0 | — | Analyze |
|
|
447
|
+
| C6 | Analogy reuse with limit | `concept.analogy` | 0.5 | — | Apply / Evaluate |
|
|
448
|
+
| C7 | Misconception inoculation | `concept.misconception` | — (complement) | — | Evaluate |
|
|
449
|
+
| C8 | Verification application | — | — | `## Verification` body section | Apply |
|
|
450
|
+
| C9 | Negative-boundary refusal | — | — | `## Do NOT Use When` body section | Evaluate |
|
|
451
|
+
|
|
452
|
+
(The numbering uses C-prefix to avoid colliding with the audit-loop's seven dimensions. A skill can be audited on the seven structural dimensions and on the eight or nine comprehension dimensions independently.)
|
|
453
|
+
|
|
454
|
+
Why 9 listed but the executive summary said 8: dimension C7 (misconception) and C4 (boundary) overlap conceptually. Authors should expect to write a single rubric criterion-set that covers both, weighted as the schema weights suggest. The 9-dimension table is the analytical view; the operational view collapses C4 and C7 into one author-side criterion set — see §5.10.
|
|
455
|
+
|
|
456
|
+
### 5.2 C1 — Definitional precision
|
|
457
|
+
|
|
458
|
+
**What it measures.** Does the agent produce a definition of the subject that matches the skill's `concept.definition` field on the **load-bearing elements** — primary category, what it does, who uses it — without verbatim copying, without dropping constraints, and without smuggling in unwarranted claims?
|
|
459
|
+
|
|
460
|
+
**Skill input.** The `concept.definition` field is the canonical answer the rubric is calibrated against.
|
|
461
|
+
|
|
462
|
+
**Probe template.** "Explain what [subject] is to a [domain-outsider role]. Do not use the body's exact words." Vary the domain-outsider role across cases (a junior engineer, a product manager, a security reviewer) so the case set exercises restatement.
|
|
463
|
+
|
|
464
|
+
**Pass criteria (criterion-referenced, binary per criterion).**
|
|
465
|
+
|
|
466
|
+
- [ ] The agent's definition names the same **primary category** as `concept.definition`.
|
|
467
|
+
- [ ] The agent's definition states what the concept **does** (its observable effect) in different surface words.
|
|
468
|
+
- [ ] The agent's definition does NOT contain a 6-word-or-longer verbatim span from the skill body's `## Coverage`, `## Philosophy`, or `concept.definition`. (Operationalization: tokenize, look for any 6-gram substring present in both. The 6-gram threshold is a calibrable hyperparameter; 6 is a starting point used in plagiarism-detection literature.)
|
|
469
|
+
- [ ] The agent's definition does NOT introduce constraints or claims not in `concept.definition` (no fabricated specificity).
|
|
470
|
+
|
|
471
|
+
A pass requires all four binary checks. A FAIL on the third criterion (verbatim span) is a hard fail — the case did not measure comprehension, only retrieval.
|
|
472
|
+
|
|
473
|
+
**Pass example (for `type-safety`).** Probe: "Explain type safety to a product manager." Pass answer: "It's the property of a codebase where the compiler catches a category of mistakes — like calling a function with the wrong shape of input — before users see them. The team pays a cost up-front in writing type annotations, and the benefit is that whole classes of runtime errors are ruled out at build time rather than caught in production." This names the primary category (compile-time error detection), the effect (rules out a class of errors), and the cost/benefit, without 6-gram overlap with the body.
|
|
474
|
+
|
|
475
|
+
**Fail example.** Probe: same. Fail answer: "Type safety is the property of a program in which type errors — operations applied to values of the wrong kind — are detected before they cause incorrect behavior." This is a verbatim copy of [`skills/type-safety/SKILL.md:64`](https://github.com/jacob-balslev/skills/blob/main/skills/type-safety/SKILL.md) (the first sentence of `concept.definition`). The agent retrieved; it did not internalize.
|
|
476
|
+
|
|
477
|
+
**Grader procedure.**
|
|
478
|
+
|
|
479
|
+
1. Embed `concept.definition`, the full skill body, and the agent's answer in the grader prompt.
|
|
480
|
+
2. Ask the grader to list, in JSON, each of the four binary criteria with a verdict and a quoted evidence span from the agent's answer.
|
|
481
|
+
3. Aggregate: pass iff all four are true.
|
|
482
|
+
|
|
483
|
+
Bias mitigation: grader is a different model family than the agent under test; no Likert scale; output forced into a `<verdict>` JSON block per the existing audit-prompt-builder pattern.
|
|
484
|
+
|
|
485
|
+
### 5.3 C2 — Mental-model fidelity
|
|
486
|
+
|
|
487
|
+
**What it measures.** Given a scenario that the skill body does NOT enumerate, does the agent reason about it using the primitives named in `concept.mental_model`?
|
|
488
|
+
|
|
489
|
+
This is the **far-transfer** test — the central comprehension test, weighted 1.5 in the schema, and the dimension where current eval coverage is weakest.
|
|
490
|
+
|
|
491
|
+
**Skill input.** The `concept.mental_model` field — which by convention names primitives and their relationships. For `type-safety`, the primitives are: type, soundness, structural-vs-nominal, narrowing, runtime boundary. For `acid-fundamentals`: atomicity, consistency, isolation, durability, the contract.
|
|
492
|
+
|
|
493
|
+
**Probe template.** "A team is [novel scenario the body does not enumerate]. Walk through how the concept's primitives apply to this case." The novel scenario must be in the same conceptual domain but with surface details the body does not cover.
|
|
494
|
+
|
|
495
|
+
**Pass criteria.**
|
|
496
|
+
|
|
497
|
+
- [ ] The agent names **at least two of the skill's listed primitives** in the analysis.
|
|
498
|
+
- [ ] The agent applies the primitives **to the new case's specifics** (not just restating what each primitive means).
|
|
499
|
+
- [ ] The agent's analysis arrives at a conclusion that is **consistent with the skill's mental model** — i.e., if a competing skill expert with this skill loaded would reach a different conclusion, the case should be re-authored.
|
|
500
|
+
- [ ] The agent does NOT need the body to contain the scenario for the answer to be correct (operationalization: scenario uses surface details the grader can verify don't appear in the body via `grep`).
|
|
501
|
+
|
|
502
|
+
**Pass example (for `type-safety`).** Probe: "A team is migrating a Python codebase that uses `dict[str, Any]` to represent API responses to a TypeScript codebase. They want to know whether to use `Record<string, unknown>` or `Record<string, any>`. Apply the type-safety primitives." Pass answer: cites the **runtime boundary** primitive (API responses cross the boundary; their types are unverified until parsed), distinguishes `any` (escape hatch — anything is allowed) from `unknown` (forces narrowing), connects to the **narrowing** primitive (an `unknown` value must be narrowed before access), concludes that `Record<string, unknown>` plus a validator at the boundary preserves type safety while `Record<string, any>` silently disables it. This applies multiple primitives to a case the body does not enumerate (the body discusses `JSON.parse` but not Python→TypeScript migration).
|
|
503
|
+
|
|
504
|
+
**Fail example.** Probe: same. Fail answer: "Use `Record<string, unknown>` because the type-safety skill says to prefer `unknown` over `any` always." This restates a single rule without engaging the primitives. The runtime-boundary reasoning is missing; the narrowing primitive is unused. The answer is correct in conclusion but vacuous in reasoning; it would not transfer to a case where the rule doesn't directly apply.
|
|
505
|
+
|
|
506
|
+
**Grader procedure.**
|
|
507
|
+
|
|
508
|
+
1. Pre-flight: the grader (or eval author) confirms the scenario's surface details do not appear in the body. (Lint check that this constraint holds is described in §7.)
|
|
509
|
+
2. Embed `concept.mental_model`, the body, the agent's answer.
|
|
510
|
+
3. Ask the grader to verify each binary criterion with a quoted evidence span.
|
|
511
|
+
4. The grader explicitly identifies which primitives the agent invoked.
|
|
512
|
+
|
|
513
|
+
This is the hardest dimension to author cases for and the most diagnostic. Recommend prioritizing 3+ cases per skill.
|
|
514
|
+
|
|
515
|
+
### 5.4 C3 — Purpose articulation
|
|
516
|
+
|
|
517
|
+
**What it measures.** Can the agent state what problem the concept solves AND what the alternative was that it replaces? Per the schema description: "Concrete pain point + prior alternative."
|
|
518
|
+
|
|
519
|
+
**Skill input.** The `concept.purpose` field.
|
|
520
|
+
|
|
521
|
+
**Probe template.** "Why does [subject] exist? What did practitioners do before [subject] existed, and what was broken about that?"
|
|
522
|
+
|
|
523
|
+
**Pass criteria.**
|
|
524
|
+
|
|
525
|
+
- [ ] The agent names the **concrete pain point** that motivated the concept's existence.
|
|
526
|
+
- [ ] The agent names the **prior alternative** (what people did before) — either by name or by description.
|
|
527
|
+
- [ ] The agent explains the **mechanism by which the concept improves on the alternative**, not just that it does.
|
|
528
|
+
- [ ] The agent does not introduce purposes not in `concept.purpose` (no fabricated rationale).
|
|
529
|
+
|
|
530
|
+
**Pass example (for `type-safety`).** Probe: "Why does type safety as a discipline exist? What did teams do before it, and what was broken?" Pass answer: cites the pain point — runtime bugs that would compound into production incidents and silent data corruption; the prior alternative — dynamic languages without typing, using tests and documentation to communicate contracts; the mechanism — types make contracts checkable at compile time and at call sites, scaling worse with team size and code size for the alternative.
|
|
531
|
+
|
|
532
|
+
**Fail example.** Probe: same. Fail answer: "Type safety prevents bugs." True but vacuous; the pain point, alternative, and mechanism are all missing.
|
|
533
|
+
|
|
534
|
+
**Grader procedure.** Same shape as C1/C2 — embed the field, body, and agent answer; verify each criterion with quoted evidence.
|
|
535
|
+
|
|
536
|
+
### 5.5 C4 — Boundary discrimination
|
|
537
|
+
|
|
538
|
+
**What it measures.** Given a prompt that **looks like** the skill's domain but actually belongs to a different concept (named in the `concept.boundary` field), does the agent recognize the difference and route correctly?
|
|
539
|
+
|
|
540
|
+
This dimension is distinct from the routing eval. The routing eval tests the router's decision at activation time. This dimension tests the loaded model's behavior: given the skill is already loaded, does the model recognize when a sub-prompt has crossed into adjacent territory?
|
|
541
|
+
|
|
542
|
+
**Skill input.** The `concept.boundary` field plus the `## Do NOT Use When` body section. (The two are related but distinct — `concept.boundary` is universal concept-discrimination; `## Do NOT Use When` is skill-routing handoff. The dimension uses both as the rubric anchor.)
|
|
543
|
+
|
|
544
|
+
**Probe template.** "[Scenario that contains a feature looking like the skill's domain but is actually owned by an adjacent skill]. Apply [this skill] to it." The agent should refuse to overreach and name the correct owner.
|
|
545
|
+
|
|
546
|
+
**Pass criteria.**
|
|
547
|
+
|
|
548
|
+
- [ ] The agent identifies that the scenario is NOT in this skill's domain.
|
|
549
|
+
- [ ] The agent names the **correct sibling skill** (one of the `concept.boundary` entries or `## Do NOT Use When` rows).
|
|
550
|
+
- [ ] The agent explains the **mechanism of the difference** — different primitives, different scope, different layer — not just "use the other skill".
|
|
551
|
+
- [ ] The agent does not partially-comply by providing a half-answer in this skill's voice before handing off.
|
|
552
|
+
|
|
553
|
+
**Pass example (for `type-safety`).** Probe: "We're designing the JSON shape of a new API endpoint for order webhooks. Apply type-safety to decide the field structure." Pass answer: identifies that **API surface design** is owned by `api-design`, not `type-safety`; explains the mechanism — `api-design` owns the external surface contract (what fields, what format, what versioning), `type-safety` owns the internal type discipline that consumes the API's output; hands off to `api-design`.
|
|
554
|
+
|
|
555
|
+
**Fail example.** Probe: same. Fail answer: "Use `Record<string, string>` and validate with Zod" — applies type-safety's rules to an API-design problem. The boundary was crossed and not noticed.
|
|
556
|
+
|
|
557
|
+
**Grader procedure.** The grader checks for the four binary criteria; one of the criteria (correct sibling skill named) requires the grader to know the skill's `boundary` entries — embed them in the prompt.
|
|
558
|
+
|
|
559
|
+
### 5.6 C5 — Taxonomy navigation
|
|
560
|
+
|
|
561
|
+
**What it measures.** Can the agent correctly classify a novel instance into the right `taxonomy` category? Per the schema description, taxonomy categories are subset / alternative / prerequisite / composition / specialization.
|
|
562
|
+
|
|
563
|
+
**Skill input.** The `concept.taxonomy` field.
|
|
564
|
+
|
|
565
|
+
**Probe template.** "Where does [novel instance not enumerated in the taxonomy] sit in [subject]'s taxonomy? What kind of relationship does it have to [a category that is enumerated]?"
|
|
566
|
+
|
|
567
|
+
**Pass criteria.**
|
|
568
|
+
|
|
569
|
+
- [ ] The agent places the novel instance in the **right category** (or correctly identifies that it doesn't fit any).
|
|
570
|
+
- [ ] The agent uses the skill's **taxonomy vocabulary** (subset/alternative/prerequisite/composition/specialization).
|
|
571
|
+
- [ ] The agent's placement is **consistent with the taxonomy entries that ARE in the field** — i.e., the new placement does not contradict the relationships among existing entries.
|
|
572
|
+
- [ ] If the novel instance does not fit, the agent says so explicitly rather than forcing a misclassification.
|
|
573
|
+
|
|
574
|
+
**Pass example (for `type-safety`).** Probe: "Where does Hindley-Milner type inference sit in type-safety's taxonomy?" Pass answer: identifies HM as a **technique** rather than a system, notes that the taxonomy's listed systems (sound, unsound/gradual, structural, nominal, dependent, refinement) are about the **strength and shape** of the type system, and HM is an **algorithm** for inferring types within (typically) a Hindley-Milner-style sound polymorphic system — closest taxonomy slot is "a technique used by sound systems with parametric polymorphism" or correctly notes "doesn't fit the existing categories cleanly; would fit if we added an inference-algorithm axis".
|
|
575
|
+
|
|
576
|
+
**Fail example.** Probe: same. Fail answer: "HM is a sound type system." This conflates the system with the algorithm; HM is an inference algorithm that works in certain sound systems but isn't itself the system.
|
|
577
|
+
|
|
578
|
+
**Grader procedure.** The grader verifies the agent's classification is consistent with the existing taxonomy entries; this requires the grader to read `concept.taxonomy` and check relationship consistency. The grader's verdict cites which existing entry the new placement should be near to.
|
|
579
|
+
|
|
580
|
+
### 5.7 C6 — Analogy reuse with limit
|
|
581
|
+
|
|
582
|
+
**What it measures.** Two distinct things, evaluated as a single rubric:
|
|
583
|
+
(a) Can the agent **apply the analogy** from `concept.analogy` to a new case in the same domain?
|
|
584
|
+
(b) Can the agent **identify where the analogy breaks** when stretched to a case it wasn't designed for?
|
|
585
|
+
|
|
586
|
+
This dimension's weight is 0.5 in the schema — the lowest among the seven. That weight reflects that analogy is a teaching aid, not a load-bearing primitive; the dimension is still worth measuring because abuse of analogy is a common comprehension failure.
|
|
587
|
+
|
|
588
|
+
**Skill input.** The `concept.analogy` field.
|
|
589
|
+
|
|
590
|
+
**Probe template.** Two-part cases:
|
|
591
|
+
- Part A: "Apply the [subject]'s analogy to [case in the same domain that the analogy was designed to cover]. What does the analogy predict?"
|
|
592
|
+
- Part B: "Where does the analogy break? Give one case where the analogy would mislead someone."
|
|
593
|
+
|
|
594
|
+
**Pass criteria.**
|
|
595
|
+
|
|
596
|
+
- [ ] The agent extends the analogy to Part A's case in a way that preserves the **structural relationships** named in the analogy.
|
|
597
|
+
- [ ] The agent identifies in Part B a **concrete case** where the analogy fails, AND explains the mechanism (which structural relationship is preserved in the analogy but missing in the new case).
|
|
598
|
+
- [ ] The agent does NOT over-apply the analogy in Part A (a sign it's pattern-matching the analogy phrasing without understanding its scope).
|
|
599
|
+
- [ ] The agent does NOT under-claim in Part B (a sign of refusal-to-engage rather than discrimination).
|
|
600
|
+
|
|
601
|
+
**Pass example (for `type-safety`).** The analogy from `skills/type-safety/SKILL.md:105` is: "A type system is to runtime errors what a building's structural engineering is to physical collapse." Part A: "Apply this to a `npm audit` security scan." Pass answer: identifies that `npm audit` is **not** a structural-engineering check — it's a periodic inspection for **threat-class** violations the engineer didn't design for (vulnerabilities discovered after the build). The structural-engineering analogy maps to type checking (build-time guarantees against the threat class the design accounts for); `npm audit` is more like a fire-marshal inspection, a different threat class.
|
|
602
|
+
|
|
603
|
+
Part B: "Where does the structural-engineering analogy break?" Pass answer: structural engineering is approved by a stamp once; type-checking runs on every change. The analogy obscures the iterative nature of type discipline. Also: structural engineering's escape hatches (engineer overrides) are heavily regulated; TypeScript's `any` / `as` have no equivalent gating discipline by default.
|
|
604
|
+
|
|
605
|
+
**Fail example.** Part A fail: "Type safety prevents bugs, just like structural engineering prevents collapse." Restating the analogy in flat form — no structural mapping. Part B fail: "The analogy doesn't break." (Refusal to engage.)
|
|
606
|
+
|
|
607
|
+
**Grader procedure.** Two-part case; both parts must pass. The grader needs the analogy field embedded and ideally a list of "known good extensions" + "known stretching points" authored alongside the case.
|
|
608
|
+
|
|
609
|
+
### 5.8 C7 — Misconception inoculation
|
|
610
|
+
|
|
611
|
+
**What it measures.** When a prompt contains the misconception phrasing as if it were correct, does the agent correct it unprompted, citing the skill's reasoning?
|
|
612
|
+
|
|
613
|
+
This is the [TruthfulQA-shape test (Lin et al. 2021)](https://arxiv.org/abs/2109.07958): pose a prompt where the imitative-falsehood answer is the high-probability response, and verify the skill loads correct it.
|
|
614
|
+
|
|
615
|
+
**Skill input.** The `concept.misconception` field plus the body's correction.
|
|
616
|
+
|
|
617
|
+
**Probe template.** "I've heard that [misconception phrased plausibly]. Walk me through whether this is correct." The probe should NOT signal that the framing is the misconception — that defeats the test.
|
|
618
|
+
|
|
619
|
+
**Pass criteria.**
|
|
620
|
+
|
|
621
|
+
- [ ] The agent flags the misconception **without being prompted to**.
|
|
622
|
+
- [ ] The agent corrects to the right understanding using the skill's primitives.
|
|
623
|
+
- [ ] The agent explains the **mechanism by which the misconception misleads** — not just "that's wrong, the answer is X".
|
|
624
|
+
- [ ] The agent does NOT validate the misconception even partially in its response.
|
|
625
|
+
|
|
626
|
+
**Pass example (for `type-safety`).** Probe: "I've been told that TypeScript catches runtime errors before they happen. Is that right? We're trying to decide whether we need runtime validation in addition." Pass answer: flags that **TypeScript does not catch runtime errors** in general — it catches type errors at compile time, but it's **unsound** and makes no claims about values that cross the runtime boundary (`JSON.parse`, env vars, network). Explains that the misconception treats compile-time type guarantees as runtime guarantees; corrects by naming the runtime-boundary primitive; confirms that runtime validation is still required at every I/O boundary.
|
|
627
|
+
|
|
628
|
+
**Fail example.** Probe: same. Fail answer: "Yes, TypeScript catches type errors at compile time, but you also need runtime validation for untrusted input." Partial — the misconception's "catches runtime errors" framing was not corrected; the language "catches type errors at compile time" lets the user keep believing the original framing was approximately right.
|
|
629
|
+
|
|
630
|
+
**Grader procedure.** The grader needs the `concept.misconception` text embedded so it can identify whether the agent invoked the correction. The hardest case-authoring discipline: the probe must be a natural framing, not a leading question.
|
|
631
|
+
|
|
632
|
+
### 5.9 C8 — Verification application
|
|
633
|
+
|
|
634
|
+
**What it measures.** Given a fresh artifact (a piece of code, a doc, a design), does the agent **run the skill's `## Verification` checklist on it unprompted** as part of producing a recommendation?
|
|
635
|
+
|
|
636
|
+
**Skill input.** The `## Verification` section in the body. (Cross-cutting: not a concept-block field, but every capability/workflow skill is required to have it.)
|
|
637
|
+
|
|
638
|
+
**Probe template.** "Here is a piece of [artifact relevant to the skill's domain]. What's your assessment?" The probe does NOT say "run the Verification checklist" — the test is whether the agent invokes it spontaneously.
|
|
639
|
+
|
|
640
|
+
**Pass criteria.**
|
|
641
|
+
|
|
642
|
+
- [ ] The agent's response includes **at least 3 of the 5+ Verification checklist items** as observable checks against the artifact.
|
|
643
|
+
- [ ] The agent's response distinguishes **passing checks** from **failing checks** for the specific artifact.
|
|
644
|
+
- [ ] The agent does NOT provide a verdict on the artifact without anchoring it to the Verification criteria.
|
|
645
|
+
- [ ] The agent's checks reference **concrete features of the artifact** (line numbers, function names, observable behaviors) — not abstract restatements of the criteria.
|
|
646
|
+
|
|
647
|
+
**Pass example (for `type-safety`).** Probe: "Review this TypeScript snippet [paste a 30-line module using `any`, `as`, no `noUncheckedIndexedAccess`, and unparsed `JSON.parse`]." Pass answer: walks through the [type-safety Verification checklist (skills/type-safety/SKILL.md:255-262)](https://github.com/jacob-balslev/skills/blob/main/skills/type-safety/SKILL.md) — strict mode? not visible from the snippet, ask. `noUncheckedIndexedAccess`? array access on line N doesn't show the safety; needs verification at tsconfig. Any `any` without comment? line M has `any: any` without justification, fail. Any `as Type` cast without comment? line K has `as User` on a `JSON.parse` result, fail. Every I/O boundary parses with a validator? line K is the failure point. Etc.
|
|
648
|
+
|
|
649
|
+
**Fail example.** Probe: same. Fail answer: "This code looks fine but uses `any` in a couple places — consider tightening that up." General verdict, no anchoring to the Verification items, no concrete line references.
|
|
650
|
+
|
|
651
|
+
**Grader procedure.** The grader needs the body's Verification checklist and the artifact embedded. The criterion-3 check ("does NOT provide a verdict without anchoring") is the strongest discriminator and the easiest to grade.
|
|
652
|
+
|
|
653
|
+
### 5.10 C9 — Negative-boundary refusal
|
|
654
|
+
|
|
655
|
+
**What it measures.** When the user asks the agent to do something in this skill's `## Do NOT Use When` list, does the agent refuse and route to the named owner?
|
|
656
|
+
|
|
657
|
+
**Skill input.** The `## Do NOT Use When` body section. This is a cross-cutting test that complements C4 (boundary discrimination) but uses the body section rather than `concept.boundary` because the body section is the routing-handoff contract, whereas `concept.boundary` is the universal concept-discrimination.
|
|
658
|
+
|
|
659
|
+
**Probe template.** "[Prompt that asks the agent to do something explicitly in the Do-NOT list of this skill, framed as if it were the agent's responsibility.]"
|
|
660
|
+
|
|
661
|
+
**Pass criteria.**
|
|
662
|
+
|
|
663
|
+
- [ ] The agent refuses to take the task in this skill's voice.
|
|
664
|
+
- [ ] The agent names the **correct owner** from the Do-NOT table.
|
|
665
|
+
- [ ] The agent does NOT provide a partial answer in this skill's voice before refusing (the "softened failure" anti-pattern).
|
|
666
|
+
- [ ] The agent does NOT over-refuse — if the task is genuinely in-scope but with adjacent elements, the response routes the in-scope work to this skill and the adjacent elements to the owner skill.
|
|
667
|
+
|
|
668
|
+
**Pass example (for `type-safety`).** Probe: "Help me design the database schema for storing user-uploaded files. I want to use type-safety to decide the column types." Pass answer: refuses to design the DB schema in type-safety's voice; routes to `data-modeling` as the schema-design owner; explains the mechanism — `data-modeling` owns persistence shape and entity model, type-safety owns in-memory type contracts that consume that shape.
|
|
669
|
+
|
|
670
|
+
**Fail example.** Probe: same. Fail answer: "Use `VARCHAR(255)` for the file path and `BYTEA` for the content. Type-safety would say…" — answers the DB schema question in type-safety's voice; the routing handoff didn't fire.
|
|
671
|
+
|
|
672
|
+
**Grader procedure.** Embed the Do NOT Use When section; verify the agent named the correct owner skill.
|
|
673
|
+
|
|
674
|
+
### 5.11 Aggregation policy
|
|
675
|
+
|
|
676
|
+
The 9 dimensions produce 9 binary verdicts per skill. The aggregation question is how to combine them into an overall comprehension verdict for the skill.
|
|
677
|
+
|
|
678
|
+
Three options, ordered by simplicity:
|
|
679
|
+
|
|
680
|
+
| Aggregation | Rule | Pros | Cons |
|
|
681
|
+
|---|---|---|---|
|
|
682
|
+
| **Strict** | All 9 dimensions PASS → SKILL PASSES; else PARTIAL or FAIL | Highest reliability; clearest signal | Hard to achieve; may discourage shipping |
|
|
683
|
+
| **Weighted** | Per-dimension weight from the schema; weighted sum ≥ threshold → PASS | Honors schema weights; calibrable | Weighted thresholds are arbitrary; reintroduces Likert-shaped scoring |
|
|
684
|
+
| **Pass-count** | At least N of 9 PASS → PASS; the maintainer picks N (e.g., 6/9) | Simple; reproducible | Treats all dimensions as equal weight |
|
|
685
|
+
|
|
686
|
+
Recommendation: start with **strict-with-named-exceptions** — all 9 dimensions PASS to claim SKILL PASSES, but the aggregate verdict carries a **per-dimension report** so a 7/9 skill can ship with two known-deficient dimensions documented. This is the aggregate pattern the existing audit-loop already uses ([`scripts/lib/audit-prompt-builder.js` lines 717–754](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js)).
|
|
687
|
+
|
|
688
|
+
### 5.12 Rubric summary table
|
|
689
|
+
|
|
690
|
+
| Dim | Question | Measured against | Pass requires |
|
|
691
|
+
|---|---|---|---|
|
|
692
|
+
| C1 | Can the agent restate the concept in fresh words? | `concept.definition` | Primary category + effect + no 6-gram + no fabrication |
|
|
693
|
+
| C2 | Can the agent reason about a fresh case using the named primitives? | `concept.mental_model` | ≥2 primitives invoked + applied to specifics + consistent conclusion |
|
|
694
|
+
| C3 | Can the agent state the problem the concept solves and what it replaced? | `concept.purpose` | Pain point + prior alternative + improvement mechanism |
|
|
695
|
+
| C4 | Can the agent recognize when a prompt is in an adjacent concept's domain? | `concept.boundary` + `## Do NOT Use When` | Identifies the cross + names correct sibling + explains mechanism |
|
|
696
|
+
| C5 | Can the agent place a novel instance in the right taxonomy slot? | `concept.taxonomy` | Right category + skill vocab + consistent + admits non-fit |
|
|
697
|
+
| C6 | Can the agent reuse the analogy AND identify its limit? | `concept.analogy` | Apply preserves structure + Limit identifies real break |
|
|
698
|
+
| C7 | Does the agent correct a misconception unprompted? | `concept.misconception` | Flag without prompt + correct using primitives + explain mechanism |
|
|
699
|
+
| C8 | Does the agent run the Verification checklist on a fresh artifact? | `## Verification` body | ≥3 items + pass/fail differentiation + anchored to artifact |
|
|
700
|
+
| C9 | Does the agent refuse Do-NOT-Use scenarios? | `## Do NOT Use When` body | Refuses + names owner + no partial-comply |
|
|
701
|
+
|
|
702
|
+
### 5.13 Scoring discipline — the rubric as an application of `methodical`
|
|
703
|
+
|
|
704
|
+
The rubric in §5.1–§5.12 specifies *what* is measured and *what constitutes a pass* per dimension. It does not by itself specify the discipline the grader must follow *while applying* the rubric. That discipline is the subject of [`skills/methodical/SKILL.md`](/Users/jacobbalslev/Development/skills/methodical/SKILL.md), and three of its rules map directly onto the rubric's operational behavior:
|
|
705
|
+
|
|
706
|
+
| `methodical` rule | What it forces the grader to do | Concrete rubric application |
|
|
707
|
+
|---|---|---|
|
|
708
|
+
| **RULE-1: Complete Before Summarize** ([§ 1, lines 89–95](/Users/jacobbalslev/Development/skills/methodical/SKILL.md)) | Never construct a dimension-level summary before completing the per-behavior enumeration. Count input behaviors first; produce one verdict per behavior; only then aggregate. The count of output behavior-verdicts MUST match the count of input `expected_behaviors[]`. | A case with 5 `expected_behaviors[]` produces exactly 5 `behavior_verdicts[]` entries — never 3 ("the important ones") and never an aggregated 1 ("overall pass"). The verdict-schema in Appendix A enforces this structurally; RULE-1 is the *reason* the schema is shaped that way. |
|
|
709
|
+
| **RULE-2: Evidence Receipt Per Step** ([§ 1, lines 97–103](/Users/jacobbalslev/Development/skills/methodical/SKILL.md)) | Each behavior verdict produces an evidence receipt — a quoted span from the agent's answer — not an unsupported judgment. "I believe the agent understood the runtime-boundary primitive" is not an evidence receipt. "The agent's answer states 'API payloads cross the runtime boundary' on line 3 of their response" is. | The Appendix A § A.1 grader-prompt RULE block requires an `evidence_quote` field per behavior verdict. RULE-2 is the rule that makes this requirement non-negotiable. |
|
|
710
|
+
| **RULE-9: State the Completeness Claim Explicitly** ([§ 1, lines 150–155](/Users/jacobbalslev/Development/skills/methodical/SKILL.md)) | The grader's output must state what it examined and what it could not evaluate, with reasons. A dimension verdict that silently skips behaviors it found ambiguous is a failed grader, not a partial-pass case. | The verdict JSON's `dimension_verdict` is `PASS` only when every `behavior_verdict.verdict` is `PASS`. If the grader could not decide a behavior, that behavior's `verdict` is `FAIL` (with `rationale` naming the indecision), not omitted. The completeness claim is implicit in the schema: the verdict object enumerates every `expected_behavior.id`. |
|
|
711
|
+
|
|
712
|
+
The remaining `methodical` rules (RULE-3 through RULE-8, RULE-10) apply at the *prompt-construction* layer (R9 in §7.10) and at the *risk* layer (§8.9). The three above are the scoring-discipline rules — they govern what happens *while* the grader is producing its verdict, not before or after.
|
|
713
|
+
|
|
714
|
+
**Why this matters.** Without RULE-1, the grader will inevitably condense — "the agent broadly understood the concept" rolls up four binary verdicts into one impressionistic summary, and the maintainer loses per-behavior signal. Without RULE-2, the grader's verdicts cannot be re-checked by a human auditor and the comprehension surface becomes self-reporting without recourse. Without RULE-9, partial coverage masquerades as full coverage, and skills that fail one critical behavior ship as "passing" because the failed behavior was silently elided. The schema in Appendix A § A.1 is the *structural* enforcement of these rules; `methodical` is the *behavioral* rationale.
|
|
715
|
+
|
|
716
|
+
A grader prompt that does not surface these rules to the grader explicitly — see §7.10 (R9) — relies on the grader's training to behave methodically. Per §4.7, that training rewards the opposite behavior. The rules must be in the prompt.
|
|
717
|
+
|
|
718
|
+
---
|
|
719
|
+
|
|
720
|
+
## 6. Worked example — `type-safety`
|
|
721
|
+
|
|
722
|
+
This section applies the rubric to [`skills/type-safety/SKILL.md`](https://github.com/jacob-balslev/skills/blob/main/skills/type-safety/SKILL.md). The output is a JSON eval file shaped like the existing 31 files plus the proposed comprehension extensions. Each case names a rubric dimension and provides truth sources for the grader.
|
|
723
|
+
|
|
724
|
+
`type-safety` is chosen because (a) its `concept` block is exemplary in depth, (b) it has `eval_artifacts: planned` and `eval_state: unverified` so the worked example would directly seed the eval, and (c) its primitives (type, soundness, structural/nominal, narrowing, runtime boundary) lend themselves to clean discrimination cases.
|
|
725
|
+
|
|
726
|
+
### 6.1 Eval file shape — current and proposed
|
|
727
|
+
|
|
728
|
+
**Current shape** (observed across all 31 files):
|
|
729
|
+
|
|
730
|
+
```json
|
|
731
|
+
{
|
|
732
|
+
"skill_name": "string",
|
|
733
|
+
"subject": "string",
|
|
734
|
+
"adjacent_concepts": ["array of skill names"],
|
|
735
|
+
"grounding_note": "optional string",
|
|
736
|
+
"evals": [
|
|
737
|
+
{
|
|
738
|
+
"id": 1,
|
|
739
|
+
"prompt": "string",
|
|
740
|
+
"dimension": "definition | mental_model | purpose | boundary | application | rule_conflict | anti_pattern",
|
|
741
|
+
"substance": "domain | contradiction-check",
|
|
742
|
+
"calibration": "semantic | process",
|
|
743
|
+
"truth_mode": "code_verification | conceptual_correctness_plus_repo_application | process_correctness",
|
|
744
|
+
"skill_type": "concept | workflow",
|
|
745
|
+
"criticality": "normal | high | critical",
|
|
746
|
+
"truth_sources": ["array of path / path:start-end / path#anchor strings"],
|
|
747
|
+
"expected_reasoning": "optional string"
|
|
748
|
+
}
|
|
749
|
+
]
|
|
750
|
+
}
|
|
751
|
+
```
|
|
752
|
+
|
|
753
|
+
**Proposed additive extensions** (the worked example uses these; existing files remain valid because the keys are optional):
|
|
754
|
+
|
|
755
|
+
```json
|
|
756
|
+
{
|
|
757
|
+
"evals": [
|
|
758
|
+
{
|
|
759
|
+
"...all existing fields...",
|
|
760
|
+
"comprehension_dimension": "C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9",
|
|
761
|
+
"concept_field": "definition | mental_model | purpose | boundary | taxonomy | analogy | misconception",
|
|
762
|
+
"transfer": "near | far",
|
|
763
|
+
"expected_behaviors": [
|
|
764
|
+
{ "id": "must_not_quote_body_verbatim", "kind": "negative", "description": "no 6-gram substring overlap with skill body" },
|
|
765
|
+
{ "id": "invokes_runtime_boundary_primitive", "kind": "positive", "description": "answer names runtime boundary as a primitive" }
|
|
766
|
+
]
|
|
767
|
+
}
|
|
768
|
+
]
|
|
769
|
+
}
|
|
770
|
+
```
|
|
771
|
+
|
|
772
|
+
Three new keys: `comprehension_dimension`, `concept_field`, `transfer`. One new array: `expected_behaviors` (each with `id`, `kind: positive | negative`, `description`). All keys are optional from the schema's perspective.
|
|
773
|
+
|
|
774
|
+
### 6.2 The 10 worked-example scenarios
|
|
775
|
+
|
|
776
|
+
The 10 cases below cover all 9 rubric dimensions plus one extra case in C2 (mental-model fidelity) since it's the highest-weight dimension and benefits from multiple cases. The JSON validates against the current eval shape (every required field is present) and adds the proposed extensions.
|
|
777
|
+
|
|
778
|
+
```json
|
|
779
|
+
{
|
|
780
|
+
"skill_name": "type-safety",
|
|
781
|
+
"subject": "Type-safety as a discipline: what static type systems guarantee, the difference between sound and unsound systems, structural vs nominal typing, type narrowing, the runtime boundary problem, the discipline of validating at I/O boundaries and trusting types inside, and the connection from compile-time guarantees to runtime correctness",
|
|
782
|
+
"adjacent_concepts": ["api-design", "testing-strategy", "data-modeling", "code-review"],
|
|
783
|
+
"grounding_note": "Truth sources cite line ranges in skills/type-safety/SKILL.md (frontmatter concept block at lines 63-116, body Verification at 254-262, Do NOT Use When at 264-272). Cases are authored against the v4 SKILL.md as of 2026-05-16. Comprehension-dimension extensions (comprehension_dimension, concept_field, transfer, expected_behaviors) are additive — graders that don't consume them treat them as no-ops.",
|
|
784
|
+
"evals": [
|
|
785
|
+
{
|
|
786
|
+
"id": 1,
|
|
787
|
+
"prompt": "Explain what type safety is, to a product manager who has never written code. Do not use the phrases from the type-safety skill body verbatim.",
|
|
788
|
+
"dimension": "definition",
|
|
789
|
+
"comprehension_dimension": "C1",
|
|
790
|
+
"concept_field": "definition",
|
|
791
|
+
"transfer": "near",
|
|
792
|
+
"substance": "domain",
|
|
793
|
+
"calibration": "semantic",
|
|
794
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
795
|
+
"skill_type": "concept",
|
|
796
|
+
"criticality": "high",
|
|
797
|
+
"truth_sources": [
|
|
798
|
+
"skills/type-safety/SKILL.md:64",
|
|
799
|
+
"skills/type-safety/SKILL.md#type-safety"
|
|
800
|
+
],
|
|
801
|
+
"expected_behaviors": [
|
|
802
|
+
{ "id": "names_primary_category", "kind": "positive", "description": "Names compile-time error detection as the category" },
|
|
803
|
+
{ "id": "states_observable_effect", "kind": "positive", "description": "States that types rule out a class of runtime errors" },
|
|
804
|
+
{ "id": "no_verbatim_span", "kind": "negative", "description": "No 6-gram span shared with skills/type-safety/SKILL.md concept.definition or body" },
|
|
805
|
+
{ "id": "no_fabricated_specificity", "kind": "negative", "description": "Does not invent claims about specific languages, line counts, or guarantees not in concept.definition" }
|
|
806
|
+
],
|
|
807
|
+
"expected_reasoning": "A correct answer names the category (compile-time error detection), the effect (rules out a class of runtime errors), the cost (annotation burden up-front), and the benefit (compounding safety). It does NOT verbatim-copy the concept.definition's first sentence, which is the canonical body phrasing."
|
|
808
|
+
},
|
|
809
|
+
{
|
|
810
|
+
"id": 2,
|
|
811
|
+
"prompt": "A team is migrating a Python service that uses dict[str, Any] for inbound API payloads to a TypeScript rewrite. They're debating between Record<string, unknown> and Record<string, any>. Apply type-safety's mental model primitives to this case. The skill body discusses JSON.parse but does not enumerate this specific Python-to-TypeScript migration scenario.",
|
|
812
|
+
"dimension": "mental_model",
|
|
813
|
+
"comprehension_dimension": "C2",
|
|
814
|
+
"concept_field": "mental_model",
|
|
815
|
+
"transfer": "far",
|
|
816
|
+
"substance": "domain",
|
|
817
|
+
"calibration": "semantic",
|
|
818
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
819
|
+
"skill_type": "concept",
|
|
820
|
+
"criticality": "critical",
|
|
821
|
+
"truth_sources": [
|
|
822
|
+
"skills/type-safety/SKILL.md:65-78",
|
|
823
|
+
"skills/type-safety/SKILL.md:208-239",
|
|
824
|
+
"skills/type-safety/SKILL.md#the-runtime-boundary"
|
|
825
|
+
],
|
|
826
|
+
"expected_behaviors": [
|
|
827
|
+
{ "id": "invokes_runtime_boundary_primitive", "kind": "positive", "description": "Identifies API payloads as crossing the runtime boundary, where type information stops" },
|
|
828
|
+
{ "id": "invokes_narrowing_or_validation_primitive", "kind": "positive", "description": "Invokes either the narrowing primitive (unknown forces narrowing) or the validation primitive (parse at boundary)" },
|
|
829
|
+
{ "id": "distinguishes_any_from_unknown", "kind": "positive", "description": "Names that any opts out of type-checking, unknown forces narrowing" },
|
|
830
|
+
{ "id": "conclusion_consistent_with_skill", "kind": "positive", "description": "Concludes Record<string, unknown> + validator at the boundary preserves type-safety; Record<string, any> silently disables it" },
|
|
831
|
+
{ "id": "scenario_not_in_body", "kind": "negative", "description": "Body does not mention Python-to-TypeScript migration; the answer is far transfer, not retrieval" }
|
|
832
|
+
],
|
|
833
|
+
"expected_reasoning": "API payloads are an I/O boundary; their types are unverified bytes until parsed. unknown forces narrowing (the type-safe answer to 'I don't know what this is yet'); any disables checking entirely. The right answer is Record<string, unknown> at the type level plus a schema validator (Zod, io-ts) at the boundary to produce typed values. Record<string, any> would compile but silently destroy the discipline."
|
|
834
|
+
},
|
|
835
|
+
{
|
|
836
|
+
"id": 3,
|
|
837
|
+
"prompt": "Why does the discipline of type-safety exist as a thing teams choose to invest in? What did practitioners do before it was widely adopted, and what was specifically broken about that approach?",
|
|
838
|
+
"dimension": "purpose",
|
|
839
|
+
"comprehension_dimension": "C3",
|
|
840
|
+
"concept_field": "purpose",
|
|
841
|
+
"transfer": "near",
|
|
842
|
+
"substance": "domain",
|
|
843
|
+
"calibration": "semantic",
|
|
844
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
845
|
+
"skill_type": "concept",
|
|
846
|
+
"criticality": "high",
|
|
847
|
+
"truth_sources": [
|
|
848
|
+
"skills/type-safety/SKILL.md:79-84"
|
|
849
|
+
],
|
|
850
|
+
"expected_behaviors": [
|
|
851
|
+
{ "id": "names_pain_point", "kind": "positive", "description": "Names the pain point — runtime bugs becoming production incidents or silent data corruption" },
|
|
852
|
+
{ "id": "names_prior_alternative", "kind": "positive", "description": "Names the prior alternative — dynamic languages relying on tests, docs, and reader vigilance to communicate contracts" },
|
|
853
|
+
{ "id": "names_improvement_mechanism", "kind": "positive", "description": "States that types make contracts checkable at compile time and visible at call sites; scales with team size and code size" },
|
|
854
|
+
{ "id": "no_fabricated_purpose", "kind": "negative", "description": "Does not add purposes not in the skill — e.g., does not claim types are about performance" }
|
|
855
|
+
]
|
|
856
|
+
},
|
|
857
|
+
{
|
|
858
|
+
"id": 4,
|
|
859
|
+
"prompt": "We're designing the JSON shape of a new outbound webhook event for the orders service. Apply type-safety to decide the field structure and naming.",
|
|
860
|
+
"dimension": "boundary",
|
|
861
|
+
"comprehension_dimension": "C4",
|
|
862
|
+
"concept_field": "boundary",
|
|
863
|
+
"transfer": "near",
|
|
864
|
+
"substance": "contradiction-check",
|
|
865
|
+
"calibration": "semantic",
|
|
866
|
+
"truth_mode": "code_verification",
|
|
867
|
+
"skill_type": "concept",
|
|
868
|
+
"criticality": "high",
|
|
869
|
+
"truth_sources": [
|
|
870
|
+
"skills/type-safety/SKILL.md:85-94",
|
|
871
|
+
"skills/type-safety/SKILL.md:264-272",
|
|
872
|
+
"skills/type-safety/SKILL.md#do-not-use-when"
|
|
873
|
+
],
|
|
874
|
+
"expected_behaviors": [
|
|
875
|
+
{ "id": "identifies_cross_into_api_design", "kind": "positive", "description": "Recognizes that JSON-shape design of an API surface is api-design, not type-safety" },
|
|
876
|
+
{ "id": "names_api_design_as_owner", "kind": "positive", "description": "Names api-design as the correct owner skill" },
|
|
877
|
+
{ "id": "explains_mechanism_of_difference", "kind": "positive", "description": "States api-design owns the external surface contract; type-safety owns the discipline of expressing internal program correctness as types" },
|
|
878
|
+
{ "id": "no_partial_comply_in_type_safety_voice", "kind": "negative", "description": "Does not provide JSON shape recommendations in type-safety's voice before handing off" }
|
|
879
|
+
]
|
|
880
|
+
},
|
|
881
|
+
{
|
|
882
|
+
"id": 5,
|
|
883
|
+
"prompt": "Where does Haskell's Generalized Algebraic Data Types (GADTs) feature sit in type-safety's taxonomy? It allows types to depend on values like dependent types but is more constrained. Place it using the skill's taxonomy vocabulary, or explain why it doesn't fit.",
|
|
884
|
+
"dimension": "mental_model",
|
|
885
|
+
"comprehension_dimension": "C5",
|
|
886
|
+
"concept_field": "taxonomy",
|
|
887
|
+
"transfer": "far",
|
|
888
|
+
"substance": "domain",
|
|
889
|
+
"calibration": "semantic",
|
|
890
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
891
|
+
"skill_type": "concept",
|
|
892
|
+
"criticality": "normal",
|
|
893
|
+
"truth_sources": [
|
|
894
|
+
"skills/type-safety/SKILL.md:95-103"
|
|
895
|
+
],
|
|
896
|
+
"expected_behaviors": [
|
|
897
|
+
{ "id": "uses_taxonomy_vocab", "kind": "positive", "description": "Uses one or more of the schema's taxonomy relationship types: subset, alternative, prerequisite, composition, specialization" },
|
|
898
|
+
{ "id": "places_between_dependent_and_refinement", "kind": "positive", "description": "Places GADTs near dependent and refinement types, recognizing the partial dependency" },
|
|
899
|
+
{ "id": "preserves_existing_relationships", "kind": "positive", "description": "Placement does not contradict the relationships among existing taxonomy entries (sound vs unsound, structural vs nominal)" },
|
|
900
|
+
{ "id": "admits_imperfect_fit_or_extends", "kind": "positive", "description": "Either admits the fit is imperfect and proposes how to extend, or fits with a clear relationship-type justification" }
|
|
901
|
+
],
|
|
902
|
+
"expected_reasoning": "GADTs sit between Haskell's vanilla parametric polymorphism and full dependent types — types can be refined by pattern-matching on constructors, but not by arbitrary value expressions. In the skill's taxonomy, the closest fit is 'specialization of sound type systems toward refinement types' or 'composition of pattern-matching with parametric polymorphism'. A correct answer either places GADTs in one of those slots or names the gap in the taxonomy."
|
|
903
|
+
},
|
|
904
|
+
{
|
|
905
|
+
"id": 6,
|
|
906
|
+
"prompt": "Part A: The type-safety skill compares a type system to a building's structural engineering. Use that analogy to explain how `npm audit` differs from type-checking — does the analogy place npm audit as a type-check, as a different kind of check, or outside the scope of the analogy?\n\nPart B: Where does the structural-engineering analogy break down? Give one concrete case where applying the analogy would mislead someone.",
|
|
907
|
+
"dimension": "mental_model",
|
|
908
|
+
"comprehension_dimension": "C6",
|
|
909
|
+
"concept_field": "analogy",
|
|
910
|
+
"transfer": "far",
|
|
911
|
+
"substance": "domain",
|
|
912
|
+
"calibration": "semantic",
|
|
913
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
914
|
+
"skill_type": "concept",
|
|
915
|
+
"criticality": "normal",
|
|
916
|
+
"truth_sources": [
|
|
917
|
+
"skills/type-safety/SKILL.md:104-107"
|
|
918
|
+
],
|
|
919
|
+
"expected_behaviors": [
|
|
920
|
+
{ "id": "part_a_distinguishes_npm_audit", "kind": "positive", "description": "Part A: Identifies npm audit as a different class of check (fire-marshal inspection vs structural design) — checking for vulnerabilities found AFTER the build, not for the threat class the design accounted for" },
|
|
921
|
+
{ "id": "part_b_names_real_break", "kind": "positive", "description": "Part B: Names a concrete way the analogy misleads — e.g., structural engineering is stamped once vs type-check runs every change; or escape hatches are gated in engineering but not in TypeScript by default" },
|
|
922
|
+
{ "id": "preserves_structural_relationships", "kind": "positive", "description": "Part A's mapping preserves the analogy's load-bearing relationship (build-time design verification vs runtime threat tolerance)" },
|
|
923
|
+
{ "id": "no_under_claim_on_break", "kind": "negative", "description": "Part B does not refuse to identify a break ('the analogy holds') — the analogy DOES break in identifiable ways and a refusal indicates non-engagement" }
|
|
924
|
+
]
|
|
925
|
+
},
|
|
926
|
+
{
|
|
927
|
+
"id": 7,
|
|
928
|
+
"prompt": "Our backend team has been told that since we use TypeScript with strict mode enabled, the JSON parsed from our customer API responses is automatically validated at compile time. We've been planning the next quarter's work and someone proposed dropping our Zod schemas to reduce maintenance overhead. Walk me through whether that's a sensible plan.",
|
|
929
|
+
"dimension": "anti_pattern",
|
|
930
|
+
"comprehension_dimension": "C7",
|
|
931
|
+
"concept_field": "misconception",
|
|
932
|
+
"transfer": "near",
|
|
933
|
+
"substance": "contradiction-check",
|
|
934
|
+
"calibration": "semantic",
|
|
935
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
936
|
+
"skill_type": "concept",
|
|
937
|
+
"criticality": "critical",
|
|
938
|
+
"truth_sources": [
|
|
939
|
+
"skills/type-safety/SKILL.md:108-115",
|
|
940
|
+
"skills/type-safety/SKILL.md:208-239",
|
|
941
|
+
"skills/type-safety/SKILL.md#the-runtime-boundary"
|
|
942
|
+
],
|
|
943
|
+
"expected_behaviors": [
|
|
944
|
+
{ "id": "flags_misconception_unprompted", "kind": "positive", "description": "Recognizes 'JSON parsed... automatically validated at compile time' as the misconception without being told it's wrong" },
|
|
945
|
+
{ "id": "corrects_via_runtime_boundary", "kind": "positive", "description": "Corrects using the runtime boundary primitive — TypeScript types are claims about values; values from network/disk are not types until parsed" },
|
|
946
|
+
{ "id": "rejects_dropping_zod", "kind": "positive", "description": "Recommends keeping Zod (or equivalent) for I/O boundary validation; dropping it would silently disable runtime safety" },
|
|
947
|
+
{ "id": "explains_mechanism_of_mislead", "kind": "positive", "description": "States that the misconception conflates 'static type assertion' with 'runtime verification' — JSON.parse(input) as User is a claim, not a check" },
|
|
948
|
+
{ "id": "no_partial_validation_of_misconception", "kind": "negative", "description": "Does not validate the misconception even partially (e.g., 'mostly true, but...')" }
|
|
949
|
+
],
|
|
950
|
+
"expected_reasoning": "The misconception is that TypeScript catches runtime errors. It does not. TypeScript is unsound by design, and even without escape hatches, it makes no guarantees about values crossing the runtime boundary. JSON.parse(input) as User produces a value of static type User and actual type whatever the bytes contained — the static type is a claim, not a verification. Dropping Zod removes the only mechanism that actually validates at runtime; the agent should reject the proposal and explain the mechanism."
|
|
951
|
+
},
|
|
952
|
+
{
|
|
953
|
+
"id": 8,
|
|
954
|
+
"prompt": "Review this TypeScript snippet for type-safety. Snippet (payments.ts): `export async function fetchPaymentMethod(userId: string) { const response = await fetch('/api/users/' + userId + '/payment'); const data: any = await response.json(); return data as PaymentMethod; } function getUserBalance(user: User): number { const cents = user.balance; return cents / 100; } function summarize(records: PaymentRecord[]): string { return records.map(r => r.amount.toFixed(2)).join(', '); }`. What's your assessment?",
|
|
955
|
+
"dimension": "application",
|
|
956
|
+
"comprehension_dimension": "C8",
|
|
957
|
+
"concept_field": null,
|
|
958
|
+
"transfer": "near",
|
|
959
|
+
"substance": "domain",
|
|
960
|
+
"calibration": "semantic",
|
|
961
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
962
|
+
"skill_type": "concept",
|
|
963
|
+
"criticality": "high",
|
|
964
|
+
"truth_sources": [
|
|
965
|
+
"skills/type-safety/SKILL.md:254-262",
|
|
966
|
+
"skills/type-safety/SKILL.md#verification"
|
|
967
|
+
],
|
|
968
|
+
"expected_behaviors": [
|
|
969
|
+
{ "id": "invokes_verification_checklist_unprompted", "kind": "positive", "description": "Walks through ≥3 items from the type-safety Verification checklist without being told to" },
|
|
970
|
+
{ "id": "flags_any_without_comment", "kind": "positive", "description": "Flags `any: any` on the data variable without justification" },
|
|
971
|
+
{ "id": "flags_as_cast_without_comment", "kind": "positive", "description": "Flags `as PaymentMethod` cast on the unvalidated response.json() result" },
|
|
972
|
+
{ "id": "flags_missing_runtime_validation", "kind": "positive", "description": "Flags the fetch boundary as missing runtime validation (Zod/io-ts/valibot)" },
|
|
973
|
+
{ "id": "anchors_to_line_numbers", "kind": "positive", "description": "References specific lines or function names from the snippet, not abstract restatements" },
|
|
974
|
+
{ "id": "no_verdict_without_anchor", "kind": "negative", "description": "Does not provide an overall verdict without anchoring each judgment to a Verification criterion" }
|
|
975
|
+
],
|
|
976
|
+
"expected_reasoning": "The Verification checklist requires (paraphrased): strict mode on, noUncheckedIndexedAccess on, no `any` without justification comment, no `as Type` without justification comment, every I/O boundary parses with a validator, discriminated unions have exhaustiveness, public APIs have explicit return types. This snippet fails the `any` rule (data: any), the as cast rule (as PaymentMethod), the I/O validation rule (fetch result is cast not parsed), the explicit return type rule on the async function (inferred only). It passes 'no @ts-ignore'. The agent should walk through these items by name and anchor each to the snippet."
|
|
977
|
+
},
|
|
978
|
+
{
|
|
979
|
+
"id": 9,
|
|
980
|
+
"prompt": "Help me design a Postgres schema for storing user-uploaded file metadata. I want to use type-safety to decide which columns should be NOT NULL, which should be enums, and how to express the relationships between the tables.",
|
|
981
|
+
"dimension": "boundary",
|
|
982
|
+
"comprehension_dimension": "C9",
|
|
983
|
+
"concept_field": null,
|
|
984
|
+
"transfer": "near",
|
|
985
|
+
"substance": "contradiction-check",
|
|
986
|
+
"calibration": "semantic",
|
|
987
|
+
"truth_mode": "code_verification",
|
|
988
|
+
"skill_type": "concept",
|
|
989
|
+
"criticality": "high",
|
|
990
|
+
"truth_sources": [
|
|
991
|
+
"skills/type-safety/SKILL.md:264-272",
|
|
992
|
+
"skills/type-safety/SKILL.md#do-not-use-when"
|
|
993
|
+
],
|
|
994
|
+
"expected_behaviors": [
|
|
995
|
+
{ "id": "refuses_schema_design_in_type_safety_voice", "kind": "positive", "description": "Does not produce a column list with NOT NULL or enum suggestions in type-safety's voice" },
|
|
996
|
+
{ "id": "names_data_modeling_as_owner", "kind": "positive", "description": "Names data-modeling as the correct owner skill" },
|
|
997
|
+
{ "id": "explains_mechanism", "kind": "positive", "description": "States that data-modeling owns persistence shape; type-safety owns in-memory type contracts that consume that shape" },
|
|
998
|
+
{ "id": "no_partial_comply", "kind": "negative", "description": "Does not provide partial schema recommendations before refusing" },
|
|
999
|
+
{ "id": "no_overrefuse", "kind": "negative", "description": "If the user asks a follow-up about the TypeScript types that consume the schema once data-modeling has produced it, the agent re-engages — does not blanket-refuse all related work" }
|
|
1000
|
+
]
|
|
1001
|
+
},
|
|
1002
|
+
{
|
|
1003
|
+
"id": 10,
|
|
1004
|
+
"prompt": "A team has heard that 'type assertions in TypeScript are like runtime casts in C++ — they check the type at runtime and fail if it's wrong.' Walk me through whether this comparison is accurate.",
|
|
1005
|
+
"dimension": "anti_pattern",
|
|
1006
|
+
"comprehension_dimension": "C7",
|
|
1007
|
+
"concept_field": "misconception",
|
|
1008
|
+
"transfer": "far",
|
|
1009
|
+
"substance": "contradiction-check",
|
|
1010
|
+
"calibration": "semantic",
|
|
1011
|
+
"truth_mode": "conceptual_correctness_plus_repo_application",
|
|
1012
|
+
"skill_type": "concept",
|
|
1013
|
+
"criticality": "high",
|
|
1014
|
+
"truth_sources": [
|
|
1015
|
+
"skills/type-safety/SKILL.md:108-115",
|
|
1016
|
+
"skills/type-safety/SKILL.md#any-vs-unknown-vs-never"
|
|
1017
|
+
],
|
|
1018
|
+
"expected_behaviors": [
|
|
1019
|
+
{ "id": "flags_misconception_about_as", "kind": "positive", "description": "Recognizes that TypeScript `as` is NOT a runtime check; C++ dynamic_cast is. The comparison is wrong." },
|
|
1020
|
+
{ "id": "states_as_compiles_to_nothing", "kind": "positive", "description": "States that `as` compiles to nothing — it is a directive to the type checker only, no runtime code is emitted" },
|
|
1021
|
+
{ "id": "distinguishes_static_directive_from_runtime_check", "kind": "positive", "description": "Distinguishes a static directive ('trust me, this is the type') from a runtime check (`instanceof`, validator parse)" },
|
|
1022
|
+
{ "id": "no_partial_validation", "kind": "negative", "description": "Does not say 'yes, with some caveats' — the comparison is structurally wrong, not just imprecise" }
|
|
1023
|
+
],
|
|
1024
|
+
"expected_reasoning": "TypeScript `as` is a static directive, not a runtime check. It compiles to nothing. C++ `dynamic_cast` IS a runtime check (returns nullptr or throws). C-style casts in C++ are closer to TypeScript `as`. A correct answer rejects the comparison: TypeScript `as` is a silent claim a misused `as` makes; the C++ comparison should be to a C-style cast, not dynamic_cast. The body's misconception field explicitly names this trap."
|
|
1025
|
+
}
|
|
1026
|
+
]
|
|
1027
|
+
}
|
|
1028
|
+
```
|
|
1029
|
+
|
|
1030
|
+
### 6.3 Coverage report
|
|
1031
|
+
|
|
1032
|
+
The 10 cases cover all 9 rubric dimensions:
|
|
1033
|
+
|
|
1034
|
+
| Dim | Case IDs | Count |
|
|
1035
|
+
|---|---|---|
|
|
1036
|
+
| C1 — Definitional precision | 1 | 1 |
|
|
1037
|
+
| C2 — Mental-model fidelity | 2 | 1 |
|
|
1038
|
+
| C3 — Purpose articulation | 3 | 1 |
|
|
1039
|
+
| C4 — Boundary discrimination | 4 | 1 |
|
|
1040
|
+
| C5 — Taxonomy navigation | 5 | 1 |
|
|
1041
|
+
| C6 — Analogy reuse with limit | 6 | 1 |
|
|
1042
|
+
| C7 — Misconception inoculation | 7, 10 | 2 |
|
|
1043
|
+
| C8 — Verification application | 8 | 1 |
|
|
1044
|
+
| C9 — Negative-boundary refusal | 9 | 1 |
|
|
1045
|
+
|
|
1046
|
+
Transfer mix: 6 near-transfer (C1, C3, C4, C7-#1, C8, C9), 4 far-transfer (C2, C5, C6, C7-#2). The schema's heaviest-weighted dimensions (mental_model 1.5, boundary 1.5) both have far-transfer coverage. Misconception has two cases — one near (the Zod question, a domain-internal misconception) and one far (the C++ comparison, a cross-language misconception), exercising the misconception primitive across surface variations.
|
|
1047
|
+
|
|
1048
|
+
The minimum threshold (`AGENTS.MD § Evaluation Discipline`: "≥7 realistic scenarios per skill") is exceeded; this file would also satisfy the existing audit-loop's Eval dimension if the grader is updated to accept the comprehension-extension keys.
|
|
1049
|
+
|
|
1050
|
+
### 6.4 What this worked example does NOT cover
|
|
1051
|
+
|
|
1052
|
+
The 10 cases focus on the concept block plus the cross-cutting Verification and Do-NOT sections. They do not test the body's specific tables (the soundness comparison table, the narrowing technique table, the runtime-boundary table). Those tables encode operational nuance — e.g., the differences between `typeof`, `instanceof`, `in`, discriminated union narrowing — that a separate eval surface could probe. Whether to fold those into the comprehension rubric (as a 10th "operational-fluency" dimension) or keep them in the existing `application`-tagged cases is an open question for §9.
|
|
1053
|
+
|
|
1054
|
+
The cases also do not exercise the comparison table at lines 134–143 (system-by-system soundness). A skill-specific operational-knowledge test would benefit `type-safety` because that table is the load-bearing distinguishing data. But adding it to the comprehension rubric risks coupling the rubric to one skill's body shape; the proposed rubric stays universal.
|
|
1055
|
+
|
|
1056
|
+
---
|
|
1057
|
+
|
|
1058
|
+
## 7. Implementation recommendations
|
|
1059
|
+
|
|
1060
|
+
Ordered by leverage. Each item names files that exist today, scripts that would be added, and the smallest viable change.
|
|
1061
|
+
|
|
1062
|
+
### 7.1 R1 (smallest viable change) — Author one comprehension eval file
|
|
1063
|
+
|
|
1064
|
+
**Effort.** ~2 hours, single skill.
|
|
1065
|
+
|
|
1066
|
+
**Change.** Three steps, all in the same commit:
|
|
1067
|
+
|
|
1068
|
+
1. Add `examples/evals/type-safety.json` using the worked example in §6.2.
|
|
1069
|
+
2. Flip `skills/type-safety/SKILL.md` frontmatter: `eval_artifacts: planned` → `present`.
|
|
1070
|
+
3. Add a `## Evals` body section to `skills/type-safety/SKILL.md` (between the last conceptual section and `## Verification`) that links to `examples/evals/type-safety.json`. This is required by the lint when `eval_artifacts: present` — the linter emits `missing required section "## Evals" (conditional: eval_artifacts is "present")` otherwise. Model the section on the canonical example at [`skills/documentation/SKILL.md:201`](https://github.com/jacob-balslev/skills/blob/main/_archived/documentation/SKILL.md).
|
|
1071
|
+
|
|
1072
|
+
Then run `node scripts/skill-lint.js skills/type-safety` to confirm all three changes pass the existing `checkEvalCoherence` and conditional-section checks. The pre-existing `[T3↔T5] examples/skills.manifest.sample.json` generator-parity drift is unrelated to R1 — regenerate the sample independently with `node scripts/generate-manifest.js --include-template --timestamp <ISO> --output examples/skills.manifest.sample.json`.
|
|
1073
|
+
|
|
1074
|
+
**Why this is leverage-1.** No script changes. No schema changes. No grader changes. It proves the eval-file shape works and gives a concrete worked-example specimen. Future eval authors copy from `type-safety.json`. The new keys (`comprehension_dimension`, `concept_field`, `transfer`, `expected_behaviors`) are optional and ignored by the existing audit-prompt-builder.
|
|
1075
|
+
|
|
1076
|
+
**Risk.** None — it's additive. If the maintainer later rejects the rubric, the file is one delete away from removal.
|
|
1077
|
+
|
|
1078
|
+
### 7.2 R2 — Add a grader prompt template for comprehension
|
|
1079
|
+
|
|
1080
|
+
**Effort.** ~1 day. ~150 lines of Node, no dependencies beyond what `scripts/lib/audit-prompt-builder.js` already uses.
|
|
1081
|
+
|
|
1082
|
+
**Change.** Create `scripts/lib/comprehension-prompt-builder.js` modeled on `scripts/lib/audit-prompt-builder.js`. The new module exports `COMPREHENSION_DIMENSIONS` (an array of 9 dimension records corresponding to C1–C9), a `buildComprehensionPrompt` function that composes a per-case prompt embedding (a) the skill body, (b) the `concept.<field>` content for the case's `concept_field`, (c) the case's `prompt`, (d) the `expected_behaviors` rubric, and (e) the existing IDENTITY/STEPS/RULES/OUTPUT structure with a `<verdict>` JSON block per behavior.
|
|
1083
|
+
|
|
1084
|
+
Reuse `parseDimensionResponse` and `aggregateVerdict` from the existing builder; both already handle a `<verdict>` JSON output with `findings[]` and per-finding `severity/surface/problem/evidence/required_action`.
|
|
1085
|
+
|
|
1086
|
+
**Why this is leverage-2.** The prompt-builder is the only piece the audit loop is missing. The existing `scripts/skill-audit.js --graded` would consume it directly. The audit loop's seven dimensions are unchanged; the comprehension grader is a parallel additional run.
|
|
1087
|
+
|
|
1088
|
+
**Risk.** Low. The prompt-builder pattern is already proven by the audit-prompt-builder; this is a parameterized variant.
|
|
1089
|
+
|
|
1090
|
+
### 7.3 R3 — Implement `scripts/skill/evaluate-skill.js --comprehension`
|
|
1091
|
+
|
|
1092
|
+
**Effort.** ~2 days.
|
|
1093
|
+
|
|
1094
|
+
**Change.** Create the script the schema documents but the repo doesn't have. Path: `scripts/skill/evaluate-skill.js` (matching the schema's literal string at [`schemas/skill.v4.schema.json:211`](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json)).
|
|
1095
|
+
|
|
1096
|
+
The script's CLI shape:
|
|
1097
|
+
|
|
1098
|
+
```bash
|
|
1099
|
+
node scripts/skill/evaluate-skill.js --skill type-safety --comprehension
|
|
1100
|
+
node scripts/skill/evaluate-skill.js --skill type-safety --comprehension --grader-cli "codex exec"
|
|
1101
|
+
node scripts/skill/evaluate-skill.js --all --comprehension --json
|
|
1102
|
+
```
|
|
1103
|
+
|
|
1104
|
+
Behavior:
|
|
1105
|
+
1. Read the skill's `concept` block.
|
|
1106
|
+
2. Read the matching `examples/evals/<skill>.json` if `eval_artifacts: present`.
|
|
1107
|
+
3. For each case with `comprehension_dimension` set, call the grader CLI with the comprehension prompt template (R2).
|
|
1108
|
+
4. Aggregate per-dimension pass-counts.
|
|
1109
|
+
5. Emit a comprehension scorecard with one row per dimension and an overall verdict.
|
|
1110
|
+
6. Optionally write the receipt to `eval_last_run` in the skill frontmatter (the schema already supports `receipt` and `receipt_hash`).
|
|
1111
|
+
|
|
1112
|
+
**Why this is leverage-3.** It closes the gap between schema intent and implementation. The script the schema names will exist. The comprehension dimension becomes runnable.
|
|
1113
|
+
|
|
1114
|
+
**Risk.** Medium — the grader-CLI dependency is the same as `scripts/skill-audit.js --graded` already has, so the infrastructure pattern is proven. The script must not regress existing behavior; the `--comprehension` flag opts in.
|
|
1115
|
+
|
|
1116
|
+
### 7.4 R4 — Add a lint check for far-transfer scenarios
|
|
1117
|
+
|
|
1118
|
+
**Effort.** ~3 hours.
|
|
1119
|
+
|
|
1120
|
+
**Change.** Add a new lint check `checkComprehensionScenarioGrounding` to `scripts/lint/check-eval-quality.js` (or a new file). For each eval case where `transfer: far` is set, verify that the surface-keywords of the `prompt` do not appear in the skill body — i.e., the case's scenario was not lifted from the body.
|
|
1121
|
+
|
|
1122
|
+
Operationalization: tokenize the prompt, drop stopwords, drop tokens of length ≤ 3, look for any 4-gram or longer overlap with the body. If found, emit a warning that the scenario is near-transfer despite the `transfer: far` tag.
|
|
1123
|
+
|
|
1124
|
+
**Why this is leverage-4.** It prevents the far-transfer label from being applied to cases that are actually near transfer — a form of the same `eval_state: verified` honesty rule the protocol already enforces elsewhere.
|
|
1125
|
+
|
|
1126
|
+
**Risk.** Low — false positives are easy to suppress with an inline comment, and the warning is non-blocking.
|
|
1127
|
+
|
|
1128
|
+
### 7.5 R5 — Add a baseline-comparison harness
|
|
1129
|
+
|
|
1130
|
+
**Effort.** ~3 days.
|
|
1131
|
+
|
|
1132
|
+
**Change.** Mirror Anthropic's skill-creator pattern (§4.3.3): run the eval cases twice — once with the skill loaded, once without — and compute a per-dimension delta. This implements the BIG-bench "breakthroughness" idea (§4.2.2) for skills.
|
|
1133
|
+
|
|
1134
|
+
CLI shape:
|
|
1135
|
+
|
|
1136
|
+
```bash
|
|
1137
|
+
node scripts/skill/evaluate-skill.js --skill type-safety --comprehension --baseline
|
|
1138
|
+
```
|
|
1139
|
+
|
|
1140
|
+
Output: per-dimension table showing baseline pass-count, skill-loaded pass-count, and delta. A skill whose comprehension cases produce no answer change when loaded vs. not loaded is either redundant (the model already knows) or ineffective (the skill is loaded but ignored).
|
|
1141
|
+
|
|
1142
|
+
**Why this is leverage-5.** It quantifies skill value. A skill with 0% delta on its own comprehension eval is a measurable problem. A skill with +60% on `C2` (mental-model fidelity) is teaching something the model wouldn't have done unaided.
|
|
1143
|
+
|
|
1144
|
+
**Risk.** Medium — running the grader twice per case doubles cost, so this is only worth running on demand (releases, large refactors), not on every commit.
|
|
1145
|
+
|
|
1146
|
+
### 7.6 R6 — Extend the audit-loop scorecard with a comprehension row
|
|
1147
|
+
|
|
1148
|
+
**Effort.** ~half day.
|
|
1149
|
+
|
|
1150
|
+
**Change.** Add an 8th dimension to the audit-loop's seven scorecard rows: `Comprehension quality`. Its `appliesWhen` predicate returns true iff `comprehension_state: present`. The dimension reads the comprehension scorecard (R3 output) and reports the same per-dimension verdicts.
|
|
1151
|
+
|
|
1152
|
+
Update [`SKILL_AUDIT_CHECKLIST.md`](https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_CHECKLIST.md) to add the new row to the artifact contract.
|
|
1153
|
+
|
|
1154
|
+
**Why this is leverage-6.** It surfaces the comprehension grade in the existing audit artifact flow. Audits become 8-dimensional when the skill claims comprehension; they remain 7-dimensional when it doesn't.
|
|
1155
|
+
|
|
1156
|
+
**Risk.** Low — additive, gated on `comprehension_state`.
|
|
1157
|
+
|
|
1158
|
+
### 7.7 R7 — Backfill comprehension evals to the existing `concept`-declaring skills
|
|
1159
|
+
|
|
1160
|
+
**Effort.** ~1 day per skill.
|
|
1161
|
+
|
|
1162
|
+
**Change.** Apply the worked-example pattern (§6) to all skills with `comprehension_state: present`. Today that's `type-safety` and `acid-fundamentals`. The `documentation` skill has a comprehension-shaped eval (`examples/evals/comprehension.json`) but lacks the `concept` block — either declare comprehension_state and add the block, or rename the eval file and don't.
|
|
1163
|
+
|
|
1164
|
+
**Why this is leverage-7.** The first set of comprehension-graded skills becomes the calibration corpus. Future skill authors point at these as worked examples.
|
|
1165
|
+
|
|
1166
|
+
**Risk.** Low.
|
|
1167
|
+
|
|
1168
|
+
### 7.8 R8 — Open documentation: protocol section on comprehension
|
|
1169
|
+
|
|
1170
|
+
**Effort.** ~half day.
|
|
1171
|
+
|
|
1172
|
+
**Change.** Add a new section to `docs/skill-metadata-protocol.md` and `docs/field-reference.md § concept` explaining the rubric. Cross-reference from `AGENTS.MD § Evaluation Discipline` so the protocol-level discipline is discoverable. Add the new optional eval-file keys (`comprehension_dimension`, `concept_field`, `transfer`, `expected_behaviors`) to `docs/field-reference.md` with the same field-level treatment as existing keys.
|
|
1173
|
+
|
|
1174
|
+
**Why this is leverage-8.** Without documentation, the rubric is folklore. With documentation in the canonical reference doc, authors discover it during their normal skill-authoring flow.
|
|
1175
|
+
|
|
1176
|
+
**Risk.** None.
|
|
1177
|
+
|
|
1178
|
+
### 7.10 R9 — Grader prompt incorporates `methodical` forcing functions
|
|
1179
|
+
|
|
1180
|
+
**Effort.** ~half day, prompt-only change to the R2 builder.
|
|
1181
|
+
|
|
1182
|
+
**Change.** Add a `# METHODICAL FORCING FUNCTIONS` section to the comprehension-prompt-builder's output (the section is rendered into every grader prompt the builder produces). The section enumerates four rules from [`skills/methodical/SKILL.md`](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) in the grader's own voice — i.e., the grader is the agent applying these rules to its own scoring output. Each rule has a one-line description of what it forces the grader to do.
|
|
1183
|
+
|
|
1184
|
+
The four rules and what each forces:
|
|
1185
|
+
|
|
1186
|
+
| Rule | Source | What it forces the grader to do |
|
|
1187
|
+
|---|---|---|
|
|
1188
|
+
| **RULE-1: Complete Before Summarize** | [`methodical` § 1, lines 89–95](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) | Produce one verdict per `expected_behavior.id` BEFORE producing the `dimension_verdict`. Never aggregate before enumerating. The count of `behavior_verdicts[]` must equal the count of `expected_behaviors[]`. |
|
|
1189
|
+
| **RULE-3: Separate Generation from Criticism** | [`methodical` § 1, lines 105–114](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) | After producing the verdict, perform a self-critique pass: "What behavior did I score as PASS where the evidence-quote does not actually support the criterion? Where did I soften a clear FAIL into hedged language?" Revise any verdict where the self-critique surfaces a problem. |
|
|
1190
|
+
| **RULE-6: Negative Findings Are Primary Data** | [`methodical` § 1, lines 130–136](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) | A failed behavior is more diagnostic than a passed behavior. Do NOT soften failures with hedge words ("could be improved", "worth reviewing"). State `FAIL` directly and quote the gap. Use of hedge words on a behavior with `verdict: FAIL` is itself a grader malfunction. |
|
|
1191
|
+
| **RULE-9: State the Completeness Claim Explicitly** | [`methodical` § 1, lines 150–155](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) | The verdict object enumerates every `expected_behavior.id` from the input. A behavior the grader could not decide is `verdict: FAIL` with `rationale: "could not evaluate because <reason>"`, not omitted. The grader states which dimensions it could not evaluate and why, never silently. |
|
|
1192
|
+
|
|
1193
|
+
These rules are forcing functions, not aspirations. RULE-1 prevents the §4.7 summarization-bias failure mode in the grader (the 26–73% base rate). RULE-3 implements the Generator/Critic separation that [Towards Data Science (2025)](https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/) documents as the dominant remediation for the 17.2× error amplification rate. RULE-6 prevents the §4.7 framing-bias failure mode (the 26.42% base rate) where the grader softens negative verdicts to match the agent's positive framing. RULE-9 prevents the §4.7 instruction-skipping failure mode where the grader silently omits behaviors the prompt enumerated.
|
|
1194
|
+
|
|
1195
|
+
The four rules are intentionally a subset of `methodical`'s 10. RULE-2 (evidence receipts) is already structurally required by the verdict schema's `evidence_quote` field (Appendix A § A.1) — naming it again in the FORCING FUNCTIONS section is redundant; RULE-4 (prioritization is reordering not filtering) does not apply at the grader layer; RULE-5 (Observe Before Act) is captured by the grader's STEP 1 (read the body first); RULE-7 (Verification is not trust) applies at the audit-loop integration layer (R6, §7.6) where the audit consumes the grader's verdict; RULE-8 (challenge scope framing) does not apply because the grader's scope is exactly the case's `expected_behaviors[]`; RULE-10 (deliberate pace on high-stakes steps) is implicit in the single-shot per-case grader invocation. The four selected are the minimum set that defends against the four failure modes with measured base rates in §4.7.
|
|
1196
|
+
|
|
1197
|
+
**Why this is leverage-9 (placed between R7 and R8 in priority).** Without the FORCING FUNCTIONS block in the grader prompt, R2 / R3 produce a grader that inherits the §4.7 base rates. The functional rubric still works probabilistically — a grader scoring 1,000 cases will be approximately right on average — but per-case verdicts are not individually defensible. Adding the FORCING FUNCTIONS block costs ~30 lines of prompt and reduces per-verdict variance.
|
|
1198
|
+
|
|
1199
|
+
**Risk.** Low — prompt-only change, opt-in via the comprehension grader, no schema impact. The risk is the inverse: a maintainer who skips R9 ships a grader that produces statistically defensible but per-verdict unreliable scores, and over-claims the surface's discriminating power.
|
|
1200
|
+
|
|
1201
|
+
**Verification.** Run R9 against the worked example (§6). The 10 cases collectively have ~45 `expected_behaviors[]`. The grader's output must produce 45 `behavior_verdicts[]`. A grader output with <45 verdicts is a RULE-1 / RULE-9 violation, detectable by a simple count assertion in the calling script — i.e., R9's compliance is itself mechanically checkable, not LLM-graded.
|
|
1202
|
+
|
|
1203
|
+
### 7.9 Recommended adoption order
|
|
1204
|
+
|
|
1205
|
+
```
|
|
1206
|
+
R1 (one eval file) ─► R2 (prompt builder) ─► R3 (evaluate-skill script)
|
|
1207
|
+
│ │
|
|
1208
|
+
└───► R9 (methodical FF) ──┘
|
|
1209
|
+
│
|
|
1210
|
+
R8 (documentation) ◄──── R6 (audit scorecard) ◄──── R7 (backfill) ◄──────┘
|
|
1211
|
+
│
|
|
1212
|
+
R4 (lint check) ────► R5 (baseline harness)
|
|
1213
|
+
```
|
|
1214
|
+
|
|
1215
|
+
R1 first — it costs the least and validates the eval-file shape. R2 and R3 are the core implementation. **R9 attaches to R2's prompt builder and ships in the same change** — without it, R3's grader inherits the §4.7 base rates. R4 is a cheap honesty check. R5 is the value-quantification piece, added once the basics work. R6 connects to the existing audit flow. R7 calibrates against the gold-standard skills. R8 makes the surface discoverable.
|
|
1216
|
+
|
|
1217
|
+
The maintainer can stop at any step: R1 alone delivers a worked specimen; R1+R2+R9+R3 delivers a runnable, per-verdict-defensible grader; the full set delivers a calibrated comprehension surface.
|
|
1218
|
+
|
|
1219
|
+
---
|
|
1220
|
+
|
|
1221
|
+
## 8. Risks and anti-patterns
|
|
1222
|
+
|
|
1223
|
+
### 8.1 LLM-judge bias propagation
|
|
1224
|
+
|
|
1225
|
+
The position / verbosity / self-enhancement biases documented by [Zheng et al. 2023](https://arxiv.org/abs/2306.05685) and refined in [later surveys](https://arxiv.org/html/2411.15594v6) apply to the proposed grader. Mitigations baked into the rubric:
|
|
1226
|
+
|
|
1227
|
+
- **Direct scoring, not pairwise.** The rubric scores each behavior in `expected_behaviors[]` as binary; no pairwise comparisons.
|
|
1228
|
+
- **Cross-family grader.** Use a different model family as grader than agent — if Sonnet is the agent, grader is GPT-5.x or Gemini, and vice versa. Self-enhancement bias drops sharply when grader and agent are different families.
|
|
1229
|
+
- **Chain-of-thought required.** The grader prompt template forces the grader to cite an evidence span before each behavior verdict. This improves reliability per [CalibraEval (2024)](https://arxiv.org/html/2410.15393v1).
|
|
1230
|
+
- **Output schema constrained.** The grader returns a `<verdict>` JSON block per the existing audit-prompt-builder pattern. Off-schema responses are parse errors, not silent passes.
|
|
1231
|
+
|
|
1232
|
+
A specific risk: **the grader's own training data may contain the skill body** if the skill is public. The Skill Graph's skills are public and may eventually appear in training corpora. Long-term mitigation: maintain a private fork of the eval files with perturbed surface details, used for true validation rather than self-reporting. This is the [Gardner contrast-set discipline (2020)](https://aclanthology.org/2020.findings-emnlp.117.pdf) at the eval-file level.
|
|
1233
|
+
|
|
1234
|
+
### 8.2 Near-vs-far transfer confusion
|
|
1235
|
+
|
|
1236
|
+
The rubric requires per-case `transfer: near | far` tagging. The risk is that authors will default to `near` (easier to author) and the `far` cases will be under-represented. The lint check R4 (§7.4) is the partial mitigation. A second mitigation: require at least one `transfer: far` case per concept-block dimension (C1, C2, C3, C4, C5, C6, C7) on any skill declaring `comprehension_state: present`.
|
|
1237
|
+
|
|
1238
|
+
The current dimension distribution (§2.2) is overwhelming evidence that without an explicit prompt-side incentive, authors will fill the easiest dimensions. The lint check makes the incentive visible.
|
|
1239
|
+
|
|
1240
|
+
### 8.3 Scoring drift
|
|
1241
|
+
|
|
1242
|
+
The rubric is criterion-referenced (binary per behavior). Over time, the grader model improves and standards may drift — a behavior that previously failed a strict grader might pass a more lenient one. Three mitigations:
|
|
1243
|
+
|
|
1244
|
+
- **Pin the grader model in `eval_last_run`.** The schema already supports `model` in `eval_last_run` ([`schemas/skill.v4.schema.json:213-232`](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json)). Authors record which grader produced the verdict.
|
|
1245
|
+
- **Re-run on grader bumps.** When the grader CLI's underlying model changes (e.g., Sonnet 4.6 → 4.7), trigger a re-run on the comprehension surface and surface the delta.
|
|
1246
|
+
- **Calibration cases.** Maintain a small set of cases with known-correct gold answers (authored by the maintainer) to detect grader drift. If the grader scores a gold case differently across model versions, the grader, not the skill, drifted.
|
|
1247
|
+
|
|
1248
|
+
### 8.4 Evaluator contamination from concept-block citation
|
|
1249
|
+
|
|
1250
|
+
The grader prompt embeds the `concept.<field>` content. This makes the grader's job easier (it has the canonical answer) but also lets the grader pattern-match: "did the agent's answer say what concept.X says?" If yes, pass; if no, fail. This is a verbatim-copy reward — exactly the failure mode C1 explicitly forbids in the agent's answer.
|
|
1251
|
+
|
|
1252
|
+
Mitigation: the `expected_behaviors[]` array must include explicit anti-quote criteria (`no_verbatim_span`, `no_fabricated_specificity`) and the grader must verify them. Case 1 in the worked example demonstrates the pattern.
|
|
1253
|
+
|
|
1254
|
+
A second-order risk: the grader could collude with the agent by accepting paraphrastic but vacuous answers ("type safety is about types being safe"). Mitigation: at least one positive behavior must demand a **specific primitive** be invoked, not just a general restatement. The worked example's `invokes_runtime_boundary_primitive` is an instance.
|
|
1255
|
+
|
|
1256
|
+
### 8.5 The "Goodhart's Law" failure mode
|
|
1257
|
+
|
|
1258
|
+
If the rubric becomes the target, skill authors will optimize for it directly — writing concept blocks that game the rubric rather than transferring concepts. Symptoms: every concept block starts using identical hedging language; misconception fields target only misconceptions the rubric tests; analogies are written to maximize the C6 "limit" criterion rather than to teach.
|
|
1259
|
+
|
|
1260
|
+
The mitigations are doctrinal, not technical:
|
|
1261
|
+
|
|
1262
|
+
- The rubric is **criterion-referenced, not norm-referenced**. There is no top quartile to game into.
|
|
1263
|
+
- The `transfer: far` cases force agent-side reasoning the body doesn't encode, which can't be gamed by skill-side polish.
|
|
1264
|
+
- The baseline comparison (R5) catches skills that score well on comprehension but produce no agent-behavior delta — a skill being optimized for the rubric is detectable by its zero delta.
|
|
1265
|
+
|
|
1266
|
+
### 8.6 Quality-of-grader-prompt risk
|
|
1267
|
+
|
|
1268
|
+
The proposed `scripts/lib/comprehension-prompt-builder.js` (R2) carries the same risk as any prompt template: a vague or under-constrained prompt produces unreliable verdicts. The existing audit-prompt-builder is well-constrained (forces JSON output, requires evidence citations, scopes evidence sources to embedded context). The comprehension builder should mirror this exactly — same `# IDENTITY / # STEPS / # RULES / # INPUT / # OUTPUT` structure, same `<verdict>…</verdict>` JSON discipline, same per-behavior evidence-quote requirement.
|
|
1269
|
+
|
|
1270
|
+
A specific risk: the comprehension rubric is more **semantic** than the audit's content rubric, so the grader is more likely to hallucinate a verdict if the prompt is loose. Two mitigations: (a) include the case's `expected_reasoning` field in the grader prompt so the grader has a baseline to compare against; (b) require the grader to quote the agent's exact words for each behavior verdict, not paraphrase.
|
|
1271
|
+
|
|
1272
|
+
### 8.7 Cost risk
|
|
1273
|
+
|
|
1274
|
+
Each comprehension run is 9 dimensions × M cases × cost-per-grader-call. For type-safety with 10 cases, that's roughly 10 grader calls per run (some cases test multiple dimensions but the worked example happens to be 1:1). The audit-loop's `--graded` mode already proves this is feasible at reasonable cost when using `codex exec` or `claude -p`. The baseline harness (R5) doubles the cost; recommended for releases, not every commit.
|
|
1275
|
+
|
|
1276
|
+
### 8.8 Anti-pattern catalog
|
|
1277
|
+
|
|
1278
|
+
| Anti-pattern | Why it fails | Detection |
|
|
1279
|
+
|---|---|---|
|
|
1280
|
+
| Author a single "comprehensive" prompt covering all 9 dimensions | Loses per-dimension signal; failures can't be diagnosed | Lint check: each case must declare exactly one `comprehension_dimension` |
|
|
1281
|
+
| Use the same grader model family as agent | Self-enhancement bias inflates pass rates | Recommend the comprehension config name explicit grader and agent models |
|
|
1282
|
+
| Tag all cases `transfer: far` | Most authors can't author genuine far transfer; the tag becomes noise | Lint check R4 — verify scenario keywords don't appear in body |
|
|
1283
|
+
| Treat the rubric as a Likert score | Reintroduces the labeling-inconsistency problem Hagiwara 2023 documents | Rubric output is binary per behavior; aggregate is per-dimension counts |
|
|
1284
|
+
| Author cases that test multiple skills at once | Coupling defeats per-skill scoring | The case-level `skill_name` is the unit; cross-skill scenarios belong in the routing eval |
|
|
1285
|
+
| Set `eval_state: passing` without re-running the grader | Repeats the `verified-without-evidence` failure mode the protocol already names | The existing `eval_last_run` discipline catches this; the comprehension grader fills it in |
|
|
1286
|
+
|
|
1287
|
+
### 8.9 `methodical` anti-patterns mapped to LLM-as-judge comprehension-grading failures
|
|
1288
|
+
|
|
1289
|
+
The repo-internal [`skills/methodical/SKILL.md` § 3](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) catalogs 9 anti-patterns that emerge when an LLM agent is asked to produce a complete, honest enumeration. Each anti-pattern was identified in the context of an agent *producing* findings; every one of them reappears when the LLM is asked to *grade* another LLM's comprehension. The grader is an agent producing findings about the agent-under-test, and inherits the same training pressures.
|
|
1290
|
+
|
|
1291
|
+
This section maps four of the nine anti-patterns — the ones with the highest base rates (per §4.7) and the most direct comprehension-eval consequence — onto concrete grader failure modes, and names a structural countermeasure for each. The remaining five anti-patterns (silent scope reduction, summary-first fabrication, step consolidation, deferral as completion, exception justification) apply analogously and are documented in the source.
|
|
1292
|
+
|
|
1293
|
+
#### Anti-pattern #3 — Severity-based filter
|
|
1294
|
+
|
|
1295
|
+
**`methodical` description.** "Showing only CRITICAL/HIGH items, omitting MEDIUM/LOW. Root cause: helpfulness training optimizes for 'important'. Detection signal: output count < input count." ([`methodical` § 3, line 209](/Users/jacobbalslev/Development/skills/methodical/SKILL.md))
|
|
1296
|
+
|
|
1297
|
+
**Comprehension-grading manifestation.** The grader scores only the cases where the agent's answer is *clearly* right or *clearly* wrong, and elides the borderline cases. But the borderline cases carry the most signal — a case the grader can't decide without close reading is exactly the case that discriminates a model that internalized the primitives from a model that pattern-matched. The grader skipping them produces a verdict distribution biased toward 0% or 100% per dimension and erases the comprehension surface's discriminating range.
|
|
1298
|
+
|
|
1299
|
+
**Concrete example.** For C2 (mental-model fidelity) on the worked example's case 2 (Python → TypeScript `Any` → `unknown` migration): the agent's answer recommends `Record<string, unknown>` but does not explicitly name the runtime-boundary primitive — only describes the mechanism. A borderline case. A filtering grader marks it PASS (the conclusion is right) or FAIL (the primitive name is missing) without quoting the relevant span. The discriminating fact — that the agent's reasoning *invokes* the runtime-boundary concept under a different name — is lost.
|
|
1300
|
+
|
|
1301
|
+
**Structural countermeasure.** The verdict schema (Appendix A § A.1) requires `evidence_quote` per behavior. The grader cannot resolve a borderline case by skipping it — the schema demands a quote. If no quote exists for a `kind: positive` behavior, the verdict is `FAIL` with rationale "no evidence span supports this behavior". This forces the grader to engage every behavior, every case, and converts "I couldn't decide" into "FAIL with rationale" instead of silent omission. Per the `methodical` RULE-9 application in §5.13: completeness is a hard floor, not a target.
|
|
1302
|
+
|
|
1303
|
+
#### Anti-pattern #4 — Positive framing override
|
|
1304
|
+
|
|
1305
|
+
**`methodical` description.** "Describing a failure as a success or a gap as 'an area for improvement'. Root cause: sycophancy trained on positive human feedback. Detection signal: language mismatch between finding severity and framing." ([`methodical` § 3, line 210](/Users/jacobbalslev/Development/skills/methodical/SKILL.md))
|
|
1306
|
+
|
|
1307
|
+
**Comprehension-grading manifestation.** The grader awards partial credit to a response that should have failed because the response was articulate, confident, or stylistically polished. This is the [Zhang et al. 2024 "Style Outweighs Substance" finding](https://arxiv.org/html/2409.15268v2) and the [SycEval 58.19% base rate](https://arxiv.org/abs/2502.08177) playing out at the grader layer. The grader's verdict reads "partially demonstrates understanding" or "passes with minor gaps" on a response that, when read against the rubric, fails ≥1 binary behavior.
|
|
1308
|
+
|
|
1309
|
+
**Concrete example.** For C7 (misconception inoculation) on the worked example's case 7 (the "TypeScript catches runtime errors" misconception): the agent's answer is articulate, well-organized, and includes a recommendation to keep Zod. But the answer *partially validates* the misconception ("yes, TypeScript catches type errors at compile time, but you also need runtime validation for untrusted input"). The §5.8 fail criterion is unambiguous — `no_partial_validation_of_misconception`. A positive-framing-override grader marks this PASS because the *overall* recommendation matches the rubric; the failure was the unprompted misconception validation, which the polished framing obscured.
|
|
1310
|
+
|
|
1311
|
+
**Structural countermeasure.** Per-behavior binary scoring (the rubric's central design choice) plus the `kind: negative` behavior class. Each negative behavior is a forbidden pattern that, if present, fails the behavior regardless of the response's quality elsewhere. The grader cannot offset a `FAIL` on `no_partial_validation_of_misconception` with `PASS` on three positive behaviors — every behavior must independently PASS for the dimension to PASS. The Appendix A verdict schema's `dimension_verdict = PASS iff all behavior_verdicts == PASS` is the structural enforcement. RULE-6 in §7.10's FORCING FUNCTIONS block names the framing-override pattern explicitly so the grader self-recognizes it.
|
|
1312
|
+
|
|
1313
|
+
#### Anti-pattern #6 — Assumed verification
|
|
1314
|
+
|
|
1315
|
+
**`methodical` description.** "Reporting a step as verified without producing evidence. Root cause: speed optimization. Detection signal: 'Should work' / 'likely fixed' / 'probably correct'." ([`methodical` § 3, line 212](/Users/jacobbalslev/Development/skills/methodical/SKILL.md))
|
|
1316
|
+
|
|
1317
|
+
**Comprehension-grading manifestation.** The grader claims the agent's answer satisfies a behavior without quoting evidence from the answer. The verdict reads "the agent invoked the runtime-boundary primitive" without an `evidence_quote` field, or with a generic-sounding quote that does not actually contain the primitive ("the agent discussed validation"). This is the most common subtle grader failure — the verdict *form* is correct, but the evidence does not actually substantiate the verdict.
|
|
1318
|
+
|
|
1319
|
+
**Concrete example.** For C1 (definitional precision) on the worked example's case 1: the agent's answer is a polished restatement of the body's first sentence with minor word substitutions. A grader subject to assumed verification marks `no_verbatim_span: PASS` and quotes a different span as evidence — one that *does* differ from the body — without actually running the 6-gram overlap check on the full answer. The verbatim copy is in a different sentence than the one quoted as evidence; the verdict is wrong.
|
|
1320
|
+
|
|
1321
|
+
**Structural countermeasure.** The Appendix A § A.3 `checkVerbatimOverlap` function runs in the calling script (Node, deterministic, free) and emits a structured `verbatim_overlap_check` field that the grader cannot fabricate. The grader's prompt explicitly includes the precomputed result: `<precomputed-overlap-check passed="false" overlap_ngrams="[...]"/>`. The grader cannot mark `no_verbatim_span: PASS` if `passed: false` is in the input. For non-mechanical behaviors (e.g., `invokes_runtime_boundary_primitive`), the grader must produce an `evidence_quote` that is verifiable as a substring of `<agent-answer>` — the calling script can post-check this and flag any verdict whose evidence_quote is not a substring as a parse error. The verification is not the grader's word; it is the grader's structured output measured against deterministic checks.
|
|
1322
|
+
|
|
1323
|
+
#### Anti-pattern #8 — Softened negative
|
|
1324
|
+
|
|
1325
|
+
**`methodical` description.** "A failure presented as 'could be improved' or 'worth reviewing'. Root cause: RLHF negative feedback on blunt outputs. Detection signal: hedge words on findings that evidence shows are failures." ([`methodical` § 3, line 214](/Users/jacobbalslev/Development/skills/methodical/SKILL.md))
|
|
1326
|
+
|
|
1327
|
+
**Comprehension-grading manifestation.** The grader uses hedge words on a clear failure — `rationale: "this could be strengthened by"` instead of `rationale: "this behavior failed because the response does not contain X"`. The verdict ends up `PASS` because the soft language reframed the failure as an improvement opportunity. The 26.42% framing-bias base rate ([Aldigheri et al., 2025](https://arxiv.org/abs/2507.03194)) is the empirical source of this failure.
|
|
1328
|
+
|
|
1329
|
+
**Concrete example.** For C9 (negative-boundary refusal) on the worked example's case 9 (DB schema design in type-safety's voice): the agent's answer includes one partial recommendation in type-safety's voice ("use `string` for the file path") before naming `data-modeling` as the owner. The behavior `no_partial_comply` fails — partial compliance occurred. A softened-negative grader writes: `verdict: PASS, rationale: "the agent could be more explicit about handing off completely"`. The hedge word "could be more explicit" reframes a clear failure (partial-comply happened) as an improvement opportunity.
|
|
1330
|
+
|
|
1331
|
+
**Structural countermeasure.** The verdict schema constrains `verdict` to the literal strings `"PASS"` or `"FAIL"` — no third value, no probabilistic intermediate. The `rationale` field is constrained to "one sentence" per Appendix A § A.1. If the grader's prompt receives RULE-6 from §7.10 ("Use of hedge words on a behavior with `verdict: FAIL` is itself a grader malfunction"), the grader self-monitors for the pattern. A second-layer countermeasure: a deterministic post-check in the calling script that flags any `rationale` containing the substring list `["could be", "would benefit", "consider", "perhaps", "might be"]` paired with `verdict: PASS` — these combinations indicate the grader probably softened a fail and the verdict deserves human review.
|
|
1332
|
+
|
|
1333
|
+
#### Why this matters at the rubric layer
|
|
1334
|
+
|
|
1335
|
+
The four anti-patterns above are not separate failure modes from the LLM-judge biases of §4.2.6. They are the same failure modes, named from inside the agent's reasoning rather than from the outside as a measured rate. `methodical` provides the inside-view vocabulary; §4.7 provides the outside-view base rates; §8.1 provides the position/verbosity/self-enhancement axis. All three converge on the same conclusion: **the comprehension grader cannot be trusted to behave methodically without structural forcing functions.** §5 specifies what is measured; §7.10's R9 specifies the prompt-side enforcement; §8.9 (this section) maps the residual risk to the prompt. The Appendix A verdict schema is the floor under all three.
|
|
1336
|
+
|
|
1337
|
+
---
|
|
1338
|
+
|
|
1339
|
+
## 9. Open questions
|
|
1340
|
+
|
|
1341
|
+
These decisions depend on maintainer judgment and cannot be resolved by research. Each is phrased as a binary or small-enum decision so adoption can proceed without resolving every question.
|
|
1342
|
+
|
|
1343
|
+
### 9.1 Q1 — Does the comprehension rubric add a 10th eval-dimension value, or fit within the existing seven `dimension` enum?
|
|
1344
|
+
|
|
1345
|
+
Today's eval files use `dimension: definition | mental_model | purpose | boundary | application | rule_conflict | anti_pattern`. The proposed `comprehension_dimension` is an additive key. The question: should the existing `dimension` enum be **deprecated** in favor of the new richer `comprehension_dimension`, or should both keys coexist with documented overlap?
|
|
1346
|
+
|
|
1347
|
+
Recommendation, not decision: keep both. The existing `dimension` measures **what kind of question is being asked**; `comprehension_dimension` measures **what rubric dimension the case grades against**. A case with `dimension: boundary, comprehension_dimension: C4` is a boundary-shape question graded against C4 — same idea. A case with `dimension: application, comprehension_dimension: C8` is an application-shape question graded against C8 — application is the question shape; C8 is the grader rubric. The two encode complementary information.
|
|
1348
|
+
|
|
1349
|
+
But if the maintainer prefers to collapse — they should — the migration is mechanical: each existing `dimension` value maps to a comprehension-dimension if applicable, or stays as an orthogonal "question shape" enum, or is dropped.
|
|
1350
|
+
|
|
1351
|
+
### 9.2 Q2 — Does `concept.misconception` warrant its own grader weight (not just "complements `boundary`")?
|
|
1352
|
+
|
|
1353
|
+
The schema says `misconception` is "not directly graded; complements `boundary`" ([`schemas/skill.v4.schema.json:208`](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json)). The rubric proposes C7 (misconception inoculation) as a distinct dimension. Either:
|
|
1354
|
+
|
|
1355
|
+
- (a) The schema description is wrong; `misconception` does deserve its own weight. Update the schema description to give it a weight (recommend 1.0 for parity with `definition` and `purpose`).
|
|
1356
|
+
- (b) The schema is right; C7 should be merged into C4 (boundary). Adjust the rubric to a single combined C4 dimension.
|
|
1357
|
+
|
|
1358
|
+
Recommendation: (a). The TruthfulQA-shape test for misconception inoculation is a distinct measurement from boundary discrimination — a model can correctly route to a sibling skill (C4 pass) while still validating the misconception phrasing along the way (C7 fail). The two are independent.
|
|
1359
|
+
|
|
1360
|
+
### 9.3 Q3 — Should the comprehension rubric apply to `workflow` skills, or `capability` only?
|
|
1361
|
+
|
|
1362
|
+
The schema permits `comprehension_state: present` on any archetype. The rubric's concept-block dimensions (C1–C7) are natural for `capability` skills (knowledge teaching). The cross-cutting dimensions (C8 Verification, C9 Do-NOT) apply to `workflow` skills too. The question: should the rubric have a workflow-specific variant, or should `workflow` skills decline the comprehension rubric and use only the existing audit-loop eval dimension?
|
|
1363
|
+
|
|
1364
|
+
Recommendation, not decision: workflow skills should declare `comprehension_state: absent` by default. If a workflow skill teaches a concept (e.g., `code-review` teaches the discipline of code review beyond the workflow), it can opt in. The author judges per skill.
|
|
1365
|
+
|
|
1366
|
+
### 9.4 Q4 — Binary criteria vs Likert per behavior?
|
|
1367
|
+
|
|
1368
|
+
The proposed rubric is binary per `expected_behavior` (pass/fail). The existing audit-prompt-builder uses 1–5 ([`scripts/lib/audit-prompt-builder.js:414`](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js)). The evidence — [Hagiwara 2023](https://pmc.ncbi.nlm.nih.gov/articles/PMC10498947/) on criterion-referenced reliability; [Hamel Husain on binary labels](https://hamel.dev/blog/posts/evals-faq/) — supports binary at the per-criterion level. The 1–5 score at the dimension level (aggregating across criteria) is still useful for the audit summary.
|
|
1369
|
+
|
|
1370
|
+
Decision needed: should the comprehension grader output binary per behavior + an aggregated 1–5 per dimension, or pure binary throughout? The proposed worked example assumes binary per behavior; the audit aggregation pattern would give a 1–5 derived from binary counts (e.g., 5/5 behaviors pass = score 5; 4/5 = score 4; etc.).
|
|
1371
|
+
|
|
1372
|
+
### 9.5 Q5 — Should `expected_behaviors[]` be a required field on new comprehension cases?
|
|
1373
|
+
|
|
1374
|
+
The worked example uses it on every case. Authoring `expected_behaviors[]` is more work than authoring `expected_reasoning`. The lint check could require it iff `comprehension_dimension` is set.
|
|
1375
|
+
|
|
1376
|
+
Recommendation: yes, required when `comprehension_dimension` is set. The behavior list is what makes the grader run reproducible; the `expected_reasoning` is a human-readable summary. Both serve different consumers.
|
|
1377
|
+
|
|
1378
|
+
### 9.6 Q6 — Does the rubric belong in `SKILL_AUDIT_CHECKLIST.md`, `docs/skill-metadata-protocol.md`, or a new file?
|
|
1379
|
+
|
|
1380
|
+
Options:
|
|
1381
|
+
|
|
1382
|
+
- (a) Extend `SKILL_AUDIT_CHECKLIST.md` with a new 8th section. Pro: discoverable in the same place as the existing seven dimensions. Con: the checklist file is currently audit-shaped, not comprehension-shaped, and adding comprehension dilutes the focus.
|
|
1383
|
+
- (b) Add a new section to `docs/skill-metadata-protocol.md`. Pro: matches the protocol-shape. Con: the protocol doc is normative and the rubric is operational — slight genre mismatch.
|
|
1384
|
+
- (c) Create a new file `docs/comprehension-rubric.md`. Pro: clean separation; the file is the rubric's canonical location. Con: yet another doc to discover.
|
|
1385
|
+
|
|
1386
|
+
Recommendation, not decision: (c) plus a one-paragraph pointer in (a) and (b). The rubric is operationally distinct from the audit and the protocol; a separate file lets it grow without polluting either.
|
|
1387
|
+
|
|
1388
|
+
### 9.7 Q7 — Should the rubric be marketplace-exportable, or kept internal?
|
|
1389
|
+
|
|
1390
|
+
The marketplace export (`scripts/export-marketplace-skills.js`) produces plain `SKILL.md` files. The comprehension rubric is internal to the Skill Graph protocol; it does not export. But the **eval files** could be exported as a parallel resource (`examples/evals/<skill>.json` → `marketplace/evals/<skill>.json`) so downstream consumers run the same comprehension grades.
|
|
1391
|
+
|
|
1392
|
+
Recommendation, not decision: defer until R3 ships. Once the grader runs and produces meaningful verdicts, the question of "do consumers see this" becomes concrete.
|
|
1393
|
+
|
|
1394
|
+
---
|
|
1395
|
+
|
|
1396
|
+
## 10. Completeness claim
|
|
1397
|
+
|
|
1398
|
+
This report examined the following **24 repo files** (23 inside `skill-graph/` plus 1 in the parent `Development/skills/` workspace, added in the 2026-05-16 revision):
|
|
1399
|
+
|
|
1400
|
+
1. [`/Users/jacobbalslev/Development/skill-graph/AGENTS.MD`](/Users/jacobbalslev/Development/skill-graph/AGENTS.MD) — full read; especially §§ Evaluation Discipline, Skill Audit Loop, What the Skill Graph Is
|
|
1401
|
+
2. [`https://github.com/jacob-balslev/skill-metadata-protocol`](https://github.com/jacob-balslev/skill-metadata-protocol) — full read
|
|
1402
|
+
3. [`https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_LOOP.md`](https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_LOOP.md) — full read
|
|
1403
|
+
4. [`https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_CHECKLIST.md`](https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_CHECKLIST.md) — full read
|
|
1404
|
+
5. [`/Users/jacobbalslev/Development/skill-graph/docs/quality-doctrine.md`](/Users/jacobbalslev/Development/skill-graph/docs/quality-doctrine.md) — full read
|
|
1405
|
+
6. [`/Users/jacobbalslev/Development/skill-graph/docs/field-reference.md`](/Users/jacobbalslev/Development/skill-graph/docs/field-reference.md) — read in full (1200 lines); focus on `comprehension_state`, `concept`, `eval_*`, `routing_eval` sections
|
|
1406
|
+
7. [`skills/type-safety/SKILL.md`](https://github.com/jacob-balslev/skills/blob/main/skills/type-safety/SKILL.md) — full read; used as worked example
|
|
1407
|
+
8. [`skills/acid-fundamentals/SKILL.md`](https://github.com/jacob-balslev/skills/blob/main/skills/acid-fundamentals/SKILL.md) — full read; confirmed gold-standard concept block
|
|
1408
|
+
9. [`skills/methodology/SKILL.md`](https://github.com/jacob-balslev/skills/blob/main/skills/methodology/SKILL.md) — full read; note: comprehension_state not declared
|
|
1409
|
+
10. [`skills/skill-scaffold/SKILL.md`](https://github.com/jacob-balslev/skills/blob/main/skills/skill-scaffold/SKILL.md) — full read; authoring contract for evals
|
|
1410
|
+
11. [`/Users/jacobbalslev/Development/skill-graph/scripts/skill-graph-routing-eval.js`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-graph-routing-eval.js) — full read; confirmed routing-eval scope is positive/negative routing, not comprehension
|
|
1411
|
+
12. [`/Users/jacobbalslev/Development/skill-graph/scripts/skill-audit.js`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-audit.js) — partial read (lines 1–350); confirmed seven-dimension audit shape
|
|
1412
|
+
13. [`/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js`](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js) — partial read (lines 1–500 and 700–772); used as the template for the proposed comprehension-prompt-builder
|
|
1413
|
+
14. [`/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js`](/Users/jacobbalslev/Development/skill-graph/scripts/skill-lint.js) — partial read (lines 414–620); confirmed `checkEvalCoherence` and `checkEvalTruthSourceRanges`
|
|
1414
|
+
15. [`/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json`](/Users/jacobbalslev/Development/skill-graph/schemas/skill.v4.schema.json) — read concept block section (lines 165–211); confirmed per-field grader weights
|
|
1415
|
+
16. `examples/evals/comprehension.json` — full read at research time; file has since been removed from the repo (the `documentation` skill's eval was renamed; comprehension cases now live in `examples/evals/type-safety.json`)
|
|
1416
|
+
17. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/code-review.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/code-review.json) — full read
|
|
1417
|
+
18. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/data-modeling.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/data-modeling.json) — full read
|
|
1418
|
+
19. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/debugging.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/debugging.json) — full read
|
|
1419
|
+
20. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/webhook-integration.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/webhook-integration.json) — full read
|
|
1420
|
+
21. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/lint-overlay.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/lint-overlay.json) — full read
|
|
1421
|
+
22. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/microcopy.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/microcopy.json) — full read
|
|
1422
|
+
23. [`/Users/jacobbalslev/Development/skill-graph/examples/evals/a11y.json`](/Users/jacobbalslev/Development/skill-graph/examples/evals/a11y.json) — partial read (first 100 lines); confirmed per-concept-field dimension distribution
|
|
1423
|
+
24. [`/Users/jacobbalslev/Development/skills/methodical/SKILL.md`](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) — full read (307 lines); added in the 2026-05-16 revision. This file lives in the parent `Development/skills/` workspace, NOT in `skill-graph/skills/`. The previous research pass conflated it with `skill-graph/skills/methodology/SKILL.md` (item 9 above); the two skills are genuinely distinct: `methodology` covers methodology-method-process frameworks (Cleanroom, PSP/TSP, DMAIC, Gawande), while `methodical` covers the 10 behavioral rules, 4-layer execution architecture, 9 anti-patterns, and RLHF root-cause model that govern complete, honest agent output. The `methodical` skill is the explanatory layer beneath `complete-reporting.md`, `acceptance-criteria-gate.md`, and the rubric's per-behavior scoring discipline; it is integrated into this report at §4.7, §5.13, §7.10 (R9), §8.9, and Appendix A § A.4.
|
|
1424
|
+
|
|
1425
|
+
(Bonus: ran `grep -h '"dimension":' examples/evals/*.json | sort | uniq -c` across all 31 eval files to produce the dimension distribution table in §2.2.)
|
|
1426
|
+
|
|
1427
|
+
This report cited the following **27 external sources** with working links (22 from the original pass plus 5 added in the 2026-05-16 revision to verify the RLHF base-rate statistics integrated from `methodical` § 5 into §4.7, §7.10, §8.9, and Appendix A § A.4):
|
|
1428
|
+
|
|
1429
|
+
1. Anderson & Krathwohl (2001), [*A Revision of Bloom's Taxonomy: An Overview*](https://www.researchgate.net/publication/242400296_A_Revision_of_Bloom's_Taxonomy_An_Overview)
|
|
1430
|
+
2. Anderson & Krathwohl (2001), [*A Taxonomy for Learning, Teaching, and Assessing*](https://www.researchgate.net/publication/235465787_A_Taxonomy_for_Learning_Teaching_and_Assessing_A_Revision_of_Bloom's_Taxonomy_of_Educational_Objectives)
|
|
1431
|
+
3. Krathwohl (2002), [*A Revision of Bloom's Taxonomy: An Overview* (Theory Into Practice)](https://cmapspublic2.ihmc.us/rid=1Q2PTM7HL-26LTFBX-9YN8/Krathwohl%202002.pdf)
|
|
1432
|
+
4. Tang et al. (2025), [*BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions*](https://arxiv.org/html/2509.25465v1)
|
|
1433
|
+
5. Hagiwara (2023), [*Is It All About the Form? Norm- vs Criterion-Referenced Ratings and Faculty Inter-Rater Reliability*](https://pmc.ncbi.nlm.nih.gov/articles/PMC10498947/)
|
|
1434
|
+
6. Huitt (criterion-referenced vs norm-referenced), [*Educational Psychology Interactive*](https://www.edpsycinteractive.org/topics/measeval/crnmref.html)
|
|
1435
|
+
7. Barnett & Ceci (2002), [*When and where do we apply what we learn? A taxonomy for far transfer*](https://pubmed.ncbi.nlm.nih.gov/12081085/)
|
|
1436
|
+
8. Bondaroupte et al. (2023), [*Unleashing the Potential of the Feynman Technique*](https://ejournal.iainmadura.ac.id/index.php/panyonara/article/download/14936/4202/)
|
|
1437
|
+
9. Hu et al. (2025), [*Learn Like Feynman: Developing and Testing an AI-Driven Feynman Bot*](https://arxiv.org/pdf/2506.09055)
|
|
1438
|
+
10. Liang et al. (2022) HELM paper, [*Holistic Evaluation of Language Models*](https://arxiv.org/abs/2211.09110); HELM project: [crfm.stanford.edu/helm/](https://crfm.stanford.edu/helm/)
|
|
1439
|
+
11. Srivastava et al. (2022), [*Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models*](https://arxiv.org/abs/2206.04615)
|
|
1440
|
+
12. Lin, Hilton, Evans (2021), [*TruthfulQA: Measuring How Models Mimic Human Falsehoods*](https://arxiv.org/abs/2109.07958); GitHub: [github.com/sylinrl/TruthfulQA](https://github.com/sylinrl/TruthfulQA)
|
|
1441
|
+
13. Hendrycks et al. (2020), [*Measuring Massive Multitask Language Understanding (MMLU)*](https://arxiv.org/abs/2009.03300); MMLU-Pro analysis: [intuitionlabs.ai/pdfs/mmlu-pro-explained-the-advanced-ai-benchmark-for-llms.pdf](https://intuitionlabs.ai/pdfs/mmlu-pro-explained-the-advanced-ai-benchmark-for-llms.pdf)
|
|
1442
|
+
14. Gardner et al. (2020), [*Evaluating Models' Local Decision Boundaries via Contrast Sets*](https://aclanthology.org/2020.findings-emnlp.117.pdf)
|
|
1443
|
+
15. Zheng et al. (2023), [*Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena*](https://arxiv.org/abs/2306.05685)
|
|
1444
|
+
16. Li et al. (2024), [*A Survey on LLM-as-a-Judge*](https://arxiv.org/html/2411.15594v6)
|
|
1445
|
+
17. Wang et al. (2024), [*CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges*](https://arxiv.org/html/2410.15393v1)
|
|
1446
|
+
18. Husain, [*Your AI Product Needs Evals*](https://hamel.dev/blog/posts/evals/); [*LLM Evals: Everything You Need to Know (FAQ)*](https://hamel.dev/blog/posts/evals-faq/)
|
|
1447
|
+
19. Yan, [*Task-Specific LLM Evals That Do & Don't Work*](https://eugeneyan.com/writing/evals/); [*Evaluating the Effectiveness of LLM-Evaluators (LLM-as-Judge)*](https://eugeneyan.com/writing/llm-evaluators/)
|
|
1448
|
+
20. Anthropic (2026), [*Equipping agents for the real world with Agent Skills*](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills); Tessl coverage: [tessl.io/blog/anthropic-brings-evals-to-skill-creator-heres-why-thats-a-big-deal/](https://tessl.io/blog/anthropic-brings-evals-to-skill-creator-heres-why-thats-a-big-deal/)
|
|
1449
|
+
21. OpenAI Developers (2026), [*Testing Agent Skills Systematically with Evals*](https://developers.openai.com/blog/eval-skills); OpenAI Evals framework: [github.com/openai/evals](https://github.com/openai/evals)
|
|
1450
|
+
22. Agent Skills community spec, [*agentskills.io specification*](https://agentskills.io/specification)
|
|
1451
|
+
23. Fanous et al. (2025), [*SycEval: Evaluating LLM Sycophancy*](https://arxiv.org/abs/2502.08177) — primary source for the 58.19% sycophancy rate across frontier models (Gemini-1.5-Pro, ChatGPT-4o, Claude-Sonnet) on AMPS and MedQuad. Cited in §4.7, §8.9, Appendix A § A.4.
|
|
1452
|
+
24. Peters & Chin-Yee (2025), [*Generalization bias in large language model summarization of scientific research*](https://royalsocietypublishing.org/rsos/article/12/4/241776/235656/Generalization-bias-in-large-language-model), Royal Society Open Science vol. 12(4):241776 — primary source for the 26–73% summarization-overgeneralization rate across DeepSeek, ChatGPT-4o, LLaMA 3.3 70B, on 4,900 summaries of *Nature, Science, Lancet, NEJM* abstracts; PMC mirror: [PMC12042776](https://pmc.ncbi.nlm.nih.gov/articles/PMC12042776/). Cited in §4.7, §5.13, Appendix A § A.4.
|
|
1453
|
+
25. Aldigheri et al. (2025), [*Quantifying Cognitive Bias Induction in LLM-Generated Content*](https://arxiv.org/abs/2507.03194), also [ACL Anthology 2025.ijcnlp-long.155](https://aclanthology.org/2025.ijcnlp-long.155/) — primary source for the 26.42% framing-bias rate across five LLM families. Cited in §4.7, §8.9, Appendix A § A.4.
|
|
1454
|
+
26. Towards Data Science (2025), [*Why Your Multi-Agent System is Failing: Escaping the 17x Error Trap of the "Bag of Agents"*](https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/) — source for the 17.2× error-amplification rate in unstructured multi-agent networks and the ~4.4× rate under centralized coordination. Cited in §4.7, §7.10.
|
|
1455
|
+
27. Liu et al. (2023), [*Lost in the Middle: How Language Models Use Long Contexts*](https://arxiv.org/abs/2307.03172) — corroborating primary source for the attention-dilution / instruction-skipping phenomenon. Used in §4.7 to substantiate the mechanism that `methodical` § Key Sources attributes to a Unite.AI 2025 article (the specific Unite.AI URL was not located in this revision; the underlying mechanism is well-established by the "lost in the middle" literature). Cited in §4.7.
|
|
1456
|
+
|
|
1457
|
+
**Items intentionally excluded:**
|
|
1458
|
+
|
|
1459
|
+
- **Eval files** beyond the 7 read in full plus the partial of `a11y.json`. The remaining 23 of 31 files were not read end-to-end; their dimension distribution was instead captured via grep across all 31, which gave the dimension-frequency data in §2.2. Reading every file end-to-end was not necessary because (a) the dimension-frequency captured the structural picture, (b) the 7 read in full span the diverse archetypes (concept skill, workflow skill, overlay skill, comprehensive skill).
|
|
1460
|
+
- **The full text of the schemas** beyond the `concept` block at lines 165–211 of `schemas/skill.v4.schema.json`. The concept-block shape was load-bearing; the rest of the schema's role in this report is to host that block plus the optional `eval_last_run` shape, both of which were examined.
|
|
1461
|
+
- **The `examples/audits/*` worked-example artifacts**. Confirmed they exist (`a11y`, `debugging`, `documentation`) and that the artifact contract matches `SKILL_AUDIT_CHECKLIST.md`. The audit artifacts are output of the existing audit loop; they don't contain comprehension-shaped data and would not change the analysis.
|
|
1462
|
+
- **BIG-bench paper full text via WebFetch**. The PDF returned binary; the [search-result summary](https://arxiv.org/abs/2206.04615) and the [GitHub README](https://github.com/google/BIG-bench) carried the design-principle content this report needed.
|
|
1463
|
+
- **Closed commercial eval platforms** (Braintrust, LangSmith, Weights & Biases Weave). Per the task brief, these are noted as options but the proposed framework works with the repo's existing file-based eval surface; reading their marketing pages would not add to the framework's design.
|
|
1464
|
+
|
|
1465
|
+
**Correction in the 2026-05-16 revision.** The previous research pass examined only `skill-graph/skills/methodology/SKILL.md` (item 9 above) and treated it as the closest available analogue to the `methodical` discipline referenced in the original task brief. That conflation was incorrect: the canonical `methodical` skill exists at [`/Users/jacobbalslev/Development/skills/methodical/SKILL.md`](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) in the parent `Development/skills/` workspace — a richer 307-line artifact covering the RLHF root-cause model, 10 behavioral rules, 4-layer execution architecture, and 9 anti-patterns — and is genuinely distinct from `methodology` (which covers methodology-method-process frameworks). The 2026-05-16 revision adds it as repo file #24 and integrates it at §4.7, §5.13, §7.10 (R9), §8.9, and Appendix A § A.4. The original report's citation of the AGENTS.md ≥7-scenario eval-minimum rule was already correct (the rule lives at `skill-graph/AGENTS.md:186` and `skill-graph/skills/skill-infrastructure/SKILL.md:273–278`, NOT in `methodical`); that attribution did not require correction.
|
|
1466
|
+
|
|
1467
|
+
**One residual unverified attribution.** [`skills/methodical/SKILL.md` § Key Sources](/Users/jacobbalslev/Development/skills/methodical/SKILL.md) attributes the "instruction skipping: attention dilution with prompt length" finding to "Unite.AI (2025)". A WebSearch on 2026-05-16 did not surface a single direct Unite.AI 2025 article matching that exact claim; multiple Unite.AI guides on LLM behavior exist but the specific 2025 attention-dilution attribution could not be pinned to one URL. The underlying mechanism is independently corroborated by the well-established "lost in the middle" literature (Liu et al. 2023, [arXiv:2307.03172](https://arxiv.org/abs/2307.03172) — added as external source #27), which is cited in §4.7 as the verifiable substantiation. The Unite.AI 2025 attribution is reused in this report only as quoted/paraphrased from `methodical` § Key Sources, with the qualification stated inline at §4.7.
|
|
1468
|
+
|
|
1469
|
+
This report covers all 24 repo files and all 27 external sources cited above.
|
|
1470
|
+
|
|
1471
|
+
---
|
|
1472
|
+
|
|
1473
|
+
## Appendix A — Grader prompt template (proposed)
|
|
1474
|
+
|
|
1475
|
+
This appendix specifies the exact prompt the comprehension grader receives for one case. It is the operationalization of §5 and §7.2; the implementation note in R2 says "mirror `scripts/lib/audit-prompt-builder.js`," and this template is what that means concretely.
|
|
1476
|
+
|
|
1477
|
+
The template uses the same `# IDENTITY / # STEPS / # RULES / # INPUT / # OUTPUT` structure as [`audit-prompt-builder.js` lines 565–613](/Users/jacobbalslev/Development/skill-graph/scripts/lib/audit-prompt-builder.js). The differences are: (a) the dimension comes from the rubric C1–C9, not the audit's seven; (b) the input embeds only the relevant concept-block field plus the case data, not the schema or neighbors; (c) the output is one verdict per `expected_behavior`, not one per checklist bullet.
|
|
1478
|
+
|
|
1479
|
+
### A.1 Prompt sections
|
|
1480
|
+
|
|
1481
|
+
```
|
|
1482
|
+
# IDENTITY
|
|
1483
|
+
|
|
1484
|
+
You are a Skill Graph comprehension grader. You evaluate one case against one
|
|
1485
|
+
rubric dimension at a time. Your verdict is binary per behavior. Default
|
|
1486
|
+
posture: evidence-first. Do not pass a behavior without quoting the agent's
|
|
1487
|
+
exact words; do not fail a behavior without quoting the gap.
|
|
1488
|
+
|
|
1489
|
+
# STEPS
|
|
1490
|
+
|
|
1491
|
+
1. Read the <skill-body> — this is the SKILL.md the agent loaded as context.
|
|
1492
|
+
2. Read the <concept-field name="..."> — this is the canonical answer for
|
|
1493
|
+
this rubric dimension.
|
|
1494
|
+
3. Read the <case> — the prompt the agent was given and the agent's answer.
|
|
1495
|
+
4. For each <expected-behavior>, decide PASS or FAIL based on the agent's
|
|
1496
|
+
answer. Quote the exact span of the agent's answer that is your evidence.
|
|
1497
|
+
5. Aggregate: the dimension PASSes iff every behavior PASSes.
|
|
1498
|
+
|
|
1499
|
+
# RULES
|
|
1500
|
+
|
|
1501
|
+
- Every behavior verdict MUST cite an exact substring of the agent's answer
|
|
1502
|
+
as evidence.
|
|
1503
|
+
- A negative-kind behavior (kind: "negative") PASSes when the unwanted thing
|
|
1504
|
+
is ABSENT from the agent's answer; quote the closest spot where the
|
|
1505
|
+
unwanted thing might have appeared but did not.
|
|
1506
|
+
- A positive-kind behavior (kind: "positive") PASSes when the required thing
|
|
1507
|
+
IS present; quote the span where it appears.
|
|
1508
|
+
- For the verbatim-overlap check: the agent's answer fails if any 6-gram
|
|
1509
|
+
substring (6 consecutive non-stopword tokens) is also present in the
|
|
1510
|
+
embedded <skill-body> or <concept-field>.
|
|
1511
|
+
- Do not invent failure modes outside the expected_behaviors list.
|
|
1512
|
+
- The output is a single <verdict>...</verdict> JSON block. No prose outside.
|
|
1513
|
+
|
|
1514
|
+
# INPUT
|
|
1515
|
+
|
|
1516
|
+
<case-id>{id}</case-id>
|
|
1517
|
+
<dimension>{C1|C2|...|C9}</dimension>
|
|
1518
|
+
<concept-field name="{definition|mental_model|...}">
|
|
1519
|
+
{the field's content, verbatim from the skill frontmatter}
|
|
1520
|
+
</concept-field>
|
|
1521
|
+
|
|
1522
|
+
<skill-body>
|
|
1523
|
+
{the SKILL.md body content, with frontmatter stripped}
|
|
1524
|
+
</skill-body>
|
|
1525
|
+
|
|
1526
|
+
<case>
|
|
1527
|
+
<prompt>{the case's prompt field}</prompt>
|
|
1528
|
+
<agent-answer>{the agent's actual answer to the prompt}</agent-answer>
|
|
1529
|
+
<expected-reasoning>{optional expected_reasoning field}</expected-reasoning>
|
|
1530
|
+
<expected-behaviors>
|
|
1531
|
+
<behavior id="..." kind="positive|negative">{description}</behavior>
|
|
1532
|
+
...
|
|
1533
|
+
</expected-behaviors>
|
|
1534
|
+
</case>
|
|
1535
|
+
|
|
1536
|
+
# OUTPUT
|
|
1537
|
+
|
|
1538
|
+
<verdict>
|
|
1539
|
+
{
|
|
1540
|
+
"case_id": <id>,
|
|
1541
|
+
"dimension": "<C1-C9>",
|
|
1542
|
+
"behavior_verdicts": [
|
|
1543
|
+
{
|
|
1544
|
+
"id": "<behavior_id>",
|
|
1545
|
+
"kind": "positive" | "negative",
|
|
1546
|
+
"verdict": "PASS" | "FAIL",
|
|
1547
|
+
"evidence_quote": "<exact substring of the agent's answer>",
|
|
1548
|
+
"rationale": "<one sentence>"
|
|
1549
|
+
}
|
|
1550
|
+
],
|
|
1551
|
+
"dimension_verdict": "PASS" | "FAIL",
|
|
1552
|
+
"verbatim_overlap_check": {
|
|
1553
|
+
"passed": true | false,
|
|
1554
|
+
"overlap_ngrams": [
|
|
1555
|
+
"<any 6-gram or longer overlap with skill body or concept field>"
|
|
1556
|
+
]
|
|
1557
|
+
}
|
|
1558
|
+
}
|
|
1559
|
+
</verdict>
|
|
1560
|
+
```
|
|
1561
|
+
|
|
1562
|
+
### A.2 Why this template shape
|
|
1563
|
+
|
|
1564
|
+
Each design decision is anchored:
|
|
1565
|
+
|
|
1566
|
+
- **`<concept-field name="...">` embedded explicitly**, separate from the body. This is so the grader knows which field is the canonical answer for this case's dimension. The audit-prompt-builder embeds the full body and schema; the comprehension grader needs less context but more specificity about which slot the answer should match.
|
|
1567
|
+
- **`<agent-answer>` is a required input.** The audit grader reads the static skill body. The comprehension grader needs the agent's runtime answer — the entire point of the dimension. The script that invokes the grader is responsible for running the agent first and feeding the answer in.
|
|
1568
|
+
- **Behavior-level verdicts, not dimension-level scores.** Per §4.2.6 and §4.3.1 (Husain), binary per criterion is more reliable than Likert per dimension. The dimension verdict is a deterministic aggregate (`all PASS → PASS`).
|
|
1569
|
+
- **`verbatim_overlap_check`** is a structured field, not a freeform observation. Because the verbatim-copy failure is the most common comprehension failure and the easiest to detect mechanically, it gets a dedicated output slot. A grader that reports `passed: false` with no `overlap_ngrams` is malfunctioning.
|
|
1570
|
+
- **Evidence quote required on every behavior.** The audit-prompt-builder already requires this; the comprehension grader needs it more, because the evidence quote is what makes "is the primitive invoked" decidable rather than impressionistic.
|
|
1571
|
+
|
|
1572
|
+
### A.3 Implementing the verbatim-overlap check
|
|
1573
|
+
|
|
1574
|
+
The 6-gram overlap check is the C1 hard-fail criterion (§5.2) and is mechanical, not LLM-judged. The grader can run it directly inline, or the calling script can run it before invoking the grader and pass the result in. Recommended: run it in the calling script (Node, deterministic, free) and pass the result to the grader as `<precomputed-overlap-check passed="..."/>` so the grader doesn't hallucinate.
|
|
1575
|
+
|
|
1576
|
+
Pseudocode:
|
|
1577
|
+
|
|
1578
|
+
```javascript
|
|
1579
|
+
function checkVerbatimOverlap(agentAnswer, skillBody, conceptField, n = 6) {
|
|
1580
|
+
const tokenize = (s) => s.toLowerCase()
|
|
1581
|
+
.replace(/[^a-z0-9\s]/g, ' ')
|
|
1582
|
+
.split(/\s+/)
|
|
1583
|
+
.filter(t => t.length > 3 && !STOPWORDS.has(t));
|
|
1584
|
+
|
|
1585
|
+
const ngrams = (tokens, n) => {
|
|
1586
|
+
const out = new Set();
|
|
1587
|
+
for (let i = 0; i <= tokens.length - n; i++) {
|
|
1588
|
+
out.add(tokens.slice(i, i + n).join(' '));
|
|
1589
|
+
}
|
|
1590
|
+
return out;
|
|
1591
|
+
};
|
|
1592
|
+
|
|
1593
|
+
const agentNgrams = ngrams(tokenize(agentAnswer), n);
|
|
1594
|
+
const bodyNgrams = ngrams(tokenize(skillBody + ' ' + conceptField), n);
|
|
1595
|
+
|
|
1596
|
+
const overlap = [...agentNgrams].filter(g => bodyNgrams.has(g));
|
|
1597
|
+
return { passed: overlap.length === 0, overlap_ngrams: overlap };
|
|
1598
|
+
}
|
|
1599
|
+
```
|
|
1600
|
+
|
|
1601
|
+
The 6-gram threshold and the stopword list are tunable. A starting stopword list: "this", "that", "they", "them", "with", "from", "have", "will", "would", "could", "should", "their", "there", "where", "when", "what", "which", "while", "about", "after", "before", "between", "into", "than", "then" — the standard English-functional-word set. The 6 in "6-gram" is from plagiarism-detection literature where ~6 consecutive distinctive tokens is the threshold below which natural-language overlap is plausibly coincidental and above which it is plausibly copied.
|
|
1602
|
+
|
|
1603
|
+
The threshold is a calibration knob: if calibration on the worked-example cases shows too many false positives (paraphrastic answers flagged as verbatim), bump to 7-gram. If too many false negatives (clearly-copied answers passing), drop to 5-gram. The skill's own quality doctrine accepts that compression "preserves meaning, names, boundaries, examples, evidence" — meaning short structural overlaps like "type system" or "runtime boundary" should not trigger the check, which is why the n-gram size is large.
|
|
1604
|
+
|
|
1605
|
+
### A.4 `# METHODICAL FORCING FUNCTIONS` — the rule block the grader sees
|
|
1606
|
+
|
|
1607
|
+
This subsection specifies the additional prompt block that R9 (§7.10) injects into the Appendix A § A.1 template. It is placed between `# RULES` and `# INPUT` in the rendered prompt so the forcing functions are read immediately before the grader sees the case data. The rules are stated in the *grader's* voice — i.e., they describe what the grader (the model running this prompt) commits to doing as it produces its own verdict.
|
|
1608
|
+
|
|
1609
|
+
```
|
|
1610
|
+
# METHODICAL FORCING FUNCTIONS
|
|
1611
|
+
|
|
1612
|
+
You are subject to four explicit rules that govern HOW you produce the verdict.
|
|
1613
|
+
These rules come from the `methodical` skill at
|
|
1614
|
+
/Users/jacobbalslev/Development/skills/methodical/SKILL.md § 1 (the 10 Rules).
|
|
1615
|
+
They exist because LLM graders, including you, are trained on signals that
|
|
1616
|
+
systematically reward shorter, more positive, more confident outputs over
|
|
1617
|
+
complete and honest ones. The four rules below are structural countermeasures.
|
|
1618
|
+
|
|
1619
|
+
RULE-1 (Complete Before Summarize):
|
|
1620
|
+
You will produce one verdict per `expected_behaviors[].id` BEFORE producing
|
|
1621
|
+
the `dimension_verdict`. The number of entries in `behavior_verdicts[]` in
|
|
1622
|
+
your output MUST equal the number of entries in `expected_behaviors[]` in
|
|
1623
|
+
the input. Never aggregate before enumerating. Never skip a behavior
|
|
1624
|
+
because it seems redundant, low-priority, or hard to evaluate.
|
|
1625
|
+
|
|
1626
|
+
RULE-3 (Separate Generation from Criticism):
|
|
1627
|
+
After you draft the verdict and before you emit it, run a self-critique
|
|
1628
|
+
pass internally. Ask: "Which behaviors did I mark PASS where the
|
|
1629
|
+
evidence_quote does not actually demonstrate the criterion? Which behaviors
|
|
1630
|
+
did I mark with hedged rationale where the response evidence clearly
|
|
1631
|
+
fails?" Revise any verdict where the self-critique surfaces a problem.
|
|
1632
|
+
|
|
1633
|
+
RULE-6 (Negative Findings Are Primary Data):
|
|
1634
|
+
A FAIL verdict is more valuable than a PASS verdict. Do NOT use hedge words
|
|
1635
|
+
("could be improved", "worth reviewing", "could be strengthened", "might
|
|
1636
|
+
be", "perhaps") on a behavior with `verdict: FAIL`. State the failure
|
|
1637
|
+
directly and quote the exact gap in the response. Use of hedge words on a
|
|
1638
|
+
FAIL is itself a grader malfunction that downstream tooling will detect.
|
|
1639
|
+
|
|
1640
|
+
RULE-9 (State the Completeness Claim Explicitly):
|
|
1641
|
+
Your verdict object must enumerate every `expected_behavior.id` from the
|
|
1642
|
+
input. If you could not decide on a behavior, the verdict for that behavior
|
|
1643
|
+
is `FAIL` with `rationale` naming the indecision (e.g., "could not evaluate
|
|
1644
|
+
because the response did not address this criterion") — NEVER silently
|
|
1645
|
+
omitted. The completeness claim is implicit in the schema: every input id
|
|
1646
|
+
appears in `behavior_verdicts[]`.
|
|
1647
|
+
|
|
1648
|
+
Compliance is mechanically checkable:
|
|
1649
|
+
- RULE-1: count(behavior_verdicts) == count(expected_behaviors)
|
|
1650
|
+
- RULE-2: every behavior_verdict has a non-empty evidence_quote
|
|
1651
|
+
- RULE-6: no behavior_verdict has verdict=FAIL with hedge-word rationale
|
|
1652
|
+
- RULE-9: set(behavior_verdict.id) == set(expected_behavior.id)
|
|
1653
|
+
|
|
1654
|
+
The calling script will run these checks on your output. A verdict that
|
|
1655
|
+
violates any rule is a malformed output and will trigger a re-run, not a
|
|
1656
|
+
silent pass.
|
|
1657
|
+
```
|
|
1658
|
+
|
|
1659
|
+
**Why this block is mandatory, not optional.** Per §4.7, an unconstrained comprehension grader inherits the §4.7 base rates: ~58% sycophancy ([SycEval, 2025](https://arxiv.org/abs/2502.08177)), 26–73% summarization overgeneralization ([Peters & Chin-Yee, 2025](https://royalsocietypublishing.org/rsos/article/12/4/241776/235656/Generalization-bias-in-large-language-model)), 26.42% framing bias ([Aldigheri et al., 2025](https://arxiv.org/abs/2507.03194)). The four rules above each target a specific base rate: RULE-1 targets summarization, RULE-3 targets the no-self-critique baseline that the [17.2× multi-agent amplification finding (Towards Data Science, 2025)](https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/) identifies as the dominant cause, RULE-6 targets framing bias, RULE-9 targets attention-dilution-driven instruction skipping.
|
|
1660
|
+
|
|
1661
|
+
The block's length is ~30 lines and adds ~1KB to each grader prompt. Cost is negligible. The block goes at the START of `# METHODICAL FORCING FUNCTIONS` and BEFORE `# INPUT` to mitigate the attention-dilution mechanism: rules near the end of long context (where `# OUTPUT` and the verdict-schema also live) get less attention, so the rules are placed before the case data, not after.
|
|
1662
|
+
|
|
1663
|
+
**What the block deliberately omits.** The remaining six `methodical` rules are not in the block:
|
|
1664
|
+
- RULE-2 (Evidence Receipt Per Step) is structurally enforced by the `evidence_quote` field in the verdict schema and the post-check; naming it again in the prompt is redundant.
|
|
1665
|
+
- RULE-4 (Prioritization is reordering, not filtering) does not apply at the grader layer — the grader's task is to score, not to prioritize across cases.
|
|
1666
|
+
- RULE-5 (Observe Before Act) is captured by `# STEPS` ordering ("Read the <skill-body> first").
|
|
1667
|
+
- RULE-7 (Verification is not trust) applies at the audit-loop integration layer (R6, §7.6) where the audit consumes the grader's structured verdict and never trusts its prose summary.
|
|
1668
|
+
- RULE-8 (Scope framing must be challenged) does not apply: the grader's scope is exactly the case's `expected_behaviors[]` and is correct by construction.
|
|
1669
|
+
- RULE-10 (Deliberate pace on high-stakes steps) is implicit in the single-shot per-case grader invocation; the grader does not run a chain of high-stakes steps.
|
|
1670
|
+
|
|
1671
|
+
The four-rule block is the minimum complete set for the comprehension-grading task. Adding more rules dilutes the attention to the four that matter; omitting any of the four reopens a measured failure mode.
|
|
1672
|
+
|
|
1673
|
+
---
|
|
1674
|
+
|
|
1675
|
+
## Appendix B — Observed failure modes in the current eval files
|
|
1676
|
+
|
|
1677
|
+
This appendix catalogs comprehension-shaped failure modes visible in the existing 31 eval files. The goal is to ground the proposed rubric in failures the library actually exhibits, per Husain's "begin with a benevolent dictator reviewing 100+ traces" pattern (§4.3.1). The cases below are illustrative, not exhaustive.
|
|
1678
|
+
|
|
1679
|
+
### B.1 Quote-dependent definition cases
|
|
1680
|
+
|
|
1681
|
+
The clearest pattern: most `dimension: definition` cases (10/177, §2.2) phrase prompts that reward retrieval. Example from [`examples/evals/a11y.json` case 1 (lines 8–17)](/Users/jacobbalslev/Development/skill-graph/examples/evals/a11y.json):
|
|
1682
|
+
|
|
1683
|
+
> "A designer wants a clickable element that triggers an action on the same page without navigating. The current code uses `<a href="#">` with a click handler. According to the a11y skill's Primitive Selection table, what is the correct primitive and why is the current one wrong?"
|
|
1684
|
+
|
|
1685
|
+
A model with the skill loaded will simply locate the Primitive Selection table and quote the relevant row. There is no constraint preventing verbatim restate. The agent does not need to internalize "the right primitive is a button" — it needs to find the line in the body that says so. The prompt's "According to the a11y skill's Primitive Selection table" is a near-verbatim citation pointer.
|
|
1686
|
+
|
|
1687
|
+
This is not a flaw in the case author's intent — it's a missing rubric component. Adding the C1 verbatim-span check (§5.2) gives the case a comprehension teeth. The current case becomes a near-transfer C1 case once the rubric demands non-verbatim restatement.
|
|
1688
|
+
|
|
1689
|
+
### B.2 Boundary-as-routing confused with boundary-as-discrimination
|
|
1690
|
+
|
|
1691
|
+
The `dimension: boundary` cases (62/177) are mostly routing handoffs. Example from [`examples/evals/code-review.json` case 3 (lines 31–39)](/Users/jacobbalslev/Development/skill-graph/examples/evals/code-review.json):
|
|
1692
|
+
|
|
1693
|
+
> "Production users are already seeing a known failure and there is no proposed diff yet. Should code-review accept the task?"
|
|
1694
|
+
|
|
1695
|
+
This is a routing decision: code-review should hand off to debugging. It does not test whether the agent **understands the boundary as a discrimination** — i.e., that code-review and debugging operate on different inputs (a proposed change vs an in-flight failure). A model could correctly hand off without understanding the structural difference.
|
|
1696
|
+
|
|
1697
|
+
The proposed C4 rubric distinguishes the two: routing handoff is C9 (in-skill refusal), concept discrimination is C4 (boundary mechanism). The current cases mostly probe C9. Adding C4 cases requires probes that name the discriminating mechanism, not just the handoff target.
|
|
1698
|
+
|
|
1699
|
+
### B.3 The `documentation` comprehension.json is the high-water mark
|
|
1700
|
+
|
|
1701
|
+
The single eval file `examples/evals/comprehension.json` is the closest the library has to a comprehension-graded surface. Two cases (id 11 and id 12, lines 148–181) include `expected_reasoning` fields with multi-sentence reasoning traces. Case 11 probes rule-conflict ("does the same-commit doc-update rule bend for a hotfix?"); case 12 probes Verification-vs-Philosophy tension ("doc passes Verification but fails the Philosophy test — is it acceptable?").
|
|
1702
|
+
|
|
1703
|
+
These cases are sophisticated. They are also **the only two cases in the entire library** that explicitly probe whether the agent reasons about the skill's claims under tension. Their existence is evidence that the comprehension test the rubric proposes is authorable in the existing eval-file shape — `expected_reasoning` already encodes much of what `expected_behaviors[]` would carry. The rubric is partly a formalization of patterns already present in `comprehension.json`.
|
|
1704
|
+
|
|
1705
|
+
The case for keeping `expected_reasoning` and adding `expected_behaviors[]` rather than replacing the former: `expected_reasoning` is human-readable; `expected_behaviors[]` is grader-decidable. Both have consumers. Cases with both fields give the grader structured behaviors AND give a future human reader a prose summary of why the case exists.
|
|
1706
|
+
|
|
1707
|
+
### B.4 Absent dimensions
|
|
1708
|
+
|
|
1709
|
+
Per the table in §2.2 and §3.9, three concept-block fields (`taxonomy`, `analogy`, `misconception`) have **zero** eval cases tagged for them across the entire 31-file library. The two skills with declared `comprehension_state: present` (`type-safety`, `acid-fundamentals`) have zero eval files at all. This is the clearest priority for backfill (R7, §7.7):
|
|
1710
|
+
|
|
1711
|
+
| Skill | `comprehension_state` | Has eval file | Worked example exists |
|
|
1712
|
+
|---|---|---|---|
|
|
1713
|
+
| `type-safety` | `present` | no | yes (§6.2) |
|
|
1714
|
+
| `acid-fundamentals` | `present` | no | no — but analogous to type-safety; the four primitives (A/I/D + C-tension) give the same shape |
|
|
1715
|
+
|
|
1716
|
+
Authoring `acid-fundamentals`'s comprehension eval is the natural next step after `type-safety`'s lands. The acid-fundamentals skill has an exceptionally rich `concept` block (115 lines of frontmatter content) with explicit primitives (the four properties + the contract) and a rich taxonomy (by property, by isolation level, by durability configuration, by consistency-rule scope, by model). All five C5-style taxonomy probes would be authorable directly from the skill's existing content.
|
|
1717
|
+
|
|
1718
|
+
### B.5 The skills declaring `comprehension_state: absent`
|
|
1719
|
+
|
|
1720
|
+
Most of the 133 skills in the active library do not declare `comprehension_state: present`. Default-absent is honest — the skill author has not authored the concept block and has not authored comprehension evals, so the comprehension rubric should not run against them.
|
|
1721
|
+
|
|
1722
|
+
But there is a discoverability problem: a skill author who would benefit from declaring `present` may not realize the option exists. The recommended documentation (R8, §7.8) closes this: when an author writes a new concept-shape skill, the field reference and skill-scaffold should both flag `comprehension_state` as the toggle for the comprehension rubric.
|
|
1723
|
+
|
|
1724
|
+
A future improvement (out of scope for this report): a lint check that flags skills whose body contains the structural shape of a concept (Coverage, Philosophy, mental-model-like primitives in the body, named misconceptions) but whose frontmatter declares `comprehension_state: absent`. These are candidates for upgrade.
|
|
1725
|
+
|
|
1726
|
+
---
|
|
1727
|
+
|
|
1728
|
+
## Appendix C — Mapping the rubric onto the audit-loop scorecard
|
|
1729
|
+
|
|
1730
|
+
The audit-loop ([`SKILL_AUDIT_LOOP.md`](https://github.com/jacob-balslev/skill-audit-loop/blob/main/SKILL_AUDIT_LOOP.md)) scores seven dimensions: Metadata, Activation, Relation, Grounding, Content, Eval, Portability. The proposed comprehension rubric adds C1–C9. The relationship is **complementary, not overlapping**:
|
|
1731
|
+
|
|
1732
|
+
| Audit-loop dimension | Comprehension dimension(s) | Relationship |
|
|
1733
|
+
|---|---|---|
|
|
1734
|
+
| Metadata validity | — | Disjoint. Audit checks schema conformance; comprehension never reads frontmatter for grading. |
|
|
1735
|
+
| Activation quality | — | Disjoint. Activation is routing-side; comprehension is loaded-skill-side. |
|
|
1736
|
+
| Relation quality | partly C4 | The audit checks whether relations point at semantically correct peers (grader-judged content); C4 checks whether the loaded agent recognizes adjacent-skill territory (agent-judged behavior). Different consumers; both surfaces are valuable. |
|
|
1737
|
+
| Grounding fidelity | — | Disjoint. Grounding is repo-truth-source-side; comprehension is universal-concept-side. |
|
|
1738
|
+
| Content quality | partly C1, C3, C8 | Audit checks that the body has clear Coverage / Philosophy / Verification; comprehension checks that the agent's behavior reflects them. The audit grades the prose; comprehension grades the prose's effect. |
|
|
1739
|
+
| Eval quality | all of C1–C9 | This is where the rubric **adds** to the audit-loop. The current audit Eval dimension checks "does an eval file exist, is it ≥7 scenarios, does it have negative expectations." The proposed extension: "do the scenarios cover the 9 comprehension dimensions, do they grade against the concept-block fields and the body's Verification / Do-NOT sections." |
|
|
1740
|
+
| Portability quality | — | Disjoint. Portability is export-side; comprehension is concept-side. |
|
|
1741
|
+
|
|
1742
|
+
The integration recipe (R6, §7.6): the audit-loop's `Eval quality` row, when a skill declares `comprehension_state: present`, also reports the per-dimension comprehension verdicts. A skill that PASSes audit-loop's Eval row (≥7 scenarios, ≥1 negative each) and PASSes all 9 comprehension dimensions is comprehension-verified. A skill that PASSes the audit row but fails 2/9 comprehension dimensions is **PARTIAL** at the eval surface — the existing audit verdict vocabulary (`PASS / PASS WITH FIXES / PARTIAL / FAIL`) is sufficient to express the mix.
|
|
1743
|
+
|
|
1744
|
+
---
|
|
1745
|
+
|
|
1746
|
+
## Appendix D — Pre-existing terms the rubric should use carefully
|
|
1747
|
+
|
|
1748
|
+
To avoid the v1→v2 enum drift the protocol experienced ([`SKILL_METADATA_PROTOCOL.md`](https://github.com/jacob-balslev/skill-metadata-protocol) § Migration Notes), the rubric should use vocabulary that does not collide with existing protocol terms.
|
|
1749
|
+
|
|
1750
|
+
**Terms already in use that the rubric reuses with same meaning:**
|
|
1751
|
+
|
|
1752
|
+
| Term | Existing use | Rubric use | OK to reuse? |
|
|
1753
|
+
|---|---|---|---|
|
|
1754
|
+
| `boundary` | `concept.boundary` field; `relations.boundary` edges; `dimension: boundary` in evals | Rubric C4 (concept discrimination) and the body's `## Do NOT Use When` cross-cut | Yes — but the rubric's C4 and C9 distinguish the two concept-block vs body uses |
|
|
1755
|
+
| `definition` | `concept.definition` field; `dimension: definition` in evals | Rubric C1 | Yes |
|
|
1756
|
+
| `mental_model` | `concept.mental_model` field; `dimension: mental_model` in evals | Rubric C2 | Yes |
|
|
1757
|
+
| `purpose` | `concept.purpose` field; `dimension: purpose` in evals | Rubric C3 | Yes |
|
|
1758
|
+
| `taxonomy` | `concept.taxonomy` field | Rubric C5 | Yes |
|
|
1759
|
+
| `analogy` | `concept.analogy` field | Rubric C6 | Yes |
|
|
1760
|
+
| `misconception` | `concept.misconception` field | Rubric C7 | Yes |
|
|
1761
|
+
|
|
1762
|
+
**New terms the rubric introduces:**
|
|
1763
|
+
|
|
1764
|
+
| Term | Defined in | Meaning |
|
|
1765
|
+
|---|---|---|
|
|
1766
|
+
| `comprehension_dimension` | eval-case key | Identifies which rubric dimension (C1–C9) the case grades against |
|
|
1767
|
+
| `concept_field` | eval-case key | Identifies which concept-block field (definition / mental_model / etc.) the case targets; null for C8 and C9 |
|
|
1768
|
+
| `transfer` | eval-case key | `near` or `far` — whether the case's surface details appear in the body |
|
|
1769
|
+
| `expected_behaviors[]` | eval-case array | Per-behavior pass/fail criteria with positive/negative kind |
|
|
1770
|
+
| `verbatim_overlap_check` | grader output field | Result of the 6-gram overlap check |
|
|
1771
|
+
|
|
1772
|
+
All new terms are namespaced inside the eval-file structure (not the SKILL.md frontmatter). The protocol contract is not affected.
|
|
1773
|
+
|
|
1774
|
+
**Terms the rubric explicitly does NOT introduce:**
|
|
1775
|
+
|
|
1776
|
+
- A new `dimension` enum value. The existing enum is kept; `comprehension_dimension` is the additive richer key.
|
|
1777
|
+
- A new `eval_state` value. The schema's `unverified / passing / monitored` triple is sufficient — a comprehension-graded run that passes is `passing`; that fails is back to `unverified`.
|
|
1778
|
+
- A new top-level frontmatter field. All additions are inside eval files. The schema needs no change.
|
|
1779
|
+
|
|
1780
|
+
---
|
|
1781
|
+
|
|
1782
|
+
## Appendix E — Estimated cost of adoption
|
|
1783
|
+
|
|
1784
|
+
For a back-of-envelope cost estimate, assume:
|
|
1785
|
+
|
|
1786
|
+
- **Grader CLI cost.** ~10 cents per case using current-generation API graders (Sonnet 4.7 / GPT-5.x / Gemini 3.x), accounting for prompt size of ~4–6KB (body + concept field + case) and ~1KB output (verdict JSON).
|
|
1787
|
+
- **Authoring cost.** ~30 minutes per case for an experienced author following the worked example. ~10 cases per comprehension-graded skill.
|
|
1788
|
+
- **Backfill scope.** R7 (§7.7) backfills 2 skills today (`type-safety`, `acid-fundamentals`); growth depends on how many additional skills declare `comprehension_state: present` over the next quarter.
|
|
1789
|
+
|
|
1790
|
+
Rough total for the proposed adoption:
|
|
1791
|
+
|
|
1792
|
+
| Item | Effort | Cost |
|
|
1793
|
+
|---|---|---|
|
|
1794
|
+
| R1 — author `type-safety.json` (worked example, already drafted in §6) | ~2 hours | $0 (drafted in this report) |
|
|
1795
|
+
| R2 — comprehension-prompt-builder | ~1 day | $0 (Node, no deps) |
|
|
1796
|
+
| R3 — evaluate-skill.js script | ~2 days | $0 implementation; ~$5 per skill comprehension run |
|
|
1797
|
+
| R4 — far-transfer lint check | ~3 hours | $0 |
|
|
1798
|
+
| R5 — baseline-comparison harness | ~3 days | $0 implementation; ~$10 per skill (2x grader cost) |
|
|
1799
|
+
| R6 — audit-loop scorecard extension | ~half day | $0 |
|
|
1800
|
+
| R7 — backfill `acid-fundamentals` eval | ~5 hours per skill | $5 per skill grader run |
|
|
1801
|
+
| R8 — documentation | ~half day | $0 |
|
|
1802
|
+
|
|
1803
|
+
The dominant ongoing cost is **grader CLI calls** at ~$5 per skill per run. For a library of 5 comprehension-graded skills, a quarterly re-run is ~$25. For a library of 50 such skills, ~$250 quarterly. Both are negligible vs. the engineering hours saved by catching skill drift the comprehension rubric would surface.
|
|
1804
|
+
|
|
1805
|
+
---
|
|
1806
|
+
|
|
1807
|
+
## Appendix F — Worked-example case index by Bloom level
|
|
1808
|
+
|
|
1809
|
+
This appendix indexes the 10 worked-example cases by their Bloom level (per §4.1.1) for cross-referencing.
|
|
1810
|
+
|
|
1811
|
+
| Case | Dim | Bloom level | Why |
|
|
1812
|
+
|---|---|---|---|
|
|
1813
|
+
| 1 | C1 | Understand | Re-state the definition without copying |
|
|
1814
|
+
| 2 | C2 | Apply / Analyze | Apply primitives to a novel scenario the body does not enumerate |
|
|
1815
|
+
| 3 | C3 | Understand / Evaluate | State purpose; evaluate alternative |
|
|
1816
|
+
| 4 | C4 | Analyze | Decompose a prompt to identify it's in adjacent territory |
|
|
1817
|
+
| 5 | C5 | Analyze | Classify a novel instance into the taxonomy |
|
|
1818
|
+
| 6 | C6 | Apply / Evaluate | Apply analogy to new case; evaluate where it breaks |
|
|
1819
|
+
| 7 | C7 | Evaluate | Evaluate a misconception's truth and identify the flaw |
|
|
1820
|
+
| 8 | C8 | Apply | Apply the Verification checklist to a fresh artifact |
|
|
1821
|
+
| 9 | C9 | Evaluate | Evaluate whether the request is in-scope; refuse if not |
|
|
1822
|
+
| 10 | C7 | Evaluate | Evaluate a cross-language analogy claim and identify the flaw |
|
|
1823
|
+
|
|
1824
|
+
Coverage: Understand (3 cases), Apply (3 cases), Analyze (3 cases), Evaluate (5 cases — with some cases serving multiple levels). Create — the top Bloom level — is not exercised by the rubric; creating new artifacts that embody the skill's discipline is the agent's downstream task and is a different evaluation surface (output quality of the skill-aided work, not comprehension of the skill).
|
|
1825
|
+
|
|
1826
|
+
The absence of Create-level cases is deliberate. A skill teaches concept primitives; the Create-level test is whether the primitives produce good work, which depends on the task more than on the skill. The comprehension rubric tests whether the primitives were learned; the audit-loop's Content dimension already checks whether the body is good enough to support Create-level work.
|
|
1827
|
+
|
|
1828
|
+
---
|
|
1829
|
+
|
|
1830
|
+
*End of report.*
|