kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +47 -0
  2. package/package.json +3 -2
  3. package/src/agent/context.js +17 -1
  4. package/src/agent/engine.js +467 -100
  5. package/src/agent/llm-client.js +24 -1
  6. package/src/agent/pipelines/_advance-hints.js +92 -0
  7. package/src/agent/pipelines/_milestone-derive.js +325 -20
  8. package/src/agent/pipelines/skill-authoring.js +49 -3
  9. package/src/agent/tools/agent-tool.js +2 -2
  10. package/src/agent/tools/consult-skill.js +15 -0
  11. package/src/agent/tools/dashboard-render.js +48 -1
  12. package/src/agent/tools/document-parse.js +31 -2
  13. package/src/agent/tools/phase-advance.js +17 -13
  14. package/src/agent/tools/release.js +343 -7
  15. package/src/agent/tools/sandbox-exec.js +65 -8
  16. package/src/agent/tools/worker-llm-call.js +95 -15
  17. package/src/agent/workspace.js +25 -4
  18. package/src/cli/components.js +4 -1
  19. package/src/cli/index.js +125 -8
  20. package/src/config.js +19 -2
  21. package/src/marathon/driver.js +217 -0
  22. package/src/marathon/prompts.js +93 -0
  23. package/template/.env.template +17 -1
  24. package/template/AGENT.md +2 -2
  25. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  26. package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
  27. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  28. package/template/skills/en/confidence-system/SKILL.md +30 -8
  29. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  30. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  31. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  32. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  33. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  34. package/template/skills/en/document-chunking/SKILL.md +99 -15
  35. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  36. package/template/skills/en/quality-control/SKILL.md +23 -0
  37. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  38. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  39. package/template/skills/en/skill-authoring/SKILL.md +85 -2
  40. package/template/skills/en/skill-creator/SKILL.md +25 -3
  41. package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
  42. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  43. package/template/skills/en/tree-processing/SKILL.md +1 -1
  44. package/template/skills/en/version-control/SKILL.md +15 -0
  45. package/template/skills/en/work-decomposition/SKILL.md +52 -32
  46. package/template/skills/phase_skills.yaml +5 -0
  47. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  48. package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
  49. package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
  50. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  51. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  52. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  53. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  54. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  55. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  56. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  57. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  58. package/template/skills/zh/document-chunking/SKILL.md +101 -18
  59. package/template/skills/zh/document-parsing/SKILL.md +65 -65
  60. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  61. package/template/skills/zh/entity-extraction/SKILL.md +78 -68
  62. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  63. package/template/skills/zh/quality-control/SKILL.md +23 -0
  64. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  65. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  66. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  67. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  68. package/template/skills/zh/skill-authoring/SKILL.md +136 -58
  69. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  70. package/template/skills/zh/skill-creator/SKILL.md +215 -201
  71. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  72. package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
  73. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  74. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  75. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  76. package/template/skills/zh/tree-processing/SKILL.md +67 -63
  77. package/template/skills/zh/version-control/SKILL.md +15 -0
  78. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  79. package/template/skills/zh/work-decomposition/SKILL.md +52 -30
  80. package/template/workflows/common/llm_client.py +168 -0
  81. package/template/workflows/common/utils.py +132 -0
@@ -1,77 +1,81 @@
1
1
  ---
2
2
  name: skill-creator
3
3
  tier: meta
4
- description: Anthropic 官方 skill 脚手架工具——用于迭代/优化已有 skill 或对其运行 evaluation,不是构建 KC per-rule 核查 skill 的首选参考。要写 KC 规则 skill,先 consult `skill-authoring`(规范目录结构 + 粒度规则 + KC 特定的 check.py 入口约定)和 `work-decomposition`(排序与分组决策)。本 skill 适用于:per-rule skill 已经存在、agent 想优化其 description/触发或跑正式 evaluation 时。
4
+ description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
5
5
  ---
6
6
 
7
- # Skill Creator
7
+ # Skill creator(KC 兜底)
8
8
 
9
- A skill for creating new skills and iteratively improving them.
9
+ 这是 Anthropic 上游 `skill-creator` 方法论的中文版,作为编写复杂技能时的深度参考随 KC 一并发布。KC 内部主流的技能编写路径是 `skill-authoring` 技能(核心思想沿袭自本技能)。只有当 `skill-authoring` 的指引对你正在编写的技能复杂度而言不够用时,才直接来查阅 `skill-creator`——例如你想跑正式的评估循环、做带方差分析的基准对比,或对触发用的 description 做优化。关于"先做哪些技能、怎样分组"的决策,请参考 `work-decomposition`。
10
10
 
11
- At a high level, the process of creating a skill goes like this:
11
+ ---
12
+
13
+ 一个用于创建新技能并对其进行迭代改进的技能。
12
14
 
13
- - Decide what you want the skill to do and roughly how it should do it
14
- - Write a draft of the skill
15
- - Create a few test prompts and run claude-with-access-to-the-skill on them
16
- - Help the user evaluate the results both qualitatively and quantitatively
17
- - While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
18
- - Use the `eval-viewer/generate_review.py` script to show the user the results for them to look at, and also let them look at the quantitative metrics
19
- - Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
20
- - Repeat until you're satisfied
21
- - Expand the test set and try again at larger scale
15
+ 从高层次看,创建一个技能的流程大致如下:
22
16
 
23
- Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
17
+ - 先确定这个技能要做什么、以及大致如何完成
18
+ - 写出技能的初稿
19
+ - 准备少量测试提示词,让带有该技能的 Claude 在其上运行
20
+ - 协助用户从定性和定量两个角度评估结果
21
+ - 在测试运行还在后台进行时,先起草一些定量评估项(如果暂时还没有的话);如果已经有现成的评估,可以直接沿用,或者根据需要进行修改。然后向用户解释这些评估项(已存在的也要把含义说明清楚)
22
+ - 使用 `eval-viewer/generate_review.py` 脚本把结果呈现给用户查看,让他们也能浏览定量指标
23
+ - 根据用户对结果的评估反馈(以及定量基准里暴露出的明显缺陷)重写技能
24
+ - 反复迭代,直到满意为止
25
+ - 扩充测试集,在更大规模上再跑一遍
24
26
 
25
- On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
27
+ 使用这个技能时,你的工作是判断用户目前处在上述流程的哪个阶段,然后从那里切入帮他们继续推进。比如,用户可能说"我想做一个 X 的技能"。你就可以帮他们澄清需求、写初稿、写测试用例、确定评估方式、跑完所有提示词、再迭代。
26
28
 
27
- Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
29
+ 另一种情况是,他们已经有一个技能初稿。这时你可以直接跳到评估/迭代那一段循环。
28
30
 
29
- Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.
31
+ 当然,你应当保持灵活。如果用户说"我不想跑一堆评估,咱们就随意一点感觉感觉",那就这样做也没问题。
30
32
 
31
- Cool? Cool.
33
+ 技能基本完成之后(顺序仍然是灵活的),你还可以运行技能描述优化器——我们为此专门写了一个脚本——来优化技能的触发表现。
32
34
 
33
- ## Communicating with the user
35
+ 可以了吧?可以了。
34
36
 
35
- The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
37
+ ## 与用户沟通
36
38
 
37
- So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
39
+ skill creator 的使用者可能跨越从对编程行话非常陌生到相当熟悉的整个区间。如果你没听说过的话(也很正常,这个趋势也才刚出现没多久),现在出现了一股潮流:Claude 的能力激发着水管工去打开终端、爸爸妈妈和爷爷奶奶去 google "怎么安装 npm"。不过绝大多数用户应该还是对计算机比较熟悉的。
38
40
 
39
- - "evaluation" and "benchmark" are borderline, but OK
40
- - for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
41
+ 所以请关注上下文线索,据此判断该如何措辞!默认情况下大概可以这样把握尺度:
41
42
 
42
- It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
43
+ - "evaluation" "benchmark" 处于临界状态,但通常可以直接用
44
+ - 对于 "JSON" 和 "assertion",则要看到用户给出明确信号、表明他们了解这些概念之后,再直接使用、不加解释
45
+
46
+ 如果拿不准,简短地解释一下术语也完全可以;只要担心用户可能不懂,就顺手给一个一句话的定义。
43
47
 
44
48
  ---
45
49
 
46
- ## Creating a skill
50
+ ## 创建一个技能
47
51
 
48
- ### Capture Intent
52
+ ### 捕捉意图
49
53
 
50
- Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
54
+ 先从理解用户的意图开始。当前对话里可能已经包含了用户想要沉淀为技能的工作流程(例如他们说"把这个变成一个技能")。如果是这种情况,先从对话历史中提取答案——用过哪些工具、步骤的顺序、用户做过哪些修正、观察到的输入/输出格式。剩下的缺口由用户来补齐,并在进入下一步之前请他们确认。
51
55
 
52
- 1. What should this skill enable Claude to do?
53
- 2. When should this skill trigger? (what user phrases/contexts)
54
- 3. What's the expected output format?
55
- 4. Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
56
+ 1. 这个技能应该让 Claude 能做什么?
57
+ 2. 这个技能应该在什么时候触发?(什么样的用户话术/语境)
58
+ 3. 期望的输出格式是什么?
59
+ 4. 是否需要建立测试用例来验证技能是否工作正常?输出可以被客观验证的技能(文件转换、数据抽取、代码生成、固定流程步骤)会从测试用例中获益。输出偏主观的技能(写作风格、艺术品)通常不需要。请根据技能类型给出合适的默认建议,但最终让用户来决定。
56
60
 
57
- ### Interview and Research
61
+ ### 访谈与调研
58
62
 
59
- Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
63
+ 主动询问关于边界情况、输入/输出格式、示例文件、成功标准和依赖项的问题。在把这一部分敲定之前,不要急着写测试提示词。
60
64
 
61
- Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
65
+ 留意可用的 MCP——如果它们对调研有帮助(搜索文档、查找相似技能、了解最佳实践),并且环境支持子代理就并行调研,否则就在主线程内联进行。带着上下文进入,可以减轻用户的负担。
62
66
 
63
- ### Write the SKILL.md
67
+ ### 编写 SKILL.md
64
68
 
65
- Based on the user interview, fill in these components:
69
+ 基于对用户的访谈结果,填写以下组成部分:
66
70
 
67
- - **name**: Skill identifier
68
- - **description**: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
69
- - **compatibility**: Required tools, dependencies (optional, rarely needed)
70
- - **the rest of the skill :)**
71
+ - **name**:技能标识符
72
+ - **description**:何时触发、做什么。这是最主要的触发机制——既要写技能"做什么",又要写"什么场景下使用它"的具体描述。所有"何时使用"的信息都放在这里,不要放在正文里。注意:当前 Claude 容易"触发不足"——明明应该用的技能却不调用。为了对冲这种倾向,请把技能描述写得稍微"催促"一点。比如,与其写"用于构建一个简单快速的仪表盘,以展示 Anthropic 内部数据。",不如写"用于构建一个简单快速的仪表盘,以展示 Anthropic 内部数据。只要用户提到 dashboard、数据可视化、内部指标,或者想要展示任何类型的公司数据,都要使用这个技能,即使他们没有显式说出 'dashboard' 这个词。"
73
+ - **compatibility**:依赖的工具、依赖项(可选,很少需要)
74
+ - **技能的剩余部分 :)**
71
75
 
72
- ### Skill Writing Guide
76
+ ### 技能写作指南
73
77
 
74
- #### Anatomy of a Skill
78
+ #### 技能的解剖结构
75
79
 
76
80
  ```
77
81
  skill-name/
@@ -84,21 +88,21 @@ skill-name/
84
88
  └── assets/ - Files used in output (templates, icons, fonts)
85
89
  ```
86
90
 
87
- #### Progressive Disclosure
91
+ #### 渐进式信息披露
88
92
 
89
- Skills use a three-level loading system:
90
- 1. **Metadata** (name + description) - Always in context (~100 words)
91
- 2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
92
- 3. **Bundled resources** - As needed (unlimited, scripts can execute without loading)
93
+ 技能采用三层加载体系:
94
+ 1. **元数据**(name + description)——始终在上下文里(约 100 词)
95
+ 2. **SKILL.md 正文**——技能被触发时进入上下文(理想情况下少于 500 行)
96
+ 3. **打包资源**——按需加载(无上限;脚本可以执行而无需被加载进上下文)
93
97
 
94
- These word counts are approximate and you can feel free to go longer if needed.
98
+ 这些字数都是大概数字,必要时可以放宽。
95
99
 
96
- **Key patterns:**
97
- - Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
98
- - Reference files clearly from SKILL.md with guidance on when to read them
99
- - For large reference files (>300 lines), include a table of contents
100
+ **关键模式:**
101
+ - SKILL.md 控制在 500 行以内;如果接近上限,就再加一层目录层级,并清楚地告诉调用该技能的模型"下一步该去哪里"。
102
+ - SKILL.md 中清晰地引用其他文件,并说明何时去读它们
103
+ - 对于较大的参考文件(>300 行),请加上目录
100
104
 
101
- **Domain organization**: When a skill supports multiple domains/frameworks, organize by variant:
105
+ **按领域组织**:当一个技能要支持多个领域/框架时,按变体组织:
102
106
  ```
103
107
  cloud-deploy/
104
108
  ├── SKILL.md (workflow + selection)
@@ -107,17 +111,17 @@ cloud-deploy/
107
111
  ├── gcp.md
108
112
  └── azure.md
109
113
  ```
110
- Claude reads only the relevant reference file.
114
+ Claude 只会读取相关的那一份参考文件。
111
115
 
112
- #### Principle of Lack of Surprise
116
+ #### 不让用户惊讶的原则
113
117
 
114
- This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.
118
+ 这点几乎不言自明,但还是要说:技能不得包含恶意代码、漏洞利用代码,或任何可能危及系统安全的内容。技能的实际内容不应让用户对其意图感到意外——如果你向用户描述这个技能,它的真实行为应该与描述一致。不要配合用户写出误导性的技能,或者用于未授权访问、数据外泄等恶意目的的技能。但像"扮演 XYZ 角色"这种是可以的。
115
119
 
116
- #### Writing Patterns
120
+ #### 写作模式
117
121
 
118
- Prefer using the imperative form in instructions.
122
+ 指令尽量使用祈使句。
119
123
 
120
- **Defining output formats** - You can do it like this:
124
+ **定义输出格式**——可以这样写:
121
125
  ```markdown
122
126
  ## Report structure
123
127
  ALWAYS use this exact template:
@@ -127,7 +131,7 @@ ALWAYS use this exact template:
127
131
  ## Recommendations
128
132
  ```
129
133
 
130
- **Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
134
+ **示例模式**——加入一些示例会很有帮助。可以按下面这种格式写(不过如果示例里出现了 "Input" "Output",可以适当变通一下):
131
135
  ```markdown
132
136
  ## Commit message format
133
137
  **Example 1:**
@@ -135,15 +139,15 @@ Input: Added user authentication with JWT tokens
135
139
  Output: feat(auth): implement JWT-based authentication
136
140
  ```
137
141
 
138
- ### Writing Style
142
+ ### 写作风格
139
143
 
140
- Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.
144
+ 尽量向模型解释清楚某件事情为什么重要,而不是堆砌生硬死板的"必须 MUST"。运用心智理论,把技能写得通用一些,而不是死扣具体例子。先写一版初稿,然后过一段时间用全新的眼光重新审视、加以改进。
141
145
 
142
- ### Test Cases
146
+ ### 测试用例
143
147
 
144
- After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.
148
+ 写完技能初稿后,准备 23 条贴近实际的测试提示词——也就是真实用户实际可能说出的话。把它们摆给用户看:[不必照搬这段措辞]"这里有几条我想跑一下的测试用例,看着合适吗?还要再加几条吗?"然后就去跑。
145
149
 
146
- Save test cases to `evals/evals.json`. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.
150
+ 把测试用例保存到 `evals/evals.json`。这一步先不写断言(assertion),只写提示词。断言会在下一步、运行测试的过程中起草。
147
151
 
148
152
  ```json
149
153
  {
@@ -159,19 +163,19 @@ Save test cases to `evals/evals.json`. Don't write assertions yet — just the p
159
163
  }
160
164
  ```
161
165
 
162
- See `references/schemas.md` for the full schema (including the `assertions` field, which you'll add later).
166
+ 完整 schema(包括稍后再加入的 `assertions` 字段)见 `references/schemas.md`。
163
167
 
164
- ## Running and evaluating test cases
168
+ ## 运行并评估测试用例
165
169
 
166
- This section is one continuous sequence — don't stop partway through. Do NOT use `/skill-test` or any other testing skill.
170
+ 这一节是一个连续的整体——不要中途停下。不要使用 `/skill-test` 或任何其他测试用的技能。
167
171
 
168
- Put results in `<skill-name>-workspace/` as a sibling to the skill directory. Within the workspace, organize results by iteration (`iteration-1/`, `iteration-2/`, etc.) and within that, each test case gets a directory (`eval-0/`, `eval-1/`, etc.). Don't create all of this upfront — just create directories as you go.
172
+ 把结果放在 `<skill-name>-workspace/` 下,与技能目录平级。在 workspace 内部,按迭代分组(`iteration-1/`、`iteration-2/` 等),每个测试用例再各占一个目录(`eval-0/`、`eval-1/` 等)。不必一开始就把所有目录建好——边做边建即可。
169
173
 
170
- ### Step 1: Spawn all runs (with-skill AND baseline) in the same turn
174
+ ### 1 步:在同一轮里启动全部运行(with-skill baseline
171
175
 
172
- For each test case, spawn two subagents in the same turn — one with the skill, one without. This is important: don't spawn the with-skill runs first and then come back for baselines later. Launch everything at once so it all finishes around the same time.
176
+ 针对每个测试用例,在同一轮里启动两个子代理——一个带技能,一个不带。这一点很重要:不要先把 with-skill 的那一轮全跑完、之后再回来补 baseline。所有任务一次性启动,让它们大约同时完成。
173
177
 
174
- **With-skill run:**
178
+ **带技能的运行:**
175
179
 
176
180
  ```
177
181
  Execute this task:
@@ -182,11 +186,11 @@ Execute this task:
182
186
  - Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">
183
187
  ```
184
188
 
185
- **Baseline run** (same prompt, but the baseline depends on context):
186
- - **Creating a new skill**: no skill at all. Same prompt, no skill path, save to `without_skill/outputs/`.
187
- - **Improving an existing skill**: the old version. Before editing, snapshot the skill (`cp -r <skill-path> <workspace>/skill-snapshot/`), then point the baseline subagent at the snapshot. Save to `old_skill/outputs/`.
189
+ **基线运行**(同样的提示词,但 baseline 的设置取决于情境):
190
+ - **创建新技能时**:完全不带任何技能。同样的提示词,不传 skill path,输出保存到 `without_skill/outputs/`。
191
+ - **改进已有技能时**:用旧版本作为基线。开始改之前先把技能快照一份(`cp -r <skill-path> <workspace>/skill-snapshot/`),然后让基线子代理指向这个快照。输出保存到 `old_skill/outputs/`。
188
192
 
189
- Write an `eval_metadata.json` for each test case (assertions can be empty for now). Give each eval a descriptive name based on what it's testing — not just "eval-0". Use this name for the directory too. If this iteration uses new or modified eval prompts, create these files for each new eval directory — don't assume they carry over from previous iterations.
193
+ 为每个测试用例写一份 `eval_metadata.json`(断言此时可以留空)。给每个 eval 起一个能体现其测试目的的描述性名字——不要只叫 "eval-0"。这个名字也用作目录名。如果本次迭代用了新的或修改过的 eval 提示词,请为每个新的 eval 目录都创建这些文件——不要假设前一轮的元数据会自动延续过来。
190
194
 
191
195
  ```json
192
196
  {
@@ -197,17 +201,17 @@ Write an `eval_metadata.json` for each test case (assertions can be empty for no
197
201
  }
198
202
  ```
199
203
 
200
- ### Step 2: While runs are in progress, draft assertions
204
+ ### 2 步:在运行进行中起草断言
201
205
 
202
- Don't just wait for the runs to finish — you can use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in `evals/evals.json`, review them and explain what they check.
206
+ 不要光等运行结束——这段时间可以利用起来。为每个测试用例起草定量断言,并向用户解释这些断言。如果 `evals/evals.json` 里已经有断言,过一遍并向用户说明每条检查的是什么。
203
207
 
204
- Good assertions are objectively verifiable and have descriptive names — they should read clearly in the benchmark viewer so someone glancing at the results immediately understands what each one checks. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
208
+ 好的断言是客观可验证的,并且名字有描述性——在基准查看器里一眼就能看出每条到底在检查什么。主观性的技能(写作风格、设计质感)更适合做定性评估——不要硬把断言套到需要人类判断的事情上。
205
209
 
206
- Update the `eval_metadata.json` files and `evals/evals.json` with the assertions once drafted. Also explain to the user what they'll see in the viewer — both the qualitative outputs and the quantitative benchmark.
210
+ 起草完之后,把断言更新进 `eval_metadata.json` `evals/evals.json`。同时也告诉用户他们会在查看器里看到什么——既包括定性的输出,也包括定量的基准指标。
207
211
 
208
- ### Step 3: As runs complete, capture timing data
212
+ ### 3 步:随着任务完成抓取计时数据
209
213
 
210
- When each subagent task completes, you receive a notification containing `total_tokens` and `duration_ms`. Save this data immediately to `timing.json` in the run directory:
214
+ 每个子代理任务结束时,你会收到一条通知,里面包含 `total_tokens` `duration_ms`。请立刻把这些数据保存到对应运行目录下的 `timing.json`:
211
215
 
212
216
  ```json
213
217
  {
@@ -217,24 +221,24 @@ When each subagent task completes, you receive a notification containing `total_
217
221
  }
218
222
  ```
219
223
 
220
- This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.
224
+ 这是你抓取这些数据的唯一机会——它们随任务通知一起来,并不会被持久化到别的地方。请收到一条就处理一条,不要等着攒一批再处理。
221
225
 
222
- ### Step 4: Grade, aggregate, and launch the viewer
226
+ ### 4 步:评分、汇总、并启动查看器
223
227
 
224
- Once all runs are done:
228
+ 所有运行结束之后:
225
229
 
226
- 1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and evaluates each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
230
+ 1. **为每个运行打分**——派一个评分子代理(或者直接在主线程里打分),让它读 `agents/grader.md` 并对每条断言进行评估。把结果保存到对应运行目录下的 `grading.json`。grading.json 里的 expectations 数组必须使用字段 `text`、`passed`、`evidence`(不要用 `name`/`met`/`details` 或其他变体)——查看器依赖这几个确切的字段名。对于能够通过程序检查的断言,写一段脚本去跑,而不要靠肉眼比对——脚本更快、更可靠,而且能在多次迭代中重复使用。
227
231
 
228
- 2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory:
232
+ 2. **汇总成基准**——在 skill-creator 目录下运行汇总脚本:
229
233
  ```bash
230
234
  python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
231
235
  ```
232
- This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects.
233
- Put each with_skill version before its baseline counterpart.
236
+ 它会产生 `benchmark.json` `benchmark.md`,里面包含每种配置的通过率、耗时、token 用量,给出均值 ± 标准差以及差值。如果要手动生成 benchmark.json,查看器期望的精确 schema `references/schemas.md`。
237
+ 请把每个 with_skill 版本放在它对应的 baseline 之前。
234
238
 
235
- 3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
239
+ 3. **过一遍分析师视角**——读一下基准数据,把汇总数据可能掩盖掉的模式挖出来。`agents/analyzer.md`("Analyzing Benchmark Results" 一节)里有要点清单——比如那些不管有没有技能都会通过的断言(无区分度)、方差很高的 eval(可能不稳定)、以及时间和 token 的权衡。
236
240
 
237
- 4. **Launch the viewer** with both qualitative outputs and quantitative data:
241
+ 4. **启动查看器**,同时展示定性输出和定量数据:
238
242
  ```bash
239
243
  nohup python <skill-creator-path>/eval-viewer/generate_review.py \
240
244
  <workspace>/iteration-N \
@@ -243,31 +247,31 @@ Put each with_skill version before its baseline counterpart.
243
247
  > /dev/null 2>&1 &
244
248
  VIEWER_PID=$!
245
249
  ```
246
- For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
250
+ 到第 2 次及以后的迭代,还要加上 `--previous-workspace <workspace>/iteration-<N-1>`。
247
251
 
248
- **Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up.
252
+ **Cowork / 无界面环境:** 如果 `webbrowser.open()` 不可用,或者环境根本没有显示器,请用 `--static <output_path>` 来生成一份独立的 HTML 文件,而不是启动服务器。用户点击 "Submit All Reviews" 时反馈会被下载为 `feedback.json` 文件。下载之后,把 `feedback.json` 拷进 workspace 目录,下一轮迭代会读取它。
249
253
 
250
- Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
254
+ 注意:请使用 generate_review.py 来生成查看器;没必要手写 HTML
251
255
 
252
- 5. **Tell the user** something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
256
+ 5. **告诉用户**类似这样的话:"我已经把结果在你的浏览器里打开了。有两个标签页——'Outputs' 让你逐个测试用例查看并留下反馈,'Benchmark' 展示定量比较结果。看完之后回来告诉我一声。"
253
257
 
254
- ### What the user sees in the viewer
258
+ ### 用户在查看器里看到什么
255
259
 
256
- The "Outputs" tab shows one test case at a time:
257
- - **Prompt**: the task that was given
258
- - **Output**: the files the skill produced, rendered inline where possible
259
- - **Previous Output** (iteration 2+): collapsed section showing last iteration's output
260
- - **Formal Grades** (if grading was run): collapsed section showing assertion pass/fail
261
- - **Feedback**: a textbox that auto-saves as they type
262
- - **Previous Feedback** (iteration 2+): their comments from last time, shown below the textbox
260
+ "Outputs" 标签页每次只展示一个测试用例:
261
+ - **Prompt**:当时给出的任务
262
+ - **Output**:技能生成的文件,能内联渲染的就内联展示
263
+ - **Previous Output**(迭代 2 起):折叠区域,展示上一轮迭代的输出
264
+ - **Formal Grades**(若打了分):折叠区域,展示每条断言的通过/失败
265
+ - **Feedback**:一个文本框,输入时会自动保存
266
+ - **Previous Feedback**(迭代 2 起):上一轮的反馈意见,显示在文本框下方
263
267
 
264
- The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.
268
+ "Benchmark" 标签页展示统计汇总:通过率、耗时、token 用量,按配置分组,并附带每个 eval 的明细和分析师的观察笔记。
265
269
 
266
- Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to `feedback.json`.
270
+ 导航方式是 prev/next 按钮或键盘方向键。看完后他们点 "Submit All Reviews",所有反馈会保存到 `feedback.json`。
267
271
 
268
- ### Step 5: Read the feedback
272
+ ### 5 步:读取反馈
269
273
 
270
- When the user tells you they're done, read `feedback.json`:
274
+ 用户告诉你他们看完了之后,读取 `feedback.json`:
271
275
 
272
276
  ```json
273
277
  {
@@ -280,9 +284,9 @@ When the user tells you they're done, read `feedback.json`:
280
284
  }
281
285
  ```
282
286
 
283
- Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.
287
+ 反馈为空意味着用户觉得没问题。请把改进精力集中在那些用户提出了具体意见的测试用例上。
284
288
 
285
- Kill the viewer server when you're done with it:
289
+ 事情做完之后记得把查看器服务进程关掉:
286
290
 
287
291
  ```bash
288
292
  kill $VIEWER_PID 2>/dev/null
@@ -290,54 +294,54 @@ kill $VIEWER_PID 2>/dev/null
290
294
 
291
295
  ---
292
296
 
293
- ## Improving the skill
297
+ ## 改进技能
294
298
 
295
- This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better based on their feedback.
299
+ 这一段是整个循环的核心。你跑完了测试用例,用户也评审了结果,接下来要根据他们的反馈把技能做得更好。
296
300
 
297
- ### How to think about improvements
301
+ ### 怎样思考改进
298
302
 
299
- 1. **Generalize from the feedback.** The big picture thing that's happening here is that we're trying to create skills that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddly overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
303
+ 1. **从反馈中泛化。** 这里大局上要明白的是:我们想做的是能被使用上百万次(也许字面意义上、甚至更多次,谁知道呢)、横跨各种各样提示词的技能。眼下你和用户只是反复在少数几个例子上迭代,因为这样推进得快。这些例子用户烂熟于心,新输出他们一眼就能评估。但如果你和用户共同开发出来的技能只在这几个例子上管用,那就毫无意义。与其塞进各种过拟合的小补丁、或者堆上一堆压迫性的 MUST,不如在遇到某个顽固问题时换条路子:尝试新的比喻、推荐不一样的工作模式。试一下成本很低,说不定就撞到一个特别好的写法。
300
304
 
301
- 2. **Keep the prompt lean.** Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
305
+ 2. **保持提示词精简。** 把那些没在出力的内容删掉。除了看最终输出,请务必读一下完整的 transcript——如果发现技能让模型在无谓的事情上浪费时间,可以试着去掉造成这种浪费的部分,看看效果。
302
306
 
303
- 3. **Explain the why.** Try hard to explain the **why** behind everything you're asking the model to do. Today's LLMs are *smart*. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
307
+ 3. **解释"为什么"。** 努力把你要求模型做的每件事背后的**原因**讲清楚。今天的 LLM 是*聪明*的。它们有不错的心智理论,在有好用的引导框架时能跳出机械执行、把事情真正做成。即使用户的反馈很简短、甚至带着不耐烦,也要努力理解任务本身,理解用户为什么这样写、他们究竟写了什么,然后把这种理解传递到指令里去。如果你发现自己在用大写 ALWAYS NEVER、或者用上特别死板的结构,那是个黄色警示——尽可能换种说法,把背后的道理讲明白,让模型理解你为什么要求做这件事。这样更人性、也更有力、更有效。
304
308
 
305
- 4. **Look for repeated work across test cases.** Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a `create_docx.py` or a `build_chart.py`, that's a strong signal the skill should bundle that script. Write it once, put it in `scripts/`, and tell the skill to use it. This saves every future invocation from reinventing the wheel.
309
+ 4. **留意各个测试用例之间的重复劳动。** 阅读测试运行的 transcript,留意几个子代理是不是都各自独立写了类似的辅助脚本、或者走了同样的多步流程。如果 3 个测试用例里子代理都各自写了一个 `create_docx.py` `build_chart.py`,那就是一个很强的信号:这个技能应该把那段脚本打包进来。写一次,放到 `scripts/`,然后让技能去调用它。这样以后每次调用都不必从零再造一遍。
306
310
 
307
- This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.
311
+ 这项任务相当重要(我们可是想在这上面每年创造数十亿美元的经济价值!),而思考时间从来不是瓶颈;请慢慢来,反复琢磨。我建议先写一版修订稿,然后用全新的眼光再看一遍,进一步打磨。真的努力站到用户的位置,去理解他们想要什么、需要什么。
308
312
 
309
- ### The iteration loop
313
+ ### 迭代循环
310
314
 
311
- After improving the skill:
315
+ 改完技能之后:
312
316
 
313
- 1. Apply your improvements to the skill
314
- 2. Rerun all test cases into a new `iteration-<N+1>/` directory, including baseline runs. If you're creating a new skill, the baseline is always `without_skill` (no skill) — that stays the same across iterations. If you're improving an existing skill, use your judgment on what makes sense as the baseline: the original version the user came in with, or the previous iteration.
315
- 3. Launch the reviewer with `--previous-workspace` pointing at the previous iteration
316
- 4. Wait for the user to review and tell you they're done
317
- 5. Read the new feedback, improve again, repeat
317
+ 1. 把改进应用到技能上
318
+ 2. 把所有测试用例都重新跑一遍,写到一个新的 `iteration-<N+1>/` 目录里,也包括 baseline 运行。如果你在做的是创建新技能,baseline 一直是 `without_skill`(不带任何技能)——这在所有迭代里保持不变。如果你在改进已有技能,那么用哪种作为 baseline 由你判断:用户最初带进来的原始版本,还是上一轮迭代的版本。
319
+ 3. 启动评审器,并把 `--previous-workspace` 指向前一次迭代
320
+ 4. 等用户评审完、告诉你结束为止
321
+ 5. 读取新的反馈,再改进,再迭代
318
322
 
319
- Keep going until:
320
- - The user says they're happy
321
- - The feedback is all empty (everything looks good)
322
- - You're not making meaningful progress
323
+ 一直循环下去,直到:
324
+ - 用户说他们满意了
325
+ - 所有反馈都为空(全都看着没问题)
326
+ - 你已经不再做出有意义的进展
323
327
 
324
328
  ---
325
329
 
326
- ## Advanced: Blind comparison
330
+ ## 进阶:盲对比
327
331
 
328
- For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read `agents/comparator.md` and `agents/analyzer.md` for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.
332
+ 在需要对一个技能的两个版本做更严格比较的情况下(比如用户问"新版本到底是不是真的更好?"),有一套盲对比体系。详情请读 `agents/comparator.md` `agents/analyzer.md`。基本思路是:把两份输出交给一个独立代理、但不告诉它哪份是哪份,让它来评判质量。然后再分析"赢的那份为什么赢"。
329
333
 
330
- This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.
334
+ 这个步骤是可选的,需要子代理,多数用户用不到。人工评审循环通常已经够用。
331
335
 
332
336
  ---
333
337
 
334
- ## Description Optimization
338
+ ## 描述优化
335
339
 
336
- The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
340
+ SKILL.md 前置元信息(frontmatter)里的 description 字段是决定 Claude 是否调用一个技能的主要机制。在创建或改进一个技能之后,可以主动提议优化它的 description,以提升触发的准确性。
337
341
 
338
- ### Step 1: Generate trigger eval queries
342
+ ### 1 步:生成触发评估查询
339
343
 
340
- Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
344
+ 写出 20 eval 查询——里面既有"应当触发"的,也有"不应当触发"的。保存为 JSON
341
345
 
342
346
  ```json
343
347
  [
@@ -346,38 +350,38 @@ Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save
346
350
  ]
347
351
  ```
348
352
 
349
- The queries must be realistic and something a Claude Code or Claude.ai user would actually type. Not abstract requests, but requests that are concrete and specific and have a good amount of detail. For instance, file paths, personal context about the user's job or situation, column names and values, company names, URLs. A little bit of backstory. Some might be in lowercase or contain abbreviations or typos or casual speech. Use a mix of different lengths, and focus on edge cases rather than making them clear-cut (the user will get a chance to sign off on them).
353
+ 这些查询必须真实可信,是 Claude Code 或者 Claude.ai 的用户实际会输入的内容。不要抽象的请求,而要具体、有细节的请求。比如带上文件路径、用户工作或处境的私人化背景、列名和取值、公司名、URL,加一点点小故事。其中一些可以是全小写、带缩写、带错字或是口语化的。请使用不同长度,重点放在边界情况上而不是泾渭分明的简单例子(用户后面会有机会确认)。
350
354
 
351
- Bad: `"Format this data"`, `"Extract text from PDF"`, `"Create a chart"`
355
+ 不好的写法:`"Format this data"`、`"Extract text from PDF"`、`"Create a chart"`
352
356
 
353
- Good: `"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"`
357
+ 好的写法:`"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"`
354
358
 
355
- For the **should-trigger** queries (8-10), think about coverage. You want different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill or file type but clearly needs it. Throw in some uncommon use cases and cases where this skill competes with another but should win.
359
+ 对于 **应当触发** 的查询(810 条),要想着覆盖度。同一个意图要有不同的表达方式——有些正式、有些随意。也要包含用户没有明说技能名或文件类型、但显然需要这个技能的情形。再放进几个不常见的用例,以及该技能与其他技能竞争、但应该胜出的情形。
356
360
 
357
- For the **should-not-trigger** queries (8-10), the most valuable ones are the near-misses — queries that share keywords or concepts with the skill but actually need something different. Think adjacent domains, ambiguous phrasing where a naive keyword match would trigger but shouldn't, and cases where the query touches on something the skill does but in a context where another tool is more appropriate.
361
+ 对于 **不应当触发** 的查询(810 条),最有价值的是那些"擦肩而过"的——和技能共享一些关键词或概念,但实际上需要的是别的东西。要想到相邻领域、关键词上的歧义(朴素的关键词匹配会误触发但其实不该触发)、以及虽然涉及到这个技能能做的事但更适合用别的工具完成的情形。
358
362
 
359
- The key thing to avoid: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too easy — it doesn't test anything. The negative cases should be genuinely tricky.
363
+ 要避免的关键问题:不要把"不应触发"的查询写得明显与技能无关。把"写一个斐波那契函数"作为 PDF 技能的反例就太简单了——什么都测不出来。负例应当真正具有迷惑性。
360
364
 
361
- ### Step 2: Review with user
365
+ ### 2 步:与用户一起评审
362
366
 
363
- Present the eval set to the user for review using the HTML template:
367
+ HTML 模板把这套 eval 集合呈给用户评审:
364
368
 
365
- 1. Read the template from `assets/eval_review.html`
366
- 2. Replace the placeholders:
367
- - `__EVAL_DATA_PLACEHOLDER__` → the JSON array of eval items (no quotes around it — it's a JS variable assignment)
368
- - `__SKILL_NAME_PLACEHOLDER__` → the skill's name
369
- - `__SKILL_DESCRIPTION_PLACEHOLDER__` → the skill's current description
370
- 3. Write to a temp file (e.g., `/tmp/eval_review_<skill-name>.html`) and open it: `open /tmp/eval_review_<skill-name>.html`
371
- 4. The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
372
- 5. The file downloads to `~/Downloads/eval_set.json` — check the Downloads folder for the most recent version in case there are multiple (e.g., `eval_set (1).json`)
369
+ 1. `assets/eval_review.html` 读取模板
370
+ 2. 替换占位符:
371
+ - `__EVAL_DATA_PLACEHOLDER__` → eval 项的 JSON 数组(不要加外层引号——它是一句 JS 变量赋值的右值)
372
+ - `__SKILL_NAME_PLACEHOLDER__` → 技能的名字
373
+ - `__SKILL_DESCRIPTION_PLACEHOLDER__` → 技能当前的 description
374
+ 3. 写到一个临时文件(例如 `/tmp/eval_review_<skill-name>.html`)然后打开:`open /tmp/eval_review_<skill-name>.html`
375
+ 4. 用户可以编辑查询、切换 should-trigger、增删条目,然后点击 "Export Eval Set"
376
+ 5. 文件会下载到 `~/Downloads/eval_set.json`——记得在 Downloads 目录里挑最新的那一份,以免出现多版本(如 `eval_set (1).json`)
373
377
 
374
- This step matters — bad eval queries lead to bad descriptions.
378
+ 这一步很关键——评估查询写得不好,会得到不好的 description。
375
379
 
376
- ### Step 3: Run the optimization loop
380
+ ### 3 步:运行优化循环
377
381
 
378
- Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."
382
+ 告诉用户:"这个会花一点时间——我会在后台跑优化循环,并定期回来看看进展。"
379
383
 
380
- Save the eval set to the workspace, then run in the background:
384
+ eval 集合保存到 workspace,然后在后台运行:
381
385
 
382
386
  ```bash
383
387
  python -m scripts.run_loop \
@@ -388,93 +392,103 @@ python -m scripts.run_loop \
388
392
  --verbose
389
393
  ```
390
394
 
391
- Use the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences.
395
+ 请使用系统提示词里那个 model ID(也就是驱动当前会话的那个模型),这样触发测试和用户实际体验保持一致。
392
396
 
393
- While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
397
+ 它运行期间,请周期性地 tail 一下输出,向用户汇报当前是第几轮迭代、得分是什么样子。
394
398
 
395
- This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` selected by test score rather than train score to avoid overfitting.
399
+ 它会自动完成整个优化循环。它会把 eval 集合按 60% 训练 / 40% 留出测试拆开,先评估当前 description(每条查询跑 3 次以得到稳定的触发率),再调用 Claude 根据失败用例提出改进建议。它会在新 description 上重新跑训练集和测试集,最多迭代 5 轮。结束时,它会在浏览器里打开一份 HTML 报告,展示每一轮的结果,并返回包含 `best_description` JSON——是按测试集得分而不是训练集得分挑出来的,以避免过拟合。
396
400
 
397
- ### How skill triggering works
401
+ ### 技能触发机制的工作原理
398
402
 
399
- Understanding the triggering mechanism helps design better eval queries. Skills appear in Claude's `available_skills` list with their name + description, and Claude decides whether to consult a skill based on that description. The important thing to know is that Claude only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Claude can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.
403
+ 理解触发机制有助于设计更好的 eval 查询。技能会以 name + description 的形式出现在 Claude `available_skills` 列表里,Claude 根据这个 description 决定是否查阅一个技能。一个要点是:Claude 只会在它自己很难独立完成的任务上去查阅技能——像"读一下这份 PDF"这种简单的一步式查询可能并不会触发技能,哪怕 description 完美匹配,因为 Claude 用基础工具就能直接处理。复杂的、多步骤的、或者专业性强的查询,在 description 匹配的前提下能够可靠地触发技能。
400
404
 
401
- This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.
405
+ 这意味着你的 eval 查询要足够有分量,让 Claude 真的能从查阅技能中获益。"读一下文件 X"这种简单查询不是好的测试用例——不管 description 写得多好,它们都不会触发技能。
402
406
 
403
- ### Step 4: Apply the result
407
+ ### 4 步:应用结果
404
408
 
405
- Take `best_description` from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.
409
+ JSON 输出里取出 `best_description`,更新技能 SKILL.md 的前置元信息。给用户看一下前后对比,并报告分数。
406
410
 
407
411
  ---
408
412
 
409
- ### Package and Present (only if `present_files` tool is available)
413
+ ### 打包与呈现(仅在有 `present_files` 工具时进行)
410
414
 
411
- Check whether you have access to the `present_files` tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
415
+ 先检查你是否有 `present_files` 工具。如果没有,跳过这一步。如果有,就把技能打包,并把 .skill 文件呈现给用户:
412
416
 
413
417
  ```bash
414
418
  python -m scripts.package_skill <path/to/skill-folder>
415
419
  ```
416
420
 
417
- After packaging, direct the user to the resulting `.skill` file path so they can install it.
421
+ 打包完之后,告诉用户生成的 `.skill` 文件的路径,他们就可以安装它了。
418
422
 
419
423
  ---
420
424
 
421
- ## Claude.ai-specific instructions
425
+ ## Claude.ai 专用说明
426
+
427
+ 在 Claude.ai 上,核心工作流不变(起草 → 测试 → 评审 → 改进 → 重复),但因为 Claude.ai 没有子代理,一些机制需要相应调整:
422
428
 
423
- In Claude.ai, the core workflow is the same (draft → test → review → improve → repeat), but because Claude.ai doesn't have subagents, some mechanics change. Here's what to adapt:
429
+ **运行测试用例**:没有子代理就意味着没有并行执行。对每个测试用例,先读技能的 SKILL.md,然后照着它的指示自己来完成测试提示词的任务。一次跑一个。这样的严谨度不如让独立子代理来跑(你既写了技能,也在执行它,会带着完整的上下文),但作为合理性检查仍然有价值——而且人工评审环节会作为补偿。跳过基线运行——就按照用户要求用技能完成任务即可。
424
430
 
425
- **Running test cases**: No subagents means no parallel execution. For each test case, read the skill's SKILL.md, then follow its instructions to accomplish the test prompt yourself. Do them one at a time. This is less rigorous than independent subagents (you wrote the skill and you're also running it, so you have full context), but it's a useful sanity check — and the human review step compensates. Skip the baseline runs — just use the skill to complete the task as requested.
431
+ **评审结果**:如果你打不开浏览器(比如 Claude.ai VM 没有显示器,或者你在远端服务器上),那就完全跳过浏览器评审器。改为直接在对话中展示结果。对每个测试用例,展示提示词和输出。如果输出是一个用户需要看的文件(比如 .docx .xlsx),先把它存到文件系统里、再告诉他们路径,让他们下载并查看。在对话中直接征求反馈:"看着怎么样?有什么想改的吗?"
426
432
 
427
- **Reviewing results**: If you can't open a browser (e.g., Claude.ai's VM has no display, or you're on a remote server), skip the browser reviewer entirely. Instead, present results directly in the conversation. For each test case, show the prompt and the output. If the output is a file the user needs to see (like a .docx or .xlsx), save it to the filesystem and tell them where it is so they can download and inspect it. Ask for feedback inline: "How does this look? Anything you'd change?"
433
+ **基准化**:跳过定量基准——它依赖的基线比较在没有子代理时本来就没什么意义。把精力放在来自用户的定性反馈上。
428
434
 
429
- **Benchmarking**: Skip the quantitative benchmarking — it relies on baseline comparisons which aren't meaningful without subagents. Focus on qualitative feedback from the user.
435
+ **迭代循环**:和之前一样——改进技能、重跑测试用例、收集反馈——只是中间不走浏览器评审器。如果有文件系统,你仍然可以把结果按迭代目录整理起来。
430
436
 
431
- **The iteration loop**: Same as before improve the skill, rerun the test cases, ask for feedback — just without the browser reviewer in the middle. You can still organize results into iteration directories on the filesystem if you have one.
437
+ **Description 优化**:这一节需要 `claude` 命令行工具(具体来说是 `claude -p`),它只在 Claude Code 里可用。如果你在 Claude.ai,就跳过这一节。
432
438
 
433
- **Description optimization**: This section requires the `claude` CLI tool (specifically `claude -p`) which is only available in Claude Code. Skip it if you're on Claude.ai.
439
+ **盲对比**:需要子代理。跳过。
434
440
 
435
- **Blind comparison**: Requires subagents. Skip it.
441
+ **打包**:`package_skill.py` 脚本只要有 Python 和文件系统就能跑。在 Claude.ai 上你也可以运行它,用户可以下载生成的 `.skill` 文件。
436
442
 
437
- **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
443
+ **更新一个已存在的技能**:用户可能让你更新一个已存在的技能,而不是创建新技能。这时:
444
+ - **保留原有的名字。** 注意原技能的目录名和 `name` 字段,原样使用。例如已安装的技能叫 `research-helper`,就输出 `research-helper.skill`(不要写成 `research-helper-v2`)。
445
+ - **先复制到可写入的位置再编辑。** 已安装的技能路径可能是只读的。先复制到 `/tmp/skill-name/`,在那里编辑,再从副本打包。
446
+ - **如果要手动打包,先在 `/tmp/` 里组装好**,然后再复制到输出目录——直接写入输出目录可能因权限失败。
438
447
 
439
448
  ---
440
449
 
441
- ## Cowork-Specific Instructions
450
+ ## Cowork 专用说明
442
451
 
443
- If you're in Cowork, the main things to know are:
452
+ 如果你在 Cowork 环境里,主要要知道这几件事:
444
453
 
445
- - You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
446
- - You don't have a browser or display, so when generating the eval viewer, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
447
- - For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.py` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER *BEFORE* evaluating inputs yourself. You want to get them in front of the human ASAP!
448
- - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
449
- - Packaging works — `package_skill.py` just needs Python and a filesystem.
450
- - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
454
+ - 你有子代理,所以主流程(并行启动测试用例、跑基线、打分等等)都能照常工作。(不过如果你遇到严重的超时问题,把测试提示词改成串行执行也是可以的。)
455
+ - 你没有浏览器、也没有显示器,所以在生成 eval 查看器时请使用 `--static <output_path>` 生成一份独立 HTML,而不是启动服务器。然后把这个链接抛出来,用户可以点开在自己浏览器里查看。
456
+ - 出于一些原因,Cowork 这个环境似乎容易让 Claude 在跑完测试后不去生成 eval 查看器——所以再次强调:不管是 Cowork 还是 Claude Code,跑完测试之后你都应当先生成 eval 查看器、让用户先看一遍例子,然后再亲自去改进技能、做更正——而且必须使用 `generate_review.py`(不要自己手写一份花式 HTML)。抱歉这里我要全大写一下:先把 EVAL VIEWER *生成出来*,再去自己评估输出。目的是让人尽快看到结果!
457
+ - 反馈机制不同:因为没有在跑的服务器,查看器里的 "Submit All Reviews" 按钮会把 `feedback.json` 作为文件下载。你之后从那个位置去读取它(可能需要先申请访问权限)。
458
+ - 打包能用——`package_skill.py` 只需要 Python 和文件系统。
459
+ - Description 优化(`run_loop.py` / `run_eval.py`)在 Cowork 下应当也没问题,因为它是通过子进程调用 `claude -p`,不是浏览器;但还是请等到技能整体做完、用户也认可后再做这一步。
460
+ - **更新一个已存在的技能**:用户可能让你更新一个已存在的技能,而不是创建新技能。参照上方 Claude.ai 节里的"更新一个已存在的技能"指引执行。
451
461
 
452
462
  ---
453
463
 
454
- ## Reference files
464
+ ## 参考文件
455
465
 
456
- The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.
466
+ agents/ 目录下是各种专用子代理的指令文件。需要派出对应的子代理时再去读取。
457
467
 
458
- - `agents/grader.md` — How to evaluate assertions against outputs
459
- - `agents/comparator.md` — How to do blind A/B comparison between two outputs
460
- - `agents/analyzer.md` — How to analyze why one version beat another
468
+ - `agents/grader.md` — 如何对照输出来评估断言
469
+ - `agents/comparator.md` — 如何对两份输出做盲式 A/B 对比
470
+ - `agents/analyzer.md` — 如何分析为什么某一版本胜出
461
471
 
462
- The references/ directory has additional documentation:
463
- - `references/schemas.md` — JSON structures for evals.json, grading.json, etc.
472
+ references/ 目录下是额外的文档:
473
+ - `references/schemas.md` — evals.jsongrading.json 等的 JSON 结构
464
474
 
465
475
  ---
466
476
 
467
- Repeating one more time the core loop here for emphasis:
477
+ 最后再把核心循环重申一次,以示强调:
468
478
 
469
- - Figure out what the skill is about
470
- - Draft or edit the skill
471
- - Run claude-with-access-to-the-skill on test prompts
472
- - With the user, evaluate the outputs:
473
- - Create benchmark.json and run `eval-viewer/generate_review.py` to help the user review them
474
- - Run quantitative evals
475
- - Repeat until you and the user are satisfied
476
- - Package the final skill and return it to the user.
479
+ - 弄清楚这个技能要解决什么问题
480
+ - 起草或编辑这个技能
481
+ - 让带有这个技能的 Claude 在测试提示词上运行
482
+ - 与用户一道评估输出:
483
+ - 生成 benchmark.json 并运行 `eval-viewer/generate_review.py` 来辅助用户评审
484
+ - 跑一遍定量评估
485
+ - 反复迭代,直到你和用户都满意
486
+ - 把最终的技能打包,并交付给用户
477
487
 
478
- Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
488
+ 如果你那边有 TodoList 这种工具,请把这些步骤加进去,免得忘记。如果你在 Cowork 环境,请特别把 "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" 加进 TodoList,确保它真的会被执行。
489
+
490
+ 祝顺利!
491
+
492
+ ---
479
493
 
480
- Good luck!
494
+ *方法论改编自 [anthropics/skills](https://github.com/anthropics/skills)(Apache 2.0)。*