ironweave 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +16 -0
- package/.clinerules +7 -0
- package/.codex/INSTALL.md +45 -0
- package/.cursor-plugin/plugin.json +19 -0
- package/.cursorrules +7 -0
- package/.github/copilot-instructions.md +7 -0
- package/.opencode/INSTALL.md +42 -0
- package/.windsurfrules +7 -0
- package/AGENTS.md +1 -0
- package/CLAUDE.md +22 -0
- package/CONTRIBUTING.md +81 -0
- package/GEMINI.md +1 -0
- package/LICENSE +21 -0
- package/README.md +250 -0
- package/README_CN.md +248 -0
- package/package.json +48 -0
- package/skills/api-contract-design/SKILL.md +227 -0
- package/skills/api-contract-design/references/api-design-rules.md +106 -0
- package/skills/brainstorm/SKILL.md +271 -0
- package/skills/brainstorm/agents/architect.md +34 -0
- package/skills/brainstorm/agents/challenger.md +34 -0
- package/skills/brainstorm/agents/domain-expert.md +34 -0
- package/skills/brainstorm/agents/pragmatist.md +34 -0
- package/skills/brainstorm/agents/product-manager.md +34 -0
- package/skills/brainstorm/agents/ux-designer.md +34 -0
- package/skills/brainstorm/references/synthesis-rules.md +51 -0
- package/skills/code-scaffold/SKILL.md +313 -0
- package/skills/code-scaffold/references/scaffold-rules.md +131 -0
- package/skills/docs-output/SKILL.md +149 -0
- package/skills/docs-output/references/naming-rules.md +52 -0
- package/skills/docs-output/scripts/docs_manager.py +353 -0
- package/skills/engineering-principles/SKILL.md +133 -0
- package/skills/engineering-principles/references/anti-patterns.md +144 -0
- package/skills/engineering-principles/references/ddd-patterns.md +66 -0
- package/skills/engineering-principles/references/design-patterns.md +34 -0
- package/skills/engineering-principles/references/patterns-architecture.md +301 -0
- package/skills/engineering-principles/references/patterns-backend.md +77 -0
- package/skills/engineering-principles/references/patterns-classic.md +200 -0
- package/skills/engineering-principles/references/patterns-crosscut.md +67 -0
- package/skills/engineering-principles/references/patterns-frontend.md +27 -0
- package/skills/engineering-principles/references/patterns-module.md +95 -0
- package/skills/engineering-principles/references/patterns-small-scale.md +79 -0
- package/skills/engineering-principles/references/quality-checklist.md +76 -0
- package/skills/engineering-principles/references/solid-principles.md +46 -0
- package/skills/engineering-principles/references/tdd-workflow.md +60 -0
- package/skills/engineering-principles/scripts/principles_matcher.py +433 -0
- package/skills/error-handling-strategy/SKILL.md +347 -0
- package/skills/error-handling-strategy/references/error-handling-rules.md +91 -0
- package/skills/implementation-complexity-analysis/SKILL.md +193 -0
- package/skills/implementation-complexity-analysis/references/complexity-rules.md +126 -0
- package/skills/integration-test-design/SKILL.md +296 -0
- package/skills/integration-test-design/references/test-strategy-rules.md +90 -0
- package/skills/observability-design/SKILL.md +327 -0
- package/skills/observability-design/references/observability-rules.md +129 -0
- package/skills/orchestrator/SKILL.md +260 -0
- package/skills/orchestrator/references/deliver.md +112 -0
- package/skills/orchestrator/references/execute.md +313 -0
- package/skills/orchestrator/references/gates.md +252 -0
- package/skills/orchestrator/references/parallel.md +70 -0
- package/skills/orchestrator/references/route-a.md +135 -0
- package/skills/orchestrator/references/route-b.md +91 -0
- package/skills/orchestrator/references/route-c.md +65 -0
- package/skills/orchestrator/references/route-d.md +75 -0
- package/skills/orchestrator/references/scope-sizer.md +219 -0
- package/skills/performance-arch-design/SKILL.md +208 -0
- package/skills/performance-arch-design/references/performance-rules.md +95 -0
- package/skills/project-context/SKILL.md +104 -0
- package/skills/project-context/references/schema.md +97 -0
- package/skills/project-context/scripts/context_db.py +358 -0
- package/skills/requirement-qa/SKILL.md +287 -0
- package/skills/requirement-qa/references/completion-signals.md +42 -0
- package/skills/requirement-qa/references/option-rules.md +57 -0
- package/skills/requirement-qa/scripts/qa_session.py +223 -0
- package/skills/skill-creator/LICENSE.txt +202 -0
- package/skills/skill-creator/SKILL.md +485 -0
- package/skills/skill-creator/agents/analyzer.md +274 -0
- package/skills/skill-creator/agents/comparator.md +202 -0
- package/skills/skill-creator/agents/grader.md +223 -0
- package/skills/skill-creator/assets/eval_review.html +146 -0
- package/skills/skill-creator/eval-viewer/generate_review.py +471 -0
- package/skills/skill-creator/eval-viewer/viewer.html +1325 -0
- package/skills/skill-creator/references/schemas.md +430 -0
- package/skills/skill-creator/scripts/__init__.py +0 -0
- package/skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/skills/skill-creator/scripts/generate_report.py +326 -0
- package/skills/skill-creator/scripts/improve_description.py +247 -0
- package/skills/skill-creator/scripts/package_skill.py +136 -0
- package/skills/skill-creator/scripts/quick_validate.py +103 -0
- package/skills/skill-creator/scripts/run_eval.py +310 -0
- package/skills/skill-creator/scripts/run_loop.py +328 -0
- package/skills/skill-creator/scripts/utils.py +47 -0
- package/skills/spec-writing/SKILL.md +96 -0
- package/skills/spec-writing/references/mermaid-guide.md +66 -0
- package/skills/spec-writing/references/test-matrix.md +73 -0
- package/skills/task-difficulty/SKILL.md +162 -0
- package/skills/task-difficulty/references/scoring-guide.md +123 -0
- package/skills/task-difficulty/scripts/difficulty_scorer.py +328 -0
- package/skills/tech-stack/SKILL.md +67 -0
- package/skills/tech-stack/references/tech-reference-tables.md +130 -0
|
@@ -0,0 +1,328 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Run the eval + improve loop until all pass or max iterations reached.
|
|
3
|
+
|
|
4
|
+
Combines run_eval.py and improve_description.py in a loop, tracking history
|
|
5
|
+
and returning the best description found. Supports train/test split to prevent
|
|
6
|
+
overfitting.
|
|
7
|
+
"""
|
|
8
|
+
|
|
9
|
+
import argparse
|
|
10
|
+
import json
|
|
11
|
+
import random
|
|
12
|
+
import sys
|
|
13
|
+
import tempfile
|
|
14
|
+
import time
|
|
15
|
+
import webbrowser
|
|
16
|
+
from pathlib import Path
|
|
17
|
+
|
|
18
|
+
from scripts.generate_report import generate_html
|
|
19
|
+
from scripts.improve_description import improve_description
|
|
20
|
+
from scripts.run_eval import find_project_root, run_eval
|
|
21
|
+
from scripts.utils import parse_skill_md
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
def split_eval_set(eval_set: list[dict], holdout: float, seed: int = 42) -> tuple[list[dict], list[dict]]:
|
|
25
|
+
"""Split eval set into train and test sets, stratified by should_trigger."""
|
|
26
|
+
random.seed(seed)
|
|
27
|
+
|
|
28
|
+
# Separate by should_trigger
|
|
29
|
+
trigger = [e for e in eval_set if e["should_trigger"]]
|
|
30
|
+
no_trigger = [e for e in eval_set if not e["should_trigger"]]
|
|
31
|
+
|
|
32
|
+
# Shuffle each group
|
|
33
|
+
random.shuffle(trigger)
|
|
34
|
+
random.shuffle(no_trigger)
|
|
35
|
+
|
|
36
|
+
# Calculate split points
|
|
37
|
+
n_trigger_test = max(1, int(len(trigger) * holdout))
|
|
38
|
+
n_no_trigger_test = max(1, int(len(no_trigger) * holdout))
|
|
39
|
+
|
|
40
|
+
# Split
|
|
41
|
+
test_set = trigger[:n_trigger_test] + no_trigger[:n_no_trigger_test]
|
|
42
|
+
train_set = trigger[n_trigger_test:] + no_trigger[n_no_trigger_test:]
|
|
43
|
+
|
|
44
|
+
return train_set, test_set
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
def run_loop(
|
|
48
|
+
eval_set: list[dict],
|
|
49
|
+
skill_path: Path,
|
|
50
|
+
description_override: str | None,
|
|
51
|
+
num_workers: int,
|
|
52
|
+
timeout: int,
|
|
53
|
+
max_iterations: int,
|
|
54
|
+
runs_per_query: int,
|
|
55
|
+
trigger_threshold: float,
|
|
56
|
+
holdout: float,
|
|
57
|
+
model: str,
|
|
58
|
+
verbose: bool,
|
|
59
|
+
live_report_path: Path | None = None,
|
|
60
|
+
log_dir: Path | None = None,
|
|
61
|
+
) -> dict:
|
|
62
|
+
"""Run the eval + improvement loop."""
|
|
63
|
+
project_root = find_project_root()
|
|
64
|
+
name, original_description, content = parse_skill_md(skill_path)
|
|
65
|
+
current_description = description_override or original_description
|
|
66
|
+
|
|
67
|
+
# Split into train/test if holdout > 0
|
|
68
|
+
if holdout > 0:
|
|
69
|
+
train_set, test_set = split_eval_set(eval_set, holdout)
|
|
70
|
+
if verbose:
|
|
71
|
+
print(f"Split: {len(train_set)} train, {len(test_set)} test (holdout={holdout})", file=sys.stderr)
|
|
72
|
+
else:
|
|
73
|
+
train_set = eval_set
|
|
74
|
+
test_set = []
|
|
75
|
+
|
|
76
|
+
history = []
|
|
77
|
+
exit_reason = "unknown"
|
|
78
|
+
|
|
79
|
+
for iteration in range(1, max_iterations + 1):
|
|
80
|
+
if verbose:
|
|
81
|
+
print(f"\n{'='*60}", file=sys.stderr)
|
|
82
|
+
print(f"Iteration {iteration}/{max_iterations}", file=sys.stderr)
|
|
83
|
+
print(f"Description: {current_description}", file=sys.stderr)
|
|
84
|
+
print(f"{'='*60}", file=sys.stderr)
|
|
85
|
+
|
|
86
|
+
# Evaluate train + test together in one batch for parallelism
|
|
87
|
+
all_queries = train_set + test_set
|
|
88
|
+
t0 = time.time()
|
|
89
|
+
all_results = run_eval(
|
|
90
|
+
eval_set=all_queries,
|
|
91
|
+
skill_name=name,
|
|
92
|
+
description=current_description,
|
|
93
|
+
num_workers=num_workers,
|
|
94
|
+
timeout=timeout,
|
|
95
|
+
project_root=project_root,
|
|
96
|
+
runs_per_query=runs_per_query,
|
|
97
|
+
trigger_threshold=trigger_threshold,
|
|
98
|
+
model=model,
|
|
99
|
+
)
|
|
100
|
+
eval_elapsed = time.time() - t0
|
|
101
|
+
|
|
102
|
+
# Split results back into train/test by matching queries
|
|
103
|
+
train_queries_set = {q["query"] for q in train_set}
|
|
104
|
+
train_result_list = [r for r in all_results["results"] if r["query"] in train_queries_set]
|
|
105
|
+
test_result_list = [r for r in all_results["results"] if r["query"] not in train_queries_set]
|
|
106
|
+
|
|
107
|
+
train_passed = sum(1 for r in train_result_list if r["pass"])
|
|
108
|
+
train_total = len(train_result_list)
|
|
109
|
+
train_summary = {"passed": train_passed, "failed": train_total - train_passed, "total": train_total}
|
|
110
|
+
train_results = {"results": train_result_list, "summary": train_summary}
|
|
111
|
+
|
|
112
|
+
if test_set:
|
|
113
|
+
test_passed = sum(1 for r in test_result_list if r["pass"])
|
|
114
|
+
test_total = len(test_result_list)
|
|
115
|
+
test_summary = {"passed": test_passed, "failed": test_total - test_passed, "total": test_total}
|
|
116
|
+
test_results = {"results": test_result_list, "summary": test_summary}
|
|
117
|
+
else:
|
|
118
|
+
test_results = None
|
|
119
|
+
test_summary = None
|
|
120
|
+
|
|
121
|
+
history.append({
|
|
122
|
+
"iteration": iteration,
|
|
123
|
+
"description": current_description,
|
|
124
|
+
"train_passed": train_summary["passed"],
|
|
125
|
+
"train_failed": train_summary["failed"],
|
|
126
|
+
"train_total": train_summary["total"],
|
|
127
|
+
"train_results": train_results["results"],
|
|
128
|
+
"test_passed": test_summary["passed"] if test_summary else None,
|
|
129
|
+
"test_failed": test_summary["failed"] if test_summary else None,
|
|
130
|
+
"test_total": test_summary["total"] if test_summary else None,
|
|
131
|
+
"test_results": test_results["results"] if test_results else None,
|
|
132
|
+
# For backward compat with report generator
|
|
133
|
+
"passed": train_summary["passed"],
|
|
134
|
+
"failed": train_summary["failed"],
|
|
135
|
+
"total": train_summary["total"],
|
|
136
|
+
"results": train_results["results"],
|
|
137
|
+
})
|
|
138
|
+
|
|
139
|
+
# Write live report if path provided
|
|
140
|
+
if live_report_path:
|
|
141
|
+
partial_output = {
|
|
142
|
+
"original_description": original_description,
|
|
143
|
+
"best_description": current_description,
|
|
144
|
+
"best_score": "in progress",
|
|
145
|
+
"iterations_run": len(history),
|
|
146
|
+
"holdout": holdout,
|
|
147
|
+
"train_size": len(train_set),
|
|
148
|
+
"test_size": len(test_set),
|
|
149
|
+
"history": history,
|
|
150
|
+
}
|
|
151
|
+
live_report_path.write_text(generate_html(partial_output, auto_refresh=True, skill_name=name))
|
|
152
|
+
|
|
153
|
+
if verbose:
|
|
154
|
+
def print_eval_stats(label, results, elapsed):
|
|
155
|
+
pos = [r for r in results if r["should_trigger"]]
|
|
156
|
+
neg = [r for r in results if not r["should_trigger"]]
|
|
157
|
+
tp = sum(r["triggers"] for r in pos)
|
|
158
|
+
pos_runs = sum(r["runs"] for r in pos)
|
|
159
|
+
fn = pos_runs - tp
|
|
160
|
+
fp = sum(r["triggers"] for r in neg)
|
|
161
|
+
neg_runs = sum(r["runs"] for r in neg)
|
|
162
|
+
tn = neg_runs - fp
|
|
163
|
+
total = tp + tn + fp + fn
|
|
164
|
+
precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
|
|
165
|
+
recall = tp / (tp + fn) if (tp + fn) > 0 else 1.0
|
|
166
|
+
accuracy = (tp + tn) / total if total > 0 else 0.0
|
|
167
|
+
print(f"{label}: {tp+tn}/{total} correct, precision={precision:.0%} recall={recall:.0%} accuracy={accuracy:.0%} ({elapsed:.1f}s)", file=sys.stderr)
|
|
168
|
+
for r in results:
|
|
169
|
+
status = "PASS" if r["pass"] else "FAIL"
|
|
170
|
+
rate_str = f"{r['triggers']}/{r['runs']}"
|
|
171
|
+
print(f" [{status}] rate={rate_str} expected={r['should_trigger']}: {r['query'][:60]}", file=sys.stderr)
|
|
172
|
+
|
|
173
|
+
print_eval_stats("Train", train_results["results"], eval_elapsed)
|
|
174
|
+
if test_summary:
|
|
175
|
+
print_eval_stats("Test ", test_results["results"], 0)
|
|
176
|
+
|
|
177
|
+
if train_summary["failed"] == 0:
|
|
178
|
+
exit_reason = f"all_passed (iteration {iteration})"
|
|
179
|
+
if verbose:
|
|
180
|
+
print(f"\nAll train queries passed on iteration {iteration}!", file=sys.stderr)
|
|
181
|
+
break
|
|
182
|
+
|
|
183
|
+
if iteration == max_iterations:
|
|
184
|
+
exit_reason = f"max_iterations ({max_iterations})"
|
|
185
|
+
if verbose:
|
|
186
|
+
print(f"\nMax iterations reached ({max_iterations}).", file=sys.stderr)
|
|
187
|
+
break
|
|
188
|
+
|
|
189
|
+
# Improve the description based on train results
|
|
190
|
+
if verbose:
|
|
191
|
+
print(f"\nImproving description...", file=sys.stderr)
|
|
192
|
+
|
|
193
|
+
t0 = time.time()
|
|
194
|
+
# Strip test scores from history so improvement model can't see them
|
|
195
|
+
blinded_history = [
|
|
196
|
+
{k: v for k, v in h.items() if not k.startswith("test_")}
|
|
197
|
+
for h in history
|
|
198
|
+
]
|
|
199
|
+
new_description = improve_description(
|
|
200
|
+
skill_name=name,
|
|
201
|
+
skill_content=content,
|
|
202
|
+
current_description=current_description,
|
|
203
|
+
eval_results=train_results,
|
|
204
|
+
history=blinded_history,
|
|
205
|
+
model=model,
|
|
206
|
+
log_dir=log_dir,
|
|
207
|
+
iteration=iteration,
|
|
208
|
+
)
|
|
209
|
+
improve_elapsed = time.time() - t0
|
|
210
|
+
|
|
211
|
+
if verbose:
|
|
212
|
+
print(f"Proposed ({improve_elapsed:.1f}s): {new_description}", file=sys.stderr)
|
|
213
|
+
|
|
214
|
+
current_description = new_description
|
|
215
|
+
|
|
216
|
+
# Find the best iteration by TEST score (or train if no test set)
|
|
217
|
+
if test_set:
|
|
218
|
+
best = max(history, key=lambda h: h["test_passed"] or 0)
|
|
219
|
+
best_score = f"{best['test_passed']}/{best['test_total']}"
|
|
220
|
+
else:
|
|
221
|
+
best = max(history, key=lambda h: h["train_passed"])
|
|
222
|
+
best_score = f"{best['train_passed']}/{best['train_total']}"
|
|
223
|
+
|
|
224
|
+
if verbose:
|
|
225
|
+
print(f"\nExit reason: {exit_reason}", file=sys.stderr)
|
|
226
|
+
print(f"Best score: {best_score} (iteration {best['iteration']})", file=sys.stderr)
|
|
227
|
+
|
|
228
|
+
return {
|
|
229
|
+
"exit_reason": exit_reason,
|
|
230
|
+
"original_description": original_description,
|
|
231
|
+
"best_description": best["description"],
|
|
232
|
+
"best_score": best_score,
|
|
233
|
+
"best_train_score": f"{best['train_passed']}/{best['train_total']}",
|
|
234
|
+
"best_test_score": f"{best['test_passed']}/{best['test_total']}" if test_set else None,
|
|
235
|
+
"final_description": current_description,
|
|
236
|
+
"iterations_run": len(history),
|
|
237
|
+
"holdout": holdout,
|
|
238
|
+
"train_size": len(train_set),
|
|
239
|
+
"test_size": len(test_set),
|
|
240
|
+
"history": history,
|
|
241
|
+
}
|
|
242
|
+
|
|
243
|
+
|
|
244
|
+
def main():
|
|
245
|
+
parser = argparse.ArgumentParser(description="Run eval + improve loop")
|
|
246
|
+
parser.add_argument("--eval-set", required=True, help="Path to eval set JSON file")
|
|
247
|
+
parser.add_argument("--skill-path", required=True, help="Path to skill directory")
|
|
248
|
+
parser.add_argument("--description", default=None, help="Override starting description")
|
|
249
|
+
parser.add_argument("--num-workers", type=int, default=10, help="Number of parallel workers")
|
|
250
|
+
parser.add_argument("--timeout", type=int, default=30, help="Timeout per query in seconds")
|
|
251
|
+
parser.add_argument("--max-iterations", type=int, default=5, help="Max improvement iterations")
|
|
252
|
+
parser.add_argument("--runs-per-query", type=int, default=3, help="Number of runs per query")
|
|
253
|
+
parser.add_argument("--trigger-threshold", type=float, default=0.5, help="Trigger rate threshold")
|
|
254
|
+
parser.add_argument("--holdout", type=float, default=0.4, help="Fraction of eval set to hold out for testing (0 to disable)")
|
|
255
|
+
parser.add_argument("--model", required=True, help="Model for improvement")
|
|
256
|
+
parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
|
|
257
|
+
parser.add_argument("--report", default="auto", help="Generate HTML report at this path (default: 'auto' for temp file, 'none' to disable)")
|
|
258
|
+
parser.add_argument("--results-dir", default=None, help="Save all outputs (results.json, report.html, log.txt) to a timestamped subdirectory here")
|
|
259
|
+
args = parser.parse_args()
|
|
260
|
+
|
|
261
|
+
eval_set = json.loads(Path(args.eval_set).read_text())
|
|
262
|
+
skill_path = Path(args.skill_path)
|
|
263
|
+
|
|
264
|
+
if not (skill_path / "SKILL.md").exists():
|
|
265
|
+
print(f"Error: No SKILL.md found at {skill_path}", file=sys.stderr)
|
|
266
|
+
sys.exit(1)
|
|
267
|
+
|
|
268
|
+
name, _, _ = parse_skill_md(skill_path)
|
|
269
|
+
|
|
270
|
+
# Set up live report path
|
|
271
|
+
if args.report != "none":
|
|
272
|
+
if args.report == "auto":
|
|
273
|
+
timestamp = time.strftime("%Y%m%d_%H%M%S")
|
|
274
|
+
live_report_path = Path(tempfile.gettempdir()) / f"skill_description_report_{skill_path.name}_{timestamp}.html"
|
|
275
|
+
else:
|
|
276
|
+
live_report_path = Path(args.report)
|
|
277
|
+
# Open the report immediately so the user can watch
|
|
278
|
+
live_report_path.write_text("<html><body><h1>Starting optimization loop...</h1><meta http-equiv='refresh' content='5'></body></html>")
|
|
279
|
+
webbrowser.open(str(live_report_path))
|
|
280
|
+
else:
|
|
281
|
+
live_report_path = None
|
|
282
|
+
|
|
283
|
+
# Determine output directory (create before run_loop so logs can be written)
|
|
284
|
+
if args.results_dir:
|
|
285
|
+
timestamp = time.strftime("%Y-%m-%d_%H%M%S")
|
|
286
|
+
results_dir = Path(args.results_dir) / timestamp
|
|
287
|
+
results_dir.mkdir(parents=True, exist_ok=True)
|
|
288
|
+
else:
|
|
289
|
+
results_dir = None
|
|
290
|
+
|
|
291
|
+
log_dir = results_dir / "logs" if results_dir else None
|
|
292
|
+
|
|
293
|
+
output = run_loop(
|
|
294
|
+
eval_set=eval_set,
|
|
295
|
+
skill_path=skill_path,
|
|
296
|
+
description_override=args.description,
|
|
297
|
+
num_workers=args.num_workers,
|
|
298
|
+
timeout=args.timeout,
|
|
299
|
+
max_iterations=args.max_iterations,
|
|
300
|
+
runs_per_query=args.runs_per_query,
|
|
301
|
+
trigger_threshold=args.trigger_threshold,
|
|
302
|
+
holdout=args.holdout,
|
|
303
|
+
model=args.model,
|
|
304
|
+
verbose=args.verbose,
|
|
305
|
+
live_report_path=live_report_path,
|
|
306
|
+
log_dir=log_dir,
|
|
307
|
+
)
|
|
308
|
+
|
|
309
|
+
# Save JSON output
|
|
310
|
+
json_output = json.dumps(output, indent=2)
|
|
311
|
+
print(json_output)
|
|
312
|
+
if results_dir:
|
|
313
|
+
(results_dir / "results.json").write_text(json_output)
|
|
314
|
+
|
|
315
|
+
# Write final HTML report (without auto-refresh)
|
|
316
|
+
if live_report_path:
|
|
317
|
+
live_report_path.write_text(generate_html(output, auto_refresh=False, skill_name=name))
|
|
318
|
+
print(f"\nReport: {live_report_path}", file=sys.stderr)
|
|
319
|
+
|
|
320
|
+
if results_dir and live_report_path:
|
|
321
|
+
(results_dir / "report.html").write_text(generate_html(output, auto_refresh=False, skill_name=name))
|
|
322
|
+
|
|
323
|
+
if results_dir:
|
|
324
|
+
print(f"Results saved to: {results_dir}", file=sys.stderr)
|
|
325
|
+
|
|
326
|
+
|
|
327
|
+
if __name__ == "__main__":
|
|
328
|
+
main()
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
"""Shared utilities for skill-creator scripts."""
|
|
2
|
+
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
|
|
8
|
+
"""Parse a SKILL.md file, returning (name, description, full_content)."""
|
|
9
|
+
content = (skill_path / "SKILL.md").read_text()
|
|
10
|
+
lines = content.split("\n")
|
|
11
|
+
|
|
12
|
+
if lines[0].strip() != "---":
|
|
13
|
+
raise ValueError("SKILL.md missing frontmatter (no opening ---)")
|
|
14
|
+
|
|
15
|
+
end_idx = None
|
|
16
|
+
for i, line in enumerate(lines[1:], start=1):
|
|
17
|
+
if line.strip() == "---":
|
|
18
|
+
end_idx = i
|
|
19
|
+
break
|
|
20
|
+
|
|
21
|
+
if end_idx is None:
|
|
22
|
+
raise ValueError("SKILL.md missing frontmatter (no closing ---)")
|
|
23
|
+
|
|
24
|
+
name = ""
|
|
25
|
+
description = ""
|
|
26
|
+
frontmatter_lines = lines[1:end_idx]
|
|
27
|
+
i = 0
|
|
28
|
+
while i < len(frontmatter_lines):
|
|
29
|
+
line = frontmatter_lines[i]
|
|
30
|
+
if line.startswith("name:"):
|
|
31
|
+
name = line[len("name:"):].strip().strip('"').strip("'")
|
|
32
|
+
elif line.startswith("description:"):
|
|
33
|
+
value = line[len("description:"):].strip()
|
|
34
|
+
# Handle YAML multiline indicators (>, |, >-, |-)
|
|
35
|
+
if value in (">", "|", ">-", "|-"):
|
|
36
|
+
continuation_lines: list[str] = []
|
|
37
|
+
i += 1
|
|
38
|
+
while i < len(frontmatter_lines) and (frontmatter_lines[i].startswith(" ") or frontmatter_lines[i].startswith("\t")):
|
|
39
|
+
continuation_lines.append(frontmatter_lines[i].strip())
|
|
40
|
+
i += 1
|
|
41
|
+
description = " ".join(continuation_lines)
|
|
42
|
+
continue
|
|
43
|
+
else:
|
|
44
|
+
description = value.strip('"').strip("'")
|
|
45
|
+
i += 1
|
|
46
|
+
|
|
47
|
+
return name, description, content
|
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: spec-writing
|
|
3
|
+
description: >-
|
|
4
|
+
撰写和修订模块化需求功能文档,输出到 specs/ 目录。每篇功能文档包含六段结构:功能概述、用户故事、需求描述、技术要求、约束条件、测试标准。所有结构图一律使用 Mermaid(flowchart、sequenceDiagram、erDiagram 等)。Monorepo 默认 apps/ + packages/ 布局。
|
|
5
|
+
务必在以下场景使用本 skill:用户提到需求文档、PRD、产品需求、功能文档、specs、用户故事、user stories、验收标准、acceptance criteria、feature requirements、需求分析、业务需求、功能需求、需求规格、需求评审、需求变更、功能点拆分,或者用户要求撰写 / 修订 / 评审某个功能的详细描述,即使没有明确说“需求文档”。不适用于纯技术选型文档。
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# 需求文档(Specs)
|
|
9
|
+
|
|
10
|
+
本 Skill 负责**需求文档**的撰写与修订——从一句话需求到结构化的、可交付的功能文档。它解决的核心问题是:让需求表达从模糊口头沟通变成可追踪、可评审、可验收的标准化文档。
|
|
11
|
+
|
|
12
|
+
**不**负责独立技术选型长文。**不**替代 UI 视觉设计稿,但须用 **Mermaid** 表达所有结构类信息,因为 Mermaid 可版本管理、可协作编辑、可嵌入 Markdown。
|
|
13
|
+
|
|
14
|
+
## 原则
|
|
15
|
+
|
|
16
|
+
- **模块化**:按系统模块拆分目录;模块内按页面或功能点**独立成文件**,避免单文件过长。这样做的好处是每个文件有清晰的职责边界,团队可并行编写,评审和变更时只需关注受影响的文件。
|
|
17
|
+
- **结构化**:每个功能文档必须包含下列 **六个部分**,缺一不可。六段结构确保从"做什么"到"怎么验收"形成闭环,避免遗漏。
|
|
18
|
+
- **Monorepo 语境**:若需求涉及仓库布局,默认约定为 **`apps/`** 业务应用、**`packages/`** 可复用包;与项目实际不符时以用户/仓库说明为准。
|
|
19
|
+
|
|
20
|
+
## 图表(唯一绘图语言:Mermaid)
|
|
21
|
+
|
|
22
|
+
需求文档中凡需**可视化**,**一律使用 Mermaid**。Mermaid 是纯文本,可 Git 管理、可 diff、可 PR review。
|
|
23
|
+
|
|
24
|
+
常用类型:`graph TD/LR`(模块/架构图)、`flowchart`(流程/状态)、`sequenceDiagram`(时序)、`erDiagram`(实体关系)。图内嵌于 ` ```mermaid ` 代码块中。
|
|
25
|
+
|
|
26
|
+
> **何时该画图**:≥2 角色交互→时序图;≥3 步分支→流程图;新增/变更实体→ER 图;模块依赖→模块图。
|
|
27
|
+
>
|
|
28
|
+
> 详细类型表、使用指南与示例见 → `references/mermaid-guide.md`
|
|
29
|
+
|
|
30
|
+
## 推荐目录示例
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
specs/
|
|
34
|
+
├── auth/
|
|
35
|
+
│ ├── login.md
|
|
36
|
+
│ └── register.md
|
|
37
|
+
├── user/
|
|
38
|
+
│ ├── profile.md
|
|
39
|
+
│ └── settings.md
|
|
40
|
+
└── dashboard/
|
|
41
|
+
└── overview.md
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
## 单篇功能文档六段结构
|
|
45
|
+
|
|
46
|
+
1. **功能概述**
|
|
47
|
+
一两句话说明目标与业务价值:做什么、为什么做,范围清晰。好的概述能让不了解项目背景的人在 10 秒内理解这个功能的意义。
|
|
48
|
+
|
|
49
|
+
2. **用户故事**
|
|
50
|
+
格式:`作为 [用户角色],我希望 [完成某件事],以便 [获得某种价值]。`
|
|
51
|
+
可多条,覆盖不同角色或场景。
|
|
52
|
+
|
|
53
|
+
**示例**:
|
|
54
|
+
- 作为**普通用户**,我希望通过手机号+验证码登录,以便无需记忆密码即可快速访问系统。
|
|
55
|
+
- 作为**管理员**,我希望查看登录失败日志,以便及时发现异常访问行为。
|
|
56
|
+
|
|
57
|
+
3. **需求描述**(结构化)
|
|
58
|
+
- **输入**:数据、触发条件
|
|
59
|
+
- **输出**:结果、界面变化
|
|
60
|
+
- **交互流程**:步骤与状态流转(**复杂流程须配流程图或时序图**)
|
|
61
|
+
- **界面要求**:布局、组件形态、文案;空态 / 加载态 / 错误态
|
|
62
|
+
|
|
63
|
+
4. **技术要求**
|
|
64
|
+
接口与数据结构、性能指标(响应时间、并发等)、特殊实现(防抖、轮询、懒加载等);涉及多模块协作时可配**模块图**;涉及**新增或变更库表/实体**时须配 **`erDiagram`**(或与项目统一 ER 文档交叉引用并说明变更点)。
|
|
65
|
+
|
|
66
|
+
5. **约束条件**
|
|
67
|
+
时间线与优先级、第三方依赖(短信、支付、地图等)、合规与安全、**本期不做**的边界。明确"不做什么"与"做什么"同等重要,它防止范围蔓延。
|
|
68
|
+
|
|
69
|
+
6. **测试标准**(验收)
|
|
70
|
+
- **正常路径**:覆盖核心业务流程
|
|
71
|
+
- **边界值**:极值、空值、最大长度等
|
|
72
|
+
- **异常路径**(非法输入、网络错误、权限不足等)
|
|
73
|
+
- **回归点**(可能影响到的已有功能)
|
|
74
|
+
- **交叉组合编排**(当存在多个测试维度时)
|
|
75
|
+
|
|
76
|
+
每条测试标准应具备可执行性:描述操作步骤和预期结果,而非仅说明抽象要求。
|
|
77
|
+
|
|
78
|
+
### 交叉组合编排与优先级评定
|
|
79
|
+
|
|
80
|
+
当一个功能涉及多组独立的测试维度(≥ 2 维且至少有一维 ≥ 3 取值)时,需要**显式列出维度组合矩阵**并**评定每条组合的优先级**(P0 / P1 / P2 / P3),为后续 TDD 提供铺垫。
|
|
81
|
+
|
|
82
|
+
优先级判定顺序:核心业务路径 → 资金/安全相关 → 错误恢复 → 边界值 → 等价类合并 → 逻辑不可达。矩阵中 P0 行直接对应 TDD 第一批测试用例。
|
|
83
|
+
|
|
84
|
+
> 完整编排步骤、优先级定义、判断依据与登录功能 12 组合示例见 → `references/test-matrix.md`
|
|
85
|
+
|
|
86
|
+
## 执行方式
|
|
87
|
+
|
|
88
|
+
- **新建**:先定模块与文件路径,再按六段逐段补齐;缺信息时向用户确认,不臆造业务规则;**该画图处给出 Mermaid,不省略**。
|
|
89
|
+
- **修订**:保持模块边界;改动处同步更新「测试标准」与「约束条件」中相关条目,并**同步更新相关 Mermaid 图**。
|
|
90
|
+
- **评审辅助**:如果用户提供了已有需求文档让你评审,按六段结构逐项检查完整性,指出缺失或模糊之处,给出修改建议。
|
|
91
|
+
|
|
92
|
+
## 常见问题与处理
|
|
93
|
+
|
|
94
|
+
- **用户只给了一句话需求**(如"做一个登录功能"):先拆解为需要收集的信息清单(支持哪些登录方式?有无第三方登录?是否需要注册?),逐步引导用户澄清,再输出六段文档。
|
|
95
|
+
- **需求跨多个模块**:为每个模块生成独立文档,但在各文档的「技术要求」中交叉引用相关模块,必要时画模块依赖图。
|
|
96
|
+
- **Mermaid 图过于复杂**:拆成多张小图,每张聚焦一个关注点(如一张时序图专注登录流程,另一张专注 token 刷新流程)。
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
# Mermaid 图表速查
|
|
2
|
+
|
|
3
|
+
需求文档中凡需可视化,一律使用 Mermaid。
|
|
4
|
+
|
|
5
|
+
## 常用类型
|
|
6
|
+
|
|
7
|
+
| 类型 | 典型用途 | Mermaid 形式 |
|
|
8
|
+
|------|----------|----------------|
|
|
9
|
+
| 模块图 | 功能/子系统/包之间依赖或边界 | `graph TD` / `graph LR` |
|
|
10
|
+
| 流程图 | 操作步骤、状态机、业务分支 | `flowchart` |
|
|
11
|
+
| 时序图 | 用户、前端、后端、第三方交互顺序 | `sequenceDiagram` |
|
|
12
|
+
| 架构图 | 本功能在整体系统中的位置、数据流向 | `flowchart` / `flowchart LR` |
|
|
13
|
+
| **ER 图** | **本功能涉及的持久化实体**、字段、关系 | **`erDiagram`** |
|
|
14
|
+
|
|
15
|
+
## 何时该画图
|
|
16
|
+
|
|
17
|
+
- 涉及 **两个以上角色或系统交互** → 时序图
|
|
18
|
+
- 有 **三步以上流程分支** → 流程图
|
|
19
|
+
- 涉及 **新增/变更实体** → ER 图
|
|
20
|
+
- **模块间有调用或依赖关系** → 模块图
|
|
21
|
+
- 宁可多画几张简洁的图,也不要写大段纯文字描述复杂交互
|
|
22
|
+
|
|
23
|
+
## 嵌入方式
|
|
24
|
+
|
|
25
|
+
图内嵌于 Markdown 的 ` ```mermaid ` 代码块中;复杂场景可拆为多张图,每张聚焦一个关注点。
|
|
26
|
+
|
|
27
|
+
## 示例
|
|
28
|
+
|
|
29
|
+
### 时序图(登录流程)
|
|
30
|
+
|
|
31
|
+
```mermaid
|
|
32
|
+
sequenceDiagram
|
|
33
|
+
participant U as 用户
|
|
34
|
+
participant F as 前端
|
|
35
|
+
participant B as 后端
|
|
36
|
+
participant SMS as 短信服务
|
|
37
|
+
|
|
38
|
+
U->>F: 输入手机号
|
|
39
|
+
F->>B: 请求发送验证码
|
|
40
|
+
B->>SMS: 发送验证码
|
|
41
|
+
SMS-->>U: 收到验证码
|
|
42
|
+
U->>F: 输入验证码
|
|
43
|
+
F->>B: 提交登录请求
|
|
44
|
+
B-->>F: 返回 Token
|
|
45
|
+
F-->>U: 跳转首页
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### ER 图(用户与会话)
|
|
49
|
+
|
|
50
|
+
```mermaid
|
|
51
|
+
erDiagram
|
|
52
|
+
USER {
|
|
53
|
+
int id PK
|
|
54
|
+
string phone
|
|
55
|
+
string email
|
|
56
|
+
string status
|
|
57
|
+
datetime created_at
|
|
58
|
+
}
|
|
59
|
+
SESSION {
|
|
60
|
+
int id PK
|
|
61
|
+
int user_id FK
|
|
62
|
+
string token
|
|
63
|
+
datetime expires_at
|
|
64
|
+
}
|
|
65
|
+
USER ||--o{ SESSION : "has"
|
|
66
|
+
```
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# 测试标准:交叉组合编排与优先级评定
|
|
2
|
+
|
|
3
|
+
## 交叉组合编排
|
|
4
|
+
|
|
5
|
+
当一个功能涉及多组独立的测试维度时,需要**显式列出维度组合矩阵**,为后续 TDD 测试驱动开发提供铺垫依据。核心价值:把隐含的组合逻辑变成可追踪的测试用例列表,避免遗漏关键路径。
|
|
6
|
+
|
|
7
|
+
### 编排步骤
|
|
8
|
+
|
|
9
|
+
1. **识别测试维度**:从需求描述中提取所有影响行为的独立变量
|
|
10
|
+
2. **列出每个维度的取值**:穷举每个维度的合法取值
|
|
11
|
+
3. **生成组合矩阵**:用表格列出所有逻辑组合
|
|
12
|
+
4. **评定优先级**:对每个组合标记 P0 / P1 / P2 / P3
|
|
13
|
+
5. **标注预期结果**:每个组合对应的预期行为
|
|
14
|
+
|
|
15
|
+
### 何时编排
|
|
16
|
+
|
|
17
|
+
- 测试维度 ≥ 2 且至少有一个维度 ≥ 3 个取值时,应显式列出矩阵
|
|
18
|
+
- 简单的二元场景(如"登录/未登录")无需矩阵化
|
|
19
|
+
|
|
20
|
+
## 测试优先级评定
|
|
21
|
+
|
|
22
|
+
### 优先级定义
|
|
23
|
+
|
|
24
|
+
| 优先级 | 含义 | TDD 要求 |
|
|
25
|
+
|--------|------|----------|
|
|
26
|
+
| **P0** | 核心业务路径,不通过则功能不可用 | 必须在开发前编写对应测试用例 |
|
|
27
|
+
| **P1** | 重要但非阻塞,影响用户体验或安全 | 应在 Sprint 内补充测试 |
|
|
28
|
+
| **P2** | 锦上添花的边界覆盖 | 有余力时编写 |
|
|
29
|
+
| **P3** | 与其他组合行为等价或逻辑上不可达 | 不需要测试,需说明跳过原因 |
|
|
30
|
+
|
|
31
|
+
### 优先级判断依据
|
|
32
|
+
|
|
33
|
+
按以下顺序判断每个测试组合的优先级:
|
|
34
|
+
|
|
35
|
+
1. **核心业务路径**:正常流程中用户最常走的路径 → P0
|
|
36
|
+
2. **资金/安全相关**:涉及支付、权限、数据安全的路径 → P0
|
|
37
|
+
3. **错误恢复路径**:用户操作出错后的恢复流程 → P1
|
|
38
|
+
4. **边界值组合**:极值、空值、最大长度等 → P1
|
|
39
|
+
5. **等价类合并**:多个组合预期行为完全一致 → 保留一个为 P1,其余 P3
|
|
40
|
+
6. **逻辑不可达**:前置条件互斥导致无法触发的组合 → P3
|
|
41
|
+
|
|
42
|
+
### 完整示例(用户登录功能)
|
|
43
|
+
|
|
44
|
+
**测试维度:**
|
|
45
|
+
- A - 登录方式:手机号+验证码、邮箱+密码
|
|
46
|
+
- B - 用户状态:新用户、已存在、已锁定
|
|
47
|
+
- C - 网络状态:正常、超时
|
|
48
|
+
|
|
49
|
+
**组合矩阵(2 × 3 × 2 = 12 种):**
|
|
50
|
+
|
|
51
|
+
| # | 登录方式 | 用户状态 | 网络状态 | 预期结果 | 优先级 | 优先级理由 |
|
|
52
|
+
|---|---------|---------|---------|---------|-------|-----------|
|
|
53
|
+
| 1 | 手机号+验证码 | 已存在 | 正常 | 登录成功,跳转首页 | P0 | 核心业务路径 |
|
|
54
|
+
| 2 | 手机号+验证码 | 新用户 | 正常 | 自动注册并登录 | P0 | 新用户首次体验 |
|
|
55
|
+
| 3 | 手机号+验证码 | 已锁定 | 正常 | 提示账号已锁定 | P0 | 安全相关 |
|
|
56
|
+
| 4 | 手机号+验证码 | 已存在 | 超时 | 提示网络错误,保留输入 | P1 | 错误恢复路径 |
|
|
57
|
+
| 5 | 邮箱+密码 | 已存在 | 正常 | 登录成功,跳转首页 | P0 | 核心业务路径 |
|
|
58
|
+
| 6 | 邮箱+密码 | 新用户 | 正常 | 提示用户不存在 | P0 | 核心分支 |
|
|
59
|
+
| 7 | 邮箱+密码 | 已锁定 | 正常 | 提示账号已锁定 | P0 | 安全相关 |
|
|
60
|
+
| 8 | 邮箱+密码 | 已存在 | 超时 | 提示网络错误,保留输入 | P1 | 错误恢复路径 |
|
|
61
|
+
| 9 | 手机号+验证码 | 新用户 | 超时 | 提示网络错误 | P3 | 与 #4 行为一致(等价类) |
|
|
62
|
+
| 10 | 手机号+验证码 | 已锁定 | 超时 | 提示网络错误 | P3 | 网络错误优先于业务提示(等价类) |
|
|
63
|
+
| 11 | 邮箱+密码 | 新用户 | 超时 | 提示网络错误 | P3 | 与 #8 行为一致(等价类) |
|
|
64
|
+
| 12 | 邮箱+密码 | 已锁定 | 超时 | 提示网络错误 | P3 | 与 #10 同理(等价类) |
|
|
65
|
+
|
|
66
|
+
**统计:** P0 6 个 / P1 2 个 / P3 4 个 → TDD 第一批至少 6 个测试用例(P0)
|
|
67
|
+
|
|
68
|
+
## 与 TDD 的衔接
|
|
69
|
+
|
|
70
|
+
1. **P0 行**直接对应需要编写的测试用例,开发前先写这些测试
|
|
71
|
+
2. **P1 行**在 P0 测试通过后补充
|
|
72
|
+
3. 矩阵中的**预期结果**列就是测试用例的 `expect` 断言
|
|
73
|
+
4. 矩阵的维度取值就是测试用例的**参数化输入**(如 `test.each` / `@pytest.mark.parametrize`)
|