@devrev-computer/skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/README.md +37 -0
  2. package/bin/install.mjs +158 -0
  3. package/package.json +33 -0
  4. package/skills/account-evaluation/account-evaluation.md +64 -0
  5. package/skills/account-research/account-research.md +323 -0
  6. package/skills/account-research/references/signals-guide.md +52 -0
  7. package/skills/create-workflow-template/create-workflow-template.md +1091 -0
  8. package/skills/create-workflow-template/examples/3592-Generate rca from pia-template.json +1 -0
  9. package/skills/create-workflow-template/examples/4392-Async opportunity review agent-template.json +1 -0
  10. package/skills/create-workflow-template/examples/4441-Ticket escalator from customer message-template.json +1 -0
  11. package/skills/create-workflow-template/examples/4505-Auto-update issue tcd as end of sprint date-template.json +1 -0
  12. package/skills/create-workflow-template/examples/5040-Devrevu - enablement journey - poc emails-template.json +1 -0
  13. package/skills/create-workflow-template/examples/5158-Devrevu - enablement journey - mailing for non enablement journey users-template.json +1 -0
  14. package/skills/create-workflow-template/examples/5216-Account segment missing notification-template.json +1 -0
  15. package/skills/create-workflow-template/examples/working-csat-score-on-ticket-resolved.json +1 -0
  16. package/skills/create-workflow-template/examples/working-enhancement-replace-agent.json +1 -0
  17. package/skills/create-workflow-template/examples/working-invoke-code-sample.json +1 -0
  18. package/skills/create-workflow-template/examples/working-loop-variable-sample.json +1 -0
  19. package/skills/create-workflow-template/operations/actions.md +2919 -0
  20. package/skills/create-workflow-template/operations/blockings.md +38 -0
  21. package/skills/create-workflow-template/operations/controls.md +108 -0
  22. package/skills/create-workflow-template/operations/schema-index.md +166 -0
  23. package/skills/create-workflow-template/operations/schemas/account_created.md +58 -0
  24. package/skills/create-workflow-template/operations/schemas/account_updated.md +73 -0
  25. package/skills/create-workflow-template/operations/schemas/add_comment.md +29 -0
  26. package/skills/create-workflow-template/operations/schemas/airdrop_sync_run_started.md +33 -0
  27. package/skills/create-workflow-template/operations/schemas/airdrop_sync_run_status_updated.md +35 -0
  28. package/skills/create-workflow-template/operations/schemas/article_created.md +96 -0
  29. package/skills/create-workflow-template/operations/schemas/article_updated.md +135 -0
  30. package/skills/create-workflow-template/operations/schemas/ask_ai.md +11 -0
  31. package/skills/create-workflow-template/operations/schemas/classify_object.md +22 -0
  32. package/skills/create-workflow-template/operations/schemas/contact_created.md +43 -0
  33. package/skills/create-workflow-template/operations/schemas/contact_updated.md +65 -0
  34. package/skills/create-workflow-template/operations/schemas/conversation_created.md +108 -0
  35. package/skills/create-workflow-template/operations/schemas/conversation_sla_tracker_updated.md +46 -0
  36. package/skills/create-workflow-template/operations/schemas/conversation_updated.md +130 -0
  37. package/skills/create-workflow-template/operations/schemas/convert_conversation_to_ticket.md +13 -0
  38. package/skills/create-workflow-template/operations/schemas/create_account.md +62 -0
  39. package/skills/create-workflow-template/operations/schemas/create_article.md +79 -0
  40. package/skills/create-workflow-template/operations/schemas/create_brand.md +42 -0
  41. package/skills/create-workflow-template/operations/schemas/create_contact.md +65 -0
  42. package/skills/create-workflow-template/operations/schemas/create_dm.md +53 -0
  43. package/skills/create-workflow-template/operations/schemas/create_enhancement.md +63 -0
  44. package/skills/create-workflow-template/operations/schemas/create_incident.md +136 -0
  45. package/skills/create-workflow-template/operations/schemas/create_issue.md +150 -0
  46. package/skills/create-workflow-template/operations/schemas/create_meeting.md +105 -0
  47. package/skills/create-workflow-template/operations/schemas/create_opportunity.md +123 -0
  48. package/skills/create-workflow-template/operations/schemas/create_ticket.md +184 -0
  49. package/skills/create-workflow-template/operations/schemas/csat_response_received.md +73 -0
  50. package/skills/create-workflow-template/operations/schemas/dev_user_created.md +54 -0
  51. package/skills/create-workflow-template/operations/schemas/dev_user_updated.md +99 -0
  52. package/skills/create-workflow-template/operations/schemas/enhancement_created.md +46 -0
  53. package/skills/create-workflow-template/operations/schemas/enhancement_updated.md +89 -0
  54. package/skills/create-workflow-template/operations/schemas/evaluate_sentiment.md +14 -0
  55. package/skills/create-workflow-template/operations/schemas/execute_metric_action.md +11 -0
  56. package/skills/create-workflow-template/operations/schemas/feature_created.md +40 -0
  57. package/skills/create-workflow-template/operations/schemas/for_each.md +45 -0
  58. package/skills/create-workflow-template/operations/schemas/get_account.md +59 -0
  59. package/skills/create-workflow-template/operations/schemas/get_airdrop_sync_unit.md +32 -0
  60. package/skills/create-workflow-template/operations/schemas/get_brand.md +40 -0
  61. package/skills/create-workflow-template/operations/schemas/get_complete_enhancement_details.md +13 -0
  62. package/skills/create-workflow-template/operations/schemas/get_conversation.md +120 -0
  63. package/skills/create-workflow-template/operations/schemas/get_customer.md +60 -0
  64. package/skills/create-workflow-template/operations/schemas/get_enhancement.md +66 -0
  65. package/skills/create-workflow-template/operations/schemas/get_feature.md +56 -0
  66. package/skills/create-workflow-template/operations/schemas/get_incident.md +85 -0
  67. package/skills/create-workflow-template/operations/schemas/get_issue.md +117 -0
  68. package/skills/create-workflow-template/operations/schemas/get_kg_schema.md +23 -0
  69. package/skills/create-workflow-template/operations/schemas/get_meeting.md +87 -0
  70. package/skills/create-workflow-template/operations/schemas/get_metric_trackers.md +20 -0
  71. package/skills/create-workflow-template/operations/schemas/get_node_schema.md +29 -0
  72. package/skills/create-workflow-template/operations/schemas/get_opportunity.md +93 -0
  73. package/skills/create-workflow-template/operations/schemas/get_org_user.md +57 -0
  74. package/skills/create-workflow-template/operations/schemas/get_org_user_preference.md +40 -0
  75. package/skills/create-workflow-template/operations/schemas/get_part.md +55 -0
  76. package/skills/create-workflow-template/operations/schemas/get_self.md +54 -0
  77. package/skills/create-workflow-template/operations/schemas/get_session_details.md +45 -0
  78. package/skills/create-workflow-template/operations/schemas/get_sprint_board.md +103 -0
  79. package/skills/create-workflow-template/operations/schemas/get_ticket.md +136 -0
  80. package/skills/create-workflow-template/operations/schemas/get_workspace.md +21 -0
  81. package/skills/create-workflow-template/operations/schemas/go_back.md +13 -0
  82. package/skills/create-workflow-template/operations/schemas/http.md +38 -0
  83. package/skills/create-workflow-template/operations/schemas/hybrid_search.md +144 -0
  84. package/skills/create-workflow-template/operations/schemas/if_else.md +16 -0
  85. package/skills/create-workflow-template/operations/schemas/incident_created.md +88 -0
  86. package/skills/create-workflow-template/operations/schemas/incident_updated.md +126 -0
  87. package/skills/create-workflow-template/operations/schemas/init_variable.md +67 -0
  88. package/skills/create-workflow-template/operations/schemas/invoice_created.md +21 -0
  89. package/skills/create-workflow-template/operations/schemas/invoice_updated.md +41 -0
  90. package/skills/create-workflow-template/operations/schemas/invoke_code.md +132 -0
  91. package/skills/create-workflow-template/operations/schemas/issue_created.md +105 -0
  92. package/skills/create-workflow-template/operations/schemas/issue_sla_tracker_updated.md +46 -0
  93. package/skills/create-workflow-template/operations/schemas/issue_updated.md +172 -0
  94. package/skills/create-workflow-template/operations/schemas/link_incident_with_issue.md +14 -0
  95. package/skills/create-workflow-template/operations/schemas/link_ticket_with_issue.md +14 -0
  96. package/skills/create-workflow-template/operations/schemas/list_enhancements.md +74 -0
  97. package/skills/create-workflow-template/operations/schemas/list_issues.md +108 -0
  98. package/skills/create-workflow-template/operations/schemas/list_sessions.md +79 -0
  99. package/skills/create-workflow-template/operations/schemas/list_sprint.md +29 -0
  100. package/skills/create-workflow-template/operations/schemas/list_web_sessions.md +87 -0
  101. package/skills/create-workflow-template/operations/schemas/loop_over_accounts.md +106 -0
  102. package/skills/create-workflow-template/operations/schemas/loop_over_articles.md +126 -0
  103. package/skills/create-workflow-template/operations/schemas/loop_over_customers.md +88 -0
  104. package/skills/create-workflow-template/operations/schemas/loop_over_dev_users.md +75 -0
  105. package/skills/create-workflow-template/operations/schemas/loop_over_enhancements.md +112 -0
  106. package/skills/create-workflow-template/operations/schemas/loop_over_incidents.md +113 -0
  107. package/skills/create-workflow-template/operations/schemas/loop_over_issues.md +217 -0
  108. package/skills/create-workflow-template/operations/schemas/loop_over_meetings.md +150 -0
  109. package/skills/create-workflow-template/operations/schemas/loop_over_opportunity.md +161 -0
  110. package/skills/create-workflow-template/operations/schemas/loop_over_sprints.md +50 -0
  111. package/skills/create-workflow-template/operations/schemas/loop_over_tickets.md +203 -0
  112. package/skills/create-workflow-template/operations/schemas/manual_trigger.md +11 -0
  113. package/skills/create-workflow-template/operations/schemas/meeting_created.md +116 -0
  114. package/skills/create-workflow-template/operations/schemas/meeting_updated.md +152 -0
  115. package/skills/create-workflow-template/operations/schemas/oasis_sql_execute.md +11 -0
  116. package/skills/create-workflow-template/operations/schemas/opportunity_created.md +92 -0
  117. package/skills/create-workflow-template/operations/schemas/opportunity_updated.md +124 -0
  118. package/skills/create-workflow-template/operations/schemas/pick_user.md +16 -0
  119. package/skills/create-workflow-template/operations/schemas/question_answer_created.md +44 -0
  120. package/skills/create-workflow-template/operations/schemas/question_answer_updated.md +75 -0
  121. package/skills/create-workflow-template/operations/schemas/recall_chats.md +13 -0
  122. package/skills/create-workflow-template/operations/schemas/router.md +15 -0
  123. package/skills/create-workflow-template/operations/schemas/send_notification.md +19 -0
  124. package/skills/create-workflow-template/operations/schemas/set_variable.md +67 -0
  125. package/skills/create-workflow-template/operations/schemas/sleep_for.md +12 -0
  126. package/skills/create-workflow-template/operations/schemas/sleep_until.md +17 -0
  127. package/skills/create-workflow-template/operations/schemas/sprint_updated.md +37 -0
  128. package/skills/create-workflow-template/operations/schemas/suggest_part.md +14 -0
  129. package/skills/create-workflow-template/operations/schemas/task_updated.md +79 -0
  130. package/skills/create-workflow-template/operations/schemas/test_example.md +16 -0
  131. package/skills/create-workflow-template/operations/schemas/ticket_created.md +136 -0
  132. package/skills/create-workflow-template/operations/schemas/ticket_sla_tracker_updated.md +46 -0
  133. package/skills/create-workflow-template/operations/schemas/ticket_updated.md +198 -0
  134. package/skills/create-workflow-template/operations/schemas/timeline_comment_created.md +70 -0
  135. package/skills/create-workflow-template/operations/schemas/update_account.md +68 -0
  136. package/skills/create-workflow-template/operations/schemas/update_article.md +95 -0
  137. package/skills/create-workflow-template/operations/schemas/update_brand.md +44 -0
  138. package/skills/create-workflow-template/operations/schemas/update_contact.md +53 -0
  139. package/skills/create-workflow-template/operations/schemas/update_conversation.md +149 -0
  140. package/skills/create-workflow-template/operations/schemas/update_enhancement.md +64 -0
  141. package/skills/create-workflow-template/operations/schemas/update_incident.md +156 -0
  142. package/skills/create-workflow-template/operations/schemas/update_issue.md +173 -0
  143. package/skills/create-workflow-template/operations/schemas/update_meeting.md +114 -0
  144. package/skills/create-workflow-template/operations/schemas/update_opportunity.md +137 -0
  145. package/skills/create-workflow-template/operations/schemas/update_question_answer.md +60 -0
  146. package/skills/create-workflow-template/operations/schemas/update_ticket.md +188 -0
  147. package/skills/create-workflow-template/operations/schemas/watch_ticket_for_updates.md +225 -0
  148. package/skills/create-workflow-template/operations/schemas/web_search.md +17 -0
  149. package/skills/create-workflow-template/operations/schemas/while.md +24 -0
  150. package/skills/create-workflow-template/operations/schemas/widget_created.md +75 -0
  151. package/skills/create-workflow-template/operations/schemas/widget_updated.md +98 -0
  152. package/skills/create-workflow-template/operations/schemas/workspace_created.md +20 -0
  153. package/skills/create-workflow-template/operations/triggers.md +1583 -0
  154. package/skills/customer-brief/customer-brief.md +66 -0
  155. package/skills/deal-review-meddpicc/deal-review-meddpicc.md +58 -0
  156. package/skills/next-step-for-opportunity/next-step-for-opportunity.md +55 -0
  157. package/skills/opportunity-feature-prioritizer/SKILL.md +183 -0
  158. package/skills/sales-call-plan-coach/sales-call-plan-coach.md +73 -0
  159. package/skills/sales-context/sales-context.md +44 -0
  160. package/skills/sales-search-and-lookup/sales-search-and-lookup.md +58 -0
  161. package/skills/skill-creator/SKILL.md +570 -0
  162. package/skills/skill-creator/agents/analyzer.md +274 -0
  163. package/skills/skill-creator/agents/comparator.md +202 -0
  164. package/skills/skill-creator/agents/grader.md +223 -0
  165. package/skills/skill-creator/assets/eval_review.html +146 -0
  166. package/skills/skill-creator/eval-viewer/generate_review.py +471 -0
  167. package/skills/skill-creator/eval-viewer/viewer.html +1325 -0
  168. package/skills/skill-creator/references/schemas.md +430 -0
  169. package/skills/skill-creator/references/tool-patterns.md +290 -0
  170. package/skills/skill-creator/scripts/__init__.py +0 -0
  171. package/skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
  172. package/skills/skill-creator/scripts/generate_report.py +326 -0
  173. package/skills/skill-creator/scripts/improve_description.py +247 -0
  174. package/skills/skill-creator/scripts/package_skill.py +136 -0
  175. package/skills/skill-creator/scripts/quick_validate.py +103 -0
  176. package/skills/skill-creator/scripts/run_eval.py +310 -0
  177. package/skills/skill-creator/scripts/run_loop.py +328 -0
  178. package/skills/skill-creator/scripts/utils.py +47 -0
  179. package/skills/trace-diagnosis/trace-diagnosis.md +186 -0
@@ -0,0 +1,328 @@
1
+ #!/usr/bin/env python3
2
+ """Run the eval + improve loop until all pass or max iterations reached.
3
+
4
+ Combines run_eval.py and improve_description.py in a loop, tracking history
5
+ and returning the best description found. Supports train/test split to prevent
6
+ overfitting.
7
+ """
8
+
9
+ import argparse
10
+ import json
11
+ import random
12
+ import sys
13
+ import tempfile
14
+ import time
15
+ import webbrowser
16
+ from pathlib import Path
17
+
18
+ from scripts.generate_report import generate_html
19
+ from scripts.improve_description import improve_description
20
+ from scripts.run_eval import find_project_root, run_eval
21
+ from scripts.utils import parse_skill_md
22
+
23
+
24
+ def split_eval_set(eval_set: list[dict], holdout: float, seed: int = 42) -> tuple[list[dict], list[dict]]:
25
+ """Split eval set into train and test sets, stratified by should_trigger."""
26
+ random.seed(seed)
27
+
28
+ # Separate by should_trigger
29
+ trigger = [e for e in eval_set if e["should_trigger"]]
30
+ no_trigger = [e for e in eval_set if not e["should_trigger"]]
31
+
32
+ # Shuffle each group
33
+ random.shuffle(trigger)
34
+ random.shuffle(no_trigger)
35
+
36
+ # Calculate split points
37
+ n_trigger_test = max(1, int(len(trigger) * holdout))
38
+ n_no_trigger_test = max(1, int(len(no_trigger) * holdout))
39
+
40
+ # Split
41
+ test_set = trigger[:n_trigger_test] + no_trigger[:n_no_trigger_test]
42
+ train_set = trigger[n_trigger_test:] + no_trigger[n_no_trigger_test:]
43
+
44
+ return train_set, test_set
45
+
46
+
47
+ def run_loop(
48
+ eval_set: list[dict],
49
+ skill_path: Path,
50
+ description_override: str | None,
51
+ num_workers: int,
52
+ timeout: int,
53
+ max_iterations: int,
54
+ runs_per_query: int,
55
+ trigger_threshold: float,
56
+ holdout: float,
57
+ model: str,
58
+ verbose: bool,
59
+ live_report_path: Path | None = None,
60
+ log_dir: Path | None = None,
61
+ ) -> dict:
62
+ """Run the eval + improvement loop."""
63
+ project_root = find_project_root()
64
+ name, original_description, content = parse_skill_md(skill_path)
65
+ current_description = description_override or original_description
66
+
67
+ # Split into train/test if holdout > 0
68
+ if holdout > 0:
69
+ train_set, test_set = split_eval_set(eval_set, holdout)
70
+ if verbose:
71
+ print(f"Split: {len(train_set)} train, {len(test_set)} test (holdout={holdout})", file=sys.stderr)
72
+ else:
73
+ train_set = eval_set
74
+ test_set = []
75
+
76
+ history = []
77
+ exit_reason = "unknown"
78
+
79
+ for iteration in range(1, max_iterations + 1):
80
+ if verbose:
81
+ print(f"\n{'='*60}", file=sys.stderr)
82
+ print(f"Iteration {iteration}/{max_iterations}", file=sys.stderr)
83
+ print(f"Description: {current_description}", file=sys.stderr)
84
+ print(f"{'='*60}", file=sys.stderr)
85
+
86
+ # Evaluate train + test together in one batch for parallelism
87
+ all_queries = train_set + test_set
88
+ t0 = time.time()
89
+ all_results = run_eval(
90
+ eval_set=all_queries,
91
+ skill_name=name,
92
+ description=current_description,
93
+ num_workers=num_workers,
94
+ timeout=timeout,
95
+ project_root=project_root,
96
+ runs_per_query=runs_per_query,
97
+ trigger_threshold=trigger_threshold,
98
+ model=model,
99
+ )
100
+ eval_elapsed = time.time() - t0
101
+
102
+ # Split results back into train/test by matching queries
103
+ train_queries_set = {q["query"] for q in train_set}
104
+ train_result_list = [r for r in all_results["results"] if r["query"] in train_queries_set]
105
+ test_result_list = [r for r in all_results["results"] if r["query"] not in train_queries_set]
106
+
107
+ train_passed = sum(1 for r in train_result_list if r["pass"])
108
+ train_total = len(train_result_list)
109
+ train_summary = {"passed": train_passed, "failed": train_total - train_passed, "total": train_total}
110
+ train_results = {"results": train_result_list, "summary": train_summary}
111
+
112
+ if test_set:
113
+ test_passed = sum(1 for r in test_result_list if r["pass"])
114
+ test_total = len(test_result_list)
115
+ test_summary = {"passed": test_passed, "failed": test_total - test_passed, "total": test_total}
116
+ test_results = {"results": test_result_list, "summary": test_summary}
117
+ else:
118
+ test_results = None
119
+ test_summary = None
120
+
121
+ history.append({
122
+ "iteration": iteration,
123
+ "description": current_description,
124
+ "train_passed": train_summary["passed"],
125
+ "train_failed": train_summary["failed"],
126
+ "train_total": train_summary["total"],
127
+ "train_results": train_results["results"],
128
+ "test_passed": test_summary["passed"] if test_summary else None,
129
+ "test_failed": test_summary["failed"] if test_summary else None,
130
+ "test_total": test_summary["total"] if test_summary else None,
131
+ "test_results": test_results["results"] if test_results else None,
132
+ # For backward compat with report generator
133
+ "passed": train_summary["passed"],
134
+ "failed": train_summary["failed"],
135
+ "total": train_summary["total"],
136
+ "results": train_results["results"],
137
+ })
138
+
139
+ # Write live report if path provided
140
+ if live_report_path:
141
+ partial_output = {
142
+ "original_description": original_description,
143
+ "best_description": current_description,
144
+ "best_score": "in progress",
145
+ "iterations_run": len(history),
146
+ "holdout": holdout,
147
+ "train_size": len(train_set),
148
+ "test_size": len(test_set),
149
+ "history": history,
150
+ }
151
+ live_report_path.write_text(generate_html(partial_output, auto_refresh=True, skill_name=name))
152
+
153
+ if verbose:
154
+ def print_eval_stats(label, results, elapsed):
155
+ pos = [r for r in results if r["should_trigger"]]
156
+ neg = [r for r in results if not r["should_trigger"]]
157
+ tp = sum(r["triggers"] for r in pos)
158
+ pos_runs = sum(r["runs"] for r in pos)
159
+ fn = pos_runs - tp
160
+ fp = sum(r["triggers"] for r in neg)
161
+ neg_runs = sum(r["runs"] for r in neg)
162
+ tn = neg_runs - fp
163
+ total = tp + tn + fp + fn
164
+ precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
165
+ recall = tp / (tp + fn) if (tp + fn) > 0 else 1.0
166
+ accuracy = (tp + tn) / total if total > 0 else 0.0
167
+ print(f"{label}: {tp+tn}/{total} correct, precision={precision:.0%} recall={recall:.0%} accuracy={accuracy:.0%} ({elapsed:.1f}s)", file=sys.stderr)
168
+ for r in results:
169
+ status = "PASS" if r["pass"] else "FAIL"
170
+ rate_str = f"{r['triggers']}/{r['runs']}"
171
+ print(f" [{status}] rate={rate_str} expected={r['should_trigger']}: {r['query'][:60]}", file=sys.stderr)
172
+
173
+ print_eval_stats("Train", train_results["results"], eval_elapsed)
174
+ if test_summary:
175
+ print_eval_stats("Test ", test_results["results"], 0)
176
+
177
+ if train_summary["failed"] == 0:
178
+ exit_reason = f"all_passed (iteration {iteration})"
179
+ if verbose:
180
+ print(f"\nAll train queries passed on iteration {iteration}!", file=sys.stderr)
181
+ break
182
+
183
+ if iteration == max_iterations:
184
+ exit_reason = f"max_iterations ({max_iterations})"
185
+ if verbose:
186
+ print(f"\nMax iterations reached ({max_iterations}).", file=sys.stderr)
187
+ break
188
+
189
+ # Improve the description based on train results
190
+ if verbose:
191
+ print(f"\nImproving description...", file=sys.stderr)
192
+
193
+ t0 = time.time()
194
+ # Strip test scores from history so improvement model can't see them
195
+ blinded_history = [
196
+ {k: v for k, v in h.items() if not k.startswith("test_")}
197
+ for h in history
198
+ ]
199
+ new_description = improve_description(
200
+ skill_name=name,
201
+ skill_content=content,
202
+ current_description=current_description,
203
+ eval_results=train_results,
204
+ history=blinded_history,
205
+ model=model,
206
+ log_dir=log_dir,
207
+ iteration=iteration,
208
+ )
209
+ improve_elapsed = time.time() - t0
210
+
211
+ if verbose:
212
+ print(f"Proposed ({improve_elapsed:.1f}s): {new_description}", file=sys.stderr)
213
+
214
+ current_description = new_description
215
+
216
+ # Find the best iteration by TEST score (or train if no test set)
217
+ if test_set:
218
+ best = max(history, key=lambda h: h["test_passed"] or 0)
219
+ best_score = f"{best['test_passed']}/{best['test_total']}"
220
+ else:
221
+ best = max(history, key=lambda h: h["train_passed"])
222
+ best_score = f"{best['train_passed']}/{best['train_total']}"
223
+
224
+ if verbose:
225
+ print(f"\nExit reason: {exit_reason}", file=sys.stderr)
226
+ print(f"Best score: {best_score} (iteration {best['iteration']})", file=sys.stderr)
227
+
228
+ return {
229
+ "exit_reason": exit_reason,
230
+ "original_description": original_description,
231
+ "best_description": best["description"],
232
+ "best_score": best_score,
233
+ "best_train_score": f"{best['train_passed']}/{best['train_total']}",
234
+ "best_test_score": f"{best['test_passed']}/{best['test_total']}" if test_set else None,
235
+ "final_description": current_description,
236
+ "iterations_run": len(history),
237
+ "holdout": holdout,
238
+ "train_size": len(train_set),
239
+ "test_size": len(test_set),
240
+ "history": history,
241
+ }
242
+
243
+
244
+ def main():
245
+ parser = argparse.ArgumentParser(description="Run eval + improve loop")
246
+ parser.add_argument("--eval-set", required=True, help="Path to eval set JSON file")
247
+ parser.add_argument("--skill-path", required=True, help="Path to skill directory")
248
+ parser.add_argument("--description", default=None, help="Override starting description")
249
+ parser.add_argument("--num-workers", type=int, default=10, help="Number of parallel workers")
250
+ parser.add_argument("--timeout", type=int, default=30, help="Timeout per query in seconds")
251
+ parser.add_argument("--max-iterations", type=int, default=5, help="Max improvement iterations")
252
+ parser.add_argument("--runs-per-query", type=int, default=3, help="Number of runs per query")
253
+ parser.add_argument("--trigger-threshold", type=float, default=0.5, help="Trigger rate threshold")
254
+ parser.add_argument("--holdout", type=float, default=0.4, help="Fraction of eval set to hold out for testing (0 to disable)")
255
+ parser.add_argument("--model", required=True, help="Model for improvement")
256
+ parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
257
+ parser.add_argument("--report", default="auto", help="Generate HTML report at this path (default: 'auto' for temp file, 'none' to disable)")
258
+ parser.add_argument("--results-dir", default=None, help="Save all outputs (results.json, report.html, log.txt) to a timestamped subdirectory here")
259
+ args = parser.parse_args()
260
+
261
+ eval_set = json.loads(Path(args.eval_set).read_text())
262
+ skill_path = Path(args.skill_path)
263
+
264
+ if not (skill_path / "SKILL.md").exists():
265
+ print(f"Error: No SKILL.md found at {skill_path}", file=sys.stderr)
266
+ sys.exit(1)
267
+
268
+ name, _, _ = parse_skill_md(skill_path)
269
+
270
+ # Set up live report path
271
+ if args.report != "none":
272
+ if args.report == "auto":
273
+ timestamp = time.strftime("%Y%m%d_%H%M%S")
274
+ live_report_path = Path(tempfile.gettempdir()) / f"skill_description_report_{skill_path.name}_{timestamp}.html"
275
+ else:
276
+ live_report_path = Path(args.report)
277
+ # Open the report immediately so the user can watch
278
+ live_report_path.write_text("<html><body><h1>Starting optimization loop...</h1><meta http-equiv='refresh' content='5'></body></html>")
279
+ webbrowser.open(str(live_report_path))
280
+ else:
281
+ live_report_path = None
282
+
283
+ # Determine output directory (create before run_loop so logs can be written)
284
+ if args.results_dir:
285
+ timestamp = time.strftime("%Y-%m-%d_%H%M%S")
286
+ results_dir = Path(args.results_dir) / timestamp
287
+ results_dir.mkdir(parents=True, exist_ok=True)
288
+ else:
289
+ results_dir = None
290
+
291
+ log_dir = results_dir / "logs" if results_dir else None
292
+
293
+ output = run_loop(
294
+ eval_set=eval_set,
295
+ skill_path=skill_path,
296
+ description_override=args.description,
297
+ num_workers=args.num_workers,
298
+ timeout=args.timeout,
299
+ max_iterations=args.max_iterations,
300
+ runs_per_query=args.runs_per_query,
301
+ trigger_threshold=args.trigger_threshold,
302
+ holdout=args.holdout,
303
+ model=args.model,
304
+ verbose=args.verbose,
305
+ live_report_path=live_report_path,
306
+ log_dir=log_dir,
307
+ )
308
+
309
+ # Save JSON output
310
+ json_output = json.dumps(output, indent=2)
311
+ print(json_output)
312
+ if results_dir:
313
+ (results_dir / "results.json").write_text(json_output)
314
+
315
+ # Write final HTML report (without auto-refresh)
316
+ if live_report_path:
317
+ live_report_path.write_text(generate_html(output, auto_refresh=False, skill_name=name))
318
+ print(f"\nReport: {live_report_path}", file=sys.stderr)
319
+
320
+ if results_dir and live_report_path:
321
+ (results_dir / "report.html").write_text(generate_html(output, auto_refresh=False, skill_name=name))
322
+
323
+ if results_dir:
324
+ print(f"Results saved to: {results_dir}", file=sys.stderr)
325
+
326
+
327
+ if __name__ == "__main__":
328
+ main()
@@ -0,0 +1,47 @@
1
+ """Shared utilities for skill-creator scripts."""
2
+
3
+ from pathlib import Path
4
+
5
+
6
+
7
+ def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
8
+ """Parse a SKILL.md file, returning (name, description, full_content)."""
9
+ content = (skill_path / "SKILL.md").read_text()
10
+ lines = content.split("\n")
11
+
12
+ if lines[0].strip() != "---":
13
+ raise ValueError("SKILL.md missing frontmatter (no opening ---)")
14
+
15
+ end_idx = None
16
+ for i, line in enumerate(lines[1:], start=1):
17
+ if line.strip() == "---":
18
+ end_idx = i
19
+ break
20
+
21
+ if end_idx is None:
22
+ raise ValueError("SKILL.md missing frontmatter (no closing ---)")
23
+
24
+ name = ""
25
+ description = ""
26
+ frontmatter_lines = lines[1:end_idx]
27
+ i = 0
28
+ while i < len(frontmatter_lines):
29
+ line = frontmatter_lines[i]
30
+ if line.startswith("name:"):
31
+ name = line[len("name:"):].strip().strip('"').strip("'")
32
+ elif line.startswith("description:"):
33
+ value = line[len("description:"):].strip()
34
+ # Handle YAML multiline indicators (>, |, >-, |-)
35
+ if value in (">", "|", ">-", "|-"):
36
+ continuation_lines: list[str] = []
37
+ i += 1
38
+ while i < len(frontmatter_lines) and (frontmatter_lines[i].startswith(" ") or frontmatter_lines[i].startswith("\t")):
39
+ continuation_lines.append(frontmatter_lines[i].strip())
40
+ i += 1
41
+ description = " ".join(continuation_lines)
42
+ continue
43
+ else:
44
+ description = value.strip('"').strip("'")
45
+ i += 1
46
+
47
+ return name, description, content
@@ -0,0 +1,186 @@
1
+ ---
2
+ skill-name: trace-diagnosis
3
+ user-invocable: true
4
+ description: Diagnose issues in Computer agent execution traces. Fetches MELTS traces, analyzes tool calls, errors, latency and response quality, creates an engineering ticket. Use when debugging agent failures or investigating why the Computer agent gave wrong answers.
5
+ arguments:
6
+ - name: context
7
+ description: Description of the issue or the conversation context where the agent failed
8
+ required: false
9
+ ---
10
+
11
+ # TraceDiagnosis
12
+
13
+ Diagnoses issues in Computer agent execution traces. Fetches MELTS traces for recent user messages, analyzes tool calls, errors, latency and response quality, and provides a triage summary.
14
+
15
+ ## Tools
16
+
17
+ 1. **FetchTraces** — Fetches MELTS execution traces for the current conversation. Returns the skill call chain, inputs, outputs, errors, timing, and token usage.
18
+ 2. **FetchKnowledgeBase** — Returns content from the **Computer Agent Common Issues** reference article only. Use it for known root causes and fixes. Do not use the customer org's knowledge base or any other org articles for this analysis.
19
+ 3. **CreateTicket** — Creates a ticket with the full technical analysis as the description.
20
+
21
+ ## Your Output to the User
22
+
23
+ Your text output should be two parts: (1) one line stating the root cause only, in plain language; (2) one line stating that we have used their data to report this bug to the DevRev team, and the ticket ID. Keep it brief — 2 to 3 sentences total.
24
+
25
+ Example output:
26
+ "I analyzed the traces and found the issue: the agent used a search tool instead of a database query for your request. I've created ticket TKT-181 to track this for the engineering team."
27
+
28
+ Do not include in your output:
29
+ - Headings, sections, or markdown formatting
30
+ - "What should have happened" or correct execution paths
31
+ - Skill or tool names (KnowledgeSearch, NLToSQL, GetKGSchema, HybridSearch, ExecuteSQL, FetchObjectContext, etc.)
32
+ - Technical terms (SQL, namespace, DON ID, aggregation, semantic search, DQL, etc.)
33
+ - Step-by-step tables or bullet point breakdowns
34
+ - Code blocks or SQL examples
35
+ - Recommendations or workarounds
36
+ - Apologies or follow-up questions
37
+
38
+ All technical details go exclusively into the CreateTicket description where only engineers see them.
39
+
40
+ ## Task Sequence
41
+
42
+ 1. Call **FetchTraces** to get execution traces for this conversation. Check the result carefully — if the output contains "timeout", "error", or is empty, treat it as a failure even if the tool call itself reports success. If traces are unavailable, use the SHORT ticket template (Reference B2) instead of the full template.
43
+ 2. Call **FetchKnowledgeBase** with the error or failure pattern to find known root causes.
44
+ 3. Call **CreateTicket**:
45
+ - If you have trace data: use the FULL ticket template (Reference B1) with actual data from the traces.
46
+ - If traces are unavailable: use the SHORT ticket template (Reference B2). Do NOT use the full template and fill it with "not available".
47
+ 4. Output (1) one line with the root cause only, and (2) one line that we've reported this with the ticket ID.
48
+
49
+ ## Reference A: Failure Patterns
50
+
51
+ Use these to classify the failure when writing the CreateTicket description:
52
+
53
+ - **A: SQL Silent Fallback** — GetNodeSchema fails/wrong DON ID → HybridSearch used for analytical query → wrong counts
54
+ - **B: SQL Generation Error** — NLToSQL/ExecuteSQL → table/column not found, forbidden expression
55
+ - **C: Code Counting Error** — ExecuteSQL returns N rows → ExecuteCode → different count
56
+ - **D: Invalid Search Params** — HybridSearch → namespace not permitted, bad params
57
+ - **E: Object Fetch Failure** — FetchObjectContext → mongo no documents, invalid ID, 400/404 error
58
+ - **F: Permission Error** — GetKGSchema → "not authorised" → downstream blocked
59
+ - **G: Timeout/503** — Any skill → 503, timeout, deadline exceeded
60
+ - **H: Context Overflow** — Multiple calls succeed → sudden "unable to process"
61
+ - **I: No Skills Called** — Agent responded without calling any tools
62
+ - **J: Wrong Skill Selected** — Wrong tool for the job (e.g. HybridSearch for counts)
63
+
64
+ ## Reference B1: Full Ticket Template (when trace data IS available)
65
+
66
+ **Title**: [TraceDiagnosis] — [what failed in ≤10 words]
67
+
68
+ **Description**:
69
+
70
+ ```
71
+ ## Summary
72
+ [2-3 sentences: what the user asked, what went wrong, what the impact was]
73
+
74
+ ## Failure Pattern
75
+ [Pattern letter]: [1 sentence explaining why this pattern was identified]
76
+
77
+ ## Turn 1 Analysis: "[user's exact query for turn 1]"
78
+
79
+ **Agent Response**: "[first 200 chars of agent's response]..."
80
+
81
+ ### Skill Calls
82
+
83
+ #### Skill Call 1: [SkillName]
84
+ - **Query**: [the search query, SQL query, or natural language input — verbatim from trace]
85
+ - **Namespace** (if applicable): [the namespace value passed]
86
+ - **Full Input**: [complete input JSON or all key fields]
87
+ - **Output**: [full output or key fields, truncate if >500 chars with "[truncated]"]
88
+ - **Duration**: [X]ms
89
+ - **Status**: Success / Failed ([HTTP status code])
90
+ - **Error**: [exact error message if failed, "none" if success]
91
+
92
+ [Continue for ALL skill calls in this turn.]
93
+
94
+ ### Turn 1 Errors
95
+ 1. **[SkillName] — [error type]**: [exact error message]
96
+ - Triggering input: [what caused it]
97
+ - Downstream impact: [what broke because of this]
98
+
99
+ ### Turn 1 Assessment
100
+ [What went wrong — 2-3 sentences]
101
+
102
+ **Token Usage**: [prompt_tokens] prompt / [completion_tokens] completion
103
+ **Total Duration**: [X]ms
104
+
105
+ ---
106
+
107
+ [Add more turns as needed]
108
+
109
+ ---
110
+
111
+ ## Root Cause Analysis
112
+
113
+ ### What Failed
114
+ [Name the specific skill, error, input that caused it. Include DON IDs, error codes.]
115
+
116
+ ### Why It Failed
117
+ [Technical root cause. Reference KB match if found.]
118
+
119
+ ### Silent Fallbacks
120
+ [List cases where skill failed and agent silently switched approach.]
121
+
122
+ ### Knowledge Base Match
123
+ [Quote relevant root cause and fix from KB. If no match: "No match found."]
124
+
125
+ ## Impact
126
+ [How this affected the user]
127
+
128
+ ## Severity
129
+ [High / Medium / Low]
130
+
131
+ ## Recommendation
132
+ *Only if FetchKnowledgeBase found a match.*
133
+ - **Workaround**: [1 sentence]
134
+ - **Engineering Fix**: [1 sentence]
135
+ - **Known Issue**: [Yes/No]
136
+
137
+ ## Reproduction
138
+ - **DM ID**: [conversation ID]
139
+ - **Timestamp**: [when failing query was sent]
140
+ - **User**: [user name and DON ID]
141
+ - **Org**: [org slug / DON]
142
+ - **Agent**: [agent ID]
143
+ - **Repro query**: "[exact user query]"
144
+ ```
145
+
146
+ ## Reference B2: Short Ticket Template (when trace data is NOT available)
147
+
148
+ **Title**: [TraceDiagnosis] Trace Unavailable — [brief description]
149
+
150
+ **Description**:
151
+
152
+ ```
153
+ ## Summary
154
+ [2-3 sentences: what the user asked and what appeared to go wrong]
155
+
156
+ ## Trace Status
157
+ FetchTraces returned: [exact error message or "timeout"].
158
+
159
+ ## What We Know
160
+ - **User query**: "[exact user query]"
161
+ - **Agent response summary**: [brief summary]
162
+ - **Suspected pattern**: [Pattern letter or "Unknown"]
163
+ - **Knowledge base match**: [if found]
164
+
165
+ ## What Engineers Need To Do
166
+ 1. Pull MELTS traces manually
167
+ 2. Identify full skill call chain and error cascade
168
+ 3. Confirm suspected failure pattern
169
+
170
+ ## Reproduction
171
+ - **DM ID**: [if available]
172
+ - **User**: [user name and DON ID]
173
+ - **Org**: [org slug / DON]
174
+ - **Repro query**: "[exact user query]"
175
+ ```
176
+
177
+ ## Rules
178
+
179
+ 1. Your text output to the user is two parts only: root cause + ticket ID. No technical details.
180
+ 2. Always call FetchTraces first. Inspect result for "timeout", "error", or empty — treat as failure.
181
+ 3. Always create a ticket. Every diagnosis gets a ticket.
182
+ 4. Choose the right template: B1 when you have traces, B2 when you don't. Never fill B1 with "not available".
183
+ 5. When using B1: be exhaustive — engineers should never need to re-pull traces.
184
+ 6. When using B2: keep it short and actionable.
185
+ 7. Never fabricate trace data. Only include fields where you have actual data.
186
+ 8. For known root causes, use only FetchKnowledgeBase output. Do not search customer org's data.