@aiagenta2z/agtm 1.0.9 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -82,6 +82,115 @@ To use the rate command, have to setup the benchmark levels configuration. save
82
82
  agtm setup --levels
83
83
  ```
84
84
 
85
+ #### Description
86
+ This skill run `agent rate` command line to evaluate
87
+
88
+ The Agtm Skills CLI manages local skill bundles for supported agents (for example `claude-code`, `codex`, `openclaw`). It can download skills from GitHub, install them into the correct agent folders, list what is installed, record run logs, and apply rating benchmarks.
89
+
90
+ It also serves as a benchmarking tool to evaluate skill outputs:
91
+ **Benchmark** your AI agent against real-world standards — from Google-level engineering to Apple-caliber product launches.
92
+ **Rate** performance of each run with structured scores and levels, helping agents like Claude Code choose the right skills more effectively.
93
+
94
+
95
+ #### Usage
96
+
97
+
98
+ Each time after your agent runs a skills, it runs a follow up skill agent-skills-evaluator to track
99
+ the log of this run with input, output summarized, keep them in a log file based memory.
100
+ Then it calls the `agtm skills log` and `agtm skills rate`, `agtm skills rate show`
101
+
102
+ `agtm skills log`: keep track of skills running in a local cache json log file
103
+ `agtm skills rate prepare`: Fetch the evaluator and benchmarks.json, load the criteria of evaluation, such as job levels, task fullfillment.
104
+ `agtm skills rate apply`: Append the LLM Based Evaluator to the local results.
105
+ `agtm skills rate show`: Show the table of historical scores, level ratings.
106
+
107
+ ```
108
+ agtm skills log <skill_id> --data '<json_payload>'
109
+ agtm skills rate prepare --skill_id <skill_id> --prompt "<eval_prompt>" --benchmark <path/benchmark.json>
110
+ agtm skills rate apply --skill_id <skill_id> --result '<result_json: log_id>'
111
+ agtm skills rate show --skill_id <skill_id>
112
+ ```
113
+
114
+ #### Example
115
+ Note: `code_success_skills` is a dummy skill which always produce success results, `code_fail_skills` is a dummy skill which always produce failure results,
116
+
117
+
118
+ ```shell
119
+ ## log command will output a log_id
120
+ agtm skills log code_success_skills --data '{"input":"generate sql","output":"ok","meta":{"agent":"claude-code"}}'
121
+ agtm skills rate prepare --skill_id code_success_skills --prompt "Evaluate the code execution results"
122
+ agtm skills rate apply --skill_id code_success_skills --result '{"results":[{"log_id":"3679a3fe-4d97-4eb1-83bc-f83d711be195","rating":0.90,"level":"L4"}]}'
123
+ agtm skills rate show ## show the historical skills dashboard, including score, evaluation levels
124
+ ```
125
+
126
+ Note:
127
+ - Persists a run record at `.agtm/skills/log/<uuid>.json` (or the `--logDir` you supply).
128
+ - `<json_payload>` must contain at least `input` and `output`; optional fields (meta, rating, level) are accepted.
129
+
130
+
131
+ #### Pipeline
132
+
133
+ **Step 1. Add log to memory**
134
+ ```
135
+ agtm skills log code_success_skills --data '{"input":"generate sql","output":"ok","meta":{"agent":"claude-code"}}'
136
+ agtm skills log code_fail_skills --data '{"input":"generate sql","output":"failure","meta":{"agent":"claude-code"}}'
137
+
138
+ ```
139
+
140
+ It will generate a {log_id}.json as memory
141
+ ```shell
142
+ ✅ Saved log to .agtm/skills/log/96c216f1-edc5-40f3-b041-b01a68b137a1.json
143
+ ```
144
+
145
+ **Step 2. Prepare Evaluation prompt**
146
+
147
+ Prepare (<input, output>, benchmark) for LLM to compare the <input,output> with the benchmark..
148
+
149
+ ```shell
150
+ agtm skills rate prepare --skill_id code_success_skills --prompt "Evaluate the code execution results"
151
+
152
+ agtm skills rate prepare --skill_id code_fail_skills --prompt "Evaluate the code execution results"
153
+ ```
154
+
155
+ ```shell
156
+ {"skill_id":"code_success_skills","benchmarks":[{"software-engineering":{"Google":[{"level":"L3","title":"Software Engineer II","description":"Entry-level engineer. Delivers well-scoped tasks with guidance. Learning codebase, tools, and best practices.","signals":["task execution","learning velocity","code quality basics"]},{"level":"L4","title":"Software Engineer III","description":"Independent contributor. Owns small features end-to-end. Writes maintainable code and participates in design discussions.","signals":["ownership","code quality","debugging ability"]},{"level":"L5","title":"Senior Software Engineer","description":"Leads projects and drives design decisions. Mentors others and improves system quality.","signals":["technical leadership","system design","mentorship"]},{"level":"L6","title":"Staff Software Engineer","description":"Owns large systems or cross-team initiatives. Sets technical direction and influences multiple teams.","signals":["architecture","cross-team impact","scalability thinking"]},{"level":"L7","title":"Senior Staff Software Engineer","description":"Drives org-level technical strategy. Solves ambiguous, high-impact problems.","signals":["org influence","complex problem solving","long-term vision"]},{"level":"L8","title":"Principal Engineer","description":"Company-wide impact. Defines technical standards and long-term architecture.","signals":["company impact","vision","industry-level thinking"]}]}}],"logs":[{"log_id":"1db0e927-79f1-46c2-b6dd-200d567f631d","input":"generate sql","output":"ok"},{"log_id":"94a2fae9-80ff-4b18-a77a-5714d34bcc20","input":"generate sql","output":"ok"},{"log_id":"96c216f1-edc5-40f3-b041-b01a68b137a1","input":"generate sql","output":"ok"},{"log_id":"b1f76f33-6f45-41e3-ae14-6b598f6aa357","input":"generate sql","output":"ok"}],"instructions":"System Prompt: You are an evaluator of skill performance. Score each example from 0.0 to 1.0 and assign a level based on benchmarks. Return JSON only. Please output json in format of {\"skill_id\": <skill_id>, \"results\": [{\"log_id\": \"<log_id_1>\", \"score\": 1.0, \"level\": \"L3\", **extra},{\"log_id\": \"<log_id_2>\", \"score\": 1.0, \"level\": \"L3\", **extra}]}\nUser prompt: Evaluate the code execution results"}
157
+
158
+ {"skill_id":"code_fail_skills","benchmarks":[{"software-engineering":{"Google":[{"level":"L3","title":"Software Engineer II","description":"Entry-level engineer. Delivers well-scoped tasks with guidance. Learning codebase, tools, and best practices.","signals":["task execution","learning velocity","code quality basics"]},{"level":"L4","title":"Software Engineer III","description":"Independent contributor. Owns small features end-to-end. Writes maintainable code and participates in design discussions.","signals":["ownership","code quality","debugging ability"]},{"level":"L5","title":"Senior Software Engineer","description":"Leads projects and drives design decisions. Mentors others and improves system quality.","signals":["technical leadership","system design","mentorship"]},{"level":"L6","title":"Staff Software Engineer","description":"Owns large systems or cross-team initiatives. Sets technical direction and influences multiple teams.","signals":["architecture","cross-team impact","scalability thinking"]},{"level":"L7","title":"Senior Staff Software Engineer","description":"Drives org-level technical strategy. Solves ambiguous, high-impact problems.","signals":["org influence","complex problem solving","long-term vision"]},{"level":"L8","title":"Principal Engineer","description":"Company-wide impact. Defines technical standards and long-term architecture.","signals":["company impact","vision","industry-level thinking"]}]}}],"logs":[{"log_id":"2e5513e7-27ae-4636-9d21-4b57ec9f739b","input":"generate sql","output":"failure"},{"log_id":"563747fb-ea62-4ebc-80c4-1bc1d1c82ed5","input":"generate sql","output":"failure"},{"log_id":"db699754-b1fd-491c-a49f-2af1a41ad1f7","input":"generate sql","output":"failure"}],"instructions":"System Prompt: You are an evaluator of skill performance. Score each example from 0.0 to 1.0 and assign a level based on benchmarks. Return JSON only. Please output json in format of {\"skill_id\": <skill_id>, \"results\": [{\"log_id\": \"<log_id_1>\", \"score\": 1.0, \"level\": \"L3\", **extra},{\"log_id\": \"<log_id_2>\", \"score\": 1.0, \"level\": \"L3\", **extra}]}\nUser prompt: Evaluate the code execution results"}
159
+ ```
160
+
161
+ **Step 3. Local Agent Run the evaluation prompt of step 2.**
162
+
163
+ Your Agent give {"score": double, "level": str} to each of the log_id
164
+ ```
165
+ {"skill_id":"code_success_skills","results":[{"log_id":"1db0e927-79f1-46c2-b6dd-200d567f631d","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."},{"log_id":"94a2fae9-80ff-4b18-a77a-5714d34bcc20","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."},{"log_id":"96c216f1-edc5-40f3-b041-b01a68b137a1","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."},{"log_id":"b1f76f33-6f45-41e3-ae14-6b598f6aa357","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."}]}
166
+
167
+ {"skill_id":"code_fail_skills","results":[{"log_id":"2e5513e7-27ae-4636-9d21-4b57ec9f739b","score":0,"level":"L3"},{"log_id":"563747fb-ea62-4ebc-80c4-1bc1d1c82ed5","score":0,"level":"L3"},{"log_id":"db699754-b1fd-491c-a49f-2af1a41ad1f7","score":0,"level":"L3"}]}
168
+ ```
169
+
170
+ **Step 4. Apply Results to Local Log Status**
171
+
172
+ ```shell
173
+ agtm skills rate apply --skill_id code_success_skills --result '{"skill_id":"code_success_skills","results":[{"log_id":"1db0e927-79f1-46c2-b6dd-200d567f631d","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."},{"log_id":"94a2fae9-80ff-4b18-a77a-5714d34bcc20","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."},{"log_id":"96c216f1-edc5-40f3-b041-b01a68b137a1","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate sql. Matches entry-level performance criteria for task execution."},{"log_id":"b1f76f33-6f45-41e3-ae14-6b598f6aa357","score":1,"level":"L3","rationale":"Successfully executed a well-scoped task generate. Matches entry-level performance criteria for task execution."}]}'
174
+
175
+ agtm skills rate apply --skill_id code_fail_skills --result '{"skill_id":"code_fail_skills","results":[{"log_id":"2e5513e7-27ae-4636-9d21-4b57ec9f739b","score":0,"level":"L3"},{"log_id":"563747fb-ea62-4ebc-80c4-1bc1d1c82ed5","score":0,"level":"L3"},{"log_id":"db699754-b1fd-491c-a49f-2af1a41ad1f7","score":0,"level":"L3"}]}'
176
+ ```
177
+
178
+ **Step 5. Show final Result (Optional)**
179
+ ```shell
180
+ agtm skills rate show
181
+ ```
182
+
183
+
184
+ ```shell
185
+ skill_id run_times score level
186
+ ------------------- --------- ----- -----
187
+ code_fail_skills 3 0.00 L3
188
+ code_success_skills 4 1.00 L3
189
+ ```
190
+
191
+
192
+ #### CLI Documents
193
+
85
194
  #### Usage
86
195
  ```
87
196
  agtm skills rate prepare --skill_id <skill_id> --prompt "<eval_prompt>" --benchmark <path/benchmark.json>
@@ -131,6 +240,7 @@ write your `customized_agent_benchmark.json` following the formats
131
240
  }
132
241
  ```
133
242
 
243
+
134
244
  ## Supported Agents
135
245
  We provide the same skills local folder as vercel/skills packages.
136
246
  Skills can be installed to any of these agents
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@aiagenta2z/agtm",
3
- "version": "1.0.9",
3
+ "version": "1.1.1",
4
4
  "description": "agtm: CLI Tool for AI Agent Management, Skills, Agent Registry, Benchmarks and Hints in AI Agent Marketplace\n",
5
5
  "main": "dist/agtm-cli.js",
6
6
  "type": "module",