@hanfour.huang/caliber 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +667 -0
- package/dist/analyzers/section.d.ts +9 -0
- package/dist/analyzers/section.d.ts.map +1 -0
- package/dist/analyzers/section.js +503 -0
- package/dist/analyzers/section.js.map +1 -0
- package/dist/analyzers/usage.d.ts +3 -0
- package/dist/analyzers/usage.d.ts.map +1 -0
- package/dist/analyzers/usage.js +141 -0
- package/dist/analyzers/usage.js.map +1 -0
- package/dist/cli.d.ts +3 -0
- package/dist/cli.d.ts.map +1 -0
- package/dist/cli.js +295 -0
- package/dist/cli.js.map +1 -0
- package/dist/config.d.ts +19 -0
- package/dist/config.d.ts.map +1 -0
- package/dist/config.js +156 -0
- package/dist/config.js.map +1 -0
- package/dist/data-quality.d.ts +3 -0
- package/dist/data-quality.d.ts.map +1 -0
- package/dist/data-quality.js +54 -0
- package/dist/data-quality.js.map +1 -0
- package/dist/extractors/claude-code.d.ts +6 -0
- package/dist/extractors/claude-code.d.ts.map +1 -0
- package/dist/extractors/claude-code.js +216 -0
- package/dist/extractors/claude-code.js.map +1 -0
- package/dist/extractors/codex.d.ts +5 -0
- package/dist/extractors/codex.d.ts.map +1 -0
- package/dist/extractors/codex.js +184 -0
- package/dist/extractors/codex.js.map +1 -0
- package/dist/i18n.d.ts +53 -0
- package/dist/i18n.d.ts.map +1 -0
- package/dist/i18n.js +163 -0
- package/dist/i18n.js.map +1 -0
- package/dist/period.d.ts +5 -0
- package/dist/period.d.ts.map +1 -0
- package/dist/period.js +25 -0
- package/dist/period.js.map +1 -0
- package/dist/reporters/report.d.ts +7 -0
- package/dist/reporters/report.d.ts.map +1 -0
- package/dist/reporters/report.js +440 -0
- package/dist/reporters/report.js.map +1 -0
- package/dist/standard.d.ts +5 -0
- package/dist/standard.d.ts.map +1 -0
- package/dist/standard.js +98 -0
- package/dist/standard.js.map +1 -0
- package/dist/types.d.ts +216 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/types.js +3 -0
- package/dist/types.js.map +1 -0
- package/dist/utils.d.ts +3 -0
- package/dist/utils.d.ts.map +1 -0
- package/dist/utils.js +17 -0
- package/dist/utils.js.map +1 -0
- package/package.json +59 -0
- package/templates/eval-standard.json +174 -0
package/README.md
ADDED
|
@@ -0,0 +1,667 @@
|
|
|
1
|
+
# Caliber
|
|
2
|
+
|
|
3
|
+
**Measure the caliber of your AI-assisted engineering.** A self-hostable gateway, audit log, and evaluator for teams that want to know exactly what their AI coding assistants are doing — and how well.
|
|
4
|
+
|
|
5
|
+
**精準衡量你的 AI 工程力。** 自架的 gateway / 稽核 / 評核平台,讓團隊清楚知道 AI 助理到底在做什麼、做得多好。
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Why / 為什麼需要這個工具
|
|
10
|
+
|
|
11
|
+
Engineering managers need evidence-based data to evaluate how effectively their team uses AI coding assistants. Manual review of hundreds of AI sessions is impractical. This tool automates the process by:
|
|
12
|
+
|
|
13
|
+
研發經理需要基於證據的資料來評估團隊使用 AI 程式助手的成效。手動審查數百個 AI 工作階段不切實際。本工具透過以下方式自動化此流程:
|
|
14
|
+
|
|
15
|
+
1. **Extracting** usage data from local Claude Code (`~/.claude/`) and Codex (`~/.codex/`) storage
|
|
16
|
+
2. **Analyzing** session patterns for decision-making quality and risk identification
|
|
17
|
+
3. **Scoring** against a configurable evaluation standard (default: OneAD R&D standard)
|
|
18
|
+
4. **Generating** structured reports with evidence and score recommendations
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Features / 功能特色
|
|
23
|
+
|
|
24
|
+
- Reads Claude Code session metadata, facets, SQLite cost data, and JSONL conversations
|
|
25
|
+
- Reads Codex SQLite thread data (tokens, models, sessions)
|
|
26
|
+
- Detects decision-making patterns (iterative refinement, multi-task coordination, active corrections)
|
|
27
|
+
- Detects risk identification signals (security awareness, performance discussions, bug catching)
|
|
28
|
+
- Configurable evaluation standard — bring your own criteria, keywords, and thresholds
|
|
29
|
+
- Multiple output formats: terminal (colored), JSON, Markdown, HTML
|
|
30
|
+
- JSON output is machine-parseable (`--format json` emits clean JSON to stdout, progress logs go to stderr)
|
|
31
|
+
- Noise filtering to exclude system messages and code review templates from analysis
|
|
32
|
+
- `init-standard` command to export the default standard as a customization template
|
|
33
|
+
- Data quality warnings when data sources are missing or incomplete
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Platform mode / 平台模式
|
|
38
|
+
|
|
39
|
+
Starting with **v0.2.0** Caliber also ships as a self-hostable web platform with
|
|
40
|
+
organization-scoped RBAC, invites, and an audit log. Use this mode if you want
|
|
41
|
+
a shared workspace for a team rather than a per-engineer CLI report.
|
|
42
|
+
|
|
43
|
+
**v0.3.0** adds an opt-in **gateway** that proxies Anthropic-native
|
|
44
|
+
(`/v1/messages`) and OpenAI-compatible (`/v1/chat/completions`) traffic
|
|
45
|
+
through a shared pool of upstream accounts:
|
|
46
|
+
|
|
47
|
+
- Admins donate `sk-ant-...` API keys or OAuth bundles extracted from Claude
|
|
48
|
+
Code; the gateway's scheduler picks one per request based on priority,
|
|
49
|
+
concurrency, and rate-limit state.
|
|
50
|
+
- Each user self-issues or receives an admin-issued platform API key
|
|
51
|
+
(`ak_...`) that authenticates against the gateway.
|
|
52
|
+
- Usage + cost (per Anthropic/OpenAI token pricing) lands in a `usage_logs`
|
|
53
|
+
table, surfaced via per-user and per-org dashboards.
|
|
54
|
+
|
|
55
|
+
**v0.4.0** adds an opt-in **evaluator** subsystem for performance evaluation
|
|
56
|
+
(gated behind `ENABLE_EVALUATOR` feature flag):
|
|
57
|
+
|
|
58
|
+
- **Content capture opt-in** — organization-level toggle; members see their
|
|
59
|
+
captured usage on `/dashboard/profile/evaluation`. 90-day default retention
|
|
60
|
+
(per-org override: 30/60/90). AES-256-GCM encryption with domain-separated
|
|
61
|
+
HKDF keys.
|
|
62
|
+
- **Dual-layer evaluation** — rule-based scoring (always-on) + optional LLM
|
|
63
|
+
Deep Analysis (per-org opt-in). Costs dogfooded via self-gateway loopback
|
|
64
|
+
and tracked in `usage_logs`.
|
|
65
|
+
- **Admin-customizable scoring rubrics** — platform defaults seeded for
|
|
66
|
+
English, Traditional Chinese, and Japanese; organizations can define custom
|
|
67
|
+
rubrics with dry-run preview. Zod-validated signal discriminated union
|
|
68
|
+
(keywords, thresholds, refusal rates, client mix, model diversity, cache
|
|
69
|
+
patterns, extended thinking, tool variety, iteration counts) — **extended
|
|
70
|
+
in v0.5.0** with six facet-based signal types.
|
|
71
|
+
- **GDPR member-initiated delete request workflow** — members request deletion,
|
|
72
|
+
org admins approve (or auto-reject after 30 days). Retention purge and GDPR
|
|
73
|
+
execution run on separate cron workers.
|
|
74
|
+
- **Labor-law-friendly transparency** — members always see their own full
|
|
75
|
+
evaluation report; team managers see redacted team views (LLM analysis
|
|
76
|
+
fields nulled unless they are also org admins). Leaderboard visibility is
|
|
77
|
+
opt-in per organization (privacy default).
|
|
78
|
+
|
|
79
|
+
**v0.5.0** extends the v0.4.0 evaluator with **per-org LLM cost budgeting**
|
|
80
|
+
and **opt-in LLM facet extraction** (gated behind `ENABLE_FACET_EXTRACTION`
|
|
81
|
+
env + per-org `llm_facet_enabled`). All v0.4.0 behaviour preserved when both
|
|
82
|
+
flags are off.
|
|
83
|
+
|
|
84
|
+
- **Cost budget infrastructure** — every org gets `llm_monthly_budget_usd`
|
|
85
|
+
+ `llm_budget_overage_behavior` (`degrade` skips over-budget calls,
|
|
86
|
+
`halt` stops all LLM until next UTC month). Spend tracked per-call in a
|
|
87
|
+
new `llm_usage_events` ledger, summed per UTC month, enforced before each
|
|
88
|
+
LLM call. Cost dashboard at `/dashboard/organizations/<id>/evaluator/costs`
|
|
89
|
+
with breakdowns by task / model / 6-month history; compact widget on the
|
|
90
|
+
evaluator status page.
|
|
91
|
+
- **LLM facet extraction** — opt-in second LLM pass per session that
|
|
92
|
+
classifies each evaluation window's sessions into structured JSON
|
|
93
|
+
(`{sessionType, outcome, claudeHelpfulness, frictionCount, bugsCaughtCount,
|
|
94
|
+
codexErrorsCount}`). Extracted rows persisted to `request_body_facets`
|
|
95
|
+
table with `prompt_version` cache so the same LLM call doesn't fire twice.
|
|
96
|
+
Deterministic failures (parse / validation / timeout) write an error row
|
|
97
|
+
so they don't retry; transient failures (5xx, budget hit) skip silently.
|
|
98
|
+
- **Six new rubric signal types** consume the facet aggregate:
|
|
99
|
+
`facet_claude_helpfulness`, `facet_friction_per_session`,
|
|
100
|
+
`facet_bugs_caught`, `facet_codex_errors`, `facet_outcome_success_rate`,
|
|
101
|
+
`facet_session_type_ratio`. Custom rubrics can opt in today; the rubric
|
|
102
|
+
editor ships an in-form Signal types reference.
|
|
103
|
+
- **Platform default rubrics bumped to v1.1.0** — strictly additive: each
|
|
104
|
+
section gains one facet support (`facet_outcome_success` to `interaction`,
|
|
105
|
+
`facet_bugs_caught` to `riskControl`). Orgs without facet extraction see
|
|
106
|
+
zero scoring change.
|
|
107
|
+
- **Report-page facet drill-down** — when facet rows exist for the period,
|
|
108
|
+
the user's evaluation report shows session-type distribution, success
|
|
109
|
+
rate, avg helpfulness, and bug/friction/codex counters. Hidden silently
|
|
110
|
+
when no rows exist.
|
|
111
|
+
- **Observability artifacts** shipped under `ops/`: 3 Grafana dashboards
|
|
112
|
+
(evaluator / body-capture / GDPR), 11 Prometheus alert rules, 9 runbooks
|
|
113
|
+
in `docs/runbooks/`, and a post-release smoke workflow that auto-creates a
|
|
114
|
+
`release-blocker` issue when the canary fails.
|
|
115
|
+
|
|
116
|
+
See [`docs/UPGRADE-v0.5.0.md`](docs/UPGRADE-v0.5.0.md) for the upgrade
|
|
117
|
+
playbook (migrations 0004-0007, env flags, three-tier rollback).
|
|
118
|
+
|
|
119
|
+
Quick start:
|
|
120
|
+
```sh
|
|
121
|
+
cd docker
|
|
122
|
+
cp .env.example .env # fill in OAuth + bootstrap email (+ gateway secrets if enabling)
|
|
123
|
+
docker compose up -d # api + web + postgres + redis
|
|
124
|
+
docker compose --profile gateway up -d # opt-in: add gateway service
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Images are published on every `v*` tag to:
|
|
128
|
+
|
|
129
|
+
| Image | amd64 | arm64 |
|
|
130
|
+
|-------|-------|-------|
|
|
131
|
+
| `ghcr.io/hanfour/caliber-api` | ✅ | ✅ |
|
|
132
|
+
| `ghcr.io/hanfour/caliber-gateway` (new in v0.3.0) | ✅ | ✅ |
|
|
133
|
+
| `ghcr.io/hanfour/caliber-web` | ✅ | ❌ (dropped in v0.5.0; QEMU cross-build was unstable) |
|
|
134
|
+
|
|
135
|
+
Operator guides:
|
|
136
|
+
|
|
137
|
+
- **Try locally first**: [`docs/LOCAL_DEPLOY.md`](docs/LOCAL_DEPLOY.md) — 5-min path on your laptop, escalates to on-prem production
|
|
138
|
+
- Self-hosting bring-up (api + web + gateway): [`docs/SELF_HOSTING.md`](docs/SELF_HOSTING.md)
|
|
139
|
+
- Gateway operator + user reference: [`docs/GATEWAY.md`](docs/GATEWAY.md)
|
|
140
|
+
|
|
141
|
+
**Cloud deploy templates** (alternatives to docker-compose self-hosting):
|
|
142
|
+
|
|
143
|
+
- [Render Blueprint](deploy/render/README.md) — closest thing to one-click; provisions Postgres + 3 services; needs Upstash Redis externally
|
|
144
|
+
- [Fly.io](deploy/fly/README.md) — three apps + Fly Postgres + Upstash; geographically distributed if you want
|
|
145
|
+
- [Railway](deploy/railway/README.md) — native Postgres + Redis plugins; manual service creation per the README
|
|
146
|
+
|
|
147
|
+
⚠️ Vercel is **not supported** — the gateway is a long-running Fastify
|
|
148
|
+
server with BullMQ workers, doesn't fit Vercel's serverless model. See
|
|
149
|
+
the deploy/ READMEs for what does work.
|
|
150
|
+
|
|
151
|
+
CLI mode and platform mode share no runtime state; pick whichever fits.
|
|
152
|
+
|
|
153
|
+
> **First time trying platform mode?** Start with
|
|
154
|
+
> [`docs/GETTING_STARTED.md`](docs/GETTING_STARTED.md) — a 30-minute
|
|
155
|
+
> end-to-end walkthrough that takes a fresh checkout to a working
|
|
156
|
+
> personal AI gateway sharing your Claude.ai Pro/Max subscription
|
|
157
|
+
> across all your devices.
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## Data Sources / 資料來源
|
|
162
|
+
|
|
163
|
+
| Source | Path | Data |
|
|
164
|
+
|--------|------|------|
|
|
165
|
+
| Claude Code Session Meta | `~/.claude/usage-data/session-meta/*.json` | Tokens, duration, tools, languages, git commits, first prompt |
|
|
166
|
+
| Claude Code Facets | `~/.claude/usage-data/facets/*.json` | AI-generated session analysis: goals, outcomes, friction, helpfulness |
|
|
167
|
+
| Claude Code SQLite | `~/.claude/__store.db` | Per-message cost (USD), model, duration |
|
|
168
|
+
| Claude Code JSONL | `~/.claude/projects/*/*.jsonl` | Full conversation content for keyword signal scanning |
|
|
169
|
+
| Codex SQLite | `~/.codex/state_5.sqlite` | Threads: tokens_used, model, title, git info |
|
|
170
|
+
| Codex History | `~/.codex/history.jsonl` | Full user prompts by thread/session |
|
|
171
|
+
| Codex Logs | `~/.codex/logs_2.sqlite` | Thread-level tool calls and error events |
|
|
172
|
+
|
|
173
|
+
All data is read **locally and read-only**. No data is sent to any external service.
|
|
174
|
+
|
|
175
|
+
所有資料皆為**本地端唯讀存取**,不會傳送至任何外部服務。
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Prerequisites / 系統需求
|
|
180
|
+
|
|
181
|
+
- **Node.js** >= 18
|
|
182
|
+
- **npm** (included with Node.js)
|
|
183
|
+
- `~/.claude/` directory (from Claude Code usage)
|
|
184
|
+
- `~/.codex/` directory (from Codex CLI usage, optional)
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Installation / 安裝
|
|
189
|
+
|
|
190
|
+
### Recommended: Install from npm / 建議:從 npm 安裝
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
npm install -g @hanfour.huang/caliber
|
|
194
|
+
|
|
195
|
+
# Verify installation
|
|
196
|
+
caliber --version
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
### Update / 更新
|
|
200
|
+
|
|
201
|
+
```bash
|
|
202
|
+
npm install -g @hanfour.huang/caliber@latest
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
Caliber uses `~/.caliber.json` for CLI settings. On first run after upgrading
|
|
206
|
+
from the older `aide` CLI, an existing `~/.aide.json` file is read, migrated to
|
|
207
|
+
`~/.caliber.json`, and reported with a one-time deprecation notice.
|
|
208
|
+
|
|
209
|
+
### Existing local-clone users / 已使用 clone 安裝的使用者
|
|
210
|
+
|
|
211
|
+
If you previously installed from a cloned repo or `npm link`, migrate to the npm package:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
npm unlink -g aide 2>/dev/null || npm uninstall -g @hanfour.huang/aide
|
|
215
|
+
npm install -g @hanfour.huang/caliber@latest
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
### Development mode / 開發模式
|
|
219
|
+
|
|
220
|
+
```bash
|
|
221
|
+
git clone https://github.com/hanfour/caliber.git ~/caliber
|
|
222
|
+
cd ~/caliber
|
|
223
|
+
npm install
|
|
224
|
+
npx tsx src/cli.ts --help
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Quick Start / 快速開始
|
|
230
|
+
|
|
231
|
+
```bash
|
|
232
|
+
# Quick usage summary (last 7 days)
|
|
233
|
+
caliber summary
|
|
234
|
+
|
|
235
|
+
# Full evaluation report (last 30 days, terminal output)
|
|
236
|
+
caliber report
|
|
237
|
+
|
|
238
|
+
# Save report as Markdown
|
|
239
|
+
caliber report --format markdown --output report.md
|
|
240
|
+
|
|
241
|
+
# Save report as HTML
|
|
242
|
+
caliber report --format html --output report.html
|
|
243
|
+
|
|
244
|
+
# Monthly KPI report
|
|
245
|
+
caliber monthly
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
---
|
|
249
|
+
|
|
250
|
+
## Usage / 使用方式
|
|
251
|
+
|
|
252
|
+
### Quick Summary / 快速摘要
|
|
253
|
+
|
|
254
|
+
```bash
|
|
255
|
+
# Last 7 days (default)
|
|
256
|
+
caliber summary
|
|
257
|
+
|
|
258
|
+
# Custom date range
|
|
259
|
+
caliber summary --since 2026-03-01 --until 2026-03-31
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
Output:
|
|
263
|
+
|
|
264
|
+
```
|
|
265
|
+
AI Dev Usage Summary
|
|
266
|
+
Period: 2026-03-01 ~ 2026-03-31
|
|
267
|
+
|
|
268
|
+
Claude Code
|
|
269
|
+
Sessions: 57
|
|
270
|
+
Tokens: 259,336
|
|
271
|
+
Duration: 15676 min
|
|
272
|
+
Active Days: 9
|
|
273
|
+
|
|
274
|
+
Codex
|
|
275
|
+
Sessions: 1
|
|
276
|
+
Tokens: 368,930
|
|
277
|
+
Active Days: 1
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
### Full Evaluation Report / 完整評核報告
|
|
281
|
+
|
|
282
|
+
```bash
|
|
283
|
+
# Default: last 30 days, text format, built-in OneAD standard
|
|
284
|
+
caliber report
|
|
285
|
+
|
|
286
|
+
# Current calendar month
|
|
287
|
+
caliber monthly
|
|
288
|
+
|
|
289
|
+
# Previous full calendar month
|
|
290
|
+
caliber monthly --previous
|
|
291
|
+
|
|
292
|
+
# Current calendar quarter
|
|
293
|
+
caliber quarterly
|
|
294
|
+
|
|
295
|
+
# Previous full calendar quarter
|
|
296
|
+
caliber quarterly --previous
|
|
297
|
+
|
|
298
|
+
# Custom date range
|
|
299
|
+
caliber report --since 2026-03-01 --until 2026-04-14
|
|
300
|
+
|
|
301
|
+
# Output as Markdown file
|
|
302
|
+
caliber report --format markdown --output report.md
|
|
303
|
+
|
|
304
|
+
# Output as HTML file
|
|
305
|
+
caliber report --format html --output report.html
|
|
306
|
+
|
|
307
|
+
# Output as JSON (machine-parseable, clean stdout)
|
|
308
|
+
caliber report --format json --output report.json
|
|
309
|
+
|
|
310
|
+
# Pipe JSON for programmatic consumption
|
|
311
|
+
caliber report --format json 2>/dev/null | jq '.sections[].score'
|
|
312
|
+
|
|
313
|
+
# Use a custom evaluation standard
|
|
314
|
+
caliber report --standard my-standard.json
|
|
315
|
+
|
|
316
|
+
# Include engineer/department metadata in report
|
|
317
|
+
caliber report --engineer "Jane Doe" --department "R&D"
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
> **Note:** When using `--format json`, progress and status messages are written to stderr.
|
|
321
|
+
> stdout contains only the JSON report, making it safe to pipe to `jq` or other tools.
|
|
322
|
+
|
|
323
|
+
### Using the compiled CLI / 使用編譯後的 CLI
|
|
324
|
+
|
|
325
|
+
If you are developing locally and have run `npm run build`, you can use `node dist/cli.js`:
|
|
326
|
+
|
|
327
|
+
```bash
|
|
328
|
+
node dist/cli.js report --since 2026-03-01 --until 2026-03-31
|
|
329
|
+
node dist/cli.js summary
|
|
330
|
+
node dist/cli.js monthly --previous --format markdown --output march.md
|
|
331
|
+
node dist/cli.js report --format html --output report.html
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
---
|
|
335
|
+
|
|
336
|
+
## CLI Reference / 命令參考
|
|
337
|
+
|
|
338
|
+
### `caliber report`
|
|
339
|
+
|
|
340
|
+
Generate a full evaluation report.
|
|
341
|
+
|
|
342
|
+
```
|
|
343
|
+
Options:
|
|
344
|
+
-s, --since <date> Start date, YYYY-MM-DD (default: 30 days ago)
|
|
345
|
+
-u, --until <date> End date, YYYY-MM-DD (default: today)
|
|
346
|
+
-f, --format <format> Output: text | json | markdown (default: text)
|
|
347
|
+
-o, --output <file> Write report to file instead of stdout
|
|
348
|
+
--standard <path> Path to custom evaluation standard JSON
|
|
349
|
+
--engineer <name> Engineer name for report identification
|
|
350
|
+
--department <name> Department name for report identification
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
### `caliber summary`
|
|
354
|
+
|
|
355
|
+
Quick usage summary for a date range.
|
|
356
|
+
|
|
357
|
+
```
|
|
358
|
+
Options:
|
|
359
|
+
-s, --since <date> Start date, YYYY-MM-DD (default: 7 days ago)
|
|
360
|
+
-u, --until <date> End date, YYYY-MM-DD (default: today)
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
### `caliber monthly`
|
|
364
|
+
|
|
365
|
+
Generate a monthly KPI report.
|
|
366
|
+
|
|
367
|
+
```
|
|
368
|
+
Options:
|
|
369
|
+
-f, --format <format> Output: text | json | markdown (default: text)
|
|
370
|
+
-o, --output <file> Write report to file instead of stdout
|
|
371
|
+
--standard <path> Path to custom evaluation standard JSON
|
|
372
|
+
--previous Use the previous full calendar month
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
### `caliber quarterly`
|
|
376
|
+
|
|
377
|
+
Generate a quarterly KPI report.
|
|
378
|
+
|
|
379
|
+
```
|
|
380
|
+
Options:
|
|
381
|
+
-f, --format <format> Output: text | json | markdown (default: text)
|
|
382
|
+
-o, --output <file> Write report to file instead of stdout
|
|
383
|
+
--standard <path> Path to custom evaluation standard JSON
|
|
384
|
+
--previous Use the previous full calendar quarter
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### `caliber init-standard`
|
|
388
|
+
|
|
389
|
+
Export the default evaluation standard as a JSON template for customization.
|
|
390
|
+
|
|
391
|
+
```
|
|
392
|
+
Options:
|
|
393
|
+
-o, --output <file> Output file path (default: eval-standard.json)
|
|
394
|
+
```
|
|
395
|
+
|
|
396
|
+
---
|
|
397
|
+
|
|
398
|
+
## Report Structure / 報告結構
|
|
399
|
+
|
|
400
|
+
The generated report contains the following sections:
|
|
401
|
+
|
|
402
|
+
### 1. Management Summary / 管理摘要
|
|
403
|
+
|
|
404
|
+
Management-facing overview for monthly/quarterly KPI review:
|
|
405
|
+
|
|
406
|
+
- Overall headline
|
|
407
|
+
- Period assessment
|
|
408
|
+
- Key observations
|
|
409
|
+
- Recommended follow-up actions
|
|
410
|
+
|
|
411
|
+
### 2. Usage Overview / 使用概覽
|
|
412
|
+
|
|
413
|
+
Quantitative metrics for both Claude Code and Codex:
|
|
414
|
+
|
|
415
|
+
- Total sessions, tokens (input/output), estimated cost
|
|
416
|
+
- Active days, duration
|
|
417
|
+
- Top projects by token usage
|
|
418
|
+
- Top tools used (Bash, Read, Edit, etc.)
|
|
419
|
+
- Model breakdown
|
|
420
|
+
|
|
421
|
+
### 3-N. Evaluation Sections / 評核區段
|
|
422
|
+
|
|
423
|
+
Each section defined in the evaluation standard generates:
|
|
424
|
+
|
|
425
|
+
- **Summary** — aggregate statistics
|
|
426
|
+
- **Usage evidence** — workload/depth indicators such as sessions, tool usage, follow-up prompts
|
|
427
|
+
- **Score evidence** — threshold-relevant evidence used for 100% / 120% scoring
|
|
428
|
+
- **Evidence signals** — grouped by type (iterative refinement, bugs caught, security awareness, etc.)
|
|
429
|
+
- **Metrics** — numeric indicators used for scoring
|
|
430
|
+
|
|
431
|
+
### Final. Score Recommendation / 分值建議
|
|
432
|
+
|
|
433
|
+
For each evaluation section:
|
|
434
|
+
|
|
435
|
+
- **Score**: Standard (100%) or Superior (120%)
|
|
436
|
+
- **Label**: Human-readable grade
|
|
437
|
+
- **Reason**: Evidence-backed explanation referencing the criteria
|
|
438
|
+
|
|
439
|
+
### Data Quality Warnings / 資料品質警告
|
|
440
|
+
|
|
441
|
+
The report includes data quality warnings when:
|
|
442
|
+
|
|
443
|
+
- Required data sources (`~/.claude/usage-data/session-meta`) are missing
|
|
444
|
+
- Sessions exist but no facets are found (qualitative analysis limited)
|
|
445
|
+
- No keyword signals detected (JSONL files may be missing)
|
|
446
|
+
- No sessions found at all in the evaluation period
|
|
447
|
+
|
|
448
|
+
---
|
|
449
|
+
|
|
450
|
+
## Custom Evaluation Standards / 自訂評核標準
|
|
451
|
+
|
|
452
|
+
The built-in default is the OneAD R&D AI-Application Evaluation Standard. To create your own:
|
|
453
|
+
|
|
454
|
+
### Step 1: Export the default template / 匯出預設範本
|
|
455
|
+
|
|
456
|
+
```bash
|
|
457
|
+
npx tsx src/cli.ts init-standard --output my-standard.json
|
|
458
|
+
```
|
|
459
|
+
|
|
460
|
+
### Step 2: Edit the JSON file / 編輯 JSON 檔案
|
|
461
|
+
|
|
462
|
+
Key fields you can customize:
|
|
463
|
+
|
|
464
|
+
| Field | Purpose |
|
|
465
|
+
|-------|---------|
|
|
466
|
+
| `name` | Standard name shown in report header |
|
|
467
|
+
| `sections[]` | Array of evaluation sections (add/remove/reorder) |
|
|
468
|
+
| `sections[].id` | Unique section identifier |
|
|
469
|
+
| `sections[].name` | Section display name |
|
|
470
|
+
| `sections[].weight` | KPI weight (display only) |
|
|
471
|
+
| `sections[].keywords` | Conversation scanning keywords |
|
|
472
|
+
| `sections[].thresholds` | Numeric thresholds for Superior score |
|
|
473
|
+
| `sections[].superiorRules` | Optional rule for combining thresholds |
|
|
474
|
+
| `sections[].standard` | 100% score criteria text |
|
|
475
|
+
| `sections[].superior` | 120% score criteria text |
|
|
476
|
+
| `noiseFilters` | Rules to exclude system/template messages |
|
|
477
|
+
|
|
478
|
+
### Step 3: Use it / 使用自訂標準
|
|
479
|
+
|
|
480
|
+
```bash
|
|
481
|
+
npx tsx src/cli.ts report --standard my-standard.json
|
|
482
|
+
```
|
|
483
|
+
|
|
484
|
+
### Example: Adding a new section / 新增評核區段範例
|
|
485
|
+
|
|
486
|
+
```json
|
|
487
|
+
{
|
|
488
|
+
"id": "collaboration",
|
|
489
|
+
"name": "AI-Human Collaboration Quality",
|
|
490
|
+
"weight": "30%",
|
|
491
|
+
"standard": {
|
|
492
|
+
"score": 100,
|
|
493
|
+
"label": "Standard",
|
|
494
|
+
"criteria": ["Uses AI for routine tasks", "Follows AI suggestions without modification"]
|
|
495
|
+
},
|
|
496
|
+
"superior": {
|
|
497
|
+
"score": 120,
|
|
498
|
+
"label": "Superior",
|
|
499
|
+
"criteria": ["Actively debates with AI on design decisions", "Synthesizes multiple AI suggestions into novel solutions"]
|
|
500
|
+
},
|
|
501
|
+
"keywords": ["design", "architecture", "trade-off", "pattern", "alternative"],
|
|
502
|
+
"thresholds": {
|
|
503
|
+
"iterativeRatio": 0.4,
|
|
504
|
+
"keywordHits": 15
|
|
505
|
+
},
|
|
506
|
+
"superiorRules": {
|
|
507
|
+
"mode": "grouped",
|
|
508
|
+
"strongThresholds": ["iterativeRatio", "keywordHits"],
|
|
509
|
+
"supportThresholds": ["avgToolUses"],
|
|
510
|
+
"minStrongMatched": 1,
|
|
511
|
+
"minSupportMatched": 0
|
|
512
|
+
}
|
|
513
|
+
}
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
### Superior Rules / 升等規則
|
|
517
|
+
|
|
518
|
+
`superiorRules.mode = "any"` — any matched threshold is enough for 120%.
|
|
519
|
+
|
|
520
|
+
`superiorRules.mode = "grouped"` — separate strong evidence from support evidence. Strong evidence must meet a minimum count; support evidence alone is not sufficient.
|
|
521
|
+
|
|
522
|
+
Keys referenced by `strongThresholds` and `supportThresholds` must also exist in `thresholds`.
|
|
523
|
+
|
|
524
|
+
### Available threshold keys / 可用門檻鍵值
|
|
525
|
+
|
|
526
|
+
| Key | Description |
|
|
527
|
+
|-----|-------------|
|
|
528
|
+
| `iterativeRatio` | Ratio of iterative/multi-task sessions to total |
|
|
529
|
+
| `correctionCount` | Number of user corrections/interruptions |
|
|
530
|
+
| `keywordHits` | Number of keyword signal matches |
|
|
531
|
+
| `avgToolUses` | Average tool uses per session |
|
|
532
|
+
| `securityCount` | Security-related keyword matches |
|
|
533
|
+
| `performanceCount` | Performance-related keyword matches |
|
|
534
|
+
| `bugsCaught` | AI-generated bugs caught (from facets) |
|
|
535
|
+
| `frictionSessions` | Sessions with friction events |
|
|
536
|
+
| `codexIterativeSessions` | Codex threads with strong iterative evidence |
|
|
537
|
+
| `codexMultiTurnSessions` | Codex multi-turn threads |
|
|
538
|
+
| `codexFollowUpCount` | Codex follow-up user prompts |
|
|
539
|
+
| `codexDeepSessions` | Codex high-depth threads |
|
|
540
|
+
| `codexErrorSessions` | Codex threads with logged errors |
|
|
541
|
+
|
|
542
|
+
---
|
|
543
|
+
|
|
544
|
+
## Default Evaluation Standard / 預設評核標準
|
|
545
|
+
|
|
546
|
+
The built-in OneAD standard evaluates two dimensions:
|
|
547
|
+
|
|
548
|
+
### AI Interaction & Decision (20% KPI weight) / AI 交互與決策
|
|
549
|
+
|
|
550
|
+
| Grade | Criteria |
|
|
551
|
+
|-------|----------|
|
|
552
|
+
| **Standard (100%)** | Actively use AI for coding; clear decision notes |
|
|
553
|
+
| **Superior (120%)** | Multi-iteration guidance (A->B->C); system-constraint-aware optimization |
|
|
554
|
+
|
|
555
|
+
### AI Identification & Risk Control (50% KPI weight) / AI 識別與風險控管
|
|
556
|
+
|
|
557
|
+
| Grade | Criteria |
|
|
558
|
+
|-------|----------|
|
|
559
|
+
| **Standard (100%)** | Catch common AI errors/hallucinations; stable code |
|
|
560
|
+
| **Superior (120%)** | Identify critical risks (security, performance, memory); produce SOP/Wiki for team sharing |
|
|
561
|
+
|
|
562
|
+
---
|
|
563
|
+
|
|
564
|
+
## Architecture / 架構
|
|
565
|
+
|
|
566
|
+
```
|
|
567
|
+
src/
|
|
568
|
+
├── cli.ts # CLI entry point (commander)
|
|
569
|
+
├── types.ts # TypeScript type definitions
|
|
570
|
+
├── standard.ts # Load & validate evaluation standards
|
|
571
|
+
├── period.ts # Date period resolution (monthly/quarterly)
|
|
572
|
+
├── data-quality.ts # Data source completeness checks
|
|
573
|
+
├── utils.ts # Shared utilities (noise filter)
|
|
574
|
+
├── extractors/
|
|
575
|
+
│ ├── claude-code.ts # Read ~/.claude/ data (JSONL, SQLite, JSON)
|
|
576
|
+
│ └── codex.ts # Read ~/.codex/ data (SQLite, JSONL)
|
|
577
|
+
├── analyzers/
|
|
578
|
+
│ ├── usage.ts # Aggregate quantitative usage metrics
|
|
579
|
+
│ └── section.ts # Generic section analyzer (facets + keywords + thresholds)
|
|
580
|
+
└── reporters/
|
|
581
|
+
└── report.ts # Render reports (text, JSON, Markdown)
|
|
582
|
+
|
|
583
|
+
templates/
|
|
584
|
+
└── eval-standard.json # Default OneAD evaluation standard (source of truth)
|
|
585
|
+
|
|
586
|
+
tests/
|
|
587
|
+
├── cli.test.ts # CLI regression tests (subprocess)
|
|
588
|
+
├── section.test.ts # Section analyzer unit tests
|
|
589
|
+
├── standard.test.ts # Standard loader/validator tests
|
|
590
|
+
├── data-quality.test.ts # Data quality checker tests
|
|
591
|
+
└── fixtures/ # Test fixture files
|
|
592
|
+
```
|
|
593
|
+
|
|
594
|
+
### Pipeline / 處理流程
|
|
595
|
+
|
|
596
|
+
```
|
|
597
|
+
Extract --> Analyze --> Score --> Report
|
|
598
|
+
|
|
599
|
+
1. Extract: Read session-meta, facets, SQLite, JSONL from local stores
|
|
600
|
+
2. Analyze: Aggregate usage + run each section through generic analyzer
|
|
601
|
+
3. Score: Compare metrics against section thresholds
|
|
602
|
+
4. Report: Render in chosen format with evidence and recommendations
|
|
603
|
+
```
|
|
604
|
+
|
|
605
|
+
---
|
|
606
|
+
|
|
607
|
+
## Development / 開發
|
|
608
|
+
|
|
609
|
+
### Scripts / 腳本
|
|
610
|
+
|
|
611
|
+
```bash
|
|
612
|
+
npm run build # Compile TypeScript to dist/
|
|
613
|
+
npm run dev # Run CLI directly via tsx (no build needed)
|
|
614
|
+
npm run test # Run test suite (vitest)
|
|
615
|
+
npm run test:watch # Run tests in watch mode
|
|
616
|
+
```
|
|
617
|
+
|
|
618
|
+
### Running tests / 執行測試
|
|
619
|
+
|
|
620
|
+
```bash
|
|
621
|
+
# Run all tests
|
|
622
|
+
npm test
|
|
623
|
+
|
|
624
|
+
# Run a specific test file
|
|
625
|
+
npx vitest run tests/section.test.ts
|
|
626
|
+
|
|
627
|
+
# Watch mode
|
|
628
|
+
npm run test:watch
|
|
629
|
+
```
|
|
630
|
+
|
|
631
|
+
### Project conventions / 專案慣例
|
|
632
|
+
|
|
633
|
+
- All progress/status messages are written to **stderr**; report output goes to **stdout**
|
|
634
|
+
- JSON output (`--format json`) is guaranteed clean on stdout for piping
|
|
635
|
+
- SQLite connections are wrapped in `try/finally` to prevent resource leaks
|
|
636
|
+
- The evaluation standard template (`templates/eval-standard.json`) is the single source of truth
|
|
637
|
+
- Custom standards inherit default `noiseFilters` if not specified
|
|
638
|
+
|
|
639
|
+
---
|
|
640
|
+
|
|
641
|
+
## Troubleshooting / 問題排除
|
|
642
|
+
|
|
643
|
+
### No sessions found
|
|
644
|
+
|
|
645
|
+
- Verify `~/.claude/usage-data/session-meta/` contains JSON files
|
|
646
|
+
- Check the date range matches when the AI tools were used
|
|
647
|
+
- For Codex, verify `~/.codex/state_5.sqlite` exists
|
|
648
|
+
|
|
649
|
+
### Empty facets
|
|
650
|
+
|
|
651
|
+
- Facets are generated asynchronously by Claude Code after sessions end
|
|
652
|
+
- Recent sessions may not have facets yet
|
|
653
|
+
- The tool will show a data quality warning in this case
|
|
654
|
+
|
|
655
|
+
### JSON output contains extra text
|
|
656
|
+
|
|
657
|
+
This was fixed in v0.1.0. All progress messages now go to stderr. If you encounter this, ensure you are using the latest version. Use `2>/dev/null` to suppress stderr when piping:
|
|
658
|
+
|
|
659
|
+
```bash
|
|
660
|
+
npx tsx src/cli.ts report --format json 2>/dev/null | jq .
|
|
661
|
+
```
|
|
662
|
+
|
|
663
|
+
---
|
|
664
|
+
|
|
665
|
+
## License
|
|
666
|
+
|
|
667
|
+
MIT
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
import type { ClaudeCodeSession, ClaudeCodeFacet, ClaudeCodeConversationSignal, CodexSession, CodexConversationSignal, CodexSessionInsight, EvalSectionDef, EvalSectionResult } from "../types.js";
|
|
2
|
+
/**
|
|
3
|
+
* Generic section analyzer.
|
|
4
|
+
*
|
|
5
|
+
* Collects evidence from facets, sessions, and conversation signals,
|
|
6
|
+
* then determines standard vs superior score based on section thresholds.
|
|
7
|
+
*/
|
|
8
|
+
export declare function analyzeSection(section: EvalSectionDef, claudeSessions: ClaudeCodeSession[], facets: Map<string, ClaudeCodeFacet>, claudeSignals: ClaudeCodeConversationSignal[], codexSessions: CodexSession[], codexInsights: Map<string, CodexSessionInsight>, codexSignals: CodexConversationSignal[]): EvalSectionResult;
|
|
9
|
+
//# sourceMappingURL=section.d.ts.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"section.d.ts","sourceRoot":"","sources":["../../src/analyzers/section.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EACV,iBAAiB,EACjB,eAAe,EACf,4BAA4B,EAC5B,YAAY,EACZ,uBAAuB,EACvB,mBAAmB,EACnB,cAAc,EACd,iBAAiB,EAClB,MAAM,aAAa,CAAC;AAsiBrB;;;;;GAKG;AACH,wBAAgB,cAAc,CAC5B,OAAO,EAAE,cAAc,EACvB,cAAc,EAAE,iBAAiB,EAAE,EACnC,MAAM,EAAE,GAAG,CAAC,MAAM,EAAE,eAAe,CAAC,EACpC,aAAa,EAAE,4BAA4B,EAAE,EAC7C,aAAa,EAAE,YAAY,EAAE,EAC7B,aAAa,EAAE,GAAG,CAAC,MAAM,EAAE,mBAAmB,CAAC,EAC/C,YAAY,EAAE,uBAAuB,EAAE,GACtC,iBAAiB,CAwKnB"}
|