@minhpnq1807/contextos 0.6.2 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,11 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.3
4
+
5
+ - **Launch benchmark wording:** Clarified that `ctx leaderboard --hallucination` is an offline deterministic benchmark comparing a raw heuristic baseline with ContextOS evidence-based context selection, while live agent results remain pending external CLI environments.
6
+ - **Offline leaderboard labels:** Renamed offline leaderboard output from agent-like labels to `Raw heuristic baseline` and `ContextOS evidence benchmark` so the 10% to 80% result is not confused with a live Codex/Gemini comparison.
7
+ - **Live leaderboard alias:** Added `ctx leaderboard --hallucination --live --agent <name>` as a launch-friendly alias for running the hallucination benchmark through one installed agent CLI. Live benchmark output now reports `OK`/`SKIPPED`/`ERROR` style statuses and supports `CONTEXTOS_<AGENT>_CMD` command templates for external wrappers.
8
+
3
9
  ## 0.6.2
4
10
 
5
11
  - **Live agent leaderboard:** Added `ctx leaderboard --agents codex,gemini` and `npm run leaderboard:agents` to run the hallucination benchmark through installed Codex/Gemini CLIs with timeouts and skip/error reporting for missing or unauthenticated agents.
package/README.md CHANGED
@@ -1,8 +1,8 @@
1
1
  # ContextOS
2
2
 
3
- Runtime context router for coding agents.
3
+ Stop coding agents from ignoring repo rules, guessing the wrong path, and reading random files.
4
4
 
5
- Rules, files, skills, workflows, and evidence: injected before the agent writes code.
5
+ ContextOS gives the agent the right rules, files, skills, workflows, and evidence before it writes code.
6
6
 
7
7
  [![npm version](https://img.shields.io/npm/v/@minhpnq1807/contextos.svg)](https://www.npmjs.com/package/@minhpnq1807/contextos)
8
8
  [![CI](https://github.com/khovan123/contextOS/actions/workflows/ci.yml/badge.svg)](https://github.com/khovan123/contextOS/actions/workflows/ci.yml)
@@ -10,19 +10,17 @@ Rules, files, skills, workflows, and evidence: injected before the agent writes
10
10
  [![license: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
11
11
 
12
12
  ```text
13
- WITHOUT ContextOS
14
- AGENTS.md is a long static blob
15
- important rules drift into the middle
16
- agent starts by grepping files and misses the repo contract
13
+ Problem: Agents ignore project rules.
14
+ Fix: ContextOS puts the relevant AGENTS.md rules in front of the agent for this task.
17
15
 
18
- WITH ContextOS
19
- prompt -> score relevant AGENTS.md rules
20
- -> inject critical rules at top and bottom
21
- -> suggest files, skills, workflows
22
- -> report followed / ignored / unknown
16
+ Problem: Agents choose the wrong deployment path.
17
+ Fix: ContextOS checks repo evidence before suggesting skills like EAS, Vercel, Docker, or CI/CD.
18
+
19
+ Problem: Agents grep random files.
20
+ Fix: ContextOS suggests the files and workflows to check first.
23
21
  ```
24
22
 
25
- ContextOS is not another `AGENTS.md` loader. It is a runtime context router for coding agents: it chooses the task-relevant rules, files, skills, workflows, and evidence before the agent starts editing.
23
+ ContextOS is not another `AGENTS.md` loader. It is a pre-flight context layer for coding agents: it turns repo rules, project signals, skills, workflows, and evidence into a compact task brief before the agent starts editing.
26
24
 
27
25
  Published package: [`@minhpnq1807/contextos`](https://www.npmjs.com/package/@minhpnq1807/contextos)
28
26
 
@@ -36,7 +34,7 @@ Same prompt. Same model. Different context.
36
34
  ctx skills doctor -- "fix deployed"
37
35
  ```
38
36
 
39
- | Repo evidence | Expected route |
37
+ | Repo evidence | What ContextOS tells the agent |
40
38
  | --- | --- |
41
39
  | `eas.json`, `expo`, `react-native` | `eas`, `mobile-deployment`, `github-actions-ci-cd` |
42
40
  | `vercel.json`, `next`, GitHub workflow | `vercel-deployment`, `github-actions-ci-cd`, `env-secret-management` |
@@ -55,7 +53,7 @@ Regenerate the GIFs from real local `ctx` command output:
55
53
  npm run demo:capture
56
54
  ```
57
55
 
58
- ## Agent Hallucination Benchmark
56
+ ## Wrong Path Benchmark
59
57
 
60
58
  Generic agents often guess deployment tooling from the prompt alone:
61
59
 
@@ -64,7 +62,7 @@ Prompt: Fix deployment
64
62
  Raw agent guess: Vercel, Docker, Railway
65
63
  ```
66
64
 
67
- ContextOS routes from project evidence instead:
65
+ ContextOS checks the repo first:
68
66
 
69
67
  ```text
70
68
  Detected evidence:
@@ -78,9 +76,9 @@ Selected skills:
78
76
  - github-actions-ci-cd
79
77
  ```
80
78
 
81
- That is the core launch demo: same prompt, same model, different repo context, correct skills.
79
+ That is the core launch demo: same prompt, same model, different repo, correct next step.
82
80
 
83
- Skill Router internal fixture benchmark:
81
+ Internal fixture benchmark:
84
82
 
85
83
  | Metric | Result |
86
84
  | --- | ---: |
@@ -91,26 +89,44 @@ Skill Router internal fixture benchmark:
91
89
  | Confidence Calibration | 100.0% |
92
90
  | Negative Gate Accuracy | 100.0% |
93
91
 
94
- This is an internal fixture benchmark, not an external real-world benchmark. It is designed to prove the router behavior across controlled Expo/EAS, Next/Vercel, Docker, Railway/Render, Firebase, auth, database, testing, mobile, and adversarial negative-gate cases.
92
+ This is an internal fixture benchmark, not an external real-world benchmark. It is designed to prove that ContextOS changes its suggestions from repo evidence across controlled Expo/EAS, Next/Vercel, Docker, Railway/Render, Firebase, auth, database, testing, mobile, and adversarial negative cases.
95
93
 
96
- Hallucination leaderboard:
94
+ Offline hallucination leaderboard:
97
95
 
98
96
  ```bash
99
97
  ctx leaderboard --hallucination
100
98
  ```
101
99
 
102
- Current local result across 20 fixture tasks and 12 repo contexts:
100
+ Current deterministic result across 20 fixture tasks and 12 repo contexts:
103
101
 
104
- | System | Correct Skill |
102
+ | System | Correct context choice |
105
103
  | --- | ---: |
106
- | Raw Agent | 10.0% |
107
- | ContextOS + Codex | 80.0% |
104
+ | Raw heuristic baseline | 10.0% |
105
+ | ContextOS evidence benchmark | 80.0% |
106
+
107
+ This means ContextOS improves deterministic context routing from 10% to 80% on the offline hallucination task set. It does not claim ContextOS beats Codex, Gemini, Claude Code, or Cursor in live runs.
108
+
109
+ Live agent benchmark support exists, but results are pending an external environment with working CLI auth/session access:
110
+
111
+ ```bash
112
+ ctx leaderboard --hallucination --live --agent codex
113
+ ctx leaderboard --hallucination --live --agent gemini
114
+ ```
115
+
116
+ If a CLI cannot run in the current environment, the command reports `SKIPPED` or an agent error instead of blocking launch.
117
+
118
+ Live benchmark tracking:
119
+
120
+ - [Run Codex live benchmark](https://github.com/khovan123/contextOS/issues/1)
121
+ - [Run Claude Code live benchmark](https://github.com/khovan123/contextOS/issues/3)
122
+ - [Run Gemini CLI live benchmark](https://github.com/khovan123/contextOS/issues/4)
123
+ - [Run Cursor live benchmark](https://github.com/khovan123/contextOS/issues/2)
108
124
 
109
125
  Example hook context injected before the agent works:
110
126
 
111
127
  ```text
112
128
  ## Critical ContextOS rules
113
- - IMPORTANT: This project has a knowledge graph. ALWAYS use code-review-graph MCP tools before Grep/Glob/Read.
129
+ - IMPORTANT: This project has a knowledge graph. Use it before broad file search.
114
130
  - Use `query_graph` pattern="tests_for" to check coverage.
115
131
 
116
132
  ## Suggested files to check
@@ -129,7 +145,7 @@ ContextOS report
129
145
  Efficiency: 100%
130
146
  Injected rules: 8
131
147
  Rule outcomes: 8 followed, 0 ignored, 0 unknown
132
- Runtime telemetry: code-review-graph, code-review-graph.query_graph_tool
148
+ Runtime evidence: project graph was used before file search
133
149
  ```
134
150
 
135
151
  ## Quick Install
@@ -172,37 +188,44 @@ ctx install agy
172
188
 
173
189
  Restart the agent after setup. Then use the agent normally.
174
190
 
175
- ## Why
191
+ ## Why ContextOS Exists
176
192
 
177
193
  Developers put real operating instructions in `AGENTS.md`: use this graph tool before reading files, run these tests, follow this architecture boundary, avoid this migration path.
178
194
 
179
- The problem is not that agents cannot read `AGENTS.md`. The problem is that large context windows bury the important rule in the middle, where attention is weak. ContextOS turns a static rules file into task-aware runtime context.
195
+ The problem is not that agents cannot read `AGENTS.md`. The problem is that large context windows bury the important rule in the middle, where attention is weak.
196
+
197
+ The same thing happens with project structure:
198
+
199
+ - A deployment prompt says "fix deploy", and the agent guesses Vercel in an Expo repo.
200
+ - A backend error mentions Fastify, and the agent loads frontend skills.
201
+ - A feature request names one route, and the agent starts with broad grep instead of the files that matter.
202
+
203
+ ContextOS fixes those three failures before the agent starts work.
180
204
 
181
205
  The next visible demo is not another feature. It is showing the pain in a few seconds:
182
206
 
183
207
  ```text
184
208
  Raw agent: guesses from the prompt.
185
- ContextOS: routes from repo evidence.
209
+ ContextOS: checks repo evidence first.
186
210
  ```
187
211
 
188
212
  ## What ContextOS Does
189
213
 
190
- | Layer | What happens |
214
+ | Agent failure | ContextOS behavior |
191
215
  | --- | --- |
192
- | Hooks | Codex, Claude Code, and Antigravity hooks run before/after each task. |
193
- | Scoring | Local MiniLM embeddings plus heuristics rank AGENTS.md rules by the prompt. |
194
- | Injection | Critical rules are placed with primacy + recency, not buried in the middle. |
195
- | Discovery | Relevant files, skills, and workflows are suggested before work starts. |
196
- | Sync | Rules/MCP via Ruler, skills via skillshare, workflows via ContextOS. |
197
- | Evidence | Stop hooks persist `followed`, `ignored`, `unknown`, and runtime telemetry for explicit reports. |
216
+ | Ignores project rules | Shows the relevant rules at the start of the task. |
217
+ | Picks the wrong tool or deployment path | Suggests skills only when the repo has supporting evidence. |
218
+ | Reads random files first | Suggests the likely files and workflows before exploration starts. |
219
+ | Claims compliance without proof | Reports which rules were followed, ignored, or unknown after the task. |
220
+ | Needs to work across agents | Supports Codex, Claude Code, and Antigravity with the same project context. |
198
221
 
199
222
  ## Comparison
200
223
 
201
224
  | Approach | What it gives the agent | Main gap |
202
225
  | --- | --- | --- |
203
226
  | Plain `AGENTS.md` | Static repo instructions. | Important rules get buried or ignored when the task changes. |
204
- | Generic RAG | Semantically related files or snippets. | It usually does not route skills/workflows or prove rule compliance. |
205
- | ContextOS | Task-routed rules, files, skills, workflows, and evidence. | Requires local setup and warm indexes for best results. |
227
+ | Generic RAG | Related files or snippets. | It usually does not choose skills/workflows or prove rule compliance. |
228
+ | ContextOS | Task-specific rules, files, skills, workflows, and evidence. | Requires local setup and prepared indexes for best results. |
206
229
 
207
230
  ## Safety Model
208
231
 
@@ -212,20 +235,20 @@ ContextOS is designed to be OSS-friendly and low-friction:
212
235
  | --- | --- |
213
236
  | Standalone by default | `ctx setup` works without `code-review-graph`, `codegraph`, or `agent-memory`. |
214
237
  | Optional adapters | Graph and memory backends add signal when available; missing adapters contribute score `0`. |
215
- | Fail-open hooks | Prompt hooks return local context or nothing instead of blocking the agent when MCP, embeddings, graph, or memory is unavailable. |
238
+ | Fail-open hooks | Prompt hooks return local context or nothing instead of blocking the agent when optional runtime pieces are unavailable. |
216
239
  | Local-only telemetry | Reports, prompt history, evidence, and telemetry stay under `~/.ctx/contextos/`. |
217
- | No hook network calls | Prompt and stop hooks do not call external services. Install/warm commands may download the local embedding model when explicitly run. |
240
+ | No hook network calls | Prompt and stop hooks do not call external services. Install/warm commands may prepare local indexes when explicitly run. |
218
241
  | No postinstall surprise | `npm install` only installs the CLI. Setup runs only when you call `ctx setup`. |
219
242
 
220
- Positioning: ContextOS works standalone and gets smarter when graph or memory adapters are available.
243
+ Positioning: ContextOS works standalone and gets smarter when project graph or memory adapters are available.
221
244
 
222
245
  ## Roadmap
223
246
 
224
- ContextOS is not heading toward a dashboard-first product. The next work is focused on making the existing local runtime more visible and reusable:
247
+ ContextOS is not heading toward a dashboard-first product. The next work is focused on making the existing local behavior more visible and reusable:
225
248
 
226
249
  | Next | Why |
227
250
  | --- | --- |
228
- | Hallucination Leaderboard | Compare raw agent guesses vs ContextOS evidence-routed recommendations across the same repos and tasks. |
251
+ | Hallucination Leaderboard | Compare raw agent guesses vs ContextOS evidence-based recommendations across the same repos and tasks. |
229
252
  | Agent Replay | Turn telemetry into a readable post-task narrative: prompt, selected skills, followed rules, suggested files, touched files, efficiency. |
230
253
  | Community Skill Packs | Let contributors PR ContextOS-ready skills with triggers, evidence, negative gates, and workflows before building a larger hub. |
231
254
  | ContextOS Ready | Define a repository readiness badge for AGENTS.md, skills, workflows, and evidence quality. |
@@ -237,11 +260,11 @@ See [docs/roadmap.md](docs/roadmap.md) for the current roadmap notes.
237
260
 
238
261
  ContextOS starts the community loop with [`community-skills/`](community-skills/) instead of a hosted marketplace. The seed packs are `eas`, `vercel`, `prisma`, `redis`, `oauth-google`, and `jwt-auth`.
239
262
 
240
- Each pack contains a model-visible `SKILL.md` plus `skill.yaml` routing metadata with prompt triggers, project evidence, negative triggers, and a short workflow. Contributors can PR new packs by copying [`community-skills/_template/`](community-skills/_template/).
263
+ Each pack contains a model-visible `SKILL.md` plus `skill.yaml` metadata with prompt triggers, project evidence, negative triggers, and a short workflow. Contributors can PR new packs by copying [`community-skills/_template/`](community-skills/_template/).
241
264
 
242
265
  ## ContextOS Ready
243
266
 
244
- `ctx doctor` scores whether a repository is ready for ContextOS-style agent routing:
267
+ `ctx doctor` scores whether a repository is ready for ContextOS-style agent guidance:
245
268
 
246
269
  ```bash
247
270
  ctx doctor
@@ -271,10 +294,11 @@ The score checks project `AGENTS.md` rules, project skill packs under `.codex/sk
271
294
  | `ctx evidence` | Show why each rule was marked followed/ignored/unknown. |
272
295
  | `ctx stats` | Show workspace-level usage and effectiveness metrics. |
273
296
  | `ctx benchmark -- "task"` | Compare raw AGENTS.md ordering vs ContextOS scheduling. |
274
- | `ctx benchmark --skills` | Run the Skill Router eval benchmark. |
275
- | `ctx leaderboard --hallucination` | Compare raw prompt-only guesses vs ContextOS routing. |
276
- | `ctx leaderboard --agents codex,gemini` | Run the live CLI leaderboard when Codex/Gemini credentials are available. |
277
- | `ctx sync --rules` | Sync AGENTS/Ruler/MCP config across agents. |
297
+ | `ctx benchmark --skills` | Run the skill selection eval benchmark. |
298
+ | `ctx leaderboard --hallucination` | Run the offline deterministic hallucination benchmark. |
299
+ | `ctx leaderboard --hallucination --live --agent codex` | Run the live CLI benchmark when agent auth/session is available. |
300
+ | `ctx leaderboard --agents codex,gemini` | Legacy live CLI leaderboard form. |
301
+ | `ctx sync --rules` | Sync project rules across agents. |
278
302
  | `ctx sync --skills` | Sync skills across agents through skillshare. |
279
303
  | `ctx sync --workflows` | Sync workflow markdown across Claude/Codex/Antigravity. |
280
304
 
@@ -283,7 +307,7 @@ The score checks project `AGENTS.md` rules, project skill packs under `.codex/sk
283
307
  1. Start in a repo with an `AGENTS.md` that contains a rule like:
284
308
 
285
309
  ```text
286
- Always use code-review-graph MCP tools before reading files.
310
+ Always use the project graph before reading files.
287
311
  ```
288
312
 
289
313
  2. Install:
@@ -598,8 +622,9 @@ This warning comes from a transitive dependency in the local embedding/WASM stac
598
622
  | `ctx stats` | Shows aggregate runtime metrics for the current workspace. | You want to know whether ContextOS is active and useful over time. | Prints sectioned tables for prompt/report counts, injection rate, efficiency, rule outcomes, hook events, last prompt, and last report. |
599
623
  | `ctx benchmark -- "task"` | Compares baseline AGENTS.md ordering with ContextOS task-aware scheduling. | You want a before/after signal for lost-in-the-middle risk. | Prints tables for parsed/actionable/filtered rules, baseline middle-risk, scheduled high/mid rules, recency reminder status, and top scored rules. |
600
624
  | `ctx benchmark --skills` | Runs the Skill Router eval benchmark. | You want evidence for skill routing accuracy and negative gates. | Prints top-1 accuracy, top-3 recall, false positive rate, confidence calibration, and negative gate accuracy across `eval/skill-routing` fixtures. |
601
- | `ctx leaderboard --hallucination` | Compares raw prompt-only skill guesses with ContextOS evidence routing. | You want launch evidence for the hallucination problem. | Runs 20 fixture tasks across 10+ repo contexts and prints Raw Agent vs ContextOS correctness plus sample failures. |
602
- | `ctx leaderboard --agents codex,gemini` | Runs the same benchmark shape through installed agent CLIs. | You want real agent output instead of the deterministic raw baseline. | Calls `codex exec` in read-only mode and the local Gemini CLI with timeouts; missing or unauthenticated CLIs are reported as skipped/errors instead of blocking. |
625
+ | `ctx leaderboard --hallucination` | Runs the offline deterministic hallucination benchmark. | You want launch evidence for the wrong-context problem without depending on external agent auth. | Runs 20 fixture tasks across 10+ repo contexts and prints Raw heuristic baseline vs ContextOS evidence benchmark plus sample failures. |
626
+ | `ctx leaderboard --hallucination --live --agent codex` | Runs the hallucination benchmark through one installed agent CLI. | You want real agent output and have CLI auth/session available. | Calls the selected CLI with timeouts; missing, blocked, or unauthenticated CLIs are reported as skipped/errors instead of blocking. |
627
+ | `ctx leaderboard --agents codex,gemini` | Legacy live CLI leaderboard form. | You want to run multiple live agents at once. | Equivalent live-agent benchmark shape for comma-separated CLIs. |
603
628
  | `ctx sync --rules` | Syncs project rules and MCP servers through Ruler. | You want Codex, Claude Code, and Antigravity to share one project rule/MCP source of truth. | Ensures `.ruler/ruler.toml`, injects `ctx-mcp`, imports existing MCP servers from Codex and project `.mcp.json`, runs `ruler apply --agents codex,claude,antigravity`, mirrors MCP servers to Antigravity MCP configs, and verifies generated config. |
604
629
  | `ctx sync --rules --agents <list>` | Syncs only selected agents through Ruler. | You want to update one or two agents without touching the others. | Accepts comma-separated values such as `codex`, `claude`, `agy`, `antigravity`, or `codex,claude,agy`; `agy` is normalized to Ruler's `antigravity`. |
605
630
  | `ctx sync --rules --dry-run` | Previews Ruler sync without writing files or running apply. | You want to inspect behavior before changing project config. | Prints the same flow with dry-run status. |
@@ -664,7 +689,7 @@ These files are local telemetry only. Hooks do not make network calls.
664
689
 
665
690
  ## Project Understanding
666
691
 
667
- ContextOS works standalone. The core path is local rules, file embeddings, import graph expansion, skill routing, workflow routing, and evidence capture.
692
+ ContextOS works standalone. The default path is local project rules, prepared file indexes, project skills, workflows, and evidence capture.
668
693
 
669
694
  Project graph and memory backends are optional adapters:
670
695
 
@@ -676,26 +701,24 @@ Project graph and memory backends are optional adapters:
676
701
 
677
702
  ContextOS does not require `code-review-graph`, `codegraph`, or `agent-memory` to install or run. It gets smarter when those backends are available; when they are missing, the adapter scores stay at zero and the hook continues with local context.
678
703
 
679
- For file suggestions, ContextOS now runs a local RAG-style retrieval pass:
704
+ For file suggestions, ContextOS uses prepared local indexes:
680
705
 
681
706
  ```text
682
707
  prompt
683
- -> UserPromptSubmit hook calls ctx-mcp bridge
684
- -> ctx-mcp reads AGENTS.md and scores rules with local MiniLM
685
- -> query the persisted file-vector index in embeddings.db for semantic file candidates
686
- -> expand candidates through relative import graph links
687
- -> optionally query code-review-graph semantic_search_nodes with seed entity names
688
- -> merge and deduplicate semantic, import-graph, and optional graph matches
689
- -> inject top suggested files with graph evidence reasons
708
+ -> read task-relevant AGENTS.md rules
709
+ -> suggest prepared file candidates
710
+ -> expand nearby imports
711
+ -> add optional project-graph matches when available
712
+ -> inject a compact list of files to check
690
713
  ```
691
714
 
692
- This keeps the hook fast and local while still using graph semantics when available. The graph search path is visible in runtime data through file reasons such as `graph:content-moderation.service`. When no graph adapter is available, file suggestions still use local file vectors and import graph expansion.
715
+ This keeps the hook fast and local while still using project graph signal when available. When no graph adapter is available, file suggestions still use local file indexes and import expansion.
693
716
 
694
- Prompt scoring does not walk the repository for file candidates or import expansion. `ctx install` and `ctx embeddings warm` rebuild the persisted file-vector index and one-hop import adjacency index by walking source paths once; prompt hooks query those indexes directly. Rules, files, skills, and workflows are scored concurrently with `Promise.all()`.
717
+ Prompt-time file suggestions do not walk the repository. `ctx install` and `ctx embeddings warm` rebuild the file index and one-hop import adjacency by walking source paths once; prompt hooks query those prepared indexes directly. Rules, files, skills, and workflows are resolved concurrently.
695
718
 
696
719
  `ctx embeddings warm` automatically refreshes the active Codex marketplace payload before rebuilding indexes. Use `ctx refresh` when you want the same marketplace sync plus install-style file, skill, import, and code-review-graph embedding refresh in one command.
697
720
 
698
- If a prompt has no usable context candidates, the hook fails open without emitting an empty `hook context` block, records `emptyContextReason` in the workspace runtime file, and starts a detached `autowarm` rebuild with a cooldown. That background rebuild refreshes file vectors, skill/workflow vectors, import adjacency, and available code-review-graph node embeddings for the next prompt while keeping repository walking out of the current prompt hot path.
721
+ If a prompt has no usable context candidates, the hook fails open without emitting an empty `hook context` block, records `emptyContextReason` in the workspace runtime file, and starts a detached `autowarm` rebuild with a cooldown. That background rebuild refreshes prepared indexes for the next prompt while keeping repository walking out of the current prompt path.
699
722
 
700
723
  Use `ctx --config` to choose which prompt sections ContextOS injects and how many suggestions each section may show. Interactive `ctx setup` includes the same section picker and limit prompts, while `ctx setup --yes` keeps the current saved config for automation. The panel supports multiple selection with `Space` and persists the global choice in `~/.ctx/contextos/output-config.json`. Defaults are five suggested files, five skills, and five workflows; caps are 20 files, 10 skills, and 5 workflows. Disabling rules hides both critical and additional relevant rule sections; compliance metadata remains available for reports.
701
724
 
package/bin/ctx.js CHANGED
@@ -199,7 +199,9 @@ Usage:
199
199
  ctx stats Show workspace statistics
200
200
  ctx benchmark -- "task" Benchmark workspace for a task
201
201
  ctx benchmark --skills Run skill routing eval benchmark
202
- ctx leaderboard --hallucination Compare raw agent guesses vs ContextOS routing
202
+ ctx leaderboard --hallucination Run offline deterministic hallucination benchmark
203
+ ctx leaderboard --hallucination --live --agent codex
204
+ Run hallucination benchmark through one live CLI
203
205
  ctx leaderboard --agents codex,gemini Run live CLI leaderboard for installed agents
204
206
  ctx sync --rules Sync AGENTS.md rules to all agents
205
207
  ctx sync --rules --agents <names> Sync rules to specific agents only
@@ -252,6 +254,18 @@ function normalizeInstallAgent(agent) {
252
254
  if (normalized === "antigravity") return "agy";
253
255
  return normalized;
254
256
  }
257
+
258
+ function leaderboardAgentsFromArgs(args) {
259
+ const agentIndex = args.indexOf("--agent");
260
+ const agentsIndex = args.indexOf("--agents");
261
+ const index = agentIndex >= 0 ? agentIndex : agentsIndex;
262
+ if (index < 0) return [];
263
+ return String(args[index + 1] || "")
264
+ .split(",")
265
+ .map((agent) => agent.trim())
266
+ .filter(Boolean);
267
+ }
268
+
255
269
  /**
256
270
  * Intercept console.log from an async fn,
257
271
  * printing each line immediately with "│ " prefix for real-time feedback.
@@ -1039,7 +1053,17 @@ try {
1039
1053
  console.log(formatBenchmark(benchmarkWorkspace({ cwd: process.cwd(), task })));
1040
1054
  }
1041
1055
  } else if (command === "leaderboard") {
1042
- if (args.includes("--hallucination")) {
1056
+ if (args.includes("--hallucination") && args.includes("--live")) {
1057
+ const agents = leaderboardAgentsFromArgs(args);
1058
+ const limitIndex = args.indexOf("--limit");
1059
+ const timeoutIndex = args.indexOf("--timeout-ms");
1060
+ console.log(formatAgentLeaderboard(runAgentLeaderboard({
1061
+ rootDir,
1062
+ agents: agents.length ? agents : undefined,
1063
+ caseLimit: limitIndex >= 0 ? Number(args[limitIndex + 1]) : undefined,
1064
+ timeoutMs: timeoutIndex >= 0 ? Number(args[timeoutIndex + 1]) : undefined
1065
+ })));
1066
+ } else if (args.includes("--hallucination")) {
1043
1067
  console.log(formatHallucinationLeaderboard(await runHallucinationLeaderboard({ rootDir })));
1044
1068
  } else if (args.includes("--agents")) {
1045
1069
  const index = args.indexOf("--agents");
@@ -1053,7 +1077,7 @@ try {
1053
1077
  timeoutMs: timeoutIndex >= 0 ? Number(args[timeoutIndex + 1]) : undefined
1054
1078
  })));
1055
1079
  } else {
1056
- throw new Error("Usage: ctx leaderboard --hallucination OR ctx leaderboard --agents codex,gemini");
1080
+ throw new Error("Usage: ctx leaderboard --hallucination OR ctx leaderboard --hallucination --live --agent codex OR ctx leaderboard --agents codex,gemini");
1057
1081
  }
1058
1082
  } else if (command === "skills") {
1059
1083
  if (args[1] === "doctor") {
@@ -29,7 +29,8 @@ export function runAgentLeaderboard({
29
29
  const systems = [];
30
30
 
31
31
  for (const agent of agents) {
32
- const binary = findBinary(agent);
32
+ const template = agentCommandTemplate(agent);
33
+ const binary = template ? template.split(/\s+/).filter(Boolean)[0] : findBinary(agent);
33
34
  if (!binary) {
34
35
  systems.push({ name: agent, status: "skipped", reason: "binary not found", rows: [], correctRate: 0 });
35
36
  continue;
@@ -71,7 +72,7 @@ export function formatAgentLeaderboard(result) {
71
72
  ];
72
73
  for (const system of result.systems) {
73
74
  const score = system.status === "ok" ? percent(system.correctRate) : system.reason;
74
- lines.push(`${system.name.padEnd(8)} ${system.status.padEnd(8)} ${score}`);
75
+ lines.push(`${system.name.padEnd(8)} ${system.status.toUpperCase().padEnd(8)} ${score}`);
75
76
  }
76
77
  lines.push("", "Cases:");
77
78
  for (const system of result.systems) {
@@ -123,6 +124,8 @@ function runAgentCase({ agent, binary, testCase, skillIds, timeoutMs, rootDir })
123
124
  }
124
125
 
125
126
  function agentArgs({ agent, cwd, prompt }) {
127
+ const genericTemplate = agentCommandTemplate(agent);
128
+ if (genericTemplate) return expandTemplate(genericTemplate, { cwd, prompt }).slice(1);
126
129
  if (agent === "codex") {
127
130
  return [
128
131
  "exec",
@@ -140,6 +143,11 @@ function agentArgs({ agent, cwd, prompt }) {
140
143
  return [prompt];
141
144
  }
142
145
 
146
+ function agentCommandTemplate(agent) {
147
+ const envKey = `CONTEXTOS_${String(agent || "").toUpperCase().replace(/[^A-Z0-9]+/g, "_")}_CMD`;
148
+ return process.env[envKey] || "";
149
+ }
150
+
143
151
  function buildPrompt({ task, skillIds }) {
144
152
  return [
145
153
  "You are evaluating a repository for a coding-agent skill router benchmark.",
@@ -180,6 +188,11 @@ function findBinary(name) {
180
188
  for (const candidate of candidates) {
181
189
  if (fs.existsSync(candidate)) return candidate;
182
190
  }
191
+ for (const dir of String(process.env.PATH || "").split(path.delimiter)) {
192
+ if (!dir) continue;
193
+ const candidate = path.join(dir, safeName);
194
+ if (fs.existsSync(candidate)) return candidate;
195
+ }
183
196
  for (const command of [
184
197
  `command -v ${safeName}`,
185
198
  `source ~/.profile >/dev/null 2>&1 || true; source ~/.bashrc >/dev/null 2>&1 || true; command -v ${safeName}`
@@ -41,8 +41,8 @@ export async function runHallucinationLeaderboard({
41
41
  caseCount: selectedCases.length,
42
42
  repoCount: new Set(selectedCases.map((row) => row.fixture)).size,
43
43
  systems: [
44
- summarizeSystem("Raw Agent", rawRows),
45
- summarizeSystem("ContextOS + Codex", contextRows)
44
+ summarizeSystem("Raw heuristic baseline", rawRows),
45
+ summarizeSystem("ContextOS evidence benchmark", contextRows)
46
46
  ],
47
47
  rows: selectedCases.map((testCase) => ({
48
48
  prompt: testCase.prompt,
@@ -60,11 +60,11 @@ export function formatHallucinationLeaderboard(result) {
60
60
  `Repos: ${result.repoCount}`,
61
61
  `Tasks: ${result.caseCount}`,
62
62
  "",
63
- "System Correct Skill",
64
- "------------------ -------------"
63
+ "System Correct Context",
64
+ "---------------------------- ---------------"
65
65
  ];
66
66
  for (const system of result.systems) {
67
- lines.push(`${system.name.padEnd(18)} ${percent(system.correctRate)}`);
67
+ lines.push(`${system.name.padEnd(28)} ${percent(system.correctRate)}`);
68
68
  }
69
69
  lines.push("", "Sample failures:");
70
70
  const failures = result.rows
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@minhpnq1807/contextos",
3
- "version": "0.6.2",
3
+ "version": "0.6.3",
4
4
  "description": "Task-aware AGENTS.md context injection and compliance reporting for Codex, Claude Code, and Antigravity.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ctx",
3
- "version": "0.6.2",
3
+ "version": "0.6.3",
4
4
  "description": "Inject task-relevant AGENTS.md rules into Codex through plugin hooks.",
5
5
  "author": {
6
6
  "name": "ContextOS"