@pickled-dev/cli 0.18.0 → 0.19.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +41 -236
  2. package/dist/index.js +82 -82
  3. package/package.json +2 -2
package/README.md CHANGED
@@ -1,268 +1,73 @@
1
1
  # @pickled-dev/cli
2
2
 
3
- > Test what agents actually understand about your product
3
+ > Pickled runs real agent questions across a matrix of interfaces, sources, and toolsets, then scores the answers with deterministic checks.
4
4
 
5
- Pickled runs scenarios against real agent targets, checks citations against registered sources, and matches declared traps deterministically. No LLM grades another LLM.
5
+ The CLI for [Pickled](https://pickled.dev). Use it locally or in CI to check what agents say about your product. No LLM grades another LLM.
6
6
 
7
- ## Installation
7
+ Full docs: [docs.pickled.dev/docs](https://docs.pickled.dev/docs).
8
+
9
+ ## Install
8
10
 
9
11
  ```bash
10
12
  bun add -g @pickled-dev/cli
13
+ # or run without installing:
14
+ bunx @pickled-dev/cli <command>
11
15
  ```
12
16
 
13
- ## Usage
14
-
15
- ### 1. Initialize config
17
+ ## Commands
16
18
 
17
- ```bash
18
- pickled init
19
- ```
19
+ - **`pickled init [path]`** writes a starter `pickled.yml`.
20
+ - **`pickled audit [path]`** scans agent-facing files (`CLAUDE.md`, `AGENTS.md`, `llms.txt`) for broken refs, oversized sections, and stale-pattern matches. No LLM calls.
21
+ - **`pickled check [path]`** runs the scenarios in `pickled.yml`, expands matrix cells, and scores each answer.
20
22
 
21
- Creates a `pickled.yml` file:
23
+ ## Minimum config
22
24
 
23
25
  ```yaml
24
26
  tool:
25
- name: "your-product"
26
- description: "What your product does"
27
+ name: my-product
28
+ description: short one-liner
27
29
 
28
30
  docs:
29
31
  sources:
30
32
  readme: ./README.md
31
33
 
34
+ targets:
35
+ quick:
36
+ category: cli
37
+ provider: claude-code
38
+ model: claude-haiku-4-5
39
+
32
40
  scenarios:
33
- - name: "Getting started"
34
- prompt: "How do I install and set up this product?"
41
+ - name: Install
42
+ prompt: How do I install my-product?
35
43
  requiredSources: [readme]
36
44
 
37
- threshold: 80
38
- ```
39
-
40
- ### 2. Edit your config
41
-
42
- Declare the sources agents should cite, the scenarios they should answer, and any stale patterns you want traps to catch.
43
-
44
- ### 3. Run the check
45
-
46
- ```bash
47
- pickled check
48
- ```
49
-
50
- ## Commands
51
-
52
- ### `pickled init [path]`
53
-
54
- Create a starter `pickled.yml` config file.
55
-
56
- ### `pickled audit [path]`
57
-
58
- Static scan of agent-context files. No LLM calls.
59
-
60
- | Option | Description |
61
- | --------------------- | ---------------------------------------------------- |
62
- | `--format <name>` | `terminal` (default), `markdown`, or `json` |
63
- | `--json` | Shorthand for `--format json` |
64
- | `-o, --output <file>` | Save report to file |
65
- | `--fail-on <level>` | Exit non-zero on `error` (default) or `warning` |
66
-
67
- Default `terminal` format is plain text suited to CI logs. Use `--format markdown` for GitHub step summaries; `--format json` for machine consumers.
68
-
69
- ### `pickled check [path]`
70
-
71
- Run agent scenarios against registered sources.
72
-
73
- | Option | Description |
74
- | --------------------- | ------------------------------------------------------------------- |
75
- | `--json` | Output as JSON |
76
- | `-o, --output <file>` | Save JSON report to file |
77
- | `-v, --verbose` | Show progress while scenarios run |
78
- | `-t, --threshold <n>` | Minimum score percent needed to pass |
79
- | `--target <name>` | Restrict to the named target. Overrides `matrix.target` for non-matrix scenarios; for matrix scenarios, also acts as `--interface` unless that flag is explicitly set. |
80
- | `--scenario <name>` | Run only the named scenario (CI-matrix-friendly) |
81
- | `--interface <name>` | Matrix cell filter: run only cells with this interface. Takes precedence over `--target` for matrix cells. |
82
- | `--source <name>` | Matrix cell filter: run only cells with this source id |
83
- | `--toolset <name>` | Matrix cell filter: run only cells with this toolset name |
84
-
85
- `--target` and `--interface` are related but distinct: `--target` is the legacy flag that narrows the top-level `matrix.target` axis (used before per-scenario `scenario.matrix.interfaces` shipped in v0.16.0). When `--target` is the only flag passed, the CLI also applies it as `--interface` so matrix scenarios narrow consistently. Pass `--interface` explicitly to override.
86
-
87
- The cell filters work with `scenario.matrix` declarations. Designed for GitHub Actions matrix usage where each CI job runs one cell:
88
-
89
- ```yaml
90
- # .github/workflows/pickled-matrix.yml
91
- strategy:
92
- matrix:
93
- interface: [codex, claude_code]
94
- source: [docs_site, readme]
95
- toolset: [none]
96
- steps:
97
- - run: |
98
- pickled check \
99
- --interface "${{ matrix.interface }}" \
100
- --source "${{ matrix.source }}" \
101
- --toolset "${{ matrix.toolset }}" \
102
- --output "pickled-report-${{ matrix.interface }}-${{ matrix.source }}-${{ matrix.toolset }}.json"
103
- ```
104
-
105
- Each job uploads one receipt; a later job can merge or compare them. Full-matrix runs without filters work too; they just produce one report covering every declared cell.
106
-
107
- ## Example Output
108
-
109
- ```text
110
- pickled check
111
- -------------------------------------------------------
112
- Tool: zod
113
- Sources: [readme], [llms]
114
- Scenarios: 1
115
-
116
- Scenario: Error handling
117
- ✗ Trap fired (0%)
118
- trap: old_v2_api
119
- reason: Deprecated in Zod 4; use z.treeifyError()
120
- match: "ZodError.format()"
121
- cited: [readme], [llms]
122
-
123
- -------------------------------------------------------
124
- Overall: 0 / 100 · threshold 80 · run fails
125
- Review fired traps before trusting this surface.
126
- ```
127
-
128
- ## Result Labels
129
-
130
- | Label | Meaning |
131
- | ----- | ------- |
132
- | `Well grounded` | Required sources cited. No unknown sources. High confidence. |
133
- | `Grounded` | Required sources cited. No unknown sources. Lower confidence. |
134
- | `Partially grounded` | Some required citations are missing, or unknown citations appeared. |
135
- | `Trap fired` | A declared stale pattern matched. Score is forced to 0 for that scenario. |
136
- | `Ungrounded` | No valid citations, or every citation is unknown. |
137
- | `Error` | The target failed before Pickled could score the response. |
138
-
139
- ## Sources
140
-
141
- Sources are what scenarios cite. Three loader types:
142
-
143
- - **`file` (default)** - a path to one local file. The string form (`readme: ./README.md`) implicitly uses this.
144
- - **`url`** - an `http(s)://` path. Fetched on every `pickled check` run.
145
- - **`codebase`** - a glob expanded into one logical source whose content is every matched file concatenated with file-separator headers. Useful when you want the agent to answer from a directory of JSDoc, per-package agent docs, or examples.
146
-
147
- Codebase sources are always explicit:
148
-
149
- ```yaml
150
- docs:
151
- sources:
152
- readme: ./README.md # file (string form)
153
- docs_site: https://example.com/docs.md # url (string form, http prefix)
154
- jsdoc:
155
- type: codebase
156
- path: "packages/**/src/**/*.ts"
157
- exclude: ["**/*.test.ts"] # codebase-only
158
- maxBytes: 524288 # optional; default 256 KB soft cap
159
- ```
160
-
161
- Codebase loader safety defaults: skips directories (`onlyFiles`), does not follow symlinks, rejects glob patterns containing `..` segments. Files are read in lexicographic order so the same config produces the same content for reproducible LLM calls. The audit's trap cross-reference scans each matched file individually so findings carry per-file `source_id:path:line`.
162
-
163
- URL sources are NOT scanned by the audit's trap cross-reference in v1; they are fetched only during `pickled check`.
164
-
165
- ## Toolsets
166
-
167
- Matrix mode (`scenario.matrix.toolsets`) iterates each scenario across named toolset profiles. Three shapes ship today:
168
-
169
- - **`none`** (the deterministic baseline). Pickled injects the cell's active source content into the agent's prompt. Citation contract applies if `requiredSources` is declared. Same scoring shape as non-matrix scenarios.
170
- - **`web`** on Claude Code only. Set `webSearch: true` and/or `webFetch: true`. The cell scopes the SDK's built-in tool set to exactly those web tools (`tools: ["WebSearch", "WebFetch"]`) so default Read/Edit/Bash do not leak, and adds them to `allowedTools` so they execute without permission prompts. Source is NOT injected; the cell's prompt is rewritten to name the active source as the discovery target. Citation contract is skipped; the cell scores on traps + `expected.includes`/`excludes` + tool-use provenance.
171
- - **`mcp`** on Claude Code only. Declare `mcpServers` (a map of server name to `McpServerConfig` with `stdio`, `http`, or `sse` transport). The cell sets `tools: []` (all built-ins disabled; MCP tools come from `mcpServers`) and `allowedTools: ["mcp__<server>__*", ...]` (auto-permission for the configured server namespaces). Tool-use provenance accepts any invocation of `mcp__<server>__*` for any configured server.
172
-
173
- The SDK `tools` option (not `allowedTools`) is what actually restricts which tools the agent can call: `allowedTools` is only a permission-prompt-bypass list. Pickled sets both for non-none cells so the agent is confined to the configured tool path with no fallback to local filesystem tools.
174
-
175
- Tool-use provenance (web and MCP) is a hard veto. A cell that does not invoke at least one of the configured tools is forced to `NO` with confidence `0`, because an answer pulled from model prior knowledge cannot testify to the tool path the cell is meant to test.
176
-
177
- For non-none cells, scenario-level `context` overrides for `allowedTools`, `disallowedTools`, and `mcpServers` are ignored: the toolset declaration is the single source of truth so the cell label honestly describes what the agent had available. None cells still honor `context` as before.
178
-
179
- Declare profiles at the top level of `pickled.yml`:
180
-
181
- ```yaml
182
- toolsets:
183
- none: {}
184
- web:
185
- webSearch: true
186
- webFetch: true
187
- context7_mcp:
188
- mcpServers:
189
- context7:
190
- type: http
191
- url: https://mcp.context7.com/mcp
192
- headers:
193
- CONTEXT7_API_KEY: ${CONTEXT7_API_KEY}
194
- ```
195
-
196
- Mixing `webSearch`/`webFetch` and `mcpServers` in the same toolset is rejected: declare separate toolsets so provenance can be attributed to one tool path.
197
-
198
- String values in `pickled.yml` that match `${UPPER_SNAKE_CASE}` are replaced with the corresponding `process.env` entry at load time, so secrets (MCP auth headers, API keys) stay out of the config file. Missing env vars become empty strings so the failure surfaces at the call site (e.g., a 401 from the MCP server) rather than at config load. Bun auto-loads `.env`, so the conventional dotfile works.
199
-
200
- Then reference them per scenario:
201
-
202
- ```yaml
203
- scenarios:
204
- - name: "Install"
205
- matrix:
206
- interfaces: [quick]
207
- sources: [llms]
208
- toolsets: [none, web, context7_mcp]
209
- expected:
210
- includes: ["bunx pickled"]
45
+ threshold: 60
211
46
  ```
212
47
 
213
- That scenario produces 3 cells: `[quick · llms · none]` (injected), `[quick · llms · web]` (discovered via web tools), `[quick · llms · context7_mcp]` (discovered via Context7 MCP).
214
-
215
- Toolsets that declare neither web flags nor `mcpServers`, and toolsets on a non-Claude-Code interface, throw clear errors per cell so the misconfiguration is obvious.
216
-
217
- ## Targets
218
-
219
- Pickled ships three target shapes today. Each target is a distinct surface that exercises the agent differently; results are comparable but not identical.
48
+ That gets you a single controlled-mode scenario. To compare across interfaces, sources, or tool paths (web / MCP), add `matrix:` and `toolsets:`. See [Matrix evaluation](https://docs.pickled.dev/docs/matrix-evaluation) and the [`pickled.yml` reference](https://docs.pickled.dev/docs/pickled-yml).
220
49
 
221
- ### CLI targets
50
+ ## Matrix filters in CI
222
51
 
223
- - `claude-code` (Claude Agent SDK) - runs the model with tools and workspace context. Requires the Claude Code CLI install.
224
- - `codex-cli` (Codex CLI binary) - spawns the codex binary, pipes the prompt, parses the response.
52
+ `pickled check` accepts `--interface`, `--source`, and `--toolset` flags so a GitHub Actions matrix can fan out one cell per job. Full workflow examples in [GitHub Actions](https://docs.pickled.dev/docs/github-actions).
225
53
 
226
- ### API target
54
+ ## Current support
227
55
 
228
- - `anthropic` - calls the Anthropic Messages API directly via `@anthropic-ai/sdk`. No tools, no workspace, no agent orchestration. Useful when you want a controlled baseline that isolates "did the model understand the registered sources" from "did the agent's tools fix it for the model."
56
+ | Axis | Works today |
57
+ | --- | --- |
58
+ | Sources | local files, URLs, codebase globs |
59
+ | Toolsets | `none`, `web`, `mcp` |
60
+ | Interfaces | Claude Code, Codex CLI, Anthropic API |
61
+ | Output | terminal, JSON, markdown audit reports |
229
62
 
230
- API targets require:
63
+ ## Read more
231
64
 
232
- - `ANTHROPIC_API_KEY` in the environment
233
- - An explicit `model` field on the target config (no silent defaults; reproducibility depends on pinning)
65
+ - [Getting started](https://docs.pickled.dev/docs/getting-started)
66
+ - [Matrix evaluation](https://docs.pickled.dev/docs/matrix-evaluation)
67
+ - [Toolsets](https://docs.pickled.dev/docs/toolsets)
68
+ - [`pickled.yml` reference](https://docs.pickled.dev/docs/pickled-yml)
69
+ - [GitHub Actions](https://docs.pickled.dev/docs/github-actions)
234
70
 
235
- Example config:
236
-
237
- ```yaml
238
- targets:
239
- anthropic_haiku:
240
- category: api
241
- provider: anthropic
242
- model: claude-haiku-4-5
243
- temperature: 0
244
- maxTokens: 4096
245
- ```
71
+ ## License
246
72
 
247
- API targets accept only `model`, `temperature`, `maxTokens`, and `threshold`. The loader rejects CLI-only fields (`allowedTools`, `mcpServers`, `permissionMode`, `maxTurns`, etc.) on an API target so silent no-ops cannot create false confidence.
248
-
249
- **Cost note:** API targets meter by input + output tokens, not by CLI session. Budget accordingly when running matrices with many sources or large scenario sets.
250
-
251
- ## CI
252
-
253
- ```yaml
254
- # GitHub Actions
255
- - name: Check agent legibility
256
- run: pickled check --threshold 80
257
- ```
258
-
259
- Fail the run when the overall score falls below the threshold.
260
-
261
- ## Local Development
262
-
263
- ```bash
264
- # From the monorepo root
265
- bun install
266
- bun run dev:cli -- init
267
- bun run dev:cli -- check
268
- ```
73
+ MIT