ai-wiki-toolkit-linux-arm64 0.1.32 → 0.1.35
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +99 -9
- package/bin/aiwiki-toolkit +0 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
# ai-wiki-toolkit-linux-arm64
|
|
2
2
|
|
|
3
3
|
This package contains the `aiwiki-toolkit` executable for `linux-arm64-glibc`.
|
|
4
|
-
It is published as the platform-specific binary package for `ai-wiki-toolkit` `0.1.
|
|
4
|
+
It is published as the platform-specific binary package for `ai-wiki-toolkit` `0.1.35`.
|
|
5
5
|
Most users should install `ai-wiki-toolkit` instead of using this package directly.
|
|
6
6
|
|
|
7
7
|
---
|
|
@@ -473,20 +473,110 @@ This writes a static HTML review queue and JSON payload under `ai-wiki/_toolkit/
|
|
|
473
473
|
To summarize first-attempt product impact from a captured eval run:
|
|
474
474
|
|
|
475
475
|
```bash
|
|
476
|
+
aiwiki-toolkit eval impact families
|
|
477
|
+
aiwiki-toolkit eval impact families --format json
|
|
478
|
+
aiwiki-toolkit eval impact discover
|
|
479
|
+
aiwiki-toolkit eval impact family show ownership_boundary
|
|
480
|
+
aiwiki-toolkit eval impact family candidates
|
|
481
|
+
aiwiki-toolkit eval impact family init --name retry_loop --from-candidate problems/retry-loop --baseline-ref HEAD^
|
|
482
|
+
aiwiki-toolkit eval impact family draft --candidate problems_retry_loop --baseline-ref HEAD^
|
|
483
|
+
aiwiki-toolkit eval impact family promote --candidate problems_retry_loop
|
|
484
|
+
aiwiki-toolkit eval impact family promote --candidate problems_retry_loop --apply
|
|
485
|
+
aiwiki-toolkit eval impact plan --family ownership_boundary
|
|
486
|
+
aiwiki-toolkit eval impact plan --family ownership_boundary --format json
|
|
487
|
+
aiwiki-toolkit eval impact prepare --family ownership_boundary
|
|
488
|
+
aiwiki-toolkit eval impact prepare --family ownership_boundary --format json
|
|
489
|
+
aiwiki-toolkit eval impact run --run-dir /path/to/eval-run --slot s01
|
|
490
|
+
aiwiki-toolkit eval impact run --run-dir /path/to/eval-run --all-slots --score-policy command-exit
|
|
491
|
+
aiwiki-toolkit eval impact run --run-dir /path/to/eval-run --all-slots --score-policy rubric --rubric evals/impact/rubrics/my-family.json
|
|
492
|
+
aiwiki-toolkit eval impact benchmark --family ownership_boundary --score-policy command-exit
|
|
493
|
+
aiwiki-toolkit eval impact schedule report --handle your-handle --candidate-max-items 25
|
|
494
|
+
aiwiki-toolkit eval impact schedule run --family ownership_boundary --score-policy command-exit
|
|
495
|
+
aiwiki-toolkit eval impact schedule run --all-runnable --if-due --score-policy rubric
|
|
496
|
+
aiwiki-toolkit eval impact capture --run-dir /path/to/eval-run --slot s01 --prompt-level original --first-pass-success
|
|
497
|
+
aiwiki-toolkit eval impact validate --run-dir /path/to/eval-run
|
|
498
|
+
aiwiki-toolkit eval impact score --run-dir /path/to/eval-run --slot s01 --prompt-level original --label success
|
|
499
|
+
aiwiki-toolkit eval impact manifest --run-dir /path/to/eval-run
|
|
500
|
+
aiwiki-toolkit eval impact manifest --run-dir /path/to/eval-run --format json
|
|
476
501
|
aiwiki-toolkit eval impact report --run-dir /path/to/eval-run
|
|
477
502
|
aiwiki-toolkit eval impact report --run-dir /path/to/eval-run --format json
|
|
478
503
|
aiwiki-toolkit eval impact summarize --run-dir /path/to/eval-run --run-dir /path/to/another-run
|
|
479
504
|
aiwiki-toolkit eval impact summarize --runs-file evals/impact/runs.json
|
|
480
505
|
```
|
|
481
506
|
|
|
482
|
-
|
|
483
|
-
`
|
|
484
|
-
|
|
485
|
-
|
|
486
|
-
|
|
487
|
-
|
|
488
|
-
telemetry
|
|
489
|
-
|
|
507
|
+
Use `eval impact families` before running benchmarks. It discovers registered families from
|
|
508
|
+
`evals/impact/families/*/spec.toml`, reports readiness, prompt and rubric presence, memory fixture
|
|
509
|
+
counts, baseline refs, historical issues, and next commands. Use `eval impact family show <name>`
|
|
510
|
+
for one family.
|
|
511
|
+
|
|
512
|
+
Use `eval impact family candidates` to expose trial/error replay candidates from existing AI wiki
|
|
513
|
+
telemetry. It layers over `diagnose memory --focus trial-error` and reports candidate readiness
|
|
514
|
+
without writing user-owned AI wiki docs. Use `eval impact family init --from-candidate ...` only
|
|
515
|
+
after confirming a source incident, baseline ref, prompt shape, and rubric direction; it creates a
|
|
516
|
+
draft family scaffold under `evals/impact/`.
|
|
517
|
+
|
|
518
|
+
Use `eval impact discover` for the continuous loop. It refreshes the managed candidate queue under
|
|
519
|
+
`ai-wiki/_toolkit/evals/candidates/`, preserves first-seen/last-seen/seen-count state, and prints
|
|
520
|
+
the next draft, promotion, and schedule commands. Use `eval impact family draft` to create managed
|
|
521
|
+
candidate files under `ai-wiki/_toolkit/evals/drafts/<candidate>/` without registering a formal
|
|
522
|
+
family. Use `eval impact family promote` as a report-only gate; add `--apply` only after the draft
|
|
523
|
+
has a real baseline ref, prompt, and rubric and you want to write formal files under
|
|
524
|
+
`evals/impact/`.
|
|
525
|
+
|
|
526
|
+
Use `eval impact plan` to inspect the next run before creating workspaces or invoking agents. It
|
|
527
|
+
reads `evals/impact/families/<family>/spec.toml` and prompt files, then reports the planned
|
|
528
|
+
baseline ref, prompt hashes, workflow-primary variants, output paths, and script commands. The plan
|
|
529
|
+
command does not mutate eval artifacts or call an agent.
|
|
530
|
+
|
|
531
|
+
Use `eval impact prepare` to execute the planned setup only: it creates neutral slot workspaces,
|
|
532
|
+
creates the run directory and metadata, and writes initial `manifest.json` and `manifest.md` files.
|
|
533
|
+
It still does not call an agent.
|
|
534
|
+
|
|
535
|
+
Use `eval impact run` to invoke Codex CLI against one neutral slot or all slots in an already
|
|
536
|
+
prepared run. The command calls the repo-local slot runner, captures first-pass artifacts,
|
|
537
|
+
optionally exports visible Codex sessions, validates confounds, applies an explicit score policy,
|
|
538
|
+
and writes a report bundle under `<run-dir>/report_bundle/`. The default score policy is `none`.
|
|
539
|
+
`--score-policy command-exit` is useful for smoke tests and execution-health automation, but it
|
|
540
|
+
only scores Codex/save-result command completion; use manual or semantic scoring before making
|
|
541
|
+
research-quality correctness claims.
|
|
542
|
+
`--score-policy rubric` reads an `impact-eval-rubric-v1` JSON file, writes
|
|
543
|
+
`rubric_judgment.json` next to each slot score, then writes the normal `score.json` artifact.
|
|
544
|
+
Rubric criteria can inspect captured diffs, final messages, result fields, changed files, and
|
|
545
|
+
untracked files.
|
|
546
|
+
|
|
547
|
+
Use `eval impact benchmark` when you want one command to prepare a family and immediately run all
|
|
548
|
+
slots. It wraps `prepare` plus `run`, then returns the prepared run directory, run result, validation
|
|
549
|
+
status, scores, and report bundle.
|
|
550
|
+
|
|
551
|
+
Use `eval impact schedule report` to generate a periodic benchmark dashboard under
|
|
552
|
+
`ai-wiki/_toolkit/evals/reports/<period>/`. It combines registered families, the managed candidate
|
|
553
|
+
queue, and the run index. Pass the same candidate filters you use for discovery, such as `--handle`,
|
|
554
|
+
`--since`, and `--candidate-max-items`, so the scheduled report does not accidentally stale a
|
|
555
|
+
larger queue with a narrower refresh. Use `eval impact schedule run --family <name>` or
|
|
556
|
+
`--all-runnable` to run benchmarks, append `ai-wiki/_toolkit/evals/runs/index.json`, refresh the
|
|
557
|
+
report, and record `ai-wiki/_toolkit/evals/schedule/state.json`. `--if-due` is intended for cron,
|
|
558
|
+
launchd, or an agent workflow that should run at most once per period.
|
|
559
|
+
|
|
560
|
+
Use `eval impact capture` after a manual first pass or repaired pass to save `result.json`, the
|
|
561
|
+
workspace diff, status, head, and optional final-message artifact. It infers slot variant and
|
|
562
|
+
workspace from `metadata.json` when possible. Use `eval impact validate` after exporting visible
|
|
563
|
+
sessions to write `confounds.json`; missing exports are reported as critical confounds rather than
|
|
564
|
+
silently accepted. Use `eval impact score` to write the manual `score.json` artifact for a slot.
|
|
565
|
+
Each of these commands refreshes `manifest.json` and `manifest.md` so the run inventory stays
|
|
566
|
+
current.
|
|
567
|
+
|
|
568
|
+
The report and manifest commands read an existing run directory with `metadata.json`, result
|
|
569
|
+
captures, optional `score.json` files, and optional `confounds.json`. The `eval impact report`
|
|
570
|
+
command compares the run's primary variants, normally `no_aiwiki_workflow` versus
|
|
571
|
+
`aiwiki_ambient_memory_workflow`, using first-attempt metrics only: `first_pass` captures count
|
|
572
|
+
toward the signal, while `final` repair captures stay diagnostic. The command reports
|
|
573
|
+
first-attempt success rate, average score, attempts, human nudges, changed files, untracked files,
|
|
574
|
+
change-profile splits for project files versus AI wiki telemetry and user-owned wiki churn, and
|
|
575
|
+
whether the run is ready for shareable causal claims. It does not run agents.
|
|
576
|
+
|
|
577
|
+
Use `eval impact manifest` to audit run identity before interpreting scores. It reports the
|
|
578
|
+
baseline ref, prompt hashes, model, reasoning effort, execution surface, slot-to-variant mapping,
|
|
579
|
+
session export presence, confounds, and captured artifact paths.
|
|
490
580
|
|
|
491
581
|
Use `eval impact summarize` to aggregate multiple captured runs into a product-level dashboard.
|
|
492
582
|
It reports each family's primary outcome, product signal, shareability, success and score deltas,
|
package/bin/aiwiki-toolkit
CHANGED
|
Binary file
|
package/package.json
CHANGED