PyPI - selfevals - Versions diffs - 0.2.2__tar.gz → 0.4.0__tar.gz - Mend

selfevals 0.2.2tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (242) hide show

{selfevals-0.2.2 → selfevals-0.4.0}/.gitignore RENAMED Viewed

@@ -51,6 +51,9 @@ Thumbs.db
 .bootstrap/
 data/
+# Agent worktrees
+.claude/
 # Secrets
 .env
 .env.*

{selfevals-0.2.2 → selfevals-0.4.0}/CHANGELOG.md RENAMED Viewed

@@ -7,10 +7,119 @@ Versions follow [SemVer](https://semver.org/).
 ## [Unreleased]
+## [0.4.0] - 2026-05-28
+### Added
+- **Recall-based `must_include` grading (`Expected.min_recall`).** A new
+  optional `min_recall` float in `[0, 1]` on `EvalCase.expected`. When
+  set (and `must_include` is non-empty), the `DeterministicGrader` grades
+  `must_include` by *recall* — the fraction of required substrings that
+  appear in the response — instead of all-or-nothing: the grade is `pass`
+  iff `recall >= min_recall`, and `score` becomes the recall value
+  (exposed in `details["recall"]`). Missing substrings still emit their
+  `missing_required_substring` failure mode for diagnostics but no longer
+  force a FAIL on their own. Hard violations (`must_not_include`,
+  `required_tools`/`forbidden_tools`, `regex_match`, `structured_output`)
+  always take precedence: any hard violation still forces FAIL even when
+  recall clears the threshold. When `min_recall` is `None` (the default),
+  `must_include` stays all-or-nothing as before.
+- **Cache hit counts in the JSON report.** Each iteration in
+  `selfevals report --format json` (and the live `run --format json`) now
+  carries a `"cache": {"hits": N, "llm_calls": M}` object — the number of
+  cache-hit LLM spans and the total LLM-call spans across that iteration's
+  traces — so cost/throughput consumers can read cache effectiveness
+  without raw trace spelunking.
+- **Per-iteration failure rationales in the JSON report.** Each iteration
+  now carries a `"failure_reasons"` array: deduplicated grader rationales
+  for every non-passing grade, one entry per distinct
+  `(grader, label, reason)` with `score` and `failure_modes`. This lets a
+  downstream consumer see *why* a grader failed without reading SQLite.
+  (Populated on a live `run`; empty when an experiment is reconstructed
+  from storage, e.g. via `report` or the HTTP API — see
+  [`docs/json_report_schema.md`](docs/json_report_schema.md).)
+- **Thread viewer (web + API).** `GET /api/workspaces/{ws}/threads/{thread}`
+  (already shipped, now documented) assembles every `Trace` sharing a
+  `thread_id` into an ordered, turn-by-turn conversation (`ThreadResponse`),
+  each turn carrying its `primary_grade` and `grader_results`. New web
+  route `/[workspace]/threads/[thread]` renders the multi-turn conversation.
+- **Funnel drill-down (web + API).** New endpoint
+  `GET /api/workspaces/{ws}/iterations/{id}/funnel` returns the per-iteration
+  grader funnel (`FunnelResponse`, recursive `FunnelNodeResponse` nodes read
+  straight from `IterationRecord.metrics.funnel`). New "Funnel" tab on the
+  experiment-detail view renders it via the recursive `FunnelNode.svelte`
+  component. `nodes` is empty when no grader emitted a breakdown.
+- **Server-rendered iteration compare (web + API).** New endpoint
+  `GET /api/workspaces/{ws}/experiments/{id}/compare?a={itr}&b={itr}` returns
+  a structured `CompareResponse` (proposal diff, metrics diff, failure-mode
+  diff, funnel diff, winner recommendation, `holdout_status`) computed by the
+  reporter's `compute_compare` — the single source of truth shared with the
+  CLI `compare` command. Returns 404 when an iteration is unknown and 400
+  when the two iterations belong to different experiments. The web "Compare"
+  tab now renders this diff server-side instead of recomputing deltas in the
+  browser.
+### Documentation
+- New [`docs/api_reference.md`](docs/api_reference.md): the canonical HTTP
+  API reference — every endpoint, grouped by resource, with method, path,
+  params, response schema, and error codes.
+- New [`docs/eval_config.md`](docs/eval_config.md): the YAML eval-config
+  reference (top-level keys, `EvalCase`/`Expected` fields including
+  `min_recall`, graders, agent transports, proposers) with validating
+  snippets.
+- New [`docs/json_report_schema.md`](docs/json_report_schema.md): the
+  `report --format json` output shape, documenting every root and
+  per-iteration key (including the new `cache` and `failure_reasons`).
+- `docs/FRONTEND.md` §3/§5: the funnel, compare, and thread endpoints/views
+  are now documented as shipped.
+## [0.3.0] - 2026-05-27
+### Added
+- **Validated multi-turn conversation input.** When `EvalCase.input`
+  carries a `messages` key it is validated as a typed conversation:
+  non-empty message list, roles from a new `MessageRole` enum
+  (system/user/assistant/tool), content as a string or a list of
+  content blocks, multimodal-aware via the `Modality` enum. New
+  `Message`, `ContentBlock`, and `ConversationInput` models, plus
+  `EvalCase.conversation()` / `EvalCase.is_conversation()` accessors.
+  Inputs without a `messages` key remain opaque payloads, so the
+  field stays a plain JSON dict that adapters receive verbatim.
+- **Async-first evaluators.** `AgentAdapter.invoke` and `Grader.grade`
+  are now async. The executor runs repetitions concurrently and the
+  optimization loop grades concurrently, each bounded by a
+  configurable semaphore (`concurrency` / `grade_concurrency`,
+  default 8). `EmbeddedAdapter` accepts sync or async callables,
+  `CliCommandAdapter` uses an asyncio subprocess, and
+  `HttpEndpointAdapter` is native async on httpx. `asyncio.run` is
+  confined to the CLI edge.
+### Changed
+- `httpx` is now a runtime dependency (the default HTTP adapter
+  transport), not just a dev dependency.
+### Documentation
+- STATUS.md and README banners read v0.3.0; multi-turn input and async
+  evaluators moved into "What works"; test counts refreshed (default
+  surface 559, full 597); roadmap records both as shipped in 0.3.0.
 ## [0.2.2] - 2026-05-27
 ### Documentation
+- STATUS.md and README banners now read v0.2.2 (they had lagged at
+  v0.2.1 despite the 0.2.2 release). Refreshed the STATUS body against
+  the current tree: test counts (default surface 481 -> 528, full
+  surface 566, extras-gated 9 -> 24), and the forward-looking
+  "What v0.2 will probably contain" section became a "Roadmap" that
+  separates what shipped in 0.2.x from what remains on the backlog.
+### Documentation
 - Onboarding pass after the `bootstrap` -> `selfevals` rename. Fixed the
   CI mypy target (`src/bootstrap` -> `src/selfevals`) and 13 stale
   `bootstrap` CLI/prose references in the bundled error-analysis skill.

{selfevals-0.2.2 → selfevals-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: selfevals
-Version: 0.2.2
+Version: 0.4.0
 Summary: Self-improving evals framework for AI agents.
 Project-URL: Homepage, https://github.com/patovaldezf/selfevals
 Project-URL: Repository, https://github.com/patovaldezf/selfevals
@@ -18,6 +18,7 @@ Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Software Development :: Testing
 Requires-Python: >=3.12
+Requires-Dist: httpx<1,>=0.27
 Requires-Dist: pydantic<3,>=2.7
 Requires-Dist: pyyaml<7,>=6
 Provides-Extra: all
@@ -94,9 +95,10 @@ configuration to keep. CLI-first, multi-tenant from day one, and agnostic
 to the agent framework underneath — selfevals never calls your provider;
 your agent does, and selfevals grades the result.
-> Status: **v0.2.1 — runtime functional.** The CLI works end-to-end:
+> Status: **v0.3.0 — runtime functional.** The CLI works end-to-end:
 > load an experiment spec → run cases through an adapter → grade traces →
-> persist iterations → render a report. See [`docs/spec/`](docs/spec/) for
+> persist iterations → render a report. Adapters and graders are async,
+> with concurrent repetitions and grading. See [`docs/spec/`](docs/spec/) for
 > the canonical and operational specs that drive design, and
 > [`docs/STATUS.md`](docs/STATUS.md) for an honest what-works / what-doesn't
 > snapshot.
@@ -153,7 +155,7 @@ The five nouns you'll meet everywhere:
 | Term | What it is |
 |------|------------|
-| **EvalCase** | One test: an input, the expected outcome, and which graders apply. |
+| **EvalCase** | One test: an input (a validated multi-turn `messages` conversation, or any opaque payload), the expected outcome, and which graders apply. |
 | **Adapter** | The bridge to your agent — embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls *it*, never the provider directly. |
 | **Grader** | Scores a trace. `DeterministicGrader` (rules: substrings, tools, JSON schema) or `LLMJudgeGrader` (a rubric-driven judge). |
 | **Proposer** | Picks the next parameter configuration to try — `manual`, `grid`, or `random`. |
@@ -223,7 +225,7 @@ its arguments. The surface:
 |---------|---------|
 | `init <slug>` | Create a workspace and seed the default failure-mode taxonomy. |
 | `run <spec.yaml>` | Run an experiment spec end-to-end. |
-| `report <ws> <exp>` | Render a stored experiment as markdown (`--format json` for JSON). |
+| `report <ws> <exp>` | Render a stored experiment as markdown (`--format json` for JSON; the JSON now includes per-iteration `cache` hit counts and deduplicated `failure_reasons`). |
 | `compare <ws> <itr_a> <itr_b>` | Diff two iterations side by side. |
 | `estimate` | Dry-run cost estimate for a search space × cases × reps. |
 | `workspace show <ws>` | Inspect a workspace. |
@@ -247,6 +249,17 @@ candidate modes via `failuremode promote`. The bundled
 [`error-analysis` skill](src/selfevals/.agents/skills/error-analysis/SKILL.md)
 (discoverable via `selfevals skills list`) encodes the method.
+## Documentation
+| Doc | What it covers |
+|-----|----------------|
+| [`docs/eval_config.md`](docs/eval_config.md) | The YAML experiment spec: top-level keys, `EvalCase`/`Expected` fields (including recall-based `must_include` via `min_recall`), graders, agent transports, and proposers. |
+| [`docs/api_reference.md`](docs/api_reference.md) | The canonical HTTP API reference — every endpoint, response schema, and error codes. |
+| [`docs/json_report_schema.md`](docs/json_report_schema.md) | The `report --format json` output shape, including the per-iteration `cache` and `failure_reasons` keys. |
+| [`docs/adapters.md`](docs/adapters.md) | Adapter contract and per-transport YAML/code snippets. |
+| [`docs/FRONTEND.md`](docs/FRONTEND.md) | The web UI spec (views, endpoints, roadmap). |
+| [`docs/STATUS.md`](docs/STATUS.md) | Honest what-works / what-doesn't snapshot. |
 ## Layout
 ```

{selfevals-0.2.2 → selfevals-0.4.0}/README.md RENAMED Viewed

@@ -9,9 +9,10 @@ configuration to keep. CLI-first, multi-tenant from day one, and agnostic
 to the agent framework underneath — selfevals never calls your provider;
 your agent does, and selfevals grades the result.
-> Status: **v0.2.1 — runtime functional.** The CLI works end-to-end:
+> Status: **v0.3.0 — runtime functional.** The CLI works end-to-end:
 > load an experiment spec → run cases through an adapter → grade traces →
-> persist iterations → render a report. See [`docs/spec/`](docs/spec/) for
+> persist iterations → render a report. Adapters and graders are async,
+> with concurrent repetitions and grading. See [`docs/spec/`](docs/spec/) for
 > the canonical and operational specs that drive design, and
 > [`docs/STATUS.md`](docs/STATUS.md) for an honest what-works / what-doesn't
 > snapshot.
@@ -68,7 +69,7 @@ The five nouns you'll meet everywhere:
 | Term | What it is |
 |------|------------|
-| **EvalCase** | One test: an input, the expected outcome, and which graders apply. |
+| **EvalCase** | One test: an input (a validated multi-turn `messages` conversation, or any opaque payload), the expected outcome, and which graders apply. |
 | **Adapter** | The bridge to your agent — embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls *it*, never the provider directly. |
 | **Grader** | Scores a trace. `DeterministicGrader` (rules: substrings, tools, JSON schema) or `LLMJudgeGrader` (a rubric-driven judge). |
 | **Proposer** | Picks the next parameter configuration to try — `manual`, `grid`, or `random`. |
@@ -138,7 +139,7 @@ its arguments. The surface:
 |---------|---------|
 | `init <slug>` | Create a workspace and seed the default failure-mode taxonomy. |
 | `run <spec.yaml>` | Run an experiment spec end-to-end. |
-| `report <ws> <exp>` | Render a stored experiment as markdown (`--format json` for JSON). |
+| `report <ws> <exp>` | Render a stored experiment as markdown (`--format json` for JSON; the JSON now includes per-iteration `cache` hit counts and deduplicated `failure_reasons`). |
 | `compare <ws> <itr_a> <itr_b>` | Diff two iterations side by side. |
 | `estimate` | Dry-run cost estimate for a search space × cases × reps. |
 | `workspace show <ws>` | Inspect a workspace. |
@@ -162,6 +163,17 @@ candidate modes via `failuremode promote`. The bundled
 [`error-analysis` skill](src/selfevals/.agents/skills/error-analysis/SKILL.md)
 (discoverable via `selfevals skills list`) encodes the method.
+## Documentation
+| Doc | What it covers |
+|-----|----------------|
+| [`docs/eval_config.md`](docs/eval_config.md) | The YAML experiment spec: top-level keys, `EvalCase`/`Expected` fields (including recall-based `must_include` via `min_recall`), graders, agent transports, and proposers. |
+| [`docs/api_reference.md`](docs/api_reference.md) | The canonical HTTP API reference — every endpoint, response schema, and error codes. |
+| [`docs/json_report_schema.md`](docs/json_report_schema.md) | The `report --format json` output shape, including the per-iteration `cache` and `failure_reasons` keys. |
+| [`docs/adapters.md`](docs/adapters.md) | Adapter contract and per-transport YAML/code snippets. |
+| [`docs/FRONTEND.md`](docs/FRONTEND.md) | The web UI spec (views, endpoints, roadmap). |
+| [`docs/STATUS.md`](docs/STATUS.md) | Honest what-works / what-doesn't snapshot. |
 ## Layout
 ```

selfevals-0.4.0/docs/FE_FASE_A_PENDIENTES.md ADDED Viewed

@@ -0,0 +1,171 @@
+# Fase A — Pendientes y deuda diferida
+Vivo. Se actualiza cada vez que un PR de Fase A se mergea y descubre
+algo que debe quedar registrado para después.
+Para el plan completo, ver [FRONTEND_PRODUCT_PLAN.md](FRONTEND_PRODUCT_PLAN.md).
+## Estado de la Fase A
+| # | PR | Estado | Notas |
+|---|---|---|---|
+| A1 | #16 los 3 bugs + plan | ✅ merged | — |
+| A2 | #17 link iter → traza | ✅ merged | — |
+| A3 | #18 resolver pointers | ✅ merged | — |
+| A4 | #19 selfevals serve | ✅ merged | latente BUG-4 hasta este PR |
+| A5 | #20 identidad humana sobre ULID | ✅ merged | — |
+| A6 | #21 span kind visible + densidad | ✅ merged | QA visual destrabada al fix-ear BUG-4 |
+| BUG-4 | #22 proxy `/api` en `selfevals serve` | ✅ merged | hooks.server.ts + `SELFEVALS_API_BASE` |
+| A7 | #23 a11y filas ([button]/[link]) | ✅ merged | — |
+| A8 | paginación + virtual scroll | 🟡 in progress | envelope solo en `/experiments` por ahora |
+## Pendientes registrados (orden de descubrimiento)
+### De A5 (#20)
+1. **URL routing por slug humano.** El URL sigue siendo el ULID
+   (`/<ws_id>/experiments/<exp_id>`). Slugs humanos en la ruta
+   requieren:
+   - `queries.workspace_detail` aceptar slug además de id.
+   - Posiblemente `queries.list_experiments` / `experiment_detail`
+     resolver experimento por slug dentro de un workspace.
+   - Routes SvelteKit con `[workspace]` aceptan ambos sin cambio.
+   - Scope: backend resolver + tests. No es de Fase A; ir como propio
+     PR cuando duela.
+2. ~~**Anchor-set: CopyableId chip dentro del row.**~~ ✅ Resuelto en
+   A7. Row reescrito de `<a>` envolvente a `<div>` con `<a>` (link al
+   experiment) y `<CopyableId>` lado a lado. focus-within ring para
+   feedback de teclado consistente.
+3. ~~**QA visual de los chips CopyableId.**~~ ✅ Desbloqueado con
+   BUG-4. Render visual confirmado en el trace viewer y experiment
+   detail vía `selfevals serve`. Hover, tick `copied`, y stop-propagation
+   en celdas clickeables siguen siendo verificación manual (no test
+   automatizado — vitest no está montado, ver A6).
+4. **Workspace ID en overview no es copiable.** El `/[workspace]/+page.svelte`
+   muestra `workspace.slug` como chip y `workspace.name` como h1, pero
+   el ULID del workspace no aparece en ningún lado. Si alguien lo
+   necesita para curl/CLI, debe ir a otra ruta o leerlo de la URL.
+   Decisión consciente — no añadir hasta que duela.
+### De A6 (en progreso)
+1. ~~**QA visual del trace tree.** Bloqueado por BUG-4.~~ ✅ Desbloqueado
+   con el fix de BUG-4. Dogfood: `curl :web_port/<ws>/traces/<run>`
+   muestra glyph ◆ + label "llm" + nombre "adapter_response", todo
+   renderizado correctamente.
+2. **Iconografía de spans.** Los glifos Unicode (`◆ ✦ ⚙ ◇ ▽ △ ◉ ↦
+   ☞ ◈ ✕`) son funcionales pero no son SVGs. Si el set crece o el
+   peso visual se queda corto, evaluar un set SVG inline (sin
+   dependencia externa) — pero NO antes de que un usuario real se
+   queje del look actual.
+3. **`tokens_per_second` en el árbol.** Lo expongo en el API
+   (`keep_keys`) pero NO lo renderizo en SpanNode aún: en pingpong da
+   `None`. Cuando ROADMAP #9 lo pueble, añadir como fact (junto a TTFT
+   y tokens). Mismo para `time_to_first_token_ms` en ejemplos reales.
+4. **Densidad: facts ocultas en mobile.** `hidden sm:inline-block`
+   esconde los facts en viewports angostos. El plan §1.2-7 dice
+   "mobile está roto" → decisión consciente desktop-first, no
+   regresión. Cuando llegue mobile, refactorizar SpanNode para
+   colapsar facts a chevron expandible.
+### De A7 (cerrado)
+Nada de deuda diferida — pendiente A5#2 (anchor-set CopyableId chip)
+quedó resuelto en este PR.
+### De A8 (en progreso)
+1. **Pagination envelope solo en `/experiments`.** Los demás endpoints
+   de lista (`/workspaces`, `/iterations`) siguen devolviendo lista
+   plana. Cuando algún usuario tenga >100 de cualquiera de esos, hacer
+   el mismo upgrade (la receta ya está). Por ahora, deuda registrada.
+2. **FE UI de "Load more" / paginación.** El envelope viaja pero la
+   UI no tiene botón "Load more" — solo muestra "X of N" cuando hay
+   más páginas. La razón: nadie tiene aún >100 experimentos así que
+   no hay un caso de uso real para probar. Cuando duela, añadir el
+   botón con `offset` incremental.
+3. **Virtual scroll: row height aproximada.** `SpanTreeFlat` asume
+   `ROW_HEIGHT_PX = 28` (uniforme). Si un día metemos rows multi-línea
+   (p.ej. error con traceback inline), la matemática del window se
+   rompe. Solución correcta cuando llegue: `ResizeObserver` por fila
+   + altura medida. Por ahora un fact en una sola línea encaja.
+4. **No hay "scroll to selected".** Si la traza tiene 1000 spans y el
+   span seleccionado está en el medio, abrir la página NO hace scroll
+   a esa fila. La selección viene del click humano (no de la URL todavía),
+   así que el seleccionado siempre empieza en `null` y el usuario
+   recorre. Cuando integrumos URL `?span=sp_...`, añadir auto-scroll.
+### De A8 (pendiente)
+_(pendiente)_
+## Bugs/deuda fuera de Fase A descubiertos en el camino
+### BUG-4 ✅ FIXED — `selfevals serve` no proxya `/api` → web Node ve 404 en todo
+**Síntoma.** Con `selfevals serve` (modo producción, no `npm run dev`),
+toda ruta del web devuelve 404 o "Backend unreachable / API 404".
+Verificado en QA de A6:
+```
+$ uv run selfevals --db /tmp/qa.sqlite serve --port 5188
+$ curl :5189/                          # 200 pero renderiza "API 404"
+$ curl :5189/<ws>/traces/<run_id>      # 404
+$ curl :5189/api/workspaces            # 404 (no hay handler)
+```
+**Causa raíz.** `cli/commands.py:cmd_serve` spawna el Node server de
+SvelteKit en `port+1` con `ORIGIN=http://host:port+1`, pero **no
+configura proxy**. El FE usa `fetch('/api/...')` (relativo) en
+`+page.server.ts` — en SSR eso pega al Node server (que no tiene
+ruta `/api`), no a FastAPI en `port`. El propio comentario en
+`commands.py:735-737` reconoce el hueco ("the built server has no
+proxy, so the web side must call the API via absolute URLs") pero
+nunca se implementó el fix.
+**Impacto.** Toda Fase A es invisible end-to-end por la vía oficial
+de dogfood. Los PRs A1-A6 pasan typecheck + pytest + API roundtrip,
+pero ningún usuario ha visto el resultado renderizado vía
+`selfevals serve`. (Sí funciona vía `npm run dev` + `uvicorn`
+manualmente — el bug es específico de la ruta "un comando" de A4.)
+**Fixes posibles (preferir 1):**
+1. `SELFEVALS_API_BASE=http://host:port` env var hacia el subprocess
+   Node + cliente API absoluto cuando esté presente (server-side
+   load). El cliente del browser sigue relativo.
+2. Hooks SvelteKit (`hooks.server.ts`) que proxean `/api` → FastAPI
+   cuando detectan `SELFEVALS_API_BASE`.
+3. Servir el build estático desde FastAPI y dropar el Node server
+   (pierde SSR — no aceptable).
+**Test a añadir.** `test_serve_web_can_reach_api_through_node`: lanza
+`selfevals serve`, espera a que ambos puertos respondan, y hace
+`curl :web_port/` esperando ver al menos un workspace en el HTML
+renderizado (no "Backend unreachable"). Esto pinta el contrato
+end-to-end y captura cualquier futuro retroceso del proxy.
+**Prioridad.** Alta. Bloquea QA visual de toda Fase A. Ir como PR
+propio (BUG-4), no enredarlo con A6/A7/A8. Idealmente antes de
+mergear A6.
+**Fix shipped (PR fix/bug-4-serve-api-proxy).** Opción 2:
+`web/src/hooks.server.ts` con un `handle` que intercepta `/api/*`,
+lee `SELFEVALS_API_BASE` y hace `fetch` al upstream FastAPI
+streameando body (importante para SSE: `/api/.../stream` es
+text/event-stream sin EOF). `cmd_serve` setea
+`SELFEVALS_API_BASE=http://host:api_port` en el subprocess Node.
+Dogfood end-to-end OK:
+`curl :web_port/<ws>/traces/<run_id>` ahora devuelve 200 con todo
+el render de Fase A (A5 nombre humano + A6 glyph + facts).
+**Pendiente.** Test integration con `node` real está fuera del PR
+(CI Python no tiene Node; CI Web no tiene Python). Unit test del env
+var en `test_serve_spawns_node_with_correct_env` cubre la mitad Python.
+Para el matrix completo, configurar un job de CI con Python + Node
+juntos — registrar como deuda separada cuando duela.

selfevals 0.2.2__tar.gz → 0.4.0__tar.gz

selfevals 0.2.2tar.gz → 0.4.0tar.gz