@lcv-ideas-software/cross-review 4.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +2568 -0
- package/LICENSE +201 -0
- package/NOTICE +26 -0
- package/README.md +208 -0
- package/SECURITY.md +52 -0
- package/dist/scripts/api-streaming-smoke.d.ts +1 -0
- package/dist/scripts/api-streaming-smoke.js +78 -0
- package/dist/scripts/api-streaming-smoke.js.map +1 -0
- package/dist/scripts/runtime-default-smoke.d.ts +1 -0
- package/dist/scripts/runtime-default-smoke.js +88 -0
- package/dist/scripts/runtime-default-smoke.js.map +1 -0
- package/dist/scripts/runtime-smoke.d.ts +1 -0
- package/dist/scripts/runtime-smoke.js +148 -0
- package/dist/scripts/runtime-smoke.js.map +1 -0
- package/dist/scripts/smoke.d.ts +1 -0
- package/dist/scripts/smoke.js +6156 -0
- package/dist/scripts/smoke.js.map +1 -0
- package/dist/src/core/cache-manifest.d.ts +22 -0
- package/dist/src/core/cache-manifest.js +133 -0
- package/dist/src/core/cache-manifest.js.map +1 -0
- package/dist/src/core/caller-tokens.d.ts +32 -0
- package/dist/src/core/caller-tokens.js +240 -0
- package/dist/src/core/caller-tokens.js.map +1 -0
- package/dist/src/core/config.d.ts +9 -0
- package/dist/src/core/config.js +643 -0
- package/dist/src/core/config.js.map +1 -0
- package/dist/src/core/convergence.d.ts +5 -0
- package/dist/src/core/convergence.js +186 -0
- package/dist/src/core/convergence.js.map +1 -0
- package/dist/src/core/cost.d.ts +59 -0
- package/dist/src/core/cost.js +359 -0
- package/dist/src/core/cost.js.map +1 -0
- package/dist/src/core/file-config.d.ts +316 -0
- package/dist/src/core/file-config.js +490 -0
- package/dist/src/core/file-config.js.map +1 -0
- package/dist/src/core/orchestrator.d.ts +199 -0
- package/dist/src/core/orchestrator.js +3430 -0
- package/dist/src/core/orchestrator.js.map +1 -0
- package/dist/src/core/prompt-parts.d.ts +58 -0
- package/dist/src/core/prompt-parts.js +122 -0
- package/dist/src/core/prompt-parts.js.map +1 -0
- package/dist/src/core/relator-lottery.d.ts +23 -0
- package/dist/src/core/relator-lottery.js +112 -0
- package/dist/src/core/relator-lottery.js.map +1 -0
- package/dist/src/core/reports.d.ts +2 -0
- package/dist/src/core/reports.js +82 -0
- package/dist/src/core/reports.js.map +1 -0
- package/dist/src/core/session-store.d.ts +149 -0
- package/dist/src/core/session-store.js +1923 -0
- package/dist/src/core/session-store.js.map +1 -0
- package/dist/src/core/status.d.ts +61 -0
- package/dist/src/core/status.js +249 -0
- package/dist/src/core/status.js.map +1 -0
- package/dist/src/core/timeouts.d.ts +2 -0
- package/dist/src/core/timeouts.js +3 -0
- package/dist/src/core/timeouts.js.map +1 -0
- package/dist/src/core/types.d.ts +604 -0
- package/dist/src/core/types.js +36 -0
- package/dist/src/core/types.js.map +1 -0
- package/dist/src/dashboard/server.d.ts +2 -0
- package/dist/src/dashboard/server.js +339 -0
- package/dist/src/dashboard/server.js.map +1 -0
- package/dist/src/mcp/server.d.ts +54 -0
- package/dist/src/mcp/server.js +1584 -0
- package/dist/src/mcp/server.js.map +1 -0
- package/dist/src/observability/logger.d.ts +9 -0
- package/dist/src/observability/logger.js +24 -0
- package/dist/src/observability/logger.js.map +1 -0
- package/dist/src/peers/anthropic.d.ts +14 -0
- package/dist/src/peers/anthropic.js +290 -0
- package/dist/src/peers/anthropic.js.map +1 -0
- package/dist/src/peers/base.d.ts +72 -0
- package/dist/src/peers/base.js +416 -0
- package/dist/src/peers/base.js.map +1 -0
- package/dist/src/peers/deepseek.d.ts +12 -0
- package/dist/src/peers/deepseek.js +246 -0
- package/dist/src/peers/deepseek.js.map +1 -0
- package/dist/src/peers/errors.d.ts +2 -0
- package/dist/src/peers/errors.js +185 -0
- package/dist/src/peers/errors.js.map +1 -0
- package/dist/src/peers/gemini.d.ts +13 -0
- package/dist/src/peers/gemini.js +215 -0
- package/dist/src/peers/gemini.js.map +1 -0
- package/dist/src/peers/grok.d.ts +17 -0
- package/dist/src/peers/grok.js +346 -0
- package/dist/src/peers/grok.js.map +1 -0
- package/dist/src/peers/model-selection.d.ts +4 -0
- package/dist/src/peers/model-selection.js +260 -0
- package/dist/src/peers/model-selection.js.map +1 -0
- package/dist/src/peers/openai.d.ts +14 -0
- package/dist/src/peers/openai.js +299 -0
- package/dist/src/peers/openai.js.map +1 -0
- package/dist/src/peers/perplexity.d.ts +18 -0
- package/dist/src/peers/perplexity.js +375 -0
- package/dist/src/peers/perplexity.js.map +1 -0
- package/dist/src/peers/registry.d.ts +3 -0
- package/dist/src/peers/registry.js +77 -0
- package/dist/src/peers/registry.js.map +1 -0
- package/dist/src/peers/retry.d.ts +2 -0
- package/dist/src/peers/retry.js +36 -0
- package/dist/src/peers/retry.js.map +1 -0
- package/dist/src/peers/stub.d.ts +13 -0
- package/dist/src/peers/stub.js +344 -0
- package/dist/src/peers/stub.js.map +1 -0
- package/dist/src/peers/text.d.ts +18 -0
- package/dist/src/peers/text.js +39 -0
- package/dist/src/peers/text.js.map +1 -0
- package/dist/src/security/redact.d.ts +2 -0
- package/dist/src/security/redact.js +128 -0
- package/dist/src/security/redact.js.map +1 -0
- package/docs/api-keys.md +34 -0
- package/docs/architecture.md +118 -0
- package/docs/caching.md +135 -0
- package/docs/costs.md +40 -0
- package/docs/evidence-preflight.md +88 -0
- package/docs/github-security-baseline.md +32 -0
- package/docs/model-selection.md +105 -0
- package/docs/reports/cross-review-v2-api-capability-smoke-2026-04-30.md +354 -0
- package/docs/reports/cross-review-v2-format-recovery-findings-2026-04-28.md +223 -0
- package/docs/reports/cross-review-v2-official-provider-docs-refresh-2026-05-05.md +60 -0
- package/docs/reports/cross-review-v2-token-streaming-smoke-2026-04-30.md +119 -0
- package/package.json +88 -0
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# Architecture
|
|
2
|
+
|
|
3
|
+
This API-only `cross-review` implementation is intentionally independent from the CLI-based `cross-review-v1` project.
|
|
4
|
+
|
|
5
|
+
## Runtime Layers
|
|
6
|
+
|
|
7
|
+
1. MCP server: exposes workflow tools over stdio.
|
|
8
|
+
2. Orchestrator: creates sessions, runs reviews, checks unanimity and asks the lead peer to revise.
|
|
9
|
+
3. Peer adapters: call official provider APIs and client libraries.
|
|
10
|
+
4. Model selection: queries model APIs and chooses the highest-capability documented model available to the key.
|
|
11
|
+
5. Session store: writes durable JSON and Markdown artifacts under `data/sessions`.
|
|
12
|
+
6. Session events: writes durable `events.ndjson` streams per session for long-running work.
|
|
13
|
+
7. Token streaming: writes count-based `peer.token.delta` and `peer.token.completed` events when provider streaming is enabled.
|
|
14
|
+
8. Reports: writes `session-report.md` with convergence, failures, decision quality, costs and recent events.
|
|
15
|
+
9. Observability: writes one NDJSON log per process under `data/logs`.
|
|
16
|
+
10. Dashboard: local read-only HTTP UI for sessions, events, reports, probes and metrics.
|
|
17
|
+
|
|
18
|
+
## Real Execution Rule
|
|
19
|
+
|
|
20
|
+
Runtime default is real API execution. Stubs are disabled unless `CROSS_REVIEW_STUB=1`.
|
|
21
|
+
|
|
22
|
+
## Timeout Model
|
|
23
|
+
|
|
24
|
+
Real API review rounds are intentionally long-running. The provider-side HTTP
|
|
25
|
+
timeout is controlled by `CROSS_REVIEW_TIMEOUT_MS` and defaults to 30
|
|
26
|
+
minutes.
|
|
27
|
+
|
|
28
|
+
MCP hosts also have their own client-to-server request timeout. For real peer
|
|
29
|
+
calls, configure the host timeout to at least 300 seconds. A lower generic
|
|
30
|
+
default, such as 60 seconds, can close the MCP request while the provider calls
|
|
31
|
+
are still legitimately processing.
|
|
32
|
+
|
|
33
|
+
For host environments that cannot keep a long MCP request open, use
|
|
34
|
+
`session_start_round` or `session_start_unanimous`. Those tools create a
|
|
35
|
+
background in-process job and return immediately. Use `session_poll`,
|
|
36
|
+
`session_events`, `session_metrics` and `session_report` to follow progress
|
|
37
|
+
without blocking the client request. `session_cancel_job` requests cooperative
|
|
38
|
+
cancellation and forwards `AbortSignal` to provider client calls where supported.
|
|
39
|
+
|
|
40
|
+
## Streaming Model
|
|
41
|
+
|
|
42
|
+
`CROSS_REVIEW_STREAM_EVENTS` controls normal workflow events and defaults to
|
|
43
|
+
enabled. `CROSS_REVIEW_STREAM_TOKENS` controls provider token-progress events
|
|
44
|
+
and also defaults to enabled. `runtime_capabilities.token_streaming` reflects
|
|
45
|
+
the effective token-streaming setting, not a compile-time constant.
|
|
46
|
+
|
|
47
|
+
When token streaming is active, adapters use provider-native streaming APIs:
|
|
48
|
+
|
|
49
|
+
- OpenAI: Responses API streaming events, including `response.output_text.delta`.
|
|
50
|
+
- Anthropic: Messages stream helper with text deltas and `finalMessage()`.
|
|
51
|
+
- Gemini: `models.generateContentStream`.
|
|
52
|
+
- DeepSeek: OpenAI-compatible chat completions with `stream: true`.
|
|
53
|
+
|
|
54
|
+
The streaming path is not a separate fake progress channel. The same streamed
|
|
55
|
+
text is accumulated and then parsed into the existing review or generation
|
|
56
|
+
result.
|
|
57
|
+
|
|
58
|
+
For safety, `peer.token.delta` events include character counts by default rather
|
|
59
|
+
than provider text. `CROSS_REVIEW_STREAM_TEXT=1` can include redacted text in
|
|
60
|
+
trusted local diagnostics, but it is intentionally opt-in because providers may
|
|
61
|
+
split sensitive strings across chunks. Raw thinking content is still not
|
|
62
|
+
requested or persisted.
|
|
63
|
+
|
|
64
|
+
## Unanimity Rule
|
|
65
|
+
|
|
66
|
+
A session converges only when the caller status is `READY`, every selected peer returns `READY`, and no peer failed or omitted a machine-readable status.
|
|
67
|
+
|
|
68
|
+
Decision quality is tracked per peer:
|
|
69
|
+
|
|
70
|
+
- `clean`: parsed status without warnings.
|
|
71
|
+
- `format_warning`: parsed with non-blocking parser warnings.
|
|
72
|
+
- `recovered`: recovered through format repair, moderation-safe retry or bounded sanitization.
|
|
73
|
+
- `needs_operator_review`: no parseable status remains after recovery.
|
|
74
|
+
- `failed`: provider or model-selection failure blocked the peer.
|
|
75
|
+
|
|
76
|
+
`unparseable_after_recovery`, `prompt_flagged_by_moderation`,
|
|
77
|
+
`silent_model_downgrade` and other rejected peer failures always block
|
|
78
|
+
unanimity until resolved.
|
|
79
|
+
|
|
80
|
+
## Moderation-Safe Prompting
|
|
81
|
+
|
|
82
|
+
Prior peer history is summarized from structured fields instead of replaying
|
|
83
|
+
raw model text. This keeps prompts smaller, reduces the chance that a verbose
|
|
84
|
+
peer repeats policy-sensitive language into a later provider, and produces more
|
|
85
|
+
useful audit trails.
|
|
86
|
+
|
|
87
|
+
If a provider still rejects a prompt as moderated or safety-blocked, the
|
|
88
|
+
orchestrator records the failure class and retries once with a compact,
|
|
89
|
+
sanitized review prompt. This retry does not bypass provider policy: if the
|
|
90
|
+
compact context is insufficient, the peer must return `NEEDS_EVIDENCE` or the
|
|
91
|
+
session remains blocked for operator action.
|
|
92
|
+
|
|
93
|
+
## Model Discovery
|
|
94
|
+
|
|
95
|
+
Provider model APIs are queried at probe/session initialization:
|
|
96
|
+
|
|
97
|
+
- OpenAI: Models API.
|
|
98
|
+
- Anthropic: Models API.
|
|
99
|
+
- Gemini: `models.list`.
|
|
100
|
+
- DeepSeek: OpenAI-compatible `/models`.
|
|
101
|
+
|
|
102
|
+
The selected model and selection evidence are persisted in the session capability snapshot.
|
|
103
|
+
|
|
104
|
+
## Provider Thinking Baseline
|
|
105
|
+
|
|
106
|
+
The peer adapters use the strongest official reasoning controls available for each provider because cross-review is correctness-oriented:
|
|
107
|
+
|
|
108
|
+
- OpenAI runs through the Responses API with high reasoning effort.
|
|
109
|
+
- Anthropic uses adaptive thinking and omits raw thinking content from responses.
|
|
110
|
+
- Gemini enables thinking configuration for Gemini 3.x and the Gemini 2.5 fallback.
|
|
111
|
+
- DeepSeek enables Thinking Mode and follows the official multi-round guidance by resending the summarized session context in each stateless request.
|
|
112
|
+
|
|
113
|
+
Raw chain-of-thought is not persisted. Session continuity is represented through prompts, structured peer decisions, summaries and artifacts.
|
|
114
|
+
|
|
115
|
+
## Stable Rename
|
|
116
|
+
|
|
117
|
+
Stable version `2.1.0` renamed the active product to `cross-review`. The earlier development
|
|
118
|
+
name remains only in historical changelog or memory notes.
|
package/docs/caching.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# Prompt Caching (v2.21.0+)
|
|
2
|
+
|
|
3
|
+
`cross-review` integrates with the prompt-caching surface of every supported provider. The runtime emits a uniform `provider.cache.usage` event and persists a per-session `cache_manifest.json` so dashboards, FinOps reports and post-mortem tooling can read cache telemetry without branching on provider-specific shapes.
|
|
4
|
+
|
|
5
|
+
This document describes:
|
|
6
|
+
|
|
7
|
+
- per-provider behavior matrix
|
|
8
|
+
- the `stablePrefix` cache key + schema-version invariant
|
|
9
|
+
- pair-scoped cache keys
|
|
10
|
+
- cost savings accounting
|
|
11
|
+
- operator controls (kill-switch + TTL overrides)
|
|
12
|
+
- empirical guidance per provider
|
|
13
|
+
|
|
14
|
+
## Per-provider behavior matrix
|
|
15
|
+
|
|
16
|
+
| Peer (Provider) | Cache mode | Threshold | TTL surface | Telemetry source |
|
|
17
|
+
| --------------------- | ---------- | --------------- | ---------------------------------------------- | --------------------------------------------------------------------- |
|
|
18
|
+
| `codex` (OpenAI) | `auto` | ~1k tokens | `prompt_cache_retention` (`in_memory` / `24h`) | `usage.prompt_tokens_details.cached_tokens` |
|
|
19
|
+
| `claude` (Anthropic) | `explicit` | ~4k tokens | `cache_control.ttl` (`5m` / `1h`) | `usage.cache_creation_input_tokens` + `usage.cache_read_input_tokens` |
|
|
20
|
+
| `gemini` (Google) | `implicit` | service-managed | n/a | `usageMetadata.cachedContentTokenCount` |
|
|
21
|
+
| `deepseek` (DeepSeek) | `auto` | service-managed | n/a | `usage.prompt_cache_hit_tokens` + `usage.prompt_cache_miss_tokens` |
|
|
22
|
+
| `grok` (xAI) | `auto` | service-managed | mirrors OpenAI | `usage.prompt_tokens_details.cached_tokens` |
|
|
23
|
+
|
|
24
|
+
`mode` values follow the canonical `TokenUsage.cache_provider_mode` enum:
|
|
25
|
+
|
|
26
|
+
- `auto` — provider auto-detects cacheable prefix (OpenAI, DeepSeek, Grok)
|
|
27
|
+
- `explicit` — runtime places cache_control breakpoints in the body (Anthropic only)
|
|
28
|
+
- `implicit` — provider transparently caches and reports tokens read (Gemini)
|
|
29
|
+
- `not_supported` — peer call did not produce cache telemetry
|
|
30
|
+
|
|
31
|
+
## Cache key scope strategy
|
|
32
|
+
|
|
33
|
+
Every cached call is bucketed by a **pair-scoped cache key**:
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
cross-review:<peer>:<caller>:v<cache_schema_version>
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
The pair scope means two different callers reviewing the same case still share cache hits within a peer; cache invalidation is bounded by the schema version. Bumping `CROSS_REVIEW_CACHE_SCHEMA_VERSION` (e.g. `v1` → `v2`) invalidates every previously cached entry, by design. Use this when prompt structure changes materially (new convergence rule, new system role line, new evidence index format).
|
|
40
|
+
|
|
41
|
+
Internally, we also emit a `stablePrefixHash` (sha256 hex) computed over the LF-normalized stablePrefix. The hash is invariant across rounds for the same case — see `prompt-parts.ts` and the smoke marker `cache_hash_invariance_test`.
|
|
42
|
+
|
|
43
|
+
## Cache schema versioning
|
|
44
|
+
|
|
45
|
+
`stablePrefix` always begins with the line `cache_schema_version: vN`. This appears verbatim inside the cached prefix payload so any structural shift produces a different hash and a different cache scope automatically.
|
|
46
|
+
|
|
47
|
+
When to bump:
|
|
48
|
+
|
|
49
|
+
- Adding/removing a section in stablePrefix (system, task, review_focus, convergence_rules, evidence_index)
|
|
50
|
+
- Reordering sections inside stablePrefix
|
|
51
|
+
- Changing the systemRole text materially
|
|
52
|
+
- Changing the convergence rules text
|
|
53
|
+
|
|
54
|
+
Smoke marker `cache_schema_version_in_prefix_test` pins the first-line shape so a regression is caught locally.
|
|
55
|
+
|
|
56
|
+
## Cost savings accounting
|
|
57
|
+
|
|
58
|
+
The cost layer (`src/core/cost.ts`) extends `CostEstimate` with two cache-related fields:
|
|
59
|
+
|
|
60
|
+
- `cache_savings_usd?: number` — populated when the rate card knows how to price the savings
|
|
61
|
+
- `cache_savings_unknown?: boolean` — set when cache telemetry was present but no rate card matched
|
|
62
|
+
|
|
63
|
+
Rate cards live in `src/core/cache-rates.json`. The primary input/output rates still flow through `config.cost_rates` (env vars `CROSS_REVIEW_<PROVIDER>_INPUT_USD_PER_MILLION` / `CROSS_REVIEW_<PROVIDER>_OUTPUT_USD_PER_MILLION`). The rate card delta is FALLBACK math: `(fresh_input_per_million - cached_input_per_million) × cache_read_tokens / 1e6`.
|
|
64
|
+
|
|
65
|
+
Adapters surface the read/write counts via `TokenUsage.cache_read_tokens` and `TokenUsage.cache_write_tokens`. The orchestrator reads them, emits a `provider.cache.usage` event, and appends a row to `<data_dir>/sessions/<session_id>/cache_manifest.json`.
|
|
66
|
+
|
|
67
|
+
## Bypass / kill-switch
|
|
68
|
+
|
|
69
|
+
```
|
|
70
|
+
CROSS_REVIEW_DISABLE_CACHE=true
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Disables prompt caching globally for the runtime. Adapters fall back to the pre-v2.21 behavior (no `prompt_cache_key`, no `cache_control` blocks, no `x-grok-conv-id` header). The cost layer continues to merge `cache_read_tokens` / `cache_write_tokens` if a provider returns them anyway, so audit reproducibility is preserved.
|
|
74
|
+
|
|
75
|
+
Use cases:
|
|
76
|
+
|
|
77
|
+
- Provider misbehavior (cache poisoning, stale state)
|
|
78
|
+
- Audit reproducibility (force every call to make a fresh inference)
|
|
79
|
+
- A/B comparison between cached and uncached spend
|
|
80
|
+
|
|
81
|
+
## TTL configuration
|
|
82
|
+
|
|
83
|
+
```
|
|
84
|
+
CROSS_REVIEW_CACHE_TTL_ANTHROPIC=5m|1h # default 1h
|
|
85
|
+
CROSS_REVIEW_CACHE_TTL_OPENAI=5m|1h # default 1h
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
- **Anthropic** accepts `5m` and `1h` per the SDK. Values other than `5m`/`1h` are ignored with a stderr notice and the default is used.
|
|
89
|
+
- **OpenAI** accepts only `in_memory` and `24h` per the Responses API (the SDK type is locked to those two strings). The runtime translates `1h` → `24h` (extended retention) and anything else → `in_memory` (the default ~5 min window).
|
|
90
|
+
|
|
91
|
+
Grok mirrors OpenAI's mapping.
|
|
92
|
+
|
|
93
|
+
## Anthropic cache_control placement
|
|
94
|
+
|
|
95
|
+
The Anthropic adapter places exactly **one** cache_control breakpoint at the END of the system prompt block:
|
|
96
|
+
|
|
97
|
+
```ts
|
|
98
|
+
system: [
|
|
99
|
+
{
|
|
100
|
+
type: "text",
|
|
101
|
+
text: systemPromptText,
|
|
102
|
+
cache_control: { type: "ephemeral", ttl: "1h" },
|
|
103
|
+
},
|
|
104
|
+
],
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Anthropic supports up to 4 breakpoints per request; we reserve 3 for future additions (per-message layering, tool block caching, multi-tier prefixes). The `cache_creation_input_tokens` / `cache_read_input_tokens` fields on `response.usage` map directly to `cache_write_tokens` / `cache_read_tokens` on our canonical `TokenUsage` shape.
|
|
108
|
+
|
|
109
|
+
## Empirical guidance
|
|
110
|
+
|
|
111
|
+
| Provider | Practical minimum cached prefix | Notes |
|
|
112
|
+
| --------- | ------------------------------- | ------------------------------------------------------------------------------------------ |
|
|
113
|
+
| OpenAI | ≥ 1024 tokens | The Responses API auto-detects; `prompt_cache_key` improves hit rate for repeat callers. |
|
|
114
|
+
| Anthropic | ≥ 4096 tokens | Below this size Anthropic may not actually create the cache entry. Adapter emits a notice. |
|
|
115
|
+
| Gemini | service-managed | Implicit only at this writing; explicit `caches.create` is deferred. |
|
|
116
|
+
| DeepSeek | service-managed | Auto-cached; both hit and miss tokens are returned. |
|
|
117
|
+
| Grok | service-managed | xAI mirrors OpenAI; `x-grok-conv-id` header binds the cache scope. |
|
|
118
|
+
|
|
119
|
+
## Reference URLs
|
|
120
|
+
|
|
121
|
+
- OpenAI prompt caching: https://platform.openai.com/docs/guides/prompt-caching
|
|
122
|
+
- Anthropic prompt caching: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
|
|
123
|
+
- Google Gemini caching: https://ai.google.dev/gemini-api/docs/caching
|
|
124
|
+
- DeepSeek context caching: https://api-docs.deepseek.com/guides/kv_cache
|
|
125
|
+
- xAI / Grok caching: https://docs.x.ai/docs/api-reference
|
|
126
|
+
|
|
127
|
+
## Smoke markers
|
|
128
|
+
|
|
129
|
+
The smoke harness (`scripts/smoke.ts`) ships five anti-drift markers covering this surface:
|
|
130
|
+
|
|
131
|
+
- `cache_hash_invariance_test` — round/draft/priorRounds permutations do NOT mutate `stablePrefixHash`
|
|
132
|
+
- `cache_schema_version_in_prefix_test` — first line of `stablePrefix` matches `^cache_schema_version: v\d+$`
|
|
133
|
+
- `cache_rates_json_loaded_test` — every provider has a rate card with a numeric `fresh_input_per_million_usd`
|
|
134
|
+
- `cache_manifest_atomic_write_test` — sequential appends preserve every entry
|
|
135
|
+
- `cache_disable_kill_switch_test` — `CROSS_REVIEW_DISABLE_CACHE=true` flips `config.cache.enabled`
|
package/docs/costs.md
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# Costs
|
|
2
|
+
|
|
3
|
+
Runtime calls are real provider API calls by default.
|
|
4
|
+
|
|
5
|
+
## Smoke Tests
|
|
6
|
+
|
|
7
|
+
`npm test` uses `CROSS_REVIEW_STUB=1` and does not call provider APIs.
|
|
8
|
+
|
|
9
|
+
## Real Runs
|
|
10
|
+
|
|
11
|
+
`probe_peers`, `session_init`, `ask_peers` and `run_until_unanimous` may call provider APIs when keys are present.
|
|
12
|
+
|
|
13
|
+
The server records token usage returned by providers. Paid review/generation tools are blocked until explicit budget ceilings and rate cards are configured. This avoids stale hard-coded prices because provider pricing changes frequently.
|
|
14
|
+
|
|
15
|
+
`CROSS_REVIEW_MAX_OUTPUT_TOKENS` controls the maximum output budget requested from all providers. The default is `20000`; raise or lower it in the MCP host configuration according to the desired quality/cost tradeoff. Invalid, zero or negative values fall back to the default.
|
|
16
|
+
|
|
17
|
+
## Required Financial Configuration
|
|
18
|
+
|
|
19
|
+
Set rates through Windows environment variables or the MCP host configuration before running paid calls. Values are USD per million tokens. Use current official provider pricing; this project intentionally does not ship default provider prices.
|
|
20
|
+
|
|
21
|
+
```powershell
|
|
22
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_MAX_SESSION_COST_USD", "20", "User")
|
|
23
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_PREFLIGHT_MAX_ROUND_COST_USD", "20", "User")
|
|
24
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_UNTIL_STOPPED_MAX_COST_USD", "20", "User")
|
|
25
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_OPENAI_INPUT_USD_PER_MILLION", "<current OpenAI input rate>", "User")
|
|
26
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_OPENAI_OUTPUT_USD_PER_MILLION", "<current OpenAI output rate>", "User")
|
|
27
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_ANTHROPIC_INPUT_USD_PER_MILLION", "<current Anthropic input rate>", "User")
|
|
28
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_ANTHROPIC_OUTPUT_USD_PER_MILLION", "<current Anthropic output rate>", "User")
|
|
29
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_GEMINI_INPUT_USD_PER_MILLION", "<current Gemini input rate>", "User")
|
|
30
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_GEMINI_OUTPUT_USD_PER_MILLION", "<current Gemini output rate>", "User")
|
|
31
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_DEEPSEEK_INPUT_USD_PER_MILLION", "<current DeepSeek input rate>", "User")
|
|
32
|
+
[Environment]::SetEnvironmentVariable("CROSS_REVIEW_DEEPSEEK_OUTPUT_USD_PER_MILLION", "<current DeepSeek output rate>", "User")
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
`CROSS_REVIEW_MAX_SESSION_COST_USD` sets the default per-session budget guard. `CROSS_REVIEW_PREFLIGHT_MAX_ROUND_COST_USD` blocks a round before calls begin when the estimated cost exceeds the configured value. `CROSS_REVIEW_UNTIL_STOPPED_MAX_COST_USD` is required for `until_stopped=true`.
|
|
36
|
+
|
|
37
|
+
When the estimated session cost exceeds the configured limit, the run is
|
|
38
|
+
finalized as `max-rounds` with reason `budget_exceeded`. Missing financial
|
|
39
|
+
configuration finalizes the session as `max-rounds` with reason
|
|
40
|
+
`financial_controls_missing` before any paid provider call is made.
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
# Evidence Preflight (v3.5.0)
|
|
2
|
+
|
|
3
|
+
`run_until_unanimous` and `session_start_unanimous` run a **pure textual
|
|
4
|
+
evidence preflight** before any paid peer call. It catches the
|
|
5
|
+
`f0db3970`-class failure — a submission that _claims_ completed
|
|
6
|
+
operational work (tests pass, a diff exists, a build was validated) but
|
|
7
|
+
embeds **zero concrete evidence** — and fails it locally with
|
|
8
|
+
`needs_evidence_preflight` instead of burning API budget across multiple
|
|
9
|
+
`NEEDS_EVIDENCE` rounds.
|
|
10
|
+
|
|
11
|
+
## Scope boundary (important)
|
|
12
|
+
|
|
13
|
+
cross-review is an **API-only orchestrator**. The preflight:
|
|
14
|
+
|
|
15
|
+
- **does NOT** execute shell, run `git diff`, or read the repo;
|
|
16
|
+
- **does NOT** package evidence for you;
|
|
17
|
+
- only inspects text the caller already supplied — `task`,
|
|
18
|
+
`initial_draft`, the structured `evidence` field, and
|
|
19
|
+
already-attached evidence.
|
|
20
|
+
|
|
21
|
+
Evidence _packaging_ (`git diff --stat`, hunks, `git status --short`,
|
|
22
|
+
validation-command tails, changed-file lists, target commit) is a
|
|
23
|
+
**caller-side responsibility**. Build it into a local agent helper,
|
|
24
|
+
prompt workflow, or shared utility — never inside the MCP server.
|
|
25
|
+
|
|
26
|
+
## How it decides
|
|
27
|
+
|
|
28
|
+
The preflight trips **only** when **both** are true:
|
|
29
|
+
|
|
30
|
+
1. **Completed-work claim present** — the text matches one of:
|
|
31
|
+
`\d+ passed/failed`, `git diff`, `git status`, `npm run`,
|
|
32
|
+
`cargo test|build`, `build passed/succeeded/clean/green`,
|
|
33
|
+
`tests? pass/passed/green`, `git diff --check`.
|
|
34
|
+
2. **Zero evidence markers** — the text contains none of: fenced code
|
|
35
|
+
blocks (` ``` `), `@@ -`/`@@ +` diff hunks, 7+ hex-char hashes,
|
|
36
|
+
`file.ext:NN` line refs, `$ `/`> ` command-prompt lines.
|
|
37
|
+
|
|
38
|
+
Mere keyword presence does **not** trip it. "I plan to write a patch"
|
|
39
|
+
or "here is the test plan" is a design review with legitimately no diff
|
|
40
|
+
— it passes.
|
|
41
|
+
|
|
42
|
+
A non-empty `evidence` field **or** any attached evidence makes the
|
|
43
|
+
preflight pass unconditionally — that is the caller's authoritative
|
|
44
|
+
declaration that concrete evidence exists.
|
|
45
|
+
|
|
46
|
+
## Minimum evidence format
|
|
47
|
+
|
|
48
|
+
To pass the preflight when your task makes a completed-work claim,
|
|
49
|
+
embed at least one of these inline in `initial_draft` or in the
|
|
50
|
+
`evidence` field:
|
|
51
|
+
|
|
52
|
+
- a fenced code block with the relevant diff hunk(s) — `@@ -N,M +N,M @@`;
|
|
53
|
+
- `file/path.ext:LINE` references for every changed location;
|
|
54
|
+
- raw command output (the `$ cmd` line plus its output) for every
|
|
55
|
+
validation you claim — `npm run typecheck`, `cargo test`,
|
|
56
|
+
`git diff --check`, `rg` scans;
|
|
57
|
+
- content hashes (`sha256`, `md5`) when asserting artifact identity.
|
|
58
|
+
|
|
59
|
+
This is the same provenance-grade material the R1 evidence-upfront
|
|
60
|
+
contract already expects. The preflight just refuses to spend API when
|
|
61
|
+
it is obviously absent.
|
|
62
|
+
|
|
63
|
+
## The `evidence` field
|
|
64
|
+
|
|
65
|
+
Both `run_until_unanimous` and `session_start_unanimous` accept an
|
|
66
|
+
optional `evidence: string`. When non-empty it satisfies the preflight
|
|
67
|
+
unconditionally. Use it to attach the caller-packaged evidence bundle
|
|
68
|
+
without inflating `initial_draft`.
|
|
69
|
+
|
|
70
|
+
## Opt-out
|
|
71
|
+
|
|
72
|
+
Set `CROSS_REVIEW_EVIDENCE_PREFLIGHT=off` to disable the preflight
|
|
73
|
+
entirely (default: `on`). Disabling is rarely needed — the trip
|
|
74
|
+
condition is deliberately conservative — but the escape hatch exists
|
|
75
|
+
for callers whose tasks legitimately make completed-work claims in
|
|
76
|
+
prose without inline markers.
|
|
77
|
+
|
|
78
|
+
## Outcome when it trips
|
|
79
|
+
|
|
80
|
+
- session finalized: `outcome = "aborted"`, `reason =
|
|
81
|
+
"needs_evidence_preflight"`;
|
|
82
|
+
- event emitted: `session.evidence_preflight_failed` with
|
|
83
|
+
`completed_work_claim_matched`, `evidence_marker_found`,
|
|
84
|
+
`structured_evidence_supplied`, `attachments_present`;
|
|
85
|
+
- **zero paid peer calls** were made.
|
|
86
|
+
|
|
87
|
+
Re-submit with evidence embedded inline, with the `evidence` field
|
|
88
|
+
populated, or with evidence attached via `session_attach_evidence`.
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
# GitHub Security Baseline
|
|
2
|
+
|
|
3
|
+
This project is prepared to become a public repository.
|
|
4
|
+
|
|
5
|
+
Required repository settings after the remote is created:
|
|
6
|
+
|
|
7
|
+
1. Enable Secret Protection / Secret Scanning.
|
|
8
|
+
2. Enable Push Protection.
|
|
9
|
+
3. Enable Code Scanning with CodeQL Default Setup.
|
|
10
|
+
4. Enable Code Quality.
|
|
11
|
+
5. Enable Dependabot alerts.
|
|
12
|
+
6. Enable Dependabot security updates.
|
|
13
|
+
7. Enable Dependabot version updates from `.github/dependabot.yml`.
|
|
14
|
+
8. Enable Dependabot auto-merge workflow only after branch rules are active.
|
|
15
|
+
9. Protect `main` with a repository ruleset.
|
|
16
|
+
10. Require code scanning results with CodeQL security alerts: All / alerts: All.
|
|
17
|
+
11. Require code quality thresholds: Any / Any.
|
|
18
|
+
12. Require CI to pass before merge.
|
|
19
|
+
13. Disable force-push and branch deletion on `main`.
|
|
20
|
+
|
|
21
|
+
Package publishing is active after the repository is created and the `NPM_TOKEN` secret is
|
|
22
|
+
configured. Pushes to `main` auto-create an organization-standard display tag such as `v02.01.00`
|
|
23
|
+
from `package.json`; the tag then creates a normal GitHub Release and publishes
|
|
24
|
+
`@lcv-ideas-software/cross-review` to npmjs.com and GitHub Packages. The API-first package is
|
|
25
|
+
separate from the CLI package `@lcv-ideas-software/cross-review-v1`.
|
|
26
|
+
|
|
27
|
+
CodeQL Advanced Setup is intentionally not committed. If Advanced Setup ever becomes necessary,
|
|
28
|
+
it must be proposed with justification and approved before adding a workflow file.
|
|
29
|
+
|
|
30
|
+
No secrets, runtime sessions, logs, prompts, provider responses, API keys or local AI memories may
|
|
31
|
+
be committed. The `.gitignore` is intentionally strict because this repository is designed for
|
|
32
|
+
public release from its first push.
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# Model Selection
|
|
2
|
+
|
|
3
|
+
The server uses automatic model selection unless an explicit environment override is present.
|
|
4
|
+
|
|
5
|
+
## Rules
|
|
6
|
+
|
|
7
|
+
1. Query the provider's official model API using the current API key.
|
|
8
|
+
2. Keep only models that can perform text generation for the peer role.
|
|
9
|
+
3. Exclude known non-thinking, low-capacity or deprecated models from cross-review priority lists.
|
|
10
|
+
4. Compare returned model IDs against the documented priority list.
|
|
11
|
+
5. Select the first available model in that priority list.
|
|
12
|
+
6. Persist the selected model, candidate list, source URL, confidence and reason in the session snapshot.
|
|
13
|
+
|
|
14
|
+
If a provider returns models but none match the advanced thinking priority list, the runtime keeps the documented advanced fallback instead of silently downgrading to a weaker random candidate. That makes availability problems visible in probes and review rounds.
|
|
15
|
+
|
|
16
|
+
The no-downgrade behavior is covered by `scripts/smoke.ts`: when a provider
|
|
17
|
+
returns only a weak/deprecated candidate such as `claude-haiku-4-5`, selection
|
|
18
|
+
stays on the documented advanced fallback and records `confidence=unknown`.
|
|
19
|
+
|
|
20
|
+
## Current Priority Lists
|
|
21
|
+
|
|
22
|
+
OpenAI/Codex:
|
|
23
|
+
|
|
24
|
+
```text
|
|
25
|
+
gpt-5.5 > gpt-5.4 > gpt-5.2 > gpt-5.1-codex-max > gpt-5.1-codex > gpt-5.1 > gpt-5-pro > gpt-5
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Anthropic/Claude:
|
|
29
|
+
|
|
30
|
+
```text
|
|
31
|
+
claude-opus-4-7 > claude-opus-4-6 > claude-sonnet-4-6
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
Haiku is intentionally excluded because the cross-review role requires advanced reasoning depth.
|
|
35
|
+
|
|
36
|
+
Google/Gemini:
|
|
37
|
+
|
|
38
|
+
```text
|
|
39
|
+
gemini-2.5-pro > gemini-3.1-pro-preview
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Operator preference 2026-05-07: `gemini-2.5-pro` is the runtime default because under Google One AI Ultra subscription it carries 1k requests/day vs `gemini-3.1-pro-preview`'s 250 requests/day; `gemini-3.1-pro-preview` remains in the priority list as a fallback. Workspace policy (operator directive 2026-05-07): only `gemini-*-pro` variants ≥ 2.5 are permitted — no `*-flash` variants and no models below 2.5 (those are deprecated). `gemini-3-pro-preview` is intentionally excluded from the active fallback path because preview model deprecation is tracked through official Gemini release notes and this project avoids soon-to-deprecate intermediate previews when a newer advanced model is available.
|
|
43
|
+
|
|
44
|
+
DeepSeek:
|
|
45
|
+
|
|
46
|
+
```text
|
|
47
|
+
deepseek-v4-pro > deepseek-v4-flash
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
`deepseek-chat` and `deepseek-reasoner` are not active fallbacks because DeepSeek has announced their deprecation for 2026-07-24. `deepseek-v4-pro` is the preferred thinking-capable model for this project.
|
|
51
|
+
|
|
52
|
+
xAI/Grok:
|
|
53
|
+
|
|
54
|
+
```text
|
|
55
|
+
grok-4.20-multi-agent > grok-4-latest > grok-4.3 > grok-4.20-reasoning > grok-4.20 > grok-4-1-fast > grok-4 > grok-3-fast > grok-3
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
`GROK_API_KEY` is the canonical auth variable. The operator chooses the model
|
|
59
|
+
with `CROSS_REVIEW_GROK_MODEL`. Per official xAI reasoning docs, only
|
|
60
|
+
`grok-4.20-multi-agent` accepts an explicit `reasoning.effort` request body
|
|
61
|
+
field, where the value selects agent count (`low`/`medium` = 4 agents,
|
|
62
|
+
`high`/`xhigh` = 16 agents). Automatic-reasoning Grok models such as
|
|
63
|
+
`grok-4-latest`, `grok-4.20`, and `grok-4.20-reasoning` must not receive that
|
|
64
|
+
field; the adapter detects the configured model and omits it automatically.
|
|
65
|
+
The xAI model catalog currently recommends `grok-4.3` for general Chat API
|
|
66
|
+
usage; the multi-agent model remains first in this project because it is the
|
|
67
|
+
only documented Grok model with explicit agent-count control.
|
|
68
|
+
|
|
69
|
+
## Thinking Configuration
|
|
70
|
+
|
|
71
|
+
Cross-review-v2 is optimized for correctness over latency and cost. Provider adapters explicitly request thinking/reasoning where the official APIs support it:
|
|
72
|
+
|
|
73
|
+
- OpenAI/Codex: Responses API with reasoning effort `xhigh` by default.
|
|
74
|
+
- Anthropic/Claude: adaptive thinking with omitted thinking display plus `output_config.effort=xhigh` by default on Opus 4.7.
|
|
75
|
+
- Google/Gemini: `thinkingConfig.thinkingLevel=HIGH` for Gemini 3.x and automatic thinking budget for Gemini 2.5 Pro fallback.
|
|
76
|
+
- DeepSeek: `thinking.type=enabled` with `reasoning_effort=max` by default.
|
|
77
|
+
- Grok: `reasoning.effort` is sent only for `grok-4.20-multi-agent`; all other
|
|
78
|
+
Grok reasoning models use xAI automatic reasoning without the explicit field.
|
|
79
|
+
|
|
80
|
+
## Official Documentation Refresh — 2026-05-05
|
|
81
|
+
|
|
82
|
+
Checked against primary provider documentation before the v2.16.0 protocol
|
|
83
|
+
repair:
|
|
84
|
+
|
|
85
|
+
- OpenAI: GPT-5.5 is the current recommended frontier model for complex
|
|
86
|
+
reasoning/coding, with Responses API reasoning effort values through `xhigh`
|
|
87
|
+
and 1M context / 128K output.
|
|
88
|
+
- Anthropic: Claude Opus 4.7 is the generally available complex-reasoning and
|
|
89
|
+
agentic-coding default; current docs expose 1M context and adaptive thinking.
|
|
90
|
+
- Google Gemini: Gemini 3.1 Pro Preview is the current advanced Gemini 3.1
|
|
91
|
+
option; Gemini 3 Pro Preview is deprecated/shut down and must stay out of
|
|
92
|
+
active fallbacks.
|
|
93
|
+
- DeepSeek: DeepSeek-V4 exposes `deepseek-v4-pro` and `deepseek-v4-flash`;
|
|
94
|
+
legacy `deepseek-chat` and `deepseek-reasoner` are scheduled for
|
|
95
|
+
discontinuation on 2026-07-24 and must stay out of priority fallbacks.
|
|
96
|
+
- xAI Grok: the model catalog currently recommends `grok-4.3` for general Chat
|
|
97
|
+
API use, while reasoning docs identify `grok-4.20-multi-agent` as the only
|
|
98
|
+
explicit `reasoning.effort` model. Other Grok reasoning models reason
|
|
99
|
+
automatically and must not receive explicit effort.
|
|
100
|
+
|
|
101
|
+
## Important
|
|
102
|
+
|
|
103
|
+
The priority list is intentionally code-level configuration, not hidden behavior. Provider model catalogs and deprecation schedules change often, so this file and `src/peers/model-selection.ts` must be reviewed against official provider documentation whenever defaults change.
|
|
104
|
+
|
|
105
|
+
The redacted real-API capability smoke for the current default models is recorded in `docs/reports/cross-review-api-capability-smoke-2026-04-30.md`.
|