@lcv-ideas-software/cross-review 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (122) hide show
  1. package/CHANGELOG.md +2568 -0
  2. package/LICENSE +201 -0
  3. package/NOTICE +26 -0
  4. package/README.md +208 -0
  5. package/SECURITY.md +52 -0
  6. package/dist/scripts/api-streaming-smoke.d.ts +1 -0
  7. package/dist/scripts/api-streaming-smoke.js +78 -0
  8. package/dist/scripts/api-streaming-smoke.js.map +1 -0
  9. package/dist/scripts/runtime-default-smoke.d.ts +1 -0
  10. package/dist/scripts/runtime-default-smoke.js +88 -0
  11. package/dist/scripts/runtime-default-smoke.js.map +1 -0
  12. package/dist/scripts/runtime-smoke.d.ts +1 -0
  13. package/dist/scripts/runtime-smoke.js +148 -0
  14. package/dist/scripts/runtime-smoke.js.map +1 -0
  15. package/dist/scripts/smoke.d.ts +1 -0
  16. package/dist/scripts/smoke.js +6156 -0
  17. package/dist/scripts/smoke.js.map +1 -0
  18. package/dist/src/core/cache-manifest.d.ts +22 -0
  19. package/dist/src/core/cache-manifest.js +133 -0
  20. package/dist/src/core/cache-manifest.js.map +1 -0
  21. package/dist/src/core/caller-tokens.d.ts +32 -0
  22. package/dist/src/core/caller-tokens.js +240 -0
  23. package/dist/src/core/caller-tokens.js.map +1 -0
  24. package/dist/src/core/config.d.ts +9 -0
  25. package/dist/src/core/config.js +643 -0
  26. package/dist/src/core/config.js.map +1 -0
  27. package/dist/src/core/convergence.d.ts +5 -0
  28. package/dist/src/core/convergence.js +186 -0
  29. package/dist/src/core/convergence.js.map +1 -0
  30. package/dist/src/core/cost.d.ts +59 -0
  31. package/dist/src/core/cost.js +359 -0
  32. package/dist/src/core/cost.js.map +1 -0
  33. package/dist/src/core/file-config.d.ts +316 -0
  34. package/dist/src/core/file-config.js +490 -0
  35. package/dist/src/core/file-config.js.map +1 -0
  36. package/dist/src/core/orchestrator.d.ts +199 -0
  37. package/dist/src/core/orchestrator.js +3430 -0
  38. package/dist/src/core/orchestrator.js.map +1 -0
  39. package/dist/src/core/prompt-parts.d.ts +58 -0
  40. package/dist/src/core/prompt-parts.js +122 -0
  41. package/dist/src/core/prompt-parts.js.map +1 -0
  42. package/dist/src/core/relator-lottery.d.ts +23 -0
  43. package/dist/src/core/relator-lottery.js +112 -0
  44. package/dist/src/core/relator-lottery.js.map +1 -0
  45. package/dist/src/core/reports.d.ts +2 -0
  46. package/dist/src/core/reports.js +82 -0
  47. package/dist/src/core/reports.js.map +1 -0
  48. package/dist/src/core/session-store.d.ts +149 -0
  49. package/dist/src/core/session-store.js +1923 -0
  50. package/dist/src/core/session-store.js.map +1 -0
  51. package/dist/src/core/status.d.ts +61 -0
  52. package/dist/src/core/status.js +249 -0
  53. package/dist/src/core/status.js.map +1 -0
  54. package/dist/src/core/timeouts.d.ts +2 -0
  55. package/dist/src/core/timeouts.js +3 -0
  56. package/dist/src/core/timeouts.js.map +1 -0
  57. package/dist/src/core/types.d.ts +604 -0
  58. package/dist/src/core/types.js +36 -0
  59. package/dist/src/core/types.js.map +1 -0
  60. package/dist/src/dashboard/server.d.ts +2 -0
  61. package/dist/src/dashboard/server.js +339 -0
  62. package/dist/src/dashboard/server.js.map +1 -0
  63. package/dist/src/mcp/server.d.ts +54 -0
  64. package/dist/src/mcp/server.js +1584 -0
  65. package/dist/src/mcp/server.js.map +1 -0
  66. package/dist/src/observability/logger.d.ts +9 -0
  67. package/dist/src/observability/logger.js +24 -0
  68. package/dist/src/observability/logger.js.map +1 -0
  69. package/dist/src/peers/anthropic.d.ts +14 -0
  70. package/dist/src/peers/anthropic.js +290 -0
  71. package/dist/src/peers/anthropic.js.map +1 -0
  72. package/dist/src/peers/base.d.ts +72 -0
  73. package/dist/src/peers/base.js +416 -0
  74. package/dist/src/peers/base.js.map +1 -0
  75. package/dist/src/peers/deepseek.d.ts +12 -0
  76. package/dist/src/peers/deepseek.js +246 -0
  77. package/dist/src/peers/deepseek.js.map +1 -0
  78. package/dist/src/peers/errors.d.ts +2 -0
  79. package/dist/src/peers/errors.js +185 -0
  80. package/dist/src/peers/errors.js.map +1 -0
  81. package/dist/src/peers/gemini.d.ts +13 -0
  82. package/dist/src/peers/gemini.js +215 -0
  83. package/dist/src/peers/gemini.js.map +1 -0
  84. package/dist/src/peers/grok.d.ts +17 -0
  85. package/dist/src/peers/grok.js +346 -0
  86. package/dist/src/peers/grok.js.map +1 -0
  87. package/dist/src/peers/model-selection.d.ts +4 -0
  88. package/dist/src/peers/model-selection.js +260 -0
  89. package/dist/src/peers/model-selection.js.map +1 -0
  90. package/dist/src/peers/openai.d.ts +14 -0
  91. package/dist/src/peers/openai.js +299 -0
  92. package/dist/src/peers/openai.js.map +1 -0
  93. package/dist/src/peers/perplexity.d.ts +18 -0
  94. package/dist/src/peers/perplexity.js +375 -0
  95. package/dist/src/peers/perplexity.js.map +1 -0
  96. package/dist/src/peers/registry.d.ts +3 -0
  97. package/dist/src/peers/registry.js +77 -0
  98. package/dist/src/peers/registry.js.map +1 -0
  99. package/dist/src/peers/retry.d.ts +2 -0
  100. package/dist/src/peers/retry.js +36 -0
  101. package/dist/src/peers/retry.js.map +1 -0
  102. package/dist/src/peers/stub.d.ts +13 -0
  103. package/dist/src/peers/stub.js +344 -0
  104. package/dist/src/peers/stub.js.map +1 -0
  105. package/dist/src/peers/text.d.ts +18 -0
  106. package/dist/src/peers/text.js +39 -0
  107. package/dist/src/peers/text.js.map +1 -0
  108. package/dist/src/security/redact.d.ts +2 -0
  109. package/dist/src/security/redact.js +128 -0
  110. package/dist/src/security/redact.js.map +1 -0
  111. package/docs/api-keys.md +34 -0
  112. package/docs/architecture.md +118 -0
  113. package/docs/caching.md +135 -0
  114. package/docs/costs.md +40 -0
  115. package/docs/evidence-preflight.md +88 -0
  116. package/docs/github-security-baseline.md +32 -0
  117. package/docs/model-selection.md +105 -0
  118. package/docs/reports/cross-review-v2-api-capability-smoke-2026-04-30.md +354 -0
  119. package/docs/reports/cross-review-v2-format-recovery-findings-2026-04-28.md +223 -0
  120. package/docs/reports/cross-review-v2-official-provider-docs-refresh-2026-05-05.md +60 -0
  121. package/docs/reports/cross-review-v2-token-streaming-smoke-2026-04-30.md +119 -0
  122. package/package.json +88 -0
@@ -0,0 +1,118 @@
1
+ # Architecture
2
+
3
+ This API-only `cross-review` implementation is intentionally independent from the CLI-based `cross-review-v1` project.
4
+
5
+ ## Runtime Layers
6
+
7
+ 1. MCP server: exposes workflow tools over stdio.
8
+ 2. Orchestrator: creates sessions, runs reviews, checks unanimity and asks the lead peer to revise.
9
+ 3. Peer adapters: call official provider APIs and client libraries.
10
+ 4. Model selection: queries model APIs and chooses the highest-capability documented model available to the key.
11
+ 5. Session store: writes durable JSON and Markdown artifacts under `data/sessions`.
12
+ 6. Session events: writes durable `events.ndjson` streams per session for long-running work.
13
+ 7. Token streaming: writes count-based `peer.token.delta` and `peer.token.completed` events when provider streaming is enabled.
14
+ 8. Reports: writes `session-report.md` with convergence, failures, decision quality, costs and recent events.
15
+ 9. Observability: writes one NDJSON log per process under `data/logs`.
16
+ 10. Dashboard: local read-only HTTP UI for sessions, events, reports, probes and metrics.
17
+
18
+ ## Real Execution Rule
19
+
20
+ Runtime default is real API execution. Stubs are disabled unless `CROSS_REVIEW_STUB=1`.
21
+
22
+ ## Timeout Model
23
+
24
+ Real API review rounds are intentionally long-running. The provider-side HTTP
25
+ timeout is controlled by `CROSS_REVIEW_TIMEOUT_MS` and defaults to 30
26
+ minutes.
27
+
28
+ MCP hosts also have their own client-to-server request timeout. For real peer
29
+ calls, configure the host timeout to at least 300 seconds. A lower generic
30
+ default, such as 60 seconds, can close the MCP request while the provider calls
31
+ are still legitimately processing.
32
+
33
+ For host environments that cannot keep a long MCP request open, use
34
+ `session_start_round` or `session_start_unanimous`. Those tools create a
35
+ background in-process job and return immediately. Use `session_poll`,
36
+ `session_events`, `session_metrics` and `session_report` to follow progress
37
+ without blocking the client request. `session_cancel_job` requests cooperative
38
+ cancellation and forwards `AbortSignal` to provider client calls where supported.
39
+
40
+ ## Streaming Model
41
+
42
+ `CROSS_REVIEW_STREAM_EVENTS` controls normal workflow events and defaults to
43
+ enabled. `CROSS_REVIEW_STREAM_TOKENS` controls provider token-progress events
44
+ and also defaults to enabled. `runtime_capabilities.token_streaming` reflects
45
+ the effective token-streaming setting, not a compile-time constant.
46
+
47
+ When token streaming is active, adapters use provider-native streaming APIs:
48
+
49
+ - OpenAI: Responses API streaming events, including `response.output_text.delta`.
50
+ - Anthropic: Messages stream helper with text deltas and `finalMessage()`.
51
+ - Gemini: `models.generateContentStream`.
52
+ - DeepSeek: OpenAI-compatible chat completions with `stream: true`.
53
+
54
+ The streaming path is not a separate fake progress channel. The same streamed
55
+ text is accumulated and then parsed into the existing review or generation
56
+ result.
57
+
58
+ For safety, `peer.token.delta` events include character counts by default rather
59
+ than provider text. `CROSS_REVIEW_STREAM_TEXT=1` can include redacted text in
60
+ trusted local diagnostics, but it is intentionally opt-in because providers may
61
+ split sensitive strings across chunks. Raw thinking content is still not
62
+ requested or persisted.
63
+
64
+ ## Unanimity Rule
65
+
66
+ A session converges only when the caller status is `READY`, every selected peer returns `READY`, and no peer failed or omitted a machine-readable status.
67
+
68
+ Decision quality is tracked per peer:
69
+
70
+ - `clean`: parsed status without warnings.
71
+ - `format_warning`: parsed with non-blocking parser warnings.
72
+ - `recovered`: recovered through format repair, moderation-safe retry or bounded sanitization.
73
+ - `needs_operator_review`: no parseable status remains after recovery.
74
+ - `failed`: provider or model-selection failure blocked the peer.
75
+
76
+ `unparseable_after_recovery`, `prompt_flagged_by_moderation`,
77
+ `silent_model_downgrade` and other rejected peer failures always block
78
+ unanimity until resolved.
79
+
80
+ ## Moderation-Safe Prompting
81
+
82
+ Prior peer history is summarized from structured fields instead of replaying
83
+ raw model text. This keeps prompts smaller, reduces the chance that a verbose
84
+ peer repeats policy-sensitive language into a later provider, and produces more
85
+ useful audit trails.
86
+
87
+ If a provider still rejects a prompt as moderated or safety-blocked, the
88
+ orchestrator records the failure class and retries once with a compact,
89
+ sanitized review prompt. This retry does not bypass provider policy: if the
90
+ compact context is insufficient, the peer must return `NEEDS_EVIDENCE` or the
91
+ session remains blocked for operator action.
92
+
93
+ ## Model Discovery
94
+
95
+ Provider model APIs are queried at probe/session initialization:
96
+
97
+ - OpenAI: Models API.
98
+ - Anthropic: Models API.
99
+ - Gemini: `models.list`.
100
+ - DeepSeek: OpenAI-compatible `/models`.
101
+
102
+ The selected model and selection evidence are persisted in the session capability snapshot.
103
+
104
+ ## Provider Thinking Baseline
105
+
106
+ The peer adapters use the strongest official reasoning controls available for each provider because cross-review is correctness-oriented:
107
+
108
+ - OpenAI runs through the Responses API with high reasoning effort.
109
+ - Anthropic uses adaptive thinking and omits raw thinking content from responses.
110
+ - Gemini enables thinking configuration for Gemini 3.x and the Gemini 2.5 fallback.
111
+ - DeepSeek enables Thinking Mode and follows the official multi-round guidance by resending the summarized session context in each stateless request.
112
+
113
+ Raw chain-of-thought is not persisted. Session continuity is represented through prompts, structured peer decisions, summaries and artifacts.
114
+
115
+ ## Stable Rename
116
+
117
+ Stable version `2.1.0` renamed the active product to `cross-review`. The earlier development
118
+ name remains only in historical changelog or memory notes.
@@ -0,0 +1,135 @@
1
+ # Prompt Caching (v2.21.0+)
2
+
3
+ `cross-review` integrates with the prompt-caching surface of every supported provider. The runtime emits a uniform `provider.cache.usage` event and persists a per-session `cache_manifest.json` so dashboards, FinOps reports and post-mortem tooling can read cache telemetry without branching on provider-specific shapes.
4
+
5
+ This document describes:
6
+
7
+ - per-provider behavior matrix
8
+ - the `stablePrefix` cache key + schema-version invariant
9
+ - pair-scoped cache keys
10
+ - cost savings accounting
11
+ - operator controls (kill-switch + TTL overrides)
12
+ - empirical guidance per provider
13
+
14
+ ## Per-provider behavior matrix
15
+
16
+ | Peer (Provider) | Cache mode | Threshold | TTL surface | Telemetry source |
17
+ | --------------------- | ---------- | --------------- | ---------------------------------------------- | --------------------------------------------------------------------- |
18
+ | `codex` (OpenAI) | `auto` | ~1k tokens | `prompt_cache_retention` (`in_memory` / `24h`) | `usage.prompt_tokens_details.cached_tokens` |
19
+ | `claude` (Anthropic) | `explicit` | ~4k tokens | `cache_control.ttl` (`5m` / `1h`) | `usage.cache_creation_input_tokens` + `usage.cache_read_input_tokens` |
20
+ | `gemini` (Google) | `implicit` | service-managed | n/a | `usageMetadata.cachedContentTokenCount` |
21
+ | `deepseek` (DeepSeek) | `auto` | service-managed | n/a | `usage.prompt_cache_hit_tokens` + `usage.prompt_cache_miss_tokens` |
22
+ | `grok` (xAI) | `auto` | service-managed | mirrors OpenAI | `usage.prompt_tokens_details.cached_tokens` |
23
+
24
+ `mode` values follow the canonical `TokenUsage.cache_provider_mode` enum:
25
+
26
+ - `auto` — provider auto-detects cacheable prefix (OpenAI, DeepSeek, Grok)
27
+ - `explicit` — runtime places cache_control breakpoints in the body (Anthropic only)
28
+ - `implicit` — provider transparently caches and reports tokens read (Gemini)
29
+ - `not_supported` — peer call did not produce cache telemetry
30
+
31
+ ## Cache key scope strategy
32
+
33
+ Every cached call is bucketed by a **pair-scoped cache key**:
34
+
35
+ ```
36
+ cross-review:<peer>:<caller>:v<cache_schema_version>
37
+ ```
38
+
39
+ The pair scope means two different callers reviewing the same case still share cache hits within a peer; cache invalidation is bounded by the schema version. Bumping `CROSS_REVIEW_CACHE_SCHEMA_VERSION` (e.g. `v1` → `v2`) invalidates every previously cached entry, by design. Use this when prompt structure changes materially (new convergence rule, new system role line, new evidence index format).
40
+
41
+ Internally, we also emit a `stablePrefixHash` (sha256 hex) computed over the LF-normalized stablePrefix. The hash is invariant across rounds for the same case — see `prompt-parts.ts` and the smoke marker `cache_hash_invariance_test`.
42
+
43
+ ## Cache schema versioning
44
+
45
+ `stablePrefix` always begins with the line `cache_schema_version: vN`. This appears verbatim inside the cached prefix payload so any structural shift produces a different hash and a different cache scope automatically.
46
+
47
+ When to bump:
48
+
49
+ - Adding/removing a section in stablePrefix (system, task, review_focus, convergence_rules, evidence_index)
50
+ - Reordering sections inside stablePrefix
51
+ - Changing the systemRole text materially
52
+ - Changing the convergence rules text
53
+
54
+ Smoke marker `cache_schema_version_in_prefix_test` pins the first-line shape so a regression is caught locally.
55
+
56
+ ## Cost savings accounting
57
+
58
+ The cost layer (`src/core/cost.ts`) extends `CostEstimate` with two cache-related fields:
59
+
60
+ - `cache_savings_usd?: number` — populated when the rate card knows how to price the savings
61
+ - `cache_savings_unknown?: boolean` — set when cache telemetry was present but no rate card matched
62
+
63
+ Rate cards live in `src/core/cache-rates.json`. The primary input/output rates still flow through `config.cost_rates` (env vars `CROSS_REVIEW_<PROVIDER>_INPUT_USD_PER_MILLION` / `CROSS_REVIEW_<PROVIDER>_OUTPUT_USD_PER_MILLION`). The rate card delta is FALLBACK math: `(fresh_input_per_million - cached_input_per_million) × cache_read_tokens / 1e6`.
64
+
65
+ Adapters surface the read/write counts via `TokenUsage.cache_read_tokens` and `TokenUsage.cache_write_tokens`. The orchestrator reads them, emits a `provider.cache.usage` event, and appends a row to `<data_dir>/sessions/<session_id>/cache_manifest.json`.
66
+
67
+ ## Bypass / kill-switch
68
+
69
+ ```
70
+ CROSS_REVIEW_DISABLE_CACHE=true
71
+ ```
72
+
73
+ Disables prompt caching globally for the runtime. Adapters fall back to the pre-v2.21 behavior (no `prompt_cache_key`, no `cache_control` blocks, no `x-grok-conv-id` header). The cost layer continues to merge `cache_read_tokens` / `cache_write_tokens` if a provider returns them anyway, so audit reproducibility is preserved.
74
+
75
+ Use cases:
76
+
77
+ - Provider misbehavior (cache poisoning, stale state)
78
+ - Audit reproducibility (force every call to make a fresh inference)
79
+ - A/B comparison between cached and uncached spend
80
+
81
+ ## TTL configuration
82
+
83
+ ```
84
+ CROSS_REVIEW_CACHE_TTL_ANTHROPIC=5m|1h # default 1h
85
+ CROSS_REVIEW_CACHE_TTL_OPENAI=5m|1h # default 1h
86
+ ```
87
+
88
+ - **Anthropic** accepts `5m` and `1h` per the SDK. Values other than `5m`/`1h` are ignored with a stderr notice and the default is used.
89
+ - **OpenAI** accepts only `in_memory` and `24h` per the Responses API (the SDK type is locked to those two strings). The runtime translates `1h` → `24h` (extended retention) and anything else → `in_memory` (the default ~5 min window).
90
+
91
+ Grok mirrors OpenAI's mapping.
92
+
93
+ ## Anthropic cache_control placement
94
+
95
+ The Anthropic adapter places exactly **one** cache_control breakpoint at the END of the system prompt block:
96
+
97
+ ```ts
98
+ system: [
99
+ {
100
+ type: "text",
101
+ text: systemPromptText,
102
+ cache_control: { type: "ephemeral", ttl: "1h" },
103
+ },
104
+ ],
105
+ ```
106
+
107
+ Anthropic supports up to 4 breakpoints per request; we reserve 3 for future additions (per-message layering, tool block caching, multi-tier prefixes). The `cache_creation_input_tokens` / `cache_read_input_tokens` fields on `response.usage` map directly to `cache_write_tokens` / `cache_read_tokens` on our canonical `TokenUsage` shape.
108
+
109
+ ## Empirical guidance
110
+
111
+ | Provider | Practical minimum cached prefix | Notes |
112
+ | --------- | ------------------------------- | ------------------------------------------------------------------------------------------ |
113
+ | OpenAI | ≥ 1024 tokens | The Responses API auto-detects; `prompt_cache_key` improves hit rate for repeat callers. |
114
+ | Anthropic | ≥ 4096 tokens | Below this size Anthropic may not actually create the cache entry. Adapter emits a notice. |
115
+ | Gemini | service-managed | Implicit only at this writing; explicit `caches.create` is deferred. |
116
+ | DeepSeek | service-managed | Auto-cached; both hit and miss tokens are returned. |
117
+ | Grok | service-managed | xAI mirrors OpenAI; `x-grok-conv-id` header binds the cache scope. |
118
+
119
+ ## Reference URLs
120
+
121
+ - OpenAI prompt caching: https://platform.openai.com/docs/guides/prompt-caching
122
+ - Anthropic prompt caching: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
123
+ - Google Gemini caching: https://ai.google.dev/gemini-api/docs/caching
124
+ - DeepSeek context caching: https://api-docs.deepseek.com/guides/kv_cache
125
+ - xAI / Grok caching: https://docs.x.ai/docs/api-reference
126
+
127
+ ## Smoke markers
128
+
129
+ The smoke harness (`scripts/smoke.ts`) ships five anti-drift markers covering this surface:
130
+
131
+ - `cache_hash_invariance_test` — round/draft/priorRounds permutations do NOT mutate `stablePrefixHash`
132
+ - `cache_schema_version_in_prefix_test` — first line of `stablePrefix` matches `^cache_schema_version: v\d+$`
133
+ - `cache_rates_json_loaded_test` — every provider has a rate card with a numeric `fresh_input_per_million_usd`
134
+ - `cache_manifest_atomic_write_test` — sequential appends preserve every entry
135
+ - `cache_disable_kill_switch_test` — `CROSS_REVIEW_DISABLE_CACHE=true` flips `config.cache.enabled`
package/docs/costs.md ADDED
@@ -0,0 +1,40 @@
1
+ # Costs
2
+
3
+ Runtime calls are real provider API calls by default.
4
+
5
+ ## Smoke Tests
6
+
7
+ `npm test` uses `CROSS_REVIEW_STUB=1` and does not call provider APIs.
8
+
9
+ ## Real Runs
10
+
11
+ `probe_peers`, `session_init`, `ask_peers` and `run_until_unanimous` may call provider APIs when keys are present.
12
+
13
+ The server records token usage returned by providers. Paid review/generation tools are blocked until explicit budget ceilings and rate cards are configured. This avoids stale hard-coded prices because provider pricing changes frequently.
14
+
15
+ `CROSS_REVIEW_MAX_OUTPUT_TOKENS` controls the maximum output budget requested from all providers. The default is `20000`; raise or lower it in the MCP host configuration according to the desired quality/cost tradeoff. Invalid, zero or negative values fall back to the default.
16
+
17
+ ## Required Financial Configuration
18
+
19
+ Set rates through Windows environment variables or the MCP host configuration before running paid calls. Values are USD per million tokens. Use current official provider pricing; this project intentionally does not ship default provider prices.
20
+
21
+ ```powershell
22
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_MAX_SESSION_COST_USD", "20", "User")
23
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_PREFLIGHT_MAX_ROUND_COST_USD", "20", "User")
24
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_UNTIL_STOPPED_MAX_COST_USD", "20", "User")
25
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_OPENAI_INPUT_USD_PER_MILLION", "<current OpenAI input rate>", "User")
26
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_OPENAI_OUTPUT_USD_PER_MILLION", "<current OpenAI output rate>", "User")
27
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_ANTHROPIC_INPUT_USD_PER_MILLION", "<current Anthropic input rate>", "User")
28
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_ANTHROPIC_OUTPUT_USD_PER_MILLION", "<current Anthropic output rate>", "User")
29
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_GEMINI_INPUT_USD_PER_MILLION", "<current Gemini input rate>", "User")
30
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_GEMINI_OUTPUT_USD_PER_MILLION", "<current Gemini output rate>", "User")
31
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_DEEPSEEK_INPUT_USD_PER_MILLION", "<current DeepSeek input rate>", "User")
32
+ [Environment]::SetEnvironmentVariable("CROSS_REVIEW_DEEPSEEK_OUTPUT_USD_PER_MILLION", "<current DeepSeek output rate>", "User")
33
+ ```
34
+
35
+ `CROSS_REVIEW_MAX_SESSION_COST_USD` sets the default per-session budget guard. `CROSS_REVIEW_PREFLIGHT_MAX_ROUND_COST_USD` blocks a round before calls begin when the estimated cost exceeds the configured value. `CROSS_REVIEW_UNTIL_STOPPED_MAX_COST_USD` is required for `until_stopped=true`.
36
+
37
+ When the estimated session cost exceeds the configured limit, the run is
38
+ finalized as `max-rounds` with reason `budget_exceeded`. Missing financial
39
+ configuration finalizes the session as `max-rounds` with reason
40
+ `financial_controls_missing` before any paid provider call is made.
@@ -0,0 +1,88 @@
1
+ # Evidence Preflight (v3.5.0)
2
+
3
+ `run_until_unanimous` and `session_start_unanimous` run a **pure textual
4
+ evidence preflight** before any paid peer call. It catches the
5
+ `f0db3970`-class failure — a submission that _claims_ completed
6
+ operational work (tests pass, a diff exists, a build was validated) but
7
+ embeds **zero concrete evidence** — and fails it locally with
8
+ `needs_evidence_preflight` instead of burning API budget across multiple
9
+ `NEEDS_EVIDENCE` rounds.
10
+
11
+ ## Scope boundary (important)
12
+
13
+ cross-review is an **API-only orchestrator**. The preflight:
14
+
15
+ - **does NOT** execute shell, run `git diff`, or read the repo;
16
+ - **does NOT** package evidence for you;
17
+ - only inspects text the caller already supplied — `task`,
18
+ `initial_draft`, the structured `evidence` field, and
19
+ already-attached evidence.
20
+
21
+ Evidence _packaging_ (`git diff --stat`, hunks, `git status --short`,
22
+ validation-command tails, changed-file lists, target commit) is a
23
+ **caller-side responsibility**. Build it into a local agent helper,
24
+ prompt workflow, or shared utility — never inside the MCP server.
25
+
26
+ ## How it decides
27
+
28
+ The preflight trips **only** when **both** are true:
29
+
30
+ 1. **Completed-work claim present** — the text matches one of:
31
+ `\d+ passed/failed`, `git diff`, `git status`, `npm run`,
32
+ `cargo test|build`, `build passed/succeeded/clean/green`,
33
+ `tests? pass/passed/green`, `git diff --check`.
34
+ 2. **Zero evidence markers** — the text contains none of: fenced code
35
+ blocks (` ``` `), `@@ -`/`@@ +` diff hunks, 7+ hex-char hashes,
36
+ `file.ext:NN` line refs, `$ `/`> ` command-prompt lines.
37
+
38
+ Mere keyword presence does **not** trip it. "I plan to write a patch"
39
+ or "here is the test plan" is a design review with legitimately no diff
40
+ — it passes.
41
+
42
+ A non-empty `evidence` field **or** any attached evidence makes the
43
+ preflight pass unconditionally — that is the caller's authoritative
44
+ declaration that concrete evidence exists.
45
+
46
+ ## Minimum evidence format
47
+
48
+ To pass the preflight when your task makes a completed-work claim,
49
+ embed at least one of these inline in `initial_draft` or in the
50
+ `evidence` field:
51
+
52
+ - a fenced code block with the relevant diff hunk(s) — `@@ -N,M +N,M @@`;
53
+ - `file/path.ext:LINE` references for every changed location;
54
+ - raw command output (the `$ cmd` line plus its output) for every
55
+ validation you claim — `npm run typecheck`, `cargo test`,
56
+ `git diff --check`, `rg` scans;
57
+ - content hashes (`sha256`, `md5`) when asserting artifact identity.
58
+
59
+ This is the same provenance-grade material the R1 evidence-upfront
60
+ contract already expects. The preflight just refuses to spend API when
61
+ it is obviously absent.
62
+
63
+ ## The `evidence` field
64
+
65
+ Both `run_until_unanimous` and `session_start_unanimous` accept an
66
+ optional `evidence: string`. When non-empty it satisfies the preflight
67
+ unconditionally. Use it to attach the caller-packaged evidence bundle
68
+ without inflating `initial_draft`.
69
+
70
+ ## Opt-out
71
+
72
+ Set `CROSS_REVIEW_EVIDENCE_PREFLIGHT=off` to disable the preflight
73
+ entirely (default: `on`). Disabling is rarely needed — the trip
74
+ condition is deliberately conservative — but the escape hatch exists
75
+ for callers whose tasks legitimately make completed-work claims in
76
+ prose without inline markers.
77
+
78
+ ## Outcome when it trips
79
+
80
+ - session finalized: `outcome = "aborted"`, `reason =
81
+ "needs_evidence_preflight"`;
82
+ - event emitted: `session.evidence_preflight_failed` with
83
+ `completed_work_claim_matched`, `evidence_marker_found`,
84
+ `structured_evidence_supplied`, `attachments_present`;
85
+ - **zero paid peer calls** were made.
86
+
87
+ Re-submit with evidence embedded inline, with the `evidence` field
88
+ populated, or with evidence attached via `session_attach_evidence`.
@@ -0,0 +1,32 @@
1
+ # GitHub Security Baseline
2
+
3
+ This project is prepared to become a public repository.
4
+
5
+ Required repository settings after the remote is created:
6
+
7
+ 1. Enable Secret Protection / Secret Scanning.
8
+ 2. Enable Push Protection.
9
+ 3. Enable Code Scanning with CodeQL Default Setup.
10
+ 4. Enable Code Quality.
11
+ 5. Enable Dependabot alerts.
12
+ 6. Enable Dependabot security updates.
13
+ 7. Enable Dependabot version updates from `.github/dependabot.yml`.
14
+ 8. Enable Dependabot auto-merge workflow only after branch rules are active.
15
+ 9. Protect `main` with a repository ruleset.
16
+ 10. Require code scanning results with CodeQL security alerts: All / alerts: All.
17
+ 11. Require code quality thresholds: Any / Any.
18
+ 12. Require CI to pass before merge.
19
+ 13. Disable force-push and branch deletion on `main`.
20
+
21
+ Package publishing is active after the repository is created and the `NPM_TOKEN` secret is
22
+ configured. Pushes to `main` auto-create an organization-standard display tag such as `v02.01.00`
23
+ from `package.json`; the tag then creates a normal GitHub Release and publishes
24
+ `@lcv-ideas-software/cross-review` to npmjs.com and GitHub Packages. The API-first package is
25
+ separate from the CLI package `@lcv-ideas-software/cross-review-v1`.
26
+
27
+ CodeQL Advanced Setup is intentionally not committed. If Advanced Setup ever becomes necessary,
28
+ it must be proposed with justification and approved before adding a workflow file.
29
+
30
+ No secrets, runtime sessions, logs, prompts, provider responses, API keys or local AI memories may
31
+ be committed. The `.gitignore` is intentionally strict because this repository is designed for
32
+ public release from its first push.
@@ -0,0 +1,105 @@
1
+ # Model Selection
2
+
3
+ The server uses automatic model selection unless an explicit environment override is present.
4
+
5
+ ## Rules
6
+
7
+ 1. Query the provider's official model API using the current API key.
8
+ 2. Keep only models that can perform text generation for the peer role.
9
+ 3. Exclude known non-thinking, low-capacity or deprecated models from cross-review priority lists.
10
+ 4. Compare returned model IDs against the documented priority list.
11
+ 5. Select the first available model in that priority list.
12
+ 6. Persist the selected model, candidate list, source URL, confidence and reason in the session snapshot.
13
+
14
+ If a provider returns models but none match the advanced thinking priority list, the runtime keeps the documented advanced fallback instead of silently downgrading to a weaker random candidate. That makes availability problems visible in probes and review rounds.
15
+
16
+ The no-downgrade behavior is covered by `scripts/smoke.ts`: when a provider
17
+ returns only a weak/deprecated candidate such as `claude-haiku-4-5`, selection
18
+ stays on the documented advanced fallback and records `confidence=unknown`.
19
+
20
+ ## Current Priority Lists
21
+
22
+ OpenAI/Codex:
23
+
24
+ ```text
25
+ gpt-5.5 > gpt-5.4 > gpt-5.2 > gpt-5.1-codex-max > gpt-5.1-codex > gpt-5.1 > gpt-5-pro > gpt-5
26
+ ```
27
+
28
+ Anthropic/Claude:
29
+
30
+ ```text
31
+ claude-opus-4-7 > claude-opus-4-6 > claude-sonnet-4-6
32
+ ```
33
+
34
+ Haiku is intentionally excluded because the cross-review role requires advanced reasoning depth.
35
+
36
+ Google/Gemini:
37
+
38
+ ```text
39
+ gemini-2.5-pro > gemini-3.1-pro-preview
40
+ ```
41
+
42
+ Operator preference 2026-05-07: `gemini-2.5-pro` is the runtime default because under Google One AI Ultra subscription it carries 1k requests/day vs `gemini-3.1-pro-preview`'s 250 requests/day; `gemini-3.1-pro-preview` remains in the priority list as a fallback. Workspace policy (operator directive 2026-05-07): only `gemini-*-pro` variants ≥ 2.5 are permitted — no `*-flash` variants and no models below 2.5 (those are deprecated). `gemini-3-pro-preview` is intentionally excluded from the active fallback path because preview model deprecation is tracked through official Gemini release notes and this project avoids soon-to-deprecate intermediate previews when a newer advanced model is available.
43
+
44
+ DeepSeek:
45
+
46
+ ```text
47
+ deepseek-v4-pro > deepseek-v4-flash
48
+ ```
49
+
50
+ `deepseek-chat` and `deepseek-reasoner` are not active fallbacks because DeepSeek has announced their deprecation for 2026-07-24. `deepseek-v4-pro` is the preferred thinking-capable model for this project.
51
+
52
+ xAI/Grok:
53
+
54
+ ```text
55
+ grok-4.20-multi-agent > grok-4-latest > grok-4.3 > grok-4.20-reasoning > grok-4.20 > grok-4-1-fast > grok-4 > grok-3-fast > grok-3
56
+ ```
57
+
58
+ `GROK_API_KEY` is the canonical auth variable. The operator chooses the model
59
+ with `CROSS_REVIEW_GROK_MODEL`. Per official xAI reasoning docs, only
60
+ `grok-4.20-multi-agent` accepts an explicit `reasoning.effort` request body
61
+ field, where the value selects agent count (`low`/`medium` = 4 agents,
62
+ `high`/`xhigh` = 16 agents). Automatic-reasoning Grok models such as
63
+ `grok-4-latest`, `grok-4.20`, and `grok-4.20-reasoning` must not receive that
64
+ field; the adapter detects the configured model and omits it automatically.
65
+ The xAI model catalog currently recommends `grok-4.3` for general Chat API
66
+ usage; the multi-agent model remains first in this project because it is the
67
+ only documented Grok model with explicit agent-count control.
68
+
69
+ ## Thinking Configuration
70
+
71
+ Cross-review-v2 is optimized for correctness over latency and cost. Provider adapters explicitly request thinking/reasoning where the official APIs support it:
72
+
73
+ - OpenAI/Codex: Responses API with reasoning effort `xhigh` by default.
74
+ - Anthropic/Claude: adaptive thinking with omitted thinking display plus `output_config.effort=xhigh` by default on Opus 4.7.
75
+ - Google/Gemini: `thinkingConfig.thinkingLevel=HIGH` for Gemini 3.x and automatic thinking budget for Gemini 2.5 Pro fallback.
76
+ - DeepSeek: `thinking.type=enabled` with `reasoning_effort=max` by default.
77
+ - Grok: `reasoning.effort` is sent only for `grok-4.20-multi-agent`; all other
78
+ Grok reasoning models use xAI automatic reasoning without the explicit field.
79
+
80
+ ## Official Documentation Refresh — 2026-05-05
81
+
82
+ Checked against primary provider documentation before the v2.16.0 protocol
83
+ repair:
84
+
85
+ - OpenAI: GPT-5.5 is the current recommended frontier model for complex
86
+ reasoning/coding, with Responses API reasoning effort values through `xhigh`
87
+ and 1M context / 128K output.
88
+ - Anthropic: Claude Opus 4.7 is the generally available complex-reasoning and
89
+ agentic-coding default; current docs expose 1M context and adaptive thinking.
90
+ - Google Gemini: Gemini 3.1 Pro Preview is the current advanced Gemini 3.1
91
+ option; Gemini 3 Pro Preview is deprecated/shut down and must stay out of
92
+ active fallbacks.
93
+ - DeepSeek: DeepSeek-V4 exposes `deepseek-v4-pro` and `deepseek-v4-flash`;
94
+ legacy `deepseek-chat` and `deepseek-reasoner` are scheduled for
95
+ discontinuation on 2026-07-24 and must stay out of priority fallbacks.
96
+ - xAI Grok: the model catalog currently recommends `grok-4.3` for general Chat
97
+ API use, while reasoning docs identify `grok-4.20-multi-agent` as the only
98
+ explicit `reasoning.effort` model. Other Grok reasoning models reason
99
+ automatically and must not receive explicit effort.
100
+
101
+ ## Important
102
+
103
+ The priority list is intentionally code-level configuration, not hidden behavior. Provider model catalogs and deprecation schedules change often, so this file and `src/peers/model-selection.ts` must be reviewed against official provider documentation whenever defaults change.
104
+
105
+ The redacted real-API capability smoke for the current default models is recorded in `docs/reports/cross-review-api-capability-smoke-2026-04-30.md`.