@opencodehub/cli 0.2.2 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. package/dist/commands/ci-templates/github-nightly.yml +35 -0
  2. package/dist/commands/ci-templates/github-rescan.yml +52 -0
  3. package/dist/commands/ci-templates/github-verdict.yml +24 -0
  4. package/dist/commands/ci-templates/github-weekly.yml +49 -0
  5. package/dist/commands/ci-templates/gitlab-ci.yml +56 -0
  6. package/dist/index.js +9 -1
  7. package/dist/index.js.map +1 -1
  8. package/dist/plugin-assets/agents/code-analyst.md +18 -0
  9. package/dist/plugin-assets/commands/audit-deps.md +29 -0
  10. package/dist/plugin-assets/commands/owners.md +20 -0
  11. package/dist/plugin-assets/commands/probe.md +21 -0
  12. package/dist/plugin-assets/commands/rename.md +20 -0
  13. package/dist/plugin-assets/commands/verdict.md +18 -0
  14. package/dist/plugin-assets/hooks/augment.sh +128 -0
  15. package/dist/plugin-assets/hooks/docs-staleness.sh +45 -0
  16. package/dist/plugin-assets/hooks.json +34 -0
  17. package/dist/plugin-assets/skills/codehub-code-pack/SKILL.md +181 -0
  18. package/dist/plugin-assets/skills/codehub-code-pack/references/determinism-contract.md +150 -0
  19. package/dist/plugin-assets/skills/codehub-contract-map/SKILL.md +144 -0
  20. package/dist/plugin-assets/skills/codehub-document/SKILL.md +152 -0
  21. package/dist/plugin-assets/skills/codehub-document/references/cross-reference-spec.md +142 -0
  22. package/dist/plugin-assets/skills/codehub-document/references/data-source-map.md +139 -0
  23. package/dist/plugin-assets/skills/codehub-document/references/document-templates.md +347 -0
  24. package/dist/plugin-assets/skills/codehub-document/references/mermaid-patterns.md +181 -0
  25. package/dist/plugin-assets/skills/codehub-document/templates/agents/README.md +64 -0
  26. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-analysis-dead-code.md +104 -0
  27. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-analysis-ownership.md +101 -0
  28. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-analysis-risk-hotspots.md +105 -0
  29. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-architecture-data-flow.md +103 -0
  30. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-architecture-module-map.md +102 -0
  31. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-architecture-system-overview.md +100 -0
  32. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-behavior-processes.md +103 -0
  33. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-behavior-state-machines.md +101 -0
  34. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-cross-repo-contracts-matrix.md +104 -0
  35. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-cross-repo-dependency-flow.md +111 -0
  36. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-cross-repo-portfolio-map.md +106 -0
  37. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-diagrams-components.md +99 -0
  38. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-diagrams-dependency-graph.md +104 -0
  39. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-diagrams-sequences.md +103 -0
  40. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-reference-cli.md +110 -0
  41. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-reference-mcp-tools.md +100 -0
  42. package/dist/plugin-assets/skills/codehub-document/templates/agents/doc-reference-public-api.md +111 -0
  43. package/dist/plugin-assets/skills/codehub-document/templates/orchestrator-prompt.md +110 -0
  44. package/dist/plugin-assets/skills/codehub-onboarding/SKILL.md +111 -0
  45. package/dist/plugin-assets/skills/codehub-pr-description/SKILL.md +122 -0
  46. package/dist/plugin-assets/skills/opencodehub-debugging/SKILL.md +144 -0
  47. package/dist/plugin-assets/skills/opencodehub-exploring/SKILL.md +120 -0
  48. package/dist/plugin-assets/skills/opencodehub-guide/SKILL.md +180 -0
  49. package/dist/plugin-assets/skills/opencodehub-impact-analysis/SKILL.md +151 -0
  50. package/dist/plugin-assets/skills/opencodehub-pr-review/SKILL.md +246 -0
  51. package/dist/plugin-assets/skills/opencodehub-refactoring/SKILL.md +180 -0
  52. package/package.json +11 -9
@@ -0,0 +1,152 @@
1
+ ---
2
+ name: codehub-document
3
+ description: "Use when the user asks to generate, regenerate, or refresh long-form codebase documentation, an architecture book, a module map, or a per-repo reference — especially after `codehub analyze` finishes or after a large merge. Examples: \"document this repo\", \"regenerate the architecture docs\", \"write a module map for the monorepo\", \"produce a group-wide portfolio doc\". DO NOT use if the repo is not indexed — run `codehub analyze` first and confirm `mcp__opencodehub__list_repos` returns the repo. DO NOT use for PR descriptions (use `codehub-pr-description`), onboarding docs (use `codehub-onboarding`), or cross-repo contract maps alone (use `codehub-contract-map`)."
4
+ allowed-tools: "Read, Write, Edit, Glob, Grep, Bash(codehub:*), mcp__opencodehub__list_repos, mcp__opencodehub__project_profile, mcp__opencodehub__query, mcp__opencodehub__context, mcp__opencodehub__impact, mcp__opencodehub__dependencies, mcp__opencodehub__owners, mcp__opencodehub__risk_trends, mcp__opencodehub__route_map, mcp__opencodehub__tool_map, mcp__opencodehub__list_dead_code, mcp__opencodehub__list_findings, mcp__opencodehub__verdict, mcp__opencodehub__group_list, mcp__opencodehub__group_query, mcp__opencodehub__group_status, mcp__opencodehub__group_contracts, mcp__opencodehub__group_cross_repo_links, mcp__opencodehub__sql, Task"
5
+ argument-hint: "[output-dir] [--group <name>] [--committed] [--refresh] [--section <name>]"
6
+ color: indigo
7
+ model: sonnet
8
+ ---
9
+
10
+ # codehub-document
11
+
12
+ Primary artifact generator. Produces a tree of cross-linked Markdown under `.codehub/docs/` (single-repo) or `.codehub/groups/<name>/docs/` (group mode) using a three-phase orchestration: **Phase 0 parallel precompute waves** → **Phase 1 file-level subagent fan-out** (one packet = one output file) → **Phase 2 deterministic cross-reference assembly**.
13
+
14
+ **Model policy.** This skill runs on Sonnet by default. Bump to Opus in two cases:
15
+
16
+ 1. `--refresh --group` combined — the pruning + partial fan-out across members needs the extra judgment.
17
+ 2. Any individual packet may set `model: opus` in its frontmatter to opt into Opus for synthesis-heavy roles. The cross-repo skeletons (`doc-cross-repo-*`) and `doc-analysis-risk-hotspots` typically do this; full-scan single-repo packets stay on Sonnet.
18
+
19
+ ## Preconditions (check before Phase 0)
20
+
21
+ 1. `mcp__opencodehub__list_repos` returns the target. If not, emit `Run codehub analyze first — repo <name> is not indexed.` and stop.
22
+ 2. `codehub status` reports fresh. If stale, emit `Run 'codehub analyze' first — index is stale` and stop.
23
+ 3. Group mode only: `mcp__opencodehub__group_status({group})` must return `fresh: true` for every member. If any member is stale, abort and name each stale repo.
24
+
25
+ ## Arguments
26
+
27
+ - `[output-dir]` (optional positional) — where to write. Default is `.codehub/docs/` (gitignored). With `--committed`, default flips to `docs/codehub/` and the skill does not add a `.gitignore` entry.
28
+ - `--group <name>` — enable group mode. Phase 0 calls `group_list` + `group_status` + `group_contracts` + `group_query`. Phase CD dispatches `doc-cross-repo`.
29
+ - `--committed` — write under `docs/codehub/` (or user-supplied path) instead of `.codehub/docs/`. Does not touch `.gitignore`.
30
+ - `--refresh` — consult `.docmeta.json`, identify stale sections by comparing `max(mtime(section.sources[]))` against `section.mtime`, and dispatch exactly one file-level subagent per stale section (re-seeding its packet from `templates/agents/<role>.md`). Phase 2 always re-runs.
31
+ - `--section <name>` — regenerate one named section (e.g., `architecture/system-overview`). Dispatches exactly one subagent with one skeleton and re-runs Phase 2. Useful for targeted updates.
32
+
33
+ ## Four-phase orchestration
34
+
35
+ ### Phase 0 — Precompute shared context (parallel waves, no subagent)
36
+
37
+ Phase 0 writes `<docs-root>/.context.md` and `<docs-root>/.prefetch.md` so Phase 1 subagents read cached data instead of re-calling tools. It runs as three waves — **two of them are single-message tool-call batches**, so the MCP fan-out parallelizes.
38
+
39
+ **Wave 0a — independent precompute (one message, parallel)**. Issue all of these in a single tool-use batch:
40
+
41
+ - `mcp__opencodehub__list_repos`
42
+ - `mcp__opencodehub__project_profile`
43
+ - `mcp__opencodehub__sql` — schema probe: `SELECT table_name, column_name FROM information_schema.columns WHERE table_name IN ('nodes','relations') ORDER BY table_name, column_name`
44
+ - `mcp__opencodehub__route_map`
45
+ - `mcp__opencodehub__tool_map`
46
+ - `mcp__opencodehub__dependencies`
47
+ - `mcp__opencodehub__risk_trends`
48
+ - `mcp__opencodehub__list_dead_code`
49
+ - `mcp__opencodehub__list_findings`
50
+ - Group mode only: `mcp__opencodehub__group_list`, `mcp__opencodehub__group_status`, `mcp__opencodehub__group_contracts`
51
+
52
+ **Wave 0b — depends on 0a (one message, parallel)**. Needs schema column names + profile entry points from 0a, so it is a second batch. Issue in one message:
53
+
54
+ - `mcp__opencodehub__sql` — top communities (`SELECT … FROM nodes WHERE kind='Community' ORDER BY cohesion DESC LIMIT 10`)
55
+ - `mcp__opencodehub__sql` — top processes (`SELECT … FROM nodes WHERE kind='Process' ORDER BY step_count DESC LIMIT 10`)
56
+ - `mcp__opencodehub__sql` — relations slice for diagrams (filtered per the schema probe)
57
+ - `mcp__opencodehub__owners` × top-5 folders (derived from `project_profile` entry points + file-count heuristic)
58
+ - Group mode only: `mcp__opencodehub__group_query` for any canonical cross-repo search terms
59
+
60
+ **Wave 0c — inline Write (no tool batch)**. Deterministic post-processing; no MCP calls:
61
+
62
+ 1. Assemble `<docs-root>/.context.md` (hard 200-line cap; per-section `truncated: true` when the raw output exceeds the cap). Sections: repo profile, schema probe, top communities, top processes, routes, MCP tools, owners summary, dependencies summary, dead-code counts, findings summary, risk trends. Group mode appends: group manifest, contracts matrix, freshness table.
63
+ 2. Write `<docs-root>/.prefetch.md` — newline-delimited JSON, one record per tool call with `{tool, args, sha256, keys, cached_at, truncated}`. Example:
64
+
65
+ ```json
66
+ {"tool":"project_profile","args":{"repo":"opencodehub"},"sha256":"…","keys":["languages","stacks","entryPoints"],"cached_at":"2026-04-27T18:04:11Z","truncated":false}
67
+ ```
68
+
69
+ The full layout of both files, plus the schema-preflight rationale and the Phase 0 pseudocode, live in `references/data-source-map.md`.
70
+
71
+ ### Phase 1 — File-level subagent fan-out
72
+
73
+ One packet = one output file. The orchestrator seeds packets by copying `templates/agents/<role>.md` to `<docs-root>/.packets/<role>.md`, substituting the placeholders listed in `templates/agents/README.md § Placeholders`, and spawning a `general-purpose` subagent per packet with the prompt from `templates/orchestrator-prompt.md`.
74
+
75
+ **Skeletons (single-repo)**. Ten are always seeded; four are conditional on triggers observed in Phase 0:
76
+
77
+ | Role | Always / Conditional | Trigger |
78
+ |------|----------------------|---------|
79
+ | `doc-architecture-system-overview` | always | — |
80
+ | `doc-architecture-module-map` | always | — |
81
+ | `doc-architecture-data-flow` | always | — |
82
+ | `doc-reference-public-api` | always | — |
83
+ | `doc-behavior-processes` | always | — |
84
+ | `doc-analysis-risk-hotspots` | always | — |
85
+ | `doc-analysis-ownership` | always | — |
86
+ | `doc-analysis-dead-code` | always | — |
87
+ | `doc-diagrams-components` | always | — |
88
+ | `doc-diagrams-dependency-graph` | always | — |
89
+ | `doc-reference-cli` | conditional | CLI detected in `project_profile` entry points |
90
+ | `doc-reference-mcp-tools` | conditional | `tool_map` returns ≥ 1 row |
91
+ | `doc-behavior-state-machines` | conditional | ≥ 2 `StateMachine` nodes in the graph |
92
+ | `doc-diagrams-sequences` | conditional | ≥ 1 process with ≥ 3 steps |
93
+
94
+ **Group mode** adds three cross-repo skeletons seeded from `{{ group_docs_root }}/.packets/`:
95
+
96
+ - `doc-cross-repo-portfolio-map`
97
+ - `doc-cross-repo-contracts-matrix`
98
+ - `doc-cross-repo-dependency-flow`
99
+
100
+ **Dispatch priority (citation magnetism)**. Single-repo, up to ~10 packets per message, two messages back-to-back (no gate — this is purely how Claude Code's concurrent-Agent ceiling is managed):
101
+
102
+ 1. **Message 1 (high-magnetism first)**: `system-overview`, `public-api`, `processes`, `components`, `module-map`, `data-flow`, plus up to four of the remaining always/conditional skeletons.
103
+ 2. **Message 2 (immediately after)**: every remaining seeded skeleton.
104
+
105
+ **Group mode dispatch**. Sort all seeded packets by `(priority_class, repo_index)` and greedily fill batches of ≤ 10. Example: 3 repos × ~12 per-repo skeletons + 3 cross-repo skeletons ≈ 39 packets → 4 messages of ≤ 10, dispatched back-to-back.
106
+
107
+ **Spawn parameters** (per `templates/orchestrator-prompt.md § Usage at the orchestrator`):
108
+
109
+ - `subagent_type`: `general-purpose`
110
+ - `name`: the role (e.g. `doc-architecture-system-overview`)
111
+ - `description`: 3–5 words
112
+ - `model`: read from the packet's `model:` frontmatter (default `sonnet`, individual packets may request `opus`)
113
+ - `run_in_background`: `true`
114
+ - `prompt`: the canonical text from `templates/orchestrator-prompt.md`, with `{{ packet_path }}` substituted
115
+
116
+ **Monitoring**. Orchestrator tails `.packets/*.md` with `wc -l` at 30 s, 2 m, 5 m, then every 5 m — identical to the erpaval Act monitoring rhythm. A packet whose line count stops growing but whose `status:` line is still `IN_PROGRESS` is the signal to `SendMessage` a nudge or mark it failed.
117
+
118
+ ### Phase 2 — Cross-reference assembler (inline, deterministic)
119
+
120
+ No LLM call. Pure regex + join. See `references/cross-reference-spec.md` for the full algorithm. Summary:
121
+
122
+ 1. Extract every backtick `<path>:<LOC>` (or `<repo>:<path>:<LOC>`) citation from every generated Markdown file.
123
+ 2. Build a co-occurrence index: `source_file → [docs_citing_it]`.
124
+ 3. For any two docs sharing ≥ 2 common sources, append `## See also` (3–5 links) to both.
125
+ 4. **Group mode — sourced cross-repo links (v2)**: call `mcp__opencodehub__group_cross_repo_links` with the current `--group` value. The tool returns a deterministic, alpha-sorted `links[]` array (each entry: `source_repo_uri`, `target_repo_uri`, `source_doc_path`, `target_doc_path`, `relation`, optional `evidence`). Embed that array **verbatim** into `.docmeta.json.cross_repo_links[]` (schema v2). Then render the `## See also (other repos in group)` footer by grouping links by `source_doc_path`, emitting one bullet per target, labelled by `relation` (e.g. `depends_on → orders-api/architecture.md`). Do NOT re-compute links heuristically; the tool is the single source of truth.
126
+ 5. Write `<docs-root>/README.md` (landing page with the structure-is-deterministic disclaimer) and `<docs-root>/.docmeta.json` with `schema_version: 2`. `.docmeta.json.sections[i].agent` records the file-role (e.g. `doc-architecture-system-overview`) for `--refresh` traceability. Pre-v2 `.docmeta.json` files on disk remain readable; the orchestrator lazily upgrades them on the next regeneration by writing v2.
127
+
128
+ ## `--refresh` algorithm
129
+
130
+ See `references/cross-reference-spec.md § --refresh algorithm` for the full procedure. One-line summary: compare `max(mtime(section.sources[]))` against `section.mtime`, dispatch exactly one file-level subagent per stale section (re-seeding its packet from the skeleton), then always re-run Phase 2.
131
+
132
+ ## Progressive disclosure — references/
133
+
134
+ | Reference | When to consult |
135
+ | ---------------------------------- | -------------------------------------------------------- |
136
+ | `references/document-templates.md` | Per-file structural templates (what goes in each section)|
137
+ | `references/data-source-map.md` | Which MCP tools feed which subagent |
138
+ | `references/cross-reference-spec.md` | Phase E algorithm + `.docmeta.json` schema + `--refresh` |
139
+ | `references/mermaid-patterns.md` | Mermaid idioms for each diagram type |
140
+
141
+ ## Quality checklist
142
+
143
+ - [ ] Phase 0 ran waves 0a and 0b as single-message tool batches; `.context.md` is ≤ 200 lines; `.prefetch.md` has one JSON line per tool call including the schema probe.
144
+ - [ ] Phase 1 seeded one packet per output file under `<docs-root>/.packets/`, with placeholders substituted and `status: IN_PROGRESS`.
145
+ - [ ] Phase 1 dispatched packets in batches of ≤ 10 per message, priority-first (system-overview, public-api, processes, components, module-map, data-flow before the rest).
146
+ - [ ] Every generated file has H1 = identifier, no YAML frontmatter (the frontmatter lives in the packet, not the output).
147
+ - [ ] Every factual claim in every output has a backtick citation (`path:LOC` or `repo:path:LOC`).
148
+ - [ ] Every packet has `status: COMPLETE` and a populated Work log / Validation / Summary section before Phase 2 starts.
149
+ - [ ] Phase 2 wrote `.docmeta.json` validating against the schema in `references/cross-reference-spec.md`, with `sections[i].agent` set to the file-role.
150
+ - [ ] `See also` footers appear on every doc with ≥ 2 shared citations.
151
+ - [ ] Group mode: outputs from `doc-cross-repo-*` packets use `repo:path:LOC` citations exclusively.
152
+ - [ ] `codehub status` is fresh before this skill starts; otherwise the preconditions caught the stale state.
@@ -0,0 +1,142 @@
1
+ # cross-reference-spec — Phase E algorithm + `.docmeta.json` schema + `--refresh`
2
+
3
+ Phase E is **deterministic Markdown assembly**. No LLM call. Pure regex + join + write.
4
+
5
+ ## Citation grammar
6
+
7
+ Every factual claim carries an inline backtick citation. Two forms, both recognized by the assembler:
8
+
9
+ - **Single-repo**: `` `<path>:<LOC>` `` or `` `<path>:<start>-<end>` ``. File-level cites append ` (N LOC)`.
10
+ - **Group-qualified**: `` `<repo>:<path>:<LOC>` `` — **mandatory** in any file under `cross-repo/` or `contracts.md`.
11
+
12
+ ### The Phase E regex
13
+
14
+ ```
15
+ (?P<repo>[a-zA-Z0-9_-]+:)?(?P<path>[^\s`:]+\.[a-zA-Z0-9]+)(?::(?P<start>\d+)(?:-(?P<end>\d+))?)?(?:\s*\((?P<loc>\d+)\s*LOC\))?
16
+ ```
17
+
18
+ The assembler scans only between backtick pairs — never raw prose.
19
+
20
+ ## Algorithm
21
+
22
+ 1. **Walk** every `.md` file under the output tree (excluding the precompute files).
23
+ 2. **Extract** every citation matching the regex between backtick pairs.
24
+ 3. **Build** the co-occurrence index: `source_file → [docs_citing_it]`.
25
+ 4. **For each doc**, compute its set of siblings: docs that share ≥ 2 common source citations.
26
+ 5. **Rank** siblings by shared-citation count, then alphabetically. Take the top 3–5.
27
+ 6. **Append** a `## See also` footer to every doc with ≥ 1 sibling. Use Markdown reference-style links, not inline URLs.
28
+ 7. **Group mode**: for every `cross-repo/*.md` file, additionally append `## See also (other repos in group)` listing relative paths into sibling repos' generated docs (e.g., `../../billing/.codehub/docs/reference/public-api.md`).
29
+ 8. **Dedup** sibling paths across both footer sections.
30
+ 9. **Strip** any YAML frontmatter blocks on generated docs and record a `frontmatter_removed: [<path>]` entry in `.docmeta.json`.
31
+ 10. **Write** `README.md` (landing page with the "Prose is LLM-generated; structure is graph-derived" disclaimer) and `.docmeta.json` (schema below).
32
+
33
+ ## `.docmeta.json` schema
34
+
35
+ The file carries a `schema_version` integer. **v2 is the current schema**; v1 files on disk remain readable — the orchestrator lazily upgrades them on the next regeneration by re-running Phase E and writing v2. v2 adds one new field — `cross_repo_links[]` — populated in group mode from the `group_cross_repo_links` MCP tool. All v1 fields carry through unchanged.
36
+
37
+ ```json
38
+ {
39
+ "$schema": "https://opencodehub.dev/schemas/docmeta-v2.json",
40
+ "schema_version": 2,
41
+ "generated_at": "2026-04-27T18:12:04Z",
42
+ "codehub_graph_hash": "sha256:a1b2c3…",
43
+ "mode": "single-repo",
44
+ "repo": "opencodehub",
45
+ "group": null,
46
+ "staleness_at": "2026-04-27T18:12:04Z",
47
+ "sections": [
48
+ {
49
+ "path": "architecture/system-overview.md",
50
+ "agent": "doc-architecture",
51
+ "sources": [
52
+ "packages/mcp/src/server.ts",
53
+ "packages/mcp/src/index.ts"
54
+ ],
55
+ "mtime": "2026-04-27T18:11:58Z",
56
+ "citation_count": 18,
57
+ "mermaid_count": 1
58
+ }
59
+ ],
60
+ "cross_repo_refs": [],
61
+ "cross_repo_links": [],
62
+ "frontmatter_removed": []
63
+ }
64
+ ```
65
+
66
+ Group mode populates `cross_repo_refs[]` (as in v1):
67
+
68
+ ```json
69
+ {
70
+ "cross_repo_refs": [
71
+ {
72
+ "repo": "billing",
73
+ "from_doc": "cross-repo/contracts-matrix.md",
74
+ "to_doc": "../../../billing/.codehub/docs/reference/public-api.md",
75
+ "contract_count": 4
76
+ }
77
+ ]
78
+ }
79
+ ```
80
+
81
+ And `cross_repo_links[]` (new in v2, sourced from `group_cross_repo_links`):
82
+
83
+ ```json
84
+ {
85
+ "cross_repo_links": [
86
+ {
87
+ "source_repo_uri": "github.com/org/frontend",
88
+ "target_repo_uri": "github.com/org/orders-api",
89
+ "source_doc_path": "frontend/architecture.md",
90
+ "target_doc_path": "orders-api/architecture.md",
91
+ "relation": "depends_on",
92
+ "evidence": "GET /orders/{id}"
93
+ },
94
+ {
95
+ "source_repo_uri": "github.com/org/orders-api",
96
+ "target_repo_uri": "github.com/org/frontend",
97
+ "source_doc_path": "orders-api/architecture.md",
98
+ "target_doc_path": "frontend/architecture.md",
99
+ "relation": "consumer_of",
100
+ "evidence": "GET /orders/{id}"
101
+ }
102
+ ]
103
+ }
104
+ ```
105
+
106
+ `cross_repo_links[]` is the sourced, deterministic, alpha-sorted link graph emitted by `group_cross_repo_links`. The engine owns the data (one record per matched contract, emitted in both directions — `depends_on` from consumer to producer, `consumer_of` from producer to consumer). The skill owns the file — it embeds the tool's output verbatim during Phase E and renders the `## See also (other repos in group)` footer from it. Backward-compat: pre-v2 files without `cross_repo_links` are fine to read; the orchestrator writes v2 on next regeneration.
107
+
108
+ **Relation vocabulary**:
109
+
110
+ - `depends_on` — source repo consumes target repo (consumer → producer). The target is an upstream API.
111
+ - `consumer_of` — source repo is consumed BY target repo (producer → consumer). The target is a known downstream.
112
+ - `see_also` — reserved for a later AC. Bidirectional doc link inferred from non-contract cross-repo references.
113
+
114
+ `staleness_at` is copied from the `_meta.codehub/staleness` envelope on the last MCP response the assembler observed.
115
+
116
+ ## `--refresh` algorithm
117
+
118
+ 1. Load `.docmeta.json` from the existing output tree.
119
+ 2. Fetch the current `codehub_graph_hash` from `mcp__opencodehub__list_repos`. If it matches the manifest's hash exactly, skip to step 5.
120
+ 3. For each `section` in the manifest:
121
+ - Compute `max(mtime(source))` across `sections[i].sources[]` via `stat`.
122
+ - If `max(source_mtime) > sections[i].mtime`: mark the section stale.
123
+ 4. Collect the union of stale sections and their owners (`section.agent`). Dispatch only those subagents; pass them a `sections_to_refresh` list so they write only those files.
124
+ 5. Always re-run Phase E over the full tree (cross-reference assembly is cheap and idempotent).
125
+
126
+ The algorithm is **tolerant of the common case** where `codehub analyze` updates the graph but touches only a few files. Falling back to a full regen when `graph_hash` churns avoids subtle staleness when node IDs shift.
127
+
128
+ ## Determinism call-outs
129
+
130
+ - **Deterministic**: file list, directory layout, section ordering, diagram node set, citation targets, `.docmeta.json` structure.
131
+ - **Non-deterministic**: prose sentences, diagram edge ordering within a node, choice of which 3 processes get sequence diagrams among ties.
132
+
133
+ Generated `README.md` includes the one-line disclaimer: *"Prose is LLM-generated; structure is graph-derived. Phase E cross-references are deterministic."*
134
+
135
+ ## Common failure modes
136
+
137
+ | Symptom | Likely cause | Fix |
138
+ |---|---|---|
139
+ | `See also` footer points at a missing file | Phase AB wrote a partial file, Phase E saw the citation but the target was orphaned | Re-run `--refresh` on the owning section |
140
+ | A group-mode `cross-repo/` file has a plain `path:LOC` citation | `doc-cross-repo` slipped on the grammar rule | Update the subagent's Quality Checklist enforcement; add the bad line to the agent's prompt |
141
+ | `.docmeta.json.frontmatter_removed` is non-empty | A subagent emitted YAML frontmatter despite the rule | The assembler stripped it; no user action needed, but fix the subagent |
142
+ | `--refresh` regenerated everything unexpectedly | `graph_hash` changed (node IDs shifted) | Expected behavior on re-analyze; not a bug |
@@ -0,0 +1,139 @@
1
+ # data-source-map — which tools feed which subagent
2
+
3
+ Phase 0 precomputes a shared context from these sources. Subagents read the precompute from disk; they do not re-call tools whose digest is in `.prefetch.md`.
4
+
5
+ ## `.context.md` (200-line cap)
6
+
7
+ ```markdown
8
+ # Codehub context — <repo-or-group-name>
9
+ generated_at: <ISO-8601>
10
+ graph_hash: <from list_repos>
11
+
12
+ ## Repo profile # from project_profile
13
+ - languages: TypeScript 87%, Rust 11%, Python 2%
14
+ - stacks: Node 22, pnpm 10, DuckDB, Vitest
15
+ - entry points: packages/mcp/src/index.ts, packages/cli/src/bin.ts
16
+
17
+ ## Top communities (≤ 10) # from sql: SELECT name, inferred_label, cohesion, symbol_count
18
+ # FROM nodes WHERE kind='Community' ORDER BY cohesion DESC LIMIT 10
19
+ | name | inferred_label | cohesion | symbols |
20
+
21
+ ## Top processes (≤ 10) # from sql: SELECT name, entry_point, step_count
22
+ # FROM nodes WHERE kind='Process' ORDER BY step_count DESC LIMIT 10
23
+ | name | entry_point | step_count |
24
+
25
+ ## Routes # from route_map — truncated to 25 rows
26
+ | method | path | handler |
27
+
28
+ ## MCP tools # from tool_map — truncated to 25 rows
29
+ | tool | summary |
30
+
31
+ ## Owners summary # from owners on top 5 folders
32
+ | path | top_owner | share |
33
+
34
+ ## Staleness envelope # from list_repos._meta.codehub/staleness
35
+ - graph_hash: …
36
+ - indexed_at: …
37
+ - staleness_level: fresh | stale
38
+ ```
39
+
40
+ **Group mode adds:**
41
+
42
+ ```markdown
43
+ ## Group manifest # from group_list
44
+ - group: <name>
45
+ - repos: [<list>]
46
+
47
+ ## Group contracts matrix # from group_contracts
48
+ | producer | consumer | count |
49
+
50
+ ## Group freshness # from group_status
51
+ | repo | fresh | last_indexed |
52
+ ```
53
+
54
+ ## `.prefetch.md` (no cap, ledger)
55
+
56
+ Newline-delimited JSON. One line per tool call. Example:
57
+
58
+ ```json
59
+ {"tool":"project_profile","args":{"repo":"opencodehub"},"sha256":"8c5f…","keys":["languages","stacks","entryPoints"],"cached_at":"2026-04-27T18:04:11Z","truncated":false}
60
+ {"tool":"tool_map","args":{"repo":"opencodehub"},"sha256":"1b9e…","keys":["tools"],"cached_at":"2026-04-27T18:04:12Z","truncated":true}
61
+ ```
62
+
63
+ Subagents use the ledger two ways:
64
+
65
+ 1. **Skip re-call.** If an agent would call a tool whose digest is here, it reads `.prefetch.md` + `.context.md` instead.
66
+ 2. **Know when data is truncated.** Sections with `truncated: true` signal that raw tool output is larger than what's in `.context.md`. The agent may re-call the tool for a targeted slice if needed.
67
+
68
+ ## Per-role input table
69
+
70
+ File-level fan-out means one role may seed multiple packets (for example, `doc-architecture` seeds `system-overview`, `module-map`, `data-flow`; `doc-diagrams` seeds `components`, `sequences`, `dependency-graph`). This table is indexed by role — the `codehub-document` orchestrator reads it when deciding which cached digests to mention in each packet's Input specification.
71
+
72
+ | Role | Primary tools (Phase 0 cached) | Mid-run tools (not cached; agent may call) |
73
+ |--------------------|-------------------------------------------------------------|-----------------------------------------------------------|
74
+ | `doc-architecture` | `project_profile`, `sql` (communities, processes) | `context`, `query`, `dependencies`, `sql` for deeper joins |
75
+ | `doc-reference` | `tool_map`, `route_map`, `project_profile` | `signature`, `context`, `sql` for export filtering |
76
+ | `doc-behavior` | `sql` (processes), `route_map`, `tool_map` | `context` per process, `query` to disambiguate names |
77
+ | `doc-analysis` | `owners`, `risk_trends`, `list_findings`, `list_dead_code` | `verdict` (optional), `sql` for drill-down |
78
+ | `doc-diagrams` | `sql` (relations), `dependencies` | `context` per process, `query` for actor labels |
79
+ | `doc-cross-repo` | `group_list`, `group_status`, `group_contracts` | `group_query`, `route_map` per member |
80
+
81
+ ## Schema preflight (non-optional)
82
+
83
+ **Before composing any SQL query over `nodes`, `relations`, or any other
84
+ graph table, Phase 0 MUST probe the schema once and cache the result in
85
+ `.prefetch.md`.** Subagents then consult the cached schema instead of
86
+ guessing column names, which would fail with `Binder Error: Referenced
87
+ column "X" not found in FROM clause`.
88
+
89
+ The probe is one SQL call:
90
+
91
+ ```
92
+ sql("SELECT table_name, column_name FROM information_schema.columns
93
+ WHERE table_name IN ('nodes','relations') ORDER BY table_name, column_name")
94
+ ```
95
+
96
+ Write the result as a dedicated `.context.md § Schema` subsection (top 30
97
+ rows, no cap) and as a digest line in `.prefetch.md` with
98
+ `keys: ["table_name","column_name"]`.
99
+
100
+ Historical note: `nodes` does not have a `path` column — routes store their
101
+ endpoint under `name` (as `"METHOD /path"`), and the file path is
102
+ `file_path`. Observed during a 2026-04-27 dogfood when subagent prompts
103
+ blindly referenced `path` and hit a Binder Error on an otherwise fresh
104
+ graph. The preflight prevents this class of bug across every subagent.
105
+
106
+ ## Phase 0 algorithm (pseudocode)
107
+
108
+ Steps marked `# wave 0a` and `# wave 0b` each run as a single parallel tool-use batch — every line inside a wave issues concurrently in one message.
109
+
110
+ ```
111
+ # wave 0a — independent precompute (one parallel batch)
112
+ 1. staleness = list_repos → entry for this repo → _meta.codehub/staleness
113
+ 2. profile = project_profile({repo})
114
+ 3. schema = sql("SELECT table_name, column_name FROM information_schema.columns …")
115
+ 4. routes = route_map({repo})
116
+ 5. tools = tool_map({repo})
117
+ 6. deps = dependencies({repo})
118
+ 7. risk = risk_trends({repo})
119
+ 8. dead = list_dead_code({repo})
120
+ 9. findings = list_findings({repo})
121
+ 10. if --group: group_manifest = group_list
122
+ group_freshness = group_status({group})
123
+ group_contracts_matrix = group_contracts({group})
124
+ // precondition check: every member fresh; abort otherwise
125
+
126
+ # wave 0b — depends on schema + profile (one parallel batch)
127
+ 11. communities = sql("SELECT … FROM nodes WHERE kind='Community' …")
128
+ 12. processes = sql("SELECT … FROM nodes WHERE kind='Process' …")
129
+ 13. relations = sql("SELECT … FROM relations …") # for diagrams
130
+ 14. top_folders = top-5 folders by file count (from profile.entryPoints + glob)
131
+ 15. owners_summary = [owners({path}) for path in top_folders]
132
+ 16. if --group: group_hits = group_query({group, canonical_terms})
133
+
134
+ # wave 0c — inline deterministic post-processing (no MCP calls)
135
+ 17. write .context.md (enforce 200-line cap; truncate per-section, mark flags)
136
+ 18. write .prefetch.md (one JSON line per tool call with sha256 of response)
137
+ ```
138
+
139
+ The algorithm is **deterministic** given the same `graph_hash` — the file list and section structure are identical across runs; only the exact content varies.