java-codebase-rag 0.1.0__py3-none-any.whl → 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,818 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: java-codebase-rag
3
- Version: 0.1.0
4
- Summary: MCP server for semantic + structural search over Java codebases
5
- Author: HumanBean17
6
- License-Expression: MIT
7
- Project-URL: Homepage, https://github.com/HumanBean17/java-codebase-rag
8
- Project-URL: Repository, https://github.com/HumanBean17/java-codebase-rag
9
- Project-URL: Issues, https://github.com/HumanBean17/java-codebase-rag/issues
10
- Keywords: mcp,java,rag,code-search,graph,lancedb,kuzu
11
- Classifier: Development Status :: 3 - Alpha
12
- Classifier: Intended Audience :: Developers
13
- Classifier: Programming Language :: Python :: 3
14
- Classifier: Programming Language :: Python :: 3.11
15
- Classifier: Programming Language :: Python :: 3.12
16
- Classifier: Programming Language :: Python :: 3.13
17
- Classifier: Topic :: Software Development :: Libraries
18
- Requires-Python: >=3.11
19
- Description-Content-Type: text/markdown
20
- License-File: LICENSE
21
- Requires-Dist: kuzu<0.12,>=0.11.3
22
- Requires-Dist: lancedb<0.31,>=0.25.3
23
- Requires-Dist: mcp<2,>=1.27.0
24
- Requires-Dist: numpy<2.5,>=1.26.4
25
- Requires-Dist: pathspec<2,>=1.0.4
26
- Requires-Dist: pyarrow<24,>=23.0.1
27
- Requires-Dist: PyYAML<7,>=6.0.3
28
- Requires-Dist: sentence-transformers<6,>=5.4.0
29
- Requires-Dist: tree-sitter<0.26,>=0.25.2
30
- Requires-Dist: tree-sitter-java<0.24,>=0.23.5
31
- Requires-Dist: unidiff<1,>=0.7.3
32
- Dynamic: license-file
33
-
34
- # java-codebase-rag
35
-
36
- A graph-native code intelligence layer for Java microservice estates, exposed to LLM agents via the **Model Context Protocol (MCP)**.
37
-
38
- The system extracts a deterministic property graph from Java source (tree-sitter), stores it in **Kuzu** (graph) alongside a **LanceDB** vector index (chunks), and exposes a deliberately small MCP surface — **five tools**: `search`, `find`, `describe`, `neighbors`, `resolve` — that collapse onto three primitive agent operations: **locate**, **inspect**, **walk**.
39
-
40
- > **What this MCP is:** a **GPS for code navigation**, not a reasoning engine.
41
- > Agents use a simple loop:
42
- >
43
- > 1. **Locate** entry nodes (`search` / `find`, or identifier-shaped **`resolve`**)
44
- > 2. **Inspect** what a node is (`describe`)
45
- > 3. **Walk** one hop at a time (`neighbors`) until enough evidence is gathered
46
- >
47
- > The MCP exposes structure and adjacency; the agent owns multi-hop reasoning and stop conditions.
48
-
49
- For the design rationale, the GPS metaphor, and the full ontology, see [`docs/paper/paper.pdf`](./docs/paper/paper.pdf) (architecture report).
50
-
51
- > **Stability disclaimer.** This repo does **not** promise backward compatibility. MCP tool contracts, env vars, Lance/Kuzu schemas, config files, and Python APIs may change without a deprecation period. Track `main` and rebuild indexes when ontology or embedding settings change (see [§6 Graph layer](#6-graph-layer)).
52
-
53
- ---
54
-
55
- ## Contents
56
-
57
- 1. [Install](#1-install)
58
- 2. [Environment variables](#2-environment-variables)
59
- 3. [MCP host setup](#3-mcp-host-setup) — Claude Code, Claude Desktop
60
- 4. [MCP tool reference](#4-mcp-tool-reference)
61
- 5. [CLI reference (`java-codebase-rag`)](#5-cli-reference-java-codebase-rag)
62
- 6. [Graph layer](#6-graph-layer) — Kuzu schema, edges, capabilities, ranking
63
- 7. [Brownfield overrides](#7-brownfield-overrides) — config + in-source annotations
64
- 8. [Ignore patterns](#8-ignore-patterns)
65
- 9. [Further reading](#9-further-reading)
66
-
67
- ---
68
-
69
- ## 1. Install
70
-
71
- ```bash
72
- cd /path/to/java-codebase-rag
73
- python3 -m venv .venv
74
- .venv/bin/pip install -r requirements.txt
75
- ```
76
-
77
- - **Python 3.11+** required.
78
- - **Embedding model** must match what the index was built with (default `sentence-transformers/all-MiniLM-L6-v2`).
79
- - The `cocoindex` package is **only** needed for lifecycle commands that run the indexer (`init`, `increment`, `reprocess`, and `erase`). Search and MCP navigation work without it.
80
-
81
- For the assumptions this MCP makes about your Java repo (annotations, DI patterns, naming) and a per-file map of where to edit if you can't refactor your codebase to match, see [`CODEBASE_REQUIREMENTS.md`](./CODEBASE_REQUIREMENTS.md).
82
-
83
- ---
84
-
85
- ## 2. Environment variables
86
-
87
- The operator-facing surface is **five** variables (plus MCP-only `JAVA_CODEBASE_RAG_SOURCE_ROOT` below). Precedence for knobs that also exist as CLI flags or YAML entries is **CLI flag > env var > YAML > built-in default** (see [`docs/JAVA-CODEBASE-RAG-CLI.md`](./docs/JAVA-CODEBASE-RAG-CLI.md)).
88
-
89
- | Variable | Purpose |
90
- |---|---|
91
- | `JAVA_CODEBASE_RAG_INDEX_DIR` | Local filesystem **directory** for Lance tables, the Kuzu file `code_graph.kuzu`, and cocoindex state (`cocoindex.db`). Not a `lancedb://` or cloud URI — use a path. Default: `./.java-codebase-rag/` under the resolved Java tree root. |
92
- | `SBERT_MODEL` | Hub id or local directory; must match indexer. Overridable via `.java-codebase-rag.yml` `embedding.model` and `--embedding-model`. |
93
- | `SBERT_DEVICE` | Optional: `cpu`, `cuda`, `mps`. Overridable via YAML `embedding.device` and `--embedding-device`. |
94
- | `JAVA_CODEBASE_RAG_DEBUG_CONTEXT` | When truthy, verbose stderr logging for chunk context expansion (diagnostics only). |
95
- | `JAVA_CODEBASE_RAG_RUN_HEAVY` | Test gate: set to `1` / `true` / `yes` to run the slow cocoindex + Lance end-to-end test (`pytest`); not used in normal operator workflows. |
96
-
97
- **MCP host launchers** also set `JAVA_CODEBASE_RAG_SOURCE_ROOT` to the Java repository root when it differs from the server process cwd (see `mcp.json.example`).
98
-
99
- Only the names in the table above (plus `JAVA_CODEBASE_RAG_SOURCE_ROOT` for MCP hosts) are read as configuration. Project config belongs in **`.java-codebase-rag.yml`** (or `.yaml`).
100
-
101
- **Paths and conventions** (for scripts and operators):
102
-
103
- - **`JAVA_CODEBASE_RAG_INDEX_DIR`** — filesystem path to the index directory (not a URI). Lance opens this directory; Kuzu is always `<index-dir>/code_graph.kuzu`; cocoindex keeps **`cocoindex.db`** next to them.
104
- - **Java tree root** — CLI: `--source-root` (else cwd). MCP stdio: set `JAVA_CODEBASE_RAG_SOURCE_ROOT` when the Java repo root differs from the server process cwd.
105
- - **`microservice_roots`** — configure only under **`microservice_roots:`** in `.java-codebase-rag.yml` (or `.yaml`).
106
- - **Chunk context diagnostics / heavy tests** — `JAVA_CODEBASE_RAG_DEBUG_CONTEXT`, `JAVA_CODEBASE_RAG_RUN_HEAVY` (see the table above).
107
-
108
- Python package: **`java_codebase_rag`** (`python -m java_codebase_rag.cli`).
109
-
110
- ### Project YAML reference (`.java-codebase-rag.yml`)
111
-
112
- A single file at the project root (the directory you pass as `--source-root`, or cwd) holds everything that isn't an environment variable. The two accepted filenames are `.java-codebase-rag.yml` and `.java-codebase-rag.yaml`; if both exist, `.yml` wins.
113
-
114
- **All keys are optional.** A project with no YAML at all uses built-in defaults plus env vars. Add only the keys you need.
115
-
116
- ```yaml
117
- # .java-codebase-rag.yml — full reference, every key annotated.
118
- # Place at the project root (same directory you pass as --source-root).
119
-
120
- # -------- Core knobs (mirror env vars; precedence: CLI > env > YAML > default) --------
121
-
122
- # Index directory: where Lance tables, code_graph.kuzu, and cocoindex.db live.
123
- # - Tilde (`~`) is expanded; `$VAR` is NOT (use absolute paths or `~`).
124
- # - Relative paths resolve against source_root, not cwd.
125
- # - Env: JAVA_CODEBASE_RAG_INDEX_DIR. CLI: --index-dir. Default: ./.java-codebase-rag/
126
- index_dir: ./.java-codebase-rag
127
-
128
- # Embedding configuration. Must match between indexer and reader — if you change
129
- # `embedding.model`, rebuild the index (`java-codebase-rag reprocess`).
130
- embedding:
131
- # Hub id OR local directory containing the sentence-transformers model files.
132
- # - Hub id example: `sentence-transformers/all-MiniLM-L6-v2`
133
- # - Local path examples: `/opt/models/minilm`, `~/models/minilm`, `$MODEL_DIR/minilm`
134
- # - Resolution applies expanduser + expandvars when the value is path-shaped
135
- # (starts with `/`, `./`, `../`, `~`, or contains `$`). Same rule for
136
- # `SBERT_MODEL` and `--embedding-model` after precedence picks the string.
137
- # Plain `org/name` is treated as a hub id and passed through unchanged.
138
- # A relative path without `./` (e.g. `models/minilm`) is ambiguous with
139
- # hub-id shape — prepend `./` if you mean a local directory.
140
- # - Env: SBERT_MODEL. CLI: --embedding-model. Default: sentence-transformers/all-MiniLM-L6-v2
141
- model: sentence-transformers/all-MiniLM-L6-v2
142
-
143
- # Optional. One of: cpu, cuda, mps, cuda:0, cuda:1, ...
144
- # When omitted, sentence-transformers picks automatically.
145
- # Env: SBERT_DEVICE. CLI: --embedding-device.
146
- device: cpu
147
-
148
- # -------- Microservice layout --------
149
-
150
- # Explicit microservice roots, relative to source_root. When set, takes priority
151
- # over auto-detection (build markers + outermost source-set folding).
152
- # Each entry is a directory NAME (no leading slash, no `~`). See §7 for the
153
- # auto-detection fallback and the diagnose-microservice CLI verb.
154
- microservice_roots:
155
- - chat-core
156
- - chat-orchestrator
157
- - ranking
158
-
159
- # -------- Cross-service edge resolution --------
160
-
161
- # How the resolver treats auto-detected cross-service call edges. See §7.2.
162
- # - auto (default): promote auto-detected callers to cross_service when a route matches.
163
- # - brownfield_only : only edges where both ends come from brownfield annotations or YAML
164
- # stay cross_service; everything else becomes `unresolved`.
165
- cross_service_resolution: auto
166
-
167
- # -------- Brownfield overrides (see §7 for full schema and semantics) --------
168
-
169
- # Roles & capabilities for custom stereotypes the indexer can't recognise.
170
- role_overrides:
171
- annotations:
172
- AcmeService: SERVICE
173
- CompanyController: CONTROLLER
174
- capabilities:
175
- CompanyKafkaTopic: [MESSAGE_LISTENER]
176
- fqn:
177
- com.legacy.OrderProcessor:
178
- role: SERVICE
179
- capabilities: [MESSAGE_LISTENER]
180
-
181
- # Server-side route declarations for endpoints the framework introspector can't see.
182
- route_overrides:
183
- annotations:
184
- ann.AcmeRoute:
185
- framework: spring_mvc
186
- kind: http_endpoint
187
- method: GET
188
- path: /acme
189
- fqn:
190
- com.legacy.UserApi:
191
- framework: spring_mvc
192
- kind: http_endpoint
193
- path: /legacy/users
194
-
195
- # Caller-side HTTP client overrides (RestTemplate/WebClient wrappers, custom Feign-likes).
196
- http_client_overrides:
197
- annotations:
198
- ann.LegacyHttpClient:
199
- client_kind: rest_template
200
- target_service: chat-core
201
- path: /chat/joinOperator
202
- method: POST
203
- fqn:
204
- com.legacy.ChatClient:
205
- client_kind: feign_method
206
- target_service: chat-core
207
-
208
- # Caller-side async producer overrides (Kafka/RabbitMQ event publishers).
209
- async_producer_overrides:
210
- annotations:
211
- ann.LegacyEvent:
212
- client_kind: kafka_send
213
- topic: chat.follow-up
214
- broker: ""
215
- fqn:
216
- com.legacy.EventBus:
217
- client_kind: kafka_send
218
- topic: chat.follow-up
219
- ```
220
-
221
- **Path expansion (what gets `~` / `$VAR` treatment):**
222
-
223
- | Field | Expanded? | Notes |
224
- |---|---|---|
225
- | `index_dir` | partial | `~` expanded; `$VAR` is NOT expanded. Relative paths resolve against `source_root`. |
226
- | `embedding.model` (when path-shaped) | yes | Path-shape = starts with `/`, `./`, `../`, `~`, or contains `$`. Plain `org/name` is treated as a hub id and passed through. Applies to the value after CLI > env > YAML > default precedence. Long-lived MCP hosts also apply the same expansion when reading `SBERT_MODEL` from the process environment (so table metadata and search agree with `index_common` defaults). |
227
- | `embedding.device` | n/a | Device strings (`cpu`, `cuda`, `mps`) aren't paths. |
228
- | `microservice_roots[*]` | no | Each entry is a directory **name** relative to `source_root`, not an arbitrary path. |
229
- | Brownfield `path:` / `topic:` values | no | These are URL paths and Kafka topic names, not filesystem paths. Literal characters preserved. |
230
-
231
- **Tips & gotchas:**
232
-
233
- - **The file must be at `source_root`**, not in `$HOME`. The MCP server reads `JAVA_CODEBASE_RAG_SOURCE_ROOT` to find it; the CLI uses `--source-root` (else cwd).
234
- - **Don't commit secrets** into this YAML — it sits next to your source tree and is read by every operator who clones it.
235
- - **Rebuild after editing brownfield overrides.** Run a full `java-codebase-rag reprocess` (no flags) so Lance and Kuzu stay coherent, or use `--graph-only` / `--vectors-only` when you know only one store needs invalidation. Editing `embedding.model` requires a vector rebuild (`reprocess` or `--vectors-only`).
236
- - **Diagnose what's loaded.** `java-codebase-rag meta` prints the resolved config and each value's `*_source` (`cli` / `env` / `yaml` / `default`) — see `embedding_model_source`, `embedding_device_source`, `index_dir_source`.
237
- - **`embedding.model` and `$` in directory names.** `expandvars` treats `$VAR` / `${VAR}` like the shell. HuggingFace hub ids never contain `$`. If a local filesystem path contains a literal `$` in a directory name, use an absolute path that avoids `$`-expansion patterns, or expect `expandvars` to interpret `$` sequences.
238
-
239
- Deeper documentation for the brownfield blocks (`role_overrides`, `route_overrides`, `http_client_overrides`, `async_producer_overrides`, `cross_service_resolution`) lives in [§7 Brownfield overrides](#7-brownfield-overrides).
240
-
241
- ---
242
-
243
- ## 3. MCP host setup
244
-
245
- ### Claude Code
246
-
247
- **Project scope:** copy `mcp.json.example` to your repo as `.mcp.json`, replace absolute paths, and merge with any existing `mcpServers`.
248
-
249
- **Or via CLI:**
250
-
251
- ```bash
252
- claude mcp add --transport stdio java-codebase-rag -- \
253
- /path/to/java-codebase-rag/.venv/bin/python \
254
- /path/to/java-codebase-rag/server.py
255
- ```
256
-
257
- Set env vars (`JAVA_CODEBASE_RAG_INDEX_DIR`, `JAVA_CODEBASE_RAG_SOURCE_ROOT`, `SBERT_MODEL`, …) in `.mcp.json` or your shell profile. Official docs: [Claude Code settings](https://docs.anthropic.com/en/docs/claude-code/settings).
258
-
259
- ### Claude Desktop
260
-
261
- Edit `claude_desktop_config.json` (macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`) and add an entry under `mcpServers` with the same `command`, `args`, and `env` as in `mcp.json.example`.
262
-
263
- ### Driving the MCP from an agent
264
-
265
- - **[`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md)** — standalone MCP operating manual (copy-paste into `QWEN.md` / `CLAUDE.md` / `AGENTS.md`): five tools, `NodeFilter`, edge taxonomy, required `neighbors` arguments, ontology glossary, recovery playbook, slash-style aliases. No CLI or repo-doc dependencies inside the copy block.
266
- - **[`docs/skills/java-codebase-explore.md`](./docs/skills/java-codebase-explore.md)** — exploration **strategy** (missions, fallbacks, anti-capabilities, stopping rules); AGENT-GUIDE remains the **operating manual** for tool shapes and recovery.
267
- - **[`docs/MANUAL-VERIFICATION-CHECKLIST.md`](./docs/MANUAL-VERIFICATION-CHECKLIST.md)** — 7-phase agent-driven verification you run after indexing your real project. Each item has a copy-paste prompt and calibration data from `tests/bank-chat-system`.
268
- - **[`automation/cursor_propose_only/README.md`](./automation/cursor_propose_only/README.md)** — optional proposal orchestration workflow (single-command autopilot, planning bundles, and automated execution/review loops).
269
-
270
- ---
271
-
272
- ## 4. MCP tool reference
273
-
274
- | Tool | Purpose | Args | Example |
275
- |---|---|---|---|
276
- | `search` | Locate nodes by NL/code text. | `query: str`, `table: str="java"`, `hybrid: bool=False`, `limit: int=5`, `offset: int=0`, `path_contains: str \| None`, `filter: NodeFilter \| str \| None` | `{"query":"join operator flow","limit":5}` |
277
- | `find` | Locate nodes by structured filter. | `kind: "symbol"\|"route"\|"client"\|"producer"`, `filter: NodeFilter \| str`, `limit: int=25`, `offset: int=0` | `{"kind":"symbol","filter":{"role":"CONTROLLER"}}` |
278
- | `describe` | Full record + edge counts for one node. For **type** symbols, `edge_summary` may include composed dot-keys (`DECLARES.DECLARES_CLIENT`, `DECLARES.EXPOSES`); for **method** symbols it may include override-axis virtual keys (`OVERRIDDEN_BY`, …) and an `OVERRIDES` row that **merges** stored `[:OVERRIDES]` in/out with the dispatch-up rollup (per direction `max`). See [`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md) (`describe`). | `id: str` | `{"id":"sym:com.bank.chat.core.api.ChatController#joinOperator(JoinOperatorRequest)"}` |
279
- | `resolve` | Identifier-shaped node lookup (symbol / route / client / producer). Returns `status` `one`, `many`, or `none`; prefer over `describe(fqn=…)` when an FQN may collide. See [`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md) (`resolve`). | `identifier: str`, `hint_kind: "symbol"|"route"|"client"|"producer" \| null` | `{"identifier":"com.bank.chat.core.api.ChatController","hint_kind":"symbol"}` |
280
- | `neighbors` | Graph walk. **Required**: `direction` and `edge_types` (stored labels; type Symbols may pass composed `DECLARES.*`; non-static method Symbols may pass `OVERRIDDEN_BY*` — `out` only — see [`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md)). | `ids: str \| list[str]`, `direction: "in"\|"out"`, `edge_types: list[str]`, `limit: int=25`, `offset: int=0`, `filter: NodeFilter \| str \| None`, `edge_filter: EdgeFilter \| str \| None` (`CALLS` only; see guide) | `{"ids":"sym:…ChatController","direction":"out","edge_types":["DECLARES.DECLARES_CLIENT"]}` |
281
-
282
- **`NodeFilter` notes:**
283
-
284
- - `filter` is a JSON object matching the `NodeFilter` schema. Wire types are `object` or, as a fallback, a JSON-encoded string for clients that flatten objects.
285
- - Unknown filter keys and populated fields that are not applicable to the effective node kind fail loudly with `success=false` and a teaching `message` (no silent key dropping).
286
- - For `neighbors`, mixed-kind neighborhoods fail on the first evaluated neighbor row whose kind makes populated filter fields inapplicable.
287
- - Symbol-only keys: `symbol_kind` (single value) and `symbol_kinds` (set membership) for declaration granularity (`class`, `interface`, `enum`, `record`, `annotation`, `method`, `constructor`).
288
- - `find(kind="symbol", ...)` results include `symbol_kind` so callers can see declaration granularity without a follow-up `describe`.
289
- - For `find`, an empty / whitespace-only filter string or the JSON literal `null` is treated like `{}` (match anything).
290
-
291
- Example:
292
-
293
- ```json
294
- {"kind":"symbol","filter":{"microservice":"chat-core","symbol_kind":"interface"}}
295
- ```
296
-
297
- **MCP v2 response extras (`hints`, pagination echo):** On success, `search`, `find`, `describe`, `neighbors`, and `resolve` return a `hints` field (`list[str]`, capped at five unique strings) with short, templated suggestions for likely next tool calls; hints are advisory. `hints` is always empty when `success` is false. `resolve` additionally echoes `resolved_identifier` (post-validation trimmed identifier) on every `success=true` response; it is `null` when `success` is false. Resolve hints fire only on `status: none` or `status: many` (not on `status: one`). `search` and `find` additionally echo the request’s `limit` and `offset` on success; on failure those echoed fields are omitted (`null` in JSON). The find page-full hint fires only when another page may exist (handler over-fetches by one row; not exposed on the output model). `neighbors` echoes `requested_edge_types` (deduped edge labels from the request) on success; empty results with non-empty `edge_types` may emit kind- and direction-aware structural hints driven by `EDGE_SCHEMA` (see [`propose/completed/HINTS-V3-PROPOSE.md`](./propose/completed/HINTS-V3-PROPOSE.md)); when any result edge carries a brownfield/fallback `attrs.strategy` (see `FUZZY_STRATEGY_SET` in `java_ontology.py`), a single meta-tier fuzzy-strategy hint may also appear on non-empty results. See [`propose/completed/HINTS-ROAD-SIGNS-PROPOSE.md`](./propose/completed/HINTS-ROAD-SIGNS-PROPOSE.md) Appendix A for the locked v1 template catalog; see [`propose/HINTS-V2-PROPOSE.md`](./propose/HINTS-V2-PROPOSE.md) for v2 additions (`resolve` rules and neighbors fuzzy-strategy hint).
298
-
299
- ---
300
-
301
- ## 5. CLI reference (`java-codebase-rag`)
302
-
303
- Operator playbook with workflows, exit codes, and env alignment: [`docs/JAVA-CODEBASE-RAG-CLI.md`](./docs/JAVA-CODEBASE-RAG-CLI.md).
304
-
305
- Run `java-codebase-rag --help` to list grouped subcommands (lifecycle / introspection / analysis). Output mode is automatic: JSON when piped, pretty text in a TTY. Module entrypoint: `python -m java_codebase_rag.cli`. Lifecycle commands (`init`, `increment`, `reprocess`, `erase`) stream subprocess progress to **stderr** (including any child stdout the tool relays); **`--quiet`** suppresses that human channel; **stdout** remains the machine-readable contract (JSON or pprint).
306
-
307
- Shared flags on all subcommands: `--source-root`, `--index-dir`, `--embedding-model`, `--embedding-device` (each optional; see the CLI guide for precedence).
308
-
309
- | Group | Subcommand | Role |
310
- |---|---|---|
311
- | Lifecycle | `init` | First-time index; refuses if the index dir already has artifacts. |
312
- | Lifecycle | `increment` | CocoIndex catch-up (Lance only); prints a stderr warning that Kuzu is unchanged until `reprocess`. |
313
- | Lifecycle | `reprocess` | Default: full Lance reprocess + full Kuzu rebuild. Optional `--vectors-only` / `--graph-only` (mutually exclusive) for a single phase. |
314
- | Lifecycle | `erase` | Deletes index artifacts; requires `--yes` or interactive TTY confirm. |
315
- | Introspection | `meta`, `tables`, `diagnose-ignore` | Health, table listing, ignore-layer diagnostics. |
316
- | Analysis | `analyze-pr` | Blast-radius / risk from a unified diff. |
317
-
318
- The hidden alias **`refresh`** invokes **`reprocess`** (prefer **`reprocess`** in new scripts).
319
-
320
- Examples:
321
-
322
- ```bash
323
- java-codebase-rag init --source-root /path/to/java/repo --index-dir /path/to/.java-codebase-rag --quiet
324
- java-codebase-rag reprocess --source-root /path/to/java/repo --index-dir /path/to/.java-codebase-rag --quiet
325
- java-codebase-rag meta --source-root /path/to/java/repo --index-dir /path/to/.java-codebase-rag | .venv/bin/python -c "import json,sys; print(json.loads(sys.stdin.read())['edge_counts'])"
326
- java-codebase-rag diagnose-ignore .git/HEAD --source-root /path/to/java/repo
327
- java-codebase-rag analyze-pr --diff-file /tmp/pr.diff --source-root /path/to/java/repo --index-dir /path/to/.java-codebase-rag
328
- ```
329
-
330
- ### `analyze-pr` output shape
331
-
332
- Pass the same unified diff text you would feed to `patch` (e.g. `git diff` output). Paths in the diff should match project-relative `Symbol.filename` values in the graph (e.g. `chat-assign/src/main/java/.../ChatManagementService.java`). A one-line edit returns:
333
-
334
- ```json
335
- {
336
- "success": true,
337
- "changed_symbols": [
338
- {
339
- "symbol_id": "<opaque>",
340
- "fqn": "com.bank.chat.assign.service.ChatManagementService#assign(AssignmentRequest)",
341
- "kind": "method",
342
- "change_type": "modified",
343
- "file": "chat-assign/src/main/java/com/bank/chat/assign/service/ChatManagementService.java",
344
- "hunk_lines": [48, 49, 50, 51, 52]
345
- }
346
- ],
347
- "blast_radius_total": 2,
348
- "blast_radius_by_symbol": { "<opaque>": 1 },
349
- "cross_service_callers": 0,
350
- "routes_touched": [],
351
- "risk_score": 0.008,
352
- "risk_band": "low",
353
- "notes": []
354
- }
355
- ```
356
-
357
- ### Manual search
358
-
359
- `--model` defaults from `SBERT_MODEL` (same path-shaped `~` / `$VAR` expansion as MCP and `java-codebase-rag` config). Omit `--model` to use the env default; pass a hub id or local path explicitly when needed.
360
-
361
- ```bash
362
- # Vector
363
- JAVA_CODEBASE_RAG_INDEX_DIR=/path/to/.java-codebase-rag .venv/bin/python search_lancedb.py "rate limit" --table java --limit 2
364
-
365
- # Graph-expanded (requires the Kuzu DB to exist)
366
- JAVA_CODEBASE_RAG_INDEX_DIR=/path/to/.java-codebase-rag .venv/bin/python search_lancedb.py "rate limit" \
367
- --table java --limit 5 --graph-expand --expand-depth 2
368
-
369
- # Role-filtered
370
- JAVA_CODEBASE_RAG_INDEX_DIR=/path/to/.java-codebase-rag .venv/bin/python search_lancedb.py "place order" --table java --role CONTROLLER
371
-
372
- # With surrounding context (1 chunk before + 1 chunk after)
373
- JAVA_CODEBASE_RAG_INDEX_DIR=/path/to/.java-codebase-rag .venv/bin/python search_lancedb.py "chat assignment" \
374
- --table java --limit 3 --context-neighbors 1
375
- ```
376
-
377
- ### Building the graph standalone
378
-
379
- `java-codebase-rag reprocess` (default, no flags) runs `cocoindex update` with a full reprocess flag, then invokes `build_ast_graph.py` to rebuild Kuzu under the resolved index directory. For a **graph-only** rebuild from the CLI, prefer `java-codebase-rag reprocess --graph-only` (see [`docs/JAVA-CODEBASE-RAG-CLI.md`](./docs/JAVA-CODEBASE-RAG-CLI.md)). To invoke the graph builder directly:
380
-
381
- ```bash
382
- # Scan the current working directory
383
- .venv/bin/python build_ast_graph.py --verbose
384
-
385
- # Or point at a specific repo root and graph path
386
- .venv/bin/python build_ast_graph.py --source-root /path/to/repo --kuzu-path /path/to/.java-codebase-rag/code_graph.kuzu --verbose
387
- ```
388
-
389
- If `--source-root` is omitted, the current working directory is used. The MCP server resolves the Java tree from `JAVA_CODEBASE_RAG_SOURCE_ROOT` when set, otherwise cwd.
390
-
391
- For `reprocess`, the pipeline runs `cocoindex` with `cwd` set to the bundle directory (so Python imports resolve), but passes the resolved Java tree root and index dir to the subprocess so indexing targets your project. The Kuzu DB is dropped and rebuilt from scratch on each full reprocess; graph-side incremental rebuilds are future work ([`propose/TIER2-INCREMENTAL-REBUILD-PROPOSE.md`](./propose/TIER2-INCREMENTAL-REBUILD-PROPOSE.md)).
392
-
393
- ---
394
-
395
- ## 6. Graph layer
396
-
397
- A deterministic property graph derived from tree-sitter Java parsing lives next to the LanceDB tables under the index directory (default `${JAVA_CODEBASE_RAG_INDEX_DIR:-./.java-codebase-rag}/code_graph.kuzu`). Current ontology version: **15** (see [`docs/EDGE-NAVIGATION.md`](./docs/EDGE-NAVIGATION.md) for MCP-traversable edge shapes).
398
-
399
- ### Node kinds
400
-
401
- | Kind | Examples |
402
- |---|---|
403
- | `Symbol` | `package`, `file`, `class`, `interface`, `enum`, `record`, `annotation`, `method`, `constructor` |
404
- | `Route` | HTTP endpoint or async listener (one row per declared route) |
405
- | `Client` | Outbound HTTP / messaging call site |
406
- | `UnresolvedCallSite` | Receiver-failure call site (`chained_receiver`, `phantom_unresolved_receiver`) — not a `Symbol`; ids use the `ucs:` prefix |
407
-
408
- Known-receiver-external JDK / Spring / Lombok callees stay on **`CALLS`** as phantom **method** symbols (`resolved=false`). Receiver-failure sites (unresolved receiver or chained receiver) are **`UnresolvedCallSite`** nodes linked by **`UNRESOLVED_AT`** (not in `EDGE_SCHEMA`; use `describe(method_id).unresolved_call_sites`, `neighbors(..., include_unresolved=True)`, or `java-codebase-rag unresolved-calls`).
409
-
410
- ### Edge types (MCP-traversable)
411
-
412
- | Edge | Direction | Meaning |
413
- |---|---|---|
414
- | `EXTENDS` | type → type | Class- or interface-inheritance. |
415
- | `IMPLEMENTS` | type → interface | Interface implementation. |
416
- | `INJECTS` | type → type | DI: field, constructor, or setter injection (incl. Lombok). |
417
- | `DECLARES` | type → method/constructor | Type declares a callable. |
418
- | `OVERRIDES` | method → method | Subtype instance method overrides a supertype-declared method (same `signature`, one supertype hop via `IMPLEMENTS` / `EXTENDS`). |
419
- | `DECLARES_CLIENT` | type → client | Type declares an outbound call site. |
420
- | `CALLS` | method → method | In-process call (confidence-scored, strategy-tagged). |
421
- | `EXPOSES` | type → route | Type exposes an HTTP/async route. |
422
- | `HTTP_CALLS` | client → route | Cross-service HTTP call (caller-side Client to target Route). |
423
- | `ASYNC_CALLS` | producer → route | Cross-service async (Kafka, Rabbit, JMS, …). |
424
-
425
- Caller/callee traversals default to `exclude_external=true` on **`find_callers`** so library FQN prefixes are filtered without dropping edges from the graph.
426
-
427
- ### Call-graph notes
428
-
429
- - Receiver typing uses **one scope map per method** (locals shadow fields/parameters), but **not** full nested-block lexical scope. See `CODEBASE_REQUIREMENTS.md` → *Call graph*.
430
- - **Anonymous classes** (`new T() { … }`) are indexed as synthetic nested types (`…<anon:startByte>`); `CALLS` from their methods use that member as the caller so inbound-call traversal reaches the handler body.
431
- - **Lambdas** still attribute inner calls to the enclosing named method (no synthetic callable symbol).
432
- - Unqualified calls from anonymous members fall through to the lexically enclosing type for callee lookup (matches Java compiler scoping).
433
-
434
- ### Injection mechanisms detected
435
-
436
- - Field `@Autowired` / `@Inject` / `@Resource`
437
- - Constructor injection (Spring single-ctor rule and explicit `@Autowired`)
438
- - Setter `@Autowired`
439
- - Lombok `@RequiredArgsConstructor` (final fields) and `@AllArgsConstructor` (all non-static)
440
-
441
- ### Chunk enrichment (Lance)
442
-
443
- Java chunk rows are enriched with `package`, `module`, `microservice`, `primary_type_fqn`, `primary_type_kind`, `role`, `capabilities`, `annotations_on_type`, `symbols`, `ontology_version`. `role` and `capabilities` are inferred in `ast_java` / `graph_enrich`.
444
-
445
- ### `module` vs `microservice`
446
-
447
- Two location fields are tracked per Java symbol / chunk:
448
-
449
- - **`module`** — the *innermost* build-marker (`pom.xml`, `build.gradle`, `build.gradle.kts`, `build.sbt`) ancestor's directory name. (Legacy `service` field, renamed.)
450
- - **`microservice`** — the *outermost* build-marker ancestor under the resolved Java tree root. For a single-module project both equal the same name; for a multi-module reactor (e.g. `chat-core/{chat-app,chat-engine,...}`) every child collapses to `microservice='chat-core'` while keeping its own `module='chat-app'`.
451
-
452
- Resolution order for `microservice`:
453
-
454
- 1. Explicit override list — `microservice_roots: [foo, bar]` in `.java-codebase-rag.yml` at the project root (YAML-only).
455
- 2. Outermost build marker between `project_root` and the file.
456
- 3. First path segment under `project_root`.
457
- 4. `""` if nothing matches.
458
-
459
- ### Re-index required when ontology changes
460
-
461
- Current ontology version is **15**. Any index built before this version must be rebuilt via `cocoindex update ... --full-reprocess -f` or a full `java-codebase-rag reprocess` (no selective flags) so vectors and graph stay aligned. Until re-indexed, the server defensively JSON-decodes string-form list columns so nothing explodes, but filters like `array_contains` will not work.
462
-
463
- Ontology **15** (CALLS-NOISE) adds `CALLS.callee_declaring_role`, `GraphMeta.pass3_unresolved_phantom_receiver` / `pass3_unresolved_chained`, and **supertype-walk dedup** at build time. PR-2 adds `edge_filter` on `neighbors`. **PR-3 (breaking):** receiver-failure sites (`chained_receiver`, unresolved-receiver `phantom`) are no longer `CALLS` rows — they live on `UnresolvedCallSite` + `UNRESOLVED_AT`. Default `neighbors(..., ['CALLS'])` returns fewer rows; use `include_unresolved=True` for a source-ordered interleaved transcript (`row_kind`), `describe(method_id).unresolved_call_sites` (capped), or `java-codebase-rag unresolved-calls list|stats`. Known-receiver-external JDK rows stay on `CALLS` with `resolved=false`.
464
-
465
- Ontology **14** introduces `EDGE_SCHEMA` in `java_ontology.py` as the canonical edge navigation schema (see `docs/EDGE-NAVIGATION.md`). **`HTTP_CALLS` is `Client → Route`** (SCHEMA-V2 PR-B). **`ASYNC_CALLS` is `Producer → Route`** with `DECLARES_PRODUCER` (SCHEMA-V2 PR-C). Run one full reprocess after upgrading through the SCHEMA-V2 sequence (or when you need the v14 ontology gate).
466
-
467
- Ontology **13** materializes stored `OVERRIDES` edges between method Symbols (subtype override → supertype declaration, matching `signature` on a direct `IMPLEMENTS` / `EXTENDS` hop). `neighbors(edge_types=["OVERRIDES"])` traverses this relationship; `OVERRIDDEN_BY*` dot-keys in `edge_summary` are also navigable on method Symbol origins (`out` only).
468
-
469
- Ontology **12** renames `@CodebaseClient` to `@CodebaseHttpClient`, types HTTP `method` as the shared `CodebaseHttpMethod` enum on both inbound and outbound stubs, and makes inbound layer-C HTTP routes **replace** same-method built-in Spring rows (no merge). Rebuild after upgrading so `meta_chain` keys and annotation simple names match the extractor.
470
-
471
- ### Capabilities
472
-
473
- In addition to the single primary `role` per Java type, the indexer extracts a multi-tag `capabilities: list[str]` field from method-level annotations, type-level annotations, injected types, and supertypes. A type can carry zero or many capabilities. Capabilities never *replace* the role; they augment it.
474
-
475
- | Capability | Trigger |
476
- |---|---|
477
- | `MESSAGE_LISTENER` | `@KafkaListener`, `@RabbitListener`, `@JmsListener`, `@SqsListener`, `@EventListener`, `@StreamListener` on any method. |
478
- | `MESSAGE_PRODUCER` | Type injects `KafkaTemplate`, `RabbitTemplate`, `JmsTemplate`, `StreamBridge`, or `ApplicationEventPublisher`. |
479
- | `HTTP_CLIENT` | Type has `@FeignClient`. |
480
- | `SCHEDULED_TASK` | `@Scheduled` on any method, or class implements `org.quartz.Job`. |
481
- | `EXCEPTION_HANDLER` | `@ControllerAdvice`, `@RestControllerAdvice`, or any method with `@ExceptionHandler`. |
482
-
483
- Use `find(kind="symbol", filter={"capability":"..."})` to enumerate types carrying a capability. Use `search(..., filter={"capability":"..."})` or `neighbors(..., filter={"capability":"..."})` for capability-aware narrowing.
484
-
485
- ### Ranking
486
-
487
- Java hits are reweighted after vector / hybrid scoring by their `role`:
488
-
489
- | Role | Weight |
490
- |---|---|
491
- | `CONTROLLER` | +0.10 |
492
- | `SERVICE` | +0.08 |
493
- | `CLIENT` | +0.06 |
494
- | `COMPONENT` | +0.03 |
495
- | `REPOSITORY` | +0.02 |
496
- | `MAPPER` / `OTHER` | 0 |
497
- | `ENTITY` | -0.06 |
498
- | `CONFIG` | -0.10 |
499
-
500
- This favours orchestrators / entrypoints / integrations over configuration and schema chunks for *what happens when…*-style queries, while keeping repositories and entities reachable. Weights are **skipped** when you pass an explicit `role=` filter; the per-row breakdown is surfaced in `score_components`.
501
-
502
- On top of role weights, Java chunks receive a **symbol-match bonus** (exposed as `score_components.symbol_bonus`). Three additive components, all capped:
503
-
504
- 1. **Method / field overlap** — each declared symbol whose tokens overlap the query earns `+0.03` (capped at `+0.06`).
505
- 2. **Action-verb bump** — chunks declaring a method whose name begins with an action verb (`process`, `handle`, `on`, `pick`, `select`, `assign`, `notify`, `dispatch`, `publish`, `consume`, `route`, `trigger`, `enqueue`, `distribute`, …) get a flat `+0.02`.
506
- 3. **Type-name overlap** — strongest single lexical signal: when the simple name of `primary_type_fqn` shares tokens with the query, each overlap hit earns `+0.05` (capped at `+0.10`).
507
-
508
- Combined, these pull `processClientMessage` / `pickEligibleOperator` / `onOperatorAssigned` chunks — and the classes that own them — above ones that only enqueue or configure. Like role weights, the bonus is **skipped when the caller locks `role=`**.
509
-
510
- ### Debugging empty `context_before` / `context_after`
511
-
512
- If `context_neighbors=1` returns empty context strings, set `JAVA_CODEBASE_RAG_DEBUG_CONTEXT=1` in the MCP server env before launching. The server logs (to stderr) why expansion bailed: missing schema columns, empty bucket scan, chunk not found in bucket, or underlying scan error. Typical causes are (a) a stale server that hasn't reloaded after a reindex, or (b) an index missing `range_start` / `range_end` columns — the code falls back to exact-text matching, so re-running fixes it.
513
-
514
- ---
515
-
516
- ## 7. Brownfield overrides
517
-
518
- For Spring-centric defaults that don't match your tree (custom wrapper stereotypes, non-Spring stacks, vendored code), you can steer `role`, `capabilities`, routes, and clients without forking the indexer. Three layers, in priority order:
519
-
520
- 1. **Config** — `.java-codebase-rag.yml` at the project root.
521
- 2. **Meta-annotation walk** — automatic discovery of `@interface` chains in your source.
522
- 3. **Source stubs** — copy `@CodebaseRole`, `@CodebaseCapability`, `@CodebaseHttpRoute`, `@CodebaseAsyncRoute`, `@CodebaseHttpClient`, `@CodebaseProducer` definitions into any package.
523
-
524
- ### 7.1 Config: `role_overrides`, `route_overrides`
525
-
526
- `.java-codebase-rag.yml` at the project root (same file as `microservice_roots`). `role_overrides` maps annotation simple names and/or per-type FQNs to roles and capabilities:
527
-
528
- ```yaml
529
- microservice_roots: []
530
-
531
- role_overrides:
532
- annotations:
533
- AcmeService: SERVICE
534
- CompanyController: CONTROLLER
535
- capabilities:
536
- CompanyKafkaTopic: [MESSAGE_LISTENER]
537
- AcmeBatch: [SCHEDULED_TASK]
538
- fqn:
539
- com.legacy.OrderProcessor:
540
- role: SERVICE
541
- capabilities: [MESSAGE_LISTENER]
542
- com.acme.payments.PaymentEventBus:
543
- capabilities: [MESSAGE_PRODUCER]
544
- ```
545
-
546
- Unknown role or capability strings are ignored with a warning on load.
547
-
548
- `@FeignClient` interfaces auto-attach `role=CLIENT` and `capability=HTTP_CLIENT`. For `RestTemplate` / `WebClient` wrappers, opt in explicitly with `@CodebaseRole(CodebaseRoleKind.CLIENT)` and `@CodebaseCapability(CodebaseCapabilityKind.HTTP_CLIENT)`.
549
-
550
- `route_overrides` maps custom annotation names (or suffixes such as `com.acme.Foo` when usage sites show only `Foo`) and per-type FQNs to `Route` fields for methods that don't otherwise resolve from Spring / Feign / messaging built-ins:
551
-
552
- ```yaml
553
- route_overrides:
554
- annotations:
555
- ann.AcmeRoute:
556
- framework: spring_mvc
557
- kind: http_endpoint
558
- method: GET
559
- path: /acme
560
- fqn:
561
- com.legacy.UserApi:
562
- framework: spring_mvc
563
- kind: http_endpoint
564
- path: /legacy/users
565
- ```
566
-
567
- Unknown `framework` / `kind` strings are dropped with a stderr warning.
568
-
569
- ### 7.2 Cross-service resolution mode
570
-
571
- Optional top-level key in the same YAML file:
572
-
573
- ```yaml
574
- cross_service_resolution: auto # default when omitted
575
- # cross_service_resolution: brownfield_only
576
- ```
577
-
578
- With `brownfield_only`, the resolver does **not** promote auto-detected call sites to `cross_service` matches: only edges where both the caller strategy and every matched route's `source_layer` come from brownfield (`@CodebaseHttpRoute` / `@CodebaseAsyncRoute`, `@CodebaseHttpClient`, YAML overrides, meta-annotation closure, or FQN maps) stay `cross_service`. Everything else that would have been a cross-service match becomes `unresolved`. `intra_service`, `phantom`, and `ambiguous` behaviour is unchanged. Unknown values log a warning and behave like `auto`.
579
-
580
- Resolution order for each method: built-in extraction → annotation map → meta-annotation closure → in-source `@CodebaseHttpRoute` / `@CodebaseAsyncRoute` → per-type FQN map (last writer wins on overlapping fields). On the same method, `@CodebaseAsyncRoute` replaces built-in `@KafkaListener` extraction so brownfield topic names aren't duplicated alongside SpEL or multi-topic listeners. For HTTP, `@CodebaseHttpRoute` replaces same-method built-in Spring mapping rows (brownfield exclusivity); enable `build_ast_graph.py --verbose` to see `brownfield-exclusivity-shadowing` INFO when framework annotations are bypassed.
581
-
582
- ### 7.3 Source stubs
583
-
584
- If config and meta-annotations aren't enough, copy these `@interface` definitions into any package — **simple-name-only** matching means no Maven dependency on this bundle. Verbatim copies live under `tests/fixtures/brownfield_route_stubs/` and `tests/fixtures/brownfield_client_stubs/` for copy-pasting.
585
-
586
- #### Roles & capabilities (class-level)
587
-
588
- ```java
589
- package com.example.rag; // any package
590
-
591
- import java.lang.annotation.*;
592
-
593
- public enum CodebaseRoleKind {
594
- CONTROLLER, SERVICE, REPOSITORY, COMPONENT, CONFIG, ENTITY, CLIENT, MAPPER, DTO
595
- }
596
-
597
- public enum CodebaseCapabilityKind {
598
- MESSAGE_LISTENER, MESSAGE_PRODUCER, HTTP_CLIENT, SCHEDULED_TASK, EXCEPTION_HANDLER
599
- }
600
-
601
- @Target(ElementType.TYPE)
602
- @Retention(RetentionPolicy.SOURCE)
603
- public @interface CodebaseRole { CodebaseRoleKind value(); }
604
-
605
- @Target(ElementType.TYPE)
606
- @Retention(RetentionPolicy.SOURCE)
607
- @Repeatable(CodebaseCapabilities.class)
608
- public @interface CodebaseCapability { CodebaseCapabilityKind value(); }
609
-
610
- @Target(ElementType.TYPE)
611
- @Retention(RetentionPolicy.SOURCE)
612
- public @interface CodebaseCapabilities { CodebaseCapability[] value(); }
613
- ```
614
-
615
- Usage:
616
-
617
- ```java
618
- @CodebaseRole(CodebaseRoleKind.SERVICE)
619
- @CodebaseCapability(CodebaseCapabilityKind.MESSAGE_LISTENER)
620
- @CodebaseCapability(CodebaseCapabilityKind.MESSAGE_PRODUCER)
621
- public class LegacyChatService { /* ... */ }
622
- ```
623
-
624
- > Resolver binds `@CodebaseRole(CodebaseRoleKind.…)`; string-literal `@CodebaseRole("…")` forms are ignored.
625
-
626
- #### Direction matters: inbound vs outbound
627
-
628
- | Direction | Annotation | Purpose |
629
- |---|---|---|
630
- | Inbound | `@CodebaseHttpRoute`, `@CodebaseAsyncRoute` | Declare handlers/listeners your service exposes as `Route` nodes. |
631
- | Outbound | `@CodebaseHttpClient`, `@CodebaseProducer` | Declare call sites/publish sites your service invokes (caller edges). |
632
-
633
- `@FeignClient` declarations are outbound (`clientKind=feign_method`), not inbound `Route` rows.
634
-
635
- #### Routes (method-level, inbound)
636
-
637
- ```java
638
- public enum CodebaseHttpMethod {
639
- GET, POST, PUT, PATCH, DELETE, HEAD, OPTIONS
640
- }
641
-
642
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
643
- @Repeatable(CodebaseHttpRoutes.class)
644
- public @interface CodebaseHttpRoute { String path(); CodebaseHttpMethod method(); }
645
-
646
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
647
- public @interface CodebaseHttpRoutes { CodebaseHttpRoute[] value(); }
648
-
649
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
650
- @Repeatable(CodebaseAsyncRoutes.class)
651
- public @interface CodebaseAsyncRoute { String topic(); }
652
-
653
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
654
- public @interface CodebaseAsyncRoutes { CodebaseAsyncRoute[] value(); }
655
- ```
656
-
657
- Usage:
658
-
659
- ```java
660
- @CodebaseHttpRoute(path = "/chat/joinOperator", method = CodebaseHttpMethod.POST)
661
- public Reply joinOperator(Request req) { /* ... */ }
662
-
663
- @CodebaseAsyncRoute(topic = "chat.follow-up")
664
- public void onFollowUp(Event e) { /* ... */ }
665
- ```
666
-
667
- `path` / `method` are required for HTTP routes; `topic` is required for async routes.
668
-
669
- #### Clients & producers (method-level, outbound)
670
-
671
- ```java
672
- public enum CodebaseClientKind { feign_method, rest_template, web_client }
673
-
674
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
675
- @Repeatable(CodebaseHttpClients.class)
676
- public @interface CodebaseHttpClient {
677
- CodebaseClientKind clientKind();
678
- String targetService() default "";
679
- String path() default "";
680
- CodebaseHttpMethod method();
681
- }
682
-
683
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
684
- public @interface CodebaseHttpClients { CodebaseHttpClient[] value(); }
685
-
686
- public enum CodebaseProducerKind { kafka_send, stream_bridge_send }
687
-
688
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
689
- @Repeatable(CodebaseProducers.class)
690
- public @interface CodebaseProducer {
691
- CodebaseProducerKind producerKind() default CodebaseProducerKind.kafka_send;
692
- String topic();
693
- }
694
-
695
- @Target(ElementType.METHOD) @Retention(RetentionPolicy.SOURCE)
696
- public @interface CodebaseProducers { CodebaseProducer[] value(); }
697
- ```
698
-
699
- Usage:
700
-
701
- ```java
702
- @CodebaseHttpClient(
703
- clientKind = CodebaseClientKind.rest_template,
704
- targetService = "chat-core",
705
- path = "/chat/joinOperator",
706
- method = CodebaseHttpMethod.POST)
707
- public Reply callJoinOperator(Request req) { /* ... */ }
708
-
709
- @CodebaseProducer(
710
- producerKind = CodebaseProducerKind.kafka_send,
711
- topic = "chat.follow-up")
712
- public void publishFollowUp(Event e) { /* ... */ }
713
- ```
714
-
715
- Resolution order in code: built-in inference → config annotation maps → meta-annotation walk → `@CodebaseRole` / `@CodebaseCapability` → `role_overrides.fqn` (highest priority for explicit per-type config). Route composition uses the same first-pass index, then `@CodebaseHttpRoute` / `@CodebaseAsyncRoute`, then `route_overrides.fqn`. Rebuild the affected store (`java-codebase-rag reprocess`, or `--vectors-only` / `--graph-only` when appropriate, or `build_ast_graph.py` for graph-only manual runs) after changing overrides.
716
-
717
- ### 7.4 Caller-side overrides
718
-
719
- ```yaml
720
- http_client_overrides:
721
- annotations:
722
- ann.LegacyHttpClient:
723
- client_kind: rest_template
724
- target_service: chat-core
725
- path: /chat/joinOperator
726
- method: POST
727
- fqn:
728
- com.legacy.ChatClient:
729
- client_kind: feign_method
730
- target_service: chat-core
731
-
732
- async_producer_overrides:
733
- annotations:
734
- ann.LegacyEvent:
735
- client_kind: kafka_send
736
- topic: chat.follow-up
737
- broker: ""
738
- fqn:
739
- com.legacy.EventBus:
740
- client_kind: kafka_send
741
- topic: chat.follow-up
742
- ```
743
-
744
- Unknown `client_kind` values are dropped with a stderr warning. **One intentional divergence** from route layering: if any brownfield layer emits method-level outgoing calls, built-in outgoing calls for that same method are **replaced** (not appended) to avoid double-counting one network call site.
745
-
746
- When a brownfield caller override specifies only part of what built-in detection would produce, missing fields are inherited from built-in — partial overrides are non-destructive (tightening, not replacing). Example: built-in produces `client_kind=rest_template`, `method=GET`, `path=/users/{id}`; an override sets only `path=/users/me`; the final call keeps `client_kind=rest_template` and `method=GET` while changing only the path.
747
-
748
- ### 7.5 Brownfield limitations
749
-
750
- - **Duplicate `@interface` simple names across packages.** The meta map keys by simple name. If two distinct types share a name (`com.team1.X` and `com.team2.X`), only the first after **sorted file order** is kept; a stderr message names both FQNs. Resolve by renaming, or use `role_overrides.fqn` / `@CodebaseRole`.
751
- - **Incremental indexing and annotation sources.** The indexer may only reprocess changed files. If you edit an `@interface` declaration (e.g. remove a `@Service` meta-annotation from a wrapper), every class that used it may need re-enrichment; the pipeline does not track that dependency automatically. **Run a full `java-codebase-rag reprocess` after changing any `@interface` used as a custom stereotype.**
752
- - **`Symbol` rows scope.** `role` and `capabilities` on the graph are computed for **type** nodes (classes, interfaces, etc.). Method and constructor `Symbol` rows use defaults `role=OTHER` and `capabilities=[]`.
753
-
754
- ### 7.6 Lance / Kuzu consistency
755
-
756
- Both the Kuzu graph writer and Lance chunk enrichment call **one** function — `graph_enrich.collect_annotation_meta_chain` — which scans the project with sorted `*.java` paths, the same layered ignore rules as `build_ast_graph` / `path_filtering.iter_java_source_files`, parse-error warnings on stderr, and deterministic *first wins* for duplicate annotation simple names. Kuzu and Lance **should** agree; they can still diverge if the same file is handled differently elsewhere in the pipeline (e.g. parse edge cases). If graph tools and `search` disagree on a type, run a full reindex and compare.
757
-
758
- ---
759
-
760
- ## 8. Ignore patterns
761
-
762
- Java file discovery for the Kuzu graph, annotation meta-chain collection, and the CocoIndex Lance pipeline share the same layered ignore model (`path_filtering.LayeredIgnore`):
763
-
764
- 1. **Builtin default** — hardcoded patterns applied to every project.
765
- 2. **Project root** — optional `<project>/.java-codebase-rag/ignore` (gitignore syntax, including negation with `!`).
766
- 3. **Nested** — any `<subdir>/.java-codebase-rag/ignore` on the path from the project root to the file; closer files override farther ones.
767
- 4. **Git** — every `.gitignore` from the project root down to the file's directory, merged in order, using `pathspec.GitIgnoreSpec` (same semantics as git). Disable with `LayeredIgnore(..., use_gitignore=False)`.
768
-
769
- ### Builtin default patterns
770
-
771
- The builtin default layer (`path_filtering.COMMON_EXCLUDED_PATH_PATTERNS`) combines two mechanisms.
772
-
773
- **a) Glob patterns** (applied during the layered match):
774
-
775
- | Pattern | Excludes |
776
- |---|---|
777
- | `**/.*` | Any dot-file or dot-directory at any depth. |
778
- | `**/.git/**` | Git metadata. |
779
- | `**/.idea/**` | IntelliJ project metadata. |
780
- | `**/.venv/**` | Python virtual environments. |
781
- | `**/node_modules/**` | npm/yarn dependency tree. |
782
- | `**/*.class` | Compiled JVM class files. |
783
- | `**/src/test/java/**` | Maven/Gradle test sources (prod-only index by design). |
784
- | `**/src/test/resources/**` | Test resource bundles. |
785
-
786
- **b) Build-output directory pruning** (during `os.walk` traversal). Three directory names — `out`, `build`, `target` — are pruned **only** when they sit alongside a build-tool indicator file (`pom.xml`, `build.gradle`, `build.gradle.kts`, `settings.gradle`, `settings.gradle.kts`). This guards against the false-positive where one of these names is a legal Java package (e.g. `com.example.out.api.AssignEndpoint` lives at `src/main/java/com/example/out/api/AssignEndpoint.java`, where `out/` is a package, not a Maven build output).
787
-
788
- A few directory names are pruned **unconditionally** because they are never legal Java package names: `.git`, `.idea`, `.venv`, `node_modules` (defined in `path_filtering.UNCONDITIONAL_PRUNE_DIRS`).
789
-
790
- To skip a directory the builtin walks (or include one it prunes), add a `.java-codebase-rag/ignore` file at the project root or any subtree root. Use `java-codebase-rag diagnose-ignore <path>` to see which layer decided for a given file.
791
-
792
- If no `.java-codebase-rag/ignore` exists anywhere under the project, behaviour matches the builtin list alone (plus git when enabled). When a negation rule could un-ignore paths under directories the CocoIndex walk used to prune globally, the walk switches to a permissive exclude list and each candidate path is filtered again with the full layered rules.
793
-
794
- **Monorepo note:** negation detection runs two full-tree `rglob` passes when constructing a `LayeredIgnore` (ignore files and `.gitignore` files). Usually cheap to amortise; extremely large trees should expect that fixed cost per new instance.
795
-
796
- **Dependencies:** `pathspec` is pinned in `requirements.txt` and constrained the same way in `pyproject.toml` (loose bundle install vs. wheel metadata).
797
-
798
- ---
799
-
800
- ## 9. Further reading
801
-
802
- | Document | What's in it |
803
- |---|---|
804
- | [`docs/paper/paper.pdf`](./docs/paper/paper.pdf) | Architecture report — design rationale, GPS metaphor, three-layer architecture, design principles, future work. |
805
- | [`docs/AGENT-GUIDE.md`](./docs/AGENT-GUIDE.md) | Agent-facing guide. Copy-paste into `QWEN.md` / `CLAUDE.md` / `AGENTS.md`. |
806
- | [`docs/skills/java-codebase-explore.md`](./docs/skills/java-codebase-explore.md) | Agent exploration skill (strategy, missions, fallbacks); packaged zip [`docs/skills/java-codebase-explore.zip`](./docs/skills/java-codebase-explore.zip) via `./scripts/build-explore-skill.sh` for Perplexity-style hosts. |
807
- | [`docs/JAVA-CODEBASE-RAG-CLI.md`](./docs/JAVA-CODEBASE-RAG-CLI.md) | Operator playbook for the CLI: workflows, exit codes, env alignment. |
808
- | [`docs/MANUAL-VERIFICATION-CHECKLIST.md`](./docs/MANUAL-VERIFICATION-CHECKLIST.md) | 7-phase agent-driven verification after indexing your project. |
809
- | [`automation/cursor_propose_only/README.md`](./automation/cursor_propose_only/README.md) | Optional orchestration workflow for single-command proposal pipelines (autopilot), planning/review loops, and automated per-PR execution via command templates. |
810
- | [`CODEBASE_REQUIREMENTS.md`](./CODEBASE_REQUIREMENTS.md) | Assumptions about your Java repo + per-file edit map for non-conforming codebases. |
811
- | [`propose/PRODUCT-VISION.md`](./propose/PRODUCT-VISION.md) | Long-term product direction. |
812
-
813
- ### Roadmap (graph layer)
814
-
815
- - `get_service_topology` — microservice-level summary aggregating `HTTP_CALLS` / `ASYNC_CALLS`.
816
- - Agentic routing layer (query classifier → vector / graph / both).
817
- - Incremental Kuzu updates (per-changed-file) — see [`propose/TIER2-INCREMENTAL-REBUILD-PROPOSE.md`](./propose/TIER2-INCREMENTAL-REBUILD-PROPOSE.md) and [`propose/INDEX-AUTO-MODE-PROPOSE.md`](./propose/INDEX-AUTO-MODE-PROPOSE.md).
818
- - Optional `codegraph_nodes` LanceDB table embedding symbol summaries so the graph itself is vector-searchable.