diffdoc 0.4.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -9,6 +9,7 @@
9
9
  "cloudChatModel": "gpt-4o-mini",
10
10
  "cloudEmbedModel": "text-embedding-3-small",
11
11
  "embedBatchSize": 25,
12
+ "summarizeConcurrency": 2,
12
13
  "openaiApiKey": "",
13
14
  "includeGlobs": [],
14
15
  "excludeGlobs": [],
package/README.md CHANGED
@@ -1,277 +1,329 @@
1
1
  # DiffDoc
2
2
 
3
- ## Project Description
3
+ Your codebase already knows how the product works. DiffDoc turns that implementation into a living, portable knowledgebase that humans and agents can search, question, and reuse.
4
4
 
5
- DiffDoc turns source code into searchable, plain-English project context. It scans repository files, asks an OpenAI-compatible chat model to summarize the business behavior in each file, stores the summaries as portable per-hash JSON assets, embeds those assets into a local Vectra index, and answers questions using the indexed results as retrieval context.
5
+ It generates plain-English summaries from source files, records them in a manifest-first artifact model, and keeps the resulting context close to the repository. Use it to give developers, agents, reviewers, and stakeholders implementation-grounded answers without asking them to read every file first.
6
6
 
7
- The project is designed for teams that need fast codebase comprehension without requiring every stakeholder to read implementation details. It can run against local model servers such as Ollama, LM Studio, or vLLM, or against cloud OpenAI-compatible APIs.
7
+ ## Guiding Principles
8
8
 
9
- ## Installation
9
+ - The codebase is the source of truth. Requirements documents, tickets, wikis, and tribal knowledge can drift, but product behavior is ultimately defined by the code that ships.
10
+ - Summaries should describe implemented behavior, not imagined intent. DiffDoc focuses on what the current files do so product questions are answered from the implementation first.
11
+ - The knowledgebase should evolve with the product. When files change, DiffDoc refreshes affected summaries and manifest entries so generated context does not become a stale snapshot.
12
+ - The manifest is the durable contract. DiffDoc is intentionally manifest-first: the manifest is the source of truth for generated summaries, and downstream tools should be able to consume the manifest and summary assets without depending on DiffDoc's built-in embedding workflow.
13
+ - Retrieval is optional infrastructure. The built-in `embed` command, local Vectra index, `search`, `query`, and MCP server are convenience features for teams that want an end-to-end local workflow, but consumers should be free to use their own embedding provider, vector store, search system, or documentation pipeline.
14
+ - Useful context should serve humans and agents. The generated knowledgebase is intended for product questions, onboarding, code review, agent workflows, audits, and long-term maintenance.
10
15
 
11
- Run from this repository:
16
+ ## Requirements
12
17
 
13
- ```bash
14
- npm install
15
- npm run build
16
- node dist/index.js --help
17
- ```
18
+ - Node.js `>=22`
19
+ - An OpenAI-compatible chat model for `summarize` and `query`
20
+ - An OpenAI-compatible embedding model for `embed`, `search`, and `query`
21
+ - A local model server such as Ollama, LM Studio, or vLLM, or a cloud OpenAI-compatible endpoint
18
22
 
19
- Run after publishing:
23
+ ## Install
24
+
25
+ Run DiffDoc without adding it to your project:
20
26
 
21
27
  ```bash
22
28
  npx diffdoc --help
23
29
  ```
24
30
 
25
- Use as a project dev dependency:
31
+ Install it as a project dev dependency:
26
32
 
27
33
  ```bash
28
34
  npm install --save-dev diffdoc
29
- npx diffdoc --help
30
35
  ```
31
36
 
32
- Package scripts can call the installed binary:
37
+ Recommended package scripts:
33
38
 
34
39
  ```json
35
40
  {
36
41
  "scripts": {
37
42
  "diffdoc:init": "diffdoc init",
38
43
  "diffdoc:summarize": "diffdoc summarize",
39
- "diffdoc:status": "diffdoc status",
40
44
  "diffdoc:embed": "diffdoc embed",
41
45
  "diffdoc:search": "diffdoc search",
42
46
  "diffdoc:query": "diffdoc query",
47
+ "diffdoc:status": "diffdoc status",
43
48
  "diffdoc:mcp": "diffdoc-mcp"
44
49
  }
45
50
  }
46
51
  ```
47
52
 
48
- ## Configuration
49
-
50
- DiffDoc accepts runtime flags on each command. It also loads a JSON `.diffdocrc` file from the current working directory when present, or from a custom path with `--config <path>`.
53
+ ## Quick Start
51
54
 
52
- Precedence:
53
-
54
- 1. CLI flags
55
- 2. `.diffdocrc`
56
- 3. Environment variable fallbacks
57
-
58
- Create a local config from the example:
55
+ Initialize DiffDoc in your repository:
59
56
 
60
57
  ```bash
61
- cp .diffdocrc.example .diffdocrc
58
+ npx diffdoc init
62
59
  ```
63
60
 
64
- Example config with all supported keys:
61
+ For a non-interactive setup using defaults:
65
62
 
66
- ```json
67
- {
68
- "baseDir": "./.diffdoc",
69
- "aiProvider": "local",
70
- "localLlmEndpoint": "http://localhost:11434/v1",
71
- "localEmbedEndpoint": "http://localhost:11434/v1/embeddings",
72
- "localChatModel": "qwen2.5-coder:7b",
73
- "localEmbedModel": "nomic-embed-code",
74
- "cloudLlmEndpoint": "https://api.openai.com/v1",
75
- "cloudChatModel": "gpt-4o-mini",
76
- "cloudEmbedModel": "text-embedding-3-small",
77
- "openaiApiKey": "",
78
- "includeGlobs": [],
79
- "excludeGlobs": [],
80
- "ignoreFile": ".diffdocignore"
81
- }
63
+ ```bash
64
+ npx diffdoc init --yes
82
65
  ```
83
66
 
84
- Supported environment fallbacks use the uppercase names for the same settings, including `AI_PROVIDER`, `DIFFDOC_BASE_DIR`, `LOCAL_LLM_ENDPOINT`, `LOCAL_EMBED_ENDPOINT`, `LOCAL_CHAT_MODEL`, `LOCAL_EMBED_MODEL`, `CLOUD_LLM_ENDPOINT`, `CLOUD_CHAT_MODEL`, `CLOUD_EMBED_MODEL`, `OPENAI_API_KEY`, `DIFFDOC_INCLUDE_GLOBS`, `DIFFDOC_EXCLUDE_GLOBS`, and `DIFFDOC_IGNORE_FILE`.
85
-
86
- ## Manifest-First Design
67
+ Create summaries:
87
68
 
88
- DiffDoc separates summarization from embedding. The `summarize` command writes file-to-hash mappings to `manifest.json` and stores each summary in an independent hash-addressed JSON file under `./.diffdoc/summaries/`.
89
-
90
- The manifest is plain JSON and contains one entry per tracked file:
91
-
92
- ```json
93
- {
94
- "schemaVersion": 2,
95
- "lastSyncedCommit": "string-hash",
96
- "files": {
97
- "src/example.ts": "md5-string"
98
- }
99
- }
69
+ ```bash
70
+ npx diffdoc summarize --path . --mode all
100
71
  ```
101
72
 
102
- Example summary asset at `./.diffdoc/summaries/<hash>.json`:
73
+ Build the local search index:
103
74
 
104
- ```json
105
- {
106
- "schemaVersion": 1,
107
- "content_hash": "md5-string",
108
- "summary": "Plain-English explanation text here.",
109
- "raw_code_snapshot": "Optional code text when --include-code-snapshot is enabled"
110
- }
75
+ ```bash
76
+ npx diffdoc embed
111
77
  ```
112
78
 
113
- Because the summaries are stored independently, users do not have to embed immediately. They can review, archive, transform, or embed the manifest later using their preferred vectorization model and storage solution.
114
-
115
- DiffDoc includes `diffdoc embed` as a built-in convenience path for creating a local Vectra index, but the manifest can also be consumed by other tools such as custom OpenAI-compatible embedding pipelines, hosted vector databases, local search systems, or internal documentation workflows.
116
-
117
- ## Commands
118
-
119
- Initialize DiffDoc configuration for a repository:
79
+ Search raw matches:
120
80
 
121
81
  ```bash
122
- diffdoc init
82
+ npx diffdoc search "How does authentication work?"
123
83
  ```
124
84
 
125
- Use defaults without prompts:
85
+ Ask a question using retrieved project context:
126
86
 
127
87
  ```bash
128
- diffdoc init --yes
88
+ npx diffdoc query "What business behavior does this repository implement?"
129
89
  ```
130
90
 
131
- Choose a provider and overwrite an existing config file:
91
+ After the first full run, refresh changed files with delta mode:
132
92
 
133
93
  ```bash
134
- diffdoc init --provider cloud --force
94
+ npx diffdoc summarize --path . --mode delta
95
+ npx diffdoc embed
135
96
  ```
136
97
 
137
- `init` creates or updates repo setup files, appends missing `.gitignore` entries, and prints next commands. It does not run `summarize` or `embed`.
98
+ ## What Init Creates
138
99
 
139
- Summarize a repository into `./.diffdoc/manifest.json`:
100
+ `diffdoc init` creates or updates repository-local setup files:
140
101
 
141
- ```bash
142
- diffdoc summarize --path . --mode all
143
- ```
102
+ - `.diffdocrc`: local DiffDoc configuration
103
+ - `.diffdocignore`: gitignore-style file selection rules for summarization
104
+ - `.gitignore`: entries for local/generated DiffDoc files when needed
144
105
 
145
- Summarize only changed Git files using the existing manifest state:
106
+ It does not summarize or embed anything. Run `summarize` and `embed` after initialization.
146
107
 
147
- ```bash
148
- diffdoc summarize --path . --mode delta
149
- ```
108
+ ## Configuration
150
109
 
151
- Store raw code snapshots in summary assets:
110
+ DiffDoc reads settings in this order:
152
111
 
153
- ```bash
154
- diffdoc summarize --path . --mode all --include-code-snapshot
112
+ 1. CLI flags
113
+ 2. `.diffdocrc` or the file passed with `--config <path>`
114
+ 3. Environment variables
115
+ 4. Built-in defaults
116
+
117
+ Example `.diffdocrc` for local models:
118
+
119
+ ```json
120
+ {
121
+ "baseDir": "./.diffdoc",
122
+ "aiProvider": "local",
123
+ "localLlmEndpoint": "http://localhost:11434/v1",
124
+ "localEmbedEndpoint": "http://localhost:11434/v1/embeddings",
125
+ "localChatModel": "qwen2.5-coder:7b",
126
+ "localEmbedModel": "nomic-embed-code",
127
+ "embedBatchSize": 25,
128
+ "summarizeConcurrency": 2,
129
+ "includeGlobs": [],
130
+ "excludeGlobs": [],
131
+ "ignoreFile": ".diffdocignore"
132
+ }
155
133
  ```
156
134
 
157
- Add include/exclude filters at runtime:
135
+ Example `.diffdocrc` for a cloud OpenAI-compatible endpoint:
158
136
 
159
- ```bash
160
- diffdoc summarize --path . --mode all --include-glob "src/**/*.ts" --exclude-glob "**/*.test.ts"
137
+ ```json
138
+ {
139
+ "baseDir": "./.diffdoc",
140
+ "aiProvider": "cloud",
141
+ "cloudLlmEndpoint": "https://api.openai.com/v1",
142
+ "cloudChatModel": "gpt-4o-mini",
143
+ "cloudEmbedModel": "text-embedding-3-small",
144
+ "embedBatchSize": 25,
145
+ "summarizeConcurrency": 2,
146
+ "includeGlobs": [],
147
+ "excludeGlobs": [],
148
+ "ignoreFile": ".diffdocignore"
149
+ }
161
150
  ```
162
151
 
163
- Emit a CI-friendly JSON summarize report:
152
+ Set `OPENAI_API_KEY` for cloud providers instead of committing API keys:
164
153
 
165
154
  ```bash
166
- diffdoc summarize --path . --mode delta --json
155
+ OPENAI_API_KEY="..." npx diffdoc summarize --path . --mode all
167
156
  ```
168
157
 
169
- Inspect manifest-relative artifact freshness:
158
+ Supported environment variables:
170
159
 
171
- ```bash
172
- diffdoc status
160
+ ```text
161
+ AI_PROVIDER
162
+ DIFFDOC_BASE_DIR
163
+ DIFFDOC_EMBED_BATCH_SIZE
164
+ DIFFDOC_SUMMARIZE_CONCURRENCY
165
+ DIFFDOC_INCLUDE_GLOBS
166
+ DIFFDOC_EXCLUDE_GLOBS
167
+ DIFFDOC_IGNORE_FILE
168
+ LOCAL_LLM_ENDPOINT
169
+ LOCAL_CHAT_MODEL
170
+ LOCAL_EMBED_ENDPOINT
171
+ LOCAL_EMBED_MODEL
172
+ CLOUD_LLM_ENDPOINT
173
+ CLOUD_CHAT_MODEL
174
+ CLOUD_EMBED_MODEL
175
+ OPENAI_API_KEY
173
176
  ```
174
177
 
175
- Use a custom manifest path under `--base-dir`:
178
+ ## File Selection
176
179
 
177
- ```bash
178
- diffdoc status --manifest manifest.json
180
+ `.diffdocignore` uses `.gitignore`-style syntax. This is the main way to keep generated files, dependencies, secrets, binaries, and local artifacts out of summaries.
181
+
182
+ Example `.diffdocignore`:
183
+
184
+ ```gitignore
185
+ .git/
186
+ .diffdoc/
187
+ node_modules/
188
+ dist/
189
+ coverage/
190
+ .env
191
+ *.log
179
192
  ```
180
193
 
181
- Emit CI-friendly JSON output:
194
+ Precedence is intentionally conservative:
182
195
 
183
- ```bash
184
- diffdoc status --json
196
+ 1. `.diffdocignore` skips files first
197
+ 2. `excludeGlobs` skip files second
198
+ 3. `includeGlobs` narrow whatever remains
199
+
200
+ An included file is still skipped if it matches `.diffdocignore` or `excludeGlobs`.
201
+
202
+ Use include and exclude filters from config:
203
+
204
+ ```json
205
+ {
206
+ "includeGlobs": ["src/**/*.ts"],
207
+ "excludeGlobs": ["**/*.test.ts"]
208
+ }
185
209
  ```
186
210
 
187
- Embed the manifest into a local Vectra index at `./.diffdoc/vectra`:
211
+ Or pass them at runtime:
188
212
 
189
213
  ```bash
190
- diffdoc embed
214
+ npx diffdoc summarize --path . --mode all --include-glob "src/**/*.ts" --exclude-glob "**/*.test.ts"
191
215
  ```
192
216
 
193
- Limit how many summary documents are sent per embeddings request:
217
+ ## Commands
218
+
219
+ Initialize setup files:
194
220
 
195
221
  ```bash
196
- diffdoc embed --embed-batch-size 20
222
+ npx diffdoc init
223
+ npx diffdoc init --yes
224
+ npx diffdoc init --provider cloud --force
197
225
  ```
198
226
 
199
- Force full index rebuild:
227
+ Summarize files into `.diffdoc/manifest.json` and `.diffdoc/summaries/*.json`:
200
228
 
201
229
  ```bash
202
- diffdoc embed --rebuild
230
+ npx diffdoc summarize --path . --mode all
231
+ npx diffdoc summarize --path . --mode delta
232
+ npx diffdoc summarize --path . --mode delta --json
233
+ npx diffdoc summarize --path . --mode all --summarize-concurrency 4
203
234
  ```
204
235
 
205
- Search the local Vectra index and print raw matches:
236
+ Summarization runs with bounded concurrency. The default is `2`; use `1` for strict rate limits, `2-4` for most providers, and higher values only when your local model server or API quota can handle the request volume.
237
+
238
+ Store raw code snapshots in summary assets when you want retrieved results to include source text:
206
239
 
207
240
  ```bash
208
- diffdoc search "How does this project process changed files?"
241
+ npx diffdoc summarize --path . --mode all --include-code-snapshot
209
242
  ```
210
243
 
211
- Include retrieved code snapshots in search results:
244
+ Check manifest and index freshness:
212
245
 
213
246
  ```bash
214
- diffdoc search "How does embedding work?" --top 3 --code
247
+ npx diffdoc status
248
+ npx diffdoc status --json
215
249
  ```
216
250
 
217
- Ask a question and have the configured chat model answer using retrieved embedded context:
251
+ Embed summaries into the local Vectra index:
218
252
 
219
253
  ```bash
220
- diffdoc query "How does this project process changed files?"
254
+ npx diffdoc embed
255
+ npx diffdoc embed --rebuild
256
+ npx diffdoc embed --embed-batch-size 20
221
257
  ```
222
258
 
223
- Include retrieved code snapshots after the generated answer:
259
+ Search indexed summaries:
224
260
 
225
261
  ```bash
226
- diffdoc query "How does embedding work?" --top 3 --code
262
+ npx diffdoc search "How does this project process changed files?"
263
+ npx diffdoc search "How does embedding work?" --top 3 --code
227
264
  ```
228
265
 
229
- Use a custom config file:
266
+ Ask questions with retrieval-augmented answers:
230
267
 
231
268
  ```bash
232
- diffdoc query "How does embedding work?" --config ./config/diffdoc.local.json
269
+ npx diffdoc query "How does this project process changed files?"
270
+ npx diffdoc query "How does embedding work?" --top 3 --code
233
271
  ```
234
272
 
235
- Override a config value at runtime:
273
+ Use a custom config or artifact directory:
236
274
 
237
275
  ```bash
238
- diffdoc embed --config ./.diffdocrc --base-dir ./tmp-diffdoc
276
+ npx diffdoc query "How does embedding work?" --config ./config/diffdoc.local.json
277
+ npx diffdoc embed --config ./.diffdocrc --base-dir ./tmp-diffdoc
239
278
  ```
240
279
 
241
- ## Workflow
280
+ ## Artifacts
242
281
 
243
- Typical usage is:
282
+ DiffDoc keeps generated project context under `baseDir`, which defaults to `./.diffdoc`:
244
283
 
245
- ```bash
246
- diffdoc summarize --path . --mode all
247
- diffdoc embed
248
- diffdoc search "What files explain the summarization flow?"
249
- diffdoc query "What business behavior does this repository implement?"
284
+ ```text
285
+ .diffdoc/
286
+ manifest.json
287
+ summaries/
288
+ <content-hash>.json
289
+ vectra/
250
290
  ```
251
291
 
252
- After the initial run, use delta mode to refresh changed files:
292
+ The manifest maps repository-relative file paths to content hashes:
253
293
 
254
- ```bash
255
- diffdoc summarize --path . --mode delta
256
- diffdoc embed
294
+ ```json
295
+ {
296
+ "schemaVersion": 2,
297
+ "lastSyncedCommit": "string-hash",
298
+ "files": {
299
+ "src/example.ts": "md5-string"
300
+ }
301
+ }
257
302
  ```
258
303
 
259
- ## GitHub Actions
304
+ Each summary asset is portable JSON:
260
305
 
261
- This repository includes a workflow at `.github/workflows/diffdoc-summarize.yml` that runs on pushes to `main`. It installs the project, builds the CLI, runs delta summarization, and commits `.diffdoc/manifest.json` back to the branch when the manifest changes.
306
+ ```json
307
+ {
308
+ "schemaVersion": 1,
309
+ "content_hash": "md5-string",
310
+ "summary": "Plain-English explanation text here.",
311
+ "raw_code_snapshot": "Optional code text when --include-code-snapshot is enabled"
312
+ }
313
+ ```
262
314
 
263
- The workflow intentionally ignores `.diffdoc/manifest.json` and `.diffdoc/vectra/**` changes as triggers so the bot commit does not create a loop.
315
+ Commit `.diffdoc/manifest.json` and `.diffdoc/summaries/*.json` if you want summaries shared across machines or CI runs. Keep `.diffdoc/vectra/` local unless you have a specific reason to commit the generated vector index.
264
316
 
265
- Configure the same values used by the CLI as GitHub Actions variables or secrets, such as `AI_PROVIDER`, `LOCAL_LLM_ENDPOINT`, `LOCAL_CHAT_MODEL`, `CLOUD_LLM_ENDPOINT`, `CLOUD_CHAT_MODEL`, and `OPENAI_API_KEY`. The workflow uses the environment-variable fallback path in DiffDoc, so no `.diffdocrc` file is required in CI.
317
+ The manifest and summary assets are the stable handoff point for consumers. The local Vectra index produced by `diffdoc embed` is optional and can be replaced by any embedding model and storage backend that fits your environment.
266
318
 
267
319
  ## MCP Server
268
320
 
269
- DiffDoc also ships a local MCP stdio server as `diffdoc-mcp`. This lets MCP-compatible agents search or answer questions against the local Vectra index directly.
321
+ DiffDoc ships an MCP stdio server as `diffdoc-mcp`. Run `summarize` and `embed` before using it so the MCP tools have a local index to query.
270
322
 
271
- Run it manually with the same config style as the CLI:
323
+ Run the server manually:
272
324
 
273
325
  ```bash
274
- diffdoc-mcp --config ./.diffdocrc
326
+ npx diffdoc-mcp --config ./.diffdocrc
275
327
  ```
276
328
 
277
329
  Example MCP client configuration:
@@ -287,29 +339,34 @@ Example MCP client configuration:
287
339
  }
288
340
  ```
289
341
 
290
- If DiffDoc is installed as a project dev dependency, the same `npx diffdoc-mcp` command will resolve the local package binary.
291
-
292
342
  Available MCP tools:
293
343
 
294
- - `diffdoc_search`: searches the local Vectra index and returns raw file matches, summaries, scores, hashes, and optional code snapshots.
295
- - `diffdoc_answer`: retrieves relevant index context and asks the configured chat model to answer the question.
296
- - `diffdoc_index_stats`: returns the Vectra index path, whether it exists, and the indexed item count.
344
+ - `diffdoc_search`: search the local index and return matching files, summaries, scores, hashes, and optional code snapshots
345
+ - `diffdoc_answer`: retrieve relevant context and ask the configured chat model to answer a question
346
+ - `diffdoc_index_stats`: return index path, existence status, and indexed item count
347
+
348
+ ## CI
349
+
350
+ For CI, prefer environment variables or a generated config file instead of committing local credentials.
297
351
 
298
- Run `diffdoc summarize` and `diffdoc embed` before using the MCP server, otherwise the search and answer tools will not have a local index to query.
352
+ Typical CI flow:
353
+
354
+ ```bash
355
+ npm ci
356
+ npx diffdoc summarize --path . --mode delta --json
357
+ npx diffdoc embed
358
+ ```
359
+
360
+ Use `summarize --json` and `status --json` when a workflow needs machine-readable output.
361
+
362
+ Commit the manifest and summary assets from CI if you want DiffDoc state to advance with the branch. Ignore `.diffdoc/vectra/` unless your workflow intentionally persists the local index.
299
363
 
300
364
  ## Notes
301
365
 
302
- - Node.js `>=22` is required because Vectra requires it.
303
- - This repository ignores `.diffdoc/vectra` and `.diffdocrc`; add similar entries to your project's `.gitignore` if you do not want generated indexes or local config committed. The manifest at `.diffdoc/manifest.json` is not ignored by this repository.
304
- - Summary assets are written to `.diffdoc/summaries/*.json`.
305
- - Manifest schema is currently `schemaVersion: 2`; older manifest shapes are not auto-migrated.
306
- - Commit `.diffdoc/manifest.json` when using delta workflows. Delta summarization reads the previous manifest state to decide which changed files need fresh summaries.
307
366
  - `summarize` requires a configured chat model.
308
- - `summarize` prints run progress and final totals (`scanned`, `skipped`, `updated`, `failed`, `pruned`).
309
- - `summarize --json` prints a single machine-readable run report to stdout for CI parsing.
310
- - `status` does not require a configured chat or embedding model.
311
- - `status --json` prints a machine-readable report with summary and index freshness details.
312
- - `embed` requires a configured embedding model. Use `embedBatchSize` in `.diffdocrc`, `DIFFDOC_EMBED_BATCH_SIZE`, or `--embed-batch-size` to tune how many summary documents are sent per embeddings request.
313
- - `search` requires a configured embedding model and returns raw retrieval results without calling the chat model.
314
- - `query` requires both a configured chat model and embedding model.
367
+ - `embed` and `search` require a configured embedding model.
368
+ - `query` requires both chat and embedding configuration.
369
+ - `status` does not require chat or embedding configuration.
370
+ - Delta summarization uses Git changes plus the existing manifest state.
371
+ - Manifest schema is currently `schemaVersion: 2`; older manifest shapes are not auto-migrated.
315
372
  - For code-oriented embedding models such as `nomic-embed-code`, DiffDoc prefixes query embeddings with `Represent this query for searching relevant code:`.
@@ -19,6 +19,7 @@ const DEFAULT_CONFIG = {
19
19
  cloudChatModel: "gpt-4o-mini",
20
20
  cloudEmbedModel: "text-embedding-3-small",
21
21
  openaiApiKey: "",
22
+ summarizeConcurrency: 2,
22
23
  includeGlobs: [],
23
24
  excludeGlobs: [],
24
25
  ignoreFile: ".diffdocignore"
@@ -6,6 +6,7 @@ Object.defineProperty(exports, "__esModule", { value: true });
6
6
  exports.runSummarize = runSummarize;
7
7
  const promises_1 = __importDefault(require("node:fs/promises"));
8
8
  const node_path_1 = __importDefault(require("node:path"));
9
+ const ignore_1 = __importDefault(require("ignore"));
9
10
  const artifacts_1 = require("../types/artifacts");
10
11
  const git_1 = require("../utils/git");
11
12
  const hashing_1 = require("../utils/hashing");
@@ -55,18 +56,21 @@ function compileGlobs(patterns) {
55
56
  function matchesAny(filePath, patterns) {
56
57
  return patterns.some((pattern) => pattern.test(filePath));
57
58
  }
58
- function shouldIncludeFile(filePath, includeGlobs, excludeGlobs, ignoreGlobs) {
59
- if (includeGlobs.length > 0 && !matchesAny(filePath, includeGlobs)) {
59
+ function shouldIncludeFile(filePath, includeGlobs, excludeGlobs, ignoreMatcher) {
60
+ if (ignoreMatcher.ignores(filePath)) {
60
61
  return false;
61
62
  }
62
63
  if (excludeGlobs.length > 0 && matchesAny(filePath, excludeGlobs)) {
63
64
  return false;
64
65
  }
65
- if (ignoreGlobs.length > 0 && matchesAny(filePath, ignoreGlobs)) {
66
+ if (includeGlobs.length > 0 && !matchesAny(filePath, includeGlobs)) {
66
67
  return false;
67
68
  }
68
69
  return true;
69
70
  }
71
+ function isIgnoredDirectory(dirPath, ignoreMatcher) {
72
+ return ignoreMatcher.ignores(dirPath) || ignoreMatcher.ignores(`${dirPath}/`);
73
+ }
70
74
  async function fileExists(filePath) {
71
75
  try {
72
76
  await promises_1.default.access(filePath);
@@ -119,38 +123,38 @@ async function readManifest(manifestPath) {
119
123
  throw error;
120
124
  }
121
125
  }
122
- async function readIgnorePatterns(repoPath, ignoreFilePath) {
126
+ async function readIgnoreMatcher(repoPath, ignoreFilePath) {
127
+ const matcher = (0, ignore_1.default)();
123
128
  const absolutePath = node_path_1.default.isAbsolute(ignoreFilePath)
124
129
  ? ignoreFilePath
125
130
  : node_path_1.default.resolve(repoPath, ignoreFilePath);
126
131
  try {
127
132
  const raw = await promises_1.default.readFile(absolutePath, "utf8");
128
- return raw
129
- .split(/\r?\n/)
130
- .map((line) => line.trim())
131
- .filter((line) => line.length > 0 && !line.startsWith("#"))
132
- .map(normalizeGlobPattern);
133
+ return matcher.add(raw);
133
134
  }
134
135
  catch (error) {
135
136
  const nodeError = error;
136
137
  if (nodeError.code === "ENOENT") {
137
- return [];
138
+ return matcher;
138
139
  }
139
140
  throw error;
140
141
  }
141
142
  }
142
- async function walkCodeFiles(rootPath, includeGlobs, excludeGlobs, ignoreGlobs, currentPath = rootPath) {
143
+ async function walkCodeFiles(rootPath, includeGlobs, excludeGlobs, ignoreMatcher, currentPath = rootPath) {
143
144
  const entries = await promises_1.default.readdir(currentPath, { withFileTypes: true });
144
145
  const files = [];
145
146
  for (const entry of entries) {
146
147
  const entryPath = node_path_1.default.join(currentPath, entry.name);
147
148
  if (entry.isDirectory()) {
148
- files.push(...await walkCodeFiles(rootPath, includeGlobs, excludeGlobs, ignoreGlobs, entryPath));
149
+ const relativePath = normalizeRelativePath(node_path_1.default.relative(rootPath, entryPath));
150
+ if (!isIgnoredDirectory(relativePath, ignoreMatcher)) {
151
+ files.push(...await walkCodeFiles(rootPath, includeGlobs, excludeGlobs, ignoreMatcher, entryPath));
152
+ }
149
153
  continue;
150
154
  }
151
155
  if (entry.isFile()) {
152
156
  const relativePath = normalizeRelativePath(node_path_1.default.relative(rootPath, entryPath));
153
- if (shouldIncludeFile(relativePath, includeGlobs, excludeGlobs, ignoreGlobs)) {
157
+ if (shouldIncludeFile(relativePath, includeGlobs, excludeGlobs, ignoreMatcher)) {
154
158
  files.push(relativePath);
155
159
  }
156
160
  }
@@ -219,6 +223,25 @@ async function ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot
219
223
  };
220
224
  await writeSummaryAsset(summaryPath, summary);
221
225
  }
226
+ async function runWithConcurrency(items, concurrency, worker) {
227
+ let nextIndex = 0;
228
+ const workerCount = Math.min(concurrency, items.length);
229
+ await Promise.all(Array.from({ length: workerCount }, async () => {
230
+ while (nextIndex < items.length) {
231
+ const item = items[nextIndex];
232
+ nextIndex += 1;
233
+ await worker(item);
234
+ }
235
+ }));
236
+ }
237
+ function createManifestLock() {
238
+ let queue = Promise.resolve();
239
+ return async function withManifestLock(task) {
240
+ const run = queue.then(task, task);
241
+ queue = run.then(() => undefined, () => undefined);
242
+ return run;
243
+ };
244
+ }
222
245
  async function pruneOrphanedSummaries(summaryDir, manifest) {
223
246
  const activeHashes = new Set(Object.values(manifest.files));
224
247
  let entries = [];
@@ -261,10 +284,30 @@ async function runSummarize(options, config) {
261
284
  ? options.excludeGlobs.map(normalizeGlobPattern)
262
285
  : config.summarize.excludeGlobs.map(normalizeGlobPattern));
263
286
  const ignoreFile = options.ignoreFile || config.summarize.ignoreFile;
264
- const ignorePatterns = compileGlobs(await readIgnorePatterns(repoPath, ignoreFile));
287
+ const ignoreMatcher = await readIgnoreMatcher(repoPath, ignoreFile);
265
288
  const totals = { scanned: 0, skipped: 0, updated: 0, failed: 0, pruned: 0 };
266
289
  const failures = [];
267
290
  const isJson = options.json;
291
+ const concurrency = config.summarize.concurrency;
292
+ const withManifestLock = createManifestLock();
293
+ const summaryAssetTasks = new Map();
294
+ async function ensureSummaryAssetForFile(filePath, hash, rawCodeSnapshot) {
295
+ const summaryPath = getSummaryPath(summaryDir, hash);
296
+ if (await fileExists(summaryPath)) {
297
+ return;
298
+ }
299
+ let task = summaryAssetTasks.get(hash);
300
+ if (!task) {
301
+ task = (async () => {
302
+ const summaryText = await (0, llm_1.generateFunctionalSummary)(filePath, rawCodeSnapshot, config.chat);
303
+ await ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot, options.includeCodeSnapshot);
304
+ })().finally(() => {
305
+ summaryAssetTasks.delete(hash);
306
+ });
307
+ summaryAssetTasks.set(hash, task);
308
+ }
309
+ await task;
310
+ }
268
311
  if (!isJson) {
269
312
  console.log(`Starting summarize run`);
270
313
  console.log(`Mode: ${options.mode}`);
@@ -277,46 +320,53 @@ async function runSummarize(options, config) {
277
320
  manifest.files = {};
278
321
  refs.clear();
279
322
  await writeManifest(manifestPath, manifest);
280
- const files = await walkCodeFiles(repoPath, includePatterns, excludePatterns, ignorePatterns);
323
+ const files = await walkCodeFiles(repoPath, includePatterns, excludePatterns, ignoreMatcher);
281
324
  const totalFiles = files.length;
325
+ let completedFiles = 0;
282
326
  if (!isJson) {
283
327
  console.log(`Candidates: ${totalFiles}`);
328
+ console.log(`Concurrency: ${concurrency}`);
284
329
  }
285
- for (let i = 0; i < files.length; i += 1) {
286
- const filePath = files[i];
287
- totals.scanned += 1;
330
+ await runWithConcurrency(files, concurrency, async (filePath) => {
331
+ await withManifestLock(async () => {
332
+ totals.scanned += 1;
333
+ });
288
334
  try {
289
335
  const absolutePath = node_path_1.default.join(repoPath, filePath);
290
336
  const rawCodeSnapshot = await promises_1.default.readFile(absolutePath, "utf8");
291
337
  const hash = (0, hashing_1.hashFileContent)(rawCodeSnapshot);
292
- const summaryPath = getSummaryPath(summaryDir, hash);
293
- if (!await fileExists(summaryPath)) {
294
- const summaryText = await (0, llm_1.generateFunctionalSummary)(filePath, rawCodeSnapshot, config.chat);
295
- await ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot, options.includeCodeSnapshot);
296
- }
297
- manifest.files[filePath] = hash;
298
- refs.set(hash, (refs.get(hash) || 0) + 1);
299
- await writeManifest(manifestPath, manifest);
300
- totals.updated += 1;
301
- if (!isJson) {
302
- console.log(`[${i + 1}/${totalFiles}] summarized ${filePath}`);
303
- }
338
+ await ensureSummaryAssetForFile(filePath, hash, rawCodeSnapshot);
339
+ await withManifestLock(async () => {
340
+ manifest.files[filePath] = hash;
341
+ refs.set(hash, (refs.get(hash) || 0) + 1);
342
+ await writeManifest(manifestPath, manifest);
343
+ totals.updated += 1;
344
+ completedFiles += 1;
345
+ if (!isJson) {
346
+ console.log(`[${completedFiles}/${totalFiles}] summarized ${filePath}`);
347
+ }
348
+ });
304
349
  }
305
350
  catch (error) {
306
351
  const message = error instanceof Error ? error.message : String(error);
307
- failures.push({ filePath, message });
308
- totals.failed += 1;
309
- if (!isJson) {
310
- console.error(`[${i + 1}/${totalFiles}] failed ${filePath}: ${message}`);
311
- }
352
+ await withManifestLock(async () => {
353
+ failures.push({ filePath, message });
354
+ totals.failed += 1;
355
+ completedFiles += 1;
356
+ if (!isJson) {
357
+ console.error(`[${completedFiles}/${totalFiles}] failed ${filePath}: ${message}`);
358
+ }
359
+ });
312
360
  }
313
- }
361
+ });
314
362
  }
315
363
  else {
316
364
  const deltas = await (0, git_1.getGitDeltas)(repoPath, manifest.lastSyncedCommit);
317
365
  const totalCandidates = deltas.modifiedOrAdded.length + deltas.deleted.length;
366
+ let completedModified = 0;
318
367
  if (!isJson) {
319
368
  console.log(`Candidates: ${totalCandidates} (${deltas.modifiedOrAdded.length} modified/added, ${deltas.deleted.length} deleted)`);
369
+ console.log(`Concurrency: ${concurrency}`);
320
370
  }
321
371
  for (const deletedPath of deltas.deleted) {
322
372
  const removed = await removeManifestPath(deletedPath, manifest, manifestPath, summaryDir, refs);
@@ -327,73 +377,85 @@ async function runSummarize(options, config) {
327
377
  console.log(`pruned ${deletedPath}`);
328
378
  }
329
379
  }
330
- for (let i = 0; i < deltas.modifiedOrAdded.length; i += 1) {
331
- const filePath = deltas.modifiedOrAdded[i];
332
- totals.scanned += 1;
380
+ await runWithConcurrency(deltas.modifiedOrAdded, concurrency, async (filePath) => {
381
+ await withManifestLock(async () => {
382
+ totals.scanned += 1;
383
+ });
333
384
  try {
334
- if (!shouldIncludeFile(filePath, includePatterns, excludePatterns, ignorePatterns)) {
335
- const removed = await removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs);
336
- if (removed) {
337
- totals.pruned += 1;
338
- }
339
- else {
340
- totals.skipped += 1;
341
- }
342
- if (!isJson) {
343
- console.log(`[${i + 1}/${deltas.modifiedOrAdded.length}] excluded ${filePath}`);
344
- }
345
- continue;
385
+ if (!shouldIncludeFile(filePath, includePatterns, excludePatterns, ignoreMatcher)) {
386
+ await withManifestLock(async () => {
387
+ const removed = await removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs);
388
+ if (removed) {
389
+ totals.pruned += 1;
390
+ }
391
+ else {
392
+ totals.skipped += 1;
393
+ }
394
+ completedModified += 1;
395
+ if (!isJson) {
396
+ console.log(`[${completedModified}/${deltas.modifiedOrAdded.length}] excluded ${filePath}`);
397
+ }
398
+ });
399
+ return;
346
400
  }
347
401
  const previousHash = manifest.files[filePath];
348
402
  const absolutePath = node_path_1.default.join(repoPath, filePath);
349
403
  const rawCodeSnapshot = await promises_1.default.readFile(absolutePath, "utf8");
350
404
  const hash = (0, hashing_1.hashFileContent)(rawCodeSnapshot);
351
405
  if (previousHash === hash) {
352
- totals.skipped += 1;
353
- if (!isJson) {
354
- console.log(`[${i + 1}/${deltas.modifiedOrAdded.length}] unchanged ${filePath}`);
355
- }
356
- continue;
357
- }
358
- const summaryPath = getSummaryPath(summaryDir, hash);
359
- if (!await fileExists(summaryPath)) {
360
- const summaryText = await (0, llm_1.generateFunctionalSummary)(filePath, rawCodeSnapshot, config.chat);
361
- await ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot, options.includeCodeSnapshot);
362
- }
363
- const changed = await setManifestPathHash(filePath, hash, manifest, manifestPath, summaryDir, refs);
364
- if (changed) {
365
- totals.updated += 1;
366
- }
367
- else {
368
- totals.skipped += 1;
369
- }
370
- if (!isJson) {
371
- console.log(`[${i + 1}/${deltas.modifiedOrAdded.length}] updated ${filePath}`);
406
+ await withManifestLock(async () => {
407
+ totals.skipped += 1;
408
+ completedModified += 1;
409
+ if (!isJson) {
410
+ console.log(`[${completedModified}/${deltas.modifiedOrAdded.length}] unchanged ${filePath}`);
411
+ }
412
+ });
413
+ return;
372
414
  }
373
- }
374
- catch (error) {
375
- const nodeError = error;
376
- if (nodeError.code === "ENOENT") {
377
- const removed = await removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs);
378
- if (removed) {
379
- totals.pruned += 1;
415
+ await ensureSummaryAssetForFile(filePath, hash, rawCodeSnapshot);
416
+ await withManifestLock(async () => {
417
+ const changed = await setManifestPathHash(filePath, hash, manifest, manifestPath, summaryDir, refs);
418
+ if (changed) {
419
+ totals.updated += 1;
380
420
  }
381
421
  else {
382
422
  totals.skipped += 1;
383
423
  }
424
+ completedModified += 1;
384
425
  if (!isJson) {
385
- console.log(`[${i + 1}/${deltas.modifiedOrAdded.length}] missing ${filePath}`);
426
+ console.log(`[${completedModified}/${deltas.modifiedOrAdded.length}] updated ${filePath}`);
386
427
  }
387
- continue;
428
+ });
429
+ }
430
+ catch (error) {
431
+ const nodeError = error;
432
+ if (nodeError.code === "ENOENT") {
433
+ await withManifestLock(async () => {
434
+ const removed = await removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs);
435
+ if (removed) {
436
+ totals.pruned += 1;
437
+ }
438
+ else {
439
+ totals.skipped += 1;
440
+ }
441
+ completedModified += 1;
442
+ if (!isJson) {
443
+ console.log(`[${completedModified}/${deltas.modifiedOrAdded.length}] missing ${filePath}`);
444
+ }
445
+ });
446
+ return;
388
447
  }
389
448
  const message = error instanceof Error ? error.message : String(error);
390
- failures.push({ filePath, message });
391
- totals.failed += 1;
392
- if (!isJson) {
393
- console.error(`[${i + 1}/${deltas.modifiedOrAdded.length}] failed ${filePath}: ${message}`);
394
- }
449
+ await withManifestLock(async () => {
450
+ failures.push({ filePath, message });
451
+ totals.failed += 1;
452
+ completedModified += 1;
453
+ if (!isJson) {
454
+ console.error(`[${completedModified}/${deltas.modifiedOrAdded.length}] failed ${filePath}: ${message}`);
455
+ }
456
+ });
395
457
  }
396
- }
458
+ });
397
459
  }
398
460
  manifest.lastSyncedCommit = await (0, git_1.getCurrentCommit)(repoPath);
399
461
  await writeManifest(manifestPath, manifest);
@@ -409,7 +471,7 @@ async function runSummarize(options, config) {
409
471
  finishedAt: finishedAt.toISOString(),
410
472
  durationMs,
411
473
  totals,
412
- failures
474
+ failures: failures.sort((a, b) => a.filePath.localeCompare(b.filePath))
413
475
  };
414
476
  if (isJson) {
415
477
  console.log(JSON.stringify(report, null, 2));
package/dist/config.js CHANGED
@@ -72,6 +72,7 @@ function buildRuntimeConfig(options, needs = { chat: true, embeddings: true }) {
72
72
  const includeGlobs = readListOption(mergedOptions.includeGlobs, "DIFFDOC_INCLUDE_GLOBS");
73
73
  const excludeGlobs = readListOption(mergedOptions.excludeGlobs, "DIFFDOC_EXCLUDE_GLOBS");
74
74
  const ignoreFile = readOption(mergedOptions.ignoreFile, "DIFFDOC_IGNORE_FILE", ".diffdocignore");
75
+ const summarizeConcurrency = readPositiveIntegerOption(mergedOptions.summarizeConcurrency, "DIFFDOC_SUMMARIZE_CONCURRENCY", 2);
75
76
  const chatBaseURL = provider === "cloud"
76
77
  ? readOption(mergedOptions.cloudLlmEndpoint, "CLOUD_LLM_ENDPOINT", "https://api.openai.com/v1")
77
78
  : readOption(mergedOptions.localLlmEndpoint, "LOCAL_LLM_ENDPOINT");
@@ -116,7 +117,8 @@ function buildRuntimeConfig(options, needs = { chat: true, embeddings: true }) {
116
117
  summarize: {
117
118
  includeGlobs,
118
119
  excludeGlobs,
119
- ignoreFile
120
+ ignoreFile,
121
+ concurrency: summarizeConcurrency
120
122
  }
121
123
  };
122
124
  }
package/dist/index.js CHANGED
@@ -42,7 +42,7 @@ function addCloudEndpointAndKeyOptions(command) {
42
42
  program
43
43
  .name("diffdoc")
44
44
  .description("Translate repository code shifts into plain-English business context")
45
- .version("0.4.2");
45
+ .version("0.5.0");
46
46
  program
47
47
  .command("init")
48
48
  .description("Initialize DiffDoc configuration for this repository")
@@ -71,6 +71,7 @@ addChatOptions(addBaseOptions(program
71
71
  .option("--include-glob <pattern>", "include glob pattern (repeatable)", collectOption, [])
72
72
  .option("--exclude-glob <pattern>", "exclude glob pattern (repeatable)", collectOption, [])
73
73
  .option("--ignore-file <path>", "path to ignore pattern file relative to --path")
74
+ .option("--summarize-concurrency <count>", "number of files to summarize concurrently")
74
75
  .action(async (options) => {
75
76
  try {
76
77
  const config = (0, config_1.buildRuntimeConfig)(options, { chat: true });
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "diffdoc",
3
- "version": "0.4.2",
3
+ "version": "0.5.0",
4
4
  "description": "Translate repository code shifts into plain-English business context",
5
5
  "license": "MIT",
6
6
  "author": "Christopher Sullivan",
@@ -36,6 +36,7 @@
36
36
  "dependencies": {
37
37
  "@modelcontextprotocol/sdk": "^1.29.0",
38
38
  "commander": "^12.0.0",
39
+ "ignore": "^7.0.5",
39
40
  "openai": "^4.28.0",
40
41
  "simple-git": "^3.24.0",
41
42
  "vectra": "^0.14.0",