diffdoc 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -8,5 +8,8 @@
8
8
  "cloudLlmEndpoint": "https://api.openai.com/v1",
9
9
  "cloudChatModel": "gpt-4o-mini",
10
10
  "cloudEmbedModel": "text-embedding-3-small",
11
- "openaiApiKey": ""
11
+ "openaiApiKey": "",
12
+ "includeGlobs": [],
13
+ "excludeGlobs": [],
14
+ "ignoreFile": ".diffdocignore"
12
15
  }
package/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  ## Project Description
4
4
 
5
- DiffDoc turns source code into searchable, plain-English project context. It scans repository files, asks an OpenAI-compatible chat model to summarize the business behavior in each file, stores those summaries in a portable JSON manifest, embeds the manifest into a local Vectra index, and answers questions using the indexed results as retrieval context.
5
+ DiffDoc turns source code into searchable, plain-English project context. It scans repository files, asks an OpenAI-compatible chat model to summarize the business behavior in each file, stores the summaries as portable per-hash JSON assets, embeds those assets into a local Vectra index, and answers questions using the indexed results as retrieval context.
6
6
 
7
7
  The project is designed for teams that need fast codebase comprehension without requiring every stakeholder to read implementation details. It can run against local model servers such as Ollama, LM Studio, or vLLM, or against cloud OpenAI-compatible APIs.
8
8
 
@@ -72,31 +72,42 @@ Example config with all supported keys:
72
72
  "cloudLlmEndpoint": "https://api.openai.com/v1",
73
73
  "cloudChatModel": "gpt-4o-mini",
74
74
  "cloudEmbedModel": "text-embedding-3-small",
75
- "openaiApiKey": ""
75
+ "openaiApiKey": "",
76
+ "includeGlobs": [],
77
+ "excludeGlobs": [],
78
+ "ignoreFile": ".diffdocignore"
76
79
  }
77
80
  ```
78
81
 
79
- Supported environment fallbacks use the uppercase names for the same settings, including `AI_PROVIDER`, `DIFFDOC_BASE_DIR`, `LOCAL_LLM_ENDPOINT`, `LOCAL_EMBED_ENDPOINT`, `LOCAL_CHAT_MODEL`, `LOCAL_EMBED_MODEL`, `CLOUD_LLM_ENDPOINT`, `CLOUD_CHAT_MODEL`, `CLOUD_EMBED_MODEL`, and `OPENAI_API_KEY`.
82
+ Supported environment fallbacks use the uppercase names for the same settings, including `AI_PROVIDER`, `DIFFDOC_BASE_DIR`, `LOCAL_LLM_ENDPOINT`, `LOCAL_EMBED_ENDPOINT`, `LOCAL_CHAT_MODEL`, `LOCAL_EMBED_MODEL`, `CLOUD_LLM_ENDPOINT`, `CLOUD_CHAT_MODEL`, `CLOUD_EMBED_MODEL`, `OPENAI_API_KEY`, `DIFFDOC_INCLUDE_GLOBS`, `DIFFDOC_EXCLUDE_GLOBS`, and `DIFFDOC_IGNORE_FILE`.
80
83
 
81
84
  ## Manifest-First Design
82
85
 
83
- DiffDoc separates summarization from embedding. The `summarize` command writes all generated file summaries to `manifest.json` under `baseDir`, usually `./.diffdoc/manifest.json`.
86
+ DiffDoc separates summarization from embedding. The `summarize` command writes file-to-hash mappings to `manifest.json` and stores each summary in an independent hash-addressed JSON file under `./.diffdoc/summaries/`.
84
87
 
85
88
  The manifest is plain JSON and contains one entry per tracked file:
86
89
 
87
90
  ```json
88
91
  {
92
+ "schemaVersion": 2,
89
93
  "lastSyncedCommit": "string-hash",
90
94
  "files": {
91
- "src/example.ts": {
92
- "hash": "md5-string",
93
- "summaryText": "Plain-English explanation text here.",
94
- "rawCodeSnapshot": "Full code text here..."
95
- }
95
+ "src/example.ts": "md5-string"
96
96
  }
97
97
  }
98
98
  ```
99
99
 
100
+ Example summary asset at `./.diffdoc/summaries/<hash>.json`:
101
+
102
+ ```json
103
+ {
104
+ "schemaVersion": 1,
105
+ "content_hash": "md5-string",
106
+ "summary": "Plain-English explanation text here.",
107
+ "raw_code_snapshot": "Optional code text when --include-code-snapshot is enabled"
108
+ }
109
+ ```
110
+
100
111
  Because the summaries are stored independently, users do not have to embed immediately. They can review, archive, transform, or embed the manifest later using their preferred vectorization model and storage solution.
101
112
 
102
113
  DiffDoc includes `diffdoc embed` as a built-in convenience path for creating a local Vectra index, but the manifest can also be consumed by other tools such as custom OpenAI-compatible embedding pipelines, hosted vector databases, local search systems, or internal documentation workflows.
@@ -115,12 +126,30 @@ Summarize only changed Git files using the existing manifest state:
115
126
  diffdoc summarize --path . --mode delta
116
127
  ```
117
128
 
129
+ Store raw code snapshots in summary assets:
130
+
131
+ ```bash
132
+ diffdoc summarize --path . --mode all --include-code-snapshot
133
+ ```
134
+
135
+ Add include/exclude filters at runtime:
136
+
137
+ ```bash
138
+ diffdoc summarize --path . --mode all --include-glob "src/**/*.ts" --exclude-glob "**/*.test.ts"
139
+ ```
140
+
118
141
  Embed the manifest into a local Vectra index at `./.diffdoc/vectra`:
119
142
 
120
143
  ```bash
121
144
  diffdoc embed
122
145
  ```
123
146
 
147
+ Force full index rebuild:
148
+
149
+ ```bash
150
+ diffdoc embed --rebuild
151
+ ```
152
+
124
153
  Search the local Vectra index and print raw matches:
125
154
 
126
155
  ```bash
@@ -226,6 +255,8 @@ Run `diffdoc summarize` and `diffdoc embed` before using the MCP server, otherwi
226
255
 
227
256
  - Node.js `>=22` is required because Vectra requires it.
228
257
  - This repository ignores `.diffdoc/vectra` and `.diffdocrc`; add similar entries to your project's `.gitignore` if you do not want generated indexes or local config committed. The manifest at `.diffdoc/manifest.json` is not ignored by this repository.
258
+ - Summary assets are written to `.diffdoc/summaries/*.json`.
259
+ - Manifest schema is currently `schemaVersion: 2`; older manifest shapes are not auto-migrated.
229
260
  - Commit `.diffdoc/manifest.json` when using delta workflows. Delta summarization reads the previous manifest state to decide which changed files need fresh summaries.
230
261
  - `summarize` requires a configured chat model.
231
262
  - `embed` requires a configured embedding model.
@@ -8,56 +8,142 @@ exports.runEmbed = runEmbed;
8
8
  const promises_1 = __importDefault(require("node:fs/promises"));
9
9
  const node_path_1 = __importDefault(require("node:path"));
10
10
  const vectra_1 = require("vectra");
11
+ const artifacts_1 = require("../types/artifacts");
11
12
  const llm_1 = require("../utils/llm");
12
13
  const paths_1 = require("../utils/paths");
13
14
  const VECTRA_INDEX_DIR = "vectra";
14
15
  function getVectraIndexPath(config) {
15
16
  return node_path_1.default.resolve((0, paths_1.getDiffdocBaseDir)(config.baseDir), VECTRA_INDEX_DIR);
16
17
  }
18
+ function getSummaryDir(manifestPath) {
19
+ return node_path_1.default.resolve(node_path_1.default.dirname(manifestPath), "summaries");
20
+ }
21
+ function getSummaryPath(summaryDir, hash) {
22
+ return node_path_1.default.resolve(summaryDir, `${hash}.json`);
23
+ }
24
+ async function readManifest(manifestPath) {
25
+ const parsed = JSON.parse(await promises_1.default.readFile(manifestPath, "utf8"));
26
+ if (parsed.schemaVersion !== artifacts_1.MANIFEST_SCHEMA_VERSION) {
27
+ throw new Error(`Unsupported manifest schema in ${manifestPath}. Expected schemaVersion ${artifacts_1.MANIFEST_SCHEMA_VERSION}.`);
28
+ }
29
+ return {
30
+ schemaVersion: artifacts_1.MANIFEST_SCHEMA_VERSION,
31
+ lastSyncedCommit: typeof parsed.lastSyncedCommit === "string" ? parsed.lastSyncedCommit : "",
32
+ files: parsed.files && typeof parsed.files === "object" ? parsed.files : {}
33
+ };
34
+ }
35
+ async function readSummaryAsset(summaryPath) {
36
+ const parsed = JSON.parse(await promises_1.default.readFile(summaryPath, "utf8"));
37
+ if (parsed.schemaVersion !== artifacts_1.SUMMARY_ASSET_SCHEMA_VERSION) {
38
+ throw new Error(`Unsupported summary schema in ${summaryPath}. Expected schemaVersion ${artifacts_1.SUMMARY_ASSET_SCHEMA_VERSION}.`);
39
+ }
40
+ if (typeof parsed.content_hash !== "string") {
41
+ throw new Error(`Invalid summary hash in ${summaryPath}.`);
42
+ }
43
+ if (typeof parsed.summary !== "string") {
44
+ throw new Error(`Invalid summary text in ${summaryPath}.`);
45
+ }
46
+ return {
47
+ schemaVersion: artifacts_1.SUMMARY_ASSET_SCHEMA_VERSION,
48
+ content_hash: parsed.content_hash,
49
+ summary: parsed.summary,
50
+ raw_code_snapshot: typeof parsed.raw_code_snapshot === "string" ? parsed.raw_code_snapshot : undefined
51
+ };
52
+ }
17
53
  function buildDocument(filePath, summaryText, rawCodeSnapshot) {
18
- return `File: ${filePath}\n` +
19
- `Summary: ${summaryText}\n\n` +
20
- `Code Snapshot:\n\`\`\`\n${rawCodeSnapshot}\n\`\`\``;
54
+ let output = `File: ${filePath}\nSummary: ${summaryText}`;
55
+ if (rawCodeSnapshot) {
56
+ output += `\n\nCode Snapshot:\n\`\`\`\n${rawCodeSnapshot}\n\`\`\``;
57
+ }
58
+ return output;
21
59
  }
22
60
  async function runEmbed(options, config) {
23
61
  const manifestPath = (0, paths_1.resolveDiffdocArtifactPath)(options.manifest, config.baseDir);
24
- const manifest = JSON.parse(await promises_1.default.readFile(manifestPath, "utf8"));
62
+ const manifest = await readManifest(manifestPath);
25
63
  const entries = Object.entries(manifest.files);
64
+ const summaryDir = getSummaryDir(manifestPath);
26
65
  const indexPath = getVectraIndexPath(config);
27
66
  const index = new vectra_1.LocalIndex(indexPath);
28
- await index.createIndex({
29
- version: 1,
30
- deleteIfExists: true,
31
- metadata_config: {
32
- indexed: ["filePath", "hash"]
67
+ if (options.rebuild) {
68
+ await index.createIndex({
69
+ version: 1,
70
+ deleteIfExists: true,
71
+ metadata_config: {
72
+ indexed: ["filePath", "hash"]
73
+ }
74
+ });
75
+ }
76
+ else if (!await index.isIndexCreated()) {
77
+ await index.createIndex({
78
+ version: 1,
79
+ deleteIfExists: false,
80
+ metadata_config: {
81
+ indexed: ["filePath", "hash"]
82
+ }
83
+ });
84
+ }
85
+ const existingItems = await index.listItems();
86
+ const existingByPath = new Map(existingItems.map((item) => [item.id, item]));
87
+ const toUpsert = [];
88
+ for (const [filePath, hash] of entries) {
89
+ const existing = existingByPath.get(filePath);
90
+ if (existing?.metadata.hash === hash) {
91
+ continue;
92
+ }
93
+ const summaryPath = getSummaryPath(summaryDir, hash);
94
+ const summaryAsset = await readSummaryAsset(summaryPath);
95
+ if (summaryAsset.content_hash !== hash) {
96
+ throw new Error(`Hash mismatch in summary asset ${summaryPath}.`);
33
97
  }
34
- });
35
- if (entries.length === 0) {
36
- console.log(`Created empty Vectra index at ${indexPath}.`);
98
+ toUpsert.push({
99
+ filePath,
100
+ hash,
101
+ summaryText: summaryAsset.summary,
102
+ rawCodeSnapshot: summaryAsset.raw_code_snapshot,
103
+ document: buildDocument(filePath, summaryAsset.summary, summaryAsset.raw_code_snapshot)
104
+ });
105
+ }
106
+ const activePathSet = new Set(entries.map(([filePath]) => filePath));
107
+ const toDelete = existingItems
108
+ .map((item) => item.id)
109
+ .filter((id) => Boolean(id) && !activePathSet.has(id));
110
+ if (toUpsert.length === 0 && toDelete.length === 0) {
111
+ console.log(`Index is already up to date at ${indexPath}.`);
37
112
  return;
38
113
  }
39
- const documents = entries.map(([filePath, file]) => buildDocument(filePath, file.summaryText, file.rawCodeSnapshot));
40
- const embeddings = await (0, llm_1.generateEmbeddings)(documents, config.embeddings);
114
+ const embeddings = toUpsert.length > 0
115
+ ? await (0, llm_1.generateEmbeddings)(toUpsert.map((item) => item.document), config.embeddings)
116
+ : [];
41
117
  await index.beginUpdate();
42
118
  try {
43
- for (let i = 0; i < entries.length; i += 1) {
44
- const [filePath, file] = entries[i];
119
+ for (let i = 0; i < toUpsert.length; i += 1) {
120
+ const item = toUpsert[i];
121
+ const metadata = item.rawCodeSnapshot
122
+ ? {
123
+ filePath: item.filePath,
124
+ hash: item.hash,
125
+ summaryText: item.summaryText,
126
+ rawCodeSnapshot: item.rawCodeSnapshot
127
+ }
128
+ : {
129
+ filePath: item.filePath,
130
+ hash: item.hash,
131
+ summaryText: item.summaryText
132
+ };
45
133
  await index.upsertItem({
46
- id: filePath,
134
+ id: item.filePath,
47
135
  vector: embeddings[i],
48
- metadata: {
49
- filePath,
50
- hash: file.hash,
51
- summaryText: file.summaryText,
52
- rawCodeSnapshot: file.rawCodeSnapshot
53
- }
136
+ metadata
54
137
  });
55
138
  }
139
+ for (const itemId of toDelete) {
140
+ await index.deleteItem(itemId);
141
+ }
56
142
  await index.endUpdate();
57
143
  }
58
144
  catch (error) {
59
145
  index.cancelUpdate();
60
146
  throw error;
61
147
  }
62
- console.log(`Embedded ${entries.length} summaries into Vectra index at ${indexPath}.`);
148
+ console.log(`Embedded ${toUpsert.length} summaries and pruned ${toDelete.length} items in ${indexPath}.`);
63
149
  }
@@ -25,7 +25,7 @@ async function runQuery(message, options, config) {
25
25
  console.log((0, retrieval_1.trimForDisplay)(result.summaryText, 1200));
26
26
  if (options.code) {
27
27
  console.log("Code Snapshot:");
28
- console.log((0, retrieval_1.trimForDisplay)(result.rawCodeSnapshot, 2000));
28
+ console.log((0, retrieval_1.trimForDisplay)(result.rawCodeSnapshot || "(not stored)", 2000));
29
29
  }
30
30
  }
31
31
  }
@@ -44,7 +44,7 @@ async function runSearch(message, options, config) {
44
44
  console.log((0, retrieval_1.trimForDisplay)(result.summaryText, 1200));
45
45
  if (options.code) {
46
46
  console.log("Code Snapshot:");
47
- console.log((0, retrieval_1.trimForDisplay)(result.rawCodeSnapshot, 2000));
47
+ console.log((0, retrieval_1.trimForDisplay)(result.rawCodeSnapshot || "(not stored)", 2000));
48
48
  }
49
49
  }
50
50
  }
@@ -6,60 +6,240 @@ Object.defineProperty(exports, "__esModule", { value: true });
6
6
  exports.runSummarize = runSummarize;
7
7
  const promises_1 = __importDefault(require("node:fs/promises"));
8
8
  const node_path_1 = __importDefault(require("node:path"));
9
+ const artifacts_1 = require("../types/artifacts");
9
10
  const git_1 = require("../utils/git");
10
11
  const hashing_1 = require("../utils/hashing");
11
12
  const llm_1 = require("../utils/llm");
12
13
  const paths_1 = require("../utils/paths");
13
- const TARGET_EXTENSIONS = new Set([".ts", ".js", ".cs", ".py"]);
14
- const IGNORED_DIRECTORIES = new Set([".git", "node_modules", "dist"]);
15
- const IGNORED_FILES = new Set(["package-lock.json", "yarn.lock", "pnpm-lock.yaml", "bun.lockb"]);
16
14
  function normalizeRelativePath(filePath) {
17
15
  return filePath.split(node_path_1.default.sep).join("/");
18
16
  }
19
- function isTargetCodeFile(filePath) {
20
- return TARGET_EXTENSIONS.has(node_path_1.default.extname(filePath)) && !IGNORED_FILES.has(node_path_1.default.basename(filePath));
17
+ function getSummaryDir(manifestPath) {
18
+ return node_path_1.default.resolve(node_path_1.default.dirname(manifestPath), "summaries");
19
+ }
20
+ function getSummaryPath(summaryDir, hash) {
21
+ return node_path_1.default.resolve(summaryDir, `${hash}.json`);
22
+ }
23
+ function normalizeGlobPattern(pattern) {
24
+ return pattern.split(node_path_1.default.sep).join("/");
25
+ }
26
+ function escapeRegex(value) {
27
+ return value.replace(/[|\\{}()[\]^$+?.]/g, "\\$&");
28
+ }
29
+ function globToRegExp(pattern) {
30
+ const normalized = normalizeGlobPattern(pattern);
31
+ let regexBody = "";
32
+ for (let i = 0; i < normalized.length; i += 1) {
33
+ const char = normalized[i];
34
+ const next = normalized[i + 1];
35
+ if (char === "*" && next === "*") {
36
+ regexBody += ".*";
37
+ i += 1;
38
+ continue;
39
+ }
40
+ if (char === "*") {
41
+ regexBody += "[^/]*";
42
+ continue;
43
+ }
44
+ if (char === "?") {
45
+ regexBody += "[^/]";
46
+ continue;
47
+ }
48
+ regexBody += escapeRegex(char);
49
+ }
50
+ return new RegExp(`^${regexBody}$`);
51
+ }
52
+ function compileGlobs(patterns) {
53
+ return patterns.filter(Boolean).map(globToRegExp);
54
+ }
55
+ function matchesAny(filePath, patterns) {
56
+ return patterns.some((pattern) => pattern.test(filePath));
57
+ }
58
+ function shouldIncludeFile(filePath, includeGlobs, excludeGlobs, ignoreGlobs) {
59
+ if (includeGlobs.length > 0 && !matchesAny(filePath, includeGlobs)) {
60
+ return false;
61
+ }
62
+ if (excludeGlobs.length > 0 && matchesAny(filePath, excludeGlobs)) {
63
+ return false;
64
+ }
65
+ if (ignoreGlobs.length > 0 && matchesAny(filePath, ignoreGlobs)) {
66
+ return false;
67
+ }
68
+ return true;
69
+ }
70
+ async function fileExists(filePath) {
71
+ try {
72
+ await promises_1.default.access(filePath);
73
+ return true;
74
+ }
75
+ catch {
76
+ return false;
77
+ }
78
+ }
79
+ async function atomicWriteUtf8(targetPath, content) {
80
+ await promises_1.default.mkdir(node_path_1.default.dirname(targetPath), { recursive: true });
81
+ const tempPath = `${targetPath}.${process.pid}.${Date.now()}.tmp`;
82
+ const handle = await promises_1.default.open(tempPath, "w");
83
+ try {
84
+ await handle.writeFile(content, "utf8");
85
+ await handle.sync();
86
+ }
87
+ finally {
88
+ await handle.close();
89
+ }
90
+ await promises_1.default.rename(tempPath, targetPath);
91
+ }
92
+ async function writeManifest(manifestPath, manifest) {
93
+ await atomicWriteUtf8(manifestPath, `${JSON.stringify(manifest, null, 2)}\n`);
94
+ }
95
+ async function writeSummaryAsset(summaryPath, summary) {
96
+ await atomicWriteUtf8(summaryPath, `${JSON.stringify(summary, null, 2)}\n`);
21
97
  }
22
98
  async function readManifest(manifestPath) {
23
99
  try {
24
- return JSON.parse(await promises_1.default.readFile(manifestPath, "utf8"));
100
+ const parsed = JSON.parse(await promises_1.default.readFile(manifestPath, "utf8"));
101
+ if (parsed.schemaVersion !== artifacts_1.MANIFEST_SCHEMA_VERSION) {
102
+ throw new Error(`Unsupported manifest schema in ${manifestPath}. Expected schemaVersion ${artifacts_1.MANIFEST_SCHEMA_VERSION}.`);
103
+ }
104
+ return {
105
+ schemaVersion: artifacts_1.MANIFEST_SCHEMA_VERSION,
106
+ lastSyncedCommit: typeof parsed.lastSyncedCommit === "string" ? parsed.lastSyncedCommit : "",
107
+ files: parsed.files && typeof parsed.files === "object" ? parsed.files : {}
108
+ };
25
109
  }
26
110
  catch (error) {
27
111
  const nodeError = error;
28
112
  if (nodeError.code === "ENOENT") {
29
- return { lastSyncedCommit: "", files: {} };
113
+ return {
114
+ schemaVersion: artifacts_1.MANIFEST_SCHEMA_VERSION,
115
+ lastSyncedCommit: "",
116
+ files: {}
117
+ };
30
118
  }
31
119
  throw error;
32
120
  }
33
121
  }
34
- async function writeManifest(manifestPath, manifest) {
35
- await promises_1.default.mkdir(node_path_1.default.dirname(manifestPath), { recursive: true });
36
- await promises_1.default.writeFile(manifestPath, `${JSON.stringify(manifest, null, 2)}\n`, "utf8");
122
+ async function readIgnorePatterns(repoPath, ignoreFilePath) {
123
+ const absolutePath = node_path_1.default.isAbsolute(ignoreFilePath)
124
+ ? ignoreFilePath
125
+ : node_path_1.default.resolve(repoPath, ignoreFilePath);
126
+ try {
127
+ const raw = await promises_1.default.readFile(absolutePath, "utf8");
128
+ return raw
129
+ .split(/\r?\n/)
130
+ .map((line) => line.trim())
131
+ .filter((line) => line.length > 0 && !line.startsWith("#"))
132
+ .map(normalizeGlobPattern);
133
+ }
134
+ catch (error) {
135
+ const nodeError = error;
136
+ if (nodeError.code === "ENOENT") {
137
+ return [];
138
+ }
139
+ throw error;
140
+ }
37
141
  }
38
- async function walkCodeFiles(rootPath, currentPath = rootPath) {
142
+ async function walkCodeFiles(rootPath, includeGlobs, excludeGlobs, ignoreGlobs, currentPath = rootPath) {
39
143
  const entries = await promises_1.default.readdir(currentPath, { withFileTypes: true });
40
144
  const files = [];
41
145
  for (const entry of entries) {
42
146
  const entryPath = node_path_1.default.join(currentPath, entry.name);
43
147
  if (entry.isDirectory()) {
44
- if (!IGNORED_DIRECTORIES.has(entry.name)) {
45
- files.push(...await walkCodeFiles(rootPath, entryPath));
46
- }
148
+ files.push(...await walkCodeFiles(rootPath, includeGlobs, excludeGlobs, ignoreGlobs, entryPath));
47
149
  continue;
48
150
  }
49
- if (entry.isFile() && isTargetCodeFile(entry.name)) {
50
- files.push(normalizeRelativePath(node_path_1.default.relative(rootPath, entryPath)));
151
+ if (entry.isFile()) {
152
+ const relativePath = normalizeRelativePath(node_path_1.default.relative(rootPath, entryPath));
153
+ if (shouldIncludeFile(relativePath, includeGlobs, excludeGlobs, ignoreGlobs)) {
154
+ files.push(relativePath);
155
+ }
51
156
  }
52
157
  }
53
158
  return files.sort();
54
159
  }
55
- async function summarizeFile(rootPath, relativePath, config) {
56
- const absolutePath = node_path_1.default.join(rootPath, relativePath);
57
- const rawCodeSnapshot = await promises_1.default.readFile(absolutePath, "utf8");
58
- return {
59
- hash: (0, hashing_1.hashFileContent)(rawCodeSnapshot),
60
- summaryText: await (0, llm_1.generateFunctionalSummary)(relativePath, rawCodeSnapshot, config.chat),
61
- rawCodeSnapshot
160
+ function countHashRefs(files) {
161
+ const refs = new Map();
162
+ for (const hash of Object.values(files)) {
163
+ refs.set(hash, (refs.get(hash) || 0) + 1);
164
+ }
165
+ return refs;
166
+ }
167
+ async function deleteSummaryIfUnreferenced(summaryDir, hash, refs) {
168
+ if ((refs.get(hash) || 0) > 0) {
169
+ return;
170
+ }
171
+ const summaryPath = getSummaryPath(summaryDir, hash);
172
+ try {
173
+ await promises_1.default.unlink(summaryPath);
174
+ }
175
+ catch (error) {
176
+ const nodeError = error;
177
+ if (nodeError.code !== "ENOENT") {
178
+ throw error;
179
+ }
180
+ }
181
+ }
182
+ async function setManifestPathHash(filePath, newHash, manifest, manifestPath, summaryDir, refs) {
183
+ const previousHash = manifest.files[filePath];
184
+ if (previousHash === newHash) {
185
+ return;
186
+ }
187
+ if (previousHash) {
188
+ refs.set(previousHash, Math.max((refs.get(previousHash) || 1) - 1, 0));
189
+ }
190
+ manifest.files[filePath] = newHash;
191
+ refs.set(newHash, (refs.get(newHash) || 0) + 1);
192
+ await writeManifest(manifestPath, manifest);
193
+ if (previousHash) {
194
+ await deleteSummaryIfUnreferenced(summaryDir, previousHash, refs);
195
+ }
196
+ }
197
+ async function removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs) {
198
+ const previousHash = manifest.files[filePath];
199
+ if (!previousHash) {
200
+ return;
201
+ }
202
+ delete manifest.files[filePath];
203
+ refs.set(previousHash, Math.max((refs.get(previousHash) || 1) - 1, 0));
204
+ await writeManifest(manifestPath, manifest);
205
+ await deleteSummaryIfUnreferenced(summaryDir, previousHash, refs);
206
+ }
207
+ async function ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot, includeCodeSnapshot) {
208
+ const summaryPath = getSummaryPath(summaryDir, hash);
209
+ if (await fileExists(summaryPath)) {
210
+ return;
211
+ }
212
+ const summary = {
213
+ schemaVersion: artifacts_1.SUMMARY_ASSET_SCHEMA_VERSION,
214
+ content_hash: hash,
215
+ summary: summaryText,
216
+ raw_code_snapshot: includeCodeSnapshot ? rawCodeSnapshot : undefined
62
217
  };
218
+ await writeSummaryAsset(summaryPath, summary);
219
+ }
220
+ async function pruneOrphanedSummaries(summaryDir, manifest) {
221
+ const activeHashes = new Set(Object.values(manifest.files));
222
+ let entries = [];
223
+ try {
224
+ entries = await promises_1.default.readdir(summaryDir);
225
+ }
226
+ catch (error) {
227
+ const nodeError = error;
228
+ if (nodeError.code === "ENOENT") {
229
+ return;
230
+ }
231
+ throw error;
232
+ }
233
+ for (const entry of entries) {
234
+ if (!entry.endsWith(".json")) {
235
+ continue;
236
+ }
237
+ const hash = entry.slice(0, -5);
238
+ if (activeHashes.has(hash)) {
239
+ continue;
240
+ }
241
+ await promises_1.default.unlink(node_path_1.default.resolve(summaryDir, entry));
242
+ }
63
243
  }
64
244
  async function runSummarize(options, config) {
65
245
  if (options.mode !== "all" && options.mode !== "delta") {
@@ -68,46 +248,93 @@ async function runSummarize(options, config) {
68
248
  const commandCwd = process.cwd();
69
249
  const repoPath = node_path_1.default.resolve(commandCwd, options.path);
70
250
  const manifestPath = (0, paths_1.resolveDiffdocArtifactPath)(options.out, config.baseDir);
71
- const manifest = options.mode === "delta" ? await readManifest(manifestPath) : { lastSyncedCommit: "", files: {} };
251
+ const summaryDir = getSummaryDir(manifestPath);
252
+ const manifest = await readManifest(manifestPath);
253
+ const refs = countHashRefs(manifest.files);
254
+ const includePatterns = compileGlobs((options.includeGlobs && options.includeGlobs.length > 0)
255
+ ? options.includeGlobs.map(normalizeGlobPattern)
256
+ : config.summarize.includeGlobs.map(normalizeGlobPattern));
257
+ const excludePatterns = compileGlobs((options.excludeGlobs && options.excludeGlobs.length > 0)
258
+ ? options.excludeGlobs.map(normalizeGlobPattern)
259
+ : config.summarize.excludeGlobs.map(normalizeGlobPattern));
260
+ const ignoreFile = options.ignoreFile || config.summarize.ignoreFile;
261
+ const ignorePatterns = compileGlobs(await readIgnorePatterns(repoPath, ignoreFile));
262
+ const failures = [];
72
263
  if (options.mode === "all") {
73
- const files = await walkCodeFiles(repoPath);
74
264
  manifest.files = {};
265
+ refs.clear();
266
+ await writeManifest(manifestPath, manifest);
267
+ const files = await walkCodeFiles(repoPath, includePatterns, excludePatterns, ignorePatterns);
75
268
  for (const filePath of files) {
76
- manifest.files[filePath] = await summarizeFile(repoPath, filePath, config);
77
- console.log(`Summarized ${filePath}`);
269
+ try {
270
+ const absolutePath = node_path_1.default.join(repoPath, filePath);
271
+ const rawCodeSnapshot = await promises_1.default.readFile(absolutePath, "utf8");
272
+ const hash = (0, hashing_1.hashFileContent)(rawCodeSnapshot);
273
+ const summaryPath = getSummaryPath(summaryDir, hash);
274
+ if (!await fileExists(summaryPath)) {
275
+ const summaryText = await (0, llm_1.generateFunctionalSummary)(filePath, rawCodeSnapshot, config.chat);
276
+ await ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot, options.includeCodeSnapshot);
277
+ }
278
+ manifest.files[filePath] = hash;
279
+ refs.set(hash, (refs.get(hash) || 0) + 1);
280
+ await writeManifest(manifestPath, manifest);
281
+ console.log(`Summarized ${filePath}`);
282
+ }
283
+ catch (error) {
284
+ const message = error instanceof Error ? error.message : String(error);
285
+ failures.push({ filePath, message });
286
+ console.error(`Failed ${filePath}: ${message}`);
287
+ }
78
288
  }
79
289
  }
80
290
  else {
81
291
  const deltas = await (0, git_1.getGitDeltas)(repoPath, manifest.lastSyncedCommit);
82
292
  for (const deletedPath of deltas.deleted) {
83
- delete manifest.files[deletedPath];
293
+ await removeManifestPath(deletedPath, manifest, manifestPath, summaryDir, refs);
84
294
  console.log(`Pruned ${deletedPath}`);
85
295
  }
86
296
  for (const filePath of deltas.modifiedOrAdded) {
87
- const absolutePath = node_path_1.default.join(repoPath, filePath);
88
297
  try {
298
+ if (!shouldIncludeFile(filePath, includePatterns, excludePatterns, ignorePatterns)) {
299
+ await removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs);
300
+ continue;
301
+ }
302
+ const previousHash = manifest.files[filePath];
303
+ const absolutePath = node_path_1.default.join(repoPath, filePath);
89
304
  const rawCodeSnapshot = await promises_1.default.readFile(absolutePath, "utf8");
90
305
  const hash = (0, hashing_1.hashFileContent)(rawCodeSnapshot);
91
- if (manifest.files[filePath]?.hash === hash)
306
+ if (previousHash === hash) {
92
307
  continue;
93
- manifest.files[filePath] = {
94
- hash,
95
- summaryText: await (0, llm_1.generateFunctionalSummary)(filePath, rawCodeSnapshot, config.chat),
96
- rawCodeSnapshot
97
- };
308
+ }
309
+ const summaryPath = getSummaryPath(summaryDir, hash);
310
+ if (!await fileExists(summaryPath)) {
311
+ const summaryText = await (0, llm_1.generateFunctionalSummary)(filePath, rawCodeSnapshot, config.chat);
312
+ await ensureSummaryAsset(summaryDir, hash, summaryText, rawCodeSnapshot, options.includeCodeSnapshot);
313
+ }
314
+ await setManifestPathHash(filePath, hash, manifest, manifestPath, summaryDir, refs);
98
315
  console.log(`Updated ${filePath}`);
99
316
  }
100
317
  catch (error) {
101
318
  const nodeError = error;
102
319
  if (nodeError.code === "ENOENT") {
103
- delete manifest.files[filePath];
320
+ await removeManifestPath(filePath, manifest, manifestPath, summaryDir, refs);
104
321
  continue;
105
322
  }
106
- throw error;
323
+ const message = error instanceof Error ? error.message : String(error);
324
+ failures.push({ filePath, message });
325
+ console.error(`Failed ${filePath}: ${message}`);
107
326
  }
108
327
  }
109
328
  }
110
329
  manifest.lastSyncedCommit = await (0, git_1.getCurrentCommit)(repoPath);
111
330
  await writeManifest(manifestPath, manifest);
331
+ await pruneOrphanedSummaries(summaryDir, manifest);
112
332
  console.log(`Wrote manifest to ${manifestPath}`);
333
+ if (failures.length > 0) {
334
+ console.error(`\n${failures.length} file(s) failed during summarization:`);
335
+ for (const failure of failures) {
336
+ console.error(`- ${failure.filePath}: ${failure.message}`);
337
+ }
338
+ throw new Error("Summarization completed with failures.");
339
+ }
113
340
  }
package/dist/config.js CHANGED
@@ -9,6 +9,22 @@ const node_path_1 = __importDefault(require("node:path"));
9
9
  function readOption(value, envName, fallback = "") {
10
10
  return value || process.env[envName] || fallback;
11
11
  }
12
+ function parseCsv(value) {
13
+ return value.split(",").map((item) => item.trim()).filter(Boolean);
14
+ }
15
+ function readListOption(value, envName, fallback = []) {
16
+ if (Array.isArray(value)) {
17
+ return value.flatMap((item) => parseCsv(item)).filter(Boolean);
18
+ }
19
+ if (typeof value === "string" && value.trim()) {
20
+ return parseCsv(value);
21
+ }
22
+ const envValue = process.env[envName];
23
+ if (envValue && envValue.trim()) {
24
+ return parseCsv(envValue);
25
+ }
26
+ return fallback;
27
+ }
12
28
  function loadRcFile(configPath) {
13
29
  const resolvedPath = node_path_1.default.resolve(process.cwd(), configPath || ".diffdocrc");
14
30
  if (!node_fs_1.default.existsSync(resolvedPath)) {
@@ -41,6 +57,9 @@ function buildRuntimeConfig(options, needs = { chat: true, embeddings: true }) {
41
57
  const mergedOptions = mergeConfigOptions(options);
42
58
  const provider = readProvider(mergedOptions.aiProvider);
43
59
  const apiKey = readOption(mergedOptions.openaiApiKey, "OPENAI_API_KEY", provider === "local" ? "local-key" : "");
60
+ const includeGlobs = readListOption(mergedOptions.includeGlobs, "DIFFDOC_INCLUDE_GLOBS");
61
+ const excludeGlobs = readListOption(mergedOptions.excludeGlobs, "DIFFDOC_EXCLUDE_GLOBS");
62
+ const ignoreFile = readOption(mergedOptions.ignoreFile, "DIFFDOC_IGNORE_FILE", ".diffdocignore");
44
63
  const chatBaseURL = provider === "cloud"
45
64
  ? readOption(mergedOptions.cloudLlmEndpoint, "CLOUD_LLM_ENDPOINT", "https://api.openai.com/v1")
46
65
  : readOption(mergedOptions.localLlmEndpoint, "LOCAL_LLM_ENDPOINT");
@@ -80,6 +99,11 @@ function buildRuntimeConfig(options, needs = { chat: true, embeddings: true }) {
80
99
  apiKey,
81
100
  baseURL: embedBaseURL,
82
101
  model: embedModel
102
+ },
103
+ summarize: {
104
+ includeGlobs,
105
+ excludeGlobs,
106
+ ignoreFile
83
107
  }
84
108
  };
85
109
  }
package/dist/index.js CHANGED
@@ -8,6 +8,10 @@ const query_1 = require("./commands/query");
8
8
  const summarize_1 = require("./commands/summarize");
9
9
  const llm_1 = require("./utils/llm");
10
10
  const program = new commander_1.Command();
11
+ function collectOption(value, previous) {
12
+ previous.push(value);
13
+ return previous;
14
+ }
11
15
  function addBaseOptions(command) {
12
16
  return command
13
17
  .option("--config <path>", "path to .diffdocrc JSON config file")
@@ -43,10 +47,22 @@ addChatOptions(addBaseOptions(program
43
47
  .option("--path <path>", "repository or code path to scan", ".")
44
48
  .option("--out <path>", "manifest output path under --base-dir", "manifest.json")
45
49
  .option("--mode <mode>", "summarization mode: all or delta", "all")
50
+ .option("--include-code-snapshot", "store raw code in summary assets", false)
51
+ .option("--include-glob <pattern>", "include glob pattern (repeatable)", collectOption, [])
52
+ .option("--exclude-glob <pattern>", "exclude glob pattern (repeatable)", collectOption, [])
53
+ .option("--ignore-file <path>", "path to ignore pattern file relative to --path")
46
54
  .action(async (options) => {
47
55
  try {
48
56
  const config = (0, config_1.buildRuntimeConfig)(options, { chat: true });
49
- await (0, summarize_1.runSummarize)({ path: options.path, out: options.out, mode: options.mode }, config);
57
+ await (0, summarize_1.runSummarize)({
58
+ path: options.path,
59
+ out: options.out,
60
+ mode: options.mode,
61
+ includeCodeSnapshot: options.includeCodeSnapshot,
62
+ includeGlobs: options.includeGlob,
63
+ excludeGlobs: options.excludeGlob,
64
+ ignoreFile: options.ignoreFile
65
+ }, config);
50
66
  }
51
67
  catch (error) {
52
68
  console.error(error instanceof Error ? error.message : error);
@@ -104,10 +120,11 @@ addCloudEndpointAndKeyOptions(addEmbeddingOptions(addBaseOptions(program
104
120
  .command("embed"))))
105
121
  .description("Embed manifest summaries into a local Vectra index")
106
122
  .option("--manifest <path>", "manifest input path under --base-dir", "manifest.json")
123
+ .option("--rebuild", "rebuild local index from scratch", false)
107
124
  .action(async (options) => {
108
125
  try {
109
126
  const config = (0, config_1.buildRuntimeConfig)(options, { embeddings: true });
110
- await (0, embed_1.runEmbed)({ manifest: options.manifest }, config);
127
+ await (0, embed_1.runEmbed)({ manifest: options.manifest, rebuild: options.rebuild }, config);
111
128
  }
112
129
  catch (error) {
113
130
  console.error(error instanceof Error ? error.message : error);
@@ -39,7 +39,7 @@ function buildAnswerPrompt(question, results) {
39
39
  `File: ${result.filePath}`,
40
40
  `Score: ${result.score}`,
41
41
  `Summary:\n${result.summaryText}`,
42
- `Code Snapshot:\n${result.rawCodeSnapshot}`
42
+ `Code Snapshot:\n${result.rawCodeSnapshot || "(not stored)"}`
43
43
  ].join("\n");
44
44
  }).join("\n\n---\n\n");
45
45
  return `Answer the user's question using only the retrieved DiffDoc results below. If the results do not contain enough information, say what is missing. Prefer a direct answer first, then cite the relevant file paths. Keep the explanation appropriate to the question: summarize when asked for a summary, explain implementation details when asked how something works, and avoid unsupported claims.\n\nUser question:\n${question}\n\nRetrieved results:\n${context}`;
@@ -0,0 +1,5 @@
1
+ "use strict";
2
+ Object.defineProperty(exports, "__esModule", { value: true });
3
+ exports.SUMMARY_ASSET_SCHEMA_VERSION = exports.MANIFEST_SCHEMA_VERSION = void 0;
4
+ exports.MANIFEST_SCHEMA_VERSION = 2;
5
+ exports.SUMMARY_ASSET_SCHEMA_VERSION = 1;
package/dist/utils/git.js CHANGED
@@ -7,16 +7,12 @@ exports.getGitDeltas = getGitDeltas;
7
7
  exports.getCurrentCommit = getCurrentCommit;
8
8
  const node_path_1 = __importDefault(require("node:path"));
9
9
  const simple_git_1 = __importDefault(require("simple-git"));
10
- const TARGET_EXTENSIONS = new Set([".ts", ".js", ".cs", ".py"]);
11
10
  function normalizePath(filePath) {
12
11
  return filePath.split(node_path_1.default.sep).join("/");
13
12
  }
14
- function isTargetCodeFile(filePath) {
15
- return TARGET_EXTENSIONS.has(node_path_1.default.extname(filePath));
16
- }
17
13
  function addUnique(target, filePath) {
18
14
  const normalized = normalizePath(filePath.trim());
19
- if (normalized && isTargetCodeFile(normalized)) {
15
+ if (normalized) {
20
16
  target.add(normalized);
21
17
  }
22
18
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "diffdoc",
3
- "version": "0.2.0",
3
+ "version": "0.3.0",
4
4
  "description": "Translate repository code shifts into plain-English business context",
5
5
  "license": "MIT",
6
6
  "author": "Christopher Sullivan",