@hardlydifficult/repo-processor 1.0.66 → 1.0.68

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +197 -139
  2. package/package.json +3 -3
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # @hardlydifficult/repo-processor
2
2
 
3
- Incremental GitHub repo processor with SHA-based stale detection, parallel file/dir processing, and git-backed YAML persistence.
3
+ Incremental GitHub repository processor with SHA-based stale detection, parallel batch processing (files bottom-up, directories top-down), and git-backed YAML persistence.
4
4
 
5
5
  ## Installation
6
6
 
@@ -12,79 +12,60 @@ npm install @hardlydifficult/repo-processor
12
12
 
13
13
  ```typescript
14
14
  import { RepoProcessor, GitYamlStore } from "@hardlydifficult/repo-processor";
15
- import { GitHubClient } from "@hardlydifficult/github";
15
+ import { createGitHubClient } from "@hardlydifficult/github";
16
16
 
17
17
  // 1. Configure git-backed YAML store
18
18
  const store = new GitYamlStore({
19
19
  cloneUrl: "https://github.com/owner/repo.git",
20
- localPath: "/tmp/repo-store",
21
- resultDir: (owner, repo) => `results/${owner}/${repo}`,
22
- authToken: process.env.GITHUB_TOKEN,
23
- gitUser: { name: "Processor Bot", email: "bot@example.com" },
20
+ localPath: ".results",
21
+ resultDir: (owner, repo) => `repos/${owner}/${repo}`,
22
+ gitUser: { name: "CI", email: "ci@example.com" },
24
23
  });
25
24
 
26
25
  // 2. Define processing callbacks
27
26
  const callbacks = {
28
- shouldProcess: (entry) => entry.type === "blob" && entry.path.endsWith(".ts"),
27
+ shouldProcess: (entry) => entry.type === "blob",
29
28
  processFile: async ({ entry, content }) => ({
30
29
  path: entry.path,
31
- lineCount: content.split("\n").length,
30
+ length: content.length,
32
31
  }),
33
- processDirectory: async ({ path, subtreeFilePaths, children }) => ({
34
- path,
35
- files: subtreeFilePaths.length,
36
- dirs: children.filter((c) => c.isDir).length,
32
+ processDirectory: async (ctx) => ({
33
+ path: ctx.path,
34
+ fileCount: ctx.subtreeFilePaths.length,
37
35
  }),
38
36
  };
39
37
 
40
38
  // 3. Create and run processor
41
- const github = new GitHubClient({ token: process.env.GITHUB_TOKEN });
42
39
  const processor = new RepoProcessor({
43
- githubClient: github,
40
+ githubClient: createGitHubClient({ token: process.env.GITHUB_TOKEN! }),
44
41
  store,
45
42
  callbacks,
46
43
  });
47
44
 
48
45
  const result = await processor.run("owner", "repo");
49
- // => { filesProcessed: 12, filesRemoved: 1, dirsProcessed: 4 }
46
+ // { filesProcessed: 10, filesRemoved: 1, dirsProcessed: 3 }
50
47
  ```
51
48
 
52
- ## RepoProcessor: Incremental Processing
49
+ ## RepoProcessor: Incremental Repository Processing
53
50
 
54
- `RepoProcessor` executes an incremental pipeline for processing GitHub file trees: fetch tree diff process changed files remove deleted files resolve stale directories → commit.
51
+ The `RepoProcessor` class implements an incremental pipeline for processing GitHub repository file trees. It detects changes by comparing current and previous file SHAs, processes only changed files in parallel batches, and processes affected directories bottom-up.
55
52
 
56
- ```typescript
57
- import { RepoProcessor } from "@hardlydifficult/repo-processor";
58
-
59
- const processor = new RepoProcessor({
60
- githubClient,
61
- store,
62
- callbacks,
63
- concurrency: 5, // optional (default 5)
64
- branch: "main", // optional (default "main")
65
- });
66
-
67
- const result = await processor.run("owner", "repo", (progress) => {
68
- console.log(
69
- `Phase: ${progress.phase}, Files: ${progress.filesCompleted}/${progress.filesTotal}, Dirs: ${progress.dirsCompleted}/${progress.dirsTotal}`
70
- );
71
- });
72
- ```
73
-
74
- ### RepoProcessorConfig
53
+ ### Configuration
75
54
 
76
- | Field | Type | Required | Default |
77
- |-------|------|----------|---------|
78
- | `githubClient` | `GitHubClient` | Yes | |
79
- | `store` | `ProcessorStore` | Yes | |
80
- | `callbacks` | `ProcessorCallbacks` | Yes | |
81
- | `concurrency` | `number` | No | `5` |
82
- | `branch` | `string` | No | `"main"` |
55
+ | Field | Type | Default | Description |
56
+ |-------|------|:-------:|-------------|
57
+ | `githubClient` | `GitHubClient` | | GitHub API client from `@hardlydifficult/github` |
58
+ | `store` | `ProcessorStore` | | Persistence layer for file/dir results and manifests |
59
+ | `callbacks` | `ProcessorCallbacks` | | Domain logic for filtering, file processing, and directory processing |
60
+ | `concurrency` | `number` | `5` | Max concurrent file/dir processing per batch |
61
+ | `branch` | `string` | `"main"` | Git branch to fetch tree from |
83
62
 
84
63
  ### ProcessingResult
85
64
 
65
+ Result returned by `RepoProcessor.run()`.
66
+
86
67
  ```typescript
87
- {
68
+ interface ProcessingResult {
88
69
  filesProcessed: number; // Count of files processed
89
70
  filesRemoved: number; // Count of deleted files
90
71
  dirsProcessed: number; // Count of directories processed
@@ -93,52 +74,123 @@ const result = await processor.run("owner", "repo", (progress) => {
93
74
 
94
75
  ### File and Directory Contexts
95
76
 
77
+ Files are processed top-down (all changed files) and directories bottom-up (deepest first), ensuring child directories are processed before parents.
78
+
96
79
  ```typescript
97
80
  interface FileContext {
98
- entry: TreeEntry;
99
- content: string;
81
+ readonly entry: TreeEntry;
82
+ readonly content: string;
100
83
  }
101
84
 
102
85
  interface DirectoryContext {
103
86
  path: string;
104
87
  sha: string;
105
- subtreeFilePaths: string[];
106
- children: DirectoryChild[];
107
- tree: TreeEntry[];
88
+ subtreeFilePaths: readonly string[];
89
+ children: readonly DirectoryChild[];
90
+ tree: readonly TreeEntry[];
108
91
  }
109
92
 
110
93
  interface DirectoryChild {
111
- name: string;
112
- isDir: boolean;
113
- fullPath: string;
94
+ readonly name: string;
95
+ readonly isDir: boolean;
96
+ readonly fullPath: string;
114
97
  }
115
98
  ```
116
99
 
117
- ## RepoWatcher: SHA-based Triggering
100
+ ### Usage Example
101
+
102
+ ```typescript
103
+ import { RepoProcessor } from "@hardlydifficult/repo-processor";
104
+ import { createGitHubClient } from "@hardlydifficult/github";
105
+
106
+ const processor = new RepoProcessor({
107
+ githubClient: createGitHubClient({ token: process.env.GITHUB_TOKEN! }),
108
+ store: new GitYamlStore({
109
+ cloneUrl: "https://github.com/hardlydifficult/results.git",
110
+ localPath: ".results",
111
+ resultDir: () => "repos",
112
+ gitUser: { name: "CI", email: "ci@example.com" },
113
+ }),
114
+ callbacks: {
115
+ shouldProcess: (entry) => entry.path.endsWith(".ts"),
116
+ processFile: async ({ entry, content }) => ({
117
+ lines: content.split("\n").length,
118
+ checksum: crypto.createHash("sha256").update(content).digest("hex"),
119
+ }),
120
+ processDirectory: async (ctx) => ({
121
+ path: ctx.path,
122
+ fileCount: ctx.subtreeFilePaths.length,
123
+ hasSubdirs: ctx.children.some((c) => c.isDir),
124
+ }),
125
+ },
126
+ concurrency: 10,
127
+ });
128
+
129
+ const result = await processor.run("hardlydifficult", "typescript", (progress) => {
130
+ console.log(`Phase: ${progress.phase} | Files: ${progress.filesCompleted}/${progress.filesTotal}`);
131
+ });
132
+ // => { filesProcessed: 12, filesRemoved: 0, dirsProcessed: 4 }
133
+ ```
134
+
135
+ ## RepoWatcher: SHA-based Change Monitoring
136
+
137
+ The `RepoWatcher` class watches GitHub repositories for SHA changes and triggers processing. It supports automatic state persistence, concurrent run prevention, pending SHA re-triggers, and retry logic.
138
+
139
+ ### Configuration
118
140
 
119
- `RepoWatcher` monitors GitHub repos for SHA changes and triggers processing with automatic retries, concurrency control, and state persistence.
141
+ | Field | Type | Default | Description |
142
+ |-------|------|:-------:|-------------|
143
+ | `stateKey` | `string` | — | Key used for persisting state (e.g., `"repo-processor"`). |
144
+ | `stateDirectory` | `string` | — | Directory where state is persisted. |
145
+ | `autoSaveMs` | `number` | `5000` | Auto-save interval in milliseconds. |
146
+ | `run` | `(owner: string, name: string) => Promise<T>` | — | Function to execute when processing is triggered. |
147
+ | `onComplete` | `(owner, name, result, sha) => void` | — | Called after a successful run (optional). |
148
+ | `onError` | `(owner, name, error) => void` | — | Called when a run fails (optional). |
149
+ | `onEvent` | `(event) => void` | — | Logger/event callback (optional). |
150
+ | `maxAttempts` | `number` | `1` | Max attempts for each run (includes initial + retries). |
151
+
152
+ ### Usage
120
153
 
121
154
  ```typescript
122
155
  import { RepoWatcher } from "@hardlydifficult/repo-processor";
156
+ import { RepoProcessor } from "@hardlydifficult/repo-processor";
123
157
 
124
158
  const watcher = new RepoWatcher({
125
- stateKey: "repo-state",
126
- stateDirectory: "/tmp/state",
159
+ stateKey: "repo-processor",
160
+ stateDirectory: ".state",
127
161
  run: async (owner, name) => {
128
- const processor = new RepoProcessor({ /* config */ });
162
+ const processor = new RepoProcessor({
163
+ githubClient: createGitHubClient({ token: process.env.GITHUB_TOKEN! }),
164
+ store: new GitYamlStore({
165
+ cloneUrl: "https://github.com/hardlydifficult/results.git",
166
+ localPath: ".results",
167
+ resultDir: () => "repos",
168
+ gitUser: { name: "CI", email: "ci@example.com" },
169
+ }),
170
+ callbacks: {
171
+ shouldProcess: (entry) => entry.path.endsWith(".ts"),
172
+ processFile: async ({ entry, content }) => ({
173
+ path: entry.path,
174
+ lines: content.split("\n").length,
175
+ }),
176
+ processDirectory: async (ctx) => ({
177
+ path: ctx.path,
178
+ fileCount: ctx.subtreeFilePaths.length,
179
+ }),
180
+ },
181
+ });
129
182
  return processor.run(owner, name);
130
183
  },
131
- onComplete: (owner, name, result, sha) => {
132
- console.log(`Completed ${owner}/${name}: ${result.filesProcessed} files`);
133
- },
134
- onError: (owner, name, error) => {
135
- console.error(`Failed ${owner}/${name}:`, error);
136
- },
137
- maxAttempts: 3, // optional retries
138
184
  });
139
185
 
140
- await watcher.init();
186
+ await watcher.init(); // Load persisted state
187
+ watcher.handlePush("hardlydifficult", "typescript", "abc123");
188
+ // Triggers processing if SHA differs from last tracked SHA
189
+ ```
141
190
 
191
+ ### Trigger Methods
192
+
193
+ ```typescript
142
194
  // Handle push events (SHA comparison performed automatically)
143
195
  watcher.handlePush("hardlydifficult", "typescript", "abc123...");
144
196
 
@@ -150,107 +202,84 @@ const response = await watcher.triggerManual("hardlydifficult", "typescript");
150
202
  // => { success: true, result: ProcessingResult } | { success: false, reason: string }
151
203
  ```
152
204
 
153
- ### RepoWatcherConfig
205
+ ## GitYamlStore: Git-Backed YAML Persistence
154
206
 
155
- | Field | Type | Required | Description |
156
- |-------|------|----------|-------------|
157
- | `stateKey` | `string` | Yes | Key for state persistence |
158
- | `stateDirectory` | `string` | Yes | Directory for state files |
159
- | `run` | `(owner, name) => Promise<TResult>` | Yes | Processing logic |
160
- | `onComplete` | `(owner, name, result, sha) => void` | No | Success callback |
161
- | `onError` | `(owner, name, error) => void` | No | Failure callback |
162
- | `autoSaveMs` | `number` | No | `5000` (5s) |
163
- | `maxAttempts` | `number` | No | `1` (no retry) |
207
+ The `GitYamlStore` class implements `ProcessorStore` by persisting results as YAML files in a git repository. Each result includes the tree SHA for change detection. It supports authenticated cloning, auto-pull, batch commits, and push with conflict resolution.
164
208
 
165
- ## GitYamlStore: YAML Persistence
209
+ ### Configuration
166
210
 
167
- `GitYamlStore` implements `ProcessorStore` by persisting results as YAML files in a git repository.
211
+ | Field | Type | Default | Description |
212
+ |-------|------|:-------:|-------------|
213
+ | `cloneUrl` | `string` | — | URL of the git repository to clone (e.g., `"https://github.com/user/results.git"`). |
214
+ | `localPath` | `string` | — | Local directory to clone the repo into. |
215
+ | `resultDir` | `(owner, repo) => string` | — | Function mapping owner/repo to result subdirectory. |
216
+ | `authToken` | `string` | `process.env.GITHUB_TOKEN` | GitHub token for authenticated clone/push. |
217
+ | `gitUser` | `{ name: string; email: string }` | — | Git user identity used when committing. |
168
218
 
169
- ```typescript
170
- import { GitYamlStore } from "@hardlydifficult/repo-processor";
171
-
172
- const store = new GitYamlStore({
173
- cloneUrl: "https://github.com/owner/repo.git",
174
- localPath: "/tmp/store",
175
- resultDir: (owner, repo) => `results/${owner}/${repo}`,
176
- authToken: process.env.GITHUB_TOKEN, // optional, falls back to env
177
- gitUser: { name: "Processor", email: "bot@example.com" },
178
- });
179
- ```
180
-
181
- ### Typed Result Loading
219
+ ### Loading Results with Schema Validation
182
220
 
183
221
  ```typescript
222
+ import { GitYamlStore } from "@hardlydifficult/repo-processor";
184
223
  import { z } from "zod";
185
224
 
186
- const fileSchema = z.object({
187
- path: z.string(),
188
- lineCount: z.number(),
189
- sha: z.string(),
225
+ const store = new GitYamlStore({
226
+ cloneUrl: "https://github.com/hardlydifficult/results.git",
227
+ localPath: ".results",
228
+ resultDir: (owner, repo) => `repos/${owner}/${repo}`,
229
+ gitUser: { name: "CI", email: "ci@example.com" },
190
230
  });
191
231
 
192
- const dirSchema = z.object({
232
+ const FileResultSchema = z.object({
193
233
  path: z.string(),
194
- files: z.number(),
195
- dirs: z.number(),
234
+ lines: z.number(),
196
235
  sha: z.string(),
197
236
  });
198
-
199
- const fileResult = await store.loadFileResult("owner", "repo", "src/index.ts", fileSchema);
200
- const dirResult = await store.loadDirResult("owner", "repo", "src", dirSchema);
237
+ const fileResult = await store.loadFileResult(
238
+ "hardlydifficult",
239
+ "typescript",
240
+ "src/index.ts",
241
+ FileResultSchema
242
+ );
243
+ // { path: "src/index.ts", lines: 12, sha: "abc..." }
201
244
  ```
202
245
 
203
- ## resolveStaleDirectories: Stale Directory Resolution
246
+ ## resolveStaleDirectories: Stale Directory Detection
204
247
 
205
- `resolveStaleDirectories` determines which directories need reprocessing by combining SHA-based detection with diff-derived stale directories.
248
+ The `resolveStaleDirectories` function identifies directories requiring reprocessing by combining diff-based stale directories with SHA-based comparison.
206
249
 
207
250
  ```typescript
208
- import { resolveStaleDirectories } from "@hardlydifficult/repo-processor";
251
+ import {
252
+ resolveStaleDirectories,
253
+ GitYamlStore,
254
+ } from "@hardlydifficult/repo-processor";
255
+ import { createGitHubClient } from "@hardlydifficult/github";
256
+
257
+ const client = createGitHubClient({ token: process.env.GITHUB_TOKEN! });
258
+ const { entries, rootSha } = await client.repo("owner", "repo").getFileTree();
259
+ const tree = [...entries, { path: "", type: "tree", sha: rootSha }];
209
260
 
210
261
  const staleDirs = await resolveStaleDirectories(
211
- owner,
212
- repo,
213
- staleDirsFromDiff,
214
- allFilePaths,
262
+ "owner",
263
+ "repo",
264
+ [], // stale dirs from diff
265
+ entries.filter((e) => e.type === "blob").map((e) => e.path),
215
266
  tree,
216
267
  store
217
268
  );
269
+ // ["src/utils", "src", ""] (bottom-up order inferred later)
218
270
  ```
219
271
 
220
- ### Algorithm
272
+ ### Stale Directory Logic
221
273
 
222
- - All directories derived from file paths (and root `""`) are checked
223
- - A directory is stale if:
224
- - Its stored SHA is missing, or
225
- - Its stored SHA differs from the current tree SHA
226
- - Stale directories from diff (e.g., due to file changes) are also included
274
+ 1. **Diff-based stale dirs** directories with changed/removed children
275
+ 2. **SHA mismatch** — any directory whose stored SHA differs from the current tree SHA
227
276
 
228
- ## ProcessorStore Interface
229
-
230
- ```typescript
231
- interface ProcessorStore {
232
- ensureReady?(owner: string, repo: string): Promise<void>;
233
- getFileManifest(owner: string, repo: string): Promise<FileManifest>;
234
- getDirSha(owner: string, repo: string, dirPath: string): Promise<string | null>;
235
- writeFileResult(owner: string, repo: string, path: string, sha: string, result: unknown): Promise<void>;
236
- writeDirResult(owner: string, repo: string, path: string, sha: string, result: unknown): Promise<void>;
237
- deleteFileResult(owner: string, repo: string, path: string): Promise<void>;
238
- commitBatch(owner: string, repo: string, count: number): Promise<void>;
239
- }
240
- ```
241
-
242
- ## ProcessorCallbacks Interface
243
-
244
- ```typescript
245
- interface ProcessorCallbacks {
246
- shouldProcess(entry: TreeEntry): boolean;
247
- processFile(ctx: FileContext): Promise<unknown>;
248
- processDirectory(ctx: DirectoryContext): Promise<unknown>;
249
- }
250
- ```
277
+ Directories whose stored SHA is missing are always included.
251
278
 
252
279
  ## Progress Reporting
253
280
 
281
+ Progress reported by the `onProgress` callback during `RepoProcessor.run()`.
282
+
254
283
  ```typescript
255
284
  interface ProcessingProgress {
256
285
  phase: "loading" | "files" | "directories" | "committing";
@@ -260,10 +289,39 @@ interface ProcessingProgress {
260
289
  dirsTotal: number;
261
290
  dirsCompleted: number;
262
291
  }
263
-
264
- type ProgressCallback = (progress: ProcessingProgress) => void;
265
292
  ```
266
293
 
294
+ | Phase | Description |
295
+ |-------|-------------|
296
+ | `"loading"` | Initial fetching of file tree. |
297
+ | `"files"` | Processing of files. |
298
+ | `"directories"` | Processing of directories (bottom-up). |
299
+ | `"committing"` | Final commit to persistence. |
300
+
301
+ ## ProcessorStore Interface
302
+
303
+ Consumer-implemented persistence layer.
304
+
305
+ | Method | Description |
306
+ |--------|-------------|
307
+ | `ensureReady?(owner, repo)` | One-time initialization (optional). |
308
+ | `getFileManifest(owner, repo)` | Return manifest of previously processed file SHAs (path → blob SHA). |
309
+ | `getDirSha(owner, repo, dirPath)` | Return stored SHA for a directory. Null if not stored. |
310
+ | `writeFileResult(owner, repo, path, sha, result)` | Persist result for a processed file. |
311
+ | `writeDirResult(owner, repo, path, sha, result)` | Persist result for a processed directory. |
312
+ | `deleteFileResult(owner, repo, path)` | Delete stored result for a removed file. |
313
+ | `commitBatch(owner, repo, count)` | Commit current batch of changes. |
314
+
315
+ ## ProcessorCallbacks Interface
316
+
317
+ Consumer-provided domain logic.
318
+
319
+ | Method | Description |
320
+ |--------|-------------|
321
+ | `shouldProcess(entry)` | Filter: which tree entries should be processed? |
322
+ | `processFile(ctx)` | Process a single changed file. Return value passed to `store.writeFileResult`. |
323
+ | `processDirectory(ctx)` | Process a directory after all children. Return value passed to `store.writeDirResult`. |
324
+
267
325
  ## Setup
268
326
 
269
327
  No external service setup beyond GitHub is required. The package uses `@hardlydifficult/github` for tree fetches and `simple-git` for git operations.
@@ -271,5 +329,5 @@ No external service setup beyond GitHub is required. The package uses `@hardlydi
271
329
  ### Environment Variables
272
330
 
273
331
  | Variable | Usage |
274
- |----------|-------|
332
+ |---------|-------|
275
333
  | `GITHUB_TOKEN` | Used by `GitYamlStore` for authenticated git operations if `authToken` not provided |
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@hardlydifficult/repo-processor",
3
- "version": "1.0.66",
3
+ "version": "1.0.68",
4
4
  "main": "./dist/index.js",
5
5
  "types": "./dist/index.d.ts",
6
6
  "files": [
@@ -16,7 +16,7 @@
16
16
  },
17
17
  "dependencies": {
18
18
  "@hardlydifficult/collections": "1.0.9",
19
- "@hardlydifficult/github": "1.0.30",
19
+ "@hardlydifficult/github": "1.0.31",
20
20
  "@hardlydifficult/state-tracker": "2.0.20",
21
21
  "@hardlydifficult/text": "1.0.29",
22
22
  "simple-git": "3.32.2",
@@ -25,7 +25,7 @@
25
25
  },
26
26
  "peerDependencies": {
27
27
  "@hardlydifficult/collections": "1.0.9",
28
- "@hardlydifficult/github": "1.0.30",
28
+ "@hardlydifficult/github": "1.0.31",
29
29
  "@hardlydifficult/state-tracker": "2.0.20",
30
30
  "@hardlydifficult/text": "1.0.29"
31
31
  },