agentic-dataset-builder 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +121 -30
- package/dist/cli.js +17 -0
- package/dist/schemas/source.d.ts +23 -0
- package/dist/schemas/source.js +19 -0
- package/dist/sources/claude.js +2 -3
- package/dist/sources/codex.js +2 -7
- package/dist/sources/pi.js +5 -10
- package/package.json +23 -4
package/README.md
CHANGED
|
@@ -1,20 +1,27 @@
|
|
|
1
1
|
# Agentic Dataset Builder
|
|
2
2
|
|
|
3
|
-
Pure TypeScript CLI for
|
|
3
|
+
Pure TypeScript CLI for turning local Pi, Codex, and Claude Code history into one validated `dataset.parquet` file.
|
|
4
4
|
|
|
5
|
-
##
|
|
5
|
+
## Goal
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
Use this repo when you want an AI coding assistant to do one job end-to-end:
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
1. discover local session history
|
|
10
|
+
2. normalize it into the local Qwen35-compatible schema
|
|
11
|
+
3. label records by training use
|
|
12
|
+
4. write one final parquet dataset
|
|
10
13
|
|
|
11
|
-
|
|
14
|
+
The CLI is native Node.js + TypeScript. It does not require Python.
|
|
15
|
+
|
|
16
|
+
## Fastest path
|
|
17
|
+
|
|
18
|
+
If the package is published on npm:
|
|
12
19
|
|
|
13
20
|
```bash
|
|
14
|
-
npx agentic-dataset-builder@0.2.
|
|
21
|
+
npx --registry=https://registry.npmjs.org/ agentic-dataset-builder@0.2.1 --output-root ./out
|
|
15
22
|
```
|
|
16
23
|
|
|
17
|
-
|
|
24
|
+
If working from this repo locally:
|
|
18
25
|
|
|
19
26
|
```bash
|
|
20
27
|
npm install
|
|
@@ -22,42 +29,126 @@ npm run build
|
|
|
22
29
|
node dist/cli.js --output-root ./out
|
|
23
30
|
```
|
|
24
31
|
|
|
25
|
-
##
|
|
32
|
+
## What the command does
|
|
26
33
|
|
|
27
|
-
|
|
28
|
-
# Pi + Codex
|
|
29
|
-
npx agentic-dataset-builder@0.2.0 --output-root ./out --include-sources pi,codex --include-labels cot_eligible,agent_only
|
|
34
|
+
The CLI will:
|
|
30
35
|
|
|
31
|
-
|
|
32
|
-
|
|
36
|
+
- detect local session roots for `pi`, `codex`, and `claude`
|
|
37
|
+
- read supported history files
|
|
38
|
+
- validate normalized records with `Zod`
|
|
39
|
+
- keep only the labels you requested
|
|
40
|
+
- write one final parquet file
|
|
41
|
+
- write a manifest and a run log
|
|
33
42
|
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
43
|
+
## Default source behavior
|
|
44
|
+
|
|
45
|
+
- `pi`
|
|
46
|
+
- full agent traces
|
|
47
|
+
- can produce `cot_eligible` or `agent_only`
|
|
48
|
+
- `codex`
|
|
49
|
+
- full agent traces
|
|
50
|
+
- usually produces `agent_only`
|
|
51
|
+
- `claude`
|
|
52
|
+
- prompt history only for now
|
|
53
|
+
- produces `prompt_only`
|
|
37
54
|
|
|
38
|
-
|
|
55
|
+
Claude is intentionally low-fidelity right now. It is not treated as a full assistant/tool trace source.
|
|
39
56
|
|
|
40
|
-
|
|
57
|
+
## Default output
|
|
58
|
+
|
|
59
|
+
Each run creates one directory:
|
|
41
60
|
|
|
42
61
|
```text
|
|
43
|
-
|
|
62
|
+
<output-root>/agentic-dataset-<timestamp>/
|
|
44
63
|
dataset.parquet
|
|
45
64
|
manifest.json
|
|
46
65
|
run.log
|
|
47
66
|
```
|
|
48
67
|
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
- `
|
|
68
|
+
Files:
|
|
69
|
+
|
|
70
|
+
- `dataset.parquet`
|
|
71
|
+
- final merged dataset
|
|
72
|
+
- `manifest.json`
|
|
73
|
+
- source roots, source counts, labels kept, output path
|
|
74
|
+
- `run.log`
|
|
75
|
+
- step-by-step execution log for debugging
|
|
76
|
+
|
|
77
|
+
## Recommended commands
|
|
78
|
+
|
|
79
|
+
Pi + Codex:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
node dist/cli.js --output-root ./out --include-sources pi,codex --include-labels cot_eligible,agent_only
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Codex + Claude prompt-only:
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
node dist/cli.js --output-root ./out --include-sources codex,claude --include-labels agent_only,prompt_only
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Pi only:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
node dist/cli.js --output-root ./out --include-sources pi --include-labels cot_eligible,agent_only
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Important flags
|
|
98
|
+
|
|
99
|
+
- `--output-root <dir>`
|
|
100
|
+
- required output root
|
|
101
|
+
- `--include-sources <csv>`
|
|
102
|
+
- any of: `pi,codex,claude`
|
|
103
|
+
- `--include-labels <csv>`
|
|
104
|
+
- any of: `cot_eligible,agent_only,prompt_only,discard`
|
|
105
|
+
- `--pi-root <dir>`
|
|
106
|
+
- override detected Pi session path
|
|
107
|
+
- `--codex-root <dir>`
|
|
108
|
+
- override detected Codex session path
|
|
109
|
+
- `--claude-root <dir>`
|
|
110
|
+
- override detected Claude project-history path
|
|
111
|
+
- `--help`
|
|
112
|
+
- print CLI help
|
|
113
|
+
|
|
114
|
+
## Auto-detected paths
|
|
115
|
+
|
|
116
|
+
The CLI tries OS-specific defaults automatically.
|
|
117
|
+
|
|
118
|
+
Typical paths:
|
|
119
|
+
|
|
120
|
+
- Pi: `~/.pi/agent/sessions`
|
|
121
|
+
- Codex: `~/.codex/sessions`
|
|
122
|
+
- Claude: `~/.claude/projects`
|
|
123
|
+
|
|
124
|
+
On Windows it also checks `APPDATA` and `LOCALAPPDATA` variants.
|
|
125
|
+
|
|
126
|
+
## Verification checklist
|
|
127
|
+
|
|
128
|
+
After a run, verify these three things:
|
|
52
129
|
|
|
53
|
-
|
|
130
|
+
1. `dataset.parquet` exists
|
|
131
|
+
2. `manifest.json` exists
|
|
132
|
+
3. `run.log` does not end with an uncaught error
|
|
54
133
|
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
134
|
+
Typical quick check:
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
ls ./out/agentic-dataset-*/
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
## Development notes
|
|
141
|
+
|
|
142
|
+
Useful development commands:
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
npm run check
|
|
146
|
+
npm run test
|
|
147
|
+
npm run build
|
|
148
|
+
```
|
|
58
149
|
|
|
59
|
-
|
|
150
|
+
This repo currently includes:
|
|
60
151
|
|
|
61
|
-
-
|
|
62
|
-
-
|
|
63
|
-
-
|
|
152
|
+
- Zod validation for source events and final records
|
|
153
|
+
- Vitest coverage for core schema and labeling paths
|
|
154
|
+
- native parquet writing in TypeScript
|
package/dist/cli.js
CHANGED
|
@@ -9,6 +9,23 @@ import { labelRecord } from './labeling.js';
|
|
|
9
9
|
import { Qwen35RecordSchema } from './schemas/qwen35.js';
|
|
10
10
|
import { writeParquet } from './parquet.js';
|
|
11
11
|
function parseArgs(argv) {
|
|
12
|
+
if (argv.includes('--help') || argv.includes('-h')) {
|
|
13
|
+
console.log(`agentic-dataset-builder@0.2.1
|
|
14
|
+
|
|
15
|
+
Usage:
|
|
16
|
+
npx agentic-dataset-builder@0.2.1 --output-root ./out
|
|
17
|
+
|
|
18
|
+
Options:
|
|
19
|
+
--output-root <dir> Output directory root
|
|
20
|
+
--include-sources <list> Comma-separated: pi,codex,claude
|
|
21
|
+
--include-labels <list> Comma-separated: cot_eligible,agent_only,prompt_only,discard
|
|
22
|
+
--pi-root <dir> Override Pi session root
|
|
23
|
+
--codex-root <dir> Override Codex session root
|
|
24
|
+
--claude-root <dir> Override Claude project history root
|
|
25
|
+
--help Show this help message
|
|
26
|
+
`);
|
|
27
|
+
process.exit(0);
|
|
28
|
+
}
|
|
12
29
|
const args = {
|
|
13
30
|
outputRoot: './out',
|
|
14
31
|
includeSources: ['pi', 'codex'],
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
import { z } from 'zod';
|
|
2
|
+
export declare const PiSessionHeaderSchema: z.ZodObject<{
|
|
3
|
+
type: z.ZodLiteral<"session">;
|
|
4
|
+
id: z.ZodString;
|
|
5
|
+
timestamp: z.ZodString;
|
|
6
|
+
cwd: z.ZodOptional<z.ZodString>;
|
|
7
|
+
}, z.core.$loose>;
|
|
8
|
+
export declare const PiSessionEntrySchema: z.ZodObject<{
|
|
9
|
+
type: z.ZodString;
|
|
10
|
+
id: z.ZodOptional<z.ZodString>;
|
|
11
|
+
parentId: z.ZodOptional<z.ZodNullable<z.ZodString>>;
|
|
12
|
+
timestamp: z.ZodOptional<z.ZodString>;
|
|
13
|
+
}, z.core.$loose>;
|
|
14
|
+
export declare const CodexEntrySchema: z.ZodObject<{
|
|
15
|
+
timestamp: z.ZodOptional<z.ZodString>;
|
|
16
|
+
type: z.ZodString;
|
|
17
|
+
payload: z.ZodOptional<z.ZodRecord<z.ZodString, z.ZodUnknown>>;
|
|
18
|
+
}, z.core.$loose>;
|
|
19
|
+
export declare const ClaudeProjectEntrySchema: z.ZodRecord<z.ZodString, z.ZodUnknown>;
|
|
20
|
+
export type PiSessionHeader = z.infer<typeof PiSessionHeaderSchema>;
|
|
21
|
+
export type PiSessionEntry = z.infer<typeof PiSessionEntrySchema>;
|
|
22
|
+
export type CodexEntry = z.infer<typeof CodexEntrySchema>;
|
|
23
|
+
export type ClaudeProjectEntry = z.infer<typeof ClaudeProjectEntrySchema>;
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
import { z } from 'zod';
|
|
2
|
+
export const PiSessionHeaderSchema = z.object({
|
|
3
|
+
type: z.literal('session'),
|
|
4
|
+
id: z.string(),
|
|
5
|
+
timestamp: z.string(),
|
|
6
|
+
cwd: z.string().optional(),
|
|
7
|
+
}).passthrough();
|
|
8
|
+
export const PiSessionEntrySchema = z.object({
|
|
9
|
+
type: z.string(),
|
|
10
|
+
id: z.string().optional(),
|
|
11
|
+
parentId: z.string().nullable().optional(),
|
|
12
|
+
timestamp: z.string().optional(),
|
|
13
|
+
}).passthrough();
|
|
14
|
+
export const CodexEntrySchema = z.object({
|
|
15
|
+
timestamp: z.string().optional(),
|
|
16
|
+
type: z.string(),
|
|
17
|
+
payload: z.record(z.string(), z.unknown()).optional(),
|
|
18
|
+
}).passthrough();
|
|
19
|
+
export const ClaudeProjectEntrySchema = z.record(z.string(), z.unknown());
|
package/dist/sources/claude.js
CHANGED
|
@@ -1,13 +1,12 @@
|
|
|
1
1
|
import fg from 'fast-glob';
|
|
2
|
-
import { z } from 'zod';
|
|
3
2
|
import { Qwen35RecordSchema } from '../schemas/qwen35.js';
|
|
3
|
+
import { ClaudeProjectEntrySchema } from '../schemas/source.js';
|
|
4
4
|
import { readJsonl } from '../utils/jsonl.js';
|
|
5
|
-
const EntrySchema = z.record(z.string(), z.unknown());
|
|
6
5
|
export async function collectClaudePromptOnlyRecords(root) {
|
|
7
6
|
const files = await fg('**/*.jsonl', { cwd: root, absolute: true, onlyFiles: true });
|
|
8
7
|
const records = [];
|
|
9
8
|
for (const file of files.sort()) {
|
|
10
|
-
const entries = (await readJsonl(file)).map((row) =>
|
|
9
|
+
const entries = (await readJsonl(file)).map((row) => ClaudeProjectEntrySchema.parse(row));
|
|
11
10
|
for (const entry of entries) {
|
|
12
11
|
if (entry.type !== 'user')
|
|
13
12
|
continue;
|
package/dist/sources/codex.js
CHANGED
|
@@ -1,12 +1,7 @@
|
|
|
1
1
|
import fg from 'fast-glob';
|
|
2
|
-
import { z } from 'zod';
|
|
3
2
|
import { Qwen35RecordSchema } from '../schemas/qwen35.js';
|
|
3
|
+
import { CodexEntrySchema } from '../schemas/source.js';
|
|
4
4
|
import { readJsonl } from '../utils/jsonl.js';
|
|
5
|
-
const EntrySchema = z.object({
|
|
6
|
-
timestamp: z.string().optional(),
|
|
7
|
-
type: z.string(),
|
|
8
|
-
payload: z.record(z.string(), z.unknown()).optional(),
|
|
9
|
-
}).passthrough();
|
|
10
5
|
class TurnBuilder {
|
|
11
6
|
sessionMeta;
|
|
12
7
|
turnId;
|
|
@@ -204,7 +199,7 @@ export async function collectCodexRecords(root) {
|
|
|
204
199
|
const files = await fg('**/*.jsonl', { cwd: root, absolute: true, onlyFiles: true });
|
|
205
200
|
const records = [];
|
|
206
201
|
for (const file of files.sort()) {
|
|
207
|
-
const entries = (await readJsonl(file)).map((entry) =>
|
|
202
|
+
const entries = (await readJsonl(file)).map((entry) => CodexEntrySchema.parse(entry));
|
|
208
203
|
const sessionMeta = (entries.find((entry) => entry.type === 'session_meta')?.payload ?? {});
|
|
209
204
|
let builder = null;
|
|
210
205
|
for (const entry of entries) {
|
package/dist/sources/pi.js
CHANGED
|
@@ -1,20 +1,15 @@
|
|
|
1
1
|
import fs from 'node:fs';
|
|
2
2
|
import fg from 'fast-glob';
|
|
3
|
-
import { z } from 'zod';
|
|
4
3
|
import { Qwen35RecordSchema } from '../schemas/qwen35.js';
|
|
4
|
+
import { PiSessionEntrySchema, PiSessionHeaderSchema } from '../schemas/source.js';
|
|
5
5
|
import { isFile } from '../utils/common.js';
|
|
6
6
|
import { readJsonl } from '../utils/jsonl.js';
|
|
7
|
-
const SessionEntrySchema = z.object({
|
|
8
|
-
type: z.string(),
|
|
9
|
-
id: z.string().optional(),
|
|
10
|
-
parentId: z.string().nullable().optional(),
|
|
11
|
-
timestamp: z.string().optional(),
|
|
12
|
-
}).passthrough();
|
|
13
7
|
export async function collectPiRecords(root) {
|
|
14
8
|
const files = await fg('**/*.jsonl', { cwd: root, absolute: true, onlyFiles: true });
|
|
15
9
|
const records = [];
|
|
16
10
|
for (const file of files.sort()) {
|
|
17
|
-
const
|
|
11
|
+
const rawRows = await readJsonl(file);
|
|
12
|
+
const rows = rawRows.map((row, index) => (index === 0 ? PiSessionHeaderSchema.parse(row) : PiSessionEntrySchema.parse(row)));
|
|
18
13
|
if (!rows.length)
|
|
19
14
|
continue;
|
|
20
15
|
const header = rows[0];
|
|
@@ -25,7 +20,7 @@ export async function collectPiRecords(root) {
|
|
|
25
20
|
if (!entry.id)
|
|
26
21
|
continue;
|
|
27
22
|
byId.set(entry.id, entry);
|
|
28
|
-
const key = entry.parentId
|
|
23
|
+
const key = typeof entry.parentId === 'string' ? entry.parentId : null;
|
|
29
24
|
const bucket = children.get(key) ?? [];
|
|
30
25
|
bucket.push(entry.id);
|
|
31
26
|
children.set(key, bucket);
|
|
@@ -47,7 +42,7 @@ function branchEntries(leaf, byId) {
|
|
|
47
42
|
if (!entry)
|
|
48
43
|
break;
|
|
49
44
|
ordered.push(entry);
|
|
50
|
-
current = entry.parentId
|
|
45
|
+
current = typeof entry.parentId === 'string' ? entry.parentId : null;
|
|
51
46
|
}
|
|
52
47
|
return ordered.reverse();
|
|
53
48
|
}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "agentic-dataset-builder",
|
|
3
|
-
"version": "0.2.
|
|
3
|
+
"version": "0.2.1",
|
|
4
4
|
"description": "Pure TypeScript agentic dataset builder for Pi, Codex, and Claude Code history",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"type": "module",
|
|
@@ -26,7 +26,24 @@
|
|
|
26
26
|
"agentic-dataset-builder": "./dist/cli.js"
|
|
27
27
|
},
|
|
28
28
|
"files": [
|
|
29
|
-
"dist",
|
|
29
|
+
"dist/cli.js",
|
|
30
|
+
"dist/cli.d.ts",
|
|
31
|
+
"dist/labeling.js",
|
|
32
|
+
"dist/labeling.d.ts",
|
|
33
|
+
"dist/parquet.js",
|
|
34
|
+
"dist/parquet.d.ts",
|
|
35
|
+
"dist/platform",
|
|
36
|
+
"dist/schemas/qwen35.js",
|
|
37
|
+
"dist/schemas/qwen35.d.ts",
|
|
38
|
+
"dist/schemas/source.js",
|
|
39
|
+
"dist/schemas/source.d.ts",
|
|
40
|
+
"dist/sources/pi.js",
|
|
41
|
+
"dist/sources/pi.d.ts",
|
|
42
|
+
"dist/sources/codex.js",
|
|
43
|
+
"dist/sources/codex.d.ts",
|
|
44
|
+
"dist/sources/claude.js",
|
|
45
|
+
"dist/sources/claude.d.ts",
|
|
46
|
+
"dist/utils",
|
|
30
47
|
"README.md",
|
|
31
48
|
"LICENSE"
|
|
32
49
|
],
|
|
@@ -37,7 +54,8 @@
|
|
|
37
54
|
"build": "tsc -p tsconfig.json",
|
|
38
55
|
"dev": "tsx src/cli.ts",
|
|
39
56
|
"check": "tsc -p tsconfig.json --noEmit",
|
|
40
|
-
"pack:check": "npm pack --dry-run"
|
|
57
|
+
"pack:check": "npm pack --dry-run",
|
|
58
|
+
"test": "vitest run"
|
|
41
59
|
},
|
|
42
60
|
"dependencies": {
|
|
43
61
|
"fast-glob": "^3.3.3",
|
|
@@ -47,6 +65,7 @@
|
|
|
47
65
|
"devDependencies": {
|
|
48
66
|
"@types/node": "^24.7.2",
|
|
49
67
|
"tsx": "^4.20.6",
|
|
50
|
-
"typescript": "^5.9.3"
|
|
68
|
+
"typescript": "^5.9.3",
|
|
69
|
+
"vitest": "^4.1.3"
|
|
51
70
|
}
|
|
52
71
|
}
|