agentic-dataset-builder 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,20 +1,27 @@
1
1
  # Agentic Dataset Builder
2
2
 
3
- Pure TypeScript CLI for building one merged parquet dataset from local Pi, Codex, and Claude Code history.
3
+ Pure TypeScript CLI for turning local Pi, Codex, and Claude Code history into one validated `dataset.parquet` file.
4
4
 
5
- ## Requirements
5
+ ## Goal
6
6
 
7
- - Node 18+
7
+ Use this repo when you want an AI coding assistant to do one job end-to-end:
8
8
 
9
- ## Install and run
9
+ 1. discover local session history
10
+ 2. normalize it into the local Qwen35-compatible schema
11
+ 3. label records by training use
12
+ 4. write one final parquet dataset
10
13
 
11
- Without installing globally:
14
+ The CLI is native Node.js + TypeScript. It does not require Python.
15
+
16
+ ## Fastest path
17
+
18
+ If the package is published on npm:
12
19
 
13
20
  ```bash
14
- npx agentic-dataset-builder@0.2.0 --output-root ./out
21
+ npx --registry=https://registry.npmjs.org/ agentic-dataset-builder@0.2.2 --output-root ./out
15
22
  ```
16
23
 
17
- Local development:
24
+ If working from this repo locally:
18
25
 
19
26
  ```bash
20
27
  npm install
@@ -22,42 +29,126 @@ npm run build
22
29
  node dist/cli.js --output-root ./out
23
30
  ```
24
31
 
25
- ## Examples
32
+ ## What the command does
26
33
 
27
- ```bash
28
- # Pi + Codex
29
- npx agentic-dataset-builder@0.2.0 --output-root ./out --include-sources pi,codex --include-labels cot_eligible,agent_only
34
+ The CLI will:
30
35
 
31
- # Codex + Claude prompt-only
32
- npx agentic-dataset-builder@0.2.0 --output-root ./out --include-sources codex,claude --include-labels agent_only,prompt_only
36
+ - detect local session roots for `pi`, `codex`, and `claude`
37
+ - read supported history files
38
+ - validate normalized records with `Zod`
39
+ - keep only the labels you requested
40
+ - write one final parquet file
41
+ - write a manifest and a run log
33
42
 
34
- # Pi only
35
- npx agentic-dataset-builder@0.2.0 --output-root ./out --include-sources pi --include-labels cot_eligible,agent_only
36
- ```
43
+ ## Default source behavior
44
+
45
+ - `pi`
46
+ - full agent traces
47
+ - can produce `cot_eligible` or `agent_only`
48
+ - `codex`
49
+ - full agent traces
50
+ - usually produces `agent_only`
51
+ - `claude`
52
+ - prompt history only for now
53
+ - produces `prompt_only`
37
54
 
38
- ## Output
55
+ Claude is intentionally low-fidelity right now. It is not treated as a full assistant/tool trace source.
39
56
 
40
- Each run creates a directory like:
57
+ ## Default output
58
+
59
+ Each run creates one directory:
41
60
 
42
61
  ```text
43
- out/agentic-dataset-<timestamp>/
62
+ <output-root>/agentic-dataset-<timestamp>/
44
63
  dataset.parquet
45
64
  manifest.json
46
65
  run.log
47
66
  ```
48
67
 
49
- - `dataset.parquet`: final merged dataset
50
- - `manifest.json`: source roots, counts, labels, and summary stats
51
- - `run.log`: step-by-step execution log
68
+ Files:
69
+
70
+ - `dataset.parquet`
71
+ - final merged dataset
72
+ - `manifest.json`
73
+ - source roots, source counts, labels kept, output path
74
+ - `run.log`
75
+ - step-by-step execution log for debugging
76
+
77
+ ## Recommended commands
78
+
79
+ Pi + Codex:
80
+
81
+ ```bash
82
+ node dist/cli.js --output-root ./out --include-sources pi,codex --include-labels cot_eligible,agent_only
83
+ ```
84
+
85
+ Codex + Claude prompt-only:
86
+
87
+ ```bash
88
+ node dist/cli.js --output-root ./out --include-sources codex,claude --include-labels agent_only,prompt_only
89
+ ```
90
+
91
+ Pi only:
92
+
93
+ ```bash
94
+ node dist/cli.js --output-root ./out --include-sources pi --include-labels cot_eligible,agent_only
95
+ ```
96
+
97
+ ## Important flags
98
+
99
+ - `--output-root <dir>`
100
+ - required output root
101
+ - `--include-sources <csv>`
102
+ - any of: `pi,codex,claude`
103
+ - `--include-labels <csv>`
104
+ - any of: `cot_eligible,agent_only,prompt_only,discard`
105
+ - `--pi-root <dir>`
106
+ - override detected Pi session path
107
+ - `--codex-root <dir>`
108
+ - override detected Codex session path
109
+ - `--claude-root <dir>`
110
+ - override detected Claude project-history path
111
+ - `--help`
112
+ - print CLI help
113
+
114
+ ## Auto-detected paths
115
+
116
+ The CLI tries OS-specific defaults automatically.
117
+
118
+ Typical paths:
119
+
120
+ - Pi: `~/.pi/agent/sessions`
121
+ - Codex: `~/.codex/sessions`
122
+ - Claude: `~/.claude/projects`
123
+
124
+ On Windows it also checks `APPDATA` and `LOCALAPPDATA` variants.
125
+
126
+ ## Verification checklist
127
+
128
+ After a run, verify these three things:
52
129
 
53
- ## Source support
130
+ 1. `dataset.parquet` exists
131
+ 2. `manifest.json` exists
132
+ 3. `run.log` does not end with an uncaught error
54
133
 
55
- - `pi`: full agent trace with visible reasoning when available
56
- - `codex`: agent trace, often without visible reasoning
57
- - `claude`: prompt-history only for now, labeled `prompt_only`
134
+ Typical quick check:
135
+
136
+ ```bash
137
+ ls ./out/agentic-dataset-*/
138
+ ```
139
+
140
+ ## Development notes
141
+
142
+ Useful development commands:
143
+
144
+ ```bash
145
+ npm run check
146
+ npm run test
147
+ npm run build
148
+ ```
58
149
 
59
- ## Notes
150
+ This repo currently includes:
60
151
 
61
- - default source roots are auto-detected for Linux, macOS, and Windows
62
- - override paths with `--pi-root`, `--codex-root`, and `--claude-root`
63
- - Claude is intentionally low-fidelity right now: user prompt history only, not full assistant/tool trace
152
+ - Zod validation for source events and final records
153
+ - Vitest coverage for core schema and labeling paths
154
+ - native parquet writing in TypeScript
package/dist/cli.js CHANGED
@@ -9,6 +9,23 @@ import { labelRecord } from './labeling.js';
9
9
  import { Qwen35RecordSchema } from './schemas/qwen35.js';
10
10
  import { writeParquet } from './parquet.js';
11
11
  function parseArgs(argv) {
12
+ if (argv.includes('--help') || argv.includes('-h')) {
13
+ console.log(`agentic-dataset-builder@0.2.2
14
+
15
+ Usage:
16
+ npx agentic-dataset-builder@0.2.2 --output-root ./out
17
+
18
+ Options:
19
+ --output-root <dir> Output directory root
20
+ --include-sources <list> Comma-separated: pi,codex,claude
21
+ --include-labels <list> Comma-separated: cot_eligible,agent_only,prompt_only,discard
22
+ --pi-root <dir> Override Pi session root
23
+ --codex-root <dir> Override Codex session root
24
+ --claude-root <dir> Override Claude project history root
25
+ --help Show this help message
26
+ `);
27
+ process.exit(0);
28
+ }
12
29
  const args = {
13
30
  outputRoot: './out',
14
31
  includeSources: ['pi', 'codex'],
@@ -1,13 +1,36 @@
1
1
  import os from 'node:os';
2
2
  import path from 'node:path';
3
3
  import fs from 'node:fs';
4
+ function currentPlatform() {
5
+ return process.platform;
6
+ }
7
+ function xdgConfigHome() {
8
+ return process.env.XDG_CONFIG_HOME;
9
+ }
10
+ function xdgDataHome() {
11
+ return process.env.XDG_DATA_HOME;
12
+ }
13
+ function macAppSupportHome() {
14
+ if (currentPlatform() !== 'darwin')
15
+ return undefined;
16
+ return path.join(os.homedir(), 'Library', 'Application Support');
17
+ }
4
18
  export function candidatePiRoots() {
5
19
  const home = os.homedir();
6
20
  const appdata = process.env.APPDATA;
7
21
  const localappdata = process.env.LOCALAPPDATA;
22
+ const xdgConfig = xdgConfigHome();
23
+ const xdgData = xdgDataHome();
24
+ const macAppSupport = macAppSupportHome();
8
25
  return dedupe([
9
26
  process.env.PI_SESSION_ROOT,
10
27
  path.join(home, '.pi', 'agent', 'sessions'),
28
+ xdgConfig ? path.join(xdgConfig, 'pi', 'agent', 'sessions') : undefined,
29
+ xdgConfig ? path.join(xdgConfig, '.pi', 'agent', 'sessions') : undefined,
30
+ xdgData ? path.join(xdgData, 'pi', 'agent', 'sessions') : undefined,
31
+ xdgData ? path.join(xdgData, '.pi', 'agent', 'sessions') : undefined,
32
+ macAppSupport ? path.join(macAppSupport, 'pi', 'agent', 'sessions') : undefined,
33
+ macAppSupport ? path.join(macAppSupport, '.pi', 'agent', 'sessions') : undefined,
11
34
  appdata ? path.join(appdata, 'pi', 'agent', 'sessions') : undefined,
12
35
  appdata ? path.join(appdata, '.pi', 'agent', 'sessions') : undefined,
13
36
  localappdata ? path.join(localappdata, 'pi', 'agent', 'sessions') : undefined,
@@ -18,9 +41,18 @@ export function candidateCodexRoots() {
18
41
  const home = os.homedir();
19
42
  const appdata = process.env.APPDATA;
20
43
  const localappdata = process.env.LOCALAPPDATA;
44
+ const xdgConfig = xdgConfigHome();
45
+ const xdgData = xdgDataHome();
46
+ const macAppSupport = macAppSupportHome();
21
47
  return dedupe([
22
48
  process.env.CODEX_SESSION_ROOT,
23
49
  path.join(home, '.codex', 'sessions'),
50
+ xdgConfig ? path.join(xdgConfig, 'codex', 'sessions') : undefined,
51
+ xdgConfig ? path.join(xdgConfig, '.codex', 'sessions') : undefined,
52
+ xdgData ? path.join(xdgData, 'codex', 'sessions') : undefined,
53
+ xdgData ? path.join(xdgData, '.codex', 'sessions') : undefined,
54
+ macAppSupport ? path.join(macAppSupport, 'Codex', 'sessions') : undefined,
55
+ macAppSupport ? path.join(macAppSupport, '.codex', 'sessions') : undefined,
24
56
  appdata ? path.join(appdata, 'Codex', 'sessions') : undefined,
25
57
  appdata ? path.join(appdata, '.codex', 'sessions') : undefined,
26
58
  localappdata ? path.join(localappdata, 'Codex', 'sessions') : undefined,
@@ -31,9 +63,18 @@ export function candidateClaudeRoots() {
31
63
  const home = os.homedir();
32
64
  const appdata = process.env.APPDATA;
33
65
  const localappdata = process.env.LOCALAPPDATA;
66
+ const xdgConfig = xdgConfigHome();
67
+ const xdgData = xdgDataHome();
68
+ const macAppSupport = macAppSupportHome();
34
69
  return dedupe([
35
70
  process.env.CLAUDE_SESSION_ROOT,
36
71
  path.join(home, '.claude', 'projects'),
72
+ xdgConfig ? path.join(xdgConfig, 'claude', 'projects') : undefined,
73
+ xdgConfig ? path.join(xdgConfig, '.claude', 'projects') : undefined,
74
+ xdgData ? path.join(xdgData, 'claude', 'projects') : undefined,
75
+ xdgData ? path.join(xdgData, '.claude', 'projects') : undefined,
76
+ macAppSupport ? path.join(macAppSupport, 'Claude', 'projects') : undefined,
77
+ macAppSupport ? path.join(macAppSupport, '.claude', 'projects') : undefined,
37
78
  appdata ? path.join(appdata, 'Claude', 'projects') : undefined,
38
79
  appdata ? path.join(appdata, '.claude', 'projects') : undefined,
39
80
  localappdata ? path.join(localappdata, 'Claude', 'projects') : undefined,
@@ -0,0 +1 @@
1
+ export {};
@@ -0,0 +1,36 @@
1
+ import { describe, expect, it, vi, beforeEach, afterEach } from 'vitest';
2
+ const envBackup = { ...process.env };
3
+ describe('platform path candidates', () => {
4
+ beforeEach(() => {
5
+ vi.resetModules();
6
+ process.env = { ...envBackup };
7
+ });
8
+ afterEach(() => {
9
+ process.env = { ...envBackup };
10
+ vi.unstubAllGlobals();
11
+ });
12
+ it('includes XDG candidates for pi on linux', async () => {
13
+ process.env.XDG_CONFIG_HOME = '/home/test/.config';
14
+ process.env.XDG_DATA_HOME = '/home/test/.local/share';
15
+ vi.stubGlobal('process', { ...process, platform: 'linux', env: process.env });
16
+ const mod = await import('./paths.js');
17
+ const candidates = mod.candidatePiRoots();
18
+ expect(candidates).toContain('/home/test/.config/pi/agent/sessions');
19
+ expect(candidates).toContain('/home/test/.local/share/pi/agent/sessions');
20
+ });
21
+ it('includes Application Support candidates for codex on macOS', async () => {
22
+ vi.stubGlobal('process', { ...process, platform: 'darwin', env: process.env });
23
+ const mod = await import('./paths.js');
24
+ const candidates = mod.candidateCodexRoots();
25
+ expect(candidates.some((value) => value.includes('Library/Application Support/Codex/sessions'))).toBe(true);
26
+ });
27
+ it('includes APPDATA and LOCALAPPDATA candidates for claude on Windows', async () => {
28
+ process.env.APPDATA = 'C:/Users/test/AppData/Roaming';
29
+ process.env.LOCALAPPDATA = 'C:/Users/test/AppData/Local';
30
+ vi.stubGlobal('process', { ...process, platform: 'win32', env: process.env });
31
+ const mod = await import('./paths.js');
32
+ const candidates = mod.candidateClaudeRoots();
33
+ expect(candidates).toContain('C:/Users/test/AppData/Roaming/Claude/projects');
34
+ expect(candidates).toContain('C:/Users/test/AppData/Local/Claude/projects');
35
+ });
36
+ });
@@ -0,0 +1,23 @@
1
+ import { z } from 'zod';
2
+ export declare const PiSessionHeaderSchema: z.ZodObject<{
3
+ type: z.ZodLiteral<"session">;
4
+ id: z.ZodString;
5
+ timestamp: z.ZodString;
6
+ cwd: z.ZodOptional<z.ZodString>;
7
+ }, z.core.$loose>;
8
+ export declare const PiSessionEntrySchema: z.ZodObject<{
9
+ type: z.ZodString;
10
+ id: z.ZodOptional<z.ZodString>;
11
+ parentId: z.ZodOptional<z.ZodNullable<z.ZodString>>;
12
+ timestamp: z.ZodOptional<z.ZodString>;
13
+ }, z.core.$loose>;
14
+ export declare const CodexEntrySchema: z.ZodObject<{
15
+ timestamp: z.ZodOptional<z.ZodString>;
16
+ type: z.ZodString;
17
+ payload: z.ZodOptional<z.ZodRecord<z.ZodString, z.ZodUnknown>>;
18
+ }, z.core.$loose>;
19
+ export declare const ClaudeProjectEntrySchema: z.ZodRecord<z.ZodString, z.ZodUnknown>;
20
+ export type PiSessionHeader = z.infer<typeof PiSessionHeaderSchema>;
21
+ export type PiSessionEntry = z.infer<typeof PiSessionEntrySchema>;
22
+ export type CodexEntry = z.infer<typeof CodexEntrySchema>;
23
+ export type ClaudeProjectEntry = z.infer<typeof ClaudeProjectEntrySchema>;
@@ -0,0 +1,19 @@
1
+ import { z } from 'zod';
2
+ export const PiSessionHeaderSchema = z.object({
3
+ type: z.literal('session'),
4
+ id: z.string(),
5
+ timestamp: z.string(),
6
+ cwd: z.string().optional(),
7
+ }).passthrough();
8
+ export const PiSessionEntrySchema = z.object({
9
+ type: z.string(),
10
+ id: z.string().optional(),
11
+ parentId: z.string().nullable().optional(),
12
+ timestamp: z.string().optional(),
13
+ }).passthrough();
14
+ export const CodexEntrySchema = z.object({
15
+ timestamp: z.string().optional(),
16
+ type: z.string(),
17
+ payload: z.record(z.string(), z.unknown()).optional(),
18
+ }).passthrough();
19
+ export const ClaudeProjectEntrySchema = z.record(z.string(), z.unknown());
@@ -1,13 +1,12 @@
1
1
  import fg from 'fast-glob';
2
- import { z } from 'zod';
3
2
  import { Qwen35RecordSchema } from '../schemas/qwen35.js';
3
+ import { ClaudeProjectEntrySchema } from '../schemas/source.js';
4
4
  import { readJsonl } from '../utils/jsonl.js';
5
- const EntrySchema = z.record(z.string(), z.unknown());
6
5
  export async function collectClaudePromptOnlyRecords(root) {
7
6
  const files = await fg('**/*.jsonl', { cwd: root, absolute: true, onlyFiles: true });
8
7
  const records = [];
9
8
  for (const file of files.sort()) {
10
- const entries = (await readJsonl(file)).map((row) => EntrySchema.parse(row));
9
+ const entries = (await readJsonl(file)).map((row) => ClaudeProjectEntrySchema.parse(row));
11
10
  for (const entry of entries) {
12
11
  if (entry.type !== 'user')
13
12
  continue;
@@ -1,12 +1,7 @@
1
1
  import fg from 'fast-glob';
2
- import { z } from 'zod';
3
2
  import { Qwen35RecordSchema } from '../schemas/qwen35.js';
3
+ import { CodexEntrySchema } from '../schemas/source.js';
4
4
  import { readJsonl } from '../utils/jsonl.js';
5
- const EntrySchema = z.object({
6
- timestamp: z.string().optional(),
7
- type: z.string(),
8
- payload: z.record(z.string(), z.unknown()).optional(),
9
- }).passthrough();
10
5
  class TurnBuilder {
11
6
  sessionMeta;
12
7
  turnId;
@@ -204,7 +199,7 @@ export async function collectCodexRecords(root) {
204
199
  const files = await fg('**/*.jsonl', { cwd: root, absolute: true, onlyFiles: true });
205
200
  const records = [];
206
201
  for (const file of files.sort()) {
207
- const entries = (await readJsonl(file)).map((entry) => EntrySchema.parse(entry));
202
+ const entries = (await readJsonl(file)).map((entry) => CodexEntrySchema.parse(entry));
208
203
  const sessionMeta = (entries.find((entry) => entry.type === 'session_meta')?.payload ?? {});
209
204
  let builder = null;
210
205
  for (const entry of entries) {
@@ -1,20 +1,15 @@
1
1
  import fs from 'node:fs';
2
2
  import fg from 'fast-glob';
3
- import { z } from 'zod';
4
3
  import { Qwen35RecordSchema } from '../schemas/qwen35.js';
4
+ import { PiSessionEntrySchema, PiSessionHeaderSchema } from '../schemas/source.js';
5
5
  import { isFile } from '../utils/common.js';
6
6
  import { readJsonl } from '../utils/jsonl.js';
7
- const SessionEntrySchema = z.object({
8
- type: z.string(),
9
- id: z.string().optional(),
10
- parentId: z.string().nullable().optional(),
11
- timestamp: z.string().optional(),
12
- }).passthrough();
13
7
  export async function collectPiRecords(root) {
14
8
  const files = await fg('**/*.jsonl', { cwd: root, absolute: true, onlyFiles: true });
15
9
  const records = [];
16
10
  for (const file of files.sort()) {
17
- const rows = (await readJsonl(file)).map((row) => SessionEntrySchema.parse(row));
11
+ const rawRows = await readJsonl(file);
12
+ const rows = rawRows.map((row, index) => (index === 0 ? PiSessionHeaderSchema.parse(row) : PiSessionEntrySchema.parse(row)));
18
13
  if (!rows.length)
19
14
  continue;
20
15
  const header = rows[0];
@@ -25,7 +20,7 @@ export async function collectPiRecords(root) {
25
20
  if (!entry.id)
26
21
  continue;
27
22
  byId.set(entry.id, entry);
28
- const key = entry.parentId ?? null;
23
+ const key = typeof entry.parentId === 'string' ? entry.parentId : null;
29
24
  const bucket = children.get(key) ?? [];
30
25
  bucket.push(entry.id);
31
26
  children.set(key, bucket);
@@ -47,7 +42,7 @@ function branchEntries(leaf, byId) {
47
42
  if (!entry)
48
43
  break;
49
44
  ordered.push(entry);
50
- current = entry.parentId ?? null;
45
+ current = typeof entry.parentId === 'string' ? entry.parentId : null;
51
46
  }
52
47
  return ordered.reverse();
53
48
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentic-dataset-builder",
3
- "version": "0.2.0",
3
+ "version": "0.2.2",
4
4
  "description": "Pure TypeScript agentic dataset builder for Pi, Codex, and Claude Code history",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -26,7 +26,24 @@
26
26
  "agentic-dataset-builder": "./dist/cli.js"
27
27
  },
28
28
  "files": [
29
- "dist",
29
+ "dist/cli.js",
30
+ "dist/cli.d.ts",
31
+ "dist/labeling.js",
32
+ "dist/labeling.d.ts",
33
+ "dist/parquet.js",
34
+ "dist/parquet.d.ts",
35
+ "dist/platform",
36
+ "dist/schemas/qwen35.js",
37
+ "dist/schemas/qwen35.d.ts",
38
+ "dist/schemas/source.js",
39
+ "dist/schemas/source.d.ts",
40
+ "dist/sources/pi.js",
41
+ "dist/sources/pi.d.ts",
42
+ "dist/sources/codex.js",
43
+ "dist/sources/codex.d.ts",
44
+ "dist/sources/claude.js",
45
+ "dist/sources/claude.d.ts",
46
+ "dist/utils",
30
47
  "README.md",
31
48
  "LICENSE"
32
49
  ],
@@ -37,7 +54,8 @@
37
54
  "build": "tsc -p tsconfig.json",
38
55
  "dev": "tsx src/cli.ts",
39
56
  "check": "tsc -p tsconfig.json --noEmit",
40
- "pack:check": "npm pack --dry-run"
57
+ "pack:check": "npm pack --dry-run",
58
+ "test": "vitest run"
41
59
  },
42
60
  "dependencies": {
43
61
  "fast-glob": "^3.3.3",
@@ -47,6 +65,7 @@
47
65
  "devDependencies": {
48
66
  "@types/node": "^24.7.2",
49
67
  "tsx": "^4.20.6",
50
- "typescript": "^5.9.3"
68
+ "typescript": "^5.9.3",
69
+ "vitest": "^4.1.3"
51
70
  }
52
71
  }