codealmanac 0.1.4 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli-GTEC5PC7.js +6237 -0
- package/dist/cli-GTEC5PC7.js.map +1 -0
- package/dist/codealmanac.js +39 -5142
- package/dist/codealmanac.js.map +1 -1
- package/guides/mini.md +27 -1
- package/guides/processing/claude-code.md +152 -0
- package/guides/processing/codex.md +214 -0
- package/guides/processing/generic.md +128 -0
- package/guides/reference.md +80 -6
- package/package.json +2 -2
- package/prompts/reviewer.md +1 -1
- package/prompts/writer.md +1 -1
package/guides/mini.md
CHANGED
|
@@ -144,7 +144,7 @@ wikilink-syntax
|
|
|
144
144
|
# 2. Triage with --lead
|
|
145
145
|
$ almanac show sqlite-indexer --lead
|
|
146
146
|
The indexer (`src/indexer/`) builds and maintains `.almanac/index.db` — a
|
|
147
|
-
SQLite database that powers all query commands (`search`, `
|
|
147
|
+
SQLite database that powers all query commands (`search`, `show`, `health`,
|
|
148
148
|
`topics show`). It runs silently before every query command, comparing page
|
|
149
149
|
file mtimes against the stored `content_hash`; only changed or new pages are
|
|
150
150
|
re-parsed.
|
|
@@ -216,6 +216,32 @@ No logs at all → the hook isn't installed, or bailed before backgrounding, or
|
|
|
216
216
|
|
|
217
217
|
---
|
|
218
218
|
|
|
219
|
+
## Staying current
|
|
220
|
+
|
|
221
|
+
codealmanac checks for updates in the background (once per 24h) after each
|
|
222
|
+
command. When a new version is available, you'll see a stderr banner on
|
|
223
|
+
every subsequent invocation:
|
|
224
|
+
|
|
225
|
+
```
|
|
226
|
+
! codealmanac 0.1.6 available (you're on 0.1.5) — run: almanac update
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
The banner shows on every command until you update or dismiss it. Run:
|
|
230
|
+
|
|
231
|
+
```bash
|
|
232
|
+
almanac update # upgrade to latest (foreground `npm i -g codealmanac@latest`)
|
|
233
|
+
almanac update --dismiss # skip this version; banner goes away until the next release
|
|
234
|
+
almanac update --check # check now without installing (bypasses 24h cache)
|
|
235
|
+
almanac doctor # see current update status + notifier setting
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
Auto-install is deliberately NOT the default — silent install without consent
|
|
239
|
+
violates the trust contract, npm prefixes diverge across version managers, and a
|
|
240
|
+
mid-invocation binary swap corrupts dynamic imports. Tier B (nag + manual
|
|
241
|
+
install) is the design. See `almanac update --help` for the full flag set.
|
|
242
|
+
|
|
243
|
+
---
|
|
244
|
+
|
|
219
245
|
## When in doubt
|
|
220
246
|
|
|
221
247
|
- `.almanac/README.md` — repo-specific conventions + notability bar
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
# Processing Claude Code Sessions
|
|
2
|
+
|
|
3
|
+
## Format overview
|
|
4
|
+
|
|
5
|
+
Claude Code stores sessions as JSONL files (one JSON object per line) at:
|
|
6
|
+
```
|
|
7
|
+
~/.claude/projects/<project-hash>/<session-uuid>.jsonl
|
|
8
|
+
```
|
|
9
|
+
|
|
10
|
+
The project hash is a path with slashes replaced by dashes (e.g., `-Users-rohan-Desktop-Projects-myrepo`). Each session file contains the full conversation history including tool calls, tool results, thinking blocks, and metadata.
|
|
11
|
+
|
|
12
|
+
Typical session sizes: 150KB (short Q&A) to 8MB+ (multi-hour coding session). Line counts range from ~75 to ~1,750.
|
|
13
|
+
|
|
14
|
+
## Record types
|
|
15
|
+
|
|
16
|
+
Each line is a JSON object with a `type` field at the top level:
|
|
17
|
+
|
|
18
|
+
| Type | Frequency | What it contains |
|
|
19
|
+
|------|-----------|-----------------|
|
|
20
|
+
| `assistant` | ~40% of records, ~15% of bytes | Model responses: text, tool_use, thinking blocks. Each content item is in `message.content[]` |
|
|
21
|
+
| `user` | ~35% of records, ~55% of bytes | Human messages OR tool results. Check `message.content[].type` to distinguish |
|
|
22
|
+
| `attachment` | ~5% of records, ~7% of bytes | System context injected by the harness: deferred tool lists, skill listings, memory, task reminders, edited file snippets |
|
|
23
|
+
| `file-history-snapshot` | ~5% of records, <1% of bytes | Checkpoint markers for undo/redo. Always tiny (~250 bytes) |
|
|
24
|
+
| `permission-mode` | ~3% of records, <1% of bytes | Records when permission mode changes (e.g., `bypassPermissions`) |
|
|
25
|
+
| `last-prompt` | ~3% of records, <1% of bytes | Marks turn boundaries. ~120 bytes each |
|
|
26
|
+
| `system` | ~2% of records, <1% of bytes | System messages injected mid-conversation. Often empty content |
|
|
27
|
+
| `ai-title` | rare | Auto-generated session title |
|
|
28
|
+
| `queue-operation` | rare | Queued follow-up commands |
|
|
29
|
+
|
|
30
|
+
## What to extract (signal)
|
|
31
|
+
|
|
32
|
+
### 1. Human messages (highest signal density)
|
|
33
|
+
- **Where:** Records with `type: "user"` where `message.content[]` contains items with `type: "text"`
|
|
34
|
+
- **Also check:** `userType` field -- `"external"` means the actual human typed this
|
|
35
|
+
- **What:** Intent, requirements, feedback, decisions, bug reports, design direction
|
|
36
|
+
- **Example pattern:** `{"type": "user", "message": {"content": [{"type": "text", "text": "What problems did you run into?"}]}}`
|
|
37
|
+
|
|
38
|
+
### 2. Assistant text responses
|
|
39
|
+
- **Where:** Records with `type: "assistant"`, then `message.content[]` items with `type: "text"`
|
|
40
|
+
- **What:** Explanations, decisions, summaries, architecture analysis, bug diagnoses
|
|
41
|
+
- **Typical size:** 100-3000 chars per text block
|
|
42
|
+
- **Example pattern:** The assistant explains a root cause, summarizes what was built, or describes a design decision
|
|
43
|
+
|
|
44
|
+
### 3. Subagent results (high-value signal hidden in noise)
|
|
45
|
+
- **Where:** Records with `type: "user"` that have a top-level `toolUseResult` field with `agentType` set
|
|
46
|
+
- **What:** Complete results from subagents (review, critic, pair, etc.). The `content` field contains the full subagent output, often multi-thousand-character analysis
|
|
47
|
+
- **Key fields:** `toolUseResult.agentType` (e.g., "general-purpose", "review", "critic", "pair"), `toolUseResult.content[].text`, `toolUseResult.prompt` (what the subagent was asked to do)
|
|
48
|
+
- **Why it matters:** These are often the densest signal in a session -- complete audit reports, code reviews, architecture analyses
|
|
49
|
+
|
|
50
|
+
### 4. Session metadata
|
|
51
|
+
- **Where:** First few records of the file
|
|
52
|
+
- **Key fields on user records:** `cwd`, `gitBranch`, `version`, `timestamp`, `sessionId`, `entrypoint` (cli vs other)
|
|
53
|
+
- **Attachment records** with `type: "nested_memory"` contain the project's memory/context
|
|
54
|
+
|
|
55
|
+
## What to skip (noise)
|
|
56
|
+
|
|
57
|
+
### 1. Tool results (~24% of total bytes) -- SKIP
|
|
58
|
+
- **Where:** `user` records where `message.content[]` has `type: "tool_result"`
|
|
59
|
+
- **Why skip:** These are file contents, grep results, build output, test output. The actual files are in the repo; the output is transient
|
|
60
|
+
- **The biggest offenders:** Read tool results can be 150KB+ (entire file contents dumped inline)
|
|
61
|
+
|
|
62
|
+
### 2. Tool calls (~8% of total bytes) -- SKIP or SUMMARIZE
|
|
63
|
+
- **Where:** `assistant` records where `message.content[]` has `type: "tool_use"`
|
|
64
|
+
- **Contains:** `name` (Bash, Read, Grep, Edit, Write, Glob) and `input` (command, file path, pattern)
|
|
65
|
+
- **Why skip:** The sequence of tool calls is operational, not knowledge. Exception: summarize the *pattern* of tool usage ("read 15 files in src/auth/")
|
|
66
|
+
|
|
67
|
+
### 3. Wrapper overhead (~38% of total bytes in large sessions) -- SKIP
|
|
68
|
+
- **Where:** Top-level fields on `user` records: `parentUuid`, `sourceToolAssistantUUID`, `toolUseResult` (when not a subagent), `slug`, `requestId`, `isMeta`, `isSidechain`
|
|
69
|
+
- **Why skip:** The `toolUseResult` field on user records DUPLICATES the tool result content that already appears in `message.content[]`. This is the single biggest source of bloat. In one 8MB session, `toolUseResult` alone was 2.7MB (33%)
|
|
70
|
+
- **EXCEPTION:** When `toolUseResult.agentType` is set, this is a subagent result and IS signal
|
|
71
|
+
|
|
72
|
+
### 4. Empty thinking blocks (~6% in some sessions) -- SKIP
|
|
73
|
+
- **Where:** `assistant` records, `message.content[]` with `type: "thinking"` but `thinking: ""`
|
|
74
|
+
- **Why:** Claude Code often records thinking blocks with empty content (the actual thinking happened but was not persisted). These are pure waste
|
|
75
|
+
|
|
76
|
+
### 5. Attachments (~7%) -- SKIP
|
|
77
|
+
- **Where:** `type: "attachment"` records
|
|
78
|
+
- **Contains:** `deferred_tools_delta` (tool availability lists), `skill_listing` (repeated skill menus), `task_reminder` (repeated TODO lists), `mcp_instructions_delta` (MCP setup)
|
|
79
|
+
- **Why skip:** Harness infrastructure, not knowledge. Repeated across turns
|
|
80
|
+
|
|
81
|
+
### 6. Metadata records (<1%) -- SKIP
|
|
82
|
+
- `file-history-snapshot`, `permission-mode`, `last-prompt`, `queue-operation`, `ai-title`
|
|
83
|
+
|
|
84
|
+
### 7. Base64 image data (~1-7% in some sessions) -- SKIP
|
|
85
|
+
- **Where:** `user` records with `message.content[]` containing `type: "image"` and `source.type: "base64"`
|
|
86
|
+
- **Why skip:** Screenshots pasted by the user. Can be 100KB+ of base64 per image. Not extractable as knowledge
|
|
87
|
+
|
|
88
|
+
## What to summarize
|
|
89
|
+
|
|
90
|
+
These patterns should be compressed rather than fully extracted or fully skipped:
|
|
91
|
+
|
|
92
|
+
| Pattern | Summarize as |
|
|
93
|
+
|---------|-------------|
|
|
94
|
+
| 10+ consecutive tool_use/tool_result pairs reading files | "Read N files in {directory pattern}" |
|
|
95
|
+
| grep/glob sequences searching for a pattern | "Searched for {pattern} across {scope}" |
|
|
96
|
+
| Edit tool calls modifying files | "Modified {file}: {description from the Edit input}" |
|
|
97
|
+
| Bash commands running tests | "Ran tests: {pass/fail summary from result}" |
|
|
98
|
+
| Bash commands running builds | "Built {target}: {success/failure}" |
|
|
99
|
+
|
|
100
|
+
## Extraction approach
|
|
101
|
+
|
|
102
|
+
1. **Parse the JSONL file** line by line. Each line is one JSON object.
|
|
103
|
+
2. **First pass -- extract metadata:** From the first `user` record, grab `cwd`, `gitBranch`, `version`, `timestamp`, `sessionId`.
|
|
104
|
+
3. **For each record, check `type`:**
|
|
105
|
+
- `type: "user"` with `message.content[].type == "text"` -> **extract as human message**
|
|
106
|
+
- `type: "user"` with `toolUseResult.agentType` set -> **extract subagent result** from `toolUseResult.content[].text`
|
|
107
|
+
- `type: "user"` with `message.content[].type == "tool_result"` -> **skip** (or summarize tool call patterns)
|
|
108
|
+
- `type: "assistant"` with `message.content[].type == "text"` -> **extract as assistant reasoning**
|
|
109
|
+
- `type: "assistant"` with `message.content[].type == "thinking"` and non-empty `thinking` field -> **extract as internal reasoning**
|
|
110
|
+
- `type: "assistant"` with `message.content[].type == "tool_use"` -> **skip or summarize**
|
|
111
|
+
- `type: "attachment"`, `type: "system"`, metadata types -> **skip**
|
|
112
|
+
4. **Second pass -- deduplicate:** The `toolUseResult` field on user records often duplicates content from `message.content[]`. Always prefer `message.content[]` and only use `toolUseResult` for subagent data.
|
|
113
|
+
5. **Third pass -- compress tool sequences:** Collapse consecutive tool_use + tool_result pairs into summaries.
|
|
114
|
+
|
|
115
|
+
## Example: signal extraction from a real session
|
|
116
|
+
|
|
117
|
+
**Human intent (from user record):**
|
|
118
|
+
> "Tell me about our ref token?"
|
|
119
|
+
Signal: User wants to understand the ref token system.
|
|
120
|
+
|
|
121
|
+
**Assistant reasoning (from assistant text):**
|
|
122
|
+
> "Ref Token: Opaque backend-signed token that binds a locally-edited .md file to a specific page in the DB. Backend signs on download, verifies on publish. No signing secret ever touches the client."
|
|
123
|
+
Signal: Architectural explanation of the ref token system.
|
|
124
|
+
|
|
125
|
+
**Assistant problem diagnosis (from assistant text):**
|
|
126
|
+
> "Root cause: the codealmanac CLI expects topics.yaml in the new list format but .almanac/topics.yaml is still in the old dict format"
|
|
127
|
+
Signal: Bug identification with root cause.
|
|
128
|
+
|
|
129
|
+
**Subagent audit (from toolUseResult with agentType):**
|
|
130
|
+
> agentType: "general-purpose", prompt: "You are auditing the hosted editor..."
|
|
131
|
+
> content: "Audit Report: Hosted Editor / Quill Co-Editing (Pre-Phase-5) ... 3 HIGH gaps found ... Fix malformed edit-result surfacing ... Fix Quill session isolation ..."
|
|
132
|
+
Signal: Complete code audit with findings, priorities, and recommendations.
|
|
133
|
+
|
|
134
|
+
**Noise skipped:** 522 tool results totaling 2MB of file contents, 466KB of empty thinking blocks, 408KB of metadata records, 2.7MB of duplicated toolUseResult data.
|
|
135
|
+
|
|
136
|
+
## Gotchas
|
|
137
|
+
|
|
138
|
+
1. **toolUseResult duplication is massive.** In one 8MB session, `toolUseResult` accounted for 33% of the file. It duplicates `message.content[]` tool results AND sometimes contains subagent results. Always check `agentType` before discarding.
|
|
139
|
+
|
|
140
|
+
2. **Thinking blocks are always empty in recent versions.** The `thinking` field in content items of type `"thinking"` is consistently empty string in observed sessions (95 empty out of 95 in one session). The thinking content is not persisted. Do not expect signal here despite the promising field name.
|
|
141
|
+
|
|
142
|
+
3. **assistant records contain message-level metadata.** The `message` object has `usage` (token counts), `model` (model name), `stop_reason` (why generation stopped). These can be useful for understanding session dynamics but are not knowledge signal.
|
|
143
|
+
|
|
144
|
+
4. **user records serve double duty.** A `user` record with `userType: "external"` and text content is a real human message. A `user` record with tool_result content is just the harness returning tool output. Always check `message.content[].type`.
|
|
145
|
+
|
|
146
|
+
5. **The `isSidechain` field** indicates branched conversations (user went back and tried a different approach). Sidechain records may represent abandoned approaches -- still potentially valuable as "what was tried and rejected."
|
|
147
|
+
|
|
148
|
+
6. **Base64 images can be huge.** A single screenshot paste can add 100KB+ of base64 data to a user record. These look like small records until you measure them.
|
|
149
|
+
|
|
150
|
+
7. **Attachment records repeat.** `skill_listing` and `task_reminder` attachments are re-injected at many turn boundaries. The same content appears 10-15 times in a long session.
|
|
151
|
+
|
|
152
|
+
8. **Subagent data structure:** The `toolUseResult` for subagents includes `toolStats` (readCount, searchCount, bashCount, editFileCount, linesAdded, linesRemoved) and `usage` (input_tokens, output_tokens, cache stats). These are useful metadata about the subagent's work.
|
|
@@ -0,0 +1,214 @@
|
|
|
1
|
+
# Processing Codex Sessions
|
|
2
|
+
|
|
3
|
+
## Format overview
|
|
4
|
+
|
|
5
|
+
Codex stores sessions as JSONL files at:
|
|
6
|
+
```
|
|
7
|
+
~/.codex/sessions/<year>/<month>/<day>/rollout-<timestamp>-<thread-uuid>.jsonl
|
|
8
|
+
```
|
|
9
|
+
|
|
10
|
+
A SQLite database at `~/.codex/state_5.sqlite` provides session metadata (title, cwd, model, tokens_used, git info) in the `threads` table.
|
|
11
|
+
|
|
12
|
+
Multiple rollout files for the same timestamp indicate subagent threads spawned by the parent session. The SQLite `source` column reveals this: `"vscode"` for top-level sessions, a JSON blob with `subagent.thread_spawn.parent_thread_id` for child threads.
|
|
13
|
+
|
|
14
|
+
Typical session sizes: 500KB (short task) to 12MB+ (multi-turn debugging session). Line counts range from ~175 to ~2,800.
|
|
15
|
+
|
|
16
|
+
## Record types
|
|
17
|
+
|
|
18
|
+
Each line is a JSON object with a `type` field:
|
|
19
|
+
|
|
20
|
+
| Type | % of records | % of bytes | What it contains |
|
|
21
|
+
|------|-------------|------------|-----------------|
|
|
22
|
+
| `response_item` | ~55% | ~36% | Model outputs: function calls, function call outputs, messages, reasoning |
|
|
23
|
+
| `event_msg` | ~43% | ~29% | Harness events: command execution results, token counts, agent messages, task lifecycle |
|
|
24
|
+
| `turn_context` | ~2% | ~5% | Per-turn context: model, cwd, instructions, settings. Repeated every turn |
|
|
25
|
+
| `session_meta` | 1 per file | ~1-3% | Session metadata: id, cwd, model, CLI version, base instructions, skills |
|
|
26
|
+
| `compacted` | rare (0-3) | 10-30% when present | Compressed conversation history from context window compaction |
|
|
27
|
+
|
|
28
|
+
### response_item subtypes (in `payload.type`)
|
|
29
|
+
|
|
30
|
+
| Subtype | What it contains |
|
|
31
|
+
|---------|-----------------|
|
|
32
|
+
| `function_call` | Tool invocations: `name` (always `exec_command`), `arguments` (JSON with cmd, workdir, yield_time_ms) |
|
|
33
|
+
| `function_call_output` | Tool results: `output` (command stdout/stderr as string) |
|
|
34
|
+
| `message` | Conversation messages. Check `payload.role`: `developer` (system prompts), `user` (human + context), `assistant` (model output) |
|
|
35
|
+
| `reasoning` | Model reasoning. Contains `encrypted_content` (unreadable) and `summary` (always empty in observed data) |
|
|
36
|
+
| `custom_tool_call` | File edit operations via `apply_patch`. Contains unified diff in `input` |
|
|
37
|
+
| `custom_tool_call_output` | Patch application results: success/failure + modified file list |
|
|
38
|
+
| `web_search_call` | Web search invocations (rare) |
|
|
39
|
+
|
|
40
|
+
### event_msg subtypes (in `payload.type`)
|
|
41
|
+
|
|
42
|
+
| Subtype | What it contains |
|
|
43
|
+
|---------|-----------------|
|
|
44
|
+
| `user_message` | Human input: `message` (text), `images` (base64 data URIs), `text_elements` |
|
|
45
|
+
| `agent_message` | Model commentary shown to user: `message`, `phase` (always "commentary") |
|
|
46
|
+
| `exec_command_end` | Command execution results (DUPLICATES `function_call_output`): stdout, stderr, aggregated_output, exit_code, duration, command |
|
|
47
|
+
| `token_count` | Token usage and rate limit info per turn |
|
|
48
|
+
| `task_started` | Turn lifecycle: turn_id, model_context_window, collaboration_mode |
|
|
49
|
+
| `task_complete` | Turn completion: `last_agent_message` (the final text shown to user) |
|
|
50
|
+
| `patch_apply_end` | File edit results: stdout, changes list, success boolean |
|
|
51
|
+
| `context_compacted` | Marker that context window was compacted |
|
|
52
|
+
| `turn_aborted` | Turn was cancelled |
|
|
53
|
+
|
|
54
|
+
## What to extract (signal)
|
|
55
|
+
|
|
56
|
+
### 1. Human messages (highest signal density)
|
|
57
|
+
- **Where:** `event_msg` records with `payload.type == "user_message"`
|
|
58
|
+
- **Field:** `payload.message`
|
|
59
|
+
- **Also in:** `response_item` with `payload.type == "message"` and `payload.role == "user"`, `payload.content[].type == "input_text"`. The event_msg version is cleaner
|
|
60
|
+
- **Watch for:** The `response_item` version also contains system/developer context injected alongside the real user message. Extract only `input_text` items where the text is NOT wrapped in XML tags like `<environment_context>`, `<permissions instructions>`, `<app-context>`, `<skills_instructions>`, `<collaboration_mode>`
|
|
61
|
+
|
|
62
|
+
### 2. Agent messages (model's user-facing commentary)
|
|
63
|
+
- **Where:** `event_msg` records with `payload.type == "agent_message"`
|
|
64
|
+
- **Fields:** `payload.message`, `payload.phase`
|
|
65
|
+
- **What:** Short status updates and reasoning the model shares with the user. These are the "thinking out loud" moments
|
|
66
|
+
- **Example:** "I'm inspecting the repo for category vs topic naming drift and I'll trace it through code, routes, API shapes, and copy so the discrepancies are concrete rather than guessed."
|
|
67
|
+
|
|
68
|
+
### 3. Task completion summaries
|
|
69
|
+
- **Where:** `event_msg` records with `payload.type == "task_complete"`
|
|
70
|
+
- **Field:** `payload.last_agent_message`
|
|
71
|
+
- **What:** The final, complete response for each turn. Often the densest signal -- the model's synthesized answer after all tool use. Can be multi-thousand characters of analysis
|
|
72
|
+
|
|
73
|
+
### 4. Assistant output text
|
|
74
|
+
- **Where:** `response_item` with `payload.type == "message"`, `payload.role == "assistant"`, content items with `type: "output_text"`
|
|
75
|
+
- **What:** Model's text responses interspersed with tool calls. Shorter than task_complete but captures incremental reasoning
|
|
76
|
+
|
|
77
|
+
### 5. File edits (apply_patch)
|
|
78
|
+
- **Where:** `response_item` with `payload.type == "custom_tool_call"` and `payload.name == "apply_patch"`
|
|
79
|
+
- **Field:** `payload.input` contains a unified diff
|
|
80
|
+
- **What:** Every code change the model made. Extract the file path and a summary of the change, not the full diff (the repo has the final state)
|
|
81
|
+
|
|
82
|
+
### 6. Session metadata
|
|
83
|
+
- **Where:** `session_meta` record (first line of file)
|
|
84
|
+
- **Key fields:** `payload.id`, `payload.cwd`, `payload.model_provider`, `payload.cli_version`, `payload.source`, `payload.model` (in turn_context)
|
|
85
|
+
- **SQLite enrichment:** Query `threads` table for `title`, `tokens_used`, `git_branch`, `first_user_message`, `source` (reveals if this is a subagent)
|
|
86
|
+
|
|
87
|
+
## What to skip (noise)
|
|
88
|
+
|
|
89
|
+
### 1. function_call_output records (~17% of bytes) -- SKIP
|
|
90
|
+
- **Where:** `response_item` with `payload.type == "function_call_output"`
|
|
91
|
+
- **Why:** Raw command output (file contents, grep results, build output). Already in the repo or transient
|
|
92
|
+
|
|
93
|
+
### 2. exec_command_end records (~15% of bytes) -- SKIP
|
|
94
|
+
- **Where:** `event_msg` with `payload.type == "exec_command_end"`
|
|
95
|
+
- **Why:** DUPLICATES `function_call_output` with the same `call_id`. Contains stdout, stderr, aggregated_output redundantly. In one 12MB session, 399 of these consumed 1.9MB
|
|
96
|
+
- **Note:** 100% overlap with function_call_output on shared call_ids
|
|
97
|
+
|
|
98
|
+
### 3. Reasoning records (~6% of bytes) -- SKIP
|
|
99
|
+
- **Where:** `response_item` with `payload.type == "reasoning"`
|
|
100
|
+
- **Why:** Contains `encrypted_content` (base64 blob, unreadable) and `summary` (consistently empty array in all observed sessions). No extractable signal
|
|
101
|
+
- **Do not confuse with:** `agent_message` records, which ARE readable model reasoning
|
|
102
|
+
|
|
103
|
+
### 4. turn_context records (~5-25% of bytes) -- SKIP
|
|
104
|
+
- **Where:** `type: "turn_context"`
|
|
105
|
+
- **Why:** Repeated every turn. Contains model name, cwd, instructions, sandbox policy, collaboration mode. Same content each time with minor variations
|
|
106
|
+
|
|
107
|
+
### 5. session_meta base_instructions (~3% of bytes) -- SKIP
|
|
108
|
+
- **Where:** `session_meta` record, `payload.base_instructions.text`
|
|
109
|
+
- **Why:** Codex's built-in system prompt. Same across all sessions. ~2000 chars of personality and behavior instructions
|
|
110
|
+
|
|
111
|
+
### 6. Developer instructions in message records -- SKIP
|
|
112
|
+
- **Where:** `response_item` messages with `payload.role == "developer"`
|
|
113
|
+
- **Contains:** `<permissions instructions>`, `<app-context>`, `<collaboration_mode>`, `<apps_instructions>`, `<skills_instructions>` XML blocks
|
|
114
|
+
- **Why:** Harness configuration, not user knowledge. Can be 10KB+ per occurrence
|
|
115
|
+
|
|
116
|
+
### 7. token_count records (~2%) -- SKIP
|
|
117
|
+
- Rate limit and token usage telemetry
|
|
118
|
+
|
|
119
|
+
### 8. function_call records (~1.5%) -- SKIP or SUMMARIZE
|
|
120
|
+
- **Where:** `response_item` with `payload.type == "function_call"`
|
|
121
|
+
- **Contains:** `name` (always `exec_command`), `arguments` (cmd, workdir)
|
|
122
|
+
- **Why:** Operational commands. Summarize the pattern, not individual calls
|
|
123
|
+
|
|
124
|
+
### 9. Base64 images in user_message (~4-7% when present) -- SKIP
|
|
125
|
+
- **Where:** `event_msg` with `payload.type == "user_message"`, `payload.images[]`
|
|
126
|
+
- **Format:** data URIs (`data:image/png;base64,...`), 350KB-550KB each
|
|
127
|
+
- **Also in:** `response_item` message content with `type: "input_image"` and `image_url`
|
|
128
|
+
- **Why:** Screenshots. Not extractable as text knowledge. A single image can be 550KB
|
|
129
|
+
|
|
130
|
+
### 10. Compacted records (10-30% when present) -- EXTRACT SELECTIVELY
|
|
131
|
+
- **Where:** `type: "compacted"` records
|
|
132
|
+
- **Contains:** `payload.replacement_history[]` -- a compressed version of earlier conversation
|
|
133
|
+
- **Treatment:** These contain summarized versions of earlier turns after context compaction. The `replacement_history` items have `role` and `content[]` with `input_text`/`output_text`. Extract output_text items (model summaries) but skip input_text (already captured from the original records earlier in the file)
|
|
134
|
+
|
|
135
|
+
## What to summarize
|
|
136
|
+
|
|
137
|
+
| Pattern | Summarize as |
|
|
138
|
+
|---------|-------------|
|
|
139
|
+
| N consecutive exec_command function_call/output pairs | "Ran N commands exploring {pattern}" |
|
|
140
|
+
| grep/find/sed sequences reading files | "Searched for {pattern} in {directory}" |
|
|
141
|
+
| apply_patch calls | "Modified {file}: {one-line description from diff}" |
|
|
142
|
+
| Multiple agent_message records saying similar things | Keep only the last one before task_complete |
|
|
143
|
+
|
|
144
|
+
## Extraction approach
|
|
145
|
+
|
|
146
|
+
1. **Check SQLite first** for session metadata: `SELECT id, title, cwd, model, tokens_used, git_branch, source, first_user_message FROM threads WHERE id = '<thread-id>'`. The `source` field tells you if this is a subagent session.
|
|
147
|
+
|
|
148
|
+
2. **Parse the JSONL file** line by line.
|
|
149
|
+
|
|
150
|
+
3. **Extract session_meta** (first record): grab `payload.id`, `payload.cwd`, `payload.source`, `payload.model_provider`.
|
|
151
|
+
|
|
152
|
+
4. **For each record, route by type and subtype:**
|
|
153
|
+
- `event_msg` + `user_message` -> **extract** `payload.message` as human input. Note `payload.images` count but skip the base64 data
|
|
154
|
+
- `event_msg` + `agent_message` -> **extract** `payload.message` as model reasoning
|
|
155
|
+
- `event_msg` + `task_complete` -> **extract** `payload.last_agent_message` as turn summary
|
|
156
|
+
- `event_msg` + `exec_command_end` -> **skip** (duplicated in response_item)
|
|
157
|
+
- `event_msg` + `token_count` / `task_started` / `context_compacted` -> **skip**
|
|
158
|
+
- `response_item` + `message` + `role: "assistant"` -> **extract** output_text content
|
|
159
|
+
- `response_item` + `message` + `role: "developer"` or `role: "user"` with XML-tagged content -> **skip**
|
|
160
|
+
- `response_item` + `message` + `role: "user"` with plain text -> **extract** (but deduplicate against event_msg user_message)
|
|
161
|
+
- `response_item` + `function_call_output` -> **skip**
|
|
162
|
+
- `response_item` + `function_call` -> **summarize** command pattern
|
|
163
|
+
- `response_item` + `reasoning` -> **skip** (encrypted, empty summary)
|
|
164
|
+
- `response_item` + `custom_tool_call` -> **summarize** file path and change description
|
|
165
|
+
- `response_item` + `custom_tool_call_output` -> **skip** (just success/fail)
|
|
166
|
+
- `turn_context` -> **skip**
|
|
167
|
+
- `compacted` -> **extract** output_text from `payload.replacement_history[]`
|
|
168
|
+
|
|
169
|
+
5. **Deduplicate across record types:** User messages appear in both `event_msg.user_message` AND `response_item.message` (role: user). Command output appears in both `response_item.function_call_output` AND `event_msg.exec_command_end`. Always prefer the event_msg version for user messages (cleaner), skip the duplicate command output entirely.
|
|
170
|
+
|
|
171
|
+
6. **Link subagent sessions:** Check the SQLite `source` column. If it contains `subagent.thread_spawn`, this session's knowledge should be attributed to the parent thread. The `parent_thread_id` field links them.
|
|
172
|
+
|
|
173
|
+
## Example: signal extraction from a real session
|
|
174
|
+
|
|
175
|
+
**Human intent (from event_msg.user_message):**
|
|
176
|
+
> "Look at our codebase, we have categories and topics, we want to unify everywhere to be called topics. Find all discrepancies."
|
|
177
|
+
Signal: User wants a category-to-topic naming audit.
|
|
178
|
+
|
|
179
|
+
**Agent reasoning (from event_msg.agent_message):**
|
|
180
|
+
> "I've isolated one concrete runtime mismatch already: the page-topic footer still talks about 'categories' in UI copy while the rest of the product model is 'topics.' I'm now separating first-party mismatches from external-source fields and old design docs so the final list is usable."
|
|
181
|
+
Signal: Agent's approach to categorizing the findings.
|
|
182
|
+
|
|
183
|
+
**Task completion (from event_msg.task_complete):**
|
|
184
|
+
> "Root Cause: This is not primarily a CSS-specificity problem. The break happens because the suggestion extension's renderHTML() emits bare <ins>/<del> tags without the CSS class..."
|
|
185
|
+
Signal: Complete diagnosis with root cause and fix path.
|
|
186
|
+
|
|
187
|
+
**File edit (from custom_tool_call):**
|
|
188
|
+
> name: apply_patch, file: SuggestionChangesExtension.ts
|
|
189
|
+
> Change: Added class attribute to renderHTML() output so CSS selectors match
|
|
190
|
+
Signal: What was actually changed and why.
|
|
191
|
+
|
|
192
|
+
**Noise skipped:** 442 function_call_output records (2.2MB), 399 duplicate exec_command_end records (1.9MB), 266 encrypted reasoning records (787KB), 67 repeated turn_context records (675KB), 368 token_count records (267KB).
|
|
193
|
+
|
|
194
|
+
## Gotchas
|
|
195
|
+
|
|
196
|
+
1. **exec_command_end duplicates function_call_output.** They share the same `call_id` and contain the same command output in different field names. 100% overlap observed. Skip exec_command_end entirely.
|
|
197
|
+
|
|
198
|
+
2. **Reasoning is unreadable.** Despite having `summary` and `content` fields, reasoning records contain only `encrypted_content` (opaque base64) and consistently empty `summary: []`. There is zero extractable signal from reasoning records.
|
|
199
|
+
|
|
200
|
+
3. **User messages appear twice.** Once in `event_msg.user_message` (clean, just the text + images) and again in `response_item.message` with `role: "user"` (mixed with system context). Use the event_msg version.
|
|
201
|
+
|
|
202
|
+
4. **Developer messages are system prompts, not human.** `response_item.message` with `role: "developer"` contains harness instructions (permissions, app context, collaboration mode, skills). These are NOT human messages. They are wrapped in XML tags like `<permissions instructions>`, `<app-context>`, etc.
|
|
203
|
+
|
|
204
|
+
5. **Images are massive.** User messages with screenshots contain base64 data URIs of 350-550KB each. A single image can make a 75-char message record balloon to 550KB. Check `payload.images` length but do not extract the base64 data.
|
|
205
|
+
|
|
206
|
+
6. **Compacted records contain earlier conversation.** When the context window fills up, Codex compacts history into `compacted` records. The `replacement_history` array contains summarized earlier turns. If you are processing the file start-to-finish, you will see the original records first and then the compacted summary later -- be careful not to double-count.
|
|
207
|
+
|
|
208
|
+
7. **Subagent sessions are separate files.** A parent session spawns subagent threads that are written to their own rollout files in the same date directory. The SQLite `source` column JSON identifies child threads. To get the complete picture of a multi-agent session, you must read all linked rollout files.
|
|
209
|
+
|
|
210
|
+
8. **Codex uses `exec_command` for everything.** Unlike Claude Code which has specialized tools (Read, Grep, Edit, Bash), Codex wraps all operations in `exec_command` with shell commands (`sed`, `rg`, `cat`, etc.) or `apply_patch` for file edits. This means function_call records are less informative about intent -- you need to parse the `cmd` field to understand what was done.
|
|
211
|
+
|
|
212
|
+
9. **task_complete contains the cleanest signal.** The `last_agent_message` in task_complete records is the final synthesized response after all tool use. If you can only extract one thing per turn, extract this.
|
|
213
|
+
|
|
214
|
+
10. **The model field is in turn_context, not session_meta.** Session_meta has `model_provider` ("openai") but the actual model name (e.g., "gpt-5.4") is in `turn_context.model`.
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
# Processing Unknown Session Formats
|
|
2
|
+
|
|
3
|
+
This guide is a fallback for session files that do not match known formats (Claude Code JSONL, Codex rollout JSONL). Use it when you encounter a new tool or an unrecognized file structure.
|
|
4
|
+
|
|
5
|
+
## Step 1: Identify the format
|
|
6
|
+
|
|
7
|
+
Check the file extension and first few lines:
|
|
8
|
+
|
|
9
|
+
- **JSONL** (`.jsonl`): One JSON object per line. Parse each line independently
|
|
10
|
+
- **JSON** (`.json`): Single JSON document. Could be an array of messages or a nested conversation object
|
|
11
|
+
- **Markdown** (`.md`): Likely a conversation export with `## Human` / `## Assistant` headers or similar
|
|
12
|
+
- **Plain text** (`.txt`, `.log`): Look for turn separators (blank lines, `---`, timestamps)
|
|
13
|
+
- **SQLite** (`.sqlite`, `.db`): Database with conversation tables. List tables first, then query
|
|
14
|
+
|
|
15
|
+
For JSONL, check each line for a `type`, `role`, or `kind` field. The presence of certain fields reveals the format:
|
|
16
|
+
- `type: "user"` / `type: "assistant"` + `message.content` -> Claude Code format
|
|
17
|
+
- `type: "response_item"` / `type: "event_msg"` -> Codex format
|
|
18
|
+
- `role: "user"` / `role: "assistant"` without a wrapper type -> Raw API conversation log
|
|
19
|
+
- `type: "human"` / `type: "ai"` -> LangChain/LangSmith format
|
|
20
|
+
|
|
21
|
+
## Step 2: Classify each record
|
|
22
|
+
|
|
23
|
+
For any conversation format, records fall into these universal categories:
|
|
24
|
+
|
|
25
|
+
### Signal (extract)
|
|
26
|
+
1. **Human messages** -- What the user asked, decided, or directed. These explain intent, requirements, and constraints. Look for:
|
|
27
|
+
- Records with `role: "user"` or `type: "human"` or `type: "user_message"`
|
|
28
|
+
- Text that reads like natural language instructions, questions, or feedback
|
|
29
|
+
- Short messages (under ~500 chars) are almost always signal
|
|
30
|
+
|
|
31
|
+
2. **AI reasoning text** -- Explanations, analyses, decisions, and summaries the model produced. Look for:
|
|
32
|
+
- Records with `role: "assistant"` and text content (not tool calls)
|
|
33
|
+
- Fields named `text`, `content`, `message`, `output`, `response`
|
|
34
|
+
- Text that explains *why* something was done, not *what* command was run
|
|
35
|
+
|
|
36
|
+
3. **Final/summary responses** -- The model's synthesized answer after a chain of tool use. Look for:
|
|
37
|
+
- The last assistant message before the next human message
|
|
38
|
+
- Fields named `final_response`, `last_message`, `summary`, `result`
|
|
39
|
+
- These are typically the densest signal per byte
|
|
40
|
+
|
|
41
|
+
4. **Error messages and failures** -- What went wrong and why. Look for:
|
|
42
|
+
- Records containing `error`, `failed`, `exception`, `traceback`
|
|
43
|
+
- These often reveal important constraints, gotchas, or architectural issues
|
|
44
|
+
|
|
45
|
+
### Noise (skip)
|
|
46
|
+
1. **File contents returned by tools** -- Already in the repo. Look for:
|
|
47
|
+
- Records with `type: "tool_result"`, `type: "function_call_output"`, or `type: "tool_output"`
|
|
48
|
+
- Content that looks like source code (imports, function definitions, indented blocks)
|
|
49
|
+
- Long strings (>2KB) that are clearly file dumps
|
|
50
|
+
- **Size clue:** If a record is >10KB, it is almost certainly a tool result, not reasoning
|
|
51
|
+
|
|
52
|
+
2. **Tool invocations** -- What commands were run. Operational, not knowledge. Look for:
|
|
53
|
+
- Records with `type: "tool_use"`, `type: "function_call"`, or `type: "tool_call"`
|
|
54
|
+
- Fields named `name`, `arguments`, `input`, `command`
|
|
55
|
+
- Exception: file edit commands may contain the *what* of a change (summarize those)
|
|
56
|
+
|
|
57
|
+
3. **System prompts and instructions** -- Same across sessions. Look for:
|
|
58
|
+
- Records with `role: "system"` or `role: "developer"`
|
|
59
|
+
- Content wrapped in XML tags (`<instructions>`, `<context>`, `<rules>`)
|
|
60
|
+
- Long preambles about model behavior, tool availability, permissions
|
|
61
|
+
|
|
62
|
+
4. **Metadata and telemetry** -- Session infrastructure. Look for:
|
|
63
|
+
- Token counts, usage statistics, rate limits
|
|
64
|
+
- Timestamps, UUIDs, session IDs (useful for linking but not knowledge)
|
|
65
|
+
- Permission changes, mode switches, checkpoint markers
|
|
66
|
+
|
|
67
|
+
5. **Base64-encoded data** -- Images, files, binary content. Look for:
|
|
68
|
+
- Long strings matching `[A-Za-z0-9+/=]{1000,}` or data URIs
|
|
69
|
+
- Fields named `image`, `data`, `base64`, `source.data`
|
|
70
|
+
|
|
71
|
+
6. **Duplicate records** -- Many formats log the same event multiple ways. Look for:
|
|
72
|
+
- Records sharing an ID field (`call_id`, `tool_use_id`, `request_id`)
|
|
73
|
+
- The same text appearing in both a streaming record and a final record
|
|
74
|
+
|
|
75
|
+
### Summarize (compress)
|
|
76
|
+
1. **Sequences of tool calls** -- "Read 15 files in src/lib/" is better than 15 individual read records
|
|
77
|
+
2. **Repetitive status updates** -- "Still searching..." x10 -> "Searched extensively"
|
|
78
|
+
3. **Build/test output** -- "Tests: 58/58 passed" not the full test runner output
|
|
79
|
+
4. **File edit details** -- "Modified auth.ts: added token validation" not the full diff
|
|
80
|
+
|
|
81
|
+
## Step 3: Estimate signal ratio
|
|
82
|
+
|
|
83
|
+
Before processing the full file, sample it:
|
|
84
|
+
|
|
85
|
+
1. Take the first 20 records, middle 20, and last 20
|
|
86
|
+
2. Classify each as signal / noise / summarize
|
|
87
|
+
3. Measure bytes in each category
|
|
88
|
+
|
|
89
|
+
Typical ratios from known formats:
|
|
90
|
+
- **Claude Code:** 3-15% signal, 85-97% noise (tool results + wrapper overhead dominate)
|
|
91
|
+
- **Codex:** 10-19% signal, 50-70% noise, 20-30% ambiguous (reasoning encrypted, compacted records)
|
|
92
|
+
- **Raw API logs:** 30-50% signal (no tool overhead, just conversation)
|
|
93
|
+
- **Chat exports (markdown):** 60-80% signal (already cleaned by the export process)
|
|
94
|
+
|
|
95
|
+
## Step 4: Extract
|
|
96
|
+
|
|
97
|
+
For each signal record, extract:
|
|
98
|
+
- **Who said it:** human or AI
|
|
99
|
+
- **What they said:** the text content
|
|
100
|
+
- **When:** timestamp if available
|
|
101
|
+
- **Context:** what came before (the preceding human message gives context to an AI response)
|
|
102
|
+
|
|
103
|
+
Structure the output as a sequence of turns:
|
|
104
|
+
```
|
|
105
|
+
[timestamp] HUMAN: <message>
|
|
106
|
+
[timestamp] AI: <response>
|
|
107
|
+
[timestamp] HUMAN: <follow-up>
|
|
108
|
+
[timestamp] AI: <response>
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Step 5: Handle unknowns gracefully
|
|
112
|
+
|
|
113
|
+
If you cannot classify a record:
|
|
114
|
+
- If it is <1KB, include it (small records are cheap to carry)
|
|
115
|
+
- If it is >10KB, skip it (large unclassified records are almost always tool output)
|
|
116
|
+
- If it contains natural language prose, include it
|
|
117
|
+
- If it contains code, JSON, or structured data, skip it
|
|
118
|
+
|
|
119
|
+
## Privacy checks
|
|
120
|
+
|
|
121
|
+
Before outputting extracted content, scan for:
|
|
122
|
+
- API keys: patterns like `sk-`, `ghp_`, `Bearer `, `token: "`
|
|
123
|
+
- Passwords: fields named `password`, `passwd`, `secret`
|
|
124
|
+
- Personal data: email addresses, IP addresses, phone numbers
|
|
125
|
+
- File paths: may reveal usernames (e.g., `/Users/johndoe/`)
|
|
126
|
+
- JWT tokens: strings matching `eyJ...`
|
|
127
|
+
|
|
128
|
+
Flag these but do not include them in extracted output.
|