codealmanac 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/guides/mini.md CHANGED
@@ -144,7 +144,7 @@ wikilink-syntax
144
144
  # 2. Triage with --lead
145
145
  $ almanac show sqlite-indexer --lead
146
146
  The indexer (`src/indexer/`) builds and maintains `.almanac/index.db` — a
147
- SQLite database that powers all query commands (`search`, `info`, `health`,
147
+ SQLite database that powers all query commands (`search`, `show`, `health`,
148
148
  `topics show`). It runs silently before every query command, comparing page
149
149
  file mtimes against the stored `content_hash`; only changed or new pages are
150
150
  re-parsed.
@@ -216,6 +216,32 @@ No logs at all → the hook isn't installed, or bailed before backgrounding, or
216
216
 
217
217
  ---
218
218
 
219
+ ## Staying current
220
+
221
+ codealmanac checks for updates in the background (once per 24h) after each
222
+ command. When a new version is available, you'll see a stderr banner on
223
+ every subsequent invocation:
224
+
225
+ ```
226
+ ! codealmanac 0.1.6 available (you're on 0.1.5) — run: almanac update
227
+ ```
228
+
229
+ The banner shows on every command until you update or dismiss it. Run:
230
+
231
+ ```bash
232
+ almanac update # upgrade to latest (foreground `npm i -g codealmanac@latest`)
233
+ almanac update --dismiss # skip this version; banner goes away until the next release
234
+ almanac update --check # check now without installing (bypasses 24h cache)
235
+ almanac doctor # see current update status + notifier setting
236
+ ```
237
+
238
+ Auto-install is deliberately NOT the default — silent install without consent
239
+ violates the trust contract, npm prefixes diverge across version managers, and a
240
+ mid-invocation binary swap corrupts dynamic imports. Tier B (nag + manual
241
+ install) is the design. See `almanac update --help` for the full flag set.
242
+
243
+ ---
244
+
219
245
  ## When in doubt
220
246
 
221
247
  - `.almanac/README.md` — repo-specific conventions + notability bar
@@ -0,0 +1,152 @@
1
+ # Processing Claude Code Sessions
2
+
3
+ ## Format overview
4
+
5
+ Claude Code stores sessions as JSONL files (one JSON object per line) at:
6
+ ```
7
+ ~/.claude/projects/<project-hash>/<session-uuid>.jsonl
8
+ ```
9
+
10
+ The project hash is a path with slashes replaced by dashes (e.g., `-Users-rohan-Desktop-Projects-myrepo`). Each session file contains the full conversation history including tool calls, tool results, thinking blocks, and metadata.
11
+
12
+ Typical session sizes: 150KB (short Q&A) to 8MB+ (multi-hour coding session). Line counts range from ~75 to ~1,750.
13
+
14
+ ## Record types
15
+
16
+ Each line is a JSON object with a `type` field at the top level:
17
+
18
+ | Type | Frequency | What it contains |
19
+ |------|-----------|-----------------|
20
+ | `assistant` | ~40% of records, ~15% of bytes | Model responses: text, tool_use, thinking blocks. Each content item is in `message.content[]` |
21
+ | `user` | ~35% of records, ~55% of bytes | Human messages OR tool results. Check `message.content[].type` to distinguish |
22
+ | `attachment` | ~5% of records, ~7% of bytes | System context injected by the harness: deferred tool lists, skill listings, memory, task reminders, edited file snippets |
23
+ | `file-history-snapshot` | ~5% of records, <1% of bytes | Checkpoint markers for undo/redo. Always tiny (~250 bytes) |
24
+ | `permission-mode` | ~3% of records, <1% of bytes | Records when permission mode changes (e.g., `bypassPermissions`) |
25
+ | `last-prompt` | ~3% of records, <1% of bytes | Marks turn boundaries. ~120 bytes each |
26
+ | `system` | ~2% of records, <1% of bytes | System messages injected mid-conversation. Often empty content |
27
+ | `ai-title` | rare | Auto-generated session title |
28
+ | `queue-operation` | rare | Queued follow-up commands |
29
+
30
+ ## What to extract (signal)
31
+
32
+ ### 1. Human messages (highest signal density)
33
+ - **Where:** Records with `type: "user"` where `message.content[]` contains items with `type: "text"`
34
+ - **Also check:** `userType` field -- `"external"` means the actual human typed this
35
+ - **What:** Intent, requirements, feedback, decisions, bug reports, design direction
36
+ - **Example pattern:** `{"type": "user", "message": {"content": [{"type": "text", "text": "What problems did you run into?"}]}}`
37
+
38
+ ### 2. Assistant text responses
39
+ - **Where:** Records with `type: "assistant"`, then `message.content[]` items with `type: "text"`
40
+ - **What:** Explanations, decisions, summaries, architecture analysis, bug diagnoses
41
+ - **Typical size:** 100-3000 chars per text block
42
+ - **Example pattern:** The assistant explains a root cause, summarizes what was built, or describes a design decision
43
+
44
+ ### 3. Subagent results (high-value signal hidden in noise)
45
+ - **Where:** Records with `type: "user"` that have a top-level `toolUseResult` field with `agentType` set
46
+ - **What:** Complete results from subagents (review, critic, pair, etc.). The `content` field contains the full subagent output, often multi-thousand-character analysis
47
+ - **Key fields:** `toolUseResult.agentType` (e.g., "general-purpose", "review", "critic", "pair"), `toolUseResult.content[].text`, `toolUseResult.prompt` (what the subagent was asked to do)
48
+ - **Why it matters:** These are often the densest signal in a session -- complete audit reports, code reviews, architecture analyses
49
+
50
+ ### 4. Session metadata
51
+ - **Where:** First few records of the file
52
+ - **Key fields on user records:** `cwd`, `gitBranch`, `version`, `timestamp`, `sessionId`, `entrypoint` (cli vs other)
53
+ - **Attachment records** with `type: "nested_memory"` contain the project's memory/context
54
+
55
+ ## What to skip (noise)
56
+
57
+ ### 1. Tool results (~24% of total bytes) -- SKIP
58
+ - **Where:** `user` records where `message.content[]` has `type: "tool_result"`
59
+ - **Why skip:** These are file contents, grep results, build output, test output. The actual files are in the repo; the output is transient
60
+ - **The biggest offenders:** Read tool results can be 150KB+ (entire file contents dumped inline)
61
+
62
+ ### 2. Tool calls (~8% of total bytes) -- SKIP or SUMMARIZE
63
+ - **Where:** `assistant` records where `message.content[]` has `type: "tool_use"`
64
+ - **Contains:** `name` (Bash, Read, Grep, Edit, Write, Glob) and `input` (command, file path, pattern)
65
+ - **Why skip:** The sequence of tool calls is operational, not knowledge. Exception: summarize the *pattern* of tool usage ("read 15 files in src/auth/")
66
+
67
+ ### 3. Wrapper overhead (~38% of total bytes in large sessions) -- SKIP
68
+ - **Where:** Top-level fields on `user` records: `parentUuid`, `sourceToolAssistantUUID`, `toolUseResult` (when not a subagent), `slug`, `requestId`, `isMeta`, `isSidechain`
69
+ - **Why skip:** The `toolUseResult` field on user records DUPLICATES the tool result content that already appears in `message.content[]`. This is the single biggest source of bloat. In one 8MB session, `toolUseResult` alone was 2.7MB (33%)
70
+ - **EXCEPTION:** When `toolUseResult.agentType` is set, this is a subagent result and IS signal
71
+
72
+ ### 4. Empty thinking blocks (~6% in some sessions) -- SKIP
73
+ - **Where:** `assistant` records, `message.content[]` with `type: "thinking"` but `thinking: ""`
74
+ - **Why:** Claude Code often records thinking blocks with empty content (the actual thinking happened but was not persisted). These are pure waste
75
+
76
+ ### 5. Attachments (~7%) -- SKIP
77
+ - **Where:** `type: "attachment"` records
78
+ - **Contains:** `deferred_tools_delta` (tool availability lists), `skill_listing` (repeated skill menus), `task_reminder` (repeated TODO lists), `mcp_instructions_delta` (MCP setup)
79
+ - **Why skip:** Harness infrastructure, not knowledge. Repeated across turns
80
+
81
+ ### 6. Metadata records (<1%) -- SKIP
82
+ - `file-history-snapshot`, `permission-mode`, `last-prompt`, `queue-operation`, `ai-title`
83
+
84
+ ### 7. Base64 image data (~1-7% in some sessions) -- SKIP
85
+ - **Where:** `user` records with `message.content[]` containing `type: "image"` and `source.type: "base64"`
86
+ - **Why skip:** Screenshots pasted by the user. Can be 100KB+ of base64 per image. Not extractable as knowledge
87
+
88
+ ## What to summarize
89
+
90
+ These patterns should be compressed rather than fully extracted or fully skipped:
91
+
92
+ | Pattern | Summarize as |
93
+ |---------|-------------|
94
+ | 10+ consecutive tool_use/tool_result pairs reading files | "Read N files in {directory pattern}" |
95
+ | grep/glob sequences searching for a pattern | "Searched for {pattern} across {scope}" |
96
+ | Edit tool calls modifying files | "Modified {file}: {description from the Edit input}" |
97
+ | Bash commands running tests | "Ran tests: {pass/fail summary from result}" |
98
+ | Bash commands running builds | "Built {target}: {success/failure}" |
99
+
100
+ ## Extraction approach
101
+
102
+ 1. **Parse the JSONL file** line by line. Each line is one JSON object.
103
+ 2. **First pass -- extract metadata:** From the first `user` record, grab `cwd`, `gitBranch`, `version`, `timestamp`, `sessionId`.
104
+ 3. **For each record, check `type`:**
105
+ - `type: "user"` with `message.content[].type == "text"` -> **extract as human message**
106
+ - `type: "user"` with `toolUseResult.agentType` set -> **extract subagent result** from `toolUseResult.content[].text`
107
+ - `type: "user"` with `message.content[].type == "tool_result"` -> **skip** (or summarize tool call patterns)
108
+ - `type: "assistant"` with `message.content[].type == "text"` -> **extract as assistant reasoning**
109
+ - `type: "assistant"` with `message.content[].type == "thinking"` and non-empty `thinking` field -> **extract as internal reasoning**
110
+ - `type: "assistant"` with `message.content[].type == "tool_use"` -> **skip or summarize**
111
+ - `type: "attachment"`, `type: "system"`, metadata types -> **skip**
112
+ 4. **Second pass -- deduplicate:** The `toolUseResult` field on user records often duplicates content from `message.content[]`. Always prefer `message.content[]` and only use `toolUseResult` for subagent data.
113
+ 5. **Third pass -- compress tool sequences:** Collapse consecutive tool_use + tool_result pairs into summaries.
114
+
115
+ ## Example: signal extraction from a real session
116
+
117
+ **Human intent (from user record):**
118
+ > "Tell me about our ref token?"
119
+ Signal: User wants to understand the ref token system.
120
+
121
+ **Assistant reasoning (from assistant text):**
122
+ > "Ref Token: Opaque backend-signed token that binds a locally-edited .md file to a specific page in the DB. Backend signs on download, verifies on publish. No signing secret ever touches the client."
123
+ Signal: Architectural explanation of the ref token system.
124
+
125
+ **Assistant problem diagnosis (from assistant text):**
126
+ > "Root cause: the codealmanac CLI expects topics.yaml in the new list format but .almanac/topics.yaml is still in the old dict format"
127
+ Signal: Bug identification with root cause.
128
+
129
+ **Subagent audit (from toolUseResult with agentType):**
130
+ > agentType: "general-purpose", prompt: "You are auditing the hosted editor..."
131
+ > content: "Audit Report: Hosted Editor / Quill Co-Editing (Pre-Phase-5) ... 3 HIGH gaps found ... Fix malformed edit-result surfacing ... Fix Quill session isolation ..."
132
+ Signal: Complete code audit with findings, priorities, and recommendations.
133
+
134
+ **Noise skipped:** 522 tool results totaling 2MB of file contents, 466KB of empty thinking blocks, 408KB of metadata records, 2.7MB of duplicated toolUseResult data.
135
+
136
+ ## Gotchas
137
+
138
+ 1. **toolUseResult duplication is massive.** In one 8MB session, `toolUseResult` accounted for 33% of the file. It duplicates `message.content[]` tool results AND sometimes contains subagent results. Always check `agentType` before discarding.
139
+
140
+ 2. **Thinking blocks are always empty in recent versions.** The `thinking` field in content items of type `"thinking"` is consistently empty string in observed sessions (95 empty out of 95 in one session). The thinking content is not persisted. Do not expect signal here despite the promising field name.
141
+
142
+ 3. **assistant records contain message-level metadata.** The `message` object has `usage` (token counts), `model` (model name), `stop_reason` (why generation stopped). These can be useful for understanding session dynamics but are not knowledge signal.
143
+
144
+ 4. **user records serve double duty.** A `user` record with `userType: "external"` and text content is a real human message. A `user` record with tool_result content is just the harness returning tool output. Always check `message.content[].type`.
145
+
146
+ 5. **The `isSidechain` field** indicates branched conversations (user went back and tried a different approach). Sidechain records may represent abandoned approaches -- still potentially valuable as "what was tried and rejected."
147
+
148
+ 6. **Base64 images can be huge.** A single screenshot paste can add 100KB+ of base64 data to a user record. These look like small records until you measure them.
149
+
150
+ 7. **Attachment records repeat.** `skill_listing` and `task_reminder` attachments are re-injected at many turn boundaries. The same content appears 10-15 times in a long session.
151
+
152
+ 8. **Subagent data structure:** The `toolUseResult` for subagents includes `toolStats` (readCount, searchCount, bashCount, editFileCount, linesAdded, linesRemoved) and `usage` (input_tokens, output_tokens, cache stats). These are useful metadata about the subagent's work.
@@ -0,0 +1,214 @@
1
+ # Processing Codex Sessions
2
+
3
+ ## Format overview
4
+
5
+ Codex stores sessions as JSONL files at:
6
+ ```
7
+ ~/.codex/sessions/<year>/<month>/<day>/rollout-<timestamp>-<thread-uuid>.jsonl
8
+ ```
9
+
10
+ A SQLite database at `~/.codex/state_5.sqlite` provides session metadata (title, cwd, model, tokens_used, git info) in the `threads` table.
11
+
12
+ Multiple rollout files for the same timestamp indicate subagent threads spawned by the parent session. The SQLite `source` column reveals this: `"vscode"` for top-level sessions, a JSON blob with `subagent.thread_spawn.parent_thread_id` for child threads.
13
+
14
+ Typical session sizes: 500KB (short task) to 12MB+ (multi-turn debugging session). Line counts range from ~175 to ~2,800.
15
+
16
+ ## Record types
17
+
18
+ Each line is a JSON object with a `type` field:
19
+
20
+ | Type | % of records | % of bytes | What it contains |
21
+ |------|-------------|------------|-----------------|
22
+ | `response_item` | ~55% | ~36% | Model outputs: function calls, function call outputs, messages, reasoning |
23
+ | `event_msg` | ~43% | ~29% | Harness events: command execution results, token counts, agent messages, task lifecycle |
24
+ | `turn_context` | ~2% | ~5% | Per-turn context: model, cwd, instructions, settings. Repeated every turn |
25
+ | `session_meta` | 1 per file | ~1-3% | Session metadata: id, cwd, model, CLI version, base instructions, skills |
26
+ | `compacted` | rare (0-3) | 10-30% when present | Compressed conversation history from context window compaction |
27
+
28
+ ### response_item subtypes (in `payload.type`)
29
+
30
+ | Subtype | What it contains |
31
+ |---------|-----------------|
32
+ | `function_call` | Tool invocations: `name` (always `exec_command`), `arguments` (JSON with cmd, workdir, yield_time_ms) |
33
+ | `function_call_output` | Tool results: `output` (command stdout/stderr as string) |
34
+ | `message` | Conversation messages. Check `payload.role`: `developer` (system prompts), `user` (human + context), `assistant` (model output) |
35
+ | `reasoning` | Model reasoning. Contains `encrypted_content` (unreadable) and `summary` (always empty in observed data) |
36
+ | `custom_tool_call` | File edit operations via `apply_patch`. Contains unified diff in `input` |
37
+ | `custom_tool_call_output` | Patch application results: success/failure + modified file list |
38
+ | `web_search_call` | Web search invocations (rare) |
39
+
40
+ ### event_msg subtypes (in `payload.type`)
41
+
42
+ | Subtype | What it contains |
43
+ |---------|-----------------|
44
+ | `user_message` | Human input: `message` (text), `images` (base64 data URIs), `text_elements` |
45
+ | `agent_message` | Model commentary shown to user: `message`, `phase` (always "commentary") |
46
+ | `exec_command_end` | Command execution results (DUPLICATES `function_call_output`): stdout, stderr, aggregated_output, exit_code, duration, command |
47
+ | `token_count` | Token usage and rate limit info per turn |
48
+ | `task_started` | Turn lifecycle: turn_id, model_context_window, collaboration_mode |
49
+ | `task_complete` | Turn completion: `last_agent_message` (the final text shown to user) |
50
+ | `patch_apply_end` | File edit results: stdout, changes list, success boolean |
51
+ | `context_compacted` | Marker that context window was compacted |
52
+ | `turn_aborted` | Turn was cancelled |
53
+
54
+ ## What to extract (signal)
55
+
56
+ ### 1. Human messages (highest signal density)
57
+ - **Where:** `event_msg` records with `payload.type == "user_message"`
58
+ - **Field:** `payload.message`
59
+ - **Also in:** `response_item` with `payload.type == "message"` and `payload.role == "user"`, `payload.content[].type == "input_text"`. The event_msg version is cleaner
60
+ - **Watch for:** The `response_item` version also contains system/developer context injected alongside the real user message. Extract only `input_text` items where the text is NOT wrapped in XML tags like `<environment_context>`, `<permissions instructions>`, `<app-context>`, `<skills_instructions>`, `<collaboration_mode>`
61
+
62
+ ### 2. Agent messages (model's user-facing commentary)
63
+ - **Where:** `event_msg` records with `payload.type == "agent_message"`
64
+ - **Fields:** `payload.message`, `payload.phase`
65
+ - **What:** Short status updates and reasoning the model shares with the user. These are the "thinking out loud" moments
66
+ - **Example:** "I'm inspecting the repo for category vs topic naming drift and I'll trace it through code, routes, API shapes, and copy so the discrepancies are concrete rather than guessed."
67
+
68
+ ### 3. Task completion summaries
69
+ - **Where:** `event_msg` records with `payload.type == "task_complete"`
70
+ - **Field:** `payload.last_agent_message`
71
+ - **What:** The final, complete response for each turn. Often the densest signal -- the model's synthesized answer after all tool use. Can be multi-thousand characters of analysis
72
+
73
+ ### 4. Assistant output text
74
+ - **Where:** `response_item` with `payload.type == "message"`, `payload.role == "assistant"`, content items with `type: "output_text"`
75
+ - **What:** Model's text responses interspersed with tool calls. Shorter than task_complete but captures incremental reasoning
76
+
77
+ ### 5. File edits (apply_patch)
78
+ - **Where:** `response_item` with `payload.type == "custom_tool_call"` and `payload.name == "apply_patch"`
79
+ - **Field:** `payload.input` contains a unified diff
80
+ - **What:** Every code change the model made. Extract the file path and a summary of the change, not the full diff (the repo has the final state)
81
+
82
+ ### 6. Session metadata
83
+ - **Where:** `session_meta` record (first line of file)
84
+ - **Key fields:** `payload.id`, `payload.cwd`, `payload.model_provider`, `payload.cli_version`, `payload.source`, `payload.model` (in turn_context)
85
+ - **SQLite enrichment:** Query `threads` table for `title`, `tokens_used`, `git_branch`, `first_user_message`, `source` (reveals if this is a subagent)
86
+
87
+ ## What to skip (noise)
88
+
89
+ ### 1. function_call_output records (~17% of bytes) -- SKIP
90
+ - **Where:** `response_item` with `payload.type == "function_call_output"`
91
+ - **Why:** Raw command output (file contents, grep results, build output). Already in the repo or transient
92
+
93
+ ### 2. exec_command_end records (~15% of bytes) -- SKIP
94
+ - **Where:** `event_msg` with `payload.type == "exec_command_end"`
95
+ - **Why:** DUPLICATES `function_call_output` with the same `call_id`. Contains stdout, stderr, aggregated_output redundantly. In one 12MB session, 399 of these consumed 1.9MB
96
+ - **Note:** 100% overlap with function_call_output on shared call_ids
97
+
98
+ ### 3. Reasoning records (~6% of bytes) -- SKIP
99
+ - **Where:** `response_item` with `payload.type == "reasoning"`
100
+ - **Why:** Contains `encrypted_content` (base64 blob, unreadable) and `summary` (consistently empty array in all observed sessions). No extractable signal
101
+ - **Do not confuse with:** `agent_message` records, which ARE readable model reasoning
102
+
103
+ ### 4. turn_context records (~5-25% of bytes) -- SKIP
104
+ - **Where:** `type: "turn_context"`
105
+ - **Why:** Repeated every turn. Contains model name, cwd, instructions, sandbox policy, collaboration mode. Same content each time with minor variations
106
+
107
+ ### 5. session_meta base_instructions (~3% of bytes) -- SKIP
108
+ - **Where:** `session_meta` record, `payload.base_instructions.text`
109
+ - **Why:** Codex's built-in system prompt. Same across all sessions. ~2000 chars of personality and behavior instructions
110
+
111
+ ### 6. Developer instructions in message records -- SKIP
112
+ - **Where:** `response_item` messages with `payload.role == "developer"`
113
+ - **Contains:** `<permissions instructions>`, `<app-context>`, `<collaboration_mode>`, `<apps_instructions>`, `<skills_instructions>` XML blocks
114
+ - **Why:** Harness configuration, not user knowledge. Can be 10KB+ per occurrence
115
+
116
+ ### 7. token_count records (~2%) -- SKIP
117
+ - Rate limit and token usage telemetry
118
+
119
+ ### 8. function_call records (~1.5%) -- SKIP or SUMMARIZE
120
+ - **Where:** `response_item` with `payload.type == "function_call"`
121
+ - **Contains:** `name` (always `exec_command`), `arguments` (cmd, workdir)
122
+ - **Why:** Operational commands. Summarize the pattern, not individual calls
123
+
124
+ ### 9. Base64 images in user_message (~4-7% when present) -- SKIP
125
+ - **Where:** `event_msg` with `payload.type == "user_message"`, `payload.images[]`
126
+ - **Format:** data URIs (`data:image/png;base64,...`), 350KB-550KB each
127
+ - **Also in:** `response_item` message content with `type: "input_image"` and `image_url`
128
+ - **Why:** Screenshots. Not extractable as text knowledge. A single image can be 550KB
129
+
130
+ ### 10. Compacted records (10-30% when present) -- EXTRACT SELECTIVELY
131
+ - **Where:** `type: "compacted"` records
132
+ - **Contains:** `payload.replacement_history[]` -- a compressed version of earlier conversation
133
+ - **Treatment:** These contain summarized versions of earlier turns after context compaction. The `replacement_history` items have `role` and `content[]` with `input_text`/`output_text`. Extract output_text items (model summaries) but skip input_text (already captured from the original records earlier in the file)
134
+
135
+ ## What to summarize
136
+
137
+ | Pattern | Summarize as |
138
+ |---------|-------------|
139
+ | N consecutive exec_command function_call/output pairs | "Ran N commands exploring {pattern}" |
140
+ | grep/find/sed sequences reading files | "Searched for {pattern} in {directory}" |
141
+ | apply_patch calls | "Modified {file}: {one-line description from diff}" |
142
+ | Multiple agent_message records saying similar things | Keep only the last one before task_complete |
143
+
144
+ ## Extraction approach
145
+
146
+ 1. **Check SQLite first** for session metadata: `SELECT id, title, cwd, model, tokens_used, git_branch, source, first_user_message FROM threads WHERE id = '<thread-id>'`. The `source` field tells you if this is a subagent session.
147
+
148
+ 2. **Parse the JSONL file** line by line.
149
+
150
+ 3. **Extract session_meta** (first record): grab `payload.id`, `payload.cwd`, `payload.source`, `payload.model_provider`.
151
+
152
+ 4. **For each record, route by type and subtype:**
153
+ - `event_msg` + `user_message` -> **extract** `payload.message` as human input. Note `payload.images` count but skip the base64 data
154
+ - `event_msg` + `agent_message` -> **extract** `payload.message` as model reasoning
155
+ - `event_msg` + `task_complete` -> **extract** `payload.last_agent_message` as turn summary
156
+ - `event_msg` + `exec_command_end` -> **skip** (duplicated in response_item)
157
+ - `event_msg` + `token_count` / `task_started` / `context_compacted` -> **skip**
158
+ - `response_item` + `message` + `role: "assistant"` -> **extract** output_text content
159
+ - `response_item` + `message` + `role: "developer"` or `role: "user"` with XML-tagged content -> **skip**
160
+ - `response_item` + `message` + `role: "user"` with plain text -> **extract** (but deduplicate against event_msg user_message)
161
+ - `response_item` + `function_call_output` -> **skip**
162
+ - `response_item` + `function_call` -> **summarize** command pattern
163
+ - `response_item` + `reasoning` -> **skip** (encrypted, empty summary)
164
+ - `response_item` + `custom_tool_call` -> **summarize** file path and change description
165
+ - `response_item` + `custom_tool_call_output` -> **skip** (just success/fail)
166
+ - `turn_context` -> **skip**
167
+ - `compacted` -> **extract** output_text from `payload.replacement_history[]`
168
+
169
+ 5. **Deduplicate across record types:** User messages appear in both `event_msg.user_message` AND `response_item.message` (role: user). Command output appears in both `response_item.function_call_output` AND `event_msg.exec_command_end`. Always prefer the event_msg version for user messages (cleaner), skip the duplicate command output entirely.
170
+
171
+ 6. **Link subagent sessions:** Check the SQLite `source` column. If it contains `subagent.thread_spawn`, this session's knowledge should be attributed to the parent thread. The `parent_thread_id` field links them.
172
+
173
+ ## Example: signal extraction from a real session
174
+
175
+ **Human intent (from event_msg.user_message):**
176
+ > "Look at our codebase, we have categories and topics, we want to unify everywhere to be called topics. Find all discrepancies."
177
+ Signal: User wants a category-to-topic naming audit.
178
+
179
+ **Agent reasoning (from event_msg.agent_message):**
180
+ > "I've isolated one concrete runtime mismatch already: the page-topic footer still talks about 'categories' in UI copy while the rest of the product model is 'topics.' I'm now separating first-party mismatches from external-source fields and old design docs so the final list is usable."
181
+ Signal: Agent's approach to categorizing the findings.
182
+
183
+ **Task completion (from event_msg.task_complete):**
184
+ > "Root Cause: This is not primarily a CSS-specificity problem. The break happens because the suggestion extension's renderHTML() emits bare <ins>/<del> tags without the CSS class..."
185
+ Signal: Complete diagnosis with root cause and fix path.
186
+
187
+ **File edit (from custom_tool_call):**
188
+ > name: apply_patch, file: SuggestionChangesExtension.ts
189
+ > Change: Added class attribute to renderHTML() output so CSS selectors match
190
+ Signal: What was actually changed and why.
191
+
192
+ **Noise skipped:** 442 function_call_output records (2.2MB), 399 duplicate exec_command_end records (1.9MB), 266 encrypted reasoning records (787KB), 67 repeated turn_context records (675KB), 368 token_count records (267KB).
193
+
194
+ ## Gotchas
195
+
196
+ 1. **exec_command_end duplicates function_call_output.** They share the same `call_id` and contain the same command output in different field names. 100% overlap observed. Skip exec_command_end entirely.
197
+
198
+ 2. **Reasoning is unreadable.** Despite having `summary` and `content` fields, reasoning records contain only `encrypted_content` (opaque base64) and consistently empty `summary: []`. There is zero extractable signal from reasoning records.
199
+
200
+ 3. **User messages appear twice.** Once in `event_msg.user_message` (clean, just the text + images) and again in `response_item.message` with `role: "user"` (mixed with system context). Use the event_msg version.
201
+
202
+ 4. **Developer messages are system prompts, not human.** `response_item.message` with `role: "developer"` contains harness instructions (permissions, app context, collaboration mode, skills). These are NOT human messages. They are wrapped in XML tags like `<permissions instructions>`, `<app-context>`, etc.
203
+
204
+ 5. **Images are massive.** User messages with screenshots contain base64 data URIs of 350-550KB each. A single image can make a 75-char message record balloon to 550KB. Check `payload.images` length but do not extract the base64 data.
205
+
206
+ 6. **Compacted records contain earlier conversation.** When the context window fills up, Codex compacts history into `compacted` records. The `replacement_history` array contains summarized earlier turns. If you are processing the file start-to-finish, you will see the original records first and then the compacted summary later -- be careful not to double-count.
207
+
208
+ 7. **Subagent sessions are separate files.** A parent session spawns subagent threads that are written to their own rollout files in the same date directory. The SQLite `source` column JSON identifies child threads. To get the complete picture of a multi-agent session, you must read all linked rollout files.
209
+
210
+ 8. **Codex uses `exec_command` for everything.** Unlike Claude Code which has specialized tools (Read, Grep, Edit, Bash), Codex wraps all operations in `exec_command` with shell commands (`sed`, `rg`, `cat`, etc.) or `apply_patch` for file edits. This means function_call records are less informative about intent -- you need to parse the `cmd` field to understand what was done.
211
+
212
+ 9. **task_complete contains the cleanest signal.** The `last_agent_message` in task_complete records is the final synthesized response after all tool use. If you can only extract one thing per turn, extract this.
213
+
214
+ 10. **The model field is in turn_context, not session_meta.** Session_meta has `model_provider` ("openai") but the actual model name (e.g., "gpt-5.4") is in `turn_context.model`.
@@ -0,0 +1,128 @@
1
+ # Processing Unknown Session Formats
2
+
3
+ This guide is a fallback for session files that do not match known formats (Claude Code JSONL, Codex rollout JSONL). Use it when you encounter a new tool or an unrecognized file structure.
4
+
5
+ ## Step 1: Identify the format
6
+
7
+ Check the file extension and first few lines:
8
+
9
+ - **JSONL** (`.jsonl`): One JSON object per line. Parse each line independently
10
+ - **JSON** (`.json`): Single JSON document. Could be an array of messages or a nested conversation object
11
+ - **Markdown** (`.md`): Likely a conversation export with `## Human` / `## Assistant` headers or similar
12
+ - **Plain text** (`.txt`, `.log`): Look for turn separators (blank lines, `---`, timestamps)
13
+ - **SQLite** (`.sqlite`, `.db`): Database with conversation tables. List tables first, then query
14
+
15
+ For JSONL, check each line for a `type`, `role`, or `kind` field. The presence of certain fields reveals the format:
16
+ - `type: "user"` / `type: "assistant"` + `message.content` -> Claude Code format
17
+ - `type: "response_item"` / `type: "event_msg"` -> Codex format
18
+ - `role: "user"` / `role: "assistant"` without a wrapper type -> Raw API conversation log
19
+ - `type: "human"` / `type: "ai"` -> LangChain/LangSmith format
20
+
21
+ ## Step 2: Classify each record
22
+
23
+ For any conversation format, records fall into these universal categories:
24
+
25
+ ### Signal (extract)
26
+ 1. **Human messages** -- What the user asked, decided, or directed. These explain intent, requirements, and constraints. Look for:
27
+ - Records with `role: "user"` or `type: "human"` or `type: "user_message"`
28
+ - Text that reads like natural language instructions, questions, or feedback
29
+ - Short messages (under ~500 chars) are almost always signal
30
+
31
+ 2. **AI reasoning text** -- Explanations, analyses, decisions, and summaries the model produced. Look for:
32
+ - Records with `role: "assistant"` and text content (not tool calls)
33
+ - Fields named `text`, `content`, `message`, `output`, `response`
34
+ - Text that explains *why* something was done, not *what* command was run
35
+
36
+ 3. **Final/summary responses** -- The model's synthesized answer after a chain of tool use. Look for:
37
+ - The last assistant message before the next human message
38
+ - Fields named `final_response`, `last_message`, `summary`, `result`
39
+ - These are typically the densest signal per byte
40
+
41
+ 4. **Error messages and failures** -- What went wrong and why. Look for:
42
+ - Records containing `error`, `failed`, `exception`, `traceback`
43
+ - These often reveal important constraints, gotchas, or architectural issues
44
+
45
+ ### Noise (skip)
46
+ 1. **File contents returned by tools** -- Already in the repo. Look for:
47
+ - Records with `type: "tool_result"`, `type: "function_call_output"`, or `type: "tool_output"`
48
+ - Content that looks like source code (imports, function definitions, indented blocks)
49
+ - Long strings (>2KB) that are clearly file dumps
50
+ - **Size clue:** If a record is >10KB, it is almost certainly a tool result, not reasoning
51
+
52
+ 2. **Tool invocations** -- What commands were run. Operational, not knowledge. Look for:
53
+ - Records with `type: "tool_use"`, `type: "function_call"`, or `type: "tool_call"`
54
+ - Fields named `name`, `arguments`, `input`, `command`
55
+ - Exception: file edit commands may contain the *what* of a change (summarize those)
56
+
57
+ 3. **System prompts and instructions** -- Same across sessions. Look for:
58
+ - Records with `role: "system"` or `role: "developer"`
59
+ - Content wrapped in XML tags (`<instructions>`, `<context>`, `<rules>`)
60
+ - Long preambles about model behavior, tool availability, permissions
61
+
62
+ 4. **Metadata and telemetry** -- Session infrastructure. Look for:
63
+ - Token counts, usage statistics, rate limits
64
+ - Timestamps, UUIDs, session IDs (useful for linking but not knowledge)
65
+ - Permission changes, mode switches, checkpoint markers
66
+
67
+ 5. **Base64-encoded data** -- Images, files, binary content. Look for:
68
+ - Long strings matching `[A-Za-z0-9+/=]{1000,}` or data URIs
69
+ - Fields named `image`, `data`, `base64`, `source.data`
70
+
71
+ 6. **Duplicate records** -- Many formats log the same event multiple ways. Look for:
72
+ - Records sharing an ID field (`call_id`, `tool_use_id`, `request_id`)
73
+ - The same text appearing in both a streaming record and a final record
74
+
75
+ ### Summarize (compress)
76
+ 1. **Sequences of tool calls** -- "Read 15 files in src/lib/" is better than 15 individual read records
77
+ 2. **Repetitive status updates** -- "Still searching..." x10 -> "Searched extensively"
78
+ 3. **Build/test output** -- "Tests: 58/58 passed" not the full test runner output
79
+ 4. **File edit details** -- "Modified auth.ts: added token validation" not the full diff
80
+
81
+ ## Step 3: Estimate signal ratio
82
+
83
+ Before processing the full file, sample it:
84
+
85
+ 1. Take the first 20 records, middle 20, and last 20
86
+ 2. Classify each as signal / noise / summarize
87
+ 3. Measure bytes in each category
88
+
89
+ Typical ratios from known formats:
90
+ - **Claude Code:** 3-15% signal, 85-97% noise (tool results + wrapper overhead dominate)
91
+ - **Codex:** 10-19% signal, 50-70% noise, 20-30% ambiguous (reasoning encrypted, compacted records)
92
+ - **Raw API logs:** 30-50% signal (no tool overhead, just conversation)
93
+ - **Chat exports (markdown):** 60-80% signal (already cleaned by the export process)
94
+
95
+ ## Step 4: Extract
96
+
97
+ For each signal record, extract:
98
+ - **Who said it:** human or AI
99
+ - **What they said:** the text content
100
+ - **When:** timestamp if available
101
+ - **Context:** what came before (the preceding human message gives context to an AI response)
102
+
103
+ Structure the output as a sequence of turns:
104
+ ```
105
+ [timestamp] HUMAN: <message>
106
+ [timestamp] AI: <response>
107
+ [timestamp] HUMAN: <follow-up>
108
+ [timestamp] AI: <response>
109
+ ```
110
+
111
+ ## Step 5: Handle unknowns gracefully
112
+
113
+ If you cannot classify a record:
114
+ - If it is <1KB, include it (small records are cheap to carry)
115
+ - If it is >10KB, skip it (large unclassified records are almost always tool output)
116
+ - If it contains natural language prose, include it
117
+ - If it contains code, JSON, or structured data, skip it
118
+
119
+ ## Privacy checks
120
+
121
+ Before outputting extracted content, scan for:
122
+ - API keys: patterns like `sk-`, `ghp_`, `Bearer `, `token: "`
123
+ - Passwords: fields named `password`, `passwd`, `secret`
124
+ - Personal data: email addresses, IP addresses, phone numbers
125
+ - File paths: may reveal usernames (e.g., `/Users/johndoe/`)
126
+ - JWT tokens: strings matching `eyJ...`
127
+
128
+ Flag these but do not include them in extracted output.