@rubytech/create-realagent 1.0.828 → 1.0.829

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (72) hide show
  1. package/package.json +1 -1
  2. package/payload/platform/neo4j/schema.cypher +2 -1
  3. package/payload/platform/package.json +2 -2
  4. package/payload/platform/plugins/admin/hooks/__tests__/archive-ingest-surface-gate.test.sh +39 -54
  5. package/payload/platform/plugins/admin/hooks/archive-ingest-surface-gate.sh +26 -58
  6. package/payload/platform/plugins/admin/skills/onboarding/SKILL.md +2 -2
  7. package/payload/platform/plugins/docs/references/plugins-guide.md +1 -1
  8. package/payload/platform/plugins/memory/PLUGIN.md +4 -4
  9. package/payload/platform/plugins/memory/mcp/dist/index.js +18 -218
  10. package/payload/platform/plugins/memory/mcp/dist/index.js.map +1 -1
  11. package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/schema-validator.test.js +103 -0
  12. package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/schema-validator.test.js.map +1 -1
  13. package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.d.ts.map +1 -1
  14. package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.js +30 -20
  15. package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.js.map +1 -1
  16. package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.d.ts +16 -1
  17. package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.d.ts.map +1 -1
  18. package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.js +12 -3
  19. package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.js.map +1 -1
  20. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-archive-write.test.js +2 -138
  21. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-archive-write.test.js.map +1 -1
  22. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-ingest.test.js +10 -5
  23. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-ingest.test.js.map +1 -1
  24. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.d.ts +2 -0
  25. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.d.ts.map +1 -0
  26. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.js +148 -0
  27. package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.js.map +1 -0
  28. package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.d.ts +1 -64
  29. package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.d.ts.map +1 -1
  30. package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.js +6 -336
  31. package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.js.map +1 -1
  32. package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.d.ts +7 -11
  33. package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.d.ts.map +1 -1
  34. package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.js +1 -11
  35. package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.js.map +1 -1
  36. package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.d.ts +21 -17
  37. package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.d.ts.map +1 -1
  38. package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.js +77 -37
  39. package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.js.map +1 -1
  40. package/payload/platform/plugins/memory/references/schema-base.md +2 -0
  41. package/payload/platform/plugins/memory/skills/document-ingest/SKILL.md +54 -4
  42. package/payload/platform/plugins/whatsapp/PLUGIN.md +1 -1
  43. package/payload/platform/scripts/seed-neo4j.sh +15 -14
  44. package/payload/platform/templates/specialists/agents/database-operator.md +9 -15
  45. package/payload/server/chunk-CUSH3UXP.js +2305 -0
  46. package/payload/server/chunk-IWNDVGKT.js +10077 -0
  47. package/payload/server/chunk-KC7NUABI.js +654 -0
  48. package/payload/server/chunk-WUVXPZIV.js +1116 -0
  49. package/payload/server/client-pool-3TM3SRIA.js +32 -0
  50. package/payload/server/cloudflare-task-tracker-4NIODMGL.js +19 -0
  51. package/payload/server/maxy-edge.js +3 -3
  52. package/payload/server/neo4j-migrations-XTQ4WEV6.js +428 -0
  53. package/payload/server/server.js +6 -6
  54. package/payload/platform/plugins/whatsapp-import/PLUGIN.md +0 -48
  55. package/payload/platform/plugins/whatsapp-import/bin/ingest.mjs +0 -617
  56. package/payload/platform/plugins/whatsapp-import/bin/whatsapp-ingest.sh +0 -98
  57. package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/delta-append.test.ts +0 -163
  58. package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/parse-export-lrm.test.ts +0 -83
  59. package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/parse-export.test.ts +0 -678
  60. package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/sessionize.test.ts +0 -91
  61. package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/to-classifier-input.test.ts +0 -59
  62. package/payload/platform/plugins/whatsapp-import/lib/src/delta-cursor.ts +0 -54
  63. package/payload/platform/plugins/whatsapp-import/lib/src/derive-keys.ts +0 -82
  64. package/payload/platform/plugins/whatsapp-import/lib/src/index.ts +0 -22
  65. package/payload/platform/plugins/whatsapp-import/lib/src/parse-export.ts +0 -471
  66. package/payload/platform/plugins/whatsapp-import/lib/src/sessionize.ts +0 -81
  67. package/payload/platform/plugins/whatsapp-import/lib/src/to-classifier-input.ts +0 -48
  68. package/payload/platform/plugins/whatsapp-import/lib/tsconfig.json +0 -9
  69. package/payload/platform/plugins/whatsapp-import/lib/vitest.config.ts +0 -9
  70. package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/SKILL.md +0 -124
  71. package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/conversation-archive-shape.md +0 -143
  72. package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/export-parse.md +0 -109
@@ -1,124 +0,0 @@
1
- ---
2
- name: whatsapp-import
3
- description: Single-phase WhatsApp `_chat.txt` ingest contract (Task 891). Confirms the owner + every distinct sender against existing `:AdminUser` / `:Person` nodes (no auto-creation), then invokes the deterministic Bash entry `whatsapp-ingest.sh`. The script parses the export, sessionizes at gap-hours boundaries, classifies each session via Haiku (mode='chat') into `:Section:Conversation` chunks with summary + topic keywords, and writes them under a parent `:ConversationArchive` MERGEd on `conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds)`. Re-imports are delta-append: prior chunks are never touched; only messages after `lastIngestedMessageHash` flow through the pipeline. Triggers when the operator drops a `_chat.txt` file or its containing export folder into chat. Distinct from the live `whatsapp` plugin (Baileys QR pairing).
4
- ---
5
-
6
- # WhatsApp Import — Conversation Archive
7
-
8
- Single-phase ingest. The deterministic Bash entry parses the export, splits it into sessions, classifies each session into topic-bounded `:Section:Conversation` chunks via Haiku, and writes everything under a parent `:ConversationArchive`. Insight derivation (`:Observation` / `:Task` / `:Preference` / `:MENTIONS`) is deferred to a separate follow-up task that operates on chunks rather than per-message rows.
9
-
10
- ## Owner + all-participants confirmation (mandatory first step)
11
-
12
- A `_chat.txt` carries N distinct senders (1 owner + N-1 others). Every distinct senderName must resolve to an existing `:AdminUser` or `:Person` elementId before the script runs — the writer LOUD-FAILs on any unresolved sender.
13
-
14
- The flow:
15
-
16
- 1. **Preview the senders.** Call `mcp__memory__whatsapp-export-preview` with the operator-supplied path (read-only, no Cypher writes):
17
- ```json
18
- {
19
- "filePath": "/abs/path/to/_chat.txt",
20
- "timezone": "Europe/London"
21
- }
22
- ```
23
- Returns counters + the sender histogram. Surface to the operator as one chat message — counters and the histogram, no prose.
24
-
25
- 2. **List candidate `:AdminUser` and existing `:Person`** rows for the senders via `mcp__graph__maxy-graph-read_neo4j_cypher`.
26
-
27
- 3. **Iterate the histogram, one operator question per distinct senderName.** For each sender: `"Sender '<name>' (<count> messages) — pick existing :AdminUser/:Person or block?"`. The operator either picks an existing elementId or names "block" (refuses to map to a node). **Never auto-create a `:Person`** — the operator must confirm a canonical node, mirroring `feedback_archives_are_not_documents.md`'s closed-set discipline.
28
-
29
- 4. **Identify the owner** from the resolved set — the operator who exported the chat. Echo back: `"Owner = :AdminUser <name> (<elementId>); other participants = <list>. Confirm yes/no."`
30
-
31
- 5. **Persist the resolved IDs** as `--owner-element-id` + `--participant-person-ids <csv>` for the script call.
32
-
33
- DM and group follow the identical flow. A 1:1 chat resolves 2 senders; a group resolves N senders. Identity (`conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds.join(","))`) is identical regardless of group size — DM and group use the same MERGE key.
34
-
35
- ## Step 2 — invoke the script
36
-
37
- Single Bash call:
38
-
39
- ```bash
40
- bash platform/plugins/whatsapp-import/bin/whatsapp-ingest.sh <archive.zip|dir|_chat.txt> \
41
- --owner-element-id <id> \
42
- --participant-person-ids <id1>,<id2>,... \
43
- --scope <admin|public>
44
- ```
45
-
46
- Optional flags:
47
- - `--account-id <id>` — explicit account id when more than one exists under `data/accounts/` (Phase 0 has one).
48
- - `--timezone <iana>` — IANA zone for timestamps (default `Europe/London`).
49
- - `--date-format <DD/MM/YY|MM/DD/YY|DD/MM/YYYY|MM/DD/YYYY>` — override auto-detect for ambiguous locales.
50
- - `--session-gap-hours <N>` — gap threshold (in hours) used to split parsed messages into sessions for chunking (default `12`). Smaller values produce more sessions, more Haiku calls, finer chunks; larger values group more messages per session.
51
-
52
- The script:
53
- - Unzips the archive if needed; locates `_chat.txt`.
54
- - Parses the file deterministically (year shape, sender/body grammar, timezone offset, U+200E/U+200F bidi-strip).
55
- - Computes the source file's `archiveSha256` (provenance + cleanup discriminator).
56
- - Validates every distinct parsed senderName against the closed set of `{owner, participants...}` candidate names. Any miss LOUD-FAILs `parser-miss reason="senderName=<verbatim> not in confirmed participant set ..."`.
57
- - Computes `conversationIdentity` from accountId + sorted participant elementIds.
58
- - Looks up any prior `:ConversationArchive` carrying that identity → reads `lastIngestedMessageHash`. If found, slices parsed lines after the cursor (delta-append). Cursor not found → `FAIL delta-cursor-missing`. Cursor at last line → empty-delta noop (exit 0, no writes).
59
- - Sessionizes the delta lines at the operator-supplied gap-hours boundary.
60
- - For each session: renders as turn-attributed text (`[ts] Sender: body\n…`) and calls `memory-classify` with `mode='chat'`. Returns one or more `:Section:Conversation` chunk specs with `summary`, `keywords[]`, `firstMessageAt`, `lastMessageAt`, `participantNames[]`, `messageCount`, `body` (verbatim turn-attributed text).
61
- - Calls `memory-ingest` with `parentLabel='ConversationArchive'`. Server MERGEs the parent on `conversationIdentity`, MERGEs `:PARTICIPANT_IN` edges from each confirmed participant, drops any prior chunks stamped with this `archiveSha256` (idempotency for re-running the same export bytes), CREATEs new chunks, extends the existing `:NEXT` chain from its tail, advances `lastIngestedMessageHash` + `lastIngestedMessageAt`.
62
-
63
- NO insight pass runs. Phase 2 (operator-driven `:Observation` / `:Task` / `:Preference` derivation against chunks) is its own follow-up task with its own skill.
64
-
65
- ## Three operator messages per ingest
66
-
67
- After the script succeeds, formulate the three operator-facing messages from the JSON summary on stdout (one operator message per surfaceable phase):
68
-
69
- 1. **Parse summary.** `Parsed <archiveSourceFile>: <parsed> messages across <sessions> sessions, date range <dateRange.first> → <dateRange.last>. Participants: <senderHistogram[i].name (count), …>.`
70
- 2. **Classify summary.** `Classified into <chunks> chunks, covering: <topicKeywords[0], topicKeywords[1], …>.`
71
- 3. **Write summary.** `Created :ConversationArchive <archiveElementId> with <chunks> :Section:Conversation chunks (NEXT chain length <chunks - 1>). Participants linked via :PARTICIPANT_IN: <participantsLinked>.`
72
-
73
- For an empty-delta re-import (`delta.kind === "empty-delta"`): emit only message 1 + a noop line `noop reason="no new messages since <priorLastIngestedMessageAt>"`.
74
-
75
- ## Stdout JSON shape (success)
76
-
77
- ```json
78
- {
79
- "archiveElementId": "4:abcd…:42",
80
- "conversationIdentity": "<sha256-hex>",
81
- "archiveSha256": "<sha256-hex>",
82
- "archiveSourceFile": "_chat.txt",
83
- "parsed": 1707,
84
- "mediaSkipped": 0,
85
- "systemSkipped": 0,
86
- "delta": { "kind": "first-ingest|delta|empty-delta", "deltaStart": 0, "deltaMessages": 1707 },
87
- "sessions": 38,
88
- "chunks": 142,
89
- "nextEdgesCreated": 141,
90
- "participantsLinked": 2,
91
- "dateRange": { "first": "2024-01-15T09:30:00+00:00", "last": "2026-04-30T18:42:00+01:00" },
92
- "senderHistogram": [{ "name": "Joel", "count": 812 }, { "name": "Adam", "count": 895 }],
93
- "topicKeywords": ["pricing", "scheduling", "introductions", "..."],
94
- "ms": 6800
95
- }
96
- ```
97
-
98
- ## Failure path — single FAIL line
99
-
100
- - **Exit non-zero** + one stderr line: `[whatsapp-import] FAIL phase=<argv|parse|classify|delta-cursor-missing|memory-ingest|uncaught> reason="..."`. Surface this verbatim to the operator and yield. **Do not retry.** The archive-ingest-surface-gate denies parser-source edits, JS test runners, and the legacy `whatsapp-export-parse` / `whatsapp-export-insight-write` MCP tools — none of those are escape hatches in your surface.
101
- - `parser-miss` LOUD-FAIL: an unconfirmed senderName slipped through. Either re-run with the missing :Person elementId added to `--participant-person-ids`, or report a parser bug.
102
- - `delta-cursor-missing` LOUD-FAIL: the prior `lastIngestedMessageHash` is not present in the re-export. Either the operator deleted prior messages from the archive, or this is a different chat. Investigation required — never re-run blindly.
103
-
104
- ## Idempotency
105
-
106
- - **Re-running the same export bytes** is a no-op: the cleanup-by-`archiveSha256` step drops THIS export's prior chunks and re-creates them with identical content. `:NEXT` chain length unchanged. Counters: `chunks` non-zero, `delta.kind` either `first-ingest` (if no prior archive) or `delta` (cursor advanced past previous run).
107
- - **Re-running with appended messages** (a fresh export from the same chat with new messages at the tail): cursor lookup finds the prior `lastIngestedMessageHash`, slices new messages, sessionizes only those, and appends new chunks at the tail of the existing `:NEXT` chain. Pre-existing chunks are never touched (their `archiveSha256` differs from this run's).
108
-
109
- ## Verification (post-write)
110
-
111
- Run via `mcp__graph__maxy-graph-read_neo4j_cypher`:
112
-
113
- - `MATCH (a:ConversationArchive { conversationIdentity: $cid }) RETURN elementId(a), a.lastIngestedMessageAt, a.lastIngestedMessageHash` — agrees with the JSON summary.
114
- - `MATCH (a:ConversationArchive { conversationIdentity: $cid })-[:HAS_SECTION]->(c:Section:Conversation) RETURN count(c)` — equals `chunks`.
115
- - `MATCH p=(:Section:Conversation)-[:NEXT*]->(:Section:Conversation) WHERE startNode(relationships(p)[0]).archiveSourceFile = $file WITH max(length(p)) AS chain RETURN chain` — equals `chunks - 1` for a fresh-only ingest, or longer for a delta-append (full chain since first ever ingest).
116
- - `MATCH (p)-[:PARTICIPANT_IN]->(:ConversationArchive { conversationIdentity: $cid }) RETURN count(p)` — equals `participantsLinked` after a first-ingest, or the running total of all confirmed participants ever.
117
- - Phase 1 wrote ZERO observations: `MATCH (o:Observation)-[:OBSERVED_IN]->(:ConversationArchive { conversationIdentity: $cid }) RETURN count(o)` — should be 0 today (Phase 2 deferred).
118
-
119
- ## What this is not
120
-
121
- - **Not** the live `whatsapp` plugin. That plugin (Baileys QR pairing) holds messages in an in-memory store cleared on restart. This plugin imports historical exports into Neo4j as persistent graph nodes.
122
- - **Not** a media-transcription pipeline. Voice notes, photos, PDFs are skipped at parse with a counter logged.
123
- - **Not** an insight-extraction pass. Phase 2 (`:Observation` / `:Task` / `:Preference` / `:MENTIONS` derivation, anchored to chunks) ships in its own task.
124
- - **Not** automatic. The owner + all-participants confirmation gate is mandatory before any line is written, mirroring `feedback_archives_are_not_documents.md`'s closed-set discipline.
@@ -1,143 +0,0 @@
1
- # Conversation Archive — graph shape, identity, edges, delta protocol
2
-
3
- The reference document for the `:ConversationArchive` shape introduced by Task 891. This is the source of truth for what the graph looks like after a `whatsapp-ingest.sh` run — the SKILL.md prescribes the workflow; this file specifies the schema.
4
-
5
- ## Labels
6
-
7
- | Label | Role | MERGE key | Schema constraint |
8
- |---|---|---|---|
9
- | `:ConversationArchive` | Parent node — one per chat | `conversationIdentity` | `FOR (a:ConversationArchive) REQUIRE a.conversationIdentity IS UNIQUE` |
10
- | `:Section:Conversation` | Topic-bounded chunk of messages | (no MERGE — CREATE only) | inherits `:Section` indices |
11
-
12
- `:Section:Conversation` is the chat-mode counterpart of document-mode `:Section:Chapter` etc. Same `:Section` base label, same indexing, different secondary label.
13
-
14
- ## Identity formula
15
-
16
- ```
17
- conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds.join(","))
18
- ```
19
-
20
- - Stable across re-exports — same accountId + same operator-confirmed participant set always produces the same identity, regardless of the source file's bytes.
21
- - DM and group are identical — the difference is array length.
22
- - Participant order is sorted before joining, so the operator can supply elementIds in any order.
23
-
24
- ## Required indices
25
-
26
- ```
27
- INDEX :ConversationArchive(accountId)
28
- INDEX :ConversationArchive(createdBySession)
29
- ```
30
-
31
- Plus the constraint above (which doubles as a uniqueness index on `conversationIdentity`).
32
-
33
- ## Properties on `:ConversationArchive`
34
-
35
- | Property | Type | Source | When set |
36
- |---|---|---|---|
37
- | `conversationIdentity` | string (sha256-hex) | derived | ON CREATE |
38
- | `accountId` | string (UUID) | argv | ON CREATE |
39
- | `scope` | string (`admin` / `public`) | argv | ON CREATE |
40
- | `summary` | string | classifier (synthetic) | ON CREATE |
41
- | `keywords` | string[] | classifier (aggregated across sessions) | ON CREATE |
42
- | `embedding` | float[] | embed(summary) | ON CREATE |
43
- | `archiveSourceFile` | string (basename) | this export | ON CREATE |
44
- | `createdAt` | ISO 8601 | this run | ON CREATE |
45
- | `createdByAgent` | string | constant `"whatsapp-import"` | ON CREATE |
46
- | `createdBySession` | string (UUID) | env / argv | ON CREATE |
47
- | `source` | string | constant `"whatsapp"` | ON CREATE |
48
- | `updatedAt` | ISO 8601 | this run | ON CREATE / ON MATCH |
49
- | `lastIngestedMessageHash` | string (sha256-hex) | derived from last delta line | ON CREATE / ON MATCH |
50
- | `lastIngestedMessageAt` | ISO 8601 | last delta line's `dateSent` | ON CREATE / ON MATCH |
51
- | `lastIngestedBySession` | string (UUID) | this run | ON CREATE / ON MATCH |
52
- | `lastIngestedArchiveSha256` | string (sha256-hex) | this export's file bytes | ON CREATE / ON MATCH |
53
-
54
- ## Properties on `:Section:Conversation` (each chunk)
55
-
56
- | Property | Type | Source |
57
- |---|---|---|
58
- | `accountId` | string | inherited |
59
- | `title` | string | classifier |
60
- | `body` | string | classifier (verbatim turn-attributed text — `[ts] Sender: body\n…`) |
61
- | `bodyPreview` | string (≤150 chars) | first 150 chars of body |
62
- | `position` | int | chunk index within this run |
63
- | `scope` | string | inherited |
64
- | `embedding` | float[] | embed(body) |
65
- | `summary` | string | classifier (1–3 sentences) |
66
- | `keywords` | string[] | classifier |
67
- | `firstMessageAt` | ISO 8601 | first `[ts]` in chunk |
68
- | `lastMessageAt` | ISO 8601 | last `[ts]` in chunk |
69
- | `participantNames` | string[] | distinct senderNames in chunk |
70
- | `messageCount` | int | message count in chunk |
71
- | `archiveSha256` | string (sha256-hex) | this export's file bytes (cleanup discriminator) |
72
- | `archiveSourceFile` | string (basename) | this export |
73
- | `createdAt` | ISO 8601 | this run |
74
- | `createdByAgent` / `createdBySession` / `source` | provenance | this run |
75
-
76
- ## Edges
77
-
78
- | Edge | From | To | Cardinality | When written |
79
- |---|---|---|---|---|
80
- | `:HAS_SECTION` | `:ConversationArchive` | `:Section:Conversation` | one per chunk | every run |
81
- | `:NEXT` | `:Section:Conversation` | `:Section:Conversation` | chunks − 1 (chronological chain across all sessions across all runs) | extending tail per run |
82
- | `:PARTICIPANT_IN` | `:Person` / `:AdminUser` | `:ConversationArchive` | one per confirmed participant | MERGEd every run (idempotent) |
83
-
84
- Every edge carries `createdAt`, `createdByAgent`, `createdBySession`, `source`, plus `archiveSha256` for HAS_SECTION + NEXT (so cleanup-by-archiveSha256 catches the right edges).
85
-
86
- ## Delta-append protocol
87
-
88
- ```
89
- first ingest (or empty graph)
90
-
91
-
92
- parsed_lines ──── all of them ────► sessionize ──► classify ──► memory-ingest
93
-
94
-
95
- :ConversationArchive (NEW)
96
- └── :HAS_SECTION ──► chunks (NEXT chain length K-1)
97
-
98
-
99
- re-import (delta)
100
-
101
-
102
- parsed_lines ──┐
103
-
104
-
105
- find cursor where deriveMessageContentHash(line) == archive.lastIngestedMessageHash
106
-
107
- ┌──────────┼──────────┬───────────────────┐
108
- │ │ │ │
109
- found missing empty (cursor at last line)
110
- │ │ │
111
- ▼ ▼ ▼
112
- slice from FAIL noop (exit 0,
113
- cursor+1 non-zero no writes)
114
-
115
-
116
- delta_lines ──► sessionize ──► classify ──► memory-ingest (parentLabel='ConversationArchive')
117
-
118
-
119
- :ConversationArchive (MERGE on conversationIdentity)
120
- └── :HAS_SECTION ──► NEW chunks
121
- ── :NEXT extends from prior tail
122
- prior chunks unchanged
123
- cursor advances
124
- ```
125
-
126
- ## Why the cursor is content-only (not file-byte-based)
127
-
128
- `lastIngestedMessageHash = sha256(dateSent + "|" + NFKC-trim-lower(senderName) + "|" + body)`. The hash deliberately excludes the source file's bytes — a fresh re-export of the same chat has different file bytes (different SHA-256 of the file) but the SAME message tuples. Without content-only hashing, every delta-import would `delta-cursor-missing` because the file SHA-256 always changes.
129
-
130
- ## Why DM and group are identical
131
-
132
- The brief's "DM-only" Task 887 §A0 contract was a workaround for the per-message writer's auto-Person leak. Under the chunked archive shape, the writer never auto-creates anyone — every participant is operator-confirmed up front. The 2-vs-3-vs-N participant case is just the array length on the right of the identity formula. No special-casing.
133
-
134
- ## Provenance discipline (for cleanup correctness)
135
-
136
- Every node and edge written by this pipeline is stamped with:
137
- - `source = 'whatsapp'`
138
- - `createdByAgent = 'whatsapp-import'`
139
- - `createdBySession = <this run's session UUID>`
140
- - `archiveSha256 = <this export's file SHA-256>` (chunks + HAS_SECTION + NEXT only — the parent records the LAST archiveSha256 separately)
141
- - `archiveSourceFile = <this export's basename>` (parent + chunks)
142
-
143
- The cleanup-by-`archiveSha256` step in memory-ingest's chat path drops only chunks whose `archiveSha256` matches THIS run's. Re-running the same export bytes is a no-op idempotently; re-running with a fresh delta export (different bytes, different SHA-256) leaves prior chunks untouched and appends new ones at the tail.
@@ -1,109 +0,0 @@
1
- # Reference: `_chat.txt` parsing — implementation reference
2
-
3
- > **This is no longer operator instruction.** The agent does NOT walk this grammar in its own LLM turn. Parsing runs deterministically in [`platform/plugins/whatsapp-import/lib/src/parse-export.ts`](../../../lib/src/parse-export.ts), invoked in-process by [`bin/ingest.mjs`](../../../bin/ingest.mjs) (which the operator calls via [`bin/whatsapp-ingest.sh`](../../../bin/whatsapp-ingest.sh) — the single deterministic Bash entry). The legacy MCP wrapper is blocked at the harness gate. The vitest grid in [`lib/src/__tests__/parse-export.test.ts`](../../../lib/src/__tests__/parse-export.test.ts) is the executable contract; this prose is the human-readable companion. Extend the grammar by adding a failing test first.
4
-
5
- WhatsApp's "Export Chat" produces a UTF-8 text file with a deterministic line grammar. This reference describes what the parser library does when it converts that file into the `{senderName, dateSent, body, sequenceIndex}[]` structure the SKILL.md consumes.
6
-
7
- ## File-open invariants
8
-
9
- 1. **UTF-8 only.** Open the file with explicit UTF-8 decoding. On encoding error, abort the import with a named LOUD-FAIL — never silently substitute or corrupt bodies. WhatsApp's modern apps emit UTF-8 reliably; an encoding error usually means the operator manually edited the file with a tool that broke it. Surface the named error so they can re-export.
10
- 2. **No size cap from the parser.** The parser handles arbitrarily large files; the [SKILL.md](../SKILL.md)'s 100-message selective-ingest gate is the operator-facing compression layer.
11
- 3. **Compute `archiveSourceFile = sha256(file bytes)` first.** The hash drives `conversationId` and lets re-imports of the same archive land idempotently.
12
-
13
- ## Line grammar
14
-
15
- Every message line begins with a square-bracketed timestamp prefix followed by `<Sender>: <body>`:
16
-
17
- ```
18
- [DD/MM/YYYY, HH:MM:SS] <Sender>: <body> ← modern WhatsApp default (4-digit year)
19
- [DD/MM/YY, HH:MM:SS] <Sender>: <body> ← legacy exports (2-digit year)
20
- ```
21
-
22
- - **Day/month ordering.** `DD/MM` is the WhatsApp default everywhere except US iOS, which emits `MM/DD`. The parser auto-detects from the first prefix-matching line when `dateFormat` is omitted: probe DD/MM first; if range-valid, lock DD/MM; otherwise lock MM/DD. The lock is per-file — a single export never mixes orderings (the locale is set by the device that generated the export). Manually concatenated multi-locale archives are an explicit out-of-scope: pass `dateFormat` to override.
23
- - **Year shape.** Both 2-digit (`\d{2}` → `2000+yy`) and 4-digit (`\d{4}` → as-is) years are accepted by the same regex (`\d{2,4}`). A single file may hold both shapes; year semantics are resolved per-line from the captured length, not per-file.
24
- - **Time.** `HH:MM:SS` 24-hour; older exports may emit `HH:MM` (no seconds — treat as `:00`).
25
- - **Sender.** Saved contact name, phone number with country code (`+44 7700 900123`), or `You` for legacy operator-sent messages.
26
- - **Body.** Message text, possibly multi-line.
27
-
28
- Trim trailing whitespace from each line before parsing.
29
-
30
- ## Multi-line bodies
31
-
32
- A body that wraps to multiple lines continues onto subsequent lines that do **not** match the timestamp-prefix pattern. The parser must accumulate these continuation lines into the previous message's body, joining with `\n`. End-of-message is detected by the next timestamp prefix or end-of-file.
33
-
34
- ```
35
- [14/03/26, 10:15:23] Joel: Quick question about the deck —
36
- do you have the v3 PDF anywhere?
37
- I checked Drive and only see v2.
38
- [14/03/26, 10:16:01] Sarah: Sec, will dig it out
39
- ```
40
-
41
- Yields two messages:
42
- - `Joel: "Quick question about the deck —\ndo you have the v3 PDF anywhere?\nI checked Drive and only see v2."`
43
- - `Sarah: "Sec, will dig it out"`
44
-
45
- ## System messages — skip with counter
46
-
47
- WhatsApp injects system-generated lines into the export for group events, contact changes, and security messages. These lines match the timestamp prefix but are **not** sent by a person and have no first-class graph value. Skip them at parse time, increment `systemSkipped`, and do not pass to `memory-archive-write`.
48
-
49
- Patterns to recognise (English; localisation expands the list):
50
-
51
- - `<Sender> created group "<name>"`
52
- - `<Sender> changed the subject from "<old>" to "<new>"`
53
- - `<Sender> changed this group's icon`
54
- - `<Sender> added <other>`
55
- - `<Sender> removed <other>`
56
- - `<Sender> left`
57
- - `<Sender>'s security code changed.`
58
- - `Messages and calls are end-to-end encrypted. ...` (the conversation header)
59
- - `You deleted this message.`
60
- - `This message was deleted.`
61
-
62
- Heuristic for the parser: if the body contains no spaces between `<Sender>` and a verb-phrase token, AND the body lacks a colon-after-sender separator, treat as a system message. Conservative — when uncertain, prefer to ingest as a message; the insight pass tolerates noisy bodies better than the parser would tolerate dropped real messages.
63
-
64
- ## Media attachments — skip with counter
65
-
66
- Lines whose body indicates a media-only message (no text content) get skipped at parse time, increment `mediaSkipped`. Patterns:
67
-
68
- - `<Media omitted>` (when the operator chose "Without Media" on export)
69
- - `IMG-<digits>-<digits>.jpg (file attached)` / `.jpeg` / `.png` / `.heic` / `.gif`
70
- - `VID-<digits>-<digits>.mp4 (file attached)`
71
- - `PTT-<digits>-<digits>.opus (file attached)` (voice notes)
72
- - `AUD-<digits>-<digits>.opus (file attached)` (audio)
73
- - `STK-<digits>-<digits>.webp (file attached)` (stickers)
74
- - `<filename>.pdf (file attached)`
75
- - `<filename>.docx (file attached)`
76
- - `‎<...> attached: <filename>` (alternative format on some platforms)
77
-
78
- Mixed messages (text + media reference in one body) are kept as messages — only pure-media-only lines are skipped. The text body is retained.
79
-
80
- ## Forwarded messages
81
-
82
- A forwarded message is prefixed with the invisible Unicode `‎` (U+200E LEFT-TO-RIGHT MARK) followed by metadata WhatsApp injects. Parse the body normally; the LRM character is preserved in `body` (the insight pass's classifier sees it as benign). Do not strip — the raw body's fidelity matters for downstream queries.
83
-
84
- ## Edge cases
85
-
86
- - **Empty body** (timestamp prefix followed by sender colon but no text). Rare. Skip with `systemSkipped` increment — usually corresponds to a deleted message stub.
87
- - **Leading BOM** (U+FEFF at file start). Strip before parsing the first line.
88
- - **Mixed line endings** (`\r\n` vs `\n`). Normalise to `\n` before tokenisation.
89
- - **Sender containing a colon** (e.g., a contact named "Joel: Work"). The grammar splits on the FIRST `: ` (colon-space) after the timestamp prefix's closing `]`. Subsequent colons in the sender or body are preserved verbatim.
90
-
91
- ## Parser output shape
92
-
93
- The parser returns `{conversationId, archiveSourceFile, parsedLines[], counters}` where:
94
-
95
- - `parsedLines[]: Array<{senderName: string, dateSent: string (ISO 8601 with operator-supplied timezone), body: string, sequenceIndex: number}>`
96
- - `counters: {parsed: n, systemSkipped: n, mediaSkipped: n, parseErrors: n}`
97
-
98
- The skill consumes this directly. The `messageId` is computed by the skill (not the parser) so the `lineHash` covers the original raw line, not the post-parse normalised body.
99
-
100
- ## When to LOUD-FAIL
101
-
102
- The parser throws (and `whatsapp-export-parse` returns `isError: true`) on:
103
-
104
- - Encoding error at file open (UTF-8 decode fails — the parser uses `TextDecoder` with `fatal: true`, so any invalid byte sequence aborts loudly rather than silently substituting U+FFFD).
105
- - Empty file or zero parsed lines after walking the file (the file isn't a `_chat.txt`). The thrown error and the `[whatsapp-import] parse-grammar-miss first-line="<sample>"` stderr line both carry a sanitised first-line sample (control chars stripped, truncated to 80 chars) so the operator can recognise the offending header shape without re-running with a debugger.
106
- - A timestamp prefix matches but the body parse fails (no `: ` separator after the closing `]` AND no system-pattern match) — emits `parse-error file=<...> line=<n> reason=no-sender-body-separator content="<...>"`.
107
- - Missing required input (`accountId`, `timezone`).
108
-
109
- Never silently drop data the parser couldn't classify. The operator chooses to skip; the parser does not choose for them.