@rubytech/create-maxy 1.0.757 → 1.0.759
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/payload/platform/config/brand.json +1 -1
- package/payload/platform/plugins/docs/references/plugins-guide.md +1 -0
- package/payload/platform/plugins/memory/mcp/dist/index.js +27 -8
- package/payload/platform/plugins/memory/mcp/dist/index.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-archive-write.test.js +97 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-archive-write.test.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.d.ts +56 -6
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.js +360 -105
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.js.map +1 -1
- package/payload/platform/plugins/whatsapp-import/PLUGIN.md +29 -0
- package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/SKILL.md +122 -0
- package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/conversation-and-messages.md +99 -0
- package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/export-parse.md +102 -0
- package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/insight-extraction.md +118 -0
- package/payload/platform/templates/specialists/agents/database-operator.md +1 -0
- package/payload/server/chunk-2W7O63CK.js +3052 -0
- package/payload/server/chunk-ZK3VNR7Y.js +9510 -0
- package/payload/server/client-pool-TULUIO6M.js +28 -0
- package/payload/server/maxy-edge.js +2 -2
- package/payload/server/public/assets/{admin-CTbpNMNG.js → admin-DHg5a2u2.js} +2 -2
- package/payload/server/public/index.html +1 -1
- package/payload/server/server.js +21 -12
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: whatsapp-import
|
|
3
|
+
description: Import a WhatsApp `_chat.txt` export into a {{productName}} Neo4j graph as a Conversation with chronologically-chained Messages, then derive typed insights (mentions, preferences, commitments, observed relationships) as first-class graph entities. Triggers when the user asks to import a WhatsApp chat, ingest a `_chat.txt` file, or drops the contents of an "Export Chat" folder into chat. Distinct from the live `whatsapp` plugin (Baileys); this is import-from-export only.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# WhatsApp Import
|
|
7
|
+
|
|
8
|
+
Ingests a WhatsApp "Export Chat" archive — `_chat.txt` plus media attachments — into a {{productName}} Neo4j graph. Two passes:
|
|
9
|
+
|
|
10
|
+
1. **Deterministic ingest** — Conversation + Messages + chronology + sender edges, written via the fixed Cypher inside `memory-archive-write`.
|
|
11
|
+
2. **Insight extraction** — analysis-derived nodes and edges (mentions, topics, preferences, commitments, observed relationships) written via existing `memory-write` / `memory-update` tools after pass 1 completes.
|
|
12
|
+
|
|
13
|
+
Every node and edge carries `source='whatsapp'`, `createdByAgent='whatsapp-import'`, `createdBySession=<this-skill-run-uuid>`, and `archiveSourceFile=<sha256-prefix>` so the operator can grep this ingest's footprint at any time.
|
|
14
|
+
|
|
15
|
+
## Owner + participant confirmation (mandatory first step)
|
|
16
|
+
|
|
17
|
+
A WhatsApp export belongs to exactly one operator (the person whose phone produced the export) and contains messages from a known set of senders. Both must be confirmed before any line is written. The flow:
|
|
18
|
+
|
|
19
|
+
### Step 1 — Owner
|
|
20
|
+
|
|
21
|
+
The owner is metadata: who exported this chat. Stamped on the `:Conversation` node as `createdBySession` provenance. The owner is **not** a row-level subject — every message has its own sender.
|
|
22
|
+
|
|
23
|
+
1. List every `:AdminUser` in the graph: `{userId, name, accountId(s)}`.
|
|
24
|
+
2. Ask the operator: "Who exported this `_chat.txt`?" Accept either an existing `:AdminUser` userId or a new external `:Person` (with `givenName`+`familyName`+ at least one of `email`/`telephone`).
|
|
25
|
+
3. Echo the chosen owner back verbatim. Require explicit yes/no confirmation before proceeding.
|
|
26
|
+
4. Persist the resolved owner's `elementId` as `$ownerNodeId`.
|
|
27
|
+
|
|
28
|
+
### Step 2 — Participants
|
|
29
|
+
|
|
30
|
+
Parse the `_chat.txt` per [export-parse.md](references/export-parse.md). For each distinct sender name, capture: `{senderName, firstSeen, lastSeen, messageCount}`. Display the list in chat with these counts; the operator sees who they're about to ingest before any write.
|
|
31
|
+
|
|
32
|
+
For each distinct sender, ask the operator to choose:
|
|
33
|
+
|
|
34
|
+
- **Existing `:AdminUser`** — typically themselves (when their own messages are in the export). Resolve via `memory-search` by `userId` or `name`. Persist the elementId.
|
|
35
|
+
- **Existing `:Person`** — match by `givenName`+`familyName`, `email`, or `telephone`. Use `memory-search` to find candidates; if multiple match, surface them and require operator pick. Persist the elementId.
|
|
36
|
+
- **New external `:Person`** — mint via `memory-write` with `givenName`+`familyName`+ at least one of `email`/`telephone`. Provenance: `source='whatsapp'`, `createdByAgent='whatsapp-import'`, `createdBySession=$sessionId`. Capture the resulting elementId.
|
|
37
|
+
- **Skip** — exclude this sender's messages from the import. Operator may pick this for noisy auto-replies, bots, etc.
|
|
38
|
+
|
|
39
|
+
**Refusing to invent identity is load-bearing.** The skill never silently mints a `:Person` from a WhatsApp display name alone (which is often just a phone number or "Mum"). A new `:Person` requires confirmation of `givenName`+`familyName`+ contact information. This is the first contract `feedback_archives_are_not_documents.md` enforces.
|
|
40
|
+
|
|
41
|
+
### Step 3 — Persist the participant map
|
|
42
|
+
|
|
43
|
+
Build `$participantNodeIds = {senderName → senderElementId}`. Echo back to operator one final time (`Confirm: 5 senders, 4 :AdminUser/:Person, 1 skipped — proceed?`). On yes, the participant map flows into every row of the `memory-archive-write` call.
|
|
44
|
+
|
|
45
|
+
### Step 4 — Same-person, multiple display-names heuristic
|
|
46
|
+
|
|
47
|
+
WhatsApp displays a sender by their phone-saved name when known, by phone number otherwise. If the operator's contact list changed mid-conversation, the same person may appear under two distinct senderNames (`+44 7...` and `Joel Smalley`). Detect this heuristically: surface any senderName that is digit-only (a phone number) and ask `Is "+44 7..." the same person as "Joel Smalley"?`. On yes, both senderNames map to the same elementId. On no, keep them distinct.
|
|
48
|
+
|
|
49
|
+
## Selective-ingest threshold (bulk archives)
|
|
50
|
+
|
|
51
|
+
WhatsApp 1:1 chats commonly contain 1,000–10,000 messages; group chats 10,000+. Writing all of them in one shot defeats compression-on-write and produces a landfill graph. The skill compresses by interrogating the operator before the bulk write.
|
|
52
|
+
|
|
53
|
+
**Threshold:** when the parsed `rows[]` count exceeds **100 messages**, pause and ask the operator to filter along the natural axes:
|
|
54
|
+
|
|
55
|
+
- **Date range** — "messages between 2026-01-01 and 2026-04-01"
|
|
56
|
+
- **Sender** — "only messages from Joel and Sarah"
|
|
57
|
+
- **Keyword** — body contains "Q3 report" / "office hours" / etc.
|
|
58
|
+
- **All** — accept the full archive (rare; only for small chats or when the operator explicitly wants every message)
|
|
59
|
+
|
|
60
|
+
Apply the chosen filter to `rows[]` before invoking `memory-archive-write`. Compress on write, never after — a 5,000-message blanket import is noise; a 200-message filtered import is signal.
|
|
61
|
+
|
|
62
|
+
When the threshold trips, emit one log line BEFORE the prompt:
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
[whatsapp-import] selective-ingest-gate count=<n> threshold=100 axes=date,sender,keyword
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Stable IDs
|
|
69
|
+
|
|
70
|
+
- `conversationId = whatsapp-export:<sha256(_chat.txt bytes)>:<accountId>` — same archive, same operator account → idempotent re-import. Different archive (even for the same conversation) → different conversationId.
|
|
71
|
+
- `messageId = whatsapp-export:<conversationId>:<lineHash>` where `lineHash = sha256(<original-line-text>)`. Re-imports of the same archive are zero-write idempotent; re-exports with appended messages add the delta cleanly.
|
|
72
|
+
|
|
73
|
+
## Timezone
|
|
74
|
+
|
|
75
|
+
WhatsApp's `[DD/MM/YY, HH:MM:SS]` line prefix lacks a timezone offset. The skill **must not silently assume UTC.** When the timezone is non-obvious (the operator hasn't said where they were when the messages were sent), ask:
|
|
76
|
+
|
|
77
|
+
> The export uses `[DD/MM/YY, HH:MM:SS]` but doesn't include a timezone. Which timezone should I tag these messages with? (e.g., Europe/London, America/New_York, UTC)
|
|
78
|
+
|
|
79
|
+
Convert each parsed timestamp to ISO 8601 with the supplied offset before passing to `memory-archive-write`. The Cypher's `datetime()` then preserves the exact instant.
|
|
80
|
+
|
|
81
|
+
## Execution model
|
|
82
|
+
|
|
83
|
+
1. **Parse** — Read `_chat.txt` per [export-parse.md](references/export-parse.md). Build the parsed-line structure: `{senderName, dateSent, body, sequenceIndex}`. Skip system messages and media-only lines, increment counters.
|
|
84
|
+
2. **Owner+participant confirmation** — Steps 1–3 above. Persist `$ownerNodeId` + `$participantNodeIds`.
|
|
85
|
+
3. **Selective-ingest gate** — If `parsedLines.length > 100`, pause for filter selection. Apply filter.
|
|
86
|
+
4. **Build rows[]** — Map each parsed line to `{messageId, conversationId, senderNodeId, senderName, dateSent (ISO 8601), body, sequenceIndex}`. Compute `messageId` per line.
|
|
87
|
+
5. **Build conversation block** — `{conversationId, archiveSourceFile, firstMessageAt, lastMessageAt, participantCount, messageCount}` from the rows[].
|
|
88
|
+
6. **Dispatch** `mcp__memory__memory-archive-write` once with `archiveType='whatsapp-export'`, `ownerNodeId`, `conversation`, `participantNodeIds` (the distinct elementIds from the map), `rows`, `sessionId`. The tool MERGEs the Conversation, MERGEs Messages, links PART_OF + SENT + PARTICIPANT_IN edges per row, and runs the `finalize` hook to MERGE the NEXT chronology by dateSent ordering.
|
|
89
|
+
7. **Emit per-export log line:**
|
|
90
|
+
```
|
|
91
|
+
[whatsapp-import] file=<chat.txt> conversationId=<cid> participants=<n> messages-parsed=<n> media-skipped=<n> system-skipped=<n> ms=<elapsed>
|
|
92
|
+
```
|
|
93
|
+
8. **Insight pass** — Run pass 2 per [insight-extraction.md](references/insight-extraction.md). Read the just-written messages via `memory-search`, classify within the specialist's own LLM turn, and write typed observations through `memory-write` / `memory-update`. Emit:
|
|
94
|
+
```
|
|
95
|
+
[whatsapp-import] insight-pass model=sonnet chunks=<n> mentions=<n> preferences=<n> tasks=<n> observed-relationships=<n> novel-insights=<n> ms=<elapsed>
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Doctrine — raw Cypher and `cypher-shell` are forbidden in this skill
|
|
99
|
+
|
|
100
|
+
All writes route through `mcp__memory__memory-archive-write` (bulk Conversation+Messages) or `mcp__memory__memory-write` / `mcp__memory__memory-update` (second-pass typed observations). The agent never authors Cypher. If the operator hits a write shape these tools do not express, **do not improvise** — surface the gap as a structured task per the database-operator's LOUD-FAIL prerogative. See [database-operator.md](../../../../templates/specialists/agents/database-operator.md#prerogatives).
|
|
101
|
+
|
|
102
|
+
## LOUD-FAIL on parse errors
|
|
103
|
+
|
|
104
|
+
If [export-parse.md](references/export-parse.md)'s grammar doesn't match a line (genuine parser failure, not a documented skip case), emit:
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
[whatsapp-import] parse-error file=<chat.txt> line=<n> reason=<r>
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
…and abort the import. Do NOT silently truncate or guess. The operator gets a named error; we keep no half-truths in the graph.
|
|
111
|
+
|
|
112
|
+
## Idempotency contract
|
|
113
|
+
|
|
114
|
+
Re-importing the same `_chat.txt` is a no-op (`createdMessages=0`, `mergedMessages=N`, NEXT chain unchanged). Re-importing a re-exported file with appended messages adds only the delta and extends the NEXT chain to cover the new tail. Both paths are server-enforced via MERGE on `messageId` and the finalize hook's idempotent NEXT-MERGE.
|
|
115
|
+
|
|
116
|
+
## Verification (post-write)
|
|
117
|
+
|
|
118
|
+
- `MATCH (c:Conversation:WhatsAppConversation {conversationId: $cid}) RETURN c.messageCount, c.participantCount, c.firstMessageAt, c.lastMessageAt` — agrees with the per-export log line counts.
|
|
119
|
+
- `MATCH (m:Message:WhatsAppMessage)-[:PART_OF]->(c {conversationId: $cid}) RETURN count(m)` — equals post-filter line count.
|
|
120
|
+
- `MATCH p=(m:Message {conversationId: $cid})-[:NEXT*]->(end) WITH max(length(p)) AS chain RETURN chain` — equals `messageCount - 1`.
|
|
121
|
+
- `MATCH (m:Message {conversationId: $cid}) RETURN min(m.dateSent), max(m.dateSent)` — matches the file's first/last lines (modulo the operator-confirmed timezone).
|
|
122
|
+
- `MATCH (n) WHERE n.createdBySession = $sessionId RETURN labels(n) AS l, count(*) ORDER BY count(*) DESC` — the full graph footprint of this ingest, sortable by label.
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
# Reference: Conversation + Messages row schema
|
|
2
|
+
|
|
3
|
+
This reference defines the `archiveType='whatsapp-export'` row + conversation block + participantNodeIds shapes that the [SKILL.md](../SKILL.md) passes to `mcp__memory__memory-archive-write`. The Cypher body is fixed server-side in [`platform/plugins/memory/mcp/src/tools/memory-archive-write.ts`](../../../../memory/mcp/src/tools/memory-archive-write.ts); the agent's responsibility ends at producing valid input.
|
|
4
|
+
|
|
5
|
+
## Tool input shape
|
|
6
|
+
|
|
7
|
+
```json
|
|
8
|
+
{
|
|
9
|
+
"archiveType": "whatsapp-export",
|
|
10
|
+
"ownerNodeId": "<elementId of :AdminUser or :Person — operator who exported the chat>",
|
|
11
|
+
"sessionId": "<UUID — generated once per skill run>",
|
|
12
|
+
"conversation": {
|
|
13
|
+
"conversationId": "whatsapp-export:<sha256(file)>:<accountId>",
|
|
14
|
+
"archiveSourceFile": "<sha256(file)>",
|
|
15
|
+
"firstMessageAt": "2026-03-14T10:15:23Z",
|
|
16
|
+
"lastMessageAt": "2026-04-21T18:42:11Z",
|
|
17
|
+
"participantCount": 2,
|
|
18
|
+
"messageCount": 174
|
|
19
|
+
},
|
|
20
|
+
"participantNodeIds": ["<elemId-Joel>", "<elemId-Sarah>"],
|
|
21
|
+
"rows": [
|
|
22
|
+
{
|
|
23
|
+
"messageId": "whatsapp-export:<conversationId>:<lineHash>",
|
|
24
|
+
"conversationId": "<same as conversation.conversationId>",
|
|
25
|
+
"senderNodeId": "<elemId — operator-confirmed>",
|
|
26
|
+
"senderName": "Joel",
|
|
27
|
+
"dateSent": "2026-03-14T10:15:23Z",
|
|
28
|
+
"body": "Quick question about the deck —\ndo you have the v3 PDF anywhere?",
|
|
29
|
+
"sequenceIndex": 0
|
|
30
|
+
},
|
|
31
|
+
...
|
|
32
|
+
]
|
|
33
|
+
}
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## What the server does (informational, not the agent's responsibility)
|
|
37
|
+
|
|
38
|
+
Each batch (max 500 rows) runs in one `executeWrite` transaction:
|
|
39
|
+
|
|
40
|
+
1. **MERGE Conversation** — `(:Conversation:WhatsAppConversation {conversationId})`. ON CREATE stamps full provenance + scalar metadata (firstMessageAt, lastMessageAt, participantCount, messageCount, archiveSourceFile). ON MATCH refreshes the moving scalars (`lastMessageAt`, `messageCount`, `participantCount`) plus `lastImportedAt` / `lastImportedBySession` for re-import audit; createdBy* stamps stay frozen.
|
|
41
|
+
2. **MATCH sender by elementId** — `WHERE elementId(sender) = row.senderNodeId`. The verifyParticipants preflight already confirmed every senderNodeId resolves; this MATCH never fails at runtime.
|
|
42
|
+
3. **MERGE Message** — `(:Message:WhatsAppMessage {messageId})`. ON CREATE stamps provenance, body, dateSent, senderName (raw, audit-only), sequenceIndex.
|
|
43
|
+
4. **MERGE PART_OF** — `(m)-[:PART_OF]->(c)`.
|
|
44
|
+
5. **MERGE SENT** — `(sender)-[:SENT]->(m)`. Per-row, idempotent.
|
|
45
|
+
6. **MERGE PARTICIPANT_IN** — `(sender)-[:PARTICIPANT_IN]->(c)`. Per-row, but rolls up across rows because MERGE on the same edge between the same pair is idempotent — a 200-message conversation with 2 participants ends with 2 PARTICIPANT_IN edges, not 200.
|
|
46
|
+
|
|
47
|
+
After all batches, the **finalize** hook runs one additional `executeWrite` transaction:
|
|
48
|
+
|
|
49
|
+
7. **MERGE NEXT chain** — `MATCH (m:Message)-[:PART_OF]->(c {conversationId})` ordered by `dateSent ASC, sequenceIndex ASC`, then UNWIND consecutive pairs, MERGE `(prev)-[:NEXT]->(next)`. Idempotent on MERGE — re-imports add only the new tail edges. Returns `nextEdges` count for the per-call log line.
|
|
50
|
+
|
|
51
|
+
## Natural keys + provenance
|
|
52
|
+
|
|
53
|
+
| Entity | Key | Provenance fields stamped |
|
|
54
|
+
|---|---|---|
|
|
55
|
+
| `:Conversation:WhatsAppConversation` | `conversationId` | `accountId, source='whatsapp', createdByAgent='whatsapp-import', createdBySource='whatsapp-import', createdBySession, createdAt=datetime(), scope='admin', agentType='admin', archiveSourceFile`. ON re-import, also: `lastImportedAt, lastImportedBySession`. |
|
|
56
|
+
| `:Message:WhatsAppMessage` | `messageId` | Same provenance set + `conversationId, dateSent, body, senderName, sequenceIndex, scope='admin'`. |
|
|
57
|
+
| `[:SENT]` | (sender, message) pair | `source='whatsapp', createdAt=datetime()`. |
|
|
58
|
+
| `[:PARTICIPANT_IN]` | (sender, conversation) pair | `source='whatsapp', createdAt=datetime()`. |
|
|
59
|
+
| `[:PART_OF]` | (message, conversation) pair | (no edge properties needed; the edge type is the signal). |
|
|
60
|
+
| `[:NEXT]` | (prev message, next message) pair | `source='whatsapp', createdAt=datetime()`. |
|
|
61
|
+
|
|
62
|
+
## Edges this reference does NOT mint
|
|
63
|
+
|
|
64
|
+
- `(:AdminUser)-[:OWNS]->(:Conversation)` — explicitly NOT created. The exporter is metadata-on-Conversation provenance, not a graph relationship. A `:Conversation` is owned by its `accountId`, the same way other account-scoped nodes are. Adding an OWNS edge would duplicate property-based scope with edge-based scope (anti-pattern; see `feedback_account_scope_is_a_property.md`).
|
|
65
|
+
- `(:Conversation)-[:OCCURRED_AT]->(:Date)` — date is a scalar property on the Conversation, not a separate node. Querying by date uses the property, not an edge traversal.
|
|
66
|
+
- `(:Message)-[:HAS_BODY]->(:Text)` — the body is a property on the Message; no separate Text node.
|
|
67
|
+
|
|
68
|
+
If a future query pattern reveals one of these is genuinely useful, file a separate task — never add them ad-hoc here.
|
|
69
|
+
|
|
70
|
+
## Divergence from live-channel `persistMessage`
|
|
71
|
+
|
|
72
|
+
The live `whatsapp` plugin's `persistMessage` in [`platform/ui/app/lib/neo4j-store.ts`](../../../../../ui/app/lib/neo4j-store.ts) holds a per-conversation lock when chaining NEXT — concurrent inbound messages to the same conversation would otherwise race to insert NEXT pointers. Bulk import does **not** need that lock because:
|
|
73
|
+
|
|
74
|
+
1. The whole import runs in one `database-operator` turn, no concurrency.
|
|
75
|
+
2. The finalize step runs ONE Cypher statement that scopes to a single `conversationId` and rebuilds the chain in `dateSent ASC` order. MERGE is idempotent; concurrent re-runs (which can't happen in one turn) would converge regardless.
|
|
76
|
+
|
|
77
|
+
If a future change runs imports in parallel across conversations, that's still safe — each conversationId's NEXT chain is rebuilt independently. If two parallel imports targeted the same conversationId (operator scripting error), MERGE-on-NEXT idempotency saves the graph; the latest write wins on edge properties.
|
|
78
|
+
|
|
79
|
+
## Counter shape returned to the agent
|
|
80
|
+
|
|
81
|
+
```json
|
|
82
|
+
{
|
|
83
|
+
"archiveType": "whatsapp-export",
|
|
84
|
+
"processedRows": 174,
|
|
85
|
+
"counters": {
|
|
86
|
+
"createdMessages": 174,
|
|
87
|
+
"relationshipsCreated": 524,
|
|
88
|
+
"labelsAdded": 348,
|
|
89
|
+
"nextEdgesProcessed": 173,
|
|
90
|
+
"nextEdgesCreated": 173
|
|
91
|
+
},
|
|
92
|
+
"errors": [],
|
|
93
|
+
"nextChainStatus": "ok"
|
|
94
|
+
}
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
The skill displays the counts to the operator in chat and emits the `[whatsapp-import] file=...` log line using `processedRows`, `nextEdgesCreated`, and the elapsed time. The exact counter keys evolve (per-handler), so the skill should iterate the `counters` map rather than destructure specific names.
|
|
98
|
+
|
|
99
|
+
When a batch errors, `errors[]` contains `{rowIndex, reason}`; `nextChainStatus` becomes `"partial"` (finalize still runs over whatever messages did land); the skill's chat output names the failed offset and the operator decides whether to re-import.
|
package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/export-parse.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
# Reference: `_chat.txt` parsing
|
|
2
|
+
|
|
3
|
+
WhatsApp's "Export Chat" produces a UTF-8 text file with a deterministic line grammar. This reference is the contract for converting that file into the parsed-line structure the [SKILL.md](../SKILL.md) builds rows from.
|
|
4
|
+
|
|
5
|
+
## File-open invariants
|
|
6
|
+
|
|
7
|
+
1. **UTF-8 only.** Open the file with explicit UTF-8 decoding. On encoding error, abort the import with a named LOUD-FAIL — never silently substitute or corrupt bodies. WhatsApp's modern apps emit UTF-8 reliably; an encoding error usually means the operator manually edited the file with a tool that broke it. Surface the named error so they can re-export.
|
|
8
|
+
2. **No size cap from the parser.** The parser handles arbitrarily large files; the [SKILL.md](../SKILL.md)'s 100-message selective-ingest gate is the operator-facing compression layer.
|
|
9
|
+
3. **Compute `archiveSourceFile = sha256(file bytes)` first.** The hash drives `conversationId` and lets re-imports of the same archive land idempotently.
|
|
10
|
+
|
|
11
|
+
## Line grammar
|
|
12
|
+
|
|
13
|
+
Every message line begins with a square-bracketed timestamp prefix followed by `<Sender>: <body>`:
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
[DD/MM/YY, HH:MM:SS] <Sender>: <body>
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
- `DD/MM/YY` — two-digit day, month, two-digit year (WhatsApp default; some locales emit `MM/DD/YY` — operator confirms ambiguous orderings during the timezone prompt).
|
|
20
|
+
- `HH:MM:SS` — 24-hour time. Older exports may emit `HH:MM` (no seconds); treat seconds as 0 when absent.
|
|
21
|
+
- `<Sender>` — the sender's display name as WhatsApp showed it. May be a saved contact name, a phone number with country code (`+44 7700 900123`), or "You" for messages sent by the operator (older exports).
|
|
22
|
+
- `<body>` — message text, possibly multi-line.
|
|
23
|
+
|
|
24
|
+
Trim trailing whitespace from each line before parsing.
|
|
25
|
+
|
|
26
|
+
## Multi-line bodies
|
|
27
|
+
|
|
28
|
+
A body that wraps to multiple lines continues onto subsequent lines that do **not** match the timestamp-prefix pattern. The parser must accumulate these continuation lines into the previous message's body, joining with `\n`. End-of-message is detected by the next timestamp prefix or end-of-file.
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
[14/03/26, 10:15:23] Joel: Quick question about the deck —
|
|
32
|
+
do you have the v3 PDF anywhere?
|
|
33
|
+
I checked Drive and only see v2.
|
|
34
|
+
[14/03/26, 10:16:01] Sarah: Sec, will dig it out
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Yields two messages:
|
|
38
|
+
- `Joel: "Quick question about the deck —\ndo you have the v3 PDF anywhere?\nI checked Drive and only see v2."`
|
|
39
|
+
- `Sarah: "Sec, will dig it out"`
|
|
40
|
+
|
|
41
|
+
## System messages — skip with counter
|
|
42
|
+
|
|
43
|
+
WhatsApp injects system-generated lines into the export for group events, contact changes, and security messages. These lines match the timestamp prefix but are **not** sent by a person and have no first-class graph value. Skip them at parse time, increment `systemSkipped`, and do not pass to `memory-archive-write`.
|
|
44
|
+
|
|
45
|
+
Patterns to recognise (English; localisation expands the list):
|
|
46
|
+
|
|
47
|
+
- `<Sender> created group "<name>"`
|
|
48
|
+
- `<Sender> changed the subject from "<old>" to "<new>"`
|
|
49
|
+
- `<Sender> changed this group's icon`
|
|
50
|
+
- `<Sender> added <other>`
|
|
51
|
+
- `<Sender> removed <other>`
|
|
52
|
+
- `<Sender> left`
|
|
53
|
+
- `<Sender>'s security code changed.`
|
|
54
|
+
- `Messages and calls are end-to-end encrypted. ...` (the conversation header)
|
|
55
|
+
- `You deleted this message.`
|
|
56
|
+
- `This message was deleted.`
|
|
57
|
+
|
|
58
|
+
Heuristic for the parser: if the body contains no spaces between `<Sender>` and a verb-phrase token, AND the body lacks a colon-after-sender separator, treat as a system message. Conservative — when uncertain, prefer to ingest as a message; the insight pass tolerates noisy bodies better than the parser would tolerate dropped real messages.
|
|
59
|
+
|
|
60
|
+
## Media attachments — skip with counter
|
|
61
|
+
|
|
62
|
+
Lines whose body indicates a media-only message (no text content) get skipped at parse time, increment `mediaSkipped`. Patterns:
|
|
63
|
+
|
|
64
|
+
- `<Media omitted>` (when the operator chose "Without Media" on export)
|
|
65
|
+
- `IMG-<digits>-<digits>.jpg (file attached)` / `.jpeg` / `.png` / `.heic` / `.gif`
|
|
66
|
+
- `VID-<digits>-<digits>.mp4 (file attached)`
|
|
67
|
+
- `PTT-<digits>-<digits>.opus (file attached)` (voice notes)
|
|
68
|
+
- `AUD-<digits>-<digits>.opus (file attached)` (audio)
|
|
69
|
+
- `STK-<digits>-<digits>.webp (file attached)` (stickers)
|
|
70
|
+
- `<filename>.pdf (file attached)`
|
|
71
|
+
- `<filename>.docx (file attached)`
|
|
72
|
+
- `<...> attached: <filename>` (alternative format on some platforms)
|
|
73
|
+
|
|
74
|
+
Mixed messages (text + media reference in one body) are kept as messages — only pure-media-only lines are skipped. The text body is retained.
|
|
75
|
+
|
|
76
|
+
## Forwarded messages
|
|
77
|
+
|
|
78
|
+
A forwarded message is prefixed with the invisible Unicode `` (U+200E LEFT-TO-RIGHT MARK) followed by metadata WhatsApp injects. Parse the body normally; the LRM character is preserved in `body` (the insight pass's classifier sees it as benign). Do not strip — the raw body's fidelity matters for downstream queries.
|
|
79
|
+
|
|
80
|
+
## Edge cases
|
|
81
|
+
|
|
82
|
+
- **Empty body** (timestamp prefix followed by sender colon but no text). Rare. Skip with `systemSkipped` increment — usually corresponds to a deleted message stub.
|
|
83
|
+
- **Leading BOM** (U+FEFF at file start). Strip before parsing the first line.
|
|
84
|
+
- **Mixed line endings** (`\r\n` vs `\n`). Normalise to `\n` before tokenisation.
|
|
85
|
+
- **Sender containing a colon** (e.g., a contact named "Joel: Work"). The grammar splits on the FIRST `: ` (colon-space) after the timestamp prefix's closing `]`. Subsequent colons in the sender or body are preserved verbatim.
|
|
86
|
+
|
|
87
|
+
## Parser output shape
|
|
88
|
+
|
|
89
|
+
The parser returns `{conversationId, archiveSourceFile, parsedLines[], counters}` where:
|
|
90
|
+
|
|
91
|
+
- `parsedLines[]: Array<{senderName: string, dateSent: string (ISO 8601 with operator-supplied timezone), body: string, sequenceIndex: number}>`
|
|
92
|
+
- `counters: {parsed: n, systemSkipped: n, mediaSkipped: n, parseErrors: n}`
|
|
93
|
+
|
|
94
|
+
The skill consumes this directly. The `messageId` is computed by the skill (not the parser) so the `lineHash` covers the original raw line, not the post-parse normalised body.
|
|
95
|
+
|
|
96
|
+
## When to LOUD-FAIL
|
|
97
|
+
|
|
98
|
+
- Encoding error at file open (UTF-8 decode fails partway).
|
|
99
|
+
- Zero parsed lines after walking the file (the file isn't a `_chat.txt`).
|
|
100
|
+
- A timestamp prefix matches but the body parse fails (no `: ` separator after the closing `]`) — emit `[whatsapp-import] parse-error file=<...> line=<n> reason=<r>` and abort.
|
|
101
|
+
|
|
102
|
+
Never silently drop data the parser couldn't classify. The operator chooses to skip; the parser does not choose for them.
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# Reference: Insight extraction (pass 2)
|
|
2
|
+
|
|
3
|
+
After the deterministic ingest completes, the [SKILL.md](../SKILL.md) runs a second pass that emits **first-class graph entities** capturing what the conversation is *about*. Insights are not summaries; they are typed nodes and edges the operator's queries can traverse.
|
|
4
|
+
|
|
5
|
+
This pass runs INLINE in the database-operator specialist's own LLM turn — Sonnet — calling existing memory tools. There is no separate insight-extraction MCP tool. Reuse-over-invent: each observation maps onto an existing graph label or edge type wherever possible. `:Insight` is the last resort.
|
|
6
|
+
|
|
7
|
+
## Reuse-over-invent — the six observation kinds
|
|
8
|
+
|
|
9
|
+
| Kind | Maps to | Edge from | Edge to | Edge type | Notes |
|
|
10
|
+
|---|---|---|---|---|---|
|
|
11
|
+
| Third-party mention | `:Person` (existing) | `:Message` | `:Person` | `:MENTIONS` | Hallucination gate: see below. |
|
|
12
|
+
| Recurring topic (≥3 mentions) | `:DefinedTerm` (existing) with `category:'topic'` | `:Conversation` | `:DefinedTerm` | `:DISCUSSES` (with `frequency` property) | Surface as a topic only if the term recurs ≥3 distinct messages. |
|
|
13
|
+
| Expressed preference | `:Preference` (existing, see [schema.cypher:588](../../../../../neo4j/schema.cypher#L588-L595)) | `:Person` or `:AdminUser` | `:Preference` | `:HAS_PREFERENCE` + `(:Preference)-[:OBSERVED_IN]->(:Conversation)` | Preference text is a single observation, not a paragraph. |
|
|
14
|
+
| Commitment / action item | `:Task` (existing) | `:Person` or `:AdminUser` | `:Task` | `:OWNS` | Set `dueDate` only when explicitly named ("by Friday", "next week"). Skip for vague "soon". |
|
|
15
|
+
| Inter-person relationship | (matched `:Person` nodes) | `:Person` | `:Person` | `:RELATED_TO` (with `kind`, `evidenceMessageIds[]`) | Operator-confirmation gate before write — see below. |
|
|
16
|
+
| Genuinely novel finding | `:Insight` (new label, last resort only) | `:Insight` | `:Message` | `:DERIVED_FROM` | Only when reuse-over-invent fails for every existing label. Self-rated `confidence` 0–1. |
|
|
17
|
+
|
|
18
|
+
## Anti-hallucination gates
|
|
19
|
+
|
|
20
|
+
The biggest risk in this pass is Sonnet writing edges to wrong-Person nodes. Two gates protect the graph:
|
|
21
|
+
|
|
22
|
+
### Gate 1: `memory-search` BEFORE every `:MENTIONS` edge
|
|
23
|
+
|
|
24
|
+
For every `:MENTIONS` edge candidate, the skill turn MUST run `memory-search(query=<mentioned-name>, kind='person')` first. The result determines what happens:
|
|
25
|
+
|
|
26
|
+
- **0 hits** — the mentioned name doesn't exist in the graph. Skip silently. Do not auto-mint a `:Person` from a chat-mention alone.
|
|
27
|
+
- **1 hit** — proceed if the match is unambiguous; surface to operator if ambiguous (see Gate 2).
|
|
28
|
+
- **2+ hits** — ambiguous. Surface to operator before any edge writes.
|
|
29
|
+
|
|
30
|
+
The agent never writes a `:MENTIONS` edge without the prior `memory-search`. This is a discipline gate: the inline classification is just `memory-search → memory-write`, never `memory-write` directly from raw classification.
|
|
31
|
+
|
|
32
|
+
### Gate 2: First-name-only matches surface to operator regardless of hit count
|
|
33
|
+
|
|
34
|
+
Single-first-name references in chat ("ask Sarah about Q3") have ambiguous referents even when `memory-search` returns one match — that one match might be the wrong Sarah. The rule:
|
|
35
|
+
|
|
36
|
+
- The mention has a **disambiguator** (full name "Sarah Chen", phone, email, role context "Sarah at Acme") → `memory-search` 1-hit → write the edge.
|
|
37
|
+
- The mention is a **first-name only** without disambiguator → ALWAYS surface to operator confirmation, regardless of `memory-search` result count.
|
|
38
|
+
|
|
39
|
+
Surface format:
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
[whatsapp-import] mention-ambiguous name="Sarah" reason=first-name-only candidates=1 awaiting-operator-resolution
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Followed by a chat prompt: `"Sarah" mentioned in message <messageId>. Found 1 :Person candidate: Sarah Chen (sarah@acme.com). Confirm? Yes / No / Pick another.`
|
|
46
|
+
|
|
47
|
+
### Gate 3: `:RELATED_TO` between two existing distinct Persons
|
|
48
|
+
|
|
49
|
+
When the second pass infers a relationship between two `:Person` nodes who both already exist in the graph (e.g., chat says "Joel and Sarah are working on Q3 together" → `(joel)-[:RELATED_TO {kind:'collaborator'}]->(sarah)`), surface to operator confirmation before write. The operator sees the inferred edge with both endpoints' names + the supporting message excerpts; on yes, the edge writes with `evidenceMessageIds: [...]`.
|
|
50
|
+
|
|
51
|
+
The default for this gate is conservative — when in doubt, surface. False-positive RELATED_TO edges are graph noise; false-negative skips can be re-run.
|
|
52
|
+
|
|
53
|
+
## Chunking strategy
|
|
54
|
+
|
|
55
|
+
For conversations with 100+ messages, chunk the input to the inline LLM turn at ~50 messages per chunk. The classifier processes each chunk independently; the skill aggregates observations across chunks before writing. Aggregation deduplicates (the same `:MENTIONS` edge would otherwise be proposed once per chunk that referenced the same person).
|
|
56
|
+
|
|
57
|
+
Per-chunk processing emits one log line for grep-ability:
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
[whatsapp-import] insight-pass-chunk index=<i> messages=<n> mentions-proposed=<n> preferences-proposed=<n> tasks-proposed=<n>
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
After all chunks complete and observations are written, the per-pass summary log line fires:
|
|
64
|
+
|
|
65
|
+
```
|
|
66
|
+
[whatsapp-import] insight-pass model=sonnet chunks=<n> mentions=<n> preferences=<n> tasks=<n> observed-relationships=<n> novel-insights=<n> ms=<elapsed>
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
The numbers in the summary line are **edges actually written**, not edges proposed. Proposed-but-skipped (zero hits, ambiguous-skipped, operator-rejected) don't count.
|
|
70
|
+
|
|
71
|
+
## When to mint `:Insight`
|
|
72
|
+
|
|
73
|
+
Only when **every** of the six observation kinds above fails to fit. Examples that DO fit existing labels:
|
|
74
|
+
|
|
75
|
+
- "I prefer dark roast" → `:Preference`
|
|
76
|
+
- "Send the deck Friday" → `:Task` with `dueDate`
|
|
77
|
+
- "Sarah keeps mentioning Q3 reporting" → `:DefinedTerm{category:'topic', name:'Q3 reporting'}` + `:DISCUSSES`
|
|
78
|
+
- "Joel works with Sarah" → `:RELATED_TO`
|
|
79
|
+
|
|
80
|
+
Examples that genuinely warrant `:Insight`:
|
|
81
|
+
|
|
82
|
+
- "The team's morale dropped after the office move" — observation about a state change with no entity it cleanly attaches to.
|
|
83
|
+
- "There's a recurring tension between sales and ops on lead-handoff timing" — meta-observation about an organisational dynamic.
|
|
84
|
+
- "Joel's writing style shifts to formal English when discussing legal matters" — a behavioural pattern that doesn't fit `:Preference`.
|
|
85
|
+
|
|
86
|
+
Each `:Insight` carries `summary` (one sentence, ≤200 chars), `kind` (free-text classifier label), `confidence` (Sonnet self-rated 0–1), plus the standard provenance stamps. `:DERIVED_FROM` edges link to the SPECIFIC messages that supported the insight (not the entire conversation).
|
|
87
|
+
|
|
88
|
+
```cypher
|
|
89
|
+
MERGE (i:Insight {insightId: $insightId})
|
|
90
|
+
ON CREATE SET
|
|
91
|
+
i.summary = $summary,
|
|
92
|
+
i.kind = $kind,
|
|
93
|
+
i.confidence = $confidence,
|
|
94
|
+
i.accountId = $accountId,
|
|
95
|
+
i.source = 'whatsapp',
|
|
96
|
+
i.createdByAgent = 'whatsapp-import',
|
|
97
|
+
i.createdBySession = $sessionId,
|
|
98
|
+
i.createdAt = datetime(),
|
|
99
|
+
i.scope = 'admin'
|
|
100
|
+
WITH i, $messageIds AS mids
|
|
101
|
+
UNWIND mids AS mid
|
|
102
|
+
MATCH (m:Message {messageId: mid})
|
|
103
|
+
MERGE (i)-[d:DERIVED_FROM]->(m)
|
|
104
|
+
ON CREATE SET d.createdAt = datetime()
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
This Cypher runs through `mcp__memory__memory-write` (single-node + relationships payload), not through `memory-archive-write`. The agent does not author this Cypher directly — the schema-aware writer translates the structured payload.
|
|
108
|
+
|
|
109
|
+
## What pass 2 does NOT do
|
|
110
|
+
|
|
111
|
+
- **Does not summarise the conversation.** A `:KnowledgeDocument` summary is the wrong shape per `feedback_archives_are_not_documents.md`.
|
|
112
|
+
- **Does not score sentiment** per message. Sentiment is noise at the per-message level. Conversation-level emotional tone may show up as a `:Insight` if genuinely interesting.
|
|
113
|
+
- **Does not aggregate across conversations.** Each export ingests in isolation. Cross-conversation patterns are a query-time concern using the `createdBySession` provenance index.
|
|
114
|
+
- **Does not invent identifiers.** A `:MENTIONS` candidate without a `memory-search` hit is dropped, never auto-minted as a placeholder Person.
|
|
115
|
+
|
|
116
|
+
## Operator-side discipline
|
|
117
|
+
|
|
118
|
+
The operator may interrupt at any prompt during pass 2 and ask "skip the insight pass, just commit the messages." The skill obeys — no retroactive insight scan, no "let me extract a few quick ones first." Pass 1's Conversation+Messages is independent and complete on its own. The insight pass is opt-out at any prompt.
|
|
@@ -107,6 +107,7 @@ The classifier maps document sections to typed ontology labels. It does not inve
|
|
|
107
107
|
Per-source archive imports keep their own skill because their CSVs already encode entity types deterministically and need no LLM classifier. Currently shipped:
|
|
108
108
|
|
|
109
109
|
- **linkedin-import** — LinkedIn Basic Data Export. Ships with references for `Profile.csv` and `Connections.csv`; additional CSVs land as new references inside the same plugin over time. Path: `platform/plugins/linkedin-import/skills/linkedin-import/SKILL.md`. Load via `plugin-read` before any ingestion.
|
|
110
|
+
- **whatsapp-import** — WhatsApp `_chat.txt` export ingestion. Imports historical Conversation + Messages with chronological NEXT chain via `memory-archive-write` (archiveType=`whatsapp-export`), then derives typed insights (mentions, preferences, commitments, observed relationships) inline through existing memory tools. Distinct from the live `whatsapp` plugin (Baileys QR pairing, in-memory store). Path: `platform/plugins/whatsapp-import/skills/whatsapp-import/SKILL.md`. Load via `plugin-read` before any ingestion.
|
|
110
111
|
|
|
111
112
|
Future CRM-type seed plugins (HubSpot, Salesforce, Pipedrive, iCloud contacts, Gmail CSV, etc.) will ship under the same pattern — each as its own opt-in plugin, each with its own `SKILL.md` path under `platform/plugins/<name>/skills/`. When the admin adds a new archive-import skill, its PLUGIN.md will name itself here and in the admin's `<plugin-manifest>`. No prompt change required.
|
|
112
113
|
|