@rubytech/create-realagent 1.0.828 → 1.0.830
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/payload/platform/config/brand.json +1 -1
- package/payload/platform/lib/oauth-llm/dist/index.d.ts +1 -1
- package/payload/platform/lib/oauth-llm/dist/index.d.ts.map +1 -1
- package/payload/platform/lib/oauth-llm/dist/index.js +21 -0
- package/payload/platform/lib/oauth-llm/dist/index.js.map +1 -1
- package/payload/platform/lib/oauth-llm/src/index.ts +24 -0
- package/payload/platform/neo4j/migrations/007-conversation-archive-source.ts +116 -0
- package/payload/platform/neo4j/schema.cypher +12 -2
- package/payload/platform/package.json +2 -2
- package/payload/platform/plugins/admin/hooks/__tests__/archive-ingest-surface-gate.test.sh +6 -6
- package/payload/platform/plugins/admin/hooks/archive-ingest-surface-gate.sh +14 -8
- package/payload/platform/plugins/admin/skills/onboarding/SKILL.md +2 -2
- package/payload/platform/plugins/contacts/mcp/dist/index.js +5 -5
- package/payload/platform/plugins/contacts/mcp/dist/index.js.map +1 -1
- package/payload/platform/plugins/contacts/mcp/dist/tools/contact-create.d.ts +1 -1
- package/payload/platform/plugins/contacts/mcp/dist/tools/contact-create.d.ts.map +1 -1
- package/payload/platform/plugins/contacts/mcp/dist/tools/contact-create.js +29 -23
- package/payload/platform/plugins/contacts/mcp/dist/tools/contact-create.js.map +1 -1
- package/payload/platform/plugins/docs/references/plugins-guide.md +1 -1
- package/payload/platform/plugins/memory/PLUGIN.md +6 -5
- package/payload/platform/plugins/{whatsapp-import/bin/ingest.mjs → memory/bin/conversation-archive-ingest.mjs} +136 -212
- package/payload/platform/plugins/{whatsapp-import/bin/whatsapp-ingest.sh → memory/bin/conversation-archive-ingest.sh} +27 -19
- package/payload/platform/plugins/memory/mcp/dist/index.js +26 -212
- package/payload/platform/plugins/memory/mcp/dist/index.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/llm-classifier.test.js +4 -3
- package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/llm-classifier.test.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/schema-loader.test.js +11 -6
- package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/schema-loader.test.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/schema-validator.test.js +103 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/__tests__/schema-validator.test.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/index.d.ts +5 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/index.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/index.js +30 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/index.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/types.d.ts +48 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/types.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/types.js +23 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/types.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/whatsapp-text.d.ts +3 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/whatsapp-text.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/whatsapp-text.js +237 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-normalisers/whatsapp-text.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/delta-cursor.d.ts +11 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/delta-cursor.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/delta-cursor.js +21 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/delta-cursor.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/derive-keys.d.ts +16 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/derive-keys.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/derive-keys.js +39 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/derive-keys.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sender-bind.d.ts +17 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sender-bind.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sender-bind.js +90 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sender-bind.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sessionize.d.ts +9 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sessionize.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sessionize.js +32 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/sessionize.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/to-turn-text.d.ts +3 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/to-turn-text.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/to-turn-text.js +27 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/conversation-pipeline/to-turn-text.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/document-chunker.d.ts +45 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/document-chunker.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/document-chunker.js +125 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/document-chunker.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.d.ts +24 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.js +293 -33
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-classifier.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-ranker.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-ranker.js +9 -2
- package/payload/platform/plugins/memory/mcp/dist/lib/llm-ranker.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.d.ts +16 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.js +12 -3
- package/payload/platform/plugins/memory/mcp/dist/lib/schema-validator.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-source-agnosticism.test.d.ts +2 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-source-agnosticism.test.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-source-agnosticism.test.js +75 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-source-agnosticism.test.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-whatsapp-text.test.d.ts +2 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-whatsapp-text.test.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-whatsapp-text.test.js +67 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/conversation-normalisers-whatsapp-text.test.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-archive-write.test.js +2 -138
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-archive-write.test.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-ingest.test.js +39 -3
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/memory-ingest.test.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.d.ts +2 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.d.ts.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.js +148 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/__tests__/profile-update-personfields-open.test.js.map +1 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.d.ts +1 -47
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.js +9 -318
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-archive-write.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.d.ts +7 -0
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.js +14 -8
- package/payload/platform/plugins/memory/mcp/dist/tools/memory-ingest.js.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.d.ts +21 -17
- package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.d.ts.map +1 -1
- package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.js +77 -37
- package/payload/platform/plugins/memory/mcp/dist/tools/profile-update.js.map +1 -1
- package/payload/platform/plugins/memory/references/schema-base.md +3 -1
- package/payload/platform/plugins/{whatsapp-import/skills/whatsapp-import → memory/skills/conversation-archive}/SKILL.md +45 -36
- package/payload/platform/plugins/memory/skills/document-ingest/SKILL.md +59 -6
- package/payload/platform/plugins/whatsapp/PLUGIN.md +1 -1
- package/payload/platform/scripts/seed-neo4j.sh +9 -8
- package/payload/platform/templates/specialists/agents/database-operator.md +7 -14
- package/payload/server/chunk-7BO5HDJC.js +10093 -0
- package/payload/server/chunk-CUSH3UXP.js +2305 -0
- package/payload/server/chunk-EL4DZ56X.js +1116 -0
- package/payload/server/chunk-IWNDVGKT.js +10077 -0
- package/payload/server/chunk-KC7NUABI.js +654 -0
- package/payload/server/chunk-QOJ2D26Z.js +654 -0
- package/payload/server/chunk-RC46ZYGT.js +2305 -0
- package/payload/server/chunk-WUVXPZIV.js +1116 -0
- package/payload/server/client-pool-3TM3SRIA.js +32 -0
- package/payload/server/client-pool-7NTEFNVQ.js +32 -0
- package/payload/server/cloudflare-task-tracker-4NIODMGL.js +19 -0
- package/payload/server/cloudflare-task-tracker-WE77WXSI.js +19 -0
- package/payload/server/maxy-edge.js +3 -3
- package/payload/server/neo4j-migrations-4XPNJNM6.js +490 -0
- package/payload/server/neo4j-migrations-XTQ4WEV6.js +428 -0
- package/payload/server/server.js +6 -6
- package/payload/platform/plugins/whatsapp-import/PLUGIN.md +0 -48
- package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/delta-append.test.ts +0 -163
- package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/parse-export-lrm.test.ts +0 -83
- package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/parse-export.test.ts +0 -678
- package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/sessionize.test.ts +0 -91
- package/payload/platform/plugins/whatsapp-import/lib/src/__tests__/to-classifier-input.test.ts +0 -59
- package/payload/platform/plugins/whatsapp-import/lib/src/delta-cursor.ts +0 -54
- package/payload/platform/plugins/whatsapp-import/lib/src/derive-keys.ts +0 -82
- package/payload/platform/plugins/whatsapp-import/lib/src/index.ts +0 -22
- package/payload/platform/plugins/whatsapp-import/lib/src/parse-export.ts +0 -471
- package/payload/platform/plugins/whatsapp-import/lib/src/sessionize.ts +0 -81
- package/payload/platform/plugins/whatsapp-import/lib/src/to-classifier-input.ts +0 -48
- package/payload/platform/plugins/whatsapp-import/lib/tsconfig.json +0 -9
- package/payload/platform/plugins/whatsapp-import/lib/vitest.config.ts +0 -9
- package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/conversation-archive-shape.md +0 -143
- package/payload/platform/plugins/whatsapp-import/skills/whatsapp-import/references/export-parse.md +0 -109
|
@@ -1,43 +1,52 @@
|
|
|
1
1
|
---
|
|
2
|
-
name:
|
|
3
|
-
description:
|
|
2
|
+
name: conversation-archive
|
|
3
|
+
description: Source-agnostic ingest for conversation transcripts (Task 894). One skill for WhatsApp `_chat.txt`, Telegram, Signal, LinkedIn DMs, Zoom transcript, meeting minutes, iMessage, Slack, and any future source — `source` is a property on `:ConversationArchive`, never a separate skill or plugin. Confirms the owner + every distinct sender against existing `:AdminUser` / `:Person` nodes (no auto-creation), then invokes the deterministic Bash entry `conversation-archive-ingest.sh` with `--source <enum>`. The script normalises the source (per-source pluggable function), sessionizes at gap-hours boundaries, classifies each session via Haiku (mode='chat') into `:Section:Conversation` chunks with summary + topic keywords, and writes them under a parent `:ConversationArchive` MERGEd on `conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds)`. Re-imports are delta-append: prior chunks are never touched; only messages after `lastIngestedMessageHash` flow through the pipeline. Triggers when the operator drops any conversation export (file or directory) into chat.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
#
|
|
6
|
+
# Conversation Archive — source-agnostic transcript ingestion
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
One skill, one bash entry, one writer. The pipeline is identical for every conversation source. The only source-specific code is the per-source normaliser (a pluggable function under [`platform/plugins/memory/mcp/src/lib/conversation-normalisers/`](../../mcp/src/lib/conversation-normalisers/)). New sources = a new normaliser file + an enum entry, never a new skill.
|
|
9
|
+
|
|
10
|
+
## Confirmed sources (Phase 0)
|
|
11
|
+
|
|
12
|
+
`whatsapp` ships today (relocated grammar from the retired `whatsapp-import` plugin). Other sources land as needed:
|
|
13
|
+
- `telegram` — Telegram JSON export
|
|
14
|
+
- `signal` — Signal text export
|
|
15
|
+
- `linkedin-messages` — LinkedIn DM thread export (LinkedIn *connections* stay flat-dataset; only DMs route here)
|
|
16
|
+
- `zoom-transcript` — Zoom .vtt / .txt transcript
|
|
17
|
+
- `meeting-minutes` — operator-supplied meeting notes
|
|
18
|
+
- `imessage` — iMessage backup export
|
|
19
|
+
- `slack` — Slack channel export
|
|
20
|
+
|
|
21
|
+
Until a normaliser ships, the bash entry loud-fails with the implemented set listed in the error.
|
|
9
22
|
|
|
10
23
|
## Owner + all-participants confirmation (mandatory first step)
|
|
11
24
|
|
|
12
|
-
A
|
|
25
|
+
A conversation transcript carries N distinct senders (1 owner + N-1 others). Every distinct senderName must resolve to an existing `:AdminUser` or `:Person` elementId before the script runs — the writer LOUD-FAILs on any unresolved sender.
|
|
13
26
|
|
|
14
27
|
The flow:
|
|
15
28
|
|
|
16
|
-
1. **
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
"filePath": "/abs/path/to/_chat.txt",
|
|
20
|
-
"timezone": "Europe/London"
|
|
21
|
-
}
|
|
22
|
-
```
|
|
23
|
-
Returns counters + the sender histogram. Surface to the operator as one chat message — counters and the histogram, no prose.
|
|
29
|
+
1. **Identify the source.** Look at the file the operator dropped — `_chat.txt` ⇒ `whatsapp`; `*.json` from a Telegram export ⇒ `telegram`; `.vtt` ⇒ `zoom-transcript`; `.txt` of formatted meeting notes ⇒ `meeting-minutes`; etc. If unsure, ask the operator one question to disambiguate.
|
|
30
|
+
|
|
31
|
+
2. **Read the file head** to discover senders before the script runs. Read the first ~50 lines (or use a per-source preview heuristic). Extract distinct sender display names. The operator never sees the script's own counters — surface a one-line histogram (`Sender '<name>' (<count> messages)`) per distinct sender.
|
|
24
32
|
|
|
25
|
-
|
|
33
|
+
3. **List candidate `:AdminUser` and existing `:Person`** rows for the senders via `mcp__graph__maxy-graph-read_neo4j_cypher`.
|
|
26
34
|
|
|
27
|
-
|
|
35
|
+
4. **Iterate the histogram, one operator question per distinct senderName.** For each sender: `"Sender '<name>' (<count> messages) — pick existing :AdminUser/:Person or block?"`. The operator either picks an existing elementId or names "block" (refuses to map to a node). **Never auto-create a `:Person`** — the operator must confirm a canonical node, mirroring `feedback_archives_are_not_documents.md`'s closed-set discipline.
|
|
28
36
|
|
|
29
|
-
|
|
37
|
+
5. **Identify the owner** from the resolved set — the operator who exported the chat. Echo back: `"Owner = :AdminUser <name> (<elementId>); other participants = <list>. Confirm yes/no."`
|
|
30
38
|
|
|
31
|
-
|
|
39
|
+
6. **Persist the resolved IDs** as `--owner-element-id` + `--participant-person-ids <csv>` for the script call.
|
|
32
40
|
|
|
33
|
-
DM and group follow the identical flow. A 1:1 chat resolves 2 senders; a group resolves N senders. Identity (`conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds.join(","))`) is identical regardless of group size — DM and group use the same MERGE key.
|
|
41
|
+
DM and group follow the identical flow. A 1:1 chat resolves 2 senders; a group resolves N senders. Identity (`conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds.join(","))`) is identical regardless of group size — DM and group use the same MERGE key. Source format (whatsapp / telegram / slack) does not affect identity.
|
|
34
42
|
|
|
35
43
|
## Step 2 — invoke the script
|
|
36
44
|
|
|
37
45
|
Single Bash call:
|
|
38
46
|
|
|
39
47
|
```bash
|
|
40
|
-
bash platform/plugins/
|
|
48
|
+
bash platform/plugins/memory/bin/conversation-archive-ingest.sh <source-path> \
|
|
49
|
+
--source <whatsapp|telegram|signal|linkedin-messages|zoom-transcript|meeting-minutes|imessage|slack|other> \
|
|
41
50
|
--owner-element-id <id> \
|
|
42
51
|
--participant-person-ids <id1>,<id2>,... \
|
|
43
52
|
--scope <admin|public>
|
|
@@ -46,27 +55,25 @@ bash platform/plugins/whatsapp-import/bin/whatsapp-ingest.sh <archive.zip|dir|_c
|
|
|
46
55
|
Optional flags:
|
|
47
56
|
- `--account-id <id>` — explicit account id when more than one exists under `data/accounts/` (Phase 0 has one).
|
|
48
57
|
- `--timezone <iana>` — IANA zone for timestamps (default `Europe/London`).
|
|
49
|
-
- `--date-format <DD/MM/YY|MM/DD/YY|DD/MM/YYYY|MM/DD/YYYY>` — override auto-detect for ambiguous locales.
|
|
58
|
+
- `--date-format <DD/MM/YY|MM/DD/YY|DD/MM/YYYY|MM/DD/YYYY>` — WhatsApp only; override auto-detect for ambiguous locales.
|
|
50
59
|
- `--session-gap-hours <N>` — gap threshold (in hours) used to split parsed messages into sessions for chunking (default `12`). Smaller values produce more sessions, more Haiku calls, finer chunks; larger values group more messages per session.
|
|
51
60
|
|
|
52
61
|
The script:
|
|
53
|
-
-
|
|
54
|
-
- Parses the file deterministically (year shape, sender/body grammar, timezone offset, U+200E/U+200F bidi-strip).
|
|
55
|
-
- Computes the source file's `archiveSha256` (provenance + cleanup discriminator).
|
|
62
|
+
- Picks the normaliser for `--source`. WhatsApp: locates `_chat.txt` (zip / dir / direct file), parses deterministically, computes `archiveSha256`. Other sources interpret the path according to their own format.
|
|
56
63
|
- Validates every distinct parsed senderName against the closed set of `{owner, participants...}` candidate names. Any miss LOUD-FAILs `parser-miss reason="senderName=<verbatim> not in confirmed participant set ..."`.
|
|
57
64
|
- Computes `conversationIdentity` from accountId + sorted participant elementIds.
|
|
58
65
|
- Looks up any prior `:ConversationArchive` carrying that identity → reads `lastIngestedMessageHash`. If found, slices parsed lines after the cursor (delta-append). Cursor not found → `FAIL delta-cursor-missing`. Cursor at last line → empty-delta noop (exit 0, no writes).
|
|
59
66
|
- Sessionizes the delta lines at the operator-supplied gap-hours boundary.
|
|
60
|
-
- For each session: renders as turn-attributed text (`[ts] Sender: body\n…`) and calls `memory-classify` with `mode='chat'
|
|
61
|
-
- Calls `memory-ingest` with `parentLabel='ConversationArchive'
|
|
67
|
+
- For each session: renders as turn-attributed text (`[ts] Sender: body\n…`) and calls `memory-classify` with `mode='chat'` (the classifier sees turn-attributed text only — never the source format). Returns one or more `:Section:Conversation` chunk specs.
|
|
68
|
+
- Calls `memory-ingest` with `parentLabel='ConversationArchive'` and `source=<enum>`. Server MERGEs the parent on `conversationIdentity`, MERGEs `:PARTICIPANT_IN` edges from each confirmed participant, drops any prior chunks stamped with this `archiveSha256` (idempotency for re-running the same export bytes), CREATEs new chunks, extends the existing `:NEXT` chain from its tail, advances `lastIngestedMessageHash` + `lastIngestedMessageAt`. Every node and edge stamps `source=<enum>` and `createdByAgent='conversation-archive'`.
|
|
62
69
|
|
|
63
|
-
NO insight pass runs. Phase 2 (operator-driven `:Observation` / `:Task` / `:Preference` derivation against chunks) is its own follow-up task
|
|
70
|
+
NO insight pass runs. Phase 2 (operator-driven `:Observation` / `:Task` / `:Preference` derivation against chunks) is its own follow-up task — and applies uniformly to every source once chunks exist.
|
|
64
71
|
|
|
65
72
|
## Three operator messages per ingest
|
|
66
73
|
|
|
67
74
|
After the script succeeds, formulate the three operator-facing messages from the JSON summary on stdout (one operator message per surfaceable phase):
|
|
68
75
|
|
|
69
|
-
1. **Parse summary.** `Parsed <archiveSourceFile
|
|
76
|
+
1. **Parse summary.** `Parsed <archiveSourceFile> (source=<enum>): <parsed> messages across <sessions> sessions, date range <dateRange.first> → <dateRange.last>. Participants: <senderHistogram[i].name (count), …>.`
|
|
70
77
|
2. **Classify summary.** `Classified into <chunks> chunks, covering: <topicKeywords[0], topicKeywords[1], …>.`
|
|
71
78
|
3. **Write summary.** `Created :ConversationArchive <archiveElementId> with <chunks> :Section:Conversation chunks (NEXT chain length <chunks - 1>). Participants linked via :PARTICIPANT_IN: <participantsLinked>.`
|
|
72
79
|
|
|
@@ -80,6 +87,7 @@ For an empty-delta re-import (`delta.kind === "empty-delta"`): emit only message
|
|
|
80
87
|
"conversationIdentity": "<sha256-hex>",
|
|
81
88
|
"archiveSha256": "<sha256-hex>",
|
|
82
89
|
"archiveSourceFile": "_chat.txt",
|
|
90
|
+
"source": "whatsapp",
|
|
83
91
|
"parsed": 1707,
|
|
84
92
|
"mediaSkipped": 0,
|
|
85
93
|
"systemSkipped": 0,
|
|
@@ -97,28 +105,29 @@ For an empty-delta re-import (`delta.kind === "empty-delta"`): emit only message
|
|
|
97
105
|
|
|
98
106
|
## Failure path — single FAIL line
|
|
99
107
|
|
|
100
|
-
- **Exit non-zero** + one stderr line: `[
|
|
108
|
+
- **Exit non-zero** + one stderr line: `[conversation-archive] FAIL phase=<argv|parse|classify|delta-cursor-missing|memory-ingest|uncaught> reason="..."`. Surface this verbatim to the operator and yield. **Do not retry.**
|
|
101
109
|
- `parser-miss` LOUD-FAIL: an unconfirmed senderName slipped through. Either re-run with the missing :Person elementId added to `--participant-person-ids`, or report a parser bug.
|
|
102
110
|
- `delta-cursor-missing` LOUD-FAIL: the prior `lastIngestedMessageHash` is not present in the re-export. Either the operator deleted prior messages from the archive, or this is a different chat. Investigation required — never re-run blindly.
|
|
103
111
|
|
|
104
112
|
## Idempotency
|
|
105
113
|
|
|
106
|
-
- **Re-running the same export bytes** is a no-op: the cleanup-by-`archiveSha256` step drops THIS export's prior chunks and re-creates them with identical content. `:NEXT` chain length unchanged.
|
|
107
|
-
- **Re-running with appended messages** (a fresh export from the same chat with new messages at the tail): cursor lookup finds the prior `lastIngestedMessageHash`, slices new messages, sessionizes only those, and appends new chunks at the tail of the existing `:NEXT` chain. Pre-existing chunks are never touched
|
|
114
|
+
- **Re-running the same export bytes** is a no-op: the cleanup-by-`archiveSha256` step drops THIS export's prior chunks and re-creates them with identical content. `:NEXT` chain length unchanged.
|
|
115
|
+
- **Re-running with appended messages** (a fresh export from the same chat with new messages at the tail): cursor lookup finds the prior `lastIngestedMessageHash`, slices new messages, sessionizes only those, and appends new chunks at the tail of the existing `:NEXT` chain. Pre-existing chunks are never touched.
|
|
108
116
|
|
|
109
117
|
## Verification (post-write)
|
|
110
118
|
|
|
111
119
|
Run via `mcp__graph__maxy-graph-read_neo4j_cypher`:
|
|
112
120
|
|
|
113
|
-
- `MATCH (a:ConversationArchive { conversationIdentity: $cid }) RETURN elementId(a), a.lastIngestedMessageAt, a.lastIngestedMessageHash` —
|
|
121
|
+
- `MATCH (a:ConversationArchive { conversationIdentity: $cid }) RETURN elementId(a), a.source, a.lastIngestedMessageAt, a.lastIngestedMessageHash` — `source` matches `--source`; counters agree with the JSON summary.
|
|
114
122
|
- `MATCH (a:ConversationArchive { conversationIdentity: $cid })-[:HAS_SECTION]->(c:Section:Conversation) RETURN count(c)` — equals `chunks`.
|
|
115
|
-
- `MATCH
|
|
116
|
-
- `MATCH (p)-[:PARTICIPANT_IN]->(:ConversationArchive { conversationIdentity: $cid }) RETURN count(p)` — equals `participantsLinked` after a first-ingest
|
|
123
|
+
- `MATCH (a:ConversationArchive { source: $source }) RETURN count(a)` — uses the `(accountId, source)` index.
|
|
124
|
+
- `MATCH (p)-[:PARTICIPANT_IN]->(:ConversationArchive { conversationIdentity: $cid }) RETURN count(p)` — equals `participantsLinked` after a first-ingest.
|
|
117
125
|
- Phase 1 wrote ZERO observations: `MATCH (o:Observation)-[:OBSERVED_IN]->(:ConversationArchive { conversationIdentity: $cid }) RETURN count(o)` — should be 0 today (Phase 2 deferred).
|
|
118
126
|
|
|
119
127
|
## What this is not
|
|
120
128
|
|
|
121
|
-
- **Not** the live `whatsapp` plugin. That plugin (Baileys QR pairing) holds messages in an in-memory store cleared on restart. This
|
|
129
|
+
- **Not** the live `whatsapp` plugin. That plugin (Baileys QR pairing) holds messages in an in-memory store cleared on restart. This skill imports historical exports into Neo4j as persistent graph nodes.
|
|
122
130
|
- **Not** a media-transcription pipeline. Voice notes, photos, PDFs are skipped at parse with a counter logged.
|
|
123
|
-
- **Not** an insight-extraction pass. Phase 2 (`:Observation` / `:Task` / `:Preference` / `:MENTIONS` derivation, anchored to chunks) ships in its own task.
|
|
124
|
-
- **Not**
|
|
131
|
+
- **Not** an insight-extraction pass. Phase 2 (`:Observation` / `:Task` / `:Preference` / `:MENTIONS` derivation, anchored to chunks) ships in its own task and applies uniformly to every source.
|
|
132
|
+
- **Not** an archive-wide importer. Flat datasets (LinkedIn connections, CRM exports) are first-class entities + natural edges, not `:ConversationArchive` — `feedback_archives_are_not_documents.md` stands.
|
|
133
|
+
- **Not** automatic. The owner + all-participants confirmation gate is mandatory before any line is written.
|
|
@@ -1,11 +1,27 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: document-ingest
|
|
3
|
-
description: Universal document ingestion — maps any unstructured document (PDF, text, transcript, web page) to ontologically-grounded
|
|
3
|
+
description: Universal document ingestion — maps any unstructured document (PDF, text, transcript, web page) OR chat archive (WhatsApp `_chat.txt`) to ontologically-grounded graph nodes via Haiku-driven classification. Triggers when the operator uploads or fetches a document or chat export for ingestion. One skill for every input shape — no per-doctype, no per-channel branching.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Document Ingest
|
|
7
7
|
|
|
8
|
-
Ingests any unstructured
|
|
8
|
+
Ingests any unstructured input — documents (PDF, text, transcript, web page) and chat archives (WhatsApp `_chat.txt`) — into the graph. Every classified section becomes one `:Section` node. **Two parent shapes, one pipeline:**
|
|
9
|
+
|
|
10
|
+
| Input shape | Parent label | Section secondary label | Identity property | mode |
|
|
11
|
+
|---|---|---|---|---|
|
|
12
|
+
| PDF / text / web (default) | `:KnowledgeDocument` | `:Section:<Kind>` from closed enumeration (`Position`, `Chapter`, `Parties`, …) | `attachmentId` | `document` |
|
|
13
|
+
| Chat archive (`_chat.txt`) | `:ConversationArchive` | `:Section:Conversation` | `conversationIdentity` | `chat` |
|
|
14
|
+
|
|
15
|
+
The classifier in `memory-classify` decides which section kinds each section maps to (document mode) or chunks the archive into topic-bounded `:Section:Conversation` nodes (chat mode). The skill orchestrates the pipeline; the classifier reads the loaded ontology; the writer enforces the validator. **Classifier failure is terminal — the ingest aborts entirely; nothing is written. Loud failures, never silent landfill.**
|
|
16
|
+
|
|
17
|
+
## Routing — chat vs document (mandatory first decision)
|
|
18
|
+
|
|
19
|
+
Before anchor confirmation, decide the parent shape from the input:
|
|
20
|
+
|
|
21
|
+
- **Chat archive** — input filename ends in `_chat.txt`, the dispatch brief names the input as a WhatsApp chat / messaging-channel export, or the operator labels it as such. Set `mode='chat'` and `parentLabel='ConversationArchive'`. Skip anchor confirmation; run participant confirmation instead (see § Participant confirmation). The classifier produces `:Section:Conversation` chunks; no anchor edges, no related entities.
|
|
22
|
+
- **Document** — everything else. Set `mode='document'` and `parentLabel='KnowledgeDocument'` (or omit — these are the defaults). Run the anchor confirmation flow below.
|
|
23
|
+
|
|
24
|
+
Both branches go through the same three tools (`memory-ingest-extract` → `memory-classify` → `memory-ingest`); only the parameters differ.
|
|
9
25
|
|
|
10
26
|
## Anchor confirmation (mandatory first step)
|
|
11
27
|
|
|
@@ -27,6 +43,26 @@ The confirmation flow:
|
|
|
27
43
|
3. Run a one-shot graph read to resolve the anchor's element ID. For UserProfile: `MATCH (u:UserProfile {accountId: $accountId}) RETURN elementId(u) AS anchorId, 'UserProfile' AS anchorLabel`. For LocalBusiness: `MATCH (b:LocalBusiness {accountId: $accountId}) RETURN elementId(b) AS anchorId, 'LocalBusiness' AS anchorLabel`. For a third party: search by name via `memory-search` and pick the matching node.
|
|
28
44
|
4. Persist `$anchorNodeId` and `$anchorLabel` for the rest of the run. These flow into both `memory-classify` (as part of the `anchorDescription`) and `memory-ingest` (as the `anchorNodeId` + `anchorLabel` parameters).
|
|
29
45
|
|
|
46
|
+
## Participant confirmation (chat mode only)
|
|
47
|
+
|
|
48
|
+
Chat archives are multi-party — no single subject anchor. Instead, every distinct sender name in the archive must resolve to an existing `:AdminUser` or `:Person` elementId before any classify or ingest call. No auto-creation; missing participants are blockers, not silent skips.
|
|
49
|
+
|
|
50
|
+
The confirmation flow:
|
|
51
|
+
|
|
52
|
+
1. Read the dispatch brief. Extract the archive path and any operator-stated participant identities.
|
|
53
|
+
2. Read a small sample of the archive (head ~50 lines) via the `Read` tool to discover the distinct sender names that appear at line starts after the bracketed-timestamp prefix.
|
|
54
|
+
3. For every distinct senderName, search the graph via `memory-search` (or a one-shot `MATCH (n) WHERE (n:Person OR n:AdminUser) AND n.accountId = $accountId AND (n.name = $name OR (n.givenName + ' ' + n.familyName) = $name) RETURN elementId(n)`).
|
|
55
|
+
4. **Resolved fully** — every senderName mapped to exactly one elementId. Capture the owner's elementId (the operator who exported the archive — usually the `:AdminUser` for this account; ask if ambiguous) and the comma-separated list of remaining participant elementIds.
|
|
56
|
+
5. **Unresolved** — at least one senderName has no matching node. Surface to the operator: *"Archive `<filename>` mentions sender `<name>` but no `:Person` / `:AdminUser` matches. Create the contact first or correct the name, then re-dispatch."* Do NOT proceed.
|
|
57
|
+
6. **Ambiguous** — a senderName matches multiple nodes. Ask the operator which one. Do NOT proceed until disambiguated.
|
|
58
|
+
|
|
59
|
+
Persist `$ownerElementId` and `$participantElementIds` (array, owner excluded) for the run. These flow into `memory-ingest` as the `participantElementIds` parameter (owner + others, deduped).
|
|
60
|
+
|
|
61
|
+
Compute archive metadata before classify:
|
|
62
|
+
- `archiveSha256` — `bash sha256sum "<file>" | cut -d' ' -f1`. Stamped on the parent + every chunk.
|
|
63
|
+
- `archiveSourceFile` — the basename (e.g. `_chat.txt`).
|
|
64
|
+
- `conversationIdentity` — pass as the `attachmentId` parameter to `memory-ingest`. Format: `chat:<sha256(accountId + ":" + sortedParticipantElementIds)>` where `sortedParticipantElementIds` is the sorted-then-comma-joined list of `[owner, ...participants]`. Same conversation across re-exports → same identity → idempotent MERGE on the `:ConversationArchive`.
|
|
65
|
+
|
|
30
66
|
## Pipeline
|
|
31
67
|
|
|
32
68
|
Four steps in order. Steps 1–3 are deterministic tool calls (the agent does not classify; the agent calls the classifier tool). Step 4 is agent-driven graph writes against the existing graph, gated by the dispatch brief's named entity list. Hallucination defence and ontology validation stay server-side in `memory-classify` and the `memory-write` validator.
|
|
@@ -62,17 +98,26 @@ After extract returns, look at the document's char count and emit a one-line siz
|
|
|
62
98
|
| <5K | ~10s |
|
|
63
99
|
| 5K–10K | ~20s |
|
|
64
100
|
| 10K–20K | ~45–90s |
|
|
65
|
-
|
|
|
101
|
+
| 20K–525K | ~90–180s (single-shot) |
|
|
102
|
+
| >525K | chunked: ~90–180s × ⌈chars / 525K⌉ |
|
|
103
|
+
|
|
104
|
+
For inputs >525K chars memory-classify enters the chunked path automatically: deterministic 525K-char chunks with 17.5K-char overlap, one Haiku call per chunk, server-side merge of same-kind boundary straddlers. Inputs >682K chars in chat mode loud-fail with `cause:"input-too-large"` — the chat path must keep sessions under the per-call ceiling via sessionize.
|
|
66
105
|
|
|
67
106
|
Form: "Classifying `<filename>` (`<N>` chars) — expect ~`<estimate>`."
|
|
68
107
|
|
|
69
108
|
### 2. `memory-classify`
|
|
70
109
|
|
|
71
|
-
Calls Haiku with the loaded ontology and the cached text.
|
|
110
|
+
Calls Haiku with the loaded ontology and the cached text.
|
|
111
|
+
|
|
112
|
+
**Document mode (default).** Inputs: `attachmentId` (same one), `anchorDescription` (a short sentence built from the confirmed anchor — e.g. `"subject = UserProfile (the account owner); edges from UserProfile."` or `"subject = LocalBusiness {name: 'Acme Roofing'} (the operator's business); edges from LocalBusiness."`).
|
|
113
|
+
|
|
114
|
+
**Chat mode.** Inputs: `attachmentId` (same one), `mode='chat'`, `anchorDescription` (a short sentence naming the conversation — e.g. `"WhatsApp conversation between Joel and Adam (2 participants)"`). The chat prompt drops the natural-edge map, the closed enumeration, and the orphan logic — Haiku produces topic-bounded `:Section:Conversation` chunks with `summary`, `keywords`, `firstMessageAt`, `lastMessageAt`, `participantNames`, `messageCount` per chunk. The whole archive may produce one chunk (short conversation) or many (long chat with topic transitions); chunks cover every message in chronological order with no gaps.
|
|
115
|
+
|
|
116
|
+
Returns:
|
|
72
117
|
|
|
73
118
|
- `documentSummary` — 1-3 sentences for the KnowledgeDocument node
|
|
74
119
|
- `documentKeywords` — 3-10 lowercase topic keywords
|
|
75
|
-
- `sections` — section-shaped output, each with `kind` (from the closed enumeration), `title`, `body
|
|
120
|
+
- `sections` — section-shaped output, each with `kind` (from the closed enumeration), `title`, `body` (server-reconstructed via `documentText.slice(sourceStart, sourceEnd)` — the LLM emits offsets, not body text), `summary` (≤500 chars, server-validated), `sourceStart`/`sourceEnd` (whole-document char offsets — translated by the chunked-classify path so the writer always sees whole-doc coordinates), `properties`, `anchorEdge` (or null), optional `related`, optional `classifierReason` (when `kind === "Other"`)
|
|
76
121
|
- `documentEdges` — optional. Document-level edges off the KnowledgeDocument, picked by document shape: `PARTY` for contract parties (Person/Organization); `PARTICIPANT` for meeting/call attendees (Person/Organization); `FROM`/`TO`/`CC` for email header parties (Person, plus Organization for `TO`/`CC`); `SPEAKER` for voice-note/single-speaker transcript speakers (Person). Anchor stays the document subject from step 1; these edges are additional document→party links the writer applies off the KnowledgeDocument, not off any Section
|
|
77
122
|
- `orphanCandidates` — nodes the classifier wanted to emit but could not natural-edge. Surfaced loudly to the operator (see Output contract below)
|
|
78
123
|
- `hallucinatedRelated` — diagnostic count of related-kind values dropped server-side as ontology misses
|
|
@@ -83,7 +128,13 @@ After step 2 succeeds, emit a chat message before step 3 naming what the classif
|
|
|
83
128
|
|
|
84
129
|
### 3. `memory-ingest`
|
|
85
130
|
|
|
86
|
-
Writes the classified document
|
|
131
|
+
Writes the classified document or chat archive.
|
|
132
|
+
|
|
133
|
+
**Document mode (default).** Inputs: `attachmentId`, `documentSummary`, `anchorNodeId`, `anchorLabel`, `sections`, `documentEdges` (pass through if present), `orphanCandidates` (pass through if present), `scope` (from the brief — confirm with the operator if absent), optional `documentKeywords`, `userKeywords`, `sourceUrl`, `sourceType`.
|
|
134
|
+
|
|
135
|
+
**Chat mode.** Inputs: `attachmentId` (set to `conversationIdentity`), `parentLabel='ConversationArchive'`, `documentSummary`, `sections` (the chunks from chat-mode classify), `scope`, plus the chat-archive metadata: `archiveSha256` (cleanup discriminator), `archiveSourceFile` (audit), `participantElementIds` (owner + others, for `:PARTICIPANT_IN` edges). Pass `anchorNodeId` and `anchorLabel` as any non-empty placeholder (e.g. the owner's elementId + `'AdminUser'`) — they are unused on the chat path but the parameter is non-optional. The writer MERGEs `:ConversationArchive { conversationIdentity }`, drops any chunks stamped with this `archiveSha256` (idempotent re-ingest), CREATEs new chunks chained by `:NEXT`, and MERGEs `:PARTICIPANT_IN` edges from each participant.
|
|
136
|
+
|
|
137
|
+
Returns:
|
|
87
138
|
|
|
88
139
|
- `documentNodeId`, `sectionCount`
|
|
89
140
|
- `kindBreakdown` — per-kind count, e.g. `{"Position": 4, "Chapter": 12, "Other": 1}`
|
|
@@ -99,6 +150,8 @@ Re-ingesting the same `attachmentId` is safe — the writer drops prior `:Sectio
|
|
|
99
150
|
|
|
100
151
|
### 4. `wire-brief-entities`
|
|
101
152
|
|
|
153
|
+
**Skipped in chat mode** — `:ConversationArchive` does not carry KD-level brief-wired edges; participants are already attached via `:PARTICIPANT_IN` and message bodies stay verbatim inside chunk text (mention extraction is deferred to a separate insight-derivation task). Document mode only.
|
|
154
|
+
|
|
102
155
|
After `memory-ingest` returns the new KnowledgeDocument's `documentNodeId`, this step iterates the entities the dispatch brief named and connects each to the new document with the natural KD-level edge.
|
|
103
156
|
|
|
104
157
|
**Entity sources.** The dispatch brief's "key entities to connect" list. Brief shape: prose names of Persons, Organizations, Services, Tasks, Events, KnowledgeDocuments, BrandingData that the document describes or references. Example: *"Person nodes for Joel Smalley, Adam Mackay, Dan McLeod; LocalBusiness / Organization nodes for Real Agent / Real Agency; Any existing Task nodes related to Real Agent Lettings."* Extract every named entity from the brief before any `memory-write`.
|
|
@@ -51,7 +51,7 @@ When per-group activation is `mention`, the agent fires only if the inbound mess
|
|
|
51
51
|
|
|
52
52
|
## Live persistence
|
|
53
53
|
|
|
54
|
-
Every `messages.upsert` event (both `notify` and `append`, both `fromMe` directions) writes a `:Message:WhatsAppMessage` row to Neo4j attached to the sessionKey-keyed `:Conversation`. A single capture site at `platform/ui/app/lib/whatsapp/manager.ts` covers inbound, outbound (Baileys echoes agent-sent messages back through `messages.upsert` with `fromMe=true`), and owner-mirror — without touching `outbound/send.ts`. `messageId` namespace is `whatsapp-live:<waName>:<remoteJid>:<msg.key.id>` where `<waName>` is the Baileys credential dirname (e.g. `default`);
|
|
54
|
+
Every `messages.upsert` event (both `notify` and `append`, both `fromMe` directions) writes a `:Message:WhatsAppMessage` row to Neo4j attached to the sessionKey-keyed `:Conversation`. A single capture site at `platform/ui/app/lib/whatsapp/manager.ts` covers inbound, outbound (Baileys echoes agent-sent messages back through `messages.upsert` with `fromMe=true`), and owner-mirror — without touching `outbound/send.ts`. `messageId` namespace is `whatsapp-live:<waName>:<remoteJid>:<msg.key.id>` where `<waName>` is the Baileys credential dirname (e.g. `default`); distinct from the `:Section:Conversation` chunks written by the source-agnostic `conversation-archive` skill — live and archive live in disjoint label spaces. Persist failures are loud (`[whatsapp-persist] FAIL …`) and never block dispatch — silent loss is the worse failure mode.
|
|
55
55
|
|
|
56
56
|
**`accountId` contract.** `n.accountId` on every `:Conversation`, `:Person`, and `:Message:WhatsAppMessage` row stamped by this plugin is the **platform-side UUID** resolved by [`resolvePlatformAccountId()`](../../ui/app/lib/whatsapp/platform-account-id.ts) from `data/accounts/<uuid>/account.json` — NOT the Baileys credential dirname (which is only used as the `messageId`/`sessionKey` namespace token). The boot-time line `[whatsapp-persist] resolved-account-id waname=<dir> uuid=<uuid>` records the resolution. Doctrine: see `.docs/neo4j.md` "Account isolation invariant" — migration 004 `pruneAlienAccounts` `DETACH DELETE`s any node whose `accountId` is not a UUID dir on every boot. The helper loud-throws on zero or multi accounts (Phase 0 single-account invariant), aborting the WhatsApp connection start before any write can occur.
|
|
57
57
|
|
|
@@ -98,15 +98,16 @@ fi
|
|
|
98
98
|
# webfetch-preflight.mjs: detects JS-SPA shells before WebFetch's
|
|
99
99
|
# 60s extraction timeout (Task 536). Fail-open on any error;
|
|
100
100
|
# on positive SPA detection exits 2 with WEBFETCH_CANNOT_READ_JS_SPA.
|
|
101
|
-
# archive-ingest-surface-gate.sh (Task 855
|
|
101
|
+
# archive-ingest-surface-gate.sh (Task 855 → Task 894):
|
|
102
102
|
# narrows the database-operator subagent's effective surface
|
|
103
|
-
# during
|
|
104
|
-
# (
|
|
105
|
-
#
|
|
106
|
-
#
|
|
107
|
-
#
|
|
108
|
-
#
|
|
109
|
-
#
|
|
103
|
+
# during conversation-archive ingestion to exactly one Bash
|
|
104
|
+
# entry (memory/bin/conversation-archive-ingest.sh --source
|
|
105
|
+
# <enum>) plus read-only neighbours. The Task-894 generalised
|
|
106
|
+
# ingest path replaces the WhatsApp-specific Task-855 path.
|
|
107
|
+
# Defensive blocks remain on the deleted legacy MCP tool names
|
|
108
|
+
# (mcp__memory__whatsapp-export-{parse,preview,insight-write,
|
|
109
|
+
# insight-pass}, mcp__memory__memory-archive-write when
|
|
110
|
+
# archiveType=whatsapp-export). Preserves the plugin-source edit,
|
|
110
111
|
# JS test-runner, and post-parse-error blocks (LinkedIn and
|
|
111
112
|
# future per-source archive parsers still use the legacy MCP
|
|
112
113
|
# path until they migrate to deterministic Bash entries).
|
|
@@ -3,7 +3,7 @@ name: database-operator
|
|
|
3
3
|
description: "Document and archive ingestion and ad-hoc graph operations — running the universal `document-ingest` skill for any unstructured document (PDF, text, transcript, web page, audio, video) and per-source archive-import skills (LinkedIn Basic Data Export today; CRM-type seed archives as each plugin ships), plus operator-driven graph hygiene (prune orphans, deduplicate entities, add edges, normalise labels). Delegate when the operator uploads any document, drops an archive directory into chat, or asks for any graph operation that is not a routine per-turn write."
|
|
4
4
|
summary: "Ingests every unstructured document and external archive into your graph (LinkedIn today; other CRM sources in future) and handles ad-hoc graph tidy-ups on request. For example, when you upload a CV, a pricing guide, or a contract; when you drop a LinkedIn export folder into chat; or when you ask to prune orphan nodes, merge duplicate people, or add edges between entities."
|
|
5
5
|
model: claude-sonnet-4-6
|
|
6
|
-
tools: Read, Bash, Glob, Grep, mcp__graph__maxy-graph-read_neo4j_cypher, mcp__graph__maxy-graph-write_neo4j_cypher, mcp__graph__maxy-graph-get_neo4j_schema, mcp__memory__memory-write, mcp__memory__memory-update, mcp__memory__memory-delete, mcp__memory__memory-search, mcp__memory__memory-rank, mcp__memory__memory-reindex, mcp__memory__memory-find-candidates, mcp__memory__memory-ingest, mcp__memory__memory-ingest-extract, mcp__memory__memory-ingest-web, mcp__memory__memory-classify, mcp__memory__memory-archive-write,
|
|
6
|
+
tools: Read, Bash, Glob, Grep, mcp__graph__maxy-graph-read_neo4j_cypher, mcp__graph__maxy-graph-write_neo4j_cypher, mcp__graph__maxy-graph-get_neo4j_schema, mcp__memory__memory-write, mcp__memory__memory-update, mcp__memory__memory-delete, mcp__memory__memory-search, mcp__memory__memory-rank, mcp__memory__memory-reindex, mcp__memory__memory-find-candidates, mcp__memory__memory-ingest, mcp__memory__memory-ingest-extract, mcp__memory__memory-ingest-web, mcp__memory__memory-classify, mcp__memory__memory-archive-write, mcp__memory__graph-prune-denylist-list, mcp__memory__graph-prune-denylist-add, mcp__memory__graph-prune-denylist-remove, mcp__contacts__contact-create, mcp__contacts__contact-update, mcp__contacts__contact-lookup, mcp__contacts__contact-list, mcp__tasks__task-create, mcp__admin__file-attach, mcp__admin__plugin-read
|
|
7
7
|
---
|
|
8
8
|
|
|
9
9
|
# Database Operator
|
|
@@ -32,15 +32,15 @@ The pre-publish gate (`platform/scripts/verify-skill-tool-surface.sh`) staticall
|
|
|
32
32
|
|
|
33
33
|
**Archive-ingest surface gate.** Each per-source archive importer ships a single deterministic Bash entry under `platform/plugins/<name>/bin/<name>-ingest.sh`. The harness-level gate at `platform/plugins/admin/hooks/archive-ingest-surface-gate.sh` enforces the surface filter that makes the LLM mechanically incapable of deviating mid-ingest:
|
|
34
34
|
|
|
35
|
-
- **
|
|
36
|
-
- **
|
|
35
|
+
- **Conversation transcripts (any source) ingest via the `conversation-archive` skill.** The deterministic Bash entry (`platform/plugins/memory/bin/conversation-archive-ingest.sh --source <enum>`) is the only supported path; normalise (per source), sessionize, classify (mode='chat'), and memory-ingest (parentLabel='ConversationArchive') all run in-process. Stale references to the retired `mcp__memory__whatsapp-export-{parse,preview,insight-write}` tools are not exposed by the harness — invoking them returns "tool not found".
|
|
36
|
+
- **Flat-dataset archiveTypes flow unchanged:** `memory-archive-write` with `archiveType=linkedin-connections` (and future flat-dataset archiveTypes like CRM exports) is allowed. Conversation-shaped sources (WhatsApp, Telegram, Signal, Slack, …) NEVER use `memory-archive-write` — they go through the conversation-archive skill.
|
|
37
37
|
- **Plugin-source edits blocked:** `Edit`/`Write`/`NotebookEdit` against `platform/plugins/*/lib/*` is denied. The operator does not own plugin source.
|
|
38
38
|
- **JS test runners blocked** (preserved): `vitest` / `bun test` / `npm test` / `npx jest` Bash commands are denied. The operator does not run plugin tests.
|
|
39
|
-
- **Post-parse-error flag** (preserved
|
|
39
|
+
- **Post-parse-error flag** (preserved): when any `mcp__*__*-export-parse` / `mcp__*__*-import-parse` tool returns `isError: true`, every subsequent tool call this turn is blocked until the operator submits a new prompt.
|
|
40
40
|
|
|
41
|
-
Every PreToolUse decision emits `[archive-ingest-gate] decision=<allow|block> tool=<n> reason=<r> ...` to server.log so the full trail of one ingest is greppable alongside the `[
|
|
41
|
+
Every PreToolUse decision emits `[archive-ingest-gate] decision=<allow|block> tool=<n> reason=<r> ...` to server.log so the full trail of one ingest is greppable alongside the `[conversation-archive]` script lines.
|
|
42
42
|
|
|
43
|
-
*Failure symptoms (now harness-blocked):*
|
|
43
|
+
*Failure symptoms (now harness-blocked):* calling `mcp__memory__memory-archive-write` with a conversation-shaped `archiveType`, editing a normaliser source ("`whatsapp-text.ts`") to "fix" a malformed input, running `npx vitest` to "diagnose" a parser. Treat these blocks as confirmation the gate is doing its job — invoke the script, surface its FAIL line if it fails, and yield. There is no around-the-block path.
|
|
44
44
|
|
|
45
45
|
---
|
|
46
46
|
|
|
@@ -119,14 +119,7 @@ The classifier maps document sections to typed ontology labels. It does not inve
|
|
|
119
119
|
Per-source archive imports keep their own skill because their CSVs already encode entity types deterministically and need no LLM classifier. Currently shipped:
|
|
120
120
|
|
|
121
121
|
- **linkedin-import** — LinkedIn Basic Data Export. Ships with references for `Profile.csv` and `Connections.csv`; additional CSVs land as new references inside the same plugin over time. Path: `platform/plugins/linkedin-import/skills/linkedin-import/SKILL.md`. Load via `plugin-read` before any ingestion.
|
|
122
|
-
- **
|
|
123
|
-
1. **Preview** via `mcp__memory__whatsapp-export-preview` — read-only parse that returns `{conversationSha256, parsed, mediaSkipped, systemSkipped, dateRange, senders:[{name,messageCount}], totalMessages, archiveBytes}`. No Cypher writes.
|
|
124
|
-
2. **Operator confirms owner + every distinct sender.** Iterate the preview's sender histogram one question at a time: for each sender, ask the operator to pick an existing `:AdminUser`/`:Person` elementId or block. Auto-creating participants is forbidden. Identify the owner from the resolved set, then echo back the `{owner, participants...}` pair for explicit yes/no confirmation.
|
|
125
|
-
3. **Archive-write** via `bash platform/plugins/whatsapp-import/bin/whatsapp-ingest.sh <archive> --owner-element-id <id> --participant-person-ids <id1>,<id2>,... --scope <admin|public>`. Parses, sessionizes the parsed messages at gap-hours boundaries (default 12h), classifies each session via Haiku (`memory-classify` with `mode='chat'`) into topic-bounded `:Section:Conversation` chunks, and writes them under a parent `:ConversationArchive` MERGEd on `conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds)`. Re-imports are delta-append: prior chunks never touched; only messages after `lastIngestedMessageHash` flow through the pipeline. Writer is bound to the operator-confirmed sender set — any parsed senderName outside that closed set LOUD-FAILs `parser-miss reason="..."`.
|
|
126
|
-
|
|
127
|
-
Surface to the operator as three chat messages built from the JSON summary on stdout: (a) parse summary with sender histogram + date range; (b) classify summary with chunk count + topic keywords; (c) write summary with `:ConversationArchive` elementId + chunk count + NEXT-chain length + participants linked. Empty-delta re-imports collapse to step (a) plus a noop line. The legacy `mcp__memory__whatsapp-export-parse` / `whatsapp-export-insight-write` / `whatsapp-export-insight-pass` / `memory-archive-write{archiveType:whatsapp-export}` MCP tools are blocked at the harness; the Bash script is the only supported invocation surface. SKILL: `platform/plugins/whatsapp-import/skills/whatsapp-import/SKILL.md`. Phase 2 insight derivation (`:Observation` / `:Task` / `:Preference` / `:MENTIONS` against chunks) is deferred to a separate follow-up task with its own skill.
|
|
128
|
-
|
|
129
|
-
Distinct from the live `whatsapp` plugin (Baileys QR pairing, in-memory store).
|
|
122
|
+
- **conversation-archive** — Conversation transcripts (any source) ingest via `conversation-archive` skill (Task 894 — supersedes the WhatsApp-specific Task 891 plugin). One skill, one bash entry, one writer, with `--source <enum>` selecting the per-source normaliser (`whatsapp`, `telegram`, `signal`, `linkedin-messages`, `zoom-transcript`, `meeting-minutes`, `imessage`, `slack`, `other`). Phase 0 ships only `whatsapp`; other normalisers land per-source. Pipeline: operator confirms owner + every distinct sender → `bash platform/plugins/memory/bin/conversation-archive-ingest.sh <archive> --source <enum> --owner-element-id <id> --participant-person-ids <id1>,<id2>,... --scope <admin|public>`. The script normalises (per source), sessionizes at gap-hours boundaries (default 12h), classifies each session via Haiku (`memory-classify` with `mode='chat'`) into topic-bounded `:Section:Conversation` chunks, and writes them under a parent `:ConversationArchive` MERGEd on `conversationIdentity = sha256(accountId + ":" + sortedParticipantElementIds)`. Re-imports are delta-append; the writer is bound to the operator-confirmed sender set (parser-miss = LOUD-FAIL). SKILL: `platform/plugins/memory/skills/conversation-archive/SKILL.md`. Phase 2 insight derivation against chunks is a separate follow-up task. Distinct from the live `whatsapp` plugin (Baileys QR pairing, in-memory store).
|
|
130
123
|
|
|
131
124
|
Future CRM-type seed plugins (HubSpot, Salesforce, Pipedrive, iCloud contacts, Gmail CSV, etc.) will ship under the same pattern — each as its own opt-in plugin, each with its own `SKILL.md` path under `platform/plugins/<name>/skills/`. When the admin adds a new archive-import skill, its PLUGIN.md will name itself here and in the admin's `<plugin-manifest>`. No prompt change required.
|
|
132
125
|
|