memex-mvp 0.8.1 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/HELP.md +18 -3
- package/README.md +1 -1
- package/README.ru.md +2 -1
- package/lib/parse-telegram-html.js +358 -0
- package/package.json +2 -2
- package/server.js +109 -4
- package/skills/install-memex/SKILL.md +24 -7
- package/skills/install-memex/examples.md +13 -13
package/HELP.md
CHANGED
|
@@ -390,9 +390,24 @@ npx memex-sync scan
|
|
|
390
390
|
|
|
391
391
|
### Telegram-чаты не появляются
|
|
392
392
|
|
|
393
|
-
|
|
394
|
-
|
|
395
|
-
|
|
393
|
+
В Telegram **Desktop** (не mobile!): чат → меню (⋮ или ☰) → **Export chat history**.
|
|
394
|
+
|
|
395
|
+
**memex поддерживает оба формата экспорта (v0.9+):**
|
|
396
|
+
|
|
397
|
+
- **«Machine-readable JSON»** ← рекомендуется. Один файл `result.json`, чисто ингестится.
|
|
398
|
+
- **«HTML»** ← тоже работает (с v0.9+). Получишь директорию `ChatExport_<chat>_<date>/` — кинь её **целиком** в inbox, memex авто-парсит.
|
|
399
|
+
|
|
400
|
+
Шаги:
|
|
401
|
+
1. Выбери формат (JSON — короче путь, HTML — если уже экспортировано).
|
|
402
|
+
2. Сохрани в **`~/.memex/inbox/`** — для JSON это `result.json`, для HTML — вся директория `ChatExport_…`.
|
|
403
|
+
3. Memex подхватит автоматически за ~1.5 секунды. Файл/директория уедет в `~/.memex/data/conversations/telegram/` (или `…/telegram-html/`) после ingest'a.
|
|
404
|
+
|
|
405
|
+
Если ингест не произошёл — проверь `~/.memex/data/memex.log`. Для HTML-export'а парсер пишет actionable error если что-то не так (формат изменился у Telegram, директория повреждена, и т.д.).
|
|
406
|
+
|
|
407
|
+
**Что НЕ работает:**
|
|
408
|
+
- ❌ Мобильный экспорт (Telegram mobile не экспортирует структурированно)
|
|
409
|
+
- ❌ Скриншоты чатов
|
|
410
|
+
- ❌ Только-медиа чаты (без текста парсятся как `[photo]`/`[voice]` placeholders, но без content)
|
|
396
411
|
|
|
397
412
|
### Хочу подключить новый Obsidian-vault
|
|
398
413
|
|
package/README.md
CHANGED
|
@@ -172,7 +172,7 @@ Perplexity threads need to be made **Public** in the Share dialog first — meme
|
|
|
172
172
|
| Cursor IDE chats | Auto: reads Cursor's local SQLite session store |
|
|
173
173
|
| Continue / Zed | Auto: filesystem watchers per platform |
|
|
174
174
|
| Obsidian notes | Auto: per-vault markdown watcher |
|
|
175
|
-
| Telegram exports | Manual: drop `result.json` (Telegram Desktop) into `~/.memex/inbox
|
|
175
|
+
| Telegram exports | Manual: drop `result.json` (Telegram Desktop → Export → JSON) into `~/.memex/inbox/`. v0.9+: also accepts the **HTML** export directory (`ChatExport_*/`) — drop the whole folder, memex parses it. |
|
|
176
176
|
| Telegram (live) | Run [`memex-bot`](bot/README.md) — captures messages you send/forward to your private bot |
|
|
177
177
|
| **Web pages, AI chat shares, pasted text** | From any MCP agent: *"save https://... to memex"*. Agent fetches; memex stores verbatim. Cloudflare-protected pages (Perplexity, npm.com, Twitter, Medium, …) handled via the agent's r.jina.ai fallback. See [HELP.md §8](HELP.md) |
|
|
178
178
|
|
package/README.ru.md
CHANGED
|
@@ -224,7 +224,8 @@ which node # → путь до бинарника node (например /Users
|
|
|
224
224
|
| **Claude Cowork** | `cowork-*.jsonl` (через filename prefix), включая subagents | ✅ работает |
|
|
225
225
|
| **Cursor IDE** (Composer + Chat) | SQLite `state.vscdb` в `~/Library/Application Support/Cursor/` | ✅ работает (poll каждые 5 мин) |
|
|
226
226
|
| **Obsidian** vault notes | `.md` файлы + YAML frontmatter | ✅ работает (FSEvents, hash-based dedupe) |
|
|
227
|
-
| **Telegram** | `result.json` из Desktop export | ✅ работает |
|
|
227
|
+
| **Telegram (JSON export)** | `result.json` из Desktop export | ✅ работает |
|
|
228
|
+
| **Telegram (HTML export)** | директория `ChatExport_*` из Desktop (v0.9+) | ✅ работает — кидай всю папку в inbox |
|
|
228
229
|
| **Telegram (live)** | бот `memex-bot` ловит твои сообщения / форварды | ✅ работает |
|
|
229
230
|
| **Web-страницы, AI-share'ы, paste'ы** | `memex_store_document` — агент fetch'ит, memex хранит verbatim (v0.6+) | ✅ работает |
|
|
230
231
|
| Claude.ai web export | будет в v0.7 | — |
|
|
@@ -0,0 +1,358 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Telegram Desktop HTML export → Telegram-JSON-shape converter.
|
|
3
|
+
*
|
|
4
|
+
* Telegram Desktop offers two export formats:
|
|
5
|
+
* - "Machine-readable JSON" — what memex's importTelegram expects
|
|
6
|
+
* - "Human-readable HTML" — what many users pick by default
|
|
7
|
+
*
|
|
8
|
+
* Users frequently export as HTML by accident (often the default in the
|
|
9
|
+
* Telegram UI), then memex's inbox watcher silently ignores the dropped
|
|
10
|
+
* directory. This module makes HTML work: parse → emit the same shape
|
|
11
|
+
* importTelegram already understands.
|
|
12
|
+
*
|
|
13
|
+
* Telegram's HTML export is reasonably stable:
|
|
14
|
+
*
|
|
15
|
+
* ChatExport_<chat-title>_<date>/
|
|
16
|
+
* ├── messages.html (or messages.htm — chunked: messages2, messages3, …)
|
|
17
|
+
* ├── photos/
|
|
18
|
+
* ├── files/
|
|
19
|
+
* ├── stickers/
|
|
20
|
+
* └── voice_messages/
|
|
21
|
+
*
|
|
22
|
+
* Each messages*.html has structure:
|
|
23
|
+
*
|
|
24
|
+
* <div class="message default clearfix" id="message12345">
|
|
25
|
+
* <div class="body">
|
|
26
|
+
* <div class="from_name"> ↳ Sender Name </div> (may be absent on "joined" messages)
|
|
27
|
+
* <div class="text"> message text </div>
|
|
28
|
+
* <div class="pull_right date details" title="2024-01-01 14:23:45 UTC+03:00">14:23</div>
|
|
29
|
+
* </div>
|
|
30
|
+
* </div>
|
|
31
|
+
*
|
|
32
|
+
* Joined message = same sender as previous, has class "joined", no from_name.
|
|
33
|
+
* Service message = class "service" (joined chat, name change, …) — we skip these.
|
|
34
|
+
* Forwarded = "forwarded body" wrapping the message body.
|
|
35
|
+
* Reply = "reply_to details" sibling.
|
|
36
|
+
*
|
|
37
|
+
* We use regex-based parsing (no DOM dependency) because Telegram's class
|
|
38
|
+
* names are stable and we control which fields we care about. If Telegram
|
|
39
|
+
* radically changes the schema, parser breaks loudly (returns 0 messages
|
|
40
|
+
* + clear log) rather than silently corrupting.
|
|
41
|
+
*/
|
|
42
|
+
|
|
43
|
+
import { readFileSync, existsSync, readdirSync, statSync } from 'node:fs';
|
|
44
|
+
import { join, basename, dirname } from 'node:path';
|
|
45
|
+
|
|
46
|
+
/**
|
|
47
|
+
* Detect if a given path is a Telegram HTML export.
|
|
48
|
+
* Accepts both a directory (most common — ChatExport_xxx/) and a bare
|
|
49
|
+
* messages.html file (rare — user dropped just the one file).
|
|
50
|
+
*
|
|
51
|
+
* Returns { type: 'dir' | 'file' | null, htmlFiles: string[] }
|
|
52
|
+
* null type means "not a Telegram HTML export"
|
|
53
|
+
*/
|
|
54
|
+
export function detectTelegramHtml(path) {
|
|
55
|
+
if (!existsSync(path)) return { type: null, htmlFiles: [] };
|
|
56
|
+
const stats = statSync(path);
|
|
57
|
+
|
|
58
|
+
// Directory case: look for messages*.html inside
|
|
59
|
+
if (stats.isDirectory()) {
|
|
60
|
+
let entries = [];
|
|
61
|
+
try { entries = readdirSync(path); } catch (_) { return { type: null, htmlFiles: [] }; }
|
|
62
|
+
const htmlFiles = entries
|
|
63
|
+
.filter((f) => /^messages\d*\.html?$/i.test(f))
|
|
64
|
+
.map((f) => join(path, f));
|
|
65
|
+
if (htmlFiles.length === 0) return { type: null, htmlFiles: [] };
|
|
66
|
+
// Verify the first one contains Telegram-shaped markers
|
|
67
|
+
const head = safeReadHead(htmlFiles[0]);
|
|
68
|
+
if (!looksLikeTelegram(head)) return { type: null, htmlFiles: [] };
|
|
69
|
+
// Sort chunks: messages.html < messages2.html < messages3.html …
|
|
70
|
+
htmlFiles.sort(numericChunkSort);
|
|
71
|
+
return { type: 'dir', htmlFiles };
|
|
72
|
+
}
|
|
73
|
+
|
|
74
|
+
// Single file case: must be messages*.html
|
|
75
|
+
if (stats.isFile() && /\.html?$/i.test(path) && /messages\d*\.html?$/i.test(basename(path))) {
|
|
76
|
+
const head = safeReadHead(path);
|
|
77
|
+
if (!looksLikeTelegram(head)) return { type: null, htmlFiles: [] };
|
|
78
|
+
return { type: 'file', htmlFiles: [path] };
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
return { type: null, htmlFiles: [] };
|
|
82
|
+
}
|
|
83
|
+
|
|
84
|
+
function safeReadHead(file, bytes = 8192) {
|
|
85
|
+
try {
|
|
86
|
+
return readFileSync(file, 'utf-8').slice(0, bytes);
|
|
87
|
+
} catch (_) {
|
|
88
|
+
return '';
|
|
89
|
+
}
|
|
90
|
+
}
|
|
91
|
+
|
|
92
|
+
function looksLikeTelegram(head) {
|
|
93
|
+
// Reliable markers in Telegram Desktop HTML exports
|
|
94
|
+
return /class="page_wrap"/.test(head) ||
|
|
95
|
+
/class="page_body chat_page"/.test(head) ||
|
|
96
|
+
(/class="from_name"/.test(head) && /class="text"/.test(head));
|
|
97
|
+
}
|
|
98
|
+
|
|
99
|
+
function numericChunkSort(a, b) {
|
|
100
|
+
const numA = parseInt((basename(a).match(/messages(\d*)\.html?/i) || [, '0'])[1] || '0', 10);
|
|
101
|
+
const numB = parseInt((basename(b).match(/messages(\d*)\.html?/i) || [, '0'])[1] || '0', 10);
|
|
102
|
+
return numA - numB;
|
|
103
|
+
}
|
|
104
|
+
|
|
105
|
+
/**
|
|
106
|
+
* Strip HTML tags and decode common entities → plain text.
|
|
107
|
+
* Conservative: preserves newlines from <br>, paragraph breaks from </div>.
|
|
108
|
+
*/
|
|
109
|
+
function htmlToText(html) {
|
|
110
|
+
if (!html) return '';
|
|
111
|
+
let out = String(html);
|
|
112
|
+
// Convert breaks to newlines BEFORE stripping tags
|
|
113
|
+
out = out.replace(/<br\s*\/?>/gi, '\n');
|
|
114
|
+
out = out.replace(/<\/p>/gi, '\n\n');
|
|
115
|
+
out = out.replace(/<\/div>/gi, '\n');
|
|
116
|
+
// Drop all remaining tags
|
|
117
|
+
out = out.replace(/<[^>]+>/g, '');
|
|
118
|
+
// Decode common entities
|
|
119
|
+
out = out
|
|
120
|
+
.replace(/ /g, ' ')
|
|
121
|
+
.replace(/&/g, '&')
|
|
122
|
+
.replace(/</g, '<')
|
|
123
|
+
.replace(/>/g, '>')
|
|
124
|
+
.replace(/"/g, '"')
|
|
125
|
+
.replace(/'/g, "'")
|
|
126
|
+
.replace(/'/g, "'")
|
|
127
|
+
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
|
|
128
|
+
.replace(/&#(\d+);/g, (_, dec) => String.fromCodePoint(parseInt(dec, 10)));
|
|
129
|
+
// Collapse 3+ blank lines, trim
|
|
130
|
+
out = out.replace(/\n{3,}/g, '\n\n').trim();
|
|
131
|
+
return out;
|
|
132
|
+
}
|
|
133
|
+
|
|
134
|
+
/**
|
|
135
|
+
* Parse a Telegram date title into Unix timestamp.
|
|
136
|
+
* Format: "2024-01-01 14:23:45 UTC+03:00" (or "UTC-04:00", etc.)
|
|
137
|
+
* Returns { tsUnix, isoString } or null if unparseable.
|
|
138
|
+
*/
|
|
139
|
+
function parseTelegramDate(title) {
|
|
140
|
+
if (!title) return null;
|
|
141
|
+
const m = title.match(/^(\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2}):(\d{2})(?:\s+UTC([+-])(\d{2}):(\d{2}))?$/);
|
|
142
|
+
if (!m) return null;
|
|
143
|
+
const [, y, mo, d, h, mi, s, sign, oh, om] = m;
|
|
144
|
+
// Construct an ISO 8601 string with the explicit offset (or UTC if absent)
|
|
145
|
+
const offset = sign ? `${sign}${oh}:${om}` : 'Z';
|
|
146
|
+
const iso = `${y}-${mo}-${d}T${h}:${mi}:${s}${offset}`;
|
|
147
|
+
const date = new Date(iso);
|
|
148
|
+
if (isNaN(date.getTime())) return null;
|
|
149
|
+
return {
|
|
150
|
+
tsUnix: Math.floor(date.getTime() / 1000),
|
|
151
|
+
isoString: iso.replace(/[+-]\d{2}:\d{2}$/, '').replace('Z', ''),
|
|
152
|
+
};
|
|
153
|
+
}
|
|
154
|
+
|
|
155
|
+
/**
|
|
156
|
+
* Parse a single message div (raw HTML segment).
|
|
157
|
+
* Returns null for service messages (we skip those) or messages with no text.
|
|
158
|
+
*/
|
|
159
|
+
function parseMessageDiv(messageHtml, lastSender) {
|
|
160
|
+
// Skip service messages outright
|
|
161
|
+
if (/class="message service\b/.test(messageHtml)) return null;
|
|
162
|
+
|
|
163
|
+
// Extract message id from outer div: id="message12345"
|
|
164
|
+
const idMatch = messageHtml.match(/id="message(\d+)"/);
|
|
165
|
+
const msgId = idMatch ? idMatch[1] : null;
|
|
166
|
+
if (!msgId) return null;
|
|
167
|
+
|
|
168
|
+
const isJoined = /class="message [^"]*joined/.test(messageHtml);
|
|
169
|
+
|
|
170
|
+
// Forwarded marker
|
|
171
|
+
const isForwarded = /class="forwarded body"/.test(messageHtml);
|
|
172
|
+
let forwardedFrom = null;
|
|
173
|
+
if (isForwarded) {
|
|
174
|
+
const fwdM = messageHtml.match(/class="forwarded[^"]*"[\s\S]*?<div class="from_name"[^>]*>\s*([\s\S]*?)\s*<\/div>/);
|
|
175
|
+
if (fwdM) {
|
|
176
|
+
forwardedFrom = htmlToText(fwdM[1]).replace(/^Forwarded from:?\s*/i, '').trim();
|
|
177
|
+
}
|
|
178
|
+
}
|
|
179
|
+
|
|
180
|
+
// Sender (from_name) — absent on joined messages
|
|
181
|
+
let fromName = null;
|
|
182
|
+
const fromM = messageHtml.match(/<div class="from_name"[^>]*>\s*([\s\S]*?)\s*<\/div>/);
|
|
183
|
+
if (fromM && !isForwarded) {
|
|
184
|
+
fromName = htmlToText(fromM[1]).trim();
|
|
185
|
+
}
|
|
186
|
+
// If joined, inherit lastSender; otherwise use parsed or fallback
|
|
187
|
+
if (!fromName && isJoined && lastSender) fromName = lastSender;
|
|
188
|
+
if (!fromName) fromName = 'Unknown';
|
|
189
|
+
|
|
190
|
+
// Date — title attribute on `.date.details`
|
|
191
|
+
let date = null;
|
|
192
|
+
const dateM = messageHtml.match(/class="[^"]*\bdate details[^"]*"\s+title="([^"]+)"/);
|
|
193
|
+
if (dateM) date = parseTelegramDate(dateM[1]);
|
|
194
|
+
|
|
195
|
+
// Main text — last `<div class="text">…</div>` inside body (forwards may have one earlier)
|
|
196
|
+
let text = '';
|
|
197
|
+
const textMatches = [...messageHtml.matchAll(/<div class="text"[^>]*>([\s\S]*?)<\/div>(?=\s*(?:<div class="(?!text)|<\/div>|<a class="|$))/g)];
|
|
198
|
+
if (textMatches.length > 0) {
|
|
199
|
+
// Use last one (the actual message body, after any quoted/forwarded preamble)
|
|
200
|
+
text = htmlToText(textMatches[textMatches.length - 1][1]);
|
|
201
|
+
}
|
|
202
|
+
|
|
203
|
+
// Reply marker — include as prefix so it's searchable but not lost
|
|
204
|
+
const replyM = messageHtml.match(/class="reply_to details"[^>]*>([\s\S]*?)<\/div>/);
|
|
205
|
+
if (replyM) {
|
|
206
|
+
const replyTxt = htmlToText(replyM[1]).replace(/^In reply to\s+/i, '').trim();
|
|
207
|
+
if (replyTxt) text = `↩ Reply: ${replyTxt}\n\n${text}`;
|
|
208
|
+
}
|
|
209
|
+
|
|
210
|
+
// Photo / media — if no text, note the media presence so the row isn't lost.
|
|
211
|
+
// Use word-boundary regexes since class attrs like "photo_wrap clearfix pull_left"
|
|
212
|
+
// wouldn't match a strict `class="photo_wrap"` pattern.
|
|
213
|
+
if (!text) {
|
|
214
|
+
if (/class="[^"]*\bphoto_wrap\b/.test(messageHtml)) text = '[photo]';
|
|
215
|
+
else if (/class="[^"]*\bmedia_voice_message\b/.test(messageHtml)) text = '[voice message]';
|
|
216
|
+
else if (/class="[^"]*\bmedia_video_file\b/.test(messageHtml)) text = '[video]';
|
|
217
|
+
else if (/class="[^"]*\bmedia_audio_file\b/.test(messageHtml)) text = '[audio]';
|
|
218
|
+
else if (/class="[^"]*\bmedia_file\b/.test(messageHtml)) text = '[file]';
|
|
219
|
+
else if (/class="[^"]*\bsticker\b/.test(messageHtml)) text = '[sticker]';
|
|
220
|
+
else return null; // Truly empty — skip
|
|
221
|
+
}
|
|
222
|
+
|
|
223
|
+
// Build the message object in the shape importTelegram expects
|
|
224
|
+
// (date and date_unixtime are required by the importer)
|
|
225
|
+
const isoDate = date ? date.isoString : null;
|
|
226
|
+
const tsUnix = date ? date.tsUnix : 0;
|
|
227
|
+
|
|
228
|
+
return {
|
|
229
|
+
id: parseInt(msgId, 10),
|
|
230
|
+
type: 'message',
|
|
231
|
+
date: isoDate || '1970-01-01T00:00:00',
|
|
232
|
+
date_unixtime: tsUnix > 0 ? String(tsUnix) : '0',
|
|
233
|
+
from: fromName,
|
|
234
|
+
from_id: fromName ? `user_html_${slugify(fromName)}` : 'unknown',
|
|
235
|
+
text: text,
|
|
236
|
+
...(forwardedFrom ? { forwarded_from: forwardedFrom } : {}),
|
|
237
|
+
};
|
|
238
|
+
}
|
|
239
|
+
|
|
240
|
+
function slugify(s) {
|
|
241
|
+
return String(s).toLowerCase().replace(/[^a-z0-9]+/g, '_').replace(/^_+|_+$/g, '').slice(0, 40) || 'anon';
|
|
242
|
+
}
|
|
243
|
+
|
|
244
|
+
/**
|
|
245
|
+
* Extract chat title from messages.html (or first chunk).
|
|
246
|
+
* Falls back to directory name basename, then "Telegram chat".
|
|
247
|
+
*/
|
|
248
|
+
function extractChatTitle(htmlContent, fallbackPath) {
|
|
249
|
+
// Try the <title>...</title>
|
|
250
|
+
const titleM = htmlContent.match(/<title>\s*([^<]+?)\s*<\/title>/i);
|
|
251
|
+
if (titleM) {
|
|
252
|
+
let t = titleM[1].trim();
|
|
253
|
+
// Telegram titles often look like "Alice — Chat Export"
|
|
254
|
+
t = t.replace(/\s*[—-]\s*(Chat Export|Telegram).*$/i, '').trim();
|
|
255
|
+
if (t && t !== 'Telegram') return t;
|
|
256
|
+
}
|
|
257
|
+
// Try the page_header text
|
|
258
|
+
const headerM = htmlContent.match(/<div class="text bold"[^>]*>\s*([\s\S]*?)\s*<\/div>/);
|
|
259
|
+
if (headerM) {
|
|
260
|
+
const t = htmlToText(headerM[1]).trim();
|
|
261
|
+
if (t) return t;
|
|
262
|
+
}
|
|
263
|
+
// Fallback: dirname of the parent ChatExport_xxx folder
|
|
264
|
+
if (fallbackPath) {
|
|
265
|
+
const parent = basename(dirname(fallbackPath));
|
|
266
|
+
if (parent && parent.startsWith('ChatExport')) {
|
|
267
|
+
return parent.replace(/^ChatExport_?/, '').replace(/_/g, ' ').trim() || 'Telegram chat';
|
|
268
|
+
}
|
|
269
|
+
}
|
|
270
|
+
return 'Telegram chat';
|
|
271
|
+
}
|
|
272
|
+
|
|
273
|
+
/**
|
|
274
|
+
* Main entrypoint. Parse a Telegram HTML export path → return an object
|
|
275
|
+
* shaped like a Telegram JSON export, ready for importTelegram().
|
|
276
|
+
*
|
|
277
|
+
* Returns null if path isn't a valid Telegram HTML export.
|
|
278
|
+
*
|
|
279
|
+
* Object shape:
|
|
280
|
+
* {
|
|
281
|
+
* personal_information: { user_id: "" },
|
|
282
|
+
* chats: {
|
|
283
|
+
* list: [{
|
|
284
|
+
* id: <stable hash of chat title>,
|
|
285
|
+
* name: <chat title>,
|
|
286
|
+
* type: "personal_chat",
|
|
287
|
+
* messages: [{ id, type, date, date_unixtime, from, from_id, text, … }, …]
|
|
288
|
+
* }]
|
|
289
|
+
* }
|
|
290
|
+
* }
|
|
291
|
+
*/
|
|
292
|
+
export function parseTelegramHtmlExport(path, opts = {}) {
|
|
293
|
+
const detection = detectTelegramHtml(path);
|
|
294
|
+
if (!detection.type) return null;
|
|
295
|
+
if (detection.htmlFiles.length === 0) return null;
|
|
296
|
+
|
|
297
|
+
let allMessages = [];
|
|
298
|
+
let chatTitle = null;
|
|
299
|
+
let lastSender = null;
|
|
300
|
+
|
|
301
|
+
for (const htmlPath of detection.htmlFiles) {
|
|
302
|
+
let content;
|
|
303
|
+
try { content = readFileSync(htmlPath, 'utf-8'); }
|
|
304
|
+
catch (_) { continue; }
|
|
305
|
+
|
|
306
|
+
if (!chatTitle) chatTitle = extractChatTitle(content, htmlPath);
|
|
307
|
+
|
|
308
|
+
// Split into per-message blocks. The reliable boundary is the
|
|
309
|
+
// opening `<div class="message ` of the next message.
|
|
310
|
+
// Use a tolerant regex that handles the message default / joined variants.
|
|
311
|
+
const messageBlocks = [...content.matchAll(/<div class="message [^"]*"[\s\S]*?(?=<div class="message [^"]*"|<div class="page_footer"|<\/body>)/g)];
|
|
312
|
+
|
|
313
|
+
for (const blockMatch of messageBlocks) {
|
|
314
|
+
const msg = parseMessageDiv(blockMatch[0], lastSender);
|
|
315
|
+
if (msg) {
|
|
316
|
+
allMessages.push(msg);
|
|
317
|
+
// Track sender for "joined" continuation messages
|
|
318
|
+
if (msg.from && msg.from !== 'Unknown') lastSender = msg.from;
|
|
319
|
+
}
|
|
320
|
+
}
|
|
321
|
+
}
|
|
322
|
+
|
|
323
|
+
if (allMessages.length === 0) return null;
|
|
324
|
+
|
|
325
|
+
// Stable chat id: hash of title + first message ts (good enough for dedup)
|
|
326
|
+
// We use a simple numeric hash so the synthetic chat_id is stable across re-imports.
|
|
327
|
+
const chatId = stableChatId(chatTitle || 'Telegram chat', allMessages[0]?.date_unixtime || '0');
|
|
328
|
+
|
|
329
|
+
return {
|
|
330
|
+
personal_information: { user_id: '' },
|
|
331
|
+
chats: {
|
|
332
|
+
list: [
|
|
333
|
+
{
|
|
334
|
+
id: chatId,
|
|
335
|
+
name: chatTitle || 'Telegram chat',
|
|
336
|
+
type: 'personal_chat',
|
|
337
|
+
messages: allMessages,
|
|
338
|
+
},
|
|
339
|
+
],
|
|
340
|
+
},
|
|
341
|
+
_source: {
|
|
342
|
+
format: 'telegram-html',
|
|
343
|
+
original_path: path,
|
|
344
|
+
chunks: detection.htmlFiles.length,
|
|
345
|
+
messages_total: allMessages.length,
|
|
346
|
+
},
|
|
347
|
+
};
|
|
348
|
+
}
|
|
349
|
+
|
|
350
|
+
function stableChatId(title, firstTs) {
|
|
351
|
+
let hash = 0;
|
|
352
|
+
const key = title + ':' + firstTs;
|
|
353
|
+
for (let i = 0; i < key.length; i++) {
|
|
354
|
+
hash = ((hash << 5) - hash) + key.charCodeAt(i);
|
|
355
|
+
hash |= 0;
|
|
356
|
+
}
|
|
357
|
+
return Math.abs(hash);
|
|
358
|
+
}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "memex-mvp",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.9.0",
|
|
4
4
|
"description": "Local-first MCP server for cross-agent AI memory. One SQLite + FTS5 corpus across Claude Code, Cowork, Cursor, Continue, Zed, Obsidian, and Telegram — passively captured, verbatim, searchable from any MCP-compatible client.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "server.js",
|
|
@@ -26,7 +26,7 @@
|
|
|
26
26
|
"sync": "node ingest.js",
|
|
27
27
|
"ingest": "node ingest.js",
|
|
28
28
|
"bot": "node bot/index.js",
|
|
29
|
-
"test": "node test/parser.test.js && node test/bot-inbox.test.js && node test/search-sort.test.js && node test/store-document.test.js && node test/cli.test.js && node test/hook.test.js",
|
|
29
|
+
"test": "node test/parser.test.js && node test/bot-inbox.test.js && node test/search-sort.test.js && node test/store-document.test.js && node test/cli.test.js && node test/hook.test.js && node test/telegram-html.test.js",
|
|
30
30
|
"prepublishOnly": "npm test"
|
|
31
31
|
},
|
|
32
32
|
"engines": {
|
package/server.js
CHANGED
|
@@ -49,6 +49,10 @@ import {
|
|
|
49
49
|
} from './lib/store-doc/canonicalize.js';
|
|
50
50
|
import { detectIssues, isBlocked } from './lib/store-doc/detect.js';
|
|
51
51
|
import { extractTitle } from './lib/store-doc/extract-title.js';
|
|
52
|
+
import {
|
|
53
|
+
detectTelegramHtml,
|
|
54
|
+
parseTelegramHtmlExport,
|
|
55
|
+
} from './lib/parse-telegram-html.js';
|
|
52
56
|
import { createHash } from 'node:crypto';
|
|
53
57
|
import { runCli, CLI_SUBCOMMAND_NAMES } from './lib/cli/index.js';
|
|
54
58
|
|
|
@@ -325,9 +329,17 @@ const insertImport = db.prepare(`
|
|
|
325
329
|
|
|
326
330
|
// -------------------- Importers --------------------
|
|
327
331
|
|
|
328
|
-
/**
|
|
329
|
-
|
|
330
|
-
|
|
332
|
+
/**
|
|
333
|
+
* Telegram Desktop export importer. Accepts:
|
|
334
|
+
* - filePath (string) — path to result.json
|
|
335
|
+
* - rawObject (object) — already-parsed export, e.g. from parseTelegramHtmlExport
|
|
336
|
+
*
|
|
337
|
+
* Returns total imported message count.
|
|
338
|
+
*/
|
|
339
|
+
function importTelegram(filePathOrRaw) {
|
|
340
|
+
const raw = typeof filePathOrRaw === 'string'
|
|
341
|
+
? JSON.parse(readFileSync(filePathOrRaw, 'utf-8'))
|
|
342
|
+
: filePathOrRaw;
|
|
331
343
|
|
|
332
344
|
// Telegram Desktop produces either a single chat object or { chats: { list: [...] } }
|
|
333
345
|
const chats = Array.isArray(raw.chats?.list)
|
|
@@ -670,9 +682,83 @@ function resolvePendingParents() {
|
|
|
670
682
|
}
|
|
671
683
|
|
|
672
684
|
/** Auto-detect format and import */
|
|
685
|
+
/**
|
|
686
|
+
* Try to import a path as a Telegram HTML export (directory or single file).
|
|
687
|
+
* Returns imported message count, or 0 if not an HTML export.
|
|
688
|
+
*
|
|
689
|
+
* Side effects on success:
|
|
690
|
+
* - Inserts an `imports` row tagged "telegram-html"
|
|
691
|
+
* - Moves the source directory/file to ~/.memex/data/conversations/telegram-html/
|
|
692
|
+
*
|
|
693
|
+
* If it LOOKS like a Telegram HTML export but parsing failed, prints an
|
|
694
|
+
* actionable error pointing the user at the Desktop export menu — instead
|
|
695
|
+
* of silently ignoring. This was Tester 5's friction point.
|
|
696
|
+
*/
|
|
697
|
+
function importTelegramHtmlIfMatches(path) {
|
|
698
|
+
const detection = detectTelegramHtml(path);
|
|
699
|
+
if (!detection.type) return 0;
|
|
700
|
+
|
|
701
|
+
let parsed;
|
|
702
|
+
try {
|
|
703
|
+
parsed = parseTelegramHtmlExport(path);
|
|
704
|
+
} catch (err) {
|
|
705
|
+
log('telegram-html parse error:', basename(path), err.message);
|
|
706
|
+
parsed = null;
|
|
707
|
+
}
|
|
708
|
+
|
|
709
|
+
if (!parsed || parsed.chats.list[0].messages.length === 0) {
|
|
710
|
+
// Looked like Telegram HTML (had markers) but extraction yielded nothing.
|
|
711
|
+
// Print actionable error rather than silent ignore.
|
|
712
|
+
log('');
|
|
713
|
+
log('⚠ Detected Telegram HTML export at ' + basename(path) + ' but extracted 0 messages.');
|
|
714
|
+
log(' This usually means Telegram changed the HTML format, or the export is partial.');
|
|
715
|
+
log(' EASIEST FIX — re-export as JSON:');
|
|
716
|
+
log(' 1. Open Telegram Desktop');
|
|
717
|
+
log(' 2. Click the chat → ⋮ menu → "Export chat history"');
|
|
718
|
+
log(' 3. Format: change "HTML" to "Machine-readable JSON"');
|
|
719
|
+
log(' 4. Drop the new result.json into ~/.memex/inbox/');
|
|
720
|
+
log('');
|
|
721
|
+
log(' HTML export will be left in place — feel free to delete it once JSON works.');
|
|
722
|
+
return 0;
|
|
723
|
+
}
|
|
724
|
+
|
|
725
|
+
let imported = 0;
|
|
726
|
+
try {
|
|
727
|
+
imported = importTelegram(parsed);
|
|
728
|
+
} catch (err) {
|
|
729
|
+
log('telegram-html import error:', err.message);
|
|
730
|
+
return 0;
|
|
731
|
+
}
|
|
732
|
+
|
|
733
|
+
if (imported > 0) {
|
|
734
|
+
insertImport.run(
|
|
735
|
+
basename(path),
|
|
736
|
+
'telegram-html',
|
|
737
|
+
Math.floor(Date.now() / 1000),
|
|
738
|
+
imported
|
|
739
|
+
);
|
|
740
|
+
// Archive: move the whole directory (or file) so the watcher doesn't re-process
|
|
741
|
+
const targetDir = join(ARCHIVE, 'telegram-html');
|
|
742
|
+
mkdirSync(targetDir, { recursive: true });
|
|
743
|
+
const target = join(targetDir, basename(path));
|
|
744
|
+
try {
|
|
745
|
+
renameSync(path, target);
|
|
746
|
+
} catch (_) {}
|
|
747
|
+
log(`imported ${imported} messages from ${basename(path)} (telegram-html, ${detection.htmlFiles.length} chunk(s))`);
|
|
748
|
+
}
|
|
749
|
+
return imported;
|
|
750
|
+
}
|
|
751
|
+
|
|
673
752
|
function importFile(filePath) {
|
|
674
753
|
if (!existsSync(filePath)) return 0;
|
|
675
754
|
const stats = statSync(filePath);
|
|
755
|
+
|
|
756
|
+
// Telegram HTML export — can be either a directory (ChatExport_xxx/)
|
|
757
|
+
// or a bare messages.html file. We accept both. Detected via marker
|
|
758
|
+
// patterns inside the HTML, not file extension alone.
|
|
759
|
+
if (stats.isDirectory()) {
|
|
760
|
+
return importTelegramHtmlIfMatches(filePath);
|
|
761
|
+
}
|
|
676
762
|
if (!stats.isFile()) return 0;
|
|
677
763
|
|
|
678
764
|
const lower = filePath.toLowerCase();
|
|
@@ -692,6 +778,10 @@ function importFile(filePath) {
|
|
|
692
778
|
imported = importTelegram(filePath);
|
|
693
779
|
source = 'telegram';
|
|
694
780
|
}
|
|
781
|
+
} else if (/\.html?$/i.test(lower)) {
|
|
782
|
+
// Single-file HTML drop (rare — usually a directory)
|
|
783
|
+
imported = importTelegramHtmlIfMatches(filePath);
|
|
784
|
+
if (imported > 0) source = 'telegram';
|
|
695
785
|
} else if (lower.endsWith('.jsonl')) {
|
|
696
786
|
// Filename prefix tells us which product the session came from.
|
|
697
787
|
// cowork- → Claude Cowork (incl. its subagents)
|
|
@@ -739,15 +829,30 @@ function importFile(filePath) {
|
|
|
739
829
|
// dropping a partial file by hand — the watcher must not race the writer and
|
|
740
830
|
// move the unfinished tmp into archive, which used to spam ENOENT into the
|
|
741
831
|
// daemon's rename and corrupt the import accounting.
|
|
832
|
+
// Watch INBOX top-level. Files: chokidar 'add' event. Directories:
|
|
833
|
+
// chokidar 'addDir' event (v0.9+ inbox can also receive Telegram HTML
|
|
834
|
+
// export DIRECTORIES like ChatExport_xxx/, not just JSON/JSONL files).
|
|
835
|
+
//
|
|
836
|
+
// `depth: 0` means we only get top-level entries — we DON'T want every
|
|
837
|
+
// .html chunk inside ChatExport_xxx to fire 'add' separately. The
|
|
838
|
+
// directory drop itself is what we react to; the HTML parser walks
|
|
839
|
+
// inside.
|
|
742
840
|
chokidar
|
|
743
841
|
.watch(INBOX, {
|
|
744
842
|
ignoreInitial: false,
|
|
745
843
|
ignored: /\.tmp$/,
|
|
746
844
|
awaitWriteFinish: { stabilityThreshold: 800 },
|
|
845
|
+
depth: 0,
|
|
747
846
|
})
|
|
748
847
|
.on('add', (filePath) => {
|
|
749
|
-
log('inbox detected:', basename(filePath));
|
|
848
|
+
log('inbox detected (file):', basename(filePath));
|
|
750
849
|
importFile(filePath);
|
|
850
|
+
})
|
|
851
|
+
.on('addDir', (dirPath) => {
|
|
852
|
+
// Skip the inbox itself
|
|
853
|
+
if (dirPath === INBOX) return;
|
|
854
|
+
log('inbox detected (dir):', basename(dirPath));
|
|
855
|
+
importFile(dirPath);
|
|
751
856
|
});
|
|
752
857
|
|
|
753
858
|
// -------------------- MCP Server --------------------
|
|
@@ -58,14 +58,19 @@ Do these in order. Show each command before running it. Stop and ask if anything
|
|
|
58
58
|
npm install -g memex-mvp
|
|
59
59
|
```
|
|
60
60
|
|
|
61
|
-
If you get `EACCES` (macOS system Node),
|
|
61
|
+
If you get `EACCES` (macOS system Node), **recommend Option B (permanent prefix fix) by default** and proceed unless the user explicitly chooses sudo.
|
|
62
62
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
63
|
+
Say to the user:
|
|
64
|
+
|
|
65
|
+
> "Hit EACCES — your Node is installed in a system directory that npm can't write to without admin rights. Two fixes:
|
|
66
|
+
>
|
|
67
|
+
> **A)** Quick: `sudo npm install -g memex-mvp` — one-time, requires your password
|
|
68
|
+
> **B)** Permanent fix: I'll move npm's install location to your home directory. After this, no `npm install -g` will ever ask for sudo again, for any package. Five commands, one-time.
|
|
69
|
+
>
|
|
70
|
+
> I recommend **B** — it's a permanent improvement that benefits all your future Node tools, not just memex. Proceed with B?"
|
|
71
|
+
|
|
72
|
+
If the user agrees (or says anything like "ok", "yes", "default", "go", "B") — proceed with B without further questions:
|
|
67
73
|
|
|
68
|
-
**Option B** — fix prefix permanently (better long-term):
|
|
69
74
|
```sh
|
|
70
75
|
mkdir -p ~/.npm-global
|
|
71
76
|
npm config set prefix ~/.npm-global
|
|
@@ -74,7 +79,19 @@ source ~/.zshrc
|
|
|
74
79
|
npm install -g memex-mvp
|
|
75
80
|
```
|
|
76
81
|
|
|
77
|
-
|
|
82
|
+
If the user **explicitly** picks A (or says "sudo", "fast", "quick"), use sudo:
|
|
83
|
+
|
|
84
|
+
```sh
|
|
85
|
+
sudo npm install -g memex-mvp
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
After the install completes, verify with `which memex` (should print an absolute path) and `memex --version` (should print the version).
|
|
89
|
+
|
|
90
|
+
**Important if Option B was used + user already had memex from a prior sudo install:** the old sudo-installed copy in `/usr/local/lib/node_modules/memex-mvp/` is now orphaned (PATH prefers the new home install). Mention this once:
|
|
91
|
+
|
|
92
|
+
> "By the way, you have an older sudo-installed memex copy in `/usr/local/lib/`. It's harmless but takes ~60 MB. Clean it up when convenient: `sudo npm uninstall -g memex-mvp` — not urgent."
|
|
93
|
+
|
|
94
|
+
Don't run that uninstall yourself — let the user decide when.
|
|
78
95
|
|
|
79
96
|
If `node` or `npm` are missing, STOP and tell the user to install Node.js (recommend nvm.sh or `brew install node`). Don't try to install Node yourself.
|
|
80
97
|
|
|
@@ -73,7 +73,7 @@ Sample interactions when the user invokes this skill. These show typical happy p
|
|
|
73
73
|
|
|
74
74
|
---
|
|
75
75
|
|
|
76
|
-
## Example 2 · EACCES recovery (macOS system Node)
|
|
76
|
+
## Example 2 · EACCES recovery (macOS system Node) — recommend B by default
|
|
77
77
|
|
|
78
78
|
> **User:** install memex
|
|
79
79
|
>
|
|
@@ -84,14 +84,18 @@ Sample interactions when the user invokes this skill. These show typical happy p
|
|
|
84
84
|
> npm error path: '/usr/local/lib/node_modules/memex-mvp'
|
|
85
85
|
> ```
|
|
86
86
|
>
|
|
87
|
-
> Hit
|
|
87
|
+
> Hit EACCES — your Node is installed in a system directory that npm can't write to without admin rights. Two fixes:
|
|
88
88
|
>
|
|
89
|
-
> **
|
|
90
|
-
>
|
|
91
|
-
>
|
|
92
|
-
>
|
|
89
|
+
> **A)** Quick: `sudo npm install -g memex-mvp` — one-time, requires your password
|
|
90
|
+
>
|
|
91
|
+
> **B)** Permanent fix: I'll move npm's install location to your home directory. After this, no `npm install -g` will ever ask for sudo again, for any package. Five commands, one-time.
|
|
92
|
+
>
|
|
93
|
+
> I recommend **B** — it's a permanent improvement that benefits all your future Node tools, not just memex. **Proceed with B?**
|
|
94
|
+
>
|
|
95
|
+
> **User:** ok
|
|
96
|
+
>
|
|
97
|
+
> **Agent:** Running the 5 prefix-fix commands:
|
|
93
98
|
>
|
|
94
|
-
> **Option B** (one-time setup, then future `npm install -g` works without sudo):
|
|
95
99
|
> ```sh
|
|
96
100
|
> mkdir -p ~/.npm-global
|
|
97
101
|
> npm config set prefix ~/.npm-global
|
|
@@ -100,11 +104,7 @@ Sample interactions when the user invokes this skill. These show typical happy p
|
|
|
100
104
|
> npm install -g memex-mvp
|
|
101
105
|
> ```
|
|
102
106
|
>
|
|
103
|
-
>
|
|
104
|
-
>
|
|
105
|
-
> **User:** B
|
|
106
|
-
>
|
|
107
|
-
> **Agent:** [runs the 5 commands in sequence, shows output for each] ✓ Done. Verifying:
|
|
107
|
+
> ✓ Done. Verifying:
|
|
108
108
|
>
|
|
109
109
|
> ```
|
|
110
110
|
> which memex → /Users/macbook/.npm-global/bin/memex
|
|
@@ -242,7 +242,7 @@ Sample interactions when the user invokes this skill. These show typical happy p
|
|
|
242
242
|
|
|
243
243
|
- **Always discover first**, then ask for confirmation before any write action
|
|
244
244
|
- **Show every command** before running it; show diffs for config edits
|
|
245
|
-
- **
|
|
245
|
+
- **Recommend the better default** with a clear reason (e.g. Option B for EACCES) instead of forcing the user to weigh both options blind. Fall back to alternatives only if explicitly chosen.
|
|
246
246
|
- **Detect and adapt** to existing installs, missing prerequisites, partial setups
|
|
247
247
|
- **Refuse gracefully** when running in a context where the install can't actually work (web-only agents)
|
|
248
248
|
|