npm - clawmem - Versions diffs - 0.7.1 → 0.7.2 - Mend

clawmem 0.7.1 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/AGENTS.md +2 -1
package/CLAUDE.md +2 -1
package/README.md +16 -2
package/SKILL.md +2 -1
package/package.json +1 -1
package/src/clawmem.ts +30 -2
package/src/conversation-synthesis.ts +637 -0

package/AGENTS.md CHANGED Viewed

@@ -368,7 +368,7 @@ Pin, snooze, and forget are **manual MCP tools** — not automated. The agent sh
 - Do NOT pin everything — pin is for persistent high-priority items, not temporary boosting.
 - Do NOT forget memories to "clean up" — let confidence decay and contradiction detection handle it naturally.
 - Do NOT run `build_graphs` after every reindex — A-MEM creates per-doc links automatically. Only after bulk ingestion or when `intent_search` returns weak graph results.
-- Do NOT run `clawmem mine` autonomously — it is a bulk ingestion command (same category as `update`/`reindex`). Suggest it to the user when they mention old conversation exports, but let them run it. Bulk import has disk/embedding cost implications that need user consent.
+- Do NOT run `clawmem mine` autonomously — it is a bulk ingestion command (same category as `update`/`reindex`). Suggest it to the user when they mention old conversation exports, but let them run it. Bulk import has disk/embedding cost implications that need user consent. **v0.7.2 adds `--synthesize`** — an opt-in post-import LLM fact extraction pass that turns raw conversation dumps into searchable structured decisions / preferences / milestones / problems with cross-fact relations. Off by default; also requires user consent because it drives additional LLM calls (one per conversation doc). Suggest both together when the user wants to get real value out of old chat exports, not just the raw dumps.
 - Do NOT use `diary_write` in Claude Code — hooks (`decision-extractor`, `handoff-generator`) capture this automatically. Diary is for non-hooked environments only (Hermes, Gemini, plain MCP clients).
 - Do NOT use `kg_query` for causal "why" questions — use `intent_search` or `memory_retrieve`. `kg_query` returns structured entity facts (SPO triples), not reasoning chains.
@@ -505,6 +505,7 @@ The `memory_relations` table is populated by multiple independent sources:
 | Entity co-occurrence graph | entity | A-MEM enrichment (indexing) | LLM entity extraction → quality filters (title/length/blocklist/location validation) → type-agnostic canonical resolution within compatibility buckets (person, org, location, tech=project/service/tool/concept) → `entity_mentions` + `entity_cooccurrences` tables. Entity edges use IDF-based specificity scoring. Feeds ENTITY intent queries and MPFP `[entity, semantic]` patterns. |
 | `consolidated_observations` | supporting, contradicts | Consolidation worker (background) | 3-tier consolidation: facts → observations → mental models. Observations track `proof_count`, `trend` (STABLE/STRENGTHENING/WEAKENING/STALE), and source links. **v0.7.1 safety gates:** name-aware merge gate uses entity-anchor comparison + 3-gram cosine similarity (dual-threshold `CLAWMEM_MERGE_SCORE_NORMAL`=0.93 / `_STRICT`=0.98) to prevent cross-entity merges ("Alice decided X" merging into "Bob decided X"). Merge-time contradiction gate runs deterministic heuristic + LLM check; blocked merges route to `CLAWMEM_CONTRADICTION_POLICY`=`link` (new row + `contradicts` edge, default) or `supersede` (old row `status='inactive'`, new row replaces). |
 | Deductive synthesis | supporting, contradicts | Consolidation worker Phase 3 (background, every ~15 min) | Combines 2-3 related recent observations (decision/preference/milestone/problem, last 7 days) into `content_type='deductive'` documents with `source_doc_ids` provenance. First-class searchable docs with ∞ half-life. **v0.7.1 anti-contamination wrapper:** every draft passes through deterministic pre-checks (empty conclusion, invalid source_indices, pool-only entity contamination via `entity_mentions` or lexical fallback) + LLM validator (fail-open with `validatorFallbackAccepts` counter) + dedupe. Per-reason rejection stats exposed via `DeductiveSynthesisStats` (contaminationRejects, invalidIndexRejects, unsupportedRejects, emptyRejects, dedupSkipped, validatorFallbackAccepts). Contradictory dedupe matches are linked via `contradicts` edges. |
+| Conversation synthesis (`runConversationSynthesis`) | semantic, supporting, contradicts, causal, temporal, entity | `clawmem mine <dir> --synthesize` (opt-in, post-index) | **v0.7.2.** Two-pass LLM pipeline over freshly imported `content_type='conversation'` docs. Pass 1 extracts structured facts (decision/preference/milestone/problem) via `extractFactsFromConversation`, saves each via dedup-aware `saveMemory`, populates a local Set-based alias map. Pass 2 resolves cross-fact links against the local map first (fails closed on ambiguity — multi-candidate titles return unresolved), falls back to collection-scoped SQL lookup with LIMIT 2 ambiguity detection. Relations upsert via `ON CONFLICT DO UPDATE SET weight = MAX(weight, excluded.weight)` — idempotent on equal-weight reruns but monotonically accepts stronger later evidence. Synthesized fact paths are a pure function of `(sourceDocId, slug(title), short sha256(normalizedTitle))`, so reruns update in place instead of creating parallel rows. Counters split into `llmFailures` (null/thrown/invalid-JSON) vs `docsWithNoFacts` (valid-empty extraction). All failures non-fatal — never rolls back the mine import. |
 **Edge collision:** Both `generateMemoryLinks()` and `buildSemanticGraph()` insert `relation_type='semantic'`. PK is `(source_id, target_id, relation_type)` — first writer wins.

package/CLAUDE.md CHANGED Viewed

@@ -368,7 +368,7 @@ Pin, snooze, and forget are **manual MCP tools** — not automated. The agent sh
 - Do NOT pin everything — pin is for persistent high-priority items, not temporary boosting.
 - Do NOT forget memories to "clean up" — let confidence decay and contradiction detection handle it naturally.
 - Do NOT run `build_graphs` after every reindex — A-MEM creates per-doc links automatically. Only after bulk ingestion or when `intent_search` returns weak graph results.
-- Do NOT run `clawmem mine` autonomously — it is a bulk ingestion command (same category as `update`/`reindex`). Suggest it to the user when they mention old conversation exports, but let them run it. Bulk import has disk/embedding cost implications that need user consent.
+- Do NOT run `clawmem mine` autonomously — it is a bulk ingestion command (same category as `update`/`reindex`). Suggest it to the user when they mention old conversation exports, but let them run it. Bulk import has disk/embedding cost implications that need user consent. **v0.7.2 adds `--synthesize`** — an opt-in post-import LLM fact extraction pass that turns raw conversation dumps into searchable structured decisions / preferences / milestones / problems with cross-fact relations. Off by default; also requires user consent because it drives additional LLM calls (one per conversation doc). Suggest both together when the user wants to get real value out of old chat exports, not just the raw dumps.
 - Do NOT use `diary_write` in Claude Code — hooks (`decision-extractor`, `handoff-generator`) capture this automatically. Diary is for non-hooked environments only (Hermes, Gemini, plain MCP clients).
 - Do NOT use `kg_query` for causal "why" questions — use `intent_search` or `memory_retrieve`. `kg_query` returns structured entity facts (SPO triples), not reasoning chains.
@@ -505,6 +505,7 @@ The `memory_relations` table is populated by multiple independent sources:
 | Entity co-occurrence graph | entity | A-MEM enrichment (indexing) | LLM entity extraction → quality filters (title/length/blocklist/location validation) → type-agnostic canonical resolution within compatibility buckets (person, org, location, tech=project/service/tool/concept) → `entity_mentions` + `entity_cooccurrences` tables. Entity edges use IDF-based specificity scoring. Feeds ENTITY intent queries and MPFP `[entity, semantic]` patterns. |
 | `consolidated_observations` | supporting, contradicts | Consolidation worker (background) | 3-tier consolidation: facts → observations → mental models. Observations track `proof_count`, `trend` (STABLE/STRENGTHENING/WEAKENING/STALE), and source links. **v0.7.1 safety gates:** name-aware merge gate uses entity-anchor comparison + 3-gram cosine similarity (dual-threshold `CLAWMEM_MERGE_SCORE_NORMAL`=0.93 / `_STRICT`=0.98) to prevent cross-entity merges ("Alice decided X" merging into "Bob decided X"). Merge-time contradiction gate runs deterministic heuristic + LLM check; blocked merges route to `CLAWMEM_CONTRADICTION_POLICY`=`link` (new row + `contradicts` edge, default) or `supersede` (old row `status='inactive'`, new row replaces). |
 | Deductive synthesis | supporting, contradicts | Consolidation worker Phase 3 (background, every ~15 min) | Combines 2-3 related recent observations (decision/preference/milestone/problem, last 7 days) into `content_type='deductive'` documents with `source_doc_ids` provenance. First-class searchable docs with ∞ half-life. **v0.7.1 anti-contamination wrapper:** every draft passes through deterministic pre-checks (empty conclusion, invalid source_indices, pool-only entity contamination via `entity_mentions` or lexical fallback) + LLM validator (fail-open with `validatorFallbackAccepts` counter) + dedupe. Per-reason rejection stats exposed via `DeductiveSynthesisStats` (contaminationRejects, invalidIndexRejects, unsupportedRejects, emptyRejects, dedupSkipped, validatorFallbackAccepts). Contradictory dedupe matches are linked via `contradicts` edges. |
+| Conversation synthesis (`runConversationSynthesis`) | semantic, supporting, contradicts, causal, temporal, entity | `clawmem mine <dir> --synthesize` (opt-in, post-index) | **v0.7.2.** Two-pass LLM pipeline over freshly imported `content_type='conversation'` docs. Pass 1 extracts structured facts (decision/preference/milestone/problem) via `extractFactsFromConversation`, saves each via dedup-aware `saveMemory`, populates a local Set-based alias map. Pass 2 resolves cross-fact links against the local map first (fails closed on ambiguity — multi-candidate titles return unresolved), falls back to collection-scoped SQL lookup with LIMIT 2 ambiguity detection. Relations upsert via `ON CONFLICT DO UPDATE SET weight = MAX(weight, excluded.weight)` — idempotent on equal-weight reruns but monotonically accepts stronger later evidence. Synthesized fact paths are a pure function of `(sourceDocId, slug(title), short sha256(normalizedTitle))`, so reruns update in place instead of creating parallel rows. Counters split into `llmFailures` (null/thrown/invalid-JSON) vs `docsWithNoFacts` (valid-empty extraction). All failures non-fatal — never rolls back the mine import. |
 **Edge collision:** Both `generateMemoryLinks()` and `buildSemanticGraph()` insert `relation_type='semantic'`. PK is `(source_id, target_id, relation_type)` — first writer wins.

package/README.md CHANGED Viewed

@@ -19,7 +19,7 @@ ClawMem turns your markdown notes, project docs, and research dumps into persist
 - **Surfaces relevant context** on every prompt (context-surfacing hook)
 - **Bootstraps sessions** with your profile, latest handoff, recent decisions, and stale notes
 - **Captures decisions, preferences, milestones, and problems** from session transcripts using a local GGUF observer model
-- **Imports conversation exports** from Claude Code, ChatGPT, Claude.ai, Slack, and plain text via `clawmem mine`
+- **Imports conversation exports** from Claude Code, ChatGPT, Claude.ai, Slack, and plain text via `clawmem mine`, with optional post-import LLM fact extraction (`--synthesize`) that pulls structured decisions / preferences / milestones / problems and cross-fact links out of otherwise full-text conversation dumps (v0.7.2)
 - **Generates handoffs** at session end so the next session can pick up where you left off
 - **Learns what matters** via a feedback loop that boosts referenced notes and decays unused ones
 - **Guards against prompt injection** in surfaced content
@@ -66,6 +66,20 @@ Five independent safety gates around the consolidation pipeline and context surf
 - **Anti-contamination deductive synthesis** — every Phase 3 draft runs through a three-layer validator: deterministic pre-checks (empty conclusion, invalid source_indices, pool-only entity contamination via `entity_mentions`) + LLM validator (fail-open with `validatorFallbackAccepts` counter) + dedupe. Per-reason rejection stats exposed via `DeductiveSynthesisStats` so Phase 3 yield can be diagnosed without enabling extra logging.
 - **Context instruction + relationship snippets** — `context-surfacing` now always prepends an `<instruction>` block framing the surfaced facts as background knowledge the model already holds, and appends an optional `<relationships>` block listing memory-graph edges where BOTH endpoints are in the surfaced doc set. The relationships block is the first thing dropped when the payload would overflow `CLAWMEM_PROFILE`'s token budget, preserving facts-first behaviour while giving the model graph-level reasoning hooks directly in-prompt.
+### v0.7.2 Post-Import Conversation Synthesis
+Opt-in LLM pass that runs **after** `clawmem mine` finishes indexing an imported collection. Operates on the freshly imported `content_type='conversation'` documents and extracts structured knowledge facts (decisions / preferences / milestones / problems) plus cross-fact relations, writing each fact as a first-class searchable document alongside the raw conversation exchanges. See [post-import synthesis](docs/concepts/architecture.md#post-import-conversation-synthesis-v072) for the architectural walkthrough.
+- **New CLI flag** — `clawmem mine <dir> --synthesize [--synthesis-max-docs N]`. Off by default. When omitted, existing mine behaviour is byte-identical to v0.7.1.
+- **Two-pass pipeline** — Pass 1 extracts facts per conversation via the existing LLM, saves each via dedup-aware `saveMemory`, and populates a local alias map. Pass 2 resolves cross-fact links against the local map first, falling back to collection-scoped SQL lookup. Forward references (link to a fact extracted later in the same run) are resolved correctly.
+- **Idempotent reruns** — synthesized fact paths are a pure function of `(sourceDocId, slug(title), short sha256(normalizedTitle))`, so reruns over the same conversation batch hit the `saveMemory` update branch instead of creating parallel rows. Same-slug collisions are disambiguated by the stable hash suffix, not encounter order.
+- **Fail-closed link resolution** — when two different facts claim the same normalized title or alias, the resolver treats the link as ambiguous and counts it unresolved. Pre-existing docs with duplicate titles in the collection do not silently bind either.
+- **Weight-monotonic relation upsert** — `memory_relations` insert uses `ON CONFLICT DO UPDATE SET weight = MAX(weight, excluded.weight)`, which is idempotent on equal-weight reruns but still accepts stronger later evidence without double-counting.
+- **Non-fatal failure model** — any LLM failure, JSON parse error, saveMemory collision, or relation insert error is counted and logged, never re-thrown. Synthesis failure after `indexCollection` commits does not roll back the mine import.
+- **Split operator counters** — `llmFailures` counts actual LLM path failures (null, thrown, non-array JSON), while `docsWithNoFacts` counts docs where the LLM responded validly but returned zero structured facts. Previously these were conflated as `nullCalls`.
+Adds +63 tests (46 unit + 5 integration + 12 regression) on top of the v0.7.1 baseline.
 ## Architecture
 <p align="center">
@@ -657,7 +671,7 @@ clawmem collection list                         List collections
 clawmem collection remove <name>                Remove a collection
 clawmem update [--pull] [--embed]               Incremental re-scan
-clawmem mine <dir> [-c name] [--embed]         Import conversation exports (Claude, ChatGPT, Slack)
+clawmem mine <dir> [-c name] [--embed] [--synthesize]  Import conversation exports (--synthesize runs post-import LLM fact extraction, v0.7.2)
 clawmem embed [-f]                              Generate fragment embeddings
 clawmem reindex [--force]                       Full re-index
 clawmem watch                                   File watcher daemon

package/SKILL.md CHANGED Viewed

@@ -528,6 +528,7 @@ mcp__clawmem__vsearch(query, collection="name", compact=true)   # vector
 | `buildSemanticGraph()` | semantic | `build_graphs` MCP (manual) | Pure cosine similarity. A-MEM edges take precedence (first-writer wins). |
 | `consolidated_observations` | supporting, contradicts | Consolidation worker (background) | **v0.7.1 safety gates:** Phase 2 name-aware merge gate (entity anchors + 3-gram cosine, dual-threshold `CLAWMEM_MERGE_SCORE_NORMAL`=0.93 / `_STRICT`=0.98) blocks cross-entity merges. Merge-time contradiction gate (heuristic + LLM) routes blocked merges to `link` (default, inserts `contradicts` edge) or `supersede` (old row `status='inactive'`) via `CLAWMEM_CONTRADICTION_POLICY`. |
 | Deductive synthesis | supporting, contradicts | Consolidation worker Phase 3 (every ~15 min) | Combines 2-3 related observations (decision/preference/milestone/problem, last 7 days) into `content_type='deductive'` docs. **v0.7.1 anti-contamination:** deterministic pre-checks (empty/invalid_indices/pool-only entity contamination) + LLM validator (fail-open, `validatorFallbackAccepts` counter) + dedupe. Per-reason rejection stats via `DeductiveSynthesisStats`. Contradictory dedupe matches linked via `contradicts` edges. |
+| Conversation synthesis | semantic, supporting, contradicts, causal, temporal, entity | `clawmem mine <dir> --synthesize` (opt-in, post-index) | **v0.7.2.** Two-pass LLM pipeline over freshly imported `content_type='conversation'` docs. Pass 1 extracts structured decision/preference/milestone/problem facts + aliases + cross-fact links, saves via dedup-aware `saveMemory`, populates ambiguity-aware local Set map. Pass 2 resolves links (local first, SQL fallback with `LIMIT 2` ambiguity detection), upserts relations via `ON CONFLICT DO UPDATE SET weight = MAX(weight, excluded.weight)`. Synthesized paths are a pure function of `(sourceDocId, slug(title), short sha256(normalizedTitle))` so reruns update in place. All failures non-fatal. Counters split: `llmFailures` (LLM/parse error) vs `docsWithNoFacts` (valid empty extraction). |
 **Graph traversal asymmetry:** `adaptiveTraversal()` traverses all edge types outbound (source->target) but only `semantic` and `entity` inbound.
@@ -589,7 +590,7 @@ Phase 3 deductive synthesis applies the same `contradicts` link for any draft th
 - Do NOT pin everything — pin is for persistent high-priority items.
 - Do NOT forget memories to "clean up" — let confidence decay and contradiction detection handle it.
 - Do NOT run `build_graphs` after every reindex — A-MEM creates per-doc links automatically.
-- Do NOT run `clawmem mine` autonomously — it is a bulk ingestion command. Suggest it to the user when they mention old conversation exports, but let them run it.
+- Do NOT run `clawmem mine` autonomously — it is a bulk ingestion command. Suggest it to the user when they mention old conversation exports, but let them run it. **v0.7.2 adds `--synthesize`** — an opt-in post-import LLM fact extraction pass. Also requires user consent because it drives one extra LLM call per conversation doc. Suggest both together when the user wants searchable structured memory from raw chat exports.
 - Do NOT use `diary_write` in Claude Code — hooks capture this automatically. Diary is for non-hooked environments only (Hermes, Gemini, plain MCP).
 - Do NOT use `kg_query` for causal "why" questions — use `intent_search` or `memory_retrieve`. `kg_query` returns structured entity facts (SPO triples), not reasoning chains.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "clawmem",
-  "version": "0.7.1",
+  "version": "0.7.2",
   "description": "On-device context engine and memory for AI agents. Claude Code and OpenClaw. Hooks + MCP server + hybrid RAG search.",
   "type": "module",
   "bin": {

package/src/clawmem.ts CHANGED Viewed

@@ -242,12 +242,14 @@ async function cmdMine(args: string[]) {
       collection: { type: "string", short: "c" },
       embed: { type: "boolean", default: false },
       "dry-run": { type: "boolean", default: false },
+      synthesize: { type: "boolean", default: false },
+      "synthesis-max-docs": { type: "string" },
     },
     allowPositionals: true,
   });
   const dir = positionals[0];
-  if (!dir) die("Usage: clawmem mine <directory> [-c collection-name] [--embed] [--dry-run]");
+  if (!dir) die("Usage: clawmem mine <directory> [-c collection-name] [--embed] [--dry-run] [--synthesize] [--synthesis-max-docs N]");
   const absDir = pathResolve(dir);
   if (!existsSync(absDir)) die(`Directory not found: ${absDir}`);
@@ -319,6 +321,32 @@ async function cmdMine(args: string[]) {
     const stats = await indexCollection(s, collectionName, stagingDir, "**/*.md");
     console.log(`  ${c.green}+${stats.added}${c.reset} added, ${c.yellow}~${stats.updated}${c.reset} updated, ${c.dim}=${stats.unchanged}${c.reset} unchanged`);
+    // Ext 4 — post-import conversation synthesis (opt-in via --synthesize)
+    // Runs AFTER indexCollection has committed. Failure is non-fatal and never
+    // rolls back the mine import.
+    if (values.synthesize) {
+      const maxDocs = values["synthesis-max-docs"]
+        ? parseInt(values["synthesis-max-docs"] as string, 10)
+        : undefined;
+      console.log(`\n${c.cyan}Running post-import conversation synthesis${c.reset}`);
+      try {
+        const { runConversationSynthesis } = await import("./conversation-synthesis.ts");
+        const llm = getDefaultLlamaCpp();
+        const synthResult = await runConversationSynthesis(s, llm, {
+          collection: collectionName,
+          maxDocs: Number.isFinite(maxDocs) && (maxDocs as number) > 0 ? maxDocs : undefined,
+        });
+        console.log(
+          `  ${c.green}${synthResult.factsSaved}${c.reset} facts saved, ` +
+          `${c.green}${synthResult.linksResolved}${c.reset} links resolved, ` +
+          `${c.yellow}${synthResult.linksUnresolved}${c.reset} unresolved, ` +
+          `${c.dim}${synthResult.llmFailures} LLM failure(s), ${synthResult.docsWithNoFacts} docs with no facts${c.reset}`,
+        );
+      } catch (err) {
+        console.log(`  ${c.yellow}Synthesis failed (mine import preserved):${c.reset} ${err}`);
+      }
+    }
     if (values.embed) {
       console.log();
       await cmdEmbed([]);
@@ -2500,7 +2528,7 @@ ${c.bold}Setup:${c.reset}
 ${c.bold}Indexing:${c.reset}
   clawmem update [--pull] [--embed]    Re-scan collections (--embed auto-embeds)
-  clawmem mine <dir> [-c name] [--embed]  Import conversation exports (Claude, ChatGPT, Slack)
+  clawmem mine <dir> [-c name] [--embed] [--synthesize]  Import conversation exports (Claude, ChatGPT, Slack); --synthesize runs post-import LLM fact extraction
   clawmem embed [-f]                   Generate fragment embeddings
   clawmem reindex [--force] [--enrich]  Full re-index (--enrich: run entity extraction + links on all docs)
   clawmem watch                        File watcher daemon

package/src/conversation-synthesis.ts ADDED Viewed

@@ -0,0 +1,637 @@
+/**
+ * conversation-synthesis.ts — Post-import conversation synthesis pipeline (v0.7.2, Ext 4)
+ *
+ * Runs AFTER `clawmem mine` completes indexing. Operates on imported conversation
+ * docs to extract structured knowledge facts (decisions / preferences / milestones /
+ * problems) and cross-document relations via a two-pass LLM pipeline.
+ *
+ * Pass 1 — Extract facts:
+ *   - For each conversation doc in the target collection, ask the LLM for structured
+ *     facts with {title, contentType, narrative, facts, aliases, links}
+ *   - Save each fact via the dedup-aware saveMemory API
+ *   - Populate a localMap keyed by normalized title + aliases → docId
+ *
+ * Pass 2 — Resolve links:
+ *   - For each extracted fact, resolve its links[] targetTitle via localMap first,
+ *     fall back to SQL lookup scoped to the same collection
+ *   - Insert memory_relations via existing upsert (INSERT OR IGNORE for idempotency)
+ *
+ * Failure modes are all non-fatal:
+ *   - null LLM call → increment nullCalls, continue
+ *   - invalid JSON → skip doc, continue
+ *   - unresolved link target → increment linksUnresolved, continue
+ *   - any error inside the pipeline never bubbles to the mine import
+ *
+ * Invoked only when `clawmem mine <dir> --synthesize` is passed (off by default).
+ */
+import type { Store } from "./store.ts";
+import type { LlamaCpp } from "./llm.ts";
+import { extractJsonFromLLM } from "./amem.ts";
+import type { ContentType } from "./memory.ts";
+// =============================================================================
+// Constants
+// =============================================================================
+const DEFAULT_MAX_DOCS = 20;
+const DEFAULT_CONTENT_TYPE_FILTER: ContentType[] = ["conversation"];
+const DEFAULT_LINK_WEIGHT = 0.6;
+const DEFAULT_CONFIDENCE = 0.7;
+const DEFAULT_QUALITY_SCORE = 0.6;
+const CONVERSATION_TRUNCATE_CHARS = 3000;
+const LLM_MAX_TOKENS = 1200;
+const LLM_TEMPERATURE = 0.3;
+/** Content types that the extractor is allowed to emit for synthesized facts. */
+const VALID_EXTRACTED_TYPES = new Set<ContentType>([
+  "decision",
+  "preference",
+  "milestone",
+  "problem",
+]);
+/** Relation types the extractor may propose — must match the post-P0 taxonomy. */
+const VALID_RELATION_TYPES = new Set<string>([
+  "semantic",
+  "supporting",
+  "contradicts",
+  "causal",
+  "temporal",
+  "entity",
+]);
+// =============================================================================
+// Public types (per THOTH_EXTRACTION_PLAN.md Ext 4 spec)
+// =============================================================================
+export type SynthesizeOptions = {
+  /** Required — only operate on this imported collection. */
+  collection: string;
+  /** Log what would happen but don't insert facts or relations. */
+  dryRun?: boolean;
+  /** Cap total conversation docs scanned per run (default 20). */
+  maxDocs?: number;
+  /** Content types to target for synthesis (default ["conversation"]). */
+  contentTypeFilter?: ContentType[];
+};
+export type ExtractedFactLink = {
+  targetTitle: string;
+  relationType: string;
+  weight?: number;
+};
+export type ExtractedFact = {
+  title: string;
+  contentType: ContentType;
+  narrative: string;
+  facts?: string[];
+  aliases?: string[];
+  sourceDocId: number;
+  links?: ExtractedFactLink[];
+};
+export type SynthesisResult = {
+  docsScanned: number;
+  factsExtracted: number;
+  factsSaved: number;
+  linksResolved: number;
+  /**
+   * Links where the target could not be resolved to a single unique docId.
+   * Includes unknown targets AND ambiguous multi-match targets (Turn 13 fix).
+   */
+  linksUnresolved: number;
+  /**
+   * Docs where the LLM path itself failed — null response, thrown error,
+   * or invalid JSON that couldn't be parsed into an array.
+   */
+  llmFailures: number;
+  /**
+   * Docs where the LLM responded with a valid but empty (or all-invalid)
+   * extraction — distinct from LLM failures so operators can diagnose
+   * "LLM is broken" vs "conversation had no structured facts".
+   */
+  docsWithNoFacts: number;
+};
+// =============================================================================
+// Helpers
+// =============================================================================
+/** Normalize a title or alias for localMap keying. */
+export function normalizeTitle(title: string): string {
+  return title.toLowerCase().trim().replace(/\s+/g, " ");
+}
+/** Slugify a title for stable synthesized path generation. */
+function slugify(title: string): string {
+  const slug = title
+    .toLowerCase()
+    .replace(/[^a-z0-9]+/g, "-")
+    .replace(/^-+|-+$/g, "")
+    .slice(0, 50);
+  return slug || "untitled";
+}
+/** Render an extracted fact as markdown for the body field. */
+export function renderFactBody(fact: ExtractedFact): string {
+  const lines: string[] = [
+    `# ${fact.title}`,
+    "",
+    fact.narrative,
+  ];
+  if (fact.facts && fact.facts.length > 0) {
+    lines.push("", "## Supporting facts");
+    for (const f of fact.facts) {
+      lines.push(`- ${f}`);
+    }
+  }
+  if (fact.aliases && fact.aliases.length > 0) {
+    lines.push("", `**Aliases:** ${fact.aliases.join(", ")}`);
+  }
+  lines.push("", `_Synthesized from conversation doc #${fact.sourceDocId}._`);
+  return lines.join("\n");
+}
+/**
+ * Build the LLM prompt for conversation fact extraction.
+ * Exported for test inspection.
+ */
+export function buildExtractionPrompt(conversationText: string): string {
+  const content = conversationText.slice(0, CONVERSATION_TRUNCATE_CHARS);
+  return `Analyze this conversation and extract structured knowledge facts.
+Conversation:
+${content}
+Extract discrete facts as a JSON array. Each fact should represent ONE of:
+- "decision": a choice made, architectural decision, tool selection
+- "preference": a stated preference, convention, or style rule
+- "milestone": a completed deliverable, version release, or event
+- "problem": a bug, issue, or constraint discovered
+For each fact provide:
+- title: concise 3-8 word title (becomes the fact identity)
+- contentType: one of [decision, preference, milestone, problem]
+- narrative: 1-3 sentence description of the fact in context
+- facts: optional array of supporting fact strings (evidence)
+- aliases: optional alternative titles for linking (e.g., ["OAuth choice"] for "Use OAuth 2.0")
+- links: optional array of cross-fact references. Each link is
+  {targetTitle, relationType, weight}
+  - targetTitle may refer to another fact extracted from this conversation OR from
+    any other conversation in the same imported batch. Prefer an exact title, and
+    if you have multiple candidates use a canonical alias.
+  - relationType MUST be one of: semantic, supporting, contradicts, causal, temporal, entity
+  - weight is 0.0-1.0 (default 0.6)
+Only extract facts the conversation clearly supports. Do NOT fabricate.
+Return ONLY valid JSON array. Return empty array [] if no structured facts found.
+Example output:
+[
+  {
+    "title": "Use OAuth 2.0 with PKCE",
+    "contentType": "decision",
+    "narrative": "Team decided to use OAuth 2.0 with PKCE for user authentication, replacing session cookies.",
+    "facts": ["PKCE chosen for mobile support", "Legacy session auth to be deprecated Q2"],
+    "aliases": ["OAuth decision", "switch to OAuth"],
+    "links": [
+      { "targetTitle": "Deprecate session auth", "relationType": "causal", "weight": 0.8 }
+    ]
+  }
+]`;
+}
+/**
+ * Validate + normalize a single raw fact object from LLM output.
+ * Returns null if the fact is malformed or uses a disallowed content/relation type.
+ */
+function normalizeExtractedFact(
+  raw: unknown,
+  sourceDocId: number,
+): ExtractedFact | null {
+  if (!raw || typeof raw !== "object") return null;
+  const obj = raw as Record<string, unknown>;
+  const title = typeof obj.title === "string" ? obj.title.trim() : "";
+  if (!title) return null;
+  const contentType = obj.contentType;
+  if (typeof contentType !== "string") return null;
+  if (!VALID_EXTRACTED_TYPES.has(contentType as ContentType)) return null;
+  const narrative = typeof obj.narrative === "string" ? obj.narrative.trim() : "";
+  if (!narrative) return null;
+  const facts: string[] = Array.isArray(obj.facts)
+    ? obj.facts.filter((f): f is string => typeof f === "string" && f.trim().length > 0)
+    : [];
+  const aliases: string[] = Array.isArray(obj.aliases)
+    ? obj.aliases.filter((a): a is string => typeof a === "string" && a.trim().length > 0)
+    : [];
+  const links: ExtractedFactLink[] = Array.isArray(obj.links)
+    ? (obj.links as unknown[])
+        .map((l) => {
+          if (!l || typeof l !== "object") return null;
+          const link = l as Record<string, unknown>;
+          const targetTitle =
+            typeof link.targetTitle === "string" ? link.targetTitle.trim() : "";
+          const relationType =
+            typeof link.relationType === "string" ? link.relationType : "";
+          if (!targetTitle || !VALID_RELATION_TYPES.has(relationType)) return null;
+          const weight =
+            typeof link.weight === "number" && Number.isFinite(link.weight)
+              ? Math.max(0, Math.min(1, link.weight))
+              : DEFAULT_LINK_WEIGHT;
+          return { targetTitle, relationType, weight };
+        })
+        .filter((l): l is ExtractedFactLink => l !== null)
+    : [];
+  return {
+    title,
+    contentType: contentType as ContentType,
+    narrative,
+    facts,
+    aliases,
+    sourceDocId,
+    links,
+  };
+}
+/**
+ * Extract facts from a single conversation doc via LLM.
+ *
+ * Return value discriminates failure mode (Turn 13 fix):
+ *   - `null`     → LLM itself failed: null response, thrown error, or non-array JSON
+ *   - `[]`       → LLM responded with a valid but empty extraction (or all facts rejected by normalize)
+ *   - `[fact..]` → at least one valid fact extracted
+ *
+ * Callers use this distinction to split `llmFailures` from `docsWithNoFacts`.
+ *
+ * Exported for unit testing.
+ */
+export async function extractFactsFromConversation(
+  llm: LlamaCpp,
+  conversationText: string,
+  sourceDocId: number,
+): Promise<ExtractedFact[] | null> {
+  const prompt = buildExtractionPrompt(conversationText);
+  let result;
+  try {
+    result = await llm.generate(prompt, {
+      temperature: LLM_TEMPERATURE,
+      maxTokens: LLM_MAX_TOKENS,
+    });
+  } catch (err) {
+    console.log(`[synthesis] LLM generate threw for doc ${sourceDocId}:`, err);
+    return null;
+  }
+  if (!result || typeof result.text !== "string") return null;
+  const parsed = extractJsonFromLLM(result.text);
+  if (!Array.isArray(parsed)) return null;
+  const facts: ExtractedFact[] = [];
+  for (const raw of parsed) {
+    const fact = normalizeExtractedFact(raw, sourceDocId);
+    if (fact) facts.push(fact);
+  }
+  return facts;
+}
+/**
+ * Resolve a link target to a UNIQUE docId via localMap first, then a SQL
+ * fallback scoped to the same collection.
+ *
+ * Ambiguity handling (Turn 13 fix):
+ *  - localMap stores a Set<number> per normalized title/alias. If a key maps
+ *    to more than one docId (two different synthesized facts share the same
+ *    title or alias), the resolver returns `null` — the caller counts this
+ *    as unresolved/ambiguous instead of silently binding to one candidate.
+ *  - SQL fallback issues a LIMIT 2 query and returns `null` if more than
+ *    one row matches.
+ *
+ * Exported for unit testing.
+ */
+export function resolveLinkTarget(
+  store: Store,
+  localMap: Map<string, Set<number>>,
+  titleOrAlias: string,
+  collection: string,
+): number | null {
+  const normalized = normalizeTitle(titleOrAlias);
+  if (!normalized) return null;
+  const localHits = localMap.get(normalized);
+  if (localHits && localHits.size > 0) {
+    if (localHits.size === 1) {
+      // localHits.values().next().value is the sole docId
+      const first = localHits.values().next().value;
+      return typeof first === "number" ? first : null;
+    }
+    // Ambiguous — two or more synthesized facts claim this title/alias
+    console.log(
+      `[synthesis] Ambiguous local target "${titleOrAlias}" — ${localHits.size} candidates, treated as unresolved`,
+    );
+    return null;
+  }
+  try {
+    const rows = store.db
+      .prepare(
+        `SELECT id
+         FROM documents
+         WHERE collection = ?
+           AND active = 1
+           AND LOWER(TRIM(title)) = ?
+         ORDER BY created_at DESC
+         LIMIT 2`,
+      )
+      .all(collection, normalized) as Array<{ id: number }>;
+    if (rows.length === 0) return null;
+    if (rows.length > 1) {
+      console.log(
+        `[synthesis] Ambiguous SQL target "${titleOrAlias}" in collection '${collection}' — multiple matches, treated as unresolved`,
+      );
+      return null;
+    }
+    return rows[0]!.id;
+  } catch (err) {
+    console.log(`[synthesis] SQL lookup failed for "${titleOrAlias}":`, err);
+    return null;
+  }
+}
+// =============================================================================
+// Main orchestrator
+// =============================================================================
+/**
+ * Helper: add a docId to the localMap under `key`. Uses Set<number> so we can
+ * detect ambiguous collisions (two different facts claiming the same title/alias).
+ * Turn 13 fix — previous implementation silently overwrote on collision.
+ */
+function addToLocalMap(
+  localMap: Map<string, Set<number>>,
+  key: string,
+  docId: number,
+): void {
+  if (!key) return;
+  const existing = localMap.get(key);
+  if (existing) {
+    existing.add(docId);
+  } else {
+    localMap.set(key, new Set([docId]));
+  }
+}
+/**
+ * Build a stable synthesized path for a fact (Turn 14 fix).
+ *
+ * The path is a pure function of (sourceDocId, slug(title), hash(normalized title)),
+ * with NO dependence on extraction order. This means:
+ *   - Reruns over the same conversation batch hit saveMemory's
+ *     UNIQUE(collection, path) update branch and keep the same synthesized
+ *     document in place, even when the LLM's fact order changes.
+ *   - Two different facts with the same slug (e.g., "Use OAuth." and
+ *     "Use OAuth!" both slugify to "use-oauth") get distinct hash suffixes
+ *     because the full normalized title differs, so they do not clobber
+ *     each other in the UNIQUE(collection, path) constraint.
+ *
+ * Turn 13 used a per-run encounter counter which was order-dependent: if the
+ * LLM re-emitted the two same-slug facts in reversed order on a subsequent
+ * run, the `-2` suffix would land on the other fact and saveMemory would
+ * overwrite each row with the wrong body. The hash version is stable.
+ */
+function buildSynthesizedPath(sourceDocId: number, title: string): string {
+  const baseSlug = slugify(title);
+  const hasher = new Bun.CryptoHasher("sha256");
+  hasher.update(normalizeTitle(title));
+  const shortHash = hasher.digest("hex").slice(0, 8);
+  return `synthesized/${baseSlug}-src${sourceDocId}-${shortHash}.md`;
+}
+/**
+ * Run the two-pass conversation synthesis pipeline over a collection's
+ * imported conversation documents.
+ *
+ * Failure of this pipeline NEVER aborts or rolls back an upstream mine import —
+ * the caller should invoke this AFTER indexCollection has committed its changes.
+ */
+export async function runConversationSynthesis(
+  store: Store,
+  llm: LlamaCpp,
+  opts: SynthesizeOptions,
+): Promise<SynthesisResult> {
+  const {
+    collection,
+    dryRun = false,
+    maxDocs = DEFAULT_MAX_DOCS,
+    contentTypeFilter = DEFAULT_CONTENT_TYPE_FILTER,
+  } = opts;
+  const result: SynthesisResult = {
+    docsScanned: 0,
+    factsExtracted: 0,
+    factsSaved: 0,
+    linksResolved: 0,
+    linksUnresolved: 0,
+    llmFailures: 0,
+    docsWithNoFacts: 0,
+  };
+  if (!collection) {
+    console.log(`[synthesis] No collection specified — skipping`);
+    return result;
+  }
+  if (contentTypeFilter.length === 0) {
+    console.log(`[synthesis] Empty contentTypeFilter — skipping`);
+    return result;
+  }
+  let docs: Array<{ id: number; title: string; body: string }>;
+  try {
+    const placeholders = contentTypeFilter.map(() => "?").join(",");
+    docs = store.db
+      .prepare(
+        `SELECT d.id, d.title, c.doc as body
+         FROM documents d
+         JOIN content c ON c.hash = d.hash
+         WHERE d.collection = ?
+           AND d.active = 1
+           AND d.content_type IN (${placeholders})
+         ORDER BY d.created_at ASC, d.id ASC
+         LIMIT ?`,
+      )
+      .all(collection, ...contentTypeFilter, maxDocs) as Array<{
+      id: number;
+      title: string;
+      body: string;
+    }>;
+  } catch (err) {
+    console.log(`[synthesis] Query failed for collection '${collection}':`, err);
+    return result;
+  }
+  if (docs.length === 0) {
+    console.log(
+      `[synthesis] No matching docs in collection '${collection}' (types=${contentTypeFilter.join(",")})`,
+    );
+    return result;
+  }
+  console.log(
+    `[synthesis] Pass 1 — extracting facts from ${docs.length} doc(s) in '${collection}'${dryRun ? " (dry run)" : ""}`,
+  );
+  // Pass 1 — extract + save + populate localMap
+  // Each fact carries its resolved docId so Pass 2 can reference it without
+  // re-querying. In dryRun mode we only count, we do not persist anything.
+  type SavedFact = ExtractedFact & { _savedDocId: number };
+  const saved: SavedFact[] = [];
+  const localMap = new Map<string, Set<number>>();
+  for (const doc of docs) {
+    result.docsScanned++;
+    const extracted = await extractFactsFromConversation(llm, doc.body, doc.id);
+    if (extracted === null) {
+      // LLM path failed (null / thrown / non-array)
+      result.llmFailures++;
+      continue;
+    }
+    if (extracted.length === 0) {
+      // LLM returned a valid response but there were no structured facts
+      // to extract (or all candidates were rejected by normalize).
+      result.docsWithNoFacts++;
+      continue;
+    }
+    for (const fact of extracted) {
+      result.factsExtracted++;
+      if (dryRun) continue;
+      try {
+        const saveResult = store.saveMemory({
+          collection,
+          path: buildSynthesizedPath(doc.id, fact.title),
+          title: fact.title,
+          body: renderFactBody(fact),
+          contentType: fact.contentType,
+          confidence: DEFAULT_CONFIDENCE,
+          qualityScore: DEFAULT_QUALITY_SCORE,
+          semanticPayload: `${fact.title}\n${fact.narrative}`,
+        });
+        if (!saveResult.docId || saveResult.docId < 0) continue;
+        if (saveResult.action === "inserted" || saveResult.action === "updated") {
+          result.factsSaved++;
+        }
+        // Populate localMap with the canonical title and every alias. Using
+        // Set<number> means a second fact claiming the same title/alias will
+        // make the key ambiguous and the resolver returns null instead of
+        // silently picking one. (Turn 13 fix.)
+        addToLocalMap(localMap, normalizeTitle(fact.title), saveResult.docId);
+        for (const alias of fact.aliases ?? []) {
+          addToLocalMap(localMap, normalizeTitle(alias), saveResult.docId);
+        }
+        saved.push({ ...fact, _savedDocId: saveResult.docId });
+      } catch (err) {
+        console.log(`[synthesis] saveMemory error for "${fact.title}":`, err);
+      }
+    }
+  }
+  if (dryRun) {
+    console.log(
+      `[synthesis] Dry run complete — docsScanned=${result.docsScanned} factsExtracted=${result.factsExtracted} llmFailures=${result.llmFailures} docsWithNoFacts=${result.docsWithNoFacts}`,
+    );
+    return result;
+  }
+  // Pass 2 — resolve links against localMap first, then collection-scoped SQL
+  console.log(
+    `[synthesis] Pass 2 — resolving links for ${saved.length} saved fact(s)`,
+  );
+  for (const fact of saved) {
+    if (!fact.links || fact.links.length === 0) continue;
+    const sourceDocId = fact._savedDocId;
+    for (const link of fact.links) {
+      const targetId = resolveLinkTarget(
+        store,
+        localMap,
+        link.targetTitle,
+        collection,
+      );
+      if (targetId === null || targetId === sourceDocId) {
+        result.linksUnresolved++;
+        if (targetId !== sourceDocId) {
+          console.log(
+            `[synthesis] Unresolved link "${link.targetTitle}" from doc ${sourceDocId}`,
+          );
+        }
+        continue;
+      }
+      try {
+        // Idempotent-yet-evidence-preserving upsert (Turn 13 fix):
+        //   INSERT OR IGNORE under-accumulated — it discarded later runs that
+        //     had stronger evidence for the same triple.
+        //   store.insertRelation over-accumulated (weight += excluded.weight) —
+        //     it inflated weights linearly with rerun count.
+        // `ON CONFLICT DO UPDATE SET weight = MAX(weight, excluded.weight)`
+        // is idempotent on reruns with equal weight AND monotonically accepts
+        // later-discovered stronger evidence for the same (source, target, type)
+        // triple without double-counting.
+        store.db
+          .prepare(
+            `INSERT INTO memory_relations
+               (source_id, target_id, relation_type, weight, metadata, created_at)
+             VALUES (?, ?, ?, ?, ?, ?)
+             ON CONFLICT(source_id, target_id, relation_type)
+             DO UPDATE SET weight = MAX(weight, excluded.weight)`,
+          )
+          .run(
+            sourceDocId,
+            targetId,
+            link.relationType,
+            link.weight ?? DEFAULT_LINK_WEIGHT,
+            JSON.stringify({ origin: "conversation-synthesis" }),
+            new Date().toISOString(),
+          );
+        result.linksResolved++;
+      } catch (err) {
+        console.log(
+          `[synthesis] insertRelation failed ${sourceDocId}->${targetId} (${link.relationType}):`,
+          err,
+        );
+        result.linksUnresolved++;
+      }
+    }
+  }
+  console.log(
+    `[synthesis] Complete — docsScanned=${result.docsScanned} factsExtracted=${result.factsExtracted} factsSaved=${result.factsSaved} linksResolved=${result.linksResolved} linksUnresolved=${result.linksUnresolved} llmFailures=${result.llmFailures} docsWithNoFacts=${result.docsWithNoFacts}`,
+  );
+  return result;
+}