get-claudia 1.56.1 → 1.58.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,67 @@
2
2
 
3
3
  All notable changes to Claudia will be documented in this file.
4
4
 
5
+ ## 1.58.0 (2026-05-13)
6
+
7
+ ### The Memory Reliability Release
8
+
9
+ Five PRs that fix the memory layer's biggest recurring failure mode and lock in the integration philosophy. After this release, memory writes that name entities ("Matt Blumberg") actually create those entities with the correct type. The release history's recurring memory-fix releases ("Recall Recovery", "Vector Search Fix", "Semantic Search Actually Works Now") get permanent regression-test sentinels so the same bug classes can't quietly come back. And the codebase loses a dual-maintenance hazard that was already costing time.
10
+
11
+ #### Fixed
12
+ - **`memory_remember` actually links entities and infers their type correctly (#54)** -- A confirmed bug from 2026-05-13: calling `memory_remember(content="Matt Blumberg said X", entities=["Matt Blumberg", "Markup AI"])` was creating entities but assigning them `type: person` by default, even when the name clearly indicated an organization. "Markup AI" was being saved as a person. The real bug was in `_infer_entity_type` -- it didn't recognise `AI` / `.ai` / `Co.` as corporate suffixes and fell back to `person`. Fixed with a pure-function rule-based type inference (corporate suffixes -> organization, project keywords -> project, person patterns -> person, fallback -> concept, never default to person). Plus a new `claudia-memory --backfill-entities` CLI to retroactively link orphaned references in existing user databases.
13
+
14
+ #### Added
15
+ - **`claudia memory backfill-entities` command (#54)** -- Default dry-run: prints a plan and writes nothing. `--apply` makes a timestamped backup to `~/.claudia/backups/memory-{timestamp}.db` first, then applies the backfill. Idempotent: re-running on an already-backfilled DB is a no-op. Aborts cleanly if backup creation fails.
16
+ - **5 regression tests for recurring bug classes (#56)** -- New `memory-daemon/tests/test_recurring_regressions.py` adds permanent forward-looking sentinels for: entity linking on `memory_remember`, recall returning results after seed writes, embedding migration preserving recall, daemon startup tolerating stale SHM files, and `memory_briefing` returning a valid structure on an empty database. Each test docstring names the historical releases where its bug class appeared (v1.35.x, v1.51.5, v1.51.18, v1.55.7, v1.55.8, v1.55.14, v1.21.1, v1.40.1).
17
+ - **API parameter aliases for read-side MCP tools (#57)** -- `memory_about` now accepts `entity_name` and `name` alongside `entity`. `memory_relate` accepts `source_entity` / `target_entity` / `relationship_type` alongside `source` / `target` / `relationship`. `memory_recall` accepts `q` and `search` alongside `query`. Purely additive: every existing caller continues to work unchanged. Aliases normalize at the MCP boundary; service-layer signatures are untouched. If both canonical and alias are passed in the same call, canonical wins.
18
+
19
+ #### Removed
20
+ - **Rube (Composio) MCP integration as a bundled default (#41)** -- Rube is no longer a recommended or bundled MCP server in `.mcp.json.example` (root and template-v2), README, or the Claudia documentation. Locks in the direct-integrations-only philosophy (claude.ai-native MCPs + user-built custom MCPs like Gmail/Calendar). Existing users with `rube` already configured continue to work unchanged; the installer simply no longer ships Rube as an example. The "Tool configuration" example in `claudia-principles.md` was updated to vendor-neutral phrasing.
21
+ - **Legacy `claudia/` sibling files (#55)** -- Removed 3 stale sibling files (`post-tool-capture.py`, `session-health-check.py`, `settings.local.json`) that lived under `claudia/`. These were never reaching users (the installer ships from `template-v2/` only), but every hook bug fix had to remember to patch both locations. The dual-maintenance hazard was real: PR #38's sibling-fix step had to apply the same env-var fix twice. Removed at the source.
22
+
23
+ #### Stats
24
+ - **43 new tests** across 4 files (22 entity-resolution tests in #54, 5 regression sentinels in #56, 16 alias tests in #57)
25
+ - **805 total daemon tests passing** (up from 762 before the v1.57.0 chain), 0 regressions
26
+ - TDD sensitivity proofs for every behavior change: tests fail on the un-modified code, pass after the fix
27
+ - 5 PRs merged, all with stop-gates and TDD discipline
28
+
29
+ #### Notes
30
+ - The bug in #54 was different from the original proposal (#51) described. The proposal said "entities are silently ignored." Actually the entity *records* were getting created -- the bug was that they were all getting `type: person`. Fixing the actual bug rather than the imagined one was a better outcome.
31
+ - The `claudia memory backfill-entities` command surface lives on the daemon's argparse (alongside `--backfill-embeddings`, `--migrate-vault-para`), not as a `claudia memory ...` subcommand on the Node CLI. The Node CLI is the installer, not a memory-command dispatcher.
32
+ - Aliases are NOT yet advertised in the MCP `list_tools()` `inputSchema`. They are tolerantly accepted at the request boundary. Schema-level advertisement is a future enhancement if it proves needed for client discoverability.
33
+
34
+ ---
35
+
36
+ ## 1.57.0 (2026-05-13)
37
+
38
+ ### The Curated Memory Release
39
+
40
+ Five PRs that complete one thesis: **curated, judgment-driven memory capture, enforced at prompt time and persisted across sessions.** Claudia now catches the user's intent when it matters, persists canonical facts as they emerge, and writes a daily session summary so context survives across days.
41
+
42
+ #### Fixed
43
+ - **PostToolUse hook actually runs (#38)** -- The hook was reading `os.environ.get("CLAUDE_TOOL_NAME")`, which Claude Code never sets. Every install since the hook landed had been silently no-op'ing, so `~/.claudia/observations.jsonl` was never written. The hook now reads its payload from stdin per the documented hook contract. Includes a sibling fix to the legacy `claudia/.claude/hooks/post-tool-capture.py` for codebase consistency.
44
+
45
+ #### Added
46
+ - **Memory-commitment rule (#39)** -- A new always-active rule (`template-v2/.claude/rules/memory-commitment.md`) codifies when to save canonical facts immediately via `memory_remember` / `memory_batch` rather than batching to end-of-session reflection. Trigger phrases include "lock this in," "remember this," "this is canonical." Substantive-artifact discipline: at the end of producing a multi-file artifact, do a memory commitment pass and save the canonical facts as one bundled `memory_batch` call.
47
+ - **UserPromptSubmit hook with intent detection (#42)** -- A new hook (`template-v2/.claude/hooks/user-prompt-capture.py`) inspects the user's prompt at submit time and injects reminder context for two trigger classes. Class 1: canonical-fact phrases ("lock this in," "remember this," etc.) tell the agent to save immediately rather than wait for `/meditate`. Class 2: destructive command patterns (`rm -rf`, `git push --force`, `DROP TABLE`, etc.) trigger a "verify before acting" reminder per the safety-first principle. Destructive patterns are surfaced to the model as human-readable labels (`rm -rf (recursive delete)`), not raw regex, so the agent can reason about them clearly.
48
+ - **Daily session summary system (#40)** -- A new SessionEnd hook (`template-v2/.claude/hooks/session-summary.py`) writes a per-session markdown summary to `~/.claudia/sessions/YYYY-MM-DD/NN-slug.md` covering opening prompt, files touched, external actions, and find-this-again references. SessionStart now surfaces a 3-day digest of recent sessions via the existing health-check hook, so future-Claudia knows what past-Claudia worked on. PostToolUse hook gained `file_path` extraction for Write/Edit/MultiEdit/NotebookEdit and `external_action` labels for git push, gh repo create, vercel/netlify deploy, supabase db push, and direct MCP sends.
49
+ - **Explicit upgrade messaging (#50)** -- The installer now names `~/.claudia/` explicitly after an upgrade and lists what is preserved (entities, relationships, reflections, embeddings) instead of the generic "data preserved" phrasing. Users care about their accumulated memory graph; the previous wording did not signal that the database is safe.
50
+
51
+ #### Changed
52
+ - **External-action detection uses word-boundary regex (#40)** -- Previously a substring match, so `echo "git push for testing"` falsely fired the `external_action` flag. The new patterns anchor on command separators (line start, `;`, `&&`, `|`, `(`) and skip transparent prefixes (`sudo`, `nohup`, `time`, `env`). False positives on echoed/quoted strings are eliminated; real commands still fire.
53
+ - **PostToolUse output truncation 200 -> 300 chars (#40)** -- Room for the richer output context that includes `file_path` and `external_action` labels alongside the truncated stdout/stderr.
54
+
55
+ #### Stats
56
+ - 41 new hook tests in `tests/hooks/` (stdlib `unittest`, zero new dependencies), all passing in ~1.5s
57
+ - TDD sensitivity proofs for every behavior change: tests fail on the un-modified hook, pass after the fix
58
+ - 5 PRs merged, 0 regressions
59
+
60
+ #### Notes
61
+ - The brief that drove this chain emphasized one principle: **trust the existing user-file preservation policy (commit `efce9f2`)** rather than inventing a new upgrade framework. The installer's behavior didn't change; only the messaging did.
62
+ - The four hook PRs each landed with their own automated tests and TDD sensitivity proofs. The legacy `claudia/` subdirectory was kept in sync with the canonical `template-v2/` to avoid maintenance drift.
63
+
64
+ ---
65
+
5
66
  ## 1.56.1 (2026-04-11)
6
67
 
7
68
  ### Preserve User-Modified Skills on Upgrade
package/README.md CHANGED
@@ -320,19 +320,6 @@ This generates a one-click URL to enable all required Google APIs and walks you
320
320
  | **Extended** | 83 | Core + Docs, Sheets, Tasks, Chat |
321
321
  | **Complete** | 111 | Extended + Slides, Forms, Apps Script |
322
322
 
323
- ### 500+ Apps via Rube
324
-
325
- [Rube](https://rube.app) (by Composio) connects Claudia to Slack, Notion, Jira, GitHub, Linear, HubSpot, Stripe, Figma, and hundreds more through one-click OAuth. No per-app MCP setup needed.
326
-
327
- | Category | Examples |
328
- |----------|----------|
329
- | **Communication** | Slack, Discord, Teams, Telegram |
330
- | **Project Management** | Jira, Linear, Asana, Trello, Monday.com |
331
- | **Knowledge & Docs** | Notion, Confluence, Google Docs, Coda |
332
- | **Code & Dev** | GitHub, GitLab, Bitbucket |
333
- | **CRM & Sales** | HubSpot, Salesforce, Pipedrive |
334
- | **And 500+ more** | [Browse the full list](https://rube.app) |
335
-
336
323
  ### Obsidian Vault
337
324
 
338
325
  Memory auto-syncs to an Obsidian vault at `~/.claudia/vault/` using PARA structure. Every entity becomes a markdown note with `[[wikilinks]]`, so Obsidian's graph view maps your network. SQLite is the source of truth; the vault is a read-only projection you can browse and search.
package/bin/index.js CHANGED
@@ -877,7 +877,10 @@ async function main() {
877
877
  }
878
878
 
879
879
  console.log('');
880
- console.log(` ${colors.cyan}✓${colors.reset} Framework updated (data preserved)`);
880
+ console.log(` ${colors.cyan}✓${colors.reset} Framework updated`);
881
+ console.log(` • Your memory at ${colors.bold}~/.claudia/${colors.reset} is preserved (entities, relationships, reflections, embeddings).`);
882
+ console.log(` • Skills and hooks refreshed; any modifications you chose to keep were respected.`);
883
+ console.log(` • Restart Claude Code for changes to take effect.`);
881
884
  }
882
885
 
883
886
  // Self-heal: strip CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS from settings (#24)
@@ -1110,6 +1110,24 @@ def main():
1110
1110
  action="store_true",
1111
1111
  help="Preview mode for --migrate-vault-para: show routing plan without making changes",
1112
1112
  )
1113
+ parser.add_argument(
1114
+ "--backfill-entities",
1115
+ action="store_true",
1116
+ help=(
1117
+ "Scan memories for un-linked entity references and propose "
1118
+ "creating/linking them (Proposal #51). Dry-run by default; "
1119
+ "pass --apply to write changes (creates a SQLite backup first)."
1120
+ ),
1121
+ )
1122
+ parser.add_argument(
1123
+ "--apply",
1124
+ action="store_true",
1125
+ help=(
1126
+ "With --backfill-entities: actually write the changes. "
1127
+ "A SQLite backup is created at ~/.claudia/backups/ before any "
1128
+ "writes; if backup creation fails, the command aborts."
1129
+ ),
1130
+ )
1113
1131
  parser.add_argument(
1114
1132
  "--migrate-legacy",
1115
1133
  action="store_true",
@@ -1704,6 +1722,61 @@ def main():
1704
1722
  run_para_migration(vault_path, db=db, preview=args.preview)
1705
1723
  return
1706
1724
 
1725
+ if args.backfill_entities:
1726
+ # Entity-link backfill (Proposal #51). Dry-run by default; --apply
1727
+ # writes after creating a SQLite backup.
1728
+ setup_logging(debug=args.debug)
1729
+ from datetime import datetime as _dt
1730
+
1731
+ from .services.backfill import (
1732
+ apply_backfill,
1733
+ format_plan_summary,
1734
+ plan_backfill,
1735
+ )
1736
+
1737
+ db = get_db()
1738
+ db.initialize()
1739
+
1740
+ plan = plan_backfill(db)
1741
+ print(format_plan_summary(plan))
1742
+
1743
+ if not args.apply:
1744
+ # Dry-run path: we already printed the plan; nothing more to do.
1745
+ return
1746
+
1747
+ # --apply: take the mandatory backup first.
1748
+ config = get_config()
1749
+ backups_dir = Path(config.backup_dir)
1750
+ try:
1751
+ backups_dir.mkdir(parents=True, exist_ok=True)
1752
+ except OSError as e:
1753
+ print(
1754
+ f"\nCannot create backup directory {backups_dir}: {e}\n"
1755
+ "Aborting before any database writes."
1756
+ )
1757
+ sys.exit(1)
1758
+
1759
+ timestamp = _dt.utcnow().strftime("%Y-%m-%dT%H%M%SZ")
1760
+ backup_path = backups_dir / f"memory-{timestamp}.db"
1761
+
1762
+ try:
1763
+ result = apply_backfill(db, plan, backup_path=backup_path)
1764
+ except Exception as e:
1765
+ print(
1766
+ f"\nBackfill aborted (no DB writes performed): {e}\n"
1767
+ f"Backup target was: {backup_path}"
1768
+ )
1769
+ sys.exit(1)
1770
+
1771
+ print(
1772
+ "\nBackfill applied:\n"
1773
+ f" backup written to: {result.backup_path}\n"
1774
+ f" entities created: {result.entities_created}\n"
1775
+ f" entities reused: {result.entities_reused}\n"
1776
+ f" memory_entities links created: {result.links_created}"
1777
+ )
1778
+ return
1779
+
1707
1780
  if args.merge_databases:
1708
1781
  # Manual consolidation of hash-named databases
1709
1782
  setup_logging(debug=args.debug)
@@ -141,6 +141,79 @@ def _require(arguments: dict, key: str, tool_name: str):
141
141
  return value
142
142
 
143
143
 
144
+ # ── Parameter-name aliases (v1.58.0 PR E) ──
145
+ #
146
+ # The memory MCP tools historically used different parameter conventions
147
+ # (entity vs source/target/relationship vs query). The aliases below let
148
+ # callers use a consistent variant while every existing caller continues
149
+ # to work unchanged. Normalization happens here at the MCP boundary;
150
+ # service-layer signatures in claudia_memory/services/ are untouched.
151
+ #
152
+ # Rules:
153
+ # 1. Purely additive. The canonical name continues to work as before.
154
+ # 2. If both the canonical name and an alias are provided in the same
155
+ # call, the canonical name wins (the alias is left in place and is
156
+ # not consulted by the handler).
157
+ # 3. Otherwise, the first matching alias is renamed to the canonical
158
+ # key and the alias key is removed from the arguments dict so the
159
+ # handler only ever sees the canonical name.
160
+
161
+ _PARAM_ALIASES: Dict[str, Dict[str, List[str]]] = {
162
+ "memory_about": {
163
+ "entity": ["entity_name", "name"],
164
+ },
165
+ "memory_relate": {
166
+ "source": ["source_entity"],
167
+ "target": ["target_entity"],
168
+ "relationship": ["relationship_type"],
169
+ },
170
+ "memory_recall": {
171
+ "query": ["q", "search"],
172
+ },
173
+ }
174
+
175
+
176
+ def _normalize_params(arguments: dict, canonical: str, aliases: List[str]) -> dict:
177
+ """Resolve alias parameter names to the canonical name.
178
+
179
+ If the canonical name is already present in `arguments`, it wins and
180
+ `arguments` is returned unchanged. Otherwise, the first alias from
181
+ `aliases` that is present is renamed to the canonical key, and the
182
+ alias key is removed. If no alias matches either, `arguments` is
183
+ returned unchanged.
184
+
185
+ The function never mutates the caller's dict: when a rewrite is
186
+ needed it returns a shallow copy with the alias key replaced.
187
+ """
188
+ if canonical in arguments:
189
+ return arguments
190
+ for alias in aliases:
191
+ if alias in arguments:
192
+ arguments = dict(arguments)
193
+ arguments[canonical] = arguments.pop(alias)
194
+ return arguments
195
+ return arguments
196
+
197
+
198
+ def _apply_parameter_aliases(tool_name: str, arguments: dict) -> dict:
199
+ """Apply all registered aliases for the given tool, if any.
200
+
201
+ Tools without an entry in `_PARAM_ALIASES` (the vast majority) get
202
+ their arguments back unchanged. Dot-notation aliases (e.g.
203
+ 'memory.about') are routed to the same canonical tool name for the
204
+ purposes of alias lookup.
205
+ """
206
+ # Dot-notation aliases (memory.about, etc.) share the same canonical
207
+ # alias map as their underscore counterparts.
208
+ lookup_name = tool_name.replace(".", "_", 1) if "." in tool_name else tool_name
209
+ aliases_for_tool = _PARAM_ALIASES.get(lookup_name)
210
+ if not aliases_for_tool:
211
+ return arguments
212
+ for canonical, alias_list in aliases_for_tool.items():
213
+ arguments = _normalize_params(arguments, canonical, alias_list)
214
+ return arguments
215
+
216
+
144
217
  MAX_RESPONSE_BYTES = 50_000
145
218
 
146
219
 
@@ -3142,6 +3215,10 @@ async def call_tool(name: str, arguments: Dict[str, Any]) -> CallToolResult:
3142
3215
  """Handle tool calls via dispatch registry."""
3143
3216
  db = get_db()
3144
3217
  try:
3218
+ # Normalize parameter-name aliases at the MCP boundary so handlers
3219
+ # only ever see canonical parameter names. Purely additive: tools
3220
+ # without registered aliases are unchanged. See _PARAM_ALIASES.
3221
+ arguments = _apply_parameter_aliases(name, arguments)
3145
3222
  with db.transaction():
3146
3223
  handler = _TOOL_HANDLERS.get(name)
3147
3224
  if handler:
@@ -0,0 +1,346 @@
1
+ """Entity-link backfill command (Proposal #51).
2
+
3
+ The pre-v1.58 write path linked entities to memories *most* of the time
4
+ but auto-created organisations as type=person and could miss entities
5
+ referenced only in the content (no ``about_entities`` array supplied).
6
+
7
+ This module scans existing memories that have no entity links and
8
+ proposes new entity creations + ``memory_entities`` rows. Two phases:
9
+
10
+ * ``plan_backfill(db)`` -- pure read. Returns a :class:`BackfillPlan`
11
+ with everything it would do. No writes.
12
+ * ``apply_backfill(db, plan, backup_path)`` -- writes. **First** creates
13
+ a SQLite backup at ``backup_path``. If backup fails, raises BEFORE
14
+ any DB modification.
15
+
16
+ CLI entry points live in ``claudia_memory/__main__.py``:
17
+ ``claudia-memory --backfill-entities`` (dry-run; default) and
18
+ ``claudia-memory --backfill-entities --apply``.
19
+
20
+ No new deps. No schema migrations. Idempotent on re-apply.
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import logging
26
+ import re
27
+ import sqlite3
28
+ from dataclasses import dataclass, field
29
+ from datetime import datetime
30
+ from pathlib import Path
31
+ from typing import Any, Dict, List, Optional
32
+
33
+ from .entities import infer_entity_type
34
+
35
+ logger = logging.getLogger(__name__)
36
+
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Plan / Result dataclasses
40
+ # ---------------------------------------------------------------------------
41
+
42
+
43
+ @dataclass
44
+ class BackfillPlan:
45
+ """Read-only plan of what apply_backfill would do.
46
+
47
+ Attributes:
48
+ orphan_count: number of memory rows with zero memory_entities links
49
+ that the planner thinks SHOULD have at least one link.
50
+ proposed_entities: list of dicts ``{"name": str, "inferred_type":
51
+ str, "memory_ids": [int, ...]}``. Each dict represents a name
52
+ we detected in memory content for which we will (a) create the
53
+ entity if missing, (b) link it to those memories.
54
+ scanned_memories: total memories the planner looked at.
55
+ """
56
+
57
+ orphan_count: int = 0
58
+ proposed_entities: List[Dict[str, Any]] = field(default_factory=list)
59
+ scanned_memories: int = 0
60
+
61
+
62
+ @dataclass
63
+ class BackfillResult:
64
+ """Counts of writes performed by apply_backfill."""
65
+
66
+ entities_created: int = 0
67
+ entities_reused: int = 0
68
+ links_created: int = 0
69
+ backup_path: Optional[Path] = None
70
+
71
+
72
+ # ---------------------------------------------------------------------------
73
+ # Name detection -- intentionally conservative
74
+ # ---------------------------------------------------------------------------
75
+
76
+ # Two or more capitalised words: a reasonable signal for proper nouns.
77
+ # We won't catch single-word entities like "Acme" here -- that prevents a
78
+ # flood of false positives like "The", "She", "Monday" at sentence starts.
79
+ _PROPER_NOUN_RE = re.compile(r"\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b")
80
+
81
+ # Things we never want to propose as entity names.
82
+ _STOPWORDS = frozenset(
83
+ {
84
+ "Project", # Without a following noun, this is the keyword itself.
85
+ "Inc",
86
+ "LLC",
87
+ "Corp",
88
+ "AI",
89
+ "Ltd",
90
+ "Co",
91
+ }
92
+ )
93
+
94
+
95
+ def _candidate_names(content: str) -> List[str]:
96
+ """Extract proper-noun candidate names from memory content.
97
+
98
+ Returns a list of unique, order-preserved candidates.
99
+ """
100
+ if not content:
101
+ return []
102
+
103
+ seen: Dict[str, None] = {}
104
+ for match in _PROPER_NOUN_RE.finditer(content):
105
+ raw = match.group(1).strip()
106
+ # Reject single-token stopwords we accidentally captured.
107
+ if raw in _STOPWORDS:
108
+ continue
109
+ seen.setdefault(raw, None)
110
+ return list(seen.keys())
111
+
112
+
113
+ # ---------------------------------------------------------------------------
114
+ # Phase 1: plan_backfill (NO writes)
115
+ # ---------------------------------------------------------------------------
116
+
117
+
118
+ def plan_backfill(db) -> BackfillPlan:
119
+ """Scan memories with no entity links and propose new links.
120
+
121
+ Args:
122
+ db: The Database object (sqlite wrapper).
123
+
124
+ Returns:
125
+ A :class:`BackfillPlan`. The caller can inspect ``orphan_count``
126
+ and ``proposed_entities`` before deciding to ``--apply``.
127
+
128
+ This function MUST NOT write to the database. Tests assert this.
129
+ """
130
+ plan = BackfillPlan()
131
+
132
+ # Find memories that have no entity link at all and have content
133
+ # that looks like it mentions someone or something.
134
+ rows = db.execute(
135
+ """
136
+ SELECT m.id, m.content
137
+ FROM memories m
138
+ LEFT JOIN memory_entities me ON m.id = me.memory_id
139
+ WHERE me.memory_id IS NULL
140
+ AND m.invalidated_at IS NULL
141
+ AND m.content IS NOT NULL
142
+ """,
143
+ fetch=True,
144
+ ) or []
145
+
146
+ plan.scanned_memories = len(rows)
147
+ if not rows:
148
+ return plan
149
+
150
+ # name -> {"inferred_type": str, "memory_ids": [int]}
151
+ by_name: Dict[str, Dict[str, Any]] = {}
152
+
153
+ for row in rows:
154
+ memory_id = row["id"]
155
+ content = row["content"]
156
+ names = _candidate_names(content)
157
+ if not names:
158
+ continue
159
+ plan.orphan_count += 1
160
+ for name in names:
161
+ entry = by_name.setdefault(
162
+ name,
163
+ {
164
+ "name": name,
165
+ "inferred_type": infer_entity_type(name, content),
166
+ "memory_ids": [],
167
+ },
168
+ )
169
+ entry["memory_ids"].append(memory_id)
170
+
171
+ plan.proposed_entities = list(by_name.values())
172
+ return plan
173
+
174
+
175
+ # ---------------------------------------------------------------------------
176
+ # Phase 2: apply_backfill (WRITES, but only after a successful backup)
177
+ # ---------------------------------------------------------------------------
178
+
179
+
180
+ def _create_backup(db, backup_path: Path) -> Path:
181
+ """Write a SQLite-native backup of ``db`` to ``backup_path``.
182
+
183
+ Uses :meth:`sqlite3.Connection.backup` for crash-consistent copy.
184
+ Creates parent directories. Raises on any failure so the caller can
185
+ abort the apply before touching the main DB.
186
+ """
187
+ backup_path = Path(backup_path)
188
+ backup_path.parent.mkdir(parents=True, exist_ok=True)
189
+
190
+ # The Database wrapper exposes a thread-local connection via
191
+ # ``_get_connection``. We do not capture it as a long-lived
192
+ # attribute -- always ask the wrapper for the live one.
193
+ if hasattr(db, "_get_connection"):
194
+ source_conn = db._get_connection() # noqa: SLF001
195
+ elif hasattr(db, "conn"):
196
+ source_conn = db.conn
197
+ else:
198
+ raise RuntimeError(
199
+ "Cannot create backup: db has no _get_connection or conn attribute"
200
+ )
201
+
202
+ target = sqlite3.connect(str(backup_path))
203
+ try:
204
+ source_conn.backup(target)
205
+ finally:
206
+ target.close()
207
+
208
+ if not backup_path.exists() or backup_path.stat().st_size == 0:
209
+ raise RuntimeError(f"Backup file at {backup_path} is missing or empty")
210
+
211
+ return backup_path
212
+
213
+
214
+ def _ensure_entity_for_backfill(
215
+ db, name: str, entity_type: str
216
+ ) -> tuple[int, bool]:
217
+ """Return (entity_id, created_now).
218
+
219
+ Looks up by canonical_name (lowercased). Returns the existing id
220
+ if found, else inserts a new row. Does not touch embeddings (the
221
+ main daemon's normal flow will pick those up on next access).
222
+ """
223
+ canonical = name.lower().strip()
224
+ existing = db.get_one(
225
+ "entities",
226
+ where="canonical_name = ?",
227
+ where_params=(canonical,),
228
+ )
229
+ if existing:
230
+ return existing["id"], False
231
+
232
+ now = datetime.utcnow().isoformat()
233
+ new_id = db.insert(
234
+ "entities",
235
+ {
236
+ "name": name,
237
+ "type": entity_type,
238
+ "canonical_name": canonical,
239
+ "importance": 1.0,
240
+ "created_at": now,
241
+ "updated_at": now,
242
+ },
243
+ )
244
+ return new_id, True
245
+
246
+
247
+ def apply_backfill(db, plan: BackfillPlan, backup_path: Path) -> BackfillResult:
248
+ """Apply the plan after first taking a SQLite backup.
249
+
250
+ Args:
251
+ db: Database wrapper.
252
+ plan: A :class:`BackfillPlan` from :func:`plan_backfill`.
253
+ backup_path: Where to write the SQLite backup. Required.
254
+
255
+ Returns:
256
+ A :class:`BackfillResult` with counts.
257
+
258
+ Raises:
259
+ Anything raised by :func:`_create_backup`. If the backup step
260
+ fails, NO writes are performed.
261
+ """
262
+ backup_path = Path(backup_path)
263
+
264
+ # Backup MUST come first. If it fails, abort before any DB write.
265
+ created_backup = _create_backup(db, backup_path)
266
+ logger.info("Backfill: backup created at %s", created_backup)
267
+
268
+ result = BackfillResult(backup_path=created_backup)
269
+
270
+ for proposal in plan.proposed_entities:
271
+ name = proposal["name"]
272
+ entity_type = proposal["inferred_type"]
273
+ memory_ids = proposal["memory_ids"]
274
+
275
+ entity_id, created_now = _ensure_entity_for_backfill(db, name, entity_type)
276
+ if created_now:
277
+ result.entities_created += 1
278
+ else:
279
+ result.entities_reused += 1
280
+
281
+ for memory_id in memory_ids:
282
+ try:
283
+ db.insert(
284
+ "memory_entities",
285
+ {
286
+ "memory_id": memory_id,
287
+ "entity_id": entity_id,
288
+ "relationship": "about",
289
+ },
290
+ )
291
+ result.links_created += 1
292
+ except Exception as e:
293
+ # Duplicate link (memory already has it) is harmless.
294
+ logger.debug(
295
+ "Backfill: skipping duplicate link memory=%s entity=%s: %s",
296
+ memory_id,
297
+ entity_id,
298
+ e,
299
+ )
300
+
301
+ logger.info(
302
+ "Backfill applied: %d entities created, %d reused, %d links",
303
+ result.entities_created,
304
+ result.entities_reused,
305
+ result.links_created,
306
+ )
307
+ return result
308
+
309
+
310
+ # ---------------------------------------------------------------------------
311
+ # CLI helper: render a plan summary for the dry-run output
312
+ # ---------------------------------------------------------------------------
313
+
314
+
315
+ def format_plan_summary(plan: BackfillPlan) -> str:
316
+ """Human-readable summary of a plan (printed in dry-run mode)."""
317
+ lines = [
318
+ "Entity-link backfill plan (dry-run, no writes):",
319
+ f" Scanned memories without links: {plan.scanned_memories}",
320
+ f" Memories with orphan name references: {plan.orphan_count}",
321
+ f" Proposed new/linked entities: {len(plan.proposed_entities)}",
322
+ ]
323
+ if plan.proposed_entities:
324
+ lines.append("")
325
+ lines.append(" By inferred type:")
326
+ type_counts: Dict[str, int] = {}
327
+ for p in plan.proposed_entities:
328
+ type_counts[p["inferred_type"]] = (
329
+ type_counts.get(p["inferred_type"], 0) + 1
330
+ )
331
+ for t, n in sorted(type_counts.items()):
332
+ lines.append(f" {t}: {n}")
333
+ # Show a small sample so the user can sanity-check.
334
+ sample = plan.proposed_entities[:10]
335
+ lines.append("")
336
+ lines.append(" Sample (first 10):")
337
+ for p in sample:
338
+ lines.append(
339
+ f" - {p['name']!r} -> {p['inferred_type']} "
340
+ f"({len(p['memory_ids'])} memory link(s))"
341
+ )
342
+ lines.append("")
343
+ lines.append(
344
+ "Run with --apply to write changes. A SQLite backup will be created first."
345
+ )
346
+ return "\n".join(lines)