get-claudia 1.56.1 → 1.58.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +61 -0
- package/README.md +0 -13
- package/bin/index.js +4 -1
- package/memory-daemon/claudia_memory/__main__.py +73 -0
- package/memory-daemon/claudia_memory/mcp/server.py +77 -0
- package/memory-daemon/claudia_memory/services/backfill.py +346 -0
- package/memory-daemon/claudia_memory/services/entities.py +192 -0
- package/memory-daemon/claudia_memory/services/remember.py +46 -7
- package/package.json +1 -1
- package/template-v2/.claude/hooks/__pycache__/post-tool-capture.cpython-313.pyc +0 -0
- package/template-v2/.claude/hooks/__pycache__/session-health-check.cpython-313.pyc +0 -0
- package/template-v2/.claude/hooks/__pycache__/user-prompt-capture.cpython-313.pyc +0 -0
- package/template-v2/.claude/hooks/post-tool-capture.py +109 -9
- package/template-v2/.claude/hooks/session-health-check.py +52 -4
- package/template-v2/.claude/hooks/session-summary.py +399 -0
- package/template-v2/.claude/hooks/user-prompt-capture.py +123 -0
- package/template-v2/.claude/manifest.json +5 -4
- package/template-v2/.claude/rules/claudia-principles.md +1 -1
- package/template-v2/.claude/rules/memory-commitment.md +92 -0
- package/template-v2/.claude/settings.local.json +26 -0
- package/template-v2/.mcp.json.example +0 -10
- package/template-v2/CLAUDE.md +1 -79
- package/template-v2/.claude/hooks/__pycache__/pre-compact.cpython-313.pyc +0 -0
- package/template-v2/gitignore +0 -35
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,67 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to Claudia will be documented in this file.
|
|
4
4
|
|
|
5
|
+
## 1.58.0 (2026-05-13)
|
|
6
|
+
|
|
7
|
+
### The Memory Reliability Release
|
|
8
|
+
|
|
9
|
+
Five PRs that fix the memory layer's biggest recurring failure mode and lock in the integration philosophy. After this release, memory writes that name entities ("Matt Blumberg") actually create those entities with the correct type. The release history's recurring memory-fix releases ("Recall Recovery", "Vector Search Fix", "Semantic Search Actually Works Now") get permanent regression-test sentinels so the same bug classes can't quietly come back. And the codebase loses a dual-maintenance hazard that was already costing time.
|
|
10
|
+
|
|
11
|
+
#### Fixed
|
|
12
|
+
- **`memory_remember` actually links entities and infers their type correctly (#54)** -- A confirmed bug from 2026-05-13: calling `memory_remember(content="Matt Blumberg said X", entities=["Matt Blumberg", "Markup AI"])` was creating entities but assigning them `type: person` by default, even when the name clearly indicated an organization. "Markup AI" was being saved as a person. The real bug was in `_infer_entity_type` -- it didn't recognise `AI` / `.ai` / `Co.` as corporate suffixes and fell back to `person`. Fixed with a pure-function rule-based type inference (corporate suffixes -> organization, project keywords -> project, person patterns -> person, fallback -> concept, never default to person). Plus a new `claudia-memory --backfill-entities` CLI to retroactively link orphaned references in existing user databases.
|
|
13
|
+
|
|
14
|
+
#### Added
|
|
15
|
+
- **`claudia memory backfill-entities` command (#54)** -- Default dry-run: prints a plan and writes nothing. `--apply` makes a timestamped backup to `~/.claudia/backups/memory-{timestamp}.db` first, then applies the backfill. Idempotent: re-running on an already-backfilled DB is a no-op. Aborts cleanly if backup creation fails.
|
|
16
|
+
- **5 regression tests for recurring bug classes (#56)** -- New `memory-daemon/tests/test_recurring_regressions.py` adds permanent forward-looking sentinels for: entity linking on `memory_remember`, recall returning results after seed writes, embedding migration preserving recall, daemon startup tolerating stale SHM files, and `memory_briefing` returning a valid structure on an empty database. Each test docstring names the historical releases where its bug class appeared (v1.35.x, v1.51.5, v1.51.18, v1.55.7, v1.55.8, v1.55.14, v1.21.1, v1.40.1).
|
|
17
|
+
- **API parameter aliases for read-side MCP tools (#57)** -- `memory_about` now accepts `entity_name` and `name` alongside `entity`. `memory_relate` accepts `source_entity` / `target_entity` / `relationship_type` alongside `source` / `target` / `relationship`. `memory_recall` accepts `q` and `search` alongside `query`. Purely additive: every existing caller continues to work unchanged. Aliases normalize at the MCP boundary; service-layer signatures are untouched. If both canonical and alias are passed in the same call, canonical wins.
|
|
18
|
+
|
|
19
|
+
#### Removed
|
|
20
|
+
- **Rube (Composio) MCP integration as a bundled default (#41)** -- Rube is no longer a recommended or bundled MCP server in `.mcp.json.example` (root and template-v2), README, or the Claudia documentation. Locks in the direct-integrations-only philosophy (claude.ai-native MCPs + user-built custom MCPs like Gmail/Calendar). Existing users with `rube` already configured continue to work unchanged; the installer simply no longer ships Rube as an example. The "Tool configuration" example in `claudia-principles.md` was updated to vendor-neutral phrasing.
|
|
21
|
+
- **Legacy `claudia/` sibling files (#55)** -- Removed 3 stale sibling files (`post-tool-capture.py`, `session-health-check.py`, `settings.local.json`) that lived under `claudia/`. These were never reaching users (the installer ships from `template-v2/` only), but every hook bug fix had to remember to patch both locations. The dual-maintenance hazard was real: PR #38's sibling-fix step had to apply the same env-var fix twice. Removed at the source.
|
|
22
|
+
|
|
23
|
+
#### Stats
|
|
24
|
+
- **43 new tests** across 4 files (22 entity-resolution tests in #54, 5 regression sentinels in #56, 16 alias tests in #57)
|
|
25
|
+
- **805 total daemon tests passing** (up from 762 before the v1.57.0 chain), 0 regressions
|
|
26
|
+
- TDD sensitivity proofs for every behavior change: tests fail on the un-modified code, pass after the fix
|
|
27
|
+
- 5 PRs merged, all with stop-gates and TDD discipline
|
|
28
|
+
|
|
29
|
+
#### Notes
|
|
30
|
+
- The bug in #54 was different from the original proposal (#51) described. The proposal said "entities are silently ignored." Actually the entity *records* were getting created -- the bug was that they were all getting `type: person`. Fixing the actual bug rather than the imagined one was a better outcome.
|
|
31
|
+
- The `claudia memory backfill-entities` command surface lives on the daemon's argparse (alongside `--backfill-embeddings`, `--migrate-vault-para`), not as a `claudia memory ...` subcommand on the Node CLI. The Node CLI is the installer, not a memory-command dispatcher.
|
|
32
|
+
- Aliases are NOT yet advertised in the MCP `list_tools()` `inputSchema`. They are tolerantly accepted at the request boundary. Schema-level advertisement is a future enhancement if it proves needed for client discoverability.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## 1.57.0 (2026-05-13)
|
|
37
|
+
|
|
38
|
+
### The Curated Memory Release
|
|
39
|
+
|
|
40
|
+
Five PRs that complete one thesis: **curated, judgment-driven memory capture, enforced at prompt time and persisted across sessions.** Claudia now catches the user's intent when it matters, persists canonical facts as they emerge, and writes a daily session summary so context survives across days.
|
|
41
|
+
|
|
42
|
+
#### Fixed
|
|
43
|
+
- **PostToolUse hook actually runs (#38)** -- The hook was reading `os.environ.get("CLAUDE_TOOL_NAME")`, which Claude Code never sets. Every install since the hook landed had been silently no-op'ing, so `~/.claudia/observations.jsonl` was never written. The hook now reads its payload from stdin per the documented hook contract. Includes a sibling fix to the legacy `claudia/.claude/hooks/post-tool-capture.py` for codebase consistency.
|
|
44
|
+
|
|
45
|
+
#### Added
|
|
46
|
+
- **Memory-commitment rule (#39)** -- A new always-active rule (`template-v2/.claude/rules/memory-commitment.md`) codifies when to save canonical facts immediately via `memory_remember` / `memory_batch` rather than batching to end-of-session reflection. Trigger phrases include "lock this in," "remember this," "this is canonical." Substantive-artifact discipline: at the end of producing a multi-file artifact, do a memory commitment pass and save the canonical facts as one bundled `memory_batch` call.
|
|
47
|
+
- **UserPromptSubmit hook with intent detection (#42)** -- A new hook (`template-v2/.claude/hooks/user-prompt-capture.py`) inspects the user's prompt at submit time and injects reminder context for two trigger classes. Class 1: canonical-fact phrases ("lock this in," "remember this," etc.) tell the agent to save immediately rather than wait for `/meditate`. Class 2: destructive command patterns (`rm -rf`, `git push --force`, `DROP TABLE`, etc.) trigger a "verify before acting" reminder per the safety-first principle. Destructive patterns are surfaced to the model as human-readable labels (`rm -rf (recursive delete)`), not raw regex, so the agent can reason about them clearly.
|
|
48
|
+
- **Daily session summary system (#40)** -- A new SessionEnd hook (`template-v2/.claude/hooks/session-summary.py`) writes a per-session markdown summary to `~/.claudia/sessions/YYYY-MM-DD/NN-slug.md` covering opening prompt, files touched, external actions, and find-this-again references. SessionStart now surfaces a 3-day digest of recent sessions via the existing health-check hook, so future-Claudia knows what past-Claudia worked on. PostToolUse hook gained `file_path` extraction for Write/Edit/MultiEdit/NotebookEdit and `external_action` labels for git push, gh repo create, vercel/netlify deploy, supabase db push, and direct MCP sends.
|
|
49
|
+
- **Explicit upgrade messaging (#50)** -- The installer now names `~/.claudia/` explicitly after an upgrade and lists what is preserved (entities, relationships, reflections, embeddings) instead of the generic "data preserved" phrasing. Users care about their accumulated memory graph; the previous wording did not signal that the database is safe.
|
|
50
|
+
|
|
51
|
+
#### Changed
|
|
52
|
+
- **External-action detection uses word-boundary regex (#40)** -- Previously a substring match, so `echo "git push for testing"` falsely fired the `external_action` flag. The new patterns anchor on command separators (line start, `;`, `&&`, `|`, `(`) and skip transparent prefixes (`sudo`, `nohup`, `time`, `env`). False positives on echoed/quoted strings are eliminated; real commands still fire.
|
|
53
|
+
- **PostToolUse output truncation 200 -> 300 chars (#40)** -- Room for the richer output context that includes `file_path` and `external_action` labels alongside the truncated stdout/stderr.
|
|
54
|
+
|
|
55
|
+
#### Stats
|
|
56
|
+
- 41 new hook tests in `tests/hooks/` (stdlib `unittest`, zero new dependencies), all passing in ~1.5s
|
|
57
|
+
- TDD sensitivity proofs for every behavior change: tests fail on the un-modified hook, pass after the fix
|
|
58
|
+
- 5 PRs merged, 0 regressions
|
|
59
|
+
|
|
60
|
+
#### Notes
|
|
61
|
+
- The brief that drove this chain emphasized one principle: **trust the existing user-file preservation policy (commit `efce9f2`)** rather than inventing a new upgrade framework. The installer's behavior didn't change; only the messaging did.
|
|
62
|
+
- The four hook PRs each landed with their own automated tests and TDD sensitivity proofs. The legacy `claudia/` subdirectory was kept in sync with the canonical `template-v2/` to avoid maintenance drift.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
5
66
|
## 1.56.1 (2026-04-11)
|
|
6
67
|
|
|
7
68
|
### Preserve User-Modified Skills on Upgrade
|
package/README.md
CHANGED
|
@@ -320,19 +320,6 @@ This generates a one-click URL to enable all required Google APIs and walks you
|
|
|
320
320
|
| **Extended** | 83 | Core + Docs, Sheets, Tasks, Chat |
|
|
321
321
|
| **Complete** | 111 | Extended + Slides, Forms, Apps Script |
|
|
322
322
|
|
|
323
|
-
### 500+ Apps via Rube
|
|
324
|
-
|
|
325
|
-
[Rube](https://rube.app) (by Composio) connects Claudia to Slack, Notion, Jira, GitHub, Linear, HubSpot, Stripe, Figma, and hundreds more through one-click OAuth. No per-app MCP setup needed.
|
|
326
|
-
|
|
327
|
-
| Category | Examples |
|
|
328
|
-
|----------|----------|
|
|
329
|
-
| **Communication** | Slack, Discord, Teams, Telegram |
|
|
330
|
-
| **Project Management** | Jira, Linear, Asana, Trello, Monday.com |
|
|
331
|
-
| **Knowledge & Docs** | Notion, Confluence, Google Docs, Coda |
|
|
332
|
-
| **Code & Dev** | GitHub, GitLab, Bitbucket |
|
|
333
|
-
| **CRM & Sales** | HubSpot, Salesforce, Pipedrive |
|
|
334
|
-
| **And 500+ more** | [Browse the full list](https://rube.app) |
|
|
335
|
-
|
|
336
323
|
### Obsidian Vault
|
|
337
324
|
|
|
338
325
|
Memory auto-syncs to an Obsidian vault at `~/.claudia/vault/` using PARA structure. Every entity becomes a markdown note with `[[wikilinks]]`, so Obsidian's graph view maps your network. SQLite is the source of truth; the vault is a read-only projection you can browse and search.
|
package/bin/index.js
CHANGED
|
@@ -877,7 +877,10 @@ async function main() {
|
|
|
877
877
|
}
|
|
878
878
|
|
|
879
879
|
console.log('');
|
|
880
|
-
console.log(` ${colors.cyan}✓${colors.reset} Framework updated
|
|
880
|
+
console.log(` ${colors.cyan}✓${colors.reset} Framework updated`);
|
|
881
|
+
console.log(` • Your memory at ${colors.bold}~/.claudia/${colors.reset} is preserved (entities, relationships, reflections, embeddings).`);
|
|
882
|
+
console.log(` • Skills and hooks refreshed; any modifications you chose to keep were respected.`);
|
|
883
|
+
console.log(` • Restart Claude Code for changes to take effect.`);
|
|
881
884
|
}
|
|
882
885
|
|
|
883
886
|
// Self-heal: strip CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS from settings (#24)
|
|
@@ -1110,6 +1110,24 @@ def main():
|
|
|
1110
1110
|
action="store_true",
|
|
1111
1111
|
help="Preview mode for --migrate-vault-para: show routing plan without making changes",
|
|
1112
1112
|
)
|
|
1113
|
+
parser.add_argument(
|
|
1114
|
+
"--backfill-entities",
|
|
1115
|
+
action="store_true",
|
|
1116
|
+
help=(
|
|
1117
|
+
"Scan memories for un-linked entity references and propose "
|
|
1118
|
+
"creating/linking them (Proposal #51). Dry-run by default; "
|
|
1119
|
+
"pass --apply to write changes (creates a SQLite backup first)."
|
|
1120
|
+
),
|
|
1121
|
+
)
|
|
1122
|
+
parser.add_argument(
|
|
1123
|
+
"--apply",
|
|
1124
|
+
action="store_true",
|
|
1125
|
+
help=(
|
|
1126
|
+
"With --backfill-entities: actually write the changes. "
|
|
1127
|
+
"A SQLite backup is created at ~/.claudia/backups/ before any "
|
|
1128
|
+
"writes; if backup creation fails, the command aborts."
|
|
1129
|
+
),
|
|
1130
|
+
)
|
|
1113
1131
|
parser.add_argument(
|
|
1114
1132
|
"--migrate-legacy",
|
|
1115
1133
|
action="store_true",
|
|
@@ -1704,6 +1722,61 @@ def main():
|
|
|
1704
1722
|
run_para_migration(vault_path, db=db, preview=args.preview)
|
|
1705
1723
|
return
|
|
1706
1724
|
|
|
1725
|
+
if args.backfill_entities:
|
|
1726
|
+
# Entity-link backfill (Proposal #51). Dry-run by default; --apply
|
|
1727
|
+
# writes after creating a SQLite backup.
|
|
1728
|
+
setup_logging(debug=args.debug)
|
|
1729
|
+
from datetime import datetime as _dt
|
|
1730
|
+
|
|
1731
|
+
from .services.backfill import (
|
|
1732
|
+
apply_backfill,
|
|
1733
|
+
format_plan_summary,
|
|
1734
|
+
plan_backfill,
|
|
1735
|
+
)
|
|
1736
|
+
|
|
1737
|
+
db = get_db()
|
|
1738
|
+
db.initialize()
|
|
1739
|
+
|
|
1740
|
+
plan = plan_backfill(db)
|
|
1741
|
+
print(format_plan_summary(plan))
|
|
1742
|
+
|
|
1743
|
+
if not args.apply:
|
|
1744
|
+
# Dry-run path: we already printed the plan; nothing more to do.
|
|
1745
|
+
return
|
|
1746
|
+
|
|
1747
|
+
# --apply: take the mandatory backup first.
|
|
1748
|
+
config = get_config()
|
|
1749
|
+
backups_dir = Path(config.backup_dir)
|
|
1750
|
+
try:
|
|
1751
|
+
backups_dir.mkdir(parents=True, exist_ok=True)
|
|
1752
|
+
except OSError as e:
|
|
1753
|
+
print(
|
|
1754
|
+
f"\nCannot create backup directory {backups_dir}: {e}\n"
|
|
1755
|
+
"Aborting before any database writes."
|
|
1756
|
+
)
|
|
1757
|
+
sys.exit(1)
|
|
1758
|
+
|
|
1759
|
+
timestamp = _dt.utcnow().strftime("%Y-%m-%dT%H%M%SZ")
|
|
1760
|
+
backup_path = backups_dir / f"memory-{timestamp}.db"
|
|
1761
|
+
|
|
1762
|
+
try:
|
|
1763
|
+
result = apply_backfill(db, plan, backup_path=backup_path)
|
|
1764
|
+
except Exception as e:
|
|
1765
|
+
print(
|
|
1766
|
+
f"\nBackfill aborted (no DB writes performed): {e}\n"
|
|
1767
|
+
f"Backup target was: {backup_path}"
|
|
1768
|
+
)
|
|
1769
|
+
sys.exit(1)
|
|
1770
|
+
|
|
1771
|
+
print(
|
|
1772
|
+
"\nBackfill applied:\n"
|
|
1773
|
+
f" backup written to: {result.backup_path}\n"
|
|
1774
|
+
f" entities created: {result.entities_created}\n"
|
|
1775
|
+
f" entities reused: {result.entities_reused}\n"
|
|
1776
|
+
f" memory_entities links created: {result.links_created}"
|
|
1777
|
+
)
|
|
1778
|
+
return
|
|
1779
|
+
|
|
1707
1780
|
if args.merge_databases:
|
|
1708
1781
|
# Manual consolidation of hash-named databases
|
|
1709
1782
|
setup_logging(debug=args.debug)
|
|
@@ -141,6 +141,79 @@ def _require(arguments: dict, key: str, tool_name: str):
|
|
|
141
141
|
return value
|
|
142
142
|
|
|
143
143
|
|
|
144
|
+
# ── Parameter-name aliases (v1.58.0 PR E) ──
|
|
145
|
+
#
|
|
146
|
+
# The memory MCP tools historically used different parameter conventions
|
|
147
|
+
# (entity vs source/target/relationship vs query). The aliases below let
|
|
148
|
+
# callers use a consistent variant while every existing caller continues
|
|
149
|
+
# to work unchanged. Normalization happens here at the MCP boundary;
|
|
150
|
+
# service-layer signatures in claudia_memory/services/ are untouched.
|
|
151
|
+
#
|
|
152
|
+
# Rules:
|
|
153
|
+
# 1. Purely additive. The canonical name continues to work as before.
|
|
154
|
+
# 2. If both the canonical name and an alias are provided in the same
|
|
155
|
+
# call, the canonical name wins (the alias is left in place and is
|
|
156
|
+
# not consulted by the handler).
|
|
157
|
+
# 3. Otherwise, the first matching alias is renamed to the canonical
|
|
158
|
+
# key and the alias key is removed from the arguments dict so the
|
|
159
|
+
# handler only ever sees the canonical name.
|
|
160
|
+
|
|
161
|
+
_PARAM_ALIASES: Dict[str, Dict[str, List[str]]] = {
|
|
162
|
+
"memory_about": {
|
|
163
|
+
"entity": ["entity_name", "name"],
|
|
164
|
+
},
|
|
165
|
+
"memory_relate": {
|
|
166
|
+
"source": ["source_entity"],
|
|
167
|
+
"target": ["target_entity"],
|
|
168
|
+
"relationship": ["relationship_type"],
|
|
169
|
+
},
|
|
170
|
+
"memory_recall": {
|
|
171
|
+
"query": ["q", "search"],
|
|
172
|
+
},
|
|
173
|
+
}
|
|
174
|
+
|
|
175
|
+
|
|
176
|
+
def _normalize_params(arguments: dict, canonical: str, aliases: List[str]) -> dict:
|
|
177
|
+
"""Resolve alias parameter names to the canonical name.
|
|
178
|
+
|
|
179
|
+
If the canonical name is already present in `arguments`, it wins and
|
|
180
|
+
`arguments` is returned unchanged. Otherwise, the first alias from
|
|
181
|
+
`aliases` that is present is renamed to the canonical key, and the
|
|
182
|
+
alias key is removed. If no alias matches either, `arguments` is
|
|
183
|
+
returned unchanged.
|
|
184
|
+
|
|
185
|
+
The function never mutates the caller's dict: when a rewrite is
|
|
186
|
+
needed it returns a shallow copy with the alias key replaced.
|
|
187
|
+
"""
|
|
188
|
+
if canonical in arguments:
|
|
189
|
+
return arguments
|
|
190
|
+
for alias in aliases:
|
|
191
|
+
if alias in arguments:
|
|
192
|
+
arguments = dict(arguments)
|
|
193
|
+
arguments[canonical] = arguments.pop(alias)
|
|
194
|
+
return arguments
|
|
195
|
+
return arguments
|
|
196
|
+
|
|
197
|
+
|
|
198
|
+
def _apply_parameter_aliases(tool_name: str, arguments: dict) -> dict:
|
|
199
|
+
"""Apply all registered aliases for the given tool, if any.
|
|
200
|
+
|
|
201
|
+
Tools without an entry in `_PARAM_ALIASES` (the vast majority) get
|
|
202
|
+
their arguments back unchanged. Dot-notation aliases (e.g.
|
|
203
|
+
'memory.about') are routed to the same canonical tool name for the
|
|
204
|
+
purposes of alias lookup.
|
|
205
|
+
"""
|
|
206
|
+
# Dot-notation aliases (memory.about, etc.) share the same canonical
|
|
207
|
+
# alias map as their underscore counterparts.
|
|
208
|
+
lookup_name = tool_name.replace(".", "_", 1) if "." in tool_name else tool_name
|
|
209
|
+
aliases_for_tool = _PARAM_ALIASES.get(lookup_name)
|
|
210
|
+
if not aliases_for_tool:
|
|
211
|
+
return arguments
|
|
212
|
+
for canonical, alias_list in aliases_for_tool.items():
|
|
213
|
+
arguments = _normalize_params(arguments, canonical, alias_list)
|
|
214
|
+
return arguments
|
|
215
|
+
|
|
216
|
+
|
|
144
217
|
MAX_RESPONSE_BYTES = 50_000
|
|
145
218
|
|
|
146
219
|
|
|
@@ -3142,6 +3215,10 @@ async def call_tool(name: str, arguments: Dict[str, Any]) -> CallToolResult:
|
|
|
3142
3215
|
"""Handle tool calls via dispatch registry."""
|
|
3143
3216
|
db = get_db()
|
|
3144
3217
|
try:
|
|
3218
|
+
# Normalize parameter-name aliases at the MCP boundary so handlers
|
|
3219
|
+
# only ever see canonical parameter names. Purely additive: tools
|
|
3220
|
+
# without registered aliases are unchanged. See _PARAM_ALIASES.
|
|
3221
|
+
arguments = _apply_parameter_aliases(name, arguments)
|
|
3145
3222
|
with db.transaction():
|
|
3146
3223
|
handler = _TOOL_HANDLERS.get(name)
|
|
3147
3224
|
if handler:
|
|
@@ -0,0 +1,346 @@
|
|
|
1
|
+
"""Entity-link backfill command (Proposal #51).
|
|
2
|
+
|
|
3
|
+
The pre-v1.58 write path linked entities to memories *most* of the time
|
|
4
|
+
but auto-created organisations as type=person and could miss entities
|
|
5
|
+
referenced only in the content (no ``about_entities`` array supplied).
|
|
6
|
+
|
|
7
|
+
This module scans existing memories that have no entity links and
|
|
8
|
+
proposes new entity creations + ``memory_entities`` rows. Two phases:
|
|
9
|
+
|
|
10
|
+
* ``plan_backfill(db)`` -- pure read. Returns a :class:`BackfillPlan`
|
|
11
|
+
with everything it would do. No writes.
|
|
12
|
+
* ``apply_backfill(db, plan, backup_path)`` -- writes. **First** creates
|
|
13
|
+
a SQLite backup at ``backup_path``. If backup fails, raises BEFORE
|
|
14
|
+
any DB modification.
|
|
15
|
+
|
|
16
|
+
CLI entry points live in ``claudia_memory/__main__.py``:
|
|
17
|
+
``claudia-memory --backfill-entities`` (dry-run; default) and
|
|
18
|
+
``claudia-memory --backfill-entities --apply``.
|
|
19
|
+
|
|
20
|
+
No new deps. No schema migrations. Idempotent on re-apply.
|
|
21
|
+
"""
|
|
22
|
+
|
|
23
|
+
from __future__ import annotations
|
|
24
|
+
|
|
25
|
+
import logging
|
|
26
|
+
import re
|
|
27
|
+
import sqlite3
|
|
28
|
+
from dataclasses import dataclass, field
|
|
29
|
+
from datetime import datetime
|
|
30
|
+
from pathlib import Path
|
|
31
|
+
from typing import Any, Dict, List, Optional
|
|
32
|
+
|
|
33
|
+
from .entities import infer_entity_type
|
|
34
|
+
|
|
35
|
+
logger = logging.getLogger(__name__)
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
# ---------------------------------------------------------------------------
|
|
39
|
+
# Plan / Result dataclasses
|
|
40
|
+
# ---------------------------------------------------------------------------
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
@dataclass
|
|
44
|
+
class BackfillPlan:
|
|
45
|
+
"""Read-only plan of what apply_backfill would do.
|
|
46
|
+
|
|
47
|
+
Attributes:
|
|
48
|
+
orphan_count: number of memory rows with zero memory_entities links
|
|
49
|
+
that the planner thinks SHOULD have at least one link.
|
|
50
|
+
proposed_entities: list of dicts ``{"name": str, "inferred_type":
|
|
51
|
+
str, "memory_ids": [int, ...]}``. Each dict represents a name
|
|
52
|
+
we detected in memory content for which we will (a) create the
|
|
53
|
+
entity if missing, (b) link it to those memories.
|
|
54
|
+
scanned_memories: total memories the planner looked at.
|
|
55
|
+
"""
|
|
56
|
+
|
|
57
|
+
orphan_count: int = 0
|
|
58
|
+
proposed_entities: List[Dict[str, Any]] = field(default_factory=list)
|
|
59
|
+
scanned_memories: int = 0
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
@dataclass
|
|
63
|
+
class BackfillResult:
|
|
64
|
+
"""Counts of writes performed by apply_backfill."""
|
|
65
|
+
|
|
66
|
+
entities_created: int = 0
|
|
67
|
+
entities_reused: int = 0
|
|
68
|
+
links_created: int = 0
|
|
69
|
+
backup_path: Optional[Path] = None
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
# ---------------------------------------------------------------------------
|
|
73
|
+
# Name detection -- intentionally conservative
|
|
74
|
+
# ---------------------------------------------------------------------------
|
|
75
|
+
|
|
76
|
+
# Two or more capitalised words: a reasonable signal for proper nouns.
|
|
77
|
+
# We won't catch single-word entities like "Acme" here -- that prevents a
|
|
78
|
+
# flood of false positives like "The", "She", "Monday" at sentence starts.
|
|
79
|
+
_PROPER_NOUN_RE = re.compile(r"\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)\b")
|
|
80
|
+
|
|
81
|
+
# Things we never want to propose as entity names.
|
|
82
|
+
_STOPWORDS = frozenset(
|
|
83
|
+
{
|
|
84
|
+
"Project", # Without a following noun, this is the keyword itself.
|
|
85
|
+
"Inc",
|
|
86
|
+
"LLC",
|
|
87
|
+
"Corp",
|
|
88
|
+
"AI",
|
|
89
|
+
"Ltd",
|
|
90
|
+
"Co",
|
|
91
|
+
}
|
|
92
|
+
)
|
|
93
|
+
|
|
94
|
+
|
|
95
|
+
def _candidate_names(content: str) -> List[str]:
|
|
96
|
+
"""Extract proper-noun candidate names from memory content.
|
|
97
|
+
|
|
98
|
+
Returns a list of unique, order-preserved candidates.
|
|
99
|
+
"""
|
|
100
|
+
if not content:
|
|
101
|
+
return []
|
|
102
|
+
|
|
103
|
+
seen: Dict[str, None] = {}
|
|
104
|
+
for match in _PROPER_NOUN_RE.finditer(content):
|
|
105
|
+
raw = match.group(1).strip()
|
|
106
|
+
# Reject single-token stopwords we accidentally captured.
|
|
107
|
+
if raw in _STOPWORDS:
|
|
108
|
+
continue
|
|
109
|
+
seen.setdefault(raw, None)
|
|
110
|
+
return list(seen.keys())
|
|
111
|
+
|
|
112
|
+
|
|
113
|
+
# ---------------------------------------------------------------------------
|
|
114
|
+
# Phase 1: plan_backfill (NO writes)
|
|
115
|
+
# ---------------------------------------------------------------------------
|
|
116
|
+
|
|
117
|
+
|
|
118
|
+
def plan_backfill(db) -> BackfillPlan:
|
|
119
|
+
"""Scan memories with no entity links and propose new links.
|
|
120
|
+
|
|
121
|
+
Args:
|
|
122
|
+
db: The Database object (sqlite wrapper).
|
|
123
|
+
|
|
124
|
+
Returns:
|
|
125
|
+
A :class:`BackfillPlan`. The caller can inspect ``orphan_count``
|
|
126
|
+
and ``proposed_entities`` before deciding to ``--apply``.
|
|
127
|
+
|
|
128
|
+
This function MUST NOT write to the database. Tests assert this.
|
|
129
|
+
"""
|
|
130
|
+
plan = BackfillPlan()
|
|
131
|
+
|
|
132
|
+
# Find memories that have no entity link at all and have content
|
|
133
|
+
# that looks like it mentions someone or something.
|
|
134
|
+
rows = db.execute(
|
|
135
|
+
"""
|
|
136
|
+
SELECT m.id, m.content
|
|
137
|
+
FROM memories m
|
|
138
|
+
LEFT JOIN memory_entities me ON m.id = me.memory_id
|
|
139
|
+
WHERE me.memory_id IS NULL
|
|
140
|
+
AND m.invalidated_at IS NULL
|
|
141
|
+
AND m.content IS NOT NULL
|
|
142
|
+
""",
|
|
143
|
+
fetch=True,
|
|
144
|
+
) or []
|
|
145
|
+
|
|
146
|
+
plan.scanned_memories = len(rows)
|
|
147
|
+
if not rows:
|
|
148
|
+
return plan
|
|
149
|
+
|
|
150
|
+
# name -> {"inferred_type": str, "memory_ids": [int]}
|
|
151
|
+
by_name: Dict[str, Dict[str, Any]] = {}
|
|
152
|
+
|
|
153
|
+
for row in rows:
|
|
154
|
+
memory_id = row["id"]
|
|
155
|
+
content = row["content"]
|
|
156
|
+
names = _candidate_names(content)
|
|
157
|
+
if not names:
|
|
158
|
+
continue
|
|
159
|
+
plan.orphan_count += 1
|
|
160
|
+
for name in names:
|
|
161
|
+
entry = by_name.setdefault(
|
|
162
|
+
name,
|
|
163
|
+
{
|
|
164
|
+
"name": name,
|
|
165
|
+
"inferred_type": infer_entity_type(name, content),
|
|
166
|
+
"memory_ids": [],
|
|
167
|
+
},
|
|
168
|
+
)
|
|
169
|
+
entry["memory_ids"].append(memory_id)
|
|
170
|
+
|
|
171
|
+
plan.proposed_entities = list(by_name.values())
|
|
172
|
+
return plan
|
|
173
|
+
|
|
174
|
+
|
|
175
|
+
# ---------------------------------------------------------------------------
|
|
176
|
+
# Phase 2: apply_backfill (WRITES, but only after a successful backup)
|
|
177
|
+
# ---------------------------------------------------------------------------
|
|
178
|
+
|
|
179
|
+
|
|
180
|
+
def _create_backup(db, backup_path: Path) -> Path:
|
|
181
|
+
"""Write a SQLite-native backup of ``db`` to ``backup_path``.
|
|
182
|
+
|
|
183
|
+
Uses :meth:`sqlite3.Connection.backup` for crash-consistent copy.
|
|
184
|
+
Creates parent directories. Raises on any failure so the caller can
|
|
185
|
+
abort the apply before touching the main DB.
|
|
186
|
+
"""
|
|
187
|
+
backup_path = Path(backup_path)
|
|
188
|
+
backup_path.parent.mkdir(parents=True, exist_ok=True)
|
|
189
|
+
|
|
190
|
+
# The Database wrapper exposes a thread-local connection via
|
|
191
|
+
# ``_get_connection``. We do not capture it as a long-lived
|
|
192
|
+
# attribute -- always ask the wrapper for the live one.
|
|
193
|
+
if hasattr(db, "_get_connection"):
|
|
194
|
+
source_conn = db._get_connection() # noqa: SLF001
|
|
195
|
+
elif hasattr(db, "conn"):
|
|
196
|
+
source_conn = db.conn
|
|
197
|
+
else:
|
|
198
|
+
raise RuntimeError(
|
|
199
|
+
"Cannot create backup: db has no _get_connection or conn attribute"
|
|
200
|
+
)
|
|
201
|
+
|
|
202
|
+
target = sqlite3.connect(str(backup_path))
|
|
203
|
+
try:
|
|
204
|
+
source_conn.backup(target)
|
|
205
|
+
finally:
|
|
206
|
+
target.close()
|
|
207
|
+
|
|
208
|
+
if not backup_path.exists() or backup_path.stat().st_size == 0:
|
|
209
|
+
raise RuntimeError(f"Backup file at {backup_path} is missing or empty")
|
|
210
|
+
|
|
211
|
+
return backup_path
|
|
212
|
+
|
|
213
|
+
|
|
214
|
+
def _ensure_entity_for_backfill(
|
|
215
|
+
db, name: str, entity_type: str
|
|
216
|
+
) -> tuple[int, bool]:
|
|
217
|
+
"""Return (entity_id, created_now).
|
|
218
|
+
|
|
219
|
+
Looks up by canonical_name (lowercased). Returns the existing id
|
|
220
|
+
if found, else inserts a new row. Does not touch embeddings (the
|
|
221
|
+
main daemon's normal flow will pick those up on next access).
|
|
222
|
+
"""
|
|
223
|
+
canonical = name.lower().strip()
|
|
224
|
+
existing = db.get_one(
|
|
225
|
+
"entities",
|
|
226
|
+
where="canonical_name = ?",
|
|
227
|
+
where_params=(canonical,),
|
|
228
|
+
)
|
|
229
|
+
if existing:
|
|
230
|
+
return existing["id"], False
|
|
231
|
+
|
|
232
|
+
now = datetime.utcnow().isoformat()
|
|
233
|
+
new_id = db.insert(
|
|
234
|
+
"entities",
|
|
235
|
+
{
|
|
236
|
+
"name": name,
|
|
237
|
+
"type": entity_type,
|
|
238
|
+
"canonical_name": canonical,
|
|
239
|
+
"importance": 1.0,
|
|
240
|
+
"created_at": now,
|
|
241
|
+
"updated_at": now,
|
|
242
|
+
},
|
|
243
|
+
)
|
|
244
|
+
return new_id, True
|
|
245
|
+
|
|
246
|
+
|
|
247
|
+
def apply_backfill(db, plan: BackfillPlan, backup_path: Path) -> BackfillResult:
|
|
248
|
+
"""Apply the plan after first taking a SQLite backup.
|
|
249
|
+
|
|
250
|
+
Args:
|
|
251
|
+
db: Database wrapper.
|
|
252
|
+
plan: A :class:`BackfillPlan` from :func:`plan_backfill`.
|
|
253
|
+
backup_path: Where to write the SQLite backup. Required.
|
|
254
|
+
|
|
255
|
+
Returns:
|
|
256
|
+
A :class:`BackfillResult` with counts.
|
|
257
|
+
|
|
258
|
+
Raises:
|
|
259
|
+
Anything raised by :func:`_create_backup`. If the backup step
|
|
260
|
+
fails, NO writes are performed.
|
|
261
|
+
"""
|
|
262
|
+
backup_path = Path(backup_path)
|
|
263
|
+
|
|
264
|
+
# Backup MUST come first. If it fails, abort before any DB write.
|
|
265
|
+
created_backup = _create_backup(db, backup_path)
|
|
266
|
+
logger.info("Backfill: backup created at %s", created_backup)
|
|
267
|
+
|
|
268
|
+
result = BackfillResult(backup_path=created_backup)
|
|
269
|
+
|
|
270
|
+
for proposal in plan.proposed_entities:
|
|
271
|
+
name = proposal["name"]
|
|
272
|
+
entity_type = proposal["inferred_type"]
|
|
273
|
+
memory_ids = proposal["memory_ids"]
|
|
274
|
+
|
|
275
|
+
entity_id, created_now = _ensure_entity_for_backfill(db, name, entity_type)
|
|
276
|
+
if created_now:
|
|
277
|
+
result.entities_created += 1
|
|
278
|
+
else:
|
|
279
|
+
result.entities_reused += 1
|
|
280
|
+
|
|
281
|
+
for memory_id in memory_ids:
|
|
282
|
+
try:
|
|
283
|
+
db.insert(
|
|
284
|
+
"memory_entities",
|
|
285
|
+
{
|
|
286
|
+
"memory_id": memory_id,
|
|
287
|
+
"entity_id": entity_id,
|
|
288
|
+
"relationship": "about",
|
|
289
|
+
},
|
|
290
|
+
)
|
|
291
|
+
result.links_created += 1
|
|
292
|
+
except Exception as e:
|
|
293
|
+
# Duplicate link (memory already has it) is harmless.
|
|
294
|
+
logger.debug(
|
|
295
|
+
"Backfill: skipping duplicate link memory=%s entity=%s: %s",
|
|
296
|
+
memory_id,
|
|
297
|
+
entity_id,
|
|
298
|
+
e,
|
|
299
|
+
)
|
|
300
|
+
|
|
301
|
+
logger.info(
|
|
302
|
+
"Backfill applied: %d entities created, %d reused, %d links",
|
|
303
|
+
result.entities_created,
|
|
304
|
+
result.entities_reused,
|
|
305
|
+
result.links_created,
|
|
306
|
+
)
|
|
307
|
+
return result
|
|
308
|
+
|
|
309
|
+
|
|
310
|
+
# ---------------------------------------------------------------------------
|
|
311
|
+
# CLI helper: render a plan summary for the dry-run output
|
|
312
|
+
# ---------------------------------------------------------------------------
|
|
313
|
+
|
|
314
|
+
|
|
315
|
+
def format_plan_summary(plan: BackfillPlan) -> str:
|
|
316
|
+
"""Human-readable summary of a plan (printed in dry-run mode)."""
|
|
317
|
+
lines = [
|
|
318
|
+
"Entity-link backfill plan (dry-run, no writes):",
|
|
319
|
+
f" Scanned memories without links: {plan.scanned_memories}",
|
|
320
|
+
f" Memories with orphan name references: {plan.orphan_count}",
|
|
321
|
+
f" Proposed new/linked entities: {len(plan.proposed_entities)}",
|
|
322
|
+
]
|
|
323
|
+
if plan.proposed_entities:
|
|
324
|
+
lines.append("")
|
|
325
|
+
lines.append(" By inferred type:")
|
|
326
|
+
type_counts: Dict[str, int] = {}
|
|
327
|
+
for p in plan.proposed_entities:
|
|
328
|
+
type_counts[p["inferred_type"]] = (
|
|
329
|
+
type_counts.get(p["inferred_type"], 0) + 1
|
|
330
|
+
)
|
|
331
|
+
for t, n in sorted(type_counts.items()):
|
|
332
|
+
lines.append(f" {t}: {n}")
|
|
333
|
+
# Show a small sample so the user can sanity-check.
|
|
334
|
+
sample = plan.proposed_entities[:10]
|
|
335
|
+
lines.append("")
|
|
336
|
+
lines.append(" Sample (first 10):")
|
|
337
|
+
for p in sample:
|
|
338
|
+
lines.append(
|
|
339
|
+
f" - {p['name']!r} -> {p['inferred_type']} "
|
|
340
|
+
f"({len(p['memory_ids'])} memory link(s))"
|
|
341
|
+
)
|
|
342
|
+
lines.append("")
|
|
343
|
+
lines.append(
|
|
344
|
+
"Run with --apply to write changes. A SQLite backup will be created first."
|
|
345
|
+
)
|
|
346
|
+
return "\n".join(lines)
|