PyPI - claude-sql - Versions diffs - 1.0.0__tar.gz → 1.1.0__tar.gz - Mend

claude-sql 1.0.0tar.gz → 1.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

{claude_sql-1.0.0 → claude_sql-1.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: claude-sql
-Version: 1.0.0
+Version: 1.1.0
 Summary: Zero-copy SQL + semantic search + LLM analytics over ~/.claude/ transcripts.
 Keywords: claude,claude-code,anthropic,duckdb,sql,semantic-search,embeddings,bedrock,transcripts,analytics,observability
 Author: Laith Al-Saadoon
@@ -31,8 +31,8 @@ Requires-Dist: numpy>=2.4.4
 Requires-Dist: packaging>=26.2
 Requires-Dist: polars>=1.40.0
 Requires-Dist: pyarrow>=23.0.1
-Requires-Dist: pydantic>=2.13.2
 Requires-Dist: pydantic-settings>=2.13.1
+Requires-Dist: pydantic>=2.13.2
 Requires-Dist: pyyaml>=6.0.3
 Requires-Dist: scikit-learn>=1.5
 Requires-Dist: scipy>=1.13
@@ -200,6 +200,30 @@ The IAM policy needs `bedrock:InvokeModel` on:
 - `inference-profile/global.cohere.embed-v4:0`
 - `inference-profile/global.anthropic.claude-sonnet-4-6`
+### Reading transcripts from S3
+claude-sql reads the local JSONL corpus by default, but any transcript glob
+can be an `s3://` URI instead — point it at sessions mirrored to S3 by the
+[`claude-agent-sdk` `S3SessionStore`](https://github.com/anthropics/claude-agent-sdk-python/tree/main/examples/session_stores)
+(layout `s3://{bucket}/{prefix}{project}/{session}/part-*.jsonl`). DuckDB reads
+the parts zero-copy over HTTP range requests — no download step — and every
+view and macro works unchanged.
+```bash
+# Personal corpus on S3 instead of ~/.claude/projects.
+export CLAUDE_SQL_DEFAULT_GLOB='s3://my-bucket/transcripts/*/*/part-*.jsonl'
+export AWS_PROFILE=your-profile        # credentials via the standard AWS chain
+claude-sql schema
+claude-sql query "SELECT session_id, started_at FROM sessions ORDER BY started_at DESC LIMIT 10"
+```
+claude-sql loads DuckDB's `httpfs` extension and creates a `credential_chain`
+S3 secret automatically when it sees an `s3://` glob — no keys are embedded
+anywhere. For a non-AWS store (MinIO) or a local mock, set
+`CLAUDE_SQL_S3_ENDPOINT`, `CLAUDE_SQL_S3_URL_STYLE=path`, and
+`CLAUDE_SQL_S3_USE_SSL=false`. The IAM policy needs `s3:GetObject` +
+`s3:ListBucket` on the prefix.
 ## Quick tour
 ```bash
@@ -367,13 +391,15 @@ Every option is configurable via `CLAUDE_SQL_*`:
 | `CLAUDE_SQL_DEFAULT_GLOB` | `~/.claude/projects/*/*.jsonl` | Main transcript glob |
 | `CLAUDE_SQL_SUBAGENT_GLOB` | `~/.claude/projects/*/*/subagents/agent-*.jsonl` | Subagent transcripts |
 | `CLAUDE_SQL_TEAM_CORPUS_ROOT` | `None` | Team-corpus root; when set, derives all three globs from `<root>/<author>/projects/*` (replaces the personal corpus) |
-| `CLAUDE_SQL_REGION` | `us-east-1` | Bedrock region |
+| `CLAUDE_SQL_S3_ENDPOINT` | `None` | Custom S3 endpoint `host[:port]` for non-AWS stores (MinIO) or a local mock; unset uses default AWS S3. Only consulted when a glob is an `s3://` URI |
+| `CLAUDE_SQL_S3_URL_STYLE` | `vhost` | S3 addressing style (`vhost` or `path`); set `path` for MinIO / moto |
+| `CLAUDE_SQL_S3_USE_SSL` | `true` | Toggle TLS for the S3 endpoint; set `false` for a local mock |
+| `CLAUDE_SQL_REGION` | `us-east-1` | Bedrock region **and** the S3 secret region |
 | `CLAUDE_SQL_MODEL_ID` | `global.cohere.embed-v4:0` | Embedding model |
 | `CLAUDE_SQL_SONNET_MODEL_ID` | `global.anthropic.claude-sonnet-4-6` | Classification model |
 | `CLAUDE_SQL_OUTPUT_DIMENSION` | `1024` | Matryoshka embedding dimension |
 | `CLAUDE_SQL_EMBED_CONCURRENCY` | `8` | Parallel Cohere Embed v4 calls (global CRIS) |
 | `CLAUDE_SQL_LLM_CONCURRENCY` | `2` | Parallel Sonnet 4.6 calls (global CRIS) |
-| `CLAUDE_SQL_CONCURRENCY` | `None` | DEPRECATED single knob — aliases onto both pipelines with a warning |
 | `CLAUDE_SQL_BATCH_SIZE` | `96` | Cohere batch size |
 | `CLAUDE_SQL_EMBEDDINGS_PARQUET_PATH` | `~/.claude/embeddings/` | Embeddings cache (sharded directory of `part-*.parquet`) |
 | `CLAUDE_SQL_USER_FRICTION_PARQUET_PATH` | `~/.claude/user_friction/` | Friction cache (sharded) |

{claude_sql-1.0.0 → claude_sql-1.1.0}/README.md RENAMED Viewed

@@ -152,6 +152,30 @@ The IAM policy needs `bedrock:InvokeModel` on:
 - `inference-profile/global.cohere.embed-v4:0`
 - `inference-profile/global.anthropic.claude-sonnet-4-6`
+### Reading transcripts from S3
+claude-sql reads the local JSONL corpus by default, but any transcript glob
+can be an `s3://` URI instead — point it at sessions mirrored to S3 by the
+[`claude-agent-sdk` `S3SessionStore`](https://github.com/anthropics/claude-agent-sdk-python/tree/main/examples/session_stores)
+(layout `s3://{bucket}/{prefix}{project}/{session}/part-*.jsonl`). DuckDB reads
+the parts zero-copy over HTTP range requests — no download step — and every
+view and macro works unchanged.
+```bash
+# Personal corpus on S3 instead of ~/.claude/projects.
+export CLAUDE_SQL_DEFAULT_GLOB='s3://my-bucket/transcripts/*/*/part-*.jsonl'
+export AWS_PROFILE=your-profile        # credentials via the standard AWS chain
+claude-sql schema
+claude-sql query "SELECT session_id, started_at FROM sessions ORDER BY started_at DESC LIMIT 10"
+```
+claude-sql loads DuckDB's `httpfs` extension and creates a `credential_chain`
+S3 secret automatically when it sees an `s3://` glob — no keys are embedded
+anywhere. For a non-AWS store (MinIO) or a local mock, set
+`CLAUDE_SQL_S3_ENDPOINT`, `CLAUDE_SQL_S3_URL_STYLE=path`, and
+`CLAUDE_SQL_S3_USE_SSL=false`. The IAM policy needs `s3:GetObject` +
+`s3:ListBucket` on the prefix.
 ## Quick tour
 ```bash
@@ -319,13 +343,15 @@ Every option is configurable via `CLAUDE_SQL_*`:
 | `CLAUDE_SQL_DEFAULT_GLOB` | `~/.claude/projects/*/*.jsonl` | Main transcript glob |
 | `CLAUDE_SQL_SUBAGENT_GLOB` | `~/.claude/projects/*/*/subagents/agent-*.jsonl` | Subagent transcripts |
 | `CLAUDE_SQL_TEAM_CORPUS_ROOT` | `None` | Team-corpus root; when set, derives all three globs from `<root>/<author>/projects/*` (replaces the personal corpus) |
-| `CLAUDE_SQL_REGION` | `us-east-1` | Bedrock region |
+| `CLAUDE_SQL_S3_ENDPOINT` | `None` | Custom S3 endpoint `host[:port]` for non-AWS stores (MinIO) or a local mock; unset uses default AWS S3. Only consulted when a glob is an `s3://` URI |
+| `CLAUDE_SQL_S3_URL_STYLE` | `vhost` | S3 addressing style (`vhost` or `path`); set `path` for MinIO / moto |
+| `CLAUDE_SQL_S3_USE_SSL` | `true` | Toggle TLS for the S3 endpoint; set `false` for a local mock |
+| `CLAUDE_SQL_REGION` | `us-east-1` | Bedrock region **and** the S3 secret region |
 | `CLAUDE_SQL_MODEL_ID` | `global.cohere.embed-v4:0` | Embedding model |
 | `CLAUDE_SQL_SONNET_MODEL_ID` | `global.anthropic.claude-sonnet-4-6` | Classification model |
 | `CLAUDE_SQL_OUTPUT_DIMENSION` | `1024` | Matryoshka embedding dimension |
 | `CLAUDE_SQL_EMBED_CONCURRENCY` | `8` | Parallel Cohere Embed v4 calls (global CRIS) |
 | `CLAUDE_SQL_LLM_CONCURRENCY` | `2` | Parallel Sonnet 4.6 calls (global CRIS) |
-| `CLAUDE_SQL_CONCURRENCY` | `None` | DEPRECATED single knob — aliases onto both pipelines with a warning |
 | `CLAUDE_SQL_BATCH_SIZE` | `96` | Cohere batch size |
 | `CLAUDE_SQL_EMBEDDINGS_PARQUET_PATH` | `~/.claude/embeddings/` | Embeddings cache (sharded directory of `part-*.parquet`) |
 | `CLAUDE_SQL_USER_FRICTION_PARQUET_PATH` | `~/.claude/user_friction/` | Friction cache (sharded) |

claude_sql-1.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,66 @@
+# GENERATED by mise-tasks/build-dist — do not edit. The source of truth is
+# packages/*/pyproject.toml. This file bundles all five members into the one
+# publishable ``claude-sql`` wheel. See the task docstring for why.
+[project]
+name = "claude-sql"
+version = "1.1.0"
+description = 'Zero-copy SQL + semantic search + LLM analytics over ~/.claude/ transcripts.'
+readme = "README.md"
+license = { text = "Apache-2.0" }
+authors = [{ name = "Laith Al-Saadoon", email = "lalsaado@amazon.com" }]
+requires-python = ">=3.13"
+keywords = ["claude", "claude-code", "anthropic", "duckdb", "sql", "semantic-search", "embeddings", "bedrock", "transcripts", "analytics", "observability"]
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.13",
+    "Development Status :: 5 - Production/Stable",
+    "Intended Audience :: Developers",
+    "Operating System :: POSIX :: Linux",
+    "Operating System :: MacOS",
+    "Topic :: Software Development",
+    "Topic :: Database",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    "Topic :: Utilities",
+    "Typing :: Typed",
+]
+dependencies = [
+    "anthropic>=0.40",
+    "anyio>=4.13.0",
+    "boto3>=1.42.91",
+    "cyclopts>=4.10.2",
+    "duckdb>=1.5.2,<2",
+    "hdbscan>=0.8.40",
+    "igraph>=1.0.0,<2.0",
+    "lancedb>=0.30,<0.31",
+    "leidenalg>=0.11.0,<0.12",
+    "loguru>=0.7.3",
+    "numpy>=2.4.4",
+    "packaging>=26.2",
+    "polars>=1.40.0",
+    "pyarrow>=23.0.1",
+    "pydantic-settings>=2.13.1",
+    "pydantic>=2.13.2",
+    "pyyaml>=6.0.3",
+    "scikit-learn>=1.5",
+    "scipy>=1.13",
+    "tenacity>=9.1.4",
+    "tiktoken>=0.12.0",
+    "umap-learn>=0.5.12",
+]
+[project.scripts]
+claude-sql = "claude_sql.app.cli:main"
+[project.urls]
+Homepage = "https://github.com/theagenticguy/claude-sql"
+Repository = "https://github.com/theagenticguy/claude-sql"
+Issues = "https://github.com/theagenticguy/claude-sql/issues"
+Changelog = "https://github.com/theagenticguy/claude-sql/blob/main/CHANGELOG.md"
+[build-system]
+requires = ["uv_build>=0.11.14,<0.12"]
+build-backend = "uv_build"
+[tool.uv.build-backend]
+module-name = "claude_sql"
+namespace = true

claude_sql-1.1.0/src/claude_sql/analytics/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ """claude-sql analytics: embed, classify, trajectory, conflicts, friction, cluster, terms, community, ingest."""

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/classify_worker.py RENAMED Viewed

@@ -23,8 +23,8 @@ import anyio
 import polars as pl
 from loguru import logger
-from claude_sql import checkpointer, retry_queue
-from claude_sql.llm_shared import (
+from claude_sql.core import checkpointer, retry_queue
+from claude_sql.core.llm_shared import (
     CLASSIFY_SYSTEM_PROMPT,
     _build_bedrock_client,
     _count_pending_sessions,
@@ -32,14 +32,14 @@ from claude_sql.llm_shared import (
     classify_one,
     pipeline_cache_stats,
 )
-from claude_sql.parquet_shards import read_all, write_part
-from claude_sql.schemas import SESSION_CLASSIFICATION_SCHEMA
-from claude_sql.session_text import iter_session_texts, session_bounds
+from claude_sql.core.parquet_shards import read_all, write_part
+from claude_sql.core.schemas import SESSION_CLASSIFICATION_SCHEMA
+from claude_sql.core.session_text import iter_session_texts, session_bounds
 if TYPE_CHECKING:
     import duckdb
-    from claude_sql.config import Settings
+    from claude_sql.core.config import Settings
 async def _classify_sessions_async(
@@ -52,7 +52,7 @@ async def _classify_sessions_async(
 ) -> int:
     """Async implementation behind :func:`classify_sessions`."""
     already: set[str] = set()
-    done_df = read_all(settings.classifications_parquet_path)
+    done_df = read_all(settings.classifications_parquet_path, columns=["session_id"])
     if done_df is not None and done_df.height > 0:
         already = set(done_df["session_id"].to_list())
@@ -212,7 +212,7 @@ def classify_sessions(
     if dry_run:
         already: set[str] = set()
-        done_df = read_all(settings.classifications_parquet_path)
+        done_df = read_all(settings.classifications_parquet_path, columns=["session_id"])
         if done_df is not None and done_df.height > 0:
             already = set(done_df["session_id"].to_list())
         pending_count = _count_pending_sessions(

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/cluster_worker.py RENAMED Viewed

@@ -22,8 +22,7 @@ import numpy as np
 import polars as pl
 from loguru import logger
-from claude_sql import lance_store
-from claude_sql.config import Settings
+from claude_sql.core.config import Settings
 def _load_embeddings(path: Path) -> tuple[list[str], np.ndarray]:
@@ -33,6 +32,10 @@ def _load_embeddings(path: Path) -> tuple[list[str], np.ndarray]:
     through the DuckDB ``message_embeddings`` view) so this worker can run
     independently of view registration on the calling connection.
     """
+    # Deferred so importing this module via the CLI for a non-cluster command
+    # doesn't pull in the ~2.6s lancedb import subtree.
+    from claude_sql.core import lance_store
     db = lance_store.connect_db(path)
     if not lance_store._has_table(db, lance_store.TABLE_NAME):
         return [], np.zeros((0, 0), dtype=np.float32)
@@ -65,6 +68,10 @@ def run_clustering(settings: Settings, *, force: bool = False) -> dict[str, int]
         ``{"total": N, "clusters": K, "noise": M}`` where K excludes the
         noise cluster (label -1).
     """
+    # Deferred (see _load_embeddings) — keeps the lancedb import off the CLI's
+    # module-load path for non-cluster commands.
+    from claude_sql.core import lance_store
     out_path = settings.clusters_parquet_path
     in_path = settings.lance_uri
@@ -186,13 +193,19 @@ def run_clustering(settings: Settings, *, force: bool = False) -> dict[str, int]
         noise / len(labels) if len(labels) else 0,
     )
+    # Hand polars the numpy arrays directly — it ingests contiguous arrays
+    # near-zero-copy. Round-tripping through ``.tolist()`` materialized N
+    # boxed Python ints/floats/bools per column just to have polars re-parse
+    # them back into the typed columns the schema already pins (mirrors the
+    # read-side boxing fix in #68, now on the write side). ``X2`` columns are
+    # sliced views, so copy to contiguous float32 before handing them over.
     df = pl.DataFrame(
         {
             "uuid": uuids,
-            "cluster_id": labels.astype(np.int32).tolist(),
-            "x": X2[:, 0].astype(np.float32).tolist(),
-            "y": X2[:, 1].astype(np.float32).tolist(),
-            "is_noise": (labels < 0).tolist(),
+            "cluster_id": labels.astype(np.int32),
+            "x": np.ascontiguousarray(X2[:, 0], dtype=np.float32),
+            "y": np.ascontiguousarray(X2[:, 1], dtype=np.float32),
+            "is_noise": labels < 0,
         },
         schema={
             "uuid": pl.Utf8,

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/community_worker.py RENAMED Viewed

@@ -63,7 +63,7 @@ import numpy as np
 import polars as pl
 from loguru import logger
-from claude_sql.config import Settings
+from claude_sql.core.config import Settings
 if TYPE_CHECKING:
     import duckdb
@@ -85,10 +85,12 @@ def _load_session_centroids(
     """Return ``(session_ids, centroids)`` where centroids is ``(N_sessions, dim)`` float32.
     Joins the ``message_embeddings`` view (LanceDB-backed via ``register_vss``)
-    to the v1 ``messages`` view on uuid, then aggregates inside DuckDB
-    (unnest with position → ``avg`` per (session, dim_index) → ordered
-    ``list``). The L2-normalize step stays in numpy where
-    ``np.linalg.norm`` is faster on a contiguous (N, dim) matrix.
+    to the v1 ``messages`` view on uuid, pulls one ``(session_id, embedding)``
+    row per message ordered by session, then computes per-session means in
+    numpy with a single ``np.add.reduceat`` segmented sum (sessions are
+    contiguous after the ``ORDER BY``) followed by an L2-normalize. This keeps
+    the intermediate at ``N_messages`` rows rather than the ``N_messages ×
+    dim`` explosion the prior ``unnest``-per-dimension aggregation produced.
     ``embeddings_parquet_path`` is accepted for back-compat with callers that
     still pass it but is no longer consulted — the connection's
@@ -96,29 +98,24 @@ def _load_session_centroids(
     """
     del embeddings_parquet_path  # legacy kwarg — view is the source of truth now
     logger.info("Loading message embeddings and joining to sessions...")
+    # Pull one row per message (session_id, embedding) ordered by session, then
+    # compute per-session means in numpy via a single segmented reduction.
+    #
+    # The prior implementation unnested every embedding into ``dim`` rows
+    # (``generate_subscripts`` + ``unnest``) and grouped on (session, pos) —
+    # that explodes the working set to N_messages × dim rows before the
+    # average. Carrying the FLOAT[dim] vector through the join and reducing it
+    # in numpy keeps the intermediate at N_messages rows and is 1.4–1.8×
+    # faster on a 24k–96k-message corpus (measured), with the win widening as
+    # the corpus grows. ``ORDER BY session_id`` makes the sessions contiguous
+    # so ``np.add.reduceat`` can segment-sum without a Python per-session loop.
     sql = """
-        WITH joined AS (
-            SELECT CAST(m.session_id AS VARCHAR) AS session_id,
-                   e.embedding::FLOAT[] AS emb
-              FROM message_embeddings e
-              JOIN messages m
-                ON CAST(m.uuid AS VARCHAR) = e.uuid
-        ),
-        unrolled AS (
-            SELECT session_id,
-                   generate_subscripts(emb, 1) AS pos,
-                   unnest(emb) AS v
-              FROM joined
-        ),
-        agg AS (
-            SELECT session_id, pos, avg(v) AS m
-              FROM unrolled
-             GROUP BY 1, 2
-        )
-        SELECT session_id, list(m ORDER BY pos) AS centroid
-          FROM agg
-         GROUP BY 1
-         ORDER BY 1
+        SELECT CAST(m.session_id AS VARCHAR) AS session_id,
+               e.embedding AS emb
+          FROM message_embeddings e
+          JOIN messages m
+            ON CAST(m.uuid AS VARCHAR) = e.uuid
+         ORDER BY session_id
     """
     try:
         df = con.execute(sql).pl()
@@ -136,10 +133,32 @@ def _load_session_centroids(
             "Lance embeddings exist and the messages view is registered."
         )
-    sids = df["session_id"].to_list()
-    centroids = np.stack([np.asarray(c, dtype=np.float32) for c in df["centroid"].to_list()])
+    # ``emb`` is a DuckDB ``FLOAT[dim]`` column; polars surfaces it as a
+    # fixed-size ``Array(Float32, dim)`` dtype (occasionally a variable
+    # ``List`` if the cast was lost upstream). ``Series.to_numpy()`` extracts
+    # the buffer directly into a contiguous ``(N_messages, dim)`` matrix; for
+    # the fixed-``Array`` case it's a near-zero-copy view. The prior
+    # ``np.asarray(series.to_list(), ...)`` boxed every one of the
+    # N_messages × dim float32 values into a Python ``float`` object first —
+    # measured at 6–10× slower and ~43× higher peak RSS on a 6k–96k-message
+    # corpus (e.g. 7.4 s / 3.5 GB → 1.2 s / 83 MB at 96k). This is the
+    # read-side analog of the SQL-side ``unnest`` explosion removed in #65.
+    emb_arr = df["emb"].to_numpy()
+    if emb_arr.ndim == 1:
+        # Variable ``List`` dtype (or object array of rows) — stack to 2-D.
+        emb_arr = np.stack(list(emb_arr))
+    emb_np = np.ascontiguousarray(emb_arr, dtype=np.float32)
+    sessions = df["session_id"].to_numpy()
+    # ``return_index`` gives each group's first row offset on the sorted array;
+    # ``np.unique`` returns the labels already sorted, matching the prior
+    # ``ORDER BY 1`` contract. ``reduceat`` sums each [start_i, start_{i+1})
+    # segment in one pass.
+    sids_arr, starts, counts = np.unique(sessions, return_index=True, return_counts=True)
+    summed = np.add.reduceat(emb_np, starts, axis=0)
+    centroids = (summed / counts[:, None]).astype(np.float32)
     norms = np.linalg.norm(centroids, axis=1, keepdims=True)
     centroids = centroids / np.where(norms == 0, 1.0, norms)
+    sids = sids_arr.tolist()
     logger.info("Computed {} session centroids (dim={})", len(sids), centroids.shape[1])
     return sids, centroids

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/conflicts_worker.py RENAMED Viewed

@@ -41,8 +41,8 @@ import pyarrow as pa
 import pyarrow.parquet as pq
 from loguru import logger
-from claude_sql import checkpointer, retry_queue
-from claude_sql.llm_shared import (
+from claude_sql.core import checkpointer, retry_queue
+from claude_sql.core.llm_shared import (
     CONFLICTS_SYSTEM_PROMPT,
     _build_bedrock_client,
     _count_pending_sessions,
@@ -50,16 +50,16 @@ from claude_sql.llm_shared import (
     classify_one,
     pipeline_cache_stats,
 )
-from claude_sql.parquet_shards import iter_part_files, read_all, write_part
-from claude_sql.schemas import SESSION_CONFLICTS_SCHEMA
-from claude_sql.session_text import iter_session_texts, session_bounds
+from claude_sql.core.parquet_shards import iter_part_files, read_all, write_part
+from claude_sql.core.schemas import SESSION_CONFLICTS_SCHEMA
+from claude_sql.core.session_text import iter_session_texts, session_bounds
 if TYPE_CHECKING:
     from pathlib import Path
     import duckdb
-    from claude_sql.config import Settings
+    from claude_sql.core.config import Settings
 # v1.0 parquet schema — kept as a module constant so the worker, the test
@@ -143,7 +143,7 @@ async def _conflicts_async(
     _purge_legacy_shards(settings.conflicts_parquet_path)
     already: set[str] = set()
-    done_df = read_all(settings.conflicts_parquet_path)
+    done_df = read_all(settings.conflicts_parquet_path, columns=["session_id"])
     if done_df is not None and done_df.height > 0:
         already = set(done_df["session_id"].to_list())
@@ -304,7 +304,7 @@ def detect_conflicts(
     thinking_mode = "disabled" if no_thinking else settings.classify_thinking
     if dry_run:
         already: set[str] = set()
-        done_df = read_all(settings.conflicts_parquet_path)
+        done_df = read_all(settings.conflicts_parquet_path, columns=["session_id"])
         if done_df is not None and done_df.height > 0:
             already = set(done_df["session_id"].to_list())
         pending_count = _count_pending_sessions(

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/embed_worker.py RENAMED Viewed

@@ -22,9 +22,7 @@ from datetime import UTC, datetime
 from pathlib import Path
 from typing import TYPE_CHECKING, Any
-import boto3
 import polars as pl
-from botocore.config import Config as BotoConfig
 from botocore.exceptions import (
     ClientError,
     ConnectionError as BotoConnectionError,
@@ -40,9 +38,9 @@ from tenacity import (
     wait_exponential,
 )
-from claude_sql import lance_store
-from claude_sql.config import Settings
-from claude_sql.logging_setup import loguru_before_sleep
+from claude_sql.core.config import Settings
+from claude_sql.core.llm_shared import _build_bedrock_client
+from claude_sql.core.logging_setup import loguru_before_sleep
 if TYPE_CHECKING:
     import duckdb
@@ -135,6 +133,11 @@ def discover_unembedded(
     list of (uuid, text) tuples
         Messages needing embedding, in DuckDB's scan order.
     """
+    # Deferred so importing this module (e.g. via the CLI for a non-embed
+    # command) doesn't drag in the ~2.6s lancedb import subtree. lance_store
+    # is only touched once an embed-path function actually runs.
+    from claude_sql.core import lance_store
     # Read the already-embedded uuids straight from Lance via its Python API.
     # We don't go through the DuckDB ``message_embeddings`` view here because
     # the embed command runs with ``register_vss`` skipped (cli.py:1205-1213),
@@ -179,31 +182,6 @@ def discover_unembedded(
     return pairs
-def _build_bedrock_client(settings: Settings) -> Any:
-    """Construct a boto3 ``bedrock-runtime`` client from settings.
-    Parameters
-    ----------
-    settings
-        Application settings providing the target AWS region.
-    Returns
-    -------
-    botocore client
-        A low-level ``bedrock-runtime`` client.
-    """
-    # Disable botocore's internal retry layer so tenacity sees throttling
-    # immediately — otherwise botocore silently absorbs 4 retries and our
-    # retry policy never kicks in.  Also bump read_timeout for large batches.
-    boto_cfg = BotoConfig(
-        region_name=settings.region,
-        retries={"max_attempts": 0, "mode": "standard"},
-        read_timeout=60,
-        connect_timeout=10,
-    )
-    return boto3.client("bedrock-runtime", config=boto_cfg)
 @retry(
     # Cohere Embed v4 on Bedrock has a strict TPM bucket that replenishes over
     # tens of seconds; wait up to 60s between attempts and try up to 10 times
@@ -462,6 +440,10 @@ async def run_backfill(
             "dry_run": True,
         }
+    # Deferred (see discover_unembedded) — keeps the lancedb import off the
+    # dry-run / nothing-pending paths above, which return before this point.
+    from claude_sql.core import lance_store
     # Checkpoint every N messages so a throttling-induced timeout doesn't
     # discard work already embedded. chunk must be a multiple of batch_size.
     chunk_size = max(settings.batch_size * 4, 256)

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/friction_worker.py RENAMED Viewed

@@ -49,8 +49,8 @@ import duckdb
 import polars as pl
 from loguru import logger
-from claude_sql import checkpointer, retry_queue
-from claude_sql.llm_shared import (
+from claude_sql.core import checkpointer, retry_queue
+from claude_sql.core.llm_shared import (
     USER_FRICTION_SYSTEM_PROMPT,
     BedrockRefusalError,
     _build_bedrock_client,
@@ -58,12 +58,12 @@ from claude_sql.llm_shared import (
     classify_one,
     pipeline_cache_stats,
 )
-from claude_sql.parquet_shards import read_all, write_part
-from claude_sql.schemas import USER_FRICTION_SCHEMA
-from claude_sql.session_text import session_bounds
+from claude_sql.core.parquet_shards import read_all, write_part
+from claude_sql.core.schemas import USER_FRICTION_SCHEMA
+from claude_sql.core.session_text import session_bounds
 if TYPE_CHECKING:
-    from claude_sql.config import Settings
+    from claude_sql.core.config import Settings
 # ---------------------------------------------------------------------------
@@ -428,7 +428,7 @@ async def _classify_async(
     """Async body behind :func:`detect_user_friction`."""
     out_path = settings.user_friction_parquet_path
     already: set[str] = set()
-    done_df = read_all(out_path)
+    done_df = read_all(out_path, columns=["uuid"])
     if done_df is not None and done_df.height > 0:
         already = set(done_df["uuid"].to_list())

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/ingest.py RENAMED Viewed

@@ -49,8 +49,8 @@ import polars as pl
 import tiktoken
 from loguru import logger
-from claude_sql.config import Settings
-from claude_sql.parquet_shards import iter_part_files, write_part
+from claude_sql.core.config import Settings
+from claude_sql.core.parquet_shards import iter_part_files, write_part
 if TYPE_CHECKING:
     import duckdb

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/skills_catalog.py RENAMED Viewed

@@ -41,7 +41,7 @@ import yaml
 from loguru import logger
 from packaging.version import InvalidVersion, Version as _Version
-from claude_sql.config import Settings
+from claude_sql.core.config import Settings
 # Built-in Claude Code slash commands.  These never map to a SKILL.md on
 # disk but show up as ``<command-name>/clear</command-name>`` in the

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/terms_worker.py RENAMED Viewed

@@ -23,7 +23,7 @@ from loguru import logger
 if TYPE_CHECKING:
     import duckdb
-    from claude_sql.config import Settings
+    from claude_sql.core.config import Settings
 def run_terms(

{claude_sql-1.0.0/src/claude_sql → claude_sql-1.1.0/src/claude_sql/analytics}/trajectory_worker.py RENAMED Viewed

@@ -43,22 +43,22 @@ import anyio
 import polars as pl
 from loguru import logger
-from claude_sql import checkpointer, retry_queue
-from claude_sql.llm_shared import (
+from claude_sql.core import checkpointer, retry_queue
+from claude_sql.core.llm_shared import (
     BedrockRefusalError,
     _build_bedrock_client,
     _estimate_cost,
     classify_one,
     pipeline_cache_stats,
 )
-from claude_sql.parquet_shards import iter_part_files, write_part
-from claude_sql.schemas import TRAJECTORY_ARRAY_SCHEMA
-from claude_sql.session_text import session_bounds
+from claude_sql.core.parquet_shards import iter_part_files, replace_sessions, write_part
+from claude_sql.core.schemas import TRAJECTORY_ARRAY_SCHEMA
+from claude_sql.core.session_text import session_bounds
 if TYPE_CHECKING:
     import duckdb
-    from claude_sql.config import Settings
+    from claude_sql.core.config import Settings
 # ---------------------------------------------------------------------------
@@ -726,7 +726,7 @@ async def _trajectory_async(
     # Group by session to chunk per-session (anchor-sharing requires
     # contiguous windows from the same session in chunk order).
-    by_session: dict[str, list] = defaultdict(list)
+    by_session: dict[str, list[Any]] = defaultdict(list)
     for row in raw_rows:
         by_session[row[0]].append(row)
@@ -886,8 +886,20 @@ async def _trajectory_async(
             # don't collide on filenames — but we still keep the lock so the
             # in-memory ``written_box`` / ``processed_sessions`` set updates
             # in lockstep with the on-disk write.
+            #
+            # replace_sessions drops any prior rows for ``sid`` still sitting
+            # in the cache from earlier runs. The checkpointer gates
+            # computation on advancing (latest_ts, message_count) bounds but
+            # does NOT touch the parquet cache; without this step a growing
+            # active session duplicates its (prev_uuid, curr_uuid) pairs
+            # on every rerun. See GH #45.
             df = pl.DataFrame(all_rows, schema=_PARQUET_SCHEMA)
             async with write_lock:
+                replace_sessions(
+                    settings.trajectory_parquet_path,
+                    key_column="session_id",
+                    session_ids=[sid],
+                )
                 write_part(settings.trajectory_parquet_path, df)
                 written_box[0] += len(all_rows)
                 processed_sessions.add(sid)

claude_sql-1.1.0/src/claude_sql/app/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ """claude-sql binary: cyclopts CLI + entry point."""

claude-sql 1.0.0__tar.gz → 1.1.0__tar.gz

claude-sql 1.0.0tar.gz → 1.1.0tar.gz