PyPI - cat-stack - Versions diffs - 1.6.8__tar.gz → 2.0.0__tar.gz - Mend

cat-stack 1.6.8tar.gz → 2.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

{cat_stack-1.6.8 → cat_stack-2.0.0}/.gitignore RENAMED Viewed

@@ -41,3 +41,11 @@ survey-summarizer
 # R package build/check artifacts
 *.Rcheck/
 *.tar.gz
+# Local experiment scratch (not part of the package)
+/patches/
+/AGENTS.md
+/test_live_*.py
+/test_smoke_*.py
+/test_stress_*.py
+/test_parallel_*.py

{cat_stack-1.6.8 → cat_stack-2.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cat-stack
-Version: 1.6.8
+Version: 2.0.0
 Summary: Domain-agnostic text, image, PDF, and DOCX classification engine powered by LLMs
 Project-URL: Documentation, https://github.com/chrissoria/cat-stack#readme
 Project-URL: Issues, https://github.com/chrissoria/cat-stack/issues
@@ -154,6 +154,67 @@ cat.explore(
 )
 ```
+### `collapse_themes()`
+Consolidate a long, redundant list of extracted category labels (e.g. the output of `explore()`) into a smaller, deduplicated taxonomy. Runs the semantic merge iteratively, then applies a single deterministic embedding re-merge over the whole result to collapse cross-batch lexical siblings (e.g. "tension" / "estrangement") that batched passes leave separate. Tuned to err toward over-segmentation (keeping categories) rather than over-merging.
+```python
+# Basic: aggressive merge, auto-stop at the quality peak
+cat.collapse_themes(
+    input_data=raw_labels,            # list[str] or a frequency Series/dict
+    api_key=key,
+    description="Why did you move?",   # the survey question / context
+    aggressive=True,
+    passes="auto",
+    user_model="gpt-4o",
+)
+```
+```python
+# Per-step model assignment: a cheap model thins restatements,
+# a stronger model does the conceptual merge (providers can differ)
+cat.collapse_themes(
+    input_data=raw_labels,
+    api_key=key,
+    description="Why did you move?",
+    aggressive=True,
+    passes="auto",
+    unique_model="Qwen/Qwen2.5-72B-Instruct:together",
+    unique_model_source="huggingface",
+    unique_passes=1,
+    merge_model="Qwen/Qwen3.6-35B-A3B:together",
+    merge_model_source="huggingface",
+    max_workers=8,
+)
+```
+**Parameters**
+| Parameter | Default | Description |
+| --- | --- | --- |
+| `input_data` | — | List of category labels, or a frequency `Series`/`dict` (`label -> count`). |
+| `api_key` | `None` | API key for the LLM provider (required). |
+| `description` | `""` | The survey question or context, used in the merge prompt. |
+| `passes` | `1` | Number of merge iterations, or `"auto"` to iterate until the embedding-quality benchmark peaks. |
+| `max_passes` | `10` | Cap on iterations when `passes="auto"`. |
+| `batch_size` | `40` | Labels per LLM chunk (`ceil(n / batch_size)` calls per pass). |
+| `aggressive` | `False` | `True` = conceptual-merge prompt (compress related labels); `False` = extract-unique (faithful thinning, removes restatements only). |
+| `dedupe_threshold` | `0.95` | Jaro-Winkler similarity at/above which normalized labels are deduped (`1.0` = exact only). |
+| `embedding_merge_threshold` | `0.92` | Cosine similarity at/above which labels are merged in the pre-LLM embedding step. `None`/`>=1.0` disables it. |
+| `shuffle` | `True` | Randomize order each pass so batch composition varies (improves convergence stability). |
+| `final_consolidation` | `0.82` | Cosine threshold for one greedy global embedding re-merge after all passes, collapsing cross-batch duplicates. Conservative by design (errs toward keeping categories). `False`/`None` skips it. |
+| `user_model` | `"gpt-4o"` | Model for the merge phase. Use a capable model — small models can degenerate. |
+| `model_source` | `"auto"` | Provider for `user_model` (`"auto"`, `"openai"`, `"huggingface"`, …). |
+| `unique_model` | `None` | If set, run an initial extract-unique thinning phase on this (typically cheaper) model before the merge phase. `None` skips the phase (backward compatible). |
+| `unique_model_source` | `"auto"` | Provider for `unique_model` — can differ from the merge phase. |
+| `unique_passes` | `1` | Number of thinning passes when `unique_model` is set. |
+| `merge_model` | `None` | Model for the merge phase; falls back to `user_model` when `None`. |
+| `merge_model_source` | `"auto"` | Provider for `merge_model`. |
+| `creativity` | `0` | Temperature (`0` = deterministic). |
+| `max_workers` | `1` | Batches processed concurrently per pass. |
+| `random_state` | `None` | Seed for shuffling (per-pass seed = `random_state + pass`). |
+| `filename` | `None` | Optional CSV path to save the final list. |
+| `progress_callback` | `None` | Optional `callback(pass, passes, label)` for progress reporting. |
 ### `summarize()`
 Summarize text or PDF documents, with optional multi-model ensemble.

{cat_stack-1.6.8 → cat_stack-2.0.0}/README.md RENAMED Viewed

@@ -118,6 +118,67 @@ cat.explore(
 )
 ```
+### `collapse_themes()`
+Consolidate a long, redundant list of extracted category labels (e.g. the output of `explore()`) into a smaller, deduplicated taxonomy. Runs the semantic merge iteratively, then applies a single deterministic embedding re-merge over the whole result to collapse cross-batch lexical siblings (e.g. "tension" / "estrangement") that batched passes leave separate. Tuned to err toward over-segmentation (keeping categories) rather than over-merging.
+```python
+# Basic: aggressive merge, auto-stop at the quality peak
+cat.collapse_themes(
+    input_data=raw_labels,            # list[str] or a frequency Series/dict
+    api_key=key,
+    description="Why did you move?",   # the survey question / context
+    aggressive=True,
+    passes="auto",
+    user_model="gpt-4o",
+)
+```
+```python
+# Per-step model assignment: a cheap model thins restatements,
+# a stronger model does the conceptual merge (providers can differ)
+cat.collapse_themes(
+    input_data=raw_labels,
+    api_key=key,
+    description="Why did you move?",
+    aggressive=True,
+    passes="auto",
+    unique_model="Qwen/Qwen2.5-72B-Instruct:together",
+    unique_model_source="huggingface",
+    unique_passes=1,
+    merge_model="Qwen/Qwen3.6-35B-A3B:together",
+    merge_model_source="huggingface",
+    max_workers=8,
+)
+```
+**Parameters**
+| Parameter | Default | Description |
+| --- | --- | --- |
+| `input_data` | — | List of category labels, or a frequency `Series`/`dict` (`label -> count`). |
+| `api_key` | `None` | API key for the LLM provider (required). |
+| `description` | `""` | The survey question or context, used in the merge prompt. |
+| `passes` | `1` | Number of merge iterations, or `"auto"` to iterate until the embedding-quality benchmark peaks. |
+| `max_passes` | `10` | Cap on iterations when `passes="auto"`. |
+| `batch_size` | `40` | Labels per LLM chunk (`ceil(n / batch_size)` calls per pass). |
+| `aggressive` | `False` | `True` = conceptual-merge prompt (compress related labels); `False` = extract-unique (faithful thinning, removes restatements only). |
+| `dedupe_threshold` | `0.95` | Jaro-Winkler similarity at/above which normalized labels are deduped (`1.0` = exact only). |
+| `embedding_merge_threshold` | `0.92` | Cosine similarity at/above which labels are merged in the pre-LLM embedding step. `None`/`>=1.0` disables it. |
+| `shuffle` | `True` | Randomize order each pass so batch composition varies (improves convergence stability). |
+| `final_consolidation` | `0.82` | Cosine threshold for one greedy global embedding re-merge after all passes, collapsing cross-batch duplicates. Conservative by design (errs toward keeping categories). `False`/`None` skips it. |
+| `user_model` | `"gpt-4o"` | Model for the merge phase. Use a capable model — small models can degenerate. |
+| `model_source` | `"auto"` | Provider for `user_model` (`"auto"`, `"openai"`, `"huggingface"`, …). |
+| `unique_model` | `None` | If set, run an initial extract-unique thinning phase on this (typically cheaper) model before the merge phase. `None` skips the phase (backward compatible). |
+| `unique_model_source` | `"auto"` | Provider for `unique_model` — can differ from the merge phase. |
+| `unique_passes` | `1` | Number of thinning passes when `unique_model` is set. |
+| `merge_model` | `None` | Model for the merge phase; falls back to `user_model` when `None`. |
+| `merge_model_source` | `"auto"` | Provider for `merge_model`. |
+| `creativity` | `0` | Temperature (`0` = deterministic). |
+| `max_workers` | `1` | Batches processed concurrently per pass. |
+| `random_state` | `None` | Seed for shuffling (per-pass seed = `random_state + pass`). |
+| `filename` | `None` | Optional CSV path to save the final list. |
+| `progress_callback` | `None` | Optional `callback(pass, passes, label)` for progress reporting. |
 ### `summarize()`
 Summarize text or PDF documents, with optional multi-model ensemble.

{cat_stack-1.6.8 → cat_stack-2.0.0}/src/catstack/__about__.py RENAMED Viewed

@@ -1,7 +1,7 @@
 # SPDX-FileCopyrightText: 2025-present Christopher Soria <chrissoria@berkeley.edu>
 #
 # SPDX-License-Identifier: GPL-3.0-or-later
-__version__ = "1.6.8"
+__version__ = "2.0.0"
 __author__ = "Chris Soria"
 __email__ = "chrissoria@berkeley.edu"
 __title__ = "cat-stack"

{cat_stack-1.6.8 → cat_stack-2.0.0}/src/catstack/__init__.py RENAMED Viewed

@@ -18,6 +18,7 @@ from .__about__ import (
 # Main entry points
 from .extract import extract
 from .explore import explore
+from .collapse_themes import collapse_themes
 from .classify import classify
 from .summarize import summarize
 from .prompt_tune import prompt_tune
@@ -103,6 +104,7 @@ __all__ = [
     # Main entry points
     "extract",
     "explore",
+    "collapse_themes",
     "classify",
     "summarize",
     "prompt_tune",

{cat_stack-1.6.8 → cat_stack-2.0.0}/src/catstack/_formatter.py RENAMED Viewed

@@ -45,6 +45,11 @@ def _check_dependencies():
 def _check_dependencies_installed() -> bool:
     """Pure check — returns True if all formatter deps import successfully.
     No side effects, no install attempts."""
+    # If a dep was just pip-installed in this process's lifetime, the import
+    # system may have cached its earlier absence; clear that so the re-check
+    # actually sees the freshly-installed package.
+    import importlib
+    importlib.invalidate_caches()
     try:
         import torch  # noqa: F401
         import transformers  # noqa: F401
@@ -165,7 +170,31 @@ def _ensure_dependencies(verbose: bool = True) -> bool:
             "  To skip this and disable the formatter, pass json_formatter=False."
         )
-    return _install_dependencies(verbose=verbose)
+    ok = _install_dependencies(verbose=verbose)
+    if not ok:
+        # Freshly pip-installed packages (esp. compiled ones like torch) often
+        # cannot be imported by the SAME running process — but they ARE on disk
+        # now. Tell the user to re-run rather than silently degrading every row
+        # to an error.
+        if verbose and _deps_on_disk():
+            print(
+                "[CatLLM] Formatter dependencies were just installed but cannot "
+                "be imported into the already-running process. Please RE-RUN your "
+                "command — they will load on the next start. (Avoid this by "
+                "pre-installing: pip install 'cat-stack[formatter]'.)"
+            )
+    return ok
+def _deps_on_disk() -> bool:
+    """True if the formatter deps are findable on disk (importable in a FRESH
+    process) even if they failed to import in the current one."""
+    import importlib.util
+    try:
+        return all(importlib.util.find_spec(m) is not None
+                   for m in ("torch", "transformers", "accelerate"))
+    except (ImportError, ValueError):
+        return False
 def _is_model_cached() -> bool:
@@ -205,6 +234,51 @@ def ensure_formatter_available() -> bool:
     return True  # actual download happens in load_formatter()
+def _load_formatter_tokenizer(AutoTokenizer):
+    """Load the formatter tokenizer, defending against a malformed
+    `tokenizer_config.json`.
+    Some published configs store `extra_special_tokens` as a LIST, but
+    transformers 4.56–4.57.x expect a dict and crash in
+    `_set_model_specific_special_tokens` with
+    `'list' object has no attribute 'keys'`. On that failure we snapshot the
+    repo locally, normalize a list-valued `extra_special_tokens` to `{}`
+    (the tokens already live in `added_tokens`/`special_tokens_map`, so
+    dropping the field is lossless), and load from the patched local copy.
+    """
+    try:
+        return AutoTokenizer.from_pretrained(
+            _MERGED_MODEL_REPO, trust_remote_code=True
+        )
+    except (AttributeError, TypeError) as e:
+        if "keys" not in str(e) and "extra_special_tokens" not in str(e):
+            raise
+        import json
+        import os
+        from huggingface_hub import snapshot_download
+        local_dir = snapshot_download(_MERGED_MODEL_REPO)
+        cfg_path = os.path.join(local_dir, "tokenizer_config.json")
+        with open(cfg_path) as f:
+            cfg = json.load(f)
+        if isinstance(cfg.get("extra_special_tokens"), list):
+            cfg["extra_special_tokens"] = {}
+            # snapshot dirs are often read-only symlink caches; patch a copy.
+            import tempfile
+            import shutil
+            patched = tempfile.mkdtemp(prefix="catllm_formatter_tok_")
+            for fn in os.listdir(local_dir):
+                src = os.path.join(local_dir, fn)
+                if os.path.isfile(src):
+                    shutil.copy(src, os.path.join(patched, fn))
+            with open(os.path.join(patched, "tokenizer_config.json"), "w") as f:
+                json.dump(cfg, f)
+            print("[CatLLM] Patched malformed extra_special_tokens in the "
+                  "formatter tokenizer config (list -> {}).")
+            return AutoTokenizer.from_pretrained(patched, trust_remote_code=True)
+        raise
 def load_formatter(device=None):
     """
     Load the merged formatter model and tokenizer.
@@ -230,15 +304,21 @@ def load_formatter(device=None):
     dtype = torch.float16 if device == "cuda" else torch.float32
     print(f"[CatLLM] Loading JSON formatter on {device}...")
-    tokenizer = AutoTokenizer.from_pretrained(
-        _MERGED_MODEL_REPO, trust_remote_code=True
-    )
+    tokenizer = _load_formatter_tokenizer(AutoTokenizer)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token
-    model = AutoModelForCausalLM.from_pretrained(
-        _MERGED_MODEL_REPO, dtype=dtype, trust_remote_code=True
-    )
+    # `dtype=` is the transformers >=4.56 kwarg; older versions only accept
+    # `torch_dtype=` and crash if `dtype=` leaks into the config. Try the new
+    # name, fall back to the old one.
+    try:
+        model = AutoModelForCausalLM.from_pretrained(
+            _MERGED_MODEL_REPO, dtype=dtype, trust_remote_code=True
+        )
+    except TypeError:
+        model = AutoModelForCausalLM.from_pretrained(
+            _MERGED_MODEL_REPO, torch_dtype=dtype, trust_remote_code=True
+        )
     model = model.to(device)
     model.eval()
@@ -281,7 +361,9 @@ def run_formatter(raw_output, categories, model, tokenizer, device):
     with torch.no_grad():
         out = model.generate(
             **inputs,
-            max_new_tokens=128,
+            # 512 (was 128): a large category set produces a long N-key JSON
+            # object; 128 tokens truncated it for 28/48-category tasks.
+            max_new_tokens=512,
             do_sample=False,
             temperature=None,
             top_p=None,

cat-stack 1.6.8__tar.gz → 2.0.0__tar.gz

cat-stack 1.6.8tar.gz → 2.0.0tar.gz