PyPI - ssdiff - Versions diffs - 0.2.0__tar.gz → 0.2.1__tar.gz - Mend

ssdiff 0.2.0tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

{ssdiff-0.2.0/ssdiff.egg-info → ssdiff-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ssdiff
-Version: 0.2.0
+Version: 0.2.1
 Summary: Supervised Semantic Differential (SSD): interpretable, embedding-based analysis of concept meaning in text.
 Author-email: Hubert Plisiecki <hplisiecki@gmail.com>, Paweł Lenartowicz <pawellenartowicz@europe.com>
 License-Expression: MIT
@@ -271,49 +271,75 @@ These helpers make lexicon selection transparent and data-driven (you can also h
 ### `suggest_lexicon(...)`
-Rank tokens by balanced coverage with a mild penalty for strong correlation with
-- `cov_all`: fraction of essays containing the token (0/1 presence)
-- `cov_bal`: average presence across 𝑛 quantile bins of 𝑦 (default: 4 bins)
-- `corr`: Pearson correlation between 0/1 presence and standardized 𝑦
-- `rank = cov_bal * (1 - min(1, |corr|/corr_cap))` (default `corr_cap=0.30`)
+Rank tokens by balanced coverage with a mild penalty for strong association with the outcome.
+All three lexicon utilities accept `var_type='continuous'` (default) or `var_type='categorical'`:
+| | `var_type='continuous'` | `var_type='categorical'` |
+|---|---|---|
+| `cov_bal` | average presence across 𝑛 quantile bins of 𝑦 | average presence across group labels |
+| `corr` | Pearson correlation between 0/1 presence and standardized 𝑦 | Cramér's V between 0/1 presence and group label |
+| `q1` / `q4` | coverage in lowest / highest 𝑦 quantile bin | min / max group coverage |
+| `rank` | `cov_bal * (1 - min(1, \|corr\|/corr_cap))` | same formula (Cramér's V replaces Pearson) |
 Accepts a DataFrame (`text_col`, `score_col`) or a `(texts, y)` tuple where texts can be raw strings or token lists.
 ```python
 from ssdiff import suggest_lexicon
-# Using a DataFrame
+# Continuous outcome (default)
 cands_df = suggest_lexicon(df, text_col="lemmatized", score_col="questionnaire_result", top_k=150)
 # Or using a tuple (texts, y)
 texts = [" ".join(doc) for doc in docs]
 cands_df2 = suggest_lexicon((docs, y), top_k=150)
+# Categorical groups
+cands_cat = suggest_lexicon(df, text_col="lemmatized", score_col="diagnosis", top_k=150, var_type="categorical")
+cands_cat2 = suggest_lexicon((docs, groups), top_k=150, var_type="categorical")
 ```
 ### `token_presence_stats(...)`
-Per-token coverage & correlation diagnostics:
+Per-token coverage & association diagnostics:
 ```python
 from ssdiff import token_presence_stats
-stats = token_presence_stats((texts, y), token="concept_keyword_1", n_bins=4, verbose=True)
-print(stats)  # dict: token, docs, cov_all, cov_bal, corr, rank
+# Continuous
+stats = token_presence_stats(texts, y, token="concept_keyword_1", n_bins=4, verbose=True)
+print(stats)  # dict: token, docs, cov_all, cov_bal, corr, rank, q1, q4
+# Categorical — output also includes group_cov (per-group coverage dict)
+stats = token_presence_stats(texts, groups, token="concept_keyword_1", var_type="categorical", verbose=True)
+print(stats["group_cov"])  # e.g. {"control": 0.45, "depression": 0.62}
 ```
 ### `coverage_by_lexicon(...)`
 Summary for your chosen lexicon:
-- `summary` : `docs_any`, `cov_all`, `q1`. `q4`, `corr_any`
-  - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (by quantile)
-- `per_token_df`: stats for each token
+- `summary` : `docs_any`, `cov_all`, `q1`, `q4`, `corr_any`, `hits_mean`, `hits_median`, `types_mean`, `types_median`
+  - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (continuous) or min/max group coverage (categorical)
+  - when `var_type='categorical'`, summary also includes `group_cov` (per-group coverage dict)
+- `per_token_df`: per-token stats
 ```python
 from ssdiff import coverage_by_lexicon
+# Continuous
 summary, per_tok = coverage_by_lexicon(
     (texts, y),
     lexicon={"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"},
     n_bins=4,
-    verbose=True
+    verbose=True,
+)
+# Categorical
+summary, per_tok = coverage_by_lexicon(
+    (texts, groups),
+    lexicon={"concept_keyword_1", "concept_keyword_2"},
+    var_type="categorical",
+    verbose=True,
 )
+print(summary["group_cov"])  # e.g. {"control": 0.80, "depression": 0.75}
 ```
 ---
@@ -746,9 +772,10 @@ Returned by `SSDGroup.get_contrast()`. Duck-types with `SSD` for interpretation:
 - `build_docs_from_preprocessed(pre_docs)` → list[list[str]] (lemmas for modeling)
 ### Lexicon
-- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30)` → DataFrame
-- `token_presence_stats(df_or_tuple, token, n_bins=4, corr_cap=0.30, verbose=False)` → dict
-- `coverage_by_lexicon(df_or_tuple, lexicon, n_bins=4, verbose=False)` → `(summary, per_token_df)`
+- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30, var_type='continuous')` → DataFrame
+- `token_presence_stats(texts, y, token, n_bins=4, corr_cap=0.30, verbose=False, var_type='continuous')` → dict
+- `coverage_by_lexicon(df_or_tuple, text_col=None, score_col=None, lexicon=(), n_bins=4, verbose=False, var_type='continuous')` → `(summary, per_token_df)`
+- `var_type`: `'continuous'` (numeric outcome, default) or `'categorical'` (group labels). When categorical, `corr` is Cramér's V, `cov_bal` is balanced across groups, and `q1`/`q4` are min/max group coverage.
 ---
 ## Citing & License

{ssdiff-0.2.0 → ssdiff-0.2.1}/README.md RENAMED Viewed

@@ -230,49 +230,75 @@ These helpers make lexicon selection transparent and data-driven (you can also h
 ### `suggest_lexicon(...)`
-Rank tokens by balanced coverage with a mild penalty for strong correlation with
-- `cov_all`: fraction of essays containing the token (0/1 presence)
-- `cov_bal`: average presence across 𝑛 quantile bins of 𝑦 (default: 4 bins)
-- `corr`: Pearson correlation between 0/1 presence and standardized 𝑦
-- `rank = cov_bal * (1 - min(1, |corr|/corr_cap))` (default `corr_cap=0.30`)
+Rank tokens by balanced coverage with a mild penalty for strong association with the outcome.
+All three lexicon utilities accept `var_type='continuous'` (default) or `var_type='categorical'`:
+| | `var_type='continuous'` | `var_type='categorical'` |
+|---|---|---|
+| `cov_bal` | average presence across 𝑛 quantile bins of 𝑦 | average presence across group labels |
+| `corr` | Pearson correlation between 0/1 presence and standardized 𝑦 | Cramér's V between 0/1 presence and group label |
+| `q1` / `q4` | coverage in lowest / highest 𝑦 quantile bin | min / max group coverage |
+| `rank` | `cov_bal * (1 - min(1, \|corr\|/corr_cap))` | same formula (Cramér's V replaces Pearson) |
 Accepts a DataFrame (`text_col`, `score_col`) or a `(texts, y)` tuple where texts can be raw strings or token lists.
 ```python
 from ssdiff import suggest_lexicon
-# Using a DataFrame
+# Continuous outcome (default)
 cands_df = suggest_lexicon(df, text_col="lemmatized", score_col="questionnaire_result", top_k=150)
 # Or using a tuple (texts, y)
 texts = [" ".join(doc) for doc in docs]
 cands_df2 = suggest_lexicon((docs, y), top_k=150)
+# Categorical groups
+cands_cat = suggest_lexicon(df, text_col="lemmatized", score_col="diagnosis", top_k=150, var_type="categorical")
+cands_cat2 = suggest_lexicon((docs, groups), top_k=150, var_type="categorical")
 ```
 ### `token_presence_stats(...)`
-Per-token coverage & correlation diagnostics:
+Per-token coverage & association diagnostics:
 ```python
 from ssdiff import token_presence_stats
-stats = token_presence_stats((texts, y), token="concept_keyword_1", n_bins=4, verbose=True)
-print(stats)  # dict: token, docs, cov_all, cov_bal, corr, rank
+# Continuous
+stats = token_presence_stats(texts, y, token="concept_keyword_1", n_bins=4, verbose=True)
+print(stats)  # dict: token, docs, cov_all, cov_bal, corr, rank, q1, q4
+# Categorical — output also includes group_cov (per-group coverage dict)
+stats = token_presence_stats(texts, groups, token="concept_keyword_1", var_type="categorical", verbose=True)
+print(stats["group_cov"])  # e.g. {"control": 0.45, "depression": 0.62}
 ```
 ### `coverage_by_lexicon(...)`
 Summary for your chosen lexicon:
-- `summary` : `docs_any`, `cov_all`, `q1`. `q4`, `corr_any`
-  - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (by quantile)
-- `per_token_df`: stats for each token
+- `summary` : `docs_any`, `cov_all`, `q1`, `q4`, `corr_any`, `hits_mean`, `hits_median`, `types_mean`, `types_median`
+  - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (continuous) or min/max group coverage (categorical)
+  - when `var_type='categorical'`, summary also includes `group_cov` (per-group coverage dict)
+- `per_token_df`: per-token stats
 ```python
 from ssdiff import coverage_by_lexicon
+# Continuous
 summary, per_tok = coverage_by_lexicon(
     (texts, y),
     lexicon={"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"},
     n_bins=4,
-    verbose=True
+    verbose=True,
+)
+# Categorical
+summary, per_tok = coverage_by_lexicon(
+    (texts, groups),
+    lexicon={"concept_keyword_1", "concept_keyword_2"},
+    var_type="categorical",
+    verbose=True,
 )
+print(summary["group_cov"])  # e.g. {"control": 0.80, "depression": 0.75}
 ```
 ---
@@ -705,9 +731,10 @@ Returned by `SSDGroup.get_contrast()`. Duck-types with `SSD` for interpretation:
 - `build_docs_from_preprocessed(pre_docs)` → list[list[str]] (lemmas for modeling)
 ### Lexicon
-- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30)` → DataFrame
-- `token_presence_stats(df_or_tuple, token, n_bins=4, corr_cap=0.30, verbose=False)` → dict
-- `coverage_by_lexicon(df_or_tuple, lexicon, n_bins=4, verbose=False)` → `(summary, per_token_df)`
+- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30, var_type='continuous')` → DataFrame
+- `token_presence_stats(texts, y, token, n_bins=4, corr_cap=0.30, verbose=False, var_type='continuous')` → dict
+- `coverage_by_lexicon(df_or_tuple, text_col=None, score_col=None, lexicon=(), n_bins=4, verbose=False, var_type='continuous')` → `(summary, per_token_df)`
+- `var_type`: `'continuous'` (numeric outcome, default) or `'categorical'` (group labels). When categorical, `corr` is Cramér's V, `cov_bal` is balanced across groups, and `q1`/`q4` are min/max group coverage.
 ---
 ## Citing & License

{ssdiff-0.2.0 → ssdiff-0.2.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "ssdiff"
-version = "0.2.0"
+version = "0.2.1"
 description = "Supervised Semantic Differential (SSD): interpretable, embedding-based analysis of concept meaning in text."
 readme = "README.md"
 requires-python = ">=3.9"

{ssdiff-0.2.0 → ssdiff-0.2.1}/ssdiff/lexicon.py RENAMED Viewed

@@ -63,31 +63,69 @@ def _z(v: pd.Series | np.ndarray) -> np.ndarray:
     mu = float(np.nanmean(arr))
     return (arr - mu) / sd
+def _validate_var_type(var_type: str) -> None:
+    if var_type not in ("continuous", "categorical"):
+        raise ValueError(
+            f"var_type must be 'continuous' or 'categorical', got {var_type!r}"
+        )
+def _categorical_mask(y) -> np.ndarray:
+    """Boolean mask: True for valid categorical entries (not None/NaN/empty)."""
+    arr = np.asarray(y, dtype=object)
+    return np.array([
+        g is not None and g != "" and (not isinstance(g, float) or np.isfinite(g))
+        for g in arr
+    ], dtype=bool)
+def _cramers_v(presence: np.ndarray, groups: np.ndarray) -> float:
+    """Cramér's V between binary presence (0/1) and group labels."""
+    ct = pd.crosstab(presence, groups)
+    if ct.shape[0] < 2 or ct.shape[1] < 2:
+        return 0.0
+    n = ct.values.sum()
+    row_sums = ct.sum(axis=1).values
+    col_sums = ct.sum(axis=0).values
+    expected = np.outer(row_sums, col_sums) / n
+    chi2 = float(((ct.values - expected) ** 2 / expected).sum())
+    k = min(ct.shape) - 1
+    return float(np.sqrt(chi2 / (n * k))) if n * k > 0 else 0.0
 def _rank_for_token_stats(
     presence_vec: np.ndarray,
     y: pd.Series | np.ndarray,
     n_bins: int = 4,
     corr_cap: float = 0.30,
+    categorical: bool = False,
 ) -> tuple[float, float, float, float]:
     """
     presence_vec: 0/1 per document
     Returns: (cov_all, cov_bal, corr, rank)
     rank = balanced_coverage * (1 - min(1, |corr|/corr_cap))
+    When categorical=True, bins are group labels and corr is Cramér's V.
     """
-    bins = _quantile_bins(y, n_bins=n_bins)
     presence_vec = presence_vec.astype(float)
     cov_all = float(np.mean(presence_vec)) if len(presence_vec) else 0.0
-    # balanced coverage: mean coverage within each bin
-    cov_per_bin = []
-    for b in sorted(np.unique(bins)):
-        idx = np.where(bins == b)[0]
-        cov_per_bin.append(float(np.mean(presence_vec[idx])) if len(idx) else 0.0)
-    cov_bal = float(np.mean(cov_per_bin)) if cov_per_bin else 0.0
+    if categorical:
+        groups = np.asarray(y, dtype=object)
+        cov_per_group = []
+        for g in sorted(set(groups)):
+            idx = np.where(groups == g)[0]
+            cov_per_group.append(float(np.mean(presence_vec[idx])) if len(idx) else 0.0)
+        cov_bal = float(np.mean(cov_per_group)) if cov_per_group else 0.0
+        corr = _cramers_v(presence_vec.astype(int), groups)
+    else:
+        bins = _quantile_bins(y, n_bins=n_bins)
+        # balanced coverage: mean coverage within each bin
+        cov_per_bin = []
+        for b in sorted(np.unique(bins)):
+            idx = np.where(bins == b)[0]
+            cov_per_bin.append(float(np.mean(presence_vec[idx])) if len(idx) else 0.0)
+        cov_bal = float(np.mean(cov_per_bin)) if cov_per_bin else 0.0
+        y_std = _z(y)
+        corr = float(np.corrcoef(presence_vec, y_std)[0, 1]) if np.std(presence_vec) > 0 else 0.0
-    y_std = _z(y)
-    # guard zero variance in presence
-    corr = float(np.corrcoef(presence_vec, y_std)[0, 1]) if np.std(presence_vec) > 0 else 0.0
     pen = min(1.0, abs(corr) / corr_cap)
     rank = cov_bal * (1.0 - pen)
     return cov_all, cov_bal, corr, rank
@@ -105,9 +143,11 @@ def suggest_lexicon(
     min_docs: int = 5,
     n_bins: int = 4,
     corr_cap: float = 0.30,
+    var_type: str = "continuous",
 ) -> pd.DataFrame:
     """
-    Suggest candidate tokens ranked by coverage with a mild penalty for strong correlation with y.
+    Suggest candidate tokens ranked by coverage with a mild penalty for strong
+    association with y.
     Parameters
     ----------
@@ -118,32 +158,54 @@ def suggest_lexicon(
     text_col : str | None
         Column name with preprocessed text (space-separated) if df provided.
     score_col : str | None
-        Column name with numeric outcome if df provided.
+        Column name with outcome variable if df provided (numeric for
+        continuous, any hashable for categorical).
+    var_type : str
+        ``'continuous'`` (default) for numeric outcomes or ``'categorical'``
+        for group labels.
     Returns
     -------
     DataFrame with columns: token, docs, cov_all, cov_bal, corr, rank (sorted desc).
+    When var_type='categorical', corr is Cramér's V and cov_bal is balanced
+    across group labels instead of quantile bins.
     """
+    _validate_var_type(var_type)
+    is_categorical = var_type == "categorical"
     # Allow passing a tuple (texts, y) directly
     if not isinstance(df_or_texts, pd.DataFrame):
         if isinstance(df_or_texts, tuple) and len(df_or_texts) == 2:
             texts, y = df_or_texts
             texts = _texts_to_token_lists(texts)
-            y = _as_series_1d(y)
-            mask = ~y.isna()
-            if not mask.all():
-                texts = [texts[i] for i in range(len(texts)) if mask.iat[i]]
-                y = y[mask].reset_index(drop=True)
+            if is_categorical:
+                y = np.asarray(y, dtype=object)
+                mask = _categorical_mask(y)
+                if not mask.all():
+                    texts = [texts[i] for i in range(len(texts)) if mask[i]]
+                    y = y[mask]
+            else:
+                y = _as_series_1d(y)
+                mask = ~y.isna()
+                if not mask.all():
+                    texts = [texts[i] for i in range(len(texts)) if mask.iat[i]]
+                    y = y[mask].reset_index(drop=True)
         else:
             raise ValueError("If not passing a DataFrame, pass (texts, y) as a tuple.")
     else:
         if not text_col or not score_col:
             raise ValueError("Provide text_col and score_col when using a DataFrame.")
         s = df_or_texts[text_col].fillna("").astype(str)
-        y = _as_series_1d(df_or_texts[score_col])
-        mask = ~y.isna()
-        texts = _texts_to_token_lists(s[mask].tolist())
-        y = y[mask]
+        if is_categorical:
+            y = np.asarray(df_or_texts[score_col], dtype=object)
+            mask = _categorical_mask(y)
+            texts = _texts_to_token_lists(s[mask].tolist())
+            y = y[mask]
+        else:
+            y = _as_series_1d(df_or_texts[score_col])
+            mask = ~y.isna()
+            texts = _texts_to_token_lists(s[mask].tolist())
+            y = y[mask]
     # Build doc-frequency counts
     token_sets = _token_sets(texts)
@@ -156,10 +218,12 @@ def suggest_lexicon(
         return pd.DataFrame(columns=["token", "docs", "cov_all", "cov_bal", "corr", "rank"])
     rows = []
-    y_clean = y.reset_index(drop=True)
+    y_clean = y if is_categorical else y.reset_index(drop=True)
     for t in vocab:
         pres = np.fromiter((1 if t in ts else 0 for ts in token_sets), dtype=np.int8, count=len(token_sets))
-        cov_all, cov_bal, corr, rank = _rank_for_token_stats(pres, y_clean, n_bins=n_bins, corr_cap=corr_cap)
+        cov_all, cov_bal, corr, rank = _rank_for_token_stats(
+            pres, y_clean, n_bins=n_bins, corr_cap=corr_cap, categorical=is_categorical,
+        )
         rows.append(dict(token=t, docs=int(pres.sum()), cov_all=cov_all, cov_bal=cov_bal, corr=corr, rank=rank))
     out = pd.DataFrame(rows)
@@ -176,14 +240,26 @@ def token_presence_stats(
     n_bins: int = 4,
     corr_cap: float = 0.30,
     verbose: bool = False,
+    var_type: str = "continuous",
 ) -> dict:
     """
     Compute docs count, coverage, balanced coverage, correlation, and rank for a single token.
     Accepts texts as:
       - str                          → split() is used
       - List[str]                    → treated as tokenized document
       - List[List[str]] (or deeper)  → treated as sentences of tokens (flattened)
+    Parameters
+    ----------
+    var_type : str
+        ``'continuous'`` (default) or ``'categorical'``.  When categorical,
+        corr is Cramér's V, q1/q4 become min/max group coverage, and an
+        extra ``group_cov`` dict mapping each group label to its coverage
+        is included in the output.
     """
+    _validate_var_type(var_type)
+    is_categorical = var_type == "categorical"
     def _doc_token_set(doc) -> set[str]:
         # Fast, robust flattener to collect strings at any nesting depth.
@@ -210,48 +286,78 @@ def token_presence_stats(
     # --- coerce inputs ---
     token = str(token)
-    if isinstance(y, np.ndarray):
-        y_series = pd.Series(y, dtype=float)
+    texts_list = list(texts)
+    if is_categorical:
+        y_arr = np.asarray(y, dtype=object)
+        mask = _categorical_mask(y_arr)
+        if not mask.all():
+            texts_list = [texts_list[i] for i in range(len(texts_list)) if mask[i]]
+            y_arr = y_arr[mask]
+        if len(texts_list) != len(y_arr):
+            raise ValueError(f"Length mismatch: texts={len(texts_list)} vs y={len(y_arr)}")
+        pres = np.fromiter(
+            (1 if token in _doc_token_set(doc) else 0 for doc in texts_list),
+            dtype=np.int8, count=len(texts_list),
+        )
+        cov_all, cov_bal, corr, rank = _rank_for_token_stats(
+            pres, y_arr, n_bins=n_bins, corr_cap=corr_cap, categorical=True,
+        )
+        # per-group coverage
+        group_labels = sorted(set(y_arr))
+        group_cov = {}
+        for g in group_labels:
+            idx = np.where(y_arr == g)[0]
+            group_cov[g] = float(pres[idx].mean()) if len(idx) else 0.0
+        q1 = min(group_cov.values()) if group_cov else 0.0
+        q4 = max(group_cov.values()) if group_cov else 0.0
+        out = dict(
+            token=token, docs=int(pres.sum()),
+            cov_all=float(cov_all), cov_bal=float(cov_bal),
+            corr=float(corr), rank=float(rank),
+            q1=q1, q4=q4,
+            group_cov=group_cov,
+        )
     else:
-        y_series = y.copy()
+        if isinstance(y, np.ndarray):
+            y_series = pd.Series(y, dtype=float)
+        else:
+            y_series = y.copy()
-    texts_list = list(texts)
-    mask = ~y_series.isna()
-    if not mask.all():
-        texts_list = [texts_list[i] for i in range(len(texts_list)) if mask.iat[i]]
-        y_series = y_series[mask].reset_index(drop=True)
+        mask = ~y_series.isna()
+        if not mask.all():
+            texts_list = [texts_list[i] for i in range(len(texts_list)) if mask.iat[i]]
+            y_series = y_series[mask].reset_index(drop=True)
-    if len(texts_list) != len(y_series):
-        raise ValueError(f"Length mismatch: texts={len(texts_list)} vs y={len(y_series)}")
+        if len(texts_list) != len(y_series):
+            raise ValueError(f"Length mismatch: texts={len(texts_list)} vs y={len(y_series)}")
-    pres = np.fromiter(
-        (1 if token in _doc_token_set(doc) else 0 for doc in texts_list),
-        dtype=np.int8,
-        count=len(texts_list),
-    )
+        pres = np.fromiter(
+            (1 if token in _doc_token_set(doc) else 0 for doc in texts_list),
+            dtype=np.int8, count=len(texts_list),
+        )
-    # --- core stats (existing helpers assumed available) ---
-    cov_all, cov_bal, corr, rank = _rank_for_token_stats(
-        pres, y_series, n_bins=n_bins, corr_cap=corr_cap
-    )
+        cov_all, cov_bal, corr, rank = _rank_for_token_stats(
+            pres, y_series, n_bins=n_bins, corr_cap=corr_cap, categorical=False,
+        )
-    # quartiles (for interpretability)
-    bins = _quantile_bins(y_series, n_bins=n_bins)
-    low = np.where(bins == bins.min())[0]
-    high = np.where(bins == bins.max())[0]
-    q1 = float(pres[low].mean()) if len(low) else 0.0
-    q4 = float(pres[high].mean()) if len(high) else 0.0
-    out = dict(
-        token=token,
-        docs=int(pres.sum()),
-        cov_all=float(cov_all),
-        cov_bal=float(cov_bal),
-        corr=float(corr),
-        rank=float(rank),
-        q1=q1,
-        q4=q4,
-    )
+        # quartiles (for interpretability)
+        bins = _quantile_bins(y_series, n_bins=n_bins)
+        low = np.where(bins == bins.min())[0]
+        high = np.where(bins == bins.max())[0]
+        q1 = float(pres[low].mean()) if len(low) else 0.0
+        q4 = float(pres[high].mean()) if len(high) else 0.0
+        out = dict(
+            token=token, docs=int(pres.sum()),
+            cov_all=float(cov_all), cov_bal=float(cov_bal),
+            corr=float(corr), rank=float(rank),
+            q1=q1, q4=q4,
+        )
     if verbose:
         print(
@@ -259,6 +365,9 @@ def token_presence_stats(
             f"docs={out['docs']} | cov_all={out['cov_all']:.3f} | cov_bal={out['cov_bal']:.3f} | "
             f"q1={out['q1']:.3f} | q4={out['q4']:.3f} | corr={out['corr']:.3f} | rank={out['rank']:.3f}"
         )
+        if is_categorical and "group_cov" in out:
+            parts = " | ".join(f"{g}={v:.3f}" for g, v in out["group_cov"].items())
+            print(f"  group_cov: {parts}")
     return out
@@ -272,6 +381,7 @@ def coverage_by_lexicon(
     *,
     n_bins: int = 4,
     verbose: bool = False,
+    var_type: str = "continuous",
 ) -> tuple[dict, pd.DataFrame]:
     """
     Summarize coverage for a given lexicon.
@@ -283,24 +393,35 @@ def coverage_by_lexicon(
           * profiles: List[List[str]]  (multiple independent posts per unit)
       - Tuple (texts, y), where texts is Sequence of the same forms above.
+    Parameters
+    ----------
+    var_type : str
+        ``'continuous'`` (default) or ``'categorical'``.  When categorical,
+        q1/q4 become min/max group coverage, corr uses Cramér's V, and the
+        summary includes a ``group_cov`` dict.
     Returns
     -------
     summary : dict(
         docs_any, cov_all, q1, q4, corr_any,
         hits_mean, hits_median, types_mean, types_median
+        [, group_cov]  — only when var_type='categorical'
     )
     per_token_df : DataFrame(word, docs, cov_all, q1, q4, corr)
     """
     import numpy as np
     import pandas as pd
+    _validate_var_type(var_type)
+    is_categorical = var_type == "categorical"
     # --- small internal adapters (robust to nested inputs) --------------------
-    def _as_series_1d(y_like) -> pd.Series:
+    def _local_as_series_1d(y_like) -> pd.Series:
         if isinstance(y_like, pd.Series):
             return y_like.reset_index(drop=True)
         return pd.Series(y_like, dtype="float64")
-    def _z(s: pd.Series) -> np.ndarray:
+    def _local_z(s: pd.Series) -> np.ndarray:
         s = s.astype(float)
         mu = float(s.mean())
         sd = float(s.std(ddof=0))
@@ -308,7 +429,7 @@ def coverage_by_lexicon(
             return np.zeros(len(s), dtype=float)
         return ((s - mu) / sd).to_numpy(dtype=float)
-    def _quantile_bins(y: pd.Series, n_bins: int = 4) -> np.ndarray:
+    def _local_quantile_bins(y: pd.Series, n_bins: int = 4) -> np.ndarray:
         q = pd.qcut(y.rank(method="average"), n_bins, labels=False, duplicates="drop")
         return q.to_numpy(dtype=int)
@@ -339,49 +460,61 @@ def coverage_by_lexicon(
                 return out
         return str(unit).split()
-    def _texts_to_token_lists(texts_like) -> list[list[str]]:
+    def _local_texts_to_token_lists(texts_like) -> list[list[str]]:
         return [_to_unit_tokens(u) for u in texts_like]
-    def _token_sets(text_lists: list[list[str]]) -> list[set[str]]:
+    def _local_token_sets(text_lists: list[list[str]]) -> list[set[str]]:
         return [set(toks) if toks else set() for toks in text_lists]
     # --- coerce inputs --------------------------------------------------------
     if not isinstance(df_or_texts, pd.DataFrame):
         if isinstance(df_or_texts, tuple) and len(df_or_texts) == 2:
             texts, y = df_or_texts
-            texts = _texts_to_token_lists(texts)
-            y = _as_series_1d(y)
-            mask = ~y.isna()
-            if not mask.all():
-                texts = [texts[i] for i in range(len(texts)) if mask.iat[i]]
-                y = y[mask].reset_index(drop=True)
+            texts = _local_texts_to_token_lists(texts)
+            if is_categorical:
+                y = np.asarray(y, dtype=object)
+                mask = _categorical_mask(y)
+                if not mask.all():
+                    texts = [texts[i] for i in range(len(texts)) if mask[i]]
+                    y = y[mask]
+            else:
+                y = _local_as_series_1d(y)
+                mask = ~y.isna()
+                if not mask.all():
+                    texts = [texts[i] for i in range(len(texts)) if mask.iat[i]]
+                    y = y[mask].reset_index(drop=True)
         else:
             raise ValueError("If not passing a DataFrame, pass (texts, y) as a tuple.")
     else:
         if not text_col or not score_col:
             raise ValueError("Provide text_col and score_col when using a DataFrame.")
         s = df_or_texts[text_col]
-        y = _as_series_1d(df_or_texts[score_col])
-        mask = ~y.isna()
-        s = s[mask]
-        y = y[mask].reset_index(drop=True)
-        texts = _texts_to_token_lists(s.tolist())
+        if is_categorical:
+            y = np.asarray(df_or_texts[score_col], dtype=object)
+            mask = _categorical_mask(y)
+            s = s[mask]
+            y = y[mask]
+            texts = _local_texts_to_token_lists(s.tolist())
+        else:
+            y = _local_as_series_1d(df_or_texts[score_col])
+            mask = ~y.isna()
+            s = s[mask]
+            y = y[mask].reset_index(drop=True)
+            texts = _local_texts_to_token_lists(s.tolist())
     # guard: empty after filtering
     if len(texts) == 0 or len(y) == 0:
         summary = dict(
             docs_any=0, cov_all=0.0, q1=0.0, q4=0.0, corr_any=0.0,
-            hits_mean=0.0, hits_median=0.0, types_mean=0.0, types_median=0.0
+            hits_mean=0.0, hits_median=0.0, types_mean=0.0, types_median=0.0,
         )
+        if is_categorical:
+            summary["group_cov"] = {}
         return summary, pd.DataFrame(columns=["word","docs","cov_all","q1","q4","corr"])
     # --- prep features --------------------------------------------------------
-    bins = _quantile_bins(y, n_bins=n_bins)
-    low_idx = np.where(bins == bins.min())[0]
-    high_idx = np.where(bins == bins.max())[0]
     lex = [str(w) for w in lexicon]
-    token_sets = _token_sets(texts)
+    token_sets = _local_token_sets(texts)
     # presence of ANY lexicon word per unit
     pres_any = np.fromiter(
@@ -389,19 +522,73 @@ def coverage_by_lexicon(
         dtype=np.int8,
         count=len(token_sets),
     )
-    y_std = _z(y)
-    corr_any = float(np.corrcoef(pres_any, y_std)[0, 1]) if pres_any.std() > 0 else 0.0
     overall = float(pres_any.mean()) if len(pres_any) else 0.0
-    q1 = float(pres_any[low_idx].mean()) if len(low_idx) else 0.0
-    q4 = float(pres_any[high_idx].mean()) if len(high_idx) else 0.0
     docs_any = int(pres_any.sum())
-    # --- NEW: whole-profile lexicon frequency stats ---------------------------
+    if is_categorical:
+        groups = y  # already np.ndarray of object dtype
+        group_labels = sorted(set(groups))
+        # q1/q4 → min/max group coverage for the any-presence vector
+        group_cov_any = {}
+        for g in group_labels:
+            idx = np.where(groups == g)[0]
+            group_cov_any[g] = float(pres_any[idx].mean()) if len(idx) else 0.0
+        q1 = min(group_cov_any.values()) if group_cov_any else 0.0
+        q4 = max(group_cov_any.values()) if group_cov_any else 0.0
+        corr_any = _cramers_v(pres_any.astype(int), groups)
+        # per-token stats
+        rows = []
+        for w in lex:
+            pres = np.fromiter(((1 if w in ts else 0) for ts in token_sets),
+                               dtype=np.int8, count=len(token_sets))
+            corr = _cramers_v(pres.astype(int), groups)
+            gc = {}
+            for g in group_labels:
+                idx = np.where(groups == g)[0]
+                gc[g] = float(pres[idx].mean()) if len(idx) else 0.0
+            rows.append(dict(
+                word=w,
+                docs=int(pres.sum()),
+                cov_all=float(pres.mean()) if len(pres) else 0.0,
+                q1=min(gc.values()) if gc else 0.0,
+                q4=max(gc.values()) if gc else 0.0,
+                corr=corr,
+            ))
+    else:
+        bins = _local_quantile_bins(y, n_bins=n_bins)
+        low_idx = np.where(bins == bins.min())[0]
+        high_idx = np.where(bins == bins.max())[0]
+        y_std = _local_z(y)
+        corr_any = float(np.corrcoef(pres_any, y_std)[0, 1]) if pres_any.std() > 0 else 0.0
+        q1 = float(pres_any[low_idx].mean()) if len(low_idx) else 0.0
+        q4 = float(pres_any[high_idx].mean()) if len(high_idx) else 0.0
+        # per-token stats
+        rows = []
+        for w in lex:
+            pres = np.fromiter(((1 if w in ts else 0) for ts in token_sets),
+                               dtype=np.int8, count=len(token_sets))
+            corr = float(np.corrcoef(pres, y_std)[0, 1]) if pres.std() > 0 else 0.0
+            rows.append(dict(
+                word=w,
+                docs=int(pres.sum()),
+                cov_all=float(pres.mean()) if len(pres) else 0.0,
+                q1=float(pres[low_idx].mean()) if len(low_idx) else 0.0,
+                q4=float(pres[high_idx].mean()) if len(high_idx) else 0.0,
+                corr=corr,
+            ))
+    per_token = pd.DataFrame(rows, columns=["word", "docs", "cov_all", "q1", "q4", "corr"])
+    # --- whole-profile lexicon frequency stats (DV-agnostic) ------------------
     lex_set = set(lex)
-    # total occurrences of any lexicon token in each profile/unit
     hits_per_unit = np.array([sum(1 for t in toks if t in lex_set) for toks in texts], dtype=np.int32)
-    # number of unique lexicon types present in each unit
     types_per_unit = np.array([len(set(toks) & lex_set) for toks in texts], dtype=np.int32)
     hits_mean = float(hits_per_unit.mean()) if len(hits_per_unit) else 0.0
@@ -409,22 +596,6 @@ def coverage_by_lexicon(
     types_mean = float(types_per_unit.mean()) if len(types_per_unit) else 0.0
     types_median = float(np.median(types_per_unit)) if len(types_per_unit) else 0.0
-    # per-token stats (vectorized presence via set membership)
-    rows = []
-    for w in lex:
-        pres = np.fromiter(((1 if w in ts else 0) for ts in token_sets),
-                           dtype=np.int8, count=len(token_sets))
-        corr = float(np.corrcoef(pres, y_std)[0, 1]) if pres.std() > 0 else 0.0
-        rows.append(dict(
-            word=w,
-            docs=int(pres.sum()),
-            cov_all=float(pres.mean()) if len(pres) else 0.0,
-            q1=float(pres[low_idx].mean()) if len(low_idx) else 0.0,
-            q4=float(pres[high_idx].mean()) if len(high_idx) else 0.0,
-            corr=corr,
-        ))
-    per_token = pd.DataFrame(rows, columns=["word", "docs", "cov_all", "q1", "q4", "corr"])
     summary = dict(
         docs_any=docs_any,
         cov_all=overall,
@@ -436,6 +607,8 @@ def coverage_by_lexicon(
         types_mean=types_mean,
         types_median=types_median,
     )
+    if is_categorical:
+        summary["group_cov"] = group_cov_any
     per_token = per_token.sort_values(
         ["cov_all", "docs"], ascending=[False, False]
@@ -448,6 +621,9 @@ def coverage_by_lexicon(
             f"docs_any={docs_any} | cov_all={overall:.3f} | "
             f"q1={q1:.3f} | q4={q4:.3f} | corr_any={corr_any:.3f}"
         )
+        if is_categorical:
+            parts = " | ".join(f"{g}={v:.3f}" for g, v in group_cov_any.items())
+            print(f"  group_cov: {parts}")
         print(
             f"  hits_mean={hits_mean:.2f} | hits_median={hits_median:.2f} | "
             f"types_mean={types_mean:.2f} | types_median={types_median:.2f}"

{ssdiff-0.2.0 → ssdiff-0.2.1/ssdiff.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ssdiff
-Version: 0.2.0
+Version: 0.2.1
 Summary: Supervised Semantic Differential (SSD): interpretable, embedding-based analysis of concept meaning in text.
 Author-email: Hubert Plisiecki <hplisiecki@gmail.com>, Paweł Lenartowicz <pawellenartowicz@europe.com>
 License-Expression: MIT
@@ -271,49 +271,75 @@ These helpers make lexicon selection transparent and data-driven (you can also h
 ### `suggest_lexicon(...)`
-Rank tokens by balanced coverage with a mild penalty for strong correlation with
-- `cov_all`: fraction of essays containing the token (0/1 presence)
-- `cov_bal`: average presence across 𝑛 quantile bins of 𝑦 (default: 4 bins)
-- `corr`: Pearson correlation between 0/1 presence and standardized 𝑦
-- `rank = cov_bal * (1 - min(1, |corr|/corr_cap))` (default `corr_cap=0.30`)
+Rank tokens by balanced coverage with a mild penalty for strong association with the outcome.
+All three lexicon utilities accept `var_type='continuous'` (default) or `var_type='categorical'`:
+| | `var_type='continuous'` | `var_type='categorical'` |
+|---|---|---|
+| `cov_bal` | average presence across 𝑛 quantile bins of 𝑦 | average presence across group labels |
+| `corr` | Pearson correlation between 0/1 presence and standardized 𝑦 | Cramér's V between 0/1 presence and group label |
+| `q1` / `q4` | coverage in lowest / highest 𝑦 quantile bin | min / max group coverage |
+| `rank` | `cov_bal * (1 - min(1, \|corr\|/corr_cap))` | same formula (Cramér's V replaces Pearson) |
 Accepts a DataFrame (`text_col`, `score_col`) or a `(texts, y)` tuple where texts can be raw strings or token lists.
 ```python
 from ssdiff import suggest_lexicon
-# Using a DataFrame
+# Continuous outcome (default)
 cands_df = suggest_lexicon(df, text_col="lemmatized", score_col="questionnaire_result", top_k=150)
 # Or using a tuple (texts, y)
 texts = [" ".join(doc) for doc in docs]
 cands_df2 = suggest_lexicon((docs, y), top_k=150)
+# Categorical groups
+cands_cat = suggest_lexicon(df, text_col="lemmatized", score_col="diagnosis", top_k=150, var_type="categorical")
+cands_cat2 = suggest_lexicon((docs, groups), top_k=150, var_type="categorical")
 ```
 ### `token_presence_stats(...)`
-Per-token coverage & correlation diagnostics:
+Per-token coverage & association diagnostics:
 ```python
 from ssdiff import token_presence_stats
-stats = token_presence_stats((texts, y), token="concept_keyword_1", n_bins=4, verbose=True)
-print(stats)  # dict: token, docs, cov_all, cov_bal, corr, rank
+# Continuous
+stats = token_presence_stats(texts, y, token="concept_keyword_1", n_bins=4, verbose=True)
+print(stats)  # dict: token, docs, cov_all, cov_bal, corr, rank, q1, q4
+# Categorical — output also includes group_cov (per-group coverage dict)
+stats = token_presence_stats(texts, groups, token="concept_keyword_1", var_type="categorical", verbose=True)
+print(stats["group_cov"])  # e.g. {"control": 0.45, "depression": 0.62}
 ```
 ### `coverage_by_lexicon(...)`
 Summary for your chosen lexicon:
-- `summary` : `docs_any`, `cov_all`, `q1`. `q4`, `corr_any`
-  - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (by quantile)
-- `per_token_df`: stats for each token
+- `summary` : `docs_any`, `cov_all`, `q1`, `q4`, `corr_any`, `hits_mean`, `hits_median`, `types_mean`, `types_median`
+  - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (continuous) or min/max group coverage (categorical)
+  - when `var_type='categorical'`, summary also includes `group_cov` (per-group coverage dict)
+- `per_token_df`: per-token stats
 ```python
 from ssdiff import coverage_by_lexicon
+# Continuous
 summary, per_tok = coverage_by_lexicon(
     (texts, y),
     lexicon={"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"},
     n_bins=4,
-    verbose=True
+    verbose=True,
+)
+# Categorical
+summary, per_tok = coverage_by_lexicon(
+    (texts, groups),
+    lexicon={"concept_keyword_1", "concept_keyword_2"},
+    var_type="categorical",
+    verbose=True,
 )
+print(summary["group_cov"])  # e.g. {"control": 0.80, "depression": 0.75}
 ```
 ---
@@ -746,9 +772,10 @@ Returned by `SSDGroup.get_contrast()`. Duck-types with `SSD` for interpretation:
 - `build_docs_from_preprocessed(pre_docs)` → list[list[str]] (lemmas for modeling)
 ### Lexicon
-- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30)` → DataFrame
-- `token_presence_stats(df_or_tuple, token, n_bins=4, corr_cap=0.30, verbose=False)` → dict
-- `coverage_by_lexicon(df_or_tuple, lexicon, n_bins=4, verbose=False)` → `(summary, per_token_df)`
+- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30, var_type='continuous')` → DataFrame
+- `token_presence_stats(texts, y, token, n_bins=4, corr_cap=0.30, verbose=False, var_type='continuous')` → dict
+- `coverage_by_lexicon(df_or_tuple, text_col=None, score_col=None, lexicon=(), n_bins=4, verbose=False, var_type='continuous')` → `(summary, per_token_df)`
+- `var_type`: `'continuous'` (numeric outcome, default) or `'categorical'` (group labels). When categorical, `corr` is Cramér's V, `cov_bal` is balanced across groups, and `q1`/`q4` are min/max group coverage.
 ---
 ## Citing & License