PyPI - pycorpdiff - Versions diffs - 0.1.0a6__tar.gz → 0.1.0a8__tar.gz - Mend

pycorpdiff 0.1.0a6tar.gz → 0.1.0a8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (130) hide show

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/.gitignore RENAMED Viewed

@@ -33,9 +33,6 @@ Thumbs.db
 # Hypothesis example database (auto-managed)
 .hypothesis/
-# Local tooling
-.claude/
 # Jupyter checkpoints
 .ipynb_checkpoints/

pycorpdiff-0.1.0a8/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,71 @@
+# Changelog
+All notable changes to `pycorpdiff` are documented in this file. The format
+follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this
+project adheres to [Semantic Versioning](https://semver.org/).
+## [0.1.0a8] — first public release
+The first public alpha of `pycorpdiff` — comparative corpus analysis
+for modern Python workflows. Three public verbs (`compare`, `track`,
+`compare.before_after`), nine `Result` dataclasses each implementing
+the relevant subset of `.to_df / .plot / .explain / .summary /
+.to_html / .to_json` (see `docs/design.md` for the per-Result method
+matrix), two `typing.Protocol` extension points (`Tokenizer`,
+`Embedder`), and opt-in extras for visualisation, semantic embedding,
+temporal modelling, polars interop, DuckDB ingestion, 🤗 Datasets,
+and notebook rendering.
+### Analytical surface
+- **Keyness**: signed log-likelihood G² with selectable formula
+  (`formula="rayson"` 2-cell shortcut, default; matches the UCREL
+  LL Wizard. `formula="dunning"` 4-cell G²; matches NLTK +
+  `quanteda::textstat_keyness(measure="lr")` byte-for-byte.). Pearson
+  χ², Hardie LogRatio, Gabrielatos %DIFF, BIC-approximated Bayes
+  factor (also tracks the `formula=` choice), Juilland D / Gries DP
+  dispersion flagging, Benjamini–Hochberg correction, stop-word
+  filtering, empirical permutation *p*-values, N-way contingency G²
+  via `keyness_multi`.
+- **Collocations**: logDice, PMI, t-score, MI³ with Laplace smoothing;
+  cross-corpus `collocation_shift`; co-occurrence networks via
+  `cooccurrence_network`.
+- **Semantic shift**: averaged contextual embeddings, Procrustes
+  alignment, multi-period `semantic_trajectory`, `neighborhood_drift`.
+  Embedder output shape is validated to catch silently-broken
+  embedders before they produce nonsense.
+- **Temporal**: Wilson-CI trajectories, offline PELT changepoints,
+  online Bayesian changepoint detection, segmented-OLS interrupted
+  time series, Bayesian structural time-series causal impact,
+  state-space exponential-smoothing forecasting.
+### Cross-validated
+The package is checked against standard tools by automated test:
+- **Rayson's LL Wizard** — hand-derived contingency-table reference
+  triples (fast tier; runs on every push).
+- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
+  on every adjacent bigram (slow tier).
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
+  US Conventions corpus (slow tier).
+- **quanteda (R)** via `rpy2` — G² agreement to ≤ 1e-10 with
+  `formula="dunning"` (slow tier).
+- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
+  sanity check on Stanford SNAP COHA decade embeddings; skips
+  gracefully when the archive isn't reachable (slow tier).
+### Extras
+`[viz]`, `[semantic]`, `[temporal]`, `[polars]`, `[duckdb]`, `[nlp]`,
+`[huggingface]`, `[notebooks]`, `[all]` are MIT-compatible. A separate
+`[showcase]` extra pulls in `pysofra` (GPL-3.0-or-later) for
+JAMA-style table polish in the showcase notebook — opt in explicitly
+if you accept that licence.
+### Infrastructure
+Hundreds of tests, `ruff` + `mypy --strict` clean across the source
+tree, matrix CI on three Python versions × two operating systems,
+plus a slow-tier CI job exercising the cross-validation receipts
+against NLTK + quanteda on main pushes.

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/CITATION.cff RENAMED Viewed

@@ -4,7 +4,7 @@ message: >
   entry. GitHub renders a "Cite this repository" widget directly from
   this file.
 title: "pycorpdiff: Comparative Corpus Analysis for Modern Python Workflows"
-version: 0.1.0a6
+version: 0.1.0a8
 date-released: 2026-05-25
 authors:
   - family-names: Turner

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pycorpdiff
-Version: 0.1.0a6
+Version: 0.1.0a8
 Summary: Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
 Project-URL: Homepage, https://github.com/jturner-uofl/pycorpdiff
 Project-URL: Documentation, https://github.com/jturner-uofl/pycorpdiff
@@ -54,7 +54,6 @@ Requires-Dist: matplotlib>=3.8; extra == 'all'
 Requires-Dist: networkx>=3.1; extra == 'all'
 Requires-Dist: polars>=1.0; extra == 'all'
 Requires-Dist: pyarrow>=15; extra == 'all'
-Requires-Dist: pysofra>=0.1.0a3; extra == 'all'
 Requires-Dist: ruptures>=1.1; extra == 'all'
 Requires-Dist: scikit-learn>=1.3; extra == 'all'
 Requires-Dist: sentence-transformers>=2.2; extra == 'all'
@@ -77,7 +76,6 @@ Provides-Extra: nlp
 Requires-Dist: spacy>=3.7; extra == 'nlp'
 Provides-Extra: notebooks
 Requires-Dist: jupyter>=1.0; extra == 'notebooks'
-Requires-Dist: pysofra>=0.1.0a3; extra == 'notebooks'
 Requires-Dist: vl-convert-python>=1.5; extra == 'notebooks'
 Provides-Extra: polars
 Requires-Dist: polars>=1.0; extra == 'polars'
@@ -85,6 +83,8 @@ Requires-Dist: pyarrow>=15; extra == 'polars'
 Provides-Extra: semantic
 Requires-Dist: scikit-learn>=1.3; extra == 'semantic'
 Requires-Dist: sentence-transformers>=2.2; extra == 'semantic'
+Provides-Extra: showcase
+Requires-Dist: pysofra>=0.1.0a3; extra == 'showcase'
 Provides-Extra: temporal
 Requires-Dist: ruptures>=1.1; extra == 'temporal'
 Requires-Dist: statsmodels>=0.14; extra == 'temporal'
@@ -127,11 +127,11 @@ and computational social science routinely have:
 `pycorpdiff` is positioned as **orchestration**, not reinvention.
 Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
 `SBERT`-compatible model) plug in via two `typing.Protocol` extension
-points — one-line adapters, no plugin registry. The base install pulls
-only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
-via extras.
+points — one-line adapters, no plugin registry. The base install's
+direct runtime dependencies are `numpy`, `pandas`, `scipy`, and
+`pyarrow`; everything else is opt-in via extras.
-> **Status: alpha (0.1.0a6).** Public API is stable for the features
+> **Status: alpha (0.1.0a8).** Public API is stable for the features
 > described below; on PyPI as `pip install pycorpdiff`.
 ## The three-layer architecture
@@ -178,7 +178,8 @@ for the full feature tour, or the cheat sheet below for one-line API previews.
 ```python
 # Compare verbs (returns Result objects; methods exposed vary by Result)
-pcd.compare(a, b).keyness()
+pcd.compare(a, b).keyness()                                                   # default formula="rayson" (LL Wizard)
+pcd.compare(a, b).keyness(formula="dunning")                                  # full 4-cell G² (matches quanteda / NLTK)
 pcd.compare(a, b).collocation_shift("immigrant")
 pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
 # SBERTEmbedder downloads a sentence-transformers model on first call;
@@ -190,7 +191,7 @@ tr.changepoints()                                  # offline PELT
 tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
 tr.interrupted_time_series(event_date="2016")      # segmented OLS
 tr.causal_impact(event_date="2016")                # Bayesian counterfactual (Brodersen 2015)
-tr.forecast(horizon=4)                             # state-space ETS
+tr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)
 # Before / after a known event
 pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
@@ -209,17 +210,20 @@ every analytical surface.
 ## Installation
 ```bash
-pip install pycorpdiff                # lexical-comparative core
-pip install "pycorpdiff[viz]"         # + altair / matplotlib / networkx
-pip install "pycorpdiff[semantic]"    # + sentence-transformers
-pip install "pycorpdiff[temporal]"    # + ruptures / statsmodels
-pip install "pycorpdiff[notebooks]"   # + jupyter / vl-convert / pysofra
-pip install "pycorpdiff[all]"         # everything
+pip install pycorpdiff                       # lexical-comparative core (MIT)
+pip install "pycorpdiff[viz]"                # + altair / matplotlib / networkx
+pip install "pycorpdiff[semantic]"           # + sentence-transformers
+pip install "pycorpdiff[temporal]"           # + ruptures / statsmodels
+pip install "pycorpdiff[notebooks]"          # + jupyter / vl-convert
+pip install "pycorpdiff[all]"                # everything MIT-compatible
+pip install "pycorpdiff[all,showcase]"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase
 ```
-The base install keeps a small dependency footprint (`numpy`, `pandas`,
-`scipy`, `pyarrow`); optional extras land per analytical layer so you
-only pay for what you use.
+The base install's direct runtime dependencies are `numpy`, `pandas`,
+`scipy`, and `pyarrow`; optional extras land per analytical layer so
+you only pay for what you use. `[showcase]` is broken out separately
+because `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without
+that extra remains MIT-only.
 To work from source:
@@ -232,13 +236,27 @@ pytest -q
 ## Cross-validation receipts
-The math agrees with the standard tools — by automated test:
+The math is checked against standard tools by automated test. The
+fast tier runs on every push (matrix CI); the slow tier needs heavy
+optional dependencies (R + quanteda, NLTK, rpy2, Stanford SNAP
+downloads) and runs on main pushes only.
-- **Rayson's LL Wizard** — hand-derived contingency-table reference triples
-- **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
-- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
-- **quanteda (R)** via `rpy2` — byte-for-byte G² agreement with `formula="dunning"` (slow tier)
-- **HistWords (Hamilton et al. 2016)** — diachronic cosine displacements on COHA (slow tier)
+Fast tier:
+- **Rayson's LL Wizard** — hand-derived contingency-table reference
+  triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))
+Slow tier:
+- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
+  on every adjacent bigram
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
+  US Conventions corpus
+- **quanteda (R)** via `rpy2` — G² agreement to ≤ 1e-10 with
+  `formula="dunning"`
+- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
+  sanity check on Stanford SNAP COHA decade embeddings (skips
+  gracefully if the archive isn't reachable)
 ## Citation

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/README.md RENAMED Viewed

@@ -31,11 +31,11 @@ and computational social science routinely have:
 `pycorpdiff` is positioned as **orchestration**, not reinvention.
 Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
 `SBERT`-compatible model) plug in via two `typing.Protocol` extension
-points — one-line adapters, no plugin registry. The base install pulls
-only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
-via extras.
+points — one-line adapters, no plugin registry. The base install's
+direct runtime dependencies are `numpy`, `pandas`, `scipy`, and
+`pyarrow`; everything else is opt-in via extras.
-> **Status: alpha (0.1.0a6).** Public API is stable for the features
+> **Status: alpha (0.1.0a8).** Public API is stable for the features
 > described below; on PyPI as `pip install pycorpdiff`.
 ## The three-layer architecture
@@ -82,7 +82,8 @@ for the full feature tour, or the cheat sheet below for one-line API previews.
 ```python
 # Compare verbs (returns Result objects; methods exposed vary by Result)
-pcd.compare(a, b).keyness()
+pcd.compare(a, b).keyness()                                                   # default formula="rayson" (LL Wizard)
+pcd.compare(a, b).keyness(formula="dunning")                                  # full 4-cell G² (matches quanteda / NLTK)
 pcd.compare(a, b).collocation_shift("immigrant")
 pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
 # SBERTEmbedder downloads a sentence-transformers model on first call;
@@ -94,7 +95,7 @@ tr.changepoints()                                  # offline PELT
 tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
 tr.interrupted_time_series(event_date="2016")      # segmented OLS
 tr.causal_impact(event_date="2016")                # Bayesian counterfactual (Brodersen 2015)
-tr.forecast(horizon=4)                             # state-space ETS
+tr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)
 # Before / after a known event
 pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
@@ -113,17 +114,20 @@ every analytical surface.
 ## Installation
 ```bash
-pip install pycorpdiff                # lexical-comparative core
-pip install "pycorpdiff[viz]"         # + altair / matplotlib / networkx
-pip install "pycorpdiff[semantic]"    # + sentence-transformers
-pip install "pycorpdiff[temporal]"    # + ruptures / statsmodels
-pip install "pycorpdiff[notebooks]"   # + jupyter / vl-convert / pysofra
-pip install "pycorpdiff[all]"         # everything
+pip install pycorpdiff                       # lexical-comparative core (MIT)
+pip install "pycorpdiff[viz]"                # + altair / matplotlib / networkx
+pip install "pycorpdiff[semantic]"           # + sentence-transformers
+pip install "pycorpdiff[temporal]"           # + ruptures / statsmodels
+pip install "pycorpdiff[notebooks]"          # + jupyter / vl-convert
+pip install "pycorpdiff[all]"                # everything MIT-compatible
+pip install "pycorpdiff[all,showcase]"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase
 ```
-The base install keeps a small dependency footprint (`numpy`, `pandas`,
-`scipy`, `pyarrow`); optional extras land per analytical layer so you
-only pay for what you use.
+The base install's direct runtime dependencies are `numpy`, `pandas`,
+`scipy`, and `pyarrow`; optional extras land per analytical layer so
+you only pay for what you use. `[showcase]` is broken out separately
+because `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without
+that extra remains MIT-only.
 To work from source:
@@ -136,13 +140,27 @@ pytest -q
 ## Cross-validation receipts
-The math agrees with the standard tools — by automated test:
+The math is checked against standard tools by automated test. The
+fast tier runs on every push (matrix CI); the slow tier needs heavy
+optional dependencies (R + quanteda, NLTK, rpy2, Stanford SNAP
+downloads) and runs on main pushes only.
-- **Rayson's LL Wizard** — hand-derived contingency-table reference triples
-- **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
-- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
-- **quanteda (R)** via `rpy2` — byte-for-byte G² agreement with `formula="dunning"` (slow tier)
-- **HistWords (Hamilton et al. 2016)** — diachronic cosine displacements on COHA (slow tier)
+Fast tier:
+- **Rayson's LL Wizard** — hand-derived contingency-table reference
+  triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))
+Slow tier:
+- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
+  on every adjacent bigram
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
+  US Conventions corpus
+- **quanteda (R)** via `rpy2` — G² agreement to ≤ 1e-10 with
+  `formula="dunning"`
+- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
+  sanity check on Stanford SNAP COHA decade embeddings (skips
+  gracefully if the archive isn't reachable)
 ## Citation

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "pycorpdiff"
-version = "0.1.0a6"
+version = "0.1.0a8"
 description = "Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference."
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -62,13 +62,18 @@ nlp = ["spacy>=3.7"]
 # Public-text-corpus hub. Heavy (pulls pyarrow, fsspec, requests, aiohttp),
 # so opt-in only — base install stays small.
 huggingface = ["datasets>=2.14"]
-# Needed if you want to execute the showcase notebook or regenerate the
-# rendered HTML examples. `jupyter` runs the notebook, `vl-convert` does
-# static SVG/PNG export of altair charts, `pysofra` renders the showcase's
-# result tables in JAMA-style typography.
-notebooks = ["jupyter>=1.0", "vl-convert-python>=1.5", "pysofra>=0.1.0a3"]
-# Meta-extra: `pycorpdiff[all]` pulls in every optional code path
-# including the notebook runtime.
+# Needed if you want to execute the example notebooks. `jupyter` runs
+# the notebook; `vl-convert` does static SVG/PNG export of altair charts.
+# Kept MIT-clean — see `showcase` below for the JAMA-style table polish.
+notebooks = ["jupyter>=1.0", "vl-convert-python>=1.5"]
+# Adds `pysofra` for the showcase notebook's JAMA-style typography.
+# IMPORTANT: pysofra is GPL-3.0-or-later. Opting in to `[showcase]` (or
+# installing pysofra directly) brings GPL into your environment; pure
+# pycorpdiff use without this extra remains MIT-only.
+showcase = ["pysofra>=0.1.0a3"]
+# Meta-extra: every MIT-compatible optional code path. Does NOT include
+# `[showcase]` because pysofra is GPL-3.0-or-later; install
+# `pycorpdiff[all,showcase]` explicitly if you accept that licence.
 all = [
     "altair>=5",
     "matplotlib>=3.8",
@@ -84,7 +89,6 @@ all = [
     "spacy>=3.7",
     "jupyter>=1.0",
     "vl-convert-python>=1.5",
-    "pysofra>=0.1.0a3",
 ]
 dev = [
     "pytest>=8",

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/__init__.py RENAMED Viewed

@@ -6,20 +6,21 @@ result objects (:class:`KeynessResult`, :class:`CollocationShiftResult`,
 :class:`SemanticShiftResult`, :class:`TemporalTrajectory`,
 :class:`NetworkResult`, :class:`ForecastResult`,
 :class:`CausalImpactResult`, :class:`BocpdResult`,
-:class:`ConcordanceResult`), each implementing the same
-``.to_df / .plot / .explain / .summary / .to_html / .to_json`` contract.
+:class:`ConcordanceResult`), each implementing the relevant subset of
+the ``.to_df / .plot / .explain / .summary / .to_html / .to_json``
+contract. See ``docs/design.md`` for the per-Result method matrix.
 Example
 -------
 >>> import pycorpdiff as pcd
->>> pcd.__version__
-'0.1.0a6'
+>>> isinstance(pcd.__version__, str)
+True
 """
 from __future__ import annotations
-__version__ = "0.1.0a6"
+__version__ = "0.1.0a8"
 from .collocation.network import NetworkResult, cooccurrence_network
 from .compare import Comparison, compare

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/compare.py RENAMED Viewed

@@ -149,7 +149,9 @@ class Comparison:
         if effect_size:
             table["log_ratio"] = _log_ratio(a_kept, b_kept, n_a, n_b)
             table["percent_diff"] = _percent_diff(a_kept, b_kept, n_a, n_b)
-            table["bayes_factor"] = _bayes_factor(a_kept, b_kept, n_a, n_b)
+            table["bayes_factor"] = _bayes_factor(
+                a_kept, b_kept, n_a, n_b, formula=formula
+            )
         if dispersion:
             kept_terms = table.index

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/corpus.py RENAMED Viewed

@@ -242,6 +242,15 @@ class Corpus:
         """
         from .temporal.slicing import TemporalCorpus  # local import to break cycle
+        if len(self.docs) == 0:
+            raise ValueError(
+                "by_time() requires a non-empty corpus; got 0 documents."
+            )
+        if col not in self.docs.columns:
+            raise ValueError(
+                f"by_time(col={col!r}, ...): column not found in corpus. "
+                f"Available columns: {list(self.docs.columns)!r}."
+            )
         return TemporalCorpus(parent=self, time_col=col, freq=freq)
     def with_tokenizer(self, tokenizer: Tokenizer) -> Corpus:

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/io/duckdb.py RENAMED Viewed

@@ -71,12 +71,24 @@ def read_duckdb(
     ... )
     """
     try:
-        import duckdb  # noqa: F401
+        import duckdb
     except ImportError as exc:  # pragma: no cover
         raise ImportError(
             "read_duckdb requires duckdb. Install with: pip install 'pycorpdiff[duckdb]'"
         ) from exc
+    if isinstance(connection, str):
+        raise TypeError(
+            "read_duckdb expects a DuckDB connection, not a file path. "
+            f"Got connection={connection!r}. Open one first: "
+            f'duckdb.connect({connection!r}), or pcd.read_duckdb(duckdb.connect(), "...")'
+        )
+    if not isinstance(connection, duckdb.DuckDBPyConnection):
+        raise TypeError(
+            "read_duckdb expects a duckdb.DuckDBPyConnection; got "
+            f"{type(connection).__name__}. Open one via duckdb.connect(...)."
+        )
     cursor = connection.execute(query, params) if params is not None else connection.execute(query)
     df = cursor.df()
     if text_col not in df.columns:

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/keyness/bayes.py RENAMED Viewed

@@ -15,7 +15,7 @@ from __future__ import annotations
 import numpy as np
 import pandas as pd
-from .loglikelihood import log_likelihood
+from .loglikelihood import LLFormula, log_likelihood
 def bayes_factor(
@@ -23,6 +23,8 @@ def bayes_factor(
     counts_b: pd.Series,
     total_a: int,
     total_b: int,
+    *,
+    formula: LLFormula = "rayson",
 ) -> pd.Series:
     """BIC-approximated Bayes factor for each term's frequency difference.
@@ -31,6 +33,12 @@ def bayes_factor(
     the unsigned log-likelihood. The Bayes factor is then
     ``exp(BIC / 2)``. Wilson (2013) is the keyness application.
+    ``formula`` selects which G² flavour feeds the BF: ``"rayson"`` (the
+    2-cell shortcut, default; matches the LL Wizard) or ``"dunning"``
+    (the full 4-cell G²; matches quanteda/NLTK). Use the same
+    ``formula=`` as the ``keyness()`` call that produced the row so the
+    G² and the Bayes factor in a single row describe the same statistic.
     Interpret with Kass & Raftery (1995):
     - ``BF > 2``  : positive evidence
@@ -43,7 +51,7 @@ def bayes_factor(
     plots / sorts handle it.
     """
     terms = counts_a.index.union(counts_b.index)
-    ll_table = log_likelihood(counts_a, counts_b, total_a, total_b)
+    ll_table = log_likelihood(counts_a, counts_b, total_a, total_b, formula=formula)
     g2_abs = ll_table["g2"].abs()
     bic = g2_abs - np.log(total_a + total_b)
     with np.errstate(over="ignore"):

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/results.py RENAMED Viewed

@@ -10,7 +10,7 @@ contract:
 - ``.summary()`` returns a short human-readable string.
 - ``.explain(term, n)`` returns a :class:`ConcordanceResult` with
   KWIC evidence for one row of the result. Defined only on
-  comparison-based Results (``KeynessResult``, ``CollocationShiftResult``)
+  term-ranked Results (``KeynessResult``, ``CollocationShiftResult``)
   where "one row of the result" maps to a target term.
 See ``docs/design.md`` for the per-Result method matrix. This contract
@@ -257,15 +257,32 @@ class SemanticShiftResult:
         return _table_to_json(self.table, path, **kw)
     def plot(self, **kw: Any) -> alt.Chart:
-        """Plotting for SemanticShiftResult is not yet implemented.
+        """Horizontal bar chart of cosine distance per target term.
-        For a forward-looking trajectory of cosine distances, use
-        :func:`pycorpdiff.semantic_trajectory` and pass the resulting
-        DataFrame to :func:`pycorpdiff.viz.semantic_forecast_plot`.
+        For a multi-period trajectory of cosine distances (an across-
+        time view rather than a single A-vs-B snapshot), use
+        :func:`pycorpdiff.semantic_trajectory` paired with
+        :func:`pycorpdiff.viz.semantic_forecast_plot`.
+        Extra keyword arguments forward to :meth:`altair.Chart.properties`.
         """
-        raise NotImplementedError(
-            "SemanticShiftResult.plot() is not yet implemented; "
-            "use .table or pcd.viz.semantic_forecast_plot() instead"
+        import altair as alt
+        return (  # type: ignore[no-any-return]
+            alt.Chart(self.table)
+            .mark_bar(color="#0b6e7c")
+            .encode(
+                x=alt.X("cosine_distance:Q", title="Cosine distance (A → B)"),
+                y=alt.Y("target:N", sort="-x", title=None),
+                tooltip=[
+                    "target",
+                    alt.Tooltip("cosine_similarity:Q", format=".4f"),
+                    alt.Tooltip("cosine_distance:Q", format=".4f"),
+                    "n_contexts_a",
+                    "n_contexts_b",
+                ],
+            )
+            .properties(width=400, **kw)
         )
     def neighbors_before(

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/src/pycorpdiff/semantic/shift.py RENAMED Viewed

@@ -46,6 +46,28 @@ def _centroid(vectors: np.ndarray[Any, Any]) -> np.ndarray[Any, Any]:
     return out
+def _validate_embeddings(
+    vecs: np.ndarray[Any, Any], expected_rows: int, side: str
+) -> None:
+    """Catch mis-shaped embedder output before it produces silent nonsense.
+    A 1-D return from ``embedder.encode`` would otherwise be averaged into
+    a scalar centroid and yield ``cosine_similarity == 1.0`` for any
+    comparison — a silently wrong result.
+    """
+    if vecs.ndim != 2:
+        raise ValueError(
+            f"embedder.encode() for corpus {side!r} returned an array of "
+            f"rank {vecs.ndim}; expected 2 (shape (n_windows, d)). "
+            f"Got shape {vecs.shape}."
+        )
+    if vecs.shape[0] != expected_rows:
+        raise ValueError(
+            f"embedder.encode() for corpus {side!r} returned "
+            f"{vecs.shape[0]} rows; expected {expected_rows} (one per window)."
+        )
 def semantic_shift(
     a: Corpus | CorpusSlice,
     b: Corpus | CorpusSlice,
@@ -103,6 +125,8 @@ def semantic_shift(
         vecs_a = np.asarray(embedder.encode(wins_a), dtype=np.float64)
         vecs_b = np.asarray(embedder.encode(wins_b), dtype=np.float64)
+        _validate_embeddings(vecs_a, expected_rows=len(wins_a), side="a")
+        _validate_embeddings(vecs_b, expected_rows=len(wins_b), side="b")
         if align == "procrustes":
             # Procrustes wants two matrices of the same shape. Pad / truncate

{pycorpdiff-0.1.0a6 → pycorpdiff-0.1.0a8}/tests/integration/test_crossval_histwords.py RENAMED Viewed

@@ -71,9 +71,12 @@ def test_fetch_coha_1990_returns_real_vocab(histwords_cache_dir: Path) -> None:
     everyday words. Doesn't check vector values — that's the next test."""
     if not _has_internet():
         pytest.skip("offline")
-    vecs = pcd.fetch_histwords_decade(
-        1990, source="coha", cache_dir=histwords_cache_dir
-    )
+    try:
+        vecs = pcd.fetch_histwords_decade(
+            1990, source="coha", cache_dir=histwords_cache_dir
+        )
+    except FileNotFoundError as exc:
+        pytest.skip(f"COHA 1990s not available: {exc}")
     # COHA 1990s vocab is large (~50k+ words). Expect basic English words.
     for word in ("the", "and", "of", "is", "people"):
         assert word in vecs, f"expected {word!r} in 1990s COHA vocab"
@@ -98,6 +101,8 @@ def test_known_shifters_show_high_cosine_distance(
             )
         except KeyError:
             pytest.skip(f"{word!r} missing from COHA 1900s or 1990s vocab")
+        except FileNotFoundError as exc:
+            pytest.skip(f"COHA decade data not available: {exc}")
         assert d > 0.3, (
             f"expected {word!r} to show cosine distance > 0.3 "
             f"between 1900s and 1990s COHA; got {d:.3f}"
@@ -115,9 +120,12 @@ def test_stable_function_words_show_low_cosine_distance(
         pytest.skip("offline")
     stable = ["the", "and", "of"]
     for word in stable:
-        d = pcd.histwords_cosine_shift(
-            1900, 1990, word, source="coha", cache_dir=histwords_cache_dir
-        )
+        try:
+            d = pcd.histwords_cosine_shift(
+                1900, 1990, word, source="coha", cache_dir=histwords_cache_dir
+            )
+        except FileNotFoundError as exc:
+            pytest.skip(f"COHA decade data not available: {exc}")
         assert d < 0.30, (
             f"expected {word!r} to be stable across decades "
             f"(cosine distance < 0.30); got {d:.3f}"
@@ -137,19 +145,25 @@ def test_shifter_distance_exceeds_stable_distance_by_meaningful_margin(
     shifter_distances = []
     for word in ("gay", "broadcast", "awful"):
         with contextlib.suppress(KeyError):
-            shifter_distances.append(
-                pcd.histwords_cosine_shift(
-                    1900, 1990, word, source="coha",
-                    cache_dir=histwords_cache_dir,
+            try:
+                shifter_distances.append(
+                    pcd.histwords_cosine_shift(
+                        1900, 1990, word, source="coha",
+                        cache_dir=histwords_cache_dir,
+                    )
                 )
-            )
+            except FileNotFoundError as exc:
+                pytest.skip(f"COHA decade data not available: {exc}")
     stable_distances = []
     for word in ("the", "and", "of"):
-        stable_distances.append(
-            pcd.histwords_cosine_shift(
-                1900, 1990, word, source="coha", cache_dir=histwords_cache_dir
+        try:
+            stable_distances.append(
+                pcd.histwords_cosine_shift(
+                    1900, 1990, word, source="coha", cache_dir=histwords_cache_dir
+                )
             )
-        )
+        except FileNotFoundError as exc:
+            pytest.skip(f"COHA decade data not available: {exc}")
     if not shifter_distances:
         pytest.skip("no shifters available in COHA vocab")
     avg_shift = sum(shifter_distances) / len(shifter_distances)

pycorpdiff 0.1.0a6__tar.gz → 0.1.0a8__tar.gz

pycorpdiff 0.1.0a6tar.gz → 0.1.0a8tar.gz