PyPI - pycorpdiff - Versions diffs - 0.1.0a5__tar.gz → 0.1.0a7__tar.gz - Mend

pycorpdiff 0.1.0a5tar.gz → 0.1.0a7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (132) hide show

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/.gitignore RENAMED Viewed

@@ -33,9 +33,6 @@ Thumbs.db
 # Hypothesis example database (auto-managed)
 .hypothesis/
-# Local tooling
-.claude/
 # Jupyter checkpoints
 .ipynb_checkpoints/

pycorpdiff-0.1.0a7/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,71 @@
+# Changelog
+All notable changes to `pycorpdiff` are documented in this file. The format
+follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this
+project adheres to [Semantic Versioning](https://semver.org/).
+## [0.1.0a7] — first public release
+The first public alpha of `pycorpdiff` — comparative corpus analysis
+for modern Python workflows. Three public verbs (`compare`, `track`,
+`compare.before_after`), nine `Result` dataclasses each implementing
+the relevant subset of `.to_df / .plot / .explain / .summary /
+.to_html / .to_json` (see `docs/design.md` for the per-Result method
+matrix), two `typing.Protocol` extension points (`Tokenizer`,
+`Embedder`), and opt-in extras for visualisation, semantic embedding,
+temporal modelling, polars interop, DuckDB ingestion, 🤗 Datasets,
+and notebook rendering.
+### Analytical surface
+- **Keyness**: signed log-likelihood G² with selectable formula
+  (`formula="rayson"` 2-cell shortcut, default; matches the UCREL
+  LL Wizard. `formula="dunning"` 4-cell G²; matches NLTK +
+  `quanteda::textstat_keyness(measure="lr")` byte-for-byte.). Pearson
+  χ², Hardie LogRatio, Gabrielatos %DIFF, BIC-approximated Bayes
+  factor (also tracks the `formula=` choice), Juilland D / Gries DP
+  dispersion flagging, Benjamini–Hochberg correction, stop-word
+  filtering, empirical permutation *p*-values, N-way contingency G²
+  via `keyness_multi`.
+- **Collocations**: logDice, PMI, t-score, MI³ with Laplace smoothing;
+  cross-corpus `collocation_shift`; co-occurrence networks via
+  `cooccurrence_network`.
+- **Semantic shift**: averaged contextual embeddings, Procrustes
+  alignment, multi-period `semantic_trajectory`, `neighborhood_drift`.
+  Embedder output shape is validated to catch silently-broken
+  embedders before they produce nonsense.
+- **Temporal**: Wilson-CI trajectories, offline PELT changepoints,
+  online Bayesian changepoint detection, segmented-OLS interrupted
+  time series, Bayesian structural time-series causal impact,
+  state-space exponential-smoothing forecasting.
+### Cross-validated
+The package is checked against standard tools by automated test:
+- **Rayson's LL Wizard** — hand-derived contingency-table reference
+  triples (fast tier; runs on every push).
+- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
+  on every adjacent bigram (slow tier).
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
+  US Conventions corpus (slow tier).
+- **quanteda (R)** via `rpy2` — G² agreement to ≤ 1e-10 with
+  `formula="dunning"` (slow tier).
+- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
+  sanity check on Stanford SNAP COHA decade embeddings; skips
+  gracefully when the archive isn't reachable (slow tier).
+### Extras
+`[viz]`, `[semantic]`, `[temporal]`, `[polars]`, `[duckdb]`, `[nlp]`,
+`[huggingface]`, `[notebooks]`, `[all]` are MIT-compatible. A separate
+`[showcase]` extra pulls in `pysofra` (GPL-3.0-or-later) for
+JAMA-style table polish in the showcase notebook — opt in explicitly
+if you accept that licence.
+### Infrastructure
+Hundreds of tests, `ruff` + `mypy --strict` clean across the source
+tree, matrix CI on three Python versions × two operating systems,
+plus a slow-tier CI job exercising the cross-validation receipts
+against NLTK + quanteda on main pushes.

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/CITATION.cff RENAMED Viewed

@@ -4,7 +4,7 @@ message: >
   entry. GitHub renders a "Cite this repository" widget directly from
   this file.
 title: "pycorpdiff: Comparative Corpus Analysis for Modern Python Workflows"
-version: 0.1.0a5
+version: 0.1.0a7
 date-released: 2026-05-25
 authors:
   - family-names: Turner
@@ -32,7 +32,9 @@ abstract: >
   API. The package targets corpus linguistics, digital humanities,
   computational social science, and discourse analysis research,
   emphasising interpretability, explainability, statistical rigour,
-  and reproducibility.
+  and reproducibility. A bundled synthetic UK-Hansard-style sample
+  ships for offline demonstration; real-data interfaces include
+  fetch_hansard and from_huggingface.
 identifiers:
   - type: url
     value: "https://github.com/jturner-uofl/pycorpdiff"

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pycorpdiff
-Version: 0.1.0a5
+Version: 0.1.0a7
 Summary: Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
 Project-URL: Homepage, https://github.com/jturner-uofl/pycorpdiff
 Project-URL: Documentation, https://github.com/jturner-uofl/pycorpdiff
@@ -54,7 +54,6 @@ Requires-Dist: matplotlib>=3.8; extra == 'all'
 Requires-Dist: networkx>=3.1; extra == 'all'
 Requires-Dist: polars>=1.0; extra == 'all'
 Requires-Dist: pyarrow>=15; extra == 'all'
-Requires-Dist: pysofra>=0.1.0a3; extra == 'all'
 Requires-Dist: ruptures>=1.1; extra == 'all'
 Requires-Dist: scikit-learn>=1.3; extra == 'all'
 Requires-Dist: sentence-transformers>=2.2; extra == 'all'
@@ -77,7 +76,6 @@ Provides-Extra: nlp
 Requires-Dist: spacy>=3.7; extra == 'nlp'
 Provides-Extra: notebooks
 Requires-Dist: jupyter>=1.0; extra == 'notebooks'
-Requires-Dist: pysofra>=0.1.0a3; extra == 'notebooks'
 Requires-Dist: vl-convert-python>=1.5; extra == 'notebooks'
 Provides-Extra: polars
 Requires-Dist: polars>=1.0; extra == 'polars'
@@ -85,6 +83,8 @@ Requires-Dist: pyarrow>=15; extra == 'polars'
 Provides-Extra: semantic
 Requires-Dist: scikit-learn>=1.3; extra == 'semantic'
 Requires-Dist: sentence-transformers>=2.2; extra == 'semantic'
+Provides-Extra: showcase
+Requires-Dist: pysofra>=0.1.0a3; extra == 'showcase'
 Provides-Extra: temporal
 Requires-Dist: ruptures>=1.1; extra == 'temporal'
 Requires-Dist: statsmodels>=0.14; extra == 'temporal'
@@ -127,11 +127,11 @@ and computational social science routinely have:
 `pycorpdiff` is positioned as **orchestration**, not reinvention.
 Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
 `SBERT`-compatible model) plug in via two `typing.Protocol` extension
-points — one-line adapters, no plugin registry. The base install pulls
-only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
-via extras.
+points — one-line adapters, no plugin registry. The base install's
+direct runtime dependencies are `numpy`, `pandas`, `scipy`, and
+`pyarrow`; everything else is opt-in via extras.
-> **Status: alpha (0.1.0a5).** Public API is stable for the features
+> **Status: alpha (0.1.0a7).** Public API is stable for the features
 > described below; on PyPI as `pip install pycorpdiff`.
 ## The three-layer architecture
@@ -178,7 +178,8 @@ for the full feature tour, or the cheat sheet below for one-line API previews.
 ```python
 # Compare verbs (returns Result objects; methods exposed vary by Result)
-pcd.compare(a, b).keyness()
+pcd.compare(a, b).keyness()                                                   # default formula="rayson" (LL Wizard)
+pcd.compare(a, b).keyness(formula="dunning")                                  # full 4-cell G² (matches quanteda / NLTK)
 pcd.compare(a, b).collocation_shift("immigrant")
 pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
 # SBERTEmbedder downloads a sentence-transformers model on first call;
@@ -190,7 +191,7 @@ tr.changepoints()                                  # offline PELT
 tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
 tr.interrupted_time_series(event_date="2016")      # segmented OLS
 tr.causal_impact(event_date="2016")                # Bayesian counterfactual (Brodersen 2015)
-tr.forecast(horizon=4)                             # state-space ETS
+tr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)
 # Before / after a known event
 pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
@@ -209,17 +210,20 @@ every analytical surface.
 ## Installation
 ```bash
-pip install pycorpdiff                # lexical-comparative core
-pip install "pycorpdiff[viz]"         # + altair / matplotlib / networkx
-pip install "pycorpdiff[semantic]"    # + sentence-transformers
-pip install "pycorpdiff[temporal]"    # + ruptures / statsmodels
-pip install "pycorpdiff[notebooks]"   # + jupyter / vl-convert / pysofra
-pip install "pycorpdiff[all]"         # everything
+pip install pycorpdiff                       # lexical-comparative core (MIT)
+pip install "pycorpdiff[viz]"                # + altair / matplotlib / networkx
+pip install "pycorpdiff[semantic]"           # + sentence-transformers
+pip install "pycorpdiff[temporal]"           # + ruptures / statsmodels
+pip install "pycorpdiff[notebooks]"          # + jupyter / vl-convert
+pip install "pycorpdiff[all]"                # everything MIT-compatible
+pip install "pycorpdiff[all,showcase]"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase
 ```
-The base install keeps a small dependency footprint (`numpy`, `pandas`,
-`scipy`, `pyarrow`); optional extras land per analytical layer so you
-only pay for what you use.
+The base install's direct runtime dependencies are `numpy`, `pandas`,
+`scipy`, and `pyarrow`; optional extras land per analytical layer so
+you only pay for what you use. `[showcase]` is broken out separately
+because `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without
+that extra remains MIT-only.
 To work from source:
@@ -232,13 +236,27 @@ pytest -q
 ## Cross-validation receipts
-The math agrees with the standard tools — by automated test:
+The math is checked against standard tools by automated test. The
+fast tier runs on every push (matrix CI); the slow tier needs heavy
+optional dependencies (R + quanteda, NLTK, rpy2, Stanford SNAP
+downloads) and runs on main pushes only.
-- **Rayson's LL Wizard** — hand-derived contingency-table reference triples
-- **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
-- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
-- **quanteda (R)** via `rpy2` — byte-for-byte G² agreement (slow tier)
-- **HistWords (Hamilton et al. 2016)** — diachronic cosine displacements on COHA (slow tier)
+Fast tier:
+- **Rayson's LL Wizard** — hand-derived contingency-table reference
+  triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))
+Slow tier:
+- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
+  on every adjacent bigram
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
+  US Conventions corpus
+- **quanteda (R)** via `rpy2` — G² agreement to ≤ 1e-10 with
+  `formula="dunning"`
+- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
+  sanity check on Stanford SNAP COHA decade embeddings (skips
+  gracefully if the archive isn't reachable)
 ## Citation

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/README.md RENAMED Viewed

@@ -31,11 +31,11 @@ and computational social science routinely have:
 `pycorpdiff` is positioned as **orchestration**, not reinvention.
 Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
 `SBERT`-compatible model) plug in via two `typing.Protocol` extension
-points — one-line adapters, no plugin registry. The base install pulls
-only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
-via extras.
+points — one-line adapters, no plugin registry. The base install's
+direct runtime dependencies are `numpy`, `pandas`, `scipy`, and
+`pyarrow`; everything else is opt-in via extras.
-> **Status: alpha (0.1.0a5).** Public API is stable for the features
+> **Status: alpha (0.1.0a7).** Public API is stable for the features
 > described below; on PyPI as `pip install pycorpdiff`.
 ## The three-layer architecture
@@ -82,7 +82,8 @@ for the full feature tour, or the cheat sheet below for one-line API previews.
 ```python
 # Compare verbs (returns Result objects; methods exposed vary by Result)
-pcd.compare(a, b).keyness()
+pcd.compare(a, b).keyness()                                                   # default formula="rayson" (LL Wizard)
+pcd.compare(a, b).keyness(formula="dunning")                                  # full 4-cell G² (matches quanteda / NLTK)
 pcd.compare(a, b).collocation_shift("immigrant")
 pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
 # SBERTEmbedder downloads a sentence-transformers model on first call;
@@ -94,7 +95,7 @@ tr.changepoints()                                  # offline PELT
 tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
 tr.interrupted_time_series(event_date="2016")      # segmented OLS
 tr.causal_impact(event_date="2016")                # Bayesian counterfactual (Brodersen 2015)
-tr.forecast(horizon=4)                             # state-space ETS
+tr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)
 # Before / after a known event
 pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
@@ -113,17 +114,20 @@ every analytical surface.
 ## Installation
 ```bash
-pip install pycorpdiff                # lexical-comparative core
-pip install "pycorpdiff[viz]"         # + altair / matplotlib / networkx
-pip install "pycorpdiff[semantic]"    # + sentence-transformers
-pip install "pycorpdiff[temporal]"    # + ruptures / statsmodels
-pip install "pycorpdiff[notebooks]"   # + jupyter / vl-convert / pysofra
-pip install "pycorpdiff[all]"         # everything
+pip install pycorpdiff                       # lexical-comparative core (MIT)
+pip install "pycorpdiff[viz]"                # + altair / matplotlib / networkx
+pip install "pycorpdiff[semantic]"           # + sentence-transformers
+pip install "pycorpdiff[temporal]"           # + ruptures / statsmodels
+pip install "pycorpdiff[notebooks]"          # + jupyter / vl-convert
+pip install "pycorpdiff[all]"                # everything MIT-compatible
+pip install "pycorpdiff[all,showcase]"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase
 ```
-The base install keeps a small dependency footprint (`numpy`, `pandas`,
-`scipy`, `pyarrow`); optional extras land per analytical layer so you
-only pay for what you use.
+The base install's direct runtime dependencies are `numpy`, `pandas`,
+`scipy`, and `pyarrow`; optional extras land per analytical layer so
+you only pay for what you use. `[showcase]` is broken out separately
+because `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without
+that extra remains MIT-only.
 To work from source:
@@ -136,13 +140,27 @@ pytest -q
 ## Cross-validation receipts
-The math agrees with the standard tools — by automated test:
+The math is checked against standard tools by automated test. The
+fast tier runs on every push (matrix CI); the slow tier needs heavy
+optional dependencies (R + quanteda, NLTK, rpy2, Stanford SNAP
+downloads) and runs on main pushes only.
-- **Rayson's LL Wizard** — hand-derived contingency-table reference triples
-- **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
-- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
-- **quanteda (R)** via `rpy2` — byte-for-byte G² agreement (slow tier)
-- **HistWords (Hamilton et al. 2016)** — diachronic cosine displacements on COHA (slow tier)
+Fast tier:
+- **Rayson's LL Wizard** — hand-derived contingency-table reference
+  triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))
+Slow tier:
+- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
+  on every adjacent bigram
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
+  US Conventions corpus
+- **quanteda (R)** via `rpy2` — G² agreement to ≤ 1e-10 with
+  `formula="dunning"`
+- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
+  sanity check on Stanford SNAP COHA decade embeddings (skips
+  gracefully if the archive isn't reachable)
 ## Citation

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "pycorpdiff"
-version = "0.1.0a5"
+version = "0.1.0a7"
 description = "Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference."
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -62,13 +62,18 @@ nlp = ["spacy>=3.7"]
 # Public-text-corpus hub. Heavy (pulls pyarrow, fsspec, requests, aiohttp),
 # so opt-in only — base install stays small.
 huggingface = ["datasets>=2.14"]
-# Needed if you want to execute the showcase notebook or regenerate the
-# rendered HTML examples. `jupyter` runs the notebook, `vl-convert` does
-# static SVG/PNG export of altair charts, `pysofra` renders the showcase's
-# result tables in JAMA-style typography.
-notebooks = ["jupyter>=1.0", "vl-convert-python>=1.5", "pysofra>=0.1.0a3"]
-# Meta-extra: `pycorpdiff[all]` pulls in every optional code path
-# including the notebook runtime.
+# Needed if you want to execute the example notebooks. `jupyter` runs
+# the notebook; `vl-convert` does static SVG/PNG export of altair charts.
+# Kept MIT-clean — see `showcase` below for the JAMA-style table polish.
+notebooks = ["jupyter>=1.0", "vl-convert-python>=1.5"]
+# Adds `pysofra` for the showcase notebook's JAMA-style typography.
+# IMPORTANT: pysofra is GPL-3.0-or-later. Opting in to `[showcase]` (or
+# installing pysofra directly) brings GPL into your environment; pure
+# pycorpdiff use without this extra remains MIT-only.
+showcase = ["pysofra>=0.1.0a3"]
+# Meta-extra: every MIT-compatible optional code path. Does NOT include
+# `[showcase]` because pysofra is GPL-3.0-or-later; install
+# `pycorpdiff[all,showcase]` explicitly if you accept that licence.
 all = [
     "altair>=5",
     "matplotlib>=3.8",
@@ -84,7 +89,6 @@ all = [
     "spacy>=3.7",
     "jupyter>=1.0",
     "vl-convert-python>=1.5",
-    "pysofra>=0.1.0a3",
 ]
 dev = [
     "pytest>=8",
@@ -176,6 +180,8 @@ disallow_any_generics = true
 module = [
     "altair",
     "altair.*",
+    "datasets",
+    "datasets.*",
     "duckdb",
     "duckdb.*",
     "matplotlib",

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/src/pycorpdiff/__init__.py RENAMED Viewed

@@ -6,20 +6,21 @@ result objects (:class:`KeynessResult`, :class:`CollocationShiftResult`,
 :class:`SemanticShiftResult`, :class:`TemporalTrajectory`,
 :class:`NetworkResult`, :class:`ForecastResult`,
 :class:`CausalImpactResult`, :class:`BocpdResult`,
-:class:`ConcordanceResult`), each implementing the same
-``.to_df / .plot / .explain / .summary / .to_html / .to_json`` contract.
+:class:`ConcordanceResult`), each implementing the relevant subset of
+the ``.to_df / .plot / .explain / .summary / .to_html / .to_json``
+contract. See ``docs/design.md`` for the per-Result method matrix.
 Example
 -------
 >>> import pycorpdiff as pcd
->>> pcd.__version__
-'0.1.0a5'
+>>> isinstance(pcd.__version__, str)
+True
 """
 from __future__ import annotations
-__version__ = "0.1.0a5"
+__version__ = "0.1.0a7"
 from .collocation.network import NetworkResult, cooccurrence_network
 from .compare import Comparison, compare

pycorpdiff-0.1.0a7/src/pycorpdiff/_backends/pandas.py ADDED Viewed

@@ -0,0 +1,9 @@
+"""Pandas-backed internals for :class:`pycorpdiff.Corpus`.
+Corpus operations route through this module so backend-specific code
+stays out of the public API. The pandas backend is the default and is
+exercised on every install; polars is opt-in via the ``polars`` extra
+and lives in the sibling :mod:`pycorpdiff._backends.polars`.
+"""
+from __future__ import annotations

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/src/pycorpdiff/compare.py RENAMED Viewed

@@ -10,6 +10,7 @@ from dataclasses import dataclass
 from typing import TYPE_CHECKING, Literal
 from .corpus import Corpus, CorpusSlice
+from .keyness.loglikelihood import LLFormula
 if TYPE_CHECKING:
     from .results import (
@@ -46,6 +47,7 @@ class Comparison:
     def keyness(
         self,
         method: KeynessMethod = "log_likelihood",
+        formula: LLFormula = "rayson",
         effect_size: bool = True,
         dispersion: bool = False,
         min_count: int = 5,
@@ -64,6 +66,14 @@ class Comparison:
             sorts by signed Pearson χ². The other modes
             (``"log_ratio"``, ``"bayes_factor"``, ``"percent_diff"``)
             require ``effect_size=True`` and sort by that column.
+        formula
+            Which log-likelihood formulation to use for the G² column.
+            ``"rayson"`` (default) is the 2-cell shortcut matching
+            Rayson's UCREL LL Wizard; ``"dunning"`` is the full 4-cell
+            G² matching NLTK's ``BigramAssocMeasures`` and R's
+            ``quanteda::textstat_keyness(measure="lr")``. See
+            ``docs/statistical-methods.md`` for the math + when they
+            diverge.
         effect_size
             If True (default), also compute LogRatio (Hardie),
             %DIFF (Gabrielatos), and the BIC-approximated Bayes factor.
@@ -131,7 +141,7 @@ class Comparison:
         # G² is always computed (cheap, the default sort column). χ² is
         # computed only when requested — same shape, asymptotically
         # equivalent, no need to pay for both by default.
-        table = log_likelihood(a_kept, b_kept, n_a, n_b)
+        table = log_likelihood(a_kept, b_kept, n_a, n_b, formula=formula)
         if method == "chi_squared":
             chi_table = _chi_squared(a_kept, b_kept, n_a, n_b)
             table["chi_squared"] = chi_table["chi_squared"]
@@ -139,7 +149,9 @@ class Comparison:
         if effect_size:
             table["log_ratio"] = _log_ratio(a_kept, b_kept, n_a, n_b)
             table["percent_diff"] = _percent_diff(a_kept, b_kept, n_a, n_b)
-            table["bayes_factor"] = _bayes_factor(a_kept, b_kept, n_a, n_b)
+            table["bayes_factor"] = _bayes_factor(
+                a_kept, b_kept, n_a, n_b, formula=formula
+            )
         if dispersion:
             kept_terms = table.index
@@ -192,6 +204,7 @@ class Comparison:
             label_a=_corpus_label(self.a),
             label_b=_corpus_label(self.b),
             params={
+                "formula": formula,
                 "effect_size": effect_size,
                 "dispersion": dispersion,
                 "min_count": min_count,

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/src/pycorpdiff/corpus.py RENAMED Viewed

@@ -242,6 +242,15 @@ class Corpus:
         """
         from .temporal.slicing import TemporalCorpus  # local import to break cycle
+        if len(self.docs) == 0:
+            raise ValueError(
+                "by_time() requires a non-empty corpus; got 0 documents."
+            )
+        if col not in self.docs.columns:
+            raise ValueError(
+                f"by_time(col={col!r}, ...): column not found in corpus. "
+                f"Available columns: {list(self.docs.columns)!r}."
+            )
         return TemporalCorpus(parent=self, time_col=col, freq=freq)
     def with_tokenizer(self, tokenizer: Tokenizer) -> Corpus:

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/src/pycorpdiff/io/duckdb.py RENAMED Viewed

@@ -71,12 +71,24 @@ def read_duckdb(
     ... )
     """
     try:
-        import duckdb  # noqa: F401
+        import duckdb
     except ImportError as exc:  # pragma: no cover
         raise ImportError(
             "read_duckdb requires duckdb. Install with: pip install 'pycorpdiff[duckdb]'"
         ) from exc
+    if isinstance(connection, str):
+        raise TypeError(
+            "read_duckdb expects a DuckDB connection, not a file path. "
+            f"Got connection={connection!r}. Open one first: "
+            f'duckdb.connect({connection!r}), or pcd.read_duckdb(duckdb.connect(), "...")'
+        )
+    if not isinstance(connection, duckdb.DuckDBPyConnection):
+        raise TypeError(
+            "read_duckdb expects a duckdb.DuckDBPyConnection; got "
+            f"{type(connection).__name__}. Open one via duckdb.connect(...)."
+        )
     cursor = connection.execute(query, params) if params is not None else connection.execute(query)
     df = cursor.df()
     if text_col not in df.columns:

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/src/pycorpdiff/io/huggingface.py RENAMED Viewed

@@ -95,7 +95,7 @@ def from_huggingface(
     loader = _loader
     if loader is None:
         try:
-            from datasets import load_dataset as _hf_load  # type: ignore[import-not-found]
+            from datasets import load_dataset as _hf_load
         except ImportError as exc:  # pragma: no cover
             raise ImportError(
                 "from_huggingface requires the `datasets` library. "

{pycorpdiff-0.1.0a5 → pycorpdiff-0.1.0a7}/src/pycorpdiff/keyness/bayes.py RENAMED Viewed

@@ -15,7 +15,7 @@ from __future__ import annotations
 import numpy as np
 import pandas as pd
-from .loglikelihood import log_likelihood
+from .loglikelihood import LLFormula, log_likelihood
 def bayes_factor(
@@ -23,6 +23,8 @@ def bayes_factor(
     counts_b: pd.Series,
     total_a: int,
     total_b: int,
+    *,
+    formula: LLFormula = "rayson",
 ) -> pd.Series:
     """BIC-approximated Bayes factor for each term's frequency difference.
@@ -31,6 +33,12 @@ def bayes_factor(
     the unsigned log-likelihood. The Bayes factor is then
     ``exp(BIC / 2)``. Wilson (2013) is the keyness application.
+    ``formula`` selects which G² flavour feeds the BF: ``"rayson"`` (the
+    2-cell shortcut, default; matches the LL Wizard) or ``"dunning"``
+    (the full 4-cell G²; matches quanteda/NLTK). Use the same
+    ``formula=`` as the ``keyness()`` call that produced the row so the
+    G² and the Bayes factor in a single row describe the same statistic.
     Interpret with Kass & Raftery (1995):
     - ``BF > 2``  : positive evidence
@@ -43,7 +51,7 @@ def bayes_factor(
     plots / sorts handle it.
     """
     terms = counts_a.index.union(counts_b.index)
-    ll_table = log_likelihood(counts_a, counts_b, total_a, total_b)
+    ll_table = log_likelihood(counts_a, counts_b, total_a, total_b, formula=formula)
     g2_abs = ll_table["g2"].abs()
     bic = g2_abs - np.log(total_a + total_b)
     with np.errstate(over="ignore"):

pycorpdiff 0.1.0a5__tar.gz → 0.1.0a7__tar.gz

pycorpdiff 0.1.0a5tar.gz → 0.1.0a7tar.gz