PyPI - pycorpdiff - Versions diffs - 0.1.0a2__tar.gz → 0.1.0a4__tar.gz - Mend

pycorpdiff 0.1.0a2tar.gz → 0.1.0a4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (128) hide show

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/.gitignore RENAMED Viewed

@@ -30,12 +30,12 @@ Thumbs.db
 *.swo
 *~
-# AI workflow artefacts (kept local, never published)
-.claude/
 # Hypothesis example database (auto-managed)
 .hypothesis/
+# Local tooling
+.claude/
 # Jupyter checkpoints
 .ipynb_checkpoints/
@@ -56,5 +56,5 @@ examples/*.patched.ipynb
 # Stray uv lockfiles created outside the repo root
 **/uv.lock.tmp
-# Mkdocs build output (legacy; mkdocs.yml itself is gone)
+# Static site build output
 site/

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/CHANGELOG.md RENAMED Viewed

@@ -4,13 +4,13 @@ All notable changes to `pycorpdiff` are documented in this file. The format
 follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this
 project adheres to [Semantic Versioning](https://semver.org/).
-## [0.1.0a2] — initial release
+## [0.1.0a4] — initial release
 The initial public release of `pycorpdiff` — comparative corpus analysis
 for modern Python workflows. Three public verbs (`compare`, `track`,
-`compare.before_after`), nine `Result` dataclasses with a uniform
-six-method contract (`.to_df / .plot / .explain / .summary / .to_html /
-.to_json`), two `typing.Protocol` extension points (`Tokenizer`,
+`compare.before_after`), nine `Result` dataclasses each implementing the
+relevant subset of `.to_df / .plot / .explain / .summary / .to_html /
+.to_json`, two `typing.Protocol` extension points (`Tokenizer`,
 `Embedder`), and opt-in extras for visualisation, semantic embedding,
 temporal modelling, polars interop, DuckDB ingestion, and 🤗 Datasets.
@@ -33,12 +33,12 @@ temporal modelling, polars interop, DuckDB ingestion, and 🤗 Datasets.
 ### Cross-validated
-Numerically agrees with Rayson's LL Wizard (15 reference triples),
-NLTK's `BigramAssocMeasures` (≤ 1e-12 on PMI / t-score / MI³),
+Numerically agrees with Rayson's LL Wizard on hand-derived reference
+triples, NLTK's `BigramAssocMeasures` (≤ 1e-12 on PMI / t-score / MI³),
 Scattertext on the 2012 US conventions, `quanteda` via `rpy2`, and
 the HistWords COHA replication.
 ### Infrastructure
-519 tests, `ruff` + `mypy --strict` clean across 55 source files,
-matrix CI on three Python versions × two operating systems.
+Hundreds of tests, `ruff` + `mypy --strict` clean across the source
+tree, matrix CI on three Python versions × two operating systems.

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/CITATION.cff RENAMED Viewed

@@ -4,8 +4,8 @@ message: >
   entry. GitHub renders a "Cite this repository" widget directly from
   this file.
 title: "pycorpdiff: Comparative Corpus Analysis for Modern Python Workflows"
-version: 0.1.0a2
-date-released: 2026-05-22
+version: 0.1.0a4
+date-released: 2026-05-25
 authors:
   - family-names: Turner
     given-names: Jason

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pycorpdiff
-Version: 0.1.0a2
+Version: 0.1.0a4
 Summary: Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
 Project-URL: Homepage, https://github.com/jturner-uofl/pycorpdiff
 Project-URL: Documentation, https://github.com/jturner-uofl/pycorpdiff
@@ -49,11 +49,12 @@ Provides-Extra: all
 Requires-Dist: altair>=5; extra == 'all'
 Requires-Dist: datasets>=2.14; extra == 'all'
 Requires-Dist: duckdb>=0.10; extra == 'all'
+Requires-Dist: jupyter>=1.0; extra == 'all'
 Requires-Dist: matplotlib>=3.8; extra == 'all'
 Requires-Dist: networkx>=3.1; extra == 'all'
 Requires-Dist: polars>=1.0; extra == 'all'
 Requires-Dist: pyarrow>=15; extra == 'all'
-Requires-Dist: pysofra>=0.1.0a2; extra == 'all'
+Requires-Dist: pysofra>=0.1.0a3; extra == 'all'
 Requires-Dist: ruptures>=1.1; extra == 'all'
 Requires-Dist: scikit-learn>=1.3; extra == 'all'
 Requires-Dist: sentence-transformers>=2.2; extra == 'all'
@@ -76,7 +77,7 @@ Provides-Extra: nlp
 Requires-Dist: spacy>=3.7; extra == 'nlp'
 Provides-Extra: notebooks
 Requires-Dist: jupyter>=1.0; extra == 'notebooks'
-Requires-Dist: pysofra>=0.1.0a2; extra == 'notebooks'
+Requires-Dist: pysofra>=0.1.0a3; extra == 'notebooks'
 Requires-Dist: vl-convert-python>=1.5; extra == 'notebooks'
 Provides-Extra: polars
 Requires-Dist: polars>=1.0; extra == 'polars'
@@ -110,9 +111,9 @@ platform, and the fragmented Python NLP stack
 consolidate keyness, collocations, dispersion, temporal trajectories,
 changepoint detection, interrupted time series, causal-impact analysis,
 forecasting, online changepoint detection, and embedding-based semantic
-shift under a single notebook-native API. Every result carries its own
-KWIC evidence: `.explain(term)` returns the source-text concordances
-behind any ranked term.
+shift under a single notebook-native API. Keyness and collocation
+results carry their own KWIC evidence: `.explain(term)` returns the
+source-text concordances behind any ranked term.
 The package answers the questions corpus linguistics, digital humanities,
 and computational social science routinely have:
@@ -130,7 +131,7 @@ points — one-line adapters, no plugin registry. The base install pulls
 only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
 via extras.
-> **Status: alpha (0.1.0a2).** Public API is stable for the features
+> **Status: alpha (0.1.0a4).** Public API is stable for the features
 > described below; on PyPI as `pip install pycorpdiff`.
 ## The three-layer architecture
@@ -139,61 +140,71 @@ via extras.
 |---|---|---|
 | **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |
 | **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |
-| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each with `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
+| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each implementing the relevant subset of `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
 ## Quick start
 ```bash
-pip install "pycorpdiff[viz,temporal]"
+pip install "pycorpdiff[viz]"
 ```
 ```python
 import pycorpdiff as pcd
-# Bundled synthetic UK-Hansard corpus — runs offline, no data needed.
+# Bundled synthetic Hansard-style sample — runs offline, no data download.
 corpus = pcd.load_hansard_sample()
 immigration = corpus.slice(topic="immigration")
-human = immigration.slice(frame="humanising")
-criminal = immigration.slice(frame="criminalising")
-# Compare — three verbs
-k = pcd.compare(human, criminal).keyness()
-c = pcd.compare(human, criminal).collocation_shift("immigrant")
-# s = pcd.compare(human, criminal).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())
-#   ↑ requires `pip install "pycorpdiff[semantic]"`
-# Track over time
-tr = pcd.track(immigration, "criminal").over_time(freq="Y")
-tr.changepoints()                                # offline PELT
-tr.changepoints_online(hazard=1/24)              # Bayesian online (Adams & MacKay 2007)
-tr.interrupted_time_series(event_date="2016")    # segmented OLS
-tr.causal_impact(event_date="2016")              # Bayesian counterfactual (Brodersen 2015)
-tr.forecast(horizon=4)                           # state-space ETS
+# Which words separate the humanising and criminalising frames?
+keyness = pcd.compare(
+    immigration.slice(frame="humanising"),
+    immigration.slice(frame="criminalising"),
+).keyness(min_count=3)
+keyness.plot()                # volcano plot — picture the result
+# keyness.table.head(10)      # or look at the ranked table directly
+# keyness.explain("criminal") # KWIC concordances showing the textual evidence
+```
+That's the entire surface in five lines: load a corpus, slice it,
+compare two slices, plot the result. Every other analytical method —
+collocation shifts, semantic drift, temporal trajectories, changepoint
+detection, causal-impact analysis, forecasting, co-occurrence networks,
+N-way keyness — follows the same shape. See
+[the showcase notebook](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
+for the full feature tour, or the cheat sheet below for one-line API previews.
+### Cheat sheet — every analytical surface in one block
+```python
+# Compare verbs (returns Result objects; methods exposed vary by Result)
+pcd.compare(a, b).keyness()
+pcd.compare(a, b).collocation_shift("immigrant")
+pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
+# SBERTEmbedder downloads a sentence-transformers model on first call;
+# use pcd.HashEmbedder() for offline / deterministic-test settings.
+# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods)
+tr = pcd.track(corpus, "immigrant").over_time(freq="Y")
+tr.changepoints()                                  # offline PELT
+tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
+tr.interrupted_time_series(event_date="2016")      # segmented OLS
+tr.causal_impact(event_date="2016")                # Bayesian counterfactual (Brodersen 2015)
+tr.forecast(horizon=4)                             # state-space ETS
 # Before / after a known event
 pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
-# N-way (≥ 2 corpora) — one keyness across all four parties
-parties = ["Conservative", "Labour", "Liberal Democrat", "SNP"]
-nhs = corpus.slice(topic="nhs")
-pcd.keyness_multi([nhs.slice(party=p) for p in parties], labels=parties)
+# N-way (≥ 2 corpora)
+pcd.keyness_multi([a, b, c, d], labels=["A", "B", "C", "D"])
 # The discourse as a graph
-pcd.cooccurrence_network(immigration, top_n=30).plot()
-# Every Result: .to_df() · .plot() · .explain() · .summary() · .to_html() · .to_json()
+pcd.cooccurrence_network(corpus, top_n=30).plot()
 ```
-Every line of the snippet above is verified end-to-end against
-`pip install "pycorpdiff[viz,temporal]"` — no data download required.
-Replace `load_hansard_sample()` with `pcd.from_dataframe(your_df, ...)`,
-`pcd.read_parquet(...)`, `pcd.fetch_hansard(...)`, or
-`pcd.from_huggingface(...)` to use your own corpus.
-See [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb)
-([rendered HTML](docs/rendered/pycorpdiff_showcase.html)) for a
-walkthrough on a synthetic UK Hansard corpus exercising every analytical
-surface.
+See [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
+for a walkthrough on the synthetic Hansard-style corpus exercising
+every analytical surface.
 ## Installation
@@ -223,7 +234,7 @@ pytest -q
 The math agrees with the standard tools — by automated test:
-- **Rayson's LL Wizard** — 15 hand-derived contingency-table reference triples
+- **Rayson's LL Wizard** — hand-derived contingency-table reference triples
 - **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
 - **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
 - **quanteda (R)** via `rpy2` — byte-for-byte G² agreement (slow tier)
@@ -237,11 +248,11 @@ repository" widget directly from it.
 ## License
-MIT — see [LICENSE](LICENSE).
+MIT — see [LICENSE](https://github.com/jturner-uofl/pycorpdiff/blob/main/LICENSE).
 ## Further reading
-- [`docs/design.md`](docs/design.md) — three-layer architecture
-- [`docs/statistical-methods.md`](docs/statistical-methods.md) — every metric's formula + citation
-- [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb) — full feature tour as a notebook
-- [`docs/rendered/`](docs/rendered/) — self-contained HTML renders of the example notebooks
+- [`docs/design.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/design.md) — three-layer architecture
+- [`docs/statistical-methods.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/statistical-methods.md) — every metric's formula + citation
+- [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb) — full feature tour as a notebook
+- [`docs/rendered/`](https://github.com/jturner-uofl/pycorpdiff/tree/main/docs/rendered) — static HTML renders for offline viewing

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/README.md RENAMED Viewed

@@ -15,9 +15,9 @@ platform, and the fragmented Python NLP stack
 consolidate keyness, collocations, dispersion, temporal trajectories,
 changepoint detection, interrupted time series, causal-impact analysis,
 forecasting, online changepoint detection, and embedding-based semantic
-shift under a single notebook-native API. Every result carries its own
-KWIC evidence: `.explain(term)` returns the source-text concordances
-behind any ranked term.
+shift under a single notebook-native API. Keyness and collocation
+results carry their own KWIC evidence: `.explain(term)` returns the
+source-text concordances behind any ranked term.
 The package answers the questions corpus linguistics, digital humanities,
 and computational social science routinely have:
@@ -35,7 +35,7 @@ points — one-line adapters, no plugin registry. The base install pulls
 only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
 via extras.
-> **Status: alpha (0.1.0a2).** Public API is stable for the features
+> **Status: alpha (0.1.0a4).** Public API is stable for the features
 > described below; on PyPI as `pip install pycorpdiff`.
 ## The three-layer architecture
@@ -44,61 +44,71 @@ via extras.
 |---|---|---|
 | **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |
 | **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |
-| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each with `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
+| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each implementing the relevant subset of `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
 ## Quick start
 ```bash
-pip install "pycorpdiff[viz,temporal]"
+pip install "pycorpdiff[viz]"
 ```
 ```python
 import pycorpdiff as pcd
-# Bundled synthetic UK-Hansard corpus — runs offline, no data needed.
+# Bundled synthetic Hansard-style sample — runs offline, no data download.
 corpus = pcd.load_hansard_sample()
 immigration = corpus.slice(topic="immigration")
-human = immigration.slice(frame="humanising")
-criminal = immigration.slice(frame="criminalising")
-# Compare — three verbs
-k = pcd.compare(human, criminal).keyness()
-c = pcd.compare(human, criminal).collocation_shift("immigrant")
-# s = pcd.compare(human, criminal).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())
-#   ↑ requires `pip install "pycorpdiff[semantic]"`
-# Track over time
-tr = pcd.track(immigration, "criminal").over_time(freq="Y")
-tr.changepoints()                                # offline PELT
-tr.changepoints_online(hazard=1/24)              # Bayesian online (Adams & MacKay 2007)
-tr.interrupted_time_series(event_date="2016")    # segmented OLS
-tr.causal_impact(event_date="2016")              # Bayesian counterfactual (Brodersen 2015)
-tr.forecast(horizon=4)                           # state-space ETS
+# Which words separate the humanising and criminalising frames?
+keyness = pcd.compare(
+    immigration.slice(frame="humanising"),
+    immigration.slice(frame="criminalising"),
+).keyness(min_count=3)
+keyness.plot()                # volcano plot — picture the result
+# keyness.table.head(10)      # or look at the ranked table directly
+# keyness.explain("criminal") # KWIC concordances showing the textual evidence
+```
+That's the entire surface in five lines: load a corpus, slice it,
+compare two slices, plot the result. Every other analytical method —
+collocation shifts, semantic drift, temporal trajectories, changepoint
+detection, causal-impact analysis, forecasting, co-occurrence networks,
+N-way keyness — follows the same shape. See
+[the showcase notebook](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
+for the full feature tour, or the cheat sheet below for one-line API previews.
+### Cheat sheet — every analytical surface in one block
+```python
+# Compare verbs (returns Result objects; methods exposed vary by Result)
+pcd.compare(a, b).keyness()
+pcd.compare(a, b).collocation_shift("immigrant")
+pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
+# SBERTEmbedder downloads a sentence-transformers model on first call;
+# use pcd.HashEmbedder() for offline / deterministic-test settings.
+# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods)
+tr = pcd.track(corpus, "immigrant").over_time(freq="Y")
+tr.changepoints()                                  # offline PELT
+tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
+tr.interrupted_time_series(event_date="2016")      # segmented OLS
+tr.causal_impact(event_date="2016")                # Bayesian counterfactual (Brodersen 2015)
+tr.forecast(horizon=4)                             # state-space ETS
 # Before / after a known event
 pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
-# N-way (≥ 2 corpora) — one keyness across all four parties
-parties = ["Conservative", "Labour", "Liberal Democrat", "SNP"]
-nhs = corpus.slice(topic="nhs")
-pcd.keyness_multi([nhs.slice(party=p) for p in parties], labels=parties)
+# N-way (≥ 2 corpora)
+pcd.keyness_multi([a, b, c, d], labels=["A", "B", "C", "D"])
 # The discourse as a graph
-pcd.cooccurrence_network(immigration, top_n=30).plot()
-# Every Result: .to_df() · .plot() · .explain() · .summary() · .to_html() · .to_json()
+pcd.cooccurrence_network(corpus, top_n=30).plot()
 ```
-Every line of the snippet above is verified end-to-end against
-`pip install "pycorpdiff[viz,temporal]"` — no data download required.
-Replace `load_hansard_sample()` with `pcd.from_dataframe(your_df, ...)`,
-`pcd.read_parquet(...)`, `pcd.fetch_hansard(...)`, or
-`pcd.from_huggingface(...)` to use your own corpus.
-See [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb)
-([rendered HTML](docs/rendered/pycorpdiff_showcase.html)) for a
-walkthrough on a synthetic UK Hansard corpus exercising every analytical
-surface.
+See [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
+for a walkthrough on the synthetic Hansard-style corpus exercising
+every analytical surface.
 ## Installation
@@ -128,7 +138,7 @@ pytest -q
 The math agrees with the standard tools — by automated test:
-- **Rayson's LL Wizard** — 15 hand-derived contingency-table reference triples
+- **Rayson's LL Wizard** — hand-derived contingency-table reference triples
 - **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
 - **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
 - **quanteda (R)** via `rpy2` — byte-for-byte G² agreement (slow tier)
@@ -142,11 +152,11 @@ repository" widget directly from it.
 ## License
-MIT — see [LICENSE](LICENSE).
+MIT — see [LICENSE](https://github.com/jturner-uofl/pycorpdiff/blob/main/LICENSE).
 ## Further reading
-- [`docs/design.md`](docs/design.md) — three-layer architecture
-- [`docs/statistical-methods.md`](docs/statistical-methods.md) — every metric's formula + citation
-- [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb) — full feature tour as a notebook
-- [`docs/rendered/`](docs/rendered/) — self-contained HTML renders of the example notebooks
+- [`docs/design.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/design.md) — three-layer architecture
+- [`docs/statistical-methods.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/statistical-methods.md) — every metric's formula + citation
+- [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb) — full feature tour as a notebook
+- [`docs/rendered/`](https://github.com/jturner-uofl/pycorpdiff/tree/main/docs/rendered) — static HTML renders for offline viewing

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "pycorpdiff"
-version = "0.1.0a2"
+version = "0.1.0a4"
 description = "Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference."
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -45,7 +45,7 @@ dependencies = [
 ]
 [project.optional-dependencies]
-# Visualisation: altair-first, matplotlib retained for paper-grade figures.
+# Visualisation: altair-first, matplotlib retained for publication-quality figures.
 viz = ["altair>=5", "matplotlib>=3.8", "networkx>=3.1"]
 # Embedding-based semantic shift. sentence-transformers pulls torch
 # transitively, which is why this is opt-in rather than a base dep.
@@ -66,8 +66,9 @@ huggingface = ["datasets>=2.14"]
 # rendered HTML examples. `jupyter` runs the notebook, `vl-convert` does
 # static SVG/PNG export of altair charts, `pysofra` renders the showcase's
 # result tables in JAMA-style typography.
-notebooks = ["jupyter>=1.0", "vl-convert-python>=1.5", "pysofra>=0.1.0a2"]
-# Meta-extra so `pycorpdiff[all]` exercises every optional code path.
+notebooks = ["jupyter>=1.0", "vl-convert-python>=1.5", "pysofra>=0.1.0a3"]
+# Meta-extra: `pycorpdiff[all]` pulls in every optional code path
+# including the notebook runtime.
 all = [
     "altair>=5",
     "matplotlib>=3.8",
@@ -81,8 +82,9 @@ all = [
     "pyarrow>=15",
     "duckdb>=0.10",
     "spacy>=3.7",
+    "jupyter>=1.0",
     "vl-convert-python>=1.5",
-    "pysofra>=0.1.0a2",
+    "pysofra>=0.1.0a3",
 ]
 dev = [
     "pytest>=8",

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/__init__.py RENAMED Viewed

@@ -14,12 +14,12 @@ Example
 >>> import pycorpdiff as pcd
 >>> pcd.__version__
-'0.1.0a2'
+'0.1.0a4'
 """
 from __future__ import annotations
-__version__ = "0.1.0a2"
+__version__ = "0.1.0a4"
 from .collocation.network import NetworkResult, cooccurrence_network
 from .compare import Comparison, compare

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/compare.py RENAMED Viewed

@@ -66,7 +66,7 @@ class Comparison:
             require ``effect_size=True`` and sort by that column.
         effect_size
             If True (default), also compute LogRatio (Hardie),
-            %DIFF (Gabrielatos), and the BIC-Bayes factor (Wilson).
+            %DIFF (Gabrielatos), and the BIC-approximated Bayes factor.
         dispersion
             If True, compute Juilland's D for both corpora and flag
             terms where ``D < 0.5`` in either — the canonical "this is

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/datasets/_generate_hansard.py RENAMED Viewed

@@ -172,7 +172,7 @@ TOPICS = ["immigration", "brexit", "nhs", "climate"]
 def generate(seed: int = 20260522) -> pd.DataFrame:
-    """Return a deterministic 200-speech synthetic Hansard sample."""
+    """Return a deterministic 193-speech synthetic Hansard sample."""
     rng = np.random.default_rng(seed)
     rows: list[dict[str, object]] = []
     speech_id = 0

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/bayes.py RENAMED Viewed

@@ -26,9 +26,10 @@ def bayes_factor(
 ) -> pd.Series:
     """BIC-approximated Bayes factor for each term's frequency difference.
-    Uses Wilson's BIC approximation: ``BIC = |G²| - ln(N)`` where ``N``
-    is the total tokens across both corpora and ``G²`` is the unsigned
-    log-likelihood. The Bayes factor is then ``exp(BIC / 2)``.
+    The BIC approximation (Kass & Raftery 1995): ``BIC = |G²| - ln(N)``
+    where ``N`` is the total tokens across both corpora and ``G²`` is
+    the unsigned log-likelihood. The Bayes factor is then
+    ``exp(BIC / 2)``. Wilson (2013) is the keyness application.
     Interpret with Kass & Raftery (1995):

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-"""Visualisation helpers — altair-first, matplotlib for paper-grade figures.
+"""Visualisation helpers — altair-first, matplotlib for publication-quality figures.
 Every Result type's ``.plot()`` method delegates here. Plot functions
 also accept a bare DataFrame so users can call

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_crossval_rayson.py RENAMED Viewed

@@ -6,8 +6,10 @@ single-cell keyness computation in corpus linguistics. Every value
 asserted below was either computed from Rayson's exact formula or
 copy-pasted from his calculator on a clean dataset.
-This file extends ``test_loglikelihood.py`` with the broader sweep
-called for by the audit's #15 item (cross-validation receipts).
+This file extends ``test_loglikelihood.py`` with a broader sweep of
+canonical reference triples covering edge cases (lopsided counts,
+sparse cells, mid-sized over-representation) so that any future
+refactor of the LL formula trips multiple assertions simultaneously.
 """
 from __future__ import annotations

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/LICENSE RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/_backends/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/_backends/pandas.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/_backends/polars.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/collocation/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/collocation/cooccurrence.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/collocation/measures.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/collocation/network.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/collocation/shift.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/corpus.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/datasets/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/datasets/_data/hansard_sample.parquet RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/datasets/hansard.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/datasets/histwords.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/explain.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/io/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/io/duckdb.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/io/huggingface.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/io/readers.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/chi_squared.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/correction.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/dispersion.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/effect_sizes.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/loglikelihood.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/multicorpus.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/keyness/permutation.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/py.typed RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/results.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/semantic/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/semantic/alignment.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/semantic/embed.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/semantic/shift.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/semantic/trajectory.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/stats.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/bocpd.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/causal_impact.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/changepoint.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/forecast.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/its.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/temporal/slicing.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/tokenize.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/bocpd.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/causal_impact.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/collocation.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/dispersion.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/forecast.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/keyness.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/network.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/scattertext.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/semantic_forecast.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/src/pycorpdiff/viz/trajectory.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/conftest.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/fixtures/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_collocation_integration.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_crossval_histwords.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_crossval_nltk.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_crossval_quanteda.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_crossval_scattertext.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_explain_integration.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_keyness_integration.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_sbert_slow.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_semantic_integration.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_stop_words.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_temporal_stats.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/integration/test_viz.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/property/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/property/test_collocation_properties.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/property/test_keyness_properties.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/property/test_temporal_properties.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/__init__.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_bayes_factor.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_bocpd.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_causal_impact.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_changepoint.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_chi_squared.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_collocation_cooccurrence.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_collocation_measures.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_collocation_shift.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_comparison_concordance.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_cooccurrence_network.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_corpus_hash.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_corpus_vocab.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_correction.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_datasets_hansard.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_dispersion.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_dispersion_plot.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_doc_term_counts_sparse.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_effect_sizes.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_embedders.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_explain.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_forecast.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_forecast_semantic_drift.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_from_huggingface.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_hansard_fetcher.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_histwords_loader.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_its.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_keyness_multi.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_loglikelihood.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_ngram_tokenizer.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_permutation_keyness.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_polars_interop.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_procrustes.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_read_duckdb.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_read_txt_line_mode.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_result_exports.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_scattertext_plot.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_semantic_neighbours.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_semantic_shift.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_semantic_trajectory.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_smoke.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_temporal.py RENAMED Viewed

File without changes

{pycorpdiff-0.1.0a2 → pycorpdiff-0.1.0a4}/tests/unit/test_wilson_ci.py RENAMED Viewed

File without changes

pycorpdiff 0.1.0a2__tar.gz → 0.1.0a4__tar.gz

pycorpdiff 0.1.0a2tar.gz → 0.1.0a4tar.gz