skxperiments 0.1.0.dev0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (110) hide show
  1. skxperiments-0.1.0.dev0/.github/workflows/ci.yml +45 -0
  2. skxperiments-0.1.0.dev0/.github/workflows/notebooks.yml +39 -0
  3. skxperiments-0.1.0.dev0/.gitignore +57 -0
  4. skxperiments-0.1.0.dev0/.pre-commit-config.yaml +17 -0
  5. skxperiments-0.1.0.dev0/CHANGELOG.md +343 -0
  6. skxperiments-0.1.0.dev0/PKG-INFO +272 -0
  7. skxperiments-0.1.0.dev0/README.md +235 -0
  8. skxperiments-0.1.0.dev0/ROADMAP.md +341 -0
  9. skxperiments-0.1.0.dev0/docs/README.md +6 -0
  10. skxperiments-0.1.0.dev0/docs/en/choosing.md +53 -0
  11. skxperiments-0.1.0.dev0/docs/en/glossary.md +87 -0
  12. skxperiments-0.1.0.dev0/docs/en/index.md +64 -0
  13. skxperiments-0.1.0.dev0/docs/pt-br/escolhendo.md +53 -0
  14. skxperiments-0.1.0.dev0/docs/pt-br/glossario.md +86 -0
  15. skxperiments-0.1.0.dev0/docs/pt-br/index.md +63 -0
  16. skxperiments-0.1.0.dev0/examples/for_starters/en/00_why_randomize.ipynb +123 -0
  17. skxperiments-0.1.0.dev0/examples/for_starters/en/01_first_experiment.ipynb +190 -0
  18. skxperiments-0.1.0.dev0/examples/for_starters/en/02_inference_three_ways.ipynb +182 -0
  19. skxperiments-0.1.0.dev0/examples/for_starters/en/03_reducing_variance.ipynb +113 -0
  20. skxperiments-0.1.0.dev0/examples/for_starters/en/04_balance_rerandomization.ipynb +117 -0
  21. skxperiments-0.1.0.dev0/examples/for_starters/en/05_blocking.ipynb +110 -0
  22. skxperiments-0.1.0.dev0/examples/for_starters/en/06_factorial.ipynb +133 -0
  23. skxperiments-0.1.0.dev0/examples/for_starters/en/07_many_tests.ipynb +130 -0
  24. skxperiments-0.1.0.dev0/examples/for_starters/en/08_diagnostics.ipynb +110 -0
  25. skxperiments-0.1.0.dev0/examples/for_starters/en/09_putting_it_together.ipynb +125 -0
  26. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/00_why_randomize.ipynb +123 -0
  27. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/01_first_experiment.ipynb +190 -0
  28. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/02_inference_three_ways.ipynb +182 -0
  29. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/03_reducing_variance.ipynb +113 -0
  30. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/04_balance_rerandomization.ipynb +117 -0
  31. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/05_blocking.ipynb +110 -0
  32. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/06_factorial.ipynb +133 -0
  33. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/07_many_tests.ipynb +130 -0
  34. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/08_diagnostics.ipynb +110 -0
  35. skxperiments-0.1.0.dev0/examples/for_starters/pt-br/09_putting_it_together.ipynb +125 -0
  36. skxperiments-0.1.0.dev0/pyproject.toml +79 -0
  37. skxperiments-0.1.0.dev0/skxperiments/__init__.py +5 -0
  38. skxperiments-0.1.0.dev0/skxperiments/core/__init__.py +42 -0
  39. skxperiments-0.1.0.dev0/skxperiments/core/assignment.py +589 -0
  40. skxperiments-0.1.0.dev0/skxperiments/core/base.py +512 -0
  41. skxperiments-0.1.0.dev0/skxperiments/core/exceptions.py +145 -0
  42. skxperiments-0.1.0.dev0/skxperiments/core/potential_outcomes.py +168 -0
  43. skxperiments-0.1.0.dev0/skxperiments/core/results.py +624 -0
  44. skxperiments-0.1.0.dev0/skxperiments/design/__init__.py +22 -0
  45. skxperiments-0.1.0.dev0/skxperiments/design/balance.py +182 -0
  46. skxperiments-0.1.0.dev0/skxperiments/design/blocked_crd.py +157 -0
  47. skxperiments-0.1.0.dev0/skxperiments/design/crd.py +162 -0
  48. skxperiments-0.1.0.dev0/skxperiments/design/factorial.py +174 -0
  49. skxperiments-0.1.0.dev0/skxperiments/design/power.py +233 -0
  50. skxperiments-0.1.0.dev0/skxperiments/design/rerandomized_crd.py +319 -0
  51. skxperiments-0.1.0.dev0/skxperiments/diagnostics/__init__.py +21 -0
  52. skxperiments-0.1.0.dev0/skxperiments/diagnostics/aa_test.py +277 -0
  53. skxperiments-0.1.0.dev0/skxperiments/diagnostics/balance_report.py +224 -0
  54. skxperiments-0.1.0.dev0/skxperiments/diagnostics/srm.py +327 -0
  55. skxperiments-0.1.0.dev0/skxperiments/estimators/__init__.py +23 -0
  56. skxperiments-0.1.0.dev0/skxperiments/estimators/blocked_difference_in_means.py +197 -0
  57. skxperiments-0.1.0.dev0/skxperiments/estimators/cuped.py +280 -0
  58. skxperiments-0.1.0.dev0/skxperiments/estimators/difference_in_means.py +161 -0
  59. skxperiments-0.1.0.dev0/skxperiments/estimators/factorial_estimator.py +213 -0
  60. skxperiments-0.1.0.dev0/skxperiments/estimators/lin_estimator.py +298 -0
  61. skxperiments-0.1.0.dev0/skxperiments/inference/__init__.py +17 -0
  62. skxperiments-0.1.0.dev0/skxperiments/inference/bootstrap.py +450 -0
  63. skxperiments-0.1.0.dev0/skxperiments/inference/multiple.py +365 -0
  64. skxperiments-0.1.0.dev0/skxperiments/inference/neyman.py +386 -0
  65. skxperiments-0.1.0.dev0/skxperiments/inference/randomization_test.py +319 -0
  66. skxperiments-0.1.0.dev0/skxperiments/pipeline.py +366 -0
  67. skxperiments-0.1.0.dev0/skxperiments/reporting/__init__.py +30 -0
  68. skxperiments-0.1.0.dev0/skxperiments/reporting/plots.py +411 -0
  69. skxperiments-0.1.0.dev0/skxperiments/reporting/summary.py +185 -0
  70. skxperiments-0.1.0.dev0/tests/__init__.py +1 -0
  71. skxperiments-0.1.0.dev0/tests/core/__init__.py +1 -0
  72. skxperiments-0.1.0.dev0/tests/core/test_assignment.py +183 -0
  73. skxperiments-0.1.0.dev0/tests/core/test_base.py +496 -0
  74. skxperiments-0.1.0.dev0/tests/core/test_exceptions.py +168 -0
  75. skxperiments-0.1.0.dev0/tests/core/test_potential_outcomes.py +164 -0
  76. skxperiments-0.1.0.dev0/tests/core/test_results.py +362 -0
  77. skxperiments-0.1.0.dev0/tests/design/__init__.py +1 -0
  78. skxperiments-0.1.0.dev0/tests/design/test_balance.py +299 -0
  79. skxperiments-0.1.0.dev0/tests/design/test_blocked_crd.py +346 -0
  80. skxperiments-0.1.0.dev0/tests/design/test_crd.py +198 -0
  81. skxperiments-0.1.0.dev0/tests/design/test_factorial.py +330 -0
  82. skxperiments-0.1.0.dev0/tests/design/test_power.py +261 -0
  83. skxperiments-0.1.0.dev0/tests/design/test_rerandomized_crd.py +370 -0
  84. skxperiments-0.1.0.dev0/tests/diagnostics/__init__.py +0 -0
  85. skxperiments-0.1.0.dev0/tests/diagnostics/test_aa_test.py +259 -0
  86. skxperiments-0.1.0.dev0/tests/diagnostics/test_balance_report.py +249 -0
  87. skxperiments-0.1.0.dev0/tests/diagnostics/test_srm.py +283 -0
  88. skxperiments-0.1.0.dev0/tests/estimators/__init__.py +1 -0
  89. skxperiments-0.1.0.dev0/tests/estimators/test_blocked_difference_in_means.py +463 -0
  90. skxperiments-0.1.0.dev0/tests/estimators/test_cuped.py +530 -0
  91. skxperiments-0.1.0.dev0/tests/estimators/test_difference_in_means.py +332 -0
  92. skxperiments-0.1.0.dev0/tests/estimators/test_factorial_estimator.py +448 -0
  93. skxperiments-0.1.0.dev0/tests/estimators/test_lin_estimator.py +529 -0
  94. skxperiments-0.1.0.dev0/tests/inference/__init__.py +1 -0
  95. skxperiments-0.1.0.dev0/tests/inference/test_bootstrap.py +699 -0
  96. skxperiments-0.1.0.dev0/tests/inference/test_multiple.py +442 -0
  97. skxperiments-0.1.0.dev0/tests/inference/test_neyman.py +595 -0
  98. skxperiments-0.1.0.dev0/tests/inference/test_randomization_test.py +890 -0
  99. skxperiments-0.1.0.dev0/tests/integration/__init__.py +5 -0
  100. skxperiments-0.1.0.dev0/tests/integration/test_blocked_crd_bdim.py +73 -0
  101. skxperiments-0.1.0.dev0/tests/integration/test_crd_cuped.py +82 -0
  102. skxperiments-0.1.0.dev0/tests/integration/test_crd_dim.py +81 -0
  103. skxperiments-0.1.0.dev0/tests/integration/test_crd_lin.py +75 -0
  104. skxperiments-0.1.0.dev0/tests/integration/test_factorial_dim.py +83 -0
  105. skxperiments-0.1.0.dev0/tests/reporting/__init__.py +0 -0
  106. skxperiments-0.1.0.dev0/tests/reporting/test_plots.py +140 -0
  107. skxperiments-0.1.0.dev0/tests/reporting/test_result_plots.py +149 -0
  108. skxperiments-0.1.0.dev0/tests/reporting/test_summary.py +154 -0
  109. skxperiments-0.1.0.dev0/tests/test_comparison.py +190 -0
  110. skxperiments-0.1.0.dev0/tests/test_pipeline.py +197 -0
@@ -0,0 +1,45 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ paths:
7
+ - 'skxperiments/**'
8
+ - 'tests/**'
9
+ - 'pyproject.toml'
10
+ - '.github/workflows/ci.yml'
11
+ pull_request:
12
+ branches: [main]
13
+ paths:
14
+ - 'skxperiments/**'
15
+ - 'tests/**'
16
+ - 'pyproject.toml'
17
+ - '.github/workflows/ci.yml'
18
+
19
+ jobs:
20
+ test:
21
+ runs-on: ubuntu-latest
22
+ # Manual escape hatch: put [skip-lib] in the (head/merge) commit message
23
+ # to skip the library suite even when library files changed.
24
+ if: ${{ !contains(github.event.head_commit.message, '[skip-lib]') }}
25
+ strategy:
26
+ matrix:
27
+ python-version: ["3.10", "3.11", "3.12"]
28
+
29
+ steps:
30
+ - uses: actions/checkout@v4
31
+
32
+ - name: Set up Python ${{ matrix.python-version }}
33
+ uses: actions/setup-python@v5
34
+ with:
35
+ python-version: ${{ matrix.python-version }}
36
+ cache: 'pip'
37
+ cache-dependency-path: 'pyproject.toml'
38
+
39
+ - name: Install dependencies
40
+ run: |
41
+ python -m pip install --upgrade pip
42
+ pip install -e ".[dev]"
43
+
44
+ - name: Run tests with coverage
45
+ run: pytest --cov-report=term-missing
@@ -0,0 +1,39 @@
1
+ name: Notebooks
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ paths:
7
+ - 'examples/**'
8
+ - 'skxperiments/**'
9
+ - 'pyproject.toml'
10
+ - '.github/workflows/notebooks.yml'
11
+ pull_request:
12
+ branches: [main]
13
+ paths:
14
+ - 'examples/**'
15
+ - 'skxperiments/**'
16
+ - 'pyproject.toml'
17
+ - '.github/workflows/notebooks.yml'
18
+
19
+ jobs:
20
+ notebooks:
21
+ runs-on: ubuntu-latest
22
+
23
+ steps:
24
+ - uses: actions/checkout@v4
25
+
26
+ - name: Set up Python 3.12
27
+ uses: actions/setup-python@v5
28
+ with:
29
+ python-version: "3.12"
30
+ cache: 'pip'
31
+ cache-dependency-path: 'pyproject.toml'
32
+
33
+ - name: Install dependencies
34
+ run: |
35
+ python -m pip install --upgrade pip
36
+ pip install -e ".[dev]"
37
+
38
+ - name: Execute example notebooks
39
+ run: pytest --nbmake --no-cov examples/
@@ -0,0 +1,57 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+
8
+ # Virtual environments
9
+ venv/
10
+ env/
11
+ .venv/
12
+ ENV/
13
+
14
+ # Distribution / packaging
15
+ build/
16
+ develop-eggs/
17
+ dist/
18
+ downloads/
19
+ eggs/
20
+ .eggs/
21
+ lib/
22
+ lib64/
23
+ parts/
24
+ sdist/
25
+ var/
26
+ wheels/
27
+ *.egg-info/
28
+ *.egg
29
+ MANIFEST
30
+
31
+ # Testing
32
+ .pytest_cache/
33
+ .coverage
34
+ .coverage.*
35
+ htmlcov/
36
+ .tox/
37
+ .hypothesis/
38
+ coverage.xml
39
+ *.cover
40
+
41
+ # Type checkers / linters
42
+ .mypy_cache/
43
+ .ruff_cache/
44
+
45
+ # IDE
46
+ .vscode/
47
+ .idea/
48
+ *.swp
49
+ *.swo
50
+
51
+ # OS
52
+ .DS_Store
53
+ Thumbs.db
54
+
55
+ # Jupyter
56
+ .ipynb_checkpoints/
57
+ *.ipynb_checkpoints
@@ -0,0 +1,17 @@
1
+ repos:
2
+ - repo: https://github.com/astral-sh/ruff-pre-commit
3
+ rev: v0.1.9
4
+ hooks:
5
+ - id: ruff
6
+ args: [--fix]
7
+
8
+ - repo: https://github.com/psf/black
9
+ rev: 24.1.1
10
+ hooks:
11
+ - id: black
12
+
13
+ - repo: https://github.com/pre-commit/mirrors-mypy
14
+ rev: v1.8.0
15
+ hooks:
16
+ - id: mypy
17
+ additional_dependencies: [numpy, pandas-stubs]
@@ -0,0 +1,343 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ### Added — Phase 7: Visualization and reporting
11
+
12
+ - **Plots** (`skxperiments.reporting.plots`): `plot_balance`, `plot_srm`,
13
+ `plot_null_distribution` (diagnostic) and `plot_effect`, `plot_forest`,
14
+ `plot_interaction`, `plot_power_curve` (result). Each accepts an
15
+ optional `ax` and returns the `matplotlib.axes.Axes` it drew on.
16
+ - **`ExperimentReport`** (`skxperiments.reporting.summary`): renders a
17
+ `PipelineResult` as a self-contained static HTML page — results table,
18
+ diagnostics summary, and the relevant plots embedded inline as base64
19
+ PNGs. No template engine. `include_plots=False` produces the page
20
+ without matplotlib. `to_html()` / `save(path)`.
21
+ - **Optional `matplotlib` dependency**: added a `viz` extra
22
+ (`pip install skxperiments[viz]`) and included matplotlib in `dev`.
23
+ Plotting is imported lazily; calling a plot without matplotlib raises a
24
+ clear `ImportError` pointing at the extra. Importing the package, or
25
+ building a report with `include_plots=False`, needs no optional
26
+ dependency.
27
+
28
+ ### Added — Phase 6: Pipeline and comparison
29
+
30
+ - **`ExperimentPipeline`** (`skxperiments.pipeline`): composes an inference
31
+ procedure (which already wraps an estimator) with a set of diagnostics
32
+ and runs them against a single `Assignment`. The design travels with the
33
+ assignment (`assignment.design_`), so neither design nor estimator is a
34
+ separate argument. Diagnostics run best-effort (one raising a
35
+ `SkxperimentsError` is recorded as a warning and skipped); a flagged
36
+ diagnostic is surfaced but does not stop estimation unless
37
+ `raise_on_flag=True`. Default diagnostics: `[SRMTest()]`. Returns a
38
+ `PipelineResult` bundling the inference `Results`, a merged
39
+ `DiagnosticsReport`, and per-diagnostic results.
40
+ - **`ExperimentComparison`** (`skxperiments.pipeline`): compares several
41
+ independent experiments by collecting each scalar ATE/p-value and
42
+ applying `MultipleTestingCorrection` across the family. Accepts a dict
43
+ of `PipelineResult` or scalar `Results`. Returns a `ComparisonResult`
44
+ with the corrected `Results` per experiment and a comparison table
45
+ (ATE, SE, CI, original/corrected p-value, significance) ready for the
46
+ Phase 7 forest plot. Multi-effect/subgroup comparison is deferred to v2.
47
+
48
+ ### Added — Phase 5: Diagnostics
49
+
50
+ - **`SRMTest`** (`skxperiments.diagnostics.srm`): Sample Ratio Mismatch
51
+ check via Pearson's chi-squared, comparing observed arm/cell counts to
52
+ the design's intended allocation. Two-arm (CRD/Blocked) and factorial
53
+ designs; expected allocation inferred from `design_.p` or a uniform
54
+ split over cells, or supplied explicitly. Default threshold 0.001 (the
55
+ industry SRM convention). Returns `SRMResult`.
56
+ - **`BalanceReport`** (`skxperiments.diagnostics.balance_report`): wraps
57
+ `check_balance` to report the standardized mean difference (SMD) per
58
+ covariate and flag covariates with `|SMD| > threshold` (default 0.1,
59
+ Austin 2009). Constant covariates (undefined SMD) are surfaced as
60
+ warnings. Two-arm only; rejects factorial. Returns `BalanceResult`
61
+ (the table is exposed via `to_dataframe` for the Phase 7 Love plot).
62
+ - **`AATest`** (`skxperiments.diagnostics.aa_test`): re-randomizes a
63
+ design on fixed data and runs a wrapped inference each time, checking
64
+ calibration. The false-positive rate is compared to `alpha` with an
65
+ exact binomial test (flag below `meta_threshold`, default 0.001), and
66
+ p-value uniformity is reported via a Kolmogorov-Smirnov test (secondary
67
+ warning). Returns `AAResult`.
68
+ - Each diagnostic exposes a dedicated frozen result dataclass with
69
+ `summary`/`to_dict` and a `to_diagnostics_report()` mapping into the
70
+ existing `DiagnosticsReport` (for the Phase 6 pipeline). `Results` is
71
+ left untouched: diagnostics are not estimands.
72
+
73
+ ### Added — Phase 4.4: Bootstrap confidence intervals
74
+
75
+ - **`BootstrapCI`** (`skxperiments.inference.bootstrap`): `BaseInference`
76
+ subclass that approximates the sampling distribution of an estimator's
77
+ ATE by resampling units with replacement within each treatment arm
78
+ (within each block-by-arm stratum for blocked designs), then forms a
79
+ `"percentile"` or `"bca"` (default) confidence interval. The bootstrap
80
+ is the library's explicit **superpopulation** method: it ignores the
81
+ randomization mechanism and always reports
82
+ `inference_mode="superpopulation"`.
83
+ - **Estimator-agnostic**: any estimator producing a scalar `Results.ate`
84
+ is supported (`DifferenceInMeans`, `BlockedDifferenceInMeans`,
85
+ `LinEstimator`, `CUPED`); each resample is turned back into an
86
+ `Assignment` and the estimator is refitted. Multi-effect estimators are
87
+ rejected.
88
+ - **BCa**: bias-correction `z0` from the bootstrap distribution and
89
+ acceleration `a` from a leave-one-out jackknife (Efron 1987). The
90
+ degenerate case (bias-correction undefined) raises `InvalidDesignError`
91
+ suggesting `method="percentile"`.
92
+ - **Output**: bootstrap standard error, percentile/BCa interval, and an
93
+ approximate two-sided bootstrap p-value (achieved significance level).
94
+ - **Fail fast**: `InsufficientDataError` when any arm (CRD) or block-by-arm
95
+ stratum (blocked) has fewer than 2 units; matched-pair blocked designs
96
+ are not supported by the within-stratum bootstrap in v1 (see `ROADMAP.md`).
97
+ - **Reserved keys schema in `Results.extra`** extended with `method`,
98
+ `n_resamples`, `bootstrap_distribution`, and (BCa only)
99
+ `bias_correction` and `acceleration`. Documented in the `Results`
100
+ docstring.
101
+
102
+ ### Added — Phase 4.3: Neyman confidence intervals
103
+
104
+ - **`NeymanCI`** (`skxperiments.inference.neyman`): `BaseInference`
105
+ subclass that wraps a scalar estimator and builds a two-sided Wald
106
+ confidence interval and p-value from Neyman's variance estimator,
107
+ dispatched by assignment type. CRD uses the conservative
108
+ `s_t^2 / n_t + s_c^2 / n_c` (`ddof=1`); blocked designs use the
109
+ stratified `sum_b (N_b / N)^2 * V_b`, consistent with the size-weighted
110
+ ATE of `BlockedDifferenceInMeans` (Imbens & Rubin 2015, ch. 6 and 9).
111
+ The interval uses the normal quantile (`scipy.stats.norm`).
112
+ - **Estimator whitelist**: v1 accepts only `DifferenceInMeans` and
113
+ `BlockedDifferenceInMeans`, validated at construction with
114
+ `DesignEstimatorMismatch`. `CUPED` and `LinEstimator` support is
115
+ deferred (see `ROADMAP.md`).
116
+ - **Finite-population scope**: an estimator reporting
117
+ `inference_mode="superpopulation"` is rejected with a message
118
+ redirecting to `BootstrapCI` (Phase 4.4). The Neyman formula is
119
+ identical under both interpretations; the restriction is a scope choice.
120
+ - **Fail fast**: `InsufficientDataError` when any arm (CRD) or any arm
121
+ within a block (blocked) has fewer than 2 observations.
122
+ - **Reserved keys schema in `Results.extra`** extended with
123
+ `variance_type` (`"neyman"` or `"neyman_stratified"`); `NeymanCI` also
124
+ propagates `inference_mode`. Documented in the `Results` docstring.
125
+
126
+ ### Added — Phase 4.2: Multiple testing correction
127
+
128
+ - **`MultipleTestingCorrection`** (`skxperiments.inference.multiple`):
129
+ utility class for applying Bonferroni, Holm, or Benjamini-Hochberg
130
+ correction to a family of p-values. Standalone class (not
131
+ `BaseInference`); operates on post-processed p-values rather than
132
+ on `Assignment` objects. API: `MultipleTestingCorrection(method,
133
+ alpha).correct(results) -> Results | list[Results]`. Accepts
134
+ `Results` in multi-effect mode (`p_value: dict`) or `list[Results]`
135
+ in scalar mode; output preserves input format. Default
136
+ `method="holm"` (uniformly more powerful than Bonferroni for FWER).
137
+ All methods clip corrected p-values to `[0, 1]`.
138
+ - **FWER vs. FDR control**: the docstring explicitly distinguishes
139
+ Bonferroni and Holm (FWER, family-wise error rate) from BH (FDR,
140
+ false discovery rate); the choice depends on the inferential goal.
141
+ - **Double-correction detection**: applying `correct()` to a
142
+ `Results` whose `extra` already contains any of the 4 reserved
143
+ keys raises `InvalidDesignError` pointing the user at the original
144
+ uncorrected `Results`.
145
+ - **`alpha` override**: `MultipleTestingCorrection(alpha=...)`
146
+ overrides the input `Results`' `alpha` in the output.
147
+ - **Reserved keys schema in `Results.extra`** extended with
148
+ `correction_method`, `original_p_values`, `family_wise_alpha`,
149
+ `n_tests`. Documented in the `Results` class docstring.
150
+
151
+ ### Added — Phase 4.1: Randomization-based inference
152
+
153
+ - **`RandomizationTest`** (`skxperiments.inference.randomization_test`): Fisher's sharp null
154
+ hypothesis test via Monte Carlo permutations. Materializes the `BaseAssignment.draw()`
155
+ contract by routing each permutation through the original randomization mechanism — under
156
+ rerandomization, the cached Mahalanobis covariance matrix is reused via
157
+ `CRDAssignment.rerandomization_metadata`; under blocking, within-block proportions are
158
+ preserved automatically. Always refits the estimator on the original assignment at the
159
+ start of `fit()`; prior estimator state is discarded. Three alternatives:
160
+ `"two-sided"` (criterion `|T_perm| >= |T_obs|`), `"greater"`, `"less"`. P-value uses the
161
+ Phipson & Smyth (2010) continuity correction `(1 + n_extreme) / (1 + n_permutations)`,
162
+ guaranteeing a Monte Carlo p-value strictly greater than zero. Reproducibility: same
163
+ `seed` produces identical `null_distribution_`.
164
+ - **`BaseInference.estimate()`** abstract method: subclasses must now implement both
165
+ `fit()` and `estimate()`, mirroring `BaseEstimator`.
166
+ - **`BaseInference._validate_assignment_type()`**: thin wrapper exposing the same
167
+ validation surface as `BaseEstimator`. Underlying logic extracted to a module-level
168
+ helper `_check_assignment_type` in `skxperiments.core.base` so both ABCs share a
169
+ single source of truth for the `DesignEstimatorMismatch` message format.
170
+ - Reserved keys schema in `Results.extra` documented in the `Results` class docstring:
171
+ `inference_mode`, `theta`, `correlation` (written by Phase 3 estimators);
172
+ `n_permutations`, `null_distribution`, `alternative` (written by `RandomizationTest`).
173
+ - `pytest` marker `slow` registered in `pyproject.toml` for tests that run statistical
174
+ property checks.
175
+
176
+ ### Added — Roadmap
177
+
178
+ - **`ROADMAP.md`**: centralized tracking of deferred features, decisions
179
+ to revisit, and v2 plans. Organized by phase with What / Why deferred /
180
+ Trigger structure. Linked from `README.md`.
181
+
182
+ ### Tests — Phase 7
183
+
184
+ - 32 tests across `tests/reporting/`: 8 for the diagnostic plots, 14 for
185
+ the result plots, and 10 for `ExperimentReport`. All run headless (Agg
186
+ backend) and assert on figure artefacts (tick labels, bar/line counts,
187
+ histogram bins) and HTML substrings rather than pixels; the
188
+ optional-dependency guard is exercised by monkeypatching matplotlib
189
+ out of `sys.modules`.
190
+
191
+ ### Tests — Phase 6
192
+
193
+ - 33 tests across `tests/test_pipeline.py` (16) and
194
+ `tests/test_comparison.py` (17): pipeline creation/validation, clean
195
+ and flagged runs, `raise_on_flag`, best-effort diagnostic errors, a
196
+ custom `BalanceReport` diagnostic, empty diagnostics, and the result
197
+ surface; comparison creation/validation, Bonferroni known values,
198
+ mixed `PipelineResult`/`Results` input, order preservation,
199
+ multi-effect and missing-p-value rejection, and the result surface.
200
+
201
+ ### Tests — Phase 5
202
+
203
+ - 61 tests across `tests/diagnostics/`: 23 for `SRMTest` (creation/
204
+ validation, design-inferred and explicit expectations, factorial,
205
+ the result surface, unsupported input), 20 for `BalanceReport`
206
+ (balanced/imbalanced detection, threshold control, covariate subsets,
207
+ constant covariates, blocked designs, factorial rejection, NaN
208
+ propagation, result surface), and 18 for `AATest` (creation/validation,
209
+ calibrated-pipeline calibration, reproducibility, miscalibration flag,
210
+ no-p-value rejection, result surface, and a slow nested run with
211
+ `RandomizationTest`).
212
+
213
+ ### Tests — Phase 4.4
214
+
215
+ - 36 tests for `BootstrapCI` across 10 grouping classes covering creation
216
+ and parameter validation, assignment-type and multi-effect rejection,
217
+ insufficient-data fail-fast (including matched-pair blocks), the
218
+ degenerate BCa case, fitted attributes, estimate output (percentile and
219
+ BCa extras, superpopulation override), reproducibility, percentile-vs-BCa
220
+ agreement on symmetric data (slow), blocked designs, estimator-agnostic
221
+ smoke tests (`LinEstimator`, `CUPED`), assignment immutability, and slow
222
+ numerics (Monte Carlo coverage and agreement of the bootstrap SE with the
223
+ `NeymanCI` SE).
224
+
225
+ ### Tests — Phase 4.3
226
+
227
+ - 30 tests for `NeymanCI` across 8 grouping classes covering creation and
228
+ alpha validation, estimator-whitelist and assignment-type rejection
229
+ (factorial, multi-effect, superpopulation), insufficient-data fail-fast,
230
+ hand-checked CRD and stratified-blocked variance, the Wald CI/p-value,
231
+ rerandomization acceptance, assignment immutability, and slow Monte
232
+ Carlo coverage (CRD and blocked, near the nominal 95%).
233
+
234
+ ### Tests — Phase 4.2
235
+
236
+ - 25 tests for `MultipleTestingCorrection` across 7 grouping classes
237
+ covering creation validations, input validation (rejects scalar
238
+ single, multi-effect without p_value, empty list, mixed list, double
239
+ correction), Bonferroni (known values, clipping, ordering),
240
+ Holm (known values, monotonicity, dominance over Bonferroni,
241
+ clipping), Benjamini-Hochberg (known values, monotonicity, FDR
242
+ control via 1000-rep simulation marked `slow`), multi-effect input
243
+ (effects/metadata preservation, alpha override), and list input
244
+ (order preservation, per-Results metadata, family metadata in each
245
+ element).
246
+
247
+ ### Tests — Phase 4.1
248
+
249
+ - 36 tests for `RandomizationTest` across 10 grouping classes covering creation,
250
+ validation, fit/estimate behavior, statistical properties (slow), reproducibility,
251
+ rerandomization (Mahalanobis preservation under draws), blocking (per-block
252
+ treatment count preservation), integration with all four Phase 3 estimators
253
+ (`DifferenceInMeans`, `BlockedDifferenceInMeans`, `LinEstimator`, `CUPED`), and
254
+ alternative hypothesis behavior. `FactorialAssignment` and multi-effect estimators
255
+ are explicitly rejected (deferred to v2).
256
+ - 15 new tests in `tests/core/test_base.py` covering the extended `BaseInference`
257
+ contract and snapshot tests pinning the `DesignEstimatorMismatch` message format
258
+ after the `_check_assignment_type` refactor.
259
+
260
+ ### Status
261
+
262
+ The v1 feature set (Phases 0–7) is complete: design, estimation,
263
+ randomization/finite-population/superpopulation inference, multiple-testing
264
+ correction, diagnostics, pipeline composition, and reporting.
265
+
266
+ Deferred to v2 (see `ROADMAP.md`): `SequentialTest` (mSPRT/always-valid),
267
+ Benjamini-Yekutieli correction, CUPED/Lin variance in `NeymanCI`,
268
+ studentized bootstrap, block-resampling bootstrap, subgroup comparison,
269
+ a plotly backend, and interactive dashboards.
270
+
271
+ ## [0.1.0-dev] - 2026-05-31
272
+
273
+ ### Added — Phase 0: Scaffold
274
+
275
+ - Project scaffold: `pyproject.toml`, `README.md`, `.pre-commit-config.yaml`, GitHub Actions CI.
276
+ - Package structure: `core`, `design`, `estimators`, `inference`, `diagnostics`, `reporting`.
277
+ - Custom exceptions hierarchy: `SkxperimentsError`, `DesignEstimatorMismatch`, `NotFittedError`,
278
+ `InsufficientDataError`, `InvalidDesignError`.
279
+
280
+ ### Added — Phase 1: Core
281
+
282
+ - `PotentialOutcomes` class: unit-level Y(0), Y(1), ITE, ATE.
283
+ - `BaseAssignment` (ABC) with abstract `draw(seed)` method for randomization-based inference;
284
+ `CRDAssignment`, `BlockedAssignment`, `FactorialAssignment` concrete subclasses.
285
+ - `Results` class with mutually exclusive scalar (`ate`) and multi-effect
286
+ (`effects: dict[tuple[str, ...], float]`) modes; auto-populated metadata fields
287
+ (`estimator_name`, `design_name`, `n_obs`, `n_treated`, `n_control`).
288
+ - `BaseDesign`, `BaseEstimator`, `BaseInference` abstract base classes;
289
+ `_validate_assignment_type` accepting `type | tuple[type, ...]`.
290
+ - `DiagnosticsReport` dataclass.
291
+
292
+ ### Added — Phase 2: Designs
293
+
294
+ - **`CRD`**: completely randomized design with `n_treated` (absolute) or `p` (proportion),
295
+ mutually exclusive.
296
+ - **`BlockedCRD`**: independent randomization within blocks; `BlockedAssignment` carries
297
+ `block_col_` and `block_sizes_`.
298
+ - **`ReRandomizedCRD`**: Mahalanobis acceptance criterion (Morgan & Rubin 2012); covariance
299
+ matrix cached in `CRDAssignment.rerandomization_metadata` and reused in `draw()` without
300
+ recomputation; rejects on singular covariance.
301
+ - **`FactorialDesign`**: 2^K with equal cell sizes; little-endian cell encoding
302
+ (`cell_idx = sum(factor_value * 2^i for i, factor in enumerate(factors))`);
303
+ `FactorialAssignment.cell_ids(**factor_values)` for cell selection.
304
+ - **`check_balance(assignment, covariates)`**: standardized mean differences with pooled std
305
+ `sqrt((var_t + var_c) / 2)`, `ddof=1` (Austin 2009, Stuart 2010).
306
+ - **`power_analysis(...)`**: keyword-only function resolving one of `n`, `mde`, or `power`
307
+ given the other two; normal approximation; supports unequal allocation.
308
+
309
+ ### Added — Phase 3: Estimators
310
+
311
+ - **`DifferenceInMeans`**: simple ATE for `CRDAssignment` (including rerandomized);
312
+ rejects `BlockedAssignment` and `FactorialAssignment`.
313
+ - **`BlockedDifferenceInMeans`**: size-weighted SATE for `BlockedAssignment`
314
+ (Imbens & Rubin 2015); unbiased without within-block variance assumptions.
315
+ - **`FactorialEstimator`**: all 2^K − 1 orthogonal contrasts (main effects and interactions
316
+ of all orders) for `FactorialAssignment`; alphabetical effect keys via
317
+ `itertools.combinations(sorted(factor_cols), r)`; returns `Results` in multi-effect mode.
318
+ - **`LinEstimator`**: OLS of Y on `[1, T, X_centered, T * X_centered]` (Lin 2013);
319
+ accepts `CRDAssignment` or `BlockedAssignment`; rejects constant covariates.
320
+ `inference_mode` is documentational metadata propagated to `Results.extra`.
321
+ - **`CUPED`**: pre-experiment covariate adjustment (Deng et al. 2013); `theta_` and
322
+ `correlation_` propagated via `Results.extra`. v1 accepts only `CRDAssignment`;
323
+ blocked extension planned for v2.
324
+
325
+ ### Architectural decisions
326
+
327
+ The library has 19 numbered architectural decisions documented in source comments and base
328
+ class docstrings. Notable ones:
329
+
330
+ - The `Assignment` object carries a reference to the generating design (`design_`) so
331
+ `draw()` can replay the randomization mechanism for inference.
332
+ - The outcome variable is **not** part of the `Assignment` contract; estimators receive
333
+ `outcome_col` as a parameter and resolve it against `assignment.data_` at fit time.
334
+ - `Results` is the uniform output object; estimators auto-populate `estimator_name` and
335
+ `design_name`. Inference classes (Phase 4) will populate `inference_name`.
336
+ - `inference_mode` (`finite_population` vs. `superpopulation`) is metadata in Phase 3;
337
+ it will live as a `BaseInference` attribute in Phase 4.
338
+
339
+ ### Tests
340
+
341
+ ~452 tests across `tests/core/`, `tests/design/`, `tests/estimators/`, and
342
+ `tests/integration/`. All passing on CI. Tests follow a strict pattern of grouping classes
343
+ (`TestXxxCreation`, `TestXxxValidation`, `TestXxxNumerics`, etc.) with seeded helpers.