ars-sigma 1.0.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ars_sigma-1.0.0.dist-info/METADATA +940 -0
- ars_sigma-1.0.0.dist-info/RECORD +27 -0
- ars_sigma-1.0.0.dist-info/WHEEL +5 -0
- ars_sigma-1.0.0.dist-info/licenses/LICENSE.txt +1394 -0
- ars_sigma-1.0.0.dist-info/licenses/NOTICE.txt +16 -0
- ars_sigma-1.0.0.dist-info/top_level.txt +1 -0
- sigma/__init__.py +66 -0
- sigma/_export.py +909 -0
- sigma/_extension.py +48 -0
- sigma/_graphviz.py +367 -0
- sigma/_node.py +459 -0
- sigma/_palette.py +168 -0
- sigma/_partition.py +282 -0
- sigma/_ranking.py +481 -0
- sigma/_response_plot.py +609 -0
- sigma/_splitting.py +282 -0
- sigma/_statistics.py +389 -0
- sigma/_survival.py +553 -0
- sigma/_tree.py +2050 -0
- sigma/_tree_classification.py +630 -0
- sigma/_tree_ranking.py +900 -0
- sigma/_tree_regression.py +685 -0
- sigma/_tree_sql.py +259 -0
- sigma/_tree_survival.py +880 -0
- sigma/_tree_text.py +906 -0
- sigma/_types.py +235 -0
- sigma/util/__init__.py +1 -0
|
@@ -0,0 +1,940 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: ars-sigma
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: Conditional inference trees for Python
|
|
5
|
+
Author: ArsChitectura SAS
|
|
6
|
+
License-Expression: LicenseRef-ArsChitectura-Sigma
|
|
7
|
+
Project-URL: Repository, https://github.com/arschitectura/sigma
|
|
8
|
+
Project-URL: Documentation, https://arschitectura.com/products/sigma/
|
|
9
|
+
Project-URL: Contact, https://arschitectura.com/contact/
|
|
10
|
+
Requires-Python: >=3.10
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
License-File: LICENSE.txt
|
|
13
|
+
License-File: NOTICE.txt
|
|
14
|
+
Requires-Dist: numpy>=1.24
|
|
15
|
+
Requires-Dist: scipy>=1.11
|
|
16
|
+
Requires-Dist: scikit-learn>=1.3
|
|
17
|
+
Requires-Dist: typing_extensions>=4.0
|
|
18
|
+
Provides-Extra: viz
|
|
19
|
+
Requires-Dist: graphviz>=0.20; extra == "viz"
|
|
20
|
+
Requires-Dist: matplotlib>=3.8; extra == "viz"
|
|
21
|
+
Dynamic: license-file
|
|
22
|
+
|
|
23
|
+
# Sigma
|
|
24
|
+
|
|
25
|
+
<img src="https://arschitectura.com/medias/sigma_small.webp" alt="Sigma" width="200" height="200" align="right">
|
|
26
|
+
|
|
27
|
+
**Conditional inference trees for Python.**
|
|
28
|
+
|
|
29
|
+
Provides classification (`ClassificationTree`), regression
|
|
30
|
+
(`RegressionTree`), survival-analysis (`SurvivalTree`), and ranking
|
|
31
|
+
(`RankingTree`) estimators, compatible with scikit-learn.
|
|
32
|
+
|
|
33
|
+
- **Unbiased splits** - permutation-based p-values decouple variable selection from split search, avoiding CART's bias toward variables with many possible splits
|
|
34
|
+
- **Interpretable by construction** - each split is a statistical hypothesis test with a reported p-value, and fitted trees render to PNG/SVG via `to_image`
|
|
35
|
+
- **scikit-learn compatible** - `ClassificationTree`, `RegressionTree`, `SurvivalTree`, and `RankingTree` drop into any sklearn pipeline
|
|
36
|
+
|
|
37
|
+
Every statistical method in Sigma comes from a [peer-reviewed paper](#references).
|
|
38
|
+
|
|
39
|
+
## 1. License
|
|
40
|
+
|
|
41
|
+
Governed by the [**Sigma License**](./LICENSE.txt). This is a
|
|
42
|
+
source-available license, not OSI-approved open source. Commercial use
|
|
43
|
+
is permitted with attribution. ArsChitectura SAS retains an at-will right to
|
|
44
|
+
revoke the license, at any time, for any reason. Licensee shall consult
|
|
45
|
+
[Licensor's organization website](https://arschitectura.com/products/sigma/) and
|
|
46
|
+
the [Software's project homepage](https://github.com/arschitectura/sigma) at
|
|
47
|
+
least once every ninety (90) days.
|
|
48
|
+
|
|
49
|
+
**AI, ML, and other automated ingestion of this library, its
|
|
50
|
+
documentation, or any derivative work is prohibited**, excepted to
|
|
51
|
+
generate your own client code that calls Sigma's public API.
|
|
52
|
+
|
|
53
|
+
Modification of Sigma is permitted only as preparation for a Contribution
|
|
54
|
+
to the Canonical Repository; see [`CONTRIBUTING.md`](./CONTRIBUTING.md)
|
|
55
|
+
for the lifecycle.
|
|
56
|
+
|
|
57
|
+
External contributors must sign the CLA in [`CONTRIBUTING.md`](./CONTRIBUTING.md)
|
|
58
|
+
before a pull request can be accepted.
|
|
59
|
+
|
|
60
|
+
If your needs exceed this, a paid, non-revocable commercial license is available on [request](https://arschitectura.com/contact/).
|
|
61
|
+
|
|
62
|
+
## 2. Support
|
|
63
|
+
|
|
64
|
+
Read the [documentation](https://arschitectura.com/products/sigma/).
|
|
65
|
+
|
|
66
|
+
Have questions, feedback, or need help getting started? I would love to hear from you - [get in touch](https://arschitectura.com/contact/).
|
|
67
|
+
|
|
68
|
+
<div align="center">
|
|
69
|
+
<a href="https://arschitectura.com/contact/">
|
|
70
|
+
<img src="https://arschitectura.com/medias/card.webp" alt="Card" width="500" height="311">
|
|
71
|
+
</a>
|
|
72
|
+
</div>
|
|
73
|
+
|
|
74
|
+
## 3. Installation
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
pip install ars-sigma
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## 4. Sample Trees
|
|
81
|
+
|
|
82
|
+
Four trees fitted on classic datasets. Each subsection shows the
|
|
83
|
+
fit code, the `to_text` rendering, and the rendered tree and response
|
|
84
|
+
images. Click an image to view it at full size.
|
|
85
|
+
|
|
86
|
+
### 4.1. Titanic (classification)
|
|
87
|
+
|
|
88
|
+
Predicting survival probability with a Jeffreys 95% confidence
|
|
89
|
+
interval at each node - surfaces passenger class, sex, and age.
|
|
90
|
+
|
|
91
|
+
```python
|
|
92
|
+
tree = sigma.ClassificationTree(random_state=123)
|
|
93
|
+
tree.fit(X, y)
|
|
94
|
+
print(tree.to_text(precision=1))
|
|
95
|
+
tree.to_image("png", "titanic.png", precision=1)
|
|
96
|
+
tree.to_image("png", "titanic_response.png", kind="response")
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
```
|
|
100
|
+
Died proba. Survived proba. Obs. count Obs. share Split p-value Leaf index
|
|
101
|
+
---------------------- ---------------------- ---------- ---------- ------------- ----------
|
|
102
|
+
All records 59.6% (55.9% to 63.1%) 40.4% (36.9% to 44.1%) 712 100.0% 0.02%
|
|
103
|
+
├── Passenger class is "1st" or "2nd" 43.1% (38.1% to 48.3%) 56.9% (51.7% to 61.9%) 357 50.1% 0.02%
|
|
104
|
+
│ ├── Sex is "female" 5.7% (2.9% to 10.2%) 94.3% (89.8% to 97.1%) 157 22.1% 6
|
|
105
|
+
│ └── Sex is "male" 72.5% (66.0% to 78.3%) 27.5% (21.7% to 34.0%) 200 28.1% 0.08%
|
|
106
|
+
│ ├── Passenger class is "1st" 60.4% (50.7% to 69.5%) 39.6% (30.5% to 49.3%) 101 14.2% 1.02%
|
|
107
|
+
│ │ ├── Age <= 53.0 53.2% (42.2% to 63.9%) 46.8% (36.1% to 57.8%) 79 11.1% 5
|
|
108
|
+
│ │ └── Age > 53.0 86.4% (67.9% to 96.0%) 13.6% (4.0% to 32.1%) 22 3.1% 2
|
|
109
|
+
│ └── Passenger class is "2nd" 84.8% (76.8% to 90.9%) 15.2% (9.1% to 23.2%) 99 13.9% 0.62%
|
|
110
|
+
│ ├── Age <= 12.0 0% (0% to 23.8%) 100% (76.2% to 100%) 9 1.3% 7
|
|
111
|
+
│ └── Age > 12.0 93.3% (86.8% to 97.2%) 6.7% (2.8% to 13.2%) 90 12.6% 1
|
|
112
|
+
└── Passenger class is "3rd" 76.1% (71.4% to 80.3%) 23.9% (19.7% to 28.6%) 355 49.9% 0.02%
|
|
113
|
+
├── Sex is "female" 53.9% (44.3% to 63.4%) 46.1% (36.6% to 55.7%) 102 14.3% 4
|
|
114
|
+
└── Sex is "male" 85.0% (80.2% to 89.0%) 15.0% (11.0% to 19.8%) 253 35.5% 3
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
<table>
|
|
118
|
+
<tr>
|
|
119
|
+
<td><a href="https://arschitectura.com/medias/sigma_titanic.png"><img src="https://arschitectura.com/medias/sigma_titanic.png" alt="Tree fitted on the Titanic dataset"></a></td>
|
|
120
|
+
</tr>
|
|
121
|
+
<tr>
|
|
122
|
+
<td><a href="https://arschitectura.com/medias/sigma_titanic_response.png"><img src="https://arschitectura.com/medias/sigma_titanic_response.png" alt="Response plot for the Titanic dataset"></a></td>
|
|
123
|
+
</tr>
|
|
124
|
+
</table>
|
|
125
|
+
|
|
126
|
+
### 4.2. Diabetes (regression)
|
|
127
|
+
|
|
128
|
+
Predicting one-year disease progression with a Bayesian-bootstrap 95%
|
|
129
|
+
confidence interval at each node - surfaces BMI, triglycerides, and blood
|
|
130
|
+
pressure. BMI splits recursively, so the example calls `compact()` (see
|
|
131
|
+
section 5.5) to fold that chain into one multi-way node.
|
|
132
|
+
|
|
133
|
+
```python
|
|
134
|
+
tree = sigma.RegressionTree(
|
|
135
|
+
test_type="monte_carlo",
|
|
136
|
+
resamples=2000,
|
|
137
|
+
random_state=123,
|
|
138
|
+
reverse_order=True,
|
|
139
|
+
)
|
|
140
|
+
tree.fit(X, y)
|
|
141
|
+
compact_tree = tree.compact()
|
|
142
|
+
print(compact_tree.to_text(precision=1))
|
|
143
|
+
compact_tree.to_image(
|
|
144
|
+
"png", "diabetes.png", orientation="left-to-right", precision=1
|
|
145
|
+
)
|
|
146
|
+
compact_tree.to_image("png", "diabetes_response.png", kind="response")
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
```
|
|
150
|
+
Disease progression mean Obs. count Obs. share Split p-value Leaf index
|
|
151
|
+
------------------------ ---------- ---------- ------------- ----------
|
|
152
|
+
All records 152.1 (145.1 to 159.4) 442 100.0%
|
|
153
|
+
├── BMI <= 24.4 105.0 (97.6 to 112.9) 165 37.3% 0.05%
|
|
154
|
+
│ ├── Triglycerides (log) <= 4.6 93.9 (86.7 to 101.4) 123 27.8% 8
|
|
155
|
+
│ └── Triglycerides (log) > 4.6 137.7 (121.6 to 153.6) 42 9.5% 6
|
|
156
|
+
├── 24.4 < BMI <= 27.2 144.0 (131.5 to 157.1) 112 25.3% 0.05%
|
|
157
|
+
│ ├── Total-to-HDL ratio <= 4.8 118.4 (104.3 to 133.9) 65 14.7% 0.05%
|
|
158
|
+
│ │ ├── Triglycerides (log) <= 4.8 102.2 (89.5 to 116.4) 53 12.0% 7
|
|
159
|
+
│ │ └── Triglycerides (log) > 4.8 190.0 (162.1 to 217.8) 12 2.7% 3
|
|
160
|
+
│ └── Total-to-HDL ratio > 4.8 179.3 (160.8 to 197.9) 47 10.6% 4
|
|
161
|
+
└── BMI > 27.2 204.8 (193.8 to 215.4) 165 37.3% 0.05%
|
|
162
|
+
├── Blood pressure <= 111.8 189.3 (176.8 to 201.6) 124 28.1% 0.05%
|
|
163
|
+
│ ├── Triglycerides (log) <= 5.1 171.1 (156.4 to 186.8) 80 18.1% 5
|
|
164
|
+
│ └── Triglycerides (log) > 5.1 222.5 (205.0 to 238.2) 44 10.0% 2
|
|
165
|
+
└── Blood pressure > 111.8 251.4 (235.1 to 265.8) 41 9.3% 1
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
<table>
|
|
169
|
+
<tr>
|
|
170
|
+
<td><a href="https://arschitectura.com/medias/sigma_diabetes.png"><img src="https://arschitectura.com/medias/sigma_diabetes.png" alt="Tree fitted on the Diabetes dataset"></a></td>
|
|
171
|
+
</tr>
|
|
172
|
+
<tr>
|
|
173
|
+
<td><a href="https://arschitectura.com/medias/sigma_diabetes_response.png"><img src="https://arschitectura.com/medias/sigma_diabetes_response.png" alt="Response plot for the Diabetes dataset"></a></td>
|
|
174
|
+
</tr>
|
|
175
|
+
</table>
|
|
176
|
+
|
|
177
|
+
### 4.3. GBSG-2 breast cancer (survival)
|
|
178
|
+
|
|
179
|
+
Predicting recurrence-free years with a Brookmeyer-Crowley 95%
|
|
180
|
+
confidence interval at each node - splits on positive lymph nodes,
|
|
181
|
+
hormone therapy, and progesterone receptor level.
|
|
182
|
+
|
|
183
|
+
```python
|
|
184
|
+
tree = sigma.SurvivalTree(
|
|
185
|
+
random_state=123,
|
|
186
|
+
metrics=("median", ("survival", 5.0, "years")),
|
|
187
|
+
)
|
|
188
|
+
tree.fit(X, y)
|
|
189
|
+
print(tree.to_text(precision=1))
|
|
190
|
+
tree.to_image("png", "breast_cancer.png", precision=1)
|
|
191
|
+
tree.to_image("png", "breast_cancer_response.png", kind="response")
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
Median Recurrence-free years Survival at 5 years Obs. count Obs. share Split p-value Leaf index
|
|
196
|
+
---------------------------- ---------------------- ---------- ---------- ------------- ----------
|
|
197
|
+
All records 4.9 (4.2 to 5.5) 49.2% (44.6% to 53.6%) 686 100.0% 0.02%
|
|
198
|
+
├── Tumor size <= 19 unknown (5.4 to unknown) 65.6% (55.4% to 73.9%) 135 19.7% 0.04%
|
|
199
|
+
│ ├── Positive lymph nodes <= 2 unknown (unknown bounds) 80.7% (68.2% to 88.7%) 77 11.2% 7
|
|
200
|
+
│ └── Positive lymph nodes > 2 4.7 (2.6 to 5.5) 44.7% (29.2% to 59.1%) 58 8.5% 3.52%
|
|
201
|
+
│ ├── Age > 41 5.4 (3.5 to unknown) 54.6% (36.2% to 69.8%) 49 7.1% 5
|
|
202
|
+
│ └── Age <= 41 1.3 (0.7 to 3.2) 0% (0% to 0%) 9 1.3% 1
|
|
203
|
+
└── Tumor size > 19 4.3 (3.7 to 5.0) 44.9% (39.7% to 49.9%) 551 80.3% 0.02%
|
|
204
|
+
├── Positive lymph nodes <= 4 5.7 (4.8 to unknown) 54.7% (47.7% to 61.0%) 332 48.4% 1.10%
|
|
205
|
+
│ ├── Hormone therapy is true unknown (5.6 to unknown) 67.7% (56.3% to 76.7%) 114 16.6% 6
|
|
206
|
+
│ └── Hormone therapy is false 4.8 (3.9 to unknown) 47.1% (38.3% to 55.4%) 218 31.8% 4
|
|
207
|
+
└── Positive lymph nodes > 4 2.4 (2.0 to 3.1) 29.9% (22.6% to 37.5%) 219 31.9% 0.10%
|
|
208
|
+
├── Progesterone receptor level > 24 4.1 (2.7 to 5.5) 43.8% (31.4% to 55.4%) 107 15.6% 3
|
|
209
|
+
└── Progesterone receptor level <= 24 1.7 (1.4 to 2.2) 17.3% (10.0% to 26.3%) 112 16.3% 2
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
<table>
|
|
213
|
+
<tr>
|
|
214
|
+
<td><a href="https://arschitectura.com/medias/sigma_breast_cancer.png"><img src="https://arschitectura.com/medias/sigma_breast_cancer.png" alt="Tree fitted on the GBSG-2 breast cancer dataset"></a></td>
|
|
215
|
+
</tr>
|
|
216
|
+
<tr>
|
|
217
|
+
<td><a href="https://arschitectura.com/medias/sigma_breast_cancer_response.png"><img src="https://arschitectura.com/medias/sigma_breast_cancer_response.png" alt="Response plot for the GBSG-2 breast cancer dataset"></a></td>
|
|
218
|
+
</tr>
|
|
219
|
+
</table>
|
|
220
|
+
|
|
221
|
+
### 4.4. Sushi (ranking)
|
|
222
|
+
|
|
223
|
+
Predicting per-item Plackett-Luce expected rank with a Bayesian-bootstrap
|
|
224
|
+
95% confidence interval at each node - surfaces sex and age group as the
|
|
225
|
+
strongest demographic drivers of sushi preference among 5000 Japanese
|
|
226
|
+
respondents ranking ten classic sushi. Age group splits recursively on the
|
|
227
|
+
male side, so the example calls `compact()` (see section 5.5) to fold that
|
|
228
|
+
chain into one multi-way node.
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
tree = sigma.RankingTree(
|
|
232
|
+
pca_components=10,
|
|
233
|
+
random_state=123,
|
|
234
|
+
max_depth=3,
|
|
235
|
+
)
|
|
236
|
+
tree.fit(X, rankings)
|
|
237
|
+
compact_tree = tree.compact()
|
|
238
|
+
print(compact_tree.to_text(precision=2))
|
|
239
|
+
compact_tree.to_image(
|
|
240
|
+
"png", "sushi.png", orientation="left-to-right", precision=2
|
|
241
|
+
)
|
|
242
|
+
compact_tree.to_image("png", "sushi_response.png", kind="response")
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
```
|
|
246
|
+
Ebi rank Anago rank Maguro rank Uni rank Tamago rank Obs. count Obs. share Split p-value Leaf index
|
|
247
|
+
------------------- ------------------- ------------------- ------------------- ------------------- ---------- ---------- ------------- ----------
|
|
248
|
+
All records 4.94 (4.86 to 5.01) 5.39 (5.32 to 5.46) 4.37 (4.31 to 4.43) 6.07 (5.98 to 6.16) 3.24 (3.15 to 3.31) 5000 100.0% <1e-300
|
|
249
|
+
├── Gender is "male" 5.16 (5.05 to 5.25) 5.15 (5.06 to 5.28) 4.22 (4.11 to 4.33) 5.66 (5.50 to 5.77) 2.90 (2.82 to 3.01) 2373 47.5%
|
|
250
|
+
│ ├── Age group is "30-39" 5.27 (5.14 to 5.43) 5.04 (4.83 to 5.23) 4.31 (4.18 to 4.45) 5.53 (5.30 to 5.75) 2.83 (2.71 to 2.99) 830 16.6% 6
|
|
251
|
+
│ ├── Age group is "40-49", "50-59", or "60+" 5.21 (5.04 to 5.39) 5.28 (5.12 to 5.45) 4.29 (4.15 to 4.44) 5.03 (4.80 to 5.23) 2.96 (2.81 to 3.13) 884 17.7% 0.60%
|
|
252
|
+
│ │ ├── Childhood region is "Tohoku", "Hokuriku", "Kanto+Shizuoka", "Nagoya", "Kinki", "Chugoku", or "Okinawa" 5.31 (5.13 to 5.50) 5.17 (4.97 to 5.34) 4.29 (4.14 to 4.43) 5.10 (4.88 to 5.33) 2.94 (2.76 to 3.15) 735 14.7% 7
|
|
253
|
+
│ │ └── Childhood region is "Hokkaido", "Shikoku", "Kyushu", or "abroad" 4.74 (4.36 to 5.13) 5.81 (5.37 to 6.23) 4.32 (3.98 to 4.66) 4.67 (4.23 to 5.19) 3.03 (2.66 to 3.35) 149 3.0% 3
|
|
254
|
+
│ └── Age group is "15-19" or "20-29" 4.96 (4.78 to 5.14) 5.14 (4.95 to 5.34) 4.01 (3.84 to 4.17) 6.54 (6.30 to 6.73) 2.93 (2.73 to 3.15) 659 13.2% 5
|
|
255
|
+
└── Gender is "female" 4.73 (4.64 to 4.82) 5.62 (5.53 to 5.72) 4.52 (4.43 to 4.59) 6.44 (6.32 to 6.55) 3.54 (3.43 to 3.66) 2627 52.5% <1e-300
|
|
256
|
+
├── Age group is "20-29", "30-39", "40-49", "50-59", or "60+" 4.73 (4.63 to 4.82) 5.55 (5.44 to 5.66) 4.56 (4.48 to 4.65) 6.30 (6.13 to 6.41) 3.57 (3.46 to 3.68) 2429 48.6% <1e-300
|
|
257
|
+
│ ├── Childhood region is "Hokuriku", "Kanto+Shizuoka", "Kinki", "Chugoku", "Kyushu", or "abroad" 4.90 (4.78 to 5.01) 5.43 (5.28 to 5.57) 4.54 (4.46 to 4.63) 6.44 (6.31 to 6.59) 3.50 (3.38 to 3.62) 1833 36.7% 4
|
|
258
|
+
│ └── Childhood region is "Hokkaido", "Tohoku", "Nagoya", "Shikoku", or "Okinawa" 4.18 (4.02 to 4.38) 5.90 (5.70 to 6.12) 4.64 (4.47 to 4.82) 5.84 (5.55 to 6.13) 3.78 (3.61 to 3.99) 596 11.9% 1
|
|
259
|
+
└── Age group is "15-19" 4.73 (4.37 to 5.11) 6.46 (6.08 to 6.79) 3.96 (3.68 to 4.30) 7.90 (7.60 to 8.25) 3.25 (2.88 to 3.64) 198 4.0% 2
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
<table>
|
|
263
|
+
<tr>
|
|
264
|
+
<td><a href="https://arschitectura.com/medias/sigma_sushi.png"><img src="https://arschitectura.com/medias/sigma_sushi.png" alt="Tree fitted on the Sushi preference dataset"></a></td>
|
|
265
|
+
</tr>
|
|
266
|
+
<tr>
|
|
267
|
+
<td><a href="https://arschitectura.com/medias/sigma_sushi_response.png"><img src="https://arschitectura.com/medias/sigma_sushi_response.png" alt="Response plot for the Sushi preference dataset"></a></td>
|
|
268
|
+
</tr>
|
|
269
|
+
</table>
|
|
270
|
+
|
|
271
|
+
## 5. Advanced usage
|
|
272
|
+
|
|
273
|
+
### 5.1. Controlling tree depth and node size
|
|
274
|
+
|
|
275
|
+
`alpha` is the principal knob: it sets the significance threshold for
|
|
276
|
+
every split test, so lowering it produces a terser, more statistically
|
|
277
|
+
conservative tree, and raising it produces a richer, more exploratory
|
|
278
|
+
one. `min_splits`, `min_buckets`, and `max_depth` are secondary safety
|
|
279
|
+
bounds, shared between `RegressionTree` and `ClassificationTree`.
|
|
280
|
+
|
|
281
|
+
```python
|
|
282
|
+
tree = ClassificationTree(
|
|
283
|
+
max_depth=4, # maximum tree depth (None = unlimited)
|
|
284
|
+
)
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
### 5.2. Fitting with sample weights
|
|
288
|
+
|
|
289
|
+
Sample weights let you model **variable exposures** - per-row
|
|
290
|
+
time-at-risk, insurance policy-years, or frequency weights for
|
|
291
|
+
pre-aggregated rows. A weight of `k` is equivalent to observing the
|
|
292
|
+
sample `k` times.
|
|
293
|
+
|
|
294
|
+
```python
|
|
295
|
+
import numpy
|
|
296
|
+
from sigma import RegressionTree
|
|
297
|
+
|
|
298
|
+
n = 200
|
|
299
|
+
X = numpy.random.randn(n, 2)
|
|
300
|
+
claim_amount = numpy.where(X[:, 0] > 0, 1500.0, 300.0) + 100 * numpy.random.randn(n)
|
|
301
|
+
exposure_years = numpy.random.uniform(0.1, 2.0, size=n)
|
|
302
|
+
|
|
303
|
+
tree = RegressionTree()
|
|
304
|
+
tree.fit(X, claim_amount, sample_weight=exposure_years)
|
|
305
|
+
predictions = tree.predict(X)
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
### 5.3. Visualizing the tree
|
|
309
|
+
|
|
310
|
+
Install the optional visualization extra and the Graphviz system
|
|
311
|
+
binary (`brew install graphviz` on macOS):
|
|
312
|
+
|
|
313
|
+
```bash
|
|
314
|
+
pip install ars-sigma[viz]
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
Then render to PNG, PDF, SVG, or GIF:
|
|
318
|
+
|
|
319
|
+
```python
|
|
320
|
+
tree.to_image("png", "tree.png", feature_names=["feature_1", "feature_2"], response_name="y")
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
PNG and PDF additionally require `cairosvg`; SVG needs only the
|
|
324
|
+
Graphviz binary. See `to_image` and `export_graphviz` for the full set
|
|
325
|
+
of display options.
|
|
326
|
+
|
|
327
|
+
### 5.4. Exporting the tree as a SQL CASE expression
|
|
328
|
+
|
|
329
|
+
`to_sql` (and the module-level `sigma.export_sql`) emits a single SQL
|
|
330
|
+
`CASE` expression that reproduces `tree.predict` row-by-row in any
|
|
331
|
+
SQL-92/SQL-99 engine, with no extra dependencies:
|
|
332
|
+
|
|
333
|
+
```python
|
|
334
|
+
sql_expression = tree.to_sql()
|
|
335
|
+
print(sql_expression)
|
|
336
|
+
# SELECT id, (<sql_expression>) AS prediction FROM points;
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
```sql
|
|
340
|
+
CASE
|
|
341
|
+
WHEN "Passenger class" IN ('1st', '2nd') THEN
|
|
342
|
+
CASE
|
|
343
|
+
WHEN "Sex" = 'female' THEN
|
|
344
|
+
0.9426751592356688 -- Leaf 6
|
|
345
|
+
WHEN "Sex" = 'male' THEN
|
|
346
|
+
CASE
|
|
347
|
+
WHEN "Passenger class" = '1st' THEN
|
|
348
|
+
CASE
|
|
349
|
+
WHEN "Age" <= 53.0 THEN
|
|
350
|
+
0.46835443037974683 -- Leaf 5
|
|
351
|
+
WHEN "Age" > 53.0 THEN
|
|
352
|
+
0.13636363636363635 -- Leaf 2
|
|
353
|
+
ELSE NULL
|
|
354
|
+
END
|
|
355
|
+
WHEN "Passenger class" = '2nd' THEN
|
|
356
|
+
CASE
|
|
357
|
+
WHEN "Age" <= 12.0 THEN
|
|
358
|
+
1.0 -- Leaf 7
|
|
359
|
+
WHEN "Age" > 12.0 THEN
|
|
360
|
+
0.06666666666666667 -- Leaf 1
|
|
361
|
+
ELSE NULL
|
|
362
|
+
END
|
|
363
|
+
ELSE 0.275
|
|
364
|
+
END
|
|
365
|
+
ELSE 0.5686274509803921
|
|
366
|
+
END
|
|
367
|
+
WHEN "Passenger class" = '3rd' THEN
|
|
368
|
+
CASE
|
|
369
|
+
WHEN "Sex" = 'female' THEN
|
|
370
|
+
0.46078431372549017 -- Leaf 4
|
|
371
|
+
WHEN "Sex" = 'male' THEN
|
|
372
|
+
0.15019762845849802 -- Leaf 3
|
|
373
|
+
ELSE 0.23943661971830985
|
|
374
|
+
END
|
|
375
|
+
ELSE 0.4044943820224719
|
|
376
|
+
END
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
For `ClassificationTree`, pass `target_class=` to pick which class
|
|
380
|
+
probability the expression should emit. Categorical values not seen at
|
|
381
|
+
fit time evaluate to the holding node's prediction, mirroring
|
|
382
|
+
`tree.predict`. `NULL` numerical or boolean inputs fall through to
|
|
383
|
+
`ELSE NULL`; wrap in `COALESCE(..., default)` to substitute a fallback
|
|
384
|
+
value.
|
|
385
|
+
|
|
386
|
+
### 5.5. Collapsing recursive splits with `compact()`
|
|
387
|
+
|
|
388
|
+
When the same feature is split over several consecutive levels, a binary
|
|
389
|
+
tree repeats that feature down a chain. `compact()` returns a new tree in
|
|
390
|
+
which each such chain is collapsed into a single multi-way node, with one
|
|
391
|
+
branch per resulting interval (numeric features) or category subset
|
|
392
|
+
(categorical features):
|
|
393
|
+
|
|
394
|
+
```python
|
|
395
|
+
compact_tree = tree.compact()
|
|
396
|
+
print(compact_tree.to_text())
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
The compacted tree predicts identically to the original and renders
|
|
400
|
+
through the same `to_text`, `to_image`, and `to_sql` methods. A merged
|
|
401
|
+
node spans several original splits, so it reports no single split
|
|
402
|
+
p-value. For example, a chain of consecutive splits on `Age`
|
|
403
|
+
|
|
404
|
+
```
|
|
405
|
+
├── Age <= 30
|
|
406
|
+
└── Age > 30
|
|
407
|
+
├── Age <= 50
|
|
408
|
+
└── Age > 50
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
collapses into one node carrying three interval branches:
|
|
412
|
+
|
|
413
|
+
```
|
|
414
|
+
├── Age <= 30
|
|
415
|
+
├── 30 < Age <= 50
|
|
416
|
+
└── Age > 50
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
The original tree is left unchanged; `compact()` produces an independent
|
|
420
|
+
copy whose node ids are renumbered to match its smaller shape.
|
|
421
|
+
|
|
422
|
+
## 6. Parameters
|
|
423
|
+
|
|
424
|
+
The table below is a quick reference; each parameter has a dedicated
|
|
425
|
+
subsection further down with defaults, alternatives, and guidance on
|
|
426
|
+
when to choose each option.
|
|
427
|
+
|
|
428
|
+
| Parameter | Description |
|
|
429
|
+
| :-------------------------------- | :---------------------------------------------------------------------------- |
|
|
430
|
+
| `correlation` | Rank-transform inputs (robust) or use raw values (classical) |
|
|
431
|
+
| `test_stat` | How the multivariate score is aggregated into a scalar test statistic |
|
|
432
|
+
| `test_type` | Multiplicity adjustment applied across covariates |
|
|
433
|
+
| `alpha` | Significance level for the stopping rule |
|
|
434
|
+
| `min_splits` | Minimum sum of weights required to attempt a split |
|
|
435
|
+
| `min_buckets` | Minimum sum of weights in each child node |
|
|
436
|
+
| `max_depth` | Maximum tree depth |
|
|
437
|
+
| `categorical_features` | Which feature columns are categorical |
|
|
438
|
+
| `ci_method` (classification tree) | Confidence interval method for per-class proportions |
|
|
439
|
+
| `ci_method` (regression tree) | Confidence interval method for node mean predictions |
|
|
440
|
+
| `ci_method` (ranking tree) | Confidence interval method for per-item leaf PL-MLE expected-rank predictions |
|
|
441
|
+
| `npseudo` | Turner ghost-item pseudo-comparison weight for the per-node PL fit |
|
|
442
|
+
| `pl_max_iter` | Maximum Hunter MM iterations per node's Plackett-Luce fit |
|
|
443
|
+
| `pl_tolerance` | Convergence tolerance on log-worth for the Hunter MM iteration |
|
|
444
|
+
| `ci_coverage` | Coverage level for node-prediction confidence intervals |
|
|
445
|
+
| `transmuter` | Per-node data transform with post-hoc split validation |
|
|
446
|
+
| `resamples` | Number of permutations for `test_type="monte_carlo"` |
|
|
447
|
+
| `decorator` | Per-node decoration callable rendered by `to_text` / `to_image` |
|
|
448
|
+
| `random_state` | RNG seed for permutation resampling, bootstrap CI methods, and plot jitter |
|
|
449
|
+
|
|
450
|
+
### 6.1. `correlation`
|
|
451
|
+
|
|
452
|
+
**Default**: `"rank"`.
|
|
453
|
+
|
|
454
|
+
Score function for the test statistic.
|
|
455
|
+
|
|
456
|
+
- `"normal"` uses raw values, recovering the original Pearson-like
|
|
457
|
+
behavior from Hothorn et al. (2006). Choose this when the response is
|
|
458
|
+
well-behaved (approximately Gaussian, no heavy outliers) and you want
|
|
459
|
+
the slight power gain on truly linear associations.
|
|
460
|
+
- `"rank"` (default) rank-transforms continuous covariates and, for
|
|
461
|
+
regression, the response before computing the statistic, yielding a
|
|
462
|
+
Spearman-like nonparametric test. Robust to outliers and heavy tails.
|
|
463
|
+
The safe choice for arbitrary real-world data.
|
|
464
|
+
|
|
465
|
+
### 6.2. `test_stat`
|
|
466
|
+
|
|
467
|
+
**Default**: `"quadratic"`.
|
|
468
|
+
|
|
469
|
+
How the multivariate score is aggregated into a scalar test statistic.
|
|
470
|
+
|
|
471
|
+
- `"maximum"` is a maximum-type statistic that concentrates power on
|
|
472
|
+
alternatives where one component dominates. Choose this when you
|
|
473
|
+
expect a single-direction effect (e.g., a binary classification where
|
|
474
|
+
only one class differs from the rest).
|
|
475
|
+
- `"quadratic"` (default) is an omnibus chi-squared form with good
|
|
476
|
+
power across general alternatives. Choose this when you have no prior
|
|
477
|
+
on the direction of association, or when the response is multivariate
|
|
478
|
+
(multi-class classification with many classes).
|
|
479
|
+
|
|
480
|
+
### 6.3. `test_type`
|
|
481
|
+
|
|
482
|
+
**Default**: `"sidak"`.
|
|
483
|
+
|
|
484
|
+
Multiplicity adjustment applied across the $m$ candidate covariates
|
|
485
|
+
(indexed by $j$), transforming each raw p-value $P_j$ before the
|
|
486
|
+
stopping rule fires.
|
|
487
|
+
|
|
488
|
+
- `"bonferroni"` is the closed-form $\min(m P_j, 1)$. The simplest and
|
|
489
|
+
best-known correction, strictly more conservative than Sidak under
|
|
490
|
+
independence; prefer `"sidak"` unless matching an external reference.
|
|
491
|
+
- `"monte_carlo"` is the Westfall-Young min-P resampling procedure.
|
|
492
|
+
More powerful than Sidak when covariates are correlated, at the cost
|
|
493
|
+
of $B \cdot m$ extra statistic evaluations per node, where $B$ is the
|
|
494
|
+
number of response permutations (controlled by `resamples`). Choose
|
|
495
|
+
when covariates are highly collinear and you can afford the
|
|
496
|
+
resampling budget; requires a positive `resamples`.
|
|
497
|
+
- `"sidak"` (default) is the closed-form $1 - (1 - P_j)^m$. Powerful
|
|
498
|
+
under independence or positive dependence of test statistics. The
|
|
499
|
+
recommended default.
|
|
500
|
+
|
|
501
|
+
### 6.4. `alpha`
|
|
502
|
+
|
|
503
|
+
**Default**: `0.05`.
|
|
504
|
+
|
|
505
|
+
Significance level for the stopping rule. Recursion stops at a node
|
|
506
|
+
when $\min_j(\text{adjusted } P_j) > \alpha$.
|
|
507
|
+
|
|
508
|
+
The default `0.05` is a good choice for simple, exploratory analysis.
|
|
509
|
+
For trees fitted on very large datasets, or on correlated records
|
|
510
|
+
where the independence assumption is partially broken, tighten
|
|
511
|
+
`alpha` by one or two orders of magnitude (`0.005` or `0.0005`) to
|
|
512
|
+
keep the tree compact. For models aiming at higher predictive
|
|
513
|
+
accuracy (closer to a full-fledged machine learning model), loosen
|
|
514
|
+
`alpha` to between `0.10` and `0.25`. Tune in concert with
|
|
515
|
+
`max_depth`, `min_splits`, and `min_buckets`.
|
|
516
|
+
|
|
517
|
+
### 6.5. `min_splits`
|
|
518
|
+
|
|
519
|
+
**Default**: `20`.
|
|
520
|
+
|
|
521
|
+
Minimum sum of weights required to attempt a split. Nodes whose weight
|
|
522
|
+
sum falls below this become leaves regardless of p-values. Increase to
|
|
523
|
+
enforce statistical reliability of node-level estimates on smaller
|
|
524
|
+
subsets, decrease to allow finer partitioning.
|
|
525
|
+
|
|
526
|
+
### 6.6. `min_buckets`
|
|
527
|
+
|
|
528
|
+
**Default**: `7`.
|
|
529
|
+
|
|
530
|
+
Minimum sum of weights in each child node. Splits that would produce a
|
|
531
|
+
child smaller than this are rejected. Together with `min_splits`,
|
|
532
|
+
controls the smallest leaf permitted; raise both for noisier data.
|
|
533
|
+
|
|
534
|
+
### 6.7. `max_depth`
|
|
535
|
+
|
|
536
|
+
**Default**: `None` (no limit).
|
|
537
|
+
|
|
538
|
+
Maximum tree depth. Set to a small integer for shallow, easily interpreted
|
|
539
|
+
trees. Leave `None` to let the p-value stopping rule fully control depth.
|
|
540
|
+
|
|
541
|
+
### 6.8. `categorical_features`
|
|
542
|
+
|
|
543
|
+
**Default**: `None` (all numeric).
|
|
544
|
+
|
|
545
|
+
List of feature columns to treat as categorical. Entries may be
|
|
546
|
+
column-name strings (resolved against the DataFrame columns at fit
|
|
547
|
+
time, i.e. `feature_names_in_`) or integer column indices; mixing the
|
|
548
|
+
two forms is allowed. Letting $K$ denote the number of levels in a
|
|
549
|
+
categorical feature, Sigma uses exhaustive split enumeration for
|
|
550
|
+
$K \le 10$ and an ordered-merge heuristic for $K > 10$ (see the
|
|
551
|
+
Algorithm section).
|
|
552
|
+
|
|
553
|
+
### 6.9. `ci_method` (`RegressionTree` only)
|
|
554
|
+
|
|
555
|
+
**Default**: `"bayesian_bootstrap"`.
|
|
556
|
+
|
|
557
|
+
Method for the confidence interval on each node's mean prediction. In
|
|
558
|
+
the descriptions below, $y$ denotes the per-row response, $n$ the
|
|
559
|
+
sample size at the node, and $n_{\text{eff}}$ the Kish effective
|
|
560
|
+
sample size at the node.
|
|
561
|
+
|
|
562
|
+
- `"bayesian_bootstrap"` (default) uses Dirichlet resampling of the
|
|
563
|
+
weighted mean. Nonparametric: makes no assumption on the response
|
|
564
|
+
distribution. The safe choice for arbitrary regression targets, but
|
|
565
|
+
less powerful than a method tailored to the response's actual
|
|
566
|
+
distribution.
|
|
567
|
+
- `"bca"` is the bias-corrected and accelerated bootstrap interval
|
|
568
|
+
(Efron, 1987): resample $10{,}000$ times from the empirical
|
|
569
|
+
distribution, then read percentiles corrected for median
|
|
570
|
+
bias ($z_0$) and skewness ($a$, computed via jackknife).
|
|
571
|
+
Nonparametric and second-order accurate ($O(1/n)$ coverage error);
|
|
572
|
+
transformation-respecting. Choose for the frequentist counterpart of
|
|
573
|
+
`"bayesian_bootstrap"` when an external benchmark specifies a
|
|
574
|
+
frequentist confidence interval. Non-deterministic across calls.
|
|
575
|
+
- `"beta"` is a Clopper-Pearson-style Beta interval for proportional
|
|
576
|
+
responses in $[0, 1]$. Choose when $y$ is naturally a rate or
|
|
577
|
+
proportion (conversion rate, click-through rate).
|
|
578
|
+
- `"exponential"` is the exact chi-squared interval for an Exponential
|
|
579
|
+
mean (Gamma with shape $= 1$); requires $y \ge 0$. Choose when
|
|
580
|
+
responses are non-negative waiting times or lifetimes that follow an
|
|
581
|
+
exponential distribution.
|
|
582
|
+
- `"gamma"` is the exact chi-squared interval for a Gamma mean using
|
|
583
|
+
a method-of-moments shape estimate; requires $y \ge 0$. Choose for
|
|
584
|
+
non-negative right-skewed responses (insurance claims, incomes,
|
|
585
|
+
durations).
|
|
586
|
+
- `"log_normal"` is Cox's interval for the arithmetic mean of a
|
|
587
|
+
log-normal response; requires $y > 0$. Centered on the log-normal
|
|
588
|
+
MLE of the mean, not the sample mean. Choose when $\log y$ is
|
|
589
|
+
approximately normal (financial returns, biological measurements).
|
|
590
|
+
- `"log_normal_gci"` is the generalized confidence interval
|
|
591
|
+
(Krishnamoorthy & Mathew, 2003) for the arithmetic mean of a
|
|
592
|
+
log-normal response; requires $y > 0$. Like `"log_normal"` but built
|
|
593
|
+
via Monte Carlo from a generalized pivot, giving asymmetric bounds.
|
|
594
|
+
Choose when $n_{\text{eff}}$ is very small with large $\log y$ variance, where
|
|
595
|
+
Cox's symmetric Wald form begins to lose calibration.
|
|
596
|
+
Non-deterministic across calls.
|
|
597
|
+
- `"normal"` is a Wald-style interval $\bar{Y} \pm z \cdot \text{SE}$
|
|
598
|
+
($\bar{Y}$ the node weighted mean of $y$, $z$ a standard normal
|
|
599
|
+
quantile, $\text{SE}$ the standard error) with the Kish effective
|
|
600
|
+
sample size. Tight and cheap. Choose when the central limit theorem
|
|
601
|
+
applies comfortably ($n_{\text{eff}}$ well above 30, finite response
|
|
602
|
+
variance).
|
|
603
|
+
- `"poisson"` is the exact Garwood chi-squared interval for a Poisson
|
|
604
|
+
mean rate; requires $y \ge 0$. The conservative choice with
|
|
605
|
+
guaranteed coverage (Patil & Kulkarni, 2012). Choose for count
|
|
606
|
+
responses generated by an approximately Poisson process when
|
|
607
|
+
guaranteed coverage matters more than tightness.
|
|
608
|
+
- `"poisson_jeffreys"` is the equal-tailed Jeffreys interval for a
|
|
609
|
+
Poisson mean rate; requires $y \ge 0$. Shorter than `"poisson"` at
|
|
610
|
+
moderate rates (Patil & Kulkarni, 2012). Choose for count
|
|
611
|
+
responses at moderate rates when you do not require Garwood's
|
|
612
|
+
guaranteed coverage.
|
|
613
|
+
- `"student_t"` has the same form as `"normal"` but uses a Student-t
|
|
614
|
+
quantile with $n_{\text{eff}} - 1$ degrees of freedom. Wider than
|
|
615
|
+
`"normal"` for small effective sample sizes. Choose when
|
|
616
|
+
$n_{\text{eff}}$ is borderline and small-sample coverage matters.
|
|
617
|
+
|
|
618
|
+
### 6.10. `ci_method` (`ClassificationTree` only)
|
|
619
|
+
|
|
620
|
+
**Default**: `"jeffreys"`.
|
|
621
|
+
|
|
622
|
+
Method for the per-class confidence intervals on node class
|
|
623
|
+
proportions. In the descriptions below, $n$ denotes the node sample
|
|
624
|
+
size and $z$ a standard normal quantile.
|
|
625
|
+
|
|
626
|
+
- `"agresti_coull"` is the adjusted Wald interval: Wald applied
|
|
627
|
+
after adding $z^2/2$ pseudo-successes and $z^2/2$ pseudo-failures.
|
|
628
|
+
Slightly wider and more conservative than `"wilson"` at small
|
|
629
|
+
sample sizes; statistically equivalent to `"wilson"` and
|
|
630
|
+
`"jeffreys"` for $n > 40$ per Brown-Cai-DasGupta (2001). Choose
|
|
631
|
+
when matching an external reference that specifies Agresti-Coull.
|
|
632
|
+
- `"clopper_pearson"` is the exact Beta interval. Has the absolute
|
|
633
|
+
coverage *guarantee* ($\ge$ `ci_coverage` for every true proportion),
|
|
634
|
+
but is conservative: intervals are wider than they need to be on
|
|
635
|
+
average. Choose when guaranteed coverage matters more than tightness
|
|
636
|
+
(regulatory or safety contexts).
|
|
637
|
+
- `"jeffreys"` (default) is a Bayesian interval from the Beta
|
|
638
|
+
posterior with the Jeffreys non-informative prior $\mathrm{Beta}(0.5, 0.5)$.
|
|
639
|
+
Neither systematically conservative nor systematically aggressive on
|
|
640
|
+
average. Recommended for general use.
|
|
641
|
+
- `"mid_p_exact"` is the mid-p variant of Clopper-Pearson. Strictly
|
|
642
|
+
narrower than `"clopper_pearson"` while keeping an exact-tail
|
|
643
|
+
rationale, with average coverage close to nominal. Choose when
|
|
644
|
+
Clopper-Pearson's conservatism feels too wasteful but an exact-tail
|
|
645
|
+
method is still desired.
|
|
646
|
+
- `"wilson"` is the closed-form Wilson score interval, clipped to
|
|
647
|
+
$[0, 1]$. Cheapest to compute and accurate at moderate sample sizes;
|
|
648
|
+
coverage degrades near 0 and 1. Choose when you need vectorized
|
|
649
|
+
speed and class proportions are not extreme.
|
|
650
|
+
- `"wilson_cc"` is the Wilson score interval with Newcombe's
|
|
651
|
+
continuity correction. Slightly wider than `"wilson"`, restoring
|
|
652
|
+
lower-tail coverage at small sample sizes. Choose when the node
|
|
653
|
+
total weight $w_{\text{total}}$ is small and plain Wilson
|
|
654
|
+
under-covers.
|
|
655
|
+
|
|
656
|
+
### 6.11. `ci_method` (`RankingTree` only)
|
|
657
|
+
|
|
658
|
+
**Default**: `"bayesian_bootstrap"`.
|
|
659
|
+
|
|
660
|
+
Method for the per-item confidence intervals on each node's Plackett-Luce
|
|
661
|
+
expected-rank vector. Both supported methods refit the PL MLE on
|
|
662
|
+
resampled active rows and aggregate the resulting expected-rank vectors
|
|
663
|
+
marginally per item; scalar-mean CI methods (`"normal"`, `"student_t"`)
|
|
664
|
+
and the seven distribution-specific methods of `RegressionTree`
|
|
665
|
+
(`"beta"`, `"exponential"`, `"gamma"`, `"log_normal"`,
|
|
666
|
+
`"log_normal_gci"`, `"poisson"`, `"poisson_jeffreys"`) are rejected at
|
|
667
|
+
construction time because PL expected rank is a non-linear functional of
|
|
668
|
+
a joint MLE rather than a scalar sample mean.
|
|
669
|
+
|
|
670
|
+
- `"bayesian_bootstrap"` (default) draws Dirichlet weights for the
|
|
671
|
+
active rows and refits the PL MLE on each replicate. Nonparametric;
|
|
672
|
+
the safe choice.
|
|
673
|
+
- `"bca"` is the bias-corrected and accelerated bootstrap (Efron, 1987)
|
|
674
|
+
applied to row-resampled PL refits, with the acceleration term
|
|
675
|
+
computed from a leave-one-out jackknife of PL refits. Slower than
|
|
676
|
+
`"bayesian_bootstrap"`; non-deterministic across calls.
|
|
677
|
+
|
|
678
|
+
### 6.12. `ci_coverage`
|
|
679
|
+
|
|
680
|
+
**Default**: `0.95`.
|
|
681
|
+
|
|
682
|
+
Coverage level for node-prediction confidence intervals. Set to `None`
|
|
683
|
+
to skip CI computation entirely (the proper way to fully avoid the
|
|
684
|
+
per-node `ci_method` cost). Common alternatives: `0.90` (less
|
|
685
|
+
conservative), `0.99` (more conservative). For survival trees, also
|
|
686
|
+
controls the confidence band drawn behind each Kaplan-Meier curve in
|
|
687
|
+
the response plot; for ranking trees, it sets the per-item whisker
|
|
688
|
+
width in the expected-rank response plot.
|
|
689
|
+
|
|
690
|
+
### 6.13. `transmuter`
|
|
691
|
+
|
|
692
|
+
**Default**: `None`.
|
|
693
|
+
|
|
694
|
+
Optional callable that transforms node-level data before predictions
|
|
695
|
+
and confidence intervals are computed, with post-hoc split validation.
|
|
696
|
+
Signature: `(X, y, sample_weight) -> (y', sample_weight')`, or
|
|
697
|
+
`(X, y, sample_weight, side_data) -> (y', sample_weight')` when
|
|
698
|
+
`side_data` is passed to `fit`. After each candidate split, both
|
|
699
|
+
child subsets are independently transmuted and a significance test is
|
|
700
|
+
run on the transmuted data; if the p-value exceeds `alpha` the split
|
|
701
|
+
is rejected and the node becomes a leaf. Use cases: survival outcomes
|
|
702
|
+
(Kaplan-Meier-style transformation), rate normalization (impressions
|
|
703
|
+
to click-through rate), de-noising heavy-tailed responses.
|
|
704
|
+
|
|
705
|
+
### 6.14. `resamples`
|
|
706
|
+
|
|
707
|
+
**Default**: `None`.
|
|
708
|
+
|
|
709
|
+
Number of permutations $B$ for `test_type="monte_carlo"`. Required and
|
|
710
|
+
must be a positive integer when monte_carlo is selected; ignored
|
|
711
|
+
otherwise. Typical choices: `1000` for day-to-day production, `10000`
|
|
712
|
+
for paper-grade reproducible adjusted p-values.
|
|
713
|
+
|
|
714
|
+
### 6.15. `decorator`
|
|
715
|
+
|
|
716
|
+
**Default**: `None`.
|
|
717
|
+
|
|
718
|
+
Optional callable invoked once per node after the tree is built.
|
|
719
|
+
Signature: `(X_active, y_active, w_active, side_data_active) ->
|
|
720
|
+
decoration` where `decoration` is any object (or `None`). The returned
|
|
721
|
+
object is stored on the node as `node.decoration` and rendered by
|
|
722
|
+
`to_text` and `to_image`. Use cases: per-node metric (RMSE,
|
|
723
|
+
classification accuracy), business labels (segment names), diagnostic
|
|
724
|
+
statistics.
|
|
725
|
+
|
|
726
|
+
### 6.16. `random_state`
|
|
727
|
+
|
|
728
|
+
**Default**: `None`.
|
|
729
|
+
|
|
730
|
+
Seed for all stochastic operations in the estimator. Pass an integer
|
|
731
|
+
for reproducibility; `None` uses an unpredictable seed. Controls:
|
|
732
|
+
|
|
733
|
+
- min-P permutation resampling under `test_type="monte_carlo"`;
|
|
734
|
+
- the bootstrap-family CI methods of `RegressionTree`
|
|
735
|
+
(`bayesian_bootstrap`, `bca`, `log_normal_gci`);
|
|
736
|
+
- the bootstrap-family CI methods of `RankingTree`
|
|
737
|
+
(`bayesian_bootstrap`, `bca`) applied K times per node - once per
|
|
738
|
+
item;
|
|
739
|
+
- the jitter of `to_image(kind="response")` raincloud plots
|
|
740
|
+
(`RegressionTree` only; combined with the leaf index so each leaf
|
|
741
|
+
receives a distinct pattern).
|
|
742
|
+
|
|
743
|
+
## 7. Algorithm
|
|
744
|
+
|
|
745
|
+
The algorithm builds a decision tree using statistical hypothesis
|
|
746
|
+
testing for unbiased variable selection. Unlike CART, which selects
|
|
747
|
+
variables by maximizing an impurity criterion (and is therefore biased
|
|
748
|
+
toward variables with many possible splits), conditional inference trees
|
|
749
|
+
use permutation-based p-values to decouple variable selection from split
|
|
750
|
+
search.
|
|
751
|
+
|
|
752
|
+
The framework is generic: the only difference between the four task
|
|
753
|
+
families is the influence function $h$ applied to the response $Y_i$
|
|
754
|
+
of observation $i$. For classification with $J$ classes,
|
|
755
|
+
$h(Y_i) = e_J(Y_i)$ (one-hot encoding of the class label). For
|
|
756
|
+
regression, $h(Y_i) = Y_i$ (identity). For survival, $h(Y_i)$ is the
|
|
757
|
+
log-rank score (a scalar centred Savage score). For ranking, the
|
|
758
|
+
ranks-in-cell $Y_i$ is imputed at unranked items with the per-row
|
|
759
|
+
tail mean, log-transformed via $\log(1 + Y_i)$, column-centered, and
|
|
760
|
+
projected onto the top-$R$ right singular vectors of the resulting
|
|
761
|
+
matrix: $h(Y_i) = (\log(1 + Y_i) - \bar{m}) V$, where $\bar{m}$ is
|
|
762
|
+
the global column mean and $V$ is the loading matrix. The
|
|
763
|
+
log-transformation of power-law-distributed rank data before factor
|
|
764
|
+
analysis follows Leydesdorff (2006). All test statistics, p-value
|
|
765
|
+
computations, and splitting criteria use the same formulas.
|
|
766
|
+
|
|
767
|
+
### 7.1. Step 1: Variable selection and stopping
|
|
768
|
+
|
|
769
|
+
Given $n$ observations with response values $Y_i$, covariate values
|
|
770
|
+
$X_{ji}$ (the value of the $j$-th covariate $X_j$ for observation $i$),
|
|
771
|
+
and case weights $w_i$, define $g_j$ as the score function for
|
|
772
|
+
covariate $X_j$ (identity for numeric covariates, dummy encoding for
|
|
773
|
+
categorical ones). When
|
|
774
|
+
`correlation="rank"` (the default), continuous covariates and regression
|
|
775
|
+
responses are rank-transformed within each node before computing the
|
|
776
|
+
test statistics, yielding Spearman-like nonparametric tests that are
|
|
777
|
+
robust to outliers and non-normality. When `correlation="normal"`, raw
|
|
778
|
+
values are used (Pearson-like, as in the original paper). For each
|
|
779
|
+
covariate $X_j$, the algorithm computes the linear statistic
|
|
780
|
+
|
|
781
|
+
$$T_j = \text{vec}\!\left(\sum_{i=1}^{n} w_i \cdot g_j(X_{ji}) \cdot h(Y_i)^\top\right)$$
|
|
782
|
+
|
|
783
|
+
and derives its conditional expectation $\mu_j$ and covariance
|
|
784
|
+
$\Sigma_j$ under the null hypothesis of independence between $X_j$ and
|
|
785
|
+
the response $Y$. A test statistic (quadratic-form or maximum-type) is
|
|
786
|
+
computed and converted to a p-value $P_j$. A multiplicity adjustment is
|
|
787
|
+
applied across all $m$ covariates, and recursion stops when
|
|
788
|
+
$\min_j(\text{adjusted } P_j) > \alpha$. Otherwise the covariate with
|
|
789
|
+
the smallest adjusted p-value is selected.
|
|
790
|
+
|
|
791
|
+
The default adjustment is the Sidak correction
|
|
792
|
+
($\text{adjusted } P_j = 1 - (1 - P_j)^m$), which is powerful under the
|
|
793
|
+
mild assumption that the test statistics across covariates are
|
|
794
|
+
independent or positively dependent. A simpler closed-form alternative,
|
|
795
|
+
`test_type="bonferroni"`, uses
|
|
796
|
+
$\text{adjusted } P_j = \min(m P_j, 1)$; it is strictly more
|
|
797
|
+
conservative than Sidak. The third alternative,
|
|
798
|
+
`test_type="monte_carlo"`, uses the Westfall-Young (1993) min-P
|
|
799
|
+
resampling procedure. For each of $B$ permutations of the response, all
|
|
800
|
+
$m$ p-values are recomputed and the minimum recorded. The adjusted
|
|
801
|
+
p-value for covariate $j$ is the proportion of permutations where this
|
|
802
|
+
minimum did not exceed the observed $P_j$. This method is more powerful
|
|
803
|
+
than Sidak when covariates are correlated, at the cost of
|
|
804
|
+
$O(B \cdot m)$ additional statistic evaluations. Set `resamples` (e.g.,
|
|
805
|
+
1000 or 10000) and optionally `random_state` for reproducibility. All
|
|
806
|
+
three methods are available via the `test_type` parameter.
|
|
807
|
+
|
|
808
|
+
### 7.2. Step 2: Binary splitting
|
|
809
|
+
|
|
810
|
+
For the selected covariate, the algorithm searches for the binary
|
|
811
|
+
partition $A^*$ that maximizes the two-sample test statistic. Numeric
|
|
812
|
+
covariates are split at midpoints between consecutive unique values.
|
|
813
|
+
Categorical covariates with $K \le 10$ levels use exhaustive enumeration
|
|
814
|
+
of all $2^{K-1} - 1$ partitions; for $K > 10$, categories are ordered
|
|
815
|
+
by weighted mean of the first influence function column and only $K - 1$
|
|
816
|
+
contiguous splits are evaluated (provably optimal for regression,
|
|
817
|
+
heuristic for classification).
|
|
818
|
+
|
|
819
|
+
### 7.3. Step 3: Recursion and prediction
|
|
820
|
+
|
|
821
|
+
Case weights are updated to reflect node membership and steps 1-2 are
|
|
822
|
+
repeated recursively on each child node. Terminal nodes predict:
|
|
823
|
+
|
|
824
|
+
- **Regression**: the weighted mean of the response.
|
|
825
|
+
- **Classification**: the majority class, with class probabilities
|
|
826
|
+
given by the normalized weighted class counts.
|
|
827
|
+
|
|
828
|
+
## 8. Partykit compatibility
|
|
829
|
+
|
|
830
|
+
Sigma is a pure-Python reimplementation of R's `partykit::ctree` with
|
|
831
|
+
various improvements. Tree shape, split variables, split thresholds,
|
|
832
|
+
and per-leaf predictions are empirically verified to match
|
|
833
|
+
`partykit::ctree` on three reference datasets, one per task family:
|
|
834
|
+
|
|
835
|
+
- **Regression**: the `airquality` dataset (Ozone on Wind/Temp/Month/Day,
|
|
836
|
+
n=116 after dropping the rows with no Ozone observation). Crosscheck at
|
|
837
|
+
`tests/test_partykit_equivalence.py:26`.
|
|
838
|
+
- **Classification**: the `GlaucomaM` dataset from R's `TH.data` package
|
|
839
|
+
(Class on 62 morphology covariates, n=196). Crosscheck at
|
|
840
|
+
`tests/test_partykit_equivalence.py:75`.
|
|
841
|
+
- **Survival**: the `GBSG2` dataset from `lifelines`
|
|
842
|
+
(`Surv(time, cens) ~ horTh + age + menostat + tsize + tgrade + pnodes +
|
|
843
|
+
progrec + estrec`, n=686). Crosscheck at
|
|
844
|
+
`tests/test_tree_survival.py:661`.
|
|
845
|
+
|
|
846
|
+
Three deliberate deviations from partykit are worth knowing about:
|
|
847
|
+
|
|
848
|
+
1. **`test_type="sidak"` is the default**, matching partykit's effective
|
|
849
|
+
behavior. Partykit's `testtype="Bonferroni"` is a naming error on their
|
|
850
|
+
part: the adjustment it computes is mathematically the Sidak formula
|
|
851
|
+
$1 - (1 - P_j)^m$, not the textbook Bonferroni $\min(m P_j, 1)$.
|
|
852
|
+
Sigma exposes both options under their correct names; pass
|
|
853
|
+
`test_type="bonferroni"` for the textbook Bonferroni formula, or `test_type="sidak"`
|
|
854
|
+
(the default) to match partykit's "Bonferroni" output exactly.
|
|
855
|
+
2. **`correlation="rank"` is the default**, where partykit uses raw
|
|
856
|
+
values. Rank-transforming both response and continuous covariates
|
|
857
|
+
gives a Spearman-style test that is robust to outliers and skew, at
|
|
858
|
+
the cost of a small loss of power against linear alternatives. Pass
|
|
859
|
+
`correlation="normal"` to match partykit exactly.
|
|
860
|
+
3. **Leaves are reordered for display**: `leaves_` iterates in a
|
|
861
|
+
task-appropriate canonical order, and `to_text` / `to_image` swap
|
|
862
|
+
left and right children of each inner node to match. Sort keys are
|
|
863
|
+
descending majority class share (`ClassificationTree`), ascending
|
|
864
|
+
predicted response (`RegressionTree`), worst prognosis first
|
|
865
|
+
(`SurvivalTree`), and ascending lexicographic per-item PL expected-rank
|
|
866
|
+
vector (`RankingTree`). Partykit prints leaves in tree-traversal
|
|
867
|
+
order. The underlying tree is identical; only the iteration order
|
|
868
|
+
of `leaves_` and the visual left-vs-right placement of children in
|
|
869
|
+
exported renderings differ.
|
|
870
|
+
|
|
871
|
+
## 9. References
|
|
872
|
+
|
|
873
|
+
- Hothorn, T., & Zeileis, A. (2015). *partykit: A Modular Toolkit for
|
|
874
|
+
Recursive Partytioning in R.* *Journal of Machine Learning
|
|
875
|
+
Research*, 16, 3905-3909.
|
|
876
|
+
[jmlr.org/papers/v16/hothorn15a](https://jmlr.org/papers/v16/hothorn15a.html)
|
|
877
|
+
- Turner, H., van Etten, J., Firth, D., & Kosmidis, I. (2020).
|
|
878
|
+
*Modelling Rankings in R: The PlackettLuce Package.* *Computational
|
|
879
|
+
Statistics*, 35(3), 1027-1057.
|
|
880
|
+
[doi:10.1007/s00180-020-00959-3](https://doi.org/10.1007/s00180-020-00959-3)
|
|
881
|
+
- Patil, V. V., & Kulkarni, H. V. (2012). *Comparison of Confidence
|
|
882
|
+
Intervals for the Poisson Mean: Some New Aspects.* *REVSTAT -
|
|
883
|
+
Statistical Journal*, 10(2), 211-227.
|
|
884
|
+
[doi:10.57805/revstat.v10i2.117](https://doi.org/10.57805/revstat.v10i2.117)
|
|
885
|
+
- Hothorn, T., Hornik, K., & Zeileis, A. (2006). *Unbiased Recursive
|
|
886
|
+
Partitioning: A Conditional Inference Framework.* *Journal of
|
|
887
|
+
Computational and Graphical Statistics*, 15(3), 651-674.
|
|
888
|
+
[doi:10.1198/106186006X133933](https://doi.org/10.1198/106186006X133933)
|
|
889
|
+
- Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2006).
|
|
890
|
+
*A Lego System for Conditional Inference.* *The American
|
|
891
|
+
Statistician*, 60(3), 257-263.
|
|
892
|
+
[doi:10.1198/000313006X118430](https://doi.org/10.1198/000313006X118430)
|
|
893
|
+
- Leydesdorff, L. (2006). *Classification and Powerlaws: The
|
|
894
|
+
Logarithmic Transformation.* *Journal of the American Society for
|
|
895
|
+
Information Science and Technology*, 57(11), 1470-1486.
|
|
896
|
+
[doi:10.1002/asi.20467](https://doi.org/10.1002/asi.20467)
|
|
897
|
+
- Olsson, U. (2005). *Confidence Intervals for the Mean of a
|
|
898
|
+
Log-Normal Distribution.* *Journal of Statistics Education*, 13(1).
|
|
899
|
+
[doi:10.1080/10691898.2005.11910638](https://doi.org/10.1080/10691898.2005.11910638)
|
|
900
|
+
- Hunter, D. R. (2004). *MM Algorithms for Generalized Bradley-Terry
|
|
901
|
+
Models.* *Annals of Statistics*, 32(1), 384-406.
|
|
902
|
+
[doi:10.1214/aos/1079120141](https://doi.org/10.1214/aos/1079120141)
|
|
903
|
+
- Krishnamoorthy, K., & Mathew, T. (2003). *Inferences on the Means of
|
|
904
|
+
Lognormal Distributions Using Generalized p-Values and Generalized
|
|
905
|
+
Confidence Intervals.* *Journal of Statistical Planning and
|
|
906
|
+
Inference*, 115(1), 103-121.
|
|
907
|
+
[doi:10.1016/S0378-3758(02)00153-2](https://doi.org/10.1016/S0378-3758\(02\)00153-2)
|
|
908
|
+
- Hothorn, T., & Lausen, B. (2003). *On the Exact Distribution of
|
|
909
|
+
Maximally Selected Rank Statistics.* *Computational Statistics &
|
|
910
|
+
Data Analysis*, 43(2), 121-137.
|
|
911
|
+
[doi:10.1016/S0167-9473(02)00225-6](https://doi.org/10.1016/S0167-9473\(02\)00225-6)
|
|
912
|
+
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). *Interval
|
|
913
|
+
Estimation for a Binomial Proportion.* *Statistical Science*, 16(2),
|
|
914
|
+
101-133.
|
|
915
|
+
[doi:10.1214/ss/1009213286](https://doi.org/10.1214/ss/1009213286)
|
|
916
|
+
- Agresti, A., & Coull, B. A. (1998). *Approximate is Better than
|
|
917
|
+
"Exact" for Interval Estimation of Binomial Proportions.* *The
|
|
918
|
+
American Statistician*, 52(2), 119-126.
|
|
919
|
+
[doi:10.1080/00031305.1998.10480550](https://doi.org/10.1080/00031305.1998.10480550)
|
|
920
|
+
- Newcombe, R. G. (1998). *Two-Sided Confidence Intervals for the
|
|
921
|
+
Single Proportion: Comparison of Seven Methods.* *Statistics in
|
|
922
|
+
Medicine*, 17(8), 857-872.
|
|
923
|
+
[doi:10.1002/sim.777](https://doi.org/10.1002/\(SICI\)1097-0258\(19980430\)17:8%3C857::AID-SIM777%3E3.0.CO;2-E)
|
|
924
|
+
- Efron, B. (1987). *Better Bootstrap Confidence Intervals.* *Journal
|
|
925
|
+
of the American Statistical Association*, 82(397), 171-185.
|
|
926
|
+
[doi:10.1080/01621459.1987.10478410](https://doi.org/10.1080/01621459.1987.10478410)
|
|
927
|
+
- Rubin, D. B. (1981). *The Bayesian Bootstrap.* *Annals of
|
|
928
|
+
Statistics*, 9(1), 130-134.
|
|
929
|
+
[doi:10.1214/aos/1176345338](https://doi.org/10.1214/aos/1176345338)
|
|
930
|
+
- Efron, B. (1977). *The Efficiency of Cox's Likelihood Function for
|
|
931
|
+
Censored Data.* *Journal of the American Statistical Association*,
|
|
932
|
+
72(359), 557-565.
|
|
933
|
+
[doi:10.1080/01621459.1977.10480613](https://doi.org/10.1080/01621459.1977.10480613)
|
|
934
|
+
- Breslow, N. E. (1974). *Covariance Analysis of Censored Survival
|
|
935
|
+
Data.* *Biometrics*, 30(1), 89-99.
|
|
936
|
+
[doi:10.2307/2529620](https://doi.org/10.2307/2529620)
|
|
937
|
+
- Wilson, E. B. (1927). *Probable Inference, the Law of Succession,
|
|
938
|
+
and Statistical Inference.* *Journal of the American Statistical
|
|
939
|
+
Association*, 22(158), 209-212.
|
|
940
|
+
[doi:10.1080/01621459.1927.10502953](https://doi.org/10.1080/01621459.1927.10502953)
|