edef 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
edef-0.1.0/LICENSE ADDED
@@ -0,0 +1,11 @@
1
+ Copyright 2026 Ludger Hentschel
2
+
3
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
4
+
5
+ 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
6
+
7
+ 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
8
+
9
+ 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
10
+
11
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
edef-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,601 @@
1
+ Metadata-Version: 2.4
2
+ Name: edef
3
+ Version: 0.1.0
4
+ Summary: Euler Decomposition of Explained Fit: additive feature attribution for realized predictive performance
5
+ Author: Ludger Hentschel
6
+ License: BSD-3-Clause
7
+ Project-URL: Homepage, https://github.com/LudgerHentschel/edef
8
+ Project-URL: Repository, https://github.com/LudgerHentschel/edef
9
+ Project-URL: Issues, https://github.com/LudgerHentschel/edef/issues
10
+ Keywords: feature attribution,feature importance,model fit,machine learning,explainability,integrated gradients,SHAP,SAGE
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Intended Audience :: Financial and Insurance Industry
14
+ Classifier: License :: OSI Approved :: BSD License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Scientific/Engineering
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
23
+ Requires-Python: >=3.10
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Requires-Dist: numpy>=1.23
27
+ Provides-Extra: linear
28
+ Requires-Dist: scikit-learn; extra == "linear"
29
+ Provides-Extra: torch
30
+ Requires-Dist: torch; extra == "torch"
31
+ Provides-Extra: tree
32
+ Requires-Dist: treeig; extra == "tree"
33
+ Provides-Extra: numerical
34
+ Requires-Dist: scikit-learn; extra == "numerical"
35
+ Provides-Extra: plots
36
+ Requires-Dist: matplotlib; extra == "plots"
37
+ Requires-Dist: pandas; extra == "plots"
38
+ Requires-Dist: shap; extra == "plots"
39
+ Provides-Extra: dev
40
+ Requires-Dist: pytest; extra == "dev"
41
+ Requires-Dist: scikit-learn; extra == "dev"
42
+ Requires-Dist: matplotlib; extra == "dev"
43
+ Requires-Dist: pandas; extra == "dev"
44
+ Requires-Dist: shap; extra == "dev"
45
+ Provides-Extra: all
46
+ Requires-Dist: torch; extra == "all"
47
+ Requires-Dist: treeig; extra == "all"
48
+ Requires-Dist: scikit-learn; extra == "all"
49
+ Requires-Dist: matplotlib; extra == "all"
50
+ Requires-Dist: pandas; extra == "all"
51
+ Requires-Dist: shap; extra == "all"
52
+ Dynamic: license-file
53
+
54
+ # EDEF
55
+
56
+ EDEF (Euler Decomposition of Explained Fit) decomposes realized predictive
57
+ performance into additive feature contributions.
58
+
59
+ For each observation, EDEF returns feature attributions $\phi_j$ satisfying
60
+
61
+ $$\sum_j \phi_j = \mathcal{L}(y, f(x_0)) - \mathcal{L}(y, f(x)),$$
62
+
63
+ where $\mathcal{L}$ is the prediction loss, $x_0$ is a baseline input, $x$ is
64
+ the observation, and $f(x)$ is the prediction function evaluated at $x$.
65
+ A positive contribution means the feature improved realized predictive fit
66
+ relative to the baseline.
67
+
68
+ Standard attribution methods explain predictions. EDEF explains whether those
69
+ predictions were accurate.
70
+
71
+ EDEF applies the integrated-gradients framework of Sundararajan, Taly, and
72
+ Yan (2017) to the loss function rather than the prediction, and thereby
73
+ inherits the main axiomatic properties of IG — completeness, implementation invariance, and the dummy axiom — while attributing realized
74
+ predictive fit rather than predicted values.
75
+
76
+ ## Installation
77
+
78
+ ```bash
79
+ pip install edef
80
+ ```
81
+
82
+ Optional dependencies:
83
+
84
+ ```bash
85
+ pip install torch # for TorchExplainer
86
+ pip install treeig # for TreeExplainer
87
+ pip install shap # for SHAP plotting compatibility
88
+ ```
89
+
90
+ ## Using EDEF
91
+
92
+ Although EDEF attributes model fit instead of predictions, it follows a
93
+ familiar explainer pattern:
94
+
95
+ ```python
96
+ import numpy as np
97
+ from sklearn.linear_model import LinearRegression
98
+ import edef
99
+
100
+ rng = np.random.default_rng(123)
101
+ X = rng.normal(size=(200, 3))
102
+ y = X @ np.array([1.0, 0.5, 0.0]) + rng.normal(scale=0.5, size=200)
103
+
104
+ model = LinearRegression().fit(X, y)
105
+ result = edef.LinearExplainer(model, feature_names=["x1", "x2", "x3"])(X, y)
106
+ print(result)
107
+ ```
108
+
109
+ ```text
110
+ Feature contributions
111
+ ---------------------
112
+ x1 edef= 0.979 se= 0.134 t= 7.33 share= 0.741
113
+ x2 edef= 0.343 se= 0.064 t= 5.33 share= 0.260
114
+ x3 edef=-0.001 se= 0.002 t=-0.40 share=-0.000
115
+ ```
116
+
117
+ Unlike most attribution methods, EDEF reports standard errors and t-statistics
118
+ alongside attribution values. Features that move predictions without improving
119
+ accuracy show up near zero.
120
+
121
+ The decomposition is exact for any linear model: contributions sum to the
122
+ realized reduction in mean squared error relative to the sample mean.
123
+
124
+ ## Why EDEF?
125
+
126
+ Prediction-attribution methods answer "why did the model predict this value?"
127
+ EDEF answers "which features made the model accurate here?"
128
+
129
+ These questions have different answers. A feature can strongly influence a
130
+ prediction while contributing nothing to predictive accuracy — or can
131
+ actively hurt it. This happens when a feature moves predictions in the wrong
132
+ direction, when a feature is overfit, or when a feature captures real signal
133
+ on average but adds noise on a particular evaluation sample.
134
+
135
+ Consider a model trained to predict financial returns. Feature A captures a
136
+ persistent signal; feature B was correlated with returns in the training set
137
+ but is uncorrelated in the evaluation period. Both features generate large
138
+ prediction movements. SHAP or Integrated Gradients assigns large importances to
139
+ both. EDEF assigns large importance to A and near-zero importance to B —
140
+ because B's prediction movements do not improve realized fit in the evaluation sample.
141
+
142
+ The distinction matters most where prediction accuracy is the natural object of
143
+ interest: in model monitoring, out-of-sample validation, feature selection,
144
+ overfit detection, and scientific settings where fit to held-out outcomes
145
+ is the standard of evidence. The practical differences can be small when predictions are highly accurate but are often large when models are imperfect.
146
+
147
+ ## How EDEF works
148
+
149
+ EDEF applies the path-integral perspective of Integrated Gradients — but to
150
+ the loss function rather than the prediction.
151
+
152
+ Along the straight-line path
153
+
154
+ $$x(t) = x_0 + t \cdot (x - x_0), \qquad 0 \le t \le 1,$$
155
+
156
+ the loss reduction from baseline to observation is
157
+
158
+ $$\mathcal{L}(y, f(x_0)) - \mathcal{L}(y, f(x))
159
+ = -\int_0^1 \frac{d}{dt} \mathcal{L}(y, f(x(t))) \quad dt.$$
160
+
161
+ By the chain rule this integral decomposes additively across features:
162
+
163
+ $$\phi_j = (x_j - x_{0,j}) \int_0^1
164
+ \left[-\frac{\partial \mathcal{L}}{\partial f} \cdot
165
+ \frac{\partial f}{\partial x_j}\Bigg{|}_{x(t)}\right] dt.$$
166
+
167
+ The integrand is the prediction gradient $\partial f/\partial x_j$ multiplied
168
+ by the loss gradient $\partial \mathcal{L}/\partial f$. This chain-rule factor
169
+ is what distinguishes EDEF from Integrated Gradients, which integrates only the
170
+ prediction gradient. For squared-error loss,
171
+ $\partial \mathcal{L}/\partial f = -2(y - f(x(t)))$, so EDEF weights the
172
+ prediction gradient by how wrong the prediction is at each point along the
173
+ path. Features that move predictions toward the truth accumulate positive
174
+ contributions; features that move predictions away accumulate negative
175
+ contributions.
176
+
177
+ This integral is computed differently for each model class:
178
+
179
+ - **Linear models.** The integral has a closed form. No quadrature is needed.
180
+
181
+ - **Tree models.** Via TreeIG, the path trace gives the exact sequence of
182
+ split crossings and prediction jumps along $x(t)$. At each crossing, the
183
+ loss changes by a computable amount. EDEF assigns that loss change to the
184
+ crossing feature. The result is exact — no quadrature, no approximation.
185
+
186
+ - **PyTorch models.** Automatic differentiation computes
187
+ $\partial \mathcal{L}/\partial x_j$ at each interpolation point;
188
+ Gauss-Legendre quadrature integrates over $t$.
189
+
190
+ - **Black-box sklearn models.** Finite-difference approximations to the loss
191
+ gradient replace automatic differentiation; Gauss-Legendre quadrature
192
+ integrates over $t$.
193
+
194
+ ## Speed
195
+
196
+ The fundamental cost difference between EDEF and its competitors is that EDEF's cost is independent of the number of features $p$ for many cases because it exploits analytic or automatic gradients. Permutation and SAGE must answer "what if feature $j$ were absent?" for every feature separately, so their cost scales with the number of features $p$. EDEF follows a single path from $x_0$ to $x$ and integrates gradients along it in $S$ passes, with $S$ fixed regardless of how many features the model has. For linear models the integral is closed form; for tree models it is exact via TreeIG path traces; for PyTorch and black-box models, $S$ is typically 32–64.
197
+
198
+ The table illustrates computational costs using full-dataset forward passes as the main cost unit. $S$ is the number of quadrature steps; $R$ is the number of permutation repetitions per feature; $T$ is the number of SAGE coalition samples; $B$ is the background set size.
199
+
200
+ | Method | Forward passes | $p=100$, $n=10{,}000$ | $p=1{,}000$, $n=10{,}000$ |
201
+ |:---|:---|---:|---:|
202
+ | EDEF | $S$ (independent of $p$) | 50 | 50 |
203
+ | Permutation ($R=10$) | $R \cdot p$ | 1,000 | 10,000 |
204
+ | SAGE ($T=512$, $B=128$) | $T \cdot B \cdot p$ | 6,500,000 | 65,000,000 |
205
+
206
+ At $p=1{,}000$, EDEF requires 200 times fewer passes than permutation and over
207
+ a million times fewer than SAGE. The gap widens with $p$.
208
+
209
+ Forward-pass counts may understate the practical wall-clock advantage because
210
+ EDEF's passes can be fully vectorized over observations while permutation and
211
+ SAGE run sequentially across features and coalition samples. See the timing
212
+ notebook and the accompanying paper for wall-clock comparisons.
213
+
214
+ For linear models, EDEF is closed form and requires no model evaluations at
215
+ all. For tree models and PyTorch models, wall-clock times are similar: the
216
+ dominant cost for tree models is the TreeIG path traversal rather than model
217
+ forward passes, and both complete attribution for thousands of observations on
218
+ a typical model in well under a second.
219
+
220
+ ## Statistical inference
221
+
222
+ Because EDEF computes a contribution $\phi_j(x_i)$ for each feature and each
223
+ observation, feature importances are sample averages:
224
+
225
+ $$\bar\phi_j = \frac{1}{n} \sum_{i=1}^n \phi_j(x_i).$$
226
+
227
+ Sample averages have standard errors. EDEF reports them:
228
+
229
+ $$\widehat{\text{se}}(\bar\phi_j)
230
+ = \frac{1}{\sqrt{n}} \text{sd}\bigl(\phi_j(x_1), \ldots, \phi_j(x_n)\bigr).$$
231
+
232
+ Standard errors unlock inference that most attribution methods cannot provide:
233
+ t-statistics to test whether a feature's contribution is distinguishable from
234
+ zero, standard errors on grouped contributions, and uncertainty quantification
235
+ across resampled evaluation sets. In settings where prediction accuracy is
236
+ itself the quantity of scientific interest — rather than a prediction to be
237
+ explained — these inferential outputs are as important as the point estimates.
238
+
239
+ ## Relation to other methods
240
+
241
+ SHAP and Integrated Gradients explain predictions. EDEF and SAGE explain
242
+ realized model fit. These are fundamentally different attribution targets.
243
+
244
+ **SHAP and Integrated Gradients** ask:
245
+ > "How much does feature $j$ contribute to the prediction?"
246
+
247
+ **EDEF and SAGE** ask:
248
+ > "How much does feature $j$ contribute to realized predictive accuracy?"
249
+
250
+ The table below positions EDEF against the main alternatives along six
251
+ dimensions. The columns record what each method attributes; whether the method
252
+ holds the model fixed at realized inputs or evaluates it at counterfactual
253
+ ones; scope (local per-observation, aggregated-local, or global); whether
254
+ natural standard errors are available without resampling; and the process by
255
+ which importance is allocated.
256
+
257
+ | Method | Attributes | Model fixed | Realized inputs only | Scope | Nat. SEs | Process |
258
+ |:---|:---|:---:|:---:|:---|:---:|:---|
259
+ | Coefficients | Prediction / score | ✓ | ✓ | Global | — | Local sensitivity |
260
+ | Integrated Gradients | Prediction | ✓ | — | Local | — | Continuous path |
261
+ | SHAP | Prediction | — | — | Local, aggregated | — | Discrete averaging |
262
+ | Permutation / perturbation | Prediction/Accuracy | — | — | Global | — | Discrete removal |
263
+ | SAGE | Accuracy | — | — | Global | — | Discrete averaging |
264
+ | **EDEF** | **Accuracy** | **✓** | **✓** | **Global** | **✓** | **Continuous path** |
265
+
266
+ *Model fixed*: the method conditions on the deployed model at realized inputs,
267
+ without counterfactual feature removal, marginalization, or perturbation.
268
+ *Realized inputs only*: the model is evaluated only at inputs derivable as
269
+ convex combinations of actual observations, requiring no synthetic inputs.
270
+ *Nat. SEs*: standard errors reflecting sampling variation, not Monte Carlo
271
+ approximation error.
272
+
273
+ ### Integrated Gradients
274
+
275
+ Integrated Gradients computes $\phi_j = (x_j - x_{0,j}) \int_0^1 \partial f(x(t))/\partial x_j dt$ — the integral of the prediction gradient along the path from $x_0$ to $x$. EDEF computes the integral of the loss gradient along the same path. They share a path, a baseline, and an integration method. They differ in exactly one thing: what is integrated.
276
+
277
+ That difference in the integrand is the full story. IG measures how much each
278
+ feature moved the prediction as we interpolate from baseline to observation.
279
+ EDEF measures how much each feature improved or worsened predictive accuracy
280
+ as we make that same interpolation. For a perfect prediction, IG and EDEF
281
+ give the same sign for every feature. For a poor prediction, features that
282
+ moved the prediction in the wrong direction get negative EDEF attribution
283
+ even if they get large positive IG attribution.
284
+
285
+ For a linear model with zero intercept and zero baseline, EDEF and IG agree in
286
+ sign but differ in magnitude, with EDEF attributions scaled by the accuracy of
287
+ the prediction. As predictions become less accurate, the two methods diverge.
288
+
289
+ ### SHAP
290
+
291
+ SHAP builds attributions from discrete feature inclusion effects averaged over
292
+ coalitions of other features. It does not follow a path and does not observe
293
+ realized outcomes. A feature can receive large SHAP importance purely because
294
+ it moves predictions strongly, regardless of whether those prediction movements
295
+ correspond to actual patterns in the outcome variable.
296
+
297
+ SHAP's coalition construction is deliberately indifferent to whether predictions
298
+ are accurate. The same coalition structure, the same expected-prediction
299
+ baseline, and the same discrete inclusion/exclusion logic apply whether the
300
+ model generalizes well or poorly. This makes SHAP a precise tool for
301
+ explaining the model's behavior in input space, and an imprecise tool for
302
+ evaluating that behavior against realized outcomes.
303
+
304
+ ### Permutation and perturbation methods
305
+
306
+ Permutation importance — available in scikit-learn as `permutation_importance`, and implemented in various forms across the model-evaluation literature — measures how much model accuracy declines or predictions change when a feature is shuffled or removed. The question asked is: "how much does the model rely on feature $j$?"
307
+
308
+ Permutation methods can target either predictive accuracy or predictions, depending on the scoring function used. The most common application — measuring the drop in model performance when a feature's values are shuffled — is a fit-based method and sits in the same camp as SAGE and EDEF. Permutation-based SHAP, by contrast, uses permutation sampling to estimate prediction attributions. The discussion below concerns the fit-based variant. It differs from EDEF in three ways.
309
+
310
+ First, permutation methods evaluate the model at counterfactual inputs.
311
+ Shuffling feature $j$ creates (feature, outcome) pairs that never occurred
312
+ in the data. Whether those pairs are meaningful depends on the joint
313
+ distribution of features; correlation with other features means the shuffled
314
+ inputs may fall far outside the model's training support. EDEF evaluates the
315
+ model only at convex combinations of two real inputs — no synthetic feature
316
+ combinations are introduced.
317
+
318
+ Second, permutation methods produce a single importance score per feature
319
+ with no natural observation-level decomposition. Standard errors, when
320
+ reported, reflect Monte Carlo variance across permutation draws — not the
321
+ sampling variation across observations that EDEF's standard errors capture.
322
+
323
+ Third, permutation importance answers a counterfactual question about feature
324
+ removal: "how much worse would accuracy be if the model could not see feature
325
+ $j$?" EDEF answers a realized question about feature contribution: "how much
326
+ did feature $j$ contribute to accuracy on these observations, given the
327
+ predictions the model actually made?" The two questions have the same answer
328
+ for independent features in large samples and can diverge substantially when
329
+ features are correlated or when model accuracy varies across the sample.
330
+
331
+ ### SAGE
332
+
333
+ SAGE is the closest existing method to EDEF in motivation. Both measure feature
334
+ contributions to realized predictive performance rather than to predictions.
335
+ They differ substantially in construction.
336
+
337
+ SAGE applies Shapley-style coalition averaging to predictive performance: it
338
+ measures how much each feature changes expected loss as it enters or leaves a
339
+ coalition, where absent features are marginalized over a background
340
+ distribution. The SAGE attribution for feature $j$ asks "how much worse would
341
+ the model perform if it could not use feature $j$?" — a global counterfactual
342
+ question about feature removal.
343
+
344
+ EDEF asks "how much did feature $j$ contribute to the loss reduction for this
345
+ observation, along the specific path from baseline to observation?" — a local
346
+ path-integral question about feature movement. The difference is analogous to
347
+ the difference between SHAP and IG: Shapley-style marginalizing out features
348
+ versus path-integral accumulation of gradient contributions.
349
+
350
+ Three practical consequences follow. First, EDEF requires only a baseline
351
+ vector; SAGE requires a background distribution from which to marginalize out
352
+ features. Second, EDEF computes observation-level contributions that aggregate
353
+ naturally to sample-average importances with standard errors; SAGE produces
354
+ global importance estimates without natural observation-level decompositions.
355
+ Third, EDEF exploits closed-form path integrals and exact tree path traces for
356
+ efficient computation; SAGE currently lacks analogous backend optimizations
357
+ and can be expensive for large models.
358
+
359
+ ## Available explainers
360
+
361
+ | Explainer | Intended models | Method |
362
+ |:---|:---|:---|
363
+ | `LinearExplainer` | Linear and generalized linear models | Closed-form exact decomposition |
364
+ | `TorchExplainer` | PyTorch neural networks | Autograd + Gauss-Legendre quadrature |
365
+ | `TreeExplainer` | Tree ensembles | Exact TreeIG path traces |
366
+ | `NumericalExplainer` | Any sklearn-style model | Finite-difference + Gauss-Legendre quadrature |
367
+
368
+ ## Supported models
369
+
370
+ ### Linear models
371
+
372
+ - Linear regression
373
+ - Binary logistic regression
374
+ - Multiclass logistic regression
375
+
376
+ ### PyTorch models
377
+
378
+ - Regression (squared-error loss)
379
+ - Binary classification (log loss)
380
+ - Multiclass classification (softmax log loss)
381
+
382
+ ### Tree models (via TreeIG)
383
+
384
+ - `sklearn.tree.DecisionTreeRegressor`
385
+ - `sklearn.ensemble.RandomForestRegressor`
386
+ - `sklearn.ensemble.ExtraTreesRegressor`
387
+ - `sklearn.ensemble.GradientBoostingRegressor`
388
+ - `sklearn.ensemble.GradientBoostingClassifier`
389
+ - `xgboost.XGBRegressor`, `xgboost.XGBClassifier`, `xgboost.Booster`
390
+ - `lightgbm.LGBMRegressor`, `lightgbm.LGBMClassifier`, `lightgbm.Booster`
391
+
392
+ Tree classification uses raw margins/logits rather than predicted
393
+ probabilities. Probabilities are not additive across trees.
394
+
395
+ ### Numerical black-box models
396
+
397
+ Any model with `predict(X)` (regression) or `predict_proba(X)`
398
+ (classification), including sklearn pipelines and `MLPRegressor`/
399
+ `MLPClassifier`.
400
+
401
+ ## Not currently supported
402
+
403
+ - probability-output tree attribution;
404
+ - missing-value tree routing;
405
+ - CatBoost.
406
+
407
+ ## Examples for Different Models
408
+
409
+ ### Linear regression
410
+
411
+ ```python
412
+ import numpy as np
413
+ from sklearn.linear_model import LinearRegression
414
+ import edef
415
+
416
+ model = LinearRegression().fit(X, y)
417
+ result = edef.LinearExplainer(model, feature_names=["x1", "x2", "x3"])(X, y)
418
+ ```
419
+
420
+ ### Binary classification
421
+
422
+ ```python
423
+ from sklearn.linear_model import LogisticRegression
424
+ import edef
425
+
426
+ model = LogisticRegression().fit(X, y)
427
+ result = edef.LinearExplainer(model, loss="log_loss", feature_names=[...])(X, y)
428
+ ```
429
+
430
+ ### Tree regression
431
+
432
+ ```python
433
+ from sklearn.ensemble import GradientBoostingRegressor
434
+ import edef
435
+
436
+ model = GradientBoostingRegressor(n_estimators=100, max_depth=3).fit(X, y)
437
+
438
+ explainer = edef.TreeExplainer(model, baseline=X.mean(axis=0), loss="squared_error")
439
+ result = explainer(X_eval, y_eval)
440
+ ```
441
+
442
+ EDEF uses TreeIG to find the exact sequence of split-crossing events along
443
+ the interpolation path for each observation. Each crossing changes the
444
+ prediction, which changes the loss. EDEF assigns the loss change at each
445
+ crossing to the crossing feature. The result is exact — no quadrature,
446
+ no approximation parameters.
447
+
448
+ ### Tree classification
449
+
450
+ ```python
451
+ from sklearn.ensemble import GradientBoostingClassifier
452
+ import edef
453
+
454
+ model = GradientBoostingClassifier(...).fit(X, y)
455
+ explainer = edef.TreeExplainer(model, baseline=X.mean(axis=0), loss="log_loss")
456
+ result = explainer(X_eval, y_eval)
457
+ ```
458
+
459
+ For multiclass models, use `loss="multiclass_log_loss"`. EDEF merges the
460
+ split-crossing sequences across all class-margin trees, applying exact softmax
461
+ log-loss changes at each event.
462
+
463
+ ### PyTorch models
464
+
465
+ ```python
466
+ import edef
467
+
468
+ explainer = edef.TorchExplainer(
469
+ model,
470
+ baseline=X_train.mean(axis=0),
471
+ loss="squared_error", # or "log_loss", "multiclass_log_loss"
472
+ n_steps=50,
473
+ feature_names=[...],
474
+ )
475
+ result = explainer(X_eval, y_eval)
476
+ ```
477
+
478
+ ### Black-box sklearn models
479
+
480
+ ```python
481
+ from sklearn.neural_network import MLPRegressor
482
+ from sklearn.pipeline import make_pipeline
483
+ from sklearn.preprocessing import StandardScaler
484
+ import edef
485
+
486
+ model = make_pipeline(StandardScaler(), MLPRegressor(...)).fit(X, y)
487
+
488
+ explainer = edef.NumericalExplainer(
489
+ model,
490
+ baseline=X.mean(axis=0),
491
+ loss="squared_error",
492
+ n_steps=32,
493
+ feature_names=[...],
494
+ )
495
+ result = explainer(X_eval, y_eval)
496
+ ```
497
+
498
+ ## Grouped contributions
499
+
500
+ ```python
501
+ grouped = result.group(["signal", "signal", "noise"])
502
+ ```
503
+
504
+ Grouped contributions preserve exact additivity. Group labels map input
505
+ features to named groups; features sharing a label are summed. This is
506
+ useful for one-hot encoded variables, embedding blocks, factor groups, and
507
+ hierarchical feature structures.
508
+
509
+
510
+ ## Accessing results
511
+
512
+ ```python
513
+ result.values # feature contributions (point estimates)
514
+ result.standard_errors # standard errors
515
+ result.t_values # t-statistics: values / standard_errors
516
+ result.proportions # share of total explained fit
517
+ result.to_frame() # pandas DataFrame, sorted by contribution
518
+ result.plot() # horizontal bar chart with confidence intervals
519
+ ```
520
+
521
+ Standard errors are computed from the observation-level contributions
522
+ and scale correctly under grouping.
523
+
524
+ ## SHAP plotting
525
+
526
+ ```python
527
+ shap_exp = result.to_shap_explanation(data=X)
528
+
529
+ import shap
530
+ shap.plots.beeswarm(shap_exp)
531
+ ```
532
+
533
+ The underlying values are EDEF realized-fit contributions. The SHAP plotting
534
+ interface is used for visualization only.
535
+
536
+ ## Project status
537
+
538
+ EDEF covers the dominant regression and classification models in the Python
539
+ ecosystem with exact or high-accuracy decompositions:
540
+
541
+ - closed-form exact attribution for linear models;
542
+ - autograd path integration for PyTorch models;
543
+ - exact attribution for tree ensembles via TreeIG;
544
+ - numerical attribution for any sklearn-interface model;
545
+ - multiclass log-loss decomposition throughout;
546
+ - observation-level contributions with standard errors and t-statistics;
547
+ - grouping, SHAP-compatible plotting, and pandas output.
548
+
549
+ ## References
550
+
551
+ EDEF:
552
+
553
+ - Hentschel, Ludger. 2026.
554
+ "Feature importance for model fit: Nonlinear regression and
555
+ classification in machine learning models."
556
+
557
+ - Hentschel, Ludger. 2026.
558
+ "Feature importance for predictive accuracy: An Euler decomposition."
559
+
560
+ TreeIG:
561
+
562
+ - Hentschel, Ludger. 2026.
563
+ "TreeIG: Exact Integrated Gradients for Tree-Based Models."
564
+
565
+ Integrated Gradients:
566
+
567
+ - Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017.
568
+ "Axiomatic Attribution for Deep Networks."
569
+ *International Conference on Machine Learning (ICML).*
570
+
571
+ SHAP and TreeSHAP:
572
+
573
+ - Lundberg, Scott M., and Su-In Lee. 2017.
574
+ "A Unified Approach to Interpreting Model Predictions."
575
+ *Advances in Neural Information Processing Systems (NeurIPS).*
576
+
577
+ - Lundberg, Scott M., Gabriel Erion, and Su-In Lee. 2020.
578
+ "From Local Explanations to Global Understanding with Explainable AI
579
+ for Trees."
580
+ *Nature Machine Intelligence.*
581
+
582
+ SAGE:
583
+
584
+ - Covert, Ian, Scott Lundberg, and Su-In Lee. 2020.
585
+ "Understanding Global Feature Contributions With Additive Importance
586
+ Measures."
587
+ *NeurIPS.*
588
+
589
+ Permutation importance:
590
+
591
+ - Breiman, Leo. 2001.
592
+ "Random Forests."
593
+ *Machine Learning* 45(1): 5–32.
594
+
595
+ - Louppe, Gilles, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. 2013.
596
+ "Understanding Variable Importances in Forests of Randomized Trees."
597
+ *Advances in Neural Information Processing Systems (NeurIPS).*
598
+
599
+ ## License
600
+
601
+ BSD 3-Clause License.