score-analysis 0.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,421 @@
1
+ Metadata-Version: 2.1
2
+ Name: score-analysis
3
+ Version: 0.2.3
4
+ Summary: Library to evaluate models
5
+ Author: Martins Bruveris
6
+ Author-email: martins.bruveris@gmx.com
7
+ Requires-Python: >=3.9,<3.13
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.9
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Provides-Extra: docs
14
+ Provides-Extra: jupyter
15
+ Requires-Dist: enum-tools ; extra == "docs"
16
+ Requires-Dist: jupyter ; extra == "jupyter"
17
+ Requires-Dist: matplotlib ; extra == "jupyter"
18
+ Requires-Dist: numpy
19
+ Requires-Dist: pandas
20
+ Requires-Dist: scipy
21
+ Requires-Dist: seaborn ; extra == "jupyter"
22
+ Requires-Dist: sphinx (>=4.4.0,<5.0.0) ; extra == "docs"
23
+ Requires-Dist: sphinx-rtd-theme ; extra == "docs"
24
+ Requires-Dist: sphinx-toolbox (>=2.0.0,<3.0.0) ; extra == "docs"
25
+ Requires-Dist: tabulate
26
+ Requires-Dist: tqdm ; extra == "jupyter"
27
+ Description-Content-Type: text/markdown
28
+
29
+ # score-analysis
30
+
31
+ Package to analyse ML model results. Contains an efficient implementation of common
32
+ metrics computations such as TPR, FPR, EER and methods for threshold setting.
33
+
34
+ Check out the online
35
+ [documentation]().
36
+
37
+ ## Usage
38
+
39
+ ### Terminology
40
+
41
+ Sometimes, we like to work with metrics based on acceptance and rejection, such as
42
+ FAR (false acceptance rate) and FRR (false rejection rate), while the standard ML
43
+ terminology talks about positive and negative classes and FPR (false positive rate) and
44
+ FNR (false negative rate).
45
+
46
+ This library adopts the standard ML terminology. The translation is simple: just replace
47
+ "accept" with "positive" and replace "reject" with "negative" and you have a dictionary
48
+ between the two worlds.
49
+
50
+ The library is also agnostic to which direction scores are pointing. It works with
51
+ scores that indicate membership of the positive (accept) class as well as with scores
52
+ that indicate membership of the negative (reject) class. The score interpretation is
53
+ set using the `score_class` parameter when constructing a `Scores` object.
54
+
55
+ The key is to decouple the process of computing scores from the process of interpreting
56
+ them. When we compute scores, e.g., using an ML model, some will point towards genuines,
57
+ some towards spoofs/fraud. Sometimes we use score-mappers that reverse the score
58
+ orientation. We cannot change scores. But, when we move towards interpreting them, we
59
+ should always use a fixed terminology: positive class means accept/genuine; negative
60
+ class means reject/spoof/fraud. And at the point, when we go from generating scores to
61
+ interpreting them, we set, via the `score_class` parameter, how scores are to be
62
+ interpreted.
63
+
64
+ ### Scores
65
+
66
+ We assume that we work with a binary classification problem. First, we create a `Scores`
67
+ object with the experiment results. We can do this in two ways.
68
+
69
+ ```python
70
+ from score_analysis import Scores
71
+
72
+ # If we have the scores for the positive and negative classes separately
73
+ scores = Scores(pos=[1, 2, 3], neg=[0.5, 1.5])
74
+
75
+ # If we have an intermingled set of scores and labels
76
+ scores = Scores.from_labels(
77
+ labels=[1, 1, 1, 0, 0],
78
+ scores=[1, 2, 3, 0.5, 1.5],
79
+ # We specify the label of the positive class. All other labels are assigned to
80
+ # the negative class.
81
+ pos_label=1,
82
+ )
83
+ ```
84
+
85
+ There are two parameters, that will determine calculations of metrics:
86
+
87
+ - Do scores indicate membership of the positive or the negative class (`score_class`)
88
+ - If a score is exactly equal to the threshold, will it be assigned to the positive
89
+ or negative class (`equal_class`)
90
+
91
+ The meaning of the parameters is summarized in the table
92
+
93
+ | score_class | equal_class | Decision logic for positive class |
94
+ |:-----------:|:-----------:|:---------------------------------:|
95
+ | pos | pos | score >= threshold |
96
+ | pos | neg | score > threshold |
97
+ | neg | pos | score <= threshold |
98
+ | neg | neg | score < threshold |
99
+
100
+ We can apply a threshold to a `Scores` object to obtain a confusion matrix and then
101
+ compute metrics associated to the confusion matrix.
102
+
103
+ ```python
104
+ cm = scores.cm(threshold=2.5)
105
+ print(cm.fpr()) # Print False Positive Rate
106
+ ```
107
+
108
+ We can work with multiple thresholds at once, which leads to vectorized confusion
109
+ matrices. In fact, the `threshold` parameter accepts arbitrary-shaped arrays and all
110
+ confusion matrix operations preserve the shapes.
111
+
112
+ ```python
113
+ threshold = np.linspace(0, 3, num=50)
114
+ cm = scores.cm(threshold=threshold)
115
+ fpr = cm.fpr() # Contains FPRs at all defined thresholds
116
+ assert fpr.shape == threshold.shape
117
+ ```
118
+
119
+ We can also determine thresholds at specific operating points. These operations are also
120
+ fully vectorized.
121
+
122
+ ```python
123
+ # Calculate threshold at 30% False Positive Rate
124
+ threshold = scores.threshold_at_fpr(fpr=0.3)
125
+
126
+ # Calculate thresholds at logarithmically spaced FPR intervals from 0.1% to 100%
127
+ fpr = np.logspace(-3, 0, num=50)
128
+ threshold = scores.threshold_at_fpr(fpr)
129
+ ```
130
+
131
+ Note that determining thresholds a fixed operating points requires interpolation, since
132
+ with a finite dataset we can measure only finitely many values for FPR, etc. If we want
133
+ to determine a threshold at any other value for the target metric, we use linear
134
+ interpolation.
135
+
136
+ ### Confusion matrices
137
+
138
+ Most metrics that we use are defined via confusion matrices. We can create a confusion
139
+ matrix either from vectors with labels and predictions or directly from a matrix.
140
+
141
+ ```python
142
+ >>> labels = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
143
+ >>> predictions = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
144
+ >>> cm = ConfusionMatrix(labels=labels, predictions=predictions)
145
+ >>> cm.classes
146
+ [0, 1, 2]
147
+ >>> cm.matrix
148
+ array([[3, 0, 0],
149
+ [0, 1, 2],
150
+ [2, 1, 3]])
151
+ ```
152
+
153
+ A binary confusion matrix is a special case of a `ConfusionMatrix`, with specially
154
+ designated positive and negative classes. The convention is that the classes are
155
+ ordered `classes = [pos, neg]`. It can be created with the parameter `binary=True`.
156
+
157
+ For binary confusion matrices all metrics such as TPR are scalar. Since we have defined
158
+ which is the positive class, there is no need to use the one-vs-all strategy.
159
+
160
+ A binary confusion matrix is different from a regular confusion matrix with two classes,
161
+ since the latter does not have designated positive and negative classes.
162
+
163
+ ```python
164
+ >>> cm = ConfusionMatrix(matrix=[[1, 4], [2, 3]], binary=True)
165
+ >>> cm.tpr()
166
+ 0.2
167
+ >>> cm = ConfusionMatrix(matrix=[[1, 4], [2, 3]])
168
+ >>> cm.tpr() # True positive rate for each class
169
+ array([0.2, 0.6])
170
+ ```
171
+
172
+ ### Available metrics
173
+
174
+ Basic parameters
175
+
176
+ 1. TP (true positive)
177
+ 2. TN (true negative)
178
+ 3. FP (false positive)
179
+ 4. FN (false negative)
180
+ 5. P (condition positive)
181
+ 6. N (condition negative)
182
+ 7. TOP (test outcome positive)
183
+ 8. TON (test outcome negative)
184
+ 9. POP (population)
185
+
186
+ Class metrics
187
+
188
+ 1. TPR (true positive rate) + confidence interval
189
+ 2. TNR (true negative rate) + confidence interval
190
+ 3. FPR (false positive rate) + confidence interval
191
+ 4. FNR (false negative rate) + confidence interval
192
+ 5. TOPR (test outcome positive rate)
193
+ 6. TONR (test outcome negative rate)
194
+ 7. PPV (positive predictive value)
195
+ 8. NPV (negative predictive value)
196
+ 9. FDR (false discovery rate)
197
+ 10. FOR (false omission rate)
198
+ 11. Class accuracy
199
+ 12. Class error rate
200
+
201
+ Overall metrics
202
+
203
+ 1. Accuracy
204
+ 2. Error rate
205
+
206
+ ### Confidence intervals
207
+
208
+ The library implements bootstrapping to compute confidence intervals for arbirtrary
209
+ (vectorized) measurements. It allows us to compute confidence intervals for arbitrary
210
+ functions
211
+
212
+ ```python
213
+ def metric(scores: Scores) -> np.ndarray:
214
+ # Simple metric calculating the mean of positive scores
215
+ return np.mean(scores.pos)
216
+
217
+ scores = Scores(pos=[1, 2, 3], neg=[0, 2]) # Sample scores
218
+ ci = scores.bootstrap_ci(metric=metric, alpha=0.05)
219
+
220
+ # For metrics that are part of the Scores class we can pass their names
221
+ ci = scores.bootstrap_ci(metric="eer")
222
+ # Scores.eer() returns both threshold and EER value
223
+ print(f"Threshold 95%-CI: ({ci[0, 0]:.4f}, {ci[0, 1]:.4f})")
224
+ print(f"EER 95%-CI: ({ci[1, 0]:.3%}, {ci[1, 1]:.3%})")
225
+ ```
226
+
227
+ ### Vectorized operations
228
+
229
+ All operations are, as far as feasible, vectorized and care has been taken to ensure
230
+ consistent handling of matrix shapes.
231
+
232
+ - A (vectorized) confusion matrix has shape (X, N, N), where X can be an arbitrary
233
+ shape, including the empty shape, and N is the number of classes.
234
+ - Calculating a metric results in an array of shape (X, Y), where Y is the shape
235
+ defined by the metric. Most metrics are scalar, Y=(), while confidence intervals
236
+ have shape (2,).
237
+ - A confusion matrix can be converted to a vector of binary confusion matrices using the
238
+ one-vs-all strategy. This results in a binary confusion matrix of shape (X, N, 2, 2).
239
+ - Calculating per-class metrics implicitely uses the one-vs-all strategy, so the result
240
+ has shape (X, N, Y).
241
+ - Whenever a result is a scalar, we return it as such. This is, e.g., the case when
242
+ computing scalar metrics of single confusion matrices, i.e., X=Y=().
243
+
244
+ ### Showbias
245
+ The `showbias` function can be used to measure how a user-specified metric differs
246
+ across groups of data. Typically, we would be interested in knowing how, for example,
247
+ FRR differs across different ethnicities, which would help us to understand if our
248
+ product is biased and performs better for some ethnicities than for others. However, the
249
+ function should be general enough to allow you to measure any variations in a metric
250
+ across different groups: You could, for example, use it to measure accuracy across
251
+ different document types or flagging rates across different SDK platforms. You could
252
+ even measure how Dogfido's FRR differs across different dog breeds:
253
+
254
+ ![image info](images/showbias.png)
255
+
256
+ In its simplest case, the `showbias` function assumes that you have a pandas dataframe
257
+ with three columns:
258
+ - A `group` column that indicates group membership for every row, e.g.
259
+ `female` and `male` values in a column called `gender`
260
+ - A `scores` column that contains the predicted scores (e.g. by a model)
261
+ - A `labels` column that contains the ground truth using integers
262
+
263
+ Imagine that you have a dataframe `df` that contains model predictions and
264
+ ground truth labels along with gender data.
265
+ ```python
266
+ import pandas as pd
267
+ import numpy as np
268
+
269
+ df = pd.DataFrame({
270
+ 'gender': np.random.choice(['female', 'male'], size=1000),
271
+ 'labels': np.random.choice([0, 1], size=1000),
272
+ 'scores': np.random.uniform(0.0, 1.0, 1000)
273
+ })
274
+ ```
275
+
276
+ You can then just run the following to measure FRR per gender:
277
+ ```python
278
+ from score_analysis import showbias
279
+
280
+ bias_frame = showbias(
281
+ data=df,
282
+ group_columns="gender",
283
+ label_column="labels",
284
+ score_column="scores",
285
+ metric="fnr",
286
+ threshold=[0.5]
287
+ )
288
+ print(bias_frame.to_markdown())
289
+ ```
290
+ which should result in a table like this:
291
+ | gender | 0.5 |
292
+ |:---------|------:|
293
+ | female | 0.508 |
294
+ | male | 0.474 |
295
+
296
+ Above, we have been passing a threshold of `0.5` (as also indicated by the column name).
297
+ You can pass several thresholds all at once, like so:
298
+ ```python
299
+ bias_frame = showbias(
300
+ data=df,
301
+ group_columns="gender",
302
+ label_column="labels",
303
+ score_column="scores",
304
+ metric="fnr",
305
+ threshold=[0.3, 0.5, 0.7]
306
+ )
307
+ print(bias_frame.to_markdown())
308
+ ```
309
+ which will result in several columns, one for every threshold:
310
+ | gender | 0.3 | 0.5 | 0.7 |
311
+ |:---------|------:|------:|------:|
312
+ | female | 0.311 | 0.508 | 0.705 |
313
+ | male | 0.252 | 0.474 | 0.697 |
314
+
315
+ You can obtain metrics that are normalised. For example, you can normalise to the metric
316
+ measured across the entire dataset by passing `normalise="by_overall"` argument, like so:
317
+ ```python
318
+ bias_frame = showbias(
319
+ data=df,
320
+ group_columns="gender",
321
+ label_column="labels",
322
+ score_column="scores",
323
+ metric="fnr",
324
+ threshold=[0.5],
325
+ normalise="by_overall"
326
+ )
327
+ ```
328
+
329
+ You can obtain confidence intervals by setting the `nb_samples` in the `BootstrapConfig`
330
+ to a value greater than `0`:
331
+ ```python
332
+ from score_analysis import BootstrapConfig
333
+
334
+ bias_frame = showbias(
335
+ data=df,
336
+ group_columns="gender",
337
+ label_column="labels",
338
+ score_column="scores",
339
+ metric="fnr",
340
+ threshold=[0.5],
341
+ bootstrap_config=BootstrapConfig(
342
+ nb_samples=500,
343
+ stratified_sampling="by_group"
344
+ ),
345
+ alpha_level=0.05
346
+ )
347
+ print(bias_frame.to_markdown())
348
+ ```
349
+ In this case, `bias_frame` will have 4 properties:
350
+ - `bias_frame.values` contains the observed values
351
+ - `bias_frame.alpha` contains the alpha level
352
+ - `bias_frame.lower` contains lower bound of the CI
353
+ - `bias_frame.upper` contains upper bound of the CI
354
+
355
+ Imagine that you didn't only collect gender data in `df` but also age group data.
356
+ ```python
357
+ import pandas as pd
358
+ import numpy as np
359
+
360
+ df = pd.DataFrame({
361
+ 'gender': np.random.choice(['female', 'male'], size=1000),
362
+ 'age_group': np.random.choice(['<25', '25-35', '35-45', '45-55', '>55'], size=1000),
363
+ 'labels': np.random.choice([0, 1], size=1000),
364
+ 'scores': np.random.uniform(0.0, 1.0, 1000)
365
+ })
366
+ ```
367
+
368
+ You can then just run the following to measure FRR per gender x age group combination:
369
+ ```python
370
+ bias_frame = showbias(
371
+ data=df,
372
+ group_columns=["gender", "age_group"],
373
+ label_column="labels",
374
+ score_column="scores",
375
+ metric="fnr",
376
+ threshold=[0.5]
377
+ )
378
+ print(bias_frame.to_markdown(reset_display_index=True))
379
+ ```
380
+ which should result in a table like this:
381
+ | gender | age_group | 0.5 |
382
+ |:---------|:------------|------:|
383
+ | female | 25-35 | 0.514 |
384
+ | female | 35-45 | 0.571 |
385
+ | female | 45-55 | 0.52 |
386
+ | female | <25 | 0.517 |
387
+ | female | >55 | 0.509 |
388
+ | male | 25-35 | 0.525 |
389
+ | male | 35-45 | 0.435 |
390
+ | male | 45-55 | 0.414 |
391
+ | male | <25 | 0.529 |
392
+ | male | >55 | 0.562 |
393
+
394
+ ## Contributing
395
+
396
+ Before submitting an MR, please run
397
+
398
+ ```shell
399
+ make style
400
+ ```
401
+
402
+ This will run `black`, `isort` and `flake8` on the code.
403
+
404
+ Unit tests can be executed via
405
+
406
+ ```shell
407
+ make test
408
+ ```
409
+
410
+ ## Formatting tips
411
+
412
+ * `# fmt: skip` for disabling formatting on a single line.
413
+ * `# fmt: off` / `# fmt: on` for disabling formatting on a block of code.
414
+ * `# noqa: F401` to disable flake8 warning of unused import
415
+
416
+ ## Future plans
417
+
418
+ The following features are planned
419
+
420
+ - [ ] Aliases for metrics
421
+