compsil 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- compsil-0.1.0/LICENSE +21 -0
- compsil-0.1.0/PKG-INFO +377 -0
- compsil-0.1.0/README.md +338 -0
- compsil-0.1.0/compsil/__init__.py +1 -0
- compsil-0.1.0/compsil/compsil.py +377 -0
- compsil-0.1.0/compsil.egg-info/PKG-INFO +377 -0
- compsil-0.1.0/compsil.egg-info/SOURCES.txt +10 -0
- compsil-0.1.0/compsil.egg-info/dependency_links.txt +1 -0
- compsil-0.1.0/compsil.egg-info/requires.txt +5 -0
- compsil-0.1.0/compsil.egg-info/top_level.txt +1 -0
- compsil-0.1.0/setup.cfg +4 -0
- compsil-0.1.0/setup.py +35 -0
compsil-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Aggelos Semoglou
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
compsil-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,377 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: compsil
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: CompSil: Composite Silhouette for Cluster-Count Selection
|
|
5
|
+
Home-page: https://github.com/semoglou/compsil
|
|
6
|
+
Author: Aggelos Semoglou
|
|
7
|
+
Author-email: a.semoglou@outlook.gr
|
|
8
|
+
License: MIT
|
|
9
|
+
Classifier: Programming Language :: Python :: 3
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Intended Audience :: Science/Research
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
20
|
+
Requires-Python: >=3.9
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Requires-Dist: numpy>=1.22
|
|
24
|
+
Requires-Dist: pandas>=1.5
|
|
25
|
+
Requires-Dist: scikit-learn>=1.4
|
|
26
|
+
Requires-Dist: matplotlib>=3.6
|
|
27
|
+
Requires-Dist: joblib>=1.2
|
|
28
|
+
Dynamic: author
|
|
29
|
+
Dynamic: author-email
|
|
30
|
+
Dynamic: classifier
|
|
31
|
+
Dynamic: description
|
|
32
|
+
Dynamic: description-content-type
|
|
33
|
+
Dynamic: home-page
|
|
34
|
+
Dynamic: license
|
|
35
|
+
Dynamic: license-file
|
|
36
|
+
Dynamic: requires-dist
|
|
37
|
+
Dynamic: requires-python
|
|
38
|
+
Dynamic: summary
|
|
39
|
+
|
|
40
|
+
# CompSil
|
|
41
|
+
|
|
42
|
+
<p align="center">
|
|
43
|
+
<a href="https://pypi.org/project/compsil/"><img src="https://img.shields.io/pypi/v/compsil.svg?color=blue" alt="PyPI version"></a>
|
|
44
|
+
<a href="https://pypi.org/project/compsil/"><img src="https://img.shields.io/badge/python-3.9%2B-blue" alt="Python 3.9+"></a>
|
|
45
|
+
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/license-MIT-yellow.svg" alt="License: MIT"></a>
|
|
46
|
+
<a href="https://pepy.tech/project/compsil"><img src="https://pepy.tech/badge/compsil" alt="Downloads"></a>
|
|
47
|
+
<a href="#"><img src="https://img.shields.io/badge/ECML%20PKDD-2026-green" alt="ECML PKDD 2026"></a>
|
|
48
|
+
</p>
|
|
49
|
+
|
|
50
|
+
<table>
|
|
51
|
+
<tr>
|
|
52
|
+
<td>
|
|
53
|
+
|
|
54
|
+
📄 **Accepted at _ECML PKDD 2026_**
|
|
55
|
+
|
|
56
|
+
**Composite Silhouette**
|
|
57
|
+
|
|
58
|
+
</td>
|
|
59
|
+
</tr>
|
|
60
|
+
</table>
|
|
61
|
+
|
|
62
|
+
**CompSil** is an open-source Python package for selecting the number of clusters in unlabeled data using **Composite Silhouette**, an internal validation criterion that adaptively combines micro- and macro-averaged Silhouette scores across repeated subsampled clusterings.
|
|
63
|
+
|
|
64
|
+
### Composite Silhouette: A Subsampling-based Aggregation Strategy
|
|
65
|
+
|
|
66
|
+
Selecting the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are usually unavailable.
|
|
67
|
+
|
|
68
|
+
The standard Silhouette coefficient is one of the most widely used internal validation metrics for this task. However, its usual **micro-averaged** form aggregates Silhouette values over all data points, which can make the score strongly influenced by large clusters. In imbalanced datasets, this may mask poor separation or instability in smaller but meaningful groups.
|
|
69
|
+
|
|
70
|
+
A natural alternative is **macro-averaging**, where Silhouette values are first averaged within each cluster and then averaged across clusters. This gives every cluster equal influence, reducing the dominance of majority groups. However, macro-averaging can also overemphasize small, noisy, or under-represented clusters.
|
|
71
|
+
|
|
72
|
+
The distinction between micro- and macro-averaged Silhouette aggregation is discussed in detail in [**Revisiting Silhouette Aggregation**](https://arxiv.org/abs/2401.05831) by Pavlopoulos, Vardakas, and Likas. The corresponding repository is available here: [https://github.com/ipavlopoulos/revisiting-silhouette-aggregation](https://github.com/ipavlopoulos/revisiting-silhouette-aggregation).
|
|
73
|
+
|
|
74
|
+
For users who only need direct Silhouette computation, including sample-level, micro-averaged, and macro-averaged Silhouette scores with or without approximation, see the companion Silhouette package: [https://github.com/semoglou/sil_score](https://github.com/semoglou/sil_score).
|
|
75
|
+
|
|
76
|
+
<img src="https://raw.githubusercontent.com/semoglou/compsil/main/figs/aggr.png" alt="Micro vs Macro Silhouette Aggregation" width="700">
|
|
77
|
+
|
|
78
|
+
These complementary failure modes create a practical dilemma:
|
|
79
|
+
|
|
80
|
+
- **Micro-averaging** reflects global, point-wise clustering quality but can favor majority clusters.
|
|
81
|
+
|
|
82
|
+
- **Macro-averaging** reflects cluster-wise balance but can overemphasize small or noisy groups.
|
|
83
|
+
|
|
84
|
+
In many applications, it is unclear in advance which view should be trusted.
|
|
85
|
+
|
|
86
|
+
**CompSil** addresses this issue by using the disagreement between micro- and macro-averaged Silhouette scores as a local signal for adaptive aggregation.
|
|
87
|
+
|
|
88
|
+
Composite Silhouette evaluates candidate numbers of clusters through repeated subsampled clusterings. For each candidate value of `k`, the method:
|
|
89
|
+
|
|
90
|
+
1. Draws multiple subsamples of the dataset.
|
|
91
|
+
|
|
92
|
+
2. Clusters each subsample.
|
|
93
|
+
|
|
94
|
+
3. Computes both micro- and macro-averaged Silhouette scores.
|
|
95
|
+
|
|
96
|
+
4. Measures their discrepancy.
|
|
97
|
+
|
|
98
|
+
5. Converts this discrepancy into a smooth convex weight.
|
|
99
|
+
|
|
100
|
+
6. Combines the two Silhouette views into a subsample-level composite score.
|
|
101
|
+
|
|
102
|
+
7. Averages the composite scores across subsamples.
|
|
103
|
+
|
|
104
|
+
<img src="https://raw.githubusercontent.com/semoglou/compsil/main/figs/smmp.png" alt="Composite Silhouette pipeline" width="700">
|
|
105
|
+
|
|
106
|
+
For each subsample, Composite Silhouette combines the two views as:
|
|
107
|
+
|
|
108
|
+
```text
|
|
109
|
+
|
|
110
|
+
S_mM = w * S_micro + (1 - w) * S_macro
|
|
111
|
+
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
where the weight `w` is determined adaptively from the normalized discrepancy between `S_micro` and `S_macro`.
|
|
115
|
+
|
|
116
|
+
This produces a single internal validation score that can be maximized over candidate values of `k`.
|
|
117
|
+
|
|
118
|
+
CompSil enables:
|
|
119
|
+
|
|
120
|
+
- Selection of the number of clusters without labels.
|
|
121
|
+
|
|
122
|
+
- Adaptive balancing of micro- and macro-averaged Silhouette.
|
|
123
|
+
|
|
124
|
+
- More robust cluster-count selection under size imbalance.
|
|
125
|
+
|
|
126
|
+
- Repeated subsampling for stable internal validation.
|
|
127
|
+
|
|
128
|
+
- Optional lower-confidence-bound selection using subsampling variability.
|
|
129
|
+
|
|
130
|
+
#
|
|
131
|
+
|
|
132
|
+
## Citation
|
|
133
|
+
|
|
134
|
+
If you find this work useful, please consider citing:
|
|
135
|
+
|
|
136
|
+
Semoglou, A., Likas, A., & Pavlopoulos, J. (2026). Composite Silhouette.
|
|
137
|
+
|
|
138
|
+
Accepted at *ECML PKDD 2026*.
|
|
139
|
+
|
|
140
|
+
```bibtex
|
|
141
|
+
@inproceedings{semoglou2026composite,
|
|
142
|
+
title = {Composite Silhouette},
|
|
143
|
+
author = {Semoglou, Aggelos and Likas, Aristidis and Pavlopoulos, John},
|
|
144
|
+
booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
|
|
145
|
+
year = {2026}
|
|
146
|
+
}
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
The preprint is also available on arXiv: [https://arxiv.org/abs/2604.13816](https://arxiv.org/abs/2604.13816)
|
|
150
|
+
|
|
151
|
+
## Installation
|
|
152
|
+
|
|
153
|
+
Install **CompSil** from [PyPI](https://pypi.org/project/compsil/):
|
|
154
|
+
|
|
155
|
+
```python
|
|
156
|
+
pip install compsil
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Import the main class in Python as:
|
|
160
|
+
|
|
161
|
+
```python
|
|
162
|
+
from compsil import CompSil
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## API Reference
|
|
166
|
+
|
|
167
|
+
CompSil provides a simple class-based interface for evaluating Composite Silhouette over one or more candidate numbers of clusters.
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
#### `CompSil`
|
|
172
|
+
|
|
173
|
+
Computes Composite Silhouette for candidate cluster counts using repeated subsampled clusterings.
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
CompSil(
|
|
177
|
+
data,
|
|
178
|
+
ground_truth=None,
|
|
179
|
+
k_values=range(2, 11),
|
|
180
|
+
num_samples=10,
|
|
181
|
+
sample_size="auto",
|
|
182
|
+
random_state=42,
|
|
183
|
+
n_jobs=-1,
|
|
184
|
+
eps=1e-12,
|
|
185
|
+
)
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
**Inputs**
|
|
189
|
+
|
|
190
|
+
- `data`: array-like of shape `(n_samples, n_features)`
|
|
191
|
+
Input data matrix.
|
|
192
|
+
|
|
193
|
+
- `ground_truth`: int or None, default `None`
|
|
194
|
+
Optional reference number of clusters.
|
|
195
|
+
Used only for visualization.
|
|
196
|
+
|
|
197
|
+
- `k_values`: iterable of int or int, default `range(2, 11)`
|
|
198
|
+
Candidate number or candidate numbers of clusters to evaluate.
|
|
199
|
+
|
|
200
|
+
- `num_samples`: int, default `10`
|
|
201
|
+
Number of subsamples used for each candidate value of `k`.
|
|
202
|
+
|
|
203
|
+
- `sample_size`: int, float, None, or `"auto"`, default `"auto"`
|
|
204
|
+
Subsample size used in each repeated clustering.
|
|
205
|
+
- If `int`, it is interpreted as the absolute subsample size.
|
|
206
|
+
- If `float` in `(0, 1]`, it is interpreted as a fraction of the dataset size.
|
|
207
|
+
- If `None` or `"auto"`, the subsample size is selected automatically from the dataset size and the largest candidate value of `k`.
|
|
208
|
+
|
|
209
|
+
- `random_state`: int, default `42`
|
|
210
|
+
Base random seed used for reproducible subsampling and clustering.
|
|
211
|
+
|
|
212
|
+
- `n_jobs`: int, default `-1`
|
|
213
|
+
Number of parallel jobs used during evaluation.
|
|
214
|
+
|
|
215
|
+
- `eps`: float, default `1e-12`
|
|
216
|
+
Numerical stability constant used when normalizing micro–macro discrepancies.
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
#### `evaluate`
|
|
221
|
+
|
|
222
|
+
Evaluates Composite Silhouette over all candidate values of `k`.
|
|
223
|
+
|
|
224
|
+
```python
|
|
225
|
+
model.evaluate()
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
After calling `evaluate`, the results are stored in:
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
model.results_df
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
The results table contains:
|
|
235
|
+
|
|
236
|
+
- `k`: candidate number of clusters.
|
|
237
|
+
- `avg S_micro`: average micro-averaged Silhouette across subsamples.
|
|
238
|
+
- `avg S_macro`: average macro-averaged Silhouette across subsamples.
|
|
239
|
+
- `w_micro`: average adaptive weight assigned to the micro view.
|
|
240
|
+
- `S_mM`: Composite Silhouette score.
|
|
241
|
+
- `std S_mM`: standard deviation of subsample-level composite scores.
|
|
242
|
+
- `se S_mM`: standard error of the Composite Silhouette estimate.
|
|
243
|
+
- `LCB S_mM`: lower-confidence-bound score, computed as `S_mM - se S_mM`.
|
|
244
|
+
- `B_eff`: number of valid subsampling trials.
|
|
245
|
+
- `sample_size`: resolved subsample size.
|
|
246
|
+
- `sample_fraction`: resolved subsample fraction.
|
|
247
|
+
|
|
248
|
+
---
|
|
249
|
+
|
|
250
|
+
#### `get_optimal_k`
|
|
251
|
+
|
|
252
|
+
Returns the selected number of clusters.
|
|
253
|
+
|
|
254
|
+
```python
|
|
255
|
+
model.get_optimal_k(use_lcb=False)
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Inputs**
|
|
259
|
+
|
|
260
|
+
- `use_lcb`: bool, default `False`
|
|
261
|
+
If `False`, selects the `k` that maximizes `S_mM`.
|
|
262
|
+
If `True`, selects the `k` that maximizes `LCB S_mM`.
|
|
263
|
+
|
|
264
|
+
**Returns**
|
|
265
|
+
|
|
266
|
+
- `optimal_k`: int
|
|
267
|
+
Selected number of clusters.
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
#### `get_results_dataframe`
|
|
272
|
+
|
|
273
|
+
Returns the results as a pandas DataFrame indexed by `k`.
|
|
274
|
+
|
|
275
|
+
```python
|
|
276
|
+
results = model.get_results_dataframe()
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Returns**
|
|
280
|
+
|
|
281
|
+
- `results`: pandas DataFrame
|
|
282
|
+
Table containing the Composite Silhouette results for all candidate values of `k`.
|
|
283
|
+
|
|
284
|
+
---
|
|
285
|
+
|
|
286
|
+
#### `plot_results`
|
|
287
|
+
|
|
288
|
+
Plots the Composite Silhouette curve together with the subsample-averaged micro- and macro-averaged Silhouette curves.
|
|
289
|
+
|
|
290
|
+
```python
|
|
291
|
+
model.plot_results()
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
If `ground_truth` was provided, it is shown as a vertical reference line.
|
|
295
|
+
|
|
296
|
+
## Quick Start
|
|
297
|
+
|
|
298
|
+
This example creates a simple synthetic dataset with five Gaussian clusters, evaluates candidate values of `k`, and selects the number of clusters using Composite Silhouette.
|
|
299
|
+
|
|
300
|
+
```python
|
|
301
|
+
from sklearn.datasets import make_blobs
|
|
302
|
+
from sklearn.preprocessing import StandardScaler
|
|
303
|
+
from compsil import CompSil
|
|
304
|
+
|
|
305
|
+
# Create a simple synthetic dataset
|
|
306
|
+
X, y = make_blobs(
|
|
307
|
+
n_samples=1000,
|
|
308
|
+
centers=5,
|
|
309
|
+
n_features=10,
|
|
310
|
+
cluster_std=1.5,
|
|
311
|
+
random_state=42,
|
|
312
|
+
)
|
|
313
|
+
|
|
314
|
+
# Standardize the data
|
|
315
|
+
X = StandardScaler().fit_transform(X)
|
|
316
|
+
|
|
317
|
+
# Initialize Composite Silhouette
|
|
318
|
+
model = CompSil(
|
|
319
|
+
data=X,
|
|
320
|
+
ground_truth=5,
|
|
321
|
+
k_values=range(2, 11),
|
|
322
|
+
num_samples=10,
|
|
323
|
+
sample_size="auto",
|
|
324
|
+
random_state=0,
|
|
325
|
+
n_jobs=-1,
|
|
326
|
+
)
|
|
327
|
+
|
|
328
|
+
# Evaluate all candidate k values
|
|
329
|
+
model.evaluate()
|
|
330
|
+
|
|
331
|
+
# Select the number of clusters
|
|
332
|
+
best_k = model.get_optimal_k()
|
|
333
|
+
|
|
334
|
+
print("Selected k:", best_k)
|
|
335
|
+
|
|
336
|
+
# Inspect the full results table
|
|
337
|
+
results = model.get_results_dataframe()
|
|
338
|
+
print(results)
|
|
339
|
+
|
|
340
|
+
# Plot the Composite Silhouette curve
|
|
341
|
+
model.plot_results()
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
The `S_mM` column in the results table contains the Composite Silhouette score for each candidate number of clusters. The selected number of clusters is the value of `k` that maximizes `S_mM`.
|
|
345
|
+
|
|
346
|
+
CompSil can also be used to evaluate a single candidate number of clusters. In this case, pass an integer to `k_values`.
|
|
347
|
+
|
|
348
|
+
```python
|
|
349
|
+
# Evaluate a single candidate k
|
|
350
|
+
model = CompSil(
|
|
351
|
+
data=X,
|
|
352
|
+
k_values=5
|
|
353
|
+
)
|
|
354
|
+
|
|
355
|
+
model.evaluate()
|
|
356
|
+
|
|
357
|
+
# Composite Silhouette score for k=5
|
|
358
|
+
print("Composite Silhouette score:", model.score_)
|
|
359
|
+
|
|
360
|
+
# Full results table
|
|
361
|
+
results = model.get_results_dataframe()
|
|
362
|
+
print(results)
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
When a single value of `k` is evaluated, `model.score_` stores the corresponding Composite Silhouette score.
|
|
366
|
+
|
|
367
|
+
## Acknowledgments
|
|
368
|
+
This work was supported by [_Archimedes Research Unit_](https://archimedesai.gr/), [_Athena Research Center_](https://www.athenarc.gr/en).
|
|
369
|
+
|
|
370
|
+
## License
|
|
371
|
+
This project is licensed under the [MIT License](https://github.com/semoglou/compsil/blob/main/LICENSE).
|
|
372
|
+
|
|
373
|
+
## Links
|
|
374
|
+
- Package: [PyPI](https://pypi.org/project/compsil/)
|
|
375
|
+
- Paper: Accepted at ECML PKDD 2026
|
|
376
|
+
- DOI: Coming soon
|
|
377
|
+
- Preprint: [arXiv:2604.13816](https://arxiv.org/abs/2604.13816)
|