datatypical 0.7.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- datatypical-0.7.0/LICENSE +21 -0
- datatypical-0.7.0/PKG-INFO +302 -0
- datatypical-0.7.0/README.md +278 -0
- datatypical-0.7.0/datatypical.egg-info/PKG-INFO +302 -0
- datatypical-0.7.0/datatypical.egg-info/SOURCES.txt +10 -0
- datatypical-0.7.0/datatypical.egg-info/dependency_links.txt +1 -0
- datatypical-0.7.0/datatypical.egg-info/top_level.txt +2 -0
- datatypical-0.7.0/datatypical.py +3417 -0
- datatypical-0.7.0/datatypical_viz.py +912 -0
- datatypical-0.7.0/pyproject.toml +26 -0
- datatypical-0.7.0/setup.cfg +4 -0
- datatypical-0.7.0/setup.py +49 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Amanda S. Barnard
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,302 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: datatypical
|
|
3
|
+
Version: 0.7.0
|
|
4
|
+
Summary: Explainable instance significance discovery for scientific datasets
|
|
5
|
+
Home-page: https://github.com/amaxiom/DataTypical
|
|
6
|
+
Author: Amanda S. Barnard
|
|
7
|
+
Author-email: "Amanda S. Barnard" <amanda.s.barnard@anu.edu.au>
|
|
8
|
+
License: MIT
|
|
9
|
+
Project-URL: Homepage, https://github.com/amaxiom/DataTypical
|
|
10
|
+
Project-URL: Documentation, https://github.com/amaxiom/DataTypical/tree/main/docs
|
|
11
|
+
Project-URL: Repository, https://github.com/amaxiom/DataTypical
|
|
12
|
+
Keywords: machine-learning,explainable-ai,shapley-values,data-science
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Requires-Python: >=3.8
|
|
18
|
+
Description-Content-Type: text/markdown
|
|
19
|
+
License-File: LICENSE
|
|
20
|
+
Dynamic: author
|
|
21
|
+
Dynamic: home-page
|
|
22
|
+
Dynamic: license-file
|
|
23
|
+
Dynamic: requires-python
|
|
24
|
+
|
|
25
|
+
# DataTypical
|
|
26
|
+
|
|
27
|
+
**Explainable Instance Significance Discovery for Scientific Datasets**
|
|
28
|
+
|
|
29
|
+
[](https://www.python.org/downloads/)
|
|
30
|
+
[](https://opensource.org/licenses/MIT)
|
|
31
|
+
|
|
32
|
+
DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Key Features
|
|
37
|
+
|
|
38
|
+
- **Three Significance Types**: Archetypal, prototypical, stereotypical (all computed simultaneously)
|
|
39
|
+
- **Shapley Explanations**: Feature-level attributions for why samples are significant
|
|
40
|
+
- **Formative Discovery**: Distinguish samples that ARE significant from those that CREATE structure
|
|
41
|
+
- **Publication Visualizations**: Dual-perspective scatter plots, heatmaps, and profile plots
|
|
42
|
+
- **Multi-Modal Support**: Tabular data, text, and graph networks through unified API
|
|
43
|
+
- **Performance Optimized**: Fast exploration mode and efficient Shapley computation
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Installation
|
|
48
|
+
```bash
|
|
49
|
+
pip install datatypical
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## Quick Start
|
|
55
|
+
```python
|
|
56
|
+
from datatypical import DataTypical
|
|
57
|
+
from datatypical_viz import significance_plot, heatmap, profile_plot
|
|
58
|
+
import pandas as pd
|
|
59
|
+
|
|
60
|
+
# Load your data
|
|
61
|
+
data = pd.read_csv('your_data.csv')
|
|
62
|
+
|
|
63
|
+
# Analyze with explanations
|
|
64
|
+
dt = DataTypical(shapley_mode=True)
|
|
65
|
+
results = dt.fit_transform(data)
|
|
66
|
+
|
|
67
|
+
# Three significance perspectives (0-1 normalized ranks)
|
|
68
|
+
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])
|
|
69
|
+
|
|
70
|
+
# Visualize: which samples are critical vs replaceable?
|
|
71
|
+
significance_plot(results, significance='archetypal')
|
|
72
|
+
|
|
73
|
+
# Understand: which features drive significance?
|
|
74
|
+
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
|
|
75
|
+
|
|
76
|
+
# Explain: why is this sample significant?
|
|
77
|
+
top_idx = results['archetypal_rank'].idxmax()
|
|
78
|
+
profile_plot(dt, top_idx, significance='archetypal', order='local')
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## What DataTypical Does
|
|
84
|
+
|
|
85
|
+
### Three Complementary Lenses
|
|
86
|
+
|
|
87
|
+
| Lens | Finds | Use Cases |
|
|
88
|
+
|------|-------|-----------|
|
|
89
|
+
| **Archetypal** | Extreme, boundary samples | Edge case discovery, outlier detection, range understanding |
|
|
90
|
+
| **Prototypical** | Representative, central samples | Dataset summarization, cluster centers, data coverage |
|
|
91
|
+
| **Stereotypical** | Target-similar samples | Optimization, goal-oriented selection, phenotype matching |
|
|
92
|
+
|
|
93
|
+
**The Power**: All three computed simultaneously—different perspectives reveal different insights.
|
|
94
|
+
|
|
95
|
+
### Dual Perspective (with Shapley)
|
|
96
|
+
|
|
97
|
+
When `shapley_mode=True`, DataTypical reveals two views:
|
|
98
|
+
|
|
99
|
+
- **Actual Significance** (`*_rank`): Samples that ARE significant
|
|
100
|
+
- **Formative Significance** (`*_shapley_rank`): Samples that CREATE the structure
|
|
101
|
+
|
|
102
|
+
This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Example: Drug Discovery
|
|
107
|
+
```python
|
|
108
|
+
# Analyze compound library
|
|
109
|
+
dt = DataTypical(
|
|
110
|
+
shapley_mode=True,
|
|
111
|
+
stereotype_column='activity', # Target property
|
|
112
|
+
fast_mode=False
|
|
113
|
+
)
|
|
114
|
+
results = dt.fit_transform(compounds)
|
|
115
|
+
|
|
116
|
+
# Find critical compounds (high actual + high formative)
|
|
117
|
+
critical = results[
|
|
118
|
+
(results['stereotypical_rank'] > 0.8) &
|
|
119
|
+
(results['stereotypical_shapley_rank'] > 0.8)
|
|
120
|
+
]
|
|
121
|
+
print(f"Found {len(critical)} critical compounds")
|
|
122
|
+
|
|
123
|
+
# Find redundant compounds (high actual + low formative)
|
|
124
|
+
redundant = results[
|
|
125
|
+
(results['stereotypical_rank'] > 0.8) &
|
|
126
|
+
(results['stereotypical_shapley_rank'] < 0.3)
|
|
127
|
+
]
|
|
128
|
+
print(f"Found {len(redundant)} replaceable compounds")
|
|
129
|
+
|
|
130
|
+
# Understand alternative mechanisms
|
|
131
|
+
for idx in critical.index:
|
|
132
|
+
profile_plot(dt, idx, significance='stereotypical')
|
|
133
|
+
# Each shows different feature pattern → different mechanism
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
**Discovery**: Multiple structural pathways to high activity.
|
|
137
|
+
|
|
138
|
+
---
|
|
139
|
+
|
|
140
|
+
## Key Parameters
|
|
141
|
+
```python
|
|
142
|
+
DataTypical(
|
|
143
|
+
shapley_mode=False, # True for explanations
|
|
144
|
+
fast_mode=True, # False for publication quality
|
|
145
|
+
n_archetypes=8, # Number of extreme corners
|
|
146
|
+
n_prototypes=8, # Number of representatives
|
|
147
|
+
stereotype_column=None, # Target column for stereotypical
|
|
148
|
+
shapley_top_n=500, # Limit explanations to top N
|
|
149
|
+
shapley_n_permutations=100, # Number of permutations
|
|
150
|
+
random_state=None, # Set for reproducible results
|
|
151
|
+
max_memory_mb=8000 # Memory limit
|
|
152
|
+
)
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## Visualization Functions
|
|
158
|
+
```python
|
|
159
|
+
from datatypical import significance_plot, heatmap, profile_plot
|
|
160
|
+
|
|
161
|
+
# 1. Dual-perspective scatter plot
|
|
162
|
+
significance_plot(results, significance='archetypal')
|
|
163
|
+
|
|
164
|
+
# 2. Feature attribution heatmap
|
|
165
|
+
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
|
|
166
|
+
|
|
167
|
+
# 3. Individual sample profile
|
|
168
|
+
profile_plot(dt, sample_idx, significance='archetypal', order='local')
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## Multi-Modal Support
|
|
174
|
+
|
|
175
|
+
### Tabular Data
|
|
176
|
+
```python
|
|
177
|
+
df = pd.DataFrame(...)
|
|
178
|
+
dt = DataTypical()
|
|
179
|
+
results = dt.fit_transform(df)
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
### Text Data
|
|
183
|
+
```python
|
|
184
|
+
texts = ["document 1", "document 2", ...]
|
|
185
|
+
dt = DataTypical()
|
|
186
|
+
results = dt.fit_transform(texts)
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
### Graph Networks
|
|
190
|
+
```python
|
|
191
|
+
node_features = pd.DataFrame(...)
|
|
192
|
+
edges = [(0, 1), (1, 2), ...]
|
|
193
|
+
dt = DataTypical()
|
|
194
|
+
results = dt.fit_transform(node_features, edges=edges)
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## Performance
|
|
200
|
+
|
|
201
|
+
| Dataset Size | Without Shapley | With Shapley |
|
|
202
|
+
|--------------|-----------------|--------------|
|
|
203
|
+
| 1,000 samples | ~5 seconds | ~5 minutes |
|
|
204
|
+
| 10,000 samples | ~30 seconds | ~60 minutes |
|
|
205
|
+
|
|
206
|
+
**Optimization Strategy**:
|
|
207
|
+
1. Fast exploration (`fast_mode=True`, no Shapley)
|
|
208
|
+
2. Identify interesting samples
|
|
209
|
+
3. Detailed analysis (`shapley_mode=True`, subset)
|
|
210
|
+
4. Generate publication figures
|
|
211
|
+
|
|
212
|
+
---
|
|
213
|
+
|
|
214
|
+
## Use Cases
|
|
215
|
+
|
|
216
|
+
**Scientific Discovery**: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification
|
|
217
|
+
|
|
218
|
+
**Dataset Curation**: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling
|
|
219
|
+
|
|
220
|
+
**Model Understanding**: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations
|
|
221
|
+
|
|
222
|
+
---
|
|
223
|
+
|
|
224
|
+
## What Makes DataTypical Different
|
|
225
|
+
|
|
226
|
+
**From outlier detection**: Finds extremes AND explains why
|
|
227
|
+
|
|
228
|
+
**From clustering**: Finds representatives maximizing coverage AND explains why
|
|
229
|
+
|
|
230
|
+
**From feature selection**: Explains which features matter for which samples
|
|
231
|
+
|
|
232
|
+
**From PCA/t-SNE**: Maintains interpretability in original feature space
|
|
233
|
+
|
|
234
|
+
**The Novel Contribution**: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.
|
|
235
|
+
|
|
236
|
+
---
|
|
237
|
+
|
|
238
|
+
## Documentation
|
|
239
|
+
|
|
240
|
+
Complete documentation, examples, and guides available at:
|
|
241
|
+
**https://github.com/amaxiom/DataTypical**
|
|
242
|
+
|
|
243
|
+
Includes:
|
|
244
|
+
- Getting started tutorials
|
|
245
|
+
- Comprehensive examples across scientific domains
|
|
246
|
+
- Visualization interpretation guides
|
|
247
|
+
- Advanced usage and computation details
|
|
248
|
+
- Test suite and benchmarks
|
|
249
|
+
|
|
250
|
+
---
|
|
251
|
+
|
|
252
|
+
## Support
|
|
253
|
+
|
|
254
|
+
- **GitHub Repository**: https://github.com/amaxiom/DataTypical
|
|
255
|
+
- **Report Issues**: https://github.com/amaxiom/DataTypical/issues
|
|
256
|
+
- **Questions & Discussions**: https://github.com/amaxiom/DataTypical/discussions
|
|
257
|
+
|
|
258
|
+
---
|
|
259
|
+
|
|
260
|
+
## Requirements
|
|
261
|
+
|
|
262
|
+
- Python ≥ 3.8
|
|
263
|
+
- NumPy ≥ 1.20
|
|
264
|
+
- Pandas ≥ 1.3
|
|
265
|
+
- SciPy ≥ 1.7
|
|
266
|
+
- scikit-learn ≥ 1.0
|
|
267
|
+
- Matplotlib ≥ 3.3
|
|
268
|
+
- Seaborn ≥ 0.11
|
|
269
|
+
- Numba ≥ 0.55
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## Citation
|
|
274
|
+
|
|
275
|
+
If you use DataTypical in your research, please cite:
|
|
276
|
+
```bibtex
|
|
277
|
+
@software{datatypical2025,
|
|
278
|
+
author = {Barnard, Amanda S.},
|
|
279
|
+
title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
|
|
280
|
+
year = {2026},
|
|
281
|
+
url = {https://github.com/amaxiom/DataTypical},
|
|
282
|
+
version = {0.7}
|
|
283
|
+
}
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
---
|
|
287
|
+
|
|
288
|
+
## License
|
|
289
|
+
|
|
290
|
+
MIT License - Copyright (c) 2026 Amanda S. Barnard
|
|
291
|
+
|
|
292
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
|
293
|
+
|
|
294
|
+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
|
295
|
+
|
|
296
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
|
|
297
|
+
|
|
298
|
+
---
|
|
299
|
+
|
|
300
|
+
## Acknowledgments
|
|
301
|
+
|
|
302
|
+
DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).
|
|
@@ -0,0 +1,278 @@
|
|
|
1
|
+
# DataTypical
|
|
2
|
+
|
|
3
|
+
**Explainable Instance Significance Discovery for Scientific Datasets**
|
|
4
|
+
|
|
5
|
+
[](https://www.python.org/downloads/)
|
|
6
|
+
[](https://opensource.org/licenses/MIT)
|
|
7
|
+
|
|
8
|
+
DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Key Features
|
|
13
|
+
|
|
14
|
+
- **Three Significance Types**: Archetypal, prototypical, stereotypical (all computed simultaneously)
|
|
15
|
+
- **Shapley Explanations**: Feature-level attributions for why samples are significant
|
|
16
|
+
- **Formative Discovery**: Distinguish samples that ARE significant from those that CREATE structure
|
|
17
|
+
- **Publication Visualizations**: Dual-perspective scatter plots, heatmaps, and profile plots
|
|
18
|
+
- **Multi-Modal Support**: Tabular data, text, and graph networks through unified API
|
|
19
|
+
- **Performance Optimized**: Fast exploration mode and efficient Shapley computation
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Installation
|
|
24
|
+
```bash
|
|
25
|
+
pip install datatypical
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Quick Start
|
|
31
|
+
```python
|
|
32
|
+
from datatypical import DataTypical
|
|
33
|
+
from datatypical_viz import significance_plot, heatmap, profile_plot
|
|
34
|
+
import pandas as pd
|
|
35
|
+
|
|
36
|
+
# Load your data
|
|
37
|
+
data = pd.read_csv('your_data.csv')
|
|
38
|
+
|
|
39
|
+
# Analyze with explanations
|
|
40
|
+
dt = DataTypical(shapley_mode=True)
|
|
41
|
+
results = dt.fit_transform(data)
|
|
42
|
+
|
|
43
|
+
# Three significance perspectives (0-1 normalized ranks)
|
|
44
|
+
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])
|
|
45
|
+
|
|
46
|
+
# Visualize: which samples are critical vs replaceable?
|
|
47
|
+
significance_plot(results, significance='archetypal')
|
|
48
|
+
|
|
49
|
+
# Understand: which features drive significance?
|
|
50
|
+
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
|
|
51
|
+
|
|
52
|
+
# Explain: why is this sample significant?
|
|
53
|
+
top_idx = results['archetypal_rank'].idxmax()
|
|
54
|
+
profile_plot(dt, top_idx, significance='archetypal', order='local')
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
## What DataTypical Does
|
|
60
|
+
|
|
61
|
+
### Three Complementary Lenses
|
|
62
|
+
|
|
63
|
+
| Lens | Finds | Use Cases |
|
|
64
|
+
|------|-------|-----------|
|
|
65
|
+
| **Archetypal** | Extreme, boundary samples | Edge case discovery, outlier detection, range understanding |
|
|
66
|
+
| **Prototypical** | Representative, central samples | Dataset summarization, cluster centers, data coverage |
|
|
67
|
+
| **Stereotypical** | Target-similar samples | Optimization, goal-oriented selection, phenotype matching |
|
|
68
|
+
|
|
69
|
+
**The Power**: All three computed simultaneously—different perspectives reveal different insights.
|
|
70
|
+
|
|
71
|
+
### Dual Perspective (with Shapley)
|
|
72
|
+
|
|
73
|
+
When `shapley_mode=True`, DataTypical reveals two views:
|
|
74
|
+
|
|
75
|
+
- **Actual Significance** (`*_rank`): Samples that ARE significant
|
|
76
|
+
- **Formative Significance** (`*_shapley_rank`): Samples that CREATE the structure
|
|
77
|
+
|
|
78
|
+
This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Example: Drug Discovery
|
|
83
|
+
```python
|
|
84
|
+
# Analyze compound library
|
|
85
|
+
dt = DataTypical(
|
|
86
|
+
shapley_mode=True,
|
|
87
|
+
stereotype_column='activity', # Target property
|
|
88
|
+
fast_mode=False
|
|
89
|
+
)
|
|
90
|
+
results = dt.fit_transform(compounds)
|
|
91
|
+
|
|
92
|
+
# Find critical compounds (high actual + high formative)
|
|
93
|
+
critical = results[
|
|
94
|
+
(results['stereotypical_rank'] > 0.8) &
|
|
95
|
+
(results['stereotypical_shapley_rank'] > 0.8)
|
|
96
|
+
]
|
|
97
|
+
print(f"Found {len(critical)} critical compounds")
|
|
98
|
+
|
|
99
|
+
# Find redundant compounds (high actual + low formative)
|
|
100
|
+
redundant = results[
|
|
101
|
+
(results['stereotypical_rank'] > 0.8) &
|
|
102
|
+
(results['stereotypical_shapley_rank'] < 0.3)
|
|
103
|
+
]
|
|
104
|
+
print(f"Found {len(redundant)} replaceable compounds")
|
|
105
|
+
|
|
106
|
+
# Understand alternative mechanisms
|
|
107
|
+
for idx in critical.index:
|
|
108
|
+
profile_plot(dt, idx, significance='stereotypical')
|
|
109
|
+
# Each shows different feature pattern → different mechanism
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
**Discovery**: Multiple structural pathways to high activity.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Key Parameters
|
|
117
|
+
```python
|
|
118
|
+
DataTypical(
|
|
119
|
+
shapley_mode=False, # True for explanations
|
|
120
|
+
fast_mode=True, # False for publication quality
|
|
121
|
+
n_archetypes=8, # Number of extreme corners
|
|
122
|
+
n_prototypes=8, # Number of representatives
|
|
123
|
+
stereotype_column=None, # Target column for stereotypical
|
|
124
|
+
shapley_top_n=500, # Limit explanations to top N
|
|
125
|
+
shapley_n_permutations=100, # Number of permutations
|
|
126
|
+
random_state=None, # Set for reproducible results
|
|
127
|
+
max_memory_mb=8000 # Memory limit
|
|
128
|
+
)
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Visualization Functions
|
|
134
|
+
```python
|
|
135
|
+
from datatypical import significance_plot, heatmap, profile_plot
|
|
136
|
+
|
|
137
|
+
# 1. Dual-perspective scatter plot
|
|
138
|
+
significance_plot(results, significance='archetypal')
|
|
139
|
+
|
|
140
|
+
# 2. Feature attribution heatmap
|
|
141
|
+
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
|
|
142
|
+
|
|
143
|
+
# 3. Individual sample profile
|
|
144
|
+
profile_plot(dt, sample_idx, significance='archetypal', order='local')
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## Multi-Modal Support
|
|
150
|
+
|
|
151
|
+
### Tabular Data
|
|
152
|
+
```python
|
|
153
|
+
df = pd.DataFrame(...)
|
|
154
|
+
dt = DataTypical()
|
|
155
|
+
results = dt.fit_transform(df)
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Text Data
|
|
159
|
+
```python
|
|
160
|
+
texts = ["document 1", "document 2", ...]
|
|
161
|
+
dt = DataTypical()
|
|
162
|
+
results = dt.fit_transform(texts)
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Graph Networks
|
|
166
|
+
```python
|
|
167
|
+
node_features = pd.DataFrame(...)
|
|
168
|
+
edges = [(0, 1), (1, 2), ...]
|
|
169
|
+
dt = DataTypical()
|
|
170
|
+
results = dt.fit_transform(node_features, edges=edges)
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
---
|
|
174
|
+
|
|
175
|
+
## Performance
|
|
176
|
+
|
|
177
|
+
| Dataset Size | Without Shapley | With Shapley |
|
|
178
|
+
|--------------|-----------------|--------------|
|
|
179
|
+
| 1,000 samples | ~5 seconds | ~5 minutes |
|
|
180
|
+
| 10,000 samples | ~30 seconds | ~60 minutes |
|
|
181
|
+
|
|
182
|
+
**Optimization Strategy**:
|
|
183
|
+
1. Fast exploration (`fast_mode=True`, no Shapley)
|
|
184
|
+
2. Identify interesting samples
|
|
185
|
+
3. Detailed analysis (`shapley_mode=True`, subset)
|
|
186
|
+
4. Generate publication figures
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## Use Cases
|
|
191
|
+
|
|
192
|
+
**Scientific Discovery**: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification
|
|
193
|
+
|
|
194
|
+
**Dataset Curation**: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling
|
|
195
|
+
|
|
196
|
+
**Model Understanding**: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## What Makes DataTypical Different
|
|
201
|
+
|
|
202
|
+
**From outlier detection**: Finds extremes AND explains why
|
|
203
|
+
|
|
204
|
+
**From clustering**: Finds representatives maximizing coverage AND explains why
|
|
205
|
+
|
|
206
|
+
**From feature selection**: Explains which features matter for which samples
|
|
207
|
+
|
|
208
|
+
**From PCA/t-SNE**: Maintains interpretability in original feature space
|
|
209
|
+
|
|
210
|
+
**The Novel Contribution**: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.
|
|
211
|
+
|
|
212
|
+
---
|
|
213
|
+
|
|
214
|
+
## Documentation
|
|
215
|
+
|
|
216
|
+
Complete documentation, examples, and guides available at:
|
|
217
|
+
**https://github.com/amaxiom/DataTypical**
|
|
218
|
+
|
|
219
|
+
Includes:
|
|
220
|
+
- Getting started tutorials
|
|
221
|
+
- Comprehensive examples across scientific domains
|
|
222
|
+
- Visualization interpretation guides
|
|
223
|
+
- Advanced usage and computation details
|
|
224
|
+
- Test suite and benchmarks
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## Support
|
|
229
|
+
|
|
230
|
+
- **GitHub Repository**: https://github.com/amaxiom/DataTypical
|
|
231
|
+
- **Report Issues**: https://github.com/amaxiom/DataTypical/issues
|
|
232
|
+
- **Questions & Discussions**: https://github.com/amaxiom/DataTypical/discussions
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## Requirements
|
|
237
|
+
|
|
238
|
+
- Python ≥ 3.8
|
|
239
|
+
- NumPy ≥ 1.20
|
|
240
|
+
- Pandas ≥ 1.3
|
|
241
|
+
- SciPy ≥ 1.7
|
|
242
|
+
- scikit-learn ≥ 1.0
|
|
243
|
+
- Matplotlib ≥ 3.3
|
|
244
|
+
- Seaborn ≥ 0.11
|
|
245
|
+
- Numba ≥ 0.55
|
|
246
|
+
|
|
247
|
+
---
|
|
248
|
+
|
|
249
|
+
## Citation
|
|
250
|
+
|
|
251
|
+
If you use DataTypical in your research, please cite:
|
|
252
|
+
```bibtex
|
|
253
|
+
@software{datatypical2025,
|
|
254
|
+
author = {Barnard, Amanda S.},
|
|
255
|
+
title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
|
|
256
|
+
year = {2026},
|
|
257
|
+
url = {https://github.com/amaxiom/DataTypical},
|
|
258
|
+
version = {0.7}
|
|
259
|
+
}
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
---
|
|
263
|
+
|
|
264
|
+
## License
|
|
265
|
+
|
|
266
|
+
MIT License - Copyright (c) 2026 Amanda S. Barnard
|
|
267
|
+
|
|
268
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
|
269
|
+
|
|
270
|
+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
|
271
|
+
|
|
272
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
|
|
273
|
+
|
|
274
|
+
---
|
|
275
|
+
|
|
276
|
+
## Acknowledgments
|
|
277
|
+
|
|
278
|
+
DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).
|