datatypical 0.7.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,302 @@
1
+ Metadata-Version: 2.4
2
+ Name: datatypical
3
+ Version: 0.7.0
4
+ Summary: Explainable instance significance discovery for scientific datasets
5
+ Home-page: https://github.com/amaxiom/DataTypical
6
+ Author: Amanda S. Barnard
7
+ Author-email: "Amanda S. Barnard" <amanda.s.barnard@anu.edu.au>
8
+ License: MIT
9
+ Project-URL: Homepage, https://github.com/amaxiom/DataTypical
10
+ Project-URL: Documentation, https://github.com/amaxiom/DataTypical/tree/main/docs
11
+ Project-URL: Repository, https://github.com/amaxiom/DataTypical
12
+ Keywords: machine-learning,explainable-ai,shapley-values,data-science
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Requires-Python: >=3.8
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE
20
+ Dynamic: author
21
+ Dynamic: home-page
22
+ Dynamic: license-file
23
+ Dynamic: requires-python
24
+
25
+ # DataTypical
26
+
27
+ **Explainable Instance Significance Discovery for Scientific Datasets**
28
+
29
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
30
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
31
+
32
+ DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.
33
+
34
+ ---
35
+
36
+ ## Key Features
37
+
38
+ - **Three Significance Types**: Archetypal, prototypical, stereotypical (all computed simultaneously)
39
+ - **Shapley Explanations**: Feature-level attributions for why samples are significant
40
+ - **Formative Discovery**: Distinguish samples that ARE significant from those that CREATE structure
41
+ - **Publication Visualizations**: Dual-perspective scatter plots, heatmaps, and profile plots
42
+ - **Multi-Modal Support**: Tabular data, text, and graph networks through unified API
43
+ - **Performance Optimized**: Fast exploration mode and efficient Shapley computation
44
+
45
+ ---
46
+
47
+ ## Installation
48
+ ```bash
49
+ pip install datatypical
50
+ ```
51
+
52
+ ---
53
+
54
+ ## Quick Start
55
+ ```python
56
+ from datatypical import DataTypical
57
+ from datatypical_viz import significance_plot, heatmap, profile_plot
58
+ import pandas as pd
59
+
60
+ # Load your data
61
+ data = pd.read_csv('your_data.csv')
62
+
63
+ # Analyze with explanations
64
+ dt = DataTypical(shapley_mode=True)
65
+ results = dt.fit_transform(data)
66
+
67
+ # Three significance perspectives (0-1 normalized ranks)
68
+ print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])
69
+
70
+ # Visualize: which samples are critical vs replaceable?
71
+ significance_plot(results, significance='archetypal')
72
+
73
+ # Understand: which features drive significance?
74
+ heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
75
+
76
+ # Explain: why is this sample significant?
77
+ top_idx = results['archetypal_rank'].idxmax()
78
+ profile_plot(dt, top_idx, significance='archetypal', order='local')
79
+ ```
80
+
81
+ ---
82
+
83
+ ## What DataTypical Does
84
+
85
+ ### Three Complementary Lenses
86
+
87
+ | Lens | Finds | Use Cases |
88
+ |------|-------|-----------|
89
+ | **Archetypal** | Extreme, boundary samples | Edge case discovery, outlier detection, range understanding |
90
+ | **Prototypical** | Representative, central samples | Dataset summarization, cluster centers, data coverage |
91
+ | **Stereotypical** | Target-similar samples | Optimization, goal-oriented selection, phenotype matching |
92
+
93
+ **The Power**: All three computed simultaneously—different perspectives reveal different insights.
94
+
95
+ ### Dual Perspective (with Shapley)
96
+
97
+ When `shapley_mode=True`, DataTypical reveals two views:
98
+
99
+ - **Actual Significance** (`*_rank`): Samples that ARE significant
100
+ - **Formative Significance** (`*_shapley_rank`): Samples that CREATE the structure
101
+
102
+ This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.
103
+
104
+ ---
105
+
106
+ ## Example: Drug Discovery
107
+ ```python
108
+ # Analyze compound library
109
+ dt = DataTypical(
110
+ shapley_mode=True,
111
+ stereotype_column='activity', # Target property
112
+ fast_mode=False
113
+ )
114
+ results = dt.fit_transform(compounds)
115
+
116
+ # Find critical compounds (high actual + high formative)
117
+ critical = results[
118
+ (results['stereotypical_rank'] > 0.8) &
119
+ (results['stereotypical_shapley_rank'] > 0.8)
120
+ ]
121
+ print(f"Found {len(critical)} critical compounds")
122
+
123
+ # Find redundant compounds (high actual + low formative)
124
+ redundant = results[
125
+ (results['stereotypical_rank'] > 0.8) &
126
+ (results['stereotypical_shapley_rank'] < 0.3)
127
+ ]
128
+ print(f"Found {len(redundant)} replaceable compounds")
129
+
130
+ # Understand alternative mechanisms
131
+ for idx in critical.index:
132
+ profile_plot(dt, idx, significance='stereotypical')
133
+ # Each shows different feature pattern → different mechanism
134
+ ```
135
+
136
+ **Discovery**: Multiple structural pathways to high activity.
137
+
138
+ ---
139
+
140
+ ## Key Parameters
141
+ ```python
142
+ DataTypical(
143
+ shapley_mode=False, # True for explanations
144
+ fast_mode=True, # False for publication quality
145
+ n_archetypes=8, # Number of extreme corners
146
+ n_prototypes=8, # Number of representatives
147
+ stereotype_column=None, # Target column for stereotypical
148
+ shapley_top_n=500, # Limit explanations to top N
149
+ shapley_n_permutations=100, # Number of permutations
150
+ random_state=None, # Set for reproducible results
151
+ max_memory_mb=8000 # Memory limit
152
+ )
153
+ ```
154
+
155
+ ---
156
+
157
+ ## Visualization Functions
158
+ ```python
159
+ from datatypical import significance_plot, heatmap, profile_plot
160
+
161
+ # 1. Dual-perspective scatter plot
162
+ significance_plot(results, significance='archetypal')
163
+
164
+ # 2. Feature attribution heatmap
165
+ heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
166
+
167
+ # 3. Individual sample profile
168
+ profile_plot(dt, sample_idx, significance='archetypal', order='local')
169
+ ```
170
+
171
+ ---
172
+
173
+ ## Multi-Modal Support
174
+
175
+ ### Tabular Data
176
+ ```python
177
+ df = pd.DataFrame(...)
178
+ dt = DataTypical()
179
+ results = dt.fit_transform(df)
180
+ ```
181
+
182
+ ### Text Data
183
+ ```python
184
+ texts = ["document 1", "document 2", ...]
185
+ dt = DataTypical()
186
+ results = dt.fit_transform(texts)
187
+ ```
188
+
189
+ ### Graph Networks
190
+ ```python
191
+ node_features = pd.DataFrame(...)
192
+ edges = [(0, 1), (1, 2), ...]
193
+ dt = DataTypical()
194
+ results = dt.fit_transform(node_features, edges=edges)
195
+ ```
196
+
197
+ ---
198
+
199
+ ## Performance
200
+
201
+ | Dataset Size | Without Shapley | With Shapley |
202
+ |--------------|-----------------|--------------|
203
+ | 1,000 samples | ~5 seconds | ~5 minutes |
204
+ | 10,000 samples | ~30 seconds | ~60 minutes |
205
+
206
+ **Optimization Strategy**:
207
+ 1. Fast exploration (`fast_mode=True`, no Shapley)
208
+ 2. Identify interesting samples
209
+ 3. Detailed analysis (`shapley_mode=True`, subset)
210
+ 4. Generate publication figures
211
+
212
+ ---
213
+
214
+ ## Use Cases
215
+
216
+ **Scientific Discovery**: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification
217
+
218
+ **Dataset Curation**: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling
219
+
220
+ **Model Understanding**: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations
221
+
222
+ ---
223
+
224
+ ## What Makes DataTypical Different
225
+
226
+ **From outlier detection**: Finds extremes AND explains why
227
+
228
+ **From clustering**: Finds representatives maximizing coverage AND explains why
229
+
230
+ **From feature selection**: Explains which features matter for which samples
231
+
232
+ **From PCA/t-SNE**: Maintains interpretability in original feature space
233
+
234
+ **The Novel Contribution**: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.
235
+
236
+ ---
237
+
238
+ ## Documentation
239
+
240
+ Complete documentation, examples, and guides available at:
241
+ **https://github.com/amaxiom/DataTypical**
242
+
243
+ Includes:
244
+ - Getting started tutorials
245
+ - Comprehensive examples across scientific domains
246
+ - Visualization interpretation guides
247
+ - Advanced usage and computation details
248
+ - Test suite and benchmarks
249
+
250
+ ---
251
+
252
+ ## Support
253
+
254
+ - **GitHub Repository**: https://github.com/amaxiom/DataTypical
255
+ - **Report Issues**: https://github.com/amaxiom/DataTypical/issues
256
+ - **Questions & Discussions**: https://github.com/amaxiom/DataTypical/discussions
257
+
258
+ ---
259
+
260
+ ## Requirements
261
+
262
+ - Python ≥ 3.8
263
+ - NumPy ≥ 1.20
264
+ - Pandas ≥ 1.3
265
+ - SciPy ≥ 1.7
266
+ - scikit-learn ≥ 1.0
267
+ - Matplotlib ≥ 3.3
268
+ - Seaborn ≥ 0.11
269
+ - Numba ≥ 0.55
270
+
271
+ ---
272
+
273
+ ## Citation
274
+
275
+ If you use DataTypical in your research, please cite:
276
+ ```bibtex
277
+ @software{datatypical2025,
278
+ author = {Barnard, Amanda S.},
279
+ title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
280
+ year = {2026},
281
+ url = {https://github.com/amaxiom/DataTypical},
282
+ version = {0.7}
283
+ }
284
+ ```
285
+
286
+ ---
287
+
288
+ ## License
289
+
290
+ MIT License - Copyright (c) 2026 Amanda S. Barnard
291
+
292
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
293
+
294
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
295
+
296
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
297
+
298
+ ---
299
+
300
+ ## Acknowledgments
301
+
302
+ DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).
@@ -0,0 +1,7 @@
1
+ datatypical.py,sha256=XWxBi9u23nVLs3eOtSafY4FnFcnmqi10sMU40JiLKjg,134841
2
+ datatypical_viz.py,sha256=5eTQPJAb9L4oOGBBsZMPxOzXwYEwqWrfD9aCG949g6g,34557
3
+ datatypical-0.7.0.dist-info/licenses/LICENSE,sha256=g0opCy-9QpR50YnBBb8vaOCFanVirfgoW4L5piJpOzI,1073
4
+ datatypical-0.7.0.dist-info/METADATA,sha256=U4C9Z2cnzEw2iB-w3ytFgnx9hbEMO8TJ66BlMH20o4Y,10078
5
+ datatypical-0.7.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
6
+ datatypical-0.7.0.dist-info/top_level.txt,sha256=plG_W9bF7qvWQWnJ9fsQnKAubyHhFOAAcxNdVf1hw4Q,28
7
+ datatypical-0.7.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.10.2)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Amanda S. Barnard
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,2 @@
1
+ datatypical
2
+ datatypical_viz