dataset-complexity-profiler 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ include dataset_complexity_profiler/meta_model.pkl
@@ -0,0 +1,187 @@
1
+ Metadata-Version: 2.4
2
+ Name: dataset-complexity-profiler
3
+ Version: 0.1.0
4
+ Summary: Text dataset complexity profiler with a packaged meta-model for automatic embedding dimension recommendation
5
+ License: MIT
6
+ Keywords: nlp,meta-learning,dataset-profiling,embeddings
7
+ Requires-Python: >=3.9
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: sentence-transformers>=2.2.2
11
+ Requires-Dist: datasets>=2.14.5
12
+ Requires-Dist: pymfe>=0.3.1
13
+ Requires-Dist: scikit-learn>=1.2.2
14
+ Requires-Dist: numpy>=1.24.0
15
+ Requires-Dist: pandas>=1.5.3
16
+ Requires-Dist: joblib>=1.2.0
17
+ Dynamic: license-file
18
+
19
+ # Dataset Complexity Profiling Tool
20
+
21
+ A text dataset complexity profiler with a packaged default meta-model for automatic embedding dimension recommendations.
22
+
23
+ This repository is organized as a package plus scripts, so you can use it as a library, run command-line flows with `uv`, and retrain the meta-model when needed.
24
+
25
+ ## Project structure
26
+
27
+ - `dataset_complexity_profiler/`
28
+ - package code
29
+ - contains `DatasetProfiler` and packaged `meta_model.pkl`
30
+ - `scripts/`
31
+ - helper scripts for demo, prediction, benchmark collection, training, and sanity checks
32
+ - `tests/`
33
+ - automated pytest unit tests
34
+ - `pyproject.toml`
35
+ - packaging and dependency settings for `uv`
36
+ - `MANIFEST.in`
37
+ - includes `dataset_complexity_profiler/meta_model.pkl` in the package
38
+ - `requirements.txt`
39
+ - runtime dependencies
40
+ - `requirements-dev.txt`
41
+ - test/development dependencies
42
+ - `uv.lock`
43
+ - `uv` dependency lock file
44
+
45
+ ## What each file does
46
+
47
+ - `dataset_complexity_profiler/__init__.py`
48
+ - exposes `DatasetProfiler` for `from dataset_complexity_profiler import DatasetProfiler`
49
+ - `dataset_complexity_profiler/dataset_adapter.py`
50
+ - the main implementation
51
+ - extracts meta-features, analyzes datasets, predicts embedding dimension, and trains custom meta-models
52
+ - `scripts/main.py`
53
+ - demo runner for benchmark datasets
54
+ - good for a quick functional check
55
+ - `scripts/predict.py`
56
+ - example script to predict optimal embedding dimension for one dataset and save JSON output
57
+ - `scripts/collect_benchmarks.py`
58
+ - builds `benchmarks.csv` from a set of benchmark text datasets
59
+ - used when preparing training data for a new meta-model
60
+ - `scripts/train_meta_model.py`
61
+ - trains the regression meta-model on `benchmarks.csv`
62
+ - saves a new `meta_model.pkl`
63
+ - `scripts/test.py`
64
+ - manual demo script for running a quick prediction flow
65
+ - a small convenience script, separate from automated tests
66
+ - `tests/`
67
+ - unit tests for package logic using `pytest`
68
+
69
+ ## What `scripts/main.py` does
70
+
71
+ - Runs the package pipeline on a set of preconfigured benchmark datasets.
72
+ - Uses the default packaged `meta_model.pkl`.
73
+ - Downloads datasets, computes embeddings, predicts optimal embedding dimension, and prints recommendations.
74
+ - Useful for a full end-to-end verification of the workflow.
75
+
76
+ ## How to demonstrate the model
77
+
78
+ Use one of these scripts to show the package working:
79
+
80
+ - `uv run python scripts/main.py`
81
+ - runs a complete benchmark demo on several datasets
82
+ - `uv run python scripts/predict.py`
83
+ - predicts optimal embedding dimension for a single dataset and writes `prediction_result.json`
84
+ - `uv run python scripts/test.py`
85
+ - quick manual check script for a simple prediction flow
86
+
87
+ ## Why `benchmarks.csv` exists
88
+
89
+ - `benchmarks.csv` is a training dataset for the meta-model.
90
+ - It is generated by `scripts/collect_benchmarks.py` from several benchmark text datasets.
91
+ - It is not required for normal use of the package because the repository already ships with a default `meta_model.pkl`.
92
+
93
+ In other words:
94
+
95
+ - `benchmarks.csv` is used for meta-model training and experimentation.
96
+ - `DatasetProfiler` can work immediately without it.
97
+
98
+ ## How the default model works
99
+
100
+ The package includes `dataset_complexity_profiler/meta_model.pkl`.
101
+ If you create `DatasetProfiler()` normally, it loads that model automatically.
102
+ That means users can get recommendations without training anything.
103
+
104
+ ## How to train your own meta-model
105
+
106
+ If you want a model tuned to your own data distribution, you do not need to pass a CSV file directly to `DatasetProfiler`.
107
+ Instead, use Python data structures with raw text and labels.
108
+
109
+ Example:
110
+
111
+ ```python
112
+ from dataset_complexity_profiler import DatasetProfiler
113
+
114
+ profiler = DatasetProfiler(auto_load_meta_model=False)
115
+
116
+ datasets = [
117
+ {"texts": ["sample 1", "sample 2"], "labels": [0, 1]},
118
+ {"texts": ["sample 3", "sample 4"], "labels": [1, 0]},
119
+ ]
120
+
121
+ profiler.train_custom_meta_model(datasets)
122
+ profiler.save_meta_model("custom_meta_model.pkl")
123
+ ```
124
+
125
+ Then later:
126
+
127
+ ```python
128
+ profiler = DatasetProfiler(auto_load_meta_model=False)
129
+ profiler.load_meta_model("custom_meta_model.pkl")
130
+ ```
131
+
132
+ If you want to generate a training dataset from benchmark datasets, use:
133
+
134
+ ```bash
135
+ uv run python scripts/collect_benchmarks.py
136
+ uv run python scripts/train_meta_model.py
137
+ ```
138
+
139
+ Then the resulting `meta_model.pkl` will be placed in `dataset_complexity_profiler/`.
140
+
141
+ ## How to use the package with `uv`
142
+
143
+ Install dependencies:
144
+
145
+ ```bash
146
+ uv sync
147
+ ```
148
+
149
+ Run the demo script:
150
+
151
+ ```bash
152
+ uv run python scripts/main.py
153
+ ```
154
+
155
+ Predict on a new dataset:
156
+
157
+ ```bash
158
+ uv run python scripts/predict.py
159
+ ```
160
+
161
+ Collect benchmark training data:
162
+
163
+ ```bash
164
+ uv run python scripts/collect_benchmarks.py
165
+ ```
166
+
167
+ Train a new meta-model:
168
+
169
+ ```bash
170
+ uv run python scripts/train_meta_model.py
171
+ ```
172
+
173
+ ## Core API
174
+
175
+ - `analyze_text_dataset(texts, labels, ...)` — full dataset analysis report
176
+ - `analyze_and_adapt(X, y, ...)` — analyze precomputed embeddings
177
+ - `fit_transform(texts, labels, ...)` — text → embeddings → compressed vectors
178
+ - `predict_embedding_dim(X, y, ...)` — recommended embedding dimension
179
+ - `train_custom_meta_model(datasets, ...)` — train a custom meta-model on user datasets
180
+ - `save_meta_model(path)` and `load_meta_model(path)`
181
+
182
+ ## Notes
183
+
184
+ - `scripts/test.py` is optional and only for manual sanity checking.
185
+ - `tests/` is the real automated test suite.
186
+ - `benchmarks.csv` is only needed when you want to retrain or expand the meta-model.
187
+ - For user-specific training, pass text/label pairs in Python, not a single CSV file.
@@ -0,0 +1,169 @@
1
+ # Dataset Complexity Profiling Tool
2
+
3
+ A text dataset complexity profiler with a packaged default meta-model for automatic embedding dimension recommendations.
4
+
5
+ This repository is organized as a package plus scripts, so you can use it as a library, run command-line flows with `uv`, and retrain the meta-model when needed.
6
+
7
+ ## Project structure
8
+
9
+ - `dataset_complexity_profiler/`
10
+ - package code
11
+ - contains `DatasetProfiler` and packaged `meta_model.pkl`
12
+ - `scripts/`
13
+ - helper scripts for demo, prediction, benchmark collection, training, and sanity checks
14
+ - `tests/`
15
+ - automated pytest unit tests
16
+ - `pyproject.toml`
17
+ - packaging and dependency settings for `uv`
18
+ - `MANIFEST.in`
19
+ - includes `dataset_complexity_profiler/meta_model.pkl` in the package
20
+ - `requirements.txt`
21
+ - runtime dependencies
22
+ - `requirements-dev.txt`
23
+ - test/development dependencies
24
+ - `uv.lock`
25
+ - `uv` dependency lock file
26
+
27
+ ## What each file does
28
+
29
+ - `dataset_complexity_profiler/__init__.py`
30
+ - exposes `DatasetProfiler` for `from dataset_complexity_profiler import DatasetProfiler`
31
+ - `dataset_complexity_profiler/dataset_adapter.py`
32
+ - the main implementation
33
+ - extracts meta-features, analyzes datasets, predicts embedding dimension, and trains custom meta-models
34
+ - `scripts/main.py`
35
+ - demo runner for benchmark datasets
36
+ - good for a quick functional check
37
+ - `scripts/predict.py`
38
+ - example script to predict optimal embedding dimension for one dataset and save JSON output
39
+ - `scripts/collect_benchmarks.py`
40
+ - builds `benchmarks.csv` from a set of benchmark text datasets
41
+ - used when preparing training data for a new meta-model
42
+ - `scripts/train_meta_model.py`
43
+ - trains the regression meta-model on `benchmarks.csv`
44
+ - saves a new `meta_model.pkl`
45
+ - `scripts/test.py`
46
+ - manual demo script for running a quick prediction flow
47
+ - a small convenience script, separate from automated tests
48
+ - `tests/`
49
+ - unit tests for package logic using `pytest`
50
+
51
+ ## What `scripts/main.py` does
52
+
53
+ - Runs the package pipeline on a set of preconfigured benchmark datasets.
54
+ - Uses the default packaged `meta_model.pkl`.
55
+ - Downloads datasets, computes embeddings, predicts optimal embedding dimension, and prints recommendations.
56
+ - Useful for a full end-to-end verification of the workflow.
57
+
58
+ ## How to demonstrate the model
59
+
60
+ Use one of these scripts to show the package working:
61
+
62
+ - `uv run python scripts/main.py`
63
+ - runs a complete benchmark demo on several datasets
64
+ - `uv run python scripts/predict.py`
65
+ - predicts optimal embedding dimension for a single dataset and writes `prediction_result.json`
66
+ - `uv run python scripts/test.py`
67
+ - quick manual check script for a simple prediction flow
68
+
69
+ ## Why `benchmarks.csv` exists
70
+
71
+ - `benchmarks.csv` is a training dataset for the meta-model.
72
+ - It is generated by `scripts/collect_benchmarks.py` from several benchmark text datasets.
73
+ - It is not required for normal use of the package because the repository already ships with a default `meta_model.pkl`.
74
+
75
+ In other words:
76
+
77
+ - `benchmarks.csv` is used for meta-model training and experimentation.
78
+ - `DatasetProfiler` can work immediately without it.
79
+
80
+ ## How the default model works
81
+
82
+ The package includes `dataset_complexity_profiler/meta_model.pkl`.
83
+ If you create `DatasetProfiler()` normally, it loads that model automatically.
84
+ That means users can get recommendations without training anything.
85
+
86
+ ## How to train your own meta-model
87
+
88
+ If you want a model tuned to your own data distribution, you do not need to pass a CSV file directly to `DatasetProfiler`.
89
+ Instead, use Python data structures with raw text and labels.
90
+
91
+ Example:
92
+
93
+ ```python
94
+ from dataset_complexity_profiler import DatasetProfiler
95
+
96
+ profiler = DatasetProfiler(auto_load_meta_model=False)
97
+
98
+ datasets = [
99
+ {"texts": ["sample 1", "sample 2"], "labels": [0, 1]},
100
+ {"texts": ["sample 3", "sample 4"], "labels": [1, 0]},
101
+ ]
102
+
103
+ profiler.train_custom_meta_model(datasets)
104
+ profiler.save_meta_model("custom_meta_model.pkl")
105
+ ```
106
+
107
+ Then later:
108
+
109
+ ```python
110
+ profiler = DatasetProfiler(auto_load_meta_model=False)
111
+ profiler.load_meta_model("custom_meta_model.pkl")
112
+ ```
113
+
114
+ If you want to generate a training dataset from benchmark datasets, use:
115
+
116
+ ```bash
117
+ uv run python scripts/collect_benchmarks.py
118
+ uv run python scripts/train_meta_model.py
119
+ ```
120
+
121
+ Then the resulting `meta_model.pkl` will be placed in `dataset_complexity_profiler/`.
122
+
123
+ ## How to use the package with `uv`
124
+
125
+ Install dependencies:
126
+
127
+ ```bash
128
+ uv sync
129
+ ```
130
+
131
+ Run the demo script:
132
+
133
+ ```bash
134
+ uv run python scripts/main.py
135
+ ```
136
+
137
+ Predict on a new dataset:
138
+
139
+ ```bash
140
+ uv run python scripts/predict.py
141
+ ```
142
+
143
+ Collect benchmark training data:
144
+
145
+ ```bash
146
+ uv run python scripts/collect_benchmarks.py
147
+ ```
148
+
149
+ Train a new meta-model:
150
+
151
+ ```bash
152
+ uv run python scripts/train_meta_model.py
153
+ ```
154
+
155
+ ## Core API
156
+
157
+ - `analyze_text_dataset(texts, labels, ...)` — full dataset analysis report
158
+ - `analyze_and_adapt(X, y, ...)` — analyze precomputed embeddings
159
+ - `fit_transform(texts, labels, ...)` — text → embeddings → compressed vectors
160
+ - `predict_embedding_dim(X, y, ...)` — recommended embedding dimension
161
+ - `train_custom_meta_model(datasets, ...)` — train a custom meta-model on user datasets
162
+ - `save_meta_model(path)` and `load_meta_model(path)`
163
+
164
+ ## Notes
165
+
166
+ - `scripts/test.py` is optional and only for manual sanity checking.
167
+ - `tests/` is the real automated test suite.
168
+ - `benchmarks.csv` is only needed when you want to retrain or expand the meta-model.
169
+ - For user-specific training, pass text/label pairs in Python, not a single CSV file.
@@ -0,0 +1,9 @@
1
+ """Пакет dataset_complexity_profiler.
2
+
3
+ Предоставляет основной класс `DatasetProfiler` для анализа текстовых датасетов
4
+ и рекомендации оптимальной размерности эмбеддингов.
5
+ """
6
+
7
+ from .dataset_adapter import DatasetProfiler
8
+
9
+ __all__ = ["DatasetProfiler"]