dataset-complexity-profiler 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dataset_complexity_profiler-0.1.0/LICENSE +21 -0
- dataset_complexity_profiler-0.1.0/MANIFEST.in +1 -0
- dataset_complexity_profiler-0.1.0/PKG-INFO +187 -0
- dataset_complexity_profiler-0.1.0/README.md +169 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler/__init__.py +9 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler/dataset_adapter.py +662 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler/meta_model.pkl +0 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler.egg-info/PKG-INFO +187 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler.egg-info/SOURCES.txt +15 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler.egg-info/dependency_links.txt +1 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler.egg-info/requires.txt +7 -0
- dataset_complexity_profiler-0.1.0/dataset_complexity_profiler.egg-info/top_level.txt +1 -0
- dataset_complexity_profiler-0.1.0/pyproject.toml +41 -0
- dataset_complexity_profiler-0.1.0/setup.cfg +4 -0
- dataset_complexity_profiler-0.1.0/tests/test_default_meta_model.py +5 -0
- dataset_complexity_profiler-0.1.0/tests/test_fit_transform.py +17 -0
- dataset_complexity_profiler-0.1.0/tests/test_train_custom_meta_model.py +33 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
include dataset_complexity_profiler/meta_model.pkl
|
|
@@ -0,0 +1,187 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: dataset-complexity-profiler
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Text dataset complexity profiler with a packaged meta-model for automatic embedding dimension recommendation
|
|
5
|
+
License: MIT
|
|
6
|
+
Keywords: nlp,meta-learning,dataset-profiling,embeddings
|
|
7
|
+
Requires-Python: >=3.9
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: sentence-transformers>=2.2.2
|
|
11
|
+
Requires-Dist: datasets>=2.14.5
|
|
12
|
+
Requires-Dist: pymfe>=0.3.1
|
|
13
|
+
Requires-Dist: scikit-learn>=1.2.2
|
|
14
|
+
Requires-Dist: numpy>=1.24.0
|
|
15
|
+
Requires-Dist: pandas>=1.5.3
|
|
16
|
+
Requires-Dist: joblib>=1.2.0
|
|
17
|
+
Dynamic: license-file
|
|
18
|
+
|
|
19
|
+
# Dataset Complexity Profiling Tool
|
|
20
|
+
|
|
21
|
+
A text dataset complexity profiler with a packaged default meta-model for automatic embedding dimension recommendations.
|
|
22
|
+
|
|
23
|
+
This repository is organized as a package plus scripts, so you can use it as a library, run command-line flows with `uv`, and retrain the meta-model when needed.
|
|
24
|
+
|
|
25
|
+
## Project structure
|
|
26
|
+
|
|
27
|
+
- `dataset_complexity_profiler/`
|
|
28
|
+
- package code
|
|
29
|
+
- contains `DatasetProfiler` and packaged `meta_model.pkl`
|
|
30
|
+
- `scripts/`
|
|
31
|
+
- helper scripts for demo, prediction, benchmark collection, training, and sanity checks
|
|
32
|
+
- `tests/`
|
|
33
|
+
- automated pytest unit tests
|
|
34
|
+
- `pyproject.toml`
|
|
35
|
+
- packaging and dependency settings for `uv`
|
|
36
|
+
- `MANIFEST.in`
|
|
37
|
+
- includes `dataset_complexity_profiler/meta_model.pkl` in the package
|
|
38
|
+
- `requirements.txt`
|
|
39
|
+
- runtime dependencies
|
|
40
|
+
- `requirements-dev.txt`
|
|
41
|
+
- test/development dependencies
|
|
42
|
+
- `uv.lock`
|
|
43
|
+
- `uv` dependency lock file
|
|
44
|
+
|
|
45
|
+
## What each file does
|
|
46
|
+
|
|
47
|
+
- `dataset_complexity_profiler/__init__.py`
|
|
48
|
+
- exposes `DatasetProfiler` for `from dataset_complexity_profiler import DatasetProfiler`
|
|
49
|
+
- `dataset_complexity_profiler/dataset_adapter.py`
|
|
50
|
+
- the main implementation
|
|
51
|
+
- extracts meta-features, analyzes datasets, predicts embedding dimension, and trains custom meta-models
|
|
52
|
+
- `scripts/main.py`
|
|
53
|
+
- demo runner for benchmark datasets
|
|
54
|
+
- good for a quick functional check
|
|
55
|
+
- `scripts/predict.py`
|
|
56
|
+
- example script to predict optimal embedding dimension for one dataset and save JSON output
|
|
57
|
+
- `scripts/collect_benchmarks.py`
|
|
58
|
+
- builds `benchmarks.csv` from a set of benchmark text datasets
|
|
59
|
+
- used when preparing training data for a new meta-model
|
|
60
|
+
- `scripts/train_meta_model.py`
|
|
61
|
+
- trains the regression meta-model on `benchmarks.csv`
|
|
62
|
+
- saves a new `meta_model.pkl`
|
|
63
|
+
- `scripts/test.py`
|
|
64
|
+
- manual demo script for running a quick prediction flow
|
|
65
|
+
- a small convenience script, separate from automated tests
|
|
66
|
+
- `tests/`
|
|
67
|
+
- unit tests for package logic using `pytest`
|
|
68
|
+
|
|
69
|
+
## What `scripts/main.py` does
|
|
70
|
+
|
|
71
|
+
- Runs the package pipeline on a set of preconfigured benchmark datasets.
|
|
72
|
+
- Uses the default packaged `meta_model.pkl`.
|
|
73
|
+
- Downloads datasets, computes embeddings, predicts optimal embedding dimension, and prints recommendations.
|
|
74
|
+
- Useful for a full end-to-end verification of the workflow.
|
|
75
|
+
|
|
76
|
+
## How to demonstrate the model
|
|
77
|
+
|
|
78
|
+
Use one of these scripts to show the package working:
|
|
79
|
+
|
|
80
|
+
- `uv run python scripts/main.py`
|
|
81
|
+
- runs a complete benchmark demo on several datasets
|
|
82
|
+
- `uv run python scripts/predict.py`
|
|
83
|
+
- predicts optimal embedding dimension for a single dataset and writes `prediction_result.json`
|
|
84
|
+
- `uv run python scripts/test.py`
|
|
85
|
+
- quick manual check script for a simple prediction flow
|
|
86
|
+
|
|
87
|
+
## Why `benchmarks.csv` exists
|
|
88
|
+
|
|
89
|
+
- `benchmarks.csv` is a training dataset for the meta-model.
|
|
90
|
+
- It is generated by `scripts/collect_benchmarks.py` from several benchmark text datasets.
|
|
91
|
+
- It is not required for normal use of the package because the repository already ships with a default `meta_model.pkl`.
|
|
92
|
+
|
|
93
|
+
In other words:
|
|
94
|
+
|
|
95
|
+
- `benchmarks.csv` is used for meta-model training and experimentation.
|
|
96
|
+
- `DatasetProfiler` can work immediately without it.
|
|
97
|
+
|
|
98
|
+
## How the default model works
|
|
99
|
+
|
|
100
|
+
The package includes `dataset_complexity_profiler/meta_model.pkl`.
|
|
101
|
+
If you create `DatasetProfiler()` normally, it loads that model automatically.
|
|
102
|
+
That means users can get recommendations without training anything.
|
|
103
|
+
|
|
104
|
+
## How to train your own meta-model
|
|
105
|
+
|
|
106
|
+
If you want a model tuned to your own data distribution, you do not need to pass a CSV file directly to `DatasetProfiler`.
|
|
107
|
+
Instead, use Python data structures with raw text and labels.
|
|
108
|
+
|
|
109
|
+
Example:
|
|
110
|
+
|
|
111
|
+
```python
|
|
112
|
+
from dataset_complexity_profiler import DatasetProfiler
|
|
113
|
+
|
|
114
|
+
profiler = DatasetProfiler(auto_load_meta_model=False)
|
|
115
|
+
|
|
116
|
+
datasets = [
|
|
117
|
+
{"texts": ["sample 1", "sample 2"], "labels": [0, 1]},
|
|
118
|
+
{"texts": ["sample 3", "sample 4"], "labels": [1, 0]},
|
|
119
|
+
]
|
|
120
|
+
|
|
121
|
+
profiler.train_custom_meta_model(datasets)
|
|
122
|
+
profiler.save_meta_model("custom_meta_model.pkl")
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Then later:
|
|
126
|
+
|
|
127
|
+
```python
|
|
128
|
+
profiler = DatasetProfiler(auto_load_meta_model=False)
|
|
129
|
+
profiler.load_meta_model("custom_meta_model.pkl")
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
If you want to generate a training dataset from benchmark datasets, use:
|
|
133
|
+
|
|
134
|
+
```bash
|
|
135
|
+
uv run python scripts/collect_benchmarks.py
|
|
136
|
+
uv run python scripts/train_meta_model.py
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Then the resulting `meta_model.pkl` will be placed in `dataset_complexity_profiler/`.
|
|
140
|
+
|
|
141
|
+
## How to use the package with `uv`
|
|
142
|
+
|
|
143
|
+
Install dependencies:
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
uv sync
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Run the demo script:
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
uv run python scripts/main.py
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
Predict on a new dataset:
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
uv run python scripts/predict.py
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Collect benchmark training data:
|
|
162
|
+
|
|
163
|
+
```bash
|
|
164
|
+
uv run python scripts/collect_benchmarks.py
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
Train a new meta-model:
|
|
168
|
+
|
|
169
|
+
```bash
|
|
170
|
+
uv run python scripts/train_meta_model.py
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
## Core API
|
|
174
|
+
|
|
175
|
+
- `analyze_text_dataset(texts, labels, ...)` — full dataset analysis report
|
|
176
|
+
- `analyze_and_adapt(X, y, ...)` — analyze precomputed embeddings
|
|
177
|
+
- `fit_transform(texts, labels, ...)` — text → embeddings → compressed vectors
|
|
178
|
+
- `predict_embedding_dim(X, y, ...)` — recommended embedding dimension
|
|
179
|
+
- `train_custom_meta_model(datasets, ...)` — train a custom meta-model on user datasets
|
|
180
|
+
- `save_meta_model(path)` and `load_meta_model(path)`
|
|
181
|
+
|
|
182
|
+
## Notes
|
|
183
|
+
|
|
184
|
+
- `scripts/test.py` is optional and only for manual sanity checking.
|
|
185
|
+
- `tests/` is the real automated test suite.
|
|
186
|
+
- `benchmarks.csv` is only needed when you want to retrain or expand the meta-model.
|
|
187
|
+
- For user-specific training, pass text/label pairs in Python, not a single CSV file.
|
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# Dataset Complexity Profiling Tool
|
|
2
|
+
|
|
3
|
+
A text dataset complexity profiler with a packaged default meta-model for automatic embedding dimension recommendations.
|
|
4
|
+
|
|
5
|
+
This repository is organized as a package plus scripts, so you can use it as a library, run command-line flows with `uv`, and retrain the meta-model when needed.
|
|
6
|
+
|
|
7
|
+
## Project structure
|
|
8
|
+
|
|
9
|
+
- `dataset_complexity_profiler/`
|
|
10
|
+
- package code
|
|
11
|
+
- contains `DatasetProfiler` and packaged `meta_model.pkl`
|
|
12
|
+
- `scripts/`
|
|
13
|
+
- helper scripts for demo, prediction, benchmark collection, training, and sanity checks
|
|
14
|
+
- `tests/`
|
|
15
|
+
- automated pytest unit tests
|
|
16
|
+
- `pyproject.toml`
|
|
17
|
+
- packaging and dependency settings for `uv`
|
|
18
|
+
- `MANIFEST.in`
|
|
19
|
+
- includes `dataset_complexity_profiler/meta_model.pkl` in the package
|
|
20
|
+
- `requirements.txt`
|
|
21
|
+
- runtime dependencies
|
|
22
|
+
- `requirements-dev.txt`
|
|
23
|
+
- test/development dependencies
|
|
24
|
+
- `uv.lock`
|
|
25
|
+
- `uv` dependency lock file
|
|
26
|
+
|
|
27
|
+
## What each file does
|
|
28
|
+
|
|
29
|
+
- `dataset_complexity_profiler/__init__.py`
|
|
30
|
+
- exposes `DatasetProfiler` for `from dataset_complexity_profiler import DatasetProfiler`
|
|
31
|
+
- `dataset_complexity_profiler/dataset_adapter.py`
|
|
32
|
+
- the main implementation
|
|
33
|
+
- extracts meta-features, analyzes datasets, predicts embedding dimension, and trains custom meta-models
|
|
34
|
+
- `scripts/main.py`
|
|
35
|
+
- demo runner for benchmark datasets
|
|
36
|
+
- good for a quick functional check
|
|
37
|
+
- `scripts/predict.py`
|
|
38
|
+
- example script to predict optimal embedding dimension for one dataset and save JSON output
|
|
39
|
+
- `scripts/collect_benchmarks.py`
|
|
40
|
+
- builds `benchmarks.csv` from a set of benchmark text datasets
|
|
41
|
+
- used when preparing training data for a new meta-model
|
|
42
|
+
- `scripts/train_meta_model.py`
|
|
43
|
+
- trains the regression meta-model on `benchmarks.csv`
|
|
44
|
+
- saves a new `meta_model.pkl`
|
|
45
|
+
- `scripts/test.py`
|
|
46
|
+
- manual demo script for running a quick prediction flow
|
|
47
|
+
- a small convenience script, separate from automated tests
|
|
48
|
+
- `tests/`
|
|
49
|
+
- unit tests for package logic using `pytest`
|
|
50
|
+
|
|
51
|
+
## What `scripts/main.py` does
|
|
52
|
+
|
|
53
|
+
- Runs the package pipeline on a set of preconfigured benchmark datasets.
|
|
54
|
+
- Uses the default packaged `meta_model.pkl`.
|
|
55
|
+
- Downloads datasets, computes embeddings, predicts optimal embedding dimension, and prints recommendations.
|
|
56
|
+
- Useful for a full end-to-end verification of the workflow.
|
|
57
|
+
|
|
58
|
+
## How to demonstrate the model
|
|
59
|
+
|
|
60
|
+
Use one of these scripts to show the package working:
|
|
61
|
+
|
|
62
|
+
- `uv run python scripts/main.py`
|
|
63
|
+
- runs a complete benchmark demo on several datasets
|
|
64
|
+
- `uv run python scripts/predict.py`
|
|
65
|
+
- predicts optimal embedding dimension for a single dataset and writes `prediction_result.json`
|
|
66
|
+
- `uv run python scripts/test.py`
|
|
67
|
+
- quick manual check script for a simple prediction flow
|
|
68
|
+
|
|
69
|
+
## Why `benchmarks.csv` exists
|
|
70
|
+
|
|
71
|
+
- `benchmarks.csv` is a training dataset for the meta-model.
|
|
72
|
+
- It is generated by `scripts/collect_benchmarks.py` from several benchmark text datasets.
|
|
73
|
+
- It is not required for normal use of the package because the repository already ships with a default `meta_model.pkl`.
|
|
74
|
+
|
|
75
|
+
In other words:
|
|
76
|
+
|
|
77
|
+
- `benchmarks.csv` is used for meta-model training and experimentation.
|
|
78
|
+
- `DatasetProfiler` can work immediately without it.
|
|
79
|
+
|
|
80
|
+
## How the default model works
|
|
81
|
+
|
|
82
|
+
The package includes `dataset_complexity_profiler/meta_model.pkl`.
|
|
83
|
+
If you create `DatasetProfiler()` normally, it loads that model automatically.
|
|
84
|
+
That means users can get recommendations without training anything.
|
|
85
|
+
|
|
86
|
+
## How to train your own meta-model
|
|
87
|
+
|
|
88
|
+
If you want a model tuned to your own data distribution, you do not need to pass a CSV file directly to `DatasetProfiler`.
|
|
89
|
+
Instead, use Python data structures with raw text and labels.
|
|
90
|
+
|
|
91
|
+
Example:
|
|
92
|
+
|
|
93
|
+
```python
|
|
94
|
+
from dataset_complexity_profiler import DatasetProfiler
|
|
95
|
+
|
|
96
|
+
profiler = DatasetProfiler(auto_load_meta_model=False)
|
|
97
|
+
|
|
98
|
+
datasets = [
|
|
99
|
+
{"texts": ["sample 1", "sample 2"], "labels": [0, 1]},
|
|
100
|
+
{"texts": ["sample 3", "sample 4"], "labels": [1, 0]},
|
|
101
|
+
]
|
|
102
|
+
|
|
103
|
+
profiler.train_custom_meta_model(datasets)
|
|
104
|
+
profiler.save_meta_model("custom_meta_model.pkl")
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Then later:
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
profiler = DatasetProfiler(auto_load_meta_model=False)
|
|
111
|
+
profiler.load_meta_model("custom_meta_model.pkl")
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
If you want to generate a training dataset from benchmark datasets, use:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
uv run python scripts/collect_benchmarks.py
|
|
118
|
+
uv run python scripts/train_meta_model.py
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Then the resulting `meta_model.pkl` will be placed in `dataset_complexity_profiler/`.
|
|
122
|
+
|
|
123
|
+
## How to use the package with `uv`
|
|
124
|
+
|
|
125
|
+
Install dependencies:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
uv sync
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Run the demo script:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
uv run python scripts/main.py
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Predict on a new dataset:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
uv run python scripts/predict.py
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
Collect benchmark training data:
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
uv run python scripts/collect_benchmarks.py
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Train a new meta-model:
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
uv run python scripts/train_meta_model.py
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
## Core API
|
|
156
|
+
|
|
157
|
+
- `analyze_text_dataset(texts, labels, ...)` — full dataset analysis report
|
|
158
|
+
- `analyze_and_adapt(X, y, ...)` — analyze precomputed embeddings
|
|
159
|
+
- `fit_transform(texts, labels, ...)` — text → embeddings → compressed vectors
|
|
160
|
+
- `predict_embedding_dim(X, y, ...)` — recommended embedding dimension
|
|
161
|
+
- `train_custom_meta_model(datasets, ...)` — train a custom meta-model on user datasets
|
|
162
|
+
- `save_meta_model(path)` and `load_meta_model(path)`
|
|
163
|
+
|
|
164
|
+
## Notes
|
|
165
|
+
|
|
166
|
+
- `scripts/test.py` is optional and only for manual sanity checking.
|
|
167
|
+
- `tests/` is the real automated test suite.
|
|
168
|
+
- `benchmarks.csv` is only needed when you want to retrain or expand the meta-model.
|
|
169
|
+
- For user-specific training, pass text/label pairs in Python, not a single CSV file.
|