dataprof 0.4.78__cp314-cp314-win_amd64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of dataprof might be problematic. Click here for more details.

dataprof/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ from .dataprof import *
2
+
3
+ __doc__ = dataprof.__doc__
4
+ if hasattr(dataprof, "__all__"):
5
+ __all__ = dataprof.__all__
Binary file
@@ -0,0 +1,301 @@
1
+ Metadata-Version: 2.4
2
+ Name: dataprof
3
+ Version: 0.4.78
4
+ Classifier: Development Status :: 4 - Beta
5
+ Classifier: Intended Audience :: Developers
6
+ Classifier: Intended Audience :: Science/Research
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: POSIX
9
+ Classifier: Operating System :: Microsoft :: Windows
10
+ Classifier: Operating System :: MacOS :: MacOS X
11
+ Classifier: Programming Language :: Rust
12
+ Classifier: Programming Language :: Python :: Implementation :: CPython
13
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.8
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Requires-Dist: pandas>=1.3.0 ; extra == 'pandas'
23
+ Requires-Dist: pandas>=1.3.0 ; extra == 'jupyter'
24
+ Requires-Dist: ipython>=7.0.0 ; extra == 'jupyter'
25
+ Requires-Dist: pandas>=1.3.0 ; extra == 'all'
26
+ Requires-Dist: ipython>=7.0.0 ; extra == 'all'
27
+ Requires-Dist: numpy>=1.20.0 ; extra == 'all'
28
+ Provides-Extra: pandas
29
+ Provides-Extra: jupyter
30
+ Provides-Extra: all
31
+ License-File: LICENSE
32
+ Summary: Fast, lightweight data profiling and quality assessment library
33
+ Keywords: data,profiling,quality,csv,json,analysis,performance
34
+ Author-email: Andrea Bozzo <andrea@example.com>
35
+ Requires-Python: >=3.8
36
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
37
+ Project-URL: Homepage, https://github.com/AndreaBozzo/dataprof
38
+ Project-URL: Repository, https://github.com/AndreaBozzo/dataprof
39
+ Project-URL: Issues, https://github.com/AndreaBozzo/dataprof/issues
40
+
41
+ # dataprof
42
+
43
+ [![CI](https://github.com/AndreaBozzo/dataprof/workflows/CI/badge.svg)](https://github.com/AndreaBozzo/dataprof/actions)
44
+ [![License](https://img.shields.io/github/license/AndreaBozzo/dataprof)](LICENSE)
45
+ [![Rust](https://img.shields.io/badge/rust-1.70%2B-orange.svg)](https://www.rust-lang.org)
46
+ [![Crates.io](https://img.shields.io/crates/v/dataprof.svg)](https://crates.io/crates/dataprof)
47
+
48
+
49
+ A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.
50
+
51
+ Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.
52
+
53
+ ## Privacy & Transparency
54
+
55
+ DataProf processes **all data locally** on your machine. Zero telemetry, zero external data transmission.
56
+
57
+ **[Read exactly what DataProf analyzes →](docs/WHAT_DATAPROF_DOES.md)**
58
+
59
+ - 100% local processing - your data never leaves your machine
60
+ - No telemetry or tracking
61
+ - Open source & fully auditable
62
+ - Read-only database access (when using DB features)
63
+
64
+ **Complete transparency:** Every metric, calculation, and data point is documented with source code references for independent verification.
65
+
66
+ ## CI/CD Integration
67
+
68
+ Automate data quality checks in your workflows with our GitHub Action:
69
+
70
+ ```yaml
71
+ - name: DataProf Quality Check
72
+ uses: AndreaBozzo/dataprof-actions@v1
73
+ with:
74
+ file: 'data/dataset.csv'
75
+ quality-threshold: 80
76
+ fail-on-issues: true
77
+ # Batch mode (NEW)
78
+ recursive: true
79
+ output-html: 'quality-report.html'
80
+ ```
81
+
82
+ **[Get the Action →](https://github.com/AndreaBozzo/dataprof-action)**
83
+
84
+ - **Zero setup** - works out of the box
85
+ - **ISO 8000/25012 compliant** - industry-standard quality metrics
86
+ - **Batch processing** - analyze entire directories recursively
87
+ - **Flexible** - customizable thresholds and output formats
88
+ - **Fast** - typically completes in under 2 minutes
89
+
90
+ Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.
91
+
92
+ ## Quick Start
93
+
94
+ ### CLI (Recommended - Full Features)
95
+
96
+ > **Installation**: Download pre-built binaries from [Releases](https://github.com/AndreaBozzo/dataprof/releases) or build from source with `cargo install dataprof`.
97
+
98
+ > **Note**: After building with `cargo build --release`, the binary is located at `target/release/dataprof-cli.exe` (Windows) or `target/release/dataprof` (Linux/Mac). Run it from the project root as `target/release/dataprof-cli.exe <command>` or add it to your PATH.
99
+
100
+ #### Basic Analysis
101
+ ```bash
102
+ # Comprehensive quality analysis
103
+ dataprof analyze data.csv --detailed
104
+
105
+ # Analyze Parquet files (requires --features parquet)
106
+ dataprof analyze data.parquet --detailed
107
+
108
+ # Windows example (from project root after cargo build --release)
109
+ target\release\dataprof-cli.exe analyze data.csv --detailed
110
+ ```
111
+
112
+ #### HTML Reports
113
+ ```bash
114
+ # Generate HTML report with visualizations
115
+ dataprof report data.csv -o quality_report.html
116
+
117
+ # Custom template
118
+ dataprof report data.csv --template custom.hbs --detailed
119
+ ```
120
+
121
+ #### Batch Processing
122
+ ```bash
123
+ # Process entire directory with parallel execution
124
+ dataprof batch /data/folder --recursive --parallel
125
+
126
+ # Generate HTML batch dashboard
127
+ dataprof batch /data/folder --recursive --html batch_report.html
128
+
129
+ # JSON export for CI/CD automation
130
+ dataprof batch /data/folder --json batch_results.json --recursive
131
+
132
+ # JSON output to stdout
133
+ dataprof batch /data/folder --format json --recursive
134
+
135
+ # With custom filter and progress
136
+ dataprof batch /data/folder --filter "*.csv" --parallel --progress
137
+ ```
138
+
139
+ ![DataProf Batch Report](assets/animations/HTMLbatch.gif)
140
+
141
+ #### Database Analysis
142
+ ```bash
143
+ # PostgreSQL table profiling
144
+ dataprof database postgres://user:pass@host/db --table users
145
+
146
+ # Custom SQL query
147
+ dataprof database sqlite://data.db --query "SELECT * FROM users WHERE active=1"
148
+ ```
149
+
150
+ #### Benchmarking
151
+ ```bash
152
+ # Benchmark different engines on your data
153
+ dataprof benchmark data.csv
154
+
155
+ # Show engine information
156
+ dataprof benchmark --info
157
+ ```
158
+
159
+ #### Advanced Options
160
+ ```bash
161
+ # Streaming for large files
162
+ dataprof analyze large_dataset.csv --streaming --sample 10000
163
+
164
+ # JSON output for programmatic use
165
+ dataprof analyze data.csv --format json --output results.json
166
+
167
+ # Custom ISO threshold profile
168
+ dataprof analyze data.csv --threshold-profile strict
169
+ ```
170
+
171
+ **Quick Reference**: All commands follow the pattern `dataprof <command> [args]`. Use `dataprof help` or `dataprof <command> --help` for detailed options.
172
+
173
+ ### Python Bindings
174
+
175
+ ```bash
176
+ pip install dataprof
177
+ ```
178
+
179
+ ```python
180
+ import dataprof
181
+
182
+ # Comprehensive quality analysis (ISO 8000/25012 compliant)
183
+ report = dataprof.analyze_csv_with_quality("data.csv")
184
+ print(f"Quality score: {report.quality_score():.1f}%")
185
+
186
+ # Access individual quality dimensions
187
+ metrics = report.data_quality_metrics
188
+ print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
189
+ print(f"Consistency: {metrics.data_type_consistency:.1f}%")
190
+ print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")
191
+
192
+ # Batch processing
193
+ result = dataprof.batch_analyze_directory("/data", recursive=True)
194
+ print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")
195
+ ```
196
+
197
+ > **Note**: Database profiling is available via CLI only. Python users can export SQL results to CSV and use `analyze_csv_with_quality()`.
198
+
199
+ **[Full Python API Documentation →](docs/python/README.md)**
200
+
201
+ ### Rust Library
202
+
203
+ ```bash
204
+ cargo add dataprof
205
+ ```
206
+
207
+ ```rust
208
+ use dataprof::*;
209
+
210
+ // High-performance Arrow processing for large files (>100MB)
211
+ // Requires compilation with: cargo build --features arrow
212
+ #[cfg(feature = "arrow")]
213
+ let profiler = DataProfiler::columnar();
214
+ #[cfg(feature = "arrow")]
215
+ let report = profiler.analyze_csv_file("large_dataset.csv")?;
216
+
217
+ // Standard adaptive profiling (recommended for most use cases)
218
+ let profiler = DataProfiler::auto();
219
+ let report = profiler.analyze_file("dataset.csv")?;
220
+ ```
221
+
222
+ ## Development
223
+
224
+ Want to contribute or build from source? Here's what you need:
225
+
226
+ ### Prerequisites
227
+ - Rust (latest stable via [rustup](https://rustup.rs/))
228
+ - Docker (for database testing)
229
+
230
+ ### Quick Setup
231
+ ```bash
232
+ git clone https://github.com/AndreaBozzo/dataprof.git
233
+ cd dataprof
234
+ cargo build --release # Build the project
235
+ docker-compose -f .devcontainer/docker-compose.yml up -d # Start test databases
236
+ ```
237
+
238
+ ### Feature Flags
239
+
240
+ dataprof uses optional features to keep compile times fast and binaries lean:
241
+
242
+ ```bash
243
+ # Minimal build (CSV/JSON only, ~60s compile)
244
+ cargo build --release
245
+
246
+ # With Apache Arrow (columnar processing, ~90s compile)
247
+ cargo build --release --features arrow
248
+
249
+ # With Parquet support (requires arrow, ~95s compile)
250
+ cargo build --release --features parquet
251
+
252
+ # With database connectors
253
+ cargo build --release --features postgres,mysql,sqlite
254
+
255
+ # All features (full functionality, ~130s compile)
256
+ cargo build --release --all-features
257
+ ```
258
+
259
+ **When to use Arrow?**
260
+ - ✅ Files > 100MB with many columns (>20)
261
+ - ✅ Columnar data with uniform types
262
+ - ✅ Need maximum throughput (up to 13x faster)
263
+ - ❌ Small files (<10MB) - standard engine is faster
264
+ - ❌ Mixed/messy data - streaming engine handles better
265
+
266
+ **When to use Parquet?**
267
+ - ✅ Analytics workloads with columnar data
268
+ - ✅ Data lake architectures
269
+ - ✅ Integration with Spark, Pandas, PyArrow
270
+ - ✅ Efficient storage and compression
271
+ - ✅ Type-safe schema preservation
272
+
273
+ ### Common Development Tasks
274
+ ```bash
275
+ cargo test # Run all tests
276
+ cargo bench # Performance benchmarks
277
+ cargo fmt # Format code
278
+ cargo clippy # Code quality checks
279
+ ```
280
+
281
+ ## Documentation
282
+
283
+ ### Privacy & Transparency
284
+ - [What DataProf Does](docs/WHAT_DATAPROF_DOES.md) - **Complete transparency guide with source code verification**
285
+
286
+ ### User Guides
287
+ - [Python API Reference](docs/python/API_REFERENCE.md) - Full Python API documentation
288
+ - [Python Integrations](docs/python/INTEGRATIONS.md) - Pandas, scikit-learn, Jupyter, Airflow, dbt
289
+ - [Database Connectors](docs/guides/database-connectors.md) - Production database connectivity
290
+ - [Apache Arrow Integration](docs/guides/apache-arrow-integration.md) - Columnar processing guide
291
+ - [CLI Usage Guide](docs/guides/CLI_USAGE_GUIDE.md) - Complete CLI reference
292
+
293
+ ### Developer Guides
294
+ - [Development Guide](docs/DEVELOPMENT.md) - Complete setup and contribution guide
295
+ - [Performance Guide](docs/guides/performance-guide.md) - Optimization and benchmarking
296
+ - [Performance Benchmarks](docs/project/benchmarking.md) - Benchmark results and methodology
297
+
298
+ ## License
299
+
300
+ Licensed under the MIT License. See [LICENSE](LICENSE) for details.
301
+
@@ -0,0 +1,6 @@
1
+ dataprof-0.4.78.dist-info/METADATA,sha256=qg7dZi0LSbykpppm3Q-agQ9M3X4k1LklvAizxt81Z5c,10840
2
+ dataprof-0.4.78.dist-info/WHEEL,sha256=tZ3VAZ5HuUzziFCJ2lDsDJnJO-xy4omAQIa7TJCFCZk,96
3
+ dataprof-0.4.78.dist-info/licenses/LICENSE,sha256=pD_29Inf0TmerzrHuH-Lcu2GeD39lNK0_8bDJVkHjos,1090
4
+ dataprof/__init__.py,sha256=84U5MpyP59z3koB4vbdsJg1XQSKYeTS1SC7b3VqwjfU,115
5
+ dataprof/dataprof.cp314-win_amd64.pyd,sha256=gFiDQReKYOvVyUYi4lUUkxBExbvghdjlHGLRXbBVTAg,2170368
6
+ dataprof-0.4.78.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.9.6)
3
+ Root-Is-Purelib: false
4
+ Tag: cp314-cp314-win_amd64
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Andrea Bozzo
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.