dataprof 0.4.80__cp314-cp314-win_amd64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of dataprof might be problematic. Click here for more details.

dataprof/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ from .dataprof import *
2
+
3
+ __doc__ = dataprof.__doc__
4
+ if hasattr(dataprof, "__all__"):
5
+ __all__ = dataprof.__all__
Binary file
@@ -0,0 +1,403 @@
1
+ Metadata-Version: 2.4
2
+ Name: dataprof
3
+ Version: 0.4.80
4
+ Classifier: Development Status :: 4 - Beta
5
+ Classifier: Intended Audience :: Developers
6
+ Classifier: Intended Audience :: Science/Research
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: POSIX
9
+ Classifier: Operating System :: Microsoft :: Windows
10
+ Classifier: Operating System :: MacOS :: MacOS X
11
+ Classifier: Programming Language :: Rust
12
+ Classifier: Programming Language :: Python :: Implementation :: CPython
13
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.8
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Requires-Dist: pandas>=1.3.0 ; extra == 'pandas'
23
+ Requires-Dist: pandas>=1.3.0 ; extra == 'jupyter'
24
+ Requires-Dist: ipython>=7.0.0 ; extra == 'jupyter'
25
+ Requires-Dist: pandas>=1.3.0 ; extra == 'all'
26
+ Requires-Dist: ipython>=7.0.0 ; extra == 'all'
27
+ Requires-Dist: numpy>=1.20.0 ; extra == 'all'
28
+ Provides-Extra: pandas
29
+ Provides-Extra: jupyter
30
+ Provides-Extra: all
31
+ License-File: LICENSE
32
+ Summary: Fast, lightweight data profiling and quality assessment library
33
+ Keywords: data,profiling,quality,csv,json,analysis,performance
34
+ Author-email: Andrea Bozzo <andrea@example.com>
35
+ Requires-Python: >=3.8
36
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
37
+ Project-URL: Homepage, https://github.com/AndreaBozzo/dataprof
38
+ Project-URL: Repository, https://github.com/AndreaBozzo/dataprof
39
+ Project-URL: Issues, https://github.com/AndreaBozzo/dataprof/issues
40
+
41
+ # dataprof
42
+
43
+ [![CI](https://github.com/AndreaBozzo/dataprof/workflows/CI/badge.svg)](https://github.com/AndreaBozzo/dataprof/actions)
44
+ [![License](https://img.shields.io/github/license/AndreaBozzo/dataprof)](LICENSE)
45
+ [![Rust](https://img.shields.io/badge/rust-1.80%2B-orange.svg)](https://www.rust-lang.org)
46
+ [![Crates.io](https://img.shields.io/crates/v/dataprof.svg)](https://crates.io/crates/dataprof)
47
+ [![PyPI Downloads](https://static.pepy.tech/personalized-badge/dataprof?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/dataprof)
48
+
49
+
50
+ A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.
51
+
52
+ Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.
53
+
54
+ ## Privacy & Transparency
55
+
56
+ DataProf processes **all data locally** on your machine. Zero telemetry, zero external data transmission.
57
+
58
+ **[Read exactly what DataProf analyzes →](docs/WHAT_DATAPROF_DOES.md)**
59
+
60
+ - 100% local processing - your data never leaves your machine
61
+ - No telemetry or tracking
62
+ - Open source & fully auditable
63
+ - Read-only database access (when using DB features)
64
+
65
+ **Complete transparency:** Every metric, calculation, and data point is documented with source code references for independent verification.
66
+
67
+ ## CI/CD Integration
68
+
69
+ Automate data quality checks in your workflows with our GitHub Action:
70
+
71
+ ```yaml
72
+ - name: DataProf Quality Check
73
+ uses: AndreaBozzo/dataprof-actions@v1
74
+ with:
75
+ file: 'data/dataset.csv'
76
+ quality-threshold: 80
77
+ fail-on-issues: true
78
+ # Batch mode (NEW)
79
+ recursive: true
80
+ output-html: 'quality-report.html'
81
+ ```
82
+
83
+ **[Get the Action →](https://github.com/AndreaBozzo/dataprof-action)**
84
+
85
+ - **Zero setup** - works out of the box
86
+ - **ISO 8000/25012 compliant** - industry-standard quality metrics
87
+ - **Batch processing** - analyze entire directories recursively
88
+ - **Flexible** - customizable thresholds and output formats
89
+ - **Fast** - typically completes in under 2 minutes
90
+
91
+ Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.
92
+
93
+ ## Quick Start
94
+
95
+ ### CLI (Recommended - Full Features)
96
+
97
+ > **Installation**: Download pre-built binaries from [Releases](https://github.com/AndreaBozzo/dataprof/releases) or build from source with `cargo install dataprof`.
98
+
99
+ > **Note**: After building with `cargo build --release`, the binary is located at `target/release/dataprof-cli.exe` (Windows) or `target/release/dataprof` (Linux/Mac). Run it from the project root as `target/release/dataprof-cli.exe <command>` or add it to your PATH.
100
+
101
+ #### Basic Analysis
102
+ ```bash
103
+ # Comprehensive quality analysis
104
+ dataprof analyze data.csv --detailed
105
+
106
+ # Analyze Parquet files (requires --features parquet)
107
+ dataprof analyze data.parquet --detailed
108
+
109
+ # Windows example (from project root after cargo build --release)
110
+ target\release\dataprof-cli.exe analyze data.csv --detailed
111
+ ```
112
+
113
+ #### HTML Reports
114
+ ```bash
115
+ # Generate HTML report with visualizations
116
+ dataprof report data.csv -o quality_report.html
117
+
118
+ # Custom template
119
+ dataprof report data.csv --template custom.hbs --detailed
120
+ ```
121
+
122
+ #### Batch Processing
123
+ ```bash
124
+ # Process entire directory with parallel execution
125
+ dataprof batch /data/folder --recursive --parallel
126
+
127
+ # Generate HTML batch dashboard
128
+ dataprof batch /data/folder --recursive --html batch_report.html
129
+
130
+ # JSON export for CI/CD automation
131
+ dataprof batch /data/folder --json batch_results.json --recursive
132
+
133
+ # JSON output to stdout
134
+ dataprof batch /data/folder --format json --recursive
135
+
136
+ # With custom filter and progress
137
+ dataprof batch /data/folder --filter "*.csv" --parallel --progress
138
+ ```
139
+
140
+ ![DataProf Batch Report](assets/animations/HTMLbatch.gif)
141
+
142
+ #### Database Analysis
143
+ ```bash
144
+ # PostgreSQL table profiling
145
+ dataprof database postgres://user:pass@host/db --table users
146
+
147
+ # Custom SQL query
148
+ dataprof database sqlite://data.db --query "SELECT * FROM users WHERE active=1"
149
+ ```
150
+
151
+ #### Benchmarking
152
+ ```bash
153
+ # Benchmark different engines on your data
154
+ dataprof benchmark data.csv
155
+
156
+ # Show engine information
157
+ dataprof benchmark --info
158
+ ```
159
+
160
+ #### Advanced Options
161
+ ```bash
162
+ # Streaming for large files
163
+ dataprof analyze large_dataset.csv --streaming --sample 10000
164
+
165
+ # JSON output for programmatic use
166
+ dataprof analyze data.csv --format json --output results.json
167
+
168
+ # Custom ISO threshold profile
169
+ dataprof analyze data.csv --threshold-profile strict
170
+ ```
171
+
172
+ **Quick Reference**: All commands follow the pattern `dataprof <command> [args]`. Use `dataprof help` or `dataprof <command> --help` for detailed options.
173
+
174
+ ### Python Bindings
175
+
176
+ ```bash
177
+ pip install dataprof
178
+ ```
179
+
180
+ ```python
181
+ import dataprof
182
+
183
+ # Comprehensive quality analysis (ISO 8000/25012 compliant)
184
+ report = dataprof.analyze_csv_with_quality("data.csv")
185
+ print(f"Quality score: {report.quality_score():.1f}%")
186
+
187
+ # Access individual quality dimensions
188
+ metrics = report.data_quality_metrics
189
+ print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
190
+ print(f"Consistency: {metrics.data_type_consistency:.1f}%")
191
+ print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")
192
+
193
+ # Batch processing
194
+ result = dataprof.batch_analyze_directory("/data", recursive=True)
195
+ print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")
196
+
197
+ # Async database profiling (requires python-async feature)
198
+ import asyncio
199
+
200
+ async def profile_db():
201
+ result = await dataprof.profile_database_async(
202
+ "postgresql://user:pass@localhost/db",
203
+ "SELECT * FROM users LIMIT 1000",
204
+ batch_size=1000,
205
+ calculate_quality=True
206
+ )
207
+ print(f"Quality score: {result['quality'].overall_score:.1%}")
208
+
209
+ asyncio.run(profile_db())
210
+ ```
211
+
212
+ > **Note**: Async database profiling requires building with `--features python-async,database,postgres` (or mysql/sqlite). See [Async Support](#async-support) below.
213
+
214
+ **[Full Python API Documentation →](docs/python/README.md)**
215
+
216
+ ### Rust Library
217
+
218
+ ```bash
219
+ cargo add dataprof
220
+ ```
221
+
222
+ ```rust
223
+ use dataprof::*;
224
+
225
+ // High-performance Arrow processing for large files (>100MB)
226
+ // Requires compilation with: cargo build --features arrow
227
+ #[cfg(feature = "arrow")]
228
+ let profiler = DataProfiler::columnar();
229
+ #[cfg(feature = "arrow")]
230
+ let report = profiler.analyze_csv_file("large_dataset.csv")?;
231
+
232
+ // Standard adaptive profiling (recommended for most use cases)
233
+ let profiler = DataProfiler::auto();
234
+ let report = profiler.analyze_file("dataset.csv")?;
235
+ ```
236
+
237
+ ## Development
238
+
239
+ Want to contribute or build from source? Here's what you need:
240
+
241
+ ### Prerequisites
242
+ - Rust (latest stable via [rustup](https://rustup.rs/))
243
+ - Docker (for database testing)
244
+
245
+ ### Quick Setup
246
+ ```bash
247
+ git clone https://github.com/AndreaBozzo/dataprof.git
248
+ cd dataprof
249
+ cargo build --release # Build the project
250
+ docker-compose -f .devcontainer/docker-compose.yml up -d # Start test databases
251
+ ```
252
+
253
+ ### Feature Flags
254
+
255
+ dataprof uses optional features to keep compile times fast and binaries lean:
256
+
257
+ ```bash
258
+ # Minimal build (CSV/JSON only, ~60s compile)
259
+ cargo build --release
260
+
261
+ # With Apache Arrow (columnar processing, ~90s compile)
262
+ cargo build --release --features arrow
263
+
264
+ # With Parquet support (requires arrow, ~95s compile)
265
+ cargo build --release --features parquet
266
+
267
+ # With database connectors
268
+ cargo build --release --features postgres,mysql,sqlite
269
+
270
+ # With Python async support (for async database profiling)
271
+ maturin develop --features python-async,database,postgres
272
+
273
+ # All features (full functionality, ~130s compile)
274
+ cargo build --release --all-features
275
+ ```
276
+
277
+ **When to use Arrow?**
278
+ - ✅ Files > 100MB with many columns (>20)
279
+ - ✅ Columnar data with uniform types
280
+ - ✅ Need maximum throughput (up to 13x faster)
281
+ - ❌ Small files (<10MB) - standard engine is faster
282
+ - ❌ Mixed/messy data - streaming engine handles better
283
+
284
+ **When to use Parquet?**
285
+ - ✅ Analytics workloads with columnar data
286
+ - ✅ Data lake architectures
287
+ - ✅ Integration with Spark, Pandas, PyArrow
288
+ - ✅ Efficient storage and compression
289
+ - ✅ Type-safe schema preservation
290
+
291
+ ### Async Support
292
+
293
+ DataProf supports asynchronous operations for non-blocking database profiling, both in Rust and Python.
294
+
295
+ #### Rust Async (Database Features)
296
+
297
+ Database connectors are fully async and use `tokio` runtime:
298
+
299
+ ```rust
300
+ use dataprof::database::{DatabaseConfig, profile_database};
301
+
302
+ #[tokio::main]
303
+ async fn main() -> Result<()> {
304
+ let config = DatabaseConfig {
305
+ connection_string: "postgresql://localhost/mydb".to_string(),
306
+ batch_size: 10000,
307
+ ..Default::default()
308
+ };
309
+
310
+ let report = profile_database(config, "SELECT * FROM users").await?;
311
+ println!("Profiled {} rows", report.total_rows);
312
+ Ok(())
313
+ }
314
+ ```
315
+
316
+ **Available async features:**
317
+ - ✅ Non-blocking database queries
318
+ - ✅ Concurrent query execution
319
+ - ✅ Streaming for large result sets
320
+ - ✅ Connection pooling with SQLx
321
+ - ✅ Retry logic with exponential backoff
322
+
323
+ #### Python Async (python-async Feature)
324
+
325
+ Enable async Python bindings for database profiling:
326
+
327
+ ```bash
328
+ # Build with async support
329
+ maturin develop --features python-async,database,postgres
330
+ ```
331
+
332
+ ```python
333
+ import asyncio
334
+ import dataprof
335
+
336
+ async def main():
337
+ # Test connection
338
+ connected = await dataprof.test_connection_async(
339
+ "postgresql://user:pass@localhost/db"
340
+ )
341
+
342
+ # Get table schema
343
+ columns = await dataprof.get_table_schema_async(
344
+ "postgresql://user:pass@localhost/db",
345
+ "users"
346
+ )
347
+
348
+ # Count rows
349
+ count = await dataprof.count_table_rows_async(
350
+ "postgresql://user:pass@localhost/db",
351
+ "users"
352
+ )
353
+
354
+ # Profile database query
355
+ result = await dataprof.profile_database_async(
356
+ "postgresql://user:pass@localhost/db",
357
+ "SELECT * FROM users LIMIT 1000",
358
+ batch_size=1000,
359
+ calculate_quality=True
360
+ )
361
+
362
+ print(f"Quality score: {result['quality'].overall_score:.1%}")
363
+
364
+ asyncio.run(main())
365
+ ```
366
+
367
+ **Benefits:**
368
+ - ✅ Non-blocking I/O for better performance
369
+ - ✅ Concurrent database profiling
370
+ - ✅ Integration with async Python frameworks (FastAPI, aiohttp, etc.)
371
+ - ✅ Efficient resource usage
372
+
373
+ **See also:** [examples/async_database_example.py](examples/async_database_example.py) for complete examples.
374
+
375
+ ### Common Development Tasks
376
+ ```bash
377
+ cargo test # Run all tests
378
+ cargo bench # Performance benchmarks
379
+ cargo fmt # Format code
380
+ cargo clippy # Code quality checks
381
+ ```
382
+
383
+ ## Documentation
384
+
385
+ ### Privacy & Transparency
386
+ - [What DataProf Does](docs/WHAT_DATAPROF_DOES.md) - **Complete transparency guide with source code verification**
387
+
388
+ ### User Guides
389
+ - [Python API Reference](docs/python/API_REFERENCE.md) - Full Python API documentation
390
+ - [Python Integrations](docs/python/INTEGRATIONS.md) - Pandas, scikit-learn, Jupyter, Airflow, dbt
391
+ - [Database Connectors](docs/guides/database-connectors.md) - Production database connectivity
392
+ - [Apache Arrow Integration](docs/guides/apache-arrow-integration.md) - Columnar processing guide
393
+ - [CLI Usage Guide](docs/guides/CLI_USAGE_GUIDE.md) - Complete CLI reference
394
+
395
+ ### Developer Guides
396
+ - [Development Guide](docs/DEVELOPMENT.md) - Complete setup and contribution guide
397
+ - [Performance Guide](docs/guides/performance-guide.md) - Optimization and benchmarking
398
+ - [Performance Benchmarks](docs/project/benchmarking.md) - Benchmark results and methodology
399
+
400
+ ## License
401
+
402
+ Licensed under the MIT License. See [LICENSE](LICENSE) for details.
403
+
@@ -0,0 +1,6 @@
1
+ dataprof-0.4.80.dist-info/METADATA,sha256=Ii4NkZPPUI3RODdzHhxmtDSH3UlazZ3DBFL1T3JH23E,13861
2
+ dataprof-0.4.80.dist-info/WHEEL,sha256=tZ3VAZ5HuUzziFCJ2lDsDJnJO-xy4omAQIa7TJCFCZk,96
3
+ dataprof-0.4.80.dist-info/licenses/LICENSE,sha256=pD_29Inf0TmerzrHuH-Lcu2GeD39lNK0_8bDJVkHjos,1090
4
+ dataprof/__init__.py,sha256=84U5MpyP59z3koB4vbdsJg1XQSKYeTS1SC7b3VqwjfU,115
5
+ dataprof/dataprof.cp314-win_amd64.pyd,sha256=F4U6H5H6BSx3menvqtVjSWPEMBggSwIrS7BjooXK7nk,2119168
6
+ dataprof-0.4.80.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.9.6)
3
+ Root-Is-Purelib: false
4
+ Tag: cp314-cp314-win_amd64
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Andrea Bozzo
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.