dataprof 0.4.83__pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of dataprof might be problematic. Click here for more details.

dataprof/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ from .dataprof import *
2
+
3
+ __doc__ = dataprof.__doc__
4
+ if hasattr(dataprof, "__all__"):
5
+ __all__ = dataprof.__all__
@@ -0,0 +1,380 @@
1
+ Metadata-Version: 2.4
2
+ Name: dataprof
3
+ Version: 0.4.83
4
+ Classifier: Development Status :: 4 - Beta
5
+ Classifier: Intended Audience :: Developers
6
+ Classifier: Intended Audience :: Science/Research
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: POSIX
9
+ Classifier: Operating System :: Microsoft :: Windows
10
+ Classifier: Operating System :: MacOS :: MacOS X
11
+ Classifier: Programming Language :: Rust
12
+ Classifier: Programming Language :: Python :: Implementation :: CPython
13
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.8
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Requires-Dist: pandas>=1.3.0 ; extra == 'pandas'
23
+ Requires-Dist: pandas>=1.3.0 ; extra == 'jupyter'
24
+ Requires-Dist: ipython>=7.0.0 ; extra == 'jupyter'
25
+ Requires-Dist: pandas>=1.3.0 ; extra == 'all'
26
+ Requires-Dist: ipython>=7.0.0 ; extra == 'all'
27
+ Requires-Dist: numpy>=1.20.0 ; extra == 'all'
28
+ Provides-Extra: pandas
29
+ Provides-Extra: jupyter
30
+ Provides-Extra: all
31
+ License-File: LICENSE
32
+ Summary: Fast, lightweight data profiling and quality assessment library
33
+ Keywords: data,profiling,quality,csv,json,analysis,performance
34
+ Home-Page: https://github.com/AndreaBozzo/dataprof
35
+ Author-email: Andrea Bozzo <andreabozzo92@gmail.com>
36
+ Requires-Python: >=3.8
37
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
38
+ Project-URL: Homepage, https://github.com/AndreaBozzo/dataprof
39
+ Project-URL: Repository, https://github.com/AndreaBozzo/dataprof
40
+ Project-URL: Issues, https://github.com/AndreaBozzo/dataprof/issues
41
+
42
+ # dataprof
43
+
44
+ [![CI](https://github.com/AndreaBozzo/dataprof/workflows/CI/badge.svg)](https://github.com/AndreaBozzo/dataprof/actions)
45
+ [![License](https://img.shields.io/github/license/AndreaBozzo/dataprof)](LICENSE)
46
+ [![Rust](https://img.shields.io/badge/rust-1.80%2B-orange.svg)](https://www.rust-lang.org)
47
+ [![Crates.io](https://img.shields.io/crates/v/dataprof.svg)](https://crates.io/crates/dataprof)
48
+ [![PyPI Downloads](https://static.pepy.tech/personalized-badge/dataprof?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/dataprof)
49
+
50
+
51
+ A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.
52
+
53
+ **Automatic Pattern Detection** - Identifies 16+ common data patterns including emails, phone numbers, IP addresses, coordinates, IBAN, file paths, and more.
54
+
55
+ Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.
56
+
57
+ ## Privacy & Transparency
58
+
59
+ DataProf processes **all data locally** on your machine. Zero telemetry, zero external data transmission.
60
+
61
+ **[Read exactly what DataProf analyzes →](docs/WHAT_DATAPROF_DOES.md)**
62
+
63
+ - 100% local processing - your data never leaves your machine
64
+ - No telemetry or tracking
65
+ - Open source & fully auditable
66
+ - Read-only database access (when using DB features)
67
+
68
+ **Complete transparency:** Every metric, calculation, and data point is documented with source code references for independent verification.
69
+
70
+ ## CI/CD Integration
71
+
72
+ Automate data quality checks in your workflows with our GitHub Action:
73
+
74
+ ```yaml
75
+ - name: DataProf Quality Check
76
+ uses: AndreaBozzo/dataprof-actions@v1
77
+ with:
78
+ file: 'data/dataset.csv'
79
+ quality-threshold: 80
80
+ fail-on-issues: true
81
+ # Batch mode (NEW)
82
+ recursive: true
83
+ output-html: 'quality-report.html'
84
+ ```
85
+
86
+ **[Get the Action →](https://github.com/AndreaBozzo/dataprof-action)**
87
+
88
+ - **Zero setup** - works out of the box
89
+ - **ISO 8000/25012 compliant** - industry-standard quality metrics
90
+ - **Batch processing** - analyze entire directories recursively
91
+ - **Flexible** - customizable thresholds and output formats
92
+ - **Fast** - typically completes in under 2 minutes
93
+
94
+ Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.
95
+
96
+ ## Quick Start
97
+
98
+ ### Installation
99
+
100
+ ```bash
101
+ # Install from crates.io (recommended)
102
+ cargo install dataprof
103
+
104
+ # Or build from source
105
+ git clone https://github.com/AndreaBozzo/dataprof
106
+ cd dataprof
107
+ cargo install --path .
108
+ ```
109
+
110
+ **That's it!** Now you can use `dataprof-cli` from anywhere.
111
+
112
+ ### Basic Usage
113
+
114
+ ```bash
115
+ # Analyze a CSV file
116
+ dataprof-cli analyze data.csv
117
+
118
+ # Get detailed analysis
119
+ dataprof-cli analyze data.csv --detailed
120
+
121
+ # Generate HTML report
122
+ dataprof-cli report data.csv -o report.html
123
+
124
+ # Analyze Parquet files (requires --features parquet)
125
+ dataprof-cli analyze data.parquet
126
+ ```
127
+
128
+ ### More Features
129
+
130
+ ```bash
131
+ # Batch process entire directory
132
+ dataprof-cli batch /data/folder --recursive --parallel
133
+
134
+ # Database profiling
135
+ dataprof-cli database postgres://user:pass@host/db --table users
136
+
137
+ # Benchmark engines
138
+ dataprof-cli benchmark data.csv
139
+
140
+ # Streaming mode for large files
141
+ dataprof-cli analyze large_file.csv --streaming
142
+
143
+ # JSON output for automation
144
+ dataprof-cli analyze data.csv --format json
145
+ ```
146
+
147
+ ![DataProf Batch Report](assets/animations/HTMLbatch.gif)
148
+
149
+ **Need help?** Run `dataprof-cli --help` or `dataprof-cli <command> --help` for detailed options.
150
+
151
+ ### Python Bindings
152
+
153
+ ```bash
154
+ pip install dataprof
155
+ ```
156
+
157
+ ```python
158
+ import dataprof
159
+
160
+ # Comprehensive quality analysis (ISO 8000/25012 compliant)
161
+ report = dataprof.analyze_csv_with_quality("data.csv")
162
+ print(f"Quality score: {report.quality_score():.1f}%")
163
+
164
+ # Access individual quality dimensions
165
+ metrics = report.data_quality_metrics
166
+ print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
167
+ print(f"Consistency: {metrics.data_type_consistency:.1f}%")
168
+ print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")
169
+
170
+ # Batch processing
171
+ result = dataprof.batch_analyze_directory("/data", recursive=True)
172
+ print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")
173
+
174
+ # Async database profiling (requires python-async feature)
175
+ import asyncio
176
+
177
+ async def profile_db():
178
+ result = await dataprof.profile_database_async(
179
+ "postgresql://user:pass@localhost/db",
180
+ "SELECT * FROM users LIMIT 1000",
181
+ batch_size=1000,
182
+ calculate_quality=True
183
+ )
184
+ print(f"Quality score: {result['quality'].overall_score:.1%}")
185
+
186
+ asyncio.run(profile_db())
187
+ ```
188
+
189
+ > **Note**: Async database profiling requires building with `--features python-async,database,postgres` (or mysql/sqlite). See [Async Support](#async-support) below.
190
+
191
+ **[Full Python API Documentation →](docs/python/README.md)**
192
+
193
+ ### Rust Library
194
+
195
+ ```bash
196
+ cargo add dataprof
197
+ ```
198
+
199
+ ```rust
200
+ use dataprof::*;
201
+
202
+ // High-performance Arrow processing for large files (>100MB)
203
+ // Requires compilation with: cargo build --features arrow
204
+ #[cfg(feature = "arrow")]
205
+ let profiler = DataProfiler::columnar();
206
+ #[cfg(feature = "arrow")]
207
+ let report = profiler.analyze_csv_file("large_dataset.csv")?;
208
+
209
+ // Standard adaptive profiling (recommended for most use cases)
210
+ let profiler = DataProfiler::auto();
211
+ let report = profiler.analyze_file("dataset.csv")?;
212
+ ```
213
+
214
+ ## Development
215
+
216
+ Want to contribute or build from source? Here's what you need:
217
+
218
+ ### Prerequisites
219
+ - Rust (latest stable via [rustup](https://rustup.rs/))
220
+ - Docker (for database testing)
221
+
222
+ ### Quick Setup
223
+ ```bash
224
+ git clone https://github.com/AndreaBozzo/dataprof.git
225
+ cd dataprof
226
+ cargo build --release # Build the project
227
+ docker-compose -f .devcontainer/docker-compose.yml up -d # Start test databases
228
+ ```
229
+
230
+ ### Feature Flags
231
+
232
+ dataprof uses optional features to keep compile times fast and binaries lean:
233
+
234
+ ```bash
235
+ # Minimal build (CSV/JSON only, ~60s compile)
236
+ cargo build --release
237
+
238
+ # With Apache Arrow (columnar processing, ~90s compile)
239
+ cargo build --release --features arrow
240
+
241
+ # With Parquet support (requires arrow, ~95s compile)
242
+ cargo build --release --features parquet
243
+
244
+ # With database connectors
245
+ cargo build --release --features postgres,mysql,sqlite
246
+
247
+ # With Python async support (for async database profiling)
248
+ maturin develop --features python-async,database,postgres
249
+
250
+ # All features (full functionality, ~130s compile)
251
+ cargo build --release --all-features
252
+ ```
253
+
254
+ **When to use Arrow?**
255
+ - ✅ Files > 100MB with many columns (>20)
256
+ - ✅ Columnar data with uniform types
257
+ - ✅ Need maximum throughput (up to 13x faster)
258
+ - ❌ Small files (<10MB) - standard engine is faster
259
+ - ❌ Mixed/messy data - streaming engine handles better
260
+
261
+ **When to use Parquet?**
262
+ - ✅ Analytics workloads with columnar data
263
+ - ✅ Data lake architectures
264
+ - ✅ Integration with Spark, Pandas, PyArrow
265
+ - ✅ Efficient storage and compression
266
+ - ✅ Type-safe schema preservation
267
+
268
+ ### Async Support
269
+
270
+ DataProf supports asynchronous operations for non-blocking database profiling, both in Rust and Python.
271
+
272
+ #### Rust Async (Database Features)
273
+
274
+ Database connectors are fully async and use `tokio` runtime:
275
+
276
+ ```rust
277
+ use dataprof::database::{DatabaseConfig, profile_database};
278
+
279
+ #[tokio::main]
280
+ async fn main() -> Result<()> {
281
+ let config = DatabaseConfig {
282
+ connection_string: "postgresql://localhost/mydb".to_string(),
283
+ batch_size: 10000,
284
+ ..Default::default()
285
+ };
286
+
287
+ let report = profile_database(config, "SELECT * FROM users").await?;
288
+ println!("Profiled {} rows", report.total_rows);
289
+ Ok(())
290
+ }
291
+ ```
292
+
293
+ **Available async features:**
294
+ - ✅ Non-blocking database queries
295
+ - ✅ Concurrent query execution
296
+ - ✅ Streaming for large result sets
297
+ - ✅ Connection pooling with SQLx
298
+ - ✅ Retry logic with exponential backoff
299
+
300
+ #### Python Async (python-async Feature)
301
+
302
+ Enable async Python bindings for database profiling:
303
+
304
+ ```bash
305
+ # Build with async support
306
+ maturin develop --features python-async,database,postgres
307
+ ```
308
+
309
+ ```python
310
+ import asyncio
311
+ import dataprof
312
+
313
+ async def main():
314
+ # Test connection
315
+ connected = await dataprof.test_connection_async(
316
+ "postgresql://user:pass@localhost/db"
317
+ )
318
+
319
+ # Get table schema
320
+ columns = await dataprof.get_table_schema_async(
321
+ "postgresql://user:pass@localhost/db",
322
+ "users"
323
+ )
324
+
325
+ # Count rows
326
+ count = await dataprof.count_table_rows_async(
327
+ "postgresql://user:pass@localhost/db",
328
+ "users"
329
+ )
330
+
331
+ # Profile database query
332
+ result = await dataprof.profile_database_async(
333
+ "postgresql://user:pass@localhost/db",
334
+ "SELECT * FROM users LIMIT 1000",
335
+ batch_size=1000,
336
+ calculate_quality=True
337
+ )
338
+
339
+ print(f"Quality score: {result['quality'].overall_score:.1%}")
340
+
341
+ asyncio.run(main())
342
+ ```
343
+
344
+ **Benefits:**
345
+ - ✅ Non-blocking I/O for better performance
346
+ - ✅ Concurrent database profiling
347
+ - ✅ Integration with async Python frameworks (FastAPI, aiohttp, etc.)
348
+ - ✅ Efficient resource usage
349
+
350
+ **See also:** [examples/async_database_example.py](examples/async_database_example.py) for complete examples.
351
+
352
+ ### Common Development Tasks
353
+ ```bash
354
+ cargo test # Run all tests
355
+ cargo bench # Performance benchmarks
356
+ cargo fmt # Format code
357
+ cargo clippy # Code quality checks
358
+ ```
359
+
360
+ ## Documentation
361
+
362
+ ### Privacy & Transparency
363
+ - [What DataProf Does](docs/WHAT_DATAPROF_DOES.md) - **Complete transparency guide with source code verification**
364
+
365
+ ### User Guides
366
+ - [Python API Reference](docs/python/API_REFERENCE.md) - Full Python API documentation
367
+ - [Python Integrations](docs/python/INTEGRATIONS.md) - Pandas, scikit-learn, Jupyter, Airflow, dbt
368
+ - [Database Connectors](docs/guides/database-connectors.md) - Production database connectivity
369
+ - [Apache Arrow Integration](docs/guides/apache-arrow-integration.md) - Columnar processing guide
370
+ - [CLI Usage Guide](docs/guides/CLI_USAGE_GUIDE.md) - Complete CLI reference
371
+
372
+ ### Developer Guides
373
+ - [Development Guide](docs/DEVELOPMENT.md) - Complete setup and contribution guide
374
+ - [Performance Guide](docs/guides/performance-guide.md) - Optimization and benchmarking
375
+ - [Performance Benchmarks](docs/project/benchmarking.md) - Benchmark results and methodology
376
+
377
+ ## License
378
+
379
+ Licensed under the MIT License. See [LICENSE](LICENSE) for details.
380
+
@@ -0,0 +1,6 @@
1
+ dataprof-0.4.83.dist-info/METADATA,sha256=Nn0E99IBi8CQajcIyGeBNtVTMUQSlrj-CLr0dinpx_k,12434
2
+ dataprof-0.4.83.dist-info/WHEEL,sha256=STZ2cFMnDeKGVvLx9rTR8uRZet7TbXQTcL4SK-uJ1d8,163
3
+ dataprof-0.4.83.dist-info/licenses/LICENSE,sha256=CImaqYZiNIl11Vmb14wZW9Jzj33IxXlZnXWegpfXJF0,1069
4
+ dataprof/__init__.py,sha256=84U5MpyP59z3koB4vbdsJg1XQSKYeTS1SC7b3VqwjfU,115
5
+ dataprof/dataprof.pypy311-pp73-aarch64-linux-gnu.so,sha256=HGDrVUGfFMDpjQhitmqI3XNmJ4_1nn18U7JofTBTAqk,2943832
6
+ dataprof-0.4.83.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.10.2)
3
+ Root-Is-Purelib: false
4
+ Tag: pp311-pypy311_pp73-manylinux_2_17_aarch64
5
+ Tag: pp311-pypy311_pp73-manylinux2014_aarch64
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Andrea Bozzo
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.