data-glance 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data_glance-0.1.1/.gitignore +22 -0
- data_glance-0.1.1/PKG-INFO +313 -0
- data_glance-0.1.1/README.md +294 -0
- data_glance-0.1.1/pyproject.toml +104 -0
- data_glance-0.1.1/src/data_profiler/__init__.py +3 -0
- data_glance-0.1.1/src/data_profiler/cli.py +1780 -0
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
# Python-generated files
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[oc]
|
|
4
|
+
build/
|
|
5
|
+
dist/
|
|
6
|
+
wheels/
|
|
7
|
+
*.egg-info
|
|
8
|
+
|
|
9
|
+
# Virtual environments
|
|
10
|
+
.venv
|
|
11
|
+
|
|
12
|
+
# Generated reports
|
|
13
|
+
*.html
|
|
14
|
+
*.json
|
|
15
|
+
|
|
16
|
+
# IDE
|
|
17
|
+
.idea/
|
|
18
|
+
.vscode/
|
|
19
|
+
*.swp
|
|
20
|
+
|
|
21
|
+
# OS
|
|
22
|
+
.DS_Store
|
|
@@ -0,0 +1,313 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: data-glance
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: Quick data profiling CLI for parquet and CSV files
|
|
5
|
+
Requires-Python: >=3.12
|
|
6
|
+
Requires-Dist: pandas>=2.0.0
|
|
7
|
+
Requires-Dist: polars>=1.0.0
|
|
8
|
+
Requires-Dist: pyarrow>=15.0.0
|
|
9
|
+
Requires-Dist: rich>=13.0.0
|
|
10
|
+
Requires-Dist: setuptools
|
|
11
|
+
Requires-Dist: typer>=0.12.0
|
|
12
|
+
Requires-Dist: ydata-profiling>=4.6.0
|
|
13
|
+
Provides-Extra: dev
|
|
14
|
+
Requires-Dist: mypy>=1.13.0; extra == 'dev'
|
|
15
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
|
|
16
|
+
Requires-Dist: pytest>=8.0.0; extra == 'dev'
|
|
17
|
+
Requires-Dist: ruff>=0.8.0; extra == 'dev'
|
|
18
|
+
Description-Content-Type: text/markdown
|
|
19
|
+
|
|
20
|
+
# data-glance
|
|
21
|
+
|
|
22
|
+
Fast data profiling CLI for parquet and CSV files. Powered by [ydata-profiling](https://github.com/ydataai/ydata-profiling) and [Polars](https://pola.rs/).
|
|
23
|
+
|
|
24
|
+
## Installation
|
|
25
|
+
|
|
26
|
+
Install from PyPI:
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
# Run with uvx (cached)
|
|
30
|
+
uvx data-glance profile data.parquet
|
|
31
|
+
|
|
32
|
+
# Install globally
|
|
33
|
+
uv tool install data-glance
|
|
34
|
+
|
|
35
|
+
# Install with pip
|
|
36
|
+
pip install data-glance
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Or run directly from GitHub:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
# Run from GitHub (always latest)
|
|
43
|
+
uvx --from git+https://github.com/bswrundquist/data-glance data-glance profile data.parquet
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Quick Start
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# Profile a file
|
|
50
|
+
data-glance profile data.parquet
|
|
51
|
+
|
|
52
|
+
# Quick preview
|
|
53
|
+
data-glance head data.csv
|
|
54
|
+
|
|
55
|
+
# Check data quality
|
|
56
|
+
data-glance diagnose data.parquet
|
|
57
|
+
|
|
58
|
+
# View schema
|
|
59
|
+
data-glance schema data.parquet
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Commands
|
|
63
|
+
|
|
64
|
+
| Command | Description |
|
|
65
|
+
| ---------- | ---------------------------- |
|
|
66
|
+
| `profile` | Generate HTML profile report |
|
|
67
|
+
| `diagnose` | Check data quality issues |
|
|
68
|
+
| `head` | Preview first N rows |
|
|
69
|
+
| `tail` | Preview last N rows |
|
|
70
|
+
| `schema` | Display column types |
|
|
71
|
+
| `stats` | Quick statistics |
|
|
72
|
+
| `count` | Count rows (fast) |
|
|
73
|
+
| `columns` | List column names |
|
|
74
|
+
| `unique` | Show unique values |
|
|
75
|
+
| `filter` | Filter data by expression |
|
|
76
|
+
| `sample` | Extract random sample |
|
|
77
|
+
| `convert` | Convert between formats |
|
|
78
|
+
| `compare` | Compare two files |
|
|
79
|
+
| `validate` | Validate data rules |
|
|
80
|
+
| `info` | File metadata |
|
|
81
|
+
| `generate` | Create test data |
|
|
82
|
+
|
|
83
|
+
## Profile Command
|
|
84
|
+
|
|
85
|
+
### Basic Usage
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
data-glance profile data.parquet
|
|
89
|
+
data-glance profile data.csv --preset quick
|
|
90
|
+
data-glance profile data.parquet --preset full
|
|
91
|
+
data-glance profile huge.parquet --sample 10000
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Column Filtering
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
data-glance profile data.csv --include "user_*,order_*"
|
|
98
|
+
data-glance profile data.csv --exclude "*_id,*_hash"
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Null Handling
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
data-glance profile data.csv --nulls drop-cols
|
|
105
|
+
data-glance profile data.csv --drop-null-threshold 0.5
|
|
106
|
+
data-glance profile data.csv --drop-constant
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Output Options
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
data-glance profile data.csv -o report.html
|
|
113
|
+
data-glance profile data.csv --json report.json
|
|
114
|
+
data-glance profile data.csv --no-browser
|
|
115
|
+
data-glance profile data.csv --dry-run
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### CSV Options
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
data-glance profile data.tsv --delimiter tab
|
|
122
|
+
data-glance profile data.csv --encoding latin-1
|
|
123
|
+
data-glance profile messy.csv --ignore-errors
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
## Data Inspection
|
|
127
|
+
|
|
128
|
+
### head / tail - Preview Data
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
data-glance head data.parquet --rows 20
|
|
132
|
+
data-glance tail data.csv --rows 10
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### schema - View Structure
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
data-glance schema data.parquet
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### stats - Quick Statistics
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
data-glance stats data.parquet
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### count - Row Count
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
data-glance count data.parquet # Single file
|
|
151
|
+
data-glance count *.csv # Multiple files
|
|
152
|
+
data-glance count *.parquet --total # Just the number
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
### columns - List Columns
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
data-glance columns data.parquet
|
|
159
|
+
data-glance columns data.csv --one # One per line (for piping)
|
|
160
|
+
data-glance columns data.csv --types # With data types
|
|
161
|
+
data-glance columns data.csv --one | grep user # Filter columns
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
### unique - Value Distribution
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
data-glance unique data.csv status
|
|
168
|
+
data-glance unique data.parquet category --counts --sort
|
|
169
|
+
data-glance unique data.csv user_id --limit 50
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### info - File Metadata
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
data-glance info data.parquet
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Shows file size, modification time, and for parquet: row count, columns, row groups.
|
|
179
|
+
|
|
180
|
+
## Data Operations
|
|
181
|
+
|
|
182
|
+
### filter - Query Data
|
|
183
|
+
|
|
184
|
+
```bash
|
|
185
|
+
# Filter by condition
|
|
186
|
+
data-glance filter data.csv "col('status') == 'active'"
|
|
187
|
+
data-glance filter data.parquet "col('age') > 30" -o filtered.parquet
|
|
188
|
+
data-glance filter data.csv "col('name').str.contains('test')" --limit 100
|
|
189
|
+
|
|
190
|
+
# Expression syntax (Polars)
|
|
191
|
+
col('column') == 'value'
|
|
192
|
+
col('column') > 100
|
|
193
|
+
col('column').is_in(['a', 'b'])
|
|
194
|
+
col('column').is_null()
|
|
195
|
+
col('column').str.contains('pattern')
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
### sample - Extract Sample
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
data-glance sample data.parquet sample.parquet -n 1000
|
|
202
|
+
data-glance sample big.csv small.csv --fraction 0.1
|
|
203
|
+
data-glance sample data.parquet sample.csv # Convert while sampling
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### convert - Format Conversion
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
data-glance convert data.csv data.parquet
|
|
210
|
+
data-glance convert data.parquet data.csv
|
|
211
|
+
data-glance convert data.csv data.parquet --compression zstd
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
## Data Quality
|
|
215
|
+
|
|
216
|
+
### diagnose - Quality Check
|
|
217
|
+
|
|
218
|
+
```bash
|
|
219
|
+
data-glance diagnose data.csv
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
Shows: schema, null percentages, quality issues, suggested fixes.
|
|
223
|
+
|
|
224
|
+
### compare - Diff Files
|
|
225
|
+
|
|
226
|
+
```bash
|
|
227
|
+
data-glance compare data_v1.parquet data_v2.parquet
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
Shows: row/column differences, schema changes, null changes.
|
|
231
|
+
|
|
232
|
+
### validate - Check Rules
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
# Check for nulls
|
|
236
|
+
data-glance validate data.csv --no-nulls "id,email"
|
|
237
|
+
|
|
238
|
+
# Check uniqueness
|
|
239
|
+
data-glance validate data.parquet --unique "id"
|
|
240
|
+
|
|
241
|
+
# Check row count
|
|
242
|
+
data-glance validate data.csv --min-rows 1000
|
|
243
|
+
|
|
244
|
+
# Check null percentage
|
|
245
|
+
data-glance validate data.csv --max-null-pct 0.1
|
|
246
|
+
|
|
247
|
+
# Check required columns
|
|
248
|
+
data-glance validate data.csv --required-cols "id,name,email"
|
|
249
|
+
|
|
250
|
+
# Combine rules
|
|
251
|
+
data-glance validate data.parquet \
|
|
252
|
+
--unique "id" \
|
|
253
|
+
--no-nulls "id,email" \
|
|
254
|
+
--min-rows 1000
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
Returns exit code 1 if validation fails (useful in CI/CD).
|
|
258
|
+
|
|
259
|
+
## Global Options
|
|
260
|
+
|
|
261
|
+
```bash
|
|
262
|
+
data-glance -q profile data.csv # Quiet mode
|
|
263
|
+
data-glance -v profile data.csv # Verbose mode
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
## Test Data
|
|
267
|
+
|
|
268
|
+
```bash
|
|
269
|
+
data-glance generate test.parquet --rows 5000
|
|
270
|
+
data-glance generate test.csv --edge-cases --nulls 0.1
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
## Presets
|
|
274
|
+
|
|
275
|
+
| Preset | Speed | Detail | Use Case |
|
|
276
|
+
| --------- | ------ | -------- | ------------------------- |
|
|
277
|
+
| `quick` | Fast | Minimal | Large files, quick checks |
|
|
278
|
+
| `default` | Medium | Standard | Most use cases |
|
|
279
|
+
| `full` | Slow | Detailed | Deep analysis |
|
|
280
|
+
|
|
281
|
+
## Tips
|
|
282
|
+
|
|
283
|
+
- Use `--preset quick` or `--sample` for large files
|
|
284
|
+
- Use `diagnose` before `profile` to understand data quality
|
|
285
|
+
- Use `--dry-run` to preview what will be profiled
|
|
286
|
+
- Use `validate` in CI/CD pipelines
|
|
287
|
+
- Use `count --total` for scripting
|
|
288
|
+
- Use `columns --one` to pipe to other tools
|
|
289
|
+
- Use `filter` to extract subsets before profiling
|
|
290
|
+
|
|
291
|
+
## Development
|
|
292
|
+
|
|
293
|
+
```bash
|
|
294
|
+
# Clone and install
|
|
295
|
+
git clone https://github.com/bswrundquist/data-glance
|
|
296
|
+
cd data-glance
|
|
297
|
+
make install-dev
|
|
298
|
+
|
|
299
|
+
# Run tests
|
|
300
|
+
make test
|
|
301
|
+
|
|
302
|
+
# Lint and format
|
|
303
|
+
make lint
|
|
304
|
+
make format
|
|
305
|
+
|
|
306
|
+
# Build
|
|
307
|
+
make build
|
|
308
|
+
|
|
309
|
+
# Release
|
|
310
|
+
make release-patch # 0.1.0 -> 0.1.1
|
|
311
|
+
make release-minor # 0.1.0 -> 0.2.0
|
|
312
|
+
make release-major # 0.1.0 -> 1.0.0
|
|
313
|
+
```
|
|
@@ -0,0 +1,294 @@
|
|
|
1
|
+
# data-glance
|
|
2
|
+
|
|
3
|
+
Fast data profiling CLI for parquet and CSV files. Powered by [ydata-profiling](https://github.com/ydataai/ydata-profiling) and [Polars](https://pola.rs/).
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
Install from PyPI:
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
# Run with uvx (cached)
|
|
11
|
+
uvx data-glance profile data.parquet
|
|
12
|
+
|
|
13
|
+
# Install globally
|
|
14
|
+
uv tool install data-glance
|
|
15
|
+
|
|
16
|
+
# Install with pip
|
|
17
|
+
pip install data-glance
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
Or run directly from GitHub:
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
# Run from GitHub (always latest)
|
|
24
|
+
uvx --from git+https://github.com/bswrundquist/data-glance data-glance profile data.parquet
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Quick Start
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
# Profile a file
|
|
31
|
+
data-glance profile data.parquet
|
|
32
|
+
|
|
33
|
+
# Quick preview
|
|
34
|
+
data-glance head data.csv
|
|
35
|
+
|
|
36
|
+
# Check data quality
|
|
37
|
+
data-glance diagnose data.parquet
|
|
38
|
+
|
|
39
|
+
# View schema
|
|
40
|
+
data-glance schema data.parquet
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Commands
|
|
44
|
+
|
|
45
|
+
| Command | Description |
|
|
46
|
+
| ---------- | ---------------------------- |
|
|
47
|
+
| `profile` | Generate HTML profile report |
|
|
48
|
+
| `diagnose` | Check data quality issues |
|
|
49
|
+
| `head` | Preview first N rows |
|
|
50
|
+
| `tail` | Preview last N rows |
|
|
51
|
+
| `schema` | Display column types |
|
|
52
|
+
| `stats` | Quick statistics |
|
|
53
|
+
| `count` | Count rows (fast) |
|
|
54
|
+
| `columns` | List column names |
|
|
55
|
+
| `unique` | Show unique values |
|
|
56
|
+
| `filter` | Filter data by expression |
|
|
57
|
+
| `sample` | Extract random sample |
|
|
58
|
+
| `convert` | Convert between formats |
|
|
59
|
+
| `compare` | Compare two files |
|
|
60
|
+
| `validate` | Validate data rules |
|
|
61
|
+
| `info` | File metadata |
|
|
62
|
+
| `generate` | Create test data |
|
|
63
|
+
|
|
64
|
+
## Profile Command
|
|
65
|
+
|
|
66
|
+
### Basic Usage
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
data-glance profile data.parquet
|
|
70
|
+
data-glance profile data.csv --preset quick
|
|
71
|
+
data-glance profile data.parquet --preset full
|
|
72
|
+
data-glance profile huge.parquet --sample 10000
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Column Filtering
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
data-glance profile data.csv --include "user_*,order_*"
|
|
79
|
+
data-glance profile data.csv --exclude "*_id,*_hash"
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Null Handling
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
data-glance profile data.csv --nulls drop-cols
|
|
86
|
+
data-glance profile data.csv --drop-null-threshold 0.5
|
|
87
|
+
data-glance profile data.csv --drop-constant
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### Output Options
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
data-glance profile data.csv -o report.html
|
|
94
|
+
data-glance profile data.csv --json report.json
|
|
95
|
+
data-glance profile data.csv --no-browser
|
|
96
|
+
data-glance profile data.csv --dry-run
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### CSV Options
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
data-glance profile data.tsv --delimiter tab
|
|
103
|
+
data-glance profile data.csv --encoding latin-1
|
|
104
|
+
data-glance profile messy.csv --ignore-errors
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
## Data Inspection
|
|
108
|
+
|
|
109
|
+
### head / tail - Preview Data
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
data-glance head data.parquet --rows 20
|
|
113
|
+
data-glance tail data.csv --rows 10
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### schema - View Structure
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
data-glance schema data.parquet
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### stats - Quick Statistics
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
data-glance stats data.parquet
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### count - Row Count
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
data-glance count data.parquet # Single file
|
|
132
|
+
data-glance count *.csv # Multiple files
|
|
133
|
+
data-glance count *.parquet --total # Just the number
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
### columns - List Columns
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
data-glance columns data.parquet
|
|
140
|
+
data-glance columns data.csv --one # One per line (for piping)
|
|
141
|
+
data-glance columns data.csv --types # With data types
|
|
142
|
+
data-glance columns data.csv --one | grep user # Filter columns
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
### unique - Value Distribution
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
data-glance unique data.csv status
|
|
149
|
+
data-glance unique data.parquet category --counts --sort
|
|
150
|
+
data-glance unique data.csv user_id --limit 50
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
### info - File Metadata
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
data-glance info data.parquet
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Shows file size, modification time, and for parquet: row count, columns, row groups.
|
|
160
|
+
|
|
161
|
+
## Data Operations
|
|
162
|
+
|
|
163
|
+
### filter - Query Data
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
# Filter by condition
|
|
167
|
+
data-glance filter data.csv "col('status') == 'active'"
|
|
168
|
+
data-glance filter data.parquet "col('age') > 30" -o filtered.parquet
|
|
169
|
+
data-glance filter data.csv "col('name').str.contains('test')" --limit 100
|
|
170
|
+
|
|
171
|
+
# Expression syntax (Polars)
|
|
172
|
+
col('column') == 'value'
|
|
173
|
+
col('column') > 100
|
|
174
|
+
col('column').is_in(['a', 'b'])
|
|
175
|
+
col('column').is_null()
|
|
176
|
+
col('column').str.contains('pattern')
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### sample - Extract Sample
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
data-glance sample data.parquet sample.parquet -n 1000
|
|
183
|
+
data-glance sample big.csv small.csv --fraction 0.1
|
|
184
|
+
data-glance sample data.parquet sample.csv # Convert while sampling
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
### convert - Format Conversion
|
|
188
|
+
|
|
189
|
+
```bash
|
|
190
|
+
data-glance convert data.csv data.parquet
|
|
191
|
+
data-glance convert data.parquet data.csv
|
|
192
|
+
data-glance convert data.csv data.parquet --compression zstd
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## Data Quality
|
|
196
|
+
|
|
197
|
+
### diagnose - Quality Check
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
data-glance diagnose data.csv
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
Shows: schema, null percentages, quality issues, suggested fixes.
|
|
204
|
+
|
|
205
|
+
### compare - Diff Files
|
|
206
|
+
|
|
207
|
+
```bash
|
|
208
|
+
data-glance compare data_v1.parquet data_v2.parquet
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
Shows: row/column differences, schema changes, null changes.
|
|
212
|
+
|
|
213
|
+
### validate - Check Rules
|
|
214
|
+
|
|
215
|
+
```bash
|
|
216
|
+
# Check for nulls
|
|
217
|
+
data-glance validate data.csv --no-nulls "id,email"
|
|
218
|
+
|
|
219
|
+
# Check uniqueness
|
|
220
|
+
data-glance validate data.parquet --unique "id"
|
|
221
|
+
|
|
222
|
+
# Check row count
|
|
223
|
+
data-glance validate data.csv --min-rows 1000
|
|
224
|
+
|
|
225
|
+
# Check null percentage
|
|
226
|
+
data-glance validate data.csv --max-null-pct 0.1
|
|
227
|
+
|
|
228
|
+
# Check required columns
|
|
229
|
+
data-glance validate data.csv --required-cols "id,name,email"
|
|
230
|
+
|
|
231
|
+
# Combine rules
|
|
232
|
+
data-glance validate data.parquet \
|
|
233
|
+
--unique "id" \
|
|
234
|
+
--no-nulls "id,email" \
|
|
235
|
+
--min-rows 1000
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
Returns exit code 1 if validation fails (useful in CI/CD).
|
|
239
|
+
|
|
240
|
+
## Global Options
|
|
241
|
+
|
|
242
|
+
```bash
|
|
243
|
+
data-glance -q profile data.csv # Quiet mode
|
|
244
|
+
data-glance -v profile data.csv # Verbose mode
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
## Test Data
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
data-glance generate test.parquet --rows 5000
|
|
251
|
+
data-glance generate test.csv --edge-cases --nulls 0.1
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
## Presets
|
|
255
|
+
|
|
256
|
+
| Preset | Speed | Detail | Use Case |
|
|
257
|
+
| --------- | ------ | -------- | ------------------------- |
|
|
258
|
+
| `quick` | Fast | Minimal | Large files, quick checks |
|
|
259
|
+
| `default` | Medium | Standard | Most use cases |
|
|
260
|
+
| `full` | Slow | Detailed | Deep analysis |
|
|
261
|
+
|
|
262
|
+
## Tips
|
|
263
|
+
|
|
264
|
+
- Use `--preset quick` or `--sample` for large files
|
|
265
|
+
- Use `diagnose` before `profile` to understand data quality
|
|
266
|
+
- Use `--dry-run` to preview what will be profiled
|
|
267
|
+
- Use `validate` in CI/CD pipelines
|
|
268
|
+
- Use `count --total` for scripting
|
|
269
|
+
- Use `columns --one` to pipe to other tools
|
|
270
|
+
- Use `filter` to extract subsets before profiling
|
|
271
|
+
|
|
272
|
+
## Development
|
|
273
|
+
|
|
274
|
+
```bash
|
|
275
|
+
# Clone and install
|
|
276
|
+
git clone https://github.com/bswrundquist/data-glance
|
|
277
|
+
cd data-glance
|
|
278
|
+
make install-dev
|
|
279
|
+
|
|
280
|
+
# Run tests
|
|
281
|
+
make test
|
|
282
|
+
|
|
283
|
+
# Lint and format
|
|
284
|
+
make lint
|
|
285
|
+
make format
|
|
286
|
+
|
|
287
|
+
# Build
|
|
288
|
+
make build
|
|
289
|
+
|
|
290
|
+
# Release
|
|
291
|
+
make release-patch # 0.1.0 -> 0.1.1
|
|
292
|
+
make release-minor # 0.1.0 -> 0.2.0
|
|
293
|
+
make release-major # 0.1.0 -> 1.0.0
|
|
294
|
+
```
|