data-glance 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,22 @@
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ # Generated reports
13
+ *.html
14
+ *.json
15
+
16
+ # IDE
17
+ .idea/
18
+ .vscode/
19
+ *.swp
20
+
21
+ # OS
22
+ .DS_Store
@@ -0,0 +1,313 @@
1
+ Metadata-Version: 2.4
2
+ Name: data-glance
3
+ Version: 0.1.1
4
+ Summary: Quick data profiling CLI for parquet and CSV files
5
+ Requires-Python: >=3.12
6
+ Requires-Dist: pandas>=2.0.0
7
+ Requires-Dist: polars>=1.0.0
8
+ Requires-Dist: pyarrow>=15.0.0
9
+ Requires-Dist: rich>=13.0.0
10
+ Requires-Dist: setuptools
11
+ Requires-Dist: typer>=0.12.0
12
+ Requires-Dist: ydata-profiling>=4.6.0
13
+ Provides-Extra: dev
14
+ Requires-Dist: mypy>=1.13.0; extra == 'dev'
15
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
16
+ Requires-Dist: pytest>=8.0.0; extra == 'dev'
17
+ Requires-Dist: ruff>=0.8.0; extra == 'dev'
18
+ Description-Content-Type: text/markdown
19
+
20
+ # data-glance
21
+
22
+ Fast data profiling CLI for parquet and CSV files. Powered by [ydata-profiling](https://github.com/ydataai/ydata-profiling) and [Polars](https://pola.rs/).
23
+
24
+ ## Installation
25
+
26
+ Install from PyPI:
27
+
28
+ ```bash
29
+ # Run with uvx (cached)
30
+ uvx data-glance profile data.parquet
31
+
32
+ # Install globally
33
+ uv tool install data-glance
34
+
35
+ # Install with pip
36
+ pip install data-glance
37
+ ```
38
+
39
+ Or run directly from GitHub:
40
+
41
+ ```bash
42
+ # Run from GitHub (always latest)
43
+ uvx --from git+https://github.com/bswrundquist/data-glance data-glance profile data.parquet
44
+ ```
45
+
46
+ ## Quick Start
47
+
48
+ ```bash
49
+ # Profile a file
50
+ data-glance profile data.parquet
51
+
52
+ # Quick preview
53
+ data-glance head data.csv
54
+
55
+ # Check data quality
56
+ data-glance diagnose data.parquet
57
+
58
+ # View schema
59
+ data-glance schema data.parquet
60
+ ```
61
+
62
+ ## Commands
63
+
64
+ | Command | Description |
65
+ | ---------- | ---------------------------- |
66
+ | `profile` | Generate HTML profile report |
67
+ | `diagnose` | Check data quality issues |
68
+ | `head` | Preview first N rows |
69
+ | `tail` | Preview last N rows |
70
+ | `schema` | Display column types |
71
+ | `stats` | Quick statistics |
72
+ | `count` | Count rows (fast) |
73
+ | `columns` | List column names |
74
+ | `unique` | Show unique values |
75
+ | `filter` | Filter data by expression |
76
+ | `sample` | Extract random sample |
77
+ | `convert` | Convert between formats |
78
+ | `compare` | Compare two files |
79
+ | `validate` | Validate data rules |
80
+ | `info` | File metadata |
81
+ | `generate` | Create test data |
82
+
83
+ ## Profile Command
84
+
85
+ ### Basic Usage
86
+
87
+ ```bash
88
+ data-glance profile data.parquet
89
+ data-glance profile data.csv --preset quick
90
+ data-glance profile data.parquet --preset full
91
+ data-glance profile huge.parquet --sample 10000
92
+ ```
93
+
94
+ ### Column Filtering
95
+
96
+ ```bash
97
+ data-glance profile data.csv --include "user_*,order_*"
98
+ data-glance profile data.csv --exclude "*_id,*_hash"
99
+ ```
100
+
101
+ ### Null Handling
102
+
103
+ ```bash
104
+ data-glance profile data.csv --nulls drop-cols
105
+ data-glance profile data.csv --drop-null-threshold 0.5
106
+ data-glance profile data.csv --drop-constant
107
+ ```
108
+
109
+ ### Output Options
110
+
111
+ ```bash
112
+ data-glance profile data.csv -o report.html
113
+ data-glance profile data.csv --json report.json
114
+ data-glance profile data.csv --no-browser
115
+ data-glance profile data.csv --dry-run
116
+ ```
117
+
118
+ ### CSV Options
119
+
120
+ ```bash
121
+ data-glance profile data.tsv --delimiter tab
122
+ data-glance profile data.csv --encoding latin-1
123
+ data-glance profile messy.csv --ignore-errors
124
+ ```
125
+
126
+ ## Data Inspection
127
+
128
+ ### head / tail - Preview Data
129
+
130
+ ```bash
131
+ data-glance head data.parquet --rows 20
132
+ data-glance tail data.csv --rows 10
133
+ ```
134
+
135
+ ### schema - View Structure
136
+
137
+ ```bash
138
+ data-glance schema data.parquet
139
+ ```
140
+
141
+ ### stats - Quick Statistics
142
+
143
+ ```bash
144
+ data-glance stats data.parquet
145
+ ```
146
+
147
+ ### count - Row Count
148
+
149
+ ```bash
150
+ data-glance count data.parquet # Single file
151
+ data-glance count *.csv # Multiple files
152
+ data-glance count *.parquet --total # Just the number
153
+ ```
154
+
155
+ ### columns - List Columns
156
+
157
+ ```bash
158
+ data-glance columns data.parquet
159
+ data-glance columns data.csv --one # One per line (for piping)
160
+ data-glance columns data.csv --types # With data types
161
+ data-glance columns data.csv --one | grep user # Filter columns
162
+ ```
163
+
164
+ ### unique - Value Distribution
165
+
166
+ ```bash
167
+ data-glance unique data.csv status
168
+ data-glance unique data.parquet category --counts --sort
169
+ data-glance unique data.csv user_id --limit 50
170
+ ```
171
+
172
+ ### info - File Metadata
173
+
174
+ ```bash
175
+ data-glance info data.parquet
176
+ ```
177
+
178
+ Shows file size, modification time, and for parquet: row count, columns, row groups.
179
+
180
+ ## Data Operations
181
+
182
+ ### filter - Query Data
183
+
184
+ ```bash
185
+ # Filter by condition
186
+ data-glance filter data.csv "col('status') == 'active'"
187
+ data-glance filter data.parquet "col('age') > 30" -o filtered.parquet
188
+ data-glance filter data.csv "col('name').str.contains('test')" --limit 100
189
+
190
+ # Expression syntax (Polars)
191
+ col('column') == 'value'
192
+ col('column') > 100
193
+ col('column').is_in(['a', 'b'])
194
+ col('column').is_null()
195
+ col('column').str.contains('pattern')
196
+ ```
197
+
198
+ ### sample - Extract Sample
199
+
200
+ ```bash
201
+ data-glance sample data.parquet sample.parquet -n 1000
202
+ data-glance sample big.csv small.csv --fraction 0.1
203
+ data-glance sample data.parquet sample.csv # Convert while sampling
204
+ ```
205
+
206
+ ### convert - Format Conversion
207
+
208
+ ```bash
209
+ data-glance convert data.csv data.parquet
210
+ data-glance convert data.parquet data.csv
211
+ data-glance convert data.csv data.parquet --compression zstd
212
+ ```
213
+
214
+ ## Data Quality
215
+
216
+ ### diagnose - Quality Check
217
+
218
+ ```bash
219
+ data-glance diagnose data.csv
220
+ ```
221
+
222
+ Shows: schema, null percentages, quality issues, suggested fixes.
223
+
224
+ ### compare - Diff Files
225
+
226
+ ```bash
227
+ data-glance compare data_v1.parquet data_v2.parquet
228
+ ```
229
+
230
+ Shows: row/column differences, schema changes, null changes.
231
+
232
+ ### validate - Check Rules
233
+
234
+ ```bash
235
+ # Check for nulls
236
+ data-glance validate data.csv --no-nulls "id,email"
237
+
238
+ # Check uniqueness
239
+ data-glance validate data.parquet --unique "id"
240
+
241
+ # Check row count
242
+ data-glance validate data.csv --min-rows 1000
243
+
244
+ # Check null percentage
245
+ data-glance validate data.csv --max-null-pct 0.1
246
+
247
+ # Check required columns
248
+ data-glance validate data.csv --required-cols "id,name,email"
249
+
250
+ # Combine rules
251
+ data-glance validate data.parquet \
252
+ --unique "id" \
253
+ --no-nulls "id,email" \
254
+ --min-rows 1000
255
+ ```
256
+
257
+ Returns exit code 1 if validation fails (useful in CI/CD).
258
+
259
+ ## Global Options
260
+
261
+ ```bash
262
+ data-glance -q profile data.csv # Quiet mode
263
+ data-glance -v profile data.csv # Verbose mode
264
+ ```
265
+
266
+ ## Test Data
267
+
268
+ ```bash
269
+ data-glance generate test.parquet --rows 5000
270
+ data-glance generate test.csv --edge-cases --nulls 0.1
271
+ ```
272
+
273
+ ## Presets
274
+
275
+ | Preset | Speed | Detail | Use Case |
276
+ | --------- | ------ | -------- | ------------------------- |
277
+ | `quick` | Fast | Minimal | Large files, quick checks |
278
+ | `default` | Medium | Standard | Most use cases |
279
+ | `full` | Slow | Detailed | Deep analysis |
280
+
281
+ ## Tips
282
+
283
+ - Use `--preset quick` or `--sample` for large files
284
+ - Use `diagnose` before `profile` to understand data quality
285
+ - Use `--dry-run` to preview what will be profiled
286
+ - Use `validate` in CI/CD pipelines
287
+ - Use `count --total` for scripting
288
+ - Use `columns --one` to pipe to other tools
289
+ - Use `filter` to extract subsets before profiling
290
+
291
+ ## Development
292
+
293
+ ```bash
294
+ # Clone and install
295
+ git clone https://github.com/bswrundquist/data-glance
296
+ cd data-glance
297
+ make install-dev
298
+
299
+ # Run tests
300
+ make test
301
+
302
+ # Lint and format
303
+ make lint
304
+ make format
305
+
306
+ # Build
307
+ make build
308
+
309
+ # Release
310
+ make release-patch # 0.1.0 -> 0.1.1
311
+ make release-minor # 0.1.0 -> 0.2.0
312
+ make release-major # 0.1.0 -> 1.0.0
313
+ ```
@@ -0,0 +1,294 @@
1
+ # data-glance
2
+
3
+ Fast data profiling CLI for parquet and CSV files. Powered by [ydata-profiling](https://github.com/ydataai/ydata-profiling) and [Polars](https://pola.rs/).
4
+
5
+ ## Installation
6
+
7
+ Install from PyPI:
8
+
9
+ ```bash
10
+ # Run with uvx (cached)
11
+ uvx data-glance profile data.parquet
12
+
13
+ # Install globally
14
+ uv tool install data-glance
15
+
16
+ # Install with pip
17
+ pip install data-glance
18
+ ```
19
+
20
+ Or run directly from GitHub:
21
+
22
+ ```bash
23
+ # Run from GitHub (always latest)
24
+ uvx --from git+https://github.com/bswrundquist/data-glance data-glance profile data.parquet
25
+ ```
26
+
27
+ ## Quick Start
28
+
29
+ ```bash
30
+ # Profile a file
31
+ data-glance profile data.parquet
32
+
33
+ # Quick preview
34
+ data-glance head data.csv
35
+
36
+ # Check data quality
37
+ data-glance diagnose data.parquet
38
+
39
+ # View schema
40
+ data-glance schema data.parquet
41
+ ```
42
+
43
+ ## Commands
44
+
45
+ | Command | Description |
46
+ | ---------- | ---------------------------- |
47
+ | `profile` | Generate HTML profile report |
48
+ | `diagnose` | Check data quality issues |
49
+ | `head` | Preview first N rows |
50
+ | `tail` | Preview last N rows |
51
+ | `schema` | Display column types |
52
+ | `stats` | Quick statistics |
53
+ | `count` | Count rows (fast) |
54
+ | `columns` | List column names |
55
+ | `unique` | Show unique values |
56
+ | `filter` | Filter data by expression |
57
+ | `sample` | Extract random sample |
58
+ | `convert` | Convert between formats |
59
+ | `compare` | Compare two files |
60
+ | `validate` | Validate data rules |
61
+ | `info` | File metadata |
62
+ | `generate` | Create test data |
63
+
64
+ ## Profile Command
65
+
66
+ ### Basic Usage
67
+
68
+ ```bash
69
+ data-glance profile data.parquet
70
+ data-glance profile data.csv --preset quick
71
+ data-glance profile data.parquet --preset full
72
+ data-glance profile huge.parquet --sample 10000
73
+ ```
74
+
75
+ ### Column Filtering
76
+
77
+ ```bash
78
+ data-glance profile data.csv --include "user_*,order_*"
79
+ data-glance profile data.csv --exclude "*_id,*_hash"
80
+ ```
81
+
82
+ ### Null Handling
83
+
84
+ ```bash
85
+ data-glance profile data.csv --nulls drop-cols
86
+ data-glance profile data.csv --drop-null-threshold 0.5
87
+ data-glance profile data.csv --drop-constant
88
+ ```
89
+
90
+ ### Output Options
91
+
92
+ ```bash
93
+ data-glance profile data.csv -o report.html
94
+ data-glance profile data.csv --json report.json
95
+ data-glance profile data.csv --no-browser
96
+ data-glance profile data.csv --dry-run
97
+ ```
98
+
99
+ ### CSV Options
100
+
101
+ ```bash
102
+ data-glance profile data.tsv --delimiter tab
103
+ data-glance profile data.csv --encoding latin-1
104
+ data-glance profile messy.csv --ignore-errors
105
+ ```
106
+
107
+ ## Data Inspection
108
+
109
+ ### head / tail - Preview Data
110
+
111
+ ```bash
112
+ data-glance head data.parquet --rows 20
113
+ data-glance tail data.csv --rows 10
114
+ ```
115
+
116
+ ### schema - View Structure
117
+
118
+ ```bash
119
+ data-glance schema data.parquet
120
+ ```
121
+
122
+ ### stats - Quick Statistics
123
+
124
+ ```bash
125
+ data-glance stats data.parquet
126
+ ```
127
+
128
+ ### count - Row Count
129
+
130
+ ```bash
131
+ data-glance count data.parquet # Single file
132
+ data-glance count *.csv # Multiple files
133
+ data-glance count *.parquet --total # Just the number
134
+ ```
135
+
136
+ ### columns - List Columns
137
+
138
+ ```bash
139
+ data-glance columns data.parquet
140
+ data-glance columns data.csv --one # One per line (for piping)
141
+ data-glance columns data.csv --types # With data types
142
+ data-glance columns data.csv --one | grep user # Filter columns
143
+ ```
144
+
145
+ ### unique - Value Distribution
146
+
147
+ ```bash
148
+ data-glance unique data.csv status
149
+ data-glance unique data.parquet category --counts --sort
150
+ data-glance unique data.csv user_id --limit 50
151
+ ```
152
+
153
+ ### info - File Metadata
154
+
155
+ ```bash
156
+ data-glance info data.parquet
157
+ ```
158
+
159
+ Shows file size, modification time, and for parquet: row count, columns, row groups.
160
+
161
+ ## Data Operations
162
+
163
+ ### filter - Query Data
164
+
165
+ ```bash
166
+ # Filter by condition
167
+ data-glance filter data.csv "col('status') == 'active'"
168
+ data-glance filter data.parquet "col('age') > 30" -o filtered.parquet
169
+ data-glance filter data.csv "col('name').str.contains('test')" --limit 100
170
+
171
+ # Expression syntax (Polars)
172
+ col('column') == 'value'
173
+ col('column') > 100
174
+ col('column').is_in(['a', 'b'])
175
+ col('column').is_null()
176
+ col('column').str.contains('pattern')
177
+ ```
178
+
179
+ ### sample - Extract Sample
180
+
181
+ ```bash
182
+ data-glance sample data.parquet sample.parquet -n 1000
183
+ data-glance sample big.csv small.csv --fraction 0.1
184
+ data-glance sample data.parquet sample.csv # Convert while sampling
185
+ ```
186
+
187
+ ### convert - Format Conversion
188
+
189
+ ```bash
190
+ data-glance convert data.csv data.parquet
191
+ data-glance convert data.parquet data.csv
192
+ data-glance convert data.csv data.parquet --compression zstd
193
+ ```
194
+
195
+ ## Data Quality
196
+
197
+ ### diagnose - Quality Check
198
+
199
+ ```bash
200
+ data-glance diagnose data.csv
201
+ ```
202
+
203
+ Shows: schema, null percentages, quality issues, suggested fixes.
204
+
205
+ ### compare - Diff Files
206
+
207
+ ```bash
208
+ data-glance compare data_v1.parquet data_v2.parquet
209
+ ```
210
+
211
+ Shows: row/column differences, schema changes, null changes.
212
+
213
+ ### validate - Check Rules
214
+
215
+ ```bash
216
+ # Check for nulls
217
+ data-glance validate data.csv --no-nulls "id,email"
218
+
219
+ # Check uniqueness
220
+ data-glance validate data.parquet --unique "id"
221
+
222
+ # Check row count
223
+ data-glance validate data.csv --min-rows 1000
224
+
225
+ # Check null percentage
226
+ data-glance validate data.csv --max-null-pct 0.1
227
+
228
+ # Check required columns
229
+ data-glance validate data.csv --required-cols "id,name,email"
230
+
231
+ # Combine rules
232
+ data-glance validate data.parquet \
233
+ --unique "id" \
234
+ --no-nulls "id,email" \
235
+ --min-rows 1000
236
+ ```
237
+
238
+ Returns exit code 1 if validation fails (useful in CI/CD).
239
+
240
+ ## Global Options
241
+
242
+ ```bash
243
+ data-glance -q profile data.csv # Quiet mode
244
+ data-glance -v profile data.csv # Verbose mode
245
+ ```
246
+
247
+ ## Test Data
248
+
249
+ ```bash
250
+ data-glance generate test.parquet --rows 5000
251
+ data-glance generate test.csv --edge-cases --nulls 0.1
252
+ ```
253
+
254
+ ## Presets
255
+
256
+ | Preset | Speed | Detail | Use Case |
257
+ | --------- | ------ | -------- | ------------------------- |
258
+ | `quick` | Fast | Minimal | Large files, quick checks |
259
+ | `default` | Medium | Standard | Most use cases |
260
+ | `full` | Slow | Detailed | Deep analysis |
261
+
262
+ ## Tips
263
+
264
+ - Use `--preset quick` or `--sample` for large files
265
+ - Use `diagnose` before `profile` to understand data quality
266
+ - Use `--dry-run` to preview what will be profiled
267
+ - Use `validate` in CI/CD pipelines
268
+ - Use `count --total` for scripting
269
+ - Use `columns --one` to pipe to other tools
270
+ - Use `filter` to extract subsets before profiling
271
+
272
+ ## Development
273
+
274
+ ```bash
275
+ # Clone and install
276
+ git clone https://github.com/bswrundquist/data-glance
277
+ cd data-glance
278
+ make install-dev
279
+
280
+ # Run tests
281
+ make test
282
+
283
+ # Lint and format
284
+ make lint
285
+ make format
286
+
287
+ # Build
288
+ make build
289
+
290
+ # Release
291
+ make release-patch # 0.1.0 -> 0.1.1
292
+ make release-minor # 0.1.0 -> 0.2.0
293
+ make release-major # 0.1.0 -> 1.0.0
294
+ ```