iparq 0.2.6__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,4 @@
1
+ # These are supported funding model platforms
2
+
3
+ github: [MiguelElGallo]
4
+
@@ -5,7 +5,7 @@
5
5
 
6
6
  version: 2
7
7
  updates:
8
- - package-ecosystem: "pip" # See documentation for possible values
8
+ - package-ecosystem: "uv" # See documentation for possible values
9
9
  directory: "/" # Location of package manifests
10
10
  schedule:
11
11
  interval: "weekly"
@@ -0,0 +1,31 @@
1
+ name: "Copilot Setup Steps"
2
+
3
+ # Allow testing of the setup steps from your repository's "Actions" tab.
4
+ on: workflow_dispatch
5
+
6
+ jobs:
7
+ # The job MUST be called `copilot-setup-steps` or it will not be picked up by Copilot.
8
+ copilot-setup-steps:
9
+ runs-on: ubuntu-latest
10
+
11
+ # Set the permissions to the lowest permissions possible needed for your steps.
12
+ # Copilot will be given its own token for its operations.
13
+ permissions:
14
+ # If you want to clone the repository as part of your setup steps, for example to install dependencies, you'll need the `contents: read` permission. If you don't clone the repository in your setup steps, Copilot will do this for you automatically after the steps complete.
15
+ contents: read
16
+
17
+ # You can define any steps you want, and they will run before the agent starts.
18
+ # If you do not check out your code, Copilot will do this for you.
19
+ steps:
20
+ - name: Checkout code
21
+ uses: actions/checkout@v4
22
+
23
+ - name: Install UV (Python package manager)
24
+ run: |
25
+ curl -LsSf https://astral.sh/uv/install.sh | sh
26
+ export PATH="$HOME/.cargo/bin:$PATH"
27
+ echo "$HOME/.cargo/bin" >> $GITHUB_PATH
28
+ uv --version
29
+
30
+ # Note: GitHub MCP server is not publicly available as npm package
31
+ # Remove this step until the package is officially released
@@ -172,3 +172,4 @@ cython_debug/
172
172
  .github/.DS_Store
173
173
  yellow_tripdata_2024-01.parquet
174
174
  filter.parquet
175
+ .DS_Store
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: iparq
3
- Version: 0.2.6
3
+ Version: 0.4.0
4
4
  Summary: Display version compression and bloom filter information about a parquet file
5
5
  Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
6
6
  License-File: LICENSE
@@ -88,10 +88,10 @@ Read more about bloom filters in this [great article](https://duckdb.org/2025/03
88
88
 
89
89
  ## Usage
90
90
 
91
- iparq now supports additional options:
91
+ iparq supports inspecting single files, multiple files, and glob patterns:
92
92
 
93
93
  ```sh
94
- iparq inspect <filename> [OPTIONS]
94
+ iparq inspect <filename(s)> [OPTIONS]
95
95
  ```
96
96
 
97
97
  Options include:
@@ -100,9 +100,12 @@ Options include:
100
100
  - `--metadata-only`, `-m`: Show only file metadata without column details
101
101
  - `--column`, `-c`: Filter results to show only a specific column
102
102
 
103
- Examples:
103
+ ### Single File Examples:
104
104
 
105
105
  ```sh
106
+ # Basic inspection
107
+ iparq inspect yourfile.parquet
108
+
106
109
  # Output in JSON format
107
110
  iparq inspect yourfile.parquet --format json
108
111
 
@@ -113,9 +116,25 @@ iparq inspect yourfile.parquet --metadata-only
113
116
  iparq inspect yourfile.parquet --column column_name
114
117
  ```
115
118
 
116
- Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
119
+ ### Multiple Files and Glob Patterns:
120
+
121
+ ```sh
122
+ # Inspect multiple specific files
123
+ iparq inspect file1.parquet file2.parquet file3.parquet
124
+
125
+ # Use glob patterns to inspect all parquet files
126
+ iparq inspect *.parquet
127
+
128
+ # Use specific patterns
129
+ iparq inspect yellow*.parquet data_*.parquet
130
+
131
+ # Combine patterns and specific files
132
+ iparq inspect important.parquet temp_*.parquet
133
+ ```
134
+
135
+ When inspecting multiple files, each file's results are displayed with a header showing the filename. The utility will read the metadata of each file and print the compression codecs used in the parquet files.
117
136
 
118
- ## Example ouput - Bloom Filters
137
+ ## Example output - Bloom Filters
119
138
 
120
139
  ```log
121
140
  ParquetMetaModel(
@@ -70,10 +70,10 @@ Read more about bloom filters in this [great article](https://duckdb.org/2025/03
70
70
 
71
71
  ## Usage
72
72
 
73
- iparq now supports additional options:
73
+ iparq supports inspecting single files, multiple files, and glob patterns:
74
74
 
75
75
  ```sh
76
- iparq inspect <filename> [OPTIONS]
76
+ iparq inspect <filename(s)> [OPTIONS]
77
77
  ```
78
78
 
79
79
  Options include:
@@ -82,9 +82,12 @@ Options include:
82
82
  - `--metadata-only`, `-m`: Show only file metadata without column details
83
83
  - `--column`, `-c`: Filter results to show only a specific column
84
84
 
85
- Examples:
85
+ ### Single File Examples:
86
86
 
87
87
  ```sh
88
+ # Basic inspection
89
+ iparq inspect yourfile.parquet
90
+
88
91
  # Output in JSON format
89
92
  iparq inspect yourfile.parquet --format json
90
93
 
@@ -95,9 +98,25 @@ iparq inspect yourfile.parquet --metadata-only
95
98
  iparq inspect yourfile.parquet --column column_name
96
99
  ```
97
100
 
98
- Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
101
+ ### Multiple Files and Glob Patterns:
102
+
103
+ ```sh
104
+ # Inspect multiple specific files
105
+ iparq inspect file1.parquet file2.parquet file3.parquet
106
+
107
+ # Use glob patterns to inspect all parquet files
108
+ iparq inspect *.parquet
109
+
110
+ # Use specific patterns
111
+ iparq inspect yellow*.parquet data_*.parquet
112
+
113
+ # Combine patterns and specific files
114
+ iparq inspect important.parquet temp_*.parquet
115
+ ```
116
+
117
+ When inspecting multiple files, each file's results are displayed with a header showing the filename. The utility will read the metadata of each file and print the compression codecs used in the parquet files.
99
118
 
100
- ## Example ouput - Bloom Filters
119
+ ## Example output - Bloom Filters
101
120
 
102
121
  ```log
103
122
  ParquetMetaModel(
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "iparq"
3
- version = "0.2.6"
3
+ version = "0.4.0"
4
4
  description = "Display version compression and bloom filter information about a parquet file"
5
5
  readme = "README.md"
6
6
  authors = [
@@ -38,4 +38,9 @@ testpaths = [
38
38
 
39
39
  [[tool.mypy.overrides]]
40
40
  module = ["pyarrow.*"]
41
- ignore_missing_imports = true
41
+ ignore_missing_imports = true
42
+
43
+ [dependency-groups]
44
+ dev = [
45
+ "pytest>=8.4.1",
46
+ ]
@@ -0,0 +1 @@
1
+ __version__ = "0.4.0"
@@ -1,3 +1,4 @@
1
+ import glob
1
2
  import json
2
3
  from enum import Enum
3
4
  from typing import List, Optional
@@ -53,6 +54,9 @@ class ColumnInfo(BaseModel):
53
54
  column_index (int): The index of the column.
54
55
  compression_type (str): The compression type used for the column.
55
56
  has_bloom_filter (bool): Whether the column has a bloom filter.
57
+ has_min_max (bool): Whether min/max statistics are available.
58
+ min_value (Optional[str]): The minimum value in the column (as string for display).
59
+ max_value (Optional[str]): The maximum value in the column (as string for display).
56
60
  """
57
61
 
58
62
  row_group: int
@@ -60,6 +64,9 @@ class ColumnInfo(BaseModel):
60
64
  column_index: int
61
65
  compression_type: str
62
66
  has_bloom_filter: Optional[bool] = False
67
+ has_min_max: Optional[bool] = False
68
+ min_value: Optional[str] = None
69
+ max_value: Optional[str] = None
63
70
 
64
71
 
65
72
  class ParquetColumnInfo(BaseModel):
@@ -84,22 +91,16 @@ def read_parquet_metadata(filename: str):
84
91
  tuple: A tuple containing:
85
92
  - parquet_metadata (pyarrow.parquet.FileMetaData): The metadata of the Parquet file.
86
93
  - compression_codecs (set): A set of compression codecs used in the Parquet file.
87
- """
88
- try:
89
- compression_codecs = set([])
90
- parquet_metadata = pq.ParquetFile(filename).metadata
91
94
 
92
- for i in range(parquet_metadata.num_row_groups):
93
- for j in range(parquet_metadata.num_columns):
94
- compression_codecs.add(
95
- parquet_metadata.row_group(i).column(j).compression
96
- )
95
+ Raises:
96
+ FileNotFoundError: If the file cannot be found or opened.
97
+ """
98
+ compression_codecs = set([])
99
+ parquet_metadata = pq.ParquetFile(filename).metadata
97
100
 
98
- except FileNotFoundError:
99
- console.print(
100
- f"Cannot open: {filename}.", style="blink bold red underline on white"
101
- )
102
- exit(1)
101
+ for i in range(parquet_metadata.num_row_groups):
102
+ for j in range(parquet_metadata.num_columns):
103
+ compression_codecs.add(parquet_metadata.row_group(i).column(j).compression)
103
104
 
104
105
  return parquet_metadata, compression_codecs
105
106
 
@@ -208,6 +209,59 @@ def print_bloom_filter_info(parquet_metadata, column_info: ParquetColumnInfo) ->
208
209
  )
209
210
 
210
211
 
212
+ def print_min_max_statistics(parquet_metadata, column_info: ParquetColumnInfo) -> None:
213
+ """
214
+ Updates the column_info model with min/max statistics information.
215
+
216
+ Args:
217
+ parquet_metadata: The Parquet file metadata.
218
+ column_info: The ParquetColumnInfo model to update.
219
+ """
220
+ try:
221
+ num_row_groups = parquet_metadata.num_row_groups
222
+ num_columns = parquet_metadata.num_columns
223
+
224
+ for i in range(num_row_groups):
225
+ row_group = parquet_metadata.row_group(i)
226
+
227
+ for j in range(num_columns):
228
+ column_chunk = row_group.column(j)
229
+
230
+ # Find the corresponding column in our model
231
+ for col in column_info.columns:
232
+ if col.row_group == i and col.column_index == j:
233
+ # Check if this column has statistics
234
+ if column_chunk.is_stats_set:
235
+ stats = column_chunk.statistics
236
+ col.has_min_max = stats.has_min_max
237
+
238
+ if stats.has_min_max:
239
+ # Convert values to string for display, handling potential None values
240
+ try:
241
+ col.min_value = (
242
+ str(stats.min)
243
+ if stats.min is not None
244
+ else "null"
245
+ )
246
+ col.max_value = (
247
+ str(stats.max)
248
+ if stats.max is not None
249
+ else "null"
250
+ )
251
+ except Exception:
252
+ # Fallback for complex types that might not stringify well
253
+ col.min_value = "<unable to display>"
254
+ col.max_value = "<unable to display>"
255
+ else:
256
+ col.has_min_max = False
257
+ break
258
+ except Exception as e:
259
+ console.print(
260
+ f"Error while collecting min/max statistics: {e}",
261
+ style="blink bold red underline on white",
262
+ )
263
+
264
+
211
265
  def print_column_info_table(column_info: ParquetColumnInfo) -> None:
212
266
  """
213
267
  Prints the column information using a Rich table.
@@ -223,15 +277,27 @@ def print_column_info_table(column_info: ParquetColumnInfo) -> None:
223
277
  table.add_column("Index", justify="center")
224
278
  table.add_column("Compression", style="magenta")
225
279
  table.add_column("Bloom Filter", justify="center")
280
+ table.add_column("Min Value", style="yellow")
281
+ table.add_column("Max Value", style="yellow")
226
282
 
227
283
  # Add rows to the table
228
284
  for col in column_info.columns:
285
+ # Format min/max values for display
286
+ min_display = (
287
+ col.min_value if col.has_min_max and col.min_value is not None else "N/A"
288
+ )
289
+ max_display = (
290
+ col.max_value if col.has_min_max and col.max_value is not None else "N/A"
291
+ )
292
+
229
293
  table.add_row(
230
294
  str(col.row_group),
231
295
  col.column_name,
232
296
  str(col.column_index),
233
297
  col.compression_type,
234
298
  "✅" if col.has_bloom_filter else "❌",
299
+ min_display,
300
+ max_display,
235
301
  )
236
302
 
237
303
  # Print the table
@@ -260,27 +326,24 @@ def output_json(
260
326
  print(json.dumps(result, indent=2))
261
327
 
262
328
 
263
- @app.command(name="")
264
- @app.command(name="inspect")
265
- def inspect(
266
- filename: str = typer.Argument(..., help="Path to the Parquet file to inspect"),
267
- format: OutputFormat = typer.Option(
268
- OutputFormat.RICH, "--format", "-f", help="Output format (rich or json)"
269
- ),
270
- metadata_only: bool = typer.Option(
271
- False,
272
- "--metadata-only",
273
- "-m",
274
- help="Show only file metadata without column details",
275
- ),
276
- column_filter: Optional[str] = typer.Option(
277
- None, "--column", "-c", help="Filter results to show only specific column"
278
- ),
279
- ):
329
+ def inspect_single_file(
330
+ filename: str,
331
+ format: OutputFormat,
332
+ metadata_only: bool,
333
+ column_filter: Optional[str],
334
+ ) -> None:
280
335
  """
281
- Inspect a Parquet file and display its metadata, compression settings, and bloom filter information.
336
+ Inspect a single Parquet file and display its metadata, compression settings, and bloom filter information.
337
+
338
+ Raises:
339
+ Exception: If the file cannot be processed.
282
340
  """
283
- (parquet_metadata, compression) = read_parquet_metadata(filename)
341
+ try:
342
+ (parquet_metadata, compression) = read_parquet_metadata(filename)
343
+ except FileNotFoundError:
344
+ raise Exception(f"Cannot open: {filename}.")
345
+ except Exception as e:
346
+ raise Exception(f"Failed to read metadata: {e}")
284
347
 
285
348
  # Create metadata model
286
349
  meta_model = ParquetMetaModel(
@@ -298,6 +361,7 @@ def inspect(
298
361
  # Collect information
299
362
  print_compression_types(parquet_metadata, column_info)
300
363
  print_bloom_filter_info(parquet_metadata, column_info)
364
+ print_min_max_statistics(parquet_metadata, column_info)
301
365
 
302
366
  # Filter columns if requested
303
367
  if column_filter:
@@ -322,5 +386,61 @@ def inspect(
322
386
  console.print(f"Compression codecs: {compression}")
323
387
 
324
388
 
389
+ @app.command(name="")
390
+ @app.command(name="inspect")
391
+ def inspect(
392
+ filenames: List[str] = typer.Argument(
393
+ ..., help="Path(s) or pattern(s) to Parquet files to inspect"
394
+ ),
395
+ format: OutputFormat = typer.Option(
396
+ OutputFormat.RICH, "--format", "-f", help="Output format (rich or json)"
397
+ ),
398
+ metadata_only: bool = typer.Option(
399
+ False,
400
+ "--metadata-only",
401
+ "-m",
402
+ help="Show only file metadata without column details",
403
+ ),
404
+ column_filter: Optional[str] = typer.Option(
405
+ None, "--column", "-c", help="Filter results to show only specific column"
406
+ ),
407
+ ):
408
+ """
409
+ Inspect Parquet files and display their metadata, compression settings, and bloom filter information.
410
+ """
411
+ # Expand glob patterns and collect all matching files
412
+ all_files = []
413
+ for pattern in filenames:
414
+ matches = glob.glob(pattern)
415
+ if matches:
416
+ all_files.extend(matches)
417
+ else:
418
+ # If no matches found, treat as literal filename (for better error reporting)
419
+ all_files.append(pattern)
420
+
421
+ # Remove duplicates while preserving order
422
+ seen = set()
423
+ unique_files = []
424
+ for file in all_files:
425
+ if file not in seen:
426
+ seen.add(file)
427
+ unique_files.append(file)
428
+
429
+ # Process each file
430
+ for i, filename in enumerate(unique_files):
431
+ # For multiple files, add a header to separate results
432
+ if len(unique_files) > 1:
433
+ if i > 0:
434
+ console.print() # Add blank line between files
435
+ console.print(f"[bold blue]File: {filename}[/bold blue]")
436
+ console.print("─" * (len(filename) + 6))
437
+
438
+ try:
439
+ inspect_single_file(filename, format, metadata_only, column_filter)
440
+ except Exception as e:
441
+ console.print(f"Error processing {filename}: {e}", style="red")
442
+ continue
443
+
444
+
325
445
  if __name__ == "__main__":
326
446
  app()
@@ -0,0 +1,173 @@
1
+ import json
2
+ from pathlib import Path
3
+
4
+ from typer.testing import CliRunner
5
+
6
+ from iparq.source import app
7
+
8
+ # Define path to test fixtures
9
+ FIXTURES_DIR = Path(__file__).parent
10
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
11
+
12
+
13
+ def test_parquet_info():
14
+ """Test that the CLI correctly displays parquet file information."""
15
+ runner = CliRunner()
16
+ result = runner.invoke(app, ["inspect", str(fixture_path)])
17
+
18
+ assert result.exit_code == 0
19
+
20
+ # Check for key components instead of exact table format
21
+ assert "ParquetMetaModel" in result.stdout
22
+ assert "created_by='parquet-cpp-arrow version 14.0.2'" in result.stdout
23
+ assert "num_columns=3" in result.stdout
24
+ assert "num_rows=3" in result.stdout
25
+ assert "Parquet Column Information" in result.stdout
26
+ assert "Min Value" in result.stdout
27
+ assert (
28
+ "Value" in result.stdout
29
+ ) # This covers "Max Value" which is split across lines
30
+ assert "one" in result.stdout and "-1.0" in result.stdout and "2.5" in result.stdout
31
+ assert "two" in result.stdout and "bar" in result.stdout and "foo" in result.stdout
32
+ assert (
33
+ "three" in result.stdout
34
+ and "False" in result.stdout
35
+ and "True" in result.stdout
36
+ )
37
+ assert "Compression codecs: {'SNAPPY'}" in result.stdout
38
+
39
+
40
+ def test_metadata_only_flag():
41
+ """Test that the metadata-only flag works correctly."""
42
+ runner = CliRunner()
43
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
44
+ result = runner.invoke(app, ["inspect", "--metadata-only", str(fixture_path)])
45
+
46
+ assert result.exit_code == 0
47
+ assert "ParquetMetaModel" in result.stdout
48
+ assert "Parquet Column Information" not in result.stdout
49
+
50
+
51
+ def test_column_filter():
52
+ """Test that filtering by column name works correctly."""
53
+ runner = CliRunner()
54
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
55
+ result = runner.invoke(app, ["inspect", "--column", "one", str(fixture_path)])
56
+
57
+ assert result.exit_code == 0
58
+ assert "one" in result.stdout
59
+ assert "two" not in result.stdout
60
+
61
+
62
+ def test_json_output():
63
+ """Test JSON output format."""
64
+ runner = CliRunner()
65
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
66
+ result = runner.invoke(app, ["inspect", "--format", "json", str(fixture_path)])
67
+
68
+ assert result.exit_code == 0
69
+
70
+ # Test that output is valid JSON
71
+ data = json.loads(result.stdout)
72
+
73
+ # Check JSON structure
74
+ assert "metadata" in data
75
+ assert "columns" in data
76
+ assert "compression_codecs" in data
77
+ assert data["metadata"]["num_columns"] == 3
78
+
79
+ # Check that min/max statistics are included
80
+ for column in data["columns"]:
81
+ assert "has_min_max" in column
82
+ assert "min_value" in column
83
+ assert "max_value" in column
84
+ # For our test data, all columns should have min/max stats
85
+ assert column["has_min_max"] is True
86
+ assert column["min_value"] is not None
87
+ assert column["max_value"] is not None
88
+
89
+
90
+ def test_multiple_files():
91
+ """Test that multiple files can be inspected in a single command."""
92
+ runner = CliRunner()
93
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
94
+ # Use the same file twice to test deduplication behavior
95
+
96
+ result = runner.invoke(app, ["inspect", str(fixture_path), str(fixture_path)])
97
+
98
+ assert result.exit_code == 0
99
+ # Since both arguments are the same file, deduplication means only one file is processed
100
+ # and since there's only one unique file, no file header should be shown
101
+ assert (
102
+ "File:" not in result.stdout
103
+ ) # No header for single file (after deduplication)
104
+ assert result.stdout.count("ParquetMetaModel") == 1
105
+
106
+
107
+ def test_multiple_different_files():
108
+ """Test multiple different files by creating a temporary copy."""
109
+ import shutil
110
+ import tempfile
111
+
112
+ runner = CliRunner()
113
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
114
+
115
+ # Create a temporary file copy
116
+ with tempfile.NamedTemporaryFile(suffix=".parquet", delete=False) as tmp_file:
117
+ shutil.copy2(fixture_path, tmp_file.name)
118
+ tmp_path = tmp_file.name
119
+
120
+ try:
121
+ result = runner.invoke(app, ["inspect", str(fixture_path), tmp_path])
122
+
123
+ assert result.exit_code == 0
124
+ # Should contain file headers for both files
125
+ assert f"File: {fixture_path}" in result.stdout
126
+ assert f"File: {tmp_path}" in result.stdout
127
+ # Should contain metadata for both files
128
+ assert result.stdout.count("ParquetMetaModel") == 2
129
+ assert result.stdout.count("Parquet Column Information") == 2
130
+ finally:
131
+ # Clean up temporary file
132
+ import os
133
+
134
+ os.unlink(tmp_path)
135
+
136
+
137
+ def test_glob_pattern():
138
+ """Test that glob patterns work correctly."""
139
+ runner = CliRunner()
140
+ # Test with a pattern that should match dummy files
141
+ result = runner.invoke(app, ["inspect", str(FIXTURES_DIR / "dummy*.parquet")])
142
+
143
+ assert result.exit_code == 0
144
+ # Should process at least one file
145
+ assert "ParquetMetaModel" in result.stdout
146
+
147
+
148
+ def test_single_file_no_header():
149
+ """Test that single files don't show file headers."""
150
+ runner = CliRunner()
151
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
152
+ result = runner.invoke(app, ["inspect", str(fixture_path)])
153
+
154
+ assert result.exit_code == 0
155
+ # Should not contain file header for single file
156
+ assert "File:" not in result.stdout
157
+ assert "ParquetMetaModel" in result.stdout
158
+
159
+
160
+ def test_error_handling_with_multiple_files():
161
+ """Test that errors in one file don't stop processing of other files."""
162
+ runner = CliRunner()
163
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
164
+ nonexistent_path = FIXTURES_DIR / "nonexistent.parquet"
165
+
166
+ result = runner.invoke(app, ["inspect", str(fixture_path), str(nonexistent_path)])
167
+
168
+ assert result.exit_code == 0
169
+ # Should process the good file
170
+ assert "ParquetMetaModel" in result.stdout
171
+ # Should show error for bad file
172
+ assert "Error processing" in result.stdout
173
+ assert "nonexistent.parquet" in result.stdout