iparq 0.2.0__tar.gz → 0.2.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -56,5 +56,5 @@ jobs:
56
56
  run: uvx black . --check --verbose
57
57
  - name: Run Python tests
58
58
  if: runner.os != 'Windows'
59
- run: uv run pytest -s -vv
59
+ run: uv run pytest -vv
60
60
 
@@ -0,0 +1,50 @@
1
+ # This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2
+ # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
3
+
4
+ name: Python package
5
+ on:
6
+ push:
7
+ branches: [ "main" ]
8
+ pull_request:
9
+ branches: [ "main" ]
10
+
11
+ jobs:
12
+ build:
13
+ permissions:
14
+ contents: read
15
+ pull-requests: write
16
+ name: Test ${{ matrix.os }} Python ${{ matrix.python_version }}
17
+ runs-on: ${{ matrix.os }}
18
+ strategy:
19
+ fail-fast: false
20
+ matrix:
21
+ os: ["ubuntu-20.04", "windows-latest"]
22
+ python_version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
23
+ env:
24
+ UV_SYSTEM_PYTHON: 1
25
+ steps:
26
+ - uses: actions/checkout@v4
27
+ - name: Setup python
28
+ uses: actions/setup-python@v5
29
+ with:
30
+ python-version: ${{ matrix.python_version }}
31
+ architecture: x64
32
+ - name: Install uv
33
+ uses: astral-sh/setup-uv@v5
34
+
35
+ # dependencies are in uv.lock
36
+ - name: Install dependencies
37
+ run: |
38
+ uv sync --all-extras
39
+
40
+ - name: Lint with ruff
41
+ run: uv run ruff check .
42
+ - name: Check types with mypy
43
+ run: |
44
+ cd src/iparq
45
+ uv run mypy . --config-file=../../pyproject.toml
46
+ - name: Check formatting with black
47
+ run: uvx black . --check --verbose
48
+ - name: Run Python tests
49
+ if: runner.os != 'Windows'
50
+ run: uv run pytest -vv
@@ -11,9 +11,7 @@
11
11
  "request": "launch",
12
12
  "program": "${file}",
13
13
  "console": "integratedTerminal",
14
- "args": [
15
- "${command:pickArgs}"
16
- ]
14
+ "args": "${command:pickArgs}"
17
15
  }
18
16
  ]
19
17
  }
iparq-0.2.5/PKG-INFO ADDED
@@ -0,0 +1,145 @@
1
+ Metadata-Version: 2.4
2
+ Name: iparq
3
+ Version: 0.2.5
4
+ Summary: Display version compression and bloom filter information about a parquet file
5
+ Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
6
+ License-File: LICENSE
7
+ Requires-Python: >=3.9
8
+ Requires-Dist: pyarrow
9
+ Requires-Dist: pydantic
10
+ Requires-Dist: rich
11
+ Requires-Dist: typer[all]
12
+ Provides-Extra: checks
13
+ Requires-Dist: mypy>=1.14.1; extra == 'checks'
14
+ Requires-Dist: ruff>=0.9.3; extra == 'checks'
15
+ Provides-Extra: test
16
+ Requires-Dist: pytest>=7.0; extra == 'test'
17
+ Description-Content-Type: text/markdown
18
+
19
+ # iparq
20
+
21
+ [![Python package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
22
+
23
+ [![Dependabot Updates](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
24
+
25
+ [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
26
+
27
+ ![alt text](media/iparq.png)
28
+ After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
29
+
30
+ ***New*** Bloom filters information: Displays if there are bloom filters.
31
+ Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
32
+
33
+ ## Installation
34
+
35
+ ### Zero installation - Recommended
36
+
37
+ 1) Make sure to have Astral's UV installed by following the steps here:
38
+
39
+ <https://docs.astral.sh/uv/getting-started/installation/>
40
+
41
+ 2) Execute the following command:
42
+
43
+ ```sh
44
+ uvx --refresh iparq inspect yourparquet.parquet
45
+ ```
46
+
47
+ ### Using pip
48
+
49
+ 1) Install the package using pip:
50
+
51
+ ```sh
52
+ pip install iparq
53
+ ```
54
+
55
+ 2) Verify the installation by running:
56
+
57
+ ```sh
58
+ iparq --help
59
+ ```
60
+
61
+ ### Using uv
62
+
63
+ 1) Make sure to have Astral's UV installed by following the steps here:
64
+
65
+ <https://docs.astral.sh/uv/getting-started/installation/>
66
+
67
+ 2) Execute the following command:
68
+
69
+ ```sh
70
+ uv pip install iparq
71
+ ```
72
+
73
+ 3) Verify the installation by running:
74
+
75
+ ```sh
76
+ iparq --help
77
+ ```
78
+
79
+ ### Using Homebrew in a MAC
80
+
81
+ 1) Run the following:
82
+
83
+ ```sh
84
+ brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
85
+ brew install MiguelElGallo/tap/iparq
86
+ iparq --help
87
+ ```
88
+
89
+ ## Usage
90
+
91
+ iparq now supports additional options:
92
+
93
+ ```sh
94
+ iparq inspect <filename> [OPTIONS]
95
+ ```
96
+
97
+ Options include:
98
+
99
+ - `--format`, `-f`: Output format, either `rich` (default) or `json`
100
+ - `--metadata-only`, `-m`: Show only file metadata without column details
101
+ - `--column`, `-c`: Filter results to show only a specific column
102
+
103
+ Examples:
104
+
105
+ ```sh
106
+ # Output in JSON format
107
+ iparq inspect yourfile.parquet --format json
108
+
109
+ # Show only metadata
110
+ iparq inspect yourfile.parquet --metadata-only
111
+
112
+ # Filter to show only a specific column
113
+ iparq inspect yourfile.parquet --column column_name
114
+ ```
115
+
116
+ Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
117
+
118
+ ## Example ouput - Bloom Filters
119
+
120
+ ```log
121
+ ParquetMetaModel(
122
+ created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
123
+ num_columns=1,
124
+ num_rows=100000000,
125
+ num_row_groups=10,
126
+ format_version='1.0',
127
+ serialized_size=1196
128
+ )
129
+ Parquet Column Information
130
+ ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
131
+ ┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
132
+ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
133
+ │ 0 │ r │ 0 │ SNAPPY │ ✅ │
134
+ │ 1 │ r │ 0 │ SNAPPY │ ✅ │
135
+ │ 2 │ r │ 0 │ SNAPPY │ ✅ │
136
+ │ 3 │ r │ 0 │ SNAPPY │ ✅ │
137
+ │ 4 │ r │ 0 │ SNAPPY │ ✅ │
138
+ │ 5 │ r │ 0 │ SNAPPY │ ✅ │
139
+ │ 6 │ r │ 0 │ SNAPPY │ ✅ │
140
+ │ 7 │ r │ 0 │ SNAPPY │ ✅ │
141
+ │ 8 │ r │ 0 │ SNAPPY │ ✅ │
142
+ │ 9 │ r │ 0 │ SNAPPY │ ✅ │
143
+ └───────────┴─────────────┴───────┴─────────────┴──────────────┘
144
+ Compression codecs: {'SNAPPY'}
145
+ ```
iparq-0.2.5/README.md ADDED
@@ -0,0 +1,127 @@
1
+ # iparq
2
+
3
+ [![Python package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
4
+
5
+ [![Dependabot Updates](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
6
+
7
+ [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
8
+
9
+ ![alt text](media/iparq.png)
10
+ After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
11
+
12
+ ***New*** Bloom filters information: Displays if there are bloom filters.
13
+ Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
14
+
15
+ ## Installation
16
+
17
+ ### Zero installation - Recommended
18
+
19
+ 1) Make sure to have Astral's UV installed by following the steps here:
20
+
21
+ <https://docs.astral.sh/uv/getting-started/installation/>
22
+
23
+ 2) Execute the following command:
24
+
25
+ ```sh
26
+ uvx --refresh iparq inspect yourparquet.parquet
27
+ ```
28
+
29
+ ### Using pip
30
+
31
+ 1) Install the package using pip:
32
+
33
+ ```sh
34
+ pip install iparq
35
+ ```
36
+
37
+ 2) Verify the installation by running:
38
+
39
+ ```sh
40
+ iparq --help
41
+ ```
42
+
43
+ ### Using uv
44
+
45
+ 1) Make sure to have Astral's UV installed by following the steps here:
46
+
47
+ <https://docs.astral.sh/uv/getting-started/installation/>
48
+
49
+ 2) Execute the following command:
50
+
51
+ ```sh
52
+ uv pip install iparq
53
+ ```
54
+
55
+ 3) Verify the installation by running:
56
+
57
+ ```sh
58
+ iparq --help
59
+ ```
60
+
61
+ ### Using Homebrew in a MAC
62
+
63
+ 1) Run the following:
64
+
65
+ ```sh
66
+ brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
67
+ brew install MiguelElGallo/tap/iparq
68
+ iparq --help
69
+ ```
70
+
71
+ ## Usage
72
+
73
+ iparq now supports additional options:
74
+
75
+ ```sh
76
+ iparq inspect <filename> [OPTIONS]
77
+ ```
78
+
79
+ Options include:
80
+
81
+ - `--format`, `-f`: Output format, either `rich` (default) or `json`
82
+ - `--metadata-only`, `-m`: Show only file metadata without column details
83
+ - `--column`, `-c`: Filter results to show only a specific column
84
+
85
+ Examples:
86
+
87
+ ```sh
88
+ # Output in JSON format
89
+ iparq inspect yourfile.parquet --format json
90
+
91
+ # Show only metadata
92
+ iparq inspect yourfile.parquet --metadata-only
93
+
94
+ # Filter to show only a specific column
95
+ iparq inspect yourfile.parquet --column column_name
96
+ ```
97
+
98
+ Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
99
+
100
+ ## Example ouput - Bloom Filters
101
+
102
+ ```log
103
+ ParquetMetaModel(
104
+ created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
105
+ num_columns=1,
106
+ num_rows=100000000,
107
+ num_row_groups=10,
108
+ format_version='1.0',
109
+ serialized_size=1196
110
+ )
111
+ Parquet Column Information
112
+ ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
113
+ ┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
114
+ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
115
+ │ 0 │ r │ 0 │ SNAPPY │ ✅ │
116
+ │ 1 │ r │ 0 │ SNAPPY │ ✅ │
117
+ │ 2 │ r │ 0 │ SNAPPY │ ✅ │
118
+ │ 3 │ r │ 0 │ SNAPPY │ ✅ │
119
+ │ 4 │ r │ 0 │ SNAPPY │ ✅ │
120
+ │ 5 │ r │ 0 │ SNAPPY │ ✅ │
121
+ │ 6 │ r │ 0 │ SNAPPY │ ✅ │
122
+ │ 7 │ r │ 0 │ SNAPPY │ ✅ │
123
+ │ 8 │ r │ 0 │ SNAPPY │ ✅ │
124
+ │ 9 │ r │ 0 │ SNAPPY │ ✅ │
125
+ └───────────┴─────────────┴───────┴─────────────┴──────────────┘
126
+ Compression codecs: {'SNAPPY'}
127
+ ```
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "iparq"
3
- version = "0.2.0"
3
+ version = "0.2.5"
4
4
  description = "Display version compression and bloom filter information about a parquet file"
5
5
  readme = "README.md"
6
6
  authors = [
@@ -31,7 +31,7 @@ requires = ["hatchling"]
31
31
  build-backend = "hatchling.build"
32
32
 
33
33
  [tool.pytest.ini_options]
34
- addopts = "-ra -q"
34
+ addopts = ["-ra", "-q"]
35
35
  testpaths = [
36
36
  "tests",
37
37
  ]
File without changes
@@ -1,3 +1,5 @@
1
+ import json
2
+ from enum import Enum
1
3
  from typing import List, Optional
2
4
 
3
5
  import pyarrow.parquet as pq
@@ -7,10 +9,19 @@ from rich import print
7
9
  from rich.console import Console
8
10
  from rich.table import Table
9
11
 
10
- app = typer.Typer()
12
+ app = typer.Typer(
13
+ help="Inspect Parquet files for metadata, compression, and bloom filters"
14
+ )
11
15
  console = Console()
12
16
 
13
17
 
18
+ class OutputFormat(str, Enum):
19
+ """Enum for output format options."""
20
+
21
+ RICH = "rich"
22
+ JSON = "json"
23
+
24
+
14
25
  class ParquetMetaModel(BaseModel):
15
26
  """
16
27
  ParquetMetaModel is a data model representing metadata for a Parquet file.
@@ -227,20 +238,59 @@ def print_column_info_table(column_info: ParquetColumnInfo) -> None:
227
238
  console.print(table)
228
239
 
229
240
 
230
- @app.command()
231
- def main(filename: str):
241
+ def output_json(
242
+ meta_model: ParquetMetaModel,
243
+ column_info: ParquetColumnInfo,
244
+ compression_codecs: set,
245
+ ) -> None:
232
246
  """
233
- Main function to read and print Parquet file metadata.
247
+ Outputs the parquet information in JSON format.
234
248
 
235
249
  Args:
236
- filename (str): The path to the Parquet file.
237
-
238
- Returns:
239
- Metadata of the Parquet file and the compression codecs used.
250
+ meta_model: The Parquet metadata model
251
+ column_info: The column information model
252
+ compression_codecs: Set of compression codecs used
253
+ """
254
+ result = {
255
+ "metadata": meta_model.model_dump(),
256
+ "columns": [column.model_dump() for column in column_info.columns],
257
+ "compression_codecs": list(compression_codecs),
258
+ }
259
+
260
+ print(json.dumps(result, indent=2))
261
+
262
+
263
+ @app.command(name="")
264
+ @app.command(name="inspect")
265
+ def inspect(
266
+ filename: str = typer.Argument(..., help="Path to the Parquet file to inspect"),
267
+ format: OutputFormat = typer.Option(
268
+ OutputFormat.RICH, "--format", "-f", help="Output format (rich or json)"
269
+ ),
270
+ metadata_only: bool = typer.Option(
271
+ False,
272
+ "--metadata-only",
273
+ "-m",
274
+ help="Show only file metadata without column details",
275
+ ),
276
+ column_filter: Optional[str] = typer.Option(
277
+ None, "--column", "-c", help="Filter results to show only specific column"
278
+ ),
279
+ ):
280
+ """
281
+ Inspect a Parquet file and display its metadata, compression settings, and bloom filter information.
240
282
  """
241
283
  (parquet_metadata, compression) = read_parquet_metadata(filename)
242
284
 
243
- print_parquet_metadata(parquet_metadata)
285
+ # Create metadata model
286
+ meta_model = ParquetMetaModel(
287
+ created_by=parquet_metadata.created_by,
288
+ num_columns=parquet_metadata.num_columns,
289
+ num_rows=parquet_metadata.num_rows,
290
+ num_row_groups=parquet_metadata.num_row_groups,
291
+ format_version=str(parquet_metadata.format_version),
292
+ serialized_size=parquet_metadata.serialized_size,
293
+ )
244
294
 
245
295
  # Create a model to store column information
246
296
  column_info = ParquetColumnInfo()
@@ -249,10 +299,27 @@ def main(filename: str):
249
299
  print_compression_types(parquet_metadata, column_info)
250
300
  print_bloom_filter_info(parquet_metadata, column_info)
251
301
 
252
- # Print the information as a table
253
- print_column_info_table(column_info)
254
-
255
- print(f"Compression codecs: {compression}")
302
+ # Filter columns if requested
303
+ if column_filter:
304
+ column_info.columns = [
305
+ col for col in column_info.columns if col.column_name == column_filter
306
+ ]
307
+ if not column_info.columns:
308
+ console.print(
309
+ f"No columns match the filter: {column_filter}", style="yellow"
310
+ )
311
+
312
+ # Output based on format selection
313
+ if format == OutputFormat.JSON:
314
+ output_json(meta_model, column_info, compression)
315
+ else: # Rich format
316
+ # Print the metadata
317
+ console.print(meta_model)
318
+
319
+ # Print column details if not metadata only
320
+ if not metadata_only:
321
+ print_column_info_table(column_info)
322
+ console.print(f"Compression codecs: {compression}")
256
323
 
257
324
 
258
325
  if __name__ == "__main__":
File without changes
Binary file
@@ -0,0 +1,78 @@
1
+ import json
2
+ from pathlib import Path
3
+
4
+ from typer.testing import CliRunner
5
+
6
+ from iparq.source import app
7
+
8
+ # Define path to test fixtures
9
+ FIXTURES_DIR = Path(__file__).parent
10
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
11
+
12
+
13
+ def test_parquet_info():
14
+ """Test that the CLI correctly displays parquet file information."""
15
+ runner = CliRunner()
16
+ result = runner.invoke(app, ["inspect", str(fixture_path)])
17
+
18
+ assert result.exit_code == 0
19
+
20
+ expected_output = """ParquetMetaModel(
21
+ created_by='parquet-cpp-arrow version 14.0.2',
22
+ num_columns=3,
23
+ num_rows=3,
24
+ num_row_groups=1,
25
+ format_version='2.6',
26
+ serialized_size=2223
27
+ )
28
+ Parquet Column Information
29
+ ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
30
+ ┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
31
+ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
32
+ │ 0 │ one │ 0 │ SNAPPY │ ✅ │
33
+ │ 0 │ two │ 1 │ SNAPPY │ ✅ │
34
+ │ 0 │ three │ 2 │ SNAPPY │ ✅ │
35
+ └───────────┴─────────────┴───────┴─────────────┴──────────────┘
36
+ Compression codecs: {'SNAPPY'}"""
37
+
38
+ assert expected_output in result.stdout
39
+
40
+
41
+ def test_metadata_only_flag():
42
+ """Test that the metadata-only flag works correctly."""
43
+ runner = CliRunner()
44
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
45
+ result = runner.invoke(app, ["inspect", "--metadata-only", str(fixture_path)])
46
+
47
+ assert result.exit_code == 0
48
+ assert "ParquetMetaModel" in result.stdout
49
+ assert "Parquet Column Information" not in result.stdout
50
+
51
+
52
+ def test_column_filter():
53
+ """Test that filtering by column name works correctly."""
54
+ runner = CliRunner()
55
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
56
+ result = runner.invoke(app, ["inspect", "--column", "one", str(fixture_path)])
57
+
58
+ assert result.exit_code == 0
59
+ assert "one" in result.stdout
60
+ assert "two" not in result.stdout
61
+
62
+
63
+ def test_json_output():
64
+ """Test JSON output format."""
65
+ runner = CliRunner()
66
+ fixture_path = FIXTURES_DIR / "dummy.parquet"
67
+ result = runner.invoke(app, ["inspect", "--format", "json", str(fixture_path)])
68
+
69
+ assert result.exit_code == 0
70
+
71
+ # Test that output is valid JSON
72
+ data = json.loads(result.stdout)
73
+
74
+ # Check JSON structure
75
+ assert "metadata" in data
76
+ assert "columns" in data
77
+ assert "compression_codecs" in data
78
+ assert data["metadata"]["num_columns"] == 3
@@ -52,11 +52,12 @@ wheels = [
52
52
 
53
53
  [[package]]
54
54
  name = "iparq"
55
- version = "0.2.0"
55
+ version = "0.2.5"
56
56
  source = { editable = "." }
57
57
  dependencies = [
58
58
  { name = "pyarrow" },
59
59
  { name = "pydantic" },
60
+ { name = "rich" },
60
61
  { name = "typer" },
61
62
  ]
62
63
 
@@ -72,11 +73,12 @@ test = [
72
73
  [package.metadata]
73
74
  requires-dist = [
74
75
  { name = "mypy", marker = "extra == 'checks'", specifier = ">=1.14.1" },
75
- { name = "pyarrow", specifier = ">=19.0.0" },
76
- { name = "pydantic", specifier = ">=2.10.6" },
76
+ { name = "pyarrow" },
77
+ { name = "pydantic" },
77
78
  { name = "pytest", marker = "extra == 'test'", specifier = ">=7.0" },
79
+ { name = "rich" },
78
80
  { name = "ruff", marker = "extra == 'checks'", specifier = ">=0.9.3" },
79
- { name = "typer", specifier = ">=0.15.1" },
81
+ { name = "typer", extras = ["all"] },
80
82
  ]
81
83
  provides-extras = ["test", "checks"]
82
84
 
@@ -1,41 +0,0 @@
1
- # This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2
- # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
3
-
4
- name: Python package
5
- on:
6
- push:
7
- branches: [ "main" ]
8
- pull_request:
9
- branches: [ "main" ]
10
-
11
- jobs:
12
- build:
13
- permissions:
14
- contents: read
15
- pull-requests: write
16
- runs-on: ubuntu-latest
17
- strategy:
18
- fail-fast: false
19
- matrix:
20
- python-version: ["3.9", "3.10", "3.11"]
21
-
22
- steps:
23
- - uses: actions/checkout@v4
24
- - name: Set up Python ${{ matrix.python-version }}
25
- uses: actions/setup-python@v3
26
- with:
27
- python-version: ${{ matrix.python-version }}
28
- - name: Install dependencies
29
- run: |
30
- python -m pip install --upgrade pip
31
- python -m pip install flake8 pytest pydantic iparq
32
- if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
33
- - name: Lint with flake8
34
- run: |
35
- # stop the build if there are Python syntax errors or undefined names
36
- flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
37
- # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
38
- flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
39
- - name: Test with pytest
40
- run: |
41
- pytest
iparq-0.2.0/PKG-INFO DELETED
@@ -1,229 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: iparq
3
- Version: 0.2.0
4
- Summary: Display version compression and bloom filter information about a parquet file
5
- Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
6
- License-File: LICENSE
7
- Requires-Python: >=3.9
8
- Requires-Dist: pyarrow
9
- Requires-Dist: pydantic
10
- Requires-Dist: rich
11
- Requires-Dist: typer[all]
12
- Provides-Extra: checks
13
- Requires-Dist: mypy>=1.14.1; extra == 'checks'
14
- Requires-Dist: ruff>=0.9.3; extra == 'checks'
15
- Provides-Extra: test
16
- Requires-Dist: pytest>=7.0; extra == 'test'
17
- Description-Content-Type: text/markdown
18
-
19
- # iparq
20
-
21
- [![Python package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
22
-
23
- [![Dependabot Updates](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
24
-
25
- [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
26
-
27
- ![alt text](media/iparq.png)
28
- After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
29
-
30
- ***New*** Bloom filters information: Displays if there are bloom filters.
31
- Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
32
-
33
-
34
- ## Installation
35
-
36
- ### Zero installation - Recommended
37
-
38
- 1) Make sure to have Astral’s UV installed by following the steps here:
39
-
40
- <https://docs.astral.sh/uv/getting-started/installation/>
41
-
42
- 2) Execute the following command:
43
-
44
- ```sh
45
- uvx iparq yourparquet.parquet
46
- ```
47
-
48
- ### Using pip
49
-
50
- 1) Install the package using pip:
51
-
52
- ```sh
53
- pip install iparq
54
- ```
55
-
56
- 2) Verify the installation by running:
57
-
58
- ```sh
59
- iparq --help
60
- ```
61
-
62
- ### Using uv
63
-
64
- 1) Make sure to have Astral’s UV installed by following the steps here:
65
-
66
- <https://docs.astral.sh/uv/getting-started/installation/>
67
-
68
- 2) Execute the following command:
69
-
70
- ```sh
71
- uv pip install iparq
72
- ```
73
-
74
- 3) Verify the installation by running:
75
-
76
- ```sh
77
- iparq --help
78
- ```
79
-
80
- ### Using Homebrew in a MAC
81
-
82
- 1) Run the following:
83
-
84
- ```sh
85
- brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
86
- brew install MiguelElGallo/tap/iparq
87
- iparq —help
88
- ```
89
-
90
- ## Usage
91
-
92
- Run
93
-
94
- ```sh
95
- iparq <filename>
96
- ```
97
-
98
- Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
99
-
100
- ## Example ouput - Bloom Filters
101
-
102
- ```log
103
- ParquetMetaModel(
104
- created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
105
- num_columns=1,
106
- num_rows=100000000,
107
- num_row_groups=10,
108
- format_version='1.0',
109
- serialized_size=1196
110
- )
111
- Column Compression Info:
112
- Row Group 0:
113
- Column 'r' (Index 0): SNAPPY
114
- Row Group 1:
115
- Column 'r' (Index 0): SNAPPY
116
- Row Group 2:
117
- Column 'r' (Index 0): SNAPPY
118
- Row Group 3:
119
- Column 'r' (Index 0): SNAPPY
120
- Row Group 4:
121
- Column 'r' (Index 0): SNAPPY
122
- Row Group 5:
123
- Column 'r' (Index 0): SNAPPY
124
- Row Group 6:
125
- Column 'r' (Index 0): SNAPPY
126
- Row Group 7:
127
- Column 'r' (Index 0): SNAPPY
128
- Row Group 8:
129
- Column 'r' (Index 0): SNAPPY
130
- Row Group 9:
131
- Column 'r' (Index 0): SNAPPY
132
- Bloom Filter Info:
133
- Row Group 0:
134
- Column 'r' (Index 0): Has bloom filter
135
- Row Group 1:
136
- Column 'r' (Index 0): Has bloom filter
137
- Row Group 2:
138
- Column 'r' (Index 0): Has bloom filter
139
- Row Group 3:
140
- Column 'r' (Index 0): Has bloom filter
141
- Row Group 4:
142
- Column 'r' (Index 0): Has bloom filter
143
- Row Group 5:
144
- Column 'r' (Index 0): Has bloom filter
145
- Row Group 6:
146
- Column 'r' (Index 0): Has bloom filter
147
- Row Group 7:
148
- Column 'r' (Index 0): Has bloom filter
149
- Row Group 8:
150
- Column 'r' (Index 0): Has bloom filter
151
- Row Group 9:
152
- Column 'r' (Index 0): Has bloom filter
153
- Compression codecs: {'SNAPPY'}
154
- ```
155
-
156
- ## Example output
157
-
158
- ```log
159
- ParquetMetaModel(
160
- created_by='parquet-cpp-arrow version 14.0.2',
161
- num_columns=19,
162
- num_rows=2964624,
163
- num_row_groups=3,
164
- format_version='2.6',
165
- serialized_size=6357
166
- )
167
- Column Compression Info:
168
- Row Group 0:
169
- Column 'VendorID' (Index 0): ZSTD
170
- Column 'tpep_pickup_datetime' (Index 1): ZSTD
171
- Column 'tpep_dropoff_datetime' (Index 2): ZSTD
172
- Column 'passenger_count' (Index 3): ZSTD
173
- Column 'trip_distance' (Index 4): ZSTD
174
- Column 'RatecodeID' (Index 5): ZSTD
175
- Column 'store_and_fwd_flag' (Index 6): ZSTD
176
- Column 'PULocationID' (Index 7): ZSTD
177
- Column 'DOLocationID' (Index 8): ZSTD
178
- Column 'payment_type' (Index 9): ZSTD
179
- Column 'fare_amount' (Index 10): ZSTD
180
- Column 'extra' (Index 11): ZSTD
181
- Column 'mta_tax' (Index 12): ZSTD
182
- Column 'tip_amount' (Index 13): ZSTD
183
- Column 'tolls_amount' (Index 14): ZSTD
184
- Column 'improvement_surcharge' (Index 15): ZSTD
185
- Column 'total_amount' (Index 16): ZSTD
186
- Column 'congestion_surcharge' (Index 17): ZSTD
187
- Column 'Airport_fee' (Index 18): ZSTD
188
- Row Group 1:
189
- Column 'VendorID' (Index 0): ZSTD
190
- Column 'tpep_pickup_datetime' (Index 1): ZSTD
191
- Column 'tpep_dropoff_datetime' (Index 2): ZSTD
192
- Column 'passenger_count' (Index 3): ZSTD
193
- Column 'trip_distance' (Index 4): ZSTD
194
- Column 'RatecodeID' (Index 5): ZSTD
195
- Column 'store_and_fwd_flag' (Index 6): ZSTD
196
- Column 'PULocationID' (Index 7): ZSTD
197
- Column 'DOLocationID' (Index 8): ZSTD
198
- Column 'payment_type' (Index 9): ZSTD
199
- Column 'fare_amount' (Index 10): ZSTD
200
- Column 'extra' (Index 11): ZSTD
201
- Column 'mta_tax' (Index 12): ZSTD
202
- Column 'tip_amount' (Index 13): ZSTD
203
- Column 'tolls_amount' (Index 14): ZSTD
204
- Column 'improvement_surcharge' (Index 15): ZSTD
205
- Column 'total_amount' (Index 16): ZSTD
206
- Column 'congestion_surcharge' (Index 17): ZSTD
207
- Column 'Airport_fee' (Index 18): ZSTD
208
- Row Group 2:
209
- Column 'VendorID' (Index 0): ZSTD
210
- Column 'tpep_pickup_datetime' (Index 1): ZSTD
211
- Column 'tpep_dropoff_datetime' (Index 2): ZSTD
212
- Column 'passenger_count' (Index 3): ZSTD
213
- Column 'trip_distance' (Index 4): ZSTD
214
- Column 'RatecodeID' (Index 5): ZSTD
215
- Column 'store_and_fwd_flag' (Index 6): ZSTD
216
- Column 'PULocationID' (Index 7): ZSTD
217
- Column 'DOLocationID' (Index 8): ZSTD
218
- Column 'payment_type' (Index 9): ZSTD
219
- Column 'fare_amount' (Index 10): ZSTD
220
- Column 'extra' (Index 11): ZSTD
221
- Column 'mta_tax' (Index 12): ZSTD
222
- Column 'tip_amount' (Index 13): ZSTD
223
- Column 'tolls_amount' (Index 14): ZSTD
224
- Column 'improvement_surcharge' (Index 15): ZSTD
225
- Column 'total_amount' (Index 16): ZSTD
226
- Column 'congestion_surcharge' (Index 17): ZSTD
227
- Column 'Airport_fee' (Index 18): ZSTD
228
- Compression codecs: {'ZSTD'}
229
- ```
iparq-0.2.0/README.md DELETED
@@ -1,211 +0,0 @@
1
- # iparq
2
-
3
- [![Python package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
4
-
5
- [![Dependabot Updates](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
6
-
7
- [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
8
-
9
- ![alt text](media/iparq.png)
10
- After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
11
-
12
- ***New*** Bloom filters information: Displays if there are bloom filters.
13
- Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
14
-
15
-
16
- ## Installation
17
-
18
- ### Zero installation - Recommended
19
-
20
- 1) Make sure to have Astral’s UV installed by following the steps here:
21
-
22
- <https://docs.astral.sh/uv/getting-started/installation/>
23
-
24
- 2) Execute the following command:
25
-
26
- ```sh
27
- uvx iparq yourparquet.parquet
28
- ```
29
-
30
- ### Using pip
31
-
32
- 1) Install the package using pip:
33
-
34
- ```sh
35
- pip install iparq
36
- ```
37
-
38
- 2) Verify the installation by running:
39
-
40
- ```sh
41
- iparq --help
42
- ```
43
-
44
- ### Using uv
45
-
46
- 1) Make sure to have Astral’s UV installed by following the steps here:
47
-
48
- <https://docs.astral.sh/uv/getting-started/installation/>
49
-
50
- 2) Execute the following command:
51
-
52
- ```sh
53
- uv pip install iparq
54
- ```
55
-
56
- 3) Verify the installation by running:
57
-
58
- ```sh
59
- iparq --help
60
- ```
61
-
62
- ### Using Homebrew in a MAC
63
-
64
- 1) Run the following:
65
-
66
- ```sh
67
- brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
68
- brew install MiguelElGallo/tap/iparq
69
- iparq —help
70
- ```
71
-
72
- ## Usage
73
-
74
- Run
75
-
76
- ```sh
77
- iparq <filename>
78
- ```
79
-
80
- Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
81
-
82
- ## Example ouput - Bloom Filters
83
-
84
- ```log
85
- ParquetMetaModel(
86
- created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
87
- num_columns=1,
88
- num_rows=100000000,
89
- num_row_groups=10,
90
- format_version='1.0',
91
- serialized_size=1196
92
- )
93
- Column Compression Info:
94
- Row Group 0:
95
- Column 'r' (Index 0): SNAPPY
96
- Row Group 1:
97
- Column 'r' (Index 0): SNAPPY
98
- Row Group 2:
99
- Column 'r' (Index 0): SNAPPY
100
- Row Group 3:
101
- Column 'r' (Index 0): SNAPPY
102
- Row Group 4:
103
- Column 'r' (Index 0): SNAPPY
104
- Row Group 5:
105
- Column 'r' (Index 0): SNAPPY
106
- Row Group 6:
107
- Column 'r' (Index 0): SNAPPY
108
- Row Group 7:
109
- Column 'r' (Index 0): SNAPPY
110
- Row Group 8:
111
- Column 'r' (Index 0): SNAPPY
112
- Row Group 9:
113
- Column 'r' (Index 0): SNAPPY
114
- Bloom Filter Info:
115
- Row Group 0:
116
- Column 'r' (Index 0): Has bloom filter
117
- Row Group 1:
118
- Column 'r' (Index 0): Has bloom filter
119
- Row Group 2:
120
- Column 'r' (Index 0): Has bloom filter
121
- Row Group 3:
122
- Column 'r' (Index 0): Has bloom filter
123
- Row Group 4:
124
- Column 'r' (Index 0): Has bloom filter
125
- Row Group 5:
126
- Column 'r' (Index 0): Has bloom filter
127
- Row Group 6:
128
- Column 'r' (Index 0): Has bloom filter
129
- Row Group 7:
130
- Column 'r' (Index 0): Has bloom filter
131
- Row Group 8:
132
- Column 'r' (Index 0): Has bloom filter
133
- Row Group 9:
134
- Column 'r' (Index 0): Has bloom filter
135
- Compression codecs: {'SNAPPY'}
136
- ```
137
-
138
- ## Example output
139
-
140
- ```log
141
- ParquetMetaModel(
142
- created_by='parquet-cpp-arrow version 14.0.2',
143
- num_columns=19,
144
- num_rows=2964624,
145
- num_row_groups=3,
146
- format_version='2.6',
147
- serialized_size=6357
148
- )
149
- Column Compression Info:
150
- Row Group 0:
151
- Column 'VendorID' (Index 0): ZSTD
152
- Column 'tpep_pickup_datetime' (Index 1): ZSTD
153
- Column 'tpep_dropoff_datetime' (Index 2): ZSTD
154
- Column 'passenger_count' (Index 3): ZSTD
155
- Column 'trip_distance' (Index 4): ZSTD
156
- Column 'RatecodeID' (Index 5): ZSTD
157
- Column 'store_and_fwd_flag' (Index 6): ZSTD
158
- Column 'PULocationID' (Index 7): ZSTD
159
- Column 'DOLocationID' (Index 8): ZSTD
160
- Column 'payment_type' (Index 9): ZSTD
161
- Column 'fare_amount' (Index 10): ZSTD
162
- Column 'extra' (Index 11): ZSTD
163
- Column 'mta_tax' (Index 12): ZSTD
164
- Column 'tip_amount' (Index 13): ZSTD
165
- Column 'tolls_amount' (Index 14): ZSTD
166
- Column 'improvement_surcharge' (Index 15): ZSTD
167
- Column 'total_amount' (Index 16): ZSTD
168
- Column 'congestion_surcharge' (Index 17): ZSTD
169
- Column 'Airport_fee' (Index 18): ZSTD
170
- Row Group 1:
171
- Column 'VendorID' (Index 0): ZSTD
172
- Column 'tpep_pickup_datetime' (Index 1): ZSTD
173
- Column 'tpep_dropoff_datetime' (Index 2): ZSTD
174
- Column 'passenger_count' (Index 3): ZSTD
175
- Column 'trip_distance' (Index 4): ZSTD
176
- Column 'RatecodeID' (Index 5): ZSTD
177
- Column 'store_and_fwd_flag' (Index 6): ZSTD
178
- Column 'PULocationID' (Index 7): ZSTD
179
- Column 'DOLocationID' (Index 8): ZSTD
180
- Column 'payment_type' (Index 9): ZSTD
181
- Column 'fare_amount' (Index 10): ZSTD
182
- Column 'extra' (Index 11): ZSTD
183
- Column 'mta_tax' (Index 12): ZSTD
184
- Column 'tip_amount' (Index 13): ZSTD
185
- Column 'tolls_amount' (Index 14): ZSTD
186
- Column 'improvement_surcharge' (Index 15): ZSTD
187
- Column 'total_amount' (Index 16): ZSTD
188
- Column 'congestion_surcharge' (Index 17): ZSTD
189
- Column 'Airport_fee' (Index 18): ZSTD
190
- Row Group 2:
191
- Column 'VendorID' (Index 0): ZSTD
192
- Column 'tpep_pickup_datetime' (Index 1): ZSTD
193
- Column 'tpep_dropoff_datetime' (Index 2): ZSTD
194
- Column 'passenger_count' (Index 3): ZSTD
195
- Column 'trip_distance' (Index 4): ZSTD
196
- Column 'RatecodeID' (Index 5): ZSTD
197
- Column 'store_and_fwd_flag' (Index 6): ZSTD
198
- Column 'PULocationID' (Index 7): ZSTD
199
- Column 'DOLocationID' (Index 8): ZSTD
200
- Column 'payment_type' (Index 9): ZSTD
201
- Column 'fare_amount' (Index 10): ZSTD
202
- Column 'extra' (Index 11): ZSTD
203
- Column 'mta_tax' (Index 12): ZSTD
204
- Column 'tip_amount' (Index 13): ZSTD
205
- Column 'tolls_amount' (Index 14): ZSTD
206
- Column 'improvement_surcharge' (Index 15): ZSTD
207
- Column 'total_amount' (Index 16): ZSTD
208
- Column 'congestion_surcharge' (Index 17): ZSTD
209
- Column 'Airport_fee' (Index 18): ZSTD
210
- Compression codecs: {'ZSTD'}
211
- ```
@@ -1 +0,0 @@
1
- # This empty file marks the package as typed for mypy
@@ -1,6 +0,0 @@
1
- import sys
2
- from pathlib import Path
3
-
4
- # Add the project root to the Python path
5
- root_dir = Path(__file__).parent.parent
6
- sys.path.insert(0, str(root_dir))
@@ -1,35 +0,0 @@
1
- from typer.testing import CliRunner
2
-
3
- from src.iparq.source import app
4
-
5
-
6
- def test_empty():
7
- assert True
8
-
9
-
10
- def test_parquet_info():
11
- """Test that the CLI correctly displays parquet file information."""
12
- runner = CliRunner()
13
- result = runner.invoke(app, ["dummy.parquet"])
14
-
15
- assert result.exit_code == 0
16
-
17
- expected_output = """ParquetMetaModel(
18
- created_by='parquet-cpp-arrow version 14.0.2',
19
- num_columns=3,
20
- num_rows=3,
21
- num_row_groups=1,
22
- format_version='2.6',
23
- serialized_size=2223
24
- )
25
- Parquet Column Information
26
- ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
27
- ┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
28
- ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
29
- │ 0 │ one │ 0 │ SNAPPY │ ✅ │
30
- │ 0 │ two │ 1 │ SNAPPY │ ✅ │
31
- │ 0 │ three │ 2 │ SNAPPY │ ✅ │
32
- └───────────┴─────────────┴───────┴─────────────┴──────────────┘
33
- Compression codecs: {'SNAPPY'}"""
34
-
35
- assert expected_output in result.stdout
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes