iparq 0.2.0__tar.gz → 0.2.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {iparq-0.2.0 → iparq-0.2.5}/.github/workflows/merge.yml +1 -1
- iparq-0.2.5/.github/workflows/python-package.yml +50 -0
- {iparq-0.2.0 → iparq-0.2.5}/.vscode/launch.json +1 -3
- iparq-0.2.5/PKG-INFO +145 -0
- iparq-0.2.5/README.md +127 -0
- {iparq-0.2.0 → iparq-0.2.5}/pyproject.toml +2 -2
- iparq-0.2.5/src/iparq/py.typed +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/src/iparq/source.py +80 -13
- iparq-0.2.5/tests/conftest.py +0 -0
- iparq-0.2.5/tests/dummy.parquet +0 -0
- iparq-0.2.5/tests/test_cli.py +78 -0
- {iparq-0.2.0 → iparq-0.2.5}/uv.lock +6 -4
- iparq-0.2.0/.github/workflows/python-package.yml +0 -41
- iparq-0.2.0/PKG-INFO +0 -229
- iparq-0.2.0/README.md +0 -211
- iparq-0.2.0/src/iparq/py.typed +0 -1
- iparq-0.2.0/tests/conftest.py +0 -6
- iparq-0.2.0/tests/test_cli.py +0 -35
- {iparq-0.2.0 → iparq-0.2.5}/.github/copilot-instructions.md +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/.github/dependabot.yml +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/.github/workflows/python-publish.yml +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/.gitignore +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/.python-version +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/.vscode/settings.json +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/CONTRIBUTING.md +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/LICENSE +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/dummy.parquet +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/media/iparq.png +0 -0
- {iparq-0.2.0 → iparq-0.2.5}/src/iparq/__init__.py +0 -0
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
|
|
2
|
+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
|
|
3
|
+
|
|
4
|
+
name: Python package
|
|
5
|
+
on:
|
|
6
|
+
push:
|
|
7
|
+
branches: [ "main" ]
|
|
8
|
+
pull_request:
|
|
9
|
+
branches: [ "main" ]
|
|
10
|
+
|
|
11
|
+
jobs:
|
|
12
|
+
build:
|
|
13
|
+
permissions:
|
|
14
|
+
contents: read
|
|
15
|
+
pull-requests: write
|
|
16
|
+
name: Test ${{ matrix.os }} Python ${{ matrix.python_version }}
|
|
17
|
+
runs-on: ${{ matrix.os }}
|
|
18
|
+
strategy:
|
|
19
|
+
fail-fast: false
|
|
20
|
+
matrix:
|
|
21
|
+
os: ["ubuntu-20.04", "windows-latest"]
|
|
22
|
+
python_version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
|
|
23
|
+
env:
|
|
24
|
+
UV_SYSTEM_PYTHON: 1
|
|
25
|
+
steps:
|
|
26
|
+
- uses: actions/checkout@v4
|
|
27
|
+
- name: Setup python
|
|
28
|
+
uses: actions/setup-python@v5
|
|
29
|
+
with:
|
|
30
|
+
python-version: ${{ matrix.python_version }}
|
|
31
|
+
architecture: x64
|
|
32
|
+
- name: Install uv
|
|
33
|
+
uses: astral-sh/setup-uv@v5
|
|
34
|
+
|
|
35
|
+
# dependencies are in uv.lock
|
|
36
|
+
- name: Install dependencies
|
|
37
|
+
run: |
|
|
38
|
+
uv sync --all-extras
|
|
39
|
+
|
|
40
|
+
- name: Lint with ruff
|
|
41
|
+
run: uv run ruff check .
|
|
42
|
+
- name: Check types with mypy
|
|
43
|
+
run: |
|
|
44
|
+
cd src/iparq
|
|
45
|
+
uv run mypy . --config-file=../../pyproject.toml
|
|
46
|
+
- name: Check formatting with black
|
|
47
|
+
run: uvx black . --check --verbose
|
|
48
|
+
- name: Run Python tests
|
|
49
|
+
if: runner.os != 'Windows'
|
|
50
|
+
run: uv run pytest -vv
|
iparq-0.2.5/PKG-INFO
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: iparq
|
|
3
|
+
Version: 0.2.5
|
|
4
|
+
Summary: Display version compression and bloom filter information about a parquet file
|
|
5
|
+
Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
|
|
6
|
+
License-File: LICENSE
|
|
7
|
+
Requires-Python: >=3.9
|
|
8
|
+
Requires-Dist: pyarrow
|
|
9
|
+
Requires-Dist: pydantic
|
|
10
|
+
Requires-Dist: rich
|
|
11
|
+
Requires-Dist: typer[all]
|
|
12
|
+
Provides-Extra: checks
|
|
13
|
+
Requires-Dist: mypy>=1.14.1; extra == 'checks'
|
|
14
|
+
Requires-Dist: ruff>=0.9.3; extra == 'checks'
|
|
15
|
+
Provides-Extra: test
|
|
16
|
+
Requires-Dist: pytest>=7.0; extra == 'test'
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
|
|
19
|
+
# iparq
|
|
20
|
+
|
|
21
|
+
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
|
|
22
|
+
|
|
23
|
+
[](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
|
|
24
|
+
|
|
25
|
+
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
|
|
26
|
+
|
|
27
|
+

|
|
28
|
+
After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
|
|
29
|
+
|
|
30
|
+
***New*** Bloom filters information: Displays if there are bloom filters.
|
|
31
|
+
Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
|
|
32
|
+
|
|
33
|
+
## Installation
|
|
34
|
+
|
|
35
|
+
### Zero installation - Recommended
|
|
36
|
+
|
|
37
|
+
1) Make sure to have Astral's UV installed by following the steps here:
|
|
38
|
+
|
|
39
|
+
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
40
|
+
|
|
41
|
+
2) Execute the following command:
|
|
42
|
+
|
|
43
|
+
```sh
|
|
44
|
+
uvx --refresh iparq inspect yourparquet.parquet
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
### Using pip
|
|
48
|
+
|
|
49
|
+
1) Install the package using pip:
|
|
50
|
+
|
|
51
|
+
```sh
|
|
52
|
+
pip install iparq
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
2) Verify the installation by running:
|
|
56
|
+
|
|
57
|
+
```sh
|
|
58
|
+
iparq --help
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Using uv
|
|
62
|
+
|
|
63
|
+
1) Make sure to have Astral's UV installed by following the steps here:
|
|
64
|
+
|
|
65
|
+
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
66
|
+
|
|
67
|
+
2) Execute the following command:
|
|
68
|
+
|
|
69
|
+
```sh
|
|
70
|
+
uv pip install iparq
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
3) Verify the installation by running:
|
|
74
|
+
|
|
75
|
+
```sh
|
|
76
|
+
iparq --help
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### Using Homebrew in a MAC
|
|
80
|
+
|
|
81
|
+
1) Run the following:
|
|
82
|
+
|
|
83
|
+
```sh
|
|
84
|
+
brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
|
|
85
|
+
brew install MiguelElGallo/tap/iparq
|
|
86
|
+
iparq --help
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Usage
|
|
90
|
+
|
|
91
|
+
iparq now supports additional options:
|
|
92
|
+
|
|
93
|
+
```sh
|
|
94
|
+
iparq inspect <filename> [OPTIONS]
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Options include:
|
|
98
|
+
|
|
99
|
+
- `--format`, `-f`: Output format, either `rich` (default) or `json`
|
|
100
|
+
- `--metadata-only`, `-m`: Show only file metadata without column details
|
|
101
|
+
- `--column`, `-c`: Filter results to show only a specific column
|
|
102
|
+
|
|
103
|
+
Examples:
|
|
104
|
+
|
|
105
|
+
```sh
|
|
106
|
+
# Output in JSON format
|
|
107
|
+
iparq inspect yourfile.parquet --format json
|
|
108
|
+
|
|
109
|
+
# Show only metadata
|
|
110
|
+
iparq inspect yourfile.parquet --metadata-only
|
|
111
|
+
|
|
112
|
+
# Filter to show only a specific column
|
|
113
|
+
iparq inspect yourfile.parquet --column column_name
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
|
|
117
|
+
|
|
118
|
+
## Example ouput - Bloom Filters
|
|
119
|
+
|
|
120
|
+
```log
|
|
121
|
+
ParquetMetaModel(
|
|
122
|
+
created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
|
|
123
|
+
num_columns=1,
|
|
124
|
+
num_rows=100000000,
|
|
125
|
+
num_row_groups=10,
|
|
126
|
+
format_version='1.0',
|
|
127
|
+
serialized_size=1196
|
|
128
|
+
)
|
|
129
|
+
Parquet Column Information
|
|
130
|
+
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
|
|
131
|
+
┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
|
|
132
|
+
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
|
|
133
|
+
│ 0 │ r │ 0 │ SNAPPY │ ✅ │
|
|
134
|
+
│ 1 │ r │ 0 │ SNAPPY │ ✅ │
|
|
135
|
+
│ 2 │ r │ 0 │ SNAPPY │ ✅ │
|
|
136
|
+
│ 3 │ r │ 0 │ SNAPPY │ ✅ │
|
|
137
|
+
│ 4 │ r │ 0 │ SNAPPY │ ✅ │
|
|
138
|
+
│ 5 │ r │ 0 │ SNAPPY │ ✅ │
|
|
139
|
+
│ 6 │ r │ 0 │ SNAPPY │ ✅ │
|
|
140
|
+
│ 7 │ r │ 0 │ SNAPPY │ ✅ │
|
|
141
|
+
│ 8 │ r │ 0 │ SNAPPY │ ✅ │
|
|
142
|
+
│ 9 │ r │ 0 │ SNAPPY │ ✅ │
|
|
143
|
+
└───────────┴─────────────┴───────┴─────────────┴──────────────┘
|
|
144
|
+
Compression codecs: {'SNAPPY'}
|
|
145
|
+
```
|
iparq-0.2.5/README.md
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
# iparq
|
|
2
|
+
|
|
3
|
+
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
|
|
4
|
+
|
|
5
|
+
[](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
|
|
6
|
+
|
|
7
|
+
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
|
|
8
|
+
|
|
9
|
+

|
|
10
|
+
After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
|
|
11
|
+
|
|
12
|
+
***New*** Bloom filters information: Displays if there are bloom filters.
|
|
13
|
+
Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
|
|
14
|
+
|
|
15
|
+
## Installation
|
|
16
|
+
|
|
17
|
+
### Zero installation - Recommended
|
|
18
|
+
|
|
19
|
+
1) Make sure to have Astral's UV installed by following the steps here:
|
|
20
|
+
|
|
21
|
+
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
22
|
+
|
|
23
|
+
2) Execute the following command:
|
|
24
|
+
|
|
25
|
+
```sh
|
|
26
|
+
uvx --refresh iparq inspect yourparquet.parquet
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Using pip
|
|
30
|
+
|
|
31
|
+
1) Install the package using pip:
|
|
32
|
+
|
|
33
|
+
```sh
|
|
34
|
+
pip install iparq
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
2) Verify the installation by running:
|
|
38
|
+
|
|
39
|
+
```sh
|
|
40
|
+
iparq --help
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Using uv
|
|
44
|
+
|
|
45
|
+
1) Make sure to have Astral's UV installed by following the steps here:
|
|
46
|
+
|
|
47
|
+
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
48
|
+
|
|
49
|
+
2) Execute the following command:
|
|
50
|
+
|
|
51
|
+
```sh
|
|
52
|
+
uv pip install iparq
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
3) Verify the installation by running:
|
|
56
|
+
|
|
57
|
+
```sh
|
|
58
|
+
iparq --help
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Using Homebrew in a MAC
|
|
62
|
+
|
|
63
|
+
1) Run the following:
|
|
64
|
+
|
|
65
|
+
```sh
|
|
66
|
+
brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
|
|
67
|
+
brew install MiguelElGallo/tap/iparq
|
|
68
|
+
iparq --help
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## Usage
|
|
72
|
+
|
|
73
|
+
iparq now supports additional options:
|
|
74
|
+
|
|
75
|
+
```sh
|
|
76
|
+
iparq inspect <filename> [OPTIONS]
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Options include:
|
|
80
|
+
|
|
81
|
+
- `--format`, `-f`: Output format, either `rich` (default) or `json`
|
|
82
|
+
- `--metadata-only`, `-m`: Show only file metadata without column details
|
|
83
|
+
- `--column`, `-c`: Filter results to show only a specific column
|
|
84
|
+
|
|
85
|
+
Examples:
|
|
86
|
+
|
|
87
|
+
```sh
|
|
88
|
+
# Output in JSON format
|
|
89
|
+
iparq inspect yourfile.parquet --format json
|
|
90
|
+
|
|
91
|
+
# Show only metadata
|
|
92
|
+
iparq inspect yourfile.parquet --metadata-only
|
|
93
|
+
|
|
94
|
+
# Filter to show only a specific column
|
|
95
|
+
iparq inspect yourfile.parquet --column column_name
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
|
|
99
|
+
|
|
100
|
+
## Example ouput - Bloom Filters
|
|
101
|
+
|
|
102
|
+
```log
|
|
103
|
+
ParquetMetaModel(
|
|
104
|
+
created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
|
|
105
|
+
num_columns=1,
|
|
106
|
+
num_rows=100000000,
|
|
107
|
+
num_row_groups=10,
|
|
108
|
+
format_version='1.0',
|
|
109
|
+
serialized_size=1196
|
|
110
|
+
)
|
|
111
|
+
Parquet Column Information
|
|
112
|
+
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
|
|
113
|
+
┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
|
|
114
|
+
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
|
|
115
|
+
│ 0 │ r │ 0 │ SNAPPY │ ✅ │
|
|
116
|
+
│ 1 │ r │ 0 │ SNAPPY │ ✅ │
|
|
117
|
+
│ 2 │ r │ 0 │ SNAPPY │ ✅ │
|
|
118
|
+
│ 3 │ r │ 0 │ SNAPPY │ ✅ │
|
|
119
|
+
│ 4 │ r │ 0 │ SNAPPY │ ✅ │
|
|
120
|
+
│ 5 │ r │ 0 │ SNAPPY │ ✅ │
|
|
121
|
+
│ 6 │ r │ 0 │ SNAPPY │ ✅ │
|
|
122
|
+
│ 7 │ r │ 0 │ SNAPPY │ ✅ │
|
|
123
|
+
│ 8 │ r │ 0 │ SNAPPY │ ✅ │
|
|
124
|
+
│ 9 │ r │ 0 │ SNAPPY │ ✅ │
|
|
125
|
+
└───────────┴─────────────┴───────┴─────────────┴──────────────┘
|
|
126
|
+
Compression codecs: {'SNAPPY'}
|
|
127
|
+
```
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "iparq"
|
|
3
|
-
version = "0.2.
|
|
3
|
+
version = "0.2.5"
|
|
4
4
|
description = "Display version compression and bloom filter information about a parquet file"
|
|
5
5
|
readme = "README.md"
|
|
6
6
|
authors = [
|
|
@@ -31,7 +31,7 @@ requires = ["hatchling"]
|
|
|
31
31
|
build-backend = "hatchling.build"
|
|
32
32
|
|
|
33
33
|
[tool.pytest.ini_options]
|
|
34
|
-
addopts = "-ra -q"
|
|
34
|
+
addopts = ["-ra", "-q"]
|
|
35
35
|
testpaths = [
|
|
36
36
|
"tests",
|
|
37
37
|
]
|
|
File without changes
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
import json
|
|
2
|
+
from enum import Enum
|
|
1
3
|
from typing import List, Optional
|
|
2
4
|
|
|
3
5
|
import pyarrow.parquet as pq
|
|
@@ -7,10 +9,19 @@ from rich import print
|
|
|
7
9
|
from rich.console import Console
|
|
8
10
|
from rich.table import Table
|
|
9
11
|
|
|
10
|
-
app = typer.Typer(
|
|
12
|
+
app = typer.Typer(
|
|
13
|
+
help="Inspect Parquet files for metadata, compression, and bloom filters"
|
|
14
|
+
)
|
|
11
15
|
console = Console()
|
|
12
16
|
|
|
13
17
|
|
|
18
|
+
class OutputFormat(str, Enum):
|
|
19
|
+
"""Enum for output format options."""
|
|
20
|
+
|
|
21
|
+
RICH = "rich"
|
|
22
|
+
JSON = "json"
|
|
23
|
+
|
|
24
|
+
|
|
14
25
|
class ParquetMetaModel(BaseModel):
|
|
15
26
|
"""
|
|
16
27
|
ParquetMetaModel is a data model representing metadata for a Parquet file.
|
|
@@ -227,20 +238,59 @@ def print_column_info_table(column_info: ParquetColumnInfo) -> None:
|
|
|
227
238
|
console.print(table)
|
|
228
239
|
|
|
229
240
|
|
|
230
|
-
|
|
231
|
-
|
|
241
|
+
def output_json(
|
|
242
|
+
meta_model: ParquetMetaModel,
|
|
243
|
+
column_info: ParquetColumnInfo,
|
|
244
|
+
compression_codecs: set,
|
|
245
|
+
) -> None:
|
|
232
246
|
"""
|
|
233
|
-
|
|
247
|
+
Outputs the parquet information in JSON format.
|
|
234
248
|
|
|
235
249
|
Args:
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
250
|
+
meta_model: The Parquet metadata model
|
|
251
|
+
column_info: The column information model
|
|
252
|
+
compression_codecs: Set of compression codecs used
|
|
253
|
+
"""
|
|
254
|
+
result = {
|
|
255
|
+
"metadata": meta_model.model_dump(),
|
|
256
|
+
"columns": [column.model_dump() for column in column_info.columns],
|
|
257
|
+
"compression_codecs": list(compression_codecs),
|
|
258
|
+
}
|
|
259
|
+
|
|
260
|
+
print(json.dumps(result, indent=2))
|
|
261
|
+
|
|
262
|
+
|
|
263
|
+
@app.command(name="")
|
|
264
|
+
@app.command(name="inspect")
|
|
265
|
+
def inspect(
|
|
266
|
+
filename: str = typer.Argument(..., help="Path to the Parquet file to inspect"),
|
|
267
|
+
format: OutputFormat = typer.Option(
|
|
268
|
+
OutputFormat.RICH, "--format", "-f", help="Output format (rich or json)"
|
|
269
|
+
),
|
|
270
|
+
metadata_only: bool = typer.Option(
|
|
271
|
+
False,
|
|
272
|
+
"--metadata-only",
|
|
273
|
+
"-m",
|
|
274
|
+
help="Show only file metadata without column details",
|
|
275
|
+
),
|
|
276
|
+
column_filter: Optional[str] = typer.Option(
|
|
277
|
+
None, "--column", "-c", help="Filter results to show only specific column"
|
|
278
|
+
),
|
|
279
|
+
):
|
|
280
|
+
"""
|
|
281
|
+
Inspect a Parquet file and display its metadata, compression settings, and bloom filter information.
|
|
240
282
|
"""
|
|
241
283
|
(parquet_metadata, compression) = read_parquet_metadata(filename)
|
|
242
284
|
|
|
243
|
-
|
|
285
|
+
# Create metadata model
|
|
286
|
+
meta_model = ParquetMetaModel(
|
|
287
|
+
created_by=parquet_metadata.created_by,
|
|
288
|
+
num_columns=parquet_metadata.num_columns,
|
|
289
|
+
num_rows=parquet_metadata.num_rows,
|
|
290
|
+
num_row_groups=parquet_metadata.num_row_groups,
|
|
291
|
+
format_version=str(parquet_metadata.format_version),
|
|
292
|
+
serialized_size=parquet_metadata.serialized_size,
|
|
293
|
+
)
|
|
244
294
|
|
|
245
295
|
# Create a model to store column information
|
|
246
296
|
column_info = ParquetColumnInfo()
|
|
@@ -249,10 +299,27 @@ def main(filename: str):
|
|
|
249
299
|
print_compression_types(parquet_metadata, column_info)
|
|
250
300
|
print_bloom_filter_info(parquet_metadata, column_info)
|
|
251
301
|
|
|
252
|
-
#
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
302
|
+
# Filter columns if requested
|
|
303
|
+
if column_filter:
|
|
304
|
+
column_info.columns = [
|
|
305
|
+
col for col in column_info.columns if col.column_name == column_filter
|
|
306
|
+
]
|
|
307
|
+
if not column_info.columns:
|
|
308
|
+
console.print(
|
|
309
|
+
f"No columns match the filter: {column_filter}", style="yellow"
|
|
310
|
+
)
|
|
311
|
+
|
|
312
|
+
# Output based on format selection
|
|
313
|
+
if format == OutputFormat.JSON:
|
|
314
|
+
output_json(meta_model, column_info, compression)
|
|
315
|
+
else: # Rich format
|
|
316
|
+
# Print the metadata
|
|
317
|
+
console.print(meta_model)
|
|
318
|
+
|
|
319
|
+
# Print column details if not metadata only
|
|
320
|
+
if not metadata_only:
|
|
321
|
+
print_column_info_table(column_info)
|
|
322
|
+
console.print(f"Compression codecs: {compression}")
|
|
256
323
|
|
|
257
324
|
|
|
258
325
|
if __name__ == "__main__":
|
|
File without changes
|
|
Binary file
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
import json
|
|
2
|
+
from pathlib import Path
|
|
3
|
+
|
|
4
|
+
from typer.testing import CliRunner
|
|
5
|
+
|
|
6
|
+
from iparq.source import app
|
|
7
|
+
|
|
8
|
+
# Define path to test fixtures
|
|
9
|
+
FIXTURES_DIR = Path(__file__).parent
|
|
10
|
+
fixture_path = FIXTURES_DIR / "dummy.parquet"
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
def test_parquet_info():
|
|
14
|
+
"""Test that the CLI correctly displays parquet file information."""
|
|
15
|
+
runner = CliRunner()
|
|
16
|
+
result = runner.invoke(app, ["inspect", str(fixture_path)])
|
|
17
|
+
|
|
18
|
+
assert result.exit_code == 0
|
|
19
|
+
|
|
20
|
+
expected_output = """ParquetMetaModel(
|
|
21
|
+
created_by='parquet-cpp-arrow version 14.0.2',
|
|
22
|
+
num_columns=3,
|
|
23
|
+
num_rows=3,
|
|
24
|
+
num_row_groups=1,
|
|
25
|
+
format_version='2.6',
|
|
26
|
+
serialized_size=2223
|
|
27
|
+
)
|
|
28
|
+
Parquet Column Information
|
|
29
|
+
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
|
|
30
|
+
┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
|
|
31
|
+
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
|
|
32
|
+
│ 0 │ one │ 0 │ SNAPPY │ ✅ │
|
|
33
|
+
│ 0 │ two │ 1 │ SNAPPY │ ✅ │
|
|
34
|
+
│ 0 │ three │ 2 │ SNAPPY │ ✅ │
|
|
35
|
+
└───────────┴─────────────┴───────┴─────────────┴──────────────┘
|
|
36
|
+
Compression codecs: {'SNAPPY'}"""
|
|
37
|
+
|
|
38
|
+
assert expected_output in result.stdout
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
def test_metadata_only_flag():
|
|
42
|
+
"""Test that the metadata-only flag works correctly."""
|
|
43
|
+
runner = CliRunner()
|
|
44
|
+
fixture_path = FIXTURES_DIR / "dummy.parquet"
|
|
45
|
+
result = runner.invoke(app, ["inspect", "--metadata-only", str(fixture_path)])
|
|
46
|
+
|
|
47
|
+
assert result.exit_code == 0
|
|
48
|
+
assert "ParquetMetaModel" in result.stdout
|
|
49
|
+
assert "Parquet Column Information" not in result.stdout
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
def test_column_filter():
|
|
53
|
+
"""Test that filtering by column name works correctly."""
|
|
54
|
+
runner = CliRunner()
|
|
55
|
+
fixture_path = FIXTURES_DIR / "dummy.parquet"
|
|
56
|
+
result = runner.invoke(app, ["inspect", "--column", "one", str(fixture_path)])
|
|
57
|
+
|
|
58
|
+
assert result.exit_code == 0
|
|
59
|
+
assert "one" in result.stdout
|
|
60
|
+
assert "two" not in result.stdout
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
def test_json_output():
|
|
64
|
+
"""Test JSON output format."""
|
|
65
|
+
runner = CliRunner()
|
|
66
|
+
fixture_path = FIXTURES_DIR / "dummy.parquet"
|
|
67
|
+
result = runner.invoke(app, ["inspect", "--format", "json", str(fixture_path)])
|
|
68
|
+
|
|
69
|
+
assert result.exit_code == 0
|
|
70
|
+
|
|
71
|
+
# Test that output is valid JSON
|
|
72
|
+
data = json.loads(result.stdout)
|
|
73
|
+
|
|
74
|
+
# Check JSON structure
|
|
75
|
+
assert "metadata" in data
|
|
76
|
+
assert "columns" in data
|
|
77
|
+
assert "compression_codecs" in data
|
|
78
|
+
assert data["metadata"]["num_columns"] == 3
|
|
@@ -52,11 +52,12 @@ wheels = [
|
|
|
52
52
|
|
|
53
53
|
[[package]]
|
|
54
54
|
name = "iparq"
|
|
55
|
-
version = "0.2.
|
|
55
|
+
version = "0.2.5"
|
|
56
56
|
source = { editable = "." }
|
|
57
57
|
dependencies = [
|
|
58
58
|
{ name = "pyarrow" },
|
|
59
59
|
{ name = "pydantic" },
|
|
60
|
+
{ name = "rich" },
|
|
60
61
|
{ name = "typer" },
|
|
61
62
|
]
|
|
62
63
|
|
|
@@ -72,11 +73,12 @@ test = [
|
|
|
72
73
|
[package.metadata]
|
|
73
74
|
requires-dist = [
|
|
74
75
|
{ name = "mypy", marker = "extra == 'checks'", specifier = ">=1.14.1" },
|
|
75
|
-
{ name = "pyarrow"
|
|
76
|
-
{ name = "pydantic"
|
|
76
|
+
{ name = "pyarrow" },
|
|
77
|
+
{ name = "pydantic" },
|
|
77
78
|
{ name = "pytest", marker = "extra == 'test'", specifier = ">=7.0" },
|
|
79
|
+
{ name = "rich" },
|
|
78
80
|
{ name = "ruff", marker = "extra == 'checks'", specifier = ">=0.9.3" },
|
|
79
|
-
{ name = "typer",
|
|
81
|
+
{ name = "typer", extras = ["all"] },
|
|
80
82
|
]
|
|
81
83
|
provides-extras = ["test", "checks"]
|
|
82
84
|
|
|
@@ -1,41 +0,0 @@
|
|
|
1
|
-
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
|
|
2
|
-
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
|
|
3
|
-
|
|
4
|
-
name: Python package
|
|
5
|
-
on:
|
|
6
|
-
push:
|
|
7
|
-
branches: [ "main" ]
|
|
8
|
-
pull_request:
|
|
9
|
-
branches: [ "main" ]
|
|
10
|
-
|
|
11
|
-
jobs:
|
|
12
|
-
build:
|
|
13
|
-
permissions:
|
|
14
|
-
contents: read
|
|
15
|
-
pull-requests: write
|
|
16
|
-
runs-on: ubuntu-latest
|
|
17
|
-
strategy:
|
|
18
|
-
fail-fast: false
|
|
19
|
-
matrix:
|
|
20
|
-
python-version: ["3.9", "3.10", "3.11"]
|
|
21
|
-
|
|
22
|
-
steps:
|
|
23
|
-
- uses: actions/checkout@v4
|
|
24
|
-
- name: Set up Python ${{ matrix.python-version }}
|
|
25
|
-
uses: actions/setup-python@v3
|
|
26
|
-
with:
|
|
27
|
-
python-version: ${{ matrix.python-version }}
|
|
28
|
-
- name: Install dependencies
|
|
29
|
-
run: |
|
|
30
|
-
python -m pip install --upgrade pip
|
|
31
|
-
python -m pip install flake8 pytest pydantic iparq
|
|
32
|
-
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
|
|
33
|
-
- name: Lint with flake8
|
|
34
|
-
run: |
|
|
35
|
-
# stop the build if there are Python syntax errors or undefined names
|
|
36
|
-
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
|
|
37
|
-
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
|
|
38
|
-
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
|
|
39
|
-
- name: Test with pytest
|
|
40
|
-
run: |
|
|
41
|
-
pytest
|
iparq-0.2.0/PKG-INFO
DELETED
|
@@ -1,229 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: iparq
|
|
3
|
-
Version: 0.2.0
|
|
4
|
-
Summary: Display version compression and bloom filter information about a parquet file
|
|
5
|
-
Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
|
|
6
|
-
License-File: LICENSE
|
|
7
|
-
Requires-Python: >=3.9
|
|
8
|
-
Requires-Dist: pyarrow
|
|
9
|
-
Requires-Dist: pydantic
|
|
10
|
-
Requires-Dist: rich
|
|
11
|
-
Requires-Dist: typer[all]
|
|
12
|
-
Provides-Extra: checks
|
|
13
|
-
Requires-Dist: mypy>=1.14.1; extra == 'checks'
|
|
14
|
-
Requires-Dist: ruff>=0.9.3; extra == 'checks'
|
|
15
|
-
Provides-Extra: test
|
|
16
|
-
Requires-Dist: pytest>=7.0; extra == 'test'
|
|
17
|
-
Description-Content-Type: text/markdown
|
|
18
|
-
|
|
19
|
-
# iparq
|
|
20
|
-
|
|
21
|
-
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
|
|
22
|
-
|
|
23
|
-
[](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
|
|
24
|
-
|
|
25
|
-
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
|
|
26
|
-
|
|
27
|
-

|
|
28
|
-
After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
|
|
29
|
-
|
|
30
|
-
***New*** Bloom filters information: Displays if there are bloom filters.
|
|
31
|
-
Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
## Installation
|
|
35
|
-
|
|
36
|
-
### Zero installation - Recommended
|
|
37
|
-
|
|
38
|
-
1) Make sure to have Astral’s UV installed by following the steps here:
|
|
39
|
-
|
|
40
|
-
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
41
|
-
|
|
42
|
-
2) Execute the following command:
|
|
43
|
-
|
|
44
|
-
```sh
|
|
45
|
-
uvx iparq yourparquet.parquet
|
|
46
|
-
```
|
|
47
|
-
|
|
48
|
-
### Using pip
|
|
49
|
-
|
|
50
|
-
1) Install the package using pip:
|
|
51
|
-
|
|
52
|
-
```sh
|
|
53
|
-
pip install iparq
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
2) Verify the installation by running:
|
|
57
|
-
|
|
58
|
-
```sh
|
|
59
|
-
iparq --help
|
|
60
|
-
```
|
|
61
|
-
|
|
62
|
-
### Using uv
|
|
63
|
-
|
|
64
|
-
1) Make sure to have Astral’s UV installed by following the steps here:
|
|
65
|
-
|
|
66
|
-
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
67
|
-
|
|
68
|
-
2) Execute the following command:
|
|
69
|
-
|
|
70
|
-
```sh
|
|
71
|
-
uv pip install iparq
|
|
72
|
-
```
|
|
73
|
-
|
|
74
|
-
3) Verify the installation by running:
|
|
75
|
-
|
|
76
|
-
```sh
|
|
77
|
-
iparq --help
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
### Using Homebrew in a MAC
|
|
81
|
-
|
|
82
|
-
1) Run the following:
|
|
83
|
-
|
|
84
|
-
```sh
|
|
85
|
-
brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
|
|
86
|
-
brew install MiguelElGallo/tap/iparq
|
|
87
|
-
iparq —help
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
## Usage
|
|
91
|
-
|
|
92
|
-
Run
|
|
93
|
-
|
|
94
|
-
```sh
|
|
95
|
-
iparq <filename>
|
|
96
|
-
```
|
|
97
|
-
|
|
98
|
-
Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
|
|
99
|
-
|
|
100
|
-
## Example ouput - Bloom Filters
|
|
101
|
-
|
|
102
|
-
```log
|
|
103
|
-
ParquetMetaModel(
|
|
104
|
-
created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
|
|
105
|
-
num_columns=1,
|
|
106
|
-
num_rows=100000000,
|
|
107
|
-
num_row_groups=10,
|
|
108
|
-
format_version='1.0',
|
|
109
|
-
serialized_size=1196
|
|
110
|
-
)
|
|
111
|
-
Column Compression Info:
|
|
112
|
-
Row Group 0:
|
|
113
|
-
Column 'r' (Index 0): SNAPPY
|
|
114
|
-
Row Group 1:
|
|
115
|
-
Column 'r' (Index 0): SNAPPY
|
|
116
|
-
Row Group 2:
|
|
117
|
-
Column 'r' (Index 0): SNAPPY
|
|
118
|
-
Row Group 3:
|
|
119
|
-
Column 'r' (Index 0): SNAPPY
|
|
120
|
-
Row Group 4:
|
|
121
|
-
Column 'r' (Index 0): SNAPPY
|
|
122
|
-
Row Group 5:
|
|
123
|
-
Column 'r' (Index 0): SNAPPY
|
|
124
|
-
Row Group 6:
|
|
125
|
-
Column 'r' (Index 0): SNAPPY
|
|
126
|
-
Row Group 7:
|
|
127
|
-
Column 'r' (Index 0): SNAPPY
|
|
128
|
-
Row Group 8:
|
|
129
|
-
Column 'r' (Index 0): SNAPPY
|
|
130
|
-
Row Group 9:
|
|
131
|
-
Column 'r' (Index 0): SNAPPY
|
|
132
|
-
Bloom Filter Info:
|
|
133
|
-
Row Group 0:
|
|
134
|
-
Column 'r' (Index 0): Has bloom filter
|
|
135
|
-
Row Group 1:
|
|
136
|
-
Column 'r' (Index 0): Has bloom filter
|
|
137
|
-
Row Group 2:
|
|
138
|
-
Column 'r' (Index 0): Has bloom filter
|
|
139
|
-
Row Group 3:
|
|
140
|
-
Column 'r' (Index 0): Has bloom filter
|
|
141
|
-
Row Group 4:
|
|
142
|
-
Column 'r' (Index 0): Has bloom filter
|
|
143
|
-
Row Group 5:
|
|
144
|
-
Column 'r' (Index 0): Has bloom filter
|
|
145
|
-
Row Group 6:
|
|
146
|
-
Column 'r' (Index 0): Has bloom filter
|
|
147
|
-
Row Group 7:
|
|
148
|
-
Column 'r' (Index 0): Has bloom filter
|
|
149
|
-
Row Group 8:
|
|
150
|
-
Column 'r' (Index 0): Has bloom filter
|
|
151
|
-
Row Group 9:
|
|
152
|
-
Column 'r' (Index 0): Has bloom filter
|
|
153
|
-
Compression codecs: {'SNAPPY'}
|
|
154
|
-
```
|
|
155
|
-
|
|
156
|
-
## Example output
|
|
157
|
-
|
|
158
|
-
```log
|
|
159
|
-
ParquetMetaModel(
|
|
160
|
-
created_by='parquet-cpp-arrow version 14.0.2',
|
|
161
|
-
num_columns=19,
|
|
162
|
-
num_rows=2964624,
|
|
163
|
-
num_row_groups=3,
|
|
164
|
-
format_version='2.6',
|
|
165
|
-
serialized_size=6357
|
|
166
|
-
)
|
|
167
|
-
Column Compression Info:
|
|
168
|
-
Row Group 0:
|
|
169
|
-
Column 'VendorID' (Index 0): ZSTD
|
|
170
|
-
Column 'tpep_pickup_datetime' (Index 1): ZSTD
|
|
171
|
-
Column 'tpep_dropoff_datetime' (Index 2): ZSTD
|
|
172
|
-
Column 'passenger_count' (Index 3): ZSTD
|
|
173
|
-
Column 'trip_distance' (Index 4): ZSTD
|
|
174
|
-
Column 'RatecodeID' (Index 5): ZSTD
|
|
175
|
-
Column 'store_and_fwd_flag' (Index 6): ZSTD
|
|
176
|
-
Column 'PULocationID' (Index 7): ZSTD
|
|
177
|
-
Column 'DOLocationID' (Index 8): ZSTD
|
|
178
|
-
Column 'payment_type' (Index 9): ZSTD
|
|
179
|
-
Column 'fare_amount' (Index 10): ZSTD
|
|
180
|
-
Column 'extra' (Index 11): ZSTD
|
|
181
|
-
Column 'mta_tax' (Index 12): ZSTD
|
|
182
|
-
Column 'tip_amount' (Index 13): ZSTD
|
|
183
|
-
Column 'tolls_amount' (Index 14): ZSTD
|
|
184
|
-
Column 'improvement_surcharge' (Index 15): ZSTD
|
|
185
|
-
Column 'total_amount' (Index 16): ZSTD
|
|
186
|
-
Column 'congestion_surcharge' (Index 17): ZSTD
|
|
187
|
-
Column 'Airport_fee' (Index 18): ZSTD
|
|
188
|
-
Row Group 1:
|
|
189
|
-
Column 'VendorID' (Index 0): ZSTD
|
|
190
|
-
Column 'tpep_pickup_datetime' (Index 1): ZSTD
|
|
191
|
-
Column 'tpep_dropoff_datetime' (Index 2): ZSTD
|
|
192
|
-
Column 'passenger_count' (Index 3): ZSTD
|
|
193
|
-
Column 'trip_distance' (Index 4): ZSTD
|
|
194
|
-
Column 'RatecodeID' (Index 5): ZSTD
|
|
195
|
-
Column 'store_and_fwd_flag' (Index 6): ZSTD
|
|
196
|
-
Column 'PULocationID' (Index 7): ZSTD
|
|
197
|
-
Column 'DOLocationID' (Index 8): ZSTD
|
|
198
|
-
Column 'payment_type' (Index 9): ZSTD
|
|
199
|
-
Column 'fare_amount' (Index 10): ZSTD
|
|
200
|
-
Column 'extra' (Index 11): ZSTD
|
|
201
|
-
Column 'mta_tax' (Index 12): ZSTD
|
|
202
|
-
Column 'tip_amount' (Index 13): ZSTD
|
|
203
|
-
Column 'tolls_amount' (Index 14): ZSTD
|
|
204
|
-
Column 'improvement_surcharge' (Index 15): ZSTD
|
|
205
|
-
Column 'total_amount' (Index 16): ZSTD
|
|
206
|
-
Column 'congestion_surcharge' (Index 17): ZSTD
|
|
207
|
-
Column 'Airport_fee' (Index 18): ZSTD
|
|
208
|
-
Row Group 2:
|
|
209
|
-
Column 'VendorID' (Index 0): ZSTD
|
|
210
|
-
Column 'tpep_pickup_datetime' (Index 1): ZSTD
|
|
211
|
-
Column 'tpep_dropoff_datetime' (Index 2): ZSTD
|
|
212
|
-
Column 'passenger_count' (Index 3): ZSTD
|
|
213
|
-
Column 'trip_distance' (Index 4): ZSTD
|
|
214
|
-
Column 'RatecodeID' (Index 5): ZSTD
|
|
215
|
-
Column 'store_and_fwd_flag' (Index 6): ZSTD
|
|
216
|
-
Column 'PULocationID' (Index 7): ZSTD
|
|
217
|
-
Column 'DOLocationID' (Index 8): ZSTD
|
|
218
|
-
Column 'payment_type' (Index 9): ZSTD
|
|
219
|
-
Column 'fare_amount' (Index 10): ZSTD
|
|
220
|
-
Column 'extra' (Index 11): ZSTD
|
|
221
|
-
Column 'mta_tax' (Index 12): ZSTD
|
|
222
|
-
Column 'tip_amount' (Index 13): ZSTD
|
|
223
|
-
Column 'tolls_amount' (Index 14): ZSTD
|
|
224
|
-
Column 'improvement_surcharge' (Index 15): ZSTD
|
|
225
|
-
Column 'total_amount' (Index 16): ZSTD
|
|
226
|
-
Column 'congestion_surcharge' (Index 17): ZSTD
|
|
227
|
-
Column 'Airport_fee' (Index 18): ZSTD
|
|
228
|
-
Compression codecs: {'ZSTD'}
|
|
229
|
-
```
|
iparq-0.2.0/README.md
DELETED
|
@@ -1,211 +0,0 @@
|
|
|
1
|
-
# iparq
|
|
2
|
-
|
|
3
|
-
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
|
|
4
|
-
|
|
5
|
-
[](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
|
|
6
|
-
|
|
7
|
-
[](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
|
|
8
|
-
|
|
9
|
-

|
|
10
|
-
After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
|
|
11
|
-
|
|
12
|
-
***New*** Bloom filters information: Displays if there are bloom filters.
|
|
13
|
-
Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
## Installation
|
|
17
|
-
|
|
18
|
-
### Zero installation - Recommended
|
|
19
|
-
|
|
20
|
-
1) Make sure to have Astral’s UV installed by following the steps here:
|
|
21
|
-
|
|
22
|
-
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
23
|
-
|
|
24
|
-
2) Execute the following command:
|
|
25
|
-
|
|
26
|
-
```sh
|
|
27
|
-
uvx iparq yourparquet.parquet
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
### Using pip
|
|
31
|
-
|
|
32
|
-
1) Install the package using pip:
|
|
33
|
-
|
|
34
|
-
```sh
|
|
35
|
-
pip install iparq
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
2) Verify the installation by running:
|
|
39
|
-
|
|
40
|
-
```sh
|
|
41
|
-
iparq --help
|
|
42
|
-
```
|
|
43
|
-
|
|
44
|
-
### Using uv
|
|
45
|
-
|
|
46
|
-
1) Make sure to have Astral’s UV installed by following the steps here:
|
|
47
|
-
|
|
48
|
-
<https://docs.astral.sh/uv/getting-started/installation/>
|
|
49
|
-
|
|
50
|
-
2) Execute the following command:
|
|
51
|
-
|
|
52
|
-
```sh
|
|
53
|
-
uv pip install iparq
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
3) Verify the installation by running:
|
|
57
|
-
|
|
58
|
-
```sh
|
|
59
|
-
iparq --help
|
|
60
|
-
```
|
|
61
|
-
|
|
62
|
-
### Using Homebrew in a MAC
|
|
63
|
-
|
|
64
|
-
1) Run the following:
|
|
65
|
-
|
|
66
|
-
```sh
|
|
67
|
-
brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
|
|
68
|
-
brew install MiguelElGallo/tap/iparq
|
|
69
|
-
iparq —help
|
|
70
|
-
```
|
|
71
|
-
|
|
72
|
-
## Usage
|
|
73
|
-
|
|
74
|
-
Run
|
|
75
|
-
|
|
76
|
-
```sh
|
|
77
|
-
iparq <filename>
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
|
|
81
|
-
|
|
82
|
-
## Example ouput - Bloom Filters
|
|
83
|
-
|
|
84
|
-
```log
|
|
85
|
-
ParquetMetaModel(
|
|
86
|
-
created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
|
|
87
|
-
num_columns=1,
|
|
88
|
-
num_rows=100000000,
|
|
89
|
-
num_row_groups=10,
|
|
90
|
-
format_version='1.0',
|
|
91
|
-
serialized_size=1196
|
|
92
|
-
)
|
|
93
|
-
Column Compression Info:
|
|
94
|
-
Row Group 0:
|
|
95
|
-
Column 'r' (Index 0): SNAPPY
|
|
96
|
-
Row Group 1:
|
|
97
|
-
Column 'r' (Index 0): SNAPPY
|
|
98
|
-
Row Group 2:
|
|
99
|
-
Column 'r' (Index 0): SNAPPY
|
|
100
|
-
Row Group 3:
|
|
101
|
-
Column 'r' (Index 0): SNAPPY
|
|
102
|
-
Row Group 4:
|
|
103
|
-
Column 'r' (Index 0): SNAPPY
|
|
104
|
-
Row Group 5:
|
|
105
|
-
Column 'r' (Index 0): SNAPPY
|
|
106
|
-
Row Group 6:
|
|
107
|
-
Column 'r' (Index 0): SNAPPY
|
|
108
|
-
Row Group 7:
|
|
109
|
-
Column 'r' (Index 0): SNAPPY
|
|
110
|
-
Row Group 8:
|
|
111
|
-
Column 'r' (Index 0): SNAPPY
|
|
112
|
-
Row Group 9:
|
|
113
|
-
Column 'r' (Index 0): SNAPPY
|
|
114
|
-
Bloom Filter Info:
|
|
115
|
-
Row Group 0:
|
|
116
|
-
Column 'r' (Index 0): Has bloom filter
|
|
117
|
-
Row Group 1:
|
|
118
|
-
Column 'r' (Index 0): Has bloom filter
|
|
119
|
-
Row Group 2:
|
|
120
|
-
Column 'r' (Index 0): Has bloom filter
|
|
121
|
-
Row Group 3:
|
|
122
|
-
Column 'r' (Index 0): Has bloom filter
|
|
123
|
-
Row Group 4:
|
|
124
|
-
Column 'r' (Index 0): Has bloom filter
|
|
125
|
-
Row Group 5:
|
|
126
|
-
Column 'r' (Index 0): Has bloom filter
|
|
127
|
-
Row Group 6:
|
|
128
|
-
Column 'r' (Index 0): Has bloom filter
|
|
129
|
-
Row Group 7:
|
|
130
|
-
Column 'r' (Index 0): Has bloom filter
|
|
131
|
-
Row Group 8:
|
|
132
|
-
Column 'r' (Index 0): Has bloom filter
|
|
133
|
-
Row Group 9:
|
|
134
|
-
Column 'r' (Index 0): Has bloom filter
|
|
135
|
-
Compression codecs: {'SNAPPY'}
|
|
136
|
-
```
|
|
137
|
-
|
|
138
|
-
## Example output
|
|
139
|
-
|
|
140
|
-
```log
|
|
141
|
-
ParquetMetaModel(
|
|
142
|
-
created_by='parquet-cpp-arrow version 14.0.2',
|
|
143
|
-
num_columns=19,
|
|
144
|
-
num_rows=2964624,
|
|
145
|
-
num_row_groups=3,
|
|
146
|
-
format_version='2.6',
|
|
147
|
-
serialized_size=6357
|
|
148
|
-
)
|
|
149
|
-
Column Compression Info:
|
|
150
|
-
Row Group 0:
|
|
151
|
-
Column 'VendorID' (Index 0): ZSTD
|
|
152
|
-
Column 'tpep_pickup_datetime' (Index 1): ZSTD
|
|
153
|
-
Column 'tpep_dropoff_datetime' (Index 2): ZSTD
|
|
154
|
-
Column 'passenger_count' (Index 3): ZSTD
|
|
155
|
-
Column 'trip_distance' (Index 4): ZSTD
|
|
156
|
-
Column 'RatecodeID' (Index 5): ZSTD
|
|
157
|
-
Column 'store_and_fwd_flag' (Index 6): ZSTD
|
|
158
|
-
Column 'PULocationID' (Index 7): ZSTD
|
|
159
|
-
Column 'DOLocationID' (Index 8): ZSTD
|
|
160
|
-
Column 'payment_type' (Index 9): ZSTD
|
|
161
|
-
Column 'fare_amount' (Index 10): ZSTD
|
|
162
|
-
Column 'extra' (Index 11): ZSTD
|
|
163
|
-
Column 'mta_tax' (Index 12): ZSTD
|
|
164
|
-
Column 'tip_amount' (Index 13): ZSTD
|
|
165
|
-
Column 'tolls_amount' (Index 14): ZSTD
|
|
166
|
-
Column 'improvement_surcharge' (Index 15): ZSTD
|
|
167
|
-
Column 'total_amount' (Index 16): ZSTD
|
|
168
|
-
Column 'congestion_surcharge' (Index 17): ZSTD
|
|
169
|
-
Column 'Airport_fee' (Index 18): ZSTD
|
|
170
|
-
Row Group 1:
|
|
171
|
-
Column 'VendorID' (Index 0): ZSTD
|
|
172
|
-
Column 'tpep_pickup_datetime' (Index 1): ZSTD
|
|
173
|
-
Column 'tpep_dropoff_datetime' (Index 2): ZSTD
|
|
174
|
-
Column 'passenger_count' (Index 3): ZSTD
|
|
175
|
-
Column 'trip_distance' (Index 4): ZSTD
|
|
176
|
-
Column 'RatecodeID' (Index 5): ZSTD
|
|
177
|
-
Column 'store_and_fwd_flag' (Index 6): ZSTD
|
|
178
|
-
Column 'PULocationID' (Index 7): ZSTD
|
|
179
|
-
Column 'DOLocationID' (Index 8): ZSTD
|
|
180
|
-
Column 'payment_type' (Index 9): ZSTD
|
|
181
|
-
Column 'fare_amount' (Index 10): ZSTD
|
|
182
|
-
Column 'extra' (Index 11): ZSTD
|
|
183
|
-
Column 'mta_tax' (Index 12): ZSTD
|
|
184
|
-
Column 'tip_amount' (Index 13): ZSTD
|
|
185
|
-
Column 'tolls_amount' (Index 14): ZSTD
|
|
186
|
-
Column 'improvement_surcharge' (Index 15): ZSTD
|
|
187
|
-
Column 'total_amount' (Index 16): ZSTD
|
|
188
|
-
Column 'congestion_surcharge' (Index 17): ZSTD
|
|
189
|
-
Column 'Airport_fee' (Index 18): ZSTD
|
|
190
|
-
Row Group 2:
|
|
191
|
-
Column 'VendorID' (Index 0): ZSTD
|
|
192
|
-
Column 'tpep_pickup_datetime' (Index 1): ZSTD
|
|
193
|
-
Column 'tpep_dropoff_datetime' (Index 2): ZSTD
|
|
194
|
-
Column 'passenger_count' (Index 3): ZSTD
|
|
195
|
-
Column 'trip_distance' (Index 4): ZSTD
|
|
196
|
-
Column 'RatecodeID' (Index 5): ZSTD
|
|
197
|
-
Column 'store_and_fwd_flag' (Index 6): ZSTD
|
|
198
|
-
Column 'PULocationID' (Index 7): ZSTD
|
|
199
|
-
Column 'DOLocationID' (Index 8): ZSTD
|
|
200
|
-
Column 'payment_type' (Index 9): ZSTD
|
|
201
|
-
Column 'fare_amount' (Index 10): ZSTD
|
|
202
|
-
Column 'extra' (Index 11): ZSTD
|
|
203
|
-
Column 'mta_tax' (Index 12): ZSTD
|
|
204
|
-
Column 'tip_amount' (Index 13): ZSTD
|
|
205
|
-
Column 'tolls_amount' (Index 14): ZSTD
|
|
206
|
-
Column 'improvement_surcharge' (Index 15): ZSTD
|
|
207
|
-
Column 'total_amount' (Index 16): ZSTD
|
|
208
|
-
Column 'congestion_surcharge' (Index 17): ZSTD
|
|
209
|
-
Column 'Airport_fee' (Index 18): ZSTD
|
|
210
|
-
Compression codecs: {'ZSTD'}
|
|
211
|
-
```
|
iparq-0.2.0/src/iparq/py.typed
DELETED
|
@@ -1 +0,0 @@
|
|
|
1
|
-
# This empty file marks the package as typed for mypy
|
iparq-0.2.0/tests/conftest.py
DELETED
iparq-0.2.0/tests/test_cli.py
DELETED
|
@@ -1,35 +0,0 @@
|
|
|
1
|
-
from typer.testing import CliRunner
|
|
2
|
-
|
|
3
|
-
from src.iparq.source import app
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
def test_empty():
|
|
7
|
-
assert True
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
def test_parquet_info():
|
|
11
|
-
"""Test that the CLI correctly displays parquet file information."""
|
|
12
|
-
runner = CliRunner()
|
|
13
|
-
result = runner.invoke(app, ["dummy.parquet"])
|
|
14
|
-
|
|
15
|
-
assert result.exit_code == 0
|
|
16
|
-
|
|
17
|
-
expected_output = """ParquetMetaModel(
|
|
18
|
-
created_by='parquet-cpp-arrow version 14.0.2',
|
|
19
|
-
num_columns=3,
|
|
20
|
-
num_rows=3,
|
|
21
|
-
num_row_groups=1,
|
|
22
|
-
format_version='2.6',
|
|
23
|
-
serialized_size=2223
|
|
24
|
-
)
|
|
25
|
-
Parquet Column Information
|
|
26
|
-
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
|
|
27
|
-
┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
|
|
28
|
-
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
|
|
29
|
-
│ 0 │ one │ 0 │ SNAPPY │ ✅ │
|
|
30
|
-
│ 0 │ two │ 1 │ SNAPPY │ ✅ │
|
|
31
|
-
│ 0 │ three │ 2 │ SNAPPY │ ✅ │
|
|
32
|
-
└───────────┴─────────────┴───────┴─────────────┴──────────────┘
|
|
33
|
-
Compression codecs: {'SNAPPY'}"""
|
|
34
|
-
|
|
35
|
-
assert expected_output in result.stdout
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|