iparq 0.1.4__tar.gz → 0.1.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -170,3 +170,4 @@ cython_debug/
170
170
  # PyPI configuration file
171
171
  .pypirc
172
172
  .github/.DS_Store
173
+ yellow_tripdata_2024-01.parquet
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: iparq
3
- Version: 0.1.4
3
+ Version: 0.1.5
4
4
  Summary: Display version and compression information about a parquet file
5
5
  Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
6
6
  License-File: LICENSE
@@ -69,7 +69,7 @@ After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html),
69
69
  brew install MiguelElGallo/tap/iparq
70
70
  iparq —help
71
71
  ```
72
-
72
+
73
73
  ## Usage
74
74
 
75
75
  Run
@@ -80,17 +80,77 @@ iparq <filename>
80
80
 
81
81
  Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
82
82
 
83
-
84
83
  ## Example output
85
84
 
86
85
  ```log
87
86
  ParquetMetaModel(
88
87
  created_by='parquet-cpp-arrow version 14.0.2',
89
- num_columns=3,
90
- num_rows=3,
91
- num_row_groups=1,
88
+ num_columns=19,
89
+ num_rows=2964624,
90
+ num_row_groups=3,
92
91
  format_version='2.6',
93
- serialized_size=2223
92
+ serialized_size=6357
94
93
  )
95
- Compression codecs: {'SNAPPY'}
94
+ Column Compression Info:
95
+ Row Group 0:
96
+ Column 'VendorID' (Index 0): ZSTD
97
+ Column 'tpep_pickup_datetime' (Index 1): ZSTD
98
+ Column 'tpep_dropoff_datetime' (Index 2): ZSTD
99
+ Column 'passenger_count' (Index 3): ZSTD
100
+ Column 'trip_distance' (Index 4): ZSTD
101
+ Column 'RatecodeID' (Index 5): ZSTD
102
+ Column 'store_and_fwd_flag' (Index 6): ZSTD
103
+ Column 'PULocationID' (Index 7): ZSTD
104
+ Column 'DOLocationID' (Index 8): ZSTD
105
+ Column 'payment_type' (Index 9): ZSTD
106
+ Column 'fare_amount' (Index 10): ZSTD
107
+ Column 'extra' (Index 11): ZSTD
108
+ Column 'mta_tax' (Index 12): ZSTD
109
+ Column 'tip_amount' (Index 13): ZSTD
110
+ Column 'tolls_amount' (Index 14): ZSTD
111
+ Column 'improvement_surcharge' (Index 15): ZSTD
112
+ Column 'total_amount' (Index 16): ZSTD
113
+ Column 'congestion_surcharge' (Index 17): ZSTD
114
+ Column 'Airport_fee' (Index 18): ZSTD
115
+ Row Group 1:
116
+ Column 'VendorID' (Index 0): ZSTD
117
+ Column 'tpep_pickup_datetime' (Index 1): ZSTD
118
+ Column 'tpep_dropoff_datetime' (Index 2): ZSTD
119
+ Column 'passenger_count' (Index 3): ZSTD
120
+ Column 'trip_distance' (Index 4): ZSTD
121
+ Column 'RatecodeID' (Index 5): ZSTD
122
+ Column 'store_and_fwd_flag' (Index 6): ZSTD
123
+ Column 'PULocationID' (Index 7): ZSTD
124
+ Column 'DOLocationID' (Index 8): ZSTD
125
+ Column 'payment_type' (Index 9): ZSTD
126
+ Column 'fare_amount' (Index 10): ZSTD
127
+ Column 'extra' (Index 11): ZSTD
128
+ Column 'mta_tax' (Index 12): ZSTD
129
+ Column 'tip_amount' (Index 13): ZSTD
130
+ Column 'tolls_amount' (Index 14): ZSTD
131
+ Column 'improvement_surcharge' (Index 15): ZSTD
132
+ Column 'total_amount' (Index 16): ZSTD
133
+ Column 'congestion_surcharge' (Index 17): ZSTD
134
+ Column 'Airport_fee' (Index 18): ZSTD
135
+ Row Group 2:
136
+ Column 'VendorID' (Index 0): ZSTD
137
+ Column 'tpep_pickup_datetime' (Index 1): ZSTD
138
+ Column 'tpep_dropoff_datetime' (Index 2): ZSTD
139
+ Column 'passenger_count' (Index 3): ZSTD
140
+ Column 'trip_distance' (Index 4): ZSTD
141
+ Column 'RatecodeID' (Index 5): ZSTD
142
+ Column 'store_and_fwd_flag' (Index 6): ZSTD
143
+ Column 'PULocationID' (Index 7): ZSTD
144
+ Column 'DOLocationID' (Index 8): ZSTD
145
+ Column 'payment_type' (Index 9): ZSTD
146
+ Column 'fare_amount' (Index 10): ZSTD
147
+ Column 'extra' (Index 11): ZSTD
148
+ Column 'mta_tax' (Index 12): ZSTD
149
+ Column 'tip_amount' (Index 13): ZSTD
150
+ Column 'tolls_amount' (Index 14): ZSTD
151
+ Column 'improvement_surcharge' (Index 15): ZSTD
152
+ Column 'total_amount' (Index 16): ZSTD
153
+ Column 'congestion_surcharge' (Index 17): ZSTD
154
+ Column 'Airport_fee' (Index 18): ZSTD
155
+ Compression codecs: {'ZSTD'}
96
156
  ```
iparq-0.1.5/README.md ADDED
@@ -0,0 +1,139 @@
1
+ # iparq
2
+
3
+ [![Python package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
4
+
5
+ [![Dependabot Updates](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
6
+
7
+ [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
8
+
9
+ ![alt text](media/iparq.png)
10
+ After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
11
+
12
+ ## Installation
13
+
14
+ ### Using pip
15
+
16
+ 1) Install the package using pip:
17
+
18
+ ```sh
19
+ pip install iparq
20
+ ```
21
+
22
+ 2) Verify the installation by running:
23
+
24
+ ```sh
25
+ iparq --help
26
+ ```
27
+
28
+ ### Using uv
29
+
30
+ 1) Make sure to have Astral’s UV installed by following the steps here:
31
+
32
+ <https://docs.astral.sh/uv/getting-started/installation/>
33
+
34
+ 2) Execute the following command:
35
+
36
+ ```sh
37
+ uv pip install iparq
38
+ ```
39
+
40
+ 3) Verify the installation by running:
41
+
42
+ ```sh
43
+ iparq --help
44
+ ```
45
+
46
+ ### Using Homebrew in a MAC
47
+
48
+ 1) Run the following:
49
+
50
+ ```sh
51
+ brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
52
+ brew install MiguelElGallo/tap/iparq
53
+ iparq —help
54
+ ```
55
+
56
+ ## Usage
57
+
58
+ Run
59
+
60
+ ```sh
61
+ iparq <filename>
62
+ ```
63
+
64
+ Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
65
+
66
+ ## Example output
67
+
68
+ ```log
69
+ ParquetMetaModel(
70
+ created_by='parquet-cpp-arrow version 14.0.2',
71
+ num_columns=19,
72
+ num_rows=2964624,
73
+ num_row_groups=3,
74
+ format_version='2.6',
75
+ serialized_size=6357
76
+ )
77
+ Column Compression Info:
78
+ Row Group 0:
79
+ Column 'VendorID' (Index 0): ZSTD
80
+ Column 'tpep_pickup_datetime' (Index 1): ZSTD
81
+ Column 'tpep_dropoff_datetime' (Index 2): ZSTD
82
+ Column 'passenger_count' (Index 3): ZSTD
83
+ Column 'trip_distance' (Index 4): ZSTD
84
+ Column 'RatecodeID' (Index 5): ZSTD
85
+ Column 'store_and_fwd_flag' (Index 6): ZSTD
86
+ Column 'PULocationID' (Index 7): ZSTD
87
+ Column 'DOLocationID' (Index 8): ZSTD
88
+ Column 'payment_type' (Index 9): ZSTD
89
+ Column 'fare_amount' (Index 10): ZSTD
90
+ Column 'extra' (Index 11): ZSTD
91
+ Column 'mta_tax' (Index 12): ZSTD
92
+ Column 'tip_amount' (Index 13): ZSTD
93
+ Column 'tolls_amount' (Index 14): ZSTD
94
+ Column 'improvement_surcharge' (Index 15): ZSTD
95
+ Column 'total_amount' (Index 16): ZSTD
96
+ Column 'congestion_surcharge' (Index 17): ZSTD
97
+ Column 'Airport_fee' (Index 18): ZSTD
98
+ Row Group 1:
99
+ Column 'VendorID' (Index 0): ZSTD
100
+ Column 'tpep_pickup_datetime' (Index 1): ZSTD
101
+ Column 'tpep_dropoff_datetime' (Index 2): ZSTD
102
+ Column 'passenger_count' (Index 3): ZSTD
103
+ Column 'trip_distance' (Index 4): ZSTD
104
+ Column 'RatecodeID' (Index 5): ZSTD
105
+ Column 'store_and_fwd_flag' (Index 6): ZSTD
106
+ Column 'PULocationID' (Index 7): ZSTD
107
+ Column 'DOLocationID' (Index 8): ZSTD
108
+ Column 'payment_type' (Index 9): ZSTD
109
+ Column 'fare_amount' (Index 10): ZSTD
110
+ Column 'extra' (Index 11): ZSTD
111
+ Column 'mta_tax' (Index 12): ZSTD
112
+ Column 'tip_amount' (Index 13): ZSTD
113
+ Column 'tolls_amount' (Index 14): ZSTD
114
+ Column 'improvement_surcharge' (Index 15): ZSTD
115
+ Column 'total_amount' (Index 16): ZSTD
116
+ Column 'congestion_surcharge' (Index 17): ZSTD
117
+ Column 'Airport_fee' (Index 18): ZSTD
118
+ Row Group 2:
119
+ Column 'VendorID' (Index 0): ZSTD
120
+ Column 'tpep_pickup_datetime' (Index 1): ZSTD
121
+ Column 'tpep_dropoff_datetime' (Index 2): ZSTD
122
+ Column 'passenger_count' (Index 3): ZSTD
123
+ Column 'trip_distance' (Index 4): ZSTD
124
+ Column 'RatecodeID' (Index 5): ZSTD
125
+ Column 'store_and_fwd_flag' (Index 6): ZSTD
126
+ Column 'PULocationID' (Index 7): ZSTD
127
+ Column 'DOLocationID' (Index 8): ZSTD
128
+ Column 'payment_type' (Index 9): ZSTD
129
+ Column 'fare_amount' (Index 10): ZSTD
130
+ Column 'extra' (Index 11): ZSTD
131
+ Column 'mta_tax' (Index 12): ZSTD
132
+ Column 'tip_amount' (Index 13): ZSTD
133
+ Column 'tolls_amount' (Index 14): ZSTD
134
+ Column 'improvement_surcharge' (Index 15): ZSTD
135
+ Column 'total_amount' (Index 16): ZSTD
136
+ Column 'congestion_surcharge' (Index 17): ZSTD
137
+ Column 'Airport_fee' (Index 18): ZSTD
138
+ Compression codecs: {'ZSTD'}
139
+ ```
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "iparq"
3
- version = "0.1.4"
3
+ version = "0.1.5"
4
4
  description = "Display version and compression information about a parquet file"
5
5
  readme = "README.md"
6
6
  authors = [
@@ -94,6 +94,32 @@ def print_parquet_metadata(parquet_metadata):
94
94
  pass
95
95
 
96
96
 
97
+ def print_compression_types(parquet_metadata) -> None:
98
+ """
99
+ Prints the compression type for each column in each row group of the Parquet file.
100
+ """
101
+ try:
102
+ num_row_groups = parquet_metadata.num_row_groups
103
+ num_columns = parquet_metadata.num_columns
104
+ console.print("[bold underline]Column Compression Info:[/bold underline]")
105
+ for i in range(num_row_groups):
106
+ console.print(f"[bold]Row Group {i}:[/bold]")
107
+ for j in range(num_columns):
108
+ column_chunk = parquet_metadata.row_group(i).column(j)
109
+ compression = column_chunk.compression
110
+ column_name = parquet_metadata.schema.column(j).name
111
+ console.print(
112
+ f" Column '{column_name}' (Index {j}): [italic]{compression}[/italic]"
113
+ )
114
+ except Exception as e:
115
+ console.print(
116
+ f"Error while printing compression types: {e}",
117
+ style="blink bold red underline on white",
118
+ )
119
+ finally:
120
+ pass
121
+
122
+
97
123
  @app.command()
98
124
  def main(filename: str):
99
125
  """
@@ -107,9 +133,8 @@ def main(filename: str):
107
133
  """
108
134
  (parquet_metadata, compression) = read_parquet_metadata(filename)
109
135
 
110
- print_parquet_metadata(
111
- parquet_metadata,
112
- )
136
+ print_parquet_metadata(parquet_metadata)
137
+ print_compression_types(parquet_metadata)
113
138
  print(f"Compression codecs: {compression}")
114
139
 
115
140
 
@@ -0,0 +1,2 @@
1
+ def test_empty():
2
+ assert True
iparq-0.1.4/README.md DELETED
@@ -1,79 +0,0 @@
1
- # iparq
2
-
3
- [![Python package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-package.yml)
4
-
5
- [![Dependabot Updates](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/dependabot/dependabot-updates)
6
-
7
- [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
8
-
9
- ![alt text](media/iparq.png)
10
- After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
11
-
12
- ## Installation
13
-
14
- ### Using pip
15
-
16
- 1) Install the package using pip:
17
-
18
- ```sh
19
- pip install iparq
20
- ```
21
-
22
- 2) Verify the installation by running:
23
-
24
- ```sh
25
- iparq --help
26
- ```
27
-
28
- ### Using uv
29
-
30
- 1) Make sure to have Astral’s UV installed by following the steps here:
31
-
32
- <https://docs.astral.sh/uv/getting-started/installation/>
33
-
34
- 2) Execute the following command:
35
-
36
- ```sh
37
- uv pip install iparq
38
- ```
39
-
40
- 3) Verify the installation by running:
41
-
42
- ```sh
43
- iparq --help
44
- ```
45
-
46
- ### Using Homebrew in a MAC
47
-
48
- 1) Run the following:
49
-
50
- ```sh
51
- brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
52
- brew install MiguelElGallo/tap/iparq
53
- iparq —help
54
- ```
55
-
56
- ## Usage
57
-
58
- Run
59
-
60
- ```sh
61
- iparq <filename>
62
- ```
63
-
64
- Replace `<filename>` with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.
65
-
66
-
67
- ## Example output
68
-
69
- ```log
70
- ParquetMetaModel(
71
- created_by='parquet-cpp-arrow version 14.0.2',
72
- num_columns=3,
73
- num_rows=3,
74
- num_row_groups=1,
75
- format_version='2.6',
76
- serialized_size=2223
77
- )
78
- Compression codecs: {'SNAPPY'}
79
- ```
@@ -1,46 +0,0 @@
1
- import shutil
2
- import subprocess
3
- from pathlib import Path
4
-
5
- import pytest
6
- from pydantic import BaseModel
7
-
8
-
9
- class FileCopyConfig(BaseModel):
10
- source: Path
11
- destination: Path
12
-
13
-
14
- @pytest.fixture
15
- def copy_file(tmp_path: Path) -> Path:
16
- config = FileCopyConfig(
17
- source=Path("../dummy.parquet"), destination=tmp_path / "dummy.parquet"
18
- )
19
- try:
20
- shutil.copy(config.source, config.destination)
21
- except FileNotFoundError:
22
- print("Source file not found.")
23
- finally:
24
- print("Copy operation complete.")
25
- return config.destination
26
-
27
-
28
- def test_empty():
29
- assert True
30
-
31
-
32
- def test_dummy_parquet(copy_file: Path) -> None:
33
- try:
34
- result = subprocess.run(
35
- ["iparq", str(copy_file)],
36
- capture_output=True,
37
- text=True,
38
- check=True,
39
- )
40
- data = result.stdout
41
- assert "SNAPPY" in data
42
- assert "2.6" in data
43
- except subprocess.CalledProcessError as e:
44
- print(f"Test failed with error: {e}")
45
- finally:
46
- print("Test execution complete.")
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes