pyarrow-bigquery 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,262 @@
1
+ Metadata-Version: 2.1
2
+ Name: pyarrow-bigquery
3
+ Version: 0.1.0
4
+ Summary: A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.
5
+ Author-email: Sebastian Pawluś <sebastian.pawlus@gmail.com>
6
+ License: MIT
7
+ Keywords: pyarrow,bigquery
8
+ Description-Content-Type: text/markdown
9
+ Requires-Dist: google-cloud-bigquery<5,>=3
10
+ Requires-Dist: google-cloud-bigquery-storage<3,>=2
11
+ Requires-Dist: pyarrow<17,>=16
12
+ Requires-Dist: tenacity
13
+
14
+ # pyarrow-bigquery
15
+
16
+ A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.
17
+
18
+ ## Installation
19
+
20
+ ```bash
21
+ pip install pyarrow-bigquery
22
+ ```
23
+
24
+ ## Quick Start
25
+
26
+ This guide will help you quickly get started with `pyarrow-bigquery`, a library that allows you to **read** from and **write** to Google BigQuery using PyArrow.
27
+
28
+ ### Reading from BigQuery
29
+
30
+ `pyarrow-bigquery` exposes two methods to read BigQuery tables as PyArrow tables. Depending on your use case or the size of the table, you might want to use one method over the other.
31
+
32
+ #### Read the Whole Table
33
+
34
+ When the table is small enough to fit in memory, you can read it directly using `bq.read_table`.
35
+
36
+ ```python
37
+ import pyarrow.bigquery as bq
38
+
39
+ table = bq.read_table("gcp_project.dataset.small_table")
40
+
41
+ print(table.num_rows)
42
+ ```
43
+
44
+ #### Read with Batches
45
+
46
+ If the target table is larger than memory or you have other reasons not to fetch the whole table at once, you can use the `bq.reader` iterator method along with the `batch_size` parameter to limit how much data is fetched per iteration.
47
+
48
+ ```python
49
+ import pyarrow.bigquery as bq
50
+
51
+ for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
52
+ print(table.num_rows)
53
+ ```
54
+
55
+ ### Writing to BigQuery
56
+
57
+ Similarly, the package exposes two methods to write to BigQuery. Depending on your use case or the size of the table, you might want to use one method over the other.
58
+
59
+ #### Write the Whole Table
60
+
61
+ When you want to write a complete table at once, you can use the `bq.write_table` method.
62
+
63
+ ```python
64
+ import pyarrow as pa
65
+ import pyarrow.bigquery as bq
66
+
67
+ table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])
68
+
69
+ bq.write_table(table, 'gcp_project.dataset.table')
70
+ ```
71
+
72
+ #### Write in Batches (Smaller Chunks)
73
+
74
+ If you need to write data in smaller chunks, you can use the `bq.writer` method with the `schema` parameter to define the table structure.
75
+
76
+ ```python
77
+ import pyarrow as pa
78
+ import pyarrow.bigquery as bq
79
+
80
+ schema = pa.schema([
81
+ ("integers", pa.int64())
82
+ ])
83
+
84
+ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
85
+ w.write_batch(record_batch)
86
+ w.write_table(table)
87
+ ```
88
+
89
+ ## API Reference
90
+
91
+ ### `pyarrow.bigquery.write_table`
92
+
93
+ Write a PyArrow Table to a BigQuery Table. No return value.
94
+
95
+ **Parameters:**
96
+
97
+ - `table`: `pa.Table`
98
+ PyArrow table.
99
+
100
+ - `where`: `str`
101
+ Destination location in BigQuery catalog.
102
+
103
+ - `project`: `str`, *default* `None`
104
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
105
+
106
+ - `table_create`: `bool`, *default* `True`
107
+ Specifies if the BigQuery table should be created.
108
+
109
+ - `table_expire`: `None | int`, *default* `None`
110
+ Amount of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
111
+
112
+ - `table_overwrite`: `bool`, *default* `False`
113
+ If the table already exists, destroy it and create a new one.
114
+
115
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
116
+ Worker backend for fetching data.
117
+
118
+ - `worker_count`: `int`, *default* `os.cpu_count()`
119
+ Number of threads or processes to use for fetching data from BigQuery.
120
+
121
+ - `batch_size`: `int`, *default* `100`
122
+ Batch size for fetched rows.
123
+
124
+ ```python
125
+ bq.write_table(table, 'gcp_project.dataset.table')
126
+ ```
127
+
128
+ ### `pyarrow.bigquery.writer`
129
+
130
+ Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.
131
+
132
+ **Parameters:**
133
+
134
+ - `schema`: `pa.Schema`
135
+ PyArrow schema.
136
+
137
+ - `where`: `str`
138
+ Destination location in BigQuery catalog.
139
+
140
+ - `project`: `str`, *default* `None`
141
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
142
+
143
+ - `table_create`: `bool`, *default* `True`
144
+ Specifies if the BigQuery table should be created.
145
+
146
+ - `table_expire`: `None | int`, *default* `None`
147
+ Amount of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
148
+
149
+ - `table_overwrite`: `bool`, *default* `False`
150
+ If the table already exists, destroy it and create a new one.
151
+
152
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
153
+ Worker backend for writing data.
154
+
155
+ - `worker_count`: `int`, *default* `os.cpu_count()`
156
+ Number of threads or processes to use for writing data to BigQuery.
157
+
158
+ - `batch_size`: `int`, *default* `100`
159
+ Batch size used for writes. Table will be automatically split to this value.
160
+
161
+ Depending on the use case, you might want to use one of the methods below to write your data to a BigQuery table, using either `pa.Table` or `pa.RecordBatch`.
162
+
163
+ #### `pyarrow.bigquery.writer.write_table`
164
+
165
+ Context manager method to write a table.
166
+
167
+ **Parameters:**
168
+
169
+ - `table`: `pa.Table`
170
+ PyArrow table.
171
+
172
+ ```python
173
+ import pyarrow as pa
174
+ import pyarrow.bigquery as bq
175
+
176
+ schema = pa.schema([("value", pa.list_(pa.int64()))])
177
+
178
+ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
179
+ for a in range(1000):
180
+ w.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))
181
+ ```
182
+
183
+ #### `pyarrow.bigquery.writer.write_batch`
184
+
185
+ Context manager method to write a record batch.
186
+
187
+ **Parameters:**
188
+
189
+ - `batch`: `pa.RecordBatch`
190
+ PyArrow record batch.
191
+
192
+ ```python
193
+ import pyarrow as pa
194
+ import pyarrow.bigquery as bq
195
+
196
+ schema = pa.schema([("value", pa.list_(pa.int64()))])
197
+
198
+ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
199
+ for a in range(1000):
200
+ w.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))
201
+ ```
202
+
203
+ ### `pyarrow.bigquery.read_table`
204
+
205
+ **Parameters:**
206
+
207
+ - `source`: `str`
208
+ BigQuery table location.
209
+
210
+ - `project`: `str`, *default* `None`
211
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
212
+
213
+ - `columns`: `str`, *default* `None`
214
+ Columns to download. When not provided, all available columns will be downloaded.
215
+
216
+ - `row_restrictions`: `str`, *default* `None`
217
+ Row level filtering executed on the BigQuery side. More in [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
218
+
219
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
220
+ Worker backend for fetching data.
221
+
222
+ - `worker_count`: `int`, *default* `os.cpu_count()`
223
+ Number of threads or processes to use for fetching data from BigQuery.
224
+
225
+ - `batch_size`: `int`, *default* `100`
226
+ Batch size used for fetching. Table will be automatically split to this value.
227
+
228
+ ### `pyarrow.bigquery.reader`
229
+
230
+ **Parameters:**
231
+
232
+ - `source`: `str`
233
+ BigQuery table location.
234
+
235
+ - `project`: `str`, *default* `None`
236
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
237
+
238
+ - `columns`: `str`, *default* `None`
239
+ Columns to download. When not provided, all available columns will be downloaded.
240
+
241
+ - `row_restrictions`: `str`, *default* `None`
242
+ Row level filtering executed on the BigQuery side. More in [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
243
+
244
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
245
+ Worker backend for fetching data.
246
+
247
+ - `worker_count`: `int`, *default* `os.cpu_count()`
248
+ Number of threads or processes to use for fetching data from BigQuery.
249
+
250
+ - `batch_size`: `int`, *default* `100`
251
+ Batch size used for fetching. Table will be automatically split to this value.
252
+
253
+ ```python
254
+ import pyarrow as pa
255
+ import pyarrow.bigquery as bq
256
+
257
+ parts = []
258
+ for part in bq.reader("gcp_project.dataset.table"):
259
+ parts.append(part)
260
+
261
+ table = pa.concat_tables(parts)
262
+ ```
@@ -0,0 +1,249 @@
1
+ # pyarrow-bigquery
2
+
3
+ A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ pip install pyarrow-bigquery
9
+ ```
10
+
11
+ ## Quick Start
12
+
13
+ This guide will help you quickly get started with `pyarrow-bigquery`, a library that allows you to **read** from and **write** to Google BigQuery using PyArrow.
14
+
15
+ ### Reading from BigQuery
16
+
17
+ `pyarrow-bigquery` exposes two methods to read BigQuery tables as PyArrow tables. Depending on your use case or the size of the table, you might want to use one method over the other.
18
+
19
+ #### Read the Whole Table
20
+
21
+ When the table is small enough to fit in memory, you can read it directly using `bq.read_table`.
22
+
23
+ ```python
24
+ import pyarrow.bigquery as bq
25
+
26
+ table = bq.read_table("gcp_project.dataset.small_table")
27
+
28
+ print(table.num_rows)
29
+ ```
30
+
31
+ #### Read with Batches
32
+
33
+ If the target table is larger than memory or you have other reasons not to fetch the whole table at once, you can use the `bq.reader` iterator method along with the `batch_size` parameter to limit how much data is fetched per iteration.
34
+
35
+ ```python
36
+ import pyarrow.bigquery as bq
37
+
38
+ for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
39
+ print(table.num_rows)
40
+ ```
41
+
42
+ ### Writing to BigQuery
43
+
44
+ Similarly, the package exposes two methods to write to BigQuery. Depending on your use case or the size of the table, you might want to use one method over the other.
45
+
46
+ #### Write the Whole Table
47
+
48
+ When you want to write a complete table at once, you can use the `bq.write_table` method.
49
+
50
+ ```python
51
+ import pyarrow as pa
52
+ import pyarrow.bigquery as bq
53
+
54
+ table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])
55
+
56
+ bq.write_table(table, 'gcp_project.dataset.table')
57
+ ```
58
+
59
+ #### Write in Batches (Smaller Chunks)
60
+
61
+ If you need to write data in smaller chunks, you can use the `bq.writer` method with the `schema` parameter to define the table structure.
62
+
63
+ ```python
64
+ import pyarrow as pa
65
+ import pyarrow.bigquery as bq
66
+
67
+ schema = pa.schema([
68
+ ("integers", pa.int64())
69
+ ])
70
+
71
+ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
72
+ w.write_batch(record_batch)
73
+ w.write_table(table)
74
+ ```
75
+
76
+ ## API Reference
77
+
78
+ ### `pyarrow.bigquery.write_table`
79
+
80
+ Write a PyArrow Table to a BigQuery Table. No return value.
81
+
82
+ **Parameters:**
83
+
84
+ - `table`: `pa.Table`
85
+ PyArrow table.
86
+
87
+ - `where`: `str`
88
+ Destination location in BigQuery catalog.
89
+
90
+ - `project`: `str`, *default* `None`
91
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
92
+
93
+ - `table_create`: `bool`, *default* `True`
94
+ Specifies if the BigQuery table should be created.
95
+
96
+ - `table_expire`: `None | int`, *default* `None`
97
+ Amount of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
98
+
99
+ - `table_overwrite`: `bool`, *default* `False`
100
+ If the table already exists, destroy it and create a new one.
101
+
102
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
103
+ Worker backend for fetching data.
104
+
105
+ - `worker_count`: `int`, *default* `os.cpu_count()`
106
+ Number of threads or processes to use for fetching data from BigQuery.
107
+
108
+ - `batch_size`: `int`, *default* `100`
109
+ Batch size for fetched rows.
110
+
111
+ ```python
112
+ bq.write_table(table, 'gcp_project.dataset.table')
113
+ ```
114
+
115
+ ### `pyarrow.bigquery.writer`
116
+
117
+ Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.
118
+
119
+ **Parameters:**
120
+
121
+ - `schema`: `pa.Schema`
122
+ PyArrow schema.
123
+
124
+ - `where`: `str`
125
+ Destination location in BigQuery catalog.
126
+
127
+ - `project`: `str`, *default* `None`
128
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
129
+
130
+ - `table_create`: `bool`, *default* `True`
131
+ Specifies if the BigQuery table should be created.
132
+
133
+ - `table_expire`: `None | int`, *default* `None`
134
+ Amount of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
135
+
136
+ - `table_overwrite`: `bool`, *default* `False`
137
+ If the table already exists, destroy it and create a new one.
138
+
139
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
140
+ Worker backend for writing data.
141
+
142
+ - `worker_count`: `int`, *default* `os.cpu_count()`
143
+ Number of threads or processes to use for writing data to BigQuery.
144
+
145
+ - `batch_size`: `int`, *default* `100`
146
+ Batch size used for writes. Table will be automatically split to this value.
147
+
148
+ Depending on the use case, you might want to use one of the methods below to write your data to a BigQuery table, using either `pa.Table` or `pa.RecordBatch`.
149
+
150
+ #### `pyarrow.bigquery.writer.write_table`
151
+
152
+ Context manager method to write a table.
153
+
154
+ **Parameters:**
155
+
156
+ - `table`: `pa.Table`
157
+ PyArrow table.
158
+
159
+ ```python
160
+ import pyarrow as pa
161
+ import pyarrow.bigquery as bq
162
+
163
+ schema = pa.schema([("value", pa.list_(pa.int64()))])
164
+
165
+ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
166
+ for a in range(1000):
167
+ w.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))
168
+ ```
169
+
170
+ #### `pyarrow.bigquery.writer.write_batch`
171
+
172
+ Context manager method to write a record batch.
173
+
174
+ **Parameters:**
175
+
176
+ - `batch`: `pa.RecordBatch`
177
+ PyArrow record batch.
178
+
179
+ ```python
180
+ import pyarrow as pa
181
+ import pyarrow.bigquery as bq
182
+
183
+ schema = pa.schema([("value", pa.list_(pa.int64()))])
184
+
185
+ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
186
+ for a in range(1000):
187
+ w.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))
188
+ ```
189
+
190
+ ### `pyarrow.bigquery.read_table`
191
+
192
+ **Parameters:**
193
+
194
+ - `source`: `str`
195
+ BigQuery table location.
196
+
197
+ - `project`: `str`, *default* `None`
198
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
199
+
200
+ - `columns`: `str`, *default* `None`
201
+ Columns to download. When not provided, all available columns will be downloaded.
202
+
203
+ - `row_restrictions`: `str`, *default* `None`
204
+ Row level filtering executed on the BigQuery side. More in [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
205
+
206
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
207
+ Worker backend for fetching data.
208
+
209
+ - `worker_count`: `int`, *default* `os.cpu_count()`
210
+ Number of threads or processes to use for fetching data from BigQuery.
211
+
212
+ - `batch_size`: `int`, *default* `100`
213
+ Batch size used for fetching. Table will be automatically split to this value.
214
+
215
+ ### `pyarrow.bigquery.reader`
216
+
217
+ **Parameters:**
218
+
219
+ - `source`: `str`
220
+ BigQuery table location.
221
+
222
+ - `project`: `str`, *default* `None`
223
+ BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
224
+
225
+ - `columns`: `str`, *default* `None`
226
+ Columns to download. When not provided, all available columns will be downloaded.
227
+
228
+ - `row_restrictions`: `str`, *default* `None`
229
+ Row level filtering executed on the BigQuery side. More in [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
230
+
231
+ - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`
232
+ Worker backend for fetching data.
233
+
234
+ - `worker_count`: `int`, *default* `os.cpu_count()`
235
+ Number of threads or processes to use for fetching data from BigQuery.
236
+
237
+ - `batch_size`: `int`, *default* `100`
238
+ Batch size used for fetching. Table will be automatically split to this value.
239
+
240
+ ```python
241
+ import pyarrow as pa
242
+ import pyarrow.bigquery as bq
243
+
244
+ parts = []
245
+ for part in bq.reader("gcp_project.dataset.table"):
246
+ parts.append(part)
247
+
248
+ table = pa.concat_tables(parts)
249
+ ```
@@ -0,0 +1,36 @@
1
+ [project]
2
+ name = "pyarrow-bigquery"
3
+ version = "0.1.0"
4
+ description = "A simple library to **write to** and **download from** BigQuery tables as PyArrow tables."
5
+ authors = [{ name = "Sebastian Pawluś", email = "sebastian.pawlus@gmail.com" }]
6
+ readme = "README.md"
7
+ keywords = ["pyarrow", "bigquery"]
8
+
9
+ dependencies = [
10
+ "google-cloud-bigquery>=3,<5",
11
+ "google-cloud-bigquery-storage>=2,<3",
12
+ "pyarrow>=16,<17",
13
+ "tenacity"
14
+ ]
15
+
16
+ [project.license]
17
+ text = "MIT"
18
+
19
+ [build-system]
20
+ requires = ["setuptools>=40.6.0", "wheel"]
21
+ build-backend = "setuptools.build_meta"
22
+
23
+ [tool.setuptools.packages.find]
24
+ where = ["src/"]
25
+ include = ["pyarrow.bigquery*"]
26
+
27
+ [[tool.mypy.overrides]]
28
+ module = ["pyarrow.*"]
29
+ ignore_missing_imports = true
30
+
31
+
32
+ [tool.ruff]
33
+ exclude = [".git"]
34
+
35
+ line-length = 120
36
+ indent-width = 4
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,2 @@
1
+ from .read import reader, read_table, reader_query, read_query # noqa
2
+ from .write import writer, write_table # noqa