db2pq 0.1.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
db2pq-0.1.3/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Ian Gow
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
db2pq-0.1.3/PKG-INFO ADDED
@@ -0,0 +1,79 @@
1
+ Metadata-Version: 2.4
2
+ Name: db2pq
3
+ Version: 0.1.3
4
+ Summary: Convert database tables to parquet tables.
5
+ Home-page: https://github.com/iangow/db2pq/
6
+ Author: Ian Gow
7
+ Author-email: iandgow@gmail.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: ibis-framework[duckdb,postgres]
15
+ Requires-Dist: pyarrow
16
+ Requires-Dist: pandas
17
+ Requires-Dist: paramiko
18
+ Dynamic: author
19
+ Dynamic: author-email
20
+ Dynamic: classifier
21
+ Dynamic: description
22
+ Dynamic: description-content-type
23
+ Dynamic: home-page
24
+ Dynamic: license-file
25
+ Dynamic: requires-dist
26
+ Dynamic: requires-python
27
+ Dynamic: summary
28
+
29
+ # Library to convert PostgreSQL data to parquet files
30
+
31
+ This package was created to convert PostgreSQL data to parquet format.
32
+ This package has four major functions, one for each of three popular data formats, plus an "update" function that only updates if necessary.
33
+
34
+ - `wrds_pg_to_pq()`: Exports a WRDS PostgreSQL table to a parquet file.
35
+ - `db_to_pq()`: Exports a PostgreSQL table to a parquet file.
36
+ - `db_schema_to_pq()`: Exports a PostgreSQL schema to parquet files.
37
+ - `wrds_update_pq()`: A variant on `wrds_pg_to_pq()` that checks the "last modified" value for the relevant SAS file against that of the local parquet before getting new data from the WRDS PostgreSQL server.
38
+
39
+ ## Requirements
40
+
41
+ ### 1. Python
42
+ The software uses Python 3 and depends on Ibis, `pyarrow` (Python API for Apache Arrow libraries), and Paramiko.
43
+ These dependencies are installed when you use Pip:
44
+
45
+ ```bash
46
+ pip install db2pq --upgrade
47
+ ```
48
+
49
+ ### 2. A WRDS ID
50
+ To use public-key authentication to access WRDS, follow hints taken from [here](https://debian-administration.org/article/152/Password-less_logins_with_OpenSSH) to set up a public key.
51
+ Copy that key to the WRDS server from the terminal on your computer.
52
+ (Note that this code assumes you have a directory `.ssh` in your home directory. If not, log into WRDS via SSH, then type `mkdir ~/.ssh` to create this.)
53
+ Here's code to create the key and send it to WRDS:
54
+
55
+ ```bash
56
+ ssh-keygen -t rsa
57
+ cat ~/.ssh/id_rsa.pub | ssh $WRDS_ID@wrds-cloud-sshkey.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
58
+ ```
59
+
60
+ Use an empty passphrase in setting up the key so that the scripts can run without user intervention.
61
+
62
+ ### 3. Environment variables
63
+
64
+ Environment variables that the code uses include:
65
+
66
+ - `WRDS_ID`: Your [WRDS](https://wrds-web.wharton.upenn.edu/wrds/) ID.
67
+ - `DATA_DIR`: The local repository for parquet files.
68
+
69
+ Once can set these environment variables in (say) `~/.zprofile`:
70
+
71
+ ```bash
72
+ export WRDS_ID="iangow"
73
+ export DATA_DIR="~/Dropbox/pq_data"
74
+ ```
75
+
76
+ As an alternative to setting these environment variables, they can be passed as values of arguments `wrds_id` and `data_dir`, respectively, of the functions above.
77
+
78
+ ### Report bugs
79
+ Author: Ian Gow, <iandgow@gmail.com>
db2pq-0.1.3/README.md ADDED
@@ -0,0 +1,51 @@
1
+ # Library to convert PostgreSQL data to parquet files
2
+
3
+ This package was created to convert PostgreSQL data to parquet format.
4
+ This package has four major functions, one for each of three popular data formats, plus an "update" function that only updates if necessary.
5
+
6
+ - `wrds_pg_to_pq()`: Exports a WRDS PostgreSQL table to a parquet file.
7
+ - `db_to_pq()`: Exports a PostgreSQL table to a parquet file.
8
+ - `db_schema_to_pq()`: Exports a PostgreSQL schema to parquet files.
9
+ - `wrds_update_pq()`: A variant on `wrds_pg_to_pq()` that checks the "last modified" value for the relevant SAS file against that of the local parquet before getting new data from the WRDS PostgreSQL server.
10
+
11
+ ## Requirements
12
+
13
+ ### 1. Python
14
+ The software uses Python 3 and depends on Ibis, `pyarrow` (Python API for Apache Arrow libraries), and Paramiko.
15
+ These dependencies are installed when you use Pip:
16
+
17
+ ```bash
18
+ pip install db2pq --upgrade
19
+ ```
20
+
21
+ ### 2. A WRDS ID
22
+ To use public-key authentication to access WRDS, follow hints taken from [here](https://debian-administration.org/article/152/Password-less_logins_with_OpenSSH) to set up a public key.
23
+ Copy that key to the WRDS server from the terminal on your computer.
24
+ (Note that this code assumes you have a directory `.ssh` in your home directory. If not, log into WRDS via SSH, then type `mkdir ~/.ssh` to create this.)
25
+ Here's code to create the key and send it to WRDS:
26
+
27
+ ```bash
28
+ ssh-keygen -t rsa
29
+ cat ~/.ssh/id_rsa.pub | ssh $WRDS_ID@wrds-cloud-sshkey.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
30
+ ```
31
+
32
+ Use an empty passphrase in setting up the key so that the scripts can run without user intervention.
33
+
34
+ ### 3. Environment variables
35
+
36
+ Environment variables that the code uses include:
37
+
38
+ - `WRDS_ID`: Your [WRDS](https://wrds-web.wharton.upenn.edu/wrds/) ID.
39
+ - `DATA_DIR`: The local repository for parquet files.
40
+
41
+ Once can set these environment variables in (say) `~/.zprofile`:
42
+
43
+ ```bash
44
+ export WRDS_ID="iangow"
45
+ export DATA_DIR="~/Dropbox/pq_data"
46
+ ```
47
+
48
+ As an alternative to setting these environment variables, they can be passed as values of arguments `wrds_id` and `data_dir`, respectively, of the functions above.
49
+
50
+ ### Report bugs
51
+ Author: Ian Gow, <iandgow@gmail.com>
@@ -0,0 +1,6 @@
1
+ name = "db2pq"
2
+
3
+ from db2pq.db2pq import db_to_pq, wrds_pg_to_pq
4
+ from db2pq.db2pq import db_schema_to_pq, db_schema_tables
5
+ from db2pq.db2pq import wrds_update_pq, get_pq_files, update_schema
6
+ from db2pq.db2pq import get_modified_pq, pq_last_updated
@@ -0,0 +1,726 @@
1
+ import ibis
2
+ import os
3
+ import ibis.selectors as s
4
+ from ibis import _
5
+ from tempfile import TemporaryFile, NamedTemporaryFile
6
+ import pyarrow.parquet as pq
7
+ import re
8
+ import warnings
9
+ import paramiko
10
+ from pathlib import Path
11
+ from time import gmtime, strftime
12
+ import pandas as pd
13
+
14
+ client = paramiko.SSHClient()
15
+ wrds_id = os.getenv("WRDS_ID")
16
+ warnings.filterwarnings(action='ignore', module='.*paramiko.*')
17
+
18
+ def df_to_arrow(df, col_types=None, obs=None, batches=False):
19
+
20
+ if col_types:
21
+ types = set(col_types.values())
22
+ for type in types:
23
+ to_convert = [key for (key, value) in col_types.items() if value == type]
24
+ df = df.mutate(s.across(to_convert, _.cast(type)))
25
+
26
+ if obs:
27
+ df = df.limit(obs)
28
+
29
+ if batches:
30
+ return df.to_pyarrow_batches()
31
+ else:
32
+ return df.to_pyarrow()
33
+
34
+ def db_to_pq(table_name, schema,
35
+ user=os.getenv("PGUSER", default=os.getlogin()),
36
+ host=os.getenv("PGHOST", default="localhost"),
37
+ database=os.getenv("PGDATABASE", default=os.getlogin()),
38
+ port=os.getenv("PGPORT", default=5432),
39
+ data_dir=os.getenv("DATA_DIR", default=""),
40
+ col_types=None,
41
+ row_group_size=1048576,
42
+ obs=None,
43
+ modified=None,
44
+ alt_table_name=None,
45
+ keep=None,
46
+ drop=None,
47
+ batched=True,
48
+ threads=None):
49
+ """Export a PostgreSQL table to a parquet file.
50
+
51
+ Parameters
52
+ ----------
53
+ table_name:
54
+ Name of table in database.
55
+
56
+ schema:
57
+ Name of database schema.
58
+
59
+ host: string [Optional]
60
+ Host name for the PostgreSQL server.
61
+ The default is to use the environment value `PGHOST`.
62
+
63
+ database: string [Optional]
64
+ Name for the PostgreSQL database.
65
+ The default is to use the environment value `PGDATABASE`
66
+ or (if not set) user ID.
67
+
68
+ data_dir: string [Optional]
69
+ Root directory of parquet data repository.
70
+ The default is to use the environment value `DATA_DIR`
71
+ or (if not set) the current directory.
72
+
73
+ col_types: Dict [Optional]
74
+ Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files.
75
+ For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB.
76
+ Only a subset of columns needs to be supplied.
77
+ Supplied types should be compatible with data emitted by PostgreSQL
78
+ (i.e., one can't "fix" arbitrary type issues using this argument).
79
+ For example, `col_types = {'permno':'integer', 'permco':'integer'}`.
80
+
81
+ row_group_size: int [Optional]
82
+ Maximum number of rows in each written row group.
83
+ Default is `1024 * 1024`.
84
+
85
+ obs: Integer [Optional]
86
+ Number of observations to import from database table.
87
+ Implemented using SQL `LIMIT`.
88
+ Setting this to modest value (e.g., `obs=1000`) can be useful for testing
89
+ `db_to_pq()` with large tables.
90
+
91
+ modified: string [Optional]
92
+ Last modified string.
93
+
94
+ alt_table_name: string [Optional]
95
+ Basename of parquet file. Used when file should have different name from `table_name`.
96
+
97
+ keep: string [Optional]
98
+ Regular expression indicating columns to keep.
99
+
100
+ drop: string [Optional]
101
+ Regular expression indicating columns to drop.
102
+
103
+ batched: bool [Optional]
104
+ Indicates whether data will be extracting in batches using
105
+ `to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
106
+ Using batches degrades performance slightly, but dramatically
107
+ reduces memory requirements for large tables.
108
+
109
+ threads: int [Optional]
110
+ The number of threads DuckDB is allowed to use.
111
+ Setting this may be necessary due to limits imposed on the user
112
+ by the PostgreSQL database server.
113
+
114
+ Returns
115
+ -------
116
+ pq_file: string
117
+ Name of parquet file created.
118
+
119
+ Examples
120
+ ----------
121
+ >>> db_to_pq("dsi", "crsp")
122
+ >>> db_to_pq("feed21_bankruptcy_notification", "audit")
123
+ """
124
+ if not alt_table_name:
125
+ alt_table_name = table_name
126
+
127
+ con = ibis.duckdb.connect()
128
+ if threads:
129
+ con.raw_sql(f"SET threads TO {threads};")
130
+
131
+ uri = f"postgres://{user}@{host}:{port}/{database}"
132
+ df = con.read_postgres(uri, table_name=table_name, database=schema)
133
+ data_dir = os.path.expanduser(data_dir)
134
+ pq_dir = os.path.join(data_dir, schema)
135
+ if not os.path.exists(pq_dir):
136
+ os.makedirs(pq_dir)
137
+ pq_file = os.path.join(data_dir, schema, alt_table_name + '.parquet')
138
+ tmp_pq_file = os.path.join(data_dir, schema, '.temp_' + alt_table_name + '.parquet')
139
+
140
+ if drop:
141
+ df = df.drop(s.matches(drop))
142
+
143
+ if keep:
144
+ df = df.select(s.matches(keep))
145
+
146
+ if batched:
147
+ # Get a few rows to infer schema for batched write
148
+ tmpfile = TemporaryFile()
149
+ df_arrow = df_to_arrow(df, col_types=col_types, obs=10)
150
+ pq.write_table(df_arrow, tmpfile)
151
+ schema = pq.read_schema(tmpfile)
152
+ if modified:
153
+ schema = schema.with_metadata({b'last_modified': modified.encode()})
154
+
155
+ # Process data in batches
156
+ with pq.ParquetWriter(tmp_pq_file, schema) as writer:
157
+ batches = df_to_arrow(df, col_types=col_types, obs=obs, batches=True)
158
+ for batch in batches:
159
+ writer.write_batch(batch)
160
+ else:
161
+ df_arrow = df_to_arrow(df, col_types=col_types, obs=obs)
162
+ pq.write_table(df_arrow, tmp_pq_file, row_group_size=row_group_size)
163
+
164
+ os.rename(tmp_pq_file, pq_file)
165
+ return pq_file
166
+
167
+ def wrds_pg_to_pq(table_name,
168
+ schema,
169
+ wrds_id=os.getenv("WRDS_ID", default=""),
170
+ data_dir=os.getenv("DATA_DIR", default=""),
171
+ col_types=None,
172
+ row_group_size=1048576,
173
+ obs=None,
174
+ modified=None,
175
+ alt_table_name=None,
176
+ keep=None,
177
+ drop=None,
178
+ batched=True,
179
+ threads=3):
180
+ """Export a table from the WRDS PostgreSQL database to a parquet file.
181
+
182
+ Parameters
183
+ ----------
184
+ table_name:
185
+ Name of table in database.
186
+
187
+ schema:
188
+ Name of database schema.
189
+
190
+ wrds_id: string
191
+ WRDS ID to be used to access WRDS SAS.
192
+ Default is to use the environment value `WRDS_ID`.
193
+
194
+ data_dir: string [Optional]
195
+ Root directory of parquet data repository.
196
+ The default is to use the environment value `DATA_DIR`
197
+ or (if not set) the current directory.
198
+
199
+ col_types: Dict [Optional]
200
+ Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files.
201
+ For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB.
202
+ Only a subset of columns needs to be supplied.
203
+ Supplied types should be compatible with data emitted by PostgreSQL
204
+ (i.e., one can't "fix" arbitrary type issues using this argument).
205
+ For example, `col_types = {'permno': 'int32', 'permco': 'int32'}`.
206
+
207
+ row_group_size: int [Optional]
208
+ Maximum number of rows in each written row group.
209
+ Default is `1024 * 1024`.
210
+
211
+ obs: Integer [Optional]
212
+ Number of observations to import from database table.
213
+ Implemented using SQL `LIMIT`.
214
+ Setting this to modest value (e.g., `obs=1000`) can be useful for testing
215
+ `db_to_pq()` with large tables.
216
+
217
+ alt_table_name: string [Optional]
218
+ Basename of parquet file. Used when file should have different name from `table_name`.
219
+
220
+ keep: string [Optional]
221
+ Regular expression indicating columns to keep.
222
+
223
+ drop: string [Optional]
224
+ Regular expression indicating columns to drop.
225
+
226
+ batched: bool [Optional]
227
+ Indicates whether data will be extracting in batches using
228
+ `to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
229
+ Using batches degrades performance slightly, but dramatically
230
+ reduces memory requirements for large tables.
231
+
232
+ threads: int [Optional]
233
+ The number of threads DuckDB is allowed to use.
234
+ Setting this may be necessary due to limits imposed on the user
235
+ by the PostgreSQL database server.
236
+
237
+ Returns
238
+ -------
239
+ pq_file: string
240
+ Name of parquet file created.
241
+
242
+ Examples
243
+ ----------
244
+ >>> db_to_pq("dsi", "crsp")
245
+ >>> db_to_pq("feed21_bankruptcy_notification", "audit")
246
+ """
247
+ db_to_pq(table_name, schema, user=wrds_id,
248
+ host="wrds-pgdata.wharton.upenn.edu",
249
+ database="wrds",
250
+ port=9737,
251
+ data_dir=data_dir,
252
+ col_types=col_types,
253
+ row_group_size=row_group_size,
254
+ obs=obs,
255
+ modified=modified,
256
+ alt_table_name=alt_table_name,
257
+ keep=keep,
258
+ drop=drop,
259
+ batched=batched,
260
+ threads=threads)
261
+
262
+ def db_schema_tables(schema,
263
+ user=os.getenv("PGUSER", default=os.getlogin()),
264
+ host=os.getenv("PGHOST", default="localhost"),
265
+ database=os.getenv("PGDATABASE", default=os.getlogin()),
266
+ port=os.getenv("PGPORT", default=5432)):
267
+ """Get list of all tables in a PostgreSQL schema.
268
+
269
+ Parameters
270
+ ----------
271
+ schema:
272
+ Name of database schema.
273
+
274
+ user: string [Optional]
275
+ User role for the PostgreSQL database.
276
+ The default is to use the environment value `PGHOST`
277
+ or (if not set) user ID.
278
+
279
+ host: string [Optional]
280
+ Host name for the PostgreSQL server.
281
+ The default is to use the environment value `PGHOST`.
282
+
283
+ database: string [Optional]
284
+ Name for the PostgreSQL database.
285
+ The default is to use the environment value `PGDATABASE`
286
+ or (if not set) user ID.
287
+
288
+ port: int [Optional]
289
+ Port for the PostgreSQL server.
290
+ The default is to use the environment value `PGPORT`
291
+ or (if not set) 5432.
292
+
293
+ Returns
294
+ -------
295
+ tables: list of strings
296
+ Names of tables in schema.
297
+
298
+ Examples
299
+ ----------
300
+ >>> db_schema_tables("crsp")
301
+ >>> db_schema_tables("audit")
302
+ """
303
+ con = ibis.postgres.connect(user=user,
304
+ host=host,
305
+ port=port,
306
+ database=database)
307
+ tables = con.list_tables(database=schema)
308
+ return tables
309
+
310
+ def db_schema_to_pq(schema,
311
+ user=os.getenv("PGUSER", default=os.getlogin()),
312
+ host=os.getenv("PGHOST", default="localhost"),
313
+ database=os.getenv("PGDATABASE", default=os.getlogin()),
314
+ port=os.getenv("PGPORT", default=5432),
315
+ data_dir=os.getenv("DATA_DIR", default=""),
316
+ row_group_size=1048576,
317
+ batched=True,
318
+ threads=None):
319
+ """Export all tables in a PostgreSQL table to parquet files.
320
+
321
+ Parameters
322
+ ----------
323
+ schema:
324
+ Name of database schema.
325
+
326
+ user: string [Optional]
327
+ User role for the PostgreSQL database.
328
+ The default is to use the environment value `PGHOST`
329
+ or (if not set) user ID.
330
+
331
+ host: string [Optional]
332
+ Host name for the PostgreSQL server.
333
+ The default is to use the environment value `PGHOST`.
334
+
335
+ database: string [Optional]
336
+ Name for the PostgreSQL database.
337
+ The default is to use the environment value `PGDATABASE`
338
+ or (if not set) user ID.
339
+
340
+ port: int [Optional]
341
+ Port for the PostgreSQL server.
342
+ The default is to use the environment value `PGPORT`
343
+ or (if not set) 5432.
344
+
345
+ data_dir: string [Optional]
346
+ Root directory of parquet data repository.
347
+ The default is to use the environment value `DATA_DIR`
348
+ or (if not set) the current directory.
349
+
350
+ row_group_size: int [Optional]
351
+ Maximum number of rows in each written row group.
352
+ Default is `1024 * 1024`.
353
+
354
+ obs: Integer [Optional]
355
+ Number of observations to import from database table.
356
+ Implemented using SQL `LIMIT`.
357
+ Setting this to modest value (e.g., `obs=1000`) can be useful for testing
358
+ `db_to_pq()` with large tables.
359
+
360
+ alt_table_name: string [Optional]
361
+ Basename of parquet file. Used when file should have different name from `table_name`.
362
+
363
+ batched: bool [Optional]
364
+ Indicates whether data will be extracting in batches using
365
+ `to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
366
+ Using batches degrades performance slightly, but dramatically
367
+ reduces memory requirements for large tables.
368
+
369
+ threads: int [Optional]
370
+ The number of threads DuckDB is allowed to use.
371
+ Setting this may be necessary due to limits imposed on the user
372
+ by the PostgreSQL database server.
373
+
374
+ Returns
375
+ -------
376
+ pq_files: list of strings
377
+ Names of parquet files created.
378
+
379
+ Examples
380
+ ----------
381
+ >>> db_schema_to_pq("crsp")
382
+ >>> db_schema_to_pq("audit")
383
+ """
384
+ tables = db_schema_tables(schema, user, host, database, port)
385
+ res = [db_to_pq(table_name=table_name,
386
+ schema=schema,
387
+ user=user,
388
+ host=host,
389
+ database=database,
390
+ port=port,
391
+ data_dir=data_dir,
392
+ row_group_size=row_group_size,
393
+ threads=threads,
394
+ batched=batched) for table_name in tables]
395
+ return res
396
+
397
+ def get_process(sas_code, wrds_id=wrds_id, fpath=None):
398
+ """Update a local CSV version of a WRDS table.
399
+
400
+ Parameters
401
+ ----------
402
+ sas_code:
403
+ SAS code to be run to yield output.
404
+
405
+ wrds_id: string
406
+ Optional WRDS ID to be use to access WRDS SAS.
407
+ Default is to use the environment value `WRDS_ID`
408
+
409
+ fpath:
410
+ Optional path to a local SAS file.
411
+
412
+ Returns
413
+ -------
414
+ The STDOUT component of the process as a stream.
415
+ """
416
+ if client:
417
+ client.close()
418
+
419
+ if wrds_id:
420
+ """Function runs SAS code on WRDS server and
421
+ returns result as pipe on stdout."""
422
+ client.load_system_host_keys()
423
+ client.set_missing_host_key_policy(paramiko.WarningPolicy())
424
+ client.connect('wrds-cloud-sshkey.wharton.upenn.edu',
425
+ username=wrds_id, compress=False)
426
+ command = "qsas -stdio -noterminal"
427
+ stdin, stdout, stderr = client.exec_command(command)
428
+ stdin.write(sas_code)
429
+ stdin.close()
430
+
431
+ channel = stdout.channel
432
+ # indicate that we're not going to write to that channel anymore
433
+ channel.shutdown_write()
434
+ return stdout
435
+
436
+ def proc_contents(table_name, sas_schema=None, wrds_id=os.getenv("WRDS_ID"),
437
+ encoding=None):
438
+ if not encoding:
439
+ encoding = "utf-8"
440
+
441
+ sas_code = f"PROC CONTENTS data={sas_schema}.{table_name}(encoding='{encoding}');"
442
+
443
+ p = get_process(sas_code, wrds_id)
444
+
445
+ return p.readlines()
446
+
447
+ def get_modified_str(table_name, sas_schema, wrds_id=wrds_id,
448
+ encoding=None):
449
+
450
+ contents = proc_contents(table_name=table_name, sas_schema=sas_schema,
451
+ wrds_id=wrds_id, encoding=encoding)
452
+
453
+ if len(contents) == 0:
454
+ print(f"Table {sas_schema}.{table_name} not found.")
455
+ return None
456
+
457
+ modified = ""
458
+ next_row = False
459
+ for line in contents:
460
+ if next_row:
461
+ line = re.sub(r"^\s+(.*)\s+$", r"\1", line)
462
+ line = re.sub(r"\s+$", "", line)
463
+ if not re.findall(r"Protection", line):
464
+ modified += " " + line.rstrip()
465
+ next_row = False
466
+
467
+ if re.match(r"Last Modified", line):
468
+ modified = re.sub(r"^Last Modified\s+(.*?)\s{2,}.*$",
469
+ r"Last modified: \1", line)
470
+ modified = modified.rstrip()
471
+ next_row = True
472
+
473
+ return modified
474
+
475
+ def get_modified_pq(file_name):
476
+
477
+ if os.path.exists(file_name):
478
+ md = pq.read_schema(file_name)
479
+ schema_md = md.metadata
480
+ if not schema_md:
481
+ return ''
482
+ if b'last_modified' in schema_md.keys():
483
+ last_modified = schema_md[b'last_modified'].decode('utf-8')
484
+ else:
485
+ last_modified = ''
486
+ else:
487
+ last_modified = ''
488
+ return last_modified
489
+
490
+ def wrds_update_pq(table_name, schema,
491
+ wrds_id=os.getenv("WRDS_ID", default=""),
492
+ data_dir=os.getenv("DATA_DIR", default=""),
493
+ force=False,
494
+ col_types=None,
495
+ encoding="utf-8",
496
+ sas_schema=None,
497
+ row_group_size=1048576,
498
+ obs=None,
499
+ alt_table_name=None,
500
+ keep=None,
501
+ drop=None,
502
+ batched=True,
503
+ threads=3):
504
+ """Export a table from the WRDS PostgreSQL database to a parquet file.
505
+
506
+ Parameters
507
+ ----------
508
+ table_name:
509
+ Name of table in database.
510
+
511
+ schema:
512
+ Name of database schema.
513
+
514
+ wrds_id: string
515
+ WRDS ID to be used to access WRDS SAS.
516
+ Default is to use the environment value `WRDS_ID`.
517
+
518
+ data_dir: string [Optional]
519
+ Root directory of parquet data repository.
520
+ The default is to use the environment value `DATA_DIR`
521
+ or (if not set) the current directory.
522
+
523
+ force: Boolean
524
+ Whether update should proceed regardless of date comparison results.
525
+
526
+ col_types: Dict [Optional]
527
+ Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files.
528
+ For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB.
529
+ Only a subset of columns needs to be supplied.
530
+ Supplied types should be compatible with data emitted by PostgreSQL
531
+ (i.e., one can't "fix" arbitrary type issues using this argument).
532
+ For example, `col_types = {'permno': 'int32', 'permco': 'int32'}`.
533
+
534
+ row_group_size: int [Optional]
535
+ Maximum number of rows in each written row group.
536
+ Default is `1024 * 1024`.
537
+
538
+ obs: Integer [Optional]
539
+ Number of observations to import from database table.
540
+ Implemented using SQL `LIMIT`.
541
+ Setting this to modest value (e.g., `obs=1000`) can be useful for testing
542
+ `db_to_pq()` with large tables.
543
+
544
+ alt_table_name: string [Optional]
545
+ Basename of parquet file. Used when file should have different name from `table_name`.
546
+
547
+ keep: string [Optional]
548
+ Regular expression indicating columns to keep.
549
+
550
+ drop: string [Optional]
551
+ Regular expression indicating columns to drop.
552
+
553
+ batched: bool [Optional]
554
+ Indicates whether data will be extracting in batches using
555
+ `to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
556
+ Using batches degrades performance slightly, but dramatically
557
+ reduces memory requirements for large tables.
558
+
559
+ threads: int [Optional]
560
+ The number of threads DuckDB is allowed to use.
561
+ Setting this may be necessary due to limits imposed on the user
562
+ by the PostgreSQL database server.
563
+
564
+ Returns
565
+ -------
566
+ pq_file: string
567
+ Name of parquet file created.
568
+
569
+ Examples
570
+ ----------
571
+ >>> db_to_pq("dsi", "crsp")
572
+ >>> db_to_pq("feed21_bankruptcy_notification", "audit")
573
+ """
574
+
575
+
576
+ if not sas_schema:
577
+ sas_schema = schema
578
+
579
+ if not alt_table_name:
580
+ alt_table_name = table_name
581
+
582
+ pq_file = get_pq_file(table_name=table_name, schema=schema,
583
+ data_dir=data_dir)
584
+
585
+ modified = get_modified_str(table_name=table_name,
586
+ sas_schema=sas_schema, wrds_id=wrds_id,
587
+ encoding=encoding)
588
+ if not modified:
589
+ return False
590
+
591
+ pq_modified = get_modified_pq(pq_file)
592
+ if modified == pq_modified and not force:
593
+ print(schema + "." + alt_table_name + " already up to date.")
594
+ return False
595
+ if force:
596
+ print("Forcing update based on user request.")
597
+ else:
598
+ print("Updated %s.%s is available." % (schema, alt_table_name))
599
+ print("Getting from WRDS.")
600
+
601
+ print(f"Beginning file download at {get_now()} UTC.")
602
+ wrds_pg_to_pq(table_name=table_name,
603
+ schema=schema,
604
+ data_dir=data_dir,
605
+ wrds_id=wrds_id,
606
+ col_types=col_types,
607
+ row_group_size=row_group_size,
608
+ obs=obs,
609
+ modified=modified,
610
+ alt_table_name=alt_table_name,
611
+ keep=keep,
612
+ drop=drop,
613
+ batched=batched,
614
+ threads=threads)
615
+ print(f"Completed file download at {get_now()} UTC.\n")
616
+
617
+ def get_pq_file(table_name, schema, data_dir=os.getenv("DATA_DIR")):
618
+
619
+ data_dir = os.path.expanduser(data_dir)
620
+ if not os.path.exists(data_dir):
621
+ os.makedirs(data_dir)
622
+
623
+ schema_dir = Path(data_dir, schema)
624
+ if not os.path.exists(schema_dir):
625
+ os.makedirs(schema_dir)
626
+
627
+ pq_file = Path(data_dir, schema, table_name).with_suffix('.parquet')
628
+ return pq_file
629
+
630
+ def get_now():
631
+ return strftime("%Y-%m-%d %H:%M:%S", gmtime())
632
+
633
+ def get_pq_files(schema, data_dir=os.getenv("DATA_DIR", default="")):
634
+ """Get a list of parquet files in a schema.
635
+
636
+ Parameters
637
+ ----------
638
+ schema:
639
+ Name of database schema.
640
+
641
+ data_dir: string [Optional]
642
+ Root directory of parquet data repository.
643
+ The default is to use the environment value `DATA_DIR`
644
+ or (if not set) the current directory.
645
+
646
+ Returns
647
+ -------
648
+ pq_files: [string]
649
+ Names of parquet files found.
650
+ """
651
+ data_dir = os.path.expanduser(data_dir)
652
+ pq_dir = os.path.join(data_dir, schema)
653
+ files = os.listdir(pq_dir)
654
+ return [re.sub(r"\.parquet$", "", pq_file)
655
+ for pq_file in files
656
+ if re.search(r"\.parquet$", pq_file)]
657
+
658
+ def update_schema(schema, data_dir=os.getenv("DATA_DIR", default="")):
659
+ """Update existing parquet files in a schema.
660
+
661
+ Parameters
662
+ ----------
663
+ schema:
664
+ Name of database schema.
665
+
666
+ data_dir: string [Optional]
667
+ Root directory of parquet data repository.
668
+ The default is to use the environment value `DATA_DIR`
669
+ or (if not set) the current directory.
670
+
671
+ threads: int [Optional]
672
+ The number of threads DuckDB is allowed to use.
673
+ Setting this may be necessary due to limits imposed on the user
674
+ by the PostgreSQL database server.
675
+
676
+
677
+ Returns
678
+ -------
679
+ pq_files: [string]
680
+ Names of parquet files found.
681
+ """
682
+ pq_files = get_pq_files(schema=schema, data_dir=data_dir)
683
+ for pq_file in pq_files:
684
+ wrds_update_pq(table_name=pq_file, schema=schema,
685
+ data_dir=data_dir, threads=3)
686
+
687
+ def pq_last_updated(data_dir=None):
688
+ """
689
+ Get `last_updated` metadata for data files in a parquet data repository
690
+ set up along the lines described at
691
+ https://iangow.github.io/far_book/parquet-wrds.html.
692
+
693
+ Parameters
694
+ ----------
695
+ data_dir: string [Optional]
696
+ Root directory of parquet data repository.
697
+ The default is to use the environment value `DATA_DIR`
698
+ or (if not set) the current directory.
699
+
700
+ Returns
701
+ -------
702
+ df: [pd.DataFrame]
703
+ Data frame with four columns: table, schema, last_mod_str, last_mod
704
+ """
705
+
706
+ if not data_dir:
707
+ data_dir = os.path.expanduser(os.environ["DATA_DIR"])
708
+ data_dir = Path(data_dir)
709
+
710
+ df = pd.DataFrame([
711
+ {"table": p.stem,
712
+ "schema": subdir.name,
713
+ "last_mod_str": get_modified_pq(p)}
714
+ for subdir in data_dir.iterdir()
715
+ if subdir.is_dir()
716
+ for p in subdir.glob("*.parquet")
717
+ ])
718
+
719
+ df["last_mod"] = (
720
+ df["last_mod_str"]
721
+ .str.replace("Last modified: ", "", regex=False)
722
+ .pipe(pd.to_datetime)
723
+ .dt.tz_localize("US/Eastern")
724
+ )
725
+
726
+ return df.sort_values("schema").reset_index(drop=True)
@@ -0,0 +1,79 @@
1
+ Metadata-Version: 2.4
2
+ Name: db2pq
3
+ Version: 0.1.3
4
+ Summary: Convert database tables to parquet tables.
5
+ Home-page: https://github.com/iangow/db2pq/
6
+ Author: Ian Gow
7
+ Author-email: iandgow@gmail.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: ibis-framework[duckdb,postgres]
15
+ Requires-Dist: pyarrow
16
+ Requires-Dist: pandas
17
+ Requires-Dist: paramiko
18
+ Dynamic: author
19
+ Dynamic: author-email
20
+ Dynamic: classifier
21
+ Dynamic: description
22
+ Dynamic: description-content-type
23
+ Dynamic: home-page
24
+ Dynamic: license-file
25
+ Dynamic: requires-dist
26
+ Dynamic: requires-python
27
+ Dynamic: summary
28
+
29
+ # Library to convert PostgreSQL data to parquet files
30
+
31
+ This package was created to convert PostgreSQL data to parquet format.
32
+ This package has four major functions, one for each of three popular data formats, plus an "update" function that only updates if necessary.
33
+
34
+ - `wrds_pg_to_pq()`: Exports a WRDS PostgreSQL table to a parquet file.
35
+ - `db_to_pq()`: Exports a PostgreSQL table to a parquet file.
36
+ - `db_schema_to_pq()`: Exports a PostgreSQL schema to parquet files.
37
+ - `wrds_update_pq()`: A variant on `wrds_pg_to_pq()` that checks the "last modified" value for the relevant SAS file against that of the local parquet before getting new data from the WRDS PostgreSQL server.
38
+
39
+ ## Requirements
40
+
41
+ ### 1. Python
42
+ The software uses Python 3 and depends on Ibis, `pyarrow` (Python API for Apache Arrow libraries), and Paramiko.
43
+ These dependencies are installed when you use Pip:
44
+
45
+ ```bash
46
+ pip install db2pq --upgrade
47
+ ```
48
+
49
+ ### 2. A WRDS ID
50
+ To use public-key authentication to access WRDS, follow hints taken from [here](https://debian-administration.org/article/152/Password-less_logins_with_OpenSSH) to set up a public key.
51
+ Copy that key to the WRDS server from the terminal on your computer.
52
+ (Note that this code assumes you have a directory `.ssh` in your home directory. If not, log into WRDS via SSH, then type `mkdir ~/.ssh` to create this.)
53
+ Here's code to create the key and send it to WRDS:
54
+
55
+ ```bash
56
+ ssh-keygen -t rsa
57
+ cat ~/.ssh/id_rsa.pub | ssh $WRDS_ID@wrds-cloud-sshkey.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
58
+ ```
59
+
60
+ Use an empty passphrase in setting up the key so that the scripts can run without user intervention.
61
+
62
+ ### 3. Environment variables
63
+
64
+ Environment variables that the code uses include:
65
+
66
+ - `WRDS_ID`: Your [WRDS](https://wrds-web.wharton.upenn.edu/wrds/) ID.
67
+ - `DATA_DIR`: The local repository for parquet files.
68
+
69
+ Once can set these environment variables in (say) `~/.zprofile`:
70
+
71
+ ```bash
72
+ export WRDS_ID="iangow"
73
+ export DATA_DIR="~/Dropbox/pq_data"
74
+ ```
75
+
76
+ As an alternative to setting these environment variables, they can be passed as values of arguments `wrds_id` and `data_dir`, respectively, of the functions above.
77
+
78
+ ### Report bugs
79
+ Author: Ian Gow, <iandgow@gmail.com>
@@ -0,0 +1,10 @@
1
+ LICENSE
2
+ README.md
3
+ setup.py
4
+ db2pq/__init__.py
5
+ db2pq/db2pq.py
6
+ db2pq.egg-info/PKG-INFO
7
+ db2pq.egg-info/SOURCES.txt
8
+ db2pq.egg-info/dependency_links.txt
9
+ db2pq.egg-info/requires.txt
10
+ db2pq.egg-info/top_level.txt
@@ -0,0 +1,4 @@
1
+ ibis-framework[duckdb,postgres]
2
+ pyarrow
3
+ pandas
4
+ paramiko
@@ -0,0 +1 @@
1
+ db2pq
db2pq-0.1.3/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
db2pq-0.1.3/setup.py ADDED
@@ -0,0 +1,24 @@
1
+ import setuptools
2
+
3
+ with open("README.md", "r") as f:
4
+ long_description = f.read()
5
+ print(long_description)
6
+
7
+ setuptools.setup(
8
+ name="db2pq",
9
+ version="0.1.3",
10
+ author="Ian Gow",
11
+ author_email="iandgow@gmail.com",
12
+ description="Convert database tables to parquet tables.",
13
+ long_description=long_description,
14
+ long_description_content_type="text/markdown",
15
+ url="https://github.com/iangow/db2pq/",
16
+ packages=setuptools.find_packages(),
17
+ install_requires=['ibis-framework[duckdb, postgres]', 'pyarrow', 'pandas', 'paramiko'],
18
+ python_requires=">=3",
19
+ classifiers=[
20
+ "Programming Language :: Python :: 3",
21
+ "License :: OSI Approved :: MIT License",
22
+ "Operating System :: OS Independent",
23
+ ],
24
+ )