db2pq 0.1.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- db2pq-0.1.3/LICENSE +21 -0
- db2pq-0.1.3/PKG-INFO +79 -0
- db2pq-0.1.3/README.md +51 -0
- db2pq-0.1.3/db2pq/__init__.py +6 -0
- db2pq-0.1.3/db2pq/db2pq.py +726 -0
- db2pq-0.1.3/db2pq.egg-info/PKG-INFO +79 -0
- db2pq-0.1.3/db2pq.egg-info/SOURCES.txt +10 -0
- db2pq-0.1.3/db2pq.egg-info/dependency_links.txt +1 -0
- db2pq-0.1.3/db2pq.egg-info/requires.txt +4 -0
- db2pq-0.1.3/db2pq.egg-info/top_level.txt +1 -0
- db2pq-0.1.3/setup.cfg +4 -0
- db2pq-0.1.3/setup.py +24 -0
db2pq-0.1.3/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2023 Ian Gow
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
db2pq-0.1.3/PKG-INFO
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: db2pq
|
|
3
|
+
Version: 0.1.3
|
|
4
|
+
Summary: Convert database tables to parquet tables.
|
|
5
|
+
Home-page: https://github.com/iangow/db2pq/
|
|
6
|
+
Author: Ian Gow
|
|
7
|
+
Author-email: iandgow@gmail.com
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: ibis-framework[duckdb,postgres]
|
|
15
|
+
Requires-Dist: pyarrow
|
|
16
|
+
Requires-Dist: pandas
|
|
17
|
+
Requires-Dist: paramiko
|
|
18
|
+
Dynamic: author
|
|
19
|
+
Dynamic: author-email
|
|
20
|
+
Dynamic: classifier
|
|
21
|
+
Dynamic: description
|
|
22
|
+
Dynamic: description-content-type
|
|
23
|
+
Dynamic: home-page
|
|
24
|
+
Dynamic: license-file
|
|
25
|
+
Dynamic: requires-dist
|
|
26
|
+
Dynamic: requires-python
|
|
27
|
+
Dynamic: summary
|
|
28
|
+
|
|
29
|
+
# Library to convert PostgreSQL data to parquet files
|
|
30
|
+
|
|
31
|
+
This package was created to convert PostgreSQL data to parquet format.
|
|
32
|
+
This package has four major functions, one for each of three popular data formats, plus an "update" function that only updates if necessary.
|
|
33
|
+
|
|
34
|
+
- `wrds_pg_to_pq()`: Exports a WRDS PostgreSQL table to a parquet file.
|
|
35
|
+
- `db_to_pq()`: Exports a PostgreSQL table to a parquet file.
|
|
36
|
+
- `db_schema_to_pq()`: Exports a PostgreSQL schema to parquet files.
|
|
37
|
+
- `wrds_update_pq()`: A variant on `wrds_pg_to_pq()` that checks the "last modified" value for the relevant SAS file against that of the local parquet before getting new data from the WRDS PostgreSQL server.
|
|
38
|
+
|
|
39
|
+
## Requirements
|
|
40
|
+
|
|
41
|
+
### 1. Python
|
|
42
|
+
The software uses Python 3 and depends on Ibis, `pyarrow` (Python API for Apache Arrow libraries), and Paramiko.
|
|
43
|
+
These dependencies are installed when you use Pip:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
pip install db2pq --upgrade
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### 2. A WRDS ID
|
|
50
|
+
To use public-key authentication to access WRDS, follow hints taken from [here](https://debian-administration.org/article/152/Password-less_logins_with_OpenSSH) to set up a public key.
|
|
51
|
+
Copy that key to the WRDS server from the terminal on your computer.
|
|
52
|
+
(Note that this code assumes you have a directory `.ssh` in your home directory. If not, log into WRDS via SSH, then type `mkdir ~/.ssh` to create this.)
|
|
53
|
+
Here's code to create the key and send it to WRDS:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
ssh-keygen -t rsa
|
|
57
|
+
cat ~/.ssh/id_rsa.pub | ssh $WRDS_ID@wrds-cloud-sshkey.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Use an empty passphrase in setting up the key so that the scripts can run without user intervention.
|
|
61
|
+
|
|
62
|
+
### 3. Environment variables
|
|
63
|
+
|
|
64
|
+
Environment variables that the code uses include:
|
|
65
|
+
|
|
66
|
+
- `WRDS_ID`: Your [WRDS](https://wrds-web.wharton.upenn.edu/wrds/) ID.
|
|
67
|
+
- `DATA_DIR`: The local repository for parquet files.
|
|
68
|
+
|
|
69
|
+
Once can set these environment variables in (say) `~/.zprofile`:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
export WRDS_ID="iangow"
|
|
73
|
+
export DATA_DIR="~/Dropbox/pq_data"
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
As an alternative to setting these environment variables, they can be passed as values of arguments `wrds_id` and `data_dir`, respectively, of the functions above.
|
|
77
|
+
|
|
78
|
+
### Report bugs
|
|
79
|
+
Author: Ian Gow, <iandgow@gmail.com>
|
db2pq-0.1.3/README.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
# Library to convert PostgreSQL data to parquet files
|
|
2
|
+
|
|
3
|
+
This package was created to convert PostgreSQL data to parquet format.
|
|
4
|
+
This package has four major functions, one for each of three popular data formats, plus an "update" function that only updates if necessary.
|
|
5
|
+
|
|
6
|
+
- `wrds_pg_to_pq()`: Exports a WRDS PostgreSQL table to a parquet file.
|
|
7
|
+
- `db_to_pq()`: Exports a PostgreSQL table to a parquet file.
|
|
8
|
+
- `db_schema_to_pq()`: Exports a PostgreSQL schema to parquet files.
|
|
9
|
+
- `wrds_update_pq()`: A variant on `wrds_pg_to_pq()` that checks the "last modified" value for the relevant SAS file against that of the local parquet before getting new data from the WRDS PostgreSQL server.
|
|
10
|
+
|
|
11
|
+
## Requirements
|
|
12
|
+
|
|
13
|
+
### 1. Python
|
|
14
|
+
The software uses Python 3 and depends on Ibis, `pyarrow` (Python API for Apache Arrow libraries), and Paramiko.
|
|
15
|
+
These dependencies are installed when you use Pip:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
pip install db2pq --upgrade
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
### 2. A WRDS ID
|
|
22
|
+
To use public-key authentication to access WRDS, follow hints taken from [here](https://debian-administration.org/article/152/Password-less_logins_with_OpenSSH) to set up a public key.
|
|
23
|
+
Copy that key to the WRDS server from the terminal on your computer.
|
|
24
|
+
(Note that this code assumes you have a directory `.ssh` in your home directory. If not, log into WRDS via SSH, then type `mkdir ~/.ssh` to create this.)
|
|
25
|
+
Here's code to create the key and send it to WRDS:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
ssh-keygen -t rsa
|
|
29
|
+
cat ~/.ssh/id_rsa.pub | ssh $WRDS_ID@wrds-cloud-sshkey.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
Use an empty passphrase in setting up the key so that the scripts can run without user intervention.
|
|
33
|
+
|
|
34
|
+
### 3. Environment variables
|
|
35
|
+
|
|
36
|
+
Environment variables that the code uses include:
|
|
37
|
+
|
|
38
|
+
- `WRDS_ID`: Your [WRDS](https://wrds-web.wharton.upenn.edu/wrds/) ID.
|
|
39
|
+
- `DATA_DIR`: The local repository for parquet files.
|
|
40
|
+
|
|
41
|
+
Once can set these environment variables in (say) `~/.zprofile`:
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
export WRDS_ID="iangow"
|
|
45
|
+
export DATA_DIR="~/Dropbox/pq_data"
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
As an alternative to setting these environment variables, they can be passed as values of arguments `wrds_id` and `data_dir`, respectively, of the functions above.
|
|
49
|
+
|
|
50
|
+
### Report bugs
|
|
51
|
+
Author: Ian Gow, <iandgow@gmail.com>
|
|
@@ -0,0 +1,726 @@
|
|
|
1
|
+
import ibis
|
|
2
|
+
import os
|
|
3
|
+
import ibis.selectors as s
|
|
4
|
+
from ibis import _
|
|
5
|
+
from tempfile import TemporaryFile, NamedTemporaryFile
|
|
6
|
+
import pyarrow.parquet as pq
|
|
7
|
+
import re
|
|
8
|
+
import warnings
|
|
9
|
+
import paramiko
|
|
10
|
+
from pathlib import Path
|
|
11
|
+
from time import gmtime, strftime
|
|
12
|
+
import pandas as pd
|
|
13
|
+
|
|
14
|
+
client = paramiko.SSHClient()
|
|
15
|
+
wrds_id = os.getenv("WRDS_ID")
|
|
16
|
+
warnings.filterwarnings(action='ignore', module='.*paramiko.*')
|
|
17
|
+
|
|
18
|
+
def df_to_arrow(df, col_types=None, obs=None, batches=False):
|
|
19
|
+
|
|
20
|
+
if col_types:
|
|
21
|
+
types = set(col_types.values())
|
|
22
|
+
for type in types:
|
|
23
|
+
to_convert = [key for (key, value) in col_types.items() if value == type]
|
|
24
|
+
df = df.mutate(s.across(to_convert, _.cast(type)))
|
|
25
|
+
|
|
26
|
+
if obs:
|
|
27
|
+
df = df.limit(obs)
|
|
28
|
+
|
|
29
|
+
if batches:
|
|
30
|
+
return df.to_pyarrow_batches()
|
|
31
|
+
else:
|
|
32
|
+
return df.to_pyarrow()
|
|
33
|
+
|
|
34
|
+
def db_to_pq(table_name, schema,
|
|
35
|
+
user=os.getenv("PGUSER", default=os.getlogin()),
|
|
36
|
+
host=os.getenv("PGHOST", default="localhost"),
|
|
37
|
+
database=os.getenv("PGDATABASE", default=os.getlogin()),
|
|
38
|
+
port=os.getenv("PGPORT", default=5432),
|
|
39
|
+
data_dir=os.getenv("DATA_DIR", default=""),
|
|
40
|
+
col_types=None,
|
|
41
|
+
row_group_size=1048576,
|
|
42
|
+
obs=None,
|
|
43
|
+
modified=None,
|
|
44
|
+
alt_table_name=None,
|
|
45
|
+
keep=None,
|
|
46
|
+
drop=None,
|
|
47
|
+
batched=True,
|
|
48
|
+
threads=None):
|
|
49
|
+
"""Export a PostgreSQL table to a parquet file.
|
|
50
|
+
|
|
51
|
+
Parameters
|
|
52
|
+
----------
|
|
53
|
+
table_name:
|
|
54
|
+
Name of table in database.
|
|
55
|
+
|
|
56
|
+
schema:
|
|
57
|
+
Name of database schema.
|
|
58
|
+
|
|
59
|
+
host: string [Optional]
|
|
60
|
+
Host name for the PostgreSQL server.
|
|
61
|
+
The default is to use the environment value `PGHOST`.
|
|
62
|
+
|
|
63
|
+
database: string [Optional]
|
|
64
|
+
Name for the PostgreSQL database.
|
|
65
|
+
The default is to use the environment value `PGDATABASE`
|
|
66
|
+
or (if not set) user ID.
|
|
67
|
+
|
|
68
|
+
data_dir: string [Optional]
|
|
69
|
+
Root directory of parquet data repository.
|
|
70
|
+
The default is to use the environment value `DATA_DIR`
|
|
71
|
+
or (if not set) the current directory.
|
|
72
|
+
|
|
73
|
+
col_types: Dict [Optional]
|
|
74
|
+
Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files.
|
|
75
|
+
For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB.
|
|
76
|
+
Only a subset of columns needs to be supplied.
|
|
77
|
+
Supplied types should be compatible with data emitted by PostgreSQL
|
|
78
|
+
(i.e., one can't "fix" arbitrary type issues using this argument).
|
|
79
|
+
For example, `col_types = {'permno':'integer', 'permco':'integer'}`.
|
|
80
|
+
|
|
81
|
+
row_group_size: int [Optional]
|
|
82
|
+
Maximum number of rows in each written row group.
|
|
83
|
+
Default is `1024 * 1024`.
|
|
84
|
+
|
|
85
|
+
obs: Integer [Optional]
|
|
86
|
+
Number of observations to import from database table.
|
|
87
|
+
Implemented using SQL `LIMIT`.
|
|
88
|
+
Setting this to modest value (e.g., `obs=1000`) can be useful for testing
|
|
89
|
+
`db_to_pq()` with large tables.
|
|
90
|
+
|
|
91
|
+
modified: string [Optional]
|
|
92
|
+
Last modified string.
|
|
93
|
+
|
|
94
|
+
alt_table_name: string [Optional]
|
|
95
|
+
Basename of parquet file. Used when file should have different name from `table_name`.
|
|
96
|
+
|
|
97
|
+
keep: string [Optional]
|
|
98
|
+
Regular expression indicating columns to keep.
|
|
99
|
+
|
|
100
|
+
drop: string [Optional]
|
|
101
|
+
Regular expression indicating columns to drop.
|
|
102
|
+
|
|
103
|
+
batched: bool [Optional]
|
|
104
|
+
Indicates whether data will be extracting in batches using
|
|
105
|
+
`to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
|
|
106
|
+
Using batches degrades performance slightly, but dramatically
|
|
107
|
+
reduces memory requirements for large tables.
|
|
108
|
+
|
|
109
|
+
threads: int [Optional]
|
|
110
|
+
The number of threads DuckDB is allowed to use.
|
|
111
|
+
Setting this may be necessary due to limits imposed on the user
|
|
112
|
+
by the PostgreSQL database server.
|
|
113
|
+
|
|
114
|
+
Returns
|
|
115
|
+
-------
|
|
116
|
+
pq_file: string
|
|
117
|
+
Name of parquet file created.
|
|
118
|
+
|
|
119
|
+
Examples
|
|
120
|
+
----------
|
|
121
|
+
>>> db_to_pq("dsi", "crsp")
|
|
122
|
+
>>> db_to_pq("feed21_bankruptcy_notification", "audit")
|
|
123
|
+
"""
|
|
124
|
+
if not alt_table_name:
|
|
125
|
+
alt_table_name = table_name
|
|
126
|
+
|
|
127
|
+
con = ibis.duckdb.connect()
|
|
128
|
+
if threads:
|
|
129
|
+
con.raw_sql(f"SET threads TO {threads};")
|
|
130
|
+
|
|
131
|
+
uri = f"postgres://{user}@{host}:{port}/{database}"
|
|
132
|
+
df = con.read_postgres(uri, table_name=table_name, database=schema)
|
|
133
|
+
data_dir = os.path.expanduser(data_dir)
|
|
134
|
+
pq_dir = os.path.join(data_dir, schema)
|
|
135
|
+
if not os.path.exists(pq_dir):
|
|
136
|
+
os.makedirs(pq_dir)
|
|
137
|
+
pq_file = os.path.join(data_dir, schema, alt_table_name + '.parquet')
|
|
138
|
+
tmp_pq_file = os.path.join(data_dir, schema, '.temp_' + alt_table_name + '.parquet')
|
|
139
|
+
|
|
140
|
+
if drop:
|
|
141
|
+
df = df.drop(s.matches(drop))
|
|
142
|
+
|
|
143
|
+
if keep:
|
|
144
|
+
df = df.select(s.matches(keep))
|
|
145
|
+
|
|
146
|
+
if batched:
|
|
147
|
+
# Get a few rows to infer schema for batched write
|
|
148
|
+
tmpfile = TemporaryFile()
|
|
149
|
+
df_arrow = df_to_arrow(df, col_types=col_types, obs=10)
|
|
150
|
+
pq.write_table(df_arrow, tmpfile)
|
|
151
|
+
schema = pq.read_schema(tmpfile)
|
|
152
|
+
if modified:
|
|
153
|
+
schema = schema.with_metadata({b'last_modified': modified.encode()})
|
|
154
|
+
|
|
155
|
+
# Process data in batches
|
|
156
|
+
with pq.ParquetWriter(tmp_pq_file, schema) as writer:
|
|
157
|
+
batches = df_to_arrow(df, col_types=col_types, obs=obs, batches=True)
|
|
158
|
+
for batch in batches:
|
|
159
|
+
writer.write_batch(batch)
|
|
160
|
+
else:
|
|
161
|
+
df_arrow = df_to_arrow(df, col_types=col_types, obs=obs)
|
|
162
|
+
pq.write_table(df_arrow, tmp_pq_file, row_group_size=row_group_size)
|
|
163
|
+
|
|
164
|
+
os.rename(tmp_pq_file, pq_file)
|
|
165
|
+
return pq_file
|
|
166
|
+
|
|
167
|
+
def wrds_pg_to_pq(table_name,
|
|
168
|
+
schema,
|
|
169
|
+
wrds_id=os.getenv("WRDS_ID", default=""),
|
|
170
|
+
data_dir=os.getenv("DATA_DIR", default=""),
|
|
171
|
+
col_types=None,
|
|
172
|
+
row_group_size=1048576,
|
|
173
|
+
obs=None,
|
|
174
|
+
modified=None,
|
|
175
|
+
alt_table_name=None,
|
|
176
|
+
keep=None,
|
|
177
|
+
drop=None,
|
|
178
|
+
batched=True,
|
|
179
|
+
threads=3):
|
|
180
|
+
"""Export a table from the WRDS PostgreSQL database to a parquet file.
|
|
181
|
+
|
|
182
|
+
Parameters
|
|
183
|
+
----------
|
|
184
|
+
table_name:
|
|
185
|
+
Name of table in database.
|
|
186
|
+
|
|
187
|
+
schema:
|
|
188
|
+
Name of database schema.
|
|
189
|
+
|
|
190
|
+
wrds_id: string
|
|
191
|
+
WRDS ID to be used to access WRDS SAS.
|
|
192
|
+
Default is to use the environment value `WRDS_ID`.
|
|
193
|
+
|
|
194
|
+
data_dir: string [Optional]
|
|
195
|
+
Root directory of parquet data repository.
|
|
196
|
+
The default is to use the environment value `DATA_DIR`
|
|
197
|
+
or (if not set) the current directory.
|
|
198
|
+
|
|
199
|
+
col_types: Dict [Optional]
|
|
200
|
+
Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files.
|
|
201
|
+
For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB.
|
|
202
|
+
Only a subset of columns needs to be supplied.
|
|
203
|
+
Supplied types should be compatible with data emitted by PostgreSQL
|
|
204
|
+
(i.e., one can't "fix" arbitrary type issues using this argument).
|
|
205
|
+
For example, `col_types = {'permno': 'int32', 'permco': 'int32'}`.
|
|
206
|
+
|
|
207
|
+
row_group_size: int [Optional]
|
|
208
|
+
Maximum number of rows in each written row group.
|
|
209
|
+
Default is `1024 * 1024`.
|
|
210
|
+
|
|
211
|
+
obs: Integer [Optional]
|
|
212
|
+
Number of observations to import from database table.
|
|
213
|
+
Implemented using SQL `LIMIT`.
|
|
214
|
+
Setting this to modest value (e.g., `obs=1000`) can be useful for testing
|
|
215
|
+
`db_to_pq()` with large tables.
|
|
216
|
+
|
|
217
|
+
alt_table_name: string [Optional]
|
|
218
|
+
Basename of parquet file. Used when file should have different name from `table_name`.
|
|
219
|
+
|
|
220
|
+
keep: string [Optional]
|
|
221
|
+
Regular expression indicating columns to keep.
|
|
222
|
+
|
|
223
|
+
drop: string [Optional]
|
|
224
|
+
Regular expression indicating columns to drop.
|
|
225
|
+
|
|
226
|
+
batched: bool [Optional]
|
|
227
|
+
Indicates whether data will be extracting in batches using
|
|
228
|
+
`to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
|
|
229
|
+
Using batches degrades performance slightly, but dramatically
|
|
230
|
+
reduces memory requirements for large tables.
|
|
231
|
+
|
|
232
|
+
threads: int [Optional]
|
|
233
|
+
The number of threads DuckDB is allowed to use.
|
|
234
|
+
Setting this may be necessary due to limits imposed on the user
|
|
235
|
+
by the PostgreSQL database server.
|
|
236
|
+
|
|
237
|
+
Returns
|
|
238
|
+
-------
|
|
239
|
+
pq_file: string
|
|
240
|
+
Name of parquet file created.
|
|
241
|
+
|
|
242
|
+
Examples
|
|
243
|
+
----------
|
|
244
|
+
>>> db_to_pq("dsi", "crsp")
|
|
245
|
+
>>> db_to_pq("feed21_bankruptcy_notification", "audit")
|
|
246
|
+
"""
|
|
247
|
+
db_to_pq(table_name, schema, user=wrds_id,
|
|
248
|
+
host="wrds-pgdata.wharton.upenn.edu",
|
|
249
|
+
database="wrds",
|
|
250
|
+
port=9737,
|
|
251
|
+
data_dir=data_dir,
|
|
252
|
+
col_types=col_types,
|
|
253
|
+
row_group_size=row_group_size,
|
|
254
|
+
obs=obs,
|
|
255
|
+
modified=modified,
|
|
256
|
+
alt_table_name=alt_table_name,
|
|
257
|
+
keep=keep,
|
|
258
|
+
drop=drop,
|
|
259
|
+
batched=batched,
|
|
260
|
+
threads=threads)
|
|
261
|
+
|
|
262
|
+
def db_schema_tables(schema,
|
|
263
|
+
user=os.getenv("PGUSER", default=os.getlogin()),
|
|
264
|
+
host=os.getenv("PGHOST", default="localhost"),
|
|
265
|
+
database=os.getenv("PGDATABASE", default=os.getlogin()),
|
|
266
|
+
port=os.getenv("PGPORT", default=5432)):
|
|
267
|
+
"""Get list of all tables in a PostgreSQL schema.
|
|
268
|
+
|
|
269
|
+
Parameters
|
|
270
|
+
----------
|
|
271
|
+
schema:
|
|
272
|
+
Name of database schema.
|
|
273
|
+
|
|
274
|
+
user: string [Optional]
|
|
275
|
+
User role for the PostgreSQL database.
|
|
276
|
+
The default is to use the environment value `PGHOST`
|
|
277
|
+
or (if not set) user ID.
|
|
278
|
+
|
|
279
|
+
host: string [Optional]
|
|
280
|
+
Host name for the PostgreSQL server.
|
|
281
|
+
The default is to use the environment value `PGHOST`.
|
|
282
|
+
|
|
283
|
+
database: string [Optional]
|
|
284
|
+
Name for the PostgreSQL database.
|
|
285
|
+
The default is to use the environment value `PGDATABASE`
|
|
286
|
+
or (if not set) user ID.
|
|
287
|
+
|
|
288
|
+
port: int [Optional]
|
|
289
|
+
Port for the PostgreSQL server.
|
|
290
|
+
The default is to use the environment value `PGPORT`
|
|
291
|
+
or (if not set) 5432.
|
|
292
|
+
|
|
293
|
+
Returns
|
|
294
|
+
-------
|
|
295
|
+
tables: list of strings
|
|
296
|
+
Names of tables in schema.
|
|
297
|
+
|
|
298
|
+
Examples
|
|
299
|
+
----------
|
|
300
|
+
>>> db_schema_tables("crsp")
|
|
301
|
+
>>> db_schema_tables("audit")
|
|
302
|
+
"""
|
|
303
|
+
con = ibis.postgres.connect(user=user,
|
|
304
|
+
host=host,
|
|
305
|
+
port=port,
|
|
306
|
+
database=database)
|
|
307
|
+
tables = con.list_tables(database=schema)
|
|
308
|
+
return tables
|
|
309
|
+
|
|
310
|
+
def db_schema_to_pq(schema,
|
|
311
|
+
user=os.getenv("PGUSER", default=os.getlogin()),
|
|
312
|
+
host=os.getenv("PGHOST", default="localhost"),
|
|
313
|
+
database=os.getenv("PGDATABASE", default=os.getlogin()),
|
|
314
|
+
port=os.getenv("PGPORT", default=5432),
|
|
315
|
+
data_dir=os.getenv("DATA_DIR", default=""),
|
|
316
|
+
row_group_size=1048576,
|
|
317
|
+
batched=True,
|
|
318
|
+
threads=None):
|
|
319
|
+
"""Export all tables in a PostgreSQL table to parquet files.
|
|
320
|
+
|
|
321
|
+
Parameters
|
|
322
|
+
----------
|
|
323
|
+
schema:
|
|
324
|
+
Name of database schema.
|
|
325
|
+
|
|
326
|
+
user: string [Optional]
|
|
327
|
+
User role for the PostgreSQL database.
|
|
328
|
+
The default is to use the environment value `PGHOST`
|
|
329
|
+
or (if not set) user ID.
|
|
330
|
+
|
|
331
|
+
host: string [Optional]
|
|
332
|
+
Host name for the PostgreSQL server.
|
|
333
|
+
The default is to use the environment value `PGHOST`.
|
|
334
|
+
|
|
335
|
+
database: string [Optional]
|
|
336
|
+
Name for the PostgreSQL database.
|
|
337
|
+
The default is to use the environment value `PGDATABASE`
|
|
338
|
+
or (if not set) user ID.
|
|
339
|
+
|
|
340
|
+
port: int [Optional]
|
|
341
|
+
Port for the PostgreSQL server.
|
|
342
|
+
The default is to use the environment value `PGPORT`
|
|
343
|
+
or (if not set) 5432.
|
|
344
|
+
|
|
345
|
+
data_dir: string [Optional]
|
|
346
|
+
Root directory of parquet data repository.
|
|
347
|
+
The default is to use the environment value `DATA_DIR`
|
|
348
|
+
or (if not set) the current directory.
|
|
349
|
+
|
|
350
|
+
row_group_size: int [Optional]
|
|
351
|
+
Maximum number of rows in each written row group.
|
|
352
|
+
Default is `1024 * 1024`.
|
|
353
|
+
|
|
354
|
+
obs: Integer [Optional]
|
|
355
|
+
Number of observations to import from database table.
|
|
356
|
+
Implemented using SQL `LIMIT`.
|
|
357
|
+
Setting this to modest value (e.g., `obs=1000`) can be useful for testing
|
|
358
|
+
`db_to_pq()` with large tables.
|
|
359
|
+
|
|
360
|
+
alt_table_name: string [Optional]
|
|
361
|
+
Basename of parquet file. Used when file should have different name from `table_name`.
|
|
362
|
+
|
|
363
|
+
batched: bool [Optional]
|
|
364
|
+
Indicates whether data will be extracting in batches using
|
|
365
|
+
`to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
|
|
366
|
+
Using batches degrades performance slightly, but dramatically
|
|
367
|
+
reduces memory requirements for large tables.
|
|
368
|
+
|
|
369
|
+
threads: int [Optional]
|
|
370
|
+
The number of threads DuckDB is allowed to use.
|
|
371
|
+
Setting this may be necessary due to limits imposed on the user
|
|
372
|
+
by the PostgreSQL database server.
|
|
373
|
+
|
|
374
|
+
Returns
|
|
375
|
+
-------
|
|
376
|
+
pq_files: list of strings
|
|
377
|
+
Names of parquet files created.
|
|
378
|
+
|
|
379
|
+
Examples
|
|
380
|
+
----------
|
|
381
|
+
>>> db_schema_to_pq("crsp")
|
|
382
|
+
>>> db_schema_to_pq("audit")
|
|
383
|
+
"""
|
|
384
|
+
tables = db_schema_tables(schema, user, host, database, port)
|
|
385
|
+
res = [db_to_pq(table_name=table_name,
|
|
386
|
+
schema=schema,
|
|
387
|
+
user=user,
|
|
388
|
+
host=host,
|
|
389
|
+
database=database,
|
|
390
|
+
port=port,
|
|
391
|
+
data_dir=data_dir,
|
|
392
|
+
row_group_size=row_group_size,
|
|
393
|
+
threads=threads,
|
|
394
|
+
batched=batched) for table_name in tables]
|
|
395
|
+
return res
|
|
396
|
+
|
|
397
|
+
def get_process(sas_code, wrds_id=wrds_id, fpath=None):
|
|
398
|
+
"""Update a local CSV version of a WRDS table.
|
|
399
|
+
|
|
400
|
+
Parameters
|
|
401
|
+
----------
|
|
402
|
+
sas_code:
|
|
403
|
+
SAS code to be run to yield output.
|
|
404
|
+
|
|
405
|
+
wrds_id: string
|
|
406
|
+
Optional WRDS ID to be use to access WRDS SAS.
|
|
407
|
+
Default is to use the environment value `WRDS_ID`
|
|
408
|
+
|
|
409
|
+
fpath:
|
|
410
|
+
Optional path to a local SAS file.
|
|
411
|
+
|
|
412
|
+
Returns
|
|
413
|
+
-------
|
|
414
|
+
The STDOUT component of the process as a stream.
|
|
415
|
+
"""
|
|
416
|
+
if client:
|
|
417
|
+
client.close()
|
|
418
|
+
|
|
419
|
+
if wrds_id:
|
|
420
|
+
"""Function runs SAS code on WRDS server and
|
|
421
|
+
returns result as pipe on stdout."""
|
|
422
|
+
client.load_system_host_keys()
|
|
423
|
+
client.set_missing_host_key_policy(paramiko.WarningPolicy())
|
|
424
|
+
client.connect('wrds-cloud-sshkey.wharton.upenn.edu',
|
|
425
|
+
username=wrds_id, compress=False)
|
|
426
|
+
command = "qsas -stdio -noterminal"
|
|
427
|
+
stdin, stdout, stderr = client.exec_command(command)
|
|
428
|
+
stdin.write(sas_code)
|
|
429
|
+
stdin.close()
|
|
430
|
+
|
|
431
|
+
channel = stdout.channel
|
|
432
|
+
# indicate that we're not going to write to that channel anymore
|
|
433
|
+
channel.shutdown_write()
|
|
434
|
+
return stdout
|
|
435
|
+
|
|
436
|
+
def proc_contents(table_name, sas_schema=None, wrds_id=os.getenv("WRDS_ID"),
|
|
437
|
+
encoding=None):
|
|
438
|
+
if not encoding:
|
|
439
|
+
encoding = "utf-8"
|
|
440
|
+
|
|
441
|
+
sas_code = f"PROC CONTENTS data={sas_schema}.{table_name}(encoding='{encoding}');"
|
|
442
|
+
|
|
443
|
+
p = get_process(sas_code, wrds_id)
|
|
444
|
+
|
|
445
|
+
return p.readlines()
|
|
446
|
+
|
|
447
|
+
def get_modified_str(table_name, sas_schema, wrds_id=wrds_id,
|
|
448
|
+
encoding=None):
|
|
449
|
+
|
|
450
|
+
contents = proc_contents(table_name=table_name, sas_schema=sas_schema,
|
|
451
|
+
wrds_id=wrds_id, encoding=encoding)
|
|
452
|
+
|
|
453
|
+
if len(contents) == 0:
|
|
454
|
+
print(f"Table {sas_schema}.{table_name} not found.")
|
|
455
|
+
return None
|
|
456
|
+
|
|
457
|
+
modified = ""
|
|
458
|
+
next_row = False
|
|
459
|
+
for line in contents:
|
|
460
|
+
if next_row:
|
|
461
|
+
line = re.sub(r"^\s+(.*)\s+$", r"\1", line)
|
|
462
|
+
line = re.sub(r"\s+$", "", line)
|
|
463
|
+
if not re.findall(r"Protection", line):
|
|
464
|
+
modified += " " + line.rstrip()
|
|
465
|
+
next_row = False
|
|
466
|
+
|
|
467
|
+
if re.match(r"Last Modified", line):
|
|
468
|
+
modified = re.sub(r"^Last Modified\s+(.*?)\s{2,}.*$",
|
|
469
|
+
r"Last modified: \1", line)
|
|
470
|
+
modified = modified.rstrip()
|
|
471
|
+
next_row = True
|
|
472
|
+
|
|
473
|
+
return modified
|
|
474
|
+
|
|
475
|
+
def get_modified_pq(file_name):
|
|
476
|
+
|
|
477
|
+
if os.path.exists(file_name):
|
|
478
|
+
md = pq.read_schema(file_name)
|
|
479
|
+
schema_md = md.metadata
|
|
480
|
+
if not schema_md:
|
|
481
|
+
return ''
|
|
482
|
+
if b'last_modified' in schema_md.keys():
|
|
483
|
+
last_modified = schema_md[b'last_modified'].decode('utf-8')
|
|
484
|
+
else:
|
|
485
|
+
last_modified = ''
|
|
486
|
+
else:
|
|
487
|
+
last_modified = ''
|
|
488
|
+
return last_modified
|
|
489
|
+
|
|
490
|
+
def wrds_update_pq(table_name, schema,
|
|
491
|
+
wrds_id=os.getenv("WRDS_ID", default=""),
|
|
492
|
+
data_dir=os.getenv("DATA_DIR", default=""),
|
|
493
|
+
force=False,
|
|
494
|
+
col_types=None,
|
|
495
|
+
encoding="utf-8",
|
|
496
|
+
sas_schema=None,
|
|
497
|
+
row_group_size=1048576,
|
|
498
|
+
obs=None,
|
|
499
|
+
alt_table_name=None,
|
|
500
|
+
keep=None,
|
|
501
|
+
drop=None,
|
|
502
|
+
batched=True,
|
|
503
|
+
threads=3):
|
|
504
|
+
"""Export a table from the WRDS PostgreSQL database to a parquet file.
|
|
505
|
+
|
|
506
|
+
Parameters
|
|
507
|
+
----------
|
|
508
|
+
table_name:
|
|
509
|
+
Name of table in database.
|
|
510
|
+
|
|
511
|
+
schema:
|
|
512
|
+
Name of database schema.
|
|
513
|
+
|
|
514
|
+
wrds_id: string
|
|
515
|
+
WRDS ID to be used to access WRDS SAS.
|
|
516
|
+
Default is to use the environment value `WRDS_ID`.
|
|
517
|
+
|
|
518
|
+
data_dir: string [Optional]
|
|
519
|
+
Root directory of parquet data repository.
|
|
520
|
+
The default is to use the environment value `DATA_DIR`
|
|
521
|
+
or (if not set) the current directory.
|
|
522
|
+
|
|
523
|
+
force: Boolean
|
|
524
|
+
Whether update should proceed regardless of date comparison results.
|
|
525
|
+
|
|
526
|
+
col_types: Dict [Optional]
|
|
527
|
+
Dictionary of PostgreSQL data types to be used when importing data to PostgreSQL or writing to Parquet files.
|
|
528
|
+
For Parquet files, conversion from PostgreSQL to PyArrow types is handled by DuckDB.
|
|
529
|
+
Only a subset of columns needs to be supplied.
|
|
530
|
+
Supplied types should be compatible with data emitted by PostgreSQL
|
|
531
|
+
(i.e., one can't "fix" arbitrary type issues using this argument).
|
|
532
|
+
For example, `col_types = {'permno': 'int32', 'permco': 'int32'}`.
|
|
533
|
+
|
|
534
|
+
row_group_size: int [Optional]
|
|
535
|
+
Maximum number of rows in each written row group.
|
|
536
|
+
Default is `1024 * 1024`.
|
|
537
|
+
|
|
538
|
+
obs: Integer [Optional]
|
|
539
|
+
Number of observations to import from database table.
|
|
540
|
+
Implemented using SQL `LIMIT`.
|
|
541
|
+
Setting this to modest value (e.g., `obs=1000`) can be useful for testing
|
|
542
|
+
`db_to_pq()` with large tables.
|
|
543
|
+
|
|
544
|
+
alt_table_name: string [Optional]
|
|
545
|
+
Basename of parquet file. Used when file should have different name from `table_name`.
|
|
546
|
+
|
|
547
|
+
keep: string [Optional]
|
|
548
|
+
Regular expression indicating columns to keep.
|
|
549
|
+
|
|
550
|
+
drop: string [Optional]
|
|
551
|
+
Regular expression indicating columns to drop.
|
|
552
|
+
|
|
553
|
+
batched: bool [Optional]
|
|
554
|
+
Indicates whether data will be extracting in batches using
|
|
555
|
+
`to_pyarrow_batches()` instead of a single call to `to_pyarrow()`.
|
|
556
|
+
Using batches degrades performance slightly, but dramatically
|
|
557
|
+
reduces memory requirements for large tables.
|
|
558
|
+
|
|
559
|
+
threads: int [Optional]
|
|
560
|
+
The number of threads DuckDB is allowed to use.
|
|
561
|
+
Setting this may be necessary due to limits imposed on the user
|
|
562
|
+
by the PostgreSQL database server.
|
|
563
|
+
|
|
564
|
+
Returns
|
|
565
|
+
-------
|
|
566
|
+
pq_file: string
|
|
567
|
+
Name of parquet file created.
|
|
568
|
+
|
|
569
|
+
Examples
|
|
570
|
+
----------
|
|
571
|
+
>>> db_to_pq("dsi", "crsp")
|
|
572
|
+
>>> db_to_pq("feed21_bankruptcy_notification", "audit")
|
|
573
|
+
"""
|
|
574
|
+
|
|
575
|
+
|
|
576
|
+
if not sas_schema:
|
|
577
|
+
sas_schema = schema
|
|
578
|
+
|
|
579
|
+
if not alt_table_name:
|
|
580
|
+
alt_table_name = table_name
|
|
581
|
+
|
|
582
|
+
pq_file = get_pq_file(table_name=table_name, schema=schema,
|
|
583
|
+
data_dir=data_dir)
|
|
584
|
+
|
|
585
|
+
modified = get_modified_str(table_name=table_name,
|
|
586
|
+
sas_schema=sas_schema, wrds_id=wrds_id,
|
|
587
|
+
encoding=encoding)
|
|
588
|
+
if not modified:
|
|
589
|
+
return False
|
|
590
|
+
|
|
591
|
+
pq_modified = get_modified_pq(pq_file)
|
|
592
|
+
if modified == pq_modified and not force:
|
|
593
|
+
print(schema + "." + alt_table_name + " already up to date.")
|
|
594
|
+
return False
|
|
595
|
+
if force:
|
|
596
|
+
print("Forcing update based on user request.")
|
|
597
|
+
else:
|
|
598
|
+
print("Updated %s.%s is available." % (schema, alt_table_name))
|
|
599
|
+
print("Getting from WRDS.")
|
|
600
|
+
|
|
601
|
+
print(f"Beginning file download at {get_now()} UTC.")
|
|
602
|
+
wrds_pg_to_pq(table_name=table_name,
|
|
603
|
+
schema=schema,
|
|
604
|
+
data_dir=data_dir,
|
|
605
|
+
wrds_id=wrds_id,
|
|
606
|
+
col_types=col_types,
|
|
607
|
+
row_group_size=row_group_size,
|
|
608
|
+
obs=obs,
|
|
609
|
+
modified=modified,
|
|
610
|
+
alt_table_name=alt_table_name,
|
|
611
|
+
keep=keep,
|
|
612
|
+
drop=drop,
|
|
613
|
+
batched=batched,
|
|
614
|
+
threads=threads)
|
|
615
|
+
print(f"Completed file download at {get_now()} UTC.\n")
|
|
616
|
+
|
|
617
|
+
def get_pq_file(table_name, schema, data_dir=os.getenv("DATA_DIR")):
|
|
618
|
+
|
|
619
|
+
data_dir = os.path.expanduser(data_dir)
|
|
620
|
+
if not os.path.exists(data_dir):
|
|
621
|
+
os.makedirs(data_dir)
|
|
622
|
+
|
|
623
|
+
schema_dir = Path(data_dir, schema)
|
|
624
|
+
if not os.path.exists(schema_dir):
|
|
625
|
+
os.makedirs(schema_dir)
|
|
626
|
+
|
|
627
|
+
pq_file = Path(data_dir, schema, table_name).with_suffix('.parquet')
|
|
628
|
+
return pq_file
|
|
629
|
+
|
|
630
|
+
def get_now():
|
|
631
|
+
return strftime("%Y-%m-%d %H:%M:%S", gmtime())
|
|
632
|
+
|
|
633
|
+
def get_pq_files(schema, data_dir=os.getenv("DATA_DIR", default="")):
|
|
634
|
+
"""Get a list of parquet files in a schema.
|
|
635
|
+
|
|
636
|
+
Parameters
|
|
637
|
+
----------
|
|
638
|
+
schema:
|
|
639
|
+
Name of database schema.
|
|
640
|
+
|
|
641
|
+
data_dir: string [Optional]
|
|
642
|
+
Root directory of parquet data repository.
|
|
643
|
+
The default is to use the environment value `DATA_DIR`
|
|
644
|
+
or (if not set) the current directory.
|
|
645
|
+
|
|
646
|
+
Returns
|
|
647
|
+
-------
|
|
648
|
+
pq_files: [string]
|
|
649
|
+
Names of parquet files found.
|
|
650
|
+
"""
|
|
651
|
+
data_dir = os.path.expanduser(data_dir)
|
|
652
|
+
pq_dir = os.path.join(data_dir, schema)
|
|
653
|
+
files = os.listdir(pq_dir)
|
|
654
|
+
return [re.sub(r"\.parquet$", "", pq_file)
|
|
655
|
+
for pq_file in files
|
|
656
|
+
if re.search(r"\.parquet$", pq_file)]
|
|
657
|
+
|
|
658
|
+
def update_schema(schema, data_dir=os.getenv("DATA_DIR", default="")):
|
|
659
|
+
"""Update existing parquet files in a schema.
|
|
660
|
+
|
|
661
|
+
Parameters
|
|
662
|
+
----------
|
|
663
|
+
schema:
|
|
664
|
+
Name of database schema.
|
|
665
|
+
|
|
666
|
+
data_dir: string [Optional]
|
|
667
|
+
Root directory of parquet data repository.
|
|
668
|
+
The default is to use the environment value `DATA_DIR`
|
|
669
|
+
or (if not set) the current directory.
|
|
670
|
+
|
|
671
|
+
threads: int [Optional]
|
|
672
|
+
The number of threads DuckDB is allowed to use.
|
|
673
|
+
Setting this may be necessary due to limits imposed on the user
|
|
674
|
+
by the PostgreSQL database server.
|
|
675
|
+
|
|
676
|
+
|
|
677
|
+
Returns
|
|
678
|
+
-------
|
|
679
|
+
pq_files: [string]
|
|
680
|
+
Names of parquet files found.
|
|
681
|
+
"""
|
|
682
|
+
pq_files = get_pq_files(schema=schema, data_dir=data_dir)
|
|
683
|
+
for pq_file in pq_files:
|
|
684
|
+
wrds_update_pq(table_name=pq_file, schema=schema,
|
|
685
|
+
data_dir=data_dir, threads=3)
|
|
686
|
+
|
|
687
|
+
def pq_last_updated(data_dir=None):
|
|
688
|
+
"""
|
|
689
|
+
Get `last_updated` metadata for data files in a parquet data repository
|
|
690
|
+
set up along the lines described at
|
|
691
|
+
https://iangow.github.io/far_book/parquet-wrds.html.
|
|
692
|
+
|
|
693
|
+
Parameters
|
|
694
|
+
----------
|
|
695
|
+
data_dir: string [Optional]
|
|
696
|
+
Root directory of parquet data repository.
|
|
697
|
+
The default is to use the environment value `DATA_DIR`
|
|
698
|
+
or (if not set) the current directory.
|
|
699
|
+
|
|
700
|
+
Returns
|
|
701
|
+
-------
|
|
702
|
+
df: [pd.DataFrame]
|
|
703
|
+
Data frame with four columns: table, schema, last_mod_str, last_mod
|
|
704
|
+
"""
|
|
705
|
+
|
|
706
|
+
if not data_dir:
|
|
707
|
+
data_dir = os.path.expanduser(os.environ["DATA_DIR"])
|
|
708
|
+
data_dir = Path(data_dir)
|
|
709
|
+
|
|
710
|
+
df = pd.DataFrame([
|
|
711
|
+
{"table": p.stem,
|
|
712
|
+
"schema": subdir.name,
|
|
713
|
+
"last_mod_str": get_modified_pq(p)}
|
|
714
|
+
for subdir in data_dir.iterdir()
|
|
715
|
+
if subdir.is_dir()
|
|
716
|
+
for p in subdir.glob("*.parquet")
|
|
717
|
+
])
|
|
718
|
+
|
|
719
|
+
df["last_mod"] = (
|
|
720
|
+
df["last_mod_str"]
|
|
721
|
+
.str.replace("Last modified: ", "", regex=False)
|
|
722
|
+
.pipe(pd.to_datetime)
|
|
723
|
+
.dt.tz_localize("US/Eastern")
|
|
724
|
+
)
|
|
725
|
+
|
|
726
|
+
return df.sort_values("schema").reset_index(drop=True)
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: db2pq
|
|
3
|
+
Version: 0.1.3
|
|
4
|
+
Summary: Convert database tables to parquet tables.
|
|
5
|
+
Home-page: https://github.com/iangow/db2pq/
|
|
6
|
+
Author: Ian Gow
|
|
7
|
+
Author-email: iandgow@gmail.com
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: ibis-framework[duckdb,postgres]
|
|
15
|
+
Requires-Dist: pyarrow
|
|
16
|
+
Requires-Dist: pandas
|
|
17
|
+
Requires-Dist: paramiko
|
|
18
|
+
Dynamic: author
|
|
19
|
+
Dynamic: author-email
|
|
20
|
+
Dynamic: classifier
|
|
21
|
+
Dynamic: description
|
|
22
|
+
Dynamic: description-content-type
|
|
23
|
+
Dynamic: home-page
|
|
24
|
+
Dynamic: license-file
|
|
25
|
+
Dynamic: requires-dist
|
|
26
|
+
Dynamic: requires-python
|
|
27
|
+
Dynamic: summary
|
|
28
|
+
|
|
29
|
+
# Library to convert PostgreSQL data to parquet files
|
|
30
|
+
|
|
31
|
+
This package was created to convert PostgreSQL data to parquet format.
|
|
32
|
+
This package has four major functions, one for each of three popular data formats, plus an "update" function that only updates if necessary.
|
|
33
|
+
|
|
34
|
+
- `wrds_pg_to_pq()`: Exports a WRDS PostgreSQL table to a parquet file.
|
|
35
|
+
- `db_to_pq()`: Exports a PostgreSQL table to a parquet file.
|
|
36
|
+
- `db_schema_to_pq()`: Exports a PostgreSQL schema to parquet files.
|
|
37
|
+
- `wrds_update_pq()`: A variant on `wrds_pg_to_pq()` that checks the "last modified" value for the relevant SAS file against that of the local parquet before getting new data from the WRDS PostgreSQL server.
|
|
38
|
+
|
|
39
|
+
## Requirements
|
|
40
|
+
|
|
41
|
+
### 1. Python
|
|
42
|
+
The software uses Python 3 and depends on Ibis, `pyarrow` (Python API for Apache Arrow libraries), and Paramiko.
|
|
43
|
+
These dependencies are installed when you use Pip:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
pip install db2pq --upgrade
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### 2. A WRDS ID
|
|
50
|
+
To use public-key authentication to access WRDS, follow hints taken from [here](https://debian-administration.org/article/152/Password-less_logins_with_OpenSSH) to set up a public key.
|
|
51
|
+
Copy that key to the WRDS server from the terminal on your computer.
|
|
52
|
+
(Note that this code assumes you have a directory `.ssh` in your home directory. If not, log into WRDS via SSH, then type `mkdir ~/.ssh` to create this.)
|
|
53
|
+
Here's code to create the key and send it to WRDS:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
ssh-keygen -t rsa
|
|
57
|
+
cat ~/.ssh/id_rsa.pub | ssh $WRDS_ID@wrds-cloud-sshkey.wharton.upenn.edu "cat >> ~/.ssh/authorized_keys"
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Use an empty passphrase in setting up the key so that the scripts can run without user intervention.
|
|
61
|
+
|
|
62
|
+
### 3. Environment variables
|
|
63
|
+
|
|
64
|
+
Environment variables that the code uses include:
|
|
65
|
+
|
|
66
|
+
- `WRDS_ID`: Your [WRDS](https://wrds-web.wharton.upenn.edu/wrds/) ID.
|
|
67
|
+
- `DATA_DIR`: The local repository for parquet files.
|
|
68
|
+
|
|
69
|
+
Once can set these environment variables in (say) `~/.zprofile`:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
export WRDS_ID="iangow"
|
|
73
|
+
export DATA_DIR="~/Dropbox/pq_data"
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
As an alternative to setting these environment variables, they can be passed as values of arguments `wrds_id` and `data_dir`, respectively, of the functions above.
|
|
77
|
+
|
|
78
|
+
### Report bugs
|
|
79
|
+
Author: Ian Gow, <iandgow@gmail.com>
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
db2pq
|
db2pq-0.1.3/setup.cfg
ADDED
db2pq-0.1.3/setup.py
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
import setuptools
|
|
2
|
+
|
|
3
|
+
with open("README.md", "r") as f:
|
|
4
|
+
long_description = f.read()
|
|
5
|
+
print(long_description)
|
|
6
|
+
|
|
7
|
+
setuptools.setup(
|
|
8
|
+
name="db2pq",
|
|
9
|
+
version="0.1.3",
|
|
10
|
+
author="Ian Gow",
|
|
11
|
+
author_email="iandgow@gmail.com",
|
|
12
|
+
description="Convert database tables to parquet tables.",
|
|
13
|
+
long_description=long_description,
|
|
14
|
+
long_description_content_type="text/markdown",
|
|
15
|
+
url="https://github.com/iangow/db2pq/",
|
|
16
|
+
packages=setuptools.find_packages(),
|
|
17
|
+
install_requires=['ibis-framework[duckdb, postgres]', 'pyarrow', 'pandas', 'paramiko'],
|
|
18
|
+
python_requires=">=3",
|
|
19
|
+
classifiers=[
|
|
20
|
+
"Programming Language :: Python :: 3",
|
|
21
|
+
"License :: OSI Approved :: MIT License",
|
|
22
|
+
"Operating System :: OS Independent",
|
|
23
|
+
],
|
|
24
|
+
)
|