bp-condenser-postgresql 0.2.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,9 @@
1
+ Copyright 2019, Tonic AI
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8
+
9
+
@@ -0,0 +1,119 @@
1
+ Metadata-Version: 2.4
2
+ Name: bp-condenser-postgresql
3
+ Version: 0.2.1
4
+ Summary: Config-driven Postgres database subsetting tool.
5
+ Author: Brightpick
6
+ License-Expression: MIT
7
+ Keywords: database,subset,subsetting,postgres,sampling,ETL
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3 :: Only
10
+ Classifier: Programming Language :: Python :: 3.8
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Programming Language :: Python :: 3.13
16
+ Classifier: Programming Language :: Python :: 3.14
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Environment :: Console
19
+ Classifier: Topic :: Database
20
+ Requires-Python: >=3.8
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: toposort
24
+ Requires-Dist: psycopg2-binary
25
+ Dynamic: license-file
26
+
27
+ # Condenser
28
+
29
+ Condenser is a config-driven database subsetting tool for Postgres.
30
+
31
+ Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.
32
+
33
+ One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.
34
+
35
+ You can find more details about how we built this [here](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool) and [here](https://www.tonic.ai/blog/condenser-v2/).
36
+
37
+ ## Need to Subset a Large Database?
38
+
39
+ Our open-source tool can subset databases up to 10GB, but it will struggle with larger databases. Our premium database subsetter can, among other things (graphical UI, job scheduling, fancy algorithms), subset multi-TB databases with ease. If you're interested find us at [hello@tonic.ai](mailto:hello@tonic.ai).
40
+
41
+ # Installation
42
+
43
+ Five steps to install, assuming Python 3.8+:
44
+
45
+ 1. Download the required Python modules. You can use [`pip`](https://pypi.org/project/pip/) for easy installation. The required modules are `toposort` and `psycopg2-binary`.
46
+ ```
47
+ $ pip install toposort
48
+ $ pip install psycopg2-binary
49
+ ```
50
+ 2. Install Postgres database tools. We need `pg_dump` and `psql`; they need to be on your `$PATH` or point to them with `$POSTGRES_PATH`.
51
+ 3. Download this repo. You can clone the repo or Download it as a zip. Scroll up, it's the green button that says "Clone or download".
52
+ 4. Setup your configuration and save it in `config.json`. The provided `config.json.example` has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in `initial_targets`. Here's an example that will collect 10% of a table named `public.target_table`.
53
+ ```
54
+ "initial_targets": [
55
+ {
56
+ "table": "public.target_table",
57
+ "percent": 10
58
+ }
59
+ ]
60
+ ```
61
+ There may be more required configuration depending on your database, but simple databases should be easy. See the Config section for more details, and `config.json.example_all` for all of the options in a single config file.
62
+
63
+ 5. Run! `$ python direct_subset.py`
64
+
65
+ # Config
66
+
67
+ Configuration must exist in `config.json`. There is an example configuration provided in `config.json.example`. Most of the configuration is straightforward: source and destination DB connection details and subsetting settings. There are three fields that deserve some additional attention.
68
+
69
+ The first is `initial_targets`. This is where you tell the subsetter to begin the subset. You can specify any number of tables as an initial target, and provide either a percent goal (e.g. 5% of the `users` table) or a WHERE clause.
70
+
71
+ Next is `dependency_breaks`. The best way to get a full understanding of this is to read our [blog post](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool). But if you want a TLDR, it's this: The subsetting tool cannot operate on databases with cycles in their foreign key relationships. (Example: Table `events` references `users`, which references `company`, which references `events`, a cycle exists if you think about the foreign keys as a directed graph.) If your database has a foreign key cycle (any many do), have no fear! This field lets you tell the subsetter to ignore certain foreign keys, and essentially remove the cycle. You'll have to know a bit about your database to use this field effectively. The tool will warn you if you have a cycle that you haven't broken.
72
+
73
+ The last is `fk_augmentation`. Databases frequently have foreign keys that are not codified as constraints on the database, these are implicit foreign keys. For a subsetter to create useful subsets if needs to know about this implicit constraints. This field lets you essentially add foreign keys to the subsetter that the DB doesn't have listed as a constraint.
74
+
75
+ Below we describe the use of all configuration parameters, but the best place to start for the exact format is `config.json.example`.
76
+
77
+ `db_type`: Required database type selector. The only supported value is `"postgres"`.
78
+
79
+ `source_db_connection_info`: Source database connection details. These are recorded as a JSON object with the fields `user_name`, `host`, `db_name`, `ssl_mode`, `password` (optional), and `port`. If `password` is omitted, then you will be prompted for a password. See `config.json.example` for details.
80
+
81
+ `destination_db_connection_info`: Destination database connection details. Same fields as `source_db_connection_info`.
82
+
83
+ `initial_targets`: JSON array of JSON objects. The inner object must contain a `table` field, which is a target table, and either a `where` field or a `percent` field. The `where` field is used to specify a WHERE clause for the subsetting. The `percent` field indicates we want a specific percentage of the target table; it is equivalent to `"where": "random() < <percent>/100.0"`.
84
+
85
+ `passthrough_tables`: Tables that will be copied to the destination database in whole. The value is a JSON array of strings in the form `"<schema>.<table>"`.
86
+
87
+ `excluded_tables`: Tables that will be excluded from the subset. The table will exist in the output, but contain no rows. The value is a JSON array of strings in the form `"<schema>.<table>"`.
88
+
89
+ `upstream_filters`: Additional filtering to be applied to tables during upstream subsetting. Upstream subsetting happens when a row is imported, and there are rows with foreign keys to that row. The subsetter then greedily grabs as many rows from the database as it can, based on the rows already imported. If you don't want such greedy behavior you can impose additional filters with this option. This is an advanced feature, you probably won't need for your first subsets. The value is a JSON array of JSON objects. See `example-config.json` for details.
90
+
91
+ `fk_augmentation`: Additional foreign keys that, while not represented as constraints in the database, are logically present in the data. Foreign keys listed in `fk_augmentation` are unioned with the foreign keys provided by constraints in the database. `fk_augmentation` is useful when there are foreign keys existing in the data, but not represented in the database. The value is a JSON array of JSON objects. See `example-config.json` for details.
92
+
93
+ `dependency_breaks`: An array containing JSON objects with *"fk_table"* and *"target_table"* fields of table relationships to be ignored in order to break cycles
94
+
95
+ `keep_disconnected_tables`: If `true` tables that the subset target(s) don't reach, when following foreign keys, will be copied 100% over. If it's `false` then their schema will be copied but the table contents will be empty. Put more mathematically, the tables and foreign keys create a graph (where tables are nodes and foreign keys are directed edges) disconnected tables are the tables in components that don't contain any targets. This setting decides how to import those tables.
96
+
97
+ `max_rows_per_table`: This is interpreted as a limit on all of the tables to be copied. Useful if you have some very large tables that you want a sampling from. For an unlimited dataset (recommended) set this parameter to `ALL`.
98
+
99
+ `pre_constraint_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, but before the database constraints have been applied. Useful to perform tasks that will clean up any data that would otherwise violate the database constraints. `post_subset_sql` is the preferred option for any general purpose queries.
100
+
101
+ `post_subset_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, and after the database constraints have been applied. Useful to perform additional adhoc tasks after subsetting.
102
+
103
+ # Running
104
+
105
+ Almost all the configuration is in the `config.json` file, so running is as simple as
106
+
107
+ ```
108
+ $ python direct_subset.py
109
+ ```
110
+
111
+ Two commandline arguments are supported:
112
+
113
+ `-v`: Verbose output. Useful for performance debugging. Lists almost every query made, and it's speed.
114
+
115
+ `--no-constraints`: Do not add constraints found in the source database to the destination database.
116
+
117
+ # Requirements
118
+
119
+ Reference the `requirements.txt` file for a list of required Python packages. Python 3.8+ is required.
@@ -0,0 +1,93 @@
1
+ # Condenser
2
+
3
+ Condenser is a config-driven database subsetting tool for Postgres.
4
+
5
+ Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.
6
+
7
+ One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.
8
+
9
+ You can find more details about how we built this [here](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool) and [here](https://www.tonic.ai/blog/condenser-v2/).
10
+
11
+ ## Need to Subset a Large Database?
12
+
13
+ Our open-source tool can subset databases up to 10GB, but it will struggle with larger databases. Our premium database subsetter can, among other things (graphical UI, job scheduling, fancy algorithms), subset multi-TB databases with ease. If you're interested find us at [hello@tonic.ai](mailto:hello@tonic.ai).
14
+
15
+ # Installation
16
+
17
+ Five steps to install, assuming Python 3.8+:
18
+
19
+ 1. Download the required Python modules. You can use [`pip`](https://pypi.org/project/pip/) for easy installation. The required modules are `toposort` and `psycopg2-binary`.
20
+ ```
21
+ $ pip install toposort
22
+ $ pip install psycopg2-binary
23
+ ```
24
+ 2. Install Postgres database tools. We need `pg_dump` and `psql`; they need to be on your `$PATH` or point to them with `$POSTGRES_PATH`.
25
+ 3. Download this repo. You can clone the repo or Download it as a zip. Scroll up, it's the green button that says "Clone or download".
26
+ 4. Setup your configuration and save it in `config.json`. The provided `config.json.example` has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in `initial_targets`. Here's an example that will collect 10% of a table named `public.target_table`.
27
+ ```
28
+ "initial_targets": [
29
+ {
30
+ "table": "public.target_table",
31
+ "percent": 10
32
+ }
33
+ ]
34
+ ```
35
+ There may be more required configuration depending on your database, but simple databases should be easy. See the Config section for more details, and `config.json.example_all` for all of the options in a single config file.
36
+
37
+ 5. Run! `$ python direct_subset.py`
38
+
39
+ # Config
40
+
41
+ Configuration must exist in `config.json`. There is an example configuration provided in `config.json.example`. Most of the configuration is straightforward: source and destination DB connection details and subsetting settings. There are three fields that deserve some additional attention.
42
+
43
+ The first is `initial_targets`. This is where you tell the subsetter to begin the subset. You can specify any number of tables as an initial target, and provide either a percent goal (e.g. 5% of the `users` table) or a WHERE clause.
44
+
45
+ Next is `dependency_breaks`. The best way to get a full understanding of this is to read our [blog post](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool). But if you want a TLDR, it's this: The subsetting tool cannot operate on databases with cycles in their foreign key relationships. (Example: Table `events` references `users`, which references `company`, which references `events`, a cycle exists if you think about the foreign keys as a directed graph.) If your database has a foreign key cycle (any many do), have no fear! This field lets you tell the subsetter to ignore certain foreign keys, and essentially remove the cycle. You'll have to know a bit about your database to use this field effectively. The tool will warn you if you have a cycle that you haven't broken.
46
+
47
+ The last is `fk_augmentation`. Databases frequently have foreign keys that are not codified as constraints on the database, these are implicit foreign keys. For a subsetter to create useful subsets if needs to know about this implicit constraints. This field lets you essentially add foreign keys to the subsetter that the DB doesn't have listed as a constraint.
48
+
49
+ Below we describe the use of all configuration parameters, but the best place to start for the exact format is `config.json.example`.
50
+
51
+ `db_type`: Required database type selector. The only supported value is `"postgres"`.
52
+
53
+ `source_db_connection_info`: Source database connection details. These are recorded as a JSON object with the fields `user_name`, `host`, `db_name`, `ssl_mode`, `password` (optional), and `port`. If `password` is omitted, then you will be prompted for a password. See `config.json.example` for details.
54
+
55
+ `destination_db_connection_info`: Destination database connection details. Same fields as `source_db_connection_info`.
56
+
57
+ `initial_targets`: JSON array of JSON objects. The inner object must contain a `table` field, which is a target table, and either a `where` field or a `percent` field. The `where` field is used to specify a WHERE clause for the subsetting. The `percent` field indicates we want a specific percentage of the target table; it is equivalent to `"where": "random() < <percent>/100.0"`.
58
+
59
+ `passthrough_tables`: Tables that will be copied to the destination database in whole. The value is a JSON array of strings in the form `"<schema>.<table>"`.
60
+
61
+ `excluded_tables`: Tables that will be excluded from the subset. The table will exist in the output, but contain no rows. The value is a JSON array of strings in the form `"<schema>.<table>"`.
62
+
63
+ `upstream_filters`: Additional filtering to be applied to tables during upstream subsetting. Upstream subsetting happens when a row is imported, and there are rows with foreign keys to that row. The subsetter then greedily grabs as many rows from the database as it can, based on the rows already imported. If you don't want such greedy behavior you can impose additional filters with this option. This is an advanced feature, you probably won't need for your first subsets. The value is a JSON array of JSON objects. See `example-config.json` for details.
64
+
65
+ `fk_augmentation`: Additional foreign keys that, while not represented as constraints in the database, are logically present in the data. Foreign keys listed in `fk_augmentation` are unioned with the foreign keys provided by constraints in the database. `fk_augmentation` is useful when there are foreign keys existing in the data, but not represented in the database. The value is a JSON array of JSON objects. See `example-config.json` for details.
66
+
67
+ `dependency_breaks`: An array containing JSON objects with *"fk_table"* and *"target_table"* fields of table relationships to be ignored in order to break cycles
68
+
69
+ `keep_disconnected_tables`: If `true` tables that the subset target(s) don't reach, when following foreign keys, will be copied 100% over. If it's `false` then their schema will be copied but the table contents will be empty. Put more mathematically, the tables and foreign keys create a graph (where tables are nodes and foreign keys are directed edges) disconnected tables are the tables in components that don't contain any targets. This setting decides how to import those tables.
70
+
71
+ `max_rows_per_table`: This is interpreted as a limit on all of the tables to be copied. Useful if you have some very large tables that you want a sampling from. For an unlimited dataset (recommended) set this parameter to `ALL`.
72
+
73
+ `pre_constraint_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, but before the database constraints have been applied. Useful to perform tasks that will clean up any data that would otherwise violate the database constraints. `post_subset_sql` is the preferred option for any general purpose queries.
74
+
75
+ `post_subset_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, and after the database constraints have been applied. Useful to perform additional adhoc tasks after subsetting.
76
+
77
+ # Running
78
+
79
+ Almost all the configuration is in the `config.json` file, so running is as simple as
80
+
81
+ ```
82
+ $ python direct_subset.py
83
+ ```
84
+
85
+ Two commandline arguments are supported:
86
+
87
+ `-v`: Verbose output. Useful for performance debugging. Lists almost every query made, and it's speed.
88
+
89
+ `--no-constraints`: Do not add constraints found in the source database to the destination database.
90
+
91
+ # Requirements
92
+
93
+ Reference the `requirements.txt` file for a list of required Python packages. Python 3.8+ is required.
@@ -0,0 +1,119 @@
1
+ Metadata-Version: 2.4
2
+ Name: bp-condenser-postgresql
3
+ Version: 0.2.1
4
+ Summary: Config-driven Postgres database subsetting tool.
5
+ Author: Brightpick
6
+ License-Expression: MIT
7
+ Keywords: database,subset,subsetting,postgres,sampling,ETL
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3 :: Only
10
+ Classifier: Programming Language :: Python :: 3.8
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Programming Language :: Python :: 3.13
16
+ Classifier: Programming Language :: Python :: 3.14
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Environment :: Console
19
+ Classifier: Topic :: Database
20
+ Requires-Python: >=3.8
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: toposort
24
+ Requires-Dist: psycopg2-binary
25
+ Dynamic: license-file
26
+
27
+ # Condenser
28
+
29
+ Condenser is a config-driven database subsetting tool for Postgres.
30
+
31
+ Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.
32
+
33
+ One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.
34
+
35
+ You can find more details about how we built this [here](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool) and [here](https://www.tonic.ai/blog/condenser-v2/).
36
+
37
+ ## Need to Subset a Large Database?
38
+
39
+ Our open-source tool can subset databases up to 10GB, but it will struggle with larger databases. Our premium database subsetter can, among other things (graphical UI, job scheduling, fancy algorithms), subset multi-TB databases with ease. If you're interested find us at [hello@tonic.ai](mailto:hello@tonic.ai).
40
+
41
+ # Installation
42
+
43
+ Five steps to install, assuming Python 3.8+:
44
+
45
+ 1. Download the required Python modules. You can use [`pip`](https://pypi.org/project/pip/) for easy installation. The required modules are `toposort` and `psycopg2-binary`.
46
+ ```
47
+ $ pip install toposort
48
+ $ pip install psycopg2-binary
49
+ ```
50
+ 2. Install Postgres database tools. We need `pg_dump` and `psql`; they need to be on your `$PATH` or point to them with `$POSTGRES_PATH`.
51
+ 3. Download this repo. You can clone the repo or Download it as a zip. Scroll up, it's the green button that says "Clone or download".
52
+ 4. Setup your configuration and save it in `config.json`. The provided `config.json.example` has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in `initial_targets`. Here's an example that will collect 10% of a table named `public.target_table`.
53
+ ```
54
+ "initial_targets": [
55
+ {
56
+ "table": "public.target_table",
57
+ "percent": 10
58
+ }
59
+ ]
60
+ ```
61
+ There may be more required configuration depending on your database, but simple databases should be easy. See the Config section for more details, and `config.json.example_all` for all of the options in a single config file.
62
+
63
+ 5. Run! `$ python direct_subset.py`
64
+
65
+ # Config
66
+
67
+ Configuration must exist in `config.json`. There is an example configuration provided in `config.json.example`. Most of the configuration is straightforward: source and destination DB connection details and subsetting settings. There are three fields that deserve some additional attention.
68
+
69
+ The first is `initial_targets`. This is where you tell the subsetter to begin the subset. You can specify any number of tables as an initial target, and provide either a percent goal (e.g. 5% of the `users` table) or a WHERE clause.
70
+
71
+ Next is `dependency_breaks`. The best way to get a full understanding of this is to read our [blog post](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool). But if you want a TLDR, it's this: The subsetting tool cannot operate on databases with cycles in their foreign key relationships. (Example: Table `events` references `users`, which references `company`, which references `events`, a cycle exists if you think about the foreign keys as a directed graph.) If your database has a foreign key cycle (any many do), have no fear! This field lets you tell the subsetter to ignore certain foreign keys, and essentially remove the cycle. You'll have to know a bit about your database to use this field effectively. The tool will warn you if you have a cycle that you haven't broken.
72
+
73
+ The last is `fk_augmentation`. Databases frequently have foreign keys that are not codified as constraints on the database, these are implicit foreign keys. For a subsetter to create useful subsets if needs to know about this implicit constraints. This field lets you essentially add foreign keys to the subsetter that the DB doesn't have listed as a constraint.
74
+
75
+ Below we describe the use of all configuration parameters, but the best place to start for the exact format is `config.json.example`.
76
+
77
+ `db_type`: Required database type selector. The only supported value is `"postgres"`.
78
+
79
+ `source_db_connection_info`: Source database connection details. These are recorded as a JSON object with the fields `user_name`, `host`, `db_name`, `ssl_mode`, `password` (optional), and `port`. If `password` is omitted, then you will be prompted for a password. See `config.json.example` for details.
80
+
81
+ `destination_db_connection_info`: Destination database connection details. Same fields as `source_db_connection_info`.
82
+
83
+ `initial_targets`: JSON array of JSON objects. The inner object must contain a `table` field, which is a target table, and either a `where` field or a `percent` field. The `where` field is used to specify a WHERE clause for the subsetting. The `percent` field indicates we want a specific percentage of the target table; it is equivalent to `"where": "random() < <percent>/100.0"`.
84
+
85
+ `passthrough_tables`: Tables that will be copied to the destination database in whole. The value is a JSON array of strings in the form `"<schema>.<table>"`.
86
+
87
+ `excluded_tables`: Tables that will be excluded from the subset. The table will exist in the output, but contain no rows. The value is a JSON array of strings in the form `"<schema>.<table>"`.
88
+
89
+ `upstream_filters`: Additional filtering to be applied to tables during upstream subsetting. Upstream subsetting happens when a row is imported, and there are rows with foreign keys to that row. The subsetter then greedily grabs as many rows from the database as it can, based on the rows already imported. If you don't want such greedy behavior you can impose additional filters with this option. This is an advanced feature, you probably won't need for your first subsets. The value is a JSON array of JSON objects. See `example-config.json` for details.
90
+
91
+ `fk_augmentation`: Additional foreign keys that, while not represented as constraints in the database, are logically present in the data. Foreign keys listed in `fk_augmentation` are unioned with the foreign keys provided by constraints in the database. `fk_augmentation` is useful when there are foreign keys existing in the data, but not represented in the database. The value is a JSON array of JSON objects. See `example-config.json` for details.
92
+
93
+ `dependency_breaks`: An array containing JSON objects with *"fk_table"* and *"target_table"* fields of table relationships to be ignored in order to break cycles
94
+
95
+ `keep_disconnected_tables`: If `true` tables that the subset target(s) don't reach, when following foreign keys, will be copied 100% over. If it's `false` then their schema will be copied but the table contents will be empty. Put more mathematically, the tables and foreign keys create a graph (where tables are nodes and foreign keys are directed edges) disconnected tables are the tables in components that don't contain any targets. This setting decides how to import those tables.
96
+
97
+ `max_rows_per_table`: This is interpreted as a limit on all of the tables to be copied. Useful if you have some very large tables that you want a sampling from. For an unlimited dataset (recommended) set this parameter to `ALL`.
98
+
99
+ `pre_constraint_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, but before the database constraints have been applied. Useful to perform tasks that will clean up any data that would otherwise violate the database constraints. `post_subset_sql` is the preferred option for any general purpose queries.
100
+
101
+ `post_subset_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, and after the database constraints have been applied. Useful to perform additional adhoc tasks after subsetting.
102
+
103
+ # Running
104
+
105
+ Almost all the configuration is in the `config.json` file, so running is as simple as
106
+
107
+ ```
108
+ $ python direct_subset.py
109
+ ```
110
+
111
+ Two commandline arguments are supported:
112
+
113
+ `-v`: Verbose output. Useful for performance debugging. Lists almost every query made, and it's speed.
114
+
115
+ `--no-constraints`: Do not add constraints found in the source database to the destination database.
116
+
117
+ # Requirements
118
+
119
+ Reference the `requirements.txt` file for a list of required Python packages. Python 3.8+ is required.
@@ -0,0 +1,19 @@
1
+ LICENSE
2
+ README.md
3
+ config_reader.py
4
+ database_helper.py
5
+ db_connect.py
6
+ direct_subset.py
7
+ psql_database_creator.py
8
+ psql_database_helper.py
9
+ pyproject.toml
10
+ result_tabulator.py
11
+ subset.py
12
+ subset_utils.py
13
+ topo_orderer.py
14
+ bp_condenser_postgresql.egg-info/PKG-INFO
15
+ bp_condenser_postgresql.egg-info/SOURCES.txt
16
+ bp_condenser_postgresql.egg-info/dependency_links.txt
17
+ bp_condenser_postgresql.egg-info/entry_points.txt
18
+ bp_condenser_postgresql.egg-info/requires.txt
19
+ bp_condenser_postgresql.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ condenser-direct-subset = direct_subset:main
@@ -0,0 +1,2 @@
1
+ toposort
2
+ psycopg2-binary
@@ -0,0 +1,10 @@
1
+ config_reader
2
+ database_helper
3
+ db_connect
4
+ direct_subset
5
+ psql_database_creator
6
+ psql_database_helper
7
+ result_tabulator
8
+ subset
9
+ subset_utils
10
+ topo_orderer
@@ -0,0 +1,93 @@
1
+ import json, sys, collections
2
+
3
+ _config = None
4
+
5
+ def initialize(file_like = None):
6
+ global _config
7
+ if _config != None:
8
+ print('WARNING: Attempted to initialize configuration twice.', file=sys.stderr)
9
+
10
+ if not file_like:
11
+ with open('config.json', 'r') as fp:
12
+ _config = json.load(fp)
13
+ else:
14
+ _config = json.load(file_like)
15
+
16
+ if "desired_result" in _config:
17
+ raise ValueError("desired_result is a key in the old config spec. Check the README.md and config.json.example for the latest configuration parameters.")
18
+
19
+ _validate_db_type()
20
+
21
+ def _validate_db_type():
22
+ if 'db_type' not in _config:
23
+ raise ValueError("Missing required config key 'db_type'. The only supported value is 'postgres'.")
24
+
25
+ db_type = _config['db_type']
26
+ if not isinstance(db_type, str):
27
+ raise ValueError("Invalid db_type {!r}. The only supported value is 'postgres'.".format(db_type))
28
+
29
+ normalized_db_type = db_type.lower()
30
+ if normalized_db_type != 'postgres':
31
+ raise ValueError("Unsupported db_type '{}'. Condenser supports only 'postgres'.".format(db_type))
32
+
33
+ _config['db_type'] = normalized_db_type
34
+
35
+ DependencyBreak = collections.namedtuple('DependencyBreak', ['fk_table', 'target_table'])
36
+ def get_dependency_breaks():
37
+ return set([DependencyBreak(b['fk_table'], b['target_table']) for b in _config['dependency_breaks']])
38
+
39
+ def get_preserve_fk_opportunistically():
40
+ return set([DependencyBreak(b['fk_table'], b['target_table']) for b in _config['dependency_breaks'] if 'perserve_fk_opportunistically' in b and b['perserve_fk_opportunistically']])
41
+
42
+ def get_initial_targets():
43
+ return _config['initial_targets']
44
+
45
+ def get_initial_target_tables():
46
+ return [target["table"] for target in _config['initial_targets']]
47
+
48
+ def keep_disconnected_tables():
49
+ return 'keep_disconnected_tables' in _config and bool(_config['keep_disconnected_tables'])
50
+
51
+ def get_db_type():
52
+ return _config['db_type']
53
+
54
+ def get_source_db_connection_info():
55
+ return _config['source_db_connection_info']
56
+
57
+ def get_destination_db_connection_info():
58
+ return _config['destination_db_connection_info']
59
+
60
+ def get_excluded_tables():
61
+ return list(_config['excluded_tables'])
62
+
63
+ def get_passthrough_tables():
64
+ return list(_config['passthrough_tables'])
65
+
66
+ def get_fk_augmentation():
67
+ return list(map(__convert_tonic_format, _config['fk_augmentation']))
68
+
69
+ def get_upstream_filters():
70
+ return _config["upstream_filters"]
71
+
72
+ def get_pre_constraint_sql():
73
+ return _config["pre_constraint_sql"] if "pre_constraint_sql" in _config else []
74
+
75
+ def get_post_subset_sql():
76
+ return _config["post_subset_sql"] if "post_subset_sql" in _config else []
77
+
78
+ def get_max_rows_per_table():
79
+ return _config["max_rows_per_table"] if "max_rows_per_table" in _config else None
80
+
81
+ def __convert_tonic_format(obj):
82
+ if "fk_schema" in obj:
83
+ return {
84
+ "fk_table": obj["fk_schema"] + "." + obj["fk_table"],
85
+ "fk_columns": obj["fk_columns"],
86
+ "target_table": obj["target_schema"] + "." + obj["target_table"],
87
+ "target_columns": obj["target_columns"],
88
+ }
89
+ else:
90
+ return obj
91
+
92
+ def verbose_logging():
93
+ return '-v' in sys.argv
@@ -0,0 +1,8 @@
1
+ import config_reader
2
+
3
+ def get_specific_helper():
4
+ if config_reader.get_db_type() == 'postgres':
5
+ import psql_database_helper
6
+ return psql_database_helper
7
+ else:
8
+ raise ValueError('unsupported db_type ' + config_reader.get_db_type())
@@ -0,0 +1,85 @@
1
+ import config_reader
2
+ import psycopg2
3
+ import os, pathlib, re, urllib, subprocess, os.path, json, getpass, time, sys, datetime
4
+
5
+ class DbConnect:
6
+
7
+ def __init__(self, db_type, connection_info):
8
+ requiredKeys = [
9
+ 'user_name',
10
+ 'host',
11
+ 'db_name',
12
+ 'port'
13
+ ]
14
+
15
+ for r in requiredKeys:
16
+ if r not in connection_info.keys():
17
+ raise Exception('Missing required key in database connection info: ' + r)
18
+ if 'password' not in connection_info.keys():
19
+ connection_info['password'] = getpass.getpass('Enter password for {0} on host {1}: '.format(connection_info['user_name'], connection_info['host']))
20
+
21
+ self.user = connection_info['user_name']
22
+ self.password = connection_info['password']
23
+ self.host = connection_info['host']
24
+ self.port = connection_info['port']
25
+ self.db_name = connection_info['db_name']
26
+ self.ssl_mode = connection_info['ssl_mode'] if 'ssl_mode' in connection_info else None
27
+ self.__db_type = db_type.lower()
28
+
29
+ def get_db_connection(self, read_repeatable=False):
30
+
31
+ if self.__db_type == 'postgres':
32
+ return PsqlConnection(self, read_repeatable)
33
+ else:
34
+ raise ValueError('unsupported db_type ' + self.__db_type)
35
+
36
+ class DbConnection:
37
+ def __init__(self, connection):
38
+ self.connection = connection
39
+
40
+ def commit(self):
41
+ self.connection.commit()
42
+
43
+ def close(self):
44
+ self.connection.close()
45
+
46
+
47
+ class LoggingCursor:
48
+ def __init__(self, cursor):
49
+ self.inner_cursor = cursor
50
+
51
+ def execute(self, query):
52
+ start_time = time.time()
53
+ if config_reader.verbose_logging():
54
+ print('Beginning query @ {}:\n\t{}'.format(str(datetime.datetime.now()), query))
55
+ sys.stdout.flush()
56
+ retval = self.inner_cursor.execute(query)
57
+ if config_reader.verbose_logging():
58
+ print('\tQuery completed in {}s'.format(time.time() - start_time))
59
+ sys.stdout.flush()
60
+ return retval
61
+
62
+ def __getattr__(self, name):
63
+ return self.inner_cursor.__getattribute__(name)
64
+
65
+ def __exit__(self, a, b, c):
66
+ return self.inner_cursor.__exit__(a, b, c)
67
+
68
+ def __enter__(self):
69
+ return LoggingCursor(self.inner_cursor.__enter__())
70
+
71
+ # small wrapper to the connection class that gives us a common interface to the cursor()
72
+ # method. This one is for Postgres.
73
+ class PsqlConnection(DbConnection):
74
+ def __init__(self, connect, read_repeatable):
75
+ connection_string = 'dbname=\'{0}\' user=\'{1}\' password=\'{2}\' host={3} port={4}'.format(connect.db_name, connect.user, connect.password, connect.host, connect.port)
76
+
77
+ if connect.ssl_mode :
78
+ connection_string = connection_string + ' sslmode={0}'.format(connect.ssl_mode)
79
+
80
+ DbConnection.__init__(self, psycopg2.connect(connection_string))
81
+ if read_repeatable:
82
+ self.connection.isolation_level = psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
83
+
84
+ def cursor(self, name=None, withhold=False):
85
+ return LoggingCursor(self.connection.cursor(name=name, withhold=withhold))
@@ -0,0 +1,66 @@
1
+ import uuid, sys
2
+ import config_reader, result_tabulator
3
+ import time
4
+ from subset import Subset
5
+ from psql_database_creator import PsqlDatabaseCreator
6
+ from db_connect import DbConnect
7
+ from subset_utils import print_progress
8
+ import database_helper
9
+
10
+ def db_creator(db_type, source, dest):
11
+ if db_type == 'postgres':
12
+ return PsqlDatabaseCreator(source, dest, False)
13
+ else:
14
+ raise ValueError('unsupported db_type ' + db_type)
15
+
16
+
17
+ def main() -> None:
18
+ if "--stdin" in sys.argv:
19
+ config_reader.initialize(sys.stdin)
20
+ else:
21
+ config_reader.initialize()
22
+
23
+ db_type = config_reader.get_db_type()
24
+ source_dbc = DbConnect(db_type, config_reader.get_source_db_connection_info())
25
+ destination_dbc = DbConnect(db_type, config_reader.get_destination_db_connection_info())
26
+
27
+ database = db_creator(db_type, source_dbc, destination_dbc)
28
+ database.teardown()
29
+ database.create()
30
+
31
+ # Get list of tables to operate on
32
+ db_helper = database_helper.get_specific_helper()
33
+ all_tables = db_helper.list_all_tables(source_dbc)
34
+ all_tables = [x for x in all_tables if x not in config_reader.get_excluded_tables()]
35
+
36
+ subsetter = Subset(source_dbc, destination_dbc, all_tables)
37
+
38
+ try:
39
+ subsetter.prep_temp_dbs()
40
+ subsetter.run_middle_out()
41
+
42
+ print("Beginning pre constraint SQL calls")
43
+ start_time = time.time()
44
+ for idx, sql in enumerate(config_reader.get_pre_constraint_sql()):
45
+ print_progress(sql, idx+1, len(config_reader.get_pre_constraint_sql()))
46
+ db_helper.run_query(sql, destination_dbc.get_db_connection())
47
+ print("Completed pre constraint SQL calls in {}s".format(time.time()-start_time))
48
+
49
+ print("Adding database constraints")
50
+ if "--no-constraints" not in sys.argv:
51
+ database.add_constraints()
52
+
53
+ print("Beginning post subset SQL calls")
54
+ start_time = time.time()
55
+ for idx, sql in enumerate(config_reader.get_post_subset_sql()):
56
+ print_progress(sql, idx+1, len(config_reader.get_post_subset_sql()))
57
+ db_helper.run_query(sql, destination_dbc.get_db_connection())
58
+ print("Completed post subset SQL calls in {}s".format(time.time()-start_time))
59
+
60
+ result_tabulator.tabulate(source_dbc, destination_dbc, all_tables)
61
+ finally:
62
+ subsetter.unprep_temp_dbs()
63
+
64
+
65
+ if __name__ == '__main__':
66
+ main()