PyPI - bp-condenser-postgresql - Versions diffs - 0.2.1__py3-none-any.whl - Mend

bp-condenser-postgresql 0.2.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

bp_condenser_postgresql-0.2.1.dist-info/METADATA +119 -0
bp_condenser_postgresql-0.2.1.dist-info/RECORD +16 -0
bp_condenser_postgresql-0.2.1.dist-info/WHEEL +5 -0
bp_condenser_postgresql-0.2.1.dist-info/entry_points.txt +2 -0
bp_condenser_postgresql-0.2.1.dist-info/licenses/LICENSE +9 -0
bp_condenser_postgresql-0.2.1.dist-info/top_level.txt +10 -0
config_reader.py +93 -0
database_helper.py +8 -0
db_connect.py +85 -0
direct_subset.py +66 -0
psql_database_creator.py +163 -0
psql_database_helper.py +211 -0
result_tabulator.py +26 -0
subset.py +199 -0
subset_utils.py +171 -0
topo_orderer.py +38 -0

bp_condenser_postgresql-0.2.1.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,119 @@
+Metadata-Version: 2.4
+Name: bp-condenser-postgresql
+Version: 0.2.1
+Summary: Config-driven Postgres database subsetting tool.
+Author: Brightpick
+License-Expression: MIT
+Keywords: database,subset,subsetting,postgres,sampling,ETL
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Operating System :: OS Independent
+Classifier: Environment :: Console
+Classifier: Topic :: Database
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: toposort
+Requires-Dist: psycopg2-binary
+Dynamic: license-file
+# Condenser
+Condenser is a config-driven database subsetting tool for Postgres.
+Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.
+One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.
+You can find more details about how we built this [here](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool) and [here](https://www.tonic.ai/blog/condenser-v2/).
+## Need to Subset a Large Database?
+Our open-source tool can subset databases up to 10GB, but it will struggle with larger databases. Our premium database subsetter can, among other things (graphical UI, job scheduling, fancy algorithms), subset multi-TB databases with ease. If you're interested find us at [hello@tonic.ai](mailto:hello@tonic.ai).
+# Installation
+Five steps to install, assuming Python 3.8+:
+1. Download the required Python modules. You can use [`pip`](https://pypi.org/project/pip/) for easy installation. The required modules are `toposort` and `psycopg2-binary`.
+```
+$ pip install toposort
+$ pip install psycopg2-binary
+```
+2. Install Postgres database tools. We need `pg_dump` and `psql`; they need to be on your `$PATH` or point to them with `$POSTGRES_PATH`.
+3. Download this repo. You can clone the repo or Download it as a zip. Scroll up, it's the green button that says "Clone or download".
+4. Setup your configuration and save it in `config.json`. The provided `config.json.example` has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in `initial_targets`. Here's an example that will collect 10% of a table named `public.target_table`.
+```
+"initial_targets": [
+    {
+        "table": "public.target_table",
+        "percent": 10
+    }
+]
+```
+There may be more required configuration depending on your database, but simple databases should be easy. See the Config section for more details, and `config.json.example_all` for all of the options in a single config file.
+5. Run! `$ python direct_subset.py`
+# Config
+Configuration must exist in `config.json`. There is an example configuration provided in `config.json.example`. Most of the configuration is straightforward: source and destination DB connection details and subsetting settings. There are three fields that deserve some additional attention.
+The first is `initial_targets`. This is where you tell the subsetter to begin the subset. You can specify any number of tables as an initial target, and provide either a percent goal (e.g. 5% of the `users` table) or a WHERE clause.
+Next is `dependency_breaks`. The best way to get a full understanding of this is to read our [blog post](https://www.tonic.ai/blog/condenser-a-database-subsetting-tool). But if you want a TLDR, it's this: The subsetting tool cannot operate on databases with cycles in their foreign key relationships. (Example: Table `events` references `users`, which references `company`, which references `events`, a cycle exists if you think about the foreign keys as a directed graph.) If your database has a foreign key cycle (any many do), have no fear! This field lets you tell the subsetter to ignore certain foreign keys, and essentially remove the cycle. You'll have to know a bit about your database to use this field effectively. The tool will warn you if you have a cycle that you haven't broken.
+The last is `fk_augmentation`. Databases frequently have foreign keys that are not codified as constraints on the database, these are implicit foreign keys. For a subsetter to create useful subsets if needs to know about this implicit constraints. This field lets you essentially add foreign keys to the subsetter that the DB doesn't have listed as a constraint.
+Below we describe the use of all configuration parameters, but the best place to start for the exact format is `config.json.example`.
+`db_type`: Required database type selector. The only supported value is `"postgres"`.
+`source_db_connection_info`: Source database connection details. These are recorded as a JSON object with the fields `user_name`, `host`, `db_name`, `ssl_mode`, `password` (optional), and `port`. If `password` is omitted, then you will be prompted for a password. See `config.json.example` for details.
+`destination_db_connection_info`: Destination database connection details. Same fields as `source_db_connection_info`.
+`initial_targets`: JSON array of JSON objects. The inner object must contain a `table` field, which is a target table, and either a `where` field or a `percent` field. The `where` field is used to specify a WHERE clause for the subsetting. The `percent` field indicates we want a specific percentage of the target table; it is equivalent to `"where": "random() < <percent>/100.0"`.
+`passthrough_tables`: Tables that will be copied to the destination database in whole. The value is a JSON array of strings in the form `"<schema>.<table>"`.
+`excluded_tables`: Tables that will be excluded from the subset. The table will exist in the output, but contain no rows. The value is a JSON array of strings in the form `"<schema>.<table>"`.
+`upstream_filters`: Additional filtering to be applied to tables during upstream subsetting. Upstream subsetting happens when a row is imported, and there are rows with foreign keys to that row. The subsetter then greedily grabs as many rows from the database as it can, based on the rows already imported. If you don't want such greedy behavior you can impose additional filters with this option. This is an advanced feature, you probably won't need for your first subsets. The value is a JSON array of JSON objects. See `example-config.json` for details.
+`fk_augmentation`: Additional foreign keys that, while not represented as constraints in the database, are logically present in the data. Foreign keys listed in `fk_augmentation` are unioned with the foreign keys provided by constraints in the database. `fk_augmentation` is useful when there are foreign keys existing in the data, but not represented in the database. The value is a JSON array of JSON objects. See `example-config.json` for details.
+`dependency_breaks`: An array containing JSON objects with *"fk_table"* and *"target_table"* fields of table relationships to be ignored in order to break cycles
+`keep_disconnected_tables`: If `true` tables that the subset target(s) don't reach, when following foreign keys, will be copied 100% over. If it's `false` then their schema will be copied but the table contents will be empty. Put more mathematically, the tables and foreign keys create a graph (where tables are nodes and foreign keys are directed edges) disconnected tables are the tables in components that don't contain any targets. This setting decides how to import those tables.
+`max_rows_per_table`: This is interpreted as a limit on all of the tables to be copied. Useful if you have some very large tables that you want a sampling from. For an unlimited dataset (recommended) set this parameter to `ALL`.
+`pre_constraint_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, but before the database constraints have been applied. Useful to perform tasks that will clean up any data that would otherwise violate the database constraints. `post_subset_sql` is the preferred option for any general purpose queries.
+`post_subset_sql`: An array of SQL commands that will be issued on the destination database after subsetting is complete, and after the database constraints have been applied. Useful to perform additional adhoc tasks after subsetting.
+# Running
+Almost all the configuration is in the `config.json` file, so running is as simple as
+```
+$ python direct_subset.py
+```
+Two commandline arguments are supported:
+`-v`: Verbose output. Useful for performance debugging. Lists almost every query made, and it's speed.
+`--no-constraints`: Do not add constraints found in the source database to the destination database.
+# Requirements
+Reference the `requirements.txt` file for a list of required Python packages. Python 3.8+ is required.

bp_condenser_postgresql-0.2.1.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,16 @@
+config_reader.py,sha256=47Jt2gNNZhA-pDux3_zoat8cPGzq2PyikE1U5rtHmms,3216
+database_helper.py,sha256=nFVVjKLZ1fE7V9A8ot8UrLAZZ6MzZkHluMVTH7HvePU,260
+db_connect.py,sha256=dzG6ki6UzffPzEtC9OTn-7CMr2e_sciic_o-_-gC3xI,3174
+direct_subset.py,sha256=uTqNC-u4GJBSKMJIR6eVx1C_y-UT5yfZBt_sd0SySjE,2395
+psql_database_creator.py,sha256=1jyRUH3dxXzuqlKyQNkeFOn5SeoefiicNCEetMLzd80,6614
+psql_database_helper.py,sha256=os_JzCeY0y6ActhYox6h4tq15ORYfsyIAbM152CSvxo,8903
+result_tabulator.py,sha256=IddIh3pDcQhbHZ9OCBdY864siKzwmp2Oxw0nP_A3j9E,850
+subset.py,sha256=ZDtUooxvP1QiJsK_uVZtrENXG8ZMpjf4uCX-O00Qyd4,11068
+subset_utils.py,sha256=05yYYx_n4vf26O1yaTaiovFW0KfQ5mpOIfIfNeEHbrE,5534
+topo_orderer.py,sha256=ff-l0ni7ze__5n-ZdsM8EtS_zYQLh2lX5mMp7CYJtMk,1171
+bp_condenser_postgresql-0.2.1.dist-info/licenses/LICENSE,sha256=rHPHCfXUe8vG68ncurrHjKtpD8Tl48gi5J4ACqIWV-Q,1051
+bp_condenser_postgresql-0.2.1.dist-info/METADATA,sha256=0Z8I0Ws78eyTHPJ68IYKy6E4ratL5Pa5T2rw09iGhNg,9692
+bp_condenser_postgresql-0.2.1.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
+bp_condenser_postgresql-0.2.1.dist-info/entry_points.txt,sha256=rzV1x0_f9t2VYgcDFrVb5-aPT5PIPDR2W8sOKxwz-uA,63
+bp_condenser_postgresql-0.2.1.dist-info/top_level.txt,sha256=NXsX5NsBK9_elKFzgGN30Jxtz_lWRmmZuumormQFQZA,148
+bp_condenser_postgresql-0.2.1.dist-info/RECORD,,

bp_condenser_postgresql-0.2.1.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,5 @@
+Wheel-Version: 1.0
+Generator: setuptools (82.0.1)
+Root-Is-Purelib: true
+Tag: py3-none-any

bp_condenser_postgresql-0.2.1.dist-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ condenser-direct-subset = direct_subset:main

bp_condenser_postgresql-0.2.1.dist-info/licenses/LICENSE ADDED Viewed

@@ -0,0 +1,9 @@
+Copyright 2019, Tonic AI
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

bp_condenser_postgresql-0.2.1.dist-info/top_level.txt ADDED Viewed

@@ -0,0 +1,10 @@
+config_reader
+database_helper
+db_connect
+direct_subset
+psql_database_creator
+psql_database_helper
+result_tabulator
+subset
+subset_utils
+topo_orderer

config_reader.py ADDED Viewed

@@ -0,0 +1,93 @@
+import json, sys, collections
+_config = None
+def initialize(file_like = None):
+    global _config
+    if _config != None:
+        print('WARNING: Attempted to initialize configuration twice.', file=sys.stderr)
+    if not file_like:
+        with open('config.json', 'r') as fp:
+            _config = json.load(fp)
+    else:
+        _config = json.load(file_like)
+    if "desired_result" in _config:
+        raise ValueError("desired_result is a key in the old config spec. Check the README.md and config.json.example for the latest configuration parameters.")
+    _validate_db_type()
+def _validate_db_type():
+    if 'db_type' not in _config:
+        raise ValueError("Missing required config key 'db_type'. The only supported value is 'postgres'.")
+    db_type = _config['db_type']
+    if not isinstance(db_type, str):
+        raise ValueError("Invalid db_type {!r}. The only supported value is 'postgres'.".format(db_type))
+    normalized_db_type = db_type.lower()
+    if normalized_db_type != 'postgres':
+        raise ValueError("Unsupported db_type '{}'. Condenser supports only 'postgres'.".format(db_type))
+    _config['db_type'] = normalized_db_type
+DependencyBreak = collections.namedtuple('DependencyBreak', ['fk_table', 'target_table'])
+def get_dependency_breaks():
+    return set([DependencyBreak(b['fk_table'], b['target_table']) for b in _config['dependency_breaks']])
+def get_preserve_fk_opportunistically():
+    return set([DependencyBreak(b['fk_table'], b['target_table']) for b in _config['dependency_breaks'] if 'perserve_fk_opportunistically' in b and b['perserve_fk_opportunistically']])
+def get_initial_targets():
+    return _config['initial_targets']
+def get_initial_target_tables():
+    return [target["table"] for target in _config['initial_targets']]
+def keep_disconnected_tables():
+    return 'keep_disconnected_tables' in _config and bool(_config['keep_disconnected_tables'])
+def get_db_type():
+    return _config['db_type']
+def get_source_db_connection_info():
+    return _config['source_db_connection_info']
+def get_destination_db_connection_info():
+    return _config['destination_db_connection_info']
+def get_excluded_tables():
+    return list(_config['excluded_tables'])
+def get_passthrough_tables():
+    return list(_config['passthrough_tables'])
+def get_fk_augmentation():
+    return list(map(__convert_tonic_format, _config['fk_augmentation']))
+def get_upstream_filters():
+    return _config["upstream_filters"]
+def get_pre_constraint_sql():
+    return _config["pre_constraint_sql"] if "pre_constraint_sql" in _config else []
+def get_post_subset_sql():
+    return _config["post_subset_sql"] if "post_subset_sql" in _config else []
+def get_max_rows_per_table():
+    return _config["max_rows_per_table"] if "max_rows_per_table" in _config else None
+def __convert_tonic_format(obj):
+    if "fk_schema" in obj:
+        return {
+            "fk_table": obj["fk_schema"] + "." + obj["fk_table"],
+            "fk_columns": obj["fk_columns"],
+            "target_table": obj["target_schema"] + "." + obj["target_table"],
+            "target_columns": obj["target_columns"],
+        }
+    else:
+        return obj
+def verbose_logging():
+    return '-v' in sys.argv

database_helper.py ADDED Viewed

@@ -0,0 +1,8 @@
+import config_reader
+def get_specific_helper():
+    if config_reader.get_db_type() == 'postgres':
+        import psql_database_helper
+        return psql_database_helper
+    else:
+        raise ValueError('unsupported db_type ' + config_reader.get_db_type())

db_connect.py ADDED Viewed

@@ -0,0 +1,85 @@
+import config_reader
+import psycopg2
+import os, pathlib, re, urllib, subprocess, os.path, json, getpass, time, sys, datetime
+class DbConnect:
+    def __init__(self, db_type, connection_info):
+        requiredKeys = [
+            'user_name',
+            'host',
+            'db_name',
+            'port'
+        ]
+        for r in requiredKeys:
+            if r not in connection_info.keys():
+                raise Exception('Missing required key in database connection info: ' + r)
+        if 'password' not in connection_info.keys():
+            connection_info['password'] = getpass.getpass('Enter password for {0} on host {1}: '.format(connection_info['user_name'], connection_info['host']))
+        self.user = connection_info['user_name']
+        self.password = connection_info['password']
+        self.host = connection_info['host']
+        self.port = connection_info['port']
+        self.db_name = connection_info['db_name']
+        self.ssl_mode = connection_info['ssl_mode'] if 'ssl_mode' in connection_info else None
+        self.__db_type = db_type.lower()
+    def get_db_connection(self, read_repeatable=False):
+        if self.__db_type == 'postgres':
+            return PsqlConnection(self, read_repeatable)
+        else:
+            raise ValueError('unsupported db_type ' + self.__db_type)
+class DbConnection:
+    def __init__(self, connection):
+        self.connection = connection
+    def commit(self):
+        self.connection.commit()
+    def close(self):
+        self.connection.close()
+class LoggingCursor:
+    def __init__(self, cursor):
+        self.inner_cursor = cursor
+    def execute(self, query):
+        start_time = time.time()
+        if config_reader.verbose_logging():
+            print('Beginning query @ {}:\n\t{}'.format(str(datetime.datetime.now()), query))
+            sys.stdout.flush()
+        retval = self.inner_cursor.execute(query)
+        if config_reader.verbose_logging():
+            print('\tQuery completed in {}s'.format(time.time() - start_time))
+            sys.stdout.flush()
+        return retval
+    def __getattr__(self, name):
+        return self.inner_cursor.__getattribute__(name)
+    def __exit__(self, a, b, c):
+        return self.inner_cursor.__exit__(a, b, c)
+    def __enter__(self):
+        return LoggingCursor(self.inner_cursor.__enter__())
+# small wrapper to the connection class that gives us a common interface to the cursor()
+# method. This one is for Postgres.
+class PsqlConnection(DbConnection):
+    def __init__(self,  connect, read_repeatable):
+        connection_string = 'dbname=\'{0}\' user=\'{1}\' password=\'{2}\' host={3} port={4}'.format(connect.db_name, connect.user, connect.password, connect.host, connect.port)
+        if connect.ssl_mode :
+            connection_string = connection_string + ' sslmode={0}'.format(connect.ssl_mode)
+        DbConnection.__init__(self, psycopg2.connect(connection_string))
+        if read_repeatable:
+            self.connection.isolation_level = psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
+    def cursor(self, name=None, withhold=False):
+        return LoggingCursor(self.connection.cursor(name=name, withhold=withhold))

direct_subset.py ADDED Viewed

@@ -0,0 +1,66 @@
+import uuid, sys
+import config_reader, result_tabulator
+import time
+from subset import Subset
+from psql_database_creator import PsqlDatabaseCreator
+from db_connect import DbConnect
+from subset_utils import print_progress
+import database_helper
+def db_creator(db_type, source, dest):
+    if db_type == 'postgres':
+        return PsqlDatabaseCreator(source, dest, False)
+    else:
+        raise ValueError('unsupported db_type ' + db_type)
+def main() -> None:
+    if "--stdin" in sys.argv:
+        config_reader.initialize(sys.stdin)
+    else:
+        config_reader.initialize()
+    db_type = config_reader.get_db_type()
+    source_dbc = DbConnect(db_type, config_reader.get_source_db_connection_info())
+    destination_dbc = DbConnect(db_type, config_reader.get_destination_db_connection_info())
+    database = db_creator(db_type, source_dbc, destination_dbc)
+    database.teardown()
+    database.create()
+    # Get list of tables to operate on
+    db_helper = database_helper.get_specific_helper()
+    all_tables = db_helper.list_all_tables(source_dbc)
+    all_tables = [x for x in all_tables if x not in config_reader.get_excluded_tables()]
+    subsetter = Subset(source_dbc, destination_dbc, all_tables)
+    try:
+        subsetter.prep_temp_dbs()
+        subsetter.run_middle_out()
+        print("Beginning pre constraint SQL calls")
+        start_time = time.time()
+        for idx, sql in enumerate(config_reader.get_pre_constraint_sql()):
+            print_progress(sql, idx+1, len(config_reader.get_pre_constraint_sql()))
+            db_helper.run_query(sql, destination_dbc.get_db_connection())
+        print("Completed pre constraint SQL calls in {}s".format(time.time()-start_time))
+        print("Adding database constraints")
+        if "--no-constraints" not in sys.argv:
+            database.add_constraints()
+        print("Beginning post subset SQL calls")
+        start_time = time.time()
+        for idx, sql in enumerate(config_reader.get_post_subset_sql()):
+            print_progress(sql, idx+1, len(config_reader.get_post_subset_sql()))
+            db_helper.run_query(sql, destination_dbc.get_db_connection())
+        print("Completed post subset SQL calls in {}s".format(time.time()-start_time))
+        result_tabulator.tabulate(source_dbc, destination_dbc, all_tables)
+    finally:
+        subsetter.unprep_temp_dbs()
+if __name__ == '__main__':
+    main()

psql_database_creator.py ADDED Viewed

@@ -0,0 +1,163 @@
+import os, urllib, urllib.parse, subprocess
+from db_connect import DbConnect
+import database_helper
+class PsqlDatabaseCreator:
+    def __init__(self, source_dbc, destination_dbc, use_existing_dump = False):
+        self.destination_dbc = destination_dbc
+        self.source_dbc = source_dbc
+        self.__source_db_connection = source_dbc.get_db_connection()
+        self.use_existing_dump = use_existing_dump
+        self.output_path = os.path.join(os.getcwd(),'SQL')
+        if not os.path.isdir(self.output_path):
+            os.mkdir(self.output_path)
+        self.add_constraint_output_path = os.path.join(os.getcwd(), 'SQL', 'add_constraint_output.txt')
+        self.add_constraint_error_path = os.path.join(os.getcwd(), 'SQL', 'add_constraint_error.txt')
+        if os.path.exists(self.add_constraint_output_path):
+            os.remove(self.add_constraint_output_path)
+        if os.path.exists(self.add_constraint_error_path):
+            os.remove(self.add_constraint_error_path)
+        self.create_output_path = os.path.join(os.getcwd(), 'SQL', 'create_output.txt')
+        self.create_error_path = os.path.join(os.getcwd(), 'SQL', 'create_error.txt')
+        if os.path.exists(self.create_output_path):
+            os.remove(self.create_output_path)
+        if os.path.exists(self.create_error_path):
+            os.remove(self.create_error_path)
+    def create(self):
+        if self.use_existing_dump == True:
+            pass
+        else:
+            cur_path = os.getcwd()
+            pg_dump_path = get_pg_bin_path()
+            if pg_dump_path != '':
+                os.chdir(pg_dump_path)
+            connection = '--dbname=postgresql://{0}@{2}:{3}/{4}?{1}'.format(self.source_dbc.user, urllib.parse.urlencode({'password': self.source_dbc.password}), self.source_dbc.host, self.source_dbc.port, self.source_dbc.db_name)
+            result = subprocess.run(['pg_dump', connection, '--schema-only', '--no-owner', '--no-privileges', '--section=pre-data']
+                    , stdout = subprocess.PIPE, stderr = subprocess.PIPE)
+            if result.returncode != 0 or contains_errors(result.stderr):
+                raise Exception('Captuing pre-data schema failed. Details:\n{}'.format(result.stderr))
+            os.chdir(cur_path)
+            pre_data_sql = self.__filter_commands(result.stdout.decode('utf-8'))
+            self.run_psql(pre_data_sql)
+    def teardown(self):
+        user_schemas = database_helper.get_specific_helper().list_all_user_schemas(self.__source_db_connection)
+        if len(user_schemas) == 0:
+            raise Exception("Couldn't find any non system schemas.")
+        drop_statements = ["DROP SCHEMA IF EXISTS \"{}\" CASCADE".format(s) for s in user_schemas if s != 'public']
+        q = ';'.join(drop_statements)
+        q += ";DROP SCHEMA IF EXISTS public CASCADE;CREATE SCHEMA IF NOT EXISTS public;"
+        self.run_query(q)
+    def add_constraints(self):
+        if self.use_existing_dump == True:
+            pass
+        else:
+            cur_path = os.getcwd()
+            pg_dump_path = get_pg_bin_path()
+            if pg_dump_path != '':
+                os.chdir(pg_dump_path)
+            connection = '--dbname=postgresql://{0}@{2}:{3}/{4}?{1}'.format(self.source_dbc.user, urllib.parse.urlencode({'password': self.source_dbc.password}), self.source_dbc.host, self.source_dbc.port, self.source_dbc.db_name)
+            result = subprocess.run(['pg_dump', connection, '--schema-only', '--no-owner', '--no-privileges', '--section=post-data']
+                    , stderr = subprocess.PIPE, stdout = subprocess.PIPE)
+            if result.returncode != 0 or contains_errors(result.stderr):
+                raise Exception('Captuing post-data schema failed. Details:\n{}'.format(result.stderr))
+            os.chdir(cur_path)
+            self.run_psql(result.stdout.decode('utf-8'))
+    def __filter_commands(self, input):
+        input = input.split('\n')
+        filtered_key_words = [
+            'COMMENT ON CONSTRAINT',
+            'COMMENT ON EXTENSION'
+        ]
+        retval = []
+        for line in input:
+            l = line.rstrip()
+            filtered = False
+            for key in filtered_key_words:
+                if l.startswith(key):
+                    filtered = True
+            if not filtered:
+                retval.append(l)
+        return '\n'.join(retval)
+    def run_query(self, query):
+        pg_dump_path = get_pg_bin_path()
+        cur_path = os.getcwd()
+        if(pg_dump_path != ''):
+            os.chdir(pg_dump_path)
+        connection_info = self.destination_dbc
+        connection_string = '--dbname=postgresql://{0}@{2}:{3}/{4}?{1}'.format(
+                    connection_info.user, urllib.parse.urlencode({'password': connection_info.password}), connection_info.host,
+                    connection_info.port, connection_info.db_name)
+        result = subprocess.run(['psql', connection_string, '-c {0}'.format(query)], stderr = subprocess.PIPE, stdout = subprocess.DEVNULL)
+        if result.returncode != 0 or contains_errors(result.stderr):
+            raise Exception('Running query: "{}" failed. Details:\n{}'.format(query, result.stderr))
+        os.chdir(cur_path)
+    def run_psql(self, queries):
+        pg_dump_path = get_pg_bin_path()
+        cur_path = os.getcwd()
+        if(pg_dump_path != ''):
+            os.chdir(pg_dump_path)
+        connect = self.destination_dbc
+        connection_string = '--dbname=postgresql://{0}@{2}:{3}/{4}?{1}'.format(
+            connect.user, urllib.parse.urlencode({'password': connect.password}), connect.host,
+            connect.port, connect.db_name)
+        input = queries.encode('utf-8')
+        result = subprocess.run(['psql', connection_string], stderr = subprocess.PIPE, input = input, stdout= subprocess.DEVNULL)
+        if result.returncode != 0 or contains_errors(result.stderr):
+            raise Exception('Creating schema failed. Details:\n{}'.format(result.stderr))
+        os.chdir(cur_path)
+def get_pg_bin_path():
+    if 'POSTGRES_PATH' in os.environ:
+        pg_dump_path = os.environ['POSTGRES_PATH']
+    else:
+        pg_dump_path = ''
+    err = os.system('"' + os.path.join(pg_dump_path, 'pg_dump') + '"' + ' --help > ' + os.devnull)
+    if err != 0:
+        raise Exception("Couldn't find Postgres utilities, consider specifying POSTGRES_PATH environment variable if Postgres isn't " +
+            "in your PATH.")
+    return pg_dump_path
+def contains_errors(stderr):
+    msgs = stderr.decode('utf-8')
+    return any(filter(lambda msg: msg.strip().startswith('ERROR'), msgs.split('\n')))

psql_database_helper.py ADDED Viewed

@@ -0,0 +1,211 @@
+import os, uuid, csv
+import config_reader
+from pathlib import Path
+from psycopg2.extras import execute_values, register_default_json, register_default_jsonb
+from subset_utils import columns_joined, columns_tupled, schema_name, table_name, fully_qualified_table, redact_relationships, quoter
+register_default_json(loads=lambda x: str(x))
+register_default_jsonb(loads=lambda x: str(x))
+def prep_temp_dbs(_, __):
+    pass
+def unprep_temp_dbs(_, __):
+    pass
+def turn_off_constraints(connection):
+    # can't be done in postgres
+    pass
+def copy_rows(source, destination, query, destination_table):
+    datatypes = get_table_datatypes(table_name(destination_table), schema_name(destination_table), destination)
+    non_generated_columns = [(dt[0], dt[1]) for i, dt in enumerate(datatypes) if dt[2] != 's']
+    generated_columns_positions = [i for i, dt in enumerate(datatypes) if 's' in dt[2]]
+    always_generated_id = any([dt[3] == 'a' for dt in datatypes])
+    def template_piece(dt):
+        if dt == '_json':
+            return '%s::json[]'
+        elif dt == '_jsonb':
+            return '%s::jsonb[]'
+        else:
+            return '%s'
+    template = '(' + ','.join([template_piece(dt[1]) for dt in non_generated_columns]) + ')'
+    columns = '("' + '","'.join([dt[0] for dt in non_generated_columns]) + '")'
+    cursor_name='table_cursor_'+str(uuid.uuid4()).replace('-','')
+    cursor = source.cursor(name=cursor_name)
+    cursor.execute(query)
+    fetch_row_count = 100000
+    while True:
+        rows = cursor.fetchmany(fetch_row_count)
+        if len(rows) == 0:
+            break
+        # using the inner_cursor means we don't log all the noise
+        destination_cursor = destination.cursor().inner_cursor
+        insert_query = 'INSERT INTO {} {} VALUES %s'.format(fully_qualified_table(destination_table), columns)
+        if (always_generated_id):
+            insert_query = 'INSERT INTO {} {} OVERRIDING SYSTEM VALUE VALUES %s'.format(fully_qualified_table(destination_table), columns)
+        updated_rows = [tuple(val for i, val in enumerate(row) if i not in generated_columns_positions) for row in rows]
+        execute_values(destination_cursor, insert_query, updated_rows, template)
+        destination_cursor.close()
+    cursor.close()
+    destination.commit()
+def source_db_temp_table(target_table):
+    return  'tonic_subset_' + schema_name(target_table) + '_' + table_name(target_table)
+def create_id_temp_table(conn, number_of_columns):
+    table_name = 'tonic_subset_' + str(uuid.uuid4())
+    cursor = conn.cursor()
+    column_defs = ',\n'.join(['    col' + str(aye) + '  varchar' for aye in range(number_of_columns)])
+    q = 'CREATE TEMPORARY TABLE "{}" (\n {} \n)'.format(table_name, column_defs)
+    cursor.execute(q)
+    cursor.close()
+    return table_name
+def copy_to_temp_table(conn, query, target_table, pk_columns = None):
+    temp_table = fully_qualified_table(source_db_temp_table(target_table))
+    with conn.cursor() as cur:
+        cur.execute('CREATE TEMPORARY TABLE IF NOT EXISTS ' + temp_table + ' AS ' + query + ' LIMIT 0')
+        if pk_columns:
+            query = query + ' WHERE {} NOT IN (SELECT {} FROM {})'.format(columns_tupled(pk_columns), columns_joined(pk_columns), temp_table)
+        cur.execute('INSERT INTO ' + temp_table + ' ' + query)
+        conn.commit()
+def clean_temp_table_cells(fk_table, fk_columns, target_table, target_columns, conn):
+    fk_alias = 'tonic_subset_398dhjr23_fk'
+    target_alias = 'tonic_subset_398dhjr23_target'
+    fk_table = fully_qualified_table(source_db_temp_table(fk_table))
+    target_table = fully_qualified_table(source_db_temp_table(target_table))
+    assignment_list = ','.join(['{} = NULL'.format(quoter(c)) for c in fk_columns])
+    column_matching = ' AND '.join(['{}.{} = {}.{}'.format(fk_alias, quoter(fc), target_alias, quoter(tc)) for fc, tc in zip(fk_columns, target_columns)])
+    q = 'UPDATE {} {} SET {} WHERE NOT EXISTS (SELECT 1 FROM {} {} WHERE {})'.format(fk_table, fk_alias, assignment_list, target_table, target_alias, column_matching)
+    run_query(q, conn)
+def get_redacted_table_references(table_name, tables, conn):
+    relationships = get_unredacted_fk_relationships(tables, conn)
+    redacted = redact_relationships(relationships)
+    return [r for r in redacted if r['target_table']==table_name]
+def get_unredacted_fk_relationships(tables, conn):
+    cur = conn.cursor()
+    q = '''
+    SELECT fk_nsp.nspname || '.' || fk_table AS fk_table, array_agg(fk_att.attname ORDER BY fk_att.attnum) AS fk_columns, tar_nsp.nspname || '.' || target_table AS target_table, array_agg(tar_att.attname ORDER BY fk_att.attnum) AS target_columns
+    FROM (
+        SELECT
+            fk.oid AS fk_table_id,
+            fk.relnamespace AS fk_schema_id,
+            fk.relname AS fk_table,
+            unnest(con.conkey) as fk_column_id,
+            tar.oid AS target_table_id,
+            tar.relnamespace AS target_schema_id,
+            tar.relname AS target_table,
+            unnest(con.confkey) as target_column_id,
+            con.connamespace AS constraint_nsp,
+            con.conname AS constraint_name
+        FROM pg_constraint con
+        JOIN pg_class fk ON con.conrelid = fk.oid
+        JOIN pg_class tar ON con.confrelid = tar.oid
+        WHERE con.contype = 'f'
+    ) sub
+    JOIN pg_attribute fk_att ON fk_att.attrelid = fk_table_id AND fk_att.attnum = fk_column_id
+    JOIN pg_attribute tar_att ON tar_att.attrelid = target_table_id AND tar_att.attnum = target_column_id
+    JOIN pg_namespace fk_nsp ON fk_schema_id = fk_nsp.oid
+    JOIN pg_namespace tar_nsp ON target_schema_id = tar_nsp.oid
+    GROUP BY 1, 3, sub.constraint_nsp, sub.constraint_name;
+    '''
+    cur.execute(q)
+    relationships = list()
+    for row in cur.fetchall():
+        d = dict()
+        d['fk_table'] = row[0]
+        d['fk_columns'] = row[1]
+        d['target_table'] = row[2]
+        d['target_columns'] = row[3]
+        if d['fk_table'] in tables and d['target_table'] in tables:
+            relationships.append( d )
+    cur.close()
+    for augment in config_reader.get_fk_augmentation():
+        not_present = True
+        for r in relationships:
+            not_present = not_present and not all([r[key] == augment[key] for key in r.keys()])
+            if not not_present:
+                break
+        if augment['fk_table'] in tables and augment['target_table'] in tables and not_present:
+            relationships.append(augment)
+    return relationships
+def run_query(query, conn, commit=True):
+    with conn.cursor() as cur:
+        cur.execute(query)
+        if commit:
+            conn.commit()
+def get_table_count_estimate(table_name, schema, conn):
+    with conn.cursor() as cur:
+        cur.execute('SELECT reltuples::BIGINT AS count FROM pg_class WHERE oid=\'"{}"."{}"\'::regclass'.format(schema, table_name))
+        return cur.fetchone()[0]
+def get_table_columns(table, schema, conn):
+    with conn.cursor() as cur:
+        cur.execute('SELECT attname FROM pg_attribute WHERE attrelid=\'"{}"."{}"\'::regclass AND attnum > 0 AND NOT attisdropped ORDER BY attnum;'.format(schema, table))
+        return [r[0] for r in cur.fetchall()]
+def list_all_user_schemas(conn):
+    with conn.cursor() as cur:
+        cur.execute("SELECT nspname FROM pg_catalog.pg_namespace WHERE nspname NOT LIKE 'pg\_%' and nspname != 'information_schema';")
+        return [r[0] for r in cur.fetchall()]
+def list_all_tables(db_connect):
+    conn = db_connect.get_db_connection()
+    with conn.cursor() as cur:
+        cur.execute("""SELECT concat(concat(nsp.nspname,'.'),cls.relname)
+                        FROM pg_class cls
+                        JOIN pg_namespace nsp ON nsp.oid = cls.relnamespace
+                        WHERE nsp.nspname NOT IN ('information_schema', 'pg_catalog') AND cls.relkind = 'r';""")
+        return [r[0] for r in cur.fetchall()]
+def get_table_datatypes(table, schema, conn):
+    if not schema:
+        table_clause = "cl.relname = '{}'".format(table)
+    else:
+        table_clause = "cl.relname = '{}' AND ns.nspname = '{}'".format(table, schema)
+    with conn.cursor() as cur:
+        cur.execute("""SELECT att.attname, ty.typname, att.attgenerated, att.attidentity
+                        FROM pg_attribute att
+                        JOIN pg_class cl ON cl.oid = att.attrelid
+                        JOIN pg_type ty ON ty.oid = att.atttypid
+                        JOIN pg_namespace ns ON ns.oid = cl.relnamespace
+                        WHERE {} AND att.attnum > 0 AND
+                        NOT att.attisdropped
+                        ORDER BY att.attnum;
+                    """.format(table_clause))
+        return [(r[0], r[1], r[2], r[3]) for r in cur.fetchall()]
+def truncate_table(target_table, conn):
+    with conn.cursor() as cur:
+        cur.execute("TRUNCATE TABLE {}".format(target_table))
+        conn.commit()

result_tabulator.py ADDED Viewed

@@ -0,0 +1,26 @@
+import database_helper
+def tabulate(source_dbc, destination_dbc, tables):
+    #tabulate
+    row_counts = list()
+    source_conn = source_dbc.get_db_connection()
+    dest_conn = destination_dbc.get_db_connection()
+    db_helper = database_helper.get_specific_helper()
+    try:
+        for table in tables:
+            o = db_helper.get_table_count_estimate(table_name(table), schema_name(table), source_conn)
+            n = db_helper.get_table_count_estimate(table_name(table), schema_name(table), dest_conn)
+            row_counts.append((table,o,n))
+    finally:
+        source_conn.close()
+        dest_conn.close()
+    print('\n'.join(['{}, {}, {}, {}'.format(x[0], x[1], x[2], x[2]/x[1] if x[1] > 0 else 0) for x in row_counts]))
+def schema_name(table):
+    return table.split('.')[0]
+def table_name(table):
+    return table.split('.')[1]

subset.py ADDED Viewed

@@ -0,0 +1,199 @@
+from topo_orderer import get_topological_order_by_tables
+from subset_utils import UnionFind, schema_name, table_name, find, compute_disconnected_tables, compute_downstream_tables, compute_upstream_tables, columns_joined, columns_tupled, columns_to_copy, quoter, fully_qualified_table, print_progress, upstream_filter_match, redact_relationships
+import database_helper
+import config_reader
+import shutil, os, uuid, time, itertools
+#
+# A QUICK NOTE ON DEFINITIONS:
+#
+# Foreign key relationships form a graph. We make sure all subsetting happens on DAGs.
+# Nodes in the DAG are tables, and FKs point from the table with a FK column to the table
+# with the PK column. In other words, tables with FKs are upstream of tables with PKs.
+#
+# Sometimes we'll refer to tables as downstream or 'target' tables, because they are
+# targeted by foreign keys. We will also use upstream or 'fk' tables, because they
+# have foreign keys.
+#
+# Generally speaking, tables downstream of other tables have their membership defined
+# by the requirements of their upstream tables. And tables upstream can be more flexible
+# about their membership vis-a-vis the downstream tables (i.e. upstream tables can decide
+# to include more or less).
+#
+class Subset:
+    def __init__(self, source_dbc, destination_dbc, all_tables, clean_previous = True):
+        self.__source_dbc = source_dbc
+        self.__destination_dbc = destination_dbc
+        self.__source_conn = source_dbc.get_db_connection(read_repeatable=True)
+        self.__destination_conn = destination_dbc.get_db_connection()
+        self.__all_tables = all_tables
+        self.__db_helper = database_helper.get_specific_helper()
+        self.__db_helper.turn_off_constraints(self.__destination_conn)
+    def run_middle_out(self):
+        passthrough_tables = self.__get_passthrough_tables()
+        relationships = self.__db_helper.get_unredacted_fk_relationships(self.__all_tables, self.__source_conn)
+        disconnected_tables = compute_disconnected_tables(config_reader.get_initial_target_tables(), passthrough_tables, self.__all_tables, relationships)
+        connected_tables = [table for table in self.__all_tables if table not in disconnected_tables]
+        order = get_topological_order_by_tables(relationships, connected_tables)
+        order = list(order)
+        # start by subsetting the direct targets
+        print('Beginning subsetting with these direct targets: ' + str(config_reader.get_initial_target_tables()))
+        start_time = time.time()
+        processed_tables = set()
+        for idx, target in enumerate(config_reader.get_initial_targets()):
+            print_progress(target, idx+1, len(config_reader.get_initial_targets()))
+            self.__subset_direct(target, relationships)
+            processed_tables.add(target['table'])
+        print('Direct target tables completed in {}s'.format(time.time()-start_time))
+        # greedily grab rows with foreign keys to rows in the target strata
+        upstream_tables = compute_upstream_tables(config_reader.get_initial_target_tables(), order)
+        print('Beginning greedy upstream subsetting with these tables: ' + str(upstream_tables))
+        start_time = time.time()
+        for idx, t in enumerate(upstream_tables):
+            print_progress(t, idx+1, len(upstream_tables))
+            data_added = self.__subset_upstream(t, processed_tables, relationships)
+            if data_added:
+                processed_tables.add(t)
+        print('Greedy subsettings completed in {}s'.format(time.time()-start_time))
+        # process pass-through tables, you need this before subset_downstream, so you can get all required downstream rows
+        print('Beginning pass-through tables: ' + str(passthrough_tables))
+        start_time = time.time()
+        for idx, t in enumerate(passthrough_tables):
+            print_progress(t, idx+1, len(passthrough_tables))
+            q = 'SELECT * FROM {}'.format(fully_qualified_table(t))
+            if config_reader.get_max_rows_per_table() is not None:
+                q += ' LIMIT {}'.format(config_reader.get_max_rows_per_table())
+            self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, q, t)
+        print('Pass-through completed in {}s'.format(time.time()-start_time))
+        # use subset_downstream to get all supporting rows according to existing needs
+        downstream_tables = compute_downstream_tables(passthrough_tables, disconnected_tables, order)
+        print('Beginning downstream subsetting with these tables: ' + str(downstream_tables))
+        start_time = time.time()
+        for idx, t in enumerate(downstream_tables):
+            print_progress(t, idx+1, len(downstream_tables))
+            self.subset_downstream(t, relationships)
+        print('Downstream subsetting completed in {}s'.format(time.time()-start_time))
+        if config_reader.keep_disconnected_tables():
+            # get all the data for tables in disconnected components (i.e. pass those tables through)
+            print('Beginning disconnected tables: ' + str(disconnected_tables))
+            start_time = time.time()
+            for idx, t in enumerate(disconnected_tables):
+                print_progress(t, idx+1, len(disconnected_tables))
+                q = 'SELECT * FROM {}'.format(fully_qualified_table(t))
+                self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, q, t)
+            print('Disconnected tables completed in {}s'.format(time.time()-start_time))
+    def prep_temp_dbs(self):
+        self.__db_helper.prep_temp_dbs(self.__source_conn, self.__destination_conn)
+    def unprep_temp_dbs(self):
+        self.__db_helper.unprep_temp_dbs(self.__source_conn, self.__destination_conn)
+    def __subset_direct(self, target, relationships):
+        t = target['table']
+        columns_query = columns_to_copy(t, relationships, self.__source_conn)
+        if 'where' in target:
+            q = 'SELECT {} FROM {} WHERE {}'.format(columns_query, fully_qualified_table(t), target['where'])
+        elif 'percent' in target:
+            q = 'SELECT {} FROM {} WHERE random() < {}'.format(columns_query, fully_qualified_table(t), float(target['percent'])/100)
+        else:
+            raise ValueError('target table {} had no \'where\' or \'percent\' term defined, check your configuration.'.format(t))
+        self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, q, t)
+    def __subset_upstream(self, target, processed_tables, relationships):
+        redacted_relationships = redact_relationships(relationships)
+        relevant_key_constraints = list(filter(lambda r: r['target_table'] in processed_tables and r['fk_table'] == target, redacted_relationships))
+        # this table isn't referenced by anything we've already processed, so let's leave it empty
+        #  OR
+        # table was already added, this only happens if the upstream table was also a direct target
+        if len(relevant_key_constraints) == 0 or target in processed_tables:
+            return False
+        temp_target_name = 'subset_temp_' + table_name(target)
+        try:
+            # copy the whole table
+            columns_query = columns_to_copy(target, relationships, self.__source_conn)
+            self.__db_helper.run_query('CREATE TEMPORARY TABLE {} AS SELECT * FROM {} LIMIT 0'.format(quoter(temp_target_name), fully_qualified_table(target)), self.__destination_conn)
+            query = 'SELECT {} FROM {}'.format(columns_query, fully_qualified_table(target))
+            self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, query, temp_target_name)
+            # filter it down in the target database
+            table_columns = self.__db_helper.get_table_columns(table_name(target), schema_name(target), self.__source_conn)
+            clauses = ['{} IN (SELECT {} FROM {})'.format(columns_tupled(kc['fk_columns']), columns_joined(kc['target_columns']), fully_qualified_table(kc['target_table'])) for kc in relevant_key_constraints]
+            clauses.extend(upstream_filter_match(target, table_columns))
+            select_query = 'SELECT * FROM {} WHERE TRUE AND {}'.format(quoter(temp_target_name), ' AND '.join(clauses))
+            if config_reader.get_max_rows_per_table() is not None:
+                select_query += " LIMIT {}".format(config_reader.get_max_rows_per_table())
+            insert_query = 'INSERT INTO {} {}'.format(fully_qualified_table(target), select_query)
+            self.__db_helper.run_query(insert_query, self.__destination_conn)
+            self.__destination_conn.commit()
+        finally:
+            self.__db_helper.run_query('DROP TABLE IF EXISTS {}'.format(quoter(temp_target_name)), self.__destination_conn)
+        return True
+    def __get_passthrough_tables(self):
+        passthrough_tables = config_reader.get_passthrough_tables()
+        return list(set(passthrough_tables))
+    # Table A -> Table B and Table A has the column b_id.  So we SELECT b_id from table_a from our destination
+    # database.  And we take those b_ids and run `select * from table b where id in (those list of ids)` then insert
+    # that result set into table b of the destination database
+    def subset_downstream(self, table, relationships):
+        referencing_tables = self.__db_helper.get_redacted_table_references(table, self.__all_tables, self.__source_conn)
+        if len(referencing_tables) > 0:
+            pk_columns = referencing_tables[0]['target_columns']
+        else:
+            return
+        temp_table = self.__db_helper.create_id_temp_table(self.__destination_conn, len(pk_columns))
+        for r in referencing_tables:
+            fk_table = r['fk_table']
+            fk_columns = r['fk_columns']
+            q='SELECT {} FROM {} WHERE {} NOT IN (SELECT {} FROM {})'.format(columns_joined(fk_columns), fully_qualified_table(fk_table), columns_tupled(fk_columns), columns_joined(pk_columns), fully_qualified_table(table))
+            self.__db_helper.copy_rows(self.__destination_conn, self.__destination_conn, q, temp_table)
+        columns_query = columns_to_copy(table, relationships, self.__source_conn)
+        cursor_name='table_cursor_'+str(uuid.uuid4()).replace('-','')
+        cursor = self.__destination_conn.cursor(name=cursor_name, withhold=True)
+        cursor_query ='SELECT DISTINCT * FROM {}'.format(fully_qualified_table(temp_table))
+        cursor.execute(cursor_query)
+        fetch_row_count = 100000
+        while True:
+            rows = cursor.fetchmany(fetch_row_count)
+            if len(rows) == 0:
+                break
+            ids = ['('+','.join(['\'' + str(c) + '\'' for c in row])+')' for row in rows if all([c is not None for c in row])]
+            if len(ids) == 0:
+                break
+            ids_to_query = ','.join(ids)
+            q = 'SELECT {} FROM {} WHERE {} IN ({})'.format(columns_query, fully_qualified_table(table), columns_tupled(pk_columns), ids_to_query)
+            self.__db_helper.copy_rows(self.__source_conn, self.__destination_conn, q, table)
+        cursor.close()

subset_utils.py ADDED Viewed

@@ -0,0 +1,171 @@
+import config_reader
+import database_helper
+# this function generally copies all columns as is, but if the table has been selected as
+# breaking a dependency cycle, then it will insert NULLs instead of that table's foreign keys
+# to the downstream dependency that breaks the cycle
+def columns_to_copy(table, relationships, conn):
+    target_breaks = set()
+    opportunists = config_reader.get_preserve_fk_opportunistically()
+    for dep_break in config_reader.get_dependency_breaks():
+        if dep_break.fk_table == table and dep_break not in opportunists:
+            target_breaks.add(dep_break.target_table)
+    columns_to_null = set()
+    for rel in relationships:
+        if rel['fk_table'] == table and rel['target_table'] in target_breaks:
+            columns_to_null.update(rel['fk_columns'])
+    columns = database_helper.get_specific_helper().get_table_columns(table_name(table), schema_name(table), conn)
+    return ','.join(['{}.{}'.format(quoter(table_name(table)), quoter(c)) if c not in columns_to_null else 'NULL as {}'.format(quoter(c)) for c in columns])
+def upstream_filter_match(target, table_columns):
+    retval = []
+    filters = config_reader.get_upstream_filters()
+    for filter in filters:
+        if "table" in filter and target == filter["table"]:
+            retval.append(filter["condition"])
+        if "column" in filter and filter["column"] in table_columns:
+            retval.append(filter["condition"])
+    return retval
+def redact_relationships(relationships):
+    breaks = config_reader.get_dependency_breaks()
+    retval = [r for r in relationships if (r['fk_table'], r['target_table']) not in breaks]
+    return retval
+def find(f, seq):
+    """Return first item in sequence where f(item) == True."""
+    for item in seq:
+        if f(item):
+            return item
+def compute_upstream_tables(target_tables, order):
+    upstream_tables = []
+    in_upstream = False
+    for strata in order:
+        if in_upstream:
+            upstream_tables.extend(strata)
+        if any([tt in strata for tt in target_tables]):
+            in_upstream = True
+    return upstream_tables
+def compute_downstream_tables(passthrough_tables, disconnected_tables, order):
+    downstream_tables = []
+    for strata in order:
+        downstream_tables.extend(strata)
+    downstream_tables = list(reversed(list(filter(lambda table: table not in passthrough_tables and table not in disconnected_tables, downstream_tables))))
+    return downstream_tables
+def compute_disconnected_tables(target_tables, passthrough_tables, all_tables, relationships):
+    uf = UnionFind()
+    for t in all_tables:
+        uf.make_set(t)
+    for rel in relationships:
+        uf.link(rel['fk_table'], rel['target_table'])
+    connected_components = set([uf.find(tt) for tt in target_tables])
+    connected_components.update([uf.find(pt) for pt in passthrough_tables])
+    return [t for t in all_tables if uf.find(t) not in connected_components]
+def fully_qualified_table(table):
+    if '.' in table:
+        return quoter(schema_name(table)) + '.' + quoter(table_name(table))
+    else:
+        return quoter(table_name(table))
+def schema_name(table):
+    return table.split('.')[0] if '.' in table else None
+def table_name(table):
+    split = table.split('.')
+    return split[1] if len(split) > 1 else split[0]
+def columns_tupled(columns):
+    return '(' + ','.join([quoter(c) for c in columns]) + ')'
+def columns_joined(columns):
+    return ','.join([quoter(c) for c in columns])
+def quoter(id):
+    return '"' + id + '"'
+def print_progress(target, idx, count):
+    print('Processing {} of {}: {}'.format(idx, count, target))
+class UnionFind:
+    def __init__(self):
+        self.elementsToId = dict()
+        self.elements = []
+        self.roots = []
+        self.ranks = []
+    def __len__(self):
+        return len(self.roots)
+    def make_set(self, elem):
+        self.id_of(elem)
+    def find(self, elem):
+        x = self.elementsToId[elem]
+        if x == None:
+            return None
+        rootId = self.find_internal(x)
+        return self.elements[rootId]
+    def find_internal(self, x):
+        x0 = x
+        while self.roots[x] != x:
+            x = self.roots[x]
+        while self.roots[x0] != x:
+            y = self.roots[x0]
+            self.roots[x0] = x
+            x0 = y
+        return x
+    def id_of(self, elem):
+        if elem not in self.elementsToId:
+            idx = len(self.roots)
+            self.elements.append(elem)
+            self.elementsToId[elem] = idx
+            self.roots.append(idx)
+            self.ranks.append(0)
+        return self.elementsToId[elem]
+    def link(self, elem1, elem2):
+        x = self.id_of(elem1)
+        y = self.id_of(elem2)
+        xr = self.find_internal(x)
+        yr = self.find_internal(y)
+        if xr == yr:
+            return
+        xd = self.ranks[xr]
+        yd = self.ranks[yr]
+        if xd < yd:
+            self.roots[xr] = yr
+        elif yd < xd:
+            self.roots[yr] = xr
+        else:
+            self.roots[yr] = xr
+            self.ranks[xr] = self.ranks[xr] + 1
+    def members_of(self, elem):
+        id = self.elementsToId[elem]
+        if id is None:
+            raise ValueError("tried calling membersOf on an unknown element")
+        elemRoot = self.find_internal(id)
+        retval = []
+        for idx in range(len(self.elements)):
+            otherRoot = self.find_internal(idx)
+            if elemRoot == otherRoot:
+                retval.append(self.elements[idx])
+        return retval

topo_orderer.py ADDED Viewed

@@ -0,0 +1,38 @@
+from toposort import toposort, toposort_flatten
+import config_reader
+def get_topological_order_by_tables(relationships, tables):
+    topsort_input =  __prepare_topsort_input(relationships, tables)
+    return list(toposort(topsort_input))
+def __prepare_topsort_input(relationships, tables):
+    dep_breaks = config_reader.get_dependency_breaks()
+    deps = dict()
+    for r in relationships:
+        p =r['fk_table']
+        c =r['target_table']
+        #break circ dependency
+        dep_break_found = False
+        for dep_break in dep_breaks:
+            if p == dep_break.fk_table and c == dep_break.target_table:
+                dep_break_found = True
+                break
+        if dep_break_found == True:
+            continue
+        # toposort ignores self circularities for some reason, but we cannot
+        if p == c:
+            raise ValueError('Circular dependency, {} depends on itself!'.format(p))
+        if tables is not None and len(tables) > 0 and (p not in tables or c not in tables):
+            continue
+        if p in deps:
+            deps[p].add(c)
+        else:
+            deps[p] = set()
+            deps[p].add(c)
+    return deps