ingestr 0.0.2__tar.gz → 0.0.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of ingestr might be problematic. Click here for more details.

Files changed (51) hide show
  1. {ingestr-0.0.2 → ingestr-0.0.4}/.gitignore +4 -1
  2. {ingestr-0.0.2 → ingestr-0.0.4}/Makefile +1 -1
  3. {ingestr-0.0.2 → ingestr-0.0.4}/PKG-INFO +19 -6
  4. {ingestr-0.0.2 → ingestr-0.0.4}/README.md +15 -5
  5. ingestr-0.0.4/docs/.vitepress/config.mjs +49 -0
  6. ingestr-0.0.4/docs/api-examples.md +49 -0
  7. ingestr-0.0.4/docs/commands/example-uris.md +5 -0
  8. ingestr-0.0.4/docs/commands/ingest.md +44 -0
  9. ingestr-0.0.4/docs/getting-started/core-concepts.md +41 -0
  10. ingestr-0.0.4/docs/getting-started/incremental-loading.md +228 -0
  11. ingestr-0.0.4/docs/getting-started/quickstart.md +39 -0
  12. ingestr-0.0.4/docs/index.md +25 -0
  13. ingestr-0.0.4/docs/markdown-examples.md +85 -0
  14. ingestr-0.0.4/docs/supported-sources/bigquery.md +20 -0
  15. ingestr-0.0.4/docs/supported-sources/databricks.md +20 -0
  16. ingestr-0.0.4/docs/supported-sources/duckdb.md +16 -0
  17. ingestr-0.0.4/docs/supported-sources/mssql.md +22 -0
  18. ingestr-0.0.4/docs/supported-sources/mysql.md +20 -0
  19. ingestr-0.0.4/docs/supported-sources/overview.md +16 -0
  20. ingestr-0.0.4/docs/supported-sources/postgres.md +21 -0
  21. ingestr-0.0.4/docs/supported-sources/redshift.md +21 -0
  22. ingestr-0.0.4/docs/supported-sources/snowflake.md +20 -0
  23. ingestr-0.0.4/docs/supported-sources/sqlite.md +16 -0
  24. ingestr-0.0.4/ingestr/main.py +279 -0
  25. ingestr-0.0.4/ingestr/main_test.py +579 -0
  26. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/factory.py +5 -3
  27. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/sources.py +7 -3
  28. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/sources_test.py +4 -2
  29. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/sql_database/__init__.py +3 -47
  30. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/sql_database/helpers.py +0 -1
  31. ingestr-0.0.4/ingestr/src/telemetry/event.py +14 -0
  32. ingestr-0.0.4/ingestr/testdata/test_append.db +0 -0
  33. ingestr-0.0.4/ingestr/testdata/test_create_replace.db +0 -0
  34. ingestr-0.0.4/ingestr/testdata/test_delete_insert_with_timerange.db +0 -0
  35. ingestr-0.0.4/ingestr/testdata/test_delete_insert_without_primary_key.db +0 -0
  36. ingestr-0.0.4/ingestr/testdata/test_merge_with_primary_key.db +0 -0
  37. ingestr-0.0.4/package-lock.json +1571 -0
  38. ingestr-0.0.4/package.json +10 -0
  39. {ingestr-0.0.2 → ingestr-0.0.4}/pyproject.toml +1 -1
  40. {ingestr-0.0.2 → ingestr-0.0.4}/requirements.txt +3 -0
  41. ingestr-0.0.4/resources/demo.gif +0 -0
  42. ingestr-0.0.4/resources/demo.tape +18 -0
  43. ingestr-0.0.2/ingestr/main.py +0 -191
  44. {ingestr-0.0.2 → ingestr-0.0.4}/LICENSE.md +0 -0
  45. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/destinations.py +0 -0
  46. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/destinations_test.py +0 -0
  47. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/sql_database/schema_types.py +0 -0
  48. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/sql_database/settings.py +0 -0
  49. {ingestr-0.0.2 → ingestr-0.0.4}/ingestr/src/testdata/fakebqcredentials.json +0 -0
  50. {ingestr-0.0.2 → ingestr-0.0.4}/requirements-dev.txt +0 -0
  51. {ingestr-0.0.2 → ingestr-0.0.4}/resources/ingestr.svg +0 -0
@@ -10,4 +10,7 @@ venv
10
10
  .pytest_cache
11
11
  .mypy_cache
12
12
  pipeline_data
13
- dist
13
+ dist
14
+ docs/.vitepress/dist
15
+ docs/.vitepress/cache
16
+ node_modules
@@ -18,7 +18,7 @@ test: venv
18
18
  . venv/bin/activate; $(MAKE) test-ci
19
19
 
20
20
  test-specific: venv
21
- . venv/bin/activate; pytest -rP -vv --tb=short --cov=ingestr --no-cov-on-fail -k $(test)
21
+ . venv/bin/activate; pytest -rP -vv --tb=short --cov=ingestr --no-cov-on-fail --capture=no -k $(test)
22
22
 
23
23
  lint-ci:
24
24
  ruff ingestr --fix && ruff format ingestr
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: ingestr
3
- Version: 0.0.2
3
+ Version: 0.0.4
4
4
  Summary: ingestr is a command-line application that ingests data from various sources and stores them in any database.
5
5
  Project-URL: Homepage, https://github.com/bruin-data/ingestr
6
6
  Project-URL: Issues, https://github.com/bruin-data/ingestr/issues
@@ -16,11 +16,14 @@ Classifier: Topic :: Database
16
16
  Requires-Python: >=3.9
17
17
  Requires-Dist: databricks-sql-connector==2.9.3
18
18
  Requires-Dist: dlt==0.4.3
19
+ Requires-Dist: duckdb-engine==0.11.1
19
20
  Requires-Dist: duckdb==0.9.2
21
+ Requires-Dist: google-cloud-bigquery-storage
20
22
  Requires-Dist: pendulum==3.0.0
21
23
  Requires-Dist: psycopg2==2.9.9
22
24
  Requires-Dist: pyodbc==5.1.0
23
25
  Requires-Dist: rich==13.7.0
26
+ Requires-Dist: rudder-sdk-python==2.0.2
24
27
  Requires-Dist: snowflake-sqlalchemy==1.5.1
25
28
  Requires-Dist: sqlalchemy-bigquery==1.9.0
26
29
  Requires-Dist: sqlalchemy2-stubs==0.0.2a38
@@ -32,18 +35,19 @@ Description-Content-Type: text/markdown
32
35
  <div align="center">
33
36
  <img src="./resources/ingestr.svg" width="500" />
34
37
  <p>Ingest & copy data from any source to any destination without any code</p>
38
+ <img src="./resources/demo.gif" width="500" />
35
39
  </div>
36
40
 
41
+
37
42
  -----
38
43
 
39
44
  Ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, no code necessary.
40
45
 
41
- - ✨ copy data from your Postges / Mongo / BigQuery or any other source into any destination
42
- - ➕ incremental loading
46
+ - ✨ copy data from your database into any destination
47
+ - ➕ incremental loading: `append`, `merge` or `delete+insert`
43
48
  - 🐍 single-command installation
44
- - 💅 Docker image for easy installation & usage
45
49
 
46
- ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the magic.
50
+ ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the data land on its destination.
47
51
 
48
52
 
49
53
  ## Installation
@@ -67,6 +71,10 @@ This command will:
67
71
  - get the table `public.some_data` from the Postgres instance.
68
72
  - upload this data to your BigQuery warehouse under the schema `ingestr` and table `some_data`.
69
73
 
74
+ ## Documentation
75
+ You can see the full documentation [here](https://bruindata.github.com/ingestr).
76
+
77
+
70
78
  ## Supported Sources & Destinations
71
79
 
72
80
  | Database | Source | Destination |
@@ -79,4 +87,9 @@ This command will:
79
87
  | DuckDB | ✅ | ✅ |
80
88
  | Microsoft SQL Server | ✅ | ✅ |
81
89
  | SQLite | ✅ | ❌ |
82
- | MySQL | ✅ | ❌ |
90
+ | MySQL | ✅ | ❌ |
91
+
92
+ More to come soon!
93
+
94
+ ## Acknowledgements
95
+ This project would not have been possible without the amazing work done by the [SQLAlchemy](https://www.sqlalchemy.org/) and [dlt](https://dlthub.com/) teams. We relied on their work to connect to various sources and destinations, and built `ingestr` as a simple, opinionated wrapper around their work.
@@ -1,18 +1,19 @@
1
1
  <div align="center">
2
2
  <img src="./resources/ingestr.svg" width="500" />
3
3
  <p>Ingest & copy data from any source to any destination without any code</p>
4
+ <img src="./resources/demo.gif" width="500" />
4
5
  </div>
5
6
 
7
+
6
8
  -----
7
9
 
8
10
  Ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, no code necessary.
9
11
 
10
- - ✨ copy data from your Postges / Mongo / BigQuery or any other source into any destination
11
- - ➕ incremental loading
12
+ - ✨ copy data from your database into any destination
13
+ - ➕ incremental loading: `append`, `merge` or `delete+insert`
12
14
  - 🐍 single-command installation
13
- - 💅 Docker image for easy installation & usage
14
15
 
15
- ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the magic.
16
+ ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the data land on its destination.
16
17
 
17
18
 
18
19
  ## Installation
@@ -36,6 +37,10 @@ This command will:
36
37
  - get the table `public.some_data` from the Postgres instance.
37
38
  - upload this data to your BigQuery warehouse under the schema `ingestr` and table `some_data`.
38
39
 
40
+ ## Documentation
41
+ You can see the full documentation [here](https://bruindata.github.com/ingestr).
42
+
43
+
39
44
  ## Supported Sources & Destinations
40
45
 
41
46
  | Database | Source | Destination |
@@ -48,4 +53,9 @@ This command will:
48
53
  | DuckDB | ✅ | ✅ |
49
54
  | Microsoft SQL Server | ✅ | ✅ |
50
55
  | SQLite | ✅ | ❌ |
51
- | MySQL | ✅ | ❌ |
56
+ | MySQL | ✅ | ❌ |
57
+
58
+ More to come soon!
59
+
60
+ ## Acknowledgements
61
+ This project would not have been possible without the amazing work done by the [SQLAlchemy](https://www.sqlalchemy.org/) and [dlt](https://dlthub.com/) teams. We relied on their work to connect to various sources and destinations, and built `ingestr` as a simple, opinionated wrapper around their work.
@@ -0,0 +1,49 @@
1
+ import { defineConfig } from "vitepress";
2
+
3
+ // https://vitepress.dev/reference/site-config
4
+ export default defineConfig({
5
+ title: "ingestr",
6
+ description: "Ingest & copy data between any source and any destination",
7
+ themeConfig: {
8
+ // https://vitepress.dev/reference/default-theme-config
9
+ nav: [
10
+ { text: "Home", link: "/" },
11
+ { text: "Getting started", link: "/getting-started/quickstart.md" },
12
+ ],
13
+
14
+ sidebar: [
15
+ {
16
+ text: "Getting started",
17
+ items: [
18
+ { text: "Quickstart", link: "/getting-started/quickstart.md" },
19
+ { text: "Core Concepts", link: "/getting-started/core-concepts.md" },
20
+ { text: "Incremental Loading", link: "/getting-started/incremental-loading.md" },
21
+ ],
22
+ },
23
+ {
24
+ text: "Commands",
25
+ items: [
26
+ { text: "ingest", link: "/commands/ingest.md" },
27
+ { text: "example-uris", link: "/commands/example-uris.md" },
28
+ ],
29
+ },
30
+ {
31
+ text: "Sources & Destinations",
32
+ items: [
33
+ { text: "Overview", link: "/supported-sources/overview.md" },
34
+ { text: "Postgres", link: "/supported-sources/postgres.md" },
35
+ { text: "Google BigQuery", link: "/supported-sources/bigquery.md" },
36
+ { text: "Snowflake", link: "/supported-sources/snowflake.md" },
37
+ { text: "AWS Redshift", link: "/supported-sources/redshift.md" },
38
+ { text: "Databricks", link: "/supported-sources/databricks.md" },
39
+ { text: "DuckDB", link: "/supported-sources/duckdb.md" },
40
+ { text: "Microsoft SQL Server", link: "/supported-sources/mssql.md" },
41
+ { text: "SQLite", link: "/supported-sources/sqlite.md" },
42
+ { text: "MySQL", link: "/supported-sources/mysql.md" },
43
+ ],
44
+ },
45
+ ],
46
+
47
+ socialLinks: [{ icon: "github", link: "https://github.com/bruin-data/ingestr" }],
48
+ },
49
+ });
@@ -0,0 +1,49 @@
1
+ ---
2
+ outline: deep
3
+ ---
4
+
5
+ # Runtime API Examples
6
+
7
+ This page demonstrates usage of some of the runtime APIs provided by VitePress.
8
+
9
+ The main `useData()` API can be used to access site, theme, and page data for the current page. It works in both `.md` and `.vue` files:
10
+
11
+ ```md
12
+ <script setup>
13
+ import { useData } from 'vitepress'
14
+
15
+ const { theme, page, frontmatter } = useData()
16
+ </script>
17
+
18
+ ## Results
19
+
20
+ ### Theme Data
21
+ <pre>{{ theme }}</pre>
22
+
23
+ ### Page Data
24
+ <pre>{{ page }}</pre>
25
+
26
+ ### Page Frontmatter
27
+ <pre>{{ frontmatter }}</pre>
28
+ ```
29
+
30
+ <script setup>
31
+ import { useData } from 'vitepress'
32
+
33
+ const { site, theme, page, frontmatter } = useData()
34
+ </script>
35
+
36
+ ## Results
37
+
38
+ ### Theme Data
39
+ <pre>{{ theme }}</pre>
40
+
41
+ ### Page Data
42
+ <pre>{{ page }}</pre>
43
+
44
+ ### Page Frontmatter
45
+ <pre>{{ frontmatter }}</pre>
46
+
47
+ ## More
48
+
49
+ Check out the documentation for the [full list of runtime APIs](https://vitepress.dev/reference/runtime-api#usedata).
@@ -0,0 +1,5 @@
1
+ # `ingestr example-uris`
2
+
3
+ This command is supposed to serve as a guide for the user to understand the various URI formats that are supported by the `ingestr` tool. The command will provide a list of supported sources and destinations, along with the URI format for each of them.
4
+
5
+ For the detailed documentation, please refer to the [Sources & Destinations](../supported-sources/overview.md) section.
@@ -0,0 +1,44 @@
1
+ # `ingestr ingest`
2
+
3
+ The `ingest` command is a core functionality of the `ingestr` tool, allowing users to transfer data from a source to a destination with optional support for incremental updates.
4
+
5
+ ## Example
6
+
7
+ The following example demonstrates how to use the `ingest` command to transfer data from a source to a destination.
8
+
9
+ ```bash
10
+ ingestr ingest
11
+ --source-uri '<your-source-uri-here>'
12
+ --source-table '<your-schema>.<your-table>'
13
+ --dest-uri '<your-destination-uri-here>'
14
+ ```
15
+
16
+ ## Required Options
17
+
18
+ - `--source-uri TEXT`: Specifies the URI of the data source. This parameter is required.
19
+ - `--dest-uri TEXT`: Specifies the URI of the destination where data will be ingested. This parameter is required.
20
+ - `--source-table TEXT`: Defines the source table to fetch data from. This parameter is required.
21
+
22
+ ## Optional Options
23
+
24
+ - `--dest-table TEXT`: Designates the destination table to save the data. If not specified, defaults to the value of `--source-table`.
25
+ - `--incremental-key TEXT`: Identifies the key used for incremental data strategies. Defaults to `None`.
26
+ - `--incremental-strategy TEXT`: Defines the strategy for incremental updates. Options include `replace`, `append`, `delete+insert`, or `merge`. The default strategy is `replace`.
27
+ - `--interval-start`: Sets the start of the interval for the incremental key. Defaults to `None`.
28
+ - `--interval-end`: Sets the end of the interval for the incremental key. Defaults to `None`.
29
+ - `--primary-key TEXT`: Specifies the primary key for the merge operation. Defaults to `None`.
30
+
31
+ The `interval-start` and `interval-end` options support various datetime formats, here are some examples:
32
+ - `%Y-%m-%d`: `2023-01-31`
33
+ - `%Y-%m-%dT%H:%M:%S`: `2023-01-31T15:00:00`
34
+ - `%Y-%m-%dT%H:%M:%S%z`: `2023-01-31T15:00:00+00:00`
35
+ - `%Y-%m-%dT%H:%M:%S.%f`: `2023-01-31T15:00:00.000123`
36
+ - `%Y-%m-%dT%H:%M:%S.%f%z`: `2023-01-31T15:00:00.000123+00:00`
37
+
38
+ > [!INFO]
39
+ > For the details around the incremental key and the various strategies, please refer to the [Incremental Loading](../getting-started/incremental-loading.md) section.
40
+
41
+ ## General Options
42
+
43
+ - `--help`: Displays the help message and exits the command.
44
+
@@ -0,0 +1,41 @@
1
+ ---
2
+ outline: deep
3
+ ---
4
+
5
+ # Core Concepts
6
+ ingestr has a few simple concepts that you should understand before you start using it.
7
+
8
+ ## Source & Destination URIs
9
+ The source and destination are the two main components of ingestr. The source is the place from where you want to ingest the data, hence the name "source", and the destination is the place where you want to store the data.
10
+
11
+ The sources and destinations are identified with [URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier). A URI is a simple string that contains the credentials used to connect to the source or destination.
12
+
13
+ Here's an example URI for a Postgres database:
14
+ ```
15
+ postgresql://admin:admin@localhost:8837/web?sslmode=disable
16
+ ```
17
+
18
+ The URI is composed of the following parts:
19
+ - `postgresql`: the name of the database
20
+ - `admin:admin`: the username and password
21
+ - `localhost:8837`: the host and port
22
+ - `web`: the database name
23
+ - `sslmode=disable`: the query parameters
24
+
25
+ ingestr can connect to any source or destination using this structure across all databases.
26
+
27
+ > [!NOTE]
28
+ > ingestr uses [dlt](https://github.com/dlt-hub/dlt) & [SQLAlchemy](https://www.sqlalchemy.org/) libraries internally, which means you can get connection URIs by following their documentation as well, they are supposed to work right away in ingestr.
29
+
30
+ ## Source & Destination Tables
31
+ The source and destination tables are the tables from the source and destination databases, respectively. The source table is the table from where you want to ingest the data from, and the destination table is the table where you want to store the data.
32
+
33
+ ingestr uses the `--source-table` and `--dest-table` flags to specify the source and destination tables, respectively. The `--dest-table` is optional, if you don't specify it, ingestr will use the same table name as the source table.
34
+
35
+
36
+ ## Incremental Loading
37
+ ingestr supports incremental loading, which means you can choose to append, merge or delete+insert data into the destination table. Incremental loading allows you to ingest only the new rows from the source table into the destination table, which means that you don't have to ingest the entire table every time you run ingestr.
38
+
39
+ Incremental loading requires various identifiers in the source table to understand what has changed when, so that the new rows can be ingested into the destination table. Read more in the [Incremental Loading](/getting-started/incremental-loading.md) section.
40
+
41
+
@@ -0,0 +1,228 @@
1
+ ---
2
+ outline: deep
3
+ ---
4
+
5
+ # Incremental Loading
6
+ ingestr supports incremental loading, which means you can choose to append, merge or delete+insert data into the destination table. Incremental loading allows you to ingest only the new rows from the source table into the destination table, which means that you don't have to ingest the entire table every time you run ingestr.
7
+
8
+ Before you use incremental loading, you should understand 3 important keys:
9
+ - `primary_key`: the column(s) that uniquely identify a row in the table, if you give a primary key for an ingestion the resulting rows will be deduplicated based on the primary key, a.k.a there will only be one row for each primary key in the destination.
10
+ - `incremental_key`: the column that will be used to determine the new rows, if you give an incremental key for an ingestion the resulting rows will be filtered based on the incremental key, a.k.a only the new rows will be ingested.
11
+ - A good example of an incremental key is a timestamp column, where you only want to ingest the rows that are newer than the last ingestion, e.g. `created_at` or `updated_at`.
12
+ - `strategy`: the strategy to use for incremental loading, the available strategies are:
13
+ - `replace`: replace the existing destination table with the source directly, this is the default strategy and the simplest one.
14
+ - This strategy is not recommended for large tables, as it will replace the entire table and can be slow.
15
+ - `append`: simply append the new rows to the destination table.
16
+ - `merge`: merge the new rows with the existing rows in the destination table, insert the new ones and update the existing ones with the new values.
17
+ - `delete+insert`: delete the existing rows in the destination table that match the incremental key and then insert the new rows.
18
+
19
+
20
+
21
+ ## Replace
22
+ Replace is the default strategy, and it simply replaces the entire destination table with the source table.
23
+
24
+ The following example below will replace the entire `my_schema.some_data` table in BigQuery with the `my_schema.some_data` table in Postgres.
25
+ ```bash
26
+ ingestr \
27
+ --source-uri 'postgresql://admin:admin@localhost:8837/web?sslmode=disable' \
28
+ --source-table 'my_schema.some_data' \
29
+ --dest-uri 'bigquery://<your-project-name>?credentials_path=/path/to/service/account.json' \
30
+ ```
31
+
32
+ Here's how the replace strategy works:
33
+ - The source table is downloaded.
34
+ - The source table is uploaded to the destination, replacing the destination table.
35
+
36
+
37
+ > [!CAUTION]
38
+ > This strategy will delete the entire destination table and replace it with the source table, use with caution.
39
+
40
+ ## Append
41
+ Append will simply append the new rows from the source table to the destination table. By default, it will append all the rows, in order to use it as an incremental strategy, you should provide an `incremental_key`.
42
+
43
+ The following example below will append the new rows from the `my_schema.some_data` table in Postgres to the `my_schema.some_data` table in BigQuery, only where there's a new table.
44
+ ```bash
45
+ ingestr \
46
+ --source-uri 'postgresql://admin:admin@localhost:8837/web?sslmode=disable' \
47
+ --source-table 'my_schema.some_data' \
48
+ --dest-uri 'bigquery://<your-project-name>?credentials_path=/path/to/service/account.json' \
49
+ --incremental-strategy append
50
+ --incremental-key updated_at
51
+ ```
52
+
53
+ ### Example
54
+
55
+ Let's assume you had the following source table:
56
+ | id | name | updated_at |
57
+ |----|------|------------|
58
+ | 1 | John | 2021-01-01 |
59
+ | 2 | Jane | 2021-01-01 |
60
+
61
+ #### First Ingestion
62
+ The first time you run the command, it will ingest all the rows into the destination table. Here's how your destination looks like now:
63
+
64
+ | id | name | updated_at |
65
+ |----|------|------------|
66
+ | 1 | John | 2021-01-01 |
67
+ | 2 | Jane | 2021-01-01 |
68
+
69
+ #### Second Ingestion, no new data
70
+ When there's no new data in the source table, the destination table will remain the same.
71
+
72
+ #### Third Ingestion, new data
73
+ Let's say John changed his name to Johnny, and Jane's `updated_at` was updated to `2021-01-02`, e.g. your source:
74
+ | id | name | updated_at |
75
+ |----|--------|------------|
76
+ | 1 | Johnny | 2021-01-02 |
77
+ | 2 | Jane | 2021-01-01 |
78
+
79
+
80
+ When you run the command again, it will only ingest the new rows into the destination table. Here's how your destination looks like now:
81
+ | id | name | updated_at |
82
+ |----|--------|------------|
83
+ | 1 | John | 2021-01-01 |
84
+ | 2 | Jane | 2021-01-01 |
85
+ | 1 | Johnny | 2021-01-02 |
86
+
87
+ **Notice the last row in the table:** it's the new row that was ingested from the source table.
88
+
89
+ The behavior is the same if there were new rows in the source table, they would be appended to the destination table if they have `updated_at` that is **later than the latest record** in the destination table.
90
+
91
+ > [!TIP]
92
+ > The `append` strategy allows you to keep a version history of your data, as it will keep appending the new rows to the destination table. You can use it to build [SCD Type 2](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row) tables, for example.
93
+
94
+
95
+ ## Merge
96
+ Merge will merge the new rows with the existing rows in the destination table, insert the new ones and update the existing ones with the new values. By default, it will merge all the rows, in order to use it as an incremental strategy, you should provide an `incremental_key` as well as a `primary_key` to find the right rows to update.
97
+
98
+ The following example below will merge the new rows from the `my_schema.some_data` table in Postgres to the `my_schema.some_data` table in BigQuery, only where there's a new table.
99
+ ```bash
100
+ ingestr \
101
+ --source-uri 'postgresql://admin:admin@localhost:8837/web?sslmode=disable' \
102
+ --source-table 'my_schema.some_data' \
103
+ --dest-uri 'bigquery://<your-project-name>?credentials_path=/path/to/service/account.json' \
104
+ --incremental-strategy merge
105
+ --incremental-key updated_at
106
+ --primary-key id
107
+ ```
108
+
109
+ Here's how the merge strategy works:
110
+ - If the row with the `primary_key` exists in the destination table, it will be updated with the new values from the source table.
111
+ - If the row with the `primary_key` doesn't exist in the destination table, it will be inserted into the destination table.
112
+ - If the row with the `primary_key` exists in the destination table but not in the source table, it will remain in the destination table.
113
+ - If the row with the `primary_key` doesn't exist in the destination table but exists in the source table, it will be inserted into the destination table.
114
+
115
+ ### Example
116
+
117
+ Let's assume you had the following source table:
118
+ | id | name | updated_at |
119
+ |----|------|------------|
120
+ | 1 | John | 2021-01-01 |
121
+ | 2 | Jane | 2021-01-01 |
122
+
123
+ #### First Ingestion
124
+ The first time you run the command, it will ingest all the rows into the destination table. Here's how your destination looks like now:
125
+
126
+ | id | name | updated_at |
127
+ |----|------|------------|
128
+ | 1 | John | 2021-01-01 |
129
+ | 2 | Jane | 2021-01-01 |
130
+
131
+ #### Second Ingestion, no new data
132
+ When there's no new data in the source table, the destination table will remain the same.
133
+
134
+ #### Third Ingestion, new data
135
+ Let's say John changed his name to Johnny, e.g. your source:
136
+ | id | name | updated_at |
137
+ |----|--------|------------|
138
+ | 1 | Johnny | 2021-01-02 |
139
+ | 2 | Jane | 2021-01-01 |
140
+
141
+ When you run the command again, it will merge the new rows into the destination table. Here's how your destination looks like now:
142
+ | id | name | updated_at |
143
+ |----|--------|------------|
144
+ | 1 | Johnny | 2021-01-02 |
145
+ | 2 | Jane | 2021-01-01 |
146
+
147
+ **Notice the first row in the table:** it's the updated row that was ingested from the source table.
148
+
149
+ The behavior is the same if there were new rows in the source table, they would be merged into the destination table if they have `updated_at` that is **later than the latest record** in the destination table.
150
+
151
+ > [!TIP]
152
+ > The `merge` strategy is different from the `append` strategy, as it will update the existing rows in the destination table with the new values from the source table. It's useful when you want to keep the latest version of your data in the destination table.
153
+
154
+ > [!CAUTION]
155
+ > For the cases where there's a primary key match, the `merge` strategy will **update** the existing rows in the destination table with the new values from the source table. Use with caution, as it can lead to data loss if not used properly, as well as data processing charges if your data warehouse charges for updates.
156
+
157
+ ## Delete+Insert
158
+ Delete+Insert will delete the existing rows in the destination table that match the `incremental_key` and then insert the new rows from the source table. By default, it will delete and insert all the rows, in order to use it as an incremental strategy, you should provide an `incremental_key`.
159
+
160
+ The following example below will delete the existing rows in the `my_schema.some_data` table in BigQuery that match the `updated_at` and then insert the new rows from the `my_schema.some_data` table in Postgres.
161
+ ```bash
162
+ ingestr \
163
+ --source-uri 'postgresql://admin:admin@localhost:8837/web?sslmode=disable' \
164
+ --source-table 'my_schema.some_data' \
165
+ --dest-uri 'bigquery://<your-project-name>?credentials_path=/path/to/service/account.json' \
166
+ --incremental-strategy delete+insert
167
+ --incremental-key updated_at
168
+ ```
169
+
170
+ Here's how the delete+insert strategy works:
171
+ - The new rows from the source table will be inserted into a staging table in the destination database.
172
+ - The existing rows in the destination table that match the `incremental_key` will be deleted.
173
+ - The new rows from the staging table will be inserted into the destination table.
174
+
175
+ A few important notes about the `delete+insert` strategy:
176
+ - it does not guarantee the order of the rows in the destination table, as it will delete and insert the rows in the destination table.
177
+ - it does not deduplicate the rows in the destination table, as it will delete and insert the rows in the destination table, which means you may have multiple rows with the same `incremental_key` in the destination table.
178
+
179
+ ### Example
180
+ Let's assume you had the following source table:
181
+ | id | name | updated_at |
182
+ |----|------|------------|
183
+ | 1 | John | 2021-01-01 |
184
+ | 2 | Jane | 2021-01-01 |
185
+
186
+ #### First Ingestion
187
+ The first time you run the command, it will ingest all the rows into the destination table. Here's how your destination looks like now:
188
+ | id | name | updated_at |
189
+ |----|------|------------|
190
+ | 1 | John | 2021-01-01 |
191
+ | 2 | Jane | 2021-01-01 |
192
+
193
+ #### Second Ingestion, no new data
194
+ Even when there's no new data in the source table, the rows from the source table will be inserted into a staging table in the destination database, and then the existing rows in the destination table that match the `incremental_key` will be deleted, and then the new rows from the staging table will be inserted into the destination table. The destination table will remain the same for the case of this example.
195
+ > [!CAUTION]
196
+ > If you had rows in the destination table that does not exist in the source table, they will be deleted from the destination table.
197
+
198
+ #### Third Ingestion, new data
199
+ Let's say John changed his name to Johnny, e.g. your source:
200
+ | id | name | updated_at |
201
+ |----|--------|------------|
202
+ | 1 | Johnny | 2021-01-02 |
203
+ | 2 | Jane | 2021-01-01 |
204
+
205
+ When you run the command again, it will delete the existing rows in the destination table that match the `incremental_key` and then insert the new rows from the source table. Here's how your destination looks like now:
206
+ | id | name | updated_at |
207
+ |----|--------|------------|
208
+ | 1 | Johnny | 2021-01-02 |
209
+ | 2 | Jane | 2021-01-01 |
210
+
211
+ **Notice the first row in the table:** it's the updated row that was ingested from the source table.
212
+
213
+ The behavior is the same if there were new rows in the source table, they would be deleted and inserted into the destination table if they have `updated_at` that is **later than the latest record** in the destination table.
214
+
215
+ > [!TIP]
216
+ > The `delete+insert` strategy is useful when you want to keep the destination table clean, as it will delete the existing rows in the destination table that match the `incremental_key` and then insert the new rows from the source table. `delete+insert` strategy also allows you to do backfills on the data, e.g. going back to a past date and ingesting the data again.
217
+
218
+
219
+ ## Conclusion
220
+ Incremental loading is a powerful feature that allows you to ingest only the new rows from the source table into the destination table. It's useful when you want to keep the destination table up-to-date with the source table, as well as when you want to keep a version history of your data in the destination table. However, there are a few things to keep in mind when using incremental loading:
221
+
222
+ - If you can and your data is not huge, use the `replace` strategy, as it's the simplest strategy and it will replace the entire destination table with the source table, which will always give you a clean exact replica of the source table.
223
+ - If you want to keep a version history of your data, use the `append` strategy, as it will keep appending the new rows to the destination table, which will give you a version history of your data.
224
+ - If you want to keep the latest version of your data in the destination table and your table has a natural primary key, such as a user ID, use the `merge` strategy, as it will update the existing rows in the destination table with the new values from the source table.
225
+ - If you want to keep the destination table clean and you want to do backfills on the data, use the `delete+insert` strategy, as it will delete the existing rows in the destination table that match the `incremental_key` and then insert the new rows from the source table.
226
+
227
+ > [!TIP]
228
+ > Even though the document tries to explain, there's no better learning than trying it yourself. You can use the [Quickstart](/getting-started/quickstart.md) to try the incremental loading strategies yourself.
@@ -0,0 +1,39 @@
1
+ ---
2
+ outline: deep
3
+ ---
4
+
5
+ # Quickstart
6
+ ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, no code necessary.
7
+
8
+ - ✨ copy data from your Postges / Mongo / BigQuery or any other source into any destination
9
+ - ➕ incremental loading: `append`, `merge` or `delete+insert`
10
+ - 🐍 single-command installation
11
+
12
+ ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the magic.
13
+
14
+
15
+ ## Installation
16
+ ```
17
+ pip install ingestr
18
+ ```
19
+
20
+ ## Quickstart
21
+
22
+ ```bash
23
+ ingestr \
24
+ --source-uri 'postgresql://admin:admin@localhost:8837/web?sslmode=disable' \
25
+ --source-table 'public.some_data' \
26
+ --dest-uri 'bigquery://<your-project-name>?credentials_path=/path/to/service/account.json' \
27
+ --dest-table 'ingestr.some_data'
28
+ ```
29
+
30
+ That's it.
31
+
32
+ This command will:
33
+ - get the table `public.some_data` from the Postgres instance.
34
+ - upload this data to your BigQuery warehouse under the schema `ingestr` and table `some_data`.
35
+
36
+
37
+ ## Supported Sources & Destinations
38
+
39
+ See the [Supported Sources & Destinations](/supported-sources/overview.md) page for a list of all supported sources and destinations. More to come soon!
@@ -0,0 +1,25 @@
1
+ ---
2
+ # https://vitepress.dev/reference/default-theme-home-page
3
+ layout: home
4
+
5
+ hero:
6
+ name: "ingestr"
7
+ text: Copy data between any source and any destination
8
+ tagline: "ingestr is a command-line application that allows ingesting or copying data from any source into any destination database."
9
+ actions:
10
+ - theme: brand
11
+ text: Getting Started
12
+ link: /getting-started/quickstart.md
13
+ - theme: alt
14
+ text: GitHub
15
+ link: https://github.com/bruin-data/ingestr
16
+
17
+ features:
18
+ - title: Feature A
19
+ details: Lorem ipsum dolor sit amet, consectetur adipiscing elit
20
+ - title: Feature B
21
+ details: Lorem ipsum dolor sit amet, consectetur adipiscing elit
22
+ - title: Feature C
23
+ details: Lorem ipsum dolor sit amet, consectetur adipiscing elit
24
+ ---
25
+