dbt-vectorize 0.1.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,203 @@
1
+ Metadata-Version: 2.4
2
+ Name: dbt-vectorize
3
+ Version: 0.1.4
4
+ Summary: dbt + Rust vectorization runner for pgvector
5
+ Author-email: Maria Dubyaga <kraftaa@gmail.com>
6
+ License-Expression: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/kraftaa/dbt-vector
8
+ Project-URL: Repository, https://github.com/kraftaa/dbt-vector
9
+ Classifier: Environment :: Console
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Rust
13
+ Classifier: Topic :: Database
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ Requires-Dist: PyYAML>=6.0
18
+
19
+ # dbt-vectors (prototype scaffold)
20
+
21
+ > Make vector indexes a first-class materialization in dbt. This repo is an MVP scaffold to prove the concept.
22
+
23
+ ## Why
24
+ - dbt today only materializes SQL artifacts (table, view, incremental, ephemeral).
25
+ - Vector pipelines require SQL + embeddings + upsert to a vector DB; teams currently stitch that with ad-hoc external scripts.
26
+ - A custom `vector_index` materialization can run inside `dbt build`, generating embeddings, handling incremental logic, and writing to pgvector/Pinecone/Qdrant.
27
+
28
+ ## What’s here
29
+ - **dbt package skeleton** with a `vector_index` materialization and dispatchable macros (pgvector working).
30
+ - **Rust embedder** (`rust/embedding_engine`) that can generate embeddings via OpenAI, Amazon Bedrock, or a local ONNX model (no Python needed).
31
+ - **`./bin/vectorize` runner**: orchestrates `dbt run` for the model and then calls the Rust embedder to write embeddings into Postgres/pgvector.
32
+ - **Examples** to show how a model is defined and run.
33
+
34
+ ## Prerequisites
35
+
36
+ `dbt-vectorize` does not vendor dbt. It uses whatever dbt binary you point it to (`DBT=...`) or find on PATH.
37
+
38
+ Verify your existing dbt + adapter:
39
+ ```bash
40
+ dbt --version
41
+ ```
42
+ You should see a plugin like `postgres` under "Plugins".
43
+
44
+ If you do not have dbt + postgres adapter installed:
45
+ ```bash
46
+ python -m pip install "dbt-core~=1.9" "dbt-postgres~=1.9"
47
+ ```
48
+
49
+ You also need pgvector available in Postgres:
50
+ - install the extension package on the **Postgres server** (`vector.control` must exist on that server)
51
+ - enable it in each **database** you want to use
52
+
53
+ ```sql
54
+ CREATE EXTENSION IF NOT EXISTS vector;
55
+ ```
56
+
57
+ (`pgvector` is the project name; the SQL extension name is `vector`.)
58
+
59
+ ## Repo layout
60
+ - `dbt_project.yml` – declares this as a dbt package and exposes macros.
61
+ - `macros/materializations/vector_index.sql` – Jinja materialization scaffold (pgvector first, adapters dispatchable).
62
+ - `macros/adapters/vector_index_pgvector.sql` – pgvector adapter macro that creates/loads the target table.
63
+ - `bin/vectorize` – orchestration command that runs dbt and then Rust embedding.
64
+ - `rust/embedding_engine` – Rust crate and `pg_embedder` binary used for embedding generation/upsert.
65
+
66
+ ## Next steps (MVP path)
67
+ 1. Harden Rust embedding provider support (OpenAI/Bedrock/local ONNX) with better diagnostics and retries. ⏳
68
+ 2. Expand adapter macros beyond pgvector (Pinecone/Qdrant). ⏳
69
+ 3. Add end-to-end integration tests for dbt + pgvector + `pg_embedder`. ⏳
70
+ 4. Publish package docs and a reproducible quickstart. ⏳
71
+
72
+ ## Example model (current)
73
+ ```sql
74
+ {{ config(
75
+ materialized='vector_index',
76
+ vector_db='pgvector',
77
+ index_name='knowledge_base',
78
+ embedding_model='text-embedding-3-small',
79
+ dimensions=(env_var('EMBED_DIMS', 1536) | int),
80
+ metadata_columns=['source', 'created_at', 'doc_id']
81
+ ) }}
82
+
83
+ select
84
+ doc_id,
85
+ chunk_text as text,
86
+ source,
87
+ created_at
88
+ from {{ ref('staging_documents') }}
89
+ where is_active = true
90
+ ```
91
+
92
+ Running `./bin/vectorize --select vector_knowledge_base` should:
93
+ - fetch incremental rows
94
+ - generate embeddings via Rust engine
95
+ - upsert to pgvector (or Pinecone/Qdrant via adapters)
96
+ - emit metrics (processed, failed, latency) and freshness tests
97
+
98
+ ## Run locally (preferred: existing local Postgres)
99
+
100
+ 1) Ensure Postgres is running, reachable (`PGHOST/PGPORT/PGUSER/PGDATABASE`), and has `vector` enabled:
101
+ ```sql
102
+ CREATE EXTENSION IF NOT EXISTS vector;
103
+ ```
104
+
105
+ 2) Choose a provider and matching dimensions:
106
+ ```
107
+ # Local ONNX (MiniLM, 384 dims)
108
+ EMBED_PROVIDER=local
109
+ EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
110
+ EMBED_LOCAL_MODEL_PATH=$PWD/ml_model # contains model.onnx + tokenizer.json
111
+ EMBED_DIMS=384
112
+
113
+ # OpenAI
114
+ EMBED_PROVIDER=openai
115
+ EMBED_MODEL=text-embedding-3-small
116
+ EMBED_DIMS=1536 # or a smaller dim if you request it from OpenAI
117
+
118
+ # Bedrock Titan v2 (defaults)
119
+ EMBED_PROVIDER=bedrock
120
+ EMBED_MODEL=amazon.titan-embed-text-v2:0
121
+ EMBED_DIMS=1024 # or 512/256 if you override
122
+ ```
123
+
124
+ 3) Run vectorization (dbt model + embedding upsert):
125
+ ```
126
+ PGHOST=localhost PGPORT=5432 PGUSER=postgres PGDATABASE=postgres \
127
+ EMBED_PROVIDER=... EMBED_MODEL=... EMBED_DIMS=... \
128
+ ./bin/vectorize --select vector_knowledge_base
129
+ ```
130
+
131
+ Shortcut with env file:
132
+ ```
133
+ cp .env.vectorize.example .env.vectorize
134
+ ./bin/vectorize --select vector_knowledge_base
135
+ ```
136
+ `bin/vectorize` auto-loads `.env.vectorize` if present. Use `VECTORIZE_ENV_FILE=/path/to/file` to load a different env file.
137
+
138
+ Expected CLI output (example):
139
+ ```
140
+ [vectorize] running dbt model vector_knowledge_base (provider=local, model=sentence-transformers/all-MiniLM-L6-v2)
141
+ ...
142
+ Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
143
+ [vectorize] generating embeddings via Rust into public.knowledge_base
144
+ embedded 20 rows into public.knowledge_base
145
+ [vectorize] done.
146
+ ```
147
+
148
+ Quick verification in Postgres:
149
+ ```sql
150
+ SELECT count(*) AS rows FROM public.knowledge_base;
151
+
152
+ SELECT
153
+ doc_id,
154
+ (embedding::float4[])[1:8] AS first_8_dims,
155
+ source,
156
+ created_at
157
+ FROM public.knowledge_base
158
+ LIMIT 5;
159
+ ```
160
+
161
+ ## Optional Docker Postgres
162
+
163
+ Use this only if you want a disposable local pgvector instance:
164
+ ```
165
+ docker-compose up -d postgres
166
+ ```
167
+ If Docker/Colima is not running, this will fail with a daemon connection error.
168
+
169
+ ## Build pip package (`dbt-vectorize`)
170
+
171
+ Build from repo root (factorlens-style, bundles Rust binary in wheel):
172
+ ```bash
173
+ ./scripts/build_wheel_with_binary.sh
174
+ ```
175
+
176
+ Artifacts will be written to `dist/`.
177
+ Install locally:
178
+ ```bash
179
+ python -m pip install dist/dbt_vectorize-*.whl
180
+ ```
181
+
182
+ CLI entrypoint after install:
183
+ ```bash
184
+ dbt-vectorize --select vector_knowledge_base
185
+ ```
186
+
187
+ CI release wheel build (macOS arm64 + Linux x86_64):
188
+ - workflow file: `.github/workflows/release.yml`
189
+ - trigger manually from Actions or push a `v*` tag
190
+ - outputs platform-specific wheels under workflow artifacts / GitHub release assets
191
+
192
+ ### Supported embedding dimensions (set `EMBED_DIMS` to match)
193
+ - OpenAI `text-embedding-3-small`: 1536 (can request smaller via API parameter)
194
+ - OpenAI `text-embedding-3-large`: 3072 (can request smaller)
195
+ - Bedrock Titan embed text v2: 1024 (or 512/256)
196
+ - Bedrock Titan embed text v1: 1024 (or 512/256)
197
+ - Bedrock Cohere Embed v4: 1536 (or 1024/512/256)
198
+ - Local MiniLM (all-MiniLM-L6-v2 ONNX): 384
199
+
200
+ ## Notes
201
+ - The Rust embedder is Python-free.
202
+ - Keep your Postgres `vector` column dimension aligned with `EMBED_DIMS`.
203
+ - IVFFLAT indexes warn on very small datasets; that’s expected. Rebuild after you have more rows.
@@ -0,0 +1,185 @@
1
+ # dbt-vectors (prototype scaffold)
2
+
3
+ > Make vector indexes a first-class materialization in dbt. This repo is an MVP scaffold to prove the concept.
4
+
5
+ ## Why
6
+ - dbt today only materializes SQL artifacts (table, view, incremental, ephemeral).
7
+ - Vector pipelines require SQL + embeddings + upsert to a vector DB; teams currently stitch that with ad-hoc external scripts.
8
+ - A custom `vector_index` materialization can run inside `dbt build`, generating embeddings, handling incremental logic, and writing to pgvector/Pinecone/Qdrant.
9
+
10
+ ## What’s here
11
+ - **dbt package skeleton** with a `vector_index` materialization and dispatchable macros (pgvector working).
12
+ - **Rust embedder** (`rust/embedding_engine`) that can generate embeddings via OpenAI, Amazon Bedrock, or a local ONNX model (no Python needed).
13
+ - **`./bin/vectorize` runner**: orchestrates `dbt run` for the model and then calls the Rust embedder to write embeddings into Postgres/pgvector.
14
+ - **Examples** to show how a model is defined and run.
15
+
16
+ ## Prerequisites
17
+
18
+ `dbt-vectorize` does not vendor dbt. It uses whatever dbt binary you point it to (`DBT=...`) or find on PATH.
19
+
20
+ Verify your existing dbt + adapter:
21
+ ```bash
22
+ dbt --version
23
+ ```
24
+ You should see a plugin like `postgres` under "Plugins".
25
+
26
+ If you do not have dbt + postgres adapter installed:
27
+ ```bash
28
+ python -m pip install "dbt-core~=1.9" "dbt-postgres~=1.9"
29
+ ```
30
+
31
+ You also need pgvector available in Postgres:
32
+ - install the extension package on the **Postgres server** (`vector.control` must exist on that server)
33
+ - enable it in each **database** you want to use
34
+
35
+ ```sql
36
+ CREATE EXTENSION IF NOT EXISTS vector;
37
+ ```
38
+
39
+ (`pgvector` is the project name; the SQL extension name is `vector`.)
40
+
41
+ ## Repo layout
42
+ - `dbt_project.yml` – declares this as a dbt package and exposes macros.
43
+ - `macros/materializations/vector_index.sql` – Jinja materialization scaffold (pgvector first, adapters dispatchable).
44
+ - `macros/adapters/vector_index_pgvector.sql` – pgvector adapter macro that creates/loads the target table.
45
+ - `bin/vectorize` – orchestration command that runs dbt and then Rust embedding.
46
+ - `rust/embedding_engine` – Rust crate and `pg_embedder` binary used for embedding generation/upsert.
47
+
48
+ ## Next steps (MVP path)
49
+ 1. Harden Rust embedding provider support (OpenAI/Bedrock/local ONNX) with better diagnostics and retries. ⏳
50
+ 2. Expand adapter macros beyond pgvector (Pinecone/Qdrant). ⏳
51
+ 3. Add end-to-end integration tests for dbt + pgvector + `pg_embedder`. ⏳
52
+ 4. Publish package docs and a reproducible quickstart. ⏳
53
+
54
+ ## Example model (current)
55
+ ```sql
56
+ {{ config(
57
+ materialized='vector_index',
58
+ vector_db='pgvector',
59
+ index_name='knowledge_base',
60
+ embedding_model='text-embedding-3-small',
61
+ dimensions=(env_var('EMBED_DIMS', 1536) | int),
62
+ metadata_columns=['source', 'created_at', 'doc_id']
63
+ ) }}
64
+
65
+ select
66
+ doc_id,
67
+ chunk_text as text,
68
+ source,
69
+ created_at
70
+ from {{ ref('staging_documents') }}
71
+ where is_active = true
72
+ ```
73
+
74
+ Running `./bin/vectorize --select vector_knowledge_base` should:
75
+ - fetch incremental rows
76
+ - generate embeddings via Rust engine
77
+ - upsert to pgvector (or Pinecone/Qdrant via adapters)
78
+ - emit metrics (processed, failed, latency) and freshness tests
79
+
80
+ ## Run locally (preferred: existing local Postgres)
81
+
82
+ 1) Ensure Postgres is running, reachable (`PGHOST/PGPORT/PGUSER/PGDATABASE`), and has `vector` enabled:
83
+ ```sql
84
+ CREATE EXTENSION IF NOT EXISTS vector;
85
+ ```
86
+
87
+ 2) Choose a provider and matching dimensions:
88
+ ```
89
+ # Local ONNX (MiniLM, 384 dims)
90
+ EMBED_PROVIDER=local
91
+ EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
92
+ EMBED_LOCAL_MODEL_PATH=$PWD/ml_model # contains model.onnx + tokenizer.json
93
+ EMBED_DIMS=384
94
+
95
+ # OpenAI
96
+ EMBED_PROVIDER=openai
97
+ EMBED_MODEL=text-embedding-3-small
98
+ EMBED_DIMS=1536 # or a smaller dim if you request it from OpenAI
99
+
100
+ # Bedrock Titan v2 (defaults)
101
+ EMBED_PROVIDER=bedrock
102
+ EMBED_MODEL=amazon.titan-embed-text-v2:0
103
+ EMBED_DIMS=1024 # or 512/256 if you override
104
+ ```
105
+
106
+ 3) Run vectorization (dbt model + embedding upsert):
107
+ ```
108
+ PGHOST=localhost PGPORT=5432 PGUSER=postgres PGDATABASE=postgres \
109
+ EMBED_PROVIDER=... EMBED_MODEL=... EMBED_DIMS=... \
110
+ ./bin/vectorize --select vector_knowledge_base
111
+ ```
112
+
113
+ Shortcut with env file:
114
+ ```
115
+ cp .env.vectorize.example .env.vectorize
116
+ ./bin/vectorize --select vector_knowledge_base
117
+ ```
118
+ `bin/vectorize` auto-loads `.env.vectorize` if present. Use `VECTORIZE_ENV_FILE=/path/to/file` to load a different env file.
119
+
120
+ Expected CLI output (example):
121
+ ```
122
+ [vectorize] running dbt model vector_knowledge_base (provider=local, model=sentence-transformers/all-MiniLM-L6-v2)
123
+ ...
124
+ Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
125
+ [vectorize] generating embeddings via Rust into public.knowledge_base
126
+ embedded 20 rows into public.knowledge_base
127
+ [vectorize] done.
128
+ ```
129
+
130
+ Quick verification in Postgres:
131
+ ```sql
132
+ SELECT count(*) AS rows FROM public.knowledge_base;
133
+
134
+ SELECT
135
+ doc_id,
136
+ (embedding::float4[])[1:8] AS first_8_dims,
137
+ source,
138
+ created_at
139
+ FROM public.knowledge_base
140
+ LIMIT 5;
141
+ ```
142
+
143
+ ## Optional Docker Postgres
144
+
145
+ Use this only if you want a disposable local pgvector instance:
146
+ ```
147
+ docker-compose up -d postgres
148
+ ```
149
+ If Docker/Colima is not running, this will fail with a daemon connection error.
150
+
151
+ ## Build pip package (`dbt-vectorize`)
152
+
153
+ Build from repo root (factorlens-style, bundles Rust binary in wheel):
154
+ ```bash
155
+ ./scripts/build_wheel_with_binary.sh
156
+ ```
157
+
158
+ Artifacts will be written to `dist/`.
159
+ Install locally:
160
+ ```bash
161
+ python -m pip install dist/dbt_vectorize-*.whl
162
+ ```
163
+
164
+ CLI entrypoint after install:
165
+ ```bash
166
+ dbt-vectorize --select vector_knowledge_base
167
+ ```
168
+
169
+ CI release wheel build (macOS arm64 + Linux x86_64):
170
+ - workflow file: `.github/workflows/release.yml`
171
+ - trigger manually from Actions or push a `v*` tag
172
+ - outputs platform-specific wheels under workflow artifacts / GitHub release assets
173
+
174
+ ### Supported embedding dimensions (set `EMBED_DIMS` to match)
175
+ - OpenAI `text-embedding-3-small`: 1536 (can request smaller via API parameter)
176
+ - OpenAI `text-embedding-3-large`: 3072 (can request smaller)
177
+ - Bedrock Titan embed text v2: 1024 (or 512/256)
178
+ - Bedrock Titan embed text v1: 1024 (or 512/256)
179
+ - Bedrock Cohere Embed v4: 1536 (or 1024/512/256)
180
+ - Local MiniLM (all-MiniLM-L6-v2 ONNX): 384
181
+
182
+ ## Notes
183
+ - The Rust embedder is Python-free.
184
+ - Keep your Postgres `vector` column dimension aligned with `EMBED_DIMS`.
185
+ - IVFFLAT indexes warn on very small datasets; that’s expected. Rebuild after you have more rows.
@@ -0,0 +1,3 @@
1
+ __all__ = ["__version__"]
2
+
3
+ __version__ = "0.1.4"
@@ -0,0 +1,2 @@
1
+ """Package data holder for bundled native binaries."""
2
+
@@ -0,0 +1,165 @@
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import shutil
5
+ import subprocess
6
+ import sys
7
+ from pathlib import Path
8
+
9
+ import yaml
10
+
11
+ def _resolve_cwd() -> Path:
12
+ explicit = os.environ.get("DBT_VECTORIZE_CWD")
13
+ if explicit:
14
+ return Path(explicit).expanduser().resolve()
15
+ return Path.cwd()
16
+
17
+
18
+ def _find_repo_root(start: Path) -> Path | None:
19
+ env_root = os.environ.get("DBT_VECTORIZE_REPO")
20
+ if env_root:
21
+ p = Path(env_root).expanduser().resolve()
22
+ if (p / "dbt_project.yml").exists() and (p / "rust" / "embedding_engine" / "Cargo.toml").exists():
23
+ return p
24
+
25
+ cur = start.resolve()
26
+ for p in [cur, *cur.parents]:
27
+ if (p / "dbt_project.yml").exists() and (p / "rust" / "embedding_engine" / "Cargo.toml").exists():
28
+ return p
29
+ return None
30
+
31
+
32
+ def _packaged_embedder() -> str | None:
33
+ pkg_dir = Path(__file__).resolve().parent
34
+ candidates = [
35
+ pkg_dir / "bin" / "pg_embedder",
36
+ pkg_dir / "bin" / "pg_embedder.exe",
37
+ ]
38
+ for c in candidates:
39
+ if c.exists() and os.access(c, os.X_OK):
40
+ return str(c)
41
+ return None
42
+
43
+
44
+ def _find_pg_embedder_cmd(cwd: Path) -> tuple[list[str], Path | None]:
45
+ explicit = os.environ.get("DBT_VECTORIZE_PG_EMBEDDER")
46
+ if explicit:
47
+ return [explicit], cwd
48
+
49
+ packaged = _packaged_embedder()
50
+ if packaged:
51
+ return [packaged], cwd
52
+
53
+ cargo = shutil.which("cargo")
54
+ repo = _find_repo_root(cwd)
55
+ if cargo and repo:
56
+ return [cargo, "run", "--quiet", "--bin", "pg_embedder", "--release", "--"], repo / "rust" / "embedding_engine"
57
+
58
+ raise FileNotFoundError(
59
+ "Could not find runnable pg_embedder backend. "
60
+ "Set DBT_VECTORIZE_PG_EMBEDDER, install wheel with bundled binary, "
61
+ "or run inside a cloned dbt-vector repo with Rust/cargo available."
62
+ )
63
+
64
+
65
+ def _build_dbt_cmd(cwd: Path, argv: list[str]) -> tuple[list[str], dict[str, str], str, str]:
66
+ dbt = os.environ.get("DBT", "dbt")
67
+ profile_dir = os.environ.get("PROFILE_DIR") or os.environ.get("DBT_PROFILES_DIR") or str(cwd)
68
+ project_dir = os.environ.get("PROJECT_DIR") or str(cwd)
69
+ select_model = os.environ.get("SELECT_MODEL", "vector_knowledge_base")
70
+ embed_provider = os.environ.get("EMBED_PROVIDER", "local")
71
+ embed_model = os.environ.get("EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
72
+
73
+ run_args = list(argv) if argv else ["--select", select_model]
74
+ cmd = [
75
+ dbt,
76
+ "run",
77
+ "--no-partial-parse",
78
+ "--profiles-dir",
79
+ profile_dir,
80
+ "--project-dir",
81
+ project_dir,
82
+ *run_args,
83
+ ]
84
+ env = os.environ.copy()
85
+ env["EMBED_PROVIDER"] = embed_provider
86
+ env["EMBED_MODEL"] = embed_model
87
+ return cmd, env, embed_provider, embed_model
88
+
89
+
90
+ def _build_embed_env(cwd: Path) -> dict[str, str]:
91
+ env = os.environ.copy()
92
+ profile_dir = env.get("PROFILE_DIR") or env.get("DBT_PROFILES_DIR") or str(cwd)
93
+ profile_name = env.get("PROFILE", "default")
94
+ target_name = env.get("TARGET")
95
+ profile_file = env.get("PROFILE_FILE") or str(Path(profile_dir) / "profiles.yml")
96
+
97
+ # Fallback to dbt profile values when PG* are not explicitly provided.
98
+ if (
99
+ not env.get("PGHOST")
100
+ or not env.get("PGPORT")
101
+ or not env.get("PGUSER")
102
+ or not env.get("PGDATABASE")
103
+ ):
104
+ p = Path(profile_file)
105
+ if p.exists():
106
+ with p.open("r", encoding="utf-8") as f:
107
+ data = yaml.safe_load(f) or {}
108
+ profile = data.get(profile_name, {}) or {}
109
+ outputs = profile.get("outputs", {}) or {}
110
+ target = target_name or profile.get("target")
111
+ if not target and outputs:
112
+ target = next(iter(outputs.keys()))
113
+ cfg = outputs.get(target, {}) if target else {}
114
+ if cfg.get("type") == "postgres":
115
+ if not env.get("PGHOST") and cfg.get("host") is not None:
116
+ env["PGHOST"] = str(cfg["host"])
117
+ if not env.get("PGPORT") and cfg.get("port") is not None:
118
+ env["PGPORT"] = str(cfg["port"])
119
+ if not env.get("PGUSER") and cfg.get("user") is not None:
120
+ env["PGUSER"] = str(cfg["user"])
121
+ if not env.get("PGPASSWORD") and cfg.get("password") is not None:
122
+ env["PGPASSWORD"] = str(cfg["password"])
123
+ if not env.get("PGDATABASE") and cfg.get("dbname") is not None:
124
+ env["PGDATABASE"] = str(cfg["dbname"])
125
+ if not env.get("SCHEMA") and cfg.get("schema") is not None:
126
+ env["SCHEMA"] = str(cfg["schema"])
127
+
128
+ env.setdefault("PGHOST", "localhost")
129
+ env.setdefault("PGPORT", "5432")
130
+ env.setdefault("PGUSER", "postgres")
131
+ env.setdefault("PGPASSWORD", "")
132
+ env.setdefault("PGDATABASE", "postgres")
133
+ env.setdefault("SCHEMA", "public")
134
+ env.setdefault("INDEX_NAME", "knowledge_base")
135
+ env.setdefault("EMBED_DIMS", "1536")
136
+ env.setdefault("EMBED_PROVIDER", "local")
137
+ env.setdefault("EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
138
+ return env
139
+
140
+
141
+ def main() -> int:
142
+ argv = sys.argv[1:]
143
+ cwd = _resolve_cwd()
144
+
145
+ dbt_cmd, dbt_env, provider, model = _build_dbt_cmd(cwd, argv)
146
+ print(f"[vectorize] running dbt model selection (provider={provider}, model={model})")
147
+ dbt_proc = subprocess.run(dbt_cmd, cwd=str(cwd), env=dbt_env)
148
+ if dbt_proc.returncode != 0:
149
+ return dbt_proc.returncode
150
+
151
+ embed_cmd, embed_cwd = _find_pg_embedder_cmd(cwd)
152
+ embed_env = _build_embed_env(cwd)
153
+ schema = embed_env.get("SCHEMA", "public")
154
+ index_name = embed_env.get("INDEX_NAME", "knowledge_base")
155
+ print(f"[vectorize] generating embeddings via Rust into {schema}.{index_name}")
156
+ embed_proc = subprocess.run(embed_cmd, cwd=str(embed_cwd or cwd), env=embed_env)
157
+ if embed_proc.returncode != 0:
158
+ return embed_proc.returncode
159
+
160
+ print("[vectorize] done.")
161
+ return 0
162
+
163
+
164
+ if __name__ == "__main__":
165
+ raise SystemExit(main())
@@ -0,0 +1,203 @@
1
+ Metadata-Version: 2.4
2
+ Name: dbt-vectorize
3
+ Version: 0.1.4
4
+ Summary: dbt + Rust vectorization runner for pgvector
5
+ Author-email: Maria Dubyaga <kraftaa@gmail.com>
6
+ License-Expression: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/kraftaa/dbt-vector
8
+ Project-URL: Repository, https://github.com/kraftaa/dbt-vector
9
+ Classifier: Environment :: Console
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Rust
13
+ Classifier: Topic :: Database
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ Requires-Dist: PyYAML>=6.0
18
+
19
+ # dbt-vectors (prototype scaffold)
20
+
21
+ > Make vector indexes a first-class materialization in dbt. This repo is an MVP scaffold to prove the concept.
22
+
23
+ ## Why
24
+ - dbt today only materializes SQL artifacts (table, view, incremental, ephemeral).
25
+ - Vector pipelines require SQL + embeddings + upsert to a vector DB; teams currently stitch that with ad-hoc external scripts.
26
+ - A custom `vector_index` materialization can run inside `dbt build`, generating embeddings, handling incremental logic, and writing to pgvector/Pinecone/Qdrant.
27
+
28
+ ## What’s here
29
+ - **dbt package skeleton** with a `vector_index` materialization and dispatchable macros (pgvector working).
30
+ - **Rust embedder** (`rust/embedding_engine`) that can generate embeddings via OpenAI, Amazon Bedrock, or a local ONNX model (no Python needed).
31
+ - **`./bin/vectorize` runner**: orchestrates `dbt run` for the model and then calls the Rust embedder to write embeddings into Postgres/pgvector.
32
+ - **Examples** to show how a model is defined and run.
33
+
34
+ ## Prerequisites
35
+
36
+ `dbt-vectorize` does not vendor dbt. It uses whatever dbt binary you point it to (`DBT=...`) or find on PATH.
37
+
38
+ Verify your existing dbt + adapter:
39
+ ```bash
40
+ dbt --version
41
+ ```
42
+ You should see a plugin like `postgres` under "Plugins".
43
+
44
+ If you do not have dbt + postgres adapter installed:
45
+ ```bash
46
+ python -m pip install "dbt-core~=1.9" "dbt-postgres~=1.9"
47
+ ```
48
+
49
+ You also need pgvector available in Postgres:
50
+ - install the extension package on the **Postgres server** (`vector.control` must exist on that server)
51
+ - enable it in each **database** you want to use
52
+
53
+ ```sql
54
+ CREATE EXTENSION IF NOT EXISTS vector;
55
+ ```
56
+
57
+ (`pgvector` is the project name; the SQL extension name is `vector`.)
58
+
59
+ ## Repo layout
60
+ - `dbt_project.yml` – declares this as a dbt package and exposes macros.
61
+ - `macros/materializations/vector_index.sql` – Jinja materialization scaffold (pgvector first, adapters dispatchable).
62
+ - `macros/adapters/vector_index_pgvector.sql` – pgvector adapter macro that creates/loads the target table.
63
+ - `bin/vectorize` – orchestration command that runs dbt and then Rust embedding.
64
+ - `rust/embedding_engine` – Rust crate and `pg_embedder` binary used for embedding generation/upsert.
65
+
66
+ ## Next steps (MVP path)
67
+ 1. Harden Rust embedding provider support (OpenAI/Bedrock/local ONNX) with better diagnostics and retries. ⏳
68
+ 2. Expand adapter macros beyond pgvector (Pinecone/Qdrant). ⏳
69
+ 3. Add end-to-end integration tests for dbt + pgvector + `pg_embedder`. ⏳
70
+ 4. Publish package docs and a reproducible quickstart. ⏳
71
+
72
+ ## Example model (current)
73
+ ```sql
74
+ {{ config(
75
+ materialized='vector_index',
76
+ vector_db='pgvector',
77
+ index_name='knowledge_base',
78
+ embedding_model='text-embedding-3-small',
79
+ dimensions=(env_var('EMBED_DIMS', 1536) | int),
80
+ metadata_columns=['source', 'created_at', 'doc_id']
81
+ ) }}
82
+
83
+ select
84
+ doc_id,
85
+ chunk_text as text,
86
+ source,
87
+ created_at
88
+ from {{ ref('staging_documents') }}
89
+ where is_active = true
90
+ ```
91
+
92
+ Running `./bin/vectorize --select vector_knowledge_base` should:
93
+ - fetch incremental rows
94
+ - generate embeddings via Rust engine
95
+ - upsert to pgvector (or Pinecone/Qdrant via adapters)
96
+ - emit metrics (processed, failed, latency) and freshness tests
97
+
98
+ ## Run locally (preferred: existing local Postgres)
99
+
100
+ 1) Ensure Postgres is running, reachable (`PGHOST/PGPORT/PGUSER/PGDATABASE`), and has `vector` enabled:
101
+ ```sql
102
+ CREATE EXTENSION IF NOT EXISTS vector;
103
+ ```
104
+
105
+ 2) Choose a provider and matching dimensions:
106
+ ```
107
+ # Local ONNX (MiniLM, 384 dims)
108
+ EMBED_PROVIDER=local
109
+ EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
110
+ EMBED_LOCAL_MODEL_PATH=$PWD/ml_model # contains model.onnx + tokenizer.json
111
+ EMBED_DIMS=384
112
+
113
+ # OpenAI
114
+ EMBED_PROVIDER=openai
115
+ EMBED_MODEL=text-embedding-3-small
116
+ EMBED_DIMS=1536 # or a smaller dim if you request it from OpenAI
117
+
118
+ # Bedrock Titan v2 (defaults)
119
+ EMBED_PROVIDER=bedrock
120
+ EMBED_MODEL=amazon.titan-embed-text-v2:0
121
+ EMBED_DIMS=1024 # or 512/256 if you override
122
+ ```
123
+
124
+ 3) Run vectorization (dbt model + embedding upsert):
125
+ ```
126
+ PGHOST=localhost PGPORT=5432 PGUSER=postgres PGDATABASE=postgres \
127
+ EMBED_PROVIDER=... EMBED_MODEL=... EMBED_DIMS=... \
128
+ ./bin/vectorize --select vector_knowledge_base
129
+ ```
130
+
131
+ Shortcut with env file:
132
+ ```
133
+ cp .env.vectorize.example .env.vectorize
134
+ ./bin/vectorize --select vector_knowledge_base
135
+ ```
136
+ `bin/vectorize` auto-loads `.env.vectorize` if present. Use `VECTORIZE_ENV_FILE=/path/to/file` to load a different env file.
137
+
138
+ Expected CLI output (example):
139
+ ```
140
+ [vectorize] running dbt model vector_knowledge_base (provider=local, model=sentence-transformers/all-MiniLM-L6-v2)
141
+ ...
142
+ Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
143
+ [vectorize] generating embeddings via Rust into public.knowledge_base
144
+ embedded 20 rows into public.knowledge_base
145
+ [vectorize] done.
146
+ ```
147
+
148
+ Quick verification in Postgres:
149
+ ```sql
150
+ SELECT count(*) AS rows FROM public.knowledge_base;
151
+
152
+ SELECT
153
+ doc_id,
154
+ (embedding::float4[])[1:8] AS first_8_dims,
155
+ source,
156
+ created_at
157
+ FROM public.knowledge_base
158
+ LIMIT 5;
159
+ ```
160
+
161
+ ## Optional Docker Postgres
162
+
163
+ Use this only if you want a disposable local pgvector instance:
164
+ ```
165
+ docker-compose up -d postgres
166
+ ```
167
+ If Docker/Colima is not running, this will fail with a daemon connection error.
168
+
169
+ ## Build pip package (`dbt-vectorize`)
170
+
171
+ Build from repo root (factorlens-style, bundles Rust binary in wheel):
172
+ ```bash
173
+ ./scripts/build_wheel_with_binary.sh
174
+ ```
175
+
176
+ Artifacts will be written to `dist/`.
177
+ Install locally:
178
+ ```bash
179
+ python -m pip install dist/dbt_vectorize-*.whl
180
+ ```
181
+
182
+ CLI entrypoint after install:
183
+ ```bash
184
+ dbt-vectorize --select vector_knowledge_base
185
+ ```
186
+
187
+ CI release wheel build (macOS arm64 + Linux x86_64):
188
+ - workflow file: `.github/workflows/release.yml`
189
+ - trigger manually from Actions or push a `v*` tag
190
+ - outputs platform-specific wheels under workflow artifacts / GitHub release assets
191
+
192
+ ### Supported embedding dimensions (set `EMBED_DIMS` to match)
193
+ - OpenAI `text-embedding-3-small`: 1536 (can request smaller via API parameter)
194
+ - OpenAI `text-embedding-3-large`: 3072 (can request smaller)
195
+ - Bedrock Titan embed text v2: 1024 (or 512/256)
196
+ - Bedrock Titan embed text v1: 1024 (or 512/256)
197
+ - Bedrock Cohere Embed v4: 1536 (or 1024/512/256)
198
+ - Local MiniLM (all-MiniLM-L6-v2 ONNX): 384
199
+
200
+ ## Notes
201
+ - The Rust embedder is Python-free.
202
+ - Keep your Postgres `vector` column dimension aligned with `EMBED_DIMS`.
203
+ - IVFFLAT indexes warn on very small datasets; that’s expected. Rebuild after you have more rows.
@@ -0,0 +1,13 @@
1
+ README.md
2
+ pyproject.toml
3
+ setup.py
4
+ dbt_vectorize/__init__.py
5
+ dbt_vectorize/cli.py
6
+ dbt_vectorize.egg-info/PKG-INFO
7
+ dbt_vectorize.egg-info/SOURCES.txt
8
+ dbt_vectorize.egg-info/dependency_links.txt
9
+ dbt_vectorize.egg-info/entry_points.txt
10
+ dbt_vectorize.egg-info/requires.txt
11
+ dbt_vectorize.egg-info/top_level.txt
12
+ dbt_vectorize/bin/__init__.py
13
+ dbt_vectorize/bin/pg_embedder
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ dbt-vectorize = dbt_vectorize.cli:main
@@ -0,0 +1 @@
1
+ PyYAML>=6.0
@@ -0,0 +1 @@
1
+ dbt_vectorize
@@ -0,0 +1,39 @@
1
+ [build-system]
2
+ requires = ["setuptools>=70.1"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "dbt-vectorize"
7
+ version = "0.1.4"
8
+ description = "dbt + Rust vectorization runner for pgvector"
9
+ readme = "README.md"
10
+ license = "Apache-2.0"
11
+ requires-python = ">=3.9"
12
+ authors = [
13
+ { name = "Maria Dubyaga", email = "kraftaa@gmail.com" }
14
+ ]
15
+ dependencies = [
16
+ "PyYAML>=6.0",
17
+ ]
18
+ classifiers = [
19
+ "Environment :: Console",
20
+ "Intended Audience :: Developers",
21
+ "Programming Language :: Python :: 3",
22
+ "Programming Language :: Rust",
23
+ "Topic :: Database",
24
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
25
+ ]
26
+
27
+ [project.scripts]
28
+ dbt-vectorize = "dbt_vectorize.cli:main"
29
+
30
+ [project.urls]
31
+ Homepage = "https://github.com/kraftaa/dbt-vector"
32
+ Repository = "https://github.com/kraftaa/dbt-vector"
33
+
34
+ [tool.setuptools]
35
+ packages = ["dbt_vectorize", "dbt_vectorize.bin"]
36
+ include-package-data = true
37
+
38
+ [tool.setuptools.package-data]
39
+ dbt_vectorize = ["bin/*"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,27 @@
1
+ from setuptools import setup
2
+
3
+ try:
4
+ from setuptools.command.bdist_wheel import bdist_wheel as _bdist_wheel
5
+ except Exception:
6
+ try:
7
+ from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
8
+ except Exception:
9
+ _bdist_wheel = None
10
+
11
+
12
+ if _bdist_wheel is not None:
13
+ class bdist_wheel(_bdist_wheel):
14
+ # Force platform wheels because we bundle a native Rust binary.
15
+ def finalize_options(self):
16
+ super().finalize_options()
17
+ self.root_is_pure = False
18
+
19
+ # The package works with any Python 3 version; only the platform matters.
20
+ def get_tag(self):
21
+ _py, _abi, plat = super().get_tag()
22
+ return "py3", "none", plat
23
+
24
+
25
+ setup(cmdclass={"bdist_wheel": bdist_wheel})
26
+ else:
27
+ setup()