pgsemantic 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pgsemantic-0.1.0/LICENSE +21 -0
- pgsemantic-0.1.0/PKG-INFO +615 -0
- pgsemantic-0.1.0/README.md +580 -0
- pgsemantic-0.1.0/pgsemantic/__init__.py +3 -0
- pgsemantic-0.1.0/pgsemantic/__main__.py +4 -0
- pgsemantic-0.1.0/pgsemantic/cli.py +52 -0
- pgsemantic-0.1.0/pgsemantic/commands/__init__.py +0 -0
- pgsemantic-0.1.0/pgsemantic/commands/apply.py +491 -0
- pgsemantic-0.1.0/pgsemantic/commands/index.py +246 -0
- pgsemantic-0.1.0/pgsemantic/commands/inspect.py +144 -0
- pgsemantic-0.1.0/pgsemantic/commands/integrate.py +142 -0
- pgsemantic-0.1.0/pgsemantic/commands/search.py +130 -0
- pgsemantic-0.1.0/pgsemantic/commands/serve.py +86 -0
- pgsemantic-0.1.0/pgsemantic/commands/status.py +131 -0
- pgsemantic-0.1.0/pgsemantic/commands/worker.py +80 -0
- pgsemantic-0.1.0/pgsemantic/config.py +167 -0
- pgsemantic-0.1.0/pgsemantic/db/__init__.py +0 -0
- pgsemantic-0.1.0/pgsemantic/db/client.py +108 -0
- pgsemantic-0.1.0/pgsemantic/db/introspect.py +369 -0
- pgsemantic-0.1.0/pgsemantic/db/queue.py +210 -0
- pgsemantic-0.1.0/pgsemantic/db/vectors.py +812 -0
- pgsemantic-0.1.0/pgsemantic/embeddings/__init__.py +41 -0
- pgsemantic-0.1.0/pgsemantic/embeddings/base.py +32 -0
- pgsemantic-0.1.0/pgsemantic/embeddings/local.py +92 -0
- pgsemantic-0.1.0/pgsemantic/embeddings/ollama_provider.py +72 -0
- pgsemantic-0.1.0/pgsemantic/embeddings/openai_provider.py +51 -0
- pgsemantic-0.1.0/pgsemantic/exceptions.py +54 -0
- pgsemantic-0.1.0/pgsemantic/mcp_server/__init__.py +0 -0
- pgsemantic-0.1.0/pgsemantic/mcp_server/server.py +266 -0
- pgsemantic-0.1.0/pgsemantic/worker/__init__.py +0 -0
- pgsemantic-0.1.0/pgsemantic/worker/daemon.py +228 -0
- pgsemantic-0.1.0/pgsemantic.egg-info/PKG-INFO +615 -0
- pgsemantic-0.1.0/pgsemantic.egg-info/SOURCES.txt +37 -0
- pgsemantic-0.1.0/pgsemantic.egg-info/dependency_links.txt +1 -0
- pgsemantic-0.1.0/pgsemantic.egg-info/entry_points.txt +2 -0
- pgsemantic-0.1.0/pgsemantic.egg-info/requires.txt +17 -0
- pgsemantic-0.1.0/pgsemantic.egg-info/top_level.txt +1 -0
- pgsemantic-0.1.0/pyproject.toml +70 -0
- pgsemantic-0.1.0/setup.cfg +4 -0
pgsemantic-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Sai Ram Varma Budharaju
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,615 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pgsemantic
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Zero-config semantic search bootstrap for any PostgreSQL database
|
|
5
|
+
Author: Sai Ram Varma Budharaju
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/varmabudharaju/pgsemantic
|
|
8
|
+
Project-URL: Issues, https://github.com/varmabudharaju/pgsemantic/issues
|
|
9
|
+
Keywords: pgvector,postgresql,semantic-search,rag,embeddings,mcp
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Requires-Python: >=3.10
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: LICENSE
|
|
18
|
+
Requires-Dist: typer>=0.9.0
|
|
19
|
+
Requires-Dist: rich>=13.0.0
|
|
20
|
+
Requires-Dist: psycopg[binary]>=3.1.0
|
|
21
|
+
Requires-Dist: pgvector>=0.3.0
|
|
22
|
+
Requires-Dist: sentence-transformers>=3.0.0
|
|
23
|
+
Requires-Dist: openai>=1.0.0
|
|
24
|
+
Requires-Dist: mcp>=1.0.0
|
|
25
|
+
Requires-Dist: httpx>=0.27.0
|
|
26
|
+
Requires-Dist: python-dotenv>=1.0.0
|
|
27
|
+
Provides-Extra: dev
|
|
28
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
29
|
+
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
|
|
30
|
+
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
|
|
31
|
+
Requires-Dist: ruff; extra == "dev"
|
|
32
|
+
Requires-Dist: mypy>=1.8.0; extra == "dev"
|
|
33
|
+
Requires-Dist: types-psycopg2; extra == "dev"
|
|
34
|
+
Dynamic: license-file
|
|
35
|
+
|
|
36
|
+
# pgsemantic
|
|
37
|
+
|
|
38
|
+
**Point it at your existing Postgres database. Get semantic search in 60 seconds.**
|
|
39
|
+
|
|
40
|
+
[](https://pypi.org/project/pgsemantic/)
|
|
41
|
+
[](https://www.python.org/downloads/)
|
|
42
|
+
[](LICENSE)
|
|
43
|
+
|
|
44
|
+
Zero-config CLI that bootstraps production-quality semantic search on any existing PostgreSQL database — local or remote (Supabase, Neon, RDS, Railway). No pgvector expertise required.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Install
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
pip install pgsemantic
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
$ pgsemantic --help
|
|
56
|
+
|
|
57
|
+
Usage: pgsemantic [OPTIONS] COMMAND [ARGS]...
|
|
58
|
+
|
|
59
|
+
Zero-config semantic search bootstrap for any PostgreSQL database.
|
|
60
|
+
|
|
61
|
+
╭─ Commands ───────────────────────────────────────────────────────────────────╮
|
|
62
|
+
│ inspect Scan a database and score columns for semantic search suitability │
|
|
63
|
+
│ apply Set up semantic search on a table │
|
|
64
|
+
│ index Bulk embed existing rows where embedding IS NULL │
|
|
65
|
+
│ search Search your database using natural language │
|
|
66
|
+
│ worker Start the background worker daemon │
|
|
67
|
+
│ serve Start the pgsemantic MCP server │
|
|
68
|
+
│ status Show embedding health dashboard for all watched tables │
|
|
69
|
+
│ integrate Set up AI agent integrations (Claude Desktop) │
|
|
70
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Full Walkthrough
|
|
76
|
+
|
|
77
|
+
Here's the complete flow from zero to semantic search. Every output below is real terminal output, not mockups.
|
|
78
|
+
|
|
79
|
+
### Step 1: Connect to your database
|
|
80
|
+
|
|
81
|
+
pgsemantic needs a Postgres connection string. You can either pass it with `--db` or save it in a `.env` file:
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
# Option A: pass directly (good for trying things out)
|
|
85
|
+
pgsemantic inspect postgresql://postgres:password@localhost:5432/mydb
|
|
86
|
+
|
|
87
|
+
# Option B: save in a .env file (recommended)
|
|
88
|
+
echo 'DATABASE_URL=postgresql://postgres:password@localhost:5432/mydb' > .env
|
|
89
|
+
pgsemantic inspect # reads from .env automatically
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**Where to find your connection string:**
|
|
93
|
+
|
|
94
|
+
| Provider | Where to find it |
|
|
95
|
+
|----------|-----------------|
|
|
96
|
+
| **Local Docker** | The URL you set when creating the container. With our docker-compose.yml: `postgresql://postgres:password@localhost:5432/pgvector_dev` |
|
|
97
|
+
| **Supabase** | Dashboard → Settings → Database → Connection string → URI |
|
|
98
|
+
| **Neon** | Dashboard → your project → Connection Details → Connection string |
|
|
99
|
+
| **Railway** | Dashboard → your Postgres service → Variables → `DATABASE_URL` |
|
|
100
|
+
| **AWS RDS** | `postgresql://USER:PASS@your-instance.region.rds.amazonaws.com:5432/dbname` |
|
|
101
|
+
|
|
102
|
+
> **Tip:** The `.env` file contains your database password. Add `.env` to your `.gitignore` so it's never committed to git.
|
|
103
|
+
|
|
104
|
+
### Step 2: Scan your database
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
$ pgsemantic inspect postgresql://postgres:password@localhost:5432/pgvector_dev
|
|
108
|
+
|
|
109
|
+
╭──────────────────────────────────────────────────────────────────────────────╮
|
|
110
|
+
│ Semantic Search Candidates — localhost:5432/pgvector_dev │
|
|
111
|
+
│ pgvector 0.8.2 │
|
|
112
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
113
|
+
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━┓
|
|
114
|
+
┃ Table ┃ Column ┃ Score ┃ Avg Length ┃ Type ┃
|
|
115
|
+
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━┩
|
|
116
|
+
│ clinical_trials │ eligibility │ ★★★ │ 1,382 chars │ text │
|
|
117
|
+
│ clinical_trials │ brief_summary │ ★★★ │ 711 chars │ text │
|
|
118
|
+
│ products │ description │ ★★★ │ 108 chars │ text │
|
|
119
|
+
│ clinical_trials │ title │ ★★★ │ 85 chars │ text │
|
|
120
|
+
│ products │ name │ ★☆☆ │ 20 chars │ text │
|
|
121
|
+
│ products │ category │ ★☆☆ │ 9 chars │ text │
|
|
122
|
+
└─────────────────┴───────────────┴───────┴─────────────┴──────┘
|
|
123
|
+
|
|
124
|
+
Scoring is heuristic (text length + column name patterns).
|
|
125
|
+
Always verify recommendations make sense for your use case.
|
|
126
|
+
|
|
127
|
+
Next step:
|
|
128
|
+
pgsemantic apply --table clinical_trials --column eligibility
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
`inspect` scans every text column in your database, samples average text length, and scores them. Longer text in columns named `description`, `body`, `content`, etc. scores higher. It's a heuristic to help you decide what to embed.
|
|
132
|
+
|
|
133
|
+
### Step 3: Set up semantic search on a table
|
|
134
|
+
|
|
135
|
+
```
|
|
136
|
+
$ pgsemantic apply --table products --column description
|
|
137
|
+
|
|
138
|
+
╭────────────────────────────── pgsemantic apply ──────────────────────────────╮
|
|
139
|
+
│ Setting up semantic search on public.products │
|
|
140
|
+
│ Column(s): description │
|
|
141
|
+
│ Model: all-MiniLM-L6-v2 (384 dimensions) │
|
|
142
|
+
│ Storage: inline │
|
|
143
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
144
|
+
|
|
145
|
+
[1/9] Checking pgvector extension... v0.8.2
|
|
146
|
+
[2/9] Verifying source column(s)... found (PK: id)
|
|
147
|
+
[3/9] Checking embedding column... needs creation
|
|
148
|
+
|
|
149
|
+
╭───────────────────────── SQL Preview — ALTER TABLE ──────────────────────────╮
|
|
150
|
+
│ ALTER TABLE "public"."products" ADD COLUMN "embedding" vector(384); │
|
|
151
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
152
|
+
This will add an embedding column to your table. Continue? [y/N]: y
|
|
153
|
+
|
|
154
|
+
[4/9] Adding embedding column... done
|
|
155
|
+
[5/9] Creating HNSW index (CONCURRENTLY)... done
|
|
156
|
+
[6/9] Creating queue table... done
|
|
157
|
+
[7/9] Installing trigger function... done
|
|
158
|
+
[8/9] Installing trigger on table... done
|
|
159
|
+
[9/9] Saving config... done
|
|
160
|
+
|
|
161
|
+
╭─────────────────────────────── Setup Complete ───────────────────────────────╮
|
|
162
|
+
│ Semantic search is ready on public.products (description) │
|
|
163
|
+
│ │
|
|
164
|
+
│ Next steps: │
|
|
165
|
+
│ 1. Embed existing rows: pgsemantic index --table products │
|
|
166
|
+
│ 2. Start live sync: pgsemantic worker │
|
|
167
|
+
│ 3. Start MCP server: pgsemantic serve │
|
|
168
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
`apply` shows you the exact SQL it will run and asks for confirmation before touching your table. It adds ONE column (`embedding`) and ONE trigger. Nothing else changes.
|
|
172
|
+
|
|
173
|
+
### Step 4: Embed your existing rows
|
|
174
|
+
|
|
175
|
+
```
|
|
176
|
+
$ pgsemantic index --table products
|
|
177
|
+
|
|
178
|
+
Indexing public.products (description) with all-MiniLM-L6-v2 (batch size: 32,
|
|
179
|
+
storage: inline)
|
|
180
|
+
|
|
181
|
+
Total rows with content: 20
|
|
182
|
+
Already embedded: 0
|
|
183
|
+
Remaining: 20
|
|
184
|
+
|
|
185
|
+
Embedding rows... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0:00:00 0:00:00
|
|
186
|
+
|
|
187
|
+
Indexed 20 rows in 0.3s (4,631 rows/min)
|
|
188
|
+
Coverage: 20/20 (100.0%)
|
|
189
|
+
|
|
190
|
+
Next steps:
|
|
191
|
+
Start live sync: pgsemantic worker
|
|
192
|
+
Start MCP server: pgsemantic serve
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
`index` generates vector embeddings for every row. It uses the local model by default — no API key, no internet required (after first download). Re-running only processes rows that haven't been embedded yet, so it's safe to interrupt and resume.
|
|
196
|
+
|
|
197
|
+
### Step 5: Search!
|
|
198
|
+
|
|
199
|
+
```
|
|
200
|
+
$ pgsemantic search "wireless headphones with good battery" --table products
|
|
201
|
+
|
|
202
|
+
Results for: "wireless headphones with good battery" in products.description
|
|
203
|
+
|
|
204
|
+
1. (score: 0.698) id: 2 name: Bose QuietComfort 45 category: electronics price: 329.99
|
|
205
|
+
Comfortable over-ear wireless headphones with world-class noise cancellation,
|
|
206
|
+
24-hour battery life, and high-fidelity audio
|
|
207
|
+
|
|
208
|
+
2. (score: 0.673) id: 1 name: Sony WH-1000XM5 category: electronics price: 349.99
|
|
209
|
+
Premium wireless noise-canceling headphones with 30-hour battery life,
|
|
210
|
+
exceptional sound quality, and comfortable fit for all-day listening
|
|
211
|
+
|
|
212
|
+
3. (score: 0.604) id: 3 name: Apple AirPods Max category: electronics price: 549.99
|
|
213
|
+
High-fidelity wireless over-ear headphones with active noise cancellation,
|
|
214
|
+
spatial audio, and premium build quality
|
|
215
|
+
|
|
216
|
+
4. (score: 0.569) id: 4 name: Samsung Galaxy Buds Pro category: electronics price: 199.99
|
|
217
|
+
True wireless earbuds with intelligent active noise cancellation, 360 audio,
|
|
218
|
+
and water resistance for workouts
|
|
219
|
+
|
|
220
|
+
5. (score: 0.533) id: 17 name: Sony WF-1000XM5 Earbuds category: electronics price: 279.99
|
|
221
|
+
Premium true wireless earbuds with industry-leading noise cancellation and
|
|
222
|
+
high-resolution audio in a compact design
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
Natural language in, ranked results out. It works with any query:
|
|
226
|
+
|
|
227
|
+
```
|
|
228
|
+
$ pgsemantic search "espresso machine" --table products --limit 3
|
|
229
|
+
|
|
230
|
+
Results for: "espresso machine" in products.description
|
|
231
|
+
|
|
232
|
+
1. (score: 0.738) id: 19 name: Breville Barista Express category: kitchen price: 699.99
|
|
233
|
+
Semi-automatic espresso machine with built-in grinder, precise temperature
|
|
234
|
+
control, and micro-foam milk texturing
|
|
235
|
+
|
|
236
|
+
2. (score: 0.303) id: 11 name: Instant Pot Duo 7-in-1 category: kitchen price: 89.99
|
|
237
|
+
Multi-use programmable pressure cooker, slow cooker, rice cooker, steamer, and
|
|
238
|
+
more in one appliance
|
|
239
|
+
|
|
240
|
+
3. (score: 0.253) id: 13 name: Yeti Rambler Tumbler category: kitchen price: 35.00
|
|
241
|
+
Vacuum-insulated stainless steel tumbler that keeps drinks cold for hours,
|
|
242
|
+
dishwasher safe
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
### Step 6: Keep embeddings in sync (optional)
|
|
246
|
+
|
|
247
|
+
```bash
|
|
248
|
+
$ pgsemantic worker
|
|
249
|
+
|
|
250
|
+
INFO Worker started. Polling pgvector_setup_queue every 500ms, batch size 10.
|
|
251
|
+
INFO Claimed 3 jobs from queue.
|
|
252
|
+
INFO Embedded products#21 (description, 384d)
|
|
253
|
+
INFO Embedded products#22 (description, 384d)
|
|
254
|
+
INFO Embedded products#23 (description, 384d)
|
|
255
|
+
INFO Worker alive, queue empty.
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
The worker runs in the background. When your app inserts/updates/deletes rows, the trigger fires, the worker picks up the job, and the embedding is updated automatically. Stop it with `Ctrl+C`.
|
|
259
|
+
|
|
260
|
+
### Step 7: Check health
|
|
261
|
+
|
|
262
|
+
```
|
|
263
|
+
$ pgsemantic status
|
|
264
|
+
|
|
265
|
+
Embedding Status
|
|
266
|
+
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
|
|
267
|
+
┃ Table ┃ Column ┃ Model ┃ Storage ┃ Coverage ┃ Pending ┃ Failed ┃
|
|
268
|
+
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
|
|
269
|
+
│ clinical_trials │ brief_sum │ all-MiniLM-L6-v2 │ inline │ 3003/3003 │ 0 │ 0 │
|
|
270
|
+
│ │ │ │ │ (100.0%) │ │ │
|
|
271
|
+
│ products │ descript │ all-MiniLM-L6-v2 │ inline │ 20/20 │ 0 │ 0 │
|
|
272
|
+
│ │ │ │ │ (100.0%) │ │ │
|
|
273
|
+
│ test_products │ title, │ all-MiniLM-L6-v2 │ external │ 10/10 │ 0 │ 0 │
|
|
274
|
+
│ │ descript │ │ │ (100.0%) │ │ │
|
|
275
|
+
└─────────────────┴───────────┴──────────────────┴──────────┴────────────┴─────────┴────────┘
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
### Step 8: Connect Claude Desktop (optional)
|
|
279
|
+
|
|
280
|
+
```bash
|
|
281
|
+
# Auto-configure Claude Desktop
|
|
282
|
+
pgsemantic integrate claude
|
|
283
|
+
|
|
284
|
+
# Restart Claude Desktop, then ask:
|
|
285
|
+
# "Search the products table for wireless headphones under $100"
|
|
286
|
+
# "Find articles about machine learning"
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
Claude calls `semantic_search` or `hybrid_search` behind the scenes and gets ranked results from your actual database.
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## Multi-Column Search (`--columns`)
|
|
294
|
+
|
|
295
|
+
Search across multiple columns at once — for example, finding products by both their name and description:
|
|
296
|
+
|
|
297
|
+
```
|
|
298
|
+
$ pgsemantic apply --table test_products --columns title,description --external
|
|
299
|
+
|
|
300
|
+
╭────────────────────────────── pgsemantic apply ──────────────────────────────╮
|
|
301
|
+
│ Setting up semantic search on public.test_products │
|
|
302
|
+
│ Column(s): title, description │
|
|
303
|
+
│ Model: all-MiniLM-L6-v2 (384 dimensions) │
|
|
304
|
+
│ Storage: external → pgsemantic_embeddings_test_products │
|
|
305
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
306
|
+
|
|
307
|
+
[1/9] Checking pgvector extension... v0.8.2
|
|
308
|
+
[2/9] Verifying source column(s)... found (PK: id)
|
|
309
|
+
[3/9] Creating shadow table... done
|
|
310
|
+
Your table remains unchanged. Embeddings stored in
|
|
311
|
+
pgsemantic_embeddings_test_products
|
|
312
|
+
[5/9] Creating HNSW index (CONCURRENTLY)... done
|
|
313
|
+
[6/9] Creating queue table... done
|
|
314
|
+
[7/9] Installing trigger function... done
|
|
315
|
+
[8/9] Installing trigger on table... done
|
|
316
|
+
[9/9] Saving config... done
|
|
317
|
+
|
|
318
|
+
╭─────────────────────────────── Setup Complete ───────────────────────────────╮
|
|
319
|
+
│ Semantic search is ready on public.test_products (title, description) │
|
|
320
|
+
│ Embeddings stored in: pgsemantic_embeddings_test_products │
|
|
321
|
+
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
Columns are concatenated with field labels before embedding:
|
|
325
|
+
|
|
326
|
+
```
|
|
327
|
+
title: Sony WH-1000XM5
|
|
328
|
+
description: Wireless noise-canceling headphones with 30-hour battery life...
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
Then index and search work exactly the same:
|
|
332
|
+
|
|
333
|
+
```
|
|
334
|
+
$ pgsemantic index --table test_products
|
|
335
|
+
|
|
336
|
+
Indexing public.test_products (title, description) with all-MiniLM-L6-v2
|
|
337
|
+
(batch size: 32, storage: external)
|
|
338
|
+
|
|
339
|
+
Total rows with content: 10
|
|
340
|
+
Already embedded: 0
|
|
341
|
+
Remaining: 10
|
|
342
|
+
|
|
343
|
+
Embedding rows... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 0:00:00 0:00:00
|
|
344
|
+
|
|
345
|
+
Indexed 10 rows in 0.4s (1,370 rows/min)
|
|
346
|
+
Coverage: 10/10 (100.0%)
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
```
|
|
350
|
+
$ pgsemantic search "warm jacket for hiking" --table test_products
|
|
351
|
+
|
|
352
|
+
Results for: "warm jacket for hiking" in test_products.title
|
|
353
|
+
|
|
354
|
+
1. (score: 0.581) id: 5 North Face Thermoball
|
|
355
|
+
Warm and compressible insulated jacket with synthetic fill for wet conditions
|
|
356
|
+
|
|
357
|
+
2. (score: 0.507) id: 4 Patagonia Nano Puff
|
|
358
|
+
Lightweight insulated jacket perfect for cold weather hiking and outdoor adventures
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
---
|
|
362
|
+
|
|
363
|
+
## External Storage Mode (`--external`)
|
|
364
|
+
|
|
365
|
+
By default, pgsemantic adds an `embedding` column directly to your source table. If you can't or don't want to alter your production table (no ALTER permissions, strict schema policies, etc.), use `--external`:
|
|
366
|
+
|
|
367
|
+
```bash
|
|
368
|
+
pgsemantic apply --table products --column description --external
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
This creates a separate **shadow table** to store embeddings. Your source table stays completely untouched:
|
|
372
|
+
|
|
373
|
+
```
|
|
374
|
+
-- Your source table — NO embedding column, no changes at all
|
|
375
|
+
SELECT column_name FROM information_schema.columns WHERE table_name = 'test_products';
|
|
376
|
+
|
|
377
|
+
column_name
|
|
378
|
+
─────────────
|
|
379
|
+
id
|
|
380
|
+
title
|
|
381
|
+
description
|
|
382
|
+
category
|
|
383
|
+
price
|
|
384
|
+
|
|
385
|
+
-- Embeddings live in a separate shadow table
|
|
386
|
+
SELECT row_id, source_column, model_name FROM pgsemantic_embeddings_test_products LIMIT 3;
|
|
387
|
+
|
|
388
|
+
row_id │ source_column │ model_name
|
|
389
|
+
────────┼───────────────────┼──────────────────
|
|
390
|
+
1 │ title+description │ all-MiniLM-L6-v2
|
|
391
|
+
2 │ title+description │ all-MiniLM-L6-v2
|
|
392
|
+
3 │ title+description │ all-MiniLM-L6-v2
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
**How updates and deletes work with `--external`:**
|
|
396
|
+
|
|
397
|
+
- **INSERT/UPDATE** — the worker generates a new embedding and upserts it into the shadow table (`INSERT ... ON CONFLICT DO UPDATE`)
|
|
398
|
+
- **DELETE** — the worker deletes the corresponding row from the shadow table
|
|
399
|
+
- **Search** — pgsemantic automatically JOINs the shadow table to your source table at query time, so search results look exactly the same
|
|
400
|
+
|
|
401
|
+
The shadow table schema:
|
|
402
|
+
|
|
403
|
+
```sql
|
|
404
|
+
CREATE TABLE pgsemantic_embeddings_{table} (
|
|
405
|
+
row_id TEXT PRIMARY KEY, -- stringified PK from your source table
|
|
406
|
+
embedding vector(384), -- the embedding vector
|
|
407
|
+
source_column TEXT NOT NULL, -- which column(s) were embedded
|
|
408
|
+
model_name TEXT NOT NULL, -- model used to generate the embedding
|
|
409
|
+
updated_at TIMESTAMPTZ DEFAULT NOW() -- when the embedding was last updated
|
|
410
|
+
);
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
You can combine `--external` with `--columns`:
|
|
414
|
+
|
|
415
|
+
```bash
|
|
416
|
+
# Multi-column + external: search across title and description,
|
|
417
|
+
# without modifying the products table at all
|
|
418
|
+
pgsemantic apply --table products --columns title,description --external
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
---
|
|
422
|
+
|
|
423
|
+
## Embedding Models
|
|
424
|
+
|
|
425
|
+
| Model | Provider | Dimensions | Cost | API Key Required |
|
|
426
|
+
|-------|----------|-----------|------|-----------------|
|
|
427
|
+
| **all-MiniLM-L6-v2** (default) | Local (sentence-transformers) | 384 | Free | No |
|
|
428
|
+
| **nomic-embed-text** | Ollama (local) | 768 | Free | No |
|
|
429
|
+
| **text-embedding-3-small** | OpenAI | 1536 | $0.02/1M tokens | Yes |
|
|
430
|
+
|
|
431
|
+
```bash
|
|
432
|
+
# Use default local model (no API key needed)
|
|
433
|
+
pgsemantic apply --table products --column description
|
|
434
|
+
|
|
435
|
+
# Use OpenAI (set OPENAI_API_KEY in .env first)
|
|
436
|
+
pgsemantic apply --table products --column description --model openai
|
|
437
|
+
|
|
438
|
+
# Use Ollama (requires: ollama serve && ollama pull nomic-embed-text)
|
|
439
|
+
pgsemantic apply --table products --column description --model ollama
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
---
|
|
443
|
+
|
|
444
|
+
## Supported Postgres Hosts
|
|
445
|
+
|
|
446
|
+
Works with any Postgres that has the pgvector extension.
|
|
447
|
+
|
|
448
|
+
| Host | How to Enable pgvector |
|
|
449
|
+
|------|----------------------|
|
|
450
|
+
| **Docker (local)** | Use `ankane/pgvector:pg16` image (included in our docker-compose.yml) |
|
|
451
|
+
| **Supabase** | Dashboard → Database → Extensions → "vector" |
|
|
452
|
+
| **Neon** | `CREATE EXTENSION vector;` as project owner |
|
|
453
|
+
| **Railway** | `CREATE EXTENSION vector;` as postgres user |
|
|
454
|
+
| **AWS RDS** | `CREATE EXTENSION vector;` with `rds_superuser` role |
|
|
455
|
+
| **Google Cloud SQL** | `CREATE EXTENSION vector;` as `cloudsqlsuperuser` |
|
|
456
|
+
| **Azure Database** | `CREATE EXTENSION vector;` (Flexible Server, PG 15+) |
|
|
457
|
+
| **Bare metal** | `apt install postgresql-16-pgvector` then `CREATE EXTENSION vector;` |
|
|
458
|
+
|
|
459
|
+
When pgvector isn't installed, `pgsemantic apply` tries to install it automatically. If that fails (permissions), it prints host-specific instructions and exits cleanly.
|
|
460
|
+
|
|
461
|
+
---
|
|
462
|
+
|
|
463
|
+
## Architecture
|
|
464
|
+
|
|
465
|
+
```
|
|
466
|
+
┌──────────────────────────────────────────────────────────┐
|
|
467
|
+
│ Your Application │
|
|
468
|
+
│ │
|
|
469
|
+
│ INSERT INTO products (name, description) │
|
|
470
|
+
│ VALUES ('Widget', 'A great widget for...'); │
|
|
471
|
+
└─────────────────────────┬─────────────────────────────────┘
|
|
472
|
+
│
|
|
473
|
+
▼
|
|
474
|
+
┌──────────────────────────────────────────────────────────┐
|
|
475
|
+
│ PostgreSQL + pgvector │
|
|
476
|
+
│ │
|
|
477
|
+
│ ┌─────────────────────────────────────────────────────┐ │
|
|
478
|
+
│ │ products table │ │
|
|
479
|
+
│ │ id | name | description | embedding vector(384) │ │
|
|
480
|
+
│ └───────────────────────────┬─────────────────────────┘ │
|
|
481
|
+
│ │ │
|
|
482
|
+
│ Trigger fires on INSERT/UPDATE ──► pgvector_setup_queue │
|
|
483
|
+
│ (table, row_id, op) │
|
|
484
|
+
└──────────────────────────────────────┬────────────────────┘
|
|
485
|
+
│
|
|
486
|
+
┌────────────────────────┘
|
|
487
|
+
▼
|
|
488
|
+
┌─────────────────────────────┐ ┌───────────────────────┐
|
|
489
|
+
│ pgsemantic worker │ │ pgsemantic serve │
|
|
490
|
+
│ │ │ (MCP Server) │
|
|
491
|
+
│ Polls queue every 500ms │ │ │
|
|
492
|
+
│ Fetches source text │ │ semantic_search() │
|
|
493
|
+
│ Generates embedding │ │ hybrid_search() │
|
|
494
|
+
│ Writes back to table │ │ get_embedding_status │
|
|
495
|
+
│ Deletes completed job │ │ │ │
|
|
496
|
+
└─────────────────────────────┘ └───────┼───────────────┘
|
|
497
|
+
│
|
|
498
|
+
▼
|
|
499
|
+
┌───────────────┐
|
|
500
|
+
│ Claude Desktop │
|
|
501
|
+
│ Cursor / AI │
|
|
502
|
+
└───────────────┘
|
|
503
|
+
```
|
|
504
|
+
|
|
505
|
+
**Key design decisions:**
|
|
506
|
+
|
|
507
|
+
- **Embedding column lives on your source table** (by default) — no central embeddings table. Use `--external` if you prefer a separate shadow table.
|
|
508
|
+
- **Persistent queue, not LISTEN/NOTIFY** — LISTEN/NOTIFY drops events silently under load. The queue table with `SELECT FOR UPDATE SKIP LOCKED` never loses events and survives process crashes.
|
|
509
|
+
- **HNSW index, not IVFFlat** — no training step, handles inserts without full rebuild, >95% recall.
|
|
510
|
+
- **Pre-filtered hybrid search** — Uses `hnsw.iterative_scan = relaxed_order` (pgvector >=0.8.0) to combine vector similarity with SQL WHERE filters in a single query.
|
|
511
|
+
|
|
512
|
+
---
|
|
513
|
+
|
|
514
|
+
## Commands Reference
|
|
515
|
+
|
|
516
|
+
| Command | Description |
|
|
517
|
+
|---------|-------------|
|
|
518
|
+
| `pgsemantic inspect <DB_URL>` | Scan and score columns for semantic search suitability |
|
|
519
|
+
| `pgsemantic apply --table T --column C` | Set up semantic search: extension, column, index, trigger |
|
|
520
|
+
| `pgsemantic index --table T` | Bulk embed all existing rows |
|
|
521
|
+
| `pgsemantic search "query" --table T` | Natural language search from the terminal |
|
|
522
|
+
| `pgsemantic worker` | Background daemon that keeps embeddings in sync |
|
|
523
|
+
| `pgsemantic serve` | Start MCP server for Claude Desktop / Cursor |
|
|
524
|
+
| `pgsemantic status` | Show embedding health dashboard |
|
|
525
|
+
| `pgsemantic integrate claude` | Auto-configure Claude Desktop |
|
|
526
|
+
|
|
527
|
+
### `apply` options
|
|
528
|
+
|
|
529
|
+
| Flag | Default | Description |
|
|
530
|
+
|------|---------|-------------|
|
|
531
|
+
| `--table`, `-t` | required | Table to set up |
|
|
532
|
+
| `--column`, `-c` | required* | Single text column to embed |
|
|
533
|
+
| `--columns` | — | Comma-separated columns to embed together (e.g. `title,description`) |
|
|
534
|
+
| `--model`, `-m` | `local` | `local`, `openai`, or `ollama` |
|
|
535
|
+
| `--external` | off | Store embeddings in a shadow table (don't modify source table) |
|
|
536
|
+
| `--db`, `-d` | `DATABASE_URL` | Database connection string |
|
|
537
|
+
| `--schema`, `-s` | `public` | Postgres schema |
|
|
538
|
+
|
|
539
|
+
*Use `--column` or `--columns`, not both.
|
|
540
|
+
|
|
541
|
+
### MCP server tools
|
|
542
|
+
|
|
543
|
+
| Tool | Description |
|
|
544
|
+
|------|-------------|
|
|
545
|
+
| `semantic_search` | Natural-language similarity search |
|
|
546
|
+
| `hybrid_search` | Semantic search + SQL WHERE filters (e.g. `{"category": "laptop", "price_max": 1500}`) |
|
|
547
|
+
| `get_embedding_status` | Coverage %, queue depth, model info |
|
|
548
|
+
|
|
549
|
+
---
|
|
550
|
+
|
|
551
|
+
## Configuration
|
|
552
|
+
|
|
553
|
+
### `.env` file
|
|
554
|
+
|
|
555
|
+
Create in the directory where you run pgsemantic:
|
|
556
|
+
|
|
557
|
+
```env
|
|
558
|
+
# Required — your Postgres connection string
|
|
559
|
+
DATABASE_URL=postgresql://user:pass@host:5432/dbname
|
|
560
|
+
|
|
561
|
+
# Only needed for OpenAI embeddings
|
|
562
|
+
OPENAI_API_KEY=sk-...
|
|
563
|
+
|
|
564
|
+
# Only needed for Ollama embeddings
|
|
565
|
+
OLLAMA_BASE_URL=http://localhost:11434
|
|
566
|
+
```
|
|
567
|
+
|
|
568
|
+
Most users only need `DATABASE_URL`. Everything else has sensible defaults.
|
|
569
|
+
|
|
570
|
+
### `.pgsemantic.json`
|
|
571
|
+
|
|
572
|
+
Created automatically by `pgsemantic apply`. Tracks which tables are set up and with what settings. Don't delete it — `index`, `worker`, `search`, and `serve` read it. Safe to commit to git.
|
|
573
|
+
|
|
574
|
+
---
|
|
575
|
+
|
|
576
|
+
## FAQ
|
|
577
|
+
|
|
578
|
+
**Does this modify my existing tables?**
|
|
579
|
+
By default, it adds ONE column (`embedding`) and ONE trigger. Nothing else changes. Use `--external` to avoid even that — embeddings go in a separate shadow table.
|
|
580
|
+
|
|
581
|
+
**What if pgvector is not installed?**
|
|
582
|
+
`apply` tries to install it automatically. If that fails (permissions), it prints host-specific instructions and exits cleanly.
|
|
583
|
+
|
|
584
|
+
**Do I need to download my data?**
|
|
585
|
+
No. pgsemantic connects to your database (local or remote) and works in place. Embeddings are generated on your machine and written back over the connection.
|
|
586
|
+
|
|
587
|
+
**Can I search across multiple columns?**
|
|
588
|
+
Yes. Use `--columns title,description` to concatenate multiple columns into one embedding.
|
|
589
|
+
|
|
590
|
+
**What about very large text fields?**
|
|
591
|
+
all-MiniLM-L6-v2 has a 256-token limit (longer text is truncated). For long documents, use OpenAI's text-embedding-3-small (8191 token limit).
|
|
592
|
+
|
|
593
|
+
**What is `.pgsemantic.json`?**
|
|
594
|
+
Config file created by `apply`. Tracks your setup. Don't delete it. Safe to commit to git.
|
|
595
|
+
|
|
596
|
+
---
|
|
597
|
+
|
|
598
|
+
## Development
|
|
599
|
+
|
|
600
|
+
```bash
|
|
601
|
+
git clone https://github.com/varmabudharaju/pgsemantic.git
|
|
602
|
+
cd pgsemantic
|
|
603
|
+
pip install -e ".[dev]"
|
|
604
|
+
|
|
605
|
+
docker-compose up -d # local Postgres with pgvector
|
|
606
|
+
pytest tests/unit/ -v # unit tests (no DB needed)
|
|
607
|
+
pytest tests/integration/ -m integration -v # integration tests
|
|
608
|
+
ruff check . # lint
|
|
609
|
+
```
|
|
610
|
+
|
|
611
|
+
---
|
|
612
|
+
|
|
613
|
+
## License
|
|
614
|
+
|
|
615
|
+
MIT
|