raglite-chromadb 1.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- raglite_chromadb-1.0.1/PKG-INFO +167 -0
- raglite_chromadb-1.0.1/README.md +154 -0
- raglite_chromadb-1.0.1/pyproject.toml +28 -0
- raglite_chromadb-1.0.1/raglite/__init__.py +1 -0
- raglite_chromadb-1.0.1/raglite/chroma_rest.py +111 -0
- raglite_chromadb-1.0.1/raglite/extract.py +63 -0
- raglite_chromadb-1.0.1/raglite/prompts.py +56 -0
- raglite_chromadb-1.0.1/raglite/raglite_cli.py +953 -0
- raglite_chromadb-1.0.1/raglite/vector_index.py +325 -0
- raglite_chromadb-1.0.1/raglite_chromadb.egg-info/PKG-INFO +167 -0
- raglite_chromadb-1.0.1/raglite_chromadb.egg-info/SOURCES.txt +20 -0
- raglite_chromadb-1.0.1/raglite_chromadb.egg-info/dependency_links.txt +1 -0
- raglite_chromadb-1.0.1/raglite_chromadb.egg-info/entry_points.txt +2 -0
- raglite_chromadb-1.0.1/raglite_chromadb.egg-info/requires.txt +3 -0
- raglite_chromadb-1.0.1/raglite_chromadb.egg-info/top_level.txt +1 -0
- raglite_chromadb-1.0.1/setup.cfg +4 -0
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: raglite-chromadb
|
|
3
|
+
Version: 1.0.1
|
|
4
|
+
Summary: Local-first RAG-lite CLI: condense docs into structured Markdown, then index/query with Chroma + hybrid search
|
|
5
|
+
Author: Viraj Sanghvi
|
|
6
|
+
License: MIT
|
|
7
|
+
Keywords: rag,docs,chroma,ollama,openclaw,summarization,local-first
|
|
8
|
+
Requires-Python: >=3.11
|
|
9
|
+
Description-Content-Type: text/markdown
|
|
10
|
+
Requires-Dist: beautifulsoup4==4.12.3
|
|
11
|
+
Requires-Dist: lxml==5.3.0
|
|
12
|
+
Requires-Dist: pypdf==5.2.0
|
|
13
|
+
|
|
14
|
+
# RAGLite
|
|
15
|
+
|
|
16
|
+
<p align="center">
|
|
17
|
+
<img src="assets/hero.svg" alt="RAGLite: Compress first. Index second." width="900" />
|
|
18
|
+
</p>
|
|
19
|
+
|
|
20
|
+
RAGLite is a local-first CLI that turns a folder of docs (PDF/HTML/TXT/MD) into **structured, low-fluff Markdown** β and then makes it searchable with **Chroma** π§ + **ripgrep** π.
|
|
21
|
+
|
|
22
|
+
Core idea: **compression-before-embeddings** βοΈβ‘οΈπ§
|
|
23
|
+
|
|
24
|
+
<p align="center">
|
|
25
|
+
<img src="assets/diagram.svg" alt="RAGLite workflow: condense, index, query" width="900" />
|
|
26
|
+
</p>
|
|
27
|
+
|
|
28
|
+
## What you get
|
|
29
|
+
|
|
30
|
+
For each input file:
|
|
31
|
+
- `*.execution-notes.md` β practical run/operate notes (checks, failure modes, commands)
|
|
32
|
+
- `*.tool-summary.md` β compact index entry (purpose, capabilities, entrypoints, footguns)
|
|
33
|
+
|
|
34
|
+
Optionally:
|
|
35
|
+
- `raglite index` stores embeddings in **Chroma** π§ (one DB, many collections)
|
|
36
|
+
- `raglite query` runs **hybrid search** π (vector + keyword)
|
|
37
|
+
|
|
38
|
+
## Why local + open-source?
|
|
39
|
+
|
|
40
|
+
If you want a private, local setup (no managed βfancy vector DBβ required), RAGLite keeps everything on your machine:
|
|
41
|
+
- Distilled Markdown artifacts are plain files you can audit + version control
|
|
42
|
+
- Indexing uses **Chroma** (open-source, local) and keyword search uses **ripgrep**
|
|
43
|
+
- You can still swap in a hosted vector DB later if you outgrow local
|
|
44
|
+
|
|
45
|
+
## Engines
|
|
46
|
+
|
|
47
|
+
RAGLite supports two backends:
|
|
48
|
+
|
|
49
|
+
- **OpenClaw (recommended):** uses your local OpenClaw Gateway `/v1/responses` endpoint for higher-quality, format-following condensation.
|
|
50
|
+
- **Ollama:** uses `POST /api/generate` for fully local inference (often less reliable at strict templates).
|
|
51
|
+
|
|
52
|
+
## Prereqs
|
|
53
|
+
|
|
54
|
+
- **Python 3.11+**
|
|
55
|
+
- An LLM engine:
|
|
56
|
+
- **OpenClaw** (recommended) π¦, or
|
|
57
|
+
- **Ollama** π¦
|
|
58
|
+
- For search:
|
|
59
|
+
- **Chroma** (open-source, local) π§ at `http://127.0.0.1:8100`
|
|
60
|
+
|
|
61
|
+
## Install
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# from repo root
|
|
65
|
+
python3 -m venv .venv
|
|
66
|
+
source .venv/bin/activate
|
|
67
|
+
pip install -e .
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Quickstart (60s)
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
# 0) Setup
|
|
74
|
+
cd ~/Projects/raglite
|
|
75
|
+
source .venv/bin/activate
|
|
76
|
+
|
|
77
|
+
# 1) Condense β Index (one command)
|
|
78
|
+
raglite run /path/to/docs \
|
|
79
|
+
--out ./raglite_out \
|
|
80
|
+
--engine ollama --ollama-model llama3.2:3b \
|
|
81
|
+
--collection my-docs \
|
|
82
|
+
--chroma-url http://127.0.0.1:8100 \
|
|
83
|
+
--skip-indexed
|
|
84
|
+
|
|
85
|
+
# 2) Query
|
|
86
|
+
raglite query ./raglite_out \
|
|
87
|
+
--collection my-docs \
|
|
88
|
+
"rollback procedure"
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
## Usage
|
|
92
|
+
|
|
93
|
+
### 1) Distill docs βοΈ
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
raglite condense /path/to/docs \
|
|
97
|
+
--out ./raglite_out \
|
|
98
|
+
--engine openclaw
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
(Or fully local: `--engine ollama --ollama-model llama3.2:3b`.)
|
|
102
|
+
|
|
103
|
+
### 2) Index distilled output (Chroma)
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
raglite index ./raglite_out \
|
|
107
|
+
--collection my-docs \
|
|
108
|
+
--chroma-url http://127.0.0.1:8100
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### 3) Query (hybrid)
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
raglite query ./raglite_out \
|
|
115
|
+
--collection my-docs \
|
|
116
|
+
--top-k 5 \
|
|
117
|
+
--keyword-top-k 5 \
|
|
118
|
+
"rollback procedure"
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Useful flags
|
|
122
|
+
|
|
123
|
+
- `--skip-existing` : donβt redo files that already have both outputs
|
|
124
|
+
- `--skip-indexed` : donβt re-embed chunks that are already indexed
|
|
125
|
+
- `--nodes` : write per-section nodes + per-doc/root indices
|
|
126
|
+
- `--node-max-chars 1200` : keep nodes embed-friendly
|
|
127
|
+
- `--sleep-ms 200` : throttle between files (helps avoid timeouts)
|
|
128
|
+
- `--max-chars 180000` : cap extracted text per file before summarizing
|
|
129
|
+
|
|
130
|
+
## Output layout
|
|
131
|
+
|
|
132
|
+
RAGLite preserves folder structure under your `--out` dir:
|
|
133
|
+
|
|
134
|
+
```text
|
|
135
|
+
<out>/
|
|
136
|
+
some/subdir/file.execution-notes.md
|
|
137
|
+
some/subdir/file.tool-summary.md
|
|
138
|
+
|
|
139
|
+
(Default output folder is `./raglite_out`.)
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
## Notes / gotchas
|
|
143
|
+
|
|
144
|
+
- PDF extraction is best-effort: scanned PDFs without embedded text wonβt be great.
|
|
145
|
+
- If you use `--engine openclaw`, pass `--gateway-token` or set `OPENCLAW_GATEWAY_TOKEN`.
|
|
146
|
+
- Indexing defaults to high-signal artifacts (nodes/summaries/notes) and skips `*.outline.md` unless you opt in.
|
|
147
|
+
|
|
148
|
+
## Roadmap
|
|
149
|
+
|
|
150
|
+
### Current (implemented)
|
|
151
|
+
- `condense` β condense/summarize documents into Markdown artifacts
|
|
152
|
+
- `index` β chunk + embed + store in **Chroma** collections
|
|
153
|
+
- `query` β retrieve relevant chunks (vector + keyword)
|
|
154
|
+
- `run` β one-command pipeline (condense β index)
|
|
155
|
+
- Outline + nodes + indices: `--outline`, `--nodes`, root `index.md` + per-doc `*.index.md`
|
|
156
|
+
|
|
157
|
+
### Next (near-term)
|
|
158
|
+
- Detect deletions (prune removed chunks from Chroma)
|
|
159
|
+
- Batch upserts to Chroma for speed
|
|
160
|
+
- Better query output formatting (snippets + anchors)
|
|
161
|
+
- `raglite doctor` (dependency checks)
|
|
162
|
+
|
|
163
|
+
(Full: [ROADMAP.md](ROADMAP.md))
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
Built to turn βdocsβ into **usable, searchable tool knowledge**.
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# RAGLite
|
|
2
|
+
|
|
3
|
+
<p align="center">
|
|
4
|
+
<img src="assets/hero.svg" alt="RAGLite: Compress first. Index second." width="900" />
|
|
5
|
+
</p>
|
|
6
|
+
|
|
7
|
+
RAGLite is a local-first CLI that turns a folder of docs (PDF/HTML/TXT/MD) into **structured, low-fluff Markdown** β and then makes it searchable with **Chroma** π§ + **ripgrep** π.
|
|
8
|
+
|
|
9
|
+
Core idea: **compression-before-embeddings** βοΈβ‘οΈπ§
|
|
10
|
+
|
|
11
|
+
<p align="center">
|
|
12
|
+
<img src="assets/diagram.svg" alt="RAGLite workflow: condense, index, query" width="900" />
|
|
13
|
+
</p>
|
|
14
|
+
|
|
15
|
+
## What you get
|
|
16
|
+
|
|
17
|
+
For each input file:
|
|
18
|
+
- `*.execution-notes.md` β practical run/operate notes (checks, failure modes, commands)
|
|
19
|
+
- `*.tool-summary.md` β compact index entry (purpose, capabilities, entrypoints, footguns)
|
|
20
|
+
|
|
21
|
+
Optionally:
|
|
22
|
+
- `raglite index` stores embeddings in **Chroma** π§ (one DB, many collections)
|
|
23
|
+
- `raglite query` runs **hybrid search** π (vector + keyword)
|
|
24
|
+
|
|
25
|
+
## Why local + open-source?
|
|
26
|
+
|
|
27
|
+
If you want a private, local setup (no managed βfancy vector DBβ required), RAGLite keeps everything on your machine:
|
|
28
|
+
- Distilled Markdown artifacts are plain files you can audit + version control
|
|
29
|
+
- Indexing uses **Chroma** (open-source, local) and keyword search uses **ripgrep**
|
|
30
|
+
- You can still swap in a hosted vector DB later if you outgrow local
|
|
31
|
+
|
|
32
|
+
## Engines
|
|
33
|
+
|
|
34
|
+
RAGLite supports two backends:
|
|
35
|
+
|
|
36
|
+
- **OpenClaw (recommended):** uses your local OpenClaw Gateway `/v1/responses` endpoint for higher-quality, format-following condensation.
|
|
37
|
+
- **Ollama:** uses `POST /api/generate` for fully local inference (often less reliable at strict templates).
|
|
38
|
+
|
|
39
|
+
## Prereqs
|
|
40
|
+
|
|
41
|
+
- **Python 3.11+**
|
|
42
|
+
- An LLM engine:
|
|
43
|
+
- **OpenClaw** (recommended) π¦, or
|
|
44
|
+
- **Ollama** π¦
|
|
45
|
+
- For search:
|
|
46
|
+
- **Chroma** (open-source, local) π§ at `http://127.0.0.1:8100`
|
|
47
|
+
|
|
48
|
+
## Install
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
# from repo root
|
|
52
|
+
python3 -m venv .venv
|
|
53
|
+
source .venv/bin/activate
|
|
54
|
+
pip install -e .
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Quickstart (60s)
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# 0) Setup
|
|
61
|
+
cd ~/Projects/raglite
|
|
62
|
+
source .venv/bin/activate
|
|
63
|
+
|
|
64
|
+
# 1) Condense β Index (one command)
|
|
65
|
+
raglite run /path/to/docs \
|
|
66
|
+
--out ./raglite_out \
|
|
67
|
+
--engine ollama --ollama-model llama3.2:3b \
|
|
68
|
+
--collection my-docs \
|
|
69
|
+
--chroma-url http://127.0.0.1:8100 \
|
|
70
|
+
--skip-indexed
|
|
71
|
+
|
|
72
|
+
# 2) Query
|
|
73
|
+
raglite query ./raglite_out \
|
|
74
|
+
--collection my-docs \
|
|
75
|
+
"rollback procedure"
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## Usage
|
|
79
|
+
|
|
80
|
+
### 1) Distill docs βοΈ
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
raglite condense /path/to/docs \
|
|
84
|
+
--out ./raglite_out \
|
|
85
|
+
--engine openclaw
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
(Or fully local: `--engine ollama --ollama-model llama3.2:3b`.)
|
|
89
|
+
|
|
90
|
+
### 2) Index distilled output (Chroma)
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
raglite index ./raglite_out \
|
|
94
|
+
--collection my-docs \
|
|
95
|
+
--chroma-url http://127.0.0.1:8100
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
### 3) Query (hybrid)
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
raglite query ./raglite_out \
|
|
102
|
+
--collection my-docs \
|
|
103
|
+
--top-k 5 \
|
|
104
|
+
--keyword-top-k 5 \
|
|
105
|
+
"rollback procedure"
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Useful flags
|
|
109
|
+
|
|
110
|
+
- `--skip-existing` : donβt redo files that already have both outputs
|
|
111
|
+
- `--skip-indexed` : donβt re-embed chunks that are already indexed
|
|
112
|
+
- `--nodes` : write per-section nodes + per-doc/root indices
|
|
113
|
+
- `--node-max-chars 1200` : keep nodes embed-friendly
|
|
114
|
+
- `--sleep-ms 200` : throttle between files (helps avoid timeouts)
|
|
115
|
+
- `--max-chars 180000` : cap extracted text per file before summarizing
|
|
116
|
+
|
|
117
|
+
## Output layout
|
|
118
|
+
|
|
119
|
+
RAGLite preserves folder structure under your `--out` dir:
|
|
120
|
+
|
|
121
|
+
```text
|
|
122
|
+
<out>/
|
|
123
|
+
some/subdir/file.execution-notes.md
|
|
124
|
+
some/subdir/file.tool-summary.md
|
|
125
|
+
|
|
126
|
+
(Default output folder is `./raglite_out`.)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Notes / gotchas
|
|
130
|
+
|
|
131
|
+
- PDF extraction is best-effort: scanned PDFs without embedded text wonβt be great.
|
|
132
|
+
- If you use `--engine openclaw`, pass `--gateway-token` or set `OPENCLAW_GATEWAY_TOKEN`.
|
|
133
|
+
- Indexing defaults to high-signal artifacts (nodes/summaries/notes) and skips `*.outline.md` unless you opt in.
|
|
134
|
+
|
|
135
|
+
## Roadmap
|
|
136
|
+
|
|
137
|
+
### Current (implemented)
|
|
138
|
+
- `condense` β condense/summarize documents into Markdown artifacts
|
|
139
|
+
- `index` β chunk + embed + store in **Chroma** collections
|
|
140
|
+
- `query` β retrieve relevant chunks (vector + keyword)
|
|
141
|
+
- `run` β one-command pipeline (condense β index)
|
|
142
|
+
- Outline + nodes + indices: `--outline`, `--nodes`, root `index.md` + per-doc `*.index.md`
|
|
143
|
+
|
|
144
|
+
### Next (near-term)
|
|
145
|
+
- Detect deletions (prune removed chunks from Chroma)
|
|
146
|
+
- Batch upserts to Chroma for speed
|
|
147
|
+
- Better query output formatting (snippets + anchors)
|
|
148
|
+
- `raglite doctor` (dependency checks)
|
|
149
|
+
|
|
150
|
+
(Full: [ROADMAP.md](ROADMAP.md))
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
Built to turn βdocsβ into **usable, searchable tool knowledge**.
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=68", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "raglite-chromadb"
|
|
7
|
+
version = "1.0.1"
|
|
8
|
+
description = "Local-first RAG-lite CLI: condense docs into structured Markdown, then index/query with Chroma + hybrid search"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.11"
|
|
11
|
+
authors = [{ name = "Viraj Sanghvi" }]
|
|
12
|
+
license = { text = "MIT" }
|
|
13
|
+
keywords = ["rag", "docs", "chroma", "ollama", "openclaw", "summarization", "local-first"]
|
|
14
|
+
dependencies = [
|
|
15
|
+
"beautifulsoup4==4.12.3",
|
|
16
|
+
"lxml==5.3.0",
|
|
17
|
+
"pypdf==5.2.0",
|
|
18
|
+
]
|
|
19
|
+
|
|
20
|
+
[project.scripts]
|
|
21
|
+
raglite = "raglite.raglite_cli:cli"
|
|
22
|
+
|
|
23
|
+
[tool.setuptools]
|
|
24
|
+
package-dir = {"" = "."}
|
|
25
|
+
|
|
26
|
+
[tool.setuptools.packages.find]
|
|
27
|
+
where = ["."]
|
|
28
|
+
include = ["raglite*"]
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
__all__ = []
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
import json
|
|
4
|
+
import urllib.error
|
|
5
|
+
import urllib.request
|
|
6
|
+
from dataclasses import dataclass
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
@dataclass
|
|
10
|
+
class ChromaLoc:
|
|
11
|
+
base_url: str = "http://127.0.0.1:8100"
|
|
12
|
+
tenant: str = "default_tenant"
|
|
13
|
+
database: str = "default_database"
|
|
14
|
+
|
|
15
|
+
def collections_url(self) -> str:
|
|
16
|
+
return f"{self.base_url}/api/v2/tenants/{self.tenant}/databases/{self.database}/collections"
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
def _req_json(method: str, url: str, body: dict | None = None, timeout: int = 120) -> dict | list:
|
|
20
|
+
data = None if body is None else json.dumps(body).encode("utf-8")
|
|
21
|
+
req = urllib.request.Request(url, data=data, method=method)
|
|
22
|
+
req.add_header("Content-Type", "application/json")
|
|
23
|
+
try:
|
|
24
|
+
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
|
25
|
+
raw = resp.read().decode("utf-8")
|
|
26
|
+
if not raw:
|
|
27
|
+
return {}
|
|
28
|
+
return json.loads(raw)
|
|
29
|
+
except urllib.error.HTTPError as e: # type: ignore[attr-defined]
|
|
30
|
+
detail = ""
|
|
31
|
+
try:
|
|
32
|
+
detail = e.read().decode("utf-8", errors="ignore")
|
|
33
|
+
except Exception:
|
|
34
|
+
detail = ""
|
|
35
|
+
raise RuntimeError(f"Chroma HTTP {e.code} calling {url}: {detail[:500]}")
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
def list_collections(loc: ChromaLoc) -> list[dict]:
|
|
39
|
+
res = _req_json("GET", loc.collections_url())
|
|
40
|
+
if not isinstance(res, list):
|
|
41
|
+
raise RuntimeError(f"Expected list from Chroma list_collections, got {type(res).__name__}")
|
|
42
|
+
return res
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
def get_or_create_collection(loc: ChromaLoc, name: str, *, space: str = "cosine") -> dict:
|
|
46
|
+
cols = list_collections(loc)
|
|
47
|
+
for c in cols:
|
|
48
|
+
if c.get("name") == name:
|
|
49
|
+
return c
|
|
50
|
+
|
|
51
|
+
created = _req_json(
|
|
52
|
+
"POST",
|
|
53
|
+
loc.collections_url(),
|
|
54
|
+
{"name": name, "metadata": {"hnsw:space": space}},
|
|
55
|
+
)
|
|
56
|
+
if not isinstance(created, dict):
|
|
57
|
+
raise RuntimeError(f"Expected dict from Chroma create_collection, got {type(created).__name__}")
|
|
58
|
+
return created
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
def add(
|
|
62
|
+
loc: ChromaLoc,
|
|
63
|
+
collection_id: str,
|
|
64
|
+
*,
|
|
65
|
+
ids: list[str],
|
|
66
|
+
documents: list[str],
|
|
67
|
+
embeddings: list[list[float]],
|
|
68
|
+
metadatas: list[dict] | None = None,
|
|
69
|
+
) -> None:
|
|
70
|
+
url = f"{loc.collections_url()}/{collection_id}/add"
|
|
71
|
+
body: dict = {"ids": ids, "documents": documents, "embeddings": embeddings}
|
|
72
|
+
if metadatas is not None:
|
|
73
|
+
body["metadatas"] = metadatas
|
|
74
|
+
_req_json("POST", url, body, timeout=600)
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
def upsert(
|
|
78
|
+
loc: ChromaLoc,
|
|
79
|
+
collection_id: str,
|
|
80
|
+
*,
|
|
81
|
+
ids: list[str],
|
|
82
|
+
documents: list[str],
|
|
83
|
+
embeddings: list[list[float]],
|
|
84
|
+
metadatas: list[dict] | None = None,
|
|
85
|
+
) -> None:
|
|
86
|
+
"""Upsert records into a collection.
|
|
87
|
+
|
|
88
|
+
Chroma's /upsert updates existing ids and inserts new ones.
|
|
89
|
+
"""
|
|
90
|
+
url = f"{loc.collections_url()}/{collection_id}/upsert"
|
|
91
|
+
body: dict = {"ids": ids, "documents": documents, "embeddings": embeddings}
|
|
92
|
+
if metadatas is not None:
|
|
93
|
+
body["metadatas"] = metadatas
|
|
94
|
+
_req_json("POST", url, body, timeout=600)
|
|
95
|
+
|
|
96
|
+
|
|
97
|
+
def query(
|
|
98
|
+
loc: ChromaLoc,
|
|
99
|
+
collection_id: str,
|
|
100
|
+
*,
|
|
101
|
+
query_embeddings: list[list[float]],
|
|
102
|
+
n_results: int = 10,
|
|
103
|
+
include: list[str] | None = None,
|
|
104
|
+
) -> dict:
|
|
105
|
+
url = f"{loc.collections_url()}/{collection_id}/query"
|
|
106
|
+
body: dict = {"query_embeddings": query_embeddings, "n_results": n_results}
|
|
107
|
+
if include is not None:
|
|
108
|
+
body["include"] = include
|
|
109
|
+
res = _req_json("POST", url, body, timeout=600)
|
|
110
|
+
assert isinstance(res, dict)
|
|
111
|
+
return res
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
import re
|
|
4
|
+
from dataclasses import dataclass
|
|
5
|
+
from pathlib import Path
|
|
6
|
+
from typing import Literal
|
|
7
|
+
|
|
8
|
+
from bs4 import BeautifulSoup
|
|
9
|
+
from pypdf import PdfReader
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
FileKind = Literal["pdf", "html", "htm", "txt"]
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
@dataclass
|
|
16
|
+
class ExtractResult:
|
|
17
|
+
kind: FileKind
|
|
18
|
+
text: str
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def _clean_text(s: str) -> str:
|
|
22
|
+
s = s.replace("\r\n", "\n").replace("\r", "\n")
|
|
23
|
+
s = re.sub(r"[ \t]+", " ", s)
|
|
24
|
+
s = re.sub(r"\n{3,}", "\n\n", s)
|
|
25
|
+
return s.strip()
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
def extract_pdf(path: Path) -> ExtractResult:
|
|
29
|
+
reader = PdfReader(str(path))
|
|
30
|
+
parts = []
|
|
31
|
+
for i, page in enumerate(reader.pages):
|
|
32
|
+
try:
|
|
33
|
+
txt = page.extract_text() or ""
|
|
34
|
+
except Exception:
|
|
35
|
+
txt = ""
|
|
36
|
+
if txt.strip():
|
|
37
|
+
parts.append(f"\n\n--- Page {i+1} ---\n\n{txt}")
|
|
38
|
+
return ExtractResult(kind="pdf", text=_clean_text("\n".join(parts)))
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
def extract_html(path: Path) -> ExtractResult:
|
|
42
|
+
html = path.read_text(encoding="utf-8", errors="ignore")
|
|
43
|
+
soup = BeautifulSoup(html, "lxml")
|
|
44
|
+
|
|
45
|
+
# Remove scripts/styles
|
|
46
|
+
for tag in soup(["script", "style", "noscript"]):
|
|
47
|
+
tag.decompose()
|
|
48
|
+
|
|
49
|
+
text = soup.get_text("\n")
|
|
50
|
+
return ExtractResult(kind="html", text=_clean_text(text))
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
def extract_txt(path: Path) -> ExtractResult:
|
|
54
|
+
return ExtractResult(kind="txt", text=_clean_text(path.read_text(encoding="utf-8", errors="ignore")))
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
def extract_file(path: Path) -> ExtractResult:
|
|
58
|
+
suffix = path.suffix.lower().lstrip(".")
|
|
59
|
+
if suffix == "pdf":
|
|
60
|
+
return extract_pdf(path)
|
|
61
|
+
if suffix in ("html", "htm"):
|
|
62
|
+
return extract_html(path)
|
|
63
|
+
return extract_txt(path)
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
from dataclasses import dataclass
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
@dataclass
|
|
7
|
+
class PromptPair:
|
|
8
|
+
execution_notes_prompt: str
|
|
9
|
+
tool_summary_prompt: str
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def build_prompts(*, token_cap_hint: str = "~1200 tokens max") -> PromptPair:
|
|
13
|
+
# These prompts are designed to be copy/pasted into Cosmo.
|
|
14
|
+
execution_notes = f"""You are an expert at converting documentation into EXECUTION-RELEVANT notes for an AI agent that can run tools (CLI commands, HTTP calls, scripts, and functions).
|
|
15
|
+
|
|
16
|
+
OUTPUT FORMAT (Markdown):
|
|
17
|
+
- Title
|
|
18
|
+
- What this tool/service is
|
|
19
|
+
- When to use
|
|
20
|
+
- Inputs (required/optional)
|
|
21
|
+
- Outputs
|
|
22
|
+
- Preconditions / assumptions
|
|
23
|
+
- Step-by-step 'golden path' (numbered)
|
|
24
|
+
- Verification checks (how to confirm success)
|
|
25
|
+
- Common errors + fixes
|
|
26
|
+
- Safety/rollback notes (what not to do / how to undo)
|
|
27
|
+
|
|
28
|
+
RULES:
|
|
29
|
+
- Be concise and operational; no marketing.
|
|
30
|
+
- Prefer concrete commands, flags, endpoints, and example payloads.
|
|
31
|
+
- If the doc is long, extract only what is needed to execute.
|
|
32
|
+
- Keep the final output within {token_cap_hint}.
|
|
33
|
+
|
|
34
|
+
SOURCE DOCUMENT (extracted text) is below. Use it as the only source of truth.
|
|
35
|
+
---
|
|
36
|
+
"""
|
|
37
|
+
|
|
38
|
+
tool_summary = f"""You are an expert at writing ultra-condensed TOOL INDEX summaries for an agent tool library.
|
|
39
|
+
|
|
40
|
+
Write a short Markdown file with:
|
|
41
|
+
- Tool name
|
|
42
|
+
- 1-sentence purpose
|
|
43
|
+
- Capabilities (3-7 bullets)
|
|
44
|
+
- Required environment/dependencies
|
|
45
|
+
- Primary entrypoints (commands/endpoints)
|
|
46
|
+
- Key limitations / footguns (1-3 bullets)
|
|
47
|
+
|
|
48
|
+
RULES:
|
|
49
|
+
- No fluff. Assume the reader is an executor agent.
|
|
50
|
+
- Keep within ~250-400 tokens.
|
|
51
|
+
|
|
52
|
+
SOURCE DOCUMENT (extracted text) is below. Use it as the only source of truth.
|
|
53
|
+
---
|
|
54
|
+
"""
|
|
55
|
+
|
|
56
|
+
return PromptPair(execution_notes_prompt=execution_notes, tool_summary_prompt=tool_summary)
|