raglite-chromadb 1.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,167 @@
1
+ Metadata-Version: 2.4
2
+ Name: raglite-chromadb
3
+ Version: 1.0.1
4
+ Summary: Local-first RAG-lite CLI: condense docs into structured Markdown, then index/query with Chroma + hybrid search
5
+ Author: Viraj Sanghvi
6
+ License: MIT
7
+ Keywords: rag,docs,chroma,ollama,openclaw,summarization,local-first
8
+ Requires-Python: >=3.11
9
+ Description-Content-Type: text/markdown
10
+ Requires-Dist: beautifulsoup4==4.12.3
11
+ Requires-Dist: lxml==5.3.0
12
+ Requires-Dist: pypdf==5.2.0
13
+
14
+ # RAGLite
15
+
16
+ <p align="center">
17
+ <img src="assets/hero.svg" alt="RAGLite: Compress first. Index second." width="900" />
18
+ </p>
19
+
20
+ RAGLite is a local-first CLI that turns a folder of docs (PDF/HTML/TXT/MD) into **structured, low-fluff Markdown** β€” and then makes it searchable with **Chroma** 🧠 + **ripgrep** πŸ”Ž.
21
+
22
+ Core idea: **compression-before-embeddings** βœ‚οΈβž‘οΈπŸ§ 
23
+
24
+ <p align="center">
25
+ <img src="assets/diagram.svg" alt="RAGLite workflow: condense, index, query" width="900" />
26
+ </p>
27
+
28
+ ## What you get
29
+
30
+ For each input file:
31
+ - `*.execution-notes.md` β€” practical run/operate notes (checks, failure modes, commands)
32
+ - `*.tool-summary.md` β€” compact index entry (purpose, capabilities, entrypoints, footguns)
33
+
34
+ Optionally:
35
+ - `raglite index` stores embeddings in **Chroma** 🧠 (one DB, many collections)
36
+ - `raglite query` runs **hybrid search** πŸ”Ž (vector + keyword)
37
+
38
+ ## Why local + open-source?
39
+
40
+ If you want a private, local setup (no managed β€œfancy vector DB” required), RAGLite keeps everything on your machine:
41
+ - Distilled Markdown artifacts are plain files you can audit + version control
42
+ - Indexing uses **Chroma** (open-source, local) and keyword search uses **ripgrep**
43
+ - You can still swap in a hosted vector DB later if you outgrow local
44
+
45
+ ## Engines
46
+
47
+ RAGLite supports two backends:
48
+
49
+ - **OpenClaw (recommended):** uses your local OpenClaw Gateway `/v1/responses` endpoint for higher-quality, format-following condensation.
50
+ - **Ollama:** uses `POST /api/generate` for fully local inference (often less reliable at strict templates).
51
+
52
+ ## Prereqs
53
+
54
+ - **Python 3.11+**
55
+ - An LLM engine:
56
+ - **OpenClaw** (recommended) 🦞, or
57
+ - **Ollama** πŸ¦™
58
+ - For search:
59
+ - **Chroma** (open-source, local) 🧠 at `http://127.0.0.1:8100`
60
+
61
+ ## Install
62
+
63
+ ```bash
64
+ # from repo root
65
+ python3 -m venv .venv
66
+ source .venv/bin/activate
67
+ pip install -e .
68
+ ```
69
+
70
+ ## Quickstart (60s)
71
+
72
+ ```bash
73
+ # 0) Setup
74
+ cd ~/Projects/raglite
75
+ source .venv/bin/activate
76
+
77
+ # 1) Condense β†’ Index (one command)
78
+ raglite run /path/to/docs \
79
+ --out ./raglite_out \
80
+ --engine ollama --ollama-model llama3.2:3b \
81
+ --collection my-docs \
82
+ --chroma-url http://127.0.0.1:8100 \
83
+ --skip-indexed
84
+
85
+ # 2) Query
86
+ raglite query ./raglite_out \
87
+ --collection my-docs \
88
+ "rollback procedure"
89
+ ```
90
+
91
+ ## Usage
92
+
93
+ ### 1) Distill docs ✍️
94
+
95
+ ```bash
96
+ raglite condense /path/to/docs \
97
+ --out ./raglite_out \
98
+ --engine openclaw
99
+ ```
100
+
101
+ (Or fully local: `--engine ollama --ollama-model llama3.2:3b`.)
102
+
103
+ ### 2) Index distilled output (Chroma)
104
+
105
+ ```bash
106
+ raglite index ./raglite_out \
107
+ --collection my-docs \
108
+ --chroma-url http://127.0.0.1:8100
109
+ ```
110
+
111
+ ### 3) Query (hybrid)
112
+
113
+ ```bash
114
+ raglite query ./raglite_out \
115
+ --collection my-docs \
116
+ --top-k 5 \
117
+ --keyword-top-k 5 \
118
+ "rollback procedure"
119
+ ```
120
+
121
+ ### Useful flags
122
+
123
+ - `--skip-existing` : don’t redo files that already have both outputs
124
+ - `--skip-indexed` : don’t re-embed chunks that are already indexed
125
+ - `--nodes` : write per-section nodes + per-doc/root indices
126
+ - `--node-max-chars 1200` : keep nodes embed-friendly
127
+ - `--sleep-ms 200` : throttle between files (helps avoid timeouts)
128
+ - `--max-chars 180000` : cap extracted text per file before summarizing
129
+
130
+ ## Output layout
131
+
132
+ RAGLite preserves folder structure under your `--out` dir:
133
+
134
+ ```text
135
+ <out>/
136
+ some/subdir/file.execution-notes.md
137
+ some/subdir/file.tool-summary.md
138
+
139
+ (Default output folder is `./raglite_out`.)
140
+ ```
141
+
142
+ ## Notes / gotchas
143
+
144
+ - PDF extraction is best-effort: scanned PDFs without embedded text won’t be great.
145
+ - If you use `--engine openclaw`, pass `--gateway-token` or set `OPENCLAW_GATEWAY_TOKEN`.
146
+ - Indexing defaults to high-signal artifacts (nodes/summaries/notes) and skips `*.outline.md` unless you opt in.
147
+
148
+ ## Roadmap
149
+
150
+ ### Current (implemented)
151
+ - `condense` β€” condense/summarize documents into Markdown artifacts
152
+ - `index` β€” chunk + embed + store in **Chroma** collections
153
+ - `query` β€” retrieve relevant chunks (vector + keyword)
154
+ - `run` β€” one-command pipeline (condense β†’ index)
155
+ - Outline + nodes + indices: `--outline`, `--nodes`, root `index.md` + per-doc `*.index.md`
156
+
157
+ ### Next (near-term)
158
+ - Detect deletions (prune removed chunks from Chroma)
159
+ - Batch upserts to Chroma for speed
160
+ - Better query output formatting (snippets + anchors)
161
+ - `raglite doctor` (dependency checks)
162
+
163
+ (Full: [ROADMAP.md](ROADMAP.md))
164
+
165
+ ---
166
+
167
+ Built to turn β€œdocs” into **usable, searchable tool knowledge**.
@@ -0,0 +1,154 @@
1
+ # RAGLite
2
+
3
+ <p align="center">
4
+ <img src="assets/hero.svg" alt="RAGLite: Compress first. Index second." width="900" />
5
+ </p>
6
+
7
+ RAGLite is a local-first CLI that turns a folder of docs (PDF/HTML/TXT/MD) into **structured, low-fluff Markdown** β€” and then makes it searchable with **Chroma** 🧠 + **ripgrep** πŸ”Ž.
8
+
9
+ Core idea: **compression-before-embeddings** βœ‚οΈβž‘οΈπŸ§ 
10
+
11
+ <p align="center">
12
+ <img src="assets/diagram.svg" alt="RAGLite workflow: condense, index, query" width="900" />
13
+ </p>
14
+
15
+ ## What you get
16
+
17
+ For each input file:
18
+ - `*.execution-notes.md` β€” practical run/operate notes (checks, failure modes, commands)
19
+ - `*.tool-summary.md` β€” compact index entry (purpose, capabilities, entrypoints, footguns)
20
+
21
+ Optionally:
22
+ - `raglite index` stores embeddings in **Chroma** 🧠 (one DB, many collections)
23
+ - `raglite query` runs **hybrid search** πŸ”Ž (vector + keyword)
24
+
25
+ ## Why local + open-source?
26
+
27
+ If you want a private, local setup (no managed β€œfancy vector DB” required), RAGLite keeps everything on your machine:
28
+ - Distilled Markdown artifacts are plain files you can audit + version control
29
+ - Indexing uses **Chroma** (open-source, local) and keyword search uses **ripgrep**
30
+ - You can still swap in a hosted vector DB later if you outgrow local
31
+
32
+ ## Engines
33
+
34
+ RAGLite supports two backends:
35
+
36
+ - **OpenClaw (recommended):** uses your local OpenClaw Gateway `/v1/responses` endpoint for higher-quality, format-following condensation.
37
+ - **Ollama:** uses `POST /api/generate` for fully local inference (often less reliable at strict templates).
38
+
39
+ ## Prereqs
40
+
41
+ - **Python 3.11+**
42
+ - An LLM engine:
43
+ - **OpenClaw** (recommended) 🦞, or
44
+ - **Ollama** πŸ¦™
45
+ - For search:
46
+ - **Chroma** (open-source, local) 🧠 at `http://127.0.0.1:8100`
47
+
48
+ ## Install
49
+
50
+ ```bash
51
+ # from repo root
52
+ python3 -m venv .venv
53
+ source .venv/bin/activate
54
+ pip install -e .
55
+ ```
56
+
57
+ ## Quickstart (60s)
58
+
59
+ ```bash
60
+ # 0) Setup
61
+ cd ~/Projects/raglite
62
+ source .venv/bin/activate
63
+
64
+ # 1) Condense β†’ Index (one command)
65
+ raglite run /path/to/docs \
66
+ --out ./raglite_out \
67
+ --engine ollama --ollama-model llama3.2:3b \
68
+ --collection my-docs \
69
+ --chroma-url http://127.0.0.1:8100 \
70
+ --skip-indexed
71
+
72
+ # 2) Query
73
+ raglite query ./raglite_out \
74
+ --collection my-docs \
75
+ "rollback procedure"
76
+ ```
77
+
78
+ ## Usage
79
+
80
+ ### 1) Distill docs ✍️
81
+
82
+ ```bash
83
+ raglite condense /path/to/docs \
84
+ --out ./raglite_out \
85
+ --engine openclaw
86
+ ```
87
+
88
+ (Or fully local: `--engine ollama --ollama-model llama3.2:3b`.)
89
+
90
+ ### 2) Index distilled output (Chroma)
91
+
92
+ ```bash
93
+ raglite index ./raglite_out \
94
+ --collection my-docs \
95
+ --chroma-url http://127.0.0.1:8100
96
+ ```
97
+
98
+ ### 3) Query (hybrid)
99
+
100
+ ```bash
101
+ raglite query ./raglite_out \
102
+ --collection my-docs \
103
+ --top-k 5 \
104
+ --keyword-top-k 5 \
105
+ "rollback procedure"
106
+ ```
107
+
108
+ ### Useful flags
109
+
110
+ - `--skip-existing` : don’t redo files that already have both outputs
111
+ - `--skip-indexed` : don’t re-embed chunks that are already indexed
112
+ - `--nodes` : write per-section nodes + per-doc/root indices
113
+ - `--node-max-chars 1200` : keep nodes embed-friendly
114
+ - `--sleep-ms 200` : throttle between files (helps avoid timeouts)
115
+ - `--max-chars 180000` : cap extracted text per file before summarizing
116
+
117
+ ## Output layout
118
+
119
+ RAGLite preserves folder structure under your `--out` dir:
120
+
121
+ ```text
122
+ <out>/
123
+ some/subdir/file.execution-notes.md
124
+ some/subdir/file.tool-summary.md
125
+
126
+ (Default output folder is `./raglite_out`.)
127
+ ```
128
+
129
+ ## Notes / gotchas
130
+
131
+ - PDF extraction is best-effort: scanned PDFs without embedded text won’t be great.
132
+ - If you use `--engine openclaw`, pass `--gateway-token` or set `OPENCLAW_GATEWAY_TOKEN`.
133
+ - Indexing defaults to high-signal artifacts (nodes/summaries/notes) and skips `*.outline.md` unless you opt in.
134
+
135
+ ## Roadmap
136
+
137
+ ### Current (implemented)
138
+ - `condense` β€” condense/summarize documents into Markdown artifacts
139
+ - `index` β€” chunk + embed + store in **Chroma** collections
140
+ - `query` β€” retrieve relevant chunks (vector + keyword)
141
+ - `run` β€” one-command pipeline (condense β†’ index)
142
+ - Outline + nodes + indices: `--outline`, `--nodes`, root `index.md` + per-doc `*.index.md`
143
+
144
+ ### Next (near-term)
145
+ - Detect deletions (prune removed chunks from Chroma)
146
+ - Batch upserts to Chroma for speed
147
+ - Better query output formatting (snippets + anchors)
148
+ - `raglite doctor` (dependency checks)
149
+
150
+ (Full: [ROADMAP.md](ROADMAP.md))
151
+
152
+ ---
153
+
154
+ Built to turn β€œdocs” into **usable, searchable tool knowledge**.
@@ -0,0 +1,28 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "raglite-chromadb"
7
+ version = "1.0.1"
8
+ description = "Local-first RAG-lite CLI: condense docs into structured Markdown, then index/query with Chroma + hybrid search"
9
+ readme = "README.md"
10
+ requires-python = ">=3.11"
11
+ authors = [{ name = "Viraj Sanghvi" }]
12
+ license = { text = "MIT" }
13
+ keywords = ["rag", "docs", "chroma", "ollama", "openclaw", "summarization", "local-first"]
14
+ dependencies = [
15
+ "beautifulsoup4==4.12.3",
16
+ "lxml==5.3.0",
17
+ "pypdf==5.2.0",
18
+ ]
19
+
20
+ [project.scripts]
21
+ raglite = "raglite.raglite_cli:cli"
22
+
23
+ [tool.setuptools]
24
+ package-dir = {"" = "."}
25
+
26
+ [tool.setuptools.packages.find]
27
+ where = ["."]
28
+ include = ["raglite*"]
@@ -0,0 +1 @@
1
+ __all__ = []
@@ -0,0 +1,111 @@
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import urllib.error
5
+ import urllib.request
6
+ from dataclasses import dataclass
7
+
8
+
9
+ @dataclass
10
+ class ChromaLoc:
11
+ base_url: str = "http://127.0.0.1:8100"
12
+ tenant: str = "default_tenant"
13
+ database: str = "default_database"
14
+
15
+ def collections_url(self) -> str:
16
+ return f"{self.base_url}/api/v2/tenants/{self.tenant}/databases/{self.database}/collections"
17
+
18
+
19
+ def _req_json(method: str, url: str, body: dict | None = None, timeout: int = 120) -> dict | list:
20
+ data = None if body is None else json.dumps(body).encode("utf-8")
21
+ req = urllib.request.Request(url, data=data, method=method)
22
+ req.add_header("Content-Type", "application/json")
23
+ try:
24
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
25
+ raw = resp.read().decode("utf-8")
26
+ if not raw:
27
+ return {}
28
+ return json.loads(raw)
29
+ except urllib.error.HTTPError as e: # type: ignore[attr-defined]
30
+ detail = ""
31
+ try:
32
+ detail = e.read().decode("utf-8", errors="ignore")
33
+ except Exception:
34
+ detail = ""
35
+ raise RuntimeError(f"Chroma HTTP {e.code} calling {url}: {detail[:500]}")
36
+
37
+
38
+ def list_collections(loc: ChromaLoc) -> list[dict]:
39
+ res = _req_json("GET", loc.collections_url())
40
+ if not isinstance(res, list):
41
+ raise RuntimeError(f"Expected list from Chroma list_collections, got {type(res).__name__}")
42
+ return res
43
+
44
+
45
+ def get_or_create_collection(loc: ChromaLoc, name: str, *, space: str = "cosine") -> dict:
46
+ cols = list_collections(loc)
47
+ for c in cols:
48
+ if c.get("name") == name:
49
+ return c
50
+
51
+ created = _req_json(
52
+ "POST",
53
+ loc.collections_url(),
54
+ {"name": name, "metadata": {"hnsw:space": space}},
55
+ )
56
+ if not isinstance(created, dict):
57
+ raise RuntimeError(f"Expected dict from Chroma create_collection, got {type(created).__name__}")
58
+ return created
59
+
60
+
61
+ def add(
62
+ loc: ChromaLoc,
63
+ collection_id: str,
64
+ *,
65
+ ids: list[str],
66
+ documents: list[str],
67
+ embeddings: list[list[float]],
68
+ metadatas: list[dict] | None = None,
69
+ ) -> None:
70
+ url = f"{loc.collections_url()}/{collection_id}/add"
71
+ body: dict = {"ids": ids, "documents": documents, "embeddings": embeddings}
72
+ if metadatas is not None:
73
+ body["metadatas"] = metadatas
74
+ _req_json("POST", url, body, timeout=600)
75
+
76
+
77
+ def upsert(
78
+ loc: ChromaLoc,
79
+ collection_id: str,
80
+ *,
81
+ ids: list[str],
82
+ documents: list[str],
83
+ embeddings: list[list[float]],
84
+ metadatas: list[dict] | None = None,
85
+ ) -> None:
86
+ """Upsert records into a collection.
87
+
88
+ Chroma's /upsert updates existing ids and inserts new ones.
89
+ """
90
+ url = f"{loc.collections_url()}/{collection_id}/upsert"
91
+ body: dict = {"ids": ids, "documents": documents, "embeddings": embeddings}
92
+ if metadatas is not None:
93
+ body["metadatas"] = metadatas
94
+ _req_json("POST", url, body, timeout=600)
95
+
96
+
97
+ def query(
98
+ loc: ChromaLoc,
99
+ collection_id: str,
100
+ *,
101
+ query_embeddings: list[list[float]],
102
+ n_results: int = 10,
103
+ include: list[str] | None = None,
104
+ ) -> dict:
105
+ url = f"{loc.collections_url()}/{collection_id}/query"
106
+ body: dict = {"query_embeddings": query_embeddings, "n_results": n_results}
107
+ if include is not None:
108
+ body["include"] = include
109
+ res = _req_json("POST", url, body, timeout=600)
110
+ assert isinstance(res, dict)
111
+ return res
@@ -0,0 +1,63 @@
1
+ from __future__ import annotations
2
+
3
+ import re
4
+ from dataclasses import dataclass
5
+ from pathlib import Path
6
+ from typing import Literal
7
+
8
+ from bs4 import BeautifulSoup
9
+ from pypdf import PdfReader
10
+
11
+
12
+ FileKind = Literal["pdf", "html", "htm", "txt"]
13
+
14
+
15
+ @dataclass
16
+ class ExtractResult:
17
+ kind: FileKind
18
+ text: str
19
+
20
+
21
+ def _clean_text(s: str) -> str:
22
+ s = s.replace("\r\n", "\n").replace("\r", "\n")
23
+ s = re.sub(r"[ \t]+", " ", s)
24
+ s = re.sub(r"\n{3,}", "\n\n", s)
25
+ return s.strip()
26
+
27
+
28
+ def extract_pdf(path: Path) -> ExtractResult:
29
+ reader = PdfReader(str(path))
30
+ parts = []
31
+ for i, page in enumerate(reader.pages):
32
+ try:
33
+ txt = page.extract_text() or ""
34
+ except Exception:
35
+ txt = ""
36
+ if txt.strip():
37
+ parts.append(f"\n\n--- Page {i+1} ---\n\n{txt}")
38
+ return ExtractResult(kind="pdf", text=_clean_text("\n".join(parts)))
39
+
40
+
41
+ def extract_html(path: Path) -> ExtractResult:
42
+ html = path.read_text(encoding="utf-8", errors="ignore")
43
+ soup = BeautifulSoup(html, "lxml")
44
+
45
+ # Remove scripts/styles
46
+ for tag in soup(["script", "style", "noscript"]):
47
+ tag.decompose()
48
+
49
+ text = soup.get_text("\n")
50
+ return ExtractResult(kind="html", text=_clean_text(text))
51
+
52
+
53
+ def extract_txt(path: Path) -> ExtractResult:
54
+ return ExtractResult(kind="txt", text=_clean_text(path.read_text(encoding="utf-8", errors="ignore")))
55
+
56
+
57
+ def extract_file(path: Path) -> ExtractResult:
58
+ suffix = path.suffix.lower().lstrip(".")
59
+ if suffix == "pdf":
60
+ return extract_pdf(path)
61
+ if suffix in ("html", "htm"):
62
+ return extract_html(path)
63
+ return extract_txt(path)
@@ -0,0 +1,56 @@
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+
5
+
6
+ @dataclass
7
+ class PromptPair:
8
+ execution_notes_prompt: str
9
+ tool_summary_prompt: str
10
+
11
+
12
+ def build_prompts(*, token_cap_hint: str = "~1200 tokens max") -> PromptPair:
13
+ # These prompts are designed to be copy/pasted into Cosmo.
14
+ execution_notes = f"""You are an expert at converting documentation into EXECUTION-RELEVANT notes for an AI agent that can run tools (CLI commands, HTTP calls, scripts, and functions).
15
+
16
+ OUTPUT FORMAT (Markdown):
17
+ - Title
18
+ - What this tool/service is
19
+ - When to use
20
+ - Inputs (required/optional)
21
+ - Outputs
22
+ - Preconditions / assumptions
23
+ - Step-by-step 'golden path' (numbered)
24
+ - Verification checks (how to confirm success)
25
+ - Common errors + fixes
26
+ - Safety/rollback notes (what not to do / how to undo)
27
+
28
+ RULES:
29
+ - Be concise and operational; no marketing.
30
+ - Prefer concrete commands, flags, endpoints, and example payloads.
31
+ - If the doc is long, extract only what is needed to execute.
32
+ - Keep the final output within {token_cap_hint}.
33
+
34
+ SOURCE DOCUMENT (extracted text) is below. Use it as the only source of truth.
35
+ ---
36
+ """
37
+
38
+ tool_summary = f"""You are an expert at writing ultra-condensed TOOL INDEX summaries for an agent tool library.
39
+
40
+ Write a short Markdown file with:
41
+ - Tool name
42
+ - 1-sentence purpose
43
+ - Capabilities (3-7 bullets)
44
+ - Required environment/dependencies
45
+ - Primary entrypoints (commands/endpoints)
46
+ - Key limitations / footguns (1-3 bullets)
47
+
48
+ RULES:
49
+ - No fluff. Assume the reader is an executor agent.
50
+ - Keep within ~250-400 tokens.
51
+
52
+ SOURCE DOCUMENT (extracted text) is below. Use it as the only source of truth.
53
+ ---
54
+ """
55
+
56
+ return PromptPair(execution_notes_prompt=execution_notes, tool_summary_prompt=tool_summary)