knowledge-rag 3.5.2__tar.gz → 3.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,59 +1,59 @@
1
- # Runtime
2
- venv/
3
- __pycache__/
4
- *.pyc
5
- data/
6
-
7
- # User config (personal settings — use presets/ as starting point)
8
- config.yaml
9
-
10
- # Personal documents (NEVER commit — user-populated content)
11
- documents/aar/
12
- documents/security/
13
- documents/logscale/
14
- documents/general/
15
- documents/ctf/
16
- documents/development/
17
- documents/documents/
18
-
19
- # Sensitive file formats (extra safety layer)
20
- documents/**/*.pdf
21
- documents/**/*.docx
22
- documents/**/*.xlsx
23
- documents/**/*.pptx
24
- documents/**/*.csv
25
-
26
- # Keep example docs
27
- !documents/examples/
28
- !documents/examples/**
29
-
30
- # FastEmbed model cache
31
- .cache/
32
- models_cache/
33
- *.onnx
34
-
35
- # Local scripts (not part of distribution)
36
- scripts/
37
- setup-notebook.ps1
38
- demo-real.yml
39
- documents/.sync-log.txt
40
- documents/README-CATEGORIES.md
41
-
42
- # Temp/junk
43
- *.b64
44
- *.tar.gz
45
- *.bak
46
-
47
- # OS files
48
- .DS_Store
49
- Thumbs.db
50
- desktop.ini
51
-
52
- # IDE
53
- .vscode/
54
- .idea/
55
- *.swp
56
- *.swo
57
- dist/
58
- .ruff_cache/
59
- .pytest_cache/
1
+ # Runtime
2
+ venv/
3
+ __pycache__/
4
+ *.pyc
5
+ data/
6
+
7
+ # User config (personal settings — use presets/ as starting point)
8
+ config.yaml
9
+
10
+ # Personal documents (NEVER commit — user-populated content)
11
+ documents/aar/
12
+ documents/security/
13
+ documents/logscale/
14
+ documents/general/
15
+ documents/ctf/
16
+ documents/development/
17
+ documents/documents/
18
+
19
+ # Sensitive file formats (extra safety layer)
20
+ documents/**/*.pdf
21
+ documents/**/*.docx
22
+ documents/**/*.xlsx
23
+ documents/**/*.pptx
24
+ documents/**/*.csv
25
+
26
+ # Keep example docs
27
+ !documents/examples/
28
+ !documents/examples/**
29
+
30
+ # FastEmbed model cache
31
+ .cache/
32
+ models_cache/
33
+ *.onnx
34
+
35
+ # Local scripts (not part of distribution)
36
+ scripts/
37
+ setup-notebook.ps1
38
+ demo-real.yml
39
+ documents/.sync-log.txt
40
+ documents/README-CATEGORIES.md
41
+
42
+ # Temp/junk
43
+ *.b64
44
+ *.tar.gz
45
+ *.bak
46
+
47
+ # OS files
48
+ .DS_Store
49
+ Thumbs.db
50
+ desktop.ini
51
+
52
+ # IDE
53
+ .vscode/
54
+ .idea/
55
+ *.swp
56
+ *.swo
57
+ dist/
58
+ .ruff_cache/
59
+ .pytest_cache/
@@ -1,21 +1,21 @@
1
- MIT License
2
-
3
- Copyright (c) 2025 Ailton Rocha (Lyon)
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Ailton Rocha (Lyon)
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -1,7 +1,7 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: knowledge-rag
3
- Version: 3.5.2
4
- Summary: Local RAG System for Claude Code — Hybrid search + Cross-encoder Reranking + 12 MCP Tools. Zero external servers.
3
+ Version: 3.6.0
4
+ Summary: Local RAG System for Claude Code — Hybrid search + Cross-encoder Reranking + 12 MCP Tools + 20 Format Parsers. Zero external servers.
5
5
  Project-URL: Homepage, https://github.com/lyonzin/knowledge-rag
6
6
  Project-URL: Repository, https://github.com/lyonzin/knowledge-rag
7
7
  Project-URL: Issues, https://github.com/lyonzin/knowledge-rag/issues
@@ -50,19 +50,19 @@ Description-Content-Type: text/markdown
50
50
  [![Glama Score](https://glama.ai/mcp/servers/lyonzin/knowledge-rag/badges/score.svg)](https://glama.ai/mcp/servers/lyonzin/knowledge-rag)
51
51
  [![PyPI](https://img.shields.io/pypi/v/knowledge-rag)](https://pypi.org/project/knowledge-rag/)
52
52
 
53
- ### LLMs don't know your docs. Every conversation starts from zero.
53
+ ### Your docs, your machine, zero cloud. Claude Code searches them natively.
54
54
 
55
- Your notes, writeups, internal procedures, PDFsnone of it exists to your AI assistant.
56
- Cloud RAG solutions leak your private data. Local ones require Docker, Ollama, and 15 minutes of setup before a single query.
55
+ Drop your PDFs, markdown, code, notebooks**1800+ files, 39K chunks, indexed in under 3 minutes.**<br/>
56
+ Hybrid search (BM25 + semantic vectors + cross-encoder reranking) through 12 MCP tools.<br/>
57
+ Everything runs locally via ONNX. No Docker, no Ollama, no API keys, no data leaves your machine.
57
58
 
58
- **Knowledge RAG fixes this.** One `pip install`, zero external servers.
59
- Your documents become instantly searchable inside Claude Code with reranking precision that actually finds what you need.
60
-
61
- `pip install → restart Claude Code → done.`
59
+ ```
60
+ pip install knowledge-rag restart Claude Code search_knowledge("your query")
61
+ ```
62
62
 
63
63
  ---
64
64
 
65
- **12 MCP Tools** | **Hybrid Search + Cross-Encoder Reranking** | **Markdown-Aware Chunking** | **100% Local, Zero Cloud**
65
+ **12 MCP Tools** | **Hybrid Search + Reranking** | **20 File Formats** | **Optional NVIDIA GPU** | **100% Local**
66
66
 
67
67
  [What's New](#whats-new-in-v352) | [Supported Formats](#supported-formats) | [Installation](#installation) | [Configuration](#configuration) | [API Reference](#api-reference) | [Architecture](#architecture)
68
68
 
@@ -125,6 +125,14 @@ See [Changelog](#changelog) for full history.
125
125
  | Excel | `.xlsx` | openpyxl | Yes | Sheet-by-sheet extraction |
126
126
  | PowerPoint | `.pptx` | python-pptx | Yes | Slide-by-slide extraction |
127
127
  | Jupyter Notebook | `.ipynb` | Cell-aware parser | Yes | Markdown + code cells only, no outputs/base64 |
128
+ | C Source | `.c` | Code-aware parser | Yes | Functions/structs/includes extracted |
129
+ | C/C++ Header | `.h` | Code-aware parser | Yes | Function declarations/structs extracted |
130
+ | C++ Source | `.cpp` | Code-aware parser | Yes | Classes/structs/includes extracted |
131
+ | JavaScript | `.js` | Code-aware parser | Yes | Functions/classes/imports (ESM + CJS) |
132
+ | React JSX | `.jsx` | Code-aware parser | Yes | Same as JS parser |
133
+ | TypeScript | `.ts` | Code-aware parser | Yes | Functions/classes/interfaces/enums/imports |
134
+ | React TSX | `.tsx` | Code-aware parser | Yes | Same as TS parser |
135
+ | XML | `.xml` | XML parser | Yes | Root element and namespace extraction |
128
136
  | MQL4 Header | `.mqh` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
129
137
  | MQL4 Source | `.mq4` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
130
138
 
@@ -144,7 +152,7 @@ See [Changelog](#changelog) for full history.
144
152
  | **Markdown-Aware Chunking** | `.md` files split by `##`/`###` sections instead of fixed windows |
145
153
  | **In-Process Embeddings** | FastEmbed ONNX Runtime (BAAI/bge-small-en-v1.5, 384D) |
146
154
  | **Keyword Routing** | Word-boundary aware routing for domain-specific queries |
147
- | **12 Format Parsers** | MD, TXT, PDF, PY, JSON, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 |
155
+ | **20 Format Parsers** | MD, TXT, PDF, PY, C, H, CPP, JS, JSX, TS, TSX, JSON, XML, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 |
148
156
  | **Category Organization** | Organize docs by folder, auto-tagged by path |
149
157
  | **Incremental Indexing** | Change detection via mtime/size — only re-indexes modified files |
150
158
  | **Chunk Deduplication** | SHA256 content hashing prevents duplicate chunks |
@@ -202,7 +210,7 @@ flowchart TB
202
210
  end
203
211
 
204
212
  subgraph INGEST["DOCUMENT INGESTION"]
205
- PARSERS["12 Parsers<br/>MD | PDF | TXT | PY | JSON | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
213
+ PARSERS["20 Parsers<br/>MD | PDF | TXT | PY | C | H | CPP | JS | JSX | TS | TSX | JSON | XML | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
206
214
  CHUNKER["Chunking<br/>MD: section-aware<br/>Other: 1000 chars + 200 overlap"]
207
215
  PARSERS --> CHUNKER
208
216
  end
@@ -268,11 +276,11 @@ flowchart LR
268
276
  FILES["documents/<br/>├── security/<br/>├── development/<br/>├── ctf/<br/>└── general/"]
269
277
  end
270
278
 
271
- subgraph PARSE["Parse (12 formats)"]
279
+ subgraph PARSE["Parse (20 formats)"]
272
280
  MD["Markdown"]
273
281
  PDF["PDF<br/>(PyMuPDF)"]
274
282
  OFFICE["DOCX | XLSX<br/>PPTX | CSV"]
275
- CODE["PY | JSON<br/>IPYNB"]
283
+ CODE["PY | C | H | CPP | JS | JSX<br/>TS | TSX | JSON | XML | IPYNB"]
276
284
  end
277
285
 
278
286
  subgraph CHUNK["Chunk"]
@@ -933,7 +941,7 @@ knowledge-rag/
933
941
  ├── mcp_server/
934
942
  │ ├── __init__.py # Stdout protection + version
935
943
  │ ├── config.py # YAML config loader + defaults
936
- │ ├── ingestion.py # 12 parsers, chunking, metadata extraction
944
+ │ ├── ingestion.py # 20 parsers, chunking, metadata extraction
937
945
  │ └── server.py # MCP server, ChromaDB, BM25, reranker, 12 tools
938
946
  ├── config.example.yaml # Documented config template (copy to config.yaml)
939
947
  ├── config.yaml # Your active configuration (git-ignored)
@@ -1024,6 +1032,15 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
1024
1032
 
1025
1033
  ## Changelog
1026
1034
 
1035
+ ### v3.6.0 (2026-04-23)
1036
+
1037
+ - **NEW**: Multi-language code parsing — C (`.c`), C++ (`.cpp`/`.h`), JavaScript (`.js`/`.jsx`), TypeScript (`.ts`/`.tsx`) with per-language function/class/import extraction
1038
+ - **NEW**: XML parser (`.xml`) — root element and namespace metadata extraction
1039
+ - **NEW**: All 8 new formats default enabled — no config change needed
1040
+ - **NEW**: NPM wrapper (`npx knowledge-rag`) + Docker image (`ghcr.io/lyonzin/knowledge-rag`)
1041
+ - **NEW**: Automated release pipeline — PyPI (Trusted Publishing), NPM, Docker GHCR
1042
+ - **IMPROVED**: Code parser reports correct `language` metadata per file type (was hardcoded to `"python"` for all code files)
1043
+
1027
1044
  ### v3.5.2 (2026-04-16)
1028
1045
 
1029
1046
  - **NEW**: Auto-discovery of CUDA 12 DLLs from pip-installed NVIDIA packages — no manual PATH configuration needed
@@ -1038,7 +1055,7 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
1038
1055
  ### v3.5.0 (2026-04-16)
1039
1056
 
1040
1057
  - **NEW**: Optional GPU acceleration for ONNX embeddings — `pip install knowledge-rag[gpu]` + `models.embedding.gpu: true` in config. 5-10x faster indexing on NVIDIA GPUs with automatic CPU fallback.
1041
- - **DOCS**: Supported formats table added to README (12 formats)
1058
+ - **DOCS**: Supported formats table added to README (20 formats)
1042
1059
 
1043
1060
  ### v3.4.3 (2026-04-16)
1044
1061
 
@@ -12,19 +12,19 @@
12
12
  [![Glama Score](https://glama.ai/mcp/servers/lyonzin/knowledge-rag/badges/score.svg)](https://glama.ai/mcp/servers/lyonzin/knowledge-rag)
13
13
  [![PyPI](https://img.shields.io/pypi/v/knowledge-rag)](https://pypi.org/project/knowledge-rag/)
14
14
 
15
- ### LLMs don't know your docs. Every conversation starts from zero.
15
+ ### Your docs, your machine, zero cloud. Claude Code searches them natively.
16
16
 
17
- Your notes, writeups, internal procedures, PDFsnone of it exists to your AI assistant.
18
- Cloud RAG solutions leak your private data. Local ones require Docker, Ollama, and 15 minutes of setup before a single query.
17
+ Drop your PDFs, markdown, code, notebooks**1800+ files, 39K chunks, indexed in under 3 minutes.**<br/>
18
+ Hybrid search (BM25 + semantic vectors + cross-encoder reranking) through 12 MCP tools.<br/>
19
+ Everything runs locally via ONNX. No Docker, no Ollama, no API keys, no data leaves your machine.
19
20
 
20
- **Knowledge RAG fixes this.** One `pip install`, zero external servers.
21
- Your documents become instantly searchable inside Claude Code with reranking precision that actually finds what you need.
22
-
23
- `pip install → restart Claude Code → done.`
21
+ ```
22
+ pip install knowledge-rag restart Claude Code search_knowledge("your query")
23
+ ```
24
24
 
25
25
  ---
26
26
 
27
- **12 MCP Tools** | **Hybrid Search + Cross-Encoder Reranking** | **Markdown-Aware Chunking** | **100% Local, Zero Cloud**
27
+ **12 MCP Tools** | **Hybrid Search + Reranking** | **20 File Formats** | **Optional NVIDIA GPU** | **100% Local**
28
28
 
29
29
  [What's New](#whats-new-in-v352) | [Supported Formats](#supported-formats) | [Installation](#installation) | [Configuration](#configuration) | [API Reference](#api-reference) | [Architecture](#architecture)
30
30
 
@@ -87,6 +87,14 @@ See [Changelog](#changelog) for full history.
87
87
  | Excel | `.xlsx` | openpyxl | Yes | Sheet-by-sheet extraction |
88
88
  | PowerPoint | `.pptx` | python-pptx | Yes | Slide-by-slide extraction |
89
89
  | Jupyter Notebook | `.ipynb` | Cell-aware parser | Yes | Markdown + code cells only, no outputs/base64 |
90
+ | C Source | `.c` | Code-aware parser | Yes | Functions/structs/includes extracted |
91
+ | C/C++ Header | `.h` | Code-aware parser | Yes | Function declarations/structs extracted |
92
+ | C++ Source | `.cpp` | Code-aware parser | Yes | Classes/structs/includes extracted |
93
+ | JavaScript | `.js` | Code-aware parser | Yes | Functions/classes/imports (ESM + CJS) |
94
+ | React JSX | `.jsx` | Code-aware parser | Yes | Same as JS parser |
95
+ | TypeScript | `.ts` | Code-aware parser | Yes | Functions/classes/interfaces/enums/imports |
96
+ | React TSX | `.tsx` | Code-aware parser | Yes | Same as TS parser |
97
+ | XML | `.xml` | XML parser | Yes | Root element and namespace extraction |
90
98
  | MQL4 Header | `.mqh` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
91
99
  | MQL4 Source | `.mq4` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
92
100
 
@@ -106,7 +114,7 @@ See [Changelog](#changelog) for full history.
106
114
  | **Markdown-Aware Chunking** | `.md` files split by `##`/`###` sections instead of fixed windows |
107
115
  | **In-Process Embeddings** | FastEmbed ONNX Runtime (BAAI/bge-small-en-v1.5, 384D) |
108
116
  | **Keyword Routing** | Word-boundary aware routing for domain-specific queries |
109
- | **12 Format Parsers** | MD, TXT, PDF, PY, JSON, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 |
117
+ | **20 Format Parsers** | MD, TXT, PDF, PY, C, H, CPP, JS, JSX, TS, TSX, JSON, XML, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 |
110
118
  | **Category Organization** | Organize docs by folder, auto-tagged by path |
111
119
  | **Incremental Indexing** | Change detection via mtime/size — only re-indexes modified files |
112
120
  | **Chunk Deduplication** | SHA256 content hashing prevents duplicate chunks |
@@ -164,7 +172,7 @@ flowchart TB
164
172
  end
165
173
 
166
174
  subgraph INGEST["DOCUMENT INGESTION"]
167
- PARSERS["12 Parsers<br/>MD | PDF | TXT | PY | JSON | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
175
+ PARSERS["20 Parsers<br/>MD | PDF | TXT | PY | C | H | CPP | JS | JSX | TS | TSX | JSON | XML | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
168
176
  CHUNKER["Chunking<br/>MD: section-aware<br/>Other: 1000 chars + 200 overlap"]
169
177
  PARSERS --> CHUNKER
170
178
  end
@@ -230,11 +238,11 @@ flowchart LR
230
238
  FILES["documents/<br/>├── security/<br/>├── development/<br/>├── ctf/<br/>└── general/"]
231
239
  end
232
240
 
233
- subgraph PARSE["Parse (12 formats)"]
241
+ subgraph PARSE["Parse (20 formats)"]
234
242
  MD["Markdown"]
235
243
  PDF["PDF<br/>(PyMuPDF)"]
236
244
  OFFICE["DOCX | XLSX<br/>PPTX | CSV"]
237
- CODE["PY | JSON<br/>IPYNB"]
245
+ CODE["PY | C | H | CPP | JS | JSX<br/>TS | TSX | JSON | XML | IPYNB"]
238
246
  end
239
247
 
240
248
  subgraph CHUNK["Chunk"]
@@ -895,7 +903,7 @@ knowledge-rag/
895
903
  ├── mcp_server/
896
904
  │ ├── __init__.py # Stdout protection + version
897
905
  │ ├── config.py # YAML config loader + defaults
898
- │ ├── ingestion.py # 12 parsers, chunking, metadata extraction
906
+ │ ├── ingestion.py # 20 parsers, chunking, metadata extraction
899
907
  │ └── server.py # MCP server, ChromaDB, BM25, reranker, 12 tools
900
908
  ├── config.example.yaml # Documented config template (copy to config.yaml)
901
909
  ├── config.yaml # Your active configuration (git-ignored)
@@ -986,6 +994,15 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
986
994
 
987
995
  ## Changelog
988
996
 
997
+ ### v3.6.0 (2026-04-23)
998
+
999
+ - **NEW**: Multi-language code parsing — C (`.c`), C++ (`.cpp`/`.h`), JavaScript (`.js`/`.jsx`), TypeScript (`.ts`/`.tsx`) with per-language function/class/import extraction
1000
+ - **NEW**: XML parser (`.xml`) — root element and namespace metadata extraction
1001
+ - **NEW**: All 8 new formats default enabled — no config change needed
1002
+ - **NEW**: NPM wrapper (`npx knowledge-rag`) + Docker image (`ghcr.io/lyonzin/knowledge-rag`)
1003
+ - **NEW**: Automated release pipeline — PyPI (Trusted Publishing), NPM, Docker GHCR
1004
+ - **IMPROVED**: Code parser reports correct `language` metadata per file type (was hardcoded to `"python"` for all code files)
1005
+
989
1006
  ### v3.5.2 (2026-04-16)
990
1007
 
991
1008
  - **NEW**: Auto-discovery of CUDA 12 DLLs from pip-installed NVIDIA packages — no manual PATH configuration needed
@@ -1000,7 +1017,7 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
1000
1017
  ### v3.5.0 (2026-04-16)
1001
1018
 
1002
1019
  - **NEW**: Optional GPU acceleration for ONNX embeddings — `pip install knowledge-rag[gpu]` + `models.embedding.gpu: true` in config. 5-10x faster indexing on NVIDIA GPUs with automatic CPU fallback.
1003
- - **DOCS**: Supported formats table added to README (12 formats)
1020
+ - **DOCS**: Supported formats table added to README (20 formats)
1004
1021
 
1005
1022
  ### v3.4.3 (2026-04-16)
1006
1023