knowledge-rag 3.5.2__tar.gz → 3.6.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/.gitignore +59 -59
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/LICENSE +21 -21
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/PKG-INFO +33 -16
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/README.md +31 -14
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/config.example.yaml +247 -239
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/documents/examples/sample-document.md +36 -36
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/mcp_server/__init__.py +17 -17
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/mcp_server/config.py +595 -536
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/mcp_server/ingestion.py +919 -801
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/mcp_server/server.py +1968 -1968
- knowledge_rag-3.6.0/npm/README.md +61 -0
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/presets/cybersecurity.yaml +367 -363
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/presets/developer.yaml +364 -356
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/presets/general.yaml +112 -112
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/presets/research.yaml +256 -256
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/pyproject.toml +114 -114
- {knowledge_rag-3.5.2 → knowledge_rag-3.6.0}/requirements.txt +49 -49
|
@@ -1,59 +1,59 @@
|
|
|
1
|
-
# Runtime
|
|
2
|
-
venv/
|
|
3
|
-
__pycache__/
|
|
4
|
-
*.pyc
|
|
5
|
-
data/
|
|
6
|
-
|
|
7
|
-
# User config (personal settings — use presets/ as starting point)
|
|
8
|
-
config.yaml
|
|
9
|
-
|
|
10
|
-
# Personal documents (NEVER commit — user-populated content)
|
|
11
|
-
documents/aar/
|
|
12
|
-
documents/security/
|
|
13
|
-
documents/logscale/
|
|
14
|
-
documents/general/
|
|
15
|
-
documents/ctf/
|
|
16
|
-
documents/development/
|
|
17
|
-
documents/documents/
|
|
18
|
-
|
|
19
|
-
# Sensitive file formats (extra safety layer)
|
|
20
|
-
documents/**/*.pdf
|
|
21
|
-
documents/**/*.docx
|
|
22
|
-
documents/**/*.xlsx
|
|
23
|
-
documents/**/*.pptx
|
|
24
|
-
documents/**/*.csv
|
|
25
|
-
|
|
26
|
-
# Keep example docs
|
|
27
|
-
!documents/examples/
|
|
28
|
-
!documents/examples/**
|
|
29
|
-
|
|
30
|
-
# FastEmbed model cache
|
|
31
|
-
.cache/
|
|
32
|
-
models_cache/
|
|
33
|
-
*.onnx
|
|
34
|
-
|
|
35
|
-
# Local scripts (not part of distribution)
|
|
36
|
-
scripts/
|
|
37
|
-
setup-notebook.ps1
|
|
38
|
-
demo-real.yml
|
|
39
|
-
documents/.sync-log.txt
|
|
40
|
-
documents/README-CATEGORIES.md
|
|
41
|
-
|
|
42
|
-
# Temp/junk
|
|
43
|
-
*.b64
|
|
44
|
-
*.tar.gz
|
|
45
|
-
*.bak
|
|
46
|
-
|
|
47
|
-
# OS files
|
|
48
|
-
.DS_Store
|
|
49
|
-
Thumbs.db
|
|
50
|
-
desktop.ini
|
|
51
|
-
|
|
52
|
-
# IDE
|
|
53
|
-
.vscode/
|
|
54
|
-
.idea/
|
|
55
|
-
*.swp
|
|
56
|
-
*.swo
|
|
57
|
-
dist/
|
|
58
|
-
.ruff_cache/
|
|
59
|
-
.pytest_cache/
|
|
1
|
+
# Runtime
|
|
2
|
+
venv/
|
|
3
|
+
__pycache__/
|
|
4
|
+
*.pyc
|
|
5
|
+
data/
|
|
6
|
+
|
|
7
|
+
# User config (personal settings — use presets/ as starting point)
|
|
8
|
+
config.yaml
|
|
9
|
+
|
|
10
|
+
# Personal documents (NEVER commit — user-populated content)
|
|
11
|
+
documents/aar/
|
|
12
|
+
documents/security/
|
|
13
|
+
documents/logscale/
|
|
14
|
+
documents/general/
|
|
15
|
+
documents/ctf/
|
|
16
|
+
documents/development/
|
|
17
|
+
documents/documents/
|
|
18
|
+
|
|
19
|
+
# Sensitive file formats (extra safety layer)
|
|
20
|
+
documents/**/*.pdf
|
|
21
|
+
documents/**/*.docx
|
|
22
|
+
documents/**/*.xlsx
|
|
23
|
+
documents/**/*.pptx
|
|
24
|
+
documents/**/*.csv
|
|
25
|
+
|
|
26
|
+
# Keep example docs
|
|
27
|
+
!documents/examples/
|
|
28
|
+
!documents/examples/**
|
|
29
|
+
|
|
30
|
+
# FastEmbed model cache
|
|
31
|
+
.cache/
|
|
32
|
+
models_cache/
|
|
33
|
+
*.onnx
|
|
34
|
+
|
|
35
|
+
# Local scripts (not part of distribution)
|
|
36
|
+
scripts/
|
|
37
|
+
setup-notebook.ps1
|
|
38
|
+
demo-real.yml
|
|
39
|
+
documents/.sync-log.txt
|
|
40
|
+
documents/README-CATEGORIES.md
|
|
41
|
+
|
|
42
|
+
# Temp/junk
|
|
43
|
+
*.b64
|
|
44
|
+
*.tar.gz
|
|
45
|
+
*.bak
|
|
46
|
+
|
|
47
|
+
# OS files
|
|
48
|
+
.DS_Store
|
|
49
|
+
Thumbs.db
|
|
50
|
+
desktop.ini
|
|
51
|
+
|
|
52
|
+
# IDE
|
|
53
|
+
.vscode/
|
|
54
|
+
.idea/
|
|
55
|
+
*.swp
|
|
56
|
+
*.swo
|
|
57
|
+
dist/
|
|
58
|
+
.ruff_cache/
|
|
59
|
+
.pytest_cache/
|
|
@@ -1,21 +1,21 @@
|
|
|
1
|
-
MIT License
|
|
2
|
-
|
|
3
|
-
Copyright (c) 2025 Ailton Rocha (Lyon)
|
|
4
|
-
|
|
5
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
-
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
-
in the Software without restriction, including without limitation the rights
|
|
8
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
-
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
-
furnished to do so, subject to the following conditions:
|
|
11
|
-
|
|
12
|
-
The above copyright notice and this permission notice shall be included in all
|
|
13
|
-
copies or substantial portions of the Software.
|
|
14
|
-
|
|
15
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
-
SOFTWARE.
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Ailton Rocha (Lyon)
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: knowledge-rag
|
|
3
|
-
Version: 3.
|
|
4
|
-
Summary: Local RAG System for Claude Code — Hybrid search + Cross-encoder Reranking + 12 MCP Tools. Zero external servers.
|
|
3
|
+
Version: 3.6.0
|
|
4
|
+
Summary: Local RAG System for Claude Code — Hybrid search + Cross-encoder Reranking + 12 MCP Tools + 20 Format Parsers. Zero external servers.
|
|
5
5
|
Project-URL: Homepage, https://github.com/lyonzin/knowledge-rag
|
|
6
6
|
Project-URL: Repository, https://github.com/lyonzin/knowledge-rag
|
|
7
7
|
Project-URL: Issues, https://github.com/lyonzin/knowledge-rag/issues
|
|
@@ -50,19 +50,19 @@ Description-Content-Type: text/markdown
|
|
|
50
50
|
[](https://glama.ai/mcp/servers/lyonzin/knowledge-rag)
|
|
51
51
|
[](https://pypi.org/project/knowledge-rag/)
|
|
52
52
|
|
|
53
|
-
###
|
|
53
|
+
### Your docs, your machine, zero cloud. Claude Code searches them natively.
|
|
54
54
|
|
|
55
|
-
|
|
56
|
-
|
|
55
|
+
Drop your PDFs, markdown, code, notebooks — **1800+ files, 39K chunks, indexed in under 3 minutes.**<br/>
|
|
56
|
+
Hybrid search (BM25 + semantic vectors + cross-encoder reranking) through 12 MCP tools.<br/>
|
|
57
|
+
Everything runs locally via ONNX. No Docker, no Ollama, no API keys, no data leaves your machine.
|
|
57
58
|
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
`pip install → restart Claude Code → done.`
|
|
59
|
+
```
|
|
60
|
+
pip install knowledge-rag → restart Claude Code → search_knowledge("your query")
|
|
61
|
+
```
|
|
62
62
|
|
|
63
63
|
---
|
|
64
64
|
|
|
65
|
-
**12 MCP Tools** | **Hybrid Search +
|
|
65
|
+
**12 MCP Tools** | **Hybrid Search + Reranking** | **20 File Formats** | **Optional NVIDIA GPU** | **100% Local**
|
|
66
66
|
|
|
67
67
|
[What's New](#whats-new-in-v352) | [Supported Formats](#supported-formats) | [Installation](#installation) | [Configuration](#configuration) | [API Reference](#api-reference) | [Architecture](#architecture)
|
|
68
68
|
|
|
@@ -125,6 +125,14 @@ See [Changelog](#changelog) for full history.
|
|
|
125
125
|
| Excel | `.xlsx` | openpyxl | Yes | Sheet-by-sheet extraction |
|
|
126
126
|
| PowerPoint | `.pptx` | python-pptx | Yes | Slide-by-slide extraction |
|
|
127
127
|
| Jupyter Notebook | `.ipynb` | Cell-aware parser | Yes | Markdown + code cells only, no outputs/base64 |
|
|
128
|
+
| C Source | `.c` | Code-aware parser | Yes | Functions/structs/includes extracted |
|
|
129
|
+
| C/C++ Header | `.h` | Code-aware parser | Yes | Function declarations/structs extracted |
|
|
130
|
+
| C++ Source | `.cpp` | Code-aware parser | Yes | Classes/structs/includes extracted |
|
|
131
|
+
| JavaScript | `.js` | Code-aware parser | Yes | Functions/classes/imports (ESM + CJS) |
|
|
132
|
+
| React JSX | `.jsx` | Code-aware parser | Yes | Same as JS parser |
|
|
133
|
+
| TypeScript | `.ts` | Code-aware parser | Yes | Functions/classes/interfaces/enums/imports |
|
|
134
|
+
| React TSX | `.tsx` | Code-aware parser | Yes | Same as TS parser |
|
|
135
|
+
| XML | `.xml` | XML parser | Yes | Root element and namespace extraction |
|
|
128
136
|
| MQL4 Header | `.mqh` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
|
|
129
137
|
| MQL4 Source | `.mq4` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
|
|
130
138
|
|
|
@@ -144,7 +152,7 @@ See [Changelog](#changelog) for full history.
|
|
|
144
152
|
| **Markdown-Aware Chunking** | `.md` files split by `##`/`###` sections instead of fixed windows |
|
|
145
153
|
| **In-Process Embeddings** | FastEmbed ONNX Runtime (BAAI/bge-small-en-v1.5, 384D) |
|
|
146
154
|
| **Keyword Routing** | Word-boundary aware routing for domain-specific queries |
|
|
147
|
-
| **
|
|
155
|
+
| **20 Format Parsers** | MD, TXT, PDF, PY, C, H, CPP, JS, JSX, TS, TSX, JSON, XML, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 |
|
|
148
156
|
| **Category Organization** | Organize docs by folder, auto-tagged by path |
|
|
149
157
|
| **Incremental Indexing** | Change detection via mtime/size — only re-indexes modified files |
|
|
150
158
|
| **Chunk Deduplication** | SHA256 content hashing prevents duplicate chunks |
|
|
@@ -202,7 +210,7 @@ flowchart TB
|
|
|
202
210
|
end
|
|
203
211
|
|
|
204
212
|
subgraph INGEST["DOCUMENT INGESTION"]
|
|
205
|
-
PARSERS["
|
|
213
|
+
PARSERS["20 Parsers<br/>MD | PDF | TXT | PY | C | H | CPP | JS | JSX | TS | TSX | JSON | XML | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
|
|
206
214
|
CHUNKER["Chunking<br/>MD: section-aware<br/>Other: 1000 chars + 200 overlap"]
|
|
207
215
|
PARSERS --> CHUNKER
|
|
208
216
|
end
|
|
@@ -268,11 +276,11 @@ flowchart LR
|
|
|
268
276
|
FILES["documents/<br/>├── security/<br/>├── development/<br/>├── ctf/<br/>└── general/"]
|
|
269
277
|
end
|
|
270
278
|
|
|
271
|
-
subgraph PARSE["Parse (
|
|
279
|
+
subgraph PARSE["Parse (20 formats)"]
|
|
272
280
|
MD["Markdown"]
|
|
273
281
|
PDF["PDF<br/>(PyMuPDF)"]
|
|
274
282
|
OFFICE["DOCX | XLSX<br/>PPTX | CSV"]
|
|
275
|
-
CODE["PY |
|
|
283
|
+
CODE["PY | C | H | CPP | JS | JSX<br/>TS | TSX | JSON | XML | IPYNB"]
|
|
276
284
|
end
|
|
277
285
|
|
|
278
286
|
subgraph CHUNK["Chunk"]
|
|
@@ -933,7 +941,7 @@ knowledge-rag/
|
|
|
933
941
|
├── mcp_server/
|
|
934
942
|
│ ├── __init__.py # Stdout protection + version
|
|
935
943
|
│ ├── config.py # YAML config loader + defaults
|
|
936
|
-
│ ├── ingestion.py #
|
|
944
|
+
│ ├── ingestion.py # 20 parsers, chunking, metadata extraction
|
|
937
945
|
│ └── server.py # MCP server, ChromaDB, BM25, reranker, 12 tools
|
|
938
946
|
├── config.example.yaml # Documented config template (copy to config.yaml)
|
|
939
947
|
├── config.yaml # Your active configuration (git-ignored)
|
|
@@ -1024,6 +1032,15 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
|
|
|
1024
1032
|
|
|
1025
1033
|
## Changelog
|
|
1026
1034
|
|
|
1035
|
+
### v3.6.0 (2026-04-23)
|
|
1036
|
+
|
|
1037
|
+
- **NEW**: Multi-language code parsing — C (`.c`), C++ (`.cpp`/`.h`), JavaScript (`.js`/`.jsx`), TypeScript (`.ts`/`.tsx`) with per-language function/class/import extraction
|
|
1038
|
+
- **NEW**: XML parser (`.xml`) — root element and namespace metadata extraction
|
|
1039
|
+
- **NEW**: All 8 new formats default enabled — no config change needed
|
|
1040
|
+
- **NEW**: NPM wrapper (`npx knowledge-rag`) + Docker image (`ghcr.io/lyonzin/knowledge-rag`)
|
|
1041
|
+
- **NEW**: Automated release pipeline — PyPI (Trusted Publishing), NPM, Docker GHCR
|
|
1042
|
+
- **IMPROVED**: Code parser reports correct `language` metadata per file type (was hardcoded to `"python"` for all code files)
|
|
1043
|
+
|
|
1027
1044
|
### v3.5.2 (2026-04-16)
|
|
1028
1045
|
|
|
1029
1046
|
- **NEW**: Auto-discovery of CUDA 12 DLLs from pip-installed NVIDIA packages — no manual PATH configuration needed
|
|
@@ -1038,7 +1055,7 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
|
|
|
1038
1055
|
### v3.5.0 (2026-04-16)
|
|
1039
1056
|
|
|
1040
1057
|
- **NEW**: Optional GPU acceleration for ONNX embeddings — `pip install knowledge-rag[gpu]` + `models.embedding.gpu: true` in config. 5-10x faster indexing on NVIDIA GPUs with automatic CPU fallback.
|
|
1041
|
-
- **DOCS**: Supported formats table added to README (
|
|
1058
|
+
- **DOCS**: Supported formats table added to README (20 formats)
|
|
1042
1059
|
|
|
1043
1060
|
### v3.4.3 (2026-04-16)
|
|
1044
1061
|
|
|
@@ -12,19 +12,19 @@
|
|
|
12
12
|
[](https://glama.ai/mcp/servers/lyonzin/knowledge-rag)
|
|
13
13
|
[](https://pypi.org/project/knowledge-rag/)
|
|
14
14
|
|
|
15
|
-
###
|
|
15
|
+
### Your docs, your machine, zero cloud. Claude Code searches them natively.
|
|
16
16
|
|
|
17
|
-
|
|
18
|
-
|
|
17
|
+
Drop your PDFs, markdown, code, notebooks — **1800+ files, 39K chunks, indexed in under 3 minutes.**<br/>
|
|
18
|
+
Hybrid search (BM25 + semantic vectors + cross-encoder reranking) through 12 MCP tools.<br/>
|
|
19
|
+
Everything runs locally via ONNX. No Docker, no Ollama, no API keys, no data leaves your machine.
|
|
19
20
|
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
`pip install → restart Claude Code → done.`
|
|
21
|
+
```
|
|
22
|
+
pip install knowledge-rag → restart Claude Code → search_knowledge("your query")
|
|
23
|
+
```
|
|
24
24
|
|
|
25
25
|
---
|
|
26
26
|
|
|
27
|
-
**12 MCP Tools** | **Hybrid Search +
|
|
27
|
+
**12 MCP Tools** | **Hybrid Search + Reranking** | **20 File Formats** | **Optional NVIDIA GPU** | **100% Local**
|
|
28
28
|
|
|
29
29
|
[What's New](#whats-new-in-v352) | [Supported Formats](#supported-formats) | [Installation](#installation) | [Configuration](#configuration) | [API Reference](#api-reference) | [Architecture](#architecture)
|
|
30
30
|
|
|
@@ -87,6 +87,14 @@ See [Changelog](#changelog) for full history.
|
|
|
87
87
|
| Excel | `.xlsx` | openpyxl | Yes | Sheet-by-sheet extraction |
|
|
88
88
|
| PowerPoint | `.pptx` | python-pptx | Yes | Slide-by-slide extraction |
|
|
89
89
|
| Jupyter Notebook | `.ipynb` | Cell-aware parser | Yes | Markdown + code cells only, no outputs/base64 |
|
|
90
|
+
| C Source | `.c` | Code-aware parser | Yes | Functions/structs/includes extracted |
|
|
91
|
+
| C/C++ Header | `.h` | Code-aware parser | Yes | Function declarations/structs extracted |
|
|
92
|
+
| C++ Source | `.cpp` | Code-aware parser | Yes | Classes/structs/includes extracted |
|
|
93
|
+
| JavaScript | `.js` | Code-aware parser | Yes | Functions/classes/imports (ESM + CJS) |
|
|
94
|
+
| React JSX | `.jsx` | Code-aware parser | Yes | Same as JS parser |
|
|
95
|
+
| TypeScript | `.ts` | Code-aware parser | Yes | Functions/classes/interfaces/enums/imports |
|
|
96
|
+
| React TSX | `.tsx` | Code-aware parser | Yes | Same as TS parser |
|
|
97
|
+
| XML | `.xml` | XML parser | Yes | Root element and namespace extraction |
|
|
90
98
|
| MQL4 Header | `.mqh` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
|
|
91
99
|
| MQL4 Source | `.mq4` | Code parser | No | MetaTrader — add to `supported_formats` to enable |
|
|
92
100
|
|
|
@@ -106,7 +114,7 @@ See [Changelog](#changelog) for full history.
|
|
|
106
114
|
| **Markdown-Aware Chunking** | `.md` files split by `##`/`###` sections instead of fixed windows |
|
|
107
115
|
| **In-Process Embeddings** | FastEmbed ONNX Runtime (BAAI/bge-small-en-v1.5, 384D) |
|
|
108
116
|
| **Keyword Routing** | Word-boundary aware routing for domain-specific queries |
|
|
109
|
-
| **
|
|
117
|
+
| **20 Format Parsers** | MD, TXT, PDF, PY, C, H, CPP, JS, JSX, TS, TSX, JSON, XML, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 |
|
|
110
118
|
| **Category Organization** | Organize docs by folder, auto-tagged by path |
|
|
111
119
|
| **Incremental Indexing** | Change detection via mtime/size — only re-indexes modified files |
|
|
112
120
|
| **Chunk Deduplication** | SHA256 content hashing prevents duplicate chunks |
|
|
@@ -164,7 +172,7 @@ flowchart TB
|
|
|
164
172
|
end
|
|
165
173
|
|
|
166
174
|
subgraph INGEST["DOCUMENT INGESTION"]
|
|
167
|
-
PARSERS["
|
|
175
|
+
PARSERS["20 Parsers<br/>MD | PDF | TXT | PY | C | H | CPP | JS | JSX | TS | TSX | JSON | XML | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
|
|
168
176
|
CHUNKER["Chunking<br/>MD: section-aware<br/>Other: 1000 chars + 200 overlap"]
|
|
169
177
|
PARSERS --> CHUNKER
|
|
170
178
|
end
|
|
@@ -230,11 +238,11 @@ flowchart LR
|
|
|
230
238
|
FILES["documents/<br/>├── security/<br/>├── development/<br/>├── ctf/<br/>└── general/"]
|
|
231
239
|
end
|
|
232
240
|
|
|
233
|
-
subgraph PARSE["Parse (
|
|
241
|
+
subgraph PARSE["Parse (20 formats)"]
|
|
234
242
|
MD["Markdown"]
|
|
235
243
|
PDF["PDF<br/>(PyMuPDF)"]
|
|
236
244
|
OFFICE["DOCX | XLSX<br/>PPTX | CSV"]
|
|
237
|
-
CODE["PY |
|
|
245
|
+
CODE["PY | C | H | CPP | JS | JSX<br/>TS | TSX | JSON | XML | IPYNB"]
|
|
238
246
|
end
|
|
239
247
|
|
|
240
248
|
subgraph CHUNK["Chunk"]
|
|
@@ -895,7 +903,7 @@ knowledge-rag/
|
|
|
895
903
|
├── mcp_server/
|
|
896
904
|
│ ├── __init__.py # Stdout protection + version
|
|
897
905
|
│ ├── config.py # YAML config loader + defaults
|
|
898
|
-
│ ├── ingestion.py #
|
|
906
|
+
│ ├── ingestion.py # 20 parsers, chunking, metadata extraction
|
|
899
907
|
│ └── server.py # MCP server, ChromaDB, BM25, reranker, 12 tools
|
|
900
908
|
├── config.example.yaml # Documented config template (copy to config.yaml)
|
|
901
909
|
├── config.yaml # Your active configuration (git-ignored)
|
|
@@ -986,6 +994,15 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
|
|
|
986
994
|
|
|
987
995
|
## Changelog
|
|
988
996
|
|
|
997
|
+
### v3.6.0 (2026-04-23)
|
|
998
|
+
|
|
999
|
+
- **NEW**: Multi-language code parsing — C (`.c`), C++ (`.cpp`/`.h`), JavaScript (`.js`/`.jsx`), TypeScript (`.ts`/`.tsx`) with per-language function/class/import extraction
|
|
1000
|
+
- **NEW**: XML parser (`.xml`) — root element and namespace metadata extraction
|
|
1001
|
+
- **NEW**: All 8 new formats default enabled — no config change needed
|
|
1002
|
+
- **NEW**: NPM wrapper (`npx knowledge-rag`) + Docker image (`ghcr.io/lyonzin/knowledge-rag`)
|
|
1003
|
+
- **NEW**: Automated release pipeline — PyPI (Trusted Publishing), NPM, Docker GHCR
|
|
1004
|
+
- **IMPROVED**: Code parser reports correct `language` metadata per file type (was hardcoded to `"python"` for all code files)
|
|
1005
|
+
|
|
989
1006
|
### v3.5.2 (2026-04-16)
|
|
990
1007
|
|
|
991
1008
|
- **NEW**: Auto-discovery of CUDA 12 DLLs from pip-installed NVIDIA packages — no manual PATH configuration needed
|
|
@@ -1000,7 +1017,7 @@ With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and rera
|
|
|
1000
1017
|
### v3.5.0 (2026-04-16)
|
|
1001
1018
|
|
|
1002
1019
|
- **NEW**: Optional GPU acceleration for ONNX embeddings — `pip install knowledge-rag[gpu]` + `models.embedding.gpu: true` in config. 5-10x faster indexing on NVIDIA GPUs with automatic CPU fallback.
|
|
1003
|
-
- **DOCS**: Supported formats table added to README (
|
|
1020
|
+
- **DOCS**: Supported formats table added to README (20 formats)
|
|
1004
1021
|
|
|
1005
1022
|
### v3.4.3 (2026-04-16)
|
|
1006
1023
|
|