@robthepcguy/rag-vault 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (131) hide show
  1. package/LICENSE +24 -0
  2. package/README.md +421 -0
  3. package/dist/bin/install-skills.d.ts +20 -0
  4. package/dist/bin/install-skills.d.ts.map +1 -0
  5. package/dist/bin/install-skills.js +196 -0
  6. package/dist/bin/install-skills.js.map +1 -0
  7. package/dist/chunker/index.d.ts +11 -0
  8. package/dist/chunker/index.d.ts.map +1 -0
  9. package/dist/chunker/index.js +6 -0
  10. package/dist/chunker/index.js.map +1 -0
  11. package/dist/chunker/semantic-chunker.d.ts +96 -0
  12. package/dist/chunker/semantic-chunker.d.ts.map +1 -0
  13. package/dist/chunker/semantic-chunker.js +267 -0
  14. package/dist/chunker/semantic-chunker.js.map +1 -0
  15. package/dist/chunker/sentence-splitter.d.ts +16 -0
  16. package/dist/chunker/sentence-splitter.d.ts.map +1 -0
  17. package/dist/chunker/sentence-splitter.js +114 -0
  18. package/dist/chunker/sentence-splitter.js.map +1 -0
  19. package/dist/embedder/index.d.ts +55 -0
  20. package/dist/embedder/index.d.ts.map +1 -0
  21. package/dist/embedder/index.js +146 -0
  22. package/dist/embedder/index.js.map +1 -0
  23. package/dist/errors/index.d.ts +73 -0
  24. package/dist/errors/index.d.ts.map +1 -0
  25. package/dist/errors/index.js +170 -0
  26. package/dist/errors/index.js.map +1 -0
  27. package/dist/index.d.ts +3 -0
  28. package/dist/index.d.ts.map +1 -0
  29. package/dist/index.js +91 -0
  30. package/dist/index.js.map +1 -0
  31. package/dist/parser/html-parser.d.ts +14 -0
  32. package/dist/parser/html-parser.d.ts.map +1 -0
  33. package/dist/parser/html-parser.js +99 -0
  34. package/dist/parser/html-parser.js.map +1 -0
  35. package/dist/parser/index.d.ts +144 -0
  36. package/dist/parser/index.d.ts.map +1 -0
  37. package/dist/parser/index.js +446 -0
  38. package/dist/parser/index.js.map +1 -0
  39. package/dist/parser/pdf-filter.d.ts +89 -0
  40. package/dist/parser/pdf-filter.d.ts.map +1 -0
  41. package/dist/parser/pdf-filter.js +304 -0
  42. package/dist/parser/pdf-filter.js.map +1 -0
  43. package/dist/server/index.d.ts +144 -0
  44. package/dist/server/index.d.ts.map +1 -0
  45. package/dist/server/index.js +518 -0
  46. package/dist/server/index.js.map +1 -0
  47. package/dist/server/raw-data-utils.d.ts +81 -0
  48. package/dist/server/raw-data-utils.d.ts.map +1 -0
  49. package/dist/server/raw-data-utils.js +196 -0
  50. package/dist/server/raw-data-utils.js.map +1 -0
  51. package/dist/server/schemas.d.ts +186 -0
  52. package/dist/server/schemas.d.ts.map +1 -0
  53. package/dist/server/schemas.js +99 -0
  54. package/dist/server/schemas.js.map +1 -0
  55. package/dist/utils/config-parsers.d.ts +14 -0
  56. package/dist/utils/config-parsers.d.ts.map +1 -0
  57. package/dist/utils/config-parsers.js +47 -0
  58. package/dist/utils/config-parsers.js.map +1 -0
  59. package/dist/utils/config.d.ts +37 -0
  60. package/dist/utils/config.d.ts.map +1 -0
  61. package/dist/utils/config.js +52 -0
  62. package/dist/utils/config.js.map +1 -0
  63. package/dist/utils/logger.d.ts +36 -0
  64. package/dist/utils/logger.d.ts.map +1 -0
  65. package/dist/utils/logger.js +64 -0
  66. package/dist/utils/logger.js.map +1 -0
  67. package/dist/utils/math.d.ts +34 -0
  68. package/dist/utils/math.d.ts.map +1 -0
  69. package/dist/utils/math.js +73 -0
  70. package/dist/utils/math.js.map +1 -0
  71. package/dist/utils/process-handlers.d.ts +26 -0
  72. package/dist/utils/process-handlers.d.ts.map +1 -0
  73. package/dist/utils/process-handlers.js +69 -0
  74. package/dist/utils/process-handlers.js.map +1 -0
  75. package/dist/vectordb/index.d.ts +210 -0
  76. package/dist/vectordb/index.d.ts.map +1 -0
  77. package/dist/vectordb/index.js +613 -0
  78. package/dist/vectordb/index.js.map +1 -0
  79. package/dist/web/api-routes.d.ts +9 -0
  80. package/dist/web/api-routes.d.ts.map +1 -0
  81. package/dist/web/api-routes.js +127 -0
  82. package/dist/web/api-routes.js.map +1 -0
  83. package/dist/web/config-routes.d.ts +7 -0
  84. package/dist/web/config-routes.d.ts.map +1 -0
  85. package/dist/web/config-routes.js +54 -0
  86. package/dist/web/config-routes.js.map +1 -0
  87. package/dist/web/database-manager.d.ts +130 -0
  88. package/dist/web/database-manager.d.ts.map +1 -0
  89. package/dist/web/database-manager.js +382 -0
  90. package/dist/web/database-manager.js.map +1 -0
  91. package/dist/web/http-server.d.ts +28 -0
  92. package/dist/web/http-server.d.ts.map +1 -0
  93. package/dist/web/http-server.js +311 -0
  94. package/dist/web/http-server.js.map +1 -0
  95. package/dist/web/index.d.ts +3 -0
  96. package/dist/web/index.d.ts.map +1 -0
  97. package/dist/web/index.js +114 -0
  98. package/dist/web/index.js.map +1 -0
  99. package/dist/web/middleware/async-handler.d.ts +17 -0
  100. package/dist/web/middleware/async-handler.d.ts.map +1 -0
  101. package/dist/web/middleware/async-handler.js +26 -0
  102. package/dist/web/middleware/async-handler.js.map +1 -0
  103. package/dist/web/middleware/auth.d.ts +22 -0
  104. package/dist/web/middleware/auth.d.ts.map +1 -0
  105. package/dist/web/middleware/auth.js +81 -0
  106. package/dist/web/middleware/auth.js.map +1 -0
  107. package/dist/web/middleware/error-handler.d.ts +36 -0
  108. package/dist/web/middleware/error-handler.d.ts.map +1 -0
  109. package/dist/web/middleware/error-handler.js +68 -0
  110. package/dist/web/middleware/error-handler.js.map +1 -0
  111. package/dist/web/middleware/index.d.ts +6 -0
  112. package/dist/web/middleware/index.d.ts.map +1 -0
  113. package/dist/web/middleware/index.js +19 -0
  114. package/dist/web/middleware/index.js.map +1 -0
  115. package/dist/web/middleware/rate-limit.d.ts +38 -0
  116. package/dist/web/middleware/rate-limit.d.ts.map +1 -0
  117. package/dist/web/middleware/rate-limit.js +116 -0
  118. package/dist/web/middleware/rate-limit.js.map +1 -0
  119. package/dist/web/middleware/request-logger.d.ts +52 -0
  120. package/dist/web/middleware/request-logger.d.ts.map +1 -0
  121. package/dist/web/middleware/request-logger.js +74 -0
  122. package/dist/web/middleware/request-logger.js.map +1 -0
  123. package/dist/web/types.d.ts +6 -0
  124. package/dist/web/types.d.ts.map +1 -0
  125. package/dist/web/types.js +4 -0
  126. package/dist/web/types.js.map +1 -0
  127. package/package.json +135 -0
  128. package/skills/rag-vault/SKILL.md +111 -0
  129. package/skills/rag-vault/references/html-ingestion.md +73 -0
  130. package/skills/rag-vault/references/query-optimization.md +57 -0
  131. package/skills/rag-vault/references/result-refinement.md +54 -0
@@ -0,0 +1,111 @@
1
+ ---
2
+ name: rag-vault
3
+ description: This skill should be used when the user asks to "search documents", "query RAG", "ingest file", "ingest PDF", "save web page", "add to knowledge base", or mentions document search, semantic search, vector search, or RAG operations. Provides score interpretation (< 0.3 good, > 0.5 skip), query optimization, and ingestion guidance for query_documents, ingest_file, ingest_data tools.
4
+ version: 1.0.0
5
+ ---
6
+
7
+ # RAG Vault Skills
8
+
9
+ ## Tools
10
+
11
+ | Tool | Use When |
12
+ |------|----------|
13
+ | `ingest_file` | Local files (PDF, DOCX, TXT, MD, JSON, JSONL) |
14
+ | `ingest_data` | Raw content (HTML, text) with source URL |
15
+ | `query_documents` | Semantic + keyword hybrid search |
16
+ | `delete_file` / `list_files` / `status` | Management |
17
+
18
+ ## Search: Core Rules
19
+
20
+ Hybrid search combines vector (semantic) and keyword (BM25).
21
+
22
+ ### Score Interpretation
23
+
24
+ Lower = better match. Use this to filter noise.
25
+
26
+ | Score | Action |
27
+ |-------|--------|
28
+ | < 0.3 | Use directly |
29
+ | 0.3-0.5 | Include if mentions same concept/entity |
30
+ | > 0.5 | Skip unless no better results |
31
+
32
+ ### Limit Selection
33
+
34
+ | Intent | Limit |
35
+ |--------|-------|
36
+ | Specific answer (function, error) | 5 |
37
+ | General understanding | 10 |
38
+ | Comprehensive survey | 20 |
39
+
40
+ ### Query Formulation
41
+
42
+ | Situation | Why Transform | Action |
43
+ |-----------|---------------|--------|
44
+ | Specific term mentioned | Keyword search needs exact match | KEEP term |
45
+ | Vague query | Vector search needs semantic signal | ADD context |
46
+ | Error stack or code block | Long text dilutes relevance | EXTRACT core keywords |
47
+ | Multiple distinct topics | Single query conflates results | SPLIT queries |
48
+ | Few/poor results | Term mismatch | EXPAND (see below) |
49
+
50
+ ### Query Expansion
51
+
52
+ When results are few or all score > 0.5, expand query terms:
53
+
54
+ - Keep original term first, add 2-4 variants
55
+ - Types: synonyms, abbreviations, related terms, word forms
56
+ - Example: `"config"` → `"config configuration settings configure"`
57
+
58
+ Avoid over-expansion (causes topic drift).
59
+
60
+ ### Result Selection
61
+
62
+ When to include vs skip—based on answer quality, not just score.
63
+
64
+ **INCLUDE** if:
65
+ - Directly answers the question
66
+ - Provides necessary context
67
+ - Score < 0.5
68
+
69
+ **SKIP** if:
70
+ - Same keyword, unrelated context
71
+ - Score > 0.7
72
+ - Mentions term without explanation
73
+
74
+ ## Ingestion
75
+
76
+ ### ingest_file
77
+ ```
78
+ ingest_file({ filePath: "/absolute/path/to/document.pdf" })
79
+ ```
80
+
81
+ ### ingest_data
82
+ ```
83
+ ingest_data({
84
+ content: "<html>...</html>",
85
+ metadata: { source: "https://example.com/page", format: "html" }
86
+ })
87
+ ```
88
+
89
+ **Format selection** — match the data you have:
90
+ - HTML string → `format: "html"`
91
+ - Markdown string → `format: "markdown"`
92
+ - Other → `format: "text"`
93
+
94
+ **Source format:**
95
+ - Web page → Use URL: `https://example.com/page`
96
+ - Other content → Use scheme: `{type}://{date}` or `{type}://{date}/{detail}`
97
+ - Examples: `clipboard://2024-12-30`, `chat://2024-12-30/project-discussion`
98
+
99
+ **HTML source options:**
100
+ - Static page → LLM fetch
101
+ - SPA/JS-rendered → Browser MCP
102
+ - Auth required → Manual paste
103
+
104
+ Re-ingest same source to update. Use same source in `delete_file` to remove.
105
+
106
+ ## References
107
+
108
+ For edge cases and examples:
109
+ - [html-ingestion.md](references/html-ingestion.md) - URL normalization, SPA handling
110
+ - [query-optimization.md](references/query-optimization.md) - Query patterns by intent
111
+ - [result-refinement.md](references/result-refinement.md) - Contradiction resolution, chunking
@@ -0,0 +1,73 @@
1
+ # HTML Ingestion Reference
2
+
3
+ Basic usage is in SKILL.md. This covers URL handling and edge cases.
4
+
5
+ ## System Behavior
6
+
7
+ The parser extracts main content only—navigation, ads, and boilerplate are stripped. What gets indexed is clean body text, not the full HTML.
8
+
9
+ ## When to Use Each Source Method
10
+
11
+ | Source Type | Method | Why |
12
+ |-------------|--------|-----|
13
+ | Static page, public | LLM fetch | Simplest, no extra tools |
14
+ | SPA / JS-rendered | Browser MCP | Need rendered DOM |
15
+ | Auth required | Manual paste | Can't fetch programmatically |
16
+
17
+ ## URL Normalization
18
+
19
+ System strips query strings and fragments:
20
+ ```
21
+ https://example.com/page?utm=x#section → https://example.com/page
22
+ ```
23
+
24
+ **When query strings matter** (pagination, dynamic IDs):
25
+ ```
26
+ ingest_data({
27
+ content: page1_html,
28
+ metadata: { source: "https://example.com/results?page=1", format: "html" }
29
+ })
30
+ ```
31
+ Explicitly include full URL as source.
32
+
33
+ ## Edge Cases
34
+
35
+ ### Empty/Minimal Extraction
36
+
37
+ Why it happens:
38
+ - JS-rendered content (use browser MCP)
39
+ - Non-standard HTML structure
40
+ - Login required
41
+
42
+ ### SPA/Dynamic Content
43
+
44
+ 1. Use browser MCP to render
45
+ 2. Wait for content load
46
+ 3. Extract rendered HTML
47
+ 4. Ingest via `ingest_data`
48
+
49
+ ### Pages with Only Navigation
50
+
51
+ Skip or fetch deeper linked pages instead.
52
+
53
+ ## Updating Content
54
+
55
+ Re-ingest with same source to replace:
56
+ ```
57
+ ingest_data({
58
+ content: updated_html,
59
+ metadata: { source: "https://example.com/page", format: "html" }
60
+ })
61
+ ```
62
+
63
+ ## Search Results
64
+
65
+ Results from HTML include `source` field:
66
+ ```json
67
+ {
68
+ "filePath": "raw-data/abc123.md",
69
+ "source": "https://example.com/page",
70
+ "text": "...",
71
+ "score": 0.25
72
+ }
73
+ ```
@@ -0,0 +1,57 @@
1
+ # Query Optimization Reference
2
+
3
+ Core rules are in SKILL.md. This covers patterns and edge cases.
4
+
5
+ ## Query Patterns by Intent
6
+
7
+ | User Intent | Query Pattern | Why |
8
+ |-------------|---------------|-----|
9
+ | Definition/Concept | `"[term] definition concept"` | Targets explanatory content |
10
+ | How-To/Procedure | `"[action] steps example usage"` | Targets instructional content |
11
+ | API/Function | `"[function] API arguments return"` | Targets reference docs |
12
+ | Troubleshooting | `"[error] fix solution cause"` | Targets problem-solving content |
13
+
14
+ ## Multi-Query: When to Split
15
+
16
+ **Split** when "and" connects distinct topics:
17
+ ```
18
+ "How do I authenticate AND handle errors?"
19
+ → Query 1: "authentication login JWT session"
20
+ → Query 2: "error handling exception catch"
21
+ ```
22
+
23
+ **Don't split** when "and" is within single topic:
24
+ ```
25
+ "How do I set up and configure the database?"
26
+ → Single: "database setup configuration"
27
+ ```
28
+
29
+ ## Query Expansion Examples
30
+
31
+ When results are few or all score > 0.5:
32
+
33
+ | Type | Original | Expanded |
34
+ |------|----------|----------|
35
+ | Synonyms | delete | "delete remove" |
36
+ | Abbreviations | API | "API Application Programming Interface" |
37
+ | Related terms | auth | "auth authentication login" |
38
+ | Word forms | config | "config configuration configure" |
39
+
40
+ Keep original term first. Limit to 2-4 additions.
41
+
42
+ ## Iterative Refinement
43
+
44
+ When initial results are unsatisfactory:
45
+
46
+ | Problem | Why It Happens | Action |
47
+ |---------|----------------|--------|
48
+ | Too few results | Term mismatch | Expand query (see above) |
49
+ | Too many irrelevant | Query too broad | Add specific terms |
50
+ | Missing expected | Phrasing mismatch | Try alternative wording |
51
+
52
+ ## Language Mixing
53
+
54
+ Ngram tokenization supports cross-language queries:
55
+ ```
56
+ "API error handling" → matches both English and Japanese content
57
+ ```
@@ -0,0 +1,54 @@
1
+ # Result Refinement Reference
2
+
3
+ Core rules (score, include/skip) are in SKILL.md. This covers when and how to combine multiple results.
4
+
5
+ ## When to Synthesize vs Filter
6
+
7
+ Match approach to user intent:
8
+
9
+ | User Intent | Approach | Why |
10
+ |-------------|----------|-----|
11
+ | Specific answer ("how to X") | Filter to 1-2 best | Extra results add noise |
12
+ | Understanding a topic | Synthesize multiple | Builds complete picture |
13
+ | Troubleshooting error | Filter to direct cause | Tangential info confuses |
14
+ | Comparing options | Synthesize with structure | Need all perspectives |
15
+
16
+ ## Multiple Results Handling
17
+
18
+ ### Synthesis
19
+
20
+ When: User needs comprehensive understanding.
21
+
22
+ ```
23
+ Result 1: "API accepts JSON..."
24
+ Result 2: "Auth uses Bearer tokens..."
25
+ → Combine into unified answer
26
+ ```
27
+
28
+ ### Deduplication
29
+
30
+ When: Results overlap significantly.
31
+
32
+ 1. Pick most complete result
33
+ 2. Add only unique info from others
34
+
35
+ ### Contradiction Resolution
36
+
37
+ When: Results conflict.
38
+
39
+ Priority: Lower score (= better match)
40
+ If unresolved → Note discrepancy to user
41
+
42
+ ## Chunk Context
43
+
44
+ Single chunks may lack context ("as described above").
45
+
46
+ - Note when information is partial
47
+ - Group multiple chunks from same `filePath` as coherent sections
48
+
49
+ ## No Results
50
+
51
+ 1. Rephrase query (alternative terms)
52
+ 2. Broaden scope
53
+ 3. Check ingestion (`list_files`)
54
+ 4. Inform user: no matching content