@robthepcguy/rag-vault 1.5.0 → 1.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (84) hide show
  1. package/LICENSE +0 -0
  2. package/README.md +1 -0
  3. package/dist/bin/install-skills.d.ts +0 -0
  4. package/dist/bin/install-skills.js +0 -0
  5. package/dist/chunker/index.d.ts +0 -0
  6. package/dist/chunker/index.js +0 -0
  7. package/dist/chunker/semantic-chunker.d.ts +0 -0
  8. package/dist/chunker/semantic-chunker.js +0 -0
  9. package/dist/chunker/sentence-splitter.d.ts +0 -0
  10. package/dist/chunker/sentence-splitter.js +0 -0
  11. package/dist/embedder/index.d.ts +0 -0
  12. package/dist/embedder/index.js +0 -0
  13. package/dist/errors/index.d.ts +0 -0
  14. package/dist/errors/index.js +0 -0
  15. package/dist/explainability/index.d.ts +0 -0
  16. package/dist/explainability/index.js +0 -0
  17. package/dist/explainability/keywords.d.ts +0 -0
  18. package/dist/explainability/keywords.js +0 -0
  19. package/dist/flywheel/feedback.d.ts +0 -0
  20. package/dist/flywheel/feedback.js +0 -0
  21. package/dist/flywheel/index.d.ts +0 -0
  22. package/dist/flywheel/index.js +0 -0
  23. package/dist/index.d.ts +0 -0
  24. package/dist/parser/html-parser.d.ts +0 -0
  25. package/dist/parser/html-parser.js +0 -0
  26. package/dist/parser/index.d.ts +0 -0
  27. package/dist/parser/index.js +0 -0
  28. package/dist/parser/pdf-filter.d.ts +0 -0
  29. package/dist/parser/pdf-filter.js +0 -0
  30. package/dist/query/index.d.ts +0 -0
  31. package/dist/query/index.js +0 -0
  32. package/dist/query/parser.d.ts +0 -0
  33. package/dist/query/parser.js +0 -0
  34. package/dist/server/index.d.ts +0 -0
  35. package/dist/server/index.js +0 -0
  36. package/dist/server/raw-data-utils.d.ts +0 -0
  37. package/dist/server/raw-data-utils.js +0 -0
  38. package/dist/server/schemas.d.ts +0 -0
  39. package/dist/server/schemas.js +0 -0
  40. package/dist/utils/config-parsers.d.ts +0 -0
  41. package/dist/utils/config-parsers.js +0 -0
  42. package/dist/utils/config.d.ts +0 -0
  43. package/dist/utils/config.js +0 -0
  44. package/dist/utils/file-utils.d.ts +0 -0
  45. package/dist/utils/file-utils.js +0 -0
  46. package/dist/utils/math.d.ts +0 -0
  47. package/dist/utils/math.js +0 -0
  48. package/dist/utils/process-handlers.d.ts +0 -0
  49. package/dist/utils/process-handlers.js +0 -0
  50. package/dist/vectordb/index.d.ts +0 -0
  51. package/dist/vectordb/index.js +12 -12
  52. package/dist/web/api-routes.d.ts +0 -0
  53. package/dist/web/api-routes.js +0 -0
  54. package/dist/web/config-routes.d.ts +0 -0
  55. package/dist/web/config-routes.js +0 -0
  56. package/dist/web/database-manager.d.ts +0 -0
  57. package/dist/web/database-manager.js +0 -0
  58. package/dist/web/http-server.d.ts +0 -0
  59. package/dist/web/http-server.js +0 -0
  60. package/dist/web/index.d.ts +0 -0
  61. package/dist/web/index.js +0 -0
  62. package/dist/web/middleware/async-handler.d.ts +0 -0
  63. package/dist/web/middleware/async-handler.js +0 -0
  64. package/dist/web/middleware/auth.d.ts +0 -0
  65. package/dist/web/middleware/auth.js +0 -0
  66. package/dist/web/middleware/error-handler.d.ts +0 -0
  67. package/dist/web/middleware/error-handler.js +0 -0
  68. package/dist/web/middleware/index.d.ts +0 -0
  69. package/dist/web/middleware/index.js +0 -0
  70. package/dist/web/middleware/rate-limit.d.ts +0 -0
  71. package/dist/web/middleware/rate-limit.js +0 -0
  72. package/dist/web/middleware/request-logger.d.ts +0 -0
  73. package/dist/web/middleware/request-logger.js +0 -0
  74. package/dist/web/types.d.ts +0 -0
  75. package/dist/web/types.js +0 -0
  76. package/package.json +37 -50
  77. package/skills/rag-vault/SKILL.md +111 -111
  78. package/skills/rag-vault/references/html-ingestion.md +73 -73
  79. package/skills/rag-vault/references/query-optimization.md +57 -57
  80. package/skills/rag-vault/references/result-refinement.md +54 -54
  81. package/web-ui/dist/assets/index-SBHxoAwi.js +0 -0
  82. package/web-ui/dist/assets/index-ej8i4PGl.css +0 -0
  83. package/web-ui/dist/index.html +0 -0
  84. package/web-ui/dist/vite.svg +0 -0
package/LICENSE CHANGED
File without changes
package/README.md CHANGED
@@ -397,6 +397,7 @@ Copy the `DB_PATH` directory (default: `./lancedb/`).
397
397
  | File too large | Default limit is 100MB. Set `MAX_FILE_SIZE` higher or split the file. |
398
398
  | Path outside BASE_DIR | All file paths must be under `BASE_DIR`. Use absolute paths. |
399
399
  | MCP tools not showing | Verify config syntax, restart your AI tool completely (Cmd+Q on Mac). |
400
+ | `mcp-publisher login github` fails with `slow_down` | Use token login instead: `mcp-publisher login github --token "$(gh auth token)"` (or pass a PAT). |
400
401
  | 401 Unauthorized | API key required. Set `RAG_API_KEY` or use correct header format. |
401
402
  | 429 Too Many Requests | Rate limited. Wait for reset or increase `RATE_LIMIT_MAX_REQUESTS`. |
402
403
  | CORS errors | Add your origin to `CORS_ORIGINS` environment variable. |
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
package/dist/index.d.ts CHANGED
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
@@ -323,15 +323,15 @@ class VectorStore {
323
323
  if (tableNames.includes(this.config.tableName)) {
324
324
  // Open existing table
325
325
  this.table = await this.db.openTable(this.config.tableName);
326
- console.log(`VectorStore: Opened existing table "${this.config.tableName}"`);
326
+ console.error(`VectorStore: Opened existing table "${this.config.tableName}"`);
327
327
  // Ensure FTS index exists (migration for existing databases)
328
328
  await this.ensureFtsIndex();
329
329
  }
330
330
  else {
331
331
  // Create new table (schema auto-defined on first data insertion)
332
- console.log(`VectorStore: Table "${this.config.tableName}" will be created on first data insertion`);
332
+ console.error(`VectorStore: Table "${this.config.tableName}" will be created on first data insertion`);
333
333
  }
334
- console.log(`VectorStore initialized: ${this.config.dbPath}`);
334
+ console.error(`VectorStore initialized: ${this.config.dbPath}`);
335
335
  }
336
336
  catch (error) {
337
337
  // Clean up partially initialized resources on failure
@@ -365,7 +365,7 @@ class VectorStore {
365
365
  async deleteChunks(filePath) {
366
366
  if (!this.table) {
367
367
  // If table doesn't exist, no deletion targets, return normally
368
- console.log('VectorStore: Skipping deletion as table does not exist');
368
+ console.error('VectorStore: Skipping deletion as table does not exist');
369
369
  return;
370
370
  }
371
371
  // Validate file path before use in query to prevent SQL injection
@@ -381,7 +381,7 @@ class VectorStore {
381
381
  // so call delete directly
382
382
  // Note: Field names are case-sensitive, use backticks for camelCase fields
383
383
  await this.table.delete(`\`filePath\` = '${escapedFilePath}'`);
384
- console.log(`VectorStore: Deleted chunks for file "${filePath}"`);
384
+ console.error(`VectorStore: Deleted chunks for file "${filePath}"`);
385
385
  // Rebuild FTS index after deleting data
386
386
  await this.rebuildFtsIndex();
387
387
  }
@@ -435,7 +435,7 @@ class VectorStore {
435
435
  // Convert to LanceDB record format using explicit field mapping
436
436
  const records = chunksWithFingerprints.map(toDbRecord);
437
437
  this.table = await this.db.createTable(this.config.tableName, records);
438
- console.log(`VectorStore: Created table "${this.config.tableName}"`);
438
+ console.error(`VectorStore: Created table "${this.config.tableName}"`);
439
439
  // Create FTS index for hybrid search
440
440
  await this.ensureFtsIndex();
441
441
  })();
@@ -445,7 +445,7 @@ class VectorStore {
445
445
  finally {
446
446
  this.tableCreationPromise = null;
447
447
  }
448
- console.log(`VectorStore: Inserted ${chunks.length} chunks`);
448
+ console.error(`VectorStore: Inserted ${chunks.length} chunks`);
449
449
  return;
450
450
  }
451
451
  }
@@ -454,7 +454,7 @@ class VectorStore {
454
454
  await this.table.add(records);
455
455
  // Rebuild FTS index after adding new data
456
456
  await this.rebuildFtsIndex();
457
- console.log(`VectorStore: Inserted ${chunks.length} chunks`);
457
+ console.error(`VectorStore: Inserted ${chunks.length} chunks`);
458
458
  }
459
459
  catch (error) {
460
460
  throw new index_js_1.DatabaseError('Failed to insert chunks', error);
@@ -492,12 +492,12 @@ class VectorStore {
492
492
  name: FTS_INDEX_NAME,
493
493
  });
494
494
  this.ftsEnabled = true;
495
- console.log(`VectorStore: FTS index "${FTS_INDEX_NAME}" created successfully`);
495
+ console.error(`VectorStore: FTS index "${FTS_INDEX_NAME}" created successfully`);
496
496
  // Drop old FTS indices
497
497
  for (const idx of existingFtsIndices) {
498
498
  if (idx.name !== FTS_INDEX_NAME) {
499
499
  await this.table.dropIndex(idx.name);
500
- console.log(`VectorStore: Dropped old FTS index "${idx.name}"`);
500
+ console.error(`VectorStore: Dropped old FTS index "${idx.name}"`);
501
501
  }
502
502
  }
503
503
  }
@@ -579,7 +579,7 @@ class VectorStore {
579
579
  */
580
580
  async search(queryVector, queryText, limit = 10) {
581
581
  if (!this.table) {
582
- console.log('VectorStore: Returning empty results as table does not exist');
582
+ console.error('VectorStore: Returning empty results as table does not exist');
583
583
  return [];
584
584
  }
585
585
  if (limit < 1 || limit > 20) {
@@ -779,7 +779,7 @@ class VectorStore {
779
779
  this.ftsEnabled = false;
780
780
  this.ftsFailureCount = 0;
781
781
  this.ftsLastFailure = null;
782
- console.log('VectorStore: Connection closed');
782
+ console.error('VectorStore: Connection closed');
783
783
  // Propagate errors to caller after cleanup is complete
784
784
  if (errors.length > 0) {
785
785
  throw new index_js_1.DatabaseError(`Errors during close: ${errors.map((e) => e.message).join('; ')}`, errors[0]);
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
package/dist/web/index.js CHANGED
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
package/dist/web/types.js CHANGED
File without changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@robthepcguy/rag-vault",
3
- "version": "1.5.0",
3
+ "version": "1.5.1",
4
4
  "description": "Local RAG MCP Server - Easy-to-setup document search with minimal configuration",
5
5
  "main": "dist/index.js",
6
6
  "bin": {
@@ -41,42 +41,6 @@
41
41
  "type": "git",
42
42
  "url": "git+https://github.com/RobThePCGuy/rag-vault.git"
43
43
  },
44
- "scripts": {
45
- "build": "tsc -p tsconfig.build.json && tsc-alias -p tsconfig.build.json",
46
- "check": "pnpm type-check && pnpm lint && pnpm format:check",
47
- "check:all": "pnpm check && pnpm check:web-ui && pnpm check:unused && pnpm check:deps && pnpm build && pnpm test:unit",
48
- "check:deps": "madge --circular --extensions ts src",
49
- "check:deps:graph": "madge --extensions ts --image graph.svg src",
50
- "check:web-ui": "pnpm --prefix web-ui check",
51
- "check:unused": "node scripts/check-unused-exports.js",
52
- "check:unused:all": "knip",
53
- "cleanup:processes": "bash ./scripts/cleanup-test-processes.sh",
54
- "clean:dev": "rm -rf ./node_modules ./tmp ./uploads ./models ./lancedb ./dist ./package-lock.json && cd web-ui && rm -rf ./dist ./node_modules ./package-lock.json",
55
- "dev": "tsx src/index.ts",
56
- "format": "biome format --write src",
57
- "format:check": "biome format src",
58
- "lint": "biome lint src",
59
- "lint:fix": "biome lint --write src",
60
- "start": "node dist/index.js",
61
- "test": "vitest run",
62
- "test:coverage": "vitest run --coverage",
63
- "test:safe": "pnpm test && pnpm cleanup:processes",
64
- "test:watch": "vitest",
65
- "type-check": "tsc --noEmit",
66
- "audit": "pnpm audit --audit-level=moderate",
67
- "audit:fix": "pnpm audit --fix",
68
- "setup:web": "pnpm install && pnpm web:build && pnpm --prefix web-ui install && pnpm ui:build && pnpm web:start",
69
- "ui:build": "pnpm --prefix web-ui build",
70
- "ui:dev": "cd web-ui && pnpm dev",
71
- "web:build": "pnpm build",
72
- "web:dev": "concurrently -n api,ui -c blue,magenta \"pnpm web:watch\" \"pnpm --prefix web-ui dev\"",
73
- "web:start": "node dist/web/index.js",
74
- "web:watch": "tsx watch src/web/index.ts",
75
- "web": "tsx src/web/index.ts",
76
- "test:unit": "vitest run --project backend-unit --project web-ui",
77
- "test:integration": "RUN_EMBEDDING_INTEGRATION=1 vitest run --project backend-integration",
78
- "hooks:install": "git config core.hooksPath .githooks"
79
- },
80
44
  "dependencies": {
81
45
  "@huggingface/transformers": "^3.7.6",
82
46
  "@lancedb/lancedb": "^0.23.0",
@@ -118,17 +82,40 @@
118
82
  "node": ">=20"
119
83
  },
120
84
  "mcpName": "io.github.RobThePCGuy/rag-vault",
121
- "pnpm": {
122
- "overrides": {
123
- "tar": ">=7.5.7",
124
- "diff": ">=4.0.4"
125
- },
126
- "onlyBuiltDependencies": [
127
- "esbuild",
128
- "onnxruntime-node",
129
- "protobufjs",
130
- "@robthepcguy/rag-vault",
131
- "sharp"
132
- ]
85
+ "scripts": {
86
+ "build": "tsc -p tsconfig.build.json && tsc-alias -p tsconfig.build.json",
87
+ "check": "pnpm type-check && pnpm lint && pnpm format:check",
88
+ "check:all": "pnpm check && pnpm check:web-ui && pnpm check:unused && pnpm check:deps && pnpm build && pnpm test:unit",
89
+ "check:deps": "madge --circular --extensions ts src",
90
+ "check:deps:graph": "madge --extensions ts --image graph.svg src",
91
+ "check:web-ui": "pnpm --prefix web-ui check",
92
+ "check:unused": "node scripts/check-unused-exports.js",
93
+ "check:unused:all": "knip",
94
+ "cleanup:processes": "bash ./scripts/cleanup-test-processes.sh",
95
+ "clean:dev": "rm -rf ./node_modules ./tmp ./uploads ./models ./lancedb ./dist ./package-lock.json && cd web-ui && rm -rf ./dist ./node_modules ./package-lock.json",
96
+ "dev": "tsx src/index.ts",
97
+ "format": "biome format --write src",
98
+ "format:check": "biome format src",
99
+ "lint": "biome lint src",
100
+ "lint:fix": "biome lint --write src",
101
+ "start": "node dist/index.js",
102
+ "test": "vitest run",
103
+ "test:coverage": "vitest run --coverage",
104
+ "test:safe": "pnpm test && pnpm cleanup:processes",
105
+ "test:watch": "vitest",
106
+ "type-check": "tsc --noEmit",
107
+ "audit": "pnpm audit --audit-level=moderate",
108
+ "audit:fix": "pnpm audit --fix",
109
+ "setup:web": "pnpm install && pnpm web:build && pnpm --prefix web-ui install && pnpm ui:build && pnpm web:start",
110
+ "ui:build": "pnpm --prefix web-ui build",
111
+ "ui:dev": "cd web-ui && pnpm dev",
112
+ "web:build": "pnpm build",
113
+ "web:dev": "concurrently -n api,ui -c blue,magenta \"pnpm web:watch\" \"pnpm --prefix web-ui dev\"",
114
+ "web:start": "node dist/web/index.js",
115
+ "web:watch": "tsx watch src/web/index.ts",
116
+ "web": "tsx src/web/index.ts",
117
+ "test:unit": "vitest run --project backend-unit --project web-ui",
118
+ "test:integration": "RUN_EMBEDDING_INTEGRATION=1 vitest run --project backend-integration",
119
+ "hooks:install": "git config core.hooksPath .githooks"
133
120
  }
134
- }
121
+ }
@@ -1,111 +1,111 @@
1
- ---
2
- name: rag-vault
3
- description: This skill should be used when the user asks to "search documents", "query RAG", "ingest file", "ingest PDF", "save web page", "add to knowledge base", or mentions document search, semantic search, vector search, or RAG operations. Provides score interpretation (< 0.3 good, > 0.5 skip), query optimization, and ingestion guidance for query_documents, ingest_file, ingest_data tools.
4
- version: 1.0.0
5
- ---
6
-
7
- # RAG Vault Skills
8
-
9
- ## Tools
10
-
11
- | Tool | Use When |
12
- |------|----------|
13
- | `ingest_file` | Local files (PDF, DOCX, TXT, MD, JSON, JSONL) |
14
- | `ingest_data` | Raw content (HTML, text) with source URL |
15
- | `query_documents` | Semantic + keyword hybrid search |
16
- | `delete_file` / `list_files` / `status` | Management |
17
-
18
- ## Search: Core Rules
19
-
20
- Hybrid search combines vector (semantic) and keyword (BM25).
21
-
22
- ### Score Interpretation
23
-
24
- Lower = better match. Use this to filter noise.
25
-
26
- | Score | Action |
27
- |-------|--------|
28
- | < 0.3 | Use directly |
29
- | 0.3-0.5 | Include if mentions same concept/entity |
30
- | > 0.5 | Skip unless no better results |
31
-
32
- ### Limit Selection
33
-
34
- | Intent | Limit |
35
- |--------|-------|
36
- | Specific answer (function, error) | 5 |
37
- | General understanding | 10 |
38
- | Comprehensive survey | 20 |
39
-
40
- ### Query Formulation
41
-
42
- | Situation | Why Transform | Action |
43
- |-----------|---------------|--------|
44
- | Specific term mentioned | Keyword search needs exact match | KEEP term |
45
- | Vague query | Vector search needs semantic signal | ADD context |
46
- | Error stack or code block | Long text dilutes relevance | EXTRACT core keywords |
47
- | Multiple distinct topics | Single query conflates results | SPLIT queries |
48
- | Few/poor results | Term mismatch | EXPAND (see below) |
49
-
50
- ### Query Expansion
51
-
52
- When results are few or all score > 0.5, expand query terms:
53
-
54
- - Keep original term first, add 2-4 variants
55
- - Types: synonyms, abbreviations, related terms, word forms
56
- - Example: `"config"` → `"config configuration settings configure"`
57
-
58
- Avoid over-expansion (causes topic drift).
59
-
60
- ### Result Selection
61
-
62
- When to include vs skip—based on answer quality, not just score.
63
-
64
- **INCLUDE** if:
65
- - Directly answers the question
66
- - Provides necessary context
67
- - Score < 0.5
68
-
69
- **SKIP** if:
70
- - Same keyword, unrelated context
71
- - Score > 0.7
72
- - Mentions term without explanation
73
-
74
- ## Ingestion
75
-
76
- ### ingest_file
77
- ```
78
- ingest_file({ filePath: "/absolute/path/to/document.pdf" })
79
- ```
80
-
81
- ### ingest_data
82
- ```
83
- ingest_data({
84
- content: "<html>...</html>",
85
- metadata: { source: "https://example.com/page", format: "html" }
86
- })
87
- ```
88
-
89
- **Format selection** — match the data you have:
90
- - HTML string → `format: "html"`
91
- - Markdown string → `format: "markdown"`
92
- - Other → `format: "text"`
93
-
94
- **Source format:**
95
- - Web page → Use URL: `https://example.com/page`
96
- - Other content → Use scheme: `{type}://{date}` or `{type}://{date}/{detail}`
97
- - Examples: `clipboard://2024-12-30`, `chat://2024-12-30/project-discussion`
98
-
99
- **HTML source options:**
100
- - Static page → LLM fetch
101
- - SPA/JS-rendered → Browser MCP
102
- - Auth required → Manual paste
103
-
104
- Re-ingest same source to update. Use same source in `delete_file` to remove.
105
-
106
- ## References
107
-
108
- For edge cases and examples:
109
- - [html-ingestion.md](references/html-ingestion.md) - URL normalization, SPA handling
110
- - [query-optimization.md](references/query-optimization.md) - Query patterns by intent
111
- - [result-refinement.md](references/result-refinement.md) - Contradiction resolution, chunking
1
+ ---
2
+ name: rag-vault
3
+ description: This skill should be used when the user asks to "search documents", "query RAG", "ingest file", "ingest PDF", "save web page", "add to knowledge base", or mentions document search, semantic search, vector search, or RAG operations. Provides score interpretation (< 0.3 good, > 0.5 skip), query optimization, and ingestion guidance for query_documents, ingest_file, ingest_data tools.
4
+ version: 1.0.0
5
+ ---
6
+
7
+ # RAG Vault Skills
8
+
9
+ ## Tools
10
+
11
+ | Tool | Use When |
12
+ |------|----------|
13
+ | `ingest_file` | Local files (PDF, DOCX, TXT, MD, JSON, JSONL) |
14
+ | `ingest_data` | Raw content (HTML, text) with source URL |
15
+ | `query_documents` | Semantic + keyword hybrid search |
16
+ | `delete_file` / `list_files` / `status` | Management |
17
+
18
+ ## Search: Core Rules
19
+
20
+ Hybrid search combines vector (semantic) and keyword (BM25).
21
+
22
+ ### Score Interpretation
23
+
24
+ Lower = better match. Use this to filter noise.
25
+
26
+ | Score | Action |
27
+ |-------|--------|
28
+ | < 0.3 | Use directly |
29
+ | 0.3-0.5 | Include if mentions same concept/entity |
30
+ | > 0.5 | Skip unless no better results |
31
+
32
+ ### Limit Selection
33
+
34
+ | Intent | Limit |
35
+ |--------|-------|
36
+ | Specific answer (function, error) | 5 |
37
+ | General understanding | 10 |
38
+ | Comprehensive survey | 20 |
39
+
40
+ ### Query Formulation
41
+
42
+ | Situation | Why Transform | Action |
43
+ |-----------|---------------|--------|
44
+ | Specific term mentioned | Keyword search needs exact match | KEEP term |
45
+ | Vague query | Vector search needs semantic signal | ADD context |
46
+ | Error stack or code block | Long text dilutes relevance | EXTRACT core keywords |
47
+ | Multiple distinct topics | Single query conflates results | SPLIT queries |
48
+ | Few/poor results | Term mismatch | EXPAND (see below) |
49
+
50
+ ### Query Expansion
51
+
52
+ When results are few or all score > 0.5, expand query terms:
53
+
54
+ - Keep original term first, add 2-4 variants
55
+ - Types: synonyms, abbreviations, related terms, word forms
56
+ - Example: `"config"` → `"config configuration settings configure"`
57
+
58
+ Avoid over-expansion (causes topic drift).
59
+
60
+ ### Result Selection
61
+
62
+ When to include vs skip—based on answer quality, not just score.
63
+
64
+ **INCLUDE** if:
65
+ - Directly answers the question
66
+ - Provides necessary context
67
+ - Score < 0.5
68
+
69
+ **SKIP** if:
70
+ - Same keyword, unrelated context
71
+ - Score > 0.7
72
+ - Mentions term without explanation
73
+
74
+ ## Ingestion
75
+
76
+ ### ingest_file
77
+ ```
78
+ ingest_file({ filePath: "/absolute/path/to/document.pdf" })
79
+ ```
80
+
81
+ ### ingest_data
82
+ ```
83
+ ingest_data({
84
+ content: "<html>...</html>",
85
+ metadata: { source: "https://example.com/page", format: "html" }
86
+ })
87
+ ```
88
+
89
+ **Format selection** — match the data you have:
90
+ - HTML string → `format: "html"`
91
+ - Markdown string → `format: "markdown"`
92
+ - Other → `format: "text"`
93
+
94
+ **Source format:**
95
+ - Web page → Use URL: `https://example.com/page`
96
+ - Other content → Use scheme: `{type}://{date}` or `{type}://{date}/{detail}`
97
+ - Examples: `clipboard://2024-12-30`, `chat://2024-12-30/project-discussion`
98
+
99
+ **HTML source options:**
100
+ - Static page → LLM fetch
101
+ - SPA/JS-rendered → Browser MCP
102
+ - Auth required → Manual paste
103
+
104
+ Re-ingest same source to update. Use same source in `delete_file` to remove.
105
+
106
+ ## References
107
+
108
+ For edge cases and examples:
109
+ - [html-ingestion.md](references/html-ingestion.md) - URL normalization, SPA handling
110
+ - [query-optimization.md](references/query-optimization.md) - Query patterns by intent
111
+ - [result-refinement.md](references/result-refinement.md) - Contradiction resolution, chunking
@@ -1,73 +1,73 @@
1
- # HTML Ingestion Reference
2
-
3
- Basic usage is in SKILL.md. This covers URL handling and edge cases.
4
-
5
- ## System Behavior
6
-
7
- The parser extracts main content only—navigation, ads, and boilerplate are stripped. What gets indexed is clean body text, not the full HTML.
8
-
9
- ## When to Use Each Source Method
10
-
11
- | Source Type | Method | Why |
12
- |-------------|--------|-----|
13
- | Static page, public | LLM fetch | Simplest, no extra tools |
14
- | SPA / JS-rendered | Browser MCP | Need rendered DOM |
15
- | Auth required | Manual paste | Can't fetch programmatically |
16
-
17
- ## URL Normalization
18
-
19
- System strips query strings and fragments:
20
- ```
21
- https://example.com/page?utm=x#section → https://example.com/page
22
- ```
23
-
24
- **When query strings matter** (pagination, dynamic IDs):
25
- ```
26
- ingest_data({
27
- content: page1_html,
28
- metadata: { source: "https://example.com/results?page=1", format: "html" }
29
- })
30
- ```
31
- Explicitly include full URL as source.
32
-
33
- ## Edge Cases
34
-
35
- ### Empty/Minimal Extraction
36
-
37
- Why it happens:
38
- - JS-rendered content (use browser MCP)
39
- - Non-standard HTML structure
40
- - Login required
41
-
42
- ### SPA/Dynamic Content
43
-
44
- 1. Use browser MCP to render
45
- 2. Wait for content load
46
- 3. Extract rendered HTML
47
- 4. Ingest via `ingest_data`
48
-
49
- ### Pages with Only Navigation
50
-
51
- Skip or fetch deeper linked pages instead.
52
-
53
- ## Updating Content
54
-
55
- Re-ingest with same source to replace:
56
- ```
57
- ingest_data({
58
- content: updated_html,
59
- metadata: { source: "https://example.com/page", format: "html" }
60
- })
61
- ```
62
-
63
- ## Search Results
64
-
65
- Results from HTML include `source` field:
66
- ```json
67
- {
68
- "filePath": "raw-data/abc123.md",
69
- "source": "https://example.com/page",
70
- "text": "...",
71
- "score": 0.25
72
- }
73
- ```
1
+ # HTML Ingestion Reference
2
+
3
+ Basic usage is in SKILL.md. This covers URL handling and edge cases.
4
+
5
+ ## System Behavior
6
+
7
+ The parser extracts main content only—navigation, ads, and boilerplate are stripped. What gets indexed is clean body text, not the full HTML.
8
+
9
+ ## When to Use Each Source Method
10
+
11
+ | Source Type | Method | Why |
12
+ |-------------|--------|-----|
13
+ | Static page, public | LLM fetch | Simplest, no extra tools |
14
+ | SPA / JS-rendered | Browser MCP | Need rendered DOM |
15
+ | Auth required | Manual paste | Can't fetch programmatically |
16
+
17
+ ## URL Normalization
18
+
19
+ System strips query strings and fragments:
20
+ ```
21
+ https://example.com/page?utm=x#section → https://example.com/page
22
+ ```
23
+
24
+ **When query strings matter** (pagination, dynamic IDs):
25
+ ```
26
+ ingest_data({
27
+ content: page1_html,
28
+ metadata: { source: "https://example.com/results?page=1", format: "html" }
29
+ })
30
+ ```
31
+ Explicitly include full URL as source.
32
+
33
+ ## Edge Cases
34
+
35
+ ### Empty/Minimal Extraction
36
+
37
+ Why it happens:
38
+ - JS-rendered content (use browser MCP)
39
+ - Non-standard HTML structure
40
+ - Login required
41
+
42
+ ### SPA/Dynamic Content
43
+
44
+ 1. Use browser MCP to render
45
+ 2. Wait for content load
46
+ 3. Extract rendered HTML
47
+ 4. Ingest via `ingest_data`
48
+
49
+ ### Pages with Only Navigation
50
+
51
+ Skip or fetch deeper linked pages instead.
52
+
53
+ ## Updating Content
54
+
55
+ Re-ingest with same source to replace:
56
+ ```
57
+ ingest_data({
58
+ content: updated_html,
59
+ metadata: { source: "https://example.com/page", format: "html" }
60
+ })
61
+ ```
62
+
63
+ ## Search Results
64
+
65
+ Results from HTML include `source` field:
66
+ ```json
67
+ {
68
+ "filePath": "raw-data/abc123.md",
69
+ "source": "https://example.com/page",
70
+ "text": "...",
71
+ "score": 0.25
72
+ }
73
+ ```
@@ -1,57 +1,57 @@
1
- # Query Optimization Reference
2
-
3
- Core rules are in SKILL.md. This covers patterns and edge cases.
4
-
5
- ## Query Patterns by Intent
6
-
7
- | User Intent | Query Pattern | Why |
8
- |-------------|---------------|-----|
9
- | Definition/Concept | `"[term] definition concept"` | Targets explanatory content |
10
- | How-To/Procedure | `"[action] steps example usage"` | Targets instructional content |
11
- | API/Function | `"[function] API arguments return"` | Targets reference docs |
12
- | Troubleshooting | `"[error] fix solution cause"` | Targets problem-solving content |
13
-
14
- ## Multi-Query: When to Split
15
-
16
- **Split** when "and" connects distinct topics:
17
- ```
18
- "How do I authenticate AND handle errors?"
19
- → Query 1: "authentication login JWT session"
20
- → Query 2: "error handling exception catch"
21
- ```
22
-
23
- **Don't split** when "and" is within single topic:
24
- ```
25
- "How do I set up and configure the database?"
26
- → Single: "database setup configuration"
27
- ```
28
-
29
- ## Query Expansion Examples
30
-
31
- When results are few or all score > 0.5:
32
-
33
- | Type | Original | Expanded |
34
- |------|----------|----------|
35
- | Synonyms | delete | "delete remove" |
36
- | Abbreviations | API | "API Application Programming Interface" |
37
- | Related terms | auth | "auth authentication login" |
38
- | Word forms | config | "config configuration configure" |
39
-
40
- Keep original term first. Limit to 2-4 additions.
41
-
42
- ## Iterative Refinement
43
-
44
- When initial results are unsatisfactory:
45
-
46
- | Problem | Why It Happens | Action |
47
- |---------|----------------|--------|
48
- | Too few results | Term mismatch | Expand query (see above) |
49
- | Too many irrelevant | Query too broad | Add specific terms |
50
- | Missing expected | Phrasing mismatch | Try alternative wording |
51
-
52
- ## Language Mixing
53
-
54
- Ngram tokenization supports cross-language queries:
55
- ```
56
- "API error handling" → matches both English and Japanese content
57
- ```
1
+ # Query Optimization Reference
2
+
3
+ Core rules are in SKILL.md. This covers patterns and edge cases.
4
+
5
+ ## Query Patterns by Intent
6
+
7
+ | User Intent | Query Pattern | Why |
8
+ |-------------|---------------|-----|
9
+ | Definition/Concept | `"[term] definition concept"` | Targets explanatory content |
10
+ | How-To/Procedure | `"[action] steps example usage"` | Targets instructional content |
11
+ | API/Function | `"[function] API arguments return"` | Targets reference docs |
12
+ | Troubleshooting | `"[error] fix solution cause"` | Targets problem-solving content |
13
+
14
+ ## Multi-Query: When to Split
15
+
16
+ **Split** when "and" connects distinct topics:
17
+ ```
18
+ "How do I authenticate AND handle errors?"
19
+ → Query 1: "authentication login JWT session"
20
+ → Query 2: "error handling exception catch"
21
+ ```
22
+
23
+ **Don't split** when "and" is within single topic:
24
+ ```
25
+ "How do I set up and configure the database?"
26
+ → Single: "database setup configuration"
27
+ ```
28
+
29
+ ## Query Expansion Examples
30
+
31
+ When results are few or all score > 0.5:
32
+
33
+ | Type | Original | Expanded |
34
+ |------|----------|----------|
35
+ | Synonyms | delete | "delete remove" |
36
+ | Abbreviations | API | "API Application Programming Interface" |
37
+ | Related terms | auth | "auth authentication login" |
38
+ | Word forms | config | "config configuration configure" |
39
+
40
+ Keep original term first. Limit to 2-4 additions.
41
+
42
+ ## Iterative Refinement
43
+
44
+ When initial results are unsatisfactory:
45
+
46
+ | Problem | Why It Happens | Action |
47
+ |---------|----------------|--------|
48
+ | Too few results | Term mismatch | Expand query (see above) |
49
+ | Too many irrelevant | Query too broad | Add specific terms |
50
+ | Missing expected | Phrasing mismatch | Try alternative wording |
51
+
52
+ ## Language Mixing
53
+
54
+ Ngram tokenization supports cross-language queries:
55
+ ```
56
+ "API error handling" → matches both English and Japanese content
57
+ ```
@@ -1,54 +1,54 @@
1
- # Result Refinement Reference
2
-
3
- Core rules (score, include/skip) are in SKILL.md. This covers when and how to combine multiple results.
4
-
5
- ## When to Synthesize vs Filter
6
-
7
- Match approach to user intent:
8
-
9
- | User Intent | Approach | Why |
10
- |-------------|----------|-----|
11
- | Specific answer ("how to X") | Filter to 1-2 best | Extra results add noise |
12
- | Understanding a topic | Synthesize multiple | Builds complete picture |
13
- | Troubleshooting error | Filter to direct cause | Tangential info confuses |
14
- | Comparing options | Synthesize with structure | Need all perspectives |
15
-
16
- ## Multiple Results Handling
17
-
18
- ### Synthesis
19
-
20
- When: User needs comprehensive understanding.
21
-
22
- ```
23
- Result 1: "API accepts JSON..."
24
- Result 2: "Auth uses Bearer tokens..."
25
- → Combine into unified answer
26
- ```
27
-
28
- ### Deduplication
29
-
30
- When: Results overlap significantly.
31
-
32
- 1. Pick most complete result
33
- 2. Add only unique info from others
34
-
35
- ### Contradiction Resolution
36
-
37
- When: Results conflict.
38
-
39
- Priority: Lower score (= better match)
40
- If unresolved → Note discrepancy to user
41
-
42
- ## Chunk Context
43
-
44
- Single chunks may lack context ("as described above").
45
-
46
- - Note when information is partial
47
- - Group multiple chunks from same `filePath` as coherent sections
48
-
49
- ## No Results
50
-
51
- 1. Rephrase query (alternative terms)
52
- 2. Broaden scope
53
- 3. Check ingestion (`list_files`)
54
- 4. Inform user: no matching content
1
+ # Result Refinement Reference
2
+
3
+ Core rules (score, include/skip) are in SKILL.md. This covers when and how to combine multiple results.
4
+
5
+ ## When to Synthesize vs Filter
6
+
7
+ Match approach to user intent:
8
+
9
+ | User Intent | Approach | Why |
10
+ |-------------|----------|-----|
11
+ | Specific answer ("how to X") | Filter to 1-2 best | Extra results add noise |
12
+ | Understanding a topic | Synthesize multiple | Builds complete picture |
13
+ | Troubleshooting error | Filter to direct cause | Tangential info confuses |
14
+ | Comparing options | Synthesize with structure | Need all perspectives |
15
+
16
+ ## Multiple Results Handling
17
+
18
+ ### Synthesis
19
+
20
+ When: User needs comprehensive understanding.
21
+
22
+ ```
23
+ Result 1: "API accepts JSON..."
24
+ Result 2: "Auth uses Bearer tokens..."
25
+ → Combine into unified answer
26
+ ```
27
+
28
+ ### Deduplication
29
+
30
+ When: Results overlap significantly.
31
+
32
+ 1. Pick most complete result
33
+ 2. Add only unique info from others
34
+
35
+ ### Contradiction Resolution
36
+
37
+ When: Results conflict.
38
+
39
+ Priority: Lower score (= better match)
40
+ If unresolved → Note discrepancy to user
41
+
42
+ ## Chunk Context
43
+
44
+ Single chunks may lack context ("as described above").
45
+
46
+ - Note when information is partial
47
+ - Group multiple chunks from same `filePath` as coherent sections
48
+
49
+ ## No Results
50
+
51
+ 1. Rephrase query (alternative terms)
52
+ 2. Broaden scope
53
+ 3. Check ingestion (`list_files`)
54
+ 4. Inform user: no matching content
File without changes
File without changes
File without changes
File without changes