rag-skills 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +14 -0
- package/.claude-plugin/plugin.json +8 -0
- package/CONTRIBUTING.md +210 -0
- package/LICENSE +21 -0
- package/README.md +148 -0
- package/examples/foundational-rag-pipeline.md +104 -0
- package/examples/multi-agent-rag.md +111 -0
- package/examples/production-rag-setup.md +133 -0
- package/package.json +22 -0
- package/scripts/generate-index.py +276 -0
- package/scripts/validate-skills.py +214 -0
- package/skills/chunking/choosing-a-chunking-framework.md +186 -0
- package/skills/chunking/contextual-chunk-headers.md +106 -0
- package/skills/chunking/hierarchical-chunking.md +77 -0
- package/skills/chunking/semantic-chunking.md +78 -0
- package/skills/chunking/sliding-window-chunking.md +82 -0
- package/skills/data-type-handling/rag-for-code-documentation.md +83 -0
- package/skills/data-type-handling/rag-for-multimodal-content.md +83 -0
- package/skills/performance-optimization/optimize-retrieval-latency.md +88 -0
- package/skills/retrieval-strategies/adaptive-retrieval.md +102 -0
- package/skills/retrieval-strategies/context-enrichment-window.md +99 -0
- package/skills/retrieval-strategies/crag-corrective-rag.md +108 -0
- package/skills/retrieval-strategies/explainable-retrieval.md +106 -0
- package/skills/retrieval-strategies/graph-rag.md +107 -0
- package/skills/retrieval-strategies/hybrid-search-bm25-dense.md +81 -0
- package/skills/retrieval-strategies/hyde-hypothetical-document-embeddings.md +91 -0
- package/skills/retrieval-strategies/hype-hypothetical-prompt-embeddings.md +98 -0
- package/skills/retrieval-strategies/multi-pass-retrieval-with-reranking.md +82 -0
- package/skills/retrieval-strategies/query-transformation-strategies.md +93 -0
- package/skills/retrieval-strategies/raptor-hierarchical-retrieval.md +106 -0
- package/skills/retrieval-strategies/self-rag.md +108 -0
- package/skills/vector-databases/choosing-vector-db-by-datatype.md +112 -0
- package/skills/vector-databases/qdrant-for-production-rag.md +88 -0
- package/skills/vector-databases/qdrant-setup-rag.md +86 -0
- package/templates/skill-template.md +53 -0
- package/templates/workflow-template.md +67 -0
|
@@ -0,0 +1,186 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Choosing a Chunking Framework"
|
|
3
|
+
description: "Select the right chunking framework based on document type, pipeline architecture, and retrieval goals."
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- Grep
|
|
7
|
+
- Glob
|
|
8
|
+
- Bash
|
|
9
|
+
category: "chunking"
|
|
10
|
+
tags: ["framework-selection", "chonkie", "langchain", "llamaindex", "haystack", "unstructured"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
Chunking quality depends as much on the framework as on the strategy itself. This guide helps a coding agent choose between chunking frameworks based on the shape of the data, the surrounding RAG stack, and whether the main need is speed, structure-awareness, retrieval quality, or integration simplicity.
|
|
15
|
+
|
|
16
|
+
## Problem Statement
|
|
17
|
+
Teams often pick a chunking framework for the wrong reason:
|
|
18
|
+
- They default to the framework already in the app, even when it has weaker chunking primitives for the data type
|
|
19
|
+
- They choose a sophisticated semantic chunker for documents that only need simple deterministic splitting
|
|
20
|
+
- They use generic text splitters for PDFs, tables, or code repositories where structure matters more than raw length
|
|
21
|
+
- They optimize for chunking features without considering metadata propagation, ingestion workflow, or downstream retrieval
|
|
22
|
+
|
|
23
|
+
## Key Concepts
|
|
24
|
+
- **Framework Fit**: How well the chunking library matches the rest of the ingestion and retrieval stack
|
|
25
|
+
- **Structure Awareness**: Whether the framework understands sections, headings, tables, pages, or code blocks
|
|
26
|
+
- **Semantic Awareness**: Whether the framework can chunk by meaning rather than only size or delimiters
|
|
27
|
+
- **Operational Simplicity**: How easy it is to adopt the framework without introducing a second ingestion system
|
|
28
|
+
- **Chunking Depth**: Whether you need only fixed-size chunks or richer patterns like hierarchical, sentence-window, or code-aware chunking
|
|
29
|
+
|
|
30
|
+
## Implementation Guide
|
|
31
|
+
|
|
32
|
+
### Step 1: Identify the Dominant Document Type
|
|
33
|
+
Decide whether the corpus is mostly plain text, structured documents, code, or messy extracted files.
|
|
34
|
+
|
|
35
|
+
**Why**: Chunking frameworks differ most sharply in how they handle document structure. The right choice for markdown docs is often wrong for scanned PDFs or source code.
|
|
36
|
+
|
|
37
|
+
Use this default mapping:
|
|
38
|
+
- Plain text and generic prose: favor LangChain or Chonkie
|
|
39
|
+
- Metadata-rich RAG pipelines: favor LlamaIndex
|
|
40
|
+
- Pipeline-centric search applications: favor Haystack
|
|
41
|
+
- Parsed PDFs, office files, and layout-heavy documents: favor Unstructured
|
|
42
|
+
- Code-heavy repositories: favor Chonkie first, then LangChain if you need simpler integration
|
|
43
|
+
|
|
44
|
+
### Step 2: Check Whether Chunking Is a Core Requirement or Just a Utility
|
|
45
|
+
Determine whether chunking itself is a strategic part of the system or just one preprocessing step.
|
|
46
|
+
|
|
47
|
+
**Why**: If chunking is central to recall quality, code structure, or high-throughput ingestion, you want a framework with deeper chunking specialization. If it is incidental, use the framework already governing the RAG pipeline.
|
|
48
|
+
|
|
49
|
+
Choose this way:
|
|
50
|
+
- If chunking sophistication is a competitive advantage, start with Chonkie
|
|
51
|
+
- If chunking is just one component in a broader orchestration framework, prefer LangChain, LlamaIndex, or Haystack based on the app stack
|
|
52
|
+
|
|
53
|
+
### Step 3: Match the Framework to the Retrieval Pattern
|
|
54
|
+
Choose a framework whose chunking model supports the retrieval pattern you intend to use.
|
|
55
|
+
|
|
56
|
+
**Why**: Some frameworks are built around plain chunks, while others are better for hierarchical retrieval, metadata-enriched nodes, or section-preserving document flows.
|
|
57
|
+
|
|
58
|
+
Recommended fit by retrieval pattern:
|
|
59
|
+
- Simple vector search over text: LangChain or Chonkie
|
|
60
|
+
- Hierarchical retrieval and parent-child context: LlamaIndex or Haystack
|
|
61
|
+
- Section-aware retrieval for reports and long docs: Unstructured or Chonkie recursive chunking
|
|
62
|
+
- Code search and AST-aware chunking: Chonkie
|
|
63
|
+
- Fine-grained sentence-level retrieval with context reconstruction: LlamaIndex
|
|
64
|
+
|
|
65
|
+
### Step 4: Use the Framework Decision Matrix
|
|
66
|
+
Pick the framework whose strengths best match the workload.
|
|
67
|
+
|
|
68
|
+
**Why**: A framework should be chosen by tradeoff, not popularity.
|
|
69
|
+
|
|
70
|
+
| Framework | Best At | Choose It When | Avoid It When |
|
|
71
|
+
|----------|---------|----------------|---------------|
|
|
72
|
+
| **[Chonkie](https://docs.chonkie.ai/oss/chunkers/overview)** | Specialized chunking strategies, high throughput, semantic and code-aware chunking | Chunking is a first-class problem, you need semantic/code chunkers, or ingestion speed matters | You want the simplest possible default inside an existing LangChain or Haystack app |
|
|
73
|
+
| **[LangChain Text Splitters](https://docs.langchain.com/oss/python/integrations/splitters/index)** | Simple, reliable general-purpose chunking tightly integrated with LangChain apps | You already use LangChain or LangGraph and need practical default chunking with minimal extra tooling | You need deeply specialized code, layout, or semantic chunking beyond standard splitter patterns |
|
|
74
|
+
| **[LlamaIndex Node Parsers](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/)** | Metadata-rich nodes, sentence windows, hierarchical parsing, retrieval-aware chunking | You use LlamaIndex ingestion/query pipelines or need chunking that preserves node relationships and retrieval metadata | You only need straightforward standalone chunking without adopting LlamaIndex concepts |
|
|
75
|
+
| **[Haystack Preprocessors](https://docs.haystack.deepset.ai/docs/documentsplitter)** | Deterministic pipeline-based splitting for production search systems | You are building around Haystack pipelines and want predictable document preprocessing components | You need more advanced semantic or code-aware chunking than Haystack’s built-in preprocessors provide |
|
|
76
|
+
| **[Unstructured Chunking](https://docs.unstructured.io/open-source/core-functionality/chunking)** | Partition-first chunking for PDFs, Office docs, HTML, tables, and layout-heavy content | Your main challenge is document parsing and structural preservation before chunking | Your corpus is already clean text and you do not need layout-aware parsing |
|
|
77
|
+
|
|
78
|
+
### Step 5: Apply Framework-Specific Guidance
|
|
79
|
+
Use the framework only in the situations where it has a clear advantage.
|
|
80
|
+
|
|
81
|
+
**Why**: Each framework has a distinct operating model. Mixing them without a reason usually adds complexity.
|
|
82
|
+
|
|
83
|
+
#### Chonkie
|
|
84
|
+
Use [Chonkie](https://www.chonkie.ai/) when chunking itself is the focus of the system.
|
|
85
|
+
|
|
86
|
+
Strong fit:
|
|
87
|
+
- You need a dedicated chunking library rather than a general LLM framework
|
|
88
|
+
- You want to choose among token, sentence, recursive, semantic, late, fast, or code chunkers
|
|
89
|
+
- You have code repositories or technical docs where chunk coherence matters
|
|
90
|
+
- You have high-throughput ingestion where fast chunking is important
|
|
91
|
+
|
|
92
|
+
Especially relevant docs:
|
|
93
|
+
- [Chunkers overview](https://docs.chonkie.ai/oss/chunkers/overview)
|
|
94
|
+
- [RecursiveChunker](https://docs.chonkie.ai/oss/chunkers/recursive-chunker)
|
|
95
|
+
- [SemanticChunker](https://docs.chonkie.ai/oss/chunkers/semantic-chunker)
|
|
96
|
+
- [CodeChunker](https://docs.chonkie.ai/oss/chunkers/code-chunker)
|
|
97
|
+
- [FastChunker](https://docs.chonkie.ai/oss/chunkers/fast-chunker)
|
|
98
|
+
|
|
99
|
+
Do not choose Chonkie just because it has more chunkers. Choose it when those chunkers materially improve your corpus handling.
|
|
100
|
+
|
|
101
|
+
#### LangChain
|
|
102
|
+
Use [LangChain Text Splitters](https://docs.langchain.com/oss/python/integrations/splitters/index) when you want a practical default inside a LangChain or LangGraph application.
|
|
103
|
+
|
|
104
|
+
Strong fit:
|
|
105
|
+
- You already use LangChain loaders, embeddings, retrievers, or agents
|
|
106
|
+
- You want an opinionated default like `RecursiveCharacterTextSplitter`
|
|
107
|
+
- You need markdown, JSON, HTML, or code splitting without introducing another ingestion framework
|
|
108
|
+
- You value integration speed over chunking specialization
|
|
109
|
+
|
|
110
|
+
Especially relevant docs:
|
|
111
|
+
- [Text splitters overview](https://docs.langchain.com/oss/python/integrations/splitters/index)
|
|
112
|
+
- [RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter)
|
|
113
|
+
- [CharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/character_text_splitter)
|
|
114
|
+
|
|
115
|
+
Do not choose LangChain if you need the strongest code-aware or layout-aware chunking and are willing to use a more specialized tool.
|
|
116
|
+
|
|
117
|
+
#### LlamaIndex
|
|
118
|
+
Use [LlamaIndex Node Parsers](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/) when chunking is tightly coupled to retrieval behavior and metadata-rich nodes.
|
|
119
|
+
|
|
120
|
+
Strong fit:
|
|
121
|
+
- You want sentence-window retrieval, node relationships, or hierarchical retrieval patterns
|
|
122
|
+
- You care about chunk metadata because retrieval and postprocessing depend on it
|
|
123
|
+
- You are already using LlamaIndex ingestion pipelines, retrievers, and node postprocessors
|
|
124
|
+
- You want chunking that fits directly into a retrieval-aware framework
|
|
125
|
+
|
|
126
|
+
Especially relevant docs:
|
|
127
|
+
- [Node parser usage](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/)
|
|
128
|
+
- [HierarchicalNodeParser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/hierarchical/)
|
|
129
|
+
- [SentenceWindowNodeParser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/)
|
|
130
|
+
|
|
131
|
+
Do not choose LlamaIndex if you only need a lightweight chunker and do not want its document and node abstractions in the pipeline.
|
|
132
|
+
|
|
133
|
+
#### Haystack
|
|
134
|
+
Use [Haystack preprocessors](https://docs.haystack.deepset.ai/docs/documentsplitter) when the application is built as a search or RAG pipeline and you want deterministic preprocessing components.
|
|
135
|
+
|
|
136
|
+
Strong fit:
|
|
137
|
+
- You use Haystack pipelines end-to-end
|
|
138
|
+
- You need sentence, passage, page, line, or function splitting in a production-oriented indexing flow
|
|
139
|
+
- You want hierarchical document splitting without adopting a different ingestion framework
|
|
140
|
+
- You value stable preprocessing stages more than cutting-edge chunking algorithms
|
|
141
|
+
|
|
142
|
+
Especially relevant docs:
|
|
143
|
+
- [DocumentSplitter](https://docs.haystack.deepset.ai/docs/documentsplitter)
|
|
144
|
+
- [HierarchicalDocumentSplitter](https://docs.haystack.deepset.ai/docs/hierarchicaldocumentsplitter)
|
|
145
|
+
- [DocumentPreprocessor](https://docs.haystack.deepset.ai/docs/documentpreprocessor)
|
|
146
|
+
|
|
147
|
+
Do not choose Haystack for advanced semantic chunking if you are not otherwise using Haystack.
|
|
148
|
+
|
|
149
|
+
#### Unstructured
|
|
150
|
+
Use [Unstructured chunking](https://docs.unstructured.io/open-source/core-functionality/chunking) when the hard problem is not text splitting but extracting meaningful document elements first.
|
|
151
|
+
|
|
152
|
+
Strong fit:
|
|
153
|
+
- Your corpus includes PDFs, PowerPoints, HTML, tables, page layouts, and other messy enterprise documents
|
|
154
|
+
- You want chunking to follow structural elements created during partitioning
|
|
155
|
+
- You need section-preserving chunking such as `by_title`
|
|
156
|
+
- You care about table isolation, page boundaries, or composite document elements
|
|
157
|
+
|
|
158
|
+
Especially relevant docs:
|
|
159
|
+
- [Open-source chunking](https://docs.unstructured.io/open-source/core-functionality/chunking)
|
|
160
|
+
- [Platform API chunking strategies](https://docs.unstructured.io/platform-api/partition-api/chunking)
|
|
161
|
+
|
|
162
|
+
Do not choose Unstructured when the documents are already clean plain text and layout parsing adds no value.
|
|
163
|
+
|
|
164
|
+
## When to Use This Skill
|
|
165
|
+
- When deciding which chunking framework a coding agent should adopt for a new RAG pipeline
|
|
166
|
+
- When replacing ad hoc chunking logic with a framework-backed approach
|
|
167
|
+
- When the team needs a framework-level recommendation before tuning chunk size or overlap
|
|
168
|
+
- When the corpus mixes prose, code, and structured enterprise documents
|
|
169
|
+
|
|
170
|
+
## When NOT to Use This Skill
|
|
171
|
+
- When the framework is already fixed by platform constraints
|
|
172
|
+
- When the problem is chunk sizing rather than framework choice
|
|
173
|
+
- When you only need one simple deterministic splitter and framework selection is not material
|
|
174
|
+
|
|
175
|
+
## Related Skills
|
|
176
|
+
- [Semantic Chunking](./semantic-chunking.md) - Meaning-aware chunk boundaries
|
|
177
|
+
- [Hierarchical Chunking](./hierarchical-chunking.md) - Multi-level chunk structures
|
|
178
|
+
- [Sliding Window Chunking](./sliding-window-chunking.md) - Overlap-based chunking
|
|
179
|
+
- [RAG for Code Documentation](../data-type-handling/rag-for-code-documentation.md) - Code-heavy corpus guidance
|
|
180
|
+
|
|
181
|
+
## Metrics & Success Criteria
|
|
182
|
+
- **Framework Fit**: The chosen framework matches the dominant document and retrieval pattern
|
|
183
|
+
- **Operational Simplicity**: The framework does not introduce unnecessary ingestion complexity
|
|
184
|
+
- **Retrieval Quality**: Chunking improves recall and answer quality for the actual corpus
|
|
185
|
+
- **Maintainability**: The chunking approach is understandable and sustainable for the team
|
|
186
|
+
- **Migration Cost**: The framework choice does not create avoidable lock-in without clear upside
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Contextual Chunk Headers"
|
|
3
|
+
description: "Add higher-level context to chunks by prepending headers before embedding for better retrieval."
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- Grep
|
|
7
|
+
- Glob
|
|
8
|
+
- Bash
|
|
9
|
+
category: "chunking"
|
|
10
|
+
tags: ["contextual-headers", "metadata", "chunk-enhancement", "document-structure"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
Contextual chunk headers (CCH) enhance retrieval by prepending higher-level context (document title, section headers, summaries) to each chunk before embedding. This gives embeddings a more accurate representation of content and meaning, significantly improving retrieval quality and reducing irrelevant results.
|
|
15
|
+
|
|
16
|
+
## Problem Statement
|
|
17
|
+
Individual chunks often lack sufficient context:
|
|
18
|
+
- **Implicit References**: Chunks use pronouns and implicit references that aren't clear in isolation
|
|
19
|
+
- **Missing Context**: Chunks only make sense in the context of their section or document
|
|
20
|
+
- **Misleading Isolation**: Read alone, chunks can be misleading or incomplete
|
|
21
|
+
- **Poor Retrieval**: Queries that reference subjects or themes aren't matched effectively
|
|
22
|
+
- **LLM Misinterpretation**: LLMs struggle to understand decontextualized chunks
|
|
23
|
+
|
|
24
|
+
## Key Concepts
|
|
25
|
+
- **Chunk Header**: Higher-level context prepended to chunk content
|
|
26
|
+
- **Document-Level Context**: Document title or summary as header context
|
|
27
|
+
- **Section-Level Context**: Hierarchy of section and subsection titles
|
|
28
|
+
- **Pre-Embedding Concatenation**: Combining header and chunk before embedding
|
|
29
|
+
- **Context Preservation**: Carrying headers through to retrieval and presentation
|
|
30
|
+
|
|
31
|
+
## Implementation Guide
|
|
32
|
+
|
|
33
|
+
### Step 1: Generate Document-Level Context
|
|
34
|
+
Extract or generate document titles and summaries for use in chunk headers.
|
|
35
|
+
|
|
36
|
+
**Why**: Document-level context provides the most important higher-level information for understanding any chunk within that document.
|
|
37
|
+
|
|
38
|
+
Use an LLM to generate a descriptive document title if one doesn't exist, or extract titles from document structure.
|
|
39
|
+
|
|
40
|
+
### Step 2: Extract Section Hierarchy
|
|
41
|
+
Parse document structure to capture section and subsection titles.
|
|
42
|
+
|
|
43
|
+
**Why**: Section hierarchy provides granular context that helps retrieve information about specific topics and themes within the document.
|
|
44
|
+
|
|
45
|
+
Parse the document structure (markdown headings, HTML tags, etc.) to build a hierarchical representation of sections.
|
|
46
|
+
|
|
47
|
+
### Step 3: Create Chunk Headers
|
|
48
|
+
Combine document and section context into comprehensive headers.
|
|
49
|
+
|
|
50
|
+
**Why**: Combined context provides both broad document understanding and specific section information, giving embeddings the best possible representation.
|
|
51
|
+
|
|
52
|
+
For each chunk, create a header combining document title and the section path that contains the chunk.
|
|
53
|
+
|
|
54
|
+
### Step 4: Prepend Headers to Chunks
|
|
55
|
+
Combine headers with chunk content before embedding.
|
|
56
|
+
|
|
57
|
+
**Why**: Embeddings represent the combined header+chunk content, capturing both local meaning and global context in the same vector space.
|
|
58
|
+
|
|
59
|
+
When indexing each chunk, include the header as a prefix to the chunk content.
|
|
60
|
+
|
|
61
|
+
### Step 5: Embed with Headers
|
|
62
|
+
Use the header-enhanced chunks for embedding and indexing.
|
|
63
|
+
|
|
64
|
+
**Why**: Embedding with headers ensures that retrieval considers both local content and broader context, leading to more relevant matches.
|
|
65
|
+
|
|
66
|
+
Index the enhanced chunks using your preferred embedding model and vector database.
|
|
67
|
+
|
|
68
|
+
### Step 6: Present Results with Headers
|
|
69
|
+
Include headers when presenting retrieved chunks to users or LLMs.
|
|
70
|
+
|
|
71
|
+
**Why**: Presenting headers provides necessary context for understanding retrieved chunks, reducing misinterpretation and improving answer quality.
|
|
72
|
+
|
|
73
|
+
When returning retrieved content, include the original header for context.
|
|
74
|
+
|
|
75
|
+
### Step 7: Build Complete CCH Pipeline
|
|
76
|
+
Combine all components into a cohesive chunking and retrieval system.
|
|
77
|
+
|
|
78
|
+
**Why**: A complete pipeline ensures that contextual headers are consistently applied through chunking, embedding, retrieval, and presentation.
|
|
79
|
+
|
|
80
|
+
For implementation patterns, see [LangChain RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter), [LlamaIndex node parser usage](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/), [Anthropic Contextual Retrieval](https://www.anthropic.com/engineering/contextual-retrieval), and [dsRAG AutoContext](https://github.com/D-Star-AI/dsRAG).
|
|
81
|
+
|
|
82
|
+
## When to Use This Skill
|
|
83
|
+
- Building RAG systems with structured documents (articles, reports, documentation)
|
|
84
|
+
- When chunks frequently refer to their subject via pronouns or implicit references
|
|
85
|
+
- For documents with clear section hierarchy or structure
|
|
86
|
+
- When retrieval quality suffers from decontextualized chunks
|
|
87
|
+
- For applications where LLMs need to understand chunk context
|
|
88
|
+
|
|
89
|
+
## When NOT to Use This Skill
|
|
90
|
+
- For unstructured documents without clear hierarchy or titles
|
|
91
|
+
- When chunk headers would be redundant or too long
|
|
92
|
+
- For very short documents where the entire text is already coherent
|
|
93
|
+
- When storage overhead is a major concern (headers increase chunk size)
|
|
94
|
+
- For domains where document structure is meaningless (code snippets, logs)
|
|
95
|
+
|
|
96
|
+
## Related Skills
|
|
97
|
+
- [Semantic Chunking](./semantic-chunking.md) - Context-aware chunking approach
|
|
98
|
+
- [Hierarchical Chunking](./hierarchical-chunking.md) - Structure-based chunking
|
|
99
|
+
- [Context Enrichment Window](../retrieval-strategies/context-enrichment-window.md) - Post-retrieval context
|
|
100
|
+
|
|
101
|
+
## Metrics & Success Criteria
|
|
102
|
+
- **Retrieval Precision**: Increased rate of correct retrieval for subject-based queries
|
|
103
|
+
- **Context Completeness**: Retrieved chunks provide sufficient context for understanding
|
|
104
|
+
- **Header Effectiveness**: Headers improve embedding representation quality
|
|
105
|
+
- **Answer Quality**: Reduced hallucinations from decontextualized chunks
|
|
106
|
+
- **Storage Overhead**: Acceptable increase in chunk size due to headers
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Hierarchical Chunking"
|
|
3
|
+
description: "Chunk nested documents into parent-child levels so retrieval can move from broad sections to fine-grained passages."
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- Grep
|
|
7
|
+
- Glob
|
|
8
|
+
- Bash
|
|
9
|
+
category: "chunking"
|
|
10
|
+
tags: ["nested", "multi-level", "document-structure", "parent-child"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
Hierarchical chunking creates multi-level chunk structures that preserve document hierarchies (chapters, sections, subsections). This enables both broad-overview retrieval (high-level chunks) and detailed retrieval (low-level chunks) with parent-child relationships for context propagation.
|
|
15
|
+
|
|
16
|
+
## Problem Statement
|
|
17
|
+
Flat chunking strategies lose the structural relationships within documents:
|
|
18
|
+
- Questions about broad topics retrieve detailed fragments instead of summaries
|
|
19
|
+
- No way to retrieve "context for context" - when a detail is retrieved, its containing section is lost
|
|
20
|
+
- Navigation through document hierarchies becomes impossible
|
|
21
|
+
- Retrieval results can't be organized by document structure
|
|
22
|
+
|
|
23
|
+
## Key Concepts
|
|
24
|
+
- **Parent-Child Relationships**: Links between summary chunks and their detailed sub-chunks
|
|
25
|
+
- **Multi-Level Granularity**: Chunks at different abstraction levels (document → chapter → section → paragraph)
|
|
26
|
+
- **Context Propagation**: Ability to include parent context when retrieving child chunks
|
|
27
|
+
- **Metadata Enrichment**: Storing hierarchy information for filtering and navigation
|
|
28
|
+
- **Recursive Chunking**: Applying chunking rules recursively through document structure
|
|
29
|
+
|
|
30
|
+
## Implementation Guide
|
|
31
|
+
|
|
32
|
+
### Step 1: Parse Document Structure
|
|
33
|
+
Extract the hierarchical structure of your documents.
|
|
34
|
+
|
|
35
|
+
**Why**: Proper parsing is foundational—you can't create a hierarchy without understanding the structure.
|
|
36
|
+
|
|
37
|
+
### Step 2: Create Multi-Level Chunks
|
|
38
|
+
Generate chunks at multiple levels from the parsed hierarchy.
|
|
39
|
+
|
|
40
|
+
**Why**: Multiple chunk levels allow different retrieval strategies—summary chunks for broad questions, detailed chunks for specifics.
|
|
41
|
+
|
|
42
|
+
### Step 3: Store Hierarchy Metadata
|
|
43
|
+
Persist hierarchy information for retrieval-time context propagation.
|
|
44
|
+
|
|
45
|
+
**Why**: Metadata enables filtering and hierarchical queries (e.g., "retrieve from level 2 or deeper").
|
|
46
|
+
|
|
47
|
+
### Step 4: Implement Hierarchical Retrieval
|
|
48
|
+
Retrieve chunks with optional parent context.
|
|
49
|
+
|
|
50
|
+
**Why**: Hierarchical retrieval allows users to "zoom in" from broad topics to specific details, just like browsing a table of contents.
|
|
51
|
+
|
|
52
|
+
For implementation details, see the [LlamaIndex HierarchicalNodeParser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/hierarchical/), [LangChain RecursiveCharacterTextSplitter](https://reference.langchain.com/python/langchain-text-splitters/character/RecursiveCharacterTextSplitter), [Haystack HierarchicalDocumentSplitter](https://docs.haystack.deepset.ai/docs/hierarchicaldocumentsplitter), and [LlamaIndex Recursive Retriever + Node References](https://developers.llamaindex.ai/python/framework/integrations/retrievers/recursive_retriever_nodes/).
|
|
53
|
+
|
|
54
|
+
## When to Use This Skill
|
|
55
|
+
- Long-form documents with clear structure (books, technical documentation, research papers)
|
|
56
|
+
- When you need both summary and detailed retrieval
|
|
57
|
+
- For applications that support "drill-down" navigation
|
|
58
|
+
- When building document browsing/search interfaces
|
|
59
|
+
- For Q&A systems that benefit from hierarchical context
|
|
60
|
+
|
|
61
|
+
## When NOT to Use This Skill
|
|
62
|
+
- Flat content without structure (blog posts, single-page articles)
|
|
63
|
+
- When storage complexity outweighs retrieval benefits
|
|
64
|
+
- For small document collections where simple chunking suffices
|
|
65
|
+
- When retrieval latency is critical (hierarchical retrieval adds overhead)
|
|
66
|
+
|
|
67
|
+
## Related Skills
|
|
68
|
+
- [Semantic Chunking](./semantic-chunking.md) - For content-aware chunking
|
|
69
|
+
- [Sliding Window Chunking](./sliding-window-chunking.md) - For context overlap
|
|
70
|
+
- [Multi-Pass Retrieval with Reranking](../retrieval-strategies/multi-pass-retrieval-with-reranking.md) - For refining hierarchical results
|
|
71
|
+
|
|
72
|
+
## Metrics & Success Criteria
|
|
73
|
+
- **Coverage**: All document content preserved across hierarchy levels
|
|
74
|
+
- **Retrieval Flexibility**: Can retrieve at any abstraction level
|
|
75
|
+
- **Context Relevance**: Parent context improves answer quality for detailed questions
|
|
76
|
+
- **Storage Efficiency**: Metadata overhead doesn't exceed 30% of storage
|
|
77
|
+
- **Query Latency**: Hierarchical retrieval within 2x of flat retrieval
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Semantic Chunking"
|
|
3
|
+
description: "Use semantic boundaries and embedding similarity to chunk text for higher-relevance retrieval."
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- Grep
|
|
7
|
+
- Glob
|
|
8
|
+
- Bash
|
|
9
|
+
category: "chunking"
|
|
10
|
+
tags: ["semantic", "nlp", "sentence-boundary", "context-preservation"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
Semantic chunking divides documents into segments based on natural language boundaries and semantic meaning rather than fixed character counts. This approach preserves context integrity and improves retrieval relevance by keeping related ideas together.
|
|
15
|
+
|
|
16
|
+
## Problem Statement
|
|
17
|
+
Fixed-size chunking (e.g., 1000 tokens) often cuts through mid-sentence, mid-paragraph, or mid-concept, resulting in:
|
|
18
|
+
- Fragments that lose semantic meaning
|
|
19
|
+
- Context bleeding across unrelated chunks
|
|
20
|
+
- Poor retrieval relevance for semantic search
|
|
21
|
+
- Difficulty for LLMs to understand isolated fragments
|
|
22
|
+
|
|
23
|
+
## Key Concepts
|
|
24
|
+
- **Semantic Boundaries**: Natural break points in text (paragraphs, sections, sentences)
|
|
25
|
+
- **Context Preservation**: Keeping related ideas and arguments together
|
|
26
|
+
- **Sentence Boundary Detection**: Using NLP tools to identify proper sentence ends
|
|
27
|
+
- **Paragraph Cohesion**: Identifying when a new thought or argument begins
|
|
28
|
+
- **Overlapping Windows**: Small overlaps to preserve context between adjacent chunks
|
|
29
|
+
|
|
30
|
+
## Implementation Guide
|
|
31
|
+
|
|
32
|
+
### Step 1: Detect Sentence Boundaries
|
|
33
|
+
Use sentence boundary detection rather than simple period splitting. This handles abbreviations, decimal numbers, and other edge cases.
|
|
34
|
+
|
|
35
|
+
**Why**: Simple period/regex splitting fails on "Dr. Smith", "3.14", "U.S.A.", etc.
|
|
36
|
+
|
|
37
|
+
### Step 2: Group Sentences into Semantic Units
|
|
38
|
+
Group sentences into chunks based on semantic cohesion rather than fixed counts.
|
|
39
|
+
|
|
40
|
+
**Why**: Similarity-based chunking adapts to document density—dense technical sections get smaller chunks, narrative sections get larger ones.
|
|
41
|
+
|
|
42
|
+
### Step 3: Apply Structure-Based Chunking
|
|
43
|
+
Leverage document structure (headings, markdown, HTML tags) to identify natural sections.
|
|
44
|
+
|
|
45
|
+
**Why**: Document structure often encodes semantic boundaries (chapters, sections, subsections).
|
|
46
|
+
|
|
47
|
+
### Step 4: Implement Adaptive Sizing
|
|
48
|
+
Adjust chunk sizes based on content characteristics (code blocks, lists, tables).
|
|
49
|
+
|
|
50
|
+
**Why**: Different content types need different chunking strategies for optimal retrieval.
|
|
51
|
+
|
|
52
|
+
For practical implementations, compare [Sentence-BERT](https://arxiv.org/abs/1908.10084), [LlamaIndex Semantic Chunker](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/), [LangChain RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter), and [spaCy sentence segmentation](https://spacy.io/usage/linguistic-features#sentence-segmentation).
|
|
53
|
+
|
|
54
|
+
## When to Use This Skill
|
|
55
|
+
- Documents with clear semantic structure (articles, reports, documentation)
|
|
56
|
+
- When retrieval quality is more important than fixed-size guarantees
|
|
57
|
+
- For narrative or explanatory content where context matters
|
|
58
|
+
- When using semantic search (vector embeddings) for retrieval
|
|
59
|
+
- For Q&A systems that need coherent context for answers
|
|
60
|
+
|
|
61
|
+
## When NOT to Use This Skill
|
|
62
|
+
- Code repositories where line-level precision matters
|
|
63
|
+
- Log files or structured data with fixed formats
|
|
64
|
+
- When you need exact token count guarantees (e.g., for API limits)
|
|
65
|
+
- Streaming data where chunk boundaries can't be deferred
|
|
66
|
+
- When using exact keyword matching (BM25) where chunk size matters less
|
|
67
|
+
|
|
68
|
+
## Related Skills
|
|
69
|
+
- [Hierarchical Chunking](./hierarchical-chunking.md) - For nested document structures
|
|
70
|
+
- [Sliding Window Chunking](./sliding-window-chunking.md) - For overlapping context preservation
|
|
71
|
+
- [Hybrid Search BM25 Dense](../retrieval-strategies/hybrid-search-bm25-dense.md) - Combining search methods
|
|
72
|
+
|
|
73
|
+
## Metrics & Success Criteria
|
|
74
|
+
- **Retrieval Relevance**: Retrieved chunks should contain complete, self-contained information
|
|
75
|
+
- **Context Preservation**: Related concepts remain in same chunk
|
|
76
|
+
- **Coverage**: No significant information lost in chunk boundaries
|
|
77
|
+
- **Token Efficiency**: Chunks balanced between too small (lossy) and too large (noisy)
|
|
78
|
+
- **Semantic Coherence**: Chunks represent coherent thoughts or sections
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Sliding Window Chunking"
|
|
3
|
+
description: "Use overlapping windows to preserve context across chunk boundaries while controlling retrieval size."
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- Grep
|
|
7
|
+
- Glob
|
|
8
|
+
- Bash
|
|
9
|
+
category: "chunking"
|
|
10
|
+
tags: ["overlap", "context-preservation", "window", "boundary"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
Sliding window chunking creates overlapping chunks where each chunk shares content with adjacent chunks. This preserves context at chunk boundaries, ensuring that information split across chunks remains accessible through multiple retrieval paths.
|
|
15
|
+
|
|
16
|
+
## Problem Statement
|
|
17
|
+
Non-overlapping chunking creates hard boundaries that can break important context:
|
|
18
|
+
- Critical information appears exactly at a chunk boundary
|
|
19
|
+
- Concepts spanning multiple chunks become fragmented
|
|
20
|
+
- Retrieval may miss relevant content because it's split across boundaries
|
|
21
|
+
- LLMs lose context when only one fragment is provided
|
|
22
|
+
|
|
23
|
+
## Key Concepts
|
|
24
|
+
- **Overlap Percentage**: The proportion of shared content between adjacent chunks (typically 10-25%)
|
|
25
|
+
- **Window Size**: The total size of each chunk including overlap
|
|
26
|
+
- **Stride**: The distance between chunk start points (window size - overlap)
|
|
27
|
+
- **Context Preservation**: Ensuring relevant context exists on both sides of boundaries
|
|
28
|
+
- **Multiple Retrieval Paths**: Same content accessible from different chunks
|
|
29
|
+
|
|
30
|
+
## Implementation Guide
|
|
31
|
+
|
|
32
|
+
### Step 1: Determine Overlap Parameters
|
|
33
|
+
Calculate appropriate overlap based on your use case and chunk size.
|
|
34
|
+
|
|
35
|
+
**Why**: The overlap ratio determines how much context is preserved—too little loses context, too much creates redundancy.
|
|
36
|
+
|
|
37
|
+
### Step 2: Implement Sliding Window Chunking
|
|
38
|
+
Create chunks with configurable overlap.
|
|
39
|
+
|
|
40
|
+
**Why**: Boundary adjustment prevents chunks from ending mid-sentence while maintaining the sliding window pattern.
|
|
41
|
+
|
|
42
|
+
### Step 3: Add Chunk Metadata
|
|
43
|
+
Track overlap information for retrieval-time decisions.
|
|
44
|
+
|
|
45
|
+
**Why**: Metadata allows retrieval systems to identify and deduplicate overlapping content.
|
|
46
|
+
|
|
47
|
+
### Step 4: Implement Smart Retrieval
|
|
48
|
+
Handle overlapping results intelligently.
|
|
49
|
+
|
|
50
|
+
**Why**: Smart retrieval prevents redundant information while preserving the benefits of overlapping chunks.
|
|
51
|
+
|
|
52
|
+
### Step 5: Token-Aware Sliding Window
|
|
53
|
+
Implement sliding window based on tokens rather than characters.
|
|
54
|
+
|
|
55
|
+
**Why**: Token-aware chunking is more accurate for LLMs and respects their actual token limits.
|
|
56
|
+
|
|
57
|
+
Useful implementations include [LangChain RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter), the [OpenAI tiktoken cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb), [tiktoken](https://github.com/openai/tiktoken), and [Sentence Transformers rerankers](https://www.sbert.net/examples/cross_encoder/training/rerankers/README.html).
|
|
58
|
+
|
|
59
|
+
## When to Use This Skill
|
|
60
|
+
- Documents where context at boundaries is important
|
|
61
|
+
- When using semantic search where retrieval might match content near boundaries
|
|
62
|
+
- For narrative or sequential content where flow matters
|
|
63
|
+
- When building systems that benefit from multiple retrieval paths
|
|
64
|
+
- For technical documentation where definitions/examples might span chunks
|
|
65
|
+
|
|
66
|
+
## When NOT to Use This Skill
|
|
67
|
+
- When storage costs are critical (overlap increases storage 20-30%)
|
|
68
|
+
- When exact deduplication is required (overlap makes this harder)
|
|
69
|
+
- For structured data with clear boundaries (JSON, CSV)
|
|
70
|
+
- When using exact keyword search where overlap doesn't help
|
|
71
|
+
|
|
72
|
+
## Related Skills
|
|
73
|
+
- [Semantic Chunking](./semantic-chunking.md) - For content-aware boundaries
|
|
74
|
+
- [Hierarchical Chunking](./hierarchical-chunking.md) - For nested structures
|
|
75
|
+
- [Hybrid Search BM25 Dense](../retrieval-strategies/hybrid-search-bm25-dense.md) - For combining search methods
|
|
76
|
+
|
|
77
|
+
## Metrics & Success Criteria
|
|
78
|
+
- **Boundary Coverage**: Critical terms appear in at least one chunk (not lost at boundaries)
|
|
79
|
+
- **Storage Overhead**: Overlap increases storage by expected ratio (e.g., 20%)
|
|
80
|
+
- **Retrieval Recall**: Same content accessible through multiple chunks
|
|
81
|
+
- **Context Preservation**: Retrieved chunks contain sufficient surrounding context
|
|
82
|
+
- **Deduplication Quality**: Can identify and handle overlapping content effectively
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "RAG for Code Documentation"
|
|
3
|
+
description: "Handle code-aware retrieval by preserving symbols, file structure, and API context."
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- Grep
|
|
7
|
+
- Glob
|
|
8
|
+
- Bash
|
|
9
|
+
category: "data-type-handling"
|
|
10
|
+
tags: ["code", "programming", "syntax", "api", "documentation"]
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
RAG for code documentation requires specialized handling due to code's structured nature, syntax-specific patterns, and the importance of preserving function signatures, imports, and contextual relationships. This skill covers embedding code snippets, handling API references, and retrieving code-aware context.
|
|
15
|
+
|
|
16
|
+
## Problem Statement
|
|
17
|
+
Generic RAG approaches struggle with code-related queries:
|
|
18
|
+
- Standard chunking breaks function/class structures mid-signature
|
|
19
|
+
- Language models don't understand code syntax and semantics as well as natural language
|
|
20
|
+
- Code requires preserving structural relationships (imports, dependencies, inheritance)
|
|
21
|
+
- API queries need exact matches alongside semantic understanding
|
|
22
|
+
- Code examples need context to be useful (imports, dependencies)
|
|
23
|
+
|
|
24
|
+
## Key Concepts
|
|
25
|
+
- **Code-Aware Chunking**: Preserving function/class boundaries and syntactic units
|
|
26
|
+
- **Code-Specific Embeddings**: Models trained on code (CodeBERT, StarCoder) vs. general embeddings
|
|
27
|
+
- **Syntax Preservation**: Maintaining formatting, indentation, and structure
|
|
28
|
+
- **Context Tracking**: Following import chains and dependency relationships
|
|
29
|
+
- **Hybrid Search for Code**: Combining semantic search with AST-based matching
|
|
30
|
+
|
|
31
|
+
## Implementation Guide
|
|
32
|
+
|
|
33
|
+
### Step 1: Code-Aware Document Parsing
|
|
34
|
+
Parse code files into structured chunks that preserve syntactic units.
|
|
35
|
+
|
|
36
|
+
**Why**: AST-based parsing preserves code structure that would be lost with generic text chunking.
|
|
37
|
+
|
|
38
|
+
### Step 2: Code-Specific Embedding
|
|
39
|
+
Use embeddings designed for code or augment text embeddings with code-aware features.
|
|
40
|
+
|
|
41
|
+
**Why**: Code-specific embeddings capture semantic meaning in code (function relationships, patterns) that general embeddings miss.
|
|
42
|
+
|
|
43
|
+
### Step 2.5: Store Code Chunks in Vector DB
|
|
44
|
+
Store code chunks with rich metadata for effective retrieval.
|
|
45
|
+
|
|
46
|
+
**Why**: Rich metadata enables filtering by language, type, file path, and other code-specific attributes.
|
|
47
|
+
|
|
48
|
+
### Step 3: Implement Code-Aware Search
|
|
49
|
+
Search with code-specific considerations.
|
|
50
|
+
|
|
51
|
+
**Why**: Code-aware search combines semantic understanding with precise filtering for programming-specific use cases.
|
|
52
|
+
|
|
53
|
+
### Step 4: Handle API Documentation
|
|
54
|
+
Specialized handling for API reference queries.
|
|
55
|
+
|
|
56
|
+
**Why**: API queries often require exact matching (function/class names) rather than pure semantic search.
|
|
57
|
+
|
|
58
|
+
Practical references include the [CodeBERT paper](https://huggingface.co/papers/2002.08155), [GraphCodeBERT paper](https://huggingface.co/papers/2009.08366), [StarCoder paper](https://huggingface.co/papers/2305.06161), and [GitHub code search](https://github.com/features/code-search).
|
|
59
|
+
|
|
60
|
+
## When to Use This Skill
|
|
61
|
+
- Building code search or documentation assistants
|
|
62
|
+
- When users query code syntax, APIs, or programming concepts
|
|
63
|
+
- For IDE integrations and code completion systems
|
|
64
|
+
- When building technical documentation with code examples
|
|
65
|
+
- For troubleshooting programming queries
|
|
66
|
+
|
|
67
|
+
## When NOT to Use This Skill
|
|
68
|
+
- For natural language-only content (use general RAG)
|
|
69
|
+
- When code structure isn't important for retrieval
|
|
70
|
+
- For very simple codebases where grep suffices
|
|
71
|
+
- When embedding code-specific models is not feasible
|
|
72
|
+
|
|
73
|
+
## Related Skills
|
|
74
|
+
- [Semantic Chunking](../chunking/semantic-chunking.md) - For document chunks
|
|
75
|
+
- [Choosing Vector DB by Datatype](../vector-databases/choosing-vector-db-by-datatype.md) - Database selection
|
|
76
|
+
- [Hierarchical Chunking](../chunking/hierarchical-chunking.md) - For nested code structures
|
|
77
|
+
|
|
78
|
+
## Metrics & Success Criteria
|
|
79
|
+
- **Code Retrieval Accuracy**: Exact matches for function/class names > 90%
|
|
80
|
+
- **Semantic Understanding**: NDCG@10 for code-related queries > 0.7
|
|
81
|
+
- **Context Preservation**: Relevant imports included > 80% of time
|
|
82
|
+
- **Syntax Preservation**: Code formatting maintained 100%
|
|
83
|
+
- **Query Latency**: < 200ms for typical code queries
|