rag-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. package/.claude-plugin/marketplace.json +14 -0
  2. package/.claude-plugin/plugin.json +8 -0
  3. package/CONTRIBUTING.md +210 -0
  4. package/LICENSE +21 -0
  5. package/README.md +148 -0
  6. package/examples/foundational-rag-pipeline.md +104 -0
  7. package/examples/multi-agent-rag.md +111 -0
  8. package/examples/production-rag-setup.md +133 -0
  9. package/package.json +22 -0
  10. package/scripts/generate-index.py +276 -0
  11. package/scripts/validate-skills.py +214 -0
  12. package/skills/chunking/choosing-a-chunking-framework.md +186 -0
  13. package/skills/chunking/contextual-chunk-headers.md +106 -0
  14. package/skills/chunking/hierarchical-chunking.md +77 -0
  15. package/skills/chunking/semantic-chunking.md +78 -0
  16. package/skills/chunking/sliding-window-chunking.md +82 -0
  17. package/skills/data-type-handling/rag-for-code-documentation.md +83 -0
  18. package/skills/data-type-handling/rag-for-multimodal-content.md +83 -0
  19. package/skills/performance-optimization/optimize-retrieval-latency.md +88 -0
  20. package/skills/retrieval-strategies/adaptive-retrieval.md +102 -0
  21. package/skills/retrieval-strategies/context-enrichment-window.md +99 -0
  22. package/skills/retrieval-strategies/crag-corrective-rag.md +108 -0
  23. package/skills/retrieval-strategies/explainable-retrieval.md +106 -0
  24. package/skills/retrieval-strategies/graph-rag.md +107 -0
  25. package/skills/retrieval-strategies/hybrid-search-bm25-dense.md +81 -0
  26. package/skills/retrieval-strategies/hyde-hypothetical-document-embeddings.md +91 -0
  27. package/skills/retrieval-strategies/hype-hypothetical-prompt-embeddings.md +98 -0
  28. package/skills/retrieval-strategies/multi-pass-retrieval-with-reranking.md +82 -0
  29. package/skills/retrieval-strategies/query-transformation-strategies.md +93 -0
  30. package/skills/retrieval-strategies/raptor-hierarchical-retrieval.md +106 -0
  31. package/skills/retrieval-strategies/self-rag.md +108 -0
  32. package/skills/vector-databases/choosing-vector-db-by-datatype.md +112 -0
  33. package/skills/vector-databases/qdrant-for-production-rag.md +88 -0
  34. package/skills/vector-databases/qdrant-setup-rag.md +86 -0
  35. package/templates/skill-template.md +53 -0
  36. package/templates/workflow-template.md +67 -0
@@ -0,0 +1,186 @@
1
+ ---
2
+ title: "Choosing a Chunking Framework"
3
+ description: "Select the right chunking framework based on document type, pipeline architecture, and retrieval goals."
4
+ allowed-tools:
5
+ - Read
6
+ - Grep
7
+ - Glob
8
+ - Bash
9
+ category: "chunking"
10
+ tags: ["framework-selection", "chonkie", "langchain", "llamaindex", "haystack", "unstructured"]
11
+ ---
12
+
13
+ ## Overview
14
+ Chunking quality depends as much on the framework as on the strategy itself. This guide helps a coding agent choose between chunking frameworks based on the shape of the data, the surrounding RAG stack, and whether the main need is speed, structure-awareness, retrieval quality, or integration simplicity.
15
+
16
+ ## Problem Statement
17
+ Teams often pick a chunking framework for the wrong reason:
18
+ - They default to the framework already in the app, even when it has weaker chunking primitives for the data type
19
+ - They choose a sophisticated semantic chunker for documents that only need simple deterministic splitting
20
+ - They use generic text splitters for PDFs, tables, or code repositories where structure matters more than raw length
21
+ - They optimize for chunking features without considering metadata propagation, ingestion workflow, or downstream retrieval
22
+
23
+ ## Key Concepts
24
+ - **Framework Fit**: How well the chunking library matches the rest of the ingestion and retrieval stack
25
+ - **Structure Awareness**: Whether the framework understands sections, headings, tables, pages, or code blocks
26
+ - **Semantic Awareness**: Whether the framework can chunk by meaning rather than only size or delimiters
27
+ - **Operational Simplicity**: How easy it is to adopt the framework without introducing a second ingestion system
28
+ - **Chunking Depth**: Whether you need only fixed-size chunks or richer patterns like hierarchical, sentence-window, or code-aware chunking
29
+
30
+ ## Implementation Guide
31
+
32
+ ### Step 1: Identify the Dominant Document Type
33
+ Decide whether the corpus is mostly plain text, structured documents, code, or messy extracted files.
34
+
35
+ **Why**: Chunking frameworks differ most sharply in how they handle document structure. The right choice for markdown docs is often wrong for scanned PDFs or source code.
36
+
37
+ Use this default mapping:
38
+ - Plain text and generic prose: favor LangChain or Chonkie
39
+ - Metadata-rich RAG pipelines: favor LlamaIndex
40
+ - Pipeline-centric search applications: favor Haystack
41
+ - Parsed PDFs, office files, and layout-heavy documents: favor Unstructured
42
+ - Code-heavy repositories: favor Chonkie first, then LangChain if you need simpler integration
43
+
44
+ ### Step 2: Check Whether Chunking Is a Core Requirement or Just a Utility
45
+ Determine whether chunking itself is a strategic part of the system or just one preprocessing step.
46
+
47
+ **Why**: If chunking is central to recall quality, code structure, or high-throughput ingestion, you want a framework with deeper chunking specialization. If it is incidental, use the framework already governing the RAG pipeline.
48
+
49
+ Choose this way:
50
+ - If chunking sophistication is a competitive advantage, start with Chonkie
51
+ - If chunking is just one component in a broader orchestration framework, prefer LangChain, LlamaIndex, or Haystack based on the app stack
52
+
53
+ ### Step 3: Match the Framework to the Retrieval Pattern
54
+ Choose a framework whose chunking model supports the retrieval pattern you intend to use.
55
+
56
+ **Why**: Some frameworks are built around plain chunks, while others are better for hierarchical retrieval, metadata-enriched nodes, or section-preserving document flows.
57
+
58
+ Recommended fit by retrieval pattern:
59
+ - Simple vector search over text: LangChain or Chonkie
60
+ - Hierarchical retrieval and parent-child context: LlamaIndex or Haystack
61
+ - Section-aware retrieval for reports and long docs: Unstructured or Chonkie recursive chunking
62
+ - Code search and AST-aware chunking: Chonkie
63
+ - Fine-grained sentence-level retrieval with context reconstruction: LlamaIndex
64
+
65
+ ### Step 4: Use the Framework Decision Matrix
66
+ Pick the framework whose strengths best match the workload.
67
+
68
+ **Why**: A framework should be chosen by tradeoff, not popularity.
69
+
70
+ | Framework | Best At | Choose It When | Avoid It When |
71
+ |----------|---------|----------------|---------------|
72
+ | **[Chonkie](https://docs.chonkie.ai/oss/chunkers/overview)** | Specialized chunking strategies, high throughput, semantic and code-aware chunking | Chunking is a first-class problem, you need semantic/code chunkers, or ingestion speed matters | You want the simplest possible default inside an existing LangChain or Haystack app |
73
+ | **[LangChain Text Splitters](https://docs.langchain.com/oss/python/integrations/splitters/index)** | Simple, reliable general-purpose chunking tightly integrated with LangChain apps | You already use LangChain or LangGraph and need practical default chunking with minimal extra tooling | You need deeply specialized code, layout, or semantic chunking beyond standard splitter patterns |
74
+ | **[LlamaIndex Node Parsers](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/)** | Metadata-rich nodes, sentence windows, hierarchical parsing, retrieval-aware chunking | You use LlamaIndex ingestion/query pipelines or need chunking that preserves node relationships and retrieval metadata | You only need straightforward standalone chunking without adopting LlamaIndex concepts |
75
+ | **[Haystack Preprocessors](https://docs.haystack.deepset.ai/docs/documentsplitter)** | Deterministic pipeline-based splitting for production search systems | You are building around Haystack pipelines and want predictable document preprocessing components | You need more advanced semantic or code-aware chunking than Haystack’s built-in preprocessors provide |
76
+ | **[Unstructured Chunking](https://docs.unstructured.io/open-source/core-functionality/chunking)** | Partition-first chunking for PDFs, Office docs, HTML, tables, and layout-heavy content | Your main challenge is document parsing and structural preservation before chunking | Your corpus is already clean text and you do not need layout-aware parsing |
77
+
78
+ ### Step 5: Apply Framework-Specific Guidance
79
+ Use the framework only in the situations where it has a clear advantage.
80
+
81
+ **Why**: Each framework has a distinct operating model. Mixing them without a reason usually adds complexity.
82
+
83
+ #### Chonkie
84
+ Use [Chonkie](https://www.chonkie.ai/) when chunking itself is the focus of the system.
85
+
86
+ Strong fit:
87
+ - You need a dedicated chunking library rather than a general LLM framework
88
+ - You want to choose among token, sentence, recursive, semantic, late, fast, or code chunkers
89
+ - You have code repositories or technical docs where chunk coherence matters
90
+ - You have high-throughput ingestion where fast chunking is important
91
+
92
+ Especially relevant docs:
93
+ - [Chunkers overview](https://docs.chonkie.ai/oss/chunkers/overview)
94
+ - [RecursiveChunker](https://docs.chonkie.ai/oss/chunkers/recursive-chunker)
95
+ - [SemanticChunker](https://docs.chonkie.ai/oss/chunkers/semantic-chunker)
96
+ - [CodeChunker](https://docs.chonkie.ai/oss/chunkers/code-chunker)
97
+ - [FastChunker](https://docs.chonkie.ai/oss/chunkers/fast-chunker)
98
+
99
+ Do not choose Chonkie just because it has more chunkers. Choose it when those chunkers materially improve your corpus handling.
100
+
101
+ #### LangChain
102
+ Use [LangChain Text Splitters](https://docs.langchain.com/oss/python/integrations/splitters/index) when you want a practical default inside a LangChain or LangGraph application.
103
+
104
+ Strong fit:
105
+ - You already use LangChain loaders, embeddings, retrievers, or agents
106
+ - You want an opinionated default like `RecursiveCharacterTextSplitter`
107
+ - You need markdown, JSON, HTML, or code splitting without introducing another ingestion framework
108
+ - You value integration speed over chunking specialization
109
+
110
+ Especially relevant docs:
111
+ - [Text splitters overview](https://docs.langchain.com/oss/python/integrations/splitters/index)
112
+ - [RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter)
113
+ - [CharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/character_text_splitter)
114
+
115
+ Do not choose LangChain if you need the strongest code-aware or layout-aware chunking and are willing to use a more specialized tool.
116
+
117
+ #### LlamaIndex
118
+ Use [LlamaIndex Node Parsers](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/) when chunking is tightly coupled to retrieval behavior and metadata-rich nodes.
119
+
120
+ Strong fit:
121
+ - You want sentence-window retrieval, node relationships, or hierarchical retrieval patterns
122
+ - You care about chunk metadata because retrieval and postprocessing depend on it
123
+ - You are already using LlamaIndex ingestion pipelines, retrievers, and node postprocessors
124
+ - You want chunking that fits directly into a retrieval-aware framework
125
+
126
+ Especially relevant docs:
127
+ - [Node parser usage](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/)
128
+ - [HierarchicalNodeParser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/hierarchical/)
129
+ - [SentenceWindowNodeParser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/)
130
+
131
+ Do not choose LlamaIndex if you only need a lightweight chunker and do not want its document and node abstractions in the pipeline.
132
+
133
+ #### Haystack
134
+ Use [Haystack preprocessors](https://docs.haystack.deepset.ai/docs/documentsplitter) when the application is built as a search or RAG pipeline and you want deterministic preprocessing components.
135
+
136
+ Strong fit:
137
+ - You use Haystack pipelines end-to-end
138
+ - You need sentence, passage, page, line, or function splitting in a production-oriented indexing flow
139
+ - You want hierarchical document splitting without adopting a different ingestion framework
140
+ - You value stable preprocessing stages more than cutting-edge chunking algorithms
141
+
142
+ Especially relevant docs:
143
+ - [DocumentSplitter](https://docs.haystack.deepset.ai/docs/documentsplitter)
144
+ - [HierarchicalDocumentSplitter](https://docs.haystack.deepset.ai/docs/hierarchicaldocumentsplitter)
145
+ - [DocumentPreprocessor](https://docs.haystack.deepset.ai/docs/documentpreprocessor)
146
+
147
+ Do not choose Haystack for advanced semantic chunking if you are not otherwise using Haystack.
148
+
149
+ #### Unstructured
150
+ Use [Unstructured chunking](https://docs.unstructured.io/open-source/core-functionality/chunking) when the hard problem is not text splitting but extracting meaningful document elements first.
151
+
152
+ Strong fit:
153
+ - Your corpus includes PDFs, PowerPoints, HTML, tables, page layouts, and other messy enterprise documents
154
+ - You want chunking to follow structural elements created during partitioning
155
+ - You need section-preserving chunking such as `by_title`
156
+ - You care about table isolation, page boundaries, or composite document elements
157
+
158
+ Especially relevant docs:
159
+ - [Open-source chunking](https://docs.unstructured.io/open-source/core-functionality/chunking)
160
+ - [Platform API chunking strategies](https://docs.unstructured.io/platform-api/partition-api/chunking)
161
+
162
+ Do not choose Unstructured when the documents are already clean plain text and layout parsing adds no value.
163
+
164
+ ## When to Use This Skill
165
+ - When deciding which chunking framework a coding agent should adopt for a new RAG pipeline
166
+ - When replacing ad hoc chunking logic with a framework-backed approach
167
+ - When the team needs a framework-level recommendation before tuning chunk size or overlap
168
+ - When the corpus mixes prose, code, and structured enterprise documents
169
+
170
+ ## When NOT to Use This Skill
171
+ - When the framework is already fixed by platform constraints
172
+ - When the problem is chunk sizing rather than framework choice
173
+ - When you only need one simple deterministic splitter and framework selection is not material
174
+
175
+ ## Related Skills
176
+ - [Semantic Chunking](./semantic-chunking.md) - Meaning-aware chunk boundaries
177
+ - [Hierarchical Chunking](./hierarchical-chunking.md) - Multi-level chunk structures
178
+ - [Sliding Window Chunking](./sliding-window-chunking.md) - Overlap-based chunking
179
+ - [RAG for Code Documentation](../data-type-handling/rag-for-code-documentation.md) - Code-heavy corpus guidance
180
+
181
+ ## Metrics & Success Criteria
182
+ - **Framework Fit**: The chosen framework matches the dominant document and retrieval pattern
183
+ - **Operational Simplicity**: The framework does not introduce unnecessary ingestion complexity
184
+ - **Retrieval Quality**: Chunking improves recall and answer quality for the actual corpus
185
+ - **Maintainability**: The chunking approach is understandable and sustainable for the team
186
+ - **Migration Cost**: The framework choice does not create avoidable lock-in without clear upside
@@ -0,0 +1,106 @@
1
+ ---
2
+ title: "Contextual Chunk Headers"
3
+ description: "Add higher-level context to chunks by prepending headers before embedding for better retrieval."
4
+ allowed-tools:
5
+ - Read
6
+ - Grep
7
+ - Glob
8
+ - Bash
9
+ category: "chunking"
10
+ tags: ["contextual-headers", "metadata", "chunk-enhancement", "document-structure"]
11
+ ---
12
+
13
+ ## Overview
14
+ Contextual chunk headers (CCH) enhance retrieval by prepending higher-level context (document title, section headers, summaries) to each chunk before embedding. This gives embeddings a more accurate representation of content and meaning, significantly improving retrieval quality and reducing irrelevant results.
15
+
16
+ ## Problem Statement
17
+ Individual chunks often lack sufficient context:
18
+ - **Implicit References**: Chunks use pronouns and implicit references that aren't clear in isolation
19
+ - **Missing Context**: Chunks only make sense in the context of their section or document
20
+ - **Misleading Isolation**: Read alone, chunks can be misleading or incomplete
21
+ - **Poor Retrieval**: Queries that reference subjects or themes aren't matched effectively
22
+ - **LLM Misinterpretation**: LLMs struggle to understand decontextualized chunks
23
+
24
+ ## Key Concepts
25
+ - **Chunk Header**: Higher-level context prepended to chunk content
26
+ - **Document-Level Context**: Document title or summary as header context
27
+ - **Section-Level Context**: Hierarchy of section and subsection titles
28
+ - **Pre-Embedding Concatenation**: Combining header and chunk before embedding
29
+ - **Context Preservation**: Carrying headers through to retrieval and presentation
30
+
31
+ ## Implementation Guide
32
+
33
+ ### Step 1: Generate Document-Level Context
34
+ Extract or generate document titles and summaries for use in chunk headers.
35
+
36
+ **Why**: Document-level context provides the most important higher-level information for understanding any chunk within that document.
37
+
38
+ Use an LLM to generate a descriptive document title if one doesn't exist, or extract titles from document structure.
39
+
40
+ ### Step 2: Extract Section Hierarchy
41
+ Parse document structure to capture section and subsection titles.
42
+
43
+ **Why**: Section hierarchy provides granular context that helps retrieve information about specific topics and themes within the document.
44
+
45
+ Parse the document structure (markdown headings, HTML tags, etc.) to build a hierarchical representation of sections.
46
+
47
+ ### Step 3: Create Chunk Headers
48
+ Combine document and section context into comprehensive headers.
49
+
50
+ **Why**: Combined context provides both broad document understanding and specific section information, giving embeddings the best possible representation.
51
+
52
+ For each chunk, create a header combining document title and the section path that contains the chunk.
53
+
54
+ ### Step 4: Prepend Headers to Chunks
55
+ Combine headers with chunk content before embedding.
56
+
57
+ **Why**: Embeddings represent the combined header+chunk content, capturing both local meaning and global context in the same vector space.
58
+
59
+ When indexing each chunk, include the header as a prefix to the chunk content.
60
+
61
+ ### Step 5: Embed with Headers
62
+ Use the header-enhanced chunks for embedding and indexing.
63
+
64
+ **Why**: Embedding with headers ensures that retrieval considers both local content and broader context, leading to more relevant matches.
65
+
66
+ Index the enhanced chunks using your preferred embedding model and vector database.
67
+
68
+ ### Step 6: Present Results with Headers
69
+ Include headers when presenting retrieved chunks to users or LLMs.
70
+
71
+ **Why**: Presenting headers provides necessary context for understanding retrieved chunks, reducing misinterpretation and improving answer quality.
72
+
73
+ When returning retrieved content, include the original header for context.
74
+
75
+ ### Step 7: Build Complete CCH Pipeline
76
+ Combine all components into a cohesive chunking and retrieval system.
77
+
78
+ **Why**: A complete pipeline ensures that contextual headers are consistently applied through chunking, embedding, retrieval, and presentation.
79
+
80
+ For implementation patterns, see [LangChain RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter), [LlamaIndex node parser usage](https://developers.llamaindex.ai/python/framework/module_guides/loading/node_parsers/), [Anthropic Contextual Retrieval](https://www.anthropic.com/engineering/contextual-retrieval), and [dsRAG AutoContext](https://github.com/D-Star-AI/dsRAG).
81
+
82
+ ## When to Use This Skill
83
+ - Building RAG systems with structured documents (articles, reports, documentation)
84
+ - When chunks frequently refer to their subject via pronouns or implicit references
85
+ - For documents with clear section hierarchy or structure
86
+ - When retrieval quality suffers from decontextualized chunks
87
+ - For applications where LLMs need to understand chunk context
88
+
89
+ ## When NOT to Use This Skill
90
+ - For unstructured documents without clear hierarchy or titles
91
+ - When chunk headers would be redundant or too long
92
+ - For very short documents where the entire text is already coherent
93
+ - When storage overhead is a major concern (headers increase chunk size)
94
+ - For domains where document structure is meaningless (code snippets, logs)
95
+
96
+ ## Related Skills
97
+ - [Semantic Chunking](./semantic-chunking.md) - Context-aware chunking approach
98
+ - [Hierarchical Chunking](./hierarchical-chunking.md) - Structure-based chunking
99
+ - [Context Enrichment Window](../retrieval-strategies/context-enrichment-window.md) - Post-retrieval context
100
+
101
+ ## Metrics & Success Criteria
102
+ - **Retrieval Precision**: Increased rate of correct retrieval for subject-based queries
103
+ - **Context Completeness**: Retrieved chunks provide sufficient context for understanding
104
+ - **Header Effectiveness**: Headers improve embedding representation quality
105
+ - **Answer Quality**: Reduced hallucinations from decontextualized chunks
106
+ - **Storage Overhead**: Acceptable increase in chunk size due to headers
@@ -0,0 +1,77 @@
1
+ ---
2
+ title: "Hierarchical Chunking"
3
+ description: "Chunk nested documents into parent-child levels so retrieval can move from broad sections to fine-grained passages."
4
+ allowed-tools:
5
+ - Read
6
+ - Grep
7
+ - Glob
8
+ - Bash
9
+ category: "chunking"
10
+ tags: ["nested", "multi-level", "document-structure", "parent-child"]
11
+ ---
12
+
13
+ ## Overview
14
+ Hierarchical chunking creates multi-level chunk structures that preserve document hierarchies (chapters, sections, subsections). This enables both broad-overview retrieval (high-level chunks) and detailed retrieval (low-level chunks) with parent-child relationships for context propagation.
15
+
16
+ ## Problem Statement
17
+ Flat chunking strategies lose the structural relationships within documents:
18
+ - Questions about broad topics retrieve detailed fragments instead of summaries
19
+ - No way to retrieve "context for context" - when a detail is retrieved, its containing section is lost
20
+ - Navigation through document hierarchies becomes impossible
21
+ - Retrieval results can't be organized by document structure
22
+
23
+ ## Key Concepts
24
+ - **Parent-Child Relationships**: Links between summary chunks and their detailed sub-chunks
25
+ - **Multi-Level Granularity**: Chunks at different abstraction levels (document → chapter → section → paragraph)
26
+ - **Context Propagation**: Ability to include parent context when retrieving child chunks
27
+ - **Metadata Enrichment**: Storing hierarchy information for filtering and navigation
28
+ - **Recursive Chunking**: Applying chunking rules recursively through document structure
29
+
30
+ ## Implementation Guide
31
+
32
+ ### Step 1: Parse Document Structure
33
+ Extract the hierarchical structure of your documents.
34
+
35
+ **Why**: Proper parsing is foundational—you can't create a hierarchy without understanding the structure.
36
+
37
+ ### Step 2: Create Multi-Level Chunks
38
+ Generate chunks at multiple levels from the parsed hierarchy.
39
+
40
+ **Why**: Multiple chunk levels allow different retrieval strategies—summary chunks for broad questions, detailed chunks for specifics.
41
+
42
+ ### Step 3: Store Hierarchy Metadata
43
+ Persist hierarchy information for retrieval-time context propagation.
44
+
45
+ **Why**: Metadata enables filtering and hierarchical queries (e.g., "retrieve from level 2 or deeper").
46
+
47
+ ### Step 4: Implement Hierarchical Retrieval
48
+ Retrieve chunks with optional parent context.
49
+
50
+ **Why**: Hierarchical retrieval allows users to "zoom in" from broad topics to specific details, just like browsing a table of contents.
51
+
52
+ For implementation details, see the [LlamaIndex HierarchicalNodeParser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/hierarchical/), [LangChain RecursiveCharacterTextSplitter](https://reference.langchain.com/python/langchain-text-splitters/character/RecursiveCharacterTextSplitter), [Haystack HierarchicalDocumentSplitter](https://docs.haystack.deepset.ai/docs/hierarchicaldocumentsplitter), and [LlamaIndex Recursive Retriever + Node References](https://developers.llamaindex.ai/python/framework/integrations/retrievers/recursive_retriever_nodes/).
53
+
54
+ ## When to Use This Skill
55
+ - Long-form documents with clear structure (books, technical documentation, research papers)
56
+ - When you need both summary and detailed retrieval
57
+ - For applications that support "drill-down" navigation
58
+ - When building document browsing/search interfaces
59
+ - For Q&A systems that benefit from hierarchical context
60
+
61
+ ## When NOT to Use This Skill
62
+ - Flat content without structure (blog posts, single-page articles)
63
+ - When storage complexity outweighs retrieval benefits
64
+ - For small document collections where simple chunking suffices
65
+ - When retrieval latency is critical (hierarchical retrieval adds overhead)
66
+
67
+ ## Related Skills
68
+ - [Semantic Chunking](./semantic-chunking.md) - For content-aware chunking
69
+ - [Sliding Window Chunking](./sliding-window-chunking.md) - For context overlap
70
+ - [Multi-Pass Retrieval with Reranking](../retrieval-strategies/multi-pass-retrieval-with-reranking.md) - For refining hierarchical results
71
+
72
+ ## Metrics & Success Criteria
73
+ - **Coverage**: All document content preserved across hierarchy levels
74
+ - **Retrieval Flexibility**: Can retrieve at any abstraction level
75
+ - **Context Relevance**: Parent context improves answer quality for detailed questions
76
+ - **Storage Efficiency**: Metadata overhead doesn't exceed 30% of storage
77
+ - **Query Latency**: Hierarchical retrieval within 2x of flat retrieval
@@ -0,0 +1,78 @@
1
+ ---
2
+ title: "Semantic Chunking"
3
+ description: "Use semantic boundaries and embedding similarity to chunk text for higher-relevance retrieval."
4
+ allowed-tools:
5
+ - Read
6
+ - Grep
7
+ - Glob
8
+ - Bash
9
+ category: "chunking"
10
+ tags: ["semantic", "nlp", "sentence-boundary", "context-preservation"]
11
+ ---
12
+
13
+ ## Overview
14
+ Semantic chunking divides documents into segments based on natural language boundaries and semantic meaning rather than fixed character counts. This approach preserves context integrity and improves retrieval relevance by keeping related ideas together.
15
+
16
+ ## Problem Statement
17
+ Fixed-size chunking (e.g., 1000 tokens) often cuts through mid-sentence, mid-paragraph, or mid-concept, resulting in:
18
+ - Fragments that lose semantic meaning
19
+ - Context bleeding across unrelated chunks
20
+ - Poor retrieval relevance for semantic search
21
+ - Difficulty for LLMs to understand isolated fragments
22
+
23
+ ## Key Concepts
24
+ - **Semantic Boundaries**: Natural break points in text (paragraphs, sections, sentences)
25
+ - **Context Preservation**: Keeping related ideas and arguments together
26
+ - **Sentence Boundary Detection**: Using NLP tools to identify proper sentence ends
27
+ - **Paragraph Cohesion**: Identifying when a new thought or argument begins
28
+ - **Overlapping Windows**: Small overlaps to preserve context between adjacent chunks
29
+
30
+ ## Implementation Guide
31
+
32
+ ### Step 1: Detect Sentence Boundaries
33
+ Use sentence boundary detection rather than simple period splitting. This handles abbreviations, decimal numbers, and other edge cases.
34
+
35
+ **Why**: Simple period/regex splitting fails on "Dr. Smith", "3.14", "U.S.A.", etc.
36
+
37
+ ### Step 2: Group Sentences into Semantic Units
38
+ Group sentences into chunks based on semantic cohesion rather than fixed counts.
39
+
40
+ **Why**: Similarity-based chunking adapts to document density—dense technical sections get smaller chunks, narrative sections get larger ones.
41
+
42
+ ### Step 3: Apply Structure-Based Chunking
43
+ Leverage document structure (headings, markdown, HTML tags) to identify natural sections.
44
+
45
+ **Why**: Document structure often encodes semantic boundaries (chapters, sections, subsections).
46
+
47
+ ### Step 4: Implement Adaptive Sizing
48
+ Adjust chunk sizes based on content characteristics (code blocks, lists, tables).
49
+
50
+ **Why**: Different content types need different chunking strategies for optimal retrieval.
51
+
52
+ For practical implementations, compare [Sentence-BERT](https://arxiv.org/abs/1908.10084), [LlamaIndex Semantic Chunker](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/), [LangChain RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter), and [spaCy sentence segmentation](https://spacy.io/usage/linguistic-features#sentence-segmentation).
53
+
54
+ ## When to Use This Skill
55
+ - Documents with clear semantic structure (articles, reports, documentation)
56
+ - When retrieval quality is more important than fixed-size guarantees
57
+ - For narrative or explanatory content where context matters
58
+ - When using semantic search (vector embeddings) for retrieval
59
+ - For Q&A systems that need coherent context for answers
60
+
61
+ ## When NOT to Use This Skill
62
+ - Code repositories where line-level precision matters
63
+ - Log files or structured data with fixed formats
64
+ - When you need exact token count guarantees (e.g., for API limits)
65
+ - Streaming data where chunk boundaries can't be deferred
66
+ - When using exact keyword matching (BM25) where chunk size matters less
67
+
68
+ ## Related Skills
69
+ - [Hierarchical Chunking](./hierarchical-chunking.md) - For nested document structures
70
+ - [Sliding Window Chunking](./sliding-window-chunking.md) - For overlapping context preservation
71
+ - [Hybrid Search BM25 Dense](../retrieval-strategies/hybrid-search-bm25-dense.md) - Combining search methods
72
+
73
+ ## Metrics & Success Criteria
74
+ - **Retrieval Relevance**: Retrieved chunks should contain complete, self-contained information
75
+ - **Context Preservation**: Related concepts remain in same chunk
76
+ - **Coverage**: No significant information lost in chunk boundaries
77
+ - **Token Efficiency**: Chunks balanced between too small (lossy) and too large (noisy)
78
+ - **Semantic Coherence**: Chunks represent coherent thoughts or sections
@@ -0,0 +1,82 @@
1
+ ---
2
+ title: "Sliding Window Chunking"
3
+ description: "Use overlapping windows to preserve context across chunk boundaries while controlling retrieval size."
4
+ allowed-tools:
5
+ - Read
6
+ - Grep
7
+ - Glob
8
+ - Bash
9
+ category: "chunking"
10
+ tags: ["overlap", "context-preservation", "window", "boundary"]
11
+ ---
12
+
13
+ ## Overview
14
+ Sliding window chunking creates overlapping chunks where each chunk shares content with adjacent chunks. This preserves context at chunk boundaries, ensuring that information split across chunks remains accessible through multiple retrieval paths.
15
+
16
+ ## Problem Statement
17
+ Non-overlapping chunking creates hard boundaries that can break important context:
18
+ - Critical information appears exactly at a chunk boundary
19
+ - Concepts spanning multiple chunks become fragmented
20
+ - Retrieval may miss relevant content because it's split across boundaries
21
+ - LLMs lose context when only one fragment is provided
22
+
23
+ ## Key Concepts
24
+ - **Overlap Percentage**: The proportion of shared content between adjacent chunks (typically 10-25%)
25
+ - **Window Size**: The total size of each chunk including overlap
26
+ - **Stride**: The distance between chunk start points (window size - overlap)
27
+ - **Context Preservation**: Ensuring relevant context exists on both sides of boundaries
28
+ - **Multiple Retrieval Paths**: Same content accessible from different chunks
29
+
30
+ ## Implementation Guide
31
+
32
+ ### Step 1: Determine Overlap Parameters
33
+ Calculate appropriate overlap based on your use case and chunk size.
34
+
35
+ **Why**: The overlap ratio determines how much context is preserved—too little loses context, too much creates redundancy.
36
+
37
+ ### Step 2: Implement Sliding Window Chunking
38
+ Create chunks with configurable overlap.
39
+
40
+ **Why**: Boundary adjustment prevents chunks from ending mid-sentence while maintaining the sliding window pattern.
41
+
42
+ ### Step 3: Add Chunk Metadata
43
+ Track overlap information for retrieval-time decisions.
44
+
45
+ **Why**: Metadata allows retrieval systems to identify and deduplicate overlapping content.
46
+
47
+ ### Step 4: Implement Smart Retrieval
48
+ Handle overlapping results intelligently.
49
+
50
+ **Why**: Smart retrieval prevents redundant information while preserving the benefits of overlapping chunks.
51
+
52
+ ### Step 5: Token-Aware Sliding Window
53
+ Implement sliding window based on tokens rather than characters.
54
+
55
+ **Why**: Token-aware chunking is more accurate for LLMs and respects their actual token limits.
56
+
57
+ Useful implementations include [LangChain RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter), the [OpenAI tiktoken cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb), [tiktoken](https://github.com/openai/tiktoken), and [Sentence Transformers rerankers](https://www.sbert.net/examples/cross_encoder/training/rerankers/README.html).
58
+
59
+ ## When to Use This Skill
60
+ - Documents where context at boundaries is important
61
+ - When using semantic search where retrieval might match content near boundaries
62
+ - For narrative or sequential content where flow matters
63
+ - When building systems that benefit from multiple retrieval paths
64
+ - For technical documentation where definitions/examples might span chunks
65
+
66
+ ## When NOT to Use This Skill
67
+ - When storage costs are critical (overlap increases storage 20-30%)
68
+ - When exact deduplication is required (overlap makes this harder)
69
+ - For structured data with clear boundaries (JSON, CSV)
70
+ - When using exact keyword search where overlap doesn't help
71
+
72
+ ## Related Skills
73
+ - [Semantic Chunking](./semantic-chunking.md) - For content-aware boundaries
74
+ - [Hierarchical Chunking](./hierarchical-chunking.md) - For nested structures
75
+ - [Hybrid Search BM25 Dense](../retrieval-strategies/hybrid-search-bm25-dense.md) - For combining search methods
76
+
77
+ ## Metrics & Success Criteria
78
+ - **Boundary Coverage**: Critical terms appear in at least one chunk (not lost at boundaries)
79
+ - **Storage Overhead**: Overlap increases storage by expected ratio (e.g., 20%)
80
+ - **Retrieval Recall**: Same content accessible through multiple chunks
81
+ - **Context Preservation**: Retrieved chunks contain sufficient surrounding context
82
+ - **Deduplication Quality**: Can identify and handle overlapping content effectively
@@ -0,0 +1,83 @@
1
+ ---
2
+ title: "RAG for Code Documentation"
3
+ description: "Handle code-aware retrieval by preserving symbols, file structure, and API context."
4
+ allowed-tools:
5
+ - Read
6
+ - Grep
7
+ - Glob
8
+ - Bash
9
+ category: "data-type-handling"
10
+ tags: ["code", "programming", "syntax", "api", "documentation"]
11
+ ---
12
+
13
+ ## Overview
14
+ RAG for code documentation requires specialized handling due to code's structured nature, syntax-specific patterns, and the importance of preserving function signatures, imports, and contextual relationships. This skill covers embedding code snippets, handling API references, and retrieving code-aware context.
15
+
16
+ ## Problem Statement
17
+ Generic RAG approaches struggle with code-related queries:
18
+ - Standard chunking breaks function/class structures mid-signature
19
+ - Language models don't understand code syntax and semantics as well as natural language
20
+ - Code requires preserving structural relationships (imports, dependencies, inheritance)
21
+ - API queries need exact matches alongside semantic understanding
22
+ - Code examples need context to be useful (imports, dependencies)
23
+
24
+ ## Key Concepts
25
+ - **Code-Aware Chunking**: Preserving function/class boundaries and syntactic units
26
+ - **Code-Specific Embeddings**: Models trained on code (CodeBERT, StarCoder) vs. general embeddings
27
+ - **Syntax Preservation**: Maintaining formatting, indentation, and structure
28
+ - **Context Tracking**: Following import chains and dependency relationships
29
+ - **Hybrid Search for Code**: Combining semantic search with AST-based matching
30
+
31
+ ## Implementation Guide
32
+
33
+ ### Step 1: Code-Aware Document Parsing
34
+ Parse code files into structured chunks that preserve syntactic units.
35
+
36
+ **Why**: AST-based parsing preserves code structure that would be lost with generic text chunking.
37
+
38
+ ### Step 2: Code-Specific Embedding
39
+ Use embeddings designed for code or augment text embeddings with code-aware features.
40
+
41
+ **Why**: Code-specific embeddings capture semantic meaning in code (function relationships, patterns) that general embeddings miss.
42
+
43
+ ### Step 2.5: Store Code Chunks in Vector DB
44
+ Store code chunks with rich metadata for effective retrieval.
45
+
46
+ **Why**: Rich metadata enables filtering by language, type, file path, and other code-specific attributes.
47
+
48
+ ### Step 3: Implement Code-Aware Search
49
+ Search with code-specific considerations.
50
+
51
+ **Why**: Code-aware search combines semantic understanding with precise filtering for programming-specific use cases.
52
+
53
+ ### Step 4: Handle API Documentation
54
+ Specialized handling for API reference queries.
55
+
56
+ **Why**: API queries often require exact matching (function/class names) rather than pure semantic search.
57
+
58
+ Practical references include the [CodeBERT paper](https://huggingface.co/papers/2002.08155), [GraphCodeBERT paper](https://huggingface.co/papers/2009.08366), [StarCoder paper](https://huggingface.co/papers/2305.06161), and [GitHub code search](https://github.com/features/code-search).
59
+
60
+ ## When to Use This Skill
61
+ - Building code search or documentation assistants
62
+ - When users query code syntax, APIs, or programming concepts
63
+ - For IDE integrations and code completion systems
64
+ - When building technical documentation with code examples
65
+ - For troubleshooting programming queries
66
+
67
+ ## When NOT to Use This Skill
68
+ - For natural language-only content (use general RAG)
69
+ - When code structure isn't important for retrieval
70
+ - For very simple codebases where grep suffices
71
+ - When embedding code-specific models is not feasible
72
+
73
+ ## Related Skills
74
+ - [Semantic Chunking](../chunking/semantic-chunking.md) - For document chunks
75
+ - [Choosing Vector DB by Datatype](../vector-databases/choosing-vector-db-by-datatype.md) - Database selection
76
+ - [Hierarchical Chunking](../chunking/hierarchical-chunking.md) - For nested code structures
77
+
78
+ ## Metrics & Success Criteria
79
+ - **Code Retrieval Accuracy**: Exact matches for function/class names > 90%
80
+ - **Semantic Understanding**: NDCG@10 for code-related queries > 0.7
81
+ - **Context Preservation**: Relevant imports included > 80% of time
82
+ - **Syntax Preservation**: Code formatting maintained 100%
83
+ - **Query Latency**: < 200ms for typical code queries