@mastra/rag 2.1.1 → 2.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +18 -0
- package/dist/docs/SKILL.md +38 -0
- package/dist/docs/assets/SOURCE_MAP.json +6 -0
- package/dist/docs/references/docs-rag-chunking-and-embedding.md +183 -0
- package/dist/docs/references/docs-rag-graph-rag.md +215 -0
- package/dist/docs/references/docs-rag-overview.md +72 -0
- package/dist/docs/references/docs-rag-retrieval.md +515 -0
- package/dist/docs/references/reference-rag-chunk.md +221 -0
- package/dist/docs/references/reference-rag-database-config.md +261 -0
- package/dist/docs/references/reference-rag-document.md +114 -0
- package/dist/docs/references/reference-rag-extract-params.md +168 -0
- package/dist/docs/references/reference-rag-graph-rag.md +111 -0
- package/dist/docs/references/reference-rag-rerank.md +75 -0
- package/dist/docs/references/reference-rag-rerankWithScorer.md +80 -0
- package/dist/docs/references/reference-tools-document-chunker-tool.md +89 -0
- package/dist/docs/references/reference-tools-graph-rag-tool.md +182 -0
- package/dist/docs/references/reference-tools-vector-query-tool.md +459 -0
- package/dist/document/transformers/semantic-markdown.d.ts +6 -4
- package/dist/document/transformers/semantic-markdown.d.ts.map +1 -1
- package/dist/document/transformers/token.d.ts +5 -4
- package/dist/document/transformers/token.d.ts.map +1 -1
- package/dist/index.cjs +41 -26
- package/dist/index.cjs.map +1 -1
- package/dist/index.js +41 -26
- package/dist/index.js.map +1 -1
- package/package.json +5 -5
|
@@ -0,0 +1,168 @@
|
|
|
1
|
+
# ExtractParams
|
|
2
|
+
|
|
3
|
+
ExtractParams configures metadata extraction from document chunks using LLM analysis.
|
|
4
|
+
|
|
5
|
+
## Example
|
|
6
|
+
|
|
7
|
+
```typescript
|
|
8
|
+
import { MDocument } from '@mastra/rag'
|
|
9
|
+
|
|
10
|
+
const doc = MDocument.fromText(text)
|
|
11
|
+
const chunks = await doc.chunk({
|
|
12
|
+
extract: {
|
|
13
|
+
title: true, // Extract titles using default settings
|
|
14
|
+
summary: true, // Generate summaries using default settings
|
|
15
|
+
keywords: true, // Extract keywords using default settings
|
|
16
|
+
},
|
|
17
|
+
})
|
|
18
|
+
|
|
19
|
+
// Example output:
|
|
20
|
+
// chunks[0].metadata = {
|
|
21
|
+
// documentTitle: "AI Systems Overview",
|
|
22
|
+
// sectionSummary: "Overview of artificial intelligence concepts and applications",
|
|
23
|
+
// excerptKeywords: "KEYWORDS: AI, machine learning, algorithms"
|
|
24
|
+
// }
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Parameters
|
|
28
|
+
|
|
29
|
+
The `extract` parameter accepts the following fields:
|
|
30
|
+
|
|
31
|
+
**title?:** (`boolean | TitleExtractorsArgs`): Enable title extraction. Set to true for default settings, or provide custom configuration.
|
|
32
|
+
|
|
33
|
+
**summary?:** (`boolean | SummaryExtractArgs`): Enable summary extraction. Set to true for default settings, or provide custom configuration.
|
|
34
|
+
|
|
35
|
+
**questions?:** (`boolean | QuestionAnswerExtractArgs`): Enable question generation. Set to true for default settings, or provide custom configuration.
|
|
36
|
+
|
|
37
|
+
**keywords?:** (`boolean | KeywordExtractArgs`): Enable keyword extraction. Set to true for default settings, or provide custom configuration.
|
|
38
|
+
|
|
39
|
+
**schema?:** (`SchemaExtractArgs`): Enable structured metadata extraction using a Zod schema.
|
|
40
|
+
|
|
41
|
+
## Extractor Arguments
|
|
42
|
+
|
|
43
|
+
### TitleExtractorsArgs
|
|
44
|
+
|
|
45
|
+
**llm?:** (`MastraLanguageModel`): AI SDK language model to use for title extraction
|
|
46
|
+
|
|
47
|
+
**nodes?:** (`number`): Number of title nodes to extract
|
|
48
|
+
|
|
49
|
+
**nodeTemplate?:** (`string`): Custom prompt template for title node extraction. Must include {context} placeholder
|
|
50
|
+
|
|
51
|
+
**combineTemplate?:** (`string`): Custom prompt template for combining titles. Must include {context} placeholder
|
|
52
|
+
|
|
53
|
+
### SummaryExtractArgs
|
|
54
|
+
|
|
55
|
+
**llm?:** (`MastraLanguageModel`): AI SDK language model to use for summary extraction
|
|
56
|
+
|
|
57
|
+
**summaries?:** (`('self' | 'prev' | 'next')[]`): List of summary types to generate. Can only include 'self' (current chunk), 'prev' (previous chunk), or 'next' (next chunk)
|
|
58
|
+
|
|
59
|
+
**promptTemplate?:** (`string`): Custom prompt template for summary generation. Must include {context} placeholder
|
|
60
|
+
|
|
61
|
+
### QuestionAnswerExtractArgs
|
|
62
|
+
|
|
63
|
+
**llm?:** (`MastraLanguageModel`): AI SDK language model to use for question generation
|
|
64
|
+
|
|
65
|
+
**questions?:** (`number`): Number of questions to generate
|
|
66
|
+
|
|
67
|
+
**promptTemplate?:** (`string`): Custom prompt template for question generation. Must include both {context} and {numQuestions} placeholders
|
|
68
|
+
|
|
69
|
+
**embeddingOnly?:** (`boolean`): If true, only generate embeddings without actual questions
|
|
70
|
+
|
|
71
|
+
### KeywordExtractArgs
|
|
72
|
+
|
|
73
|
+
**llm?:** (`MastraLanguageModel`): AI SDK language model to use for keyword extraction
|
|
74
|
+
|
|
75
|
+
**keywords?:** (`number`): Number of keywords to extract
|
|
76
|
+
|
|
77
|
+
**promptTemplate?:** (`string`): Custom prompt template for keyword extraction. Must include both {context} and {maxKeywords} placeholders
|
|
78
|
+
|
|
79
|
+
### SchemaExtractArgs
|
|
80
|
+
|
|
81
|
+
**schema:** (`ZodType`): Zod schema defining the structure of the data to extract.
|
|
82
|
+
|
|
83
|
+
**llm?:** (`MastraLanguageModel`): AI SDK language model to use for extraction.
|
|
84
|
+
|
|
85
|
+
**instructions?:** (`string`): Instructions for the LLM on what to extract.
|
|
86
|
+
|
|
87
|
+
**metadataKey?:** (`string`): Key to nest extraction results under. If omitted, results are spread into the metadata object.
|
|
88
|
+
|
|
89
|
+
## Advanced Example
|
|
90
|
+
|
|
91
|
+
```typescript
|
|
92
|
+
import { MDocument } from '@mastra/rag'
|
|
93
|
+
|
|
94
|
+
const doc = MDocument.fromText(text)
|
|
95
|
+
const chunks = await doc.chunk({
|
|
96
|
+
extract: {
|
|
97
|
+
// Title extraction with custom settings
|
|
98
|
+
title: {
|
|
99
|
+
nodes: 2, // Extract 2 title nodes
|
|
100
|
+
nodeTemplate: 'Generate a title for this: {context}',
|
|
101
|
+
combineTemplate: 'Combine these titles: {context}',
|
|
102
|
+
},
|
|
103
|
+
|
|
104
|
+
// Summary extraction with custom settings
|
|
105
|
+
summary: {
|
|
106
|
+
summaries: ['self'], // Generate summaries for current chunk
|
|
107
|
+
promptTemplate: 'Summarize this: {context}',
|
|
108
|
+
},
|
|
109
|
+
|
|
110
|
+
// Question generation with custom settings
|
|
111
|
+
questions: {
|
|
112
|
+
questions: 3, // Generate 3 questions
|
|
113
|
+
promptTemplate: 'Generate {numQuestions} questions about: {context}',
|
|
114
|
+
embeddingOnly: false,
|
|
115
|
+
},
|
|
116
|
+
|
|
117
|
+
// Keyword extraction with custom settings
|
|
118
|
+
keywords: {
|
|
119
|
+
keywords: 5, // Extract 5 keywords
|
|
120
|
+
promptTemplate: 'Extract {maxKeywords} key terms from: {context}',
|
|
121
|
+
},
|
|
122
|
+
|
|
123
|
+
// Schema extraction with Zod
|
|
124
|
+
schema: {
|
|
125
|
+
schema: z.object({
|
|
126
|
+
productName: z.string(),
|
|
127
|
+
category: z.enum(['electronics', 'clothing']),
|
|
128
|
+
}),
|
|
129
|
+
instructions: 'Extract product information.',
|
|
130
|
+
metadataKey: 'product',
|
|
131
|
+
},
|
|
132
|
+
},
|
|
133
|
+
})
|
|
134
|
+
|
|
135
|
+
// Example output:
|
|
136
|
+
// chunks[0].metadata = {
|
|
137
|
+
// documentTitle: "AI in Modern Computing",
|
|
138
|
+
// sectionSummary: "Overview of AI concepts and their applications in computing",
|
|
139
|
+
// questionsThisExcerptCanAnswer: "1. What is machine learning?\n2. How do neural networks work?",
|
|
140
|
+
// excerptKeywords: "1. Machine learning\n2. Neural networks\n3. Training data",
|
|
141
|
+
// product: {
|
|
142
|
+
// productName: "Neural Net 2000",
|
|
143
|
+
// category: "electronics"
|
|
144
|
+
// }
|
|
145
|
+
// }
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## Document Grouping for Title Extraction
|
|
149
|
+
|
|
150
|
+
When using the `TitleExtractor`, you can group multiple chunks together for title extraction by specifying a shared `docId` in the `metadata` field of each chunk. All chunks with the same `docId` will receive the same extracted title. If no `docId` is set, each chunk is treated as its own document for title extraction.
|
|
151
|
+
|
|
152
|
+
**Example:**
|
|
153
|
+
|
|
154
|
+
```ts
|
|
155
|
+
import { MDocument } from '@mastra/rag'
|
|
156
|
+
|
|
157
|
+
const doc = new MDocument({
|
|
158
|
+
docs: [
|
|
159
|
+
{ text: 'chunk 1', metadata: { docId: 'docA' } },
|
|
160
|
+
{ text: 'chunk 2', metadata: { docId: 'docA' } },
|
|
161
|
+
{ text: 'chunk 3', metadata: { docId: 'docB' } },
|
|
162
|
+
],
|
|
163
|
+
type: 'text',
|
|
164
|
+
})
|
|
165
|
+
|
|
166
|
+
await doc.extractMetadata({ title: true })
|
|
167
|
+
// The first two chunks will share a title, while the third chunk will be assigned a separate title.
|
|
168
|
+
```
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
# GraphRAG
|
|
2
|
+
|
|
3
|
+
The `GraphRAG` class implements a graph-based approach to retrieval augmented generation. It creates a knowledge graph from document chunks where nodes represent documents and edges represent semantic relationships, enabling both direct similarity matching and discovery of related content through graph traversal.
|
|
4
|
+
|
|
5
|
+
## Basic Usage
|
|
6
|
+
|
|
7
|
+
```typescript
|
|
8
|
+
import { GraphRAG } from '@mastra/rag'
|
|
9
|
+
|
|
10
|
+
const graphRag = new GraphRAG({
|
|
11
|
+
dimension: 1536,
|
|
12
|
+
threshold: 0.7,
|
|
13
|
+
})
|
|
14
|
+
|
|
15
|
+
// Create the graph from chunks and embeddings
|
|
16
|
+
graphRag.createGraph(documentChunks, embeddings)
|
|
17
|
+
|
|
18
|
+
// Query the graph with embedding
|
|
19
|
+
const results = await graphRag.query({
|
|
20
|
+
query: queryEmbedding,
|
|
21
|
+
topK: 10,
|
|
22
|
+
randomWalkSteps: 100,
|
|
23
|
+
restartProb: 0.15,
|
|
24
|
+
})
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Constructor Parameters
|
|
28
|
+
|
|
29
|
+
**dimension?:** (`number`): Dimension of the embedding vectors (Default: `1536`)
|
|
30
|
+
|
|
31
|
+
**threshold?:** (`number`): Similarity threshold for creating edges between nodes (0-1) (Default: `0.7`)
|
|
32
|
+
|
|
33
|
+
## Methods
|
|
34
|
+
|
|
35
|
+
### createGraph
|
|
36
|
+
|
|
37
|
+
Creates a knowledge graph from document chunks and their embeddings.
|
|
38
|
+
|
|
39
|
+
```typescript
|
|
40
|
+
createGraph(chunks: GraphChunk[], embeddings: GraphEmbedding[]): void
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
#### Parameters
|
|
44
|
+
|
|
45
|
+
**chunks:** (`GraphChunk[]`): Array of document chunks with text and metadata
|
|
46
|
+
|
|
47
|
+
**embeddings:** (`GraphEmbedding[]`): Array of embeddings corresponding to chunks
|
|
48
|
+
|
|
49
|
+
### query
|
|
50
|
+
|
|
51
|
+
Performs a graph-based search combining vector similarity and graph traversal.
|
|
52
|
+
|
|
53
|
+
```typescript
|
|
54
|
+
query({
|
|
55
|
+
query,
|
|
56
|
+
topK = 10,
|
|
57
|
+
randomWalkSteps = 100,
|
|
58
|
+
restartProb = 0.15
|
|
59
|
+
}: {
|
|
60
|
+
query: number[];
|
|
61
|
+
topK?: number;
|
|
62
|
+
randomWalkSteps?: number;
|
|
63
|
+
restartProb?: number;
|
|
64
|
+
}): RankedNode[]
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
#### Parameters
|
|
68
|
+
|
|
69
|
+
**query:** (`number[]`): Query embedding vector
|
|
70
|
+
|
|
71
|
+
**topK?:** (`number`): Number of results to return (Default: `10`)
|
|
72
|
+
|
|
73
|
+
**randomWalkSteps?:** (`number`): Number of steps in random walk (Default: `100`)
|
|
74
|
+
|
|
75
|
+
**restartProb?:** (`number`): Probability of restarting walk from query node (Default: `0.15`)
|
|
76
|
+
|
|
77
|
+
#### Returns
|
|
78
|
+
|
|
79
|
+
Returns an array of `RankedNode` objects, where each node contains:
|
|
80
|
+
|
|
81
|
+
**id:** (`string`): Unique identifier for the node
|
|
82
|
+
|
|
83
|
+
**content:** (`string`): Text content of the document chunk
|
|
84
|
+
|
|
85
|
+
**metadata:** (`Record<string, any>`): Additional metadata associated with the chunk
|
|
86
|
+
|
|
87
|
+
**score:** (`number`): Combined relevance score from graph traversal
|
|
88
|
+
|
|
89
|
+
## Advanced Example
|
|
90
|
+
|
|
91
|
+
```typescript
|
|
92
|
+
const graphRag = new GraphRAG({
|
|
93
|
+
dimension: 1536,
|
|
94
|
+
threshold: 0.8, // Stricter similarity threshold
|
|
95
|
+
})
|
|
96
|
+
|
|
97
|
+
// Create graph from chunks and embeddings
|
|
98
|
+
graphRag.createGraph(documentChunks, embeddings)
|
|
99
|
+
|
|
100
|
+
// Query with custom parameters
|
|
101
|
+
const results = await graphRag.query({
|
|
102
|
+
query: queryEmbedding,
|
|
103
|
+
topK: 5,
|
|
104
|
+
randomWalkSteps: 200,
|
|
105
|
+
restartProb: 0.2,
|
|
106
|
+
})
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## Related
|
|
110
|
+
|
|
111
|
+
- [createGraphRAGTool](https://mastra.ai/reference/tools/graph-rag-tool)
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
# rerank()
|
|
2
|
+
|
|
3
|
+
The `rerank()` function provides advanced reranking capabilities for vector search results by combining semantic relevance, vector similarity, and position-based scoring.
|
|
4
|
+
|
|
5
|
+
```typescript
|
|
6
|
+
function rerank(
|
|
7
|
+
results: QueryResult[],
|
|
8
|
+
query: string,
|
|
9
|
+
modelConfig: ModelConfig,
|
|
10
|
+
options?: RerankerFunctionOptions,
|
|
11
|
+
): Promise<RerankResult[]>
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Usage Example
|
|
15
|
+
|
|
16
|
+
```typescript
|
|
17
|
+
import { rerank } from '@mastra/rag'
|
|
18
|
+
|
|
19
|
+
const model = 'openai/gpt-5.1'
|
|
20
|
+
|
|
21
|
+
const rerankedResults = await rerank(vectorSearchResults, 'How do I deploy to production?', model, {
|
|
22
|
+
weights: {
|
|
23
|
+
semantic: 0.5,
|
|
24
|
+
vector: 0.3,
|
|
25
|
+
position: 0.2,
|
|
26
|
+
},
|
|
27
|
+
topK: 3,
|
|
28
|
+
})
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Parameters
|
|
32
|
+
|
|
33
|
+
**results:** (`QueryResult[]`): The vector search results to rerank
|
|
34
|
+
|
|
35
|
+
**query:** (`string`): The search query text used to evaluate relevance
|
|
36
|
+
|
|
37
|
+
**model:** (`MastraLanguageModel`): The language Model to use for reranking
|
|
38
|
+
|
|
39
|
+
**options?:** (`RerankerFunctionOptions`): Options for the reranking model
|
|
40
|
+
|
|
41
|
+
The rerank function accepts any LanguageModel from the Vercel AI SDK. When using the Cohere model `rerank-v3.5`, it will automatically use Cohere's reranking capabilities.
|
|
42
|
+
|
|
43
|
+
> **Note:** For semantic scoring to work properly during re-ranking, each result must include the text content in its `metadata.text` field.
|
|
44
|
+
|
|
45
|
+
### RerankerFunctionOptions
|
|
46
|
+
|
|
47
|
+
**weights?:** (`WeightConfig`): numbersemantic?:number (default: 0.4)Weight for semantic relevancenumbervector?:number (default: 0.4)Weight for vector similaritynumberposition?:number (default: 0.2)Weight for position-based scoring
|
|
48
|
+
|
|
49
|
+
**queryEmbedding?:** (`number[]`): Embedding of the query
|
|
50
|
+
|
|
51
|
+
**topK?:** (`number`): Number of top results to return (Default: `3`)
|
|
52
|
+
|
|
53
|
+
## Returns
|
|
54
|
+
|
|
55
|
+
The function returns an array of `RerankResult` objects:
|
|
56
|
+
|
|
57
|
+
**result:** (`QueryResult`): The original query result
|
|
58
|
+
|
|
59
|
+
**score:** (`number`): Combined reranking score (0-1)
|
|
60
|
+
|
|
61
|
+
**details:** (`ScoringDetails`): Detailed scoring information
|
|
62
|
+
|
|
63
|
+
### ScoringDetails
|
|
64
|
+
|
|
65
|
+
**semantic:** (`number`): Semantic relevance score (0-1)
|
|
66
|
+
|
|
67
|
+
**vector:** (`number`): Vector similarity score (0-1)
|
|
68
|
+
|
|
69
|
+
**position:** (`number`): Position-based score (0-1)
|
|
70
|
+
|
|
71
|
+
**queryAnalysis?:** (`object`): numbermagnitude:Magnitude of the querynumber\[]dominantFeatures:Dominant features of the query
|
|
72
|
+
|
|
73
|
+
## Related
|
|
74
|
+
|
|
75
|
+
- [createVectorQueryTool](https://mastra.ai/reference/tools/vector-query-tool)
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# rerankWithScorer()
|
|
2
|
+
|
|
3
|
+
The `rerankWithScorer()` function provides advanced reranking capabilities for vector search results by combining semantic relevance, vector similarity, and position-based scoring.
|
|
4
|
+
|
|
5
|
+
```typescript
|
|
6
|
+
function rerankWithScorer({
|
|
7
|
+
results: QueryResult[],
|
|
8
|
+
query: string,
|
|
9
|
+
scorer: RelevanceScoreProvider,
|
|
10
|
+
options?: RerankerFunctionOptions,
|
|
11
|
+
}): Promise<RerankResult[]>;
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
## Usage Example
|
|
15
|
+
|
|
16
|
+
```typescript
|
|
17
|
+
import { rerankWithScorer as rerank, CohereRelevanceScorer } from '@mastra/rag'
|
|
18
|
+
|
|
19
|
+
const scorer = new CohereRelevanceScorer('rerank-v3.5')
|
|
20
|
+
|
|
21
|
+
const rerankedResults = await rerank({
|
|
22
|
+
results: vectorSearchResults,
|
|
23
|
+
query: 'How do I deploy to production?',
|
|
24
|
+
scorer,
|
|
25
|
+
options: {
|
|
26
|
+
weights: {
|
|
27
|
+
semantic: 0.5,
|
|
28
|
+
vector: 0.3,
|
|
29
|
+
position: 0.2,
|
|
30
|
+
},
|
|
31
|
+
topK: 3,
|
|
32
|
+
},
|
|
33
|
+
})
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Parameters
|
|
37
|
+
|
|
38
|
+
**results:** (`QueryResult[]`): The vector search results to rerank
|
|
39
|
+
|
|
40
|
+
**query:** (`string`): The search query text used to evaluate relevance
|
|
41
|
+
|
|
42
|
+
**scorer:** (`RelevanceScoreProvider`): The relevance scorer to use for reranking
|
|
43
|
+
|
|
44
|
+
**options?:** (`RerankerFunctionOptions`): Options for the reranking model
|
|
45
|
+
|
|
46
|
+
The `rerankWithScorer` function accepts any `RelevanceScoreProvider` from @mastra/rag.
|
|
47
|
+
|
|
48
|
+
> **Note:** For semantic scoring to work properly during re-ranking, each result must include the text content in its `metadata.text` field.
|
|
49
|
+
|
|
50
|
+
### RerankerFunctionOptions
|
|
51
|
+
|
|
52
|
+
**weights?:** (`WeightConfig`): numbersemantic?:number (default: 0.4)Weight for semantic relevancenumbervector?:number (default: 0.4)Weight for vector similaritynumberposition?:number (default: 0.2)Weight for position-based scoring
|
|
53
|
+
|
|
54
|
+
**queryEmbedding?:** (`number[]`): Embedding of the query
|
|
55
|
+
|
|
56
|
+
**topK?:** (`number`): Number of top results to return (Default: `3`)
|
|
57
|
+
|
|
58
|
+
## Returns
|
|
59
|
+
|
|
60
|
+
The function returns an array of `RerankResult` objects:
|
|
61
|
+
|
|
62
|
+
**result:** (`QueryResult`): The original query result
|
|
63
|
+
|
|
64
|
+
**score:** (`number`): Combined reranking score (0-1)
|
|
65
|
+
|
|
66
|
+
**details:** (`ScoringDetails`): Detailed scoring information
|
|
67
|
+
|
|
68
|
+
### ScoringDetails
|
|
69
|
+
|
|
70
|
+
**semantic:** (`number`): Semantic relevance score (0-1)
|
|
71
|
+
|
|
72
|
+
**vector:** (`number`): Vector similarity score (0-1)
|
|
73
|
+
|
|
74
|
+
**position:** (`number`): Position-based score (0-1)
|
|
75
|
+
|
|
76
|
+
**queryAnalysis?:** (`object`): numbermagnitude:Magnitude of the querynumber\[]dominantFeatures:Dominant features of the query
|
|
77
|
+
|
|
78
|
+
## Related
|
|
79
|
+
|
|
80
|
+
- [createVectorQueryTool](https://mastra.ai/reference/tools/vector-query-tool)
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
# createDocumentChunkerTool()
|
|
2
|
+
|
|
3
|
+
The `createDocumentChunkerTool()` function creates a tool for splitting documents into smaller chunks for efficient processing and retrieval. It supports different chunking strategies and configurable parameters.
|
|
4
|
+
|
|
5
|
+
## Basic Usage
|
|
6
|
+
|
|
7
|
+
```typescript
|
|
8
|
+
import { createDocumentChunkerTool, MDocument } from '@mastra/rag'
|
|
9
|
+
|
|
10
|
+
const document = new MDocument({
|
|
11
|
+
text: 'Your document content here...',
|
|
12
|
+
metadata: { source: 'user-manual' },
|
|
13
|
+
})
|
|
14
|
+
|
|
15
|
+
const chunker = createDocumentChunkerTool({
|
|
16
|
+
doc: document,
|
|
17
|
+
params: {
|
|
18
|
+
strategy: 'recursive',
|
|
19
|
+
size: 512,
|
|
20
|
+
overlap: 50,
|
|
21
|
+
separator: '\n',
|
|
22
|
+
},
|
|
23
|
+
})
|
|
24
|
+
|
|
25
|
+
const { chunks } = await chunker.execute()
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Parameters
|
|
29
|
+
|
|
30
|
+
**doc:** (`MDocument`): The document to be chunked
|
|
31
|
+
|
|
32
|
+
**params?:** (`ChunkParams`): Configuration parameters for chunking (Default: `Default chunking parameters`)
|
|
33
|
+
|
|
34
|
+
### ChunkParams
|
|
35
|
+
|
|
36
|
+
**strategy?:** (`'recursive'`): The chunking strategy to use (Default: `'recursive'`)
|
|
37
|
+
|
|
38
|
+
**size?:** (`number`): Target size of each chunk in tokens/characters (Default: `512`)
|
|
39
|
+
|
|
40
|
+
**overlap?:** (`number`): Number of overlapping tokens/characters between chunks (Default: `50`)
|
|
41
|
+
|
|
42
|
+
**separator?:** (`string`): Character(s) to use as chunk separator (Default: `'\n'`)
|
|
43
|
+
|
|
44
|
+
## Returns
|
|
45
|
+
|
|
46
|
+
**chunks:** (`DocumentChunk[]`): Array of document chunks with their content and metadata
|
|
47
|
+
|
|
48
|
+
## Example with Custom Parameters
|
|
49
|
+
|
|
50
|
+
```typescript
|
|
51
|
+
const technicalDoc = new MDocument({
|
|
52
|
+
text: longDocumentContent,
|
|
53
|
+
metadata: {
|
|
54
|
+
type: 'technical',
|
|
55
|
+
version: '1.0',
|
|
56
|
+
},
|
|
57
|
+
})
|
|
58
|
+
|
|
59
|
+
const chunker = createDocumentChunkerTool({
|
|
60
|
+
doc: technicalDoc,
|
|
61
|
+
params: {
|
|
62
|
+
strategy: 'recursive',
|
|
63
|
+
size: 1024, // Larger chunks
|
|
64
|
+
overlap: 100, // More overlap
|
|
65
|
+
separator: '\n\n', // Split on double newlines
|
|
66
|
+
},
|
|
67
|
+
})
|
|
68
|
+
|
|
69
|
+
const { chunks } = await chunker.execute()
|
|
70
|
+
|
|
71
|
+
// Process the chunks
|
|
72
|
+
chunks.forEach((chunk, index) => {
|
|
73
|
+
console.log(`Chunk ${index + 1} length: ${chunk.content.length}`)
|
|
74
|
+
})
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Tool Details
|
|
78
|
+
|
|
79
|
+
The chunker is created as a Mastra tool with the following properties:
|
|
80
|
+
|
|
81
|
+
- **Tool ID**: `Document Chunker {strategy} {size}`
|
|
82
|
+
- **Description**: `Chunks document using {strategy} strategy with size {size} and {overlap} overlap`
|
|
83
|
+
- **Input Schema**: Empty object (no additional inputs required)
|
|
84
|
+
- **Output Schema**: Object containing the chunks array
|
|
85
|
+
|
|
86
|
+
## Related
|
|
87
|
+
|
|
88
|
+
- [MDocument](https://mastra.ai/reference/rag/document)
|
|
89
|
+
- [createVectorQueryTool](https://mastra.ai/reference/tools/vector-query-tool)
|