@mastra/mcp-docs-server 1.1.35-alpha.15 → 1.1.35-alpha.17
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.docs/docs/agents/response-caching.md +148 -0
- package/.docs/docs/memory/observational-memory.md +56 -12
- package/.docs/docs/observability/tracing/exporters/arize.md +5 -5
- package/.docs/docs/observability/tracing/overview.md +1 -1
- package/.docs/models/index.md +1 -1
- package/.docs/models/providers/llmgateway.md +7 -1
- package/.docs/models/providers/nebius.md +2 -1
- package/.docs/models/providers/opencode.md +1 -2
- package/.docs/reference/index.md +1 -0
- package/.docs/reference/memory/observational-memory.md +11 -1
- package/.docs/reference/observability/tracing/exporters/arize.md +1 -1
- package/.docs/reference/processors/response-cache.md +114 -0
- package/CHANGELOG.md +8 -0
- package/package.json +4 -4
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
# Response caching
|
|
2
|
+
|
|
3
|
+
Response caching skips the LLM call and replays a previously cached response when an agent receives an identical request. Use it to drop latency to single-digit milliseconds and avoid paying for repeated calls.
|
|
4
|
+
|
|
5
|
+
Caching is implemented as the [`ResponseCache`](https://mastra.ai/reference/processors/response-cache) input processor. There is no agent-level option — to enable caching, register the processor explicitly. This keeps the API surface small while we collect feedback; per-call overrides flow through `RequestContext`.
|
|
6
|
+
|
|
7
|
+
## When to use response caching
|
|
8
|
+
|
|
9
|
+
Reach for it when the same request shape repeats across users or sessions, for example prompt templates, suggested-prompt buttons, agentic search re-asks, or guardrail LLMs that classify the same input over and over. Skip it when calls trigger external side effects through tools, since cache hits replay tool calls without re-executing them.
|
|
10
|
+
|
|
11
|
+
## Quickstart
|
|
12
|
+
|
|
13
|
+
Add a `ResponseCache` to the agent's `inputProcessors` and pass any `MastraServerCache` as the backend. For development, `InMemoryServerCache` works out of the box:
|
|
14
|
+
|
|
15
|
+
```typescript
|
|
16
|
+
import { Agent } from '@mastra/core/agent'
|
|
17
|
+
import { InMemoryServerCache } from '@mastra/core/cache'
|
|
18
|
+
import { ResponseCache } from '@mastra/core/processors'
|
|
19
|
+
|
|
20
|
+
const cache = new InMemoryServerCache()
|
|
21
|
+
|
|
22
|
+
export const searchAgent = new Agent({
|
|
23
|
+
name: 'Search Agent',
|
|
24
|
+
instructions: 'You answer questions concisely.',
|
|
25
|
+
model: 'openai/gpt-5',
|
|
26
|
+
inputProcessors: [new ResponseCache({ cache, ttl: 600 })], // 10 minutes
|
|
27
|
+
})
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
The first call runs the LLM normally and writes the response to the cache. Subsequent calls with an identical resolved prompt return the cached response without invoking the LLM.
|
|
31
|
+
|
|
32
|
+
## Per-call overrides via RequestContext
|
|
33
|
+
|
|
34
|
+
Per-call config flows through `RequestContext`. Use `ResponseCache.context()` to build a fresh context, or `ResponseCache.applyContext()` to merge into one you already have:
|
|
35
|
+
|
|
36
|
+
```typescript
|
|
37
|
+
import { ResponseCache } from '@mastra/core/processors'
|
|
38
|
+
import { RequestContext } from '@mastra/core/request-context'
|
|
39
|
+
|
|
40
|
+
// Fresh context with the override
|
|
41
|
+
await agent.stream('hello', {
|
|
42
|
+
requestContext: ResponseCache.context({ key: 'custom-key', bust: true }),
|
|
43
|
+
})
|
|
44
|
+
|
|
45
|
+
// Or merge into an existing context
|
|
46
|
+
const ctx = new RequestContext()
|
|
47
|
+
ctx.set('caller-meta', { userId: 'u-123' })
|
|
48
|
+
ResponseCache.applyContext(ctx, { bust: true })
|
|
49
|
+
await agent.stream('hello', { requestContext: ctx })
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Three fields are overridable per call:
|
|
53
|
+
|
|
54
|
+
- `key` — string or function. Overrides the auto-derived cache key for this request only.
|
|
55
|
+
- `scope` — string or `null`. Overrides the tenant/user scope for this request only. `null` opts out of scoping.
|
|
56
|
+
- `bust` — boolean. Skips the cache read but still writes on completion (useful for "force refresh" buttons).
|
|
57
|
+
|
|
58
|
+
`cache`, `ttl`, and `agentId` stay on the constructor — they are instance-level concerns and not safe to vary per call.
|
|
59
|
+
|
|
60
|
+
## Tenant scoping
|
|
61
|
+
|
|
62
|
+
By default, `ResponseCache` looks up `MASTRA_RESOURCE_ID_KEY` on the request context and uses it as the cache scope. This means an agent that already populates the resource id (e.g. via memory) gets per-user isolation automatically — two users never see each other's cached responses.
|
|
63
|
+
|
|
64
|
+
Override explicitly when you need a different scope:
|
|
65
|
+
|
|
66
|
+
```typescript
|
|
67
|
+
new Agent({
|
|
68
|
+
// ...
|
|
69
|
+
inputProcessors: [
|
|
70
|
+
new ResponseCache({
|
|
71
|
+
cache,
|
|
72
|
+
scope: 'org-123', // explicit tenant scope
|
|
73
|
+
}),
|
|
74
|
+
],
|
|
75
|
+
})
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Pass `scope: null` to deliberately share entries across all callers — only use this for known-public, non-personalized content.
|
|
79
|
+
|
|
80
|
+
## Custom cache backend
|
|
81
|
+
|
|
82
|
+
`ResponseCache` accepts any `MastraServerCache`. For production, use `RedisCache` from `@mastra/redis`:
|
|
83
|
+
|
|
84
|
+
```typescript
|
|
85
|
+
import { Agent } from '@mastra/core/agent'
|
|
86
|
+
import { ResponseCache } from '@mastra/core/processors'
|
|
87
|
+
import { RedisCache } from '@mastra/redis'
|
|
88
|
+
|
|
89
|
+
const cache = new RedisCache({ url: process.env.REDIS_URL })
|
|
90
|
+
|
|
91
|
+
export const agent = new Agent({
|
|
92
|
+
name: 'Cached Agent',
|
|
93
|
+
instructions: '...',
|
|
94
|
+
model: 'openai/gpt-5',
|
|
95
|
+
inputProcessors: [new ResponseCache({ cache })],
|
|
96
|
+
})
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
For a custom backend, extend `MastraServerCache` and implement its abstract methods (the processor only calls `get` and `set`).
|
|
100
|
+
|
|
101
|
+
## How caching is implemented
|
|
102
|
+
|
|
103
|
+
`ResponseCache` hooks into `processLLMRequest` (cache lookup, short-circuits on hit) and `processLLMResponse` (cache write on completion). Both run inside the agentic loop _after_ memory has loaded and earlier input processors have transformed the prompt.
|
|
104
|
+
|
|
105
|
+
This means the cache key is derived from the resolved `LanguageModelV2Prompt` Mastra is about to send to the model — i.e. _after_ memory has loaded and earlier input processors have run — and each step in an agentic tool loop is independently cached.
|
|
106
|
+
|
|
107
|
+
## What's in the cache key
|
|
108
|
+
|
|
109
|
+
When you don't supply `key`, the processor derives one deterministically from the inputs that change the LLM's response at this step: `agentId`, `stepNumber` (so each step in a tool loop has its own cache entry), `scope`, model identity (`provider`, `modelId`, spec version), and the resolved `prompt` (post-memory + post-processors). Any change to these inputs automatically invalidates the cache.
|
|
110
|
+
|
|
111
|
+
### Customize the cache key
|
|
112
|
+
|
|
113
|
+
Pass `key` as a function on the constructor or per-call to derive your own cache key from any subset of those inputs. The function receives the same inputs the deterministic hash would have consumed and returns a string (or a `Promise<string>`):
|
|
114
|
+
|
|
115
|
+
```typescript
|
|
116
|
+
import { ResponseCache, buildResponseCacheKey } from '@mastra/core/processors'
|
|
117
|
+
|
|
118
|
+
await agent.stream(input, {
|
|
119
|
+
requestContext: ResponseCache.context({
|
|
120
|
+
// Cache only on the model id and the resolved prompt tail — ignore
|
|
121
|
+
// step number, scope, etc.
|
|
122
|
+
key: ({ model, prompt }) => `qa:${model.modelId}:${JSON.stringify(prompt).slice(-200)}`,
|
|
123
|
+
}),
|
|
124
|
+
})
|
|
125
|
+
|
|
126
|
+
// Or reuse the deterministic helper while overriding individual fields:
|
|
127
|
+
await agent.stream(input, {
|
|
128
|
+
requestContext: ResponseCache.context({
|
|
129
|
+
key: inputs => buildResponseCacheKey({ ...inputs, scope: 'global' }),
|
|
130
|
+
}),
|
|
131
|
+
})
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
If the function throws, the processor falls back to the default key derivation so the call still benefits from caching.
|
|
135
|
+
|
|
136
|
+
## How cache hits work
|
|
137
|
+
|
|
138
|
+
When the processor finds a cache hit, it short-circuits the LLM call by returning the cached chunks from `processLLMRequest`. The agentic loop synthesizes a stream from those chunks instead of calling the model. `agent.generate()` collects them into a `FullOutput`; `agent.stream()` returns a `MastraModelOutput` whose chunks come from the cached buffer, so consumers iterating `fullStream` or awaiting `text`, `usage`, and `finishReason` see the cached values.
|
|
139
|
+
|
|
140
|
+
Cache writes happen after the response completes. Failed runs (errors, tripwire activations) are not cached, so the next call retries cleanly.
|
|
141
|
+
|
|
142
|
+
## Related
|
|
143
|
+
|
|
144
|
+
- [`ResponseCache` reference](https://mastra.ai/reference/processors/response-cache)
|
|
145
|
+
- [Processors](https://mastra.ai/docs/agents/processors)
|
|
146
|
+
- [Guardrails](https://mastra.ai/docs/agents/guardrails)
|
|
147
|
+
- [Agent.stream()](https://mastra.ai/reference/streaming/agents/stream)
|
|
148
|
+
- [Agent.generate()](https://mastra.ai/reference/agents/generate)
|
|
@@ -77,6 +77,48 @@ The observer also sees these markers when it processes the thread, so the observ
|
|
|
77
77
|
|
|
78
78
|
See [the API reference](https://mastra.ai/reference/memory/observational-memory) for the full configuration shape.
|
|
79
79
|
|
|
80
|
+
## Early activation
|
|
81
|
+
|
|
82
|
+
OM can activate buffered observations before the token threshold is reached. This is useful when a prompt cache is likely to expire, or when the agent changes model providers.
|
|
83
|
+
|
|
84
|
+
Top-level early activation settings apply to observations by default:
|
|
85
|
+
|
|
86
|
+
```typescript
|
|
87
|
+
const memory = new Memory({
|
|
88
|
+
options: {
|
|
89
|
+
observationalMemory: {
|
|
90
|
+
model: 'google/gemini-2.5-flash',
|
|
91
|
+
activateAfterIdle: '5m',
|
|
92
|
+
activateOnProviderChange: true,
|
|
93
|
+
},
|
|
94
|
+
},
|
|
95
|
+
})
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
Use nested `observation` and `reflection` settings for per-phase control. Reflection early activation is opt-in, so top-level settings affect only observations.
|
|
99
|
+
|
|
100
|
+
```typescript
|
|
101
|
+
const memory = new Memory({
|
|
102
|
+
options: {
|
|
103
|
+
observationalMemory: {
|
|
104
|
+
model: 'google/gemini-2.5-flash',
|
|
105
|
+
activateAfterIdle: '5m',
|
|
106
|
+
observation: {
|
|
107
|
+
activateAfterIdle: false,
|
|
108
|
+
},
|
|
109
|
+
reflection: {
|
|
110
|
+
activateAfterIdle: '10m',
|
|
111
|
+
activateOnProviderChange: true,
|
|
112
|
+
},
|
|
113
|
+
},
|
|
114
|
+
},
|
|
115
|
+
})
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
In this example, the top-level idle setting is disabled for observations, while reflections opt into idle and provider-change activation.
|
|
119
|
+
|
|
120
|
+
See [the API reference](https://mastra.ai/reference/memory/observational-memory) for the full configuration shape.
|
|
121
|
+
|
|
80
122
|
## Benefits
|
|
81
123
|
|
|
82
124
|
- **Prompt caching**: OM's context is stable and observations append over time rather than being dynamically retrieved each turn. This keeps the prompt prefix cacheable, which reduces costs.
|
|
@@ -368,17 +410,19 @@ Reflection works similarly — the Reflector runs in the background when observa
|
|
|
368
410
|
|
|
369
411
|
### Settings
|
|
370
412
|
|
|
371
|
-
| Setting
|
|
372
|
-
|
|
|
373
|
-
| `observation.bufferTokens`
|
|
374
|
-
| `observation.bufferActivation`
|
|
375
|
-
| `observation.blockAfter`
|
|
376
|
-
| `activateAfterIdle`
|
|
377
|
-
| `activateOnProviderChange`
|
|
378
|
-
| `reflection.bufferActivation`
|
|
379
|
-
| `reflection.
|
|
413
|
+
| Setting | Default | What it controls |
|
|
414
|
+
| ------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
415
|
+
| `observation.bufferTokens` | `0.2` | How often to buffer. `0.2` means every 20% of `messageTokens` — with the default 30k threshold, that's roughly every 6k tokens. Can also be an absolute token count (e.g. `5000`). |
|
|
416
|
+
| `observation.bufferActivation` | `0.8` | How aggressively to clear the message window on activation. `0.8` means remove enough messages to keep only 20% of `messageTokens` remaining. Lower values keep more message history. |
|
|
417
|
+
| `observation.blockAfter` | `1.2` | Safety threshold as a multiplier of `messageTokens`. At `1.2`, synchronous observation is forced at 36k tokens (1.2 × 30k). Only matters if buffering can't keep up. |
|
|
418
|
+
| `activateAfterIdle` | none | Forces buffered observations to activate after a period of inactivity, even before `observation.messageTokens` is reached. Accepts a numeric millisecond value such as `300_000`, or duration strings like `"5m"` or `"1hr"`. Set this to your prompt cache TTL if you want activation to happen before the next cold prompt. |
|
|
419
|
+
| `activateOnProviderChange` | `false` | Forces buffered observations to activate when the next step uses a different `provider/model` than the one that produced the latest assistant step. Use this when switching providers or models would invalidate prompt cache reuse. |
|
|
420
|
+
| `reflection.bufferActivation` | `0.5` | When to start background reflection. `0.5` means reflection begins when observations reach 50% of the `observationTokens` threshold. |
|
|
421
|
+
| `reflection.activateAfterIdle` | none | Opts buffered reflections into idle activation. Reflections don't inherit top-level `activateAfterIdle`. |
|
|
422
|
+
| `reflection.activateOnProviderChange` | `false` | Opts buffered reflections into provider-change activation. Reflections don't inherit top-level `activateOnProviderChange`. |
|
|
423
|
+
| `reflection.blockAfter` | `1.2` | Safety threshold for reflection, same logic as observation. |
|
|
380
424
|
|
|
381
|
-
If you're relying on prompt caching, set `activateAfterIdle` to match your cache TTL. That way, once a thread has been idle long enough for the cache to expire, the next request can activate buffered observations
|
|
425
|
+
If you're relying on prompt caching, set `activateAfterIdle` to match your cache TTL. That way, once a thread has been idle long enough for the cache to expire, the next request can activate buffered observations first and send a smaller compressed context window.
|
|
382
426
|
|
|
383
427
|
```typescript
|
|
384
428
|
const memory = new Memory({
|
|
@@ -392,9 +436,9 @@ const memory = new Memory({
|
|
|
392
436
|
})
|
|
393
437
|
```
|
|
394
438
|
|
|
395
|
-
With a 5-minute prompt cache TTL, this activates buffered
|
|
439
|
+
With a 5-minute prompt cache TTL, this activates buffered observations after 5 minutes of inactivity so the next uncached prompt uses compressed observations instead of a larger raw message window. If you prefer, `300_000` works the same way.
|
|
396
440
|
|
|
397
|
-
Changing model or providers mid-thread will invalidate the prompt cache. If your agent can switch between providers or models mid-thread, `activateOnProviderChange: true` forces buffered
|
|
441
|
+
Changing model or providers mid-thread will invalidate the prompt cache. If your agent can switch between providers or models mid-thread, `activateOnProviderChange: true` forces buffered observations to activate before the new provider runs. That avoids sending a large raw window to a provider that can't reuse the previous prompt cache.
|
|
398
442
|
|
|
399
443
|
### Disabling
|
|
400
444
|
|
|
@@ -43,7 +43,7 @@ Phoenix is an open-source observability platform that can be self-hosted or used
|
|
|
43
43
|
|
|
44
44
|
```bash
|
|
45
45
|
# Required
|
|
46
|
-
|
|
46
|
+
PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006/v1/traces # Or your Phoenix Cloud URL
|
|
47
47
|
|
|
48
48
|
# Optional
|
|
49
49
|
PHOENIX_API_KEY=your-api-key # For authenticated Phoenix instances
|
|
@@ -87,7 +87,7 @@ export const mastra = new Mastra({
|
|
|
87
87
|
serviceName: process.env.PHOENIX_PROJECT_NAME || 'mastra-service',
|
|
88
88
|
exporters: [
|
|
89
89
|
new ArizeExporter({
|
|
90
|
-
endpoint: process.env.
|
|
90
|
+
endpoint: process.env.PHOENIX_COLLECTOR_ENDPOINT!,
|
|
91
91
|
apiKey: process.env.PHOENIX_API_KEY,
|
|
92
92
|
projectName: process.env.PHOENIX_PROJECT_NAME,
|
|
93
93
|
}),
|
|
@@ -106,7 +106,7 @@ export const mastra = new Mastra({
|
|
|
106
106
|
> arizephoenix/phoenix:latest
|
|
107
107
|
> ```
|
|
108
108
|
>
|
|
109
|
-
> Set `
|
|
109
|
+
> Set `PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006/v1/traces` and run your Mastra agent to see traces at [localhost:6006](http://localhost:6006).
|
|
110
110
|
|
|
111
111
|
### Arize AX Setup
|
|
112
112
|
|
|
@@ -218,7 +218,7 @@ Control how traces are batched and exported:
|
|
|
218
218
|
|
|
219
219
|
```typescript
|
|
220
220
|
new ArizeExporter({
|
|
221
|
-
endpoint: process.env.
|
|
221
|
+
endpoint: process.env.PHOENIX_COLLECTOR_ENDPOINT!,
|
|
222
222
|
apiKey: process.env.PHOENIX_API_KEY,
|
|
223
223
|
|
|
224
224
|
// Batch processing configuration
|
|
@@ -233,7 +233,7 @@ Add custom attributes to all exported spans:
|
|
|
233
233
|
|
|
234
234
|
```typescript
|
|
235
235
|
new ArizeExporter({
|
|
236
|
-
endpoint: process.env.
|
|
236
|
+
endpoint: process.env.PHOENIX_COLLECTOR_ENDPOINT!,
|
|
237
237
|
resourceAttributes: {
|
|
238
238
|
'deployment.environment': process.env.NODE_ENV,
|
|
239
239
|
'service.namespace': 'production',
|
|
@@ -292,7 +292,7 @@ export const mastra = new Mastra({
|
|
|
292
292
|
serviceName: 'my-service',
|
|
293
293
|
exporters: [
|
|
294
294
|
new ArizeExporter({
|
|
295
|
-
endpoint: process.env.
|
|
295
|
+
endpoint: process.env.PHOENIX_COLLECTOR_ENDPOINT,
|
|
296
296
|
apiKey: process.env.PHOENIX_API_KEY,
|
|
297
297
|
}),
|
|
298
298
|
new DefaultExporter(), // Keep Studio access
|
package/.docs/models/index.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Model Providers
|
|
2
2
|
|
|
3
|
-
Mastra provides a unified interface for working with LLMs across multiple providers, giving you access to
|
|
3
|
+
Mastra provides a unified interface for working with LLMs across multiple providers, giving you access to 3897 models from 108 providers through a single API.
|
|
4
4
|
|
|
5
5
|
## Features
|
|
6
6
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# LLM Gateway
|
|
2
2
|
|
|
3
|
-
Access
|
|
3
|
+
Access 195 LLM Gateway models through Mastra's model router. Authentication is handled automatically using the `LLMGATEWAY_API_KEY` environment variable.
|
|
4
4
|
|
|
5
5
|
Learn more in the [LLM Gateway documentation](https://llmgateway.io/docs).
|
|
6
6
|
|
|
@@ -66,6 +66,7 @@ for await (const chunk of stream) {
|
|
|
66
66
|
| `llmgateway/gemini-2.5-flash-lite-preview-09-2025` | 1.0M | | | | | | $0.10 | $0.40 |
|
|
67
67
|
| `llmgateway/gemini-2.5-pro` | 1.0M | | | | | | $1 | $10 |
|
|
68
68
|
| `llmgateway/gemini-3-flash-preview` | 1.0M | | | | | | $0.50 | $3 |
|
|
69
|
+
| `llmgateway/gemini-3.1-flash-lite` | 1.0M | | | | | | $0.25 | $2 |
|
|
69
70
|
| `llmgateway/gemini-3.1-flash-lite-preview` | 1.0M | | | | | | $0.25 | $2 |
|
|
70
71
|
| `llmgateway/gemini-3.1-pro-preview` | 1.0M | | | | | | $2 | $12 |
|
|
71
72
|
| `llmgateway/gemini-pro-latest` | 1.0M | | | | | | $2 | $12 |
|
|
@@ -132,6 +133,7 @@ for await (const chunk of stream) {
|
|
|
132
133
|
| `llmgateway/grok-4-1-fast-reasoning` | 2.0M | | | | | | $0.20 | $0.50 |
|
|
133
134
|
| `llmgateway/grok-4-20-beta-0309-non-reasoning` | 2.0M | | | | | | $2 | $6 |
|
|
134
135
|
| `llmgateway/grok-4-20-beta-0309-reasoning` | 2.0M | | | | | | $2 | $6 |
|
|
136
|
+
| `llmgateway/grok-4-3` | 1.0M | | | | | | $1 | $3 |
|
|
135
137
|
| `llmgateway/grok-4-fast` | 2.0M | | | | | | $0.20 | $0.50 |
|
|
136
138
|
| `llmgateway/grok-4-fast-non-reasoning` | 2.0M | | | | | | $0.20 | $0.50 |
|
|
137
139
|
| `llmgateway/grok-4-fast-reasoning` | 2.0M | | | | | | $0.20 | $0.50 |
|
|
@@ -154,6 +156,10 @@ for await (const chunk of stream) {
|
|
|
154
156
|
| `llmgateway/llama-4-scout` | 33K | | | | | | $0.18 | $0.59 |
|
|
155
157
|
| `llmgateway/llama-4-scout-17b-instruct` | 8K | | | | | | $0.17 | $0.66 |
|
|
156
158
|
| `llmgateway/mimo-v2-flash` | 262K | | | | | | $0.10 | $0.30 |
|
|
159
|
+
| `llmgateway/mimo-v2-omni` | 262K | | | | | | $0.40 | $2 |
|
|
160
|
+
| `llmgateway/mimo-v2-pro` | 1.0M | | | | | | $1 | $3 |
|
|
161
|
+
| `llmgateway/mimo-v2.5` | 1.0M | | | | | | $0.40 | $2 |
|
|
162
|
+
| `llmgateway/mimo-v2.5-pro` | 1.0M | | | | | | $1 | $3 |
|
|
157
163
|
| `llmgateway/minimax-m2` | 197K | | | | | | $0.30 | $1 |
|
|
158
164
|
| `llmgateway/minimax-m2.1` | 205K | | | | | | $0.30 | $1 |
|
|
159
165
|
| `llmgateway/minimax-m2.1-lightning` | 197K | | | | | | $0.12 | $0.48 |
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Nebius Token Factory
|
|
2
2
|
|
|
3
|
-
Access
|
|
3
|
+
Access 31 Nebius Token Factory models through Mastra's model router. Authentication is handled automatically using the `NEBIUS_API_KEY` environment variable.
|
|
4
4
|
|
|
5
5
|
Learn more in the [Nebius Token Factory documentation](https://docs.tokenfactory.nebius.com/).
|
|
6
6
|
|
|
@@ -36,6 +36,7 @@ for await (const chunk of stream) {
|
|
|
36
36
|
| ------------------------------------------------ | ------- | ----- | --------- | ----- | ----- | ----- | ---------- | ----------- |
|
|
37
37
|
| `nebius/deepseek-ai/DeepSeek-V3.2` | 163K | | | | | | $0.30 | $0.45 |
|
|
38
38
|
| `nebius/deepseek-ai/DeepSeek-V3.2-fast` | 8K | | | | | | $0.40 | $2 |
|
|
39
|
+
| `nebius/deepseek-ai/DeepSeek-V4-Pro` | 1.0M | | | | | | $2 | $4 |
|
|
39
40
|
| `nebius/google/gemma-2-2b-it` | 8K | | | | | | $0.02 | $0.06 |
|
|
40
41
|
| `nebius/google/gemma-3-27b-it` | 110K | | | | | | $0.10 | $0.30 |
|
|
41
42
|
| `nebius/meta-llama/Llama-3.3-70B-Instruct` | 128K | | | | | | $0.13 | $0.40 |
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# OpenCode Zen
|
|
2
2
|
|
|
3
|
-
Access
|
|
3
|
+
Access 38 OpenCode Zen models through Mastra's model router. Authentication is handled automatically using the `OPENCODE_API_KEY` environment variable.
|
|
4
4
|
|
|
5
5
|
Learn more in the [OpenCode Zen documentation](https://opencode.ai/docs/zen).
|
|
6
6
|
|
|
@@ -64,7 +64,6 @@ for await (const chunk of stream) {
|
|
|
64
64
|
| `opencode/gpt-5.4-pro` | 1.1M | | | | | | $30 | $180 |
|
|
65
65
|
| `opencode/gpt-5.5` | 1.1M | | | | | | $5 | $30 |
|
|
66
66
|
| `opencode/gpt-5.5-pro` | 1.1M | | | | | | $30 | $180 |
|
|
67
|
-
| `opencode/hy3-preview-free` | 256K | | | | | | — | — |
|
|
68
67
|
| `opencode/kimi-k2.5` | 262K | | | | | | $0.60 | $3 |
|
|
69
68
|
| `opencode/kimi-k2.6` | 262K | | | | | | $0.95 | $4 |
|
|
70
69
|
| `opencode/minimax-m2.5` | 205K | | | | | | $0.30 | $1 |
|
package/.docs/reference/index.md
CHANGED
|
@@ -170,6 +170,7 @@ The Reference section provides documentation of Mastra's API, including paramete
|
|
|
170
170
|
- [PromptInjectionDetector](https://mastra.ai/reference/processors/prompt-injection-detector)
|
|
171
171
|
- [ProviderHistoryCompat](https://mastra.ai/reference/processors/provider-history-compat)
|
|
172
172
|
- [RegexFilterProcessor](https://mastra.ai/reference/processors/regex-filter-processor)
|
|
173
|
+
- [ResponseCache](https://mastra.ai/reference/processors/response-cache)
|
|
173
174
|
- [SemanticRecall](https://mastra.ai/reference/processors/semantic-recall-processor)
|
|
174
175
|
- [SkillSearchProcessor](https://mastra.ai/reference/processors/skill-search-processor)
|
|
175
176
|
- [StreamErrorRetryProcessor](https://mastra.ai/reference/processors/stream-error-retry-processor)
|
|
@@ -36,7 +36,9 @@ OM performs thresholding with fast local token estimation. Text uses `tokenx`, a
|
|
|
36
36
|
|
|
37
37
|
**scope** (`'resource' | 'thread'`): Memory scope for observations. \`'thread'\` keeps observations per-thread. \`'resource'\` (experimental) shares observations across all threads for a resource, enabling cross-conversation memory. (Default: `'thread'`)
|
|
38
38
|
|
|
39
|
-
**activateAfterIdle** (`number | string`): Time before buffered observations
|
|
39
|
+
**activateAfterIdle** (`number | string | false`): Time before buffered observations are forced to activate after inactivity, even before \`observation.messageTokens\` is reached. Accepts a numeric millisecond value such as \`300\_000\`, duration strings like \`"5m"\` or \`"1hr"\`, or \`false\` to disable inherited observation idle activation. Reflections do not inherit this setting. Use \`reflection.activateAfterIdle\` to opt reflections into idle activation.
|
|
40
|
+
|
|
41
|
+
**activateOnProviderChange** (`boolean`): Force buffered observations to activate when the actor provider or model changes. Reflections do not inherit this setting. Use \`reflection.activateOnProviderChange\` to opt reflections into provider-change activation. (Default: `false`)
|
|
40
42
|
|
|
41
43
|
**shareTokenBudget** (`boolean`): Share the token budget between messages and observations. When enabled, the total budget is \`observation.messageTokens + reflection.observationTokens\`. Messages can use more space when observations are small, and vice versa. This maximizes context usage through flexible allocation. \`shareTokenBudget\` is not yet compatible with async buffering. You must set \`observation: { bufferTokens: false }\` when using this option (this is a temporary limitation). (Default: `false`)
|
|
42
44
|
|
|
@@ -66,6 +68,10 @@ OM performs thresholding with fast local token estimation. Text uses `tokenx`, a
|
|
|
66
68
|
|
|
67
69
|
**observation.bufferActivation** (`number`): Controls how much of the message window to retain after activation. Accepts a ratio (0-1) or an absolute token count (≥ 1000). For example, \`0.8\` means: activate enough buffers to remove 80% of \`messageTokens\` and leave 20% as active message history. An absolute token count like \`4000\` targets a goal of keeping \~4k message tokens remaining after activation. Higher values remove more message history per activation when using a ratio. Higher values keep more message history when using a token count.
|
|
68
70
|
|
|
71
|
+
**observation.activateAfterIdle** (`number | string | false`): Time before buffered observations are forced to activate after inactivity. Accepts milliseconds, a duration string, or \`false\`. If unset, the top-level \`activateAfterIdle\` value is used for observations. Set \`false\` to disable the top-level idle setting for observations.
|
|
72
|
+
|
|
73
|
+
**observation.activateOnProviderChange** (`boolean`): Force buffered observations to activate when the actor provider or model changes. If unset, the top-level \`activateOnProviderChange\` value is used for observations.
|
|
74
|
+
|
|
69
75
|
**observation.blockAfter** (`number`): Token threshold above which synchronous (blocking) observation is forced. Between \`messageTokens\` and \`blockAfter\`, only async buffering/activation is used. Above \`blockAfter\`, a synchronous observation runs as a last resort, while buffered activation still preserves a minimum remaining context (min(1000, retention floor)). Accepts a multiplier (1 < value < 2, multiplied by \`messageTokens\`) or an absolute token count (≥ 2, must be greater than \`messageTokens\`). Only relevant when \`bufferTokens\` is set. Defaults to \`1.2\` when async buffering is enabled.
|
|
70
76
|
|
|
71
77
|
**observation.previousObserverTokens** (`number | false`): Optional token budget for the observer's previous-observations context. When set to a number, the observations passed to the Observer agent are tail-truncated to fit within this budget while keeping the newest observations and preserving highlighted 🔴 items when possible. When a buffered reflection is pending, the already-reflected observation lines are automatically replaced with the reflection summary before truncation. Set to \`0\` to omit previous observations entirely, or \`false\` to disable truncation explicitly.
|
|
@@ -86,6 +92,10 @@ OM performs thresholding with fast local token estimation. Text uses `tokenx`, a
|
|
|
86
92
|
|
|
87
93
|
**reflection.bufferActivation** (`number`): Ratio (0-1) controlling when async reflection buffering starts. When observation tokens reach \`observationTokens \* bufferActivation\`, reflection runs in the background. On activation at the full threshold, the buffered reflection replaces the observations it covers, preserving any new observations appended after that range.
|
|
88
94
|
|
|
95
|
+
**reflection.activateAfterIdle** (`number | string | false`): Time before buffered reflections are forced to activate after inactivity. Accepts milliseconds, a duration string, or \`false\`. Reflections do not inherit top-level \`activateAfterIdle\`; set this explicitly to opt reflections into idle activation.
|
|
96
|
+
|
|
97
|
+
**reflection.activateOnProviderChange** (`boolean`): Force buffered reflections to activate when the actor provider or model changes. Reflections do not inherit top-level \`activateOnProviderChange\`; set this explicitly to opt reflections into provider-change activation.
|
|
98
|
+
|
|
89
99
|
**reflection.blockAfter** (`number`): Token threshold above which synchronous (blocking) reflection is forced. Between \`observationTokens\` and \`blockAfter\`, only async buffering/activation is used. Above \`blockAfter\`, a synchronous reflection runs as a last resort. Accepts a multiplier (1 < value < 2, multiplied by \`observationTokens\`) or an absolute token count (≥ 2, must be greater than \`observationTokens\`). Only relevant when \`bufferActivation\` is set. Defaults to \`1.2\` when async reflection is enabled.
|
|
90
100
|
|
|
91
101
|
### Token estimate metadata cache
|
|
@@ -77,7 +77,7 @@ Flushes pending data and shuts down the client.
|
|
|
77
77
|
```typescript
|
|
78
78
|
import { ArizeExporter } from '@mastra/arize'
|
|
79
79
|
|
|
80
|
-
// For Phoenix: Set
|
|
80
|
+
// For Phoenix: Set PHOENIX_COLLECTOR_ENDPOINT, PHOENIX_API_KEY, PHOENIX_PROJECT_NAME
|
|
81
81
|
// For Arize AX: Set ARIZE_SPACE_ID, ARIZE_API_KEY, ARIZE_PROJECT_NAME
|
|
82
82
|
const exporter = new ArizeExporter()
|
|
83
83
|
```
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
# ResponseCache
|
|
2
|
+
|
|
3
|
+
`ResponseCache` is an input processor that caches LLM responses on the request/response boundary inside the agentic loop. It hooks into `processLLMRequest` (cache lookup; short-circuits on hit) and `processLLMResponse` (cache write on completion).
|
|
4
|
+
|
|
5
|
+
The cache key is derived from the resolved `LanguageModelV2Prompt` Mastra is about to send to the model — i.e. _after_ memory has loaded and earlier input processors have transformed the prompt — so two users with different memory contexts produce different cache keys. Each step in an agentic tool loop is independently cached.
|
|
6
|
+
|
|
7
|
+
There is no agent-level option for response caching; register `ResponseCache` explicitly on `inputProcessors`. Per-call overrides flow through `RequestContext` via [`ResponseCache.context()`](#static-helpers) and [`ResponseCache.applyContext()`](#static-helpers).
|
|
8
|
+
|
|
9
|
+
## Usage example
|
|
10
|
+
|
|
11
|
+
```typescript
|
|
12
|
+
import { Agent } from '@mastra/core/agent'
|
|
13
|
+
import { InMemoryServerCache } from '@mastra/core/cache'
|
|
14
|
+
import { ResponseCache } from '@mastra/core/processors'
|
|
15
|
+
|
|
16
|
+
const cache = new InMemoryServerCache()
|
|
17
|
+
|
|
18
|
+
const agent = new Agent({
|
|
19
|
+
name: 'Search Agent',
|
|
20
|
+
instructions: 'You answer questions concisely.',
|
|
21
|
+
model: 'openai/gpt-5',
|
|
22
|
+
inputProcessors: [new ResponseCache({ cache, ttl: 600 })],
|
|
23
|
+
})
|
|
24
|
+
|
|
25
|
+
// First call hits the LLM and writes to the cache.
|
|
26
|
+
await agent.generate('What is the capital of France?')
|
|
27
|
+
|
|
28
|
+
// Second identical call replays the cached response.
|
|
29
|
+
await agent.generate('What is the capital of France?')
|
|
30
|
+
|
|
31
|
+
// Force a fresh call but still update the cache.
|
|
32
|
+
await agent.generate('What is the capital of France?', {
|
|
33
|
+
requestContext: ResponseCache.context({ bust: true }),
|
|
34
|
+
})
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
See [Response caching](https://mastra.ai/docs/agents/response-caching) for the conceptual overview, scoping rules, and recommended deployment patterns.
|
|
38
|
+
|
|
39
|
+
## Constructor parameters
|
|
40
|
+
|
|
41
|
+
**cache** (`MastraServerCache`): The cache backend. Required. Pass any \`MastraServerCache\` implementation — \`InMemoryServerCache\` for local development, \`RedisCache\` from \`@mastra/redis\` for production, or your own subclass for a custom backend.
|
|
42
|
+
|
|
43
|
+
**ttl** (`number`): Time-to-live (seconds) for entries written by this processor. Defaults to 300 seconds (5 minutes), matching OpenRouter's reference implementation. (Default: `300`)
|
|
44
|
+
|
|
45
|
+
**scope** (`string | null`): Tenant scope appended to the cache key. \`null\` opts out of scoping. When omitted, the processor falls back to the resource id resolved from the request context (\`MASTRA\_RESOURCE\_ID\_KEY\`) for automatic per-user isolation.
|
|
46
|
+
|
|
47
|
+
**key** (`string | (inputs: ResponseCacheKeyInputs) => string | Promise<string>`): Override the auto-derived cache key. Pass a string to pin a key, or a function that receives \`{ agentId, scope, model, prompt, stepNumber }\` and returns a key. If the function throws, the processor falls back to the deterministic hash so the call still benefits from caching.
|
|
48
|
+
|
|
49
|
+
**bust** (`boolean`): Force a cache miss on every call: skip the read but still write on completion. Useful for explicit refresh paths. (Default: `false`)
|
|
50
|
+
|
|
51
|
+
**agentId** (`string`): Logical id used in the cache key namespace. Defaults to \`'mastra-response-cache'\`. Set this to the owning agent's id when you want cache entries scoped per-agent. (Default: `'mastra-response-cache'`)
|
|
52
|
+
|
|
53
|
+
## Static helpers
|
|
54
|
+
|
|
55
|
+
`ResponseCache` exposes two static helpers for setting per-call overrides on a `RequestContext`. The helpers keep the underlying context key a private implementation detail — prefer them over reading/writing the raw key.
|
|
56
|
+
|
|
57
|
+
### `ResponseCache.context(options)`
|
|
58
|
+
|
|
59
|
+
Build a fresh `RequestContext` preloaded with per-call response cache overrides.
|
|
60
|
+
|
|
61
|
+
```typescript
|
|
62
|
+
await agent.stream('hello', {
|
|
63
|
+
requestContext: ResponseCache.context({ key: 'custom', bust: true }),
|
|
64
|
+
})
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### `ResponseCache.applyContext(requestContext, options)`
|
|
68
|
+
|
|
69
|
+
Merge per-call response cache overrides into an existing `RequestContext`. Returns the same context for chaining.
|
|
70
|
+
|
|
71
|
+
```typescript
|
|
72
|
+
const ctx = new RequestContext()
|
|
73
|
+
ctx.set('caller-meta', { userId: 'u-123' })
|
|
74
|
+
ResponseCache.applyContext(ctx, { bust: true })
|
|
75
|
+
await agent.stream('hello', { requestContext: ctx })
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## ResponseCacheContextOptions
|
|
79
|
+
|
|
80
|
+
The shape passed to `ResponseCache.context()` / `ResponseCache.applyContext()`.
|
|
81
|
+
|
|
82
|
+
**key** (`string | (inputs: ResponseCacheKeyInputs) => string | Promise<string>`): Overrides the auto-derived cache key for this request only.
|
|
83
|
+
|
|
84
|
+
**scope** (`string | null`): Overrides the tenant scope for this request only. \`null\` opts out of scoping.
|
|
85
|
+
|
|
86
|
+
**bust** (`boolean`): Skip the cache read but still write on completion.
|
|
87
|
+
|
|
88
|
+
`cache`, `ttl`, and `agentId` are intentionally not overridable per call — they are instance-level concerns that should not vary per request.
|
|
89
|
+
|
|
90
|
+
## ResponseCacheKeyInputs
|
|
91
|
+
|
|
92
|
+
The argument passed to a `key` function (constructor or per-call). All fields contribute to the deterministic hash by default.
|
|
93
|
+
|
|
94
|
+
**agentId** (`string`): Logical processor id used to namespace the cache key.
|
|
95
|
+
|
|
96
|
+
**scope** (`string | null | undefined`): Resolved scope for this request, or \`null\` when scoping is disabled.
|
|
97
|
+
|
|
98
|
+
**model** (`{ provider?: string; modelId?: string; specVersion?: string }`): Provider/model identity. Different models produce different responses.
|
|
99
|
+
|
|
100
|
+
**prompt** (`LanguageModelV2Prompt`): The exact prompt the provider would receive, post memory load and post any prompt-modifying input processors.
|
|
101
|
+
|
|
102
|
+
**stepNumber** (`number`): 0-indexed step number within the agentic loop. Greater than zero for tool steps.
|
|
103
|
+
|
|
104
|
+
## Helper exports
|
|
105
|
+
|
|
106
|
+
- `buildResponseCacheKey(inputs)` — the deterministic hash used by default. Re-export it to override individual fields while preserving the rest of the standard key shape.
|
|
107
|
+
- `DEFAULT_RESPONSE_CACHE_TTL_SECONDS` — the default `ttl` (`300`).
|
|
108
|
+
- `RESPONSE_CACHE_CONTEXT_KEY` — the `RequestContext` key the static helpers write to. Exposed for advanced cases (e.g. clearing the override mid-pipeline); prefer the helpers.
|
|
109
|
+
|
|
110
|
+
## Related
|
|
111
|
+
|
|
112
|
+
- [Response caching](https://mastra.ai/docs/agents/response-caching)
|
|
113
|
+
- [Processors](https://mastra.ai/docs/agents/processors)
|
|
114
|
+
- [Processor interface](https://mastra.ai/reference/processors/processor-interface)
|
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,13 @@
|
|
|
1
1
|
# @mastra/mcp-docs-server
|
|
2
2
|
|
|
3
|
+
## 1.1.35-alpha.16
|
|
4
|
+
|
|
5
|
+
### Patch Changes
|
|
6
|
+
|
|
7
|
+
- Updated dependencies [[`7c275a8`](https://github.com/mastra-ai/mastra/commit/7c275a810595e1a6c41ccc39720531ab65734700), [`890b24c`](https://github.com/mastra-ai/mastra/commit/890b24cc7d32ed6aa4dfe253e54dc6bf4099f690), [`0f48ebf`](https://github.com/mastra-ai/mastra/commit/0f48ebfc7ac7897b2092a189f45751924cf56d1c), [`f180e49`](https://github.com/mastra-ai/mastra/commit/f180e4990e71b04c9a475b523584071712f0048f), [`9260e01`](https://github.com/mastra-ai/mastra/commit/9260e015276fb1b500f7878ee452b47476bf1583), [`2f6c54e`](https://github.com/mastra-ai/mastra/commit/2f6c54e17c041cac1def54baaa6b771647836414), [`e06a159`](https://github.com/mastra-ai/mastra/commit/e06a1598ca07a6c3778aefc2a2d288363c6294ff), [`db34bc6`](https://github.com/mastra-ai/mastra/commit/db34bc6fb36cf125bda0c46be4d3fdc774b70cc4)]:
|
|
8
|
+
- @mastra/core@1.33.0-alpha.8
|
|
9
|
+
- @mastra/mcp@1.7.0
|
|
10
|
+
|
|
3
11
|
## 1.1.35-alpha.14
|
|
4
12
|
|
|
5
13
|
### Patch Changes
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@mastra/mcp-docs-server",
|
|
3
|
-
"version": "1.1.35-alpha.
|
|
3
|
+
"version": "1.1.35-alpha.17",
|
|
4
4
|
"description": "MCP server for accessing Mastra.ai documentation, changelogs, and news.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "dist/index.js",
|
|
@@ -29,8 +29,8 @@
|
|
|
29
29
|
"jsdom": "^26.1.0",
|
|
30
30
|
"local-pkg": "^1.1.2",
|
|
31
31
|
"zod": "^4.3.6",
|
|
32
|
-
"@mastra/
|
|
33
|
-
"@mastra/
|
|
32
|
+
"@mastra/core": "1.33.0-alpha.8",
|
|
33
|
+
"@mastra/mcp": "^1.7.0"
|
|
34
34
|
},
|
|
35
35
|
"devDependencies": {
|
|
36
36
|
"@hono/node-server": "^1.19.11",
|
|
@@ -48,7 +48,7 @@
|
|
|
48
48
|
"vitest": "4.1.5",
|
|
49
49
|
"@internal/lint": "0.0.92",
|
|
50
50
|
"@internal/types-builder": "0.0.67",
|
|
51
|
-
"@mastra/core": "1.33.0-alpha.
|
|
51
|
+
"@mastra/core": "1.33.0-alpha.8"
|
|
52
52
|
},
|
|
53
53
|
"homepage": "https://mastra.ai",
|
|
54
54
|
"repository": {
|