@blockrun/franklin 3.8.17 → 3.8.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/agent/evaluator.js +19 -7
- package/dist/agent/planner.js +12 -2
- package/package.json +1 -1
package/dist/agent/evaluator.js
CHANGED
|
@@ -26,34 +26,46 @@
|
|
|
26
26
|
// Principle-based, not example-enumerating. Specific tickers or phrasings
|
|
27
27
|
// hard-coded here would rot the moment the market changes. The rule is
|
|
28
28
|
// general: claim → tool result or explicit uncertainty.
|
|
29
|
-
const EVALUATOR_PROMPT = `You are a GROUNDING CHECK agent. Your job is to verify that an AI assistant's answer is grounded in tool-call evidence, not model memory.
|
|
29
|
+
const EVALUATOR_PROMPT = `You are a GROUNDING CHECK agent. Your job is to verify that an AI assistant's answer is grounded in tool-call evidence, not model memory — and that it didn't REFUSE to use tools when tools were the right answer.
|
|
30
30
|
|
|
31
31
|
## What you receive
|
|
32
32
|
- The user's question
|
|
33
33
|
- A list of tool calls made this turn (tool name, input summary, whether it succeeded)
|
|
34
34
|
- The assistant's final text answer
|
|
35
35
|
|
|
36
|
-
##
|
|
36
|
+
## Two failure modes to catch
|
|
37
|
+
|
|
38
|
+
### A. Ungrounded claims
|
|
37
39
|
Every **factual claim** in the answer must trace to ONE of:
|
|
38
40
|
(a) A successful tool call result from this turn, OR
|
|
39
|
-
(b) Explicit acknowledgment of uncertainty ("I'm not sure", "based on older data"
|
|
41
|
+
(b) Explicit acknowledgment of uncertainty ("I'm not sure", "based on older data")
|
|
40
42
|
|
|
41
|
-
|
|
43
|
+
Flag as ungrounded:
|
|
42
44
|
- Specific current-world facts stated with confidence but not backed by any tool call this turn
|
|
43
45
|
- Recommendations or conclusions that depend on unstated data (e.g. "you should sell" without a price lookup)
|
|
44
46
|
- Invented specifics — names, numbers, dates the model produced without a tool call supporting them
|
|
45
47
|
|
|
46
|
-
|
|
48
|
+
### B. Tool-use refusal (NEW)
|
|
49
|
+
If the user clearly asked for live-world data — a current price, today's news, the latest state of X — and the assistant's answer contains a refusal or deflection (e.g. "I can't provide real-time prices", "我无法提供实时数据", "check Yahoo Finance yourself", "as an AI I don't have access to live data"), that is also UNGROUNDED. Franklin HAS tools for this (TradingMarket for prices, ExaAnswer for current events, WebSearch for general web, etc.). Refusing to reach for them is the failure this check was built for.
|
|
50
|
+
|
|
51
|
+
Flag as tool-use refusal:
|
|
52
|
+
- "I can't check real-time prices"
|
|
53
|
+
- "I don't have access to current market data"
|
|
54
|
+
- "You should check [some external site] for the latest"
|
|
55
|
+
- Any variation in any language that shrugs off a live-data question when tools exist
|
|
56
|
+
|
|
57
|
+
## What's OK
|
|
47
58
|
- Anything directly derived from a tool result shown in the turn
|
|
48
59
|
- General knowledge / definitions / reasoning that doesn't depend on current-world specifics
|
|
49
|
-
- Claims explicitly hedged as uncertain
|
|
60
|
+
- Claims explicitly hedged as uncertain for reasons unrelated to tool availability
|
|
50
61
|
|
|
51
62
|
## Output — exact format
|
|
52
63
|
|
|
53
64
|
VERDICT: GROUNDED | PARTIAL | UNGROUNDED
|
|
54
65
|
|
|
55
|
-
If not GROUNDED, list each
|
|
66
|
+
If not GROUNDED, list each issue on its own line starting with "- " and the tool that should have been called, like:
|
|
56
67
|
- Claim: "<the ungrounded part, quoted briefly>" → missing tool: <TradingMarket | ExaAnswer | ExaSearch | WebSearch | ...>
|
|
68
|
+
- Refusal: "<the refusal phrase, quoted briefly>" → should have called: <tool name>
|
|
57
69
|
|
|
58
70
|
Empty line between verdict and list. No other text. No preamble. No apology. Be terse.`;
|
|
59
71
|
// ─── Trigger policy ──────────────────────────────────────────────────────
|
package/dist/agent/planner.js
CHANGED
|
@@ -17,8 +17,18 @@ const MULTI_STEP_PATTERN = /first.*then|step\s+\d|\d+\.\s|and\s+then|after\s+tha
|
|
|
17
17
|
* the overhead of an extra planning call.
|
|
18
18
|
*/
|
|
19
19
|
export function shouldPlan(tier, profile, userText, ultrathink, planDisabled) {
|
|
20
|
-
//
|
|
21
|
-
//
|
|
20
|
+
// Default: plan-then-execute is OFF (v3.8.18). Observed failure: router
|
|
21
|
+
// correctly picks Sonnet for a "should I sell CRCL" prompt, but the
|
|
22
|
+
// executor swap downgrades actual execution to gemini-2.5-flash, which
|
|
23
|
+
// then answers from memory instead of calling TradingMarket / ExaAnswer.
|
|
24
|
+
// The cheap-executor pattern was load-bearing for Sonnet 4.0-era models;
|
|
25
|
+
// Opus 4.7 / Sonnet 4.6 handle multi-step tool use coherently in a
|
|
26
|
+
// single pass, so the two-call path is pure overhead — and it actively
|
|
27
|
+
// hurts when the executor is weaker than the planner.
|
|
28
|
+
// Opt back in with FRANKLIN_PLAN=1 (for experiments / ablation).
|
|
29
|
+
if (process.env.FRANKLIN_PLAN !== '1')
|
|
30
|
+
return false;
|
|
31
|
+
// Legacy env opt-out — still honored for users who set it previously.
|
|
22
32
|
if (process.env.FRANKLIN_NOPLAN === '1')
|
|
23
33
|
return false;
|
|
24
34
|
// User disabled planning for this session
|
package/package.json
CHANGED