@blockrun/franklin 3.8.17 → 3.8.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -26,34 +26,46 @@
26
26
  // Principle-based, not example-enumerating. Specific tickers or phrasings
27
27
  // hard-coded here would rot the moment the market changes. The rule is
28
28
  // general: claim → tool result or explicit uncertainty.
29
- const EVALUATOR_PROMPT = `You are a GROUNDING CHECK agent. Your job is to verify that an AI assistant's answer is grounded in tool-call evidence, not model memory.
29
+ const EVALUATOR_PROMPT = `You are a GROUNDING CHECK agent. Your job is to verify that an AI assistant's answer is grounded in tool-call evidence, not model memory — and that it didn't REFUSE to use tools when tools were the right answer.
30
30
 
31
31
  ## What you receive
32
32
  - The user's question
33
33
  - A list of tool calls made this turn (tool name, input summary, whether it succeeded)
34
34
  - The assistant's final text answer
35
35
 
36
- ## What you check
36
+ ## Two failure modes to catch
37
+
38
+ ### A. Ungrounded claims
37
39
  Every **factual claim** in the answer must trace to ONE of:
38
40
  (a) A successful tool call result from this turn, OR
39
- (b) Explicit acknowledgment of uncertainty ("I'm not sure", "based on older data", "I'd need to check")
41
+ (b) Explicit acknowledgment of uncertainty ("I'm not sure", "based on older data")
40
42
 
41
- Claims that are ungrounded:
43
+ Flag as ungrounded:
42
44
  - Specific current-world facts stated with confidence but not backed by any tool call this turn
43
45
  - Recommendations or conclusions that depend on unstated data (e.g. "you should sell" without a price lookup)
44
46
  - Invented specifics — names, numbers, dates the model produced without a tool call supporting them
45
47
 
46
- Claims that are grounded:
48
+ ### B. Tool-use refusal (NEW)
49
+ If the user clearly asked for live-world data — a current price, today's news, the latest state of X — and the assistant's answer contains a refusal or deflection (e.g. "I can't provide real-time prices", "我无法提供实时数据", "check Yahoo Finance yourself", "as an AI I don't have access to live data"), that is also UNGROUNDED. Franklin HAS tools for this (TradingMarket for prices, ExaAnswer for current events, WebSearch for general web, etc.). Refusing to reach for them is the failure this check was built for.
50
+
51
+ Flag as tool-use refusal:
52
+ - "I can't check real-time prices"
53
+ - "I don't have access to current market data"
54
+ - "You should check [some external site] for the latest"
55
+ - Any variation in any language that shrugs off a live-data question when tools exist
56
+
57
+ ## What's OK
47
58
  - Anything directly derived from a tool result shown in the turn
48
59
  - General knowledge / definitions / reasoning that doesn't depend on current-world specifics
49
- - Claims explicitly hedged as uncertain
60
+ - Claims explicitly hedged as uncertain for reasons unrelated to tool availability
50
61
 
51
62
  ## Output — exact format
52
63
 
53
64
  VERDICT: GROUNDED | PARTIAL | UNGROUNDED
54
65
 
55
- If not GROUNDED, list each ungrounded claim on its own line starting with "- " and the tool that should have been called, like:
66
+ If not GROUNDED, list each issue on its own line starting with "- " and the tool that should have been called, like:
56
67
  - Claim: "<the ungrounded part, quoted briefly>" → missing tool: <TradingMarket | ExaAnswer | ExaSearch | WebSearch | ...>
68
+ - Refusal: "<the refusal phrase, quoted briefly>" → should have called: <tool name>
57
69
 
58
70
  Empty line between verdict and list. No other text. No preamble. No apology. Be terse.`;
59
71
  // ─── Trigger policy ──────────────────────────────────────────────────────
@@ -17,8 +17,18 @@ const MULTI_STEP_PATTERN = /first.*then|step\s+\d|\d+\.\s|and\s+then|after\s+tha
17
17
  * the overhead of an extra planning call.
18
18
  */
19
19
  export function shouldPlan(tier, profile, userText, ultrathink, planDisabled) {
20
- // Per-process opt-out for ablation / scripting ("is plan-then-execute
21
- // still load-bearing?"). Takes precedence over every other heuristic.
20
+ // Default: plan-then-execute is OFF (v3.8.18). Observed failure: router
21
+ // correctly picks Sonnet for a "should I sell CRCL" prompt, but the
22
+ // executor swap downgrades actual execution to gemini-2.5-flash, which
23
+ // then answers from memory instead of calling TradingMarket / ExaAnswer.
24
+ // The cheap-executor pattern was load-bearing for Sonnet 4.0-era models;
25
+ // Opus 4.7 / Sonnet 4.6 handle multi-step tool use coherently in a
26
+ // single pass, so the two-call path is pure overhead — and it actively
27
+ // hurts when the executor is weaker than the planner.
28
+ // Opt back in with FRANKLIN_PLAN=1 (for experiments / ablation).
29
+ if (process.env.FRANKLIN_PLAN !== '1')
30
+ return false;
31
+ // Legacy env opt-out — still honored for users who set it previously.
22
32
  if (process.env.FRANKLIN_NOPLAN === '1')
23
33
  return false;
24
34
  // User disabled planning for this session
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@blockrun/franklin",
3
- "version": "3.8.17",
3
+ "version": "3.8.18",
4
4
  "description": "Franklin — The AI agent with a wallet. Spends USDC autonomously to get real work done. Pay per action, no subscriptions.",
5
5
  "type": "module",
6
6
  "exports": {