verifiable-thinking-mcp 0.4.1 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +125 -128
- package/dist/index.js +536 -0
- package/package.json +2 -1
package/README.md
CHANGED
|
@@ -1,56 +1,33 @@
|
|
|
1
1
|
<div align="center">
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
<img src="assets/header.svg" alt="Verifiable Thinking MCP" width="800" />
|
|
4
4
|
|
|
5
|
-
**Your LLM is confidently wrong 40% of the time on reasoning questions
|
|
6
|
-
**This fixes that.**
|
|
5
|
+
**Your LLM is confidently wrong 40% of the time on reasoning questions. This fixes that.**
|
|
7
6
|
|
|
8
|
-
[](https://www.npmjs.com/package/verifiable-thinking-mcp)
|
|
9
|
-
[](https://github.com/CoderDayton/verifiable-thinking-mcp/actions)
|
|
7
|
+
[](https://www.npmjs.com/package/verifiable-thinking-mcp)
|
|
8
|
+
[](https://github.com/CoderDayton/verifiable-thinking-mcp/actions/workflows/ci.yml)
|
|
10
9
|
[](https://codecov.io/gh/CoderDayton/verifiable-thinking-mcp)
|
|
11
10
|
[](https://opensource.org/licenses/MIT)
|
|
12
11
|
|
|
13
|
-
|
|
12
|
+
*15 trap patterns detected in <1ms. No LLM calls. Just pattern matching.*
|
|
13
|
+
|
|
14
|
+
[Quick Start](#quick-start) • [Features](#features) • [Trap Detection](#trap-detection) • [API](#tools)
|
|
14
15
|
|
|
15
16
|
</div>
|
|
16
17
|
|
|
17
18
|
---
|
|
18
19
|
|
|
19
|
-
## The Problem
|
|
20
|
-
|
|
21
|
-
Ask Claude or GPT this:
|
|
22
|
-
|
|
23
|
-
> *A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?*
|
|
24
|
-
|
|
25
|
-
**40% of the time, it answers $0.10.** Confidently. With reasoning. And it's wrong.
|
|
26
|
-
|
|
27
|
-
The correct answer is $0.05 (because $0.05 + $1.05 = $1.10).
|
|
28
|
-
|
|
29
|
-
This isn't a cherry-picked example. LLMs fail predictably on cognitive traps:
|
|
30
|
-
- Lily pad doubling problems
|
|
31
|
-
- Monty Hall scenarios
|
|
32
|
-
- Base rate fallacies
|
|
33
|
-
- Gambler's fallacy questions
|
|
34
|
-
|
|
35
|
-
They fail because they pattern-match to *similar-looking* problems instead of reasoning through the actual structure.
|
|
36
|
-
|
|
37
|
-
## The Solution
|
|
38
|
-
|
|
39
20
|
```
|
|
40
|
-
|
|
41
|
-
│
|
|
42
|
-
│
|
|
43
|
-
│
|
|
44
|
-
│
|
|
45
|
-
│
|
|
46
|
-
│
|
|
47
|
-
|
|
48
|
-
│ Answer: $0.05 ✓ │
|
|
49
|
-
└─────────────────────────────────────────────────────────────────┘
|
|
21
|
+
┌────────────────────────────────────────────────────────────────┐
|
|
22
|
+
│ "A bat and ball cost $1.10. The bat costs $1 more..." │
|
|
23
|
+
│ ↓ │
|
|
24
|
+
│ TRAP DETECTED: additive_system │
|
|
25
|
+
│ > Don't subtract $1 from $1.10. Set up: x + (x+1) = 1.10 │
|
|
26
|
+
│ ↓ │
|
|
27
|
+
│ Answer: $0.05 (not $0.10) │
|
|
28
|
+
└────────────────────────────────────────────────────────────────┘
|
|
50
29
|
```
|
|
51
30
|
|
|
52
|
-
**Verifiable Thinking** detects 15 cognitive trap patterns in <1ms and warns the LLM before it starts reasoning. No extra LLM calls. Just pattern matching.
|
|
53
|
-
|
|
54
31
|
## Quick Start
|
|
55
32
|
|
|
56
33
|
```bash
|
|
@@ -70,137 +47,159 @@ Add to Claude Desktop (`claude_desktop_config.json`):
|
|
|
70
47
|
}
|
|
71
48
|
```
|
|
72
49
|
|
|
73
|
-
|
|
50
|
+
## Features
|
|
74
51
|
|
|
75
|
-
|
|
52
|
+
| | |
|
|
53
|
+
|---|---|
|
|
54
|
+
| 🎯 **Trap Detection** | 15 patterns (bat-ball, Monty Hall, base rate) caught before reasoning starts |
|
|
55
|
+
| ⚔️ **Auto-Challenge** | Forces counterarguments when confidence >95%—no more overconfident wrong answers |
|
|
56
|
+
| 🔍 **Contradiction Detection** | Catches "Let x=5" then "Now x=10" across steps |
|
|
57
|
+
| 🌿 **Hypothesis Branching** | Explore alternatives, auto-detects when branches confirm/refute |
|
|
58
|
+
| 🔢 **Local Math** | Evaluates expressions without LLM round-trips |
|
|
59
|
+
| 🗜️ **Smart Compression** | 56.8% token savings with query-aware CPC compression |
|
|
60
|
+
| ⚡ **Real Token Counting** | Tiktoken integration—3,922× cache speedup, zero estimation error |
|
|
76
61
|
|
|
77
|
-
|
|
62
|
+
## Token Efficiency
|
|
78
63
|
|
|
79
|
-
|
|
64
|
+
Every operation counts. Verifiable Thinking uses **real token counting** (tiktoken) and **intelligent compression** to cut costs by 50-60% without sacrificing reasoning quality.
|
|
80
65
|
|
|
81
|
-
|
|
66
|
+
```typescript
|
|
67
|
+
// Traditional reasoning: ~1,350 tokens for 10-step chain
|
|
68
|
+
// Verifiable Thinking: ~580 tokens (56.8% savings)
|
|
82
69
|
|
|
83
|
-
|
|
70
|
+
// Real token counting (not estimation)
|
|
71
|
+
countTokens("What is 2+2?") // → 7 tokens (not 3)
|
|
72
|
+
// Cache speedup: 3,922× faster on repeated strings
|
|
84
73
|
|
|
85
|
-
|
|
74
|
+
// Compress before processing (not just storage)
|
|
75
|
+
scratchpad({
|
|
76
|
+
operation: "step",
|
|
77
|
+
thought: "Long analysis...", // 135 tokens → 72 tokens
|
|
78
|
+
compress: true
|
|
79
|
+
})
|
|
86
80
|
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
| **Local Math** | Evaluates expressions without LLM calls | Catches arithmetic errors instantly |
|
|
94
|
-
| **Budget Control** | Token tracking with soft/hard limits | Prevents runaway reasoning chains |
|
|
81
|
+
// Budget controls
|
|
82
|
+
scratchpad({
|
|
83
|
+
warn_at_tokens: 2000, // Soft warning
|
|
84
|
+
hard_limit_tokens: 5000 // Hard stop
|
|
85
|
+
})
|
|
86
|
+
```
|
|
95
87
|
|
|
96
|
-
|
|
97
|
-
<summary><strong>All 15 Trap Patterns</strong></summary>
|
|
98
|
-
|
|
99
|
-
| Pattern | Classic Example | The Trap |
|
|
100
|
-
|---------|-----------------|----------|
|
|
101
|
-
| `additive_system` | Bat and ball | Subtract instead of solve equations |
|
|
102
|
-
| `nonlinear_growth` | Lily pad doubling | Linear interpolation on exponential |
|
|
103
|
-
| `rate_pattern` | 5 machines, 5 minutes | Incorrect scaling |
|
|
104
|
-
| `harmonic_mean` | Round-trip average speed | Arithmetic mean for rates |
|
|
105
|
-
| `independence` | Coin flip sequence | Gambler's fallacy |
|
|
106
|
-
| `pigeonhole` | Socks in the dark | Underestimate worst case |
|
|
107
|
-
| `base_rate` | Medical test accuracy | Ignore prevalence |
|
|
108
|
-
| `factorial_counting` | Trailing zeros in n! | Simple division |
|
|
109
|
-
| `clock_overlap` | Hour/minute hand overlaps | Assume exactly 12 |
|
|
110
|
-
| `conditional_probability` | Given/if probability | Ignore conditioning |
|
|
111
|
-
| `conjunction_fallacy` | Linda the bank teller | More detail = more likely |
|
|
112
|
-
| `monty_hall` | Door switching game | 50/50 fallacy after reveal |
|
|
113
|
-
| `anchoring` | Estimation after priming | Irrelevant number influence |
|
|
114
|
-
| `sunk_cost` | Should I continue? | Past investment bias |
|
|
115
|
-
| `framing_effect` | "Save 200" vs "400 die" | Gain/loss framing |
|
|
88
|
+
**At scale:** 1,000 reasoning chains/day = **$4,193/year saved** (at GPT-4o pricing).
|
|
116
89
|
|
|
117
|
-
|
|
90
|
+
See [`docs/token-optimization.md`](docs/token-optimization.md) for architecture details and benchmarks.
|
|
118
91
|
|
|
119
92
|
## How It Works
|
|
120
93
|
|
|
121
94
|
```typescript
|
|
122
|
-
//
|
|
95
|
+
// Start with a question—trap detection runs automatically
|
|
123
96
|
scratchpad({
|
|
124
97
|
operation: "step",
|
|
125
|
-
question: "A bat and ball cost $1.10
|
|
126
|
-
thought: "Let
|
|
127
|
-
confidence: 0.8
|
|
128
|
-
})
|
|
129
|
-
// → Returns trap_analysis: { pattern: "additive_system", warning: "..." }
|
|
130
|
-
|
|
131
|
-
// Step 2: Continue reasoning with the warning in context
|
|
132
|
-
scratchpad({
|
|
133
|
-
operation: "step",
|
|
134
|
-
thought: "Setting up equations: ball = x, bat = x + 1.00",
|
|
98
|
+
question: "A bat and ball cost $1.10...",
|
|
99
|
+
thought: "Let ball = x, bat = x + 1.00",
|
|
135
100
|
confidence: 0.9
|
|
136
101
|
})
|
|
102
|
+
// → Returns trap_analysis warning
|
|
137
103
|
|
|
138
|
-
//
|
|
139
|
-
scratchpad({
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
104
|
+
// High confidence? Auto-challenge kicks in
|
|
105
|
+
scratchpad({ operation: "step", thought: "...", confidence: 0.96 })
|
|
106
|
+
// → Returns challenge_suggestion: "What if your assumption is wrong?"
|
|
107
|
+
|
|
108
|
+
// Complete with spot-check
|
|
109
|
+
scratchpad({ operation: "complete", final_answer: "$0.05" })
|
|
144
110
|
```
|
|
145
111
|
|
|
146
|
-
##
|
|
112
|
+
## Trap Detection
|
|
147
113
|
|
|
148
|
-
| |
|
|
149
|
-
|
|
150
|
-
|
|
|
151
|
-
|
|
|
152
|
-
|
|
|
153
|
-
|
|
|
154
|
-
|
|
|
155
|
-
| Token budgets | ❌ | ✓ |
|
|
156
|
-
| Lines of code | ~100 | 22,000+ |
|
|
157
|
-
| Tests | ? | 1,831 |
|
|
114
|
+
| Pattern | What It Catches |
|
|
115
|
+
|---------|-----------------|
|
|
116
|
+
| `additive_system` | Bat-ball, widget-gadget (subtract instead of solve) |
|
|
117
|
+
| `nonlinear_growth` | Lily pad doubling (linear interpolation) |
|
|
118
|
+
| `monty_hall` | Door switching (50/50 fallacy) |
|
|
119
|
+
| `base_rate` | Medical tests (ignoring prevalence) |
|
|
120
|
+
| `independence` | Coin flips (gambler's fallacy) |
|
|
158
121
|
|
|
159
|
-
|
|
160
|
-
|
|
122
|
+
<details>
|
|
123
|
+
<summary>All 15 patterns</summary>
|
|
124
|
+
|
|
125
|
+
| Pattern | Trap |
|
|
126
|
+
|---------|------|
|
|
127
|
+
| `additive_system` | Subtract instead of solve |
|
|
128
|
+
| `nonlinear_growth` | Linear interpolation |
|
|
129
|
+
| `rate_pattern` | Incorrect scaling |
|
|
130
|
+
| `harmonic_mean` | Arithmetic mean for rates |
|
|
131
|
+
| `independence` | Gambler's fallacy |
|
|
132
|
+
| `pigeonhole` | Underestimate worst case |
|
|
133
|
+
| `base_rate` | Ignore prevalence |
|
|
134
|
+
| `factorial_counting` | Simple division |
|
|
135
|
+
| `clock_overlap` | Assume 12 overlaps |
|
|
136
|
+
| `conditional_probability` | Ignore conditioning |
|
|
137
|
+
| `conjunction_fallacy` | More detail = more likely |
|
|
138
|
+
| `monty_hall` | 50/50 after reveal |
|
|
139
|
+
| `anchoring` | Irrelevant number influence |
|
|
140
|
+
| `sunk_cost` | Past investment bias |
|
|
141
|
+
| `framing_effect` | Gain/loss framing |
|
|
161
142
|
|
|
162
|
-
|
|
143
|
+
</details>
|
|
163
144
|
|
|
164
|
-
##
|
|
145
|
+
## Tools
|
|
165
146
|
|
|
166
|
-
|
|
167
|
-
<summary><strong>scratchpad operations</strong></summary>
|
|
147
|
+
**`scratchpad`** — the main tool with 11 operations:
|
|
168
148
|
|
|
169
|
-
| Operation |
|
|
170
|
-
|
|
149
|
+
| Operation | What It Does |
|
|
150
|
+
|-----------|--------------|
|
|
171
151
|
| `step` | Add reasoning step (trap priming on first) |
|
|
172
152
|
| `complete` | Finalize with auto spot-check |
|
|
173
153
|
| `revise` | Fix earlier step |
|
|
174
154
|
| `branch` | Explore alternative path |
|
|
175
155
|
| `challenge` | Force adversarial self-check |
|
|
176
156
|
| `navigate` | View history/branches |
|
|
177
|
-
| `spot_check` | Manual trap validation |
|
|
178
|
-
| `hint` | Progressive algebraic help |
|
|
179
|
-
| `mistakes` | Detect common errors |
|
|
180
|
-
| `augment` | Evaluate math expressions |
|
|
181
|
-
| `override` | Force-commit after failure |
|
|
182
|
-
|
|
183
|
-
</details>
|
|
184
157
|
|
|
185
158
|
<details>
|
|
186
|
-
<summary
|
|
159
|
+
<summary>All operations</summary>
|
|
187
160
|
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
161
|
+
| Operation | Purpose |
|
|
162
|
+
|-----------|---------|
|
|
163
|
+
| `step` | Add reasoning step |
|
|
164
|
+
| `complete` | Finalize chain |
|
|
165
|
+
| `revise` | Fix earlier step |
|
|
166
|
+
| `branch` | Alternative path |
|
|
167
|
+
| `challenge` | Adversarial self-check |
|
|
168
|
+
| `navigate` | View history |
|
|
169
|
+
| `spot_check` | Manual trap check |
|
|
170
|
+
| `hint` | Progressive simplification |
|
|
171
|
+
| `mistakes` | Algebraic error detection |
|
|
172
|
+
| `augment` | Compute math expressions |
|
|
173
|
+
| `override` | Force-commit failed step |
|
|
192
174
|
|
|
193
175
|
</details>
|
|
194
176
|
|
|
177
|
+
**Other tools:** `list_sessions`, `get_session`, `clear_session`, `compress`
|
|
178
|
+
|
|
179
|
+
## vs Sequential Thinking MCP
|
|
180
|
+
|
|
181
|
+
| | Sequential Thinking | Verifiable Thinking |
|
|
182
|
+
|---|---|---|
|
|
183
|
+
| Trap detection | ❌ | 15 patterns |
|
|
184
|
+
| Auto-challenge | ❌ | >95% confidence |
|
|
185
|
+
| Contradiction detection | ❌ | ✅ |
|
|
186
|
+
| Confidence tracking | ❌ | Per-step + chain |
|
|
187
|
+
| Local compute | ❌ | ✅ |
|
|
188
|
+
| Token budgets | ❌ | Soft + hard limits |
|
|
189
|
+
| Real token counting | ❌ | Tiktoken (3,922× cache speedup) |
|
|
190
|
+
| Compression | ❌ | 56.8% token savings |
|
|
191
|
+
|
|
192
|
+
Sequential Thinking is ~100 lines. This is 22,000+ with 1,831 tests.
|
|
193
|
+
|
|
194
|
+
See [`docs/competitive-analysis.md`](docs/competitive-analysis.md) for full breakdown.
|
|
195
|
+
|
|
195
196
|
## Development
|
|
196
197
|
|
|
197
198
|
```bash
|
|
198
199
|
git clone https://github.com/CoderDayton/verifiable-thinking-mcp.git
|
|
199
200
|
cd verifiable-thinking-mcp && bun install
|
|
200
|
-
|
|
201
|
-
bun run dev # MCP Inspector
|
|
201
|
+
bun run dev # Interactive MCP Inspector
|
|
202
202
|
bun test # 1,831 tests
|
|
203
|
-
bun run build # Production bundle
|
|
204
203
|
```
|
|
205
204
|
|
|
206
205
|
## License
|
|
@@ -211,8 +210,6 @@ MIT
|
|
|
211
210
|
|
|
212
211
|
<div align="center">
|
|
213
212
|
|
|
214
|
-
**[Report Bug](https://github.com/CoderDayton/verifiable-thinking-mcp/issues) · [Request Feature](https://github.com/CoderDayton/verifiable-thinking-mcp/issues)
|
|
215
|
-
|
|
216
|
-
*Built because LLMs shouldn't be confidently wrong.*
|
|
213
|
+
**[Report Bug](https://github.com/CoderDayton/verifiable-thinking-mcp/issues) · [Request Feature](https://github.com/CoderDayton/verifiable-thinking-mcp/issues)**
|
|
217
214
|
|
|
218
215
|
</div>
|