ollama-bench 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,22 +1,27 @@
1
1
  # Ollama-bench
2
2
 
3
- Minimal CLI tool to benchmark Ollama models with detailed phase analysis. Zero runtime dependencies.
3
+ Minimal CLI tool to benchmark Ollama models with detailed phase-by-phase analysis now with **time-to-first-token (TTFT)**, **reasoning/thinking** measurement, GPU/VRAM reporting, and a side-by-side ranking table.
4
4
 
5
5
  ## Features
6
6
 
7
- - Phase-by-phase performance breakdown
8
- - Precise timing measurements
9
- - Works with npm, pnpm, yarn, and bun
7
+ - Phase-by-phase performance breakdown (load · prompt eval · generation)
8
+ - **TTFT** (time to first token) — the metric that actually drives perceived latency
9
+ - **Reasoning models**: auto-detects thinking-capable models and measures the thinking phase separately
10
+ - Size / quantization / VRAM (GPU vs CPU) reporting via the live model state
11
+ - Aligned **ranking table** when comparing multiple models
12
+ - `--json` output for scripting and CI
13
+ - Multi-run averaging, custom prompts, custom host
14
+ - TTY-aware: colors/spinners on a terminal, clean plain text when piped (honors `NO_COLOR`)
10
15
 
11
16
  ## Quick Start
12
17
 
13
18
  ```bash
14
19
  # Run directly (no installation)
15
- npx ollama-bench qwen2.5:0.5b llama3.2:1b
20
+ npx ollama-bench qwen3:0.6b llama3.2:1b
16
21
 
17
22
  # Or with other package managers
18
- bunx ollama-bench qwen2.5:0.5b
19
- pnpm dlx ollama-bench qwen2.5:0.5b
23
+ bunx ollama-bench qwen3:0.6b
24
+ pnpm dlx ollama-bench qwen3:0.6b
20
25
  ```
21
26
 
22
27
  ## Prerequisites
@@ -24,23 +29,78 @@ pnpm dlx ollama-bench qwen2.5:0.5b
24
29
  1. **Install Ollama** - [ollama.com/download](https://ollama.com/download)
25
30
  2. **Start Ollama server** - Run `ollama serve`
26
31
 
32
+ ## Usage
33
+
34
+ ```
35
+ ollama-bench [options] <model> [model...]
36
+
37
+ Options
38
+ --think[=high|medium|low] Enable reasoning/thinking (auto-detected by default)
39
+ --no-think Disable thinking even for reasoning models
40
+ --prompt <text> Custom benchmark prompt
41
+ --runs <n> Repeat each model n times and average (default: 1)
42
+ --host <url> Ollama server URL (default: http://127.0.0.1:11434)
43
+ --json Emit machine-readable JSON instead of the report
44
+ --demo Render the UI with synthetic data (no server needed)
45
+ -v, --version Print version
46
+ -h, --help Show this help
47
+ ```
48
+
49
+ ### Examples
50
+
51
+ ```bash
52
+ # Compare two models
53
+ ollama-bench qwen3:0.6b llama3.2:1b
54
+
55
+ # Benchmark a reasoning model at high thinking effort, averaged over 3 runs
56
+ ollama-bench --runs 3 --think=high deepseek-r1:1.5b
57
+
58
+ # Custom prompt, JSON output for a script
59
+ ollama-bench --prompt "Write a haiku about TCP" --json gemma3:1b > result.json
60
+
61
+ # Preview the UI without an Ollama server
62
+ ollama-bench --demo
63
+ ```
64
+
27
65
  ## Benchmark Phases
28
66
 
29
- Each benchmark measures three distinct phases:
67
+ Each benchmark measures these phases (timings come straight from the Ollama server):
68
+
69
+ **Model Loading** — time to load weights into memory. Hardware-dependent, very consistent.
30
70
 
31
- **Phase 1: Model Loading** (Loading weights into memory)
32
- - Time to load model from disk into RAM
33
- - Hardware-dependent, very consistent
71
+ **Prompt Processing** time to encode and process the input prompt. Fast, scales with prompt length.
34
72
 
35
- **Phase 2: Prompt Processing** (Encoding input)
36
- - Time to encode and process your input prompt
37
- - Fast, scales with prompt length
73
+ **Thinking** *(reasoning models only)* the model's streamed thinking text, measured separately from the visible answer. Automatically enabled for thinking-capable models such as `qwen3` and `deepseek-r1`. Ollama does not expose separate thinking token counts, so ollama-bench reports exact thinking characters and chars/sec instead of estimating tokens.
38
74
 
39
- **Phase 3: Response Generation** (Creating output)
40
- - Time to generate the actual response
41
- - Most important metric for user-facing performance
42
- - Varies with content complexity
75
+ **Response Generation** time to generate the output tokens. The most important metric for user-facing performance.
43
76
 
77
+ Alongside the phases, ollama-bench reports **TTFT** (wall-clock time to the first streamed token) and the model's **size / quantization / VRAM** placement.
78
+
79
+ ## JSON output
80
+
81
+ `--json` writes a single JSON object to **stdout** (all progress goes to stderr, so the stream stays parseable):
82
+
83
+ ```json
84
+ {
85
+ "server": "0.12.0",
86
+ "prompt": "Explain the theory of relativity in simple terms.",
87
+ "results": [
88
+ {
89
+ "model": "qwen3:0.6b",
90
+ "ok": true,
91
+ "tokensPerSecond": 168.4,
92
+ "ttft": 0.51,
93
+ "thinking": true,
94
+ "thinkingTime": 1.13,
95
+ "thinkingChars": 640,
96
+ "thinkingCharsPerSecond": 568,
97
+ "loadTime": 0.42,
98
+ "generationTime": 1.9,
99
+ "totalTime": 2.4
100
+ }
101
+ ]
102
+ }
103
+ ```
44
104
 
45
105
  ## Available Models
46
106
 
@@ -48,4 +108,4 @@ See [ollama.com/library](https://ollama.com/library) for all available models.
48
108
 
49
109
  ## License
50
110
 
51
- MIT
111
+ MIT
package/dist/index.js CHANGED
@@ -1,9 +1,9 @@
1
1
  #!/usr/bin/env node
2
- import ollama from 'ollama';
2
+ import { Ollama } from 'ollama';
3
3
  /**
4
4
  * Object containing ANSI color codes for text coloring.
5
5
  */
6
- const colors = {
6
+ const codes = {
7
7
  reset: '\x1b[0m',
8
8
  green: '\x1b[32m',
9
9
  yellow: '\x1b[33m',
@@ -11,165 +11,620 @@ const colors = {
11
11
  cyan: '\x1b[36m',
12
12
  magenta: '\x1b[35m',
13
13
  blue: '\x1b[34m',
14
+ gray: '\x1b[90m',
15
+ bold: '\x1b[1m',
14
16
  };
15
17
  /**
16
- * Applies color to the given text.
18
+ * Whether the current stdout is an interactive terminal (controls spinners).
19
+ */
20
+ const isTTY = process.stdout.isTTY === true;
21
+ /**
22
+ * Whether ANSI colors should be emitted. Honors NO_COLOR / FORCE_COLOR and TTY.
23
+ */
24
+ const useColor = !('NO_COLOR' in process.env) &&
25
+ process.env.TERM !== 'dumb' &&
26
+ (isTTY || 'FORCE_COLOR' in process.env);
27
+ /**
28
+ * Applies color to the given text (no-op when colors are disabled).
17
29
  * @param text - The text to colorize.
18
30
  * @param color - The color to apply.
19
31
  * @returns The colorized text.
20
32
  */
21
33
  function colorize(text, color) {
22
- return `${colors[color]}${text}${colors.reset}`;
34
+ return useColor ? `${codes[color]}${text}${codes.reset}` : text;
35
+ }
36
+ /* -------------------------------------------------------------------------- */
37
+ /* Formatting helpers */
38
+ /* -------------------------------------------------------------------------- */
39
+ /**
40
+ * Formats a duration in seconds into a compact human-readable string.
41
+ */
42
+ function fmtDuration(seconds) {
43
+ if (!isFinite(seconds) || seconds < 0)
44
+ return '—';
45
+ if (seconds < 1)
46
+ return `${(seconds * 1000).toFixed(0)}ms`;
47
+ if (seconds < 60)
48
+ return `${seconds.toFixed(2)}s`;
49
+ const m = Math.floor(seconds / 60);
50
+ const s = seconds % 60;
51
+ return `${m}m${s.toFixed(0)}s`;
23
52
  }
24
53
  /**
25
- * Creates a loading animation for the console.
26
- * @param operation - The operation being performed.
27
- * @param model - The model being processed.
28
- * @returns An interval ID for the animation.
54
+ * Formats a byte count into a human-readable string (GB / MB / KB).
29
55
  */
30
- function createLoadingAnimation(operation, model) {
31
- const frames = ['|', '/', '-', '\\'];
56
+ function fmtBytes(bytes) {
57
+ if (!bytes || bytes <= 0)
58
+ return '—';
59
+ const units = ['B', 'KB', 'MB', 'GB', 'TB'];
60
+ let v = bytes;
32
61
  let i = 0;
33
- let dots = 0;
34
- return setInterval(() => {
35
- const frame = frames[i];
36
- const dotString = '.'.repeat(dots);
37
- const operationText = colorize(`${operation} ${model}${dotString}`, 'blue');
38
- process.stdout.write(`\r${frame} ${operationText}`.padEnd(50));
39
- i = (i + 1) % frames.length;
40
- dots = (dots + 1) % 4;
41
- }, 100);
42
- }
43
- /**
44
- * Pulls a model from Ollama.
45
- * @param model - The name of the model to pull.
46
- */
47
- async function pullModel(model) {
48
- console.log(colorize(`Initiating pull for ${model}...`, 'yellow'));
49
- const loadingAnimation = createLoadingAnimation('Pulling', model);
62
+ while (v >= 1024 && i < units.length - 1) {
63
+ v /= 1024;
64
+ i++;
65
+ }
66
+ return `${v.toFixed(v < 10 && i > 0 ? 1 : 0)}${units[i]}`;
67
+ }
68
+ /**
69
+ * Formats a tokens-per-second value.
70
+ */
71
+ function fmtRate(rate) {
72
+ if (!isFinite(rate) || rate <= 0)
73
+ return '—';
74
+ return `${rate.toFixed(1)} t/s`;
75
+ }
76
+ /* -------------------------------------------------------------------------- */
77
+ /* Spinner */
78
+ /* -------------------------------------------------------------------------- */
79
+ /**
80
+ * A minimal TTY spinner. On non-interactive terminals it prints a single line
81
+ * and becomes a no-op, so piped/CI output stays clean.
82
+ */
83
+ class Spinner {
84
+ static get tty() {
85
+ return process.stderr.isTTY === true;
86
+ }
87
+ constructor(text) {
88
+ this.text = text;
89
+ this.frames = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏'];
90
+ this.timer = null;
91
+ this.i = 0;
92
+ }
93
+ start() {
94
+ // Progress goes to stderr so stdout stays clean for reports / --json.
95
+ if (!Spinner.tty) {
96
+ process.stderr.write(`${this.text}\n`);
97
+ return this;
98
+ }
99
+ this.timer = setInterval(() => {
100
+ const frame = colorize(this.frames[this.i], 'cyan');
101
+ process.stderr.write(`\r${frame} ${this.text}\x1b[K`);
102
+ this.i = (this.i + 1) % this.frames.length;
103
+ }, 80);
104
+ return this;
105
+ }
106
+ update(text) {
107
+ this.text = text;
108
+ }
109
+ stop() {
110
+ if (this.timer) {
111
+ clearInterval(this.timer);
112
+ this.timer = null;
113
+ }
114
+ if (Spinner.tty)
115
+ process.stderr.write('\r\x1b[K');
116
+ }
117
+ }
118
+ const DEFAULT_PROMPT = 'Explain the theory of relativity in simple terms.';
119
+ const TOOL_VERSION = '1.2.0';
120
+ /* -------------------------------------------------------------------------- */
121
+ /* Argument parsing */
122
+ /* -------------------------------------------------------------------------- */
123
+ /**
124
+ * Parses process.argv into structured CLI options.
125
+ */
126
+ function parseArgs(argv) {
127
+ const opts = {
128
+ models: [],
129
+ prompt: DEFAULT_PROMPT,
130
+ runs: 1,
131
+ think: undefined,
132
+ noThink: false,
133
+ json: false,
134
+ demo: false,
135
+ help: false,
136
+ version: false,
137
+ };
138
+ for (let i = 0; i < argv.length; i++) {
139
+ const arg = argv[i];
140
+ const next = () => argv[++i];
141
+ if (arg === '-h' || arg === '--help')
142
+ opts.help = true;
143
+ else if (arg === '-v' || arg === '--version')
144
+ opts.version = true;
145
+ else if (arg === '--json')
146
+ opts.json = true;
147
+ else if (arg === '--demo')
148
+ opts.demo = true;
149
+ else if (arg === '--no-think')
150
+ opts.noThink = true;
151
+ else if (arg === '--think')
152
+ opts.think = true;
153
+ else if (arg.startsWith('--think=')) {
154
+ const level = arg.slice('--think='.length);
155
+ opts.think = level === 'true' ? true : level;
156
+ }
157
+ else if (arg === '--prompt')
158
+ opts.prompt = next() ?? opts.prompt;
159
+ else if (arg.startsWith('--prompt='))
160
+ opts.prompt = arg.slice('--prompt='.length);
161
+ else if (arg === '--runs')
162
+ opts.runs = Math.max(1, parseInt(next() ?? '1', 10) || 1);
163
+ else if (arg.startsWith('--runs='))
164
+ opts.runs = Math.max(1, parseInt(arg.slice('--runs='.length), 10) || 1);
165
+ else if (arg === '--host')
166
+ opts.host = next();
167
+ else if (arg.startsWith('--host='))
168
+ opts.host = arg.slice('--host='.length);
169
+ else if (arg.startsWith('-')) {
170
+ console.error(colorize(`Unknown option: ${arg}`, 'red'));
171
+ process.exit(1);
172
+ }
173
+ else
174
+ opts.models.push(arg);
175
+ }
176
+ return opts;
177
+ }
178
+ /**
179
+ * Prints CLI usage / help text.
180
+ */
181
+ function printHelp() {
182
+ const b = (t) => colorize(t, 'bold');
183
+ const c = (t) => colorize(t, 'cyan');
184
+ console.log(`
185
+ ${b('ollama-bench')} — benchmark Ollama models with phase-by-phase analysis
186
+
187
+ ${b('USAGE')}
188
+ ollama-bench [options] <model> [model...]
189
+
190
+ ${b('OPTIONS')}
191
+ ${c('--think[=high|medium|low]')} Enable reasoning/thinking (auto-detected by default)
192
+ ${c('--no-think')} Disable thinking even for reasoning models
193
+ ${c('--prompt <text>')} Custom benchmark prompt
194
+ ${c('--runs <n>')} Repeat each model n times and average (default: 1)
195
+ ${c('--host <url>')} Ollama server URL (default: http://127.0.0.1:11434)
196
+ ${c('--json')} Emit machine-readable JSON instead of the report
197
+ ${c('--demo')} Render the UI with synthetic data (no server needed)
198
+ ${c('-v, --version')} Print version
199
+ ${c('-h, --help')} Show this help
200
+
201
+ ${b('EXAMPLES')}
202
+ ollama-bench qwen3:0.6b llama3.2:1b
203
+ ollama-bench --runs 3 --think=high deepseek-r1:1.5b
204
+ ollama-bench --prompt "Write a haiku about TCP" --json gemma3:1b
205
+ `);
206
+ }
207
+ /* -------------------------------------------------------------------------- */
208
+ /* Ollama interactions */
209
+ /* -------------------------------------------------------------------------- */
210
+ /**
211
+ * Verifies the Ollama server is reachable, returning its version.
212
+ * Exits with a friendly message when the server is unreachable.
213
+ */
214
+ async function ensureServer(client) {
50
215
  try {
51
- const start = performance.now();
52
- const response = await ollama.pull({ model, stream: true });
53
- for await (const part of response) {
54
- if (part.status === 'success') {
55
- clearInterval(loadingAnimation);
56
- const end = performance.now();
57
- const duration = (end - start) / 1000;
58
- console.log(`\r${colorize(`Successfully pulled ${model} in ${duration.toFixed(2)} seconds`, 'green')} `);
59
- return;
216
+ const { version } = await client.version();
217
+ return version;
218
+ }
219
+ catch {
220
+ console.error(colorize('✗ Could not reach the Ollama server.', 'red'));
221
+ console.error(colorize(' Is it running? Start it with: ollama serve', 'gray'));
222
+ process.exit(1);
223
+ }
224
+ }
225
+ /**
226
+ * Returns the capability list for a model (e.g. ['completion', 'thinking', 'tools']).
227
+ * Returns an empty array if the model is not present locally.
228
+ */
229
+ async function modelCapabilities(client, model) {
230
+ try {
231
+ const info = await client.show({ model });
232
+ return info.capabilities ?? [];
233
+ }
234
+ catch {
235
+ return [];
236
+ }
237
+ }
238
+ /**
239
+ * Pulls a model, rendering a live progress bar with percentage.
240
+ */
241
+ async function pullModel(client, model) {
242
+ const spinner = new Spinner(colorize(`Pulling ${model}…`, 'blue')).start();
243
+ const start = performance.now();
244
+ try {
245
+ const stream = await client.pull({ model, stream: true });
246
+ for await (const part of stream) {
247
+ if (part.total && part.completed) {
248
+ const pct = Math.min(100, (part.completed / part.total) * 100);
249
+ const width = 24;
250
+ const filled = Math.round((pct / 100) * width);
251
+ const bar = '█'.repeat(filled) + '░'.repeat(width - filled);
252
+ spinner.update(`${colorize(`Pulling ${model}`, 'blue')} ${colorize(bar, 'cyan')} ` +
253
+ `${pct.toFixed(0)}% ${colorize(`${fmtBytes(part.completed)}/${fmtBytes(part.total)}`, 'gray')}`);
254
+ }
255
+ else {
256
+ spinner.update(colorize(`Pulling ${model} — ${part.status}…`, 'blue'));
60
257
  }
61
258
  }
259
+ spinner.stop();
260
+ const elapsed = (performance.now() - start) / 1000;
261
+ console.error(colorize(`✓ ${model} ready ${colorize(`(${fmtDuration(elapsed)})`, 'gray')}`, 'green'));
262
+ return true;
62
263
  }
63
264
  catch (error) {
64
- clearInterval(loadingAnimation);
65
- console.log(`\r${colorize(`Error pulling ${model}: ${error.message}`, 'red')} `);
265
+ spinner.stop();
266
+ console.error(colorize(`✗ Failed to pull ${model}: ${error.message}`, 'red'));
267
+ return false;
268
+ }
269
+ }
270
+ /**
271
+ * Computes a finite rate, avoiding Infinity/NaN in JSON output.
272
+ */
273
+ function rate(count, seconds) {
274
+ return seconds > 0 ? count / seconds : 0;
275
+ }
276
+ /**
277
+ * Runs a single streamed generation and captures timing samples.
278
+ */
279
+ async function runOnce(client, model, prompt, think) {
280
+ const start = performance.now();
281
+ let firstTokenAt = 0;
282
+ let thinkStartAt = 0;
283
+ let thinkEndAt = 0;
284
+ let thinkingChars = 0;
285
+ let final;
286
+ const stream = await client.generate({
287
+ model,
288
+ prompt,
289
+ // Pass false explicitly so --no-think disables models whose default is to think.
290
+ think,
291
+ stream: true,
292
+ });
293
+ for await (const chunk of stream) {
294
+ const now = performance.now();
295
+ if (chunk.thinking) {
296
+ if (!thinkStartAt)
297
+ thinkStartAt = now;
298
+ thinkEndAt = now;
299
+ thinkingChars += chunk.thinking.length;
300
+ }
301
+ if ((chunk.response || chunk.thinking) && !firstTokenAt)
302
+ firstTokenAt = now;
303
+ if (chunk.done)
304
+ final = chunk;
66
305
  }
306
+ if (!final)
307
+ throw new Error('No response received from server');
308
+ return {
309
+ loadTime: final.load_duration / 1e9,
310
+ promptEvalTime: final.prompt_eval_duration / 1e9,
311
+ promptEvalCount: final.prompt_eval_count,
312
+ generationTime: final.eval_duration / 1e9,
313
+ evalCount: final.eval_count,
314
+ totalTime: final.total_duration / 1e9,
315
+ ttft: firstTokenAt ? (firstTokenAt - start) / 1000 : 0,
316
+ thinkingChars,
317
+ thinkingWallTime: thinkStartAt ? (thinkEndAt - thinkStartAt) / 1000 : 0,
318
+ };
67
319
  }
68
320
  /**
69
- * Benchmarks a model's performance.
70
- * @param model - The name of the model to benchmark.
71
- * @returns A promise that resolves to the benchmark result.
321
+ * Benchmarks a model across one or more runs and aggregates the results.
72
322
  */
73
- async function benchmarkModel(model) {
74
- const prompt = "Explain the theory of relativity in simple terms.";
75
- console.log(colorize(`\nBenchmarking ${model}`, 'cyan'));
76
- console.log(colorize('─'.repeat(50), 'cyan'));
77
- const loadingAnimation = createLoadingAnimation('Running benchmark', model);
323
+ async function benchmarkModel(client, model, opts) {
324
+ // Decide whether to enable thinking.
325
+ let think = opts.think;
326
+ if (opts.noThink)
327
+ think = false;
328
+ else if (think === undefined) {
329
+ const caps = await modelCapabilities(client, model);
330
+ think = caps.includes('thinking') ? true : false;
331
+ }
332
+ const label = think ? `${model} ${colorize('(thinking)', 'magenta')}` : model;
333
+ const spinner = new Spinner(colorize(`Benchmarking ${label}…`, 'blue')).start();
334
+ const samples = [];
78
335
  try {
79
- const response = await ollama.generate({
80
- model,
81
- prompt,
82
- stream: false,
83
- });
84
- clearInterval(loadingAnimation);
85
- process.stdout.write('\r' + ' '.repeat(50) + '\r');
86
- // Calculate phase timings
87
- const loadTime = response.load_duration / 1e9;
88
- const promptEvalTime = response.prompt_eval_duration / 1e9;
89
- const generationTime = response.eval_duration / 1e9;
90
- const totalTime = response.total_duration / 1e9;
91
- const tokensPerSecond = response.eval_count / generationTime;
92
- // Calculate percentages
93
- const loadPercent = (loadTime / totalTime * 100).toFixed(1);
94
- const promptPercent = (promptEvalTime / totalTime * 100).toFixed(1);
95
- const genPercent = (generationTime / totalTime * 100).toFixed(1);
96
- // Display phases
97
- console.log(colorize('Phase 1: Model Loading (Loading weights into memory)', 'yellow'));
98
- console.log(colorize(` Time: ${loadTime.toFixed(2)}s (${loadPercent}% of total)`, 'yellow'));
99
- console.log();
100
- console.log(colorize('Phase 2: Prompt Processing (Encoding input)', 'yellow'));
101
- console.log(colorize(` Tokens: ${response.prompt_eval_count}`, 'yellow'));
102
- console.log(colorize(` Time: ${promptEvalTime.toFixed(2)}s (${promptPercent}% of total)`, 'yellow'));
103
- console.log(colorize(` Speed: ${(response.prompt_eval_count / promptEvalTime).toFixed(2)} tokens/s`, 'yellow'));
104
- console.log();
105
- console.log(colorize('Phase 3: Response Generation (Creating output)', 'yellow'));
106
- console.log(colorize(` Tokens: ${response.eval_count}`, 'yellow'));
107
- console.log(colorize(` Time: ${generationTime.toFixed(2)}s (${genPercent}% of total)`, 'yellow'));
108
- console.log(colorize(` Speed: ${tokensPerSecond.toFixed(2)} tokens/s`, 'yellow'));
109
- console.log();
110
- console.log(colorize('Summary', 'green'));
111
- console.log(colorize(` Total time: ${totalTime.toFixed(2)}s`, 'green'));
112
- console.log(colorize(` Generation speed: ${tokensPerSecond.toFixed(2)} tokens/s`, 'green'));
113
- console.log();
114
- return {
115
- model,
116
- tokensPerSecond,
117
- loadTime,
118
- promptEvalTime,
119
- generationTime,
120
- totalTime
121
- };
336
+ for (let r = 0; r < opts.runs; r++) {
337
+ if (opts.runs > 1)
338
+ spinner.update(colorize(`Benchmarking ${label} — run ${r + 1}/${opts.runs}…`, 'blue'));
339
+ samples.push(await runOnce(client, model, opts.prompt, think));
340
+ }
341
+ spinner.stop();
122
342
  }
123
343
  catch (error) {
124
- clearInterval(loadingAnimation);
125
- process.stdout.write('\r' + ' '.repeat(50) + '\r');
126
- console.log(colorize(`Error benchmarking ${model}: ${error.message}`, 'red'));
127
- console.log();
344
+ spinner.stop();
128
345
  return {
129
346
  model,
130
- tokensPerSecond: 0,
347
+ ok: false,
348
+ error: error.message,
349
+ runs: 0,
131
350
  loadTime: 0,
132
351
  promptEvalTime: 0,
352
+ promptEvalCount: 0,
353
+ promptTokensPerSecond: 0,
133
354
  generationTime: 0,
134
- totalTime: 0
355
+ evalCount: 0,
356
+ tokensPerSecond: 0,
357
+ totalTime: 0,
358
+ ttft: 0,
359
+ thinking: false,
360
+ thinkingTime: 0,
361
+ thinkingChars: 0,
362
+ thinkingCharsPerSecond: 0,
135
363
  };
136
364
  }
365
+ // Average across runs.
366
+ const avg = (pick) => samples.reduce((a, s) => a + pick(s), 0) / samples.length;
367
+ const loadTime = avg((s) => s.loadTime);
368
+ const promptEvalTime = avg((s) => s.promptEvalTime);
369
+ const promptEvalCount = avg((s) => s.promptEvalCount);
370
+ const generationTime = avg((s) => s.generationTime);
371
+ const evalCount = avg((s) => s.evalCount);
372
+ const totalTime = avg((s) => s.totalTime);
373
+ const thinkingChars = avg((s) => s.thinkingChars);
374
+ const thinkingWallTime = avg((s) => s.thinkingWallTime);
375
+ // Pull resource usage for the (still-loaded) model.
376
+ let sizeBytes;
377
+ let sizeVramBytes;
378
+ let parameterSize;
379
+ let quantization;
380
+ try {
381
+ const { models } = await client.ps();
382
+ const live = models.find((m) => m.name === model || m.model === model);
383
+ if (live) {
384
+ sizeBytes = live.size;
385
+ sizeVramBytes = live.size_vram;
386
+ parameterSize = live.details?.parameter_size;
387
+ quantization = live.details?.quantization_level;
388
+ }
389
+ }
390
+ catch {
391
+ /* ps() is best-effort */
392
+ }
393
+ return {
394
+ model,
395
+ ok: true,
396
+ runs: samples.length,
397
+ loadTime,
398
+ promptEvalTime,
399
+ promptEvalCount,
400
+ promptTokensPerSecond: rate(promptEvalCount, promptEvalTime),
401
+ generationTime,
402
+ evalCount,
403
+ tokensPerSecond: rate(evalCount, generationTime),
404
+ totalTime,
405
+ ttft: avg((s) => s.ttft),
406
+ thinking: thinkingChars > 0,
407
+ thinkingTime: thinkingWallTime,
408
+ thinkingChars,
409
+ // Ollama does not expose a separate token count for thinking chunks, so report
410
+ // the exact streamed character rate instead of estimating tokens from chars.
411
+ thinkingCharsPerSecond: rate(thinkingChars, thinkingWallTime),
412
+ sizeBytes,
413
+ sizeVramBytes,
414
+ parameterSize,
415
+ quantization,
416
+ };
417
+ }
418
+ /* -------------------------------------------------------------------------- */
419
+ /* Rendering */
420
+ /* -------------------------------------------------------------------------- */
421
+ /**
422
+ * Renders the detailed per-phase breakdown for a single result.
423
+ */
424
+ function renderResult(r) {
425
+ const runsNote = r.runs > 1 ? colorize(` (avg of ${r.runs} runs)`, 'gray') : '';
426
+ console.log(colorize(`\n${r.model}`, 'cyan') + runsNote);
427
+ console.log(colorize('─'.repeat(52), 'gray'));
428
+ if (!r.ok) {
429
+ console.log(colorize(` ✗ ${r.error}`, 'red'));
430
+ return;
431
+ }
432
+ const pct = (t) => (r.totalTime > 0 ? `${((t / r.totalTime) * 100).toFixed(0)}%` : '—');
433
+ const line = (label, value, note = '') => console.log(` ${label.padEnd(22)} ${colorize(value, 'bold')} ${colorize(note, 'gray')}`);
434
+ if (r.sizeBytes || r.parameterSize) {
435
+ const where = r.sizeVramBytes && r.sizeVramBytes > 0 ? 'GPU' : 'CPU';
436
+ const detail = [
437
+ r.parameterSize,
438
+ r.quantization,
439
+ r.sizeBytes ? fmtBytes(r.sizeBytes) : undefined,
440
+ r.sizeVramBytes ? `${fmtBytes(r.sizeVramBytes)} VRAM · ${where}` : where,
441
+ ]
442
+ .filter(Boolean)
443
+ .join(' · ');
444
+ console.log(' ' + colorize(detail, 'gray'));
445
+ console.log();
446
+ }
447
+ line('Load', fmtDuration(r.loadTime), pct(r.loadTime) + ' of total');
448
+ line('Prompt eval', fmtDuration(r.promptEvalTime), `${Math.round(r.promptEvalCount)} tok · ${fmtRate(r.promptTokensPerSecond)}`);
449
+ line('First token (TTFT)', fmtDuration(r.ttft));
450
+ if (r.thinking) {
451
+ line('Thinking', fmtDuration(r.thinkingTime), `${Math.round(r.thinkingChars)} chars · ${r.thinkingCharsPerSecond.toFixed(1)} chars/s`);
452
+ }
453
+ line('Generation', fmtDuration(r.generationTime), `${Math.round(r.evalCount)} tok · ${pct(r.generationTime)} of total`);
454
+ console.log();
455
+ line(colorize('Speed', 'green'), colorize(fmtRate(r.tokensPerSecond), 'green'), colorize('total ' + fmtDuration(r.totalTime), 'gray'));
137
456
  }
138
457
  /**
139
- * The main function that orchestrates the model pulling and benchmarking process.
458
+ * Renders an aligned comparison table ranking models by generation speed.
459
+ */
460
+ function renderTable(results) {
461
+ const ok = results.filter((r) => r.ok);
462
+ if (ok.length === 0)
463
+ return;
464
+ const ranked = [...ok].sort((a, b) => b.tokensPerSecond - a.tokensPerSecond);
465
+ const best = ranked[0];
466
+ const headers = ['', 'Model', 'Params', 'Gen', 'Prompt', 'TTFT', 'Load', 'Total'];
467
+ const rows = ranked.map((r, i) => ({
468
+ '': i === 0 ? '★' : `${i + 1}`,
469
+ Model: r.model + (r.thinking ? ' ◇' : ''),
470
+ Params: r.parameterSize ?? '—',
471
+ Gen: fmtRate(r.tokensPerSecond),
472
+ Prompt: fmtRate(r.promptTokensPerSecond),
473
+ TTFT: fmtDuration(r.ttft),
474
+ Load: fmtDuration(r.loadTime),
475
+ Total: fmtDuration(r.totalTime),
476
+ }));
477
+ const widths = headers.map((h) => Math.max(h.length, ...rows.map((row) => row[h].length)));
478
+ const fmtRow = (cells) => cells.map((c, i) => (i <= 1 ? c.padEnd(widths[i]) : c.padStart(widths[i]))).join(' ');
479
+ console.log(colorize('\nRanking', 'magenta'));
480
+ console.log(colorize('═'.repeat(52), 'magenta'));
481
+ console.log(colorize(fmtRow(headers), 'bold'));
482
+ console.log(colorize(headers.map((_, i) => '─'.repeat(widths[i])).join(' '), 'gray'));
483
+ ranked.forEach((r, i) => {
484
+ const cells = fmtRow(headers.map((h) => rows[i][h]));
485
+ console.log(i === 0 ? colorize(cells, 'green') : cells);
486
+ });
487
+ if (ranked.some((r) => r.thinking)) {
488
+ console.log(colorize('\n◇ reasoning model (thinking enabled)', 'gray'));
489
+ }
490
+ console.log(colorize(`\nFastest: ${best.model} at ${fmtRate(best.tokensPerSecond)}`, 'magenta'));
491
+ }
492
+ /* -------------------------------------------------------------------------- */
493
+ /* Demo data (for UI testing without a server) */
494
+ /* -------------------------------------------------------------------------- */
495
+ /**
496
+ * Produces synthetic benchmark results so the UI can be previewed/tested
497
+ * without a running Ollama server.
498
+ */
499
+ function demoResults() {
500
+ return [
501
+ {
502
+ model: 'qwen3:0.6b',
503
+ ok: true,
504
+ runs: 1,
505
+ loadTime: 0.42,
506
+ promptEvalTime: 0.08,
507
+ promptEvalCount: 14,
508
+ promptTokensPerSecond: 175,
509
+ generationTime: 1.9,
510
+ evalCount: 320,
511
+ tokensPerSecond: 168.4,
512
+ totalTime: 2.4,
513
+ ttft: 0.51,
514
+ thinking: true,
515
+ thinkingTime: 1.13,
516
+ thinkingChars: 640,
517
+ thinkingCharsPerSecond: 568,
518
+ sizeBytes: 1.3e9,
519
+ sizeVramBytes: 1.3e9,
520
+ parameterSize: '0.6B',
521
+ quantization: 'Q4_K_M',
522
+ },
523
+ {
524
+ model: 'llama3.2:1b',
525
+ ok: true,
526
+ runs: 1,
527
+ loadTime: 0.6,
528
+ promptEvalTime: 0.05,
529
+ promptEvalCount: 12,
530
+ promptTokensPerSecond: 240,
531
+ generationTime: 2.4,
532
+ evalCount: 280,
533
+ tokensPerSecond: 116.7,
534
+ totalTime: 3.05,
535
+ ttft: 0.66,
536
+ thinking: false,
537
+ thinkingTime: 0,
538
+ thinkingChars: 0,
539
+ thinkingCharsPerSecond: 0,
540
+ sizeBytes: 1.9e9,
541
+ sizeVramBytes: 0,
542
+ parameterSize: '1.2B',
543
+ quantization: 'Q8_0',
544
+ },
545
+ {
546
+ model: 'gemma3:1b',
547
+ ok: false,
548
+ error: "model 'gemma3:1b' not found",
549
+ runs: 0,
550
+ loadTime: 0,
551
+ promptEvalTime: 0,
552
+ promptEvalCount: 0,
553
+ promptTokensPerSecond: 0,
554
+ generationTime: 0,
555
+ evalCount: 0,
556
+ tokensPerSecond: 0,
557
+ totalTime: 0,
558
+ ttft: 0,
559
+ thinking: false,
560
+ thinkingTime: 0,
561
+ thinkingChars: 0,
562
+ thinkingCharsPerSecond: 0,
563
+ },
564
+ ];
565
+ }
566
+ /* -------------------------------------------------------------------------- */
567
+ /* Main */
568
+ /* -------------------------------------------------------------------------- */
569
+ /**
570
+ * Orchestrates argument parsing, model preparation, benchmarking and output.
140
571
  */
141
572
  export async function main() {
142
- const models = process.argv.slice(2);
143
- if (models.length === 0) {
144
- console.log(colorize(`Error: No models provided. Please specify at least one model.`, 'red'));
573
+ const opts = parseArgs(process.argv.slice(2));
574
+ if (opts.help)
575
+ return printHelp();
576
+ if (opts.version) {
577
+ console.log(TOOL_VERSION);
578
+ return;
579
+ }
580
+ // Demo mode: render the UI from synthetic data, no server required.
581
+ if (opts.demo) {
582
+ const results = demoResults();
583
+ console.log(colorize('ollama-bench (demo)', 'cyan'));
584
+ console.log(colorize('═'.repeat(52), 'cyan'));
585
+ results.forEach(renderResult);
586
+ renderTable(results);
587
+ return;
588
+ }
589
+ if (opts.models.length === 0) {
590
+ console.error(colorize('Error: specify at least one model.\n', 'red'));
591
+ printHelp();
145
592
  process.exit(1);
146
593
  }
147
- console.log(colorize(`Ollama Benchmark Script`, 'cyan'));
148
- console.log(colorize('═'.repeat(50), 'cyan'));
149
- // Pull models
150
- console.log(colorize('\nPhase: Model Preparation', 'cyan'));
151
- console.log(colorize(''.repeat(50), 'cyan'));
152
- for (const model of models) {
153
- await pullModel(model);
154
- }
155
- // Benchmark models
156
- console.log(colorize('\nPhase: Performance Testing', 'cyan'));
157
- console.log(colorize('─'.repeat(50), 'cyan'));
594
+ const client = new Ollama(opts.host ? { host: opts.host } : undefined);
595
+ const serverVersion = await ensureServer(client);
596
+ if (!opts.json) {
597
+ console.log(colorize('ollama-bench', 'cyan') + colorize(` · server v${serverVersion}`, 'gray'));
598
+ console.log(colorize(''.repeat(52), 'cyan'));
599
+ console.log(colorize('\nPreparing models', 'cyan'));
600
+ console.log(colorize('─'.repeat(52), 'gray'));
601
+ }
602
+ for (const model of opts.models) {
603
+ await pullModel(client, model);
604
+ }
605
+ if (!opts.json) {
606
+ console.log(colorize('\nBenchmarking', 'cyan'));
607
+ console.log(colorize('─'.repeat(52), 'gray'));
608
+ }
158
609
  const results = [];
159
- for (const model of models) {
160
- const result = await benchmarkModel(model);
610
+ for (const model of opts.models) {
611
+ const result = await benchmarkModel(client, model, opts);
161
612
  results.push(result);
613
+ if (!opts.json)
614
+ renderResult(result);
162
615
  }
163
- // Find the best performing model
164
- const bestModel = results.reduce((best, current) => current.tokensPerSecond > best.tokensPerSecond ? current : best);
165
- console.log(colorize('Final Results', 'magenta'));
166
- console.log(colorize('═'.repeat(50), 'magenta'));
167
- console.log(colorize(`Best performing model: ${bestModel.model}`, 'magenta'));
168
- console.log(colorize(`Generation speed: ${bestModel.tokensPerSecond.toFixed(2)} tokens/s`, 'magenta'));
169
- console.log(colorize(`Total time: ${bestModel.totalTime.toFixed(2)}s`, 'magenta'));
616
+ if (opts.json) {
617
+ console.log(JSON.stringify({ server: serverVersion, prompt: opts.prompt, results }, null, 2));
618
+ }
619
+ else if (results.filter((r) => r.ok).length > 1) {
620
+ renderTable(results);
621
+ }
622
+ // Non-zero exit if every model failed.
623
+ if (results.every((r) => !r.ok))
624
+ process.exit(1);
170
625
  }
171
626
  if (import.meta.url === import.meta.resolve(process.argv[1])) {
172
- main().catch(error => {
627
+ main().catch((error) => {
173
628
  console.error('Error:', error);
174
629
  process.exit(1);
175
630
  });
package/package.json CHANGED
@@ -1,18 +1,23 @@
1
1
  {
2
2
  "name": "ollama-bench",
3
- "version": "1.1.0",
4
- "description": "Minimal CLI tool to benchmark Ollama models with detailed phase analysis. Zero runtime dependencies.",
3
+ "version": "1.2.0",
4
+ "description": "Minimal CLI tool to benchmark Ollama models phase analysis, TTFT, reasoning/thinking measurement, and side-by-side ranking.",
5
5
  "main": "dist/index.js",
6
6
  "type": "module",
7
7
  "bin": {
8
8
  "ollama-bench": "./dist/index.js"
9
9
  },
10
+ "files": [
11
+ "dist",
12
+ "README.md",
13
+ "LICENSE"
14
+ ],
10
15
  "scripts": {
11
16
  "build": "tsc",
12
17
  "start": "node dist/index.js",
13
18
  "dev": "tsc && node dist/index.js"
14
19
  },
15
- "keywords": ["ollama", "benchmark", "ai", "models", "cli", "performance", "llm", "testing"],
20
+ "keywords": ["ollama", "benchmark", "ai", "models", "cli", "performance", "llm", "testing", "ttft", "reasoning", "tokens-per-second"],
16
21
  "author": "dalist1",
17
22
  "license": "MIT",
18
23
  "repository": {
@@ -24,14 +29,14 @@
24
29
  },
25
30
  "homepage": "https://github.com/dalist1/ollama-bench#readme",
26
31
  "dependencies": {
27
- "ollama": "latest"
32
+ "ollama": "^0.6.3"
28
33
  },
29
34
  "devDependencies": {
30
- "@types/node": "^20.19.25",
35
+ "@types/node": "^20.19.41",
31
36
  "typescript": "^5.9.3"
32
37
  },
33
38
  "engines": {
34
- "node": ">=14.0.0"
39
+ "node": ">=18.0.0"
35
40
  },
36
41
  "publishConfig": {
37
42
  "access": "public"
package/.idx/dev.nix DELETED
@@ -1,21 +0,0 @@
1
- { pkgs }: {
2
- channel = "unstable";
3
- packages =
4
- let
5
- bunLatest = builtins.fetchurl {
6
- url = "https://github.com/oven-sh/bun/releases/download/canary/bun-linux-x64.zip";
7
- };
8
- in
9
- [
10
- pkgs.nodejs_23
11
- (pkgs.bun.overrideAttrs (oldAttrs: {
12
- version = "canary";
13
- src = bunLatest;
14
- }))
15
- ];
16
- idx.extensions = [
17
- "biomejs.biome"
18
- "BeardedBear.beardedicons"
19
- "BeardedBear.beardedtheme"
20
- ];
21
- }
package/bun.lock DELETED
@@ -1,27 +0,0 @@
1
- {
2
- "lockfileVersion": 1,
3
- "configVersion": 1,
4
- "workspaces": {
5
- "": {
6
- "name": "ollama-bench",
7
- "dependencies": {
8
- "ollama": "latest",
9
- },
10
- "devDependencies": {
11
- "@types/node": "^20.19.25",
12
- "typescript": "^5.9.3",
13
- },
14
- },
15
- },
16
- "packages": {
17
- "@types/node": ["@types/node@20.19.25", "", { "dependencies": { "undici-types": "~6.21.0" } }, "sha512-ZsJzA5thDQMSQO788d7IocwwQbI8B5OPzmqNvpf3NY/+MHDAS759Wo0gd2WQeXYt5AAAQjzcrTVC6SKCuYgoCQ=="],
18
-
19
- "ollama": ["ollama@0.6.3", "", { "dependencies": { "whatwg-fetch": "^3.6.20" } }, "sha512-KEWEhIqE5wtfzEIZbDCLH51VFZ6Z3ZSa6sIOg/E/tBV8S51flyqBOXi+bRxlOYKDf8i327zG9eSTb8IJxvm3Zg=="],
20
-
21
- "typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
22
-
23
- "undici-types": ["undici-types@6.21.0", "", {}, "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ=="],
24
-
25
- "whatwg-fetch": ["whatwg-fetch@3.6.20", "", {}, "sha512-EqhiFU6daOA8kpjOWTL0olhVOF3i7OrFzSYiGsEMB8GcXS+RrzauAERX65xMeNWVqxA6HXH2m69Z9LaKKdisfg=="],
26
- }
27
- }
package/src/index.ts DELETED
@@ -1,217 +0,0 @@
1
- #!/usr/bin/env node
2
-
3
- import ollama from 'ollama';
4
-
5
- /**
6
- * Represents the available color codes for text coloring.
7
- */
8
- type Color = 'reset' | 'green' | 'yellow' | 'red' | 'cyan' | 'magenta' | 'blue';
9
-
10
- /**
11
- * Object containing ANSI color codes for text coloring.
12
- */
13
- const colors: Record<Color, string> = {
14
- reset: '\x1b[0m',
15
- green: '\x1b[32m',
16
- yellow: '\x1b[33m',
17
- red: '\x1b[31m',
18
- cyan: '\x1b[36m',
19
- magenta: '\x1b[35m',
20
- blue: '\x1b[34m',
21
- };
22
-
23
- /**
24
- * Applies color to the given text.
25
- * @param text - The text to colorize.
26
- * @param color - The color to apply.
27
- * @returns The colorized text.
28
- */
29
- function colorize(text: string, color: Color): string {
30
- return `${colors[color]}${text}${colors.reset}`;
31
- }
32
-
33
- /**
34
- * Creates a loading animation for the console.
35
- * @param operation - The operation being performed.
36
- * @param model - The model being processed.
37
- * @returns An interval ID for the animation.
38
- */
39
- function createLoadingAnimation(operation: string, model: string): NodeJS.Timeout {
40
- const frames: string[] = ['|', '/', '-', '\\'];
41
- let i = 0;
42
- let dots = 0;
43
- return setInterval(() => {
44
- const frame = frames[i];
45
- const dotString = '.'.repeat(dots);
46
- const operationText = colorize(`${operation} ${model}${dotString}`, 'blue');
47
- process.stdout.write(`\r${frame} ${operationText}`.padEnd(50));
48
- i = (i + 1) % frames.length;
49
- dots = (dots + 1) % 4;
50
- }, 100);
51
- }
52
-
53
- /**
54
- * Pulls a model from Ollama.
55
- * @param model - The name of the model to pull.
56
- */
57
- async function pullModel(model: string): Promise<void> {
58
- console.log(colorize(`Initiating pull for ${model}...`, 'yellow'));
59
- const loadingAnimation = createLoadingAnimation('Pulling', model);
60
- try {
61
- const start = performance.now();
62
- const response = await ollama.pull({ model, stream: true });
63
- for await (const part of response) {
64
- if (part.status === 'success') {
65
- clearInterval(loadingAnimation);
66
- const end = performance.now();
67
- const duration = (end - start) / 1000;
68
- console.log(`\r${colorize(`Successfully pulled ${model} in ${duration.toFixed(2)} seconds`, 'green')} `);
69
- return;
70
- }
71
- }
72
- } catch (error) {
73
- clearInterval(loadingAnimation);
74
- console.log(`\r${colorize(`Error pulling ${model}: ${(error as Error).message}`, 'red')} `);
75
- }
76
- }
77
-
78
- /**
79
- * Represents the result of a model benchmark.
80
- */
81
- interface BenchmarkResult {
82
- model: string;
83
- tokensPerSecond: number;
84
- loadTime: number;
85
- promptEvalTime: number;
86
- generationTime: number;
87
- totalTime: number;
88
- }
89
-
90
- /**
91
- * Benchmarks a model's performance.
92
- * @param model - The name of the model to benchmark.
93
- * @returns A promise that resolves to the benchmark result.
94
- */
95
- async function benchmarkModel(model: string): Promise<BenchmarkResult> {
96
- const prompt = "Explain the theory of relativity in simple terms.";
97
- console.log(colorize(`\nBenchmarking ${model}`, 'cyan'));
98
- console.log(colorize('─'.repeat(50), 'cyan'));
99
-
100
- const loadingAnimation = createLoadingAnimation('Running benchmark', model);
101
-
102
- try {
103
- const response = await ollama.generate({
104
- model,
105
- prompt,
106
- stream: false,
107
- });
108
-
109
- clearInterval(loadingAnimation);
110
- process.stdout.write('\r' + ' '.repeat(50) + '\r');
111
-
112
- // Calculate phase timings
113
- const loadTime = response.load_duration / 1e9;
114
- const promptEvalTime = response.prompt_eval_duration / 1e9;
115
- const generationTime = response.eval_duration / 1e9;
116
- const totalTime = response.total_duration / 1e9;
117
- const tokensPerSecond = response.eval_count / generationTime;
118
-
119
- // Calculate percentages
120
- const loadPercent = (loadTime / totalTime * 100).toFixed(1);
121
- const promptPercent = (promptEvalTime / totalTime * 100).toFixed(1);
122
- const genPercent = (generationTime / totalTime * 100).toFixed(1);
123
-
124
- // Display phases
125
- console.log(colorize('Phase 1: Model Loading (Loading weights into memory)', 'yellow'));
126
- console.log(colorize(` Time: ${loadTime.toFixed(2)}s (${loadPercent}% of total)`, 'yellow'));
127
- console.log();
128
-
129
- console.log(colorize('Phase 2: Prompt Processing (Encoding input)', 'yellow'));
130
- console.log(colorize(` Tokens: ${response.prompt_eval_count}`, 'yellow'));
131
- console.log(colorize(` Time: ${promptEvalTime.toFixed(2)}s (${promptPercent}% of total)`, 'yellow'));
132
- console.log(colorize(` Speed: ${(response.prompt_eval_count / promptEvalTime).toFixed(2)} tokens/s`, 'yellow'));
133
- console.log();
134
-
135
- console.log(colorize('Phase 3: Response Generation (Creating output)', 'yellow'));
136
- console.log(colorize(` Tokens: ${response.eval_count}`, 'yellow'));
137
- console.log(colorize(` Time: ${generationTime.toFixed(2)}s (${genPercent}% of total)`, 'yellow'));
138
- console.log(colorize(` Speed: ${tokensPerSecond.toFixed(2)} tokens/s`, 'yellow'));
139
- console.log();
140
-
141
- console.log(colorize('Summary', 'green'));
142
- console.log(colorize(` Total time: ${totalTime.toFixed(2)}s`, 'green'));
143
- console.log(colorize(` Generation speed: ${tokensPerSecond.toFixed(2)} tokens/s`, 'green'));
144
- console.log();
145
-
146
- return {
147
- model,
148
- tokensPerSecond,
149
- loadTime,
150
- promptEvalTime,
151
- generationTime,
152
- totalTime
153
- };
154
- } catch (error) {
155
- clearInterval(loadingAnimation);
156
- process.stdout.write('\r' + ' '.repeat(50) + '\r');
157
- console.log(colorize(`Error benchmarking ${model}: ${(error as Error).message}`, 'red'));
158
- console.log();
159
- return {
160
- model,
161
- tokensPerSecond: 0,
162
- loadTime: 0,
163
- promptEvalTime: 0,
164
- generationTime: 0,
165
- totalTime: 0
166
- };
167
- }
168
- }
169
-
170
- /**
171
- * The main function that orchestrates the model pulling and benchmarking process.
172
- */
173
- export async function main(): Promise<void> {
174
- const models = process.argv.slice(2);
175
-
176
- if (models.length === 0) {
177
- console.log(colorize(`Error: No models provided. Please specify at least one model.`, 'red'));
178
- process.exit(1);
179
- }
180
-
181
- console.log(colorize(`Ollama Benchmark Script`, 'cyan'));
182
- console.log(colorize('═'.repeat(50), 'cyan'));
183
-
184
- // Pull models
185
- console.log(colorize('\nPhase: Model Preparation', 'cyan'));
186
- console.log(colorize('─'.repeat(50), 'cyan'));
187
- for (const model of models) {
188
- await pullModel(model);
189
- }
190
-
191
- // Benchmark models
192
- console.log(colorize('\nPhase: Performance Testing', 'cyan'));
193
- console.log(colorize('─'.repeat(50), 'cyan'));
194
- const results: BenchmarkResult[] = [];
195
- for (const model of models) {
196
- const result = await benchmarkModel(model);
197
- results.push(result);
198
- }
199
-
200
- // Find the best performing model
201
- const bestModel = results.reduce((best, current) =>
202
- current.tokensPerSecond > best.tokensPerSecond ? current : best
203
- );
204
-
205
- console.log(colorize('Final Results', 'magenta'));
206
- console.log(colorize('═'.repeat(50), 'magenta'));
207
- console.log(colorize(`Best performing model: ${bestModel.model}`, 'magenta'));
208
- console.log(colorize(`Generation speed: ${bestModel.tokensPerSecond.toFixed(2)} tokens/s`, 'magenta'));
209
- console.log(colorize(`Total time: ${bestModel.totalTime.toFixed(2)}s`, 'magenta'));
210
- }
211
-
212
- if (import.meta.url === import.meta.resolve(process.argv[1])) {
213
- main().catch(error => {
214
- console.error('Error:', error);
215
- process.exit(1);
216
- });
217
- }
package/tsconfig.json DELETED
@@ -1,15 +0,0 @@
1
- {
2
- "compilerOptions": {
3
- "target": "ES2020",
4
- "module": "ES2020",
5
- "moduleResolution": "node",
6
- "outDir": "./dist",
7
- "rootDir": "./src",
8
- "strict": true,
9
- "esModuleInterop": true,
10
- "skipLibCheck": true,
11
- "forceConsistentCasingInFileNames": true
12
- },
13
- "include": ["src/**/*"],
14
- "exclude": ["node_modules", "dist"]
15
- }