benchforge 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -151,18 +151,17 @@ The `--profile` flag executes exactly one iteration with no warmup, making it id
151
151
  Results are displayed in a formatted table:
152
152
 
153
153
  ```
154
- ╔═════════════════╤═══════════════════════════════════════════╤═══════╤═════════╗
155
- ║ │ time │
156
- ║ name │ mean Δ% CI p50 p99 │ conv% │ runs ║
157
- ╟─────────────────┼───────────────────────────────────────────┼───────┼─────────╢
158
- ║ quicksort │ 0.17 +5.5% [+4.7%, +6.2%] 0.15 0.63 │ 100% │ 1,134 ║
159
- ║ insertion sort │ 0.24 +25.9% [+25.3%, +27.4%] 0.18 0.36 │ 100% │ 807 ║
160
- ║ --> native sort │ 0.16 0.15 0.41 │ 100% │ 1,210 ║
161
- ╚═════════════════╧═══════════════════════════════════════════╧═══════╧═════════╝
154
+ ╔═════════════════╤═══════════════════════════════════════════╤═════════╗
155
+ ║ │ time │ ║
156
+ ║ name │ mean Δ% CI p50 p99 │ runs ║
157
+ ╟─────────────────┼───────────────────────────────────────────┼─────────╢
158
+ ║ quicksort │ 0.17 +5.5% [+4.7%, +6.2%] 0.15 0.63 │ 1,134 ║
159
+ ║ insertion sort │ 0.24 +25.9% [+25.3%, +27.4%] 0.18 0.36 │ 807 ║
160
+ ║ --> native sort │ 0.16 0.15 0.41 │ 1,210 ║
161
+ ╚═════════════════╧═══════════════════════════════════════════╧═════════╝
162
162
  ```
163
163
 
164
164
  - **Δ% CI**: Percentage difference from baseline with bootstrap confidence interval
165
- - **conv%**: Convergence percentage (100% = stable measurements)
166
165
 
167
166
  ### HTML
168
167
 
@@ -284,136 +283,23 @@ V8's sampling profiler uses Poisson-distributed sampling. When an allocation occ
284
283
  - Node.js 22.6+ (for native TypeScript support)
285
284
  - Use `--expose-gc --allow-natives-syntax` flags for garbage collection monitoring and V8 native functions
286
285
 
287
- ## Adaptive Mode
286
+ ## Adaptive Mode (Experimental)
288
287
 
289
- Adaptive mode automatically adjusts the number of benchmark iterations until measurements stabilize, providing statistically significant results without excessive runtime.
288
+ Adaptive mode (`--adaptive`) automatically adjusts iteration count until measurements stabilize. The algorithm is still being tuned — use `--help` for available options.
290
289
 
291
- ### Using Adaptive Mode
290
+ ## Interpreting Results
292
291
 
293
- ```bash
294
- # Enable adaptive benchmarking with default settings
295
- simple-cli.ts --adaptive
296
-
297
- # Customize time limits
298
- simple-cli.ts --adaptive --time 60 --min-time 5
299
-
300
- # Combine with other options
301
- simple-cli.ts --adaptive --filter "quicksort"
302
- ```
303
-
304
- ### CLI Options for Adaptive Mode
305
-
306
- - `--adaptive` - Enable adaptive sampling mode
307
- - `--min-time <seconds>` - Minimum time before convergence can stop (default: 1s)
308
- - `--convergence <percent>` - Confidence threshold 0-100 (default: 95)
309
- - `--time <seconds>` - Maximum time limit (default: 20s in adaptive mode)
310
-
311
- ### How It Works
312
-
313
- 1. **Initial Sampling**: Collects initial batch of ~100 samples (includes warmup)
314
- 2. **Window Comparison**: Compares recent samples against previous window
315
- 3. **Stability Detection**: Checks median drift and outlier impact between windows
316
- 4. **Convergence**: Stops when both metrics are stable (<5% drift) or reaches threshold
317
-
318
- Progress is shown during execution:
319
- ```
320
- ◊ quicksort: 75% confident (2.1s)
321
- ```
322
-
323
- ### Output with Adaptive Mode
324
-
325
- ```
326
- ╔═════════════════╤═════════════════════════════════════════════╤═══════╤═════════╤══════╗
327
- ║ │ time │ │ │ ║
328
- ║ name │ median Δ% CI mean p99 │ conv% │ runs │ time ║
329
- ╟─────────────────┼─────────────────────────────────────────────┼───────┼─────────┼──────╢
330
- ║ quicksort │ 0.17 +17.3% [+15.4%, +20.0%] 0.20 0.65 │ 100% │ 526 │ 0.0s ║
331
- ║ insertion sort │ 0.18 +24.2% [+23.9%, +24.6%] 0.19 0.36 │ 100% │ 529 │ 0.0s ║
332
- ║ --> native sort │ 0.15 0.15 0.25 │ 100% │ 647 │ 0.0s ║
333
- ╚═════════════════╧═════════════════════════════════════════════╧═══════╧═════════╧══════╝
334
- ```
335
-
336
- - **conv%**: Convergence percentage (100% = stable measurements)
337
- - **time**: Total sampling duration for that benchmark
338
-
339
- ## Statistical Considerations: Mean vs Median
340
-
341
- ### When to Use Mean with Confidence Intervals
342
-
343
- **Best for:**
344
- - **Normally distributed data** - When benchmark times follow a bell curve
345
- - **Statistical comparison** - Comparing performance between implementations
346
- - **Throughput analysis** - Understanding average system performance
347
- - **Resource planning** - Estimating typical resource usage
348
-
349
- **Advantages:**
350
- - Provides confidence intervals for statistical significance
351
- - Captures the full distribution including outliers
352
- - Better for detecting small but consistent performance differences
353
- - Standard in academic performance research
354
-
355
- **Example use cases:**
356
- - Comparing algorithm implementations
357
- - Measuring API response times under normal load
358
- - Evaluating compiler optimizations
359
- - Benchmarking pure computational functions
360
-
361
- ### When to Use Median (p50)
362
-
363
- **Best for:**
364
- - **Skewed distributions** - When outliers are common
365
- - **Latency-sensitive applications** - Where typical user experience matters
366
- - **Noisy environments** - Systems with unpredictable interference
367
- - **Service Level Agreements** - "50% of requests complete within X ms"
368
-
369
- **Advantages:**
370
- - Robust to outliers and system noise
371
- - Better represents "typical" performance
372
- - More stable in virtualized/cloud environments
373
- - Less affected by GC pauses and OS scheduling
374
-
375
- **Example use cases:**
376
- - Web server response times
377
- - Database query performance
378
- - UI responsiveness metrics
379
- - Real-time system benchmarks
380
-
381
- ### Interpreting Results
382
-
383
- #### Baseline Comparison (Δ% CI)
292
+ ### Baseline Comparison (Δ% CI)
384
293
  ```
385
294
  0.17 +5.5% [+4.7%, +6.2%]
386
295
  ```
387
- This shows the benchmark is 5.5% slower than baseline, with a bootstrap confidence interval of [+4.7%, +6.2%]. Use this for comparing implementations.
296
+ The benchmark is 5.5% slower than baseline, with a bootstrap confidence interval of [+4.7%, +6.2%].
388
297
 
389
- #### Percentiles
298
+ ### Percentiles
390
299
  ```
391
300
  p50: 0.15ms, p99: 0.27ms
392
301
  ```
393
- This shows that 50% of runs completed in ≤0.15ms and 99% in ≤0.27ms. Use this when you care about consistency and tail latencies.
394
-
395
- ### Practical Guidelines
396
-
397
- 1. **Use adaptive mode when:**
398
- - You want automatic convergence detection
399
- - Benchmarks have varying execution times
400
- - You need stable measurements without guessing iteration counts
401
-
402
- 2. **Use fixed iterations when:**
403
- - Comparing across runs/machines (reproducibility)
404
- - You know roughly how many samples you need
405
- - Running in CI pipelines with time constraints
406
-
407
- 3. **Interpreting conv%:**
408
- - 100% = measurements are stable
409
- - <100% = still converging or high variance
410
- - Red color indicates low confidence
411
-
412
- ### Statistical Notes
413
-
414
- - **Bootstrap CI**: Baseline comparison uses permutation testing with bootstrap confidence intervals
415
- - **Window Stability**: Adaptive mode compares sliding windows for median drift and outlier impact
416
- - **Independence**: Assumes benchmark iterations are independent (use `--worker` flag for better isolation)
302
+ 50% of runs completed in ≤0.15ms and 99% in ≤0.27ms. Use percentiles when you care about consistency and tail latencies.
417
303
 
418
304
  ## Understanding GC Time Measurements
419
305