@xynogen/pix-data 0.2.3 → 0.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,42 +1,57 @@
1
1
  # pix-data
2
2
 
3
- Pi coding agent extension — shared model data layer. Fetches and caches the
4
- [modelgrep](https://modelgrep.com) model catalog to `~/.cache/pi/` on session
5
- start, so other extensions (model picker, footer, subagent resolver) can read
6
- context window, pricing, and a coding-focused score/rank synchronously without
7
- redundant network calls.
3
+ Pi coding agent extension — shared model data layer. Warms two cached
4
+ data sources on session start so other extensions (model picker, footer,
5
+ subagent resolver) can read context window, pricing, and a coding-focused
6
+ score/rank synchronously without redundant network calls:
8
7
 
9
- ## Data source
8
+ - **[modelgrep](https://modelgrep.com)** — the model catalog (context window,
9
+ pricing, modalities, capabilities, raw benchmark fields) used as the
10
+ authoritative source when present.
11
+ - **[benchlm.ai](https://benchlm.ai)** — a leaderboard of 0–100 coding scores
12
+ used as a fallback when modelgrep's `artificial_analysis` block is null
13
+ (currently the common case for the long tail of models).
10
14
 
11
- All data comes from a **single source**: [modelgrep.com](https://modelgrep.com)
12
- (`/api/v1/models?benchmarked=1&sort=coding`). Free, no API key, ~190 benchmarked
13
- models with real model ids. modelgrep aggregates benchmark numbers from
14
- [Artificial Analysis](https://artificialanalysis.ai).
15
+ Both caches live under `~/.cache/pi/` and are shared across every Pi
16
+ extension using the same `DataSource` class whichever extension loads
17
+ first populates the cache; subsequent extensions read from disk.
15
18
 
16
- - **Context window + pricing** — taken verbatim from modelgrep.
17
- - **Score** — computed locally from the raw benchmark fields (see below).
18
- - **Rank** — the model's position once the whole catalog is sorted by that score
19
- (best = `#1`). Unscored models sink to the bottom.
19
+ ## Data sources
20
20
 
21
- Cached 24h `~/.cache/pi/modelgrep.json`. On outage the stale cache keeps the
22
- picker working until it can refresh.
21
+ - **`modelgrep`** `GET /api/v1/models?sort=coding&order=desc&limit=200`,
22
+ paginated up to 10 pages (`meta.has_more` / `next_offset`). Free, no API key.
23
+ modelgrep aggregates benchmark numbers from
24
+ [Artificial Analysis](https://artificialanalysis.ai). Context window, pricing,
25
+ and modalities are taken verbatim from the catalog.
26
+ - **`benchlm`** — `GET https://benchlm.ai/api/data/leaderboard`. Free, no API
27
+ key. Each entry has an `overallScore` (0–100) used as the fallback score
28
+ when modelgrep's `artificial_analysis` block is null.
29
+
30
+ Cache files:
31
+
32
+ - `~/.cache/pi/modelgrep.json` (TTL 24h)
33
+ - `~/.cache/pi/benchlm.json` (TTL 24h)
34
+
35
+ On outage the stale cache keeps the picker working until it can refresh.
23
36
 
24
37
  ## Scoring methodology
25
38
 
26
- **Primary score = [Artificial Analysis Intelligence Index](https://artificialanalysis.ai/methodology/intelligence-benchmarking)**
27
- when available — AA's authoritative composite of 9 independent evals (agents,
28
- coding, scientific reasoning, general), already weighted toward agentic work.
29
- It is rescaled to 0–100 (`intelligence / 65 × 100`; the current leader scores
30
- ~65).
31
-
32
- **Fallback = a coding-and-agentic heuristic** for the ~84% of models AA has not
33
- index-scored, computed from the raw benchmarks below, then mapped onto the index
34
- scale by a least-squares line. Both the heuristic weights *and* the line were
35
- jointly tuned against the index on the models that carry *both* it and the raw
36
- benches (`index100 120.6·heuristic 10.6`, deduped n=29, R²=0.901,
37
- leave-one-out RMSE 6.55pt) a data calibration, not a guessed penalty. The
38
- picker exists to choose a model *for coding work in an agent*, so the heuristic
39
- is weighted toward exactly that:
39
+ The score a model receives is the first of the following that succeeds, in
40
+ order:
41
+
42
+ 1. **Primary = [Artificial Analysis Intelligence Index](https://artificialanalysis.ai/methodology/intelligence-benchmarking)**
43
+ when present on the modelgrep entry — AA's authoritative composite of 9
44
+ independent evals (agents, coding, scientific reasoning, general), already
45
+ weighted toward agentic work. Rescaled to 0–100
46
+ (`intelligence / 65 × 100`; the current leader scores ~65).
47
+ 2. **Heuristic** from modelgrep's raw benchmark fields when the AA index is
48
+ absent. Weighted blend of the same family of evals AA uses, then mapped onto
49
+ the index scale by a least-squares line. Both the heuristic weights *and*
50
+ the line were jointly tuned against the index on the models that carry
51
+ *both* it and the raw benches (`index100 120.6·heuristic 10.6`, deduped
52
+ n=29, R²=0.901, leave-one-out RMSE 6.55pt) — a data calibration, not a
53
+ guessed penalty. The picker exists to choose a model *for coding work in an
54
+ agent*, so the heuristic is weighted toward exactly that:
40
55
 
41
56
  | bench | range | measures |
42
57
  |---|---|---|
@@ -59,6 +74,12 @@ heuristic = 0.30·coding_score + 0.60·agentic_score + 0.10·reasoning_score
59
74
  score = round(clamp₀₁₀₀(120.6·heuristic − 10.6)) // fitted to the index
60
75
  ```
61
76
 
77
+ 3. **benchlm.ai fallback** — if the model exists in benchlm but modelgrep has
78
+ no AA index and no raw benches, look up the benchlm `overallScore` (0–100)
79
+ and use it verbatim. Match strategy (in `lookupBenchlmScore`): exact
80
+ normalized slug, then prefix overlap either way, then take the
81
+ highest-scoring match on a tie.
82
+
62
83
  **Why a heuristic at all, and why these raw evals only:** the AA Intelligence
63
84
  Index *is* the ideal number — but only ~16% of the catalog has it. For the rest
64
85
  we rebuild a comparable score from the same family of raw evals. Crucially we
@@ -89,13 +110,15 @@ place if your priorities differ.
89
110
 
90
111
  | Export | Description |
91
112
  |---|---|
92
- | `modelgrep` | `DataSource<ModelGrepModel[]>` — the catalog. TTL 24h → `~/.cache/pi/modelgrep.json` |
113
+ | `modelgrep` | `DataSource<ModelGrepModel[]>` — the modelgrep catalog. TTL 24h → `~/.cache/pi/modelgrep.json` |
114
+ | `benchlm` | `DataSource<BenchLMRawEntry[]>` — the benchlm.ai leaderboard (fallback scores). TTL 24h → `~/.cache/pi/benchlm.json` |
93
115
  | `DataSource` | Generic cached data source class |
94
116
  | `CACHE_DIR` | Resolved cache directory (`~/.cache/pi`) |
95
117
  | `buildModelsDevIndex` | Build a lookup `Map` from the catalog (context/cost/modalities) |
96
118
  | `lookupInIndex` | Fuzzy-match a router model id against an index |
97
- | `lookupModelsDev` | Sync lookup by provider + id from in-memory cache |
119
+ | `lookupModelsDev` | Sync lookup by id from in-memory cache (joined on slug) |
98
120
  | `lookupBenchmark` | Sync lookup a model by id — returns score + rank + pricing |
121
+ | `benchScoreColor` | Map a 0–100 score to a `success`/`warning`/`error`/`muted` token |
99
122
 
100
123
  ## Install
101
124
 
@@ -105,10 +128,11 @@ pi install npm:@xynogen/pix-data
105
128
 
106
129
  ## How it works
107
130
 
108
- On session start the extension fires a background fetch (`modelgrep.get()`),
109
- paginating the API until the full benchmarked catalog is retrieved. If the cache
110
- is fresh the fetch is skipped. The cache file lives in `~/.cache/pi/` any Pi
111
- extension using the same `DataSource` shares it automatically.
131
+ On session start the extension fires two non-blocking fetches in parallel
132
+ (`modelgrep.get()` and `benchlm.get()`) Pi session start is not gated on
133
+ either. If the cache is fresh both fetches are skipped. The cache files live
134
+ in `~/.cache/pi/` — any Pi extension using the same `DataSource` shares them
135
+ automatically.
112
136
 
113
137
  ## Full distro
114
138
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@xynogen/pix-data",
3
- "version": "0.2.3",
3
+ "version": "0.2.5",
4
4
  "description": "Pi extension — shared model data layer (models.dev + BenchLM), cached at ~/.cache/pi",
5
5
  "type": "module",
6
6
  "main": "src/index.ts",
package/src/data.ts CHANGED
@@ -1,9 +1,11 @@
1
1
  /**
2
2
  * data.ts — shared Pi model data layer
3
3
  *
4
- * Single source of truth, sourced from modelgrep (coding-sorted), cached at
5
- * ~/.cache/pi/modelgrep.json (TTL 24h). Provides context, cost, modalities,
6
- * capabilities, coding-percentile score, and rank.
4
+ * Two data sources, each its own cached DataSource:
5
+ * - modelgrep (coding-sorted catalog) — ~/.cache/pi/modelgrep.json (TTL 24h):
6
+ * context, cost, modalities, capabilities, coding-percentile score, rank.
7
+ * - BenchLM — ~/.cache/pi/benchlm.json: fallback overall score when modelgrep
8
+ * has no benchmark for a model (see lookupBenchmark).
7
9
  *
8
10
  * Cache files are shared across all Pi extensions — whichever extension loads
9
11
  * first populates the cache; subsequent extensions read from disk.
@@ -549,7 +551,7 @@ function lookupBenchlmScore(
549
551
 
550
552
  // Best entry = highest overallScore. Sort by score desc, then by slug
551
553
  // length asc (prefer base name over suffix variants on a tie).
552
- const best = [...candidates].sort((a, b) => {
554
+ const sorted = [...candidates].sort((a, b) => {
553
555
  const sa = a.overallScore ?? -Infinity;
554
556
  const sb = b.overallScore ?? -Infinity;
555
557
  if (sa !== sb) return sb - sa;
@@ -557,7 +559,9 @@ function lookupBenchlmScore(
557
559
  normalizeBenchlmName(a.model).length -
558
560
  normalizeBenchlmName(b.model).length
559
561
  );
560
- })[0];
562
+ });
563
+ const best = sorted[0];
564
+ if (!best) return null;
561
565
  return best.overallScore ?? null;
562
566
  }
563
567
 
package/src/index.ts CHANGED
@@ -4,7 +4,8 @@
4
4
  * Warms the shared model data cache on session start so other extensions
5
5
  * (pix-9router, models picker, footer) can read from ~/.cache/pi/* synchronously.
6
6
  *
7
- * Single non-blocking fetch — Pi session starts immediately.
7
+ * Two non-blocking fetches (modelgrep catalog + BenchLM scores) — Pi session
8
+ * starts immediately; consumers read whichever cache file they need.
8
9
  */
9
10
 
10
11
  import type { ExtensionAPI } from "@earendil-works/pi-coding-agent";