PyPI - voting-mcp - Versions diffs - 0.1.0__tar.gz - Mend

voting-mcp 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

voting_mcp-0.1.0/.env.example +10 -0
voting_mcp-0.1.0/.github/workflows/ci.yml +29 -0
voting_mcp-0.1.0/.gitignore +20 -0
voting_mcp-0.1.0/CLAUDE.md +155 -0
voting_mcp-0.1.0/PKG-INFO +143 -0
voting_mcp-0.1.0/README.md +120 -0
voting_mcp-0.1.0/RESULTS.md +91 -0
voting_mcp-0.1.0/bench/__init__.py +1 -0
voting_mcp-0.1.0/bench/compare.py +153 -0
voting_mcp-0.1.0/bench/config/models.yaml +29 -0
voting_mcp-0.1.0/bench/ensemble.py +84 -0
voting_mcp-0.1.0/bench/fetch_arc.py +72 -0
voting_mcp-0.1.0/bench/fetch_mmlu_pro.py +82 -0
voting_mcp-0.1.0/bench/metrics.py +47 -0
voting_mcp-0.1.0/bench/prompts.py +75 -0
voting_mcp-0.1.0/bench/run_ensemble.py +232 -0
voting_mcp-0.1.0/docs/accuracy_arc.png +0 -0
voting_mcp-0.1.0/docs/accuracy_mmlu_pro.png +0 -0
voting_mcp-0.1.0/pyproject.toml +66 -0
voting_mcp-0.1.0/src/voting_mcp/__init__.py +3 -0
voting_mcp-0.1.0/src/voting_mcp/aggregate.py +54 -0
voting_mcp-0.1.0/src/voting_mcp/rules/__init__.py +23 -0
voting_mcp-0.1.0/src/voting_mcp/rules/_common.py +69 -0
voting_mcp-0.1.0/src/voting_mcp/rules/approval.py +19 -0
voting_mcp-0.1.0/src/voting_mcp/rules/borda.py +35 -0
voting_mcp-0.1.0/src/voting_mcp/rules/condorcet.py +48 -0
voting_mcp-0.1.0/src/voting_mcp/rules/copeland.py +29 -0
voting_mcp-0.1.0/src/voting_mcp/rules/majority.py +49 -0
voting_mcp-0.1.0/src/voting_mcp/rules/opinion_pool.py +31 -0
voting_mcp-0.1.0/src/voting_mcp/rules/plurality.py +23 -0
voting_mcp-0.1.0/src/voting_mcp/rules/stv.py +74 -0
voting_mcp-0.1.0/src/voting_mcp/scoring.py +64 -0
voting_mcp-0.1.0/src/voting_mcp/server.py +164 -0
voting_mcp-0.1.0/src/voting_mcp/types.py +145 -0
voting_mcp-0.1.0/tests/test_bench.py +149 -0
voting_mcp-0.1.0/tests/test_rules.py +416 -0
voting_mcp-0.1.0/tests/test_scoring.py +67 -0
voting_mcp-0.1.0/tests/test_server.py +82 -0
voting_mcp-0.1.0/tests/test_types.py +153 -0
voting_mcp-0.1.0/uv.lock +1559 -0

voting_mcp-0.1.0/.env.example ADDED Viewed

@@ -0,0 +1,10 @@
+# Bench harness only — the MCP server itself reads NO secrets.
+# Routing default: OpenRouter single gateway (one key, every member is a model_id).
+OPENROUTER_API_KEY=
+# Per-provider fallback keys (only if you switch models.yaml off the gateway).
+# OPENAI_API_KEY=
+# GOOGLE_API_KEY=
+# DEEPSEEK_API_KEY=
+# ANTHROPIC_API_KEY=
+# ZHIPU_API_KEY=

voting_mcp-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,29 @@
+name: ci
+on:
+  push:
+    branches: [main]
+  pull_request:
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.12"
+      - name: Sync (core + dev)
+        run: uv sync
+      - name: Lint
+        run: uv run ruff check .
+      - name: Type-check (strict)
+        run: uv run mypy --strict src
+      - name: Test
+        run: uv run pytest -q

voting_mcp-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,20 @@
+# Python
+__pycache__/
+*.py[cod]
+.venv/
+*.egg-info/
+dist/
+build/
+# Tooling caches
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+# Secrets
+.env
+# Bench artifacts: caches, downloaded data, and generated tables/plots stay local.
+# (The raw cache is a cost guardrail; commit it deliberately if you want reuse.)
+bench/results/
+bench/datasets/*.jsonl

voting_mcp-0.1.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,155 @@
+# CLAUDE.md — voting-mcp
+## What this is
+An MCP server that exposes **principled social-choice aggregation rules** (Borda, Copeland,
+Condorcet, approval, STV, linear opinion pool) as callable tools — plus a benchmark harness
+that **proves** these rules beat naive majority vote when aggregating a diverse ensemble of
+LLM agents on a reasoning benchmark.
+The differentiator is NOT "I can count votes." Almost every multi-agent system hand-rolls
+`Counter(votes).most_common(1)`, which throws away preference intensity and confidence.
+This server ships the rules with their known axiomatic properties AND a measured accuracy
+delta over majority vote. The number is the point.
+Target end state: published to PyPI + the MCP registry, README carrying the eval table.
+## Goals / non-goals
+- GOAL: correct, exhaustively-tested rules; a clean FastMCP server over stdio; a reproducible
+  benchmark emitting an accuracy table + plot with confidence intervals.
+- NON-GOAL (this weekend): web UI, remote HTTP/OAuth transport, exact Kemeny/Schulze,
+  learned agent weights. Leave hooks, don't build them.
+## Tech stack (pin these)
+- Python 3.12, managed with `uv`
+- `mcp[cli]` (FastMCP) for the server
+- `pydantic` v2 for ballot/profile/result schemas
+- `pytest` for tests, `ruff` for lint, `mypy --strict` for types
+- Bench: async HTTP via the `openai` SDK pointed at an OpenAI-compatible endpoint; `.env` for keys;
+  `numpy` for aggregation math; `matplotlib` for the one plot
+## Repo layout
+```
+voting-mcp/
+├── CLAUDE.md
+├── pyproject.toml
+├── README.md
+├── .env.example
+├── src/voting_mcp/
+│   ├── __init__.py
+│   ├── server.py          # FastMCP entrypoint + @mcp.tool registrations
+│   ├── types.py           # Ballot / Profile / Result pydantic models
+│   ├── aggregate.py       # dispatch + ballot validation
+│   └── rules/
+│       ├── __init__.py
+│       ├── borda.py
+│       ├── copeland.py
+│       ├── condorcet.py
+│       ├── approval.py
+│       ├── stv.py
+│       ├── opinion_pool.py
+│       ├── plurality.py   # baseline
+│       └── majority.py    # baseline used by the bench
+├── tests/
+│   └── test_rules.py
+├── bench/
+│   ├── run_ensemble.py    # calls N models on a dataset, caches raw responses
+│   ├── compare.py         # rules vs majority: accuracy + bootstrap 95% CI -> table + plot
+│   ├── config/models.yaml # one entry per ensemble member
+│   ├── datasets/          # downloaded benchmark (jsonl)
+│   └── results/
+│       └── raw/           # cached per-model responses (NEVER re-call API if present)
+└── .github/workflows/ci.yml
+```
+## Commands
+- install:        `uv sync`
+- run server:     `uv run python -m voting_mcp.server`
+- inspect (REQUIRED before a tool is "done"):
+                  `npx @modelcontextprotocol/inspector uv run python -m voting_mcp.server`
+- test:           `uv run pytest -q`
+- lint + types:   `uv run ruff check . && uv run mypy --strict src`
+- bench:          `uv run python -m bench.run_ensemble --dataset bench/datasets/<file>.jsonl \
+                   --models bench/config/models.yaml --limit 200`
+- compare:        `uv run python -m bench.compare --results bench/results/<run>/`
+## Domain model
+- **Ballot** variants: strict ranking, truncated/partial ranking, approval set, score/utility
+  vector, probability distribution over candidates.
+- **Profile** = candidate set + list of ballots (all over the same candidate set).
+- **Result** = winner (or winners on a tie), full ranking where defined, and a `note` field
+  (e.g. "no Condorcet winner exists").
+## Rules to implement — each its own module + tests
+| Rule | Consumes | Known for / why it's here |
+|------|----------|---------------------------|
+| borda | rankings | positional, Condorcet-inconsistent — good contrast |
+| copeland | rankings | Condorcet-consistent pairwise |
+| condorcet | rankings | returns winner OR explicit "no winner" on a cycle |
+| approval | approval sets | simple, strategy-relevant |
+| stv | rankings | multi-round elimination, clone-resistant |
+| opinion_pool | distributions | linear pool — preserves confidence, NOT argmax |
+| plurality | top choice | baseline |
+| majority | top choice | the baseline the bench must beat |
+## Correctness discipline — READ THIS
+Social choice is full of subtle edge cases and an LLM gets them wrong confidently. So:
+- **TEST-FIRST for every rule.** Before/with each implementation, write textbook profiles with
+  a KNOWN expected winner. Required edge cases per rule where applicable:
+  - a **Condorcet cycle** (A>B>C>A) → condorcet must return "no winner", not crash
+  - a **tie** → must surface multiple winners, never silently pick one
+  - a **truncated/partial ballot** → defined behavior, not a coerce-to-zero
+  - a **clone** scenario for stv/borda
+- **Tie-breaking is an explicit, documented parameter** — never an implicit silent rule.
+- **Validate inputs**: reject malformed ballots with a clear error. Never coerce silently.
+- **Rules are pure functions**: take a Profile, return a Result. No I/O, no globals, deterministic.
+- If a rule is NP-hard to compute exactly (Kemeny/Schulze) — it's out of scope; do not approximate
+  silently.
+## MCP tool layer
+- Wrap each rule with `@mcp.tool()`. The **docstring IS the model-facing spec** — state inputs,
+  output shape, and the "no winner" condition precisely.
+- **Strict JSON schema**: `additionalProperties: false`, fully typed fields, an enum for rule names.
+- **stdio transport only. NO network calls. NO file writes. NO secrets in this package.**
+  (This keeps the server clean against the OWASP MCP Top 10 by construction — it's pure compute.)
+- A tool is not "done" until it has been exercised in the MCP Inspector.
+## Bench harness — the proof
+- **Claim to demonstrate**: principled aggregation > naive majority vote, measured, with CIs.
+- **Ensemble** (bench/config/models.yaml) — one entry per member:
+  `{ name, base_url, api_key_env, model_id, weight }`. Default members (5 distinct labs):
+  - `gpt-4o-mini`   (OpenAI)
+  - `gemini-2.5-flash-lite` (Google)
+  - `deepseek-v3`   (DeepSeek, open-weight)
+  - `claude-haiku-4-5` (Anthropic)
+  - `glm-4.7`       (Zhipu)
+- **Routing**: ONE OpenAI-compatible client. Prefer a single gateway (OpenRouter) so every member
+  is just a `model_id`; per-provider `base_url` is the fallback. Each member is interchangeable
+  config, never hardcoded.
+- **Per call**: ask for a final label AND (where the task allows) a confidence distribution over
+  the options — the opinion_pool rule needs the distribution. Cap `max_tokens`. Retry w/ backoff,
+  timeout, and record failures.
+- **Caching (cost guardrail)**: write every raw response to `bench/results/raw/`. On re-run, if a
+  cached response exists, DO NOT call the API. This makes aggregation tweaks free — you only pay
+  again when prompts or models change.
+- **compare.py**: for each rule + majority baseline, compute accuracy and a **bootstrap 95% CI**;
+  emit a markdown table and a single matplotlib bar chart with error bars.
+- **Reproducibility**: fix seeds; record model IDs + dataset hash + timestamp in the results dir.
+- **Cost discipline**: default `--limit 200`; print an estimated cost before any run that hits the API.
+- **Dataset**: an openly downloadable MCQ reasoning set (MMLU subset / ARC-Challenge / GPQA-diamond).
+  Pick one that's freely fetchable; store as jsonl with `{id, question, options, answer}`.
+## Coding conventions
+- Type hints everywhere; docstrings on public functions; small modules, no god-files.
+- `ruff` clean, `mypy --strict` clean before any commit.
+- Commit after each green `pytest` run, one logical change per commit.
+## Definition of done (phase gates)
+1. Scaffold builds, `uv sync` clean, CI green on an empty test.
+2. All rules implemented; `pytest` green INCLUDING every required edge case above.
+3. All tools exercised in the MCP Inspector; schemas strict.
+4. Bench runs end-to-end on `--limit 200`, produces table + plot, raw responses cached.
+5. README has install one-liner + eval table; `uv build` succeeds; published; `uvx mcp-scan`
+   run against our own server reported clean.
+## Out of scope this weekend (leave hooks, don't build)
+Web UI · remote HTTP/OAuth transport · exact Kemeny/Schulze · learned/accuracy-based agent weights.

voting_mcp-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,143 @@
+Metadata-Version: 2.4
+Name: voting-mcp
+Version: 0.1.0
+Summary: MCP server exposing principled social-choice aggregation rules (Borda, Copeland, Condorcet, approval, STV, opinion pool), with a reproducible benchmark measuring their accuracy vs majority vote over an LLM ensemble.
+Author-email: Hrishi Kabra <kabrahrishi@gmail.com>
+License: MIT
+Keywords: aggregation,ensemble,mcp,social-choice,voting
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.12
+Requires-Dist: mcp[cli]>=1.2.0
+Requires-Dist: pydantic>=2.6
+Provides-Extra: bench
+Requires-Dist: matplotlib>=3.8; extra == 'bench'
+Requires-Dist: numpy>=1.26; extra == 'bench'
+Requires-Dist: openai>=1.30; extra == 'bench'
+Requires-Dist: python-dotenv>=1.0; extra == 'bench'
+Requires-Dist: pyyaml>=6.0; extra == 'bench'
+Description-Content-Type: text/markdown
+# voting-mcp
+**Principled social-choice aggregation as MCP tools — with a benchmark that measures the
+accuracy lift over naive majority vote.**
+Almost every multi-agent system aggregates votes with `Counter(votes).most_common(1)`, throwing
+away preference order and confidence. `voting-mcp` ships the real rules (Borda, Copeland,
+Condorcet, approval, STV, linear opinion pool) as callable MCP tools — each with its known
+axiomatic behavior and explicit, documented tie-breaking — plus a reproducible benchmark that
+aggregates a diverse ensemble of LLMs on a reasoning set and reports accuracy with bootstrap
+confidence intervals.
+The server is **pure compute**: stdio transport, no network, no file writes, no secrets — clean
+against the OWASP MCP Top 10 by construction.
+## Install
+```sh
+# run the server directly (once published)
+uvx voting-mcp
+# or from source
+git clone <repo> && cd voting-mcp
+uv sync
+uv run python -m voting_mcp.server
+```
+Add it to an MCP client (e.g. Claude Desktop `claude_desktop_config.json`):
+```json
+{
+  "mcpServers": {
+    "voting": { "command": "uvx", "args": ["voting-mcp"] }
+  }
+}
+```
+## Tools
+Every tool takes a `profile` (`{candidates, ballots}`) and returns a `Result` with the full
+co-winner set (`winners`, so ties are never hidden), the single tie-broken `winner` (or `null`
+when none exists), a `ranking`, per-candidate `scores`, and a `note`.
+| Tool | Ballots | Notes |
+|------|---------|-------|
+| `borda` | rankings | positional; Condorcet-inconsistent, clone-sensitive |
+| `copeland` | rankings | Condorcet-consistent pairwise (+1 win, +0.5 tie) |
+| `condorcet` | rankings | returns the pairwise winner **or an explicit no-winner on a cycle** |
+| `approval` | approval sets | most-approved wins |
+| `stv` | rankings | single-winner instant-runoff; clone-resistant |
+| `opinion_pool` | distributions | linear pool — **preserves confidence, not an argmax vote** |
+| `plurality` | rankings | baseline (most first choices) |
+| `majority` | rankings | strict >50% or **no winner** |
+| `aggregate_rule` | any | dispatch by a `rule` enum |
+Tie-breaking is an explicit parameter (`lexicographic` default, `none`, or seeded `random`).
+## Benchmark
+Aggregate an ensemble of 5 models (one OpenAI-compatible client via OpenRouter) on
+ARC-Challenge and compare each rule to the naive majority vote:
+```sh
+uv sync --extra bench
+uv run python -m bench.fetch_arc --limit 200
+# prints a cost estimate and STOPS; add --yes to actually call the API, --mock for a free dry run
+uv run python -m bench.run_ensemble --dataset bench/datasets/arc_challenge.jsonl --limit 200 --yes
+uv run python -m bench.compare --dataset bench/datasets/arc_challenge.jsonl --limit 200
+```
+Every raw response is cached under `bench/results/raw/`; re-runs never re-call the API, so
+aggregation tweaks are free.
+### Results
+5-model ensemble (gpt-4o-mini · gemini-2.5-flash-lite · deepseek-v3 · claude-haiku-4.5 ·
+glm-4.7), n = 200, bootstrap 95% CI. Two datasets of different difficulty; full write-up and
+both plots in [`RESULTS.md`](RESULTS.md).
+**MMLU-Pro (hard, baseline 73.5%) — the informative case:**
+| Rule | Accuracy | 95% CI | Δ vs majority |
+|------|---------:|:------:|--------------:|
+| **opinion_pool** | **0.755** | [0.695, 0.815] | **+0.020** |
+| **majority_vote (baseline)** | 0.735 | [0.679, 0.788] | — |
+| approval | 0.701 | [0.640, 0.757] | −0.035 |
+| stv | 0.693 | [0.630, 0.750] | −0.043 |
+| copeland | 0.647 | [0.580, 0.710] | −0.088 |
+| condorcet | 0.620 | [0.550, 0.685] | −0.115 |
+| majority (strict) | 0.590 | [0.520, 0.655] | −0.145 |
+| borda | 0.472 | [0.405, 0.540] | −0.263 |
+![MMLU-Pro](docs/accuracy_mmlu_pro.png)
+**The finding (honest):** the value isn't "fancy voting beats majority." It's that **the
+confidence-preserving rule (`opinion_pool`) wins** when the crowd is uncertain (+2.0pp, the only
+rule above baseline — though its CI still overlaps, so *suggestive, not conclusive*), while
+**forcing the distributions into full rankings actively hurts** — `borda` collapses to 0.472,
+far below majority, because with 10 options the tail of the ranking is mostly noise. Aggregate
+the confidence; don't throw it away. On **ARC-Challenge** (baseline 96.8%, near-ceiling) nothing
+separates — every rule lands within overlapping CIs. See [`RESULTS.md`](RESULTS.md).
+## Develop
+```sh
+uv run pytest -q
+uv run ruff check .
+uv run mypy --strict src
+# exercise the tools in the MCP Inspector:
+npx @modelcontextprotocol/inspector uv run python -m voting_mcp.server
+```
+> Note: if you keep this repo under an iCloud-synced folder (e.g. `~/Desktop`), iCloud can spawn
+> duplicate `.pth` files that intermittently break the editable install. Tests use
+> `pythonpath=src`; run the server with `PYTHONPATH=src` if an import fails, or move the repo
+> off the synced folder.
+## License
+MIT

voting_mcp-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,120 @@
+# voting-mcp
+**Principled social-choice aggregation as MCP tools — with a benchmark that measures the
+accuracy lift over naive majority vote.**
+Almost every multi-agent system aggregates votes with `Counter(votes).most_common(1)`, throwing
+away preference order and confidence. `voting-mcp` ships the real rules (Borda, Copeland,
+Condorcet, approval, STV, linear opinion pool) as callable MCP tools — each with its known
+axiomatic behavior and explicit, documented tie-breaking — plus a reproducible benchmark that
+aggregates a diverse ensemble of LLMs on a reasoning set and reports accuracy with bootstrap
+confidence intervals.
+The server is **pure compute**: stdio transport, no network, no file writes, no secrets — clean
+against the OWASP MCP Top 10 by construction.
+## Install
+```sh
+# run the server directly (once published)
+uvx voting-mcp
+# or from source
+git clone <repo> && cd voting-mcp
+uv sync
+uv run python -m voting_mcp.server
+```
+Add it to an MCP client (e.g. Claude Desktop `claude_desktop_config.json`):
+```json
+{
+  "mcpServers": {
+    "voting": { "command": "uvx", "args": ["voting-mcp"] }
+  }
+}
+```
+## Tools
+Every tool takes a `profile` (`{candidates, ballots}`) and returns a `Result` with the full
+co-winner set (`winners`, so ties are never hidden), the single tie-broken `winner` (or `null`
+when none exists), a `ranking`, per-candidate `scores`, and a `note`.
+| Tool | Ballots | Notes |
+|------|---------|-------|
+| `borda` | rankings | positional; Condorcet-inconsistent, clone-sensitive |
+| `copeland` | rankings | Condorcet-consistent pairwise (+1 win, +0.5 tie) |
+| `condorcet` | rankings | returns the pairwise winner **or an explicit no-winner on a cycle** |
+| `approval` | approval sets | most-approved wins |
+| `stv` | rankings | single-winner instant-runoff; clone-resistant |
+| `opinion_pool` | distributions | linear pool — **preserves confidence, not an argmax vote** |
+| `plurality` | rankings | baseline (most first choices) |
+| `majority` | rankings | strict >50% or **no winner** |
+| `aggregate_rule` | any | dispatch by a `rule` enum |
+Tie-breaking is an explicit parameter (`lexicographic` default, `none`, or seeded `random`).
+## Benchmark
+Aggregate an ensemble of 5 models (one OpenAI-compatible client via OpenRouter) on
+ARC-Challenge and compare each rule to the naive majority vote:
+```sh
+uv sync --extra bench
+uv run python -m bench.fetch_arc --limit 200
+# prints a cost estimate and STOPS; add --yes to actually call the API, --mock for a free dry run
+uv run python -m bench.run_ensemble --dataset bench/datasets/arc_challenge.jsonl --limit 200 --yes
+uv run python -m bench.compare --dataset bench/datasets/arc_challenge.jsonl --limit 200
+```
+Every raw response is cached under `bench/results/raw/`; re-runs never re-call the API, so
+aggregation tweaks are free.
+### Results
+5-model ensemble (gpt-4o-mini · gemini-2.5-flash-lite · deepseek-v3 · claude-haiku-4.5 ·
+glm-4.7), n = 200, bootstrap 95% CI. Two datasets of different difficulty; full write-up and
+both plots in [`RESULTS.md`](RESULTS.md).
+**MMLU-Pro (hard, baseline 73.5%) — the informative case:**
+| Rule | Accuracy | 95% CI | Δ vs majority |
+|------|---------:|:------:|--------------:|
+| **opinion_pool** | **0.755** | [0.695, 0.815] | **+0.020** |
+| **majority_vote (baseline)** | 0.735 | [0.679, 0.788] | — |
+| approval | 0.701 | [0.640, 0.757] | −0.035 |
+| stv | 0.693 | [0.630, 0.750] | −0.043 |
+| copeland | 0.647 | [0.580, 0.710] | −0.088 |
+| condorcet | 0.620 | [0.550, 0.685] | −0.115 |
+| majority (strict) | 0.590 | [0.520, 0.655] | −0.145 |
+| borda | 0.472 | [0.405, 0.540] | −0.263 |
+![MMLU-Pro](docs/accuracy_mmlu_pro.png)
+**The finding (honest):** the value isn't "fancy voting beats majority." It's that **the
+confidence-preserving rule (`opinion_pool`) wins** when the crowd is uncertain (+2.0pp, the only
+rule above baseline — though its CI still overlaps, so *suggestive, not conclusive*), while
+**forcing the distributions into full rankings actively hurts** — `borda` collapses to 0.472,
+far below majority, because with 10 options the tail of the ranking is mostly noise. Aggregate
+the confidence; don't throw it away. On **ARC-Challenge** (baseline 96.8%, near-ceiling) nothing
+separates — every rule lands within overlapping CIs. See [`RESULTS.md`](RESULTS.md).
+## Develop
+```sh
+uv run pytest -q
+uv run ruff check .
+uv run mypy --strict src
+# exercise the tools in the MCP Inspector:
+npx @modelcontextprotocol/inspector uv run python -m voting_mcp.server
+```
+> Note: if you keep this repo under an iCloud-synced folder (e.g. `~/Desktop`), iCloud can spawn
+> duplicate `.pth` files that intermittently break the editable install. Tests use
+> `pythonpath=src`; run the server with `PYTHONPATH=src` if an import fails, or move the repo
+> off the synced folder.
+## License
+MIT

voting_mcp-0.1.0/RESULTS.md ADDED Viewed

@@ -0,0 +1,91 @@
+# Benchmark results
+**Ensemble:** 5 models via OpenRouter, equal weight —
+`gpt-4o-mini`, `gemini-2.5-flash-lite`, `deepseek-v3`, `claude-haiku-4.5`, `glm-4.7`.
+**Stats:** accuracy with a percentile bootstrap 95% CI (5,000 resamples, seed 0). Ties are
+scored unbiasedly (a k-way tie containing the gold answer scores 1/k). Each member returns a
+label + a confidence distribution; rules consume the ballot kind they need (rankings derived
+from the distribution, approval set thresholded at ≥ 1/#options, distribution verbatim for the
+opinion pool). Run date: 2026-06-30.
+Two datasets, deliberately chosen for different difficulty:
+---
+## 1. ARC-Challenge (easy — n=200, baseline 96.8%)
+992/1000 responses parsed; every question keeps ≥ 4 ballots.
+| Rule | Accuracy | 95% CI | Δ vs majority |
+|------|---------:|:------:|--------------:|
+| borda / copeland | 0.975 | [0.950, 0.995] | +0.007 |
+| stv | 0.973 | [0.948, 0.993] | +0.005 |
+| condorcet | 0.970 | [0.945, 0.990] | +0.002 |
+| **majority_vote (baseline)** | 0.968 | [0.943, 0.988] | — |
+| opinion_pool | 0.965 | [0.935, 0.990] | −0.003 |
+| approval | 0.964 | [0.937, 0.987] | −0.003 |
+| majority (strict) | 0.960 | [0.930, 0.985] | −0.008 |
+![ARC-Challenge](docs/accuracy_arc.png)
+**Nothing separates here.** The ensemble is near-ceiling (96.8% baseline), so there is no
+headroom — every rule lands within overlapping CIs. When the crowd is almost always right, how
+you count the votes barely matters.
+---
+## 2. MMLU-Pro (hard — n=200, baseline 73.5%, up to 10 options)
+This is the informative setting: lower ceiling, real disagreement. 199/200 questions keep ≥ 3
+ballots, 192/200 keep ≥ 4. (`gemini-2.5-flash-lite` returned non-JSON on ~25% of these hard
+10-option items even with a large token budget — a real model-compliance characteristic; those
+responses abstain rather than corrupt a ballot.)
+| Rule | Accuracy | 95% CI | Δ vs majority |
+|------|---------:|:------:|--------------:|
+| **opinion_pool** | **0.755** | [0.695, 0.815] | **+0.020** |
+| **majority_vote (baseline)** | 0.735 | [0.679, 0.788] | — |
+| approval | 0.701 | [0.640, 0.757] | −0.035 |
+| stv | 0.693 | [0.630, 0.750] | −0.043 |
+| copeland | 0.647 | [0.580, 0.710] | −0.088 |
+| condorcet | 0.620 | [0.550, 0.685] | −0.115 |
+| majority (strict) | 0.590 | [0.520, 0.655] | −0.145 |
+| borda | 0.472 | [0.405, 0.540] | −0.263 |
+![MMLU-Pro](docs/accuracy_mmlu_pro.png)
+## The actual finding
+The headline isn't "fancy voting beats majority." It's sharper and more useful:
+1. **The confidence-preserving rule wins.** `opinion_pool` (a linear pool of the models'
+   probability distributions) is the only rule above majority vote, by +2.0 points. When the
+   crowd is genuinely uncertain, *keeping* the confidence and averaging it beats discarding it
+   for a one-vote-per-model count. This is exactly the rule CLAUDE.md flagged as "preserves
+   confidence, NOT argmax" — and it's the one that pays off.
+2. **Forcing rankings actively hurts.** The positional/pairwise rules degrade, and `borda`
+   collapses to 0.472 — far *below* majority. With up to 10 options, each model is confident
+   about its top pick and near-noise over the other nine; turning that into a full ranking and
+   then weighting all ten positions (Borda) or all pairwise contests (Condorcet/Copeland) feeds
+   the aggregator mostly noise. The signal lives in the top pick and the confidence, not in the
+   tail order.
+3. **Honesty on significance.** opinion_pool's +2.0 is the largest margin and directionally
+   consistent, but its CI still overlaps the baseline's — call it *suggestive, not conclusive*
+   at n=200. The *negative* results (ranking rules hurting) are large and well outside the noise.
+So the measured, honest takeaway: **on hard problems, aggregate the confidence (opinion_pool);
+do not throw it away by forcing noisy full rankings.** On easy problems it doesn't matter.
+## Reproduce
+```sh
+uv sync --extra bench
+echo "OPENROUTER_API_KEY=sk-or-..." > .env
+# pick a dataset:
+uv run python -m bench.fetch_arc --limit 200          # easy
+uv run python -m bench.fetch_mmlu_pro --limit 200     # hard
+uv run python -m bench.run_ensemble --dataset bench/datasets/mmlu_pro.jsonl --limit 200 --yes --max-tokens 1024
+uv run python -m bench.compare --dataset bench/datasets/mmlu_pro.jsonl --limit 200 --n-boot 5000
+```
+Every raw response is cached, so re-scoring under different rules costs nothing.

voting_mcp-0.1.0/bench/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ """Benchmark harness: prove principled aggregation beats naive majority vote."""