npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/sources/claude-code-security-architecture-penligent-2026.md ADDED Viewed

@@ -0,0 +1,70 @@
+---
+type: source
+status: ingested
+source_type: blog
+title: "Inside Claude Code: The Architecture Behind Tools, Memory, Hooks, and MCP"
+author: "Penligent"
+date_published: 2026-04-02
+url: "https://www.penligent.ai/hackinglabs/inside-claude-code-the-architecture-behind-tools-memory-hooks-and-mcp/"
+confidence: medium
+tags: [claude-code, architecture, security, hooks, permissions, sandboxing, MCP, CVE]
+key_claims:
+  - "Claude Code is a governed execution environment with a model in the middle"
+  - "The control plane around the model often matters more than the model itself"
+  - "Permissions and sandboxing are complementary: permissions control what can be attempted, sandboxing provides OS-level enforcement"
+  - "All child processes inherit sandbox boundaries"
+  - "Auto mode drops broad allow rules to prevent unsafe policy combinations"
+  - "Managed settings: allowManagedHooksOnly, allowManagedMcpServersOnly, allowManagedPermissionRulesOnly"
+  - "CVE-2025-68143/144/145: MCP Git server vulnerabilities show thin tool wrappers inherit unsafe command surfaces"
+  - "Agent risk is compositional: repo, tool result, config file, MCP server — all can become control inputs"
+created: 2026-05-02
+updated: 2026-05-02
+---
+# Claude Code Security Architecture (Penligent, 2026)
+## Source Summary
+Security-focused technical analysis by Penligent (AI security platform). Covers the full Claude Code architecture through a security lens. Published April 2, 2026 following Anthropic's source map leak incident. Notable for concrete CVE case studies and the explicit security model analysis.
+## Five Operational Layers
+| Layer | What it controls | Why it matters |
+|---|---|---|
+| Agent loop | Task decomposition, next-step selection | Shifts from text generation to action |
+| Context and memory | What Claude knows now and across sessions | Most "drift" problems are context problems |
+| Execution surface | How work touches code and systems | Determines blast radius and reproducibility |
+| Governance and safety | What Claude is allowed to attempt | The real control plane for production |
+| Extensibility | How new capabilities arrive | Flexibility and supply-chain risk both increase |
+## Permission Modes
+| Mode | Autonomy | Best fit |
+|---|---|---|
+| `default` | Read only | Sensitive work, first use |
+| `acceptEdits` | Read + edit files | Iterating while gating commands |
+| `plan` | Read + plan only | Research before modification |
+| `auto` | ML classifier reviews | Long-running with governance |
+| `bypassPermissions` | All actions, no checks | Isolated containers only |
+| `dontAsk` | Pre-approved tools only | Locked-down environments |
+## Sandboxing
+macOS: Seatbelt. Linux/WSL2: bubblewrap. WSL1 unsupported. Child processes inherit boundaries. `autoAllowBashIfSandboxed` + filesystem/network restrictions. Escape hatch: `dangerouslyDisableSandbox` — still goes through permission flow if enabled.
+## Security CVE Case Studies
+- **CVE-2025-68143**: `mcp-server-git` `git_init` tool accepted arbitrary paths. Fix: remove tool entirely.
+- **CVE-2025-68144**: `git_diff`/`git_checkout` passed user-controlled args directly to Git CLI. Flag-like values interpreted as options. Fix: reject `-` prefix, verify valid Git refs.
+- **CVE-2025-68145**: `--repository` restriction not validated in subsequent tool calls. Fix: path validation with symlink resolution.
+- **CVE-2024-32002**: Malicious Git repo with submodules could write hooks via recursive clone on case-insensitive filesystems. Lesson: repos are execution inputs, not passive context.
+- **CVE-2026-25153**: Backstage TechDocs `mkdocs.yml` hooks execution. Fix: allowlist supported keys, strip hooks.
+## Key Quotes
+> "Claude Code is a governed execution environment with a model in the middle. This is not just 'Claude plus bash.'"
+> "The dangerous thing about agent systems is often not only the code they generate, but the quiet changes to the agent's own configuration surface."
+> "Permissions can still be tricked by bad policy or human error. OS-level boundaries change what the subprocess can actually touch."
+> "Even if prompt injection manipulates Claude's behavior, the sandbox can still prevent critical file modification, unauthorized network egress, and access outside defined boundaries."

package/vault/wiki/sources/claude-context-editing-docs.md ADDED Viewed

@@ -0,0 +1,13 @@
+---
+type: source
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [source, external-doc]
+---
+# Claude Context Editing Docs
+Anthropic documentation on context editing APIs for Claude. Describes how to modify conversation context in-place (vs session restart).
+Referenced in: [[resolved-context-pruning-inplace-vs-restart]]

package/vault/wiki/sources/cloudflare-codemode.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+type: source
+status: ingested
+source_type: official_documentation
+title: "Cloudflare Code Mode"
+author: "Cloudflare (Kenton Varda, Sunil Pai)"
+date_published: 2025-09-29
+url: "https://developers.cloudflare.com/agents/api-reference/codemode/"
+confidence: high
+tags:
+  - agent-tools
+  - typescript-execution-layer
+  - sandbox
+  - mcp
+key_claims:
+  - "LLMs are better at writing code to call APIs than at calling them directly through tool functions"
+  - "Code Mode converts MCP tools into typed TypeScript APIs, gives LLM a single 'write code' tool, and executes generated code in isolated Worker sandbox"
+  - "Inspired by Apple's CodeAct research"
+  - "DynamicWorkerExecutor spins up isolated Worker per execution via WorkerLoader"
+  - "Network isolation enforced at Workers runtime level (globalOutbound: null)"
+  - "Tool calls dispatched via Workers RPC, not network requests"
+  - "3-4x context reduction vs traditional tool calling"
+created: 2026-05-02
+updated: 2026-05-02
+---# Cloudflare Code Mode
+Cloudflare's `@cloudflare/codemode` package (beta) implements the **TypeScript execution layer** pattern for AI agents. Instead of exposing dozens of MCP tools as separate function calls in the LLM context, it converts all tools into a typed TypeScript API, gives the LLM a single "write code" tool, and executes the generated JavaScript in a secure, isolated Worker sandbox.
+## Architecture
+```
+Host Worker ←→ Dynamic Worker (isolated sandbox)
+  ToolDispatcher     LLM-generated code runs here
+  holds tool fns     codemode.myTool() → dispatcher.call()
+                     fetch() blocked by default
+```
+1. `createCodeTool` generates TypeScript type definitions from tools
+2. LLM writes an async arrow function calling `codemode.toolName(args)`
+3. Code is normalized via AST parsing (acorn)
+4. `DynamicWorkerExecutor` spins up isolated Worker via `WorkerLoader`
+5. Inside sandbox, `Proxy` intercepts `codemode.*` calls → RPC to host
+6. Console output captured and returned in result
+## Key Design Decisions
+- **TypeScript types as guardrails**: Generated type defs guide LLM to correct implementations
+- **Deterministic execution**: Once code is generated, execution is fully deterministic
+- **Executor interface is minimal** (`execute(code, fns) → ExecuteResult`): pluggable sandbox backends
+- **MCP server wrappers**: `codeMcpServer` and `openApiMcpServer` for wrapping existing servers
+- **Tool name sanitization**: hyphens/dots → underscores for valid JS identifiers
+## Limitations
+- Requires Cloudflare Workers for DynamicWorkerExecutor (custom Executor can use any sandbox)
+- JavaScript execution only
+- Tool approval (`needsApproval`) not yet supported
+- Experimental — may have breaking changes
+## Relevance to ultimate-pi
+Validates the TypeScript execution layer pattern at production scale (Cloudflare Agents SDK). The minimal Executor interface means we can implement our own sandbox backend (Node.js VM, Deno, or bubblewrap) without depending on Cloudflare infrastructure. The 3-4x context reduction directly supports our token budget goals.

package/vault/wiki/sources/code-chunk-library-supermemory.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+type: source
+status: ingested
+source_type: open-source-tool
+author: Shoubhit Dash / Supermemory AI
+date_published: 2025-12-27
+url: https://www.nexxel.dev/blog/code-chunk
+confidence: high
+key_claims:
+  - "AST-based code chunking library that implements the cAST paper algorithm in production"
+  - "70.1% Recall@5 vs 49.0% (chonkie-code) vs 42.4% (fixed-size baseline)"
+  - "Adds contextualized text: file path, scope chain, entity signatures, imports used"
+  - "Supports TypeScript, JavaScript, Python, Rust, Go, Java via tree-sitter"
+  - "Open source (MIT), npm install code-chunk"
+tags:
+  - code-chunking
+  - AST
+  - tree-sitter
+  - embeddings
+  - open-source
+created: 2026-05-02
+updated: 2026-05-02
+---# code-chunk: AST-Aware Code Chunking Library
+## Summary
+Production-grade open-source library implementing the cAST paper algorithm. Built by Supermemory AI. Uses tree-sitter for parsing, extracts semantic entities with metadata, builds scope trees, and generates contextualized text for embedding.
+## Key Features Beyond cAST Paper
+1. **Rich context extraction**: Full entity metadata, scope trees, contextualized text formatting
+2. **Overlap support**: Chunks can include last N lines from previous chunk
+3. **Streaming**: Process large files without loading everything into memory
+4. **Batch processing**: Chunk entire codebases with controlled concurrency
+5. **WASM support**: Works in Cloudflare Workers and edge runtimes
+## Contextualized Text Format
+```
+# src/services/user.ts
+# Scope: UserService > getUser
+# Defines: async getUser(id: string): Promise<User>
+# Uses: Database
+# After: constructor
+  async getUser(id: string): Promise<User> { ... }
+```
+This prepend enriches raw code with semantic context that embedding models (trained on natural language) can leverage.
+## Benchmark Results (SWE-bench Lite Eval)
+| Metric | Without Search | With Semantic Search |
+|--------|---------------|---------------------|
+| Duration | 2.0m | 1.2m |
+| Tokens | 4.3k | 2.4k |
+| Cost | $0.25 | $0.20 |
+| Tool Calls | 19 | 12 |
+## Relevance to Our Implementation
+We should adopt the same approach: tree-sitter AST parsing (already via lean-ctx) → extract entities → scope tree → greedy window assignment → contextualized text prepending → embed with contextualized text.

package/vault/wiki/sources/codeact-apple-2024.md ADDED Viewed

@@ -0,0 +1,62 @@
+---
+type: source
+status: ingested
+source_type: academic_paper
+title: "CodeAct: Executable Code Actions Elicit Better LLM Agents"
+author: "Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji (Apple / UIUC)"
+date_published: 2024-07
+url: "https://arxiv.org/abs/2402.01030"
+confidence: high
+conference: "ICML 2024"
+tags:
+  - agent-tools
+  - code-generation
+  - tool-calling
+  - academic
+key_claims:
+  - "Replacing JSON/text tool-calling with executable Python code improves LLM agent success rate by ~20 percentage points on multi-tool tasks"
+  - "CodeAct agents require ~30% fewer interaction turns than JSON-based agents"
+  - "Python interpreter provides automatic, zero-cost error signals — wrong calculations raise exceptions immediately"
+  - "Open-source models benefit more: CodeActAgent (Mistral 7B) at 12.2% vs Lemur-70B at 3.7% on multi-tool benchmark"
+  - "CodeActInstruct dataset: 7,139 multi-turn code-based trajectories across 4 domains"
+created: 2026-05-02
+updated: 2026-05-02
+---# CodeAct (Apple, ICML 2024)
+Foundation research paper that established the **code-as-unified-action-space** paradigm. Proposes replacing the JSON and text action formats common in tool-calling agents with executable Python code.
+## Core Insight
+LLMs have seen millions of lines of real-world code during pretraining but only contrived tool-calling examples. Code is a better lingua franca for agent actions because it already encodes control flow, data dependencies, and multi-step composition.
+## Key Results
+| Metric | JSON Actions | CodeAct | Improvement |
+|--------|-------------|---------|-------------|
+| Multi-tool success (GPT-4) | 53.7% | 74.4% | +20.7 pp |
+| Interaction turns | baseline | -30% | fewer round-trips |
+| Open-source (best) | 3.7% (Lemur-70B) | 12.2% (Mistral 7B) | +8.5 pp |
+## Mechanism
+- Unified action space: all agent actions expressed as Python code
+- Python interpreter catches errors automatically — no separate critique step
+- Dynamic revision: agent can emit new actions or revise prior ones based on observations
+- CodeActInstruct: fine-tuning dataset covering information retrieval, package calls, external memory, robot planning
+## Limitations
+- M3ToolEval benchmark has only 82 tasks (small sample, no confidence intervals)
+- Sandbox security is acknowledged but not deeply addressed (one paragraph)
+- 60+ point capability gap between GPT-4 and CodeActAgent remains
+## Adoption
+- Directly inspired Cloudflare Code Mode (TypeScript variant for Workers)
+- Implemented in OpenHands/OpenDevin, LangGraph CodeAct, Manus
+- Foundation for the entire "code execution layer" agent paradigm
+## Relevance to ultimate-pi
+The academic foundation for our P14 (Think-in-Code Enforcement) and the new TypeScript execution layer phase. The 20% improvement on multi-tool tasks validates that code-based tool orchestration is not just a context optimization but a capability improvement. The interpreter-as-error-signal mechanism complements our L4 adversarial verification.

package/vault/wiki/sources/codex-dsc-rfc-8573.md ADDED Viewed

@@ -0,0 +1,41 @@
+---
+type: source
+source_type: github-issue
+title: "RFC: Deterministic Session Checkpoint v1 (DSC) — Codex"
+author: "Community contributor"
+date_published: 2026-03-20
+date_accessed: 2026-05-05
+url: "https://github.com/openai/codex/issues/8573"
+confidence: medium
+tags:
+  - compaction
+  - deterministic
+  - codex
+  - checkpoint
+key_claims:
+  - "Proposes replacing Codex's lossy LLM summarization with deterministic host-generated checkpoints"
+  - "Checkpoint is derived from session event logs (rollout-*.jsonl) — zero LLM calls"
+  - "Data model: Artifact (file URI + hash), FactRecord (VALID/SUSPECT), DecisionRecord"
+  - "Stale derived facts auto-marked as SUSPECT after compaction"
+  - "RFC closed as not_planned by OpenAI (2026-03-20)"
+---
+# Codex DSC RFC 8573 — Deterministic Session Checkpoint
+## Summary
+Community-authored RFC proposing that Codex replace its lossy LLM-based compaction with a deterministic checkpoint derived from session event logs. Closed by OpenAI as "not_planned" but the approach independently validates the pi-vcc pattern.
+## Key Details
+- **Problem**: Codex's auto-compaction causes agents to re-read files, re-derive known facts, and lose task awareness
+- **Proposed solution**: `checkpoint_v1.json` — structured projection of session event log
+- **Data model**: Artifact (file URI + content hash), FactRecord (status: VALID | SUSPECT, evidence refs), DecisionRecord (rationale + dependencies)
+- **Key innovation**: SUSPECT marking — when a file changes after a fact was derived from it, the fact is automatically marked stale
+- **Outcome**: Rejected by OpenAI
+## Why This Matters
+The DSC RFC validates pi-vcc's core thesis: deterministic compaction preserves more useful state than LLM summarization. Codex choosing NOT to implement it does not invalidate the pattern — it may reflect prioritization, not disagreement. The SUSPECT marking concept is novel and absent from pi-vcc.
+> [!gap] OpenAI's rationale for closing the RFC is not documented publicly. May reflect architectural constraints in Codex's Rust core rather than disagreement with deterministic compaction.

package/vault/wiki/sources/codex-open-source-agent-2026.md ADDED Viewed

@@ -0,0 +1,110 @@
+---
+type: source
+status: ingested
+source_type: open-source-repository
+title: "Codex CLI — OpenAI's Open-Source Coding Agent"
+author: "OpenAI"
+date_published: 2026
+url: "https://github.com/openai/codex"
+tags: [codex, openai, agent, harness, rust, open-source]
+confidence: high
+key_claims:
+  - "Codex is a lightweight coding agent that runs locally (CLI, IDE, Desktop App, Web)"
+  - "Written in Rust (96.3%) — compiled binary, zero-dependency install"
+  - "79.2K+ GitHub stars, 11.4K forks, 756 releases, open-source Apache 2.0"
+  - "Platform-native sandboxing: Seatbelt (macOS), bubblewrap (Linux), Windows Sandbox"
+  - "MCP client AND server — Codex can be a tool for other agents"
+  - "Subagent workflows with per-agent model selection (gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex-spark)"
+  - "Memories system with Chronicle screen-context capture for cross-thread recall"
+  - "Lifecycle hooks at 6 events (SessionStart, PreToolUse, PermissionRequest, PostToolUse, UserPromptSubmit, Stop)"
+  - "Skills system following open agentskills.io standard with progressive disclosure"
+  - "Git worktrees for safe parallel branch work"
+  - "Automations for scheduled recurring agent tasks"
+created: 2026-05-02
+updated: 2026-05-02
+---
+# Codex CLI — OpenAI's Open-Source Coding Agent
+**Source type**: Open-source repository + official documentation. Primary documentation at `developers.openai.com/codex`.
+## Repository Facts
+- **Repo**: `github.com/openai/codex`
+- **Stars**: 79.2K+
+- **Forks**: 11.4K
+- **Language**: Rust (96.3%), Python (2.7%), TypeScript (0.3%)
+- **License**: Apache 2.0
+- **Latest release**: v0.128.0 (Apr 30, 2026)
+- **Total releases**: 756
+- **Build system**: Bazel (monorepo)
+- **Key crates**: `codex-core`, `codex-tui`, `codex-cli`, `codex-exec`, `codex-mcp`, `codex-app-server-protocol`
+## Architecture
+Codex is structured as a Cargo workspace with these key components:
+- **`codex-core/`**: Business logic library. Largest crate (actively being refactored to reduce bloat).
+- **`codex-tui/`**: Terminal UI built with Ratatui (Rust TUI library).
+- **`codex-cli/`**: CLI multitool exposing subcommands.
+- **`codex-exec/`**: Headless CLI for non-interactive automation (`codex exec`).
+- **`codex-mcp/`**: MCP client and server implementation.
+- **`app-server/`**: Local HTTP/WebSocket server for IDE extension communication.
+- **SDK**: TypeScript/Node.js SDK for programmatic use.
+## Multi-Surface Architecture
+Single agent logic runs across four surfaces:
+1. **CLI** — Zero-dependency compiled binary
+2. **IDE Extension** — VS Code, Cursor, Windsurf integration
+3. **Desktop App** — Full GUI (`codex app`)
+4. **Web (Cloud)** — `chatgpt.com/codex` for cloud-based agent runs
+The App Server (local HTTP/WebSocket) is the bridge between agent core and IDE extensions. App-server protocol v2 defines typed RPC methods with camelCase wire format, TypeScript type generation from Rust structs, and explicit cursor pagination.
+## Key Innovations
+### 1. Rust-Native Implementation
+Zero-dependency compiled binary. No Node.js required (unlike Claude Code). Platform-optimized sandbox integration via native OS APIs.
+### 2. Platform-Native Sandboxing
+Three-tier sandbox: `read-only` (inspect only), `workspace-write` (edit within workspace), `danger-full-access` (no restrictions). Uses OS-level enforcement: Seatbelt on macOS, bubblewrap on Linux, Windows Sandbox on Windows. Approval policies: `untrusted`, `on-request`, `never`. Writable roots for multi-directory work. Permission profiles with per-domain and per-unix-socket rules.
+### 3. Bidirectional MCP
+Codex functions as MCP client (connects to external MCP servers) AND MCP server (`codex mcp-server` — other agents can use Codex as a tool). This is architecturally unique among production agents.
+### 4. Memories + Chronicle
+Opt-in persistent cross-thread memory. Stores under `~/.codex/memories/`. Chronicle captures screen context to bootstrap memories. Background generation (idle-thread-based, rate-limit-aware). Secret redaction. Per-thread controls (`/memories`). Configurable extract/consolidation models.
+### 5. Hooks Framework
+JSON-configurable lifecycle hooks at 6 events. Exit-code semantics: 0=continue, 2=block. JSON stdin/stdout contracts. Multiple matching hooks run concurrently. Regex matchers filter by tool name. Managed hooks via `requirements.toml` for enterprise enforcement.
+### 6. Subagent Workflows
+Parallel agent dispatch with per-agent model selection. Addresses "context pollution" and "context rot". Subagents return summaries to main thread. Explicit triggering only (no auto-spawn). Model selection: `gpt-5.5` for demanding agents, `gpt-5.4-mini` for fast scans, `gpt-5.3-codex-spark` for near-instant text-only.
+### 7. Git Worktrees
+Isolated git worktrees for safe parallel branch work. Enables multiple agents editing different branches without conflicts.
+### 8. Skills System
+Follows `agentskills.io` open standard. Progressive disclosure: name + description loaded first, full SKILL.md only when skill is activated. 2% context budget cap. Built-in `$skill-creator` and `$skill-installer`. Scopes: REPO, USER, ADMIN, SYSTEM. Plugins for distribution.
+### 9. Automations
+Scheduled recurring agent tasks — CI-like but agent-driven. No equivalent in Claude Code or Cursor.
+### 10. Enterprise Governance
+Managed config via `requirements.toml`. Admin hooks. Organization-level policy enforcement. Full enterprise story from day one.
+## What This Means for Our Harness
+Codex independently validates 7 of our planned features and reveals 5 new gaps. See [[codex-harness-innovations]] for the detailed mapping and [[Research: Codex State-of-the-Art Harness Improvements]] for the synthesis.
+## Development Conventions (from AGENTS.md)
+- Crate names prefixed with `codex-`. Collapse if statements, inline format! args, use method references over closures.
+- Avoid bool/Option parameters in public APIs. Prefer enums, named methods, newtypes.
+- Argument comment lint: `/*param_name*/` before opaque literals.
+- Exhaustive match statements. Doc comments on new traits.
+- Prefer RPITIT native async with explicit Send bounds.
+- Object-level deep equality in tests.
+- Modules under 500 LoC. Extract from large modules aggressively.
+- Snapshot tests (insta) for TUI. `pretty_assertions::assert_eq` for tests.
+- Bazel lockfile updates on dependency changes.

package/vault/wiki/sources/coir-code-retrieval-benchmark.md ADDED Viewed

@@ -0,0 +1,51 @@
+---
+type: source
+status: ingested
+source_type: research-paper
+author: Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Yichun Yin, Hao Zhang, Yong Liu, Yasheng Wang, Ruiming Tang
+date_published: 2024-07-03
+url: https://arxiv.org/abs/2407.02883
+confidence: high
+key_claims:
+  - "CoIR is the leading benchmark for code information retrieval, accepted at ACL 2025 Main"
+  - "10 curated code datasets, 8 retrieval tasks across 7 domains, 2M+ documents"
+  - "Trusted by Voyage, Jina, BGE, Salesforce, OpenAI, Google, Qwen, NV-Embed"
+  - "Integrated into MTEB leaderboard for cross-benchmark evaluation"
+  - "Pip-installable Python framework (coir-eval)"
+tags:
+  - benchmark
+  - code-retrieval
+  - embeddings
+  - coir
+  - mteb
+created: 2026-05-02
+updated: 2026-05-02
+---# CoIR: A Comprehensive Benchmark for Code Information Retrieval Models
+## Summary
+ACL 2025 Main paper introducing CoIR, the standard benchmark for evaluating code embedding/retrieval models. 10 curated datasets, 8 retrieval tasks, 7 domains, 2M+ documents. Integrated into the MTEB leaderboard.
+## Top Models on CoIR Leaderboard
+The CoIR leaderboard is adopted by major embedding providers:
+- **Voyage AI**: voyage-code-3 (top-ranked)
+- **Salesforce**: SFR-Embedding-Code-2B_R
+- **BAAI**: bge-code-v1
+- **Jina AI**: jina-embeddings-v4
+- **Qwen**: Qwen3-Embedding
+- **OpenAI**: text-embedding-3 series
+- **Google**: Gemini embedding models
+- **NVIDIA**: NV-Embed
+## Framework
+- Install: `pip install coir-eval`
+- Compatible with MTEB/BEIR data schema
+- Supports custom models and API-based models
+- 10 tasks: codetrans-dl, stackoverflow-qa, apps, codefeedback-mt, codefeedback-st, codetrans-contest, synthetic-text2sql, cosqa, codesearchnet, codesearchnet-ccr
+## Relevance to Our Implementation
+CoIR is the benchmark we should use to evaluate our embedding pipeline. If we want to validate whether all-MiniLM-L6-v2 with good chunking approaches larger model quality, we run CoIR eval and compare against the leaderboard.

package/vault/wiki/sources/colinmcnamara-context-optimization-codemode.md ADDED Viewed

@@ -0,0 +1,48 @@
+---
+type: source
+status: ingested
+source_type: blog
+title: "Context Optimization in AI Agents: From Sub-Agents to TypeScript Interfaces"
+author: "Colin McNamara"
+date_published: 2025-09-29
+url: "https://colinmcnamara.com/blog/context-optimization-mcp-code-mode"
+confidence: medium
+tags:
+  - agent-tools
+  - context-optimization
+  - typescript-execution-layer
+  - mcp
+key_claims:
+  - "Traditional tool calling uses ~10,500+ tokens per interaction; Code Mode uses ~3,100 tokens — 3-4x reduction"
+  - "Sub-agent pattern is 'non-deterministic to non-deterministic' — LLMs coordinating LLMs"
+  - "Code Mode is 'non-deterministic to deterministic' — LLM generates code, runtime executes predictably"
+  - "TypeScript has advantages for LLM-generated code: rich training data, type safety as guardrails, deterministic execution"
+  - "LLMs have seen millions of lines of TypeScript in training but far fewer synthetic tool-call examples"
+created: 2026-05-02
+updated: 2026-05-02
+---# Context Optimization in AI Agents: From Sub-Agents to TypeScript Interfaces
+Colin McNamara's analysis of Cloudflare's Code Mode, comparing context consumption across three agent tool-calling patterns: traditional tool calling, sub-agent architecture, and the TypeScript execution layer.
+## Context Efficiency Comparison
+| Pattern | Context per Interaction | Mechanism |
+|---------|------------------------|-----------|
+| Traditional tool calling | ~10,500+ tokens | System prompt + tool defs + history + call/response pairs |
+| Sub-agent pattern | ~1,000-1,300 per agent | Supervisor with minimal context, task agents with focused tools |
+| Code Mode (TS execution) | ~3,100 tokens | System + type defs + generated code + results only |
+## Key Insight: Deterministic Bridge
+Traditional sub-agent patterns are "non-deterministic to non-deterministic" — LLMs coordinating LLMs. Code Mode creates a "non-deterministic to deterministic" bridge — an LLM generates code that executes predictably. This has advantages for debugging, reliability, performance, and security.
+## TypeScript Advantages
+- **Rich training data**: Vast quantities of high-quality TypeScript in open-source
+- **Type safety as guardrails**: Type system constrains LLM toward correct implementations
+- **Deterministic execution**: Once code is generated, execution is fully predictable
+## Relevance to ultimate-pi
+The context efficiency comparison directly supports our token budget goals. The "deterministic bridge" concept aligns with our L4 adversarial verification — generated code that's been verified once is reliable, unlike agent intuition which must be re-verified each turn. The sub-agent pattern limitations validate our move toward a TypeScript execution layer (P43).

package/vault/wiki/sources/context-folding-paper.md ADDED Viewed

@@ -0,0 +1,61 @@
+---
+type: source
+source_type: paper
+title: "Scaling Long-Horizon LLM Agent via Context-Folding"
+author: "Sun et al. (ByteDance Seed, CMU, Stanford)"
+date_published: 2025-10-15
+date_accessed: 2026-05-05
+url: "https://arxiv.org/abs/2510.11967"
+confidence: high
+tags:
+  - context-folding
+  - compaction
+  - reinforcement-learning
+  - agent-architecture
+  - academic-paper
+key_claims:
+  - "200-step agents in 10x less context (32K tokens vs 327K baseline)"
+  - "62.0% on BrowseComp-Plus, 58.0% on SWE-Bench Verified with 32K budget"
+  - "FoldGRPO: RL framework with token-level process rewards for learned folding"
+  - "Branch/return sub-trajectories replace settled segments with summaries"
+  - "Outperforms summarization-based context management"
+  - "Tool-calling accuracy collapses ~40% past 80K effective-context tokens"
+  - "Now available as first-class API primitive in Anthropic's context-management beta"
+---
+# Context Folding
+## Summary
+Context Folding (arXiv 2510.11967) is a structured compaction technique from ByteDance Seed, CMU, and Stanford that enables 200+ step agents to maintain only ~32K active tokens — 10x less than naive approaches. Published October 2025.
+## Core Mechanism
+Agents create temporary sub-trajectories for subtasks via a "branch" action. Upon completion, intermediate steps are summarized and "folded" away via a "return" action, leaving only the compressed artifact in active context.
+**Key distinction**: Folding compresses WITHIN a single run. Memory persists ACROSS runs. Different problems, different solutions.
+## FoldGRPO
+End-to-end reinforcement learning framework that makes folding behavior learnable. Uses token-level process rewards to encourage effective task decomposition and context management. Agents learn WHEN and HOW to branch and fold.
+## Results
+| Benchmark | Folding (32K) | Baseline (327K) |
+|-----------|---------------|-----------------|
+| BrowseComp-Plus | 62.0% | < 62.0% |
+| SWE-Bench Verified | 58.0% | comparable |
+Significantly outperforms summarization-based context management.
+## Critical Finding
+Past ~80K effective-context tokens, agent tool-calling accuracy collapses by approximately 40%. This is a hard cliff, not a gradual decline. Context windows beyond 80K are misleading for agentic workloads.
+## Relevance to pi-vcc
+Context folding is a fundamentally different approach from pi-vcc:
+- **Folding**: Learned, within-run, branch/return structure, RL-trained
+- **pi-vcc**: Deterministic, at compaction boundaries, extraction-based, no ML
+They could theoretically combine: pi-vcc for deterministic boundary compaction + context folding for within-run trajectory management.

package/vault/wiki/sources/context-mode-website.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+type: source
+source_type: website
+title: context-mode.com
+author: B. Mert Köseoğlu
+date_published: 2026
+url: https://context-mode.com
+confidence: medium
+key_claims:
+  - "Saves 98% of AI coding agent's context window"
+  - "66,000+ developers across 14 platforms"
+  - "99.5% reduction on Playwright output (56.2KB → 299B)"
+  - "30× fewer tokens across full sessions"
+  - "Used at Microsoft, Google, Meta, ByteDance, Red Hat, GitHub"
+  - "HN #1 with 570+ points"
+created: 2026-04-30
+updated: 2026-04-30
+status: ingested
+tags: [#source/website]
+---
+# context-mode.com
+Landing page for the context-mode MCP plugin. Source for architecture claims, feature descriptions, and benchmark numbers.
+## Architecture
+- **PreToolUse hook**: Routes tool calls. Blocks curl/wget, redirects large output to sandbox.
+- **PostToolUse hook**: Captures events to SessionDB (file ops, git, errors, decisions).
+- **SessionStart**: Restores state from previous session.
+- **PreCompact**: Builds snapshot before context wipe.
+- **UserPromptSubmit**: Captures intent, tracks decisions.
+## Think in Code Paradigm
+Introduced in v1.0.64. Mandatory across all 14 platforms. Rule: when you need to analyze/count/filter/process data, write code that does it. Don't read raw data into context. Uses `ctx_execute()` MCP tool that runs JavaScript in a sandbox (Node.js built-ins only, no npm deps).
+## Compression Results (claimed)
+- Playwright: 56.2 KB → 299 B (99.5%)
+- GitHub Issues: 58.9 KB → 1.1 KB (98%)
+- Access Logs: 45.1 KB → 155 B (99.7%)
+- Full Session: 315 KB → 5.4 KB (98%)
+## Platforms
+Claude Code, Cursor, Codex CLI, VS Code Copilot, JetBrains Copilot, Gemini CLI, Qwen Code, Kiro, OpenCode, KiloCode, Zed, OpenClaw, Pi, Antigravity
+## License
+Elastic License 2.0 (ELv2) — source-available, not OSI-approved open source.
+## GitHub
+- Stars: 11,245 (as of 2026-04-30)
+- Forks: 769
+- Language: TypeScript
+- Created: 2026-02-23
+## npm
+- Package: context-mode
+- Downloads last month: 48,161