npm - ultimate-pi - Versions diffs - 0.1.7 → 0.2.2 - Mend

ultimate-pi 0.1.7 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (524) hide show

package/vault/wiki/concepts/Dual-Model Agent Architecture.md DELETED Viewed

@@ -1,65 +0,0 @@
----
-type: concept
-title: "Dual-Model Agent Architecture"
-created: 2026-04-30
-status: developing
-tags:
-  - agent-architecture
-  - llm
-  - ensembling
-  - swe-bench
-aliases:
-  - Two-Model Agent
-related:
-  - "[[Majority Vote Ensembling]]"
-  - "[[Agentic Coding Harness]]"
-sources:
-  - "[[Augment SWE-bench Agent GitHub]]"
-  - "[[Augment SWE-bench Pro Blog]]"
-updated: 2026-05-02
----# Dual-Model Agent Architecture
-An agent architecture that uses two different LLMs for distinct phases: a fast, capable model for iterative reasoning/ coding, and a more deliberative model for solution selection/verification.
-## Augment Code's Implementation
-### Phase 1: Core Reasoning (Claude Sonnet 3.7)
-- Handles the iterative coding loop: read files, write code, run tests, debug.
-- Fast, capable, good at following instructions.
-- Runs in a loop with tool access (bash, file edit, sequential thinking).
-### Phase 2: Solution Ensembling (OpenAI o1)
-- After generating N candidate solutions (typically 8).
-- Presents all candidates to o1 with evaluation outcomes.
-- o1 analyzes and selects the best solution.
-- o1 is slower but more deliberative — better at comparative analysis.
-## Why Two Models?
-1. **Cost optimization**: Fast model for the 95% of work; expensive model only for selection.
-2. **Complementary strengths**: Claude excels at code generation; o1 excels at analysis and comparison.
-3. **Error reduction**: Majority vote ensembling catches errors that any single run might miss.
-4. **Separation of concerns**: Generation and evaluation use different reasoning patterns.
-## Alternative Patterns
-### Single-Model Multi-Pass
-- Same model generates multiple solutions then self-reviews.
-- Simpler but less effective than cross-model ensembling.
-### Model Cascade
-- Start with fast/cheap model; escalate to stronger model on failure.
-- Used by SWE-agent and some production systems.
-### Committee of Models
-- 3+ different models generate solutions independently.
-- Voting or LLM-based selection.
-## Implementation for Our Harness
-We can implement dual-model architecture as a configurable strategy:
-- **Primary model**: Claude (fast, code-capable) for the main agent loop.
-- **Ensembler model**: GPT-5 or o1 for solution verification and selection.
-- Generate 3-5 candidate solutions, use ensembler to pick best.
-- Configurable via harness config.

package/vault/wiki/concepts/Late Chunking vs Early Chunking.md DELETED Viewed

@@ -1,43 +0,0 @@
----
-type: concept
-title: "Late Chunking vs Early Chunking"
-created: 2026-04-30
-status: developing
-tags:
-  - chunking
-  - embeddings
-  - rag
-  - semantic-search
-related:
-  - "[[AST-Aware Code Chunking]]"
-  - "[[Contextualized Text Embedding]]"
-sources:
-  - "[[vectara-chunking-vs-embedding-naacl2025]]"
-updated: 2026-05-02
----# Late Chunking vs Early Chunking
-## Definitions
-- **Early chunking (standard)**: Split text → embed each chunk separately. Each chunk's embedding only sees its own text.
-- **Late chunking**: Embed the entire document first (producing token-level embeddings), then pool token embeddings into chunk-level embeddings using chunk boundaries. Each chunk's embedding "sees" the full document context.
-- **Contextual retrieval**: An intermediate approach: prepend document-level context to each chunk before embedding. Simpler than late chunking, captures some cross-chunk context.
-## Trade-offs
-| Approach | Semantic Coherence | Compute Cost | Implementation Complexity |
-|----------|-------------------|--------------|---------------------------|
-| Early chunking | Lowest | Lowest | Simplest |
-| Contextual retrieval | Medium | Medium | Moderate |
-| Late chunking | Highest | Highest | Complex |
-## Research Findings (arXiv:2504.19754)
-Late chunking + contextual retrieval evaluated for RAG systems:
-- Contextual retrieval preserves semantic coherence more effectively than early chunking
-- But requires greater computational resources (embeds full documents)
-- For code: contextual retrieval (prepending scope/file context) is the sweet spot — better than bare early chunking, cheaper than full late chunking
-## Relevance to Our Implementation
-We implement **contextual retrieval** (not full late chunking): prepend file path, scope chain, signatures, and imports to each chunk before embedding. This gives us much of the benefit at moderate cost.

package/vault/wiki/concepts/Majority Vote Ensembling.md DELETED Viewed

@@ -1,68 +0,0 @@
----
-type: concept
-title: "Majority Vote Ensembling"
-created: 2026-04-30
-status: developing
-tags:
-  - agent-architecture
-  - llm
-  - ensembling
-aliases:
-  - Solution Ensembling
-related:
-  - "[[Dual-Model Agent Architecture]]"
-sources:
-  - "[[Augment SWE-bench Agent GitHub]]"
-updated: 2026-05-02
----# Majority Vote Ensembling
-A technique where an agent generates multiple candidate solutions to the same problem, then uses an LLM (or voting mechanism) to select the best one. Used by Augment Code's SWE-bench agent to boost success rates.
-## How Augment Implements It
-1. Run the core agent (Claude Sonnet 3.7) N times on the same problem (typically N=8).
-2. Each run produces a candidate solution (diff).
-3. Run evaluation harness on each candidate to get pass/fail outcomes.
-4. Feed all candidates + outcomes to OpenAI o1 with a prompt asking it to select the best solution.
-5. o1 returns the index of the selected solution.
-## Input Format
-```json
-{
-  "id": "problem-1",
-  "instruction": "Fix the login timeout issue",
-  "diffs": ["diff1", "diff2", "..."],
-  "eval_outcomes": [
-    {"is_success": true},
-    {"is_success": false}
-  ]
-}
-```
-## Why It Works
-1. **Variance reduction**: Multiple independent runs reduce the impact of any single bad generation.
-2. **Complementary failures**: Different runs fail on different aspects; ensembling can pick the run that succeeded.
-3. **LLM-as-judge**: o1's reasoning capabilities are better suited for comparative analysis than code generation.
-4. **Evaluation-guided**: Including eval outcomes helps the ensembler distinguish between functionally correct and incorrect solutions.
-## Cost Consideration
-Running N candidates multiplies cost by N. Augment's approach: use a fast/cheap model (Sonnet) for the N runs, then an expensive model (o1) only for the single ensembling step.
-## Implementation for Our Harness
-```python
-def ensemble_solutions(problem: str, candidates: int = 5) -> str:
-    solutions = []
-    for i in range(candidates):
-        # Run agent independently
-        diff = run_agent(problem)
-        result = evaluate(diff)
-        solutions.append({"diff": diff, "success": result.passed})
-    # Select best via LLM ensembler
-    best = llm_ensembler.select_best(problem, solutions)
-    return best.diff
-```

package/vault/wiki/concepts/Meta-Harness.md DELETED Viewed

@@ -1,16 +0,0 @@
----
-type: concept
-status: stub
-created: 2026-05-02
-updated: 2026-05-02
-tags: [concept, harness, meta-learning]
----
-# Meta-Harness
-Outer-loop harness optimization framework from Lee et al. (Stanford/Together AI). A harness that optimizes the inner harness — selecting best configurations, prompts, and patterns across multiple agent runs.
-## References
-- [[lee2026-meta-harness]]
-- [[self-evolving-harness]]

package/vault/wiki/concepts/Multi-Agent AI Coding Architecture.md DELETED Viewed

@@ -1,75 +0,0 @@
----
-type: concept
-title: "Multi-Agent AI Coding Architecture"
-created: 2026-05-03
-updated: 2026-05-03
-status: developing
-tags:
-  - multi-agent
-  - architecture
-  - agentic-coding
-  - harness
-related:
-  - "[[subagent-orchestration]]"
-  - "[[generator-evaluator-architecture]]"
-  - "[[agentic-harness]]"
-  - "[[Source: Lovable Architecture & Clone Analysis]]"
-  - "[[anthropic2026-harness-design]]"
-sources:
-  - "[[Source: Lovable Architecture & Clone Analysis]]"
-  - "[[anthropic2026-harness-design]]"
-  - "[[Source: OpenAI Harness Engineering — 0 Lines of Human Code]]"
-  - "[[Source: OpenDev — Building AI Coding Agents for the Terminal]]"
----# Multi-Agent AI Coding Architecture
-The decomposition of software engineering tasks across specialized agents, each with a defined role, input/output contract, and tool surface. This is the **universal pattern** across all successful AI coding platforms.
-## Three Common Decompositions
-### Lovable/Clone Pattern: Planner → Architect → Coder
-```
-User prompt → Planner (structured Plan) → Architect (TaskPlan) → Coder (files on disk)
-```
-- Each agent receives Pydantic-validated inputs
-- LangGraph orchestrates with conditional edges
-- Coder uses ReAct pattern with file system tools
-### Anthropic Pattern: Planner → Generator → Evaluator
-```
-User prompt → Planner (product spec) → Generator (implements) ⇄ Evaluator (grades)
-```
-- Generator and Evaluator negotiate "sprint contracts" before coding
-- Evaluator uses Playwright to actually click through the app
-- Hard thresholds on grading criteria — fall below any, sprint fails
-### OpenAI Pattern: Agent-to-Agent Review Loops
-```
-Codex generates → Codex reviews locally → Additional agent review (cloud) → Human/agent feedback → Iterate
-```
-- "Ralph Wiggum Loop": agent reviews its own changes, requests additional reviews, responds to feedback, iterates until all agent reviewers satisfied
-- Humans may review PRs but aren't required to
-- Pushed "almost all review effort towards being handled agent-to-agent"
-## First-Principles Architecture
-### 1. Separate Planning from Execution
-Do not let the same agent plan and code in one step. The Planner should have read-only tools only — structurally prevented from writing code. This forces deliberation before action and prevents premature implementation.
-### 2. Structured Handoffs Between Agents
-Every handoff must be a validated data contract, not free text. Pydantic schemas, typed dicts, or structured files. The downstream agent processes objects, not unstructured descriptions.
-### 3. Independent Evaluator with Hard Criteria
-The agent that builds cannot be trusted to evaluate. Separate evaluator with explicit, gradable criteria. Each criterion has a hard threshold — not negotiable. "Claude is a poor QA agent out of the box" — evaluator requires explicit tuning to be skeptical.
-### 4. Sprint Contracts (Agree on "Done" Before Work)
-Before coding starts, the implementer and evaluator negotiate what success looks like. This prevents scope creep and provides concrete verification targets. Communication via files, not chat.
-### 5. Tool Surface = Agent Capability Boundary
-Each agent's available tools define its actual capability — not its prompt, not its role description. Remove write tools from planners. Remove subagent-spawning from subagents. Make capabilities structural, not aspirational.
-## Relevance to Our Harness
-- L2 (Planning) should be a separate agent with read-only tools
-- L3 (Execution) should work from L2's structured output
-- L4 (Verification) needs hard criteria with thresholds, not narrative feedback
-- Sprint contracts between L2 and L4 before L3 begins

package/vault/wiki/concepts/Prompt Enhancement.md DELETED Viewed

@@ -1,90 +0,0 @@
----
-type: concept
-title: "Prompt Enhancement"
-created: 2026-04-30
-status: developing
-tags:
-  - prompt-engineering
-  - context
-  - retrieval
-aliases:
-  - Prompt Enrichment
-  - Context Injection
-related:
-  - "[[Context Engine (AI Coding)]]"
-  - "[[Semantic Codebase Indexing]]"
-sources:
-  - "[[Augment Code WorkOS ERC 2025]]"
-  - "[[Augment Code Codacy AI Giants]]"
-updated: 2026-05-02
----# Prompt Enhancement
-The process of automatically enriching a user's query with relevant codebase context before it reaches the LLM. The goal is to give the LLM the same understanding a senior engineer would have when approaching a task.
-## How Augment's Prompt Enhancer Works
-1. User types a query: "add logging to payment API."
-2. Context Engine semantically searches the codebase for relevant code.
-3. Enhancer constructs an augmented prompt containing:
-   - The original query.
-   - Relevant source files and their paths.
-   - Existing patterns (how logging is done elsewhere).
-   - Related utilities and libraries already in the codebase.
-   - Team conventions and coding standards.
-4. The augmented prompt is sent to the LLM.
-## Key Design Principles
-### Reuse Over Reinvention
-The enhancer actively detects existing utilities and libraries. In Augment's demo, when asked to add Git branch info to a status bar, the enhancer detected an existing internal Git library and guided the agent to use it instead of shelling out to git.
-### Context Budget Management
-The enhancer must balance context richness with token budget:
-- Retrieve only what's relevant (semantic search).
-- Compress retrieved context (summarize large files).
-- Rank by relevance, not just similarity.
-- Respect the model's context window.
-### Pattern Recognition
-The enhancer learns from the codebase:
-- Naming conventions.
-- Error handling patterns.
-- Import structure.
-- Testing patterns.
-- Architectural layering.
-## Implementation for Our Harness
-```python
-def enhance_prompt(query: str, workspace: str) -> str:
-    # 1. Semantic search for relevant code
-    relevant_files = semantic_search(query, workspace, top_k=10)
-    # 2. Extract patterns from relevant files
-    patterns = extract_patterns(relevant_files)
-    # 3. Find existing utilities/libraries
-    utilities = find_related_utilities(query, workspace)
-    # 4. Fetch wiki knowledge (our existing knowledge base)
-    wiki_context = query_wiki(query)
-    # 5. Build augmented prompt
-    return build_prompt(
-        query=query,
-        relevant_code=relevant_files,
-        patterns=patterns,
-        utilities=utilities,
-        wiki=wiki_context
-    )
-```
-## Integration with Existing Harness
-Our harness already has several context sources:
-- **lean-ctx**: Exact file retrieval (grep, find, read).
-- **wiki**: Architectural knowledge, research, patterns.
-- **ctx_knowledge**: Persistent project conventions and gotchas.
-Prompt enhancement would unify these into a preprocessing step before the main agent loop.

package/vault/wiki/concepts/Prompt Renderer.md DELETED Viewed

@@ -1,89 +0,0 @@
----
-type: concept
-title: "Prompt Renderer"
-created: 2026-05-02
-updated: 2026-05-02
-tags:
-  - prompt-renderer
-  - multi-model
-  - build-time-compilation
-  - harness
-status: developing
-related:
-  - "[[provider-native-prompting]]"
-  - "[[model-adaptive-harness]]"
-  - "[[research: Prompt Renderer for Multi-Model Agent Harness]]"
-sources:
-  - "[[Source: Build-Time Prompt Compilation Architecture]]"
-  - "[[Source: AgentBus Jinja2 Prompt Pipelines]]"
----# Prompt Renderer
-A build-time prompt compilation system that takes a **base prompt spec** (model-agnostic) and renders **per-model optimized prompts** by applying each model's official prompting conventions, substituting variables, and caching compiled output.
-## Architecture
-```
-Base Prompt Spec (JSON/YAML)
-        ↓
-  [Compile-time Renderer]
-        ↓
-┌───────┼───────┬─────────┐
-│ GPT   │Claude │Gemini   │  ← Per-model compiled prompts
-│.json  │.json  │.json    │
-└───────┴───────┴─────────┘
-        ↓
-  [npm package]              ← Shipped in lib
-        ↓
-  [Runtime] → load pre-compiled prompt → substitute runtime vars → send to LLM
-```
-## Key Properties
-- **Build-time, not runtime**: Compiler runs during `npm run build`, output shipped as JSON in npm package
-- **Base spec is model-agnostic**: Single source of truth that describes WHAT the prompt should do, not HOW
-- **Per-model renderers**: Each model gets a plugin that knows its official prompting conventions
-- **Variable system**: Two-phase — compile-time variables (resolved at build) vs runtime variables (resolved at call time)
-- **Caching layer**: Pre-compiled prompts are the cache — no runtime compilation, no warmup needed
-- **Deterministic**: Same spec + same renderer version → identical output (hash-verifiable)
-## Rendering Pipeline
-1. **Parse base spec**: Validate structure, required fields, variable declarations
-2. **Select model renderer**: Load per-model plugin (GPT, Claude, Gemini, etc.)
-3. **Apply model conventions**: XML tags for Claude, constraints-first for GPT, constraints-last for Gemini
-4. **Substitute compile-time variables**: Resolve all vars marked `compile: true`
-5. **Validate output**: Check token count, syntax, caching thresholds
-6. **Serialize**: Write compiled prompt to JSON with hash + metadata
-7. **Cache**: Store hash → compiled output for incremental builds
-## Model-Specific Rendering Rules
-| Convention | GPT (OpenAI) | Claude (Anthropic) | Gemini (Google) |
-|-----------|-------------|-------------------|-----------------|
-| System prompt | `system` role message | `system` parameter | `systemInstruction` |
-| Structure | Constraints-first, flat | XML tags, nesting OK | Constraints-last, plain text |
-| Instruction style | Outcome-first, shorter | Long-form, detailed | Multimodal-friendly |
-| Cache control | Auto (no code) | `cache_control: {type: "ephemeral"}` | Explicit context cache |
-| Output format | Function calling | Structured output API | Controlled generation |
-| Best practice source | platform.openai.com/docs/guides/prompt-engineering | docs.anthropic.com + interactive tutorial | cloud.google.com/vertex-ai/docs |
-## Variable Substitution
-Two-phase variable system:
-```yaml
-variables:
-  model_name: { type: string, compile: true }   # Resolved at build
-  user_query: { type: string, compile: false }   # Resolved at runtime
-  max_tokens: { type: number, compile: true, default: 4096 }
-```
-Compile-time variables produce multiple compiled variants if multiple values are specified (e.g., `model_name: [gpt-5.2, claude-sonnet-4.5]`).
-## Caching Strategy
-- **Build cache**: Incremental — only recompile prompts whose spec hash changed
-- **Output cache**: Compiled prompts stored by `{spec_hash}-{model}-{var_hash}.json`
-- **Runtime**: Zero cost — load pre-compiled JSON, substitute runtime vars, send
-- **npm distribution**: Compiled prompts are regular files in the package — no compilation code shipped

package/vault/wiki/concepts/Semantic Codebase Indexing.md DELETED Viewed

@@ -1,67 +0,0 @@
----
-type: concept
-title: "Semantic Codebase Indexing"
-created: 2026-04-30
-status: developing
-tags:
-  - code-indexing
-  - embeddings
-  - vector-search
-  - ast
-aliases:
-  - Code Embedding
-related:
-  - "[[Context Engine (AI Coding)]]"
-  - "[[Prompt Enhancement]]"
-sources:
-  - "[[Augment Context Engine Official]]"
-  - "[[Augment Code Codacy AI Giants]]"
-updated: 2026-05-02
----# Semantic Codebase Indexing
-The process of converting source code into vector embeddings that capture semantic meaning, enabling similarity search across a codebase without relying on exact keyword matching.
-## How It Works
-### 1. Code Chunking
-- Split source files into logical units: functions, classes, methods, modules.
-- Use tree-sitter AST parsing for language-aware chunk boundaries.
-- Typical chunk size: 200-500 tokens for optimal embedding quality.
-### 2. Embedding Generation
-- Pass each chunk through an embedding model.
-- Options: all-MiniLM-L6-v2 (384-dim, local), CodeBERT, or Voyage AI code embeddings.
-- Augment Code uses custom embedding models trained in pairs for maximum retrieval quality.
-### 3. Vector Database Storage
-- Store embeddings in LanceDB, ChromaDB, or Qdrant.
-- Index for fast approximate nearest neighbor (ANN) search.
-- Attach metadata: file path, line range, function/class name, dependencies.
-### 4. Real-time Sync
-- Watch filesystem for changes using watchdog/inotify.
-- Re-embed changed files incrementally.
-- Augment claims "millisecond-level sync."
-### 5. Hybrid Search
-- Combine vector similarity (semantic) + BM25/ keyword (lexical).
-- Re-rank results by relevance, recency, and relationship proximity.
-## Why Semantic > Grep
-| Aspect | Grep/Keyword | Semantic Indexing |
-|--------|-------------|-------------------|
-| Finds related code | Only exact matches | Finds semantically similar code |
-| Understands intent | No | Yes — "payment logging" finds telemetry, billing, audit |
-| Cross-language | No | Partially — embeddings capture patterns |
-| Relationship aware | No | Yes — understands call graphs and imports |
-| Noise filtering | Manual | Automatic relevance ranking |
-## Implementation Stack (for our harness)
-- **Parser**: tree-sitter (18 languages via lean-ctx).
-- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2) or voyage-code-2.
-- **Vector DB**: LanceDB (embedded, zero-config) or ChromaDB.
-- **Sync**: watchdog (Python).
-- **Search**: hybrid BM25 + cosine similarity with re-ranking.

package/vault/wiki/concepts/additive-config-hierarchy.md DELETED Viewed

@@ -1,16 +0,0 @@
----
-type: concept
-status: stub
-created: 2026-05-02
-updated: 2026-05-02
-tags: [concept, configuration, claude-code]
----
-# Additive Config Hierarchy
-Configuration pattern from Claude Code: config layers stack additively (CLAUDE.md → project-level → user-level → system-level) rather than overriding. Each layer adds context rather than replacing previous layers.
-## References
-- [[claude-code-architecture-karaxai-2026]]
-- [[harness-configuration-layers]]

package/vault/wiki/concepts/agent-artifacts-verifiable-deliverables.md DELETED Viewed

@@ -1,71 +0,0 @@
----
-type: concept
-title: "Agent Artifacts (Trust via Verifiable Deliverables)"
-status: developing
-created: 2026-05-01
-updated: 2026-05-01
-tags:
-  - antigravity
-  - verification
-  - trust
-  - harness-design
-aliases: ["Artifact system", "verifiable artifacts"]
-related:
-  - "[[adversarial-verification]]"
-  - "[[automated-observability]]"
-  - "[[harness-implementation-plan]]"
-  - "[[antigravity-agent-first-architecture]]"
-sources:
-  - "[[google-antigravity-official-blog]]"
-  - "[[cursor-vs-antigravity-2026]]"
----# Agent Artifacts: Trust via Verifiable Deliverables
-Google Antigravity's Artifact system replaces raw tool-call logs with human-readable, verifiable deliverables that agents generate as they work.
-## What Are Artifacts?
-Structured, verifiable outputs agents produce during execution:
-- Task lists and implementation plans
-- Screenshots and browser recordings
-- Walkthrough documents
-- Test result summaries
-- Architecture diagrams
-Artifacts represent work at a **task level**, not an API-call level. They are designed to be audited by humans, not parsed by machines.
-## How Artifacts Build Trust
-```
-Raw tool logs: "execute_command: npm install" → "exit 0" → "write_file: src/auth.ts" → ...
-Artifact: "Authentication migration plan" → "Screenshot: login page working" → "Test results: 23/23 pass"
-```
-The second format is reviewable in seconds. The first requires scrolling through hundreds of lines.
-## Feedback on Artifacts
-- Developers comment on artifacts (Google Docs-style commenting)
-- Agents incorporate feedback **without stopping execution**
-- Feedback is asynchronous: you comment, the agent picks it up at the next checkpoint
-- No need to restart tasks for mid-course corrections
-## Comparison with Our Harness
-| Dimension | Our Harness (L4 + L5) | Antigravity Artifacts |
-|-----------|----------------------|----------------------|
-| Verification type | Adversarial critic agents | Human-reviewable deliverables |
-| Feedback loop | Multi-round debate (selective) | Async comments on artifacts |
-| Trust mechanism | Critic proves work wrong | Agent proves work right |
-| Cost | LLM tokens (critic rounds) | Human attention (review artifacts) |
-## Gap Analysis
-Our L4 adversarial verification asks: "Is this correct?" (critic finds flaws).
-Antigravity's Artifacts ask: "Here's proof this is correct" (agent demonstrates success).
-These are **complementary**. The critic catches what the agent missed. The artifact proves what the agent got right. Both should exist in the harness.
-## Proposed Integration: Phase P31
-Add an **Artifact Generation Layer** after L4 verification. Agents generate screenshots, browser recordings, and test result summaries as verifiable proof of work. These artifacts feed into L5 observability and serve as the human-reviewable interface.