npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.4 - Mend

ultimate-pi 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/questions/Research: Prompt Renderer for Multi-Model Agent Harness.md ADDED Viewed

@@ -0,0 +1,216 @@
+---
+type: synthesis
+title: "Research: Prompt Renderer for Multi-Model Agent Harness"
+created: 2026-05-02
+updated: 2026-05-02
+tags:
+  - research
+  - prompt-renderer
+  - multi-model
+  - build-time-compilation
+  - caching
+  - harness
+status: developing
+related:
+  - "[[Prompt Renderer]]"
+  - "[[Build-Time Prompt Compilation]]"
+  - "[[provider-native-prompting]]"
+  - "[[model-adaptive-harness]]"
+  - "[[harness-configuration-layers]]"
+  - "[[harness-implementation-plan]]"
+sources:
+  - "[[Source: Build-Time Prompt Compilation Architecture]]"
+  - "[[Source: AgentBus Jinja2 Prompt Pipelines]]"
+  - "[[Source: TianPan Prompt Caching Architecture]]"
+  - "[[Source: Arxiv — Don't Break the Cache]]"
+  - "[[openai-prompt-guidance]]"
+  - "[[anthropic-prompt-best-practices]]"
+  - "[[gemini-3-prompting-guide]]"
+---# Research: Prompt Renderer for Multi-Model Agent Harness
+## Overview
+Design a custom prompt renderer for the ultimate-pi agentic harness that takes a **base prompt spec** (model-agnostic), applies **per-model prompting best practices**, substitutes variables, uses a **caching layer** for cost optimization, and compiles rendered prompts at **build time** (not runtime) — shipped as compiled assets inside the npm library. This extends the existing [[provider-native-prompting]] concept with compilation, caching, and a two-phase variable system.
+## Key Findings
+1. **Build-time compilation is a proven architectural pattern but no mature off-the-shelf npm package exists.** The pattern is validated by Microsoft prompt-engine (2.8K stars, MIT — YAML-based prompt management, abandoned 2022) and PromptWeaver (`@iqai/prompt-weaver`, MIT, active Dec 2025 — Handlebars template compilation with Zod validation). The implementation is a DIY build pipeline: `js-yaml` (parse specs) + `@iqai/prompt-weaver` (template engine) + per-model renderer plugins → compiled JSON shipped in npm. No runtime template engine needed.
+2. **Strategic cache boundary control is essential** (Source: [[Source: Arxiv — Don't Break the Cache]]). Across 500 agent sessions and 4 flagship models, system prompt only caching provides the most consistent benefits (41-80% cost reduction, 13-31% TTFT improvement). Full context caching can paradoxically increase latency. The golden rule: static content first, dynamic content last. Compile-time rendering makes this trivial — all static content is in the compiled prompt, runtime vars are appended at the end.
+3. **Multi-tier caching architecture is well understood** (Source: [[Source: TianPan Prompt Caching Architecture]]). Three tiers: Semantic cache (100% savings for exact/near-duplicate queries), Prefix cache (50-90% savings for shared static context), Full inference (0% savings). Build-time compilation eliminates the need for runtime prefix caching entirely — compiled prompts ARE the cache. The "parallel execution trap" (4% cache hit rate without warming) is irrelevant when prompts are pre-compiled.
+4. **Each model has fundamentally different prompting conventions** (Sources: [[openai-prompt-guidance]], [[anthropic-prompt-best-practices]], [[gemini-3-prompting-guide]]). OpenAI says constraints-first and outcome-first. Anthropic mandates XML tags and long-form reasoning. Google says constraints-LAST with plain text. A single canonical prompt relaxed per model is WRONG — each model needs a purpose-built renderer that applies its official conventions from a shared semantic spec.
+5. **Jinja2 template patterns are production-ready but runtime-only** (Source: [[Source: AgentBus Jinja2 Prompt Pipelines]]). The Jinja2 pattern (FileSystemLoader, template inheritance, conditionals, loops, pipeline runner) is excellent for prompt structure but designed for runtime. We adapt the pattern to build-time: templates are compiled to static JSON with placeholders for runtime variables, not rendered at request time.
+## Architecture
+```
+┌──────────────────────────────────────────────────────┐
+│                   BUILD TIME                          │
+│                                                       │
+│  Base Prompt Spec (prompts/*.yaml)                    │
+│       │                                               │
+│       ▼                                               │
+│  ┌─────────────────┐                                  │
+│  │ Prompt Compiler  │  ← TypeScript build script      │
+│  │                  │                                  │
+│  │ • Parse YAML     │                                  │
+│  │ • Validate spec   │                                  │
+│  │ • Per-model       │  ← Renderer plugins             │
+│  │   renderers       │    (GPT, Claude, Gemini)        │
+│  │ • Substitute      │                                  │
+│  │   compile vars    │                                  │
+│  │ • Hash + cache    │                                  │
+│  └──────┬───────────┘                                  │
+│         │                                              │
+│         ▼                                              │
+│  Compiled Prompts (dist/prompts/*.json)                │
+│  ✓ Per-model variants                                  │
+│  ✓ Syntax-validated                                    │
+│  ✓ Token-count checked                                 │
+│  ✓ Hash-verified                                       │
+│  ✓ Shipped in npm package                              │
+│                                                       │
+└──────────────────────────────────────────────────────┘
+                         │
+                         ▼
+┌──────────────────────────────────────────────────────┐
+│                   RUNTIME                             │
+│                                                       │
+│  Load compiled prompt by {spec, model}                │
+│       │                                               │
+│       ▼                                               │
+│  Substitute runtime variables                         │
+│  (user_query, context, etc.)                         │
+│       │                                               │
+│       ▼                                               │
+│  Send to LLM API                                      │
+│  (no template engine, no compilation, no cache warmup)│
+└──────────────────────────────────────────────────────┘
+```
+## Caching Layer Design
+### Build Cache (incremental compilation)
+```
+cache/
+└── compile-cache.json    # { spec_hash → compiled_output_hash }
+```
+Only recompile prompts whose spec hash changed since last build.
+### Output Cache (compiled prompts)
+```
+dist/prompts/manifest.json  # { spec → { model → { hash, path, build_time } } }
+```
+Each compiled prompt is content-hashed for deterministic verification.
+### Runtime Cache (not needed)
+No runtime cache required — compiled prompts are static files loaded directly. Zero compilation latency, zero cache warming, zero parallel-execution traps.
+## Per-Model Rendering Rules
+| Dimension | GPT (OpenAI) | Claude (Anthropic) | Gemini (Google) |
+|-----------|-------------|-------------------|-----------------|
+| **System prompt** | `messages[0].role="system"` | `system` parameter | `systemInstruction` config |
+| **Structure** | Flat, constraints-first | XML tags (`<instructions>`) | Plain text, constraints-last |
+| **Instruction ordering** | Outcome → Constraints → Context | Role → Context → Task → XML | Context → Task → Constraints |
+| **Output format** | Function calling / JSON mode | Structured output API | Controlled generation / JSON |
+| **Cache mechanism** | Auto (prefix match) | `cache_control: {type: "ephemeral"}` | Explicit context cache |
+| **Best practice source** | platform.openai.com/docs/guides/prompt-engineering | docs.anthropic.com + interactive tutorial | cloud.google.com/vertex-ai/docs |
+| **Examples preference** | Few-shot inline | Few-shot with XML wrappers | Few-shot with clear separation |
+| **Token threshold** | 1,024 (cache min) | 1,024 (cache min) | 4,096 (cache min) |
+## Variable System
+Two-phase variable resolution:
+```typescript
+interface PromptVariable {
+  name: string;
+  type: 'string' | 'number' | 'boolean' | 'json';
+  phase: 'compile' | 'runtime';
+  default?: unknown;
+  required: boolean;
+}
+```
+- **Compile-time vars** (`phase: 'compile'`): Resolved at build time. Multiple values produce multiple compiled variants. Example: `model_name: [gpt-5.2, claude-sonnet-4.5]` → 2 compiled prompts.
+- **Runtime vars** (`phase: 'runtime'`): Resolved at call time. Left as `{{PLACEHOLDER}}` in compiled output. Substituted by a lightweight runtime function (no template engine needed — simple string replace).
+## npm Package Structure
+```
+@ultimate-pi/harness/
+├── dist/
+│   ├── prompts/
+│   │   ├── gpt/
+│   │   │   ├── system.json        # Compiled system prompt for GPT
+│   │   │   ├── spec-hardening.json
+│   │   │   └── verify.json
+│   │   ├── claude/
+│   │   │   ├── system.json        # Compiled system prompt for Claude
+│   │   │   └── ...
+│   │   └── gemini/
+│   │       └── ...
+│   ├── manifest.json              # Build manifest
+│   └── renderers/
+│       ├── gpt-renderer.js        # Renderer plugins (only if runtime rendering needed)
+│       └── ...
+├── prompts/                       # Source specs (for development)
+│   ├── base/
+│   │   ├── system.yaml
+│   │   └── verify.yaml
+│   └── fragments/
+│       └── common.yaml
+├── scripts/
+│   └── compile-prompts.ts         # Build script
+└── src/
+    └── runtime/
+        └── prompt-loader.ts       # Runtime loader (reads compiled JSON)
+```
+## Implementation Plan (integrated into harness)
+### Phase 1: Compiler Core
+- TypeScript build script that reads YAML specs → validates → applies per-model renderers → outputs compiled JSON
+- Supported models: GPT-5.2, Claude Sonnet 4.5, Gemini 2.5 Pro (extensible plugin system)
+- Deterministic builds with SHA-256 manifest
+- Integration: `npm run compile-prompts` as build step
+### Phase 2: Per-Model Renderers
+- GPT renderer: constraints-first, flat structure, outcome-first ordering, system role message
+- Claude renderer: XML tags, long-form structure, cache_control markers, system parameter
+- Gemini renderer: constraints-last, plain text, systemInstruction, context cache config
+### Phase 3: Variable System
+- Two-phase variable resolution with type checking
+- Compile-time multi-value expansion
+- Runtime placeholder format: `__VAR_name__` (avoid collision with any template syntax)
+### Phase 4: Caching
+- Incremental build cache (recompile only changed specs)
+- Compiled prompts shipped as static JSON in npm (no runtime compilation)
+- Content-hash verification for deterministic builds
+### Phase 5: Runtime Integration
+- `loadPrompt(specName, model, runtimeVars)` function
+- Zero-dependency runtime (just JSON.parse + string replace)
+- Type-safe with TypeScript types for all compiled prompts
+## Contradictions
+- [[Source: AgentBus Jinja2 Prompt Pipelines]] advocates runtime template rendering with Jinja2 (Python). Our design deliberately avoids this — pre-compiling at build time eliminates template engine dependency, reduces runtime overhead to zero, and makes prompts auditable static assets. The contradiction is resolved by recognizing that Jinja2's patterns (inheritance, blocks, pipelines) are excellent for prompt STRUCTURE but should be resolved at build time, not runtime.
+- [[Source: TianPan Prompt Caching Architecture]] describes runtime prefix caching with cache warming. Our design makes this mostly irrelevant — when prompts are pre-compiled and shipped in npm, there is no runtime prefix to cache. However, the multi-tier caching insight (semantic → prefix → full) remains valuable for the broader harness caching strategy beyond prompt rendering.
+## Open Questions
+1. **What template syntax for base specs?** YAML with JSON Schema validation is the practical choice. PromptWeaver's Handlebars syntax provides the template layer. Microsoft prompt-engine validated the YAML pattern. JSON Schema (or Zod, integrated with PromptWeaver) provides better validation than raw YAML parsing alone. YAML stays human-friendly for spec authors.
+2. **How to handle prompt versioning across npm releases?** Compiled prompts must be versioned with the harness. Semantic versioning for prompts: major = breaking spec change, minor = new prompt added, patch = rendering tweak. The build manifest provides traceability.
+3. **What about custom/fine-tuned models?** The renderer plugin system should support user-defined renderers for custom models. Default: fall back to "generic" renderer that produces a neutral format.
+4. **How to test compiled prompts?** Each compiled variant needs automated testing. PromptWeaver's Zod schema validation checks structure and types at compile time. Token thresholds checked against model-specific limits. Semantic testing (does the compiled prompt produce expected behavior?) requires sending to the target model — this is a separate integration test concern.
+5. **What happens when a provider changes its API format?** Compiled prompts for that provider become stale. The build manifest tracks renderer version — recompilation produces updated prompts. A CI check should flag prompts compiled with outdated renderer versions.
+6. **Where does token budget allocation fit?** The base spec should declare expected token budgets. The compiler validates that compiled prompts don't exceed model limits. Budget allocation is a prompt design concern, not a renderer concern — but the renderer enforces it.
+7. **Does the renderer need to support chat message arrays (multi-turn)?** Yes — the base spec should support defining multi-message prompts (system + examples + user template). The renderer compiles the full message array structure per model's expected format.

package/vault/wiki/questions/Research: Skill-First Harness Architecture.md ADDED Viewed

@@ -0,0 +1,91 @@
+---
+type: synthesis
+title: "Research: Skill-First MVP & Harness Implementation Architecture"
+created: 2026-05-03
+updated: 2026-05-03
+tags:
+  - research
+  - harness
+  - mvp
+  - skills
+  - architecture
+  - first-principles
+status: developing
+related:
+  - "[[skill-first-architecture]]"
+  - "[[harness-implementation-plan]]"
+  - "[[mvp-implementation-blueprint]]"
+  - "[[agent-skills-pattern]]"
+  - "[[drift-detection-unified]]"
+  - "[[harness-engineering-first-principles]]"
+  - "[[adr-015]]"
+sources:
+  - "[[Source: SwirlAI Agent Skills Progressive Disclosure]]"
+  - "[[Source: Claude API Agent Skills Overview]]"
+  - "[[Source: Blake Crosley Agent Architecture Guide]]"
+---
+# Research: Skill-First MVP & Harness Implementation Architecture
+## Overview
+Rethought the entire MVP and harness implementation plans from first principles. The core insight: the harness is NOT a code pipeline — it's a skill coordination layer. 80% of harness functionality can be markdown-based skills (`SKILL.md` files) loaded on-demand via progressive disclosure. Only deterministic infrastructure needs code: the event bus (wiring), the drift monitor (real-time pattern matching), shared types, and config. Everything else — spec hardening, planning, adversarial verification, observability, memory — is an LLM-invoked skill. This cuts the code surface from ~15 TypeScript files to 4, while gaining auto-activation, progressive disclosure, and zero-compile iteration speed.
+## Key Findings
+- **Skills are the atomic unit of harness behavior.** Validated by Anthropic's open standard (Dec 2025), adopted by OpenAI, Google, GitHub, Cursor within weeks. Skills use three-tier progressive disclosure: Discovery (~80 tokens/skill), Activation (~2,000 tokens), Execution (unlimited supporting files). (Source: [[Source: SwirlAI Agent Skills Progressive Disclosure]])
+- **Code is for determinism, not logic.** Hooks guarantee execution (exit code 2 blocks). Skills are probabilistic (model decides when to activate). The drift monitor MUST be code because it runs deterministic pattern matching on every `tool_result` event with sliding windows. Everything else is LLM evaluation and SHOULD be a skill. (Source: [[Source: Blake Crosley Agent Architecture Guide]])
+- **The harness pattern is hooks→skills→agents→workflows.** Claude Code's architecture (22+ lifecycle events, markdown skills, subagents, filesystem memory) validates that the harness is "a programmable runtime with an LLM kernel" — not a TypeScript codebase. (Source: [[Source: Blake Crosley Agent Architecture Guide]])
+- **Skills compose with hooks.** Skills can define their own hooks in YAML frontmatter that activate only while the skill runs. This creates domain-specific deterministic behavior without polluting other sessions. (Source: [[Source: Blake Crosley Agent Architecture Guide]])
+- **Markdown skills ARE the spec.** No separate spec files. The SKILL.md body is simultaneously the specification, the implementation instructions, and the documentation. Supporting files (reference.md, scripts/) provide execution-layer resources. (Source: [[Source: Claude API Agent Skills Overview]])
+- **Pi's built-in event bus handles routing.** No custom event bus needed — pi's native event system wires events to skill invocations. Pipeline ordering is enforced by skill activation sequence, not imperative code.
+- **File count drops from 15 to 3 TypeScript files.** `src/harness/drift-monitor.ts` (drift detection), `src/harness/types.ts` (shared types), `src/harness/config.ts` (config loader). All other functionality becomes `.pi/skills/harness-*/SKILL.md` files.
+## Key Entities
+- [[Anthropic]]: Released Agent Skills open standard Dec 18, 2025. Three-tier progressive disclosure. Adopted industry-wide within weeks.
+- [[OpenAI]]: Adopted skills for Codex CLI and ChatGPT.
+- [[Google]]: Added skills to Gemini CLI.
+- [[GitHub Copilot]]: Launched skills support same day as standard.
+- [[Cursor]]: Integrated skills alongside Rules system.
+## Key Concepts
+- [[skill-first-architecture]]: Harness layers as markdown-based skills instead of TypeScript code modules. Only deterministic infrastructure remains as code.
+- [[progressive-disclosure-agents]]: Three-tier loading: metadata (always, ~80 tokens/skill) → full SKILL.md (when relevant, ~2,000 tokens) → supporting files (on demand, unlimited).
+- [[agent-skills-pattern]]: Progressive disclosure as a system design pattern. Context windows are finite and lossy — skills keep context lean.
+## Architecture Comparison
+| Dimension | Old Plan (Code-First) | New Plan (Skill-First) |
+|-----------|----------------------|------------------------|
+| L1 Spec Hardening | `src/harness/l1-spec.ts` (~300 lines TS) | `.pi/skills/harness-spec/SKILL.md` (markdown) |
+| L2 Planning | `src/harness/l2-planner.ts` (~400 lines TS) | `.pi/skills/harness-plan/SKILL.md` (markdown) |
+| L2.5 Drift Monitor | `src/harness/l2.5-drift.ts` (~500 lines TS) | **KEPT AS CODE** — deterministic pattern matching |
+| L4 Adversarial | `src/harness/l4-critics.ts` (~300 lines TS) | `.pi/skills/harness-critic/SKILL.md` + `.pi/agents/critic.md` |
+| P20 Gate | `src/harness/p20-gate.ts` (~100 lines TS) | `.pi/skills/harness-gate/SKILL.md` (bash commands) |
+| L5 Observability | `src/harness/l5-observability.ts` (~200 lines TS) | `.pi/skills/harness-observe/SKILL.md` (markdown) |
+| L6 Memory | `src/harness/l6-memory.ts` (~150 lines TS) | Already wiki-based (claude-obsidian skills) |
+| Event Bus | ~~`src/harness/events.ts`~~ (~200 lines TS) | **REMOVED** — pi's built-in event bus handles routing (2026-05-04) |
+| Types + Config | `src/harness/types.ts` + `config.ts` (~300 lines) | **KEPT AS CODE** — shared infrastructure |
+| **Total TS files** | **~15 files, ~2,500 lines** | **~3 files, ~600 lines** |
+| **Total skill files** | **0** | **6 SKILL.md files + supporting** |
+## Contradictions
+- None identified. All three sources converge on the same architecture: skills for domain expertise, hooks for deterministic enforcement, code only when determinism is required. The skill-first approach is independently validated by Anthropic, Microsoft, OpenAI, and Google within a 4-month window (Dec 2025–Mar 2026).
+## Open Questions
+- **How does pi's skill system handle skill-to-skill invocation?** Can a harness skill invoke the next pipeline skill programmatically, or does pi's built-in event bus need to sequence them? Pi's event bus likely sequences — each skill returns, pi fires next hook.
+- **Can pi skills define hooks in frontmatter?** Claude Code skills can. If pi doesn't support this, hooks must remain in pi's event system or `.pi/settings.json`.
+- **What is the skill context budget in pi?** Claude Code uses 2% of context window with 16,000 char fallback. Pi's budget is unknown.
+- **Skill caching behavior.** Smarter implementations cache recently used skills. Does pi reload SKILL.md from disk every activation or cache? This affects drift monitor → spec hardening reinvocation performance.
+- **How are skill-generated artifacts stored?** L2 planning generates YAML plan files. Can skills write to `.pi/harness/plans/` directly, or does pi's event system broker file writes?
+- **Skill version pinning across releases.** When harness skills ship in pi package, compiled prompts vs live markdown: which approach? The research shows build-time compilation is valid but adds complexity vs live markdown that can be user-edited.
+## Sources
+- [[Source: SwirlAI Agent Skills Progressive Disclosure]]: Mar 11, 2026. Three-tier architecture, ecosystem adoption speed, progressive disclosure as system design pattern.
+- [[Source: Claude API Agent Skills Overview]]: Official docs. Filesystem-based skill architecture, three loading levels, security considerations.
+- [[Source: Blake Crosley Agent Architecture Guide]]: Apr 29, 2026. Complete harness pattern: hooks, skills, subagents, multi-agent orchestration, memory, production patterns.

package/vault/wiki/questions/Research: TypeScript Best Practices and Codebase Structure.md ADDED Viewed

@@ -0,0 +1,88 @@
+---
+type: synthesis
+title: "Research: TypeScript Best Practices and Codebase Structure"
+created: 2026-05-02
+updated: 2026-05-02
+tags:
+  - research
+  - typescript
+  - best-practices
+  - codebase-structure
+status: developing
+related:
+  - "[[ts-strict-mode-rishikc]]"
+  - "[[ts-runtimes-comparison-betterstack]]"
+  - "[[barrel-files-tkdodo]]"
+  - "[[ts-monorepo-koerselman]]"
+  - "[[vitest-official]]"
+  - "[[ts-folder-structure-mingyang]]"
+  - "[[ts-best-practices-2025-devto]]"
+  - "[[ts-result-error-handling-kkalamarski]]"
+  - "[[typescript-strict-mode]]"
+  - "[[barrel-files]]"
+  - "[[monorepo-architecture]]"
+  - "[[result-monad-error-handling]]"
+  - "[[javascript-runtimes]]"
+  - "[[vitest]]"
+sources:
+  - "[[ts-strict-mode-rishikc]]"
+  - "[[ts-runtimes-comparison-betterstack]]"
+  - "[[barrel-files-tkdodo]]"
+  - "[[ts-monorepo-koerselman]]"
+  - "[[vitest-official]]"
+  - "[[ts-folder-structure-mingyang]]"
+  - "[[ts-best-practices-2025-devto]]"
+  - "[[ts-result-error-handling-kkalamarski]]"
+---# Research: TypeScript Best Practices and Codebase Structure
+## Overview
+Research across 8 authoritative sources covering TypeScript compiler configuration, runtime selection, code organization patterns, monorepo strategies, testing frameworks, and error handling approaches. The ecosystem has matured significantly: strict mode is the default, barrel files are discouraged, monorepo tooling is production-ready, and type-safe API patterns (tRPC) are gaining adoption.
+## Key Findings
+- **Enable `strict: true` by default** for all new TypeScript projects. `strictNullChecks` alone eliminates a major class of null-reference production bugs. Migrate existing codebases incrementally — one strict flag at a time. (Source: [[ts-strict-mode-rishikc]])
+- **Avoid barrel files (`index.ts` re-exports) in application code**. Barrel files cause circular imports and slow dev servers by 68% in real production measurements. Libraries are the only valid use case. (Source: [[barrel-files-tkdodo]])
+- **Bun is the fastest runtime** (52K req/s vs Node 13K), but Node.js remains the safe choice for production due to ecosystem maturity and backporting of Bun/Deno features. (Source: [[ts-runtimes-comparison-betterstack]])
+- **Built-package strategy with Turborepo** is preferred for TypeScript monorepos. Build packages to JS with a bundler (TSUP, RsLib), use TypeScript project references, and generate `.d.ts.map` files for IDE go-to-definition. (Source: [[ts-monorepo-koerselman]])
+- **Vitest has replaced Jest** as the default test runner for new TypeScript projects. Vite-native, Jest-compatible API, smart watch mode. (Source: [[vitest-official]])
+- **Name backend folders by technical capability** (controllers, services, repositories), not by business feature. Feature-based structure works better for frontend. Separate database logic from business logic. (Source: [[ts-folder-structure-mingyang]])
+- **`Result<Ok, Err>` monad pattern** enables declarative error handling — errors are values, not exceptions. Wrap early, unwrap late. Gaining adoption via libraries like neverthrow and effect-ts. (Source: [[ts-result-error-handling-kkalamarski]])
+- **ESLint `@typescript-eslint/recommended-type-checked`** pairs with strict mode for defense-in-depth. Strict mode catches type issues; ESLint catches floating promises and behavioral bugs. (Sources: [[ts-strict-mode-rishikc]], [[ts-best-practices-2025-devto]])
+## Key Entities
+- [[vitest]]: Vite-native test framework, Jest-compatible, v4.1.5 (2026)
+- [[javascript-runtimes]]: Node.js (stable, mature), Deno (secure, tooling-rich), Bun (fast, drop-in Node.js replacement)
+## Key Concepts
+- [[typescript-strict-mode]]: The `"strict": true` compiler flag enables 8+ sub-checks
+- [[barrel-files]]: Re-export files — useful for libraries, harmful for app code
+- [[monorepo-architecture]]: Single repo, multiple packages — built-package vs internal-packages strategies
+- [[result-monad-error-handling]]: Functional error handling — `Result<Ok, Err>` with map/flatMap/match
+## Contradictions
+- **Barrel files**: Traditional advice says barrel files clean up imports; TkDodo (2024) demonstrates they cause circular imports and 68% module bloat. Consensus is shifting toward direct imports for app code. Resolution: Use barrels only for library entry points. (Sources: [[barrel-files-tkdodo]] vs common practice)
+- **Folder structure**: Mingyang Li argues for technical-capability folders on backend (Clean Architecture). Vertical Slice advocates argue feature-based folders reduce context switching. Resolution: Technical structure for backend stability, feature structure for frontend adaptability. (Source: [[ts-folder-structure-mingyang]])
+- **Built vs source-only packages**: Koerselman prefers building packages with bundlers for caching and ESM compatibility. Turborepo team's blog argues source-only is simpler and often sufficient. Resolution: Depends on project size. Small teams: source-only. Large teams with CI/CD: built-package. (Source: [[ts-monorepo-koerselman]])
+## Open Questions
+- How does tRPC compare to traditional REST in non-TypeScript environments? (Research focused on TS-TS stacks)
+- What is the adoption rate of Biome (Rust-based linter/formatter) vs ESLint+Prettier in 2026?
+- Are there published benchmarks for `isolatedModules: true` performance impact in large monorepos?
+- How does the Oxc-based TypeScript transpiler (used by Vitest) compare to SWC and ESBuild for type stripping?
+## Sources
+- [[ts-strict-mode-rishikc]]: Rishi Kumar Chawda, 2021/2026 — comprehensive strict mode guide
+- [[ts-runtimes-comparison-betterstack]]: Stanley Ulili, 2026 — Node.js vs Deno vs Bun comparison with benchmarks
+- [[barrel-files-tkdodo]]: Dominik Dorfmeister, 2024 — argument against barrel files with performance data
+- [[ts-monorepo-koerselman]]: Thijs Koerselman, 2023/2026 — deep dive into TS monorepo patterns
+- [[vitest-official]]: Vitest contributors, 2026 — official testing framework documentation
+- [[ts-folder-structure-mingyang]]: Mingyang Li, 2024 — production-grade Node.js/TS folder structure
+- [[ts-best-practices-2025-devto]]: Mitu M, 2025 — broad overview of 2025 best practices
+- [[ts-result-error-handling-kkalamarski]]: Krzysztof Kalamarski, 2022 — Result monad pattern implementation

package/vault/wiki/questions/Research: TypeScript Execution Layer for Agent Tool Calling.md ADDED Viewed

@@ -0,0 +1,81 @@
+---
+type: synthesis
+title: "Research: TypeScript Execution Layer for Agent Tool Calling"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - agent-tools
+  - typescript-execution-layer
+  - harness
+status: developing
+related:
+  - "[[ts-execution-layer]]"
+  - "[[mcp-tool-routing]]"
+  - "[[agentic-harness-context-enforcement]]"
+  - "[[think-in-code-enforcement]]"
+  - "[[harness-implementation-plan]]"
+sources:
+  - "[[codeact-apple-2024]]"
+  - "[[cloudflare-codemode]]"
+  - "[[executor-rhyssullivan]]"
+  - "[[colinmcnamara-context-optimization-codemode]]"
+---# Research: TypeScript Execution Layer for Agent Tool Calling
+## Overview
+The TypeScript execution layer pattern replaces flat tool calling with a single "write code" tool plus a sandboxed TypeScript runtime. Research across 4 sources (1 academic paper, 2 production systems, 1 analysis) confirms this pattern reduces context by 3-4x, improves multi-tool success rates by ~20%, and reduces interaction turns by ~30%. The pattern is validated at production scale by Cloudflare (Code Mode) and the open-source Executor project (1.3K stars). It directly addresses the **tool context bloat problem** identified in MCP-heavy agent architectures and complements our existing Think-in-Code enforcement (P14).
+## Key Findings
+- **20% higher success rate on multi-tool tasks** when agents write Python/TypeScript code instead of JSON tool calls (CodeAct, ICML 2024, tested on 17 LLMs). This is a capability improvement, not just a context optimization. (Source: [[codeact-apple-2024]])
+- **~3-4x context reduction**: Code Mode uses ~3,100 tokens vs ~10,500+ tokens for traditional tool calling per interaction. The LLM only sees type definitions and final results — intermediate tool call/response pairs stay in the sandbox. (Source: [[cloudflare-codemode]], [[colinmcnamara-context-optimization-codemode]])
+- **30% fewer interaction turns**: Multi-step workflows that required 5-10 round-trips become one code generation turn. Fewer round-trips mean fewer opportunities for error propagation. (Source: [[codeact-apple-2024]])
+- **Python interpreter provides zero-cost error signals**: Wrong intermediate calculations raise exceptions immediately. Agent sees traceback and revises without a separate critique step. This complements our L4 adversarial verification by catching syntax/semantic errors before they reach the critic agent. (Source: [[codeact-apple-2024]])
+- **TypeScript preferred over Python for agent code**: LLMs have seen millions of TS/JS repos in training data. Type system provides natural guardrails — malformed API calls are caught at generation time, not execution time. (Source: [[cloudflare-codemode]], [[executor-rhyssullivan]])
+- **Tool discovery without context load**: Executor's `tools.discover({ query, limit })` pattern lets agents discover tools dynamically without loading all definitions into context. This is a fundamental improvement over MCP's `tools/list` which returns everything. (Source: [[executor-rhyssullivan]])
+- **Cross-agent tool sharing**: Executor's MCP server mode enables a single tool catalog shared across Cursor, Claude Code, OpenCode, and other agents. Aligns with our P39 (Harness as MCP Server). (Source: [[executor-rhyssullivan]])
+## Key Entities
+- **Apple Machine Learning Research**: Published CodeAct (ICML 2024), the foundational paper establishing code-as-unified-action-space
+- **Cloudflare**: Production implementation of TypeScript execution layer via `@cloudflare/codemode` (Workers-based sandbox)
+- **Rhys Sullivan**: Creator of Executor, the leading open-source local-first TypeScript runtime for agents (1.3K stars)
+- **Kenton Varda & Sunil Pai**: Cloudflare engineers who articulated "LLMs are better at writing code than making tool calls"
+## Key Concepts
+- **[[ts-execution-layer]]**: The pattern of replacing flat tool lists with a typed TypeScript runtime. Core concept page.
+- **CodeAct paradigm**: LLM actions expressed as executable code rather than JSON/text. Foundation for all code execution layer systems.
+- **Tool catalog**: Single discovery point for all tools, replacing per-agent tool loading. Queryable by intent.
+- **Deterministic bridge**: LLM (non-deterministic) generates code → runtime (deterministic) executes it → predictable results. Contrasts with sub-agent pattern (non-deterministic all the way down).
+- **Network isolation**: Executed code has no network access by default. All external interaction flows through tool dispatch mechanism. Enforced at runtime level.
+## Contradictions
+- **Python vs TypeScript**: CodeAct uses Python; Cloudflare and Executor use TypeScript. CodeAct argues Python's interpreter errors are the key mechanism; TypeScript advocates argue type definitions provide similar guardrails at generation time. Both valid — the language choice depends on sandbox infrastructure. TypeScript is the better fit for our Node.js harness.
+- **Cloud vs Local**: Cloudflare Code Mode requires Cloudflare Workers; Executor runs locally. For our CLI harness, local execution is mandatory. Executor's architecture (local daemon, tool catalog, typed runtime) is the closer reference implementation.
+## What the Harness Does NOT Need from These Systems
+- **Cloudflare Workers dependency**: Our sandbox uses local Node.js VM or Deno — not CF infrastructure. The `Executor` interface is minimal and we implement our own backend.
+- **Python interpreter sandbox**: TypeScript is our harness language. CodeAct's Python research validates the paradigm but we implement in TypeScript.
+- **Web UI for tool configuration**: Executor has a browser UI. Our harness is CLI-only. Configuration via `.pi/harness/tool-catalog.json` and CLI commands.
+- **Multi-agent tool sharing**: Nice-to-have but not in scope for Phase 0. P39 (Harness as MCP Server) covers this eventually.
+## Open Questions
+- **Sandbox security model**: What's the minimum viable sandbox for TypeScript code that calls our tools? Node.js `vm` module? Deno subprocess? Bubblewrap? Each has different security/performance tradeoffs. Needs a dedicated spike.
+- **Type generation from our tool schemas**: Our tools (read, bash, edit, ctx_execute, ck_search) need TypeScript type definitions auto-generated. Cloudflare's `generateTypes()` is CF-specific. Executor's type generation is tied to its plugin system. We need a harness-specific solution.
+- **Model compatibility**: Which models are good enough at TypeScript generation to use this pattern? GPT/Claude are strong. Smaller models (Haiku, Gemini Flash) may struggle. Need model-adaptive routing per [[model-adaptive-harness]].
+- **Permission gating inside sandbox**: If the LLM-generated code calls `tools.bash("rm -rf /")`, how does the permission subsystem intercept? The tool dispatch mechanism must route through P35 (Permission Subsystem) before execution.
+- **Error handling UX**: When generated TypeScript has syntax errors or type mismatches, what does the agent see? Traceback? Auto-fix attempt? The error feedback loop design is critical.
+- **Benchmark against direct tool calling**: Before committing to this phase, benchmark our harness with direct tool calling vs TS execution layer on real tasks (not just M3ToolEval). Measure context usage, success rate, and wall-clock time.
+## Sources
+- [[codeact-apple-2024]]: Wang et al., ICML 2024 — 20% improvement, 30% fewer turns
+- [[cloudflare-codemode]]: Cloudflare official docs — production implementation, 3-4x context reduction
+- [[executor-rhyssullivan]]: RhysSullivan/executor — open-source local TS runtime, 1.3K stars
+- [[colinmcnamara-context-optimization-codemode]]: Colin McNamara analysis — context efficiency comparison, sub-agent vs Code Mode

package/vault/wiki/questions/Research: claude-mem over Obsidian for Harness Layer.md ADDED Viewed

@@ -0,0 +1,71 @@
+---
+type: synthesis
+title: "Research: claude-mem over Obsidian for Harness Layer"
+created: 2026-05-05
+updated: 2026-05-05
+tags:
+  - research
+  - memory
+  - harness
+  - claude-mem
+  - obsidian
+status: developing
+related:
+  - "[[persistent-memory]]"
+  - "[[adr-009]]"
+  - "[[Research: Claude Code State-of-the-Art Harness Improvements]]"
+  - "[[lifecycle-hooks]]"
+  - "[[Codex Harness Innovations (OpenAI)]]"
+  - "[[memory-system-of-record-vs-ephemeral-cache]]"
+sources:
+  - "[[adr-009]]"
+  - "[[persistent-memory]]"
+  - "[[Research: Claude Code State-of-the-Art Harness Improvements]]"
+  - "[[claude-code-architecture-karaxai-2026]]"
+  - "[[codex-harness-innovations]]"
+---
+# Research: claude-mem over Obsidian for Harness Layer
+## Overview
+Current harness memory decision is explicit and stable: Obsidian wiki is Layer 6 system of record via ADR-009. Local corpus has no direct `claude-mem` implementation details, so replacement recommendation cannot be high confidence. Based on current evidence, full replacement is not advised; safest path is Obsidian as source of truth plus optional local auto-memory cache if needed.
+## Key Findings
+- **System-of-record already chosen (high)**: ADR-009 explicitly replaces vector-memory stack with claude-obsidian Mode B, with `hot.md -> index.md -> pages` retrieval and lower dependency surface (Source: [[adr-009]]).
+- **Harness contracts depend on wiki structure (high)**: Layer write/read hooks, auditability, and cross-layer memory mapping are built around `wiki/index.md`, `wiki/log.md`, and `wiki/hot.md` (Source: [[persistent-memory]]).
+- **Prompt memory is weaker than deterministic controls (high)**: Claude architecture notes ~92% instruction compliance from CLAUDE.md versus deterministic hook enforcement when configured (Source: [[lifecycle-hooks]], [[claude-code-architecture-karaxai-2026]]).
+- **Automatic memory and explicit wiki solve different problems (medium)**: Codex-style implicit memories help fast context recovery, but explicit wiki remains better for durable decisions, provenance, and human review (Source: [[codex-harness-innovations]]).
+- **No verified claude-mem data in current wiki (low)**: No local source page or benchmark for claude-mem behavior, storage model, recall quality, or failure modes in this repo context.
+## Key Entities
+- [[Claude Code]]: demonstrates multi-memory architecture and deterministic hook enforcement.
+- [[Codex Harness Innovations (OpenAI)]]: demonstrates automatic/implicit memory pattern.
+## Key Concepts
+- [[persistent-memory]]: current Layer 6 design.
+- [[lifecycle-hooks]]: reliability boundary between prompt-following and enforcement.
+- [[memory-system-of-record-vs-ephemeral-cache]]: recommended split architecture.
+## Contradictions
+- [[adr-009]] optimizes for explicit wiki memory with human-readable provenance; [[codex-harness-innovations]] shows value in implicit auto-memory capture. Best reconciliation: keep explicit wiki as canonical and use implicit memory only as non-authoritative accelerator.
+## Recommendation
+- Do **not** replace Obsidian with claude-mem as primary harness memory right now.
+- If you want claude-mem, run it as a **secondary cache layer** only:
+  - read order: claude-mem quick hints -> wiki `hot.md` -> wiki `index.md` -> linked pages
+  - write order: all accepted decisions and patterns must land in wiki files
+  - enforcement: hooks block "done" unless wiki write completed for decision-bearing tasks
+## Open Questions
+- What is claude-mem persistence format and durability under compaction/restart?
+- Can claude-mem expose provenance links equivalent to wiki page references?
+- What are precision/recall metrics on this repo versus wiki query flow?
+- How should conflict resolution work when claude-mem memory disagrees with wiki decisions?
+- What token/cost/latency delta appears in real harness runs with hybrid mode?
+## Sources
+- [[adr-009]]: persistent memory ADR, 2026-04-28.
+- [[persistent-memory]]: Layer 6 module contract.
+- [[Research: Claude Code State-of-the-Art Harness Improvements]]: memory and control architecture synthesis.
+- [[claude-code-architecture-karaxai-2026]]: CLAUDE.md memory and compliance behavior.
+- [[codex-harness-innovations]]: implicit/automatic memory pattern comparison.

package/vault/wiki/questions/Research: claude-mem over obsidian wiki as the knowledge base for our agentic harness pipeline. think from first principles. does this replace or complement our current setup? no hard feelings about previous decisions. gimme accurate points.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+type: synthesis
+title: "Research: claude-mem over obsidian wiki as the knowledge base for our agentic harness pipeline. think from first principles. does this replace or complement our current setup? no hard feelings about previous decisions. gimme accurate points"
+created: 2026-05-05
+updated: 2026-05-05
+tags:
+  - research
+  - memory
+  - claude-mem
+  - obsidian
+  - harness
+  - first-principles
+status: developing
+related:
+  - "[[adr-009]]"
+  - "[[persistent-memory]]"
+  - "[[lifecycle-hooks]]"
+  - "[[memory-system-of-record-vs-ephemeral-cache]]"
+  - "[[Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always]]"
+sources:
+  - "[[adr-009]]"
+  - "[[persistent-memory]]"
+  - "[[anthropic-effective-harnesses]]"
+  - "[[claude-code-architecture-qubytes-2026]]"
+  - "[[codex-open-source-agent-2026]]"
+---
+# Research: claude-mem over obsidian wiki as the knowledge base for our agentic harness pipeline. think from first principles. does this replace or complement our current setup? no hard feelings about previous decisions. gimme accurate points
+## Overview
+First-principles answer: `claude-mem` does not replace Obsidian wiki in current harness design. It complements wiki as a fast recall cache. Canonical decision memory, provenance, and enforcement still belong in wiki system-of-record.
+## Research Method
+- Round 1 (broad): evaluate harness-memory requirements, existing ADRs, and architecture contracts.
+- Round 2 (gap fill): validate whether vault contains direct `claude-mem` primary evidence.
+- Round 3: skipped; confidence ceiling reached because direct `claude-mem` source evidence is still missing.
+- Constraint note: web tool access blocked in this run, so findings rely on existing filed sources.
+## First-Principles Criteria
+1. Canonical memory must be durable across sessions.
+2. Memory claims must be auditable and human-inspectable.
+3. Decision provenance must be linkable and conflict-resolvable.
+4. Completion gates must enforce write-back deterministically.
+5. Fast recall is useful, but cannot override canonical truth.
+## Key Findings
+- **(high)** Current architecture already sets canonical memory to wiki via `[[adr-009]]`, with read order and durable files (`hot.md`, `index.md`, linked pages). (Source: [[adr-009]], [[persistent-memory]])
+- **(high)** Harness integration points and lifecycle write patterns are coupled to wiki artifacts; replacing wiki implies re-architecting memory contracts across layers. (Source: [[persistent-memory]])
+- **(high)** Deterministic hooks and gates are required for policy reliability; prompt memory alone drifts under long-running workloads. (Source: [[anthropic-effective-harnesses]], [[claude-code-architecture-qubytes-2026]])
+- **(medium)** Auto-memory pattern is useful for convenience and continuity, but strongest in non-authoritative role alongside explicit durable memory. (Source: [[codex-open-source-agent-2026]])
+- **(low)** Vault still lacks direct `claude-mem` benchmark evidence (storage semantics, provenance fidelity, recall precision, conflict behavior), so full replacement claim is unproven.
+## Replace vs Complement
+### Replace
+Not supported by current evidence. Fails criteria 2-4 unless additional mechanisms are proven.
+### Complement
+Supported. Use `claude-mem` as optional acceleration cache; keep wiki as source-of-record.
+## Recommended Operating Model
+1. Read path: `claude-mem` hints -> `[[hot]]` -> `[[index]]` -> canonical linked pages.
+2. Write path: all decision-bearing outputs must land in wiki first.
+3. Conflict rule: cache and wiki disagree -> wiki wins.
+4. Enforcement: stop-hook blocks completion when required wiki filing is missing.
+## Contradictions
+- Fast-memory systems optimize latency; wiki optimizes auditability and governance. One layer cannot optimize both without tradeoffs. Use two-layer memory model.
+## Open Questions
+- What are exact retention and deletion semantics for `claude-mem` in team workflows?
+- Can `claude-mem` produce source-level provenance links compatible with wikilinks?
+- What measured latency/token gain appears in this repo with cache+wiki mode?
+- What precision/recall benchmark should define acceptable cache quality?
+## Sources
+- [[adr-009]]: canonical memory ADR.
+- [[persistent-memory]]: layer contract and write/read patterns.
+- [[anthropic-effective-harnesses]]: long-running harness constraints.
+- [[claude-code-architecture-qubytes-2026]]: practical persistence/hook architecture notes.
+- [[codex-open-source-agent-2026]]: implicit memory as complementary pattern.