RubyGems - phronomy - Versions diffs - 0.6.0 → 0.7.0 - Mend

phronomy 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (104) hide show

checksums.yaml +4 -4
data/.mutant.yml +21 -0
data/CHANGELOG.md +338 -0
data/CONTRIBUTING.md +102 -0
data/README.md +242 -27
data/RELEASE_CHECKLIST.md +86 -0
data/SECURITY.md +80 -0
data/benchmark/baseline.json +9 -0
data/benchmark/bench_agent_invoke.rb +105 -0
data/benchmark/bench_context_assembler.rb +46 -0
data/benchmark/bench_regression.rb +171 -0
data/benchmark/bench_token_estimator.rb +44 -0
data/benchmark/bench_tool_schema.rb +69 -0
data/benchmark/bench_vector_store.rb +39 -0
data/benchmark/bench_workflow.rb +55 -0
data/benchmark/run_all.rb +118 -0
data/docs/decisions/001-rubyllm-as-provider-layer.md +42 -0
data/docs/decisions/002-workflow-context-immutability.md +42 -0
data/docs/decisions/003-event-loop-singleton.md +48 -0
data/docs/decisions/004-invoke-timeout-is-not-cancellation.md +51 -0
data/docs/decisions/005-static-knowledge-class-level-cache.md +45 -0
data/docs/decisions/006-no-built-in-guardrails.md +48 -0
data/docs/decisions/007-mcp-is-beta-stability.md +51 -0
data/docs/decisions/008-orchestrator-uses-os-threads.md +52 -0
data/docs/decisions/009-state-store-abstraction.md +141 -0
data/lib/phronomy/agent/base.rb +194 -12
data/lib/phronomy/agent/before_completion_context.rb +1 -0
data/lib/phronomy/agent/checkpoint.rb +1 -0
data/lib/phronomy/agent/concerns/before_completion.rb +6 -0
data/lib/phronomy/agent/concerns/error_translation.rb +45 -0
data/lib/phronomy/agent/concerns/guardrailable.rb +3 -0
data/lib/phronomy/agent/concerns/retryable.rb +12 -1
data/lib/phronomy/agent/concerns/suspendable.rb +4 -0
data/lib/phronomy/agent/fsm.rb +15 -0
data/lib/phronomy/agent/handoff.rb +3 -0
data/lib/phronomy/agent/orchestrator.rb +123 -11
data/lib/phronomy/agent/parallel_tool_chat.rb +21 -4
data/lib/phronomy/agent/react_agent.rb +8 -6
data/lib/phronomy/agent/runner.rb +2 -0
data/lib/phronomy/agent/shared_state.rb +11 -0
data/lib/phronomy/agent/suspend_signal.rb +2 -0
data/lib/phronomy/agent/team_coordinator.rb +17 -5
data/lib/phronomy/cancellation_token.rb +92 -0
data/lib/phronomy/configuration.rb +26 -2
data/lib/phronomy/context/assembler.rb +6 -0
data/lib/phronomy/context/compaction_context.rb +2 -0
data/lib/phronomy/context/context_version_cache.rb +2 -0
data/lib/phronomy/context/token_budget.rb +3 -0
data/lib/phronomy/context/token_estimator.rb +9 -2
data/lib/phronomy/context/trigger_context.rb +1 -0
data/lib/phronomy/context/trim_context.rb +4 -0
data/lib/phronomy/embeddings/base.rb +5 -2
data/lib/phronomy/embeddings/ruby_llm_embeddings.rb +6 -2
data/lib/phronomy/eval/comparison.rb +2 -0
data/lib/phronomy/eval/dataset.rb +4 -0
data/lib/phronomy/eval/metrics.rb +6 -0
data/lib/phronomy/eval/runner.rb +2 -0
data/lib/phronomy/eval/scorer/base.rb +1 -0
data/lib/phronomy/eval/scorer/exact_match.rb +2 -0
data/lib/phronomy/eval/scorer/includes_scorer.rb +2 -0
data/lib/phronomy/eval/scorer/llm_judge.rb +2 -0
data/lib/phronomy/event_loop.rb +114 -7
data/lib/phronomy/fsm_session.rb +8 -1
data/lib/phronomy/generator_verifier.rb +2 -0
data/lib/phronomy/guardrail/base.rb +3 -0
data/lib/phronomy/knowledge_source/base.rb +6 -2
data/lib/phronomy/knowledge_source/entity_knowledge.rb +7 -2
data/lib/phronomy/knowledge_source/rag_knowledge.rb +8 -4
data/lib/phronomy/knowledge_source/static_knowledge.rb +7 -2
data/lib/phronomy/loader/base.rb +1 -0
data/lib/phronomy/loader/csv_loader.rb +2 -0
data/lib/phronomy/loader/markdown_loader.rb +2 -0
data/lib/phronomy/loader/plain_text_loader.rb +1 -0
data/lib/phronomy/output_parser/base.rb +1 -0
data/lib/phronomy/output_parser/json_parser.rb +22 -3
data/lib/phronomy/output_parser/structured_parser.rb +2 -0
data/lib/phronomy/prompt_template.rb +5 -0
data/lib/phronomy/runnable.rb +20 -3
data/lib/phronomy/splitter/base.rb +2 -0
data/lib/phronomy/splitter/fixed_size_splitter.rb +2 -0
data/lib/phronomy/splitter/recursive_splitter.rb +2 -0
data/lib/phronomy/state_store/base.rb +48 -0
data/lib/phronomy/state_store/in_memory.rb +62 -0
data/lib/phronomy/tool/agent_tool.rb +1 -0
data/lib/phronomy/tool/base.rb +189 -27
data/lib/phronomy/tool/mcp_tool.rb +68 -13
data/lib/phronomy/tracing/base.rb +3 -0
data/lib/phronomy/tracing/langfuse_tracer.rb +2 -0
data/lib/phronomy/tracing/open_telemetry_tracer.rb +2 -0
data/lib/phronomy/vector_store/base.rb +33 -7
data/lib/phronomy/vector_store/in_memory.rb +16 -7
data/lib/phronomy/vector_store/pgvector.rb +40 -9
data/lib/phronomy/vector_store/redis_search.rb +29 -8
data/lib/phronomy/version.rb +1 -1
data/lib/phronomy/workflow.rb +96 -7
data/lib/phronomy/workflow_context.rb +54 -4
data/lib/phronomy/workflow_runner.rb +35 -7
data/lib/phronomy.rb +70 -1
data/scripts/api_snapshot.rb +91 -0
data/scripts/check_api_annotations.rb +68 -0
data/scripts/check_private_enforcement.rb +93 -0
data/scripts/check_readme_runnable.rb +98 -0
data/scripts/run_mutation.sh +46 -0
metadata +45 -2

data/benchmark/bench_agent_invoke.rb ADDED Viewed

@@ -0,0 +1,105 @@
+# frozen_string_literal: true
+# bench_agent_invoke.rb — Agent#invoke framework overhead benchmark.
+#
+# Measures the per-invoke cost of the Phronomy::Agent::Base framework path
+# (context assembly, guardrail checks, before_completion hooks, response
+# handling) with a fully stubbed LLM.  No network calls are made.
+#
+# Scenarios:
+#   1. Minimal agent (no tools, no knowledge) — baseline framework overhead.
+#   2. Tool-aware agent with max_parallel_tools=4 (4 stub tools per turn).
+#   3. Agent#stream setup latency (first-chunk time with stubbed stream).
+require "benchmark"
+require_relative "../lib/phronomy"
+# ---------------------------------------------------------------------------
+# Shared stubs
+# ---------------------------------------------------------------------------
+BenchAgentMessage = Struct.new(:role, :content, :tool_calls, :tokens) do
+  def self.assistant(content = "done")
+    new(:assistant, content, nil,
+      Struct.new(:input, :output, :cached, :cache_creation).new(5, 5, 0, 0))
+  end
+end
+# A minimal stub Chat that returns a pre-built response immediately.
+class BenchStubChat
+  attr_reader :messages
+  def initialize(response)
+    @response = response
+    @messages = []
+  end
+  def with_instructions(_) = self
+  def with_tool(_) = self
+  def with_temperature(_) = self
+  def with_cache_instructions(_) = self
+  def with_output_schema(_) = self
+  def last_message = @response
+  def ask(_)
+    @messages << @response
+    @response
+  end
+  def stream(*)
+    yield @response.content if block_given?
+    @response
+  end
+end
+# A stub tool that does nothing but conforms to the Tool::Base interface.
+class BenchNullTool < Phronomy::Tool::Base
+  description "No-op benchmark tool"
+  param :x, type: :string, desc: "input"
+  def execute(x:)
+    "result:#{x}"
+  end
+end
+# ---------------------------------------------------------------------------
+# Agent classes
+# ---------------------------------------------------------------------------
+BENCH_RESP = BenchAgentMessage.assistant("benchmark complete")
+BENCH_RESP_CHAT = BenchStubChat.new(BENCH_RESP)
+bench_minimal_class = Class.new(Phronomy::Agent::Base) do
+  model "stub-model"
+  define_method(:build_chat) { |*| BenchStubChat.new(BENCH_RESP) }
+end
+bench_tool_class = Class.new(Phronomy::Agent::Base) do
+  model "stub-model"
+  tools BenchNullTool
+  max_parallel_tools 4
+  define_method(:build_chat) { |*| BenchStubChat.new(BENCH_RESP) }
+end
+BENCH_AGENT_MINIMAL = bench_minimal_class.new
+BENCH_AGENT_TOOLS = bench_tool_class.new
+AGENT_INVOKE_ITERATIONS = 200
+puts "=== bench_agent_invoke ==="
+Benchmark.bm(50) do |x|
+  x.report("Agent#invoke — minimal (no tools), #{AGENT_INVOKE_ITERATIONS} iters") do
+    AGENT_INVOKE_ITERATIONS.times do
+      BENCH_AGENT_MINIMAL.invoke("ping", thread_id: "bench-#{rand(1_000_000)}")
+    end
+  end
+  x.report("Agent#invoke — 4 parallel stub tools, #{AGENT_INVOKE_ITERATIONS} iters") do
+    AGENT_INVOKE_ITERATIONS.times do
+      BENCH_AGENT_TOOLS.invoke("ping", thread_id: "bench-#{rand(1_000_000)}")
+    end
+  end
+end
+puts

data/benchmark/bench_context_assembler.rb ADDED Viewed

@@ -0,0 +1,46 @@
+# frozen_string_literal: true
+# Benchmark: Context::Assembler#build
+#
+# Tests context assembly performance for varying numbers of messages and
+# knowledge chunks. This path is exercised on every agent turn.
+require "benchmark"
+require_relative "../lib/phronomy"
+BenchAsmMessage = Struct.new(:content)
+def make_assembler(n_messages:, n_chunks:, with_budget: false)
+  budget = if with_budget
+    Phronomy::Context::TokenBudget.new(context_window: 4096, max_output_tokens: 512)
+  end
+  asm = Phronomy::Context::Assembler.new(budget: budget)
+  asm.add_instruction("You are a helpful assistant. Answer the user's question.")
+  n_chunks.times do |i|
+    asm.add_knowledge("Fact #{i}: The capital of country #{i} is City #{i}.", type: :entity, trusted: true)
+  end
+  msgs = Array.new(n_messages) { BenchAsmMessage.new("This is a conversation message.") }
+  asm.add_messages(msgs)
+  asm
+end
+BENCH_ASM_ITERATIONS = 1_000
+puts "=== bench_context_assembler ==="
+Benchmark.bm(40) do |x|
+  x.report("build(10 msgs, 0 chunks)") do
+    BENCH_ASM_ITERATIONS.times { make_assembler(n_messages: 10, n_chunks: 0).build }
+  end
+  x.report("build(100 msgs, 5 chunks)") do
+    BENCH_ASM_ITERATIONS.times { make_assembler(n_messages: 100, n_chunks: 5).build }
+  end
+  x.report("build(1000 msgs, 10 chunks, no budget)") do
+    (BENCH_ASM_ITERATIONS / 10).times { make_assembler(n_messages: 1000, n_chunks: 10).build }
+  end
+  x.report("build(1000 msgs, 10 chunks, budgeted)") do
+    (BENCH_ASM_ITERATIONS / 10).times { make_assembler(n_messages: 1000, n_chunks: 10, with_budget: true).build }
+  end
+end

data/benchmark/bench_regression.rb ADDED Viewed

@@ -0,0 +1,171 @@
+# frozen_string_literal: true
+# bench_regression.rb — Targeted regression benchmarks.
+#
+# Measures the five minimum regression targets defined in Issue #232:
+#   1. WorkflowContext#merge throughput
+#   2. Workflow.define (graph build) time
+#   3. Tool::Base#params_schema generation (10 params)
+#   4. Orchestrator#dispatch_parallel overhead (10 stub agents, no LLM)
+#   5. CancellationToken#cancelled? throughput (shared token, 8 threads)
+#
+# Results are stored in a global REGRESSION_RESULTS hash (keyed by metric name,
+# value = iterations per second) for use by run_all.rb baseline comparison.
+require "benchmark"
+require_relative "../lib/phronomy"
+REGRESSION_ITERATIONS = 5_000
+# ---------------------------------------------------------------------------
+# Target 1: WorkflowContext#merge throughput
+# ---------------------------------------------------------------------------
+context_class = Class.new do
+  include Phronomy::WorkflowContext
+  field :value, type: :replace, default: -> { 0 }
+  field :log, type: :append, default: -> { [] }
+end
+sample_ctx = context_class.new(value: 42, log: ["a"])
+t1 = Benchmark.measure("WorkflowContext#merge") do
+  REGRESSION_ITERATIONS.times { sample_ctx.merge(value: 99, log: "b") }
+end
+# ---------------------------------------------------------------------------
+# Target 2: Workflow.define graph build time
+# ---------------------------------------------------------------------------
+BUILD_ITERATIONS = 1_000
+t2 = Benchmark.measure("Workflow.define (5 states)") do
+  BUILD_ITERATIONS.times do
+    build_ctx = Class.new do
+      include Phronomy::WorkflowContext
+      field :x, type: :replace, default: -> { 0 }
+    end
+    Phronomy::Workflow.define(build_ctx) do
+      initial :a
+      %i[a b c d].each_with_index do |state, i|
+        next_state = %i[a b c d e][i + 1]
+        action = ->(s) { s.merge(x: s.x + 1) }
+        self.state state, action: action
+        transition from: state, to: next_state
+      end
+      self.state :e, action: ->(s) { s }
+      transition from: :e, to: :__finish__
+    end
+  end
+end
+# ---------------------------------------------------------------------------
+# Target 3: Tool::Base#params_schema generation (10 params)
+# ---------------------------------------------------------------------------
+tool_class = Class.new(Phronomy::Tool::Base) do
+  description "Test tool with 10 params"
+  param :p1, type: :string, desc: "param 1"
+  param :p2, type: :string, desc: "param 2"
+  param :p3, type: :string, desc: "param 3"
+  param :p4, type: :string, desc: "param 4"
+  param :p5, type: :string, desc: "param 5"
+  param :p6, type: :string, desc: "param 6"
+  param :p7, type: :string, desc: "param 7"
+  param :p8, type: :string, desc: "param 8"
+  param :p9, type: :string, desc: "param 9"
+  param :p10, type: :string, desc: "param 10"
+  def execute(**_kwargs)
+    "ok"
+  end
+end
+t3 = Benchmark.measure("Tool::Base#params_schema_definition (10 params)") do
+  REGRESSION_ITERATIONS.times { tool_class.params_schema_definition }
+end
+# ---------------------------------------------------------------------------
+# Target 4: Orchestrator#dispatch_parallel overhead (10 stub agents, no LLM)
+# ---------------------------------------------------------------------------
+stub_agent_class = Class.new(Phronomy::Agent::Base) do
+  define_method(:invoke) do |_input, messages: [], thread_id: nil, config: {}|
+    {output: "stub", messages: []}
+  end
+end
+orchestrator_class = Class.new(Phronomy::Agent::Orchestrator)
+orchestrator = orchestrator_class.new
+PARALLEL_ITERATIONS = 200
+t4 = Benchmark.measure("Orchestrator#dispatch_parallel (10 agents)") do
+  PARALLEL_ITERATIONS.times do
+    tasks = Array.new(10) { {agent: stub_agent_class, input: "x"} }
+    orchestrator.dispatch_parallel(*tasks)
+  end
+end
+# ---------------------------------------------------------------------------
+# Target 5: CancellationToken#cancelled? throughput (8 threads)
+# ---------------------------------------------------------------------------
+CANCEL_TOKEN = Phronomy::CancellationToken.new
+CANCEL_ITERATIONS = 10_000
+t5 = Benchmark.measure("CancellationToken#cancelled? (8 threads)") do
+  threads = 8.times.map do
+    Thread.new { CANCEL_ITERATIONS.times { CANCEL_TOKEN.cancelled? } }
+  end
+  threads.each(&:join)
+end
+# ---------------------------------------------------------------------------
+# Target 6: CancellationToken#raise_if_cancelled! hot path (no-op, single thread)
+# ---------------------------------------------------------------------------
+RAISE_TOKEN = Phronomy::CancellationToken.new  # not cancelled — no-op path
+RAISE_ITERATIONS = 200_000
+t6 = Benchmark.measure("CancellationToken#raise_if_cancelled! (no-op)") do
+  RAISE_ITERATIONS.times { RAISE_TOKEN.raise_if_cancelled! }
+end
+# ---------------------------------------------------------------------------
+# Target 7: Context::TrimContext#remove on a 2000-element history
+# ---------------------------------------------------------------------------
+BenchMsg = Struct.new(:content) unless defined?(BenchMsg)
+TRIM_ELEMENTS = Array.new(2_000) { |i| {seq: i, message: BenchMsg.new("msg #{i}"), tokens: 10, role: :user} }
+TRIM_BUDGET = Phronomy::Context::TokenBudget.new(context_window: 4096, max_output_tokens: 512)
+TRIM_ITERATIONS = 500
+t7 = Benchmark.measure("TrimContext#remove (2000-element history)") do
+  TRIM_ITERATIONS.times do
+    tc = Phronomy::Context::TrimContext.new(message_elements: TRIM_ELEMENTS, budget: TRIM_BUDGET)
+    tc.remove((0...200).to_a)  # remove 200 oldest messages
+  end
+end
+# ---------------------------------------------------------------------------
+# Print results and store in REGRESSION_RESULTS
+# ---------------------------------------------------------------------------
+puts "=== bench_regression ==="
+printf("%-46s  %8s  %12s\n", "Metric", "Real (s)", "Iter/s")
+puts "-" * 70
+metrics = {
+  "workflow_context_merge" => [t1, REGRESSION_ITERATIONS],
+  "workflow_define" => [t2, BUILD_ITERATIONS],
+  "tool_params_schema_definition" => [t3, REGRESSION_ITERATIONS],
+  "dispatch_parallel_10" => [t4, PARALLEL_ITERATIONS],
+  "cancellation_token_cancelled" => [t5, 8 * CANCEL_ITERATIONS],
+  "cancellation_token_raise_if_cancelled_noop" => [t6, RAISE_ITERATIONS],
+  "trim_context_remove_2000" => [t7, TRIM_ITERATIONS]
+}
+REGRESSION_RESULTS = {} # rubocop:disable Style/MutableConstant
+metrics.each do |key, (measure, iters)|
+  ips = iters / measure.real
+  REGRESSION_RESULTS[key] = ips
+  printf("%-46s  %8.3f  %12.0f\n", key, measure.real, ips)
+end
+puts

data/benchmark/bench_token_estimator.rb ADDED Viewed

@@ -0,0 +1,44 @@
+# frozen_string_literal: true
+# Benchmark: Context::TokenEstimator.estimate
+#
+# Tests estimation speed for short, medium, and long text inputs, and for
+# Arrays of message-like objects. This method is called on every message in
+# every agent turn, so it must be consistently fast.
+require "benchmark"
+require_relative "../lib/phronomy"
+SHORT_TEXT = "Hello, how are you today?"
+MEDIUM_TEXT = "A" * 500
+LONG_TEXT = "A" * 10_000
+BenchMessage = Struct.new(:content)
+MESSAGES_100 = Array.new(100) { BenchMessage.new("A" * 100) }
+MESSAGES_1000 = Array.new(1000) { BenchMessage.new("A" * 100) }
+BENCH_TOKEN_ITERATIONS = 10_000
+puts "=== bench_token_estimator ==="
+Benchmark.bm(30) do |x|
+  x.report("estimate(short text)") do
+    BENCH_TOKEN_ITERATIONS.times { Phronomy::Context::TokenEstimator.estimate(SHORT_TEXT) }
+  end
+  x.report("estimate(medium text 500c)") do
+    BENCH_TOKEN_ITERATIONS.times { Phronomy::Context::TokenEstimator.estimate(MEDIUM_TEXT) }
+  end
+  x.report("estimate(long text 10k c)") do
+    BENCH_TOKEN_ITERATIONS.times { Phronomy::Context::TokenEstimator.estimate(LONG_TEXT) }
+  end
+  x.report("estimate(100 messages)") do
+    BENCH_TOKEN_ITERATIONS.times { Phronomy::Context::TokenEstimator.estimate(MESSAGES_100) }
+  end
+  x.report("estimate(1000 messages)") do
+    (BENCH_TOKEN_ITERATIONS / 10).times { Phronomy::Context::TokenEstimator.estimate(MESSAGES_1000) }
+  end
+end

data/benchmark/bench_tool_schema.rb ADDED Viewed

@@ -0,0 +1,69 @@
+# frozen_string_literal: true
+# Benchmark: Tool::Base params_schema generation and static_knowledge_chunks cache
+#
+# Tool schema generation happens once per tool class (lazily memoised).
+# static_knowledge_chunks is cached at the class level; cache-hit overhead
+# should be negligible compared to cache-miss (which calls the knowledge source).
+require "benchmark"
+require_relative "../lib/phronomy"
+# --- Tool schema ---
+class BenchTool10Params < Phronomy::Tool::Base
+  description "A tool with 10 parameters for benchmarking purposes"
+  param :param1, type: :string, desc: "First parameter"
+  param :param2, type: :integer, desc: "Second parameter"
+  param :param3, type: :number, desc: "Third parameter"
+  param :param4, type: :boolean, desc: "Fourth parameter"
+  param :param5, type: :string, desc: "Fifth parameter"
+  param :param6, type: :string, desc: "Sixth parameter", required: false
+  param :param7, type: :integer, desc: "Seventh parameter", required: false
+  param :param8, type: :string, desc: "Eighth parameter", required: false
+  param :param9, type: :string, desc: "Ninth parameter", required: false
+  param :param10, type: :string, desc: "Tenth parameter", required: false
+  def execute(**_)
+    "ok"
+  end
+end
+# Warm up memoisation
+BenchTool10Params.params_schema_definition
+BENCH_TOOL_ITERATIONS = 50_000
+puts "=== bench_tool_schema ==="
+Benchmark.bm(35) do |x|
+  x.report("params_schema_definition (memoised, 10p)") do
+    BENCH_TOOL_ITERATIONS.times { BenchTool10Params.params_schema_definition }
+  end
+end
+# --- static_knowledge_chunks cache ---
+class BenchKnowledgeSource < Phronomy::KnowledgeSource::Base
+  def fetch(query: nil)
+    [{content: "Cached knowledge fact.", type: :static}]
+  end
+  def static?
+    true
+  end
+end
+class BenchAgentWithKnowledge < Phronomy::Agent::Base
+  model "gpt-4o-mini"
+  static_knowledge BenchKnowledgeSource.new
+end
+# Warm up cache
+BenchAgentWithKnowledge.static_knowledge_chunks
+puts "\n=== bench_static_knowledge_cache ==="
+Benchmark.bm(35) do |x|
+  x.report("static_knowledge_chunks (hit)") do
+    BENCH_TOOL_ITERATIONS.times { BenchAgentWithKnowledge.static_knowledge_chunks }
+  end
+end

data/benchmark/bench_vector_store.rb ADDED Viewed

@@ -0,0 +1,39 @@
+# frozen_string_literal: true
+# Benchmark: VectorStore::InMemory#search
+#
+# Tests search performance at different corpus sizes (100, 1000, 10_000 docs).
+# Linear scan is expected; this benchmark establishes the scaling baseline.
+require "benchmark"
+require_relative "../lib/phronomy"
+DIM = 64
+def random_embedding(dim)
+  Array.new(dim) { rand(-1.0..1.0) }
+end
+def populate(store, n)
+  n.times do |i|
+    store.add(id: "doc#{i}", embedding: random_embedding(DIM), metadata: {text: "Document #{i}"})
+  end
+end
+QUERY = random_embedding(DIM)
+# Use fewer iterations for larger corpora to keep total run time reasonable.
+BENCH_VS_ITERS = {100 => 100, 1_000 => 20, 10_000 => 5}.freeze
+puts "=== bench_vector_store_inmemory ==="
+Benchmark.bm(35) do |x|
+  [100, 1_000, 10_000].each do |n|
+    store = Phronomy::VectorStore::InMemory.new(dimension: DIM)
+    populate(store, n)
+    iters = BENCH_VS_ITERS[n]
+    x.report("search(k=5, corpus=#{n}, iters=#{iters})") do
+      iters.times { store.search(query_embedding: QUERY, k: 5) }
+    end
+  end
+end

data/benchmark/bench_workflow.rb ADDED Viewed

@@ -0,0 +1,55 @@
+# frozen_string_literal: true
+# Benchmark: Workflow transition loop
+#
+# Builds a linear chain of N states and measures how long it takes to run
+# the full workflow to completion. 100 transitions must complete in <10ms.
+require "benchmark"
+require_relative "../lib/phronomy"
+# Build a linear workflow: state_0 -> state_1 -> ... -> state_(N-1) -> __finish__
+def build_linear_workflow(n)
+  context_class = Class.new do
+    include Phronomy::WorkflowContext
+    field :count, type: :replace, default: -> { 0 }
+  end
+  Phronomy::Workflow.define(context_class) do
+    initial :state_0
+    n.times do |i|
+      state :"state_#{i}", action: ->(s) { s.merge(count: s.count + 1) }
+      transition from: :"state_#{i}", to: (i + 1 < n) ? :"state_#{i + 1}" : :__finish__
+    end
+  end
+end
+BENCH_WF_ITERATIONS = 50
+puts "=== bench_workflow_transition ==="
+Benchmark.bm(30) do |x|
+  [10, 50, 100].each do |n|
+    app = build_linear_workflow(n)
+    cfg = {recursion_limit: n + 5}
+    x.report("#{n} transitions") do
+      BENCH_WF_ITERATIONS.times { app.invoke({}, config: cfg) }
+    end
+  end
+end
+# Threshold assertion: 100 transitions should complete in <10ms on average
+puts "\nThreshold check: 100 transitions < 10ms average..."
+app100 = build_linear_workflow(100)
+cfg100 = {recursion_limit: 110}
+samples = 20
+elapsed = Benchmark.realtime { samples.times { app100.invoke({}, config: cfg100) } }
+avg_ms = (elapsed / samples) * 1000.0
+puts "  Average: #{"%.2f" % avg_ms}ms per run"
+if avg_ms < 10.0
+  puts "  PASS (< 10ms)"
+else
+  warn "  WARN: #{avg_ms.round(2)}ms exceeds 10ms threshold (environment may be slow)"
+end

data/benchmark/run_all.rb ADDED Viewed

@@ -0,0 +1,118 @@
+# frozen_string_literal: true
+# run_all.rb — Runs all Phronomy benchmarks in sequence.
+#
+# Usage:
+#   ruby benchmark/run_all.rb
+#
+# In CI this script must complete within 30 seconds (smoke check only).
+#
+# Baseline management (nightly regression tracking):
+#   BENCHMARK_WRITE_BASELINE=path/to/baseline.json — write current throughput
+#     results from bench_regression.rb to a JSON baseline file.
+#   BENCHMARK_BASELINE=path/to/baseline.json — compare current results against
+#     the stored baseline; exit 1 if any metric regresses beyond the threshold.
+#   BENCHMARK_REGRESSION_THRESHOLD — percentage allowed before failing (default 20).
+require "benchmark"
+require "json"
+BENCH_DIR = __dir__
+SCRIPTS = %w[
+  bench_token_estimator.rb
+  bench_context_assembler.rb
+  bench_vector_store.rb
+  bench_workflow.rb
+  bench_tool_schema.rb
+  bench_agent_invoke.rb
+  bench_regression.rb
+].freeze
+puts "Phronomy benchmark suite"
+puts "Ruby #{RUBY_VERSION} on #{RUBY_PLATFORM}"
+puts "=" * 60
+overall_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+SCRIPTS.each do |script|
+  path = File.join(BENCH_DIR, script)
+  puts
+  load path
+end
+overall_elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - overall_start
+puts
+puts "=" * 60
+puts "Total elapsed: #{"%.2f" % overall_elapsed}s"
+# CI smoke check: fail if total exceeds the allowed limit.
+max_seconds = ENV.fetch("BENCHMARK_MAX_SECONDS", "60").to_i
+if overall_elapsed > max_seconds
+  warn "FAIL: benchmark suite exceeded #{max_seconds}s limit (took #{"%.1f" % overall_elapsed}s)"
+  exit 1
+end
+puts "OK: completed in #{"%.1f" % overall_elapsed}s (limit: #{max_seconds}s)"
+# ---------------------------------------------------------------------------
+# Baseline management — only active when the relevant env vars are set.
+# REGRESSION_RESULTS is defined in bench_regression.rb (loaded above).
+# ---------------------------------------------------------------------------
+write_path = ENV["BENCHMARK_WRITE_BASELINE"]
+compare_path = ENV["BENCHMARK_BASELINE"]
+threshold = ENV.fetch("BENCHMARK_REGRESSION_THRESHOLD", "20").to_f / 100.0
+if write_path
+  File.write(write_path, JSON.pretty_generate(REGRESSION_RESULTS))
+  puts "\nBaseline written to #{write_path}"
+end
+if compare_path
+  unless File.exist?(compare_path)
+    warn "FAIL: baseline file not found: #{compare_path}"
+    exit 1
+  end
+  baseline = JSON.parse(File.read(compare_path))
+  regressions = []
+  puts "\n#{"=" * 60}"
+  puts "Regression comparison (threshold: #{(threshold * 100).to_i}%)"
+  printf("%-46s  %10s  %10s  %8s\n", "Metric", "Baseline", "Current", "Change")
+  puts "-" * 78
+  REGRESSION_RESULTS.each do |key, current_ips|
+    unless baseline.key?(key)
+      printf("%-46s  %10s  %10.0f  %8s\n", key, "N/A", current_ips, "new")
+      next
+    end
+    baseline_ips = baseline[key].to_f
+    change = (baseline_ips - current_ips) / baseline_ips  # positive = slower
+    status = if change > threshold
+      regressions << {key:, baseline: baseline_ips, current: current_ips, change:}
+      "FAIL"
+    elsif change > threshold * 0.5
+      "WARN"
+    else
+      "OK"
+    end
+    printf("%-46s  %10.0f  %10.0f  %+7.1f%%  %s\n",
+      key, baseline_ips, current_ips, -change * 100, status)
+  end
+  if regressions.any?
+    puts
+    warn "FAIL: #{regressions.size} benchmark(s) regressed beyond #{(threshold * 100).to_i}%:"
+    regressions.each do |r|
+      warn "  #{r[:key]}: #{r[:baseline].round} → #{r[:current].round} iter/s " \
+           "(#{format("%+.1f%%", -r[:change] * 100)})"
+    end
+    exit 1
+  else
+    puts "\nAll benchmarks within threshold."
+  end
+end

data/docs/decisions/001-rubyllm-as-provider-layer.md ADDED Viewed

@@ -0,0 +1,42 @@
+# ADR-001: Use RubyLLM as the LLM Provider Layer
+## Status
+Accepted
+## Context
+Phronomy needs to send prompts to large language models and receive structured
+responses. The options were:
+1. Implement provider clients directly (OpenAI, Anthropic, Google, etc.)
+2. Vendor an existing Ruby abstraction library
+3. Treat providers as a pluggable adapter with a thin wrapper
+Implementing provider clients directly would require maintaining authentication,
+retry logic, streaming, and model versioning for each provider — significant
+ongoing maintenance cost. The Ruby ecosystem has a maturing option in RubyLLM,
+which provides a unified interface for multiple providers and handles streaming,
+tool call serialization, and response parsing.
+## Decision
+Phronomy delegates all LLM provider communication to the `ruby-llm` gem.
+`Phronomy::Agent::Base` and `Phronomy::Chain::LLMChain` call `RubyLLM.chat`
+(or equivalent) rather than provider SDKs directly.
+## Consequences
+**Positive:**
+- Provider switching is a configuration change, not a code change.
+- Streaming, tool call parsing, and multi-modal input handling are inherited
+  from RubyLLM without re-implementation.
+- The phronomy codebase stays focused on agent/workflow orchestration.
+**Negative / Tradeoffs:**
+- Phronomy's LLM feature surface is bounded by what RubyLLM exposes. Provider
+  capabilities not yet supported by RubyLLM are unavailable without a custom
+  adapter.
+- Bugs or breaking changes in RubyLLM require downstream fixes in phronomy.
+- Error types from providers are wrapped in RubyLLM errors; phronomy re-wraps
+  them again (see `Agent::Concerns::ErrorTranslation`).