pikuri-core 0.0.6 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,179 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Pikuri
4
+ # Renders attacker-controlled text safe to display, and reports *why*
5
+ # it was unsafe.
6
+ #
7
+ # Every string an LLM composes is untrusted: a bash command, a tool
8
+ # observation echoed back to the user, a description it wrote for a
9
+ # confirmation prompt. A model that is broken — or, far more likely,
10
+ # being driven by a prompt injection — can embed bytes that a terminal
11
+ # acts on rather than prints: a carriage return that overwrites the
12
+ # line the user just read, an ESC that recolors or repositions, a
13
+ # backspace that erases, a bidirectional override that reorders text so
14
+ # it reads differently than it runs, a zero-width character that hides
15
+ # in plain sight, or a Cyrillic +а+ masquerading as a Latin +a+. The
16
+ # whole point of a confirmation prompt collapses if the bytes the user
17
+ # approves are not the bytes that execute.
18
+ #
19
+ # {.sanitize} is the one chrome-independent primitive every renderer
20
+ # (terminal, TUI, web) routes through. It does two things and returns
21
+ # both as a {Result}:
22
+ #
23
+ # 1. *Neutralize* — make the dangerous bytes visible without changing
24
+ # structure. Control bytes become +\xNN+, bidi/zero-width codepoints
25
+ # become +\u{NNNN}+, tab becomes +\t+. Newlines are preserved
26
+ # (multi-line commands are normal). This is *faithful, not
27
+ # beautifying*: it never collapses runs of whitespace or rewrites a
28
+ # tab to a space, because the user must see exactly what they are
29
+ # approving — a Makefile's leading tab stays visibly a tab. A web
30
+ # chrome composes +html_escape(sanitize(s).text)+; the HTML layer is
31
+ # the caller's, not ours.
32
+ # 2. *Warn* — return a {Warning} per category detected, each a semantic
33
+ # record (kind + offending tokens + a plain-English explanation).
34
+ # Presentation is the chrome's: a terminal renders these bold yellow,
35
+ # a web client a banner. The {Warning} carries no color or markup.
36
+ #
37
+ # == Scope (deliberately closed)
38
+ #
39
+ # Detection covers the *invisibility / cursor-control / reordering*
40
+ # attack classes completely, because each is a finite, enumerable set
41
+ # of codepoints: C0 controls, C1 controls (a second ANSI introducer on
42
+ # some emulators), DEL, the bidi overrides, and the zero-width
43
+ # characters. On top of that, {.sanitize} flags *mixed-script tokens* —
44
+ # a single word combining letters from Latin + Cyrillic + Greek, which
45
+ # is the signature of a homoglyph spoof and has near-zero false
46
+ # positives on real text (humans do not weld two alphabets inside one
47
+ # word; +café+ is all-Latin, +Москва+ all-Cyrillic, only +Pаypal+ mixes).
48
+ #
49
+ # Two confusable classes are explicitly *out of scope*, because
50
+ # detecting them needs Unicode confusables tables and produces heavy
51
+ # false positives on legitimate multilingual text:
52
+ #
53
+ # * *Whole-script* homoglyphs — an entirely-Cyrillic string that merely
54
+ # looks Latin (no mixing to detect).
55
+ # * *Single-symbol* confusables — the Greek question mark +;+ (U+037E)
56
+ # that looks like a semicolon, full-width forms, the division slash.
57
+ #
58
+ # "Solid" here means complete on the classes above, not exhaustive over
59
+ # all of Unicode.
60
+ module Sanitizer
61
+ # One reason a piece of text was flagged, ready for a chrome to
62
+ # render however it surfaces warnings (bold yellow line, web banner).
63
+ #
64
+ # * +kind+ — a {Symbol} category: +:backspace+, +:control_bytes+,
65
+ # +:bidi+, +:zero_width+, or +:mixed_script+.
66
+ # * +offenders+ — the distinct offending tokens, in first-seen order:
67
+ # the escaped forms (+"\\x1b"+, +"\\u{202e}"+) for byte categories,
68
+ # the raw tokens (+"Pаypal"+) for +:mixed_script+.
69
+ # * +explanation+ — a one-line, chrome-agnostic English summary of
70
+ # what the bytes can do.
71
+ Warning = Data.define(:kind, :offenders, :explanation)
72
+
73
+ # The output of {Sanitizer.sanitize}.
74
+ #
75
+ # * +text+ — the neutralized string, safe to print literally.
76
+ # * +warnings+ — {Array}<{Warning}>, empty when nothing was flagged.
77
+ Result = Data.define(:text, :warnings)
78
+
79
+ # Bidirectional-override codepoints: the explicit LRO/RLO/PDF/LRE/RLE
80
+ # set plus the isolate set (LRI/RLI/FSI/PDI). Reordering attacks.
81
+ BIDI_OVERRIDES = [*0x202a..0x202e, *0x2066..0x2069].freeze
82
+
83
+ # Zero-width and invisible codepoints: ZWSP, ZWNJ, ZWJ, and the BOM /
84
+ # zero-width no-break space.
85
+ ZERO_WIDTH = [0x200b, 0x200c, 0x200d, 0xfeff].freeze
86
+
87
+ # Codepoints {.sanitize} rewrites: C0 controls including tab (U+0009)
88
+ # but *excluding* newline (U+000A, which passes through untouched),
89
+ # C1 controls + DEL (U+007F–009F), the zero-width set, and the bidi
90
+ # overrides. Newline is the one control character a faithful render
91
+ # must keep, so the C0 range is split around it.
92
+ SUSPECT = /[\u0000-\u0009\u000b-\u001f\u007f-\u009f\u200b-\u200d\u202a-\u202e\u2066-\u2069\ufeff]/
93
+
94
+ # The three Latin-confusable scripts whose mixing inside one token
95
+ # signals a homoglyph spoof. Punctuation, digits and spaces are the
96
+ # +Common+ script and match none of these, so they never count toward
97
+ # the "two distinct scripts" threshold.
98
+ CONFUSABLE_SCRIPTS = { 'Latin' => /\p{Latin}/, 'Cyrillic' => /\p{Cyrillic}/, 'Greek' => /\p{Greek}/ }.freeze
99
+
100
+ # Neutralize +text+ for literal display and report what was flagged.
101
+ #
102
+ # @param text [String] attacker-controlled text (an LLM-composed
103
+ # command, description, or tool observation), e.g.
104
+ # +"echo hi\rrm -rf /"+
105
+ # @return [Result] the neutralized +text+ plus an {Array}<{Warning}>
106
+ # (empty when clean)
107
+ def self.sanitize(text)
108
+ backspace = false
109
+ control = []
110
+ bidi = []
111
+ zero_width = []
112
+
113
+ clean = text.gsub(SUSPECT) do |ch|
114
+ cp = ch.ord
115
+ if cp == 0x09
116
+ '\\t'
117
+ elsif cp == 0x08
118
+ backspace = true
119
+ '\\x08'
120
+ elsif BIDI_OVERRIDES.include?(cp)
121
+ format('\\u{%04x}', cp).tap { |t| bidi << t }
122
+ elsif ZERO_WIDTH.include?(cp)
123
+ format('\\u{%04x}', cp).tap { |t| zero_width << t }
124
+ else
125
+ format('\\x%02x', cp).tap { |t| control << t }
126
+ end
127
+ end
128
+
129
+ Result.new(text: clean, warnings: warnings_for(backspace, control, bidi, zero_width, mixed_script_tokens(text)))
130
+ end
131
+
132
+ # Tokens (whitespace-delimited runs) that combine letters from two or
133
+ # more of {CONFUSABLE_SCRIPTS} — the homoglyph-spoof signature.
134
+ #
135
+ # @param text [String]
136
+ # @return [Array<String>] distinct offending tokens, first-seen order
137
+ def self.mixed_script_tokens(text)
138
+ text.split(/\s+/).reject(&:empty?).select do |token|
139
+ CONFUSABLE_SCRIPTS.count { |_name, re| token.match?(re) } >= 2
140
+ end.uniq
141
+ end
142
+
143
+ # Assemble one {Warning} per non-empty category, in a stable order
144
+ # (most-deceptive first).
145
+ #
146
+ # @return [Array<Warning>]
147
+ def self.warnings_for(backspace, control, bidi, zero_width, mixed)
148
+ out = []
149
+ if backspace
150
+ out << Warning.new(kind: :backspace, offenders: ['\\x08'],
151
+ explanation: 'Backspace characters present — the model may be trying to visually erase ' \
152
+ 'part of the text after you have read it.')
153
+ end
154
+ unless bidi.empty?
155
+ out << Warning.new(kind: :bidi, offenders: bidi.uniq,
156
+ explanation: "Bidirectional-override characters present (#{bidi.uniq.join(' ')}) — these " \
157
+ 'can reorder how text is displayed so it reads differently than it runs.')
158
+ end
159
+ unless zero_width.empty?
160
+ out << Warning.new(kind: :zero_width, offenders: zero_width.uniq,
161
+ explanation: "Zero-width / invisible characters present (#{zero_width.uniq.join(' ')}) — " \
162
+ 'the text may contain characters you cannot see.')
163
+ end
164
+ unless control.empty?
165
+ out << Warning.new(kind: :control_bytes, offenders: control.uniq,
166
+ explanation: "Non-printable control bytes present (#{control.uniq.join(' ')}) — in a " \
167
+ 'terminal these can move the cursor, change colors, or hide output.')
168
+ end
169
+ unless mixed.empty?
170
+ out << Warning.new(kind: :mixed_script, offenders: mixed,
171
+ explanation: "Mixed-script tokens present (#{mixed.join(', ')}) — letters from different " \
172
+ "alphabets are combined within one word, a classic homoglyph spoof (e.g. " \
173
+ "Cyrillic 'а' standing in for Latin 'a').")
174
+ end
175
+ out
176
+ end
177
+ private_class_method :warnings_for
178
+ end
179
+ end
@@ -68,6 +68,36 @@ module Pikuri
68
68
  add(name, 'string', description, required: false)
69
69
  end
70
70
 
71
+ # Add a required +array+-of-+string+ property — JSON-Schema
72
+ # +{type: 'array', items: {type: 'string'}}+. The LLM sends a
73
+ # native JSON array in the tool-call arguments (the shape its
74
+ # training data overwhelmingly uses for list-valued parameters),
75
+ # so there is no in-band encoding for it to get wrong.
76
+ # The value must arrive as an Array — no
77
+ # JSON-encoded-array-in-a-string fallback. Element coercion
78
+ # mirrors the scalar fields' one documented leniency, in
79
+ # reverse: Integers and finite Floats are converted to their
80
+ # +to_s+ form (a model emitting +["Fix issue 12", 42]+ meant a
81
+ # string list — the conversion is unambiguous), while booleans,
82
+ # +nil+, and nested structures are rejected — those signal a
83
+ # genuinely wrong call shape, not a representational quirk.
84
+ # An empty array is type-valid; rejecting it (if the tool needs
85
+ # at least one element) is the tool's job, with a tool-specific
86
+ # error message.
87
+ #
88
+ # @param name [Symbol] property name
89
+ # @param description [String] human-readable description shown to the LLM
90
+ # @return [self]
91
+ def required_string_array(name, description)
92
+ @properties[name] = {
93
+ type: 'array',
94
+ items: { type: 'string' },
95
+ description: description
96
+ }
97
+ @required << name.to_s
98
+ self
99
+ end
100
+
71
101
  # Add a required +integer+ property. Accepts Integers, Floats with a
72
102
  # zero fractional part (e.g. +1.0+), and base-10 numeric Strings (after
73
103
  # trimming) that resolve to whole numbers; rejects everything else.
@@ -260,9 +290,35 @@ module Pikuri
260
290
  coerce_number(value)
261
291
  when 'boolean'
262
292
  coerce_boolean(value)
293
+ when 'array'
294
+ coerce_string_array(value)
295
+ end
296
+ end
297
+
298
+ def coerce_string_array(value)
299
+ raise CoercionError, "must be an array of strings (got #{value.class}: #{value.inspect})" unless value.is_a?(Array)
300
+
301
+ value.each_with_index.map do |element, i|
302
+ case element
303
+ when String
304
+ element
305
+ when Integer
306
+ element.to_s
307
+ when Float
308
+ raise CoercionError, array_element_message(i, element) unless element.finite?
309
+
310
+ element.to_s
311
+ else
312
+ raise CoercionError, array_element_message(i, element)
313
+ end
263
314
  end
264
315
  end
265
316
 
317
+ def array_element_message(index, element)
318
+ "must be an array of strings (element #{index} is #{element.class}: #{element.inspect}; " \
319
+ 'numbers are auto-converted, other types are not)'
320
+ end
321
+
266
322
  def coerce_boolean(value)
267
323
  return value if value == true || value == false
268
324
 
@@ -341,7 +397,14 @@ module Pikuri
341
397
 
342
398
  def missing_required_message(name, schema)
343
399
  enum_part = schema[:enum] ? ", one of: #{schema[:enum].map { |v| "`#{v}`" }.join(', ')}" : ''
344
- "Missing required parameter `#{name}` (#{schema[:type]}#{enum_part}): #{schema[:description]}"
400
+ "Missing required parameter `#{name}` (#{type_label(schema)}#{enum_part}): #{schema[:description]}"
401
+ end
402
+
403
+ # Human/LLM-facing label for a property's type in error messages:
404
+ # +"array of strings"+ for array properties, the bare JSON-Schema
405
+ # type name otherwise.
406
+ def type_label(schema)
407
+ schema[:items] ? "array of #{schema[:items][:type]}s" : schema[:type]
345
408
  end
346
409
 
347
410
  def unknown_key_error(unknown)
@@ -366,7 +429,7 @@ module Pikuri
366
429
  *@properties.map { |name, prop|
367
430
  req = @required.include?(name.to_s) ? 'required' : 'optional'
368
431
  enum_part = prop[:enum] ? ", one of: #{prop[:enum].map { |v| "`#{v}`" }.join(', ')}" : ''
369
- " - `#{name}` (#{prop[:type]}, #{req}#{enum_part}): #{prop[:description]}"
432
+ " - `#{name}` (#{type_label(prop)}, #{req}#{enum_part}): #{prop[:description]}"
370
433
  }
371
434
  ].join("\n")
372
435
  end
@@ -9,13 +9,17 @@ module Pikuri
9
9
  module Search
10
10
  # Performs a Brave Search via the official Web Search API and returns
11
11
  # the hits as a list of {Result} rows. Split into a thin HTTP fetch
12
- # (#search) and a pure parser (#parse) so tests can exercise the
12
+ # (#search) and a pure parser (.parse) so tests can exercise the
13
13
  # parser against fixture JSON without hitting the network. The
14
- # cascade in {Engines.search} owns the final Markdown rendering.
14
+ # cascade in {Engines#search} owns the final Markdown rendering.
15
15
  #
16
- # Requires a Brave Search API key. Get one at
17
- # https://api-dashboard.search.brave.com the free "Data for Search"
18
- # tier allows 1 query/sec and ~2k queries/month.
16
+ # A class constructed with the API key it should use
17
+ # (+Brave.new(api_key:)+); {Engines} builds one only when a Brave key
18
+ # was configured and then drives it through the same +#search+ /
19
+ # +#label+ interface as every other provider. pikuri reads no key
20
+ # from the environment (see CLAUDE.md "Environment is not a secret
21
+ # store"). Get a key at https://api-dashboard.search.brave.com — the
22
+ # free "Data for Search" tier allows 1 query/sec and ~2k queries/month.
19
23
  #
20
24
  # == Privacy posture
21
25
  #
@@ -32,49 +36,59 @@ module Pikuri
32
36
  # 90-day retention by default, real ZDR if you pay for it. Still a
33
37
  # logged 90-day window on the cheap tier, so not a substitute for
34
38
  # ZDR for genuinely sensitive queries.
35
- module Brave
39
+ class Brave
36
40
  # @return [String] Web Search endpoint
37
41
  ENDPOINT = 'https://api.search.brave.com/res/v1/web/search'
38
42
  # @return [Integer] default number of results returned, matching
39
43
  # {DuckDuckGo::DEFAULT_MAX_RESULTS}
40
44
  DEFAULT_MAX_RESULTS = 10
41
- # @return [String] env var holding the API key; +X-Subscription-Token+
42
- ENV_KEY = 'BRAVE_SEARCH_API_KEY'
43
45
  # @return [RateLimiter] free-tier Brave caps at 1 req/sec; the
44
46
  # 5-minute cooldown protects the limited monthly quota from
45
47
  # being burned on doomed retries when a 429 hits.
46
48
  LIMITER = RateLimiter.new(min_interval: 1.0, cooldown: 300.0)
47
49
 
50
+ # @param api_key [String] Brave Search subscription token. Required
51
+ # and non-blank: pikuri reads no key from the environment — the
52
+ # host supplies it ({Engines} only constructs a Brave when a key
53
+ # was configured).
54
+ # @raise [ArgumentError] if +api_key+ is blank
55
+ def initialize(api_key:)
56
+ raise ArgumentError, 'Brave Search API key is blank' if api_key.to_s.strip.empty?
57
+
58
+ @api_key = api_key
59
+ end
60
+
61
+ # @return [String] short provider label for {Engines} logging /
62
+ # fallback messages.
63
+ def label
64
+ 'Brave'
65
+ end
66
+
48
67
  # Fetch results for +query+ and return them as an +Array<Result>+.
49
68
  # Calls are throttled to one per second and circuit-broken for 5
50
69
  # minutes on rate-limit / quota-exhausted responses; see {LIMITER}.
51
- # The caller (typically {Engines.search}) is expected to have
70
+ # The caller (typically {Engines#search}) is expected to have
52
71
  # already normalized the query and to wrap this in a result cache.
53
72
  #
54
73
  # @param query [String] search query (already normalized)
55
74
  # @param max_results [Integer] maximum number of result entries;
56
75
  # passed through as Brave's +count+ (1..20)
57
- # @param api_key [String] Brave Search subscription token; defaults to
58
- # the {ENV_KEY} environment variable
59
76
  # @return [Array<Result>] hits, possibly empty when Brave ran the
60
77
  # query and matched nothing
61
- # @raise [ArgumentError] if no API key is available
62
78
  # @raise [Engines::Unavailable] when Brave returns HTTP 429
63
79
  # (rate limit / quota exhausted) or 5xx — "try again later"
64
- # responses the cascade in {Engines.search} can fall back
80
+ # responses the cascade in {Engines#search} can fall back
65
81
  # from. Also raised immediately if {LIMITER} is in cooldown.
66
82
  # Other non-2xx (e.g. 401/403 from a bad API key) bubble up as
67
83
  # +RuntimeError+ so config problems stay visible.
68
84
  # @raise [RuntimeError] for non-rate-limit HTTP failures or when the
69
85
  # response shape contains no results.
70
- def self.search(query, max_results: DEFAULT_MAX_RESULTS, api_key: ENV.fetch(ENV_KEY, nil))
71
- raise ArgumentError, "Brave Search API key not set (#{ENV_KEY})" if api_key.to_s.strip.empty?
72
-
86
+ def search(query, max_results: DEFAULT_MAX_RESULTS)
73
87
  LIMITER.call do
74
88
  response = Faraday.get(
75
89
  ENDPOINT,
76
90
  { q: query, count: max_results },
77
- { 'X-Subscription-Token' => api_key, 'Accept' => 'application/json' }
91
+ { 'X-Subscription-Token' => @api_key, 'Accept' => 'application/json' }
78
92
  )
79
93
  unless response.success?
80
94
  if response.status == 429 || response.status >= 500
@@ -84,7 +98,7 @@ module Pikuri
84
98
  raise "Brave Search request failed: #{response.status} #{response.body}"
85
99
  end
86
100
 
87
- parse(response.body, max_results: max_results)
101
+ self.class.parse(response.body, max_results: max_results)
88
102
  end
89
103
  end
90
104
 
@@ -9,9 +9,14 @@ module Pikuri
9
9
  module Search
10
10
  # Performs a DuckDuckGo search by scraping +html.duckduckgo.com+ and
11
11
  # returns the hits as a list of {Result} rows. Split into a thin HTTP
12
- # fetch (#search) and a pure parser (#parse) so tests can exercise
12
+ # fetch (#search) and a pure parser (.parse) so tests can exercise
13
13
  # the parser against fixture HTML without hitting the network. The
14
- # cascade in {Engines.search} owns the final Markdown rendering.
14
+ # cascade in {Engines#search} owns the final Markdown rendering.
15
+ #
16
+ # A class (constructed with no arguments) so it shares the uniform
17
+ # provider shape with the keyed {Brave} / {Exa}: {Engines} holds a
18
+ # list of provider *instances* and calls +#search+ / +#label+ on each
19
+ # without caring which is which.
15
20
  #
16
21
  # == Privacy posture
17
22
  #
@@ -30,7 +35,7 @@ module Pikuri
30
35
  # Microsoft, who has no comparable no-training pledge. Better than
31
36
  # Exa for sensitive queries, worse than Brave; for anything
32
37
  # genuinely embarrassing, don't search the web at all.
33
- module DuckDuckGo
38
+ class DuckDuckGo
34
39
  # @return [String] HTML search endpoint
35
40
  ENDPOINT = 'https://html.duckduckgo.com/html/'
36
41
  # @return [String] User-Agent sent with each request; DDG often rejects
@@ -44,10 +49,16 @@ module Pikuri
44
49
  # soft-block response doesn't get retried for the next 5 minutes
45
50
  LIMITER = RateLimiter.new(min_interval: 2.0, cooldown: 300.0)
46
51
 
52
+ # @return [String] short provider label for {Engines} logging /
53
+ # fallback messages. Uniform across providers (see {Brave#label}).
54
+ def label
55
+ 'DuckDuckGo'
56
+ end
57
+
47
58
  # Fetch results for +query+ and return them as an +Array<Result>+.
48
59
  # Calls are throttled to one every 2s and circuit-broken for 5 minutes
49
60
  # after a soft-block; see {LIMITER}. The caller (typically
50
- # {Engines.search}) is expected to have already normalized the
61
+ # {Engines#search}) is expected to have already normalized the
51
62
  # query and to wrap this in a result cache.
52
63
  #
53
64
  # @param query [String] search query (already normalized)
@@ -56,12 +67,12 @@ module Pikuri
56
67
  # query and matched nothing
57
68
  # @raise [Engines::Unavailable] when DDG soft-blocks the IP
58
69
  # (anomaly/CAPTCHA page) or returns HTTP 429/5xx — i.e. "try again
59
- # later" responses the cascade in {Engines.search} can fall
70
+ # later" responses the cascade in {Engines#search} can fall
60
71
  # back from. Also raised immediately if {LIMITER} is in cooldown.
61
72
  # @raise [RuntimeError] if the HTTP call fails for other reasons or
62
73
  # the empty-results page is in an unrecognized layout. A genuine
63
74
  # empty-results page is *not* an error; see {.parse}.
64
- def self.search(query, max_results: DEFAULT_MAX_RESULTS)
75
+ def search(query, max_results: DEFAULT_MAX_RESULTS)
65
76
  LIMITER.call do
66
77
  response = Faraday.get(ENDPOINT, { q: query }, { 'User-Agent' => USER_AGENT })
67
78
  unless response.success?
@@ -72,7 +83,7 @@ module Pikuri
72
83
  raise "DuckDuckGo request failed: #{response.status} #{response.body}"
73
84
  end
74
85
 
75
- parse(response.body, max_results: max_results)
86
+ self.class.parse(response.body, max_results: max_results)
76
87
  end
77
88
  end
78
89
 
@@ -2,20 +2,31 @@
2
2
 
3
3
  module Pikuri
4
4
  class Tool
5
- # Namespace for the web-search stack used by {Tool::WEB_SEARCH}: per-
5
+ # Namespace for the web-search stack used by {Tool::WebSearch}: per-
6
6
  # provider modules ({DuckDuckGo}, {Brave}, {Exa}), the {Result} value
7
7
  # object they all return, the cross-provider {Engines} cascade with
8
8
  # its on-disk cache, and the shared {RateLimiter} a provider can wire
9
9
  # in to back off when a quota header says so.
10
10
  module Search
11
- # Search-orchestration entry point: the cascade across configured
11
+ # Search-orchestration object: the cascade across configured
12
12
  # providers, the result cache, and the {Unavailable} protocol marker
13
- # the cascade uses to fall back. The LLM-facing tool itself
14
- # ({Tool::WEB_SEARCH}) lives in +lib/tool/web_search.rb+ and calls
15
- # into {.search} below. Each {Tool::Search} provider module
13
+ # the cascade uses to fall back. The LLM-facing tool itself is built
14
+ # by {Tool::WebSearch.build}, which constructs one of these and wires
15
+ # its {#search} into a {Tool}. Each {Tool::Search} provider module
16
16
  # ({DuckDuckGo}, {Brave}, {Exa}) raises {Unavailable} when it wants
17
17
  # the cascade to try the next one.
18
- module Engines
18
+ #
19
+ # == Provider keys are constructor config, not environment
20
+ #
21
+ # Brave and Exa are paid and need an API key; DuckDuckGo needs none.
22
+ # An {Engines} is constructed with the keys it should use
23
+ # (+brave_key:+ / +exa_key:+, both optional) — pikuri reads no key
24
+ # from the environment, so the only providers in the cascade are
25
+ # DuckDuckGo plus whichever keyed providers the host actually
26
+ # configured. The host sources those keys however it likes (the
27
+ # bundled +bin/+ examples load a JSON config file by convention); see
28
+ # CLAUDE.md "Environment is not a secret store".
29
+ class Engines
19
30
  # Subsystem logger; set its level with +PIKURI_LOG_ENGINES+
20
31
  # (e.g. +PIKURI_LOG_ENGINES=debug+) or the global +PIKURI_LOG+.
21
32
  #
@@ -24,40 +35,54 @@ module Pikuri
24
35
 
25
36
  # Raised by a provider when it is temporarily unavailable (rate-limited,
26
37
  # bot-blocked, quota-exhausted, or otherwise saying "try again later"
27
- # rather than "your request is wrong"). The cascade in {Engines.search}
38
+ # rather than "your request is wrong"). The cascade in {#search}
28
39
  # catches this and tries the next provider; any other exception bubbles
29
40
  # up unchanged so genuine bugs and config errors stay visible.
30
41
  class Unavailable < StandardError; end
31
42
 
32
- # All providers that are currently configured. {DuckDuckGo} is always
33
- # available (no API key needed); {Brave} and {Exa} each join the
34
- # list when their API token is present in the environment. Recomputed
35
- # on every call so a process picks up a newly-set token without a
36
- # restart.
37
- #
38
- # @return [Array<Module>] +Tool::Search::*+ provider modules, each
39
- # exposing +.search(query, max_results:)+ → +Array<Result>+
40
- def self.providers
41
- list = [DuckDuckGo]
42
- list << Brave unless ENV[Brave::ENV_KEY].to_s.strip.empty?
43
- list << Exa unless ENV[Exa::ENV_KEY].to_s.strip.empty?
44
- list
45
- end
46
-
47
- # On-disk cache used by {.search} to memoize answered queries.
48
- # Defined as a method so specs can swap it for an isolated cache
49
- # or {UrlCache::NULL} without touching the shared instance.
43
+ # Process-shared on-disk cache backing {#search}'s default. Kept at
44
+ # class level (not per-instance) so every engine dedupes answered
45
+ # queries into one directory; the constructor's +cache:+ parameter
46
+ # injects a different store for tests. Exposed as a method so specs
47
+ # can swap it for {UrlCache::NULL} without touching the instance.
50
48
  #
51
49
  # @return [UrlCache, #fetch]
52
50
  CACHE = UrlCache.new(ttl: UrlCache::DEFAULT_TTL, dir: "#{UrlCache::ROOT_DIR}/web_search")
53
- # Accessor for {CACHE}; specs override this to swap in
54
- # {UrlCache::NULL} or an isolated cache.
51
+ # Accessor for {CACHE}, used as the constructor's +cache:+ default;
52
+ # specs override this to swap in {UrlCache::NULL}.
55
53
  #
56
54
  # @return [UrlCache, #fetch]
57
55
  def self.cache
58
56
  CACHE
59
57
  end
60
58
 
59
+ # Builds the provider cascade once: {DuckDuckGo} always (no key
60
+ # needed), plus {Brave} / {Exa} when their key was supplied
61
+ # (non-blank). Each keyed provider is constructed with its key, so
62
+ # from here on every provider is just an object answering +#search+
63
+ # / +#label+ — the cascade in {#search} treats them uniformly.
64
+ #
65
+ # @param brave_key [String, nil] Brave Search subscription token;
66
+ # non-blank ⇒ Brave joins the cascade. +nil+/blank ⇒ not configured.
67
+ # @param exa_key [String, nil] Exa API key; non-blank ⇒ Exa joins the
68
+ # cascade. +nil+/blank ⇒ not configured.
69
+ # @param cache [UrlCache, #fetch] result store memoizing answered
70
+ # queries; defaults to the process-shared {.cache}.
71
+ # @return [Engines]
72
+ def initialize(brave_key: nil, exa_key: nil, cache: self.class.cache)
73
+ @providers = [DuckDuckGo.new]
74
+ @providers << Brave.new(api_key: brave_key) unless brave_key.to_s.strip.empty?
75
+ @providers << Exa.new(api_key: exa_key) unless exa_key.to_s.strip.empty?
76
+ @cache = cache
77
+ @last_logged_providers = nil
78
+ end
79
+
80
+ # The provider instances this engine cascades across, in
81
+ # declaration order (the cascade itself shuffles them per call).
82
+ #
83
+ # @return [Array<#search, #label>] configured provider instances
84
+ attr_reader :providers
85
+
61
86
  # Run +query+ through the configured providers in random order, falling
62
87
  # back to the next one each time a provider raises {Unavailable}. The
63
88
  # shuffle spreads load so a single provider isn't always hit first
@@ -68,13 +93,13 @@ module Pikuri
68
93
  # +Array<Result>+ is rendered into smolagents-style Markdown here
69
94
  # (+"## Search Results"+ header, then +[title](url)\nbody+ entries
70
95
  # joined by blank lines; an empty array becomes +"No results found."+),
71
- # and the rendered Markdown is cached on disk via {.cache}, keyed by
72
- # the cleaned query. A cache hit short-circuits the cascade entirely
73
- # (and benefits whichever provider would have answered next time too
74
- # — once a query is cached, the cooldown state of the original
75
- # answering provider no longer matters). +max_results+ is not part
76
- # of the cache key, so callers passing a non-default value may get
77
- # a result rendered with the previously-cached size.
96
+ # and the rendered Markdown is cached on disk via {#initialize}'s
97
+ # +cache:+, keyed by the cleaned query. A cache hit short-circuits the
98
+ # cascade entirely (and benefits whichever provider would have
99
+ # answered next time too — once a query is cached, the cooldown state
100
+ # of the original answering provider no longer matters). +max_results+
101
+ # is not part of the cache key, so callers passing a non-default value
102
+ # may get a result rendered with the previously-cached size.
78
103
  #
79
104
  # If every provider reports temporary unavailability, returns an
80
105
  # +"Error: ..."+ string instead of raising — same convention as
@@ -88,7 +113,7 @@ module Pikuri
88
113
  # @return [String] Markdown-formatted result list, or +"Error: ..."+
89
114
  # when all providers are exhausted
90
115
  # @raise [ArgumentError] if the query is empty after normalization
91
- def self.search(query, max_results:)
116
+ def search(query, max_results:)
92
117
  cleaned = query.to_s.strip.gsub(/\s+/, ' ')
93
118
  raise ArgumentError, 'query is empty' if cleaned.empty?
94
119
 
@@ -96,7 +121,7 @@ module Pikuri
96
121
  log_providers(current_providers)
97
122
 
98
123
  hit = true
99
- result = cache.fetch(cleaned) do
124
+ result = @cache.fetch(cleaned) do
100
125
  hit = false
101
126
  failures = []
102
127
  results = nil
@@ -106,7 +131,7 @@ module Pikuri
106
131
  chosen = provider
107
132
  break
108
133
  rescue Unavailable => e
109
- failures << "#{provider.name.split('::').last} (#{e.message})"
134
+ failures << "#{provider.label} (#{e.message})"
110
135
  end
111
136
  # Raise so {UrlCache#fetch} does NOT persist the all-unavailable
112
137
  # message — otherwise that string would block every future search
@@ -115,7 +140,7 @@ module Pikuri
115
140
  chosen or raise Unavailable, "all search providers temporarily unavailable: #{failures.join('; ')}"
116
141
 
117
142
  LOGGER.info do
118
- "engine=#{chosen.name.split('::').last} query=#{cleaned.inspect} results=#{results.size}"
143
+ "engine=#{chosen.label} query=#{cleaned.inspect} results=#{results.size}"
119
144
  end
120
145
  render(results)
121
146
  end
@@ -125,6 +150,8 @@ module Pikuri
125
150
  "Error: #{e.message}"
126
151
  end
127
152
 
153
+ private
154
+
128
155
  # Render an +Array<Result>+ into the smolagents-style Markdown the
129
156
  # LLM consumes: +"## Search Results"+ header, then +[title](url)\nbody+
130
157
  # entries joined by blank lines. An empty array becomes the
@@ -133,30 +160,26 @@ module Pikuri
133
160
  #
134
161
  # @param results [Array<Result>] hits from the winning provider
135
162
  # @return [String] Markdown-formatted result list
136
- def self.render(results)
163
+ def render(results)
137
164
  return "## Search Results\n\nNo results found." if results.empty?
138
165
 
139
166
  "## Search Results\n\n" + results.map { |r| "[#{r.title}](#{r.url})\n#{r.body}" }.join("\n\n")
140
167
  end
141
- private_class_method :render
142
168
 
143
169
  # Emit an INFO log line listing the currently-available providers,
144
- # but only when the set differs from the last one we logged.
145
- # {.providers} is recomputed on every {.search} call so a process
146
- # picks up newly-set API keys without a restart; the memo here
147
- # keeps the log to one line per distinct configuration rather
148
- # than one per search.
170
+ # but only when the set differs from the last one this engine
171
+ # logged. The memo keeps the log to one line per distinct
172
+ # configuration rather than one per search.
149
173
  #
150
- # @param current [Array<Module>] providers returned by {.providers}
174
+ # @param current [Array<#label>] providers returned by {#providers}
151
175
  # @return [void]
152
- def self.log_providers(current)
176
+ def log_providers(current)
153
177
  return if @last_logged_providers == current
154
178
 
155
179
  @last_logged_providers = current
156
- names = current.map { |p| p.name.split('::').last }.join(', ')
180
+ names = current.map(&:label).join(', ')
157
181
  LOGGER.info("engines available: #{names}")
158
182
  end
159
- private_class_method :log_providers
160
183
  end
161
184
  end
162
185
  end