RubyGems - simple_inference - Versions diffs - 0.1.3 → 0.1.5 - Mend

simple_inference 0.1.3 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +316 -138
data/lib/simple_inference/client.rb +169 -74
data/lib/simple_inference/config.rb +16 -0
data/lib/simple_inference/errors.rb +11 -5
data/lib/simple_inference/openai.rb +178 -0
data/lib/simple_inference/response.rb +28 -0
data/lib/simple_inference/version.rb +1 -1
data/lib/simple_inference.rb +2 -0
data/sig/simple_inference.rbs +68 -1
metadata +9 -8

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 50a8ea07e0e30771b6d42f2bacf12f12b379f1151bb44947bb5febe3cac70cf9
-  data.tar.gz: c655b141ea39e518c5cdcc30cc28bac249ef23acfae8c05f4cf852e199a98a80
+  metadata.gz: ad988c1bb0af4938ea72fd303943a6dc27b90f26a8128abd737e0fca6429e081
+  data.tar.gz: 6be00487c1533201ffc48afb14a64c385b434698cf1bf3ab1c5c4ab10834d06a
 SHA512:
-  metadata.gz: '0483fda13365abb75bde99643ada2fc95017e4883e07d823d8b86912e9553764befe7c6a92222aeca3ee05d20e8f6b1e1a23a4a74decf45383bc4cc9c055d357'
-  data.tar.gz: 4d115195078c198b2c2c1b4bc1307a05e337884449b19393488374cd1710efaf6186e68c7060417141e5477933ea3fafc5223896adc06c1d1063263437254d56
+  metadata.gz: 066dbeee456edae89770a5ed6541d77dda53d6ebcac59a2f277e28e00dde8b12b373cdec67bb0e79f84df781397034f1ff75694560bd6f612dca608ce6252630
+  data.tar.gz: 8008d5a95c38e45465e48a3f45fe8b7fd1cffec49e16cfd54419cbed08a11d7d613715314c91c742f63430860caac1fe332e10270cd0741401e98540a0582d65

data/README.md CHANGED Viewed

@@ -1,13 +1,24 @@
-## simple_inference Ruby SDK
+# SimpleInference
-Fiber-friendly Ruby client for the Simple Inference Server APIs (chat, embeddings, audio, rerank, health), designed to work well inside Rails apps and background jobs.
+A lightweight, Fiber-friendly Ruby client for OpenAI-compatible LLM APIs. Works seamlessly with OpenAI, Azure OpenAI, 火山引擎 (Volcengine), DeepSeek, Groq, Together AI, and any other provider that implements the OpenAI API specification.
-### Installation
+Designed for simplicity and compatibility – no heavy dependencies, just pure Ruby with `Net::HTTP`.
-Add the gem to your Rails application's `Gemfile`, pointing at this repository path:
+## Features
+- 🔌 **Universal compatibility** – Works with any OpenAI-compatible API provider
+- 🌊 **Streaming support** – Native SSE streaming for chat completions
+- 🧵 **Fiber-friendly** – Compatible with Ruby 3 Fiber scheduler, works great with Falcon
+- 🔧 **Flexible configuration** – Customizable API prefix for non-standard endpoints
+- 🎯 **Simple interface** – Receive-an-Object / Return-an-Object style API
+- 📦 **Zero runtime dependencies** – Uses only Ruby standard library
+## Installation
+Add to your Gemfile:
 ```ruby
-gem "simple_inference", path: "sdks/ruby"
+gem "simple_inference"
 ```
 Then run:
@@ -16,231 +27,398 @@ Then run:
 bundle install
 ```
-### Configuration
+## Quick Start
+```ruby
+require "simple_inference"
+# Connect to OpenAI
+client = SimpleInference::Client.new(
+  base_url: "https://api.openai.com",
+  api_key: ENV["OPENAI_API_KEY"]
+)
+result = client.chat(
+  model: "gpt-4o-mini",
+  messages: [{ "role" => "user", "content" => "Hello!" }]
+)
+puts result.content
+p result.usage
+```
+## Configuration
+### Options
-You can configure the client via environment variables:
+| Option | Env Variable | Default | Description |
+|--------|--------------|---------|-------------|
+| `base_url` | `SIMPLE_INFERENCE_BASE_URL` | `http://localhost:8000` | API base URL |
+| `api_key` | `SIMPLE_INFERENCE_API_KEY` | `nil` | API key (sent as `Authorization: Bearer <token>`) |
+| `api_prefix` | `SIMPLE_INFERENCE_API_PREFIX` | `/v1` | API path prefix (e.g., `/v1`, empty string for some providers) |
+| `timeout` | `SIMPLE_INFERENCE_TIMEOUT` | `nil` | Request timeout in seconds |
+| `open_timeout` | `SIMPLE_INFERENCE_OPEN_TIMEOUT` | `nil` | Connection open timeout |
+| `read_timeout` | `SIMPLE_INFERENCE_READ_TIMEOUT` | `nil` | Read timeout |
+| `raise_on_error` | `SIMPLE_INFERENCE_RAISE_ON_ERROR` | `true` | Raise exceptions on HTTP errors |
+| `headers` | – | `{}` | Additional headers to send with requests |
+| `adapter` | – | `Default` | HTTP adapter (see [Adapters](#http-adapters)) |
-- `SIMPLE_INFERENCE_BASE_URL`: e.g. `http://localhost:8000`
-- `SIMPLE_INFERENCE_API_KEY`: optional, if your deployment requires auth (sent as `Authorization: Bearer <token>`).
-- `SIMPLE_INFERENCE_TIMEOUT`, `SIMPLE_INFERENCE_OPEN_TIMEOUT`, `SIMPLE_INFERENCE_READ_TIMEOUT` (seconds).
-- `SIMPLE_INFERENCE_RAISE_ON_ERROR`: `true`/`false` (default `true`).
+### Provider Examples
-Or explicitly when constructing a client:
+#### OpenAI
 ```ruby
 client = SimpleInference::Client.new(
-  base_url: "http://localhost:8000",
-  api_key:  ENV["SIMPLE_INFERENCE_API_KEY"],
-  timeout:  30.0
+  base_url: "https://api.openai.com",
+  api_key: ENV["OPENAI_API_KEY"]
 )
 ```
-For convenience, you can also use the module constructor:
+#### 火山引擎 (Volcengine / ByteDance)
+火山引擎的 API 路径不包含 `/v1` 前缀，需要设置 `api_prefix: ""`：
 ```ruby
-client = SimpleInference.new(base_url: "http://localhost:8000")
-```
+client = SimpleInference::Client.new(
+  base_url: "https://ark.cn-beijing.volces.com/api/v3",
+  api_key: ENV["ARK_API_KEY"],
+  api_prefix: ""  # 重要：火山引擎不使用 /v1 前缀
+)
-### Rails integration example
+result = client.chat(
+  model: "deepseek-v3-250324",
+  messages: [
+    { "role" => "system", "content" => "你是人工智能助手" },
+    { "role" => "user", "content" => "你好" }
+  ]
+)
+puts result.content
+```
-Create an initializer, for example `config/initializers/simple_inference.rb`:
+#### DeepSeek
 ```ruby
-SIMPLE_INFERENCE_CLIENT = SimpleInference::Client.new(
-  base_url: ENV.fetch("SIMPLE_INFERENCE_BASE_URL", "http://localhost:8000"),
-  api_key:  ENV["SIMPLE_INFERENCE_API_KEY"]
+client = SimpleInference::Client.new(
+  base_url: "https://api.deepseek.com",
+  api_key: ENV["DEEPSEEK_API_KEY"]
 )
 ```
-Then in a controller:
+#### Groq
 ```ruby
-class ChatsController < ApplicationController
-  def create
-    result = SIMPLE_INFERENCE_CLIENT.chat_completions(
-      model:    "local-llm",
-      messages: [
-        { "role" => "user", "content" => params[:prompt] }
-      ]
-    )
+client = SimpleInference::Client.new(
+  base_url: "https://api.groq.com/openai",
+  api_key: ENV["GROQ_API_KEY"]
+)
+```
-    render json: result[:body], status: result[:status]
-  end
-end
+#### Together AI
+```ruby
+client = SimpleInference::Client.new(
+  base_url: "https://api.together.xyz",
+  api_key: ENV["TOGETHER_API_KEY"]
+)
 ```
-You can also use the client in background jobs:
+#### Local inference servers (Ollama, vLLM, etc.)
 ```ruby
-class EmbedJob < ApplicationJob
-  queue_as :default
+# Ollama
+client = SimpleInference::Client.new(
+  base_url: "http://localhost:11434"
+)
-  def perform(text)
-    result = SIMPLE_INFERENCE_CLIENT.embeddings(
-      model: "bge-m3",
-      input: text
-    )
+# vLLM
+client = SimpleInference::Client.new(
+  base_url: "http://localhost:8000"
+)
+```
-    vector = result[:body]["data"].first["embedding"]
-    # TODO: persist the vector (e.g. in DB or a vector store)
-  end
-end
+#### Custom authentication header
+Some providers use non-standard authentication headers:
+```ruby
+client = SimpleInference::Client.new(
+  base_url: "https://my-service.example.com",
+  api_prefix: "/v1",
+  headers: {
+    "x-api-key" => ENV["MY_SERVICE_KEY"]
+  }
+)
 ```
-And for health checks / maintenance tasks:
+## API Methods
+### Chat
 ```ruby
-if SIMPLE_INFERENCE_CLIENT.healthy?
-  Rails.logger.info("Inference server is healthy")
-else
-  Rails.logger.warn("Inference server is unhealthy")
-end
+result = client.chat(
+  model: "gpt-4o-mini",
+  messages: [
+    { "role" => "system", "content" => "You are a helpful assistant." },
+    { "role" => "user", "content" => "Hello!" }
+  ],
+  temperature: 0.7,
+  max_tokens: 1000
+)
-models = SIMPLE_INFERENCE_CLIENT.list_models
-Rails.logger.info("Available models: #{models[:body].inspect}")
+puts result.content
+p result.usage
 ```
-### API methods
+### Streaming Chat
-- `client.chat_completions(params)` → `POST /v1/chat/completions`
-- `client.embeddings(params)` → `POST /v1/embeddings`
-- `client.rerank(params)` → `POST /v1/rerank`
-- `client.list_models` → `GET /v1/models`
-- `client.health` → `GET /health`
-- `client.healthy?` → boolean helper based on `/health`
-- `client.audio_transcriptions(params)` → `POST /v1/audio/transcriptions`
-- `client.audio_translations(params)` → `POST /v1/audio/translations`
+```ruby
+result = client.chat(
+  model: "gpt-4o-mini",
+  messages: [{ "role" => "user", "content" => "Tell me a story" }],
+  stream: true,
+  include_usage: true
+) do |delta|
+  print delta
+end
+puts
-All methods follow a Receive-an-Object / Return-an-Object style:
+p result.usage
+```
-- Input: a Ruby `Hash` (keys can be strings or symbols).
-- Output: a `Hash` with keys:
-  - `:status` – HTTP status code
-  - `:headers` – response headers (lowercased keys)
-  - `:body` – parsed JSON (Ruby `Hash`) when the response is JSON, or a `String` for text bodies.
+Low-level streaming (events) is also available, and can be used as an Enumerator:
-### Error handling
+```ruby
+stream = client.chat_completions_stream(
+  model: "gpt-4o-mini",
+  messages: [{ "role" => "user", "content" => "Hello" }]
+)
-By default (`raise_on_error: true`) non-2xx HTTP responses raise:
+stream.each do |event|
+  # process event
+end
+```
-- `SimpleInference::Errors::HTTPError` – wraps status, headers and raw body.
+Or as an Enumerable of delta strings:
-Network and parsing errors are mapped to:
+```ruby
+stream = client.chat_stream(
+  model: "gpt-4o-mini",
+  messages: [{ "role" => "user", "content" => "Hello" }],
+  include_usage: true
+)
-- `SimpleInference::Errors::TimeoutError`
-- `SimpleInference::Errors::ConnectionError`
-- `SimpleInference::Errors::DecodeError`
+stream.each { |delta| print delta }
+puts
+p stream.result&.usage
+```
-If you prefer to handle HTTP error codes manually, disable raising:
+### Embeddings
 ```ruby
-client = SimpleInference::Client.new(
-  base_url: "http://localhost:8000",
-  raise_on_error: false
+response = client.embeddings(
+  model: "text-embedding-3-small",
+  input: "Hello, world!"
 )
-response = client.embeddings(model: "local-embed", input: "hello")
-if response[:status] == 200
-  # happy path
-else
-  Rails.logger.warn("Embedding call failed: #{response[:status]} #{response[:body].inspect}")
-end
+vector = response.body["data"][0]["embedding"]
 ```
-### Using with OpenAI and compatible services
+### Rerank
-Because this SDK follows the OpenAI-style HTTP paths (`/v1/chat/completions`, `/v1/embeddings`, etc.), you can also point it directly at OpenAI or other compatible inference services.
+```ruby
+response = client.rerank(
+  model: "bge-reranker-v2-m3",
+  query: "What is machine learning?",
+  documents: [
+    "Machine learning is a subset of AI...",
+    "The weather today is sunny...",
+    "Deep learning uses neural networks..."
+  ]
+)
+```
-#### Connect to OpenAI
+### Audio Transcription
 ```ruby
-client = SimpleInference::Client.new(
-  base_url: "https://api.openai.com",
-  api_key:  ENV["OPENAI_API_KEY"]
+response = client.audio_transcriptions(
+  model: "whisper-1",
+  file: File.open("audio.mp3", "rb")
 )
-response = client.chat_completions(
-  model:    "gpt-4.1-mini",
-  messages: [{ "role" => "user", "content" => "Hello" }]
-)
+puts response.body["text"]
+```
-pp response[:body]
+### Audio Translation
+```ruby
+response = client.audio_translations(
+  model: "whisper-1",
+  file: File.open("audio.mp3", "rb")
+)
 ```
-#### Streaming chat completions (SSE)
+### List Models
-For OpenAI-style streaming (`text/event-stream`), use `chat_completions_stream`. It yields parsed JSON events (Ruby `Hash`), so you can consume deltas incrementally:
+```ruby
+model_ids = client.models
+```
+### Health Check
 ```ruby
-client.chat_completions_stream(
-  model: "gpt-4.1-mini",
-  messages: [{ "role" => "user", "content" => "Hello" }]
-) do |event|
-  delta = event.dig("choices", 0, "delta", "content")
-  print delta if delta
+# Returns full response
+response = client.health
+# Returns boolean
+if client.healthy?
+  puts "Service is up!"
 end
-puts
 ```
-If you prefer, it also returns an Enumerator:
+## Response Format
+All HTTP methods return a `SimpleInference::Response` with:
 ```ruby
-client.chat_completions_stream(model: "gpt-4.1-mini", messages: [...]).each do |event|
-  # ...
-end
+response.status   # Integer HTTP status code
+response.headers  # Hash with downcased String keys
+response.body     # Parsed JSON (Hash/Array), raw String, or nil (SSE success)
+response.success? # true for 2xx
 ```
-Fallback behavior:
+## Error Handling
-- If the upstream service does **not** support streaming (for example, this repo's server currently returns `400` with `{"detail":"Streaming responses are not supported yet"}`), the SDK will **retry non-streaming** and yield a **single synthetic chunk** so your streaming consumer code can still run.
+By default, non-2xx responses raise exceptions:
-#### Connect to any OpenAI-compatible endpoint
+```ruby
+begin
+  client.chat_completions(model: "invalid", messages: [])
+rescue SimpleInference::Errors::HTTPError => e
+  puts "HTTP #{e.status}: #{e.message}"
+  p e.body      # parsed body (Hash/Array/String)
+  puts e.raw_body # raw response body string (if available)
+end
+```
+Other exception types:
-For services that expose an OpenAI-compatible API (same paths and payloads), point `base_url` at that service and provide the correct token:
+- `SimpleInference::Errors::TimeoutError` – Request timed out
+- `SimpleInference::Errors::ConnectionError` – Network error
+- `SimpleInference::Errors::DecodeError` – JSON parsing failed
+- `SimpleInference::Errors::ConfigurationError` – Invalid configuration
+To handle errors manually:
 ```ruby
 client = SimpleInference::Client.new(
-  base_url: "https://my-openai-compatible.example.com",
-  api_key:  ENV["MY_SERVICE_TOKEN"]
+  base_url: "https://api.openai.com",
+  api_key: ENV["OPENAI_API_KEY"],
+  raise_on_error: false
 )
+response = client.chat_completions(model: "gpt-4o-mini", messages: [...])
+if response.success?
+  # success
+else
+  puts "Error: #{response.status} - #{response.body}"
+end
 ```
-If the service uses a non-standard header instead of `Authorization: Bearer`, you can omit `api_key` and pass headers explicitly:
+## HTTP Adapters
+### Default (Net::HTTP)
+The default adapter uses Ruby's built-in `Net::HTTP`. It's thread-safe and compatible with Ruby 3 Fiber scheduler.
+### HTTPX Adapter
+For better performance or async environments, use the optional HTTPX adapter:
 ```ruby
+# Gemfile
+gem "httpx"
+```
+```ruby
+adapter = SimpleInference::HTTPAdapters::HTTPX.new(timeout: 30.0)
 client = SimpleInference::Client.new(
-  base_url: "https://my-service.example.com",
-  headers: {
-    "x-api-key" => ENV["MY_SERVICE_KEY"]
-  }
+  base_url: "https://api.openai.com",
+  api_key: ENV["OPENAI_API_KEY"],
+  adapter: adapter
 )
 ```
-### Puma vs Falcon (Fiber / Async) usage
+### Custom Adapter
-The default HTTP adapter uses Ruby's `Net::HTTP` and is safe to use under Puma's multithreaded model:
+Implement your own adapter by subclassing `SimpleInference::HTTPAdapter`:
-- No global mutable state
-- Per-client configuration only
-- Blocking IO that integrates with Ruby 3 Fiber scheduler
+```ruby
+class MyAdapter < SimpleInference::HTTPAdapter
+  def call(request)
+    # request keys: :method, :url, :headers, :body, :timeout, :open_timeout, :read_timeout
+    # Must return: { status: Integer, headers: Hash, body: String }
+  end
-If you don't pass an adapter, `SimpleInference::Client` uses `SimpleInference::HTTPAdapters::Default` (Net::HTTP).
+  def call_stream(request, &block)
+    # For streaming support (optional)
+    # Yield raw chunks to block for SSE responses
+  end
+end
+```
+## Rails Integration
-For Falcon / async environments, you can keep the default adapter, or use the optional HTTPX adapter (requires the `httpx` gem):
+Create an initializer `config/initializers/simple_inference.rb`:
 ```ruby
-gem "httpx" # optional, only required when using the HTTPX adapter
+INFERENCE_CLIENT = SimpleInference::Client.new(
+  base_url: ENV.fetch("INFERENCE_BASE_URL", "https://api.openai.com"),
+  api_key: ENV["INFERENCE_API_KEY"]
+)
 ```
-You can then use the optional HTTPX adapter shipped with this gem:
+Use in controllers:
 ```ruby
-adapter = SimpleInference::HTTPAdapters::HTTPX.new(timeout: 30.0)
+class ChatsController < ApplicationController
+  def create
+    response = INFERENCE_CLIENT.chat_completions(
+      model: "gpt-4o-mini",
+      messages: [{ "role" => "user", "content" => params[:prompt] }]
+    )
+    render json: response.body
+  end
+end
+```
+Use in background jobs:
+```ruby
+class EmbedJob < ApplicationJob
+  def perform(text)
+    response = INFERENCE_CLIENT.embeddings(
+      model: "text-embedding-3-small",
+      input: text
+    )
-SIMPLE_INFERENCE_CLIENT =
-  SimpleInference::Client.new(
-    base_url: ENV.fetch("SIMPLE_INFERENCE_BASE_URL", "http://localhost:8000"),
-    api_key:  ENV["SIMPLE_INFERENCE_API_KEY"],
-    adapter:  adapter
-  )
+    vector = response.body["data"][0]["embedding"]
+    # Store vector...
+  end
+end
 ```
+## Thread Safety
+The client is thread-safe:
+- No global mutable state
+- Per-client configuration only
+- Each request uses its own HTTP connection
+## License
+MIT License. See [LICENSE](LICENSE.txt) for details.

data/lib/simple_inference/client.rb CHANGED Viewed

@@ -22,8 +22,112 @@ module SimpleInference
     # POST /v1/chat/completions
     # params: { model: "model-name", messages: [...], ... }
-    def chat_completions(params)
-      post_json("/v1/chat/completions", params)
+    def chat_completions(**params)
+      post_json(api_path("/chat/completions"), params)
+    end
+    # High-level helper for OpenAI-compatible chat.
+    #
+    # - Non-streaming: returns an OpenAI::ChatResult with `content` + `usage`.
+    # - Streaming: yields delta strings to the block (if given), accumulates, and returns OpenAI::ChatResult.
+    #
+    # @param model [String]
+    # @param messages [Array<Hash>]
+    # @param stream [Boolean] force streaming when true (default: block_given?)
+    # @param include_usage [Boolean, nil] when true (and streaming), requests usage in the final chunk
+    # @param request_logprobs [Boolean] when true, requests logprobs (and collects them in streaming mode)
+    # @param top_logprobs [Integer, nil] default: 5 (when request_logprobs is true)
+    # @param params [Hash] additional OpenAI parameters (max_tokens, temperature, etc.)
+    # @yield [String] delta content chunks (streaming only)
+    # @return [SimpleInference::OpenAI::ChatResult]
+    def chat(model:, messages:, stream: nil, include_usage: nil, request_logprobs: false, top_logprobs: 5, **params, &block)
+      raise ArgumentError, "model is required" if model.nil? || model.to_s.strip.empty?
+      raise ArgumentError, "messages must be an Array" unless messages.is_a?(Array)
+      use_stream = stream.nil? ? block_given? : stream
+      request = { model: model, messages: messages }.merge(params)
+      request.delete(:stream)
+      request.delete("stream")
+      if request_logprobs
+        request[:logprobs] = true unless request.key?(:logprobs) || request.key?("logprobs")
+        if top_logprobs && !(request.key?(:top_logprobs) || request.key?("top_logprobs"))
+          request[:top_logprobs] = top_logprobs
+        end
+      end
+      if use_stream && include_usage
+        stream_options = request[:stream_options] || request["stream_options"]
+        stream_options ||= {}
+        if stream_options.is_a?(Hash)
+          stream_options[:include_usage] = true unless stream_options.key?(:include_usage) || stream_options.key?("include_usage")
+        end
+        request[:stream_options] = stream_options
+      end
+      if use_stream
+        full = +""
+        finish_reason = nil
+        last_usage = nil
+        collected_logprobs = []
+        response =
+          chat_completions_stream(**request) do |event|
+            delta = OpenAI.chat_completion_chunk_delta(event)
+            if delta
+              full << delta
+              block.call(delta) if block
+            end
+            fr = event.is_a?(Hash) ? event.dig("choices", 0, "finish_reason") : nil
+            finish_reason = fr if fr
+            if request_logprobs
+              chunk_logprobs = event.is_a?(Hash) ? event.dig("choices", 0, "logprobs", "content") : nil
+              if chunk_logprobs.is_a?(Array)
+                collected_logprobs.concat(chunk_logprobs)
+              end
+            end
+            usage = OpenAI.chat_completion_usage(event)
+            last_usage = usage if usage
+          end
+        OpenAI::ChatResult.new(
+          content: full,
+          usage: last_usage || OpenAI.chat_completion_usage(response),
+          finish_reason: finish_reason || OpenAI.chat_completion_finish_reason(response),
+          logprobs: collected_logprobs.empty? ? OpenAI.chat_completion_logprobs(response) : collected_logprobs,
+          response: response
+        )
+      else
+        response = chat_completions(**request)
+        OpenAI::ChatResult.new(
+          content: OpenAI.chat_completion_content(response),
+          usage: OpenAI.chat_completion_usage(response),
+          finish_reason: OpenAI.chat_completion_finish_reason(response),
+          logprobs: OpenAI.chat_completion_logprobs(response),
+          response: response
+        )
+      end
+    end
+    # Streaming chat as an Enumerable.
+    #
+    # @return [SimpleInference::OpenAI::ChatStream]
+    def chat_stream(model:, messages:, include_usage: nil, request_logprobs: false, top_logprobs: 5, **params)
+      OpenAI::ChatStream.new(
+        client: self,
+        model: model,
+        messages: messages,
+        include_usage: include_usage,
+        request_logprobs: request_logprobs,
+        top_logprobs: top_logprobs,
+        params: params
+      )
     end
     # POST /v1/chat/completions (streaming)
@@ -31,45 +135,41 @@ module SimpleInference
     # Yields parsed JSON events from an OpenAI-style SSE stream (`text/event-stream`).
     #
     # If no block is given, returns an Enumerator.
-    def chat_completions_stream(params)
-      return enum_for(:chat_completions_stream, params) unless block_given?
-      unless params.is_a?(Hash)
-        raise Errors::ConfigurationError, "params must be a Hash"
-      end
+    def chat_completions_stream(**params)
+      return enum_for(:chat_completions_stream, **params) unless block_given?
       body = params.dup
       body.delete(:stream)
       body.delete("stream")
       body["stream"] = true
-      response = post_json_stream("/v1/chat/completions", body) do |event|
+      response = post_json_stream(api_path("/chat/completions"), body) do |event|
         yield event
       end
-      content_type = response.dig(:headers, "content-type").to_s
+      content_type = response.headers["content-type"].to_s
       # Streaming case: we already yielded events from the SSE stream.
-      if response[:status].to_i >= 200 && response[:status].to_i < 300 && content_type.include?("text/event-stream")
+      if response.status >= 200 && response.status < 300 && content_type.include?("text/event-stream")
         return response
       end
       # Fallback when upstream does not support streaming (this repo's server).
-      if streaming_unsupported_error?(response[:status], response[:body])
+      if streaming_unsupported_error?(response.status, response.body)
         fallback_body = params.dup
         fallback_body.delete(:stream)
         fallback_body.delete("stream")
-        fallback_response = post_json("/v1/chat/completions", fallback_body)
-        chunk = synthesize_chat_completion_chunk(fallback_response[:body])
+        fallback_response = post_json(api_path("/chat/completions"), fallback_body)
+        chunk = synthesize_chat_completion_chunk(fallback_response.body)
         yield chunk if chunk
         return fallback_response
       end
       # If we got a non-streaming success response (JSON), convert it into a single
       # chunk so streaming consumers can share the same code path.
-      if response[:status].to_i >= 200 && response[:status].to_i < 300
-        chunk = synthesize_chat_completion_chunk(response[:body])
+      if response.status >= 200 && response.status < 300
+        chunk = synthesize_chat_completion_chunk(response.body)
         yield chunk if chunk
       end
@@ -77,18 +177,27 @@ module SimpleInference
     end
     # POST /v1/embeddings
-    def embeddings(params)
-      post_json("/v1/embeddings", params)
+    def embeddings(**params)
+      post_json(api_path("/embeddings"), params)
     end
     # POST /v1/rerank
-    def rerank(params)
-      post_json("/v1/rerank", params)
+    def rerank(**params)
+      post_json(api_path("/rerank"), params)
     end
     # GET /v1/models
     def list_models
-      get_json("/v1/models")
+      get_json(api_path("/models"))
+    end
+    # Convenience wrapper for list_models.
+    #
+    # @return [Array<String>] model IDs
+    def models
+      response = list_models
+      data = response.body.is_a?(Hash) ? response.body["data"] : nil
+      Array(data).filter_map { |m| m.is_a?(Hash) ? m["id"] : nil }
     end
     # GET /health
@@ -99,8 +208,8 @@ module SimpleInference
     # Returns true when service is healthy, false otherwise.
     def healthy?
       response = get_json("/health", raise_on_http_error: false)
-      status_ok = response[:status] == 200
-      body_status_ok = response.dig(:body, "status") == "ok"
+      status_ok = response.status == 200
+      body_status_ok = response.body.is_a?(Hash) && response.body["status"] == "ok"
       status_ok && body_status_ok
     rescue Errors::Error
       false
@@ -108,13 +217,13 @@ module SimpleInference
     # POST /v1/audio/transcriptions
     # params: { file: io_or_hash, model: "model-name", **audio_options }
-    def audio_transcriptions(params)
-      post_multipart("/v1/audio/transcriptions", params)
+    def audio_transcriptions(**params)
+      post_multipart(api_path("/audio/transcriptions"), params)
     end
     # POST /v1/audio/translations
-    def audio_translations(params)
-      post_multipart("/v1/audio/translations", params)
+    def audio_translations(**params)
+      post_multipart(api_path("/audio/translations"), params)
     end
     private
@@ -123,6 +232,10 @@ module SimpleInference
       config.base_url
     end
+    def api_path(endpoint)
+      "#{config.api_prefix}#{endpoint}"
+    end
     def get_json(path, params: nil, raise_on_http_error: nil)
       full_path = with_query(path, params)
       request_json(
@@ -199,31 +312,26 @@ module SimpleInference
           consume_sse_buffer!(buffer, &on_event)
         end
-        return {
-          status: status,
-          headers: headers,
-          body: nil,
-        }
+        return Response.new(status: status, headers: headers, body: nil)
       end
       # Non-streaming response path (adapter doesn't support streaming or server returned JSON).
       should_parse_json = content_type.include?("json")
-      parsed_body = should_parse_json ? parse_json(body_str) : body_str
-      maybe_raise_http_error(
-        status: status,
-        headers: headers,
-        body_str: body_str,
-        raise_on_http_error: raise_on_http_error,
-        ignore_streaming_unsupported: true,
-        parsed_body: parsed_body
-      )
+      parsed_body =
+        if should_parse_json
+          begin
+            parse_json(body_str)
+          rescue Errors::DecodeError
+            # Prefer HTTPError over DecodeError for non-2xx responses.
+            status >= 200 && status < 300 ? raise : body_str
+          end
+        else
+          body_str
+        end
-      {
-        status: status,
-        headers: headers,
-        body: parsed_body,
-      }
+      response = Response.new(status: status, headers: headers, body: parsed_body, raw_body: body_str)
+      maybe_raise_http_error(response: response, raise_on_http_error: raise_on_http_error, ignore_streaming_unsupported: true)
+      response
     rescue Timeout::Error => e
       raise Errors::TimeoutError, e.message
     rescue SocketError, SystemCallError => e
@@ -575,13 +683,6 @@ module SimpleInference
       headers = (response[:headers] || {}).transform_keys { |k| k.to_s.downcase }
       body = response[:body].to_s
-      maybe_raise_http_error(
-        status: status,
-        headers: headers,
-        body_str: body,
-        raise_on_http_error: raise_on_http_error
-      )
       should_parse_json =
         if expect_json.nil?
           content_type = headers["content-type"]
@@ -592,16 +693,19 @@ module SimpleInference
       parsed_body =
         if should_parse_json
-          parse_json(body)
+          begin
+            parse_json(body)
+          rescue Errors::DecodeError
+            # Prefer HTTPError over DecodeError for non-2xx responses.
+            status >= 200 && status < 300 ? raise : body
+          end
         else
           body
         end
-      {
-        status: status,
-        headers: headers,
-        body: parsed_body,
-      }
+      response = Response.new(status: status, headers: headers, body: parsed_body, raw_body: body)
+      maybe_raise_http_error(response: response, raise_on_http_error: raise_on_http_error)
+      response
     rescue Timeout::Error => e
       raise Errors::TimeoutError, e.message
     rescue SocketError, SystemCallError => e
@@ -644,26 +748,17 @@ module SimpleInference
       end
     end
-    def maybe_raise_http_error(
-      status:,
-      headers:,
-      body_str:,
-      raise_on_http_error:,
-      ignore_streaming_unsupported: false,
-      parsed_body: nil
-    )
+    def maybe_raise_http_error(response:, raise_on_http_error:, ignore_streaming_unsupported: false)
       return unless raise_on_http_error?(raise_on_http_error)
-      return unless status < 200 || status >= 300
+      return if response.success?
       # Do not raise for the known "streaming unsupported" case; the caller will
       # perform a non-streaming retry fallback.
-      return if ignore_streaming_unsupported && streaming_unsupported_error?(status, parsed_body)
+      return if ignore_streaming_unsupported && streaming_unsupported_error?(response.status, response.body)
       raise Errors::HTTPError.new(
-        http_error_message(status, body_str, parsed_body: parsed_body),
-        status: status,
-        headers: headers,
-        body: body_str
+        http_error_message(response.status, response.raw_body.to_s, parsed_body: response.body),
+        response: response
       )
     end
   end

data/lib/simple_inference/config.rb CHANGED Viewed

@@ -4,6 +4,7 @@ module SimpleInference
   class Config
     attr_reader :base_url,
                 :api_key,
+                :api_prefix,
                 :timeout,
                 :open_timeout,
                 :read_timeout,
@@ -19,6 +20,10 @@ module SimpleInference
       @api_key = (opts[:api_key] || ENV["SIMPLE_INFERENCE_API_KEY"]).to_s
       @api_key = nil if @api_key.empty?
+      @api_prefix = normalize_api_prefix(
+        opts.key?(:api_prefix) ? opts[:api_prefix] : ENV.fetch("SIMPLE_INFERENCE_API_PREFIX", "/v1")
+      )
       @timeout = to_float_or_nil(opts[:timeout] || ENV["SIMPLE_INFERENCE_TIMEOUT"])
       @open_timeout = to_float_or_nil(opts[:open_timeout] || ENV["SIMPLE_INFERENCE_OPEN_TIMEOUT"])
       @read_timeout = to_float_or_nil(opts[:read_timeout] || ENV["SIMPLE_INFERENCE_READ_TIMEOUT"])
@@ -46,6 +51,17 @@ module SimpleInference
       url.chomp("/")
     end
+    def normalize_api_prefix(value)
+      return "" if value.nil?
+      prefix = value.to_s.strip
+      return "" if prefix.empty?
+      # Ensure it starts with / and does not end with /
+      prefix = "/#{prefix}" unless prefix.start_with?("/")
+      prefix.chomp("/")
+    end
     def to_float_or_nil(value)
       return nil if value.nil? || value == ""

data/lib/simple_inference/errors.rb CHANGED Viewed

@@ -7,14 +7,20 @@ module SimpleInference
     class ConfigurationError < Error; end
     class HTTPError < Error
-      attr_reader :status, :headers, :body
+      attr_reader :response
-      def initialize(message, status:, headers:, body:)
+      def initialize(message, response:)
         super(message)
-        @status = status
-        @headers = headers
-        @body = body
+        @response = response
       end
+      def status = @response.status
+      def headers = @response.headers
+      def body = @response.body
+      def raw_body = @response.raw_body
     end
     class TimeoutError < Error; end

data/lib/simple_inference/openai.rb ADDED Viewed

@@ -0,0 +1,178 @@
+# frozen_string_literal: true
+module SimpleInference
+  # Helpers for extracting common fields from OpenAI-compatible `chat/completions` payloads.
+  #
+  # These helpers accept either:
+  # - A `SimpleInference::Response`, or
+  # - A parsed `body` / `chunk` hash (typically from JSON.parse, with String keys)
+  #
+  # Providers are "OpenAI-compatible", but many differ in subtle ways:
+  # - Some return `choices[0].text` instead of `choices[0].message.content`
+  # - Some represent `content` as an array or structured hash
+  #
+  # This module normalizes those shapes so application code can stay small and predictable.
+  module OpenAI
+    module_function
+    ChatResult =
+      Struct.new(
+        :content,
+        :usage,
+        :finish_reason,
+        :logprobs,
+        :response,
+        keyword_init: true
+      )
+    # Enumerable wrapper for streaming chat responses.
+    #
+    # @example
+    #   stream = client.chat_stream(model: "...", messages: [...], include_usage: true)
+    #   stream.each { |delta| print delta }
+    #   p stream.result.usage
+    class ChatStream
+      include Enumerable
+      attr_reader :result
+      def initialize(client:, model:, messages:, include_usage:, request_logprobs:, top_logprobs:, params:)
+        @client = client
+        @model = model
+        @messages = messages
+        @include_usage = include_usage
+        @request_logprobs = request_logprobs
+        @top_logprobs = top_logprobs
+        @params = params
+        @started = false
+        @result = nil
+      end
+      def each
+        return enum_for(:each) unless block_given?
+        raise Errors::ConfigurationError, "ChatStream can only be consumed once" if @started
+        @started = true
+        @result =
+          @client.chat(
+            model: @model,
+            messages: @messages,
+            stream: true,
+            include_usage: @include_usage,
+            request_logprobs: @request_logprobs,
+            top_logprobs: @top_logprobs,
+            **(@params || {})
+          ) { |delta| yield delta }
+      end
+    end
+    # Extract assistant content from a non-streaming chat completion.
+    #
+    # @param response_or_body [Hash] SimpleInference response hash or parsed body hash
+    # @return [String, nil]
+    def chat_completion_content(response_or_body)
+      body = unwrap_body(response_or_body)
+      choice = first_choice(body)
+      return nil unless choice
+      raw =
+        choice.dig("message", "content") ||
+          choice["text"]
+      normalize_content(raw)
+    end
+    # Extract finish_reason from a non-streaming chat completion.
+    #
+    # @param response_or_body [Hash] SimpleInference response hash or parsed body hash
+    # @return [String, nil]
+    def chat_completion_finish_reason(response_or_body)
+      body = unwrap_body(response_or_body)
+      first_choice(body)&.[]("finish_reason")
+    end
+    # Extract usage from a chat completion response or a final streaming chunk.
+    #
+    # @param response_or_body [Hash] SimpleInference response hash, body hash, or chunk hash
+    # @return [Hash, nil] symbol-keyed usage hash
+    def chat_completion_usage(response_or_body)
+      body = unwrap_body(response_or_body)
+      usage = body.is_a?(Hash) ? body["usage"] : nil
+      return nil unless usage.is_a?(Hash)
+      {
+        prompt_tokens: usage["prompt_tokens"],
+        completion_tokens: usage["completion_tokens"],
+        total_tokens: usage["total_tokens"],
+      }.compact
+    end
+    # Extract logprobs (if present) from a non-streaming chat completion.
+    #
+    # @param response_or_body [Hash] SimpleInference response hash or parsed body hash
+    # @return [Array<Hash>, nil]
+    def chat_completion_logprobs(response_or_body)
+      body = unwrap_body(response_or_body)
+      first_choice(body)&.dig("logprobs", "content")
+    end
+    # Extract delta content from a streaming `chat.completion.chunk`.
+    #
+    # @param chunk [Hash] parsed streaming event hash
+    # @return [String, nil]
+    def chat_completion_chunk_delta(chunk)
+      chunk = unwrap_body(chunk)
+      return nil unless chunk.is_a?(Hash)
+      raw = chunk.dig("choices", 0, "delta", "content")
+      normalize_content(raw)
+    end
+    # Normalize `content` shapes into a simple String.
+    #
+    # Supports strings, arrays of parts, and part hashes.
+    #
+    # @param value [Object]
+    # @return [String, nil]
+    def normalize_content(value)
+      case value
+      when String
+        value
+      when Array
+        value.map { |part| normalize_content(part) }.join
+      when Hash
+        value["text"] ||
+          value["content"] ||
+          value.to_s
+      when nil
+        nil
+      else
+        value.to_s
+      end
+    end
+    # Unwrap a full SimpleInference response into its `:body`, otherwise return the object.
+    #
+    # @param obj [Object]
+    # @return [Object]
+    def unwrap_body(obj)
+      return {} unless obj
+      return obj.body || {} if obj.respond_to?(:body)
+      obj
+    end
+    def first_choice(body)
+      return nil unless body.is_a?(Hash)
+      choices = body["choices"]
+      return nil unless choices.is_a?(Array) && !choices.empty?
+      choice0 = choices[0]
+      return nil unless choice0.is_a?(Hash)
+      choice0
+    end
+    private_class_method :first_choice
+  end
+end

data/lib/simple_inference/response.rb ADDED Viewed

@@ -0,0 +1,28 @@
+# frozen_string_literal: true
+module SimpleInference
+  # A lightweight wrapper for HTTP responses returned by SimpleInference.
+  #
+  # - `status` is an Integer HTTP status code
+  # - `headers` is a Hash with downcased String keys
+  # - `body` is a parsed JSON Hash/Array, a String, or nil (e.g. SSE streaming success)
+  # - `raw_body` is the raw response body String (when available)
+  class Response
+    attr_reader :status, :headers, :body, :raw_body
+    def initialize(status:, headers:, body:, raw_body: nil)
+      @status = status.to_i
+      @headers = (headers || {}).transform_keys { |k| k.to_s.downcase }
+      @body = body
+      @raw_body = raw_body
+    end
+    def success?
+      status >= 200 && status < 300
+    end
+    def to_h
+      { status: status, headers: headers, body: body, raw_body: raw_body }
+    end
+  end
+end

data/lib/simple_inference/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SimpleInference
-  VERSION = "0.1.3"
+  VERSION = "0.1.5"
 end

data/lib/simple_inference.rb CHANGED Viewed

@@ -4,6 +4,8 @@ require_relative "simple_inference/version"
 require_relative "simple_inference/config"
 require_relative "simple_inference/errors"
 require_relative "simple_inference/http_adapter"
+require_relative "simple_inference/response"
+require_relative "simple_inference/openai"
 require_relative "simple_inference/client"
 module SimpleInference

data/sig/simple_inference.rbs CHANGED Viewed

@@ -1,4 +1,71 @@
 module SimpleInference
   VERSION: String
-end
+  class Response
+    attr_reader status: Integer
+    attr_reader headers: Hash[String, untyped]
+    attr_reader body: untyped
+    attr_reader raw_body: String?
+    def initialize: (status: Integer, headers: Hash[untyped, untyped], body: untyped, ?raw_body: String?) -> void
+    def success?: () -> bool
+    def to_h: () -> Hash[Symbol, untyped]
+  end
+  module OpenAI
+    class ChatResult
+      attr_reader content: String?
+      attr_reader usage: Hash[Symbol, untyped]?
+      attr_reader finish_reason: String?
+      attr_reader logprobs: Array[Hash[untyped, untyped]]?
+      attr_reader response: Response
+    end
+    class ChatStream
+      include Enumerable[String]
+      attr_reader result: ChatResult?
+    end
+    def self.chat_completion_content: (untyped) -> String?
+    def self.chat_completion_finish_reason: (untyped) -> String?
+    def self.chat_completion_usage: (untyped) -> Hash[Symbol, untyped]?
+    def self.chat_completion_logprobs: (untyped) -> Array[Hash[untyped, untyped]]?
+    def self.chat_completion_chunk_delta: (untyped) -> String?
+    def self.normalize_content: (untyped) -> String?
+  end
+  class Client
+    def initialize: (?Hash[untyped, untyped]) -> void
+    def chat: (
+      model: String,
+      messages: Array[Hash[untyped, untyped]],
+      ?stream: bool?,
+      ?include_usage: bool?,
+      ?request_logprobs: bool,
+      ?top_logprobs: Integer?,
+      **untyped
+    ) { (String) -> void } -> OpenAI::ChatResult
+    def chat_stream: (
+      model: String,
+      messages: Array[Hash[untyped, untyped]],
+      ?include_usage: bool?,
+      ?request_logprobs: bool,
+      ?top_logprobs: Integer?,
+      **untyped
+    ) -> OpenAI::ChatStream
+    def chat_completions: (**untyped) -> Response
+    def chat_completions_stream: (**untyped) { (Hash[untyped, untyped]) -> void } -> Response
+    def embeddings: (**untyped) -> Response
+    def rerank: (**untyped) -> Response
+    def list_models: () -> Response
+    def models: () -> Array[String]
+    def health: () -> Response
+    def healthy?: () -> bool
+    def audio_transcriptions: (**untyped) -> Response
+    def audio_translations: (**untyped) -> Response
+  end
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: simple_inference
 version: !ruby/object:Gem::Version
-  version: 0.1.3
+  version: 0.1.5
 platform: ruby
 authors:
 - jasl
@@ -9,8 +9,8 @@ bindir: exe
 cert_chain: []
 date: 1980-01-02 00:00:00.000000000 Z
 dependencies: []
-description: Fiber-friendly Ruby client for Simple Inference Server APIs (chat, embeddings,
-  audio, rerank, health).
+description: A lightweight, Fiber-friendly Ruby client for OpenAI-compatible LLM APIs.
+  (chat, embeddings, audio, rerank, health).
 email:
 - jasl9187@hotmail.com
 executables: []
@@ -27,15 +27,16 @@ files:
 - lib/simple_inference/http_adapter.rb
 - lib/simple_inference/http_adapters/default.rb
 - lib/simple_inference/http_adapters/httpx.rb
+- lib/simple_inference/openai.rb
+- lib/simple_inference/response.rb
 - lib/simple_inference/version.rb
 - sig/simple_inference.rbs
-homepage: https://github.com/jasl/simple_inference_server/tree/main/sdks/ruby
+homepage: https://github.com/jasl/simple_inference.rb
 licenses:
 - MIT
 metadata:
   allowed_push_host: https://rubygems.org
-  homepage_uri: https://github.com/jasl/simple_inference_server/tree/main/sdks/ruby
-  source_code_uri: https://github.com/jasl/simple_inference_server
+  homepage_uri: https://github.com/jasl/simple_inference.rb
   rubygems_mfa_required: 'true'
 rdoc_options: []
 require_paths:
@@ -51,7 +52,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 4.0.1
+rubygems_version: 4.0.3
 specification_version: 4
-summary: Fiber-friendly Ruby client for the Simple Inference Server (OpenAI-compatible).
+summary: A lightweight, Fiber-friendly Ruby client for OpenAI-compatible LLM APIs.
 test_files: []