RubyGems - ruby-spacy - Versions diffs - 0.2.3 → 0.3.0 - Mend

ruby-spacy 0.2.3 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/CHANGELOG.md +24 -7
data/Gemfile +1 -1
data/README.md +120 -22
data/lib/ruby-spacy/openai_client.rb +149 -0
data/lib/ruby-spacy/version.rb +1 -1
data/lib/ruby-spacy.rb +215 -101
data/ruby-spacy.gemspec +2 -2
metadata +18 -20

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 9c9ca5b4cba8eb115192aa0b5a45216d12a9d9e4cdddc253ba55ace52e778afd
-  data.tar.gz: 197c61acfa742048fefff05b35d6045e17dd5cf212667c277537fb984a0ff926
+  metadata.gz: d6005c638c2b268fe162b288e124439be6a525952557a48b0b50685bbd2a6ea1
+  data.tar.gz: 41dbc057c9ec51ffa8d6f1149fb8acde3fb52a251299d0209b4e2d351942eac0
 SHA512:
-  metadata.gz: 950daeb4f8ee140a15bacf18ea3228f2604a552df8aa12be52fb7a488c78e67b894b8678fbe6fbed74da54beb714e89d02ab1bd46d5c59a908b8ddfbc5c9e7c0
-  data.tar.gz: 84b183babd37f9120c0ac2332eec23dff30d3180da165aaf044bf72ef4be7af4efc2b339ad5ac5b489e3e3b9b44ba33d3df4fca287addbbed05cfa4201b79d75
+  metadata.gz: 5be0efa9e649b3d46da859472ce403adaa3cdaa34d4158e7a531680eb2830ae64779ec6ada8f0f6e324cc9cb314fb1fcbc617daa26e37e91a7d14f703caeec2d
+  data.tar.gz: b8f56b4842fea3bec1b35366624c7ab9297c3a3b25c9a8502dc32c623593e511d9da538bf3e5cac272baf854cf4c2c97d4129790b492329183d88873467f8dbb

data/.gitignore CHANGED Viewed

@@ -62,3 +62,4 @@ tags
 .rubocop.yml
 .solargraph.yml
 .yardopts
+CLAUDE.md

data/CHANGELOG.md CHANGED Viewed

@@ -1,17 +1,34 @@
 # Change Log
+## 0.3.0 - 2025-01-06
+### Added
+- Ruby 4.0 support
+- `Doc#to_bytes` for serializing documents to binary format
+- `Doc.from_bytes` for restoring documents from binary data
+- `PhraseMatcher` class for efficient phrase matching
+- `Language#phrase_matcher` helper method
+### Changed
+- Replaced `ruby-openai` gem with custom `OpenAIClient` implementation
+- Updated default OpenAI model to `gpt-5-mini`
+- Updated embeddings model to `text-embedding-3-small`
+- Changed `max_tokens` parameter to `max_completion_tokens` (backward compatible)
+- Added `fiddle` gem dependency (required for Ruby 4.0)
+## 0.2.4 - 2024-12-11
+### Changed
+- Timeout and retry feature for `Spacy::Language.new`
 ## 0.2.3 - 2024-08-27
 - Timeout option added to `Spacy::Language.new`
-- Default OpenaAI models updated to `gpt-4o-mini`
-## 0.2.0 - 2022-10-02
-- spaCy 3.7.0 supported
+- Default OpenAI models updated to `gpt-4o-mini`
 ## 0.2.0 - 2022-10-02
 ### Added
-- `Doc::openai_query`
-- `Doc::openai_completion`
-- `Doc::openai_embeddings`
+- spaCy 3.7.0 supported
+- `Doc#openai_query`
+- `Doc#openai_completion`
+- `Doc#openai_embeddings`
 ## 0.1.4.1 - 2021-07-06
 - Test code refined

data/Gemfile CHANGED Viewed

@@ -5,9 +5,9 @@ source "https://rubygems.org"
 # Specify your gem's dependencies in ruby-spacy.gemspec
 gemspec
+gem "fiddle" # Required for Ruby 4.0+ (moved from default to bundled gem)
 gem "numpy"
 gem "pycall", "~> 1.5.1"
-gem "ruby-openai"
 gem "terminal-table"
 group :development do

data/README.md CHANGED Viewed

@@ -13,10 +13,11 @@
 | ✅ | Access to pre-trained word vectors                 |
 | ✅ | OpenAI Chat/Completion/Embeddings API integration  |
-Current Version: `0.2.3`
+Current Version: `0.3.0`
-- spaCy 3.7.0 supported
-- OpenAI API integration
+- Ruby 4.0 supported
+- spaCy 3.8 supported
+- OpenAI GPT-5 API integration
 ## Installation of Prerequisites
@@ -522,12 +523,73 @@ Output:
 | 9    | アルザス       | 0.5644999742507935 |
 | 10   | 南仏           | 0.5547999739646912 |
+### PhraseMatcher
+`PhraseMatcher` is more efficient than `Matcher` for matching large terminology lists. It's ideal for extracting known entities like product names, company names, or domain-specific terms.
+**Basic usage:**
+```ruby
+require "ruby-spacy"
+nlp = Spacy::Language.new("en_core_web_sm")
+# Create a phrase matcher
+matcher = nlp.phrase_matcher
+matcher.add("PRODUCT", ["iPhone", "MacBook Pro", "iPad"])
+doc = nlp.read("I bought an iPhone and a MacBook Pro yesterday.")
+matches = matcher.match(doc)
+matches.each do |span|
+  puts "#{span.text} => #{span.label}"
+end
+# => iPhone => PRODUCT
+# => MacBook Pro => PRODUCT
+```
+**Case-insensitive matching:**
+```ruby
+# Use attr: "LOWER" for case-insensitive matching
+matcher = nlp.phrase_matcher(attr: "LOWER")
+matcher.add("COMPANY", ["apple", "google", "microsoft"])
+doc = nlp.read("Apple and GOOGLE are competitors of Microsoft.")
+matches = matcher.match(doc)
+matches.each do |span|
+  puts span.text
+end
+# => Apple
+# => GOOGLE
+# => Microsoft
+```
+**Multiple categories:**
+```ruby
+matcher = nlp.phrase_matcher(attr: "LOWER")
+matcher.add("TECH_COMPANY", ["apple", "google", "microsoft", "amazon"])
+matcher.add("PRODUCT", ["iphone", "pixel", "surface", "kindle"])
+doc = nlp.read("Apple released the new iPhone while Google announced Pixel updates.")
+matches = matcher.match(doc)
+matches.each do |span|
+  puts "#{span.text}: #{span.label}"
+end
+# => Apple: TECH_COMPANY
+# => iPhone: PRODUCT
+# => Google: TECH_COMPANY
+# => Pixel: PRODUCT
+```
 ## OpenAI API Integration
-> ⚠️ This feature is currently experimental. Details are subject to change. Please refer to OpenAI's [API reference](https://platform.openai.com/docs/api-reference) and [Ruby OpenAI](https://github.com/alexrudall/ruby-openai) for available parameters (`max_tokens`, `temperature`, etc).
+> ⚠️ This feature requires GPT-5 series models. Please refer to OpenAI's [API reference](https://platform.openai.com/docs/api-reference) for details. Note: GPT-5 models do not support the `temperature` parameter.
-Easily leverage GPT models within ruby-spacy by using an OpenAI API key. When constructing prompts for the `Doc::openai_query` method, you can incorporate the following token properties of the document. These properties are retrieved through function calls (made internally by GPT when necessary) and seamlessly integrated into your prompt. Note that function calls need `gpt-4o-mini` or greater. The available properties include:
+Easily leverage GPT models within ruby-spacy by using an OpenAI API key. When constructing prompts for the `Doc::openai_query` method, you can incorporate the following token properties of the document. These properties are retrieved through tool calls (made internally by GPT when necessary) and seamlessly integrated into your prompt. The available properties include:
 - `surface`
 - `lemma`
@@ -550,9 +612,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles released 12 studio albums")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res1 = doc.openai_query(
   access_token: api_key,
   prompt: "Translate the text to Japanese."
@@ -576,9 +637,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles were an English rock band formed in Liverpool in 1960.")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_query(
   access_token: api_key,
   prompt: "Extract the topic of the document and list 10 entities (names, concepts, locations, etc.) that are relevant to the topic."
@@ -614,9 +674,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles released 12 studio albums")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_query(
   access_token: api_key,
   prompt: "List token data of each of the words used in the sentence. Add 'meaning' property and value (brief semantic definition) to each token data. Output as a JSON object."
@@ -692,7 +751,7 @@ Output:
 }
 ```
-### GPT Prompting (Generate a Syntaxt Tree using Token Properties)
+### GPT Prompting (Generate a Syntax Tree using Token Properties)
 Ruby code:
@@ -704,11 +763,10 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles released 12 studio albums")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_query(
   access_token: api_key,
-  model: "gpt-4",
   prompt: "Generate a tree diagram from the text using given token data. Use the following bracketing style: [S [NP [Det the] [N cat]] [VP [V sat] [PP [P on] [NP the mat]]]"
 )
 puts res
@@ -747,9 +805,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("Vladimir Nabokov was a")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_completion(access_token: api_key)
 puts res
 ```
@@ -769,7 +826,7 @@ api_key = ENV["OPENAI_API_KEY"]
 nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("Vladimir Nabokov was a Russian-American novelist, poet, translator and entomologist.")
-# default model: text-embedding-ada-002
+# default model: text-embedding-3-small
 res = doc.openai_embeddings(access_token: api_key)
 puts res
@@ -796,6 +853,47 @@ You can set a timeout for the `Spacy::Language.new` method:
 nlp = Spacy::Language.new("en_core_web_sm", timeout: 120) # Set timeout to 120 seconds
 ```
+### Document Serialization
+You can serialize processed documents to binary format for caching or storage. This is useful when you want to avoid re-processing the same text multiple times.
+**Saving a document:**
+```ruby
+require "ruby-spacy"
+nlp = Spacy::Language.new("en_core_web_sm")
+doc = nlp.read("Apple Inc. was founded by Steve Jobs in California.")
+# Serialize to binary
+bytes = doc.to_bytes
+# Save to file
+File.binwrite("doc_cache.bin", bytes)
+```
+**Restoring a document:**
+```ruby
+nlp = Spacy::Language.new("en_core_web_sm")
+# Load from file
+bytes = File.binread("doc_cache.bin")
+# Restore the document (all annotations are preserved)
+restored_doc = Spacy::Doc.from_bytes(nlp, bytes)
+puts restored_doc.text
+# => "Apple Inc. was founded by Steve Jobs in California."
+restored_doc.ents.each do |ent|
+  puts "#{ent.text} (#{ent.label})"
+end
+# => Apple Inc. (ORG)
+# => Steve Jobs (PERSON)
+# => California (GPE)
+```
 ## Author
 Yoichiro Hasebe [<yohasebe@gmail.com>]

data/lib/ruby-spacy/openai_client.rb ADDED Viewed

@@ -0,0 +1,149 @@
+# frozen_string_literal: true
+require "net/http"
+require "openssl"
+require "uri"
+require "json"
+module Spacy
+  # A lightweight OpenAI API client with tools support for GPT-5 series models.
+  # This client implements the chat completions and embeddings endpoints
+  # without external dependencies.
+  class OpenAIClient
+    API_ENDPOINT = "https://api.openai.com/v1"
+    DEFAULT_TIMEOUT = 120
+    MAX_RETRIES = 3
+    RETRY_DELAY = 1
+    class APIError < StandardError
+      attr_reader :status_code, :response_body
+      def initialize(message, status_code: nil, response_body: nil)
+        @status_code = status_code
+        @response_body = response_body
+        super(message)
+      end
+    end
+    def initialize(access_token:, timeout: DEFAULT_TIMEOUT)
+      @access_token = access_token
+      @timeout = timeout
+    end
+    # Sends a chat completion request with optional tools support.
+    # Note: GPT-5 series models do not support the temperature parameter.
+    #
+    # @param model [String] The model to use (e.g., "gpt-5-mini")
+    # @param messages [Array<Hash>] The conversation messages
+    # @param max_completion_tokens [Integer] Maximum tokens in the response
+    # @param temperature [Float, nil] Sampling temperature (ignored for GPT-5 models)
+    # @param tools [Array<Hash>, nil] Tool definitions for function calling
+    # @param tool_choice [String, Hash, nil] Tool selection strategy
+    # @return [Hash] The API response
+    def chat(model:, messages:, max_completion_tokens: 1000, temperature: nil, tools: nil, tool_choice: nil)
+      body = {
+        model: model,
+        messages: messages,
+        max_completion_tokens: max_completion_tokens
+      }
+      # GPT-5 series models do not support temperature parameter
+      unless gpt5_model?(model)
+        body[:temperature] = temperature || 0.7
+      end
+      if tools && !tools.empty?
+        body[:tools] = tools
+        body[:tool_choice] = tool_choice || "auto"
+      end
+      post("/chat/completions", body)
+    end
+    # Checks if the model is a GPT-5 series model.
+    # GPT-5 models have different parameter requirements (no temperature support).
+    def gpt5_model?(model)
+      model.to_s.start_with?("gpt-5")
+    end
+    # Sends an embeddings request.
+    #
+    # @param model [String] The embeddings model (e.g., "text-embedding-3-small")
+    # @param input [String] The text to embed
+    # @return [Hash] The API response
+    def embeddings(model:, input:)
+      body = {
+        model: model,
+        input: input
+      }
+      post("/embeddings", body)
+    end
+    private
+    # Creates a certificate store with system CA certificates but without CRL checking.
+    # This avoids "unable to get certificate CRL" errors on some systems.
+    def default_cert_store
+      store = OpenSSL::X509::Store.new
+      store.set_default_paths
+      store
+    end
+    def post(path, body)
+      uri = URI.parse("#{API_ENDPOINT}#{path}")
+      retries = 0
+      begin
+        http = Net::HTTP.new(uri.host, uri.port)
+        http.use_ssl = true
+        http.verify_mode = OpenSSL::SSL::VERIFY_PEER
+        http.cert_store = default_cert_store
+        http.open_timeout = @timeout
+        http.read_timeout = @timeout
+        request = Net::HTTP::Post.new(uri.path)
+        request["Content-Type"] = "application/json"
+        request["Authorization"] = "Bearer #{@access_token}"
+        request.body = body.to_json
+        response = http.request(request)
+        handle_response(response)
+      rescue Net::OpenTimeout, Net::ReadTimeout => e
+        retries += 1
+        if retries <= MAX_RETRIES
+          sleep RETRY_DELAY
+          retry
+        end
+        raise APIError.new("Request timed out after #{MAX_RETRIES} retries: #{e.message}")
+      rescue Errno::ECONNREFUSED, Errno::ECONNRESET, SocketError => e
+        retries += 1
+        if retries <= MAX_RETRIES
+          sleep RETRY_DELAY
+          retry
+        end
+        raise APIError.new("Network error after #{MAX_RETRIES} retries: #{e.message}")
+      end
+    end
+    def handle_response(response)
+      body = JSON.parse(response.body)
+      case response.code.to_i
+      when 200
+        body
+      when 400..499
+        error_message = body.dig("error", "message") || "Client error"
+        raise APIError.new(error_message, status_code: response.code.to_i, response_body: body)
+      when 500..599
+        error_message = body.dig("error", "message") || "Server error"
+        raise APIError.new(error_message, status_code: response.code.to_i, response_body: body)
+      else
+        raise APIError.new("Unexpected response: #{response.code}", status_code: response.code.to_i, response_body: body)
+      end
+    rescue JSON::ParserError
+      raise APIError.new("Invalid JSON response", status_code: response.code.to_i, response_body: response.body)
+    end
+  end
+end

data/lib/ruby-spacy/version.rb CHANGED Viewed

@@ -2,5 +2,5 @@
 module Spacy
   # The version number of the module
-  VERSION = "0.2.3"
+  VERSION = "0.3.0"
 end

data/lib/ruby-spacy.rb CHANGED Viewed

@@ -1,11 +1,12 @@
 # frozen_string_literal: true
 require_relative "ruby-spacy/version"
+require_relative "ruby-spacy/openai_client"
 require "numpy"
-require "openai"
 require "pycall"
 require "strscan"
 require "timeout"
+require "json"
 begin
   PyCall.init
@@ -39,6 +40,9 @@ module Spacy
   # Python `Matcher` class object
   PyMatcher = spacy.matcher.Matcher
+  # Python `PhraseMatcher` class object
+  PyPhraseMatcher = spacy.matcher.PhraseMatcher
   # Python `displacy` object
   PyDisplacy = PyCall.import_module('spacy.displacy')
@@ -49,18 +53,6 @@ module Spacy
     PyCall::List.call(py_generator)
   end
-  @openai_client = nil
-  def self.openai_client(access_token:)
-    # If @client is already set, just return it. Otherwise, create a new instance.
-    @openai_client ||= OpenAI::Client.new(access_token: access_token)
-  end
-  # Provide an accessor method to get the client (optional)
-  def self.client
-    @openai_client
-  end
   # See also spaCy Python API document for [`Doc`](https://spacy.io/api/doc).
   class Doc
     # @return [Object] a Python `Language` instance accessible via `PyCall`
@@ -216,6 +208,30 @@ module Spacy
       py_doc.similarity(other.py_doc)
     end
+    # Serializes the doc to a binary string.
+    # The binary data includes all annotations (tokens, entities, etc.) and can be
+    # used to restore the doc later without re-processing.
+    # @return [String] binary representation of the doc
+    # @example Save doc to file
+    #   doc = nlp.read("Hello world")
+    #   File.binwrite("doc.bin", doc.to_bytes)
+    def to_bytes
+      @py_doc.to_bytes.force_encoding(Encoding::BINARY)
+    end
+    # Restores a doc from binary data created by {#to_bytes}.
+    # This is useful for caching processed documents to avoid re-processing.
+    # @param byte_string [String] binary data from {#to_bytes}
+    # @return [Doc] the restored doc
+    # @example Load doc from file
+    #   bytes = File.binread("doc.bin")
+    #   doc = Spacy::Doc.from_bytes(nlp, bytes)
+    def self.from_bytes(nlp, byte_string)
+      py_bytes = PyCall.eval("bytes(#{byte_string.bytes})")
+      py_doc = nlp.py_nlp.call("").from_bytes(py_bytes)
+      new(nlp.py_nlp, py_doc: py_doc)
+    end
     # Visualize the document in one of two styles: "dep" (dependencies) or "ent" (named entities).
     # @param style [String] either `dep` or `ent`
     # @param compact [Boolean] only relevant to the `dep' style
@@ -224,12 +240,26 @@ module Spacy
       PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false)
     end
+    # Sends a query to OpenAI's chat completion API with optional tool support.
+    # The get_tokens tool allows the model to request token-level linguistic analysis.
+    #
+    # @param access_token [String, nil] OpenAI API key (defaults to OPENAI_API_KEY env var)
+    # @param max_completion_tokens [Integer] Maximum tokens in the response
+    # @param max_tokens [Integer] Alias for max_completion_tokens (deprecated, for backward compatibility)
+    # @param temperature [Float] Sampling temperature (ignored for GPT-5 models)
+    # @param model [String] The model to use (default: gpt-5-mini)
+    # @param messages [Array<Hash>] Conversation history (for recursive tool calls)
+    # @param prompt [String, nil] System prompt for the query
+    # @return [String, nil] The model's response content
     def openai_query(access_token: nil,
-                     max_tokens: 1000,
+                     max_completion_tokens: nil,
+                     max_tokens: nil,
                      temperature: 0.7,
-                     model: "gpt-4o-mini",
+                     model: "gpt-5-mini",
                      messages: [],
                      prompt: nil)
+      # Support both max_completion_tokens and max_tokens for backward compatibility
+      max_completion_tokens ||= max_tokens || 1000
       if messages.empty?
         messages = [
           { role: "system", content: prompt },
@@ -240,110 +270,134 @@ module Spacy
       access_token ||= ENV["OPENAI_API_KEY"]
       raise "Error: OPENAI_API_KEY is not set" unless access_token
-      begin
-        response = Spacy.openai_client(access_token: access_token).chat(
-          parameters: {
-            model: model,
-            messages: messages,
-            max_tokens: max_tokens,
-            temperature: temperature,
-            function_call: "auto",
-            stream: false,
-            functions: [
-              {
-                name: "get_tokens",
-                description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology",
-                "parameters": {
-                  "type": "object",
-                  "properties": {
-                    "text": {
-                      "type": "string",
-                      "description": "text to be tokenized"
-                    }
-                  },
-                  "required": ["text"]
+      # Tool definition for token analysis (GPT-5 tools API format)
+      tools = [
+        {
+          type: "function",
+          function: {
+            name: "get_tokens",
+            description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology",
+            parameters: {
+              type: "object",
+              properties: {
+                text: {
+                  type: "string",
+                  description: "text to be tokenized"
                 }
-              }
-            ]
+              },
+              required: ["text"]
+            }
           }
-        )
+        }
+      ]
+      client = OpenAIClient.new(access_token: access_token)
+      response = client.chat(
+        model: model,
+        messages: messages,
+        max_completion_tokens: max_completion_tokens,
+        temperature: temperature,
+        tools: tools,
+        tool_choice: "auto"
+      )
+      message = response.dig("choices", 0, "message")
-        message = response.dig("choices", 0, "message")
+      # Handle tool calls (GPT-5 format)
+      if message["tool_calls"] && !message["tool_calls"].empty?
+        messages << message
+        message["tool_calls"].each do |tool_call|
+          function_name = tool_call.dig("function", "name")
+          tool_call_id = tool_call["id"]
-        if message["role"] == "assistant" && message["function_call"]
-          messages << message
-          function_name = message.dig("function_call", "name")
-          _args = JSON.parse(message.dig("function_call", "arguments"))
           case function_name
           when "get_tokens"
-            res = tokens.map do |t|
+            result = tokens.map do |t|
               {
-                "surface": t.text,
-                "lemma": t.lemma,
-                "pos": t.pos,
-                "tag": t.tag,
-                "dep": t.dep,
-                "ent_type": t.ent_type,
-                "morphology": t.morphology
+                surface: t.text,
+                lemma: t.lemma,
+                pos: t.pos,
+                tag: t.tag,
+                dep: t.dep,
+                ent_type: t.ent_type,
+                morphology: t.morphology
               }
             end.to_json
+            messages << {
+              role: "tool",
+              tool_call_id: tool_call_id,
+              content: result
+            }
           end
-          messages << { role: "system", content: res }
-          openai_query(access_token: access_token, max_tokens: max_tokens,
-                       temperature: temperature, model: model,
-                       messages: messages, prompt: prompt)
-        else
-          message["content"]
         end
-      rescue StandardError => e
-        puts "Error: OpenAI API call failed."
-        pp e.message
-        pp e.backtrace
+        # Recursive call to get final response after tool execution
+        openai_query(
+          access_token: access_token,
+          max_completion_tokens: max_completion_tokens,
+          temperature: temperature,
+          model: model,
+          messages: messages,
+          prompt: prompt
+        )
+      else
+        message["content"]
       end
-    end
+    rescue OpenAIClient::APIError => e
+      puts "Error: OpenAI API call failed - #{e.message}"
+      nil
+    end
+    # Sends a text completion request to OpenAI's chat API.
+    #
+    # @param access_token [String, nil] OpenAI API key (defaults to OPENAI_API_KEY env var)
+    # @param max_completion_tokens [Integer] Maximum tokens in the response
+    # @param max_tokens [Integer] Alias for max_completion_tokens (deprecated, for backward compatibility)
+    # @param temperature [Float] Sampling temperature (ignored for GPT-5 models)
+    # @param model [String] The model to use (default: gpt-5-mini)
+    # @return [String, nil] The completed text
+    def openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini")
+      # Support both max_completion_tokens and max_tokens for backward compatibility
+      max_completion_tokens ||= max_tokens || 1000
-    def openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini")
       messages = [
         { role: "system", content: "Complete the text input by the user." },
         { role: "user", content: @text }
       ]
       access_token ||= ENV["OPENAI_API_KEY"]
       raise "Error: OPENAI_API_KEY is not set" unless access_token
-      begin
-        response = Spacy.openai_client(access_token: access_token).chat(
-          parameters: {
-            model: model,
-            messages: messages,
-            max_tokens: max_tokens,
-            temperature: temperature
-          }
-        )
-        response.dig("choices", 0, "message", "content")
-      rescue StandardError => e
-        puts "Error: OpenAI API call failed."
-        pp e.message
-        pp e.backtrace
-      end
-    end
-    def openai_embeddings(access_token: nil, model: "text-embedding-ada-002")
+      client = OpenAIClient.new(access_token: access_token)
+      response = client.chat(
+        model: model,
+        messages: messages,
+        max_completion_tokens: max_completion_tokens,
+        temperature: temperature
+      )
+      response.dig("choices", 0, "message", "content")
+    rescue OpenAIClient::APIError => e
+      puts "Error: OpenAI API call failed - #{e.message}"
+      nil
+    end
+    # Generates text embeddings using OpenAI's embeddings API.
+    #
+    # @param access_token [String, nil] OpenAI API key (defaults to OPENAI_API_KEY env var)
+    # @param model [String] The embeddings model (default: text-embedding-3-small)
+    # @return [Array<Float>, nil] The embedding vector
+    def openai_embeddings(access_token: nil, model: "text-embedding-3-small")
       access_token ||= ENV["OPENAI_API_KEY"]
       raise "Error: OPENAI_API_KEY is not set" unless access_token
-      begin
-        response = Spacy.openai_client(access_token: access_token).embeddings(
-          parameters: {
-            model: model,
-            input: @text
-          }
-        )
-        response.dig("data", 0, "embedding")
-      rescue StandardError => e
-        puts "Error: OpenAI API call failed."
-        pp e.message
-        pp e.backtrace
-      end
+      client = OpenAIClient.new(access_token: access_token)
+      response = client.embeddings(model: model, input: @text)
+      response.dig("data", 0, "embedding")
+    rescue OpenAIClient::APIError => e
+      puts "Error: OpenAI API call failed - #{e.message}"
+      nil
     end
     # Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
@@ -351,7 +405,7 @@ module Spacy
       @py_doc.send(name, *args)
     end
-    def respond_to_missing?(sym)
+    def respond_to_missing?(sym, *args)
       sym ? true : super
     end
   end
@@ -398,6 +452,18 @@ module Spacy
       Matcher.new(@py_nlp)
     end
+    # Generates a phrase matcher for the current language model.
+    # PhraseMatcher is more efficient than {Matcher} for matching large terminology lists.
+    # @param attr [String] the token attribute to match on (default: "ORTH").
+    #   Use "LOWER" for case-insensitive matching.
+    # @return [PhraseMatcher]
+    # @example
+    #   matcher = nlp.phrase_matcher(attr: "LOWER")
+    #   matcher.add("PRODUCT", ["iPhone", "MacBook Pro"])
+    def phrase_matcher(attr: "ORTH")
+      PhraseMatcher.new(self, attr: attr)
+    end
     # A utility method to lookup a vocabulary item of the given id.
     # @param id [Integer] a vocabulary id
     # @return [Object] a Python `Lexeme` object (https://spacy.io/api/lexeme)
@@ -473,7 +539,7 @@ module Spacy
       @py_nlp.send(name, *args)
     end
-    def respond_to_missing?(sym)
+    def respond_to_missing?(sym, *args)
       sym ? true : super
     end
   end
@@ -516,6 +582,54 @@ module Spacy
     end
   end
+  # See also spaCy Python API document for [`PhraseMatcher`](https://spacy.io/api/phrasematcher).
+  # PhraseMatcher is useful for efficiently matching large terminology lists.
+  # It's faster than {Matcher} when matching many phrase patterns.
+  class PhraseMatcher
+    # @return [Object] a Python `PhraseMatcher` instance accessible via `PyCall`
+    attr_reader :py_matcher
+    # @return [Language] the language model used by this matcher
+    attr_reader :nlp
+    # Creates a {PhraseMatcher} instance.
+    # @param nlp [Language] an instance of {Language} class
+    # @param attr [String] the token attribute to match on (default: "ORTH").
+    #   Use "LOWER" for case-insensitive matching.
+    # @example Case-insensitive matching
+    #   matcher = Spacy::PhraseMatcher.new(nlp, attr: "LOWER")
+    def initialize(nlp, attr: "ORTH")
+      @nlp = nlp
+      @py_matcher = PyPhraseMatcher.call(nlp.py_nlp.vocab, attr: attr)
+    end
+    # Adds phrase patterns to the matcher.
+    # @param label [String] a label string given to the patterns
+    # @param phrases [Array<String>] an array of phrase strings to match
+    # @example Add product names
+    #   matcher.add("PRODUCT", ["iPhone", "MacBook Pro", "iPad"])
+    def add(label, phrases)
+      patterns = phrases.map { |phrase| @nlp.py_nlp.call(phrase) }
+      @py_matcher.add(label, patterns)
+    end
+    # Execute the phrase match and return matching spans.
+    # @param doc [Doc] a {Doc} instance to search
+    # @return [Array<Span>] an array of {Span} objects with labels
+    # @example Find matches
+    #   matches = matcher.match(doc)
+    #   matches.each { |span| puts "#{span.text} => #{span.label}" }
+    def match(doc)
+      py_matches = @py_matcher.call(doc.py_doc, as_spans: true)
+      results = []
+      PyCall::List.call(py_matches).each do |py_span|
+        span = Span.new(doc, py_span: py_span)
+        results << span
+      end
+      results
+    end
+  end
   # See also spaCy Python API document for [`Span`](https://spacy.io/api/span).
   class Span
     # @return [Object] a Python `Span` instance accessible via `PyCall`
@@ -679,7 +793,7 @@ module Spacy
       @py_span.send(name, *args)
     end
-    def respond_to_missing?(sym)
+    def respond_to_missing?(sym, *args)
       sym ? true : super
     end
   end
@@ -845,7 +959,7 @@ module Spacy
       @py_token.send(name, *args)
     end
-    def respond_to_missing?(sym)
+    def respond_to_missing?(sym, *args)
       sym ? true : super
     end
   end
@@ -920,7 +1034,7 @@ module Spacy
       @py_lexeme.send(name, *args)
     end
-    def respond_to_missing?(sym)
+    def respond_to_missing?(sym, *args)
       sym ? true : super
     end
   end

data/ruby-spacy.gemspec CHANGED Viewed

@@ -15,7 +15,7 @@ Gem::Specification.new do |spec|
   spec.homepage      = "https://github.com/yohasebe/ruby-spacy"
   spec.license       = "MIT"
-  spec.required_ruby_version = Gem::Requirement.new(">= 2.6")
+  spec.required_ruby_version = Gem::Requirement.new(">= 3.1")
   # Specify which files should be added to the gem when it is released.
   # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
@@ -31,9 +31,9 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "rspec"
   spec.add_development_dependency "solargraph"
+  spec.add_dependency "fiddle" # Required for Ruby 4.0+ (moved from default to bundled gem)
   spec.add_dependency "numpy", "~> 0.4.0"
   spec.add_dependency "pycall", "~> 1.5.1"
-  spec.add_dependency "ruby-openai"
   spec.add_dependency "terminal-table", "~> 3.0.1"
   # For more information and examples about making a new gem, checkout our

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: ruby-spacy
 version: !ruby/object:Gem::Version
-  version: 0.2.3
+  version: 0.3.0
 platform: ruby
 authors:
 - Yoichiro Hasebe
-autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-08-27 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -67,47 +66,47 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: numpy
+  name: fiddle
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 0.4.0
+        version: '0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 0.4.0
+        version: '0'
 - !ruby/object:Gem::Dependency
-  name: pycall
+  name: numpy
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 1.5.1
+        version: 0.4.0
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 1.5.1
+        version: 0.4.0
 - !ruby/object:Gem::Dependency
-  name: ruby-openai
+  name: pycall
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: 1.5.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: 1.5.1
 - !ruby/object:Gem::Dependency
   name: terminal-table
   requirement: !ruby/object:Gem::Requirement
@@ -203,13 +202,13 @@ files:
 - examples/rule_based_matching/creating_spans_from_matches.rb
 - examples/rule_based_matching/matcher.rb
 - lib/ruby-spacy.rb
+- lib/ruby-spacy/openai_client.rb
 - lib/ruby-spacy/version.rb
 - ruby-spacy.gemspec
 homepage: https://github.com/yohasebe/ruby-spacy
 licenses:
 - MIT
 metadata: {}
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -217,15 +216,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '2.6'
+      version: '3.1'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.4.13
-signing_key:
+rubygems_version: 3.6.9
 specification_version: 4
 summary: A wrapper module for using spaCy natural language processing library from
   the Ruby programming language using PyCall