RubyGems - ruby-spacy - Versions diffs - 0.2.3 → 0.4.0 - Mend

ruby-spacy 0.2.3 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/.github/FUNDING.yml +6 -0
data/.gitignore +1 -0
data/CHANGELOG.md +24 -7
data/Gemfile +1 -1
data/README.md +120 -22
data/lib/ruby-spacy/openai_client.rb +166 -0
data/lib/ruby-spacy/openai_helper.rb +91 -0
data/lib/ruby-spacy/version.rb +1 -1
data/lib/ruby-spacy.rb +455 -248
data/ruby-spacy.gemspec +3 -2
metadata +34 -20

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 9c9ca5b4cba8eb115192aa0b5a45216d12a9d9e4cdddc253ba55ace52e778afd
-  data.tar.gz: 197c61acfa742048fefff05b35d6045e17dd5cf212667c277537fb984a0ff926
+  metadata.gz: 6185c586feb32fa51efcd4349398cd4ca9541280a5cc8a1b6a73eb93a987d4ac
+  data.tar.gz: a146a9c40e2d5293e2401cb16b8ac6866cbb577e11a10d9657c406f933e7a3aa
 SHA512:
-  metadata.gz: 950daeb4f8ee140a15bacf18ea3228f2604a552df8aa12be52fb7a488c78e67b894b8678fbe6fbed74da54beb714e89d02ab1bd46d5c59a908b8ddfbc5c9e7c0
-  data.tar.gz: 84b183babd37f9120c0ac2332eec23dff30d3180da165aaf044bf72ef4be7af4efc2b339ad5ac5b489e3e3b9b44ba33d3df4fca287addbbed05cfa4201b79d75
+  metadata.gz: bf558d4e9a7a6765fd7d088bbf8324a6ee0e4f4186962551d71e5a991e0aefd1e51a186f19c2824fabcc6afd0c83960771f082237febece52c2a522ccb39a5cf
+  data.tar.gz: 3a64559cf8c169d1ac1ecdef526d26e5776989b9cc203a8ed30e0dd5d87ff62a4d1b741aff30c8cb49e5ffb716c6068f9af3a12d50d0d4de8ad6f22ebe80ea0d

data/.github/FUNDING.yml ADDED Viewed

@@ -0,0 +1,6 @@
+# These are supported funding model platforms
+github: [yohasebe]
+ko_fi: yohasebe
+buy_me_a_coffee: yohasebe
+# custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']

data/.gitignore CHANGED Viewed

@@ -62,3 +62,4 @@ tags
 .rubocop.yml
 .solargraph.yml
 .yardopts
+CLAUDE.md

data/CHANGELOG.md CHANGED Viewed

@@ -1,17 +1,34 @@
 # Change Log
+## 0.3.0 - 2025-01-06
+### Added
+- Ruby 4.0 support
+- `Doc#to_bytes` for serializing documents to binary format
+- `Doc.from_bytes` for restoring documents from binary data
+- `PhraseMatcher` class for efficient phrase matching
+- `Language#phrase_matcher` helper method
+### Changed
+- Replaced `ruby-openai` gem with custom `OpenAIClient` implementation
+- Updated default OpenAI model to `gpt-5-mini`
+- Updated embeddings model to `text-embedding-3-small`
+- Changed `max_tokens` parameter to `max_completion_tokens` (backward compatible)
+- Added `fiddle` gem dependency (required for Ruby 4.0)
+## 0.2.4 - 2024-12-11
+### Changed
+- Timeout and retry feature for `Spacy::Language.new`
 ## 0.2.3 - 2024-08-27
 - Timeout option added to `Spacy::Language.new`
-- Default OpenaAI models updated to `gpt-4o-mini`
-## 0.2.0 - 2022-10-02
-- spaCy 3.7.0 supported
+- Default OpenAI models updated to `gpt-4o-mini`
 ## 0.2.0 - 2022-10-02
 ### Added
-- `Doc::openai_query`
-- `Doc::openai_completion`
-- `Doc::openai_embeddings`
+- spaCy 3.7.0 supported
+- `Doc#openai_query`
+- `Doc#openai_completion`
+- `Doc#openai_embeddings`
 ## 0.1.4.1 - 2021-07-06
 - Test code refined

data/Gemfile CHANGED Viewed

@@ -5,9 +5,9 @@ source "https://rubygems.org"
 # Specify your gem's dependencies in ruby-spacy.gemspec
 gemspec
+gem "fiddle" # Required for Ruby 4.0+ (moved from default to bundled gem)
 gem "numpy"
 gem "pycall", "~> 1.5.1"
-gem "ruby-openai"
 gem "terminal-table"
 group :development do

data/README.md CHANGED Viewed

@@ -13,10 +13,11 @@
 | ✅ | Access to pre-trained word vectors                 |
 | ✅ | OpenAI Chat/Completion/Embeddings API integration  |
-Current Version: `0.2.3`
+Current Version: `0.3.0`
-- spaCy 3.7.0 supported
-- OpenAI API integration
+- Ruby 4.0 supported
+- spaCy 3.8 supported
+- OpenAI GPT-5 API integration
 ## Installation of Prerequisites
@@ -522,12 +523,73 @@ Output:
 | 9    | アルザス       | 0.5644999742507935 |
 | 10   | 南仏           | 0.5547999739646912 |
+### PhraseMatcher
+`PhraseMatcher` is more efficient than `Matcher` for matching large terminology lists. It's ideal for extracting known entities like product names, company names, or domain-specific terms.
+**Basic usage:**
+```ruby
+require "ruby-spacy"
+nlp = Spacy::Language.new("en_core_web_sm")
+# Create a phrase matcher
+matcher = nlp.phrase_matcher
+matcher.add("PRODUCT", ["iPhone", "MacBook Pro", "iPad"])
+doc = nlp.read("I bought an iPhone and a MacBook Pro yesterday.")
+matches = matcher.match(doc)
+matches.each do |span|
+  puts "#{span.text} => #{span.label}"
+end
+# => iPhone => PRODUCT
+# => MacBook Pro => PRODUCT
+```
+**Case-insensitive matching:**
+```ruby
+# Use attr: "LOWER" for case-insensitive matching
+matcher = nlp.phrase_matcher(attr: "LOWER")
+matcher.add("COMPANY", ["apple", "google", "microsoft"])
+doc = nlp.read("Apple and GOOGLE are competitors of Microsoft.")
+matches = matcher.match(doc)
+matches.each do |span|
+  puts span.text
+end
+# => Apple
+# => GOOGLE
+# => Microsoft
+```
+**Multiple categories:**
+```ruby
+matcher = nlp.phrase_matcher(attr: "LOWER")
+matcher.add("TECH_COMPANY", ["apple", "google", "microsoft", "amazon"])
+matcher.add("PRODUCT", ["iphone", "pixel", "surface", "kindle"])
+doc = nlp.read("Apple released the new iPhone while Google announced Pixel updates.")
+matches = matcher.match(doc)
+matches.each do |span|
+  puts "#{span.text}: #{span.label}"
+end
+# => Apple: TECH_COMPANY
+# => iPhone: PRODUCT
+# => Google: TECH_COMPANY
+# => Pixel: PRODUCT
+```
 ## OpenAI API Integration
-> ⚠️ This feature is currently experimental. Details are subject to change. Please refer to OpenAI's [API reference](https://platform.openai.com/docs/api-reference) and [Ruby OpenAI](https://github.com/alexrudall/ruby-openai) for available parameters (`max_tokens`, `temperature`, etc).
+> ⚠️ This feature requires GPT-5 series models. Please refer to OpenAI's [API reference](https://platform.openai.com/docs/api-reference) for details. Note: GPT-5 models do not support the `temperature` parameter.
-Easily leverage GPT models within ruby-spacy by using an OpenAI API key. When constructing prompts for the `Doc::openai_query` method, you can incorporate the following token properties of the document. These properties are retrieved through function calls (made internally by GPT when necessary) and seamlessly integrated into your prompt. Note that function calls need `gpt-4o-mini` or greater. The available properties include:
+Easily leverage GPT models within ruby-spacy by using an OpenAI API key. When constructing prompts for the `Doc::openai_query` method, you can incorporate the following token properties of the document. These properties are retrieved through tool calls (made internally by GPT when necessary) and seamlessly integrated into your prompt. The available properties include:
 - `surface`
 - `lemma`
@@ -550,9 +612,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles released 12 studio albums")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res1 = doc.openai_query(
   access_token: api_key,
   prompt: "Translate the text to Japanese."
@@ -576,9 +637,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles were an English rock band formed in Liverpool in 1960.")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_query(
   access_token: api_key,
   prompt: "Extract the topic of the document and list 10 entities (names, concepts, locations, etc.) that are relevant to the topic."
@@ -614,9 +674,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles released 12 studio albums")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_query(
   access_token: api_key,
   prompt: "List token data of each of the words used in the sentence. Add 'meaning' property and value (brief semantic definition) to each token data. Output as a JSON object."
@@ -692,7 +751,7 @@ Output:
 }
 ```
-### GPT Prompting (Generate a Syntaxt Tree using Token Properties)
+### GPT Prompting (Generate a Syntax Tree using Token Properties)
 Ruby code:
@@ -704,11 +763,10 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("The Beatles released 12 studio albums")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_query(
   access_token: api_key,
-  model: "gpt-4",
   prompt: "Generate a tree diagram from the text using given token data. Use the following bracketing style: [S [NP [Det the] [N cat]] [VP [V sat] [PP [P on] [NP the mat]]]"
 )
 puts res
@@ -747,9 +805,8 @@ nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("Vladimir Nabokov was a")
 # default parameter values
-# max_tokens: 1000
-# temperature: 0.7
-# model: "gpt-4o-mini"
+# max_completion_tokens: 1000
+# model: "gpt-5-mini"
 res = doc.openai_completion(access_token: api_key)
 puts res
 ```
@@ -769,7 +826,7 @@ api_key = ENV["OPENAI_API_KEY"]
 nlp = Spacy::Language.new("en_core_web_sm")
 doc = nlp.read("Vladimir Nabokov was a Russian-American novelist, poet, translator and entomologist.")
-# default model: text-embedding-ada-002
+# default model: text-embedding-3-small
 res = doc.openai_embeddings(access_token: api_key)
 puts res
@@ -796,6 +853,47 @@ You can set a timeout for the `Spacy::Language.new` method:
 nlp = Spacy::Language.new("en_core_web_sm", timeout: 120) # Set timeout to 120 seconds
 ```
+### Document Serialization
+You can serialize processed documents to binary format for caching or storage. This is useful when you want to avoid re-processing the same text multiple times.
+**Saving a document:**
+```ruby
+require "ruby-spacy"
+nlp = Spacy::Language.new("en_core_web_sm")
+doc = nlp.read("Apple Inc. was founded by Steve Jobs in California.")
+# Serialize to binary
+bytes = doc.to_bytes
+# Save to file
+File.binwrite("doc_cache.bin", bytes)
+```
+**Restoring a document:**
+```ruby
+nlp = Spacy::Language.new("en_core_web_sm")
+# Load from file
+bytes = File.binread("doc_cache.bin")
+# Restore the document (all annotations are preserved)
+restored_doc = Spacy::Doc.from_bytes(nlp, bytes)
+puts restored_doc.text
+# => "Apple Inc. was founded by Steve Jobs in California."
+restored_doc.ents.each do |ent|
+  puts "#{ent.text} (#{ent.label})"
+end
+# => Apple Inc. (ORG)
+# => Steve Jobs (PERSON)
+# => California (GPE)
+```
 ## Author
 Yoichiro Hasebe [<yohasebe@gmail.com>]

data/lib/ruby-spacy/openai_client.rb ADDED Viewed

@@ -0,0 +1,166 @@
+# frozen_string_literal: true
+require "net/http"
+require "openssl"
+require "uri"
+require "json"
+module Spacy
+  # A lightweight OpenAI API client with tools support for GPT-5 series models.
+  # This client implements the chat completions and embeddings endpoints
+  # without external dependencies.
+  class OpenAIClient
+    API_ENDPOINT = "https://api.openai.com/v1"
+    DEFAULT_TIMEOUT = 120
+    MAX_RETRIES = 3
+    BASE_RETRY_DELAY = 1
+    class APIError < StandardError
+      attr_reader :status_code, :response_body
+      def initialize(message, status_code: nil, response_body: nil)
+        @status_code = status_code
+        @response_body = response_body
+        super(message)
+      end
+    end
+    def initialize(access_token:, timeout: DEFAULT_TIMEOUT)
+      @access_token = access_token
+      @timeout = timeout
+    end
+    # Sends a chat completion request with optional tools support.
+    # Note: GPT-5 series and o-series models do not support the temperature parameter.
+    #
+    # @param model [String] The model to use (e.g., "gpt-5-mini")
+    # @param messages [Array<Hash>] The conversation messages
+    # @param max_completion_tokens [Integer] Maximum tokens in the response
+    # @param temperature [Float, nil] Sampling temperature (ignored for models that don't support it)
+    # @param tools [Array<Hash>, nil] Tool definitions for function calling
+    # @param tool_choice [String, Hash, nil] Tool selection strategy
+    # @param response_format [Hash, nil] Response format specification (e.g., { type: "json_object" })
+    # @return [Hash] The API response
+    def chat(model:, messages:, max_completion_tokens: 1000, temperature: nil, tools: nil, tool_choice: nil, response_format: nil)
+      body = {
+        model: model,
+        messages: messages,
+        max_completion_tokens: max_completion_tokens
+      }
+      # GPT-5 series and o-series models do not support temperature parameter
+      unless temperature_unsupported?(model)
+        body[:temperature] = temperature || 0.7
+      end
+      if tools && !tools.empty?
+        body[:tools] = tools
+        body[:tool_choice] = tool_choice || "auto"
+      end
+      body[:response_format] = response_format if response_format
+      post("/chat/completions", body)
+    end
+    # Checks if the model does not support the temperature parameter.
+    # This includes GPT-5 series and o-series (o1, o3, o4-mini, etc.) models.
+    # @param model [String] The model name
+    # @return [Boolean]
+    def temperature_unsupported?(model)
+      name = model.to_s
+      name.start_with?("gpt-5") || name.match?(/\Ao\d/)
+    end
+    # Sends an embeddings request.
+    #
+    # @param model [String] The embeddings model (e.g., "text-embedding-3-small")
+    # @param input [String] The text to embed
+    # @param dimensions [Integer, nil] The number of dimensions for the output embeddings
+    # @return [Hash] The API response
+    def embeddings(model:, input:, dimensions: nil)
+      body = {
+        model: model,
+        input: input
+      }
+      body[:dimensions] = dimensions if dimensions
+      post("/embeddings", body)
+    end
+    private
+    # Creates a certificate store with system CA certificates but without CRL checking.
+    # This avoids "unable to get certificate CRL" errors on some systems.
+    def default_cert_store
+      store = OpenSSL::X509::Store.new
+      store.set_default_paths
+      store
+    end
+    def post(path, body)
+      uri = URI.parse("#{API_ENDPOINT}#{path}")
+      retries = 0
+      loop do
+        begin
+          http = Net::HTTP.new(uri.host, uri.port)
+          http.use_ssl = true
+          http.verify_mode = OpenSSL::SSL::VERIFY_PEER
+          http.cert_store = default_cert_store
+          http.open_timeout = @timeout
+          http.read_timeout = @timeout
+          request = Net::HTTP::Post.new(uri.path)
+          request["Content-Type"] = "application/json"
+          request["Authorization"] = "Bearer #{@access_token}"
+          request.body = body.to_json
+          response = http.request(request)
+          # Handle 429 rate limiting before general response handling
+          if response.code.to_i == 429
+            retries += 1
+            if retries <= MAX_RETRIES
+              retry_after = response["Retry-After"]&.to_f
+              delay = retry_after || (BASE_RETRY_DELAY * (2**(retries - 1)) + rand * 0.5)
+              sleep delay
+              next
+            end
+            raise APIError.new("Rate limited after #{MAX_RETRIES} retries",
+                               status_code: 429, response_body: response.body)
+          end
+          return handle_response(response)
+        rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNREFUSED, Errno::ECONNRESET, SocketError => e
+          retries += 1
+          if retries <= MAX_RETRIES
+            delay = BASE_RETRY_DELAY * (2**(retries - 1)) + rand * 0.5
+            sleep delay
+            next
+          end
+          raise APIError.new("Network error after #{MAX_RETRIES} retries: #{e.message}")
+        end
+      end
+    end
+    def handle_response(response)
+      body = JSON.parse(response.body)
+      case response.code.to_i
+      when 200
+        body
+      when 400..499
+        error_message = body.dig("error", "message") || "Client error"
+        raise APIError.new(error_message, status_code: response.code.to_i, response_body: body)
+      when 500..599
+        error_message = body.dig("error", "message") || "Server error"
+        raise APIError.new(error_message, status_code: response.code.to_i, response_body: body)
+      else
+        raise APIError.new("Unexpected response: #{response.code}", status_code: response.code.to_i, response_body: body)
+      end
+    rescue JSON::ParserError
+      raise APIError.new("Invalid JSON response", status_code: response.code.to_i, response_body: response.body)
+    end
+  end
+end

data/lib/ruby-spacy/openai_helper.rb ADDED Viewed

@@ -0,0 +1,91 @@
+# frozen_string_literal: true
+module Spacy
+  # A helper class for OpenAI API interactions, designed to work with spaCy's
+  # linguistic analysis via the block-based {Language#with_openai} API.
+  #
+  # @example Basic usage with linguistic_summary
+  #   nlp = Spacy::Language.new("en_core_web_sm")
+  #   nlp.with_openai(model: "gpt-5-mini") do |ai|
+  #     doc = nlp.read("Apple Inc. was founded by Steve Jobs.")
+  #     ai.chat(system: "Analyze the linguistic data.", user: doc.linguistic_summary)
+  #   end
+  class OpenAIHelper
+    # @return [String] the default model for chat requests
+    attr_reader :model
+    # Creates a new OpenAIHelper instance.
+    # @param access_token [String, nil] OpenAI API key (defaults to OPENAI_API_KEY env var)
+    # @param model [String] the default model for chat requests
+    # @param max_completion_tokens [Integer] default maximum tokens in responses
+    # @param temperature [Float] default sampling temperature
+    def initialize(access_token: nil, model: "gpt-5-mini",
+                   max_completion_tokens: 1000, temperature: 0.7)
+      @access_token = access_token || ENV["OPENAI_API_KEY"]
+      raise "Error: OPENAI_API_KEY is not set" unless @access_token
+      @model = model
+      @default_max_completion_tokens = max_completion_tokens
+      @default_temperature = temperature
+      @client = OpenAIClient.new(access_token: @access_token)
+    end
+    # Sends a chat completion request to OpenAI.
+    #
+    # Provides convenient `system:` and `user:` keyword arguments as shortcuts
+    # for building simple message arrays. For more complex conversations, pass
+    # a full `messages:` array directly.
+    #
+    # @param system [String, nil] system message content (shortcut)
+    # @param user [String, nil] user message content (shortcut)
+    # @param messages [Array<Hash>, nil] full message array (overrides system:/user:)
+    # @param model [String, nil] model override (defaults to instance model)
+    # @param max_completion_tokens [Integer, nil] token limit override
+    # @param temperature [Float, nil] temperature override
+    # @param response_format [Hash, nil] response format (e.g., { type: "json_object" })
+    # @param raw [Boolean] if true, returns the full API response Hash instead of text
+    # @return [String, Hash, nil] the response text, full response Hash (if raw:), or nil on error
+    def chat(system: nil, user: nil, messages: nil,
+             model: nil, max_completion_tokens: nil,
+             temperature: nil, response_format: nil, raw: false)
+      msgs = messages || build_messages(system: system, user: user)
+      raise ArgumentError, "No messages provided. Use system:/user: or messages:" if msgs.empty?
+      response = @client.chat(
+        model: model || @model,
+        messages: msgs,
+        max_completion_tokens: max_completion_tokens || @default_max_completion_tokens,
+        temperature: temperature || @default_temperature,
+        response_format: response_format
+      )
+      raw ? response : response.dig("choices", 0, "message", "content")
+    rescue OpenAIClient::APIError => e
+      puts "Error: OpenAI API call failed - #{e.message}"
+      nil
+    end
+    # Generates text embeddings using OpenAI's embeddings API.
+    #
+    # @param text [String] the text to embed
+    # @param model [String] the embeddings model
+    # @param dimensions [Integer, nil] number of dimensions (nil uses model default)
+    # @return [Array<Float>, nil] the embedding vector, or nil on error
+    def embeddings(text, model: "text-embedding-3-small", dimensions: nil)
+      response = @client.embeddings(model: model, input: text, dimensions: dimensions)
+      response.dig("data", 0, "embedding")
+    rescue OpenAIClient::APIError => e
+      puts "Error: OpenAI API call failed - #{e.message}"
+      nil
+    end
+    private
+    def build_messages(system: nil, user: nil)
+      msgs = []
+      msgs << { role: "system", content: system } if system
+      msgs << { role: "user", content: user } if user
+      msgs
+    end
+  end
+end

data/lib/ruby-spacy/version.rb CHANGED Viewed

@@ -2,5 +2,5 @@
 module Spacy
   # The version number of the module
-  VERSION = "0.2.3"
+  VERSION = "0.4.0"
 end