RubyGems - webscraping_ai - Versions diffs - 3.2.1 → 4.0.0 - Mend

webscraping_ai 3.2.1 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +25 -0
data/LICENSE +21 -0
data/README.md +110 -85
data/lib/webscraping_ai/client.rb +130 -0
data/lib/webscraping_ai/configuration.rb +10 -300
data/lib/webscraping_ai/errors.rb +44 -0
data/lib/webscraping_ai/query_encoder.rb +74 -0
data/lib/webscraping_ai/version.rb +1 -13
data/lib/webscraping_ai.rb +15 -40
data/webscraping_ai.gemspec +33 -36
metadata +27 -74
data/Gemfile +0 -9
data/Rakefile +0 -10
data/docs/AIApi.md +0 -209
data/docs/Account.md +0 -24
data/docs/AccountApi.md +0 -76
data/docs/Error.md +0 -24
data/docs/HTMLApi.md +0 -109
data/docs/SelectedHTMLApi.md +0 -209
data/docs/TextApi.md +0 -109
data/git_push.sh +0 -57
data/lib/webscraping_ai/api/account_api.rb +0 -79
data/lib/webscraping_ai/api/ai_api.rb +0 -295
data/lib/webscraping_ai/api/html_api.rb +0 -160
data/lib/webscraping_ai/api/selected_html_api.rb +0 -291
data/lib/webscraping_ai/api/text_api.rb +0 -160
data/lib/webscraping_ai/api_client.rb +0 -397
data/lib/webscraping_ai/api_error.rb +0 -58
data/lib/webscraping_ai/api_model_base.rb +0 -88
data/lib/webscraping_ai/models/account.rb +0 -178
data/lib/webscraping_ai/models/error.rb +0 -178
data/spec/api/account_api_spec.rb +0 -46
data/spec/api/ai_api_spec.rb +0 -86
data/spec/api/html_api_spec.rb +0 -61
data/spec/api/selected_html_api_spec.rb +0 -86
data/spec/api/text_api_spec.rb +0 -61
data/spec/models/account_spec.rb +0 -54
data/spec/models/error_spec.rb +0 -54
data/spec/spec_helper.rb +0 -111

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cf1c3fe294cde88ca7896056723ed69042d8c6b4217fa33667edbe055067ba56
-  data.tar.gz: cd5d05403e51177b1ac7a4ed34e4c6fd87d2e70893ce02d13b4c16c46b6e1a6b
+  metadata.gz: 9118cbe0e21d02653f6d56c66be8c05a21735241529782164f1e5352236d8567
+  data.tar.gz: 3e6d144a7dac8202e8e1405a612c1cacfda06301d17cb6c815ba35e6333237f7
 SHA512:
-  metadata.gz: 001a5c2dd678f2c090ec73cbae64cdecdd6eaeaea355ca3d47893033bff367ed6946ab93f5ef7229489c53616f5e9a1130e5d76986b74e4046fa61ca061dabd6
-  data.tar.gz: f3ff936c84bd6f56e3153d09701ea82daa2006ec3bae8db6782335241da7701960f8c4e47eb9c406bf060df071377aeb7436fef7a5b9e6ac07221b0f1fa89a6f
+  metadata.gz: af8e1bab5b5887224e6a0bae94ce178e2cc1a7405310fa60f4d11de621a6fcc1dbe28034b11247a29ff38cc33a63b7a3fc12908e88d54efe4500fe3cf617f801
+  data.tar.gz: 1e90885ed4a79632c94102863325eabfe5968cbe1901dec0f6a568526a416a3cf22550f8b8534927008c8aeef6e55c33372cdba6b50c62f74832cd74213879bd

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,25 @@
+# Changelog
+All notable changes to this project will be documented in this file. This project follows [Semantic Versioning](https://semver.org/).
+## [4.0.0] - Unreleased
+### Changed
+- **Complete rewrite**: the gem is now a hand-written, idiomatic Ruby client rather than OpenAPI-generated code.
+- New unified entry point: `WebScrapingAI::Client.new(api_key: ...)` with one method per endpoint (`#html`, `#text`, `#selected`, `#selected_multiple`, `#question`, `#fields`, `#account`).
+- Switched HTTP layer from `typhoeus` to `faraday ~> 2.0`.
+- Minimum Ruby version is now `3.1`.
+### Removed
+- `WebScrapingAI::HTMLApi`, `WebScrapingAI::TextApi`, `WebScrapingAI::SelectedHTMLApi`, `WebScrapingAI::AIApi`, `WebScrapingAI::AccountApi` classes and all generated model classes. This is a hard break — see the README for the new API surface.
+- `typhoeus` runtime dependency.
+### Added
+- Typed error hierarchy: `BadRequestError`, `PaymentRequiredError`, `AuthenticationError`, `RateLimitError`, `ServerError`, `GatewayTimeoutError` (all `< WebScrapingAI::ApiError`), plus `TimeoutError` and `ConnectionError` for transport failures.
+- Module-level configuration: `WebScrapingAI.configure { |c| c.api_key = "..." }`.
+- `WEBSCRAPING_AI_API_KEY` environment variable picked up by default.
+- RSpec test suite with WebMock-based stubs.
+- GitHub Actions workflows for CI and RubyGems trusted publishing on release.

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) WebScraping.AI
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md CHANGED Viewed

@@ -1,128 +1,153 @@
-# webscraping_ai
+# WebScraping.AI Ruby Client
-WebScrapingAI - the Ruby gem for the WebScraping.AI
+Official Ruby client for the [WebScraping.AI](https://webscraping.ai) API. Provides LLM-powered web scraping with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing.
-WebScraping.AI scraping API provides LLM-powered tools with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing.
-This SDK is automatically generated by the [OpenAPI Generator](https://openapi-generator.tech) project:
-- API version: 3.2.1
-- Package version: 3.2.1
-- Generator version: 7.22.0
-- Build package: org.openapitools.codegen.languages.RubyClientCodegen
-For more information, please visit [https://webscraping.ai](https://webscraping.ai)
+[![Gem Version](https://badge.fury.io/rb/webscraping_ai.svg)](https://rubygems.org/gems/webscraping_ai)
 ## Installation
-### Build a gem
-To build the Ruby code into a gem:
-```shell
-gem build webscraping_ai.gemspec
+```ruby
+# Gemfile
+gem "webscraping_ai", "~> 4.0"
 ```
-Then either install the gem locally:
+Or:
-```shell
-gem install ./webscraping_ai-3.2.1.gem
+```bash
+gem install webscraping_ai
 ```
-(for development, run `gem install --dev ./webscraping_ai-3.2.1.gem` to install the development dependencies)
+Requires Ruby 3.1+.
-or publish the gem to a gem hosting service, e.g. [RubyGems](https://rubygems.org/).
+## Quick start
-Finally add this to the Gemfile:
+```ruby
+require "webscraping_ai"
+client = WebScrapingAI::Client.new(api_key: ENV.fetch("WEBSCRAPING_AI_API_KEY"))
-    gem 'webscraping_ai', '~> 3.2.1'
+# Page HTML
+html = client.html("https://example.com", js: true)
-### Install from Git
+# Visible text
+text = client.text("https://example.com")
-If the Ruby gem is hosted at a git repository: https://github.com/webscraping-ai/webscraping-ai-ruby, then add the following in the Gemfile:
+# CSS-selected fragment
+title = client.selected("https://example.com", selector: "h1")
-    gem 'webscraping_ai', :git => 'https://github.com/webscraping-ai/webscraping-ai-ruby.git'
+# Multiple selectors at once
+fragments = client.selected_multiple("https://example.com", selectors: ["h1", ".price"])
-### Include the Ruby code directly
+# Ask the LLM a question about the page
+answer = client.question("https://example.com", question: "What is the main product?")
-Include the Ruby code directly using `-I` as follows:
+# Extract structured fields with the LLM
+data = client.fields(
+  "https://example.com",
+  fields: {
+    title: "Main product title",
+    price: "Current product price",
+    description: "Full product description"
+  }
+)
-```shell
-ruby -Ilib script.rb
+# Check your account quota
+info = client.account
+# => { "remaining_api_calls" => 200_000, "resets_at" => 1_617_073_667, "remaining_concurrency" => 100 }
 ```
-## Getting Started
+## Configuration
-Please follow the [installation](#installation) procedure and then run the following code:
+Configure globally once, then create clients without arguments:
 ```ruby
-# Load the gem
-require 'webscraping_ai'
-# Setup authorization
 WebScrapingAI.configure do |config|
-  # Configure API key authorization: api_key
-  config.api_key['api_key'] = 'YOUR API KEY'
-  # Uncomment the following line to set a prefix for the API key, e.g. 'Bearer' (defaults to nil)
-  # config.api_key_prefix['api_key'] = 'Bearer'
+  config.api_key      = ENV.fetch("WEBSCRAPING_AI_API_KEY")
+  config.timeout      = 60   # seconds, total request timeout
+  config.open_timeout = 10   # seconds, connection timeout
 end
-api_instance = WebScrapingAI::AIApi.new
-url = 'https://example.com' # String | URL of the target page.
-fields = { key: { key: 'inner_example'}} # Hash<String, String> | Object describing fields to extract from the page and their descriptions
-opts = {
-  headers: { key: { key: 'inner_example'}}, # Hash<String, String> | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}).
-  timeout: 10000, # Integer | Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000).
-  js: true, # Boolean | Execute on-page JavaScript using a headless browser (true by default).
-  js_timeout: 2000, # Integer | Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page.
-  wait_for: 'wait_for_example', # String | CSS selector to wait for before returning the page content. Useful for pages with dynamic content loading. Overrides js_timeout.
-  proxy: 'datacenter', # String | Type of proxy. Use `residential` if your site restricts traffic from datacenters, or `stealth` for the most heavily protected sites with advanced anti-bot detection (`datacenter` by default). Residential and stealth proxy requests are more expensive than datacenter, see the pricing page for details.
-  country: 'us', # String | Country of the proxy to use (US by default).
-  custom_proxy: 'custom_proxy_example', # String | Your own proxy URL to use instead of our built-in proxy pool in \"http://user:password@host:port\" format (<a target=\"_blank\" href=\"https://webscraping.ai/proxies/smartproxy\">Smartproxy</a> for example).
-  device: 'desktop', # String | Type of device emulation.
-  error_on_404: false, # Boolean | Return error on 404 HTTP status on the target page (false by default).
-  error_on_redirect: false, # Boolean | Return error on redirect on the target page (false by default).
-  js_script: 'document.querySelector('button').click();' # String | Custom JavaScript code to execute on the target page.
-}
+client = WebScrapingAI::Client.new
+```
+The gem also reads `WEBSCRAPING_AI_API_KEY` from the environment automatically.
-begin
-  #Extract structured data fields from a web page
-  result = api_instance.get_fields(url, fields, opts)
-  p result
-rescue WebScrapingAI::ApiError => e
-  puts "Exception when calling AIApi->get_fields: #{e}"
-end
+Per-instance overrides:
+```ruby
+client = WebScrapingAI::Client.new(
+  api_key: "...",
+  timeout: 90,
+  base_url: "https://api.webscraping.ai"
+)
 ```
-## Documentation for API Endpoints
+## Endpoints and options
-All URIs are relative to *https://api.webscraping.ai*
+All page-fetching endpoints accept these common options (passed as keyword arguments):
-Class | Method | HTTP request | Description
------------- | ------------- | ------------- | -------------
-*WebScrapingAI::AIApi* | [**get_fields**](docs/AIApi.md#get_fields) | **GET** /ai/fields | Extract structured data fields from a web page
-*WebScrapingAI::AIApi* | [**get_question**](docs/AIApi.md#get_question) | **GET** /ai/question | Get an answer to a question about a given web page
-*WebScrapingAI::AccountApi* | [**account**](docs/AccountApi.md#account) | **GET** /account | Information about your account calls quota
-*WebScrapingAI::HTMLApi* | [**get_html**](docs/HTMLApi.md#get_html) | **GET** /html | Page HTML by URL
-*WebScrapingAI::SelectedHTMLApi* | [**get_selected**](docs/SelectedHTMLApi.md#get_selected) | **GET** /selected | HTML of a selected page area by URL and CSS selector
-*WebScrapingAI::SelectedHTMLApi* | [**get_selected_multiple**](docs/SelectedHTMLApi.md#get_selected_multiple) | **GET** /selected-multiple | HTML of multiple page areas by URL and CSS selectors
-*WebScrapingAI::TextApi* | [**get_text**](docs/TextApi.md#get_text) | **GET** /text | Page text by URL
+| Option | Type | Default | Description |
+| --- | --- | --- | --- |
+| `headers` | `Hash` | — | HTTP headers to send to the target page (e.g. `{ "Cookie" => "session=..." }`) |
+| `timeout` | `Integer` | `10000` | Page retrieval timeout in ms (1–30000) |
+| `js` | `Boolean` | `true` | Execute on-page JavaScript via headless Chromium |
+| `js_timeout` | `Integer` | `2000` | JS rendering timeout in ms (1–20000) |
+| `wait_for` | `String` | — | CSS selector to wait for before returning (overrides `js_timeout`) |
+| `proxy` | `String` | `"datacenter"` | One of `datacenter`, `residential`, `stealth` |
+| `country` | `String` | `"us"` | Proxy country: `us`, `gb`, `de`, `it`, `fr`, `ca`, `es`, `ru`, `jp`, `kr`, `in`, `hk`, `tr` |
+| `custom_proxy` | `String` | — | Your own proxy in `http://user:pass@host:port` form |
+| `device` | `String` | `"desktop"` | One of `desktop`, `mobile`, `tablet` |
+| `error_on_404` | `Boolean` | `false` | Raise an error if the target page returns 404 |
+| `error_on_redirect` | `Boolean` | `false` | Raise an error if the target page redirects |
+| `js_script` | `String` | — | Custom JS to execute on the page |
+Endpoint-specific options:
-## Documentation for Models
+- `#html` — `return_script_result` (`Boolean`), `format` (`"json"`/`"text"`)
+- `#text` — `text_format` (`"plain"`/`"xml"`/`"json"`), `return_links` (`Boolean`, only with `text_format: "json"`)
+- `#selected` — `selector` (`String`), `format` (`"json"`/`"text"`)
+- `#selected_multiple` — `selectors` (`Array<String>` or single `String`)
+- `#question` — `question` (`String`, required), `format` (`"json"`/`"text"`)
+- `#fields` — `fields` (`Hash<String, String>`, required) — keys are field names, values are descriptions
- - [WebScrapingAI::Account](docs/Account.md)
- - [WebScrapingAI::Error](docs/Error.md)
+Returns: `String` for HTML/text responses, `Hash`/`Array` for JSON responses.
+## Error handling
-## Documentation for Authorization
+All API errors inherit from `WebScrapingAI::ApiError` and expose `#status`, `#message`, `#status_code`, `#status_message`, `#body`, and `#response_body`.
+```ruby
+begin
+  client.html("https://example.com")
+rescue WebScrapingAI::RateLimitError => e
+  # 429 — too many concurrent requests
+  sleep 1 and retry
+rescue WebScrapingAI::PaymentRequiredError => e
+  # 402 — out of API credits
+rescue WebScrapingAI::AuthenticationError => e
+  # 403 — wrong API key
+rescue WebScrapingAI::BadRequestError => e
+  # 400 — invalid parameters
+rescue WebScrapingAI::ServerError => e
+  # 500 — target page returned a non-2xx code, or unexpected error.
+  # e.status_code / e.status_message expose the target page's response.
+rescue WebScrapingAI::GatewayTimeoutError => e
+  # 504 — page took longer than `timeout` ms to load. Try a higher `timeout:`.
+rescue WebScrapingAI::TimeoutError => e
+  # Client-side: the HTTP request exceeded `Client#timeout`.
+rescue WebScrapingAI::ConnectionError => e
+  # Network failure before a response was received.
+end
+```
-Authentication schemes defined for the API:
-### api_key
+## Development
+```bash
+bin/setup        # bundle install
+bundle exec rspec
+bundle exec rubocop
+```
-- **Type**: API key
-- **API key parameter name**: api_key
-- **Location**: URL query string
+## License
+MIT — see [LICENSE](LICENSE).

data/lib/webscraping_ai/client.rb ADDED Viewed

@@ -0,0 +1,130 @@
+require "faraday"
+require "json"
+module WebScrapingAI
+  class Client
+    PROXY_TYPES = %w[datacenter residential stealth].freeze
+    COUNTRIES = %w[us gb de it fr ca es ru jp kr in hk tr].freeze
+    DEVICES = %w[desktop mobile tablet].freeze
+    TEXT_FORMATS = %w[plain xml json].freeze
+    FORMATS = %w[json text].freeze
+    PAGE_FETCH_OPTIONS = %i[
+      headers timeout js js_timeout wait_for proxy country
+      custom_proxy device error_on_404 error_on_redirect js_script
+    ].freeze
+    attr_reader :configuration
+    def initialize(api_key: nil, base_url: nil, timeout: nil, open_timeout: nil, adapter: nil, user_agent: nil)
+      global = WebScrapingAI.configuration
+      @configuration = Configuration.new.tap do |c|
+        c.api_key = api_key || global.api_key
+        c.base_url = base_url || global.base_url
+        c.timeout = timeout || global.timeout
+        c.open_timeout = open_timeout || global.open_timeout
+        c.adapter = adapter || global.adapter
+        c.user_agent = user_agent || global.user_agent
+      end
+      return unless @configuration.api_key.nil? || @configuration.api_key.to_s.empty?
+      raise ConfigurationError,
+            "api_key is required (pass api_key: or set WebScrapingAI.configure { |c| c.api_key = ... })"
+    end
+    # GET /ai/question — returns the LLM's answer about the page.
+    # Returns a String by default, or a Hash when format: "json".
+    def question(url, question:, **opts)
+      get("/ai/question", url: url, question: question, **opts.slice(*PAGE_FETCH_OPTIONS, :format))
+    end
+    # GET /ai/fields — extracts the named fields from the page.
+    # `fields` is a Hash of { field_name => description }. Returns a Hash.
+    def fields(url, fields:, **opts)
+      get("/ai/fields", url: url, fields: fields, **opts.slice(*PAGE_FETCH_OPTIONS))
+    end
+    # GET /html — returns the full page HTML as a String.
+    def html(url, **opts)
+      get("/html", url: url, **opts.slice(*PAGE_FETCH_OPTIONS, :return_script_result, :format))
+    end
+    # GET /text — returns the visible text content of the page.
+    # Returns a String when text_format is "plain"/"xml" (default), or a Hash when text_format: "json".
+    def text(url, **opts)
+      get("/text", url: url, **opts.slice(*PAGE_FETCH_OPTIONS, :text_format, :return_links))
+    end
+    # GET /selected — returns HTML of the element matching `selector` as a String.
+    def selected(url, selector: nil, **opts)
+      get("/selected", url: url, selector: selector, **opts.slice(*PAGE_FETCH_OPTIONS, :format))
+    end
+    # GET /selected-multiple — returns an Array of HTML strings, one per selector.
+    def selected_multiple(url, selectors:, **opts)
+      get("/selected-multiple", url: url, selectors: Array(selectors), **opts.slice(*PAGE_FETCH_OPTIONS))
+    end
+    # GET /account — returns Hash with remaining_api_calls, resets_at, remaining_concurrency, email.
+    def account
+      get("/account")
+    end
+    private
+    def connection
+      @connection ||= Faraday.new(url: configuration.base_url) do |conn|
+        conn.options.timeout = configuration.timeout
+        conn.options.open_timeout = configuration.open_timeout
+        conn.options.params_encoder = QueryEncoder
+        conn.headers["User-Agent"] = configuration.user_agent
+        conn.headers["Accept"] = "application/json, text/html, text/xml, text/plain"
+        conn.adapter(configuration.adapter || Faraday.default_adapter)
+      end
+    end
+    def get(path, **params)
+      response = connection.get(path) do |req|
+        req.params = params.merge(api_key: configuration.api_key)
+      end
+      handle_response(response)
+    rescue Faraday::TimeoutError => e
+      raise TimeoutError, e.message
+    rescue Faraday::ConnectionFailed => e
+      raise ConnectionError, e.message
+    end
+    def handle_response(response)
+      return parse_body(response) if response.status.between?(200, 299)
+      error_class = STATUS_TO_ERROR.fetch(response.status, ApiError)
+      data = safe_parse_json(response.body) || {}
+      raise error_class.new(
+        message: data["message"] || "HTTP #{response.status}",
+        status: response.status,
+        status_code: data["status_code"],
+        status_message: data["status_message"],
+        body: data["body"],
+        response_body: response.body
+      )
+    end
+    def parse_body(response)
+      content_type = response.headers["content-type"].to_s
+      if content_type.include?("application/json")
+        JSON.parse(response.body)
+      else
+        response.body
+      end
+    end
+    def safe_parse_json(body)
+      return nil if body.nil? || body.empty?
+      JSON.parse(body)
+    rescue JSON::ParserError
+      nil
+    end
+  end
+end