webscraping_ai 2.0.2 → 3.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +31 -20
- data/docs/AIApi.md +109 -0
- data/docs/Account.md +22 -0
- data/docs/AccountApi.md +76 -0
- data/docs/Error.md +14 -7
- data/docs/HTMLApi.md +45 -82
- data/docs/SelectedHTMLApi.md +92 -173
- data/docs/TextApi.md +105 -0
- data/git_push.sh +3 -4
- data/lib/webscraping_ai/api/account_api.rb +79 -0
- data/lib/webscraping_ai/api/ai_api.rb +164 -0
- data/lib/webscraping_ai/api/html_api.rb +54 -107
- data/lib/webscraping_ai/api/selected_html_api.rb +99 -217
- data/lib/webscraping_ai/api/text_api.rb +154 -0
- data/lib/webscraping_ai/api_client.rb +71 -65
- data/lib/webscraping_ai/api_error.rb +4 -3
- data/lib/webscraping_ai/configuration.rb +65 -15
- data/lib/webscraping_ai/models/{page_error.rb → account.rb} +60 -42
- data/lib/webscraping_ai/models/error.rb +66 -28
- data/lib/webscraping_ai/version.rb +4 -4
- data/lib/webscraping_ai.rb +7 -4
- data/spec/api/account_api_spec.rb +46 -0
- data/spec/api/ai_api_spec.rb +61 -0
- data/spec/api/html_api_spec.rb +17 -27
- data/spec/api/selected_html_api_spec.rb +29 -53
- data/spec/api/text_api_spec.rb +59 -0
- data/spec/models/account_spec.rb +48 -0
- data/spec/models/error_spec.rb +27 -14
- data/spec/spec_helper.rb +3 -3
- data/webscraping_ai.gemspec +7 -7
- metadata +22 -34
- data/docs/PageError.md +0 -19
- data/spec/api_client_spec.rb +0 -226
- data/spec/configuration_spec.rb +0 -42
- data/spec/models/page_error_spec.rb +0 -47
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b9d5c09646212d1d2b3127a8ab58b7db3c400b725e696e2da3a6049d193d00b1
|
4
|
+
data.tar.gz: f0ef33b178203e092eb99c8f94a9a8b5acab23e09913ac01f09fdb1a36e08a60
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7432ef958f2d51ced742d7c7b80e9662456e990e10eed9bbf3a95519253c8788746c9aa5c0cd3c8136c52adb41fac1b1de53b3abfa6daac9a6ca15419b2b780b
|
7
|
+
data.tar.gz: 8cf21004408e4dfa207d4f709ac81de2b2a2851f6c8b86ef2cdf5a28b7f88f3a826ba1ca8b87fb9c60a9e365521a99ffac1cbf8ced043abb3822c010a7357112
|
data/README.md
CHANGED
@@ -2,12 +2,12 @@
|
|
2
2
|
|
3
3
|
WebScrapingAI - the Ruby gem for the WebScraping.AI
|
4
4
|
|
5
|
-
|
5
|
+
WebScraping.AI scraping API provides GPT-powered tools with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing.
|
6
6
|
|
7
7
|
This SDK is automatically generated by the [OpenAPI Generator](https://openapi-generator.tech) project:
|
8
8
|
|
9
|
-
- API version:
|
10
|
-
- Package version:
|
9
|
+
- API version: 3.1.3
|
10
|
+
- Package version: 3.1.3
|
11
11
|
- Build package: org.openapitools.codegen.languages.RubyClientCodegen
|
12
12
|
For more information, please visit [https://webscraping.ai](https://webscraping.ai)
|
13
13
|
|
@@ -24,16 +24,16 @@ gem build webscraping_ai.gemspec
|
|
24
24
|
Then either install the gem locally:
|
25
25
|
|
26
26
|
```shell
|
27
|
-
gem install ./webscraping_ai-
|
27
|
+
gem install ./webscraping_ai-3.1.3.gem
|
28
28
|
```
|
29
29
|
|
30
|
-
(for development, run `gem install --dev ./webscraping_ai-
|
30
|
+
(for development, run `gem install --dev ./webscraping_ai-3.1.3.gem` to install the development dependencies)
|
31
31
|
|
32
32
|
or publish the gem to a gem hosting service, e.g. [RubyGems](https://rubygems.org/).
|
33
33
|
|
34
34
|
Finally add this to the Gemfile:
|
35
35
|
|
36
|
-
gem 'webscraping_ai', '~>
|
36
|
+
gem 'webscraping_ai', '~> 3.1.3'
|
37
37
|
|
38
38
|
### Install from Git
|
39
39
|
|
@@ -62,24 +62,34 @@ WebScrapingAI.configure do |config|
|
|
62
62
|
# Configure API key authorization: api_key
|
63
63
|
config.api_key['api_key'] = 'YOUR API KEY'
|
64
64
|
# Uncomment the following line to set a prefix for the API key, e.g. 'Bearer' (defaults to nil)
|
65
|
-
#config.api_key_prefix['api_key'] = 'Bearer'
|
65
|
+
# config.api_key_prefix['api_key'] = 'Bearer'
|
66
66
|
end
|
67
67
|
|
68
|
-
api_instance = WebScrapingAI::
|
69
|
-
url = 'https://example.com' # String | URL of the target page
|
68
|
+
api_instance = WebScrapingAI::AIApi.new
|
69
|
+
url = 'https://example.com' # String | URL of the target page.
|
70
70
|
opts = {
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
71
|
+
question: 'What is the summary of this page content?', # String | Question or instructions to ask the LLM model about the target page.
|
72
|
+
context_limit: 4000, # Integer | Maximum number of tokens to use as context for the LLM model (4000 by default).
|
73
|
+
response_tokens: 100, # Integer | Maximum number of tokens to return in the LLM model response. The total context size (context_limit) includes the question, the target page content and the response, so this parameter reserves tokens for the response (see also on_context_limit).
|
74
|
+
on_context_limit: 'truncate', # String | What to do if the context_limit parameter is exceeded (truncate by default). The context is exceeded when the target page content is too long.
|
75
|
+
headers: { key: 3.56}, # Hash<String, String> | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}).
|
76
|
+
timeout: 10000, # Integer | Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000).
|
77
|
+
js: true, # Boolean | Execute on-page JavaScript using a headless browser (true by default).
|
78
|
+
js_timeout: 2000, # Integer | Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page.
|
79
|
+
proxy: 'datacenter', # String | Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details.
|
80
|
+
country: 'us', # String | Country of the proxy to use (US by default). Only available on Startup and Custom plans.
|
81
|
+
device: 'desktop', # String | Type of device emulation.
|
82
|
+
error_on_404: false, # Boolean | Return error on 404 HTTP status on the target page (false by default).
|
83
|
+
error_on_redirect: false, # Boolean | Return error on redirect on the target page (false by default).
|
84
|
+
js_script: 'document.querySelector('button').click();' # String | Custom JavaScript code to execute on the target page.
|
75
85
|
}
|
76
86
|
|
77
87
|
begin
|
78
|
-
#
|
79
|
-
result = api_instance.
|
88
|
+
#Get an answer to a question about a given web page
|
89
|
+
result = api_instance.get_question(url, opts)
|
80
90
|
p result
|
81
91
|
rescue WebScrapingAI::ApiError => e
|
82
|
-
puts "Exception when calling
|
92
|
+
puts "Exception when calling AIApi->get_question: #{e}"
|
83
93
|
end
|
84
94
|
|
85
95
|
```
|
@@ -90,23 +100,24 @@ All URIs are relative to *https://api.webscraping.ai*
|
|
90
100
|
|
91
101
|
Class | Method | HTTP request | Description
|
92
102
|
------------ | ------------- | ------------- | -------------
|
103
|
+
*WebScrapingAI::AIApi* | [**get_question**](docs/AIApi.md#get_question) | **GET** /ai/question | Get an answer to a question about a given web page
|
104
|
+
*WebScrapingAI::AccountApi* | [**account**](docs/AccountApi.md#account) | **GET** /account | Information about your account calls quota
|
93
105
|
*WebScrapingAI::HTMLApi* | [**get_html**](docs/HTMLApi.md#get_html) | **GET** /html | Page HTML by URL
|
94
|
-
*WebScrapingAI::HTMLApi* | [**post_html**](docs/HTMLApi.md#post_html) | **POST** /html | Page HTML by URL with POST request to the target page
|
95
106
|
*WebScrapingAI::SelectedHTMLApi* | [**get_selected**](docs/SelectedHTMLApi.md#get_selected) | **GET** /selected | HTML of a selected page area by URL and CSS selector
|
96
107
|
*WebScrapingAI::SelectedHTMLApi* | [**get_selected_multiple**](docs/SelectedHTMLApi.md#get_selected_multiple) | **GET** /selected-multiple | HTML of multiple page areas by URL and CSS selectors
|
97
|
-
*WebScrapingAI::
|
98
|
-
*WebScrapingAI::SelectedHTMLApi* | [**post_selected_multiple**](docs/SelectedHTMLApi.md#post_selected_multiple) | **POST** /selected-multiple | HTML of multiple page areas by URL and CSS selectors, with POST request to the target page
|
108
|
+
*WebScrapingAI::TextApi* | [**get_text**](docs/TextApi.md#get_text) | **GET** /text | Page text by URL
|
99
109
|
|
100
110
|
|
101
111
|
## Documentation for Models
|
102
112
|
|
113
|
+
- [WebScrapingAI::Account](docs/Account.md)
|
103
114
|
- [WebScrapingAI::Error](docs/Error.md)
|
104
|
-
- [WebScrapingAI::PageError](docs/PageError.md)
|
105
115
|
|
106
116
|
|
107
117
|
## Documentation for Authorization
|
108
118
|
|
109
119
|
|
120
|
+
Authentication schemes defined for the API:
|
110
121
|
### api_key
|
111
122
|
|
112
123
|
|
data/docs/AIApi.md
ADDED
@@ -0,0 +1,109 @@
|
|
1
|
+
# WebScrapingAI::AIApi
|
2
|
+
|
3
|
+
All URIs are relative to *https://api.webscraping.ai*
|
4
|
+
|
5
|
+
| Method | HTTP request | Description |
|
6
|
+
| ------ | ------------ | ----------- |
|
7
|
+
| [**get_question**](AIApi.md#get_question) | **GET** /ai/question | Get an answer to a question about a given web page |
|
8
|
+
|
9
|
+
|
10
|
+
## get_question
|
11
|
+
|
12
|
+
> String get_question(url, opts)
|
13
|
+
|
14
|
+
Get an answer to a question about a given web page
|
15
|
+
|
16
|
+
Returns the answer in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing, then the answer is extracted using an LLM model.
|
17
|
+
|
18
|
+
### Examples
|
19
|
+
|
20
|
+
```ruby
|
21
|
+
require 'time'
|
22
|
+
require 'webscraping_ai'
|
23
|
+
# setup authorization
|
24
|
+
WebScrapingAI.configure do |config|
|
25
|
+
# Configure API key authorization: api_key
|
26
|
+
config.api_key['api_key'] = 'YOUR API KEY'
|
27
|
+
# Uncomment the following line to set a prefix for the API key, e.g. 'Bearer' (defaults to nil)
|
28
|
+
# config.api_key_prefix['api_key'] = 'Bearer'
|
29
|
+
end
|
30
|
+
|
31
|
+
api_instance = WebScrapingAI::AIApi.new
|
32
|
+
url = 'https://example.com' # String | URL of the target page.
|
33
|
+
opts = {
|
34
|
+
question: 'What is the summary of this page content?', # String | Question or instructions to ask the LLM model about the target page.
|
35
|
+
context_limit: 4000, # Integer | Maximum number of tokens to use as context for the LLM model (4000 by default).
|
36
|
+
response_tokens: 100, # Integer | Maximum number of tokens to return in the LLM model response. The total context size (context_limit) includes the question, the target page content and the response, so this parameter reserves tokens for the response (see also on_context_limit).
|
37
|
+
on_context_limit: 'truncate', # String | What to do if the context_limit parameter is exceeded (truncate by default). The context is exceeded when the target page content is too long.
|
38
|
+
headers: { key: 3.56}, # Hash<String, String> | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}).
|
39
|
+
timeout: 10000, # Integer | Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000).
|
40
|
+
js: true, # Boolean | Execute on-page JavaScript using a headless browser (true by default).
|
41
|
+
js_timeout: 2000, # Integer | Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page.
|
42
|
+
proxy: 'datacenter', # String | Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details.
|
43
|
+
country: 'us', # String | Country of the proxy to use (US by default). Only available on Startup and Custom plans.
|
44
|
+
device: 'desktop', # String | Type of device emulation.
|
45
|
+
error_on_404: false, # Boolean | Return error on 404 HTTP status on the target page (false by default).
|
46
|
+
error_on_redirect: false, # Boolean | Return error on redirect on the target page (false by default).
|
47
|
+
js_script: 'document.querySelector('button').click();' # String | Custom JavaScript code to execute on the target page.
|
48
|
+
}
|
49
|
+
|
50
|
+
begin
|
51
|
+
# Get an answer to a question about a given web page
|
52
|
+
result = api_instance.get_question(url, opts)
|
53
|
+
p result
|
54
|
+
rescue WebScrapingAI::ApiError => e
|
55
|
+
puts "Error when calling AIApi->get_question: #{e}"
|
56
|
+
end
|
57
|
+
```
|
58
|
+
|
59
|
+
#### Using the get_question_with_http_info variant
|
60
|
+
|
61
|
+
This returns an Array which contains the response data, status code and headers.
|
62
|
+
|
63
|
+
> <Array(String, Integer, Hash)> get_question_with_http_info(url, opts)
|
64
|
+
|
65
|
+
```ruby
|
66
|
+
begin
|
67
|
+
# Get an answer to a question about a given web page
|
68
|
+
data, status_code, headers = api_instance.get_question_with_http_info(url, opts)
|
69
|
+
p status_code # => 2xx
|
70
|
+
p headers # => { ... }
|
71
|
+
p data # => String
|
72
|
+
rescue WebScrapingAI::ApiError => e
|
73
|
+
puts "Error when calling AIApi->get_question_with_http_info: #{e}"
|
74
|
+
end
|
75
|
+
```
|
76
|
+
|
77
|
+
### Parameters
|
78
|
+
|
79
|
+
| Name | Type | Description | Notes |
|
80
|
+
| ---- | ---- | ----------- | ----- |
|
81
|
+
| **url** | **String** | URL of the target page. | |
|
82
|
+
| **question** | **String** | Question or instructions to ask the LLM model about the target page. | [optional] |
|
83
|
+
| **context_limit** | **Integer** | Maximum number of tokens to use as context for the LLM model (4000 by default). | [optional][default to 4000] |
|
84
|
+
| **response_tokens** | **Integer** | Maximum number of tokens to return in the LLM model response. The total context size (context_limit) includes the question, the target page content and the response, so this parameter reserves tokens for the response (see also on_context_limit). | [optional][default to 100] |
|
85
|
+
| **on_context_limit** | **String** | What to do if the context_limit parameter is exceeded (truncate by default). The context is exceeded when the target page content is too long. | [optional][default to 'error'] |
|
86
|
+
| **headers** | [**Hash<String, String>**](String.md) | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}). | [optional] |
|
87
|
+
| **timeout** | **Integer** | Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000). | [optional][default to 10000] |
|
88
|
+
| **js** | **Boolean** | Execute on-page JavaScript using a headless browser (true by default). | [optional][default to true] |
|
89
|
+
| **js_timeout** | **Integer** | Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page. | [optional][default to 2000] |
|
90
|
+
| **proxy** | **String** | Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details. | [optional][default to 'datacenter'] |
|
91
|
+
| **country** | **String** | Country of the proxy to use (US by default). Only available on Startup and Custom plans. | [optional][default to 'us'] |
|
92
|
+
| **device** | **String** | Type of device emulation. | [optional][default to 'desktop'] |
|
93
|
+
| **error_on_404** | **Boolean** | Return error on 404 HTTP status on the target page (false by default). | [optional][default to false] |
|
94
|
+
| **error_on_redirect** | **Boolean** | Return error on redirect on the target page (false by default). | [optional][default to false] |
|
95
|
+
| **js_script** | **String** | Custom JavaScript code to execute on the target page. | [optional] |
|
96
|
+
|
97
|
+
### Return type
|
98
|
+
|
99
|
+
**String**
|
100
|
+
|
101
|
+
### Authorization
|
102
|
+
|
103
|
+
[api_key](../README.md#api_key)
|
104
|
+
|
105
|
+
### HTTP request headers
|
106
|
+
|
107
|
+
- **Content-Type**: Not defined
|
108
|
+
- **Accept**: application/json, text/html
|
109
|
+
|
data/docs/Account.md
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
# WebScrapingAI::Account
|
2
|
+
|
3
|
+
## Properties
|
4
|
+
|
5
|
+
| Name | Type | Description | Notes |
|
6
|
+
| ---- | ---- | ----------- | ----- |
|
7
|
+
| **remaining_api_calls** | **Integer** | Remaining API credits quota | [optional] |
|
8
|
+
| **resets_at** | **Integer** | Next billing cycle start time (UNIX timestamp) | [optional] |
|
9
|
+
| **remaining_concurrency** | **Integer** | Remaining concurrent requests | [optional] |
|
10
|
+
|
11
|
+
## Example
|
12
|
+
|
13
|
+
```ruby
|
14
|
+
require 'webscraping_ai'
|
15
|
+
|
16
|
+
instance = WebScrapingAI::Account.new(
|
17
|
+
remaining_api_calls: null,
|
18
|
+
resets_at: null,
|
19
|
+
remaining_concurrency: null
|
20
|
+
)
|
21
|
+
```
|
22
|
+
|
data/docs/AccountApi.md
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
# WebScrapingAI::AccountApi
|
2
|
+
|
3
|
+
All URIs are relative to *https://api.webscraping.ai*
|
4
|
+
|
5
|
+
| Method | HTTP request | Description |
|
6
|
+
| ------ | ------------ | ----------- |
|
7
|
+
| [**account**](AccountApi.md#account) | **GET** /account | Information about your account calls quota |
|
8
|
+
|
9
|
+
|
10
|
+
## account
|
11
|
+
|
12
|
+
> <Account> account
|
13
|
+
|
14
|
+
Information about your account calls quota
|
15
|
+
|
16
|
+
Returns information about your account, including the remaining API credits quota, the next billing cycle start time, and the remaining concurrent requests. The response is in JSON format.
|
17
|
+
|
18
|
+
### Examples
|
19
|
+
|
20
|
+
```ruby
|
21
|
+
require 'time'
|
22
|
+
require 'webscraping_ai'
|
23
|
+
# setup authorization
|
24
|
+
WebScrapingAI.configure do |config|
|
25
|
+
# Configure API key authorization: api_key
|
26
|
+
config.api_key['api_key'] = 'YOUR API KEY'
|
27
|
+
# Uncomment the following line to set a prefix for the API key, e.g. 'Bearer' (defaults to nil)
|
28
|
+
# config.api_key_prefix['api_key'] = 'Bearer'
|
29
|
+
end
|
30
|
+
|
31
|
+
api_instance = WebScrapingAI::AccountApi.new
|
32
|
+
|
33
|
+
begin
|
34
|
+
# Information about your account calls quota
|
35
|
+
result = api_instance.account
|
36
|
+
p result
|
37
|
+
rescue WebScrapingAI::ApiError => e
|
38
|
+
puts "Error when calling AccountApi->account: #{e}"
|
39
|
+
end
|
40
|
+
```
|
41
|
+
|
42
|
+
#### Using the account_with_http_info variant
|
43
|
+
|
44
|
+
This returns an Array which contains the response data, status code and headers.
|
45
|
+
|
46
|
+
> <Array(<Account>, Integer, Hash)> account_with_http_info
|
47
|
+
|
48
|
+
```ruby
|
49
|
+
begin
|
50
|
+
# Information about your account calls quota
|
51
|
+
data, status_code, headers = api_instance.account_with_http_info
|
52
|
+
p status_code # => 2xx
|
53
|
+
p headers # => { ... }
|
54
|
+
p data # => <Account>
|
55
|
+
rescue WebScrapingAI::ApiError => e
|
56
|
+
puts "Error when calling AccountApi->account_with_http_info: #{e}"
|
57
|
+
end
|
58
|
+
```
|
59
|
+
|
60
|
+
### Parameters
|
61
|
+
|
62
|
+
This endpoint does not need any parameter.
|
63
|
+
|
64
|
+
### Return type
|
65
|
+
|
66
|
+
[**Account**](Account.md)
|
67
|
+
|
68
|
+
### Authorization
|
69
|
+
|
70
|
+
[api_key](../README.md#api_key)
|
71
|
+
|
72
|
+
### HTTP request headers
|
73
|
+
|
74
|
+
- **Content-Type**: Not defined
|
75
|
+
- **Accept**: application/json
|
76
|
+
|
data/docs/Error.md
CHANGED
@@ -2,16 +2,23 @@
|
|
2
2
|
|
3
3
|
## Properties
|
4
4
|
|
5
|
-
Name | Type | Description | Notes
|
6
|
-
|
7
|
-
**message** | **String** | Error description | [optional]
|
5
|
+
| Name | Type | Description | Notes |
|
6
|
+
| ---- | ---- | ----------- | ----- |
|
7
|
+
| **message** | **String** | Error description | [optional] |
|
8
|
+
| **status_code** | **Integer** | Target page response HTTP status code (403, 500, etc) | [optional] |
|
9
|
+
| **status_message** | **String** | Target page response HTTP status message | [optional] |
|
10
|
+
| **body** | **String** | Target page response body | [optional] |
|
8
11
|
|
9
|
-
##
|
12
|
+
## Example
|
10
13
|
|
11
14
|
```ruby
|
12
|
-
require '
|
15
|
+
require 'webscraping_ai'
|
13
16
|
|
14
|
-
instance = WebScrapingAI::Error.new(
|
17
|
+
instance = WebScrapingAI::Error.new(
|
18
|
+
message: null,
|
19
|
+
status_code: null,
|
20
|
+
status_message: null,
|
21
|
+
body: null
|
22
|
+
)
|
15
23
|
```
|
16
24
|
|
17
|
-
|
data/docs/HTMLApi.md
CHANGED
@@ -2,11 +2,9 @@
|
|
2
2
|
|
3
3
|
All URIs are relative to *https://api.webscraping.ai*
|
4
4
|
|
5
|
-
Method | HTTP request | Description
|
6
|
-
|
7
|
-
[**get_html**](HTMLApi.md#get_html) | **GET** /html | Page HTML by URL
|
8
|
-
[**post_html**](HTMLApi.md#post_html) | **POST** /html | Page HTML by URL with POST request to the target page
|
9
|
-
|
5
|
+
| Method | HTTP request | Description |
|
6
|
+
| ------ | ------------ | ----------- |
|
7
|
+
| [**get_html**](HTMLApi.md#get_html) | **GET** /html | Page HTML by URL |
|
10
8
|
|
11
9
|
|
12
10
|
## get_html
|
@@ -15,115 +13,80 @@ Method | HTTP request | Description
|
|
15
13
|
|
16
14
|
Page HTML by URL
|
17
15
|
|
18
|
-
Returns
|
16
|
+
Returns the full HTML content of a webpage specified by the URL. The response is in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing.
|
19
17
|
|
20
|
-
###
|
18
|
+
### Examples
|
21
19
|
|
22
20
|
```ruby
|
23
|
-
|
21
|
+
require 'time'
|
24
22
|
require 'webscraping_ai'
|
25
23
|
# setup authorization
|
26
24
|
WebScrapingAI.configure do |config|
|
27
25
|
# Configure API key authorization: api_key
|
28
26
|
config.api_key['api_key'] = 'YOUR API KEY'
|
29
27
|
# Uncomment the following line to set a prefix for the API key, e.g. 'Bearer' (defaults to nil)
|
30
|
-
#config.api_key_prefix['api_key'] = 'Bearer'
|
28
|
+
# config.api_key_prefix['api_key'] = 'Bearer'
|
31
29
|
end
|
32
30
|
|
33
31
|
api_instance = WebScrapingAI::HTMLApi.new
|
34
|
-
url = 'https://example.com' # String | URL of the target page
|
32
|
+
url = 'https://example.com' # String | URL of the target page.
|
35
33
|
opts = {
|
36
|
-
headers: {
|
37
|
-
timeout:
|
38
|
-
js: true, # Boolean | Execute on-page JavaScript using a headless browser (true by default)
|
39
|
-
|
34
|
+
headers: { key: 3.56}, # Hash<String, String> | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}).
|
35
|
+
timeout: 10000, # Integer | Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000).
|
36
|
+
js: true, # Boolean | Execute on-page JavaScript using a headless browser (true by default).
|
37
|
+
js_timeout: 2000, # Integer | Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page.
|
38
|
+
proxy: 'datacenter', # String | Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details.
|
39
|
+
country: 'us', # String | Country of the proxy to use (US by default). Only available on Startup and Custom plans.
|
40
|
+
device: 'desktop', # String | Type of device emulation.
|
41
|
+
error_on_404: false, # Boolean | Return error on 404 HTTP status on the target page (false by default).
|
42
|
+
error_on_redirect: false, # Boolean | Return error on redirect on the target page (false by default).
|
43
|
+
js_script: 'document.querySelector('button').click();', # String | Custom JavaScript code to execute on the target page.
|
44
|
+
return_script_result: false # Boolean | Return result of the custom JavaScript code (js_script parameter) execution on the target page (false by default, page HTML will be returned).
|
40
45
|
}
|
41
46
|
|
42
47
|
begin
|
43
|
-
#Page HTML by URL
|
48
|
+
# Page HTML by URL
|
44
49
|
result = api_instance.get_html(url, opts)
|
45
50
|
p result
|
46
51
|
rescue WebScrapingAI::ApiError => e
|
47
|
-
puts "
|
52
|
+
puts "Error when calling HTMLApi->get_html: #{e}"
|
48
53
|
end
|
49
54
|
```
|
50
55
|
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
Name | Type | Description | Notes
|
55
|
-
------------- | ------------- | ------------- | -------------
|
56
|
-
**url** | **String**| URL of the target page |
|
57
|
-
**headers** | [**Hash<String, String>**](String.md)| HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}) | [optional]
|
58
|
-
**timeout** | **Integer**| Maximum processing time in ms. Increase it in case of timeout errors (5000 by default, maximum is 30000) | [optional] [default to 5000]
|
59
|
-
**js** | **Boolean**| Execute on-page JavaScript using a headless browser (true by default), costs 2 requests | [optional] [default to true]
|
60
|
-
**proxy** | **String**| Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default) | [optional] [default to 'datacenter']
|
61
|
-
|
62
|
-
### Return type
|
63
|
-
|
64
|
-
**String**
|
65
|
-
|
66
|
-
### Authorization
|
67
|
-
|
68
|
-
[api_key](../README.md#api_key)
|
69
|
-
|
70
|
-
### HTTP request headers
|
71
|
-
|
72
|
-
- **Content-Type**: Not defined
|
73
|
-
- **Accept**: application/json, text/html
|
74
|
-
|
56
|
+
#### Using the get_html_with_http_info variant
|
75
57
|
|
76
|
-
|
58
|
+
This returns an Array which contains the response data, status code and headers.
|
77
59
|
|
78
|
-
> String
|
79
|
-
|
80
|
-
Page HTML by URL with POST request to the target page
|
81
|
-
|
82
|
-
Returns just HTML on success, JSON on error. Request body will be passed to the target page.
|
83
|
-
|
84
|
-
### Example
|
60
|
+
> <Array(String, Integer, Hash)> get_html_with_http_info(url, opts)
|
85
61
|
|
86
62
|
```ruby
|
87
|
-
# load the gem
|
88
|
-
require 'webscraping_ai'
|
89
|
-
# setup authorization
|
90
|
-
WebScrapingAI.configure do |config|
|
91
|
-
# Configure API key authorization: api_key
|
92
|
-
config.api_key['api_key'] = 'YOUR API KEY'
|
93
|
-
# Uncomment the following line to set a prefix for the API key, e.g. 'Bearer' (defaults to nil)
|
94
|
-
#config.api_key_prefix['api_key'] = 'Bearer'
|
95
|
-
end
|
96
|
-
|
97
|
-
api_instance = WebScrapingAI::HTMLApi.new
|
98
|
-
url = 'https://httpbin.org/post' # String | URL of the target page
|
99
|
-
opts = {
|
100
|
-
headers: {'key' => '{\"Cookie\":\"session=some_id\"}'}, # Hash<String, String> | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"})
|
101
|
-
timeout: 5000, # Integer | Maximum processing time in ms. Increase it in case of timeout errors (5000 by default, maximum is 30000)
|
102
|
-
js: true, # Boolean | Execute on-page JavaScript using a headless browser (true by default), costs 2 requests
|
103
|
-
proxy: 'datacenter', # String | Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default)
|
104
|
-
request_body: nil # Hash<String, Object> | Request body to pass to the target page
|
105
|
-
}
|
106
|
-
|
107
63
|
begin
|
108
|
-
#Page HTML by URL
|
109
|
-
|
110
|
-
p
|
64
|
+
# Page HTML by URL
|
65
|
+
data, status_code, headers = api_instance.get_html_with_http_info(url, opts)
|
66
|
+
p status_code # => 2xx
|
67
|
+
p headers # => { ... }
|
68
|
+
p data # => String
|
111
69
|
rescue WebScrapingAI::ApiError => e
|
112
|
-
puts "
|
70
|
+
puts "Error when calling HTMLApi->get_html_with_http_info: #{e}"
|
113
71
|
end
|
114
72
|
```
|
115
73
|
|
116
74
|
### Parameters
|
117
75
|
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
**
|
122
|
-
**
|
123
|
-
**
|
124
|
-
**
|
125
|
-
**proxy** | **String
|
126
|
-
**
|
76
|
+
| Name | Type | Description | Notes |
|
77
|
+
| ---- | ---- | ----------- | ----- |
|
78
|
+
| **url** | **String** | URL of the target page. | |
|
79
|
+
| **headers** | [**Hash<String, String>**](String.md) | HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"}). | [optional] |
|
80
|
+
| **timeout** | **Integer** | Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000). | [optional][default to 10000] |
|
81
|
+
| **js** | **Boolean** | Execute on-page JavaScript using a headless browser (true by default). | [optional][default to true] |
|
82
|
+
| **js_timeout** | **Integer** | Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page. | [optional][default to 2000] |
|
83
|
+
| **proxy** | **String** | Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details. | [optional][default to 'datacenter'] |
|
84
|
+
| **country** | **String** | Country of the proxy to use (US by default). Only available on Startup and Custom plans. | [optional][default to 'us'] |
|
85
|
+
| **device** | **String** | Type of device emulation. | [optional][default to 'desktop'] |
|
86
|
+
| **error_on_404** | **Boolean** | Return error on 404 HTTP status on the target page (false by default). | [optional][default to false] |
|
87
|
+
| **error_on_redirect** | **Boolean** | Return error on redirect on the target page (false by default). | [optional][default to false] |
|
88
|
+
| **js_script** | **String** | Custom JavaScript code to execute on the target page. | [optional] |
|
89
|
+
| **return_script_result** | **Boolean** | Return result of the custom JavaScript code (js_script parameter) execution on the target page (false by default, page HTML will be returned). | [optional][default to false] |
|
127
90
|
|
128
91
|
### Return type
|
129
92
|
|
@@ -135,6 +98,6 @@ Name | Type | Description | Notes
|
|
135
98
|
|
136
99
|
### HTTP request headers
|
137
100
|
|
138
|
-
- **Content-Type**:
|
101
|
+
- **Content-Type**: Not defined
|
139
102
|
- **Accept**: application/json, text/html
|
140
103
|
|