curlyq 0.0.2 → 0.0.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/FUNDING.yml +2 -0
- data/CHANGELOG.md +19 -0
- data/Gemfile.lock +1 -1
- data/README.md +18 -8
- data/bin/curlyq +42 -19
- data/lib/curly/curl/html.rb +71 -49
- data/lib/curly/curl/json.rb +23 -11
- data/lib/curly/hash.rb +2 -0
- data/lib/curly/string.rb +1 -1
- data/lib/curly/version.rb +1 -1
- data/src/_README.md +15 -7
- metadata +2 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: '091e39001a4456eef85fa25e97281c6218e3383619d37a14a66ca9bd41fee9ab'
|
4
|
+
data.tar.gz: c287e095d4f9525e924d08cd580c37ab19448ae93bb384d419526b60cf895493
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e51492325696e09319666ee29b753472f555129dc7664f34a08b1ae30dd319cdb5518727030d2d7bbaa1dfaefc6dd14ce4ccf2756243238dd885d37e2dbffbb4
|
7
|
+
data.tar.gz: 27c006a7433cd9bd9208cc0aa69c0029974ca98524ecf3ee323bad00cc09490acfbfd5b4e61a65e22d5f750ebe05fa03d5d1bae1d1b977bf1028ed2aa21e2ee7
|
data/.github/FUNDING.yml
ADDED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,22 @@
|
|
1
|
+
### 0.0.4
|
2
|
+
|
3
|
+
2024-01-10 13:54
|
4
|
+
|
5
|
+
#### FIXED
|
6
|
+
|
7
|
+
- Queries combined with + or & not requiring all matches to be true
|
8
|
+
|
9
|
+
### 0.0.3
|
10
|
+
|
11
|
+
2024-01-10 13:38
|
12
|
+
|
13
|
+
#### IMPROVED
|
14
|
+
|
15
|
+
- Refactor Curl and Json libs to allow setting of options after creation of object
|
16
|
+
- Allow setting of headers on most subcommands
|
17
|
+
- --clean now affects source, head, and body keys of output
|
18
|
+
- Also remove tabs when cleaning whitespace
|
19
|
+
|
1
20
|
### 0.0.2
|
2
21
|
|
3
22
|
2024-01-10 09:18
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
#
|
1
|
+
# CurlyQ
|
2
2
|
|
3
3
|
[![Gem](https://img.shields.io/gem/v/na.svg)](https://rubygems.org/gems/curlyq)
|
4
4
|
[![GitHub license](https://img.shields.io/github/license/ttscoff/curlyq.svg)](./LICENSE.txt)
|
@@ -7,11 +7,13 @@
|
|
7
7
|
|
8
8
|
_If you find this useful, feel free to [buy me some coffee][donate]._
|
9
9
|
|
10
|
+
[donate]: https://brettterpstra.com/donate
|
10
11
|
|
11
|
-
|
12
|
+
|
13
|
+
The current version of `curlyq` is 0.0.4
|
12
14
|
.
|
13
15
|
|
14
|
-
|
16
|
+
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
|
15
17
|
|
16
18
|
[github]: https://github.com/ttscoff/curlyq/
|
17
19
|
|
@@ -24,11 +26,15 @@ If you're using Homebrew, you have the option to install via [brew-gem](https://
|
|
24
26
|
brew install brew-gem
|
25
27
|
brew gem install curlyq
|
26
28
|
|
27
|
-
If you don't have Ruby/RubyGems, you can install them pretty easily with Homebrew, rvm, or asdf.
|
29
|
+
If you don't have Ruby/RubyGems, you can install them pretty easily with [Homebrew], [rvm], or [asdf].
|
30
|
+
|
31
|
+
[Homebrew]: https://brew.sh/ "Homebrew???The Missing Package Manager for macOS (or Linux)"
|
32
|
+
[rvm]: https://rvm.io/ "Ruby Version Manager (RVM)"
|
33
|
+
[asdf]: https://github.com/asdf-vm/asdf "asdf-vm/asdf:Extendable version manager with support for ..."
|
28
34
|
|
29
35
|
### Usage
|
30
36
|
|
31
|
-
Run `curlyq help` for a list of
|
37
|
+
Run `curlyq help` for a list of subcommands. Run `curlyq help SUBCOMMAND` for details on a particular subcommand and its options.
|
32
38
|
|
33
39
|
```
|
34
40
|
NAME
|
@@ -38,7 +44,7 @@ SYNOPSIS
|
|
38
44
|
curlyq [global options] command [command options] [arguments...]
|
39
45
|
|
40
46
|
VERSION
|
41
|
-
0.0.
|
47
|
+
0.0.4
|
42
48
|
|
43
49
|
GLOBAL OPTIONS
|
44
50
|
--help - Show this message
|
@@ -61,7 +67,7 @@ COMMANDS
|
|
61
67
|
|
62
68
|
#### Commands
|
63
69
|
|
64
|
-
curlyq makes use of subcommands, e.g. `curlyq html` or `curlyq extract`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command.
|
70
|
+
curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
|
65
71
|
|
66
72
|
##### extract
|
67
73
|
|
@@ -135,6 +141,7 @@ SYNOPSIS
|
|
135
141
|
COMMAND OPTIONS
|
136
142
|
-c, --[no-]compressed - Expect compressed results
|
137
143
|
--[no-]clean - Remove extra whitespace from results
|
144
|
+
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
|
138
145
|
-t, --type=arg - Type of images to return (img, srcset, opengraph, all) (may be used more than once, default: ["all"])
|
139
146
|
```
|
140
147
|
|
@@ -193,6 +200,8 @@ COMMAND OPTIONS
|
|
193
200
|
|
194
201
|
##### screenshot
|
195
202
|
|
203
|
+
Full-page screenshots require Firefox, installed and specified with `--browser firefox`.
|
204
|
+
|
196
205
|
```
|
197
206
|
NAME
|
198
207
|
screenshot - Save a screenshot of a URL
|
@@ -203,6 +212,7 @@ SYNOPSIS
|
|
203
212
|
|
204
213
|
COMMAND OPTIONS
|
205
214
|
-b, --browser=arg - Browser to use (firefox, chrome) (default: chrome)
|
215
|
+
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
|
206
216
|
-o, --out, --file=arg - File destination (default: none)
|
207
217
|
-t, --type=arg - Type of screenshot to save (full (requires firefox), print, visible) (default: full)
|
208
218
|
```
|
@@ -230,4 +240,4 @@ PayPal link: [paypal.me/ttscoff](https://paypal.me/ttscoff)
|
|
230
240
|
|
231
241
|
## Changelog
|
232
242
|
|
233
|
-
See [CHANGELOG.md](https://github.com/ttscoff/
|
243
|
+
See [CHANGELOG.md](https://github.com/ttscoff/curlyq/blob/main/CHANGELOG.md)
|
data/bin/curlyq
CHANGED
@@ -110,12 +110,13 @@ command %i[html curl] do |c|
|
|
110
110
|
output = []
|
111
111
|
|
112
112
|
urls.each do |url|
|
113
|
-
res = Curl::Html.new(url, browser: options[:browser], fallback: options[:fallback],
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
113
|
+
res = Curl::Html.new(url, { browser: options[:browser], fallback: options[:fallback],
|
114
|
+
headers: headers, headers_only: options[:info],
|
115
|
+
compressed: options[:compressed], clean: options[:clean],
|
116
|
+
ignore_local_links: options[:ignore_relative],
|
117
|
+
ignore_fragment_links: options[:ignore_fragments],
|
118
|
+
external_links_only: options[:external_links_only] })
|
119
|
+
res.curl
|
119
120
|
|
120
121
|
if options[:info]
|
121
122
|
output.push(res.headers)
|
@@ -156,12 +157,18 @@ command :screenshot do |c|
|
|
156
157
|
c.desc 'File destination'
|
157
158
|
c.flag %i[o out file]
|
158
159
|
|
160
|
+
c.desc 'Define a header to send as key=value'
|
161
|
+
c.flag %i[h header], multiple: true
|
162
|
+
|
159
163
|
c.action do |_, options, args|
|
160
164
|
urls = args.join(' ').split(/[, ]+/)
|
165
|
+
headers = break_headers(options[:header])
|
161
166
|
|
162
167
|
urls.each do |url|
|
163
168
|
c = Curl::Html.new(url)
|
164
|
-
c.
|
169
|
+
c.headers = headers
|
170
|
+
c.browser = options[:browser]
|
171
|
+
c.screenshot(options[:out], type: options[:type])
|
165
172
|
end
|
166
173
|
end
|
167
174
|
end
|
@@ -185,7 +192,11 @@ command :json do |c|
|
|
185
192
|
output = []
|
186
193
|
|
187
194
|
urls.each do |url|
|
188
|
-
res = Curl::Json.new(url
|
195
|
+
res = Curl::Json.new(url)
|
196
|
+
res.request_headers = headers
|
197
|
+
res.compressed = options[:compressed],
|
198
|
+
res.symbolize_names = false
|
199
|
+
res.curl
|
189
200
|
|
190
201
|
json = res.json
|
191
202
|
|
@@ -235,8 +246,9 @@ command :extract do |c|
|
|
235
246
|
output = []
|
236
247
|
|
237
248
|
urls.each do |url|
|
238
|
-
res = Curl::Html.new(url, headers: headers, headers_only: false,
|
239
|
-
|
249
|
+
res = Curl::Html.new(url, { headers: headers, headers_only: false,
|
250
|
+
compressed: options[:compressed], clean: options[:clean] })
|
251
|
+
res.curl
|
240
252
|
extracted = res.extract(options[:before], options[:after])
|
241
253
|
extracted.strip_tags! if options[:strip]
|
242
254
|
output.concat(extracted)
|
@@ -271,8 +283,9 @@ command :tags do |c|
|
|
271
283
|
output = []
|
272
284
|
|
273
285
|
urls.each do |url|
|
274
|
-
res = Curl::Html.new(url, headers: headers, headers_only: options[:headers],
|
275
|
-
|
286
|
+
res = Curl::Html.new(url, { headers: headers, headers_only: options[:headers],
|
287
|
+
compressed: options[:compressed], clean: options[:clean] })
|
288
|
+
res.curl
|
276
289
|
output = []
|
277
290
|
if options[:search]
|
278
291
|
output = res.tags.search(options[:search])
|
@@ -299,15 +312,20 @@ command :images do |c|
|
|
299
312
|
c.desc 'Remove extra whitespace from results'
|
300
313
|
c.switch %i[clean]
|
301
314
|
|
315
|
+
c.desc 'Define a header to send as key=value'
|
316
|
+
c.flag %i[h header], multiple: true
|
317
|
+
|
302
318
|
c.action do |global_options, options, args|
|
303
319
|
urls = args.join(' ').split(/[, ]+/)
|
320
|
+
headers = break_headers(options[:header])
|
304
321
|
|
305
322
|
output = []
|
306
323
|
|
307
324
|
types = options[:type].join(' ').split(/[ ,]+/).map(&:normalize_image_type)
|
308
325
|
|
309
326
|
urls.each do |url|
|
310
|
-
res = Curl::Html.new(url, compressed: options[:compressed], clean: options[:clean])
|
327
|
+
res = Curl::Html.new(url, { compressed: options[:compressed], clean: options[:clean] })
|
328
|
+
res.curl
|
311
329
|
output.concat(res.images(types: types))
|
312
330
|
end
|
313
331
|
|
@@ -339,10 +357,13 @@ command :links do |c|
|
|
339
357
|
output = []
|
340
358
|
|
341
359
|
urls.each do |url|
|
342
|
-
res = Curl::Html.new(url,
|
343
|
-
|
344
|
-
|
345
|
-
|
360
|
+
res = Curl::Html.new(url, {
|
361
|
+
compressed: options[:compressed], clean: options[:clean],
|
362
|
+
ignore_local_links: options[:ignore_relative],
|
363
|
+
ignore_fragment_links: options[:ignore_fragments],
|
364
|
+
external_links_only: options[:external_links_only]
|
365
|
+
})
|
366
|
+
res.curl
|
346
367
|
|
347
368
|
if options[:query]
|
348
369
|
query = options[:query] =~ /^links/ ? options[:query] : "links#{options[:query]}"
|
@@ -371,7 +392,8 @@ command :headlinks do |c|
|
|
371
392
|
output = []
|
372
393
|
|
373
394
|
urls.each do |url|
|
374
|
-
res = Curl::Html.new(url, compressed: options[:compressed], clean: options[:clean])
|
395
|
+
res = Curl::Html.new(url, { compressed: options[:compressed], clean: options[:clean] })
|
396
|
+
res.curl
|
375
397
|
|
376
398
|
if options[:query]
|
377
399
|
query = options[:query] =~ /^links/ ? options[:query] : "links#{options[:query]}"
|
@@ -420,7 +442,8 @@ command :scrape do |c|
|
|
420
442
|
driver.get url
|
421
443
|
res = driver.page_source
|
422
444
|
|
423
|
-
res = Curl::Html.new(nil, source: res, clean: options[:clean])
|
445
|
+
res = Curl::Html.new(nil, { source: res, clean: options[:clean] })
|
446
|
+
res.curl
|
424
447
|
if options[:search]
|
425
448
|
out = res.search(options[:search])
|
426
449
|
|
data/lib/curly/curl/html.rb
CHANGED
@@ -10,8 +10,11 @@ module Curl
|
|
10
10
|
|
11
11
|
# Class for CURLing an HTML page
|
12
12
|
class Html
|
13
|
-
|
14
|
-
|
13
|
+
attr_accessor :settings, :browser, :source, :headers, :headers_only, :compressed, :clean, :fallback,
|
14
|
+
:ignore_local_links, :ignore_fragment_links, :external_links_only
|
15
|
+
|
16
|
+
attr_reader :url, :code, :meta, :links, :head, :body,
|
17
|
+
:title, :description, :body_links, :body_images
|
15
18
|
|
16
19
|
def to_data(url: nil)
|
17
20
|
{
|
@@ -20,9 +23,9 @@ module Curl
|
|
20
23
|
headers: @headers,
|
21
24
|
meta: @meta,
|
22
25
|
meta_links: @links,
|
23
|
-
head: @head,
|
24
|
-
body: @body,
|
25
|
-
source: @source,
|
26
|
+
head: @clean ? @head&.strip&.clean : @head,
|
27
|
+
body: @clean ? @body&.strip&.clean : @body,
|
28
|
+
source: @clean ? @source&.strip&.clean : @source,
|
26
29
|
title: @title,
|
27
30
|
description: @description,
|
28
31
|
links: @body_links,
|
@@ -33,29 +36,48 @@ module Curl
|
|
33
36
|
##
|
34
37
|
## Create a new page object from a URL
|
35
38
|
##
|
36
|
-
## @param url
|
37
|
-
## @param
|
38
|
-
##
|
39
|
-
## @
|
39
|
+
## @param url [String] The url
|
40
|
+
## @param options [Hash] The options
|
41
|
+
##
|
42
|
+
## @option options :browser [Symbol] the browser to use instead of curl (:chrome, :firefox)
|
43
|
+
## @option options :source [String] source provided instead of curl
|
44
|
+
## @option options :headers [Hash] headers to send in the request
|
45
|
+
## @option options :headers_only [Boolean] whether to return just response headers
|
46
|
+
## @option options :compressed [Boolean] expect compressed response
|
47
|
+
## @option options :clean [Boolean] clean whitespace from response
|
48
|
+
## @option options :fallback [Symbol] browser to fall back to if curl doesn't work (:chrome, :firefox)
|
49
|
+
## @option options :ignore_local_links [Boolean] when collecting links, ignore local/relative links
|
50
|
+
## @option options :ignore_fragment_links [Boolean] when collecting links, ignore links that are just #fragments
|
51
|
+
## @option options :external_links_only [Boolean] only collect links outside of current site
|
40
52
|
##
|
41
53
|
## @return [HTMLCurl] new page object
|
42
54
|
##
|
43
|
-
def initialize(url,
|
44
|
-
|
45
|
-
|
46
|
-
@
|
47
|
-
@
|
48
|
-
@
|
49
|
-
@
|
55
|
+
def initialize(url, options = {})
|
56
|
+
@browser = options[:browser] || :none
|
57
|
+
@source = options[:source]
|
58
|
+
@headers = options[:headers] || {}
|
59
|
+
@headers_only = options[:headers_only]
|
60
|
+
@compressed = options[:compressed]
|
61
|
+
@clean = options[:clean]
|
62
|
+
@fallback = options[:fallback]
|
63
|
+
@ignore_local_links = options[:ignore_local_links]
|
64
|
+
@ignore_fragment_links = options[:ignore_fragment_links]
|
65
|
+
@external_links_only = options[:external_links_only]
|
66
|
+
|
50
67
|
@curl = TTY::Which.which('curl')
|
51
68
|
@url = url
|
52
|
-
|
53
|
-
|
54
|
-
|
69
|
+
end
|
70
|
+
|
71
|
+
def curl
|
72
|
+
res = if @url && @browser && @browser != :none
|
73
|
+
source = curl_dynamic_html
|
74
|
+
curl_html(nil, source: source, headers: @headers)
|
55
75
|
elsif url.nil? && !source.nil?
|
56
|
-
curl_html(nil, source: source, headers: headers, headers_only: headers_only,
|
76
|
+
curl_html(nil, source: @source, headers: @headers, headers_only: @headers_only,
|
77
|
+
compressed: @compressed, fallback: false)
|
57
78
|
else
|
58
|
-
curl_html(url, headers: headers, headers_only: headers_only,
|
79
|
+
curl_html(@url, headers: @headers, headers_only: @headers_only,
|
80
|
+
compressed: @compressed, fallback: @fallback)
|
59
81
|
end
|
60
82
|
@url = res[:url]
|
61
83
|
@code = res[:code]
|
@@ -82,10 +104,10 @@ module Curl
|
|
82
104
|
## save (:full_page,
|
83
105
|
## :print_page, :visible)
|
84
106
|
##
|
85
|
-
def screenshot(destination = nil,
|
107
|
+
def screenshot(destination = nil, type: :full_page)
|
86
108
|
full_page = type.to_sym == :full_page
|
87
109
|
print_page = type.to_sym == :print_page
|
88
|
-
save_screenshot(destination,
|
110
|
+
save_screenshot(destination, type: type)
|
89
111
|
end
|
90
112
|
|
91
113
|
##
|
@@ -297,7 +319,7 @@ module Curl
|
|
297
319
|
|
298
320
|
{
|
299
321
|
tag: el.name,
|
300
|
-
source: el.to_html,
|
322
|
+
source: @clean ? el.to_html&.strip&.clean : el.to_html,
|
301
323
|
attrs: attributes,
|
302
324
|
content: @clean ? el.text&.strip&.clean : el.text.strip,
|
303
325
|
tags: recurse_children(el)
|
@@ -511,14 +533,14 @@ module Curl
|
|
511
533
|
##
|
512
534
|
## @return [String] page source
|
513
535
|
##
|
514
|
-
def curl_dynamic_html
|
515
|
-
browser = browser.normalize_browser_type if browser.is_a?(String)
|
536
|
+
def curl_dynamic_html
|
537
|
+
browser = @browser.normalize_browser_type if @browser.is_a?(String)
|
516
538
|
res = nil
|
517
539
|
|
518
540
|
driver = Selenium::WebDriver.for browser
|
519
541
|
driver.manage.timeouts.implicit_wait = 4
|
520
542
|
begin
|
521
|
-
driver.get url
|
543
|
+
driver.get @url
|
522
544
|
res = driver.page_source
|
523
545
|
ensure
|
524
546
|
driver.quit
|
@@ -534,7 +556,7 @@ module Curl
|
|
534
556
|
## @param browser [Symbol] The browser (:chrome or :firefox)
|
535
557
|
## @param type [Symbol] The type of screenshot (:full_page, :print_page, or :visible)
|
536
558
|
##
|
537
|
-
def save_screenshot(destination = nil,
|
559
|
+
def save_screenshot(destination = nil, type: :full_page)
|
538
560
|
raise 'No URL provided' if url.nil?
|
539
561
|
|
540
562
|
raise 'No file destination provided' if destination.nil?
|
@@ -554,7 +576,7 @@ module Curl
|
|
554
576
|
"#{destination.sub(/\.(pdf|jpe?g|png)$/, '')}.png"
|
555
577
|
end
|
556
578
|
|
557
|
-
driver = Selenium::WebDriver.for browser
|
579
|
+
driver = Selenium::WebDriver.for @browser
|
558
580
|
driver.manage.timeouts.implicit_wait = 4
|
559
581
|
begin
|
560
582
|
driver.get @url
|
@@ -587,38 +609,38 @@ module Curl
|
|
587
609
|
headers_only: false, compressed: false, fallback: false)
|
588
610
|
unless url.nil?
|
589
611
|
flags = 'SsL'
|
590
|
-
flags += headers_only ? 'I' : 'i'
|
612
|
+
flags += @headers_only ? 'I' : 'i'
|
591
613
|
agents = [
|
592
614
|
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.1',
|
593
615
|
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.',
|
594
616
|
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.3',
|
595
617
|
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.'
|
596
618
|
]
|
597
|
-
headers = headers.nil? ? '' : headers.map { |h, v| %(-H "#{h}: #{v}") }.join(' ')
|
598
|
-
compress = compressed ? '--compressed' : ''
|
599
|
-
source = `#{@curl} -#{flags} #{compress} #{headers} '#{url}' 2>/dev/null`
|
619
|
+
headers = @headers.nil? ? '' : @headers.map { |h, v| %(-H "#{h}: #{v}") }.join(' ')
|
620
|
+
compress = @compressed ? '--compressed' : ''
|
621
|
+
@source = `#{@curl} -#{flags} #{compress} #{headers} '#{@url}' 2>/dev/null`
|
600
622
|
agent = 0
|
601
623
|
while source.nil? || source.empty?
|
602
|
-
source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{url}' 2>/dev/null`
|
624
|
+
source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{@url}' 2>/dev/null`
|
603
625
|
break if agent >= agents.count - 1
|
604
626
|
end
|
605
627
|
|
606
|
-
unless $?.success? || fallback
|
607
|
-
warn "Error curling #{url}"
|
628
|
+
unless $?.success? || @fallback
|
629
|
+
warn "Error curling #{@url}"
|
608
630
|
Process.exit 1
|
609
631
|
end
|
610
632
|
|
611
|
-
if fallback && (source.nil? || source.empty?)
|
612
|
-
source = curl_dynamic_html(url, fallback, headers)
|
633
|
+
if @fallback && (@source.nil? || @source.empty?)
|
634
|
+
@source = curl_dynamic_html(@url, @fallback, @headers)
|
613
635
|
end
|
614
636
|
end
|
615
637
|
|
616
638
|
return false if source.nil? || source.empty?
|
617
639
|
|
618
|
-
source.strip!
|
640
|
+
@source.strip!
|
619
641
|
|
620
|
-
headers = { 'location' => url }
|
621
|
-
lines = source.split(/\r\n/)
|
642
|
+
headers = { 'location' => @url }
|
643
|
+
lines = @source.split(/\r\n/)
|
622
644
|
code = lines[0].match(/(\d\d\d)/)[1]
|
623
645
|
lines.shift
|
624
646
|
lines.each_with_index do |line, idx|
|
@@ -626,7 +648,7 @@ module Curl
|
|
626
648
|
m = Regexp.last_match
|
627
649
|
headers[m[1]] = m[2]
|
628
650
|
else
|
629
|
-
source = lines[idx..].join("\n")
|
651
|
+
@source = lines[idx..].join("\n")
|
630
652
|
break
|
631
653
|
end
|
632
654
|
end
|
@@ -636,21 +658,21 @@ module Curl
|
|
636
658
|
end
|
637
659
|
|
638
660
|
if headers['content-type'] =~ /json/
|
639
|
-
return { url: url, code: code, headers: headers, meta: nil, links: nil,
|
640
|
-
head: nil, body: source.strip, source: source.strip, body_links: nil, body_images: nil }
|
661
|
+
return { url: @url, code: code, headers: headers, meta: nil, links: nil,
|
662
|
+
head: nil, body: @source.strip, source: @source.strip, body_links: nil, body_images: nil }
|
641
663
|
end
|
642
664
|
|
643
665
|
head = source.match(%r{(?<=<head>)(.*?)(?=</head>)}mi)
|
644
666
|
|
645
667
|
if head.nil?
|
646
|
-
{ url: url, code: code, headers: headers, meta: nil, links: nil, head: nil, body: source.strip,
|
647
|
-
source: source.strip, body_links: nil, body_images: nil }
|
668
|
+
{ url: @url, code: code, headers: headers, meta: nil, links: nil, head: nil, body: @source.strip,
|
669
|
+
source: @source.strip, body_links: nil, body_images: nil }
|
648
670
|
else
|
649
671
|
meta = meta_tags(head[1])
|
650
672
|
links = link_tags(head[1])
|
651
|
-
body = source.match(%r{<body.*?>(.*?)</body>}mi)[1]
|
652
|
-
{ url: url, code: code, headers: headers, meta: meta, links: links, head: head[1], body: body,
|
653
|
-
source: source.strip, body_links: body_links, body_images: body_images }
|
673
|
+
body = @source.match(%r{<body.*?>(.*?)</body>}mi)[1]
|
674
|
+
{ url: @url, code: code, headers: headers, meta: meta, links: links, head: head[1], body: body,
|
675
|
+
source: @source.strip, body_links: body_links, body_images: body_images }
|
654
676
|
end
|
655
677
|
end
|
656
678
|
|
data/lib/curly/curl/json.rb
CHANGED
@@ -3,7 +3,11 @@
|
|
3
3
|
module Curl
|
4
4
|
# Class for CURLing a JSON response
|
5
5
|
class Json
|
6
|
-
|
6
|
+
attr_accessor :url
|
7
|
+
|
8
|
+
attr_writer :compressed, :request_headers, :symbolize_names
|
9
|
+
|
10
|
+
attr_reader :code, :json, :headers
|
7
11
|
|
8
12
|
def to_data
|
9
13
|
{
|
@@ -23,9 +27,17 @@ module Curl
|
|
23
27
|
##
|
24
28
|
## @return [Curl::Json] Curl::Json object with url, code, parsed json, and response headers
|
25
29
|
##
|
26
|
-
def initialize(url,
|
30
|
+
def initialize(url, options = {})
|
31
|
+
@url = url
|
32
|
+
@request_headers = options[:headers]
|
33
|
+
@compressed = options[:compressed]
|
34
|
+
@symbolize_names = options[:symbolize_names]
|
35
|
+
|
27
36
|
@curl = TTY::Which.which('curl')
|
28
|
-
|
37
|
+
end
|
38
|
+
|
39
|
+
def curl
|
40
|
+
page = curl_json
|
29
41
|
|
30
42
|
raise "Error retrieving #{url}" if page.nil? || page.empty?
|
31
43
|
|
@@ -60,7 +72,7 @@ module Curl
|
|
60
72
|
##
|
61
73
|
## @return [Hash] hash of url, code, headers, and parsed json
|
62
74
|
##
|
63
|
-
def curl_json
|
75
|
+
def curl_json
|
64
76
|
flags = 'SsLi'
|
65
77
|
agents = [
|
66
78
|
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.1',
|
@@ -69,12 +81,12 @@ module Curl
|
|
69
81
|
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.'
|
70
82
|
]
|
71
83
|
|
72
|
-
headers = headers.nil? ? '' : headers.map { |h, v| %(-H "#{h}: #{v}") }.join(' ')
|
73
|
-
compress = compressed ? '--compressed' : ''
|
74
|
-
source = `#{@curl} -#{flags} #{compress} #{headers} '#{url}' 2>/dev/null`
|
84
|
+
headers = @headers.nil? ? '' : @headers.map { |h, v| %(-H "#{h}: #{v}") }.join(' ')
|
85
|
+
compress = @compressed ? '--compressed' : ''
|
86
|
+
source = `#{@curl} -#{flags} #{compress} #{headers} '#{@url}' 2>/dev/null`
|
75
87
|
agent = 0
|
76
88
|
while source.nil? || source.empty?
|
77
|
-
source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{url}' 2>/dev/null`
|
89
|
+
source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{@url}' 2>/dev/null`
|
78
90
|
break if agent >= agents.count - 1
|
79
91
|
end
|
80
92
|
|
@@ -99,9 +111,9 @@ module Curl
|
|
99
111
|
json = source.strip.force_encoding('utf-8')
|
100
112
|
begin
|
101
113
|
json.gsub!(/[\u{1F600}-\u{1F6FF}]/, '')
|
102
|
-
{ url: url, code: code, headers: headers, json: JSON.parse(json, symbolize_names: symbolize_names) }
|
103
|
-
rescue StandardError
|
104
|
-
{ url: url, code: code, headers: headers, json: nil}
|
114
|
+
{ url: @url, code: code, headers: headers, json: JSON.parse(json, symbolize_names: @symbolize_names) }
|
115
|
+
rescue StandardError
|
116
|
+
{ url: @url, code: code, headers: headers, json: nil }
|
105
117
|
end
|
106
118
|
end
|
107
119
|
end
|
data/lib/curly/hash.rb
CHANGED
data/lib/curly/string.rb
CHANGED
data/lib/curly/version.rb
CHANGED
data/src/_README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
<!--README--><!--GITHUB-->#
|
1
|
+
<!--README--><!--GITHUB--># CurlyQ
|
2
2
|
|
3
3
|
[![Gem](https://img.shields.io/gem/v/na.svg)](https://rubygems.org/gems/curlyq)
|
4
4
|
[![GitHub license](https://img.shields.io/github/license/ttscoff/curlyq.svg)](./LICENSE.txt)
|
@@ -6,11 +6,13 @@
|
|
6
6
|
**A command line helper for curl and web scraping**
|
7
7
|
|
8
8
|
_If you find this useful, feel free to [buy me some coffee][donate]._
|
9
|
+
|
10
|
+
[donate]: https://brettterpstra.com/donate
|
9
11
|
<!--END GITHUB-->
|
10
12
|
|
11
|
-
The current version of `curlyq` is <!--VER
|
13
|
+
The current version of `curlyq` is <!--VER-->0.0.3<!--END VER-->.
|
12
14
|
|
13
|
-
|
15
|
+
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
|
14
16
|
|
15
17
|
[github]: https://github.com/ttscoff/curlyq/
|
16
18
|
|
@@ -23,11 +25,15 @@ If you're using Homebrew, you have the option to install via [brew-gem](https://
|
|
23
25
|
brew install brew-gem
|
24
26
|
brew gem install curlyq
|
25
27
|
|
26
|
-
If you don't have Ruby/RubyGems, you can install them pretty easily with Homebrew, rvm, or asdf.
|
28
|
+
If you don't have Ruby/RubyGems, you can install them pretty easily with [Homebrew], [rvm], or [asdf].
|
29
|
+
|
30
|
+
[Homebrew]: https://brew.sh/ "Homebrew—The Missing Package Manager for macOS (or Linux)"
|
31
|
+
[rvm]: https://rvm.io/ "Ruby Version Manager (RVM)"
|
32
|
+
[asdf]: https://github.com/asdf-vm/asdf "asdf-vm/asdf:Extendable version manager with support for ..."
|
27
33
|
|
28
34
|
### Usage
|
29
35
|
|
30
|
-
Run `curlyq help` for a list of
|
36
|
+
Run `curlyq help` for a list of subcommands. Run `curlyq help SUBCOMMAND` for details on a particular subcommand and its options.
|
31
37
|
|
32
38
|
```
|
33
39
|
@cli(bundle exec bin/curlyq help)
|
@@ -35,7 +41,7 @@ Run `curlyq help` for a list of commands. Run `curlyq help SUBCOMMAND` for detai
|
|
35
41
|
|
36
42
|
#### Commands
|
37
43
|
|
38
|
-
curlyq makes use of subcommands, e.g. `curlyq html` or `curlyq extract`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command.
|
44
|
+
curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
|
39
45
|
|
40
46
|
##### extract
|
41
47
|
|
@@ -82,6 +88,8 @@ curlyq makes use of subcommands, e.g. `curlyq html` or `curlyq extract`. Each su
|
|
82
88
|
|
83
89
|
##### screenshot
|
84
90
|
|
91
|
+
Full-page screenshots require Firefox, installed and specified with `--browser firefox`.
|
92
|
+
|
85
93
|
```
|
86
94
|
@cli(bundle exec bin/curlyq help screenshot)
|
87
95
|
```
|
@@ -97,5 +105,5 @@ PayPal link: [paypal.me/ttscoff](https://paypal.me/ttscoff)
|
|
97
105
|
|
98
106
|
## Changelog
|
99
107
|
|
100
|
-
See [CHANGELOG.md](https://github.com/ttscoff/
|
108
|
+
See [CHANGELOG.md](https://github.com/ttscoff/curlyq/blob/main/CHANGELOG.md)
|
101
109
|
<!--END GITHUB--><!--END README-->
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: curlyq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.4
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Brett Terpstra
|
@@ -137,6 +137,7 @@ extra_rdoc_files:
|
|
137
137
|
- README.rdoc
|
138
138
|
- curlyq.rdoc
|
139
139
|
files:
|
140
|
+
- ".github/FUNDING.yml"
|
140
141
|
- ".gitignore"
|
141
142
|
- CHANGELOG.md
|
142
143
|
- Gemfile
|