curlyq 0.0.8 → 0.0.10
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -0
- data/Gemfile.lock +1 -1
- data/README.md +20 -4
- data/Rakefile +17 -0
- data/bin/curlyq +8 -13
- data/lib/curly/array.rb +40 -4
- data/lib/curly/curl/html.rb +22 -0
- data/lib/curly/hash.rb +128 -31
- data/lib/curly/numeric.rb +11 -0
- data/lib/curly/string.rb +27 -3
- data/lib/curly/version.rb +3 -1
- data/lib/curly.rb +1 -0
- data/src/_README.md +19 -3
- data/test/curlyq_headlinks_test.rb +3 -2
- data/test/curlyq_html_test.rb +3 -3
- data/test/curlyq_scrape_test.rb +32 -2
- data/test/curlyq_tags_test.rb +12 -4
- data/test/helpers/curlyq-helpers.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6109483b8869733f9e21ecab9bc8bcda0aa3b58ca1f13f9b96fe7739d019df1f
|
4
|
+
data.tar.gz: 98a8d46fe68bc88ea030dfb8e04262fbab5418005390ff79693d6f636a3bf276
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1d75b4af2d6c1fadb83501fa707184ef41d061c08de14666b86d296048e8f21540fe2ad53a79985d5b042c93fa629cdbe8d101828edbb02832d1b55b920d5834
|
7
|
+
data.tar.gz: 238855918e3e765a2edf1864dd2663a959b099cfa5f1b89942f94eb20ba428c1700adee85590879662f0cf8de659328fbe752e8648ee210eefe0769639c57da2
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,23 @@
|
|
1
|
+
### 0.0.10
|
2
|
+
|
3
|
+
2024-01-17 13:50
|
4
|
+
|
5
|
+
#### IMPROVED
|
6
|
+
|
7
|
+
- Update YARD documentation
|
8
|
+
- Breaking change, ensure all return types are Arrays, even with single objects, to aid in scriptability
|
9
|
+
- Screenshot test suite
|
10
|
+
|
11
|
+
### 0.0.9
|
12
|
+
|
13
|
+
2024-01-16 12:38
|
14
|
+
|
15
|
+
#### IMPROVED
|
16
|
+
|
17
|
+
- You can now use dot syntax inside of a square bracket comparison in --query (`[attrs.id*=what]`)
|
18
|
+
- *=, ^=, $=, and == work with array values
|
19
|
+
- [] comparisons with no comparison, e.g. [attrs.id], will return every match that has that element populated
|
20
|
+
|
1
21
|
### 0.0.8
|
2
22
|
|
3
23
|
2024-01-15 16:45
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -10,10 +10,13 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
|
|
10
10
|
[donate]: https://brettterpstra.com/donate
|
11
11
|
|
12
12
|
|
13
|
-
|
13
|
+
[jq]: https://github.com/jqlang/jq "Command-line JSON processor"
|
14
|
+
[yq]: https://github.com/mikefarah/yq "yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor"
|
15
|
+
|
16
|
+
The current version of `curlyq` is 0.0.10
|
14
17
|
.
|
15
18
|
|
16
|
-
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like
|
19
|
+
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like [jq] to parse the output.
|
17
20
|
|
18
21
|
[github]: https://github.com/ttscoff/curlyq/
|
19
22
|
|
@@ -44,7 +47,7 @@ SYNOPSIS
|
|
44
47
|
curlyq [global options] command [command options] [arguments...]
|
45
48
|
|
46
49
|
VERSION
|
47
|
-
0.0.
|
50
|
+
0.0.10
|
48
51
|
|
49
52
|
GLOBAL OPTIONS
|
50
53
|
--help - Show this message
|
@@ -71,6 +74,9 @@ You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some c
|
|
71
74
|
|
72
75
|
A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
|
73
76
|
|
77
|
+
> I've tried to make the query function useful, but if you want to do any kind of advanced shaping, you're better off piping the JSON output to [jq] or [yq].
|
78
|
+
|
79
|
+
|
74
80
|
Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
|
75
81
|
|
76
82
|
The comparisons for the query flag are:
|
@@ -84,6 +90,16 @@ The comparisons for the query flag are:
|
|
84
90
|
- `^=` starts with text
|
85
91
|
- `$=` ends with text
|
86
92
|
|
93
|
+
Comparisons can be numeric or string comparisons. A numeric comparison like `curlyq images -q '[width>500]' URL` would return all of the images on the page with a width attribute greater than 500.
|
94
|
+
|
95
|
+
You can also use dot syntax inside of comparisons, e.g. `[links.rel*=me]` to target the links object (`html` command), and return only the links with a `rel=me` attribute. If the comparison is to an array object (like `class` or `rel`), it will match if any of the elements of the array match your comparison.
|
96
|
+
|
97
|
+
If you end the query with a specific key, only that key will be output. If there's only one match, it will be output as a raw string. If there are multiple matches, output will be an array:
|
98
|
+
|
99
|
+
curlyq tags --search '#main .post h3' -q '[attrs.id*=what].source' 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/'
|
100
|
+
|
101
|
+
<h3 id="whats-next">What???s Next</h3>
|
102
|
+
|
87
103
|
#### Commands
|
88
104
|
|
89
105
|
curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
|
@@ -440,7 +456,7 @@ COMMAND OPTIONS
|
|
440
456
|
|
441
457
|
Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
|
442
458
|
|
443
|
-
curlyq tags --search '#main .post h3' -q 'attrs
|
459
|
+
curlyq tags --search '#main .post h3' -q '[attrs.id*=what]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
444
460
|
|
445
461
|
[
|
446
462
|
{
|
data/Rakefile
CHANGED
@@ -56,6 +56,23 @@ task :test, :pattern, :threads, :max_tests do |_, args|
|
|
56
56
|
ThreadedTests.new.run(pattern: pattern, max_threads: args[:threads].to_i, max_tests: args[:max_tests])
|
57
57
|
end
|
58
58
|
|
59
|
+
desc 'Install current gem in all versions of asdf-controlled ruby'
|
60
|
+
task :install do
|
61
|
+
Rake::Task['clobber'].invoke
|
62
|
+
Rake::Task['package'].invoke
|
63
|
+
Dir.chdir 'pkg'
|
64
|
+
file = Dir.glob('*.gem').last
|
65
|
+
|
66
|
+
current_ruby = `asdf current ruby`.match(/(\d.\d+.\d+)/)[1]
|
67
|
+
|
68
|
+
`asdf list ruby`.split.map { |ruby| ruby.strip.sub(/^*/, '') }.each do |ruby|
|
69
|
+
`asdf shell ruby #{ruby}`
|
70
|
+
puts `gem install #{file}`
|
71
|
+
end
|
72
|
+
|
73
|
+
`asdf shell ruby #{current_ruby}`
|
74
|
+
end
|
75
|
+
|
59
76
|
desc 'Development version check'
|
60
77
|
task :ver do
|
61
78
|
gver = `git ver`
|
data/bin/curlyq
CHANGED
@@ -49,7 +49,7 @@ end
|
|
49
49
|
def self.print_out(output, yaml, raw: false, pretty: true)
|
50
50
|
output = output.to_data if output.respond_to?(:to_data)
|
51
51
|
# Was intended to flatten single responses, but not getting an array back is unpredictable
|
52
|
-
|
52
|
+
output = output.clean_output
|
53
53
|
if output.is_a?(String)
|
54
54
|
print output
|
55
55
|
elsif raw
|
@@ -130,13 +130,13 @@ command %i[html curl] do |c|
|
|
130
130
|
out = res.parse(source)
|
131
131
|
|
132
132
|
if options[:query]
|
133
|
-
out = out.to_data(url: url, clean: options[:clean]).dot_query(options[:query])
|
133
|
+
out = out.to_data(url: url, clean: options[:clean]).dot_query(options[:query], full_tag: false)
|
134
134
|
else
|
135
135
|
out = out.to_data
|
136
136
|
end
|
137
137
|
output.push([out])
|
138
138
|
elsif options[:query]
|
139
|
-
queried = res.to_data.dot_query(options[:query])
|
139
|
+
queried = res.to_data.dot_query(options[:query], full_tag: false)
|
140
140
|
output.push(queried) if queried
|
141
141
|
else
|
142
142
|
output.push(res.to_data(url: url))
|
@@ -144,14 +144,9 @@ command %i[html curl] do |c|
|
|
144
144
|
end
|
145
145
|
output.delete_if(&:nil?)
|
146
146
|
output.delete_if(&:empty?)
|
147
|
-
# output = output[0] if output.count == 1
|
148
147
|
output.map! { |o| o[options[:raw].to_sym] } if options[:raw]
|
149
148
|
|
150
|
-
|
151
|
-
while output.length == 1
|
152
|
-
output = output[0]
|
153
|
-
end
|
154
|
-
end
|
149
|
+
output = output.clean_output
|
155
150
|
|
156
151
|
print_out(output, global_options[:yaml], raw: options[:raw], pretty: global_options[:pretty])
|
157
152
|
end
|
@@ -246,7 +241,7 @@ command :json do |c|
|
|
246
241
|
end
|
247
242
|
end
|
248
243
|
|
249
|
-
|
244
|
+
output = output.clean_output
|
250
245
|
|
251
246
|
print_out(output, global_options[:yaml], pretty: global_options[:pretty])
|
252
247
|
end
|
@@ -356,7 +351,7 @@ command :tags do |c|
|
|
356
351
|
end
|
357
352
|
end
|
358
353
|
|
359
|
-
output = output
|
354
|
+
output = output.clean_output
|
360
355
|
|
361
356
|
if options[:source]
|
362
357
|
puts output.to_html
|
@@ -480,7 +475,7 @@ command :headlinks do |c|
|
|
480
475
|
end
|
481
476
|
end
|
482
477
|
|
483
|
-
output = output
|
478
|
+
output = output.clean_output
|
484
479
|
|
485
480
|
print_out(output, global_options[:yaml], pretty: global_options[:pretty])
|
486
481
|
end
|
@@ -531,7 +526,7 @@ command :scrape do |c|
|
|
531
526
|
|
532
527
|
output.delete_if(&:empty?)
|
533
528
|
|
534
|
-
output = output
|
529
|
+
output = output.clean_output
|
535
530
|
|
536
531
|
if options[:raw]
|
537
532
|
output.map! { |o| o[options[:raw].to_sym] }
|
data/lib/curly/array.rb
CHANGED
@@ -66,7 +66,7 @@ class ::Array
|
|
66
66
|
replace dedup_links
|
67
67
|
end
|
68
68
|
|
69
|
-
|
69
|
+
##
|
70
70
|
## Run a query on array elements
|
71
71
|
##
|
72
72
|
## @param path [String] dot.syntax path to compare
|
@@ -80,17 +80,29 @@ class ::Array
|
|
80
80
|
res
|
81
81
|
end
|
82
82
|
|
83
|
+
##
|
84
|
+
## Gets the value of every item in the array
|
85
|
+
##
|
86
|
+
## @param path The query path (dot syntax)
|
87
|
+
##
|
88
|
+
## @return [Array] array of values
|
89
|
+
##
|
83
90
|
def get_value(path)
|
84
|
-
|
85
|
-
res.is_a?(Array) && res.count == 1 ? res[0] : res
|
91
|
+
map { |el| el.get_value(path) }
|
86
92
|
end
|
87
93
|
|
94
|
+
##
|
95
|
+
## Convert every item in the array to HTML
|
96
|
+
##
|
97
|
+
## @return [String] Html representation of the object.
|
98
|
+
##
|
88
99
|
def to_html
|
89
100
|
map(&:to_html)
|
90
101
|
end
|
91
102
|
|
92
103
|
##
|
93
|
-
## Test if a tag contains an attribute matching filter
|
104
|
+
## Test if a tag contains an attribute matching filter
|
105
|
+
## queries
|
94
106
|
##
|
95
107
|
## @param tag_name [String] The tag name
|
96
108
|
## @param classes [String] The classes to match
|
@@ -102,6 +114,8 @@ class ::Array
|
|
102
114
|
## @param value [String] The value to match
|
103
115
|
## @param descendant [Boolean] Check descendant tags
|
104
116
|
##
|
117
|
+
## @return [Boolean] tag matches
|
118
|
+
##
|
105
119
|
def tag_match(tag_name, classes, id, attribute, operator, value, descendant: false)
|
106
120
|
tag = self
|
107
121
|
keep = true
|
@@ -155,4 +169,26 @@ class ::Array
|
|
155
169
|
keep
|
156
170
|
end
|
157
171
|
end
|
172
|
+
|
173
|
+
##
|
174
|
+
## Clean up output, shrink single-item arrays, ensure array output
|
175
|
+
##
|
176
|
+
## @return [Array] cleaned up array
|
177
|
+
##
|
178
|
+
def clean_output
|
179
|
+
output = dup
|
180
|
+
while output.is_a?(Array) && output.count == 1
|
181
|
+
output = output[0]
|
182
|
+
end
|
183
|
+
output.ensure_array
|
184
|
+
end
|
185
|
+
|
186
|
+
##
|
187
|
+
## Ensure that an object is an array
|
188
|
+
##
|
189
|
+
## @return [Array] object as Array
|
190
|
+
##
|
191
|
+
def ensure_array
|
192
|
+
return self
|
193
|
+
end
|
158
194
|
end
|
data/lib/curly/curl/html.rb
CHANGED
@@ -16,6 +16,12 @@ module Curl
|
|
16
16
|
attr_reader :url, :code, :meta, :links, :head, :body,
|
17
17
|
:title, :description, :body_links, :body_images
|
18
18
|
|
19
|
+
# Convert self to a hash of data
|
20
|
+
#
|
21
|
+
# @param url [String] A base url to fall back to
|
22
|
+
#
|
23
|
+
# @return [Hash] a hash of data
|
24
|
+
#
|
19
25
|
def to_data(url: nil)
|
20
26
|
{
|
21
27
|
url: @url || url,
|
@@ -68,12 +74,23 @@ module Curl
|
|
68
74
|
@url = url.nil? ? options[:url] : url
|
69
75
|
end
|
70
76
|
|
77
|
+
##
|
78
|
+
# Parse raw HTML source instead of curling
|
79
|
+
#
|
80
|
+
# @param source [String] The source
|
81
|
+
#
|
82
|
+
#
|
83
|
+
# @return [Hash] Hash of data after processing #
|
84
|
+
#
|
71
85
|
def parse(source)
|
72
86
|
@body = source
|
73
87
|
{ url: @url, code: @code, headers: @headers, meta: @meta, links: @links, head: @head, body: source,
|
74
88
|
source: source.strip, body_links: content_links, body_images: content_images }
|
75
89
|
end
|
76
90
|
|
91
|
+
##
|
92
|
+
## Curl a url, either with curl or Selenium based on browser settings
|
93
|
+
##
|
77
94
|
def curl
|
78
95
|
res = if @url && @browser && @browser != :none
|
79
96
|
source = curl_dynamic_html
|
@@ -283,6 +300,11 @@ module Curl
|
|
283
300
|
output
|
284
301
|
end
|
285
302
|
|
303
|
+
##
|
304
|
+
## String representation
|
305
|
+
##
|
306
|
+
## @return String representation of the object.
|
307
|
+
##
|
286
308
|
def to_s
|
287
309
|
headers = @headers.nil? ? 0 : @headers.count
|
288
310
|
meta = @meta.nil? ? 0 : @meta.count
|
data/lib/curly/hash.rb
CHANGED
@@ -2,6 +2,14 @@
|
|
2
2
|
|
3
3
|
# Hash helpers
|
4
4
|
class ::Hash
|
5
|
+
## Convert a Curly object to data hash
|
6
|
+
##
|
7
|
+
## @return [Hash] return a hash with keys renamed and
|
8
|
+
## cleaned up
|
9
|
+
##
|
10
|
+
## @param url [String] A url to fall back to
|
11
|
+
## @param clean [Boolean] Clean extra spaces and newlines in sources
|
12
|
+
##
|
5
13
|
def to_data(url: nil, clean: false)
|
6
14
|
if key?(:body_links)
|
7
15
|
{
|
@@ -23,17 +31,33 @@ class ::Hash
|
|
23
31
|
end
|
24
32
|
end
|
25
33
|
|
34
|
+
##
|
35
|
+
## Return the raw HTML of the object
|
36
|
+
##
|
37
|
+
## @return [String] Html representation of the object.
|
38
|
+
##
|
26
39
|
def to_html
|
27
40
|
if key?(:source)
|
28
41
|
self[:source]
|
29
42
|
end
|
30
43
|
end
|
31
44
|
|
45
|
+
##
|
46
|
+
## Get a value from the hash using a dot-syntax query
|
47
|
+
##
|
48
|
+
## @param query [String] The query (dot notation)
|
49
|
+
##
|
50
|
+
## @return [Object] result of querying the hash
|
51
|
+
##
|
32
52
|
def get_value(query)
|
33
53
|
return nil if self.empty?
|
54
|
+
stringify_keys!
|
55
|
+
|
34
56
|
query.split('.').inject(self) do |v, k|
|
35
|
-
|
36
|
-
|
57
|
+
return v.map { |el| el.get_value(k) } if v.is_a? Array
|
58
|
+
# k = k.to_i if v.is_a? Array
|
59
|
+
next v unless v.key?(k)
|
60
|
+
|
37
61
|
v.fetch(k)
|
38
62
|
end
|
39
63
|
end
|
@@ -42,7 +66,7 @@ class ::Hash
|
|
42
66
|
#
|
43
67
|
# @param path [String] The path
|
44
68
|
#
|
45
|
-
# @return Result of path query
|
69
|
+
# @return [Object] Result of path query
|
46
70
|
#
|
47
71
|
def dot_query(path, root = nil, full_tag: true)
|
48
72
|
res = stringify_keys
|
@@ -52,12 +76,17 @@ class ::Hash
|
|
52
76
|
return res.get_value(path)
|
53
77
|
end
|
54
78
|
|
55
|
-
|
79
|
+
path.gsub!(/\[(.*?)\]/) do
|
80
|
+
inter = Regexp.last_match(1).gsub(/\./, '%')
|
81
|
+
"[#{inter}]"
|
82
|
+
end
|
83
|
+
|
56
84
|
out = []
|
57
85
|
q = path.split(/(?<![\d.])\./)
|
58
86
|
|
59
87
|
while q.count.positive?
|
60
88
|
pth = q.shift
|
89
|
+
pth.gsub!(/%/, '.')
|
61
90
|
|
62
91
|
return nil if res.nil?
|
63
92
|
|
@@ -70,8 +99,8 @@ class ::Hash
|
|
70
99
|
|
71
100
|
ats = []
|
72
101
|
at = []
|
73
|
-
while pth =~ /\[[+&,]
|
74
|
-
m = pth.match(/\[(?<com>[,+&])? *(?<key
|
102
|
+
while pth =~ /\[[+&,]?[\w.]+( *[\^*$=<>]=? *\w+)?/
|
103
|
+
m = pth.match(/\[(?<com>[,+&])? *(?<key>[\w.]+)( *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+))? */)
|
75
104
|
|
76
105
|
comp = [m['key'], m['op'], m['val']]
|
77
106
|
case m['com']
|
@@ -82,7 +111,7 @@ class ::Hash
|
|
82
111
|
at.push(comp)
|
83
112
|
end
|
84
113
|
|
85
|
-
pth.sub!(/\[(?<com>[,&+])? *(?<key
|
114
|
+
pth.sub!(/\[(?<com>[,&+])? *(?<key>[\w.]+)( *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+))?/, '[')
|
86
115
|
end
|
87
116
|
ats.push(at) unless at.empty?
|
88
117
|
pth.sub!(/\[\]/, '')
|
@@ -110,11 +139,11 @@ class ::Hash
|
|
110
139
|
pth = ''
|
111
140
|
|
112
141
|
return false if res.nil?
|
142
|
+
|
113
143
|
if ats.count.positive?
|
114
144
|
while ats.count.positive?
|
115
145
|
atr = ats.shift
|
116
146
|
res = [res] if res.is_a?(Hash)
|
117
|
-
|
118
147
|
res.each do |r|
|
119
148
|
out.push(full_tag ? tag : r) if evaluate_comp(r, atr)
|
120
149
|
end
|
@@ -140,6 +169,32 @@ class ::Hash
|
|
140
169
|
out
|
141
170
|
end
|
142
171
|
|
172
|
+
##
|
173
|
+
## Test if values in an array match an operator
|
174
|
+
##
|
175
|
+
## @param array [Array] The array
|
176
|
+
## @param key [String] The key
|
177
|
+
## @param comp [String] The comparison, e.g. *= or $=
|
178
|
+
##
|
179
|
+
## @return [Boolean] true if array contains match
|
180
|
+
def array_match(array, key, comp)
|
181
|
+
keep = false
|
182
|
+
array.each do |el|
|
183
|
+
keep = case comp
|
184
|
+
when /^\^/
|
185
|
+
key =~ /^#{el}/i ? true : false
|
186
|
+
when /^\$/
|
187
|
+
key =~ /#{el}$/i ? true : false
|
188
|
+
when /^\*/
|
189
|
+
key =~ /#{el}/i ? true : false
|
190
|
+
else
|
191
|
+
key =~ /^#{el}$/i ? true : false
|
192
|
+
end
|
193
|
+
break if keep
|
194
|
+
end
|
195
|
+
keep
|
196
|
+
end
|
197
|
+
|
143
198
|
##
|
144
199
|
## Evaluate a comparison
|
145
200
|
##
|
@@ -165,40 +220,57 @@ class ::Hash
|
|
165
220
|
end
|
166
221
|
r = r.get_value(key.to_s) if key.to_s =~ /\./
|
167
222
|
|
168
|
-
|
223
|
+
if val.nil?
|
224
|
+
if r.is_a?(Hash)
|
225
|
+
return r.key?(key) && !r[key].nil? && !r[key].empty?
|
226
|
+
elsif r.is_a?(String)
|
227
|
+
return r.nil? ? false : true
|
228
|
+
elsif r.is_a?(Array)
|
229
|
+
return r.empty? ? false : true
|
230
|
+
end
|
231
|
+
end
|
169
232
|
|
170
|
-
if
|
233
|
+
if r.nil?
|
171
234
|
keep = false
|
172
|
-
elsif r
|
173
|
-
valid = r
|
174
|
-
|
175
|
-
|
176
|
-
k =~ /^#{a[2]}/i ? true : false
|
177
|
-
when /^\$/
|
178
|
-
k =~ /#{a[2]}$/i ? true : false
|
179
|
-
when /^\*/
|
180
|
-
k =~ /#{a[2]}/i ? true : false
|
235
|
+
elsif r.is_a?(Array)
|
236
|
+
valid = r.filter do |k|
|
237
|
+
if k.is_a? Array
|
238
|
+
array_match(k, a[2], a[1])
|
181
239
|
else
|
182
|
-
|
240
|
+
case a[1]
|
241
|
+
when /^\^/
|
242
|
+
k =~ /^#{a[2]}/i ? true : false
|
243
|
+
when /^\$/
|
244
|
+
k =~ /#{a[2]}$/i ? true : false
|
245
|
+
when /^\*/
|
246
|
+
k =~ /#{a[2]}/i ? true : false
|
247
|
+
else
|
248
|
+
k =~ /^#{a[2]}$/i ? true : false
|
249
|
+
end
|
183
250
|
end
|
184
251
|
end
|
185
252
|
|
186
253
|
keep = valid.count.positive?
|
187
254
|
elsif val.is_a?(Numeric) && a[1] =~ /^[<>=]{1,2}$/
|
188
|
-
k = r
|
255
|
+
k = r.to_i
|
189
256
|
comp = a[1] =~ /^=$/ ? '==' : a[1]
|
190
257
|
keep = eval("#{k}#{comp}#{val}")
|
191
258
|
else
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
|
196
|
-
|
197
|
-
|
198
|
-
|
199
|
-
|
200
|
-
|
201
|
-
|
259
|
+
v = r.is_a?(Hash) ? r[key] : r
|
260
|
+
if v.is_a? Array
|
261
|
+
keep = array_match(v, a[2], a[1])
|
262
|
+
else
|
263
|
+
keep = case a[1]
|
264
|
+
when /^\^/
|
265
|
+
v =~ /^#{a[2]}/i ? true : false
|
266
|
+
when /^\$/
|
267
|
+
v =~ /#{a[2]}$/i ? true : false
|
268
|
+
when /^\*/
|
269
|
+
v =~ /#{a[2]}/i ? true : false
|
270
|
+
else
|
271
|
+
v =~ /^#{a[2]}$/i ? true : false
|
272
|
+
end
|
273
|
+
end
|
202
274
|
end
|
203
275
|
|
204
276
|
return false unless keep
|
@@ -306,7 +378,32 @@ class ::Hash
|
|
306
378
|
end
|
307
379
|
end
|
308
380
|
|
381
|
+
##
|
382
|
+
## Destructive version of #stringify_keys
|
383
|
+
##
|
384
|
+
## @see #stringify_keys
|
385
|
+
##
|
309
386
|
def stringify_keys!
|
310
387
|
replace stringify_keys
|
311
388
|
end
|
389
|
+
|
390
|
+
##
|
391
|
+
## Clean up empty arrays and return an array with one or
|
392
|
+
## more elements
|
393
|
+
##
|
394
|
+
## @return [Array] output array
|
395
|
+
##
|
396
|
+
def clean_output
|
397
|
+
output = ensure_array
|
398
|
+
output.clean_output
|
399
|
+
end
|
400
|
+
|
401
|
+
##
|
402
|
+
## Ensure that an object is an array
|
403
|
+
##
|
404
|
+
## @return [Array] object as Array
|
405
|
+
##
|
406
|
+
def ensure_array
|
407
|
+
return [self]
|
408
|
+
end
|
312
409
|
end
|
data/lib/curly/string.rb
CHANGED
@@ -6,6 +6,11 @@
|
|
6
6
|
## @return [String] cleaned string
|
7
7
|
##
|
8
8
|
class ::String
|
9
|
+
## Remove extra spaces and newlines, compress space
|
10
|
+
## between tags
|
11
|
+
##
|
12
|
+
## @return [String] cleaned string
|
13
|
+
##
|
9
14
|
def clean
|
10
15
|
gsub(/[\t\n ]+/m, ' ').gsub(/> +</, '><')
|
11
16
|
end
|
@@ -40,7 +45,7 @@ class ::String
|
|
40
45
|
##
|
41
46
|
## Convert an image type string to a symbol
|
42
47
|
##
|
43
|
-
## @return Symbol :srcset, :img, :opengraph, :all
|
48
|
+
## @return [Symbol] :srcset, :img, :opengraph, :all
|
44
49
|
##
|
45
50
|
def normalize_image_type(default = :all)
|
46
51
|
case self.to_s
|
@@ -58,7 +63,7 @@ class ::String
|
|
58
63
|
##
|
59
64
|
## Convert a browser type string to a symbol
|
60
65
|
##
|
61
|
-
## @return Symbol :chrome, :firefox
|
66
|
+
## @return [Symbol] :chrome, :firefox
|
62
67
|
##
|
63
68
|
def normalize_browser_type(default = :none)
|
64
69
|
case self.to_s
|
@@ -74,7 +79,7 @@ class ::String
|
|
74
79
|
##
|
75
80
|
## Convert a screenshot type string to a symbol
|
76
81
|
##
|
77
|
-
## @return Symbol :full_page, :print_page, :visible
|
82
|
+
## @return [Symbol] :full_page, :print_page, :visible
|
78
83
|
##
|
79
84
|
def normalize_screenshot_type(default = :none)
|
80
85
|
case self.to_s
|
@@ -88,4 +93,23 @@ class ::String
|
|
88
93
|
default.is_a?(Symbol) ? default.to_sym : default.normalize_browser_type
|
89
94
|
end
|
90
95
|
end
|
96
|
+
|
97
|
+
##
|
98
|
+
## Clean up output and return a single-item array
|
99
|
+
##
|
100
|
+
## @return [Array] output array
|
101
|
+
##
|
102
|
+
def clean_output
|
103
|
+
output = ensure_array
|
104
|
+
output.clean_output
|
105
|
+
end
|
106
|
+
|
107
|
+
##
|
108
|
+
## Ensure that an object is an array
|
109
|
+
##
|
110
|
+
## @return [Array] object as Array
|
111
|
+
##
|
112
|
+
def ensure_array
|
113
|
+
return [self]
|
114
|
+
end
|
91
115
|
end
|
data/lib/curly/version.rb
CHANGED
data/lib/curly.rb
CHANGED
data/src/_README.md
CHANGED
@@ -10,9 +10,12 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
|
|
10
10
|
[donate]: https://brettterpstra.com/donate
|
11
11
|
<!--END GITHUB-->
|
12
12
|
|
13
|
-
|
13
|
+
[jq]: https://github.com/jqlang/jq "Command-line JSON processor"
|
14
|
+
[yq]: https://github.com/mikefarah/yq "yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor"
|
14
15
|
|
15
|
-
|
16
|
+
The current version of `curlyq` is <!--VER-->0.0.9<!--END VER-->.
|
17
|
+
|
18
|
+
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like [jq] to parse the output.
|
16
19
|
|
17
20
|
[github]: https://github.com/ttscoff/curlyq/
|
18
21
|
|
@@ -45,6 +48,9 @@ You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some c
|
|
45
48
|
|
46
49
|
A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
|
47
50
|
|
51
|
+
> I've tried to make the query function useful, but if you want to do any kind of advanced shaping, you're better off piping the JSON output to [jq] or [yq].
|
52
|
+
<!--JEKYLL{:.warn}-->
|
53
|
+
|
48
54
|
Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
|
49
55
|
|
50
56
|
The comparisons for the query flag are:
|
@@ -58,6 +64,16 @@ The comparisons for the query flag are:
|
|
58
64
|
- `^=` starts with text
|
59
65
|
- `$=` ends with text
|
60
66
|
|
67
|
+
Comparisons can be numeric or string comparisons. A numeric comparison like `curlyq images -q '[width>500]' URL` would return all of the images on the page with a width attribute greater than 500.
|
68
|
+
|
69
|
+
You can also use dot syntax inside of comparisons, e.g. `[links.rel*=me]` to target the links object (`html` command), and return only the links with a `rel=me` attribute. If the comparison is to an array object (like `class` or `rel`), it will match if any of the elements of the array match your comparison.
|
70
|
+
|
71
|
+
If you end the query with a specific key, only that key will be output. If there's only one match, it will be output as a raw string. If there are multiple matches, output will be an array:
|
72
|
+
|
73
|
+
curlyq tags --search '#main .post h3' -q '[attrs.id*=what].source' 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/'
|
74
|
+
|
75
|
+
<h3 id="whats-next">What’s Next</h3>
|
76
|
+
|
61
77
|
#### Commands
|
62
78
|
|
63
79
|
curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
|
@@ -314,7 +330,7 @@ Example:
|
|
314
330
|
|
315
331
|
Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
|
316
332
|
|
317
|
-
curlyq tags --search '#main .post h3' -q 'attrs
|
333
|
+
curlyq tags --search '#main .post h3' -q '[attrs.id*=what]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
318
334
|
|
319
335
|
[
|
320
336
|
{
|
@@ -17,8 +17,9 @@ class CurlyQHeadlinksTest < Test::Unit::TestCase
|
|
17
17
|
result = curlyq('headlinks', '-q', '[rel=stylesheet]', 'https://brettterpstra.com')
|
18
18
|
json = JSON.parse(result)
|
19
19
|
|
20
|
-
|
21
|
-
assert_match(/
|
20
|
+
assert_equal(Array, json.class, 'Result should be an array')
|
21
|
+
assert_match(/stylesheet/, json[0]['rel'], 'Should have retrieved a single result with rel stylesheet')
|
22
|
+
assert_match(/screen\.\d+\.css$/, json[0]['href'], 'Stylesheet should be correct primary stylesheet')
|
22
23
|
end
|
23
24
|
|
24
25
|
def test_headlinks
|
data/test/curlyq_html_test.rb
CHANGED
@@ -14,12 +14,12 @@ class CurlyQHtmlTest < Test::Unit::TestCase
|
|
14
14
|
result = curlyq('html', '-s', '#main article .aligncenter', '-q', 'images[1]', 'https://brettterpstra.com')
|
15
15
|
json = JSON.parse(result)
|
16
16
|
|
17
|
-
assert_match(/aligncenter/, json['class'], 'Should have found an image with class "aligncenter"')
|
17
|
+
assert_match(/aligncenter/, json[0]['class'], 'Should have found an image with class "aligncenter"')
|
18
18
|
end
|
19
19
|
|
20
20
|
def test_html_query
|
21
21
|
result = curlyq('html', '-q', 'meta.title', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
|
22
|
-
|
23
|
-
assert_match(/Introducing CurlyQ/,
|
22
|
+
json = JSON.parse(result)
|
23
|
+
assert_match(/Introducing CurlyQ/, json[0], 'Should have retrived the page title')
|
24
24
|
end
|
25
25
|
end
|
data/test/curlyq_scrape_test.rb
CHANGED
@@ -11,12 +11,42 @@ class CurlyQScrapeTest < Test::Unit::TestCase
|
|
11
11
|
include CurlyQHelpers
|
12
12
|
|
13
13
|
def setup
|
14
|
+
@screenshot = File.join(File.dirname(__FILE__), 'screenshot_test')
|
15
|
+
FileUtils.rm_f("#{@screenshot}.pdf") if File.exist?("#{@screenshot}.pdf")
|
16
|
+
FileUtils.rm_f('screenshot_test.png') if File.exist?("#{@screenshot}.png")
|
17
|
+
FileUtils.rm_f("#{@screenshot}_full.png") if File.exist?("#{@screenshot}_full.png")
|
14
18
|
end
|
15
19
|
|
16
|
-
def
|
20
|
+
def teardown
|
21
|
+
FileUtils.rm_f("#{@screenshot}.pdf") if File.exist?("#{@screenshot}.pdf")
|
22
|
+
FileUtils.rm_f('screenshot_test.png') if File.exist?("#{@screenshot}.png")
|
23
|
+
FileUtils.rm_f("#{@screenshot}_full.png") if File.exist?("#{@screenshot}_full.png")
|
24
|
+
end
|
25
|
+
|
26
|
+
def test_scrape_firefox
|
17
27
|
result = curlyq('scrape', '-b', 'firefox', '-q', 'links[rel=me&content*=mastodon][0]', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
|
18
28
|
json = JSON.parse(result)
|
19
29
|
|
20
|
-
|
30
|
+
assert_equal(Array, json.class, 'Result should be an Array')
|
31
|
+
assert_match(/Mastodon/, json[0]['content'], 'Should have retrieved a Mastodon link')
|
32
|
+
end
|
33
|
+
|
34
|
+
def test_scrape_chrome
|
35
|
+
result = curlyq('scrape', '-b', 'chrome', '-q', 'links[rel=me&content*=mastodon][0]', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
|
36
|
+
json = JSON.parse(result)
|
37
|
+
|
38
|
+
assert_equal(Array, json.class, 'Result should be an Array')
|
39
|
+
assert_match(/Mastodon/, json[0]['content'], 'Should have retrieved a Mastodon link')
|
40
|
+
end
|
41
|
+
|
42
|
+
def test_screenshot
|
43
|
+
curlyq('screenshot', '-b', 'firefox', '-o', @screenshot, '-t', 'print', 'https://brettterpstra.com')
|
44
|
+
assert(File.exist?("#{@screenshot}.pdf"), 'PDF Screenshot should exist')
|
45
|
+
|
46
|
+
curlyq('screenshot', '-b', 'chrome', '-o', @screenshot, '-t', 'visible', 'https://brettterpstra.com')
|
47
|
+
assert(File.exist?("#{@screenshot}.png"), 'PNG Screenshot should exist')
|
48
|
+
|
49
|
+
curlyq('screenshot', '-b', 'firefox', '-o', "#{@screenshot}_full", '-t', 'full', 'https://brettterpstra.com')
|
50
|
+
assert(File.exist?("#{@screenshot}_full.png"), 'PNG Screenshot should exist')
|
21
51
|
end
|
22
52
|
end
|
data/test/curlyq_tags_test.rb
CHANGED
@@ -14,18 +14,26 @@ class CurlyQTagsTest < Test::Unit::TestCase
|
|
14
14
|
end
|
15
15
|
|
16
16
|
def test_tags
|
17
|
-
result = curlyq('tags', '--search', '#main .post h3', '
|
17
|
+
result = curlyq('tags', '--search', '#main .post h3', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
|
18
18
|
json = JSON.parse(result)
|
19
19
|
|
20
|
-
assert_equal(json.
|
21
|
-
|
20
|
+
assert_equal(Array, json.class, 'Should be an array of matches')
|
21
|
+
assert_equal(6, json.count, 'Should be six results')
|
22
22
|
end
|
23
23
|
|
24
24
|
def test_clean
|
25
25
|
result = curlyq('tags', '--search', '#main section.related', '--clean', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
|
26
26
|
json = JSON.parse(result)
|
27
27
|
|
28
|
-
assert_equal(json.
|
28
|
+
assert_equal(Array, json.class, 'Should be a single Array')
|
29
|
+
assert_equal(1, json.count, 'Should be one element')
|
29
30
|
assert_match(%r{Last.fm</h5></a></li>}, json[0]['source'], 'Should have matched #whats-next')
|
30
31
|
end
|
32
|
+
|
33
|
+
def test_query
|
34
|
+
result = curlyq('tags', '--search', '#main .post h3', '-q', '[attrs.id*=what].source', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
|
35
|
+
json = JSON.parse(result)
|
36
|
+
assert_equal(Array, json.class, 'Should be an array')
|
37
|
+
assert_match(%r{^<h3 id="whats-next">What’s Next</h3>$}, json[0], 'Should have returned just source')
|
38
|
+
end
|
31
39
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: curlyq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.10
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Brett Terpstra
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-01-
|
11
|
+
date: 2024-01-17 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rake
|
@@ -236,6 +236,7 @@ files:
|
|
236
236
|
- lib/curly/curl/html.rb
|
237
237
|
- lib/curly/curl/json.rb
|
238
238
|
- lib/curly/hash.rb
|
239
|
+
- lib/curly/numeric.rb
|
239
240
|
- lib/curly/string.rb
|
240
241
|
- lib/curly/version.rb
|
241
242
|
- src/_README.md
|