curlyq 0.0.8 → 0.0.10

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d3e32b382d7318b067ee3fb22f2e9057cf6aa9facfac41c74a0ebb5d4fb4743d
4
- data.tar.gz: d379da3f0db621052e61230356f5c58b587eefccbb0a4c997216516a4159b44a
3
+ metadata.gz: 6109483b8869733f9e21ecab9bc8bcda0aa3b58ca1f13f9b96fe7739d019df1f
4
+ data.tar.gz: 98a8d46fe68bc88ea030dfb8e04262fbab5418005390ff79693d6f636a3bf276
5
5
  SHA512:
6
- metadata.gz: ae63654deb943771e5f6f3aa0f6a037b1015336abbd696a8ce77acc22f361a3b6a18b03f3b7d02e5c7d5dcaa8d3608248bed240679acfce22ba2e462d84b529f
7
- data.tar.gz: 481f8499e45a65cb3981fcf20ef7fc9f01f97a1b7014c6566aa2f3bf7a6611fd2d5d35f78e742e4063eea192b938c0642f0ca764e5032f330778d2815a191a41
6
+ metadata.gz: 1d75b4af2d6c1fadb83501fa707184ef41d061c08de14666b86d296048e8f21540fe2ad53a79985d5b042c93fa629cdbe8d101828edbb02832d1b55b920d5834
7
+ data.tar.gz: 238855918e3e765a2edf1864dd2663a959b099cfa5f1b89942f94eb20ba428c1700adee85590879662f0cf8de659328fbe752e8648ee210eefe0769639c57da2
data/CHANGELOG.md CHANGED
@@ -1,3 +1,23 @@
1
+ ### 0.0.10
2
+
3
+ 2024-01-17 13:50
4
+
5
+ #### IMPROVED
6
+
7
+ - Update YARD documentation
8
+ - Breaking change, ensure all return types are Arrays, even with single objects, to aid in scriptability
9
+ - Screenshot test suite
10
+
11
+ ### 0.0.9
12
+
13
+ 2024-01-16 12:38
14
+
15
+ #### IMPROVED
16
+
17
+ - You can now use dot syntax inside of a square bracket comparison in --query (`[attrs.id*=what]`)
18
+ - *=, ^=, $=, and == work with array values
19
+ - [] comparisons with no comparison, e.g. [attrs.id], will return every match that has that element populated
20
+
1
21
  ### 0.0.8
2
22
 
3
23
  2024-01-15 16:45
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- curlyq (0.0.8)
4
+ curlyq (0.0.10)
5
5
  gli (~> 2.21.0)
6
6
  nokogiri (~> 1.16.0)
7
7
  selenium-webdriver (~> 4.16.0)
data/README.md CHANGED
@@ -10,10 +10,13 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
10
10
  [donate]: https://brettterpstra.com/donate
11
11
 
12
12
 
13
- The current version of `curlyq` is 0.0.8
13
+ [jq]: https://github.com/jqlang/jq "Command-line JSON processor"
14
+ [yq]: https://github.com/mikefarah/yq "yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor"
15
+
16
+ The current version of `curlyq` is 0.0.10
14
17
  .
15
18
 
16
- CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
19
+ CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like [jq] to parse the output.
17
20
 
18
21
  [github]: https://github.com/ttscoff/curlyq/
19
22
 
@@ -44,7 +47,7 @@ SYNOPSIS
44
47
  curlyq [global options] command [command options] [arguments...]
45
48
 
46
49
  VERSION
47
- 0.0.8
50
+ 0.0.10
48
51
 
49
52
  GLOBAL OPTIONS
50
53
  --help - Show this message
@@ -71,6 +74,9 @@ You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some c
71
74
 
72
75
  A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
73
76
 
77
+ > I've tried to make the query function useful, but if you want to do any kind of advanced shaping, you're better off piping the JSON output to [jq] or [yq].
78
+
79
+
74
80
  Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
75
81
 
76
82
  The comparisons for the query flag are:
@@ -84,6 +90,16 @@ The comparisons for the query flag are:
84
90
  - `^=` starts with text
85
91
  - `$=` ends with text
86
92
 
93
+ Comparisons can be numeric or string comparisons. A numeric comparison like `curlyq images -q '[width>500]' URL` would return all of the images on the page with a width attribute greater than 500.
94
+
95
+ You can also use dot syntax inside of comparisons, e.g. `[links.rel*=me]` to target the links object (`html` command), and return only the links with a `rel=me` attribute. If the comparison is to an array object (like `class` or `rel`), it will match if any of the elements of the array match your comparison.
96
+
97
+ If you end the query with a specific key, only that key will be output. If there's only one match, it will be output as a raw string. If there are multiple matches, output will be an array:
98
+
99
+ curlyq tags --search '#main .post h3' -q '[attrs.id*=what].source' 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/'
100
+
101
+ <h3 id="whats-next">What???s Next</h3>
102
+
87
103
  #### Commands
88
104
 
89
105
  curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
@@ -440,7 +456,7 @@ COMMAND OPTIONS
440
456
 
441
457
  Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
442
458
 
443
- curlyq tags --search '#main .post h3' -q 'attrs[id*=what]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
459
+ curlyq tags --search '#main .post h3' -q '[attrs.id*=what]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
444
460
 
445
461
  [
446
462
  {
data/Rakefile CHANGED
@@ -56,6 +56,23 @@ task :test, :pattern, :threads, :max_tests do |_, args|
56
56
  ThreadedTests.new.run(pattern: pattern, max_threads: args[:threads].to_i, max_tests: args[:max_tests])
57
57
  end
58
58
 
59
+ desc 'Install current gem in all versions of asdf-controlled ruby'
60
+ task :install do
61
+ Rake::Task['clobber'].invoke
62
+ Rake::Task['package'].invoke
63
+ Dir.chdir 'pkg'
64
+ file = Dir.glob('*.gem').last
65
+
66
+ current_ruby = `asdf current ruby`.match(/(\d.\d+.\d+)/)[1]
67
+
68
+ `asdf list ruby`.split.map { |ruby| ruby.strip.sub(/^*/, '') }.each do |ruby|
69
+ `asdf shell ruby #{ruby}`
70
+ puts `gem install #{file}`
71
+ end
72
+
73
+ `asdf shell ruby #{current_ruby}`
74
+ end
75
+
59
76
  desc 'Development version check'
60
77
  task :ver do
61
78
  gver = `git ver`
data/bin/curlyq CHANGED
@@ -49,7 +49,7 @@ end
49
49
  def self.print_out(output, yaml, raw: false, pretty: true)
50
50
  output = output.to_data if output.respond_to?(:to_data)
51
51
  # Was intended to flatten single responses, but not getting an array back is unpredictable
52
- # output = output[0] if output&.is_a?(Array) && output.count == 1
52
+ output = output.clean_output
53
53
  if output.is_a?(String)
54
54
  print output
55
55
  elsif raw
@@ -130,13 +130,13 @@ command %i[html curl] do |c|
130
130
  out = res.parse(source)
131
131
 
132
132
  if options[:query]
133
- out = out.to_data(url: url, clean: options[:clean]).dot_query(options[:query])
133
+ out = out.to_data(url: url, clean: options[:clean]).dot_query(options[:query], full_tag: false)
134
134
  else
135
135
  out = out.to_data
136
136
  end
137
137
  output.push([out])
138
138
  elsif options[:query]
139
- queried = res.to_data.dot_query(options[:query])
139
+ queried = res.to_data.dot_query(options[:query], full_tag: false)
140
140
  output.push(queried) if queried
141
141
  else
142
142
  output.push(res.to_data(url: url))
@@ -144,14 +144,9 @@ command %i[html curl] do |c|
144
144
  end
145
145
  output.delete_if(&:nil?)
146
146
  output.delete_if(&:empty?)
147
- # output = output[0] if output.count == 1
148
147
  output.map! { |o| o[options[:raw].to_sym] } if options[:raw]
149
148
 
150
- if output.is_a?(Array)
151
- while output.length == 1
152
- output = output[0]
153
- end
154
- end
149
+ output = output.clean_output
155
150
 
156
151
  print_out(output, global_options[:yaml], raw: options[:raw], pretty: global_options[:pretty])
157
152
  end
@@ -246,7 +241,7 @@ command :json do |c|
246
241
  end
247
242
  end
248
243
 
249
- # output = output[0] if output.count == 1
244
+ output = output.clean_output
250
245
 
251
246
  print_out(output, global_options[:yaml], pretty: global_options[:pretty])
252
247
  end
@@ -356,7 +351,7 @@ command :tags do |c|
356
351
  end
357
352
  end
358
353
 
359
- output = output[0] if output.count == 1
354
+ output = output.clean_output
360
355
 
361
356
  if options[:source]
362
357
  puts output.to_html
@@ -480,7 +475,7 @@ command :headlinks do |c|
480
475
  end
481
476
  end
482
477
 
483
- output = output[0] if output.count == 1
478
+ output = output.clean_output
484
479
 
485
480
  print_out(output, global_options[:yaml], pretty: global_options[:pretty])
486
481
  end
@@ -531,7 +526,7 @@ command :scrape do |c|
531
526
 
532
527
  output.delete_if(&:empty?)
533
528
 
534
- output = output[0] if output.count == 1
529
+ output = output.clean_output
535
530
 
536
531
  if options[:raw]
537
532
  output.map! { |o| o[options[:raw].to_sym] }
data/lib/curly/array.rb CHANGED
@@ -66,7 +66,7 @@ class ::Array
66
66
  replace dedup_links
67
67
  end
68
68
 
69
- #---------------------------------------------------------
69
+ ##
70
70
  ## Run a query on array elements
71
71
  ##
72
72
  ## @param path [String] dot.syntax path to compare
@@ -80,17 +80,29 @@ class ::Array
80
80
  res
81
81
  end
82
82
 
83
+ ##
84
+ ## Gets the value of every item in the array
85
+ ##
86
+ ## @param path The query path (dot syntax)
87
+ ##
88
+ ## @return [Array] array of values
89
+ ##
83
90
  def get_value(path)
84
- res = map { |el| el.get_value(path) }
85
- res.is_a?(Array) && res.count == 1 ? res[0] : res
91
+ map { |el| el.get_value(path) }
86
92
  end
87
93
 
94
+ ##
95
+ ## Convert every item in the array to HTML
96
+ ##
97
+ ## @return [String] Html representation of the object.
98
+ ##
88
99
  def to_html
89
100
  map(&:to_html)
90
101
  end
91
102
 
92
103
  ##
93
- ## Test if a tag contains an attribute matching filter queries
104
+ ## Test if a tag contains an attribute matching filter
105
+ ## queries
94
106
  ##
95
107
  ## @param tag_name [String] The tag name
96
108
  ## @param classes [String] The classes to match
@@ -102,6 +114,8 @@ class ::Array
102
114
  ## @param value [String] The value to match
103
115
  ## @param descendant [Boolean] Check descendant tags
104
116
  ##
117
+ ## @return [Boolean] tag matches
118
+ ##
105
119
  def tag_match(tag_name, classes, id, attribute, operator, value, descendant: false)
106
120
  tag = self
107
121
  keep = true
@@ -155,4 +169,26 @@ class ::Array
155
169
  keep
156
170
  end
157
171
  end
172
+
173
+ ##
174
+ ## Clean up output, shrink single-item arrays, ensure array output
175
+ ##
176
+ ## @return [Array] cleaned up array
177
+ ##
178
+ def clean_output
179
+ output = dup
180
+ while output.is_a?(Array) && output.count == 1
181
+ output = output[0]
182
+ end
183
+ output.ensure_array
184
+ end
185
+
186
+ ##
187
+ ## Ensure that an object is an array
188
+ ##
189
+ ## @return [Array] object as Array
190
+ ##
191
+ def ensure_array
192
+ return self
193
+ end
158
194
  end
@@ -16,6 +16,12 @@ module Curl
16
16
  attr_reader :url, :code, :meta, :links, :head, :body,
17
17
  :title, :description, :body_links, :body_images
18
18
 
19
+ # Convert self to a hash of data
20
+ #
21
+ # @param url [String] A base url to fall back to
22
+ #
23
+ # @return [Hash] a hash of data
24
+ #
19
25
  def to_data(url: nil)
20
26
  {
21
27
  url: @url || url,
@@ -68,12 +74,23 @@ module Curl
68
74
  @url = url.nil? ? options[:url] : url
69
75
  end
70
76
 
77
+ ##
78
+ # Parse raw HTML source instead of curling
79
+ #
80
+ # @param source [String] The source
81
+ #
82
+ #
83
+ # @return [Hash] Hash of data after processing #
84
+ #
71
85
  def parse(source)
72
86
  @body = source
73
87
  { url: @url, code: @code, headers: @headers, meta: @meta, links: @links, head: @head, body: source,
74
88
  source: source.strip, body_links: content_links, body_images: content_images }
75
89
  end
76
90
 
91
+ ##
92
+ ## Curl a url, either with curl or Selenium based on browser settings
93
+ ##
77
94
  def curl
78
95
  res = if @url && @browser && @browser != :none
79
96
  source = curl_dynamic_html
@@ -283,6 +300,11 @@ module Curl
283
300
  output
284
301
  end
285
302
 
303
+ ##
304
+ ## String representation
305
+ ##
306
+ ## @return String representation of the object.
307
+ ##
286
308
  def to_s
287
309
  headers = @headers.nil? ? 0 : @headers.count
288
310
  meta = @meta.nil? ? 0 : @meta.count
data/lib/curly/hash.rb CHANGED
@@ -2,6 +2,14 @@
2
2
 
3
3
  # Hash helpers
4
4
  class ::Hash
5
+ ## Convert a Curly object to data hash
6
+ ##
7
+ ## @return [Hash] return a hash with keys renamed and
8
+ ## cleaned up
9
+ ##
10
+ ## @param url [String] A url to fall back to
11
+ ## @param clean [Boolean] Clean extra spaces and newlines in sources
12
+ ##
5
13
  def to_data(url: nil, clean: false)
6
14
  if key?(:body_links)
7
15
  {
@@ -23,17 +31,33 @@ class ::Hash
23
31
  end
24
32
  end
25
33
 
34
+ ##
35
+ ## Return the raw HTML of the object
36
+ ##
37
+ ## @return [String] Html representation of the object.
38
+ ##
26
39
  def to_html
27
40
  if key?(:source)
28
41
  self[:source]
29
42
  end
30
43
  end
31
44
 
45
+ ##
46
+ ## Get a value from the hash using a dot-syntax query
47
+ ##
48
+ ## @param query [String] The query (dot notation)
49
+ ##
50
+ ## @return [Object] result of querying the hash
51
+ ##
32
52
  def get_value(query)
33
53
  return nil if self.empty?
54
+ stringify_keys!
55
+
34
56
  query.split('.').inject(self) do |v, k|
35
- k = k.to_i if v.is_a? Array
36
- next unless v.key?(k)
57
+ return v.map { |el| el.get_value(k) } if v.is_a? Array
58
+ # k = k.to_i if v.is_a? Array
59
+ next v unless v.key?(k)
60
+
37
61
  v.fetch(k)
38
62
  end
39
63
  end
@@ -42,7 +66,7 @@ class ::Hash
42
66
  #
43
67
  # @param path [String] The path
44
68
  #
45
- # @return Result of path query
69
+ # @return [Object] Result of path query
46
70
  #
47
71
  def dot_query(path, root = nil, full_tag: true)
48
72
  res = stringify_keys
@@ -52,12 +76,17 @@ class ::Hash
52
76
  return res.get_value(path)
53
77
  end
54
78
 
55
- enumerate = false
79
+ path.gsub!(/\[(.*?)\]/) do
80
+ inter = Regexp.last_match(1).gsub(/\./, '%')
81
+ "[#{inter}]"
82
+ end
83
+
56
84
  out = []
57
85
  q = path.split(/(?<![\d.])\./)
58
86
 
59
87
  while q.count.positive?
60
88
  pth = q.shift
89
+ pth.gsub!(/%/, '.')
61
90
 
62
91
  return nil if res.nil?
63
92
 
@@ -70,8 +99,8 @@ class ::Hash
70
99
 
71
100
  ats = []
72
101
  at = []
73
- while pth =~ /\[[+&,]?\w+( *[\^*$=<>]=? *\w+)?/
74
- m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+)( *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+))? */)
102
+ while pth =~ /\[[+&,]?[\w.]+( *[\^*$=<>]=? *\w+)?/
103
+ m = pth.match(/\[(?<com>[,+&])? *(?<key>[\w.]+)( *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+))? */)
75
104
 
76
105
  comp = [m['key'], m['op'], m['val']]
77
106
  case m['com']
@@ -82,7 +111,7 @@ class ::Hash
82
111
  at.push(comp)
83
112
  end
84
113
 
85
- pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+)( *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+))?/, '[')
114
+ pth.sub!(/\[(?<com>[,&+])? *(?<key>[\w.]+)( *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+))?/, '[')
86
115
  end
87
116
  ats.push(at) unless at.empty?
88
117
  pth.sub!(/\[\]/, '')
@@ -110,11 +139,11 @@ class ::Hash
110
139
  pth = ''
111
140
 
112
141
  return false if res.nil?
142
+
113
143
  if ats.count.positive?
114
144
  while ats.count.positive?
115
145
  atr = ats.shift
116
146
  res = [res] if res.is_a?(Hash)
117
-
118
147
  res.each do |r|
119
148
  out.push(full_tag ? tag : r) if evaluate_comp(r, atr)
120
149
  end
@@ -140,6 +169,32 @@ class ::Hash
140
169
  out
141
170
  end
142
171
 
172
+ ##
173
+ ## Test if values in an array match an operator
174
+ ##
175
+ ## @param array [Array] The array
176
+ ## @param key [String] The key
177
+ ## @param comp [String] The comparison, e.g. *= or $=
178
+ ##
179
+ ## @return [Boolean] true if array contains match
180
+ def array_match(array, key, comp)
181
+ keep = false
182
+ array.each do |el|
183
+ keep = case comp
184
+ when /^\^/
185
+ key =~ /^#{el}/i ? true : false
186
+ when /^\$/
187
+ key =~ /#{el}$/i ? true : false
188
+ when /^\*/
189
+ key =~ /#{el}/i ? true : false
190
+ else
191
+ key =~ /^#{el}$/i ? true : false
192
+ end
193
+ break if keep
194
+ end
195
+ keep
196
+ end
197
+
143
198
  ##
144
199
  ## Evaluate a comparison
145
200
  ##
@@ -165,40 +220,57 @@ class ::Hash
165
220
  end
166
221
  r = r.get_value(key.to_s) if key.to_s =~ /\./
167
222
 
168
- return r.key?(key) && !r[key].nil? && !r[key].empty? if val.nil?
223
+ if val.nil?
224
+ if r.is_a?(Hash)
225
+ return r.key?(key) && !r[key].nil? && !r[key].empty?
226
+ elsif r.is_a?(String)
227
+ return r.nil? ? false : true
228
+ elsif r.is_a?(Array)
229
+ return r.empty? ? false : true
230
+ end
231
+ end
169
232
 
170
- if !r.key?(key)
233
+ if r.nil?
171
234
  keep = false
172
- elsif r[key].is_a?(Array)
173
- valid = r[key].filter do |k|
174
- case a[1]
175
- when /^\^/
176
- k =~ /^#{a[2]}/i ? true : false
177
- when /^\$/
178
- k =~ /#{a[2]}$/i ? true : false
179
- when /^\*/
180
- k =~ /#{a[2]}/i ? true : false
235
+ elsif r.is_a?(Array)
236
+ valid = r.filter do |k|
237
+ if k.is_a? Array
238
+ array_match(k, a[2], a[1])
181
239
  else
182
- k =~ /^#{a[2]}$/i ? true : false
240
+ case a[1]
241
+ when /^\^/
242
+ k =~ /^#{a[2]}/i ? true : false
243
+ when /^\$/
244
+ k =~ /#{a[2]}$/i ? true : false
245
+ when /^\*/
246
+ k =~ /#{a[2]}/i ? true : false
247
+ else
248
+ k =~ /^#{a[2]}$/i ? true : false
249
+ end
183
250
  end
184
251
  end
185
252
 
186
253
  keep = valid.count.positive?
187
254
  elsif val.is_a?(Numeric) && a[1] =~ /^[<>=]{1,2}$/
188
- k = r[key].to_i
255
+ k = r.to_i
189
256
  comp = a[1] =~ /^=$/ ? '==' : a[1]
190
257
  keep = eval("#{k}#{comp}#{val}")
191
258
  else
192
- keep = case a[1]
193
- when /^\^/
194
- r[key] =~ /^#{a[2]}/i ? true : false
195
- when /^\$/
196
- r[key] =~ /#{a[2]}$/i ? true : false
197
- when /^\*/
198
- r[key] =~ /#{a[2]}/i ? true : false
199
- else
200
- r[key] =~ /^#{a[2]}$/i ? true : false
201
- end
259
+ v = r.is_a?(Hash) ? r[key] : r
260
+ if v.is_a? Array
261
+ keep = array_match(v, a[2], a[1])
262
+ else
263
+ keep = case a[1]
264
+ when /^\^/
265
+ v =~ /^#{a[2]}/i ? true : false
266
+ when /^\$/
267
+ v =~ /#{a[2]}$/i ? true : false
268
+ when /^\*/
269
+ v =~ /#{a[2]}/i ? true : false
270
+ else
271
+ v =~ /^#{a[2]}$/i ? true : false
272
+ end
273
+ end
202
274
  end
203
275
 
204
276
  return false unless keep
@@ -306,7 +378,32 @@ class ::Hash
306
378
  end
307
379
  end
308
380
 
381
+ ##
382
+ ## Destructive version of #stringify_keys
383
+ ##
384
+ ## @see #stringify_keys
385
+ ##
309
386
  def stringify_keys!
310
387
  replace stringify_keys
311
388
  end
389
+
390
+ ##
391
+ ## Clean up empty arrays and return an array with one or
392
+ ## more elements
393
+ ##
394
+ ## @return [Array] output array
395
+ ##
396
+ def clean_output
397
+ output = ensure_array
398
+ output.clean_output
399
+ end
400
+
401
+ ##
402
+ ## Ensure that an object is an array
403
+ ##
404
+ ## @return [Array] object as Array
405
+ ##
406
+ def ensure_array
407
+ return [self]
408
+ end
312
409
  end
@@ -0,0 +1,11 @@
1
+ # Numeric helpers
2
+ class ::Numeric
3
+ ##
4
+ ## Return an array version of self
5
+ ##
6
+ ## @return [Array] self enclosed in an array
7
+ ##
8
+ def ensure_array
9
+ [self]
10
+ end
11
+ end
data/lib/curly/string.rb CHANGED
@@ -6,6 +6,11 @@
6
6
  ## @return [String] cleaned string
7
7
  ##
8
8
  class ::String
9
+ ## Remove extra spaces and newlines, compress space
10
+ ## between tags
11
+ ##
12
+ ## @return [String] cleaned string
13
+ ##
9
14
  def clean
10
15
  gsub(/[\t\n ]+/m, ' ').gsub(/> +</, '><')
11
16
  end
@@ -40,7 +45,7 @@ class ::String
40
45
  ##
41
46
  ## Convert an image type string to a symbol
42
47
  ##
43
- ## @return Symbol :srcset, :img, :opengraph, :all
48
+ ## @return [Symbol] :srcset, :img, :opengraph, :all
44
49
  ##
45
50
  def normalize_image_type(default = :all)
46
51
  case self.to_s
@@ -58,7 +63,7 @@ class ::String
58
63
  ##
59
64
  ## Convert a browser type string to a symbol
60
65
  ##
61
- ## @return Symbol :chrome, :firefox
66
+ ## @return [Symbol] :chrome, :firefox
62
67
  ##
63
68
  def normalize_browser_type(default = :none)
64
69
  case self.to_s
@@ -74,7 +79,7 @@ class ::String
74
79
  ##
75
80
  ## Convert a screenshot type string to a symbol
76
81
  ##
77
- ## @return Symbol :full_page, :print_page, :visible
82
+ ## @return [Symbol] :full_page, :print_page, :visible
78
83
  ##
79
84
  def normalize_screenshot_type(default = :none)
80
85
  case self.to_s
@@ -88,4 +93,23 @@ class ::String
88
93
  default.is_a?(Symbol) ? default.to_sym : default.normalize_browser_type
89
94
  end
90
95
  end
96
+
97
+ ##
98
+ ## Clean up output and return a single-item array
99
+ ##
100
+ ## @return [Array] output array
101
+ ##
102
+ def clean_output
103
+ output = ensure_array
104
+ output.clean_output
105
+ end
106
+
107
+ ##
108
+ ## Ensure that an object is an array
109
+ ##
110
+ ## @return [Array] object as Array
111
+ ##
112
+ def ensure_array
113
+ return [self]
114
+ end
91
115
  end
data/lib/curly/version.rb CHANGED
@@ -1,3 +1,5 @@
1
+ # Top level module for CurlyQ
1
2
  module Curly
2
- VERSION = '0.0.8'
3
+ # Current version number
4
+ VERSION = '0.0.10'
3
5
  end
data/lib/curly.rb CHANGED
@@ -4,6 +4,7 @@ require 'curly/version'
4
4
  require 'curly/hash'
5
5
  require 'curly/string'
6
6
  require 'curly/array'
7
+ require 'curly/numeric'
7
8
  require 'json'
8
9
  require 'yaml'
9
10
  require 'uri'
data/src/_README.md CHANGED
@@ -10,9 +10,12 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
10
10
  [donate]: https://brettterpstra.com/donate
11
11
  <!--END GITHUB-->
12
12
 
13
- The current version of `curlyq` is <!--VER-->0.0.4<!--END VER-->.
13
+ [jq]: https://github.com/jqlang/jq "Command-line JSON processor"
14
+ [yq]: https://github.com/mikefarah/yq "yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor"
14
15
 
15
- CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
16
+ The current version of `curlyq` is <!--VER-->0.0.9<!--END VER-->.
17
+
18
+ CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like [jq] to parse the output.
16
19
 
17
20
  [github]: https://github.com/ttscoff/curlyq/
18
21
 
@@ -45,6 +48,9 @@ You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some c
45
48
 
46
49
  A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
47
50
 
51
+ > I've tried to make the query function useful, but if you want to do any kind of advanced shaping, you're better off piping the JSON output to [jq] or [yq].
52
+ <!--JEKYLL{:.warn}-->
53
+
48
54
  Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
49
55
 
50
56
  The comparisons for the query flag are:
@@ -58,6 +64,16 @@ The comparisons for the query flag are:
58
64
  - `^=` starts with text
59
65
  - `$=` ends with text
60
66
 
67
+ Comparisons can be numeric or string comparisons. A numeric comparison like `curlyq images -q '[width>500]' URL` would return all of the images on the page with a width attribute greater than 500.
68
+
69
+ You can also use dot syntax inside of comparisons, e.g. `[links.rel*=me]` to target the links object (`html` command), and return only the links with a `rel=me` attribute. If the comparison is to an array object (like `class` or `rel`), it will match if any of the elements of the array match your comparison.
70
+
71
+ If you end the query with a specific key, only that key will be output. If there's only one match, it will be output as a raw string. If there are multiple matches, output will be an array:
72
+
73
+ curlyq tags --search '#main .post h3' -q '[attrs.id*=what].source' 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/'
74
+
75
+ <h3 id="whats-next">What’s Next</h3>
76
+
61
77
  #### Commands
62
78
 
63
79
  curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
@@ -314,7 +330,7 @@ Example:
314
330
 
315
331
  Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
316
332
 
317
- curlyq tags --search '#main .post h3' -q 'attrs[id*=what]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
333
+ curlyq tags --search '#main .post h3' -q '[attrs.id*=what]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
318
334
 
319
335
  [
320
336
  {
@@ -17,8 +17,9 @@ class CurlyQHeadlinksTest < Test::Unit::TestCase
17
17
  result = curlyq('headlinks', '-q', '[rel=stylesheet]', 'https://brettterpstra.com')
18
18
  json = JSON.parse(result)
19
19
 
20
- assert_match(/stylesheet/, json['rel'], 'Should have retrieved a single result with rel stylesheet')
21
- assert_match(/screen\.\d+\.css$/, json['href'], 'Stylesheet should be correct primary stylesheet')
20
+ assert_equal(Array, json.class, 'Result should be an array')
21
+ assert_match(/stylesheet/, json[0]['rel'], 'Should have retrieved a single result with rel stylesheet')
22
+ assert_match(/screen\.\d+\.css$/, json[0]['href'], 'Stylesheet should be correct primary stylesheet')
22
23
  end
23
24
 
24
25
  def test_headlinks
@@ -14,12 +14,12 @@ class CurlyQHtmlTest < Test::Unit::TestCase
14
14
  result = curlyq('html', '-s', '#main article .aligncenter', '-q', 'images[1]', 'https://brettterpstra.com')
15
15
  json = JSON.parse(result)
16
16
 
17
- assert_match(/aligncenter/, json['class'], 'Should have found an image with class "aligncenter"')
17
+ assert_match(/aligncenter/, json[0]['class'], 'Should have found an image with class "aligncenter"')
18
18
  end
19
19
 
20
20
  def test_html_query
21
21
  result = curlyq('html', '-q', 'meta.title', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
22
-
23
- assert_match(/Introducing CurlyQ/, result, 'Should have retrived the page title')
22
+ json = JSON.parse(result)
23
+ assert_match(/Introducing CurlyQ/, json[0], 'Should have retrived the page title')
24
24
  end
25
25
  end
@@ -11,12 +11,42 @@ class CurlyQScrapeTest < Test::Unit::TestCase
11
11
  include CurlyQHelpers
12
12
 
13
13
  def setup
14
+ @screenshot = File.join(File.dirname(__FILE__), 'screenshot_test')
15
+ FileUtils.rm_f("#{@screenshot}.pdf") if File.exist?("#{@screenshot}.pdf")
16
+ FileUtils.rm_f('screenshot_test.png') if File.exist?("#{@screenshot}.png")
17
+ FileUtils.rm_f("#{@screenshot}_full.png") if File.exist?("#{@screenshot}_full.png")
14
18
  end
15
19
 
16
- def test_scrape
20
+ def teardown
21
+ FileUtils.rm_f("#{@screenshot}.pdf") if File.exist?("#{@screenshot}.pdf")
22
+ FileUtils.rm_f('screenshot_test.png') if File.exist?("#{@screenshot}.png")
23
+ FileUtils.rm_f("#{@screenshot}_full.png") if File.exist?("#{@screenshot}_full.png")
24
+ end
25
+
26
+ def test_scrape_firefox
17
27
  result = curlyq('scrape', '-b', 'firefox', '-q', 'links[rel=me&content*=mastodon][0]', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
18
28
  json = JSON.parse(result)
19
29
 
20
- assert_match(/Mastodon/, json['content'], 'Should have retrieved a Mastodon link')
30
+ assert_equal(Array, json.class, 'Result should be an Array')
31
+ assert_match(/Mastodon/, json[0]['content'], 'Should have retrieved a Mastodon link')
32
+ end
33
+
34
+ def test_scrape_chrome
35
+ result = curlyq('scrape', '-b', 'chrome', '-q', 'links[rel=me&content*=mastodon][0]', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
36
+ json = JSON.parse(result)
37
+
38
+ assert_equal(Array, json.class, 'Result should be an Array')
39
+ assert_match(/Mastodon/, json[0]['content'], 'Should have retrieved a Mastodon link')
40
+ end
41
+
42
+ def test_screenshot
43
+ curlyq('screenshot', '-b', 'firefox', '-o', @screenshot, '-t', 'print', 'https://brettterpstra.com')
44
+ assert(File.exist?("#{@screenshot}.pdf"), 'PDF Screenshot should exist')
45
+
46
+ curlyq('screenshot', '-b', 'chrome', '-o', @screenshot, '-t', 'visible', 'https://brettterpstra.com')
47
+ assert(File.exist?("#{@screenshot}.png"), 'PNG Screenshot should exist')
48
+
49
+ curlyq('screenshot', '-b', 'firefox', '-o', "#{@screenshot}_full", '-t', 'full', 'https://brettterpstra.com')
50
+ assert(File.exist?("#{@screenshot}_full.png"), 'PNG Screenshot should exist')
21
51
  end
22
52
  end
@@ -14,18 +14,26 @@ class CurlyQTagsTest < Test::Unit::TestCase
14
14
  end
15
15
 
16
16
  def test_tags
17
- result = curlyq('tags', '--search', '#main .post h3', '-q', 'attrs[id*=what]', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
17
+ result = curlyq('tags', '--search', '#main .post h3', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
18
18
  json = JSON.parse(result)
19
19
 
20
- assert_equal(json.count, 1, 'Should have 1 result')
21
- assert_match(/whats-next/, json[0]['attrs']['id'], 'Should have matched #whats-next')
20
+ assert_equal(Array, json.class, 'Should be an array of matches')
21
+ assert_equal(6, json.count, 'Should be six results')
22
22
  end
23
23
 
24
24
  def test_clean
25
25
  result = curlyq('tags', '--search', '#main section.related', '--clean', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
26
26
  json = JSON.parse(result)
27
27
 
28
- assert_equal(json.count, 1, 'Should have 1 result')
28
+ assert_equal(Array, json.class, 'Should be a single Array')
29
+ assert_equal(1, json.count, 'Should be one element')
29
30
  assert_match(%r{Last.fm</h5></a></li>}, json[0]['source'], 'Should have matched #whats-next')
30
31
  end
32
+
33
+ def test_query
34
+ result = curlyq('tags', '--search', '#main .post h3', '-q', '[attrs.id*=what].source', 'https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/')
35
+ json = JSON.parse(result)
36
+ assert_equal(Array, json.class, 'Should be an array')
37
+ assert_match(%r{^<h3 id="whats-next">What’s Next</h3>$}, json[0], 'Should have returned just source')
38
+ end
31
39
  end
@@ -1,5 +1,6 @@
1
1
  require 'open3'
2
2
  require 'time'
3
+ require 'fileutils'
3
4
  $LOAD_PATH.unshift File.join(__dir__, '..', '..', 'lib')
4
5
  require 'curly'
5
6
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: curlyq
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.8
4
+ version: 0.0.10
5
5
  platform: ruby
6
6
  authors:
7
7
  - Brett Terpstra
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-01-15 00:00:00.000000000 Z
11
+ date: 2024-01-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake
@@ -236,6 +236,7 @@ files:
236
236
  - lib/curly/curl/html.rb
237
237
  - lib/curly/curl/json.rb
238
238
  - lib/curly/hash.rb
239
+ - lib/curly/numeric.rb
239
240
  - lib/curly/string.rb
240
241
  - lib/curly/version.rb
241
242
  - src/_README.md