curlyq 0.0.4 → 0.0.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '091e39001a4456eef85fa25e97281c6218e3383619d37a14a66ca9bd41fee9ab'
4
- data.tar.gz: c287e095d4f9525e924d08cd580c37ab19448ae93bb384d419526b60cf895493
3
+ metadata.gz: 2c5eb3f9a5444f19c44362545b302e3889c4e25dc34d9180452a736b1b80bc34
4
+ data.tar.gz: 3bf8d1009f493b60c31efb3636c64aa8871656dbcd9cebbeb01800d30fd0761c
5
5
  SHA512:
6
- metadata.gz: e51492325696e09319666ee29b753472f555129dc7664f34a08b1ae30dd319cdb5518727030d2d7bbaa1dfaefc6dd14ce4ccf2756243238dd885d37e2dbffbb4
7
- data.tar.gz: 27c006a7433cd9bd9208cc0aa69c0029974ca98524ecf3ee323bad00cc09490acfbfd5b4e61a65e22d5f750ebe05fa03d5d1bae1d1b977bf1028ed2aa21e2ee7
6
+ metadata.gz: 808d8122080450acee5e98e0a6338e887ba5b6e3306764dab79c713052c6e5f6749d8b4ef90f43fcdc2cc7da41766f40e6684e0e40d2de98055e2d71986ac0e8
7
+ data.tar.gz: d4e17b0cc425cbf7a704cdd188e36f734707cd885a097c6e99cb0f8bc0089e46ffdd99d1e15844a981bbfd9a205778178e45dcaa637cd8a7e761432f2610991e
data/.gitignore CHANGED
@@ -1 +1,2 @@
1
1
  html
2
+ *.bak
data/.irbrc ADDED
@@ -0,0 +1,4 @@
1
+ $LOAD_PATH.unshift File.join(__dir__, 'lib')
2
+ require_relative 'lib/curly'
3
+ include Curly
4
+
data/CHANGELOG.md CHANGED
@@ -1,3 +1,22 @@
1
+ ### 0.0.5
2
+
3
+ 2024-01-11 18:06
4
+
5
+ #### IMPROVED
6
+
7
+ - Add --query capabilities to images command
8
+ - Add --query to links command
9
+ - Allow hyphens in query syntax
10
+ - Allow any character other than comma, ampersand, or right square bracket in query value
11
+
12
+ #### FIXED
13
+
14
+ - Html --search returns a full Curl::Html object
15
+ - --query works better with --search and is consistent with other query functions
16
+ - Scrape command outputting malformed data
17
+ - Hash output when --query is used with scrape
18
+ - Nil match on tags command
19
+
1
20
  ### 0.0.4
2
21
 
3
22
  2024-01-10 13:54
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- curlyq (0.0.4)
4
+ curlyq (0.0.5)
5
5
  gli (~> 2.21.0)
6
6
  nokogiri (~> 1.16.0)
7
7
  selenium-webdriver (~> 4.16.0)
data/README.md CHANGED
@@ -10,7 +10,7 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
10
10
  [donate]: https://brettterpstra.com/donate
11
11
 
12
12
 
13
- The current version of `curlyq` is 0.0.4
13
+ The current version of `curlyq` is 0.0.5
14
14
  .
15
15
 
16
16
  CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
@@ -44,7 +44,7 @@ SYNOPSIS
44
44
  curlyq [global options] command [command options] [arguments...]
45
45
 
46
46
  VERSION
47
- 0.0.4
47
+ 0.0.5
48
48
 
49
49
  GLOBAL OPTIONS
50
50
  --help - Show this message
@@ -65,12 +65,41 @@ COMMANDS
65
65
  tags - Extract all instances of a tag
66
66
  ```
67
67
 
68
+ ### Query and Search syntax
69
+
70
+ You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some commands.
71
+
72
+ A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
73
+
74
+ Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
75
+
76
+ The comparisons for the query flag are:
77
+
78
+ - `<` less than
79
+ - `>` greater than
80
+ - `<=` less than or equal to
81
+ - `>=` greater than or equal to
82
+ - `=` or `==` is equal to
83
+ - `*=` contains text
84
+ - `^=` starts with text
85
+ - `$=` ends with text
86
+
68
87
  #### Commands
69
88
 
70
89
  curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
71
90
 
72
91
  ##### extract
73
92
 
93
+ Example:
94
+
95
+ curlyq extract -i -b 'Adding' -a 'accessing the source.' 'https://stackoverflow.com/questions/52428409/get-fully-rendered-html-using-selenium-webdriver-and-python'
96
+
97
+ [
98
+ "Adding <code>time.sleep(10)</code> in various places in case the page had not fully loaded when I was accessing the source."
99
+ ]
100
+
101
+ This specifies a before and after string and includes them (`-i`) in the result.
102
+
74
103
  ```
75
104
  NAME
76
105
  extract - Extract contents between two regular expressions
@@ -80,17 +109,32 @@ SYNOPSIS
80
109
  curlyq [global options] extract [command options] URL...
81
110
 
82
111
  COMMAND OPTIONS
83
- -a, --after=arg - Text after extraction, parsed as regex (default: none)
84
- -b, --before=arg - Text before extraction, parsed as regex (default: none)
112
+ -a, --after=arg - Text after extraction (default: none)
113
+ -b, --before=arg - Text before extraction (default: none)
85
114
  -c, --[no-]compressed - Expect compressed results
86
115
  --[no-]clean - Remove extra whitespace from results
87
116
  -h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
117
+ -i, --[no-]include - Include the before/after matches in the result
118
+ -r, --[no-]regex - Process before/after strings as regular expressions
88
119
  --[no-]strip - Strip HTML tags from results
89
120
  ```
90
121
 
91
122
 
92
123
  ##### headlinks
93
124
 
125
+ Example:
126
+
127
+ curlyq headlinks -q '[rel=stylesheet]' https://brettterpstra.com
128
+
129
+ {
130
+ "rel": "stylesheet",
131
+ "href": "https://cdn3.brettterpstra.com/stylesheets/screen.7261.css",
132
+ "type": "text/css",
133
+ "title": null
134
+ }
135
+
136
+ This pulls all `<links>` from the `<head>` of the page, and uses a query `-q` to only show links with `rel="stylesheet"`.
137
+
94
138
  ```
95
139
  NAME
96
140
  headlinks - Return all <head> links on URL's page
@@ -105,6 +149,61 @@ COMMAND OPTIONS
105
149
 
106
150
  ##### html
107
151
 
152
+ The html command (aliased as `curl`) gets the entire text of the web page and provides a JSON response with a breakdown of:
153
+
154
+ - URL, after any redirects
155
+ - Response code
156
+ - Response headers as a keyed hash
157
+ - Meta elements for the page as a keyed hash
158
+ - All meta links in the head as an array of objects containing (as available):
159
+ - rel
160
+ - href
161
+ - type
162
+ - title
163
+ - source of `<head>`
164
+ - source of `<body>`
165
+ - the page title (determined first by og:title, then by a title tag)
166
+ - description (using og:description first)
167
+ - All links on the page as an array of objects with:
168
+ - href
169
+ - title
170
+ - rel
171
+ - text content
172
+ - classes as array
173
+ - All images on the page as an array of objects containing:
174
+ - class
175
+ - all attributes as key/value pairs
176
+ - width and height (if specified)
177
+ - src
178
+ - alt and title
179
+
180
+ You can add a query (`-q`) to only get the information needed, e.g. `-q images[width>600]`.
181
+
182
+ Example:
183
+
184
+ curlyq html -s '#main article .aligncenter' -q 'images[1]' 'https://brettterpstra.com'
185
+
186
+ [
187
+ {
188
+ "class": "aligncenter",
189
+ "original": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb_tw.jpg",
190
+ "at2x": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb@2x.jpg",
191
+ "width": "800",
192
+ "height": "226",
193
+ "src": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb.jpg",
194
+ "alt": "Giveaway Robot with Keyboard Maestro icon",
195
+ "title": "Giveaway Robot with Keyboard Maestro icon"
196
+ }
197
+ ]
198
+
199
+ The above example queries the full html of the page, but narrows the elements using `--search` and then takes the 2nd image from the results.
200
+
201
+ curlyq html -q 'meta.title' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
202
+
203
+ Introducing CurlyQ, a pipeline-oriented curl helper - BrettTerpstra.com
204
+
205
+ The above example curls the page and returns the title attribute found in the meta (`-q 'meta.title'`).
206
+
108
207
  ```
109
208
  NAME
110
209
  html - Curl URL and output its elements, multiple URLs allowed
@@ -124,12 +223,78 @@ COMMAND OPTIONS
124
223
  --[no-]ignore_relative - Ignore relative hrefs when gathering content links
125
224
  -q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
126
225
  -r, --raw=arg - Output a raw value for a key (default: none)
127
- --search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
226
+ -s, --search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
128
227
  -x, --external_links_only - Only gather external links
129
228
  ```
130
229
 
131
230
  ##### images
132
231
 
232
+ The images command returns only the images on the page as an array of objects. It can be queried to match certain requirements (see Query and Search syntax above).
233
+
234
+ The base command will return all images on the page, including OpenGraph images from the head, `<img>` tags from the body, and `<srcset>` tags along with their child images.
235
+
236
+ OpenGraph images will be returned with the structure:
237
+
238
+ {
239
+ "type": "opengraph",
240
+ "attrs": null,
241
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg"
242
+ }
243
+
244
+ `img` tags will be returned with the structure:
245
+
246
+ {
247
+ "type": "img",
248
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb.jpg",
249
+ "width": "800",
250
+ "height": "226",
251
+ "alt": "Banner image for CurlyQ",
252
+ "title": "CurlyQ, curl better",
253
+ "attrs": [
254
+ {
255
+ "key": "class",
256
+ "value": [
257
+ "aligncenter"
258
+ ], // all attributes included
259
+ }
260
+ ]
261
+ }
262
+
263
+
264
+
265
+ `srcset` images will be returned with the structure:
266
+
267
+ {
268
+ "type": "srcset",
269
+ "attrs": [
270
+ {
271
+ "key": "srcset",
272
+ "value": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg 1x, https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg 2x"
273
+ }
274
+ ],
275
+ "images": [
276
+ {
277
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg",
278
+ "media": "1x"
279
+ },
280
+ {
281
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg",
282
+ "media": "2x"
283
+ }
284
+ ]
285
+ }
286
+ }
287
+
288
+ Example:
289
+
290
+ curlyq images -t img -q '[alt$=screenshot]' https://brettterpstra.com
291
+
292
+ This will return an array of images that are `<img>` tags, and only show the ones that have an `alt` attribute that ends with `screenshot`.
293
+
294
+ curlyq images -q '[width>750]' https://brettterpstra.com
295
+
296
+ This example will only return images that have a width greater than 750 pixels. This query depends on the images having proper `width` attributes set on them in the source.
297
+
133
298
  ```
134
299
  NAME
135
300
  images - Extract all images from a URL
@@ -139,14 +304,17 @@ SYNOPSIS
139
304
  curlyq [global options] images [command options] URL...
140
305
 
141
306
  COMMAND OPTIONS
142
- -c, --[no-]compressed - Expect compressed results
143
- --[no-]clean - Remove extra whitespace from results
144
- -h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
145
- -t, --type=arg - Type of images to return (img, srcset, opengraph, all) (may be used more than once, default: ["all"])
307
+ -c, --[no-]compressed - Expect compressed results
308
+ --[no-]clean - Remove extra whitespace from results
309
+ -h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
310
+ -q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
311
+ -t, --type=arg - Type of images to return (img, srcset, opengraph, all) (may be used more than once, default: ["all"])
146
312
  ```
147
313
 
148
314
  ##### json
149
315
 
316
+ The `json` command just returns an object with header/response info, and the contents of the JSON response after it's been read by the Ruby JSON library and output. If there are fetching or parsing errors it will fail gracefully with an error code.
317
+
150
318
  ```
151
319
  NAME
152
320
  json - Get a JSON response from a URL, multiple URLs allowed
@@ -163,6 +331,12 @@ COMMAND OPTIONS
163
331
 
164
332
  ##### links
165
333
 
334
+ Returns all the links on the page, which can be queried on any attribute.
335
+
336
+ Example:
337
+
338
+ curlyq images -t img -q '[width>750]' https://brettterpstra.com
339
+
166
340
  ```
167
341
  NAME
168
342
  links - Return all links on a URL's page
@@ -181,6 +355,26 @@ COMMAND OPTIONS
181
355
 
182
356
  ##### scrape
183
357
 
358
+ Loads the page in a web browser, allowing scraping of dynamically loaded pages that return nothing but scripts when `curl`ed. The `-b` (`--browser`) option is required and should be 'chrome' or 'firefox' (or just 'c' or 'f'). The selected browser must be installed on your system.
359
+
360
+ Example:
361
+
362
+ curlyq scrape -b firefox -q 'links[rel=me&content*=mastodon][0]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
363
+
364
+ {
365
+ "href": "https://nojack.easydns.ca/@ttscoff",
366
+ "title": null,
367
+ "rel": [
368
+ "me"
369
+ ],
370
+ "content": "Mastodon",
371
+ "class": [
372
+ "u-url"
373
+ ]
374
+ }
375
+
376
+ This example scrapes the page using firefox and finds the first link with a rel of 'me' and text containing 'mastodon'.
377
+
184
378
  ```
185
379
  NAME
186
380
  scrape - Scrape a page using a web browser, for dynamic (JS) pages. Be sure to have the selected --browser installed.
@@ -190,7 +384,7 @@ SYNOPSIS
190
384
  curlyq [global options] scrape [command options] URL...
191
385
 
192
386
  COMMAND OPTIONS
193
- -b, --browser=arg - Browser to use (firefox, chrome) (default: none)
387
+ -b, --browser=arg - Browser to use (firefox, chrome) (required, default: none)
194
388
  --[no-]clean - Remove extra whitespace from results
195
389
  -h, --header=arg - Define a header to send as "key=value" (may be used more than once, default: none)
196
390
  -q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
@@ -202,6 +396,17 @@ COMMAND OPTIONS
202
396
 
203
397
  Full-page screenshots require Firefox, installed and specified with `--browser firefox`.
204
398
 
399
+ Type defaults to `full`, but will only work if `-b` is Firefox. If you want to use Chrome, you must specify a `--type` as 'visible' or 'print'.
400
+
401
+ The `-o` (`--output`) flag is required. It should be a path to a target PNG file (or PDF for `-t print` output). Extension will be modified automatically, all you need is the base name.
402
+
403
+ Example:
404
+
405
+ curlyq screenshot -b f -o ~/Desktop/test https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
406
+
407
+ Screenshot saved to /Users/ttscoff/Desktop/test.png
408
+
409
+
205
410
  ```
206
411
  NAME
207
412
  screenshot - Save a screenshot of a URL
@@ -213,12 +418,14 @@ SYNOPSIS
213
418
  COMMAND OPTIONS
214
419
  -b, --browser=arg - Browser to use (firefox, chrome) (default: chrome)
215
420
  -h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
216
- -o, --out, --file=arg - File destination (default: none)
217
- -t, --type=arg - Type of screenshot to save (full (requires firefox), print, visible) (default: full)
421
+ -o, --out, --file=arg - File destination (required, default: none)
422
+ -t, --type=arg - Type of screenshot to save (full (requires firefox), print, visible) (default: visible)
218
423
  ```
219
424
 
220
425
  ##### tags
221
426
 
427
+ Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
428
+
222
429
  ```
223
430
  NAME
224
431
  tags - Extract all instances of a tag
@@ -231,7 +438,8 @@ COMMAND OPTIONS
231
438
  -c, --[no-]compressed - Expect compressed results
232
439
  --[no-]clean - Remove extra whitespace from results
233
440
  -h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
234
- -q, --query, --search=arg - CSS/XPath query (default: none)
441
+ -q, --query, --filter=arg - CSS/XPath query (default: none)
442
+ --search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
235
443
  -t, --tag=arg - Specify a tag to collect (may be used more than once, default: none)
236
444
  ```
237
445
 
data/bin/curlyq CHANGED
@@ -71,7 +71,7 @@ command %i[html curl] do |c|
71
71
  c.switch %i[I info], negatable: false
72
72
 
73
73
  c.desc 'Regurn an array of matches to a CSS or XPath query'
74
- c.flag %i[search]
74
+ c.flag %i[s search]
75
75
 
76
76
  c.desc 'Define a header to send as "key=value"'
77
77
  c.flag %i[h header], multiple: true
@@ -110,25 +110,31 @@ command %i[html curl] do |c|
110
110
  output = []
111
111
 
112
112
  urls.each do |url|
113
- res = Curl::Html.new(url, { browser: options[:browser], fallback: options[:fallback],
114
- headers: headers, headers_only: options[:info],
115
- compressed: options[:compressed], clean: options[:clean],
116
- ignore_local_links: options[:ignore_relative],
117
- ignore_fragment_links: options[:ignore_fragments],
118
- external_links_only: options[:external_links_only] })
113
+ curl_settings = { browser: options[:browser], fallback: options[:fallback],
114
+ headers: headers, headers_only: options[:info],
115
+ compressed: options[:compressed], clean: options[:clean],
116
+ ignore_local_links: options[:ignore_relative],
117
+ ignore_fragment_links: options[:ignore_fragments],
118
+ external_links_only: options[:external_links_only] }
119
+ res = Curl::Html.new(url, curl_settings)
119
120
  res.curl
120
121
 
121
122
  if options[:info]
122
123
  output.push(res.headers)
123
- # print_out(res.headers, global_options[:yaml], raw: options[:raw], pretty: global_options[:pretty])
124
124
  next
125
125
  end
126
126
 
127
127
  if options[:search]
128
- out = res.search(options[:search])
128
+ source = res.search(options[:search], return_source: true)
129
129
 
130
- out = out.dot_query(options[:query]) if options[:query]
131
- output.push(out)
130
+ out = res.parse(source)
131
+
132
+ if options[:query]
133
+ out = out.to_data(url: url, clean: options[:clean]).dot_query(options[:query])
134
+ else
135
+ out = out.to_data
136
+ end
137
+ output.push([out])
132
138
  elsif options[:query]
133
139
  queried = res.to_data.dot_query(options[:query])
134
140
  output.push(queried) if queried
@@ -136,7 +142,7 @@ command %i[html curl] do |c|
136
142
  output.push(res.to_data(url: url))
137
143
  end
138
144
  end
139
-
145
+ output.delete_if(&:nil?)
140
146
  output.delete_if(&:empty?)
141
147
  output = output[0] if output.count == 1
142
148
  output.map! { |o| o[options[:raw].to_sym] } if options[:raw]
@@ -149,13 +155,13 @@ desc 'Save a screenshot of a URL'
149
155
  arg_name 'URL', multiple: true
150
156
  command :screenshot do |c|
151
157
  c.desc 'Type of screenshot to save (full (requires firefox), print, visible)'
152
- c.flag %i[t type], type: ScreenshotType, must_match: /^[fpv].*?$/, default_value: 'full'
158
+ c.flag %i[t type], type: ScreenshotType, must_match: /^[fpv].*?$/, default_value: 'visible'
153
159
 
154
160
  c.desc 'Browser to use (firefox, chrome)'
155
161
  c.flag %i[b browser], type: BrowserType, must_match: /^[fc].*?$/, default_value: 'chrome'
156
162
 
157
163
  c.desc 'File destination'
158
- c.flag %i[o out file]
164
+ c.flag %i[o out file], required: true
159
165
 
160
166
  c.desc 'Define a header to send as key=value'
161
167
  c.flag %i[h header], multiple: true
@@ -164,11 +170,19 @@ command :screenshot do |c|
164
170
  urls = args.join(' ').split(/[, ]+/)
165
171
  headers = break_headers(options[:header])
166
172
 
173
+ type = options[:type]
174
+ browser = options[:browser]
175
+
176
+ type = type.is_a?(Symbol) ? type : type.normalize_screenshot_type
177
+ browser = browser.is_a?(Symbol) ? browser : browser.normalize_browser_type
178
+
179
+ raise 'Full page screen shots only available with Firefox' if type == :full_page && browser != :firefox
180
+
167
181
  urls.each do |url|
168
182
  c = Curl::Html.new(url)
169
183
  c.headers = headers
170
- c.browser = options[:browser]
171
- c.screenshot(options[:out], type: options[:type])
184
+ c.browser = browser
185
+ c.screenshot(options[:out], type: type)
172
186
  end
173
187
  end
174
188
  end
@@ -221,12 +235,18 @@ end
221
235
  desc 'Extract contents between two regular expressions'
222
236
  arg_name 'URL', multiple: true
223
237
  command :extract do |c|
224
- c.desc 'Text before extraction, parsed as regex'
238
+ c.desc 'Text before extraction'
225
239
  c.flag %i[b before]
226
240
 
227
- c.desc 'Text after extraction, parsed as regex'
241
+ c.desc 'Text after extraction'
228
242
  c.flag %i[a after]
229
243
 
244
+ c.desc 'Process before/after strings as regular expressions'
245
+ c.switch %i[r regex]
246
+
247
+ c.desc 'Include the before/after matches in the result'
248
+ c.switch %i[i include]
249
+
230
250
  c.desc 'Define a header to send as key=value'
231
251
  c.flag %i[h header], multiple: true
232
252
 
@@ -249,7 +269,15 @@ command :extract do |c|
249
269
  res = Curl::Html.new(url, { headers: headers, headers_only: false,
250
270
  compressed: options[:compressed], clean: options[:clean] })
251
271
  res.curl
252
- extracted = res.extract(options[:before], options[:after])
272
+ if options[:regex]
273
+ before = Regexp.new(options[:before])
274
+ after = Regexp.new(options[:after])
275
+ else
276
+ before = /#{Regexp.escape(options[:before])}/
277
+ after = /#{Regexp.escape(options[:after])}/
278
+ end
279
+
280
+ extracted = res.extract(before, after, inclusive: options[:include])
253
281
  extracted.strip_tags! if options[:strip]
254
282
  output.concat(extracted)
255
283
  end
@@ -274,7 +302,10 @@ command :tags do |c|
274
302
  c.switch %i[clean]
275
303
 
276
304
  c.desc 'CSS/XPath query'
277
- c.flag %i[q query search]
305
+ c.flag %i[q query filter]
306
+
307
+ c.desc 'Regurn an array of matches to a CSS or XPath query'
308
+ c.flag %i[search]
278
309
 
279
310
  c.action do |global_options, options, args|
280
311
  urls = args.join(' ').split(/[, ]+/)
@@ -286,9 +317,17 @@ command :tags do |c|
286
317
  res = Curl::Html.new(url, { headers: headers, headers_only: options[:headers],
287
318
  compressed: options[:compressed], clean: options[:clean] })
288
319
  res.curl
320
+
289
321
  output = []
290
322
  if options[:search]
291
- output = res.tags.search(options[:search])
323
+ out = res.search(options[:search])
324
+
325
+ # out = out.dot_query(options[:query]) if options[:query]
326
+ output.push(out)
327
+ elsif options[:query]
328
+ query = options[:query] =~ /^links/ ? options[:query] : "links#{options[:query]}"
329
+
330
+ output = res.to_data.dot_query(query)
292
331
  elsif tags.count.positive?
293
332
  tags.each { |tag| output.concat(res.tags(tag)) }
294
333
  else
@@ -312,6 +351,9 @@ command :images do |c|
312
351
  c.desc 'Remove extra whitespace from results'
313
352
  c.switch %i[clean]
314
353
 
354
+ c.desc 'Filter output using dot-syntax path'
355
+ c.flag %i[q query filter]
356
+
315
357
  c.desc 'Define a header to send as key=value'
316
358
  c.flag %i[h header], multiple: true
317
359
 
@@ -326,7 +368,15 @@ command :images do |c|
326
368
  urls.each do |url|
327
369
  res = Curl::Html.new(url, { compressed: options[:compressed], clean: options[:clean] })
328
370
  res.curl
329
- output.concat(res.images(types: types))
371
+
372
+ res = res.images(types: types)
373
+
374
+ if options[:query]
375
+ query = options[:query] =~ /^images/ ? options[:query] : "images#{options[:query]}"
376
+ res = { images: res }.dot_query(query)
377
+ end
378
+
379
+ output.concat(res)
330
380
  end
331
381
 
332
382
  print_out(output, global_options[:yaml], pretty: global_options[:pretty])
@@ -367,7 +417,7 @@ command :links do |c|
367
417
 
368
418
  if options[:query]
369
419
  query = options[:query] =~ /^links/ ? options[:query] : "links#{options[:query]}"
370
- queried = { links: res.to_data[:links] }.dot_query(query)
420
+ queried = res.to_data.dot_query(query)
371
421
  output.concat(queried) if queried
372
422
  else
373
423
  output.concat(res.body_links)
@@ -414,7 +464,7 @@ desc %(Scrape a page using a web browser, for dynamic (JS) pages. Be sure to hav
414
464
  arg_name 'URL', multiple: true
415
465
  command :scrape do |c|
416
466
  c.desc 'Browser to use (firefox, chrome)'
417
- c.flag %i[b browser], type: BrowserType
467
+ c.flag %i[b browser], type: BrowserType, required: true
418
468
 
419
469
  c.desc 'Regurn an array of matches to a CSS or XPath query'
420
470
  c.flag %i[search]
@@ -437,30 +487,19 @@ command :scrape do |c|
437
487
  output = []
438
488
 
439
489
  urls.each do |url|
440
- driver = Selenium::WebDriver.for options[:browser]
441
- begin
442
- driver.get url
443
- res = driver.page_source
444
-
445
- res = Curl::Html.new(nil, { source: res, clean: options[:clean] })
446
- res.curl
447
- if options[:search]
448
- out = res.search(options[:search])
449
-
450
- out = out.dot_query(options[:query]) if options[:query]
451
- output.push(out)
452
- elsif options[:query]
453
- queried = res.to_data(url: url).dot_query(options[:query])
454
- output = queried if queried
455
- else
456
- output.push(res.to_data(url: url))
457
- end
490
+ res = Curl::Html.new(url, { browser: options[:browser], clean: options[:clean] })
491
+ res.curl
458
492
 
459
- # elements = driver.find_elements(css: options[:query])
493
+ if options[:search]
494
+ out = res.search(options[:search])
460
495
 
461
- # elements.each { |e| output.push(e.text.strip) }
462
- ensure
463
- driver.quit
496
+ out = out.dot_query(options[:query]) if options[:query]
497
+ output.push(out)
498
+ elsif options[:query]
499
+ queried = res.to_data(url: url).dot_query(options[:query])
500
+ output.push(queried) if queried
501
+ else
502
+ output.push(res.to_data(url: url))
464
503
  end
465
504
  end
466
505
 
data/lib/curly/array.rb CHANGED
@@ -67,68 +67,69 @@ class ::Array
67
67
  end
68
68
 
69
69
  ##
70
- ## Convert and execute a dot-syntax query on the array
71
- ##
72
- ## @param path [String] The dot-syntax path
73
- ##
74
- ## @return [Array] Matching elements
75
- ##
76
- def dot_query(path)
77
- output = []
78
- if path =~ /^\[([\d+.])\]\.?/
79
- int = Regexp.last_match(1)
80
- path.sub!(/^\[[\d.]+\]\.?/, '')
81
- items = self[eval(int)]
82
- else
83
- items = self
84
- end
70
+ ## Test if a tag contains an attribute matching filter queries
71
+ ##
72
+ ## @param tag_name [String] The tag name
73
+ ## @param classes [String] The classes to match
74
+ ## @param id [String] The id attribute to
75
+ ## match
76
+ ## @param attribute [String] The attribute
77
+ ## @param operator [String] The operator, <>= *=
78
+ ## $= ^=
79
+ ## @param value [String] The value to match
80
+ ## @param descendant [Boolean] Check descendant tags
81
+ ##
82
+ def tag_match(tag_name, classes, id, attribute, operator, value, descendant: false)
83
+ tag = self
84
+ keep = true
85
+
86
+ keep = false if tag_name && !tag['tag'] =~ /^#{tag_name}$/i
87
+
88
+ if tag.key?('attrs') && tag['attrs']
89
+ if keep && id
90
+ tag_id = tag['attrs'].filter { |a| a['key'] == 'id' }.first['value']
91
+ keep = tag_id && tag_id =~ /#{id}/i
92
+ end
85
93
 
86
- if items.is_a? Hash
87
- output = items.dot_query(path)
88
- else
89
- items.each do |item|
90
- res = item.is_a?(Hash) ? item.stringify_keys : item
91
- out = []
92
- q = path.split(/(?<![\d.])\./)
93
- q.each do |pth|
94
- el = Regexp.last_match(1) if pth =~ /\[([0-9,.]+)\]/
95
- pth.sub!(/\[([0-9,.]+)\]/, '')
96
- ats = []
97
- at = []
98
- while pth =~ /\[[+&,]?\w+ *[\^*$=<>]=? *\w+/
99
- m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>\w+) */)
100
- comp = [m['key'], m['op'], m['val']]
101
- case m['com']
102
- when ','
103
- ats.push(comp)
104
- at = []
105
- else
106
- at.push(comp)
107
- end
108
-
109
- pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>\w+)/, '[')
110
- end
111
- ats.push(at) unless at.empty?
112
- pth.sub!(/\[\]/, '')
113
-
114
- return false if el.nil? && ats.empty? && !res.key?(pth)
115
-
116
- res = res[pth] unless pth.empty?
117
-
118
- while ats.count.positive?
119
- atr = ats.shift
120
-
121
- keepers = res.filter do |r|
122
- evaluate_comp(r, atr)
123
- end
124
- out.concat(keepers)
125
- end
126
-
127
- out = out[eval(el)] if out.is_a?(Array) && el =~ /^[\d.,]+$/
94
+ if keep && classes
95
+ cls = tag['attrs'].filter { |a| a['key'] == 'class' }.first
96
+ if cls
97
+ all = true
98
+ classes.each { |c| all = cls['value'].include?(c) }
99
+ keep = all
100
+ else
101
+ keep = false
128
102
  end
129
- output.push(out)
130
103
  end
104
+
105
+ if keep && attribute
106
+ attributes = tag['attrs'].filter { |a| a['key'] =~ /^#{attribute}$/i }
107
+ any = false
108
+ attributes.each do |a|
109
+ break if any
110
+
111
+ any = case operator
112
+ when /^*/
113
+ a['value'] =~ /#{value}/i
114
+ when /^\^/
115
+ a['value'] =~ /^#{value}/i
116
+ when /^\$/
117
+ a['value'] =~ /#{value}$/i
118
+ else
119
+ a['value'] =~ /^#{value}$/i
120
+ end
121
+ end
122
+ keep = any
123
+ end
124
+ end
125
+
126
+ return false if descendant && !keep
127
+
128
+ if !descendant && tag.key?('tags')
129
+ tags = tag['tags'].filter { |t| t.tag_match(tag_name, classes, id, attribute, operator, value) }
130
+ tags.count.positive?
131
+ else
132
+ keep
131
133
  end
132
- output
133
134
  end
134
135
  end
@@ -65,7 +65,13 @@ module Curl
65
65
  @external_links_only = options[:external_links_only]
66
66
 
67
67
  @curl = TTY::Which.which('curl')
68
- @url = url
68
+ @url = url.nil? ? options[:url] : url
69
+ end
70
+
71
+ def parse(source)
72
+ @body = source
73
+ { url: @url, code: @code, headers: @headers, meta: @meta, links: @links, head: @head, body: source,
74
+ source: source.strip, body_links: content_links, body_images: content_images }
69
75
  end
70
76
 
71
77
  def curl
@@ -118,10 +124,15 @@ module Curl
118
124
  ##
119
125
  ## @return [Array] array of matches
120
126
  ##
121
- def extract(before, after)
122
- before = /#{Regexp.escape(before)}/ unless before.instance_of?(Regexp)
123
- after = /#{Regexp.escape(after)}/ unless after.instance_of?(Regexp)
124
- rx = /(?<=#{before.source})(.*?)(?=#{after.source})/m
127
+ def extract(before, after, inclusive: false)
128
+ before = /#{Regexp.escape(before)}/ unless before.is_a?(Regexp)
129
+ after = /#{Regexp.escape(after)}/ unless after.is_a?(Regexp)
130
+
131
+ if inclusive
132
+ rx = /(#{before.source}.*?#{after.source})/m
133
+ else
134
+ rx = /(?<=#{before.source})(.*?)(?=#{after.source})/m
135
+ end
125
136
  @body.scan(rx).map { |r| @clean ? r[0].clean : r[0] }
126
137
  end
127
138
 
@@ -343,12 +354,16 @@ module Curl
343
354
  ##
344
355
  ## @return [Array] array of matched elements
345
356
  ##
346
- def search(path, source: @source)
357
+ def search(path, source: @source, return_source: false)
347
358
  doc = Nokogiri::HTML(source)
348
359
  output = []
349
- doc.search(path).each do |el|
350
- out = nokogiri_to_tag(el)
351
- output.push(out)
360
+ if return_source
361
+ output = doc.search(path).to_html
362
+ else
363
+ doc.search(path).each do |el|
364
+ out = nokogiri_to_tag(el)
365
+ output.push(out)
366
+ end
352
367
  end
353
368
  output
354
369
  end
@@ -480,6 +495,7 @@ module Curl
480
495
  ##
481
496
  def content_links
482
497
  links = []
498
+
483
499
  link_tags = @body.to_enum(:scan, %r{<a ?(?<tag>.*?)>(?<text>.*?)</a>}).map { Regexp.last_match }
484
500
  link_tags.each do |m|
485
501
  href = m['tag'].match(/href=(["'])(.*?)\1/)
@@ -534,7 +550,7 @@ module Curl
534
550
  ## @return [String] page source
535
551
  ##
536
552
  def curl_dynamic_html
537
- browser = @browser.normalize_browser_type if @browser.is_a?(String)
553
+ browser = @browser.is_a?(String) ? @browser.normalize_browser_type : @browser
538
554
  res = nil
539
555
 
540
556
  driver = Selenium::WebDriver.for browser
@@ -607,7 +623,7 @@ module Curl
607
623
  ##
608
624
  def curl_html(url = nil, source: nil, headers: nil,
609
625
  headers_only: false, compressed: false, fallback: false)
610
- unless url.nil?
626
+ if !url.nil?
611
627
  flags = 'SsL'
612
628
  flags += @headers_only ? 'I' : 'i'
613
629
  agents = [
@@ -620,8 +636,8 @@ module Curl
620
636
  compress = @compressed ? '--compressed' : ''
621
637
  @source = `#{@curl} -#{flags} #{compress} #{headers} '#{@url}' 2>/dev/null`
622
638
  agent = 0
623
- while source.nil? || source.empty?
624
- source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{@url}' 2>/dev/null`
639
+ while @source.nil? || @source.empty?
640
+ @source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{@url}' 2>/dev/null`
625
641
  break if agent >= agents.count - 1
626
642
  end
627
643
 
@@ -630,49 +646,50 @@ module Curl
630
646
  Process.exit 1
631
647
  end
632
648
 
633
- if @fallback && (@source.nil? || @source.empty?)
634
- @source = curl_dynamic_html(@url, @fallback, @headers)
649
+ headers = { 'location' => @url }
650
+ lines = @source.split(/\r\n/)
651
+ code = lines[0].match(/(\d\d\d)/)[1]
652
+ lines.shift
653
+ lines.each_with_index do |line, idx|
654
+ if line =~ /^([\w-]+): (.*?)$/
655
+ m = Regexp.last_match
656
+ headers[m[1]] = m[2]
657
+ else
658
+ @source = lines[idx..].join("\n")
659
+ break
660
+ end
635
661
  end
636
- end
637
-
638
- return false if source.nil? || source.empty?
639
662
 
640
- @source.strip!
663
+ if headers['content-encoding'] =~ /gzip/i && !compressed
664
+ warn 'Response is gzipped, you may need to try again with --compressed'
665
+ end
641
666
 
642
- headers = { 'location' => @url }
643
- lines = @source.split(/\r\n/)
644
- code = lines[0].match(/(\d\d\d)/)[1]
645
- lines.shift
646
- lines.each_with_index do |line, idx|
647
- if line =~ /^([\w-]+): (.*?)$/
648
- m = Regexp.last_match
649
- headers[m[1]] = m[2]
650
- else
651
- @source = lines[idx..].join("\n")
652
- break
667
+ if headers['content-type'] =~ /json/
668
+ return { url: @url, code: code, headers: headers, meta: nil, links: nil,
669
+ head: nil, body: @source.strip, source: @source.strip, body_links: nil, body_images: nil }
653
670
  end
671
+ else
672
+ @source = source unless source.nil?
654
673
  end
655
674
 
656
- if headers['content-encoding'] =~ /gzip/i && !compressed
657
- warn 'Response is gzipped, you may need to try again with --compressed'
658
- end
675
+ @source = curl_dynamic_html(@url, @fallback, @headers) if @fallback && (@source.nil? || @source.empty?)
659
676
 
660
- if headers['content-type'] =~ /json/
661
- return { url: @url, code: code, headers: headers, meta: nil, links: nil,
662
- head: nil, body: @source.strip, source: @source.strip, body_links: nil, body_images: nil }
663
- end
677
+ return false if @source.nil? || @source.empty?
678
+
679
+ @source.strip!
664
680
 
665
- head = source.match(%r{(?<=<head>)(.*?)(?=</head>)}mi)
681
+ head = @source.match(%r{(?<=<head>)(.*?)(?=</head>)}mi)
666
682
 
667
683
  if head.nil?
668
684
  { url: @url, code: code, headers: headers, meta: nil, links: nil, head: nil, body: @source.strip,
669
685
  source: @source.strip, body_links: nil, body_images: nil }
670
686
  else
687
+ @body = @source.match(%r{<body.*?>(.*?)</body>}mi)[1]
671
688
  meta = meta_tags(head[1])
672
689
  links = link_tags(head[1])
673
- body = @source.match(%r{<body.*?>(.*?)</body>}mi)[1]
674
- { url: @url, code: code, headers: headers, meta: meta, links: links, head: head[1], body: body,
675
- source: @source.strip, body_links: body_links, body_images: body_images }
690
+
691
+ { url: @url, code: code, headers: headers, meta: meta, links: links, head: head[1], body: @body,
692
+ source: @source.strip, body_links: nil, body_images: nil }
676
693
  end
677
694
  end
678
695
 
data/lib/curly/hash.rb CHANGED
@@ -2,6 +2,27 @@
2
2
 
3
3
  # Hash helpers
4
4
  class ::Hash
5
+ def to_data(url: nil, clean: false)
6
+ if key?(:body_links)
7
+ {
8
+ url: self[:url] || url,
9
+ code: self[:code],
10
+ headers: self[:headers],
11
+ meta: self[:meta],
12
+ meta_links: self[:links],
13
+ head: clean ? self[:head]&.strip&.clean : self[:head],
14
+ body: clean ? self[:body]&.strip&.clean : self[:body],
15
+ source: clean ? self[:source]&.strip&.clean : self[:source],
16
+ title: self[:title],
17
+ description: self[:description],
18
+ links: self[:body_links],
19
+ images: self[:body_images]
20
+ }
21
+ else
22
+ self
23
+ end
24
+ end
25
+
5
26
  # Extract data using a dot-syntax path
6
27
  #
7
28
  # @param path [String] The path
@@ -18,7 +39,7 @@ class ::Hash
18
39
  ats = []
19
40
  at = []
20
41
  while pth =~ /\[[+&,]?\w+ *[\^*$=<>]=? *\w+/
21
- m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>\w+) */)
42
+ m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+) */)
22
43
  comp = [m['key'], m['op'], m['val']]
23
44
  case m['com']
24
45
  when ','
@@ -28,15 +49,16 @@ class ::Hash
28
49
  at.push(comp)
29
50
  end
30
51
 
31
- pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>\w+)/, '[')
52
+ pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+)/, '[')
32
53
  end
33
54
  ats.push(at) unless at.empty?
34
55
  pth.sub!(/\[\]/, '')
35
56
 
36
57
  return false if el.nil? && ats.empty? && !res.key?(pth)
37
-
38
58
  res = res[pth] unless pth.empty?
39
59
 
60
+ return false if res.nil?
61
+
40
62
  if ats.count.positive?
41
63
  while ats.count.positive?
42
64
  atr = ats.shift
@@ -60,7 +82,7 @@ class ::Hash
60
82
  ##
61
83
  ## @param r [Hash] hash of source elements and
62
84
  ## comparison operators
63
- ## @param atr [String] The attribute to compare
85
+ ## @param atr [Array] Array of arrays conaining [attribute,comparitor,value]
64
86
  ##
65
87
  ## @return [Boolean] whether the comparison passes or fails
66
88
  ##
@@ -118,7 +140,7 @@ class ::Hash
118
140
  end
119
141
 
120
142
  ##
121
- ## Test if a hash contains a tag matching filter queries
143
+ ## Test if a tag contains an attribute matching filter queries
122
144
  ##
123
145
  ## @param tag_name [String] The tag name
124
146
  ## @param classes [String] The classes to match
data/lib/curly/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Curly
2
- VERSION = '0.0.4'
2
+ VERSION = '0.0.5'
3
3
  end
data/src/_README.md CHANGED
@@ -10,7 +10,7 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
10
10
  [donate]: https://brettterpstra.com/donate
11
11
  <!--END GITHUB-->
12
12
 
13
- The current version of `curlyq` is <!--VER-->0.0.3<!--END VER-->.
13
+ The current version of `curlyq` is <!--VER-->0.0.4<!--END VER-->.
14
14
 
15
15
  CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
16
16
 
@@ -39,12 +39,41 @@ Run `curlyq help` for a list of subcommands. Run `curlyq help SUBCOMMAND` for de
39
39
  @cli(bundle exec bin/curlyq help)
40
40
  ```
41
41
 
42
+ ### Query and Search syntax
43
+
44
+ You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some commands.
45
+
46
+ A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
47
+
48
+ Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
49
+
50
+ The comparisons for the query flag are:
51
+
52
+ - `<` less than
53
+ - `>` greater than
54
+ - `<=` less than or equal to
55
+ - `>=` greater than or equal to
56
+ - `=` or `==` is equal to
57
+ - `*=` contains text
58
+ - `^=` starts with text
59
+ - `$=` ends with text
60
+
42
61
  #### Commands
43
62
 
44
63
  curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
45
64
 
46
65
  ##### extract
47
66
 
67
+ Example:
68
+
69
+ curlyq extract -i -b 'Adding' -a 'accessing the source.' 'https://stackoverflow.com/questions/52428409/get-fully-rendered-html-using-selenium-webdriver-and-python'
70
+
71
+ [
72
+ "Adding <code>time.sleep(10)</code> in various places in case the page had not fully loaded when I was accessing the source."
73
+ ]
74
+
75
+ This specifies a before and after string and includes them (`-i`) in the result.
76
+
48
77
  ```
49
78
  @cli(bundle exec bin/curlyq help extract)
50
79
  ```
@@ -52,36 +81,198 @@ curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq ext
52
81
 
53
82
  ##### headlinks
54
83
 
84
+ Example:
85
+
86
+ curlyq headlinks -q '[rel=stylesheet]' https://brettterpstra.com
87
+
88
+ {
89
+ "rel": "stylesheet",
90
+ "href": "https://cdn3.brettterpstra.com/stylesheets/screen.7261.css",
91
+ "type": "text/css",
92
+ "title": null
93
+ }
94
+
95
+ This pulls all `<links>` from the `<head>` of the page, and uses a query `-q` to only show links with `rel="stylesheet"`.
96
+
55
97
  ```
56
98
  @cli(bundle exec bin/curlyq help headlinks)
57
99
  ```
58
100
 
59
101
  ##### html
60
102
 
103
+ The html command (aliased as `curl`) gets the entire text of the web page and provides a JSON response with a breakdown of:
104
+
105
+ - URL, after any redirects
106
+ - Response code
107
+ - Response headers as a keyed hash
108
+ - Meta elements for the page as a keyed hash
109
+ - All meta links in the head as an array of objects containing (as available):
110
+ - rel
111
+ - href
112
+ - type
113
+ - title
114
+ - source of `<head>`
115
+ - source of `<body>`
116
+ - the page title (determined first by og:title, then by a title tag)
117
+ - description (using og:description first)
118
+ - All links on the page as an array of objects with:
119
+ - href
120
+ - title
121
+ - rel
122
+ - text content
123
+ - classes as array
124
+ - All images on the page as an array of objects containing:
125
+ - class
126
+ - all attributes as key/value pairs
127
+ - width and height (if specified)
128
+ - src
129
+ - alt and title
130
+
131
+ You can add a query (`-q`) to only get the information needed, e.g. `-q images[width>600]`.
132
+
133
+ Example:
134
+
135
+ curlyq html -s '#main article .aligncenter' -q 'images[1]' 'https://brettterpstra.com'
136
+
137
+ [
138
+ {
139
+ "class": "aligncenter",
140
+ "original": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb_tw.jpg",
141
+ "at2x": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb@2x.jpg",
142
+ "width": "800",
143
+ "height": "226",
144
+ "src": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb.jpg",
145
+ "alt": "Giveaway Robot with Keyboard Maestro icon",
146
+ "title": "Giveaway Robot with Keyboard Maestro icon"
147
+ }
148
+ ]
149
+
150
+ The above example queries the full html of the page, but narrows the elements using `--search` and then takes the 2nd image from the results.
151
+
152
+ curlyq html -q 'meta.title' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
153
+
154
+ Introducing CurlyQ, a pipeline-oriented curl helper - BrettTerpstra.com
155
+
156
+ The above example curls the page and returns the title attribute found in the meta (`-q 'meta.title'`).
157
+
61
158
  ```
62
159
  @cli(bundle exec bin/curlyq help html)
63
160
  ```
64
161
 
65
162
  ##### images
66
163
 
164
+ The images command returns only the images on the page as an array of objects. It can be queried to match certain requirements (see Query and Search syntax above).
165
+
166
+ The base command will return all images on the page, including OpenGraph images from the head, `<img>` tags from the body, and `<srcset>` tags along with their child images.
167
+
168
+ OpenGraph images will be returned with the structure:
169
+
170
+ {
171
+ "type": "opengraph",
172
+ "attrs": null,
173
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg"
174
+ }
175
+
176
+ `img` tags will be returned with the structure:
177
+
178
+ {
179
+ "type": "img",
180
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb.jpg",
181
+ "width": "800",
182
+ "height": "226",
183
+ "alt": "Banner image for CurlyQ",
184
+ "title": "CurlyQ, curl better",
185
+ "attrs": [
186
+ {
187
+ "key": "class",
188
+ "value": [
189
+ "aligncenter"
190
+ ], // all attributes included
191
+ }
192
+ ]
193
+ }
194
+
195
+
196
+
197
+ `srcset` images will be returned with the structure:
198
+
199
+ {
200
+ "type": "srcset",
201
+ "attrs": [
202
+ {
203
+ "key": "srcset",
204
+ "value": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg 1x, https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg 2x"
205
+ }
206
+ ],
207
+ "images": [
208
+ {
209
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg",
210
+ "media": "1x"
211
+ },
212
+ {
213
+ "src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg",
214
+ "media": "2x"
215
+ }
216
+ ]
217
+ }
218
+ }
219
+
220
+ Example:
221
+
222
+ curlyq images -t img -q '[alt$=screenshot]' https://brettterpstra.com
223
+
224
+ This will return an array of images that are `<img>` tags, and only show the ones that have an `alt` attribute that ends with `screenshot`.
225
+
226
+ curlyq images -q '[width>750]' https://brettterpstra.com
227
+
228
+ This example will only return images that have a width greater than 750 pixels. This query depends on the images having proper `width` attributes set on them in the source.
229
+
67
230
  ```
68
231
  @cli(bundle exec bin/curlyq help images)
69
232
  ```
70
233
 
71
234
  ##### json
72
235
 
236
+ The `json` command just returns an object with header/response info, and the contents of the JSON response after it's been read by the Ruby JSON library and output. If there are fetching or parsing errors it will fail gracefully with an error code.
237
+
73
238
  ```
74
239
  @cli(bundle exec bin/curlyq help json)
75
240
  ```
76
241
 
77
242
  ##### links
78
243
 
244
+ Returns all the links on the page, which can be queried on any attribute.
245
+
246
+ Example:
247
+
248
+ curlyq images -t img -q '[width>750]' https://brettterpstra.com
249
+
79
250
  ```
80
251
  @cli(bundle exec bin/curlyq help links)
81
252
  ```
82
253
 
83
254
  ##### scrape
84
255
 
256
+ Loads the page in a web browser, allowing scraping of dynamically loaded pages that return nothing but scripts when `curl`ed. The `-b` (`--browser`) option is required and should be 'chrome' or 'firefox' (or just 'c' or 'f'). The selected browser must be installed on your system.
257
+
258
+ Example:
259
+
260
+ curlyq scrape -b firefox -q 'links[rel=me&content*=mastodon][0]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
261
+
262
+ {
263
+ "href": "https://nojack.easydns.ca/@ttscoff",
264
+ "title": null,
265
+ "rel": [
266
+ "me"
267
+ ],
268
+ "content": "Mastodon",
269
+ "class": [
270
+ "u-url"
271
+ ]
272
+ }
273
+
274
+ This example scrapes the page using firefox and finds the first link with a rel of 'me' and text containing 'mastodon'.
275
+
85
276
  ```
86
277
  @cli(bundle exec bin/curlyq help scrape)
87
278
  ```
@@ -90,12 +281,25 @@ curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq ext
90
281
 
91
282
  Full-page screenshots require Firefox, installed and specified with `--browser firefox`.
92
283
 
284
+ Type defaults to `full`, but will only work if `-b` is Firefox. If you want to use Chrome, you must specify a `--type` as 'visible' or 'print'.
285
+
286
+ The `-o` (`--output`) flag is required. It should be a path to a target PNG file (or PDF for `-t print` output). Extension will be modified automatically, all you need is the base name.
287
+
288
+ Example:
289
+
290
+ curlyq screenshot -b f -o ~/Desktop/test https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
291
+
292
+ Screenshot saved to /Users/ttscoff/Desktop/test.png
293
+
294
+
93
295
  ```
94
296
  @cli(bundle exec bin/curlyq help screenshot)
95
297
  ```
96
298
 
97
299
  ##### tags
98
300
 
301
+ Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
302
+
99
303
  ```
100
304
  @cli(bundle exec bin/curlyq help tags)
101
305
  ```
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: curlyq
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.4
4
+ version: 0.0.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Brett Terpstra
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-01-10 00:00:00.000000000 Z
11
+ date: 2024-01-12 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake
@@ -139,6 +139,7 @@ extra_rdoc_files:
139
139
  files:
140
140
  - ".github/FUNDING.yml"
141
141
  - ".gitignore"
142
+ - ".irbrc"
142
143
  - CHANGELOG.md
143
144
  - Gemfile
144
145
  - Gemfile.lock