curlyq 0.0.4 → 0.0.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/.irbrc +4 -0
- data/CHANGELOG.md +19 -0
- data/Gemfile.lock +1 -1
- data/README.md +221 -13
- data/bin/curlyq +85 -46
- data/lib/curly/array.rb +60 -59
- data/lib/curly/curl/html.rb +58 -41
- data/lib/curly/hash.rb +27 -5
- data/lib/curly/version.rb +1 -1
- data/src/_README.md +205 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 2c5eb3f9a5444f19c44362545b302e3889c4e25dc34d9180452a736b1b80bc34
|
4
|
+
data.tar.gz: 3bf8d1009f493b60c31efb3636c64aa8871656dbcd9cebbeb01800d30fd0761c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 808d8122080450acee5e98e0a6338e887ba5b6e3306764dab79c713052c6e5f6749d8b4ef90f43fcdc2cc7da41766f40e6684e0e40d2de98055e2d71986ac0e8
|
7
|
+
data.tar.gz: d4e17b0cc425cbf7a704cdd188e36f734707cd885a097c6e99cb0f8bc0089e46ffdd99d1e15844a981bbfd9a205778178e45dcaa637cd8a7e761432f2610991e
|
data/.gitignore
CHANGED
data/.irbrc
ADDED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,22 @@
|
|
1
|
+
### 0.0.5
|
2
|
+
|
3
|
+
2024-01-11 18:06
|
4
|
+
|
5
|
+
#### IMPROVED
|
6
|
+
|
7
|
+
- Add --query capabilities to images command
|
8
|
+
- Add --query to links command
|
9
|
+
- Allow hyphens in query syntax
|
10
|
+
- Allow any character other than comma, ampersand, or right square bracket in query value
|
11
|
+
|
12
|
+
#### FIXED
|
13
|
+
|
14
|
+
- Html --search returns a full Curl::Html object
|
15
|
+
- --query works better with --search and is consistent with other query functions
|
16
|
+
- Scrape command outputting malformed data
|
17
|
+
- Hash output when --query is used with scrape
|
18
|
+
- Nil match on tags command
|
19
|
+
|
1
20
|
### 0.0.4
|
2
21
|
|
3
22
|
2024-01-10 13:54
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -10,7 +10,7 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
|
|
10
10
|
[donate]: https://brettterpstra.com/donate
|
11
11
|
|
12
12
|
|
13
|
-
The current version of `curlyq` is 0.0.
|
13
|
+
The current version of `curlyq` is 0.0.5
|
14
14
|
.
|
15
15
|
|
16
16
|
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
|
@@ -44,7 +44,7 @@ SYNOPSIS
|
|
44
44
|
curlyq [global options] command [command options] [arguments...]
|
45
45
|
|
46
46
|
VERSION
|
47
|
-
0.0.
|
47
|
+
0.0.5
|
48
48
|
|
49
49
|
GLOBAL OPTIONS
|
50
50
|
--help - Show this message
|
@@ -65,12 +65,41 @@ COMMANDS
|
|
65
65
|
tags - Extract all instances of a tag
|
66
66
|
```
|
67
67
|
|
68
|
+
### Query and Search syntax
|
69
|
+
|
70
|
+
You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some commands.
|
71
|
+
|
72
|
+
A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
|
73
|
+
|
74
|
+
Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
|
75
|
+
|
76
|
+
The comparisons for the query flag are:
|
77
|
+
|
78
|
+
- `<` less than
|
79
|
+
- `>` greater than
|
80
|
+
- `<=` less than or equal to
|
81
|
+
- `>=` greater than or equal to
|
82
|
+
- `=` or `==` is equal to
|
83
|
+
- `*=` contains text
|
84
|
+
- `^=` starts with text
|
85
|
+
- `$=` ends with text
|
86
|
+
|
68
87
|
#### Commands
|
69
88
|
|
70
89
|
curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
|
71
90
|
|
72
91
|
##### extract
|
73
92
|
|
93
|
+
Example:
|
94
|
+
|
95
|
+
curlyq extract -i -b 'Adding' -a 'accessing the source.' 'https://stackoverflow.com/questions/52428409/get-fully-rendered-html-using-selenium-webdriver-and-python'
|
96
|
+
|
97
|
+
[
|
98
|
+
"Adding <code>time.sleep(10)</code> in various places in case the page had not fully loaded when I was accessing the source."
|
99
|
+
]
|
100
|
+
|
101
|
+
This specifies a before and after string and includes them (`-i`) in the result.
|
102
|
+
|
74
103
|
```
|
75
104
|
NAME
|
76
105
|
extract - Extract contents between two regular expressions
|
@@ -80,17 +109,32 @@ SYNOPSIS
|
|
80
109
|
curlyq [global options] extract [command options] URL...
|
81
110
|
|
82
111
|
COMMAND OPTIONS
|
83
|
-
-a, --after=arg - Text after extraction
|
84
|
-
-b, --before=arg - Text before extraction
|
112
|
+
-a, --after=arg - Text after extraction (default: none)
|
113
|
+
-b, --before=arg - Text before extraction (default: none)
|
85
114
|
-c, --[no-]compressed - Expect compressed results
|
86
115
|
--[no-]clean - Remove extra whitespace from results
|
87
116
|
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
|
117
|
+
-i, --[no-]include - Include the before/after matches in the result
|
118
|
+
-r, --[no-]regex - Process before/after strings as regular expressions
|
88
119
|
--[no-]strip - Strip HTML tags from results
|
89
120
|
```
|
90
121
|
|
91
122
|
|
92
123
|
##### headlinks
|
93
124
|
|
125
|
+
Example:
|
126
|
+
|
127
|
+
curlyq headlinks -q '[rel=stylesheet]' https://brettterpstra.com
|
128
|
+
|
129
|
+
{
|
130
|
+
"rel": "stylesheet",
|
131
|
+
"href": "https://cdn3.brettterpstra.com/stylesheets/screen.7261.css",
|
132
|
+
"type": "text/css",
|
133
|
+
"title": null
|
134
|
+
}
|
135
|
+
|
136
|
+
This pulls all `<links>` from the `<head>` of the page, and uses a query `-q` to only show links with `rel="stylesheet"`.
|
137
|
+
|
94
138
|
```
|
95
139
|
NAME
|
96
140
|
headlinks - Return all <head> links on URL's page
|
@@ -105,6 +149,61 @@ COMMAND OPTIONS
|
|
105
149
|
|
106
150
|
##### html
|
107
151
|
|
152
|
+
The html command (aliased as `curl`) gets the entire text of the web page and provides a JSON response with a breakdown of:
|
153
|
+
|
154
|
+
- URL, after any redirects
|
155
|
+
- Response code
|
156
|
+
- Response headers as a keyed hash
|
157
|
+
- Meta elements for the page as a keyed hash
|
158
|
+
- All meta links in the head as an array of objects containing (as available):
|
159
|
+
- rel
|
160
|
+
- href
|
161
|
+
- type
|
162
|
+
- title
|
163
|
+
- source of `<head>`
|
164
|
+
- source of `<body>`
|
165
|
+
- the page title (determined first by og:title, then by a title tag)
|
166
|
+
- description (using og:description first)
|
167
|
+
- All links on the page as an array of objects with:
|
168
|
+
- href
|
169
|
+
- title
|
170
|
+
- rel
|
171
|
+
- text content
|
172
|
+
- classes as array
|
173
|
+
- All images on the page as an array of objects containing:
|
174
|
+
- class
|
175
|
+
- all attributes as key/value pairs
|
176
|
+
- width and height (if specified)
|
177
|
+
- src
|
178
|
+
- alt and title
|
179
|
+
|
180
|
+
You can add a query (`-q`) to only get the information needed, e.g. `-q images[width>600]`.
|
181
|
+
|
182
|
+
Example:
|
183
|
+
|
184
|
+
curlyq html -s '#main article .aligncenter' -q 'images[1]' 'https://brettterpstra.com'
|
185
|
+
|
186
|
+
[
|
187
|
+
{
|
188
|
+
"class": "aligncenter",
|
189
|
+
"original": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb_tw.jpg",
|
190
|
+
"at2x": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb@2x.jpg",
|
191
|
+
"width": "800",
|
192
|
+
"height": "226",
|
193
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb.jpg",
|
194
|
+
"alt": "Giveaway Robot with Keyboard Maestro icon",
|
195
|
+
"title": "Giveaway Robot with Keyboard Maestro icon"
|
196
|
+
}
|
197
|
+
]
|
198
|
+
|
199
|
+
The above example queries the full html of the page, but narrows the elements using `--search` and then takes the 2nd image from the results.
|
200
|
+
|
201
|
+
curlyq html -q 'meta.title' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
202
|
+
|
203
|
+
Introducing CurlyQ, a pipeline-oriented curl helper - BrettTerpstra.com
|
204
|
+
|
205
|
+
The above example curls the page and returns the title attribute found in the meta (`-q 'meta.title'`).
|
206
|
+
|
108
207
|
```
|
109
208
|
NAME
|
110
209
|
html - Curl URL and output its elements, multiple URLs allowed
|
@@ -124,12 +223,78 @@ COMMAND OPTIONS
|
|
124
223
|
--[no-]ignore_relative - Ignore relative hrefs when gathering content links
|
125
224
|
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
|
126
225
|
-r, --raw=arg - Output a raw value for a key (default: none)
|
127
|
-
--search=arg
|
226
|
+
-s, --search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
|
128
227
|
-x, --external_links_only - Only gather external links
|
129
228
|
```
|
130
229
|
|
131
230
|
##### images
|
132
231
|
|
232
|
+
The images command returns only the images on the page as an array of objects. It can be queried to match certain requirements (see Query and Search syntax above).
|
233
|
+
|
234
|
+
The base command will return all images on the page, including OpenGraph images from the head, `<img>` tags from the body, and `<srcset>` tags along with their child images.
|
235
|
+
|
236
|
+
OpenGraph images will be returned with the structure:
|
237
|
+
|
238
|
+
{
|
239
|
+
"type": "opengraph",
|
240
|
+
"attrs": null,
|
241
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg"
|
242
|
+
}
|
243
|
+
|
244
|
+
`img` tags will be returned with the structure:
|
245
|
+
|
246
|
+
{
|
247
|
+
"type": "img",
|
248
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb.jpg",
|
249
|
+
"width": "800",
|
250
|
+
"height": "226",
|
251
|
+
"alt": "Banner image for CurlyQ",
|
252
|
+
"title": "CurlyQ, curl better",
|
253
|
+
"attrs": [
|
254
|
+
{
|
255
|
+
"key": "class",
|
256
|
+
"value": [
|
257
|
+
"aligncenter"
|
258
|
+
], // all attributes included
|
259
|
+
}
|
260
|
+
]
|
261
|
+
}
|
262
|
+
|
263
|
+
|
264
|
+
|
265
|
+
`srcset` images will be returned with the structure:
|
266
|
+
|
267
|
+
{
|
268
|
+
"type": "srcset",
|
269
|
+
"attrs": [
|
270
|
+
{
|
271
|
+
"key": "srcset",
|
272
|
+
"value": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg 1x, https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg 2x"
|
273
|
+
}
|
274
|
+
],
|
275
|
+
"images": [
|
276
|
+
{
|
277
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg",
|
278
|
+
"media": "1x"
|
279
|
+
},
|
280
|
+
{
|
281
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg",
|
282
|
+
"media": "2x"
|
283
|
+
}
|
284
|
+
]
|
285
|
+
}
|
286
|
+
}
|
287
|
+
|
288
|
+
Example:
|
289
|
+
|
290
|
+
curlyq images -t img -q '[alt$=screenshot]' https://brettterpstra.com
|
291
|
+
|
292
|
+
This will return an array of images that are `<img>` tags, and only show the ones that have an `alt` attribute that ends with `screenshot`.
|
293
|
+
|
294
|
+
curlyq images -q '[width>750]' https://brettterpstra.com
|
295
|
+
|
296
|
+
This example will only return images that have a width greater than 750 pixels. This query depends on the images having proper `width` attributes set on them in the source.
|
297
|
+
|
133
298
|
```
|
134
299
|
NAME
|
135
300
|
images - Extract all images from a URL
|
@@ -139,14 +304,17 @@ SYNOPSIS
|
|
139
304
|
curlyq [global options] images [command options] URL...
|
140
305
|
|
141
306
|
COMMAND OPTIONS
|
142
|
-
-c, --[no-]compressed
|
143
|
-
--[no-]clean
|
144
|
-
-h, --header=arg
|
145
|
-
-
|
307
|
+
-c, --[no-]compressed - Expect compressed results
|
308
|
+
--[no-]clean - Remove extra whitespace from results
|
309
|
+
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
|
310
|
+
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
|
311
|
+
-t, --type=arg - Type of images to return (img, srcset, opengraph, all) (may be used more than once, default: ["all"])
|
146
312
|
```
|
147
313
|
|
148
314
|
##### json
|
149
315
|
|
316
|
+
The `json` command just returns an object with header/response info, and the contents of the JSON response after it's been read by the Ruby JSON library and output. If there are fetching or parsing errors it will fail gracefully with an error code.
|
317
|
+
|
150
318
|
```
|
151
319
|
NAME
|
152
320
|
json - Get a JSON response from a URL, multiple URLs allowed
|
@@ -163,6 +331,12 @@ COMMAND OPTIONS
|
|
163
331
|
|
164
332
|
##### links
|
165
333
|
|
334
|
+
Returns all the links on the page, which can be queried on any attribute.
|
335
|
+
|
336
|
+
Example:
|
337
|
+
|
338
|
+
curlyq images -t img -q '[width>750]' https://brettterpstra.com
|
339
|
+
|
166
340
|
```
|
167
341
|
NAME
|
168
342
|
links - Return all links on a URL's page
|
@@ -181,6 +355,26 @@ COMMAND OPTIONS
|
|
181
355
|
|
182
356
|
##### scrape
|
183
357
|
|
358
|
+
Loads the page in a web browser, allowing scraping of dynamically loaded pages that return nothing but scripts when `curl`ed. The `-b` (`--browser`) option is required and should be 'chrome' or 'firefox' (or just 'c' or 'f'). The selected browser must be installed on your system.
|
359
|
+
|
360
|
+
Example:
|
361
|
+
|
362
|
+
curlyq scrape -b firefox -q 'links[rel=me&content*=mastodon][0]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
363
|
+
|
364
|
+
{
|
365
|
+
"href": "https://nojack.easydns.ca/@ttscoff",
|
366
|
+
"title": null,
|
367
|
+
"rel": [
|
368
|
+
"me"
|
369
|
+
],
|
370
|
+
"content": "Mastodon",
|
371
|
+
"class": [
|
372
|
+
"u-url"
|
373
|
+
]
|
374
|
+
}
|
375
|
+
|
376
|
+
This example scrapes the page using firefox and finds the first link with a rel of 'me' and text containing 'mastodon'.
|
377
|
+
|
184
378
|
```
|
185
379
|
NAME
|
186
380
|
scrape - Scrape a page using a web browser, for dynamic (JS) pages. Be sure to have the selected --browser installed.
|
@@ -190,7 +384,7 @@ SYNOPSIS
|
|
190
384
|
curlyq [global options] scrape [command options] URL...
|
191
385
|
|
192
386
|
COMMAND OPTIONS
|
193
|
-
-b, --browser=arg - Browser to use (firefox, chrome) (default: none)
|
387
|
+
-b, --browser=arg - Browser to use (firefox, chrome) (required, default: none)
|
194
388
|
--[no-]clean - Remove extra whitespace from results
|
195
389
|
-h, --header=arg - Define a header to send as "key=value" (may be used more than once, default: none)
|
196
390
|
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
|
@@ -202,6 +396,17 @@ COMMAND OPTIONS
|
|
202
396
|
|
203
397
|
Full-page screenshots require Firefox, installed and specified with `--browser firefox`.
|
204
398
|
|
399
|
+
Type defaults to `full`, but will only work if `-b` is Firefox. If you want to use Chrome, you must specify a `--type` as 'visible' or 'print'.
|
400
|
+
|
401
|
+
The `-o` (`--output`) flag is required. It should be a path to a target PNG file (or PDF for `-t print` output). Extension will be modified automatically, all you need is the base name.
|
402
|
+
|
403
|
+
Example:
|
404
|
+
|
405
|
+
curlyq screenshot -b f -o ~/Desktop/test https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
406
|
+
|
407
|
+
Screenshot saved to /Users/ttscoff/Desktop/test.png
|
408
|
+
|
409
|
+
|
205
410
|
```
|
206
411
|
NAME
|
207
412
|
screenshot - Save a screenshot of a URL
|
@@ -213,12 +418,14 @@ SYNOPSIS
|
|
213
418
|
COMMAND OPTIONS
|
214
419
|
-b, --browser=arg - Browser to use (firefox, chrome) (default: chrome)
|
215
420
|
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
|
216
|
-
-o, --out, --file=arg - File destination (default: none)
|
217
|
-
-t, --type=arg - Type of screenshot to save (full (requires firefox), print, visible) (default:
|
421
|
+
-o, --out, --file=arg - File destination (required, default: none)
|
422
|
+
-t, --type=arg - Type of screenshot to save (full (requires firefox), print, visible) (default: visible)
|
218
423
|
```
|
219
424
|
|
220
425
|
##### tags
|
221
426
|
|
427
|
+
Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
|
428
|
+
|
222
429
|
```
|
223
430
|
NAME
|
224
431
|
tags - Extract all instances of a tag
|
@@ -231,7 +438,8 @@ COMMAND OPTIONS
|
|
231
438
|
-c, --[no-]compressed - Expect compressed results
|
232
439
|
--[no-]clean - Remove extra whitespace from results
|
233
440
|
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
|
234
|
-
-q, --query, --
|
441
|
+
-q, --query, --filter=arg - CSS/XPath query (default: none)
|
442
|
+
--search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
|
235
443
|
-t, --tag=arg - Specify a tag to collect (may be used more than once, default: none)
|
236
444
|
```
|
237
445
|
|
data/bin/curlyq
CHANGED
@@ -71,7 +71,7 @@ command %i[html curl] do |c|
|
|
71
71
|
c.switch %i[I info], negatable: false
|
72
72
|
|
73
73
|
c.desc 'Regurn an array of matches to a CSS or XPath query'
|
74
|
-
c.flag %i[search]
|
74
|
+
c.flag %i[s search]
|
75
75
|
|
76
76
|
c.desc 'Define a header to send as "key=value"'
|
77
77
|
c.flag %i[h header], multiple: true
|
@@ -110,25 +110,31 @@ command %i[html curl] do |c|
|
|
110
110
|
output = []
|
111
111
|
|
112
112
|
urls.each do |url|
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
113
|
+
curl_settings = { browser: options[:browser], fallback: options[:fallback],
|
114
|
+
headers: headers, headers_only: options[:info],
|
115
|
+
compressed: options[:compressed], clean: options[:clean],
|
116
|
+
ignore_local_links: options[:ignore_relative],
|
117
|
+
ignore_fragment_links: options[:ignore_fragments],
|
118
|
+
external_links_only: options[:external_links_only] }
|
119
|
+
res = Curl::Html.new(url, curl_settings)
|
119
120
|
res.curl
|
120
121
|
|
121
122
|
if options[:info]
|
122
123
|
output.push(res.headers)
|
123
|
-
# print_out(res.headers, global_options[:yaml], raw: options[:raw], pretty: global_options[:pretty])
|
124
124
|
next
|
125
125
|
end
|
126
126
|
|
127
127
|
if options[:search]
|
128
|
-
|
128
|
+
source = res.search(options[:search], return_source: true)
|
129
129
|
|
130
|
-
out =
|
131
|
-
|
130
|
+
out = res.parse(source)
|
131
|
+
|
132
|
+
if options[:query]
|
133
|
+
out = out.to_data(url: url, clean: options[:clean]).dot_query(options[:query])
|
134
|
+
else
|
135
|
+
out = out.to_data
|
136
|
+
end
|
137
|
+
output.push([out])
|
132
138
|
elsif options[:query]
|
133
139
|
queried = res.to_data.dot_query(options[:query])
|
134
140
|
output.push(queried) if queried
|
@@ -136,7 +142,7 @@ command %i[html curl] do |c|
|
|
136
142
|
output.push(res.to_data(url: url))
|
137
143
|
end
|
138
144
|
end
|
139
|
-
|
145
|
+
output.delete_if(&:nil?)
|
140
146
|
output.delete_if(&:empty?)
|
141
147
|
output = output[0] if output.count == 1
|
142
148
|
output.map! { |o| o[options[:raw].to_sym] } if options[:raw]
|
@@ -149,13 +155,13 @@ desc 'Save a screenshot of a URL'
|
|
149
155
|
arg_name 'URL', multiple: true
|
150
156
|
command :screenshot do |c|
|
151
157
|
c.desc 'Type of screenshot to save (full (requires firefox), print, visible)'
|
152
|
-
c.flag %i[t type], type: ScreenshotType, must_match: /^[fpv].*?$/, default_value: '
|
158
|
+
c.flag %i[t type], type: ScreenshotType, must_match: /^[fpv].*?$/, default_value: 'visible'
|
153
159
|
|
154
160
|
c.desc 'Browser to use (firefox, chrome)'
|
155
161
|
c.flag %i[b browser], type: BrowserType, must_match: /^[fc].*?$/, default_value: 'chrome'
|
156
162
|
|
157
163
|
c.desc 'File destination'
|
158
|
-
c.flag %i[o out file]
|
164
|
+
c.flag %i[o out file], required: true
|
159
165
|
|
160
166
|
c.desc 'Define a header to send as key=value'
|
161
167
|
c.flag %i[h header], multiple: true
|
@@ -164,11 +170,19 @@ command :screenshot do |c|
|
|
164
170
|
urls = args.join(' ').split(/[, ]+/)
|
165
171
|
headers = break_headers(options[:header])
|
166
172
|
|
173
|
+
type = options[:type]
|
174
|
+
browser = options[:browser]
|
175
|
+
|
176
|
+
type = type.is_a?(Symbol) ? type : type.normalize_screenshot_type
|
177
|
+
browser = browser.is_a?(Symbol) ? browser : browser.normalize_browser_type
|
178
|
+
|
179
|
+
raise 'Full page screen shots only available with Firefox' if type == :full_page && browser != :firefox
|
180
|
+
|
167
181
|
urls.each do |url|
|
168
182
|
c = Curl::Html.new(url)
|
169
183
|
c.headers = headers
|
170
|
-
c.browser =
|
171
|
-
c.screenshot(options[:out], type:
|
184
|
+
c.browser = browser
|
185
|
+
c.screenshot(options[:out], type: type)
|
172
186
|
end
|
173
187
|
end
|
174
188
|
end
|
@@ -221,12 +235,18 @@ end
|
|
221
235
|
desc 'Extract contents between two regular expressions'
|
222
236
|
arg_name 'URL', multiple: true
|
223
237
|
command :extract do |c|
|
224
|
-
c.desc 'Text before extraction
|
238
|
+
c.desc 'Text before extraction'
|
225
239
|
c.flag %i[b before]
|
226
240
|
|
227
|
-
c.desc 'Text after extraction
|
241
|
+
c.desc 'Text after extraction'
|
228
242
|
c.flag %i[a after]
|
229
243
|
|
244
|
+
c.desc 'Process before/after strings as regular expressions'
|
245
|
+
c.switch %i[r regex]
|
246
|
+
|
247
|
+
c.desc 'Include the before/after matches in the result'
|
248
|
+
c.switch %i[i include]
|
249
|
+
|
230
250
|
c.desc 'Define a header to send as key=value'
|
231
251
|
c.flag %i[h header], multiple: true
|
232
252
|
|
@@ -249,7 +269,15 @@ command :extract do |c|
|
|
249
269
|
res = Curl::Html.new(url, { headers: headers, headers_only: false,
|
250
270
|
compressed: options[:compressed], clean: options[:clean] })
|
251
271
|
res.curl
|
252
|
-
|
272
|
+
if options[:regex]
|
273
|
+
before = Regexp.new(options[:before])
|
274
|
+
after = Regexp.new(options[:after])
|
275
|
+
else
|
276
|
+
before = /#{Regexp.escape(options[:before])}/
|
277
|
+
after = /#{Regexp.escape(options[:after])}/
|
278
|
+
end
|
279
|
+
|
280
|
+
extracted = res.extract(before, after, inclusive: options[:include])
|
253
281
|
extracted.strip_tags! if options[:strip]
|
254
282
|
output.concat(extracted)
|
255
283
|
end
|
@@ -274,7 +302,10 @@ command :tags do |c|
|
|
274
302
|
c.switch %i[clean]
|
275
303
|
|
276
304
|
c.desc 'CSS/XPath query'
|
277
|
-
c.flag %i[q query
|
305
|
+
c.flag %i[q query filter]
|
306
|
+
|
307
|
+
c.desc 'Regurn an array of matches to a CSS or XPath query'
|
308
|
+
c.flag %i[search]
|
278
309
|
|
279
310
|
c.action do |global_options, options, args|
|
280
311
|
urls = args.join(' ').split(/[, ]+/)
|
@@ -286,9 +317,17 @@ command :tags do |c|
|
|
286
317
|
res = Curl::Html.new(url, { headers: headers, headers_only: options[:headers],
|
287
318
|
compressed: options[:compressed], clean: options[:clean] })
|
288
319
|
res.curl
|
320
|
+
|
289
321
|
output = []
|
290
322
|
if options[:search]
|
291
|
-
|
323
|
+
out = res.search(options[:search])
|
324
|
+
|
325
|
+
# out = out.dot_query(options[:query]) if options[:query]
|
326
|
+
output.push(out)
|
327
|
+
elsif options[:query]
|
328
|
+
query = options[:query] =~ /^links/ ? options[:query] : "links#{options[:query]}"
|
329
|
+
|
330
|
+
output = res.to_data.dot_query(query)
|
292
331
|
elsif tags.count.positive?
|
293
332
|
tags.each { |tag| output.concat(res.tags(tag)) }
|
294
333
|
else
|
@@ -312,6 +351,9 @@ command :images do |c|
|
|
312
351
|
c.desc 'Remove extra whitespace from results'
|
313
352
|
c.switch %i[clean]
|
314
353
|
|
354
|
+
c.desc 'Filter output using dot-syntax path'
|
355
|
+
c.flag %i[q query filter]
|
356
|
+
|
315
357
|
c.desc 'Define a header to send as key=value'
|
316
358
|
c.flag %i[h header], multiple: true
|
317
359
|
|
@@ -326,7 +368,15 @@ command :images do |c|
|
|
326
368
|
urls.each do |url|
|
327
369
|
res = Curl::Html.new(url, { compressed: options[:compressed], clean: options[:clean] })
|
328
370
|
res.curl
|
329
|
-
|
371
|
+
|
372
|
+
res = res.images(types: types)
|
373
|
+
|
374
|
+
if options[:query]
|
375
|
+
query = options[:query] =~ /^images/ ? options[:query] : "images#{options[:query]}"
|
376
|
+
res = { images: res }.dot_query(query)
|
377
|
+
end
|
378
|
+
|
379
|
+
output.concat(res)
|
330
380
|
end
|
331
381
|
|
332
382
|
print_out(output, global_options[:yaml], pretty: global_options[:pretty])
|
@@ -367,7 +417,7 @@ command :links do |c|
|
|
367
417
|
|
368
418
|
if options[:query]
|
369
419
|
query = options[:query] =~ /^links/ ? options[:query] : "links#{options[:query]}"
|
370
|
-
queried =
|
420
|
+
queried = res.to_data.dot_query(query)
|
371
421
|
output.concat(queried) if queried
|
372
422
|
else
|
373
423
|
output.concat(res.body_links)
|
@@ -414,7 +464,7 @@ desc %(Scrape a page using a web browser, for dynamic (JS) pages. Be sure to hav
|
|
414
464
|
arg_name 'URL', multiple: true
|
415
465
|
command :scrape do |c|
|
416
466
|
c.desc 'Browser to use (firefox, chrome)'
|
417
|
-
c.flag %i[b browser], type: BrowserType
|
467
|
+
c.flag %i[b browser], type: BrowserType, required: true
|
418
468
|
|
419
469
|
c.desc 'Regurn an array of matches to a CSS or XPath query'
|
420
470
|
c.flag %i[search]
|
@@ -437,30 +487,19 @@ command :scrape do |c|
|
|
437
487
|
output = []
|
438
488
|
|
439
489
|
urls.each do |url|
|
440
|
-
|
441
|
-
|
442
|
-
driver.get url
|
443
|
-
res = driver.page_source
|
444
|
-
|
445
|
-
res = Curl::Html.new(nil, { source: res, clean: options[:clean] })
|
446
|
-
res.curl
|
447
|
-
if options[:search]
|
448
|
-
out = res.search(options[:search])
|
449
|
-
|
450
|
-
out = out.dot_query(options[:query]) if options[:query]
|
451
|
-
output.push(out)
|
452
|
-
elsif options[:query]
|
453
|
-
queried = res.to_data(url: url).dot_query(options[:query])
|
454
|
-
output = queried if queried
|
455
|
-
else
|
456
|
-
output.push(res.to_data(url: url))
|
457
|
-
end
|
490
|
+
res = Curl::Html.new(url, { browser: options[:browser], clean: options[:clean] })
|
491
|
+
res.curl
|
458
492
|
|
459
|
-
|
493
|
+
if options[:search]
|
494
|
+
out = res.search(options[:search])
|
460
495
|
|
461
|
-
|
462
|
-
|
463
|
-
|
496
|
+
out = out.dot_query(options[:query]) if options[:query]
|
497
|
+
output.push(out)
|
498
|
+
elsif options[:query]
|
499
|
+
queried = res.to_data(url: url).dot_query(options[:query])
|
500
|
+
output.push(queried) if queried
|
501
|
+
else
|
502
|
+
output.push(res.to_data(url: url))
|
464
503
|
end
|
465
504
|
end
|
466
505
|
|
data/lib/curly/array.rb
CHANGED
@@ -67,68 +67,69 @@ class ::Array
|
|
67
67
|
end
|
68
68
|
|
69
69
|
##
|
70
|
-
##
|
71
|
-
##
|
72
|
-
## @param
|
73
|
-
##
|
74
|
-
## @
|
75
|
-
##
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
70
|
+
## Test if a tag contains an attribute matching filter queries
|
71
|
+
##
|
72
|
+
## @param tag_name [String] The tag name
|
73
|
+
## @param classes [String] The classes to match
|
74
|
+
## @param id [String] The id attribute to
|
75
|
+
## match
|
76
|
+
## @param attribute [String] The attribute
|
77
|
+
## @param operator [String] The operator, <>= *=
|
78
|
+
## $= ^=
|
79
|
+
## @param value [String] The value to match
|
80
|
+
## @param descendant [Boolean] Check descendant tags
|
81
|
+
##
|
82
|
+
def tag_match(tag_name, classes, id, attribute, operator, value, descendant: false)
|
83
|
+
tag = self
|
84
|
+
keep = true
|
85
|
+
|
86
|
+
keep = false if tag_name && !tag['tag'] =~ /^#{tag_name}$/i
|
87
|
+
|
88
|
+
if tag.key?('attrs') && tag['attrs']
|
89
|
+
if keep && id
|
90
|
+
tag_id = tag['attrs'].filter { |a| a['key'] == 'id' }.first['value']
|
91
|
+
keep = tag_id && tag_id =~ /#{id}/i
|
92
|
+
end
|
85
93
|
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
el = Regexp.last_match(1) if pth =~ /\[([0-9,.]+)\]/
|
95
|
-
pth.sub!(/\[([0-9,.]+)\]/, '')
|
96
|
-
ats = []
|
97
|
-
at = []
|
98
|
-
while pth =~ /\[[+&,]?\w+ *[\^*$=<>]=? *\w+/
|
99
|
-
m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>\w+) */)
|
100
|
-
comp = [m['key'], m['op'], m['val']]
|
101
|
-
case m['com']
|
102
|
-
when ','
|
103
|
-
ats.push(comp)
|
104
|
-
at = []
|
105
|
-
else
|
106
|
-
at.push(comp)
|
107
|
-
end
|
108
|
-
|
109
|
-
pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>\w+)/, '[')
|
110
|
-
end
|
111
|
-
ats.push(at) unless at.empty?
|
112
|
-
pth.sub!(/\[\]/, '')
|
113
|
-
|
114
|
-
return false if el.nil? && ats.empty? && !res.key?(pth)
|
115
|
-
|
116
|
-
res = res[pth] unless pth.empty?
|
117
|
-
|
118
|
-
while ats.count.positive?
|
119
|
-
atr = ats.shift
|
120
|
-
|
121
|
-
keepers = res.filter do |r|
|
122
|
-
evaluate_comp(r, atr)
|
123
|
-
end
|
124
|
-
out.concat(keepers)
|
125
|
-
end
|
126
|
-
|
127
|
-
out = out[eval(el)] if out.is_a?(Array) && el =~ /^[\d.,]+$/
|
94
|
+
if keep && classes
|
95
|
+
cls = tag['attrs'].filter { |a| a['key'] == 'class' }.first
|
96
|
+
if cls
|
97
|
+
all = true
|
98
|
+
classes.each { |c| all = cls['value'].include?(c) }
|
99
|
+
keep = all
|
100
|
+
else
|
101
|
+
keep = false
|
128
102
|
end
|
129
|
-
output.push(out)
|
130
103
|
end
|
104
|
+
|
105
|
+
if keep && attribute
|
106
|
+
attributes = tag['attrs'].filter { |a| a['key'] =~ /^#{attribute}$/i }
|
107
|
+
any = false
|
108
|
+
attributes.each do |a|
|
109
|
+
break if any
|
110
|
+
|
111
|
+
any = case operator
|
112
|
+
when /^*/
|
113
|
+
a['value'] =~ /#{value}/i
|
114
|
+
when /^\^/
|
115
|
+
a['value'] =~ /^#{value}/i
|
116
|
+
when /^\$/
|
117
|
+
a['value'] =~ /#{value}$/i
|
118
|
+
else
|
119
|
+
a['value'] =~ /^#{value}$/i
|
120
|
+
end
|
121
|
+
end
|
122
|
+
keep = any
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
return false if descendant && !keep
|
127
|
+
|
128
|
+
if !descendant && tag.key?('tags')
|
129
|
+
tags = tag['tags'].filter { |t| t.tag_match(tag_name, classes, id, attribute, operator, value) }
|
130
|
+
tags.count.positive?
|
131
|
+
else
|
132
|
+
keep
|
131
133
|
end
|
132
|
-
output
|
133
134
|
end
|
134
135
|
end
|
data/lib/curly/curl/html.rb
CHANGED
@@ -65,7 +65,13 @@ module Curl
|
|
65
65
|
@external_links_only = options[:external_links_only]
|
66
66
|
|
67
67
|
@curl = TTY::Which.which('curl')
|
68
|
-
@url = url
|
68
|
+
@url = url.nil? ? options[:url] : url
|
69
|
+
end
|
70
|
+
|
71
|
+
def parse(source)
|
72
|
+
@body = source
|
73
|
+
{ url: @url, code: @code, headers: @headers, meta: @meta, links: @links, head: @head, body: source,
|
74
|
+
source: source.strip, body_links: content_links, body_images: content_images }
|
69
75
|
end
|
70
76
|
|
71
77
|
def curl
|
@@ -118,10 +124,15 @@ module Curl
|
|
118
124
|
##
|
119
125
|
## @return [Array] array of matches
|
120
126
|
##
|
121
|
-
def extract(before, after)
|
122
|
-
before = /#{Regexp.escape(before)}/ unless before.
|
123
|
-
after = /#{Regexp.escape(after)}/ unless after.
|
124
|
-
|
127
|
+
def extract(before, after, inclusive: false)
|
128
|
+
before = /#{Regexp.escape(before)}/ unless before.is_a?(Regexp)
|
129
|
+
after = /#{Regexp.escape(after)}/ unless after.is_a?(Regexp)
|
130
|
+
|
131
|
+
if inclusive
|
132
|
+
rx = /(#{before.source}.*?#{after.source})/m
|
133
|
+
else
|
134
|
+
rx = /(?<=#{before.source})(.*?)(?=#{after.source})/m
|
135
|
+
end
|
125
136
|
@body.scan(rx).map { |r| @clean ? r[0].clean : r[0] }
|
126
137
|
end
|
127
138
|
|
@@ -343,12 +354,16 @@ module Curl
|
|
343
354
|
##
|
344
355
|
## @return [Array] array of matched elements
|
345
356
|
##
|
346
|
-
def search(path, source: @source)
|
357
|
+
def search(path, source: @source, return_source: false)
|
347
358
|
doc = Nokogiri::HTML(source)
|
348
359
|
output = []
|
349
|
-
|
350
|
-
|
351
|
-
|
360
|
+
if return_source
|
361
|
+
output = doc.search(path).to_html
|
362
|
+
else
|
363
|
+
doc.search(path).each do |el|
|
364
|
+
out = nokogiri_to_tag(el)
|
365
|
+
output.push(out)
|
366
|
+
end
|
352
367
|
end
|
353
368
|
output
|
354
369
|
end
|
@@ -480,6 +495,7 @@ module Curl
|
|
480
495
|
##
|
481
496
|
def content_links
|
482
497
|
links = []
|
498
|
+
|
483
499
|
link_tags = @body.to_enum(:scan, %r{<a ?(?<tag>.*?)>(?<text>.*?)</a>}).map { Regexp.last_match }
|
484
500
|
link_tags.each do |m|
|
485
501
|
href = m['tag'].match(/href=(["'])(.*?)\1/)
|
@@ -534,7 +550,7 @@ module Curl
|
|
534
550
|
## @return [String] page source
|
535
551
|
##
|
536
552
|
def curl_dynamic_html
|
537
|
-
browser = @browser.normalize_browser_type
|
553
|
+
browser = @browser.is_a?(String) ? @browser.normalize_browser_type : @browser
|
538
554
|
res = nil
|
539
555
|
|
540
556
|
driver = Selenium::WebDriver.for browser
|
@@ -607,7 +623,7 @@ module Curl
|
|
607
623
|
##
|
608
624
|
def curl_html(url = nil, source: nil, headers: nil,
|
609
625
|
headers_only: false, compressed: false, fallback: false)
|
610
|
-
|
626
|
+
if !url.nil?
|
611
627
|
flags = 'SsL'
|
612
628
|
flags += @headers_only ? 'I' : 'i'
|
613
629
|
agents = [
|
@@ -620,8 +636,8 @@ module Curl
|
|
620
636
|
compress = @compressed ? '--compressed' : ''
|
621
637
|
@source = `#{@curl} -#{flags} #{compress} #{headers} '#{@url}' 2>/dev/null`
|
622
638
|
agent = 0
|
623
|
-
while source.nil? || source.empty?
|
624
|
-
source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{@url}' 2>/dev/null`
|
639
|
+
while @source.nil? || @source.empty?
|
640
|
+
@source = `#{@curl} -#{flags} #{compress} -A "#{agents[agent]}" #{headers} '#{@url}' 2>/dev/null`
|
625
641
|
break if agent >= agents.count - 1
|
626
642
|
end
|
627
643
|
|
@@ -630,49 +646,50 @@ module Curl
|
|
630
646
|
Process.exit 1
|
631
647
|
end
|
632
648
|
|
633
|
-
|
634
|
-
|
649
|
+
headers = { 'location' => @url }
|
650
|
+
lines = @source.split(/\r\n/)
|
651
|
+
code = lines[0].match(/(\d\d\d)/)[1]
|
652
|
+
lines.shift
|
653
|
+
lines.each_with_index do |line, idx|
|
654
|
+
if line =~ /^([\w-]+): (.*?)$/
|
655
|
+
m = Regexp.last_match
|
656
|
+
headers[m[1]] = m[2]
|
657
|
+
else
|
658
|
+
@source = lines[idx..].join("\n")
|
659
|
+
break
|
660
|
+
end
|
635
661
|
end
|
636
|
-
end
|
637
|
-
|
638
|
-
return false if source.nil? || source.empty?
|
639
662
|
|
640
|
-
|
663
|
+
if headers['content-encoding'] =~ /gzip/i && !compressed
|
664
|
+
warn 'Response is gzipped, you may need to try again with --compressed'
|
665
|
+
end
|
641
666
|
|
642
|
-
|
643
|
-
|
644
|
-
|
645
|
-
lines.shift
|
646
|
-
lines.each_with_index do |line, idx|
|
647
|
-
if line =~ /^([\w-]+): (.*?)$/
|
648
|
-
m = Regexp.last_match
|
649
|
-
headers[m[1]] = m[2]
|
650
|
-
else
|
651
|
-
@source = lines[idx..].join("\n")
|
652
|
-
break
|
667
|
+
if headers['content-type'] =~ /json/
|
668
|
+
return { url: @url, code: code, headers: headers, meta: nil, links: nil,
|
669
|
+
head: nil, body: @source.strip, source: @source.strip, body_links: nil, body_images: nil }
|
653
670
|
end
|
671
|
+
else
|
672
|
+
@source = source unless source.nil?
|
654
673
|
end
|
655
674
|
|
656
|
-
|
657
|
-
warn 'Response is gzipped, you may need to try again with --compressed'
|
658
|
-
end
|
675
|
+
@source = curl_dynamic_html(@url, @fallback, @headers) if @fallback && (@source.nil? || @source.empty?)
|
659
676
|
|
660
|
-
if
|
661
|
-
|
662
|
-
|
663
|
-
end
|
677
|
+
return false if @source.nil? || @source.empty?
|
678
|
+
|
679
|
+
@source.strip!
|
664
680
|
|
665
|
-
head = source.match(%r{(?<=<head>)(.*?)(?=</head>)}mi)
|
681
|
+
head = @source.match(%r{(?<=<head>)(.*?)(?=</head>)}mi)
|
666
682
|
|
667
683
|
if head.nil?
|
668
684
|
{ url: @url, code: code, headers: headers, meta: nil, links: nil, head: nil, body: @source.strip,
|
669
685
|
source: @source.strip, body_links: nil, body_images: nil }
|
670
686
|
else
|
687
|
+
@body = @source.match(%r{<body.*?>(.*?)</body>}mi)[1]
|
671
688
|
meta = meta_tags(head[1])
|
672
689
|
links = link_tags(head[1])
|
673
|
-
|
674
|
-
{ url: @url, code: code, headers: headers, meta: meta, links: links, head: head[1], body: body,
|
675
|
-
source: @source.strip, body_links:
|
690
|
+
|
691
|
+
{ url: @url, code: code, headers: headers, meta: meta, links: links, head: head[1], body: @body,
|
692
|
+
source: @source.strip, body_links: nil, body_images: nil }
|
676
693
|
end
|
677
694
|
end
|
678
695
|
|
data/lib/curly/hash.rb
CHANGED
@@ -2,6 +2,27 @@
|
|
2
2
|
|
3
3
|
# Hash helpers
|
4
4
|
class ::Hash
|
5
|
+
def to_data(url: nil, clean: false)
|
6
|
+
if key?(:body_links)
|
7
|
+
{
|
8
|
+
url: self[:url] || url,
|
9
|
+
code: self[:code],
|
10
|
+
headers: self[:headers],
|
11
|
+
meta: self[:meta],
|
12
|
+
meta_links: self[:links],
|
13
|
+
head: clean ? self[:head]&.strip&.clean : self[:head],
|
14
|
+
body: clean ? self[:body]&.strip&.clean : self[:body],
|
15
|
+
source: clean ? self[:source]&.strip&.clean : self[:source],
|
16
|
+
title: self[:title],
|
17
|
+
description: self[:description],
|
18
|
+
links: self[:body_links],
|
19
|
+
images: self[:body_images]
|
20
|
+
}
|
21
|
+
else
|
22
|
+
self
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
5
26
|
# Extract data using a dot-syntax path
|
6
27
|
#
|
7
28
|
# @param path [String] The path
|
@@ -18,7 +39,7 @@ class ::Hash
|
|
18
39
|
ats = []
|
19
40
|
at = []
|
20
41
|
while pth =~ /\[[+&,]?\w+ *[\^*$=<>]=? *\w+/
|
21
|
-
m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val
|
42
|
+
m = pth.match(/\[(?<com>[,+&])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+) */)
|
22
43
|
comp = [m['key'], m['op'], m['val']]
|
23
44
|
case m['com']
|
24
45
|
when ','
|
@@ -28,15 +49,16 @@ class ::Hash
|
|
28
49
|
at.push(comp)
|
29
50
|
end
|
30
51
|
|
31
|
-
pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val
|
52
|
+
pth.sub!(/\[(?<com>[,&+])? *(?<key>\w+) *(?<op>[\^*$=<>]{1,2}) *(?<val>[^,&\]]+)/, '[')
|
32
53
|
end
|
33
54
|
ats.push(at) unless at.empty?
|
34
55
|
pth.sub!(/\[\]/, '')
|
35
56
|
|
36
57
|
return false if el.nil? && ats.empty? && !res.key?(pth)
|
37
|
-
|
38
58
|
res = res[pth] unless pth.empty?
|
39
59
|
|
60
|
+
return false if res.nil?
|
61
|
+
|
40
62
|
if ats.count.positive?
|
41
63
|
while ats.count.positive?
|
42
64
|
atr = ats.shift
|
@@ -60,7 +82,7 @@ class ::Hash
|
|
60
82
|
##
|
61
83
|
## @param r [Hash] hash of source elements and
|
62
84
|
## comparison operators
|
63
|
-
## @param atr [
|
85
|
+
## @param atr [Array] Array of arrays conaining [attribute,comparitor,value]
|
64
86
|
##
|
65
87
|
## @return [Boolean] whether the comparison passes or fails
|
66
88
|
##
|
@@ -118,7 +140,7 @@ class ::Hash
|
|
118
140
|
end
|
119
141
|
|
120
142
|
##
|
121
|
-
## Test if a
|
143
|
+
## Test if a tag contains an attribute matching filter queries
|
122
144
|
##
|
123
145
|
## @param tag_name [String] The tag name
|
124
146
|
## @param classes [String] The classes to match
|
data/lib/curly/version.rb
CHANGED
data/src/_README.md
CHANGED
@@ -10,7 +10,7 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
|
|
10
10
|
[donate]: https://brettterpstra.com/donate
|
11
11
|
<!--END GITHUB-->
|
12
12
|
|
13
|
-
The current version of `curlyq` is <!--VER-->0.0.
|
13
|
+
The current version of `curlyq` is <!--VER-->0.0.4<!--END VER-->.
|
14
14
|
|
15
15
|
CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like `jq` to parse the output.
|
16
16
|
|
@@ -39,12 +39,41 @@ Run `curlyq help` for a list of subcommands. Run `curlyq help SUBCOMMAND` for de
|
|
39
39
|
@cli(bundle exec bin/curlyq help)
|
40
40
|
```
|
41
41
|
|
42
|
+
### Query and Search syntax
|
43
|
+
|
44
|
+
You can shape the results using `--search` (`-s`) and `--query` (`-q`) on some commands.
|
45
|
+
|
46
|
+
A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the `<article>` elements with a class of `post` inside of the div with an id of `main`, you would run `--search '#main article.post'`. Searches can target tags, ids, and classes, and can accept `>` to target direct descendents. You can also use XPaths, but I hate those so I'm not going to document them.
|
47
|
+
|
48
|
+
Queries are specifically for shaping CurlyQ output. If you're using the `html` command, it returns a key called `images`, so you can target just the images in the response with `-q 'images'`. The queries accept array syntax, so to get the first image, you would use `-q 'images[0]'`. Ranges are accepted as well, so `-q 'images[1..4]'` will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. `images[rel=me]'` to target only images with a `rel` attribute of `me`.
|
49
|
+
|
50
|
+
The comparisons for the query flag are:
|
51
|
+
|
52
|
+
- `<` less than
|
53
|
+
- `>` greater than
|
54
|
+
- `<=` less than or equal to
|
55
|
+
- `>=` greater than or equal to
|
56
|
+
- `=` or `==` is equal to
|
57
|
+
- `*=` contains text
|
58
|
+
- `^=` starts with text
|
59
|
+
- `$=` ends with text
|
60
|
+
|
42
61
|
#### Commands
|
43
62
|
|
44
63
|
curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq extract [options] URL`. Each subcommand takes its own options, but I've made an effort to standardize the choices between each command as much as possible.
|
45
64
|
|
46
65
|
##### extract
|
47
66
|
|
67
|
+
Example:
|
68
|
+
|
69
|
+
curlyq extract -i -b 'Adding' -a 'accessing the source.' 'https://stackoverflow.com/questions/52428409/get-fully-rendered-html-using-selenium-webdriver-and-python'
|
70
|
+
|
71
|
+
[
|
72
|
+
"Adding <code>time.sleep(10)</code> in various places in case the page had not fully loaded when I was accessing the source."
|
73
|
+
]
|
74
|
+
|
75
|
+
This specifies a before and after string and includes them (`-i`) in the result.
|
76
|
+
|
48
77
|
```
|
49
78
|
@cli(bundle exec bin/curlyq help extract)
|
50
79
|
```
|
@@ -52,36 +81,198 @@ curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq ext
|
|
52
81
|
|
53
82
|
##### headlinks
|
54
83
|
|
84
|
+
Example:
|
85
|
+
|
86
|
+
curlyq headlinks -q '[rel=stylesheet]' https://brettterpstra.com
|
87
|
+
|
88
|
+
{
|
89
|
+
"rel": "stylesheet",
|
90
|
+
"href": "https://cdn3.brettterpstra.com/stylesheets/screen.7261.css",
|
91
|
+
"type": "text/css",
|
92
|
+
"title": null
|
93
|
+
}
|
94
|
+
|
95
|
+
This pulls all `<links>` from the `<head>` of the page, and uses a query `-q` to only show links with `rel="stylesheet"`.
|
96
|
+
|
55
97
|
```
|
56
98
|
@cli(bundle exec bin/curlyq help headlinks)
|
57
99
|
```
|
58
100
|
|
59
101
|
##### html
|
60
102
|
|
103
|
+
The html command (aliased as `curl`) gets the entire text of the web page and provides a JSON response with a breakdown of:
|
104
|
+
|
105
|
+
- URL, after any redirects
|
106
|
+
- Response code
|
107
|
+
- Response headers as a keyed hash
|
108
|
+
- Meta elements for the page as a keyed hash
|
109
|
+
- All meta links in the head as an array of objects containing (as available):
|
110
|
+
- rel
|
111
|
+
- href
|
112
|
+
- type
|
113
|
+
- title
|
114
|
+
- source of `<head>`
|
115
|
+
- source of `<body>`
|
116
|
+
- the page title (determined first by og:title, then by a title tag)
|
117
|
+
- description (using og:description first)
|
118
|
+
- All links on the page as an array of objects with:
|
119
|
+
- href
|
120
|
+
- title
|
121
|
+
- rel
|
122
|
+
- text content
|
123
|
+
- classes as array
|
124
|
+
- All images on the page as an array of objects containing:
|
125
|
+
- class
|
126
|
+
- all attributes as key/value pairs
|
127
|
+
- width and height (if specified)
|
128
|
+
- src
|
129
|
+
- alt and title
|
130
|
+
|
131
|
+
You can add a query (`-q`) to only get the information needed, e.g. `-q images[width>600]`.
|
132
|
+
|
133
|
+
Example:
|
134
|
+
|
135
|
+
curlyq html -s '#main article .aligncenter' -q 'images[1]' 'https://brettterpstra.com'
|
136
|
+
|
137
|
+
[
|
138
|
+
{
|
139
|
+
"class": "aligncenter",
|
140
|
+
"original": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb_tw.jpg",
|
141
|
+
"at2x": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb@2x.jpg",
|
142
|
+
"width": "800",
|
143
|
+
"height": "226",
|
144
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb.jpg",
|
145
|
+
"alt": "Giveaway Robot with Keyboard Maestro icon",
|
146
|
+
"title": "Giveaway Robot with Keyboard Maestro icon"
|
147
|
+
}
|
148
|
+
]
|
149
|
+
|
150
|
+
The above example queries the full html of the page, but narrows the elements using `--search` and then takes the 2nd image from the results.
|
151
|
+
|
152
|
+
curlyq html -q 'meta.title' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
153
|
+
|
154
|
+
Introducing CurlyQ, a pipeline-oriented curl helper - BrettTerpstra.com
|
155
|
+
|
156
|
+
The above example curls the page and returns the title attribute found in the meta (`-q 'meta.title'`).
|
157
|
+
|
61
158
|
```
|
62
159
|
@cli(bundle exec bin/curlyq help html)
|
63
160
|
```
|
64
161
|
|
65
162
|
##### images
|
66
163
|
|
164
|
+
The images command returns only the images on the page as an array of objects. It can be queried to match certain requirements (see Query and Search syntax above).
|
165
|
+
|
166
|
+
The base command will return all images on the page, including OpenGraph images from the head, `<img>` tags from the body, and `<srcset>` tags along with their child images.
|
167
|
+
|
168
|
+
OpenGraph images will be returned with the structure:
|
169
|
+
|
170
|
+
{
|
171
|
+
"type": "opengraph",
|
172
|
+
"attrs": null,
|
173
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg"
|
174
|
+
}
|
175
|
+
|
176
|
+
`img` tags will be returned with the structure:
|
177
|
+
|
178
|
+
{
|
179
|
+
"type": "img",
|
180
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb.jpg",
|
181
|
+
"width": "800",
|
182
|
+
"height": "226",
|
183
|
+
"alt": "Banner image for CurlyQ",
|
184
|
+
"title": "CurlyQ, curl better",
|
185
|
+
"attrs": [
|
186
|
+
{
|
187
|
+
"key": "class",
|
188
|
+
"value": [
|
189
|
+
"aligncenter"
|
190
|
+
], // all attributes included
|
191
|
+
}
|
192
|
+
]
|
193
|
+
}
|
194
|
+
|
195
|
+
|
196
|
+
|
197
|
+
`srcset` images will be returned with the structure:
|
198
|
+
|
199
|
+
{
|
200
|
+
"type": "srcset",
|
201
|
+
"attrs": [
|
202
|
+
{
|
203
|
+
"key": "srcset",
|
204
|
+
"value": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg 1x, https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg 2x"
|
205
|
+
}
|
206
|
+
],
|
207
|
+
"images": [
|
208
|
+
{
|
209
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb_tw.jpg",
|
210
|
+
"media": "1x"
|
211
|
+
},
|
212
|
+
{
|
213
|
+
"src": "https://cdn3.brettterpstra.com/uploads/2024/01/curlyq_header-rb@2x.jpg",
|
214
|
+
"media": "2x"
|
215
|
+
}
|
216
|
+
]
|
217
|
+
}
|
218
|
+
}
|
219
|
+
|
220
|
+
Example:
|
221
|
+
|
222
|
+
curlyq images -t img -q '[alt$=screenshot]' https://brettterpstra.com
|
223
|
+
|
224
|
+
This will return an array of images that are `<img>` tags, and only show the ones that have an `alt` attribute that ends with `screenshot`.
|
225
|
+
|
226
|
+
curlyq images -q '[width>750]' https://brettterpstra.com
|
227
|
+
|
228
|
+
This example will only return images that have a width greater than 750 pixels. This query depends on the images having proper `width` attributes set on them in the source.
|
229
|
+
|
67
230
|
```
|
68
231
|
@cli(bundle exec bin/curlyq help images)
|
69
232
|
```
|
70
233
|
|
71
234
|
##### json
|
72
235
|
|
236
|
+
The `json` command just returns an object with header/response info, and the contents of the JSON response after it's been read by the Ruby JSON library and output. If there are fetching or parsing errors it will fail gracefully with an error code.
|
237
|
+
|
73
238
|
```
|
74
239
|
@cli(bundle exec bin/curlyq help json)
|
75
240
|
```
|
76
241
|
|
77
242
|
##### links
|
78
243
|
|
244
|
+
Returns all the links on the page, which can be queried on any attribute.
|
245
|
+
|
246
|
+
Example:
|
247
|
+
|
248
|
+
curlyq images -t img -q '[width>750]' https://brettterpstra.com
|
249
|
+
|
79
250
|
```
|
80
251
|
@cli(bundle exec bin/curlyq help links)
|
81
252
|
```
|
82
253
|
|
83
254
|
##### scrape
|
84
255
|
|
256
|
+
Loads the page in a web browser, allowing scraping of dynamically loaded pages that return nothing but scripts when `curl`ed. The `-b` (`--browser`) option is required and should be 'chrome' or 'firefox' (or just 'c' or 'f'). The selected browser must be installed on your system.
|
257
|
+
|
258
|
+
Example:
|
259
|
+
|
260
|
+
curlyq scrape -b firefox -q 'links[rel=me&content*=mastodon][0]' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
261
|
+
|
262
|
+
{
|
263
|
+
"href": "https://nojack.easydns.ca/@ttscoff",
|
264
|
+
"title": null,
|
265
|
+
"rel": [
|
266
|
+
"me"
|
267
|
+
],
|
268
|
+
"content": "Mastodon",
|
269
|
+
"class": [
|
270
|
+
"u-url"
|
271
|
+
]
|
272
|
+
}
|
273
|
+
|
274
|
+
This example scrapes the page using firefox and finds the first link with a rel of 'me' and text containing 'mastodon'.
|
275
|
+
|
85
276
|
```
|
86
277
|
@cli(bundle exec bin/curlyq help scrape)
|
87
278
|
```
|
@@ -90,12 +281,25 @@ curlyq makes use of subcommands, e.g. `curlyq html [options] URL` or `curlyq ext
|
|
90
281
|
|
91
282
|
Full-page screenshots require Firefox, installed and specified with `--browser firefox`.
|
92
283
|
|
284
|
+
Type defaults to `full`, but will only work if `-b` is Firefox. If you want to use Chrome, you must specify a `--type` as 'visible' or 'print'.
|
285
|
+
|
286
|
+
The `-o` (`--output`) flag is required. It should be a path to a target PNG file (or PDF for `-t print` output). Extension will be modified automatically, all you need is the base name.
|
287
|
+
|
288
|
+
Example:
|
289
|
+
|
290
|
+
curlyq screenshot -b f -o ~/Desktop/test https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
|
291
|
+
|
292
|
+
Screenshot saved to /Users/ttscoff/Desktop/test.png
|
293
|
+
|
294
|
+
|
93
295
|
```
|
94
296
|
@cli(bundle exec bin/curlyq help screenshot)
|
95
297
|
```
|
96
298
|
|
97
299
|
##### tags
|
98
300
|
|
301
|
+
Return a hierarchy of all tags in a page. Use `-t` to limit to a specific tag.
|
302
|
+
|
99
303
|
```
|
100
304
|
@cli(bundle exec bin/curlyq help tags)
|
101
305
|
```
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: curlyq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Brett Terpstra
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-01-
|
11
|
+
date: 2024-01-12 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rake
|
@@ -139,6 +139,7 @@ extra_rdoc_files:
|
|
139
139
|
files:
|
140
140
|
- ".github/FUNDING.yml"
|
141
141
|
- ".gitignore"
|
142
|
+
- ".irbrc"
|
142
143
|
- CHANGELOG.md
|
143
144
|
- Gemfile
|
144
145
|
- Gemfile.lock
|