RubyGems - curlyq - Versions diffs - 0.0.10 → 0.0.12 - Mend

curlyq 0.0.10 → 0.0.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6109483b8869733f9e21ecab9bc8bcda0aa3b58ca1f13f9b96fe7739d019df1f
-  data.tar.gz: 98a8d46fe68bc88ea030dfb8e04262fbab5418005390ff79693d6f636a3bf276
+  metadata.gz: af564949987eb7b96f7dfc730fea6ac79fb4a903c3eaff7a412af446c4a0e699
+  data.tar.gz: 58fb83c8132da551813adad468a6eb546a0d758df2b7c756b794a2a019963084
 SHA512:
-  metadata.gz: 1d75b4af2d6c1fadb83501fa707184ef41d061c08de14666b86d296048e8f21540fe2ad53a79985d5b042c93fa629cdbe8d101828edbb02832d1b55b920d5834
-  data.tar.gz: 238855918e3e765a2edf1864dd2663a959b099cfa5f1b89942f94eb20ba428c1700adee85590879662f0cf8de659328fbe752e8648ee210eefe0769639c57da2
+  metadata.gz: 70e9485a7729acbd45a501278ceaae575bf358205c05da49de5c36f08ab8fd263824c8f8bd302c32b058986db3783b7abc6d9f0e58bb11f18879cdfde5210eac
+  data.tar.gz: 8d2e664b9e786c0a9d13fb8177c703be84c0a784835dba14354189978c34501334cc1dbdd68cc0112ad363c73554c170e9eb5fea64b60ee229bc4105b08d3fc6

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,15 @@
+### 0.0.12
+2024-04-04 13:06
+### 0.0.11
+2024-01-21 15:29
+#### IMPROVED
+- Add option for --local_links_only to html and links command, only returning links with the same origin site
 ### 0.0.10
 2024-01-17 13:50

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    curlyq (0.0.10)
+    curlyq (0.0.12)
       gli (~> 2.21.0)
       nokogiri (~> 1.16.0)
       selenium-webdriver (~> 4.16.0)
@@ -11,7 +11,7 @@ GEM
   remote: https://rubygems.org/
   specs:
     gli (2.21.1)
-    nokogiri (1.16.0-arm64-darwin)
+    nokogiri (1.16.2-arm64-darwin)
       racc (~> 1.4)
     parallel (1.23.0)
     parallel_tests (3.13.0)

data/README.md CHANGED Viewed

@@ -13,8 +13,7 @@ _If you find this useful, feel free to [buy me some coffee][donate]._
 [jq]: https://github.com/jqlang/jq "Command-line JSON processor"
 [yq]: https://github.com/mikefarah/yq "yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor"
-The current version of `curlyq` is 0.0.10
-.
+The current version of `curlyq` is 0.0.12.
 CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It's designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it's expected that you'll use something like [jq] to parse the output.
@@ -47,7 +46,7 @@ SYNOPSIS
     curlyq [global options] command [command options] [arguments...]
 VERSION
-    0.0.10
+    0.0.12
 GLOBAL OPTIONS
     --help          - Show this message
@@ -56,6 +55,7 @@ GLOBAL OPTIONS
     -y, --[no-]yaml - Output YAML instead of json
 COMMANDS
+    execute    - Execute JavaScript on a URL
     extract    - Extract contents between two regular expressions
     headlinks  - Return all <head> links on URL's page
     help       - Shows a list of commands or help for one command
@@ -136,6 +136,34 @@ COMMAND OPTIONS
 ```
+##### execute
+You can execute JavaScript on a given web page using the `execute` subcommand.
+Example:
+    curlyq execute -s "NiftyAPI.find('file/save').arrow().shoot('file-save')" file:///Users/ttscoff/Desktop/Code/niftymenu/dist/MultiMarkdown-Composer.html
+You can specify an element id to wait for using `--id`, and define a pause to wait after executing a script with `--wait` (defaults to 2 seconds). Scripts can be read from the command line arguments with `--script "SCRIPT"`, from STDIN with `--script -`, or from a file using `--script PATH`.
+If you expect a return value, be sure to include a `return` statement in your executed script. Results will be output to STDOUT.
+```
+NAME
+    execute - Execute JavaScript on a URL
+SYNOPSIS
+    curlyq [global options] execute [command options] URL...
+COMMAND OPTIONS
+    -b, --browser=arg - Browser to use (firefox, chrome) (default: chrome)
+    -h, --header=arg  - Define a header to send as key=value (may be used more than once, default: none)
+    -i, --id=arg      - Element ID to wait for before executing (default: none)
+    -s, --script=arg  - Script to execute, use - to read from STDIN (may be used more than once, default: none)
+    -w, --wait=arg    - Seconds to wait after executing JS (default: 2)
+```
 ##### headlinks
 Example:
@@ -237,6 +265,7 @@ COMMAND OPTIONS
     -h, --header=arg          - Define a header to send as "key=value" (may be used more than once, default: none)
     --[no-]ignore_fragments   - Ignore fragment hrefs when gathering content links
     --[no-]ignore_relative    - Ignore relative hrefs when gathering content links
+    -l, --local_links_only    - Only gather internal (same-site) links
     -q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
     -r, --raw=arg             - Output a raw value for a key (default: none)
     -s, --search=arg          - Regurn an array of matches to a CSS or XPath query (default: none)
@@ -379,6 +408,7 @@ COMMAND OPTIONS
     -d, --[no-]dedup          - Filter out duplicate links, preserving only first one
     --[no-]ignore_fragments   - Ignore fragment hrefs when gathering content links
     --[no-]ignore_relative    - Ignore relative hrefs when gathering content links
+    -l, --local_links_only    - Only gather internal (same-site) links
     -q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
     -x, --external_links_only - Only gather external links
 ```
@@ -436,6 +466,9 @@ Example:
     Screenshot saved to /Users/ttscoff/Desktop/test.png
+You can wait for an element ID to be visible using `--id`. This can be any `#ID` on the page. If the ID doesn't exist on the page, though, the screenshot will hang for a timeout of 10 seconds.
+You can execute a script before taking the screenshot with the `--script` flag. If this is set to `-`, it will read the script from STDIN. If it's set to an existing file path, that file will be read for script input. Specify an interval (in seconds) to wait after executing the script with `--wait`.
 ```
 NAME
@@ -448,8 +481,11 @@ SYNOPSIS
 COMMAND OPTIONS
     -b, --browser=arg     - Browser to use (firefox, chrome) (default: chrome)
     -h, --header=arg      - Define a header to send as key=value (may be used more than once, default: none)
+    -i, --id=arg          - Element ID to wait for before taking screenshot (default: none)
     -o, --out, --file=arg - File destination (required, default: none)
+    -s, --script=arg      - Script to execute before taking screenshot (may be used more than once, default: none)
     -t, --type=arg        - Type of screenshot to save (full (requires firefox), print, visible) (default: visible)
+    -w, --wait=arg        - Time to wait before taking screenshot (default: 0)
 ```
 ##### tags

data/bin/curlyq CHANGED Viewed

@@ -103,6 +103,9 @@ command %i[html curl] do |c|
   c.desc 'Only gather external links'
   c.switch %i[x external_links_only], default_value: false, negatable: false
+  c.desc 'Only gather internal (same-site) links'
+  c.switch %i[l local_links_only], default_value: false, negatable: false
   c.action do |global_options, options, args|
     urls = args.join(' ').split(/[, ]+/)
     headers = break_headers(options[:header])
@@ -115,7 +118,8 @@ command %i[html curl] do |c|
                         compressed: options[:compressed], clean: options[:clean],
                         ignore_local_links: options[:ignore_relative],
                         ignore_fragment_links: options[:ignore_fragments],
-                        external_links_only: options[:external_links_only] }
+                        external_links_only: options[:external_links_only],
+                        local_links_only: options[:local_links_only] }
       res = Curl::Html.new(url, curl_settings)
       res.curl
@@ -152,6 +156,61 @@ command %i[html curl] do |c|
   end
 end
+desc 'Execute JavaScript on a URL'
+arg_name 'URL', multiple: true
+command :execute do |c|
+  c.desc 'Browser to use (firefox, chrome)'
+  c.flag %i[b browser], type: BrowserType, must_match: /^[fc].*?$/, default_value: 'chrome'
+  c.desc 'Define a header to send as key=value'
+  c.flag %i[h header], multiple: true
+  c.desc 'Script to execute, use - to read from STDIN'
+  c.flag %i[s script], multiple: true
+  c.desc 'Element ID to wait for before executing'
+  c.flag %i[i id]
+  c.desc 'Seconds to wait after executing JS'
+  c.flag %i[w wait], default_value: 2
+  c.action do |_, options, args|
+    urls = args.join(' ').split(/[, ]+/)
+    raise 'Script input required' unless options[:file] || options[:script]
+    compiled_script = []
+    if options[:script].count.positive?
+      options[:script].each do |scr|
+        scr.strip!
+        if scr == '-'
+          compiled_script << $stdin.read
+        elsif File.exist?(File.expand_path(scr))
+          compiled_script << IO.read(File.expand_path(scr))
+        else
+          compiled_script << scr
+        end
+      end
+    end
+    script = compiled_script.count.positive? ? compiled_script.join(';') : nil
+    headers = break_headers(options[:header])
+    browser = options[:browser]
+    browser = browser.is_a?(Symbol) ? browser : browser.normalize_browser_type
+    urls.each do |url|
+      c = Curl::Html.new(url)
+      c.headers = headers
+      c.browser = browser
+      $stdout.puts c.execute(script, options[:wait], options[:id])
+    end
+  end
+end
 desc 'Save a screenshot of a URL'
 arg_name 'URL', multiple: true
 command :screenshot do |c|
@@ -167,6 +226,15 @@ command :screenshot do |c|
   c.desc 'Define a header to send as key=value'
   c.flag %i[h header], multiple: true
+  c.desc 'Script to execute before taking screenshot'
+  c.flag %i[s script], multiple: true
+  c.desc 'Element ID to wait for before taking screenshot'
+  c.flag %i[i id]
+  c.desc 'Time to wait before taking screenshot'
+  c.flag %i[w wait], default_value: 0, type: Integer
   c.action do |_, options, args|
     urls = args.join(' ').split(/[, ]+/)
     headers = break_headers(options[:header])
@@ -177,13 +245,30 @@ command :screenshot do |c|
     type = type.is_a?(Symbol) ? type : type.normalize_screenshot_type
     browser = browser.is_a?(Symbol) ? browser : browser.normalize_browser_type
+    compiled_script = []
+    if options[:script].count.positive?
+      options[:script].each do |scr|
+        scr.strip!
+        if scr == '-'
+          compiled_script << $stdin.read
+        elsif File.exist?(File.expand_path(scr))
+          compiled_script << IO.read(File.expand_path(scr))
+        else
+          compiled_script << scr
+        end
+      end
+    end
+    script = compiled_script.count.positive? ? compiled_script.join(';') : nil
     raise 'Full page screen shots only available with Firefox' if type == :full_page && browser != :firefox
     urls.each do |url|
       c = Curl::Html.new(url)
       c.headers = headers
       c.browser = browser
-      c.screenshot(options[:out], type: type)
+      c.screenshot(options[:out], type: type, script: script, id: options[:id], wait: options[:wait])
     end
   end
 end
@@ -417,6 +502,9 @@ command :links do |c|
   c.desc 'Only gather external links'
   c.switch %i[x external_links_only], default_value: false, negatable: false
+  c.desc 'Only gather internal (same-site) links'
+  c.switch %i[l local_links_only], default_value: false, negatable: false
   c.desc 'Filter output using dot-syntax path'
   c.flag %i[q query filter]
@@ -433,7 +521,8 @@ command :links do |c|
                              compressed: options[:compressed], clean: options[:clean],
                              ignore_local_links: options[:ignore_relative],
                              ignore_fragment_links: options[:ignore_fragments],
-                             external_links_only: options[:external_links_only]
+                             external_links_only: options[:external_links_only],
+                             local_links_only: options[:local_links_only]
                            })
       res.curl

data/lib/curly/curl/html.rb CHANGED Viewed

@@ -11,7 +11,7 @@ module Curl
   # Class for CURLing an HTML page
   class Html
     attr_accessor :settings, :browser, :source, :headers, :headers_only, :compressed, :clean, :fallback,
-                  :ignore_local_links, :ignore_fragment_links, :external_links_only
+                  :ignore_local_links, :ignore_fragment_links, :external_links_only, :local_links_only
     attr_reader :url, :code, :meta, :links, :head, :body,
                 :title, :description, :body_links, :body_images
@@ -69,6 +69,7 @@ module Curl
       @ignore_local_links = options[:ignore_local_links]
       @ignore_fragment_links = options[:ignore_fragment_links]
       @external_links_only = options[:external_links_only]
+      @local_links_only = options[:local_links_only]
       @curl = TTY::Which.which('curl')
       @url = url.nil? ? options[:url] : url
@@ -127,10 +128,19 @@ module Curl
     ##                          save (:full_page,
     ##                          :print_page, :visible)
     ##
-    def screenshot(destination = nil, type: :full_page)
-      full_page = type.to_sym == :full_page
-      print_page = type.to_sym == :print_page
-      save_screenshot(destination, type: type)
+    def screenshot(destination = nil, type: :full_page, script: nil, id: nil, wait: 0)
+      # full_page = type.to_sym == :full_page
+      # print_page = type.to_sym == :print_page
+      save_screenshot(destination, type: type, script: script, id: id, wait_seconds: wait)
+    end
+    ##
+    ## @brief      Execute JavaScript
+    ##
+    ## @param      script  The script to run
+    ##
+    def execute(script, wait, element_id)
+      run_js(script, wait, element_id)
     end
     ##
@@ -144,12 +154,11 @@ module Curl
     def extract(before, after, inclusive: false)
       before = /#{Regexp.escape(before)}/ unless before.is_a?(Regexp)
       after = /#{Regexp.escape(after)}/ unless after.is_a?(Regexp)
-      if inclusive
-        rx = /(#{before.source}.*?#{after.source})/m
-      else
-        rx = /(?<=#{before.source})(.*?)(?=#{after.source})/m
-      end
+      rx = if inclusive
+             /(#{before.source}.*?#{after.source})/m
+           else
+             /(?<=#{before.source})(.*?)(?=#{after.source})/m
+           end
       @body.scan(rx).map { |r| @clean ? r[0].clean : r[0] }
     end
@@ -490,11 +499,19 @@ module Curl
         link_href = link_href[2]
-        next if link_href =~ /^#/ && (@ignore_fragment_links || @external_links_only)
+        if @local_links_only
+          next if @ignore_fragment_links && link_href =~ /^#/
+          next unless same_origin?(link_href)
-        next if link_href !~ %r{^(\w+:)?//} && (@ignore_local_links || @external_links_only)
+        else
+          next if link_href =~ /^#/ && (@ignore_fragment_links || @external_links_only)
+          next if link_href !~ %r{^(\w+:)?//} && (@ignore_local_links || @external_links_only)
-        next if same_origin?(link_href) && @external_links_only
+          next if same_origin?(link_href) && @external_links_only
+        end
         link_title = tag.match(/title=(['"])(.*?)\1/)
         link_title = link_title.nil? ? nil : link_title[2]
@@ -522,11 +539,19 @@ module Curl
       link_tags.each do |m|
         href = m['tag'].match(/href=(["'])(.*?)\1/)
         href = href[2] unless href.nil?
-        next if href =~ /^#/ && (@ignore_fragment_links || @external_links_only)
+        if @local_links_only
+          next if href =~ /^#/ && @ignore_fragment_links
+          next unless same_origin?(href)
+        else
+          next if href =~ /^#/ && (@ignore_fragment_links || @external_links_only)
+          next if href !~ %r{^(\w+:)?//} && (@ignore_local_links || @external_links_only)
-        next if href !~ %r{^(\w+:)?//} && (@ignore_local_links || @external_links_only)
+          next if same_origin?(href) && @external_links_only
-        next if same_origin?(href) && @external_links_only
+        end
         title = m['tag'].match(/title=(["'])(.*?)\1/)
         title = title[2] unless title.nil?
@@ -587,14 +612,46 @@ module Curl
       res
     end
+    ##
+    ## Run JavaScript on a URL
+    ##
+    ## @param      script      The JavaScript to execute
+    ## @param      wait        Seconds to wait after executing JS
+    ## @param      element_id  The element identifier
+    ##
+    def run_js(script, wait_seconds = 2, element_id = nil)
+      raise 'No script provided' if script.nil?
+      browser = @browser.is_a?(String) ? @browser.normalize_browser_type : @browser
+      driver = Selenium::WebDriver.for browser
+      driver.manage.timeouts.implicit_wait = 15
+      res = nil
+      begin
+        driver.get @url
+        if element_id
+          wait = Selenium::WebDriver::Wait.new(timeout: 10) # seconds
+          wait.until { driver.find_element(id: element_id) }
+        end
+        res = driver.execute_script(script)
+        sleep wait_seconds.to_i
+      ensure
+        driver.quit
+      end
+      $stderr.puts "Executed JS on #{@url}"
+      res
+    end
     ##
     ## Save a screenshot of a url
     ##
     ## @param      destination  [String] File path destination
-    ## @param      browser      [Symbol] The browser (:chrome or :firefox)
     ## @param      type         [Symbol] The type of screenshot (:full_page, :print_page, or :visible)
     ##
-    def save_screenshot(destination = nil, type: :full_page)
+    def save_screenshot(destination = nil, type: :full_page, script: nil, wait_seconds: 0, id: nil)
       raise 'No URL provided' if url.nil?
       raise 'No file destination provided' if destination.nil?
@@ -603,7 +660,7 @@ module Curl
       raise 'Path doesn\'t exist' unless File.directory?(File.dirname(destination))
-      browser = browser.normalize_browser_type if browser.is_a?(String)
+      browser = @browser.is_a?(String) ? @browser.normalize_browser_type : @browser
       type = type.normalize_screenshot_type if type.is_a?(String)
       raise 'Can not save full screen with Chrome, use Firefox' if type == :full_page && browser == :chrome
@@ -614,10 +671,21 @@ module Curl
                       "#{destination.sub(/\.(pdf|jpe?g|png)$/, '')}.png"
                     end
-      driver = Selenium::WebDriver.for @browser
+      driver = Selenium::WebDriver.for browser
       driver.manage.timeouts.implicit_wait = 4
       begin
         driver.get @url
+        if id
+          wait = Selenium::WebDriver::Wait.new(timeout: 10) # seconds
+          wait.until { driver.find_element(id: id) }
+        end
+        if script
+          res = driver.execute_script(script)
+        end
+        sleep wait_seconds.to_i
         case type
         when :print_page
           driver.save_print_page(destination)

data/lib/curly/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # Top level module for CurlyQ
 module Curly
   # Current version number
-  VERSION = '0.0.10'
+  VERSION = '0.0.12'
 end

data/src/_README.md CHANGED Viewed

@@ -95,6 +95,22 @@ This specifies a before and after string and includes them (`-i`) in the result.
 ```
+##### execute
+You can execute JavaScript on a given web page using the `execute` subcommand.
+Example:
+    curlyq execute -s "NiftyAPI.find('file/save').arrow().shoot('file-save')" file:///Users/ttscoff/Desktop/Code/niftymenu/dist/MultiMarkdown-Composer.html
+You can specify an element id to wait for using `--id`, and define a pause to wait after executing a script with `--wait` (defaults to 2 seconds). Scripts can be read from the command line arguments with `--script "SCRIPT"`, from STDIN with `--script -`, or from a file using `--script PATH`.
+If you expect a return value, be sure to include a `return` statement in your executed script. Results will be output to STDOUT.
+```
+@cli(bundle exec bin/curlyq help execute)
+```
 ##### headlinks
 Example:
@@ -321,6 +337,9 @@ Example:
     Screenshot saved to /Users/ttscoff/Desktop/test.png
+You can wait for an element ID to be visible using `--id`. This can be any `#ID` on the page. If the ID doesn't exist on the page, though, the screenshot will hang for a timeout of 10 seconds.
+You can execute a script before taking the screenshot with the `--script` flag. If this is set to `-`, it will read the script from STDIN. If it's set to an existing file path, that file will be read for script input. Specify an interval (in seconds) to wait after executing the script with `--wait`.
 ```
 @cli(bundle exec bin/curlyq help screenshot)

data/test/curlyq_scrape_test.rb CHANGED Viewed

@@ -13,13 +13,13 @@ class CurlyQScrapeTest < Test::Unit::TestCase
   def setup
     @screenshot = File.join(File.dirname(__FILE__), 'screenshot_test')
     FileUtils.rm_f("#{@screenshot}.pdf") if File.exist?("#{@screenshot}.pdf")
-    FileUtils.rm_f('screenshot_test.png') if File.exist?("#{@screenshot}.png")
+    FileUtils.rm_f("#{@screenshot}.png") if File.exist?("#{@screenshot}.png")
     FileUtils.rm_f("#{@screenshot}_full.png") if File.exist?("#{@screenshot}_full.png")
   end
   def teardown
     FileUtils.rm_f("#{@screenshot}.pdf") if File.exist?("#{@screenshot}.pdf")
-    FileUtils.rm_f('screenshot_test.png') if File.exist?("#{@screenshot}.png")
+    FileUtils.rm_f("#{@screenshot}.png") if File.exist?("#{@screenshot}.png")
     FileUtils.rm_f("#{@screenshot}_full.png") if File.exist?("#{@screenshot}_full.png")
   end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: curlyq
 version: !ruby/object:Gem::Version
-  version: 0.0.10
+  version: 0.0.12
 platform: ruby
 authors:
 - Brett Terpstra
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-01-17 00:00:00.000000000 Z
+date: 2024-04-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake