RubyGems - url_common - Versions diffs - 0.1.0 - Mend

url_common 0.1.0

Files changed (15) hide show

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 4f3290f0cba2dcd19ebc741320e0153f2ba5065f927914f850f956d675fe8752
+  data.tar.gz: f93a6899d9b39729db16140698c35885c4b000710cc9b5e1dc635dc9a21467aa
+SHA512:
+  metadata.gz: a534233ebf72a903303eb4b273459ec717d81fb4acff7daafbfe57f368be7c44b06d9fd1773561506c0b1957b1eac61e5012ff9d582743245b100009e8feaa72
+  data.tar.gz: 5b2efc559ad70767383a9fd35a458ef524713bb12c1c1c0c3393527e98a7ccde3ab32480793a2a91cd23c7bbbefffaaa3e16f12ca7656c75f7920621ceb9fc37

data/.gitignore ADDED

@@ -0,0 +1,11 @@
+/.bundle/
+/.yardoc
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+# rspec failure tracking
+.rspec_status

data/.rspec ADDED

@@ -0,0 +1,3 @@
+--format documentation
+--color
+--require spec_helper

data/.travis.yml ADDED

@@ -0,0 +1,6 @@
+---
+language: ruby
+cache: bundler
+rvm:
+  - 2.7.1
+before_install: gem install bundler -v 2.1.4

data/CODE_OF_CONDUCT.md ADDED

@@ -0,0 +1,74 @@
+# Contributor Covenant Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, gender identity and expression, level of experience,
+nationality, personal appearance, race, religion, or sexual identity and
+orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at fuzzygroup@gmail.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at [https://contributor-covenant.org/version/1/4][version]
+[homepage]: https://contributor-covenant.org
+[version]: https://contributor-covenant.org/version/1/4/

data/Gemfile ADDED

@@ -0,0 +1,10 @@
+source "https://rubygems.org"
+# Specify your gem's dependencies in url_common.gemspec
+gemspec
+gem "rake", "~> 12.0"
+gem "rspec", "~> 3.0"
+gem "fuzzyurl", '~> 0.9.0'
+gem 'mechanize', '~> 2.6'
+gem "byebug"

data/LICENSE.txt ADDED

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2020 Scott Johnson
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,44 @@
+# UrlCommon
+Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/url_common`. To experiment with that code, run `bin/console` for an interactive prompt.
+TODO: Delete this and the text above, and describe your gem
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'url_common'
+```
+And then execute:
+    $ bundle install
+Or install it yourself as:
+    $ gem install url_common
+## Usage
+TODO: Write usage instructions here
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/url_common. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/url_common/blob/master/CODE_OF_CONDUCT.md).
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
+## Code of Conduct
+Everyone interacting in the UrlCommon project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/url_common/blob/master/CODE_OF_CONDUCT.md).

data/Rakefile ADDED

@@ -0,0 +1,6 @@
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/bin/console ADDED

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "url_common"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start(__FILE__)

data/bin/setup ADDED

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/lib/url_common.rb ADDED

@@ -0,0 +1,269 @@
+require "url_common/version"
+require 'fuzzyurl'
+require 'mechanize'
+require 'ostruct'
+module UrlCommon
+  class Error < StandardError; end
+  # UrlCommon.is_valid?("http://fuzzyblog.io/blog/")
+  # UrlCommon.is_valid?("fuzzyblog.io/blog/")
+  def self.is_valid?(url)
+    begin
+      result = Fuzzyurl.from_string(url)
+      return false if result.hostname.nil?
+      return false if result.protocol.nil?
+      return false if (!result.hostname.include?('.')) && result.protocol.nil?
+      return true
+    rescue StandardError => e
+      return false
+    end
+  end
+  # UrlCommon.parse_fid_from_itunes_url("https://itunes.apple.com/us/app/imovie/id408981434?mt=12")
+  def self.parse_fid_from_itunes_url(url)
+    tmp = /\/id([0-9]+)/.match(url)
+    if tmp && tmp[1]
+      return tmp[1]
+    else
+      return nil
+    end
+  end
+  def self.parse_country_from_itunes_url(url)
+    country = /https?:\/\/itunes\.apple\.com\/(..)\//.match(url)
+    if country
+      country = country[1]
+    end
+    return country if country
+    return 'us'
+  end
+  def self.get_base_domain(url)
+    parts = URI.parse(url)
+    return parts.host.gsub(/^www./,'')
+  end
+  def self.join(base, rest, debug = false)
+    return URI.join(base, rest).to_s
+  end
+  def self.url_no_www(url)
+    parts = Fuzzyurl.new(url)
+    if parts.query
+      #return parts.hostname.sub(/^www\./, '') + parts.try(:path) + '?' + parts.query
+      return parts.hostname.sub(/^www\./, '') + parts&.path + '?' + parts.query
+    else
+      #byebug
+      #return parts.hostname.sub(/^www\./, '') + parts.try(:path).to_s
+      return parts.hostname.sub(/^www\./, '') + parts&.path.to_s
+    end
+  end
+  #TODO
+  def self.count_links(html)
+    return 0
+  end
+  def self.agent
+    return Mechanize.new
+  end
+  def self.strip_a_tag(a_tag)
+    #<a href="https://www.keyingredient.com/recipes/12194051/egg-salad-best-ever-creamy/">
+    return a_tag.sub(/<a href=[\"']/,'').sub(/[\"']>/,'')
+  end
+  #
+  # Returns a url w/o http://wwww
+  # UrlCommon.url_base("https://www.udemy.com/the-build-a-saas-app-with-flask-course/")
+  # "udemy.com/the-build-a-saas-app-with-flask-course/"
+  #
+  def self.url_base(url, base_domain=nil)
+    if base_domain.nil?
+      base_domain = get_base_domain(url)
+    end
+    parts = URI.parse(url)
+    extra = ""
+    extra = "?#{parts.query}" if parts.query
+    url_base = "#{base_domain}#{parts.path}#{extra}"
+    return url_base[0..254]
+  end
+  #tested #https://www.amazon.com/gp/product/B01DT4A2R4/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=nickjanetakis-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01DT4A2R4&linkId=496be5e222b6291369c0a393c797c2c0
+  # returns nil if link isn't amazon at all
+  # returns true if link is amazon and has referrer code
+  # returns false if link is amazon and doesn't have referrer code
+  def self.check_for_amazon_referrer(url, referrer_code)
+  #def check_for_amazon_referrer(url, referrer_code)
+    #https://github.com/gamache/fuzzyurl.rb
+    fu = Fuzzyurl.from_string(url)
+    return nil if fu.hostname.nil?
+    base_domain = fu.hostname.sub(/^www./,'')
+    # base_domain = UrlCommon.get_base_domain
+    parts = base_domain.split(".")
+    return nil if parts[0] != "amazon"
+    #referer_code = self.account.user.details[:amazon_referrer_code]
+    if url =~ /#{referrer_code}/
+      return true
+    else
+      return false
+    end
+  end
+  # TODO needs tests
+  #def self.check_for_jekyll_subdomain?(url)
+  def self.has_own_domain?(url)
+    return false if url =~ /\.github\.io/
+    return false if url =~ /\.blogspot\.com/
+    return false if url =~ /\.wordpress\.com/
+    #return false if url =~ /\..+\./
+    return true
+    if site_url =~ /\..+\./
+      return true
+    else
+      analysis_results << "You have a domain of your own; that's a great first step!"
+    end
+  end
+  # TODO needs tests
+  def self.get_page(url, return_html = false, user_agent = nil)
+    agent = Mechanize.new { |a|
+      if user_agent.nil?
+        #a.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:46.0) Gecko/20100101 Firefox/46.0"
+        a.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
+      else
+        a.user_agent = user_agent
+      end
+      #a.user_agent = "curl/7.54.0"
+      #debugger
+    }
+    agent.verify_callback = Proc.new do |ok,x509|
+      status = x509.error
+      msg = x509.error_string
+      logger.warn "server certificate verify: status: #{status}, msg: #{msg}" if status != 0
+      true # this has the side effect of ignoring errors. nice!
+    end
+    begin
+      page = agent.get(url)
+      if return_html
+        return :ok, page.body
+      else
+        return :ok, page
+      end
+      #return :ok, page
+    rescue StandardError => e
+      return :error, e
+    end
+  end
+  # def self.get_page_caching_attempt(url, return_html = false)
+  #   agent = Mechanize.new { |a|
+  #     a.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:46.0) Gecko/20100101 Firefox/46.0"
+  #   }
+  #   agent.verify_callback = Proc.new do |ok,x509|
+  #     status = x509.error
+  #     msg = x509.error_string
+  #     logger.warn "server certificate verify: status: #{status}, msg: #{msg}" if status != 0
+  #     true # this has the side effect of ignoring errors. nice!
+  #   end
+  #   begin
+  #     page = agent.get(url)
+  #     if return_html
+  #       Rails.cache.fetch(UrlCommon.sha_it(url), :expires_in => 1.hour) do
+  #         page.body
+  #       end
+  #       # Rails.cache.fetch(UrlCommon.sha_it(url), :expires_in => 1.hour) do
+  #       #   debugger
+  #       #   page.body
+  #       # end
+  #       return :ok, page.body
+  #     else
+  #       return :ok, page
+  #     end
+  #   rescue StandardError => e
+  #     return :error, e
+  #   end
+  # end
+  def self.mpage_is_html?(page)
+    return true if page.respond_to?(:title)
+    return false
+  end
+  # TODO needs tests
+  def self.check_for_404(url, elixir_style = false)
+    agent = Mechanize.new
+    results = []
+    begin
+      head_result = agent.head(url)
+      return OpenStruct.new(:url => url, :status => 200) if elixir_style == false
+      return :ok, url if elixir_style
+    rescue StandardError => e
+      if e.to_s =~ /404/
+        return OpenStruct.new(:url => url, :error => e, :status => 404)
+      else
+        return OpenStruct.new(:url => url, :error => e, :status => 404)
+      end
+    end
+  end
+  # TODO needs tests
+  def self.check_for_broken_links(links)
+    results = []
+    agent = Mechanize.new
+    links.each do |link|
+      begin
+        result = agent.head(link.href)
+        results << OpenStruct.new(:url => link.href, :status => 200)
+      rescue StandardError => e
+        if e.to_s =~ /404/
+          results << OpenStruct.new(:url => link.href, :error => e, :status => 404)
+        end
+      end
+    end
+    #debugger
+    results
+  end
+  def self.fix_relative_url(base_url, partial_url)
+    return partial_url if partial_url =~ /^http/
+    parts = URI.parse(base_url)
+    return parts.scheme + '://' +  parts.host + partial_url
+    return File.join(base_url, partial_url)
+  end
+  # status, url = UrlCommon.validate_with_merge_fragment("nickjj/orats", "https://www.github.com/")
+  def self.validate_with_merge_fragment(url, merge_fragment)
+    #
+    # verify it is a valid url and it isn't a 404 or redirect
+    #
+    if is_valid?(url) && check_for_404(url)
+      return true, url
+    end
+    #
+    # Try and make it valid
+    #
+    if url =~ /^http/
+      # if its invalid and has http then don't know what to do so return false
+      return false, url
+    end
+    url = File.join(merge_fragment, url)
+    if is_valid?(url) && check_for_404(url)
+      return true, url
+    end
+  end
+  #TODO needs tests
+  def self.create_mechanize_page_from_html(url, html)
+    mechanize_page = Mechanize::Page.new(nil, {'content-type'=>'text/html'}, html, nil, Mechanize.new)
+    mechanize_page.uri = URI.parse(url)
+    return mechanize_page
+  end
+end

data/lib/url_common/version.rb ADDED

@@ -0,0 +1,3 @@
+module UrlCommon
+  VERSION = "0.1.0"
+end

data/url_common.gemspec ADDED

@@ -0,0 +1,29 @@
+require_relative 'lib/url_common/version'
+Gem::Specification.new do |spec|
+  spec.name          = "url_common"
+  spec.version       = UrlCommon::VERSION
+  spec.authors       = ["Scott Johnson"]
+  spec.email         = ["fuzzygroup@gmail.com"]
+  spec.summary       = %q{This is a class library designed for common url manipulation and crawling tasks.}
+  spec.description   = %q{This is a class library for common url manipulation and crawling tasks.  It is based on a career focused on the practical side of working with the Internet using Ruby.}
+  spec.homepage      = "https://github.com/fuzzygroup/url_common/"
+  spec.license       = "MIT"
+  spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
+  spec.metadata["allowed_push_host"] = "https://rubygems.org"
+  spec.metadata["homepage_uri"] = spec.homepage
+  spec.metadata["source_code_uri"] = "https://github.com/fuzzygroup/url_common/"
+  spec.metadata["changelog_uri"] = "https://github.com/fuzzygroup/url_common/CHANGELOG.md"
+  # Specify which files should be added to the gem when it is released.
+  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
+  spec.files         = Dir.chdir(File.expand_path('..', __FILE__)) do
+    `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  end
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+end

metadata ADDED

@@ -0,0 +1,63 @@
+--- !ruby/object:Gem::Specification
+name: url_common
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Scott Johnson
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2020-08-12 00:00:00.000000000 Z
+dependencies: []
+description: This is a class library for common url manipulation and crawling tasks.  It
+  is based on a career focused on the practical side of working with the Internet
+  using Ruby.
+email:
+- fuzzygroup@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".rspec"
+- ".travis.yml"
+- CODE_OF_CONDUCT.md
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- bin/console
+- bin/setup
+- lib/url_common.rb
+- lib/url_common/version.rb
+- url_common.gemspec
+homepage: https://github.com/fuzzygroup/url_common/
+licenses:
+- MIT
+metadata:
+  allowed_push_host: https://rubygems.org
+  homepage_uri: https://github.com/fuzzygroup/url_common/
+  source_code_uri: https://github.com/fuzzygroup/url_common/
+  changelog_uri: https://github.com/fuzzygroup/url_common/CHANGELOG.md
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: 2.3.0
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.1.2
+signing_key:
+specification_version: 4
+summary: This is a class library designed for common url manipulation and crawling
+  tasks.
+test_files: []