url_common 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 4f3290f0cba2dcd19ebc741320e0153f2ba5065f927914f850f956d675fe8752
4
+ data.tar.gz: f93a6899d9b39729db16140698c35885c4b000710cc9b5e1dc635dc9a21467aa
5
+ SHA512:
6
+ metadata.gz: a534233ebf72a903303eb4b273459ec717d81fb4acff7daafbfe57f368be7c44b06d9fd1773561506c0b1957b1eac61e5012ff9d582743245b100009e8feaa72
7
+ data.tar.gz: 5b2efc559ad70767383a9fd35a458ef524713bb12c1c1c0c3393527e98a7ccde3ab32480793a2a91cd23c7bbbefffaaa3e16f12ca7656c75f7920621ceb9fc37
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
@@ -0,0 +1,6 @@
1
+ ---
2
+ language: ruby
3
+ cache: bundler
4
+ rvm:
5
+ - 2.7.1
6
+ before_install: gem install bundler -v 2.1.4
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at fuzzygroup@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [https://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: https://contributor-covenant.org
74
+ [version]: https://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,10 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in url_common.gemspec
4
+ gemspec
5
+
6
+ gem "rake", "~> 12.0"
7
+ gem "rspec", "~> 3.0"
8
+ gem "fuzzyurl", '~> 0.9.0'
9
+ gem 'mechanize', '~> 2.6'
10
+ gem "byebug"
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 Scott Johnson
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,44 @@
1
+ # UrlCommon
2
+
3
+ Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/url_common`. To experiment with that code, run `bin/console` for an interactive prompt.
4
+
5
+ TODO: Delete this and the text above, and describe your gem
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ ```ruby
12
+ gem 'url_common'
13
+ ```
14
+
15
+ And then execute:
16
+
17
+ $ bundle install
18
+
19
+ Or install it yourself as:
20
+
21
+ $ gem install url_common
22
+
23
+ ## Usage
24
+
25
+ TODO: Write usage instructions here
26
+
27
+ ## Development
28
+
29
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
30
+
31
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
32
+
33
+ ## Contributing
34
+
35
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/url_common. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/url_common/blob/master/CODE_OF_CONDUCT.md).
36
+
37
+
38
+ ## License
39
+
40
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
41
+
42
+ ## Code of Conduct
43
+
44
+ Everyone interacting in the UrlCommon project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/url_common/blob/master/CODE_OF_CONDUCT.md).
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "url_common"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,269 @@
1
+ require "url_common/version"
2
+ require 'fuzzyurl'
3
+ require 'mechanize'
4
+ require 'ostruct'
5
+
6
+ module UrlCommon
7
+ class Error < StandardError; end
8
+
9
+ # UrlCommon.is_valid?("http://fuzzyblog.io/blog/")
10
+ # UrlCommon.is_valid?("fuzzyblog.io/blog/")
11
+ def self.is_valid?(url)
12
+ begin
13
+ result = Fuzzyurl.from_string(url)
14
+ return false if result.hostname.nil?
15
+ return false if result.protocol.nil?
16
+ return false if (!result.hostname.include?('.')) && result.protocol.nil?
17
+ return true
18
+ rescue StandardError => e
19
+ return false
20
+ end
21
+ end
22
+
23
+ # UrlCommon.parse_fid_from_itunes_url("https://itunes.apple.com/us/app/imovie/id408981434?mt=12")
24
+ def self.parse_fid_from_itunes_url(url)
25
+ tmp = /\/id([0-9]+)/.match(url)
26
+ if tmp && tmp[1]
27
+ return tmp[1]
28
+ else
29
+ return nil
30
+ end
31
+ end
32
+
33
+ def self.parse_country_from_itunes_url(url)
34
+ country = /https?:\/\/itunes\.apple\.com\/(..)\//.match(url)
35
+ if country
36
+ country = country[1]
37
+ end
38
+ return country if country
39
+ return 'us'
40
+ end
41
+
42
+ def self.get_base_domain(url)
43
+ parts = URI.parse(url)
44
+ return parts.host.gsub(/^www./,'')
45
+ end
46
+
47
+ def self.join(base, rest, debug = false)
48
+ return URI.join(base, rest).to_s
49
+ end
50
+
51
+ def self.url_no_www(url)
52
+ parts = Fuzzyurl.new(url)
53
+ if parts.query
54
+ #return parts.hostname.sub(/^www\./, '') + parts.try(:path) + '?' + parts.query
55
+ return parts.hostname.sub(/^www\./, '') + parts&.path + '?' + parts.query
56
+ else
57
+ #byebug
58
+ #return parts.hostname.sub(/^www\./, '') + parts.try(:path).to_s
59
+ return parts.hostname.sub(/^www\./, '') + parts&.path.to_s
60
+ end
61
+ end
62
+
63
+ #TODO
64
+ def self.count_links(html)
65
+ return 0
66
+ end
67
+
68
+ def self.agent
69
+ return Mechanize.new
70
+ end
71
+
72
+ def self.strip_a_tag(a_tag)
73
+ #<a href="https://www.keyingredient.com/recipes/12194051/egg-salad-best-ever-creamy/">
74
+ return a_tag.sub(/<a href=[\"']/,'').sub(/[\"']>/,'')
75
+ end
76
+
77
+
78
+ #
79
+ # Returns a url w/o http://wwww
80
+ # UrlCommon.url_base("https://www.udemy.com/the-build-a-saas-app-with-flask-course/")
81
+ # "udemy.com/the-build-a-saas-app-with-flask-course/"
82
+ #
83
+ def self.url_base(url, base_domain=nil)
84
+ if base_domain.nil?
85
+ base_domain = get_base_domain(url)
86
+ end
87
+ parts = URI.parse(url)
88
+ extra = ""
89
+ extra = "?#{parts.query}" if parts.query
90
+ url_base = "#{base_domain}#{parts.path}#{extra}"
91
+ return url_base[0..254]
92
+ end
93
+
94
+ #tested #https://www.amazon.com/gp/product/B01DT4A2R4/ref=as_li_qf_sp_asin_il_tl?ie=UTF8&tag=nickjanetakis-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B01DT4A2R4&linkId=496be5e222b6291369c0a393c797c2c0
95
+ # returns nil if link isn't amazon at all
96
+ # returns true if link is amazon and has referrer code
97
+ # returns false if link is amazon and doesn't have referrer code
98
+ def self.check_for_amazon_referrer(url, referrer_code)
99
+ #def check_for_amazon_referrer(url, referrer_code)
100
+ #https://github.com/gamache/fuzzyurl.rb
101
+ fu = Fuzzyurl.from_string(url)
102
+ return nil if fu.hostname.nil?
103
+ base_domain = fu.hostname.sub(/^www./,'')
104
+ # base_domain = UrlCommon.get_base_domain
105
+ parts = base_domain.split(".")
106
+ return nil if parts[0] != "amazon"
107
+ #referer_code = self.account.user.details[:amazon_referrer_code]
108
+ if url =~ /#{referrer_code}/
109
+ return true
110
+ else
111
+ return false
112
+ end
113
+ end
114
+
115
+ # TODO needs tests
116
+ #def self.check_for_jekyll_subdomain?(url)
117
+ def self.has_own_domain?(url)
118
+ return false if url =~ /\.github\.io/
119
+ return false if url =~ /\.blogspot\.com/
120
+ return false if url =~ /\.wordpress\.com/
121
+ #return false if url =~ /\..+\./
122
+ return true
123
+ if site_url =~ /\..+\./
124
+ return true
125
+ else
126
+ analysis_results << "You have a domain of your own; that's a great first step!"
127
+ end
128
+
129
+ end
130
+
131
+ # TODO needs tests
132
+ def self.get_page(url, return_html = false, user_agent = nil)
133
+ agent = Mechanize.new { |a|
134
+ if user_agent.nil?
135
+ #a.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:46.0) Gecko/20100101 Firefox/46.0"
136
+ a.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
137
+ else
138
+ a.user_agent = user_agent
139
+ end
140
+ #a.user_agent = "curl/7.54.0"
141
+ #debugger
142
+ }
143
+ agent.verify_callback = Proc.new do |ok,x509|
144
+ status = x509.error
145
+ msg = x509.error_string
146
+ logger.warn "server certificate verify: status: #{status}, msg: #{msg}" if status != 0
147
+ true # this has the side effect of ignoring errors. nice!
148
+ end
149
+ begin
150
+ page = agent.get(url)
151
+ if return_html
152
+ return :ok, page.body
153
+ else
154
+ return :ok, page
155
+ end
156
+ #return :ok, page
157
+ rescue StandardError => e
158
+ return :error, e
159
+ end
160
+ end
161
+
162
+ # def self.get_page_caching_attempt(url, return_html = false)
163
+ # agent = Mechanize.new { |a|
164
+ # a.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:46.0) Gecko/20100101 Firefox/46.0"
165
+ # }
166
+ # agent.verify_callback = Proc.new do |ok,x509|
167
+ # status = x509.error
168
+ # msg = x509.error_string
169
+ # logger.warn "server certificate verify: status: #{status}, msg: #{msg}" if status != 0
170
+ # true # this has the side effect of ignoring errors. nice!
171
+ # end
172
+ # begin
173
+ # page = agent.get(url)
174
+ # if return_html
175
+ # Rails.cache.fetch(UrlCommon.sha_it(url), :expires_in => 1.hour) do
176
+ # page.body
177
+ # end
178
+ # # Rails.cache.fetch(UrlCommon.sha_it(url), :expires_in => 1.hour) do
179
+ # # debugger
180
+ # # page.body
181
+ # # end
182
+ # return :ok, page.body
183
+ # else
184
+ # return :ok, page
185
+ # end
186
+ # rescue StandardError => e
187
+ # return :error, e
188
+ # end
189
+ # end
190
+
191
+ def self.mpage_is_html?(page)
192
+ return true if page.respond_to?(:title)
193
+ return false
194
+ end
195
+
196
+ # TODO needs tests
197
+ def self.check_for_404(url, elixir_style = false)
198
+ agent = Mechanize.new
199
+ results = []
200
+
201
+ begin
202
+ head_result = agent.head(url)
203
+ return OpenStruct.new(:url => url, :status => 200) if elixir_style == false
204
+ return :ok, url if elixir_style
205
+ rescue StandardError => e
206
+ if e.to_s =~ /404/
207
+ return OpenStruct.new(:url => url, :error => e, :status => 404)
208
+ else
209
+ return OpenStruct.new(:url => url, :error => e, :status => 404)
210
+ end
211
+ end
212
+ end
213
+
214
+ # TODO needs tests
215
+ def self.check_for_broken_links(links)
216
+ results = []
217
+ agent = Mechanize.new
218
+ links.each do |link|
219
+ begin
220
+ result = agent.head(link.href)
221
+ results << OpenStruct.new(:url => link.href, :status => 200)
222
+ rescue StandardError => e
223
+ if e.to_s =~ /404/
224
+ results << OpenStruct.new(:url => link.href, :error => e, :status => 404)
225
+ end
226
+ end
227
+ end
228
+ #debugger
229
+ results
230
+ end
231
+
232
+ def self.fix_relative_url(base_url, partial_url)
233
+ return partial_url if partial_url =~ /^http/
234
+ parts = URI.parse(base_url)
235
+ return parts.scheme + '://' + parts.host + partial_url
236
+ return File.join(base_url, partial_url)
237
+ end
238
+
239
+ # status, url = UrlCommon.validate_with_merge_fragment("nickjj/orats", "https://www.github.com/")
240
+ def self.validate_with_merge_fragment(url, merge_fragment)
241
+ #
242
+ # verify it is a valid url and it isn't a 404 or redirect
243
+ #
244
+ if is_valid?(url) && check_for_404(url)
245
+ return true, url
246
+ end
247
+
248
+ #
249
+ # Try and make it valid
250
+ #
251
+ if url =~ /^http/
252
+ # if its invalid and has http then don't know what to do so return false
253
+ return false, url
254
+ end
255
+
256
+ url = File.join(merge_fragment, url)
257
+ if is_valid?(url) && check_for_404(url)
258
+ return true, url
259
+ end
260
+ end
261
+
262
+ #TODO needs tests
263
+ def self.create_mechanize_page_from_html(url, html)
264
+ mechanize_page = Mechanize::Page.new(nil, {'content-type'=>'text/html'}, html, nil, Mechanize.new)
265
+ mechanize_page.uri = URI.parse(url)
266
+
267
+ return mechanize_page
268
+ end
269
+ end
@@ -0,0 +1,3 @@
1
+ module UrlCommon
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,29 @@
1
+ require_relative 'lib/url_common/version'
2
+
3
+ Gem::Specification.new do |spec|
4
+ spec.name = "url_common"
5
+ spec.version = UrlCommon::VERSION
6
+ spec.authors = ["Scott Johnson"]
7
+ spec.email = ["fuzzygroup@gmail.com"]
8
+
9
+ spec.summary = %q{This is a class library designed for common url manipulation and crawling tasks.}
10
+ spec.description = %q{This is a class library for common url manipulation and crawling tasks. It is based on a career focused on the practical side of working with the Internet using Ruby.}
11
+ spec.homepage = "https://github.com/fuzzygroup/url_common/"
12
+ spec.license = "MIT"
13
+ spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
14
+
15
+ spec.metadata["allowed_push_host"] = "https://rubygems.org"
16
+
17
+ spec.metadata["homepage_uri"] = spec.homepage
18
+ spec.metadata["source_code_uri"] = "https://github.com/fuzzygroup/url_common/"
19
+ spec.metadata["changelog_uri"] = "https://github.com/fuzzygroup/url_common/CHANGELOG.md"
20
+
21
+ # Specify which files should be added to the gem when it is released.
22
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
23
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
24
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
25
+ end
26
+ spec.bindir = "exe"
27
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
28
+ spec.require_paths = ["lib"]
29
+ end
metadata ADDED
@@ -0,0 +1,63 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: url_common
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Scott Johnson
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-08-12 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: This is a class library for common url manipulation and crawling tasks. It
14
+ is based on a career focused on the practical side of working with the Internet
15
+ using Ruby.
16
+ email:
17
+ - fuzzygroup@gmail.com
18
+ executables: []
19
+ extensions: []
20
+ extra_rdoc_files: []
21
+ files:
22
+ - ".gitignore"
23
+ - ".rspec"
24
+ - ".travis.yml"
25
+ - CODE_OF_CONDUCT.md
26
+ - Gemfile
27
+ - LICENSE.txt
28
+ - README.md
29
+ - Rakefile
30
+ - bin/console
31
+ - bin/setup
32
+ - lib/url_common.rb
33
+ - lib/url_common/version.rb
34
+ - url_common.gemspec
35
+ homepage: https://github.com/fuzzygroup/url_common/
36
+ licenses:
37
+ - MIT
38
+ metadata:
39
+ allowed_push_host: https://rubygems.org
40
+ homepage_uri: https://github.com/fuzzygroup/url_common/
41
+ source_code_uri: https://github.com/fuzzygroup/url_common/
42
+ changelog_uri: https://github.com/fuzzygroup/url_common/CHANGELOG.md
43
+ post_install_message:
44
+ rdoc_options: []
45
+ require_paths:
46
+ - lib
47
+ required_ruby_version: !ruby/object:Gem::Requirement
48
+ requirements:
49
+ - - ">="
50
+ - !ruby/object:Gem::Version
51
+ version: 2.3.0
52
+ required_rubygems_version: !ruby/object:Gem::Requirement
53
+ requirements:
54
+ - - ">="
55
+ - !ruby/object:Gem::Version
56
+ version: '0'
57
+ requirements: []
58
+ rubygems_version: 3.1.2
59
+ signing_key:
60
+ specification_version: 4
61
+ summary: This is a class library designed for common url manipulation and crawling
62
+ tasks.
63
+ test_files: []