RubyGems - url_parser - Versions diffs - 0.4.0 → 0.5.0 - Mend

url_parser 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/.ruby-gemset +1 -0
data/.ruby-version +1 -0
data/.travis.yml +7 -0
data/CHANGELOG.md +20 -0
data/Gemfile +4 -0
data/Guardfile +40 -7
data/LICENSE.txt +1 -1
data/README.md +301 -5
data/Rakefile +5 -0
data/lib/url_parser.rb +93 -286
data/lib/url_parser/db.yml +77 -0
data/lib/url_parser/domain.rb +102 -0
data/lib/url_parser/model.rb +233 -0
data/lib/url_parser/option_setter.rb +47 -0
data/lib/url_parser/parser.rb +206 -0
data/lib/url_parser/uri.rb +206 -0
data/lib/url_parser/version.rb +1 -1
data/spec/spec_helper.rb +83 -6
data/spec/support/.gitkeep +0 -0
data/spec/support/helpers.rb +7 -0
data/spec/url_parser/domain_spec.rb +163 -0
data/spec/url_parser/model_spec.rb +426 -0
data/spec/url_parser/option_setter_spec.rb +71 -0
data/spec/url_parser/parser_spec.rb +515 -0
data/spec/url_parser/uri_spec.rb +570 -0
data/spec/url_parser_spec.rb +93 -387
data/url_parser.gemspec +5 -6
metadata +39 -29

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 69dee1c0cfaf6d371119bbbc969312ea4ec1632b
-  data.tar.gz: e57b7aea6c0a4ea33e4d4a8fd63954a47385ec73
+  metadata.gz: e2a74ec366bad04988dd1f34963428f2f926fe08
+  data.tar.gz: b3fdc8e738b103bc74820bb94ad2327d14ec6dc5
 SHA512:
-  metadata.gz: ebd283b722a4e5e4a6ac6b66812fad32d138be669c22e87b9c226e0a8aab7fc0ee0ff9b8445ba1d3e5139d9d59ce41b6d26ba864270c201e7742f7ff00e3572a
-  data.tar.gz: 2ba608ea61964843d308a7c88415b7f73509d8f7be4f7541e461faae5989213f4c072b3fe9a071f4f3a521fdef8b1062cb31dca7c3608a515f24d61877cd608e
+  metadata.gz: 0bd4a7b601dc635af88e0ff0a8a601de05ff6b4cbee00a5e479f0b23df4a447fbd854f10ea4e1164851d56de661df5e322878f032277e7da87b75cf94f8bd8e8
+  data.tar.gz: afb86d3bf4f117840bef16d2f43ca205ce7e7697416850b28b18c83a5d639444d2d5a167d93cf50dac955d6e4a04b0bd2ecac8c4224e8d3a1dce29406d1c8349

data/.gitignore CHANGED

@@ -20,3 +20,4 @@ tmp
 *.o
 *.a
 mkmf.log
+NOTES*

data/.ruby-gemset ADDED

	@@ -0,0 +1 @@
1	+ url_parser

data/.ruby-version ADDED

	@@ -0,0 +1 @@
1	+ 2.3.0

data/.travis.yml ADDED

@@ -0,0 +1,7 @@
+language: ruby
+rvm:
+  - 2.3.0
+before_install: gem install bundler -v 1.11.2
+addons:
+  code_climate:
+    repo_token: 913908cf4192af635496d66bcb17f5bb9cf56d27b6fc2156073961b941f5b684

data/CHANGELOG.md ADDED

@@ -0,0 +1,20 @@
+v0.5.0 / 2016-02-09
+======================
+  * Updated README.md
+  * Added CHANGELOG.md
+  * Only tag errors that inherit from StandardError
+  * Deprecate UrlParser.new, is now UrlParser.parse
+  * Added UrlParser::URI#ipv4 and UrlParser::URI#ipv6 to return the actual values, if applicable
+  * Added [gem_config](https://github.com/krautcomputing/gem_config) for configurable library settings :embedded_params, :default_scheme, and :scheme_map, see README.md for usage
+  * Add UrlParser module functions .parse, .unembed, .normalize, .canonicalize, and .clean
+  * Add UrlParser::Domain to handle domain name validations
+  * Add UrlParser .escape and .unescape to encode and decode strings
+  * Add UrlParser::Parser class for unescaping, parsing, unembedding, canonicalization, normalization, and hashing URI strings
+  * Add UrlParser::URI#naked_hostname to return the entire hostname without any ww? prefix
+  * Refactored UrlParser::URI and UrlParser::Parser classes, see README.md for updated usage
+  * Added 'addressable' to gemspec
+  * Remove 'naught' gem dependency
+  * Remove 'activemodel' gem dependency
+  * Remove 'activesupport' gem dependency
+  * Remove 'postrank-uri' gem dependency

data/Gemfile CHANGED

@@ -2,3 +2,7 @@ source 'https://rubygems.org'
 # Specify your gem's dependencies in url_parser.gemspec
 gemspec
+group :test do
+  gem "codeclimate-test-reporter", require: nil
+end

data/Guardfile CHANGED

@@ -1,17 +1,50 @@
 # A sample Guardfile
 # More info at https://github.com/guard/guard#readme
+## Uncomment and set this to only include directories you want to watch
+# directories %w(app lib config test spec features)
+## Uncomment to clear the screen before every task
+# clearing :on
+## Guard internally checks for changes in the Guardfile and exits.
+## If you want Guard to automatically start up again, run guard in a
+## shell loop, e.g.:
+##
+##  $ while bundle exec guard; do echo "Restarting Guard..."; done
+##
+## Note: if you are using the `directories` clause above and you are not
+## watching the project directory ('.'), then you will want to move
+## the Guardfile to a watched dir and symlink it back, e.g.
+#
+#  $ mkdir config
+#  $ mv Guardfile config/
+#  $ ln -s config/Guardfile .
+#
+# and, you'll have to watch "config/Guardfile" instead of "Guardfile"
 # Note: The cmd option is now required due to the increasing number of ways
 #       rspec may be run, below are examples of the most common uses.
 #  * bundler: 'bundle exec rspec'
 #  * bundler binstubs: 'bin/rspec'
-#  * spring: 'bin/rsspec' (This will use spring if running and you have
+#  * spring: 'bin/rspec' (This will use spring if running and you have
 #                          installed the spring binstubs per the docs)
-#  * zeus: 'zeus rspec' (requires the server to be started separetly)
+#  * zeus: 'zeus rspec' (requires the server to be started separately)
 #  * 'just' rspec: 'rspec'
-guard :rspec, cmd: 'bundle exec rspec' do
-  watch(%r{^spec/.+_spec\.rb$})
-  watch(%r{^lib/(.+)\.rb$})     { |m| "spec/lib/#{m[1]}_spec.rb" }
-  watch('spec/spec_helper.rb')  { "spec" }
-end
+guard :rspec, cmd: "bundle exec rspec" do
+  require "guard/rspec/dsl"
+  dsl = Guard::RSpec::Dsl.new(self)
+  # Feel free to open issues for suggestions and improvements
+  # RSpec files
+  rspec = dsl.rspec
+  watch(rspec.spec_helper) { rspec.spec_dir }
+  watch(rspec.spec_support) { rspec.spec_dir }
+  watch(rspec.spec_files)
+  # Ruby files
+  ruby = dsl.ruby
+  dsl.watch_spec_files_for(ruby.lib_files)
+end

data/LICENSE.txt CHANGED

@@ -1,4 +1,4 @@
-Copyright (c) 2014 Matt Solt
+Copyright (c) 2014-2015 Matthew Solt
 MIT License

data/README.md CHANGED

@@ -1,10 +1,12 @@
 # UrlParser
-Combine PostRank-URI, Domainatrix, and other Ruby url parsing libraries into a common interface.
+[![Gem Version](https://img.shields.io/gem/v/url_parser.svg?style=flat)](https://rubygems.org/gems/url_parser)
+[![Build Status](https://img.shields.io/travis/activefx/url_parser.svg?style=flat)](http://travis-ci.org/activefx/url_parser)
+[![Code Climate](https://img.shields.io/codeclimate/github/activefx/url_parser.svg?style=flat)](https://codeclimate.com/github/activefx/url_parser)
+[![Test Coverage](https://img.shields.io/codeclimate/coverage/github/activefx/url_parser.svg?style=flat)](https://codeclimate.com/github/activefx/url_parser/coverage)
+[![Dependency Status](https://gemnasium.com/activefx/url_parser.svg)](https://gemnasium.com/activefx/url_parser)
-See also:
-- https://github.com/pauldix/domainatrix
-- https://github.com/postrank-labs/postrank-uri
+Extended URI capabilities built on top of Addressable::URI. Parse URIs into granular components, unescape encoded characters, extract embedded URIs, normalize URIs, handle canonical url generation, and validate domains. Inspired by [PostRank-URI](https://github.com/postrank-labs/postrank-uri) and [URI.js](https://github.com/medialize/URI.js).
 ## Installation
@@ -20,9 +22,303 @@ Or install it yourself as:
     $ gem install url_parser
+## Example
+```ruby
+uri = UrlParser.parse('foo://username:password@ww2.foo.bar.example.com:123/hello/world/there.html?name=ferret#foo')
+uri.class               #=> UrlParser::URI
+uri.scheme              #=> 'foo'
+uri.username            #=> 'username'
+uri.user                #=> 'username' # Alias for #username
+uri.password            #=> 'password'
+uri.userinfo            #=> 'username:password'
+uri.hostname            #=> 'ww2.foo.bar.example.com'
+uri.naked_hostname      #=> 'foo.bar.example.com'
+uri.port                #=> 123
+uri.host                #=> 'ww2.foo.bar.example.com:123'
+uri.www                 #=> 'ww2'
+uri.tld                 #=> 'com'
+uri.top_level_domain    #=> 'com' # Alias for #tld
+uri.extension           #=> 'com' # Alias for #tld
+uri.sld                 #=> 'example'
+uri.second_level_domain #=> 'example' # Alias for #sld
+uri.domain_name         #=> 'example' # Alias for #sld
+uri.trd                 #=> 'ww2.foo.bar'
+uri.third_level_domain  #=> 'ww2.foo.bar' # Alias for #trd
+uri.subdomains          #=> 'ww2.foo.bar' # Alias for #trd
+uri.naked_trd           #=> 'foo.bar'
+uri.naked_subdomain     #=> 'foo.bar' # Alias for #naked_trd
+uri.domain              #=> 'example.com'
+uri.subdomain           #=> 'ww2.foo.bar.example.com'
+uri.origin              #=> 'foo://ww2.foo.bar.example.com:123'
+uri.authority           #=> 'username:password@ww2.foo.bar.example.com:123'
+uri.site                #=> 'foo://username:password@ww2.foo.bar.example.com:123'
+uri.path                #=> '/hello/world/there.html'
+uri.segment             #=> 'there.html'
+uri.directory           #=> '/hello/world'
+uri.filename            #=> 'there.html'
+uri.suffix              #=> 'html'
+uri.query               #=> 'name=ferret'
+uri.query_values        #=> { 'name' => 'ferret' }
+uri.fragment            #=> 'foo'
+uri.resource            #=> 'there.html?name=ferret#foo'
+uri.location            #=> '/hello/world/there.html?name=ferret#foo'
+```
 ## Usage
-TODO: Write usage instructions here
+### Parse
+Parse takes the provided URI and breaks it down into its component parts. To see a full list components provided, see [URI Data Model](#uri-data-model). If you provide an instance of Addressable::URI, it will consider the URI already parsed.
+```ruby```
+uri = UrlParser.parse('http://example.org/foo?bar=baz')
+uri.class
+#=> UrlParser::URI
+```
+Unembed, canonicalize, normalize, and clean all rely on parse.
+### Unembed
+Unembed searches the provided URI's query values for redirection urls. By default, it searches the `u` and `url` params, however you can configure custom params to search.
+```ruby
+uri = UrlParser.unembed('http://energy.gov/exit?url=https%3A//twitter.com/energy')
+uri.to_s
+#=> "https://twitter.com/energy"
+```
+With custom embedded params keys:
+```ruby
+uri = UrlParser.unembed('https://www.upwork.com/leaving?ref=https%3A%2F%2Fwww.example.com', embedded_params: [ 'u', 'url', 'ref' ])
+uri.to_s
+#=> "https://www.example.com/"
+```
+### Canonicalize
+Canonicalize applies filters on param keys to remove common tracking params, attempting to make it easier to identify duplicate URIs. For a full list of params, see `db.yml`.
+```ruby
+uri = UrlParser.canonicalize('https://en.wikipedia.org/wiki/Ruby_(programming_language)?source=ABCD&utm_source=EFGH')
+uri.to_s
+#=> "https://en.wikipedia.org/wiki/Ruby_(programming_language)?"
+```
+### Normalize
+Normalize standardizes paths, query strings, anchors, whitespace, hostnames, and trailing slashes.
+```ruby
+# Normalize paths
+uri = UrlParser.normalize('http://example.com/a/b/../../')
+uri.to_s
+#=> "http://example.com/"
+# Normalize query strings
+uri = UrlParser.normalize('http://example.com/?')
+uri.to_s
+#=> "http://example.com/"
+# Normalize anchors
+uri = UrlParser.normalize('http://example.com/#test')
+uri.to_s
+#=> "http://example.com/"
+# Normalize whitespace
+uri = UrlParser.normalize('http://example.com/a/../? #test')
+uri.to_s
+#=> "http://example.com/"
+# Normalize hostnames
+uri = UrlParser.normalize("💩.la")
+uri.to_s
+#=> "http://xn--ls8h.la/"
+# Normalize trailing slashes
+uri = UrlParser.normalize('http://example.com/a/b/')
+uri.to_s
+#=> "http://example.com/a/b"
+```
+### Clean
+Clean combines parsing, unembedding, canonicalization, and normalization into a single call. It is designed to provide a method for cross-referencing identical urls.
+```ruby
+uri = UrlParser.clean('http://example.com/a/../?url=https%3A//💩.la/&utm_source=google')
+uri.to_s
+#=> "https://xn--ls8h.la/"
+uri = UrlParser.clean('https://en.wikipedia.org/wiki/Ruby_(programming_language)?source=ABCD&utm_source%3Danalytics')
+uri.to_s
+#=> "https://en.wikipedia.org/wiki/Ruby_(programming_language)"
+```
+## UrlParser::URI
+Parsing a URI with UrlParser returns an instance of `UrlParser::URI`, with the following methods available:
+### URI Data Model
+```ruby
+ * :scheme              # Top level URI naming structure / protocol.
+ * :username            # Username portion of the userinfo.
+ * :user                # Alias for #username.
+ * :password            # Password portion of the userinfo.
+ * :userinfo            # URI username and password for authentication.
+ * :hostname            # Fully qualified domain name or IP address.
+ * :naked_hostname      # Hostname without any ww? prefix.
+ * :port                # Port number.
+ * :host                # Hostname and port.
+ * :www                 # The ww? portion of the subdomain.
+ * :tld                 # Returns the top level domain portion, aka the extension.
+ * :top_level_domain    # Alias for #tld.
+ * :extension           # Alias for #tld.
+ * :sld                 # Returns the second level domain portion, aka the domain part.
+ * :second_level_domain # Alias for #sld.
+ * :domain_name         # Alias for #sld.
+ * :trd                 # Returns the third level domain portion, aka the subdomain part.
+ * :third_level_domain  # Alias for #trd.
+ * :subdomains          # Alias for #trd.
+ * :naked_trd           # Any non-ww? subdomains.
+ * :naked_subdomain     # Alias for #naked_trd.
+ * :domain              # The domain name with the tld.
+ * :subdomain           # All subdomains, include ww?.
+ * :origin              # Scheme and host.
+ * :authority           # Userinfo and host.
+ * :site                # Scheme, userinfo, and host.
+ * :path                # Directory and segment.
+ * :segment             # Last portion of the path.
+ * :directory           # Any directories following the site within the URI.
+ * :filename            # Segment if a file extension is present.
+ * :suffix              # The file extension of the filename.
+ * :query               # Params and values as a string.
+ * :query_values        # A hash of params and values.
+ * :fragment            # Fragment identifier.
+ * :resource            # Path, query, and fragment.
+ * :location            # Directory and resource - everything after the site.
+```
+### Additional URI Methods
+```ruby
+uri = UrlParser.clean('#')
+uri.unescaped?      #=> true
+uri.parsed?         #=> true
+uri.unembedded?     #=> true
+uri.canonicalized?  #=> true
+uri.normalized?     #=> true
+uri.cleaned?        #=> true
+# IP / localhost methods
+uri.localhost?
+uri.ip_address?
+uri.ipv4?
+uri.ipv6?
+uri.ipv4 #=> returns IPv4 address if applicable
+uri.ipv6 #=> returns IPv6 address if applicable
+# UrlParser::URI#relative?
+uri = UrlParser.parse('/')
+uri.relative?
+#=> true
+# UrlParser::URI#absolute?
+uri = UrlParser.parse('http://example.com/')
+uri.absolute?
+#=> true
+# UrlParser::URI#clean - return a cleaned string
+uri = UrlParser.parse('http://example.com/?utm_source=google')
+uri.clean
+#=> "http://example.com/"
+# UrlParser::URI#canonical - cleans and strips the scheme
+uri = UrlParser.parse('http://example.com/?utm_source%3Danalytics')
+uri.canonical
+#=> "//example.com/"
+# Joining URIs
+uri = UrlParser.parse('http://foo.com/zee/zaw/zoom.html')
+joined_uri = uri + '/bar#id'
+joined_uri.to_s
+#=> "http://foo.com/bar#id"
+# UrlParser::URI #raw / #to_s - return the URI as a string
+uri = UrlParser.parse('http://example.com/')
+uri.raw
+#=> "http://example.com/"
+# Compare URIs
+# Taking into account the scheme:
+uri = UrlParser.parse('http://example.com/a/../?')
+uri == 'http://example.com/'
+#=> true
+uri == 'https://example.com/'
+#=> false
+# Ignoring the scheme:
+uri =~ 'https://example.com/'
+#=> true
+# UrlParser::URI#valid? - checks if URI is absolute and domain is valid
+uri = UrlParser.parse('http://example.qqq/')
+uri.valid?
+#=> false
+```
+## Configuration
+### embedded_params
+Set the params the unembed parser uses to search for embedded URIs. Default is `[ 'u', 'url ]`. Set to an empty array to disable unembedding.
+```ruby
+UrlParser.configure do |config|
+  config.embedded_params = [ 'ref' ]
+end
+uri = UrlParser.unembed('https://www.upwork.com/leaving?ref=https%3A%2F%2Fwww.example.com')
+uri.to_s
+#=> "https://www.example.com/"
+```
+### default_scheme
+Set a default scheme if one is not present. Can also be set to nil if there should not be a default scheme. Default is `'http'`.
+```ruby
+UrlParser.configure do |config|
+  config.default_scheme = 'https'
+end
+uri = UrlParser.parse('example.com')
+uri.to_s
+#=> "https://example.com/"
+```
+### scheme_map
+Replace scheme keys in the 'map' with the corresponding value. Useful for replacing invalid or outdated schemes. Default is an empty hash.
+```ruby
+UrlParser.configure do |config|
+  config.scheme_map = { 'feed' => 'http' }
+end
+uri = UrlParser.parse('feed://feeds.feedburner.com/YourBlog')
+uri.to_s
+#=> "http://feeds.feedburner.com/YourBlog"
+```
+## TODO
+* Extract URIs from text
+* Enable custom rules for normalization, canonicaliztion, escaping, and extraction
 ## Contributing

data/Rakefile CHANGED

@@ -1,2 +1,7 @@
 require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec