RubyGems - postrank-uri - Versions diffs - 1.0.0 - Mend

postrank-uri 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

data/Gemfile ADDED Viewed

@@ -0,0 +1,3 @@
+source "http://rubygems.org"
+gemspec

data/README.md ADDED Viewed

@@ -0,0 +1,40 @@
+# PostRank URI
+A collection of convenience methods (Ruby 1.8 & Ruby 1.9) for dealing with extracting, (un)escaping, normalization, and canonicalization of URIs. At PostRank we process over 20M URI associated activities each day, and we need to make sure that we can reliably extract the URIs from a variety of text formats, deal with all the numerous and creative ways users like to escape and unescape their URIs, normalize the resulting URIs, and finally apply a set of custom canonicalization rules to make sure that we can cross-reference when the users are talking about the same URL.
+In a nutshell, we need to make sure that creative cases like the ones below all resolve to same URI:
+ - http://igvita.com/
+ - http://igvita.com///
+ - http://igvita.com/../?#
+ - http://igvita.com/a/../?
+ - http://igvita.com/a/../?utm_source%3Danalytics
+ - ... and the list goes on - check the specs.
+## API
+- **PostRank::URI.extract(text)** - Detect URIs in text, discard bad TLD's
+- **PostRank::URI.clean(uri)** - Unescape, normalize, apply c18n filters - 95% use case.
+- **PostRank::URI.normalize(uri)** - Apply RFC normalization rules, discard extra path characters, drop anchors
+- **PostRank::URI.unescape(uri)** - Unescape URI entities, handle +/%20's, etc
+- **PostRank::URI.escape(uri)** - Escape URI
+## Example
+    >> PostRank::URI.extract('some random text with http://link.to somecanadiansite.ca')
+    [
+        [0] "http://link.to/",
+        [1] "http://somecanadiansite.ca/"
+    ]
+    >> PostRank::URI.clean('link.to?a=b&utm_source=FeedBurner#stuff')
+    [
+        [0] "http://link.to/?a=b"
+    ]
+## C18N
+As part of URI canonicalization the library will remove common tracking parameters from Google Analytics and several other providers. Beyond that, host-specific rules are also applied. For example, nytimes.com likes to add a 'partner' query parameter for tracking purposes, but which has no effect on the content - hence, it is removed from the URI. For full list, see the c18n.yml file.
+Detecting "duplicate URLs" is a hard problem to solve (expensive in all senses), instead we are compiling a manually assembled database. If you find cases which are missing, please do report them, or send us a pull request!

data/Rakefile ADDED Viewed

@@ -0,0 +1,9 @@
+require 'bundler'
+Bundler::GemHelper.install_tasks
+require 'rspec'
+require 'rspec/core/rake_task'
+Rspec::Core::RakeTask.new do |t|
+  t.rspec_opts = '--color'
+end

data/lib/postrank-uri.rb ADDED Viewed

@@ -0,0 +1,126 @@
+# -*- encoding: utf-8 -*-
+require 'addressable/uri'
+require 'domainatrix'
+require 'yaml'
+module PostRank
+  module URI
+    c18ndb = YAML.load_file(File.dirname(__FILE__) + '/postrank-uri/c18n.yml')
+    C18N = {}
+    C18N[:global] = c18ndb[:all].freeze
+    C18N[:hosts]  = c18ndb[:hosts].inject({}) {|h,(k,v)| h[/#{Regexp.escape(k)}$/.freeze] = v; h}
+    URIREGEX = {}
+    URIREGEX[:protocol] = /https?:\/\//i
+    URIREGEX[:valid_preceding_chars] = /(?:|\.|[^-\/"':!=A-Z0-9_@＠]|^|\:)/i
+    URIREGEX[:valid_domain] = /(?:[^[:punct:]\s][\.-](?=[^[:punct:]\s])|[^[:punct:]\s]){1,}\.[a-z]{2,}(?::[0-9]+)?/i
+    URIREGEX[:valid_general_url_path_chars] = /[a-z0-9!\*';:=\+\,\$\/%#\[\]\-_~]/i
+    # Allow URL paths to contain balanced parens
+    #  1. Used in Wikipedia URLs like /Primer_(film)
+    #  2. Used in IIS sessions like /S(dfd346)/
+    URIREGEX[:wikipedia_disambiguation] = /(?:\(#{URIREGEX[:valid_general_url_path_chars]}+\))/i
+    # Allow @ in a url, but only in the middle. Catch things like http://example.com/@user
+    URIREGEX[:valid_url_path_chars] = /(?:
+      #{URIREGEX[:wikipedia_disambiguation]}|
+      @#{URIREGEX[:valid_general_url_path_chars]}+\/|
+      [\.,]#{URIREGEX[:valid_general_url_path_chars]}+|
+      #{URIREGEX[:valid_general_url_path_chars]}+
+    )/ix
+    # Valid end-of-path chracters (so /foo. does not gobble the period).
+    #   1. Allow =&# for empty URL parameters and other URL-join artifacts
+    URIREGEX[:valid_url_path_ending_chars] = /[a-z0-9=_#\/\+\-]|#{URIREGEX[:wikipedia_disambiguation]}/io
+    URIREGEX[:valid_url_query_chars] = /[a-z0-9!\*'\(\);:&=\+\$\/%#\[\]\-_\.,~]/i
+    URIREGEX[:valid_url_query_ending_chars] = /[a-z0-9_&=#\/]/i
+    URIREGEX[:valid_url] = %r{
+          (                                               #   $1 total match
+            (#{URIREGEX[:valid_preceding_chars]})         #   $2 Preceeding chracter
+            (                                             #   $3 URL
+              (https?:\/\/)?                              #   $4 Protocol
+              (#{URIREGEX[:valid_domain]})                #   $5 Domain(s) and optional post number
+              (/
+                (?:
+                  # 1+ path chars and a valid last char
+                  #{URIREGEX[:valid_url_path_chars]}+#{URIREGEX[:valid_url_path_ending_chars]}|
+                  # Optional last char to handle /@foo/ case
+                  #{URIREGEX[:valid_url_path_chars]}+#{URIREGEX[:valid_url_path_ending_chars]}?|
+                  # Just a # case
+                  #{URIREGEX[:valid_url_path_ending_chars]}
+                )?
+              )?                                          #   $6 URL Path and anchor
+              # $7 Query String
+              (\?#{URIREGEX[:valid_url_query_chars]}*#{URIREGEX[:valid_url_query_ending_chars]})?
+            )
+          )
+        }iox;
+    URIREGEX[:escape]   = /([^ a-zA-Z0-9_.-]+)/x
+    URIREGEX[:unescape] = /((?:%[0-9a-fA-F]{2})+)/x
+    URIREGEX.each_pair{|k,v| v.freeze }
+    def self.extract(text)
+      return [] if !text
+      urls = []
+      text.to_s.scan(URIREGEX[:valid_url]) do |all, before, url, protocol, domain, path, query|
+        begin
+          url = clean(url).to_s
+          Domainatrix.parse(url)
+          urls.push url
+        rescue NoMethodError
+        end
+      end
+      urls.compact
+    end
+    def self.escape(uri)
+      uri.gsub(URIREGEX[:escape]) do
+        '%' + $1.unpack('H2' * $1.size).join('%').upcase
+      end.gsub(' ','%20')
+    end
+    def self.unescape(uri)
+      uri.tr('+', ' ').gsub(URIREGEX[:unescape]) do
+        [$1.delete('%')].pack('H*')
+      end
+    end
+    def self.clean(uri)
+      normalize(c18n(unescape(uri))).to_s
+    end
+    def self.normalize(uri)
+      u = parse(uri)
+      u.path = u.path.squeeze('/')
+      u.query = nil if u.query && u.query.empty?
+      u.fragment = nil
+      u
+    end
+    def self.c18n(uri)
+      u = parse(uri)
+      if q = u.query_values(:notation => :flat_array)
+        q.delete_if { |k,v| C18N[:global].include?(k) }
+        q.delete_if { |k,v| C18N[:hosts].find {|r,p| u.host =~ r && p.include?(k) } }
+      end
+      u.query_values = q
+      u
+    end
+    def self.parse(uri)
+      return uri if uri.is_a? Addressable::URI
+      uri = uri.index(URIREGEX[:protocol]) == 0 ? uri : "http://#{uri}"
+      Addressable::URI.parse(uri).normalize
+    end
+  end
+end

data/lib/postrank-uri/c18n.yml ADDED Viewed

@@ -0,0 +1,37 @@
+---
+:all:
+- utm_source    # Google Analytics: campaign source
+- utm_medium    # Google Analytics: campaign medium
+- utm_term      # Google Anlaytics: campaign term
+- utm_content   # Google Analytics: campaign content
+- utm_campaign  # Google Analytics: campaign name
+- sms_ss        # addthis.com tracker
+- awesm         # awe.sm tracker
+:hosts:
+  nytimes.com:
+  - partner
+  - emc
+  - _r
+  washingtonpost.com:
+  - nav
+  - wprss
+  cnn.com:
+  - eref
+  latimes.com:
+  - track
+  usatoday.com:
+  - csp
+  economist.com:
+  - fsrc
+  espn.go.com:
+  - campaign
+  - source
+  dw-world.de:
+  - maca
+  repubblica.it:
+  - rss
+  welt.de:
+  - wtmc
+  usatoday.com:
+  - csp

data/lib/postrank-uri/version.rb ADDED Viewed

@@ -0,0 +1,5 @@
+module PostRank
+  module URI
+    VERSION = "1.0.0"
+  end
+end

data/postrank-uri.gemspec ADDED Viewed

@@ -0,0 +1,25 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib", __FILE__)
+require "postrank-uri/version"
+Gem::Specification.new do |s|
+  s.name        = "postrank-uri"
+  s.version     = PostRank::URI::VERSION
+  s.platform    = Gem::Platform::RUBY
+  s.authors     = ["Ilya Grigorik"]
+  s.email       = ["ilya@igvita.com"]
+  s.homepage    = "http://rubygems.org/gems/postrank-uri"
+  s.summary     = "URI normalization, c18n, escaping, and extraction"
+  s.description = s.summary
+  s.rubyforge_project = "postrank-uri"
+  s.add_dependency "addressable"
+  s.add_dependency "domainatrix"
+  s.add_development_dependency "rspec"
+  s.files         = `git ls-files`.split("\n")
+  s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
+  s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
+  s.require_paths = ["lib"]
+end

data/spec/c18n_hosts.yml ADDED Viewed

@@ -0,0 +1,39 @@
+---
+- - http://www.nytimes.com/2010/12/16/world/europe/16russia.html?_r=1&partner=rss&emc=rss
+  - http://www.nytimes.com/2010/12/16/world/europe/16russia.html
+- - http://dotearth.blogs.nytimes.com/2010/12/14/beyond-political-science/?partner=rss&emc=rss
+  - http://dotearth.blogs.nytimes.com/2010/12/14/beyond-political-science/
+- - http://www.washingtonpost.com/wp-dyn/content/article/2010/12/14/AR2010121406045.html?nav=rss_email/components
+  - http://www.washingtonpost.com/wp-dyn/content/article/2010/12/14/AR2010121406045.html
+- - http://www.washingtonpost.com/wp-dyn/content/article/2010/12/14/AR2010121407704.html?wprss=rss_politics
+  - http://www.washingtonpost.com/wp-dyn/content/article/2010/12/14/AR2010121407704.html
+- - http://edition.cnn.com/2010/US/12/14/afghanistan.review/index.html?eref=edition
+  - http://edition.cnn.com/2010/US/12/14/afghanistan.review/index.html
+- - http://www.latimes.com/news/politics/la-na-steele-rnc-20101214,0,6423667.story?track=rss&utm_source=feedburner
+  - http://www.latimes.com/news/politics/la-na-steele-rnc-20101214,0,6423667.story
+- - http://www.usatoday.com/sports/baseball/2010-12-14-reggie-jackson-yankees-baby-boomers_N.htm?csp=34sports
+  - http://www.usatoday.com/sports/baseball/2010-12-14-reggie-jackson-yankees-baby-boomers_N.htm
+- - http://www.economist.com/node/17522368?story_id=17522368&fsrc=rss
+  - http://www.economist.com/node/17522368?story_id=17522368
+- - http://sports.espn.go.com/dallas/mlb/news/story?id=5919388&campaign=rss&source=MLBHeadlines
+  - http://sports.espn.go.com/dallas/mlb/news/story?id=5919388
+- - http://www.dw-world.de/dw/article/0,,6330472,00.html?maca=en-rss-en-all-1573-rdf
+  - http://www.dw-world.de/dw/article/0,,6330472,00.html
+- - http://www.repubblica.it/rubriche/il-caso-del-giorno/2010/12/13/news/riscossa_aeffe-10153565/?rss
+  - http://www.repubblica.it/rubriche/il-caso-del-giorno/2010/12/13/news/riscossa_aeffe-10153565/
+- - http://www.welt.de/sport/Der-Hoellenritt-des-Fussball-Profis-Jean-Marc-Bosman.html?wtmc=RSS.Sport.Fussball
+  - http://www.welt.de/sport/Der-Hoellenritt-des-Fussball-Profis-Jean-Marc-Bosman.html
+- - http://www.usatoday.com/life/television/news/2011-01-19-race19_ST_N.htm?csp=34life
+  - http://www.usatoday.com/life/television/news/2011-01-19-race19_ST_N.htm

data/spec/helper.rb ADDED Viewed

@@ -0,0 +1,4 @@
+require 'bundler'
+Bundler.setup
+require 'lib/postrank-uri'

data/spec/postrank-uri_spec.rb ADDED Viewed

@@ -0,0 +1,186 @@
+# -*- encoding: utf-8 -*-
+require 'helper'
+describe PostRank::URI do
+  let(:igvita) { 'http://igvita.com/' }
+  context "escaping" do
+    it "should escape PostRank::URI string" do
+      PostRank::URI.escape('id=1').should == 'id%3D1'
+    end
+    it "should escape spaces as %20's" do
+      PostRank::URI.escape('id= 1').should match('%20')
+    end
+  end
+  context "unescape" do
+    it "should unescape PostRank::URI" do
+      PostRank::URI.unescape(PostRank::URI.escape('id=1')).should == 'id=1'
+    end
+    it "should unescape PostRank::URI with spaces" do
+      PostRank::URI.unescape(PostRank::URI.escape('id= 1')).should == 'id= 1'
+    end
+    context "accept improperly escaped PostRank::URI strings" do
+      # See http://tools.ietf.org/html/rfc3986#section-2.3
+      it "should unescape PostRank::URI with spaces encoded as '+'" do
+        PostRank::URI.unescape('id=+1').should == 'id= 1'
+      end
+      it "should unescape PostRank::URI with spaces encoded as '+'" do
+        PostRank::URI.unescape('id%3D+1').should == 'id= 1'
+      end
+      it "should unescape PostRank::URI with spaces encoded as %20" do
+        PostRank::URI.unescape('id=%201').should == 'id= 1'
+      end
+    end
+  end
+  context "normalize" do
+    def n(uri)
+      PostRank::URI.normalize(uri).to_s
+    end
+    it "should normalize paths in PostRank::URIs" do
+      n('http://igvita.com/').should == igvita
+      n('http://igvita.com').to_s.should == igvita
+      n('http://igvita.com///').should == igvita
+      n('http://igvita.com/../').should == igvita
+      n('http://igvita.com/a/b/../../').should == igvita
+      n('http://igvita.com/a/b/../..').should == igvita
+    end
+    it "should normalize query strings in PostRank::URIs" do
+      n('http://igvita.com/?').should == igvita
+      n('http://igvita.com?').should == igvita
+      n('http://igvita.com/a/../?').should == igvita
+    end
+    it "should normalize anchors in PostRank::URIs" do
+      n('http://igvita.com#test').should == igvita
+      n('http://igvita.com#test#test').should == igvita
+      n('http://igvita.com/a/../?#test').should == igvita
+    end
+    it "should clean whitespace in PostRank::URIs" do
+      n('http://igvita.com/a/../?  ').should == igvita
+      n('http://igvita.com/a/../? #test').should == igvita
+      n('http://igvita.com/ /../').should == igvita
+    end
+    it "should default to http scheme if missing" do
+      n('igvita.com').should == igvita
+      n('https://test.com/').to_s.should == 'https://test.com/'
+    end
+    it "should downcase hostname" do
+      n('IGVITA.COM').should == igvita
+      n('IGVITA.COM/ABC').should == (igvita + "ABC")
+    end
+  end
+  context "canonicalization" do
+    def c(uri)
+      PostRank::URI.c18n(uri).to_s
+    end
+    context "query parameters" do
+      it "should handle nester parameters" do
+        c('igvita.com/?id=a&utm_source=a').should == 'http://igvita.com/?id=a'
+      end
+      it "should preserve order of parameters" do
+        url = 'http://a.com/?'+('a'..'z').to_a.shuffle.map {|e| "#{e}=#{e}"}.join("&")
+        c(url).should == url
+      end
+      it "should remove Google Analytics parameters" do
+        c('igvita.com/?id=a&utm_source=a').should == 'http://igvita.com/?id=a'
+        c('igvita.com/?id=a&utm_source=a&utm_valid').should == 'http://igvita.com/?id=a&utm_valid'
+      end
+      it "should remove awesm/sms parameters" do
+        c('igvita.com/?id=a&utm_source=a&awesm=b').should == 'http://igvita.com/?id=a'
+        c('igvita.com/?id=a&sms_ss=a').should == 'http://igvita.com/?id=a'
+      end
+    end
+  end
+  context "clean" do
+    def c(uri)
+      PostRank::URI.clean(uri)
+    end
+    it "should unescape, c18n and normalize" do
+      c('http://igvita.com/?id=1').should == 'http://igvita.com/?id=1'
+      c('igvita.com/?id=1').should == 'http://igvita.com/?id=1'
+      c('http://igvita.com/?id= 1').should == 'http://igvita.com/?id=%201'
+      c('http://igvita.com/?id=+1').should == 'http://igvita.com/?id=%201'
+      c('http://igvita.com/?id%3D%201').should == 'http://igvita.com/?id=%201'
+      c('igvita.com/a/..?id=1&utm_source=a&awesm=b#c').should == 'http://igvita.com/?id=1'
+      c('igvita.com?id=<>').should == 'http://igvita.com/?id=%3C%3E'
+      c('igvita.com?id="').should == 'http://igvita.com/?id=%22'
+    end
+    it "should clean host specific parameters" do
+      YAML.load_file('spec/c18n_hosts.yml').each do |orig, clean|
+        c(orig).should == clean
+      end
+    end
+  end
+  context "extract" do
+    def e(text)
+      PostRank::URI.extract(text)
+    end
+    context "TLDs" do
+      it "should not pick up bad grammar as a domain name and think it has a link" do
+        e("yah.lets").should be_empty
+      end
+      it "should not pickup bad TLDS" do
+        e('stuff.zz a.b.c d.zq').should be_empty
+      end
+    end
+    it "should handle a URL that comes after text without a space" do
+      e("text:http://spn.tw/tfnLT").should include("http://spn.tw/tfnLT")
+      e("text;http://spn.tw/tfnLT").should include("http://spn.tw/tfnLT")
+      e("text.http://spn.tw/tfnLT").should include("http://spn.tw/tfnLT")
+      e("text-http://spn.tw/tfnLT").should include("http://spn.tw/tfnLT")
+    end
+    it "should not pick up anything on or after the first . in the path of a URL with a shortener domain" do
+      e("http://bit.ly/9cJ2mz......if ur pickin up anythign here, u FAIL.").should == ["http://bit.ly/9cJ2mz"]
+    end
+    it "should pickup urls without protocol" do
+      u = e('abc.com abc.co')
+      u.should include('http://abc.com/')
+      u.should include('http://abc.co/')
+    end
+    context "multibyte characters" do
+      it "should stop extracting URLs at the full-width CJK space character" do
+        e("http://www.youtube.com/watch?v=w_j4Lda25jA　　とんかつ定食").should == ["http://www.youtube.com/watch?v=w_j4Lda25jA"]
+      end
+    end
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,114 @@
+--- !ruby/object:Gem::Specification
+name: postrank-uri
+version: !ruby/object:Gem::Version
+  prerelease: false
+  segments:
+  - 1
+  - 0
+  - 0
+  version: 1.0.0
+platform: ruby
+authors:
+- Ilya Grigorik
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2011-01-20 00:00:00 -05:00
+default_executable:
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: addressable
+  prerelease: false
+  requirement: &id001 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        segments:
+        - 0
+        version: "0"
+  type: :runtime
+  version_requirements: *id001
+- !ruby/object:Gem::Dependency
+  name: domainatrix
+  prerelease: false
+  requirement: &id002 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        segments:
+        - 0
+        version: "0"
+  type: :runtime
+  version_requirements: *id002
+- !ruby/object:Gem::Dependency
+  name: rspec
+  prerelease: false
+  requirement: &id003 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        segments:
+        - 0
+        version: "0"
+  type: :development
+  version_requirements: *id003
+description: URI normalization, c18n, escaping, and extraction
+email:
+- ilya@igvita.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- Gemfile
+- README.md
+- Rakefile
+- lib/postrank-uri.rb
+- lib/postrank-uri/c18n.yml
+- lib/postrank-uri/version.rb
+- postrank-uri.gemspec
+- spec/c18n_hosts.yml
+- spec/helper.rb
+- spec/postrank-uri_spec.rb
+has_rdoc: true
+homepage: http://rubygems.org/gems/postrank-uri
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      segments:
+      - 0
+      version: "0"
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      segments:
+      - 0
+      version: "0"
+requirements: []
+rubyforge_project: postrank-uri
+rubygems_version: 1.3.7
+signing_key:
+specification_version: 3
+summary: URI normalization, c18n, escaping, and extraction
+test_files:
+- spec/c18n_hosts.yml
+- spec/helper.rb
+- spec/postrank-uri_spec.rb