metapage 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9bae1882c2b1169f6554b9b26de750fe67087f09
4
+ data.tar.gz: 10e23f034e67140017cd0d828d84e0f38466f66d
5
+ SHA512:
6
+ metadata.gz: 4a84d1afcda95d171349f35c4fa6af500d488cbfebd4a516ebfedfdd058586988a0ea4fbac9775d5f8c36d05cd8eec8bba01d549c7a89f7ae4a2f1e80d7b9b6b
7
+ data.tar.gz: fcd70936980172149326f7856656daf7873ac05145c862924c50568e6c5bc387a669cead087cea734b5668015a4c23917c5cbc0ec07a66dadb619ebe72d195db
data/.gitignore ADDED
@@ -0,0 +1,10 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /spec/fixtures/vcr_cassettes
10
+ /tmp/
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
data/.travis.yml ADDED
@@ -0,0 +1,4 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.2.3
4
+ before_install: gem install bundler -v 1.10.6
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in metapage.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2015 Christoph Olszowka
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,54 @@
1
+ # Metapage
2
+
3
+ [![Build Status](https://travis-ci.org/colszowka/metapage.svg)](https://travis-ci.org/colszowka/metapage)
4
+
5
+ A tiny class for extracting title, description and some further information from given urls using [open graph](http://www.ogp.me) and regular meta tags.
6
+
7
+ **Why?** For example this can be used for enriching urls submitted in a chat application.
8
+
9
+ ## Features
10
+
11
+ * Fetch open graph info for a page, with fallback to regular meta tags to give *something* for most HTML urls
12
+ * Bulk-fetch info for any urls contained in a given text snippet
13
+ * Checks if the given URL's host resolves via [Google DNS](https://developers.google.com/speed/public-dns/) and
14
+ is not on a [private netowrk](https://en.wikipedia.org/wiki/Private_network) subnet to slow down clever people
15
+ from entering private urls like `http://localhost:3000/secret` to explore your network.
16
+
17
+ ## Installing
18
+
19
+ Add `gem 'metapage'` to your `Gemfile` or `gem install metapage` on your command line.
20
+
21
+ ## Usage
22
+
23
+ Fetch a specific URL. Returns `nil` if the content is not html or loading fails due to invalid url, http response or timeout.
24
+
25
+ Metapage.fetch('https://github.com/colszowka/simplecov').to_h
26
+ {:title=>"colszowka/simplecov",
27
+ :description=>"simplecov - Code coverage for Ruby 1.9+ with a powerful configuration library and automatic merging of coverage across test suites",
28
+ :image_url=>"https://avatars0.githubusercontent.com/u/13972?v=3&s=400",
29
+ :type=>"object",
30
+ :canonical_url=>"https://github.com/colszowka/simplecov",
31
+ :site_name=>"GitHub"}
32
+
33
+ Extract urls from a given string and fetch the metadata for them. Only returns successfully retrieved results.
34
+
35
+ msg = "The text is http://github.com/colszowka/simplecov and links to http://hamburg.onruby.de?foo=bar but also to invalid http://fooooooooonoexist.com"
36
+ Metapage.extract(msg).map(&:title)
37
+ #=> ['colszowka/simplecov', 'Hamburg on Ruby - Heimathafen der Hamburger Ruby Community']
38
+
39
+ Both `Metapage.fetch` and `Metapage.extract` have equivalent bang methods `fetch!` and `extract!` that will bubble HTTP or parsing exceptions instead of returning
40
+ nil or silently ignoring invalid urls.
41
+
42
+ ## Development
43
+
44
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
45
+
46
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
47
+
48
+ ## Contributing
49
+
50
+ 1. Fork it ( https://github.com/colszowka/metapage/fork )
51
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
52
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
53
+ 4. Push to the branch (`git push origin my-new-feature`)
54
+ 5. Create a new Pull Request
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
7
+
8
+ task :env do
9
+ require 'metapage'
10
+ end
11
+
12
+ desc "Generates examples for the readme on the fly"
13
+ task examples: :env do
14
+ require 'pp'
15
+ cmd = "pp Metapage.fetch('https://github.com/colszowka/simplecov').to_h"
16
+ puts cmd
17
+ eval cmd
18
+ end
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "metapage"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,3 @@
1
+ module Metapage
2
+ VERSION = "0.1.0"
3
+ end
data/lib/metapage.rb ADDED
@@ -0,0 +1,189 @@
1
+ require "metapage/version"
2
+ require 'nokogiri'
3
+ require 'httpclient'
4
+ require 'uri'
5
+ require 'resolv'
6
+ require 'ipaddr'
7
+ require 'digest/sha1'
8
+
9
+ module Metapage
10
+ class ResolveError < StandardError; end;
11
+ class HTTPResponseError < StandardError; end;
12
+ class ContentTypeError < StandardError; end;
13
+ ERROR_CLASSES = [ResolveError, HTTPResponseError, ContentTypeError]
14
+
15
+ class << self
16
+ def fetch(url)
17
+ fetch! url
18
+ rescue *ERROR_CLASSES => err
19
+ nil
20
+ end
21
+
22
+ def fetch!(url)
23
+ Metadata.new(url)
24
+ end
25
+
26
+ def extract(text)
27
+ URI.extract(text, ['http', 'https']).map {|url| fetch(url.gsub(/[\.\,]+\Z/, '')) }.compact
28
+ end
29
+
30
+ def extract!(text)
31
+ URI.extract(text, ['http', 'https']).map {|url| fetch!(url.gsub(/[\.\,]+\Z/, '')) }.compact
32
+ end
33
+ end
34
+
35
+ class Metadata
36
+ attr_reader :url
37
+ def initialize(url)
38
+ @url = url
39
+ title
40
+ end
41
+
42
+ def title
43
+ @title ||= metatag_content('og:title') || html_content('title')
44
+ end
45
+
46
+ def description
47
+ @description ||= metatag_content('og:description') || metatag_content('description')
48
+ end
49
+
50
+ def image_url
51
+ # Fallback to apple-touch-icon, fluid-icon, ms-tileicon etc
52
+ @image_url ||= metatag_content('og:image:secure_url') || metatag_content('og:image') || link_rel('apple-touch-icon-precomposed')
53
+ end
54
+
55
+ def type
56
+ @type ||= metatag_content('og:type') || 'website'
57
+ end
58
+
59
+ def canonical_url
60
+ @canonical_url ||= metatag_content('og:url') || link_rel('canonical') || url
61
+ end
62
+
63
+ def id
64
+ if canonical_url
65
+ @id ||= Digest::SHA1.hexdigest(canonical_url)
66
+ end
67
+ end
68
+
69
+ def site_name
70
+ @site_name ||= metatag_content('og:site_name') || host
71
+ end
72
+
73
+ def to_h
74
+ {
75
+ id: id,
76
+ title: title,
77
+ description: description,
78
+ image_url: image_url,
79
+ type: type,
80
+ canonical_url: canonical_url,
81
+ site_name: site_name
82
+ }
83
+ end
84
+
85
+ def to_json
86
+ to_h.to_json
87
+ end
88
+
89
+
90
+ private
91
+
92
+ def uri
93
+ @uri ||= URI(canonical_url)
94
+ end
95
+
96
+ def host
97
+ @host ||= uri.host
98
+ end
99
+
100
+ def scheme
101
+ @scheme ||= uri.scheme
102
+ end
103
+
104
+ def absolute_url(href)
105
+ if href.start_with?('http')
106
+ href
107
+ else
108
+ scheme + '://' + File.join(host, href)
109
+ end
110
+ end
111
+
112
+ def link_rel(rel)
113
+ if tag = doc.css('link[rel="'+rel+'"]').first
114
+ absolute_url tag['href']
115
+ else
116
+ nil
117
+ end
118
+ end
119
+
120
+ def metatag_content(tag_name)
121
+ if tag = doc.css('meta[property="'+ tag_name +'"]').first
122
+ tag["content"]
123
+ elsif tag = doc.css('meta[name="'+ tag_name +'"]').first
124
+ tag["content"]
125
+ end
126
+ end
127
+
128
+ def html_content(selector)
129
+ if tag = doc.css(selector).first
130
+ tag.text
131
+ end
132
+ end
133
+
134
+ def doc
135
+ @doc ||= Nokogiri::HTML.parse(content).tap do |doc|
136
+ raise ContentTypeError, "Document does not seem to be valid html" if doc.css('div').empty?
137
+ end
138
+ end
139
+
140
+ def content
141
+ http_response.body
142
+ end
143
+
144
+ def http_response
145
+ begin
146
+ raise Metapage::HTTPResponseError, "Invalid scheme for #{url}" unless valid_scheme?
147
+ raise Metapage::ResolveError, "Could not find any DNS records for #{url}" unless valid_dns?
148
+ rescue ArgumentError
149
+ raise Metapage::ResolveError, "Cannot parse url #{url.inspect}"
150
+ end
151
+
152
+ @http_response ||= begin
153
+ http_client.get(url, follow_redirect: true).tap do |response|
154
+ unless (200..299).include? response.status
155
+ raise Metapage::HTTPResponseError, "Invalid response status #{response.status}"
156
+ end
157
+ end
158
+ end
159
+ rescue SocketError, HTTPClient::ReceiveTimeoutError => err
160
+ raise Metapage::HTTPResponseError, err.to_s
161
+ end
162
+
163
+ def valid_scheme?
164
+ %w(http https).include? URI(url).scheme
165
+ end
166
+
167
+ def valid_dns?
168
+ dns = Resolv::DNS.new(nameserver: ['8.8.8.8', '8.8.4.4'])
169
+ address = dns.getaddress(URI(url).host).to_s
170
+ not private_subnets.any? {|net| net.include? IPAddr.new(address) }
171
+ rescue Resolv::ResolvError
172
+ false
173
+ end
174
+
175
+ def private_subnets
176
+ @private_subnets ||= ['127.0.0.0/8', '10.0.0.0/8', '172.16.0.0/12', '192.168.0.0/16'].map {|cidr| IPAddr.new cidr }
177
+ end
178
+
179
+ def http_client
180
+ http_client ||= HTTPClient.new.tap do |http_client|
181
+ http_client.receive_timeout = 3
182
+ http_client.connect_timeout = 3
183
+ http_client.send_timeout = 3
184
+ http_client.keep_alive_timeout = 3
185
+ http_client.ssl_config.timeout = 3
186
+ end
187
+ end
188
+ end
189
+ end
data/metapage.gemspec ADDED
@@ -0,0 +1,31 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'metapage/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "metapage"
8
+ spec.version = Metapage::VERSION
9
+ spec.authors = ["Christoph Olszowka"]
10
+ spec.email = ["christoph at olszowka de"]
11
+
12
+ spec.summary = %q{Extract metadata about a given HTML url from open graph and regular meta tags}
13
+ spec.description = spec.summary
14
+ spec.homepage = "https://github.com/colszowka/metapage"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
17
+ spec.bindir = "exe"
18
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_dependency 'httpclient'
22
+ spec.add_dependency 'nokogiri'
23
+ spec.add_dependency 'mini_magick'
24
+
25
+ spec.add_development_dependency "bundler", "~> 1.10"
26
+ spec.add_development_dependency "rake", "~> 10.0"
27
+ spec.add_development_dependency "rspec"
28
+ spec.add_development_dependency "vcr"
29
+ spec.add_development_dependency "webmock"
30
+ spec.add_development_dependency "simplecov"
31
+ end
metadata ADDED
@@ -0,0 +1,184 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: metapage
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Christoph Olszowka
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2015-11-04 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: httpclient
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: nokogiri
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: mini_magick
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: bundler
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.10'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.10'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rake
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '10.0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '10.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rspec
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: '0'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: vcr
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - ">="
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :development
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: webmock
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: simplecov
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ description: Extract metadata about a given HTML url from open graph and regular meta
140
+ tags
141
+ email:
142
+ - christoph at olszowka de
143
+ executables: []
144
+ extensions: []
145
+ extra_rdoc_files: []
146
+ files:
147
+ - ".gitignore"
148
+ - ".rspec"
149
+ - ".travis.yml"
150
+ - Gemfile
151
+ - LICENSE.txt
152
+ - README.md
153
+ - Rakefile
154
+ - bin/console
155
+ - bin/setup
156
+ - lib/metapage.rb
157
+ - lib/metapage/version.rb
158
+ - metapage.gemspec
159
+ homepage: https://github.com/colszowka/metapage
160
+ licenses: []
161
+ metadata: {}
162
+ post_install_message:
163
+ rdoc_options: []
164
+ require_paths:
165
+ - lib
166
+ required_ruby_version: !ruby/object:Gem::Requirement
167
+ requirements:
168
+ - - ">="
169
+ - !ruby/object:Gem::Version
170
+ version: '0'
171
+ required_rubygems_version: !ruby/object:Gem::Requirement
172
+ requirements:
173
+ - - ">="
174
+ - !ruby/object:Gem::Version
175
+ version: '0'
176
+ requirements: []
177
+ rubyforge_project:
178
+ rubygems_version: 2.4.5.1
179
+ signing_key:
180
+ specification_version: 4
181
+ summary: Extract metadata about a given HTML url from open graph and regular meta
182
+ tags
183
+ test_files: []
184
+ has_rdoc: