metapage 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9bae1882c2b1169f6554b9b26de750fe67087f09
4
+ data.tar.gz: 10e23f034e67140017cd0d828d84e0f38466f66d
5
+ SHA512:
6
+ metadata.gz: 4a84d1afcda95d171349f35c4fa6af500d488cbfebd4a516ebfedfdd058586988a0ea4fbac9775d5f8c36d05cd8eec8bba01d549c7a89f7ae4a2f1e80d7b9b6b
7
+ data.tar.gz: fcd70936980172149326f7856656daf7873ac05145c862924c50568e6c5bc387a669cead087cea734b5668015a4c23917c5cbc0ec07a66dadb619ebe72d195db
data/.gitignore ADDED
@@ -0,0 +1,10 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /spec/fixtures/vcr_cassettes
10
+ /tmp/
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
data/.travis.yml ADDED
@@ -0,0 +1,4 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.2.3
4
+ before_install: gem install bundler -v 1.10.6
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in metapage.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2015 Christoph Olszowka
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,54 @@
1
+ # Metapage
2
+
3
+ [![Build Status](https://travis-ci.org/colszowka/metapage.svg)](https://travis-ci.org/colszowka/metapage)
4
+
5
+ A tiny class for extracting title, description and some further information from given urls using [open graph](http://www.ogp.me) and regular meta tags.
6
+
7
+ **Why?** For example this can be used for enriching urls submitted in a chat application.
8
+
9
+ ## Features
10
+
11
+ * Fetch open graph info for a page, with fallback to regular meta tags to give *something* for most HTML urls
12
+ * Bulk-fetch info for any urls contained in a given text snippet
13
+ * Checks if the given URL's host resolves via [Google DNS](https://developers.google.com/speed/public-dns/) and
14
+ is not on a [private netowrk](https://en.wikipedia.org/wiki/Private_network) subnet to slow down clever people
15
+ from entering private urls like `http://localhost:3000/secret` to explore your network.
16
+
17
+ ## Installing
18
+
19
+ Add `gem 'metapage'` to your `Gemfile` or `gem install metapage` on your command line.
20
+
21
+ ## Usage
22
+
23
+ Fetch a specific URL. Returns `nil` if the content is not html or loading fails due to invalid url, http response or timeout.
24
+
25
+ Metapage.fetch('https://github.com/colszowka/simplecov').to_h
26
+ {:title=>"colszowka/simplecov",
27
+ :description=>"simplecov - Code coverage for Ruby 1.9+ with a powerful configuration library and automatic merging of coverage across test suites",
28
+ :image_url=>"https://avatars0.githubusercontent.com/u/13972?v=3&s=400",
29
+ :type=>"object",
30
+ :canonical_url=>"https://github.com/colszowka/simplecov",
31
+ :site_name=>"GitHub"}
32
+
33
+ Extract urls from a given string and fetch the metadata for them. Only returns successfully retrieved results.
34
+
35
+ msg = "The text is http://github.com/colszowka/simplecov and links to http://hamburg.onruby.de?foo=bar but also to invalid http://fooooooooonoexist.com"
36
+ Metapage.extract(msg).map(&:title)
37
+ #=> ['colszowka/simplecov', 'Hamburg on Ruby - Heimathafen der Hamburger Ruby Community']
38
+
39
+ Both `Metapage.fetch` and `Metapage.extract` have equivalent bang methods `fetch!` and `extract!` that will bubble HTTP or parsing exceptions instead of returning
40
+ nil or silently ignoring invalid urls.
41
+
42
+ ## Development
43
+
44
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
45
+
46
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
47
+
48
+ ## Contributing
49
+
50
+ 1. Fork it ( https://github.com/colszowka/metapage/fork )
51
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
52
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
53
+ 4. Push to the branch (`git push origin my-new-feature`)
54
+ 5. Create a new Pull Request
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
7
+
8
+ task :env do
9
+ require 'metapage'
10
+ end
11
+
12
+ desc "Generates examples for the readme on the fly"
13
+ task examples: :env do
14
+ require 'pp'
15
+ cmd = "pp Metapage.fetch('https://github.com/colszowka/simplecov').to_h"
16
+ puts cmd
17
+ eval cmd
18
+ end
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "metapage"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,3 @@
1
+ module Metapage
2
+ VERSION = "0.1.0"
3
+ end
data/lib/metapage.rb ADDED
@@ -0,0 +1,189 @@
1
+ require "metapage/version"
2
+ require 'nokogiri'
3
+ require 'httpclient'
4
+ require 'uri'
5
+ require 'resolv'
6
+ require 'ipaddr'
7
+ require 'digest/sha1'
8
+
9
+ module Metapage
10
+ class ResolveError < StandardError; end;
11
+ class HTTPResponseError < StandardError; end;
12
+ class ContentTypeError < StandardError; end;
13
+ ERROR_CLASSES = [ResolveError, HTTPResponseError, ContentTypeError]
14
+
15
+ class << self
16
+ def fetch(url)
17
+ fetch! url
18
+ rescue *ERROR_CLASSES => err
19
+ nil
20
+ end
21
+
22
+ def fetch!(url)
23
+ Metadata.new(url)
24
+ end
25
+
26
+ def extract(text)
27
+ URI.extract(text, ['http', 'https']).map {|url| fetch(url.gsub(/[\.\,]+\Z/, '')) }.compact
28
+ end
29
+
30
+ def extract!(text)
31
+ URI.extract(text, ['http', 'https']).map {|url| fetch!(url.gsub(/[\.\,]+\Z/, '')) }.compact
32
+ end
33
+ end
34
+
35
+ class Metadata
36
+ attr_reader :url
37
+ def initialize(url)
38
+ @url = url
39
+ title
40
+ end
41
+
42
+ def title
43
+ @title ||= metatag_content('og:title') || html_content('title')
44
+ end
45
+
46
+ def description
47
+ @description ||= metatag_content('og:description') || metatag_content('description')
48
+ end
49
+
50
+ def image_url
51
+ # Fallback to apple-touch-icon, fluid-icon, ms-tileicon etc
52
+ @image_url ||= metatag_content('og:image:secure_url') || metatag_content('og:image') || link_rel('apple-touch-icon-precomposed')
53
+ end
54
+
55
+ def type
56
+ @type ||= metatag_content('og:type') || 'website'
57
+ end
58
+
59
+ def canonical_url
60
+ @canonical_url ||= metatag_content('og:url') || link_rel('canonical') || url
61
+ end
62
+
63
+ def id
64
+ if canonical_url
65
+ @id ||= Digest::SHA1.hexdigest(canonical_url)
66
+ end
67
+ end
68
+
69
+ def site_name
70
+ @site_name ||= metatag_content('og:site_name') || host
71
+ end
72
+
73
+ def to_h
74
+ {
75
+ id: id,
76
+ title: title,
77
+ description: description,
78
+ image_url: image_url,
79
+ type: type,
80
+ canonical_url: canonical_url,
81
+ site_name: site_name
82
+ }
83
+ end
84
+
85
+ def to_json
86
+ to_h.to_json
87
+ end
88
+
89
+
90
+ private
91
+
92
+ def uri
93
+ @uri ||= URI(canonical_url)
94
+ end
95
+
96
+ def host
97
+ @host ||= uri.host
98
+ end
99
+
100
+ def scheme
101
+ @scheme ||= uri.scheme
102
+ end
103
+
104
+ def absolute_url(href)
105
+ if href.start_with?('http')
106
+ href
107
+ else
108
+ scheme + '://' + File.join(host, href)
109
+ end
110
+ end
111
+
112
+ def link_rel(rel)
113
+ if tag = doc.css('link[rel="'+rel+'"]').first
114
+ absolute_url tag['href']
115
+ else
116
+ nil
117
+ end
118
+ end
119
+
120
+ def metatag_content(tag_name)
121
+ if tag = doc.css('meta[property="'+ tag_name +'"]').first
122
+ tag["content"]
123
+ elsif tag = doc.css('meta[name="'+ tag_name +'"]').first
124
+ tag["content"]
125
+ end
126
+ end
127
+
128
+ def html_content(selector)
129
+ if tag = doc.css(selector).first
130
+ tag.text
131
+ end
132
+ end
133
+
134
+ def doc
135
+ @doc ||= Nokogiri::HTML.parse(content).tap do |doc|
136
+ raise ContentTypeError, "Document does not seem to be valid html" if doc.css('div').empty?
137
+ end
138
+ end
139
+
140
+ def content
141
+ http_response.body
142
+ end
143
+
144
+ def http_response
145
+ begin
146
+ raise Metapage::HTTPResponseError, "Invalid scheme for #{url}" unless valid_scheme?
147
+ raise Metapage::ResolveError, "Could not find any DNS records for #{url}" unless valid_dns?
148
+ rescue ArgumentError
149
+ raise Metapage::ResolveError, "Cannot parse url #{url.inspect}"
150
+ end
151
+
152
+ @http_response ||= begin
153
+ http_client.get(url, follow_redirect: true).tap do |response|
154
+ unless (200..299).include? response.status
155
+ raise Metapage::HTTPResponseError, "Invalid response status #{response.status}"
156
+ end
157
+ end
158
+ end
159
+ rescue SocketError, HTTPClient::ReceiveTimeoutError => err
160
+ raise Metapage::HTTPResponseError, err.to_s
161
+ end
162
+
163
+ def valid_scheme?
164
+ %w(http https).include? URI(url).scheme
165
+ end
166
+
167
+ def valid_dns?
168
+ dns = Resolv::DNS.new(nameserver: ['8.8.8.8', '8.8.4.4'])
169
+ address = dns.getaddress(URI(url).host).to_s
170
+ not private_subnets.any? {|net| net.include? IPAddr.new(address) }
171
+ rescue Resolv::ResolvError
172
+ false
173
+ end
174
+
175
+ def private_subnets
176
+ @private_subnets ||= ['127.0.0.0/8', '10.0.0.0/8', '172.16.0.0/12', '192.168.0.0/16'].map {|cidr| IPAddr.new cidr }
177
+ end
178
+
179
+ def http_client
180
+ http_client ||= HTTPClient.new.tap do |http_client|
181
+ http_client.receive_timeout = 3
182
+ http_client.connect_timeout = 3
183
+ http_client.send_timeout = 3
184
+ http_client.keep_alive_timeout = 3
185
+ http_client.ssl_config.timeout = 3
186
+ end
187
+ end
188
+ end
189
+ end
data/metapage.gemspec ADDED
@@ -0,0 +1,31 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'metapage/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "metapage"
8
+ spec.version = Metapage::VERSION
9
+ spec.authors = ["Christoph Olszowka"]
10
+ spec.email = ["christoph at olszowka de"]
11
+
12
+ spec.summary = %q{Extract metadata about a given HTML url from open graph and regular meta tags}
13
+ spec.description = spec.summary
14
+ spec.homepage = "https://github.com/colszowka/metapage"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
17
+ spec.bindir = "exe"
18
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_dependency 'httpclient'
22
+ spec.add_dependency 'nokogiri'
23
+ spec.add_dependency 'mini_magick'
24
+
25
+ spec.add_development_dependency "bundler", "~> 1.10"
26
+ spec.add_development_dependency "rake", "~> 10.0"
27
+ spec.add_development_dependency "rspec"
28
+ spec.add_development_dependency "vcr"
29
+ spec.add_development_dependency "webmock"
30
+ spec.add_development_dependency "simplecov"
31
+ end
metadata ADDED
@@ -0,0 +1,184 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: metapage
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Christoph Olszowka
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2015-11-04 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: httpclient
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: nokogiri
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: mini_magick
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: bundler
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.10'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.10'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rake
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '10.0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '10.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rspec
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: '0'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: vcr
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - ">="
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :development
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: webmock
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: simplecov
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ description: Extract metadata about a given HTML url from open graph and regular meta
140
+ tags
141
+ email:
142
+ - christoph at olszowka de
143
+ executables: []
144
+ extensions: []
145
+ extra_rdoc_files: []
146
+ files:
147
+ - ".gitignore"
148
+ - ".rspec"
149
+ - ".travis.yml"
150
+ - Gemfile
151
+ - LICENSE.txt
152
+ - README.md
153
+ - Rakefile
154
+ - bin/console
155
+ - bin/setup
156
+ - lib/metapage.rb
157
+ - lib/metapage/version.rb
158
+ - metapage.gemspec
159
+ homepage: https://github.com/colszowka/metapage
160
+ licenses: []
161
+ metadata: {}
162
+ post_install_message:
163
+ rdoc_options: []
164
+ require_paths:
165
+ - lib
166
+ required_ruby_version: !ruby/object:Gem::Requirement
167
+ requirements:
168
+ - - ">="
169
+ - !ruby/object:Gem::Version
170
+ version: '0'
171
+ required_rubygems_version: !ruby/object:Gem::Requirement
172
+ requirements:
173
+ - - ">="
174
+ - !ruby/object:Gem::Version
175
+ version: '0'
176
+ requirements: []
177
+ rubyforge_project:
178
+ rubygems_version: 2.4.5.1
179
+ signing_key:
180
+ specification_version: 4
181
+ summary: Extract metadata about a given HTML url from open graph and regular meta
182
+ tags
183
+ test_files: []
184
+ has_rdoc: