webinspector 0.5.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 943a718d012d5b472e7ecdb96d1eb4e6e178bc4a
4
- data.tar.gz: c600d322e258a2efd527bde502875655f2819f86
2
+ SHA256:
3
+ metadata.gz: df0bf76a03246a803f338a903611f128ee8b6d09329f33a9745a27eeb4e9793b
4
+ data.tar.gz: adda6867a10d3dc5f7a9fd0ec414d7046e140fe016d945eb20098a84a5176642
5
5
  SHA512:
6
- metadata.gz: 253f5907d503c96fc19485c58ddd59ec0373a9d4c755ffd5189bfd02bf0851692ffda30c25a7e44c7b2263f6dc300049bdd59b56fcc4f4970aca22d7d6ecb700
7
- data.tar.gz: 630e4c3c60de4ef1d057180cea7e09d030de573a4d9dc248d0a61657a7f44049f54b60c633d0f7ebfbe54ee70863f8d615095a4335376d2d6a8ed86527d9ff49
6
+ metadata.gz: 01ce7c5aab007a3c9ef300c61a990a6f00d00604c14f0a3fdec28fdfb620a1e50ad576055b892cfe4d59de66975236674ce1f81f66e2d60c0ad0e0a0c3f4a951
7
+ data.tar.gz: ca58cdda149cf3b0cc6dcb29017b8e080b70b4ed881c924acddfba98760246417bb0304bc4eb42242a328d4ade99408181cc3ac10016998cd2b23de8fe34a8bf
data/Gemfile CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  source 'https://rubygems.org'
2
4
 
3
5
  # Specify your gem's dependencies in webinspector.gemspec
data/README.md CHANGED
@@ -1,10 +1,10 @@
1
- # Webinspector
1
+ # WebInspector
2
2
 
3
- Ruby gem to inspect completely a web page. It scrapes a given URL, and returns you its title, description, meta, links, images and more.
3
+ Ruby gem to inspect web pages. It scrapes a given URL and returns its title, description, meta tags, links, images, and more.
4
+
5
+ <a href="https://codeclimate.com/github/davidesantangelo/webinspector"><img src="https://codeclimate.com/github/davidesantangelo/webinspector/badges/gpa.svg" /></a>
4
6
 
5
- ## See it in action!
6
7
 
7
- You can try WebInspector live at this little demo: [https://scrappet.herokuapp.com](https://scrappet.herokuapp.com)
8
8
  ## Installation
9
9
 
10
10
  Add this line to your application's Gemfile:
@@ -23,50 +23,126 @@ Or install it yourself as:
23
23
 
24
24
  ## Usage
25
25
 
26
- Initialize a WebInspector instance for an URL, like this:
26
+ ### Initialize a WebInspector instance
27
27
 
28
28
  ```ruby
29
- page = WebInspector.new('http://davidesantangelo.com')
29
+ page = WebInspector.new('http://example.com')
30
30
  ```
31
31
 
32
- ## Accessing response status and headers
32
+ ### With options
33
+
34
+ ```ruby
35
+ page = WebInspector.new('http://example.com', {
36
+ timeout: 30, # Request timeout in seconds (default: 30)
37
+ retries: 3, # Number of retries (default: 3)
38
+ headers: {'User-Agent': 'Custom UA'} # Custom HTTP headers
39
+ })
40
+ ```
33
41
 
34
- You can check the status and headers from the response like this:
42
+ ### Accessing response status and headers
35
43
 
36
44
  ```ruby
37
45
  page.response.status # 200
38
- page.response.headers # { "server"=>"apache", "content-type"=>"text/html; charset=utf-8", "cache-control"=>"must-revalidate, private, max-age=0", ... }
46
+ page.response.headers # { "server"=>"apache", "content-type"=>"text/html; charset=utf-8", ... }
47
+ page.status_code # 200
48
+ page.success? # true if the page was loaded successfully
49
+ page.error_message # returns the error message if any
39
50
  ```
40
51
 
41
- ## Accessing inpsected data
42
-
43
- You can see the data like this:
52
+ ### Accessing page data
44
53
 
45
54
  ```ruby
46
- page.url # URL of the page
47
- page.scheme # Scheme of the page (http, https)
48
- page.host # Hostname of the page (like, davidesantangelo.com, without the scheme)
49
- page.port # Port of the page
50
- page.title # title of the page from the head section, as string
51
- page.description # description of the page
52
- page.links # every link found
53
- page.images # every image found
54
- page.meta # metatags of the page
55
+ page.url # URL of the page
56
+ page.scheme # Scheme of the page (http, https)
57
+ page.host # Hostname of the page (like, example.com, without the scheme)
58
+ page.port # Port of the page
59
+ page.title # title of the page from the head section
60
+ page.description # description of the page
61
+ page.links # array of all links found on the page (absolute URLs)
62
+ page.images # array of all images found on the page (absolute URLs)
63
+ page.meta # meta tags of the page
64
+ page.favicon # favicon URL if available
55
65
  ```
56
66
 
57
- ## Accessing meta tags
67
+ ### Working with meta tags
58
68
 
59
69
  ```ruby
60
- page.meta # metatags of the page
70
+ page.meta # all meta tags
61
71
  page.meta['description'] # meta description
62
72
  page.meta['keywords'] # meta keywords
73
+ page.meta['og:title'] # OpenGraph title
74
+ ```
75
+
76
+ ### Filtering links and images by domain
77
+
78
+ ```ruby
79
+ page.domain_links('example.com') # returns only links pointing to example.com
80
+ page.domain_images('example.com') # returns only images hosted on example.com
81
+ ```
82
+
83
+ ### Searching for words
84
+
85
+ ```ruby
86
+ page.find(["ruby", "rails"]) # returns [{"ruby"=>3}, {"rails"=>1}]
87
+ ```
88
+
89
+ #### JavaScript and Stylesheets
90
+
91
+ ```ruby
92
+ page.javascripts # array of all JavaScript files (absolute URLs)
93
+ page.stylesheets # array of all CSS stylesheets (absolute URLs)
94
+ ```
95
+
96
+ #### Language Detection
97
+
98
+ ```ruby
99
+ page.language # detected language code (e.g., "en", "es", "fr")
100
+ ```
101
+
102
+ #### Structured Data
103
+
104
+ ```ruby
105
+ page.structured_data # array of JSON-LD structured data objects
106
+ page.microdata # array of microdata items
107
+ page.json_ld # alias for structured_data
108
+ ```
109
+
110
+ #### Security Information
111
+
112
+ ```ruby
113
+ page.security_info # hash with security details: { secure: true, hsts: true, ... }
63
114
  ```
64
115
 
65
- ## Find words (as array)
116
+ #### Performance Metrics
117
+
118
+ ```ruby
119
+ page.load_time # page load time in seconds
120
+ page.size # page size in bytes
121
+ ```
122
+
123
+ #### Content Type
124
+
66
125
  ```ruby
67
- page.find(["word1, word2"]) # return {"word1"=>3, "word2"=>1}
126
+ page.content_type # content type header (e.g., "text/html; charset=utf-8")
68
127
  ```
69
128
 
129
+ #### Technology Detection
130
+
131
+ ```ruby
132
+ page.technologies # hash of detected technologies: { jquery: true, react: true, ... }
133
+ ```
134
+
135
+ #### HTML Tag Statistics
136
+
137
+ ```ruby
138
+ page.tag_count # hash with counts of each HTML tag: { "div" => 45, "p" => 12, ... }
139
+ ```
140
+
141
+ ### Export all data to JSON
142
+
143
+ ```ruby
144
+ page.to_hash # returns a hash with all page data
145
+ ```
70
146
 
71
147
  ## Contributors
72
148
 
@@ -74,13 +150,13 @@ page.find(["word1, word2"]) # return {"word1"=>3, "word2"=>1}
74
150
  * Sam Nissen ([@samnissen](https://github.com/samnissen))
75
151
 
76
152
  ## License
77
- The webinspector GEM is released under the MIT License.
153
+
154
+ The WebInspector gem is released under the MIT License.
78
155
 
79
156
  ## Contributing
80
157
 
81
- 1. Fork it ( https://github.com/[my-github-username]/webinspector/fork )
158
+ 1. Fork it ( https://github.com/davidesantangelo/webinspector/fork )
82
159
  2. Create your feature branch (`git checkout -b my-new-feature`)
83
160
  3. Commit your changes (`git commit -am 'Add some feature'`)
84
161
  4. Push to the branch (`git push origin my-new-feature`)
85
162
  5. Create a new Pull Request
86
- >>>>>>> develop
data/Rakefile CHANGED
@@ -1,2 +1,3 @@
1
- require "bundler/gem_tasks"
1
+ # frozen_string_literal: true
2
2
 
3
+ require 'bundler/gem_tasks'
data/bin/console CHANGED
@@ -1,7 +1,8 @@
1
1
  #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
2
3
 
3
- require "bundler/setup"
4
- require "webinspector"
4
+ require 'bundler/setup'
5
+ require 'webinspector'
5
6
 
6
7
  # You can add fixtures and/or initialization code here to make experimenting
7
8
  # with your gem easier. You can also use a different console, if you like.
@@ -10,5 +11,5 @@ require "webinspector"
10
11
  # require "pry"
11
12
  # Pry.start
12
13
 
13
- require "irb"
14
+ require 'irb'
14
15
  IRB.start
@@ -1,144 +1,336 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require File.expand_path(File.join(File.dirname(__FILE__), 'meta'))
2
4
 
3
5
  module WebInspector
4
6
  class Inspector
7
+ attr_reader :page, :url, :host, :meta
5
8
 
6
9
  def initialize(page)
7
10
  @page = page
8
11
  @meta = WebInspector::Meta.new(page).meta
12
+ @base_url = nil
13
+ end
14
+
15
+ def set_url(url, host)
16
+ @url = url
17
+ @host = host
9
18
  end
10
19
 
11
20
  def title
12
- @page.css('title').inner_text.strip rescue nil
21
+ @page.css('title').inner_text.strip
22
+ rescue StandardError
23
+ nil
13
24
  end
14
25
 
15
26
  def description
16
- @meta['description'] || snippet
27
+ @meta['description'] || @meta['og:description'] || snippet
17
28
  end
18
29
 
19
30
  def body
20
31
  @page.css('body').to_html
21
32
  end
22
33
 
23
- def meta
24
- @meta
25
- end
26
-
34
+ # Search for specific words in the page content
35
+ # @param words [Array<String>] List of words to search for
36
+ # @return [Array<Hash>] Counts of word occurrences
27
37
  def find(words)
28
- text = @page.at('html').inner_text
38
+ text = @page.at('html').inner_text
29
39
  counter(text.downcase, words)
30
40
  end
31
41
 
42
+ # Get all links from the page
43
+ # @return [Array<String>] Array of URLs
32
44
  def links
33
- get_new_links unless @links
34
- return @links
45
+ @links ||= begin
46
+ links = []
47
+ @page.css('a').each do |a|
48
+ href = a[:href]
49
+ next unless href
50
+
51
+ # Skip javascript and mailto links
52
+ next if href.start_with?('javascript:', 'mailto:', 'tel:')
53
+
54
+ # Clean and normalize URL
55
+ href = href.strip
56
+
57
+ begin
58
+ absolute_url = make_absolute_url(href)
59
+ links << absolute_url if absolute_url
60
+ rescue URI::InvalidURIError
61
+ # Skip invalid URLs
62
+ end
63
+ end
64
+ links.uniq
65
+ end
35
66
  end
36
-
37
- def domain_links(user_domain, host)
67
+
68
+ # Get links from a specific domain
69
+ # @param user_domain [String] Domain to filter links by
70
+ # @param host [String] Current host
71
+ # @return [Array<String>] Filtered links
72
+ def domain_links(user_domain, host = nil)
38
73
  @host ||= host
39
-
40
- validated_domain_uri = validate_url_domain("http://#{user_domain.downcase.gsub(/\s+/, '')}")
41
- raise "Invalid domain provided" unless validated_domain_uri
42
-
43
- domain = validated_domain_uri.domain
44
-
45
- domain_links = []
46
-
47
- links.each do |l|
48
-
49
- u = validate_url_domain(l)
50
- next unless u && u.domain
51
-
52
- domain_links.push(l) if domain == u.domain.downcase
74
+ filter_by_domain(links, user_domain)
75
+ end
76
+
77
+ # Get all images from the page
78
+ # @return [Array<String>] Array of image URLs
79
+ def images
80
+ @images ||= begin
81
+ images = []
82
+ @page.css('img').each do |img|
83
+ src = img[:src]
84
+ next unless src
85
+
86
+ # Clean and normalize URL
87
+ src = src.strip
88
+
89
+ begin
90
+ absolute_url = make_absolute_url(src)
91
+ images << absolute_url if absolute_url
92
+ rescue URI::InvalidURIError, URI::BadURIError
93
+ # Skip invalid URLs
94
+ end
95
+ end
96
+ images.uniq.compact
53
97
  end
54
-
55
- return domain_links.compact
56
98
  end
57
-
58
- def domain_images(user_domain, host)
99
+
100
+ # Get images from a specific domain
101
+ # @param user_domain [String] Domain to filter images by
102
+ # @param host [String] Current host
103
+ # @return [Array<String>] Filtered images
104
+ def domain_images(user_domain, host = nil)
59
105
  @host ||= host
60
-
61
- validated_domain_uri = validate_url_domain("http://#{user_domain.downcase.gsub(/\s+/, '')}")
62
- raise "Invalid domain provided" unless validated_domain_uri
63
-
64
- domain = validated_domain_uri.domain
65
-
66
- domain_images = []
67
-
68
- images.each do |img|
69
- u = validate_url_domain(img)
70
- next unless u && u.domain
71
-
72
- domain_images.push(img) if u.domain.downcase.end_with?(domain)
106
+ filter_by_domain(images, user_domain)
107
+ end
108
+
109
+ # Get all JavaScript files used by the page
110
+ # @return [Array<String>] Array of JavaScript file URLs
111
+ def javascripts
112
+ @javascripts ||= begin
113
+ scripts = []
114
+ @page.css('script[src]').each do |script|
115
+ src = script[:src]
116
+ next unless src
117
+
118
+ # Clean and normalize URL
119
+ src = src.strip
120
+
121
+ begin
122
+ absolute_url = make_absolute_url(src)
123
+ scripts << absolute_url if absolute_url
124
+ rescue URI::InvalidURIError, URI::BadURIError
125
+ # Skip invalid URLs
126
+ end
127
+ end
128
+ scripts.uniq.compact
73
129
  end
74
-
75
- return domain_images.compact
76
130
  end
77
-
78
- # Normalize and validate the URLs on the page for comparison
79
- def validate_url_domain(u)
80
- # Enforce a few bare standards before proceeding
81
- u = "#{u}"
82
- u = "/" if u.empty?
83
-
84
- begin
85
- # Look for evidence of a host. If this is a relative link
86
- # like '/contact', add the page host.
87
- domained_url = @host + u unless (u.split("/").first || "").match(/(\:|\.)/)
88
- domained_url ||= u
89
-
90
- # http the URL if it is missing
91
- httpped_url = "http://" + domained_url unless domained_url[0..3] == 'http'
92
- httpped_url ||= domained_url
93
-
94
- # Make sure the URL parses
95
- uri = URI.parse(httpped_url)
96
-
97
- # Make sure the URL passes ICANN rules.
98
- # The PublicSuffix object splits the domain and subdomain
99
- # (unlike URI), which allows more liberal URL matching.
100
- return PublicSuffix.parse(uri.host)
101
- rescue URI::InvalidURIError, PublicSuffix::DomainInvalid => e
102
- return false
131
+
132
+ # Get stylesheets used by the page
133
+ # @return [Array<String>] Array of CSS file URLs
134
+ def stylesheets
135
+ @stylesheets ||= begin
136
+ styles = []
137
+ @page.css('link[rel="stylesheet"]').each do |style|
138
+ href = style[:href]
139
+ next unless href
140
+
141
+ # Clean and normalize URL
142
+ href = href.strip
143
+
144
+ begin
145
+ absolute_url = make_absolute_url(href)
146
+ styles << absolute_url if absolute_url
147
+ rescue URI::InvalidURIError, URI::BadURIError
148
+ # Skip invalid URLs
149
+ end
150
+ end
151
+ styles.uniq.compact
103
152
  end
104
153
  end
105
154
 
106
- def images
107
- get_new_images unless @images
108
- return @images
155
+ # Detect the page language
156
+ # @return [String, nil] Language code if detected, nil otherwise
157
+ def language
158
+ # Check for html lang attribute first
159
+ html_tag = @page.at('html')
160
+ return html_tag['lang'] if html_tag && html_tag['lang'] && !html_tag['lang'].empty?
161
+
162
+ # Then check for language meta tag
163
+ lang_meta = @meta['content-language']
164
+ return lang_meta if lang_meta && !lang_meta.empty?
165
+
166
+ # Fallback to inspecting content headers if available
167
+ nil
168
+ end
169
+
170
+ # Extract structured data (JSON-LD) from the page
171
+ # @return [Array<Hash>] Array of structured data objects
172
+ def structured_data
173
+ @structured_data ||= begin
174
+ data = []
175
+ @page.css('script[type="application/ld+json"]').each do |script|
176
+ parsed = JSON.parse(script.text)
177
+ data << parsed if parsed
178
+ rescue JSON::ParserError
179
+ # Skip invalid JSON
180
+ end
181
+ data
182
+ end
183
+ end
184
+
185
+ # Extract microdata from the page
186
+ # @return [Array<Hash>] Array of microdata items
187
+ def microdata
188
+ @microdata ||= begin
189
+ items = []
190
+ @page.css('[itemscope]').each do |scope|
191
+ item = { type: scope['itemtype'] }
192
+ properties = {}
193
+
194
+ scope.css('[itemprop]').each do |prop|
195
+ name = prop['itemprop']
196
+ # Extract value based on tag
197
+ value = case prop.name.downcase
198
+ when 'meta'
199
+ prop['content']
200
+ when 'img', 'audio', 'embed', 'iframe', 'source', 'track', 'video'
201
+ make_absolute_url(prop['src'])
202
+ when 'a', 'area', 'link'
203
+ make_absolute_url(prop['href'])
204
+ when 'time'
205
+ prop['datetime'] || prop.text.strip
206
+ else
207
+ prop.text.strip
208
+ end
209
+ properties[name] = value
210
+ end
211
+
212
+ item[:properties] = properties
213
+ items << item
214
+ end
215
+ items
216
+ end
217
+ end
218
+
219
+ # Count all tag types on the page
220
+ # @return [Hash] Counts of different HTML elements
221
+ def tag_count
222
+ tags = {}
223
+ @page.css('*').each do |element|
224
+ tag_name = element.name.downcase
225
+ tags[tag_name] ||= 0
226
+ tags[tag_name] += 1
227
+ end
228
+ tags
109
229
  end
110
230
 
111
231
  private
112
-
232
+
233
+ # Count occurrences of words in text
234
+ # @param text [String] Text to search in
235
+ # @param words [Array<String>] Words to find
236
+ # @return [Array<Hash>] Count results
113
237
  def counter(text, words)
114
- results = []
115
- hash = Hash.new
238
+ words.map do |word|
239
+ { word => text.scan(/#{Regexp.escape(word.downcase)}/).size }
240
+ end
241
+ end
242
+
243
+ # Validate a URL domain
244
+ # @param u [String] URL to validate
245
+ # @return [PublicSuffix::Domain, false] Domain object or false if invalid
246
+ def validate_url_domain(u)
247
+ u = u.to_s
248
+ u = '/' if u.empty?
249
+
250
+ begin
251
+ domained_url = if !(u.split('/').first || '').match(/(:|\.)/)
252
+ @host + u
253
+ else
254
+ u
255
+ end
256
+
257
+ httpped_url = domained_url.start_with?('http') ? domained_url : "http://#{domained_url}"
258
+ uri = URI.parse(httpped_url)
116
259
 
117
- words.each do |word|
118
- hash[word] = text.scan(/#{word.downcase}/).size
119
- results.push(hash)
120
- hash = Hash.new
260
+ PublicSuffix.parse(uri.host)
261
+ rescue URI::InvalidURIError, PublicSuffix::DomainInvalid
262
+ false
121
263
  end
122
- return results
123
264
  end
124
265
 
125
- def get_new_images
126
- @images = []
127
- @page.css("img").each do |img|
128
- @images.push((img[:src].to_s.start_with? @url.to_s) ? img[:src] : URI.join(url, img[:src]).to_s) if (img and img[:src])
266
+ # Filter a list of URLs by a given domain.
267
+ # @param collection [Array<String>] The list of URLs to filter.
268
+ # @param user_domain [String] The domain to filter by.
269
+ # @return [Array<String>] The filtered list of URLs.
270
+ def filter_by_domain(collection, user_domain)
271
+ return [] if collection.empty?
272
+
273
+ # Handle nil user_domain
274
+ user_domain = @host.to_s if user_domain.nil? || user_domain.empty?
275
+
276
+ # Normalize domain for comparison
277
+ normalized_domain = user_domain.to_s.downcase.gsub(/\s+/, '').sub(/^www\./, '')
278
+
279
+ collection.select do |item|
280
+ uri = URI.parse(item.to_s)
281
+ next false unless uri.host
282
+
283
+ uri_host = uri.host.to_s.downcase.sub(/^www\./, '')
284
+ uri_host.include?(normalized_domain)
285
+ rescue URI::InvalidURIError, NoMethodError
286
+ false
129
287
  end
130
288
  end
131
-
132
- def get_new_links
133
- @links = []
134
- @page.css("a").each do |a|
135
- @links.push((a[:href].to_s.start_with? @url.to_s) ? a[:href] : URI.join(@url, a[:href]).to_s) if (a and a[:href])
289
+
290
+ # Make a URL absolute
291
+ # @param url [String] URL to make absolute
292
+ # @return [String, nil] Absolute URL or nil if invalid
293
+ def make_absolute_url(url)
294
+ return nil if url.nil? || url.empty?
295
+
296
+ # If it's already absolute, return it
297
+ return url if url.start_with?('http://', 'https://')
298
+
299
+ # Get base URL from the page if not already set
300
+ if @base_url.nil?
301
+ base_tag = @page.at_css('base[href]')
302
+ @base_url = base_tag ? base_tag['href'] : ''
136
303
  end
304
+
305
+ begin
306
+ # Try joining with base URL first if available
307
+ return URI.join(@base_url, url).to_s unless @base_url.empty?
308
+ rescue URI::InvalidURIError, URI::BadURIError
309
+ # Fall through to next method
310
+ end
311
+
312
+ begin
313
+ # If we have @url, try to use it
314
+ return URI.join(@url, url).to_s if @url
315
+ rescue URI::InvalidURIError, URI::BadURIError
316
+ # Fall through to next method
317
+ end
318
+
319
+ # For relative URLs, we need to make our best guess
320
+ return "http://#{@host}#{url}" if url.start_with?('/')
321
+ return "http://#{@host}/#{url}" if @host
322
+
323
+ # Last resort, return the original
324
+ url
325
+ rescue URI::InvalidURIError, URI::BadURIError
326
+ url # Return original instead of nil to be more lenient
137
327
  end
138
328
 
329
+ # Extract a snippet from the first long paragraph
330
+ # @return [String] Text snippet
139
331
  def snippet
140
332
  first_long_paragraph = @page.search('//p[string-length() >= 120]').first
141
- first_long_paragraph ? first_long_paragraph.text : ''
333
+ first_long_paragraph ? first_long_paragraph.text.strip[0..255] : ''
142
334
  end
143
335
  end
144
- end
336
+ end