web_analytics_discovery 2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,133 @@
1
+ # web_analytics_discovery
2
+ <!--[![Gem Version](https://badge.fury.io/rb/web_analytics_discovery.png)](http://badge.fury.io/rb/web_analytics_discovery)-->
3
+ [![Build Status](https://travis-ci.org/GreyCat/web_analytics_discovery.svg?branch=master)](https://travis-ci.org/GreyCat/web_analytics_discovery)
4
+ [![Dependency Status](https://gemnasium.com/GreyCat/web_analytics_discovery.svg)](https://gemnasium.com/GreyCat/web_analytics_discovery)
5
+ [![Code Climate](https://codeclimate.com/github/GreyCat/web_analytics_discovery/badges/gpa.svg)](https://codeclimate.com/github/GreyCat/web_analytics_discovery)
6
+ <!--[![Coverage Status](https://coveralls.io/repos/GreyCat/web_analytics_discovery/badge.png?branch=master)](https://coveralls.io/r/GreyCat/web_analytics_discovery)-->
7
+ <!--[![Security Status](http://rails-brakeman.com/GreyCat/web_analytics_discovery.png)](http://rails-brakeman.com/GreyCat/web_analytics_discovery)-->
8
+
9
+ This gem provides a set of tools for discovery and export of data from
10
+ popular web analytics tools.
11
+
12
+ The supported web analytics systems are:
13
+
14
+ * Alexa
15
+ * Google Analytics
16
+ * LiveInternet
17
+ * Mail.ru
18
+ * Openstat
19
+ * Quantcast
20
+ * Rambler Top100
21
+ * Yandex Metrika
22
+
23
+ ## The problem
24
+
25
+ Given a particular site URL (i.e. `http://example.com/`), we'd like to
26
+ know audience statistics on that particular site (i.e. how many unique
27
+ people visit this site per day, per week, per month, how many page views
28
+ do they do, etc).
29
+
30
+ ## The solution
31
+
32
+ Many sites use web analytics tools to measure audience stats. Quite
33
+ often, these statistics are even available for public, although one needs to know:
34
+
35
+ * which particular web analytics system a given site uses
36
+ * what is this site's ID in that web analytics system
37
+
38
+ Answering these question usually requires tedious manual process:
39
+
40
+ * Look up site's HTML code
41
+ * Locate JavaScript code / tags / calls to web analytics system
42
+ * Identify this system
43
+ * Identify site's ID in the code / calls
44
+ * Go to web analytics's system site or API and get desired statistics
45
+
46
+ This gem tries automate these tasks, looking up all the info and
47
+ retrieving information from web analytics systems. Exported data can
48
+ be accessed in simple tabular form or programmatically, as a hash,
49
+ using API.
50
+
51
+ ## Installation
52
+
53
+ ### From RubyGems repository
54
+
55
+ * Make sure you have Ruby and RubyGems
56
+ * Just run `gem install web_analytics_discovery`
57
+
58
+ ### Manually from source
59
+
60
+ * Clone this repository / download snapshot
61
+ * `gem build web_analytics_discovery.gemspec`
62
+ * `gem install --local ./web_analytics_discovery-*.gem` (usually as
63
+ root, if you need system-wide installation)
64
+
65
+ ## Basic usage
66
+
67
+ For basic usage, a simple executable `web_analytics_discover` is
68
+ provided and installed during gem installation. It can be run with one
69
+ or several URLs as command-line arguments and it will produce a simple
70
+ summary table for each of the URLs.
71
+
72
+ Example:
73
+
74
+ $ web_analytics_discover http://kp.ru/
75
+ | id| v/day| s/day| pv/day| v/mon| s/mon| pv/mon
76
+ alexa | kp.ru| N/A| N/A| 1477599| 6825125| N/A| 44974428
77
+ googleanalytics | UA-23870775-1| N/A| N/A| N/A| N/A| N/A| N/A
78
+ liveinternet | | 597956| 745757| 1787863| 10585641| 21308436| 49775501
79
+ mailru | 294001| 756600| N/A| 2230674| 15086634| N/A| 73738178
80
+ openstat | 2026010| 983579| 1195306| 2823114| 14757845| 28953554| 69970669
81
+ quantcast | wd:ru.kp| N/A| N/A| N/A| 36300| N/A| N/A
82
+ rambler | 17841| 1048235| 1287761| 3015270| 15550162| 31307958| 75869606
83
+ yandexmetrika | 1051362| 259987| 310983| 727833| N/A| N/A| 22153416
84
+
85
+ ## API usage
86
+
87
+ One can easily use web analytics discovery using simple API. Every web
88
+ analytics service is supported by a separate class named after that
89
+ service in `WebAnalyticsDiscovery` module:
90
+
91
+ * `Alexa`
92
+ * `GoogleAnalytics`
93
+ * `LiveInternet`
94
+ * `MailRu`
95
+ * `Openstat`
96
+ * `Quantcast`
97
+ * `Rambler`
98
+ * `YandexMetrika`
99
+
100
+ One can use it like that:
101
+
102
+ require 'web_analytics_discovery'
103
+ d = WebAnalyticsDiscovery::MailRu.new
104
+ result = d.run('http://kp.ru/')
105
+
106
+ `result` will look like that:
107
+
108
+ {:id=>294001,
109
+ :visitors_day=>756600,
110
+ :pv_day=>2230674,
111
+ :visitors_week=>3365344,
112
+ :pv_week=>13102096,
113
+ :visitors_mon=>15086634,
114
+ :pv_mon=>73738178}
115
+
116
+ Some values might be missing if it's not possible to retrieve them
117
+ from a given service.
118
+
119
+ ## Licensing and usage
120
+
121
+ Copyright (C) 2013-2014 Mikhail Yakshin <greycat@altlinux.org>
122
+
123
+ This program is free software: you can redistribute it and/or modify
124
+ it under the terms of the GNU Affero General Public License as
125
+ published by the Free Software Foundation, either version 3 of the
126
+ License, or (at your option) any later version.
127
+
128
+ This program is distributed in the hope that it will be useful, but
129
+ WITHOUT ANY WARRANTY; without even the implied warranty of
130
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
131
+ Affero General Public License for more details.
132
+
133
+ Please consult LICENSE file for more details and full license text.
@@ -0,0 +1,7 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new
5
+
6
+ task :default => :spec
7
+ task :test => :spec
@@ -0,0 +1,77 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'fileutils'
4
+ require 'uri'
5
+ require 'optparse'
6
+
7
+ require 'web_analytics_discovery'
8
+ include WebAnalyticsDiscovery
9
+
10
+ class AnalyticsGrabber
11
+ def initialize
12
+ @services = {}
13
+ SERVICES.each_pair { |name, klass|
14
+ begin
15
+ @services[name] = klass.new
16
+ rescue Exception => ex
17
+ warn "Unable to start analytics service #{name}"
18
+ $stderr.puts ex.message
19
+ $stderr.puts ex.backtrace.join("\n")
20
+ end
21
+ }
22
+ end
23
+
24
+ def run(url)
25
+ r = {}
26
+ @services.each_pair { |name, service|
27
+ begin
28
+ r[name] = service.run(url)
29
+ rescue Exception => ex
30
+ warn "Exception querying analytics service #{name}"
31
+ $stderr.puts ex.message
32
+ $stderr.puts ex.backtrace.join("\n")
33
+ end
34
+ }
35
+ r
36
+ end
37
+
38
+ def pp(url)
39
+ r = run(url)
40
+ print_line ['', 'id', 'v/day', 's/day', 'pv/day', 'v/mon', 's/mon', 'pv/mon']
41
+ r.keys.sort.each { |service|
42
+ res = r[service]
43
+ next unless res
44
+ print_line [
45
+ service,
46
+ res[:id],
47
+ res[:visitors_day],
48
+ res[:visits_day],
49
+ res[:pv_day],
50
+ res[:visitors_mon],
51
+ res[:visits_mon],
52
+ res[:pv_mon],
53
+ ]
54
+ }
55
+ end
56
+
57
+ def print_line(a)
58
+ printf '%-20s', a.shift
59
+ printf '|%24s', a.shift
60
+ a.each { |x|
61
+ printf '|%11s', x || 'N/A'
62
+ }
63
+ puts
64
+ end
65
+ end
66
+
67
+ options = {}
68
+ OptionParser.new { |opts|
69
+ opts.banner = "Usage: #{__FILE__} [options] <urls>"
70
+
71
+ # opts.on('-v', '--[no-]verbose', 'Verbose logging') { |v| options[:verbose] = v }
72
+
73
+ opts.on_tail('-h', '--help', 'Show this message') { puts opts; exit }
74
+ }.parse!
75
+
76
+ ag = AnalyticsGrabber.new
77
+ ARGV.each { |a| ag.pp(a) }
@@ -0,0 +1,23 @@
1
+ require 'web_analytics_discovery/version'
2
+
3
+ require 'web_analytics_discovery/grabber/alexa'
4
+ require 'web_analytics_discovery/grabber/googleanalytics'
5
+ require 'web_analytics_discovery/grabber/liveinternet'
6
+ require 'web_analytics_discovery/grabber/mailru'
7
+ require 'web_analytics_discovery/grabber/openstat'
8
+ require 'web_analytics_discovery/grabber/quantcast'
9
+ require 'web_analytics_discovery/grabber/rambler'
10
+ require 'web_analytics_discovery/grabber/tns'
11
+ require 'web_analytics_discovery/grabber/yandexmetrika'
12
+
13
+ module WebAnalyticsDiscovery
14
+ # Special trickery to get a map of {:service_name => ClassThatImplementsServiceExtraction} magic
15
+ SERVICES = Hash[constants.map { |x|
16
+ possible_class = const_get(x)
17
+ if possible_class.class == Class
18
+ [x.to_s.downcase.to_sym, possible_class]
19
+ else
20
+ nil
21
+ end
22
+ }.delete_if { |v| v.nil? }]
23
+ end
@@ -0,0 +1,33 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'uri'
4
+ require 'web_analytics_discovery/grabberutils'
5
+
6
+ module WebAnalyticsDiscovery
7
+ class Alexa
8
+ include GrabberUtils
9
+
10
+ def run(url)
11
+ uri = URI.parse(url)
12
+ host = uri.host
13
+ r = {}
14
+ doc = download("http://www.alexa.com/siteinfo/#{host}#trafficstats")
15
+
16
+ # Try to extract certified metrics
17
+ r[:visitors_day], r[:pv_day], r[:visitors_mon], r[:pv_mon] = grab_certified_metrics(doc)
18
+
19
+ # Grab ID for clarity's sake
20
+ if doc =~ /<img src="http:\/\/traffic\.alexa\.com\/graph\?.*&u=([^"]+)">/
21
+ r[:id] = $1
22
+ end
23
+ return r
24
+ end
25
+
26
+ def grab_certified_metrics(doc)
27
+ r = []
28
+ doc.gsub(/<strong class="metrics-data">([0-9,]+)<\/strong>/) { r << $1 }
29
+ r.map! { |x| x.gsub(/,/, '').to_i }
30
+ return r
31
+ end
32
+ end
33
+ end
@@ -0,0 +1,29 @@
1
+ require 'web_analytics_discovery/grabberutils'
2
+
3
+ module WebAnalyticsDiscovery
4
+ class GoogleAnalytics
5
+ include GrabberUtils
6
+
7
+ def run(url)
8
+ @page = download(url)
9
+ run_id(find_id)
10
+ end
11
+
12
+ def find_id
13
+ case @page
14
+ when /_gat\._getTracker\(["']([^"']+)["']\)/
15
+ $1
16
+ when /_gaq\.push\(\[['"]_setAccount['"], ['"]([^"']+)['"]\]\)/
17
+ $1
18
+ else
19
+ nil
20
+ end
21
+ end
22
+
23
+ def run_id(id)
24
+ return nil unless id
25
+ r = {:id => id}
26
+ return r
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,61 @@
1
+ # -*- coding: utf-8 -*-
2
+
3
+ require 'uri'
4
+ require 'web_analytics_discovery/grabberutils'
5
+
6
+ module WebAnalyticsDiscovery
7
+ class LiveInternet
8
+ include GrabberUtils
9
+
10
+ def run(url)
11
+ @url = url
12
+ @page = download(url)
13
+ run_id(find_id)
14
+ end
15
+
16
+ def find_id
17
+ case @page
18
+ when /new Image\(\)\.src = "\/\/counter\.yadro\.ru\/hit;([^?"]+)\?/
19
+ $1
20
+ else
21
+ # Use hostname as a last resort measure
22
+ URI.parse(@url).host
23
+ end
24
+ end
25
+
26
+ def run_id(host)
27
+ r = {:id => host}
28
+
29
+ doc = download("http://www.liveinternet.ru/stat/#{host}/index.csv")
30
+ r[:pv_day], r[:visits_day], r[:visitors_day] = grab_psv(doc, 4)
31
+
32
+ # Bail out early if no LiveInternet data available
33
+ return r unless r[:pv_day]
34
+
35
+ doc = download("http://www.liveinternet.ru/stat/#{host}/index.csv?period=week;total=yes")
36
+ r[:pv_week], r[:visits_week], r[:visitors_week] = grab_psv(doc, 2)
37
+
38
+ doc = download("http://www.liveinternet.ru/stat/#{host}/index.csv?period=month;total=yes")
39
+ r[:pv_mon], r[:visits_mon], r[:visitors_mon] = grab_psv(doc, 2)
40
+
41
+ return r
42
+ end
43
+
44
+ private
45
+ def grab_psv(doc, col)
46
+ r = [nil, nil, nil]
47
+ doc.split(/\n/).each { |l|
48
+ c = l.split(/;/)
49
+ case c[0]
50
+ when '"Просмотры"'
51
+ r[0] = c[col].to_i
52
+ when '"Сессии"'
53
+ r[1] = c[col].to_i
54
+ when '"Посетители"'
55
+ r[2] = c[col].to_i
56
+ end
57
+ }
58
+ return r
59
+ end
60
+ end
61
+ end
@@ -0,0 +1,89 @@
1
+ # -*- coding: utf-8 -*-
2
+
3
+ require 'web_analytics_discovery/grabberutils'
4
+
5
+ module WebAnalyticsDiscovery
6
+ class MailRu
7
+ include GrabberUtils
8
+
9
+ def run(url)
10
+ @page = download(url)
11
+ run_id(find_id)
12
+ end
13
+
14
+ def find_id
15
+ case @page
16
+ when /<a [^>]*href="http:\/\/top\.mail\.ru\/jump\?from=(\d+)".*>\s*<img src="http:\/\/.*.top.mail.ru\/counter/m,
17
+ /<img src=['"]?http:\/\/top\.list\.ru\/counter\?id=(\d+)/,
18
+ /<img src=['"]?http:\/\/.*top\.mail\.ru\/counter\?js=na;id=(\d+)/,
19
+ /_tmr.push\(\{id:\s*['"](\d+)['"]/
20
+ $1.to_i
21
+ else
22
+ nil
23
+ end
24
+ end
25
+
26
+ def run_id(id)
27
+ return nil unless id
28
+ r = {:id => id}
29
+
30
+ #doc = download("http://top.mail.ru/visits?id=#{id}")
31
+
32
+ # Analyze daily report
33
+ doc = download("http://top.mail.ru/visits.csv?id=#{id}&period=0&date=&back=30&", 'windows-1251').split(/\n/)
34
+ return run_id_html_rating(r, id) if doc.empty?
35
+ doc = doc[4..-1]
36
+
37
+ sum_v = 0
38
+ sum_pv = 0
39
+ doc.each { |l|
40
+ #"Дата";"Посетители";"Новые посетители";"Ядро";"Хосты";"Просмотры";"Глубина"
41
+ date, v, new_v, core_v, hosts, pv, depth = l.split(/;/)
42
+ sum_v += v.to_i
43
+ sum_pv += pv.to_i
44
+ }
45
+
46
+ r[:visitors_day] = sum_v / doc.size
47
+ r[:pv_day] = sum_pv / doc.size
48
+
49
+ # Analyze weekly report
50
+ doc = download("http://top.mail.ru/visits.csv?id=#{id}&period=1&date=&back=98&", 'windows-1251').split(/\n/)
51
+ return r if doc.empty?
52
+ date, v, new_v, core_v, hosts, pv, depth = doc[4].split(/;/)
53
+ r[:visitors_week] = v.to_i
54
+ r[:pv_week] = pv.to_i
55
+
56
+ # Analyze monthly report
57
+ doc = download("http://top.mail.ru/visits.csv?id=#{id}&period=2&date=&back=395&", 'windows-1251').split(/\n/)
58
+ return r if doc.empty?
59
+ date, v, new_v, core_v, hosts, pv, depth = doc[4].split(/;/)
60
+ r[:visitors_mon] = v.to_i
61
+ r[:pv_mon] = pv.to_i
62
+
63
+ return r
64
+ end
65
+
66
+ # Parse semi-closed rating when normal full CSV export is not available
67
+ def run_id_html_rating(r, id)
68
+ doc = download("http://top.mail.ru/rating?id=#{id}", 'windows-1251')
69
+
70
+ today = []
71
+ doc.gsub(/<td class="l_col">Сегодня<\/td>.*?<td class="r_col"><b>([0-9,]+)<\/b>/m) { today << $1.gsub(/,/, '').to_i }
72
+
73
+ week = []
74
+ doc.gsub(/<td class="l_col">Неделя<\/td>.*?<td class="r_col"><b>([0-9,]+)<\/b>/m) { week << $1.gsub(/,/, '').to_i }
75
+
76
+ month = []
77
+ doc.gsub(/<td class="l_col">Месяц<\/td>.*?<td class="r_col"><b>([0-9,]+)<\/b>/m) { month << $1.gsub(/,/, '').to_i }
78
+
79
+ # Non-normal number of matches? That's weird, bail out
80
+ return r unless today.length == 3 and week.length == 3 and month.length == 3
81
+
82
+ r[:visitors_day], r[:pv_day], r[:ip_day] = today
83
+ r[:visitors_week], r[:pv_week], r[:ip_week] = week
84
+ r[:visitors_mon], r[:pv_mon], r[:ip_mon] = month
85
+
86
+ return r
87
+ end
88
+ end
89
+ end