statlysis 0.0.3 → 0.0.7

Sign up to get free protection for your applications and to get access to all the features.
data/README.markdown CHANGED
@@ -1,52 +1,45 @@
1
1
  Statlysis
2
2
  ===============================================
3
- Statistical & Analysis in Ruby DSL
3
+ Statistical and analysis in Ruby DSL, just as simple as SQL operations in ActiveRecord.
4
4
 
5
- Usage
5
+ 项目来由,理念,和使用说明
6
6
  -----------------------------------------------
7
- ### setup
7
+ 该项目起因是为eoe.cn做一套统计后台,而其构思来自于2012上半年做 [Android优亿市场数据采集分析系统](http://mvj3.github.io/2012/11/01/android_eoemarket_data_collect_and_analysis_system_summary/) 时的一些经验和心得,在2013年上半年完成了架构和大部分代码,支持ActiveRecord和Mongoid两个ORM。下半年在阳光书屋加上了对Mongoid的MapReduce支持。
8
8
 
9
- ```ruby
10
- Statlysis.setup do
11
- set_database :statlysis
9
+ 针对一般互联网网站的统计需求,都是把Google Analysis等分析网站的一段Javascript脚本放到网页底部,然后就可以看到网站每日详细的访问情况了。但是针对内部数据需求,比如每日注册用户量,这个一般就不可能开放给第三方去统计了,所以这就是statlysis的存在意义。
12
10
 
13
- daily CodeGist
14
- hourly EoeLog, :time_column => :t # support custom time_column
11
+ 做过数据分析的人都知道其中的坑,比如有些是直接拿SQL跨多个表Join统计,每次有数据需求均执行一次完整的查询,随着数据量的增大,性能问题可想而知。
15
12
 
16
- [EoeLog,
17
- EoeLog.where(:ui => 0), # support query scope
18
- EoeLog.where(:ui => {"$ne" => 0}),
19
- Mongoid[/eoe_logs_[0-9]+$/].where(:ui => {"$ne" => 0}), # support collection name regexp
20
- EoeLog.where(:do => {"$in" => [DOMAINS_HASH[:blog], DOMAINS_HASH[:my]]}),
21
- ].each do |s|
22
- daily s, :time_column => :t
23
- end
24
- end
25
- ```
13
+ 下面介绍如何用statlysis进行统计分析:
26
14
 
27
- ### access
15
+ #### 是否要 ETL(Extract, Transform, Load) 数据清洗
16
+ statlysis认为数据源一定要被ETL为简单几个维度的单层数据集,因为用户最后能看到和理解的也就是两三维的分析图表而已,所以从用户理解出发。
28
17
 
29
- ```ruby
30
- Statlysis.daily # => return daily crons
31
- Statlysis.daily.run # => run daily crons
32
- Statlysis.daily[/name_regexp/] # => return matched daily crons
33
- ```
18
+ 这里也需要注意如果当该数据表是可以直接支持统计分析的,但是数据量大,那么得加上相关索引,或者导入到另外的单表里(在ORM里可以在`after_save`等hooks里操作)再加索引。
34
19
 
35
- ### process
20
+ #### 流程
21
+ 1. 分析数据统计需求,画出包含对应维度的图表。
22
+ 2. ETL,参照 #是否要 ETL(Extract, Transform, Load) 数据清洗#
23
+ 3. 在 `Statlysis.setup { }` 代码块里配置出页面需要的数据,注意得是单表的,类似没有跨表JOIN的SQL查询。
24
+ 4. 跑统计分析,比如 `Statlysis.daily.run`。此过程可以用cron定时来驱动,或者`after_save`等数据更新来驱动。
25
+ 5. 编写用于数据需求人员查看的HTML页面,其中统计数据可以用`Statlysis.daily['code_gists'].first.stat_model`或`TimelyCodegist`来直接查询。
36
26
 
37
- ```irb
38
- [23] pry(#<Statlysis::Configuration>)> Statlysis.daily['multi'].first
39
- ```
27
+ #### 尽量采用MongoDB来作为统计数据源
28
+ MongoDB作为NoSQL数据库,它是为 **单collection** 里读写 **单个记录的整体** 而优化设计的,并支持MapReduce并发来加快统计过程。
40
29
 
41
- Features
30
+ 成功案例
42
31
  -----------------------------------------------
43
- * Support time column that stored as integer.
32
+ * eoe.cn各子网站的页面访问统计,和包含多个条件的数据库表每日数据统计,详情见 [示例配置文件](https://github.com/mvj3/statlysis/blob/master/examples/eoecn.rb) ,按日期维度分。
33
+ * 阳光书屋的学习提高班的关于做题情况的统计分析,详情见 [示例配置文件](https://github.com/mvj3/statlysis/blob/master/examples/sunshinelibrary.rb) ,按班级维度分。
44
34
 
45
- TODO
35
+ Usage
46
36
  -----------------------------------------------
47
- * Admin interface
48
- * statistical query api in Ruby and HTTP
49
- * Interacting with Javascript charting library, e.g. Highcharts, D3.
37
+ 见上面的 [成功案例](#成功案例) 的配置文件 和 [手把手操作示例](http://mvj3.github.io/statlysis/showterm.html) 。
38
+
39
+ Features
40
+ -----------------------------------------------
41
+ * 支持Mongoid和ActiveRecord两种ORM,其中Mongoid以MapReduce方式统计,ActiveRecord基于纯SQL操作。
42
+ * Support time column that stored as integer.
50
43
 
51
44
 
52
45
  Statistical Process
@@ -73,34 +66,13 @@ Q: In Mongodb, why use MapReduce instead of Aggregation?
73
66
  A: The result of aggregation pipeline is a document and is subject to the BSON Document size limit, which is currently 16 megabytes, see more details at http://docs.mongodb.org/manual/core/aggregation-pipeline/#pipeline
74
67
 
75
68
 
76
- Copyright
69
+ TODO
77
70
  -----------------------------------------------
78
- MIT. David Chen at eoe.cn.
71
+ * Admin interface
72
+ * statistical query api in Ruby and HTTP
73
+ * Interacting with Javascript charting library, e.g. Highcharts, D3.
79
74
 
80
75
 
81
- Related
76
+ Copyright
82
77
  -----------------------------------------------
83
- ### Projects
84
- * https://github.com/paulasmuth/fnordmetric FnordMetric is a redis/ruby-based realtime Event-Tracking app
85
- * https://github.com/thirtysixthspan/descriptive_statistics adds methods to the Enumerable module to allow easy calculation of basic descriptive statistics for a set of data
86
- * https://github.com/tmcw/simple-statistics simple statistics for javascript in node and the browser
87
- * https://github.com/clbustos/statsample/ A suite for basic and advanced statistics on Ruby.
88
- * https://github.com/SciRuby/sciruby Tools for scientific computation in Ruby/Rails
89
-
90
- ### Articles
91
- * http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
92
-
93
- ### Event collector
94
- * https://github.com/fluent
95
- * https://github.com/logstash/logstash
96
-
97
- ### Admin interface
98
- * http://three.kibana.org/ browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch.
99
-
100
-
101
- ### ETL
102
- * https://github.com/activewarehouse/activewarehouse-etl/
103
- * http://jisraelsen.github.io/drudgery/ ruby ETL DSL, support csv, sqlite3, ActiveRecord, without support time range
104
- * https://github.com/square/ETL Simply encapsulates the SQL procedures
105
-
106
-
78
+ MIT. David Chen at eoe.cn, sunshine-library .
data/examples/eoecn.rb ADDED
@@ -0,0 +1,35 @@
1
+ # encoding: UTF-8
2
+
3
+ Statlysis.setup do
4
+ set_database :statlysis
5
+ update_time_columns :t
6
+ set_tablename_default_pre :st
7
+
8
+ # 统计网站总体每日访问量
9
+ @log_model = IS_DEVELOP ? EoeLogTest : EoeLog
10
+ hourly @log_model, :time_column => :t
11
+ daily @log_model, :time_column => :t
12
+ # 统计登陆和非登陆用户访问量
13
+ daily @log_model.where(:ui => 0), :time_column => :t
14
+ daily @log_model.where(:ui => {"$ne" => 0}), :time_column => :t
15
+
16
+ # 统计各个子网站每日访问量
17
+ daily @log_model.where(:do => {"$in" => [DOMAINS_HASH[:blog], DOMAINS_HASH[:my]]}), :time_column => :t
18
+ [:www, :code, :skill, :book, :edu, :news, :wiki, :salon, :android].each do |site|
19
+ daily @log_model.where(:do => DOMAINS_HASH[site]), :time_column => :t
20
+ end
21
+
22
+ # 统计各个数据模型在不同条件下每天的变化量
23
+ daily CodeGist
24
+ [BlogPost, NewsNews, WikiPage].each do |model|
25
+ daily model.where("create_time > 0"), :time_column => :create_time
26
+ daily model.where("update_time > 0"), :time_column => :update_time
27
+ end
28
+
29
+ daily CommonComment.where("is_delete = 0"), :time_column => :create_time
30
+ daily CommonComment.where(:model => 'blog').where("is_delete = 0"), :time_column => :create_time
31
+ daily CommonComment.where(:model => 'code').where("is_delete = 0"), :time_column => :create_time
32
+
33
+ daily CommonMember.where("regdate > 0"), :time_column => :regdate
34
+
35
+ end
@@ -0,0 +1,41 @@
1
+ # encoding: UTF-8
2
+
3
+ Statlysis.setup do
4
+ set_database :local_statistic
5
+
6
+ daily UserRecord.where(item_type: "activity")
7
+
8
+ # 表关系 subject <= chapter <= lesson <= activity <= problem
9
+ # room和[user, duration]等绑定
10
+
11
+ # ********
12
+ # **列表**
13
+ # ********
14
+ # 查询条件: [chapter]
15
+ # 章节课时分析: room, lesson, level{5}, group_concat(user), count
16
+ # 推断其他字段: [lesson] => [chapter]
17
+ %w[not_done bad good1 good3 good5].each do |level|
18
+ always ETL::LessonLog.where(:level => level),
19
+ :group_by_columns => [
20
+ {:column_name => :room, :type => :string},
21
+ {:column_name => :lesson, :type => :string}
22
+ ],
23
+ :group_concat_columns => [:user]
24
+ end
25
+
26
+ # ********
27
+ # **详情**
28
+ # ********
29
+ # 查询条件: [activity]
30
+ # Activity分析: room, problem, answer, group_concat(user, duration), count
31
+ # 推断其他字段: [problem] => [activity, lesson]
32
+ always ETL::ProblemLog,
33
+ :group_by_columns => [
34
+ {:column_name => :room, :type => :string},
35
+ {:column_name => :problem, :type => :string},
36
+ # statlysis.gem use column_name to create table name, so that's why no_index option exists
37
+ {:column_name => :answer, :type => :string, :no_index => true}
38
+ ],
39
+ :group_concat_columns => [:user, :duration]
40
+
41
+ end
data/lib/statlysis.rb CHANGED
@@ -31,7 +31,7 @@ module Statlysis
31
31
 
32
32
  logger.info "Start to setup Statlysis" if ENV['DEBUG']
33
33
  time_log do
34
- self.config.instance_exec(&blk)
34
+ self.configuration.instance_exec(&blk)
35
35
  end
36
36
  end
37
37
 
@@ -44,10 +44,10 @@ module Statlysis
44
44
  end
45
45
 
46
46
  # delagate config methods to Configuration
47
- def config; Configuration.instance end
47
+ def configuration; Configuration.instance end
48
48
  require 'active_support/core_ext/module/delegation.rb'
49
49
  Configuration::DelegateMethods.each do |sym|
50
- delegate sym, :to => :config
50
+ delegate sym, :to => :configuration
51
51
  end
52
52
 
53
53
  attr_accessor :logger
@@ -56,9 +56,9 @@ module Statlysis
56
56
  def source_to_database_type; @_source_to_database_type ||= {} end
57
57
 
58
58
  # 代理访问 各个时间类型的 crons
59
- def daily; CronSet.new(Statlysis.config.day_crons) end
60
- def hourly; CronSet.new(Statlysis.config.hour_crons) end
61
- def always; CronSet.new(Statlysis.config.always_crons) end
59
+ def daily; CronSet.new(Statlysis.configuration.day_crons) end
60
+ def hourly; CronSet.new(Statlysis.configuration.hour_crons) end
61
+ def always; CronSet.new(Statlysis.configuration.always_crons) end
62
62
 
63
63
  end
64
64
 
@@ -6,9 +6,7 @@ module Statlysis
6
6
  include Common
7
7
 
8
8
  # feature is a string
9
- def initialize feature, default_time
10
- raise "Please assign default_time params" if not default_time
11
-
9
+ def initialize feature, default_time = nil
12
10
  # init table & model
13
11
  cron.stat_table_name = [Statlysis.tablename_default_pre, 'clocks'].compact.join("_")
14
12
  unless Statlysis.sequel.table_exists?(cron.stat_table_name)
@@ -23,12 +21,13 @@ module Statlysis
23
21
  cron.stat_model = h[:model]
24
22
 
25
23
  # init default_time
24
+ default_time ||= DateTime.now
26
25
  cron.clock = cron.stat_model.find_or_create(:feature => feature)
27
26
  cron.clock.update :t => default_time if cron.current.nil?
28
27
  cron
29
28
  end
30
29
 
31
- def update time
30
+ def update time = DateTime.now
32
31
  time = DateTime.now if time == DateTime1970
33
32
  return false if time && (time < cron.current)
34
33
  cron.clock.update :t => time
@@ -3,6 +3,7 @@
3
3
  module Statlysis
4
4
  class Cron
5
5
  attr_reader :multiple_dataset, :source_type, :time_column, :time_unit, :time_zone
6
+ attr_reader :clock
6
7
  include Common
7
8
 
8
9
  def initialize s, opts = {}
@@ -34,6 +34,9 @@ module Statlysis
34
34
  end
35
35
  end
36
36
 
37
+ # record last executed time
38
+ clock.update
39
+
37
40
  return self
38
41
  end
39
42
 
@@ -90,7 +93,7 @@ module Statlysis
90
93
  _truncated_columns = _group_by_columns_index_name.dup # only String column
91
94
  _group_by_columns_index_name = _group_by_columns_index_name.unshift :t if cron.time_column?
92
95
  # TODO use https://github.com/german/redis_orm to support full string indexes
93
- if !Statlysis.config.is_skip_database_index && _group_by_columns_index_name.any?
96
+ if !Statlysis.configuration.is_skip_database_index && _group_by_columns_index_name.any?
94
97
  mysql_per_column_length_limit_in_one_index = (1000 / 3.0 / _group_by_columns_index_name.size.to_f).to_i
95
98
  index_columns_str = _group_by_columns_index_name.map {|s| _truncated_columns.include?(s) ? "#{s.to_s}(#{mysql_per_column_length_limit_in_one_index})" : s.to_s }.join(", ")
96
99
  index_columns_str = "(#{index_columns_str})"
@@ -151,6 +154,8 @@ module Statlysis
151
154
 
152
155
  private
153
156
  def remodel
157
+ @clock ||= reclock
158
+
154
159
  n = cron.stat_table_name.to_s.singularize.camelize
155
160
  cron.stat_model = class_eval <<-MODEL, __FILE__, __LINE__+1
156
161
  class ::#{n} < Sequel::Model;
@@ -162,6 +167,11 @@ module Statlysis
162
167
  MODEL
163
168
  end
164
169
 
170
+ def reclock
171
+ # setup a clock to record the last updated
172
+ @clock = Clock.new "last_updated_at__#{cron.stat_table_name}"
173
+ end
174
+
165
175
  end
166
176
  end
167
177
 
@@ -7,23 +7,23 @@ namespace :statlysis do
7
7
  Statlysis::TimeUnits.each do |unit|
8
8
  desc "statistical in #{unit}"
9
9
  only_one_task "#{unit}_count" => :environment do
10
- Statlysis.config.send("#{unit}_crons").map(&:run)
10
+ Statlysis.configuration.send("#{unit}_crons").map(&:run)
11
11
  end
12
12
  end
13
13
 
14
14
  desc "realtime process"
15
15
  only_one_task :realtime_process => :environment do
16
- loop { Statlysis.config.realtime_crons.map(&:run); sleep 1 }
16
+ loop { Statlysis.configuration.realtime_crons.map(&:run); sleep 1 }
17
17
  end
18
18
 
19
19
  desc "similar process"
20
20
  only_one_task :similar_process => :environment do
21
- Statlysis.config.similar_crons.map(&:run)
21
+ Statlysis.configuration.similar_crons.map(&:run)
22
22
  end
23
23
 
24
24
  desc "hotest process"
25
25
  only_one_task :hotest_process => :environment do
26
- Statlysis.config.hotest_crons.map(&:run)
26
+ Statlysis.configuration.hotest_crons.map(&:run)
27
27
  end
28
28
 
29
29
  end
data/statlysis.gemspec CHANGED
@@ -4,13 +4,13 @@ $:.push File.expand_path("../lib", __FILE__)
4
4
 
5
5
  Gem::Specification.new do |s|
6
6
  s.name = 'statlysis'
7
- s.version = '0.0.3'
8
- s.date = '2013-12-03'
7
+ s.version = '0.0.7'
8
+ s.date = '2013-12-30'
9
9
  s.summary = File.read("README.markdown").split(/===+/)[1].strip.split("\n")[0]
10
10
  s.description = s.summary
11
11
  s.authors = ["David Chen"]
12
12
  s.email = 'mvjome@gmail.com'
13
- s.homepage = 'https://github.com/SunshineLibrary/statlysis'
13
+ s.homepage = 'https://github.com/mvj3/statlysis'
14
14
  s.license = 'MIT'
15
15
 
16
16
  s.files = `git ls-files`.split("\n")
data/test/helper.rb CHANGED
@@ -28,8 +28,8 @@ require 'sqlite3'
28
28
 
29
29
  # load ActiveRecord setup
30
30
  Statlysis.set_database ":memory:"
31
- Statlysis.config.is_skip_database_index = true
32
- ActiveRecord::Base.establish_connection(Statlysis.config.database_opts.merge("adapter" => "sqlite3"))
31
+ Statlysis.configuration.is_skip_database_index = true
32
+ ActiveRecord::Base.establish_connection(Statlysis.configuration.database_opts.merge("adapter" => "sqlite3"))
33
33
  Dir[File.expand_path("../migrate/*.rb", __FILE__).to_s].each { |f| require f }
34
34
  Dir[File.expand_path("../models/*.rb", __FILE__).to_s].each { |f| require f }
35
35
 
@@ -45,6 +45,24 @@ csv.each do |row|
45
45
  end
46
46
 
47
47
 
48
+
49
+
50
+
51
+
52
+
53
+
54
+
55
+
56
+ (require 'pry-debugger';binding.pry) if ENV['DEBUG']
57
+
58
+
59
+
60
+
61
+
62
+
63
+
64
+
65
+
48
66
  Statlysis.setup do
49
67
  hourly EoeLog, :time_column => :t
50
68
 
@@ -62,5 +80,4 @@ Statlysis.setup do
62
80
  cron1 = Statlysis.daily['mul'][1]
63
81
  cron2 = Statlysis.daily['cod'][0]
64
82
  cron3 = Statlysis.always['code']['mongoid'][0]
65
- require 'pry-debugger';binding.pry
66
83
  end
@@ -1,8 +1,9 @@
1
1
  # encoding: UTF-8
2
+ # NOTE 以下统计数据依赖code_gists测试数据。
2
3
 
3
4
  require 'helper'
4
5
 
5
- class TestDailyCount < Test::Unit::TestCase
6
+ class TestGenerallyCount < Test::Unit::TestCase
6
7
  def setup
7
8
  @output = Statlysis.daily['code_gist'].first.output
8
9
  end
@@ -20,5 +21,4 @@ class TestDailyCount < Test::Unit::TestCase
20
21
  assert_equal @output[-1][:totally_favcount_s].to_i, CodeGist.all.map(&:fav_count).reduce(:+)
21
22
  end
22
23
 
23
-
24
24
  end
@@ -2,7 +2,7 @@
2
2
 
3
3
  require 'helper'
4
4
 
5
- class TestStatlysis < Test::Unit::TestCase
5
+ class TestManipulateTableAndModel < Test::Unit::TestCase
6
6
  def setup
7
7
  @old_datetime = Time.zone.parse("20130105")
8
8
  end
@@ -7,12 +7,18 @@ class TestMapReduce < Test::Unit::TestCase
7
7
  end
8
8
 
9
9
  def test_multiple_dimensions_output_without_time_column
10
+ before_time = DateTime.now
10
11
  cron = Statlysis.always['mongoid']['code'][0]
11
12
  assert_equal cron.time_column, false
12
13
  assert_equal cron.time_unit, false
13
14
  assert_equal cron.stat_table_name, 'timely_codegistmongoids_author_a'
14
15
 
15
16
  cron.run
17
+
18
+ # 测试更新最后执行时间的时钟
19
+ # TODO 可能移到其他地方
20
+ assert((cron.clock.current - before_time) > 0)
21
+
16
22
  assert_equal cron.output.detect {|h| h[:author] == 'mvj3' }[:c].to_i, cron.multiple_dataset.sources.first.where(:author => 'mvj3').count
17
23
  end
18
24
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: statlysis
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-12-03 00:00:00.000000000 Z
12
+ date: 2013-12-30 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rake
@@ -251,7 +251,8 @@ dependencies:
251
251
  - - ! '>='
252
252
  - !ruby/object:Gem::Version
253
253
  version: '0'
254
- description: Statistical & Analysis in Ruby DSL
254
+ description: Statistical and analysis in Ruby DSL, just as simple as SQL operations
255
+ in ActiveRecord.
255
256
  email: mvjome@gmail.com
256
257
  executables: []
257
258
  extensions: []
@@ -263,6 +264,8 @@ files:
263
264
  - Guardfile
264
265
  - README.markdown
265
266
  - Rakefile
267
+ - examples/eoecn.rb
268
+ - examples/sunshinelibrary.rb
266
269
  - lib/statlysis.rb
267
270
  - lib/statlysis/clock.rb
268
271
  - lib/statlysis/common.rb
@@ -298,12 +301,12 @@ files:
298
301
  - test/models/.gitkeep
299
302
  - test/models/code_gist.rb
300
303
  - test/models/eoe_log.rb
301
- - test/test_daily_count.rb
304
+ - test/test_generally_count.rb
305
+ - test/test_manipulate_table_and_model.rb
302
306
  - test/test_mapreduce.rb
303
307
  - test/test_single_log_in_multiple_collections.rb
304
- - test/test_statlysis.rb
305
308
  - test/test_timeseries.rb
306
- homepage: https://github.com/SunshineLibrary/statlysis
309
+ homepage: https://github.com/mvj3/statlysis
307
310
  licenses:
308
311
  - MIT
309
312
  post_install_message:
@@ -316,22 +319,17 @@ required_ruby_version: !ruby/object:Gem::Requirement
316
319
  - - ! '>='
317
320
  - !ruby/object:Gem::Version
318
321
  version: '0'
319
- segments:
320
- - 0
321
- hash: -1643509325996557122
322
322
  required_rubygems_version: !ruby/object:Gem::Requirement
323
323
  none: false
324
324
  requirements:
325
325
  - - ! '>='
326
326
  - !ruby/object:Gem::Version
327
327
  version: '0'
328
- segments:
329
- - 0
330
- hash: -1643509325996557122
331
328
  requirements: []
332
329
  rubyforge_project:
333
330
  rubygems_version: 1.8.23
334
331
  signing_key:
335
332
  specification_version: 3
336
- summary: Statistical & Analysis in Ruby DSL
333
+ summary: Statistical and analysis in Ruby DSL, just as simple as SQL operations in
334
+ ActiveRecord.
337
335
  test_files: []