statlysis 0.0.3 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.markdown CHANGED
@@ -1,52 +1,45 @@
1
1
  Statlysis
2
2
  ===============================================
3
- Statistical & Analysis in Ruby DSL
3
+ Statistical and analysis in Ruby DSL, just as simple as SQL operations in ActiveRecord.
4
4
 
5
- Usage
5
+ 项目来由,理念,和使用说明
6
6
  -----------------------------------------------
7
- ### setup
7
+ 该项目起因是为eoe.cn做一套统计后台,而其构思来自于2012上半年做 [Android优亿市场数据采集分析系统](http://mvj3.github.io/2012/11/01/android_eoemarket_data_collect_and_analysis_system_summary/) 时的一些经验和心得,在2013年上半年完成了架构和大部分代码,支持ActiveRecord和Mongoid两个ORM。下半年在阳光书屋加上了对Mongoid的MapReduce支持。
8
8
 
9
- ```ruby
10
- Statlysis.setup do
11
- set_database :statlysis
9
+ 针对一般互联网网站的统计需求,都是把Google Analysis等分析网站的一段Javascript脚本放到网页底部,然后就可以看到网站每日详细的访问情况了。但是针对内部数据需求,比如每日注册用户量,这个一般就不可能开放给第三方去统计了,所以这就是statlysis的存在意义。
12
10
 
13
- daily CodeGist
14
- hourly EoeLog, :time_column => :t # support custom time_column
11
+ 做过数据分析的人都知道其中的坑,比如有些是直接拿SQL跨多个表Join统计,每次有数据需求均执行一次完整的查询,随着数据量的增大,性能问题可想而知。
15
12
 
16
- [EoeLog,
17
- EoeLog.where(:ui => 0), # support query scope
18
- EoeLog.where(:ui => {"$ne" => 0}),
19
- Mongoid[/eoe_logs_[0-9]+$/].where(:ui => {"$ne" => 0}), # support collection name regexp
20
- EoeLog.where(:do => {"$in" => [DOMAINS_HASH[:blog], DOMAINS_HASH[:my]]}),
21
- ].each do |s|
22
- daily s, :time_column => :t
23
- end
24
- end
25
- ```
13
+ 下面介绍如何用statlysis进行统计分析:
26
14
 
27
- ### access
15
+ #### 是否要 ETL(Extract, Transform, Load) 数据清洗
16
+ statlysis认为数据源一定要被ETL为简单几个维度的单层数据集,因为用户最后能看到和理解的也就是两三维的分析图表而已,所以从用户理解出发。
28
17
 
29
- ```ruby
30
- Statlysis.daily # => return daily crons
31
- Statlysis.daily.run # => run daily crons
32
- Statlysis.daily[/name_regexp/] # => return matched daily crons
33
- ```
18
+ 这里也需要注意如果当该数据表是可以直接支持统计分析的,但是数据量大,那么得加上相关索引,或者导入到另外的单表里(在ORM里可以在`after_save`等hooks里操作)再加索引。
34
19
 
35
- ### process
20
+ #### 流程
21
+ 1. 分析数据统计需求,画出包含对应维度的图表。
22
+ 2. ETL,参照 #是否要 ETL(Extract, Transform, Load) 数据清洗#
23
+ 3. 在 `Statlysis.setup { }` 代码块里配置出页面需要的数据,注意得是单表的,类似没有跨表JOIN的SQL查询。
24
+ 4. 跑统计分析,比如 `Statlysis.daily.run`。此过程可以用cron定时来驱动,或者`after_save`等数据更新来驱动。
25
+ 5. 编写用于数据需求人员查看的HTML页面,其中统计数据可以用`Statlysis.daily['code_gists'].first.stat_model`或`TimelyCodegist`来直接查询。
36
26
 
37
- ```irb
38
- [23] pry(#<Statlysis::Configuration>)> Statlysis.daily['multi'].first
39
- ```
27
+ #### 尽量采用MongoDB来作为统计数据源
28
+ MongoDB作为NoSQL数据库,它是为 **单collection** 里读写 **单个记录的整体** 而优化设计的,并支持MapReduce并发来加快统计过程。
40
29
 
41
- Features
30
+ 成功案例
42
31
  -----------------------------------------------
43
- * Support time column that stored as integer.
32
+ * eoe.cn各子网站的页面访问统计,和包含多个条件的数据库表每日数据统计,详情见 [示例配置文件](https://github.com/mvj3/statlysis/blob/master/examples/eoecn.rb) ,按日期维度分。
33
+ * 阳光书屋的学习提高班的关于做题情况的统计分析,详情见 [示例配置文件](https://github.com/mvj3/statlysis/blob/master/examples/sunshinelibrary.rb) ,按班级维度分。
44
34
 
45
- TODO
35
+ Usage
46
36
  -----------------------------------------------
47
- * Admin interface
48
- * statistical query api in Ruby and HTTP
49
- * Interacting with Javascript charting library, e.g. Highcharts, D3.
37
+ 见上面的 [成功案例](#成功案例) 的配置文件 和 [手把手操作示例](http://mvj3.github.io/statlysis/showterm.html) 。
38
+
39
+ Features
40
+ -----------------------------------------------
41
+ * 支持Mongoid和ActiveRecord两种ORM,其中Mongoid以MapReduce方式统计,ActiveRecord基于纯SQL操作。
42
+ * Support time column that stored as integer.
50
43
 
51
44
 
52
45
  Statistical Process
@@ -73,34 +66,13 @@ Q: In Mongodb, why use MapReduce instead of Aggregation?
73
66
  A: The result of aggregation pipeline is a document and is subject to the BSON Document size limit, which is currently 16 megabytes, see more details at http://docs.mongodb.org/manual/core/aggregation-pipeline/#pipeline
74
67
 
75
68
 
76
- Copyright
69
+ TODO
77
70
  -----------------------------------------------
78
- MIT. David Chen at eoe.cn.
71
+ * Admin interface
72
+ * statistical query api in Ruby and HTTP
73
+ * Interacting with Javascript charting library, e.g. Highcharts, D3.
79
74
 
80
75
 
81
- Related
76
+ Copyright
82
77
  -----------------------------------------------
83
- ### Projects
84
- * https://github.com/paulasmuth/fnordmetric FnordMetric is a redis/ruby-based realtime Event-Tracking app
85
- * https://github.com/thirtysixthspan/descriptive_statistics adds methods to the Enumerable module to allow easy calculation of basic descriptive statistics for a set of data
86
- * https://github.com/tmcw/simple-statistics simple statistics for javascript in node and the browser
87
- * https://github.com/clbustos/statsample/ A suite for basic and advanced statistics on Ruby.
88
- * https://github.com/SciRuby/sciruby Tools for scientific computation in Ruby/Rails
89
-
90
- ### Articles
91
- * http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
92
-
93
- ### Event collector
94
- * https://github.com/fluent
95
- * https://github.com/logstash/logstash
96
-
97
- ### Admin interface
98
- * http://three.kibana.org/ browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch.
99
-
100
-
101
- ### ETL
102
- * https://github.com/activewarehouse/activewarehouse-etl/
103
- * http://jisraelsen.github.io/drudgery/ ruby ETL DSL, support csv, sqlite3, ActiveRecord, without support time range
104
- * https://github.com/square/ETL Simply encapsulates the SQL procedures
105
-
106
-
78
+ MIT. David Chen at eoe.cn, sunshine-library .
data/examples/eoecn.rb ADDED
@@ -0,0 +1,35 @@
1
+ # encoding: UTF-8
2
+
3
+ Statlysis.setup do
4
+ set_database :statlysis
5
+ update_time_columns :t
6
+ set_tablename_default_pre :st
7
+
8
+ # 统计网站总体每日访问量
9
+ @log_model = IS_DEVELOP ? EoeLogTest : EoeLog
10
+ hourly @log_model, :time_column => :t
11
+ daily @log_model, :time_column => :t
12
+ # 统计登陆和非登陆用户访问量
13
+ daily @log_model.where(:ui => 0), :time_column => :t
14
+ daily @log_model.where(:ui => {"$ne" => 0}), :time_column => :t
15
+
16
+ # 统计各个子网站每日访问量
17
+ daily @log_model.where(:do => {"$in" => [DOMAINS_HASH[:blog], DOMAINS_HASH[:my]]}), :time_column => :t
18
+ [:www, :code, :skill, :book, :edu, :news, :wiki, :salon, :android].each do |site|
19
+ daily @log_model.where(:do => DOMAINS_HASH[site]), :time_column => :t
20
+ end
21
+
22
+ # 统计各个数据模型在不同条件下每天的变化量
23
+ daily CodeGist
24
+ [BlogPost, NewsNews, WikiPage].each do |model|
25
+ daily model.where("create_time > 0"), :time_column => :create_time
26
+ daily model.where("update_time > 0"), :time_column => :update_time
27
+ end
28
+
29
+ daily CommonComment.where("is_delete = 0"), :time_column => :create_time
30
+ daily CommonComment.where(:model => 'blog').where("is_delete = 0"), :time_column => :create_time
31
+ daily CommonComment.where(:model => 'code').where("is_delete = 0"), :time_column => :create_time
32
+
33
+ daily CommonMember.where("regdate > 0"), :time_column => :regdate
34
+
35
+ end
@@ -0,0 +1,41 @@
1
+ # encoding: UTF-8
2
+
3
+ Statlysis.setup do
4
+ set_database :local_statistic
5
+
6
+ daily UserRecord.where(item_type: "activity")
7
+
8
+ # 表关系 subject <= chapter <= lesson <= activity <= problem
9
+ # room和[user, duration]等绑定
10
+
11
+ # ********
12
+ # **列表**
13
+ # ********
14
+ # 查询条件: [chapter]
15
+ # 章节课时分析: room, lesson, level{5}, group_concat(user), count
16
+ # 推断其他字段: [lesson] => [chapter]
17
+ %w[not_done bad good1 good3 good5].each do |level|
18
+ always ETL::LessonLog.where(:level => level),
19
+ :group_by_columns => [
20
+ {:column_name => :room, :type => :string},
21
+ {:column_name => :lesson, :type => :string}
22
+ ],
23
+ :group_concat_columns => [:user]
24
+ end
25
+
26
+ # ********
27
+ # **详情**
28
+ # ********
29
+ # 查询条件: [activity]
30
+ # Activity分析: room, problem, answer, group_concat(user, duration), count
31
+ # 推断其他字段: [problem] => [activity, lesson]
32
+ always ETL::ProblemLog,
33
+ :group_by_columns => [
34
+ {:column_name => :room, :type => :string},
35
+ {:column_name => :problem, :type => :string},
36
+ # statlysis.gem use column_name to create table name, so that's why no_index option exists
37
+ {:column_name => :answer, :type => :string, :no_index => true}
38
+ ],
39
+ :group_concat_columns => [:user, :duration]
40
+
41
+ end
data/lib/statlysis.rb CHANGED
@@ -31,7 +31,7 @@ module Statlysis
31
31
 
32
32
  logger.info "Start to setup Statlysis" if ENV['DEBUG']
33
33
  time_log do
34
- self.config.instance_exec(&blk)
34
+ self.configuration.instance_exec(&blk)
35
35
  end
36
36
  end
37
37
 
@@ -44,10 +44,10 @@ module Statlysis
44
44
  end
45
45
 
46
46
  # delagate config methods to Configuration
47
- def config; Configuration.instance end
47
+ def configuration; Configuration.instance end
48
48
  require 'active_support/core_ext/module/delegation.rb'
49
49
  Configuration::DelegateMethods.each do |sym|
50
- delegate sym, :to => :config
50
+ delegate sym, :to => :configuration
51
51
  end
52
52
 
53
53
  attr_accessor :logger
@@ -56,9 +56,9 @@ module Statlysis
56
56
  def source_to_database_type; @_source_to_database_type ||= {} end
57
57
 
58
58
  # 代理访问 各个时间类型的 crons
59
- def daily; CronSet.new(Statlysis.config.day_crons) end
60
- def hourly; CronSet.new(Statlysis.config.hour_crons) end
61
- def always; CronSet.new(Statlysis.config.always_crons) end
59
+ def daily; CronSet.new(Statlysis.configuration.day_crons) end
60
+ def hourly; CronSet.new(Statlysis.configuration.hour_crons) end
61
+ def always; CronSet.new(Statlysis.configuration.always_crons) end
62
62
 
63
63
  end
64
64
 
@@ -6,9 +6,7 @@ module Statlysis
6
6
  include Common
7
7
 
8
8
  # feature is a string
9
- def initialize feature, default_time
10
- raise "Please assign default_time params" if not default_time
11
-
9
+ def initialize feature, default_time = nil
12
10
  # init table & model
13
11
  cron.stat_table_name = [Statlysis.tablename_default_pre, 'clocks'].compact.join("_")
14
12
  unless Statlysis.sequel.table_exists?(cron.stat_table_name)
@@ -23,12 +21,13 @@ module Statlysis
23
21
  cron.stat_model = h[:model]
24
22
 
25
23
  # init default_time
24
+ default_time ||= DateTime.now
26
25
  cron.clock = cron.stat_model.find_or_create(:feature => feature)
27
26
  cron.clock.update :t => default_time if cron.current.nil?
28
27
  cron
29
28
  end
30
29
 
31
- def update time
30
+ def update time = DateTime.now
32
31
  time = DateTime.now if time == DateTime1970
33
32
  return false if time && (time < cron.current)
34
33
  cron.clock.update :t => time
@@ -3,6 +3,7 @@
3
3
  module Statlysis
4
4
  class Cron
5
5
  attr_reader :multiple_dataset, :source_type, :time_column, :time_unit, :time_zone
6
+ attr_reader :clock
6
7
  include Common
7
8
 
8
9
  def initialize s, opts = {}
@@ -34,6 +34,9 @@ module Statlysis
34
34
  end
35
35
  end
36
36
 
37
+ # record last executed time
38
+ clock.update
39
+
37
40
  return self
38
41
  end
39
42
 
@@ -90,7 +93,7 @@ module Statlysis
90
93
  _truncated_columns = _group_by_columns_index_name.dup # only String column
91
94
  _group_by_columns_index_name = _group_by_columns_index_name.unshift :t if cron.time_column?
92
95
  # TODO use https://github.com/german/redis_orm to support full string indexes
93
- if !Statlysis.config.is_skip_database_index && _group_by_columns_index_name.any?
96
+ if !Statlysis.configuration.is_skip_database_index && _group_by_columns_index_name.any?
94
97
  mysql_per_column_length_limit_in_one_index = (1000 / 3.0 / _group_by_columns_index_name.size.to_f).to_i
95
98
  index_columns_str = _group_by_columns_index_name.map {|s| _truncated_columns.include?(s) ? "#{s.to_s}(#{mysql_per_column_length_limit_in_one_index})" : s.to_s }.join(", ")
96
99
  index_columns_str = "(#{index_columns_str})"
@@ -151,6 +154,8 @@ module Statlysis
151
154
 
152
155
  private
153
156
  def remodel
157
+ @clock ||= reclock
158
+
154
159
  n = cron.stat_table_name.to_s.singularize.camelize
155
160
  cron.stat_model = class_eval <<-MODEL, __FILE__, __LINE__+1
156
161
  class ::#{n} < Sequel::Model;
@@ -162,6 +167,11 @@ module Statlysis
162
167
  MODEL
163
168
  end
164
169
 
170
+ def reclock
171
+ # setup a clock to record the last updated
172
+ @clock = Clock.new "last_updated_at__#{cron.stat_table_name}"
173
+ end
174
+
165
175
  end
166
176
  end
167
177
 
@@ -7,23 +7,23 @@ namespace :statlysis do
7
7
  Statlysis::TimeUnits.each do |unit|
8
8
  desc "statistical in #{unit}"
9
9
  only_one_task "#{unit}_count" => :environment do
10
- Statlysis.config.send("#{unit}_crons").map(&:run)
10
+ Statlysis.configuration.send("#{unit}_crons").map(&:run)
11
11
  end
12
12
  end
13
13
 
14
14
  desc "realtime process"
15
15
  only_one_task :realtime_process => :environment do
16
- loop { Statlysis.config.realtime_crons.map(&:run); sleep 1 }
16
+ loop { Statlysis.configuration.realtime_crons.map(&:run); sleep 1 }
17
17
  end
18
18
 
19
19
  desc "similar process"
20
20
  only_one_task :similar_process => :environment do
21
- Statlysis.config.similar_crons.map(&:run)
21
+ Statlysis.configuration.similar_crons.map(&:run)
22
22
  end
23
23
 
24
24
  desc "hotest process"
25
25
  only_one_task :hotest_process => :environment do
26
- Statlysis.config.hotest_crons.map(&:run)
26
+ Statlysis.configuration.hotest_crons.map(&:run)
27
27
  end
28
28
 
29
29
  end
data/statlysis.gemspec CHANGED
@@ -4,13 +4,13 @@ $:.push File.expand_path("../lib", __FILE__)
4
4
 
5
5
  Gem::Specification.new do |s|
6
6
  s.name = 'statlysis'
7
- s.version = '0.0.3'
8
- s.date = '2013-12-03'
7
+ s.version = '0.0.7'
8
+ s.date = '2013-12-30'
9
9
  s.summary = File.read("README.markdown").split(/===+/)[1].strip.split("\n")[0]
10
10
  s.description = s.summary
11
11
  s.authors = ["David Chen"]
12
12
  s.email = 'mvjome@gmail.com'
13
- s.homepage = 'https://github.com/SunshineLibrary/statlysis'
13
+ s.homepage = 'https://github.com/mvj3/statlysis'
14
14
  s.license = 'MIT'
15
15
 
16
16
  s.files = `git ls-files`.split("\n")
data/test/helper.rb CHANGED
@@ -28,8 +28,8 @@ require 'sqlite3'
28
28
 
29
29
  # load ActiveRecord setup
30
30
  Statlysis.set_database ":memory:"
31
- Statlysis.config.is_skip_database_index = true
32
- ActiveRecord::Base.establish_connection(Statlysis.config.database_opts.merge("adapter" => "sqlite3"))
31
+ Statlysis.configuration.is_skip_database_index = true
32
+ ActiveRecord::Base.establish_connection(Statlysis.configuration.database_opts.merge("adapter" => "sqlite3"))
33
33
  Dir[File.expand_path("../migrate/*.rb", __FILE__).to_s].each { |f| require f }
34
34
  Dir[File.expand_path("../models/*.rb", __FILE__).to_s].each { |f| require f }
35
35
 
@@ -45,6 +45,24 @@ csv.each do |row|
45
45
  end
46
46
 
47
47
 
48
+
49
+
50
+
51
+
52
+
53
+
54
+
55
+
56
+ (require 'pry-debugger';binding.pry) if ENV['DEBUG']
57
+
58
+
59
+
60
+
61
+
62
+
63
+
64
+
65
+
48
66
  Statlysis.setup do
49
67
  hourly EoeLog, :time_column => :t
50
68
 
@@ -62,5 +80,4 @@ Statlysis.setup do
62
80
  cron1 = Statlysis.daily['mul'][1]
63
81
  cron2 = Statlysis.daily['cod'][0]
64
82
  cron3 = Statlysis.always['code']['mongoid'][0]
65
- require 'pry-debugger';binding.pry
66
83
  end
@@ -1,8 +1,9 @@
1
1
  # encoding: UTF-8
2
+ # NOTE 以下统计数据依赖code_gists测试数据。
2
3
 
3
4
  require 'helper'
4
5
 
5
- class TestDailyCount < Test::Unit::TestCase
6
+ class TestGenerallyCount < Test::Unit::TestCase
6
7
  def setup
7
8
  @output = Statlysis.daily['code_gist'].first.output
8
9
  end
@@ -20,5 +21,4 @@ class TestDailyCount < Test::Unit::TestCase
20
21
  assert_equal @output[-1][:totally_favcount_s].to_i, CodeGist.all.map(&:fav_count).reduce(:+)
21
22
  end
22
23
 
23
-
24
24
  end
@@ -2,7 +2,7 @@
2
2
 
3
3
  require 'helper'
4
4
 
5
- class TestStatlysis < Test::Unit::TestCase
5
+ class TestManipulateTableAndModel < Test::Unit::TestCase
6
6
  def setup
7
7
  @old_datetime = Time.zone.parse("20130105")
8
8
  end
@@ -7,12 +7,18 @@ class TestMapReduce < Test::Unit::TestCase
7
7
  end
8
8
 
9
9
  def test_multiple_dimensions_output_without_time_column
10
+ before_time = DateTime.now
10
11
  cron = Statlysis.always['mongoid']['code'][0]
11
12
  assert_equal cron.time_column, false
12
13
  assert_equal cron.time_unit, false
13
14
  assert_equal cron.stat_table_name, 'timely_codegistmongoids_author_a'
14
15
 
15
16
  cron.run
17
+
18
+ # 测试更新最后执行时间的时钟
19
+ # TODO 可能移到其他地方
20
+ assert((cron.clock.current - before_time) > 0)
21
+
16
22
  assert_equal cron.output.detect {|h| h[:author] == 'mvj3' }[:c].to_i, cron.multiple_dataset.sources.first.where(:author => 'mvj3').count
17
23
  end
18
24
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: statlysis
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-12-03 00:00:00.000000000 Z
12
+ date: 2013-12-30 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rake
@@ -251,7 +251,8 @@ dependencies:
251
251
  - - ! '>='
252
252
  - !ruby/object:Gem::Version
253
253
  version: '0'
254
- description: Statistical & Analysis in Ruby DSL
254
+ description: Statistical and analysis in Ruby DSL, just as simple as SQL operations
255
+ in ActiveRecord.
255
256
  email: mvjome@gmail.com
256
257
  executables: []
257
258
  extensions: []
@@ -263,6 +264,8 @@ files:
263
264
  - Guardfile
264
265
  - README.markdown
265
266
  - Rakefile
267
+ - examples/eoecn.rb
268
+ - examples/sunshinelibrary.rb
266
269
  - lib/statlysis.rb
267
270
  - lib/statlysis/clock.rb
268
271
  - lib/statlysis/common.rb
@@ -298,12 +301,12 @@ files:
298
301
  - test/models/.gitkeep
299
302
  - test/models/code_gist.rb
300
303
  - test/models/eoe_log.rb
301
- - test/test_daily_count.rb
304
+ - test/test_generally_count.rb
305
+ - test/test_manipulate_table_and_model.rb
302
306
  - test/test_mapreduce.rb
303
307
  - test/test_single_log_in_multiple_collections.rb
304
- - test/test_statlysis.rb
305
308
  - test/test_timeseries.rb
306
- homepage: https://github.com/SunshineLibrary/statlysis
309
+ homepage: https://github.com/mvj3/statlysis
307
310
  licenses:
308
311
  - MIT
309
312
  post_install_message:
@@ -316,22 +319,17 @@ required_ruby_version: !ruby/object:Gem::Requirement
316
319
  - - ! '>='
317
320
  - !ruby/object:Gem::Version
318
321
  version: '0'
319
- segments:
320
- - 0
321
- hash: -1643509325996557122
322
322
  required_rubygems_version: !ruby/object:Gem::Requirement
323
323
  none: false
324
324
  requirements:
325
325
  - - ! '>='
326
326
  - !ruby/object:Gem::Version
327
327
  version: '0'
328
- segments:
329
- - 0
330
- hash: -1643509325996557122
331
328
  requirements: []
332
329
  rubyforge_project:
333
330
  rubygems_version: 1.8.23
334
331
  signing_key:
335
332
  specification_version: 3
336
- summary: Statistical & Analysis in Ruby DSL
333
+ summary: Statistical and analysis in Ruby DSL, just as simple as SQL operations in
334
+ ActiveRecord.
337
335
  test_files: []