daru 0.0.3.1 → 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: e0388674c6bcbd695b0c456529e987cce5124052
4
- data.tar.gz: 4433ab2ef63dd8908e12c43937dbb704abccba43
3
+ metadata.gz: dc60c1a59b4c112dface4bdd6d0a40cf1bcc499c
4
+ data.tar.gz: aec8fd874e2e74ded4eeb2cebe8503b5a3cbefbc
5
5
  SHA512:
6
- metadata.gz: 37c6930a8a54e567e92efaa68c04c0774a33a7c08b6d90f66b9ee3309a09d05713d265c0dc502dd05ad36658ebd992f50df0994f96f84da5b2bff2576d5d78e7
7
- data.tar.gz: 9985f10d9d66615a0fce7a949d03765029bb186ec7f69e91b6397d1a9a51bf2849ede0c64303719918bb097e3cff0f32ba52a610a08a0f92803805dc1a298d88
6
+ metadata.gz: 091f1ce3e0da8b9083dc6f53a9977550def5d10a4f759797ec56cb6b90788aa8d6a7fb8ad1a0e4105c640b5987e0b200877a12441cc8cd6b2614f26141554d9e
7
+ data.tar.gz: d189373c2196a3ddc10856f4800fee908d9361f83921e68078c971402a3996b02d82c13bb275a26d070518e90626d4e654556d970d91b69179f2899065b31685
data/History.txt CHANGED
@@ -25,3 +25,19 @@
25
25
 
26
26
  == 0.0.3
27
27
  * This release is a complete rewrite of the entire gem to accomodate index values.
28
+
29
+ == 0.0.3.1
30
+ * Added aritmetic methods for vector aritmetic by taking the index of values into account.
31
+
32
+ == 0.0.4
33
+ * Added wrappers for Array, NMatrix and MDArray such that the external implementation is completely transparent of the data type being used internally.
34
+ * Added statistics methods for vectors for ArrayWrapper. These are compatible with statsample methods.
35
+ * Added plotting functions for DataFrame and Vector using Nyaplot.
36
+ * Create a DataFrame by specifying the rows with the ".rows" class method.
37
+ * Create a Vector from a Hash.
38
+ * Call a Vector element by specfying the index name as a method call (method_missing logic).
39
+ * Retrive multiple rows of a DataFrame by specfying a Range or an Array with multiple index names.
40
+ * #head and #tail for DataFrame.
41
+ * #uniq for Vector.
42
+ * #max for Vector can return a Vector object with the index set to the index of the max value.
43
+ * Tonnes of documentation for most methods.
data/README.md CHANGED
@@ -11,9 +11,9 @@ daru (Data Analysis in RUby) is a library for storage, analysis and manipulation
11
11
 
12
12
  Development of daru was started to address the fragmentation of Dataframe-like classes which were created in many ruby gems as per their own needs. daru offers a uniform interface for all sorts of data analysis and manipulation operations and aims to be compatible with all ruby gems involved in any way with data.
13
13
 
14
- daru is heavily inspired by `Statsample::Dataset`, `Nyaplot::DataFrame` and the super-awesome pandas, a very mature solution in Python.
14
+ daru is inspired by `Statsample::Dataset` and pandas, a very mature solution in Python.
15
15
 
16
- daru works with CRuby (1.9.3+) and JRuby and in a few weeks will be completely compatible with NMatrix and MDArray for fast data manipulation using C or Java structures.
16
+ daru works with CRuby (1.9.3+) and JRuby.
17
17
 
18
18
  ## Features
19
19
 
@@ -24,7 +24,19 @@ daru works with CRuby (1.9.3+) and JRuby and in a few weeks will be completely c
24
24
  * Indexed and named data structures.
25
25
  * Flexible and intuitive API for manipulation and analysis of data.
26
26
 
27
- ## Usage
27
+ ## Notebooks
28
+
29
+ * [Analysis and plotting of a data set comprising of music listening habits of a last.fm user(iruby notebook)](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb)
30
+
31
+ ## Blog Posts
32
+
33
+ * [Data Analysis in RUby: Basic data manipulation and plotting](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/)
34
+
35
+ ## Documentation
36
+
37
+ Docs can be found [here](https://rubygems.org/gems/daru).
38
+
39
+ ## Basic Usage
28
40
 
29
41
  daru has been created with keeping extreme ease of use in mind.
30
42
 
@@ -33,7 +45,7 @@ The gem consists of two data structures, Vector and DataFrame. Any data in a ser
33
45
  #### Initialization of DataFrame
34
46
 
35
47
  A data frame can be initialized from the following sources:
36
- * Hash of indexed vectors: `{ b: Daru::Vector.new(:b, [11,12,13,14,15], [:two, :one, :four, :five, :three]), a: Daru::Vector.new(:a, [1,2,3,4,5], [:two,:one,:three, :four, :five])}`.
48
+ * Hash of indexed order: `{ b: Daru::Vector.new(:b, [11,12,13,14,15], [:two, :one, :four, :five, :three]), a: Daru::Vector.new(:a, [1,2,3,4,5], [:two,:one,:three, :four, :five])}`.
37
49
  * Array of hashes: `[{a: 1, b: 11}, {a: 2, b: 12}, {a: 3, b: 13},{a: 4, b: 14}, {a: 5, b: 15}]`.
38
50
  * Hash of names and Arrays: `{b: [11,12,13,14,15], a: [1,2,3,4,5]}`
39
51
 
@@ -43,7 +55,7 @@ A basic DataFrame can be initialized like this:
43
55
 
44
56
  ```ruby
45
57
 
46
- df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, vectors: [:a, :b], index: [:one, :two, :three, :four, :five])
58
+ df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, order: [:a, :b], index: [:one, :two, :three, :four, :five])
47
59
  df
48
60
  # =>
49
61
  # # <Daru::DataFrame:87274040 @name = 7308c587-4073-4e7d-b3ca-3679d1dcc946 # @size = 5>
@@ -65,7 +77,7 @@ The vectors of the DataFrame will be arranged according to the array specified i
65
77
  b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
66
78
  a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
67
79
  },
68
- vectors: [:a, :b]
80
+ order: [:a, :b]
69
81
  )
70
82
  df
71
83
 
@@ -89,7 +101,7 @@ If an index for the DataFrame is supplied (third argument), then the indexes of
89
101
  a: [1,2,3] .dv(nil, [:one, :two, :three]),
90
102
  c: [11,22,33,44,55] .dv(nil, [:one, :two, :three, :four, :five]),
91
103
  d: [49,69,89,99,108,44].dv(nil, [:one, :two, :three, :four, :five, :six])
92
- }, vectors: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
104
+ }, order: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
93
105
  df
94
106
  # =>
95
107
  # #<Daru::DataFrame:87523270 @name = bda4eb68-afdd-4404-9981-708edab14201 #@size = 6>
@@ -111,7 +123,7 @@ If some of the supplied vectors do not contain certain indexes that are containe
111
123
  b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
112
124
  a: [1,2,3] .dv(:a, [:two,:one,:three])
113
125
  },
114
- vectors: [:a, :b]
126
+ order: [:a, :b]
115
127
  )
116
128
  df
117
129
 
@@ -176,7 +188,7 @@ Initialize a dataframe:
176
188
  b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
177
189
  a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
178
190
  },
179
- vectors: [:a, :b]
191
+ order: [:a, :b]
180
192
  )
181
193
 
182
194
  # =>
@@ -203,6 +215,20 @@ Select a row from a DataFrame:
203
215
  ```
204
216
  A row or a vector is returned as a `Daru::Vector` object, so any manipulations supported by `Daru::Vector` can be performed on the chosen row as well.
205
217
 
218
+ Select multiple rows with a Range and get a DataFrame in return:
219
+
220
+ ``` ruby
221
+
222
+ df.row[1..3] # OR df.row[:four..:three]
223
+ # =>
224
+ #<Daru::DataFrame:85361520 @name = d6582f66-5a55-473e-ba57-cb2ba974da6a @size #= 3>
225
+ # a b
226
+ # four 4 13
227
+ # one 2 12
228
+ # three 3 15
229
+
230
+ ```
231
+
206
232
  Select a single vector:
207
233
 
208
234
  ```ruby
@@ -251,16 +277,16 @@ Keep/remove row according to a specified condition:
251
277
  # five 5 14
252
278
 
253
279
  ```
254
- The same can be applied to vectors using `keep_vector_if`.
280
+ The same can be applied to vectors using `filter_vectors`.
255
281
 
256
- To iterate over a DataFrame and perform operations on rows or vectors, `#each_row` or `#each_vector` can be used, which works just like `#each` for Ruby Arrays.
282
+ To iterate over a DataFrame and perform operations on rows or vectors, use `#each_row` or `#each_vector`.
257
283
 
258
284
  To change the values of a row/vector while iterating through the DataFrame, use `map_rows` or `map_vectors`:
259
285
 
260
286
  ```ruby
261
287
 
262
288
  df.map_rows do |row|
263
- row = row.map { |e| e*e }
289
+ row = row * row
264
290
  end
265
291
 
266
292
  df
@@ -278,23 +304,48 @@ To change the values of a row/vector while iterating through the DataFrame, use
278
304
 
279
305
  Rows/vectors can be deleted using `delete_row` or `delete_vector`.
280
306
 
281
- #### Basic Math Operations
307
+ #### Basic Maths Operations
308
+
309
+ Performing a binary arithmetic operation on two `Daru::Vector` objects will return a `Vector` object in which the operation will be performed on elements of the same index.
310
+
311
+ ```ruby
312
+
313
+ dv1 = Daru::Vector.new [1,2,3,4], name: :boozy, index: [:a, :b, :c, :d]
314
+
315
+ dv2 = Daru::Vector.new [1,2,3,4], name: :mayer, index: [:e, :f, :b, :d]
316
+
317
+ dv1 * dv2
318
+
319
+ # #<Daru::Vector:80924700 @name = boozy @size = 2 >
320
+ # boozy
321
+ # b 6
322
+ # d 16
323
+
324
+ ```
325
+
326
+ Arithmetic operators applied on a single Numeric will perform the operation with that number against the entire vector.
327
+
328
+ #### Statistics Operations
282
329
 
283
- Coming soon!
330
+ Daru::Vector has a whole lot of statistics operations to maintain compatibility with Statsample::Vector. Check the docs for details.
331
+
332
+ #### Plotting
333
+
334
+ daru uses [Nyaplot](https://github.com/domitry/nyaplot) for plotting and an example of this can be found in the [notebook](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb) or [blog post](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/).
335
+
336
+ Head over to the tutorials and notebooks listed above for more examples.
284
337
 
285
338
  ## Roadmap
286
339
 
287
340
  * Automate testing for both MRI and JRuby.
288
341
  * Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
289
- * Destructive map iterators for DataFrame and Vector.
342
+ * Destructive map iterators for DataFrame.
290
343
  * Completely test all functionality for NMatrix and MDArray.
291
344
  * Basic Data manipulation and analysis operations:
292
345
  - Different kinds of join operations
293
346
  - Dataframe/vector merge
294
347
  - Creation of correlation, covariance matrices
295
348
  - Verification of data in a vector
296
- - Basic vector statistics - mean, median, variance, etc.
297
- * Vector arithmetic - elementwise addition, subtraction, multiplication, division.
298
349
  * Transpose a dataframe.
299
350
  * Option to express a DataFrame as an NMatrix or MDArray so as to use more efficient storage techniques.
300
351
  * Assignment of a column to a single number should set the entire column to that number.
@@ -303,15 +354,24 @@ Coming soon!
303
354
  * Creation of DataFrame from Array of Arrays.
304
355
  * Multiple value assignment for vectors with []=.
305
356
  * Load DataFrame from multiple sources (excel, SQL, etc.).
306
- * Allow for boolean operations inside #[].
307
357
  * Deletion of elements from Vector should only modify the index and leave the vector as it is so that compacting is not needed and things are faster.
308
358
  * Add a #sync method which will sync the modified index with the unmodified vector.
309
359
  * Ability to reorder the index of a dataframe.
310
- * Slicing operations using Range.
311
- * Create DataFrame by providing rows.
312
- * Integrate basic plotting with Nyaplot.
313
- * Filter through a dataframe with filter\_rows or filter\_vectors based on whatever boolean value evaluates to true.
314
- * Named arguments
360
+ * head/tail for DV.
361
+ * #find\_max function which will evaluate a block and return the row for the value of the block is max.
362
+ * Function to check if a value of a row/vector is within a specified range.
363
+ * Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
364
+ * Direct functions to answer something like 'number of something per thousand of something else'.
365
+ * Tests for checking NMatrix resizing
366
+ * Sort while preserving index.
367
+
368
+ ## Contributing
369
+
370
+ Pick a feature from the Roadmap above or think of your own and send me a Pull Request!
371
+
372
+ ## Acknowledgements
373
+
374
+ * Thank you [last.fm](http://www.last.fm/) for making user data accessible to the public.
315
375
 
316
376
  Copyright (c) 2014, Sameer Deshmukh
317
377
  All rights reserved
data/daru.gemspec CHANGED
@@ -6,6 +6,12 @@ require 'version.rb'
6
6
  DESCRIPTION = <<MSG
7
7
  Daru (Data Analysis in RUby) is a library for storage, analysis and manipulation
8
8
  of data.
9
+
10
+ Daru works with Ruby arrays, NMatrix and MDArray, thus working seamlessly accross
11
+ ruby interpreters, at the same time providing speed for those who need it.
12
+
13
+ This library is under active development so NMatrix and MDArray support is
14
+ somewhat limited, but should be available soon!
9
15
  MSG
10
16
 
11
17
  Gem::Specification.new do |spec|
@@ -27,6 +33,7 @@ Gem::Specification.new do |spec|
27
33
  spec.add_development_dependency 'rake'
28
34
  spec.add_development_dependency 'rspec'
29
35
  spec.add_development_dependency 'awesome_print'
36
+ spec.add_development_dependency 'nyaplot'
30
37
  if RUBY_ENGINE != 'jruby'
31
38
  spec.add_development_dependency 'nmatrix', '~> 0.1.0.rc5'
32
39
  end
@@ -2,7 +2,255 @@ module Daru
2
2
  module Accessors
3
3
  # Internal class for wrapping ruby array
4
4
  class ArrayWrapper
5
+ module Statistics
5
6
 
7
+ def average_deviation_population m=nil
8
+ m ||= mean
9
+ (@vector.inject(0) {|memo, val| val + (val - m).abs }) / n_valid
10
+ end
11
+
12
+ def coefficient_of_variation
13
+ standard_deviation_sample / mean
14
+ end
15
+
16
+ def count value=false
17
+ if block_given?
18
+ @vector.inject(0){ |memo, val| memo += 1 if yield val; memo}
19
+ else
20
+ val = frequencies[value]
21
+ val.nil? ? 0 : val
22
+ end
23
+ end
24
+
25
+ def factors
26
+ index = @data.sorted_indices
27
+ index.reduce([]){|memo, val| memo.push(@data[val]) if memo.last != @data[val]; memo}
28
+ end # TODO
29
+
30
+ def frequencies
31
+ @vector.inject({}) do |hash, element|
32
+ hash[element] ||= 0
33
+ hash[element] += 1
34
+ hash
35
+ end
36
+ end
37
+
38
+ def has_missing_data?
39
+ has_missing_data
40
+ end
41
+
42
+ def kurtosis m=nil
43
+ m ||= mean
44
+ fo = @vector.inject(0){ |a, x| a + ((x - m) ** 4) }
45
+ fo.quo(@size * standard_deviation_sample(m) ** 4) - 3
46
+ end
47
+
48
+ def mean
49
+ sum.quo(@size).to_f
50
+ end
51
+
52
+ def median
53
+ percentile 50
54
+ end
55
+
56
+ def median_absolute_deviation
57
+ m = median
58
+ recode {|val| (val - m).abs }.median
59
+ end
60
+
61
+ def mode
62
+ freqs = frequencies.values
63
+
64
+ @vector[freqs.index(freqs.max)]
65
+ end
66
+
67
+ def n_valid
68
+ @size
69
+ end
70
+
71
+ def percentile percent
72
+ sorted = @vector.sort
73
+ v = (n_valid * percent).quo(100)
74
+ if v.to_i != v
75
+ sorted[v.round]
76
+ else
77
+ (sorted[(v - 0.5).round].to_f + sorted[(v + 0.5).round]).quo(2)
78
+ end
79
+ end
80
+
81
+ def product
82
+ @vector.inject(:*)
83
+ end
84
+
85
+ def max
86
+ @vector.max
87
+ end
88
+
89
+ def min
90
+ @vector.min
91
+ end
92
+
93
+ def proportion value=1
94
+ frequencies[value] / n_valid
95
+ end
96
+
97
+ def proportions
98
+ len = n_valid
99
+ frequencies.inject({}) { |hash, arr| hash[arr[0]] = arr[1] / len; hash }
100
+ end
101
+
102
+ def range
103
+ max - min
104
+ end
105
+
106
+ def ranked
107
+ sum = 0
108
+ r = frequencies.sort.inject( {} ) do |memo, val|
109
+ memo[val[0]] = ((sum + 1) + (sum + val[1])) / 2
110
+ sum += val[1]
111
+ memo
112
+ end
113
+
114
+ Daru::Vector.new @vector.map { |e| r[e] }, index: @caller.index,
115
+ name: @caller.name, dtype: @caller.dtype
116
+ end
117
+
118
+ def recode(&block)
119
+ @vector.map(&block)
120
+ end
121
+
122
+ def recode!(&block)
123
+ @vector.map!(&block)
124
+ end
125
+
126
+ # Calculate skewness using (sigma(xi - mean)^3)/((N)*std_dev_sample^3)
127
+ def skew m=nil
128
+ m ||= mean
129
+ th = @vector.inject(0) { |memo, val| memo + ((val - m)**3) }
130
+ th.quo (@size * (standard_deviation_sample(m)**3))
131
+ end
132
+
133
+ def standard_deviation_population m=nil
134
+ m ||= mean
135
+ Math::sqrt(variance_population(m))
136
+ end
137
+
138
+ def standard_deviation_sample m=nil
139
+ Math::sqrt(variance_sample(m))
140
+ end
141
+
142
+ def standard_error
143
+ standard_deviation_sample/(Math::sqrt(@size))
144
+ end
145
+
146
+ def sum_of_squared_deviation
147
+ (@vector.inject(0) { |a,x| x.square + a } - (sum.square.quo(@size))).to_f
148
+ end
149
+
150
+ def sum_of_squares(m=nil)
151
+ m ||= mean
152
+ @vector.inject(0) { |memo, val| memo + (val - m)**2 }
153
+ end
154
+
155
+ def sum
156
+ @vector.inject(:+)
157
+ end
158
+
159
+ # Sample variance with denominator (N-1)
160
+ def variance_sample m=nil
161
+ m ||= self.mean
162
+
163
+ sum_of_squares(m).quo(@size - 1)
164
+ end
165
+
166
+ # Population variance with denominator (N)
167
+ def variance_population m=nil
168
+ m ||= mean
169
+
170
+ sum_of_squares(m).quo(@size).to_f
171
+ end
172
+ end # module Statistics
173
+
174
+ include Statistics
175
+ include Enumerable
176
+
177
+ def each(&block)
178
+ @vector.each(&block)
179
+ end
180
+
181
+ def map!(&block)
182
+ @vector.map!(&block)
183
+ end
184
+
185
+ attr_accessor :size
186
+ attr_reader :vector
187
+ attr_reader :has_missing_data
188
+
189
+ def initialize vector, caller
190
+ @vector = vector
191
+ @caller = caller
192
+
193
+ set_size
194
+ end
195
+
196
+ def [] index
197
+ @vector[index]
198
+ end
199
+
200
+ def []= index, value
201
+ has_missing_data = true if value.nil?
202
+ @vector[index] = value
203
+ set_size
204
+ end
205
+
206
+ def == other
207
+ @vector == other
208
+ end
209
+
210
+ def delete_at index
211
+ @vector.delete_at index
212
+ set_size
213
+ end
214
+
215
+ def index key
216
+ @vector.index key
217
+ end
218
+
219
+ def << element
220
+ @vector << element
221
+ set_size
222
+ end
223
+
224
+ def uniq
225
+ @vector.uniq
226
+ end
227
+
228
+ def to_a
229
+ @vector
230
+ end
231
+
232
+ def dup
233
+ ArrayWrapper.new @vector.dup, @caller
234
+ end
235
+
236
+ def coerce dtype
237
+ case
238
+ when dtype == Array
239
+ self
240
+ when dtype == NMatrix
241
+ Daru::Accessors::NMatrixWrapper.new @vector, @caller
242
+ when dtype == MDArray
243
+ raise NotImplementedError
244
+ else
245
+ raise ArgumentError, "Cant coerce to dtype #{dtype}"
246
+ end
247
+ end
248
+
249
+ private
250
+
251
+ def set_size
252
+ @size = @vector.size
253
+ end
6
254
  end
7
255
  end
8
256
  end