daru 0.0.3.1 → 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: e0388674c6bcbd695b0c456529e987cce5124052
4
- data.tar.gz: 4433ab2ef63dd8908e12c43937dbb704abccba43
3
+ metadata.gz: dc60c1a59b4c112dface4bdd6d0a40cf1bcc499c
4
+ data.tar.gz: aec8fd874e2e74ded4eeb2cebe8503b5a3cbefbc
5
5
  SHA512:
6
- metadata.gz: 37c6930a8a54e567e92efaa68c04c0774a33a7c08b6d90f66b9ee3309a09d05713d265c0dc502dd05ad36658ebd992f50df0994f96f84da5b2bff2576d5d78e7
7
- data.tar.gz: 9985f10d9d66615a0fce7a949d03765029bb186ec7f69e91b6397d1a9a51bf2849ede0c64303719918bb097e3cff0f32ba52a610a08a0f92803805dc1a298d88
6
+ metadata.gz: 091f1ce3e0da8b9083dc6f53a9977550def5d10a4f759797ec56cb6b90788aa8d6a7fb8ad1a0e4105c640b5987e0b200877a12441cc8cd6b2614f26141554d9e
7
+ data.tar.gz: d189373c2196a3ddc10856f4800fee908d9361f83921e68078c971402a3996b02d82c13bb275a26d070518e90626d4e654556d970d91b69179f2899065b31685
data/History.txt CHANGED
@@ -25,3 +25,19 @@
25
25
 
26
26
  == 0.0.3
27
27
  * This release is a complete rewrite of the entire gem to accomodate index values.
28
+
29
+ == 0.0.3.1
30
+ * Added aritmetic methods for vector aritmetic by taking the index of values into account.
31
+
32
+ == 0.0.4
33
+ * Added wrappers for Array, NMatrix and MDArray such that the external implementation is completely transparent of the data type being used internally.
34
+ * Added statistics methods for vectors for ArrayWrapper. These are compatible with statsample methods.
35
+ * Added plotting functions for DataFrame and Vector using Nyaplot.
36
+ * Create a DataFrame by specifying the rows with the ".rows" class method.
37
+ * Create a Vector from a Hash.
38
+ * Call a Vector element by specfying the index name as a method call (method_missing logic).
39
+ * Retrive multiple rows of a DataFrame by specfying a Range or an Array with multiple index names.
40
+ * #head and #tail for DataFrame.
41
+ * #uniq for Vector.
42
+ * #max for Vector can return a Vector object with the index set to the index of the max value.
43
+ * Tonnes of documentation for most methods.
data/README.md CHANGED
@@ -11,9 +11,9 @@ daru (Data Analysis in RUby) is a library for storage, analysis and manipulation
11
11
 
12
12
  Development of daru was started to address the fragmentation of Dataframe-like classes which were created in many ruby gems as per their own needs. daru offers a uniform interface for all sorts of data analysis and manipulation operations and aims to be compatible with all ruby gems involved in any way with data.
13
13
 
14
- daru is heavily inspired by `Statsample::Dataset`, `Nyaplot::DataFrame` and the super-awesome pandas, a very mature solution in Python.
14
+ daru is inspired by `Statsample::Dataset` and pandas, a very mature solution in Python.
15
15
 
16
- daru works with CRuby (1.9.3+) and JRuby and in a few weeks will be completely compatible with NMatrix and MDArray for fast data manipulation using C or Java structures.
16
+ daru works with CRuby (1.9.3+) and JRuby.
17
17
 
18
18
  ## Features
19
19
 
@@ -24,7 +24,19 @@ daru works with CRuby (1.9.3+) and JRuby and in a few weeks will be completely c
24
24
  * Indexed and named data structures.
25
25
  * Flexible and intuitive API for manipulation and analysis of data.
26
26
 
27
- ## Usage
27
+ ## Notebooks
28
+
29
+ * [Analysis and plotting of a data set comprising of music listening habits of a last.fm user(iruby notebook)](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb)
30
+
31
+ ## Blog Posts
32
+
33
+ * [Data Analysis in RUby: Basic data manipulation and plotting](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/)
34
+
35
+ ## Documentation
36
+
37
+ Docs can be found [here](https://rubygems.org/gems/daru).
38
+
39
+ ## Basic Usage
28
40
 
29
41
  daru has been created with keeping extreme ease of use in mind.
30
42
 
@@ -33,7 +45,7 @@ The gem consists of two data structures, Vector and DataFrame. Any data in a ser
33
45
  #### Initialization of DataFrame
34
46
 
35
47
  A data frame can be initialized from the following sources:
36
- * Hash of indexed vectors: `{ b: Daru::Vector.new(:b, [11,12,13,14,15], [:two, :one, :four, :five, :three]), a: Daru::Vector.new(:a, [1,2,3,4,5], [:two,:one,:three, :four, :five])}`.
48
+ * Hash of indexed order: `{ b: Daru::Vector.new(:b, [11,12,13,14,15], [:two, :one, :four, :five, :three]), a: Daru::Vector.new(:a, [1,2,3,4,5], [:two,:one,:three, :four, :five])}`.
37
49
  * Array of hashes: `[{a: 1, b: 11}, {a: 2, b: 12}, {a: 3, b: 13},{a: 4, b: 14}, {a: 5, b: 15}]`.
38
50
  * Hash of names and Arrays: `{b: [11,12,13,14,15], a: [1,2,3,4,5]}`
39
51
 
@@ -43,7 +55,7 @@ A basic DataFrame can be initialized like this:
43
55
 
44
56
  ```ruby
45
57
 
46
- df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, vectors: [:a, :b], index: [:one, :two, :three, :four, :five])
58
+ df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, order: [:a, :b], index: [:one, :two, :three, :four, :five])
47
59
  df
48
60
  # =>
49
61
  # # <Daru::DataFrame:87274040 @name = 7308c587-4073-4e7d-b3ca-3679d1dcc946 # @size = 5>
@@ -65,7 +77,7 @@ The vectors of the DataFrame will be arranged according to the array specified i
65
77
  b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
66
78
  a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
67
79
  },
68
- vectors: [:a, :b]
80
+ order: [:a, :b]
69
81
  )
70
82
  df
71
83
 
@@ -89,7 +101,7 @@ If an index for the DataFrame is supplied (third argument), then the indexes of
89
101
  a: [1,2,3] .dv(nil, [:one, :two, :three]),
90
102
  c: [11,22,33,44,55] .dv(nil, [:one, :two, :three, :four, :five]),
91
103
  d: [49,69,89,99,108,44].dv(nil, [:one, :two, :three, :four, :five, :six])
92
- }, vectors: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
104
+ }, order: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
93
105
  df
94
106
  # =>
95
107
  # #<Daru::DataFrame:87523270 @name = bda4eb68-afdd-4404-9981-708edab14201 #@size = 6>
@@ -111,7 +123,7 @@ If some of the supplied vectors do not contain certain indexes that are containe
111
123
  b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
112
124
  a: [1,2,3] .dv(:a, [:two,:one,:three])
113
125
  },
114
- vectors: [:a, :b]
126
+ order: [:a, :b]
115
127
  )
116
128
  df
117
129
 
@@ -176,7 +188,7 @@ Initialize a dataframe:
176
188
  b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
177
189
  a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
178
190
  },
179
- vectors: [:a, :b]
191
+ order: [:a, :b]
180
192
  )
181
193
 
182
194
  # =>
@@ -203,6 +215,20 @@ Select a row from a DataFrame:
203
215
  ```
204
216
  A row or a vector is returned as a `Daru::Vector` object, so any manipulations supported by `Daru::Vector` can be performed on the chosen row as well.
205
217
 
218
+ Select multiple rows with a Range and get a DataFrame in return:
219
+
220
+ ``` ruby
221
+
222
+ df.row[1..3] # OR df.row[:four..:three]
223
+ # =>
224
+ #<Daru::DataFrame:85361520 @name = d6582f66-5a55-473e-ba57-cb2ba974da6a @size #= 3>
225
+ # a b
226
+ # four 4 13
227
+ # one 2 12
228
+ # three 3 15
229
+
230
+ ```
231
+
206
232
  Select a single vector:
207
233
 
208
234
  ```ruby
@@ -251,16 +277,16 @@ Keep/remove row according to a specified condition:
251
277
  # five 5 14
252
278
 
253
279
  ```
254
- The same can be applied to vectors using `keep_vector_if`.
280
+ The same can be applied to vectors using `filter_vectors`.
255
281
 
256
- To iterate over a DataFrame and perform operations on rows or vectors, `#each_row` or `#each_vector` can be used, which works just like `#each` for Ruby Arrays.
282
+ To iterate over a DataFrame and perform operations on rows or vectors, use `#each_row` or `#each_vector`.
257
283
 
258
284
  To change the values of a row/vector while iterating through the DataFrame, use `map_rows` or `map_vectors`:
259
285
 
260
286
  ```ruby
261
287
 
262
288
  df.map_rows do |row|
263
- row = row.map { |e| e*e }
289
+ row = row * row
264
290
  end
265
291
 
266
292
  df
@@ -278,23 +304,48 @@ To change the values of a row/vector while iterating through the DataFrame, use
278
304
 
279
305
  Rows/vectors can be deleted using `delete_row` or `delete_vector`.
280
306
 
281
- #### Basic Math Operations
307
+ #### Basic Maths Operations
308
+
309
+ Performing a binary arithmetic operation on two `Daru::Vector` objects will return a `Vector` object in which the operation will be performed on elements of the same index.
310
+
311
+ ```ruby
312
+
313
+ dv1 = Daru::Vector.new [1,2,3,4], name: :boozy, index: [:a, :b, :c, :d]
314
+
315
+ dv2 = Daru::Vector.new [1,2,3,4], name: :mayer, index: [:e, :f, :b, :d]
316
+
317
+ dv1 * dv2
318
+
319
+ # #<Daru::Vector:80924700 @name = boozy @size = 2 >
320
+ # boozy
321
+ # b 6
322
+ # d 16
323
+
324
+ ```
325
+
326
+ Arithmetic operators applied on a single Numeric will perform the operation with that number against the entire vector.
327
+
328
+ #### Statistics Operations
282
329
 
283
- Coming soon!
330
+ Daru::Vector has a whole lot of statistics operations to maintain compatibility with Statsample::Vector. Check the docs for details.
331
+
332
+ #### Plotting
333
+
334
+ daru uses [Nyaplot](https://github.com/domitry/nyaplot) for plotting and an example of this can be found in the [notebook](http://nbviewer.ipython.org/github/v0dro/daru/blob/master/notebooks/intro_with_music_data_.ipynb) or [blog post](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/).
335
+
336
+ Head over to the tutorials and notebooks listed above for more examples.
284
337
 
285
338
  ## Roadmap
286
339
 
287
340
  * Automate testing for both MRI and JRuby.
288
341
  * Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
289
- * Destructive map iterators for DataFrame and Vector.
342
+ * Destructive map iterators for DataFrame.
290
343
  * Completely test all functionality for NMatrix and MDArray.
291
344
  * Basic Data manipulation and analysis operations:
292
345
  - Different kinds of join operations
293
346
  - Dataframe/vector merge
294
347
  - Creation of correlation, covariance matrices
295
348
  - Verification of data in a vector
296
- - Basic vector statistics - mean, median, variance, etc.
297
- * Vector arithmetic - elementwise addition, subtraction, multiplication, division.
298
349
  * Transpose a dataframe.
299
350
  * Option to express a DataFrame as an NMatrix or MDArray so as to use more efficient storage techniques.
300
351
  * Assignment of a column to a single number should set the entire column to that number.
@@ -303,15 +354,24 @@ Coming soon!
303
354
  * Creation of DataFrame from Array of Arrays.
304
355
  * Multiple value assignment for vectors with []=.
305
356
  * Load DataFrame from multiple sources (excel, SQL, etc.).
306
- * Allow for boolean operations inside #[].
307
357
  * Deletion of elements from Vector should only modify the index and leave the vector as it is so that compacting is not needed and things are faster.
308
358
  * Add a #sync method which will sync the modified index with the unmodified vector.
309
359
  * Ability to reorder the index of a dataframe.
310
- * Slicing operations using Range.
311
- * Create DataFrame by providing rows.
312
- * Integrate basic plotting with Nyaplot.
313
- * Filter through a dataframe with filter\_rows or filter\_vectors based on whatever boolean value evaluates to true.
314
- * Named arguments
360
+ * head/tail for DV.
361
+ * #find\_max function which will evaluate a block and return the row for the value of the block is max.
362
+ * Function to check if a value of a row/vector is within a specified range.
363
+ * Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
364
+ * Direct functions to answer something like 'number of something per thousand of something else'.
365
+ * Tests for checking NMatrix resizing
366
+ * Sort while preserving index.
367
+
368
+ ## Contributing
369
+
370
+ Pick a feature from the Roadmap above or think of your own and send me a Pull Request!
371
+
372
+ ## Acknowledgements
373
+
374
+ * Thank you [last.fm](http://www.last.fm/) for making user data accessible to the public.
315
375
 
316
376
  Copyright (c) 2014, Sameer Deshmukh
317
377
  All rights reserved
data/daru.gemspec CHANGED
@@ -6,6 +6,12 @@ require 'version.rb'
6
6
  DESCRIPTION = <<MSG
7
7
  Daru (Data Analysis in RUby) is a library for storage, analysis and manipulation
8
8
  of data.
9
+
10
+ Daru works with Ruby arrays, NMatrix and MDArray, thus working seamlessly accross
11
+ ruby interpreters, at the same time providing speed for those who need it.
12
+
13
+ This library is under active development so NMatrix and MDArray support is
14
+ somewhat limited, but should be available soon!
9
15
  MSG
10
16
 
11
17
  Gem::Specification.new do |spec|
@@ -27,6 +33,7 @@ Gem::Specification.new do |spec|
27
33
  spec.add_development_dependency 'rake'
28
34
  spec.add_development_dependency 'rspec'
29
35
  spec.add_development_dependency 'awesome_print'
36
+ spec.add_development_dependency 'nyaplot'
30
37
  if RUBY_ENGINE != 'jruby'
31
38
  spec.add_development_dependency 'nmatrix', '~> 0.1.0.rc5'
32
39
  end
@@ -2,7 +2,255 @@ module Daru
2
2
  module Accessors
3
3
  # Internal class for wrapping ruby array
4
4
  class ArrayWrapper
5
+ module Statistics
5
6
 
7
+ def average_deviation_population m=nil
8
+ m ||= mean
9
+ (@vector.inject(0) {|memo, val| val + (val - m).abs }) / n_valid
10
+ end
11
+
12
+ def coefficient_of_variation
13
+ standard_deviation_sample / mean
14
+ end
15
+
16
+ def count value=false
17
+ if block_given?
18
+ @vector.inject(0){ |memo, val| memo += 1 if yield val; memo}
19
+ else
20
+ val = frequencies[value]
21
+ val.nil? ? 0 : val
22
+ end
23
+ end
24
+
25
+ def factors
26
+ index = @data.sorted_indices
27
+ index.reduce([]){|memo, val| memo.push(@data[val]) if memo.last != @data[val]; memo}
28
+ end # TODO
29
+
30
+ def frequencies
31
+ @vector.inject({}) do |hash, element|
32
+ hash[element] ||= 0
33
+ hash[element] += 1
34
+ hash
35
+ end
36
+ end
37
+
38
+ def has_missing_data?
39
+ has_missing_data
40
+ end
41
+
42
+ def kurtosis m=nil
43
+ m ||= mean
44
+ fo = @vector.inject(0){ |a, x| a + ((x - m) ** 4) }
45
+ fo.quo(@size * standard_deviation_sample(m) ** 4) - 3
46
+ end
47
+
48
+ def mean
49
+ sum.quo(@size).to_f
50
+ end
51
+
52
+ def median
53
+ percentile 50
54
+ end
55
+
56
+ def median_absolute_deviation
57
+ m = median
58
+ recode {|val| (val - m).abs }.median
59
+ end
60
+
61
+ def mode
62
+ freqs = frequencies.values
63
+
64
+ @vector[freqs.index(freqs.max)]
65
+ end
66
+
67
+ def n_valid
68
+ @size
69
+ end
70
+
71
+ def percentile percent
72
+ sorted = @vector.sort
73
+ v = (n_valid * percent).quo(100)
74
+ if v.to_i != v
75
+ sorted[v.round]
76
+ else
77
+ (sorted[(v - 0.5).round].to_f + sorted[(v + 0.5).round]).quo(2)
78
+ end
79
+ end
80
+
81
+ def product
82
+ @vector.inject(:*)
83
+ end
84
+
85
+ def max
86
+ @vector.max
87
+ end
88
+
89
+ def min
90
+ @vector.min
91
+ end
92
+
93
+ def proportion value=1
94
+ frequencies[value] / n_valid
95
+ end
96
+
97
+ def proportions
98
+ len = n_valid
99
+ frequencies.inject({}) { |hash, arr| hash[arr[0]] = arr[1] / len; hash }
100
+ end
101
+
102
+ def range
103
+ max - min
104
+ end
105
+
106
+ def ranked
107
+ sum = 0
108
+ r = frequencies.sort.inject( {} ) do |memo, val|
109
+ memo[val[0]] = ((sum + 1) + (sum + val[1])) / 2
110
+ sum += val[1]
111
+ memo
112
+ end
113
+
114
+ Daru::Vector.new @vector.map { |e| r[e] }, index: @caller.index,
115
+ name: @caller.name, dtype: @caller.dtype
116
+ end
117
+
118
+ def recode(&block)
119
+ @vector.map(&block)
120
+ end
121
+
122
+ def recode!(&block)
123
+ @vector.map!(&block)
124
+ end
125
+
126
+ # Calculate skewness using (sigma(xi - mean)^3)/((N)*std_dev_sample^3)
127
+ def skew m=nil
128
+ m ||= mean
129
+ th = @vector.inject(0) { |memo, val| memo + ((val - m)**3) }
130
+ th.quo (@size * (standard_deviation_sample(m)**3))
131
+ end
132
+
133
+ def standard_deviation_population m=nil
134
+ m ||= mean
135
+ Math::sqrt(variance_population(m))
136
+ end
137
+
138
+ def standard_deviation_sample m=nil
139
+ Math::sqrt(variance_sample(m))
140
+ end
141
+
142
+ def standard_error
143
+ standard_deviation_sample/(Math::sqrt(@size))
144
+ end
145
+
146
+ def sum_of_squared_deviation
147
+ (@vector.inject(0) { |a,x| x.square + a } - (sum.square.quo(@size))).to_f
148
+ end
149
+
150
+ def sum_of_squares(m=nil)
151
+ m ||= mean
152
+ @vector.inject(0) { |memo, val| memo + (val - m)**2 }
153
+ end
154
+
155
+ def sum
156
+ @vector.inject(:+)
157
+ end
158
+
159
+ # Sample variance with denominator (N-1)
160
+ def variance_sample m=nil
161
+ m ||= self.mean
162
+
163
+ sum_of_squares(m).quo(@size - 1)
164
+ end
165
+
166
+ # Population variance with denominator (N)
167
+ def variance_population m=nil
168
+ m ||= mean
169
+
170
+ sum_of_squares(m).quo(@size).to_f
171
+ end
172
+ end # module Statistics
173
+
174
+ include Statistics
175
+ include Enumerable
176
+
177
+ def each(&block)
178
+ @vector.each(&block)
179
+ end
180
+
181
+ def map!(&block)
182
+ @vector.map!(&block)
183
+ end
184
+
185
+ attr_accessor :size
186
+ attr_reader :vector
187
+ attr_reader :has_missing_data
188
+
189
+ def initialize vector, caller
190
+ @vector = vector
191
+ @caller = caller
192
+
193
+ set_size
194
+ end
195
+
196
+ def [] index
197
+ @vector[index]
198
+ end
199
+
200
+ def []= index, value
201
+ has_missing_data = true if value.nil?
202
+ @vector[index] = value
203
+ set_size
204
+ end
205
+
206
+ def == other
207
+ @vector == other
208
+ end
209
+
210
+ def delete_at index
211
+ @vector.delete_at index
212
+ set_size
213
+ end
214
+
215
+ def index key
216
+ @vector.index key
217
+ end
218
+
219
+ def << element
220
+ @vector << element
221
+ set_size
222
+ end
223
+
224
+ def uniq
225
+ @vector.uniq
226
+ end
227
+
228
+ def to_a
229
+ @vector
230
+ end
231
+
232
+ def dup
233
+ ArrayWrapper.new @vector.dup, @caller
234
+ end
235
+
236
+ def coerce dtype
237
+ case
238
+ when dtype == Array
239
+ self
240
+ when dtype == NMatrix
241
+ Daru::Accessors::NMatrixWrapper.new @vector, @caller
242
+ when dtype == MDArray
243
+ raise NotImplementedError
244
+ else
245
+ raise ArgumentError, "Cant coerce to dtype #{dtype}"
246
+ end
247
+ end
248
+
249
+ private
250
+
251
+ def set_size
252
+ @size = @vector.size
253
+ end
6
254
  end
7
255
  end
8
256
  end