daru 0.0.3 → 0.0.3.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 7bee85826c8bd5bb962982278d93ee62b7874d93
4
- data.tar.gz: 7e9e66b3282f44888c3d018bfcfe09e3d03ae065
3
+ metadata.gz: e0388674c6bcbd695b0c456529e987cce5124052
4
+ data.tar.gz: 4433ab2ef63dd8908e12c43937dbb704abccba43
5
5
  SHA512:
6
- metadata.gz: de3897032876c4ced80ca9b8ac741e369b09478adbcc0bec1001e38b7c1474e62b2b0d9d2f5d29802830c2540c460b0688b66124a1aac570699424f95c6884c2
7
- data.tar.gz: edeb5ae0b7d9523a1fc72ae845e89145b7a462b09c89989f01e614620c27860ea2bb14b020d59d3e9f21851085ff66d4104fe243d4daa57bc0544fcc46b023ba
6
+ metadata.gz: 37c6930a8a54e567e92efaa68c04c0774a33a7c08b6d90f66b9ee3309a09d05713d265c0dc502dd05ad36658ebd992f50df0994f96f84da5b2bff2576d5d78e7
7
+ data.tar.gz: 9985f10d9d66615a0fce7a949d03765029bb186ec7f69e91b6397d1a9a51bf2849ede0c64303719918bb097e3cff0f32ba52a610a08a0f92803805dc1a298d88
data/README.md CHANGED
@@ -7,29 +7,280 @@ Data Analysis in RUby
7
7
 
8
8
  ## Introduction
9
9
 
10
- daru (Data Analysis in RUby) is a library for storage, analysis and manipulation of data. It aims to be the preferred data analysis library for Ruby.
10
+ daru (Data Analysis in RUby) is a library for storage, analysis and manipulation of data.
11
11
 
12
- Development of daru was started to address the fragmentation of Dataframe-like classes which were created in many ruby gems as per their own needs.
13
-
14
- This creates a hurdle in using these gems together to solve a problem. For example, calculating something in [statsample](https://github.com/clbustos/statsample) and plotting the results in [Nyaplot](https://github.com/domitry/nyaplot).
12
+ Development of daru was started to address the fragmentation of Dataframe-like classes which were created in many ruby gems as per their own needs. daru offers a uniform interface for all sorts of data analysis and manipulation operations and aims to be compatible with all ruby gems involved in any way with data.
15
13
 
16
14
  daru is heavily inspired by `Statsample::Dataset`, `Nyaplot::DataFrame` and the super-awesome pandas, a very mature solution in Python.
17
15
 
18
- ## Data Structures
16
+ daru works with CRuby (1.9.3+) and JRuby and in a few weeks will be completely compatible with NMatrix and MDArray for fast data manipulation using C or Java structures.
17
+
18
+ ## Features
19
+
20
+ * Data structures:
21
+ - Vector - A basic 1-D vector.
22
+ - DataFrame - A 2-D matrix-like structure which is internally composed of named `Vector` classes.
23
+ * Compatible with IRuby notebook.
24
+ * Indexed and named data structures.
25
+ * Flexible and intuitive API for manipulation and analysis of data.
26
+
27
+ ## Usage
28
+
29
+ daru has been created with keeping extreme ease of use in mind.
30
+
31
+ The gem consists of two data structures, Vector and DataFrame. Any data in a serial format is a Vector and a table is a DataFrame.
32
+
33
+ #### Initialization of DataFrame
34
+
35
+ A data frame can be initialized from the following sources:
36
+ * Hash of indexed vectors: `{ b: Daru::Vector.new(:b, [11,12,13,14,15], [:two, :one, :four, :five, :three]), a: Daru::Vector.new(:a, [1,2,3,4,5], [:two,:one,:three, :four, :five])}`.
37
+ * Array of hashes: `[{a: 1, b: 11}, {a: 2, b: 12}, {a: 3, b: 13},{a: 4, b: 14}, {a: 5, b: 15}]`.
38
+ * Hash of names and Arrays: `{b: [11,12,13,14,15], a: [1,2,3,4,5]}`
39
+
40
+ The DataFrame constructor takes 4 arguments: source, vectors, indexes and name in that order. The last 3 are optional while the first is mandatory.
41
+
42
+ A basic DataFrame can be initialized like this:
43
+
44
+ ```ruby
45
+
46
+ df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, vectors: [:a, :b], index: [:one, :two, :three, :four, :five])
47
+ df
48
+ # =>
49
+ # # <Daru::DataFrame:87274040 @name = 7308c587-4073-4e7d-b3ca-3679d1dcc946 # @size = 5>
50
+ # a b
51
+ # one 1 11
52
+ # two 2 12
53
+ # three 3 13
54
+ # four 4 14
55
+ # five 5 15
56
+
57
+ ```
58
+ Daru will automatically align the vectors correctly according to the specified index and then create the DataFrame. Thus, elements having the same index will show up in the same row. The indexes will be arranged alphabetically if vectors with unaligned indexes are supplied.
59
+
60
+ The vectors of the DataFrame will be arranged according to the array specified in the (optional) second argument. Otherwise the vectors are ordered alphabetically.
61
+
62
+ ```ruby
63
+
64
+ df = Daru::DataFrame.new({
65
+ b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
66
+ a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
67
+ },
68
+ vectors: [:a, :b]
69
+ )
70
+ df
71
+
72
+ # =>
73
+ # #<Daru::DataFrame:87363700 @name = 75ba0a14-8291-48ac-ac30-35017e4d6c5f # @size = 5>
74
+ # a b
75
+ # five 5 14
76
+ # four 4 13
77
+ # one 2 12
78
+ # three 3 15
79
+ # two 1 11
80
+
81
+ ```
82
+
83
+ If an index for the DataFrame is supplied (third argument), then the indexes of the individual vectors will be matched to the DataFrame index. If any of the indexes do not match, nils will be inserted instead:
84
+
85
+ ```ruby
86
+
87
+ df = Daru::DataFrame.new({
88
+ b: [11] .dv(nil, [:one]),
89
+ a: [1,2,3] .dv(nil, [:one, :two, :three]),
90
+ c: [11,22,33,44,55] .dv(nil, [:one, :two, :three, :four, :five]),
91
+ d: [49,69,89,99,108,44].dv(nil, [:one, :two, :three, :four, :five, :six])
92
+ }, vectors: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
93
+ df
94
+ # =>
95
+ # #<Daru::DataFrame:87523270 @name = bda4eb68-afdd-4404-9981-708edab14201 #@size = 6>
96
+ # a b c d
97
+ # one 1 11 11 49
98
+ # two 2 nil 22 69
99
+ # three 3 nil 33 89
100
+ # four nil nil 44 99
101
+ # five nil nil 55 108
102
+ # six nil nil nil 44
103
+
104
+ ```
105
+
106
+ If some of the supplied vectors do not contain certain indexes that are contained in other vectors, they are added to those vectors and the correspoding elements are set to `nil`.
107
+
108
+ ```ruby
109
+
110
+ df = Daru::DataFrame.new({
111
+ b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
112
+ a: [1,2,3] .dv(:a, [:two,:one,:three])
113
+ },
114
+ vectors: [:a, :b]
115
+ )
116
+ df
117
+
118
+ # =>
119
+ # #<Daru::DataFrame:87612510 @name = 1e904c15-e095-4dce-bfdf-c07ee4d6e4a4 # @size = 5>
120
+ # a b
121
+ # five nil 14
122
+ # four nil 13
123
+ # one 2 12
124
+ # three 3 15
125
+ # two 1 11
126
+
127
+ ```
128
+
129
+ #### Initialization of Vector
130
+
131
+ The `Vector` data structure is also named and indexed. It accepts arguments name, source, index (in that order).
132
+
133
+ In the simplest case it can be constructed like this:
134
+
135
+ ```ruby
136
+
137
+ dv = Daru::Vector.new [1,2,3,4,5], name: ravan, index: [:ek, :don, :teen, :char, :pach]
138
+ dv
139
+
140
+ # =>
141
+ # #<Daru::Vector:87630270 @name = ravan @size = 5 >
142
+ # ravan
143
+ # ek 1
144
+ # don 2
145
+ # teen 3
146
+ # char 4
147
+ # pach 5
148
+
149
+ ```
150
+
151
+ Initializing a vector with indexes will insert nils in places where elements dont exist:
152
+
153
+ ```ruby
154
+
155
+ dv = Daru::Vector.new [1,2,3], name: yoga, index: [0,1,2,3,4]
156
+ dv
157
+ # =>
158
+ # #<Daru::Vector:87890840 @name = yoga @size = 5 >
159
+ # y
160
+ # 0 1
161
+ # 1 2
162
+ # 2 3
163
+ # 3 nil
164
+ # 4 nil
165
+
166
+
167
+ ```
168
+
169
+ #### Basic Selection Operations
170
+
171
+ Initialize a dataframe:
172
+
173
+ ```ruby
174
+
175
+ df = Daru::DataFrame.new({
176
+ b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
177
+ a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
178
+ },
179
+ vectors: [:a, :b]
180
+ )
181
+
182
+ # =>
183
+ # #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 5>
184
+ # a b
185
+ # five 5 14
186
+ # four 4 13
187
+ # one 2 12
188
+ # three 3 15
189
+ # two 1 11
190
+
191
+ ```
192
+ Select a row from a DataFrame:
193
+
194
+ ```ruby
195
+
196
+ df.row[:one]
197
+
198
+ # =>
199
+ # #<Daru::Vector:87432070 @name = one @size = 2 >
200
+ # one
201
+ # a 2
202
+ # b 12
203
+ ```
204
+ A row or a vector is returned as a `Daru::Vector` object, so any manipulations supported by `Daru::Vector` can be performed on the chosen row as well.
205
+
206
+ Select a single vector:
207
+
208
+ ```ruby
209
+
210
+ df.vector[:a] # or simply df.a
211
+
212
+ # =>
213
+ # #<Daru::Vector:87454270 @name = a @size = 5 >
214
+ # a
215
+ # five 5
216
+ # four 4
217
+ # one 2
218
+ # three 3
219
+ # two 1
220
+
221
+ ```
222
+
223
+ Select multiple vectors and return a DataFrame in the specified order:
224
+
225
+ ```ruby
226
+
227
+ df.vector[:b, :a]
228
+ # =>
229
+ # #<Daru::DataFrame:87835960 @name = e80902cc-cff9-4b23-9eca-5da36ebc88a8 # @size = 5>
230
+ # b a
231
+ # five 14 5
232
+ # four 13 4
233
+ # one 12 2
234
+ # three 15 3
235
+ # two 11 1
236
+
237
+ ```
238
+
239
+ Keep/remove row according to a specified condition:
240
+
241
+ ```ruby
242
+
243
+ df = df.filter_rows do |row|
244
+ row[:a] == 5
245
+ end
246
+
247
+ df
248
+ # =>
249
+ # #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 1>
250
+ # a b
251
+ # five 5 14
252
+
253
+ ```
254
+ The same can be applied to vectors using `keep_vector_if`.
255
+
256
+ To iterate over a DataFrame and perform operations on rows or vectors, `#each_row` or `#each_vector` can be used, which works just like `#each` for Ruby Arrays.
257
+
258
+ To change the values of a row/vector while iterating through the DataFrame, use `map_rows` or `map_vectors`:
259
+
260
+ ```ruby
261
+
262
+ df.map_rows do |row|
263
+ row = row.map { |e| e*e }
264
+ end
19
265
 
20
- daru employs several data structures for storing and manipulating data:
21
- * Vector - A basic 1-D vector.
22
- * DataFrame - A 2-D matrix-like structure which is internally composed of named `Vector` classes.
266
+ df
23
267
 
24
- daru data structures can be constructed by using several Ruby classes. These include `Array`, `Hash`, `Matrix`, [NMatrix](https://github.com/SciRuby/nmatrix) and [MDArray](https://github.com/rbotafogo/mdarray). daru brings a uniform API for handling and manipulating data represented in any of the above Ruby classes.
268
+ # =>
269
+ # #<Daru::DataFrame:86826830 @name = b092ca5b-7b83-4dbe-a469-124f7f25a568 # @size = 5>
270
+ # a b
271
+ # five 25 196
272
+ # four 16 169
273
+ # one 4 144
274
+ # three 9 225
275
+ # two 1 121
25
276
 
26
- Currently things work as expected for Arrays only. Rest will added over the next few weeks.
277
+ ```
27
278
 
28
- ## Testing
279
+ Rows/vectors can be deleted using `delete_row` or `delete_vector`.
29
280
 
30
- Install jruby using `rvm install jruby`, then run `jruby -S gem install mdarray`, followed by `bundle install`. You will need to install `mdarray` manually because of strange gemspec file behaviour. If anyone can automate this then I'd greatly appreciate it! Then run `rspec` in JRuby to test for MDArray functionality.
281
+ #### Basic Math Operations
31
282
 
32
- Then switch to MRI, do a normal `bundle install` followed by `rspec` for testing everything else with NMatrix functionality.
283
+ Coming soon!
33
284
 
34
285
  ## Roadmap
35
286
 
@@ -56,6 +307,11 @@ Then switch to MRI, do a normal `bundle install` followed by `rspec` for testing
56
307
  * Deletion of elements from Vector should only modify the index and leave the vector as it is so that compacting is not needed and things are faster.
57
308
  * Add a #sync method which will sync the modified index with the unmodified vector.
58
309
  * Ability to reorder the index of a dataframe.
310
+ * Slicing operations using Range.
311
+ * Create DataFrame by providing rows.
312
+ * Integrate basic plotting with Nyaplot.
313
+ * Filter through a dataframe with filter\_rows or filter\_vectors based on whatever boolean value evaluates to true.
314
+ * Named arguments
59
315
 
60
316
  Copyright (c) 2014, Sameer Deshmukh
61
317
  All rights reserved
@@ -0,0 +1,8 @@
1
+ module Daru
2
+ module Accessors
3
+ # Internal class for wrapping ruby array
4
+ class ArrayWrapper
5
+
6
+ end
7
+ end
8
+ end
@@ -0,0 +1,17 @@
1
+ module Daru
2
+ module Accessors
3
+ class DataFrameByRow
4
+ def initialize data_frame
5
+ @data_frame = data_frame
6
+ end
7
+
8
+ def [](*names)
9
+ @data_frame[*names, :row]
10
+ end
11
+
12
+ def []=(name, vector)
13
+ @data_frame[name, :row] = vector
14
+ end
15
+ end
16
+ end
17
+ end
@@ -0,0 +1,17 @@
1
+ module Daru
2
+ module Accessors
3
+ class DataFrameByVector
4
+ def initialize data_frame
5
+ @data_frame = data_frame
6
+ end
7
+
8
+ def [](*names)
9
+ @data_frame[*names, :vector]
10
+ end
11
+
12
+ def []=(name, vectors)
13
+ @data_frame[name, :vector] = vectors
14
+ end
15
+ end
16
+ end
17
+ end
@@ -0,0 +1,9 @@
1
+ module Daru
2
+ module Accessors
3
+
4
+ # Internal class for wrapping MDArray
5
+ class MDArrayWrapper
6
+
7
+ end
8
+ end
9
+ end
@@ -0,0 +1,9 @@
1
+ module Daru
2
+ module Accessors
3
+
4
+ # Internal class for wrapping NMatrix
5
+ class NMatrixWrapper
6
+
7
+ end
8
+ end
9
+ end
@@ -1,10 +1,15 @@
1
- require_relative 'dataframe_by_row.rb'
2
- require_relative 'dataframe_by_vector.rb'
3
- require_relative 'io.rb'
1
+ require_relative 'accessors/dataframe_by_row.rb'
2
+ require_relative 'accessors/dataframe_by_vector.rb'
3
+ require_relative 'math/arithmetic/dataframe.rb'
4
+ require_relative 'math/statistics/dataframe.rb'
5
+ require_relative 'io/io.rb'
4
6
 
5
7
  module Daru
6
8
  class DataFrame
7
9
 
10
+ include Daru::Math::Arithmetic::DataFrame
11
+ include Daru::Math::Statistics::DataFrame
12
+
8
13
  class << self
9
14
  def from_csv path, opts={}, &block
10
15
  Daru::IO.from_csv path, opts, &block
@@ -19,10 +24,10 @@ module Daru
19
24
  # DataFrame basically consists of an Array of Vector objects.
20
25
  # These objects are indexed by row and column by vectors and index Index objects.
21
26
  # Arguments - source, vectors, index, name in that order. Last 3 are optional.
22
- def initialize source, *args
23
- vectors = args.shift
24
- index = args.shift
25
- @name = args.shift || SecureRandom.uuid
27
+ def initialize source, opts={}
28
+ vectors = opts[:vectors]
29
+ index = opts[:index]
30
+ @name = (opts[:name] || SecureRandom.uuid).to_sym
26
31
 
27
32
  @data = []
28
33
 
@@ -77,7 +82,7 @@ module Daru
77
82
  end
78
83
 
79
84
  @vectors.each do |vector|
80
- @data << Daru::Vector.new(vector, [], @index)
85
+ @data << Daru::Vector.new([], name: vector, index: @index)
81
86
 
82
87
  @index.each do |idx|
83
88
  begin
@@ -132,11 +137,11 @@ module Daru
132
137
  end
133
138
 
134
139
  def vector
135
- Daru::DataFrameByVector.new(self)
140
+ Daru::Accessors::DataFrameByVector.new(self)
136
141
  end
137
142
 
138
143
  def row
139
- Daru::DataFrameByRow.new(self)
144
+ Daru::Accessors::DataFrameByRow.new(self)
140
145
  end
141
146
 
142
147
  def dup
@@ -145,7 +150,7 @@ module Daru
145
150
  src[vector] = @data[@vectors[vector]]
146
151
  end
147
152
 
148
- Daru::DataFrame.new src, @vectors.dup, @index.dup, @name
153
+ Daru::DataFrame.new src, vectors: @vectors.dup, index: @index.dup, name: @name
149
154
  end
150
155
 
151
156
  def each_vector(&block)
@@ -244,11 +249,17 @@ module Daru
244
249
  end
245
250
 
246
251
  def keep_row_if &block
252
+ deletion = []
253
+
247
254
  @index.each do |index|
248
255
  keep_row = yield access_row(index)
249
256
 
250
- delete_row index unless keep_row
257
+ deletion << index unless keep_row
251
258
  end
259
+
260
+ deletion.each { |idx|
261
+ delete_row idx
262
+ }
252
263
  end
253
264
 
254
265
  def keep_vector_if &block
@@ -259,6 +270,31 @@ module Daru
259
270
  end
260
271
  end
261
272
 
273
+ def filter_rows &block
274
+ df = Daru::DataFrame.new({}, vectors: @vectors.to_a)
275
+ marked = []
276
+
277
+ @index.each do |index|
278
+ keep_row = yield access_row(index)
279
+
280
+ marked << index if keep_row
281
+ end
282
+
283
+ marked.each do |idx|
284
+ df.row[idx] = self[idx, :row]
285
+ end
286
+
287
+ df
288
+ end
289
+
290
+ def filter_vectors &block
291
+ df = self.dup
292
+
293
+ df.keep_vector_if &block
294
+
295
+ df
296
+ end
297
+
262
298
  def has_vector? name
263
299
  !!@vectors[name]
264
300
  end
@@ -323,7 +359,8 @@ module Daru
323
359
  end
324
360
 
325
361
  def inspect spacing=10, threshold=15
326
- longest = [@vectors.map(&:to_s).map(&:size).max,
362
+ longest = [@name.to_s.size,
363
+ @vectors.map(&:to_s).map(&:size).max,
327
364
  @index .map(&:to_s).map(&:size).max,
328
365
  @data .map{ |v| v.map(&:to_s).map(&:size).max }.max].max
329
366
 
@@ -395,16 +432,14 @@ module Daru
395
432
  new_vcs[name] = @data[@vectors[name]]
396
433
  end
397
434
 
398
- Daru::DataFrame.new new_vcs, new_vcs.keys, @index, @name
435
+ Daru::DataFrame.new new_vcs, vectors: new_vcs.keys, index: @index, name: @name
399
436
  end
400
437
 
401
438
  def access_row *names
402
439
  unless names[1]
403
440
  row = []
404
441
 
405
- @vectors.each do |vector|
406
- row << @data[@vectors[vector]][names[0]]
407
- end
442
+ name = nil
408
443
 
409
444
  if @index.include? names[0]
410
445
  name = names[0]
@@ -414,7 +449,11 @@ module Daru
414
449
  raise IndexError, "Specified row #{names[0]} does not exist."
415
450
  end
416
451
 
417
- Daru::Vector.new name, row, @vectors
452
+ @vectors.each do |vector|
453
+ row << @data[@vectors[vector]][name]
454
+ end
455
+
456
+ Daru::Vector.new row, index: @vectors, name: name
418
457
  else
419
458
  # TODO: Access multiple rows
420
459
  end
@@ -426,7 +465,7 @@ module Daru
426
465
  v = nil
427
466
 
428
467
  if vector.is_a?(Daru::Vector)
429
- v = Daru::Vector.new name, [], @index
468
+ v = Daru::Vector.new [], name: name, index: @index
430
469
 
431
470
  @index.each do |idx|
432
471
  begin
@@ -474,7 +513,7 @@ module Daru
474
513
 
475
514
  def create_empty_vectors
476
515
  @vectors.each do |name|
477
- @data << Daru::Vector.new(name, [], @index)
516
+ @data << Daru::Vector.new([],name: name, index: @index)
478
517
  end
479
518
  end
480
519