daru 0.2.0 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 69452b32fd8ef0ef7fb4ed58ab53ffa8aa15806d
4
- data.tar.gz: 56927c77adbe7941eb2ca9a5e44d705931aad237
3
+ metadata.gz: 87e4e2869fe6411e3eece92bb5dc24d48f890774
4
+ data.tar.gz: e711d0db1d57f51f31ccb7fb54078a6bdbcc4ff5
5
5
  SHA512:
6
- metadata.gz: 8e7511133b3409f7821cfec944a950d53df57bcd5893bb8a9557c013f31bf1e4a9cc07bbe1c143c63684f00f7d8d8f1adf3b31df732508e667ba6677f47d1d96
7
- data.tar.gz: fc4beb70106372a276b21e0da645951595e5674f56e4422752aeeabc9cc2156983add90e59486aea4d88386fbeb2896d15f7ede30667bc84027abd900ee42e0e
6
+ metadata.gz: afdb295d0d01542ba9f439cf5f7959d7f2a3b9e47de6047ecf7719548ef760e657c0dfe753ed16ee1da65e071bb5a182aaf03ee83c9de6075d54149753b9c346
7
+ data.tar.gz: e0c4ace661d9f1cb7e8040d424bb004a0b650a9605037d1aff258258bbac40a3c158e5f5b8a2a5c6a28070cf55566a0729ee9b77c8114d40d4d18cf9d26e69c3
data/History.md CHANGED
@@ -1,3 +1,18 @@
1
+ # 0.2.1 (02 July 2018)
2
+
3
+ * Minor Enhancements
4
+ - Allow pasing singular Symbol to CSV converters option (@takkanm)
5
+ - Support calling GroupBy#each_group w/o blocks (@hibariya)
6
+ - Refactor grouping and aggregation (@paisible-wanderer)
7
+ - Add String Converter to Daru::IO::CSV::CONVERTERS (@takkanm)
8
+ - Fix annoying missing libraries warning
9
+ - Remove post-install message (nice yet useless)
10
+
11
+ * Fixes
12
+ - Fix group_by for DataFrame with single row (@baarkerlounger)
13
+ - `#rolling_fillna!` bugfixes on `Daru::Vector` and `Daru::DataFrame` (@mhammiche)
14
+ - Fixes `#include?` on multiindex (@rohitner)
15
+
1
16
  # 0.2.0 (31 October 2017)
2
17
  * Major Enhancements
3
18
  - Add `DataFrame#which` query DSL (experimental! @rainchen)
data/README.md CHANGED
@@ -3,12 +3,13 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/daru.svg)](http://badge.fury.io/rb/daru)
4
4
  [![Build Status](https://travis-ci.org/SciRuby/daru.svg?branch=master)](https://travis-ci.org/SciRuby/daru)
5
5
  [![Gitter](https://badges.gitter.im/v0dro/daru.svg)](https://gitter.im/v0dro/daru?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
6
+ [![Open Source Helpers](https://www.codetriage.com/sciruby/daru/badges/users.svg)](https://www.codetriage.com/sciruby/daru)
6
7
 
7
8
  ## Introduction
8
9
 
9
10
  daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data in Ruby.
10
11
 
11
- daru makes it easy and intuitive to process data predominantly through 2 data structures: `Daru::DataFrame` and `Daru::Vector`. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2 and 2.3.
12
+ daru makes it easy and intuitive to process data predominantly through 2 data structures: `Daru::DataFrame` and `Daru::Vector`. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2, 2.3, and 2.4.
12
13
 
13
14
  ## Features
14
15
 
@@ -73,6 +74,7 @@ $ gem install daru
73
74
  * [Data Analysis in RUby: Basic data manipulation and plotting](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/)
74
75
  * [Data Analysis in RUby: Splitting, sorting, aggregating data and data types](http://v0dro.github.io/blog/2015/02/24/data-analysis-in-ruby-part-2/)
75
76
  * [Finding and Combining data in daru](http://v0dro.github.io/blog/2015/08/03/finding-and-combining-data-in-daru/)
77
+ * [Introduction to analyzing datasets with daru library](http://gafur.me/2018/02/05/analysing-datasets-with-daru-library.html)
76
78
 
77
79
  ### Time series
78
80
 
@@ -192,13 +194,13 @@ In addition to nyaplot, daru also supports plotting out of the box with [gnuplot
192
194
 
193
195
  ## Documentation
194
196
 
195
- Docs can be found [here](https://rubygems.org/gems/daru).
197
+ Docs can be found [here](http://www.rubydoc.info/gems/daru).
196
198
 
197
199
  ## Contributing
198
200
 
199
201
  Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!
200
202
 
201
- For details see [CONTRIBUTING](https://github.com/v0dro/daru/blob/master/CONTRIBUTING.md).
203
+ For details see [CONTRIBUTING](https://github.com/SciRuby/daru/blob/master/CONTRIBUTING.md).
202
204
 
203
205
  ## Acknowledgements
204
206
 
@@ -27,29 +27,6 @@ Gem::Specification.new do |spec|
27
27
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
28
28
  spec.require_paths = ["lib"]
29
29
 
30
- spec.post_install_message = <<-EOF
31
- *************************************************************************
32
- Thank you for installing daru!
33
-
34
- oOOOOOo
35
- ,| oO
36
- //| |
37
- \\\\| |
38
- `| |
39
- `-----`
40
-
41
-
42
- Hope you love daru! For enhanced interactivity and better visualizations,
43
- consider using gnuplotrb and nyaplot with iruby. For statistics use the
44
- statsample family.
45
-
46
- Read the README for interesting use cases and examples.
47
-
48
- Cheers!
49
- *************************************************************************
50
- EOF
51
-
52
-
53
30
  spec.add_runtime_dependency 'backports'
54
31
 
55
32
  # it is required by NMatrix, yet we want to specify clearly which minimal version is OK
@@ -86,16 +86,6 @@ module Daru
86
86
  create_has_library :gruff
87
87
  end
88
88
 
89
- {'spreadsheet' => '~>1.1.1', 'mechanize' => '~>2.7.5'}.each do |name, version|
90
- begin
91
- gem name, version
92
- require name
93
- rescue LoadError
94
- Daru.error "\nInstall the #{name} gem version #{version} for using"\
95
- " #{name} functions."
96
- end
97
- end
98
-
99
89
  autoload :CSV, 'csv'
100
90
  require 'matrix'
101
91
  require 'forwardable'
@@ -1,11 +1,64 @@
1
1
  module Daru
2
2
  module Core
3
3
  class GroupBy
4
+ class << self
5
+ def get_positions_group_map_on(indexes_with_positions, sort: false)
6
+ group_map = {}
7
+
8
+ indexes_with_positions.each do |idx, position|
9
+ (group_map[idx] ||= []) << position
10
+ end
11
+
12
+ if sort # TODO: maybe add a more "stable" sorting option?
13
+ sorted_keys = group_map.keys.sort(&Daru::Core::GroupBy::TUPLE_SORTER)
14
+ group_map = sorted_keys.map { |k| [k, group_map[k]] }.to_h
15
+ end
16
+
17
+ group_map
18
+ end
19
+
20
+ def get_positions_group_for_aggregation(multi_index, level=-1)
21
+ raise unless multi_index.is_a?(Daru::MultiIndex)
22
+
23
+ new_index = multi_index.dup
24
+ new_index.remove_layer(level) # TODO: recheck code of Daru::MultiIndex#remove_layer
25
+
26
+ get_positions_group_map_on(new_index.each_with_index)
27
+ end
28
+
29
+ def get_positions_group_map_for_df(df, group_by_keys, sort: true)
30
+ indexes_with_positions = df[*group_by_keys].to_df.each_row.map(&:to_a).each_with_index
31
+
32
+ get_positions_group_map_on(indexes_with_positions, sort: sort)
33
+ end
34
+
35
+ def group_map_from_positions_to_indexes(positions_group_map, index)
36
+ positions_group_map.map { |k, positions| [k, positions.map { |pos| index.at(pos) }] }.to_h
37
+ end
38
+
39
+ def df_from_group_map(df, group_map, remaining_vectors, from_position: true)
40
+ return nil if group_map == {}
41
+
42
+ new_index = group_map.flat_map { |group, values| values.map { |val| group + [val] } }
43
+ new_index = Daru::MultiIndex.from_tuples(new_index)
44
+
45
+ return Daru::DataFrame.new({}, index: new_index) if remaining_vectors == []
46
+
47
+ new_rows_order = group_map.values.flatten
48
+ new_df = df[*remaining_vectors].to_df.get_sub_dataframe(new_rows_order, by_position: from_position)
49
+ new_df.index = new_index
50
+
51
+ new_df
52
+ end
53
+ end
54
+
4
55
  attr_reader :groups, :df
5
56
 
6
57
  # Iterate over each group created by group_by. A DataFrame is yielded in
7
58
  # block.
8
59
  def each_group
60
+ return to_enum(:each_group) unless block_given?
61
+
9
62
  groups.keys.each do |k|
10
63
  yield get_group(k)
11
64
  end
@@ -22,11 +75,8 @@ module Daru
22
75
  end
23
76
 
24
77
  def initialize context, names
25
- @groups = {}
26
78
  @non_group_vectors = context.vectors.to_a - names
27
79
  @context = context
28
- vectors = names.map { |vec| context[vec].to_a }
29
- tuples = vectors[0].zip(*vectors[1..-1])
30
80
  # FIXME: It feels like we don't want to sort here. Ruby's #group_by
31
81
  # never sorts:
32
82
  #
@@ -34,7 +84,10 @@ module Daru
34
84
  # # => {4=>["test"], 2=>["me"], 6=>["please"]}
35
85
  #
36
86
  # - zverok, 2016-09-12
37
- init_groups_df tuples, names
87
+ positions_groups = GroupBy.get_positions_group_map_for_df(@context, names, sort: true)
88
+
89
+ @groups = GroupBy.group_map_from_positions_to_indexes(positions_groups, @context.index)
90
+ @df = GroupBy.df_from_group_map(@context, positions_groups, @non_group_vectors)
38
91
  end
39
92
 
40
93
  # Get a Daru::Vector of the size of each group.
@@ -282,26 +335,11 @@ module Daru
282
335
  # Ram Hyderabad,Mumbai
283
336
  #
284
337
  def aggregate(options={})
285
- @df.index = @df.index.remove_layer(@df.index.levels.size - 1)
286
338
  @df.aggregate(options)
287
339
  end
288
340
 
289
341
  private
290
342
 
291
- def init_groups_df tuples, names
292
- multi_index_tuples = []
293
- keys = tuples.uniq.sort(&TUPLE_SORTER)
294
- keys.each do |key|
295
- indices = all_indices_for(tuples, key)
296
- @groups[key] = indices
297
- indices.each do |indice|
298
- multi_index_tuples << key + [indice]
299
- end
300
- end
301
- @groups.freeze
302
- @df = resultant_context(multi_index_tuples, names) unless multi_index_tuples.empty?
303
- end
304
-
305
343
  def select_groups_from method, quantity
306
344
  selection = @context
307
345
  rows, indexes = [], []
@@ -342,33 +380,6 @@ module Daru
342
380
  end
343
381
  end
344
382
 
345
- def resultant_context(multi_index_tuples, names)
346
- multi_index = Daru::MultiIndex.from_tuples(multi_index_tuples)
347
- context_tmp = @context.dup.delete_vectors(*names)
348
- rows_tuples = context_tmp.access_row_tuples_by_indexs(
349
- *@groups.values.flatten!
350
- )
351
- context_new = Daru::DataFrame.rows(rows_tuples, index: multi_index)
352
- context_new.vectors = context_tmp.vectors
353
- context_new
354
- end
355
-
356
- def all_indices_for arry, element
357
- found, index, indexes = -1, -1, []
358
- while found
359
- found = arry[index+1..-1].index(element)
360
- if found
361
- index = index + found + 1
362
- indexes << index
363
- end
364
- end
365
- if indexes.count == 1
366
- [@context.index.at(*indexes)]
367
- else
368
- @context.index.at(*indexes).to_a
369
- end
370
- end
371
-
372
383
  def multi_indexed_grouping?
373
384
  return false unless @groups.keys[0]
374
385
  @groups.keys[0].size > 1
@@ -17,17 +17,17 @@ module Daru
17
17
  end
18
18
  end
19
19
 
20
- def initialize left_df, right_df, opts={}
20
+ def initialize left_df, right_df, opts={} # rubocop:disable Metrics/AbcSize -- quick-fix for issue #171
21
21
  init_opts(opts)
22
22
  validate_on!(left_df, right_df)
23
23
  key_sanitizer = ->(h) { sanitize_merge_keys(h.values_at(*on)) }
24
24
 
25
25
  @left = df_to_a(left_df)
26
- @left.sort_by!(&key_sanitizer)
26
+ @left.sort! { |a, b| safe_compare(a.values_at(*on), b.values_at(*on)) }
27
27
  @left_key_values = @left.map(&key_sanitizer)
28
28
 
29
29
  @right = df_to_a(right_df)
30
- @right.sort_by!(&key_sanitizer)
30
+ @right.sort! { |a, b| safe_compare(a.values_at(*on), b.values_at(*on)) }
31
31
  @right_key_values = @right.map(&key_sanitizer)
32
32
 
33
33
  @left_keys, @right_keys = merge_keys(left_df, right_df, on)
@@ -246,6 +246,15 @@ module Daru
246
246
  raise ArgumentError, "Both dataframes expected to have #{on.inspect} field"
247
247
  end
248
248
  end
249
+
250
+ def safe_compare(left_array, right_array)
251
+ left_array.zip(right_array).map { |l, r|
252
+ next 0 if l.nil? && r.nil?
253
+ next 1 if r.nil?
254
+ next -1 if l.nil?
255
+ l <=> r
256
+ }.reject(&:zero?).first || 0
257
+ end
249
258
  end
250
259
 
251
260
  module Merge
@@ -549,6 +549,20 @@ module Daru
549
549
  Daru::Accessors::DataFrameByRow.new(self)
550
550
  end
551
551
 
552
+ # Extract a dataframe given row indexes or positions
553
+ # @param keys [Array] can be positions (if by_position is true) or indexes (if by_position if false)
554
+ # @return [Daru::Dataframe]
555
+ def get_sub_dataframe(keys, by_position: true)
556
+ return Daru::DataFrame.new({}) if keys == []
557
+
558
+ keys = @index.pos(*keys) unless by_position
559
+
560
+ sub_df = row_at(*keys)
561
+ sub_df = sub_df.to_df.transpose if sub_df.is_a?(Daru::Vector)
562
+
563
+ sub_df
564
+ end
565
+
552
566
  # Duplicate the DataFrame entirely.
553
567
  #
554
568
  # == Arguments
@@ -698,6 +712,7 @@ module Daru
698
712
  #
699
713
  def rolling_fillna!(direction=:forward)
700
714
  @data.each { |vec| vec.rolling_fillna!(direction) }
715
+ self
701
716
  end
702
717
 
703
718
  def rolling_fillna(direction=:forward)
@@ -990,6 +1005,17 @@ module Daru
990
1005
  self
991
1006
  end
992
1007
 
1008
+ def apply_method(method, keys: nil, by_position: true)
1009
+ df = keys ? get_sub_dataframe(keys, by_position: by_position) : self
1010
+
1011
+ case method
1012
+ when Symbol then df.send(method)
1013
+ when Proc then method.call(df)
1014
+ else raise
1015
+ end
1016
+ end
1017
+ alias :apply_method_on_sub_df :apply_method
1018
+
993
1019
  # Retrieves a Daru::Vector, based on the result of calculation
994
1020
  # performed on each row.
995
1021
  def collect_rows &block
@@ -1450,11 +1476,10 @@ module Daru
1450
1476
  # # ["foo", "two", 3]=>[2, 4]}
1451
1477
  def group_by *vectors
1452
1478
  vectors.flatten!
1453
- # FIXME: wouldn't it better to do vectors - @vectors here and
1454
- # raise one error with all non-existent vector names?.. - zverok, 2016-05-18
1455
- vectors.each { |v|
1456
- raise(ArgumentError, "Vector #{v} does not exist") unless has_vector?(v)
1457
- }
1479
+ missing = vectors - @vectors.to_a
1480
+ unless missing.empty?
1481
+ raise(ArgumentError, "Vector(s) missing: #{missing.join(', ')}")
1482
+ end
1458
1483
 
1459
1484
  vectors = [@vectors.first] if vectors.empty?
1460
1485
 
@@ -2249,22 +2274,6 @@ module Daru
2249
2274
  end
2250
2275
  end
2251
2276
 
2252
- # returns array of row tuples at given index(s)
2253
- def access_row_tuples_by_indexs *indexes
2254
- positions = @index.pos(*indexes)
2255
-
2256
- return populate_row_for(positions) if positions.is_a? Numeric
2257
-
2258
- res = []
2259
- new_rows = @data.map { |vec| vec[*indexes] }
2260
- indexes.each do |index|
2261
- tuples = []
2262
- new_rows.map { |row| tuples += [row[index]] }
2263
- res << tuples
2264
- end
2265
- res
2266
- end
2267
-
2268
2277
  # Function to use for aggregating the data.
2269
2278
  #
2270
2279
  # @param options [Hash] options for column, you want in resultant dataframe
@@ -2282,7 +2291,7 @@ module Daru
2282
2291
  # 3 d 17
2283
2292
  # 4 e 1
2284
2293
  #
2285
- # df.aggregate(num_100_times: ->(df) { df.num*100 })
2294
+ # df.aggregate(num_100_times: ->(df) { (df.num*100).first })
2286
2295
  # => #<Daru::DataFrame(5x1)>
2287
2296
  # num_100_ti
2288
2297
  # 0 5200
@@ -2312,41 +2321,26 @@ module Daru
2312
2321
  #
2313
2322
  # Note: `GroupBy` class `aggregate` method uses this `aggregate` method
2314
2323
  # internally.
2315
- def aggregate(options={})
2316
- colmn_value, index_tuples = aggregated_colmn_value(options)
2317
- Daru::DataFrame.new(
2318
- colmn_value, index: index_tuples, order: options.keys
2319
- )
2320
- end
2324
+ def aggregate(options={}, multi_index_level=-1)
2325
+ positions_tuples, new_index = group_index_for_aggregation(@index, multi_index_level)
2321
2326
 
2322
- private
2327
+ colmn_value = aggregate_by_positions_tuples(options, positions_tuples)
2323
2328
 
2324
- # Do the `method` (`method` can be :sum, :mean, :std, :median, etc or
2325
- # lambda), on the column.
2326
- def apply_method_on_colmns colmn, index_tuples, method
2327
- rows = []
2328
- index_tuples.each do |indexes|
2329
- # If single element then also make it vector.
2330
- slice = Daru::Vector.new(Array(self[colmn][*indexes]))
2331
- case method
2332
- when Symbol
2333
- rows << (slice.is_a?(Daru::Vector) ? slice.send(method) : slice)
2334
- when Proc
2335
- rows << method.call(slice)
2336
- end
2337
- end
2338
- rows
2329
+ Daru::DataFrame.new(colmn_value, index: new_index, order: options.keys)
2339
2330
  end
2340
2331
 
2341
- def apply_method_on_df index_tuples, method
2342
- rows = []
2343
- index_tuples.each do |indexes|
2344
- slice = row[*indexes]
2345
- rows << method.call(slice)
2346
- end
2347
- rows
2332
+ # Is faster than using group_by followed by aggregate (because it doesn't generate an intermediary dataframe)
2333
+ def group_by_and_aggregate(*group_by_keys, **aggregation_map)
2334
+ positions_groups = Daru::Core::GroupBy.get_positions_group_map_for_df(self, group_by_keys.flatten, sort: true)
2335
+
2336
+ new_index = Daru::MultiIndex.from_tuples(positions_groups.keys).coerce_index
2337
+ colmn_value = aggregate_by_positions_tuples(aggregation_map, positions_groups.values)
2338
+
2339
+ Daru::DataFrame.new(colmn_value, index: new_index, order: aggregation_map.keys)
2348
2340
  end
2349
2341
 
2342
+ private
2343
+
2350
2344
  def headers
2351
2345
  Daru::Index.new(Array(index.name) + @vectors.to_a)
2352
2346
  end
@@ -2910,27 +2904,41 @@ module Daru
2910
2904
  end
2911
2905
 
2912
2906
  def update_data source, vectors
2913
- @data = @vectors.each_with_index.map do |_vec,idx|
2907
+ @data = @vectors.each_with_index.map do |_vec, idx|
2914
2908
  Daru::Vector.new(source[idx], index: @index, name: vectors[idx])
2915
2909
  end
2916
2910
  end
2917
2911
 
2918
- def aggregated_colmn_value(options)
2919
- colmn_value = []
2920
- index_tuples = Array(@index).uniq
2921
- options.keys.each do |vec|
2922
- do_this_on_vec = options[vec]
2923
- colmn_value << if @vectors.include?(vec)
2924
- apply_method_on_colmns(
2925
- vec, index_tuples, do_this_on_vec
2926
- )
2927
- else
2928
- apply_method_on_df(
2929
- index_tuples, do_this_on_vec
2930
- )
2931
- end
2912
+ def aggregate_by_positions_tuples(options, positions_tuples)
2913
+ options.map do |vect, method|
2914
+ if @vectors.include?(vect)
2915
+ vect = self[vect]
2916
+
2917
+ positions_tuples.map do |positions|
2918
+ vect.apply_method_on_sub_vector(method, keys: positions)
2919
+ end
2920
+ else
2921
+ positions_tuples.map do |positions|
2922
+ apply_method_on_sub_df(method, keys: positions)
2923
+ end
2924
+ end
2932
2925
  end
2933
- [colmn_value, index_tuples]
2926
+ end
2927
+
2928
+ def group_index_for_aggregation(index, multi_index_level=-1)
2929
+ case index
2930
+ when Daru::MultiIndex
2931
+ groups = Daru::Core::GroupBy.get_positions_group_for_aggregation(index, multi_index_level)
2932
+ new_index, pos_tuples = groups.keys, groups.values
2933
+
2934
+ new_index = Daru::MultiIndex.from_tuples(new_index).coerce_index
2935
+ when Daru::Index, Daru::CategoricalIndex
2936
+ new_index = Array(index).uniq
2937
+ pos_tuples = new_index.map { |idx| [*index.pos(idx)] }
2938
+ else raise
2939
+ end
2940
+
2941
+ [pos_tuples, new_index]
2934
2942
  end
2935
2943
 
2936
2944
  # coerce ranges, integers and array in appropriate ways
@@ -244,8 +244,21 @@ module Daru
244
244
  @labels.delete_at(layer_index)
245
245
  @name.delete_at(layer_index) unless @name.nil?
246
246
 
247
- # CategoricalIndex is used , to allow duplicate indexes.
248
- @levels.size == 1 ? Daru::CategoricalIndex.new(to_a.flatten) : self
247
+ coerce_index
248
+ end
249
+
250
+ def coerce_index
251
+ if @levels.size == 1
252
+ elements = to_a.flatten
253
+
254
+ if elements.uniq.length == elements.length
255
+ Daru::Index.new(elements)
256
+ else
257
+ Daru::CategoricalIndex.new(elements)
258
+ end
259
+ else
260
+ self
261
+ end
249
262
  end
250
263
 
251
264
  # Array `name` must have same length as levels and labels.
@@ -272,7 +285,7 @@ module Daru
272
285
  end
273
286
 
274
287
  def dup
275
- MultiIndex.new levels: levels.dup, labels: labels
288
+ MultiIndex.new levels: levels.dup, labels: labels.dup, name: (@name.nil? ? nil : @name.dup)
276
289
  end
277
290
 
278
291
  def drop_left_level by=1
@@ -293,8 +306,9 @@ module Daru
293
306
 
294
307
  def include? tuple
295
308
  return false unless tuple.is_a? Enumerable
296
- tuple.flatten.each_with_index
297
- .all? { |tup, i| @levels[i][tup] }
309
+ @labels[0...tuple.flatten.size]
310
+ .transpose
311
+ .include?(tuple.flatten.each_with_index.map { |e, i| @levels[i][e] })
298
312
  end
299
313
 
300
314
  def size
@@ -11,6 +11,9 @@ module Daru
11
11
  else
12
12
  f
13
13
  end
14
+ },
15
+ string: lambda { |f, _|
16
+ f
14
17
  }
15
18
  }.freeze
16
19
  end
@@ -34,11 +34,12 @@ module Daru
34
34
  end
35
35
  end
36
36
 
37
- module IO
37
+ module IO # rubocop:disable Metrics/ModuleLength
38
38
  class << self
39
39
  # Functions for loading/writing Excel files.
40
40
 
41
41
  def from_excel path, opts={}
42
+ optional_gem 'spreadsheet', '~>1.1.1'
42
43
  opts = {
43
44
  worksheet_id: 0
44
45
  }.merge opts
@@ -185,19 +186,25 @@ module Daru
185
186
  end
186
187
 
187
188
  def from_html path, opts
189
+ optional_gem 'mechanize', '~>2.7.5'
188
190
  page = Mechanize.new.get(path)
189
191
  page.search('table').map { |table| html_parse_table table }
190
192
  .keep_if { |table| html_search table, opts[:match] }
191
193
  .compact
192
194
  .map { |table| html_decide_values table, opts }
193
195
  .map { |table| html_table_to_dataframe table }
194
- rescue LoadError
195
- raise 'Install the mechanize gem version 2.7.5 with `gem install mechanize`,'\
196
- ' for using the from_html function.'
197
196
  end
198
197
 
199
198
  private
200
199
 
200
+ def optional_gem(name, version)
201
+ gem name, version
202
+ require name
203
+ rescue LoadError
204
+ Daru.error "\nInstall the #{name} gem version #{version} for using"\
205
+ " #{name} functions."
206
+ end
207
+
201
208
  DARU_OPT_KEYS = %i[clone order index name].freeze
202
209
 
203
210
  def from_csv_prepare_opts opts
@@ -214,7 +221,7 @@ module Daru
214
221
  end
215
222
 
216
223
  def from_csv_prepare_converters(converters)
217
- converters.flat_map do |c|
224
+ Array(converters).flat_map do |c|
218
225
  if ::CSV::Converters[c]
219
226
  ::CSV::Converters[c]
220
227
  elsif Daru::IO::CSV::CONVERTERS[c]
@@ -122,6 +122,17 @@ module Daru
122
122
  self
123
123
  end
124
124
 
125
+ def apply_method(method, keys: nil, by_position: true)
126
+ vect = keys ? get_sub_vector(keys, by_position: by_position) : self
127
+
128
+ case method
129
+ when Symbol then vect.send(method)
130
+ when Proc then method.call(vect)
131
+ else raise
132
+ end
133
+ end
134
+ alias :apply_method_on_sub_vector :apply_method
135
+
125
136
  # The name of the Daru::Vector. String.
126
137
  attr_reader :name
127
138
  # The row index. Can be either Daru::Index or Daru::MultiIndex.
@@ -790,6 +801,7 @@ module Daru
790
801
  self[idx] = last_valid_value
791
802
  end
792
803
  end
804
+ self
793
805
  end
794
806
 
795
807
  # Non-destructive version of rolling_fillna!
@@ -870,6 +882,19 @@ module Daru
870
882
  @index.include? index
871
883
  end
872
884
 
885
+ # @param keys [Array] can be positions (if by_position is true) or indexes (if by_position if false)
886
+ # @return [Daru::Vector]
887
+ def get_sub_vector(keys, by_position: true)
888
+ return Daru::Vector.new([]) if keys == []
889
+
890
+ keys = @index.pos(*keys) unless by_position
891
+
892
+ sub_vect = at(*keys)
893
+ sub_vect = Daru::Vector.new([sub_vect]) unless sub_vect.is_a?(Daru::Vector)
894
+
895
+ sub_vect
896
+ end
897
+
873
898
  # @return [Daru::DataFrame] the vector as a single-vector dataframe
874
899
  def to_df
875
900
  Daru::DataFrame.new({@name => @data}, name: @name, index: @index)
@@ -1,3 +1,3 @@
1
1
  module Daru
2
- VERSION = '0.2.0'.freeze
2
+ VERSION = '0.2.1'.freeze
3
3
  end
@@ -201,6 +201,22 @@ describe Daru::Core::GroupBy do
201
201
  end
202
202
  end
203
203
 
204
+ context '#each_group without block' do
205
+ it 'enumerates groups' do
206
+ enum = @dl_group.each_group
207
+
208
+ expect(enum.count).to eq 6
209
+ expect(enum).to all be_a(Daru::DataFrame)
210
+ expect(enum.to_a.last).to eq(Daru::DataFrame.new({
211
+ a: ['foo', 'foo'],
212
+ b: ['two', 'two'],
213
+ c: [3, 3],
214
+ d: [33, 55]
215
+ }, index: [2, 4]
216
+ ))
217
+ end
218
+ end
219
+
204
220
  context '#first' do
205
221
  it 'gets the first row from each group' do
206
222
  expect(@dl_group.first).to eq(Daru::DataFrame.new({
@@ -223,10 +239,6 @@ describe Daru::Core::GroupBy do
223
239
  end
224
240
  end
225
241
 
226
- context "#aggregate" do
227
- pending
228
- end
229
-
230
242
  context "#mean" do
231
243
  it "computes mean of the numeric columns of a single layer group" do
232
244
  expect(@sl_group.mean).to eq(Daru::DataFrame.new({
@@ -498,23 +510,6 @@ describe Daru::Core::GroupBy do
498
510
  }
499
511
  end
500
512
 
501
- context 'group and aggregate sum for two vectors' do
502
- subject {
503
- dataframe.group_by([:employee, :month]).aggregate(salary: :sum) }
504
-
505
- it { is_expected.to eq Daru::DataFrame.new({
506
- salary: [600, 500, 1200, 1000, 600, 700]},
507
- index: Daru::MultiIndex.from_tuples([
508
- ['Jane', 'July'],
509
- ['Jane', 'June'],
510
- ['John', 'July'],
511
- ['John', 'June'],
512
- ['Mark', 'July'],
513
- ['Mark', 'June']
514
- ])
515
- )}
516
- end
517
-
518
513
  context 'group and aggregate sum and lambda function for vectors' do
519
514
  subject { dataframe.group_by([:employee]).aggregate(
520
515
  salary: :sum,
@@ -592,5 +587,64 @@ describe Daru::Core::GroupBy do
592
587
  )
593
588
  end
594
589
  end
590
+
591
+ let(:spending_df) {
592
+ Daru::DataFrame.rows([
593
+ [2010, 'dev', 50, 1],
594
+ [2010, 'dev', 150, 1],
595
+ [2010, 'dev', 200, 1],
596
+ [2011, 'dev', 50, 1],
597
+ [2012, 'dev', 150, 1],
598
+
599
+ [2011, 'office', 300, 1],
600
+
601
+ [2010, 'market', 50, 1],
602
+ [2011, 'market', 500, 1],
603
+ [2012, 'market', 500, 1],
604
+ [2012, 'market', 300, 1],
605
+
606
+ [2012, 'R&D', 10, 1],],
607
+ order: [:year, :category, :spending, :nb_spending])
608
+ }
609
+ let(:multi_index_year_category) {
610
+ Daru::MultiIndex.from_tuples([
611
+ [2010, "dev"], [2010, "market"],
612
+ [2011, "dev"], [2011, "market"], [2011, "office"],
613
+ [2012, "R&D"], [2012, "dev"], [2012, "market"]])
614
+ }
615
+
616
+ context 'group_by and aggregate on multiple elements' do
617
+ it 'does aggregate' do
618
+ expect(spending_df.group_by([:year, :category]).aggregate(spending: :sum)).to eq(
619
+ Daru::DataFrame.new({spending: [400, 50, 50, 500, 300, 10, 150, 800]}, index: multi_index_year_category))
620
+ end
621
+
622
+ it 'works as older methods' do
623
+ newer_way = spending_df.group_by([:year, :category]).aggregate(spending: :sum, nb_spending: :sum)
624
+ older_way = spending_df.group_by([:year, :category]).sum
625
+ expect(newer_way).to eq(older_way)
626
+ end
627
+
628
+ context 'can aggregate on MultiIndex' do
629
+ let(:multi_indexed_aggregated_df) { spending_df.group_by([:year, :category]).aggregate(spending: :sum) }
630
+ let(:index_year) { Daru::Index.new([2010, 2011, 2012]) }
631
+ let(:index_category) { Daru::Index.new(["dev", "market", "office", "R&D"]) }
632
+
633
+ it 'aggregates by default on the last layer of MultiIndex' do
634
+ expect(multi_indexed_aggregated_df.aggregate(spending: :sum)).to eq(
635
+ Daru::DataFrame.new({spending: [450, 850, 960]}, index: index_year))
636
+ end
637
+
638
+ it 'can aggregate on the first layer of MultiIndex' do
639
+ expect(multi_indexed_aggregated_df.aggregate({spending: :sum},0)).to eq(
640
+ Daru::DataFrame.new({spending: [600, 1350, 300, 10]}, index: index_category))
641
+ end
642
+
643
+ it 'does coercion: when one layer is remaining, MultiIndex is coerced in Index that does not aggregate anymore' do
644
+ df_with_simple_index = multi_indexed_aggregated_df.aggregate(spending: :sum)
645
+ expect(df_with_simple_index.aggregate(spending: :sum)).to eq(df_with_simple_index)
646
+ end
647
+ end
648
+ end
595
649
  end
596
650
  end
@@ -1858,7 +1858,7 @@ describe Daru::DataFrame do
1858
1858
 
1859
1859
  context 'rolling_fillna! forwards' do
1860
1860
  before { subject.rolling_fillna!(:forward) }
1861
- it { is_expected.to be_a Daru::DataFrame }
1861
+ it { expect(subject.rolling_fillna!(:forward)).to eq(subject) }
1862
1862
  its(:'a.to_a') { is_expected.to eq [1, 2, 3, 3, 3, 3, 1, 7] }
1863
1863
  its(:'b.to_a') { is_expected.to eq [:a, :b, :b, :b, :b, 3, 5, 5] }
1864
1864
  its(:'c.to_a') { is_expected.to eq ['a', 'a', 3, 4, 3, 5, 5, 7] }
@@ -1866,7 +1866,7 @@ describe Daru::DataFrame do
1866
1866
 
1867
1867
  context 'rolling_fillna! backwards' do
1868
1868
  before { subject.rolling_fillna!(:backward) }
1869
- it { is_expected.to be_a Daru::DataFrame }
1869
+ it { expect(subject.rolling_fillna!(:backward)).to eq(subject) }
1870
1870
  its(:'a.to_a') { is_expected.to eq [1, 2, 3, 1, 1, 1, 1, 7] }
1871
1871
  its(:'b.to_a') { is_expected.to eq [:a, :b, 3, 3, 3, 3, 5, 0] }
1872
1872
  its(:'c.to_a') { is_expected.to eq ['a', 3, 3, 4, 3, 5, 7, 7] }
@@ -3266,6 +3266,18 @@ describe Daru::DataFrame do
3266
3266
  end
3267
3267
  end
3268
3268
 
3269
+ context "group_by" do
3270
+ context "on a single row DataFrame" do
3271
+ let(:df){ Daru::DataFrame.new(city: %w[Kyiv], year: [2015], value: [1]) }
3272
+ it "returns a groupby object" do
3273
+ expect(df.group_by([:city])).to be_a(Daru::Core::GroupBy)
3274
+ end
3275
+ it "has the correct index" do
3276
+ expect(df.group_by([:city]).groups).to eq({["Kyiv"]=>[0]})
3277
+ end
3278
+ end
3279
+ end
3280
+
3269
3281
  context "#vector_sum" do
3270
3282
  before do
3271
3283
  a1 = Daru::Vector.new [1, 2, 3, 4, 5, nil, nil]
@@ -4032,7 +4044,7 @@ describe Daru::DataFrame do
4032
4044
  Daru::DataFrame.new({num: [52,12,07,17,01]}, index: cat_idx) }
4033
4045
 
4034
4046
  it 'lambda function on particular column' do
4035
- expect(df.aggregate(num_100_times: ->(df) { df.num*100 })).to eq(
4047
+ expect(df.aggregate(num_100_times: ->(df) { (df.num*100).first })).to eq(
4036
4048
  Daru::DataFrame.new(num_100_times: [5200, 1200, 700, 1700, 100])
4037
4049
  )
4038
4050
  end
@@ -4043,6 +4055,34 @@ describe Daru::DataFrame do
4043
4055
  end
4044
4056
  end
4045
4057
 
4058
+ context '#group_by_and_aggregate' do
4059
+ let(:spending_df) {
4060
+ Daru::DataFrame.rows([
4061
+ [2010, 'dev', 50, 1],
4062
+ [2010, 'dev', 150, 1],
4063
+ [2010, 'dev', 200, 1],
4064
+ [2011, 'dev', 50, 1],
4065
+ [2012, 'dev', 150, 1],
4066
+
4067
+ [2011, 'office', 300, 1],
4068
+
4069
+ [2010, 'market', 50, 1],
4070
+ [2011, 'market', 500, 1],
4071
+ [2012, 'market', 500, 1],
4072
+ [2012, 'market', 300, 1],
4073
+
4074
+ [2012, 'R&D', 10, 1],],
4075
+ order: [:year, :category, :spending, :nb_spending])
4076
+ }
4077
+
4078
+ it 'works as group_by + aggregate' do
4079
+ expect(spending_df.group_by_and_aggregate(:year, {spending: :sum})).to eq(
4080
+ spending_df.group_by(:year).aggregate(spending: :sum))
4081
+ expect(spending_df.group_by_and_aggregate([:year, :category], spending: :sum, nb_spending: :size)).to eq(
4082
+ spending_df.group_by([:year, :category]).aggregate(spending: :sum, nb_spending: :size))
4083
+ end
4084
+ end
4085
+
4046
4086
  context '#create_sql' do
4047
4087
  let(:df) { Daru::DataFrame.new({
4048
4088
  a: [1,2,3],
@@ -0,0 +1,5 @@
1
+ ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
2
+ 8517337,094652,03/12/2012 02:00:00 PM,027XX S HAMLIN AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,ATM (AUTOMATIC TELLER MACHINE),false,true,1031,010,22,30,11,1151482,1885517,2012,02/04/2016 06:33:39 AM,41.841738053,-87.719605942,"(41.841738053, -87.719605942)"
3
+ 8517338,194241,03/06/2012 10:49:00 PM,102XX S VERNON AVE,0917,MOTOR VEHICLE THEFT,"CYCLE, SCOOTER, BIKE W-VIN",STREET,false,false,0511,005,9,49,07,1181052,1837191,2012,02/04/2016 06:33:39 AM,41.708495677,-87.612580474,"(41.708495677, -87.612580474)"
4
+ 8517339,194563,02/01/2012 08:15:00 AM,003XX W 108TH ST,0460,BATTERY,SIMPLE,"SCHOOL, PRIVATE, BUILDING",false,false,0513,005,34,49,08B,1176016,1833309,2012,02/04/2016 06:33:39 AM,41.6979571,-87.631138505,"(41.6979571, -87.631138505)"
5
+ 8517340,194531,03/12/2012 05:50:00 PM,089XX S CARPENTER ST,0560,ASSAULT,SIMPLE,STREET,false,false,2222,022,21,73,08A,1170886,1845421,2012,02/04/2016 06:33:39 AM,41.731307475,-87.649569675,"(41.731307475, -87.649569675)"
@@ -202,8 +202,16 @@ describe Daru::MultiIndex do
202
202
  expect(@multi_mi.include?([:a, :one])).to eq(true)
203
203
  end
204
204
 
205
- it "checks for non-existence of a tuple" do
206
- expect(@multi_mi.include?([:boo])).to eq(false)
205
+ it "checks for non-existence of completely specified tuple" do
206
+ expect(@multi_mi.include?([:b, :two, :foo])).to eq(false)
207
+ end
208
+
209
+ it "checks for non-existence of a top layer incomplete tuple" do
210
+ expect(@multi_mi.include?([:d])).to eq(false)
211
+ end
212
+
213
+ it "checks for non-existence of a middle layer incomplete tuple" do
214
+ expect(@multi_mi.include?([:c, :three])).to eq(false)
207
215
  end
208
216
  end
209
217
 
@@ -51,6 +51,16 @@ describe Daru::IO do
51
51
  expect(df['Domestic'].to_a).to all be_boolean
52
52
  end
53
53
 
54
+ it "uses the custom string converter correctly" do
55
+ df = Daru::DataFrame.from_csv 'spec/fixtures/string_converter_test.csv', converters: [:string]
56
+ expect(df['Case Number'].to_a.all? {|x| String === x }).to be_truthy
57
+ end
58
+
59
+ it "allow symbol to converters option" do
60
+ df = Daru::DataFrame.from_csv 'spec/fixtures/boolean_converter_test.csv', converters: :boolean
61
+ expect(df['Domestic'].to_a).to all be_boolean
62
+ end
63
+
54
64
  it "checks for equal parsing of local CSV files and remote CSV files" do
55
65
  %w[matrix_test repeated_fields scientific_notation sales-funnel].each do |file|
56
66
  df_local = Daru::DataFrame.from_csv("spec/fixtures/#{file}.csv")
@@ -1808,6 +1808,22 @@ describe Daru::Vector do
1808
1808
  end
1809
1809
  end
1810
1810
 
1811
+ context '#rolling_fillna' do
1812
+ subject do
1813
+ Daru::Vector.new(
1814
+ [Float::NAN, 2, 1, 4, nil, Float::NAN, 3, nil, Float::NAN]
1815
+ )
1816
+ end
1817
+
1818
+ context 'rolling_fillna forwards' do
1819
+ it { expect(subject.rolling_fillna(:forward).to_a).to eq [0, 2, 1, 4, 4, 4, 3, 3, 3] }
1820
+ end
1821
+
1822
+ context 'rolling_fillna backwards' do
1823
+ it { expect(subject.rolling_fillna(direction: :backward).to_a).to eq [2, 2, 1, 4, 3, 3, 3, 0, 0] }
1824
+ end
1825
+ end
1826
+
1811
1827
  context "#type" do
1812
1828
  before(:each) do
1813
1829
  @numeric = Daru::Vector.new([1,2,3,4,5])
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: daru
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Sameer Deshmukh
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-10-31 00:00:00.000000000 Z
11
+ date: 2018-07-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: backports
@@ -532,6 +532,7 @@ files:
532
532
  - spec/fixtures/repeated_fields.csv
533
533
  - spec/fixtures/sales-funnel.csv
534
534
  - spec/fixtures/scientific_notation.csv
535
+ - spec/fixtures/string_converter_test.csv
535
536
  - spec/fixtures/strings.dat
536
537
  - spec/fixtures/test_xls.xls
537
538
  - spec/fixtures/url_test.txt~
@@ -569,26 +570,7 @@ homepage: http://github.com/v0dro/daru
569
570
  licenses:
570
571
  - BSD-2
571
572
  metadata: {}
572
- post_install_message: |
573
- *************************************************************************
574
- Thank you for installing daru!
575
-
576
- oOOOOOo
577
- ,| oO
578
- //| |
579
- \\| |
580
- `| |
581
- `-----`
582
-
583
-
584
- Hope you love daru! For enhanced interactivity and better visualizations,
585
- consider using gnuplotrb and nyaplot with iruby. For statistics use the
586
- statsample family.
587
-
588
- Read the README for interesting use cases and examples.
589
-
590
- Cheers!
591
- *************************************************************************
573
+ post_install_message:
592
574
  rdoc_options: []
593
575
  require_paths:
594
576
  - lib
@@ -604,7 +586,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
604
586
  version: '0'
605
587
  requirements: []
606
588
  rubyforge_project:
607
- rubygems_version: 2.6.10
589
+ rubygems_version: 2.6.14
608
590
  signing_key:
609
591
  specification_version: 4
610
592
  summary: Data Analysis in RUby
@@ -638,6 +620,7 @@ test_files:
638
620
  - spec/fixtures/repeated_fields.csv
639
621
  - spec/fixtures/sales-funnel.csv
640
622
  - spec/fixtures/scientific_notation.csv
623
+ - spec/fixtures/string_converter_test.csv
641
624
  - spec/fixtures/strings.dat
642
625
  - spec/fixtures/test_xls.xls
643
626
  - spec/fixtures/url_test.txt~