daru 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 69452b32fd8ef0ef7fb4ed58ab53ffa8aa15806d
4
- data.tar.gz: 56927c77adbe7941eb2ca9a5e44d705931aad237
3
+ metadata.gz: 87e4e2869fe6411e3eece92bb5dc24d48f890774
4
+ data.tar.gz: e711d0db1d57f51f31ccb7fb54078a6bdbcc4ff5
5
5
  SHA512:
6
- metadata.gz: 8e7511133b3409f7821cfec944a950d53df57bcd5893bb8a9557c013f31bf1e4a9cc07bbe1c143c63684f00f7d8d8f1adf3b31df732508e667ba6677f47d1d96
7
- data.tar.gz: fc4beb70106372a276b21e0da645951595e5674f56e4422752aeeabc9cc2156983add90e59486aea4d88386fbeb2896d15f7ede30667bc84027abd900ee42e0e
6
+ metadata.gz: afdb295d0d01542ba9f439cf5f7959d7f2a3b9e47de6047ecf7719548ef760e657c0dfe753ed16ee1da65e071bb5a182aaf03ee83c9de6075d54149753b9c346
7
+ data.tar.gz: e0c4ace661d9f1cb7e8040d424bb004a0b650a9605037d1aff258258bbac40a3c158e5f5b8a2a5c6a28070cf55566a0729ee9b77c8114d40d4d18cf9d26e69c3
data/History.md CHANGED
@@ -1,3 +1,18 @@
1
+ # 0.2.1 (02 July 2018)
2
+
3
+ * Minor Enhancements
4
+ - Allow pasing singular Symbol to CSV converters option (@takkanm)
5
+ - Support calling GroupBy#each_group w/o blocks (@hibariya)
6
+ - Refactor grouping and aggregation (@paisible-wanderer)
7
+ - Add String Converter to Daru::IO::CSV::CONVERTERS (@takkanm)
8
+ - Fix annoying missing libraries warning
9
+ - Remove post-install message (nice yet useless)
10
+
11
+ * Fixes
12
+ - Fix group_by for DataFrame with single row (@baarkerlounger)
13
+ - `#rolling_fillna!` bugfixes on `Daru::Vector` and `Daru::DataFrame` (@mhammiche)
14
+ - Fixes `#include?` on multiindex (@rohitner)
15
+
1
16
  # 0.2.0 (31 October 2017)
2
17
  * Major Enhancements
3
18
  - Add `DataFrame#which` query DSL (experimental! @rainchen)
data/README.md CHANGED
@@ -3,12 +3,13 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/daru.svg)](http://badge.fury.io/rb/daru)
4
4
  [![Build Status](https://travis-ci.org/SciRuby/daru.svg?branch=master)](https://travis-ci.org/SciRuby/daru)
5
5
  [![Gitter](https://badges.gitter.im/v0dro/daru.svg)](https://gitter.im/v0dro/daru?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
6
+ [![Open Source Helpers](https://www.codetriage.com/sciruby/daru/badges/users.svg)](https://www.codetriage.com/sciruby/daru)
6
7
 
7
8
  ## Introduction
8
9
 
9
10
  daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data in Ruby.
10
11
 
11
- daru makes it easy and intuitive to process data predominantly through 2 data structures: `Daru::DataFrame` and `Daru::Vector`. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2 and 2.3.
12
+ daru makes it easy and intuitive to process data predominantly through 2 data structures: `Daru::DataFrame` and `Daru::Vector`. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2, 2.3, and 2.4.
12
13
 
13
14
  ## Features
14
15
 
@@ -73,6 +74,7 @@ $ gem install daru
73
74
  * [Data Analysis in RUby: Basic data manipulation and plotting](http://v0dro.github.io/blog/2014/11/25/data-analysis-in-ruby-basic-data-manipulation-and-plotting/)
74
75
  * [Data Analysis in RUby: Splitting, sorting, aggregating data and data types](http://v0dro.github.io/blog/2015/02/24/data-analysis-in-ruby-part-2/)
75
76
  * [Finding and Combining data in daru](http://v0dro.github.io/blog/2015/08/03/finding-and-combining-data-in-daru/)
77
+ * [Introduction to analyzing datasets with daru library](http://gafur.me/2018/02/05/analysing-datasets-with-daru-library.html)
76
78
 
77
79
  ### Time series
78
80
 
@@ -192,13 +194,13 @@ In addition to nyaplot, daru also supports plotting out of the box with [gnuplot
192
194
 
193
195
  ## Documentation
194
196
 
195
- Docs can be found [here](https://rubygems.org/gems/daru).
197
+ Docs can be found [here](http://www.rubydoc.info/gems/daru).
196
198
 
197
199
  ## Contributing
198
200
 
199
201
  Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!
200
202
 
201
- For details see [CONTRIBUTING](https://github.com/v0dro/daru/blob/master/CONTRIBUTING.md).
203
+ For details see [CONTRIBUTING](https://github.com/SciRuby/daru/blob/master/CONTRIBUTING.md).
202
204
 
203
205
  ## Acknowledgements
204
206
 
@@ -27,29 +27,6 @@ Gem::Specification.new do |spec|
27
27
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
28
28
  spec.require_paths = ["lib"]
29
29
 
30
- spec.post_install_message = <<-EOF
31
- *************************************************************************
32
- Thank you for installing daru!
33
-
34
- oOOOOOo
35
- ,| oO
36
- //| |
37
- \\\\| |
38
- `| |
39
- `-----`
40
-
41
-
42
- Hope you love daru! For enhanced interactivity and better visualizations,
43
- consider using gnuplotrb and nyaplot with iruby. For statistics use the
44
- statsample family.
45
-
46
- Read the README for interesting use cases and examples.
47
-
48
- Cheers!
49
- *************************************************************************
50
- EOF
51
-
52
-
53
30
  spec.add_runtime_dependency 'backports'
54
31
 
55
32
  # it is required by NMatrix, yet we want to specify clearly which minimal version is OK
@@ -86,16 +86,6 @@ module Daru
86
86
  create_has_library :gruff
87
87
  end
88
88
 
89
- {'spreadsheet' => '~>1.1.1', 'mechanize' => '~>2.7.5'}.each do |name, version|
90
- begin
91
- gem name, version
92
- require name
93
- rescue LoadError
94
- Daru.error "\nInstall the #{name} gem version #{version} for using"\
95
- " #{name} functions."
96
- end
97
- end
98
-
99
89
  autoload :CSV, 'csv'
100
90
  require 'matrix'
101
91
  require 'forwardable'
@@ -1,11 +1,64 @@
1
1
  module Daru
2
2
  module Core
3
3
  class GroupBy
4
+ class << self
5
+ def get_positions_group_map_on(indexes_with_positions, sort: false)
6
+ group_map = {}
7
+
8
+ indexes_with_positions.each do |idx, position|
9
+ (group_map[idx] ||= []) << position
10
+ end
11
+
12
+ if sort # TODO: maybe add a more "stable" sorting option?
13
+ sorted_keys = group_map.keys.sort(&Daru::Core::GroupBy::TUPLE_SORTER)
14
+ group_map = sorted_keys.map { |k| [k, group_map[k]] }.to_h
15
+ end
16
+
17
+ group_map
18
+ end
19
+
20
+ def get_positions_group_for_aggregation(multi_index, level=-1)
21
+ raise unless multi_index.is_a?(Daru::MultiIndex)
22
+
23
+ new_index = multi_index.dup
24
+ new_index.remove_layer(level) # TODO: recheck code of Daru::MultiIndex#remove_layer
25
+
26
+ get_positions_group_map_on(new_index.each_with_index)
27
+ end
28
+
29
+ def get_positions_group_map_for_df(df, group_by_keys, sort: true)
30
+ indexes_with_positions = df[*group_by_keys].to_df.each_row.map(&:to_a).each_with_index
31
+
32
+ get_positions_group_map_on(indexes_with_positions, sort: sort)
33
+ end
34
+
35
+ def group_map_from_positions_to_indexes(positions_group_map, index)
36
+ positions_group_map.map { |k, positions| [k, positions.map { |pos| index.at(pos) }] }.to_h
37
+ end
38
+
39
+ def df_from_group_map(df, group_map, remaining_vectors, from_position: true)
40
+ return nil if group_map == {}
41
+
42
+ new_index = group_map.flat_map { |group, values| values.map { |val| group + [val] } }
43
+ new_index = Daru::MultiIndex.from_tuples(new_index)
44
+
45
+ return Daru::DataFrame.new({}, index: new_index) if remaining_vectors == []
46
+
47
+ new_rows_order = group_map.values.flatten
48
+ new_df = df[*remaining_vectors].to_df.get_sub_dataframe(new_rows_order, by_position: from_position)
49
+ new_df.index = new_index
50
+
51
+ new_df
52
+ end
53
+ end
54
+
4
55
  attr_reader :groups, :df
5
56
 
6
57
  # Iterate over each group created by group_by. A DataFrame is yielded in
7
58
  # block.
8
59
  def each_group
60
+ return to_enum(:each_group) unless block_given?
61
+
9
62
  groups.keys.each do |k|
10
63
  yield get_group(k)
11
64
  end
@@ -22,11 +75,8 @@ module Daru
22
75
  end
23
76
 
24
77
  def initialize context, names
25
- @groups = {}
26
78
  @non_group_vectors = context.vectors.to_a - names
27
79
  @context = context
28
- vectors = names.map { |vec| context[vec].to_a }
29
- tuples = vectors[0].zip(*vectors[1..-1])
30
80
  # FIXME: It feels like we don't want to sort here. Ruby's #group_by
31
81
  # never sorts:
32
82
  #
@@ -34,7 +84,10 @@ module Daru
34
84
  # # => {4=>["test"], 2=>["me"], 6=>["please"]}
35
85
  #
36
86
  # - zverok, 2016-09-12
37
- init_groups_df tuples, names
87
+ positions_groups = GroupBy.get_positions_group_map_for_df(@context, names, sort: true)
88
+
89
+ @groups = GroupBy.group_map_from_positions_to_indexes(positions_groups, @context.index)
90
+ @df = GroupBy.df_from_group_map(@context, positions_groups, @non_group_vectors)
38
91
  end
39
92
 
40
93
  # Get a Daru::Vector of the size of each group.
@@ -282,26 +335,11 @@ module Daru
282
335
  # Ram Hyderabad,Mumbai
283
336
  #
284
337
  def aggregate(options={})
285
- @df.index = @df.index.remove_layer(@df.index.levels.size - 1)
286
338
  @df.aggregate(options)
287
339
  end
288
340
 
289
341
  private
290
342
 
291
- def init_groups_df tuples, names
292
- multi_index_tuples = []
293
- keys = tuples.uniq.sort(&TUPLE_SORTER)
294
- keys.each do |key|
295
- indices = all_indices_for(tuples, key)
296
- @groups[key] = indices
297
- indices.each do |indice|
298
- multi_index_tuples << key + [indice]
299
- end
300
- end
301
- @groups.freeze
302
- @df = resultant_context(multi_index_tuples, names) unless multi_index_tuples.empty?
303
- end
304
-
305
343
  def select_groups_from method, quantity
306
344
  selection = @context
307
345
  rows, indexes = [], []
@@ -342,33 +380,6 @@ module Daru
342
380
  end
343
381
  end
344
382
 
345
- def resultant_context(multi_index_tuples, names)
346
- multi_index = Daru::MultiIndex.from_tuples(multi_index_tuples)
347
- context_tmp = @context.dup.delete_vectors(*names)
348
- rows_tuples = context_tmp.access_row_tuples_by_indexs(
349
- *@groups.values.flatten!
350
- )
351
- context_new = Daru::DataFrame.rows(rows_tuples, index: multi_index)
352
- context_new.vectors = context_tmp.vectors
353
- context_new
354
- end
355
-
356
- def all_indices_for arry, element
357
- found, index, indexes = -1, -1, []
358
- while found
359
- found = arry[index+1..-1].index(element)
360
- if found
361
- index = index + found + 1
362
- indexes << index
363
- end
364
- end
365
- if indexes.count == 1
366
- [@context.index.at(*indexes)]
367
- else
368
- @context.index.at(*indexes).to_a
369
- end
370
- end
371
-
372
383
  def multi_indexed_grouping?
373
384
  return false unless @groups.keys[0]
374
385
  @groups.keys[0].size > 1
@@ -17,17 +17,17 @@ module Daru
17
17
  end
18
18
  end
19
19
 
20
- def initialize left_df, right_df, opts={}
20
+ def initialize left_df, right_df, opts={} # rubocop:disable Metrics/AbcSize -- quick-fix for issue #171
21
21
  init_opts(opts)
22
22
  validate_on!(left_df, right_df)
23
23
  key_sanitizer = ->(h) { sanitize_merge_keys(h.values_at(*on)) }
24
24
 
25
25
  @left = df_to_a(left_df)
26
- @left.sort_by!(&key_sanitizer)
26
+ @left.sort! { |a, b| safe_compare(a.values_at(*on), b.values_at(*on)) }
27
27
  @left_key_values = @left.map(&key_sanitizer)
28
28
 
29
29
  @right = df_to_a(right_df)
30
- @right.sort_by!(&key_sanitizer)
30
+ @right.sort! { |a, b| safe_compare(a.values_at(*on), b.values_at(*on)) }
31
31
  @right_key_values = @right.map(&key_sanitizer)
32
32
 
33
33
  @left_keys, @right_keys = merge_keys(left_df, right_df, on)
@@ -246,6 +246,15 @@ module Daru
246
246
  raise ArgumentError, "Both dataframes expected to have #{on.inspect} field"
247
247
  end
248
248
  end
249
+
250
+ def safe_compare(left_array, right_array)
251
+ left_array.zip(right_array).map { |l, r|
252
+ next 0 if l.nil? && r.nil?
253
+ next 1 if r.nil?
254
+ next -1 if l.nil?
255
+ l <=> r
256
+ }.reject(&:zero?).first || 0
257
+ end
249
258
  end
250
259
 
251
260
  module Merge
@@ -549,6 +549,20 @@ module Daru
549
549
  Daru::Accessors::DataFrameByRow.new(self)
550
550
  end
551
551
 
552
+ # Extract a dataframe given row indexes or positions
553
+ # @param keys [Array] can be positions (if by_position is true) or indexes (if by_position if false)
554
+ # @return [Daru::Dataframe]
555
+ def get_sub_dataframe(keys, by_position: true)
556
+ return Daru::DataFrame.new({}) if keys == []
557
+
558
+ keys = @index.pos(*keys) unless by_position
559
+
560
+ sub_df = row_at(*keys)
561
+ sub_df = sub_df.to_df.transpose if sub_df.is_a?(Daru::Vector)
562
+
563
+ sub_df
564
+ end
565
+
552
566
  # Duplicate the DataFrame entirely.
553
567
  #
554
568
  # == Arguments
@@ -698,6 +712,7 @@ module Daru
698
712
  #
699
713
  def rolling_fillna!(direction=:forward)
700
714
  @data.each { |vec| vec.rolling_fillna!(direction) }
715
+ self
701
716
  end
702
717
 
703
718
  def rolling_fillna(direction=:forward)
@@ -990,6 +1005,17 @@ module Daru
990
1005
  self
991
1006
  end
992
1007
 
1008
+ def apply_method(method, keys: nil, by_position: true)
1009
+ df = keys ? get_sub_dataframe(keys, by_position: by_position) : self
1010
+
1011
+ case method
1012
+ when Symbol then df.send(method)
1013
+ when Proc then method.call(df)
1014
+ else raise
1015
+ end
1016
+ end
1017
+ alias :apply_method_on_sub_df :apply_method
1018
+
993
1019
  # Retrieves a Daru::Vector, based on the result of calculation
994
1020
  # performed on each row.
995
1021
  def collect_rows &block
@@ -1450,11 +1476,10 @@ module Daru
1450
1476
  # # ["foo", "two", 3]=>[2, 4]}
1451
1477
  def group_by *vectors
1452
1478
  vectors.flatten!
1453
- # FIXME: wouldn't it better to do vectors - @vectors here and
1454
- # raise one error with all non-existent vector names?.. - zverok, 2016-05-18
1455
- vectors.each { |v|
1456
- raise(ArgumentError, "Vector #{v} does not exist") unless has_vector?(v)
1457
- }
1479
+ missing = vectors - @vectors.to_a
1480
+ unless missing.empty?
1481
+ raise(ArgumentError, "Vector(s) missing: #{missing.join(', ')}")
1482
+ end
1458
1483
 
1459
1484
  vectors = [@vectors.first] if vectors.empty?
1460
1485
 
@@ -2249,22 +2274,6 @@ module Daru
2249
2274
  end
2250
2275
  end
2251
2276
 
2252
- # returns array of row tuples at given index(s)
2253
- def access_row_tuples_by_indexs *indexes
2254
- positions = @index.pos(*indexes)
2255
-
2256
- return populate_row_for(positions) if positions.is_a? Numeric
2257
-
2258
- res = []
2259
- new_rows = @data.map { |vec| vec[*indexes] }
2260
- indexes.each do |index|
2261
- tuples = []
2262
- new_rows.map { |row| tuples += [row[index]] }
2263
- res << tuples
2264
- end
2265
- res
2266
- end
2267
-
2268
2277
  # Function to use for aggregating the data.
2269
2278
  #
2270
2279
  # @param options [Hash] options for column, you want in resultant dataframe
@@ -2282,7 +2291,7 @@ module Daru
2282
2291
  # 3 d 17
2283
2292
  # 4 e 1
2284
2293
  #
2285
- # df.aggregate(num_100_times: ->(df) { df.num*100 })
2294
+ # df.aggregate(num_100_times: ->(df) { (df.num*100).first })
2286
2295
  # => #<Daru::DataFrame(5x1)>
2287
2296
  # num_100_ti
2288
2297
  # 0 5200
@@ -2312,41 +2321,26 @@ module Daru
2312
2321
  #
2313
2322
  # Note: `GroupBy` class `aggregate` method uses this `aggregate` method
2314
2323
  # internally.
2315
- def aggregate(options={})
2316
- colmn_value, index_tuples = aggregated_colmn_value(options)
2317
- Daru::DataFrame.new(
2318
- colmn_value, index: index_tuples, order: options.keys
2319
- )
2320
- end
2324
+ def aggregate(options={}, multi_index_level=-1)
2325
+ positions_tuples, new_index = group_index_for_aggregation(@index, multi_index_level)
2321
2326
 
2322
- private
2327
+ colmn_value = aggregate_by_positions_tuples(options, positions_tuples)
2323
2328
 
2324
- # Do the `method` (`method` can be :sum, :mean, :std, :median, etc or
2325
- # lambda), on the column.
2326
- def apply_method_on_colmns colmn, index_tuples, method
2327
- rows = []
2328
- index_tuples.each do |indexes|
2329
- # If single element then also make it vector.
2330
- slice = Daru::Vector.new(Array(self[colmn][*indexes]))
2331
- case method
2332
- when Symbol
2333
- rows << (slice.is_a?(Daru::Vector) ? slice.send(method) : slice)
2334
- when Proc
2335
- rows << method.call(slice)
2336
- end
2337
- end
2338
- rows
2329
+ Daru::DataFrame.new(colmn_value, index: new_index, order: options.keys)
2339
2330
  end
2340
2331
 
2341
- def apply_method_on_df index_tuples, method
2342
- rows = []
2343
- index_tuples.each do |indexes|
2344
- slice = row[*indexes]
2345
- rows << method.call(slice)
2346
- end
2347
- rows
2332
+ # Is faster than using group_by followed by aggregate (because it doesn't generate an intermediary dataframe)
2333
+ def group_by_and_aggregate(*group_by_keys, **aggregation_map)
2334
+ positions_groups = Daru::Core::GroupBy.get_positions_group_map_for_df(self, group_by_keys.flatten, sort: true)
2335
+
2336
+ new_index = Daru::MultiIndex.from_tuples(positions_groups.keys).coerce_index
2337
+ colmn_value = aggregate_by_positions_tuples(aggregation_map, positions_groups.values)
2338
+
2339
+ Daru::DataFrame.new(colmn_value, index: new_index, order: aggregation_map.keys)
2348
2340
  end
2349
2341
 
2342
+ private
2343
+
2350
2344
  def headers
2351
2345
  Daru::Index.new(Array(index.name) + @vectors.to_a)
2352
2346
  end
@@ -2910,27 +2904,41 @@ module Daru
2910
2904
  end
2911
2905
 
2912
2906
  def update_data source, vectors
2913
- @data = @vectors.each_with_index.map do |_vec,idx|
2907
+ @data = @vectors.each_with_index.map do |_vec, idx|
2914
2908
  Daru::Vector.new(source[idx], index: @index, name: vectors[idx])
2915
2909
  end
2916
2910
  end
2917
2911
 
2918
- def aggregated_colmn_value(options)
2919
- colmn_value = []
2920
- index_tuples = Array(@index).uniq
2921
- options.keys.each do |vec|
2922
- do_this_on_vec = options[vec]
2923
- colmn_value << if @vectors.include?(vec)
2924
- apply_method_on_colmns(
2925
- vec, index_tuples, do_this_on_vec
2926
- )
2927
- else
2928
- apply_method_on_df(
2929
- index_tuples, do_this_on_vec
2930
- )
2931
- end
2912
+ def aggregate_by_positions_tuples(options, positions_tuples)
2913
+ options.map do |vect, method|
2914
+ if @vectors.include?(vect)
2915
+ vect = self[vect]
2916
+
2917
+ positions_tuples.map do |positions|
2918
+ vect.apply_method_on_sub_vector(method, keys: positions)
2919
+ end
2920
+ else
2921
+ positions_tuples.map do |positions|
2922
+ apply_method_on_sub_df(method, keys: positions)
2923
+ end
2924
+ end
2932
2925
  end
2933
- [colmn_value, index_tuples]
2926
+ end
2927
+
2928
+ def group_index_for_aggregation(index, multi_index_level=-1)
2929
+ case index
2930
+ when Daru::MultiIndex
2931
+ groups = Daru::Core::GroupBy.get_positions_group_for_aggregation(index, multi_index_level)
2932
+ new_index, pos_tuples = groups.keys, groups.values
2933
+
2934
+ new_index = Daru::MultiIndex.from_tuples(new_index).coerce_index
2935
+ when Daru::Index, Daru::CategoricalIndex
2936
+ new_index = Array(index).uniq
2937
+ pos_tuples = new_index.map { |idx| [*index.pos(idx)] }
2938
+ else raise
2939
+ end
2940
+
2941
+ [pos_tuples, new_index]
2934
2942
  end
2935
2943
 
2936
2944
  # coerce ranges, integers and array in appropriate ways
@@ -244,8 +244,21 @@ module Daru
244
244
  @labels.delete_at(layer_index)
245
245
  @name.delete_at(layer_index) unless @name.nil?
246
246
 
247
- # CategoricalIndex is used , to allow duplicate indexes.
248
- @levels.size == 1 ? Daru::CategoricalIndex.new(to_a.flatten) : self
247
+ coerce_index
248
+ end
249
+
250
+ def coerce_index
251
+ if @levels.size == 1
252
+ elements = to_a.flatten
253
+
254
+ if elements.uniq.length == elements.length
255
+ Daru::Index.new(elements)
256
+ else
257
+ Daru::CategoricalIndex.new(elements)
258
+ end
259
+ else
260
+ self
261
+ end
249
262
  end
250
263
 
251
264
  # Array `name` must have same length as levels and labels.
@@ -272,7 +285,7 @@ module Daru
272
285
  end
273
286
 
274
287
  def dup
275
- MultiIndex.new levels: levels.dup, labels: labels
288
+ MultiIndex.new levels: levels.dup, labels: labels.dup, name: (@name.nil? ? nil : @name.dup)
276
289
  end
277
290
 
278
291
  def drop_left_level by=1
@@ -293,8 +306,9 @@ module Daru
293
306
 
294
307
  def include? tuple
295
308
  return false unless tuple.is_a? Enumerable
296
- tuple.flatten.each_with_index
297
- .all? { |tup, i| @levels[i][tup] }
309
+ @labels[0...tuple.flatten.size]
310
+ .transpose
311
+ .include?(tuple.flatten.each_with_index.map { |e, i| @levels[i][e] })
298
312
  end
299
313
 
300
314
  def size
@@ -11,6 +11,9 @@ module Daru
11
11
  else
12
12
  f
13
13
  end
14
+ },
15
+ string: lambda { |f, _|
16
+ f
14
17
  }
15
18
  }.freeze
16
19
  end
@@ -34,11 +34,12 @@ module Daru
34
34
  end
35
35
  end
36
36
 
37
- module IO
37
+ module IO # rubocop:disable Metrics/ModuleLength
38
38
  class << self
39
39
  # Functions for loading/writing Excel files.
40
40
 
41
41
  def from_excel path, opts={}
42
+ optional_gem 'spreadsheet', '~>1.1.1'
42
43
  opts = {
43
44
  worksheet_id: 0
44
45
  }.merge opts
@@ -185,19 +186,25 @@ module Daru
185
186
  end
186
187
 
187
188
  def from_html path, opts
189
+ optional_gem 'mechanize', '~>2.7.5'
188
190
  page = Mechanize.new.get(path)
189
191
  page.search('table').map { |table| html_parse_table table }
190
192
  .keep_if { |table| html_search table, opts[:match] }
191
193
  .compact
192
194
  .map { |table| html_decide_values table, opts }
193
195
  .map { |table| html_table_to_dataframe table }
194
- rescue LoadError
195
- raise 'Install the mechanize gem version 2.7.5 with `gem install mechanize`,'\
196
- ' for using the from_html function.'
197
196
  end
198
197
 
199
198
  private
200
199
 
200
+ def optional_gem(name, version)
201
+ gem name, version
202
+ require name
203
+ rescue LoadError
204
+ Daru.error "\nInstall the #{name} gem version #{version} for using"\
205
+ " #{name} functions."
206
+ end
207
+
201
208
  DARU_OPT_KEYS = %i[clone order index name].freeze
202
209
 
203
210
  def from_csv_prepare_opts opts
@@ -214,7 +221,7 @@ module Daru
214
221
  end
215
222
 
216
223
  def from_csv_prepare_converters(converters)
217
- converters.flat_map do |c|
224
+ Array(converters).flat_map do |c|
218
225
  if ::CSV::Converters[c]
219
226
  ::CSV::Converters[c]
220
227
  elsif Daru::IO::CSV::CONVERTERS[c]
@@ -122,6 +122,17 @@ module Daru
122
122
  self
123
123
  end
124
124
 
125
+ def apply_method(method, keys: nil, by_position: true)
126
+ vect = keys ? get_sub_vector(keys, by_position: by_position) : self
127
+
128
+ case method
129
+ when Symbol then vect.send(method)
130
+ when Proc then method.call(vect)
131
+ else raise
132
+ end
133
+ end
134
+ alias :apply_method_on_sub_vector :apply_method
135
+
125
136
  # The name of the Daru::Vector. String.
126
137
  attr_reader :name
127
138
  # The row index. Can be either Daru::Index or Daru::MultiIndex.
@@ -790,6 +801,7 @@ module Daru
790
801
  self[idx] = last_valid_value
791
802
  end
792
803
  end
804
+ self
793
805
  end
794
806
 
795
807
  # Non-destructive version of rolling_fillna!
@@ -870,6 +882,19 @@ module Daru
870
882
  @index.include? index
871
883
  end
872
884
 
885
+ # @param keys [Array] can be positions (if by_position is true) or indexes (if by_position if false)
886
+ # @return [Daru::Vector]
887
+ def get_sub_vector(keys, by_position: true)
888
+ return Daru::Vector.new([]) if keys == []
889
+
890
+ keys = @index.pos(*keys) unless by_position
891
+
892
+ sub_vect = at(*keys)
893
+ sub_vect = Daru::Vector.new([sub_vect]) unless sub_vect.is_a?(Daru::Vector)
894
+
895
+ sub_vect
896
+ end
897
+
873
898
  # @return [Daru::DataFrame] the vector as a single-vector dataframe
874
899
  def to_df
875
900
  Daru::DataFrame.new({@name => @data}, name: @name, index: @index)
@@ -1,3 +1,3 @@
1
1
  module Daru
2
- VERSION = '0.2.0'.freeze
2
+ VERSION = '0.2.1'.freeze
3
3
  end
@@ -201,6 +201,22 @@ describe Daru::Core::GroupBy do
201
201
  end
202
202
  end
203
203
 
204
+ context '#each_group without block' do
205
+ it 'enumerates groups' do
206
+ enum = @dl_group.each_group
207
+
208
+ expect(enum.count).to eq 6
209
+ expect(enum).to all be_a(Daru::DataFrame)
210
+ expect(enum.to_a.last).to eq(Daru::DataFrame.new({
211
+ a: ['foo', 'foo'],
212
+ b: ['two', 'two'],
213
+ c: [3, 3],
214
+ d: [33, 55]
215
+ }, index: [2, 4]
216
+ ))
217
+ end
218
+ end
219
+
204
220
  context '#first' do
205
221
  it 'gets the first row from each group' do
206
222
  expect(@dl_group.first).to eq(Daru::DataFrame.new({
@@ -223,10 +239,6 @@ describe Daru::Core::GroupBy do
223
239
  end
224
240
  end
225
241
 
226
- context "#aggregate" do
227
- pending
228
- end
229
-
230
242
  context "#mean" do
231
243
  it "computes mean of the numeric columns of a single layer group" do
232
244
  expect(@sl_group.mean).to eq(Daru::DataFrame.new({
@@ -498,23 +510,6 @@ describe Daru::Core::GroupBy do
498
510
  }
499
511
  end
500
512
 
501
- context 'group and aggregate sum for two vectors' do
502
- subject {
503
- dataframe.group_by([:employee, :month]).aggregate(salary: :sum) }
504
-
505
- it { is_expected.to eq Daru::DataFrame.new({
506
- salary: [600, 500, 1200, 1000, 600, 700]},
507
- index: Daru::MultiIndex.from_tuples([
508
- ['Jane', 'July'],
509
- ['Jane', 'June'],
510
- ['John', 'July'],
511
- ['John', 'June'],
512
- ['Mark', 'July'],
513
- ['Mark', 'June']
514
- ])
515
- )}
516
- end
517
-
518
513
  context 'group and aggregate sum and lambda function for vectors' do
519
514
  subject { dataframe.group_by([:employee]).aggregate(
520
515
  salary: :sum,
@@ -592,5 +587,64 @@ describe Daru::Core::GroupBy do
592
587
  )
593
588
  end
594
589
  end
590
+
591
+ let(:spending_df) {
592
+ Daru::DataFrame.rows([
593
+ [2010, 'dev', 50, 1],
594
+ [2010, 'dev', 150, 1],
595
+ [2010, 'dev', 200, 1],
596
+ [2011, 'dev', 50, 1],
597
+ [2012, 'dev', 150, 1],
598
+
599
+ [2011, 'office', 300, 1],
600
+
601
+ [2010, 'market', 50, 1],
602
+ [2011, 'market', 500, 1],
603
+ [2012, 'market', 500, 1],
604
+ [2012, 'market', 300, 1],
605
+
606
+ [2012, 'R&D', 10, 1],],
607
+ order: [:year, :category, :spending, :nb_spending])
608
+ }
609
+ let(:multi_index_year_category) {
610
+ Daru::MultiIndex.from_tuples([
611
+ [2010, "dev"], [2010, "market"],
612
+ [2011, "dev"], [2011, "market"], [2011, "office"],
613
+ [2012, "R&D"], [2012, "dev"], [2012, "market"]])
614
+ }
615
+
616
+ context 'group_by and aggregate on multiple elements' do
617
+ it 'does aggregate' do
618
+ expect(spending_df.group_by([:year, :category]).aggregate(spending: :sum)).to eq(
619
+ Daru::DataFrame.new({spending: [400, 50, 50, 500, 300, 10, 150, 800]}, index: multi_index_year_category))
620
+ end
621
+
622
+ it 'works as older methods' do
623
+ newer_way = spending_df.group_by([:year, :category]).aggregate(spending: :sum, nb_spending: :sum)
624
+ older_way = spending_df.group_by([:year, :category]).sum
625
+ expect(newer_way).to eq(older_way)
626
+ end
627
+
628
+ context 'can aggregate on MultiIndex' do
629
+ let(:multi_indexed_aggregated_df) { spending_df.group_by([:year, :category]).aggregate(spending: :sum) }
630
+ let(:index_year) { Daru::Index.new([2010, 2011, 2012]) }
631
+ let(:index_category) { Daru::Index.new(["dev", "market", "office", "R&D"]) }
632
+
633
+ it 'aggregates by default on the last layer of MultiIndex' do
634
+ expect(multi_indexed_aggregated_df.aggregate(spending: :sum)).to eq(
635
+ Daru::DataFrame.new({spending: [450, 850, 960]}, index: index_year))
636
+ end
637
+
638
+ it 'can aggregate on the first layer of MultiIndex' do
639
+ expect(multi_indexed_aggregated_df.aggregate({spending: :sum},0)).to eq(
640
+ Daru::DataFrame.new({spending: [600, 1350, 300, 10]}, index: index_category))
641
+ end
642
+
643
+ it 'does coercion: when one layer is remaining, MultiIndex is coerced in Index that does not aggregate anymore' do
644
+ df_with_simple_index = multi_indexed_aggregated_df.aggregate(spending: :sum)
645
+ expect(df_with_simple_index.aggregate(spending: :sum)).to eq(df_with_simple_index)
646
+ end
647
+ end
648
+ end
595
649
  end
596
650
  end
@@ -1858,7 +1858,7 @@ describe Daru::DataFrame do
1858
1858
 
1859
1859
  context 'rolling_fillna! forwards' do
1860
1860
  before { subject.rolling_fillna!(:forward) }
1861
- it { is_expected.to be_a Daru::DataFrame }
1861
+ it { expect(subject.rolling_fillna!(:forward)).to eq(subject) }
1862
1862
  its(:'a.to_a') { is_expected.to eq [1, 2, 3, 3, 3, 3, 1, 7] }
1863
1863
  its(:'b.to_a') { is_expected.to eq [:a, :b, :b, :b, :b, 3, 5, 5] }
1864
1864
  its(:'c.to_a') { is_expected.to eq ['a', 'a', 3, 4, 3, 5, 5, 7] }
@@ -1866,7 +1866,7 @@ describe Daru::DataFrame do
1866
1866
 
1867
1867
  context 'rolling_fillna! backwards' do
1868
1868
  before { subject.rolling_fillna!(:backward) }
1869
- it { is_expected.to be_a Daru::DataFrame }
1869
+ it { expect(subject.rolling_fillna!(:backward)).to eq(subject) }
1870
1870
  its(:'a.to_a') { is_expected.to eq [1, 2, 3, 1, 1, 1, 1, 7] }
1871
1871
  its(:'b.to_a') { is_expected.to eq [:a, :b, 3, 3, 3, 3, 5, 0] }
1872
1872
  its(:'c.to_a') { is_expected.to eq ['a', 3, 3, 4, 3, 5, 7, 7] }
@@ -3266,6 +3266,18 @@ describe Daru::DataFrame do
3266
3266
  end
3267
3267
  end
3268
3268
 
3269
+ context "group_by" do
3270
+ context "on a single row DataFrame" do
3271
+ let(:df){ Daru::DataFrame.new(city: %w[Kyiv], year: [2015], value: [1]) }
3272
+ it "returns a groupby object" do
3273
+ expect(df.group_by([:city])).to be_a(Daru::Core::GroupBy)
3274
+ end
3275
+ it "has the correct index" do
3276
+ expect(df.group_by([:city]).groups).to eq({["Kyiv"]=>[0]})
3277
+ end
3278
+ end
3279
+ end
3280
+
3269
3281
  context "#vector_sum" do
3270
3282
  before do
3271
3283
  a1 = Daru::Vector.new [1, 2, 3, 4, 5, nil, nil]
@@ -4032,7 +4044,7 @@ describe Daru::DataFrame do
4032
4044
  Daru::DataFrame.new({num: [52,12,07,17,01]}, index: cat_idx) }
4033
4045
 
4034
4046
  it 'lambda function on particular column' do
4035
- expect(df.aggregate(num_100_times: ->(df) { df.num*100 })).to eq(
4047
+ expect(df.aggregate(num_100_times: ->(df) { (df.num*100).first })).to eq(
4036
4048
  Daru::DataFrame.new(num_100_times: [5200, 1200, 700, 1700, 100])
4037
4049
  )
4038
4050
  end
@@ -4043,6 +4055,34 @@ describe Daru::DataFrame do
4043
4055
  end
4044
4056
  end
4045
4057
 
4058
+ context '#group_by_and_aggregate' do
4059
+ let(:spending_df) {
4060
+ Daru::DataFrame.rows([
4061
+ [2010, 'dev', 50, 1],
4062
+ [2010, 'dev', 150, 1],
4063
+ [2010, 'dev', 200, 1],
4064
+ [2011, 'dev', 50, 1],
4065
+ [2012, 'dev', 150, 1],
4066
+
4067
+ [2011, 'office', 300, 1],
4068
+
4069
+ [2010, 'market', 50, 1],
4070
+ [2011, 'market', 500, 1],
4071
+ [2012, 'market', 500, 1],
4072
+ [2012, 'market', 300, 1],
4073
+
4074
+ [2012, 'R&D', 10, 1],],
4075
+ order: [:year, :category, :spending, :nb_spending])
4076
+ }
4077
+
4078
+ it 'works as group_by + aggregate' do
4079
+ expect(spending_df.group_by_and_aggregate(:year, {spending: :sum})).to eq(
4080
+ spending_df.group_by(:year).aggregate(spending: :sum))
4081
+ expect(spending_df.group_by_and_aggregate([:year, :category], spending: :sum, nb_spending: :size)).to eq(
4082
+ spending_df.group_by([:year, :category]).aggregate(spending: :sum, nb_spending: :size))
4083
+ end
4084
+ end
4085
+
4046
4086
  context '#create_sql' do
4047
4087
  let(:df) { Daru::DataFrame.new({
4048
4088
  a: [1,2,3],
@@ -0,0 +1,5 @@
1
+ ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
2
+ 8517337,094652,03/12/2012 02:00:00 PM,027XX S HAMLIN AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,ATM (AUTOMATIC TELLER MACHINE),false,true,1031,010,22,30,11,1151482,1885517,2012,02/04/2016 06:33:39 AM,41.841738053,-87.719605942,"(41.841738053, -87.719605942)"
3
+ 8517338,194241,03/06/2012 10:49:00 PM,102XX S VERNON AVE,0917,MOTOR VEHICLE THEFT,"CYCLE, SCOOTER, BIKE W-VIN",STREET,false,false,0511,005,9,49,07,1181052,1837191,2012,02/04/2016 06:33:39 AM,41.708495677,-87.612580474,"(41.708495677, -87.612580474)"
4
+ 8517339,194563,02/01/2012 08:15:00 AM,003XX W 108TH ST,0460,BATTERY,SIMPLE,"SCHOOL, PRIVATE, BUILDING",false,false,0513,005,34,49,08B,1176016,1833309,2012,02/04/2016 06:33:39 AM,41.6979571,-87.631138505,"(41.6979571, -87.631138505)"
5
+ 8517340,194531,03/12/2012 05:50:00 PM,089XX S CARPENTER ST,0560,ASSAULT,SIMPLE,STREET,false,false,2222,022,21,73,08A,1170886,1845421,2012,02/04/2016 06:33:39 AM,41.731307475,-87.649569675,"(41.731307475, -87.649569675)"
@@ -202,8 +202,16 @@ describe Daru::MultiIndex do
202
202
  expect(@multi_mi.include?([:a, :one])).to eq(true)
203
203
  end
204
204
 
205
- it "checks for non-existence of a tuple" do
206
- expect(@multi_mi.include?([:boo])).to eq(false)
205
+ it "checks for non-existence of completely specified tuple" do
206
+ expect(@multi_mi.include?([:b, :two, :foo])).to eq(false)
207
+ end
208
+
209
+ it "checks for non-existence of a top layer incomplete tuple" do
210
+ expect(@multi_mi.include?([:d])).to eq(false)
211
+ end
212
+
213
+ it "checks for non-existence of a middle layer incomplete tuple" do
214
+ expect(@multi_mi.include?([:c, :three])).to eq(false)
207
215
  end
208
216
  end
209
217
 
@@ -51,6 +51,16 @@ describe Daru::IO do
51
51
  expect(df['Domestic'].to_a).to all be_boolean
52
52
  end
53
53
 
54
+ it "uses the custom string converter correctly" do
55
+ df = Daru::DataFrame.from_csv 'spec/fixtures/string_converter_test.csv', converters: [:string]
56
+ expect(df['Case Number'].to_a.all? {|x| String === x }).to be_truthy
57
+ end
58
+
59
+ it "allow symbol to converters option" do
60
+ df = Daru::DataFrame.from_csv 'spec/fixtures/boolean_converter_test.csv', converters: :boolean
61
+ expect(df['Domestic'].to_a).to all be_boolean
62
+ end
63
+
54
64
  it "checks for equal parsing of local CSV files and remote CSV files" do
55
65
  %w[matrix_test repeated_fields scientific_notation sales-funnel].each do |file|
56
66
  df_local = Daru::DataFrame.from_csv("spec/fixtures/#{file}.csv")
@@ -1808,6 +1808,22 @@ describe Daru::Vector do
1808
1808
  end
1809
1809
  end
1810
1810
 
1811
+ context '#rolling_fillna' do
1812
+ subject do
1813
+ Daru::Vector.new(
1814
+ [Float::NAN, 2, 1, 4, nil, Float::NAN, 3, nil, Float::NAN]
1815
+ )
1816
+ end
1817
+
1818
+ context 'rolling_fillna forwards' do
1819
+ it { expect(subject.rolling_fillna(:forward).to_a).to eq [0, 2, 1, 4, 4, 4, 3, 3, 3] }
1820
+ end
1821
+
1822
+ context 'rolling_fillna backwards' do
1823
+ it { expect(subject.rolling_fillna(direction: :backward).to_a).to eq [2, 2, 1, 4, 3, 3, 3, 0, 0] }
1824
+ end
1825
+ end
1826
+
1811
1827
  context "#type" do
1812
1828
  before(:each) do
1813
1829
  @numeric = Daru::Vector.new([1,2,3,4,5])
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: daru
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Sameer Deshmukh
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-10-31 00:00:00.000000000 Z
11
+ date: 2018-07-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: backports
@@ -532,6 +532,7 @@ files:
532
532
  - spec/fixtures/repeated_fields.csv
533
533
  - spec/fixtures/sales-funnel.csv
534
534
  - spec/fixtures/scientific_notation.csv
535
+ - spec/fixtures/string_converter_test.csv
535
536
  - spec/fixtures/strings.dat
536
537
  - spec/fixtures/test_xls.xls
537
538
  - spec/fixtures/url_test.txt~
@@ -569,26 +570,7 @@ homepage: http://github.com/v0dro/daru
569
570
  licenses:
570
571
  - BSD-2
571
572
  metadata: {}
572
- post_install_message: |
573
- *************************************************************************
574
- Thank you for installing daru!
575
-
576
- oOOOOOo
577
- ,| oO
578
- //| |
579
- \\| |
580
- `| |
581
- `-----`
582
-
583
-
584
- Hope you love daru! For enhanced interactivity and better visualizations,
585
- consider using gnuplotrb and nyaplot with iruby. For statistics use the
586
- statsample family.
587
-
588
- Read the README for interesting use cases and examples.
589
-
590
- Cheers!
591
- *************************************************************************
573
+ post_install_message:
592
574
  rdoc_options: []
593
575
  require_paths:
594
576
  - lib
@@ -604,7 +586,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
604
586
  version: '0'
605
587
  requirements: []
606
588
  rubyforge_project:
607
- rubygems_version: 2.6.10
589
+ rubygems_version: 2.6.14
608
590
  signing_key:
609
591
  specification_version: 4
610
592
  summary: Data Analysis in RUby
@@ -638,6 +620,7 @@ test_files:
638
620
  - spec/fixtures/repeated_fields.csv
639
621
  - spec/fixtures/sales-funnel.csv
640
622
  - spec/fixtures/scientific_notation.csv
623
+ - spec/fixtures/string_converter_test.csv
641
624
  - spec/fixtures/strings.dat
642
625
  - spec/fixtures/test_xls.xls
643
626
  - spec/fixtures/url_test.txt~