kraps 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: d1f08b6fa0f725c63e3750f4b3bf04479622b40160a6364f708d91d37c0b1948
4
+ data.tar.gz: '0681d837d852846cc6c115fe9dae3075a0e5b3bb8b8eae2d90d8a23ec26581e3'
5
+ SHA512:
6
+ metadata.gz: 354ab3129ef1713c8229af54945251069c98d681e2db5c716d93b5925576b601751c23c1d502cb72ddaea5df5fc91e6eceb4590a619de730ae65f0762662da21
7
+ data.tar.gz: 0647fc85f445bc634f70e2e10feab325c3df6aec3a30d2af4b1a792e82b9adf1a31e8bb339465e10944bce927933956e8d88e72b37fb7d84027ec569441781d6
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.rubocop.yml ADDED
@@ -0,0 +1,79 @@
1
+ AllCops:
2
+ NewCops: enable
3
+
4
+ Naming/FileName:
5
+ Exclude:
6
+ - lib/map-reduce-ruby.rb
7
+
8
+ Style/StringConcatenation:
9
+ Exclude:
10
+ - spec/**/*.rb
11
+
12
+ Lint/UnreachableLoop:
13
+ Exclude:
14
+ - spec/**/*.rb
15
+
16
+ Metrics/BlockLength:
17
+ Enabled: false
18
+
19
+ Gemspec/RequiredRubyVersion:
20
+ Enabled: false
21
+
22
+ Style/MutableConstant:
23
+ Enabled: false
24
+
25
+ Metrics/MethodLength:
26
+ Enabled: false
27
+
28
+ Style/Documentation:
29
+ Enabled: false
30
+
31
+ Style/NumericPredicate:
32
+ Enabled: false
33
+
34
+ Metrics/AbcSize:
35
+ Enabled: false
36
+
37
+ Metrics/CyclomaticComplexity:
38
+ Enabled: false
39
+
40
+ Metrics/PerceivedComplexity:
41
+ Enabled: false
42
+
43
+ Style/StringLiterals:
44
+ Enabled: true
45
+ EnforcedStyle: double_quotes
46
+
47
+ Style/StringLiteralsInInterpolation:
48
+ Enabled: true
49
+ EnforcedStyle: double_quotes
50
+
51
+ Layout/LineLength:
52
+ Max: 250
53
+
54
+ Style/FrozenStringLiteralComment:
55
+ EnforcedStyle: never
56
+
57
+ Style/ObjectThen:
58
+ Enabled: false
59
+
60
+ Gemspec/RequireMFA:
61
+ Enabled: false
62
+
63
+ Lint/EmptyBlock:
64
+ Enabled: false
65
+
66
+ Metrics/ModuleLength:
67
+ Enabled: false
68
+
69
+ Metrics/ParameterLists:
70
+ Enabled: false
71
+
72
+ Metrics/ClassLength:
73
+ Enabled: false
74
+
75
+ Lint/EmptyClass:
76
+ Enabled: false
77
+
78
+ Style/WordArray:
79
+ Enabled: false
@@ -0,0 +1,84 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
6
+
7
+ We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
8
+
9
+ ## Our Standards
10
+
11
+ Examples of behavior that contributes to a positive environment for our community include:
12
+
13
+ * Demonstrating empathy and kindness toward other people
14
+ * Being respectful of differing opinions, viewpoints, and experiences
15
+ * Giving and gracefully accepting constructive feedback
16
+ * Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
17
+ * Focusing on what is best not just for us as individuals, but for the overall community
18
+
19
+ Examples of unacceptable behavior include:
20
+
21
+ * The use of sexualized language or imagery, and sexual attention or
22
+ advances of any kind
23
+ * Trolling, insulting or derogatory comments, and personal or political attacks
24
+ * Public or private harassment
25
+ * Publishing others' private information, such as a physical or email
26
+ address, without their explicit permission
27
+ * Other conduct which could reasonably be considered inappropriate in a
28
+ professional setting
29
+
30
+ ## Enforcement Responsibilities
31
+
32
+ Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
33
+
34
+ Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
35
+
36
+ ## Scope
37
+
38
+ This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
39
+
40
+ ## Enforcement
41
+
42
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at benjamin.vetter@wlw.de. All complaints will be reviewed and investigated promptly and fairly.
43
+
44
+ All community leaders are obligated to respect the privacy and security of the reporter of any incident.
45
+
46
+ ## Enforcement Guidelines
47
+
48
+ Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
49
+
50
+ ### 1. Correction
51
+
52
+ **Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
53
+
54
+ **Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
55
+
56
+ ### 2. Warning
57
+
58
+ **Community Impact**: A violation through a single incident or series of actions.
59
+
60
+ **Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
61
+
62
+ ### 3. Temporary Ban
63
+
64
+ **Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
65
+
66
+ **Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
67
+
68
+ ### 4. Permanent Ban
69
+
70
+ **Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
71
+
72
+ **Consequence**: A permanent ban from any sort of public interaction within the community.
73
+
74
+ ## Attribution
75
+
76
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.0,
77
+ available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
78
+
79
+ Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/diversity).
80
+
81
+ [homepage]: https://www.contributor-covenant.org
82
+
83
+ For answers to common questions about this code of conduct, see the FAQ at
84
+ https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
data/Gemfile ADDED
@@ -0,0 +1,8 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in kraps.gemspec
4
+ gemspec
5
+
6
+ gem "rake", "~> 13.0"
7
+
8
+ gem "rspec", "~> 3.0"
data/Gemfile.lock ADDED
@@ -0,0 +1,113 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ kraps (0.1.0)
5
+ attachie
6
+ distributed_job
7
+ map-reduce-ruby (>= 2.1.1)
8
+ redis
9
+ ruby-progressbar
10
+
11
+ GEM
12
+ remote: https://rubygems.org/
13
+ specs:
14
+ activesupport (6.1.7)
15
+ concurrent-ruby (~> 1.0, >= 1.0.2)
16
+ i18n (>= 1.6, < 2)
17
+ minitest (>= 5.1)
18
+ tzinfo (~> 2.0)
19
+ zeitwerk (~> 2.3)
20
+ ast (2.4.2)
21
+ attachie (1.2.0)
22
+ activesupport
23
+ aws-sdk-s3
24
+ connection_pool
25
+ mime-types
26
+ aws-eventstream (1.2.0)
27
+ aws-partitions (1.649.0)
28
+ aws-sdk-core (3.164.0)
29
+ aws-eventstream (~> 1, >= 1.0.2)
30
+ aws-partitions (~> 1, >= 1.525.0)
31
+ aws-sigv4 (~> 1.1)
32
+ jmespath (~> 1, >= 1.6.1)
33
+ aws-sdk-kms (1.58.0)
34
+ aws-sdk-core (~> 3, >= 3.127.0)
35
+ aws-sigv4 (~> 1.1)
36
+ aws-sdk-s3 (1.116.0)
37
+ aws-sdk-core (~> 3, >= 3.127.0)
38
+ aws-sdk-kms (~> 1)
39
+ aws-sigv4 (~> 1.4)
40
+ aws-sigv4 (1.5.2)
41
+ aws-eventstream (~> 1, >= 1.0.2)
42
+ concurrent-ruby (1.1.10)
43
+ connection_pool (2.3.0)
44
+ diff-lcs (1.5.0)
45
+ distributed_job (3.1.0)
46
+ redis (>= 4.1.0)
47
+ i18n (1.12.0)
48
+ concurrent-ruby (~> 1.0)
49
+ jmespath (1.6.1)
50
+ json (2.6.2)
51
+ lazy_priority_queue (0.1.1)
52
+ map-reduce-ruby (2.1.1)
53
+ json
54
+ lazy_priority_queue
55
+ mime-types (3.4.1)
56
+ mime-types-data (~> 3.2015)
57
+ mime-types-data (3.2022.0105)
58
+ minitest (5.16.3)
59
+ parallel (1.22.1)
60
+ parser (3.1.2.1)
61
+ ast (~> 2.4.1)
62
+ rainbow (3.1.1)
63
+ rake (13.0.6)
64
+ redis (5.0.5)
65
+ redis-client (>= 0.9.0)
66
+ redis-client (0.10.0)
67
+ connection_pool
68
+ regexp_parser (2.5.0)
69
+ rexml (3.2.5)
70
+ rspec (3.11.0)
71
+ rspec-core (~> 3.11.0)
72
+ rspec-expectations (~> 3.11.0)
73
+ rspec-mocks (~> 3.11.0)
74
+ rspec-core (3.11.0)
75
+ rspec-support (~> 3.11.0)
76
+ rspec-expectations (3.11.1)
77
+ diff-lcs (>= 1.2.0, < 2.0)
78
+ rspec-support (~> 3.11.0)
79
+ rspec-mocks (3.11.1)
80
+ diff-lcs (>= 1.2.0, < 2.0)
81
+ rspec-support (~> 3.11.0)
82
+ rspec-support (3.11.1)
83
+ rubocop (1.36.0)
84
+ json (~> 2.3)
85
+ parallel (~> 1.10)
86
+ parser (>= 3.1.2.1)
87
+ rainbow (>= 2.2.2, < 4.0)
88
+ regexp_parser (>= 1.8, < 3.0)
89
+ rexml (>= 3.2.5, < 4.0)
90
+ rubocop-ast (>= 1.20.1, < 2.0)
91
+ ruby-progressbar (~> 1.7)
92
+ unicode-display_width (>= 1.4.0, < 3.0)
93
+ rubocop-ast (1.21.0)
94
+ parser (>= 3.1.1.0)
95
+ ruby-progressbar (1.11.0)
96
+ tzinfo (2.0.5)
97
+ concurrent-ruby (~> 1.0)
98
+ unicode-display_width (2.3.0)
99
+ zeitwerk (2.6.1)
100
+
101
+ PLATFORMS
102
+ ruby
103
+ x86_64-linux
104
+
105
+ DEPENDENCIES
106
+ bundler
107
+ kraps!
108
+ rake (~> 13.0)
109
+ rspec (~> 3.0)
110
+ rubocop
111
+
112
+ BUNDLED WITH
113
+ 2.3.24
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2022 Benjamin Vetter
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,333 @@
1
+ # Kraps
2
+
3
+ **Easily process big data in ruby**
4
+
5
+ Kraps allows to process and perform calculations on very large datasets in
6
+ parallel using a map/reduce framework and runs on a background job framework
7
+ you already have. You just need some space on your filesystem, S3 as a storage
8
+ layer with temporary lifecycle policy enabled, the already mentioned background
9
+ job framework (like sidekiq, shoryuken, etc) and redis to keep track of the
10
+ progress. Most things you most likely already have in place anyways.
11
+
12
+ ## Installation
13
+
14
+ Install the gem and add to the application's Gemfile by executing:
15
+
16
+ $ bundle add kraps
17
+
18
+ If bundler is not being used to manage dependencies, install the gem by executing:
19
+
20
+ $ gem install kraps
21
+
22
+ ## Usage
23
+
24
+ The first thing you need to do is to tell Kraps about your desired
25
+ configuration in an initializer for example:
26
+
27
+ ```ruby
28
+ Kraps.configure(
29
+ driver: Kraps::Drivers::S3Driver.new(s3_client: Aws::S3::Client.new("..."), bucket: "some-bucket", prefix: "temp/kraps/"),
30
+ redis: Redis.new,
31
+ namespace: "my-application", # An optional namespace to be used for redis keys, default: nil
32
+ job_ttl: 24.hours, # Job information in redis will automatically be removed after this amount of time, default: 24 hours
33
+ show_progress: true # Whether or not to show the progress in the terminal when executing jobs, default: true
34
+ enqueuer: ->(worker, json) { worker.perform_async(json) } # Allows to customize the enqueueing of worker jobs
35
+ )
36
+ ```
37
+
38
+ Afterwards, create a job class, which tells Kraps what your job should do.
39
+ Therefore, you create some class with a `call` method, and optionally some
40
+ arguments. Let's create a simple job, which reads search log files to analyze
41
+ how often search queries have been searched:
42
+
43
+ ```ruby
44
+ class SearchLogCounter
45
+ def call(start_date:, end_date:)
46
+ job = Kraps::Job.new(worker: MyKrapsWorker)
47
+
48
+ job = job.parallelize(partitions: 128) do |collector|
49
+ (Date.parse(start_date)..Date.parse(end_date)).each do |date|
50
+ collector.call(date.to_s)
51
+ end
52
+ end
53
+
54
+ job = job.map do |date, _, collector|
55
+ # fetch log file for the date from e.g. s3
56
+
57
+ File.open(logfile).each_line do |line|
58
+ data = JSON.parse(line)
59
+
60
+ collector.call(data["q"], 1)
61
+ end
62
+ end
63
+
64
+ job = job.reduce do |_, count1, count2|
65
+ count1 + count2
66
+ end
67
+
68
+ job = job.each_partition do |partition, pairs|
69
+ tempfile = Tempfile.new
70
+
71
+ pairs.each do |q, count|
72
+ tempfile.puts(JSON.generate(q: q, count: count))
73
+ end
74
+
75
+ # store tempfile on e.g. s3
76
+ ensure
77
+ tempfile.close(true)
78
+ end
79
+
80
+ job
81
+ end
82
+ end
83
+ ```
84
+
85
+ Please note that this represents a specification of your job. It should be as
86
+ free as possible from side effects, because your background jobs must also be
87
+ able to take this specification to be told what to do as Kraps will run the job
88
+ with maximum concurrency.
89
+
90
+ Next thing you need to do: create the background worker which runs arbitrary
91
+ Kraps job steps. Assuming you have sidekiq in place:
92
+
93
+ ```ruby
94
+ class MyKrapsWorker
95
+ include Sidekiq::Worker
96
+
97
+ def perform(json)
98
+ Kraps::Worker.new(json, memory_limit: 128.megabytes, chunk_limit: 64, concurrency: 8).call(retries: 3)
99
+ end
100
+ end
101
+ ```
102
+
103
+ The `json` argument is automatically enqueued by Kraps and contains everything
104
+ it needs to know about the job and step to execute. The `memory_limit` tells
105
+ Kraps how much memory it is allowed to allocate for temporary chunks, etc. This
106
+ value depends on the memory size of your container/server and how much worker
107
+ threads your background queue spawns. Let's say your container/server has 2
108
+ gigabytes of memory and your background framework spawns 5 threads.
109
+ Theoretically, you might be able to give 300-400 megabytes to Kraps then. The
110
+ `chunk_limit` ensures that only the specified amount of chunks are processed in
111
+ a single run. A run basically means: it takes up to `chunk_limit` chunks,
112
+ reduces them and pushes the result as a new chunk to the list of chunks to
113
+ process. Thus, if your number of file descriptors is unlimited, you want to set
114
+ it to a higher number to avoid the overhead of multiple runs. `concurrency`
115
+ tells Kraps how much threads to use to concurrently upload/download files from
116
+ the storage layer. Finally, `retries` specifies how often Kraps should retry
117
+ the job step in case of errors. Kraps will sleep for 5 seconds between those
118
+ retries. Please note that it's not yet possible to use the retry mechanism of
119
+ your background job framework with Kraps.
120
+
121
+ Now, executing your job is super easy:
122
+
123
+ ```ruby
124
+ Kraps::Runner.new(SearchLogCounter).call(start_date: '2018-01-01', end_date: '2022-01-01')
125
+ ```
126
+
127
+ This will execute all steps of your job, where the parts of a step are executed
128
+ in parallel, depending on the number of background job workers you have.
129
+
130
+ The runner by default also shows the progress of the execution:
131
+
132
+ ```
133
+ SearchLogCounter: job 1/1, step 1/4, token 2407e38eb58233ae3cecaec86fa6a6ec, Time: 00:00:05, 356/356 (100%) => parallelize
134
+ SearchLogCounter: job 1/1, step 2/4, token 7f11a04c754389359f67c1e7627468c6, Time: 00:08:00, 128/128 (100%) => map
135
+ SearchLogCounter: job 1/1, step 3/4, token b602198bfeab20ff205a00af36e43402, Time: 00:03:00, 128/128 (100%) => reduce
136
+ SearchLogCounter: job 1/1, step 4/4, token d18acbb22bbd30faff7265c179d4ec5a, Time: 00:02:00, 128/128 (100%) => each_partition
137
+ ```
138
+
139
+ How many "parts" a step has mostly boils down to the number of partitions you
140
+ specify in the job respectively steps. More concretely, As your data consists
141
+ of `(key, value)` pairs, the number of partitions specifies how your data gets
142
+ split. Kraps assigns every `key` to a partition, either using a custom
143
+ `partitioner` or the default built in hash partitioner. The hash partitioner
144
+ simply calculates a hash of your key modulo the number of partitions and the
145
+ resulting partition number is the partition where the respective key is
146
+ assigned to. A partitioner is a callable which gets the key as argument and
147
+ returns a partition number. The built in hash partitioner looks similar to this
148
+ one:
149
+
150
+ ```ruby
151
+ partitioner = proc { |key| Digest::SHA1.hexdigest(key.inspect)[0..4].to_i(16) % 128 } # 128 partitions
152
+ ```
153
+
154
+ Please note, it's important that the partitioner and the specified number of
155
+ partitions stays in sync. When you use a custom partitioner, please make sure
156
+ that the partitioner operates on the same number of partitions you specify.
157
+
158
+ ## Datatypes
159
+
160
+ Be aware that Kraps converts everything you pass to it to JSON sooner or later,
161
+ i.e. symbols will be converted to strings, etc. Therefore, it is recommended to
162
+ only use the json compatible datatypes right from the start. However, the keys
163
+ that you pass to Kraps additionally must be properly sortable, such that it is
164
+ recommended to only use strings, numbers and arrays or a combination of those
165
+ for the keys. For more information, please check out
166
+ https://github.com/mrkamel/map-reduce-ruby/#limitations-for-keys
167
+
168
+ ## Storage
169
+
170
+ Kraps stores temporary results of steps in a storage layer. Currently, only S3
171
+ is supported besides a in memory driver used for testing purposes. Please be
172
+ aware that Kraps does not clean up any files from the storage layer, as it
173
+ would be a safe thing to do in case of errors anyways. Instead, Kraps relies on
174
+ lifecycle features of modern object storage systems. Therefore, it is recommend
175
+ to e.g. configure a lifecycle policy to delete any files after e.g. 7 days
176
+ either for a whole bucket or for a certain prefix like e.g. `temp/` and tell
177
+ Kraps about the prefix to use (e.g. `temp/kraps/`).
178
+
179
+ ```ruby
180
+ Kraps::Drivers::S3Driver.new(s3_client: Aws::S3::Client.new("..."), bucket: "some-bucket", prefix: "temp/kraps/"),
181
+ ```
182
+
183
+ If you set up the lifecycle policy for the whole bucket instead and Kraps is
184
+ the only user of the bucket, then no prefix needs to be specified.
185
+
186
+ ## API
187
+
188
+ Your jobs can use the following list of methods. Please note that you don't
189
+ always need to specify all the parameters listed here. Especially `partitions`,
190
+ `partitioner` and `worker` are used from the previous step unless changed in
191
+ the next one.
192
+
193
+ * `parallelize`: Used to seed the job with initial data
194
+
195
+ ```ruby
196
+ job.parallelize(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |collector|
197
+ ["item1", "item2", "item3"].each do |item|
198
+ collector.call(item)
199
+ end
200
+ end
201
+ ```
202
+
203
+ The block must use the collector to feed Kraps with individual items. The
204
+ items are used as keys and the values are set to `nil`.
205
+
206
+ * `map`: Maps the key value pairs to other key value pairs
207
+
208
+ ```ruby
209
+ job.map(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker) do |key, value, collector|
210
+ collector.call("changed #{key}", "changed #{value}")
211
+ end
212
+ ```
213
+
214
+ The block gets each key-value pair passed and the `collector` block can be
215
+ called as often as neccessary. This is also the reason why `map` can not simply
216
+ return the new key-value pair, but the `collector` must be used instead.
217
+
218
+ * `reduce`: Reduces the values of pairs having the same key
219
+
220
+ ```ruby
221
+ job.reduce(worker: MyKrapsWorker) do |key, value1, value2|
222
+ value1 + value2
223
+ end
224
+ ```
225
+
226
+ When the same key exists multiple times in the data, kraps feeds the values
227
+ into your reduce block and expects to get one value returned. This happens
228
+ until every key exists only once.
229
+
230
+ The `key` itself is also passed to the block for the case that you need to
231
+ customize the reduce calculation according to the value of the key. However,
232
+ most of the time, this is not neccessary and the key can simply be ignored.
233
+
234
+ * `repartition`: Used to change the partitioning
235
+
236
+ ```ruby
237
+ job.repartition(partitions: 128, partitioner: partitioner, worker: MyKrapsWorker)
238
+ ```
239
+
240
+ Repartitions all data into the specified number of partitions and using the
241
+ specified partitioner.
242
+
243
+ * `each_partition`: Passes the partition number and all data of each partition
244
+ as a lazy enumerable
245
+
246
+ ```ruby
247
+ job.each_partition do |partition, pairs|
248
+ pairs.each do |key, value|
249
+ # ...
250
+ end
251
+ end
252
+ ```
253
+
254
+ ## More Complex Jobs
255
+
256
+ Please note that a job class can return multiple jobs and jobs can build up on
257
+ each other. Let's assume that we additionally want to calculate a total number
258
+ of searches made:
259
+
260
+ ```ruby
261
+ class SearchLogCounter
262
+ def call(start_date:, end_date:)
263
+ count_job = Kraps::Job.new(worker: SomeBackgroundWorker)
264
+
265
+ count_job = count_job.parallelize(partitions: 128) do |collector|
266
+ (Date.parse(start_date)..Date.parse(end_date)).each do |date|
267
+ collector.call(date.to_s)
268
+ end
269
+ end
270
+
271
+ count_job = count_job.map do |date, _, collector|
272
+ # ...
273
+
274
+ collector.call(data["q"], 1)
275
+
276
+ # ...
277
+ end
278
+
279
+ count_job = count_job.reduce do |_, count1, count2|
280
+ count1 + count2
281
+ end
282
+
283
+ sum_job = count_job.map do |q, count, collector|
284
+ collector.call('sum', count)
285
+ end
286
+
287
+ sum_job = sum_job.reduce do |_, count1, count2|
288
+ count1 + count2
289
+ end
290
+
291
+ # ...
292
+
293
+ [count_job, sum_job]
294
+ end
295
+ end
296
+ ```
297
+
298
+ When you execute the job, Kraps will execute the jobs one after another and as
299
+ the jobs build up on each other, Kraps will execute the steps shared by both
300
+ jobs only once.
301
+
302
+ ## Dependencies
303
+
304
+ Kraps is built on top of
305
+ [map-reduce-ruby](https://github.com/mrkamel/map-reduce-ruby) for the
306
+ map/reduce framework,
307
+ [distributed_job](https://github.com/mrkamel/distributed_job)
308
+ to keep track of the job/step status,
309
+ [attachie](https://github.com/mrkamel/attachie) to interact with the storage
310
+ layer (s3),
311
+ [ruby-progressbar](https://github.com/jfelchner/ruby-progressbar) to
312
+ report the progress in the terminal.
313
+
314
+ It is highly recommended to check out `map-reduce-ruby` to dig into internals
315
+ and performance details.
316
+
317
+ ## Contributing
318
+
319
+ Bug reports and pull requests are welcome on GitHub at
320
+ https://github.com/mrkamel/kraps. This project is intended to be a safe,
321
+ welcoming space for collaboration, and contributors are expected to adhere to
322
+ the [code of conduct](https://github.com/mrkamel/kraps/blob/main/CODE_OF_CONDUCT.md).
323
+
324
+ ## License
325
+
326
+ The gem is available as open source under the terms of the
327
+ [MIT License](https://opensource.org/licenses/MIT).
328
+
329
+ ## Code of Conduct
330
+
331
+ Everyone interacting in the Kraps project's codebases, issue trackers, chat
332
+ rooms and mailing lists is expected to follow the
333
+ [code of conduct](https://github.com/mrkamel/kraps/blob/main/CODE_OF_CONDUCT.md).
data/Rakefile ADDED
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task default: :spec
@@ -0,0 +1,6 @@
1
+ version: '2'
2
+ services:
3
+ elasticsearch:
4
+ image: redis
5
+ ports:
6
+ - 6379:6379
@@ -0,0 +1,10 @@
1
+ module Kraps
2
+ module Actions
3
+ ALL = [
4
+ PARALLELIZE = "parallelize",
5
+ MAP = "map",
6
+ REDUCE = "reduce",
7
+ EACH_PARTITION = "each_partition"
8
+ ]
9
+ end
10
+ end