wonderdog 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (55) hide show
  1. data/.gitignore +49 -0
  2. data/.rspec +2 -0
  3. data/CHANGELOG.md +5 -0
  4. data/LICENSE.md +201 -0
  5. data/README.md +175 -0
  6. data/Rakefile +10 -0
  7. data/bin/estool +141 -0
  8. data/bin/estrus.rb +136 -0
  9. data/bin/wonderdog +93 -0
  10. data/config/elasticsearch-example.yml +227 -0
  11. data/config/elasticsearch.in.sh +52 -0
  12. data/config/logging.yml +43 -0
  13. data/config/more_settings.yml +60 -0
  14. data/config/run_elasticsearch-2.sh +42 -0
  15. data/config/ufo_config.json +12 -0
  16. data/lib/wonderdog.rb +14 -0
  17. data/lib/wonderdog/configuration.rb +25 -0
  18. data/lib/wonderdog/hadoop_invocation_override.rb +139 -0
  19. data/lib/wonderdog/index_and_mapping.rb +67 -0
  20. data/lib/wonderdog/timestamp.rb +43 -0
  21. data/lib/wonderdog/version.rb +3 -0
  22. data/notes/README-benchmarking.txt +272 -0
  23. data/notes/README-read_tuning.textile +74 -0
  24. data/notes/benchmarking-201011.numbers +0 -0
  25. data/notes/cluster_notes.md +17 -0
  26. data/notes/notes.txt +91 -0
  27. data/notes/pigstorefunc.pig +45 -0
  28. data/pom.xml +80 -0
  29. data/spec/spec_helper.rb +22 -0
  30. data/spec/support/driver_helper.rb +15 -0
  31. data/spec/support/integration_helper.rb +30 -0
  32. data/spec/wonderdog/hadoop_invocation_override_spec.rb +81 -0
  33. data/spec/wonderdog/index_and_type_spec.rb +73 -0
  34. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchInputFormat.java +268 -0
  35. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputCommitter.java +39 -0
  36. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputFormat.java +283 -0
  37. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchSplit.java +60 -0
  38. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingInputFormat.java +231 -0
  39. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingOutputCommitter.java +37 -0
  40. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingOutputFormat.java +88 -0
  41. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingRecordReader.java +176 -0
  42. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingRecordWriter.java +171 -0
  43. data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingSplit.java +102 -0
  44. data/src/main/java/com/infochimps/elasticsearch/ElasticTest.java +108 -0
  45. data/src/main/java/com/infochimps/elasticsearch/hadoop/util/HadoopUtils.java +100 -0
  46. data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchIndex.java +216 -0
  47. data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchJsonIndex.java +235 -0
  48. data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchStorage.java +355 -0
  49. data/test/foo.json +3 -0
  50. data/test/foo.tsv +3 -0
  51. data/test/test_dump.pig +19 -0
  52. data/test/test_json_loader.pig +21 -0
  53. data/test/test_tsv_loader.pig +16 -0
  54. data/wonderdog.gemspec +32 -0
  55. metadata +130 -0
@@ -0,0 +1,43 @@
1
+ module Wukong
2
+ module Elasticsearch
3
+
4
+ # A class that makes Ruby's Time class serialize the way
5
+ # Elasticsearch expects.
6
+ #
7
+ # Elasticsearch's date parsing engine [expects to
8
+ # receive](http://www.elasticsearch.org/guide/reference/mapping/date-format.html)
9
+ # a date formatted according to the Java library
10
+ # [Joda's](http://joda-time.sourceforge.net/)
11
+ # [ISODateTimeFormat.dateOptionalTimeParser](http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser())
12
+ # class.
13
+ #
14
+ # This format looks like this: `2012-11-30T01:15:23`.
15
+ #
16
+ # @see http://www.elasticsearch.org/guide/reference/mapping/date-format.html The Elasticsearch guide's Date Format entry
17
+ # @see http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser() The Joda class's API documentation
18
+ class Timestamp < Time
19
+
20
+ # Parses the given `string` into a Timestamp instance.
21
+ #
22
+ # @param [String] string
23
+ # @return [Timestamp]
24
+ def self.receive string
25
+ return if string.nil? || string.empty?
26
+ begin
27
+ t = Time.parse(string)
28
+ rescue ArgumentError => e
29
+ return
30
+ end
31
+ new(t.year, t.month, t.day, t.hour, t.min, t.sec, t.utc_offset)
32
+ end
33
+
34
+ # Formats the Timestamp according to ISO 8601 rules.
35
+ #
36
+ # @param [Hash] options
37
+ # @return [String]
38
+ def to_wire(options={})
39
+ utc.iso8601
40
+ end
41
+ end
42
+ end
43
+ end
@@ -0,0 +1,3 @@
1
+ module Wonderdog
2
+ VERSION = '0.0.1'
3
+ end
@@ -0,0 +1,272 @@
1
+ To do a full flush, do this:
2
+
3
+ curl -XPOST host:9200/_flush?full=true
4
+
5
+ (run it every 30 min during import)
6
+
7
+
8
+ 1 c1.xl 4es/12sh 768m buffer 1400m heap
9
+ 2,584,346,624 255,304 0h14m25 865 295 2917
10
+
11
+ 1 m1.xl 4es/12sh 1500m buffer 3200m heap
12
+ 79,364,096 464,701 0h01m02 62 7495 1250
13
+ 210,305,024 1,250,000 0h02m39 159 7861 1291
14
+ 429,467,863 2,521,538 0h03m28 208 12122 2016
15
+
16
+ 1 m1.xl 4es/12sh 4hdp 1800m buffer 3200m heap 300000 tlog
17
+ 429,467,863 2,521,538 0h03m11 191 13201 2195
18
+
19
+ 1 m1.xl 4es/12sh 4hdp 1800m buffer 2400m heap 100000 tlog 1000 batch lzw compr ulimit-l-unlimited (and in all following)
20
+ 0h03m47
21
+
22
+ 1 m1.xl 4es/12sh 4hdp 1800m buffer 2400m heap 200000 tlog 1000 batch no compr
23
+ 0h3m22
24
+ again on top of data already loaded
25
+ 0h3m16
26
+
27
+ 1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 50000 batch no compr
28
+ 433,782,784 2,250,000 0h01m17 (froze up on mass assault once 50k batch was reached)
29
+
30
+ 1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 5000 batch no compr
31
+ 785,514,496 4,075,000 0h05m59 359 11350 2136 cpu 4x70%
32
+ 1,207,500,800 6,270,000 0h08m26 506 12391 2330
33
+
34
+ 1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 5000 batch no compr
35
+ 163,512,320 845,000 0h01m49 109 7752 1464 cpu 4x75% ios 6k-8k x4 if 2800/440 ram 13257/15360MB
36
+ 641,990,656 3,345,000 0h04m41 281 11903 2231
37
+ 896,522,559 4,683,016 0h06m11 371 12622 2359
38
+ 1,131,916,976 5,937,895 0h07m05 425 13971 2600
39
+
40
+ 1 m1.xl 4es/12sh 16hdp 1800m buffer 2800m heap 200000 tlog 5000 batch no compr
41
+ 74,383,360 385,000 0h01m50 110 3500 660
42
+ 286,720,000 1,495,000 0h02m21 141 10602 1985
43
+ 461,701,120 2,410,000 0h03m30 210 11476 2147
44
+ 733,413,376 3,830,000 0h05m10 310 12354 2310
45
+ 1,131,916,976 5,937,895 0h07m16 436 13619 2535
46
+
47
+ 1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 1000 batch no compr
48
+ 156,958,720 813,056 0h01m35 95 8558 1613
49
+ 305,135,616 1,586,176 0h02m25 145 10939 2055
50
+ 446,300,160 2,323,456 0h03m10 190 12228 2293
51
+ 690,028,544 3,594,240 0h04m40 280 12836 2406
52
+ 927,807,418 4,850,093 0h06m10 370 13108 2448
53
+ 1,131,916,976 5,937,895 0h06m55 415 14308 2663
54
+
55
+ 1 m1.xl 4es/12sh 16hdp 1800m buffer 2800m heap 200000 tlog 1024 batch no compr
56
+ 234,749,952 1,222,656 0h02m08 128 9552 1791
57
+ 713,097,216 3,723,264 0h04m56 296 12578 2352
58
+ 1,131,916,976 5,937,895 0h06m49 409 14518 2702
59
+
60
+ 1 m1.xl 4es/12sh 20hdp 1800m buffer 2800m heap 200000 tlog 1024 batch no compr mergefac 40
61
+ 190,971,904 994,304 0h01m55 115 8646 1621
62
+ 326,107,136 1,699,840 0h02m52 172 9882 1851
63
+ 707,152,365 3,709,734 0h04m51 291 12748 2373 672 files
64
+ again:
65
+ 187,170,816 973,824 0h01m49 109 8934 1676
66
+ 707,152,365 3,709,734 0h05m39 339 10943 2037 1440 files ; 18 *.tis typically 4.3M
67
+ again:
68
+ 707,152,365 3,709,734 0h04m54 294 12618 2348 2052 files ; 28 *.tis typically 4.3M
69
+
70
+ 1 m1.xl 4es/12sh 20hdp 1800m buffer 2800m heap 50_000 tlog 1024 batch no compr mergefac 20 (and in following)
71
+ 349,372,416 1,821,696 0h02m42 162 11245 2106
72
+ 707,152,365 3,709,734 0h04m43 283 13108 2440
73
+
74
+ 1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 64m engine.ram_buffer_size -- 3s ping_interval -- oops 10s refresh
75
+ 253,689,856 1,321,984 0h02m48 168 7868 1474
76
+ 707,152,365 3,709,734 0h05m55 355 10449 1945
77
+
78
+ 1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 256m engine.ram_buffer_size -- 3s ping_interval
79
+ 707,152,365 3,709,734 0h04m31 271 13689 2548
80
+
81
+ 1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
82
+ 707,152,365 3,709,734 0h04m08 248 14958 2784
83
+
84
+ 1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 768m engine.ram_buffer_size -- 3s ping_interval
85
+ 707,152,365 3,709,734 0h04m47 287 12925 2406
86
+ again
87
+ 707,152,365 3,709,734 0h04m27 267 13894 2586
88
+
89
+ 1 m1.xl 4es/4sh 20hdp 768m buffer 2800m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
90
+ 707,152,365 3,709,734 0h04m14 254 14605 2718
91
+
92
+ 1 c1.xl 4es/4sh 20hdp 768m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
93
+ 707,152,365 3,709,734 0h02m55 175 21198 3946 ios 11282 ifstat 3696.26 695.26
94
+
95
+ 1 c1.xl 4es/4sh 40hdp 768m buffer 1200m heap 200_000 tlog 4096 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
96
+ 707,912,831 3,713,598 0h03m05 185 20073 3736
97
+
98
+ 1 c1.xl 4es/4sh 40hdp 768m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
99
+ 707,912,831 3,713,598 0h02m59 179 20746 3862
100
+
101
+ 1 c1.xl 4es/4sh 20hdp 256m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
102
+ 707,152,365 3,709,734 0h02m53 173 21443 3991
103
+
104
+ 1 c1.xl 4es/4sh 20hdp 512m buffer 1200m heap 200_000 tlog 1024 batch no compr 768m engine.ram_buffer_size -- 3s ping_interval
105
+ 707,152,365 3,709,734 0h03m00 180 20609 3836
106
+
107
+
108
+ 8 c1.xl 32es/32sh 14hdp/56 512m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
109
+ 1,115,291,648 5,814,272 0h01m44 104 6988 1309 8 55906 10472
110
+ 2,779,840,512 14,540,800 0h06m34 394 4613 861 8 36905 6890
111
+ 6,100,156,416 32,508,928 0h14m51 891 4560 835 8 36485 6685
112
+ (killed)
113
+
114
+ 8 c1.xl 24es/24sh 14hdp/56 256m buffer 1200m heap 200_000 tlog 1024 batch no compr 384m engine.ram_buffer_size -- 3s ping_interval
115
+ 980,221,952 5,107,662 0h01m28 88 7255 1359 8 58041 10877
116
+ 1,815,609,344 9,483,259 0h01m59 119 9961 1862 8 79691 14899
117
+ 4,451,270,656 23,694,336 0h04m06 246 12039 2208 8 96318 17670
118
+ 6,713,269,627 35,778,171 0h06m00 360 12422 2276 8 99383 18210
119
+
120
+ 8 c1.xl 24es/24sh 14hdp/140 512m buffer 1200m heap 200_000 tlog 1024 batch no compr 384m engine.ram_buffer_size -- 3s ping_interval
121
+ 4,743,036,929 24,825,856 0h04m39 279 11122 2075 8 88981 16601
122
+ 8,119,975,937 42,889,216 0h07m00 420 12764 2360 8 102117 18880
123
+ 17,273,994,924 91,991,529 0h15m14 914 12580 2307 8 100647 18456
124
+ 23,598,696,768 123,812,641 0h24m04 1444 10717 1994 8 85742 15959
125
+
126
+
127
+ 8 m1.xl 32es/32sh 14hdp/53 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval -- merge_factor30
128
+ 306,296,262 1,608,526 0h01m18 78 2577 479 8 20622 3834
129
+ 1,814,083,014 9,564,301 0h02m33 153 7813 1447 8 62511 11578
130
+ 2,837,886,406 15,030,140 0h04m49 289 6500 1198 8 52007 9589
131
+ 3,928,208,838 21,039,950 0h06m22 382 6884 1255 8 55078 10042
132
+ 6,322,378,160 33,875,546 0h11m28 688 6154 1121 8 49237 8974
133
+
134
+ 8 c1.xl 24es/24sh 14hdp/140 512m buffer 1200m heap 200_000 tlog 4096 batch no compr 256m engine.ram_buffer_size -- 3s ping_interval -- merge_factor 30
135
+ 4,717,346,816 24,855,996 0h04m55 295 10532 1952 8 84257 15616
136
+ 9,735,831,552 51,896,969 0h09m23 563 11522 2110 8 92179 16887
137
+
138
+
139
+ (200910)
140
+ 2,746,875,904 10,555,392 0h02m50 170 7761 1972 8 62090 15779
141
+ 43,201,339,007 166,049,864 0h35m06 2106 9855 2504 8 78846 20032
142
+
143
+
144
+ 2009{10,11,12}
145
+
146
+ 8 c1.xl 24es/24sh 14hdp/140 512m buffer 1200m heap 200_000 tlog 4096 batch no compr 256m engine.ram_buffer_size -- 3s ping_interval -- merge_factor 30
147
+ 135,555,262,283 516,220,825 2h16m13 8173 7895 2024 8 63161 16197
148
+
149
+
150
+ slug=tweet-2009q3pre ; curl -XGET 'http://10.99.10.113:9200/_flush/' ; curl -XPUT "http://10.99.10.113:9200/$slug/" ; rake -f ~/ics/backend/wonderdog/java/Rakefile ; ~/ics/backend/wonderdog/java/bin/wonderdog --rm --index_name=$slug --bulk_size=4096 --object_type=tweet /tmp/tweet_by_month-tumbled/"tweet-200[678]" /tmp/es_bulkload_log/$slug
151
+
152
+
153
+ sudo kill `ps aux | egrep '^61021' | cut -c 10-15`
154
+
155
+ for node in '' 2 3 ; do echo $node ; sudo node=$node ES_MAX_MEM=1600m ~/ics/backend/wonderdog/config/run_elasticsearch-2.sh ; done
156
+
157
+
158
+ for node in '' 2 3 4 ; do echo $node ; sudo node=$node ES_MAX_MEM=1200m ~/ics/backend/wonderdog/config/run_elasticsearch-2.sh ; done
159
+ sudo kill `ps aux | egrep '^61021' | cut -c 10-15` ; sleep 10 ; sudo rm -rf /mnt*/elasticsearch/* ; ps auxf | egrep '^61021' ; zero_log /var/log/elasticsearch/hoolock.log
160
+
161
+ ec2-184-73-41-228.compute-1.amazonaws.com
162
+
163
+ Query for success:
164
+ curl -XGET 'http://10.195.10.207:9200/tweet/tweet/_search?q=text:mrflip' | ruby -rubygems -e 'require "json" ; puts JSON.pretty_generate(JSON.load($stdin))'
165
+
166
+ Detect settings:
167
+ grep ' with ' /var/log/elasticsearch/hoolock.log | egrep 'DEBUG|INFO' | cut -d\] -f2,3,5- | sort | cutc | uniq -c
168
+
169
+ Example index sizes:
170
+ ls -lRhart /mnt*/elasticsearch/data/hoolock/nodes/*/indices/tweet/0/*/*.{tis,fdt}
171
+
172
+
173
+
174
+
175
+ def dr(line) ; sbytes,srecs,time,mach,*_ = line.strip.split(/\s+/) ; bytes = sbytes.gsub(/\D/,"").to_i ; recs = srecs.gsub(/\D/,"").to_i ; mach=mach.to_i ; mach = 1 if mach == 0 ; s,m,h = [0,0,0,time.split(/\D/)].flatten.reverse.map(&:to_i) ; tm = (3600*h + 60*m + s) ; results = "%14s\t%12s\t%01dh%02dm%02d\t%7d\t%7d\t%7d\t%7d\t%7d\t%7d"%[sbytes, srecs, h,m,s, tm, recs/tm/mach, bytes/tm/1024/mach, mach, recs/tm, bytes/tm/1024, ] ; puts results ; results ; end
176
+
177
+
178
+
179
+
180
+
181
+
182
+
183
+
184
+
185
+
186
+
187
+
188
+
189
+ # . jack up batch size and see effect on rec/sec, find optimal
190
+ # . run multiple mappers with one data es_node with optimal batch size, refind if necessary
191
+ # . work data es_node heavily but dont drive it into the ground
192
+ # . tune lucene + jvm options for data es_node
193
+
194
+ 14 files, 3 hadoop nodes w/ 3 tasktrackers each 27 min
195
+ 14 files, 3 hadoop nodes w/ 5 tasktrackers each 22 min
196
+
197
+ 12 files @ 500k lines -- 3M rec -- 3 hdp/2 tt -- 2 esnodes -- 17m
198
+
199
+
200
+ 6 files @ 100k = 600k rec -- 3hdp/2tt -- 1 es machine/2 esnodes -- 3m30
201
+ 6 files @ 100k = 600k rec -- 3hdp/2tt -- 1 es machine/4 esnodes -- 3m20
202
+
203
+
204
+
205
+
206
+ 5 files, 3 nodes,
207
+
208
+
209
+ Did 2,400,000 recs 24 tasks 585,243,042 bytes -- 15:37 on 12 maps/3nodes
210
+
211
+ Did _optimize
212
+ real 18m29.548s user 0m0.000s sys 0m0.000s pct 0.00
213
+
214
+
215
+ java version "1.6.0_20"
216
+ Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
217
+ Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
218
+
219
+
220
+ ===========================================================================
221
+
222
+
223
+ The refresh API allows to explicitly refresh an one or more index, making all
224
+ operations performed since the last refresh available for search. The (near)
225
+ real-time capabilities depends on the index engine used. For example, the robin
226
+ one requires refresh to be called, but by default a refresh is scheduled
227
+ periodically.
228
+
229
+ curl -XPOST 'http://localhost:9200/twitter/_refresh'
230
+
231
+ The refresh API can be applied to more than one index with a single call, or even on _all the indices.
232
+
233
+
234
+
235
+ runs:
236
+ - es_machine: m1.xlarge
237
+ es_nodes: 1
238
+ es_max_mem: 1500m
239
+ bulk_size: 5
240
+ maps: 1
241
+ records: 100000
242
+ shards: 12
243
+ replicas: 1
244
+ merge_factor: 100
245
+ thread_count: 32
246
+ lucene_buffer_size: 256mb
247
+ runtime: 108s
248
+ throughput: 1000 rec/sec
249
+ - es_machine: m1.xlarge
250
+ es_nodes: 1
251
+ bulk_size: 5
252
+ maps: 1
253
+ records: 100000
254
+ shards: 12
255
+ replicas: 1
256
+ merge_factor: 1000
257
+ thread_count: 32
258
+ lucene_buffer_size: 256mb
259
+ runtime: 77s
260
+ throughput: 1300 rec/sec
261
+ - es_machine: m1.xlarge
262
+ es_nodes: 1
263
+ bulk_size: 5
264
+ maps: 1
265
+ records: 100000
266
+ shards: 12
267
+ replicas: 1
268
+ merge_factor: 10000
269
+ thread_count: 32
270
+ lucene_buffer_size: 512mb
271
+ runtime: 180s
272
+ throughput: 555 rec/sec
@@ -0,0 +1,74 @@
1
+ http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/8517d7ccdaa6a72b
2
+
3
+ We have 3 servers in each data center, with 28M docs consuming 170G
4
+ disk (soon to shrink with ES 0.14), handling about 6k req/min for
5
+ client queries and 195k document matches/minute for alerting purposes.
6
+ With our hardware, we're hardly taxing them and still averaging
7
+ 30-35ms response times.
8
+
9
+
10
+
11
+
12
+
13
+ :index_buffer_size => "512m",
14
+ :heap_size => '11000',
15
+ :fd_ping_interval => '2s',
16
+ :fd_ping_timeout => '60s',
17
+ :fd_ping_retries => '6',
18
+ :seeds => '10.116.83.97:9300,10.196.190.111:9300,10.112.45.60:9300,10.118.254.64:9300',
19
+ :recovery_after_time => '10m',
20
+ :recovery_after_nodes => 4,
21
+ :expected_nodes => 4,
22
+ :refresh_interval => 900,
23
+
24
+ with 80 primary / 160 active shards in 5 indexes, each shard sized as approx:
25
+
26
+ 14395 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q3pre/10/index
27
+ 26615 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q4/0/index
28
+ 9294 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201004/12/index
29
+ 12204 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201005/12/index
30
+
31
+ after recovering cluster most nodes were at 7.5 - 9.6 GB
32
+
33
+ http true:
34
+
35
+ 14409 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q3pre/11/index
36
+ 26573 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q4/11/index
37
+ 23885 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2010q1/4/index
38
+ 9271 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201004/0/index
39
+ 12218 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201005/4/index
40
+
41
+ 13723 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201006/9/index
42
+ 15578 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201007/6/index
43
+ 1471 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201008/11/index
44
+ 915 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201009/1/index
45
+ 1908 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/13/index
46
+ 2026 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201011/7/index
47
+
48
+
49
+ "tweet-201010" : "num_docs" : 40272985,
50
+ "tweet-201011" : "num_docs" : 39012255,
51
+ "tweet-2009q4" : "num_docs" : 577762139,
52
+ "tweet-201006" : "num_docs" : 288445236,
53
+ "tweet-201008" : "num_docs" : 30904989,
54
+ "tweet-201005" : "num_docs" : 242058418,
55
+ "tweet-201007" : "num_docs" : 311059766,
56
+ "tweet-2009q3pre" : "num_docs" : 359075858,
57
+ "tweet-201004" : "num_docs" : 190501768,
58
+ "tweet-201009" : "num_docs" : 19166922,
59
+ "tweet-2010q1" : "num_docs" : 368031331,
60
+
61
+
62
+
63
+ 14409 tweet-2009q3pre
64
+ 26590 tweet-2009q4
65
+ 23923 tweet-2010q1
66
+ 9278 tweet-201004
67
+ 12216 tweet-201005
68
+ 13735 tweet-201006
69
+ 15580 tweet-201007
70
+ 1472 tweet-201008
71
+ 916 tweet-201009
72
+ 1910 tweet-201010
73
+ 2023 tweet-201011
74
+
@@ -0,0 +1,17 @@
1
+ ### How to choose shards, replicas and cluster size: Rules of Thumb.
2
+
3
+ sh = shards
4
+ rf = replication factor. replicas = 0 implies rf = 1, or 1 replica of each shard.
5
+
6
+ pm = running data_esnode processes per machine
7
+ N = number of machines
8
+
9
+ n_cores = number of cpu cores per machine
10
+ n_disks = number of disks per machine
11
+
12
+ * You must have at least as many data_esnodes as
13
+ Mandatory: (sh * rf) < (pm * N)
14
+
15
+ Shards: shard size < 10GB
16
+
17
+ More shards = more parallel writes
@@ -0,0 +1,91 @@
1
+ At Infochimps we recently indexed over 2.5 billion documents for a total of 4TB total indexed size. This would not have been possible without ElasticSearch and the Hadoop bulk loader we wrote, <a href="http://github.com/infochimps/wonderdog">wonderdog</a>. I'll go into the technical details in a later post but for now here's how you can get started with ElasticSearch and Hadoop:
2
+
3
+ <h2>Getting Started with ElasticSearch</h2>
4
+
5
+ The first thing is to actually install elasticsearch:
6
+
7
+ <pre class="brush: bash">
8
+ $: wget http://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.14.2.zip
9
+ $: sudo mv elasticsearch-0.14.2 /usr/local/share/
10
+ $: sudo ln -s /usr/local/share/elasticsearch-0.14.2 /usr/local/share/elasticsearch
11
+ </pre>
12
+
13
+ Next you'll want to make sure there is an 'elasticsearch' user and that there are suitable data, work, and log directories that 'elasticsearch' owns:
14
+
15
+ <pre class="brush: bash">
16
+ $: sudo useradd elasticsearch
17
+ $: sudo mkdir -p /var/log/elasticsearch /var/run/elasticsearch/{data,work}
18
+ $: sudo chown -R elasticsearch /var/{log,run}/elasticsearch
19
+ </pre>
20
+
21
+ Then get wonderdog (you'll have to git clone it for now) and go ahead and copy the example configuration in wonderdog/config:
22
+
23
+ <pre class="brush: bash">
24
+ $: sudo mkdir -p /etc/elasticsearch
25
+ $: sudo cp config/elasticsearch-example.yml /etc/elasticsearch/elasticsearch.yml
26
+ $: sudo cp config/logging.yml /etc/elasticsearch/
27
+ $: sudo cp config/elasticsearch.in.sh /etc/elasticsearch/
28
+ </pre>
29
+
30
+ Make changes to 'elasticsearch.yml' such that it points to the correct data, work, and log directories. Also, you'll want to change the number of 'recovery_after_nodes' and 'expected_nodes' in elasticsearch.yml to however many nodes (machines) you actually expect to have in your cluster. You'll probably also want to do a quick once-over of elasticsearch.in.sh and make sure the jvm settings, etc are sane for your particular setup. Finally, to startup do:
31
+
32
+ <pre class="brush: bash">
33
+ sudo -u elasticsearch /usr/local/share/elasticsearch/bin/elasticsearch -Des.config=/etc/elasticsearch/elasticsearch.yml
34
+ </pre>
35
+
36
+ You should now have a happily running (reasonably configured) elasticsearch data node.
37
+
38
+ <h2>Index Some Data</h2>
39
+
40
+ Prerequisites:
41
+
42
+ <ul>
43
+ <li>You have a working hadoop cluster</li>
44
+ <li>Elasticsearch data nodes are installed and running on all your machines and they have discovered each other. See the elasticsearch documentation for details on making that actually work.</li>
45
+ <li>You've installed the following rubygems: 'configliere' and 'json'</li>
46
+ </ul>
47
+
48
+ <h3>Get Data</h3>
49
+
50
+ As an example lets index this UFO sightings data set from Infochimps <a href="http://infochimps.com/datasets/d60000-documented-ufo-sightings-with-text-descriptions-and-metad">here</a>. (You should be familiar with this one by now...) It's mostly raw text and so it's a very reasonable thing to index. Once it's downloaded go ahead and throw it on the HDFS:
51
+ <pre class="brush: bash">
52
+ $: hadoop fs -mkdir /data/domestic/ufo
53
+ $: hadoop fs -put chimps_16154-2010-10-20_14-33-35/ufo_awesome.tsv /data/domestic/ufo/
54
+ </pre>
55
+
56
+ <h3>Index Data</h3>
57
+
58
+ This is the easy part:
59
+
60
+ <pre class="brush: bash">
61
+ $: bin/wonderdog --rm --field_names=sighted_at,reported_at,location,shape,duration,description --id_field=-1 --index_name=ufo_sightings --object_type=ufo_sighting --es_config=/etc/elasticsearch/elasticsearch.yml /data/domestic/aliens/ufo_awesome.tsv /tmp/elasticsearch/aliens/out
62
+ </pre>
63
+
64
+ Flags:
65
+
66
+ '--rm' - Remove output on the hdfs if it exists
67
+ '--field_names' - A comma separated list of the field names in the tsv, in order
68
+ '--id_field' - The field to use as the record id, -1 if the record has no inherent id
69
+ '--index_name' - The index name to bulk load into
70
+ '--object_type' - The type of objects we're indexing
71
+ '--es_config' - Points to the elasticsearch config*
72
+
73
+ *The elasticsearch config that the hadoop machines need must be on all the hadoop machines and have a 'hosts' entry listing the ips of all the elasticsearch data nodes (see wonderdog/config/elasticsearch-example.yml). This means we can run the hadoop job on a different cluster than the elasticsearch data nodes are running on.
74
+
75
+ The other two arguments are the input and output paths. The output path in this case only gets written to if one or more index requests fail. This way you can re-run the job on only those records that didn't make it the first time.
76
+
77
+ The indexing should go pretty quickly.
78
+ Next is to refresh the index so we can actually query our newly indexed data. There's a tool in wonderdog's bin directory for that:
79
+ <pre class="brush: bash">
80
+ $: bin/estool --host=`hostname -i` refresh_index
81
+ </pre>
82
+
83
+
84
+ <h3>Query Data</h3>
85
+
86
+ Once again, use estool
87
+ <pre class="brush: bash">
88
+ $: bin/estool --host=`hostname -i` --index_name=ufo_sightings --query_string="ufo" query
89
+ </pre>
90
+
91
+ Hurray.