wonderdog 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +49 -0
- data/.rspec +2 -0
- data/CHANGELOG.md +5 -0
- data/LICENSE.md +201 -0
- data/README.md +175 -0
- data/Rakefile +10 -0
- data/bin/estool +141 -0
- data/bin/estrus.rb +136 -0
- data/bin/wonderdog +93 -0
- data/config/elasticsearch-example.yml +227 -0
- data/config/elasticsearch.in.sh +52 -0
- data/config/logging.yml +43 -0
- data/config/more_settings.yml +60 -0
- data/config/run_elasticsearch-2.sh +42 -0
- data/config/ufo_config.json +12 -0
- data/lib/wonderdog.rb +14 -0
- data/lib/wonderdog/configuration.rb +25 -0
- data/lib/wonderdog/hadoop_invocation_override.rb +139 -0
- data/lib/wonderdog/index_and_mapping.rb +67 -0
- data/lib/wonderdog/timestamp.rb +43 -0
- data/lib/wonderdog/version.rb +3 -0
- data/notes/README-benchmarking.txt +272 -0
- data/notes/README-read_tuning.textile +74 -0
- data/notes/benchmarking-201011.numbers +0 -0
- data/notes/cluster_notes.md +17 -0
- data/notes/notes.txt +91 -0
- data/notes/pigstorefunc.pig +45 -0
- data/pom.xml +80 -0
- data/spec/spec_helper.rb +22 -0
- data/spec/support/driver_helper.rb +15 -0
- data/spec/support/integration_helper.rb +30 -0
- data/spec/wonderdog/hadoop_invocation_override_spec.rb +81 -0
- data/spec/wonderdog/index_and_type_spec.rb +73 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchInputFormat.java +268 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputCommitter.java +39 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputFormat.java +283 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchSplit.java +60 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingInputFormat.java +231 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingOutputCommitter.java +37 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingOutputFormat.java +88 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingRecordReader.java +176 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingRecordWriter.java +171 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingSplit.java +102 -0
- data/src/main/java/com/infochimps/elasticsearch/ElasticTest.java +108 -0
- data/src/main/java/com/infochimps/elasticsearch/hadoop/util/HadoopUtils.java +100 -0
- data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchIndex.java +216 -0
- data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchJsonIndex.java +235 -0
- data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchStorage.java +355 -0
- data/test/foo.json +3 -0
- data/test/foo.tsv +3 -0
- data/test/test_dump.pig +19 -0
- data/test/test_json_loader.pig +21 -0
- data/test/test_tsv_loader.pig +16 -0
- data/wonderdog.gemspec +32 -0
- metadata +130 -0
@@ -0,0 +1,43 @@
|
|
1
|
+
module Wukong
|
2
|
+
module Elasticsearch
|
3
|
+
|
4
|
+
# A class that makes Ruby's Time class serialize the way
|
5
|
+
# Elasticsearch expects.
|
6
|
+
#
|
7
|
+
# Elasticsearch's date parsing engine [expects to
|
8
|
+
# receive](http://www.elasticsearch.org/guide/reference/mapping/date-format.html)
|
9
|
+
# a date formatted according to the Java library
|
10
|
+
# [Joda's](http://joda-time.sourceforge.net/)
|
11
|
+
# [ISODateTimeFormat.dateOptionalTimeParser](http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser())
|
12
|
+
# class.
|
13
|
+
#
|
14
|
+
# This format looks like this: `2012-11-30T01:15:23`.
|
15
|
+
#
|
16
|
+
# @see http://www.elasticsearch.org/guide/reference/mapping/date-format.html The Elasticsearch guide's Date Format entry
|
17
|
+
# @see http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser() The Joda class's API documentation
|
18
|
+
class Timestamp < Time
|
19
|
+
|
20
|
+
# Parses the given `string` into a Timestamp instance.
|
21
|
+
#
|
22
|
+
# @param [String] string
|
23
|
+
# @return [Timestamp]
|
24
|
+
def self.receive string
|
25
|
+
return if string.nil? || string.empty?
|
26
|
+
begin
|
27
|
+
t = Time.parse(string)
|
28
|
+
rescue ArgumentError => e
|
29
|
+
return
|
30
|
+
end
|
31
|
+
new(t.year, t.month, t.day, t.hour, t.min, t.sec, t.utc_offset)
|
32
|
+
end
|
33
|
+
|
34
|
+
# Formats the Timestamp according to ISO 8601 rules.
|
35
|
+
#
|
36
|
+
# @param [Hash] options
|
37
|
+
# @return [String]
|
38
|
+
def to_wire(options={})
|
39
|
+
utc.iso8601
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
@@ -0,0 +1,272 @@
|
|
1
|
+
To do a full flush, do this:
|
2
|
+
|
3
|
+
curl -XPOST host:9200/_flush?full=true
|
4
|
+
|
5
|
+
(run it every 30 min during import)
|
6
|
+
|
7
|
+
|
8
|
+
1 c1.xl 4es/12sh 768m buffer 1400m heap
|
9
|
+
2,584,346,624 255,304 0h14m25 865 295 2917
|
10
|
+
|
11
|
+
1 m1.xl 4es/12sh 1500m buffer 3200m heap
|
12
|
+
79,364,096 464,701 0h01m02 62 7495 1250
|
13
|
+
210,305,024 1,250,000 0h02m39 159 7861 1291
|
14
|
+
429,467,863 2,521,538 0h03m28 208 12122 2016
|
15
|
+
|
16
|
+
1 m1.xl 4es/12sh 4hdp 1800m buffer 3200m heap 300000 tlog
|
17
|
+
429,467,863 2,521,538 0h03m11 191 13201 2195
|
18
|
+
|
19
|
+
1 m1.xl 4es/12sh 4hdp 1800m buffer 2400m heap 100000 tlog 1000 batch lzw compr ulimit-l-unlimited (and in all following)
|
20
|
+
0h03m47
|
21
|
+
|
22
|
+
1 m1.xl 4es/12sh 4hdp 1800m buffer 2400m heap 200000 tlog 1000 batch no compr
|
23
|
+
0h3m22
|
24
|
+
again on top of data already loaded
|
25
|
+
0h3m16
|
26
|
+
|
27
|
+
1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 50000 batch no compr
|
28
|
+
433,782,784 2,250,000 0h01m17 (froze up on mass assault once 50k batch was reached)
|
29
|
+
|
30
|
+
1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 5000 batch no compr
|
31
|
+
785,514,496 4,075,000 0h05m59 359 11350 2136 cpu 4x70%
|
32
|
+
1,207,500,800 6,270,000 0h08m26 506 12391 2330
|
33
|
+
|
34
|
+
1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 5000 batch no compr
|
35
|
+
163,512,320 845,000 0h01m49 109 7752 1464 cpu 4x75% ios 6k-8k x4 if 2800/440 ram 13257/15360MB
|
36
|
+
641,990,656 3,345,000 0h04m41 281 11903 2231
|
37
|
+
896,522,559 4,683,016 0h06m11 371 12622 2359
|
38
|
+
1,131,916,976 5,937,895 0h07m05 425 13971 2600
|
39
|
+
|
40
|
+
1 m1.xl 4es/12sh 16hdp 1800m buffer 2800m heap 200000 tlog 5000 batch no compr
|
41
|
+
74,383,360 385,000 0h01m50 110 3500 660
|
42
|
+
286,720,000 1,495,000 0h02m21 141 10602 1985
|
43
|
+
461,701,120 2,410,000 0h03m30 210 11476 2147
|
44
|
+
733,413,376 3,830,000 0h05m10 310 12354 2310
|
45
|
+
1,131,916,976 5,937,895 0h07m16 436 13619 2535
|
46
|
+
|
47
|
+
1 m1.xl 4es/12sh 64hdp 1800m buffer 2800m heap 200000 tlog 1000 batch no compr
|
48
|
+
156,958,720 813,056 0h01m35 95 8558 1613
|
49
|
+
305,135,616 1,586,176 0h02m25 145 10939 2055
|
50
|
+
446,300,160 2,323,456 0h03m10 190 12228 2293
|
51
|
+
690,028,544 3,594,240 0h04m40 280 12836 2406
|
52
|
+
927,807,418 4,850,093 0h06m10 370 13108 2448
|
53
|
+
1,131,916,976 5,937,895 0h06m55 415 14308 2663
|
54
|
+
|
55
|
+
1 m1.xl 4es/12sh 16hdp 1800m buffer 2800m heap 200000 tlog 1024 batch no compr
|
56
|
+
234,749,952 1,222,656 0h02m08 128 9552 1791
|
57
|
+
713,097,216 3,723,264 0h04m56 296 12578 2352
|
58
|
+
1,131,916,976 5,937,895 0h06m49 409 14518 2702
|
59
|
+
|
60
|
+
1 m1.xl 4es/12sh 20hdp 1800m buffer 2800m heap 200000 tlog 1024 batch no compr mergefac 40
|
61
|
+
190,971,904 994,304 0h01m55 115 8646 1621
|
62
|
+
326,107,136 1,699,840 0h02m52 172 9882 1851
|
63
|
+
707,152,365 3,709,734 0h04m51 291 12748 2373 672 files
|
64
|
+
again:
|
65
|
+
187,170,816 973,824 0h01m49 109 8934 1676
|
66
|
+
707,152,365 3,709,734 0h05m39 339 10943 2037 1440 files ; 18 *.tis typically 4.3M
|
67
|
+
again:
|
68
|
+
707,152,365 3,709,734 0h04m54 294 12618 2348 2052 files ; 28 *.tis typically 4.3M
|
69
|
+
|
70
|
+
1 m1.xl 4es/12sh 20hdp 1800m buffer 2800m heap 50_000 tlog 1024 batch no compr mergefac 20 (and in following)
|
71
|
+
349,372,416 1,821,696 0h02m42 162 11245 2106
|
72
|
+
707,152,365 3,709,734 0h04m43 283 13108 2440
|
73
|
+
|
74
|
+
1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 64m engine.ram_buffer_size -- 3s ping_interval -- oops 10s refresh
|
75
|
+
253,689,856 1,321,984 0h02m48 168 7868 1474
|
76
|
+
707,152,365 3,709,734 0h05m55 355 10449 1945
|
77
|
+
|
78
|
+
1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 256m engine.ram_buffer_size -- 3s ping_interval
|
79
|
+
707,152,365 3,709,734 0h04m31 271 13689 2548
|
80
|
+
|
81
|
+
1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
82
|
+
707,152,365 3,709,734 0h04m08 248 14958 2784
|
83
|
+
|
84
|
+
1 m1.xl 4es/4sh 20hdp 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 768m engine.ram_buffer_size -- 3s ping_interval
|
85
|
+
707,152,365 3,709,734 0h04m47 287 12925 2406
|
86
|
+
again
|
87
|
+
707,152,365 3,709,734 0h04m27 267 13894 2586
|
88
|
+
|
89
|
+
1 m1.xl 4es/4sh 20hdp 768m buffer 2800m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
90
|
+
707,152,365 3,709,734 0h04m14 254 14605 2718
|
91
|
+
|
92
|
+
1 c1.xl 4es/4sh 20hdp 768m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
93
|
+
707,152,365 3,709,734 0h02m55 175 21198 3946 ios 11282 ifstat 3696.26 695.26
|
94
|
+
|
95
|
+
1 c1.xl 4es/4sh 40hdp 768m buffer 1200m heap 200_000 tlog 4096 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
96
|
+
707,912,831 3,713,598 0h03m05 185 20073 3736
|
97
|
+
|
98
|
+
1 c1.xl 4es/4sh 40hdp 768m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
99
|
+
707,912,831 3,713,598 0h02m59 179 20746 3862
|
100
|
+
|
101
|
+
1 c1.xl 4es/4sh 20hdp 256m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
102
|
+
707,152,365 3,709,734 0h02m53 173 21443 3991
|
103
|
+
|
104
|
+
1 c1.xl 4es/4sh 20hdp 512m buffer 1200m heap 200_000 tlog 1024 batch no compr 768m engine.ram_buffer_size -- 3s ping_interval
|
105
|
+
707,152,365 3,709,734 0h03m00 180 20609 3836
|
106
|
+
|
107
|
+
|
108
|
+
8 c1.xl 32es/32sh 14hdp/56 512m buffer 1200m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval
|
109
|
+
1,115,291,648 5,814,272 0h01m44 104 6988 1309 8 55906 10472
|
110
|
+
2,779,840,512 14,540,800 0h06m34 394 4613 861 8 36905 6890
|
111
|
+
6,100,156,416 32,508,928 0h14m51 891 4560 835 8 36485 6685
|
112
|
+
(killed)
|
113
|
+
|
114
|
+
8 c1.xl 24es/24sh 14hdp/56 256m buffer 1200m heap 200_000 tlog 1024 batch no compr 384m engine.ram_buffer_size -- 3s ping_interval
|
115
|
+
980,221,952 5,107,662 0h01m28 88 7255 1359 8 58041 10877
|
116
|
+
1,815,609,344 9,483,259 0h01m59 119 9961 1862 8 79691 14899
|
117
|
+
4,451,270,656 23,694,336 0h04m06 246 12039 2208 8 96318 17670
|
118
|
+
6,713,269,627 35,778,171 0h06m00 360 12422 2276 8 99383 18210
|
119
|
+
|
120
|
+
8 c1.xl 24es/24sh 14hdp/140 512m buffer 1200m heap 200_000 tlog 1024 batch no compr 384m engine.ram_buffer_size -- 3s ping_interval
|
121
|
+
4,743,036,929 24,825,856 0h04m39 279 11122 2075 8 88981 16601
|
122
|
+
8,119,975,937 42,889,216 0h07m00 420 12764 2360 8 102117 18880
|
123
|
+
17,273,994,924 91,991,529 0h15m14 914 12580 2307 8 100647 18456
|
124
|
+
23,598,696,768 123,812,641 0h24m04 1444 10717 1994 8 85742 15959
|
125
|
+
|
126
|
+
|
127
|
+
8 m1.xl 32es/32sh 14hdp/53 1800m buffer 2800m heap 200_000 tlog 1024 batch no compr 512m engine.ram_buffer_size -- 3s ping_interval -- merge_factor30
|
128
|
+
306,296,262 1,608,526 0h01m18 78 2577 479 8 20622 3834
|
129
|
+
1,814,083,014 9,564,301 0h02m33 153 7813 1447 8 62511 11578
|
130
|
+
2,837,886,406 15,030,140 0h04m49 289 6500 1198 8 52007 9589
|
131
|
+
3,928,208,838 21,039,950 0h06m22 382 6884 1255 8 55078 10042
|
132
|
+
6,322,378,160 33,875,546 0h11m28 688 6154 1121 8 49237 8974
|
133
|
+
|
134
|
+
8 c1.xl 24es/24sh 14hdp/140 512m buffer 1200m heap 200_000 tlog 4096 batch no compr 256m engine.ram_buffer_size -- 3s ping_interval -- merge_factor 30
|
135
|
+
4,717,346,816 24,855,996 0h04m55 295 10532 1952 8 84257 15616
|
136
|
+
9,735,831,552 51,896,969 0h09m23 563 11522 2110 8 92179 16887
|
137
|
+
|
138
|
+
|
139
|
+
(200910)
|
140
|
+
2,746,875,904 10,555,392 0h02m50 170 7761 1972 8 62090 15779
|
141
|
+
43,201,339,007 166,049,864 0h35m06 2106 9855 2504 8 78846 20032
|
142
|
+
|
143
|
+
|
144
|
+
2009{10,11,12}
|
145
|
+
|
146
|
+
8 c1.xl 24es/24sh 14hdp/140 512m buffer 1200m heap 200_000 tlog 4096 batch no compr 256m engine.ram_buffer_size -- 3s ping_interval -- merge_factor 30
|
147
|
+
135,555,262,283 516,220,825 2h16m13 8173 7895 2024 8 63161 16197
|
148
|
+
|
149
|
+
|
150
|
+
slug=tweet-2009q3pre ; curl -XGET 'http://10.99.10.113:9200/_flush/' ; curl -XPUT "http://10.99.10.113:9200/$slug/" ; rake -f ~/ics/backend/wonderdog/java/Rakefile ; ~/ics/backend/wonderdog/java/bin/wonderdog --rm --index_name=$slug --bulk_size=4096 --object_type=tweet /tmp/tweet_by_month-tumbled/"tweet-200[678]" /tmp/es_bulkload_log/$slug
|
151
|
+
|
152
|
+
|
153
|
+
sudo kill `ps aux | egrep '^61021' | cut -c 10-15`
|
154
|
+
|
155
|
+
for node in '' 2 3 ; do echo $node ; sudo node=$node ES_MAX_MEM=1600m ~/ics/backend/wonderdog/config/run_elasticsearch-2.sh ; done
|
156
|
+
|
157
|
+
|
158
|
+
for node in '' 2 3 4 ; do echo $node ; sudo node=$node ES_MAX_MEM=1200m ~/ics/backend/wonderdog/config/run_elasticsearch-2.sh ; done
|
159
|
+
sudo kill `ps aux | egrep '^61021' | cut -c 10-15` ; sleep 10 ; sudo rm -rf /mnt*/elasticsearch/* ; ps auxf | egrep '^61021' ; zero_log /var/log/elasticsearch/hoolock.log
|
160
|
+
|
161
|
+
ec2-184-73-41-228.compute-1.amazonaws.com
|
162
|
+
|
163
|
+
Query for success:
|
164
|
+
curl -XGET 'http://10.195.10.207:9200/tweet/tweet/_search?q=text:mrflip' | ruby -rubygems -e 'require "json" ; puts JSON.pretty_generate(JSON.load($stdin))'
|
165
|
+
|
166
|
+
Detect settings:
|
167
|
+
grep ' with ' /var/log/elasticsearch/hoolock.log | egrep 'DEBUG|INFO' | cut -d\] -f2,3,5- | sort | cutc | uniq -c
|
168
|
+
|
169
|
+
Example index sizes:
|
170
|
+
ls -lRhart /mnt*/elasticsearch/data/hoolock/nodes/*/indices/tweet/0/*/*.{tis,fdt}
|
171
|
+
|
172
|
+
|
173
|
+
|
174
|
+
|
175
|
+
def dr(line) ; sbytes,srecs,time,mach,*_ = line.strip.split(/\s+/) ; bytes = sbytes.gsub(/\D/,"").to_i ; recs = srecs.gsub(/\D/,"").to_i ; mach=mach.to_i ; mach = 1 if mach == 0 ; s,m,h = [0,0,0,time.split(/\D/)].flatten.reverse.map(&:to_i) ; tm = (3600*h + 60*m + s) ; results = "%14s\t%12s\t%01dh%02dm%02d\t%7d\t%7d\t%7d\t%7d\t%7d\t%7d"%[sbytes, srecs, h,m,s, tm, recs/tm/mach, bytes/tm/1024/mach, mach, recs/tm, bytes/tm/1024, ] ; puts results ; results ; end
|
176
|
+
|
177
|
+
|
178
|
+
|
179
|
+
|
180
|
+
|
181
|
+
|
182
|
+
|
183
|
+
|
184
|
+
|
185
|
+
|
186
|
+
|
187
|
+
|
188
|
+
|
189
|
+
# . jack up batch size and see effect on rec/sec, find optimal
|
190
|
+
# . run multiple mappers with one data es_node with optimal batch size, refind if necessary
|
191
|
+
# . work data es_node heavily but dont drive it into the ground
|
192
|
+
# . tune lucene + jvm options for data es_node
|
193
|
+
|
194
|
+
14 files, 3 hadoop nodes w/ 3 tasktrackers each 27 min
|
195
|
+
14 files, 3 hadoop nodes w/ 5 tasktrackers each 22 min
|
196
|
+
|
197
|
+
12 files @ 500k lines -- 3M rec -- 3 hdp/2 tt -- 2 esnodes -- 17m
|
198
|
+
|
199
|
+
|
200
|
+
6 files @ 100k = 600k rec -- 3hdp/2tt -- 1 es machine/2 esnodes -- 3m30
|
201
|
+
6 files @ 100k = 600k rec -- 3hdp/2tt -- 1 es machine/4 esnodes -- 3m20
|
202
|
+
|
203
|
+
|
204
|
+
|
205
|
+
|
206
|
+
5 files, 3 nodes,
|
207
|
+
|
208
|
+
|
209
|
+
Did 2,400,000 recs 24 tasks 585,243,042 bytes -- 15:37 on 12 maps/3nodes
|
210
|
+
|
211
|
+
Did _optimize
|
212
|
+
real 18m29.548s user 0m0.000s sys 0m0.000s pct 0.00
|
213
|
+
|
214
|
+
|
215
|
+
java version "1.6.0_20"
|
216
|
+
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
|
217
|
+
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
|
218
|
+
|
219
|
+
|
220
|
+
===========================================================================
|
221
|
+
|
222
|
+
|
223
|
+
The refresh API allows to explicitly refresh an one or more index, making all
|
224
|
+
operations performed since the last refresh available for search. The (near)
|
225
|
+
real-time capabilities depends on the index engine used. For example, the robin
|
226
|
+
one requires refresh to be called, but by default a refresh is scheduled
|
227
|
+
periodically.
|
228
|
+
|
229
|
+
curl -XPOST 'http://localhost:9200/twitter/_refresh'
|
230
|
+
|
231
|
+
The refresh API can be applied to more than one index with a single call, or even on _all the indices.
|
232
|
+
|
233
|
+
|
234
|
+
|
235
|
+
runs:
|
236
|
+
- es_machine: m1.xlarge
|
237
|
+
es_nodes: 1
|
238
|
+
es_max_mem: 1500m
|
239
|
+
bulk_size: 5
|
240
|
+
maps: 1
|
241
|
+
records: 100000
|
242
|
+
shards: 12
|
243
|
+
replicas: 1
|
244
|
+
merge_factor: 100
|
245
|
+
thread_count: 32
|
246
|
+
lucene_buffer_size: 256mb
|
247
|
+
runtime: 108s
|
248
|
+
throughput: 1000 rec/sec
|
249
|
+
- es_machine: m1.xlarge
|
250
|
+
es_nodes: 1
|
251
|
+
bulk_size: 5
|
252
|
+
maps: 1
|
253
|
+
records: 100000
|
254
|
+
shards: 12
|
255
|
+
replicas: 1
|
256
|
+
merge_factor: 1000
|
257
|
+
thread_count: 32
|
258
|
+
lucene_buffer_size: 256mb
|
259
|
+
runtime: 77s
|
260
|
+
throughput: 1300 rec/sec
|
261
|
+
- es_machine: m1.xlarge
|
262
|
+
es_nodes: 1
|
263
|
+
bulk_size: 5
|
264
|
+
maps: 1
|
265
|
+
records: 100000
|
266
|
+
shards: 12
|
267
|
+
replicas: 1
|
268
|
+
merge_factor: 10000
|
269
|
+
thread_count: 32
|
270
|
+
lucene_buffer_size: 512mb
|
271
|
+
runtime: 180s
|
272
|
+
throughput: 555 rec/sec
|
@@ -0,0 +1,74 @@
|
|
1
|
+
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/8517d7ccdaa6a72b
|
2
|
+
|
3
|
+
We have 3 servers in each data center, with 28M docs consuming 170G
|
4
|
+
disk (soon to shrink with ES 0.14), handling about 6k req/min for
|
5
|
+
client queries and 195k document matches/minute for alerting purposes.
|
6
|
+
With our hardware, we're hardly taxing them and still averaging
|
7
|
+
30-35ms response times.
|
8
|
+
|
9
|
+
|
10
|
+
|
11
|
+
|
12
|
+
|
13
|
+
:index_buffer_size => "512m",
|
14
|
+
:heap_size => '11000',
|
15
|
+
:fd_ping_interval => '2s',
|
16
|
+
:fd_ping_timeout => '60s',
|
17
|
+
:fd_ping_retries => '6',
|
18
|
+
:seeds => '10.116.83.97:9300,10.196.190.111:9300,10.112.45.60:9300,10.118.254.64:9300',
|
19
|
+
:recovery_after_time => '10m',
|
20
|
+
:recovery_after_nodes => 4,
|
21
|
+
:expected_nodes => 4,
|
22
|
+
:refresh_interval => 900,
|
23
|
+
|
24
|
+
with 80 primary / 160 active shards in 5 indexes, each shard sized as approx:
|
25
|
+
|
26
|
+
14395 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q3pre/10/index
|
27
|
+
26615 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q4/0/index
|
28
|
+
9294 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201004/12/index
|
29
|
+
12204 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201005/12/index
|
30
|
+
|
31
|
+
after recovering cluster most nodes were at 7.5 - 9.6 GB
|
32
|
+
|
33
|
+
http true:
|
34
|
+
|
35
|
+
14409 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q3pre/11/index
|
36
|
+
26573 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q4/11/index
|
37
|
+
23885 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2010q1/4/index
|
38
|
+
9271 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201004/0/index
|
39
|
+
12218 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201005/4/index
|
40
|
+
|
41
|
+
13723 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201006/9/index
|
42
|
+
15578 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201007/6/index
|
43
|
+
1471 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201008/11/index
|
44
|
+
915 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201009/1/index
|
45
|
+
1908 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/13/index
|
46
|
+
2026 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201011/7/index
|
47
|
+
|
48
|
+
|
49
|
+
"tweet-201010" : "num_docs" : 40272985,
|
50
|
+
"tweet-201011" : "num_docs" : 39012255,
|
51
|
+
"tweet-2009q4" : "num_docs" : 577762139,
|
52
|
+
"tweet-201006" : "num_docs" : 288445236,
|
53
|
+
"tweet-201008" : "num_docs" : 30904989,
|
54
|
+
"tweet-201005" : "num_docs" : 242058418,
|
55
|
+
"tweet-201007" : "num_docs" : 311059766,
|
56
|
+
"tweet-2009q3pre" : "num_docs" : 359075858,
|
57
|
+
"tweet-201004" : "num_docs" : 190501768,
|
58
|
+
"tweet-201009" : "num_docs" : 19166922,
|
59
|
+
"tweet-2010q1" : "num_docs" : 368031331,
|
60
|
+
|
61
|
+
|
62
|
+
|
63
|
+
14409 tweet-2009q3pre
|
64
|
+
26590 tweet-2009q4
|
65
|
+
23923 tweet-2010q1
|
66
|
+
9278 tweet-201004
|
67
|
+
12216 tweet-201005
|
68
|
+
13735 tweet-201006
|
69
|
+
15580 tweet-201007
|
70
|
+
1472 tweet-201008
|
71
|
+
916 tweet-201009
|
72
|
+
1910 tweet-201010
|
73
|
+
2023 tweet-201011
|
74
|
+
|
Binary file
|
@@ -0,0 +1,17 @@
|
|
1
|
+
### How to choose shards, replicas and cluster size: Rules of Thumb.
|
2
|
+
|
3
|
+
sh = shards
|
4
|
+
rf = replication factor. replicas = 0 implies rf = 1, or 1 replica of each shard.
|
5
|
+
|
6
|
+
pm = running data_esnode processes per machine
|
7
|
+
N = number of machines
|
8
|
+
|
9
|
+
n_cores = number of cpu cores per machine
|
10
|
+
n_disks = number of disks per machine
|
11
|
+
|
12
|
+
* You must have at least as many data_esnodes as
|
13
|
+
Mandatory: (sh * rf) < (pm * N)
|
14
|
+
|
15
|
+
Shards: shard size < 10GB
|
16
|
+
|
17
|
+
More shards = more parallel writes
|
data/notes/notes.txt
ADDED
@@ -0,0 +1,91 @@
|
|
1
|
+
At Infochimps we recently indexed over 2.5 billion documents for a total of 4TB total indexed size. This would not have been possible without ElasticSearch and the Hadoop bulk loader we wrote, <a href="http://github.com/infochimps/wonderdog">wonderdog</a>. I'll go into the technical details in a later post but for now here's how you can get started with ElasticSearch and Hadoop:
|
2
|
+
|
3
|
+
<h2>Getting Started with ElasticSearch</h2>
|
4
|
+
|
5
|
+
The first thing is to actually install elasticsearch:
|
6
|
+
|
7
|
+
<pre class="brush: bash">
|
8
|
+
$: wget http://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.14.2.zip
|
9
|
+
$: sudo mv elasticsearch-0.14.2 /usr/local/share/
|
10
|
+
$: sudo ln -s /usr/local/share/elasticsearch-0.14.2 /usr/local/share/elasticsearch
|
11
|
+
</pre>
|
12
|
+
|
13
|
+
Next you'll want to make sure there is an 'elasticsearch' user and that there are suitable data, work, and log directories that 'elasticsearch' owns:
|
14
|
+
|
15
|
+
<pre class="brush: bash">
|
16
|
+
$: sudo useradd elasticsearch
|
17
|
+
$: sudo mkdir -p /var/log/elasticsearch /var/run/elasticsearch/{data,work}
|
18
|
+
$: sudo chown -R elasticsearch /var/{log,run}/elasticsearch
|
19
|
+
</pre>
|
20
|
+
|
21
|
+
Then get wonderdog (you'll have to git clone it for now) and go ahead and copy the example configuration in wonderdog/config:
|
22
|
+
|
23
|
+
<pre class="brush: bash">
|
24
|
+
$: sudo mkdir -p /etc/elasticsearch
|
25
|
+
$: sudo cp config/elasticsearch-example.yml /etc/elasticsearch/elasticsearch.yml
|
26
|
+
$: sudo cp config/logging.yml /etc/elasticsearch/
|
27
|
+
$: sudo cp config/elasticsearch.in.sh /etc/elasticsearch/
|
28
|
+
</pre>
|
29
|
+
|
30
|
+
Make changes to 'elasticsearch.yml' such that it points to the correct data, work, and log directories. Also, you'll want to change the number of 'recovery_after_nodes' and 'expected_nodes' in elasticsearch.yml to however many nodes (machines) you actually expect to have in your cluster. You'll probably also want to do a quick once-over of elasticsearch.in.sh and make sure the jvm settings, etc are sane for your particular setup. Finally, to startup do:
|
31
|
+
|
32
|
+
<pre class="brush: bash">
|
33
|
+
sudo -u elasticsearch /usr/local/share/elasticsearch/bin/elasticsearch -Des.config=/etc/elasticsearch/elasticsearch.yml
|
34
|
+
</pre>
|
35
|
+
|
36
|
+
You should now have a happily running (reasonably configured) elasticsearch data node.
|
37
|
+
|
38
|
+
<h2>Index Some Data</h2>
|
39
|
+
|
40
|
+
Prerequisites:
|
41
|
+
|
42
|
+
<ul>
|
43
|
+
<li>You have a working hadoop cluster</li>
|
44
|
+
<li>Elasticsearch data nodes are installed and running on all your machines and they have discovered each other. See the elasticsearch documentation for details on making that actually work.</li>
|
45
|
+
<li>You've installed the following rubygems: 'configliere' and 'json'</li>
|
46
|
+
</ul>
|
47
|
+
|
48
|
+
<h3>Get Data</h3>
|
49
|
+
|
50
|
+
As an example lets index this UFO sightings data set from Infochimps <a href="http://infochimps.com/datasets/d60000-documented-ufo-sightings-with-text-descriptions-and-metad">here</a>. (You should be familiar with this one by now...) It's mostly raw text and so it's a very reasonable thing to index. Once it's downloaded go ahead and throw it on the HDFS:
|
51
|
+
<pre class="brush: bash">
|
52
|
+
$: hadoop fs -mkdir /data/domestic/ufo
|
53
|
+
$: hadoop fs -put chimps_16154-2010-10-20_14-33-35/ufo_awesome.tsv /data/domestic/ufo/
|
54
|
+
</pre>
|
55
|
+
|
56
|
+
<h3>Index Data</h3>
|
57
|
+
|
58
|
+
This is the easy part:
|
59
|
+
|
60
|
+
<pre class="brush: bash">
|
61
|
+
$: bin/wonderdog --rm --field_names=sighted_at,reported_at,location,shape,duration,description --id_field=-1 --index_name=ufo_sightings --object_type=ufo_sighting --es_config=/etc/elasticsearch/elasticsearch.yml /data/domestic/aliens/ufo_awesome.tsv /tmp/elasticsearch/aliens/out
|
62
|
+
</pre>
|
63
|
+
|
64
|
+
Flags:
|
65
|
+
|
66
|
+
'--rm' - Remove output on the hdfs if it exists
|
67
|
+
'--field_names' - A comma separated list of the field names in the tsv, in order
|
68
|
+
'--id_field' - The field to use as the record id, -1 if the record has no inherent id
|
69
|
+
'--index_name' - The index name to bulk load into
|
70
|
+
'--object_type' - The type of objects we're indexing
|
71
|
+
'--es_config' - Points to the elasticsearch config*
|
72
|
+
|
73
|
+
*The elasticsearch config that the hadoop machines need must be on all the hadoop machines and have a 'hosts' entry listing the ips of all the elasticsearch data nodes (see wonderdog/config/elasticsearch-example.yml). This means we can run the hadoop job on a different cluster than the elasticsearch data nodes are running on.
|
74
|
+
|
75
|
+
The other two arguments are the input and output paths. The output path in this case only gets written to if one or more index requests fail. This way you can re-run the job on only those records that didn't make it the first time.
|
76
|
+
|
77
|
+
The indexing should go pretty quickly.
|
78
|
+
Next is to refresh the index so we can actually query our newly indexed data. There's a tool in wonderdog's bin directory for that:
|
79
|
+
<pre class="brush: bash">
|
80
|
+
$: bin/estool --host=`hostname -i` refresh_index
|
81
|
+
</pre>
|
82
|
+
|
83
|
+
|
84
|
+
<h3>Query Data</h3>
|
85
|
+
|
86
|
+
Once again, use estool
|
87
|
+
<pre class="brush: bash">
|
88
|
+
$: bin/estool --host=`hostname -i` --index_name=ufo_sightings --query_string="ufo" query
|
89
|
+
</pre>
|
90
|
+
|
91
|
+
Hurray.
|