RubyGems - wonderdog - Versions diffs - 0.0.1 - Mend

wonderdog 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

data/.gitignore +49 -0
data/.rspec +2 -0
data/CHANGELOG.md +5 -0
data/LICENSE.md +201 -0
data/README.md +175 -0
data/Rakefile +10 -0
data/bin/estool +141 -0
data/bin/estrus.rb +136 -0
data/bin/wonderdog +93 -0
data/config/elasticsearch-example.yml +227 -0
data/config/elasticsearch.in.sh +52 -0
data/config/logging.yml +43 -0
data/config/more_settings.yml +60 -0
data/config/run_elasticsearch-2.sh +42 -0
data/config/ufo_config.json +12 -0
data/lib/wonderdog.rb +14 -0
data/lib/wonderdog/configuration.rb +25 -0
data/lib/wonderdog/hadoop_invocation_override.rb +139 -0
data/lib/wonderdog/index_and_mapping.rb +67 -0
data/lib/wonderdog/timestamp.rb +43 -0
data/lib/wonderdog/version.rb +3 -0
data/notes/README-benchmarking.txt +272 -0
data/notes/README-read_tuning.textile +74 -0
data/notes/benchmarking-201011.numbers +0 -0
data/notes/cluster_notes.md +17 -0
data/notes/notes.txt +91 -0
data/notes/pigstorefunc.pig +45 -0
data/pom.xml +80 -0
data/spec/spec_helper.rb +22 -0
data/spec/support/driver_helper.rb +15 -0
data/spec/support/integration_helper.rb +30 -0
data/spec/wonderdog/hadoop_invocation_override_spec.rb +81 -0
data/spec/wonderdog/index_and_type_spec.rb +73 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchInputFormat.java +268 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputCommitter.java +39 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputFormat.java +283 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchSplit.java +60 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingInputFormat.java +231 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingOutputCommitter.java +37 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingOutputFormat.java +88 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingRecordReader.java +176 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingRecordWriter.java +171 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchStreamingSplit.java +102 -0
data/src/main/java/com/infochimps/elasticsearch/ElasticTest.java +108 -0
data/src/main/java/com/infochimps/elasticsearch/hadoop/util/HadoopUtils.java +100 -0
data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchIndex.java +216 -0
data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchJsonIndex.java +235 -0
data/src/main/java/com/infochimps/elasticsearch/pig/ElasticSearchStorage.java +355 -0
data/test/foo.json +3 -0
data/test/foo.tsv +3 -0
data/test/test_dump.pig +19 -0
data/test/test_json_loader.pig +21 -0
data/test/test_tsv_loader.pig +16 -0
data/wonderdog.gemspec +32 -0
metadata +130 -0

data/lib/wonderdog/timestamp.rb ADDED

@@ -0,0 +1,43 @@
+module Wukong
+  module Elasticsearch
+    # A class that makes Ruby's Time class serialize the way
+    # Elasticsearch expects.
+    #
+    # Elasticsearch's date parsing engine [expects to
+    # receive](http://www.elasticsearch.org/guide/reference/mapping/date-format.html)
+    # a date formatted according to the Java library
+    # [Joda's](http://joda-time.sourceforge.net/)
+    # [ISODateTimeFormat.dateOptionalTimeParser](http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser())
+    # class.
+    #
+    # This format looks like this: `2012-11-30T01:15:23`.
+    #
+    # @see http://www.elasticsearch.org/guide/reference/mapping/date-format.html The Elasticsearch guide's Date Format entry
+    # @see http://joda-time.sourceforge.net/api-release/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser() The Joda class's API documentation
+    class Timestamp < Time
+      # Parses the given `string` into a Timestamp instance.
+      #
+      # @param [String] string
+      # @return [Timestamp]
+      def self.receive string
+        return if string.nil? || string.empty?
+        begin
+          t = Time.parse(string)
+        rescue ArgumentError => e
+          return
+        end
+        new(t.year, t.month, t.day, t.hour, t.min, t.sec, t.utc_offset)
+      end
+      # Formats the Timestamp according to ISO 8601 rules.
+      #
+      # @param [Hash] options
+      # @return [String]
+      def to_wire(options={})
+        utc.iso8601
+      end
+    end
+  end
+end

data/lib/wonderdog/version.rb ADDED

@@ -0,0 +1,3 @@
+module Wonderdog
+  VERSION = '0.0.1'
+end

data/notes/README-benchmarking.txt ADDED

@@ -0,0 +1,272 @@
+To do a full flush, do this:
+  curl -XPOST host:9200/_flush?full=true
+(run it every 30 min during import)
+1 c1.xl 	4es/12sh     	768m buffer     1400m heap
+ 2,584,346,624	     255,304	0h14m25	    865	    295	   2917
+1 m1.xl 	4es/12sh       	1500m buffer    3200m heap
+    79,364,096	     464,701	0h01m02	     62	   7495	   1250
+   210,305,024	   1,250,000	0h02m39	    159	   7861	   1291
+   429,467,863	   2,521,538	0h03m28	    208	  12122	   2016
+1 m1.xl 	4es/12sh  	4hdp	1800m buffer    3200m heap      300000 tlog
+   429,467,863	   2,521,538	0h03m11	    191	  13201	   2195
+1 m1.xl		4es/12sh	4hdp	1800m buffer	2400m heap	100000 tlog	1000 batch	lzw compr	ulimit-l-unlimited (and in all following)
+                                0h03m47
+1 m1.xl		4es/12sh	4hdp	1800m buffer	2400m heap	200000 tlog	1000 batch	no compr
+                                0h3m22
+  again on top of data already loaded
+                                0h3m16
+1 m1.xl		4es/12sh	64hdp	1800m buffer	2800m heap	200000 tlog	50000 batch	no compr
+   433,782,784	   2,250,000	0h01m17  (froze up on mass assault once 50k batch was reached)
+1 m1.xl		4es/12sh	64hdp	1800m buffer	2800m heap	200000 tlog	5000 batch	no compr
+   785,514,496	   4,075,000	0h05m59	    359	  11350	   2136         cpu 4x70%
+ 1,207,500,800	   6,270,000	0h08m26	    506	  12391	   2330
+1 m1.xl		4es/12sh	64hdp	1800m buffer	2800m heap	200000 tlog	5000 batch	no compr
+   163,512,320	     845,000	0h01m49	    109	   7752	   1464         cpu 4x75% ios 6k-8k x4 if 2800/440 ram 13257/15360MB
+   641,990,656	   3,345,000	0h04m41	    281	  11903	   2231
+   896,522,559	   4,683,016	0h06m11	    371	  12622	   2359
+ 1,131,916,976	   5,937,895	0h07m05	    425	  13971	   2600
+1 m1.xl		4es/12sh	16hdp	1800m buffer	2800m heap	200000 tlog	5000 batch	no compr
+    74,383,360	     385,000	0h01m50	    110	   3500	    660
+   286,720,000	   1,495,000	0h02m21	    141	  10602	   1985
+   461,701,120	   2,410,000	0h03m30	    210	  11476	   2147
+   733,413,376	   3,830,000	0h05m10	    310	  12354	   2310
+ 1,131,916,976	   5,937,895	0h07m16	    436	  13619	   2535
+1 m1.xl		4es/12sh	64hdp	1800m buffer	2800m heap	200000 tlog	1000 batch	no compr
+   156,958,720	     813,056	0h01m35	     95	   8558	   1613
+   305,135,616	   1,586,176	0h02m25	    145	  10939	   2055
+   446,300,160	   2,323,456	0h03m10	    190	  12228	   2293
+   690,028,544	   3,594,240	0h04m40	    280	  12836	   2406
+   927,807,418	   4,850,093	0h06m10	    370	  13108	   2448
+ 1,131,916,976	   5,937,895	0h06m55	    415	  14308	   2663
+1 m1.xl		4es/12sh	16hdp	1800m buffer	2800m heap	200000 tlog	1024 batch	no compr
+   234,749,952	   1,222,656	0h02m08	    128	   9552	   1791
+   713,097,216	   3,723,264	0h04m56	    296	  12578	   2352
+ 1,131,916,976	   5,937,895	0h06m49	    409	  14518	   2702
+1 m1.xl		4es/12sh	20hdp	1800m buffer	2800m heap	200000 tlog	1024 batch	no compr	mergefac 40
+   190,971,904	     994,304	0h01m55	    115	   8646	   1621
+   326,107,136	   1,699,840	0h02m52	    172	   9882	   1851
+   707,152,365	   3,709,734	0h04m51	    291	  12748	   2373  672 files
+ again:
+   187,170,816	     973,824	0h01m49	    109	   8934	   1676
+   707,152,365	   3,709,734	0h05m39	    339	  10943	   2037   1440 files ; 18 *.tis typically 4.3M
+ again:
+   707,152,365	   3,709,734	0h04m54	    294	  12618	   2348   2052 files ; 28 *.tis typically 4.3M
+1 m1.xl		4es/12sh	20hdp	1800m buffer	2800m heap	50_000 tlog	1024 batch	no compr	mergefac 20 (and in following)
+   349,372,416	   1,821,696	0h02m42	    162	  11245	   2106
+   707,152,365	   3,709,734	0h04m43	    283	  13108	   2440
+1 m1.xl		4es/4sh 	20hdp	1800m buffer	2800m heap	200_000 tlog	1024 batch	no compr	64m engine.ram_buffer_size -- 3s ping_interval -- oops 10s refresh
+   253,689,856	   1,321,984	0h02m48	    168	   7868	   1474
+   707,152,365	   3,709,734	0h05m55	    355	  10449	   1945
+1 m1.xl		4es/4sh 	20hdp	1800m buffer	2800m heap	200_000 tlog	1024 batch	no compr	256m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h04m31	    271	  13689	   2548
+1 m1.xl		4es/4sh 	20hdp	1800m buffer	2800m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h04m08	    248	  14958	   2784
+1 m1.xl		4es/4sh 	20hdp	1800m buffer	2800m heap	200_000 tlog	1024 batch	no compr	768m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h04m47	    287	  12925	   2406
+  again
+   707,152,365	   3,709,734	0h04m27	    267	  13894	   2586
+1 m1.xl		4es/4sh 	20hdp	 768m buffer	2800m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h04m14	    254	  14605	   2718
+1 c1.xl		4es/4sh 	20hdp	 768m buffer	1200m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h02m55	    175	  21198	   3946  ios 11282 ifstat 3696.26    695.26
+1 c1.xl		4es/4sh 	40hdp	 768m buffer	1200m heap	200_000 tlog	4096 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+   707,912,831	   3,713,598	0h03m05	    185	  20073	   3736
+1 c1.xl		4es/4sh 	40hdp	 768m buffer	1200m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+   707,912,831	   3,713,598	0h02m59	    179	  20746	   3862
+1 c1.xl		4es/4sh 	20hdp	 256m buffer	1200m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h02m53	    173	  21443	   3991
+1 c1.xl		4es/4sh 	20hdp	 512m buffer	1200m heap	200_000 tlog	1024 batch	no compr	768m engine.ram_buffer_size -- 3s ping_interval
+   707,152,365	   3,709,734	0h03m00	    180	  20609	   3836
+8 c1.xl		32es/32sh 	14hdp/56  512m buffer	1200m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval
+ 1,115,291,648	   5,814,272	0h01m44	    104	   6988	   1309	      8	  55906	  10472
+ 2,779,840,512	  14,540,800	0h06m34	    394	   4613	    861	      8	  36905	   6890
+ 6,100,156,416	  32,508,928	0h14m51	    891	   4560	    835	      8	  36485	   6685
+ (killed)
+8 c1.xl		24es/24sh 	14hdp/56  256m buffer	1200m heap	200_000 tlog	1024 batch	no compr	384m engine.ram_buffer_size -- 3s ping_interval
+   980,221,952	   5,107,662	0h01m28	     88	   7255	   1359	      8	  58041	  10877
+ 1,815,609,344	   9,483,259	0h01m59	    119	   9961	   1862	      8	  79691	  14899
+ 4,451,270,656	  23,694,336	0h04m06	    246	  12039	   2208	      8	  96318	  17670
+ 6,713,269,627	  35,778,171	0h06m00	    360	  12422	   2276	      8	  99383	  18210
+8 c1.xl		24es/24sh 	14hdp/140  512m buffer	1200m heap	200_000 tlog	1024 batch	no compr	384m engine.ram_buffer_size -- 3s ping_interval
+ 4,743,036,929	  24,825,856	0h04m39	    279	  11122	   2075	      8	  88981	  16601
+ 8,119,975,937	  42,889,216	0h07m00	    420	  12764	   2360	      8	 102117	  18880
+17,273,994,924	  91,991,529	0h15m14	    914	  12580	   2307	      8	 100647	  18456
+23,598,696,768	 123,812,641	0h24m04	   1444	  10717	   1994	      8	  85742	  15959
+8 m1.xl		32es/32sh 	14hdp/53  1800m buffer	2800m heap	200_000 tlog	1024 batch	no compr	512m engine.ram_buffer_size -- 3s ping_interval -- merge_factor30
+   306,296,262	   1,608,526	0h01m18	     78	   2577	    479	      8	  20622	   3834
+ 1,814,083,014	   9,564,301	0h02m33	    153	   7813	   1447	      8	  62511	  11578
+ 2,837,886,406	  15,030,140	0h04m49	    289	   6500	   1198	      8	  52007	   9589
+ 3,928,208,838	  21,039,950	0h06m22	    382	   6884	   1255	      8	  55078	  10042
+ 6,322,378,160	  33,875,546	0h11m28	    688	   6154	   1121	      8	  49237	   8974
+8 c1.xl		24es/24sh 	14hdp/140  512m buffer	1200m heap	200_000 tlog	4096 batch	no compr	256m engine.ram_buffer_size -- 3s ping_interval -- merge_factor 30
+ 4,717,346,816	  24,855,996	0h04m55	    295	  10532	   1952	      8	  84257	  15616
+ 9,735,831,552	  51,896,969	0h09m23	    563	  11522	   2110	      8	  92179	  16887
+(200910)
+ 2,746,875,904	  10,555,392	0h02m50	    170	   7761	   1972	      8	  62090	  15779
+43,201,339,007	 166,049,864	0h35m06	   2106	   9855	   2504	      8	  78846	  20032
+2009{10,11,12}
+8 c1.xl		24es/24sh 	14hdp/140  512m buffer	1200m heap	200_000 tlog	4096 batch	no compr	256m engine.ram_buffer_size -- 3s ping_interval -- merge_factor 30
+135,555,262,283	 516,220,825	2h16m13	   8173	   7895	   2024	      8	  63161	  16197
+slug=tweet-2009q3pre ; curl -XGET 'http://10.99.10.113:9200/_flush/' ; curl -XPUT "http://10.99.10.113:9200/$slug/" ; rake -f ~/ics/backend/wonderdog/java/Rakefile ; ~/ics/backend/wonderdog/java/bin/wonderdog --rm --index_name=$slug --bulk_size=4096 --object_type=tweet /tmp/tweet_by_month-tumbled/"tweet-200[678]" /tmp/es_bulkload_log/$slug
+sudo kill `ps aux | egrep '^61021' | cut -c 10-15`
+for node in '' 2 3 ; do echo $node ; sudo node=$node ES_MAX_MEM=1600m ~/ics/backend/wonderdog/config/run_elasticsearch-2.sh  ; done
+for node in '' 2 3 4 ; do echo $node ; sudo node=$node ES_MAX_MEM=1200m  ~/ics/backend/wonderdog/config/run_elasticsearch-2.sh  ; done
+sudo kill `ps aux | egrep '^61021' | cut -c 10-15` ; sleep 10 ; sudo rm -rf /mnt*/elasticsearch/* ; ps auxf | egrep '^61021' ; zero_log /var/log/elasticsearch/hoolock.log
+ec2-184-73-41-228.compute-1.amazonaws.com
+Query for success:
+  curl -XGET 'http://10.195.10.207:9200/tweet/tweet/_search?q=text:mrflip' | ruby -rubygems -e 'require "json" ; puts JSON.pretty_generate(JSON.load($stdin))'
+Detect settings:
+  grep ' with ' /var/log/elasticsearch/hoolock.log  | egrep 'DEBUG|INFO' | cut -d\] -f2,3,5- | sort | cutc | uniq -c
+Example index sizes:
+  ls -lRhart /mnt*/elasticsearch/data/hoolock/nodes/*/indices/tweet/0/*/*.{tis,fdt}
+def dr(line) ; sbytes,srecs,time,mach,*_ = line.strip.split(/\s+/) ; bytes = sbytes.gsub(/\D/,"").to_i ; recs = srecs.gsub(/\D/,"").to_i ; mach=mach.to_i ; mach = 1 if mach == 0 ; s,m,h = [0,0,0,time.split(/\D/)].flatten.reverse.map(&:to_i) ; tm = (3600*h + 60*m + s) ; results = "%14s\t%12s\t%01dh%02dm%02d\t%7d\t%7d\t%7d\t%7d\t%7d\t%7d"%[sbytes, srecs, h,m,s, tm, recs/tm/mach, bytes/tm/1024/mach, mach, recs/tm, bytes/tm/1024,  ] ; puts results ; results ; end
+# . jack up batch size and see effect on rec/sec, find optimal
+# . run multiple mappers with one data es_node with optimal batch size, refind if necessary
+# . work data es_node heavily but dont drive it into the ground
+# . tune lucene + jvm options for data es_node
+14 files, 3 hadoop nodes w/ 3 tasktrackers each  27 min
+14 files, 3 hadoop nodes w/ 5 tasktrackers each 22 min
+12 files @ 500k lines -- 3M rec --  3 hdp/2 tt -- 2 esnodes -- 17m
+6 files  @ 100k = 600k rec -- 3hdp/2tt -- 1 es machine/2 esnodes -- 3m30
+6 files  @ 100k = 600k rec -- 3hdp/2tt -- 1 es machine/4 esnodes -- 3m20
+5 files, 3 nodes,
+Did 2,400,000 recs 24 tasks 585,243,042 bytes -- 15:37 on 12 maps/3nodes
+Did _optimize
+real	18m29.548s	user	0m0.000s	sys	0m0.000s	pct	0.00
+java version "1.6.0_20"
+Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
+Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
+===========================================================================
+The refresh API allows to explicitly refresh an one or more index, making all
+operations performed since the last refresh available for search. The (near)
+real-time capabilities depends on the index engine used. For example, the robin
+one requires refresh to be called, but by default a refresh is scheduled
+periodically.
+curl -XPOST 'http://localhost:9200/twitter/_refresh'
+The refresh API can be applied to more than one index with a single call, or even on _all the indices.
+runs:
+  - es_machine: m1.xlarge
+    es_nodes: 1
+    es_max_mem: 1500m
+    bulk_size: 5
+    maps: 1
+    records: 100000
+    shards: 12
+    replicas: 1
+    merge_factor: 100
+    thread_count: 32
+    lucene_buffer_size: 256mb
+    runtime: 108s
+    throughput: 1000 rec/sec
+  - es_machine: m1.xlarge
+    es_nodes: 1
+    bulk_size: 5
+    maps: 1
+    records: 100000
+    shards: 12
+    replicas: 1
+    merge_factor: 1000
+    thread_count: 32
+    lucene_buffer_size: 256mb
+    runtime: 77s
+    throughput: 1300 rec/sec
+  - es_machine: m1.xlarge
+    es_nodes: 1
+    bulk_size: 5
+    maps: 1
+    records: 100000
+    shards: 12
+    replicas: 1
+    merge_factor: 10000
+    thread_count: 32
+    lucene_buffer_size: 512mb
+    runtime: 180s
+    throughput: 555 rec/sec

data/notes/README-read_tuning.textile ADDED

@@ -0,0 +1,74 @@
+http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/8517d7ccdaa6a72b
+We have 3 servers in each data center, with 28M docs consuming 170G
+disk (soon to shrink with ES 0.14), handling about 6k req/min for
+client queries and 195k document matches/minute for alerting purposes.
+With our hardware, we're hardly taxing them and still averaging
+30-35ms response times.
+      :index_buffer_size   => "512m",
+      :heap_size           => '11000',
+      :fd_ping_interval    => '2s',
+      :fd_ping_timeout     => '60s',
+      :fd_ping_retries     => '6',
+      :seeds               => '10.116.83.97:9300,10.196.190.111:9300,10.112.45.60:9300,10.118.254.64:9300',
+      :recovery_after_time  => '10m',
+      :recovery_after_nodes => 4,
+      :expected_nodes       => 4,
+      :refresh_interval     => 900,
+with 80 primary / 160 active shards in 5 indexes, each shard sized as approx:
+14395	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q3pre/10/index
+26615	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q4/0/index
+9294	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201004/12/index
+12204	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201005/12/index
+after recovering cluster most nodes were at 7.5 - 9.6 GB
+http true:
+14409	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q3pre/11/index
+26573	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2009q4/11/index
+23885	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-2010q1/4/index
+9271	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201004/0/index
+12218	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201005/4/index
+13723	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201006/9/index
+15578	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201007/6/index
+1471	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201008/11/index
+915	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201009/1/index
+1908	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/13/index
+2026	/mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201011/7/index
+    "tweet-201010" : 		"num_docs" : 40272985,
+    "tweet-201011" : 		"num_docs" : 39012255,
+    "tweet-2009q4" : 		"num_docs" : 577762139,
+    "tweet-201006" : 		"num_docs" : 288445236,
+    "tweet-201008" : 		"num_docs" : 30904989,
+    "tweet-201005" : 		"num_docs" : 242058418,
+    "tweet-201007" : 		"num_docs" : 311059766,
+    "tweet-2009q3pre" :		"num_docs" : 359075858,
+    "tweet-201004" : 		"num_docs" : 190501768,
+    "tweet-201009" : 		"num_docs" : 19166922,
+    "tweet-2010q1" : 		"num_docs" : 368031331,
+14409	tweet-2009q3pre
+26590	tweet-2009q4
+23923	tweet-2010q1
+ 9278	tweet-201004
+12216	tweet-201005
+13735	tweet-201006
+15580	tweet-201007
+ 1472	tweet-201008
+  916	tweet-201009
+ 1910	tweet-201010
+ 2023	tweet-201011

data/notes/benchmarking-201011.numbers ADDED

Binary file

data/notes/cluster_notes.md ADDED

@@ -0,0 +1,17 @@
+### How to choose shards, replicas and cluster size: Rules of Thumb.
+sh      = shards
+rf      = replication factor. replicas = 0 implies rf = 1, or 1 replica of each shard.
+pm      = running data_esnode processes per machine
+N       = number of machines
+n_cores = number of cpu cores per machine
+n_disks = number of disks per machine
+* You must have at least as many data_esnodes as
+  Mandatory:  (sh * rf) < (pm * N)
+  Shards:     shard size < 10GB
+More shards = more parallel writes

data/notes/notes.txt ADDED

@@ -0,0 +1,91 @@
+At Infochimps we recently indexed over 2.5 billion documents for a total of 4TB total indexed size. This would not have been possible without ElasticSearch and the Hadoop bulk loader we wrote, <a href="http://github.com/infochimps/wonderdog">wonderdog</a>. I'll go into the technical details in a later post but for now here's how you can get started with ElasticSearch and Hadoop:
+<h2>Getting Started with ElasticSearch</h2>
+The first thing is to actually install elasticsearch:
+<pre class="brush: bash">
+$: wget http://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.14.2.zip
+$: sudo mv elasticsearch-0.14.2 /usr/local/share/
+$: sudo ln -s /usr/local/share/elasticsearch-0.14.2 /usr/local/share/elasticsearch
+</pre>
+Next you'll want to make sure there is an 'elasticsearch' user and that there are suitable data, work, and log directories that 'elasticsearch' owns:
+<pre class="brush: bash">
+$: sudo useradd elasticsearch
+$: sudo mkdir -p /var/log/elasticsearch /var/run/elasticsearch/{data,work}
+$: sudo chown -R elasticsearch /var/{log,run}/elasticsearch
+</pre>
+Then get wonderdog (you'll have to git clone it for now) and go ahead and copy the example configuration in wonderdog/config:
+<pre class="brush: bash">
+$: sudo mkdir -p /etc/elasticsearch
+$: sudo cp config/elasticsearch-example.yml /etc/elasticsearch/elasticsearch.yml
+$: sudo cp config/logging.yml /etc/elasticsearch/
+$: sudo cp config/elasticsearch.in.sh /etc/elasticsearch/
+</pre>
+Make changes to 'elasticsearch.yml' such that it points to the correct data, work, and log directories. Also, you'll want to change the number of 'recovery_after_nodes' and 'expected_nodes' in elasticsearch.yml to however many nodes (machines) you actually expect to have in your cluster. You'll probably also want to do a quick once-over of elasticsearch.in.sh and make sure the jvm settings, etc are sane for your particular setup. Finally, to startup do:
+<pre class="brush: bash">
+sudo -u elasticsearch /usr/local/share/elasticsearch/bin/elasticsearch -Des.config=/etc/elasticsearch/elasticsearch.yml
+</pre>
+You should now have a happily running (reasonably configured) elasticsearch data node.
+<h2>Index Some Data</h2>
+Prerequisites:
+<ul>
+<li>You have a working hadoop cluster</li>
+<li>Elasticsearch data nodes are installed and running on all your machines and they have discovered each other. See the elasticsearch documentation for details on making that actually work.</li>
+<li>You've installed the following rubygems: 'configliere' and 'json'</li>
+</ul>
+<h3>Get Data</h3>
+As an example lets index this UFO sightings data set from Infochimps <a href="http://infochimps.com/datasets/d60000-documented-ufo-sightings-with-text-descriptions-and-metad">here</a>. (You should be familiar with this one by now...) It's mostly raw text and so it's a very reasonable thing to index. Once it's downloaded go ahead and throw it on the HDFS:
+<pre class="brush: bash">
+$: hadoop fs -mkdir /data/domestic/ufo
+$: hadoop fs -put chimps_16154-2010-10-20_14-33-35/ufo_awesome.tsv /data/domestic/ufo/
+</pre>
+<h3>Index Data</h3>
+This is the easy part:
+<pre class="brush: bash">
+$: bin/wonderdog --rm --field_names=sighted_at,reported_at,location,shape,duration,description --id_field=-1 --index_name=ufo_sightings --object_type=ufo_sighting --es_config=/etc/elasticsearch/elasticsearch.yml /data/domestic/aliens/ufo_awesome.tsv /tmp/elasticsearch/aliens/out
+</pre>
+Flags:
+'--rm' - Remove output on the hdfs if it exists
+'--field_names' - A comma separated list of the field names in the tsv, in order
+'--id_field' - The field to use as the record id, -1 if the record has no inherent id
+'--index_name' - The index name to bulk load into
+'--object_type' - The type of objects we're indexing
+'--es_config' - Points to the elasticsearch config*
+*The elasticsearch config that the hadoop machines need must be on all the hadoop machines and have a 'hosts' entry listing the ips of all the elasticsearch data nodes (see wonderdog/config/elasticsearch-example.yml). This means we can run the hadoop job on a different cluster than the elasticsearch data nodes are running on.
+The other two arguments are the input and output paths. The output path in this case only gets written to if one or more index requests fail. This way you can re-run the job on only those records that didn't make it the first time.
+The indexing should go pretty quickly.
+Next is to refresh the index so we can actually query our newly indexed data. There's a tool in wonderdog's bin directory for that:
+<pre class="brush: bash">
+$: bin/estool --host=`hostname -i` refresh_index
+</pre>
+<h3>Query Data</h3>
+Once again, use estool
+<pre class="brush: bash">
+$: bin/estool --host=`hostname -i` --index_name=ufo_sightings --query_string="ufo" query
+</pre>
+Hurray.