RubyGems - wonderdog - Versions diffs - 0.1.0 → 0.1.1 - Mend

wonderdog 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

data/README.md +119 -5
data/examples/bulkload_wp_pageviews.pig +70 -0
data/lib/wonderdog/version.rb +1 -2
data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputFormat.java +27 -17
data/wonderdog.gemspec +1 -1
metadata +12 -5

data/README.md CHANGED

@@ -1,7 +1,6 @@
 # Wonderdog
-Wonderdog is a Hadoop interface to Elastic Search. While it is specifically intended for use with Apache Pig, it does include all the necessary Hadoop input and output formats for Elastic Search. That is, it's possible to skip Pig en
-tirely and write custom Hadoop jobs if you prefer.
+Wonderdog is a Hadoop interface to Elastic Search. While it is specifically intended for use with Apache Pig, it does include all the necessary Hadoop input and output formats for Elastic Search. That is, it's possible to skip Pig entirely and write custom Hadoop jobs if you prefer.
 ## Requirements
@@ -18,7 +17,7 @@ This allows you to store tabular data (eg. tsv, csv) into elasticsearch.
 ```pig
 %default ES_JAR_DIR '/usr/local/share/elasticsearch/lib'
 %default INDEX      'ufo_sightings'
-%default OBJ        'sighting'
+%default OBJ        'sighting'
 register target/wonderdog*.jar;
 register $ES_JAR_DIR/*.jar;
@@ -101,7 +100,7 @@ bin/estool refresh --index users
 You'll definitely want to do this after the bulk load finishes so you don't lose any data in case of cluster failure:
 ```
-bin/estool snapshot --index users
+bin/estool snapshot --index users
 ```
 * Bump the replicas for the index up to at least one.
@@ -165,7 +164,7 @@ bin/estool optimize -c <elasticsearch_host> --index <index_name>
 * Snapshot an index
 ```
-bin/estool snapshot -c <elasticsearch_host> --index <index_name>
+bin/estool snapshot -c <elasticsearch_host> --index <index_name>
 ```
 * Delete an index
@@ -173,3 +172,118 @@ bin/estool snapshot -c <elasticsearch_host> --index <index_name>
 ```
 bin/estool delete -c <elasticsearch_host> --index <index_name>
 ```
+## Bulk Loading Tips for the Risk-seeking Dangermouse
+The file examples/bulkload_pageviews.pig shows an example of bulk loading elasticsearch, including preparing the index.
+### Elasticsearch Setup
+Some tips for an industrial-strength cluster, assuming exclusive use of machines and no read load during the job:
+* use multiple machines with a fair bit of ram (7+GB). Heap doesn't help too much for loading though, so you don't have to go nuts: we do fine with amazon m1.large's.
+* Allocate a sizeable heap, setting min and max equal, and
+  - turn `bootstrap.mlockall` on, and run `ulimit -l unlimited`.
+  - For example, for a 3GB heap: `-Xmx3000m -Xms3000m -Delasticsearch.bootstrap.mlockall=true`
+  - Never use a heap above 12GB or so, it's dangerous (STW compaction timeouts).
+  - You've succeeded if the full heap size is resident on startup: that is, in htop both the VMEM and RSS are 3000 MB or so.
+* temporarily increase the `index_buffer_size`, to say 40%.
+### Further reading
+* [Elasticsearch JVM Settings, explained](http://jprante.github.com/2012/11/28/Elasticsearch-Java-Virtual-Machine-settings-explained.html)
+### Example of creating an index and mapping
+Index:
+    curl -XPUT ''http://localhost:9200/pageviews' -d '{"settings": {
+      "index": { "number_of_shards": 12, "store.compress": { "stored": true, "tv": true } } }}'
+    $ curl -XPUT 'http://localhost:9200/ufo_sightings/_settings?pretty=true'  -d '{"settings": {
+      "index": { "number_of_shards": 12, "store.compress": { "stored": true, "tv": true } } }}'
+Mapping (elasticsearch "type"):
+    # Wikipedia Pageviews
+    curl -XPUT ''http://localhost:9200/pageviews/pagehour/_mapping' -d '{
+      "pagehour": { "_source": { "enabled" : true }, "properties" : {
+        "page_id" :     { "type": "long",    "store": "yes" },
+        "namespace":    { "type": "integer", "store": "yes" },
+        "title":        { "type": "string",  "store": "yes" },
+        "num_visitors": { "type": "long",    "store": "yes" },
+        "date":         { "type": "integer", "store": "yes" },
+        "time":         { "type": "long",    "store": "yes" },
+        "ts":           { "type": "date",    "store": "yes" },
+        "day_of_week":  { "type": "integer", "store": "yes" } } }}'
+    $ curl -XPUT 'http://localhost:9200/ufo_sightings/sighting/_mapping' -d '{ "sighting": {
+        "_source": { "enabled" : true },
+        "properties" : {
+          "sighted_at": { "type": "date", "store": "yes" },
+		  "reported_at": { "type": "date", "store": "yes" },
+          "shape": { "type": "string", "store": "yes" },
+		  "duration": { "type": "string", "store": "yes" },
+          "description": { "type": "string", "store": "yes" },
+          "coordinates": { "type": "geo_point", "store": "yes" },
+		  "location_str": { "type": "string", "store": "no" },
+          "location": { "type": "object", "dynamic": false, "properties": {
+            "place_id": { "type": "string", "store": "yes" },
+			"place_type": { "type": "string", "store": "yes" },
+            "city": { "type": "string", "store": "yes" },
+			"county": { "type": "string", "store": "yes" },
+            "state": { "type": "string", "store": "yes" },
+			"country": { "type": "string", "store": "yes" } } }
+        } } }'
+### Temporary Bulk-load settings for an index
+To prepare a database for bulk loading, the following settings may help. They are
+*EXTREMELY* aggressive, and include knocking the replication factor back to 1 (zero replicas). One
+false step and you've destroyed Tokyo.
+Actually, you know what?  Never mind.  Don't apply these, they're too crazy.
+    curl -XPUT 'http://localhost:9200/pageviews/_settings?pretty=true'  -d '{"index": {
+      "number_of_replicas": 0, "refresh_interval": -1, "gateway.snapshot_interval": -1,
+      "translog":     { "flush_threshold_ops": 50000, "flush_threshold_size": "200mb", "flush_threshold_period": "300s" },
+      "merge.policy": { "max_merge_at_once": 30, "segments_per_tier": 30, "floor_segment": "10mb" },
+      "store.compress": { "stored": true, "tv": true } } }'
+To restore your settings, in case you didn't destroy Tokyo:
+    curl -XPUT 'http://localhost:9200/pageviews/_settings?pretty=true'  -d ' {"index": {
+      "number_of_replicas": 2, "refresh_interval": "60s", "gateway.snapshot_interval": "3600s",
+      "translog": { "flush_threshold_ops": 5000, "flush_threshold_size": "200mb", "flush_threshold_period": "300s" },
+      "merge.policy": { "max_merge_at_once": 10, "segments_per_tier": 10, "floor_segment": "10mb" },
+      "store.compress": { "stored": true, "tv": true } } }'
+If you did destroy your database, please send your resume to jobs@infochimps.com as you begin your
+job hunt. It's the reformed sinner that makes the best missionary.
+### Post-bulkrun maintenance
+    es_index=pageviews ; ( for foo in _flush _refresh   '_optimize?max_num_segments=6&refresh=true&flush=true&wait_for_merge=true' '_gateway/snapshot'  ; do    echo "======= $foo" ; time curl -XPOST "http://localhost:9200/$es_index/$foo"  ; done ) &
+### Full dump of cluster health
+    es_index=pageviews ; es_node="projectes-elasticsearch-4"
+    curl -XGET "http://localhost:9200/$es_index/_status?pretty=true"
+    curl -XGET "http://localhost:9200/_cluster/state?pretty=true"
+    curl -XGET  "http://localhost:9200/$es_index/_stats?pretty=true&merge=true&refresh=true&flush=true&warmer=true"
+    curl -XGET "http://localhost:9200/_cluster/nodes/$es_node/stats?pretty=true&all=true"
+    curl -XGET "http://localhost:9200/_cluster/nodes/$es_node?pretty=true&all=true"
+    curl -XGET "http://localhost:9200/_cluster/health?pretty=true"
+    curl -XGET "http://localhost:9200/$es_index/_search?pretty=true&limit=3"
+    curl -XGET "http://localhost:9200/$es_index/_segments?pretty=true" | head -n 200
+### Decommission nodes
+Run this, excluding the decommissionable nodes from the list:
+    curl -XPUT http://localhost:9200/pageviews/_settings -d '{
+	  "index.routing.allocation.include.ironfan_name" :
+	    "projectes-elasticsearch-0,projectes-elasticsearch-1,projectes-elasticsearch-2" }'

data/examples/bulkload_wp_pageviews.pig ADDED

@@ -0,0 +1,70 @@
+SET mapred.map.tasks.speculative.execution false;
+-- path to wikipedia pageviews data
+%default PAGEVIEWS  's3n://bigdata.chimpy.us/data/results/wikipedia/full/pageviews/2008/03'
+-- the target elasticsearch index and mapping ("type"). Will be created, though you
+-- should do it yourself first instead as shown below.
+%default INDEX      'pageviews'
+%default OBJ        'pagehour'
+-- path to elasticsearch jars
+%default ES_JAR_DIR '/usr/local/share/elasticsearch/lib'
+-- Batch size for loading
+%default BATCHSIZE  '10000'
+-- Example of bulk loading. This will easily load more than a billion documents
+-- into a large cluster. We recommend using Ironfan to set your junk up.
+--
+-- Preparation:
+--
+-- Create the index:
+--
+--    curl -XPUT 'http://projectes-elasticsearch-0.test.chimpy.us:9200/pageviews' -d '{"settings": { "index": {
+--      "number_of_shards": 12, "number_of_replicas": 0, "store.compress": { "stored": true, "tv": true } } }}'
+--
+-- Define the elasticsearch mapping (type):
+--
+--    curl -XPUT 'http://projectes-elasticsearch-0.test.chimpy.us:9200/pageviews/pagehour/_mapping' -d '{
+--      "pagehour": {
+--        "_source": { "enabled" : true },
+--        "properties" : {
+--          "page_id" :     { "type": "long",    "store": "yes" },
+--          "namespace":    { "type": "integer", "store": "yes" },
+--          "title":        { "type": "string",  "store": "yes" },
+--          "num_visitors": { "type": "long",    "store": "yes" },
+--          "date":         { "type": "integer", "store": "yes" },
+--          "time":         { "type": "long",    "store": "yes" },
+--          "ts":           { "type": "date",    "store": "yes" },
+--          "day_of_week":  { "type": "integer", "store": "yes" } } }}'
+--
+-- For best results, see the 'Tips for Bulk Loading' in the README.
+--
+-- Always disable speculative execution when loading into a database
+set mapred.map.tasks.speculative.execution false
+-- Don't re-use JVM: logging gets angry
+set mapred.job.reuse.jvm.num.tasks         1
+-- Use large file sizes; setup/teardown time for leaving the cluster is worse
+-- than non-local map tasks
+set mapred.min.split.size                  3000MB
+set pig.maxCombinedSplitSize               2000MB
+set pig.splitCombination                   true
+register ./target/wonderdog*.jar;
+register $ES_JAR_DIR/*.jar;
+pageviews = LOAD '$PAGEVIEWS' AS (
+        page_id:long, namespace:int, title:chararray,
+        num_visitors:long, date:int, time:long,
+        epoch_time:long, day_of_week:int);
+pageviews_fixed = FOREACH pageviews GENERATE
+        page_id, namespace, title,
+        num_visitors, date, time,
+        epoch_time * 1000L AS ts, day_of_week;
+STORE pageviews_fixed INTO 'es://$INDEX/$OBJ?json=false&size=$BATCHSIZE' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage();
+-- -- To instead dump the JSON data to disk (needs Pig 0.10+)
+-- set dfs.replication                        2
+-- %default OUTDUMP    '$PAGEVIEWS.json'
+-- rmf                         $OUTDUMP
+-- STORE pageviews_fixed INTO '$OUTDUMP' USING JsonStorage();

data/lib/wonderdog/version.rb CHANGED

@@ -1,4 +1,3 @@
 module Wonderdog
-  # The currently running Wonderdog version
-  VERSION = '0.1.0'
+  VERSION = '0.1.1'
 end

data/src/main/java/com/infochimps/elasticsearch/ElasticSearchOutputFormat.java CHANGED

@@ -40,13 +40,13 @@ import org.elasticsearch.ExceptionsHelper;
 import com.infochimps.elasticsearch.hadoop.util.HadoopUtils;
 /**
    Hadoop OutputFormat for writing arbitrary MapWritables (essentially HashMaps) into Elasticsearch. Records are batched up and sent
    in a one-hop manner to the elastic search data nodes that will index them.
  */
 public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWritable> implements Configurable {
     static Log LOG = LogFactory.getLog(ElasticSearchOutputFormat.class);
     private Configuration conf = null;
@@ -60,12 +60,13 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
         private String idFieldName;
         private String objType;
         private String[] fieldNames;
         // Used for bookkeeping purposes
         private AtomicLong totalBulkTime  = new AtomicLong();
         private AtomicLong totalBulkItems = new AtomicLong();
-        private Random     randgen        = new Random();
+        private Random     randgen        = new Random();
         private long       runStartTime   = System.currentTimeMillis();
+        private long       lastLogTime    = 0;
         // For hadoop configuration
         private static final String ES_CONFIG_NAME = "elasticsearch.yml";
@@ -82,7 +83,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
         private static final String COMMA = ",";
         private static final String SLASH = "/";
         private static final String NO_ID_FIELD = "-1";
         private volatile BulkRequestBuilder currentRequest;
         /**
@@ -104,7 +105,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
            <li><b>elasticsearch.id.field.name</b> - When <b>elasticsearch.is_json</b> is true, this is the name of a field in the json document that contains the document's id. If -1 is used then the document is assumed to have no id and one is assigned to it by elasticsearch.</li>
            <li><b>elasticsearch.field.names</b> - When <b>elasticsearch.is_json</b> is false, this is a comma separated list of field names.</li>
            <li><b>elasticsearch.id.field</b> - When <b>elasticsearch.is_json</b> is false, this is the numeric index of the field to use as the document id. If -1 is used the document is assumed to have no id and one is assigned to it by elasticsearch.</li>
-           </ul>
+           </ul>
          */
         public ElasticSearchRecordWriter(TaskAttemptContext context) {
             Configuration conf = context.getConfiguration();
@@ -118,7 +119,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
                 LOG.info("Using field:["+idFieldName+"] for document ids");
             }
             this.objType    = conf.get(ES_OBJECT_TYPE);
             //
             // Fetches elasticsearch.yml and the plugins directory from the distributed cache, or
             // from the local config.
@@ -134,7 +135,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
                 System.setProperty(ES_CONFIG,conf.get(ES_CONFIG));
                 System.setProperty(ES_PLUGINS,conf.get(ES_PLUGINS));
             }
             start_embedded_client();
             initialize_index(indexName);
             currentRequest = client.prepareBulk();
@@ -144,7 +145,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
            Closes the connection to elasticsearch. Any documents remaining in the bulkRequest object are indexed.
          */
         public void close(TaskAttemptContext context) throws IOException {
-            if (currentRequest.numberOfActions() > 0) {
+            if (currentRequest.numberOfActions() > 0) {
                 try {
                     BulkResponse response = currentRequest.execute().actionGet();
                 } catch (Exception e) {
@@ -175,7 +176,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
                 try {
                     Text mapKey = new Text(idFieldName);
                     String record_id = fields.get(mapKey).toString();
-                    currentRequest.add(Requests.indexRequest(indexName).id(record_id).type(objType).create(false).source(builder));
+                    currentRequest.add(Requests.indexRequest(indexName).id(record_id).type(objType).create(false).source(builder));
                 } catch (Exception e) {
                     LOG.warn("Encountered malformed record");
                 }
@@ -198,14 +199,14 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
             } else if (value instanceof FloatWritable) {
                 builder.value(((FloatWritable)value).get());
             } else if (value instanceof BooleanWritable) {
-                builder.value(((BooleanWritable)value).get());
+                builder.value(((BooleanWritable)value).get());
             } else if (value instanceof MapWritable) {
                 builder.startObject();
                 for (Map.Entry<Writable,Writable> entry : ((MapWritable)value).entrySet()) {
                     if (!(entry.getValue() instanceof NullWritable)) {
                         builder.field(entry.getKey().toString());
                         buildContent(builder, entry.getValue());
-                    }
+                    }
                 }
                 builder.endObject();
             } else if (value instanceof ArrayWritable) {
@@ -215,7 +216,7 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
                     buildContent(builder, arrayOfThings[i]);
                 }
                 builder.endArray();
-            }
+            }
         }
         /**
@@ -224,12 +225,21 @@ public class ElasticSearchOutputFormat extends OutputFormat<NullWritable, MapWri
         private void processBulkIfNeeded() {
             totalBulkItems.incrementAndGet();
             if (currentRequest.numberOfActions() >= bulkSize) {
-                try {
+              boolean loggable = (System.currentTimeMillis() - lastLogTime >= 10000);
+              try {
                     long startTime        = System.currentTimeMillis();
+                    if (loggable){ LOG.info("Sending [" + (currentRequest.numberOfActions()) + "]items"); }
                     BulkResponse response = currentRequest.execute().actionGet();
                     totalBulkTime.addAndGet(System.currentTimeMillis() - startTime);
-                    if (randgen.nextDouble() < 0.1) {
-                        LOG.info("Indexed [" + totalBulkItems.get() + "] in [" + (totalBulkTime.get()/1000) + "s] of indexing"+"[" + ((System.currentTimeMillis() - runStartTime)/1000) + "s] of wall clock"+" for ["+ (float)(1000.0*totalBulkItems.get())/(System.currentTimeMillis() - runStartTime) + "rec/s]");
+                    if (loggable) {
+                      LOG.info("Indexed ["    + (currentRequest.numberOfActions())                 + "]items " +
+                               "in ["         + ((System.currentTimeMillis() - startTime)/1000)    + "]s; "    +
+                               "avg ["        + (float)(1000.0*totalBulkItems.get())/(System.currentTimeMillis() - runStartTime) + "]rec/s" +
+                               "(total ["     + totalBulkItems.get()                               + "]items " +
+                               "indexed in [" + (totalBulkTime.get()/1000)                         + "]s, "    +
+                               "wall clock [" + ((System.currentTimeMillis() - runStartTime)/1000) + "]s)");
+                      lastLogTime = System.currentTimeMillis();
                     }
                 } catch (Exception e) {
                     LOG.warn("Bulk request failed: " + e.getMessage());

data/wonderdog.gemspec CHANGED

@@ -28,5 +28,5 @@ EOF
   gem.test_files    = gem.files.grep(/^spec/)
   gem.require_paths = ['lib']
-  gem.add_dependency('wukong-hadoop', '0.1.0')
+  gem.add_dependency('wukong-hadoop', '0.1.1')
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wonderdog
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
   prerelease:
 platform: ruby
 authors:
@@ -13,7 +13,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-02-20 00:00:00.000000000 Z
+date: 2013-03-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: wukong-hadoop
@@ -22,7 +22,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: 0.1.0
+        version: 0.1.1
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
@@ -30,7 +30,7 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: 0.1.0
+        version: 0.1.1
 description: ! "  Wonderdog provides code in both Ruby and Java to make Elasticsearch\n
   \ a more fully-fledged member of both the Hadoop and Wukong\n  ecosystems.\n\n  For
   the Java side, Wonderdog provides InputFormat and OutputFormat\n  classes for use
@@ -59,6 +59,7 @@ files:
 - config/more_settings.yml
 - config/run_elasticsearch-2.sh
 - config/ufo_config.json
+- examples/bulkload_wp_pageviews.pig
 - examples/no_wonderdog.rb
 - examples/wonderdog.rb
 - lib/wonderdog.rb
@@ -113,15 +114,21 @@ required_ruby_version: !ruby/object:Gem::Requirement
   - - ! '>='
     - !ruby/object:Gem::Version
       version: '0'
+      segments:
+      - 0
+      hash: -2901634710812664464
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
   - - ! '>='
     - !ruby/object:Gem::Version
       version: '0'
+      segments:
+      - 0
+      hash: -2901634710812664464
 requirements: []
 rubyforge_project:
-rubygems_version: 1.8.23
+rubygems_version: 1.8.24
 signing_key:
 specification_version: 3
 summary: Make Hadoop and ElasticSearch play together nicely.