wukong 1.5.4 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.textile +32 -0
- data/README.textile +58 -12
- data/TODO.textile +0 -8
- data/bin/hdp-bzip +12 -17
- data/bin/hdp-kill-task +1 -1
- data/bin/hdp-sort +7 -7
- data/bin/hdp-stream +7 -7
- data/bin/hdp-stream-flat +2 -3
- data/bin/setcat +11 -0
- data/bin/uniq-ord +59 -0
- data/examples/corpus/bucket_counter.rb +47 -0
- data/examples/corpus/dbpedia_abstract_to_sentences.rb +85 -0
- data/examples/corpus/sentence_coocurrence.rb +70 -0
- data/examples/emr/README.textile +110 -0
- data/examples/emr/dot_wukong_dir/emr_bootstrap.sh +1 -0
- data/examples/emr/elastic_mapreduce_example.rb +2 -2
- data/examples/ignore_me/counting.rb +56 -0
- data/examples/ignore_me/grouper.rb +71 -0
- data/examples/network_graph/adjacency_list.rb +2 -2
- data/examples/network_graph/breadth_first_search.rb +14 -21
- data/examples/network_graph/gen_multi_edge.rb +22 -13
- data/examples/pagerank/pagerank.rb +1 -1
- data/examples/pagerank/pagerank_initialize.rb +6 -10
- data/examples/sample_records.rb +6 -16
- data/examples/server_logs/apache_log_parser.rb +7 -22
- data/examples/server_logs/breadcrumbs.rb +39 -0
- data/examples/server_logs/logline.rb +27 -0
- data/examples/size.rb +3 -2
- data/examples/{binning_percentile_estimator.rb → stats/binning_percentile_estimator.rb} +9 -11
- data/examples/{rank_and_bin.rb → stats/rank_and_bin.rb} +2 -2
- data/examples/stupidly_simple_filter.rb +11 -14
- data/examples/word_count.rb +16 -36
- data/lib/wukong/and_pig.rb +2 -15
- data/lib/wukong/logger.rb +7 -28
- data/lib/wukong/periodic_monitor.rb +24 -9
- data/lib/wukong/script/emr_command.rb +1 -0
- data/lib/wukong/script/hadoop_command.rb +31 -29
- data/lib/wukong/script.rb +19 -14
- data/lib/wukong/store/cassandra_model.rb +2 -1
- data/lib/wukong/streamer/accumulating_reducer.rb +5 -9
- data/lib/wukong/streamer/base.rb +44 -3
- data/lib/wukong/streamer/counting_reducer.rb +12 -12
- data/lib/wukong/streamer/filter.rb +2 -2
- data/lib/wukong/streamer/list_reducer.rb +3 -3
- data/lib/wukong/streamer/reducer.rb +11 -0
- data/lib/wukong/streamer.rb +7 -3
- data/lib/wukong.rb +7 -3
- data/{examples → old}/cassandra_streaming/berlitz_for_cassandra.textile +0 -0
- data/{examples → old}/cassandra_streaming/client_interface_notes.textile +0 -0
- data/{examples → old}/cassandra_streaming/client_schema.textile +0 -0
- data/{examples → old}/cassandra_streaming/tuning.textile +0 -0
- data/wukong.gemspec +257 -285
- metadata +45 -62
- data/examples/cassandra_streaming/avromapper.rb +0 -85
- data/examples/cassandra_streaming/cassandra.avpr +0 -468
- data/examples/cassandra_streaming/cassandra_random_partitioner.rb +0 -62
- data/examples/cassandra_streaming/catter.sh +0 -45
- data/examples/cassandra_streaming/client_schema.avpr +0 -211
- data/examples/cassandra_streaming/foofile.avr +0 -0
- data/examples/cassandra_streaming/pymap.sh +0 -1
- data/examples/cassandra_streaming/pyreduce.sh +0 -1
- data/examples/cassandra_streaming/smutation.avpr +0 -188
- data/examples/cassandra_streaming/streamer.sh +0 -51
- data/examples/cassandra_streaming/struct_loader.rb +0 -24
- data/examples/count_keys.rb +0 -56
- data/examples/count_keys_at_mapper.rb +0 -57
- data/examples/emr/README-elastic_map_reduce.textile +0 -26
- data/examples/keystore/cassandra_batch_test.rb +0 -41
- data/examples/keystore/conditional_outputter_example.rb +0 -70
- data/examples/store/chunked_store_example.rb +0 -18
- data/lib/wukong/dfs.rb +0 -81
- data/lib/wukong/keystore/cassandra_conditional_outputter.rb +0 -122
- data/lib/wukong/keystore/redis_db.rb +0 -24
- data/lib/wukong/keystore/tyrant_db.rb +0 -137
- data/lib/wukong/keystore/tyrant_notes.textile +0 -145
- data/lib/wukong/models/graph.rb +0 -25
- data/lib/wukong/monitor/chunked_store.rb +0 -23
- data/lib/wukong/monitor/periodic_logger.rb +0 -34
- data/lib/wukong/monitor/periodic_monitor.rb +0 -70
- data/lib/wukong/monitor.rb +0 -7
- data/lib/wukong/rdf.rb +0 -104
- data/lib/wukong/streamer/cassandra_streamer.rb +0 -61
- data/lib/wukong/streamer/count_keys.rb +0 -30
- data/lib/wukong/streamer/count_lines.rb +0 -26
- data/lib/wukong/streamer/em_streamer.rb +0 -7
- data/lib/wukong/streamer/preprocess_with_pipe_streamer.rb +0 -22
- data/lib/wukong/wukong_class.rb +0 -21
@@ -1,211 +0,0 @@
|
|
1
|
-
{
|
2
|
-
"protocol" : "Cassandra",
|
3
|
-
"namespace" : "org.apache.cassandra.avro", "types" : [
|
4
|
-
|
5
|
-
Add/insert one value
|
6
|
-
|
7
|
-
Mutate ks, [col_ref], 'val', ts, ttl }
|
8
|
-
|
9
|
-
Add/insert multiple cols to same row
|
10
|
-
|
11
|
-
MutateRow ks, supercol_or_nil, { [col, val, ts, ttl], [col,val,ts,ttl],...}}
|
12
|
-
MutateCRow ks, { [col, val, ts, ttl], [col,val,ts,ttl],...}}
|
13
|
-
MutateSCRow ks, supercol, { [col, val, ts, ttl], [col,val,ts,ttl],...}}
|
14
|
-
|
15
|
-
Get one, many or all columns from given row
|
16
|
-
|
17
|
-
get
|
18
|
-
Multiget ks, supercol_or_nil, [col1, col2, ...] or nil
|
19
|
-
|
20
|
-
Get one, many or all columns from a slice of sequential rows
|
21
|
-
|
22
|
-
get_range
|
23
|
-
|
24
|
-
Remove one column from a row
|
25
|
-
|
26
|
-
remove
|
27
|
-
|
28
|
-
Remove many columns from a row
|
29
|
-
|
30
|
-
Remove all columns in a row
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
h3. Mo
|
35
|
-
|
36
|
-
|
37
|
-
{ "name" : "AccessLevel", "type" : "enum", "symbols" : [ "NONE", "READONLY", "READWRITE", "FALL" ] },
|
38
|
-
{ "name" : "ColumnPath", "type" : "record", "fields" : [
|
39
|
-
{ "name" : "column_family", "type" : "string"},
|
40
|
-
{ "name" : "super_column", "type" : [ "bytes", "null" ]},
|
41
|
-
{ "name" : "column", "type" : [ "bytes", "null" ] } ]},
|
42
|
-
{ "name" : "ColumnParent", "type" : "record", "fields" : [
|
43
|
-
{ "name" : "column_family", "type" : "string"},
|
44
|
-
{ "name" : "super_column", "type" : [ "bytes", "null" ] } ]},
|
45
|
-
{ "name" : "SliceRange", "type" : "record", "fields" : [
|
46
|
-
{ "name" : "start", "type" : "bytes"},
|
47
|
-
{ "name" : "finish", "type" : "bytes"},
|
48
|
-
{ "name" : "reversed", "type" : "boolean"},
|
49
|
-
{ "name" : "count", "type" : "int"},
|
50
|
-
{ "name" : "bitmasks", "type" : [ { "type" : "array", "items" : "bytes"}, "null" ] } ]},
|
51
|
-
{ "name" : "SlicePredicate", "type" : "record", "fields" : [
|
52
|
-
{ "name" : "column_names", "type" : [ { "type" : "array", "items" : "bytes"}, "null" ]},
|
53
|
-
{ "name" : "slice_range", "type" : [ "SliceRange", "null" ] } ]},
|
54
|
-
|
55
|
-
{ "name" : "Clock", "type" : "record", "fields" : [
|
56
|
-
{ "name" : "timestamp", "type" : "long" } ]},
|
57
|
-
{ "name" : "Column", "type" : "record", "fields" : [
|
58
|
-
{ "name" : "name", "type" : "bytes"},
|
59
|
-
{ "name" : "value", "type" : "bytes"},
|
60
|
-
{ "name" : "clock", "type" : "Clock"},
|
61
|
-
{ "name" : "ttl", "type" : "int" } ]},
|
62
|
-
{ "name" : "SuperColumn", "type" : "record", "fields" : [
|
63
|
-
{ "name" : "name", "type" : "bytes"},
|
64
|
-
{ "name" : "columns", "type" : { "type" : "array", "items" : "Column" } } ]},
|
65
|
-
{ "name" : "ColumnOrSuperColumn", "type" : "record", "fields" : [
|
66
|
-
{ "name" : "column", "type" : "Column" },
|
67
|
-
{ "name" : "super_column", "type" : "null" } ]},
|
68
|
-
{ "name" : "Deletion", "type" : "record", "fields" : [
|
69
|
-
{ "name" : "clock", "type" : "Clock"},
|
70
|
-
{ "name" : "super_column", "type" : [ "bytes", "null" ]},
|
71
|
-
{ "name" : "predicate", "type" : [ "SlicePredicate", "null" ] } ]},
|
72
|
-
{ "name" : "Mutation", "type" : "record", "fields" : [
|
73
|
-
{ "name" : "column_or_supercolumn", "type" : "ColumnOrSuperColumn" },
|
74
|
-
{ "name" : "deletion", "type" : "null" }
|
75
|
-
]},
|
76
|
-
{ "name" : "StreamingMutation", "type" : "record", "fields" : [
|
77
|
-
{ "name" : "key", "type" : "bytes" },
|
78
|
-
{ "name" : "mutation", "type" : "Mutation" } ]},
|
79
|
-
|
80
|
-
{ "name" : "IndexType", "type" : "enum", "symbols" : [ "KEYS" ]},
|
81
|
-
{ "name" : "ColumnDef", "type" : "record", "fields" : [
|
82
|
-
{ "name" : "name", "type" : "bytes"},
|
83
|
-
{ "name" : "validation_class", "type" : "string"},
|
84
|
-
{ "name" : "index_type", "type" : [ "IndexType", "null" ]},
|
85
|
-
{ "name" : "index_name", "type" : [ "string", "null" ] } ]},
|
86
|
-
{ "name" : "CfDef", "type" : "record", "fields" : [
|
87
|
-
{ "name" : "keyspace", "type" : "string"},
|
88
|
-
{ "name" : "name", "type" : "string"},
|
89
|
-
{ "name" : "column_type", "type" : [ "string", "null" ]},
|
90
|
-
{ "name" : "clock_type", "type" : [ "string", "null" ]},
|
91
|
-
{ "name" : "comparator_type", "type" : [ "string", "null" ]},
|
92
|
-
{ "name" : "subcomparator_type", "type" : [ "string", "null" ]},
|
93
|
-
{ "name" : "reconciler", "type" : [ "string", "null" ]},
|
94
|
-
{ "name" : "comment", "type" : [ "string", "null" ]},
|
95
|
-
{ "name" : "row_cache_size", "type" : [ "double", "null" ]},
|
96
|
-
{ "name" : "preload_row_cache", "type" : [ "boolean", "null" ]},
|
97
|
-
{ "name" : "key_cache_size", "type" : [ "double", "null" ]},
|
98
|
-
{ "name" : "read_repair_chance", "type" : [ "double", "null" ]},
|
99
|
-
{ "name" : "gc_grace_seconds", "type" : [ "int", "null" ]},
|
100
|
-
{ "name" : "column_metadata", "type" : [ { "type" : "array", "items" : "ColumnDef"}, "null" ]},
|
101
|
-
{ "name" : "id", "type" : [ "int", "null" ] } ]},
|
102
|
-
{ "name" : "KsDef", "type" : "record", "fields" : [
|
103
|
-
{ "name" : "name", "type" : "string"}, { "name" : "strategy_class", "type" : "string"},
|
104
|
-
{ "name" : "strategy_options", "type" : [ { "type" : "map", "values" : "string"}, "null" ]},
|
105
|
-
{ "name" : "replication_factor", "type" : "int"}, { "name" : "cf_defs", "type" : { "type" : "array", "items" : "CfDef" } } ]},
|
106
|
-
{ "name" : "MutationsMapEntry", "type" : "record", "fields" : [ { "name" : "key", "type" : "bytes"}, { "name" : "mutations", "type" : { "type" : "map", "values" : { "type" : "array", "items" : "Mutation" } } } ]},
|
107
|
-
{ "name" : "CoscsMapEntry", "type" : "record", "fields" : [ { "name" : "key", "type" : "bytes"}, { "name" : "columns", "type" : { "type" : "array", "items" : "ColumnOrSuperColumn" } } ]},
|
108
|
-
{ "name" : "ConsistencyLevel", "type" : "enum", "symbols" : [ "ZERO", "ONE", "QUORUM", "DCQUORUM", "DCQUORUMSYNC", "ALL" ]},
|
109
|
-
{ "name" : "InvalidRequestException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ]},
|
110
|
-
{ "name" : "NotFoundException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ]},
|
111
|
-
{ "name" : "UnavailableException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ]},
|
112
|
-
{ "name" : "TimedOutException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ] }
|
113
|
-
],
|
114
|
-
|
115
|
-
|
116
|
-
"messages" : { "get" : {
|
117
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
118
|
-
{ "name" : "column_path", "type" : "ColumnPath"},
|
119
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
120
|
-
} ],
|
121
|
-
"response" : "ColumnOrSuperColumn",
|
122
|
-
"errors" : [ "InvalidRequestException", "NotFoundException", "UnavailableException", "TimedOutException" ]
|
123
|
-
},
|
124
|
-
"get_slice" : {
|
125
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
126
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
127
|
-
{ "name" : "predicate", "type" : "SlicePredicate"},
|
128
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
129
|
-
} ],
|
130
|
-
"response" : { "type" : "array",
|
131
|
-
"items" : "ColumnOrSuperColumn"
|
132
|
-
},
|
133
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
134
|
-
},
|
135
|
-
"multiget_slice" : {
|
136
|
-
"request" : [ { "name" : "keys", "type" : { "type" : "array",
|
137
|
-
"items" : "bytes"
|
138
|
-
}},
|
139
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
140
|
-
{ "name" : "predicate", "type" : "SlicePredicate"},
|
141
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
142
|
-
} ],
|
143
|
-
"response" : { "type" : "array",
|
144
|
-
"items" : "CoscsMapEntry"
|
145
|
-
},
|
146
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
147
|
-
},
|
148
|
-
"get_count" : {
|
149
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
150
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
151
|
-
{ "name" : "predicate", "type" : "SlicePredicate"},
|
152
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
153
|
-
} ],
|
154
|
-
"response" : "int",
|
155
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
156
|
-
},
|
157
|
-
"insert" : {
|
158
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
159
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
160
|
-
{ "name" : "column", "type" : "Column"},
|
161
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
162
|
-
} ],
|
163
|
-
"response" : "null",
|
164
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
165
|
-
},
|
166
|
-
"remove" : {
|
167
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
168
|
-
{ "name" : "column_path", "type" : "ColumnPath"},
|
169
|
-
{ "name" : "clock", "type" : "Clock"},
|
170
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
171
|
-
} ],
|
172
|
-
"response" : "null",
|
173
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
174
|
-
},
|
175
|
-
"batch_mutate" : {
|
176
|
-
"request" : [ { "name" : "mutation_map", "type" : { "type" : "array",
|
177
|
-
"items" : "MutationsMapEntry"
|
178
|
-
}},
|
179
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
180
|
-
} ],
|
181
|
-
"response" : "null",
|
182
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
183
|
-
},
|
184
|
-
"system_add_keyspace" : {
|
185
|
-
"request" : [ { "name" : "ks_def", "type" : "KsDef"
|
186
|
-
} ],
|
187
|
-
"response" : "null",
|
188
|
-
"errors" : [ "InvalidRequestException" ]
|
189
|
-
},
|
190
|
-
"set_keyspace" : {
|
191
|
-
"request" : [ { "name" : "keyspace", "type" : "string"
|
192
|
-
} ],
|
193
|
-
"response" : "null",
|
194
|
-
"errors" : [ "InvalidRequestException" ]
|
195
|
-
},
|
196
|
-
"describe_keyspaces" : {
|
197
|
-
"request" : [ ],
|
198
|
-
"response" : { "type" : "array",
|
199
|
-
"items" : "string"
|
200
|
-
}
|
201
|
-
},
|
202
|
-
"describe_cluster_name" : {
|
203
|
-
"request" : [ ],
|
204
|
-
"response" : "string"
|
205
|
-
},
|
206
|
-
"describe_version" : {
|
207
|
-
"request" : [ ],
|
208
|
-
"response" : "string"
|
209
|
-
}
|
210
|
-
}
|
211
|
-
}
|
Binary file
|
@@ -1 +0,0 @@
|
|
1
|
-
python /usr/local/share/cassandra/contrib/hadoop_streaming_output/bin/mapper.py
|
@@ -1 +0,0 @@
|
|
1
|
-
python /usr/local/share/cassandra/contrib/hadoop_streaming_output/bin/reducer.py
|
@@ -1,188 +0,0 @@
|
|
1
|
-
{
|
2
|
-
"protocol" : "Cassandra",
|
3
|
-
"namespace" : "org.apache.cassandra.avro", "types" : [
|
4
|
-
|
5
|
-
{ "name" : "AccessLevel", "type" : "enum", "symbols" : [ "NONE", "READONLY", "READWRITE", "FALL" ] },
|
6
|
-
{ "name" : "ColumnPath", "type" : "record", "fields" : [
|
7
|
-
{ "name" : "column_family", "type" : "string"},
|
8
|
-
{ "name" : "super_column", "type" : [ "bytes", "null" ]},
|
9
|
-
{ "name" : "column", "type" : [ "bytes", "null" ] } ]},
|
10
|
-
{ "name" : "ColumnParent", "type" : "record", "fields" : [
|
11
|
-
{ "name" : "column_family", "type" : "string"},
|
12
|
-
{ "name" : "super_column", "type" : [ "bytes", "null" ] } ]},
|
13
|
-
{ "name" : "SliceRange", "type" : "record", "fields" : [
|
14
|
-
{ "name" : "start", "type" : "bytes"},
|
15
|
-
{ "name" : "finish", "type" : "bytes"},
|
16
|
-
{ "name" : "reversed", "type" : "boolean"},
|
17
|
-
{ "name" : "count", "type" : "int"},
|
18
|
-
{ "name" : "bitmasks", "type" : [ { "type" : "array", "items" : "bytes"}, "null" ] } ]},
|
19
|
-
{ "name" : "SlicePredicate", "type" : "record", "fields" : [
|
20
|
-
{ "name" : "column_names", "type" : [ { "type" : "array", "items" : "bytes"}, "null" ]},
|
21
|
-
{ "name" : "slice_range", "type" : [ "SliceRange", "null" ] } ]},
|
22
|
-
|
23
|
-
{ "name": "StupidColumnMutation", "type": "record",
|
24
|
-
"fields": [
|
25
|
-
{ "name" : "key", "type" : "bytes" },
|
26
|
-
{ "name" : "name", "type" : "bytes" },
|
27
|
-
{ "name" : "value", "type" : "bytes" },
|
28
|
-
{ "name" : "timestamp", "type" : "long" },
|
29
|
-
{ "name" : "ttl", "type" : "int" }
|
30
|
-
]},
|
31
|
-
|
32
|
-
{ "name" : "Clock", "type" : "record", "fields" : [
|
33
|
-
{ "name" : "timestamp", "type" : "long" } ]},
|
34
|
-
{ "name" : "Column", "type" : "record", "fields" : [
|
35
|
-
{ "name" : "name", "type" : "bytes"},
|
36
|
-
{ "name" : "value", "type" : "bytes"},
|
37
|
-
{ "name" : "clock", "type" : "Clock"},
|
38
|
-
{ "name" : "ttl", "type" : "int" } ]},
|
39
|
-
{ "name" : "SuperColumn", "type" : "record", "fields" : [
|
40
|
-
{ "name" : "name", "type" : "bytes"},
|
41
|
-
{ "name" : "columns", "type" : { "type" : "array", "items" : "Column" } } ]},
|
42
|
-
{ "name" : "ColumnOrSuperColumn", "type" : "record", "fields" : [
|
43
|
-
{ "name" : "column", "type" : "Column" },
|
44
|
-
{ "name" : "super_column", "type" : "null" } ]},
|
45
|
-
{ "name" : "Deletion", "type" : "record", "fields" : [
|
46
|
-
{ "name" : "clock", "type" : "Clock"},
|
47
|
-
{ "name" : "super_column", "type" : [ "bytes", "null" ]},
|
48
|
-
{ "name" : "predicate", "type" : [ "SlicePredicate", "null" ] } ]},
|
49
|
-
{ "name" : "Mutation", "type" : "record", "fields" : [
|
50
|
-
{ "name" : "column_or_supercolumn", "type" : "ColumnOrSuperColumn" },
|
51
|
-
{ "name" : "deletion", "type" : "null" }
|
52
|
-
]},
|
53
|
-
{ "name" : "StreamingMutation", "type" : "record", "fields" : [
|
54
|
-
{ "name" : "key", "type" : "bytes" },
|
55
|
-
{ "name" : "mutation", "type" : "Mutation" } ]},
|
56
|
-
|
57
|
-
{ "name" : "IndexType", "type" : "enum", "symbols" : [ "KEYS" ]},
|
58
|
-
{ "name" : "ColumnDef", "type" : "record", "fields" : [
|
59
|
-
{ "name" : "name", "type" : "bytes"},
|
60
|
-
{ "name" : "validation_class", "type" : "string"},
|
61
|
-
{ "name" : "index_type", "type" : [ "IndexType", "null" ]},
|
62
|
-
{ "name" : "index_name", "type" : [ "string", "null" ] } ]},
|
63
|
-
{ "name" : "CfDef", "type" : "record", "fields" : [
|
64
|
-
{ "name" : "keyspace", "type" : "string"},
|
65
|
-
{ "name" : "name", "type" : "string"},
|
66
|
-
{ "name" : "column_type", "type" : [ "string", "null" ]},
|
67
|
-
{ "name" : "clock_type", "type" : [ "string", "null" ]},
|
68
|
-
{ "name" : "comparator_type", "type" : [ "string", "null" ]},
|
69
|
-
{ "name" : "subcomparator_type", "type" : [ "string", "null" ]},
|
70
|
-
{ "name" : "reconciler", "type" : [ "string", "null" ]},
|
71
|
-
{ "name" : "comment", "type" : [ "string", "null" ]},
|
72
|
-
{ "name" : "row_cache_size", "type" : [ "double", "null" ]},
|
73
|
-
{ "name" : "preload_row_cache", "type" : [ "boolean", "null" ]},
|
74
|
-
{ "name" : "key_cache_size", "type" : [ "double", "null" ]},
|
75
|
-
{ "name" : "read_repair_chance", "type" : [ "double", "null" ]},
|
76
|
-
{ "name" : "gc_grace_seconds", "type" : [ "int", "null" ]},
|
77
|
-
{ "name" : "column_metadata", "type" : [ { "type" : "array", "items" : "ColumnDef"}, "null" ]},
|
78
|
-
{ "name" : "id", "type" : [ "int", "null" ] } ]},
|
79
|
-
{ "name" : "KsDef", "type" : "record", "fields" : [
|
80
|
-
{ "name" : "name", "type" : "string"}, { "name" : "strategy_class", "type" : "string"},
|
81
|
-
{ "name" : "strategy_options", "type" : [ { "type" : "map", "values" : "string"}, "null" ]},
|
82
|
-
{ "name" : "replication_factor", "type" : "int"}, { "name" : "cf_defs", "type" : { "type" : "array", "items" : "CfDef" } } ]},
|
83
|
-
{ "name" : "MutationsMapEntry", "type" : "record", "fields" : [ { "name" : "key", "type" : "bytes"}, { "name" : "mutations", "type" : { "type" : "map", "values" : { "type" : "array", "items" : "Mutation" } } } ]},
|
84
|
-
{ "name" : "CoscsMapEntry", "type" : "record", "fields" : [ { "name" : "key", "type" : "bytes"}, { "name" : "columns", "type" : { "type" : "array", "items" : "ColumnOrSuperColumn" } } ]},
|
85
|
-
{ "name" : "ConsistencyLevel", "type" : "enum", "symbols" : [ "ZERO", "ONE", "QUORUM", "DCQUORUM", "DCQUORUMSYNC", "ALL" ]},
|
86
|
-
{ "name" : "InvalidRequestException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ]},
|
87
|
-
{ "name" : "NotFoundException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ]},
|
88
|
-
{ "name" : "UnavailableException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ]},
|
89
|
-
{ "name" : "TimedOutException", "type" : "error", "fields" : [ { "name" : "why", "type" : [ "string", "null" ] } ] }
|
90
|
-
],
|
91
|
-
|
92
|
-
|
93
|
-
"messages" : { "get" : {
|
94
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
95
|
-
{ "name" : "column_path", "type" : "ColumnPath"},
|
96
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
97
|
-
} ],
|
98
|
-
"response" : "ColumnOrSuperColumn",
|
99
|
-
"errors" : [ "InvalidRequestException", "NotFoundException", "UnavailableException", "TimedOutException" ]
|
100
|
-
},
|
101
|
-
"get_slice" : {
|
102
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
103
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
104
|
-
{ "name" : "predicate", "type" : "SlicePredicate"},
|
105
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
106
|
-
} ],
|
107
|
-
"response" : { "type" : "array",
|
108
|
-
"items" : "ColumnOrSuperColumn"
|
109
|
-
},
|
110
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
111
|
-
},
|
112
|
-
"multiget_slice" : {
|
113
|
-
"request" : [ { "name" : "keys", "type" : { "type" : "array",
|
114
|
-
"items" : "bytes"
|
115
|
-
}},
|
116
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
117
|
-
{ "name" : "predicate", "type" : "SlicePredicate"},
|
118
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
119
|
-
} ],
|
120
|
-
"response" : { "type" : "array",
|
121
|
-
"items" : "CoscsMapEntry"
|
122
|
-
},
|
123
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
124
|
-
},
|
125
|
-
"get_count" : {
|
126
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
127
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
128
|
-
{ "name" : "predicate", "type" : "SlicePredicate"},
|
129
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
130
|
-
} ],
|
131
|
-
"response" : "int",
|
132
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
133
|
-
},
|
134
|
-
"insert" : {
|
135
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
136
|
-
{ "name" : "column_parent", "type" : "ColumnParent"},
|
137
|
-
{ "name" : "column", "type" : "Column"},
|
138
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
139
|
-
} ],
|
140
|
-
"response" : "null",
|
141
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
142
|
-
},
|
143
|
-
"remove" : {
|
144
|
-
"request" : [ { "name" : "key", "type" : "bytes"},
|
145
|
-
{ "name" : "column_path", "type" : "ColumnPath"},
|
146
|
-
{ "name" : "clock", "type" : "Clock"},
|
147
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
148
|
-
} ],
|
149
|
-
"response" : "null",
|
150
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
151
|
-
},
|
152
|
-
"batch_mutate" : {
|
153
|
-
"request" : [ { "name" : "mutation_map", "type" : { "type" : "array",
|
154
|
-
"items" : "MutationsMapEntry"
|
155
|
-
}},
|
156
|
-
{ "name" : "consistency_level", "type" : "ConsistencyLevel"
|
157
|
-
} ],
|
158
|
-
"response" : "null",
|
159
|
-
"errors" : [ "InvalidRequestException", "UnavailableException", "TimedOutException" ]
|
160
|
-
},
|
161
|
-
"system_add_keyspace" : {
|
162
|
-
"request" : [ { "name" : "ks_def", "type" : "KsDef"
|
163
|
-
} ],
|
164
|
-
"response" : "null",
|
165
|
-
"errors" : [ "InvalidRequestException" ]
|
166
|
-
},
|
167
|
-
"set_keyspace" : {
|
168
|
-
"request" : [ { "name" : "keyspace", "type" : "string"
|
169
|
-
} ],
|
170
|
-
"response" : "null",
|
171
|
-
"errors" : [ "InvalidRequestException" ]
|
172
|
-
},
|
173
|
-
"describe_keyspaces" : {
|
174
|
-
"request" : [ ],
|
175
|
-
"response" : { "type" : "array",
|
176
|
-
"items" : "string"
|
177
|
-
}
|
178
|
-
},
|
179
|
-
"describe_cluster_name" : {
|
180
|
-
"request" : [ ],
|
181
|
-
"response" : "string"
|
182
|
-
},
|
183
|
-
"describe_version" : {
|
184
|
-
"request" : [ ],
|
185
|
-
"response" : "string"
|
186
|
-
}
|
187
|
-
}
|
188
|
-
}
|
@@ -1,51 +0,0 @@
|
|
1
|
-
#!/usr/bin/env bash
|
2
|
-
|
3
|
-
input_file="$1" ; shift
|
4
|
-
output_file="$1" ; shift
|
5
|
-
map_script=${1-/bin/cat} ; shift
|
6
|
-
reduce_script=${1-/usr/bin/uniq} ; shift
|
7
|
-
|
8
|
-
dest_keyspace=${dest_keyspace-soc_net_tw}
|
9
|
-
dest_col_family=${dest_col_family-Wordbag}
|
10
|
-
|
11
|
-
hostname=`hostname`
|
12
|
-
|
13
|
-
# Path to cassandra and hadoop dirs
|
14
|
-
script_dir=$(readlink -f `dirname $0`)
|
15
|
-
CASSANDRA_HOME=${CASSANDRA_HOME-/usr/local/share/cassandra}
|
16
|
-
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
17
|
-
avro_file=${avro_file-$CASSANDRA_HOME/interface/avro/cassandra.avpr}
|
18
|
-
|
19
|
-
ARCHIVES=`/bin/ls -1 $CASSANDRA_HOME/build/apache-cassandra*.jar`
|
20
|
-
for jar in `/bin/ls -1 $CASSANDRA_HOME/build/lib/jars/*.jar $CASSANDRA_HOME/lib/*.jar`; do
|
21
|
-
ARCHIVES=$ARCHIVES,$jar
|
22
|
-
done
|
23
|
-
|
24
|
-
${HADOOP_HOME}/bin/hadoop \
|
25
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
|
26
|
-
-D stream.map.output=cassandra_avro_output \
|
27
|
-
-D stream.io.identifier.resolver.class=org.apache.cassandra.hadoop.streaming.AvroResolver \
|
28
|
-
-D cassandra.output.keyspace="$dest_keyspace" \
|
29
|
-
-D cassandra.output.columnfamily="$dest_col_family" \
|
30
|
-
-D cassandra.thrift.address=10.204.41.193,10.204.30.11,10.204.58.238,10.204.239.133,10.196.191.31,10.204.103.21,10.202.74.223,10.202.143.95 \
|
31
|
-
-D cassandra.partitioner.class=org.apache.cassandra.dht.RandomPartitioner \
|
32
|
-
-D cassandra.thrift.port=9160 \
|
33
|
-
-D mapreduce.output.columnfamilyoutputformat.batch.threshold=1024 \
|
34
|
-
-D mapred.reduce.tasks=0 \
|
35
|
-
-D mapred.map.tasks.speculative.execution=false \
|
36
|
-
-libjars $ARCHIVES \
|
37
|
-
-file $avro_file \
|
38
|
-
-outputformat org.apache.cassandra.hadoop.ColumnFamilyOutputFormat \
|
39
|
-
-mapper "ruby $script_dir/avromapper.rb --map " \
|
40
|
-
-input "$input_file" \
|
41
|
-
-output "$output_file" \
|
42
|
-
"$@"
|
43
|
-
|
44
|
-
# -D cassandra.thrift.address=10.204.54.190,10.244.42.31,10.244.42.176,10.244.42.112,10.244.42.143,10.244.42.79,10.244.42.4,10.204.53.166 \
|
45
|
-
# -D cassandra.thrift.address=10.204.221.230,10.243.79.223,10.245.19.159,10.242.154.159,10.242.153.155,10.242.153.203 \
|
46
|
-
|
47
|
-
|
48
|
-
# cat /tmp/mj-flip/chimchim-info.log | cut -f5 | ruby -e 'puts $stdin.readlines.map{|l| l.chomp.gsub(/ip-([0-9\-]+)\..*/,"\\1").gsub(/-/,".") }.join(",")'
|
49
|
-
|
50
|
-
|
51
|
-
|
@@ -1,24 +0,0 @@
|
|
1
|
-
#!/usr/bin/env ruby
|
2
|
-
require 'rubygems'
|
3
|
-
require 'wukong'
|
4
|
-
require 'wukong/periodic_monitor'
|
5
|
-
require 'wukong/store/cassandra'
|
6
|
-
require 'wukong/script/cassandra_loader_script'
|
7
|
-
|
8
|
-
Settings.use :commandline
|
9
|
-
Settings.define :log_interval, :default => 1
|
10
|
-
Settings.cassandra_keyspace = 'soc_net_tw'
|
11
|
-
Settings.cassandra_col_family = 'TwitterUser'
|
12
|
-
Settings.cassandra_hosts = "ip-10-204-41-193.ec2.internal:9160,ip-10-204-30-11.ec2.internal:9160,ip-10-204-58-238.ec2.internal:9160,ip-10-204-239-133.ec2.internal:9160,ip-10-196-191-31.ec2.internal:9160,ip-10-204-103-21.ec2.internal:9160,ip-10-202-74-223.ec2.internal:9160,ip-10-202-143-95.ec2.internal:9160"
|
13
|
-
Settings.resolve!
|
14
|
-
|
15
|
-
require 'cassandra/0.7'
|
16
|
-
require 'wuclan/twitter' ; include Wuclan::Twitter
|
17
|
-
require 'wuclan/twitter/cassandra_db'
|
18
|
-
require 'wukong/store/cassandra/streaming'
|
19
|
-
|
20
|
-
# hdp-catd s3://s3hdfs.infinitemonkeys.info/data/sn/tw/fixd/objects/twitter_user | head
|
21
|
-
|
22
|
-
# CassandraScript.new(Wukong::Store::Cassandra::StructLoader, nil).run
|
23
|
-
Wukong::CassandraScript.new(Wukong::Store::Cassandra::StructLoader, nil).run
|
24
|
-
|
data/examples/count_keys.rb
DELETED
@@ -1,56 +0,0 @@
|
|
1
|
-
#!/usr/bin/env ruby
|
2
|
-
$: << File.dirname(__FILE__)+'/../lib'
|
3
|
-
require 'wukong'
|
4
|
-
require 'wukong/streamer/count_keys'
|
5
|
-
require 'wukong/streamer/count_lines'
|
6
|
-
|
7
|
-
#
|
8
|
-
#
|
9
|
-
class CountKeysReducer < Wukong::Streamer::CountLines
|
10
|
-
#
|
11
|
-
# Taken from the actionpack Rails component ('action_view/helpers/number_helper')
|
12
|
-
#
|
13
|
-
# Formats a +number+ with grouped thousands using +delimiter+. You
|
14
|
-
# can customize the format using optional <em>delimiter</em> and <em>separator</em> parameters.
|
15
|
-
# * <tt>delimiter</tt> - Sets the thousands delimiter, defaults to ","
|
16
|
-
# * <tt>separator</tt> - Sets the separator between the units, defaults to "."
|
17
|
-
#
|
18
|
-
# number_with_delimiter(12345678) => 12,345,678
|
19
|
-
# number_with_delimiter(12345678.05) => 12,345,678.05
|
20
|
-
# number_with_delimiter(12345678, ".") => 12.345.678
|
21
|
-
def number_with_delimiter(number, delimiter=",", separator=".")
|
22
|
-
begin
|
23
|
-
parts = number.to_s.split('.')
|
24
|
-
parts[0].gsub!(/(\d)(?=(\d\d\d)+(?!\d))/, "\\1#{delimiter}")
|
25
|
-
parts.join separator
|
26
|
-
rescue
|
27
|
-
number
|
28
|
-
end
|
29
|
-
end
|
30
|
-
|
31
|
-
# Override to look nice
|
32
|
-
def formatted_count item, key_count
|
33
|
-
key_count_str = number_with_delimiter(key_count.to_i)
|
34
|
-
"%-25s\t%12s" % [item, key_count_str]
|
35
|
-
end
|
36
|
-
end
|
37
|
-
|
38
|
-
#
|
39
|
-
class CountKeysScript < Wukong::Script
|
40
|
-
def map_command
|
41
|
-
# Use `cut` to extract the first field
|
42
|
-
%Q{ cut -d"\t" -f1 }
|
43
|
-
end
|
44
|
-
|
45
|
-
#
|
46
|
-
# There's just the one field
|
47
|
-
#
|
48
|
-
def default_options
|
49
|
-
super.merge :sort_fields => 1
|
50
|
-
end
|
51
|
-
end
|
52
|
-
|
53
|
-
# Executes the script when run from command line
|
54
|
-
if __FILE__ == $0
|
55
|
-
CountKeysScript.new(nil, CountKeysReducer).run
|
56
|
-
end
|
@@ -1,57 +0,0 @@
|
|
1
|
-
#!/usr/bin/env ruby
|
2
|
-
$: << File.dirname(__FILE__)+'/../lib'
|
3
|
-
require 'wukong'
|
4
|
-
|
5
|
-
#
|
6
|
-
#
|
7
|
-
module CountKeys
|
8
|
-
#
|
9
|
-
class Mapper < Wukong::Streamer::Base
|
10
|
-
attr_accessor :keys_count
|
11
|
-
def initialize *args
|
12
|
-
self.keys_count = {}
|
13
|
-
end
|
14
|
-
def process key, *args
|
15
|
-
key.gsub!(/-.*/, '') # kill off the slug
|
16
|
-
self.keys_count[key] ||= 0
|
17
|
-
self.keys_count[key] += 1
|
18
|
-
end
|
19
|
-
def stream *args
|
20
|
-
super *args
|
21
|
-
self.keys_count.each do |key, count|
|
22
|
-
emit [key, count].to_flat
|
23
|
-
end
|
24
|
-
end
|
25
|
-
end
|
26
|
-
# Identity Mapper
|
27
|
-
class Reducer < Wukong::Streamer::AccumulatingReducer
|
28
|
-
attr_accessor :key_count
|
29
|
-
require 'active_support'
|
30
|
-
require 'action_view/helpers/number_helper'; include ActionView::Helpers::NumberHelper
|
31
|
-
|
32
|
-
# Override to look nice
|
33
|
-
def formatted_count item, key_count
|
34
|
-
key_count_str = number_with_delimiter(key_count.to_i, :delimiter => ',')
|
35
|
-
"%-25s\t%12s" % [item, key_count_str]
|
36
|
-
end
|
37
|
-
def start! *args
|
38
|
-
self.key_count = 0
|
39
|
-
end
|
40
|
-
def accumulate key, count
|
41
|
-
self.key_count += count.to_i
|
42
|
-
end
|
43
|
-
def finalize
|
44
|
-
yield formatted_count(key, key_count)
|
45
|
-
end
|
46
|
-
end
|
47
|
-
|
48
|
-
#
|
49
|
-
class Script < Wukong::Script
|
50
|
-
# There's just the one field
|
51
|
-
def default_options
|
52
|
-
super.merge :sort_fields => 1, :reduce_tasks => 1
|
53
|
-
end
|
54
|
-
end
|
55
|
-
end
|
56
|
-
|
57
|
-
CountKeys::Script.new(CountKeys::Mapper, CountKeys::Reducer).run
|
@@ -1,26 +0,0 @@
|
|
1
|
-
|
2
|
-
# Download the Amazon elastic-mapreduce runner from http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
|
3
|
-
|
4
|
-
# Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose 'emr.yourdomain.com' as the bucket and '/wukong' as a scoping path within that bucket. In that case you will refer to it with a path like s3n://emr.yourdomain.com/wukong (see notes below about s3n:// vs. s3:// URLs).
|
5
|
-
|
6
|
-
# Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
|
7
|
-
# Edit emr.yaml -- it has instructions for the
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
h3. s3n:// vs. s3:// URLs
|
14
|
-
|
15
|
-
Many external tools use a URI convention to address files in S3; they typically use the 's3://' scheme, which makes a lot of sense:
|
16
|
-
s3://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
|
17
|
-
|
18
|
-
Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks -- to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth. Hadoop unfortunately chose the 's3://' scheme for URIs in this filesystem:
|
19
|
-
s3://s3hdfs.yourcompany.com/path/to/data
|
20
|
-
|
21
|
-
Hadoop is happy to read s3 native files -- 'native' as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There's a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about). You refer to these files with the 's3n://' scheme ('n' as in 'native'):
|
22
|
-
s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-mapper.rb
|
23
|
-
s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-reducer.rb
|
24
|
-
s3n://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
|
25
|
-
|
26
|
-
Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use 's3n://' for normal s3 files.
|