mosql 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,2 @@
1
+ collections.yml
2
+ /.bundle/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
@@ -0,0 +1,48 @@
1
+ GIT
2
+ remote: git@github.com:stripe-internal/mongoriver
3
+ revision: d5b5ca1471f9efe7c91b3abe2c26f612a2dd4e9c
4
+ ref: d5b5ca1471f9efe7c91b3abe2c26f612a2dd4e9c
5
+ specs:
6
+ mongoriver (0.0.1)
7
+ bson_ext
8
+ log4r
9
+ mongo (>= 1.7)
10
+
11
+ PATH
12
+ remote: .
13
+ specs:
14
+ mosql (0.0.1)
15
+ bson_ext
16
+ json
17
+ log4r
18
+ mongo
19
+ pg
20
+ rake
21
+ sequel
22
+
23
+ GEM
24
+ remote: https://intgems.stripe.com:446/
25
+ specs:
26
+ bson (1.7.1)
27
+ bson_ext (1.7.1)
28
+ bson (~> 1.7.1)
29
+ json (1.7.5)
30
+ log4r (1.1.10)
31
+ metaclass (0.0.1)
32
+ minitest (3.0.0)
33
+ mocha (0.10.5)
34
+ metaclass (~> 0.0.1)
35
+ mongo (1.7.1)
36
+ bson (~> 1.7.1)
37
+ pg (0.14.1)
38
+ rake (10.0.2)
39
+ sequel (3.41.0)
40
+
41
+ PLATFORMS
42
+ ruby
43
+
44
+ DEPENDENCIES
45
+ minitest
46
+ mocha
47
+ mongoriver!
48
+ mosql!
@@ -0,0 +1,168 @@
1
+ # MoSQL: a MongoDB → SQL streaming translator
2
+
3
+ At Stripe, we love MongoDB. We love the flexibility it gives us in
4
+ changing data schemas as we grow and learn, and we love its
5
+ operational properties. We love replsets. We love the uniform query
6
+ language that doesn't require generating and parsing strings, tracking
7
+ placeholder parameters, or any of that nonsense.
8
+
9
+ The thing is, we also love SQL. We love the ease of doing ad-hoc data
10
+ analysis over small-to-mid-size datasets in SQL. We love doing JOINs
11
+ to pull together reports summarizing properties across multiple
12
+ datasets. We love the fact that virtually every employee we hire
13
+ already knows SQL and is comfortable using it to ask and answer
14
+ questions about data.
15
+
16
+ So, we thought, why can't we have the best of both worlds? Thus:
17
+ MoSQL.
18
+
19
+ # MoSQL: Put Mo' SQL in your NoSQL
20
+
21
+ ![MoSQL](https://stripe.com/img/blog/posts/mosql/mosql.png)
22
+
23
+ MoSQL imports the contents of your MongoDB database cluster into a
24
+ PostgreSQL instance, using an oplog tailer to keep the SQL mirror live
25
+ up-to-date. This lets you run production services against a MongoDB
26
+ database, and then run offline analytics or reporting using the full
27
+ power of SQL.
28
+
29
+ ## Installation
30
+
31
+ Install from Rubygems as:
32
+
33
+ $ gem install mosql
34
+
35
+ Or build from source by:
36
+
37
+ $ gem build mosql.gemspec
38
+
39
+ And then install the built gem.
40
+
41
+ ## The Collection Map file
42
+
43
+ In order to define a SQL schema and import your data, MoSQL needs a
44
+ collection map file describing the schema of your MongoDB data. (Don't
45
+ worry -- MoSQL can handle it if your mongo data doesn't always exactly
46
+ fit the stated schema. More on that later).
47
+
48
+ The collection map is a YAML file describing the databases and
49
+ collections in Mongo that you want to import, in terms of their SQL
50
+ types. An example collection map might be:
51
+
52
+
53
+ mongodb:
54
+ blog_posts:
55
+ :columns:
56
+ - _id: TEXT
57
+ - author: TEXT
58
+ - title: TEXT
59
+ - created: DOUBLE PRECISION
60
+ :meta:
61
+ :table: blog_posts
62
+ :extra_props: true
63
+
64
+ Said another way, the collection map is a YAML file containing a hash
65
+ mapping
66
+
67
+ <Mongo DB name> -> { <Mongo Collection Name> -> <Collection Definition> }
68
+
69
+ Where a `<Collection Definition>` is a hash with `:columns` and
70
+ `:meta` fields. `:columns` is a list of one-element hashes, mapping
71
+ field-name to SQL type. It is required to include at least an `_id`
72
+ mapping. `:meta` contains metadata about this collection/table. It is
73
+ required to include at least `:table`, naming the SQL table this
74
+ collection will be mapped to. `extra_props` determines the handling of
75
+ unknown fields in MongoDB objects -- more about that later.
76
+
77
+ By default, `mosql` looks for a collection map in a file named
78
+ `collections.yml` in your current working directory, but you can
79
+ specify a different one with `-c` or `--collections`.
80
+
81
+ ## Usage
82
+
83
+ Once you have a collection map. MoSQL usage is easy. The basic form
84
+ is:
85
+
86
+ mosql [-c collections.yml] [--sql postgres://sql-server/sql-db] [--mongo mongodb://mongo-uri]
87
+
88
+ By default, `mosql` connects to both PostgreSQL and MongoDB instances
89
+ running on default ports on localhost without authentication. You can
90
+ point it at different targets using the `--sql` and `--mongo`
91
+ command-line parameters.
92
+
93
+ `mosql` will:
94
+
95
+ 1. Create the appropriate SQL tables
96
+ 2. Import data from the Mongo database
97
+ 3. Start tailing the mongo oplog, propogating changes from MongoDB to SQL.
98
+
99
+
100
+ After the first run, `mosql` will store the status of the optailer in
101
+ the `mongo_sql` table in your SQL database, and automatically resume
102
+ where it left off. `mosql` uses the replset name to keep track of
103
+ which mongo database it's tailing, so that you can tail multiple
104
+ databases into the same SQL database. If you want to tail the same
105
+ replSet, or multiple replSets with the same name, for some reason, you
106
+ can use the `--service` flag to change the name `mosql` uses to track
107
+ state.
108
+
109
+ You likely want to run `mosql` against a secondary node, at least for
110
+ the initial import, which will cause large amounts of disk activity on
111
+ the target node. One option is to use read preferences in your
112
+ connection URI:
113
+
114
+ mosql --mongo mongodb://node1,node2,node3?readPreference=secondary
115
+
116
+ ## Advanced usage
117
+
118
+ For advanced scenarios, you can pass options to control mosql's
119
+ behavior. If you pass `--skip-tail`, mosql will do the initial import,
120
+ but not tail the oplog. This could be used, for example, to do an
121
+ import off of a backup snapshot, and then start the tailer on the live
122
+ cluster.
123
+
124
+ If you need to force a fresh reimport, run `--reimport`, which will
125
+ cause `mosql` to drop tables, create them anew, and do another import.
126
+
127
+ ## Schema mismatches and _extra_props
128
+
129
+ If MoSQL encounters values in the MongoDB database that don't fit
130
+ within the stated schema (e.g. a floating-point value in a INTEGER
131
+ field), it will log a warning, ignore the entire object, and continue.
132
+
133
+ If it encounters a MongoDB object with fields not listed in the
134
+ collection map, it will discard the extra fields, unless
135
+ `:extra_props` is set in the `:meta` hash. If it is, it will collect
136
+ any missing fields, JSON-encode them in a hash, and store the
137
+ resulting text in `_extra_props` in SQL. It's up to you to do
138
+ something useful with the JSON. One option is to use [plv8][plv8] to
139
+ parse them inside PostgreSQL, or you can just pull the JSON out whole
140
+ and parse it in application code.
141
+
142
+ This is also currently the only way to handle array or object values
143
+ inside records -- specify `:extra_props`, and they'll get JSON-encoded
144
+ into `_extra_props`. There's no reason we couldn't support
145
+ JSON-encoded values for individual columns/fields, but we haven't
146
+ written that code yet.
147
+
148
+ [plv8]: http://code.google.com/p/plv8js/
149
+
150
+ ## Sharded clusters
151
+
152
+ MoSQL does not have special support for sharded Mongo clusters at this
153
+ time. It should be possible to run a separate MoSQL instance against
154
+ each of the individual backend shard replica sets, streaming into
155
+ separate PostgreSQL instances, but we have not actually tested this
156
+ yet.
157
+
158
+ # Development
159
+
160
+ Patches and contributions are welcome! Please fork the project and
161
+ open a pull request on [github][github], or just report issues.
162
+
163
+ MoSQL includes a small but hopefully-growing test suite. It assumes a
164
+ running PostgreSQL and MongoDB instance on the local host; You can
165
+ point it at a different target via environment variables; See
166
+ `test/functional/_lib.rb` for more information.
167
+
168
+ [github]: https://github.com/stripe/mosql
@@ -0,0 +1,12 @@
1
+ require 'rake/testtask'
2
+
3
+ task :default
4
+ task :build
5
+
6
+ Rake::TestTask.new do |t|
7
+ t.libs = ["lib"]
8
+ t.verbose = true
9
+ t.test_files = FileList['test/**/*.rb'].reject do |file|
10
+ file.end_with?('_lib.rb')
11
+ end
12
+ end
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'bundler/setup'
5
+ require 'mosql/cli'
6
+
7
+ MoSQL::CLI.run(ARGV)
@@ -0,0 +1,11 @@
1
+ require 'log4r'
2
+ require 'mongo'
3
+ require 'sequel'
4
+ require 'mongoriver'
5
+ require 'json'
6
+
7
+ require 'mosql/version'
8
+ require 'mosql/log'
9
+ require 'mosql/sql'
10
+ require 'mosql/schema'
11
+ require 'mosql/tailer'
@@ -0,0 +1,305 @@
1
+ require 'mosql'
2
+ require 'optparse'
3
+ require 'yaml'
4
+ require 'logger'
5
+
6
+ module MoSQL
7
+ class CLI
8
+ include MoSQL::Logging
9
+
10
+ BATCH = 1000
11
+
12
+ attr_reader :args, :options, :tailer
13
+
14
+ def self.run(args)
15
+ cli = CLI.new(args)
16
+ cli.run
17
+ end
18
+
19
+ def initialize(args)
20
+ @args = args
21
+ @options = []
22
+ @done = false
23
+ setup_signal_handlers
24
+ end
25
+
26
+ def setup_signal_handlers
27
+ %w[TERM INT USR2].each do |sig|
28
+ Signal.trap(sig) do
29
+ log.info("Got SIG#{sig}. Preparing to exit...")
30
+ @done = true
31
+ end
32
+ end
33
+ end
34
+
35
+ def parse_args
36
+ @options = {
37
+ :collections => 'collections.yml',
38
+ :sql => 'postgres:///',
39
+ :mongo => 'mongodb://localhost',
40
+ :verbose => 0
41
+ }
42
+ optparse = OptionParser.new do |opts|
43
+ opts.banner = "Usage: #{$0} [options] "
44
+
45
+ opts.on('-h', '--help', "Display this message") do
46
+ puts opts
47
+ exit(0)
48
+ end
49
+
50
+ opts.on('-v', "Increase verbosity") do
51
+ @options[:verbose] += 1
52
+ end
53
+
54
+ opts.on("-c", "--collections [collections.yml]", "Collection map YAML file") do |file|
55
+ @options[:collections] = file
56
+ end
57
+
58
+ opts.on("--sql [sqluri]", "SQL server to connect to") do |uri|
59
+ @options[:sql] = uri
60
+ end
61
+
62
+ opts.on("--mongo [mongouri]", "Mongo connection string") do |uri|
63
+ @options[:mongo] = uri
64
+ end
65
+
66
+ opts.on("--schema [schema]", "PostgreSQL 'schema' to namespace tables") do |schema|
67
+ @options[:schema] = schema
68
+ end
69
+
70
+ opts.on("--ignore-delete", "Ignore delete operations when tailing") do
71
+ @options[:ignore_delete] = true
72
+ end
73
+
74
+ opts.on("--tail-from [timestamp]", "Start tailing from the specified UNIX timestamp") do |ts|
75
+ @options[:tail_from] = ts
76
+ end
77
+
78
+ opts.on("--service [service]", "Service name to use when storing tailing state") do |service|
79
+ @options[:service] = service
80
+ end
81
+
82
+ opts.on("--skip-tail", "Don't tail the oplog, just do the initial import") do
83
+ @options[:skip_tail] = true
84
+ end
85
+
86
+ opts.on("--reimport", "Force a data re-import") do
87
+ @options[:reimport] = true
88
+ end
89
+ end
90
+
91
+ optparse.parse!(@args)
92
+
93
+ log = Log4r::Logger.new('Stripe')
94
+ log.outputters = Log4r::StdoutOutputter.new(STDERR)
95
+ if options[:verbose] >= 1
96
+ log.level = Log4r::DEBUG
97
+ else
98
+ log.level = Log4r::INFO
99
+ end
100
+ end
101
+
102
+ def connect_mongo
103
+ @mongo = Mongo::Connection.from_uri(options[:mongo])
104
+ config = @mongo['admin'].command(:ismaster => 1)
105
+ if !config['setName']
106
+ log.warn("`#{options[:mongo]}' is not a replset. Proceeding anyways...")
107
+ end
108
+ options[:service] ||= config['setName']
109
+ end
110
+
111
+ def connect_sql
112
+ @sql = MoSQL::SQLAdapter.new(@schemamap, options[:sql], options[:schema])
113
+ if options[:verbose] >= 2
114
+ @sql.db.sql_log_level = :debug
115
+ @sql.db.loggers << Logger.new($stderr)
116
+ end
117
+ end
118
+
119
+ def load_collections
120
+ collections = YAML.load(File.read(@options[:collections]))
121
+ @schemamap = MoSQL::Schema.new(collections)
122
+ end
123
+
124
+ def run
125
+ parse_args
126
+ load_collections
127
+ connect_sql
128
+ connect_mongo
129
+
130
+ metadata_table = MoSQL::Tailer.create_table(@sql.db, 'mosql_tailers')
131
+
132
+ @tailer = MoSQL::Tailer.new([@mongo], :existing, metadata_table,
133
+ :service => options[:service])
134
+
135
+ if options[:reimport] || tailer.read_timestamp.seconds == 0
136
+ initial_import
137
+ end
138
+
139
+ optail
140
+ end
141
+
142
+ # Helpers
143
+
144
+ def collection_for_ns(ns)
145
+ dbname, collection = ns.split(".", 2)
146
+ @mongo.db(dbname).collection(collection)
147
+ end
148
+
149
+ def bulk_upsert(table, ns, items)
150
+ begin
151
+ @schemamap.copy_data(table.db, ns, items)
152
+ rescue Sequel::DatabaseError => e
153
+ log.debug("Bulk insert error (#{e}), attempting invidual upserts...")
154
+ cols = @schemamap.all_columns(@schemamap.find_ns(ns))
155
+ items.each do |it|
156
+ h = {}
157
+ cols.zip(it).each { |k,v| h[k] = v }
158
+ @sql.upsert(table, h)
159
+ end
160
+ end
161
+ end
162
+
163
+ def with_retries(tries=10)
164
+ tries.times do |try|
165
+ begin
166
+ yield
167
+ rescue Mongo::ConnectionError, Mongo::ConnectionFailure, Mongo::OperationFailure => e
168
+ # Duplicate key error
169
+ raise if e.kind_of?(Mongo::OperationFailure) && [11000, 11001].include?(e.error_code)
170
+ # Cursor timeout
171
+ raise if e.kind_of?(Mongo::OperationFailure) && e.message =~ /^Query response returned CURSOR_NOT_FOUND/
172
+ delay = 0.5 * (1.5 ** try)
173
+ log.warn("Mongo exception: #{e}, sleeping #{delay}s...")
174
+ sleep(delay)
175
+ end
176
+ end
177
+ end
178
+
179
+ def track_time
180
+ start = Time.now
181
+ yield
182
+ Time.now - start
183
+ end
184
+
185
+ def initial_import
186
+ @schemamap.create_schema(@sql.db, true)
187
+
188
+ start_ts = @mongo['local']['oplog.rs'].find_one({}, {:sort => [['$natural', -1]]})['ts']
189
+
190
+ want_dbs = @schemamap.all_mongo_dbs & @mongo.database_names
191
+ want_dbs.each do |dbname|
192
+ log.info("Importing for Mongo DB #{dbname}...")
193
+ db = @mongo.db(dbname)
194
+ want = Set.new(@schemamap.collections_for_mongo_db(dbname))
195
+ db.collections.select { |c| want.include?(c.name) }.each do |collection|
196
+ ns = "#{dbname}.#{collection.name}"
197
+ import_collection(ns, collection)
198
+ exit(0) if @done
199
+ end
200
+ end
201
+
202
+ tailer.write_timestamp(start_ts)
203
+ end
204
+
205
+ def import_collection(ns, collection)
206
+ log.info("Importing for #{ns}...")
207
+ count = 0
208
+ batch = []
209
+ table = @sql.table_for_ns(ns)
210
+ table.truncate
211
+
212
+ start = Time.now
213
+ sql_time = 0
214
+ collection.find(nil, :batch_size => BATCH) do |cursor|
215
+ with_retries do
216
+ cursor.each do |obj|
217
+ batch << @schemamap.transform(ns, obj)
218
+ count += 1
219
+
220
+ if batch.length >= BATCH
221
+ sql_time += track_time do
222
+ bulk_upsert(table, ns, batch)
223
+ end
224
+ elapsed = Time.now - start
225
+ log.info("Imported #{count} rows (#{elapsed}s, #{sql_time}s SQL)...")
226
+ batch.clear
227
+ exit(0) if @done
228
+ end
229
+ end
230
+ end
231
+ end
232
+
233
+ unless batch.empty?
234
+ bulk_upsert(table, ns, batch)
235
+ end
236
+ end
237
+
238
+ def optail
239
+ return if options[:skip_tail]
240
+
241
+ tailer.tail_from(options[:tail_from] ?
242
+ BSON::Timestamp.new(options[:tail_from].to_i, 0) :
243
+ nil)
244
+ until @done
245
+ tailer.stream(1000) do |op|
246
+ handle_op(op)
247
+ end
248
+ end
249
+ end
250
+
251
+ def sync_object(ns, _id)
252
+ obj = collection_for_ns(ns).find_one({:_id => _id})
253
+ if obj
254
+ @sql.upsert_ns(ns, obj)
255
+ else
256
+ @sql.table_for_ns(ns).where(:_id => _id).delete()
257
+ end
258
+ end
259
+
260
+ def handle_op(op)
261
+ log.debug("processing op: #{op.inspect}")
262
+ unless op['ns'] && op['op']
263
+ log.warn("Weird op: #{op.inspect}")
264
+ return
265
+ end
266
+
267
+ unless @schemamap.find_ns(op['ns'])
268
+ log.debug("Skipping op for unknown ns #{op['ns']}...")
269
+ return
270
+ end
271
+
272
+ ns = op['ns']
273
+ dbname, collection_name = ns.split(".", 2)
274
+
275
+ case op['op']
276
+ when 'n'
277
+ log.debug("Skipping no-op #{op.inspect}")
278
+ when 'i'
279
+ if collection_name == 'system.indexes'
280
+ log.info("Skipping index update: #{op.inspect}")
281
+ else
282
+ @sql.upsert_ns(ns, op['o'])
283
+ end
284
+ when 'u'
285
+ selector = op['o2']
286
+ update = op['o']
287
+ if update.keys.any? { |k| k.start_with? '$' }
288
+ log.debug("resync #{ns}: #{selector['_id']} (update was: #{update.inspect})")
289
+ sync_object(ns, selector['_id'])
290
+ else
291
+ log.debug("upsert #{ns}: _id=#{update['_id']}")
292
+ @sql.upsert_ns(ns, update)
293
+ end
294
+ when 'd'
295
+ if options[:ignore_delete]
296
+ log.debug("Ignoring delete op on #{ns} as instructed.")
297
+ else
298
+ @sql.table_for_ns(ns).where(:_id => op['o']['_id']).delete
299
+ end
300
+ else
301
+ log.info("Skipping unknown op #{op.inspect}")
302
+ end
303
+ end
304
+ end
305
+ end