mosql 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,2 @@
1
+ collections.yml
2
+ /.bundle/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
@@ -0,0 +1,48 @@
1
+ GIT
2
+ remote: git@github.com:stripe-internal/mongoriver
3
+ revision: d5b5ca1471f9efe7c91b3abe2c26f612a2dd4e9c
4
+ ref: d5b5ca1471f9efe7c91b3abe2c26f612a2dd4e9c
5
+ specs:
6
+ mongoriver (0.0.1)
7
+ bson_ext
8
+ log4r
9
+ mongo (>= 1.7)
10
+
11
+ PATH
12
+ remote: .
13
+ specs:
14
+ mosql (0.0.1)
15
+ bson_ext
16
+ json
17
+ log4r
18
+ mongo
19
+ pg
20
+ rake
21
+ sequel
22
+
23
+ GEM
24
+ remote: https://intgems.stripe.com:446/
25
+ specs:
26
+ bson (1.7.1)
27
+ bson_ext (1.7.1)
28
+ bson (~> 1.7.1)
29
+ json (1.7.5)
30
+ log4r (1.1.10)
31
+ metaclass (0.0.1)
32
+ minitest (3.0.0)
33
+ mocha (0.10.5)
34
+ metaclass (~> 0.0.1)
35
+ mongo (1.7.1)
36
+ bson (~> 1.7.1)
37
+ pg (0.14.1)
38
+ rake (10.0.2)
39
+ sequel (3.41.0)
40
+
41
+ PLATFORMS
42
+ ruby
43
+
44
+ DEPENDENCIES
45
+ minitest
46
+ mocha
47
+ mongoriver!
48
+ mosql!
@@ -0,0 +1,168 @@
1
+ # MoSQL: a MongoDB → SQL streaming translator
2
+
3
+ At Stripe, we love MongoDB. We love the flexibility it gives us in
4
+ changing data schemas as we grow and learn, and we love its
5
+ operational properties. We love replsets. We love the uniform query
6
+ language that doesn't require generating and parsing strings, tracking
7
+ placeholder parameters, or any of that nonsense.
8
+
9
+ The thing is, we also love SQL. We love the ease of doing ad-hoc data
10
+ analysis over small-to-mid-size datasets in SQL. We love doing JOINs
11
+ to pull together reports summarizing properties across multiple
12
+ datasets. We love the fact that virtually every employee we hire
13
+ already knows SQL and is comfortable using it to ask and answer
14
+ questions about data.
15
+
16
+ So, we thought, why can't we have the best of both worlds? Thus:
17
+ MoSQL.
18
+
19
+ # MoSQL: Put Mo' SQL in your NoSQL
20
+
21
+ ![MoSQL](https://stripe.com/img/blog/posts/mosql/mosql.png)
22
+
23
+ MoSQL imports the contents of your MongoDB database cluster into a
24
+ PostgreSQL instance, using an oplog tailer to keep the SQL mirror live
25
+ up-to-date. This lets you run production services against a MongoDB
26
+ database, and then run offline analytics or reporting using the full
27
+ power of SQL.
28
+
29
+ ## Installation
30
+
31
+ Install from Rubygems as:
32
+
33
+ $ gem install mosql
34
+
35
+ Or build from source by:
36
+
37
+ $ gem build mosql.gemspec
38
+
39
+ And then install the built gem.
40
+
41
+ ## The Collection Map file
42
+
43
+ In order to define a SQL schema and import your data, MoSQL needs a
44
+ collection map file describing the schema of your MongoDB data. (Don't
45
+ worry -- MoSQL can handle it if your mongo data doesn't always exactly
46
+ fit the stated schema. More on that later).
47
+
48
+ The collection map is a YAML file describing the databases and
49
+ collections in Mongo that you want to import, in terms of their SQL
50
+ types. An example collection map might be:
51
+
52
+
53
+ mongodb:
54
+ blog_posts:
55
+ :columns:
56
+ - _id: TEXT
57
+ - author: TEXT
58
+ - title: TEXT
59
+ - created: DOUBLE PRECISION
60
+ :meta:
61
+ :table: blog_posts
62
+ :extra_props: true
63
+
64
+ Said another way, the collection map is a YAML file containing a hash
65
+ mapping
66
+
67
+ <Mongo DB name> -> { <Mongo Collection Name> -> <Collection Definition> }
68
+
69
+ Where a `<Collection Definition>` is a hash with `:columns` and
70
+ `:meta` fields. `:columns` is a list of one-element hashes, mapping
71
+ field-name to SQL type. It is required to include at least an `_id`
72
+ mapping. `:meta` contains metadata about this collection/table. It is
73
+ required to include at least `:table`, naming the SQL table this
74
+ collection will be mapped to. `extra_props` determines the handling of
75
+ unknown fields in MongoDB objects -- more about that later.
76
+
77
+ By default, `mosql` looks for a collection map in a file named
78
+ `collections.yml` in your current working directory, but you can
79
+ specify a different one with `-c` or `--collections`.
80
+
81
+ ## Usage
82
+
83
+ Once you have a collection map. MoSQL usage is easy. The basic form
84
+ is:
85
+
86
+ mosql [-c collections.yml] [--sql postgres://sql-server/sql-db] [--mongo mongodb://mongo-uri]
87
+
88
+ By default, `mosql` connects to both PostgreSQL and MongoDB instances
89
+ running on default ports on localhost without authentication. You can
90
+ point it at different targets using the `--sql` and `--mongo`
91
+ command-line parameters.
92
+
93
+ `mosql` will:
94
+
95
+ 1. Create the appropriate SQL tables
96
+ 2. Import data from the Mongo database
97
+ 3. Start tailing the mongo oplog, propogating changes from MongoDB to SQL.
98
+
99
+
100
+ After the first run, `mosql` will store the status of the optailer in
101
+ the `mongo_sql` table in your SQL database, and automatically resume
102
+ where it left off. `mosql` uses the replset name to keep track of
103
+ which mongo database it's tailing, so that you can tail multiple
104
+ databases into the same SQL database. If you want to tail the same
105
+ replSet, or multiple replSets with the same name, for some reason, you
106
+ can use the `--service` flag to change the name `mosql` uses to track
107
+ state.
108
+
109
+ You likely want to run `mosql` against a secondary node, at least for
110
+ the initial import, which will cause large amounts of disk activity on
111
+ the target node. One option is to use read preferences in your
112
+ connection URI:
113
+
114
+ mosql --mongo mongodb://node1,node2,node3?readPreference=secondary
115
+
116
+ ## Advanced usage
117
+
118
+ For advanced scenarios, you can pass options to control mosql's
119
+ behavior. If you pass `--skip-tail`, mosql will do the initial import,
120
+ but not tail the oplog. This could be used, for example, to do an
121
+ import off of a backup snapshot, and then start the tailer on the live
122
+ cluster.
123
+
124
+ If you need to force a fresh reimport, run `--reimport`, which will
125
+ cause `mosql` to drop tables, create them anew, and do another import.
126
+
127
+ ## Schema mismatches and _extra_props
128
+
129
+ If MoSQL encounters values in the MongoDB database that don't fit
130
+ within the stated schema (e.g. a floating-point value in a INTEGER
131
+ field), it will log a warning, ignore the entire object, and continue.
132
+
133
+ If it encounters a MongoDB object with fields not listed in the
134
+ collection map, it will discard the extra fields, unless
135
+ `:extra_props` is set in the `:meta` hash. If it is, it will collect
136
+ any missing fields, JSON-encode them in a hash, and store the
137
+ resulting text in `_extra_props` in SQL. It's up to you to do
138
+ something useful with the JSON. One option is to use [plv8][plv8] to
139
+ parse them inside PostgreSQL, or you can just pull the JSON out whole
140
+ and parse it in application code.
141
+
142
+ This is also currently the only way to handle array or object values
143
+ inside records -- specify `:extra_props`, and they'll get JSON-encoded
144
+ into `_extra_props`. There's no reason we couldn't support
145
+ JSON-encoded values for individual columns/fields, but we haven't
146
+ written that code yet.
147
+
148
+ [plv8]: http://code.google.com/p/plv8js/
149
+
150
+ ## Sharded clusters
151
+
152
+ MoSQL does not have special support for sharded Mongo clusters at this
153
+ time. It should be possible to run a separate MoSQL instance against
154
+ each of the individual backend shard replica sets, streaming into
155
+ separate PostgreSQL instances, but we have not actually tested this
156
+ yet.
157
+
158
+ # Development
159
+
160
+ Patches and contributions are welcome! Please fork the project and
161
+ open a pull request on [github][github], or just report issues.
162
+
163
+ MoSQL includes a small but hopefully-growing test suite. It assumes a
164
+ running PostgreSQL and MongoDB instance on the local host; You can
165
+ point it at a different target via environment variables; See
166
+ `test/functional/_lib.rb` for more information.
167
+
168
+ [github]: https://github.com/stripe/mosql
@@ -0,0 +1,12 @@
1
+ require 'rake/testtask'
2
+
3
+ task :default
4
+ task :build
5
+
6
+ Rake::TestTask.new do |t|
7
+ t.libs = ["lib"]
8
+ t.verbose = true
9
+ t.test_files = FileList['test/**/*.rb'].reject do |file|
10
+ file.end_with?('_lib.rb')
11
+ end
12
+ end
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'bundler/setup'
5
+ require 'mosql/cli'
6
+
7
+ MoSQL::CLI.run(ARGV)
@@ -0,0 +1,11 @@
1
+ require 'log4r'
2
+ require 'mongo'
3
+ require 'sequel'
4
+ require 'mongoriver'
5
+ require 'json'
6
+
7
+ require 'mosql/version'
8
+ require 'mosql/log'
9
+ require 'mosql/sql'
10
+ require 'mosql/schema'
11
+ require 'mosql/tailer'
@@ -0,0 +1,305 @@
1
+ require 'mosql'
2
+ require 'optparse'
3
+ require 'yaml'
4
+ require 'logger'
5
+
6
+ module MoSQL
7
+ class CLI
8
+ include MoSQL::Logging
9
+
10
+ BATCH = 1000
11
+
12
+ attr_reader :args, :options, :tailer
13
+
14
+ def self.run(args)
15
+ cli = CLI.new(args)
16
+ cli.run
17
+ end
18
+
19
+ def initialize(args)
20
+ @args = args
21
+ @options = []
22
+ @done = false
23
+ setup_signal_handlers
24
+ end
25
+
26
+ def setup_signal_handlers
27
+ %w[TERM INT USR2].each do |sig|
28
+ Signal.trap(sig) do
29
+ log.info("Got SIG#{sig}. Preparing to exit...")
30
+ @done = true
31
+ end
32
+ end
33
+ end
34
+
35
+ def parse_args
36
+ @options = {
37
+ :collections => 'collections.yml',
38
+ :sql => 'postgres:///',
39
+ :mongo => 'mongodb://localhost',
40
+ :verbose => 0
41
+ }
42
+ optparse = OptionParser.new do |opts|
43
+ opts.banner = "Usage: #{$0} [options] "
44
+
45
+ opts.on('-h', '--help', "Display this message") do
46
+ puts opts
47
+ exit(0)
48
+ end
49
+
50
+ opts.on('-v', "Increase verbosity") do
51
+ @options[:verbose] += 1
52
+ end
53
+
54
+ opts.on("-c", "--collections [collections.yml]", "Collection map YAML file") do |file|
55
+ @options[:collections] = file
56
+ end
57
+
58
+ opts.on("--sql [sqluri]", "SQL server to connect to") do |uri|
59
+ @options[:sql] = uri
60
+ end
61
+
62
+ opts.on("--mongo [mongouri]", "Mongo connection string") do |uri|
63
+ @options[:mongo] = uri
64
+ end
65
+
66
+ opts.on("--schema [schema]", "PostgreSQL 'schema' to namespace tables") do |schema|
67
+ @options[:schema] = schema
68
+ end
69
+
70
+ opts.on("--ignore-delete", "Ignore delete operations when tailing") do
71
+ @options[:ignore_delete] = true
72
+ end
73
+
74
+ opts.on("--tail-from [timestamp]", "Start tailing from the specified UNIX timestamp") do |ts|
75
+ @options[:tail_from] = ts
76
+ end
77
+
78
+ opts.on("--service [service]", "Service name to use when storing tailing state") do |service|
79
+ @options[:service] = service
80
+ end
81
+
82
+ opts.on("--skip-tail", "Don't tail the oplog, just do the initial import") do
83
+ @options[:skip_tail] = true
84
+ end
85
+
86
+ opts.on("--reimport", "Force a data re-import") do
87
+ @options[:reimport] = true
88
+ end
89
+ end
90
+
91
+ optparse.parse!(@args)
92
+
93
+ log = Log4r::Logger.new('Stripe')
94
+ log.outputters = Log4r::StdoutOutputter.new(STDERR)
95
+ if options[:verbose] >= 1
96
+ log.level = Log4r::DEBUG
97
+ else
98
+ log.level = Log4r::INFO
99
+ end
100
+ end
101
+
102
+ def connect_mongo
103
+ @mongo = Mongo::Connection.from_uri(options[:mongo])
104
+ config = @mongo['admin'].command(:ismaster => 1)
105
+ if !config['setName']
106
+ log.warn("`#{options[:mongo]}' is not a replset. Proceeding anyways...")
107
+ end
108
+ options[:service] ||= config['setName']
109
+ end
110
+
111
+ def connect_sql
112
+ @sql = MoSQL::SQLAdapter.new(@schemamap, options[:sql], options[:schema])
113
+ if options[:verbose] >= 2
114
+ @sql.db.sql_log_level = :debug
115
+ @sql.db.loggers << Logger.new($stderr)
116
+ end
117
+ end
118
+
119
+ def load_collections
120
+ collections = YAML.load(File.read(@options[:collections]))
121
+ @schemamap = MoSQL::Schema.new(collections)
122
+ end
123
+
124
+ def run
125
+ parse_args
126
+ load_collections
127
+ connect_sql
128
+ connect_mongo
129
+
130
+ metadata_table = MoSQL::Tailer.create_table(@sql.db, 'mosql_tailers')
131
+
132
+ @tailer = MoSQL::Tailer.new([@mongo], :existing, metadata_table,
133
+ :service => options[:service])
134
+
135
+ if options[:reimport] || tailer.read_timestamp.seconds == 0
136
+ initial_import
137
+ end
138
+
139
+ optail
140
+ end
141
+
142
+ # Helpers
143
+
144
+ def collection_for_ns(ns)
145
+ dbname, collection = ns.split(".", 2)
146
+ @mongo.db(dbname).collection(collection)
147
+ end
148
+
149
+ def bulk_upsert(table, ns, items)
150
+ begin
151
+ @schemamap.copy_data(table.db, ns, items)
152
+ rescue Sequel::DatabaseError => e
153
+ log.debug("Bulk insert error (#{e}), attempting invidual upserts...")
154
+ cols = @schemamap.all_columns(@schemamap.find_ns(ns))
155
+ items.each do |it|
156
+ h = {}
157
+ cols.zip(it).each { |k,v| h[k] = v }
158
+ @sql.upsert(table, h)
159
+ end
160
+ end
161
+ end
162
+
163
+ def with_retries(tries=10)
164
+ tries.times do |try|
165
+ begin
166
+ yield
167
+ rescue Mongo::ConnectionError, Mongo::ConnectionFailure, Mongo::OperationFailure => e
168
+ # Duplicate key error
169
+ raise if e.kind_of?(Mongo::OperationFailure) && [11000, 11001].include?(e.error_code)
170
+ # Cursor timeout
171
+ raise if e.kind_of?(Mongo::OperationFailure) && e.message =~ /^Query response returned CURSOR_NOT_FOUND/
172
+ delay = 0.5 * (1.5 ** try)
173
+ log.warn("Mongo exception: #{e}, sleeping #{delay}s...")
174
+ sleep(delay)
175
+ end
176
+ end
177
+ end
178
+
179
+ def track_time
180
+ start = Time.now
181
+ yield
182
+ Time.now - start
183
+ end
184
+
185
+ def initial_import
186
+ @schemamap.create_schema(@sql.db, true)
187
+
188
+ start_ts = @mongo['local']['oplog.rs'].find_one({}, {:sort => [['$natural', -1]]})['ts']
189
+
190
+ want_dbs = @schemamap.all_mongo_dbs & @mongo.database_names
191
+ want_dbs.each do |dbname|
192
+ log.info("Importing for Mongo DB #{dbname}...")
193
+ db = @mongo.db(dbname)
194
+ want = Set.new(@schemamap.collections_for_mongo_db(dbname))
195
+ db.collections.select { |c| want.include?(c.name) }.each do |collection|
196
+ ns = "#{dbname}.#{collection.name}"
197
+ import_collection(ns, collection)
198
+ exit(0) if @done
199
+ end
200
+ end
201
+
202
+ tailer.write_timestamp(start_ts)
203
+ end
204
+
205
+ def import_collection(ns, collection)
206
+ log.info("Importing for #{ns}...")
207
+ count = 0
208
+ batch = []
209
+ table = @sql.table_for_ns(ns)
210
+ table.truncate
211
+
212
+ start = Time.now
213
+ sql_time = 0
214
+ collection.find(nil, :batch_size => BATCH) do |cursor|
215
+ with_retries do
216
+ cursor.each do |obj|
217
+ batch << @schemamap.transform(ns, obj)
218
+ count += 1
219
+
220
+ if batch.length >= BATCH
221
+ sql_time += track_time do
222
+ bulk_upsert(table, ns, batch)
223
+ end
224
+ elapsed = Time.now - start
225
+ log.info("Imported #{count} rows (#{elapsed}s, #{sql_time}s SQL)...")
226
+ batch.clear
227
+ exit(0) if @done
228
+ end
229
+ end
230
+ end
231
+ end
232
+
233
+ unless batch.empty?
234
+ bulk_upsert(table, ns, batch)
235
+ end
236
+ end
237
+
238
+ def optail
239
+ return if options[:skip_tail]
240
+
241
+ tailer.tail_from(options[:tail_from] ?
242
+ BSON::Timestamp.new(options[:tail_from].to_i, 0) :
243
+ nil)
244
+ until @done
245
+ tailer.stream(1000) do |op|
246
+ handle_op(op)
247
+ end
248
+ end
249
+ end
250
+
251
+ def sync_object(ns, _id)
252
+ obj = collection_for_ns(ns).find_one({:_id => _id})
253
+ if obj
254
+ @sql.upsert_ns(ns, obj)
255
+ else
256
+ @sql.table_for_ns(ns).where(:_id => _id).delete()
257
+ end
258
+ end
259
+
260
+ def handle_op(op)
261
+ log.debug("processing op: #{op.inspect}")
262
+ unless op['ns'] && op['op']
263
+ log.warn("Weird op: #{op.inspect}")
264
+ return
265
+ end
266
+
267
+ unless @schemamap.find_ns(op['ns'])
268
+ log.debug("Skipping op for unknown ns #{op['ns']}...")
269
+ return
270
+ end
271
+
272
+ ns = op['ns']
273
+ dbname, collection_name = ns.split(".", 2)
274
+
275
+ case op['op']
276
+ when 'n'
277
+ log.debug("Skipping no-op #{op.inspect}")
278
+ when 'i'
279
+ if collection_name == 'system.indexes'
280
+ log.info("Skipping index update: #{op.inspect}")
281
+ else
282
+ @sql.upsert_ns(ns, op['o'])
283
+ end
284
+ when 'u'
285
+ selector = op['o2']
286
+ update = op['o']
287
+ if update.keys.any? { |k| k.start_with? '$' }
288
+ log.debug("resync #{ns}: #{selector['_id']} (update was: #{update.inspect})")
289
+ sync_object(ns, selector['_id'])
290
+ else
291
+ log.debug("upsert #{ns}: _id=#{update['_id']}")
292
+ @sql.upsert_ns(ns, update)
293
+ end
294
+ when 'd'
295
+ if options[:ignore_delete]
296
+ log.debug("Ignoring delete op on #{ns} as instructed.")
297
+ else
298
+ @sql.table_for_ns(ns).where(:_id => op['o']['_id']).delete
299
+ end
300
+ else
301
+ log.info("Skipping unknown op #{op.inspect}")
302
+ end
303
+ end
304
+ end
305
+ end