purplelight 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +216 -3
- data/bin/purplelight +109 -0
- data/lib/purplelight/manifest.rb +12 -1
- data/lib/purplelight/snapshot.rb +36 -15
- data/lib/purplelight/version.rb +1 -1
- data/lib/purplelight/writer_csv.rb +180 -0
- data/lib/purplelight/writer_jsonl.rb +40 -11
- data/lib/purplelight/writer_parquet.rb +137 -0
- metadata +39 -7
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: '07534009e367f28d3374708991cb870f5fa168ee11a95142af8d357885af7abc'
|
4
|
+
data.tar.gz: e665d587dea94999326c0c42e88d2bcfd99bae01e305aee9e3051d3ddcd266e2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e4cabf4d438a8afa0d00902aa07b01320013f4e8588630fb2d5c4f9b2432e1910ac94c19ccce78bdb396a415ac3ea949527b83d22eadfbc036520656c4273869
|
7
|
+
data.tar.gz: b038e1fa40f36e985571019d7b4d7fe9c5013ea17314640f8d484ca9cbbb68292af08c5f91b1a006f34c473a24f3b83b03aece33d333685afb749437ac920ca4
|
data/README.md
CHANGED
@@ -4,10 +4,18 @@ Snapshot MongoDB collections efficiently from Ruby with resumable, partitioned e
|
|
4
4
|
|
5
5
|
### Install
|
6
6
|
|
7
|
+
Purplelight is published on RubyGems: [purplelight on RubyGems](https://rubygems.org/gems/purplelight).
|
8
|
+
|
7
9
|
Add to your Gemfile:
|
8
10
|
|
9
11
|
```ruby
|
10
|
-
gem 'purplelight'
|
12
|
+
gem 'purplelight', '~> 0.1.2'
|
13
|
+
```
|
14
|
+
|
15
|
+
Or install directly:
|
16
|
+
|
17
|
+
```bash
|
18
|
+
gem install purplelight
|
11
19
|
```
|
12
20
|
|
13
21
|
### Quick start
|
@@ -33,6 +41,60 @@ Purplelight.snapshot(
|
|
33
41
|
)
|
34
42
|
```
|
35
43
|
|
44
|
+
### Filtering with `query`
|
45
|
+
|
46
|
+
`query` is passed directly to MongoDB as the filter for the collection read. Use standard MongoDB query operators.
|
47
|
+
|
48
|
+
Ruby examples:
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
# Equality
|
52
|
+
query: { status: 'active' }
|
53
|
+
|
54
|
+
# Ranges
|
55
|
+
query: { created_at: { '$gte' => Time.parse('2025-01-01'), '$lt' => Time.parse('2025-02-01') } }
|
56
|
+
|
57
|
+
# $in / $nin
|
58
|
+
query: { type: { '$in' => %w[user admin] } }
|
59
|
+
|
60
|
+
# Nested fields (dot-notation also supported in Mongo)
|
61
|
+
query: { 'profile.country' => 'US' }
|
62
|
+
|
63
|
+
# By ObjectId boundary (works great with _id partitions)
|
64
|
+
query: { _id: { '$gt' => BSON::ObjectId.from_time(Time.utc(2024, 1, 1)) } }
|
65
|
+
```
|
66
|
+
|
67
|
+
CLI examples (JSON):
|
68
|
+
|
69
|
+
```bash
|
70
|
+
# Equality
|
71
|
+
--query '{"status":"active"}'
|
72
|
+
|
73
|
+
# Date/time range (ISO8601 strings your app can parse downstream)
|
74
|
+
--query '{"created_at":{"$gte":"2025-01-01T00:00:00Z","$lt":"2025-02-01T00:00:00Z"}}'
|
75
|
+
|
76
|
+
# Nested field
|
77
|
+
--query '{"profile.country":"US"}'
|
78
|
+
|
79
|
+
# IN list
|
80
|
+
--query '{"type":{"$in":["user","admin"]}}'
|
81
|
+
```
|
82
|
+
|
83
|
+
Notes:
|
84
|
+
- Ensure values are serializable; when using Ruby, you can pass native `Time`, `BSON::ObjectId`, etc.
|
85
|
+
- Consider adding an appropriate index to match your `query` and pass `hint:` to force indexed scans when needed:
|
86
|
+
|
87
|
+
```ruby
|
88
|
+
Purplelight.snapshot(
|
89
|
+
client: client,
|
90
|
+
collection: 'events',
|
91
|
+
output: '/data/exports',
|
92
|
+
format: :jsonl,
|
93
|
+
query: { created_at: { '$gte' => Time.parse('2025-01-01') } },
|
94
|
+
hint: { created_at: 1 }
|
95
|
+
)
|
96
|
+
```
|
97
|
+
|
36
98
|
Outputs files like:
|
37
99
|
|
38
100
|
```
|
@@ -42,7 +104,158 @@ Outputs files like:
|
|
42
104
|
users.manifest.json
|
43
105
|
```
|
44
106
|
|
45
|
-
###
|
107
|
+
### CSV usage (single-file)
|
108
|
+
|
109
|
+
```ruby
|
110
|
+
Purplelight.snapshot(
|
111
|
+
client: client,
|
112
|
+
collection: 'users',
|
113
|
+
output: '/data/exports',
|
114
|
+
format: :csv,
|
115
|
+
sharding: { mode: :single_file, prefix: 'users' },
|
116
|
+
resume: { enabled: true }
|
117
|
+
)
|
118
|
+
```
|
119
|
+
|
120
|
+
### Parquet usage (requires Arrow and Parquet gems)
|
121
|
+
|
122
|
+
Add optional dependencies:
|
123
|
+
|
124
|
+
```ruby
|
125
|
+
# Gemfile
|
126
|
+
group :parquet do
|
127
|
+
gem 'red-arrow', '~> 15.0'
|
128
|
+
gem 'red-parquet', '~> 15.0'
|
129
|
+
end
|
130
|
+
```
|
131
|
+
|
132
|
+
Then:
|
133
|
+
|
134
|
+
```ruby
|
135
|
+
Purplelight.snapshot(
|
136
|
+
client: client,
|
137
|
+
collection: 'users',
|
138
|
+
output: '/data/exports',
|
139
|
+
format: :parquet,
|
140
|
+
sharding: { mode: :single_file, prefix: 'users' },
|
141
|
+
resume: { enabled: true }
|
142
|
+
)
|
143
|
+
```
|
144
|
+
|
145
|
+
### CLI
|
146
|
+
|
147
|
+
```bash
|
148
|
+
bundle exec bin/purplelight \
|
149
|
+
--uri "$MONGO_URL" \
|
150
|
+
--db mydb --collection users \
|
151
|
+
--output /data/exports \
|
152
|
+
--format jsonl --partitions 8 --by-size $((256*1024*1024)) --prefix users
|
153
|
+
```
|
154
|
+
|
155
|
+
### Architecture
|
156
|
+
|
157
|
+
```mermaid
|
158
|
+
flowchart LR
|
159
|
+
A[Partition planner] -->|filters/ranges| B[Reader pool - threads]
|
160
|
+
B -->|batches| C[Byte-bounded queue]
|
161
|
+
C --> D[Serializer]
|
162
|
+
D -->|JSONL/CSV/Parquet| E[Sink with size-based rotation]
|
163
|
+
E --> F[Parts + Manifest]
|
46
164
|
|
47
|
-
|
165
|
+
subgraph Concurrency
|
166
|
+
B
|
167
|
+
C
|
168
|
+
D
|
169
|
+
end
|
48
170
|
|
171
|
+
subgraph Resume
|
172
|
+
F -->|checkpoints| A
|
173
|
+
end
|
174
|
+
```
|
175
|
+
|
176
|
+
Key points:
|
177
|
+
- Partitions default to contiguous `_id` ranges with sorted reads and `no_cursor_timeout`.
|
178
|
+
- Readers stream batches into a bounded, byte-aware queue to provide backpressure.
|
179
|
+
- Writers serialize to JSONL/CSV/Parquet with default zstd compression and rotate by target size.
|
180
|
+
- A manifest records parts and per-partition checkpoints for safe resume.
|
181
|
+
|
182
|
+
### Tuning for performance
|
183
|
+
|
184
|
+
- Partitions: start with `2 × cores` (default). Increase gradually if reads are underutilized; too high can add overhead.
|
185
|
+
- Batch size: 2k–10k usually works well. Larger batches reduce cursor roundtrips, but can raise latency/memory.
|
186
|
+
- Queue size: increase to 256–512MB to reduce backpressure on readers for fast disks.
|
187
|
+
- Compression: use `:zstd` for good ratio; for max speed, try `:gzip` with low level.
|
188
|
+
- Rotation size: larger (512MB–1GB) reduces finalize overhead for many parts.
|
189
|
+
- Read preference: offload to secondaries or tagged analytics nodes when available.
|
190
|
+
|
191
|
+
Benchmarking (optional):
|
192
|
+
|
193
|
+
```bash
|
194
|
+
# 1M docs benchmark with tunables
|
195
|
+
BENCH=1 BENCH_PARTITIONS=16 BENCH_BATCH_SIZE=8000 BENCH_QUEUE_MB=512 BENCH_ROTATE_MB=512 BENCH_COMPRESSION=gzip \
|
196
|
+
bundle exec rspec spec/benchmark_perf_spec.rb --format doc
|
197
|
+
```
|
198
|
+
|
199
|
+
### Read preference and node pinning
|
200
|
+
|
201
|
+
You can direct reads to non-primary members or specific tagged nodes in a replica set (e.g., MongoDB Atlas analytics nodes) via `read_preference`.
|
202
|
+
|
203
|
+
Programmatic examples:
|
204
|
+
|
205
|
+
```ruby
|
206
|
+
# Secondary reads
|
207
|
+
Purplelight.snapshot(
|
208
|
+
client: client,
|
209
|
+
collection: 'events',
|
210
|
+
output: '/data/exports',
|
211
|
+
format: :jsonl,
|
212
|
+
read_preference: :secondary
|
213
|
+
)
|
214
|
+
|
215
|
+
# Pin to tagged nodes (Atlas analytics nodes)
|
216
|
+
Purplelight.snapshot(
|
217
|
+
client: client,
|
218
|
+
collection: 'events',
|
219
|
+
output: '/data/exports',
|
220
|
+
format: :jsonl,
|
221
|
+
read_preference: { mode: :secondary, tag_sets: [{ 'nodeType' => 'ANALYTICS' }] }
|
222
|
+
)
|
223
|
+
```
|
224
|
+
|
225
|
+
Notes:
|
226
|
+
- `read_preference` accepts a symbol (mode) or a full hash with `mode` and optional `tag_sets`.
|
227
|
+
- Use tags that exist on your cluster. Atlas analytics nodes can be targeted with `{ 'nodeType' => 'ANALYTICS' }`.
|
228
|
+
|
229
|
+
CLI examples:
|
230
|
+
|
231
|
+
```bash
|
232
|
+
# Secondary reads
|
233
|
+
bundle exec bin/purplelight \
|
234
|
+
--uri "$MONGO_URL" --db mydb --collection events --output /data/exports \
|
235
|
+
--format jsonl --read-preference secondary
|
236
|
+
|
237
|
+
# Pin to tagged nodes (Atlas analytics nodes)
|
238
|
+
bundle exec bin/purplelight \
|
239
|
+
--uri "$MONGO_URL" --db mydb --collection events --output /data/exports \
|
240
|
+
--format jsonl --read-preference secondary \
|
241
|
+
--read-tags nodeType=ANALYTICS,region=EAST
|
242
|
+
|
243
|
+
# Inspect effective read preference without running
|
244
|
+
bundle exec bin/purplelight \
|
245
|
+
--uri "$MONGO_URL" --db mydb --collection events --output /tmp \
|
246
|
+
--read-preference secondary --read-tags nodeType=ANALYTICS --dry-run
|
247
|
+
```
|
248
|
+
|
249
|
+
### Quick Benchmark
|
250
|
+
```
|
251
|
+
% bash -lc 'BENCH=1 BENCH_PARTITIONS=16 BENCH_BATCH_SIZE=8000 BENCH_QUEUE_MB=512 BENCH_ROTATE_MB=512 BENCH_COMPRESSION=gzip bundle exec rspec spec/benchmark_perf_spec.rb --format doc | cat'
|
252
|
+
|
253
|
+
Performance benchmark (1M docs, gated by BENCH=1)
|
254
|
+
W, [2025-09-03T16:10:40.437304 #33546] WARN -- : MONGODB | Error checking 127.0.0.1:27018: Mongo::Error::SocketError: Errno::ECONNREFUSED: Connection refused - connect(2) for 127.0.0.1:27018 (for 127.0.0.1:27018 (no TLS)) (on 127.0.0.1:27018)
|
255
|
+
Benchmark results:
|
256
|
+
Inserted: 1000000 docs in 8.16s
|
257
|
+
Exported: 1000000 docs in 8.21s
|
258
|
+
Parts: 1, Bytes: 10646279
|
259
|
+
Throughput: 121729.17 docs/s, 1.24 MB/s
|
260
|
+
Settings: partitions=16, batch_size=8000, queue_mb=512, rotate_mb=512, compression=gzip
|
261
|
+
```
|
data/bin/purplelight
ADDED
@@ -0,0 +1,109 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'optparse'
|
4
|
+
require 'json'
|
5
|
+
require 'mongo'
|
6
|
+
require_relative '../lib/purplelight'
|
7
|
+
|
8
|
+
options = {
|
9
|
+
format: :jsonl,
|
10
|
+
compression: :zstd,
|
11
|
+
partitions: nil,
|
12
|
+
batch_size: 2000,
|
13
|
+
output: nil,
|
14
|
+
query: {},
|
15
|
+
sharding: { mode: :by_size, part_bytes: 256 * 1024 * 1024, prefix: nil },
|
16
|
+
resume: { enabled: true },
|
17
|
+
read_preference: nil,
|
18
|
+
read_tags: nil,
|
19
|
+
dry_run: false
|
20
|
+
}
|
21
|
+
|
22
|
+
parser = OptionParser.new do |opts|
|
23
|
+
opts.banner = "Usage: purplelight snapshot [options]"
|
24
|
+
|
25
|
+
opts.on('-u', '--uri URI', 'MongoDB connection URI (required)') { |v| options[:uri] = v }
|
26
|
+
opts.on('-d', '--db NAME', 'Database name (required)') { |v| options[:db] = v }
|
27
|
+
opts.on('-c', '--collection NAME', 'Collection name (required)') { |v| options[:collection] = v }
|
28
|
+
opts.on('-o', '--output PATH', 'Output directory or file (required)') { |v| options[:output] = v }
|
29
|
+
opts.on('-f', '--format FORMAT', 'Format: jsonl|csv|parquet (default jsonl)') { |v| options[:format] = v.to_sym }
|
30
|
+
opts.on('--compression NAME', 'Compression: zstd|gzip|none') { |v| options[:compression] = v.to_sym }
|
31
|
+
opts.on('--partitions N', Integer, 'Number of partitions') { |v| options[:partitions] = v }
|
32
|
+
opts.on('--batch-size N', Integer, 'Mongo batch size (default 2000)') { |v| options[:batch_size] = v }
|
33
|
+
opts.on('--by-size BYTES', Integer, 'Shard by size (bytes); default 268435456') { |v| options[:sharding] = { mode: :by_size, part_bytes: v } }
|
34
|
+
opts.on('--single-file', 'Write a single output file') { options[:sharding] = { mode: :single_file } }
|
35
|
+
opts.on('--prefix NAME', 'Output file prefix') do |v|
|
36
|
+
options[:sharding] ||= {}
|
37
|
+
options[:sharding][:prefix] = v
|
38
|
+
end
|
39
|
+
opts.on('-q', '--query JSON', 'Filter query as JSON') { |v| options[:query] = JSON.parse(v) }
|
40
|
+
opts.on('--read-preference MODE', 'Read preference mode: primary|primary_preferred|secondary|secondary_preferred|nearest') { |v| options[:read_preference] = v.to_sym }
|
41
|
+
opts.on('--read-tags TAGS', 'Comma-separated key=value list to target tagged nodes (e.g., nodeType=ANALYTICS,region=EAST)') do |v|
|
42
|
+
tags = {}
|
43
|
+
v.split(',').each do |pair|
|
44
|
+
k, val = pair.split('=', 2)
|
45
|
+
next if k.nil? || val.nil?
|
46
|
+
tags[k] = val
|
47
|
+
end
|
48
|
+
options[:read_tags] = tags unless tags.empty?
|
49
|
+
end
|
50
|
+
opts.on('--dry-run', 'Parse options and print effective read preference JSON, then exit') { options[:dry_run] = true }
|
51
|
+
opts.on('--version', 'Show version') do
|
52
|
+
puts Purplelight::VERSION
|
53
|
+
exit 0
|
54
|
+
end
|
55
|
+
opts.on('-h', '--help', 'Show help') do
|
56
|
+
puts opts
|
57
|
+
exit 0
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
begin
|
62
|
+
parser.parse!(ARGV)
|
63
|
+
rescue OptionParser::ParseError => e
|
64
|
+
warn e.message
|
65
|
+
warn parser
|
66
|
+
exit 1
|
67
|
+
end
|
68
|
+
|
69
|
+
%i[uri db collection output].each do |k|
|
70
|
+
if options[k].nil? || options[k].to_s.empty?
|
71
|
+
warn "Missing required option: --#{k}"
|
72
|
+
warn parser
|
73
|
+
exit 1
|
74
|
+
end
|
75
|
+
end
|
76
|
+
|
77
|
+
effective_read = nil
|
78
|
+
if options[:read_tags]
|
79
|
+
effective_read = { mode: (options[:read_preference] || :secondary), tag_sets: [options[:read_tags]] }
|
80
|
+
elsif options[:read_preference]
|
81
|
+
effective_read = { mode: options[:read_preference] }
|
82
|
+
end
|
83
|
+
|
84
|
+
if options[:dry_run]
|
85
|
+
puts JSON.generate({ read_preference: effective_read })
|
86
|
+
exit 0
|
87
|
+
end
|
88
|
+
|
89
|
+
client = Mongo::Client.new(options[:uri])
|
90
|
+
options[:partitions] ||= (Etc.respond_to?(:nprocessors) ? [Etc.nprocessors * 2, 4].max : 4)
|
91
|
+
|
92
|
+
ok = Purplelight.snapshot(
|
93
|
+
client: client.use(options[:db]),
|
94
|
+
collection: options[:collection],
|
95
|
+
output: options[:output],
|
96
|
+
format: options[:format],
|
97
|
+
compression: options[:compression],
|
98
|
+
partitions: options[:partitions],
|
99
|
+
batch_size: options[:batch_size],
|
100
|
+
query: options[:query],
|
101
|
+
sharding: options[:sharding],
|
102
|
+
read_preference: effective_read || options[:read_preference],
|
103
|
+
resume: { enabled: true },
|
104
|
+
on_progress: ->(s) { $stderr.puts("progress: #{s.to_json}") }
|
105
|
+
)
|
106
|
+
|
107
|
+
exit(ok ? 0 : 1)
|
108
|
+
|
109
|
+
|
data/lib/purplelight/manifest.rb
CHANGED
@@ -32,6 +32,7 @@ module Purplelight
|
|
32
32
|
'partitions' => []
|
33
33
|
}
|
34
34
|
@mutex = Mutex.new
|
35
|
+
@last_save_at = Time.now
|
35
36
|
end
|
36
37
|
|
37
38
|
def self.load(path)
|
@@ -102,7 +103,7 @@ module Purplelight
|
|
102
103
|
part = @data['parts'][index]
|
103
104
|
part['rows'] += rows_delta
|
104
105
|
part['bytes'] += bytes_delta
|
105
|
-
|
106
|
+
save_maybe!
|
106
107
|
end
|
107
108
|
end
|
108
109
|
|
@@ -122,6 +123,16 @@ module Purplelight
|
|
122
123
|
def partitions
|
123
124
|
@data['partitions']
|
124
125
|
end
|
126
|
+
|
127
|
+
private
|
128
|
+
|
129
|
+
def save_maybe!(interval_seconds: 2.0)
|
130
|
+
now = Time.now
|
131
|
+
if (now - @last_save_at) >= interval_seconds
|
132
|
+
save!
|
133
|
+
@last_save_at = now
|
134
|
+
end
|
135
|
+
end
|
125
136
|
end
|
126
137
|
end
|
127
138
|
|
data/lib/purplelight/snapshot.rb
CHANGED
@@ -6,6 +6,8 @@ require 'fileutils'
|
|
6
6
|
require_relative 'partitioner'
|
7
7
|
require_relative 'queue'
|
8
8
|
require_relative 'writer_jsonl'
|
9
|
+
require_relative 'writer_csv'
|
10
|
+
require_relative 'writer_parquet'
|
9
11
|
require_relative 'manifest'
|
10
12
|
require_relative 'errors'
|
11
13
|
|
@@ -16,9 +18,9 @@ module Purplelight
|
|
16
18
|
compression: :zstd,
|
17
19
|
batch_size: 2_000,
|
18
20
|
partitions: [Etc.respond_to?(:nprocessors) ? [Etc.nprocessors * 2, 4].max : 4, 32].min,
|
19
|
-
queue_size_bytes:
|
21
|
+
queue_size_bytes: 256 * 1024 * 1024,
|
20
22
|
rotate_bytes: 256 * 1024 * 1024,
|
21
|
-
read_concern: :majority,
|
23
|
+
read_concern: { level: :majority },
|
22
24
|
read_preference: :primary,
|
23
25
|
no_cursor_timeout: true
|
24
26
|
}
|
@@ -79,7 +81,7 @@ module Purplelight
|
|
79
81
|
end
|
80
82
|
|
81
83
|
manifest.configure!(collection: @collection.name, format: @format, compression: @compression, query_digest: query_digest, options: {
|
82
|
-
partitions: @partitions, batch_size: @batch_size, rotate_bytes: @rotate_bytes
|
84
|
+
partitions: @partitions, batch_size: @batch_size, rotate_bytes: @rotate_bytes, hint: @hint
|
83
85
|
})
|
84
86
|
manifest.ensure_partitions!(@partitions)
|
85
87
|
|
@@ -90,12 +92,18 @@ module Purplelight
|
|
90
92
|
queue = ByteQueue.new(max_bytes: @queue_size_bytes)
|
91
93
|
|
92
94
|
# Writer
|
93
|
-
case @format
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
95
|
+
writer = case @format
|
96
|
+
when :jsonl
|
97
|
+
WriterJSONL.new(directory: dir, prefix: prefix, compression: @compression, rotate_bytes: @rotate_bytes, logger: @logger, manifest: manifest)
|
98
|
+
when :csv
|
99
|
+
single_file = (@sharding && @sharding[:mode].to_s == 'single_file')
|
100
|
+
WriterCSV.new(directory: dir, prefix: prefix, compression: @compression, rotate_bytes: @rotate_bytes, logger: @logger, manifest: manifest, single_file: single_file)
|
101
|
+
when :parquet
|
102
|
+
single_file = (@sharding && @sharding[:mode].to_s == 'single_file')
|
103
|
+
WriterParquet.new(directory: dir, prefix: prefix, compression: @compression, logger: @logger, manifest: manifest, single_file: single_file)
|
104
|
+
else
|
105
|
+
raise ArgumentError, "format not implemented: #{@format}"
|
106
|
+
end
|
99
107
|
|
100
108
|
# Start reader threads
|
101
109
|
readers = partition_filters.each_with_index.map do |pf, idx|
|
@@ -151,7 +159,7 @@ module Purplelight
|
|
151
159
|
def read_partition(idx:, filter_spec:, queue:, batch_size:, manifest:)
|
152
160
|
filter = filter_spec[:filter]
|
153
161
|
sort = filter_spec[:sort] || { _id: 1 }
|
154
|
-
hint = filter_spec[:hint] || { _id: 1 }
|
162
|
+
hint = @hint || filter_spec[:hint] || { _id: 1 }
|
155
163
|
|
156
164
|
# Resume from checkpoint if present
|
157
165
|
checkpoint = manifest.partitions[idx] && manifest.partitions[idx]['last_id_exclusive']
|
@@ -164,11 +172,18 @@ module Purplelight
|
|
164
172
|
opts[:projection] = @projection if @projection
|
165
173
|
opts[:batch_size] = batch_size if batch_size
|
166
174
|
opts[:no_cursor_timeout] = @no_cursor_timeout
|
167
|
-
|
168
|
-
|
175
|
+
# Read preference can be a symbol (mode) or a full hash with tag_sets
|
176
|
+
if @read_preference
|
177
|
+
opts[:read] = @read_preference.is_a?(Hash) ? @read_preference : { mode: @read_preference }
|
178
|
+
end
|
179
|
+
# Mongo driver expects read_concern as a hash like { level: :majority }
|
180
|
+
if @read_concern
|
181
|
+
opts[:read_concern] = @read_concern.is_a?(Hash) ? @read_concern : { level: @read_concern }
|
182
|
+
end
|
169
183
|
|
170
184
|
cursor = @collection.find(filter, opts)
|
171
185
|
|
186
|
+
encode_lines = (@format == :jsonl)
|
172
187
|
buffer = []
|
173
188
|
buffer_bytes = 0
|
174
189
|
last_id = checkpoint
|
@@ -176,9 +191,15 @@ module Purplelight
|
|
176
191
|
cursor.each do |doc|
|
177
192
|
last_id = doc['_id']
|
178
193
|
doc = @mapper.call(doc) if @mapper
|
179
|
-
|
180
|
-
|
181
|
-
|
194
|
+
if encode_lines
|
195
|
+
line = Oj.dump(doc, mode: :compat) + "\n"
|
196
|
+
bytes = line.bytesize
|
197
|
+
buffer << line
|
198
|
+
else
|
199
|
+
# For CSV/Parquet keep raw docs to allow schema/row building
|
200
|
+
bytes = (Oj.dump(doc, mode: :compat).bytesize + 1)
|
201
|
+
buffer << doc
|
202
|
+
end
|
182
203
|
buffer_bytes += bytes
|
183
204
|
if buffer.length >= batch_size || buffer_bytes >= 1_000_000
|
184
205
|
queue.push(buffer, bytes: buffer_bytes)
|
data/lib/purplelight/version.rb
CHANGED
@@ -0,0 +1,180 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'csv'
|
4
|
+
require 'oj'
|
5
|
+
require 'zlib'
|
6
|
+
require 'fileutils'
|
7
|
+
|
8
|
+
begin
|
9
|
+
require 'zstds'
|
10
|
+
rescue LoadError
|
11
|
+
end
|
12
|
+
|
13
|
+
module Purplelight
|
14
|
+
class WriterCSV
|
15
|
+
DEFAULT_ROTATE_BYTES = 256 * 1024 * 1024
|
16
|
+
|
17
|
+
def initialize(directory:, prefix:, compression: :zstd, rotate_bytes: DEFAULT_ROTATE_BYTES, logger: nil, manifest: nil, single_file: false, columns: nil, headers: true)
|
18
|
+
@directory = directory
|
19
|
+
@prefix = prefix
|
20
|
+
@compression = compression
|
21
|
+
@rotate_bytes = rotate_bytes
|
22
|
+
@logger = logger
|
23
|
+
@manifest = manifest
|
24
|
+
@single_file = single_file
|
25
|
+
|
26
|
+
@columns = columns&.map(&:to_s)
|
27
|
+
@headers = headers
|
28
|
+
|
29
|
+
@part_index = nil
|
30
|
+
@io = nil
|
31
|
+
@csv = nil
|
32
|
+
@bytes_written = 0
|
33
|
+
@rows_written = 0
|
34
|
+
@file_seq = 0
|
35
|
+
@closed = false
|
36
|
+
|
37
|
+
@effective_compression = determine_effective_compression(@compression)
|
38
|
+
if @effective_compression.to_s != @compression.to_s
|
39
|
+
@logger&.warn("requested compression '#{@compression}' not available; using '#{@effective_compression}'")
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
def write_many(array_of_docs)
|
44
|
+
ensure_open!
|
45
|
+
|
46
|
+
# infer columns if needed from docs
|
47
|
+
if @columns.nil?
|
48
|
+
sample_docs = array_of_docs.is_a?(Array) ? array_of_docs : []
|
49
|
+
sample_docs = sample_docs.reject { |d| d.is_a?(String) }
|
50
|
+
@columns = infer_columns(sample_docs)
|
51
|
+
@csv << @columns if @headers
|
52
|
+
end
|
53
|
+
|
54
|
+
array_of_docs.each do |doc|
|
55
|
+
next if doc.is_a?(String)
|
56
|
+
row = @columns.map { |k| extract_value(doc, k) }
|
57
|
+
@csv << row
|
58
|
+
@rows_written += 1
|
59
|
+
end
|
60
|
+
@manifest&.add_progress_to_part!(index: @part_index, rows_delta: array_of_docs.size, bytes_delta: 0)
|
61
|
+
|
62
|
+
rotate_if_needed
|
63
|
+
end
|
64
|
+
|
65
|
+
def rotate_if_needed
|
66
|
+
return if @single_file
|
67
|
+
return if @rotate_bytes.nil?
|
68
|
+
raw_bytes = @io.respond_to?(:pos) ? @io.pos : @bytes_written
|
69
|
+
return if raw_bytes < @rotate_bytes
|
70
|
+
rotate!
|
71
|
+
end
|
72
|
+
|
73
|
+
def close
|
74
|
+
return if @closed
|
75
|
+
if @csv
|
76
|
+
@csv.flush
|
77
|
+
end
|
78
|
+
if @io
|
79
|
+
finalize_current_part!
|
80
|
+
@io.close
|
81
|
+
end
|
82
|
+
@closed = true
|
83
|
+
end
|
84
|
+
|
85
|
+
private
|
86
|
+
|
87
|
+
def ensure_open!
|
88
|
+
return if @io
|
89
|
+
FileUtils.mkdir_p(@directory)
|
90
|
+
path = next_part_path
|
91
|
+
@part_index = @manifest&.open_part!(path) if @manifest
|
92
|
+
raw = File.open(path, 'wb')
|
93
|
+
@io = build_compressed_io(raw)
|
94
|
+
@csv = CSV.new(@io)
|
95
|
+
@bytes_written = 0
|
96
|
+
@rows_written = 0
|
97
|
+
end
|
98
|
+
|
99
|
+
def build_compressed_io(raw)
|
100
|
+
case @effective_compression.to_s
|
101
|
+
when 'zstd'
|
102
|
+
if defined?(ZSTDS)
|
103
|
+
return ZSTDS::Writer.open(raw, level: 10)
|
104
|
+
else
|
105
|
+
@logger&.warn("zstd gem not loaded; using gzip")
|
106
|
+
return Zlib::GzipWriter.new(raw)
|
107
|
+
end
|
108
|
+
when 'gzip'
|
109
|
+
return Zlib::GzipWriter.new(raw)
|
110
|
+
when 'none'
|
111
|
+
return raw
|
112
|
+
else
|
113
|
+
raise ArgumentError, "unknown compression: #{@effective_compression}"
|
114
|
+
end
|
115
|
+
end
|
116
|
+
|
117
|
+
def rotate!
|
118
|
+
return unless @io
|
119
|
+
finalize_current_part!
|
120
|
+
@io.close
|
121
|
+
@io = nil
|
122
|
+
@csv = nil
|
123
|
+
ensure_open!
|
124
|
+
end
|
125
|
+
|
126
|
+
def finalize_current_part!
|
127
|
+
# Avoid flushing compressed writer explicitly to prevent Zlib::BufError; close will finish the stream.
|
128
|
+
@manifest&.complete_part!(index: @part_index, checksum: nil)
|
129
|
+
@file_seq += 1 unless @single_file
|
130
|
+
end
|
131
|
+
|
132
|
+
def next_part_path
|
133
|
+
ext = 'csv'
|
134
|
+
if @single_file
|
135
|
+
filename = format("%s.%s", @prefix, ext)
|
136
|
+
else
|
137
|
+
filename = format("%s-part-%06d.%s", @prefix, @file_seq, ext)
|
138
|
+
end
|
139
|
+
filename += ".zst" if @effective_compression.to_s == 'zstd'
|
140
|
+
filename += ".gz" if @effective_compression.to_s == 'gzip'
|
141
|
+
File.join(@directory, filename)
|
142
|
+
end
|
143
|
+
|
144
|
+
def determine_effective_compression(requested)
|
145
|
+
case requested.to_s
|
146
|
+
when 'zstd'
|
147
|
+
return (defined?(ZSTDS) ? :zstd : :gzip)
|
148
|
+
when 'gzip'
|
149
|
+
return :gzip
|
150
|
+
when 'none'
|
151
|
+
return :none
|
152
|
+
else
|
153
|
+
return :gzip
|
154
|
+
end
|
155
|
+
end
|
156
|
+
|
157
|
+
def infer_columns(docs)
|
158
|
+
keys = {}
|
159
|
+
docs.each do |d|
|
160
|
+
(d.keys - ['_id']).each { |k| keys[k.to_s] = true }
|
161
|
+
end
|
162
|
+
# Put _id first if present, then other keys sorted
|
163
|
+
cols = []
|
164
|
+
cols << '_id' if docs.first.key?('_id') || docs.first.key?(:_id)
|
165
|
+
cols + keys.keys.sort
|
166
|
+
end
|
167
|
+
|
168
|
+
def extract_value(doc, key)
|
169
|
+
val = doc[key] || doc[key.to_sym]
|
170
|
+
case val
|
171
|
+
when Hash, Array
|
172
|
+
Oj.dump(val, mode: :compat)
|
173
|
+
else
|
174
|
+
val
|
175
|
+
end
|
176
|
+
end
|
177
|
+
end
|
178
|
+
end
|
179
|
+
|
180
|
+
|
@@ -14,13 +14,14 @@ module Purplelight
|
|
14
14
|
class WriterJSONL
|
15
15
|
DEFAULT_ROTATE_BYTES = 256 * 1024 * 1024
|
16
16
|
|
17
|
-
def initialize(directory:, prefix:, compression: :zstd, rotate_bytes: DEFAULT_ROTATE_BYTES, logger: nil, manifest: nil)
|
17
|
+
def initialize(directory:, prefix:, compression: :zstd, rotate_bytes: DEFAULT_ROTATE_BYTES, logger: nil, manifest: nil, compression_level: nil)
|
18
18
|
@directory = directory
|
19
19
|
@prefix = prefix
|
20
20
|
@compression = compression
|
21
21
|
@rotate_bytes = rotate_bytes
|
22
22
|
@logger = logger
|
23
23
|
@manifest = manifest
|
24
|
+
@compression_level = compression_level
|
24
25
|
|
25
26
|
@part_index = nil
|
26
27
|
@io = nil
|
@@ -28,14 +29,26 @@ module Purplelight
|
|
28
29
|
@rows_written = 0
|
29
30
|
@file_seq = 0
|
30
31
|
@closed = false
|
32
|
+
|
33
|
+
@effective_compression = determine_effective_compression(@compression)
|
34
|
+
if @effective_compression.to_s != @compression.to_s
|
35
|
+
@logger&.warn("requested compression '#{@compression}' not available; using '#{@effective_compression}'")
|
36
|
+
end
|
31
37
|
end
|
32
38
|
|
33
39
|
def write_many(array_of_docs)
|
34
40
|
ensure_open!
|
35
|
-
|
41
|
+
# If upstream already produced newline-terminated strings, join fast.
|
42
|
+
if array_of_docs.first.is_a?(String)
|
43
|
+
buffer = array_of_docs.join
|
44
|
+
rows = array_of_docs.size
|
45
|
+
else
|
46
|
+
buffer = array_of_docs.map { |doc| Oj.dump(doc, mode: :compat) + "\n" }.join
|
47
|
+
rows = array_of_docs.size
|
48
|
+
end
|
36
49
|
write_buffer(buffer)
|
37
|
-
@rows_written +=
|
38
|
-
@manifest&.add_progress_to_part!(index: @part_index, rows_delta:
|
50
|
+
@rows_written += rows
|
51
|
+
@manifest&.add_progress_to_part!(index: @part_index, rows_delta: rows, bytes_delta: buffer.bytesize)
|
39
52
|
end
|
40
53
|
|
41
54
|
def rotate_if_needed
|
@@ -67,17 +80,20 @@ module Purplelight
|
|
67
80
|
end
|
68
81
|
|
69
82
|
def build_compressed_io(raw)
|
70
|
-
case @
|
83
|
+
case @effective_compression.to_s
|
71
84
|
when 'zstd'
|
72
85
|
if defined?(ZSTDS)
|
73
86
|
# ZSTDS::Writer supports IO-like interface
|
74
|
-
|
87
|
+
level = @compression_level || 3
|
88
|
+
return ZSTDS::Writer.open(raw, level: level)
|
75
89
|
else
|
76
|
-
@logger&.warn("zstd not
|
77
|
-
|
90
|
+
@logger&.warn("zstd gem not loaded; this should have been handled earlier")
|
91
|
+
level = @compression_level || Zlib::DEFAULT_COMPRESSION
|
92
|
+
return Zlib::GzipWriter.new(raw, level)
|
78
93
|
end
|
79
94
|
when 'gzip'
|
80
|
-
|
95
|
+
level = @compression_level || 1
|
96
|
+
return Zlib::GzipWriter.new(raw, level)
|
81
97
|
when 'none'
|
82
98
|
return raw
|
83
99
|
else
|
@@ -109,10 +125,23 @@ module Purplelight
|
|
109
125
|
def next_part_path
|
110
126
|
ext = 'jsonl'
|
111
127
|
filename = format("%s-part-%06d.%s", @prefix, @file_seq, ext)
|
112
|
-
filename += ".zst" if @
|
113
|
-
filename += ".gz" if @
|
128
|
+
filename += ".zst" if @effective_compression.to_s == 'zstd'
|
129
|
+
filename += ".gz" if @effective_compression.to_s == 'gzip'
|
114
130
|
File.join(@directory, filename)
|
115
131
|
end
|
132
|
+
|
133
|
+
def determine_effective_compression(requested)
|
134
|
+
case requested.to_s
|
135
|
+
when 'zstd'
|
136
|
+
return (defined?(ZSTDS) ? :zstd : :gzip)
|
137
|
+
when 'gzip'
|
138
|
+
return :gzip
|
139
|
+
when 'none'
|
140
|
+
return :none
|
141
|
+
else
|
142
|
+
return :gzip
|
143
|
+
end
|
144
|
+
end
|
116
145
|
end
|
117
146
|
end
|
118
147
|
|
@@ -0,0 +1,137 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
begin
|
4
|
+
require 'arrow'
|
5
|
+
require 'parquet'
|
6
|
+
rescue LoadError
|
7
|
+
# Arrow/Parquet not available; writer will refuse to run
|
8
|
+
end
|
9
|
+
|
10
|
+
require 'fileutils'
|
11
|
+
|
12
|
+
module Purplelight
|
13
|
+
class WriterParquet
|
14
|
+
DEFAULT_ROW_GROUP_SIZE = 10_000
|
15
|
+
|
16
|
+
def initialize(directory:, prefix:, compression: :zstd, row_group_size: DEFAULT_ROW_GROUP_SIZE, logger: nil, manifest: nil, single_file: true, schema: nil)
|
17
|
+
@directory = directory
|
18
|
+
@prefix = prefix
|
19
|
+
@compression = compression
|
20
|
+
@row_group_size = row_group_size
|
21
|
+
@logger = logger
|
22
|
+
@manifest = manifest
|
23
|
+
@single_file = single_file
|
24
|
+
@schema = schema
|
25
|
+
|
26
|
+
@closed = false
|
27
|
+
@file_seq = 0
|
28
|
+
@part_index = nil
|
29
|
+
|
30
|
+
ensure_dependencies!
|
31
|
+
reset_buffers
|
32
|
+
end
|
33
|
+
|
34
|
+
def write_many(array_of_docs)
|
35
|
+
ensure_open!
|
36
|
+
array_of_docs.each { |doc| @buffer_docs << doc }
|
37
|
+
@manifest&.add_progress_to_part!(index: @part_index, rows_delta: array_of_docs.length, bytes_delta: 0)
|
38
|
+
end
|
39
|
+
|
40
|
+
def close
|
41
|
+
return if @closed
|
42
|
+
ensure_open!
|
43
|
+
if !@buffer_docs.empty?
|
44
|
+
table = build_table(@buffer_docs)
|
45
|
+
write_table(table, @writer_path, append: false)
|
46
|
+
end
|
47
|
+
finalize_current_part!
|
48
|
+
@closed = true
|
49
|
+
end
|
50
|
+
|
51
|
+
private
|
52
|
+
|
53
|
+
def ensure_dependencies!
|
54
|
+
unless defined?(Arrow) && defined?(Parquet)
|
55
|
+
raise ArgumentError, "Parquet support requires gems: red-arrow and red-parquet. Add them to your Gemfile."
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
def reset_buffers
|
60
|
+
@buffer_docs = []
|
61
|
+
@columns = nil
|
62
|
+
@writer_path = nil
|
63
|
+
end
|
64
|
+
|
65
|
+
def ensure_open!
|
66
|
+
return if @writer_path
|
67
|
+
FileUtils.mkdir_p(@directory)
|
68
|
+
@writer_path = next_part_path
|
69
|
+
@part_index = @manifest&.open_part!(@writer_path) if @manifest
|
70
|
+
end
|
71
|
+
|
72
|
+
# No-op; we now write once on close for simplicity
|
73
|
+
|
74
|
+
def build_table(docs)
|
75
|
+
# Infer columns
|
76
|
+
@columns ||= infer_columns(docs)
|
77
|
+
columns = {}
|
78
|
+
@columns.each do |name|
|
79
|
+
values = docs.map { |d| extract_value(d, name) }
|
80
|
+
columns[name] = Arrow::ArrayBuilder.build(values)
|
81
|
+
end
|
82
|
+
Arrow::Table.new(columns)
|
83
|
+
end
|
84
|
+
|
85
|
+
def write_table(table, path, append: false)
|
86
|
+
# Prefer Arrow's save with explicit parquet format; compression defaults per build.
|
87
|
+
if table.respond_to?(:save)
|
88
|
+
table.save(path, format: :parquet)
|
89
|
+
return
|
90
|
+
end
|
91
|
+
# Fallback to red-parquet writer
|
92
|
+
if defined?(Parquet::ArrowFileWriter)
|
93
|
+
writer = Parquet::ArrowFileWriter.open(table.schema, path)
|
94
|
+
writer.write_table(table)
|
95
|
+
writer.close
|
96
|
+
return
|
97
|
+
end
|
98
|
+
raise "Parquet writer not available in this environment"
|
99
|
+
end
|
100
|
+
|
101
|
+
def finalize_current_part!
|
102
|
+
@manifest&.complete_part!(index: @part_index, checksum: nil)
|
103
|
+
@file_seq += 1 unless @single_file
|
104
|
+
@writer_path = nil
|
105
|
+
end
|
106
|
+
|
107
|
+
def next_part_path
|
108
|
+
ext = 'parquet'
|
109
|
+
filename = if @single_file
|
110
|
+
format("%s.%s", @prefix, ext)
|
111
|
+
else
|
112
|
+
format("%s-part-%06d.%s", @prefix, @file_seq, ext)
|
113
|
+
end
|
114
|
+
File.join(@directory, filename)
|
115
|
+
end
|
116
|
+
|
117
|
+
def infer_columns(docs)
|
118
|
+
keys = {}
|
119
|
+
docs.each do |d|
|
120
|
+
d.keys.each { |k| keys[k.to_s] = true }
|
121
|
+
end
|
122
|
+
keys.keys.sort
|
123
|
+
end
|
124
|
+
|
125
|
+
def extract_value(doc, key)
|
126
|
+
val = doc[key] || doc[key.to_sym]
|
127
|
+
case val
|
128
|
+
when Time
|
129
|
+
val
|
130
|
+
else
|
131
|
+
val
|
132
|
+
end
|
133
|
+
end
|
134
|
+
end
|
135
|
+
end
|
136
|
+
|
137
|
+
|
metadata
CHANGED
@@ -1,10 +1,10 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: purplelight
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
|
-
-
|
7
|
+
- Alexander Nicholson
|
8
8
|
bindir: bin
|
9
9
|
cert_chain: []
|
10
10
|
date: 1980-01-02 00:00:00.000000000 Z
|
@@ -37,6 +37,34 @@ dependencies:
|
|
37
37
|
- - ">="
|
38
38
|
- !ruby/object:Gem::Version
|
39
39
|
version: '3.16'
|
40
|
+
- !ruby/object:Gem::Dependency
|
41
|
+
name: csv
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
43
|
+
requirements:
|
44
|
+
- - ">="
|
45
|
+
- !ruby/object:Gem::Version
|
46
|
+
version: '0'
|
47
|
+
type: :runtime
|
48
|
+
prerelease: false
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
50
|
+
requirements:
|
51
|
+
- - ">="
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
- !ruby/object:Gem::Dependency
|
55
|
+
name: logger
|
56
|
+
requirement: !ruby/object:Gem::Requirement
|
57
|
+
requirements:
|
58
|
+
- - ">="
|
59
|
+
- !ruby/object:Gem::Version
|
60
|
+
version: '1.6'
|
61
|
+
type: :runtime
|
62
|
+
prerelease: false
|
63
|
+
version_requirements: !ruby/object:Gem::Requirement
|
64
|
+
requirements:
|
65
|
+
- - ">="
|
66
|
+
- !ruby/object:Gem::Version
|
67
|
+
version: '1.6'
|
40
68
|
- !ruby/object:Gem::Dependency
|
41
69
|
name: rspec
|
42
70
|
requirement: !ruby/object:Gem::Requirement
|
@@ -68,13 +96,15 @@ dependencies:
|
|
68
96
|
description: High-throughput, resumable snapshots of MongoDB collections with partitioning,
|
69
97
|
multi-threaded readers, and size-based sharded outputs.
|
70
98
|
email:
|
71
|
-
-
|
72
|
-
executables:
|
99
|
+
- rubygems-maint@ctrl.tokyo
|
100
|
+
executables:
|
101
|
+
- purplelight
|
73
102
|
extensions: []
|
74
103
|
extra_rdoc_files: []
|
75
104
|
files:
|
76
105
|
- README.md
|
77
106
|
- Rakefile
|
107
|
+
- bin/purplelight
|
78
108
|
- lib/purplelight.rb
|
79
109
|
- lib/purplelight/errors.rb
|
80
110
|
- lib/purplelight/manifest.rb
|
@@ -82,13 +112,15 @@ files:
|
|
82
112
|
- lib/purplelight/queue.rb
|
83
113
|
- lib/purplelight/snapshot.rb
|
84
114
|
- lib/purplelight/version.rb
|
115
|
+
- lib/purplelight/writer_csv.rb
|
85
116
|
- lib/purplelight/writer_jsonl.rb
|
117
|
+
- lib/purplelight/writer_parquet.rb
|
86
118
|
licenses:
|
87
119
|
- MIT
|
88
120
|
metadata:
|
89
|
-
homepage_uri: https://github.com/
|
90
|
-
source_code_uri: https://github.com/
|
91
|
-
changelog_uri: https://github.com/
|
121
|
+
homepage_uri: https://github.com/alexandernicholson/purplelight
|
122
|
+
source_code_uri: https://github.com/alexandernicholson/purplelight
|
123
|
+
changelog_uri: https://github.com/alexandernicholson/purplelight/releases
|
92
124
|
rdoc_options: []
|
93
125
|
require_paths:
|
94
126
|
- lib
|