purplelight 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b62be6d2a3810b6278d43fadcee4647efb1c758c9bd04ddc69051737e66d1716
4
- data.tar.gz: 2950f98c90869bcc3d6619e00adf68113e12197196f43a4c129dab2c1270e47c
3
+ metadata.gz: '07534009e367f28d3374708991cb870f5fa168ee11a95142af8d357885af7abc'
4
+ data.tar.gz: e665d587dea94999326c0c42e88d2bcfd99bae01e305aee9e3051d3ddcd266e2
5
5
  SHA512:
6
- metadata.gz: 635a9c3114bc1d6a017a8244dfd6b5cca15f82f19a9e79f59480af91c0e1ec61b72e40703393098f584a2a9d29f68cac9b81f8833c14ccf8d42f527c4cb40c2c
7
- data.tar.gz: f24b2faaff4218481b8fa8345842dd1fadeb9174adb92d449008637e072e81861b13bb8f71cacbcb51254fbcca18ea799ca8dfa5ce5c99f418e82262dd5505c1
6
+ metadata.gz: e4cabf4d438a8afa0d00902aa07b01320013f4e8588630fb2d5c4f9b2432e1910ac94c19ccce78bdb396a415ac3ea949527b83d22eadfbc036520656c4273869
7
+ data.tar.gz: b038e1fa40f36e985571019d7b4d7fe9c5013ea17314640f8d484ca9cbbb68292af08c5f91b1a006f34c473a24f3b83b03aece33d333685afb749437ac920ca4
data/README.md CHANGED
@@ -4,10 +4,18 @@ Snapshot MongoDB collections efficiently from Ruby with resumable, partitioned e
4
4
 
5
5
  ### Install
6
6
 
7
+ Purplelight is published on RubyGems: [purplelight on RubyGems](https://rubygems.org/gems/purplelight).
8
+
7
9
  Add to your Gemfile:
8
10
 
9
11
  ```ruby
10
- gem 'purplelight'
12
+ gem 'purplelight', '~> 0.1.2'
13
+ ```
14
+
15
+ Or install directly:
16
+
17
+ ```bash
18
+ gem install purplelight
11
19
  ```
12
20
 
13
21
  ### Quick start
@@ -33,6 +41,60 @@ Purplelight.snapshot(
33
41
  )
34
42
  ```
35
43
 
44
+ ### Filtering with `query`
45
+
46
+ `query` is passed directly to MongoDB as the filter for the collection read. Use standard MongoDB query operators.
47
+
48
+ Ruby examples:
49
+
50
+ ```ruby
51
+ # Equality
52
+ query: { status: 'active' }
53
+
54
+ # Ranges
55
+ query: { created_at: { '$gte' => Time.parse('2025-01-01'), '$lt' => Time.parse('2025-02-01') } }
56
+
57
+ # $in / $nin
58
+ query: { type: { '$in' => %w[user admin] } }
59
+
60
+ # Nested fields (dot-notation also supported in Mongo)
61
+ query: { 'profile.country' => 'US' }
62
+
63
+ # By ObjectId boundary (works great with _id partitions)
64
+ query: { _id: { '$gt' => BSON::ObjectId.from_time(Time.utc(2024, 1, 1)) } }
65
+ ```
66
+
67
+ CLI examples (JSON):
68
+
69
+ ```bash
70
+ # Equality
71
+ --query '{"status":"active"}'
72
+
73
+ # Date/time range (ISO8601 strings your app can parse downstream)
74
+ --query '{"created_at":{"$gte":"2025-01-01T00:00:00Z","$lt":"2025-02-01T00:00:00Z"}}'
75
+
76
+ # Nested field
77
+ --query '{"profile.country":"US"}'
78
+
79
+ # IN list
80
+ --query '{"type":{"$in":["user","admin"]}}'
81
+ ```
82
+
83
+ Notes:
84
+ - Ensure values are serializable; when using Ruby, you can pass native `Time`, `BSON::ObjectId`, etc.
85
+ - Consider adding an appropriate index to match your `query` and pass `hint:` to force indexed scans when needed:
86
+
87
+ ```ruby
88
+ Purplelight.snapshot(
89
+ client: client,
90
+ collection: 'events',
91
+ output: '/data/exports',
92
+ format: :jsonl,
93
+ query: { created_at: { '$gte' => Time.parse('2025-01-01') } },
94
+ hint: { created_at: 1 }
95
+ )
96
+ ```
97
+
36
98
  Outputs files like:
37
99
 
38
100
  ```
@@ -42,7 +104,158 @@ Outputs files like:
42
104
  users.manifest.json
43
105
  ```
44
106
 
45
- ### Status
107
+ ### CSV usage (single-file)
108
+
109
+ ```ruby
110
+ Purplelight.snapshot(
111
+ client: client,
112
+ collection: 'users',
113
+ output: '/data/exports',
114
+ format: :csv,
115
+ sharding: { mode: :single_file, prefix: 'users' },
116
+ resume: { enabled: true }
117
+ )
118
+ ```
119
+
120
+ ### Parquet usage (requires Arrow and Parquet gems)
121
+
122
+ Add optional dependencies:
123
+
124
+ ```ruby
125
+ # Gemfile
126
+ group :parquet do
127
+ gem 'red-arrow', '~> 15.0'
128
+ gem 'red-parquet', '~> 15.0'
129
+ end
130
+ ```
131
+
132
+ Then:
133
+
134
+ ```ruby
135
+ Purplelight.snapshot(
136
+ client: client,
137
+ collection: 'users',
138
+ output: '/data/exports',
139
+ format: :parquet,
140
+ sharding: { mode: :single_file, prefix: 'users' },
141
+ resume: { enabled: true }
142
+ )
143
+ ```
144
+
145
+ ### CLI
146
+
147
+ ```bash
148
+ bundle exec bin/purplelight \
149
+ --uri "$MONGO_URL" \
150
+ --db mydb --collection users \
151
+ --output /data/exports \
152
+ --format jsonl --partitions 8 --by-size $((256*1024*1024)) --prefix users
153
+ ```
154
+
155
+ ### Architecture
156
+
157
+ ```mermaid
158
+ flowchart LR
159
+ A[Partition planner] -->|filters/ranges| B[Reader pool - threads]
160
+ B -->|batches| C[Byte-bounded queue]
161
+ C --> D[Serializer]
162
+ D -->|JSONL/CSV/Parquet| E[Sink with size-based rotation]
163
+ E --> F[Parts + Manifest]
46
164
 
47
- Phase 1 (JSONL + zstd, partitioning, resume, size-based sharding) in progress.
165
+ subgraph Concurrency
166
+ B
167
+ C
168
+ D
169
+ end
48
170
 
171
+ subgraph Resume
172
+ F -->|checkpoints| A
173
+ end
174
+ ```
175
+
176
+ Key points:
177
+ - Partitions default to contiguous `_id` ranges with sorted reads and `no_cursor_timeout`.
178
+ - Readers stream batches into a bounded, byte-aware queue to provide backpressure.
179
+ - Writers serialize to JSONL/CSV/Parquet with default zstd compression and rotate by target size.
180
+ - A manifest records parts and per-partition checkpoints for safe resume.
181
+
182
+ ### Tuning for performance
183
+
184
+ - Partitions: start with `2 × cores` (default). Increase gradually if reads are underutilized; too high can add overhead.
185
+ - Batch size: 2k–10k usually works well. Larger batches reduce cursor roundtrips, but can raise latency/memory.
186
+ - Queue size: increase to 256–512MB to reduce backpressure on readers for fast disks.
187
+ - Compression: use `:zstd` for good ratio; for max speed, try `:gzip` with low level.
188
+ - Rotation size: larger (512MB–1GB) reduces finalize overhead for many parts.
189
+ - Read preference: offload to secondaries or tagged analytics nodes when available.
190
+
191
+ Benchmarking (optional):
192
+
193
+ ```bash
194
+ # 1M docs benchmark with tunables
195
+ BENCH=1 BENCH_PARTITIONS=16 BENCH_BATCH_SIZE=8000 BENCH_QUEUE_MB=512 BENCH_ROTATE_MB=512 BENCH_COMPRESSION=gzip \
196
+ bundle exec rspec spec/benchmark_perf_spec.rb --format doc
197
+ ```
198
+
199
+ ### Read preference and node pinning
200
+
201
+ You can direct reads to non-primary members or specific tagged nodes in a replica set (e.g., MongoDB Atlas analytics nodes) via `read_preference`.
202
+
203
+ Programmatic examples:
204
+
205
+ ```ruby
206
+ # Secondary reads
207
+ Purplelight.snapshot(
208
+ client: client,
209
+ collection: 'events',
210
+ output: '/data/exports',
211
+ format: :jsonl,
212
+ read_preference: :secondary
213
+ )
214
+
215
+ # Pin to tagged nodes (Atlas analytics nodes)
216
+ Purplelight.snapshot(
217
+ client: client,
218
+ collection: 'events',
219
+ output: '/data/exports',
220
+ format: :jsonl,
221
+ read_preference: { mode: :secondary, tag_sets: [{ 'nodeType' => 'ANALYTICS' }] }
222
+ )
223
+ ```
224
+
225
+ Notes:
226
+ - `read_preference` accepts a symbol (mode) or a full hash with `mode` and optional `tag_sets`.
227
+ - Use tags that exist on your cluster. Atlas analytics nodes can be targeted with `{ 'nodeType' => 'ANALYTICS' }`.
228
+
229
+ CLI examples:
230
+
231
+ ```bash
232
+ # Secondary reads
233
+ bundle exec bin/purplelight \
234
+ --uri "$MONGO_URL" --db mydb --collection events --output /data/exports \
235
+ --format jsonl --read-preference secondary
236
+
237
+ # Pin to tagged nodes (Atlas analytics nodes)
238
+ bundle exec bin/purplelight \
239
+ --uri "$MONGO_URL" --db mydb --collection events --output /data/exports \
240
+ --format jsonl --read-preference secondary \
241
+ --read-tags nodeType=ANALYTICS,region=EAST
242
+
243
+ # Inspect effective read preference without running
244
+ bundle exec bin/purplelight \
245
+ --uri "$MONGO_URL" --db mydb --collection events --output /tmp \
246
+ --read-preference secondary --read-tags nodeType=ANALYTICS --dry-run
247
+ ```
248
+
249
+ ### Quick Benchmark
250
+ ```
251
+ % bash -lc 'BENCH=1 BENCH_PARTITIONS=16 BENCH_BATCH_SIZE=8000 BENCH_QUEUE_MB=512 BENCH_ROTATE_MB=512 BENCH_COMPRESSION=gzip bundle exec rspec spec/benchmark_perf_spec.rb --format doc | cat'
252
+
253
+ Performance benchmark (1M docs, gated by BENCH=1)
254
+ W, [2025-09-03T16:10:40.437304 #33546] WARN -- : MONGODB | Error checking 127.0.0.1:27018: Mongo::Error::SocketError: Errno::ECONNREFUSED: Connection refused - connect(2) for 127.0.0.1:27018 (for 127.0.0.1:27018 (no TLS)) (on 127.0.0.1:27018)
255
+ Benchmark results:
256
+ Inserted: 1000000 docs in 8.16s
257
+ Exported: 1000000 docs in 8.21s
258
+ Parts: 1, Bytes: 10646279
259
+ Throughput: 121729.17 docs/s, 1.24 MB/s
260
+ Settings: partitions=16, batch_size=8000, queue_mb=512, rotate_mb=512, compression=gzip
261
+ ```
data/bin/purplelight ADDED
@@ -0,0 +1,109 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'optparse'
4
+ require 'json'
5
+ require 'mongo'
6
+ require_relative '../lib/purplelight'
7
+
8
+ options = {
9
+ format: :jsonl,
10
+ compression: :zstd,
11
+ partitions: nil,
12
+ batch_size: 2000,
13
+ output: nil,
14
+ query: {},
15
+ sharding: { mode: :by_size, part_bytes: 256 * 1024 * 1024, prefix: nil },
16
+ resume: { enabled: true },
17
+ read_preference: nil,
18
+ read_tags: nil,
19
+ dry_run: false
20
+ }
21
+
22
+ parser = OptionParser.new do |opts|
23
+ opts.banner = "Usage: purplelight snapshot [options]"
24
+
25
+ opts.on('-u', '--uri URI', 'MongoDB connection URI (required)') { |v| options[:uri] = v }
26
+ opts.on('-d', '--db NAME', 'Database name (required)') { |v| options[:db] = v }
27
+ opts.on('-c', '--collection NAME', 'Collection name (required)') { |v| options[:collection] = v }
28
+ opts.on('-o', '--output PATH', 'Output directory or file (required)') { |v| options[:output] = v }
29
+ opts.on('-f', '--format FORMAT', 'Format: jsonl|csv|parquet (default jsonl)') { |v| options[:format] = v.to_sym }
30
+ opts.on('--compression NAME', 'Compression: zstd|gzip|none') { |v| options[:compression] = v.to_sym }
31
+ opts.on('--partitions N', Integer, 'Number of partitions') { |v| options[:partitions] = v }
32
+ opts.on('--batch-size N', Integer, 'Mongo batch size (default 2000)') { |v| options[:batch_size] = v }
33
+ opts.on('--by-size BYTES', Integer, 'Shard by size (bytes); default 268435456') { |v| options[:sharding] = { mode: :by_size, part_bytes: v } }
34
+ opts.on('--single-file', 'Write a single output file') { options[:sharding] = { mode: :single_file } }
35
+ opts.on('--prefix NAME', 'Output file prefix') do |v|
36
+ options[:sharding] ||= {}
37
+ options[:sharding][:prefix] = v
38
+ end
39
+ opts.on('-q', '--query JSON', 'Filter query as JSON') { |v| options[:query] = JSON.parse(v) }
40
+ opts.on('--read-preference MODE', 'Read preference mode: primary|primary_preferred|secondary|secondary_preferred|nearest') { |v| options[:read_preference] = v.to_sym }
41
+ opts.on('--read-tags TAGS', 'Comma-separated key=value list to target tagged nodes (e.g., nodeType=ANALYTICS,region=EAST)') do |v|
42
+ tags = {}
43
+ v.split(',').each do |pair|
44
+ k, val = pair.split('=', 2)
45
+ next if k.nil? || val.nil?
46
+ tags[k] = val
47
+ end
48
+ options[:read_tags] = tags unless tags.empty?
49
+ end
50
+ opts.on('--dry-run', 'Parse options and print effective read preference JSON, then exit') { options[:dry_run] = true }
51
+ opts.on('--version', 'Show version') do
52
+ puts Purplelight::VERSION
53
+ exit 0
54
+ end
55
+ opts.on('-h', '--help', 'Show help') do
56
+ puts opts
57
+ exit 0
58
+ end
59
+ end
60
+
61
+ begin
62
+ parser.parse!(ARGV)
63
+ rescue OptionParser::ParseError => e
64
+ warn e.message
65
+ warn parser
66
+ exit 1
67
+ end
68
+
69
+ %i[uri db collection output].each do |k|
70
+ if options[k].nil? || options[k].to_s.empty?
71
+ warn "Missing required option: --#{k}"
72
+ warn parser
73
+ exit 1
74
+ end
75
+ end
76
+
77
+ effective_read = nil
78
+ if options[:read_tags]
79
+ effective_read = { mode: (options[:read_preference] || :secondary), tag_sets: [options[:read_tags]] }
80
+ elsif options[:read_preference]
81
+ effective_read = { mode: options[:read_preference] }
82
+ end
83
+
84
+ if options[:dry_run]
85
+ puts JSON.generate({ read_preference: effective_read })
86
+ exit 0
87
+ end
88
+
89
+ client = Mongo::Client.new(options[:uri])
90
+ options[:partitions] ||= (Etc.respond_to?(:nprocessors) ? [Etc.nprocessors * 2, 4].max : 4)
91
+
92
+ ok = Purplelight.snapshot(
93
+ client: client.use(options[:db]),
94
+ collection: options[:collection],
95
+ output: options[:output],
96
+ format: options[:format],
97
+ compression: options[:compression],
98
+ partitions: options[:partitions],
99
+ batch_size: options[:batch_size],
100
+ query: options[:query],
101
+ sharding: options[:sharding],
102
+ read_preference: effective_read || options[:read_preference],
103
+ resume: { enabled: true },
104
+ on_progress: ->(s) { $stderr.puts("progress: #{s.to_json}") }
105
+ )
106
+
107
+ exit(ok ? 0 : 1)
108
+
109
+
@@ -32,6 +32,7 @@ module Purplelight
32
32
  'partitions' => []
33
33
  }
34
34
  @mutex = Mutex.new
35
+ @last_save_at = Time.now
35
36
  end
36
37
 
37
38
  def self.load(path)
@@ -102,7 +103,7 @@ module Purplelight
102
103
  part = @data['parts'][index]
103
104
  part['rows'] += rows_delta
104
105
  part['bytes'] += bytes_delta
105
- save!
106
+ save_maybe!
106
107
  end
107
108
  end
108
109
 
@@ -122,6 +123,16 @@ module Purplelight
122
123
  def partitions
123
124
  @data['partitions']
124
125
  end
126
+
127
+ private
128
+
129
+ def save_maybe!(interval_seconds: 2.0)
130
+ now = Time.now
131
+ if (now - @last_save_at) >= interval_seconds
132
+ save!
133
+ @last_save_at = now
134
+ end
135
+ end
125
136
  end
126
137
  end
127
138
 
@@ -6,6 +6,8 @@ require 'fileutils'
6
6
  require_relative 'partitioner'
7
7
  require_relative 'queue'
8
8
  require_relative 'writer_jsonl'
9
+ require_relative 'writer_csv'
10
+ require_relative 'writer_parquet'
9
11
  require_relative 'manifest'
10
12
  require_relative 'errors'
11
13
 
@@ -16,9 +18,9 @@ module Purplelight
16
18
  compression: :zstd,
17
19
  batch_size: 2_000,
18
20
  partitions: [Etc.respond_to?(:nprocessors) ? [Etc.nprocessors * 2, 4].max : 4, 32].min,
19
- queue_size_bytes: 128 * 1024 * 1024,
21
+ queue_size_bytes: 256 * 1024 * 1024,
20
22
  rotate_bytes: 256 * 1024 * 1024,
21
- read_concern: :majority,
23
+ read_concern: { level: :majority },
22
24
  read_preference: :primary,
23
25
  no_cursor_timeout: true
24
26
  }
@@ -79,7 +81,7 @@ module Purplelight
79
81
  end
80
82
 
81
83
  manifest.configure!(collection: @collection.name, format: @format, compression: @compression, query_digest: query_digest, options: {
82
- partitions: @partitions, batch_size: @batch_size, rotate_bytes: @rotate_bytes
84
+ partitions: @partitions, batch_size: @batch_size, rotate_bytes: @rotate_bytes, hint: @hint
83
85
  })
84
86
  manifest.ensure_partitions!(@partitions)
85
87
 
@@ -90,12 +92,18 @@ module Purplelight
90
92
  queue = ByteQueue.new(max_bytes: @queue_size_bytes)
91
93
 
92
94
  # Writer
93
- case @format
94
- when :jsonl
95
- writer = WriterJSONL.new(directory: dir, prefix: prefix, compression: @compression, rotate_bytes: @rotate_bytes, logger: @logger, manifest: manifest)
96
- else
97
- raise ArgumentError, "format not implemented: #{@format}"
98
- end
95
+ writer = case @format
96
+ when :jsonl
97
+ WriterJSONL.new(directory: dir, prefix: prefix, compression: @compression, rotate_bytes: @rotate_bytes, logger: @logger, manifest: manifest)
98
+ when :csv
99
+ single_file = (@sharding && @sharding[:mode].to_s == 'single_file')
100
+ WriterCSV.new(directory: dir, prefix: prefix, compression: @compression, rotate_bytes: @rotate_bytes, logger: @logger, manifest: manifest, single_file: single_file)
101
+ when :parquet
102
+ single_file = (@sharding && @sharding[:mode].to_s == 'single_file')
103
+ WriterParquet.new(directory: dir, prefix: prefix, compression: @compression, logger: @logger, manifest: manifest, single_file: single_file)
104
+ else
105
+ raise ArgumentError, "format not implemented: #{@format}"
106
+ end
99
107
 
100
108
  # Start reader threads
101
109
  readers = partition_filters.each_with_index.map do |pf, idx|
@@ -151,7 +159,7 @@ module Purplelight
151
159
  def read_partition(idx:, filter_spec:, queue:, batch_size:, manifest:)
152
160
  filter = filter_spec[:filter]
153
161
  sort = filter_spec[:sort] || { _id: 1 }
154
- hint = filter_spec[:hint] || { _id: 1 }
162
+ hint = @hint || filter_spec[:hint] || { _id: 1 }
155
163
 
156
164
  # Resume from checkpoint if present
157
165
  checkpoint = manifest.partitions[idx] && manifest.partitions[idx]['last_id_exclusive']
@@ -164,11 +172,18 @@ module Purplelight
164
172
  opts[:projection] = @projection if @projection
165
173
  opts[:batch_size] = batch_size if batch_size
166
174
  opts[:no_cursor_timeout] = @no_cursor_timeout
167
- opts[:read] = { mode: @read_preference }
168
- opts[:read_concern] = @read_concern
175
+ # Read preference can be a symbol (mode) or a full hash with tag_sets
176
+ if @read_preference
177
+ opts[:read] = @read_preference.is_a?(Hash) ? @read_preference : { mode: @read_preference }
178
+ end
179
+ # Mongo driver expects read_concern as a hash like { level: :majority }
180
+ if @read_concern
181
+ opts[:read_concern] = @read_concern.is_a?(Hash) ? @read_concern : { level: @read_concern }
182
+ end
169
183
 
170
184
  cursor = @collection.find(filter, opts)
171
185
 
186
+ encode_lines = (@format == :jsonl)
172
187
  buffer = []
173
188
  buffer_bytes = 0
174
189
  last_id = checkpoint
@@ -176,9 +191,15 @@ module Purplelight
176
191
  cursor.each do |doc|
177
192
  last_id = doc['_id']
178
193
  doc = @mapper.call(doc) if @mapper
179
- json = Oj.dump(doc, mode: :compat)
180
- bytes = json.bytesize + 1 # newline later
181
- buffer << doc
194
+ if encode_lines
195
+ line = Oj.dump(doc, mode: :compat) + "\n"
196
+ bytes = line.bytesize
197
+ buffer << line
198
+ else
199
+ # For CSV/Parquet keep raw docs to allow schema/row building
200
+ bytes = (Oj.dump(doc, mode: :compat).bytesize + 1)
201
+ buffer << doc
202
+ end
182
203
  buffer_bytes += bytes
183
204
  if buffer.length >= batch_size || buffer_bytes >= 1_000_000
184
205
  queue.push(buffer, bytes: buffer_bytes)
@@ -1,7 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Purplelight
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.2"
5
5
  end
6
6
 
7
7
 
@@ -0,0 +1,180 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'csv'
4
+ require 'oj'
5
+ require 'zlib'
6
+ require 'fileutils'
7
+
8
+ begin
9
+ require 'zstds'
10
+ rescue LoadError
11
+ end
12
+
13
+ module Purplelight
14
+ class WriterCSV
15
+ DEFAULT_ROTATE_BYTES = 256 * 1024 * 1024
16
+
17
+ def initialize(directory:, prefix:, compression: :zstd, rotate_bytes: DEFAULT_ROTATE_BYTES, logger: nil, manifest: nil, single_file: false, columns: nil, headers: true)
18
+ @directory = directory
19
+ @prefix = prefix
20
+ @compression = compression
21
+ @rotate_bytes = rotate_bytes
22
+ @logger = logger
23
+ @manifest = manifest
24
+ @single_file = single_file
25
+
26
+ @columns = columns&.map(&:to_s)
27
+ @headers = headers
28
+
29
+ @part_index = nil
30
+ @io = nil
31
+ @csv = nil
32
+ @bytes_written = 0
33
+ @rows_written = 0
34
+ @file_seq = 0
35
+ @closed = false
36
+
37
+ @effective_compression = determine_effective_compression(@compression)
38
+ if @effective_compression.to_s != @compression.to_s
39
+ @logger&.warn("requested compression '#{@compression}' not available; using '#{@effective_compression}'")
40
+ end
41
+ end
42
+
43
+ def write_many(array_of_docs)
44
+ ensure_open!
45
+
46
+ # infer columns if needed from docs
47
+ if @columns.nil?
48
+ sample_docs = array_of_docs.is_a?(Array) ? array_of_docs : []
49
+ sample_docs = sample_docs.reject { |d| d.is_a?(String) }
50
+ @columns = infer_columns(sample_docs)
51
+ @csv << @columns if @headers
52
+ end
53
+
54
+ array_of_docs.each do |doc|
55
+ next if doc.is_a?(String)
56
+ row = @columns.map { |k| extract_value(doc, k) }
57
+ @csv << row
58
+ @rows_written += 1
59
+ end
60
+ @manifest&.add_progress_to_part!(index: @part_index, rows_delta: array_of_docs.size, bytes_delta: 0)
61
+
62
+ rotate_if_needed
63
+ end
64
+
65
+ def rotate_if_needed
66
+ return if @single_file
67
+ return if @rotate_bytes.nil?
68
+ raw_bytes = @io.respond_to?(:pos) ? @io.pos : @bytes_written
69
+ return if raw_bytes < @rotate_bytes
70
+ rotate!
71
+ end
72
+
73
+ def close
74
+ return if @closed
75
+ if @csv
76
+ @csv.flush
77
+ end
78
+ if @io
79
+ finalize_current_part!
80
+ @io.close
81
+ end
82
+ @closed = true
83
+ end
84
+
85
+ private
86
+
87
+ def ensure_open!
88
+ return if @io
89
+ FileUtils.mkdir_p(@directory)
90
+ path = next_part_path
91
+ @part_index = @manifest&.open_part!(path) if @manifest
92
+ raw = File.open(path, 'wb')
93
+ @io = build_compressed_io(raw)
94
+ @csv = CSV.new(@io)
95
+ @bytes_written = 0
96
+ @rows_written = 0
97
+ end
98
+
99
+ def build_compressed_io(raw)
100
+ case @effective_compression.to_s
101
+ when 'zstd'
102
+ if defined?(ZSTDS)
103
+ return ZSTDS::Writer.open(raw, level: 10)
104
+ else
105
+ @logger&.warn("zstd gem not loaded; using gzip")
106
+ return Zlib::GzipWriter.new(raw)
107
+ end
108
+ when 'gzip'
109
+ return Zlib::GzipWriter.new(raw)
110
+ when 'none'
111
+ return raw
112
+ else
113
+ raise ArgumentError, "unknown compression: #{@effective_compression}"
114
+ end
115
+ end
116
+
117
+ def rotate!
118
+ return unless @io
119
+ finalize_current_part!
120
+ @io.close
121
+ @io = nil
122
+ @csv = nil
123
+ ensure_open!
124
+ end
125
+
126
+ def finalize_current_part!
127
+ # Avoid flushing compressed writer explicitly to prevent Zlib::BufError; close will finish the stream.
128
+ @manifest&.complete_part!(index: @part_index, checksum: nil)
129
+ @file_seq += 1 unless @single_file
130
+ end
131
+
132
+ def next_part_path
133
+ ext = 'csv'
134
+ if @single_file
135
+ filename = format("%s.%s", @prefix, ext)
136
+ else
137
+ filename = format("%s-part-%06d.%s", @prefix, @file_seq, ext)
138
+ end
139
+ filename += ".zst" if @effective_compression.to_s == 'zstd'
140
+ filename += ".gz" if @effective_compression.to_s == 'gzip'
141
+ File.join(@directory, filename)
142
+ end
143
+
144
+ def determine_effective_compression(requested)
145
+ case requested.to_s
146
+ when 'zstd'
147
+ return (defined?(ZSTDS) ? :zstd : :gzip)
148
+ when 'gzip'
149
+ return :gzip
150
+ when 'none'
151
+ return :none
152
+ else
153
+ return :gzip
154
+ end
155
+ end
156
+
157
+ def infer_columns(docs)
158
+ keys = {}
159
+ docs.each do |d|
160
+ (d.keys - ['_id']).each { |k| keys[k.to_s] = true }
161
+ end
162
+ # Put _id first if present, then other keys sorted
163
+ cols = []
164
+ cols << '_id' if docs.first.key?('_id') || docs.first.key?(:_id)
165
+ cols + keys.keys.sort
166
+ end
167
+
168
+ def extract_value(doc, key)
169
+ val = doc[key] || doc[key.to_sym]
170
+ case val
171
+ when Hash, Array
172
+ Oj.dump(val, mode: :compat)
173
+ else
174
+ val
175
+ end
176
+ end
177
+ end
178
+ end
179
+
180
+
@@ -14,13 +14,14 @@ module Purplelight
14
14
  class WriterJSONL
15
15
  DEFAULT_ROTATE_BYTES = 256 * 1024 * 1024
16
16
 
17
- def initialize(directory:, prefix:, compression: :zstd, rotate_bytes: DEFAULT_ROTATE_BYTES, logger: nil, manifest: nil)
17
+ def initialize(directory:, prefix:, compression: :zstd, rotate_bytes: DEFAULT_ROTATE_BYTES, logger: nil, manifest: nil, compression_level: nil)
18
18
  @directory = directory
19
19
  @prefix = prefix
20
20
  @compression = compression
21
21
  @rotate_bytes = rotate_bytes
22
22
  @logger = logger
23
23
  @manifest = manifest
24
+ @compression_level = compression_level
24
25
 
25
26
  @part_index = nil
26
27
  @io = nil
@@ -28,14 +29,26 @@ module Purplelight
28
29
  @rows_written = 0
29
30
  @file_seq = 0
30
31
  @closed = false
32
+
33
+ @effective_compression = determine_effective_compression(@compression)
34
+ if @effective_compression.to_s != @compression.to_s
35
+ @logger&.warn("requested compression '#{@compression}' not available; using '#{@effective_compression}'")
36
+ end
31
37
  end
32
38
 
33
39
  def write_many(array_of_docs)
34
40
  ensure_open!
35
- buffer = array_of_docs.map { |doc| Oj.dump(doc, mode: :compat) + "\n" }.join
41
+ # If upstream already produced newline-terminated strings, join fast.
42
+ if array_of_docs.first.is_a?(String)
43
+ buffer = array_of_docs.join
44
+ rows = array_of_docs.size
45
+ else
46
+ buffer = array_of_docs.map { |doc| Oj.dump(doc, mode: :compat) + "\n" }.join
47
+ rows = array_of_docs.size
48
+ end
36
49
  write_buffer(buffer)
37
- @rows_written += array_of_docs.size
38
- @manifest&.add_progress_to_part!(index: @part_index, rows_delta: array_of_docs.size, bytes_delta: buffer.bytesize)
50
+ @rows_written += rows
51
+ @manifest&.add_progress_to_part!(index: @part_index, rows_delta: rows, bytes_delta: buffer.bytesize)
39
52
  end
40
53
 
41
54
  def rotate_if_needed
@@ -67,17 +80,20 @@ module Purplelight
67
80
  end
68
81
 
69
82
  def build_compressed_io(raw)
70
- case @compression.to_s
83
+ case @effective_compression.to_s
71
84
  when 'zstd'
72
85
  if defined?(ZSTDS)
73
86
  # ZSTDS::Writer supports IO-like interface
74
- return ZSTDS::Writer.open(raw, level: 10)
87
+ level = @compression_level || 3
88
+ return ZSTDS::Writer.open(raw, level: level)
75
89
  else
76
- @logger&.warn("zstd not available, falling back to gzip")
77
- return Zlib::GzipWriter.new(raw)
90
+ @logger&.warn("zstd gem not loaded; this should have been handled earlier")
91
+ level = @compression_level || Zlib::DEFAULT_COMPRESSION
92
+ return Zlib::GzipWriter.new(raw, level)
78
93
  end
79
94
  when 'gzip'
80
- return Zlib::GzipWriter.new(raw)
95
+ level = @compression_level || 1
96
+ return Zlib::GzipWriter.new(raw, level)
81
97
  when 'none'
82
98
  return raw
83
99
  else
@@ -109,10 +125,23 @@ module Purplelight
109
125
  def next_part_path
110
126
  ext = 'jsonl'
111
127
  filename = format("%s-part-%06d.%s", @prefix, @file_seq, ext)
112
- filename += ".zst" if @compression.to_s == 'zstd'
113
- filename += ".gz" if @compression.to_s == 'gzip'
128
+ filename += ".zst" if @effective_compression.to_s == 'zstd'
129
+ filename += ".gz" if @effective_compression.to_s == 'gzip'
114
130
  File.join(@directory, filename)
115
131
  end
132
+
133
+ def determine_effective_compression(requested)
134
+ case requested.to_s
135
+ when 'zstd'
136
+ return (defined?(ZSTDS) ? :zstd : :gzip)
137
+ when 'gzip'
138
+ return :gzip
139
+ when 'none'
140
+ return :none
141
+ else
142
+ return :gzip
143
+ end
144
+ end
116
145
  end
117
146
  end
118
147
 
@@ -0,0 +1,137 @@
1
+ # frozen_string_literal: true
2
+
3
+ begin
4
+ require 'arrow'
5
+ require 'parquet'
6
+ rescue LoadError
7
+ # Arrow/Parquet not available; writer will refuse to run
8
+ end
9
+
10
+ require 'fileutils'
11
+
12
+ module Purplelight
13
+ class WriterParquet
14
+ DEFAULT_ROW_GROUP_SIZE = 10_000
15
+
16
+ def initialize(directory:, prefix:, compression: :zstd, row_group_size: DEFAULT_ROW_GROUP_SIZE, logger: nil, manifest: nil, single_file: true, schema: nil)
17
+ @directory = directory
18
+ @prefix = prefix
19
+ @compression = compression
20
+ @row_group_size = row_group_size
21
+ @logger = logger
22
+ @manifest = manifest
23
+ @single_file = single_file
24
+ @schema = schema
25
+
26
+ @closed = false
27
+ @file_seq = 0
28
+ @part_index = nil
29
+
30
+ ensure_dependencies!
31
+ reset_buffers
32
+ end
33
+
34
+ def write_many(array_of_docs)
35
+ ensure_open!
36
+ array_of_docs.each { |doc| @buffer_docs << doc }
37
+ @manifest&.add_progress_to_part!(index: @part_index, rows_delta: array_of_docs.length, bytes_delta: 0)
38
+ end
39
+
40
+ def close
41
+ return if @closed
42
+ ensure_open!
43
+ if !@buffer_docs.empty?
44
+ table = build_table(@buffer_docs)
45
+ write_table(table, @writer_path, append: false)
46
+ end
47
+ finalize_current_part!
48
+ @closed = true
49
+ end
50
+
51
+ private
52
+
53
+ def ensure_dependencies!
54
+ unless defined?(Arrow) && defined?(Parquet)
55
+ raise ArgumentError, "Parquet support requires gems: red-arrow and red-parquet. Add them to your Gemfile."
56
+ end
57
+ end
58
+
59
+ def reset_buffers
60
+ @buffer_docs = []
61
+ @columns = nil
62
+ @writer_path = nil
63
+ end
64
+
65
+ def ensure_open!
66
+ return if @writer_path
67
+ FileUtils.mkdir_p(@directory)
68
+ @writer_path = next_part_path
69
+ @part_index = @manifest&.open_part!(@writer_path) if @manifest
70
+ end
71
+
72
+ # No-op; we now write once on close for simplicity
73
+
74
+ def build_table(docs)
75
+ # Infer columns
76
+ @columns ||= infer_columns(docs)
77
+ columns = {}
78
+ @columns.each do |name|
79
+ values = docs.map { |d| extract_value(d, name) }
80
+ columns[name] = Arrow::ArrayBuilder.build(values)
81
+ end
82
+ Arrow::Table.new(columns)
83
+ end
84
+
85
+ def write_table(table, path, append: false)
86
+ # Prefer Arrow's save with explicit parquet format; compression defaults per build.
87
+ if table.respond_to?(:save)
88
+ table.save(path, format: :parquet)
89
+ return
90
+ end
91
+ # Fallback to red-parquet writer
92
+ if defined?(Parquet::ArrowFileWriter)
93
+ writer = Parquet::ArrowFileWriter.open(table.schema, path)
94
+ writer.write_table(table)
95
+ writer.close
96
+ return
97
+ end
98
+ raise "Parquet writer not available in this environment"
99
+ end
100
+
101
+ def finalize_current_part!
102
+ @manifest&.complete_part!(index: @part_index, checksum: nil)
103
+ @file_seq += 1 unless @single_file
104
+ @writer_path = nil
105
+ end
106
+
107
+ def next_part_path
108
+ ext = 'parquet'
109
+ filename = if @single_file
110
+ format("%s.%s", @prefix, ext)
111
+ else
112
+ format("%s-part-%06d.%s", @prefix, @file_seq, ext)
113
+ end
114
+ File.join(@directory, filename)
115
+ end
116
+
117
+ def infer_columns(docs)
118
+ keys = {}
119
+ docs.each do |d|
120
+ d.keys.each { |k| keys[k.to_s] = true }
121
+ end
122
+ keys.keys.sort
123
+ end
124
+
125
+ def extract_value(doc, key)
126
+ val = doc[key] || doc[key.to_sym]
127
+ case val
128
+ when Time
129
+ val
130
+ else
131
+ val
132
+ end
133
+ end
134
+ end
135
+ end
136
+
137
+
metadata CHANGED
@@ -1,10 +1,10 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: purplelight
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
- - Purplelight Authors
7
+ - Alexander Nicholson
8
8
  bindir: bin
9
9
  cert_chain: []
10
10
  date: 1980-01-02 00:00:00.000000000 Z
@@ -37,6 +37,34 @@ dependencies:
37
37
  - - ">="
38
38
  - !ruby/object:Gem::Version
39
39
  version: '3.16'
40
+ - !ruby/object:Gem::Dependency
41
+ name: csv
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - ">="
45
+ - !ruby/object:Gem::Version
46
+ version: '0'
47
+ type: :runtime
48
+ prerelease: false
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - ">="
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ - !ruby/object:Gem::Dependency
55
+ name: logger
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - ">="
59
+ - !ruby/object:Gem::Version
60
+ version: '1.6'
61
+ type: :runtime
62
+ prerelease: false
63
+ version_requirements: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: '1.6'
40
68
  - !ruby/object:Gem::Dependency
41
69
  name: rspec
42
70
  requirement: !ruby/object:Gem::Requirement
@@ -68,13 +96,15 @@ dependencies:
68
96
  description: High-throughput, resumable snapshots of MongoDB collections with partitioning,
69
97
  multi-threaded readers, and size-based sharded outputs.
70
98
  email:
71
- - devnull@example.com
72
- executables: []
99
+ - rubygems-maint@ctrl.tokyo
100
+ executables:
101
+ - purplelight
73
102
  extensions: []
74
103
  extra_rdoc_files: []
75
104
  files:
76
105
  - README.md
77
106
  - Rakefile
107
+ - bin/purplelight
78
108
  - lib/purplelight.rb
79
109
  - lib/purplelight/errors.rb
80
110
  - lib/purplelight/manifest.rb
@@ -82,13 +112,15 @@ files:
82
112
  - lib/purplelight/queue.rb
83
113
  - lib/purplelight/snapshot.rb
84
114
  - lib/purplelight/version.rb
115
+ - lib/purplelight/writer_csv.rb
85
116
  - lib/purplelight/writer_jsonl.rb
117
+ - lib/purplelight/writer_parquet.rb
86
118
  licenses:
87
119
  - MIT
88
120
  metadata:
89
- homepage_uri: https://github.com/example/purplelight
90
- source_code_uri: https://github.com/example/purplelight
91
- changelog_uri: https://github.com/example/purplelight/releases
121
+ homepage_uri: https://github.com/alexandernicholson/purplelight
122
+ source_code_uri: https://github.com/alexandernicholson/purplelight
123
+ changelog_uri: https://github.com/alexandernicholson/purplelight/releases
92
124
  rdoc_options: []
93
125
  require_paths:
94
126
  - lib