exwiw 0.5.2 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 567683d65df5d9f147ab9415a67baf48a80e21ad32e1ef7635c624dfc3d28c47
4
- data.tar.gz: 1513b577f6f2368df60edc45a54c96495ece4f1ee9b453e92adb8991f182fcdf
3
+ metadata.gz: 4df9132fdf081f28d9d0c53ade6bba8267f0f5ec5c40e19ab42fa45d9cfa5a10
4
+ data.tar.gz: 4f20a649c126a4617cd5e90fd09eb2eb7f92b068408d5ef9e927eac98a5fadda
5
5
  SHA512:
6
- metadata.gz: a9680642eb34f99ed3f0c2924154a5171edf541286dfed14befe15b8c029271419a13d3295694847b1143898b0200bf6eb1d6448e6665dbad7bef026e3c3fbbb
7
- data.tar.gz: 30e6ef9f988965b85f899fdb0646b6e4e2befd68f95a247a9edfeddb8d3a6088f611f07d5376c7d3b9f4f58f147072e03294a140312dec1622994fb6da175720
6
+ metadata.gz: 309984a32b8cf593b5499427aa089c1c235332a5ca0cda584a2520d3dd0b073f33839adbe20487bce6bc31c20ba66434ce5e0cb1eab5e53bbdfa72bcf115707f
7
+ data.tar.gz: de045c9fef6510561b9fd4e713e6a9ba4be7c145f570aaf3da108cd7a4bf3512d0f8757c33e40470bf78b3e7dbebe5244013978c21a974610bca86ceeab54fd9
data/CHANGELOG.md CHANGED
@@ -2,6 +2,12 @@
2
2
 
3
3
  ## [Unreleased]
4
4
 
5
+ ## [0.5.3] - 2026-06-19
6
+
7
+ ### Changed
8
+
9
+ - **MongoDB: dumps now stream, bounding peak memory regardless of collection size** (default; no flag, byte-identical output). The adapter previously loaded each collection's entire result set into memory (`.to_a`) and built the whole collection's JSONL output as one string, so peak memory scaled with collection size. It now wraps the Mongo cursor in a lazy streaming result and writes output in chunks, so at most one chunk of documents (plus the small FK-propagation key arrays) is resident at a time. On a 20k-document × 30-embed collection this cut peak RSS by hundreds of MB and was also faster (less GC pressure). The per-document Extended-JSON masking was also precompiled per collection config, trimming per-document encoding cost. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the full investigation, and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the proposed native-encoder follow-up.
10
+
5
11
  ## [0.5.2] - 2026-06-18
6
12
 
7
13
  ### Fixed
data/README.md CHANGED
@@ -647,6 +647,7 @@ The MongoDB adapter is experimental. To use it:
647
647
  - `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
648
648
  - `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
649
649
  - `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only**; the SQL adapters use `--ids-column` instead (see below).
650
+ - Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. The dominant remaining cost is encoding each document to MongoDB Extended JSON (pure Ruby); see [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the proposed native-encoder follow-up. Benchmark your own data with `script/bench_mongodb_dump.rb`.
650
651
  - Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
651
652
  ```bash
652
653
  mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
@@ -664,7 +665,7 @@ The MongoDB adapter is experimental. To use it:
664
665
  MongoDB models often store one-to-many relationships as embedded subdocument arrays (e.g. `users` documents with a `posts: [...]` field). To mask fields inside embedded subdocuments, declare a separate config with `embedded_in`:
665
666
 
666
667
  ```jsonc
667
- // scenario/users.json — top-level collection
668
+ // e2e/users.json — top-level collection
668
669
  {
669
670
  "name": "users",
670
671
  "primary_key": "_id",
@@ -676,7 +677,7 @@ MongoDB models often store one-to-many relationships as embedded subdocument arr
676
677
  ]
677
678
  }
678
679
 
679
- // scenario/posts.json — embedded under users.posts
680
+ // e2e/posts.json — embedded under users.posts
680
681
  {
681
682
  "name": "posts",
682
683
  "primary_key": "_id",
@@ -0,0 +1,126 @@
1
+ # MongoDB dump performance: investigation notes
2
+
3
+ This records what was learned while making the MongoDB adapter's dump faster and
4
+ lighter, **what shipped**, and **what was explored and deliberately removed**.
5
+ It exists so the removed work isn't re-discovered from scratch and so the
6
+ trade-offs behind the current design are legible.
7
+
8
+ The reproducible harness is `script/bench_mongodb_dump.rb` (seeds a synthetic
9
+ large/embed-heavy dataset and measures the dump phases). The correctness anchor
10
+ throughout is `spec/insert_output_snapshot_spec.rb` — a **byte-exact** snapshot
11
+ of the dump output; every change below was required to keep it green.
12
+
13
+ ## The two hotspots
14
+
15
+ On an embed-heavy benchmark (20k users × 30 embedded posts → ~154 MB JSONL):
16
+
17
+ 1. **Memory.** Two compounding costs. `MongodbAdapter#execute` did `.to_a`,
18
+ loading the entire result set onto the heap (~600–900 MB / ~9.5M Ruby
19
+ objects for 20k docs). Separately, with no chunking the Runner built the whole
20
+ collection's JSONL output as **one giant string** before writing, held
21
+ simultaneously with the result set.
22
+ 2. **CPU.** `doc.as_extended_json(mode: :relaxed)` is ~82% of per-document
23
+ serialization (~104µs of ~124µs for a 30-post doc). It recursively rebuilds
24
+ the document into a new intermediate Hash tree, so cost scales with embedding
25
+ depth/count; `JSON.generate` over that tree is comparatively cheap (~10µs).
26
+
27
+ ## What shipped (default, no flags, byte-identical)
28
+
29
+ - **Chunked output streaming.** The Runner writes each bulk-insert chunk straight
30
+ to the file instead of joining the whole table's output into one string.
31
+ `MongodbAdapter` sets a positive `default_bulk_insert_chunk_size` (1000) so
32
+ MongoDB output is chunked by default while SQL adapters keep one statement per
33
+ table. Cut peak RSS ~112 MB and was ~30% faster, byte-identical.
34
+ - **Streaming result set.** `#execute` returns a lazy `StreamingResult` wrapping
35
+ the Mongo cursor instead of `.to_a`. The Runner pulls documents through
36
+ `each_slice`, so only one chunk is resident at a time. `#size` is answered with
37
+ a cheap `count_documents` (index-only) rather than draining the cursor, and the
38
+ FK-propagation `@state` is captured *as the cursor streams* and published once
39
+ the pass completes (the Runner always fully consumes a non-empty result, so
40
+ propagation is unaffected). Cut peak RSS growth ~360 MB and wall time ~40%.
41
+ - **Precompiled masking (`MaskPlan`).** Masking runs over every document **and**
42
+ every embedded subdocument, so per-config decisions (which fields carry a
43
+ `replace_with`, how each template splits, where embedded children live) were
44
+ recomputed many times per document. Compiling a `MaskPlan` once per collection
45
+ config dropped per-document masking ~17–22% and ~35 allocations/doc, scaling
46
+ down with embedding count. Byte-identical.
47
+
48
+ Net default result: memory is bounded by chunk size rather than collection size,
49
+ with a meaningful wall-time improvement and no API/flag surface.
50
+
51
+ ## What was explored and removed
52
+
53
+ After the memory work, the remaining cost was almost entirely the pure-Ruby
54
+ `as_extended_json`. Threads give **zero** speedup (it holds the GVL), and a
55
+ hand-rolled pure-Ruby fused encoder is **slower** than `as_extended_json +
56
+ JSON.generate` (per-leaf `.to_json` C-call overhead; `JSON.generate` does the
57
+ whole tree in one C pass). `bson` 5.2.0 has no native Extended-JSON serializer to
58
+ borrow (`to_extended_json` is literally `as_extended_json(**opts).to_json`, all
59
+ pure Ruby). That left two levers, both of which were built, measured, and then
60
+ **removed for being disproportionately complex**:
61
+
62
+ ### Fork-parallel serialization (`--parallel-workers=N`)
63
+
64
+ Forked `N` worker processes to serialize contiguous document slices in parallel,
65
+ parent concatenating parts in order (byte-identical). The *serialization step*
66
+ parallelized ~2–2.5× at 4–8 workers, **but the end-to-end dump speedup was only
67
+ ~1.1–1.4×** on embed-heavy data. Reason (Amdahl): ~40% of dump wall time is the
68
+ **serial Mongo cursor BSON→Ruby decode** in the parent, which serialization
69
+ parallelism cannot touch — capping the win — and fork/concat overhead eroded most
70
+ of the rest.
71
+
72
+ ### Cursor-parallel fetch (`--cursor-parallel`)
73
+
74
+ Went further: split each collection into `N` disjoint `_id` ranges, each fetched
75
+ by a forked worker with its own connection+cursor, so the **decode** was
76
+ parallelized too. Measured byte-identical at ~2.5–5.5× depending on dataset size
77
+ — a real, larger win. But it required a lot of machinery:
78
+
79
+ - `_id`-range partitioning (an index-only scan + range split) — `MongoIdPartitioner`.
80
+ - A fork orchestrator writing ordered part files + Marshal'd state sidecars — `ForkedPartWriter`.
81
+ - Distributed FK-propagation: each worker captures its range's `@state` slice and
82
+ the parent merges them in range order — `PropagationCapture`.
83
+ - A per-worker fresh-connection builder (the Mongo driver is not fork-safe).
84
+ - New adapter seams (`write_bulk_insert`, `write_inserts`) and CLI→Runner→Adapter
85
+ threading for two flags + their validations.
86
+ - A user-visible caveat: per-range cursors must `sort(_id)`, so the output is
87
+ **ordered by `_id` rather than natural order** — a different byte stream
88
+ (semantically equivalent re-import), so it could not be the snapshot-tested
89
+ default.
90
+
91
+ ### Why both were removed
92
+
93
+ The cursor-parallel win was real but bought with multi-process orchestration,
94
+ IPC, distributed state, a fresh-connection-per-worker requirement, fork
95
+ fallbacks for Windows/JRuby, and a non-default output ordering — a large,
96
+ permanently-maintained surface for a single adapter's export path. The
97
+ maintainer's call was that this is **over-engineered** for the benefit, and that
98
+ the CPU hotspot is better addressed by the lever every earlier iteration kept
99
+ pointing at: a native (C) Extended-JSON encoder. The memory wins above are
100
+ unrelated to the parallelism and were kept.
101
+
102
+ ## Where the speedup goes next
103
+
104
+ A C extension can collapse `as_extended_json + JSON.generate` into one native
105
+ tree-walk (no intermediate tree, no second pass), as a flag-free, fork-free,
106
+ single-process win — bounded by the same serial-decode ceiling (~2.5×) the
107
+ `--parallel-workers` path hit, since it also doesn't touch the driver's decode.
108
+ The full design (byte-identity strategy, fast-path vs Ruby-delegate types,
109
+ optional-load + pure-Ruby fallback, packaging) is in
110
+ [`optimize-mongodb-export-with-native-ext.md`](./optimize-mongodb-export-with-native-ext.md).
111
+
112
+ ## Methodology notes (for re-running)
113
+
114
+ - The CPU hotspot reproduces **with no database**: the Mongo driver hands back
115
+ plain Ruby `Hash` + `BSON::ObjectId`/`Time`, so synthesizing that shape in
116
+ memory yields the exact `as_extended_json + JSON.generate` cost and runs under
117
+ the normal sandbox. DB-touching measurement needs live mongo on `localhost:27017`
118
+ (the dev sandbox blocks it — disable the sandbox for those runs).
119
+ - In-process sequential bench passes accumulate RSS (Ruby reclaims to the OS
120
+ lazily), which inflates a later serial baseline and overstates a parallel
121
+ speedup; isolate sections in fresh processes for defensible numbers.
122
+ - Chunk size never changes output **bytes** — the Runner inserts the same `"\n"`
123
+ between chunks that `to_bulk_insert` inserts between documents — so it is purely
124
+ a memory/throughput knob, safe against the snapshot guard.
125
+ - Ruby 4.0 removed the `benchmark` stdlib from default gems; use
126
+ `Process.clock_gettime(Process::CLOCK_MONOTONIC)`.
@@ -0,0 +1,229 @@
1
+ # Design: optional native (C) extension for the MongoDB Extended-JSON encoder
2
+
3
+ Status: **proposed / not implemented.** This document captures the design for a
4
+ future change. It is the planned successor to the fork/cursor parallelism that
5
+ was removed (see [`optimization-notes.md`](./optimization-notes.md)).
6
+
7
+ ## Motivation
8
+
9
+ When the MongoDB adapter dumps embed-heavy documents, the dominant CPU cost is
10
+ turning each decoded Mongo document (a Ruby `Hash` containing `BSON::ObjectId` /
11
+ `Time` / nested `Hash`+`Array`) into one JSONL line of MongoDB **Relaxed
12
+ Extended JSON**. Today that is:
13
+
14
+ ```ruby
15
+ # lib/exwiw/adapter/mongodb_adapter.rb
16
+ JSON.generate(doc.as_extended_json(mode: :relaxed))
17
+ ```
18
+
19
+ `as_extended_json` (in the pure-Ruby `bson` gem) **recursively rebuilds the
20
+ whole document into a new intermediate Hash tree** (`ObjectId -> {"$oid"=>…}`,
21
+ `Time -> {"$date"=>…}`, every subdoc/array re-allocated), and then
22
+ `JSON.generate` walks that tree a *second* time. For a 30-embedded-post doc this
23
+ was measured at ~130µs/doc and is ~82% of per-document serialization cost.
24
+
25
+ Earlier experiments established the levers:
26
+
27
+ - Threads give **zero** speedup — `as_extended_json` is pure Ruby and holds the GVL.
28
+ - A pure-Ruby fused single-pass encoder is **slower** (per-leaf `.to_json` C-call
29
+ overhead beats it; `JSON.generate` does the whole tree in one C pass).
30
+ - Multi-process parallelism worked but was judged over-engineered and removed.
31
+
32
+ The remaining lever is a **C extension**: one native walk that emits the
33
+ Extended-JSON text directly — no intermediate transformed-Hash tree, no second
34
+ JSON pass.
35
+
36
+ ## Goals / non-goals
37
+
38
+ - **Goal:** a native encoder that is **byte-for-byte identical** to the current
39
+ pure-Ruby path, behind an **optional** load with a pure-Ruby fallback so
40
+ exwiw stays installable as a pure-Ruby gem (JRuby/TruffleRuby, or any host
41
+ where compilation fails, keep working).
42
+ - **Non-goal:** speeding up the Mongo cursor's BSON→Ruby *decode*. That lives in
43
+ the `mongo`/`bson` driver and is ~40% of total dump wall time. A serialization
44
+ C extension is therefore bounded to roughly the same end-to-end ceiling
45
+ (~2.5×) the removed `--parallel-workers` path had — it does **not** reach the
46
+ removed cursor-parallel path's 3.4–5.5×. This is an accepted trade for a far
47
+ simpler, flag-free, fork-free implementation.
48
+
49
+ ## Exact serialization semantics to reproduce (verified against bson 5.2.0)
50
+
51
+ The byte-exact anchor is `spec/insert_output_snapshot_spec.rb` (committed
52
+ `spec/insert_output_snapshots/mongodb/*.jsonl` fixtures). The C encoder must
53
+ reproduce the following exactly. All rows below were verified empirically and,
54
+ for `Time`, against `bson-5.2.0/lib/bson/time.rb:72-89`.
55
+
56
+ | Ruby value | Relaxed Extended JSON output | Notes |
57
+ |---|---|---|
58
+ | `BSON::ObjectId` | `{"$oid":"<24 lowercase hex>"}` | hex via `ObjectId#to_s` |
59
+ | `Time` (year 1970..9999, sub-second) | `{"$date":"2021-01-02T03:04:05.678Z"}` | floor to ms, `strftime('%Y-%m-%dT%H:%M:%S.%LZ')` |
60
+ | `Time` (year 1970..9999, whole second) | `{"$date":"2021-01-02T03:04:05Z"}` | **no** fraction when `usec == 0` |
61
+ | `Time` (year <1970 or >9999) | `{"$date":{"$numberLong":"<ms>"}}` | `ms = sec*1000 + usec.divmod(1000).first` |
62
+ | `Integer` (fits int64) | bare `42` / `9000000000` | |
63
+ | `Integer` (outside int64) | **raises `RangeError`** | `"Integer … too big to be represented as a MongoDB integer"` |
64
+ | `Float` | `JSON.generate(float)` form | `1e20 → 1e+20` (**not** `Float#to_s`'s `1.0e+20`); `100.0 → 100.0`; `-0.0 → -0.0` |
65
+ | `String` | JSON string | escape only `\b \t \n \f \r \" \\`; other `<0x20` as lowercase `\u00xx`; `/`, DEL, U+2028/U+2029, non-ASCII left raw |
66
+ | `true` / `false` / `nil` | `true` / `false` / `null` | |
67
+ | `Hash` | `{…}` | **insertion order** preserved; keys are Strings (JSON-escaped) |
68
+ | `Array` | `[…]` | |
69
+
70
+ `bson/time.rb` boundary (the highest-risk piece), verified verbatim:
71
+
72
+ ```ruby
73
+ def as_extended_json(**options)
74
+ if options[:mode] == :relaxed && (1970..9999).include?(utc_time.year)
75
+ if utc_time.usec != 0
76
+ utc_time = utc_time.floor(3) # floor to millisecond
77
+ {'$date' => utc_time.strftime('%Y-%m-%dT%H:%M:%S.%LZ')}
78
+ else
79
+ {'$date' => utc_time.strftime('%Y-%m-%dT%H:%M:%SZ')}
80
+ end
81
+ else
82
+ msec = utc_time.usec.divmod(1000).first
83
+ {'$date' => {'$numberLong' => (sec * 1000 + msec).to_s}}
84
+ end
85
+ end
86
+ ```
87
+
88
+ ## Fast-path vs delegate (the byte-identity strategy)
89
+
90
+ The encoder splits values into a **native fast path** and a **Ruby delegate**:
91
+
92
+ - **Native (in C):** `Hash`, `Array`, `String`, `Integer` within int64,
93
+ `true`/`false`/`nil`, and `BSON::ObjectId`. These are the structural bulk plus
94
+ the single most common leaf (`_id`), and their formatting is simple and stable.
95
+ - **Delegate to Ruby** — call back into
96
+ `JSON.generate(value.as_extended_json(mode: :relaxed))` for the individual
97
+ value and splice the returned fragment into the buffer:
98
+ - `Float` — `Float#to_s` diverges from `JSON.generate` for scientific notation
99
+ (`1e20`), so never reformat floats in C.
100
+ - `Time` — variable fractional digits + the `[1970,9999]` `$numberLong`
101
+ boundary + ms flooring; too fragile to risk in C v1.
102
+ - out-of-int64 `Integer` — must surface the identical `RangeError`.
103
+ - any unrecognized class — `Decimal128`, `BSON::Binary`, `Symbol`, `Regexp`,
104
+ `Date`, `BSON::Timestamp`, etc.
105
+
106
+ **Why delegating is provably byte-identical:** `Hash#as_extended_json` and
107
+ `Array#as_extended_json` are *non-transforming structural recursion* — they map
108
+ over children and call `as_extended_json` on each. So the bytes produced for any
109
+ sub-value `v` by `JSON.generate(v.as_extended_json(mode: :relaxed))` are exactly
110
+ the bytes that the whole-document `JSON.generate(doc.as_extended_json(...))`
111
+ would have produced for that position. The native walk can therefore hand any
112
+ value it does not want to format to Ruby and splice the result, with no
113
+ divergence.
114
+
115
+ `Time` and `Float` are candidates for later promotion into the native path if a
116
+ benchmark shows the per-leaf delegate call is a meaningful fraction of the win
117
+ (timestamp-heavy docs call out to Ruby once per `Time` field).
118
+
119
+ ## C source & buffer design
120
+
121
+ - `Exwiw::ExtJson.encode_native(doc) -> String` — returns one JSONL line, **no**
122
+ trailing `\n` (the caller/Runner owns separators).
123
+ - Recursive emitter writing into a single growing buffer (`rb_str_buf_new` +
124
+ `rb_str_cat`/`rb_str_buf_cat`, or a `malloc` buffer finalized to an
125
+ `rb_utf8_str_new`). Result string is UTF-8.
126
+ - Type dispatch via `TYPE()` for the immediates/`T_HASH`/`T_ARRAY`/`T_STRING`/
127
+ `T_FLOAT`/`T_FIXNUM`/`T_BIGNUM`, and a cached `BSON::ObjectId` class reference
128
+ (`rb_const_get`) compared with `rb_obj_is_kind_of` for ObjectId.
129
+ - `Hash`: `rb_hash_foreach` preserves insertion order; emit `key:value` pairs;
130
+ keys are Strings run through the same string escaper.
131
+ - Delegate path: a cached `ID` for a Ruby helper (e.g.
132
+ `Exwiw::ExtJson.encode_fragment(v)`), `rb_funcall`'d, returning the JSON
133
+ fragment String to splice. The `RangeError` for oversized integers propagates
134
+ naturally through the delegate.
135
+ - String escaper implemented in C to match the table above (no per-leaf Ruby
136
+ call for the common String case).
137
+
138
+ ## Packaging, optional load, fallback
139
+
140
+ - **gemspec:** `spec.extensions = ["ext/exwiw/ext_json/extconf.rb"]`. With
141
+ `extensions` set, `gem install exwiw` compiles automatically; hosts that can't
142
+ compile fall back at runtime (below).
143
+ - **`ext/exwiw/ext_json/extconf.rb`:** `require "mkmf"` (stdlib) +
144
+ `create_makefile("exwiw/ext_json_native")`. The compiled lib is named
145
+ `ext_json_native` (distinct from the `ext_json.rb` shim) to avoid a
146
+ `require` self-collision.
147
+ - **`ext/exwiw/ext_json/ext_json.c`:** the emitter; defines
148
+ `Exwiw::ExtJson.encode_native`.
149
+ - **`lib/exwiw/ext_json.rb`** (the shim, always loaded):
150
+
151
+ ```ruby
152
+ require "json"
153
+
154
+ module Exwiw
155
+ module ExtJson
156
+ module_function
157
+
158
+ # Pure-Ruby fragment encoder used by both the fallback and the native
159
+ # delegate path. Byte-identical to today's behavior.
160
+ def encode_fragment(value)
161
+ JSON.generate(value.respond_to?(:as_extended_json) ? value.as_extended_json(mode: :relaxed) : value)
162
+ end
163
+
164
+ begin
165
+ require "exwiw/ext_json_native" # defines Exwiw::ExtJson.encode_native
166
+ def encode(doc) = encode_native(doc)
167
+ rescue LoadError
168
+ def encode(doc) = encode_fragment(doc) # exact current behavior
169
+ end
170
+ end
171
+ end
172
+ ```
173
+
174
+ - **`Rakefile`:** `require "rake/extensiontask"`;
175
+ `Rake::ExtensionTask.new("ext_json_native") { |e| e.ext_dir = "ext/exwiw/ext_json" }`;
176
+ make the `spec` task depend on `compile`.
177
+ - **Dev dependency:** add `rake-compiler` (Gemfile / gemspec dev deps). `mkmf`
178
+ is stdlib, no runtime dep added.
179
+ - **`.gitignore`:** ignore built artifacts (`lib/exwiw/*.bundle`,
180
+ `lib/exwiw/*.so`, `ext/**/*.o`, `ext/**/Makefile`). Commit only the `ext/`
181
+ sources; the gemspec ships files via `git ls-files`.
182
+ - **`lib/exwiw.rb`:** add `require_relative "exwiw/ext_json"`.
183
+
184
+ ## Integration point
185
+
186
+ In `lib/exwiw/adapter/mongodb_adapter.rb`, the per-document serialize step
187
+ becomes (masking still runs in Ruby first; only the encode changes):
188
+
189
+ ```ruby
190
+ def to_bulk_insert(rows, config)
191
+ plan = mask_plan(config)
192
+ rows.map do |doc|
193
+ apply_mask_plan!(doc, plan)
194
+ Exwiw::ExtJson.encode(doc) # was: JSON.generate(extended_json(doc))
195
+ end.join("\n")
196
+ end
197
+ ```
198
+
199
+ The private `#extended_json` helper is removed — its logic (including the
200
+ `respond_to?(:as_extended_json)` guard) moves into `ExtJson.encode_fragment`.
201
+
202
+ ## Test & benchmark strategy
203
+
204
+ - **`spec/ext_json_spec.rb`** (DB-free; the primary byte-identity guard, runs in
205
+ normal CI): assert `encode_native(doc) == encode_fragment(doc)` over a fuzz of
206
+ representative shapes — ObjectId; nested hashes/arrays; `Time` across the year
207
+ boundary, whole-second (no fraction), and sub-second; strings with control
208
+ chars / quotes / backslashes / non-ASCII / U+2028; ints, bignums (assert the
209
+ same `RangeError`), floats (`1e20`, `-0.0`, `100.0`); `nil`; empty
210
+ hash/array/string. Skip the native half with a clear message when the lib
211
+ isn't compiled, so the suite still passes on a fallback-only host.
212
+ - **`spec/insert_output_snapshot_spec.rb`** (live mongo on 27017): the byte-exact
213
+ fixtures must stay green with the native encoder built.
214
+ - **Microbench** (extend `script/bench_mongodb_dump.rb`): native-encode vs
215
+ Ruby-fallback throughput on DB-free synthesized embed-heavy docs, plus the live
216
+ path on 20k×30, to quantify the real speedup.
217
+
218
+ ## Risk register
219
+
220
+ 1. **Time formatting** — variable fraction + ms flooring + `$numberLong`
221
+ boundary. Mitigated by delegating `Time` to Ruby in v1.
222
+ 2. **Float formatting** — `Float#to_s` ≠ `JSON.generate`. Mitigated by delegating.
223
+ 3. **String escaping** — must match JSON exactly. Implemented in C, fuzz-tested
224
+ vs the Ruby fallback.
225
+ 4. **Hash key order** — preserved via `rb_hash_foreach`.
226
+ 5. **Oversized integers** — delegate so the same `RangeError` surfaces.
227
+ 6. **Encoding** — emit UTF-8; pass non-ASCII bytes through unescaped (matches JSON).
228
+ 7. **Build/portability** — optional load + pure-Ruby fallback keeps non-CRuby and
229
+ no-compiler installs working.
@@ -117,7 +117,7 @@ Error: `pg_dump` not found in PATH. exwiw needs pg_dump to generate insert-000-s
117
117
  | `lib/exwiw/ddl_postprocessor.rb` (新規) | `IF NOT EXISTS` 書き換え / DO ブロックラップ |
118
118
  | `lib/exwiw.rb` | 新規ファイルの require |
119
119
  | `README.md` | `dump/` の出力に `insert-000-schema.{sql,js}` を追記、import 手順を更新 |
120
- | `spec/adapter/sqlite3_adapter_spec.rb` | `dump_schema` 統合テスト (`scenario/initdb/init.sqlite3` に対して実行し、出力が `CREATE TABLE IF NOT EXISTS` を含むことを assert) |
120
+ | `spec/adapter/sqlite3_adapter_spec.rb` | `dump_schema` 統合テスト (`e2e/initdb/init.sqlite3` に対して実行し、出力が `CREATE TABLE IF NOT EXISTS` を含むことを assert) |
121
121
  | `spec/adapter/mongodb_adapter_spec.rb` | `dump_schema` テスト (db スタブで `listIndexes` を返し、出力 JS を assert) |
122
122
  | `spec/runner_spec.rb` | `insert-000-schema.sql` が `output_dir` に書かれることを assert (Sqlite3 経由で実際に流れることを確認) |
123
123
 
@@ -134,9 +134,9 @@ Error: `pg_dump` not found in PATH. exwiw needs pg_dump to generate insert-000-s
134
134
  2. `bundle exec rspec spec/adapter/mongodb_adapter_spec.rb` — mongo クライアントをスタブして JS 出力に `db.createCollection("users")` と該当 collection の `createIndex(...)` が含まれることを確認。
135
135
 
136
136
  ### E2E (scenario スクリプト経由)
137
- 3. `scenario/test_with_sqlite3.sh` を実行し、`dump/insert-000-schema.sql` が生成されることと、空 DB に対して `sqlite3 empty.db < dump/insert-000-schema.sql && for f in dump/insert-*.sql; do sqlite3 empty.db < $f; done` が成功することを確認する。
138
- 4. `scenario/test_with_mysql2.sh`, `scenario/test_with_postgresql.sh` も同様に、`mysql empty_db < dump/insert-000-schema.sql` / `psql empty_db -f dump/insert-000-schema.sql` が成功 → 続けて insert ファイル群が流せることを確認。**`mysqldump` / `pg_dump` を docker compose のコンテナ内 (`compose.yml` で起動する DB コンテナ) で実行する必要がある場合は、scenario スクリプトを更新する。**
139
- 5. `scenario/test_with_mongodb.sh` を実行し、`dump/insert-000-schema.js` が出力されることと、空 DB に対して `mongosh "mongodb://localhost/empty_db" < dump/insert-000-schema.js` が成功すること、続いて `mongoimport` で各 jsonl が流せることを確認。
137
+ 3. `e2e/test_with_sqlite3.sh` を実行し、`dump/insert-000-schema.sql` が生成されることと、空 DB に対して `sqlite3 empty.db < dump/insert-000-schema.sql && for f in dump/insert-*.sql; do sqlite3 empty.db < $f; done` が成功することを確認する。
138
+ 4. `e2e/test_with_mysql2.sh`, `e2e/test_with_postgresql.sh` も同様に、`mysql empty_db < dump/insert-000-schema.sql` / `psql empty_db -f dump/insert-000-schema.sql` が成功 → 続けて insert ファイル群が流せることを確認。**`mysqldump` / `pg_dump` を docker compose のコンテナ内 (`compose.yml` で起動する DB コンテナ) で実行する必要がある場合は、scenario スクリプトを更新する。**
139
+ 5. `e2e/test_with_mongodb.sh` を実行し、`dump/insert-000-schema.js` が出力されることと、空 DB に対して `mongosh "mongodb://localhost/empty_db" < dump/insert-000-schema.js` が成功すること、続いて `mongoimport` で各 jsonl が流せることを確認。
140
140
  6. **idempotency 確認**: 同じ schema ファイルを 2 回流してもエラーにならないこと (`IF NOT EXISTS` / `DO $$ EXCEPTION WHEN duplicate_object` / `try/catch on createCollection` が効いている)。
141
141
 
142
142
  ### 手動確認のチェックポイント
@@ -6,9 +6,9 @@
6
6
  `createCollection` / `createIndex` を書き出す実装を既に持っているが、scenario 側で
7
7
  これを apply するパスが無く、CI でも検証できていなかった。具体的なギャップ:
8
8
 
9
- 1. `scenario/setup_with_mongodb.rb` は seed を `insert_many` で流すだけで、index を一切作っていない
9
+ 1. `e2e/setup_with_mongodb.rb` は seed を `insert_many` で流すだけで、index を一切作っていない
10
10
  2. その結果 `tmp/mongodb/insert-000-schema.js` は `createCollection` 行のみで `createIndex` が 0 行
11
- 3. `scenario/import_with_mongodb.rb` は `insert-*.jsonl` だけを glob して処理しており、`insert-000-schema.js` を一切実行しない
11
+ 3. `e2e/import_with_mongodb.rb` は `insert-*.jsonl` だけを glob して処理しており、`insert-000-schema.js` を一切実行しない
12
12
 
13
13
  sqlite3 / mysql2 / postgresql で導入済みの「from clean DB から立ち上げる」流れと
14
14
  MongoDB の `insert-000-schema.js` が連動していない状態だった (issue #16)。
@@ -25,10 +25,10 @@ MongoDB の `insert-000-schema.js` が連動していない状態だった (issu
25
25
  ### scenario 層
26
26
  | パス | 変更 |
27
27
  |---|---|
28
- | `scenario/setup_with_mongodb.rb` | seed 流し込みの後に 3 種類の代表的 index を作る (unique `shops.name` / plain `users.email` / 複合 `orders.shop_id+user_id`) |
29
- | `scenario/import_with_mongodb.rb` | `--no-drop` と `--input-dir DIR` フラグを追加。from-clean は drop すると schema.js が作った index ごと消えてしまうため |
30
- | `scenario/verify_with_mongodb.rb` | `--with-indexes` で target collection の index を assert (default scenario では import 時に drop されるのでスキップ) |
31
- | `scenario/test_with_mongodb_from_clean.sh` (新規) | `mongosh dropDatabase` → exwiw 実行 → `mongosh insert-000-schema.js` → `import --no-drop --input-dir tmp/mongodb-clean` → `verify --with-indexes` |
28
+ | `e2e/setup_with_mongodb.rb` | seed 流し込みの後に 3 種類の代表的 index を作る (unique `shops.name` / plain `users.email` / 複合 `orders.shop_id+user_id`) |
29
+ | `e2e/import_with_mongodb.rb` | `--no-drop` と `--input-dir DIR` フラグを追加。from-clean は drop すると schema.js が作った index ごと消えてしまうため |
30
+ | `e2e/verify_with_mongodb.rb` | `--with-indexes` で target collection の index を assert (default scenario では import 時に drop されるのでスキップ) |
31
+ | `e2e/test_with_mongodb_from_clean.sh` (新規) | `mongosh dropDatabase` → exwiw 実行 → `mongosh insert-000-schema.js` → `import --no-drop --input-dir tmp/mongodb-clean` → `verify --with-indexes` |
32
32
  | `.github/workflows/scenario.yml` | with_mongodb job に `mongodb-mongosh` install ステップと `test_with_mongodb_from_clean.sh` 実行ステップを追加。apt repo の codename は `jammy` 固定 (ubuntu-latest が noble に上がる前提) |
33
33
 
34
34
  ### snapshot test 層
@@ -56,8 +56,8 @@ MongoDB の `insert-000-schema.js` が連動していない状態だった (issu
56
56
 
57
57
  ## Verification
58
58
 
59
- - `bash scenario/test_with_mongodb.sh` 既存 scenario 維持を確認 ✓
60
- - `bash scenario/test_with_mongodb_from_clean.sh` 新規 scenario 通過を確認
59
+ - `bash e2e/test_with_mongodb.sh` 既存 scenario 維持を確認 ✓
60
+ - `bash e2e/test_with_mongodb_from_clean.sh` 新規 scenario 通過を確認
61
61
  (indexes round-trip OK) ✓
62
62
  - `bundle exec rspec` 全 153 examples / 0 failures ✓
63
63
  - `tmp/mongodb-clean/insert-000-schema.js` を目視で確認:
@@ -9,22 +9,22 @@
9
9
 
10
10
  しかし **生成された COPY-mode SQL を実際に `psql -f` で取り込めるかを検証する end-to-end テストが存在しない**。ユーザーは COPY モードで invalid な SQL が出ているのではと疑っており、それを実DBに対して検証したい。
11
11
 
12
- 既存の INSERT モードは `scenario/test_with_postgresql.sh` が `psql -f` での再取込まで含めて検証している。これに対応する COPY モード版が無い状態。
12
+ 既存の INSERT モードは `e2e/test_with_postgresql.sh` が `psql -f` での再取込まで含めて検証している。これに対応する COPY モード版が無い状態。
13
13
 
14
14
  ゴール: COPY モード出力を実際に psql に食わせる E2E シナリオ + スナップショット回帰テストを追加し、潜在的な invalid SQL を表面化する。
15
15
 
16
16
  ## 変更ファイル
17
17
 
18
- 1. **新規** `scenario/test_with_postgresql_copy.sh` — E2E シェル
18
+ 1. **新規** `e2e/test_with_postgresql_copy.sh` — E2E シェル
19
19
  2. **修正** `spec/insert_output_snapshot_spec.rb` — COPY 用の SCENARIOS エントリと `snapshot_subdir` 対応
20
20
  3. **修正** `.github/workflows/scenario.yml` — `with_postgres` ジョブに新ステップ
21
21
  4. **新規** `spec/insert_output_snapshots/postgresql-copy/insert-*.sql` — `UPDATE_SNAPSHOTS=1` で自動生成
22
22
 
23
23
  ## 詳細
24
24
 
25
- ### 1. `scenario/test_with_postgresql_copy.sh`
25
+ ### 1. `e2e/test_with_postgresql_copy.sh`
26
26
 
27
- `scenario/test_with_postgresql.sh` を雛形にして以下のみ差し替え:
27
+ `e2e/test_with_postgresql.sh` を雛形にして以下のみ差し替え:
28
28
 
29
29
  - `FROM_DATABASE_NAME="exwiw_scenario_prod_db_copy"`
30
30
  - `TO_DATABASE_NAME="exwiw_scenario_dev_db_copy"`(並列実行されても既存シナリオと衝突しない名前)
@@ -47,7 +47,7 @@
47
47
  ```ruby
48
48
  {
49
49
  adapter: "postgresql",
50
- config_dir: "scenario/postgresql-schema",
50
+ config_dir: "e2e/postgresql-schema",
51
51
  output_format: "copy",
52
52
  snapshot_subdir: "postgresql-copy",
53
53
  connection: { adapter: "postgresql", database_name: "exwiw_test",
@@ -64,7 +64,7 @@
64
64
 
65
65
  ```yaml
66
66
  - name: Run exwiw (copy mode)
67
- run: scenario/test_with_postgresql_copy.sh
67
+ run: e2e/test_with_postgresql_copy.sh
68
68
  ```
69
69
 
70
70
  `postgres:17-alpine` サービスと `postgresql-client-17` インストールは既存ステップで完了済みなので追加不要。
@@ -80,7 +80,7 @@ UPDATE_SNAPSHOTS=1 bundle exec rspec spec/insert_output_snapshot_spec.rb
80
80
  ## 検証手順
81
81
 
82
82
  1. ローカルで `docker compose up -d postgres` を起動
83
- 2. `bash scenario/test_with_postgresql_copy.sh` を実行 — exit 0 ならば COPY モード SQL は psql 経由で valid。non-zero なら invalid SQL が表面化(その時点で原因を特定して別途修正)
83
+ 2. `bash e2e/test_with_postgresql_copy.sh` を実行 — exit 0 ならば COPY モード SQL は psql 経由で valid。non-zero なら invalid SQL が表面化(その時点で原因を特定して別途修正)
84
84
  3. `UPDATE_SNAPSHOTS=1 bundle exec rspec spec/insert_output_snapshot_spec.rb` でスナップショットを生成
85
85
  4. `bundle exec rspec spec/insert_output_snapshot_spec.rb` を `UPDATE_SNAPSHOTS` 無しで再実行し、全シナリオ(sqlite3 / mysql2 / postgresql / postgresql-copy / mongodb)が通ることを確認
86
86
  5. CI 上で `with_postgres` ジョブの新ステップ `Run exwiw (copy mode)` が通る(または invalid SQL を検出する)ことを確認
@@ -81,7 +81,7 @@ indirect / polymorphic を一律に正しく扱える。
81
81
  `ids_field` 指定時に対象テーブルの WHERE が主キーではなく当該カラムになることを確認する
82
82
  ケースを追加。
83
83
  - `explain` サブコマンド(SQL のみ対応)で end-to-end 確認:
84
- 既存 scenario(例 `scenario/sqlite3-schema`)に対し
84
+ 既存 scenario(例 `e2e/sqlite3-schema`)に対し
85
85
  `--target-table=... --ids=... --ids-column=<col>` を渡し、出力 SQL の WHERE が
86
86
  `<table>.<col> IN (...)` になることを目視確認。
87
87
  - `bundle exec rspec`(全体)でリグレッションが無いこと。
@@ -0,0 +1,70 @@
1
+ # Plan: Back out MongoDB fork/cursor parallelism → optional Extended-JSON C extension
2
+
3
+ ## Context (why)
4
+
5
+ This branch (`gnhf/rt-rails-mongodb-dum-ed518c`) shipped a large MongoDB-dump perf effort across 18 iterations. The memory wins (iter 2–5: streaming result set + chunked output + precompiled `MaskPlan`) are clean, byte-identical, no-flag defaults and stay. But the CPU/throughput half (iter 6–18) grew into heavy multi-process machinery: two CLI flags (`--parallel-workers`, `--cursor-parallel`), four components (`ParallelSerializer`, `MongoIdPartitioner`, `PropagationCapture`, `ForkedPartWriter`), adapter `write_bulk_insert`/`write_inserts` seams, CLI→Runner→Adapter threading, fork orchestration, Marshal IPC, `_id`-range partitioning, distributed `@state` merge, and a sorted-output caveat — to buy ~1.1–1.4× (serialize-fork) / ~2.5–5.5× (cursor-parallel).
6
+
7
+ The maintainer's decision: **this is over-engineered.** Preserve the findings as `docs/optimization-notes.md`, **remove the parallelism**, and address the CPU hotspot with a much simpler lever the earlier iterations kept pointing at — a **C extension** for the dominant `as_extended_json(mode: :relaxed) + JSON.generate` cost.
8
+
9
+ **Honest scope of the win:** a C extension speeds only the per-document Extended-JSON *serialization* (~82% of per-doc serialize cost). It cannot touch the Mongo cursor's BSON→Ruby *decode* (~40% of total wall time, inside the driver) — that was cursor-parallel's job and is being removed. So end-to-end is bounded ~2.5× (the same Amdahl ceiling `--parallel-workers` hit), not cursor-parallel's 3.4–5.5×. The trade is deliberate: far simpler code + a no-flag, no-fork, single-process CPU win, for a smaller peak speedup. Memory behavior is unchanged (streaming stays the default).
10
+
11
+ ## Part 1 — Remove the fork/cursor parallelism (keep streaming + MaskPlan)
12
+
13
+ Verified: `git diff b389204..HEAD` on the core lib files is **entirely** parallelism (`b389204` = iter-5 "MaskPlan" commit). So the iter-5 versions are exactly the "keep streaming, drop parallel" baseline.
14
+
15
+ **Restore to `b389204` (clean streaming baseline):**
16
+ - `lib/exwiw/runner.rb` — back to the inline `each_slice` chunk loop (`file.print(adapter.to_bulk_insert(chunk))`), no `write_inserts`/`parallel_workers`/`cursor_parallel`.
17
+ - `lib/exwiw/adapter.rb` — `Base#initialize(connection_config, logger)` (drop the two kwargs + ivars), `self.build(connection_config, logger)`; drop the `write_bulk_insert`/`write_inserts` seams (they exist only for parallelism). Keep `default_bulk_insert_chunk_size` (streaming).
18
+ - `lib/exwiw/adapter/mongodb_adapter.rb` — iter-5 form: `StreamingResult` without `query`/`keys` readers; `db` with inline client construction (drop `build_client`); no `write_bulk_insert`/`write_inserts`/`cursor_parallel`/`parallel_workers`/partitioner/capture. Keep `StreamingResult`, `MaskPlan`, chunking. (Then patch `serialize_document` in Part 3.)
19
+ - `lib/exwiw/cli.rb` — drop `--parallel-workers`/`--cursor-parallel` flags, ivars, `validate_parallel_workers!`/`validate_cursor_parallel!`, and the Runner kwargs.
20
+ - `lib/exwiw.rb` — drop the four parallel `require_relative` lines (Part 3 adds the `ext_json` require).
21
+
22
+ **Delete (components + specs + probes):**
23
+ - `lib/exwiw/{parallel_serializer,forked_part_writer,mongo_id_partitioner,propagation_capture}.rb`
24
+ - `spec/{parallel_serializer,forked_part_writer,mongo_id_partitioner,propagation_capture}_spec.rb`
25
+ - `script/bench_mongodb_parallel_probe.rb`, `script/bench_mongodb_cursor_parallel_probe.rb`
26
+
27
+ **Restore to `b389204`:** `spec/adapter_spec.rb`, `spec/cli_spec.rb`, `spec/adapter/mongodb_adapter_spec.rb`, `script/bench_mongodb_dump.rb` (all post-iter-5 changes there are parallel-only).
28
+
29
+ **Edit, don't restore:** `README.md`, `CHANGELOG.md` — remove the `--parallel-workers` / `--cursor-parallel` / `EXWIW_MONGODB_*` sections; the `[Unreleased]` entry becomes "streaming/chunked MongoDB dump by default + optional Extended-JSON C extension", pointing to `docs/optimization-notes.md`.
30
+
31
+ ## Part 2 — `docs/optimization-notes.md`
32
+
33
+ Distill the 18-iteration log (`.gnhf/runs/.../notes.md`) into a durable doc: the two hotspots (result-set memory; `as_extended_json` CPU), what shipped by default (streaming + chunking + `MaskPlan`), and the **explored-and-removed** parallelism — fork serialize (~1.1–1.4×), cursor-parallel (~3.4–5.5× but heavy: IPC, `_id` partitioning, distributed `@state`, sorted output) — with the Amdahl reasoning (serial cursor decode floor) and *why* it was removed in favor of the C extension. This is the knowledge-preservation the maintainer asked for.
34
+
35
+ ## Part 3 — Optional Extended-JSON C extension — DOCUMENT ONLY (not implemented now)
36
+
37
+ **Revised scope (per maintainer):** Part 3 is NOT implemented in this pass. Instead, write the design below into `docs/optimize-mongodb-export-with-native-ext.md` as a future-work / design doc. Only Part 1 and Part 2 are executed now.
38
+
39
+ Design to capture: replaces `JSON.generate(doc.as_extended_json(mode: :relaxed))` with one native tree-walk that emits the Relaxed Extended JSON line directly — no intermediate transformed-Hash tree, no second JSON pass.
40
+
41
+ **New files**
42
+ - `ext/exwiw/ext_json/extconf.rb` — `mkmf` (stdlib), `create_makefile("exwiw/ext_json_native")`.
43
+ - `ext/exwiw/ext_json/ext_json.c` — defines `Exwiw::ExtJson.encode_native(doc) -> String` (one JSONL line, no trailing `\n`). Recursive emitter into a growth buffer.
44
+ - `lib/exwiw/ext_json.rb` — shim: `require "exwiw/ext_json_native"` (distinct name avoids self-collision); on `LoadError`, define a pure-Ruby `encode`. Exposes one stable `Exwiw::ExtJson.encode(doc)`. Fallback is **byte-for-byte today's code**:
45
+ ```ruby
46
+ JSON.generate(doc.respond_to?(:as_extended_json) ? doc.as_extended_json(mode: :relaxed) : doc)
47
+ ```
48
+
49
+ **Native fast-path vs delegate (byte-identity strategy, from the Plan agent's findings)**
50
+ - Native in C: `Hash` (insertion order via `rb_hash_foreach`; String keys), `Array`, `String` (escape only `\b\t\n\f\r\"\\`; lowercase `\u00xx` for other <0x20; leave `/`, DEL, U+2028/9, non-ASCII raw), `Integer` within int64, `true`/`false`/`nil`, and `BSON::ObjectId` → `{"$oid":"<24 hex>"}` (hex via `to_s`).
51
+ - **Delegate to Ruby** (call back to a fallback helper returning the JSON fragment for that value): `Float` (`Float#to_s` ≠ `JSON.generate` for sci-notation, e.g. `1e20`), `Time` (variable fractional digits + the years-[1970,9999] `$numberLong` boundary — highest risk), out-of-int64 `Integer` (must preserve the existing `RangeError`), and any unrecognized class (Decimal128, Binary, Symbol, Regexp, Date, …). This is provably byte-identical because `Hash/Array#as_extended_json` are non-transforming structural recursion, so `JSON.generate(v.as_extended_json(mode: :relaxed))` on any sub-value matches the whole-tree output. Time/Float are candidates for later native promotion if the benchmark justifies the added risk.
52
+
53
+ **Packaging / wiring**
54
+ - `exwiw.gemspec` — `spec.extensions = ["ext/exwiw/ext_json/extconf.rb"]` (auto-compiles on `gem install`; fallback covers platforms that can't).
55
+ - `Rakefile` — `Rake::ExtensionTask.new("ext_json_native")`; make `spec` depend on `compile`.
56
+ - Add `rake-compiler` as a dev dep (Gemfile/gemspec). `mkmf` is stdlib.
57
+ - `.gitignore` — ignore built artifacts (`lib/exwiw/*.bundle`, `lib/exwiw/*.so`, `ext/**/*.o`, `ext/**/Makefile`); commit only `ext/` sources (gemspec ships via `git ls-files`).
58
+ - `lib/exwiw.rb` — `require_relative "exwiw/ext_json"`.
59
+ - `lib/exwiw/adapter/mongodb_adapter.rb` — `serialize_document`/`to_bulk_insert` inner call becomes `ExtJson.encode(doc)` (after `apply_mask_plan!`); remove the now-unused private `#extended_json`.
60
+
61
+ ## Verification (Part 1 + Part 2 scope)
62
+
63
+ - **`bundle exec rspec`** — full suite green after the revert. The parallel specs are deleted; the restored specs match iter-5. (`spec/insert_output_snapshot_spec.rb` and other mongodb-touching specs need live mongo on 27017 → sandbox disabled; the rest run normally.)
64
+ - **`spec/insert_output_snapshot_spec.rb`** (mongodb fixtures, live mongo) — the byte-exact guard; output bytes must be unchanged by the revert (streaming default is byte-identical to iter-5).
65
+ - `git grep -nE 'parallel_workers|cursor_parallel|ParallelSerializer|MongoIdPartitioner|PropagationCapture|ForkedPartWriter'` returns nothing in `lib/`, `spec/`, `exe/`, `README.md` after the revert (confirms full removal).
66
+ - `docs/optimize-mongodb-export-with-native-ext.md` and `docs/optimization-notes.md` exist and read coherently.
67
+
68
+ ## Notes
69
+ - No `git rebase`/`push -f` (history may stay messy; backing out via forward edits + deletions, not history rewrite).
70
+ - After implementation, offer to commit the plan via the remember-plan skill.
@@ -14,6 +14,57 @@ module Exwiw
14
14
  Exwiw::MongodbCollectionConfig
15
15
  end
16
16
 
17
+ # A lazy, streaming stand-in for the materialized result array #execute
18
+ # used to return. Wrapping the Mongo cursor (instead of `.to_a`) keeps the
19
+ # dump's dominant memory cost — the full result set — off the heap: the
20
+ # Runner pulls documents through `each_slice`, so at most one chunk of
21
+ # documents (plus the small propagation-key arrays) is resident at a time,
22
+ # even for large or embed-heavy collections.
23
+ #
24
+ # It satisfies the two things the Runner asks of an execute result:
25
+ # - #size: the record count, used to skip empty collections and to log.
26
+ # Answered with a `count_documents` query (which only walks index
27
+ # entries, far cheaper than fetching every document) rather than by
28
+ # draining the cursor.
29
+ # - #each (via Enumerable / each_slice): a single streaming pass over the
30
+ # cursor. While streaming it captures — per propagation key, BEFORE
31
+ # handing the document to the caller's masking — the values downstream
32
+ # children will `$in`-match against, publishing them into @state once
33
+ # the pass completes.
34
+ #
35
+ # Contract note: unlike the old `.to_a` execute, which populated @state
36
+ # eagerly, this defers state capture until the result is consumed. The
37
+ # Runner always fully consumes a non-empty result before any child
38
+ # collection is processed, so propagation is unaffected; a caller that only
39
+ # needs @state must iterate the result (e.g. `.to_a`).
40
+ class StreamingResult
41
+ include Enumerable
42
+
43
+ def initialize(view:, collection:, keys:, state:)
44
+ @view = view
45
+ @collection = collection
46
+ @keys = keys
47
+ @state = state
48
+ end
49
+
50
+ def size
51
+ @size ||= @view.count_documents
52
+ end
53
+ alias length size
54
+
55
+ def each
56
+ return enum_for(:each) { size } unless block_given?
57
+
58
+ captured = @keys.each_with_object({}) { |key, acc| acc[key] = [] }
59
+ @view.each do |doc|
60
+ @keys.each { |key| captured[key] << doc[key] }
61
+ yield doc
62
+ end
63
+ @state[@collection] = captured
64
+ self
65
+ end
66
+ end
67
+
17
68
  def initialize(connection_config, logger)
18
69
  super
19
70
  @state = {}
@@ -86,22 +137,24 @@ module Exwiw
86
137
  def execute(query)
87
138
  @logger.debug(" Executing Mongo find on '#{query.collection}': filter=#{query.filter.inspect} projection=#{query.projection.inspect}")
88
139
 
89
- docs = db[query.collection]
140
+ view = db[query.collection]
90
141
  .find(query.filter)
91
142
  .projection(query.projection)
92
143
  .comment(query_comment_text("collection=#{query.collection}"))
93
- .to_a
94
144
 
95
- # Stash, per referenced field, the values children will `$in`-match
96
- # against. @propagation_keys is set by the build_query call for this same
145
+ # Per referenced field, the values children will `$in`-match against.
146
+ # @propagation_keys is set by the build_query call for this same
97
147
  # collection; fall back to the primary key if execute is driven without a
98
148
  # preceding build_query (e.g. in isolation from a test).
99
149
  keys = @propagation_keys || [query.primary_key]
100
- @state[query.collection] = keys.each_with_object({}) do |key, acc|
101
- acc[key] = docs.map { |doc| doc[key] }
102
- end
103
150
 
104
- docs
151
+ # Return a streaming view of the result set rather than `.to_a`-ing the
152
+ # whole collection into memory. The Runner pulls documents through
153
+ # `each_slice`, so only one chunk's worth is resident at a time even for
154
+ # large / embed-heavy collections — the dump's dominant memory cost. The
155
+ # propagation-key values are captured as the cursor streams and published
156
+ # into @state once the pass completes (see StreamingResult).
157
+ StreamingResult.new(view: view, collection: query.collection, keys: keys, state: @state)
105
158
  end
106
159
 
107
160
  # NOTE: relies on @embedded_children_by_parent set by a prior build_query
@@ -110,9 +163,9 @@ module Exwiw
110
163
  # to_bulk_insert (SQL adapters don't need it). Safe in Runner, fragile in
111
164
  # tests — call build_query first.
112
165
  def to_bulk_insert(rows, config)
166
+ plan = mask_plan(config)
113
167
  rows.map do |doc|
114
- apply_replace_with!(doc, config)
115
- apply_embedded_masking!(doc, config)
168
+ apply_mask_plan!(doc, plan)
116
169
  JSON.generate(extended_json(doc))
117
170
  end.join("\n")
118
171
  end
@@ -135,6 +188,20 @@ module Exwiw
135
188
  'jsonl'
136
189
  end
137
190
 
191
+ # Bound how many documents are serialized at once when a collection config
192
+ # carries no explicit bulk_insert_chunk_size. A MongoDB dump is one JSONL
193
+ # line per document and, without chunking, the Runner would materialize the
194
+ # entire collection's output as a single giant string while the full
195
+ # in-memory result set is still alive — doubling peak memory on large or
196
+ # embed-heavy collections. Chunking lets the Runner stream each slice to the
197
+ # file and release its serialized string (and the transient extended-JSON
198
+ # trees) before building the next.
199
+ DEFAULT_BULK_INSERT_CHUNK_SIZE = 1_000
200
+
201
+ def default_bulk_insert_chunk_size
202
+ DEFAULT_BULK_INSERT_CHUNK_SIZE
203
+ end
204
+
138
205
  def schema_output_extension
139
206
  'js'
140
207
  end
@@ -325,41 +392,101 @@ module Exwiw
325
392
  parent_fields[reference_field]
326
393
  end
327
394
 
328
- private def apply_replace_with!(doc, config)
329
- config.fields.each do |field|
395
+ # A masking plan compiled once per collection config and reused for every
396
+ # document of that collection. `masked_fields` is `[field_name,
397
+ # template_segments]` for each field carrying a `replace_with`;
398
+ # `embedded` is one EmbeddedMask per embedded child.
399
+ MaskPlan = Struct.new(:masked_fields, :embedded)
400
+
401
+ # A pre-resolved embedded-child mask: the parent path split once into
402
+ # `prefix` (the containers to descend into) and `last` (the field holding
403
+ # the subdocument(s)), plus the child's own MaskPlan.
404
+ EmbeddedMask = Struct.new(:prefix, :last, :plan)
405
+
406
+ # Build (or fetch) the cached MaskPlan for `config`. Masking runs over every
407
+ # document AND every embedded subdocument, so for an embed-heavy collection
408
+ # the same per-config decisions — which fields carry a `replace_with`, how
409
+ # each template splits into segments, where the embedded children live —
410
+ # were previously recomputed tens of times per document. Compiling them once
411
+ # per config lets #apply_mask_plan! do nothing but the work that actually
412
+ # varies per document (rendering templates, descending into subdocuments),
413
+ # so the saved per-subdocument overhead scales down with embedding count.
414
+ #
415
+ # Cached by config name: names are unique within a run and the configs do
416
+ # not mutate mid-dump. Relies on @embedded_children_by_parent, set by the
417
+ # build_query call that always precedes to_bulk_insert (see #to_bulk_insert).
418
+ private def mask_plan(config)
419
+ (@mask_plans ||= {})[config.name] ||= build_mask_plan(config)
420
+ end
421
+
422
+ private def build_mask_plan(config)
423
+ masked_fields = config.fields.each_with_object([]) do |field, acc|
330
424
  next unless field.replace_with
331
425
 
332
- doc[field.name] = field.replace_with.gsub(/\{([^{}]+)\}/) do
333
- ref = Regexp.last_match(1)
334
- (doc.key?(ref) ? doc[ref] : nil).to_s
335
- end
426
+ acc << [field.name, compile_template(field.replace_with)]
427
+ end
428
+ embedded = embedded_children_of(config).map do |child|
429
+ *prefix, last = child.embedded_in.path.split(".")
430
+ EmbeddedMask.new(prefix, last, build_mask_plan(child))
336
431
  end
432
+ MaskPlan.new(masked_fields, embedded)
337
433
  end
338
434
 
339
- private def apply_embedded_masking!(doc, parent_config)
340
- embedded_children_of(parent_config).each do |child|
341
- walk(doc, child.embedded_in.path) do |subdoc|
342
- apply_replace_with!(subdoc, child)
343
- apply_embedded_masking!(subdoc, child)
435
+ # Apply a precompiled MaskPlan to a document in place: render each masked
436
+ # field, then descend into each embedded child (recursing into its own
437
+ # plan). Equivalent to the old apply_replace_with! + apply_embedded_masking!
438
+ # pair, with all per-config lookups hoisted into the plan.
439
+ private def apply_mask_plan!(doc, plan)
440
+ plan.masked_fields.each do |name, segments|
441
+ doc[name] = render_template(segments, doc)
442
+ end
443
+ plan.embedded.each do |child|
444
+ container = child.prefix.reduce(doc) { |acc, seg| acc.is_a?(Hash) ? acc[seg] : nil }
445
+ next unless container.is_a?(Hash)
446
+
447
+ case (value = container[child.last])
448
+ when Array then value.each { |sub| apply_mask_plan!(sub, child.plan) if sub.is_a?(Hash) }
449
+ when Hash then apply_mask_plan!(value, child.plan)
344
450
  end
345
451
  end
346
452
  end
347
453
 
348
- private def embedded_children_of(parent_config)
349
- @embedded_children_by_parent.fetch(parent_config.name, [])
454
+ PLACEHOLDER_PATTERN = /\{([^{}]+)\}/
455
+
456
+ # Split a `replace_with` template into a flat list of segments (called once
457
+ # per masked field at plan-build time, see #build_mask_plan). A segment is
458
+ # either a literal String or a 1-element Array `[ref]` marking a `{ref}`
459
+ # placeholder. #render_template then concatenates them, skipping the regex
460
+ # scan / block / `Regexp.last_match` a per-document `gsub` would repeat (~2.5x
461
+ # faster per field). The segment walk reproduces the old gsub byte-for-byte
462
+ # (missing keys render as "", literals pass through unchanged).
463
+ private def compile_template(template)
464
+ segments = []
465
+ pos = 0
466
+ while (md = PLACEHOLDER_PATTERN.match(template, pos))
467
+ segments << template[pos...md.begin(0)] if md.begin(0) > pos
468
+ segments << [md[1]]
469
+ pos = md.end(0)
470
+ end
471
+ segments << template[pos..] if pos < template.length
472
+ segments
350
473
  end
351
474
 
352
- private def walk(doc, dotted_path)
353
- segments = dotted_path.split(".")
354
- *prefix, last = segments
355
- container = prefix.reduce(doc) { |acc, seg| acc.is_a?(Hash) ? acc[seg] : nil }
356
- return unless container.is_a?(Hash)
357
-
358
- value = container[last]
359
- case value
360
- when Array then value.each { |sub| yield sub if sub.is_a?(Hash) }
361
- when Hash then yield value
475
+ private def render_template(segments, doc)
476
+ out = +''
477
+ segments.each do |seg|
478
+ if seg.is_a?(Array)
479
+ ref = seg[0]
480
+ out << (doc.key?(ref) ? doc[ref] : nil).to_s
481
+ else
482
+ out << seg
483
+ end
362
484
  end
485
+ out
486
+ end
487
+
488
+ private def embedded_children_of(parent_config)
489
+ @embedded_children_by_parent.fetch(parent_config.name, [])
363
490
  end
364
491
 
365
492
  private def extended_json(doc)
data/lib/exwiw/adapter.rb CHANGED
@@ -113,6 +113,16 @@ module Exwiw
113
113
  raise NotImplementedError, "COPY format is not supported by #{self.class.name}"
114
114
  end
115
115
 
116
+ # Default bulk-insert chunk size when a table config does not set one.
117
+ # The Runner streams each chunk straight to the output file, so a non-nil
118
+ # value here bounds how much serialized output (and how many transient
119
+ # intermediate objects) live in memory at once. SQL adapters keep nil
120
+ # (one statement per table, as before); adapters whose output is large
121
+ # and built per-row (e.g. MongoDB JSONL) override with a positive value.
122
+ def default_bulk_insert_chunk_size
123
+ nil
124
+ end
125
+
116
126
  # Run the database-specific EXPLAIN for the given query and return the
117
127
  # output as a single string for `explain` subcommand to print.
118
128
  # SQL adapters override; MongodbAdapter currently raises.
data/lib/exwiw/runner.rb CHANGED
@@ -97,18 +97,36 @@ module Exwiw
97
97
  else
98
98
  phase = "generating INSERT statement"
99
99
  @logger.debug(" Generate INSERT statement...")
100
- chunk_size = table.bulk_insert_chunk_size
101
- chunks = chunk_size ? results.each_slice(chunk_size).to_a : [results]
102
- insert_sql = chunks.map { |chunk_rows| adapter.to_bulk_insert(chunk_rows, table) }.join("\n")
103
-
104
- @logger.info(" Generated INSERT statement for #{record_num} records (#{chunks.size} statement(s)).")
100
+ # Stream each chunk straight to the file instead of building the whole
101
+ # table's INSERT/JSONL output as one string first. This keeps only a
102
+ # single chunk's serialized text (and its transient intermediate
103
+ # objects) in memory at a time — important for large MongoDB
104
+ # collections, whose one-giant-chunk JSONL would otherwise be held in
105
+ # full alongside the already-large in-memory result set.
106
+ #
107
+ # The chunk size falls back to the adapter's default when the table
108
+ # config does not set one (SQL adapters: nil -> one statement, as
109
+ # before; MongoDB: a positive default so the output is chunked). The
110
+ # bytes written are identical to joining the chunks with "\n" and
111
+ # appending a trailing newline, matching the previous `file.puts`.
112
+ chunk_size = table.bulk_insert_chunk_size || adapter.default_bulk_insert_chunk_size
113
+ chunks = chunk_size ? results.each_slice(chunk_size) : [results]
114
+
115
+ statement_count = 0
105
116
  File.open(File.join(@output_dir, "insert-#{insert_idx}-#{table_name}.#{adapter.output_extension}"), 'w') do |file|
106
117
  pre = adapter.pre_insert_sql(table)
107
118
  file.puts(pre) if pre
108
- file.puts(insert_sql)
119
+ chunks.each do |chunk_rows|
120
+ file.print("\n") if statement_count.positive?
121
+ file.print(adapter.to_bulk_insert(chunk_rows, table))
122
+ statement_count += 1
123
+ end
124
+ file.print("\n")
109
125
  post = adapter.post_insert_sql(table)
110
126
  file.puts(post) if post
111
127
  end
128
+
129
+ @logger.info(" Generated INSERT statement for #{record_num} records (#{statement_count} statement(s)).")
112
130
  end
113
131
 
114
132
  if adapter.supports_bulk_delete? && !@insert_only && !(table.respond_to?(:rails_managed?) && table.rails_managed?)
data/lib/exwiw/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Exwiw
4
- VERSION = "0.5.2"
4
+ VERSION = "0.5.3"
5
5
  end
data/mise.toml CHANGED
@@ -1,6 +1,6 @@
1
1
  [env]
2
- # Prepend scenario/bin so `pg_dump` resolves to the wrapper that delegates to
2
+ # Prepend e2e/bin so `pg_dump` resolves to the wrapper that delegates to
3
3
  # the postgres container (compose.yml). exwiw's PostgreSQL adapter shells out
4
4
  # to pg_dump, which requires a server/client major-version match — the dev DB
5
5
  # is postgres:17 while host clients are often older (e.g. Homebrew pg14).
6
- _.path = ["./scenario/bin"]
6
+ _.path = ["./e2e/bin"]
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: exwiw
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.2
4
+ version: 0.5.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Shia
@@ -35,12 +35,15 @@ files:
35
35
  - CHANGELOG.md
36
36
  - LICENSE.txt
37
37
  - README.md
38
+ - docs/optimization-notes.md
39
+ - docs/optimize-mongodb-export-with-native-ext.md
38
40
  - docs/plans/2026-05-15-insert-000-schema-file.md
39
41
  - docs/plans/2026-05-16-mongodb-from-clean-scenario.md
40
42
  - docs/plans/2026-05-22-after-insert-hook.md
41
43
  - docs/plans/2026-05-22-postgres-copy-mode-scenario-test.md
42
44
  - docs/plans/2026-05-29-rails-managed-tables.md
43
45
  - docs/plans/2026-05-31-ids-column-for-sql-adapters.md
46
+ - docs/plans/2026-06-19-mongodb-export-remove-parallelism-native-ext.md
44
47
  - exe/exwiw
45
48
  - lib/exwiw.rb
46
49
  - lib/exwiw/adapter.rb