exwiw 0.5.3 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4df9132fdf081f28d9d0c53ade6bba8267f0f5ec5c40e19ab42fa45d9cfa5a10
4
- data.tar.gz: 4f20a649c126a4617cd5e90fd09eb2eb7f92b068408d5ef9e927eac98a5fadda
3
+ metadata.gz: b66bdb8ccf3d2043c0a903c70071e0f070b41b9fd89dbc09e3c5385c9b4916c1
4
+ data.tar.gz: cf1224f47635113127f2e1fcf38f161c539f7b68f0bb8c85ba0c428fa6a202d1
5
5
  SHA512:
6
- metadata.gz: 309984a32b8cf593b5499427aa089c1c235332a5ca0cda584a2520d3dd0b073f33839adbe20487bce6bc31c20ba66434ce5e0cb1eab5e53bbdfa72bcf115707f
7
- data.tar.gz: de045c9fef6510561b9fd4e713e6a9ba4be7c145f570aaf3da108cd7a4bf3512d0f8757c33e40470bf78b3e7dbebe5244013978c21a974610bca86ceeab54fd9
6
+ metadata.gz: 41a4ef3c8a6da41ccbb92d7e42a519c5260a1bbcf53ce9cdcfe3a18dbca55d509c7ca1492e726bdb82bfc8c101a0a6bcee1dec1c051ceae44910009801d4b64d
7
+ data.tar.gz: ce5cdc2e96fcacbfc2a7c3ca23c72a22a1c3e765da2e69ad21fd9ebd87bbfdc3619e54a0c6eb5ae94321e4dc1b8651ca73a965cf33509ecafba9349b8a894883
data/CHANGELOG.md CHANGED
@@ -2,6 +2,15 @@
2
2
 
3
3
  ## [Unreleased]
4
4
 
5
+ ## [0.6.1] - 2026-06-20
6
+
7
+ ## [0.6.0] - 2026-06-20
8
+
9
+ ### Added
10
+
11
+ - Optimize memory usage https://github.com/heyinc/exwiw/pull/118
12
+ - **MongoDB: optional native (C) encoder for the Extended-JSON dump path** (no flag, byte-identical output, pure-Ruby fallback). Encoding each document to MongoDB Relaxed Extended JSON — previously `JSON.generate(doc.as_extended_json(mode: :relaxed))`, which rebuilds the whole document into an intermediate transformed Hash tree and then walks it again — was the dominant per-document CPU cost (~82% of serialization on embed-heavy data). A new C extension (`ext/exwiw/ext_json/`) emits the JSONL line in a single native tree-walk. It formats the structural bulk plus the leaves that dominate a dumped document — `Hash`, `Array`, `String`, fixnum `Integer`, `true`/`false`/`nil`, `BSON::ObjectId` (`_id`), and in-range `Time` (the Mongoid `created_at`/`updated_at` timestamps) — and delegates everything else (`Float`, out-of-int64 `Integer`, out-of-range `Time`, `Symbol`, `Decimal128`, …) back to the exact pure-Ruby path, so the output is provably byte-for-byte identical. On a 30-embedded-post timestamp-heavy document this serializes ~2.8× faster. With `gem install exwiw` the extension compiles automatically; hosts that cannot compile (JRuby/TruffleRuby, no toolchain) fall back to the pure-Ruby encoder, so exwiw stays installable as a pure-Ruby gem. See [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md).
13
+
5
14
  ## [0.5.3] - 2026-06-19
6
15
 
7
16
  ### Changed
data/README.md CHANGED
@@ -647,7 +647,7 @@ The MongoDB adapter is experimental. To use it:
647
647
  - `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
648
648
  - `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
649
649
  - `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only**; the SQL adapters use `--ids-column` instead (see below).
650
- - Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. The dominant remaining cost is encoding each document to MongoDB Extended JSON (pure Ruby); see [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the proposed native-encoder follow-up. Benchmark your own data with `script/bench_mongodb_dump.rb`.
650
+ - Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
651
651
  - Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
652
652
  ```bash
653
653
  mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
@@ -0,0 +1,145 @@
1
+ # MongoDB scoping full-scan diagnosis (nullable-FK belongs_to)
2
+
3
+ Why a related collection's dump can be "especially slow" — dumping far more
4
+ records than the scope implies and scanning the whole collection — when running
5
+ an exwiw config against a MongoDB backup. The headline finding: the slowness is a
6
+ **symptom of a scoping bug**, not of serialization/decode cost.
7
+
8
+ ## Reproduction setup
9
+
10
+ - Backup: serve a raw WiredTiger dbpath with a standalone `mongod` (the same
11
+ `mongo:7` image the repo's `compose.yml` uses) on a spare port:
12
+
13
+ ```
14
+ docker run -d --name exwiw-restore-mongo --user 0:0 --entrypoint mongod \
15
+ -p 27018:27017 -v "<backup-dbpath>:/data/db" \
16
+ mongo:7 --dbpath /data/db --bind_ip_all
17
+ ```
18
+
19
+ Notes: run as root (`--user 0:0`) so mongod can read a backup's `0600` files;
20
+ bypass the image entrypoint (`--entrypoint mongod`) so it does not `gosu`-drop
21
+ back to the `mongodb` user. A backup carrying a `WiredTiger.backup` marker runs
22
+ recovery on the first start and **writes into the backup dir** (expected for a
23
+ restore). Starting a standalone from a replica-set backup yields a harmless
24
+ `system.replset` warning. Local DB connections may be blocked by a dev sandbox —
25
+ run measurement commands with the sandbox disabled.
26
+
27
+ - Run: `bundle exec exwiw export --config <app>/exwiw/exwiw.yml
28
+ --adapter=mongodb --host=localhost --port=27018 --database=<database>
29
+ --ids=<target-id> --output-dir=… --log-level=debug`.
30
+ (The optional-argument CLI flags — `--ids`, `--output-dir` — must use `=`; the
31
+ space form passes `nil` and crashes in the option callback.)
32
+
33
+ ## Baseline measurement (worked example, warm cache)
34
+
35
+ Full run ≈ **69s** over a few hundred non-empty collections. Per-collection wall
36
+ time (gap between consecutive `Processing table` log markers) was dominated by a
37
+ single collection:
38
+
39
+ | collection | time | records |
40
+ |---|---|---|
41
+ | **items** | **41.0s (59% of the whole run)** | 18,739 |
42
+ | (other collections) | ~5s and below each | — |
43
+
44
+ A "full run ≈ 10 min" figure is the **cold-cache** version (first read pages the
45
+ `items` data off the backup over the bind mount). The *relative* shape (items
46
+ dominates) is the same warm or cold.
47
+
48
+ ## Root cause: nullable belongs_to FK used as a hard `$in` AND constraint
49
+
50
+ The scope is tiny: one parent entity (`<target-id>`) → **1 store** (linked by
51
+ `business_entity_id` = the entity's **`uuid`**, correctly configured with
52
+ `references: uuid`) → **127 items** (`{store_id: <that store>}`, indexed, ~2ms).
53
+
54
+ But the run dumped **18,739** items, not 127, and scanned the whole collection.
55
+ Why:
56
+
57
+ 1. `MongodbAdapter#related_collection_filter` ANDs **every** belongs_to whose
58
+ parent produced ids. The `stores` filter became:
59
+
60
+ ```
61
+ { user_id: {$in: [580 ids]},
62
+ deleted_user_id: {$in: [96 ids]}, # nullable FK
63
+ business_entity_id: {$in: ["<target-id>"]} }
64
+ ```
65
+
66
+ The one matching store has **`deleted_user_id` absent (null)**, so it can never
67
+ satisfy `deleted_user_id ∈ {96 ids}`. The AND yields **0 stores** → `stores`
68
+ logs "No records matched. skip this table."
69
+
70
+ 2. With `stores` empty, `@state["stores"]` carries no ids, so when `items` is
71
+ built its `store_id` belongs_to contributes nothing. The remaining belongs_tos
72
+ are to **reference/master data that is dumped in full** —
73
+ `large_categories` and `medium_categories`. So the items filter degenerated to:
74
+
75
+ ```
76
+ { large_category_id: {$in: [98 ids]},
77
+ medium_category_id: {$in: [846 ids]} }
78
+ ```
79
+
80
+ 3. `items` has **no index on `large_category_id` / `medium_category_id`** (it does
81
+ have `store_id_1`). So this filter forces a **full COLLSCAN of all 2.43M items**
82
+ — and the Runner scans it **twice**: once for `StreamingResult#size`
83
+ (`count_documents`) and again for the fetch.
84
+
85
+ Isolated phase breakdown of the degenerate items query (warm): `count_documents`
86
+ 3.58s + fetch/decode 1.47s + serialize 0.26s. Serialization (the native
87
+ `Exwiw::ExtJson` C ext) is **not** the bottleneck — the COLLSCAN is, and it is far
88
+ worse cold.
89
+
90
+ The same nullable-FK problem applies one level down: the store's 127 items
91
+ themselves have `*_category_id` = null, so even `{store_id, large_category,
92
+ medium_category}` ANDed returns **0**. The only filter that yields the correct 127
93
+ is `{store_id: {$in: [store]}}` **alone**.
94
+
95
+ ## Implemented fix: genuine-anchor scoping (MongodbAdapter#related_collection_filter)
96
+
97
+ Scope flows from the dump target along belongs_to edges. The fix classifies each
98
+ belongs_to parent of a non-target collection by whether it is **genuinely scoped**
99
+ — reachable back to the dump target through belongs_to chains
100
+ (`#genuine_scope_set`, a fixpoint over the configs) — and applies the constraint
101
+ accordingly:
102
+
103
+ - **Anchor (strict).** Among the genuine parents, the most selective one (fewest
104
+ captured ids) is applied strictly (`{fk: {$in: [...]}}`). It carries the real
105
+ narrowing and, being strict, bounds the result to a small set — which keeps both
106
+ this query and the `$in` sets it feeds downstream from ballooning.
107
+ - **Other genuine parents (null-aware).** `{fk: {$in: [nil, ...]}}` (Mongo's
108
+ `$in: [nil]` matches both explicit nulls and missing fields), so a row whose
109
+ nullable refinement FK is null is not excluded by it.
110
+ - **Reference parents (dropped).** A parent NOT reachable to the dump target is
111
+ reference/master data dumped in full; its id set is "all/most of a table" and is
112
+ not a real scope, so when a genuine anchor exists it is dropped entirely.
113
+ - **No genuine parent:** fall back to the historical strict-AND of whatever
114
+ constraints exist (preserves prior behaviour for unreachable collections).
115
+
116
+ For this extraction: `stores` → `{business_entity_id ∈ {<target-id>}}` (anchor;
117
+ `user_id`/`deleted_user_id` → reference leaks, dropped) → **1 store**; `items` →
118
+ `{store_id ∈ {store}}` strict anchor with the nullable refinement FKs null-aware
119
+ and the `*_category` references dropped → **127** via the `store_id_1` index.
120
+
121
+ ### Measured result (warm, same cache)
122
+
123
+ Full run **58.8s → 11.0s ≈ 5.4×**; `items` 41s double COLLSCAN → ~11ms indexed
124
+ (≈3700×). Correctness also fixed: `stores` 0→1, `items` 18,739 (leaked COLLSCAN)
125
+ →127. Byte-identical existing snapshots (the seed graph is fully genuine and has no
126
+ null FKs, so anchor-strict + null-aware ≡ the prior strict-AND).
127
+
128
+ ### Approaches considered and rejected
129
+
130
+ - **Unconditional null-aware on every belongs_to** (the original iter-1 direction):
131
+ catastrophic. A collection that belongs_to only reference data dumped in full
132
+ becomes ~the whole table once null-aware; the resulting child `$in` then exceeds
133
+ Mongo's **48 MB max message size** and the run crashes. Null-awareness must NOT
134
+ be applied to a collection's only/anchor scope.
135
+ - **Null-aware on all genuine parents (no anchor distinction):** makes the genuine
136
+ *anchor* itself null-aware too — `stores` then matched every store with a null
137
+ `business_entity_id` (a not-fully-backfilled column) → hundreds of thousands of
138
+ stores → a ~39 MB child filter on a downstream collection (**MaxBSONSize**).
139
+ Hence the anchor stays strict.
140
+ - **Scope by the single most-selective genuine parent alone (drop other genuine):**
141
+ fast and correct here, but drops legitimate AND-narrowing for multi-parent
142
+ collections (e.g. `order_items` ∈ orders AND products) and moves seed snapshots.
143
+ - Pure-performance tweaks that keep the (incorrect) 18,739-row output —
144
+ `--cursor-parallel` (changes row order, treats the symptom) or skipping the
145
+ redundant `count_documents` scan (~½ only) — were rejected as the primary fix.
@@ -1,8 +1,11 @@
1
1
  # Design: optional native (C) extension for the MongoDB Extended-JSON encoder
2
2
 
3
- Status: **proposed / not implemented.** This document captures the design for a
4
- future change. It is the planned successor to the fork/cursor parallelism that
5
- was removed (see [`optimization-notes.md`](./optimization-notes.md)).
3
+ Status: **implemented.** This document captured the design; it now describes the
4
+ shipped encoder. Source: `ext/exwiw/ext_json/ext_json.c` (native emitter) and
5
+ `lib/exwiw/ext_json.rb` (the optional-load shim + pure-Ruby fallback); the
6
+ byte-identity guard is `spec/ext_json_spec.rb`. It is the successor to the
7
+ fork/cursor parallelism that was removed (see
8
+ [`optimization-notes.md`](./optimization-notes.md)).
6
9
 
7
10
  ## Motivation
8
11
 
@@ -90,15 +93,23 @@ end
90
93
  The encoder splits values into a **native fast path** and a **Ruby delegate**:
91
94
 
92
95
  - **Native (in C):** `Hash`, `Array`, `String`, `Integer` within int64,
93
- `true`/`false`/`nil`, and `BSON::ObjectId`. These are the structural bulk plus
94
- the single most common leaf (`_id`), and their formatting is simple and stable.
96
+ `true`/`false`/`nil`, `BSON::ObjectId`, and **in-range `Time`** (years
97
+ 1970..9999). These are the structural bulk plus the two most common leaves in
98
+ a dumped document — `_id` and the Mongoid `created_at`/`updated_at` timestamps.
99
+ The in-range Time path resolves the absolute instant with `rb_time_timespec`
100
+ (epoch seconds + nanoseconds, no `rb_funcall`), formats with `gmtime_r` +
101
+ `snprintf`, and reproduces bson's rule exactly: a `.mmm` fraction iff
102
+ `nsec >= 1000` (i.e. `usec != 0`), with the millisecond floored to
103
+ `nsec / 1e6`. The in-range window is the half-open epoch-second range
104
+ `[0, 253402300800)`.
95
105
  - **Delegate to Ruby** — call back into
96
106
  `JSON.generate(value.as_extended_json(mode: :relaxed))` for the individual
97
107
  value and splice the returned fragment into the buffer:
98
108
  - `Float` — `Float#to_s` diverges from `JSON.generate` for scientific notation
99
109
  (`1e20`), so never reformat floats in C.
100
- - `Time` variable fractional digits + the `[1970,9999]` `$numberLong`
101
- boundary + ms flooring; too fragile to risk in C v1.
110
+ - **out-of-range `Time`** (year < 1970 or > 9999) — its `$numberLong` form
111
+ involves negative-epoch arithmetic, is vanishingly rare in dumped data, and
112
+ is left to Ruby. The in-range ISO branch is handled natively (above).
102
113
  - out-of-int64 `Integer` — must surface the identical `RangeError`.
103
114
  - any unrecognized class — `Decimal128`, `BSON::Binary`, `Symbol`, `Regexp`,
104
115
  `Date`, `BSON::Timestamp`, etc.
@@ -112,9 +123,14 @@ would have produced for that position. The native walk can therefore hand any
112
123
  value it does not want to format to Ruby and splice the result, with no
113
124
  divergence.
114
125
 
115
- `Time` and `Float` are candidates for later promotion into the native path if a
116
- benchmark shows the per-leaf delegate call is a meaningful fraction of the win
117
- (timestamp-heavy docs call out to Ruby once per `Time` field).
126
+ `Time` was promoted into the native path because the benchmark showed it was
127
+ decisive: with `Time` delegated, a 30-embedded-post timestamp-heavy document
128
+ (32 `Time` fields) sped up only ~1.03× — the per-`Time` `rb_funcall` +
129
+ `as_extended_json` Hash allocation + second `JSON.generate` pass erased the win.
130
+ Formatting in-range `Time` natively brings the same document to ~2.8× (the
131
+ serialization-step ceiling). `Float` remains delegated: matching
132
+ `JSON.generate`'s shortest-round-trip float formatting in C (not `Float#to_s`)
133
+ is not worth the risk for the few floats a typical document carries.
118
134
 
119
135
  ## C source & buffer design
120
136
 
@@ -218,7 +234,11 @@ The private `#extended_json` helper is removed — its logic (including the
218
234
  ## Risk register
219
235
 
220
236
  1. **Time formatting** — variable fraction + ms flooring + `$numberLong`
221
- boundary. Mitigated by delegating `Time` to Ruby in v1.
237
+ boundary. In-range years (1970..9999) are formatted natively via
238
+ `rb_time_timespec` + `gmtime_r`; the rare out-of-range `$numberLong` form is
239
+ delegated. Mitigated by a dense byte-identity fuzz over the whole in-range
240
+ epoch span with mixed nanosecond precision, plus the boundary/sub-ms edges in
241
+ `spec/ext_json_spec.rb`.
222
242
  2. **Float formatting** — `Float#to_s` ≠ `JSON.generate`. Mitigated by delegating.
223
243
  3. **String escaping** — must match JSON exactly. Implemented in C, fuzz-tested
224
244
  vs the Ruby fallback.
@@ -0,0 +1,278 @@
1
+ # SQL dump performance: investigation notes
2
+
3
+ Companion to [`optimization-notes.md`](./optimization-notes.md) (which covers the
4
+ MongoDB adapter). This records the speed/memory bottlenecks of the **SQL**
5
+ adapters' dump path (mysql / postgresql / sqlite), measured against a baseline,
6
+ so a future iteration can address them. **Nothing is fixed yet** — this is the
7
+ measurement + bottleneck-analysis step.
8
+
9
+ The reproducible harness is `script/bench_sql_dump.rb`. It seeds a synthetic
10
+ table and measures the two Runner phases per table; it also measures the
11
+ serialization step with no DB at all. The correctness anchor for any future fix
12
+ is `spec/insert_output_snapshot_spec.rb` — the **byte-exact** snapshot of dump
13
+ output.
14
+
15
+ > **Status:** **both hotspots are fixed for all three SQL adapters.** Hotspot #2
16
+ > (the whole-table INSERT string) — see
17
+ > [Resolution #2](#resolution-hotspot-2-streamed-single-insert). Hotspot #1
18
+ > (full result-set materialization in `execute`) — postgresql, mysql, and sqlite
19
+ > all stream the fetch now; see
20
+ > [Resolution #1](#resolution-hotspot-1-streaming-fetch-postgresql--mysql).
21
+
22
+ ## The two hotspots (same shape as MongoDB had pre-optimization)
23
+
24
+ The Runner drives, per table:
25
+
26
+ 1. **execute** — the adapter materializes the **entire** result set into a Ruby
27
+ array-of-arrays before anything is written:
28
+ - postgresql: `connection.exec(sql).values`
29
+ - mysql: `res.to_a.map { |row| row.map { stringify } }` (also re-allocates
30
+ every value as a normalized String)
31
+ - sqlite: `connection.execute(sql)`
32
+
33
+ Memory here is proportional to the **table size**, independent of any chunking
34
+ downstream.
35
+
36
+ 2. **to_bulk_insert** — SQL adapters set **no** `default_bulk_insert_chunk_size`
37
+ (it is `nil`), so the Runner treats the whole table as one chunk and
38
+ `to_bulk_insert` builds the **entire** `INSERT INTO ... VALUES (...),(...);`
39
+ as one giant String — first an `Array` of N per-row tuple strings, then the
40
+ joined result — held simultaneously with the result set from step 1.
41
+
42
+ ## Baseline (200,000 rows, 8 columns, ~41.2 MB output)
43
+
44
+ Measured via `bench_sql_dump.rb` (sandbox disabled — it needs `ps` for RSS and
45
+ localhost for the live DB). RSS is sticky across in-process phases, so read the
46
+ peaks as upper bounds and the *deltas* as the signal.
47
+
48
+ | adapter | execute peak (Δ) | + whole-string write peak | result-set objs |
49
+ |---------|------------------|---------------------------|-----------------|
50
+ | postgresql | 471.7 MB (+65) | 494.0 MB | 1.8M |
51
+ | mysql | 413.4 MB (+52) | 554.9 MB | 2.4M |
52
+ | sqlite | 434.5 MB | 526.6 MB | 1.4M |
53
+
54
+ For a **41 MB** dump the process peaks near **0.5 GB** — ~12× the output size —
55
+ because the full result set *and* the whole-table INSERT string are resident at
56
+ once. Both costs scale linearly with the table, so a large table OOMs the same
57
+ way the embed-heavy MongoDB collection did.
58
+
59
+ Per-value serialization is cheap and not the bottleneck: `escape_value` is
60
+ ~0.4–1.3 µs/op and a one-row `to_bulk_insert` ~5.5 µs/op. The cost is **memory**
61
+ (holding everything at once), not CPU.
62
+
63
+ ## The byte-identity catch (why this differs from MongoDB)
64
+
65
+ MongoDB's fix was a `default_bulk_insert_chunk_size`, which is byte-identical
66
+ because JSONL chunks join with the same `"\n"` `to_bulk_insert` already inserts
67
+ between docs. **That does not transfer to SQL.** Each `to_bulk_insert` call wraps
68
+ its rows in its own `INSERT INTO ... VALUES ...;` statement, so a chunk size > 0
69
+ turns one INSERT into *many* INSERT statements — semantically equivalent on
70
+ re-import, but a **different byte stream** that breaks the snapshot guard.
71
+
72
+ The byte-identical lever for SQL is instead to **stream the tuples into a single
73
+ INSERT statement**: emit the adapter's exact `INSERT ... VALUES\n` header once,
74
+ then write each row's `(...)` tuple (reusing the adapter's own `escape_value`)
75
+ separated by `",\n"`, then `;`. The bench implements this as `write_streamed`
76
+ and asserts byte-for-byte identity with the whole-string path — confirmed
77
+ **true** for all three adapters (the header must reuse the adapter's quoting,
78
+ e.g. MySQL's backticks, or it diverges).
79
+
80
+ `write_streamed` cuts the to_bulk_insert peak by ~110–120 MB, **but** naive
81
+ per-row `IO#print` makes it ~2–2.4× slower than building one string and writing
82
+ it once. So the production fix wants **chunk-buffered streaming**: build an
83
+ N-row substring in memory, write it, repeat — managing the `",\n"` separator
84
+ across chunk boundaries — to bound memory *without* the per-row IO penalty,
85
+ while still emitting a single INSERT statement (byte-identical).
86
+
87
+ ## Resolution: hotspot #2 (streamed single INSERT)
88
+
89
+ Implemented. The Runner no longer builds the per-table INSERT as one giant
90
+ String; it delegates writing to a new adapter seam `Adapter#write_inserts(io,
91
+ results, table, chunk_size)`:
92
+
93
+ - `Adapter::Base#write_inserts` keeps the old behavior (write `to_bulk_insert`
94
+ per chunk, joined by `"\n"`), so MongoDB and any future adapter are unchanged.
95
+ - The SQL adapters mix in `Adapter::SqlBulkInsert`, which **streams** the single
96
+ `INSERT INTO ... VALUES <tuples>;` statement to the file `STREAM_FLUSH_ROWS`
97
+ (2,000) tuples at a time. Each flush is one fast `map`+`join` (the same path
98
+ `to_bulk_insert` uses), and the `",\n"` printed between slices reproduces the
99
+ exact separator between tuples — so the bytes are **identical** to the
100
+ whole-table build. The three duplicate `to_bulk_insert` methods collapsed into
101
+ the shared module; each adapter now only supplies `insert_header` (its
102
+ identifier quoting) and `escape_value`.
103
+
104
+ Verified byte-for-byte by `spec/insert_output_snapshot_spec.rb` (live DB, all
105
+ three adapters) and a flush-boundary sanity check; full suite green.
106
+
107
+ Measured (`bench_sql_dump.rb`, 200k rows / ~41.2 MB output):
108
+
109
+ | adapter | whole-string peak | streamed peak | Δ peak |
110
+ |---------|-------------------|---------------|--------|
111
+ | postgresql | 367 MB | 226 MB | −141 MB |
112
+ | mysql | 325 MB | 214 MB | −111 MB |
113
+ | sqlite | 353 MB | 207 MB | −146 MB |
114
+
115
+ So hotspot #2's contribution (~110–145 MB on a 41 MB dump — the whole INSERT
116
+ string *plus* the transient 200k-tuple `Array` and its join) is gone; the write
117
+ buffer is now bounded to ~2,000 tuples regardless of table size.
118
+
119
+ **Speed:** streaming is *not* slower than the whole-string build. Measured in
120
+ isolation (post-`GC.start`, no sampler thread) the streamed write at
121
+ `flush_rows=2000` was ~0.67 s vs ~0.83 s for the one giant `map`+`join` — small
122
+ chunks stay in cache and avoid repeatedly growing/copying a 41 MB String. The
123
+ ~1.3× "slowdown" the in-process bench shows is an artifact of its background RSS
124
+ sampler thread (`ps` every 10 ms) plus run-ordering (streamed runs first, cold),
125
+ not the algorithm. The earlier worry about a per-row `IO#print` penalty only
126
+ applied to the naive row-at-a-time prototype, which `flush_rows` slicing avoids.
127
+
128
+ ## Resolution: hotspot #1 (streaming fetch, postgresql + mysql)
129
+
130
+ Implemented for **postgresql**. `PostgresqlAdapter#execute` no longer returns
131
+ `connection.exec(sql).values` (the whole result set as a Ruby array-of-arrays);
132
+ it returns a lazy `PostgresqlAdapter::StreamingResult` that pulls rows off the
133
+ wire one at a time via libpq **single-row mode** (`send_query` +
134
+ `set_single_row_mode` + a `get_result` loop, each yielding one row's
135
+ text-format `Array<String|nil>`). The Runner drives it exactly like the old
136
+ array — `#size` then a single `each_slice` pass — so nothing else changed and
137
+ the output is byte-identical (verified by `insert_output_snapshot_spec`, both
138
+ the `insert` and `copy` pg scenarios).
139
+
140
+ It mirrors `MongodbAdapter::StreamingResult`, with two SQL-specific points:
141
+
142
+ - **`#size`** can't be answered cheaply from the cursor, so it runs a separate
143
+ `SELECT COUNT(*) FROM (<query>) AS exwiw_count_src` (comment-prefixed, like
144
+ the data query). Postgres prunes the wrapped subquery's unused projection, so
145
+ the COUNT transfers no row data — but it does re-run the query plan. This is
146
+ the deliberate cost of keeping the Runner contract (`#size` before iteration,
147
+ used to skip empty tables and log the count) unchanged, so MongoDB and the
148
+ other SQL adapters are untouched. (MongoDB's `count_documents` is an
149
+ index-only walk and cheaper; the SQL COUNT is the analogue.)
150
+ - the streaming pass ties up the connection until fully drained. The Runner
151
+ always drains it (`write_inserts`) before issuing `post_insert_sql` / DELETE
152
+ on the same connection, so the ordering holds. `StreamingResult#each` also
153
+ drains any queued results if iteration is abandoned mid-stream (a SQL error
154
+ surfaced by `#check`, or the consumer raising), so the connection stays
155
+ usable.
156
+
157
+ Measured in **isolated fresh processes** (one per path, so the peak is not
158
+ polluted by other phases — RSS is sticky), 200k rows / ~41.2 MB output:
159
+
160
+ | pg fetch path | peak RSS | Δ over baseline |
161
+ |---------------|----------|-----------------|
162
+ | materialize (`exec(sql).values`) + streamed write (OLD) | ~360 MB | ~320 MB |
163
+ | **single-row stream + streamed write (NEW)** | **~48 MB** | **~12 MB** |
164
+
165
+ So the full result set (~320 MB of Ruby strings/arrays for 200k×8) is no longer
166
+ resident: peak drops ~**310 MB (~87%)** and is now *below* the 41 MB output
167
+ size, because the row is pulled one at a time and the write buffer is bounded to
168
+ ~2,000 tuples (hotspot #2's fix). Speed is unchanged (~1.8 s both paths on the
169
+ in-process bench; the COUNT is cheap on an indexed seed). The reproducible A/B
170
+ is in `bench_sql_dump.rb` Part B (`execute(stream)` vs `execute(materialize)`,
171
+ with a byte-identity assertion).
172
+
173
+ Implemented for **mysql** too. `MysqlAdapter#execute` now returns a
174
+ `MysqlAdapter::StreamingResult` (an Enumerable mirroring the pg one) instead of
175
+ `connection.query(sql).rows`. The new `MysqlClient#stream_rows` pulls rows off
176
+ the wire one at a time via mysql2's server-side stream (`stream: true` +
177
+ `cache_rows: false`), yielding the same `Array<String|nil>` rows `#query`
178
+ buffered — so the generated INSERT is byte-identical (verified by
179
+ `insert_output_snapshot_spec`).
180
+
181
+ Two MySQL specifics differ from the pg path:
182
+
183
+ - **`#size`** is a separate `SELECT COUNT(*)` of the same query, but **not** a
184
+ subquery wrap. MySQL rejects a derived table with duplicate column names,
185
+ which a rails-managed `SELECT *` joined to another table produces
186
+ (`Duplicate column name 'id'`); Postgres tolerates it, MySQL does not. So
187
+ mysql replaces the projection with `COUNT(*)` instead
188
+ (`compile_ast(count_only: true)`) — exact because exwiw's extraction queries
189
+ have no DISTINCT/GROUP BY/LIMIT, so the count is independent of the projected
190
+ columns (confirmed against live data: `COUNT(*)` over the bare FROM/JOIN/WHERE
191
+ equals the streamed row count for both plain and `SELECT *`+join queries).
192
+ - **abandoned streams.** mysql2 requires a streamed result to be fully consumed
193
+ before the next query on the connection, or it raises "Commands out of sync".
194
+ `stream_rows` drains the remainder (re-entering `res.each`, which continues
195
+ from where it stopped) if the consumer block raises, so the connection stays
196
+ usable for the next table. `trilogy` has no streaming cursor
197
+ (no `QUERY_FLAGS_STREAMING`), so it buffers and yields — parity, no memory
198
+ win; trilogy is a test-only driver, production uses mysql2.
199
+
200
+ Measured in **isolated fresh processes** (one per path), 200k rows / ~40.7 MB
201
+ output:
202
+
203
+ | mysql fetch path | peak RSS | Δ over baseline |
204
+ |------------------|----------|-----------------|
205
+ | materialize (`query(sql).rows`) + streamed write (OLD) | ~340 MB | ~300 MB |
206
+ | **single-row stream + streamed write (NEW)** | **~50 MB** | **~10 MB** |
207
+
208
+ So peak drops ~**290 MB (~85%)**, now just above the 40.7 MB output — the same
209
+ shape as the pg result. Speed is unchanged-to-faster (the materialize path also
210
+ builds the whole array first). `bench_sql_dump.rb` Part B now shows the delta
211
+ for mysql too (it was equivalent before, when mysql still materialized).
212
+
213
+ Implemented for **sqlite** too, closing hotspot #1 for all three SQL adapters.
214
+ `SqliteAdapter#execute` no longer returns `connection.execute(sql)` (which
215
+ buffers the whole result into a Ruby array); it returns a
216
+ `SqliteAdapter::StreamingResult` (Enumerable, mirroring the pg/mysql ones) whose
217
+ `#each` walks the result one row at a time through SQLite's **statement cursor**
218
+ — `connection.prepare(data_sql)` then `Statement#each` (which maps to
219
+ `sqlite3_step`), closing the statement in an `ensure` so an abandoned mid-stream
220
+ iteration still releases the cursor. The rows are the same `Array` of
221
+ native-typed values `Database#execute` produced, so the generated INSERT is
222
+ byte-identical (verified by `insert_output_snapshot_spec` and a direct cursor
223
+ vs. `#execute` comparison).
224
+
225
+ SQLite specifics vs. the pg/mysql paths:
226
+
227
+ - **`#size`** runs a separate `SELECT COUNT(*)` of the same query with the
228
+ projection replaced by `COUNT(*)` (`compile_ast(count_only: true)`, the same
229
+ trick mysql uses) — exact because exwiw's extraction queries have no
230
+ DISTINCT/GROUP BY/LIMIT. SQLite would also tolerate a duplicate-column
231
+ subquery wrap (unlike mysql), but the `count_only` form is shared and avoids
232
+ the extra subquery.
233
+ - **no connection contention.** SQLite is an embedded, single-connection engine
234
+ that allows multiple active prepared statements at once, so the `#size` COUNT
235
+ and the data cursor don't fight over the connection the way the pg/mysql
236
+ single-row streams tie up the wire. No drain dance is needed; just close the
237
+ statement.
238
+
239
+ Measured in **isolated fresh processes** (one per path), 200k rows / ~40.5 MB
240
+ output:
241
+
242
+ | sqlite fetch path | peak RSS | Δ over baseline |
243
+ |-------------------|----------|-----------------|
244
+ | materialize (`Database#execute`) + streamed write (OLD) | ~298 MB | ~257 MB |
245
+ | **statement-cursor stream + streamed write (NEW)** | **~59 MB** | **~18 MB** |
246
+
247
+ So peak drops ~**240 MB (~80%)**, the same shape as pg/mysql, and it is
248
+ **faster** (~0.84 s vs ~1.68 s) — the materialize path pays to build the whole
249
+ Ruby array up front, the cursor does not. `bench_sql_dump.rb` Part B now shows a
250
+ real delta for sqlite too (it was equivalent before, when sqlite still
251
+ materialized).
252
+
253
+ ## Status: both hotspots closed for all three SQL adapters
254
+
255
+ 1. **Bounded-memory write** (hotspot #2) — done for mysql / postgresql / sqlite;
256
+ see [Resolution #2](#resolution-hotspot-2-streamed-single-insert).
257
+ 2. **Streaming result fetch** (hotspot #1) — done for postgresql (libpq
258
+ single-row mode), mysql (mysql2 `stream: true`), and sqlite (statement
259
+ cursor); see Resolution #1 above.
260
+
261
+ There is no remaining materialization hotspot in the SQL dump path: peak RSS is
262
+ now bounded (well below the output size) and independent of table size for every
263
+ SQL adapter, the same property the MongoDB streaming work achieved. The
264
+ `trilogy` driver still buffers (it has no streaming cursor flag), but it is a
265
+ test-only driver — production mysql uses mysql2.
266
+
267
+ ## Methodology notes
268
+
269
+ - The serialization hotspot reproduces **with no database** (Part A): synthesize
270
+ the array-of-String-arrays the drivers return and measure `to_bulk_insert`.
271
+ The live-DB part (Part B) measures `execute` and needs a reachable DB; the dev
272
+ sandbox blocks localhost (and `ps`), so disable the sandbox for bench runs.
273
+ - Run order matters: the bench measures the STREAMED path **before** the WHOLE
274
+ path so the transient giant String doesn't pollute the streamed peak (RSS is
275
+ reclaimed lazily). For defensible absolute numbers, isolate phases in fresh
276
+ processes.
277
+ - Ruby 4.0 removed the `benchmark` stdlib; the harness uses
278
+ `Process.clock_gettime(Process::CLOCK_MONOTONIC)`.