exwiw 0.5.3 → 0.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +9 -0
- data/README.md +1 -1
- data/docs/mongodb-scoping-fullscan-notes.md +145 -0
- data/docs/optimize-mongodb-export-with-native-ext.md +31 -11
- data/docs/sql-dump-optimization-notes.md +278 -0
- data/ext/exwiw/ext_json/ext_json.c +274 -0
- data/ext/exwiw/ext_json/extconf.rb +8 -0
- data/lib/exwiw/adapter/mongodb_adapter.rb +90 -22
- data/lib/exwiw/adapter/mysql_adapter.rb +70 -18
- data/lib/exwiw/adapter/mysql_client.rb +43 -0
- data/lib/exwiw/adapter/postgresql_adapter.rb +85 -15
- data/lib/exwiw/adapter/sql_bulk_insert.rb +71 -0
- data/lib/exwiw/adapter/sqlite_adapter.rb +75 -18
- data/lib/exwiw/adapter.rb +28 -0
- data/lib/exwiw/ext_json.rb +33 -0
- data/lib/exwiw/runner.rb +10 -16
- data/lib/exwiw/version.rb +1 -1
- data/lib/exwiw.rb +2 -0
- metadata +9 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: b66bdb8ccf3d2043c0a903c70071e0f070b41b9fd89dbc09e3c5385c9b4916c1
|
|
4
|
+
data.tar.gz: cf1224f47635113127f2e1fcf38f161c539f7b68f0bb8c85ba0c428fa6a202d1
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 41a4ef3c8a6da41ccbb92d7e42a519c5260a1bbcf53ce9cdcfe3a18dbca55d509c7ca1492e726bdb82bfc8c101a0a6bcee1dec1c051ceae44910009801d4b64d
|
|
7
|
+
data.tar.gz: ce5cdc2e96fcacbfc2a7c3ca23c72a22a1c3e765da2e69ad21fd9ebd87bbfdc3619e54a0c6eb5ae94321e4dc1b8651ca73a965cf33509ecafba9349b8a894883
|
data/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,15 @@
|
|
|
2
2
|
|
|
3
3
|
## [Unreleased]
|
|
4
4
|
|
|
5
|
+
## [0.6.1] - 2026-06-20
|
|
6
|
+
|
|
7
|
+
## [0.6.0] - 2026-06-20
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
|
|
11
|
+
- Optimize memory usage https://github.com/heyinc/exwiw/pull/118
|
|
12
|
+
- **MongoDB: optional native (C) encoder for the Extended-JSON dump path** (no flag, byte-identical output, pure-Ruby fallback). Encoding each document to MongoDB Relaxed Extended JSON — previously `JSON.generate(doc.as_extended_json(mode: :relaxed))`, which rebuilds the whole document into an intermediate transformed Hash tree and then walks it again — was the dominant per-document CPU cost (~82% of serialization on embed-heavy data). A new C extension (`ext/exwiw/ext_json/`) emits the JSONL line in a single native tree-walk. It formats the structural bulk plus the leaves that dominate a dumped document — `Hash`, `Array`, `String`, fixnum `Integer`, `true`/`false`/`nil`, `BSON::ObjectId` (`_id`), and in-range `Time` (the Mongoid `created_at`/`updated_at` timestamps) — and delegates everything else (`Float`, out-of-int64 `Integer`, out-of-range `Time`, `Symbol`, `Decimal128`, …) back to the exact pure-Ruby path, so the output is provably byte-for-byte identical. On a 30-embedded-post timestamp-heavy document this serializes ~2.8× faster. With `gem install exwiw` the extension compiles automatically; hosts that cannot compile (JRuby/TruffleRuby, no toolchain) fall back to the pure-Ruby encoder, so exwiw stays installable as a pure-Ruby gem. See [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md).
|
|
13
|
+
|
|
5
14
|
## [0.5.3] - 2026-06-19
|
|
6
15
|
|
|
7
16
|
### Changed
|
data/README.md
CHANGED
|
@@ -647,7 +647,7 @@ The MongoDB adapter is experimental. To use it:
|
|
|
647
647
|
- `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
|
|
648
648
|
- `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
|
|
649
649
|
- `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only**; the SQL adapters use `--ids-column` instead (see below).
|
|
650
|
-
- Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set.
|
|
650
|
+
- Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
|
|
651
651
|
- Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
|
|
652
652
|
```bash
|
|
653
653
|
mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
|
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
# MongoDB scoping full-scan diagnosis (nullable-FK belongs_to)
|
|
2
|
+
|
|
3
|
+
Why a related collection's dump can be "especially slow" — dumping far more
|
|
4
|
+
records than the scope implies and scanning the whole collection — when running
|
|
5
|
+
an exwiw config against a MongoDB backup. The headline finding: the slowness is a
|
|
6
|
+
**symptom of a scoping bug**, not of serialization/decode cost.
|
|
7
|
+
|
|
8
|
+
## Reproduction setup
|
|
9
|
+
|
|
10
|
+
- Backup: serve a raw WiredTiger dbpath with a standalone `mongod` (the same
|
|
11
|
+
`mongo:7` image the repo's `compose.yml` uses) on a spare port:
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
docker run -d --name exwiw-restore-mongo --user 0:0 --entrypoint mongod \
|
|
15
|
+
-p 27018:27017 -v "<backup-dbpath>:/data/db" \
|
|
16
|
+
mongo:7 --dbpath /data/db --bind_ip_all
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
Notes: run as root (`--user 0:0`) so mongod can read a backup's `0600` files;
|
|
20
|
+
bypass the image entrypoint (`--entrypoint mongod`) so it does not `gosu`-drop
|
|
21
|
+
back to the `mongodb` user. A backup carrying a `WiredTiger.backup` marker runs
|
|
22
|
+
recovery on the first start and **writes into the backup dir** (expected for a
|
|
23
|
+
restore). Starting a standalone from a replica-set backup yields a harmless
|
|
24
|
+
`system.replset` warning. Local DB connections may be blocked by a dev sandbox —
|
|
25
|
+
run measurement commands with the sandbox disabled.
|
|
26
|
+
|
|
27
|
+
- Run: `bundle exec exwiw export --config <app>/exwiw/exwiw.yml
|
|
28
|
+
--adapter=mongodb --host=localhost --port=27018 --database=<database>
|
|
29
|
+
--ids=<target-id> --output-dir=… --log-level=debug`.
|
|
30
|
+
(The optional-argument CLI flags — `--ids`, `--output-dir` — must use `=`; the
|
|
31
|
+
space form passes `nil` and crashes in the option callback.)
|
|
32
|
+
|
|
33
|
+
## Baseline measurement (worked example, warm cache)
|
|
34
|
+
|
|
35
|
+
Full run ≈ **69s** over a few hundred non-empty collections. Per-collection wall
|
|
36
|
+
time (gap between consecutive `Processing table` log markers) was dominated by a
|
|
37
|
+
single collection:
|
|
38
|
+
|
|
39
|
+
| collection | time | records |
|
|
40
|
+
|---|---|---|
|
|
41
|
+
| **items** | **41.0s (59% of the whole run)** | 18,739 |
|
|
42
|
+
| (other collections) | ~5s and below each | — |
|
|
43
|
+
|
|
44
|
+
A "full run ≈ 10 min" figure is the **cold-cache** version (first read pages the
|
|
45
|
+
`items` data off the backup over the bind mount). The *relative* shape (items
|
|
46
|
+
dominates) is the same warm or cold.
|
|
47
|
+
|
|
48
|
+
## Root cause: nullable belongs_to FK used as a hard `$in` AND constraint
|
|
49
|
+
|
|
50
|
+
The scope is tiny: one parent entity (`<target-id>`) → **1 store** (linked by
|
|
51
|
+
`business_entity_id` = the entity's **`uuid`**, correctly configured with
|
|
52
|
+
`references: uuid`) → **127 items** (`{store_id: <that store>}`, indexed, ~2ms).
|
|
53
|
+
|
|
54
|
+
But the run dumped **18,739** items, not 127, and scanned the whole collection.
|
|
55
|
+
Why:
|
|
56
|
+
|
|
57
|
+
1. `MongodbAdapter#related_collection_filter` ANDs **every** belongs_to whose
|
|
58
|
+
parent produced ids. The `stores` filter became:
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
{ user_id: {$in: [580 ids]},
|
|
62
|
+
deleted_user_id: {$in: [96 ids]}, # nullable FK
|
|
63
|
+
business_entity_id: {$in: ["<target-id>"]} }
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
The one matching store has **`deleted_user_id` absent (null)**, so it can never
|
|
67
|
+
satisfy `deleted_user_id ∈ {96 ids}`. The AND yields **0 stores** → `stores`
|
|
68
|
+
logs "No records matched. skip this table."
|
|
69
|
+
|
|
70
|
+
2. With `stores` empty, `@state["stores"]` carries no ids, so when `items` is
|
|
71
|
+
built its `store_id` belongs_to contributes nothing. The remaining belongs_tos
|
|
72
|
+
are to **reference/master data that is dumped in full** —
|
|
73
|
+
`large_categories` and `medium_categories`. So the items filter degenerated to:
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
{ large_category_id: {$in: [98 ids]},
|
|
77
|
+
medium_category_id: {$in: [846 ids]} }
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
3. `items` has **no index on `large_category_id` / `medium_category_id`** (it does
|
|
81
|
+
have `store_id_1`). So this filter forces a **full COLLSCAN of all 2.43M items**
|
|
82
|
+
— and the Runner scans it **twice**: once for `StreamingResult#size`
|
|
83
|
+
(`count_documents`) and again for the fetch.
|
|
84
|
+
|
|
85
|
+
Isolated phase breakdown of the degenerate items query (warm): `count_documents`
|
|
86
|
+
3.58s + fetch/decode 1.47s + serialize 0.26s. Serialization (the native
|
|
87
|
+
`Exwiw::ExtJson` C ext) is **not** the bottleneck — the COLLSCAN is, and it is far
|
|
88
|
+
worse cold.
|
|
89
|
+
|
|
90
|
+
The same nullable-FK problem applies one level down: the store's 127 items
|
|
91
|
+
themselves have `*_category_id` = null, so even `{store_id, large_category,
|
|
92
|
+
medium_category}` ANDed returns **0**. The only filter that yields the correct 127
|
|
93
|
+
is `{store_id: {$in: [store]}}` **alone**.
|
|
94
|
+
|
|
95
|
+
## Implemented fix: genuine-anchor scoping (MongodbAdapter#related_collection_filter)
|
|
96
|
+
|
|
97
|
+
Scope flows from the dump target along belongs_to edges. The fix classifies each
|
|
98
|
+
belongs_to parent of a non-target collection by whether it is **genuinely scoped**
|
|
99
|
+
— reachable back to the dump target through belongs_to chains
|
|
100
|
+
(`#genuine_scope_set`, a fixpoint over the configs) — and applies the constraint
|
|
101
|
+
accordingly:
|
|
102
|
+
|
|
103
|
+
- **Anchor (strict).** Among the genuine parents, the most selective one (fewest
|
|
104
|
+
captured ids) is applied strictly (`{fk: {$in: [...]}}`). It carries the real
|
|
105
|
+
narrowing and, being strict, bounds the result to a small set — which keeps both
|
|
106
|
+
this query and the `$in` sets it feeds downstream from ballooning.
|
|
107
|
+
- **Other genuine parents (null-aware).** `{fk: {$in: [nil, ...]}}` (Mongo's
|
|
108
|
+
`$in: [nil]` matches both explicit nulls and missing fields), so a row whose
|
|
109
|
+
nullable refinement FK is null is not excluded by it.
|
|
110
|
+
- **Reference parents (dropped).** A parent NOT reachable to the dump target is
|
|
111
|
+
reference/master data dumped in full; its id set is "all/most of a table" and is
|
|
112
|
+
not a real scope, so when a genuine anchor exists it is dropped entirely.
|
|
113
|
+
- **No genuine parent:** fall back to the historical strict-AND of whatever
|
|
114
|
+
constraints exist (preserves prior behaviour for unreachable collections).
|
|
115
|
+
|
|
116
|
+
For this extraction: `stores` → `{business_entity_id ∈ {<target-id>}}` (anchor;
|
|
117
|
+
`user_id`/`deleted_user_id` → reference leaks, dropped) → **1 store**; `items` →
|
|
118
|
+
`{store_id ∈ {store}}` strict anchor with the nullable refinement FKs null-aware
|
|
119
|
+
and the `*_category` references dropped → **127** via the `store_id_1` index.
|
|
120
|
+
|
|
121
|
+
### Measured result (warm, same cache)
|
|
122
|
+
|
|
123
|
+
Full run **58.8s → 11.0s ≈ 5.4×**; `items` 41s double COLLSCAN → ~11ms indexed
|
|
124
|
+
(≈3700×). Correctness also fixed: `stores` 0→1, `items` 18,739 (leaked COLLSCAN)
|
|
125
|
+
→127. Byte-identical existing snapshots (the seed graph is fully genuine and has no
|
|
126
|
+
null FKs, so anchor-strict + null-aware ≡ the prior strict-AND).
|
|
127
|
+
|
|
128
|
+
### Approaches considered and rejected
|
|
129
|
+
|
|
130
|
+
- **Unconditional null-aware on every belongs_to** (the original iter-1 direction):
|
|
131
|
+
catastrophic. A collection that belongs_to only reference data dumped in full
|
|
132
|
+
becomes ~the whole table once null-aware; the resulting child `$in` then exceeds
|
|
133
|
+
Mongo's **48 MB max message size** and the run crashes. Null-awareness must NOT
|
|
134
|
+
be applied to a collection's only/anchor scope.
|
|
135
|
+
- **Null-aware on all genuine parents (no anchor distinction):** makes the genuine
|
|
136
|
+
*anchor* itself null-aware too — `stores` then matched every store with a null
|
|
137
|
+
`business_entity_id` (a not-fully-backfilled column) → hundreds of thousands of
|
|
138
|
+
stores → a ~39 MB child filter on a downstream collection (**MaxBSONSize**).
|
|
139
|
+
Hence the anchor stays strict.
|
|
140
|
+
- **Scope by the single most-selective genuine parent alone (drop other genuine):**
|
|
141
|
+
fast and correct here, but drops legitimate AND-narrowing for multi-parent
|
|
142
|
+
collections (e.g. `order_items` ∈ orders AND products) and moves seed snapshots.
|
|
143
|
+
- Pure-performance tweaks that keep the (incorrect) 18,739-row output —
|
|
144
|
+
`--cursor-parallel` (changes row order, treats the symptom) or skipping the
|
|
145
|
+
redundant `count_documents` scan (~½ only) — were rejected as the primary fix.
|
|
@@ -1,8 +1,11 @@
|
|
|
1
1
|
# Design: optional native (C) extension for the MongoDB Extended-JSON encoder
|
|
2
2
|
|
|
3
|
-
Status: **
|
|
4
|
-
|
|
5
|
-
|
|
3
|
+
Status: **implemented.** This document captured the design; it now describes the
|
|
4
|
+
shipped encoder. Source: `ext/exwiw/ext_json/ext_json.c` (native emitter) and
|
|
5
|
+
`lib/exwiw/ext_json.rb` (the optional-load shim + pure-Ruby fallback); the
|
|
6
|
+
byte-identity guard is `spec/ext_json_spec.rb`. It is the successor to the
|
|
7
|
+
fork/cursor parallelism that was removed (see
|
|
8
|
+
[`optimization-notes.md`](./optimization-notes.md)).
|
|
6
9
|
|
|
7
10
|
## Motivation
|
|
8
11
|
|
|
@@ -90,15 +93,23 @@ end
|
|
|
90
93
|
The encoder splits values into a **native fast path** and a **Ruby delegate**:
|
|
91
94
|
|
|
92
95
|
- **Native (in C):** `Hash`, `Array`, `String`, `Integer` within int64,
|
|
93
|
-
`true`/`false`/`nil`,
|
|
94
|
-
|
|
96
|
+
`true`/`false`/`nil`, `BSON::ObjectId`, and **in-range `Time`** (years
|
|
97
|
+
1970..9999). These are the structural bulk plus the two most common leaves in
|
|
98
|
+
a dumped document — `_id` and the Mongoid `created_at`/`updated_at` timestamps.
|
|
99
|
+
The in-range Time path resolves the absolute instant with `rb_time_timespec`
|
|
100
|
+
(epoch seconds + nanoseconds, no `rb_funcall`), formats with `gmtime_r` +
|
|
101
|
+
`snprintf`, and reproduces bson's rule exactly: a `.mmm` fraction iff
|
|
102
|
+
`nsec >= 1000` (i.e. `usec != 0`), with the millisecond floored to
|
|
103
|
+
`nsec / 1e6`. The in-range window is the half-open epoch-second range
|
|
104
|
+
`[0, 253402300800)`.
|
|
95
105
|
- **Delegate to Ruby** — call back into
|
|
96
106
|
`JSON.generate(value.as_extended_json(mode: :relaxed))` for the individual
|
|
97
107
|
value and splice the returned fragment into the buffer:
|
|
98
108
|
- `Float` — `Float#to_s` diverges from `JSON.generate` for scientific notation
|
|
99
109
|
(`1e20`), so never reformat floats in C.
|
|
100
|
-
- `Time
|
|
101
|
-
|
|
110
|
+
- **out-of-range `Time`** (year < 1970 or > 9999) — its `$numberLong` form
|
|
111
|
+
involves negative-epoch arithmetic, is vanishingly rare in dumped data, and
|
|
112
|
+
is left to Ruby. The in-range ISO branch is handled natively (above).
|
|
102
113
|
- out-of-int64 `Integer` — must surface the identical `RangeError`.
|
|
103
114
|
- any unrecognized class — `Decimal128`, `BSON::Binary`, `Symbol`, `Regexp`,
|
|
104
115
|
`Date`, `BSON::Timestamp`, etc.
|
|
@@ -112,9 +123,14 @@ would have produced for that position. The native walk can therefore hand any
|
|
|
112
123
|
value it does not want to format to Ruby and splice the result, with no
|
|
113
124
|
divergence.
|
|
114
125
|
|
|
115
|
-
`Time`
|
|
116
|
-
|
|
117
|
-
(
|
|
126
|
+
`Time` was promoted into the native path because the benchmark showed it was
|
|
127
|
+
decisive: with `Time` delegated, a 30-embedded-post timestamp-heavy document
|
|
128
|
+
(32 `Time` fields) sped up only ~1.03× — the per-`Time` `rb_funcall` +
|
|
129
|
+
`as_extended_json` Hash allocation + second `JSON.generate` pass erased the win.
|
|
130
|
+
Formatting in-range `Time` natively brings the same document to ~2.8× (the
|
|
131
|
+
serialization-step ceiling). `Float` remains delegated: matching
|
|
132
|
+
`JSON.generate`'s shortest-round-trip float formatting in C (not `Float#to_s`)
|
|
133
|
+
is not worth the risk for the few floats a typical document carries.
|
|
118
134
|
|
|
119
135
|
## C source & buffer design
|
|
120
136
|
|
|
@@ -218,7 +234,11 @@ The private `#extended_json` helper is removed — its logic (including the
|
|
|
218
234
|
## Risk register
|
|
219
235
|
|
|
220
236
|
1. **Time formatting** — variable fraction + ms flooring + `$numberLong`
|
|
221
|
-
boundary.
|
|
237
|
+
boundary. In-range years (1970..9999) are formatted natively via
|
|
238
|
+
`rb_time_timespec` + `gmtime_r`; the rare out-of-range `$numberLong` form is
|
|
239
|
+
delegated. Mitigated by a dense byte-identity fuzz over the whole in-range
|
|
240
|
+
epoch span with mixed nanosecond precision, plus the boundary/sub-ms edges in
|
|
241
|
+
`spec/ext_json_spec.rb`.
|
|
222
242
|
2. **Float formatting** — `Float#to_s` ≠ `JSON.generate`. Mitigated by delegating.
|
|
223
243
|
3. **String escaping** — must match JSON exactly. Implemented in C, fuzz-tested
|
|
224
244
|
vs the Ruby fallback.
|
|
@@ -0,0 +1,278 @@
|
|
|
1
|
+
# SQL dump performance: investigation notes
|
|
2
|
+
|
|
3
|
+
Companion to [`optimization-notes.md`](./optimization-notes.md) (which covers the
|
|
4
|
+
MongoDB adapter). This records the speed/memory bottlenecks of the **SQL**
|
|
5
|
+
adapters' dump path (mysql / postgresql / sqlite), measured against a baseline,
|
|
6
|
+
so a future iteration can address them. **Nothing is fixed yet** — this is the
|
|
7
|
+
measurement + bottleneck-analysis step.
|
|
8
|
+
|
|
9
|
+
The reproducible harness is `script/bench_sql_dump.rb`. It seeds a synthetic
|
|
10
|
+
table and measures the two Runner phases per table; it also measures the
|
|
11
|
+
serialization step with no DB at all. The correctness anchor for any future fix
|
|
12
|
+
is `spec/insert_output_snapshot_spec.rb` — the **byte-exact** snapshot of dump
|
|
13
|
+
output.
|
|
14
|
+
|
|
15
|
+
> **Status:** **both hotspots are fixed for all three SQL adapters.** Hotspot #2
|
|
16
|
+
> (the whole-table INSERT string) — see
|
|
17
|
+
> [Resolution #2](#resolution-hotspot-2-streamed-single-insert). Hotspot #1
|
|
18
|
+
> (full result-set materialization in `execute`) — postgresql, mysql, and sqlite
|
|
19
|
+
> all stream the fetch now; see
|
|
20
|
+
> [Resolution #1](#resolution-hotspot-1-streaming-fetch-postgresql--mysql).
|
|
21
|
+
|
|
22
|
+
## The two hotspots (same shape as MongoDB had pre-optimization)
|
|
23
|
+
|
|
24
|
+
The Runner drives, per table:
|
|
25
|
+
|
|
26
|
+
1. **execute** — the adapter materializes the **entire** result set into a Ruby
|
|
27
|
+
array-of-arrays before anything is written:
|
|
28
|
+
- postgresql: `connection.exec(sql).values`
|
|
29
|
+
- mysql: `res.to_a.map { |row| row.map { stringify } }` (also re-allocates
|
|
30
|
+
every value as a normalized String)
|
|
31
|
+
- sqlite: `connection.execute(sql)`
|
|
32
|
+
|
|
33
|
+
Memory here is proportional to the **table size**, independent of any chunking
|
|
34
|
+
downstream.
|
|
35
|
+
|
|
36
|
+
2. **to_bulk_insert** — SQL adapters set **no** `default_bulk_insert_chunk_size`
|
|
37
|
+
(it is `nil`), so the Runner treats the whole table as one chunk and
|
|
38
|
+
`to_bulk_insert` builds the **entire** `INSERT INTO ... VALUES (...),(...);`
|
|
39
|
+
as one giant String — first an `Array` of N per-row tuple strings, then the
|
|
40
|
+
joined result — held simultaneously with the result set from step 1.
|
|
41
|
+
|
|
42
|
+
## Baseline (200,000 rows, 8 columns, ~41.2 MB output)
|
|
43
|
+
|
|
44
|
+
Measured via `bench_sql_dump.rb` (sandbox disabled — it needs `ps` for RSS and
|
|
45
|
+
localhost for the live DB). RSS is sticky across in-process phases, so read the
|
|
46
|
+
peaks as upper bounds and the *deltas* as the signal.
|
|
47
|
+
|
|
48
|
+
| adapter | execute peak (Δ) | + whole-string write peak | result-set objs |
|
|
49
|
+
|---------|------------------|---------------------------|-----------------|
|
|
50
|
+
| postgresql | 471.7 MB (+65) | 494.0 MB | 1.8M |
|
|
51
|
+
| mysql | 413.4 MB (+52) | 554.9 MB | 2.4M |
|
|
52
|
+
| sqlite | 434.5 MB | 526.6 MB | 1.4M |
|
|
53
|
+
|
|
54
|
+
For a **41 MB** dump the process peaks near **0.5 GB** — ~12× the output size —
|
|
55
|
+
because the full result set *and* the whole-table INSERT string are resident at
|
|
56
|
+
once. Both costs scale linearly with the table, so a large table OOMs the same
|
|
57
|
+
way the embed-heavy MongoDB collection did.
|
|
58
|
+
|
|
59
|
+
Per-value serialization is cheap and not the bottleneck: `escape_value` is
|
|
60
|
+
~0.4–1.3 µs/op and a one-row `to_bulk_insert` ~5.5 µs/op. The cost is **memory**
|
|
61
|
+
(holding everything at once), not CPU.
|
|
62
|
+
|
|
63
|
+
## The byte-identity catch (why this differs from MongoDB)
|
|
64
|
+
|
|
65
|
+
MongoDB's fix was a `default_bulk_insert_chunk_size`, which is byte-identical
|
|
66
|
+
because JSONL chunks join with the same `"\n"` `to_bulk_insert` already inserts
|
|
67
|
+
between docs. **That does not transfer to SQL.** Each `to_bulk_insert` call wraps
|
|
68
|
+
its rows in its own `INSERT INTO ... VALUES ...;` statement, so a chunk size > 0
|
|
69
|
+
turns one INSERT into *many* INSERT statements — semantically equivalent on
|
|
70
|
+
re-import, but a **different byte stream** that breaks the snapshot guard.
|
|
71
|
+
|
|
72
|
+
The byte-identical lever for SQL is instead to **stream the tuples into a single
|
|
73
|
+
INSERT statement**: emit the adapter's exact `INSERT ... VALUES\n` header once,
|
|
74
|
+
then write each row's `(...)` tuple (reusing the adapter's own `escape_value`)
|
|
75
|
+
separated by `",\n"`, then `;`. The bench implements this as `write_streamed`
|
|
76
|
+
and asserts byte-for-byte identity with the whole-string path — confirmed
|
|
77
|
+
**true** for all three adapters (the header must reuse the adapter's quoting,
|
|
78
|
+
e.g. MySQL's backticks, or it diverges).
|
|
79
|
+
|
|
80
|
+
`write_streamed` cuts the to_bulk_insert peak by ~110–120 MB, **but** naive
|
|
81
|
+
per-row `IO#print` makes it ~2–2.4× slower than building one string and writing
|
|
82
|
+
it once. So the production fix wants **chunk-buffered streaming**: build an
|
|
83
|
+
N-row substring in memory, write it, repeat — managing the `",\n"` separator
|
|
84
|
+
across chunk boundaries — to bound memory *without* the per-row IO penalty,
|
|
85
|
+
while still emitting a single INSERT statement (byte-identical).
|
|
86
|
+
|
|
87
|
+
## Resolution: hotspot #2 (streamed single INSERT)
|
|
88
|
+
|
|
89
|
+
Implemented. The Runner no longer builds the per-table INSERT as one giant
|
|
90
|
+
String; it delegates writing to a new adapter seam `Adapter#write_inserts(io,
|
|
91
|
+
results, table, chunk_size)`:
|
|
92
|
+
|
|
93
|
+
- `Adapter::Base#write_inserts` keeps the old behavior (write `to_bulk_insert`
|
|
94
|
+
per chunk, joined by `"\n"`), so MongoDB and any future adapter are unchanged.
|
|
95
|
+
- The SQL adapters mix in `Adapter::SqlBulkInsert`, which **streams** the single
|
|
96
|
+
`INSERT INTO ... VALUES <tuples>;` statement to the file `STREAM_FLUSH_ROWS`
|
|
97
|
+
(2,000) tuples at a time. Each flush is one fast `map`+`join` (the same path
|
|
98
|
+
`to_bulk_insert` uses), and the `",\n"` printed between slices reproduces the
|
|
99
|
+
exact separator between tuples — so the bytes are **identical** to the
|
|
100
|
+
whole-table build. The three duplicate `to_bulk_insert` methods collapsed into
|
|
101
|
+
the shared module; each adapter now only supplies `insert_header` (its
|
|
102
|
+
identifier quoting) and `escape_value`.
|
|
103
|
+
|
|
104
|
+
Verified byte-for-byte by `spec/insert_output_snapshot_spec.rb` (live DB, all
|
|
105
|
+
three adapters) and a flush-boundary sanity check; full suite green.
|
|
106
|
+
|
|
107
|
+
Measured (`bench_sql_dump.rb`, 200k rows / ~41.2 MB output):
|
|
108
|
+
|
|
109
|
+
| adapter | whole-string peak | streamed peak | Δ peak |
|
|
110
|
+
|---------|-------------------|---------------|--------|
|
|
111
|
+
| postgresql | 367 MB | 226 MB | −141 MB |
|
|
112
|
+
| mysql | 325 MB | 214 MB | −111 MB |
|
|
113
|
+
| sqlite | 353 MB | 207 MB | −146 MB |
|
|
114
|
+
|
|
115
|
+
So hotspot #2's contribution (~110–145 MB on a 41 MB dump — the whole INSERT
|
|
116
|
+
string *plus* the transient 200k-tuple `Array` and its join) is gone; the write
|
|
117
|
+
buffer is now bounded to ~2,000 tuples regardless of table size.
|
|
118
|
+
|
|
119
|
+
**Speed:** streaming is *not* slower than the whole-string build. Measured in
|
|
120
|
+
isolation (post-`GC.start`, no sampler thread) the streamed write at
|
|
121
|
+
`flush_rows=2000` was ~0.67 s vs ~0.83 s for the one giant `map`+`join` — small
|
|
122
|
+
chunks stay in cache and avoid repeatedly growing/copying a 41 MB String. The
|
|
123
|
+
~1.3× "slowdown" the in-process bench shows is an artifact of its background RSS
|
|
124
|
+
sampler thread (`ps` every 10 ms) plus run-ordering (streamed runs first, cold),
|
|
125
|
+
not the algorithm. The earlier worry about a per-row `IO#print` penalty only
|
|
126
|
+
applied to the naive row-at-a-time prototype, which `flush_rows` slicing avoids.
|
|
127
|
+
|
|
128
|
+
## Resolution: hotspot #1 (streaming fetch, postgresql + mysql)
|
|
129
|
+
|
|
130
|
+
Implemented for **postgresql**. `PostgresqlAdapter#execute` no longer returns
|
|
131
|
+
`connection.exec(sql).values` (the whole result set as a Ruby array-of-arrays);
|
|
132
|
+
it returns a lazy `PostgresqlAdapter::StreamingResult` that pulls rows off the
|
|
133
|
+
wire one at a time via libpq **single-row mode** (`send_query` +
|
|
134
|
+
`set_single_row_mode` + a `get_result` loop, each yielding one row's
|
|
135
|
+
text-format `Array<String|nil>`). The Runner drives it exactly like the old
|
|
136
|
+
array — `#size` then a single `each_slice` pass — so nothing else changed and
|
|
137
|
+
the output is byte-identical (verified by `insert_output_snapshot_spec`, both
|
|
138
|
+
the `insert` and `copy` pg scenarios).
|
|
139
|
+
|
|
140
|
+
It mirrors `MongodbAdapter::StreamingResult`, with two SQL-specific points:
|
|
141
|
+
|
|
142
|
+
- **`#size`** can't be answered cheaply from the cursor, so it runs a separate
|
|
143
|
+
`SELECT COUNT(*) FROM (<query>) AS exwiw_count_src` (comment-prefixed, like
|
|
144
|
+
the data query). Postgres prunes the wrapped subquery's unused projection, so
|
|
145
|
+
the COUNT transfers no row data — but it does re-run the query plan. This is
|
|
146
|
+
the deliberate cost of keeping the Runner contract (`#size` before iteration,
|
|
147
|
+
used to skip empty tables and log the count) unchanged, so MongoDB and the
|
|
148
|
+
other SQL adapters are untouched. (MongoDB's `count_documents` is an
|
|
149
|
+
index-only walk and cheaper; the SQL COUNT is the analogue.)
|
|
150
|
+
- the streaming pass ties up the connection until fully drained. The Runner
|
|
151
|
+
always drains it (`write_inserts`) before issuing `post_insert_sql` / DELETE
|
|
152
|
+
on the same connection, so the ordering holds. `StreamingResult#each` also
|
|
153
|
+
drains any queued results if iteration is abandoned mid-stream (a SQL error
|
|
154
|
+
surfaced by `#check`, or the consumer raising), so the connection stays
|
|
155
|
+
usable.
|
|
156
|
+
|
|
157
|
+
Measured in **isolated fresh processes** (one per path, so the peak is not
|
|
158
|
+
polluted by other phases — RSS is sticky), 200k rows / ~41.2 MB output:
|
|
159
|
+
|
|
160
|
+
| pg fetch path | peak RSS | Δ over baseline |
|
|
161
|
+
|---------------|----------|-----------------|
|
|
162
|
+
| materialize (`exec(sql).values`) + streamed write (OLD) | ~360 MB | ~320 MB |
|
|
163
|
+
| **single-row stream + streamed write (NEW)** | **~48 MB** | **~12 MB** |
|
|
164
|
+
|
|
165
|
+
So the full result set (~320 MB of Ruby strings/arrays for 200k×8) is no longer
|
|
166
|
+
resident: peak drops ~**310 MB (~87%)** and is now *below* the 41 MB output
|
|
167
|
+
size, because the row is pulled one at a time and the write buffer is bounded to
|
|
168
|
+
~2,000 tuples (hotspot #2's fix). Speed is unchanged (~1.8 s both paths on the
|
|
169
|
+
in-process bench; the COUNT is cheap on an indexed seed). The reproducible A/B
|
|
170
|
+
is in `bench_sql_dump.rb` Part B (`execute(stream)` vs `execute(materialize)`,
|
|
171
|
+
with a byte-identity assertion).
|
|
172
|
+
|
|
173
|
+
Implemented for **mysql** too. `MysqlAdapter#execute` now returns a
|
|
174
|
+
`MysqlAdapter::StreamingResult` (an Enumerable mirroring the pg one) instead of
|
|
175
|
+
`connection.query(sql).rows`. The new `MysqlClient#stream_rows` pulls rows off
|
|
176
|
+
the wire one at a time via mysql2's server-side stream (`stream: true` +
|
|
177
|
+
`cache_rows: false`), yielding the same `Array<String|nil>` rows `#query`
|
|
178
|
+
buffered — so the generated INSERT is byte-identical (verified by
|
|
179
|
+
`insert_output_snapshot_spec`).
|
|
180
|
+
|
|
181
|
+
Two MySQL specifics differ from the pg path:
|
|
182
|
+
|
|
183
|
+
- **`#size`** is a separate `SELECT COUNT(*)` of the same query, but **not** a
|
|
184
|
+
subquery wrap. MySQL rejects a derived table with duplicate column names,
|
|
185
|
+
which a rails-managed `SELECT *` joined to another table produces
|
|
186
|
+
(`Duplicate column name 'id'`); Postgres tolerates it, MySQL does not. So
|
|
187
|
+
mysql replaces the projection with `COUNT(*)` instead
|
|
188
|
+
(`compile_ast(count_only: true)`) — exact because exwiw's extraction queries
|
|
189
|
+
have no DISTINCT/GROUP BY/LIMIT, so the count is independent of the projected
|
|
190
|
+
columns (confirmed against live data: `COUNT(*)` over the bare FROM/JOIN/WHERE
|
|
191
|
+
equals the streamed row count for both plain and `SELECT *`+join queries).
|
|
192
|
+
- **abandoned streams.** mysql2 requires a streamed result to be fully consumed
|
|
193
|
+
before the next query on the connection, or it raises "Commands out of sync".
|
|
194
|
+
`stream_rows` drains the remainder (re-entering `res.each`, which continues
|
|
195
|
+
from where it stopped) if the consumer block raises, so the connection stays
|
|
196
|
+
usable for the next table. `trilogy` has no streaming cursor
|
|
197
|
+
(no `QUERY_FLAGS_STREAMING`), so it buffers and yields — parity, no memory
|
|
198
|
+
win; trilogy is a test-only driver, production uses mysql2.
|
|
199
|
+
|
|
200
|
+
Measured in **isolated fresh processes** (one per path), 200k rows / ~40.7 MB
|
|
201
|
+
output:
|
|
202
|
+
|
|
203
|
+
| mysql fetch path | peak RSS | Δ over baseline |
|
|
204
|
+
|------------------|----------|-----------------|
|
|
205
|
+
| materialize (`query(sql).rows`) + streamed write (OLD) | ~340 MB | ~300 MB |
|
|
206
|
+
| **single-row stream + streamed write (NEW)** | **~50 MB** | **~10 MB** |
|
|
207
|
+
|
|
208
|
+
So peak drops ~**290 MB (~85%)**, now just above the 40.7 MB output — the same
|
|
209
|
+
shape as the pg result. Speed is unchanged-to-faster (the materialize path also
|
|
210
|
+
builds the whole array first). `bench_sql_dump.rb` Part B now shows the delta
|
|
211
|
+
for mysql too (it was equivalent before, when mysql still materialized).
|
|
212
|
+
|
|
213
|
+
Implemented for **sqlite** too, closing hotspot #1 for all three SQL adapters.
|
|
214
|
+
`SqliteAdapter#execute` no longer returns `connection.execute(sql)` (which
|
|
215
|
+
buffers the whole result into a Ruby array); it returns a
|
|
216
|
+
`SqliteAdapter::StreamingResult` (Enumerable, mirroring the pg/mysql ones) whose
|
|
217
|
+
`#each` walks the result one row at a time through SQLite's **statement cursor**
|
|
218
|
+
— `connection.prepare(data_sql)` then `Statement#each` (which maps to
|
|
219
|
+
`sqlite3_step`), closing the statement in an `ensure` so an abandoned mid-stream
|
|
220
|
+
iteration still releases the cursor. The rows are the same `Array` of
|
|
221
|
+
native-typed values `Database#execute` produced, so the generated INSERT is
|
|
222
|
+
byte-identical (verified by `insert_output_snapshot_spec` and a direct cursor
|
|
223
|
+
vs. `#execute` comparison).
|
|
224
|
+
|
|
225
|
+
SQLite specifics vs. the pg/mysql paths:
|
|
226
|
+
|
|
227
|
+
- **`#size`** runs a separate `SELECT COUNT(*)` of the same query with the
|
|
228
|
+
projection replaced by `COUNT(*)` (`compile_ast(count_only: true)`, the same
|
|
229
|
+
trick mysql uses) — exact because exwiw's extraction queries have no
|
|
230
|
+
DISTINCT/GROUP BY/LIMIT. SQLite would also tolerate a duplicate-column
|
|
231
|
+
subquery wrap (unlike mysql), but the `count_only` form is shared and avoids
|
|
232
|
+
the extra subquery.
|
|
233
|
+
- **no connection contention.** SQLite is an embedded, single-connection engine
|
|
234
|
+
that allows multiple active prepared statements at once, so the `#size` COUNT
|
|
235
|
+
and the data cursor don't fight over the connection the way the pg/mysql
|
|
236
|
+
single-row streams tie up the wire. No drain dance is needed; just close the
|
|
237
|
+
statement.
|
|
238
|
+
|
|
239
|
+
Measured in **isolated fresh processes** (one per path), 200k rows / ~40.5 MB
|
|
240
|
+
output:
|
|
241
|
+
|
|
242
|
+
| sqlite fetch path | peak RSS | Δ over baseline |
|
|
243
|
+
|-------------------|----------|-----------------|
|
|
244
|
+
| materialize (`Database#execute`) + streamed write (OLD) | ~298 MB | ~257 MB |
|
|
245
|
+
| **statement-cursor stream + streamed write (NEW)** | **~59 MB** | **~18 MB** |
|
|
246
|
+
|
|
247
|
+
So peak drops ~**240 MB (~80%)**, the same shape as pg/mysql, and it is
|
|
248
|
+
**faster** (~0.84 s vs ~1.68 s) — the materialize path pays to build the whole
|
|
249
|
+
Ruby array up front, the cursor does not. `bench_sql_dump.rb` Part B now shows a
|
|
250
|
+
real delta for sqlite too (it was equivalent before, when sqlite still
|
|
251
|
+
materialized).
|
|
252
|
+
|
|
253
|
+
## Status: both hotspots closed for all three SQL adapters
|
|
254
|
+
|
|
255
|
+
1. **Bounded-memory write** (hotspot #2) — done for mysql / postgresql / sqlite;
|
|
256
|
+
see [Resolution #2](#resolution-hotspot-2-streamed-single-insert).
|
|
257
|
+
2. **Streaming result fetch** (hotspot #1) — done for postgresql (libpq
|
|
258
|
+
single-row mode), mysql (mysql2 `stream: true`), and sqlite (statement
|
|
259
|
+
cursor); see Resolution #1 above.
|
|
260
|
+
|
|
261
|
+
There is no remaining materialization hotspot in the SQL dump path: peak RSS is
|
|
262
|
+
now bounded (well below the output size) and independent of table size for every
|
|
263
|
+
SQL adapter, the same property the MongoDB streaming work achieved. The
|
|
264
|
+
`trilogy` driver still buffers (it has no streaming cursor flag), but it is a
|
|
265
|
+
test-only driver — production mysql uses mysql2.
|
|
266
|
+
|
|
267
|
+
## Methodology notes
|
|
268
|
+
|
|
269
|
+
- The serialization hotspot reproduces **with no database** (Part A): synthesize
|
|
270
|
+
the array-of-String-arrays the drivers return and measure `to_bulk_insert`.
|
|
271
|
+
The live-DB part (Part B) measures `execute` and needs a reachable DB; the dev
|
|
272
|
+
sandbox blocks localhost (and `ps`), so disable the sandbox for bench runs.
|
|
273
|
+
- Run order matters: the bench measures the STREAMED path **before** the WHOLE
|
|
274
|
+
path so the transient giant String doesn't pollute the streamed peak (RSS is
|
|
275
|
+
reclaimed lazily). For defensible absolute numbers, isolate phases in fresh
|
|
276
|
+
processes.
|
|
277
|
+
- Ruby 4.0 removed the `benchmark` stdlib; the harness uses
|
|
278
|
+
`Process.clock_gettime(Process::CLOCK_MONOTONIC)`.
|