RubyGems - exwiw - Versions diffs - 0.5.2 → 0.6.0 - Mend

exwiw 0.5.2 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -0
data/README.md +3 -2
data/docs/optimization-notes.md +126 -0
data/docs/optimize-mongodb-export-with-native-ext.md +249 -0
data/docs/plans/2026-05-15-insert-000-schema-file.md +4 -4
data/docs/plans/2026-05-16-mongodb-from-clean-scenario.md +8 -8
data/docs/plans/2026-05-22-postgres-copy-mode-scenario-test.md +7 -7
data/docs/plans/2026-05-31-ids-column-for-sql-adapters.md +1 -1
data/docs/plans/2026-06-19-mongodb-export-remove-parallelism-native-ext.md +70 -0
data/docs/sql-dump-optimization-notes.md +278 -0
data/ext/exwiw/ext_json/ext_json.c +274 -0
data/ext/exwiw/ext_json/extconf.rb +8 -0
data/lib/exwiw/adapter/mongodb_adapter.rb +159 -40
data/lib/exwiw/adapter/mysql_adapter.rb +70 -18
data/lib/exwiw/adapter/mysql_client.rb +43 -0
data/lib/exwiw/adapter/postgresql_adapter.rb +85 -15
data/lib/exwiw/adapter/sql_bulk_insert.rb +71 -0
data/lib/exwiw/adapter/sqlite_adapter.rb +75 -18
data/lib/exwiw/adapter.rb +38 -0
data/lib/exwiw/ext_json.rb +33 -0
data/lib/exwiw/runner.rb +18 -6
data/lib/exwiw/version.rb +1 -1
data/lib/exwiw.rb +2 -0
data/mise.toml +2 -2
metadata +11 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 567683d65df5d9f147ab9415a67baf48a80e21ad32e1ef7635c624dfc3d28c47
-  data.tar.gz: 1513b577f6f2368df60edc45a54c96495ece4f1ee9b453e92adb8991f182fcdf
+  metadata.gz: 39362410df244fffa463a86c845062f0e9bacac723e15a6697a50c631db0d5cd
+  data.tar.gz: 9670677e7c822886ed6268e476008c81a1caba4e2eacb286bf0e747fef5d3f3c
 SHA512:
-  metadata.gz: a9680642eb34f99ed3f0c2924154a5171edf541286dfed14befe15b8c029271419a13d3295694847b1143898b0200bf6eb1d6448e6665dbad7bef026e3c3fbbb
-  data.tar.gz: 30e6ef9f988965b85f899fdb0646b6e4e2befd68f95a247a9edfeddb8d3a6088f611f07d5376c7d3b9f4f58f147072e03294a140312dec1622994fb6da175720
+  metadata.gz: 4af4db84210fbd9da8b6b32e136c1f12af370bc719c5aeee2a1f7183046f1ffae5e8dcdc0bb6d856e30fe800a5febe12aef4d4ff2d2e8c27b162fcc4a056c7f4
+  data.tar.gz: 5064dfee83c653ae0c73ae07edf33299752c748c6f9efc5e413424e7b3a92b769f6a99901688a2d1d3415ad3a0939ee807095b1cf24b2aab46fcf89402113a90

data/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,19 @@
 ## [Unreleased]
+## [0.6.0] - 2026-06-20
+### Added
+- Optimize memory usage https://github.com/heyinc/exwiw/pull/118
+- **MongoDB: optional native (C) encoder for the Extended-JSON dump path** (no flag, byte-identical output, pure-Ruby fallback). Encoding each document to MongoDB Relaxed Extended JSON — previously `JSON.generate(doc.as_extended_json(mode: :relaxed))`, which rebuilds the whole document into an intermediate transformed Hash tree and then walks it again — was the dominant per-document CPU cost (~82% of serialization on embed-heavy data). A new C extension (`ext/exwiw/ext_json/`) emits the JSONL line in a single native tree-walk. It formats the structural bulk plus the leaves that dominate a dumped document — `Hash`, `Array`, `String`, fixnum `Integer`, `true`/`false`/`nil`, `BSON::ObjectId` (`_id`), and in-range `Time` (the Mongoid `created_at`/`updated_at` timestamps) — and delegates everything else (`Float`, out-of-int64 `Integer`, out-of-range `Time`, `Symbol`, `Decimal128`, …) back to the exact pure-Ruby path, so the output is provably byte-for-byte identical. On a 30-embedded-post timestamp-heavy document this serializes ~2.8× faster. With `gem install exwiw` the extension compiles automatically; hosts that cannot compile (JRuby/TruffleRuby, no toolchain) fall back to the pure-Ruby encoder, so exwiw stays installable as a pure-Ruby gem. See [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md).
+## [0.5.3] - 2026-06-19
+### Changed
+- **MongoDB: dumps now stream, bounding peak memory regardless of collection size** (default; no flag, byte-identical output). The adapter previously loaded each collection's entire result set into memory (`.to_a`) and built the whole collection's JSONL output as one string, so peak memory scaled with collection size. It now wraps the Mongo cursor in a lazy streaming result and writes output in chunks, so at most one chunk of documents (plus the small FK-propagation key arrays) is resident at a time. On a 20k-document × 30-embed collection this cut peak RSS by hundreds of MB and was also faster (less GC pressure). The per-document Extended-JSON masking was also precompiled per collection config, trimming per-document encoding cost. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the full investigation, and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the proposed native-encoder follow-up.
 ## [0.5.2] - 2026-06-18
 ### Fixed

data/README.md CHANGED Viewed

@@ -647,6 +647,7 @@ The MongoDB adapter is experimental. To use it:
 - `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
 - `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
 - `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only**; the SQL adapters use `--ids-column` instead (see below).
+- Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
 - Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
   ```bash
   mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
@@ -664,7 +665,7 @@ The MongoDB adapter is experimental. To use it:
 MongoDB models often store one-to-many relationships as embedded subdocument arrays (e.g. `users` documents with a `posts: [...]` field). To mask fields inside embedded subdocuments, declare a separate config with `embedded_in`:
 ```jsonc
-// scenario/users.json — top-level collection
+// e2e/users.json — top-level collection
 {
   "name": "users",
   "primary_key": "_id",
@@ -676,7 +677,7 @@ MongoDB models often store one-to-many relationships as embedded subdocument arr
   ]
 }
-// scenario/posts.json — embedded under users.posts
+// e2e/posts.json — embedded under users.posts
 {
   "name": "posts",
   "primary_key": "_id",

data/docs/optimization-notes.md ADDED Viewed

@@ -0,0 +1,126 @@
+# MongoDB dump performance: investigation notes
+This records what was learned while making the MongoDB adapter's dump faster and
+lighter, **what shipped**, and **what was explored and deliberately removed**.
+It exists so the removed work isn't re-discovered from scratch and so the
+trade-offs behind the current design are legible.
+The reproducible harness is `script/bench_mongodb_dump.rb` (seeds a synthetic
+large/embed-heavy dataset and measures the dump phases). The correctness anchor
+throughout is `spec/insert_output_snapshot_spec.rb` — a **byte-exact** snapshot
+of the dump output; every change below was required to keep it green.
+## The two hotspots
+On an embed-heavy benchmark (20k users × 30 embedded posts → ~154 MB JSONL):
+1. **Memory.** Two compounding costs. `MongodbAdapter#execute` did `.to_a`,
+   loading the entire result set onto the heap (~600–900 MB / ~9.5M Ruby
+   objects for 20k docs). Separately, with no chunking the Runner built the whole
+   collection's JSONL output as **one giant string** before writing, held
+   simultaneously with the result set.
+2. **CPU.** `doc.as_extended_json(mode: :relaxed)` is ~82% of per-document
+   serialization (~104µs of ~124µs for a 30-post doc). It recursively rebuilds
+   the document into a new intermediate Hash tree, so cost scales with embedding
+   depth/count; `JSON.generate` over that tree is comparatively cheap (~10µs).
+## What shipped (default, no flags, byte-identical)
+- **Chunked output streaming.** The Runner writes each bulk-insert chunk straight
+  to the file instead of joining the whole table's output into one string.
+  `MongodbAdapter` sets a positive `default_bulk_insert_chunk_size` (1000) so
+  MongoDB output is chunked by default while SQL adapters keep one statement per
+  table. Cut peak RSS ~112 MB and was ~30% faster, byte-identical.
+- **Streaming result set.** `#execute` returns a lazy `StreamingResult` wrapping
+  the Mongo cursor instead of `.to_a`. The Runner pulls documents through
+  `each_slice`, so only one chunk is resident at a time. `#size` is answered with
+  a cheap `count_documents` (index-only) rather than draining the cursor, and the
+  FK-propagation `@state` is captured *as the cursor streams* and published once
+  the pass completes (the Runner always fully consumes a non-empty result, so
+  propagation is unaffected). Cut peak RSS growth ~360 MB and wall time ~40%.
+- **Precompiled masking (`MaskPlan`).** Masking runs over every document **and**
+  every embedded subdocument, so per-config decisions (which fields carry a
+  `replace_with`, how each template splits, where embedded children live) were
+  recomputed many times per document. Compiling a `MaskPlan` once per collection
+  config dropped per-document masking ~17–22% and ~35 allocations/doc, scaling
+  down with embedding count. Byte-identical.
+Net default result: memory is bounded by chunk size rather than collection size,
+with a meaningful wall-time improvement and no API/flag surface.
+## What was explored and removed
+After the memory work, the remaining cost was almost entirely the pure-Ruby
+`as_extended_json`. Threads give **zero** speedup (it holds the GVL), and a
+hand-rolled pure-Ruby fused encoder is **slower** than `as_extended_json +
+JSON.generate` (per-leaf `.to_json` C-call overhead; `JSON.generate` does the
+whole tree in one C pass). `bson` 5.2.0 has no native Extended-JSON serializer to
+borrow (`to_extended_json` is literally `as_extended_json(**opts).to_json`, all
+pure Ruby). That left two levers, both of which were built, measured, and then
+**removed for being disproportionately complex**:
+### Fork-parallel serialization (`--parallel-workers=N`)
+Forked `N` worker processes to serialize contiguous document slices in parallel,
+parent concatenating parts in order (byte-identical). The *serialization step*
+parallelized ~2–2.5× at 4–8 workers, **but the end-to-end dump speedup was only
+~1.1–1.4×** on embed-heavy data. Reason (Amdahl): ~40% of dump wall time is the
+**serial Mongo cursor BSON→Ruby decode** in the parent, which serialization
+parallelism cannot touch — capping the win — and fork/concat overhead eroded most
+of the rest.
+### Cursor-parallel fetch (`--cursor-parallel`)
+Went further: split each collection into `N` disjoint `_id` ranges, each fetched
+by a forked worker with its own connection+cursor, so the **decode** was
+parallelized too. Measured byte-identical at ~2.5–5.5× depending on dataset size
+— a real, larger win. But it required a lot of machinery:
+- `_id`-range partitioning (an index-only scan + range split) — `MongoIdPartitioner`.
+- A fork orchestrator writing ordered part files + Marshal'd state sidecars — `ForkedPartWriter`.
+- Distributed FK-propagation: each worker captures its range's `@state` slice and
+  the parent merges them in range order — `PropagationCapture`.
+- A per-worker fresh-connection builder (the Mongo driver is not fork-safe).
+- New adapter seams (`write_bulk_insert`, `write_inserts`) and CLI→Runner→Adapter
+  threading for two flags + their validations.
+- A user-visible caveat: per-range cursors must `sort(_id)`, so the output is
+  **ordered by `_id` rather than natural order** — a different byte stream
+  (semantically equivalent re-import), so it could not be the snapshot-tested
+  default.
+### Why both were removed
+The cursor-parallel win was real but bought with multi-process orchestration,
+IPC, distributed state, a fresh-connection-per-worker requirement, fork
+fallbacks for Windows/JRuby, and a non-default output ordering — a large,
+permanently-maintained surface for a single adapter's export path. The
+maintainer's call was that this is **over-engineered** for the benefit, and that
+the CPU hotspot is better addressed by the lever every earlier iteration kept
+pointing at: a native (C) Extended-JSON encoder. The memory wins above are
+unrelated to the parallelism and were kept.
+## Where the speedup goes next
+A C extension can collapse `as_extended_json + JSON.generate` into one native
+tree-walk (no intermediate tree, no second pass), as a flag-free, fork-free,
+single-process win — bounded by the same serial-decode ceiling (~2.5×) the
+`--parallel-workers` path hit, since it also doesn't touch the driver's decode.
+The full design (byte-identity strategy, fast-path vs Ruby-delegate types,
+optional-load + pure-Ruby fallback, packaging) is in
+[`optimize-mongodb-export-with-native-ext.md`](./optimize-mongodb-export-with-native-ext.md).
+## Methodology notes (for re-running)
+- The CPU hotspot reproduces **with no database**: the Mongo driver hands back
+  plain Ruby `Hash` + `BSON::ObjectId`/`Time`, so synthesizing that shape in
+  memory yields the exact `as_extended_json + JSON.generate` cost and runs under
+  the normal sandbox. DB-touching measurement needs live mongo on `localhost:27017`
+  (the dev sandbox blocks it — disable the sandbox for those runs).
+- In-process sequential bench passes accumulate RSS (Ruby reclaims to the OS
+  lazily), which inflates a later serial baseline and overstates a parallel
+  speedup; isolate sections in fresh processes for defensible numbers.
+- Chunk size never changes output **bytes** — the Runner inserts the same `"\n"`
+  between chunks that `to_bulk_insert` inserts between documents — so it is purely
+  a memory/throughput knob, safe against the snapshot guard.
+- Ruby 4.0 removed the `benchmark` stdlib from default gems; use
+  `Process.clock_gettime(Process::CLOCK_MONOTONIC)`.

data/docs/optimize-mongodb-export-with-native-ext.md ADDED Viewed

@@ -0,0 +1,249 @@
+# Design: optional native (C) extension for the MongoDB Extended-JSON encoder
+Status: **implemented.** This document captured the design; it now describes the
+shipped encoder. Source: `ext/exwiw/ext_json/ext_json.c` (native emitter) and
+`lib/exwiw/ext_json.rb` (the optional-load shim + pure-Ruby fallback); the
+byte-identity guard is `spec/ext_json_spec.rb`. It is the successor to the
+fork/cursor parallelism that was removed (see
+[`optimization-notes.md`](./optimization-notes.md)).
+## Motivation
+When the MongoDB adapter dumps embed-heavy documents, the dominant CPU cost is
+turning each decoded Mongo document (a Ruby `Hash` containing `BSON::ObjectId` /
+`Time` / nested `Hash`+`Array`) into one JSONL line of MongoDB **Relaxed
+Extended JSON**. Today that is:
+```ruby
+# lib/exwiw/adapter/mongodb_adapter.rb
+JSON.generate(doc.as_extended_json(mode: :relaxed))
+```
+`as_extended_json` (in the pure-Ruby `bson` gem) **recursively rebuilds the
+whole document into a new intermediate Hash tree** (`ObjectId -> {"$oid"=>…}`,
+`Time -> {"$date"=>…}`, every subdoc/array re-allocated), and then
+`JSON.generate` walks that tree a *second* time. For a 30-embedded-post doc this
+was measured at ~130µs/doc and is ~82% of per-document serialization cost.
+Earlier experiments established the levers:
+- Threads give **zero** speedup — `as_extended_json` is pure Ruby and holds the GVL.
+- A pure-Ruby fused single-pass encoder is **slower** (per-leaf `.to_json` C-call
+  overhead beats it; `JSON.generate` does the whole tree in one C pass).
+- Multi-process parallelism worked but was judged over-engineered and removed.
+The remaining lever is a **C extension**: one native walk that emits the
+Extended-JSON text directly — no intermediate transformed-Hash tree, no second
+JSON pass.
+## Goals / non-goals
+- **Goal:** a native encoder that is **byte-for-byte identical** to the current
+  pure-Ruby path, behind an **optional** load with a pure-Ruby fallback so
+  exwiw stays installable as a pure-Ruby gem (JRuby/TruffleRuby, or any host
+  where compilation fails, keep working).
+- **Non-goal:** speeding up the Mongo cursor's BSON→Ruby *decode*. That lives in
+  the `mongo`/`bson` driver and is ~40% of total dump wall time. A serialization
+  C extension is therefore bounded to roughly the same end-to-end ceiling
+  (~2.5×) the removed `--parallel-workers` path had — it does **not** reach the
+  removed cursor-parallel path's 3.4–5.5×. This is an accepted trade for a far
+  simpler, flag-free, fork-free implementation.
+## Exact serialization semantics to reproduce (verified against bson 5.2.0)
+The byte-exact anchor is `spec/insert_output_snapshot_spec.rb` (committed
+`spec/insert_output_snapshots/mongodb/*.jsonl` fixtures). The C encoder must
+reproduce the following exactly. All rows below were verified empirically and,
+for `Time`, against `bson-5.2.0/lib/bson/time.rb:72-89`.
+| Ruby value | Relaxed Extended JSON output | Notes |
+|---|---|---|
+| `BSON::ObjectId` | `{"$oid":"<24 lowercase hex>"}` | hex via `ObjectId#to_s` |
+| `Time` (year 1970..9999, sub-second) | `{"$date":"2021-01-02T03:04:05.678Z"}` | floor to ms, `strftime('%Y-%m-%dT%H:%M:%S.%LZ')` |
+| `Time` (year 1970..9999, whole second) | `{"$date":"2021-01-02T03:04:05Z"}` | **no** fraction when `usec == 0` |
+| `Time` (year <1970 or >9999) | `{"$date":{"$numberLong":"<ms>"}}` | `ms = sec*1000 + usec.divmod(1000).first` |
+| `Integer` (fits int64) | bare `42` / `9000000000` | |
+| `Integer` (outside int64) | **raises `RangeError`** | `"Integer … too big to be represented as a MongoDB integer"` |
+| `Float` | `JSON.generate(float)` form | `1e20 → 1e+20` (**not** `Float#to_s`'s `1.0e+20`); `100.0 → 100.0`; `-0.0 → -0.0` |
+| `String` | JSON string | escape only `\b \t \n \f \r \" \\`; other `<0x20` as lowercase `\u00xx`; `/`, DEL, U+2028/U+2029, non-ASCII left raw |
+| `true` / `false` / `nil` | `true` / `false` / `null` | |
+| `Hash` | `{…}` | **insertion order** preserved; keys are Strings (JSON-escaped) |
+| `Array` | `[…]` | |
+`bson/time.rb` boundary (the highest-risk piece), verified verbatim:
+```ruby
+def as_extended_json(**options)
+  if options[:mode] == :relaxed && (1970..9999).include?(utc_time.year)
+    if utc_time.usec != 0
+      utc_time = utc_time.floor(3)            # floor to millisecond
+      {'$date' => utc_time.strftime('%Y-%m-%dT%H:%M:%S.%LZ')}
+    else
+      {'$date' => utc_time.strftime('%Y-%m-%dT%H:%M:%SZ')}
+    end
+  else
+    msec = utc_time.usec.divmod(1000).first
+    {'$date' => {'$numberLong' => (sec * 1000 + msec).to_s}}
+  end
+end
+```
+## Fast-path vs delegate (the byte-identity strategy)
+The encoder splits values into a **native fast path** and a **Ruby delegate**:
+- **Native (in C):** `Hash`, `Array`, `String`, `Integer` within int64,
+  `true`/`false`/`nil`, `BSON::ObjectId`, and **in-range `Time`** (years
+  1970..9999). These are the structural bulk plus the two most common leaves in
+  a dumped document — `_id` and the Mongoid `created_at`/`updated_at` timestamps.
+  The in-range Time path resolves the absolute instant with `rb_time_timespec`
+  (epoch seconds + nanoseconds, no `rb_funcall`), formats with `gmtime_r` +
+  `snprintf`, and reproduces bson's rule exactly: a `.mmm` fraction iff
+  `nsec >= 1000` (i.e. `usec != 0`), with the millisecond floored to
+  `nsec / 1e6`. The in-range window is the half-open epoch-second range
+  `[0, 253402300800)`.
+- **Delegate to Ruby** — call back into
+  `JSON.generate(value.as_extended_json(mode: :relaxed))` for the individual
+  value and splice the returned fragment into the buffer:
+  - `Float` — `Float#to_s` diverges from `JSON.generate` for scientific notation
+    (`1e20`), so never reformat floats in C.
+  - **out-of-range `Time`** (year < 1970 or > 9999) — its `$numberLong` form
+    involves negative-epoch arithmetic, is vanishingly rare in dumped data, and
+    is left to Ruby. The in-range ISO branch is handled natively (above).
+  - out-of-int64 `Integer` — must surface the identical `RangeError`.
+  - any unrecognized class — `Decimal128`, `BSON::Binary`, `Symbol`, `Regexp`,
+    `Date`, `BSON::Timestamp`, etc.
+**Why delegating is provably byte-identical:** `Hash#as_extended_json` and
+`Array#as_extended_json` are *non-transforming structural recursion* — they map
+over children and call `as_extended_json` on each. So the bytes produced for any
+sub-value `v` by `JSON.generate(v.as_extended_json(mode: :relaxed))` are exactly
+the bytes that the whole-document `JSON.generate(doc.as_extended_json(...))`
+would have produced for that position. The native walk can therefore hand any
+value it does not want to format to Ruby and splice the result, with no
+divergence.
+`Time` was promoted into the native path because the benchmark showed it was
+decisive: with `Time` delegated, a 30-embedded-post timestamp-heavy document
+(32 `Time` fields) sped up only ~1.03× — the per-`Time` `rb_funcall` +
+`as_extended_json` Hash allocation + second `JSON.generate` pass erased the win.
+Formatting in-range `Time` natively brings the same document to ~2.8× (the
+serialization-step ceiling). `Float` remains delegated: matching
+`JSON.generate`'s shortest-round-trip float formatting in C (not `Float#to_s`)
+is not worth the risk for the few floats a typical document carries.
+## C source & buffer design
+- `Exwiw::ExtJson.encode_native(doc) -> String` — returns one JSONL line, **no**
+  trailing `\n` (the caller/Runner owns separators).
+- Recursive emitter writing into a single growing buffer (`rb_str_buf_new` +
+  `rb_str_cat`/`rb_str_buf_cat`, or a `malloc` buffer finalized to an
+  `rb_utf8_str_new`). Result string is UTF-8.
+- Type dispatch via `TYPE()` for the immediates/`T_HASH`/`T_ARRAY`/`T_STRING`/
+  `T_FLOAT`/`T_FIXNUM`/`T_BIGNUM`, and a cached `BSON::ObjectId` class reference
+  (`rb_const_get`) compared with `rb_obj_is_kind_of` for ObjectId.
+- `Hash`: `rb_hash_foreach` preserves insertion order; emit `key:value` pairs;
+  keys are Strings run through the same string escaper.
+- Delegate path: a cached `ID` for a Ruby helper (e.g.
+  `Exwiw::ExtJson.encode_fragment(v)`), `rb_funcall`'d, returning the JSON
+  fragment String to splice. The `RangeError` for oversized integers propagates
+  naturally through the delegate.
+- String escaper implemented in C to match the table above (no per-leaf Ruby
+  call for the common String case).
+## Packaging, optional load, fallback
+- **gemspec:** `spec.extensions = ["ext/exwiw/ext_json/extconf.rb"]`. With
+  `extensions` set, `gem install exwiw` compiles automatically; hosts that can't
+  compile fall back at runtime (below).
+- **`ext/exwiw/ext_json/extconf.rb`:** `require "mkmf"` (stdlib) +
+  `create_makefile("exwiw/ext_json_native")`. The compiled lib is named
+  `ext_json_native` (distinct from the `ext_json.rb` shim) to avoid a
+  `require` self-collision.
+- **`ext/exwiw/ext_json/ext_json.c`:** the emitter; defines
+  `Exwiw::ExtJson.encode_native`.
+- **`lib/exwiw/ext_json.rb`** (the shim, always loaded):
+  ```ruby
+  require "json"
+  module Exwiw
+    module ExtJson
+      module_function
+      # Pure-Ruby fragment encoder used by both the fallback and the native
+      # delegate path. Byte-identical to today's behavior.
+      def encode_fragment(value)
+        JSON.generate(value.respond_to?(:as_extended_json) ? value.as_extended_json(mode: :relaxed) : value)
+      end
+      begin
+        require "exwiw/ext_json_native"   # defines Exwiw::ExtJson.encode_native
+        def encode(doc) = encode_native(doc)
+      rescue LoadError
+        def encode(doc) = encode_fragment(doc)   # exact current behavior
+      end
+    end
+  end
+  ```
+- **`Rakefile`:** `require "rake/extensiontask"`;
+  `Rake::ExtensionTask.new("ext_json_native") { |e| e.ext_dir = "ext/exwiw/ext_json" }`;
+  make the `spec` task depend on `compile`.
+- **Dev dependency:** add `rake-compiler` (Gemfile / gemspec dev deps). `mkmf`
+  is stdlib, no runtime dep added.
+- **`.gitignore`:** ignore built artifacts (`lib/exwiw/*.bundle`,
+  `lib/exwiw/*.so`, `ext/**/*.o`, `ext/**/Makefile`). Commit only the `ext/`
+  sources; the gemspec ships files via `git ls-files`.
+- **`lib/exwiw.rb`:** add `require_relative "exwiw/ext_json"`.
+## Integration point
+In `lib/exwiw/adapter/mongodb_adapter.rb`, the per-document serialize step
+becomes (masking still runs in Ruby first; only the encode changes):
+```ruby
+def to_bulk_insert(rows, config)
+  plan = mask_plan(config)
+  rows.map do |doc|
+    apply_mask_plan!(doc, plan)
+    Exwiw::ExtJson.encode(doc)   # was: JSON.generate(extended_json(doc))
+  end.join("\n")
+end
+```
+The private `#extended_json` helper is removed — its logic (including the
+`respond_to?(:as_extended_json)` guard) moves into `ExtJson.encode_fragment`.
+## Test & benchmark strategy
+- **`spec/ext_json_spec.rb`** (DB-free; the primary byte-identity guard, runs in
+  normal CI): assert `encode_native(doc) == encode_fragment(doc)` over a fuzz of
+  representative shapes — ObjectId; nested hashes/arrays; `Time` across the year
+  boundary, whole-second (no fraction), and sub-second; strings with control
+  chars / quotes / backslashes / non-ASCII / U+2028; ints, bignums (assert the
+  same `RangeError`), floats (`1e20`, `-0.0`, `100.0`); `nil`; empty
+  hash/array/string. Skip the native half with a clear message when the lib
+  isn't compiled, so the suite still passes on a fallback-only host.
+- **`spec/insert_output_snapshot_spec.rb`** (live mongo on 27017): the byte-exact
+  fixtures must stay green with the native encoder built.
+- **Microbench** (extend `script/bench_mongodb_dump.rb`): native-encode vs
+  Ruby-fallback throughput on DB-free synthesized embed-heavy docs, plus the live
+  path on 20k×30, to quantify the real speedup.
+## Risk register
+1. **Time formatting** — variable fraction + ms flooring + `$numberLong`
+   boundary. In-range years (1970..9999) are formatted natively via
+   `rb_time_timespec` + `gmtime_r`; the rare out-of-range `$numberLong` form is
+   delegated. Mitigated by a dense byte-identity fuzz over the whole in-range
+   epoch span with mixed nanosecond precision, plus the boundary/sub-ms edges in
+   `spec/ext_json_spec.rb`.
+2. **Float formatting** — `Float#to_s` ≠ `JSON.generate`. Mitigated by delegating.
+3. **String escaping** — must match JSON exactly. Implemented in C, fuzz-tested
+   vs the Ruby fallback.
+4. **Hash key order** — preserved via `rb_hash_foreach`.
+5. **Oversized integers** — delegate so the same `RangeError` surfaces.
+6. **Encoding** — emit UTF-8; pass non-ASCII bytes through unescaped (matches JSON).
+7. **Build/portability** — optional load + pure-Ruby fallback keeps non-CRuby and
+   no-compiler installs working.

data/docs/plans/2026-05-15-insert-000-schema-file.md CHANGED Viewed

@@ -117,7 +117,7 @@ Error: `pg_dump` not found in PATH. exwiw needs pg_dump to generate insert-000-s
 | `lib/exwiw/ddl_postprocessor.rb` (新規) | `IF NOT EXISTS` 書き換え / DO ブロックラップ |
 | `lib/exwiw.rb` | 新規ファイルの require |
 | `README.md` | `dump/` の出力に `insert-000-schema.{sql,js}` を追記、import 手順を更新 |
-| `spec/adapter/sqlite3_adapter_spec.rb` | `dump_schema` 統合テスト (`scenario/initdb/init.sqlite3` に対して実行し、出力が `CREATE TABLE IF NOT EXISTS` を含むことを assert) |
+| `spec/adapter/sqlite3_adapter_spec.rb` | `dump_schema` 統合テスト (`e2e/initdb/init.sqlite3` に対して実行し、出力が `CREATE TABLE IF NOT EXISTS` を含むことを assert) |
 | `spec/adapter/mongodb_adapter_spec.rb` | `dump_schema` テスト (db スタブで `listIndexes` を返し、出力 JS を assert) |
 | `spec/runner_spec.rb` | `insert-000-schema.sql` が `output_dir` に書かれることを assert (Sqlite3 経由で実際に流れることを確認) |
@@ -134,9 +134,9 @@ Error: `pg_dump` not found in PATH. exwiw needs pg_dump to generate insert-000-s
 2. `bundle exec rspec spec/adapter/mongodb_adapter_spec.rb` — mongo クライアントをスタブして JS 出力に `db.createCollection("users")` と該当 collection の `createIndex(...)` が含まれることを確認。
 ### E2E (scenario スクリプト経由)
-3. `scenario/test_with_sqlite3.sh` を実行し、`dump/insert-000-schema.sql` が生成されることと、空 DB に対して `sqlite3 empty.db < dump/insert-000-schema.sql && for f in dump/insert-*.sql; do sqlite3 empty.db < $f; done` が成功することを確認する。
-4. `scenario/test_with_mysql2.sh`, `scenario/test_with_postgresql.sh` も同様に、`mysql empty_db < dump/insert-000-schema.sql` / `psql empty_db -f dump/insert-000-schema.sql` が成功 → 続けて insert ファイル群が流せることを確認。**`mysqldump` / `pg_dump` を docker compose のコンテナ内 (`compose.yml` で起動する DB コンテナ) で実行する必要がある場合は、scenario スクリプトを更新する。**
-5. `scenario/test_with_mongodb.sh` を実行し、`dump/insert-000-schema.js` が出力されることと、空 DB に対して `mongosh "mongodb://localhost/empty_db" < dump/insert-000-schema.js` が成功すること、続いて `mongoimport` で各 jsonl が流せることを確認。
+3. `e2e/test_with_sqlite3.sh` を実行し、`dump/insert-000-schema.sql` が生成されることと、空 DB に対して `sqlite3 empty.db < dump/insert-000-schema.sql && for f in dump/insert-*.sql; do sqlite3 empty.db < $f; done` が成功することを確認する。
+4. `e2e/test_with_mysql2.sh`, `e2e/test_with_postgresql.sh` も同様に、`mysql empty_db < dump/insert-000-schema.sql` / `psql empty_db -f dump/insert-000-schema.sql` が成功 → 続けて insert ファイル群が流せることを確認。**`mysqldump` / `pg_dump` を docker compose のコンテナ内 (`compose.yml` で起動する DB コンテナ) で実行する必要がある場合は、scenario スクリプトを更新する。**
+5. `e2e/test_with_mongodb.sh` を実行し、`dump/insert-000-schema.js` が出力されることと、空 DB に対して `mongosh "mongodb://localhost/empty_db" < dump/insert-000-schema.js` が成功すること、続いて `mongoimport` で各 jsonl が流せることを確認。
 6. **idempotency 確認**: 同じ schema ファイルを 2 回流してもエラーにならないこと (`IF NOT EXISTS` / `DO $$ EXCEPTION WHEN duplicate_object` / `try/catch on createCollection` が効いている)。
 ### 手動確認のチェックポイント

data/docs/plans/2026-05-16-mongodb-from-clean-scenario.md CHANGED Viewed

@@ -6,9 +6,9 @@
 `createCollection` / `createIndex` を書き出す実装を既に持っているが、scenario 側で
 これを apply するパスが無く、CI でも検証できていなかった。具体的なギャップ:
-1. `scenario/setup_with_mongodb.rb` は seed を `insert_many` で流すだけで、index を一切作っていない
+1. `e2e/setup_with_mongodb.rb` は seed を `insert_many` で流すだけで、index を一切作っていない
 2. その結果 `tmp/mongodb/insert-000-schema.js` は `createCollection` 行のみで `createIndex` が 0 行
-3. `scenario/import_with_mongodb.rb` は `insert-*.jsonl` だけを glob して処理しており、`insert-000-schema.js` を一切実行しない
+3. `e2e/import_with_mongodb.rb` は `insert-*.jsonl` だけを glob して処理しており、`insert-000-schema.js` を一切実行しない
 sqlite3 / mysql2 / postgresql で導入済みの「from clean DB から立ち上げる」流れと
 MongoDB の `insert-000-schema.js` が連動していない状態だった (issue #16)。
@@ -25,10 +25,10 @@ MongoDB の `insert-000-schema.js` が連動していない状態だった (issu
 ### scenario 層
 | パス | 変更 |
 |---|---|
-| `scenario/setup_with_mongodb.rb` | seed 流し込みの後に 3 種類の代表的 index を作る (unique `shops.name` / plain `users.email` / 複合 `orders.shop_id+user_id`) |
-| `scenario/import_with_mongodb.rb` | `--no-drop` と `--input-dir DIR` フラグを追加。from-clean は drop すると schema.js が作った index ごと消えてしまうため |
-| `scenario/verify_with_mongodb.rb` | `--with-indexes` で target collection の index を assert (default scenario では import 時に drop されるのでスキップ) |
-| `scenario/test_with_mongodb_from_clean.sh` (新規) | `mongosh dropDatabase` → exwiw 実行 → `mongosh insert-000-schema.js` → `import --no-drop --input-dir tmp/mongodb-clean` → `verify --with-indexes` |
+| `e2e/setup_with_mongodb.rb` | seed 流し込みの後に 3 種類の代表的 index を作る (unique `shops.name` / plain `users.email` / 複合 `orders.shop_id+user_id`) |
+| `e2e/import_with_mongodb.rb` | `--no-drop` と `--input-dir DIR` フラグを追加。from-clean は drop すると schema.js が作った index ごと消えてしまうため |
+| `e2e/verify_with_mongodb.rb` | `--with-indexes` で target collection の index を assert (default scenario では import 時に drop されるのでスキップ) |
+| `e2e/test_with_mongodb_from_clean.sh` (新規) | `mongosh dropDatabase` → exwiw 実行 → `mongosh insert-000-schema.js` → `import --no-drop --input-dir tmp/mongodb-clean` → `verify --with-indexes` |
 | `.github/workflows/scenario.yml` | with_mongodb job に `mongodb-mongosh` install ステップと `test_with_mongodb_from_clean.sh` 実行ステップを追加。apt repo の codename は `jammy` 固定 (ubuntu-latest が noble に上がる前提) |
 ### snapshot test 層
@@ -56,8 +56,8 @@ MongoDB の `insert-000-schema.js` が連動していない状態だった (issu
 ## Verification
-- `bash scenario/test_with_mongodb.sh` 既存 scenario 維持を確認 ✓
-- `bash scenario/test_with_mongodb_from_clean.sh` 新規 scenario 通過を確認
+- `bash e2e/test_with_mongodb.sh` 既存 scenario 維持を確認 ✓
+- `bash e2e/test_with_mongodb_from_clean.sh` 新規 scenario 通過を確認
   (indexes round-trip OK) ✓
 - `bundle exec rspec` 全 153 examples / 0 failures ✓
 - `tmp/mongodb-clean/insert-000-schema.js` を目視で確認:

data/docs/plans/2026-05-22-postgres-copy-mode-scenario-test.md CHANGED Viewed

@@ -9,22 +9,22 @@
 しかし **生成された COPY-mode SQL を実際に `psql -f` で取り込めるかを検証する end-to-end テストが存在しない**。ユーザーは COPY モードで invalid な SQL が出ているのではと疑っており、それを実DBに対して検証したい。
-既存の INSERT モードは `scenario/test_with_postgresql.sh` が `psql -f` での再取込まで含めて検証している。これに対応する COPY モード版が無い状態。
+既存の INSERT モードは `e2e/test_with_postgresql.sh` が `psql -f` での再取込まで含めて検証している。これに対応する COPY モード版が無い状態。
 ゴール: COPY モード出力を実際に psql に食わせる E2E シナリオ + スナップショット回帰テストを追加し、潜在的な invalid SQL を表面化する。
 ## 変更ファイル
-1. **新規** `scenario/test_with_postgresql_copy.sh` — E2E シェル
+1. **新規** `e2e/test_with_postgresql_copy.sh` — E2E シェル
 2. **修正** `spec/insert_output_snapshot_spec.rb` — COPY 用の SCENARIOS エントリと `snapshot_subdir` 対応
 3. **修正** `.github/workflows/scenario.yml` — `with_postgres` ジョブに新ステップ
 4. **新規** `spec/insert_output_snapshots/postgresql-copy/insert-*.sql` — `UPDATE_SNAPSHOTS=1` で自動生成
 ## 詳細
-### 1. `scenario/test_with_postgresql_copy.sh`
+### 1. `e2e/test_with_postgresql_copy.sh`
-`scenario/test_with_postgresql.sh` を雛形にして以下のみ差し替え:
+`e2e/test_with_postgresql.sh` を雛形にして以下のみ差し替え:
 - `FROM_DATABASE_NAME="exwiw_scenario_prod_db_copy"`
 - `TO_DATABASE_NAME="exwiw_scenario_dev_db_copy"`（並列実行されても既存シナリオと衝突しない名前）
@@ -47,7 +47,7 @@
 ```ruby
 {
   adapter: "postgresql",
-  config_dir: "scenario/postgresql-schema",
+  config_dir: "e2e/postgresql-schema",
   output_format: "copy",
   snapshot_subdir: "postgresql-copy",
   connection: { adapter: "postgresql", database_name: "exwiw_test",
@@ -64,7 +64,7 @@
 ```yaml
       - name: Run exwiw (copy mode)
-        run: scenario/test_with_postgresql_copy.sh
+        run: e2e/test_with_postgresql_copy.sh
 ```
 `postgres:17-alpine` サービスと `postgresql-client-17` インストールは既存ステップで完了済みなので追加不要。
@@ -80,7 +80,7 @@ UPDATE_SNAPSHOTS=1 bundle exec rspec spec/insert_output_snapshot_spec.rb
 ## 検証手順
 1. ローカルで `docker compose up -d postgres` を起動
-2. `bash scenario/test_with_postgresql_copy.sh` を実行 — exit 0 ならば COPY モード SQL は psql 経由で valid。non-zero なら invalid SQL が表面化（その時点で原因を特定して別途修正）
+2. `bash e2e/test_with_postgresql_copy.sh` を実行 — exit 0 ならば COPY モード SQL は psql 経由で valid。non-zero なら invalid SQL が表面化（その時点で原因を特定して別途修正）
 3. `UPDATE_SNAPSHOTS=1 bundle exec rspec spec/insert_output_snapshot_spec.rb` でスナップショットを生成
 4. `bundle exec rspec spec/insert_output_snapshot_spec.rb` を `UPDATE_SNAPSHOTS` 無しで再実行し、全シナリオ（sqlite3 / mysql2 / postgresql / postgresql-copy / mongodb）が通ることを確認
 5. CI 上で `with_postgres` ジョブの新ステップ `Run exwiw (copy mode)` が通る（または invalid SQL を検出する）ことを確認

data/docs/plans/2026-05-31-ids-column-for-sql-adapters.md CHANGED Viewed

@@ -81,7 +81,7 @@ indirect / polymorphic を一律に正しく扱える。
   `ids_field` 指定時に対象テーブルの WHERE が主キーではなく当該カラムになることを確認する
   ケースを追加。
 - `explain` サブコマンド（SQL のみ対応）で end-to-end 確認:
-  既存 scenario（例 `scenario/sqlite3-schema`）に対し
+  既存 scenario（例 `e2e/sqlite3-schema`）に対し
   `--target-table=... --ids=... --ids-column=<col>` を渡し、出力 SQL の WHERE が
   `<table>.<col> IN (...)` になることを目視確認。
 - `bundle exec rspec`（全体）でリグレッションが無いこと。