RubyGems - exwiw - Versions diffs - 0.7.0 → 0.8.1 - Mend

exwiw 0.7.0 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +20 -1
data/README.md +63 -24
data/docs/mongodb-dump-parallelism-2x-notes.md +146 -0
data/docs/optimization-notes.md +12 -0
data/docs/scope-column-redesign.md +107 -0
data/lib/exwiw/adapter/sql_bulk_insert.rb +25 -7
data/lib/exwiw/adapter.rb +33 -4
data/lib/exwiw/cli.rb +25 -43
data/lib/exwiw/query_ast.rb +1 -1
data/lib/exwiw/query_ast_builder.rb +34 -20
data/lib/exwiw/runner.rb +24 -9
data/lib/exwiw/schema_generator.rb +81 -3
data/lib/exwiw/table_config.rb +5 -4
data/lib/exwiw/version.rb +1 -1
data/lib/tasks/exwiw.rake +17 -1
metadata +3 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 12765a8f130dec0055149beb7ba69b37c9b548758cfbd8dd807c1499a9fbf0b5
-  data.tar.gz: 9165cda22d6e4eb2ff97a6d386510a04f37d353e2b31ea4b6bada2fc1ad2adb7
+  metadata.gz: 80610cc2d13a87793171563b2b8e4ff0568135eb987b550dd9f89bffe89b1d67
+  data.tar.gz: 5e9f4976043571647163e9f743c3c8fb709a29944b590c55424dce374cca3269
 SHA512:
-  metadata.gz: eb9ad3e65e4b8756574b5c97fec28be38cc708604cbdaffcb772bd8958b183a067dffafa44078705ed00346be27a9aa81bd68a212b7bcecf635a84d6a7f86adc
-  data.tar.gz: fd47a8459abec5a56ab2c64c7cecc4ffa916608b2cf9278a68a2985defa18812b4f09b70a54b8d6ac498a305022644a0a0d923358a5a06143dedcb550aab7e84
+  metadata.gz: 8508aaa2d9cba3310a9ee4d4b940682b894bbafcab9672e4ccca5fb8f1a4b43e6fe9c36cbfcd0efe8f12c5308fa9be33aeb16d53b6d98ef71692ec2e30076be0
+  data.tar.gz: 8a1a8187e7547f4ce0eeaee458d36b16e56c08c1f2275977127de56a49eb2a3b25e22d8c52d3678fa6ea91e553ebf6c69bf943e5bb108e3984fa8fc34f03a2f0

data/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,25 @@
 ## [Unreleased]
+## [0.8.1] - 2026-06-24
+### Fixed
+- **`schema:generate` skips a `belongs_to` whose target is not an ActiveRecord model instead of crashing.** A `belongs_to` can point at a non-ActiveRecord class — most commonly an ActiveHash/ActiveYaml master (`belongs_to :equipment, class_name: "SomeActiveYamlModel"`). active_hash registers these as ordinary `belongs_to` reflections, but the target class has no database table, so resolving its `table_name` raised and aborted generation. Such a relation is not a DB edge exwiw can join or extract across, so it is now dropped from the generated belongs_tos; the underlying foreign-key column is still emitted as a plain column. A bare `belongs_to` to a plain non-AR class — which makes ActiveRecord raise while resolving the target — is treated the same way. Polymorphic associations are unaffected.
+## [0.8.0] - 2026-06-24
+### Added
+- **Scope-column mode is now declared per table.** A table that should be filtered by a shared scope/tenant column declares `scope_column: <column>` in its schema config. Naming any such table as `--target-table` then runs in scope-column mode: `--ids` are values of that shared column (not primary keys), and every table is filtered by its own `scope_column` — tables that lack one are reached via `belongs_to`, `scope_exempt: true` tables are dumped in full, and a table reachable by none aborts the run (so nothing is silently dumped unscoped). This is the primary way to extract across a foreign key that cannot be joined — most importantly a cross-database `belongs_to` (its join is impossible, but the FK column is still filterable): declare `scope_column: "<that foreign key>"` on the owning table and it is filtered by the FK value directly, with no join. SQL adapters only.
+- **`schema:generate` detects a cross-database `belongs_to` and ignores just the relation — not the table, not the foreign-key column.** A `belongs_to` whose target model lives in a different database (Rails multi-database `connects_to`) cannot be joined — each database is exported separately — so the *belongs_to entry* is emitted with `ignore: true` and `ignore_type: "cross_database"` and a `comment` recording why and pointing at the per-table `scope_column` declaration for cross-boundary extraction. The owning table is still exported normally and its foreign-key column is still exported as a plain column; only the join/dependency edge is dropped at load (otherwise the dangling cross-database target would crash dependency resolution). Polymorphic associations are handled per target. The task prints a summary of every cross-database relation it ignored. (`ignore_type` is now also preserved across regeneration by `TableConfig#merge`.) Single-database applications are unaffected.
+### Changed (breaking)
+- **`--ids-column` is removed.** The SQL-adapter flag that matched `--ids` against a non primary-key column on the target table has no remaining use case; for a scoped table, the per-table `scope_column` is the column `--ids` filter against. (The mongodb-only `--ids-field` is unaffected.)
+- **`--scope-column` is deprecated.** The global flag still selects scope-column mode (SQL-only, mutually exclusive with `--target-table`) but now emits a deprecation warning. Prefer declaring a per-table `scope_column:` in the schema config and running with `--target-table`. A per-table `scope_column` takes precedence over the flag for any table that sets both.
+- **A scoped target's `--ids` now mean scope values, not primary keys.** When `--target-table` names a table that declares a `scope_column`, exwiw runs in scope-column mode and `--ids` are matched against that shared column; the target is scoped like any other table rather than anchored by primary key. A table that declares a `scope_column` can therefore no longer be single-extracted by primary key.
 ## [0.7.0] - 2026-06-23
 ### Changed
@@ -16,7 +35,7 @@
 ### Added
-- Optimize memory usage https://github.com/heyinc/exwiw/pull/118
+- Optimize memory usage https://github.com/heyinc/exwiw/pull/118
 - **MongoDB: optional native (C) encoder for the Extended-JSON dump path** (no flag, byte-identical output, pure-Ruby fallback). Encoding each document to MongoDB Relaxed Extended JSON — previously `JSON.generate(doc.as_extended_json(mode: :relaxed))`, which rebuilds the whole document into an intermediate transformed Hash tree and then walks it again — was the dominant per-document CPU cost (~82% of serialization on embed-heavy data). A new C extension (`ext/exwiw/ext_json/`) emits the JSONL line in a single native tree-walk. It formats the structural bulk plus the leaves that dominate a dumped document — `Hash`, `Array`, `String`, fixnum `Integer`, `true`/`false`/`nil`, `BSON::ObjectId` (`_id`), and in-range `Time` (the Mongoid `created_at`/`updated_at` timestamps) — and delegates everything else (`Float`, out-of-int64 `Integer`, out-of-range `Time`, `Symbol`, `Decimal128`, …) back to the exact pure-Ruby path, so the output is provably byte-for-byte identical. On a 30-embedded-post timestamp-heavy document this serializes ~2.8× faster. With `gem install exwiw` the extension compiles automatically; hosts that cannot compile (JRuby/TruffleRuby, no toolchain) fall back to the pure-Ruby encoder, so exwiw stays installable as a pure-Ruby gem. See [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md).
 ## [0.5.3] - 2026-06-19

data/README.md CHANGED Viewed

@@ -79,7 +79,7 @@ exwiw \
   --log-level=info
 ```
-By default `--ids` are matched against the target table's primary key. `--ids-column=COLUMN` matches them against a different column instead (e.g. `--target-table=users --ids=alice@example.com --ids-column=email`). Related tables are still extracted correctly: their foreign keys are resolved through the target via a subquery (`WHERE fk IN (SELECT pk FROM target WHERE COLUMN IN (...))`), so only the target table's filter column changes. This is the SQL-adapter counterpart of the mongodb `--ids-field`; the two are mutually exclusive and each is rejected by the other adapter family. Note: if `COLUMN` is itself masked, re-running `delete-*` against an already-imported (masked) dump won't match, so prefer a stable natural key.
+By default `--ids` are matched against the target table's primary key. If the target table declares a per-table `scope_column`, exwiw runs in [scope-column mode](#scope-column-mode) instead — `--ids` are then values of that shared column, and the table is scoped like any other rather than anchored by primary key.
 When `--target-table` and `--ids` are omitted, exwiw dumps all tables defined in `--schema-dir`:
@@ -129,18 +129,31 @@ exwiw explain \
 The `--output-dir`, `--output-format`, `--insert-only`, and `--after-insert-hook` options are dump-specific and rejected when used with `explain`.
-### Scope-column mode (`--scope-column`)
+### Scope-column mode
 The default `--target-table` extraction assumes the schema converges on a single
 root: every table is reached by walking `belongs_to` toward that one table. Some
 schemas are not shaped that way — many independent top-level tables each carry the
-*same* scope/tenant column (e.g. `tenant_id`, `account_uuid`) and there is no
-single root. Choosing one of them as `--target-table` would leave the others
-unrelated to it, and an unrelated table is dumped in full — a problem if it holds
-personal data.
+*same* scope/tenant column (e.g. `tenant_id`, `business_entity_id`), and a foreign
+key that **cannot be joined** (most importantly a cross-database `belongs_to`,
+whose join is impossible but whose FK column is still filterable) is not reached at
+all. Choosing one table as `--target-table` would leave the others unrelated to it,
+and an unrelated table is dumped in full — a problem if it holds personal data.
-`--scope-column` handles this shape: instead of one anchor table, **every table is
-filtered by a shared column** whose values are `--ids`.
+Scope-column mode handles this shape: instead of anchoring on one table's primary
+key, **every table is filtered by a shared column** whose values are `--ids`.
+Declare that column per table in the schema config with `scope_column:`:
+```json
+{
+  "name": "shops",
+  "primary_key": "id",
+  "scope_column": "business_entity_id",
+  "columns": [{ "name": "id" }, { "name": "name" }, { "name": "business_entity_id" }]
+}
+```
+Then name any scoped table as `--target-table` and pass the scope values as `--ids`:
 ```bash
 exwiw \
@@ -148,15 +161,21 @@ exwiw \
   --host=localhost --port=5432 --user=reader \
   --database=app_production \
   --schema-dir=exwiw/schema \
-  --scope-column=tenant_id \
-  --ids=42,43 \
+  --target-table=shops --ids=42,43 \
   --output-dir=dump
 ```
+Because `shops` declares a `scope_column`, exwiw switches to scope-column mode: the
+`--ids` (`42,43`) are **`business_entity_id` values, not shop primary keys**, and
+`shops` itself is scoped by `business_entity_id IN (42,43)` like every other scoped
+table — it is *not* used as a primary-key anchor. (A table that declares a
+`scope_column` therefore can no longer be single-extracted by primary key.)
 Each table is resolved as follows:
-- **Carries the scope column** → `WHERE scope_column IN (ids)`.
-- **Lacks it but `belongs_to` reaches a table that has it** → exwiw joins up to the
+- **Declares the scope column** (`scope_column:`, or carries the global column of
+  the deprecated `--scope-column` flag) → `WHERE scope_column IN (ids)`.
+- **Does not, but `belongs_to` reaches a table that does** → exwiw joins up to the
   nearest such table and applies the scope filter there (the same join machinery
   the single-target mode uses).
 - **`belongs_to` a parent that is itself scoped but carries no scope column of its
@@ -168,13 +187,22 @@ Each table is resolved as follows:
   a single forward hop and a single unambiguous scopable parent.
 - **Cannot be scoped at all** (no scope column and no path to one) → exwiw
   **aborts** and lists the offending tables, so an unscoped table is never silently
-  dumped in full. For each, either add a `belongs_to` path, set `ignore: true` to
-  skip it, or mark it `scope_exempt: true` (below) to export it in full.
+  dumped in full. For each, either declare a `scope_column`, add a `belongs_to`
+  path, set `ignore: true` to skip it, or mark it `scope_exempt: true` (below) to
+  export it in full.
+Scope-column mode is SQL-only (mysql / postgresql / sqlite). It works with `exwiw
+explain` too, which is the recommended way to preview the queries before exporting.
+#### Cross-database foreign keys
-`--scope-column` is SQL-only (mysql / postgresql / sqlite) and mutually exclusive
-with `--target-table`, `--target-collection`, `--ids-column`, and `--ids-field`.
-It works with `exwiw explain` too, which is the recommended way to preview the
-queries before exporting.
+The motivating case for declaring a `scope_column` is a foreign key that cannot be
+joined: when a `belongs_to` target lives in a different database (see the
+cross-database `belongs_to` note under the generator), that join is impossible, but
+the foreign-key *column* is still present and can be filtered directly. Declaring
+`scope_column: "<that foreign key>"` on the owning table scopes it by the column
+value, with no join — `schema:generate` points this out in the ignored relation's
+`comment`.
 #### `scope_exempt` (intentional full dump)
@@ -193,11 +221,11 @@ opt out of the strict check and be exported in full:
 Rails-managed tables (`schema_migrations`, `ar_internal_metadata`) are treated as
 exempt automatically.
-#### Per-table `scope_column` override
+#### Per-table `scope_column` and the value space
-scope-column mode assumes a single shared **value** space — the same `--ids` apply
-to every scoped table. If a table stores that same value under a differently named
-column, override the column name for that table:
+Scope-column mode assumes a single shared **value** space — the same `--ids` apply
+to every scoped table. Each table names its own column, so a table that stores that
+same value under a differently named column simply declares that name:
 ```json
 {
@@ -211,6 +239,15 @@ column, override the column name for that table:
 Both `scope_exempt` and `scope_column` are user-maintained and preserved across
 `schema:generate` regeneration (the generators never emit them).
+#### Deprecated: the `--scope-column` flag
+Before per-table declarations, scope-column mode was selected with a global
+`--scope-column=COLUMN` flag (every table filtered by that one column, `--ids` its
+values, no `--target-table`). The flag still works — SQL-only and mutually
+exclusive with `--target-table` — but is **deprecated** and emits a warning; prefer
+declaring a per-table `scope_column` and running with `--target-table`. A per-table
+`scope_column` takes precedence over the flag for any table that sets both.
 ### Config file (`exwiw.yml`)
 Options you would otherwise repeat on every run can be kept in a YAML config file. Pass it with `--config=PATH`; when `--config` is omitted, exwiw automatically loads `exwiw.yml` (or `exwiw.yaml`) from the current directory if present.
@@ -226,7 +263,7 @@ output_format: insert        # insert | copy
 insert_only: false
 after_insert_hook: hooks/seed.rb
 log_level: info              # debug | info
-# target_table / ids / ids_field / ids_column / scope_column may also be set here
+# target_table / ids / ids_field / scope_column may also be set here
 ```
 With the file above, only the connection details need to be supplied on the CLI:
@@ -298,6 +335,8 @@ exwiw/schema/
 Each database keeps its own Rails migration history, so a `schema_migrations` (and `ar_internal_metadata`) entry is emitted under every database that contains one — the example above shows `primary/schema_migrations.json` and would also produce `analytics/schema_migrations.json` when the analytics database has its own migration table. Single-database applications are unaffected and continue to write files flat into the output directory.
+A `belongs_to` whose target model lives in a *different* database (e.g. a `primary` model referencing an `analytics` one) cannot be joined: each database is exported on its own connection and into its own subdirectory, so the target table is absent from the directory this config is loaded with. `schema:generate` detects such a relation (by comparing the owning and target models' database config names) and emits it with `ignore: true` and `ignore_type: "cross_database"`, recording why in the `comment`; the relation is then dropped from extraction at load time, while the foreign-key column itself is still exported as a plain column. Polymorphic associations are handled per target, so only the targets that cross a database boundary are ignored. The task also prints a summary of every cross-database `belongs_to` it ignored. **To extract across such a boundary, declare `scope_column: "<foreign_key>"` on the owning table (see [scope-column mode](#scope-column-mode)) so its rows are filtered by the foreign-key value directly** — there is no join, so the cross-database boundary is not a problem there.
 **Limitations**
 - The rails-managed table *names* are resolved from the global `ActiveRecord::Base.schema_migrations_table_name` / `internal_metadata_table_name` accessors, which are shared across all connections. A per-database override of these names is not detected, so such a table will be missing from that database's generated configs.
@@ -658,7 +697,7 @@ The MongoDB adapter is experimental. To use it:
 - The MongoDB adapter consumes a separate config type, `MongodbCollectionConfig`, with MongoDB-native naming. Use `fields` (instead of the SQL adapters' `columns`), and set `"primary_key": "_id"`. Foreign keys (`shop_id`, `user_id`, ...) stay as ordinary fields.
 - `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
 - `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
-- `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only**; the SQL adapters use `--ids-column` instead (see below).
+- `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only** (the SQL adapters have no equivalent).
 - Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
 - Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
   ```bash

data/docs/mongodb-dump-parallelism-2x-notes.md ADDED Viewed

@@ -0,0 +1,146 @@
+# MongoDB dump: reaching the 2× target with inter-collection fork parallelism
+This continues [`optimization-notes.md`](./optimization-notes.md). That document
+ends at "the remaining cost is the serial BSON→Ruby decode, and the lever left is
+a native encoder bounded by the same serial-decode ceiling." This note records a
+**different** lever that was measured to clear a **2× end-to-end** target on a
+large real extraction, **byte-identically**, and explains why it succeeds where
+the previously-removed `--parallel-workers` did not.
+The measurements below are from a large staging extraction (~300 non-empty
+collections, target → 1 store → ~hundreds of scoped rows, plus reference/master
+data dumped in full). Reproduce with the bench harness against a restored backup
+(see [`mongodb-scoping-fullscan-notes.md`](./mongodb-scoping-fullscan-notes.md)
+for serving a WiredTiger backup as a standalone `mongod`).
+## Where the time goes (warm cache, native ext compiled)
+Serial baseline ≈ **6.8–7.0 s**. Per-collection instrumentation (wrapping
+`build_query` / `execute` / `write_inserts`) attributes ~6.0 s to the write pass.
+Splitting one collection's write pass into "drain the cursor only" vs
+"drain + native-encode" shows **decode is 64–92 %** of per-collection cost — the
+Mongo driver's BSON→Ruby-Hash decode dominates; the native Extended-JSON encode is
+already cheap. The single heaviest collection (a 237k-doc reference table) is
+**1.31 s by itself** and cannot be split without intra-collection partitioning.
+Classifying every processed collection by how it is scoped (using the
+genuine-anchor model from `mongodb-scoping-fullscan-notes.md`) gives three groups
+with very different dependency shapes:
+| group | meaning | count | time |
+|---|---|---|---|
+| **leaf** | no `belongs_to` → dumped in full, no input deps | 54 | ~3.5 s |
+| **genuine** | reachable to the dump target via `belongs_to` (the scoped DAG) | 245 | ~0.8 s |
+| **ref_bt** | has `belongs_to` but NOT reachable to target (reference data scoped by strict-AND fallback) | 23 | ~2.1 s |
+The crucial structural facts, all derivable from the configs:
+- **genuine never consumes leaf/ref_bt `@state`.** A genuine collection drops its
+  reference (leaf) parents when it has a genuine anchor, and a ref_bt parent is by
+  definition not genuine. So the whole scoped DAG is independent of the heavy
+  reference dumps — it can run **concurrently** with them.
+- **leaf has no input deps at all** (no `belongs_to`) → embarrassingly parallel.
+- **ref_bt consumes only leaf `@state` and other ref_bt `@state`** (never genuine).
+  Its internal edges form shallow weakly-connected components.
+## Why process-level (not thread/serialization) parallelism is the right lever
+`optimization-notes.md` removed two parallelism attempts. The distinction that
+matters:
+- `--parallel-workers` parallelized **serialization only** (the parent still did
+  the single serial cursor decode), so it was capped at ~1.1–1.4× — decode is the
+  bottleneck and it was untouched.
+- `--cursor-parallel` parallelized **decode within a collection** (≈2.5–5.5×) but
+  forced `sort(_id)`, changing row order → not byte-identical, could not be the
+  default.
+Forking **whole collections** across processes parallelizes the decode (each
+worker decodes its own collections) **while preserving each collection's natural
+order** — so it is both decode-parallel and byte-identical. That is the property
+neither removed approach had.
+## The schedule that was measured at ≥2×
+One parent process plus a fork pool of `N` workers:
+1. **Phase 1 (concurrent).** Fork the leaf pool (`N` workers, leaves assigned by
+   LPT bin-packing on output-size weight so the 1.31 s collection sits alone).
+   Concurrently the parent (a) dumps the schema (parent-only, needs no `@state`)
+   and (b) processes the **whole genuine DAG optimistically** — without leaf
+   `@state` yet — recording each collection's row count.
+2. **Barrier + leaf `@state`.** Wait for the leaf pool; load the small Marshal
+   sidecars each consumed leaf wrote (only the ~11 leaves a downstream collection
+   references need one; the heavy 1.31 s table is not referenced, so no sidecar).
+3. **Phase 2 (cascade reprocess).** Only **8** genuine collections reference a leaf
+   at all, and **0** are statically forced to use it — a genuine collection only
+   falls back to a leaf clause at runtime when its genuine anchor turns out empty
+   (e.g. an empty intermediate parent). Reprocess those 8 with leaf `@state` now
+   present; if a reprocessed collection's row count changes, enqueue its genuine
+   children and repeat. Non-fallback collections (e.g. the DAG root one level below
+   the dump target, which has a genuine anchor and drops its leaf parents)
+   reprocess to an identical result and stop the cascade,
+   so this terminates after re-touching only the genuinely affected handful.
+4. **Phase 3 (ref_bt components).** Fork the 23 ref_bt collections as
+   **dependency-closed weakly-connected components** (over intra-ref_bt edges) in a
+   single pool — no level barriers, no cross-worker IPC. Each worker owns whole
+   components and processes their members in topological order, seeded with the
+   (sliced) leaf `@state` its members reference. Components are assigned by LPT.
+`@state` IPC is just Marshal sidecars for the ~11 referenced leaves; everything
+else is COW-inherited at fork or kept inside a component.
+## Measured result (warm cache, same backup)
+Internal compute timer (excludes the fixed ~0.5 s Ruby/bundler startup both paths
+pay):
+| workers | compute | speedup |
+|---|---|---|
+| serial | 7.01 s | 1.00× |
+| N=2 | 3.87 s | 1.81× |
+| N=4 | 2.80 s | **2.50×** |
+| N=6 | 2.79 s | **2.51×** |
+Full wall-clock (includes Ruby startup — the honest end-to-end number):
+| workers | wall | speedup |
+|---|---|---|
+| serial | 6.79 s | 1.00× |
+| N=2 | 4.06 s | 1.67× |
+| N=4 | 3.24 s | **2.10×** |
+| N=6 | 3.21 s | **2.12×** |
+**Byte-identical:** all 189 output files (schema + inserts) match the serial run
+exactly — same filenames (the insert index is taken over the full ordering,
+including `ignore:true` collections, exactly as the Runner numbers them) and same
+content (0/189 cmp mismatches).
+The curve **saturates at N≈4**: past that the wall time is bounded by the single
+1.31 s leaf decode plus the longest ref_bt chain (~0.6 s) plus startup. Going
+beyond ~2.5× needs **intra-collection** decode parallelism (the `_id`-range cursor
+split that was removed for changing row order) or a **native BSON→Extended-JSON
+transcoder** that skips the Ruby-Hash decode for unmasked reference collections
+(decode being 64–92 % of the cost, this is the largest single-process lever left
+and composes with the fork approach).
+## Operational note: this needs vCPUs to spend
+The win is real cores doing real decode in parallel. On the current ECS task
+(`cpu: 2048` = **2 vCPU**) the schedule reaches only ~1.67× (N=2). Clearing 2×
+requires scaling to **≥4 vCPU** (`cpu: 4096`); 8 vCPU does not help further given
+the N≈4 saturation. Memory is unchanged — the existing streaming keeps each worker
+bounded by chunk size, and `N×` that stays well under the 7 GiB container limit.
+## Status
+This is a **measured, byte-identical proof** (bench prototype, not yet integrated
+into the Runner/CLI). Integrating it means re-introducing process orchestration
+and `@state` sidecar IPC — the machinery `optimization-notes.md` deliberately
+removed as over-engineered for a flag. It is recorded here because, unlike that
+removed work, this schedule is (a) byte-identical by construction, (b) measured
+past the 2× target on a real extraction, and (c) the lever the task explicitly
+invited (scale the task to go faster). A production version would gate it behind a
+worker-count option, fall back to serial where `fork` is unavailable
+(Windows/JRuby), and reuse the existing genuine-anchor classification to derive
+the three groups.

data/docs/optimization-notes.md CHANGED Viewed

@@ -109,6 +109,18 @@ The full design (byte-identity strategy, fast-path vs Ruby-delegate types,
 optional-load + pure-Ruby fallback, packaging) is in
 [`optimize-mongodb-export-with-native-ext.md`](./optimize-mongodb-export-with-native-ext.md).
+The native ext shipped (0.6.1+) and the dump is now **decode-bound** (the driver's
+BSON→Ruby decode is 64–92 % of per-collection cost). The single-process levers are
+exhausted short of 2×. The lever that *was* measured past 2× — **byte-identical
+inter-collection fork parallelism** (parallelize whole collections across
+processes, so the decode itself runs in parallel while each collection keeps its
+natural order) — is written up in
+[`mongodb-dump-parallelism-2x-notes.md`](./mongodb-dump-parallelism-2x-notes.md).
+Unlike the `--parallel-workers` attempt removed above (which parallelized
+serialization only and so was capped at ~1.4×), this parallelizes the actual
+bottleneck and reaches ~2.1× wall / ~2.5× compute on a real extraction, given
+≥4 vCPU to spend.
 ## Methodology notes (for re-running)
 - The CPU hotspot reproduces **with no database**: the Mongo driver hands back

data/docs/scope-column-redesign.md ADDED Viewed

@@ -0,0 +1,107 @@
+# scope-column まわりの再設計メモ（2026-06-23）
+> ステータス: **(a) で決定**。hybrid は畳む（PR #128 の hybrid 中核を revert し、per-table scope-column モードに作り直す）。確定仕様は末尾「決定: (a) の確定仕様」を参照。
+## 背景 / 解きたい問題
+Rails の multi-database（`connects_to`）で、別 DB のモデルを指す `belongs_to`（cross-database）は **join できない**。
+```ruby
+# main DB
+class Customer < ApplicationRecord
+  belongs_to :tenant, class_name: "Org::Tenant"   # tenants は別 DB
+end
+# org DB
+class Org::Tenant < OrgApplicationRecord
+end
+```
+`customers.tenant_id` は join では絞れないが、**列の値（tenant_id）で直接フィルタ**はできる。これを使ってテナント単位 / 対象単位でデータ抽出したい。
+## いまある実装（PR #128, draft / branch `feat/hybrid-target-scope-column`）
+- **hybrid モード**: `--target-table` と `--scope-column` を併用。各テーブルを「target 到達」OR「scope 列到達」の**和集合（OR）**で解決。scope 値は target から**導出**する
+  （`scope_column IN (SELECT target.scope_column FROM target WHERE <target ids>)`）。SQL only / insert-only / 単一アンカーは従来と byte 同一。
+- **generator**: cross-database `belongs_to` を検出し、**relation だけ** `ignore: true` + `ignore_type: "cross_database"`（テーブル本体・FK 列は維持）。生成時にサマリ出力。
+- **R2**: per-table `scope_column` があるのに `--scope-column` 未指定ならエラー。
+## 再設計の方針（この議論での決定）
+1. **per-table `scope_column` を唯一の宣言にする。** scoped なテーブルは config に `scope_column: <列>` を明示。
+   グローバル `--scope-column`（正準名）＋ per-table オーバーライド、という二層と「オーバーライド」概念は**廃止**。
+   各テーブルが自分の列を宣言する（差異名も自然に表現できる。全 scope 列は同じ値空間を持つ前提）。
+2. **`--scope-column` フラグは deprecated**（警告を出して per-table 宣言へ誘導）。
+3. **`--ids-column` は削除**。「自然キー（email 等）で target を指定する」ユースケースが無いため。
+   scoped な target では `scope_column` が「`--ids` を当てる列」を兼ねる。
+4. **generator の誘導文を更新**: 「越境は `--scope-column` を使え」→「**このテーブルに `scope_column:` を宣言しろ**」。
+   検出・relation ignore・FK 列維持・サマリ自体は据え置き。
+5. **R2 は撤回**（グローバルフラグ概念が無くなるため、per-table 宣言が正道になる）。
+ここまでは hybrid の有無に関わらず確定。
+## 未解決: hybrid は要るのか？
+抽出ニーズが2種類あり、片方は pure scope-column モードでは実現できない。
+### (a) テナント丸ごと抽出 — pure scope で足りる
+「tenant 42 のデータを全部」。各 scoped テーブルを自分の `scope_column IN (42)` で絞るだけ。未 scoped テーブルは belongs_to 経由（via_path / referenced_by）。→ **hybrid 不要**。
+### (b) 対象を絞った最小抽出 + cross-DB 行 — hybrid が必要
+「shop 1 のデータだけ最小で欲しい。ただし join できない cross-DB の `customers`（shop 1 のテナントに属する）も含めたい」。
+- pure scope（テナント単位）だと **tenant 42 全体**を引いてしまい多すぎる（shop 1 以外の shop も customers も入る）。
+- 素の `--target-table=shops`（join 抽出）だと、cross-DB の `customers` には**到達できない**（join 不可）。
+- → **shop 1 を主キーで anchor しつつ、その tenant_id を導出して cross-DB テーブルを scope する hybrid でしか実現できない**。
+テスト用データ抽出は (b)（最小・対象限定）になりがちなので、hybrid は捨てがたい。
+### CLI 上の衝突（ここが論点）
+`--ids-column` を消したので、**scoped な target の `--ids` を 2 通りに解釈できない**:
+- 解釈A: `--target-table=shops --ids=42` の 42 = `scope_column` の値（テナント抽出）
+- 解釈B: `--target-table=shops --ids=1` の 1 = 主キー（hybrid: shop 1 を anchor して導出）
+同じ CLI 形では両立しない。→ どちらかを選ぶ:
+- **解釈A を採る**なら hybrid は畳む（pure scope-column モードに作り直す）。(b) は失われる。
+- **hybrid を残す**なら scoped target の `--ids` は**主キー**（解釈B）に戻し、テナント抽出（a）は **target 無しの pure scope**（または deprecated な `--scope-column`）で行う。
+### 推奨（たたき台）
+**hybrid を残す**案。理由: 元々の動機（cross-DB FK を含めて1回で対象を抽出）と (b) のニーズが一致し、すでに #128 で動いている。
+per-table `scope_column` 化とも両立できる ―― hybrid の導出元 scope 列を「`--scope-column` フラグ」ではなく **target の config の `scope_column`** から取れば、フラグ無し・`--ids-column` 無しで成立する:
+| やりたいこと | コマンド | ids の意味 | 挙動 |
+|---|---|---|---|
+| shop 1 を最小抽出（cross-DB 含む） | `--target-table=shops --ids=1` | 主キー | shop 1 + 到達分。cross-DB/scoped テーブルは shops.`scope_column` から導出した tenant に scope（hybrid） |
+| tenant 42 を丸ごと | `--ids=42`（target 無し） | scope 値 | 各 scoped テーブルを自分の `scope_column IN (42)`（pure scope） |
+| 通常の単体抽出（scope 無し） | `--target-table=orders --ids=1`（orders に scope_column 無し） | 主キー | 従来どおり |
+この案だと「解釈A（scoped target で ids=scope 値）」は捨て、テナント抽出は no-target の pure scope に寄せることになる。
+## 決定: (a) の確定仕様
+(a) のみ採用。hybrid（(b)）は今回入れない。
+### モデル
+- scoped なテーブルは config に `scope_column: <列>` を宣言（per-table のみ。グローバル正準名もオーバーライドも無し。全 scope 列は同じ値空間）。
+### 起動と挙動
+| コマンド | 条件 | ids の意味 | 挙動 |
+|---|---|---|---|
+| `--target-table=shops --ids=42` | shops が `scope_column` を宣言 | **scope 値** | scope-column モード。`scope_column` を宣言する全テーブルが自分の `scope_column IN (42)`。未宣言テーブルは belongs_to 経由（via_path / referenced_by）。`scope_exempt` は full dump。解決不能は abort（validate_scope!）。target 名は PK フィルタには使わない（scoped テーブルの1つとして scope される） |
+| `--target-table=orders --ids=1` | orders は `scope_column` 無し | 主キー | 従来の単体 target 抽出（変更なし） |
+| `--scope-column=C --ids=42` | deprecated | scope 値 | 旧 pure scope（no target、global 列 C）。**警告**を出す。`--target-table` とは排他（pre-#128 に戻す） |
+| `--ids=42` のみ | target も flag も無し | — | エラー（従来どおり） |
+- `--ids-column` は**削除**。scoped target では `scope_column` が「`--ids` を当てる列」を兼ねる。
+- scoped なテーブルを主キーで単体抽出することはできなくなる（解釈A の代償。受容済み）。
+### 実装スコープ（PR #128 を作り直す）
+1. builder: #128 の hybrid 一式（build_hybrid / compose_or / pk_in_subquery / mode 引数 / 導出版 scope_where_clause / validate_hybrid! / R2=validate_scope_column_usage!）と `:or`（query_ast + 3 adapter）を **revert**。`scope_where_clause` はリテラル `scope_column IN (ids)` に戻す。
+2. builder: `scope_mode?` を「`--scope-column` フラグ or **target が `scope_column` を宣言**」で判定するよう拡張。`resolved_scope_column = table.scope_column || （deprecated flag）`。
+3. CLI: `--ids-column` 削除、`--scope-column` を deprecated 警告化 + `--target-table` と排他（pre-#128 に戻す）、hybrid 用の `--insert-only` 必須を撤去。
+4. runner/explain: `validate_hybrid!` / `validate_scope_column_usage!` 呼び出しを撤去。`validate_scope!` の起動条件を新 `scope_mode?` に合わせる。
+5. generator: cross-DB 検出は維持。comment / サマリの誘導を「`--scope-column` を使え」→「このテーブルに `scope_column:` を宣言しろ」に更新。
+6. docs / CHANGELOG: hybrid 記述を撤去し、本モデルで書き直す。breaking（`--ids-column` 削除・`--scope-column` deprecated・scoped target の `--ids` 意味変更）を明記。tests も差し替え。

data/lib/exwiw/adapter/sql_bulk_insert.rb CHANGED Viewed

@@ -35,32 +35,50 @@ module Exwiw
       # #to_bulk_insert per chunk joined by "\n" (verified by
       # insert_output_snapshot_spec), but only one ~STREAM_FLUSH_BYTES buffer is
       # resident at a time rather than the entire table's INSERT string. Returns
-      # the number of statements written.
+      # [statement_count, record_count]; record_count is tallied during the single
+      # streaming drain so the Runner needs no separate SELECT COUNT(*) pass.
       def write_inserts(io, results, table, chunk_size)
         chunks = chunk_size ? results.each_slice(chunk_size) : [results]
         statement_count = 0
+        record_count = 0
         chunks.each do |chunk_rows|
           io.print("\n") if statement_count.positive?
-          stream_single_insert(io, chunk_rows, table)
+          record_count += stream_single_insert(io, chunk_rows, table)
           statement_count += 1
         end
-        statement_count
+        [statement_count, record_count]
       end
       # Emit one `INSERT INTO ... VALUES <tuples>;` statement to `io`, building
       # and flushing the value tuples STREAM_FLUSH_ROWS at a time so the full
-      # statement text is never held in memory at once. Each slice is one fast
-      # map+join; the ",\n" between slices reproduces the same separator
+      # statement text is never held in memory at once. Each flush is one fast
+      # map+join; the ",\n" between flushes reproduces the same separator
       # #to_bulk_insert puts between every tuple, so the bytes are identical.
+      # Returns the number of rows written, tallied as the stream is drained.
+      #
+      # Rows are buffered off `#each` rather than `rows.each_slice(...)`:
+      # each_slice queries the receiver's `#size`, which on a streaming result
+      # issues a redundant `SELECT COUNT(*)` (a second full pass over the same
+      # filter). Manual buffering walks the cursor exactly once.
       private def stream_single_insert(io, rows, table)
         io.print(insert_header(table))
         first = true
-        rows.each_slice(STREAM_FLUSH_ROWS) do |slice|
+        record_count = 0
+        buffer = []
+        flush = lambda do
           io.print(",\n") unless first
           first = false
-          io.print(slice.map { |row| insert_tuple(row) }.join(",\n"))
+          io.print(buffer.map { |row| insert_tuple(row) }.join(",\n"))
+          record_count += buffer.size
+          buffer.clear
         end
+        rows.each do |row|
+          buffer << row
+          flush.call if buffer.size >= STREAM_FLUSH_ROWS
+        end
+        flush.call unless buffer.empty?
         io.print(";")
+        record_count
       end
       private def insert_tuple(row)

data/lib/exwiw/adapter.rb CHANGED Viewed

@@ -140,15 +140,44 @@ module Exwiw
       # @param results [Enumerable] rows/documents from #execute
       # @param table the table/collection config
       # @param chunk_size [Integer, nil] rows per statement (nil => one statement)
+      # @return [Array(Integer, Integer)] [statement_count, record_count]
+      #
+      # record_count is tallied from the rows actually streamed here so the
+      # Runner no longer needs a separate upfront count query (MongoDB's
+      # count_documents / the SQL adapters' SELECT COUNT(*)) just to log the row
+      # count and decide whether an empty table can be skipped. That count was a
+      # second full pass over the same filter — a wasted COLLSCAN when the scope
+      # is unindexed; counting during the single streaming pass removes it.
+      #
+      # Batches rows by accumulating into a buffer and flushing every chunk_size
+      # rows, rather than `results.each_slice(chunk_size)`. This is deliberate:
+      # `Enumerable#each_slice` calls `#size` on the receiver as an allocation
+      # hint, which for a streaming result (MongoDB's StreamingResult) issues a
+      # `count_documents` — the very redundant count this single-pass design
+      # removes. Driving the buffer off `#each` keeps the result's `#size`
+      # untouched, so the cursor is walked exactly once. The chunk boundaries and
+      # "\n" separators reproduce the each_slice output byte-for-byte.
+      #
+      # chunk_size is always positive for callers of this default (MongoDB); the
+      # SQL adapters pass nil and override #write_inserts, so the unbounded
+      # nil-branch buffer is never reached here in practice.
       def write_inserts(io, results, table, chunk_size)
-        chunks = chunk_size ? results.each_slice(chunk_size) : [results]
         statement_count = 0
-        chunks.each do |chunk_rows|
+        record_count = 0
+        buffer = []
+        flush = lambda do
           io.print("\n") if statement_count.positive?
-          io.print(to_bulk_insert(chunk_rows, table))
+          io.print(to_bulk_insert(buffer, table))
           statement_count += 1
+          record_count += buffer.size
+          buffer.clear
+        end
+        results.each do |row|
+          buffer << row
+          flush.call if chunk_size && buffer.size >= chunk_size
         end
-        statement_count
+        flush.call unless buffer.empty?
+        [statement_count, record_count]
       end
       # Run the database-specific EXPLAIN for the given query and return the

data/lib/exwiw/cli.rb CHANGED Viewed

@@ -34,7 +34,6 @@ module Exwiw
       target_collection
       ids
       ids_field
-      ids_column
       scope_column
     ].freeze
@@ -77,7 +76,6 @@ module Exwiw
       @target_collection_name = nil
       @ids = []
       @ids_field = nil
-      @ids_column = nil
       @scope_column = nil
       @output_format = nil
       @insert_only = nil
@@ -165,7 +163,7 @@ module Exwiw
       resolve_target_collection_alias!
       resolve_scope_column!
-      resolve_ids_column_alias!
+      resolve_ids_field!
       resolve_uri_option!
       if @subcommand == "explain"
@@ -317,7 +315,6 @@ module Exwiw
         @ids = (raw.is_a?(String) ? raw.split(",") : Array(raw)).map(&:to_s)
       end
       @ids_field ||= config["ids_field"]
-      @ids_column ||= config["ids_column"]
       @scope_column ||= config["scope_column"]
     end
@@ -349,49 +346,33 @@ module Exwiw
       @target_table_name = @target_collection_name
     end
-    # `--ids-column` is the sql-adapter spelling of `--ids-field` (the mongodb
-    # spelling). Both override which column/field `--ids` is matched against on
-    # the target table; internally they fold into the single @ids_field carried
-    # by DumpTarget. Mirror the --target-table/--target-collection split: each
-    # flag is restricted to its adapter family and the two are mutually
-    # exclusive. Runs after resolve_target_collection_alias! so
+    # `--ids-field` overrides which field `--ids` is matched against on the target
+    # collection (defaulting to its primary key). It is mongodb-only and
+    # meaningless without a target collection to constrain. (The SQL adapters have
+    # no equivalent: a scoped target's `scope_column` is the column `--ids` filter
+    # against there.) Runs after resolve_target_collection_alias! so
     # @target_table_name already reflects the collection alias.
-    private def resolve_ids_column_alias!
-      if @ids_field && @ids_column
-        $stderr.puts "Specify only one of --ids-field and --ids-column"
-        exit 1
-      end
+    private def resolve_ids_field!
+      return if @ids_field.nil?
-      if @ids_field && @database_adapter != "mongodb"
-        $stderr.puts "--ids-field is only supported by the mongodb adapter (use --ids-column)"
+      if @database_adapter != "mongodb"
+        $stderr.puts "--ids-field is only supported by the mongodb adapter"
         exit 1
       end
-      if @ids_column
-        sql_adapters = ["mysql", "postgresql", "sqlite"]
-        unless sql_adapters.include?(@database_adapter)
-          $stderr.puts "--ids-column is only supported by the sql adapters (use --ids-field)"
-          exit 1
-        end
-        @ids_field = @ids_column
-      end
-      # --ids-field/--ids-column override the column --ids filters against on
-      # the target table; meaningless without a target table to constrain.
-      if @ids_field && !@target_table_name
-        flag = @ids_column ? "--ids-column" : "--ids-field"
-        $stderr.puts "--target-table is required when #{flag} is specified"
+      unless @target_table_name
+        $stderr.puts "--target-table is required when --ids-field is specified"
         exit 1
       end
     end
-    # `--scope-column` switches to scope-column mode: every table is filtered by a
-    # shared column (`--ids` are its values) instead of anchoring on one
-    # `--target-table`. It is SQL-only and mutually exclusive with the single-target
-    # flags. Runs after resolve_target_collection_alias! (so --target-collection is
-    # already folded into @target_table_name) and before resolve_ids_column_alias!
-    # so the clearer "cannot combine" message wins over the generic ids-column one.
+    # `--scope-column` is **deprecated**: it selected scope-column mode with a
+    # single global column for every table. The preferred way is to declare a
+    # per-table `scope_column:` in the schema config and pass `--target-table`
+    # (the target is then scoped like any other table). The flag still works as
+    # before — SQL-only and mutually exclusive with `--target-table` — but emits a
+    # deprecation warning. Runs after resolve_target_collection_alias! so
+    # --target-collection is already folded into @target_table_name.
     private def resolve_scope_column!
       return if @scope_column.nil?
@@ -406,11 +387,13 @@ module Exwiw
         exit 1
       end
-      if @ids_field || @ids_column
-        flag = @ids_column ? "--ids-column" : "--ids-field"
-        $stderr.puts "--scope-column cannot be combined with #{flag}"
+      if @ids_field
+        $stderr.puts "--scope-column cannot be combined with --ids-field"
         exit 1
       end
+      $stderr.puts "warning: --scope-column is deprecated; declare a per-table `scope_column:` " \
+                   "in the schema config and run with --target-table instead."
     end
     # `--uri` supplies a full connection string (e.g. `mongodb+srv://...`) and is
@@ -537,8 +520,7 @@ module Exwiw
         opts.on("--target-collection=[COLLECTION]", "Alias of --target-table for the mongodb adapter.") { |v| @target_collection_name = v }
         opts.on("--ids=[IDS]", "Comma-separated list of identifiers. Required when --target-table is given.") { |v| @ids = v.split(',') }
         opts.on("--ids-field=[FIELD]", "Field on the target collection that --ids is matched against. Defaults to the primary key. (mongodb adapter only)") { |v| @ids_field = v }
-        opts.on("--ids-column=[COLUMN]", "Column on the target table that --ids is matched against. Defaults to the primary key. (sql adapters only)") { |v| @ids_column = v }
-        opts.on("--scope-column=[COLUMN]", "Filter every table by this shared column (--ids are its values) instead of a single --target-table. Tables lacking it are reached via belongs_to. SQL adapters only; mutually exclusive with --target-table.") { |v| @scope_column = v }
+        opts.on("--scope-column=[COLUMN]", "DEPRECATED. Filter every table by this shared global column (--ids are its values) instead of a single --target-table. SQL adapters only; mutually exclusive with --target-table. Prefer declaring a per-table `scope_column:` in the schema config and running with --target-table.") { |v| @scope_column = v }
         opts.on("--output-format=[FORMAT]", "Output format: insert (default) or copy (PostgreSQL only, export subcommand only)") { |v| @output_format = v }
         opts.on("--insert-only", "Do not generate DELETE SQL files (export subcommand only)") { @insert_only = true }
         opts.on("--after-insert-hook=PATH", "Path to a .rb or .sh post-processing hook executed after all insert/delete files are written (export subcommand only)") do |v|

data/lib/exwiw/query_ast.rb CHANGED Viewed

@@ -48,7 +48,7 @@ module Exwiw
     # Resolves a set of values on `where_column` to the rows' `select_column`
     # via a nested SELECT. Used as the `value` of a WhereClause whose operator
-    # is `:in_subquery`, so `--ids-column`/`--ids-field` can filter related
+    # is `:in_subquery`, so a non primary-key `ids_field` can filter related
     # tables through the target table's primary key:
     #
     #   <table>.<fk> IN (SELECT <table_name>.<select_column>

data/lib/exwiw/query_ast_builder.rb CHANGED Viewed

@@ -12,12 +12,26 @@ module Exwiw
       new(table_name, table_by_name, dump_target, logger).scope_category
     end
+    # Scope-column mode is active when EITHER the named `--target-table` declares a
+    # per-table `scope_column` (the preferred trigger: the target is then scoped
+    # like any other table — its `--ids` are scope-column values, not primary
+    # keys), OR the deprecated `--scope-column` flag is set (a global column with no
+    # target). In both cases every table is filtered by a shared column instead of
+    # being anchored on one named target's primary key.
+    def self.scope_mode?(table_by_name, dump_target)
+      return true unless dump_target.scope_column.nil?
+      return false if dump_target.table_name.nil?
+      target = table_by_name[dump_target.table_name]
+      !!(target && target.respond_to?(:scope_column) && target.scope_column)
+    end
     # Strict pre-flight for scope-column mode: abort if any extractable table
     # cannot be scoped, so an unscoped (potentially sensitive) table is never
     # silently dumped in full. No-op outside scope mode. `tables` is the set of
     # dumpable configs (ignore:true tables are skipped — they are not extracted).
     def self.validate_scope!(tables, table_by_name, dump_target, logger)
-      return if dump_target.scope_column.nil?
+      return unless scope_mode?(table_by_name, dump_target)
       unscopable =
         tables.reject(&:ignore).select do |table|
@@ -27,11 +41,10 @@ module Exwiw
       names = unscopable.map(&:name).sort.join(", ")
       raise ArgumentError,
-            "scope-column mode: #{unscopable.size} table(s) cannot be scoped by " \
-            "'#{dump_target.scope_column}': #{names}. For each, add `scope_exempt: true` " \
-            "to export it in full, set `ignore: true` to skip it, or add a belongs_to path " \
-            "to a table that carries the scope column (use a per-table `scope_column` if the " \
-            "column name differs on that table)."
+            "scope-column mode: #{unscopable.size} table(s) cannot be scoped: #{names}. " \
+            "For each, declare `scope_column: <column>` on the table to filter it directly, " \
+            "add a belongs_to path to a table that carries the scope column, mark it " \
+            "`scope_exempt: true` to export it in full, or set `ignore: true` to skip it."
     end
     attr_reader :table_name, :table_by_name, :dump_target
@@ -268,10 +281,10 @@ module Exwiw
       clauses = []
       if table.name == dump_target.table_name
-        # `--ids-column` (folded into dump_target.ids_field by the CLI) lets
-        # `--ids` match a non primary-key column on the target table; otherwise
-        # fall back to the primary key. Only the target table's filter changes —
-        # downstream foreign-key propagation still keys off the primary key.
+        # When `dump_target.ids_field` is set, `--ids` match a non primary-key
+        # column on the target table; otherwise fall back to the primary key.
+        # Only the target table's filter changes — downstream foreign-key
+        # propagation still keys off the primary key.
         clauses.push Exwiw::QueryAst::WhereClause.new(
           column_name: dump_target.ids_field || table.primary_key,
           operator: :eq,
@@ -306,9 +319,9 @@ module Exwiw
     # Builds the WHERE clause that constrains a `foreign_key` pointing at the
     # dump target. Normally `--ids` are the target's primary keys, so a plain
-    # `foreign_key IN (ids)` suffices. When `--ids-column`/`--ids-field` is set
-    # (dump_target.ids_field), `--ids` match a non primary-key column instead,
-    # so the foreign key must be resolved through the target table:
+    # `foreign_key IN (ids)` suffices. When `dump_target.ids_field` is set, `--ids`
+    # match a non primary-key column instead, so the foreign key must be resolved
+    # through the target table:
     # `foreign_key IN (SELECT pk FROM target WHERE ids_field IN (ids))`.
     # This keeps related-table extraction correct regardless of whether the
     # relation is direct, indirect, or polymorphic.
@@ -370,7 +383,7 @@ module Exwiw
     # ------------------------------------------------------------------
     private def scope_mode?
-      !dump_target.scope_column.nil?
+      self.class.scope_mode?(table_by_name, dump_target)
     end
     # Classifier used by validate_scope! and mirrored by build_scoped below.
@@ -449,8 +462,8 @@ module Exwiw
       ast
     end
-    # The shared column this table is filtered on: a per-table `scope_column`
-    # override when present, otherwise the global `--scope-column`.
+    # The shared column this table is filtered on: a per-table `scope_column` when
+    # declared, otherwise the deprecated global `--scope-column` flag.
     private def resolved_scope_column(table)
       table.scope_column || dump_target.scope_column
     end
@@ -557,10 +570,11 @@ module Exwiw
     end
     private def scope_unscopable_message(table)
-      "Table '#{table.name}' cannot be scoped in scope-column mode: it has no " \
-        "'#{dump_target.scope_column}' column (nor a per-table scope_column override) and no " \
-        "belongs_to path to a table that does. Add `scope_exempt: true` to export it in full, " \
-        "set `ignore: true` to skip it, or add the missing belongs_to."
+      "Table '#{table.name}' cannot be scoped in scope-column mode: it carries no scope " \
+        "column (no per-table `scope_column` is declared on it) and has no belongs_to path " \
+        "to a table that does. Declare `scope_column: <column>` on it, mark it " \
+        "`scope_exempt: true` to export it in full, set `ignore: true` to skip it, or add " \
+        "the missing belongs_to."
     end
   end
 end

data/lib/exwiw/runner.rb CHANGED Viewed

@@ -74,15 +74,16 @@ module Exwiw
         phase = "executing extraction query"
         begin
           results = adapter.execute(query_ast)
-          record_num = results.size
-          if record_num.zero?
-            @logger.info("  No records matched. skip this table.")
-            next
-          end
           insert_idx = (idx + 1).to_s.rjust(3, '0')
           if @output_format == 'copy'
+            # COPY mode (PostgreSQL only) builds the whole body up front rather
+            # than streaming, so it keeps the explicit count + early skip.
+            record_num = results.size
+            if record_num.zero?
+              @logger.info("  No records matched. skip this table.")
+              next
+            end
             phase = "generating COPY statement"
             @logger.debug("  Generate COPY statement...")
             copy_sql = adapter.to_copy_from_stdin(results, table)
@@ -108,19 +109,33 @@ module Exwiw
             # config does not set one (SQL adapters: nil -> one statement, but
             # streamed in bounded buffers; MongoDB: a positive default so the
             # JSONL is chunked). #write_inserts emits bytes identical to the
-            # previous inline chunk loop and returns the statement count.
+            # previous inline chunk loop and returns [statement_count, record_num].
+            #
+            # The row count comes from this single streaming pass, so empty
+            # tables are detected here (and the just-opened file removed) rather
+            # than via a separate upfront count query — eliminating a redundant
+            # second scan of the same filter (e.g. MongoDB count_documents / SQL
+            # SELECT COUNT(*)), which is a full COLLSCAN for an unindexed scope.
             chunk_size = table.bulk_insert_chunk_size || adapter.default_bulk_insert_chunk_size
+            insert_path = File.join(@output_dir, "insert-#{insert_idx}-#{table_name}.#{adapter.output_extension}")
             statement_count = 0
-            File.open(File.join(@output_dir, "insert-#{insert_idx}-#{table_name}.#{adapter.output_extension}"), 'w') do |file|
+            record_num = 0
+            File.open(insert_path, 'w') do |file|
               pre = adapter.pre_insert_sql(table)
               file.puts(pre) if pre
-              statement_count = adapter.write_inserts(file, results, table, chunk_size)
+              statement_count, record_num = adapter.write_inserts(file, results, table, chunk_size)
               file.print("\n")
               post = adapter.post_insert_sql(table)
               file.puts(post) if post
             end
+            if record_num.zero?
+              File.delete(insert_path)
+              @logger.info("  No records matched. skip this table.")
+              next
+            end
             @logger.info("  Generated INSERT statement for #{record_num} records (#{statement_count} statement(s)).")
           end

data/lib/exwiw/schema_generator.rb CHANGED Viewed

@@ -59,6 +59,20 @@ module Exwiw
       groups
     end
+    # Flatten the generated `groups` (the Hash returned by generate! /
+    # build_table_groups) into the list of cross-database belongs_tos the
+    # generator auto-ignored, so a caller (the rake task) can surface them. Each
+    # entry is `{ table:, foreign_key:, target: }`. Empty for single-database apps.
+    def self.cross_database_belongs_tos(groups)
+      groups.values.flatten.flat_map do |table|
+        next [] unless table.respond_to?(:belongs_tos)
+        table.belongs_tos
+             .select { |bt| bt.ignore_type == CROSS_DATABASE_IGNORE_TYPE }
+             .map { |bt| { table: table.name, foreign_key: bt.foreign_key, target: bt.table_name } }
+      end
+    end
     # Reconcile the config files already on disk against the live database,
     # removing only what no longer exists there:
     #
@@ -277,11 +291,17 @@ module Exwiw
     end
     private def aggregate_belongs_tos(models)
-      belongs_to_assocs = models.flat_map { |m| belongs_to_associations_for(m) }
+      belongs_to_assocs = models
+        .flat_map { |m| belongs_to_associations_for(m) }
+        .select { |assoc| assoc.polymorphic? || active_record_target?(assoc) }
+      owner_db = database_name_for(models.first)
       non_polymorphic = belongs_to_assocs
         .reject(&:polymorphic?)
-        .map { |assoc| { table_name: assoc.table_name, foreign_key: assoc.foreign_key } }
+        .map do |assoc|
+          entry = { table_name: assoc.table_name, foreign_key: assoc.foreign_key }
+          annotate_cross_database(entry, owner_db, assoc.klass)
+        end
       # A polymorphic belongs_to (`belongs_to :reviewable, polymorphic: true`)
       # has no single target table. The candidate tables are found by looking up
@@ -292,18 +312,51 @@ module Exwiw
         .select(&:polymorphic?)
         .flat_map do |assoc|
           polymorphic_target_models(assoc.name).map do |target_model|
-            {
+            entry = {
               table_name: target_model.table_name,
               foreign_key: assoc.foreign_key,
               foreign_type: assoc.foreign_type,
               type_value: target_model.polymorphic_name,
             }
+            annotate_cross_database(entry, owner_db, target_model)
           end
         end
       (non_polymorphic + polymorphic).uniq
     end
+    CROSS_DATABASE_IGNORE_TYPE = "cross_database"
+    # A belongs_to whose target model lives in a *different* database than the
+    # owning table cannot be joined: in a Rails multi-database (`connects_to`)
+    # setup each database is exported on its own connection and into its own
+    # per-database config directory, so the target table is absent from the
+    # directory this config is loaded with, and there is no single connection to
+    # join the two on. Leaving the relation live would emit a dangling belongs_to
+    # whose target is never present at extraction time (a nil-target crash in
+    # dependency resolution). So emit it with `ignore: true` (dropped from
+    # extraction at load via TableConfig#reject_ignored_members!) tagged
+    # `ignore_type: "cross_database"`, with a `comment` recording why and pointing
+    # at the recovery path. The foreign-key column itself is still exported as a
+    # plain column; only the join/dependency edge is dropped. Polymorphic
+    # associations are annotated per target, so only the targets that cross a
+    # database boundary are ignored. Single-database apps are unaffected.
+    private def annotate_cross_database(entry, owner_db, target_model)
+      target_db = database_name_for(target_model)
+      return entry if target_db == owner_db
+      entry.merge(
+        ignore: true,
+        ignore_type: CROSS_DATABASE_IGNORE_TYPE,
+        comment: "Cross-database belongs_to: target '#{entry[:table_name]}' is in database " \
+                 "'#{target_db}', not '#{owner_db}'. exwiw exports each database separately and " \
+                 "cannot join across them, so this relation is ignored during extraction; its " \
+                 "foreign-key column '#{entry[:foreign_key]}' is still exported. To extract across " \
+                 "this boundary, declare `scope_column: \"#{entry[:foreign_key]}\"` on this table's " \
+                 "config so its rows are filtered by that foreign-key value directly (scope-column mode).",
+      )
+    end
     # `belongs_to` reflections for a model, with the synthetic HABTM left-side
     # association removed.
     #
@@ -325,6 +378,31 @@ module Exwiw
       assocs.reject { |assoc| assoc.equal?(left) }
     end
+    # Whether a (non-polymorphic) belongs_to points at an ActiveRecord model.
+    #
+    # A belongs_to can target a non-ActiveRecord class — most commonly an
+    # ActiveHash/ActiveYaml master (`belongs_to :equipment, class_name:
+    # "SomeActiveYamlModel"`). active_hash registers these as ordinary
+    # `belongs_to` reflections, yet the target class has no database table, so
+    # `assoc.table_name` (which delegates to `klass.table_name`) raises. Such a
+    # relation is not a DB edge exwiw can join or extract across, so it is
+    # dropped from the generated belongs_tos; the underlying foreign-key column
+    # is still emitted as a plain column. Polymorphic associations cannot be
+    # `klass`-resolved, so callers must screen those out before calling this.
+    #
+    # Resolving the target class behaves differently per non-AR shape: an
+    # ActiveHash reflection returns the class fine (the crash is later, at
+    # `table_name`), while a bare `belongs_to` to a plain class makes AR raise
+    # ArgumentError ("... is not an ActiveRecord::Base subclass") right here when
+    # the klass is computed. Both mean "not a DB relation", so rescue the lookup
+    # and treat either as a non-AR target to skip.
+    private def active_record_target?(assoc)
+      klass = assoc.klass
+      klass.is_a?(Class) && klass < ActiveRecord::Base ? true : false
+    rescue StandardError
+      false
+    end
     # Enumerate the concrete models that can be targets of the polymorphic
     # association `association_name`, by looking them up from every model's
     # `has_many` / `has_one` `as:` option. The order of `concrete_models` depends

data/lib/exwiw/table_config.rb CHANGED Viewed

@@ -154,10 +154,10 @@ module Exwiw
         merged_table.scope_column = scope_column
         # Structural facts of each belongs_to come from the freshly generated
-        # config, but the user-owned `comment`/`ignore`/`references` carry over
-        # when the same relation still exists. (`references` is only consumed by
-        # the MongodbAdapter, but it lives on the shared BelongsTo, so preserve
-        # it here too rather than silently dropping a hand-added value.)
+        # config, but the user-owned `comment`/`ignore`/`ignore_type`/`references`
+        # carry over when the same relation still exists. (`references` is only
+        # consumed by the MongodbAdapter, but it lives on the shared BelongsTo, so
+        # preserve it here too rather than silently dropping a hand-added value.)
         receiver_belongs_to_by_identity = belongs_tos.each_with_object({}) { |bt, hash| hash[bt.identity] = bt }
         merged_table.belongs_tos =
           passed_table.belongs_tos.map do |passed_belongs_to|
@@ -165,6 +165,7 @@ module Exwiw
             if receiver_belongs_to
               passed_belongs_to.comment = receiver_belongs_to.comment if receiver_belongs_to.comment
               passed_belongs_to.ignore = receiver_belongs_to.ignore unless receiver_belongs_to.ignore.nil?
+              passed_belongs_to.ignore_type = receiver_belongs_to.ignore_type if receiver_belongs_to.ignore_type
               passed_belongs_to.references = receiver_belongs_to.references if receiver_belongs_to.references
             end
             passed_belongs_to

data/lib/exwiw/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Exwiw
-  VERSION = "0.7.0"
+  VERSION = "0.8.1"
 end

data/lib/tasks/exwiw.rake CHANGED Viewed

@@ -17,9 +17,25 @@ namespace :exwiw do
     task generate: :environment do
       require "exwiw"
-      Exwiw::SchemaGenerator.from_rails_application(
+      groups = Exwiw::SchemaGenerator.from_rails_application(
         output_dir: resolve_schema_dir.call,
       ).generate!
+      # Surface cross-database belongs_tos the generator auto-ignored: these
+      # cannot be joined (each database is exported separately), so they were
+      # emitted with ignore:true and need a decision from the user.
+      cross = Exwiw::SchemaGenerator.cross_database_belongs_tos(groups)
+      unless cross.empty?
+        $stderr.puts "exwiw: detected #{cross.size} cross-database belongs_to(s), " \
+                     "each emitted with ignore:true (exwiw cannot join across databases). " \
+                     "The foreign-key column is still exported."
+        cross.each do |c|
+          $stderr.puts "  - #{c[:table]}.#{c[:foreign_key]} -> #{c[:target]} (in another database)"
+        end
+        $stderr.puts "  To extract across a boundary, declare `scope_column: <foreign_key>` on the " \
+                     "owning table's config (scope-column mode). Otherwise the relation stays ignored " \
+                     "and the foreign key is exported as a plain value."
+      end
     end
     desc "Remove tables/columns from the schema config that no longer exist in the application"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: exwiw
 version: !ruby/object:Gem::Version
-  version: 0.7.0
+  version: 0.8.1
 platform: ruby
 authors:
 - Shia
@@ -36,6 +36,7 @@ files:
 - CHANGELOG.md
 - LICENSE.txt
 - README.md
+- docs/mongodb-dump-parallelism-2x-notes.md
 - docs/mongodb-scoping-fullscan-notes.md
 - docs/optimization-notes.md
 - docs/optimize-mongodb-export-with-native-ext.md
@@ -46,6 +47,7 @@ files:
 - docs/plans/2026-05-29-rails-managed-tables.md
 - docs/plans/2026-05-31-ids-column-for-sql-adapters.md
 - docs/plans/2026-06-19-mongodb-export-remove-parallelism-native-ext.md
+- docs/scope-column-redesign.md
 - docs/sql-dump-optimization-notes.md
 - exe/exwiw
 - ext/exwiw/ext_json/ext_json.c