exwiw 0.7.0 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 12765a8f130dec0055149beb7ba69b37c9b548758cfbd8dd807c1499a9fbf0b5
4
- data.tar.gz: 9165cda22d6e4eb2ff97a6d386510a04f37d353e2b31ea4b6bada2fc1ad2adb7
3
+ metadata.gz: 80610cc2d13a87793171563b2b8e4ff0568135eb987b550dd9f89bffe89b1d67
4
+ data.tar.gz: 5e9f4976043571647163e9f743c3c8fb709a29944b590c55424dce374cca3269
5
5
  SHA512:
6
- metadata.gz: eb9ad3e65e4b8756574b5c97fec28be38cc708604cbdaffcb772bd8958b183a067dffafa44078705ed00346be27a9aa81bd68a212b7bcecf635a84d6a7f86adc
7
- data.tar.gz: fd47a8459abec5a56ab2c64c7cecc4ffa916608b2cf9278a68a2985defa18812b4f09b70a54b8d6ac498a305022644a0a0d923358a5a06143dedcb550aab7e84
6
+ metadata.gz: 8508aaa2d9cba3310a9ee4d4b940682b894bbafcab9672e4ccca5fb8f1a4b43e6fe9c36cbfcd0efe8f12c5308fa9be33aeb16d53b6d98ef71692ec2e30076be0
7
+ data.tar.gz: 8a1a8187e7547f4ce0eeaee458d36b16e56c08c1f2275977127de56a49eb2a3b25e22d8c52d3678fa6ea91e553ebf6c69bf943e5bb108e3984fa8fc34f03a2f0
data/CHANGELOG.md CHANGED
@@ -2,6 +2,25 @@
2
2
 
3
3
  ## [Unreleased]
4
4
 
5
+ ## [0.8.1] - 2026-06-24
6
+
7
+ ### Fixed
8
+
9
+ - **`schema:generate` skips a `belongs_to` whose target is not an ActiveRecord model instead of crashing.** A `belongs_to` can point at a non-ActiveRecord class — most commonly an ActiveHash/ActiveYaml master (`belongs_to :equipment, class_name: "SomeActiveYamlModel"`). active_hash registers these as ordinary `belongs_to` reflections, but the target class has no database table, so resolving its `table_name` raised and aborted generation. Such a relation is not a DB edge exwiw can join or extract across, so it is now dropped from the generated belongs_tos; the underlying foreign-key column is still emitted as a plain column. A bare `belongs_to` to a plain non-AR class — which makes ActiveRecord raise while resolving the target — is treated the same way. Polymorphic associations are unaffected.
10
+
11
+ ## [0.8.0] - 2026-06-24
12
+
13
+ ### Added
14
+
15
+ - **Scope-column mode is now declared per table.** A table that should be filtered by a shared scope/tenant column declares `scope_column: <column>` in its schema config. Naming any such table as `--target-table` then runs in scope-column mode: `--ids` are values of that shared column (not primary keys), and every table is filtered by its own `scope_column` — tables that lack one are reached via `belongs_to`, `scope_exempt: true` tables are dumped in full, and a table reachable by none aborts the run (so nothing is silently dumped unscoped). This is the primary way to extract across a foreign key that cannot be joined — most importantly a cross-database `belongs_to` (its join is impossible, but the FK column is still filterable): declare `scope_column: "<that foreign key>"` on the owning table and it is filtered by the FK value directly, with no join. SQL adapters only.
16
+ - **`schema:generate` detects a cross-database `belongs_to` and ignores just the relation — not the table, not the foreign-key column.** A `belongs_to` whose target model lives in a different database (Rails multi-database `connects_to`) cannot be joined — each database is exported separately — so the *belongs_to entry* is emitted with `ignore: true` and `ignore_type: "cross_database"` and a `comment` recording why and pointing at the per-table `scope_column` declaration for cross-boundary extraction. The owning table is still exported normally and its foreign-key column is still exported as a plain column; only the join/dependency edge is dropped at load (otherwise the dangling cross-database target would crash dependency resolution). Polymorphic associations are handled per target. The task prints a summary of every cross-database relation it ignored. (`ignore_type` is now also preserved across regeneration by `TableConfig#merge`.) Single-database applications are unaffected.
17
+
18
+ ### Changed (breaking)
19
+
20
+ - **`--ids-column` is removed.** The SQL-adapter flag that matched `--ids` against a non primary-key column on the target table has no remaining use case; for a scoped table, the per-table `scope_column` is the column `--ids` filter against. (The mongodb-only `--ids-field` is unaffected.)
21
+ - **`--scope-column` is deprecated.** The global flag still selects scope-column mode (SQL-only, mutually exclusive with `--target-table`) but now emits a deprecation warning. Prefer declaring a per-table `scope_column:` in the schema config and running with `--target-table`. A per-table `scope_column` takes precedence over the flag for any table that sets both.
22
+ - **A scoped target's `--ids` now mean scope values, not primary keys.** When `--target-table` names a table that declares a `scope_column`, exwiw runs in scope-column mode and `--ids` are matched against that shared column; the target is scoped like any other table rather than anchored by primary key. A table that declares a `scope_column` can therefore no longer be single-extracted by primary key.
23
+
5
24
  ## [0.7.0] - 2026-06-23
6
25
 
7
26
  ### Changed
@@ -16,7 +35,7 @@
16
35
 
17
36
  ### Added
18
37
 
19
- - Optimize memory usage https://github.com/heyinc/exwiw/pull/118
38
+ - Optimize memory usage https://github.com/heyinc/exwiw/pull/118
20
39
  - **MongoDB: optional native (C) encoder for the Extended-JSON dump path** (no flag, byte-identical output, pure-Ruby fallback). Encoding each document to MongoDB Relaxed Extended JSON — previously `JSON.generate(doc.as_extended_json(mode: :relaxed))`, which rebuilds the whole document into an intermediate transformed Hash tree and then walks it again — was the dominant per-document CPU cost (~82% of serialization on embed-heavy data). A new C extension (`ext/exwiw/ext_json/`) emits the JSONL line in a single native tree-walk. It formats the structural bulk plus the leaves that dominate a dumped document — `Hash`, `Array`, `String`, fixnum `Integer`, `true`/`false`/`nil`, `BSON::ObjectId` (`_id`), and in-range `Time` (the Mongoid `created_at`/`updated_at` timestamps) — and delegates everything else (`Float`, out-of-int64 `Integer`, out-of-range `Time`, `Symbol`, `Decimal128`, …) back to the exact pure-Ruby path, so the output is provably byte-for-byte identical. On a 30-embedded-post timestamp-heavy document this serializes ~2.8× faster. With `gem install exwiw` the extension compiles automatically; hosts that cannot compile (JRuby/TruffleRuby, no toolchain) fall back to the pure-Ruby encoder, so exwiw stays installable as a pure-Ruby gem. See [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md).
21
40
 
22
41
  ## [0.5.3] - 2026-06-19
data/README.md CHANGED
@@ -79,7 +79,7 @@ exwiw \
79
79
  --log-level=info
80
80
  ```
81
81
 
82
- By default `--ids` are matched against the target table's primary key. `--ids-column=COLUMN` matches them against a different column instead (e.g. `--target-table=users --ids=alice@example.com --ids-column=email`). Related tables are still extracted correctly: their foreign keys are resolved through the target via a subquery (`WHERE fk IN (SELECT pk FROM target WHERE COLUMN IN (...))`), so only the target table's filter column changes. This is the SQL-adapter counterpart of the mongodb `--ids-field`; the two are mutually exclusive and each is rejected by the other adapter family. Note: if `COLUMN` is itself masked, re-running `delete-*` against an already-imported (masked) dump won't match, so prefer a stable natural key.
82
+ By default `--ids` are matched against the target table's primary key. If the target table declares a per-table `scope_column`, exwiw runs in [scope-column mode](#scope-column-mode) instead `--ids` are then values of that shared column, and the table is scoped like any other rather than anchored by primary key.
83
83
 
84
84
  When `--target-table` and `--ids` are omitted, exwiw dumps all tables defined in `--schema-dir`:
85
85
 
@@ -129,18 +129,31 @@ exwiw explain \
129
129
 
130
130
  The `--output-dir`, `--output-format`, `--insert-only`, and `--after-insert-hook` options are dump-specific and rejected when used with `explain`.
131
131
 
132
- ### Scope-column mode (`--scope-column`)
132
+ ### Scope-column mode
133
133
 
134
134
  The default `--target-table` extraction assumes the schema converges on a single
135
135
  root: every table is reached by walking `belongs_to` toward that one table. Some
136
136
  schemas are not shaped that way — many independent top-level tables each carry the
137
- *same* scope/tenant column (e.g. `tenant_id`, `account_uuid`) and there is no
138
- single root. Choosing one of them as `--target-table` would leave the others
139
- unrelated to it, and an unrelated table is dumped in full a problem if it holds
140
- personal data.
137
+ *same* scope/tenant column (e.g. `tenant_id`, `business_entity_id`), and a foreign
138
+ key that **cannot be joined** (most importantly a cross-database `belongs_to`,
139
+ whose join is impossible but whose FK column is still filterable) is not reached at
140
+ all. Choosing one table as `--target-table` would leave the others unrelated to it,
141
+ and an unrelated table is dumped in full — a problem if it holds personal data.
141
142
 
142
- `--scope-column` handles this shape: instead of one anchor table, **every table is
143
- filtered by a shared column** whose values are `--ids`.
143
+ Scope-column mode handles this shape: instead of anchoring on one table's primary
144
+ key, **every table is filtered by a shared column** whose values are `--ids`.
145
+ Declare that column per table in the schema config with `scope_column:`:
146
+
147
+ ```json
148
+ {
149
+ "name": "shops",
150
+ "primary_key": "id",
151
+ "scope_column": "business_entity_id",
152
+ "columns": [{ "name": "id" }, { "name": "name" }, { "name": "business_entity_id" }]
153
+ }
154
+ ```
155
+
156
+ Then name any scoped table as `--target-table` and pass the scope values as `--ids`:
144
157
 
145
158
  ```bash
146
159
  exwiw \
@@ -148,15 +161,21 @@ exwiw \
148
161
  --host=localhost --port=5432 --user=reader \
149
162
  --database=app_production \
150
163
  --schema-dir=exwiw/schema \
151
- --scope-column=tenant_id \
152
- --ids=42,43 \
164
+ --target-table=shops --ids=42,43 \
153
165
  --output-dir=dump
154
166
  ```
155
167
 
168
+ Because `shops` declares a `scope_column`, exwiw switches to scope-column mode: the
169
+ `--ids` (`42,43`) are **`business_entity_id` values, not shop primary keys**, and
170
+ `shops` itself is scoped by `business_entity_id IN (42,43)` like every other scoped
171
+ table — it is *not* used as a primary-key anchor. (A table that declares a
172
+ `scope_column` therefore can no longer be single-extracted by primary key.)
173
+
156
174
  Each table is resolved as follows:
157
175
 
158
- - **Carries the scope column** `WHERE scope_column IN (ids)`.
159
- - **Lacks it but `belongs_to` reaches a table that has it** exwiw joins up to the
176
+ - **Declares the scope column** (`scope_column:`, or carries the global column of
177
+ the deprecated `--scope-column` flag)`WHERE scope_column IN (ids)`.
178
+ - **Does not, but `belongs_to` reaches a table that does** → exwiw joins up to the
160
179
  nearest such table and applies the scope filter there (the same join machinery
161
180
  the single-target mode uses).
162
181
  - **`belongs_to` a parent that is itself scoped but carries no scope column of its
@@ -168,13 +187,22 @@ Each table is resolved as follows:
168
187
  a single forward hop and a single unambiguous scopable parent.
169
188
  - **Cannot be scoped at all** (no scope column and no path to one) → exwiw
170
189
  **aborts** and lists the offending tables, so an unscoped table is never silently
171
- dumped in full. For each, either add a `belongs_to` path, set `ignore: true` to
172
- skip it, or mark it `scope_exempt: true` (below) to export it in full.
190
+ dumped in full. For each, either declare a `scope_column`, add a `belongs_to`
191
+ path, set `ignore: true` to skip it, or mark it `scope_exempt: true` (below) to
192
+ export it in full.
193
+
194
+ Scope-column mode is SQL-only (mysql / postgresql / sqlite). It works with `exwiw
195
+ explain` too, which is the recommended way to preview the queries before exporting.
196
+
197
+ #### Cross-database foreign keys
173
198
 
174
- `--scope-column` is SQL-only (mysql / postgresql / sqlite) and mutually exclusive
175
- with `--target-table`, `--target-collection`, `--ids-column`, and `--ids-field`.
176
- It works with `exwiw explain` too, which is the recommended way to preview the
177
- queries before exporting.
199
+ The motivating case for declaring a `scope_column` is a foreign key that cannot be
200
+ joined: when a `belongs_to` target lives in a different database (see the
201
+ cross-database `belongs_to` note under the generator), that join is impossible, but
202
+ the foreign-key *column* is still present and can be filtered directly. Declaring
203
+ `scope_column: "<that foreign key>"` on the owning table scopes it by the column
204
+ value, with no join — `schema:generate` points this out in the ignored relation's
205
+ `comment`.
178
206
 
179
207
  #### `scope_exempt` (intentional full dump)
180
208
 
@@ -193,11 +221,11 @@ opt out of the strict check and be exported in full:
193
221
  Rails-managed tables (`schema_migrations`, `ar_internal_metadata`) are treated as
194
222
  exempt automatically.
195
223
 
196
- #### Per-table `scope_column` override
224
+ #### Per-table `scope_column` and the value space
197
225
 
198
- scope-column mode assumes a single shared **value** space — the same `--ids` apply
199
- to every scoped table. If a table stores that same value under a differently named
200
- column, override the column name for that table:
226
+ Scope-column mode assumes a single shared **value** space — the same `--ids` apply
227
+ to every scoped table. Each table names its own column, so a table that stores that
228
+ same value under a differently named column simply declares that name:
201
229
 
202
230
  ```json
203
231
  {
@@ -211,6 +239,15 @@ column, override the column name for that table:
211
239
  Both `scope_exempt` and `scope_column` are user-maintained and preserved across
212
240
  `schema:generate` regeneration (the generators never emit them).
213
241
 
242
+ #### Deprecated: the `--scope-column` flag
243
+
244
+ Before per-table declarations, scope-column mode was selected with a global
245
+ `--scope-column=COLUMN` flag (every table filtered by that one column, `--ids` its
246
+ values, no `--target-table`). The flag still works — SQL-only and mutually
247
+ exclusive with `--target-table` — but is **deprecated** and emits a warning; prefer
248
+ declaring a per-table `scope_column` and running with `--target-table`. A per-table
249
+ `scope_column` takes precedence over the flag for any table that sets both.
250
+
214
251
  ### Config file (`exwiw.yml`)
215
252
 
216
253
  Options you would otherwise repeat on every run can be kept in a YAML config file. Pass it with `--config=PATH`; when `--config` is omitted, exwiw automatically loads `exwiw.yml` (or `exwiw.yaml`) from the current directory if present.
@@ -226,7 +263,7 @@ output_format: insert # insert | copy
226
263
  insert_only: false
227
264
  after_insert_hook: hooks/seed.rb
228
265
  log_level: info # debug | info
229
- # target_table / ids / ids_field / ids_column / scope_column may also be set here
266
+ # target_table / ids / ids_field / scope_column may also be set here
230
267
  ```
231
268
 
232
269
  With the file above, only the connection details need to be supplied on the CLI:
@@ -298,6 +335,8 @@ exwiw/schema/
298
335
 
299
336
  Each database keeps its own Rails migration history, so a `schema_migrations` (and `ar_internal_metadata`) entry is emitted under every database that contains one — the example above shows `primary/schema_migrations.json` and would also produce `analytics/schema_migrations.json` when the analytics database has its own migration table. Single-database applications are unaffected and continue to write files flat into the output directory.
300
337
 
338
+ A `belongs_to` whose target model lives in a *different* database (e.g. a `primary` model referencing an `analytics` one) cannot be joined: each database is exported on its own connection and into its own subdirectory, so the target table is absent from the directory this config is loaded with. `schema:generate` detects such a relation (by comparing the owning and target models' database config names) and emits it with `ignore: true` and `ignore_type: "cross_database"`, recording why in the `comment`; the relation is then dropped from extraction at load time, while the foreign-key column itself is still exported as a plain column. Polymorphic associations are handled per target, so only the targets that cross a database boundary are ignored. The task also prints a summary of every cross-database `belongs_to` it ignored. **To extract across such a boundary, declare `scope_column: "<foreign_key>"` on the owning table (see [scope-column mode](#scope-column-mode)) so its rows are filtered by the foreign-key value directly** — there is no join, so the cross-database boundary is not a problem there.
339
+
301
340
  **Limitations**
302
341
 
303
342
  - The rails-managed table *names* are resolved from the global `ActiveRecord::Base.schema_migrations_table_name` / `internal_metadata_table_name` accessors, which are shared across all connections. A per-database override of these names is not detected, so such a table will be missing from that database's generated configs.
@@ -658,7 +697,7 @@ The MongoDB adapter is experimental. To use it:
658
697
  - The MongoDB adapter consumes a separate config type, `MongodbCollectionConfig`, with MongoDB-native naming. Use `fields` (instead of the SQL adapters' `columns`), and set `"primary_key": "_id"`. Foreign keys (`shop_id`, `user_id`, ...) stay as ordinary fields.
659
698
  - `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
660
699
  - `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
661
- - `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only**; the SQL adapters use `--ids-column` instead (see below).
700
+ - `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only** (the SQL adapters have no equivalent).
662
701
  - Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
663
702
  - Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
664
703
  ```bash
@@ -0,0 +1,146 @@
1
+ # MongoDB dump: reaching the 2× target with inter-collection fork parallelism
2
+
3
+ This continues [`optimization-notes.md`](./optimization-notes.md). That document
4
+ ends at "the remaining cost is the serial BSON→Ruby decode, and the lever left is
5
+ a native encoder bounded by the same serial-decode ceiling." This note records a
6
+ **different** lever that was measured to clear a **2× end-to-end** target on a
7
+ large real extraction, **byte-identically**, and explains why it succeeds where
8
+ the previously-removed `--parallel-workers` did not.
9
+
10
+ The measurements below are from a large staging extraction (~300 non-empty
11
+ collections, target → 1 store → ~hundreds of scoped rows, plus reference/master
12
+ data dumped in full). Reproduce with the bench harness against a restored backup
13
+ (see [`mongodb-scoping-fullscan-notes.md`](./mongodb-scoping-fullscan-notes.md)
14
+ for serving a WiredTiger backup as a standalone `mongod`).
15
+
16
+ ## Where the time goes (warm cache, native ext compiled)
17
+
18
+ Serial baseline ≈ **6.8–7.0 s**. Per-collection instrumentation (wrapping
19
+ `build_query` / `execute` / `write_inserts`) attributes ~6.0 s to the write pass.
20
+ Splitting one collection's write pass into "drain the cursor only" vs
21
+ "drain + native-encode" shows **decode is 64–92 %** of per-collection cost — the
22
+ Mongo driver's BSON→Ruby-Hash decode dominates; the native Extended-JSON encode is
23
+ already cheap. The single heaviest collection (a 237k-doc reference table) is
24
+ **1.31 s by itself** and cannot be split without intra-collection partitioning.
25
+
26
+ Classifying every processed collection by how it is scoped (using the
27
+ genuine-anchor model from `mongodb-scoping-fullscan-notes.md`) gives three groups
28
+ with very different dependency shapes:
29
+
30
+ | group | meaning | count | time |
31
+ |---|---|---|---|
32
+ | **leaf** | no `belongs_to` → dumped in full, no input deps | 54 | ~3.5 s |
33
+ | **genuine** | reachable to the dump target via `belongs_to` (the scoped DAG) | 245 | ~0.8 s |
34
+ | **ref_bt** | has `belongs_to` but NOT reachable to target (reference data scoped by strict-AND fallback) | 23 | ~2.1 s |
35
+
36
+ The crucial structural facts, all derivable from the configs:
37
+
38
+ - **genuine never consumes leaf/ref_bt `@state`.** A genuine collection drops its
39
+ reference (leaf) parents when it has a genuine anchor, and a ref_bt parent is by
40
+ definition not genuine. So the whole scoped DAG is independent of the heavy
41
+ reference dumps — it can run **concurrently** with them.
42
+ - **leaf has no input deps at all** (no `belongs_to`) → embarrassingly parallel.
43
+ - **ref_bt consumes only leaf `@state` and other ref_bt `@state`** (never genuine).
44
+ Its internal edges form shallow weakly-connected components.
45
+
46
+ ## Why process-level (not thread/serialization) parallelism is the right lever
47
+
48
+ `optimization-notes.md` removed two parallelism attempts. The distinction that
49
+ matters:
50
+
51
+ - `--parallel-workers` parallelized **serialization only** (the parent still did
52
+ the single serial cursor decode), so it was capped at ~1.1–1.4× — decode is the
53
+ bottleneck and it was untouched.
54
+ - `--cursor-parallel` parallelized **decode within a collection** (≈2.5–5.5×) but
55
+ forced `sort(_id)`, changing row order → not byte-identical, could not be the
56
+ default.
57
+
58
+ Forking **whole collections** across processes parallelizes the decode (each
59
+ worker decodes its own collections) **while preserving each collection's natural
60
+ order** — so it is both decode-parallel and byte-identical. That is the property
61
+ neither removed approach had.
62
+
63
+ ## The schedule that was measured at ≥2×
64
+
65
+ One parent process plus a fork pool of `N` workers:
66
+
67
+ 1. **Phase 1 (concurrent).** Fork the leaf pool (`N` workers, leaves assigned by
68
+ LPT bin-packing on output-size weight so the 1.31 s collection sits alone).
69
+ Concurrently the parent (a) dumps the schema (parent-only, needs no `@state`)
70
+ and (b) processes the **whole genuine DAG optimistically** — without leaf
71
+ `@state` yet — recording each collection's row count.
72
+ 2. **Barrier + leaf `@state`.** Wait for the leaf pool; load the small Marshal
73
+ sidecars each consumed leaf wrote (only the ~11 leaves a downstream collection
74
+ references need one; the heavy 1.31 s table is not referenced, so no sidecar).
75
+ 3. **Phase 2 (cascade reprocess).** Only **8** genuine collections reference a leaf
76
+ at all, and **0** are statically forced to use it — a genuine collection only
77
+ falls back to a leaf clause at runtime when its genuine anchor turns out empty
78
+ (e.g. an empty intermediate parent). Reprocess those 8 with leaf `@state` now
79
+ present; if a reprocessed collection's row count changes, enqueue its genuine
80
+ children and repeat. Non-fallback collections (e.g. the DAG root one level below
81
+ the dump target, which has a genuine anchor and drops its leaf parents)
82
+ reprocess to an identical result and stop the cascade,
83
+ so this terminates after re-touching only the genuinely affected handful.
84
+ 4. **Phase 3 (ref_bt components).** Fork the 23 ref_bt collections as
85
+ **dependency-closed weakly-connected components** (over intra-ref_bt edges) in a
86
+ single pool — no level barriers, no cross-worker IPC. Each worker owns whole
87
+ components and processes their members in topological order, seeded with the
88
+ (sliced) leaf `@state` its members reference. Components are assigned by LPT.
89
+
90
+ `@state` IPC is just Marshal sidecars for the ~11 referenced leaves; everything
91
+ else is COW-inherited at fork or kept inside a component.
92
+
93
+ ## Measured result (warm cache, same backup)
94
+
95
+ Internal compute timer (excludes the fixed ~0.5 s Ruby/bundler startup both paths
96
+ pay):
97
+
98
+ | workers | compute | speedup |
99
+ |---|---|---|
100
+ | serial | 7.01 s | 1.00× |
101
+ | N=2 | 3.87 s | 1.81× |
102
+ | N=4 | 2.80 s | **2.50×** |
103
+ | N=6 | 2.79 s | **2.51×** |
104
+
105
+ Full wall-clock (includes Ruby startup — the honest end-to-end number):
106
+
107
+ | workers | wall | speedup |
108
+ |---|---|---|
109
+ | serial | 6.79 s | 1.00× |
110
+ | N=2 | 4.06 s | 1.67× |
111
+ | N=4 | 3.24 s | **2.10×** |
112
+ | N=6 | 3.21 s | **2.12×** |
113
+
114
+ **Byte-identical:** all 189 output files (schema + inserts) match the serial run
115
+ exactly — same filenames (the insert index is taken over the full ordering,
116
+ including `ignore:true` collections, exactly as the Runner numbers them) and same
117
+ content (0/189 cmp mismatches).
118
+
119
+ The curve **saturates at N≈4**: past that the wall time is bounded by the single
120
+ 1.31 s leaf decode plus the longest ref_bt chain (~0.6 s) plus startup. Going
121
+ beyond ~2.5× needs **intra-collection** decode parallelism (the `_id`-range cursor
122
+ split that was removed for changing row order) or a **native BSON→Extended-JSON
123
+ transcoder** that skips the Ruby-Hash decode for unmasked reference collections
124
+ (decode being 64–92 % of the cost, this is the largest single-process lever left
125
+ and composes with the fork approach).
126
+
127
+ ## Operational note: this needs vCPUs to spend
128
+
129
+ The win is real cores doing real decode in parallel. On the current ECS task
130
+ (`cpu: 2048` = **2 vCPU**) the schedule reaches only ~1.67× (N=2). Clearing 2×
131
+ requires scaling to **≥4 vCPU** (`cpu: 4096`); 8 vCPU does not help further given
132
+ the N≈4 saturation. Memory is unchanged — the existing streaming keeps each worker
133
+ bounded by chunk size, and `N×` that stays well under the 7 GiB container limit.
134
+
135
+ ## Status
136
+
137
+ This is a **measured, byte-identical proof** (bench prototype, not yet integrated
138
+ into the Runner/CLI). Integrating it means re-introducing process orchestration
139
+ and `@state` sidecar IPC — the machinery `optimization-notes.md` deliberately
140
+ removed as over-engineered for a flag. It is recorded here because, unlike that
141
+ removed work, this schedule is (a) byte-identical by construction, (b) measured
142
+ past the 2× target on a real extraction, and (c) the lever the task explicitly
143
+ invited (scale the task to go faster). A production version would gate it behind a
144
+ worker-count option, fall back to serial where `fork` is unavailable
145
+ (Windows/JRuby), and reuse the existing genuine-anchor classification to derive
146
+ the three groups.
@@ -109,6 +109,18 @@ The full design (byte-identity strategy, fast-path vs Ruby-delegate types,
109
109
  optional-load + pure-Ruby fallback, packaging) is in
110
110
  [`optimize-mongodb-export-with-native-ext.md`](./optimize-mongodb-export-with-native-ext.md).
111
111
 
112
+ The native ext shipped (0.6.1+) and the dump is now **decode-bound** (the driver's
113
+ BSON→Ruby decode is 64–92 % of per-collection cost). The single-process levers are
114
+ exhausted short of 2×. The lever that *was* measured past 2× — **byte-identical
115
+ inter-collection fork parallelism** (parallelize whole collections across
116
+ processes, so the decode itself runs in parallel while each collection keeps its
117
+ natural order) — is written up in
118
+ [`mongodb-dump-parallelism-2x-notes.md`](./mongodb-dump-parallelism-2x-notes.md).
119
+ Unlike the `--parallel-workers` attempt removed above (which parallelized
120
+ serialization only and so was capped at ~1.4×), this parallelizes the actual
121
+ bottleneck and reaches ~2.1× wall / ~2.5× compute on a real extraction, given
122
+ ≥4 vCPU to spend.
123
+
112
124
  ## Methodology notes (for re-running)
113
125
 
114
126
  - The CPU hotspot reproduces **with no database**: the Mongo driver hands back
@@ -0,0 +1,107 @@
1
+ # scope-column まわりの再設計メモ(2026-06-23)
2
+
3
+ > ステータス: **(a) で決定**。hybrid は畳む(PR #128 の hybrid 中核を revert し、per-table scope-column モードに作り直す)。確定仕様は末尾「決定: (a) の確定仕様」を参照。
4
+
5
+ ## 背景 / 解きたい問題
6
+
7
+ Rails の multi-database(`connects_to`)で、別 DB のモデルを指す `belongs_to`(cross-database)は **join できない**。
8
+
9
+ ```ruby
10
+ # main DB
11
+ class Customer < ApplicationRecord
12
+ belongs_to :tenant, class_name: "Org::Tenant" # tenants は別 DB
13
+ end
14
+ # org DB
15
+ class Org::Tenant < OrgApplicationRecord
16
+ end
17
+ ```
18
+
19
+ `customers.tenant_id` は join では絞れないが、**列の値(tenant_id)で直接フィルタ**はできる。これを使ってテナント単位 / 対象単位でデータ抽出したい。
20
+
21
+ ## いまある実装(PR #128, draft / branch `feat/hybrid-target-scope-column`)
22
+
23
+ - **hybrid モード**: `--target-table` と `--scope-column` を併用。各テーブルを「target 到達」OR「scope 列到達」の**和集合(OR)**で解決。scope 値は target から**導出**する
24
+ (`scope_column IN (SELECT target.scope_column FROM target WHERE <target ids>)`)。SQL only / insert-only / 単一アンカーは従来と byte 同一。
25
+ - **generator**: cross-database `belongs_to` を検出し、**relation だけ** `ignore: true` + `ignore_type: "cross_database"`(テーブル本体・FK 列は維持)。生成時にサマリ出力。
26
+ - **R2**: per-table `scope_column` があるのに `--scope-column` 未指定ならエラー。
27
+
28
+ ## 再設計の方針(この議論での決定)
29
+
30
+ 1. **per-table `scope_column` を唯一の宣言にする。** scoped なテーブルは config に `scope_column: <列>` を明示。
31
+ グローバル `--scope-column`(正準名)+ per-table オーバーライド、という二層と「オーバーライド」概念は**廃止**。
32
+ 各テーブルが自分の列を宣言する(差異名も自然に表現できる。全 scope 列は同じ値空間を持つ前提)。
33
+ 2. **`--scope-column` フラグは deprecated**(警告を出して per-table 宣言へ誘導)。
34
+ 3. **`--ids-column` は削除**。「自然キー(email 等)で target を指定する」ユースケースが無いため。
35
+ scoped な target では `scope_column` が「`--ids` を当てる列」を兼ねる。
36
+ 4. **generator の誘導文を更新**: 「越境は `--scope-column` を使え」→「**このテーブルに `scope_column:` を宣言しろ**」。
37
+ 検出・relation ignore・FK 列維持・サマリ自体は据え置き。
38
+ 5. **R2 は撤回**(グローバルフラグ概念が無くなるため、per-table 宣言が正道になる)。
39
+
40
+ ここまでは hybrid の有無に関わらず確定。
41
+
42
+ ## 未解決: hybrid は要るのか?
43
+
44
+ 抽出ニーズが2種類あり、片方は pure scope-column モードでは実現できない。
45
+
46
+ ### (a) テナント丸ごと抽出 — pure scope で足りる
47
+ 「tenant 42 のデータを全部」。各 scoped テーブルを自分の `scope_column IN (42)` で絞るだけ。未 scoped テーブルは belongs_to 経由(via_path / referenced_by)。→ **hybrid 不要**。
48
+
49
+ ### (b) 対象を絞った最小抽出 + cross-DB 行 — hybrid が必要
50
+ 「shop 1 のデータだけ最小で欲しい。ただし join できない cross-DB の `customers`(shop 1 のテナントに属する)も含めたい」。
51
+
52
+ - pure scope(テナント単位)だと **tenant 42 全体**を引いてしまい多すぎる(shop 1 以外の shop も customers も入る)。
53
+ - 素の `--target-table=shops`(join 抽出)だと、cross-DB の `customers` には**到達できない**(join 不可)。
54
+ - → **shop 1 を主キーで anchor しつつ、その tenant_id を導出して cross-DB テーブルを scope する hybrid でしか実現できない**。
55
+
56
+ テスト用データ抽出は (b)(最小・対象限定)になりがちなので、hybrid は捨てがたい。
57
+
58
+ ### CLI 上の衝突(ここが論点)
59
+
60
+ `--ids-column` を消したので、**scoped な target の `--ids` を 2 通りに解釈できない**:
61
+
62
+ - 解釈A: `--target-table=shops --ids=42` の 42 = `scope_column` の値(テナント抽出)
63
+ - 解釈B: `--target-table=shops --ids=1` の 1 = 主キー(hybrid: shop 1 を anchor して導出)
64
+
65
+ 同じ CLI 形では両立しない。→ どちらかを選ぶ:
66
+
67
+ - **解釈A を採る**なら hybrid は畳む(pure scope-column モードに作り直す)。(b) は失われる。
68
+ - **hybrid を残す**なら scoped target の `--ids` は**主キー**(解釈B)に戻し、テナント抽出(a)は **target 無しの pure scope**(または deprecated な `--scope-column`)で行う。
69
+
70
+ ### 推奨(たたき台)
71
+
72
+ **hybrid を残す**案。理由: 元々の動機(cross-DB FK を含めて1回で対象を抽出)と (b) のニーズが一致し、すでに #128 で動いている。
73
+ per-table `scope_column` 化とも両立できる ―― hybrid の導出元 scope 列を「`--scope-column` フラグ」ではなく **target の config の `scope_column`** から取れば、フラグ無し・`--ids-column` 無しで成立する:
74
+
75
+ | やりたいこと | コマンド | ids の意味 | 挙動 |
76
+ |---|---|---|---|
77
+ | shop 1 を最小抽出(cross-DB 含む) | `--target-table=shops --ids=1` | 主キー | shop 1 + 到達分。cross-DB/scoped テーブルは shops.`scope_column` から導出した tenant に scope(hybrid) |
78
+ | tenant 42 を丸ごと | `--ids=42`(target 無し) | scope 値 | 各 scoped テーブルを自分の `scope_column IN (42)`(pure scope) |
79
+ | 通常の単体抽出(scope 無し) | `--target-table=orders --ids=1`(orders に scope_column 無し) | 主キー | 従来どおり |
80
+
81
+ この案だと「解釈A(scoped target で ids=scope 値)」は捨て、テナント抽出は no-target の pure scope に寄せることになる。
82
+
83
+ ## 決定: (a) の確定仕様
84
+
85
+ (a) のみ採用。hybrid((b))は今回入れない。
86
+
87
+ ### モデル
88
+ - scoped なテーブルは config に `scope_column: <列>` を宣言(per-table のみ。グローバル正準名もオーバーライドも無し。全 scope 列は同じ値空間)。
89
+
90
+ ### 起動と挙動
91
+ | コマンド | 条件 | ids の意味 | 挙動 |
92
+ |---|---|---|---|
93
+ | `--target-table=shops --ids=42` | shops が `scope_column` を宣言 | **scope 値** | scope-column モード。`scope_column` を宣言する全テーブルが自分の `scope_column IN (42)`。未宣言テーブルは belongs_to 経由(via_path / referenced_by)。`scope_exempt` は full dump。解決不能は abort(validate_scope!)。target 名は PK フィルタには使わない(scoped テーブルの1つとして scope される) |
94
+ | `--target-table=orders --ids=1` | orders は `scope_column` 無し | 主キー | 従来の単体 target 抽出(変更なし) |
95
+ | `--scope-column=C --ids=42` | deprecated | scope 値 | 旧 pure scope(no target、global 列 C)。**警告**を出す。`--target-table` とは排他(pre-#128 に戻す) |
96
+ | `--ids=42` のみ | target も flag も無し | — | エラー(従来どおり) |
97
+
98
+ - `--ids-column` は**削除**。scoped target では `scope_column` が「`--ids` を当てる列」を兼ねる。
99
+ - scoped なテーブルを主キーで単体抽出することはできなくなる(解釈A の代償。受容済み)。
100
+
101
+ ### 実装スコープ(PR #128 を作り直す)
102
+ 1. builder: #128 の hybrid 一式(build_hybrid / compose_or / pk_in_subquery / mode 引数 / 導出版 scope_where_clause / validate_hybrid! / R2=validate_scope_column_usage!)と `:or`(query_ast + 3 adapter)を **revert**。`scope_where_clause` はリテラル `scope_column IN (ids)` に戻す。
103
+ 2. builder: `scope_mode?` を「`--scope-column` フラグ or **target が `scope_column` を宣言**」で判定するよう拡張。`resolved_scope_column = table.scope_column || (deprecated flag)`。
104
+ 3. CLI: `--ids-column` 削除、`--scope-column` を deprecated 警告化 + `--target-table` と排他(pre-#128 に戻す)、hybrid 用の `--insert-only` 必須を撤去。
105
+ 4. runner/explain: `validate_hybrid!` / `validate_scope_column_usage!` 呼び出しを撤去。`validate_scope!` の起動条件を新 `scope_mode?` に合わせる。
106
+ 5. generator: cross-DB 検出は維持。comment / サマリの誘導を「`--scope-column` を使え」→「このテーブルに `scope_column:` を宣言しろ」に更新。
107
+ 6. docs / CHANGELOG: hybrid 記述を撤去し、本モデルで書き直す。breaking(`--ids-column` 削除・`--scope-column` deprecated・scoped target の `--ids` 意味変更)を明記。tests も差し替え。
@@ -35,32 +35,50 @@ module Exwiw
35
35
  # #to_bulk_insert per chunk joined by "\n" (verified by
36
36
  # insert_output_snapshot_spec), but only one ~STREAM_FLUSH_BYTES buffer is
37
37
  # resident at a time rather than the entire table's INSERT string. Returns
38
- # the number of statements written.
38
+ # [statement_count, record_count]; record_count is tallied during the single
39
+ # streaming drain so the Runner needs no separate SELECT COUNT(*) pass.
39
40
  def write_inserts(io, results, table, chunk_size)
40
41
  chunks = chunk_size ? results.each_slice(chunk_size) : [results]
41
42
  statement_count = 0
43
+ record_count = 0
42
44
  chunks.each do |chunk_rows|
43
45
  io.print("\n") if statement_count.positive?
44
- stream_single_insert(io, chunk_rows, table)
46
+ record_count += stream_single_insert(io, chunk_rows, table)
45
47
  statement_count += 1
46
48
  end
47
- statement_count
49
+ [statement_count, record_count]
48
50
  end
49
51
 
50
52
  # Emit one `INSERT INTO ... VALUES <tuples>;` statement to `io`, building
51
53
  # and flushing the value tuples STREAM_FLUSH_ROWS at a time so the full
52
- # statement text is never held in memory at once. Each slice is one fast
53
- # map+join; the ",\n" between slices reproduces the same separator
54
+ # statement text is never held in memory at once. Each flush is one fast
55
+ # map+join; the ",\n" between flushes reproduces the same separator
54
56
  # #to_bulk_insert puts between every tuple, so the bytes are identical.
57
+ # Returns the number of rows written, tallied as the stream is drained.
58
+ #
59
+ # Rows are buffered off `#each` rather than `rows.each_slice(...)`:
60
+ # each_slice queries the receiver's `#size`, which on a streaming result
61
+ # issues a redundant `SELECT COUNT(*)` (a second full pass over the same
62
+ # filter). Manual buffering walks the cursor exactly once.
55
63
  private def stream_single_insert(io, rows, table)
56
64
  io.print(insert_header(table))
57
65
  first = true
58
- rows.each_slice(STREAM_FLUSH_ROWS) do |slice|
66
+ record_count = 0
67
+ buffer = []
68
+ flush = lambda do
59
69
  io.print(",\n") unless first
60
70
  first = false
61
- io.print(slice.map { |row| insert_tuple(row) }.join(",\n"))
71
+ io.print(buffer.map { |row| insert_tuple(row) }.join(",\n"))
72
+ record_count += buffer.size
73
+ buffer.clear
62
74
  end
75
+ rows.each do |row|
76
+ buffer << row
77
+ flush.call if buffer.size >= STREAM_FLUSH_ROWS
78
+ end
79
+ flush.call unless buffer.empty?
63
80
  io.print(";")
81
+ record_count
64
82
  end
65
83
 
66
84
  private def insert_tuple(row)
data/lib/exwiw/adapter.rb CHANGED
@@ -140,15 +140,44 @@ module Exwiw
140
140
  # @param results [Enumerable] rows/documents from #execute
141
141
  # @param table the table/collection config
142
142
  # @param chunk_size [Integer, nil] rows per statement (nil => one statement)
143
+ # @return [Array(Integer, Integer)] [statement_count, record_count]
144
+ #
145
+ # record_count is tallied from the rows actually streamed here so the
146
+ # Runner no longer needs a separate upfront count query (MongoDB's
147
+ # count_documents / the SQL adapters' SELECT COUNT(*)) just to log the row
148
+ # count and decide whether an empty table can be skipped. That count was a
149
+ # second full pass over the same filter — a wasted COLLSCAN when the scope
150
+ # is unindexed; counting during the single streaming pass removes it.
151
+ #
152
+ # Batches rows by accumulating into a buffer and flushing every chunk_size
153
+ # rows, rather than `results.each_slice(chunk_size)`. This is deliberate:
154
+ # `Enumerable#each_slice` calls `#size` on the receiver as an allocation
155
+ # hint, which for a streaming result (MongoDB's StreamingResult) issues a
156
+ # `count_documents` — the very redundant count this single-pass design
157
+ # removes. Driving the buffer off `#each` keeps the result's `#size`
158
+ # untouched, so the cursor is walked exactly once. The chunk boundaries and
159
+ # "\n" separators reproduce the each_slice output byte-for-byte.
160
+ #
161
+ # chunk_size is always positive for callers of this default (MongoDB); the
162
+ # SQL adapters pass nil and override #write_inserts, so the unbounded
163
+ # nil-branch buffer is never reached here in practice.
143
164
  def write_inserts(io, results, table, chunk_size)
144
- chunks = chunk_size ? results.each_slice(chunk_size) : [results]
145
165
  statement_count = 0
146
- chunks.each do |chunk_rows|
166
+ record_count = 0
167
+ buffer = []
168
+ flush = lambda do
147
169
  io.print("\n") if statement_count.positive?
148
- io.print(to_bulk_insert(chunk_rows, table))
170
+ io.print(to_bulk_insert(buffer, table))
149
171
  statement_count += 1
172
+ record_count += buffer.size
173
+ buffer.clear
174
+ end
175
+ results.each do |row|
176
+ buffer << row
177
+ flush.call if chunk_size && buffer.size >= chunk_size
150
178
  end
151
- statement_count
179
+ flush.call unless buffer.empty?
180
+ [statement_count, record_count]
152
181
  end
153
182
 
154
183
  # Run the database-specific EXPLAIN for the given query and return the
data/lib/exwiw/cli.rb CHANGED
@@ -34,7 +34,6 @@ module Exwiw
34
34
  target_collection
35
35
  ids
36
36
  ids_field
37
- ids_column
38
37
  scope_column
39
38
  ].freeze
40
39
 
@@ -77,7 +76,6 @@ module Exwiw
77
76
  @target_collection_name = nil
78
77
  @ids = []
79
78
  @ids_field = nil
80
- @ids_column = nil
81
79
  @scope_column = nil
82
80
  @output_format = nil
83
81
  @insert_only = nil
@@ -165,7 +163,7 @@ module Exwiw
165
163
 
166
164
  resolve_target_collection_alias!
167
165
  resolve_scope_column!
168
- resolve_ids_column_alias!
166
+ resolve_ids_field!
169
167
  resolve_uri_option!
170
168
 
171
169
  if @subcommand == "explain"
@@ -317,7 +315,6 @@ module Exwiw
317
315
  @ids = (raw.is_a?(String) ? raw.split(",") : Array(raw)).map(&:to_s)
318
316
  end
319
317
  @ids_field ||= config["ids_field"]
320
- @ids_column ||= config["ids_column"]
321
318
  @scope_column ||= config["scope_column"]
322
319
  end
323
320
 
@@ -349,49 +346,33 @@ module Exwiw
349
346
  @target_table_name = @target_collection_name
350
347
  end
351
348
 
352
- # `--ids-column` is the sql-adapter spelling of `--ids-field` (the mongodb
353
- # spelling). Both override which column/field `--ids` is matched against on
354
- # the target table; internally they fold into the single @ids_field carried
355
- # by DumpTarget. Mirror the --target-table/--target-collection split: each
356
- # flag is restricted to its adapter family and the two are mutually
357
- # exclusive. Runs after resolve_target_collection_alias! so
349
+ # `--ids-field` overrides which field `--ids` is matched against on the target
350
+ # collection (defaulting to its primary key). It is mongodb-only and
351
+ # meaningless without a target collection to constrain. (The SQL adapters have
352
+ # no equivalent: a scoped target's `scope_column` is the column `--ids` filter
353
+ # against there.) Runs after resolve_target_collection_alias! so
358
354
  # @target_table_name already reflects the collection alias.
359
- private def resolve_ids_column_alias!
360
- if @ids_field && @ids_column
361
- $stderr.puts "Specify only one of --ids-field and --ids-column"
362
- exit 1
363
- end
355
+ private def resolve_ids_field!
356
+ return if @ids_field.nil?
364
357
 
365
- if @ids_field && @database_adapter != "mongodb"
366
- $stderr.puts "--ids-field is only supported by the mongodb adapter (use --ids-column)"
358
+ if @database_adapter != "mongodb"
359
+ $stderr.puts "--ids-field is only supported by the mongodb adapter"
367
360
  exit 1
368
361
  end
369
362
 
370
- if @ids_column
371
- sql_adapters = ["mysql", "postgresql", "sqlite"]
372
- unless sql_adapters.include?(@database_adapter)
373
- $stderr.puts "--ids-column is only supported by the sql adapters (use --ids-field)"
374
- exit 1
375
- end
376
-
377
- @ids_field = @ids_column
378
- end
379
-
380
- # --ids-field/--ids-column override the column --ids filters against on
381
- # the target table; meaningless without a target table to constrain.
382
- if @ids_field && !@target_table_name
383
- flag = @ids_column ? "--ids-column" : "--ids-field"
384
- $stderr.puts "--target-table is required when #{flag} is specified"
363
+ unless @target_table_name
364
+ $stderr.puts "--target-table is required when --ids-field is specified"
385
365
  exit 1
386
366
  end
387
367
  end
388
368
 
389
- # `--scope-column` switches to scope-column mode: every table is filtered by a
390
- # shared column (`--ids` are its values) instead of anchoring on one
391
- # `--target-table`. It is SQL-only and mutually exclusive with the single-target
392
- # flags. Runs after resolve_target_collection_alias! (so --target-collection is
393
- # already folded into @target_table_name) and before resolve_ids_column_alias!
394
- # so the clearer "cannot combine" message wins over the generic ids-column one.
369
+ # `--scope-column` is **deprecated**: it selected scope-column mode with a
370
+ # single global column for every table. The preferred way is to declare a
371
+ # per-table `scope_column:` in the schema config and pass `--target-table`
372
+ # (the target is then scoped like any other table). The flag still works as
373
+ # before SQL-only and mutually exclusive with `--target-table` — but emits a
374
+ # deprecation warning. Runs after resolve_target_collection_alias! so
375
+ # --target-collection is already folded into @target_table_name.
395
376
  private def resolve_scope_column!
396
377
  return if @scope_column.nil?
397
378
 
@@ -406,11 +387,13 @@ module Exwiw
406
387
  exit 1
407
388
  end
408
389
 
409
- if @ids_field || @ids_column
410
- flag = @ids_column ? "--ids-column" : "--ids-field"
411
- $stderr.puts "--scope-column cannot be combined with #{flag}"
390
+ if @ids_field
391
+ $stderr.puts "--scope-column cannot be combined with --ids-field"
412
392
  exit 1
413
393
  end
394
+
395
+ $stderr.puts "warning: --scope-column is deprecated; declare a per-table `scope_column:` " \
396
+ "in the schema config and run with --target-table instead."
414
397
  end
415
398
 
416
399
  # `--uri` supplies a full connection string (e.g. `mongodb+srv://...`) and is
@@ -537,8 +520,7 @@ module Exwiw
537
520
  opts.on("--target-collection=[COLLECTION]", "Alias of --target-table for the mongodb adapter.") { |v| @target_collection_name = v }
538
521
  opts.on("--ids=[IDS]", "Comma-separated list of identifiers. Required when --target-table is given.") { |v| @ids = v.split(',') }
539
522
  opts.on("--ids-field=[FIELD]", "Field on the target collection that --ids is matched against. Defaults to the primary key. (mongodb adapter only)") { |v| @ids_field = v }
540
- opts.on("--ids-column=[COLUMN]", "Column on the target table that --ids is matched against. Defaults to the primary key. (sql adapters only)") { |v| @ids_column = v }
541
- opts.on("--scope-column=[COLUMN]", "Filter every table by this shared column (--ids are its values) instead of a single --target-table. Tables lacking it are reached via belongs_to. SQL adapters only; mutually exclusive with --target-table.") { |v| @scope_column = v }
523
+ opts.on("--scope-column=[COLUMN]", "DEPRECATED. Filter every table by this shared global column (--ids are its values) instead of a single --target-table. SQL adapters only; mutually exclusive with --target-table. Prefer declaring a per-table `scope_column:` in the schema config and running with --target-table.") { |v| @scope_column = v }
542
524
  opts.on("--output-format=[FORMAT]", "Output format: insert (default) or copy (PostgreSQL only, export subcommand only)") { |v| @output_format = v }
543
525
  opts.on("--insert-only", "Do not generate DELETE SQL files (export subcommand only)") { @insert_only = true }
544
526
  opts.on("--after-insert-hook=PATH", "Path to a .rb or .sh post-processing hook executed after all insert/delete files are written (export subcommand only)") do |v|
@@ -48,7 +48,7 @@ module Exwiw
48
48
 
49
49
  # Resolves a set of values on `where_column` to the rows' `select_column`
50
50
  # via a nested SELECT. Used as the `value` of a WhereClause whose operator
51
- # is `:in_subquery`, so `--ids-column`/`--ids-field` can filter related
51
+ # is `:in_subquery`, so a non primary-key `ids_field` can filter related
52
52
  # tables through the target table's primary key:
53
53
  #
54
54
  # <table>.<fk> IN (SELECT <table_name>.<select_column>
@@ -12,12 +12,26 @@ module Exwiw
12
12
  new(table_name, table_by_name, dump_target, logger).scope_category
13
13
  end
14
14
 
15
+ # Scope-column mode is active when EITHER the named `--target-table` declares a
16
+ # per-table `scope_column` (the preferred trigger: the target is then scoped
17
+ # like any other table — its `--ids` are scope-column values, not primary
18
+ # keys), OR the deprecated `--scope-column` flag is set (a global column with no
19
+ # target). In both cases every table is filtered by a shared column instead of
20
+ # being anchored on one named target's primary key.
21
+ def self.scope_mode?(table_by_name, dump_target)
22
+ return true unless dump_target.scope_column.nil?
23
+ return false if dump_target.table_name.nil?
24
+
25
+ target = table_by_name[dump_target.table_name]
26
+ !!(target && target.respond_to?(:scope_column) && target.scope_column)
27
+ end
28
+
15
29
  # Strict pre-flight for scope-column mode: abort if any extractable table
16
30
  # cannot be scoped, so an unscoped (potentially sensitive) table is never
17
31
  # silently dumped in full. No-op outside scope mode. `tables` is the set of
18
32
  # dumpable configs (ignore:true tables are skipped — they are not extracted).
19
33
  def self.validate_scope!(tables, table_by_name, dump_target, logger)
20
- return if dump_target.scope_column.nil?
34
+ return unless scope_mode?(table_by_name, dump_target)
21
35
 
22
36
  unscopable =
23
37
  tables.reject(&:ignore).select do |table|
@@ -27,11 +41,10 @@ module Exwiw
27
41
 
28
42
  names = unscopable.map(&:name).sort.join(", ")
29
43
  raise ArgumentError,
30
- "scope-column mode: #{unscopable.size} table(s) cannot be scoped by " \
31
- "'#{dump_target.scope_column}': #{names}. For each, add `scope_exempt: true` " \
32
- "to export it in full, set `ignore: true` to skip it, or add a belongs_to path " \
33
- "to a table that carries the scope column (use a per-table `scope_column` if the " \
34
- "column name differs on that table)."
44
+ "scope-column mode: #{unscopable.size} table(s) cannot be scoped: #{names}. " \
45
+ "For each, declare `scope_column: <column>` on the table to filter it directly, " \
46
+ "add a belongs_to path to a table that carries the scope column, mark it " \
47
+ "`scope_exempt: true` to export it in full, or set `ignore: true` to skip it."
35
48
  end
36
49
 
37
50
  attr_reader :table_name, :table_by_name, :dump_target
@@ -268,10 +281,10 @@ module Exwiw
268
281
  clauses = []
269
282
 
270
283
  if table.name == dump_target.table_name
271
- # `--ids-column` (folded into dump_target.ids_field by the CLI) lets
272
- # `--ids` match a non primary-key column on the target table; otherwise
273
- # fall back to the primary key. Only the target table's filter changes —
274
- # downstream foreign-key propagation still keys off the primary key.
284
+ # When `dump_target.ids_field` is set, `--ids` match a non primary-key
285
+ # column on the target table; otherwise fall back to the primary key.
286
+ # Only the target table's filter changes — downstream foreign-key
287
+ # propagation still keys off the primary key.
275
288
  clauses.push Exwiw::QueryAst::WhereClause.new(
276
289
  column_name: dump_target.ids_field || table.primary_key,
277
290
  operator: :eq,
@@ -306,9 +319,9 @@ module Exwiw
306
319
 
307
320
  # Builds the WHERE clause that constrains a `foreign_key` pointing at the
308
321
  # dump target. Normally `--ids` are the target's primary keys, so a plain
309
- # `foreign_key IN (ids)` suffices. When `--ids-column`/`--ids-field` is set
310
- # (dump_target.ids_field), `--ids` match a non primary-key column instead,
311
- # so the foreign key must be resolved through the target table:
322
+ # `foreign_key IN (ids)` suffices. When `dump_target.ids_field` is set, `--ids`
323
+ # match a non primary-key column instead, so the foreign key must be resolved
324
+ # through the target table:
312
325
  # `foreign_key IN (SELECT pk FROM target WHERE ids_field IN (ids))`.
313
326
  # This keeps related-table extraction correct regardless of whether the
314
327
  # relation is direct, indirect, or polymorphic.
@@ -370,7 +383,7 @@ module Exwiw
370
383
  # ------------------------------------------------------------------
371
384
 
372
385
  private def scope_mode?
373
- !dump_target.scope_column.nil?
386
+ self.class.scope_mode?(table_by_name, dump_target)
374
387
  end
375
388
 
376
389
  # Classifier used by validate_scope! and mirrored by build_scoped below.
@@ -449,8 +462,8 @@ module Exwiw
449
462
  ast
450
463
  end
451
464
 
452
- # The shared column this table is filtered on: a per-table `scope_column`
453
- # override when present, otherwise the global `--scope-column`.
465
+ # The shared column this table is filtered on: a per-table `scope_column` when
466
+ # declared, otherwise the deprecated global `--scope-column` flag.
454
467
  private def resolved_scope_column(table)
455
468
  table.scope_column || dump_target.scope_column
456
469
  end
@@ -557,10 +570,11 @@ module Exwiw
557
570
  end
558
571
 
559
572
  private def scope_unscopable_message(table)
560
- "Table '#{table.name}' cannot be scoped in scope-column mode: it has no " \
561
- "'#{dump_target.scope_column}' column (nor a per-table scope_column override) and no " \
562
- "belongs_to path to a table that does. Add `scope_exempt: true` to export it in full, " \
563
- "set `ignore: true` to skip it, or add the missing belongs_to."
573
+ "Table '#{table.name}' cannot be scoped in scope-column mode: it carries no scope " \
574
+ "column (no per-table `scope_column` is declared on it) and has no belongs_to path " \
575
+ "to a table that does. Declare `scope_column: <column>` on it, mark it " \
576
+ "`scope_exempt: true` to export it in full, set `ignore: true` to skip it, or add " \
577
+ "the missing belongs_to."
564
578
  end
565
579
  end
566
580
  end
data/lib/exwiw/runner.rb CHANGED
@@ -74,15 +74,16 @@ module Exwiw
74
74
  phase = "executing extraction query"
75
75
  begin
76
76
  results = adapter.execute(query_ast)
77
- record_num = results.size
78
-
79
- if record_num.zero?
80
- @logger.info(" No records matched. skip this table.")
81
- next
82
- end
83
77
  insert_idx = (idx + 1).to_s.rjust(3, '0')
84
78
 
85
79
  if @output_format == 'copy'
80
+ # COPY mode (PostgreSQL only) builds the whole body up front rather
81
+ # than streaming, so it keeps the explicit count + early skip.
82
+ record_num = results.size
83
+ if record_num.zero?
84
+ @logger.info(" No records matched. skip this table.")
85
+ next
86
+ end
86
87
  phase = "generating COPY statement"
87
88
  @logger.debug(" Generate COPY statement...")
88
89
  copy_sql = adapter.to_copy_from_stdin(results, table)
@@ -108,19 +109,33 @@ module Exwiw
108
109
  # config does not set one (SQL adapters: nil -> one statement, but
109
110
  # streamed in bounded buffers; MongoDB: a positive default so the
110
111
  # JSONL is chunked). #write_inserts emits bytes identical to the
111
- # previous inline chunk loop and returns the statement count.
112
+ # previous inline chunk loop and returns [statement_count, record_num].
113
+ #
114
+ # The row count comes from this single streaming pass, so empty
115
+ # tables are detected here (and the just-opened file removed) rather
116
+ # than via a separate upfront count query — eliminating a redundant
117
+ # second scan of the same filter (e.g. MongoDB count_documents / SQL
118
+ # SELECT COUNT(*)), which is a full COLLSCAN for an unindexed scope.
112
119
  chunk_size = table.bulk_insert_chunk_size || adapter.default_bulk_insert_chunk_size
113
120
 
121
+ insert_path = File.join(@output_dir, "insert-#{insert_idx}-#{table_name}.#{adapter.output_extension}")
114
122
  statement_count = 0
115
- File.open(File.join(@output_dir, "insert-#{insert_idx}-#{table_name}.#{adapter.output_extension}"), 'w') do |file|
123
+ record_num = 0
124
+ File.open(insert_path, 'w') do |file|
116
125
  pre = adapter.pre_insert_sql(table)
117
126
  file.puts(pre) if pre
118
- statement_count = adapter.write_inserts(file, results, table, chunk_size)
127
+ statement_count, record_num = adapter.write_inserts(file, results, table, chunk_size)
119
128
  file.print("\n")
120
129
  post = adapter.post_insert_sql(table)
121
130
  file.puts(post) if post
122
131
  end
123
132
 
133
+ if record_num.zero?
134
+ File.delete(insert_path)
135
+ @logger.info(" No records matched. skip this table.")
136
+ next
137
+ end
138
+
124
139
  @logger.info(" Generated INSERT statement for #{record_num} records (#{statement_count} statement(s)).")
125
140
  end
126
141
 
@@ -59,6 +59,20 @@ module Exwiw
59
59
  groups
60
60
  end
61
61
 
62
+ # Flatten the generated `groups` (the Hash returned by generate! /
63
+ # build_table_groups) into the list of cross-database belongs_tos the
64
+ # generator auto-ignored, so a caller (the rake task) can surface them. Each
65
+ # entry is `{ table:, foreign_key:, target: }`. Empty for single-database apps.
66
+ def self.cross_database_belongs_tos(groups)
67
+ groups.values.flatten.flat_map do |table|
68
+ next [] unless table.respond_to?(:belongs_tos)
69
+
70
+ table.belongs_tos
71
+ .select { |bt| bt.ignore_type == CROSS_DATABASE_IGNORE_TYPE }
72
+ .map { |bt| { table: table.name, foreign_key: bt.foreign_key, target: bt.table_name } }
73
+ end
74
+ end
75
+
62
76
  # Reconcile the config files already on disk against the live database,
63
77
  # removing only what no longer exists there:
64
78
  #
@@ -277,11 +291,17 @@ module Exwiw
277
291
  end
278
292
 
279
293
  private def aggregate_belongs_tos(models)
280
- belongs_to_assocs = models.flat_map { |m| belongs_to_associations_for(m) }
294
+ belongs_to_assocs = models
295
+ .flat_map { |m| belongs_to_associations_for(m) }
296
+ .select { |assoc| assoc.polymorphic? || active_record_target?(assoc) }
297
+ owner_db = database_name_for(models.first)
281
298
 
282
299
  non_polymorphic = belongs_to_assocs
283
300
  .reject(&:polymorphic?)
284
- .map { |assoc| { table_name: assoc.table_name, foreign_key: assoc.foreign_key } }
301
+ .map do |assoc|
302
+ entry = { table_name: assoc.table_name, foreign_key: assoc.foreign_key }
303
+ annotate_cross_database(entry, owner_db, assoc.klass)
304
+ end
285
305
 
286
306
  # A polymorphic belongs_to (`belongs_to :reviewable, polymorphic: true`)
287
307
  # has no single target table. The candidate tables are found by looking up
@@ -292,18 +312,51 @@ module Exwiw
292
312
  .select(&:polymorphic?)
293
313
  .flat_map do |assoc|
294
314
  polymorphic_target_models(assoc.name).map do |target_model|
295
- {
315
+ entry = {
296
316
  table_name: target_model.table_name,
297
317
  foreign_key: assoc.foreign_key,
298
318
  foreign_type: assoc.foreign_type,
299
319
  type_value: target_model.polymorphic_name,
300
320
  }
321
+ annotate_cross_database(entry, owner_db, target_model)
301
322
  end
302
323
  end
303
324
 
304
325
  (non_polymorphic + polymorphic).uniq
305
326
  end
306
327
 
328
+ CROSS_DATABASE_IGNORE_TYPE = "cross_database"
329
+
330
+ # A belongs_to whose target model lives in a *different* database than the
331
+ # owning table cannot be joined: in a Rails multi-database (`connects_to`)
332
+ # setup each database is exported on its own connection and into its own
333
+ # per-database config directory, so the target table is absent from the
334
+ # directory this config is loaded with, and there is no single connection to
335
+ # join the two on. Leaving the relation live would emit a dangling belongs_to
336
+ # whose target is never present at extraction time (a nil-target crash in
337
+ # dependency resolution). So emit it with `ignore: true` (dropped from
338
+ # extraction at load via TableConfig#reject_ignored_members!) tagged
339
+ # `ignore_type: "cross_database"`, with a `comment` recording why and pointing
340
+ # at the recovery path. The foreign-key column itself is still exported as a
341
+ # plain column; only the join/dependency edge is dropped. Polymorphic
342
+ # associations are annotated per target, so only the targets that cross a
343
+ # database boundary are ignored. Single-database apps are unaffected.
344
+ private def annotate_cross_database(entry, owner_db, target_model)
345
+ target_db = database_name_for(target_model)
346
+ return entry if target_db == owner_db
347
+
348
+ entry.merge(
349
+ ignore: true,
350
+ ignore_type: CROSS_DATABASE_IGNORE_TYPE,
351
+ comment: "Cross-database belongs_to: target '#{entry[:table_name]}' is in database " \
352
+ "'#{target_db}', not '#{owner_db}'. exwiw exports each database separately and " \
353
+ "cannot join across them, so this relation is ignored during extraction; its " \
354
+ "foreign-key column '#{entry[:foreign_key]}' is still exported. To extract across " \
355
+ "this boundary, declare `scope_column: \"#{entry[:foreign_key]}\"` on this table's " \
356
+ "config so its rows are filtered by that foreign-key value directly (scope-column mode).",
357
+ )
358
+ end
359
+
307
360
  # `belongs_to` reflections for a model, with the synthetic HABTM left-side
308
361
  # association removed.
309
362
  #
@@ -325,6 +378,31 @@ module Exwiw
325
378
  assocs.reject { |assoc| assoc.equal?(left) }
326
379
  end
327
380
 
381
+ # Whether a (non-polymorphic) belongs_to points at an ActiveRecord model.
382
+ #
383
+ # A belongs_to can target a non-ActiveRecord class — most commonly an
384
+ # ActiveHash/ActiveYaml master (`belongs_to :equipment, class_name:
385
+ # "SomeActiveYamlModel"`). active_hash registers these as ordinary
386
+ # `belongs_to` reflections, yet the target class has no database table, so
387
+ # `assoc.table_name` (which delegates to `klass.table_name`) raises. Such a
388
+ # relation is not a DB edge exwiw can join or extract across, so it is
389
+ # dropped from the generated belongs_tos; the underlying foreign-key column
390
+ # is still emitted as a plain column. Polymorphic associations cannot be
391
+ # `klass`-resolved, so callers must screen those out before calling this.
392
+ #
393
+ # Resolving the target class behaves differently per non-AR shape: an
394
+ # ActiveHash reflection returns the class fine (the crash is later, at
395
+ # `table_name`), while a bare `belongs_to` to a plain class makes AR raise
396
+ # ArgumentError ("... is not an ActiveRecord::Base subclass") right here when
397
+ # the klass is computed. Both mean "not a DB relation", so rescue the lookup
398
+ # and treat either as a non-AR target to skip.
399
+ private def active_record_target?(assoc)
400
+ klass = assoc.klass
401
+ klass.is_a?(Class) && klass < ActiveRecord::Base ? true : false
402
+ rescue StandardError
403
+ false
404
+ end
405
+
328
406
  # Enumerate the concrete models that can be targets of the polymorphic
329
407
  # association `association_name`, by looking them up from every model's
330
408
  # `has_many` / `has_one` `as:` option. The order of `concrete_models` depends
@@ -154,10 +154,10 @@ module Exwiw
154
154
  merged_table.scope_column = scope_column
155
155
 
156
156
  # Structural facts of each belongs_to come from the freshly generated
157
- # config, but the user-owned `comment`/`ignore`/`references` carry over
158
- # when the same relation still exists. (`references` is only consumed by
159
- # the MongodbAdapter, but it lives on the shared BelongsTo, so preserve
160
- # it here too rather than silently dropping a hand-added value.)
157
+ # config, but the user-owned `comment`/`ignore`/`ignore_type`/`references`
158
+ # carry over when the same relation still exists. (`references` is only
159
+ # consumed by the MongodbAdapter, but it lives on the shared BelongsTo, so
160
+ # preserve it here too rather than silently dropping a hand-added value.)
161
161
  receiver_belongs_to_by_identity = belongs_tos.each_with_object({}) { |bt, hash| hash[bt.identity] = bt }
162
162
  merged_table.belongs_tos =
163
163
  passed_table.belongs_tos.map do |passed_belongs_to|
@@ -165,6 +165,7 @@ module Exwiw
165
165
  if receiver_belongs_to
166
166
  passed_belongs_to.comment = receiver_belongs_to.comment if receiver_belongs_to.comment
167
167
  passed_belongs_to.ignore = receiver_belongs_to.ignore unless receiver_belongs_to.ignore.nil?
168
+ passed_belongs_to.ignore_type = receiver_belongs_to.ignore_type if receiver_belongs_to.ignore_type
168
169
  passed_belongs_to.references = receiver_belongs_to.references if receiver_belongs_to.references
169
170
  end
170
171
  passed_belongs_to
data/lib/exwiw/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Exwiw
4
- VERSION = "0.7.0"
4
+ VERSION = "0.8.1"
5
5
  end
data/lib/tasks/exwiw.rake CHANGED
@@ -17,9 +17,25 @@ namespace :exwiw do
17
17
  task generate: :environment do
18
18
  require "exwiw"
19
19
 
20
- Exwiw::SchemaGenerator.from_rails_application(
20
+ groups = Exwiw::SchemaGenerator.from_rails_application(
21
21
  output_dir: resolve_schema_dir.call,
22
22
  ).generate!
23
+
24
+ # Surface cross-database belongs_tos the generator auto-ignored: these
25
+ # cannot be joined (each database is exported separately), so they were
26
+ # emitted with ignore:true and need a decision from the user.
27
+ cross = Exwiw::SchemaGenerator.cross_database_belongs_tos(groups)
28
+ unless cross.empty?
29
+ $stderr.puts "exwiw: detected #{cross.size} cross-database belongs_to(s), " \
30
+ "each emitted with ignore:true (exwiw cannot join across databases). " \
31
+ "The foreign-key column is still exported."
32
+ cross.each do |c|
33
+ $stderr.puts " - #{c[:table]}.#{c[:foreign_key]} -> #{c[:target]} (in another database)"
34
+ end
35
+ $stderr.puts " To extract across a boundary, declare `scope_column: <foreign_key>` on the " \
36
+ "owning table's config (scope-column mode). Otherwise the relation stays ignored " \
37
+ "and the foreign key is exported as a plain value."
38
+ end
23
39
  end
24
40
 
25
41
  desc "Remove tables/columns from the schema config that no longer exist in the application"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: exwiw
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.8.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Shia
@@ -36,6 +36,7 @@ files:
36
36
  - CHANGELOG.md
37
37
  - LICENSE.txt
38
38
  - README.md
39
+ - docs/mongodb-dump-parallelism-2x-notes.md
39
40
  - docs/mongodb-scoping-fullscan-notes.md
40
41
  - docs/optimization-notes.md
41
42
  - docs/optimize-mongodb-export-with-native-ext.md
@@ -46,6 +47,7 @@ files:
46
47
  - docs/plans/2026-05-29-rails-managed-tables.md
47
48
  - docs/plans/2026-05-31-ids-column-for-sql-adapters.md
48
49
  - docs/plans/2026-06-19-mongodb-export-remove-parallelism-native-ext.md
50
+ - docs/scope-column-redesign.md
49
51
  - docs/sql-dump-optimization-notes.md
50
52
  - exe/exwiw
51
53
  - ext/exwiw/ext_json/ext_json.c