exwiw 0.6.0 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 39362410df244fffa463a86c845062f0e9bacac723e15a6697a50c631db0d5cd
4
- data.tar.gz: 9670677e7c822886ed6268e476008c81a1caba4e2eacb286bf0e747fef5d3f3c
3
+ metadata.gz: b66bdb8ccf3d2043c0a903c70071e0f070b41b9fd89dbc09e3c5385c9b4916c1
4
+ data.tar.gz: cf1224f47635113127f2e1fcf38f161c539f7b68f0bb8c85ba0c428fa6a202d1
5
5
  SHA512:
6
- metadata.gz: 4af4db84210fbd9da8b6b32e136c1f12af370bc719c5aeee2a1f7183046f1ffae5e8dcdc0bb6d856e30fe800a5febe12aef4d4ff2d2e8c27b162fcc4a056c7f4
7
- data.tar.gz: 5064dfee83c653ae0c73ae07edf33299752c748c6f9efc5e413424e7b3a92b769f6a99901688a2d1d3415ad3a0939ee807095b1cf24b2aab46fcf89402113a90
6
+ metadata.gz: 41a4ef3c8a6da41ccbb92d7e42a519c5260a1bbcf53ce9cdcfe3a18dbca55d509c7ca1492e726bdb82bfc8c101a0a6bcee1dec1c051ceae44910009801d4b64d
7
+ data.tar.gz: ce5cdc2e96fcacbfc2a7c3ca23c72a22a1c3e765da2e69ad21fd9ebd87bbfdc3619e54a0c6eb5ae94321e4dc1b8651ca73a965cf33509ecafba9349b8a894883
data/CHANGELOG.md CHANGED
@@ -2,6 +2,8 @@
2
2
 
3
3
  ## [Unreleased]
4
4
 
5
+ ## [0.6.1] - 2026-06-20
6
+
5
7
  ## [0.6.0] - 2026-06-20
6
8
 
7
9
  ### Added
@@ -0,0 +1,145 @@
1
+ # MongoDB scoping full-scan diagnosis (nullable-FK belongs_to)
2
+
3
+ Why a related collection's dump can be "especially slow" — dumping far more
4
+ records than the scope implies and scanning the whole collection — when running
5
+ an exwiw config against a MongoDB backup. The headline finding: the slowness is a
6
+ **symptom of a scoping bug**, not of serialization/decode cost.
7
+
8
+ ## Reproduction setup
9
+
10
+ - Backup: serve a raw WiredTiger dbpath with a standalone `mongod` (the same
11
+ `mongo:7` image the repo's `compose.yml` uses) on a spare port:
12
+
13
+ ```
14
+ docker run -d --name exwiw-restore-mongo --user 0:0 --entrypoint mongod \
15
+ -p 27018:27017 -v "<backup-dbpath>:/data/db" \
16
+ mongo:7 --dbpath /data/db --bind_ip_all
17
+ ```
18
+
19
+ Notes: run as root (`--user 0:0`) so mongod can read a backup's `0600` files;
20
+ bypass the image entrypoint (`--entrypoint mongod`) so it does not `gosu`-drop
21
+ back to the `mongodb` user. A backup carrying a `WiredTiger.backup` marker runs
22
+ recovery on the first start and **writes into the backup dir** (expected for a
23
+ restore). Starting a standalone from a replica-set backup yields a harmless
24
+ `system.replset` warning. Local DB connections may be blocked by a dev sandbox —
25
+ run measurement commands with the sandbox disabled.
26
+
27
+ - Run: `bundle exec exwiw export --config <app>/exwiw/exwiw.yml
28
+ --adapter=mongodb --host=localhost --port=27018 --database=<database>
29
+ --ids=<target-id> --output-dir=… --log-level=debug`.
30
+ (The optional-argument CLI flags — `--ids`, `--output-dir` — must use `=`; the
31
+ space form passes `nil` and crashes in the option callback.)
32
+
33
+ ## Baseline measurement (worked example, warm cache)
34
+
35
+ Full run ≈ **69s** over a few hundred non-empty collections. Per-collection wall
36
+ time (gap between consecutive `Processing table` log markers) was dominated by a
37
+ single collection:
38
+
39
+ | collection | time | records |
40
+ |---|---|---|
41
+ | **items** | **41.0s (59% of the whole run)** | 18,739 |
42
+ | (other collections) | ~5s and below each | — |
43
+
44
+ A "full run ≈ 10 min" figure is the **cold-cache** version (first read pages the
45
+ `items` data off the backup over the bind mount). The *relative* shape (items
46
+ dominates) is the same warm or cold.
47
+
48
+ ## Root cause: nullable belongs_to FK used as a hard `$in` AND constraint
49
+
50
+ The scope is tiny: one parent entity (`<target-id>`) → **1 store** (linked by
51
+ `business_entity_id` = the entity's **`uuid`**, correctly configured with
52
+ `references: uuid`) → **127 items** (`{store_id: <that store>}`, indexed, ~2ms).
53
+
54
+ But the run dumped **18,739** items, not 127, and scanned the whole collection.
55
+ Why:
56
+
57
+ 1. `MongodbAdapter#related_collection_filter` ANDs **every** belongs_to whose
58
+ parent produced ids. The `stores` filter became:
59
+
60
+ ```
61
+ { user_id: {$in: [580 ids]},
62
+ deleted_user_id: {$in: [96 ids]}, # nullable FK
63
+ business_entity_id: {$in: ["<target-id>"]} }
64
+ ```
65
+
66
+ The one matching store has **`deleted_user_id` absent (null)**, so it can never
67
+ satisfy `deleted_user_id ∈ {96 ids}`. The AND yields **0 stores** → `stores`
68
+ logs "No records matched. skip this table."
69
+
70
+ 2. With `stores` empty, `@state["stores"]` carries no ids, so when `items` is
71
+ built its `store_id` belongs_to contributes nothing. The remaining belongs_tos
72
+ are to **reference/master data that is dumped in full** —
73
+ `large_categories` and `medium_categories`. So the items filter degenerated to:
74
+
75
+ ```
76
+ { large_category_id: {$in: [98 ids]},
77
+ medium_category_id: {$in: [846 ids]} }
78
+ ```
79
+
80
+ 3. `items` has **no index on `large_category_id` / `medium_category_id`** (it does
81
+ have `store_id_1`). So this filter forces a **full COLLSCAN of all 2.43M items**
82
+ — and the Runner scans it **twice**: once for `StreamingResult#size`
83
+ (`count_documents`) and again for the fetch.
84
+
85
+ Isolated phase breakdown of the degenerate items query (warm): `count_documents`
86
+ 3.58s + fetch/decode 1.47s + serialize 0.26s. Serialization (the native
87
+ `Exwiw::ExtJson` C ext) is **not** the bottleneck — the COLLSCAN is, and it is far
88
+ worse cold.
89
+
90
+ The same nullable-FK problem applies one level down: the store's 127 items
91
+ themselves have `*_category_id` = null, so even `{store_id, large_category,
92
+ medium_category}` ANDed returns **0**. The only filter that yields the correct 127
93
+ is `{store_id: {$in: [store]}}` **alone**.
94
+
95
+ ## Implemented fix: genuine-anchor scoping (MongodbAdapter#related_collection_filter)
96
+
97
+ Scope flows from the dump target along belongs_to edges. The fix classifies each
98
+ belongs_to parent of a non-target collection by whether it is **genuinely scoped**
99
+ — reachable back to the dump target through belongs_to chains
100
+ (`#genuine_scope_set`, a fixpoint over the configs) — and applies the constraint
101
+ accordingly:
102
+
103
+ - **Anchor (strict).** Among the genuine parents, the most selective one (fewest
104
+ captured ids) is applied strictly (`{fk: {$in: [...]}}`). It carries the real
105
+ narrowing and, being strict, bounds the result to a small set — which keeps both
106
+ this query and the `$in` sets it feeds downstream from ballooning.
107
+ - **Other genuine parents (null-aware).** `{fk: {$in: [nil, ...]}}` (Mongo's
108
+ `$in: [nil]` matches both explicit nulls and missing fields), so a row whose
109
+ nullable refinement FK is null is not excluded by it.
110
+ - **Reference parents (dropped).** A parent NOT reachable to the dump target is
111
+ reference/master data dumped in full; its id set is "all/most of a table" and is
112
+ not a real scope, so when a genuine anchor exists it is dropped entirely.
113
+ - **No genuine parent:** fall back to the historical strict-AND of whatever
114
+ constraints exist (preserves prior behaviour for unreachable collections).
115
+
116
+ For this extraction: `stores` → `{business_entity_id ∈ {<target-id>}}` (anchor;
117
+ `user_id`/`deleted_user_id` → reference leaks, dropped) → **1 store**; `items` →
118
+ `{store_id ∈ {store}}` strict anchor with the nullable refinement FKs null-aware
119
+ and the `*_category` references dropped → **127** via the `store_id_1` index.
120
+
121
+ ### Measured result (warm, same cache)
122
+
123
+ Full run **58.8s → 11.0s ≈ 5.4×**; `items` 41s double COLLSCAN → ~11ms indexed
124
+ (≈3700×). Correctness also fixed: `stores` 0→1, `items` 18,739 (leaked COLLSCAN)
125
+ →127. Byte-identical existing snapshots (the seed graph is fully genuine and has no
126
+ null FKs, so anchor-strict + null-aware ≡ the prior strict-AND).
127
+
128
+ ### Approaches considered and rejected
129
+
130
+ - **Unconditional null-aware on every belongs_to** (the original iter-1 direction):
131
+ catastrophic. A collection that belongs_to only reference data dumped in full
132
+ becomes ~the whole table once null-aware; the resulting child `$in` then exceeds
133
+ Mongo's **48 MB max message size** and the run crashes. Null-awareness must NOT
134
+ be applied to a collection's only/anchor scope.
135
+ - **Null-aware on all genuine parents (no anchor distinction):** makes the genuine
136
+ *anchor* itself null-aware too — `stores` then matched every store with a null
137
+ `business_entity_id` (a not-fully-backfilled column) → hundreds of thousands of
138
+ stores → a ~39 MB child filter on a downstream collection (**MaxBSONSize**).
139
+ Hence the anchor stays strict.
140
+ - **Scope by the single most-selective genuine parent alone (drop other genuine):**
141
+ fast and correct here, but drops legitimate AND-narrowing for multi-parent
142
+ collections (e.g. `order_items` ∈ orders AND products) and moves seed snapshots.
143
+ - Pure-performance tweaks that keep the (incorrect) 18,739-row output —
144
+ `--cursor-parallel` (changes row order, treats the symptom) or skipping the
145
+ redundant `count_documents` scan (~½ only) — were rejected as the primary fix.
@@ -123,7 +123,7 @@ module Exwiw
123
123
  { config.primary_key => { "$in" => coerce_ids(dump_target.ids) } }
124
124
  end
125
125
  else
126
- related_collection_filter(config, config_by_name)
126
+ related_collection_filter(config, config_by_name, dump_target)
127
127
  end
128
128
 
129
129
  Exwiw::MongoQuery::Find.new(
@@ -352,23 +352,74 @@ module Exwiw
352
352
  # the values were captured from that field in #execute, so their BSON type
353
353
  # already matches the stored FK — no coercion.
354
354
  #
355
- # A belongs_to whose parent produced no ids contributes no constraint:
356
- # either the parent matched nothing, or it is not dumped here (e.g. an
357
- # embedded collection, or one excluded from the run). If that leaves the
358
- # filter empty even though the collection HAS belongs_to, the collection
359
- # cannot be scoped from the dump target — and falling back to an empty `{}`
360
- # filter would scan and dump the ENTIRE collection across every scope. That
361
- # is never what a scoped extraction wants, so constrain it to match nothing
362
- # and warn instead. (A collection with no belongs_to at all is genuine
363
- # reference/master data and is still dumped in full via `{}`.)
364
- private def related_collection_filter(config, config_by_name)
365
- filter = config.belongs_tos.each_with_object({}) do |relation, acc|
355
+ # Scope flows from the dump target along belongs_to edges. A belongs_to is
356
+ # classified by whether its parent is *genuinely scoped* reachable back to
357
+ # the dump target through belongs_to chains (see #genuine_scope_set) which
358
+ # determines how its constraint is applied:
359
+ #
360
+ # - Among the genuine parents, the most selective one (fewest captured ids)
361
+ # is the ANCHOR and is applied strictly. It carries the real narrowing and,
362
+ # being strict, bounds the result to a small set which keeps both this
363
+ # query and the `$in` sets it feeds downstream from ballooning.
364
+ #
365
+ # - The OTHER genuine parents are applied null-aware: a row whose (nullable)
366
+ # FK is null/absent has no reference through that relation and must not be
367
+ # excluded by it. `nil` is added to the `$in` set (Mongo's `$in: [nil]`
368
+ # matches both explicit nulls and missing fields). Without this, a nullable
369
+ # genuine FK that is null on otherwise in-scope rows ANDs the result to
370
+ # empty — dropping legitimate rows, and (when it zeroes a parent) making
371
+ # children lose that parent's selective+indexed scope and degenerate to a
372
+ # full COLLSCAN. See docs/mongodb-scoping-fullscan-notes.md. Null-aware is
373
+ # applied to non-anchor parents only: making the sole/anchor scope itself
374
+ # null-aware would match every row whose FK is null (e.g. a not-yet-
375
+ # backfilled column), ballooning the result instead of scoping it.
376
+ #
377
+ # - Reference parents (NOT reachable to the dump target — master/reference
378
+ # data dumped in full, or only reachable via such data) produce a non-
379
+ # scoping id set: "all/most of a reference table", which neither narrows
380
+ # meaningfully nor, made null-aware, stays bounded. So when the collection
381
+ # has a genuine parent to anchor on, reference-parent constraints are
382
+ # dropped entirely.
383
+ #
384
+ # When NO genuine parent produced ids, the collection is not reachable from
385
+ # the dump target; fall back to the historical strict-AND of whatever
386
+ # constraints exist (bounded, preserves prior behavior).
387
+ #
388
+ # A belongs_to whose parent produced no ids contributes no constraint: either
389
+ # the parent matched nothing, or it is not dumped here (e.g. an embedded
390
+ # collection, or one excluded from the run). If that leaves the filter empty
391
+ # even though the collection HAS belongs_to, the collection cannot be scoped
392
+ # from the dump target — and an empty `{}` filter would scan and dump the
393
+ # ENTIRE collection across every scope. That is never what a scoped
394
+ # extraction wants, so constrain it to match nothing and warn instead. (A
395
+ # collection with no belongs_to at all is genuine reference/master data and
396
+ # is still dumped in full via `{}`.)
397
+ private def related_collection_filter(config, config_by_name, dump_target)
398
+ genuine = genuine_scope_set(config_by_name, dump_target.table_name)
399
+
400
+ genuine_clauses = []
401
+ reference_clauses = []
402
+ config.belongs_tos.each do |relation|
366
403
  values = parent_state_for(relation, config_by_name)
367
404
  next if values.nil? || values.empty?
368
405
 
369
- acc[relation.foreign_key] = { "$in" => values }
406
+ target = genuine.include?(relation.table_name) ? genuine_clauses : reference_clauses
407
+ target << [relation.foreign_key, values]
370
408
  end
371
409
 
410
+ filter =
411
+ if genuine_clauses.any?
412
+ anchor_index = (0...genuine_clauses.size).min_by { |i| genuine_clauses[i][1].size }
413
+ genuine_clauses.each_with_index.each_with_object({}) do |((foreign_key, values), index), acc|
414
+ acc[foreign_key] =
415
+ index == anchor_index ? { "$in" => values } : { "$in" => [nil] + values }
416
+ end
417
+ else
418
+ reference_clauses.each_with_object({}) do |(foreign_key, values), acc|
419
+ acc[foreign_key] = { "$in" => values }
420
+ end
421
+ end
422
+
372
423
  return filter unless filter.empty? && config.belongs_tos.any?
373
424
 
374
425
  @logger.warn(
@@ -379,6 +430,31 @@ module Exwiw
379
430
  { config.primary_key => { "$in" => [] } }
380
431
  end
381
432
 
433
+ # The set of collection names *genuinely scoped* by the dump target: the
434
+ # target itself, plus every collection that can reach it by following
435
+ # belongs_to edges (child -> parent) transitively. Computed by fixpoint over
436
+ # the configs. Everything outside this set is reference/master data (or only
437
+ # reachable through it) whose belongs_to id sets do not represent a real
438
+ # scope. Memoized per target name; the configs do not mutate mid-run.
439
+ private def genuine_scope_set(config_by_name, target_name)
440
+ (@genuine_scope_set_cache ||= {})[target_name] ||=
441
+ begin
442
+ reachable = Set.new([target_name])
443
+ loop do
444
+ added = false
445
+ config_by_name.each_value do |cfg|
446
+ next if cfg.embedded? || reachable.include?(cfg.name)
447
+ next unless cfg.belongs_tos.any? { |relation| reachable.include?(relation.table_name) }
448
+
449
+ reachable << cfg.name
450
+ added = true
451
+ end
452
+ break unless added
453
+ end
454
+ reachable
455
+ end
456
+ end
457
+
382
458
  # The captured parent-collection values a child belongs_to should be
383
459
  # constrained by: the values of the parent field the FK references
384
460
  # (`relation.references`, default the parent primary_key). nil when the
data/lib/exwiw/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Exwiw
4
- VERSION = "0.6.0"
4
+ VERSION = "0.6.1"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: exwiw
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.6.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Shia
@@ -36,6 +36,7 @@ files:
36
36
  - CHANGELOG.md
37
37
  - LICENSE.txt
38
38
  - README.md
39
+ - docs/mongodb-scoping-fullscan-notes.md
39
40
  - docs/optimization-notes.md
40
41
  - docs/optimize-mongodb-export-with-native-ext.md
41
42
  - docs/plans/2026-05-15-insert-000-schema-file.md