parse-stack-next 5.4.1 → 5.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +489 -0
  3. data/Gemfile.lock +1 -1
  4. data/README.md +61 -9
  5. data/docs/atlas_vector_search_guide.md +318 -19
  6. data/lib/parse/acl_scope.rb +11 -0
  7. data/lib/parse/agent/mcp_rack_app.rb +53 -14
  8. data/lib/parse/agent/mcp_server.rb +19 -0
  9. data/lib/parse/api/path_segment.rb +31 -0
  10. data/lib/parse/api/users.rb +13 -0
  11. data/lib/parse/cache/redis.rb +55 -11
  12. data/lib/parse/client/caching.rb +12 -3
  13. data/lib/parse/client/logging.rb +9 -0
  14. data/lib/parse/client.rb +37 -3
  15. data/lib/parse/embeddings/batch_embedder.rb +188 -0
  16. data/lib/parse/embeddings/cache.rb +374 -0
  17. data/lib/parse/embeddings/cohere.rb +31 -18
  18. data/lib/parse/embeddings/image_fetch.rb +347 -0
  19. data/lib/parse/embeddings/provider.rb +17 -11
  20. data/lib/parse/embeddings/spend_cap.rb +117 -3
  21. data/lib/parse/embeddings/voyage.rb +34 -25
  22. data/lib/parse/embeddings.rb +40 -3
  23. data/lib/parse/model/acl.rb +15 -11
  24. data/lib/parse/model/core/embed_managed.rb +243 -14
  25. data/lib/parse/model/core/properties.rb +42 -5
  26. data/lib/parse/model/core/vector_searchable.rb +157 -8
  27. data/lib/parse/mongodb.rb +12 -0
  28. data/lib/parse/pipeline_security.rb +81 -15
  29. data/lib/parse/query/constraint.rb +22 -0
  30. data/lib/parse/query/constraints.rb +271 -250
  31. data/lib/parse/query.rb +284 -43
  32. data/lib/parse/retrieval/agent_tool.rb +21 -14
  33. data/lib/parse/retrieval/retriever.rb +84 -0
  34. data/lib/parse/schema/search_index_migrator.rb +48 -1
  35. data/lib/parse/stack/version.rb +1 -1
  36. data/lib/parse/stack.rb +12 -1
  37. data/lib/parse/vector_search/hybrid.rb +39 -1
  38. data/lib/parse/vector_search.rb +34 -0
  39. data/lib/parse/webhooks/payload.rb +7 -1
  40. data/lib/parse/webhooks.rb +107 -21
  41. metadata +4 -1
data/README.md CHANGED
@@ -4,6 +4,13 @@
4
4
 
5
5
  A full-featured Ruby client SDK for [Parse Server](http://parseplatform.org/). [parse-stack-next](https://github.com/neurosynq/parse-stack-next) is a Ruby client SDK, REST client, and Active Model ORM for [Parse Server](http://parseplatform.org/), combining a low-level API client, a query engine, an object-relational mapper (ORM), and a Cloud Code Webhooks rack application in a single gem.
6
6
 
7
+ ### What's new in 5.5
8
+
9
+ - **5.5.0 — Multimodal bytes-fetch with magic-byte MIME verification** — `embed_image ..., source: :bytes` has the SDK download an image itself through the `Parse::File.safe_open_url` SSRF primitive, verify the content by **magic-byte sniff** (the `Content-Type` header is never consulted — a `.jpg` URL serving HTML is refused), cross-check the URL extension, enforce a `Parse::Embeddings.allowed_image_types` allowlist, strip EXIF/XMP metadata **by default** (JPEG APP1, PNG `eXIf`, WebP `EXIF`/`XMP ` chunks; opt out with `exif_strip: false`), and forward the verified bytes to Voyage/Cohere as a base64 data URI. No provider-side URL fetch occurs, so the `trust_provider_url_fetch` sentinel is not required — the host allowlist still applies. See [CHANGELOG.md](./CHANGELOG.md)
10
+ - **5.5.0 — Embedding-model migration tooling** — `Class.reembed!(only_stale: true)` bulk re-embeds rows through the current provider/model (resumable; skips rows already current), driven by the new auto-declared `<into>_meta` provenance sibling (`{provider, model, dimensions, modality, embedded_at}`, stamped on every recompute). `Parse::Embeddings::BatchEmbedder` adds batch-level requests-per-minute pacing and exponential backoff for bulk jobs; `Parse::Embeddings::Cache.enable!` adds an opt-in query-embed cache keyed by `(provider, model, input_type, input-hash)` so repeated identical queries skip the provider round-trip. See [CHANGELOG.md](./CHANGELOG.md)
11
+ - **5.5.0 — Vector index drift detection** — on first auto-discovered use of an Atlas vectorSearch index, the SDK verifies the deployed index's `numDimensions`/`similarity` against the `:vector` property declaration and confirms a registered `agent_tenant_scope` field is covered as a `type: "filter"` path. Policy via `Parse::VectorSearch.index_drift_policy` (`:warn` default / `:raise` / `:ignore`). `Parse::Schema::SearchIndexMigrator` now auto-includes the tenant-scope field in `vectorSearch` declarations, so newly created indexes support tenant-scoped pre-filtering out of the box. See [CHANGELOG.md](./CHANGELOG.md)
12
+ - **5.5.0 — Retrieval spend-cap and filter hardening** — the per-tenant embedding spend cap now covers every query-embed path (`find_similar(text:)`, `hybrid_search(text:)`, `Parse::Retrieval.retrieve`), not just the `semantic_search` agent tool; tenant identity resolves through the ambient `Parse.with_cache_tenant` scope. Caller-supplied retrieval filters now translate Parse pointer values to storage form (`{ owner: user }` → `{ "_p_owner" => "_User$id" }`), so pointer filters match rows instead of silently matching nothing. See [CHANGELOG.md](./CHANGELOG.md)
13
+
7
14
  ### What's new in 5.4
8
15
 
9
16
  - **5.4.0 — Hybrid search + reranking for RAG** — `Class.hybrid_search(text:, lexical:, vector:, k:, fusion:)` fuses a lexical Atlas Search branch with a `$vectorSearch` branch using reciprocal-rank fusion (RRF): lexical search nails exact tokens (codes, proper nouns), vector search nails paraphrase, and fusing the two beats either alone. Each branch enforces ACL/CLP independently before fusion (no separate hydration fetch to secure); results carry `#hybrid_score` / `#hybrid_ranks`. `Parse::VectorSearch::Hybrid.rank_fusion_supported?` detects Atlas 8.0+ native `$rankFusion` by a cached behavioural probe (native execution is opt-in; client-side RRF is the always-enforced default). `Parse::Retrieval::Reranker` adds cross-encoder reranking (`Reranker::Cohere` over `/v2/rerank`, plus a deterministic `Reranker::Fixture`), wired into `Parse::Retrieval.retrieve(hybrid:, rerank:)`. `Parse::Embeddings::SpendCap` adds an opt-in per-tenant embedding token cap (hard-refuse) at the `semantic_search` agent-tool boundary. See [CHANGELOG.md](./CHANGELOG.md) and [`docs/atlas_vector_search_guide.md`](./docs/atlas_vector_search_guide.md)
@@ -38,7 +45,7 @@ See [CHANGELOG.md](./CHANGELOG.md) for the full 5.2 entry.
38
45
  - **`Parse::File` URL normalization + presigned-URL stash** — `Parse::File#url=` and `attributes=` now strip signed-URL query parameters (`X-Amz-Signature`, `AWSAccessKeyId`, `Key-Pair-Id`, etc.) before storage; the bare canonical URL lands in `@url`, and the original signed URL is stashed in `file.presigned_url` with a data-driven expiry in `file.presigned_url_expires_at`. New `file.presigned_url_valid?(buffer: 60)` predicate, configurable `Parse::File.signed_url_policy = :strip | :raise`, and `Parse::File.log_filter` / `log_filter_strict` regexes for `lograge` / Sentry / Honeybadger scrubbers. `Parse::File#inspect` no longer emits the URL — see CHANGELOG for the error-reporter payload migration callout
39
46
  - **`Parse::Lock` — public TTL-bounded mutual-exclusion primitive** — `Parse::Lock.acquire(key, ttl:, wait:) { … }` exposes the Redis-backed lock previously hidden inside `first_or_create!` as a first-class API. In-process `Mutex` fallback for memory-backed caches, fails closed on backend errors, HMAC-keyed via `PARSE_STACK_LOCK_SECRET`, namespace-separated from `first_or_create!` so the two cannot collide
40
47
  - **LiveQuery ergonomics** — autoloaded (no explicit `require 'parse/live_query'`); connections are **ACL-scoped by default** (build an admin, ACL-bypassing connection explicitly with `Parse::LiveQuery::Client.new(use_master_key: true)` — master-key authorization is per-connection, not per-subscription); `Query#subscribe` / `Klass.subscribe` accept a block yielded the `Subscription` *before* the subscribe frame is sent so `sub.on(:create) { … }` callbacks are wired before any server event can arrive; `Parse::LiveQuery.run_until_signal!(client:) { … }` is a signal-safe shutdown helper for long-running consumers
41
- - **Image embeddings** — new `embed_image` class macro for `:file`-typed source properties plus `Voyage#embed_image` (`voyage-multimodal-3`, 1024-dim) and `Cohere#embed_image` (`embed-v4.0`, 1536-dim). URL-only routing in v5.1 (bytes-fetch with MIME-sniff lands later); operator-gated via the `Parse::Embeddings.trust_provider_url_fetch = "PROVIDER_EGRESS_VERIFIED"` sentinel plus a `Parse::Embeddings.allowed_image_hosts` CDN allowlist
48
+ - **Image embeddings** — new `embed_image` class macro for `:file`-typed source properties plus `Voyage#embed_image` (`voyage-multimodal-3`, 1024-dim) and `Cohere#embed_image` (`embed-v4.0`, 1536-dim). URL-only routing in v5.1 (the bytes-fetch path with MIME-sniff shipped in v5.5 as `source: :bytes`); operator-gated via the `Parse::Embeddings.trust_provider_url_fetch = "PROVIDER_EGRESS_VERIFIED"` sentinel plus a `Parse::Embeddings.allowed_image_hosts` CDN allowlist
42
49
  - **Tenant-aware cache namespacing** — `Parse.with_cache_tenant(scope) { … }` composes the tenant into the response-cache key as `<base>:T:<tenant>:…` so a multi-tenant app sharing one Redis gets per-tenant key isolation and per-tenant SCAN-delete eviction without per-tenant `Parse::Client.new` plumbing. Fiber-local, restored on block exit, AS::N payloads carry `:cache_tenant`
43
50
  - **`_User` field-visibility DSL** — `Parse::User.master_only_fields(*fields)` and `Parse::User.self_visible_fields(*fields, via: :self)` declare admin-only and owner-only field protections on `_User`. Requires Parse Server's `protectedFieldsOwnerExempt: false` server option (the SDK emits a one-time advisory at class declaration so the dependency is surfaced before deploy). Parse Server's default for this option is changing to `false` in a future version; until your server adopts that default, set it explicitly
44
51
  - **`Parse::Installation` `belongs_to :user`** — read `installation.user` to find which user a device is currently signed in as. Symmetric `Parse::User#has_many :installations` for targeted-push grouping (master-key-only by Parse Server design; see the YARD for the owner-identity caveat)
@@ -64,6 +71,16 @@ See [CHANGELOG.md](./CHANGELOG.md) for the full 5.0 entry, including security-ha
64
71
 
65
72
  ### Core capabilities
66
73
 
74
+ > **Vector search requires MongoDB Atlas (or Atlas Local).** The `:vector`
75
+ > property, `find_similar`, `hybrid_search`, and `Parse::Retrieval` all
76
+ > execute Atlas `$vectorSearch` / `$search` aggregation stages, which exist
77
+ > only on Atlas clusters and the Atlas Local container — community/self-hosted
78
+ > MongoDB is not supported and there is no in-process fallback (a pure-Ruby
79
+ > cosine scan over a real collection is a silent performance cliff, so the
80
+ > SDK refuses rather than degrades). This is a closed design decision.
81
+ > Everything else in this list works against any MongoDB that Parse Server
82
+ > supports.
83
+
67
84
  - MongoDB Aggregation Framework support
68
85
  - **MongoDB Atlas Search** — full-text search, autocomplete, faceted search with direct MongoDB access
69
86
  - **Direct MongoDB Queries** — bypass Parse Server's REST surface for high-performance reads, with SDK-side ACL/CLP/`protectedFields` enforcement for scoped agents
@@ -611,12 +628,21 @@ If `faraday-net_http_persistent` is not available, Parse Stack automatically fal
611
628
  A caching adapter of type `Moneta::Transformer`. Caching queries and object fetches can help improve the performance of your application, even if it is for a few seconds. Only successful `GET` object fetches and queries (non-empty) will be cached. You may set the default expiration time with the `expires` option. See related: [Moneta](https://github.com/minad/moneta). At any point in time you may clear the cache by calling the `clear_cache!` method on the client connection.
612
629
 
613
630
  ```ruby
614
- store = Moneta.new :Redis, url: 'redis://localhost:6379'
631
+ # Use the bundled Parse::Cache::Redis wrapper for a Redis-backed cache. It
632
+ # serializes cached responses as JSON (never Marshal): a raw
633
+ # `Moneta.new(:Redis, ...)` store Marshals values by default, so a cache
634
+ # read would `Marshal.load` bytes from Redis — an RCE vector if that Redis
635
+ # is shared, unauthenticated, or reachable over a plaintext `redis://` MITM.
636
+ store = Parse::Cache::Redis.new(url: 'redis://localhost:6379')
615
637
  # use a Redis cache store with an automatic expire of 10 seconds.
616
638
  Parse.setup(cache: store, expires: 10, ...)
617
639
  ```
618
640
 
619
- As a shortcut, if you are planning on using REDIS and have configured the use of `redis` in your `Gemfile`, you can just pass the REDIS connection string directly to the cache option.
641
+ If you supply your own raw `Moneta.new(:Redis, ...)` store instead of the
642
+ wrapper, build it with `value_serializer: nil` to keep Marshal off the cache
643
+ read path.
644
+
645
+ As a shortcut, if you are planning on using REDIS and have configured the use of `redis` in your `Gemfile`, you can just pass the REDIS connection string directly to the cache option. The string form builds a `Parse::Cache::Redis` wrapper for you, so it is JSON-serialized and safe by default.
620
646
 
621
647
  ```ruby
622
648
  Parse.setup(cache: 'redis://localhost:6379', ...)
@@ -5325,7 +5351,11 @@ If you are already have setup a client that is being used by your defined models
5325
5351
  For high traffic applications that may be performing several server tasks on similar objects, you may utilize request caching. Caching is provided by a the `Parse::Middleware::Caching` class which utilizes a [Moneta store](https://github.com/minad/moneta) object to cache GET url requests that have allowable status codes (ex. HTTP 200, etc). The cache entry for the url will be removed when it is either considered expired (based on the `expires` option) or if a non-GET request is made with the same url. Using this feature appropriately can dramatically reduce your API request usage.
5326
5352
 
5327
5353
  ```ruby
5328
- store = Moneta.new :Redis, url: 'redis://localhost:6379'
5354
+ # Parse::Cache::Redis serializes cached responses as JSON, not Marshal — a raw
5355
+ # Moneta.new(:Redis) store Marshals values by default and a cache read would
5356
+ # Marshal.load Redis bytes (RCE if the cache is shared/untrusted). Prefer the
5357
+ # wrapper; if you supply a raw Moneta-Redis store, pass value_serializer: nil.
5358
+ store = Parse::Cache::Redis.new(url: 'redis://localhost:6379')
5329
5359
  # use a Redis cache store with an automatic expire of 10 seconds.
5330
5360
  Parse.setup(cache: store, expires: 10, ...)
5331
5361
 
@@ -5533,13 +5563,24 @@ pipeline = [
5533
5563
 
5534
5564
  Filter objects by ACL permissions using MongoDB's `_rperm` and `_wperm` fields:
5535
5565
 
5536
- **`readable_by` / `writable_by`** - Exact permission strings:
5566
+ **`readable_by` / `writable_by`** - filter by principal:
5537
5567
  ```ruby
5538
5568
  Song.query.readable_by("user123").results(mongo_direct: true) # User ID
5539
5569
  Song.query.readable_by("role:Admin").results(mongo_direct: true) # Role (explicit prefix)
5540
- Song.query.readable_by(current_user).results(mongo_direct: true) # User object
5541
- Song.query.readable_by("public").results(mongo_direct: true) # Public access (alias for "*")
5542
- Song.query.readable_by("none").results(mongo_direct: true) # Empty _rperm (master key only)
5570
+ Song.query.readable_by(current_user).results(mongo_direct: true) # User object (roles expanded)
5571
+ Song.query.readable_by(:public).results(mongo_direct: true) # Public access (maps to "*")
5572
+ Song.query.readable_by([]).results(mongo_direct: true) # No read perms (empty _rperm)
5573
+ ```
5574
+
5575
+ By default the match is **inclusive** — it ALSO returns publicly-readable rows
5576
+ (`_rperm` contains `"*"`) and rows with a missing `_rperm` (public by absence),
5577
+ because those are genuinely readable by the principal (access-simulation
5578
+ semantics). For an **exact** match — only rows whose `_rperm` literally grants
5579
+ the principal, with no public/missing rows — pass `strict: true`. This is what
5580
+ an ownership or security audit wants:
5581
+
5582
+ ```ruby
5583
+ Song.query.readable_by("role:Admin", strict: true).results # ONLY rows that explicitly grant Admin
5543
5584
  ```
5544
5585
 
5545
5586
  **`readable_by_role` / `writable_by_role`** - Adds "role:" prefix automatically:
@@ -5549,7 +5590,18 @@ Song.query.readable_by_role(admin_role).results(mongo_direct: true) #
5549
5590
  Song.query.writable_by_role(["Admin", "Editor"]).results(mongo_direct: true) # Multiple roles
5550
5591
  ```
5551
5592
 
5552
- **Note:** Requires the `mongo` gem. Add `gem 'mongo'` to your Gemfile.
5593
+ **Convenience and negation:** `publicly_readable` / `publicly_writable`,
5594
+ `privately_readable` / `private_acl` (master-key-only), `not_readable_by` /
5595
+ `not_writable_by`, and `not_publicly_readable` / `not_publicly_writable`.
5596
+ "Not readable by X" excludes rows readable by X directly, via any role X
5597
+ inherits, or publicly.
5598
+
5599
+ **Note:** These constraints compile to an aggregation `$match` on the internal
5600
+ `_rperm` / `_wperm` columns, so they auto-route to the direct-MongoDB path
5601
+ (requires the `mongo` gem and `Parse::MongoDB.configure(...)`). For a scoped
5602
+ query (`scope_to_user` / `scope_to_role` / `session_token`) the SDK enforces
5603
+ ACL/CLP on that path; a scoped aggregate fails closed if mongo-direct is not
5604
+ configured rather than running unscoped.
5553
5605
 
5554
5606
  ### ACL Dirty Tracking
5555
5607
 
@@ -288,6 +288,89 @@ declared `dimensions:` before sending the pipeline. A mismatch raises
288
288
  it — callers get "expected 1536, got 768" instead of a server-side
289
289
  error after a round-trip.
290
290
 
291
+ ### Index drift verification (v5.5)
292
+
293
+ On the first auto-discovered use of a vectorSearch index per
294
+ (class, field, index) per process, the SDK compares the deployed
295
+ index's `latestDefinition` against the model declaration:
296
+
297
+ * `numDimensions` vs the property's declared `dimensions:` — a
298
+ mismatch means every query will be rejected or return nonsense
299
+ (usually an index that predates a model change).
300
+ * `similarity` vs the property's declared `similarity:` (checked only
301
+ when both sides declare one).
302
+ * When the class registers an `agent_tenant_scope`, the scope field
303
+ must appear among the index's `type: "filter"` paths — without it,
304
+ every tenant-scoped `$vectorSearch.filter` fails Atlas-side at
305
+ query time.
306
+
307
+ Findings are computed once per (class, field, index) per process and
308
+ governed by `Parse::VectorSearch.index_drift_policy`:
309
+
310
+ ```ruby
311
+ Parse::VectorSearch.index_drift_policy = :warn # default — [Parse::VectorSearch:DRIFT] warning on first check
312
+ Parse::VectorSearch.index_drift_policy = :raise # IndexDriftError on EVERY query against a drifted index
313
+ Parse::VectorSearch.index_drift_policy = :ignore # skip verification
314
+ ```
315
+
316
+ Under `:raise` the cached findings keep raising — strict mode means a
317
+ drifted index never serves results, not "fails once, then passes".
318
+ Auto-discovery verification costs no extra round-trip (the definition
319
+ is already in hand from index discovery). An explicit `index:` kwarg
320
+ is verified best-effort: when the catalog's covering index for the
321
+ field carries the same name, its definition is checked too; catalog
322
+ lookup failures never fail the query.
323
+
324
+ ### Query-embed caching and spend caps (v5.5)
325
+
326
+ Every `text:`-overload query funnels through one embed path
327
+ (`find_similar(text:)`, `hybrid_search(text:)`,
328
+ `Parse::Retrieval.retrieve` all share it), which gives two controls:
329
+
330
+ ```ruby
331
+ # Opt-in query-embed cache: repeated identical queries skip the
332
+ # provider round-trip. Keyed by (provider, model, dimensions,
333
+ # input_type, SHA-256(input)) — plaintext never lands in the store.
334
+ Parse::Embeddings::Cache.enable!(max_entries: 2048, ttl: 600)
335
+ Parse::Embeddings::Cache.stats # => { enabled:, hits:, misses:, size: }
336
+
337
+ # Per-tenant spend cap now covers DIRECT callers too, not just the
338
+ # semantic_search agent tool. Tenant identity resolves to the ambient
339
+ # Parse.with_cache_tenant scope when set, else a shared default bucket.
340
+ # warn_at: adds a soft cap — crossing 80% of the limit emits a
341
+ # parse.embeddings.spend_cap_warning AS::N event (alert, never refuse).
342
+ Parse::Embeddings::SpendCap.configure(limit_tokens: 1_000_000, window: 3600,
343
+ warn_at: 0.8)
344
+ Parse.with_cache_tenant("tenant_abc") do
345
+ Document.find_similar(text: query) # charged against tenant_abc
346
+ end
347
+ ```
348
+
349
+ Cache hits emit the standard `parse.embeddings.embed` notification
350
+ with `cached: true`, so existing spend subscribers see hits and misses
351
+ on one stream. The cache is in-process by default; for a persistent
352
+ layer shared across processes, wrap any Moneta-compatible backend in
353
+ the bundled adapter:
354
+
355
+ ```ruby
356
+ # Build the Moneta store with value_serializer: nil. MonetaStore JSON-encodes
357
+ # vectors itself; without value_serializer: nil, Moneta would additionally
358
+ # Marshal the values, and a cache read would Marshal.load bytes from a shared
359
+ # Redis — an RCE vector if that Redis is untrusted or MITM'd over redis://.
360
+ moneta = Moneta.new(:Redis, url: ENV["REDIS_URL"], value_serializer: nil)
361
+ Parse::Embeddings::Cache.enable!(
362
+ store: Parse::Embeddings::Cache::MonetaStore.new(moneta, ttl: 30 * 24 * 3600),
363
+ )
364
+ ```
365
+
366
+ `MonetaStore` namespaces keys, forwards TTL via Moneta's `expires:`,
367
+ and fails open (a backend error is a cache miss, never a failed
368
+ embed). Keys are input hashes — plaintext queries never land in the
369
+ shared store; the VALUES are embeddings, so give the store the same
370
+ access controls as the database. A query the agent tool already
371
+ charged per-tenant is not double-billed (`SpendCap.with_precharged`
372
+ wraps the tool's retrieval).
373
+
291
374
  ### ACL/CLP inheritance
292
375
 
293
376
  Vector search routes through `Parse::MongoDB.aggregate`. Every layer
@@ -405,6 +488,18 @@ branch — see [Hybrid search](#hybrid-search-vector--lexical) below) and
405
488
  chunking — see [Reranking](#reranking)). Both were reserved in earlier
406
489
  releases and now ship in 5.4.0.
407
490
 
491
+ **Pointer values in filters translate automatically (v5.5).** A filter
492
+ like `{ owner: some_user }` (a `Parse::Pointer` / `Parse::Object`, or a
493
+ wire-form `{"__type" => "Pointer", ...}` hash — including inside `$in`
494
+ / `$eq` / `$ne` operator hashes) is rewritten to its MongoDB storage
495
+ form `{ "_p_owner" => "_User$abc123" }` before the `$match` /
496
+ `$vectorSearch.filter` is built, so pointer filters match rows instead
497
+ of silently matching nothing. Translation runs after the
498
+ underscore-key gate (callers still cannot name `_p_*` columns
499
+ directly) and before the tenant-scope fold; the `semantic_search`
500
+ agent tool inherits it. For `vector_filter:` use, the pointer column
501
+ (`_p_owner`) must be declared `type: "filter"` in the index.
502
+
408
503
  ### Hybrid search (vector + lexical)
409
504
 
410
505
  `Class.hybrid_search` runs a lexical Atlas Search (`$search`) branch and a
@@ -556,13 +651,26 @@ envelope. See the [MCP guide's Token Economy section](./mcp_guide.md#token-econo
556
651
 
557
652
  ---
558
653
 
559
- ## Image embedding: `embed_image` macro (v5.1)
654
+ ## Image embedding: `embed_image` macro (v5.1 URL mode, v5.5 bytes mode)
560
655
 
561
656
  `embed_image` is the image-source counterpart to `embed`. The source
562
657
  property must be `:file`-typed; the target must be a `:vector` property
563
658
  whose declared `provider:` supports multimodal input (currently
564
659
  `:voyage` with `voyage-multimodal-3`, or `:cohere` with `embed-v4.0`).
565
660
 
661
+ Two fetch modes, selected per declaration with `source:`:
662
+
663
+ * **`source: :url`** (default) — the SDK validates the file's URL and
664
+ forwards it; the **provider** performs the fetch from its own
665
+ network. Requires the `trust_provider_url_fetch` sentinel (see
666
+ operator setup below).
667
+ * **`source: :bytes`** (v5.5) — the **SDK** downloads the image
668
+ through `Parse::File.safe_open_url`, verifies the content by
669
+ magic-byte sniff, strips EXIF/XMP metadata, and forwards the bytes
670
+ to the provider as a base64 data URI. No provider-side URL fetch
671
+ occurs, so the sentinel is NOT required — the
672
+ `allowed_image_hosts` allowlist still is.
673
+
566
674
  ```ruby
567
675
  class Post < Parse::Object
568
676
  property :cover_image, :file
@@ -621,6 +729,57 @@ with `Parse::File`, not parallelized). Failures raise
621
729
  (`:scheme`, `:port`, `:userinfo`, `:host_blocked`,
622
730
  `:host_not_allowlisted`, `:parse`).
623
731
 
732
+ ### Bytes mode (`source: :bytes`, v5.5)
733
+
734
+ ```ruby
735
+ # Operator setup — only the host allowlist is required (the sentinel
736
+ # applies to URL forwarding, not SDK-side fetches):
737
+ Parse::Embeddings.allowed_image_hosts = [".cloudfront.net"]
738
+
739
+ class Post < Parse::Object
740
+ property :cover_image, :file
741
+ property :cover_image_embedding, :vector,
742
+ dimensions: 1024, provider: :voyage, model: "voyage-multimodal-3"
743
+
744
+ embed_image :cover_image, into: :cover_image_embedding,
745
+ source: :bytes # exif_strip: true is the default
746
+ end
747
+ ```
748
+
749
+ What happens on each (digest-miss) save:
750
+
751
+ 1. The file URL is validated through
752
+ `Parse::Embeddings.validate_image_url!(url, mode: :fetch)` — the
753
+ same host allowlist (deny-all when empty), obfuscated-IP screen,
754
+ port allowlist, and CIDR resolution check as URL mode, minus the
755
+ provider-egress sentinel.
756
+ 2. `Parse::File.safe_open_url` downloads the bytes — CIDR blocks,
757
+ DNS-rebinding re-check, port allowlist, `max_remote_size` cap,
758
+ timeouts. No parallel fetch mechanism exists.
759
+ 3. **Magic-byte verification** (`Parse::Embeddings::ImageFetch`):
760
+ the MIME type is determined exclusively from the leading bytes
761
+ (JPEG / PNG / GIF / WebP). The HTTP `Content-Type` header is never
762
+ consulted. The sniffed type must be in
763
+ `Parse::Embeddings.allowed_image_types` (default those four; SVG is
764
+ deliberately excluded as script-capable active content), and when
765
+ the URL carries a recognized image extension, the extension must
766
+ AGREE with the magic bytes — a `.jpg` URL serving PNG bytes (or
767
+ HTML) is refused as MIME laundering
768
+ (`ImageFetch::InvalidImageType`, with a `:reason` tag).
769
+ 4. **EXIF/XMP stripping, default ON.** JPEG APP1 segments (Exif and
770
+ XMP), PNG `eXIf` chunks, and WebP `EXIF`/`XMP ` RIFF chunks (with
771
+ the VP8X flag bits cleared) are removed before the bytes leave the
772
+ process — user photos commonly carry GPS coordinates and device
773
+ serials. Opt out per declaration with `exif_strip: false` when
774
+ orientation metadata must survive.
775
+ 5. The verified bytes ride to the provider as a base64 data URI
776
+ (Voyage `image_base64` content row; Cohere `image_url` data-URI
777
+ form).
778
+
779
+ Direct provider calls accept the same shape:
780
+ `provider.embed_image([Parse::Embeddings::ImageFetch.fetch!(url)])` —
781
+ `FetchedImage` sources and URL Strings may be mixed in one batch.
782
+
624
783
  ### Save-side semantics
625
784
 
626
785
  * Digest is the **SHA-256 of the URL String**, not the file bytes.
@@ -641,24 +800,102 @@ with `Parse::File`, not parallelized). Failures raise
641
800
 
642
801
  ## Re-embedding existing rows
643
802
 
644
- Changing `model:`, `dimensions:`, or `provider:` on an existing
645
- `:vector` property is a migration regardless of whether the source is
646
- text or images. Workflow:
803
+ ### Provenance: the `<into>_meta` sibling (v5.5)
804
+
805
+ Every `embed` / `embed_image` declaration auto-declares an
806
+ `<into>_meta` `:object` sibling (override with `meta_field:`) stamped
807
+ on each recompute and cleared with the vector:
808
+
809
+ ```ruby
810
+ doc.body_embedding_meta
811
+ # => { "provider" => "openai",
812
+ # "model" => "text-embedding-3-small",
813
+ # "dimensions" => 1536,
814
+ # "modality" => "text",
815
+ # "embedded_at" => "2026-06-09T17:32:11Z" }
816
+ ```
817
+
818
+ This is the record migration tooling reads to know which model
819
+ produced any stored vector.
820
+
821
+ ### Same-shape migrations: `Class.reembed!` (v5.5)
822
+
823
+ When the new model has the **same dimensions** (e.g. swapping
824
+ `text-embedding-3-small` for a same-width replacement, or a provider
825
+ change at equal width), re-embed in place:
826
+
827
+ ```ruby
828
+ # Re-embed every row through the CURRENT provider/model declaration.
829
+ Document.reembed!(batch_size: 100)
830
+
831
+ # Resumable: skip rows whose <into>_meta already matches the current
832
+ # provider + model + dimensions (rows with no meta count as stale).
833
+ Document.reembed!(only_stale: true)
834
+
835
+ # Scope it
836
+ Document.reembed!(field: :body_embedding, where: { published: true }, limit: 10_000)
837
+ ```
838
+
839
+ `reembed!` walks the class with objectId-cursor pagination, clears
840
+ each row's digest sibling (so the save-path recompute cannot elide the
841
+ provider call), and saves. Unlike `embed_pending!` — which only fills
842
+ NULL vectors — `reembed!` recomputes populated rows too. Run it with a
843
+ master-key client (or pass `save_opts:` with a session token that can
844
+ write every row). Each row's save makes one provider call; pace bulk
845
+ runs against provider rate limits (see `BatchEmbedder` below for the
846
+ pattern, or just throttle the loop).
847
+
848
+ ### Changed-width migrations: dual-field workflow
849
+
850
+ Changing `dimensions:` is a different beast — the existing
851
+ vectorSearch index can't serve the new width. Use the shadow-field
852
+ workflow:
647
853
 
648
854
  1. Add the new property alongside the old one
649
855
  (`property :body_embedding_v2, :vector, ...`) and an `embed` or
650
856
  `embed_image` block targeting it.
651
- 2. Backfill: iterate existing rows, force a save (or null+save) to
652
- trigger the new directive. The old field stays valid for reads.
653
- 3. Once backfill completes, deploy a new vectorSearch index covering
654
- the new field and migrate `find_similar` callers.
655
- 4. Drop the old property.
656
-
657
- Do NOT mutate the model in place — the digest mechanism will see
658
- unchanged source text / unchanged source URL and skip recompute,
659
- leaving stale vectors. For `embed_image`, also remember the digest is
660
- over the URL String: if you replace bytes at the same URL (PUT-replace
661
- on S3 without renaming), null the digest field to force re-embed.
857
+ 2. Backfill with `embed_pending!(field: :body_embedding_v2)` the new
858
+ field is null everywhere, so the null-filling walk is exactly right.
859
+ 3. Deploy a new vectorSearch index covering the new field and migrate
860
+ `find_similar` callers.
861
+ 4. Drop the old property and index.
862
+
863
+ Do NOT mutate a model's `dimensions:` in place — the digest mechanism
864
+ will see unchanged source text and skip recompute, leaving stale
865
+ vectors, and the drift verifier will flag every query against the old
866
+ index (`index numDimensions=1536 but property declares ...`). For
867
+ `embed_image`, also remember the digest is over the URL String: if you
868
+ replace bytes at the same URL (PUT-replace on S3 without renaming),
869
+ null the digest field — or run `reembed!` — to force re-embed.
870
+
871
+ ---
872
+
873
+ ## Bulk embedding: `BatchEmbedder` (v5.5)
874
+
875
+ `Provider#embed_text_batched` only slices input into provider-sized
876
+ chunks; retry lives inside each provider's single HTTP call. For bulk
877
+ jobs (ingest pipelines, chunk-corpus embedding) use
878
+ `Parse::Embeddings::BatchEmbedder`, which adds batch-level pacing and
879
+ backoff:
880
+
881
+ ```ruby
882
+ embedder = Parse::Embeddings::BatchEmbedder.new(
883
+ Parse::Embeddings.provider(:openai),
884
+ requests_per_minute: 60, # inter-batch pacing
885
+ max_attempts: 5, # per-batch tries (exponential backoff + jitter)
886
+ on_progress: ->(done:, total:, batch_index:, batch_count:) {
887
+ puts "#{done}/#{total}"
888
+ },
889
+ )
890
+ vectors = embedder.embed_text(texts, input_type: :search_document)
891
+ ```
892
+
893
+ Rate-limit and transient errors (any provider error class ending in
894
+ `RateLimitError` / `TransientError`; override with `retry_on:`) retry
895
+ with exponential backoff; other errors propagate immediately. A batch
896
+ that exhausts its attempts raises `BatchEmbedder::BatchFailed`
897
+ carrying `batch_index` and `completed_count`, so a resumable job knows
898
+ exactly where to pick up.
662
899
 
663
900
  ---
664
901
 
@@ -728,6 +965,54 @@ floats out). Vectors only flow through the Parse↔Mongo path, where
728
965
  the body builder's `<vector dims=N>` compaction prevents them from
729
966
  landing in stdout / error trackers.
730
967
 
968
+ ### When the embedded source is PII: deployment checklist
969
+
970
+ An embedding of PII is PII-equivalent. Inversion attacks reconstruct
971
+ substantial source text from dense embeddings, and a vector's nearest
972
+ neighbors leak the source's meaning even without reconstruction. If
973
+ the fields you `embed` contain personal data (names, addresses, health
974
+ or financial details, free-text user messages), treat the vector
975
+ column with the same handling as the source column:
976
+
977
+ 1. **Provider contract.** You are sending the raw source text (and in
978
+ bytes mode, image content) to the embedding provider on every
979
+ recompute. Confirm the provider's data-retention and training-use
980
+ terms cover PII, and that a DPA is in place where required.
981
+ Self-hosting via `LocalHTTP` (Ollama / vLLM / TEI) keeps the text
982
+ in your network.
983
+ 2. **Keep vectors off the wire.** Leave `vector_visibility` at its
984
+ `:owner_only` default so vectors are omitted from `as_json` and
985
+ webhook payloads. Do not flip a PII class to `:public`.
986
+ 3. **Row ACL still governs.** Vector hits route mongo-direct with
987
+ `_rperm` enforcement — verify your rows carry real ACLs and that
988
+ callers use scoped credentials (`session_token:` / `acl_user:`),
989
+ not blanket master key.
990
+ 4. **Tenant isolation.** Multi-tenant deployments must declare
991
+ `agent_tenant_scope` on searchable classes; the scope folds into
992
+ `$vectorSearch.filter` (and v5.5's drift verification confirms the
993
+ index covers it). Without it, similarity scores leak cross-tenant
994
+ document existence.
995
+ 5. **Score exposure.** Keep score quantization on for non-admin agent
996
+ contexts (the default) — full-precision scores enable
997
+ membership-inference probing.
998
+ 6. **EXIF stays stripped.** For image embedding, keep the bytes-mode
999
+ default `exif_strip: true`; user photos carry GPS coordinates and
1000
+ device serials that would otherwise reach the provider.
1001
+ 7. **Log and cache hygiene.** Redact query text at the Faraday layer
1002
+ (above); if you enable the persistent L2 cache, note that cache
1003
+ KEYS are hashes (no plaintext) but cache VALUES are the embeddings
1004
+ themselves — point `MonetaStore` at a store with the same access
1005
+ controls as the database.
1006
+ 8. **Deletion propagation.** When a user exercises erasure rights,
1007
+ the vector, its `<field>_digest`, and its `<field>_meta` siblings
1008
+ live on the same row and delete with it — but check external
1009
+ copies: provider-side logs (their retention policy), your L2
1010
+ embedding cache (TTL or explicit flush), and any analytics sink
1011
+ subscribed to embedding events.
1012
+ 9. **Migration hygiene.** `reembed!` re-sends every row's source text
1013
+ to the provider — schedule PII-class migrations under the same
1014
+ approvals as a data export.
1015
+
731
1016
  ---
732
1017
 
733
1018
  ## Troubleshooting
@@ -775,10 +1060,20 @@ on every poll) rather than a `until index_ready?; sleep` loop.
775
1060
  Key files:
776
1061
 
777
1062
  * `lib/parse/embeddings.rb` — registry, `Configuration`, `register`,
778
- `provider`, `configure`, `validate_image_url!`,
779
- `trust_provider_url_fetch=`, `allowed_image_hosts=`.
1063
+ `provider`, `configure`, `validate_image_url!` (`mode: :forward | :fetch`),
1064
+ `trust_provider_url_fetch=`, `allowed_image_hosts=`,
1065
+ `allowed_image_types=`.
780
1066
  * `lib/parse/embeddings/provider.rb` — abstract base, `validate_response!`,
781
1067
  `instrument_embed`, AS::N payload contract.
1068
+ * `lib/parse/embeddings/image_fetch.rb` — bytes-fetch path:
1069
+ `ImageFetch.fetch!`, magic-byte `sniff_mime`/`verify!`, EXIF/XMP
1070
+ stripping, `FetchedImage`.
1071
+ * `lib/parse/embeddings/batch_embedder.rb` — `BatchEmbedder` bulk
1072
+ orchestration (pacing, batch-level backoff, `BatchFailed`).
1073
+ * `lib/parse/embeddings/cache.rb` — opt-in query-embed cache
1074
+ (`Cache.enable!` / `fetch_vector` / `stats`).
1075
+ * `lib/parse/embeddings/spend_cap.rb` — per-tenant token cap
1076
+ (`charge!`, `charge_query!`, `with_precharged`).
782
1077
  * `lib/parse/embeddings/openai.rb` — OpenAI provider.
783
1078
  * `lib/parse/embeddings/cohere.rb` — Cohere v3 + v4.0 text-mode provider.
784
1079
  * `lib/parse/embeddings/voyage.rb` — Voyage text + multimodal-3
@@ -788,9 +1083,13 @@ Key files:
788
1083
  * `lib/parse/embeddings/local_http.rb` — generic OpenAI-compatible
789
1084
  local-gateway client.
790
1085
  * `lib/parse/embeddings/fixture.rb` — deterministic test provider.
791
- * `lib/parse/model/core/vector_searchable.rb` — `find_similar`.
1086
+ * `lib/parse/model/core/vector_searchable.rb` — `find_similar`,
1087
+ `hybrid_search`, index drift verification
1088
+ (`Parse::VectorSearch.index_drift_policy`).
792
1089
  * `lib/parse/model/core/embed_managed.rb` — `embed` and `embed_image`
793
- macros, `EmbedDirective` (carries `modality:`, `allow_insecure:`).
1090
+ macros, `EmbedDirective` (carries `modality:`, `allow_insecure:`,
1091
+ `source_mode:`, `exif_strip:`, `meta_field:`), `embed_pending!`,
1092
+ `reembed!`.
794
1093
  * `lib/parse/vector_search.rb` — low-level `Parse::VectorSearch.search`.
795
1094
  * `lib/parse/atlas_search/index_manager.rb` — `IndexCatalog.create_index`,
796
1095
  `find_vector_index`, `wait_for_ready`.
@@ -336,6 +336,17 @@ module Parse
336
336
  return if target.nil?
337
337
  target_str = target.to_s
338
338
  return if target_str.empty?
339
+ # RT-7 / NEW-4: hard internal-collection floor FIRST, independent of
340
+ # CLP. This must run on EVERY join target on the direct
341
+ # Parse::MongoDB.aggregate path. LookupRewriter.auto_rewrite (the other
342
+ # caller of assert_collection_allowed!) is skipped when rewrite_lookups
343
+ # is off or the root class can't be resolved, so relying on it alone
344
+ # leaves a gap: an internal collection (`_SCHEMA`/`_Hooks`/`_Audit`/
345
+ # `_GlobalConfig`/...) whose CLP fetch returns :no_clp would pass the
346
+ # permits? check below. The floor refuses those outright while still
347
+ # admitting the SDK data classes (`_User`/`_Role`/`_Installation`/
348
+ # `_Session`), which then face the per-scope CLP `find` gate.
349
+ Parse::PipelineSecurity.assert_collection_allowed!(target_str)
339
350
  return if Parse::CLPScope.permits?(target_str, :find, perms)
340
351
  raise Parse::CLPScope::Denied.new(
341
352
  target_str, :find,