RubyGems - parse-stack-next - Versions diffs - 5.4.1 → 5.5.1 - Mend

parse-stack-next 5.4.1 → 5.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +489 -0
data/Gemfile.lock +1 -1
data/README.md +61 -9
data/docs/atlas_vector_search_guide.md +318 -19
data/lib/parse/acl_scope.rb +11 -0
data/lib/parse/agent/mcp_rack_app.rb +53 -14
data/lib/parse/agent/mcp_server.rb +19 -0
data/lib/parse/api/path_segment.rb +31 -0
data/lib/parse/api/users.rb +13 -0
data/lib/parse/cache/redis.rb +55 -11
data/lib/parse/client/caching.rb +12 -3
data/lib/parse/client/logging.rb +9 -0
data/lib/parse/client.rb +37 -3
data/lib/parse/embeddings/batch_embedder.rb +188 -0
data/lib/parse/embeddings/cache.rb +374 -0
data/lib/parse/embeddings/cohere.rb +31 -18
data/lib/parse/embeddings/image_fetch.rb +347 -0
data/lib/parse/embeddings/provider.rb +17 -11
data/lib/parse/embeddings/spend_cap.rb +117 -3
data/lib/parse/embeddings/voyage.rb +34 -25
data/lib/parse/embeddings.rb +40 -3
data/lib/parse/model/acl.rb +15 -11
data/lib/parse/model/core/embed_managed.rb +243 -14
data/lib/parse/model/core/properties.rb +42 -5
data/lib/parse/model/core/vector_searchable.rb +157 -8
data/lib/parse/mongodb.rb +12 -0
data/lib/parse/pipeline_security.rb +81 -15
data/lib/parse/query/constraint.rb +22 -0
data/lib/parse/query/constraints.rb +271 -250
data/lib/parse/query.rb +284 -43
data/lib/parse/retrieval/agent_tool.rb +21 -14
data/lib/parse/retrieval/retriever.rb +84 -0
data/lib/parse/schema/search_index_migrator.rb +48 -1
data/lib/parse/stack/version.rb +1 -1
data/lib/parse/stack.rb +12 -1
data/lib/parse/vector_search/hybrid.rb +39 -1
data/lib/parse/vector_search.rb +34 -0
data/lib/parse/webhooks/payload.rb +7 -1
data/lib/parse/webhooks.rb +107 -21
metadata +4 -1

data/README.md CHANGED Viewed

@@ -4,6 +4,13 @@
 A full-featured Ruby client SDK for [Parse Server](http://parseplatform.org/). [parse-stack-next](https://github.com/neurosynq/parse-stack-next) is a Ruby client SDK, REST client, and Active Model ORM for [Parse Server](http://parseplatform.org/), combining a low-level API client, a query engine, an object-relational mapper (ORM), and a Cloud Code Webhooks rack application in a single gem.
+### What's new in 5.5
+- **5.5.0 — Multimodal bytes-fetch with magic-byte MIME verification** — `embed_image ..., source: :bytes` has the SDK download an image itself through the `Parse::File.safe_open_url` SSRF primitive, verify the content by **magic-byte sniff** (the `Content-Type` header is never consulted — a `.jpg` URL serving HTML is refused), cross-check the URL extension, enforce a `Parse::Embeddings.allowed_image_types` allowlist, strip EXIF/XMP metadata **by default** (JPEG APP1, PNG `eXIf`, WebP `EXIF`/`XMP ` chunks; opt out with `exif_strip: false`), and forward the verified bytes to Voyage/Cohere as a base64 data URI. No provider-side URL fetch occurs, so the `trust_provider_url_fetch` sentinel is not required — the host allowlist still applies. See [CHANGELOG.md](./CHANGELOG.md)
+- **5.5.0 — Embedding-model migration tooling** — `Class.reembed!(only_stale: true)` bulk re-embeds rows through the current provider/model (resumable; skips rows already current), driven by the new auto-declared `<into>_meta` provenance sibling (`{provider, model, dimensions, modality, embedded_at}`, stamped on every recompute). `Parse::Embeddings::BatchEmbedder` adds batch-level requests-per-minute pacing and exponential backoff for bulk jobs; `Parse::Embeddings::Cache.enable!` adds an opt-in query-embed cache keyed by `(provider, model, input_type, input-hash)` so repeated identical queries skip the provider round-trip. See [CHANGELOG.md](./CHANGELOG.md)
+- **5.5.0 — Vector index drift detection** — on first auto-discovered use of an Atlas vectorSearch index, the SDK verifies the deployed index's `numDimensions`/`similarity` against the `:vector` property declaration and confirms a registered `agent_tenant_scope` field is covered as a `type: "filter"` path. Policy via `Parse::VectorSearch.index_drift_policy` (`:warn` default / `:raise` / `:ignore`). `Parse::Schema::SearchIndexMigrator` now auto-includes the tenant-scope field in `vectorSearch` declarations, so newly created indexes support tenant-scoped pre-filtering out of the box. See [CHANGELOG.md](./CHANGELOG.md)
+- **5.5.0 — Retrieval spend-cap and filter hardening** — the per-tenant embedding spend cap now covers every query-embed path (`find_similar(text:)`, `hybrid_search(text:)`, `Parse::Retrieval.retrieve`), not just the `semantic_search` agent tool; tenant identity resolves through the ambient `Parse.with_cache_tenant` scope. Caller-supplied retrieval filters now translate Parse pointer values to storage form (`{ owner: user }` → `{ "_p_owner" => "_User$id" }`), so pointer filters match rows instead of silently matching nothing. See [CHANGELOG.md](./CHANGELOG.md)
 ### What's new in 5.4
 - **5.4.0 — Hybrid search + reranking for RAG** — `Class.hybrid_search(text:, lexical:, vector:, k:, fusion:)` fuses a lexical Atlas Search branch with a `$vectorSearch` branch using reciprocal-rank fusion (RRF): lexical search nails exact tokens (codes, proper nouns), vector search nails paraphrase, and fusing the two beats either alone. Each branch enforces ACL/CLP independently before fusion (no separate hydration fetch to secure); results carry `#hybrid_score` / `#hybrid_ranks`. `Parse::VectorSearch::Hybrid.rank_fusion_supported?` detects Atlas 8.0+ native `$rankFusion` by a cached behavioural probe (native execution is opt-in; client-side RRF is the always-enforced default). `Parse::Retrieval::Reranker` adds cross-encoder reranking (`Reranker::Cohere` over `/v2/rerank`, plus a deterministic `Reranker::Fixture`), wired into `Parse::Retrieval.retrieve(hybrid:, rerank:)`. `Parse::Embeddings::SpendCap` adds an opt-in per-tenant embedding token cap (hard-refuse) at the `semantic_search` agent-tool boundary. See [CHANGELOG.md](./CHANGELOG.md) and [`docs/atlas_vector_search_guide.md`](./docs/atlas_vector_search_guide.md)
@@ -38,7 +45,7 @@ See [CHANGELOG.md](./CHANGELOG.md) for the full 5.2 entry.
 - **`Parse::File` URL normalization + presigned-URL stash** — `Parse::File#url=` and `attributes=` now strip signed-URL query parameters (`X-Amz-Signature`, `AWSAccessKeyId`, `Key-Pair-Id`, etc.) before storage; the bare canonical URL lands in `@url`, and the original signed URL is stashed in `file.presigned_url` with a data-driven expiry in `file.presigned_url_expires_at`. New `file.presigned_url_valid?(buffer: 60)` predicate, configurable `Parse::File.signed_url_policy = :strip | :raise`, and `Parse::File.log_filter` / `log_filter_strict` regexes for `lograge` / Sentry / Honeybadger scrubbers. `Parse::File#inspect` no longer emits the URL — see CHANGELOG for the error-reporter payload migration callout
 - **`Parse::Lock` — public TTL-bounded mutual-exclusion primitive** — `Parse::Lock.acquire(key, ttl:, wait:) { … }` exposes the Redis-backed lock previously hidden inside `first_or_create!` as a first-class API. In-process `Mutex` fallback for memory-backed caches, fails closed on backend errors, HMAC-keyed via `PARSE_STACK_LOCK_SECRET`, namespace-separated from `first_or_create!` so the two cannot collide
 - **LiveQuery ergonomics** — autoloaded (no explicit `require 'parse/live_query'`); connections are **ACL-scoped by default** (build an admin, ACL-bypassing connection explicitly with `Parse::LiveQuery::Client.new(use_master_key: true)` — master-key authorization is per-connection, not per-subscription); `Query#subscribe` / `Klass.subscribe` accept a block yielded the `Subscription` *before* the subscribe frame is sent so `sub.on(:create) { … }` callbacks are wired before any server event can arrive; `Parse::LiveQuery.run_until_signal!(client:) { … }` is a signal-safe shutdown helper for long-running consumers
-- **Image embeddings** — new `embed_image` class macro for `:file`-typed source properties plus `Voyage#embed_image` (`voyage-multimodal-3`, 1024-dim) and `Cohere#embed_image` (`embed-v4.0`, 1536-dim). URL-only routing in v5.1 (bytes-fetch with MIME-sniff lands later); operator-gated via the `Parse::Embeddings.trust_provider_url_fetch = "PROVIDER_EGRESS_VERIFIED"` sentinel plus a `Parse::Embeddings.allowed_image_hosts` CDN allowlist
+- **Image embeddings** — new `embed_image` class macro for `:file`-typed source properties plus `Voyage#embed_image` (`voyage-multimodal-3`, 1024-dim) and `Cohere#embed_image` (`embed-v4.0`, 1536-dim). URL-only routing in v5.1 (the bytes-fetch path with MIME-sniff shipped in v5.5 as `source: :bytes`); operator-gated via the `Parse::Embeddings.trust_provider_url_fetch = "PROVIDER_EGRESS_VERIFIED"` sentinel plus a `Parse::Embeddings.allowed_image_hosts` CDN allowlist
 - **Tenant-aware cache namespacing** — `Parse.with_cache_tenant(scope) { … }` composes the tenant into the response-cache key as `<base>:T:<tenant>:…` so a multi-tenant app sharing one Redis gets per-tenant key isolation and per-tenant SCAN-delete eviction without per-tenant `Parse::Client.new` plumbing. Fiber-local, restored on block exit, AS::N payloads carry `:cache_tenant`
 - **`_User` field-visibility DSL** — `Parse::User.master_only_fields(*fields)` and `Parse::User.self_visible_fields(*fields, via: :self)` declare admin-only and owner-only field protections on `_User`. Requires Parse Server's `protectedFieldsOwnerExempt: false` server option (the SDK emits a one-time advisory at class declaration so the dependency is surfaced before deploy). Parse Server's default for this option is changing to `false` in a future version; until your server adopts that default, set it explicitly
 - **`Parse::Installation` `belongs_to :user`** — read `installation.user` to find which user a device is currently signed in as. Symmetric `Parse::User#has_many :installations` for targeted-push grouping (master-key-only by Parse Server design; see the YARD for the owner-identity caveat)
@@ -64,6 +71,16 @@ See [CHANGELOG.md](./CHANGELOG.md) for the full 5.0 entry, including security-ha
 ### Core capabilities
+> **Vector search requires MongoDB Atlas (or Atlas Local).** The `:vector`
+> property, `find_similar`, `hybrid_search`, and `Parse::Retrieval` all
+> execute Atlas `$vectorSearch` / `$search` aggregation stages, which exist
+> only on Atlas clusters and the Atlas Local container — community/self-hosted
+> MongoDB is not supported and there is no in-process fallback (a pure-Ruby
+> cosine scan over a real collection is a silent performance cliff, so the
+> SDK refuses rather than degrades). This is a closed design decision.
+> Everything else in this list works against any MongoDB that Parse Server
+> supports.
 - MongoDB Aggregation Framework support
 - **MongoDB Atlas Search** — full-text search, autocomplete, faceted search with direct MongoDB access
 - **Direct MongoDB Queries** — bypass Parse Server's REST surface for high-performance reads, with SDK-side ACL/CLP/`protectedFields` enforcement for scoped agents
@@ -611,12 +628,21 @@ If `faraday-net_http_persistent` is not available, Parse Stack automatically fal
 A caching adapter of type `Moneta::Transformer`. Caching queries and object fetches can help improve the performance of your application, even if it is for a few seconds. Only successful `GET` object fetches and queries (non-empty) will be cached. You may set the default expiration time with the `expires` option. See related: [Moneta](https://github.com/minad/moneta). At any point in time you may clear the cache by calling the `clear_cache!` method on the client connection.
 ```ruby
-  store = Moneta.new :Redis, url: 'redis://localhost:6379'
+  # Use the bundled Parse::Cache::Redis wrapper for a Redis-backed cache. It
+  # serializes cached responses as JSON (never Marshal): a raw
+  # `Moneta.new(:Redis, ...)` store Marshals values by default, so a cache
+  # read would `Marshal.load` bytes from Redis — an RCE vector if that Redis
+  # is shared, unauthenticated, or reachable over a plaintext `redis://` MITM.
+  store = Parse::Cache::Redis.new(url: 'redis://localhost:6379')
    # use a Redis cache store with an automatic expire of 10 seconds.
   Parse.setup(cache: store, expires: 10, ...)
 ```
-As a shortcut, if you are planning on using REDIS and have configured the use of `redis` in your `Gemfile`, you can just pass the REDIS connection string directly to the cache option.
+If you supply your own raw `Moneta.new(:Redis, ...)` store instead of the
+wrapper, build it with `value_serializer: nil` to keep Marshal off the cache
+read path.
+As a shortcut, if you are planning on using REDIS and have configured the use of `redis` in your `Gemfile`, you can just pass the REDIS connection string directly to the cache option. The string form builds a `Parse::Cache::Redis` wrapper for you, so it is JSON-serialized and safe by default.
 ```ruby
   Parse.setup(cache: 'redis://localhost:6379', ...)
@@ -5325,7 +5351,11 @@ If you are already have setup a client that is being used by your defined models
 For high traffic applications that may be performing several server tasks on similar objects, you may utilize request caching. Caching is provided by a the `Parse::Middleware::Caching` class which utilizes a [Moneta store](https://github.com/minad/moneta) object to cache GET url requests that have allowable status codes (ex. HTTP 200, etc). The cache entry for the url will be removed when it is either considered expired (based on the `expires` option) or if a non-GET request is made with the same url. Using this feature appropriately can dramatically reduce your API request usage.
 ```ruby
-store = Moneta.new :Redis, url: 'redis://localhost:6379'
+# Parse::Cache::Redis serializes cached responses as JSON, not Marshal — a raw
+# Moneta.new(:Redis) store Marshals values by default and a cache read would
+# Marshal.load Redis bytes (RCE if the cache is shared/untrusted). Prefer the
+# wrapper; if you supply a raw Moneta-Redis store, pass value_serializer: nil.
+store = Parse::Cache::Redis.new(url: 'redis://localhost:6379')
  # use a Redis cache store with an automatic expire of 10 seconds.
 Parse.setup(cache: store, expires: 10, ...)
@@ -5533,13 +5563,24 @@ pipeline = [
 Filter objects by ACL permissions using MongoDB's `_rperm` and `_wperm` fields:
-**`readable_by` / `writable_by`** - Exact permission strings:
+**`readable_by` / `writable_by`** - filter by principal:
 ```ruby
 Song.query.readable_by("user123").results(mongo_direct: true)       # User ID
 Song.query.readable_by("role:Admin").results(mongo_direct: true)    # Role (explicit prefix)
-Song.query.readable_by(current_user).results(mongo_direct: true)    # User object
-Song.query.readable_by("public").results(mongo_direct: true)        # Public access (alias for "*")
-Song.query.readable_by("none").results(mongo_direct: true)          # Empty _rperm (master key only)
+Song.query.readable_by(current_user).results(mongo_direct: true)    # User object (roles expanded)
+Song.query.readable_by(:public).results(mongo_direct: true)         # Public access (maps to "*")
+Song.query.readable_by([]).results(mongo_direct: true)              # No read perms (empty _rperm)
+```
+By default the match is **inclusive** — it ALSO returns publicly-readable rows
+(`_rperm` contains `"*"`) and rows with a missing `_rperm` (public by absence),
+because those are genuinely readable by the principal (access-simulation
+semantics). For an **exact** match — only rows whose `_rperm` literally grants
+the principal, with no public/missing rows — pass `strict: true`. This is what
+an ownership or security audit wants:
+```ruby
+Song.query.readable_by("role:Admin", strict: true).results   # ONLY rows that explicitly grant Admin
 ```
 **`readable_by_role` / `writable_by_role`** - Adds "role:" prefix automatically:
@@ -5549,7 +5590,18 @@ Song.query.readable_by_role(admin_role).results(mongo_direct: true)           #
 Song.query.writable_by_role(["Admin", "Editor"]).results(mongo_direct: true)  # Multiple roles
 ```
-**Note:** Requires the `mongo` gem. Add `gem 'mongo'` to your Gemfile.
+**Convenience and negation:** `publicly_readable` / `publicly_writable`,
+`privately_readable` / `private_acl` (master-key-only), `not_readable_by` /
+`not_writable_by`, and `not_publicly_readable` / `not_publicly_writable`.
+"Not readable by X" excludes rows readable by X directly, via any role X
+inherits, or publicly.
+**Note:** These constraints compile to an aggregation `$match` on the internal
+`_rperm` / `_wperm` columns, so they auto-route to the direct-MongoDB path
+(requires the `mongo` gem and `Parse::MongoDB.configure(...)`). For a scoped
+query (`scope_to_user` / `scope_to_role` / `session_token`) the SDK enforces
+ACL/CLP on that path; a scoped aggregate fails closed if mongo-direct is not
+configured rather than running unscoped.
 ### ACL Dirty Tracking

data/docs/atlas_vector_search_guide.md CHANGED Viewed

@@ -288,6 +288,89 @@ declared `dimensions:` before sending the pipeline. A mismatch raises
 it — callers get "expected 1536, got 768" instead of a server-side
 error after a round-trip.
+### Index drift verification (v5.5)
+On the first auto-discovered use of a vectorSearch index per
+(class, field, index) per process, the SDK compares the deployed
+index's `latestDefinition` against the model declaration:
+* `numDimensions` vs the property's declared `dimensions:` — a
+  mismatch means every query will be rejected or return nonsense
+  (usually an index that predates a model change).
+* `similarity` vs the property's declared `similarity:` (checked only
+  when both sides declare one).
+* When the class registers an `agent_tenant_scope`, the scope field
+  must appear among the index's `type: "filter"` paths — without it,
+  every tenant-scoped `$vectorSearch.filter` fails Atlas-side at
+  query time.
+Findings are computed once per (class, field, index) per process and
+governed by `Parse::VectorSearch.index_drift_policy`:
+```ruby
+Parse::VectorSearch.index_drift_policy = :warn   # default — [Parse::VectorSearch:DRIFT] warning on first check
+Parse::VectorSearch.index_drift_policy = :raise  # IndexDriftError on EVERY query against a drifted index
+Parse::VectorSearch.index_drift_policy = :ignore # skip verification
+```
+Under `:raise` the cached findings keep raising — strict mode means a
+drifted index never serves results, not "fails once, then passes".
+Auto-discovery verification costs no extra round-trip (the definition
+is already in hand from index discovery). An explicit `index:` kwarg
+is verified best-effort: when the catalog's covering index for the
+field carries the same name, its definition is checked too; catalog
+lookup failures never fail the query.
+### Query-embed caching and spend caps (v5.5)
+Every `text:`-overload query funnels through one embed path
+(`find_similar(text:)`, `hybrid_search(text:)`,
+`Parse::Retrieval.retrieve` all share it), which gives two controls:
+```ruby
+# Opt-in query-embed cache: repeated identical queries skip the
+# provider round-trip. Keyed by (provider, model, dimensions,
+# input_type, SHA-256(input)) — plaintext never lands in the store.
+Parse::Embeddings::Cache.enable!(max_entries: 2048, ttl: 600)
+Parse::Embeddings::Cache.stats   # => { enabled:, hits:, misses:, size: }
+# Per-tenant spend cap now covers DIRECT callers too, not just the
+# semantic_search agent tool. Tenant identity resolves to the ambient
+# Parse.with_cache_tenant scope when set, else a shared default bucket.
+# warn_at: adds a soft cap — crossing 80% of the limit emits a
+# parse.embeddings.spend_cap_warning AS::N event (alert, never refuse).
+Parse::Embeddings::SpendCap.configure(limit_tokens: 1_000_000, window: 3600,
+                                      warn_at: 0.8)
+Parse.with_cache_tenant("tenant_abc") do
+  Document.find_similar(text: query)   # charged against tenant_abc
+end
+```
+Cache hits emit the standard `parse.embeddings.embed` notification
+with `cached: true`, so existing spend subscribers see hits and misses
+on one stream. The cache is in-process by default; for a persistent
+layer shared across processes, wrap any Moneta-compatible backend in
+the bundled adapter:
+```ruby
+# Build the Moneta store with value_serializer: nil. MonetaStore JSON-encodes
+# vectors itself; without value_serializer: nil, Moneta would additionally
+# Marshal the values, and a cache read would Marshal.load bytes from a shared
+# Redis — an RCE vector if that Redis is untrusted or MITM'd over redis://.
+moneta = Moneta.new(:Redis, url: ENV["REDIS_URL"], value_serializer: nil)
+Parse::Embeddings::Cache.enable!(
+  store: Parse::Embeddings::Cache::MonetaStore.new(moneta, ttl: 30 * 24 * 3600),
+)
+```
+`MonetaStore` namespaces keys, forwards TTL via Moneta's `expires:`,
+and fails open (a backend error is a cache miss, never a failed
+embed). Keys are input hashes — plaintext queries never land in the
+shared store; the VALUES are embeddings, so give the store the same
+access controls as the database. A query the agent tool already
+charged per-tenant is not double-billed (`SpendCap.with_precharged`
+wraps the tool's retrieval).
 ### ACL/CLP inheritance
 Vector search routes through `Parse::MongoDB.aggregate`. Every layer
@@ -405,6 +488,18 @@ branch — see [Hybrid search](#hybrid-search-vector--lexical) below) and
 chunking — see [Reranking](#reranking)). Both were reserved in earlier
 releases and now ship in 5.4.0.
+**Pointer values in filters translate automatically (v5.5).** A filter
+like `{ owner: some_user }` (a `Parse::Pointer` / `Parse::Object`, or a
+wire-form `{"__type" => "Pointer", ...}` hash — including inside `$in`
+/ `$eq` / `$ne` operator hashes) is rewritten to its MongoDB storage
+form `{ "_p_owner" => "_User$abc123" }` before the `$match` /
+`$vectorSearch.filter` is built, so pointer filters match rows instead
+of silently matching nothing. Translation runs after the
+underscore-key gate (callers still cannot name `_p_*` columns
+directly) and before the tenant-scope fold; the `semantic_search`
+agent tool inherits it. For `vector_filter:` use, the pointer column
+(`_p_owner`) must be declared `type: "filter"` in the index.
 ### Hybrid search (vector + lexical)
 `Class.hybrid_search` runs a lexical Atlas Search (`$search`) branch and a
@@ -556,13 +651,26 @@ envelope. See the [MCP guide's Token Economy section](./mcp_guide.md#token-econo
 ---
-## Image embedding: `embed_image` macro (v5.1)
+## Image embedding: `embed_image` macro (v5.1 URL mode, v5.5 bytes mode)
 `embed_image` is the image-source counterpart to `embed`. The source
 property must be `:file`-typed; the target must be a `:vector` property
 whose declared `provider:` supports multimodal input (currently
 `:voyage` with `voyage-multimodal-3`, or `:cohere` with `embed-v4.0`).
+Two fetch modes, selected per declaration with `source:`:
+* **`source: :url`** (default) — the SDK validates the file's URL and
+  forwards it; the **provider** performs the fetch from its own
+  network. Requires the `trust_provider_url_fetch` sentinel (see
+  operator setup below).
+* **`source: :bytes`** (v5.5) — the **SDK** downloads the image
+  through `Parse::File.safe_open_url`, verifies the content by
+  magic-byte sniff, strips EXIF/XMP metadata, and forwards the bytes
+  to the provider as a base64 data URI. No provider-side URL fetch
+  occurs, so the sentinel is NOT required — the
+  `allowed_image_hosts` allowlist still is.
 ```ruby
 class Post < Parse::Object
   property :cover_image,            :file
@@ -621,6 +729,57 @@ with `Parse::File`, not parallelized). Failures raise
 (`:scheme`, `:port`, `:userinfo`, `:host_blocked`,
 `:host_not_allowlisted`, `:parse`).
+### Bytes mode (`source: :bytes`, v5.5)
+```ruby
+# Operator setup — only the host allowlist is required (the sentinel
+# applies to URL forwarding, not SDK-side fetches):
+Parse::Embeddings.allowed_image_hosts = [".cloudfront.net"]
+class Post < Parse::Object
+  property :cover_image,           :file
+  property :cover_image_embedding, :vector,
+           dimensions: 1024, provider: :voyage, model: "voyage-multimodal-3"
+  embed_image :cover_image, into: :cover_image_embedding,
+              source: :bytes            # exif_strip: true is the default
+end
+```
+What happens on each (digest-miss) save:
+1. The file URL is validated through
+   `Parse::Embeddings.validate_image_url!(url, mode: :fetch)` — the
+   same host allowlist (deny-all when empty), obfuscated-IP screen,
+   port allowlist, and CIDR resolution check as URL mode, minus the
+   provider-egress sentinel.
+2. `Parse::File.safe_open_url` downloads the bytes — CIDR blocks,
+   DNS-rebinding re-check, port allowlist, `max_remote_size` cap,
+   timeouts. No parallel fetch mechanism exists.
+3. **Magic-byte verification** (`Parse::Embeddings::ImageFetch`):
+   the MIME type is determined exclusively from the leading bytes
+   (JPEG / PNG / GIF / WebP). The HTTP `Content-Type` header is never
+   consulted. The sniffed type must be in
+   `Parse::Embeddings.allowed_image_types` (default those four; SVG is
+   deliberately excluded as script-capable active content), and when
+   the URL carries a recognized image extension, the extension must
+   AGREE with the magic bytes — a `.jpg` URL serving PNG bytes (or
+   HTML) is refused as MIME laundering
+   (`ImageFetch::InvalidImageType`, with a `:reason` tag).
+4. **EXIF/XMP stripping, default ON.** JPEG APP1 segments (Exif and
+   XMP), PNG `eXIf` chunks, and WebP `EXIF`/`XMP ` RIFF chunks (with
+   the VP8X flag bits cleared) are removed before the bytes leave the
+   process — user photos commonly carry GPS coordinates and device
+   serials. Opt out per declaration with `exif_strip: false` when
+   orientation metadata must survive.
+5. The verified bytes ride to the provider as a base64 data URI
+   (Voyage `image_base64` content row; Cohere `image_url` data-URI
+   form).
+Direct provider calls accept the same shape:
+`provider.embed_image([Parse::Embeddings::ImageFetch.fetch!(url)])` —
+`FetchedImage` sources and URL Strings may be mixed in one batch.
 ### Save-side semantics
 * Digest is the **SHA-256 of the URL String**, not the file bytes.
@@ -641,24 +800,102 @@ with `Parse::File`, not parallelized). Failures raise
 ## Re-embedding existing rows
-Changing `model:`, `dimensions:`, or `provider:` on an existing
-`:vector` property is a migration regardless of whether the source is
-text or images. Workflow:
+### Provenance: the `<into>_meta` sibling (v5.5)
+Every `embed` / `embed_image` declaration auto-declares an
+`<into>_meta` `:object` sibling (override with `meta_field:`) stamped
+on each recompute and cleared with the vector:
+```ruby
+doc.body_embedding_meta
+# => { "provider"    => "openai",
+#      "model"       => "text-embedding-3-small",
+#      "dimensions"  => 1536,
+#      "modality"    => "text",
+#      "embedded_at" => "2026-06-09T17:32:11Z" }
+```
+This is the record migration tooling reads to know which model
+produced any stored vector.
+### Same-shape migrations: `Class.reembed!` (v5.5)
+When the new model has the **same dimensions** (e.g. swapping
+`text-embedding-3-small` for a same-width replacement, or a provider
+change at equal width), re-embed in place:
+```ruby
+# Re-embed every row through the CURRENT provider/model declaration.
+Document.reembed!(batch_size: 100)
+# Resumable: skip rows whose <into>_meta already matches the current
+# provider + model + dimensions (rows with no meta count as stale).
+Document.reembed!(only_stale: true)
+# Scope it
+Document.reembed!(field: :body_embedding, where: { published: true }, limit: 10_000)
+```
+`reembed!` walks the class with objectId-cursor pagination, clears
+each row's digest sibling (so the save-path recompute cannot elide the
+provider call), and saves. Unlike `embed_pending!` — which only fills
+NULL vectors — `reembed!` recomputes populated rows too. Run it with a
+master-key client (or pass `save_opts:` with a session token that can
+write every row). Each row's save makes one provider call; pace bulk
+runs against provider rate limits (see `BatchEmbedder` below for the
+pattern, or just throttle the loop).
+### Changed-width migrations: dual-field workflow
+Changing `dimensions:` is a different beast — the existing
+vectorSearch index can't serve the new width. Use the shadow-field
+workflow:
 1. Add the new property alongside the old one
    (`property :body_embedding_v2, :vector, ...`) and an `embed` or
    `embed_image` block targeting it.
-2. Backfill: iterate existing rows, force a save (or null+save) to
-   trigger the new directive. The old field stays valid for reads.
-3. Once backfill completes, deploy a new vectorSearch index covering
-   the new field and migrate `find_similar` callers.
-4. Drop the old property.
-Do NOT mutate the model in place — the digest mechanism will see
-unchanged source text / unchanged source URL and skip recompute,
-leaving stale vectors. For `embed_image`, also remember the digest is
-over the URL String: if you replace bytes at the same URL (PUT-replace
-on S3 without renaming), null the digest field to force re-embed.
+2. Backfill with `embed_pending!(field: :body_embedding_v2)` — the new
+   field is null everywhere, so the null-filling walk is exactly right.
+3. Deploy a new vectorSearch index covering the new field and migrate
+   `find_similar` callers.
+4. Drop the old property and index.
+Do NOT mutate a model's `dimensions:` in place — the digest mechanism
+will see unchanged source text and skip recompute, leaving stale
+vectors, and the drift verifier will flag every query against the old
+index (`index numDimensions=1536 but property declares ...`). For
+`embed_image`, also remember the digest is over the URL String: if you
+replace bytes at the same URL (PUT-replace on S3 without renaming),
+null the digest field — or run `reembed!` — to force re-embed.
+---
+## Bulk embedding: `BatchEmbedder` (v5.5)
+`Provider#embed_text_batched` only slices input into provider-sized
+chunks; retry lives inside each provider's single HTTP call. For bulk
+jobs (ingest pipelines, chunk-corpus embedding) use
+`Parse::Embeddings::BatchEmbedder`, which adds batch-level pacing and
+backoff:
+```ruby
+embedder = Parse::Embeddings::BatchEmbedder.new(
+  Parse::Embeddings.provider(:openai),
+  requests_per_minute: 60,        # inter-batch pacing
+  max_attempts: 5,                # per-batch tries (exponential backoff + jitter)
+  on_progress: ->(done:, total:, batch_index:, batch_count:) {
+    puts "#{done}/#{total}"
+  },
+)
+vectors = embedder.embed_text(texts, input_type: :search_document)
+```
+Rate-limit and transient errors (any provider error class ending in
+`RateLimitError` / `TransientError`; override with `retry_on:`) retry
+with exponential backoff; other errors propagate immediately. A batch
+that exhausts its attempts raises `BatchEmbedder::BatchFailed`
+carrying `batch_index` and `completed_count`, so a resumable job knows
+exactly where to pick up.
 ---
@@ -728,6 +965,54 @@ floats out). Vectors only flow through the Parse↔Mongo path, where
 the body builder's `<vector dims=N>` compaction prevents them from
 landing in stdout / error trackers.
+### When the embedded source is PII: deployment checklist
+An embedding of PII is PII-equivalent. Inversion attacks reconstruct
+substantial source text from dense embeddings, and a vector's nearest
+neighbors leak the source's meaning even without reconstruction. If
+the fields you `embed` contain personal data (names, addresses, health
+or financial details, free-text user messages), treat the vector
+column with the same handling as the source column:
+1. **Provider contract.** You are sending the raw source text (and in
+   bytes mode, image content) to the embedding provider on every
+   recompute. Confirm the provider's data-retention and training-use
+   terms cover PII, and that a DPA is in place where required.
+   Self-hosting via `LocalHTTP` (Ollama / vLLM / TEI) keeps the text
+   in your network.
+2. **Keep vectors off the wire.** Leave `vector_visibility` at its
+   `:owner_only` default so vectors are omitted from `as_json` and
+   webhook payloads. Do not flip a PII class to `:public`.
+3. **Row ACL still governs.** Vector hits route mongo-direct with
+   `_rperm` enforcement — verify your rows carry real ACLs and that
+   callers use scoped credentials (`session_token:` / `acl_user:`),
+   not blanket master key.
+4. **Tenant isolation.** Multi-tenant deployments must declare
+   `agent_tenant_scope` on searchable classes; the scope folds into
+   `$vectorSearch.filter` (and v5.5's drift verification confirms the
+   index covers it). Without it, similarity scores leak cross-tenant
+   document existence.
+5. **Score exposure.** Keep score quantization on for non-admin agent
+   contexts (the default) — full-precision scores enable
+   membership-inference probing.
+6. **EXIF stays stripped.** For image embedding, keep the bytes-mode
+   default `exif_strip: true`; user photos carry GPS coordinates and
+   device serials that would otherwise reach the provider.
+7. **Log and cache hygiene.** Redact query text at the Faraday layer
+   (above); if you enable the persistent L2 cache, note that cache
+   KEYS are hashes (no plaintext) but cache VALUES are the embeddings
+   themselves — point `MonetaStore` at a store with the same access
+   controls as the database.
+8. **Deletion propagation.** When a user exercises erasure rights,
+   the vector, its `<field>_digest`, and its `<field>_meta` siblings
+   live on the same row and delete with it — but check external
+   copies: provider-side logs (their retention policy), your L2
+   embedding cache (TTL or explicit flush), and any analytics sink
+   subscribed to embedding events.
+9. **Migration hygiene.** `reembed!` re-sends every row's source text
+   to the provider — schedule PII-class migrations under the same
+   approvals as a data export.
 ---
 ## Troubleshooting
@@ -775,10 +1060,20 @@ on every poll) rather than a `until index_ready?; sleep` loop.
 Key files:
 * `lib/parse/embeddings.rb` — registry, `Configuration`, `register`,
-  `provider`, `configure`, `validate_image_url!`,
-  `trust_provider_url_fetch=`, `allowed_image_hosts=`.
+  `provider`, `configure`, `validate_image_url!` (`mode: :forward | :fetch`),
+  `trust_provider_url_fetch=`, `allowed_image_hosts=`,
+  `allowed_image_types=`.
 * `lib/parse/embeddings/provider.rb` — abstract base, `validate_response!`,
   `instrument_embed`, AS::N payload contract.
+* `lib/parse/embeddings/image_fetch.rb` — bytes-fetch path:
+  `ImageFetch.fetch!`, magic-byte `sniff_mime`/`verify!`, EXIF/XMP
+  stripping, `FetchedImage`.
+* `lib/parse/embeddings/batch_embedder.rb` — `BatchEmbedder` bulk
+  orchestration (pacing, batch-level backoff, `BatchFailed`).
+* `lib/parse/embeddings/cache.rb` — opt-in query-embed cache
+  (`Cache.enable!` / `fetch_vector` / `stats`).
+* `lib/parse/embeddings/spend_cap.rb` — per-tenant token cap
+  (`charge!`, `charge_query!`, `with_precharged`).
 * `lib/parse/embeddings/openai.rb` — OpenAI provider.
 * `lib/parse/embeddings/cohere.rb` — Cohere v3 + v4.0 text-mode provider.
 * `lib/parse/embeddings/voyage.rb` — Voyage text + multimodal-3
@@ -788,9 +1083,13 @@ Key files:
 * `lib/parse/embeddings/local_http.rb` — generic OpenAI-compatible
   local-gateway client.
 * `lib/parse/embeddings/fixture.rb` — deterministic test provider.
-* `lib/parse/model/core/vector_searchable.rb` — `find_similar`.
+* `lib/parse/model/core/vector_searchable.rb` — `find_similar`,
+  `hybrid_search`, index drift verification
+  (`Parse::VectorSearch.index_drift_policy`).
 * `lib/parse/model/core/embed_managed.rb` — `embed` and `embed_image`
-  macros, `EmbedDirective` (carries `modality:`, `allow_insecure:`).
+  macros, `EmbedDirective` (carries `modality:`, `allow_insecure:`,
+  `source_mode:`, `exif_strip:`, `meta_field:`), `embed_pending!`,
+  `reembed!`.
 * `lib/parse/vector_search.rb` — low-level `Parse::VectorSearch.search`.
 * `lib/parse/atlas_search/index_manager.rb` — `IndexCatalog.create_index`,
   `find_vector_index`, `wait_for_ready`.

data/lib/parse/acl_scope.rb CHANGED Viewed

@@ -336,6 +336,17 @@ module Parse
         return if target.nil?
         target_str = target.to_s
         return if target_str.empty?
+        # RT-7 / NEW-4: hard internal-collection floor FIRST, independent of
+        # CLP. This must run on EVERY join target on the direct
+        # Parse::MongoDB.aggregate path. LookupRewriter.auto_rewrite (the other
+        # caller of assert_collection_allowed!) is skipped when rewrite_lookups
+        # is off or the root class can't be resolved, so relying on it alone
+        # leaves a gap: an internal collection (`_SCHEMA`/`_Hooks`/`_Audit`/
+        # `_GlobalConfig`/...) whose CLP fetch returns :no_clp would pass the
+        # permits? check below. The floor refuses those outright while still
+        # admitting the SDK data classes (`_User`/`_Role`/`_Installation`/
+        # `_Session`), which then face the per-scope CLP `find` gate.
+        Parse::PipelineSecurity.assert_collection_allowed!(target_str)
         return if Parse::CLPScope.permits?(target_str, :find, perms)
         raise Parse::CLPScope::Denied.new(
           target_str, :find,