RubyGems - pgoutput-parser - Versions diffs - 0.1.1 → 0.2.0 - Mend

pgoutput-parser 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +36 -4
data/README.md +69 -7
data/docs/glossary.md +145 -0
data/docs/index.md +239 -0
data/docs/relation_tracker.md +168 -0
data/lib/pgoutput/binary_parser.rb +47 -11
data/lib/pgoutput/errors.rb +10 -4
data/lib/pgoutput/messages.rb +40 -2
data/lib/pgoutput/relation_tracker.rb +79 -18
data/lib/pgoutput/version.rb +1 -1
data/lib/pgoutput.rb +1 -1
data/sig/pgoutput.rbs +85 -30
metadata +7 -102

data/docs/relation_tracker.md ADDED Viewed

@@ -0,0 +1,168 @@
+# RelationTracker Guide
+`Pgoutput::RelationTracker` is the stream-order parser wrapper for callers that
+want DML tuple values annotated with PostgreSQL type OIDs.
+Use `Pgoutput::BinaryParser` when each payload can be decoded independently. Use
+`RelationTracker` when Insert, Update, and Delete messages need the Relation
+metadata that PostgreSQL sent earlier in the same logical replication stream.
+## Why Relation Tracking Exists
+pgoutput row-change messages reference a relation ID, but tuple values do not
+repeat column names or PostgreSQL type OIDs. PostgreSQL sends that metadata in a
+Relation (`R`) message.
+The tracker caches Relation messages:
+```text
+R users(id int4, email text, active bool)
+I relation_id=42 tuple=["7", "dev@example.test", "t"]
+U relation_id=42 tuple=["7", "ops@example.test", "t"]
+D relation_id=42 old_key=["7"]
+```
+After the Relation message has been seen, later DML tuple values can be
+annotated:
+```ruby
+stream = Pgoutput::RelationTracker.new
+stream.process(relation_payload)
+insert = stream.process(insert_payload)
+insert.tuple.map(&:oid)
+# => [23, 25, 16]
+```
+The raw tuple bytes are unchanged. The tracker only attaches metadata.
+## Stream-Order Contract
+`RelationTracker` assumes payloads are processed in pgoutput stream order.
+This order matters because DML messages depend on earlier Relation messages. If
+an Insert, Update, or Delete references a relation ID that has not been cached,
+the tracker raises `Pgoutput::UnknownRelationError`.
+```ruby
+stream = Pgoutput::RelationTracker.new
+stream.process(insert_payload)
+# raises Pgoutput::UnknownRelationError
+```
+The tracker does not reorder messages, buffer future DML, deduplicate events, or
+validate row lifecycle semantics such as whether an Insert occurred before a
+Delete for the same primary key. Those guarantees belong to higher CDC pipeline
+layers.
+## Tuple Arity Validation
+The tracker validates tuple arity before annotating OIDs. If PostgreSQL sends a
+tuple with a different number of values than the cached Relation column count,
+the tracker raises `Pgoutput::TupleArityError`.
+This avoids silently assigning the wrong type OIDs to tuple positions.
+```ruby
+stream = Pgoutput::RelationTracker.new
+stream.process(relation_payload)
+stream.process(malformed_insert_payload)
+# raises Pgoutput::TupleArityError
+```
+## Default Relation Cache
+By default, each tracker owns a plain Ruby `Hash`:
+```ruby
+stream = Pgoutput::RelationTracker.new
+```
+That is the right default for a single stream owner. The tracker instance itself
+is mutable and should be scoped to the code path that processes that logical
+replication stream.
+Parsed message objects returned from `process` are Ractor-shareable. The mutable
+tracker is not the shareable value; the parsed messages are.
+## Swappable Relation Cache
+`RelationTracker` accepts a `relation_cache:` object:
+```ruby
+stream = Pgoutput::RelationTracker.new(relation_cache: {})
+```
+The cache object must support:
+- `#[]=` for storing Relation messages by relation ID
+- `#fetch` for reading Relation messages and raising through the provided block
+  when a relation ID is unknown
+This keeps `RelationTracker` independent from a specific cache implementation.
+## Ratomic::Map Cache
+For experimental or parallel Ruby 4 setups, callers can inject
+`Ratomic::Map`:
+```ruby
+require "ratomic"
+require "pgoutput"
+relation_cache = Ratomic::Map.new
+stream = Pgoutput::RelationTracker.new(relation_cache: relation_cache)
+stream.process(relation_payload)
+insert = stream.process(insert_payload)
+insert.tuple.map(&:oid)
+# => [23, 25, 16]
+```
+`Ratomic::Map` is useful when relation metadata must live in a Ractor-safe cache.
+This gem keeps Ratomic as an optional development/benchmark dependency rather
+than a runtime dependency; applications that want this cache backend should add
+Ratomic directly.
+Prefer the default Hash unless a pipeline design specifically needs a shared
+Ractor-safe relation metadata cache.
+## Ractor Pattern
+A common pattern is to keep one tracker per stream-processing lane and pass only
+parsed immutable messages across Ractors:
+```ruby
+stream = Pgoutput::RelationTracker.new
+stream.process(relation_payload)
+message = stream.process(update_payload)
+worker = Ractor.new(message) do |event|
+  event.new_tuple.map(&:raw)
+end
+worker.take
+```
+If relation metadata itself must be shared across lanes, use an explicit
+Ractor-safe cache such as `Ratomic::Map` and benchmark the result for the target
+workload.
+## Boundary
+`RelationTracker` is still a parser-layer utility. It does not perform:
+- PostgreSQL value decoding
+- application-level type conversion
+- event ordering across sink workers
+- checkpointing
+- retry coordination
+- row lifecycle validation
+Its job is narrow: remember Relation metadata, annotate DML tuple values with
+type OIDs, validate tuple arity, and return immutable protocol messages.

data/lib/pgoutput/binary_parser.rb CHANGED Viewed

@@ -8,10 +8,11 @@ module Pgoutput
   # single payload. Its returned message object is deeply frozen/shareable and may
   # cross Ractor boundaries safely.
   #
-  # @api public
+  # @api public Public parser for decoding one pgoutput protocol message payload.
+  # rubocop:disable Metrics/ClassLength
   class BinaryParser
     # @param payload [String] one pgoutput message payload from a CopyData frame.
-    # @return [void]
+    # @return [void] initializes parser state for the supplied payload.
     def initialize(payload)
       @payload = payload.b
       @offset = 0
@@ -19,25 +20,34 @@ module Pgoutput
     # Parse one supported pgoutput message.
     #
-    # Supported MVP tags are `B`, `R`, `I`, `U`, `D`, and `C`.
+    # Supported MVP tags are `B`, `M`, `O`, `R`, `Y`, `I`, `U`, `D`, `T`, and `C`.
     #
-    # @return [Pgoutput::Messages::Begin, Pgoutput::Messages::Relation,
+    # @return [Pgoutput::Messages::Begin, Pgoutput::Messages::Message,
+    #   Pgoutput::Messages::Origin, Pgoutput::Messages::Relation,
+    #   Pgoutput::Messages::Type, Pgoutput::Messages::Truncate,
     #   Pgoutput::Messages::Insert, Pgoutput::Messages::Update,
-    #   Pgoutput::Messages::Delete, Pgoutput::Messages::Commit]
+    #   Pgoutput::Messages::Delete, Pgoutput::Messages::Commit] parsed immutable
+    #   message object for the payload tag.
     # @raise [UnsupportedMessageError] if the message tag is unsupported.
     # @raise [TruncatedMessageError] if the payload is incomplete.
+    # rubocop:disable Metrics/CyclomaticComplexity
     def parse
       case read_byte_chr
       when "B" then parse_begin
+      when "M" then parse_message
+      when "O" then parse_origin
       when "R" then parse_relation
+      when "Y" then parse_type
       when "I" then parse_insert
       when "U" then parse_update
       when "D" then parse_delete
+      when "T" then parse_truncate
       when "C" then parse_commit
       else
         raise UnsupportedMessageError, "unsupported pgoutput message tag"
       end
     end
+    # rubocop:enable Metrics/CyclomaticComplexity
     private
@@ -45,6 +55,19 @@ module Pgoutput
       share(Messages::Begin.new(read_uint64, read_uint64, read_uint32))
     end
+    def parse_message
+      flags = read_uint8
+      lsn = read_uint64
+      prefix = read_cstring
+      content = read_bytes(read_int32).freeze
+      share(Messages::Message.new(flags, lsn, prefix, content))
+    end
+    def parse_origin
+      share(Messages::Origin.new(read_uint64, read_cstring))
+    end
     def parse_relation
       relation_id = read_uint32
       schema = read_cstring
@@ -59,6 +82,10 @@ module Pgoutput
       share(Messages::Relation.new(relation_id, schema, table, replica_identity, columns))
     end
+    def parse_type
+      share(Messages::Type.new(read_uint32, read_cstring, read_cstring))
+    end
     def parse_insert
       relation_id = read_uint32
       tuple_tag = read_byte_chr
@@ -105,6 +132,14 @@ module Pgoutput
       end
     end
+    def parse_truncate
+      relation_count = read_uint32
+      options = read_uint8
+      relation_ids = Array.new(relation_count) { read_uint32 }.freeze
+      share(Messages::Truncate.new(relation_ids, options))
+    end
     def parse_commit
       share(Messages::Commit.new(read_uint8, read_uint64, read_uint64, read_uint64))
     end
@@ -129,18 +164,18 @@ module Pgoutput
       end.freeze
     end
-    def read_uint8 = read_bytes(1).unpack1("C")
+    def read_uint8 = Integer(read_bytes(1).unpack1("C"))
-    def read_uint16 = read_bytes(2).unpack1("n")
+    def read_uint16 = Integer(read_bytes(2).unpack1("n"))
-    def read_uint32 = read_bytes(4).unpack1("N")
+    def read_uint32 = Integer(read_bytes(4).unpack1("N"))
     def read_int32
       value = read_uint32
       value >= 0x8000_0000 ? value - 0x1_0000_0000 : value
     end
-    def read_uint64 = read_bytes(8).unpack1("Q>")
+    def read_uint64 = Integer(read_bytes(8).unpack1("Q>"))
     def read_byte_chr = read_bytes(1)
@@ -148,7 +183,7 @@ module Pgoutput
       zero = @payload.index("\0", @offset)
       raise TruncatedMessageError, "unterminated cstring at offset #{@offset}" unless zero
-      value = @payload.byteslice(@offset, zero - @offset).freeze
+      value = String(@payload.byteslice(@offset, zero - @offset)).freeze
       @offset = zero + 1
       value
     end
@@ -159,7 +194,7 @@ module Pgoutput
         raise TruncatedMessageError, "need #{length} bytes at offset #{@offset}, payload has #{@payload.bytesize} bytes"
       end
-      value = @payload.byteslice(@offset, length)
+      value = String(@payload.byteslice(@offset, length))
       @offset += length
       value
     end
@@ -168,4 +203,5 @@ module Pgoutput
       Ractor.make_shareable(message)
     end
   end
+  # rubocop:enable Metrics/ClassLength
 end

data/lib/pgoutput/errors.rb CHANGED Viewed

@@ -3,22 +3,28 @@
 module Pgoutput
   # Base error for all parser failures.
   #
-  # @api public
+  # @api public Public base class for rescuing all pgoutput-parser errors.
   class Error < StandardError; end
   # Raised when a payload ends before the requested protocol field can be read.
   #
-  # @api public
+  # @api public Public parser error for incomplete binary payloads.
   class TruncatedMessageError < Error; end
   # Raised when the parser sees a message or tuple tag outside this MVP scope.
   #
-  # @api public
+  # @api public Public parser error for pgoutput protocol features outside this scope.
   class UnsupportedMessageError < Error; end
   # Raised when row data references a relation id that has not been observed via
   # a preceding Relation (`R`) message in the current stream decoder.
   #
-  # @api public
+  # @api public Public tracker error for DML messages missing relation metadata.
   class UnknownRelationError < Error; end
+  # Raised when tuple data does not match the column count advertised by the
+  # cached Relation (`R`) message.
+  #
+  # @api public Public tracker error for malformed tuple/relation metadata pairs.
+  class TupleArityError < Error; end
 end

data/lib/pgoutput/messages.rb CHANGED Viewed

@@ -4,11 +4,11 @@ module Pgoutput
   # Immutable message model classes for the PostgreSQL pgoutput protocol.
   #
   # Every value returned by the parser is deeply shareable via
-  # {Ractor.make_shareable}. These classes are protocol-level structures only;
+  # `Ractor.make_shareable`. These classes are protocol-level structures only;
   # they preserve tuple bytes and metadata but do not convert PostgreSQL values
   # into application-specific Ruby types.
   #
-  # @api public
+  # @api public Public immutable message model namespace returned by parsers.
   module Messages
     # Transaction begin message.
     #
@@ -20,6 +20,26 @@ module Pgoutput
     #   @return [Integer] transaction id.
     Begin = Data.define(:final_lsn, :commit_timestamp, :xid)
+    # Logical decoding message.
+    #
+    # @!attribute [r] flags
+    #   @return [Integer] message flags; bit 0 marks transactional messages.
+    # @!attribute [r] lsn
+    #   @return [Integer] LSN of the logical decoding message.
+    # @!attribute [r] prefix
+    #   @return [String] message prefix.
+    # @!attribute [r] content
+    #   @return [String] immutable raw message content.
+    Message = Data.define(:flags, :lsn, :prefix, :content)
+    # Replication origin message.
+    #
+    # @!attribute [r] origin_lsn
+    #   @return [Integer] commit LSN on the origin server.
+    # @!attribute [r] name
+    #   @return [String] origin name.
+    Origin = Data.define(:origin_lsn, :name)
     # Relation column metadata.
     #
     # @!attribute [r] flags
@@ -46,6 +66,16 @@ module Pgoutput
     #   @return [Array<Column>] immutable column metadata.
     Relation = Data.define(:relation_id, :schema, :table, :replica_identity, :columns)
+    # PostgreSQL type metadata message.
+    #
+    # @!attribute [r] oid
+    #   @return [Integer] PostgreSQL type OID.
+    # @!attribute [r] schema
+    #   @return [String] namespace name.
+    # @!attribute [r] name
+    #   @return [String] type name.
+    Type = Data.define(:oid, :schema, :name)
     # One tuple column value.
     #
     # @!attribute [r] format
@@ -91,6 +121,14 @@ module Pgoutput
     #   @return [Array<TupleValue>, nil] full old tuple when replica identity is FULL.
     Delete = Data.define(:relation_id, :old_key_tuple, :old_tuple)
+    # Truncate DML message.
+    #
+    # @!attribute [r] relation_ids
+    #   @return [Array<Integer>] relation OIDs affected by the truncate.
+    # @!attribute [r] options
+    #   @return [Integer] option bits; 1 is CASCADE, 2 is RESTART IDENTITY.
+    Truncate = Data.define(:relation_ids, :options)
     # Transaction commit message.
     #
     # @!attribute [r] flags

data/lib/pgoutput/relation_tracker.rb CHANGED Viewed

@@ -8,23 +8,68 @@ module Pgoutput
   # It only adds protocol metadata to tuple values while keeping returned objects
   # deeply shareable.
   #
-  # The instance contains mutable relation-cache state and should not be shared
-  # across Ractors. Returned message objects are Ractor-safe.
+  # pgoutput DML messages carry a relation id and tuple values, but they do not
+  # repeat column names or type OIDs. PostgreSQL sends that metadata separately
+  # in Relation (`R`) messages. Call {#process} with payloads in stream order so
+  # Relation messages are cached before the Insert, Update, or Delete messages
+  # that reference them.
   #
-  # @api public
+  # The relation cache is injectable. The default cache is a plain Hash, which is
+  # appropriate when one stream owner processes payloads sequentially. Callers
+  # with an explicit Ractor-oriented design can supply a compatible cache object,
+  # such as `Ratomic::Map`, through the `relation_cache:` keyword.
+  #
+  # A custom relation cache must implement `#[]=` and `#fetch`. The tracker
+  # stores cached Relation messages by relation id and uses `#fetch` with a block
+  # so unknown relation ids still raise {UnknownRelationError}.
+  #
+  # `RelationTracker` does not reorder messages, buffer DML until metadata
+  # arrives, enforce per-record lifecycle ordering, or coordinate sink retries.
+  # Those guarantees belong to higher CDC pipeline layers. This class only
+  # preserves parser-layer stream semantics and validates tuple arity against
+  # cached Relation metadata.
+  #
+  # Returned message objects are Ractor-safe.
+  #
+  # @example Default Hash-backed relation cache
+  #   stream = Pgoutput::RelationTracker.new
+  #   stream.process(relation_payload)
+  #   insert = stream.process(insert_payload)
+  #   insert.tuple.map(&:oid)
+  #
+  # @example Ractor-safe relation cache with Ratomic::Map
+  #   require "ratomic"
+  #
+  #   relation_cache = Ratomic::Map.new
+  #   stream = Pgoutput::RelationTracker.new(relation_cache: relation_cache)
+  #   stream.process(relation_payload)
+  #   update = stream.process(update_payload)
+  #   update.new_tuple.map(&:oid)
+  #
+  # @api public Public stream-order decoder that annotates DML with relation OIDs.
   class RelationTracker
-    # @return [void]
-    def initialize
-      @relations = {}
+    # Create a tracker with an optional relation cache.
+    #
+    # @param relation_cache [Hash, #fetch, #[]=] cache for relation metadata,
+    #   keyed by pgoutput relation id. The default Hash is suitable for one
+    #   stream owner; callers may inject `Ratomic::Map` or another compatible
+    #   cache for explicit Ractor-safe relation metadata sharing.
+    # @return [void] initializes an empty tracker using the supplied cache object.
+    def initialize(relation_cache: {})
+      @relations = relation_cache
     end
     # Process one pgoutput payload in stream order.
     #
     # @param payload [String] one pgoutput logical replication message payload.
-    # @return [Pgoutput::Messages::Begin, Pgoutput::Messages::Relation,
+    # @return [Pgoutput::Messages::Begin, Pgoutput::Messages::Message,
+    #   Pgoutput::Messages::Origin, Pgoutput::Messages::Relation,
+    #   Pgoutput::Messages::Type, Pgoutput::Messages::Truncate,
     #   Pgoutput::Messages::Insert, Pgoutput::Messages::Update,
-    #   Pgoutput::Messages::Delete, Pgoutput::Messages::Commit]
+    #   Pgoutput::Messages::Delete, Pgoutput::Messages::Commit] parsed immutable
+    #   message object, with DML tuple OIDs annotated when relation metadata exists.
     # @raise [UnknownRelationError] if DML references an unseen relation id.
+    # @raise [TupleArityError] if DML tuple data does not match relation metadata.
     def process(payload)
       message = BinaryParser.new(payload).parse
@@ -43,12 +88,15 @@ module Pgoutput
       end
     end
-    # Backwards-compatible alias for callers migrating from RelationTracker.
+    # Backwards-compatible alias for callers migrating to `process`.
     #
     # @param payload [String] one pgoutput logical replication message payload.
-    # @return [Pgoutput::Messages::Begin, Pgoutput::Messages::Relation,
+    # @return [Pgoutput::Messages::Begin, Pgoutput::Messages::Message,
+    #   Pgoutput::Messages::Origin, Pgoutput::Messages::Relation,
+    #   Pgoutput::Messages::Type, Pgoutput::Messages::Truncate,
     #   Pgoutput::Messages::Insert, Pgoutput::Messages::Update,
-    #   Pgoutput::Messages::Delete, Pgoutput::Messages::Commit]
+    #   Pgoutput::Messages::Delete, Pgoutput::Messages::Commit] parsed immutable
+    #   message object, with DML tuple OIDs annotated when relation metadata exists.
     def decode(payload)
       process(payload)
     end
@@ -72,8 +120,8 @@ module Pgoutput
       Ractor.make_shareable(
         Messages::Update.new(
           message.relation_id,
-          annotate_tuple(message.old_key_tuple, relation),
-          annotate_tuple(message.old_tuple, relation),
+          annotate_optional_tuple(message.old_key_tuple, relation),
+          annotate_optional_tuple(message.old_tuple, relation),
           annotate_tuple(message.new_tuple, relation)
         )
       )
@@ -85,21 +133,34 @@ module Pgoutput
       Ractor.make_shareable(
         Messages::Delete.new(
           message.relation_id,
-          annotate_tuple(message.old_key_tuple, relation),
-          annotate_tuple(message.old_tuple, relation)
+          annotate_optional_tuple(message.old_key_tuple, relation),
+          annotate_optional_tuple(message.old_tuple, relation)
         )
       )
     end
-    def annotate_tuple(tuple, relation)
+    def annotate_optional_tuple(tuple, relation)
       return nil if tuple.nil?
+      annotate_tuple(tuple, relation)
+    end
+    def annotate_tuple(tuple, relation)
+      validate_tuple_arity!(tuple, relation)
       tuple.each_with_index.map do |value, index|
-        column = relation.columns[index]
-        Messages::TupleValue.new(value.format, value.raw, column&.oid)
+        Messages::TupleValue.new(value.format, value.raw, relation.columns.fetch(index).oid)
       end.freeze
     end
+    def validate_tuple_arity!(tuple, relation)
+      return if tuple.length == relation.columns.length
+      raise TupleArityError,
+            "tuple has #{tuple.length} values but relation #{relation.relation_id} " \
+            "has #{relation.columns.length} columns"
+    end
     def relation_for(relation_id)
       @relations.fetch(relation_id) do
         raise UnknownRelationError, "unknown relation id #{relation_id}; parse Relation message first"

data/lib/pgoutput/version.rb CHANGED Viewed

@@ -4,5 +4,5 @@ module Pgoutput
   # Current gem version.
   #
   # @return [String] semantic version string.
-  VERSION = "0.1.1"
+  VERSION = "0.2.0"
 end

data/lib/pgoutput.rb CHANGED Viewed

@@ -12,6 +12,6 @@ require_relative "pgoutput/relation_tracker"
 # payloads into immutable Ruby protocol message objects. The namespace is kept
 # short as `Pgoutput`, while the RubyGems package name is `pgoutput-parser`.
 #
-# @api public
+# @api public Public namespace for parser entry points, message models, and errors.
 module Pgoutput
 end