RubyGems - data_drain - Versions diffs - 0.1.18 → 0.2.0 - Mend

data_drain 0.1.18 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +20 -0
data/CLAUDE.md +22 -0
data/README.md +69 -169
data/lib/data_drain/engine.rb +53 -40
data/lib/data_drain/file_ingestor.rb +40 -25
data/lib/data_drain/record.rb +26 -5
data/lib/data_drain/storage/s3.rb +48 -6
data/lib/data_drain/validations.rb +17 -0
data/lib/data_drain/version.rb +1 -1
data/lib/data_drain.rb +2 -0
data/skill/SKILL.md +215 -0
data/skill/references/antipatrones.md +242 -0
data/skill/references/api-detallada.md +257 -0
data/skill/references/eventos-telemetria.md +154 -0
metadata +7 -2

data/lib/data_drain/file_ingestor.rb CHANGED Viewed

@@ -6,6 +6,8 @@ module DataDrain
   # aplicando compresión ZSTD y particionamiento Hive.
   class FileIngestor
     include Observability
+    # rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity,
+    #   Metrics/MethodLength
     # @param options [Hash] Opciones de ingestión.
     # @option options [String] :source_path Ruta absoluta al archivo local.
@@ -14,19 +16,20 @@ module DataDrain
     # @option options [String] :select_sql (Opcional) Sentencia SELECT para transformar datos al vuelo.
     # @option options [Boolean] :delete_after_upload (Opcional) Borra el archivo local al terminar. Por defecto true.
     def initialize(options)
-      @source_path         = options.fetch(:source_path)
-      @folder_name         = options.fetch(:folder_name)
-      @partition_keys      = options.fetch(:partition_keys, [])
-      @select_sql          = options.fetch(:select_sql, "*")
+      @source_path = options.fetch(:source_path)
+      @folder_name = options.fetch(:folder_name)
+      Validations.validate_identifier!(:folder_name, @folder_name)
+      @partition_keys = options.fetch(:partition_keys, [])
+      @select_sql = options.fetch(:select_sql, "*")
       @delete_after_upload = options.fetch(:delete_after_upload, true)
-      @bucket              = options[:bucket]
+      @bucket = options[:bucket]
-      @config  = DataDrain.configuration
-      @logger  = @config.logger
+      @config = DataDrain.configuration
+      @logger = @config.logger
       @adapter = DataDrain::Storage.adapter
       database = DuckDB::Database.open(":memory:")
-      @duckdb  = database.connect
+      @duckdb = database.connect
     end
     # Ejecuta el flujo de ingestión.
@@ -52,7 +55,11 @@ module DataDrain
       step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
       source_count = @duckdb.query("SELECT COUNT(*) FROM #{reader_function}").first.first
       source_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
-      safe_log(:info, "file_ingestor.count", { source_path: @source_path, count: source_count, source_query_duration_s: source_query_duration.round(2) })
+      safe_log(:info, "file_ingestor.count", {
+                 source_path: @source_path,
+                 count: source_count,
+                 source_query_duration_s: source_query_duration.round(2)
+               })
       if source_count.zero?
         cleanup_local_file
@@ -63,9 +70,14 @@ module DataDrain
       # 2. Exportación / Subida
       @adapter.prepare_export_path(@bucket, @folder_name)
-      dest_path = @config.storage_mode.to_sym == :s3 ? "s3://#{@bucket}/#{@folder_name}/" : File.join(@bucket, @folder_name, "")
+      dest_path = if @config.storage_mode.to_sym == :s3
+                    "s3://#{@bucket}/#{@folder_name}/"
+                  else
+                    File.join(@bucket,
+                              @folder_name, "")
+                  end
-      partition_clause = @partition_keys.any? ? "PARTITION_BY (#{@partition_keys.join(', ')})," : ""
+      partition_clause = @partition_keys.any? ? "PARTITION_BY (#{@partition_keys.join(", ")})," : ""
       query = <<~SQL
         COPY (
@@ -87,18 +99,19 @@ module DataDrain
       duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
       safe_log(:info, "file_ingestor.complete", {
-        source_path: @source_path,
-        duration_s: duration.round(2),
-        source_query_duration_s: source_query_duration.round(2),
-        export_duration_s: export_duration.round(2),
-        count: source_count
-      })
+                 source_path: @source_path,
+                 duration_s: duration.round(2),
+                 source_query_duration_s: source_query_duration.round(2),
+                 export_duration_s: export_duration.round(2),
+                 count: source_count
+               })
       cleanup_local_file
       true
     rescue DuckDB::Error => e
       duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
-      safe_log(:error, "file_ingestor.duckdb_error", { source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
+      safe_log(:error, "file_ingestor.duckdb_error",
+               { source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
       false
     ensure
       @duckdb&.close
@@ -109,11 +122,11 @@ module DataDrain
     # @api private
     def determine_reader
       case File.extname(@source_path).downcase
-      when '.csv'
+      when ".csv"
         "read_csv_auto('#{@source_path}')"
-      when '.json'
+      when ".json"
         "read_json_auto('#{@source_path}')"
-      when '.parquet'
+      when ".parquet"
         "read_parquet('#{@source_path}')"
       else
         raise DataDrain::Error, "Formato de archivo no soportado para ingestión: #{@source_path}"
@@ -122,10 +135,12 @@ module DataDrain
     # @api private
     def cleanup_local_file
-      if @delete_after_upload && File.exist?(@source_path)
-        File.delete(@source_path)
-        safe_log(:info, "file_ingestor.cleanup", { source_path: @source_path })
-      end
+      return unless @delete_after_upload && File.exist?(@source_path)
+      File.delete(@source_path)
+      safe_log(:info, "file_ingestor.cleanup", { source_path: @source_path })
     end
   end
+  # rubocop:enable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity,
+  #   Metrics/MethodLength
 end

data/lib/data_drain/record.rb CHANGED Viewed

@@ -11,7 +11,7 @@ module DataDrain
   # @example
   #   class ArchivedVersion < DataDrain::Record
   #     self.folder_name = 'versions'
-  #     self.partition_keys = [:year, :month, :isp_id]
+  #     self.partition_keys = [:isp_id, :year, :month]
   #     attribute :event, :string
   #   end
   class Record
@@ -24,10 +24,28 @@ module DataDrain
     class_attribute :folder_name
     class_attribute :partition_keys
+    # Cierra la conexión DuckDB del thread actual y limpia Thread.current.
+    # Idempotente: llamarlo varias veces no levanta.
+    #
+    # Útil en middlewares de Sidekiq/Puma para evitar memory leak en threads
+    # de larga vida.
+    #
+    # @return [void]
+    def self.disconnect!
+      entry = Thread.current[:data_drain_duckdb]
+      Thread.current[:data_drain_duckdb] = nil
+      return unless entry
+      entry[:conn]&.close
+      entry[:db]&.close
+    rescue StandardError # rubocop:disable Lint/SuppressedException
+    end
     # Retorna la conexión persistente a DuckDB en memoria para el hilo (Thread) actual.
     # Esto previene tener que recargar extensiones (como httpfs) en cada consulta.
     #
     # @return [DuckDB::Connection] Conexión activa a DuckDB.
+    # rubocop:disable Metrics/AbcSize
     def self.connection
       Thread.current[:data_drain_duckdb] ||= begin
         db = DuckDB::Database.open(":memory:")
@@ -42,6 +60,7 @@ module DataDrain
       end
       Thread.current[:data_drain_duckdb][:conn]
     end
+    # rubocop:enable Metrics/AbcSize
     # Consulta registros en el Data Lake filtrando por claves de partición.
     #
@@ -52,7 +71,7 @@ module DataDrain
       path = build_query_path(partitions)
       sql = <<~SQL
-        SELECT #{attribute_names.join(', ')}
+        SELECT #{attribute_names.join(", ")}
         FROM read_parquet('#{path}')
         ORDER BY created_at DESC
         LIMIT #{limit}
@@ -73,7 +92,7 @@ module DataDrain
       safe_id = id.to_s.gsub("'", "''")
       sql = <<~SQL
-        SELECT #{attribute_names.join(', ')}
+        SELECT #{attribute_names.join(", ")}
         FROM read_parquet('#{path}')
         WHERE id = '#{safe_id}'
         LIMIT 1
@@ -97,7 +116,7 @@ module DataDrain
     # @return [String] Representación legible en consola.
     def inspect
       inspection = attributes.map do |name, value|
-        "#{name}: #{value.nil? ? 'nil' : value.inspect}"
+        "#{name}: #{value.nil? ? "nil" : value.inspect}"
       end.compact.join(", ")
       "#<#{self.class} #{inspection}>"
@@ -110,7 +129,7 @@ module DataDrain
       # @param partitions [Hash]
       # @return [String]
       def build_query_path(partitions)
-        partition_path = partitions.map { |k, v| "#{k}=#{v}" }.join("/")
+        partition_path = partition_keys.map { |k| "#{k}=#{partitions[k.to_sym] || partitions[k.to_s]}" }.join("/")
         DataDrain::Storage.adapter.build_path(bucket, folder_name, partition_path)
       end
@@ -118,6 +137,7 @@ module DataDrain
       # @param sql [String]
       # @param columns [Array<String>]
       # @return [Array<DataDrain::Record>]
+      # rubocop:disable Metrics/MethodLength
       def execute_and_instantiate(sql, columns)
         @logger = DataDrain.configuration.logger
         begin
@@ -133,5 +153,6 @@ module DataDrain
         end
       end
     end
+    # rubocop:enable Metrics/MethodLength
   end
 end

data/lib/data_drain/storage/s3.rb CHANGED Viewed

@@ -4,21 +4,59 @@ module DataDrain
   module Storage
     # Implementación del adaptador de almacenamiento para Amazon S3.
     class S3 < Base
+      # rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength
       # Carga la extensión httpfs en DuckDB e inyecta las credenciales de AWS.
+      # Si aws_access_key_id y aws_secret_access_key están seteados, usa
+      # credenciales explícitas. Si no, usa credential_chain (IAM role, env vars,
+      # ~/.aws/credentials).
       # @param connection [DuckDB::Connection]
+      # @raise [DataDrain::ConfigurationError] si aws_region no está configurado
       def setup_duckdb(connection)
         connection.query("INSTALL httpfs; LOAD httpfs;")
-        connection.query("SET s3_region='#{@config.aws_region}';")
-        connection.query("SET s3_access_key_id='#{@config.aws_access_key_id}';")
-        connection.query("SET s3_secret_access_key='#{@config.aws_secret_access_key}';")
+        create_s3_secret(connection)
       end
+      private
+      # @param connection [DuckDB::Connection]
+      # @raise [DataDrain::ConfigurationError]
+      def create_s3_secret(connection)
+        region = @config.aws_region
+        raise DataDrain::ConfigurationError, "aws_region es obligatorio para storage_mode=:s3" if region.nil?
+        if @config.aws_access_key_id && @config.aws_secret_access_key
+          connection.query(<<~SQL)
+            CREATE OR REPLACE SECRET data_drain_s3 (
+              TYPE S3,
+              KEY_ID '#{escape_sql(@config.aws_access_key_id)}',
+              SECRET '#{escape_sql(@config.aws_secret_access_key)}',
+              REGION '#{escape_sql(region)}'
+            );
+          SQL
+        else
+          connection.query(<<~SQL)
+            CREATE OR REPLACE SECRET data_drain_s3 (
+              TYPE S3,
+              PROVIDER credential_chain,
+              REGION '#{escape_sql(region)}'
+            );
+          SQL
+        end
+      end
+      # @param value [String]
+      # @return [String]
+      def escape_sql(value)
+        value.to_s.gsub("'", "''")
+      end
+      public
       # @param bucket [String]
       # @param folder_name [String]
       # @param partition_path [String, nil]
       # @return [String]
       def build_path(bucket, folder_name, partition_path)
-        # En S3, el base_path actúa como el nombre del bucket
         base = File.join(bucket, folder_name)
         base = File.join(base, partition_path) if partition_path && !partition_path.empty?
         "s3://#{base}/**/*.parquet"
@@ -40,7 +78,7 @@ module DataDrain
           val = partitions[key]
           val.nil? || val.to_s.empty? ? "#{key}=[^/]+" : "#{key}=#{val}"
         end
-        pattern_regex = Regexp.new("^#{folder_name}/#{regex_parts.join('/')}")
+        pattern_regex = Regexp.new("^#{folder_name}/#{regex_parts.join("/")}")
         objects_to_delete = []
         prefix = "#{folder_name}/"
@@ -58,7 +96,10 @@ module DataDrain
       private
-      # @api private
+      # @param client [Aws::S3::Client]
+      # @param bucket [String]
+      # @param objects_to_delete [Array<Hash>]
+      # @return [Integer]
       def delete_in_batches(client, bucket, objects_to_delete)
         return 0 if objects_to_delete.empty?
@@ -70,5 +111,6 @@ module DataDrain
         deleted_count
       end
     end
+    # rubocop:enable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength
   end
 end

data/lib/data_drain/validations.rb ADDED Viewed

@@ -0,0 +1,17 @@
+# frozen_string_literal: true
+module DataDrain
+  # Módulo de validación de configuración para prevenir errores de uso.
+  module Validations
+    IDENTIFIER_REGEX = /\A[a-zA-Z_][a-zA-Z0-9_]*\z/
+    module_function
+    def validate_identifier!(name, value)
+      return if IDENTIFIER_REGEX.match?(value.to_s)
+      raise DataDrain::ConfigurationError,
+            "#{name} '#{value}' no es un identificador SQL válido"
+    end
+  end
+end

data/lib/data_drain/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module DataDrain
-  VERSION = "0.1.18"
+  VERSION = "0.2.0"
 end

data/lib/data_drain.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require "active_model"
 require_relative "data_drain/version"
 require_relative "data_drain/errors"
 require_relative "data_drain/configuration"
+require_relative "data_drain/validations"
 require_relative "data_drain/storage"
 require_relative "data_drain/observability"
 require_relative "data_drain/engine"
@@ -15,6 +16,7 @@ require_relative "data_drain/glue_runner"
 require_relative "data_drain/types/json_type"
 ActiveModel::Type.register(:json, DataDrain::Types::JsonType)
+# DSL para extraer, archivar y purgar datos entre PostgreSQL y un Data Lake en Parquet.
 module DataDrain
   class << self
     # @return [DataDrain::Configuration]

data/skill/SKILL.md ADDED Viewed

@@ -0,0 +1,215 @@
+# DataDrain Expert
+Skill de conocimiento completo sobre DataDrain. Consultame para cualquier pregunta sobre integración, arquitectura, API, errores y antipatrones.
+## Glosario
+- **DataDrain** — Micro-framework Ruby para ETL: extraer datos históricos de PostgreSQL → Parquet (S3/Local) → verificar integridad → purgar origen.
+- **Engine** — Motor principal que orquesta el flujo Conteo → Export → Verify → Purge.
+- **FileIngestor** — Convierte archivos crudos (CSV/JSON/Parquet) a Parquet particionado en el Data Lake.
+- **Record** — Clase base ORM analítico (tipo ActiveRecord) read-only sobre Parquet vía DuckDB.
+- **GlueRunner** — Orquestador de AWS Glue Jobs para tablas de gran volumen (>500GB-1TB).
+- **Storage Adapter** — Patrón Strategy con dos implementaciones: `Storage::Local` y `Storage::S3`. Cacheado en `Storage.adapter`.
+- **Observability** — Módulo mixín (`include`/`extend`) con `safe_log` resiliente y logging KV estructurado.
+- **Hive Partitioning** — Estructura de carpetas `key1=val1/key2=val2/...` que DuckDB genera y consume nativamente para prefix scans eficientes.
+- **Semi-abierto** — Convención de rangos `[start, end)` con `<` (no `<=`) para evitar pérdida de microsegundos en límites de fecha.
+- **skip_export** — Modo del Engine donde delega export a herramienta externa (Glue/EMR) y solo verifica + purga.
+- **Heartbeat** — Log de progreso emitido cada 100 lotes en purgas masivas (tablas 1TB).
+- **Wispro-Observability-Spec v1** — Estándar de logs KV: `component=` y `event=` primero, sufijo `_s` para tiempos float, `_count` para enteros, sin unidades en valores.
+## Arquitectura
+### Responsabilidad core
+DataDrain resuelve el ciclo de vida de datos históricos en bases relacionales calientes: archivar a Data Lake con garantía matemática de integridad antes de purgar el origen.
+### Componentes
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  PostgreSQL  │───>│    Engine    │───>│  Data Lake   │
+└──────────────┘    │  (DuckDB)    │    │ (S3 / Local) │
+       ▲            └──────────────┘    └──────────────┘
+       │                   │                   ▲
+       │                   ▼                   │
+       │            ┌──────────────┐           │
+       └────purga───│  Verify OK?  │           │
+                   └──────────────┘           │
+                                              │
+                          ┌──────────────┐    │
+                          │ FileIngestor │────┘
+                          └──────────────┘
+                                              │
+                          ┌──────────────┐    │
+                          │   Record     │<───┘
+                          │ (consultas)  │
+                          └──────────────┘
+```
+### Flujo runtime de Engine
+```
+1. setup_duckdb     → ATTACH Postgres + setup adapter (httpfs si S3)
+2. get_postgres_count → si 0, return true (skip)
+3. export_to_parquet → COPY ... TO ... PARTITION_BY (...) ZSTD  [omitido si skip_export]
+4. verify_integrity  → COUNT(*) Parquet == COUNT(*) Postgres
+5. purge_from_postgres → DELETE en lotes throttled + heartbeat
+```
+### Decisiones de diseño
+- **DuckDB en memoria** procesa millones de registros sin cargar objetos en RAM Ruby. Usa `ATTACH POSTGRES READ_ONLY` para leer origen y `COPY ... TO` para escribir Parquet.
+- **Conexión DuckDB thread-local** en `Record`: cada thread inicializa una conexión persistente que se cachea en `Thread.current[:data_drain_duckdb] = { db:, conn: }`. El hash retiene la `Database` para evitar GC prematuro de la conexión.
+- **Verify es la única puerta de seguridad** antes de purgar. Si retorna `false` (incluyendo `DuckDB::Error` al leer Parquet), la purga se aborta.
+- **Storage Adapter cacheado**: `Storage.adapter` memoiza la instancia. Si se cambia `storage_mode` en runtime, llamar `Storage.reset_adapter!`.
+- **Rangos semi-abiertos**: `created_at >= start AND created_at < end_boundary` donde `end_boundary = end_date.next_day.beginning_of_day`. Nunca `<= end_of_day`.
+### Stack y dependencias
+- Ruby `>= 3.0.0`
+- Runtime: `activemodel >= 6.0`, `duckdb ~> 1.4`, `pg >= 1.2`, `aws-sdk-s3 ~> 1.114`, `aws-sdk-glue ~> 1.0`
+- Versión actual: `0.1.19`
+## API Pública (resumen)
+### Configuración global
+```ruby
+DataDrain.configure do |config|
+  config.storage_mode = :local | :s3
+  config.aws_region, .aws_access_key_id, .aws_secret_access_key
+  config.db_host, .db_port, .db_user, .db_pass, .db_name
+  config.batch_size = 5000
+  config.throttle_delay = 0.5
+  config.idle_in_transaction_session_timeout = 0  # 0 = DESACTIVADO
+  config.limit_ram = "2GB"
+  config.tmp_directory = "/tmp/duckdb_work"
+  config.logger = Rails.logger
+end
+```
+### Operaciones principales
+```ruby
+# 1. ETL completo (Engine)
+DataDrain::Engine.new(
+  bucket:, start_date:, end_date:, table_name:,
+  partition_keys: %w[isp_id year month],
+  primary_key: "id",            # opcional
+  where_clause: nil,             # opcional, SQL extra
+  skip_export: false,            # true delega export a Glue
+  folder_name: nil,              # default = table_name
+  select_sql: "*"                # default
+).call  # => true (ok) | false (integrity fail)
+# 2. Ingesta de archivos crudos
+DataDrain::FileIngestor.new(
+  bucket:, source_path:, folder_name:,
+  partition_keys: [],            # opcional
+  select_sql: "*",               # opcional
+  delete_after_upload: true      # opcional
+).call
+# 3. ORM analítico
+class ArchivedX < DataDrain::Record
+  self.bucket = "..."
+  self.folder_name = "..."
+  self.partition_keys = [:isp_id, :year, :month]  # ORDEN = jerarquía Hive
+  attribute :id, :string
+end
+ArchivedX.where(limit: 10, isp_id: 42, year: 2026, month: 3)  # => Array
+ArchivedX.find("uuid", isp_id: 42, year: 2026, month: 3)       # => instance | nil
+ArchivedX.destroy_all(isp_id: 42)                               # => Integer (particiones borradas)
+# 4. Glue para tablas 1TB+
+DataDrain::GlueRunner.run_and_wait("job-name", { "--key" => "val" }, polling_interval: 30)
+```
+Detalle completo de firmas, parámetros, retornos y comportamientos en [API Detallada](references/api-detallada.md).
+## FAQ
+### ¿Cuándo usar `Engine` directo vs `GlueRunner` + `Engine(skip_export: true)`?
+`Engine` directo soporta hasta ~10-50GB cómodamente. Para tablas >500GB-1TB delegar el export a AWS Glue (Apache Spark distribuido) y usar `Engine(skip_export: true)` solo para verificar integridad y purgar Postgres. DataDrain en este modo solo lee Parquet (no exporta) y borra origen una vez confirmados los conteos.
+### ¿Qué pasa si `verify_integrity` falla?
+`Engine#call` retorna `false` y **no ejecuta la purga**. Emite log `engine.integrity_error`. Si la falla viene de no poder leer el Parquet (`DuckDB::Error`), emite `engine.parquet_read_error` y también retorna `false`. Es la única salvaguarda matemática del sistema.
+### ¿Cómo cambiar `storage_mode` en runtime?
+```ruby
+DataDrain.configure { |c| c.storage_mode = :s3 }
+DataDrain::Storage.reset_adapter!  # OBLIGATORIO, sino se sigue usando el adapter cacheado
+```
+### ¿Por qué `idle_in_transaction_session_timeout = 0`?
+`0` **desactiva** el timeout (sin límite de tiempo). Es mandatorio para purgas de gran volumen donde un lote puede tardar segundos. Internamente se valida con `!nil?` (no `.present?`) porque `0.present?` es `false` en Rails.
+### ¿El orden de `partition_keys` importa?
+Sí, **crítico**. Determina la jerarquía Hive en disco. El orden al **escribir** (Engine/FileIngestor) debe ser idéntico al declarado en el modelo `Record` que lee. Mismatch → DuckDB retorna vacío sin error. Convención canónica: `[dimension_principal, year, month]` (mayor cardinalidad o filtro más usado primero).
+### ¿La conexión DuckDB es thread-safe?
+Sí. `Record.connection` mantiene una conexión por thread vía `Thread.current`. En Puma/Sidekiq cada worker thread tiene la suya. La conexión nunca se cierra explícitamente (persiste mientras vive el thread). `Engine` y `FileIngestor` crean su propia conexión efímera por instancia y la cierran en `ensure`.
+### ¿DataDrain valida los nombres de tabla?
+No. `table_name`, `select_sql` y `where_clause` se interpolan directamente en SQL. La gema asume que estos valores vienen de código de aplicación (no de input de usuario). En `Record.find` el `id` sí se sanitiza (escape de comillas simples).
+### ¿Cómo evito OOM con tablas grandes?
+Setear `limit_ram` (ej. `"2GB"`) y `tmp_directory` (en SSD). DuckDB hará spill-to-disk automáticamente. Para tablas >500GB delegar a Glue.
+### ¿Los logs incluyen `source=`?
+No. La gema NO emite `source=` manualmente — lo inyecta automáticamente `exis_ray` (logger middleware externo) cuando está presente. Si no usás `exis_ray`, agregalo con un wrapper de logger.
+### ¿Qué formato tienen los logs?
+`component=data_drain event=<clase>.<suceso> [campos KV]`. Tiempos con sufijo `_s` y valor float. Contadores con `_count` y valor integer. Sin unidades en los valores. Detalle en [Eventos y Telemetría](references/eventos-telemetria.md).
+## Errores
+Catálogo top. Detalle completo y resolución en [API Detallada](references/api-detallada.md).
+### `DataDrain::Error`
+Clase base. Toda excepción del framework hereda de acá.
+### `DataDrain::ConfigurationError`
+Levantado cuando falta configuración obligatoria. **Causa típica:** olvidar `aws_*` con `storage_mode = :s3`. **Resolución:** completar el bloque `DataDrain.configure`.
+### `DataDrain::IntegrityError`
+Reservado para fallos matemáticos en verificación. Actualmente `Engine#call` retorna `false` en lugar de levantarlo. **Resolución:** investigar mismatch entre conteo Postgres y conteo Parquet.
+### `DataDrain::StorageError`
+Problemas interactuando con disco local, S3 o DuckDB. **Causa típica:** credenciales AWS inválidas, bucket inexistente, permisos S3 insuficientes.
+### `DataDrain::Storage::InvalidAdapterError`
+`storage_mode` no reconocido. **Causa:** valor distinto de `:local` o `:s3`. **Resolución:** corregir configuración.
+### `DuckDB::Error` (no envuelto)
+Errores de query DuckDB. En `Engine#verify_integrity` se captura y se loguea como `engine.parquet_read_error` retornando `false`. En `FileIngestor#call` se captura y se loguea como `file_ingestor.duckdb_error` retornando `false`. En `Record` se captura en `execute_and_instantiate` y retorna `[]`.
+### `RuntimeError` desde `GlueRunner`
+Levantado cuando un Job de Glue termina con estado `FAILED`, `STOPPED` o `TIMEOUT`. **Mensaje:** `"Glue Job <name> (Run ID: <id>) falló con estado <status>."`
+## Antipatrones
+Catálogo completo en [Antipatrones](references/antipatrones.md). Resumen de los más críticos:
+1. **Bypassear `verify_integrity`** llamando `purge_from_postgres` directo — rompe la única garantía de seguridad.
+2. **Mismatch en orden de `partition_keys`** entre escritura y lectura — DuckDB devuelve vacío sin error.
+3. **`storage_mode` cambiado sin `reset_adapter!`** — sigue usando el adapter viejo cacheado.
+4. **Validar `idle_in_transaction_session_timeout` con `.present?`** — `0.present?` es `false`, ignora la config.
+5. **Usar `<= end_of_day`** en rangos de fecha — pierde registros con microsegundos.
+6. **Loguear `source=`** manualmente — duplica el campo que inyecta `exis_ray`.
+## Referencias
+- [API Detallada](references/api-detallada.md) — Firmas completas, parámetros, retornos y comportamientos de cada clase pública.
+- [Eventos y Telemetría](references/eventos-telemetria.md) — Catálogo completo de eventos KV emitidos por la gema.
+- [Antipatrones](references/antipatrones.md) — Qué NO hacer y alternativas correctas.