RubyGems - data_drain - Versions diffs - 0.3.2 → 0.5.0 - Mend

data_drain 0.3.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/.rubocop.yml +12 -0
data/CHANGELOG.md +43 -0
data/README.md +30 -0
data/docs/IMPROVEMENT_PLAN.md +114 -0
data/docs/execution/v0.4.0-OBSERVACIONES.md +144 -0
data/docs/execution/v0.4.0.md +1216 -0
data/docs/execution/v0.5.0-OBSERVACIONES.md +167 -0
data/docs/execution/v0.5.0.md +900 -0
data/docs/glue-jobs-lifecycle.md +330 -0
data/docs/glue_pyspark_example.py +49 -19
data/lib/data_drain/glue_runner.rb +236 -1
data/lib/data_drain/storage/base.rb +12 -0
data/lib/data_drain/storage/local.rb +13 -0
data/lib/data_drain/storage/s3.rb +17 -0
data/lib/data_drain/validations.rb +8 -0
data/lib/data_drain/version.rb +1 -1
data/skill/SKILL.md +64 -3
data/skill/references/eventos-telemetria.md +8 -0
metadata +6 -1

data/lib/data_drain/storage/base.rb CHANGED Viewed

@@ -55,6 +55,18 @@ module DataDrain
         raise NotImplementedError, "#{self.class} debe implementar #destroy_partitions"
       end
+      # Sube un archivo local al storage.
+      #
+      # @param local_path [String]
+      # @param bucket [String]
+      # @param s3_key [String] key relativo (ej. "scripts/export.py")
+      # @param content_type [String, nil]
+      # @return [String] URI completo del archivo subido
+      # @raise [NotImplementedError]
+      def upload_file(local_path, bucket, s3_key, content_type: nil)
+        raise NotImplementedError, "#{self.class} debe implementar #upload_file"
+      end
       protected
       # @param bucket [String]

data/lib/data_drain/storage/local.rb CHANGED Viewed

@@ -27,6 +27,19 @@ module DataDrain
         "#{build_path_base(bucket, folder_name, partition_path)}/**/*.parquet"
       end
+      # @param local_path [String]
+      # @param bucket [String] Directorio destino
+      # @param s3_key [String] Path relativo dentro del bucket
+      # @param content_type [String, nil] Ignorado en modo local
+      # @return [String] Path absoluto al archivo destino
+      def upload_file(local_path, bucket, s3_key, content_type: nil)
+        _ = content_type
+        dest_path = File.join(bucket, s3_key)
+        FileUtils.mkdir_p(File.dirname(dest_path))
+        FileUtils.cp(local_path, dest_path)
+        dest_path
+      end
       # @param bucket [String]
       # @param folder_name [String]
       # @param partition_keys [Array<Symbol>]

data/lib/data_drain/storage/s3.rb CHANGED Viewed

@@ -38,6 +38,23 @@ module DataDrain
         delete_in_batches(client, bucket, objects)
       end
+      # @param local_path [String]
+      # @param bucket [String]
+      # @param s3_key [String]
+      # @param content_type [String, nil]
+      # @return [String] "s3://bucket/key"
+      def upload_file(local_path, bucket, s3_key, content_type: nil)
+        client = s3_client
+        File.open(local_path, "rb") do |file|
+          params = { bucket: bucket, key: s3_key, body: file }
+          params[:content_type] = content_type if content_type
+          client.put_object(**params)
+        end
+        "s3://#{bucket}/#{s3_key}"
+      end
       private
       # @return [Aws::S3::Client]

data/lib/data_drain/validations.rb CHANGED Viewed

@@ -6,9 +6,17 @@ module DataDrain
     # Regex que valida identificadores SQL (tablas, columnas, etc.).
     # Permite letras, guiones bajos y números (no al inicio).
     IDENTIFIER_REGEX = /\A[a-zA-Z_][a-zA-Z0-9_]*\z/
+    GLUE_NAME_REGEX = /\A(?![_-])[a-zA-Z0-9_-]+\z/
     module_function
+    def validate_glue_name!(name, value)
+      return if GLUE_NAME_REGEX.match?(value.to_s)
+      raise DataDrain::ConfigurationError,
+            "#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números, '-' y '_')"
+    end
     def validate_identifier!(name, value)
       return if IDENTIFIER_REGEX.match?(value.to_s)

data/lib/data_drain/version.rb CHANGED Viewed

@@ -2,5 +2,5 @@
 module DataDrain
   # @return [String] versión semver de la gema
-  VERSION = "0.3.2"
+  VERSION = "0.5.0"
 end

data/skill/SKILL.md CHANGED Viewed

@@ -8,12 +8,14 @@ Skill de conocimiento completo sobre DataDrain. Consultame para cualquier pregun
 - **Engine** — Motor principal que orquesta el flujo Conteo → Export → Verify → Purge.
 - **FileIngestor** — Convierte archivos crudos (CSV/JSON/Parquet) a Parquet particionado en el Data Lake.
 - **Record** — Clase base ORM analítico (tipo ActiveRecord) read-only sobre Parquet vía DuckDB.
-- **GlueRunner** — Orquestador de AWS Glue Jobs para tablas de gran volumen (>500GB-1TB).
+- **GlueRunner** — Orquestador de AWS Glue Jobs para tablas de gran volumen (>500GB-1TB). Soporta lifecycle completo: crear, actualizar, eliminar y verificar jobs.
 - **Storage Adapter** — Patrón Strategy con dos implementaciones: `Storage::Local` y `Storage::S3`. Cacheado en `Storage.adapter`.
 - **Observability** — Módulo mixín (`include`/`extend`) con `safe_log` resiliente y logging KV estructurado.
 - **Hive Partitioning** — Estructura de carpetas `key1=val1/key2=val2/...` que DuckDB genera y consume nativamente para prefix scans eficientes.
 - **Semi-abierto** — Convención de rangos `[start, end)` con `<` (no `<=`) para evitar pérdida de microsegundos en límites de fecha.
 - **skip_export** — Modo del Engine donde delega export a herramienta externa (Glue/EMR) y solo verifica + purga.
+- **ensure_job** — Wrapper idempotente de GlueRunner que crea o actualiza un job según config deseada. Incluye diffing de configuración para evitar API calls innecesarios.
+- **changed_fields** — Helper privado de ensure_job que compara config deseada vs actual de un Glue Job y retorna qué campos difieren.
 - **Heartbeat** — Log de progreso emitido cada 100 lotes en purgas masivas (tablas 1TB).
 - **Wispro-Observability-Spec v1** — Estándar de logs KV: `component=` y `event=` primero, sufijo `_s` para tiempos float, `_count` para enteros, sin unidades en valores.
@@ -66,9 +68,9 @@ DataDrain resuelve el ciclo de vida de datos históricos en bases relacionales c
 ### Stack y dependencias
-- Ruby `>= 3.0.0`
+- Ruby `>= 3.2.0`
 - Runtime: `activemodel >= 6.0`, `duckdb ~> 1.4`, `pg >= 1.2`, `aws-sdk-s3 ~> 1.114`, `aws-sdk-glue ~> 1.0`
-- Versión actual: `0.1.19`
+- Versión actual: `0.5.0`
 ## API Pública (resumen)
@@ -123,6 +125,35 @@ ArchivedX.destroy_all(isp_id: 42)                               # => Integer (pa
 # 4. Glue para tablas 1TB+
 DataDrain::GlueRunner.run_and_wait("job-name", { "--key" => "val" }, polling_interval: 30)
+# 4b. Glue Jobs Lifecycle (v0.4.0+)
+# Verificar si existe
+DataDrain::GlueRunner.job_exists?("my-job")  # => true/false
+# Obtener config completa
+job = DataDrain::GlueRunner.get_job("my-job")  # => Aws::Glue::Types::Job
+# Crear job con script local (v0.5.0+)
+job = DataDrain::GlueRunner.create_job(
+  "my-job",
+  role_arn: "arn:aws:iam::123:role/GlueRole",
+  script_path: "scripts/glue/export.py",  # local → S3 automático
+  script_bucket: "my-bucket",
+  script_folder: "scripts",
+  timeout: 1440,
+  max_retries: 2
+)
+# Upsert idempotente con diffing de config
+job = DataDrain::GlueRunner.ensure_job(
+  "my-job",
+  role_arn: "arn:aws:iam::123:role/GlueRole",
+  script_path: "scripts/glue/export.py",
+  script_bucket: "my-bucket"
+)
+# Eliminar job (idempotente)
+DataDrain::GlueRunner.delete_job("my-job")  # => true/false
 ```
 Detalle completo de firmas, parámetros, retornos y comportamientos en [API Detallada](references/api-detallada.md).
@@ -172,6 +203,28 @@ No. La gema NO emite `source=` manualmente — lo inyecta automáticamente `exis
 `component=data_drain event=<clase>.<suceso> [campos KV]`. Tiempos con sufijo `_s` y valor float. Contadores con `_count` y valor integer. Sin unidades en los valores. Detalle en [Eventos y Telemetría](references/eventos-telemetria.md).
+### ¿Cómo subo un script Glue desde mi repo?
+Desde v0.5.0 podés usar `script_path:` en lugar de `script_location:`:
+```ruby
+DataDrain::GlueRunner.ensure_job(
+  "my-export-job",
+  script_path: "scripts/glue/export.py",
+  script_bucket: "my-bucket",
+  script_folder: "scripts",
+  role_arn: ENV["GLUE_ROLE_ARN"]
+)
+```
+La gema sube el script a S3 usando el `Storage::S3` adapter existente
+(con `credential_chain` si tenés IAM role). **Requiere `storage_mode = :s3`**.
+Si `storage_mode = :local`, levanta `ConfigurationError`.
+**Overwrite:** cada invocación sobrescribe el archivo en S3. Útil para que
+el script siga al código del repo. Si necesitás versionar, usar `script_filename:`
+con hash o timestamp.
 ## Errores
 Catálogo top. Detalle completo y resolución en [API Detallada](references/api-detallada.md).
@@ -197,6 +250,12 @@ Errores de query DuckDB. En `Engine#verify_integrity` se captura y se loguea com
 ### `RuntimeError` desde `GlueRunner`
 Levantado cuando un Job de Glue termina con estado `FAILED`, `STOPPED` o `TIMEOUT`. **Mensaje:** `"Glue Job <name> (Run ID: <id>) falló con estado <status>."`
+### `Aws::Glue::Errors::EntityNotFoundException`
+Job de Glue no existe. En `job_exists?` se rescata y retorna `false`. En `get_job`, `update_job` y `delete_job` se propaga.
+### `Aws::Glue::Errors::ServiceError`
+Error genérico de AWS Glue. Se propaga en todos los métodos de lifecycle. Los métodos emiten `glue_runner.job_*_error` antes de propagar.
 ## Antipatrones
 Catálogo completo en [Antipatrones](references/antipatrones.md). Resumen de los más críticos:
@@ -207,10 +266,12 @@ Catálogo completo en [Antipatrones](references/antipatrones.md). Resumen de los
 4. **Validar `idle_in_transaction_session_timeout` con `.present?`** — `0.present?` es `false`, ignora la config.
 5. **Usar `<= end_of_day`** en rangos de fecha — pierde registros con microsegundos.
 6. **Loguear `source=`** manualmente — duplica el campo que inyecta `exis_ray`.
+7. **Usar nombres de Glue Job con guiones bajos al inicio** — `validate_glue_name!` rechaza `_my-job`. Usar `my-job` o `my_job` (sin underscore inicial).
 ## Referencias
 - [API Detallada](references/api-detallada.md) — Firmas completas, parámetros, retornos y comportamientos de cada clase pública.
+- [Glue Jobs Lifecycle](https://github.com/gedera/data_drain/blob/main/docs/glue-jobs-lifecycle.md) — Guía completa de gestión de AWS Glue Jobs: crear, actualizar, eliminar, verificar y ejecutar jobs idempotentemente.
 - [Eventos y Telemetría](references/eventos-telemetria.md) — Catálogo completo de eventos KV emitidos por la gema.
 - [Antipatrones](references/antipatrones.md) — Qué NO hacer y alternativas correctas.
 - [Postgres Tuning](references/postgres-tuning.md) — Índices, VACUUM, particionamiento y diagnóstico por tamaño de tabla.

data/skill/references/eventos-telemetria.md CHANGED Viewed

@@ -115,6 +115,14 @@ Catálogo completo de eventos KV emitidos por DataDrain. Formato Wispro-Observab
 **Nivel:** INFO. Emite antes de `start_job_run`.
 **Campos:** `job`.
+### `glue_runner.job_exists`
+**Nivel:** INFO. Emite en `ensure_job` cuando el job ya existe y se actualiza.
+**Campos:** `job`.
+### `glue_runner.job_created`
+**Nivel:** INFO. Emite en `ensure_job` cuando el job se crea.
+**Campos:** `job`.
 ### `glue_runner.polling`
 **Nivel:** INFO. Emite cada chequeo de estado mientras Job no terminó.
 **Campos:** `job`, `run_id`, `status`, `next_check_in_s`.

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: data_drain
 version: !ruby/object:Gem::Version
-  version: 0.3.2
+  version: 0.5.0
 platform: ruby
 authors:
 - Gabriel
@@ -105,6 +105,11 @@ files:
 - docs/execution/archive/v0.3.1-OBSERVACIONES.md
 - docs/execution/archive/v0.3.1.md
 - docs/execution/v0.2.2.md
+- docs/execution/v0.4.0-OBSERVACIONES.md
+- docs/execution/v0.4.0.md
+- docs/execution/v0.5.0-OBSERVACIONES.md
+- docs/execution/v0.5.0.md
+- docs/glue-jobs-lifecycle.md
 - docs/glue_pyspark_example.py
 - lib/data_drain.rb
 - lib/data_drain/configuration.rb