data_drain 0.3.1 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,159 @@
1
+ # Glue Jobs Lifecycle
2
+
3
+ Gestión completa de AWS Glue Jobs desde la gema.
4
+
5
+ ## Métodos
6
+
7
+ ### `job_exists?(job_name)` → Boolean
8
+
9
+ Verifica si un job existe en Glue.
10
+
11
+ ```ruby
12
+ DataDrain::GlueRunner.job_exists?("my-job")
13
+ # => true
14
+ ```
15
+
16
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
17
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
18
+ - Lanza otros errores de AWS sin atrapar.
19
+
20
+ ### `get_job(job_name)` → Aws::Glue::Types::Job
21
+
22
+ Obtiene la configuración completa de un job.
23
+
24
+ ```ruby
25
+ job = DataDrain::GlueRunner.get_job("my-job")
26
+ job.name # => "my-job"
27
+ job.role # => "arn:aws:iam::123:role/GlueRole"
28
+ job.command # => { name: "glueetl", python_version: "3", script_location: "s3://..." }
29
+ job.default_arguments # => { "--extra-files" => "s3://..." }
30
+ ```
31
+
32
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
33
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
34
+
35
+ ### `create_job(job_name, role_arn:, script_location:, ...)` → Aws::Glue::Types::Job
36
+
37
+ Crea un nuevo job en Glue y retorna el job creado.
38
+
39
+ **Parámetros requeridos:**
40
+ - `job_name` (String): nombre del job
41
+ - `role_arn` (String): ARN del IAM role de Glue
42
+ - `script_location` (String): path S3 del script Python
43
+
44
+ **Parámetros opcionales:**
45
+ - `command_name` (String): nombre del comando (`"glueetl"`, `"pythonshell"`). Default: `"glueetl"`.
46
+ - `default_arguments` (Hash): argumentos default del job
47
+ - `description` (String): descripción del job
48
+ - `timeout` (Integer): timeout en minutos. Default: `2880` (48h)
49
+ - `max_retries` (Integer): reintentos. Default: `0`
50
+ - `allocated_capacity` (Integer): DPU legacy. Preferir `worker_type` + `number_of_workers`
51
+ - `worker_type` (String): `"Standard"`, `"G.1X"`, `"G.2X"`, `"G.4X"`, `"G.8X"`
52
+ - `number_of_workers` (Integer): número de workers (requiere `worker_type`)
53
+ - `glue_version` (String): versión de Glue (ej. `"4.0"`)
54
+
55
+ ```ruby
56
+ job = DataDrain::GlueRunner.create_job(
57
+ "my-job",
58
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
59
+ script_location: "s3://my-bucket/scripts/export.py",
60
+ default_arguments: { "--extra-files" => "s3://my-bucket/scripts/udf.py" },
61
+ timeout: 1440,
62
+ max_retries: 2,
63
+ worker_type: "G.1X",
64
+ number_of_workers: 10
65
+ )
66
+ ```
67
+
68
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
69
+ - Lanza errores de AWS sin atrapar (nombre duplicado, rol inválido, etc.)
70
+
71
+ ### `update_job(job_name, ...)` → Aws::Glue::Types::Job
72
+
73
+ Actualiza un job existente y retorna el job actualizado.
74
+
75
+ Mismos parámetros que `create_job`, todos opcionales. Solo los parámetros provistos se actualizan.
76
+
77
+ ```ruby
78
+ job = DataDrain::GlueRunner.update_job(
79
+ "my-job",
80
+ script_location: "s3://my-bucket/scripts/export-v2.py",
81
+ timeout: 720
82
+ )
83
+ ```
84
+
85
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
86
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
87
+
88
+ ### `delete_job(job_name)` → nil
89
+
90
+ Elimina un job de Glue.
91
+
92
+ ```ruby
93
+ DataDrain::GlueRunner.delete_job("my-job")
94
+ # => nil
95
+ ```
96
+
97
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
98
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
99
+
100
+ ### `ensure_job(job_name, role_arn:, script_location:, ...)` → Aws::Glue::Types::Job
101
+
102
+ Crea o actualiza un job de forma idempotente.
103
+
104
+ - Si el job existe → `update_job`
105
+ - Si el job no existe → `create_job`
106
+
107
+ ```ruby
108
+ job = DataDrain::GlueRunner.ensure_job(
109
+ "my-job",
110
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
111
+ script_location: "s3://my-bucket/scripts/export.py",
112
+ timeout: 1440
113
+ )
114
+ ```
115
+
116
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
117
+ - Lanza errores de AWS sin atrapar.
118
+
119
+ ### `run_and_wait(job_name, arguments = {}, ...)` → Boolean
120
+
121
+ Ejecuta un job existente y espera a que complete.
122
+
123
+ ```ruby
124
+ DataDrain::GlueRunner.run_and_wait(
125
+ "my-job",
126
+ { "--start_date" => "2025-01-01", "--end_date" => "2025-02-01" },
127
+ polling_interval: 60,
128
+ max_wait_seconds: 7200
129
+ )
130
+ # => true (SUCCEEDED)
131
+ ```
132
+
133
+ - Lanza `RuntimeError` si el job falla (`FAILED`, `STOPPED`, `TIMEOUT`).
134
+ - Lanza `DataDrain::Error` si `max_wait_seconds` excede.
135
+
136
+ ## Convenciones de nombres
137
+
138
+ AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`). No permite guiones bajos ni espacios.
139
+
140
+ ```ruby
141
+ # Válido
142
+ DataDrain::GlueRunner.job_exists?("my-export-job-v2")
143
+
144
+ # Inválido — lanza ConfigurationError
145
+ DataDrain::GlueRunner.job_exists?("my_export_job")
146
+ # DataDrain::ConfigurationError: job_name 'my_export_job' no es un nombre válido para Glue Job
147
+ ```
148
+
149
+ ## Eventos de telemetría
150
+
151
+ | Evento | Nivel | Descripción |
152
+ |--------|-------|-------------|
153
+ | `glue_runner.start` | INFO | Antes de `start_job_run` |
154
+ | `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` |
155
+ | `glue_runner.job_created` | INFO | Job creado en `ensure_job` |
156
+ | `glue_runner.polling` | INFO | Chequeo de estado durante `run_and_wait` |
157
+ | `glue_runner.complete` | INFO | Job terminó `SUCCEEDED` |
158
+ | `glue_runner.failed` | ERROR | Job falló con `FAILED\|STOPPED\|TIMEOUT` |
159
+ | `glue_runner.timeout` | ERROR | `max_wait_seconds` excedido |
@@ -159,7 +159,7 @@ module DataDrain
159
159
  # @api private
160
160
  # @return [Integer]
161
161
  def get_postgres_count
162
- pg_sql = "SELECT count() AS row_count FROM public.#{@table_name} WHERE #{base_where_sql}"
162
+ pg_sql = "SELECT COUNT(*) AS row_count FROM public.#{@table_name} WHERE #{base_where_sql}"
163
163
  pg_sql = pg_sql.gsub("'", "''")
164
164
  query = "SELECT row_count FROM postgres_query('pg_source', '#{pg_sql}')"
165
165
  @duckdb.query(query).first.first
@@ -204,7 +204,7 @@ module DataDrain
204
204
 
205
205
  begin
206
206
  query = <<~SQL
207
- SELECT count()
207
+ SELECT COUNT(*)
208
208
  FROM read_parquet('#{archive_path}')
209
209
  WHERE #{base_where_sql}
210
210
  SQL
@@ -82,7 +82,7 @@ module DataDrain
82
82
 
83
83
  # @api private
84
84
  def step_count_source
85
- source_count = timed(:source_query) { @duckdb.query("SELECT count() FROM #{@reader_function}").first.first }
85
+ source_count = timed(:source_query) { @duckdb.query("SELECT COUNT(*) FROM #{@reader_function}").first.first }
86
86
  safe_log(:info, "file_ingestor.count", {
87
87
  source_path: @source_path,
88
88
  count: source_count,
@@ -19,10 +19,109 @@ module DataDrain
19
19
  # @return [Boolean] true si el Job terminó exitosamente (SUCCEEDED).
20
20
  # @raise [DataDrain::Error] si max_wait_seconds excede antes de SUCCEEDED.
21
21
  # @raise [RuntimeError] si el Job falla o se detiene.
22
+ def self.client
23
+ @client ||= Aws::Glue::Client.new(region: DataDrain.configuration.aws_region)
24
+ end
25
+
26
+ class << self
27
+ attr_writer :client
28
+ end
29
+
30
+ def self.job_exists?(job_name)
31
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
32
+ get_job(job_name)
33
+ true
34
+ rescue Aws::Glue::Errors::EntityNotFoundException
35
+ false
36
+ end
37
+
38
+ def self.get_job(job_name)
39
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
40
+ client.get_job(job_name: job_name).job
41
+ end
42
+
43
+ def self.create_job(job_name, role_arn:, script_location:, command_name: "glueetl",
44
+ default_arguments: {}, description: nil, worker_type: nil, number_of_workers: nil,
45
+ timeout: 2880, max_retries: 0, allocated_capacity: nil, glue_version: nil)
46
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
47
+ opts = {
48
+ name: job_name,
49
+ role: role_arn,
50
+ command: {
51
+ name: command_name,
52
+ python_version: "3",
53
+ script_location: script_location
54
+ }
55
+ }
56
+ opts[:default_arguments] = default_arguments unless default_arguments.empty?
57
+ opts[:description] = description if description
58
+ opts[:timeout] = timeout if timeout
59
+ opts[:max_retries] = max_retries if max_retries
60
+ opts[:allocated_capacity] = allocated_capacity if allocated_capacity
61
+ opts[:worker_type] = worker_type if worker_type
62
+ opts[:number_of_workers] = number_of_workers if number_of_workers
63
+ opts[:glue_version] = glue_version if glue_version
64
+
65
+ client.create_job(**opts)
66
+ get_job(job_name)
67
+ end
68
+
69
+ def self.update_job(job_name, role_arn: nil, command_name: nil, script_location: nil,
70
+ default_arguments: nil, description: nil, worker_type: nil,
71
+ number_of_workers: nil, timeout: nil, max_retries: nil, allocated_capacity: nil,
72
+ glue_version: nil)
73
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
74
+ job_update = {}
75
+ job_update[:role] = role_arn if role_arn
76
+ if command_name && script_location
77
+ job_update[:command] =
78
+ { name: command_name, python_version: "3", script_location: script_location }
79
+ end
80
+ job_update[:default_arguments] = default_arguments if default_arguments
81
+ job_update[:description] = description if description
82
+ job_update[:timeout] = timeout if timeout
83
+ job_update[:max_retries] = max_retries if max_retries
84
+ job_update[:allocated_capacity] = allocated_capacity if allocated_capacity
85
+ job_update[:worker_type] = worker_type if worker_type
86
+ job_update[:number_of_workers] = number_of_workers if number_of_workers
87
+ job_update[:glue_version] = glue_version if glue_version
88
+
89
+ client.update_job(job_name: job_name, job_update: job_update)
90
+ get_job(job_name)
91
+ end
92
+
93
+ def self.delete_job(job_name)
94
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
95
+ client.delete_job(job_name: job_name)
96
+ nil
97
+ end
98
+
99
+ def self.ensure_job(job_name, role_arn:, script_location:, command_name: "glueetl",
100
+ default_arguments: {}, description: nil, worker_type: nil,
101
+ number_of_workers: nil, timeout: 2880, max_retries: 0,
102
+ allocated_capacity: nil, glue_version: nil)
103
+ if job_exists?(job_name)
104
+ safe_log(:info, "glue_runner.job_exists", { job: job_name })
105
+ update_job(job_name, role_arn: role_arn, command_name: command_name,
106
+ script_location: script_location, default_arguments: default_arguments,
107
+ description: description, worker_type: worker_type,
108
+ number_of_workers: number_of_workers, timeout: timeout,
109
+ max_retries: max_retries, allocated_capacity: allocated_capacity,
110
+ glue_version: glue_version)
111
+ else
112
+ safe_log(:info, "glue_runner.job_created", { job: job_name })
113
+ create_job(job_name, role_arn: role_arn, script_location: script_location,
114
+ command_name: command_name, default_arguments: default_arguments,
115
+ description: description, worker_type: worker_type,
116
+ number_of_workers: number_of_workers, timeout: timeout,
117
+ max_retries: max_retries, allocated_capacity: allocated_capacity,
118
+ glue_version: glue_version)
119
+ end
120
+ end
121
+
22
122
  def self.run_and_wait(job_name, arguments = {}, polling_interval: 30, max_wait_seconds: nil)
23
123
  config = DataDrain.configuration
24
124
  config.validate!
25
- client = Aws::Glue::Client.new(region: config.aws_region)
26
125
  start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
27
126
 
28
127
  @logger = config.logger
@@ -6,9 +6,17 @@ module DataDrain
6
6
  # Regex que valida identificadores SQL (tablas, columnas, etc.).
7
7
  # Permite letras, guiones bajos y números (no al inicio).
8
8
  IDENTIFIER_REGEX = /\A[a-zA-Z_][a-zA-Z0-9_]*\z/
9
+ GLUE_NAME_REGEX = /\A[a-zA-Z0-9][a-zA-Z0-9-]*\z/
9
10
 
10
11
  module_function
11
12
 
13
+ def validate_glue_name!(name, value)
14
+ return if GLUE_NAME_REGEX.match?(value.to_s)
15
+
16
+ raise DataDrain::ConfigurationError,
17
+ "#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números y guiones)"
18
+ end
19
+
12
20
  def validate_identifier!(name, value)
13
21
  return if IDENTIFIER_REGEX.match?(value.to_s)
14
22
 
@@ -2,5 +2,5 @@
2
2
 
3
3
  module DataDrain
4
4
  # @return [String] versión semver de la gema
5
- VERSION = "0.3.1"
5
+ VERSION = "0.4.0"
6
6
  end
@@ -115,6 +115,14 @@ Catálogo completo de eventos KV emitidos por DataDrain. Formato Wispro-Observab
115
115
  **Nivel:** INFO. Emite antes de `start_job_run`.
116
116
  **Campos:** `job`.
117
117
 
118
+ ### `glue_runner.job_exists`
119
+ **Nivel:** INFO. Emite en `ensure_job` cuando el job ya existe y se actualiza.
120
+ **Campos:** `job`.
121
+
122
+ ### `glue_runner.job_created`
123
+ **Nivel:** INFO. Emite en `ensure_job` cuando el job se crea.
124
+ **Campos:** `job`.
125
+
118
126
  ### `glue_runner.polling`
119
127
  **Nivel:** INFO. Emite cada chequeo de estado mientras Job no terminó.
120
128
  **Campos:** `job`, `run_id`, `status`, `next_check_in_s`.
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_drain
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.1
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Gabriel
@@ -102,9 +102,12 @@ files:
102
102
  - docs/execution/archive/v0.2.0.md
103
103
  - docs/execution/archive/v0.3.0-OBSERVACIONES.md
104
104
  - docs/execution/archive/v0.3.0.md
105
+ - docs/execution/archive/v0.3.1-OBSERVACIONES.md
106
+ - docs/execution/archive/v0.3.1.md
105
107
  - docs/execution/v0.2.2.md
106
- - docs/execution/v0.3.1-OBSERVACIONES.md
107
- - docs/execution/v0.3.1.md
108
+ - docs/execution/v0.4.0-OBSERVACIONES.md
109
+ - docs/execution/v0.4.0.md
110
+ - docs/glue-jobs-lifecycle.md
108
111
  - docs/glue_pyspark_example.py
109
112
  - lib/data_drain.rb
110
113
  - lib/data_drain/configuration.rb
File without changes