data_drain 0.3.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,330 @@
1
+ # Glue Jobs Lifecycle
2
+
3
+ Gestión completa de AWS Glue Jobs desde la gema.
4
+
5
+ ## Métodos
6
+
7
+ ### `job_exists?(job_name)` → Boolean
8
+
9
+ Verifica si un job existe en Glue.
10
+
11
+ ```ruby
12
+ DataDrain::GlueRunner.job_exists?("my-job")
13
+ # => true
14
+ ```
15
+
16
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
17
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
18
+ - Lanza otros errores de AWS sin atrapar.
19
+
20
+ ### `get_job(job_name)` → Aws::Glue::Types::Job
21
+
22
+ Obtiene la configuración completa de un job.
23
+
24
+ ```ruby
25
+ job = DataDrain::GlueRunner.get_job("my-job")
26
+ job.name # => "my-job"
27
+ job.role # => "arn:aws:iam::123:role/GlueRole"
28
+ job.command # => { name: "glueetl", python_version: "3", script_location: "s3://..." }
29
+ job.default_arguments # => { "--extra-files" => "s3://..." }
30
+ ```
31
+
32
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
33
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
34
+
35
+ ### Subir scripts locales (v0.5.0+)
36
+
37
+ Desde v0.5.0 la gema puede subir scripts PySpark a S3 automáticamente.
38
+
39
+ ```ruby
40
+ # Opción moderna: script local subido por la gema
41
+ DataDrain::GlueRunner.create_job(
42
+ "my-job",
43
+ script_path: "scripts/glue/export.py", # local
44
+ script_bucket: "my-bucket",
45
+ script_folder: "scripts",
46
+ role_arn: "arn:aws:iam::123:role/GlueRole"
47
+ )
48
+ # → Sube scripts/glue/export.py a s3://my-bucket/scripts/export.py
49
+ # → Crea el job
50
+ ```
51
+
52
+ **Parámetros para upload:**
53
+ - `script_path` (String): ruta local al script Python.
54
+ - `script_bucket` (String): bucket S3 destino. **Requerido si se usa `script_path`.**
55
+ - `script_folder` (String): folder dentro del bucket. Default: `"scripts"`.
56
+ - `script_filename` (String, nil): override del nombre en S3. Default: basename del archivo.
57
+
58
+ **`script_location` vs `script_path`:**
59
+ - `script_location:` → comportamiento anterior, no hay upload.
60
+ - `script_path:` + `script_bucket:` → la gema sube a S3 primero, luego crea el Job.
61
+ - Si se pasan ambos → `DataDrain::ConfigurationError`.
62
+ - Si no se pasa ninguno → `ArgumentError`.
63
+
64
+ **Importante:** el upload **sobrescribe** cualquier archivo existente en el mismo path.
65
+ No es idempotente en sentido estricto. Usar `script_filename:` con hash o timestamp
66
+ si necesitás versionado.
67
+
68
+ ### Concurrencia (limitación conocida)
69
+
70
+ No hay lock distribuido. Si dos procesos llaman `upload_script` con el mismo destino
71
+ simultáneamente, el último `put_object` en llegar a S3 gana. Para scripts PySpark
72
+ esto es típicamente bajo riesgo (scripts son pequeños, rara vez hay writes
73
+ concurrentes al mismo path).
74
+
75
+ ### Permisos IAM mínimos
76
+
77
+ El IAM role/user que ejecuta `upload_script` necesita:
78
+
79
+ ```json
80
+ {
81
+ "Effect": "Allow",
82
+ "Action": ["s3:PutObject"],
83
+ "Resource": "arn:aws:s3:::my-bucket/scripts/*"
84
+ }
85
+ ```
86
+
87
+ Para usar con `create_job`/`ensure_job` también se necesitan los permisos de Glue
88
+ (ver sección "Permisos Glue" al inicio de este documento) + permiso para que el
89
+ IAM role del Glue Job pueda leer el script:
90
+
91
+ ```json
92
+ {
93
+ "Effect": "Allow",
94
+ "Action": ["s3:GetObject"],
95
+ "Resource": "arn:aws:s3:::my-bucket/scripts/*"
96
+ }
97
+ ```
98
+
99
+ (Este último en el role del Glue Job, no en el role de la aplicación Ruby.)
100
+
101
+ ### API standalone: `upload_script`
102
+
103
+ Para casos donde solo querés subir (sin crear Job):
104
+
105
+ ```ruby
106
+ s3_path = DataDrain::GlueRunner.upload_script(
107
+ local_path: "scripts/glue/export.py",
108
+ bucket: "my-bucket",
109
+ folder: "scripts"
110
+ )
111
+ # => "s3://my-bucket/scripts/export.py"
112
+ ```
113
+
114
+ Requiere `storage_mode = :s3`.
115
+
116
+ ### `create_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
117
+
118
+ Crea un nuevo job en Glue y retorna el job creado.
119
+
120
+ **Parámetros requeridos:**
121
+ - `job_name` (String): nombre del job
122
+ - `role_arn` (String): ARN del IAM role de Glue
123
+
124
+ **Parámetros de script (mutuamente excluyentes):**
125
+ - `script_location` (String): path S3 del script Python (comportamiento anterior)
126
+ - `script_path` + `script_bucket` (String): upload local a S3 primero (v0.5.0+)
127
+
128
+ **Parámetros opcionales:**
129
+ - `script_folder` (String): folder S3. Default: `"scripts"`.
130
+ - `script_filename` (String, nil): override del nombre en S3.
131
+ - `command_name` (String): nombre del comando (`"glueetl"`, `"pythonshell"`). Default: `"glueetl"`.
132
+ - `default_arguments` (Hash): argumentos default del job
133
+ - `description` (String): descripción del job
134
+ - `timeout` (Integer): timeout en minutos. Default: `2880` (48h)
135
+ - `max_retries` (Integer): reintentos. Default: `0`
136
+ - `allocated_capacity` (Integer): DPU legacy. Preferir `worker_type` + `number_of_workers`
137
+ - `worker_type` (String): `"Standard"`, `"G.1X"`, `"G.2X"`, `"G.4X"`, `"G.8X"`
138
+ - `number_of_workers` (Integer): número de workers (requiere `worker_type`)
139
+ - `glue_version` (String): versión de Glue (ej. `"4.0"`)
140
+
141
+ ```ruby
142
+ job = DataDrain::GlueRunner.create_job(
143
+ "my-job",
144
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
145
+ script_location: "s3://my-bucket/scripts/export.py",
146
+ default_arguments: { "--extra-files" => "s3://my-bucket/scripts/udf.py" },
147
+ timeout: 1440,
148
+ max_retries: 2,
149
+ worker_type: "G.1X",
150
+ number_of_workers: 10
151
+ )
152
+ ```
153
+
154
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
155
+ - Lanza errores de AWS sin atrapar (nombre duplicado, rol inválido, etc.)
156
+
157
+ ### `update_job(job_name, ...)` → Aws::Glue::Types::Job
158
+
159
+ Actualiza un job existente y retorna el job actualizado.
160
+
161
+ Mismos parámetros que `create_job`, todos opcionales. Solo los parámetros provistos se actualizan.
162
+
163
+ ```ruby
164
+ job = DataDrain::GlueRunner.update_job(
165
+ "my-job",
166
+ script_location: "s3://my-bucket/scripts/export-v2.py",
167
+ timeout: 720
168
+ )
169
+ ```
170
+
171
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
172
+ - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
173
+
174
+ ### `delete_job(job_name)` → Boolean
175
+
176
+ Elimina un job de Glue. Es idempotente.
177
+
178
+ ```ruby
179
+ DataDrain::GlueRunner.delete_job("my-job")
180
+ # => true (job existía y fue eliminado)
181
+
182
+ DataDrain::GlueRunner.delete_job("nonexistent")
183
+ # => false (job no existía)
184
+ ```
185
+
186
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
187
+ - Lanza otros errores de AWS sin atrapar.
188
+
189
+ ### `ensure_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
190
+
191
+ Crea o actualiza un job de forma idempotente con diffing de configuración.
192
+
193
+ - Si el job no existe → `create_job`
194
+ - Si el job existe con config diferente → `update_job`
195
+ - Si el job existe con config idéntica → no-op, retorna el job actual (`:unchanged`)
196
+
197
+ ```ruby
198
+ job = DataDrain::GlueRunner.ensure_job(
199
+ "my-job",
200
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
201
+ script_location: "s3://my-bucket/scripts/export.py",
202
+ timeout: 1440
203
+ )
204
+ ```
205
+
206
+ También soporta upload de script local (v0.5.0+):
207
+
208
+ ```ruby
209
+ job = DataDrain::GlueRunner.ensure_job(
210
+ "my-job",
211
+ script_path: "scripts/glue/export.py",
212
+ script_bucket: "my-bucket",
213
+ script_folder: "scripts",
214
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
215
+ timeout: 1440
216
+ )
217
+ ```
218
+
219
+ - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
220
+ - Lanza errores de AWS sin atrapar.
221
+
222
+ ### `run_and_wait(job_name, arguments = {}, ...)` → Boolean
223
+
224
+ Ejecuta un job existente y espera a que complete.
225
+
226
+ ```ruby
227
+ DataDrain::GlueRunner.run_and_wait(
228
+ "my-job",
229
+ { "--start_date" => "2025-01-01", "--end_date" => "2025-02-01" },
230
+ polling_interval: 60,
231
+ max_wait_seconds: 7200
232
+ )
233
+ # => true (SUCCEEDED)
234
+ ```
235
+
236
+ - Lanza `RuntimeError` si el job falla (`FAILED`, `STOPPED`, `TIMEOUT`).
237
+ - Lanza `DataDrain::Error` si `max_wait_seconds` excede.
238
+
239
+ ## Patrón completo: ensure_job + run_and_wait + PySpark
240
+
241
+ Workflow end-to-end para archivar y purgar tablas PostgreSQL usando AWS Glue:
242
+
243
+ ```ruby
244
+ # 1. Asegurar que el Glue Job existe con la config deseada (idempotente)
245
+ DataDrain::GlueRunner.ensure_job(
246
+ "my-export-job",
247
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
248
+ script_location: "s3://my-bucket/scripts/glue_pyspark_export.py",
249
+ glue_version: "4.0",
250
+ worker_type: "G.1X",
251
+ number_of_workers: 10,
252
+ timeout: 1440
253
+ )
254
+
255
+ # 2. Ejecutar el export (delegado a Glue Spark distribuido)
256
+ DataDrain::GlueRunner.run_and_wait(
257
+ "my-export-job",
258
+ {
259
+ "--start_date" => start_date.to_fs(:db),
260
+ "--end_date" => end_date.to_fs(:db),
261
+ "--s3_bucket" => bucket,
262
+ "--s3_folder" => table,
263
+ "--db_url" => "jdbc:postgresql://#{host}:#{port}/#{db}",
264
+ "--db_user" => db_user,
265
+ "--db_password" => db_password,
266
+ "--db_table" => table,
267
+ "--partition_by" => partition_keys.join(",")
268
+ },
269
+ polling_interval: 60,
270
+ max_wait_seconds: 7200
271
+ )
272
+
273
+ # 3. Verificar integridad y purgar Postgres (DataDrain solo lee Parquet)
274
+ DataDrain::Engine.new(
275
+ bucket: bucket,
276
+ folder_name: table,
277
+ start_date: start_date,
278
+ end_date: end_date,
279
+ table_name: table,
280
+ partition_keys: partition_keys,
281
+ skip_export: true # export ya lo hizo Glue
282
+ ).call
283
+ ```
284
+
285
+ ### Prerequisites
286
+
287
+ 1. **Subir el script a S3:**
288
+ ```bash
289
+ aws s3 cp glue_pyspark_export.py s3://my-bucket/scripts/
290
+ ```
291
+
292
+ 2. **IAM Role** con permisos para: Glue, S3 (lectura del script + escritura del bucket destino), RDS/Postgres (vía JDBC)
293
+
294
+ 3. **Script PySpark** en `s3://my-bucket/scripts/glue_pyspark_export.py` (ver [ejemplo](../glue_pyspark_example.py))
295
+
296
+ ## Convenciones de nombres
297
+
298
+ AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`), guiones bajos (`_`). No permite espacios ni caracteres especiales.
299
+
300
+ ```ruby
301
+ # Válido
302
+ DataDrain::GlueRunner.job_exists?("my-export-job-v2")
303
+ DataDrain::GlueRunner.job_exists?("my_export_job")
304
+
305
+ # Inválido — lanza ConfigurationError
306
+ DataDrain::GlueRunner.job_exists?("-starts-with-dash")
307
+ # DataDrain::ConfigurationError: job_name '-starts-with-dash' no es un nombre válido para Glue Job
308
+ ```
309
+
310
+ ## Eventos de telemetría
311
+
312
+ | Evento | Nivel | Descripción |
313
+ |--------|-------|-------------|
314
+ | `glue_runner.start` | INFO | Antes de `start_job_run` |
315
+ | `glue_runner.job_create` | INFO | Job creado exitosamente |
316
+ | `glue_runner.job_update` | INFO | Job actualizado (incluye `changed_fields`) |
317
+ | `glue_runner.job_delete` | INFO | Job eliminado exitosamente |
318
+ | `glue_runner.job_delete_skipped` | INFO | `delete_job` sobre job inexistente |
319
+ | `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` (y difiere) |
320
+ | `glue_runner.job_created` | INFO | Job creado en `ensure_job` |
321
+ | `glue_runner.job_unchanged` | INFO | Job existe con config idéntica en `ensure_job` |
322
+ | `glue_runner.job_create_error` | ERROR | Error en `create_job` |
323
+ | `glue_runner.job_update_error` | ERROR | Error en `update_job` |
324
+ | `glue_runner.job_delete_error` | ERROR | Error en `delete_job` |
325
+ | `glue_runner.script_uploaded` | INFO | Script subido a S3 (v0.5.0+) |
326
+ | `glue_runner.script_upload_error` | ERROR | Error al subir script a S3 (v0.5.0+) |
327
+ | `glue_runner.polling` | INFO | Chequeo de estado durante `run_and_wait` |
328
+ | `glue_runner.complete` | INFO | Job terminó `SUCCEEDED` |
329
+ | `glue_runner.failed` | ERROR | Job falló con `FAILED\|STOPPED\|TIMEOUT` |
330
+ | `glue_runner.timeout` | ERROR | `max_wait_seconds` excedido |
@@ -1,11 +1,29 @@
1
1
  """
2
2
  Script de AWS Glue (PySpark) compatible con DataDrain::GlueRunner.
3
3
 
4
- Crear el Job en la consola de AWS Glue (Spark 4.0+) y usar este script como base.
5
- Argumentos requeridos: JOB_NAME, start_date, end_date, s3_bucket, s3_folder,
4
+ Para crear el Glue Job programmatically (en vez de la consola):
5
+
6
+ # Opcion moderna: script local subido por la gema (v0.5.0+)
7
+ DataDrain::GlueRunner.ensure_job(
8
+ "my-export-job",
9
+ script_path: "docs/glue_pyspark_example.py",
10
+ script_bucket: "my-bucket",
11
+ script_folder: "scripts",
12
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
13
+ worker_type: "G.1X",
14
+ number_of_workers: 10,
15
+ timeout: 1440
16
+ )
17
+ # -> Sube este archivo a s3://my-bucket/scripts/glue_pyspark_example.py
18
+ # -> Crea el Job apuntando a ese path
19
+
20
+ # Ejecutar
21
+ DataDrain::GlueRunner.run_and_wait("my-export-job", { "--start_date" => "2025-01-01", ... })
22
+
23
+ Argumentos requeridos del job: JOB_NAME, start_date, end_date, s3_bucket, s3_folder,
6
24
  db_url, db_user, db_password, db_table, partition_by.
7
25
 
8
- Personalizar la sección de columnas derivadas según las partition_keys de cada tabla.
26
+ Personalizar la seccion de columnas derivadas segun las partition_keys de cada tabla.
9
27
  """
10
28
 
11
29
  import sys
@@ -15,27 +33,38 @@ from awsglue.context import GlueContext
15
33
  from awsglue.job import Job
16
34
  from pyspark.sql.functions import col, year, month
17
35
 
18
- args = getResolvedOptions(sys.argv, [
19
- 'JOB_NAME', 'start_date', 'end_date', 's3_bucket', 's3_folder',
20
- 'db_url', 'db_user', 'db_password', 'db_table', 'partition_by'
21
- ])
36
+ args = getResolvedOptions(
37
+ sys.argv,
38
+ [
39
+ "JOB_NAME",
40
+ "start_date",
41
+ "end_date",
42
+ "s3_bucket",
43
+ "s3_folder",
44
+ "db_url",
45
+ "db_user",
46
+ "db_password",
47
+ "db_table",
48
+ "partition_by",
49
+ ],
50
+ )
22
51
 
23
52
  sc = SparkContext()
24
53
  glueContext = GlueContext(sc)
25
54
  spark = glueContext.spark_session
26
55
  job = Job(glueContext)
27
- job.init(args['JOB_NAME'], args)
56
+ job.init(args["JOB_NAME"], args)
28
57
 
29
58
  options = {
30
- "url": args['db_url'],
31
- "dbtable": args['db_table'],
32
- "user": args['db_user'],
33
- "password": args['db_password'],
59
+ "url": args["db_url"],
60
+ "dbtable": args["db_table"],
61
+ "user": args["db_user"],
62
+ "password": args["db_password"],
34
63
  "sampleQuery": (
35
64
  f"SELECT * FROM {args['db_table']} "
36
65
  f"WHERE created_at >= '{args['start_date']}' "
37
66
  f"AND created_at < '{args['end_date']}'"
38
- )
67
+ ),
39
68
  }
40
69
 
41
70
  df = spark.read.format("jdbc").options(**options).load()
@@ -43,18 +72,19 @@ df = spark.read.format("jdbc").options(**options).load()
43
72
  # Agregar columnas derivadas necesarias para las particiones.
44
73
  # isp_id ya existe en la tabla fuente — solo agregar las que se calculan.
45
74
  # Personalizar esta seccion segun las partition_keys de cada tabla.
46
- df_final = (
47
- df.withColumn("year", year(col("created_at")))
48
- .withColumn("month", month(col("created_at")))
75
+ df_final = df.withColumn("year", year(col("created_at"))).withColumn(
76
+ "month", month(col("created_at"))
49
77
  )
50
78
 
51
79
  output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
52
- partitions = args['partition_by'].split(",")
80
+ partitions = args["partition_by"].split(",")
53
81
 
54
- (df_final.write.mode("overwrite")
82
+ (
83
+ df_final.write.mode("overwrite")
55
84
  .partitionBy(*partitions)
56
85
  .format("parquet")
57
86
  .option("compression", "zstd")
58
- .save(output_path))
87
+ .save(output_path)
88
+ )
59
89
 
60
90
  job.commit()
@@ -19,10 +19,245 @@ module DataDrain
19
19
  # @return [Boolean] true si el Job terminó exitosamente (SUCCEEDED).
20
20
  # @raise [DataDrain::Error] si max_wait_seconds excede antes de SUCCEEDED.
21
21
  # @raise [RuntimeError] si el Job falla o se detiene.
22
+ def self.client
23
+ @client ||= Aws::Glue::Client.new(region: DataDrain.configuration.aws_region)
24
+ end
25
+
26
+ class << self
27
+ attr_writer :client
28
+ end
29
+
30
+ def self.job_exists?(job_name)
31
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
32
+ get_job(job_name)
33
+ true
34
+ rescue Aws::Glue::Errors::EntityNotFoundException
35
+ false
36
+ end
37
+
38
+ def self.get_job(job_name)
39
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
40
+ client.get_job(job_name: job_name).job
41
+ end
42
+
43
+ def self.create_job(job_name, role_arn:, script_location: nil, script_path: nil,
44
+ script_bucket: nil, script_folder: "scripts", script_filename: nil,
45
+ command_name: "glueetl", default_arguments: {}, description: nil,
46
+ worker_type: nil, number_of_workers: nil, timeout: 2880,
47
+ max_retries: 0, allocated_capacity: nil, glue_version: nil)
48
+ @logger = DataDrain.configuration.logger
49
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
50
+
51
+ final_script_location = resolve_script_location(
52
+ script_location: script_location,
53
+ script_path: script_path,
54
+ script_bucket: script_bucket,
55
+ script_folder: script_folder,
56
+ script_filename: script_filename
57
+ )
58
+
59
+ opts = {
60
+ name: job_name,
61
+ role: role_arn,
62
+ command: {
63
+ name: command_name,
64
+ python_version: "3",
65
+ script_location: final_script_location
66
+ }
67
+ }
68
+ opts[:default_arguments] = default_arguments unless default_arguments.empty?
69
+ opts[:description] = description if description
70
+ opts[:timeout] = timeout if timeout
71
+ opts[:max_retries] = max_retries if max_retries
72
+ opts[:allocated_capacity] = allocated_capacity if allocated_capacity
73
+ opts[:worker_type] = worker_type if worker_type
74
+ opts[:number_of_workers] = number_of_workers if number_of_workers
75
+ opts[:glue_version] = glue_version if glue_version
76
+
77
+ client.create_job(**opts)
78
+ safe_log(:info, "glue_runner.job_create", {
79
+ job: job_name,
80
+ glue_version: glue_version,
81
+ worker_type: worker_type,
82
+ number_of_workers: number_of_workers
83
+ })
84
+ get_job(job_name)
85
+ rescue Aws::Glue::Errors::ServiceError => e
86
+ safe_log(:error, "glue_runner.job_create_error",
87
+ { job: job_name }.merge(exception_metadata(e)))
88
+ raise
89
+ end
90
+
91
+ def self.update_job(job_name, role_arn: nil, command_name: nil, script_location: nil,
92
+ default_arguments: nil, description: nil, worker_type: nil,
93
+ number_of_workers: nil, timeout: nil, max_retries: nil, allocated_capacity: nil,
94
+ glue_version: nil)
95
+ @logger = DataDrain.configuration.logger
96
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
97
+ job_update = {}
98
+ job_update[:role] = role_arn if role_arn
99
+ if command_name && script_location
100
+ job_update[:command] =
101
+ { name: command_name, python_version: "3", script_location: script_location }
102
+ end
103
+ job_update[:default_arguments] = default_arguments if default_arguments
104
+ job_update[:description] = description if description
105
+ job_update[:timeout] = timeout if timeout
106
+ job_update[:max_retries] = max_retries if max_retries
107
+ job_update[:allocated_capacity] = allocated_capacity if allocated_capacity
108
+ job_update[:worker_type] = worker_type if worker_type
109
+ job_update[:number_of_workers] = number_of_workers if number_of_workers
110
+ job_update[:glue_version] = glue_version if glue_version
111
+
112
+ client.update_job(job_name: job_name, job_update: job_update)
113
+ safe_log(:info, "glue_runner.job_update", {
114
+ job: job_name,
115
+ changed_fields: job_update.keys.map(&:to_s)
116
+ })
117
+ get_job(job_name)
118
+ rescue Aws::Glue::Errors::ServiceError => e
119
+ safe_log(:error, "glue_runner.job_update_error",
120
+ { job: job_name }.merge(exception_metadata(e)))
121
+ raise
122
+ end
123
+
124
+ def self.delete_job(job_name)
125
+ @logger = DataDrain.configuration.logger
126
+ DataDrain::Validations.validate_glue_name!(:job_name, job_name)
127
+ client.delete_job(job_name: job_name)
128
+ safe_log(:info, "glue_runner.job_delete", { job: job_name })
129
+ true
130
+ rescue Aws::Glue::Errors::EntityNotFoundException
131
+ safe_log(:info, "glue_runner.job_delete_skipped", { job: job_name, reason: "not_found" })
132
+ false
133
+ rescue Aws::Glue::Errors::ServiceError => e
134
+ safe_log(:error, "glue_runner.job_delete_error",
135
+ { job: job_name }.merge(exception_metadata(e)))
136
+ raise
137
+ end
138
+
139
+ def self.ensure_job(job_name, role_arn:, script_location: nil, script_path: nil,
140
+ script_bucket: nil, script_folder: "scripts", script_filename: nil,
141
+ command_name: "glueetl", default_arguments: {}, description: nil,
142
+ worker_type: nil, number_of_workers: nil, timeout: 2880,
143
+ max_retries: 0, allocated_capacity: nil, glue_version: nil)
144
+ @logger = DataDrain.configuration.logger
145
+
146
+ final_script_location = resolve_script_location(
147
+ script_location: script_location,
148
+ script_path: script_path,
149
+ script_bucket: script_bucket,
150
+ script_folder: script_folder,
151
+ script_filename: script_filename
152
+ )
153
+
154
+ if job_exists?(job_name)
155
+ current = get_job(job_name)
156
+ desired = {
157
+ role: role_arn,
158
+ command_name: command_name,
159
+ script_location: final_script_location,
160
+ default_arguments: default_arguments,
161
+ description: description,
162
+ worker_type: worker_type,
163
+ number_of_workers: number_of_workers,
164
+ timeout: timeout,
165
+ max_retries: max_retries,
166
+ glue_version: glue_version
167
+ }
168
+ changed = changed_fields(desired, current)
169
+ if changed.empty?
170
+ safe_log(:info, "glue_runner.job_unchanged", { job: job_name })
171
+ current
172
+ else
173
+ safe_log(:info, "glue_runner.job_exists", { job: job_name })
174
+ update_job(job_name, role_arn: role_arn, command_name: command_name,
175
+ script_location: final_script_location, default_arguments: default_arguments,
176
+ description: description, worker_type: worker_type,
177
+ number_of_workers: number_of_workers, timeout: timeout,
178
+ max_retries: max_retries, allocated_capacity: allocated_capacity,
179
+ glue_version: glue_version)
180
+ end
181
+ else
182
+ safe_log(:info, "glue_runner.job_created", { job: job_name })
183
+ create_job(job_name, role_arn: role_arn, script_location: final_script_location,
184
+ command_name: command_name, default_arguments: default_arguments,
185
+ description: description, worker_type: worker_type,
186
+ number_of_workers: number_of_workers, timeout: timeout,
187
+ max_retries: max_retries, allocated_capacity: allocated_capacity,
188
+ glue_version: glue_version)
189
+ end
190
+ end
191
+
192
+ def self.changed_fields(desired, current)
193
+ changed = []
194
+ changed << :role if current.role != desired[:role]
195
+ changed << :command if current.command.name != desired[:command_name] ||
196
+ current.command.script_location != desired[:script_location]
197
+ changed << :default_arguments if current.default_arguments != desired[:default_arguments]
198
+ changed << :description if current.description != desired[:description]
199
+ changed << :worker_type if current.worker_type != desired[:worker_type]
200
+ changed << :number_of_workers if current.number_of_workers != desired[:number_of_workers]
201
+ changed << :timeout if current.timeout != desired[:timeout]
202
+ changed << :max_retries if current.max_retries != desired[:max_retries]
203
+ changed << :glue_version if current.glue_version != desired[:glue_version]
204
+ changed
205
+ end
206
+ private_class_method :changed_fields
207
+
208
+ def self.resolve_script_location(script_location:, script_path:, script_bucket:, script_folder:, script_filename:)
209
+ both_set = script_location && script_path
210
+ raise DataDrain::ConfigurationError, "provee script_location o script_path, no ambos" if both_set
211
+
212
+ return script_location if script_location
213
+ raise ArgumentError, "script_location o script_path es requerido" unless script_path
214
+ raise DataDrain::ConfigurationError, "script_path requiere script_bucket" unless script_bucket
215
+
216
+ upload_script(
217
+ local_path: script_path,
218
+ bucket: script_bucket,
219
+ folder: script_folder,
220
+ filename: script_filename
221
+ )
222
+ end
223
+ private_class_method :resolve_script_location
224
+
225
+ def self.upload_script(local_path:, bucket:, folder: "scripts", filename: nil)
226
+ @logger = DataDrain.configuration.logger
227
+
228
+ unless File.exist?(local_path)
229
+ raise DataDrain::ConfigurationError,
230
+ "Script local '#{local_path}' no existe"
231
+ end
232
+
233
+ actual_filename = filename || File.basename(local_path)
234
+ s3_key = "#{folder.chomp("/")}/#{actual_filename}"
235
+ bytes = File.size(local_path)
236
+
237
+ adapter = DataDrain::Storage.adapter
238
+ unless adapter.is_a?(DataDrain::Storage::S3)
239
+ raise DataDrain::ConfigurationError,
240
+ "upload_script requiere storage_mode = :s3, actual: #{DataDrain.configuration.storage_mode}"
241
+ end
242
+
243
+ s3_path = adapter.upload_file(local_path, bucket, s3_key, content_type: "text/x-python")
244
+
245
+ safe_log(:info, "glue_runner.script_uploaded", {
246
+ local_path: local_path,
247
+ s3_path: s3_path,
248
+ bytes: bytes
249
+ })
250
+
251
+ s3_path
252
+ rescue Aws::S3::Errors::ServiceError => e
253
+ safe_log(:error, "glue_runner.script_upload_error",
254
+ { local_path: local_path, bucket: bucket }.merge(exception_metadata(e)))
255
+ raise
256
+ end
257
+
22
258
  def self.run_and_wait(job_name, arguments = {}, polling_interval: 30, max_wait_seconds: nil)
23
259
  config = DataDrain.configuration
24
260
  config.validate!
25
- client = Aws::Glue::Client.new(region: config.aws_region)
26
261
  start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
27
262
 
28
263
  @logger = config.logger