data_drain 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -32,16 +32,102 @@ job.default_arguments # => { "--extra-files" => "s3://..." }
32
32
  - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
33
33
  - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
34
34
 
35
- ### `create_job(job_name, role_arn:, script_location:, ...)` → Aws::Glue::Types::Job
35
+ ### Subir scripts locales (v0.5.0+)
36
+
37
+ Desde v0.5.0 la gema puede subir scripts PySpark a S3 automáticamente.
38
+
39
+ ```ruby
40
+ # Opción moderna: script local subido por la gema
41
+ DataDrain::GlueRunner.create_job(
42
+ "my-job",
43
+ script_path: "scripts/glue/export.py", # local
44
+ script_bucket: "my-bucket",
45
+ script_folder: "scripts",
46
+ role_arn: "arn:aws:iam::123:role/GlueRole"
47
+ )
48
+ # → Sube scripts/glue/export.py a s3://my-bucket/scripts/export.py
49
+ # → Crea el job
50
+ ```
51
+
52
+ **Parámetros para upload:**
53
+ - `script_path` (String): ruta local al script Python.
54
+ - `script_bucket` (String): bucket S3 destino. **Requerido si se usa `script_path`.**
55
+ - `script_folder` (String): folder dentro del bucket. Default: `"scripts"`.
56
+ - `script_filename` (String, nil): override del nombre en S3. Default: basename del archivo.
57
+
58
+ **`script_location` vs `script_path`:**
59
+ - `script_location:` → comportamiento anterior, no hay upload.
60
+ - `script_path:` + `script_bucket:` → la gema sube a S3 primero, luego crea el Job.
61
+ - Si se pasan ambos → `DataDrain::ConfigurationError`.
62
+ - Si no se pasa ninguno → `ArgumentError`.
63
+
64
+ **Importante:** el upload **sobrescribe** cualquier archivo existente en el mismo path.
65
+ No es idempotente en sentido estricto. Usar `script_filename:` con hash o timestamp
66
+ si necesitás versionado.
67
+
68
+ ### Concurrencia (limitación conocida)
69
+
70
+ No hay lock distribuido. Si dos procesos llaman `upload_script` con el mismo destino
71
+ simultáneamente, el último `put_object` en llegar a S3 gana. Para scripts PySpark
72
+ esto es típicamente bajo riesgo (scripts son pequeños, rara vez hay writes
73
+ concurrentes al mismo path).
74
+
75
+ ### Permisos IAM mínimos
76
+
77
+ El IAM role/user que ejecuta `upload_script` necesita:
78
+
79
+ ```json
80
+ {
81
+ "Effect": "Allow",
82
+ "Action": ["s3:PutObject"],
83
+ "Resource": "arn:aws:s3:::my-bucket/scripts/*"
84
+ }
85
+ ```
86
+
87
+ Para usar con `create_job`/`ensure_job` también se necesitan los permisos de Glue
88
+ (ver sección "Permisos Glue" al inicio de este documento) + permiso para que el
89
+ IAM role del Glue Job pueda leer el script:
90
+
91
+ ```json
92
+ {
93
+ "Effect": "Allow",
94
+ "Action": ["s3:GetObject"],
95
+ "Resource": "arn:aws:s3:::my-bucket/scripts/*"
96
+ }
97
+ ```
98
+
99
+ (Este último en el role del Glue Job, no en el role de la aplicación Ruby.)
100
+
101
+ ### API standalone: `upload_script`
102
+
103
+ Para casos donde solo querés subir (sin crear Job):
104
+
105
+ ```ruby
106
+ s3_path = DataDrain::GlueRunner.upload_script(
107
+ local_path: "scripts/glue/export.py",
108
+ bucket: "my-bucket",
109
+ folder: "scripts"
110
+ )
111
+ # => "s3://my-bucket/scripts/export.py"
112
+ ```
113
+
114
+ Requiere `storage_mode = :s3`.
115
+
116
+ ### `create_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
36
117
 
37
118
  Crea un nuevo job en Glue y retorna el job creado.
38
119
 
39
120
  **Parámetros requeridos:**
40
121
  - `job_name` (String): nombre del job
41
122
  - `role_arn` (String): ARN del IAM role de Glue
42
- - `script_location` (String): path S3 del script Python
123
+
124
+ **Parámetros de script (mutuamente excluyentes):**
125
+ - `script_location` (String): path S3 del script Python (comportamiento anterior)
126
+ - `script_path` + `script_bucket` (String): upload local a S3 primero (v0.5.0+)
43
127
 
44
128
  **Parámetros opcionales:**
129
+ - `script_folder` (String): folder S3. Default: `"scripts"`.
130
+ - `script_filename` (String, nil): override del nombre en S3.
45
131
  - `command_name` (String): nombre del comando (`"glueetl"`, `"pythonshell"`). Default: `"glueetl"`.
46
132
  - `default_arguments` (Hash): argumentos default del job
47
133
  - `description` (String): descripción del job
@@ -85,24 +171,28 @@ job = DataDrain::GlueRunner.update_job(
85
171
  - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
86
172
  - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
87
173
 
88
- ### `delete_job(job_name)` → nil
174
+ ### `delete_job(job_name)` → Boolean
89
175
 
90
- Elimina un job de Glue.
176
+ Elimina un job de Glue. Es idempotente.
91
177
 
92
178
  ```ruby
93
179
  DataDrain::GlueRunner.delete_job("my-job")
94
- # => nil
180
+ # => true (job existía y fue eliminado)
181
+
182
+ DataDrain::GlueRunner.delete_job("nonexistent")
183
+ # => false (job no existía)
95
184
  ```
96
185
 
97
186
  - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
98
- - Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
187
+ - Lanza otros errores de AWS sin atrapar.
99
188
 
100
- ### `ensure_job(job_name, role_arn:, script_location:, ...)` → Aws::Glue::Types::Job
189
+ ### `ensure_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
101
190
 
102
- Crea o actualiza un job de forma idempotente.
191
+ Crea o actualiza un job de forma idempotente con diffing de configuración.
103
192
 
104
- - Si el job existe → `update_job`
105
193
  - Si el job no existe → `create_job`
194
+ - Si el job existe con config diferente → `update_job`
195
+ - Si el job existe con config idéntica → no-op, retorna el job actual (`:unchanged`)
106
196
 
107
197
  ```ruby
108
198
  job = DataDrain::GlueRunner.ensure_job(
@@ -113,6 +203,19 @@ job = DataDrain::GlueRunner.ensure_job(
113
203
  )
114
204
  ```
115
205
 
206
+ También soporta upload de script local (v0.5.0+):
207
+
208
+ ```ruby
209
+ job = DataDrain::GlueRunner.ensure_job(
210
+ "my-job",
211
+ script_path: "scripts/glue/export.py",
212
+ script_bucket: "my-bucket",
213
+ script_folder: "scripts",
214
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
215
+ timeout: 1440
216
+ )
217
+ ```
218
+
116
219
  - Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
117
220
  - Lanza errores de AWS sin atrapar.
118
221
 
@@ -133,17 +236,75 @@ DataDrain::GlueRunner.run_and_wait(
133
236
  - Lanza `RuntimeError` si el job falla (`FAILED`, `STOPPED`, `TIMEOUT`).
134
237
  - Lanza `DataDrain::Error` si `max_wait_seconds` excede.
135
238
 
239
+ ## Patrón completo: ensure_job + run_and_wait + PySpark
240
+
241
+ Workflow end-to-end para archivar y purgar tablas PostgreSQL usando AWS Glue:
242
+
243
+ ```ruby
244
+ # 1. Asegurar que el Glue Job existe con la config deseada (idempotente)
245
+ DataDrain::GlueRunner.ensure_job(
246
+ "my-export-job",
247
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
248
+ script_location: "s3://my-bucket/scripts/glue_pyspark_export.py",
249
+ glue_version: "4.0",
250
+ worker_type: "G.1X",
251
+ number_of_workers: 10,
252
+ timeout: 1440
253
+ )
254
+
255
+ # 2. Ejecutar el export (delegado a Glue Spark distribuido)
256
+ DataDrain::GlueRunner.run_and_wait(
257
+ "my-export-job",
258
+ {
259
+ "--start_date" => start_date.to_fs(:db),
260
+ "--end_date" => end_date.to_fs(:db),
261
+ "--s3_bucket" => bucket,
262
+ "--s3_folder" => table,
263
+ "--db_url" => "jdbc:postgresql://#{host}:#{port}/#{db}",
264
+ "--db_user" => db_user,
265
+ "--db_password" => db_password,
266
+ "--db_table" => table,
267
+ "--partition_by" => partition_keys.join(",")
268
+ },
269
+ polling_interval: 60,
270
+ max_wait_seconds: 7200
271
+ )
272
+
273
+ # 3. Verificar integridad y purgar Postgres (DataDrain solo lee Parquet)
274
+ DataDrain::Engine.new(
275
+ bucket: bucket,
276
+ folder_name: table,
277
+ start_date: start_date,
278
+ end_date: end_date,
279
+ table_name: table,
280
+ partition_keys: partition_keys,
281
+ skip_export: true # export ya lo hizo Glue
282
+ ).call
283
+ ```
284
+
285
+ ### Prerequisites
286
+
287
+ 1. **Subir el script a S3:**
288
+ ```bash
289
+ aws s3 cp glue_pyspark_export.py s3://my-bucket/scripts/
290
+ ```
291
+
292
+ 2. **IAM Role** con permisos para: Glue, S3 (lectura del script + escritura del bucket destino), RDS/Postgres (vía JDBC)
293
+
294
+ 3. **Script PySpark** en `s3://my-bucket/scripts/glue_pyspark_export.py` (ver [ejemplo](../glue_pyspark_example.py))
295
+
136
296
  ## Convenciones de nombres
137
297
 
138
- AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`). No permite guiones bajos ni espacios.
298
+ AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`), guiones bajos (`_`). No permite espacios ni caracteres especiales.
139
299
 
140
300
  ```ruby
141
301
  # Válido
142
302
  DataDrain::GlueRunner.job_exists?("my-export-job-v2")
303
+ DataDrain::GlueRunner.job_exists?("my_export_job")
143
304
 
144
305
  # Inválido — lanza ConfigurationError
145
- DataDrain::GlueRunner.job_exists?("my_export_job")
146
- # DataDrain::ConfigurationError: job_name 'my_export_job' no es un nombre válido para Glue Job
306
+ DataDrain::GlueRunner.job_exists?("-starts-with-dash")
307
+ # DataDrain::ConfigurationError: job_name '-starts-with-dash' no es un nombre válido para Glue Job
147
308
  ```
148
309
 
149
310
  ## Eventos de telemetría
@@ -151,8 +312,18 @@ DataDrain::GlueRunner.job_exists?("my_export_job")
151
312
  | Evento | Nivel | Descripción |
152
313
  |--------|-------|-------------|
153
314
  | `glue_runner.start` | INFO | Antes de `start_job_run` |
154
- | `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` |
315
+ | `glue_runner.job_create` | INFO | Job creado exitosamente |
316
+ | `glue_runner.job_update` | INFO | Job actualizado (incluye `changed_fields`) |
317
+ | `glue_runner.job_delete` | INFO | Job eliminado exitosamente |
318
+ | `glue_runner.job_delete_skipped` | INFO | `delete_job` sobre job inexistente |
319
+ | `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` (y difiere) |
155
320
  | `glue_runner.job_created` | INFO | Job creado en `ensure_job` |
321
+ | `glue_runner.job_unchanged` | INFO | Job existe con config idéntica en `ensure_job` |
322
+ | `glue_runner.job_create_error` | ERROR | Error en `create_job` |
323
+ | `glue_runner.job_update_error` | ERROR | Error en `update_job` |
324
+ | `glue_runner.job_delete_error` | ERROR | Error en `delete_job` |
325
+ | `glue_runner.script_uploaded` | INFO | Script subido a S3 (v0.5.0+) |
326
+ | `glue_runner.script_upload_error` | ERROR | Error al subir script a S3 (v0.5.0+) |
156
327
  | `glue_runner.polling` | INFO | Chequeo de estado durante `run_and_wait` |
157
328
  | `glue_runner.complete` | INFO | Job terminó `SUCCEEDED` |
158
329
  | `glue_runner.failed` | ERROR | Job falló con `FAILED\|STOPPED\|TIMEOUT` |
@@ -1,11 +1,29 @@
1
1
  """
2
2
  Script de AWS Glue (PySpark) compatible con DataDrain::GlueRunner.
3
3
 
4
- Crear el Job en la consola de AWS Glue (Spark 4.0+) y usar este script como base.
5
- Argumentos requeridos: JOB_NAME, start_date, end_date, s3_bucket, s3_folder,
4
+ Para crear el Glue Job programmatically (en vez de la consola):
5
+
6
+ # Opcion moderna: script local subido por la gema (v0.5.0+)
7
+ DataDrain::GlueRunner.ensure_job(
8
+ "my-export-job",
9
+ script_path: "docs/glue_pyspark_example.py",
10
+ script_bucket: "my-bucket",
11
+ script_folder: "scripts",
12
+ role_arn: "arn:aws:iam::123:role/GlueServiceRole",
13
+ worker_type: "G.1X",
14
+ number_of_workers: 10,
15
+ timeout: 1440
16
+ )
17
+ # -> Sube este archivo a s3://my-bucket/scripts/glue_pyspark_example.py
18
+ # -> Crea el Job apuntando a ese path
19
+
20
+ # Ejecutar
21
+ DataDrain::GlueRunner.run_and_wait("my-export-job", { "--start_date" => "2025-01-01", ... })
22
+
23
+ Argumentos requeridos del job: JOB_NAME, start_date, end_date, s3_bucket, s3_folder,
6
24
  db_url, db_user, db_password, db_table, partition_by.
7
25
 
8
- Personalizar la sección de columnas derivadas según las partition_keys de cada tabla.
26
+ Personalizar la seccion de columnas derivadas segun las partition_keys de cada tabla.
9
27
  """
10
28
 
11
29
  import sys
@@ -15,27 +33,38 @@ from awsglue.context import GlueContext
15
33
  from awsglue.job import Job
16
34
  from pyspark.sql.functions import col, year, month
17
35
 
18
- args = getResolvedOptions(sys.argv, [
19
- 'JOB_NAME', 'start_date', 'end_date', 's3_bucket', 's3_folder',
20
- 'db_url', 'db_user', 'db_password', 'db_table', 'partition_by'
21
- ])
36
+ args = getResolvedOptions(
37
+ sys.argv,
38
+ [
39
+ "JOB_NAME",
40
+ "start_date",
41
+ "end_date",
42
+ "s3_bucket",
43
+ "s3_folder",
44
+ "db_url",
45
+ "db_user",
46
+ "db_password",
47
+ "db_table",
48
+ "partition_by",
49
+ ],
50
+ )
22
51
 
23
52
  sc = SparkContext()
24
53
  glueContext = GlueContext(sc)
25
54
  spark = glueContext.spark_session
26
55
  job = Job(glueContext)
27
- job.init(args['JOB_NAME'], args)
56
+ job.init(args["JOB_NAME"], args)
28
57
 
29
58
  options = {
30
- "url": args['db_url'],
31
- "dbtable": args['db_table'],
32
- "user": args['db_user'],
33
- "password": args['db_password'],
59
+ "url": args["db_url"],
60
+ "dbtable": args["db_table"],
61
+ "user": args["db_user"],
62
+ "password": args["db_password"],
34
63
  "sampleQuery": (
35
64
  f"SELECT * FROM {args['db_table']} "
36
65
  f"WHERE created_at >= '{args['start_date']}' "
37
66
  f"AND created_at < '{args['end_date']}'"
38
- )
67
+ ),
39
68
  }
40
69
 
41
70
  df = spark.read.format("jdbc").options(**options).load()
@@ -43,18 +72,19 @@ df = spark.read.format("jdbc").options(**options).load()
43
72
  # Agregar columnas derivadas necesarias para las particiones.
44
73
  # isp_id ya existe en la tabla fuente — solo agregar las que se calculan.
45
74
  # Personalizar esta seccion segun las partition_keys de cada tabla.
46
- df_final = (
47
- df.withColumn("year", year(col("created_at")))
48
- .withColumn("month", month(col("created_at")))
75
+ df_final = df.withColumn("year", year(col("created_at"))).withColumn(
76
+ "month", month(col("created_at"))
49
77
  )
50
78
 
51
79
  output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
52
- partitions = args['partition_by'].split(",")
80
+ partitions = args["partition_by"].split(",")
53
81
 
54
- (df_final.write.mode("overwrite")
82
+ (
83
+ df_final.write.mode("overwrite")
55
84
  .partitionBy(*partitions)
56
85
  .format("parquet")
57
86
  .option("compression", "zstd")
58
- .save(output_path))
87
+ .save(output_path)
88
+ )
59
89
 
60
90
  job.commit()
@@ -40,17 +40,29 @@ module DataDrain
40
40
  client.get_job(job_name: job_name).job
41
41
  end
42
42
 
43
- def self.create_job(job_name, role_arn:, script_location:, command_name: "glueetl",
44
- default_arguments: {}, description: nil, worker_type: nil, number_of_workers: nil,
45
- timeout: 2880, max_retries: 0, allocated_capacity: nil, glue_version: nil)
43
+ def self.create_job(job_name, role_arn:, script_location: nil, script_path: nil,
44
+ script_bucket: nil, script_folder: "scripts", script_filename: nil,
45
+ command_name: "glueetl", default_arguments: {}, description: nil,
46
+ worker_type: nil, number_of_workers: nil, timeout: 2880,
47
+ max_retries: 0, allocated_capacity: nil, glue_version: nil)
48
+ @logger = DataDrain.configuration.logger
46
49
  DataDrain::Validations.validate_glue_name!(:job_name, job_name)
50
+
51
+ final_script_location = resolve_script_location(
52
+ script_location: script_location,
53
+ script_path: script_path,
54
+ script_bucket: script_bucket,
55
+ script_folder: script_folder,
56
+ script_filename: script_filename
57
+ )
58
+
47
59
  opts = {
48
60
  name: job_name,
49
61
  role: role_arn,
50
62
  command: {
51
63
  name: command_name,
52
64
  python_version: "3",
53
- script_location: script_location
65
+ script_location: final_script_location
54
66
  }
55
67
  }
56
68
  opts[:default_arguments] = default_arguments unless default_arguments.empty?
@@ -63,13 +75,24 @@ module DataDrain
63
75
  opts[:glue_version] = glue_version if glue_version
64
76
 
65
77
  client.create_job(**opts)
78
+ safe_log(:info, "glue_runner.job_create", {
79
+ job: job_name,
80
+ glue_version: glue_version,
81
+ worker_type: worker_type,
82
+ number_of_workers: number_of_workers
83
+ })
66
84
  get_job(job_name)
85
+ rescue Aws::Glue::Errors::ServiceError => e
86
+ safe_log(:error, "glue_runner.job_create_error",
87
+ { job: job_name }.merge(exception_metadata(e)))
88
+ raise
67
89
  end
68
90
 
69
91
  def self.update_job(job_name, role_arn: nil, command_name: nil, script_location: nil,
70
92
  default_arguments: nil, description: nil, worker_type: nil,
71
93
  number_of_workers: nil, timeout: nil, max_retries: nil, allocated_capacity: nil,
72
94
  glue_version: nil)
95
+ @logger = DataDrain.configuration.logger
73
96
  DataDrain::Validations.validate_glue_name!(:job_name, job_name)
74
97
  job_update = {}
75
98
  job_update[:role] = role_arn if role_arn
@@ -87,30 +110,77 @@ module DataDrain
87
110
  job_update[:glue_version] = glue_version if glue_version
88
111
 
89
112
  client.update_job(job_name: job_name, job_update: job_update)
113
+ safe_log(:info, "glue_runner.job_update", {
114
+ job: job_name,
115
+ changed_fields: job_update.keys.map(&:to_s)
116
+ })
90
117
  get_job(job_name)
118
+ rescue Aws::Glue::Errors::ServiceError => e
119
+ safe_log(:error, "glue_runner.job_update_error",
120
+ { job: job_name }.merge(exception_metadata(e)))
121
+ raise
91
122
  end
92
123
 
93
124
  def self.delete_job(job_name)
125
+ @logger = DataDrain.configuration.logger
94
126
  DataDrain::Validations.validate_glue_name!(:job_name, job_name)
95
127
  client.delete_job(job_name: job_name)
96
- nil
128
+ safe_log(:info, "glue_runner.job_delete", { job: job_name })
129
+ true
130
+ rescue Aws::Glue::Errors::EntityNotFoundException
131
+ safe_log(:info, "glue_runner.job_delete_skipped", { job: job_name, reason: "not_found" })
132
+ false
133
+ rescue Aws::Glue::Errors::ServiceError => e
134
+ safe_log(:error, "glue_runner.job_delete_error",
135
+ { job: job_name }.merge(exception_metadata(e)))
136
+ raise
97
137
  end
98
138
 
99
- def self.ensure_job(job_name, role_arn:, script_location:, command_name: "glueetl",
100
- default_arguments: {}, description: nil, worker_type: nil,
101
- number_of_workers: nil, timeout: 2880, max_retries: 0,
102
- allocated_capacity: nil, glue_version: nil)
139
+ def self.ensure_job(job_name, role_arn:, script_location: nil, script_path: nil,
140
+ script_bucket: nil, script_folder: "scripts", script_filename: nil,
141
+ command_name: "glueetl", default_arguments: {}, description: nil,
142
+ worker_type: nil, number_of_workers: nil, timeout: 2880,
143
+ max_retries: 0, allocated_capacity: nil, glue_version: nil)
144
+ @logger = DataDrain.configuration.logger
145
+
146
+ final_script_location = resolve_script_location(
147
+ script_location: script_location,
148
+ script_path: script_path,
149
+ script_bucket: script_bucket,
150
+ script_folder: script_folder,
151
+ script_filename: script_filename
152
+ )
153
+
103
154
  if job_exists?(job_name)
104
- safe_log(:info, "glue_runner.job_exists", { job: job_name })
105
- update_job(job_name, role_arn: role_arn, command_name: command_name,
106
- script_location: script_location, default_arguments: default_arguments,
107
- description: description, worker_type: worker_type,
108
- number_of_workers: number_of_workers, timeout: timeout,
109
- max_retries: max_retries, allocated_capacity: allocated_capacity,
110
- glue_version: glue_version)
155
+ current = get_job(job_name)
156
+ desired = {
157
+ role: role_arn,
158
+ command_name: command_name,
159
+ script_location: final_script_location,
160
+ default_arguments: default_arguments,
161
+ description: description,
162
+ worker_type: worker_type,
163
+ number_of_workers: number_of_workers,
164
+ timeout: timeout,
165
+ max_retries: max_retries,
166
+ glue_version: glue_version
167
+ }
168
+ changed = changed_fields(desired, current)
169
+ if changed.empty?
170
+ safe_log(:info, "glue_runner.job_unchanged", { job: job_name })
171
+ current
172
+ else
173
+ safe_log(:info, "glue_runner.job_exists", { job: job_name })
174
+ update_job(job_name, role_arn: role_arn, command_name: command_name,
175
+ script_location: final_script_location, default_arguments: default_arguments,
176
+ description: description, worker_type: worker_type,
177
+ number_of_workers: number_of_workers, timeout: timeout,
178
+ max_retries: max_retries, allocated_capacity: allocated_capacity,
179
+ glue_version: glue_version)
180
+ end
111
181
  else
112
182
  safe_log(:info, "glue_runner.job_created", { job: job_name })
113
- create_job(job_name, role_arn: role_arn, script_location: script_location,
183
+ create_job(job_name, role_arn: role_arn, script_location: final_script_location,
114
184
  command_name: command_name, default_arguments: default_arguments,
115
185
  description: description, worker_type: worker_type,
116
186
  number_of_workers: number_of_workers, timeout: timeout,
@@ -119,6 +189,72 @@ module DataDrain
119
189
  end
120
190
  end
121
191
 
192
+ def self.changed_fields(desired, current)
193
+ changed = []
194
+ changed << :role if current.role != desired[:role]
195
+ changed << :command if current.command.name != desired[:command_name] ||
196
+ current.command.script_location != desired[:script_location]
197
+ changed << :default_arguments if current.default_arguments != desired[:default_arguments]
198
+ changed << :description if current.description != desired[:description]
199
+ changed << :worker_type if current.worker_type != desired[:worker_type]
200
+ changed << :number_of_workers if current.number_of_workers != desired[:number_of_workers]
201
+ changed << :timeout if current.timeout != desired[:timeout]
202
+ changed << :max_retries if current.max_retries != desired[:max_retries]
203
+ changed << :glue_version if current.glue_version != desired[:glue_version]
204
+ changed
205
+ end
206
+ private_class_method :changed_fields
207
+
208
+ def self.resolve_script_location(script_location:, script_path:, script_bucket:, script_folder:, script_filename:)
209
+ both_set = script_location && script_path
210
+ raise DataDrain::ConfigurationError, "provee script_location o script_path, no ambos" if both_set
211
+
212
+ return script_location if script_location
213
+ raise ArgumentError, "script_location o script_path es requerido" unless script_path
214
+ raise DataDrain::ConfigurationError, "script_path requiere script_bucket" unless script_bucket
215
+
216
+ upload_script(
217
+ local_path: script_path,
218
+ bucket: script_bucket,
219
+ folder: script_folder,
220
+ filename: script_filename
221
+ )
222
+ end
223
+ private_class_method :resolve_script_location
224
+
225
+ def self.upload_script(local_path:, bucket:, folder: "scripts", filename: nil)
226
+ @logger = DataDrain.configuration.logger
227
+
228
+ unless File.exist?(local_path)
229
+ raise DataDrain::ConfigurationError,
230
+ "Script local '#{local_path}' no existe"
231
+ end
232
+
233
+ actual_filename = filename || File.basename(local_path)
234
+ s3_key = "#{folder.chomp("/")}/#{actual_filename}"
235
+ bytes = File.size(local_path)
236
+
237
+ adapter = DataDrain::Storage.adapter
238
+ unless adapter.is_a?(DataDrain::Storage::S3)
239
+ raise DataDrain::ConfigurationError,
240
+ "upload_script requiere storage_mode = :s3, actual: #{DataDrain.configuration.storage_mode}"
241
+ end
242
+
243
+ s3_path = adapter.upload_file(local_path, bucket, s3_key, content_type: "text/x-python")
244
+
245
+ safe_log(:info, "glue_runner.script_uploaded", {
246
+ local_path: local_path,
247
+ s3_path: s3_path,
248
+ bytes: bytes
249
+ })
250
+
251
+ s3_path
252
+ rescue Aws::S3::Errors::ServiceError => e
253
+ safe_log(:error, "glue_runner.script_upload_error",
254
+ { local_path: local_path, bucket: bucket }.merge(exception_metadata(e)))
255
+ raise
256
+ end
257
+
122
258
  def self.run_and_wait(job_name, arguments = {}, polling_interval: 30, max_wait_seconds: nil)
123
259
  config = DataDrain.configuration
124
260
  config.validate!
@@ -55,6 +55,18 @@ module DataDrain
55
55
  raise NotImplementedError, "#{self.class} debe implementar #destroy_partitions"
56
56
  end
57
57
 
58
+ # Sube un archivo local al storage.
59
+ #
60
+ # @param local_path [String]
61
+ # @param bucket [String]
62
+ # @param s3_key [String] key relativo (ej. "scripts/export.py")
63
+ # @param content_type [String, nil]
64
+ # @return [String] URI completo del archivo subido
65
+ # @raise [NotImplementedError]
66
+ def upload_file(local_path, bucket, s3_key, content_type: nil)
67
+ raise NotImplementedError, "#{self.class} debe implementar #upload_file"
68
+ end
69
+
58
70
  protected
59
71
 
60
72
  # @param bucket [String]
@@ -27,6 +27,19 @@ module DataDrain
27
27
  "#{build_path_base(bucket, folder_name, partition_path)}/**/*.parquet"
28
28
  end
29
29
 
30
+ # @param local_path [String]
31
+ # @param bucket [String] Directorio destino
32
+ # @param s3_key [String] Path relativo dentro del bucket
33
+ # @param content_type [String, nil] Ignorado en modo local
34
+ # @return [String] Path absoluto al archivo destino
35
+ def upload_file(local_path, bucket, s3_key, content_type: nil)
36
+ _ = content_type
37
+ dest_path = File.join(bucket, s3_key)
38
+ FileUtils.mkdir_p(File.dirname(dest_path))
39
+ FileUtils.cp(local_path, dest_path)
40
+ dest_path
41
+ end
42
+
30
43
  # @param bucket [String]
31
44
  # @param folder_name [String]
32
45
  # @param partition_keys [Array<Symbol>]
@@ -38,6 +38,23 @@ module DataDrain
38
38
  delete_in_batches(client, bucket, objects)
39
39
  end
40
40
 
41
+ # @param local_path [String]
42
+ # @param bucket [String]
43
+ # @param s3_key [String]
44
+ # @param content_type [String, nil]
45
+ # @return [String] "s3://bucket/key"
46
+ def upload_file(local_path, bucket, s3_key, content_type: nil)
47
+ client = s3_client
48
+
49
+ File.open(local_path, "rb") do |file|
50
+ params = { bucket: bucket, key: s3_key, body: file }
51
+ params[:content_type] = content_type if content_type
52
+ client.put_object(**params)
53
+ end
54
+
55
+ "s3://#{bucket}/#{s3_key}"
56
+ end
57
+
41
58
  private
42
59
 
43
60
  # @return [Aws::S3::Client]
@@ -6,7 +6,7 @@ module DataDrain
6
6
  # Regex que valida identificadores SQL (tablas, columnas, etc.).
7
7
  # Permite letras, guiones bajos y números (no al inicio).
8
8
  IDENTIFIER_REGEX = /\A[a-zA-Z_][a-zA-Z0-9_]*\z/
9
- GLUE_NAME_REGEX = /\A[a-zA-Z0-9][a-zA-Z0-9-]*\z/
9
+ GLUE_NAME_REGEX = /\A(?![_-])[a-zA-Z0-9_-]+\z/
10
10
 
11
11
  module_function
12
12
 
@@ -14,7 +14,7 @@ module DataDrain
14
14
  return if GLUE_NAME_REGEX.match?(value.to_s)
15
15
 
16
16
  raise DataDrain::ConfigurationError,
17
- "#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números y guiones)"
17
+ "#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números, '-' y '_')"
18
18
  end
19
19
 
20
20
  def validate_identifier!(name, value)
@@ -2,5 +2,5 @@
2
2
 
3
3
  module DataDrain
4
4
  # @return [String] versión semver de la gema
5
- VERSION = "0.4.0"
5
+ VERSION = "0.5.0"
6
6
  end