data_drain 0.4.0 → 0.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +25 -0
- data/README.md +8 -4
- data/docs/execution/archives/v0.5.0-OBSERVACIONES.md +167 -0
- data/docs/execution/archives/v0.5.0.md +900 -0
- data/docs/glue-jobs-lifecycle.md +184 -13
- data/docs/glue_pyspark_example.py +49 -19
- data/lib/data_drain/glue_runner.rb +153 -17
- data/lib/data_drain/storage/base.rb +12 -0
- data/lib/data_drain/storage/local.rb +13 -0
- data/lib/data_drain/storage/s3.rb +17 -0
- data/lib/data_drain/validations.rb +2 -2
- data/lib/data_drain/version.rb +1 -1
- data/skill/SKILL.md +64 -3
- data/skill/references/eventos-telemetria.md +9 -0
- metadata +3 -1
data/docs/glue-jobs-lifecycle.md
CHANGED
|
@@ -32,16 +32,102 @@ job.default_arguments # => { "--extra-files" => "s3://..." }
|
|
|
32
32
|
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
33
33
|
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
34
34
|
|
|
35
|
-
###
|
|
35
|
+
### Subir scripts locales (v0.5.0+)
|
|
36
|
+
|
|
37
|
+
Desde v0.5.0 la gema puede subir scripts PySpark a S3 automáticamente.
|
|
38
|
+
|
|
39
|
+
```ruby
|
|
40
|
+
# Opción moderna: script local subido por la gema
|
|
41
|
+
DataDrain::GlueRunner.create_job(
|
|
42
|
+
"my-job",
|
|
43
|
+
script_path: "scripts/glue/export.py", # local
|
|
44
|
+
script_bucket: "my-bucket",
|
|
45
|
+
script_folder: "scripts",
|
|
46
|
+
role_arn: "arn:aws:iam::123:role/GlueRole"
|
|
47
|
+
)
|
|
48
|
+
# → Sube scripts/glue/export.py a s3://my-bucket/scripts/export.py
|
|
49
|
+
# → Crea el job
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Parámetros para upload:**
|
|
53
|
+
- `script_path` (String): ruta local al script Python.
|
|
54
|
+
- `script_bucket` (String): bucket S3 destino. **Requerido si se usa `script_path`.**
|
|
55
|
+
- `script_folder` (String): folder dentro del bucket. Default: `"scripts"`.
|
|
56
|
+
- `script_filename` (String, nil): override del nombre en S3. Default: basename del archivo.
|
|
57
|
+
|
|
58
|
+
**`script_location` vs `script_path`:**
|
|
59
|
+
- `script_location:` → comportamiento anterior, no hay upload.
|
|
60
|
+
- `script_path:` + `script_bucket:` → la gema sube a S3 primero, luego crea el Job.
|
|
61
|
+
- Si se pasan ambos → `DataDrain::ConfigurationError`.
|
|
62
|
+
- Si no se pasa ninguno → `ArgumentError`.
|
|
63
|
+
|
|
64
|
+
**Importante:** el upload **sobrescribe** cualquier archivo existente en el mismo path.
|
|
65
|
+
No es idempotente en sentido estricto. Usar `script_filename:` con hash o timestamp
|
|
66
|
+
si necesitás versionado.
|
|
67
|
+
|
|
68
|
+
### Concurrencia (limitación conocida)
|
|
69
|
+
|
|
70
|
+
No hay lock distribuido. Si dos procesos llaman `upload_script` con el mismo destino
|
|
71
|
+
simultáneamente, el último `put_object` en llegar a S3 gana. Para scripts PySpark
|
|
72
|
+
esto es típicamente bajo riesgo (scripts son pequeños, rara vez hay writes
|
|
73
|
+
concurrentes al mismo path).
|
|
74
|
+
|
|
75
|
+
### Permisos IAM mínimos
|
|
76
|
+
|
|
77
|
+
El IAM role/user que ejecuta `upload_script` necesita:
|
|
78
|
+
|
|
79
|
+
```json
|
|
80
|
+
{
|
|
81
|
+
"Effect": "Allow",
|
|
82
|
+
"Action": ["s3:PutObject"],
|
|
83
|
+
"Resource": "arn:aws:s3:::my-bucket/scripts/*"
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Para usar con `create_job`/`ensure_job` también se necesitan los permisos de Glue
|
|
88
|
+
(ver sección "Permisos Glue" al inicio de este documento) + permiso para que el
|
|
89
|
+
IAM role del Glue Job pueda leer el script:
|
|
90
|
+
|
|
91
|
+
```json
|
|
92
|
+
{
|
|
93
|
+
"Effect": "Allow",
|
|
94
|
+
"Action": ["s3:GetObject"],
|
|
95
|
+
"Resource": "arn:aws:s3:::my-bucket/scripts/*"
|
|
96
|
+
}
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
(Este último en el role del Glue Job, no en el role de la aplicación Ruby.)
|
|
100
|
+
|
|
101
|
+
### API standalone: `upload_script`
|
|
102
|
+
|
|
103
|
+
Para casos donde solo querés subir (sin crear Job):
|
|
104
|
+
|
|
105
|
+
```ruby
|
|
106
|
+
s3_path = DataDrain::GlueRunner.upload_script(
|
|
107
|
+
local_path: "scripts/glue/export.py",
|
|
108
|
+
bucket: "my-bucket",
|
|
109
|
+
folder: "scripts"
|
|
110
|
+
)
|
|
111
|
+
# => "s3://my-bucket/scripts/export.py"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Requiere `storage_mode = :s3`.
|
|
115
|
+
|
|
116
|
+
### `create_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
|
|
36
117
|
|
|
37
118
|
Crea un nuevo job en Glue y retorna el job creado.
|
|
38
119
|
|
|
39
120
|
**Parámetros requeridos:**
|
|
40
121
|
- `job_name` (String): nombre del job
|
|
41
122
|
- `role_arn` (String): ARN del IAM role de Glue
|
|
42
|
-
|
|
123
|
+
|
|
124
|
+
**Parámetros de script (mutuamente excluyentes):**
|
|
125
|
+
- `script_location` (String): path S3 del script Python (comportamiento anterior)
|
|
126
|
+
- `script_path` + `script_bucket` (String): upload local a S3 primero (v0.5.0+)
|
|
43
127
|
|
|
44
128
|
**Parámetros opcionales:**
|
|
129
|
+
- `script_folder` (String): folder S3. Default: `"scripts"`.
|
|
130
|
+
- `script_filename` (String, nil): override del nombre en S3.
|
|
45
131
|
- `command_name` (String): nombre del comando (`"glueetl"`, `"pythonshell"`). Default: `"glueetl"`.
|
|
46
132
|
- `default_arguments` (Hash): argumentos default del job
|
|
47
133
|
- `description` (String): descripción del job
|
|
@@ -85,24 +171,28 @@ job = DataDrain::GlueRunner.update_job(
|
|
|
85
171
|
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
86
172
|
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
87
173
|
|
|
88
|
-
### `delete_job(job_name)` →
|
|
174
|
+
### `delete_job(job_name)` → Boolean
|
|
89
175
|
|
|
90
|
-
Elimina un job de Glue.
|
|
176
|
+
Elimina un job de Glue. Es idempotente.
|
|
91
177
|
|
|
92
178
|
```ruby
|
|
93
179
|
DataDrain::GlueRunner.delete_job("my-job")
|
|
94
|
-
# =>
|
|
180
|
+
# => true (job existía y fue eliminado)
|
|
181
|
+
|
|
182
|
+
DataDrain::GlueRunner.delete_job("nonexistent")
|
|
183
|
+
# => false (job no existía)
|
|
95
184
|
```
|
|
96
185
|
|
|
97
186
|
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
98
|
-
- Lanza
|
|
187
|
+
- Lanza otros errores de AWS sin atrapar.
|
|
99
188
|
|
|
100
|
-
### `ensure_job(job_name, role_arn:,
|
|
189
|
+
### `ensure_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
|
|
101
190
|
|
|
102
|
-
Crea o actualiza un job de forma idempotente.
|
|
191
|
+
Crea o actualiza un job de forma idempotente con diffing de configuración.
|
|
103
192
|
|
|
104
|
-
- Si el job existe → `update_job`
|
|
105
193
|
- Si el job no existe → `create_job`
|
|
194
|
+
- Si el job existe con config diferente → `update_job`
|
|
195
|
+
- Si el job existe con config idéntica → no-op, retorna el job actual (`:unchanged`)
|
|
106
196
|
|
|
107
197
|
```ruby
|
|
108
198
|
job = DataDrain::GlueRunner.ensure_job(
|
|
@@ -113,6 +203,19 @@ job = DataDrain::GlueRunner.ensure_job(
|
|
|
113
203
|
)
|
|
114
204
|
```
|
|
115
205
|
|
|
206
|
+
También soporta upload de script local (v0.5.0+):
|
|
207
|
+
|
|
208
|
+
```ruby
|
|
209
|
+
job = DataDrain::GlueRunner.ensure_job(
|
|
210
|
+
"my-job",
|
|
211
|
+
script_path: "scripts/glue/export.py",
|
|
212
|
+
script_bucket: "my-bucket",
|
|
213
|
+
script_folder: "scripts",
|
|
214
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
215
|
+
timeout: 1440
|
|
216
|
+
)
|
|
217
|
+
```
|
|
218
|
+
|
|
116
219
|
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
117
220
|
- Lanza errores de AWS sin atrapar.
|
|
118
221
|
|
|
@@ -133,17 +236,75 @@ DataDrain::GlueRunner.run_and_wait(
|
|
|
133
236
|
- Lanza `RuntimeError` si el job falla (`FAILED`, `STOPPED`, `TIMEOUT`).
|
|
134
237
|
- Lanza `DataDrain::Error` si `max_wait_seconds` excede.
|
|
135
238
|
|
|
239
|
+
## Patrón completo: ensure_job + run_and_wait + PySpark
|
|
240
|
+
|
|
241
|
+
Workflow end-to-end para archivar y purgar tablas PostgreSQL usando AWS Glue:
|
|
242
|
+
|
|
243
|
+
```ruby
|
|
244
|
+
# 1. Asegurar que el Glue Job existe con la config deseada (idempotente)
|
|
245
|
+
DataDrain::GlueRunner.ensure_job(
|
|
246
|
+
"my-export-job",
|
|
247
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
248
|
+
script_location: "s3://my-bucket/scripts/glue_pyspark_export.py",
|
|
249
|
+
glue_version: "4.0",
|
|
250
|
+
worker_type: "G.1X",
|
|
251
|
+
number_of_workers: 10,
|
|
252
|
+
timeout: 1440
|
|
253
|
+
)
|
|
254
|
+
|
|
255
|
+
# 2. Ejecutar el export (delegado a Glue Spark distribuido)
|
|
256
|
+
DataDrain::GlueRunner.run_and_wait(
|
|
257
|
+
"my-export-job",
|
|
258
|
+
{
|
|
259
|
+
"--start_date" => start_date.to_fs(:db),
|
|
260
|
+
"--end_date" => end_date.to_fs(:db),
|
|
261
|
+
"--s3_bucket" => bucket,
|
|
262
|
+
"--s3_folder" => table,
|
|
263
|
+
"--db_url" => "jdbc:postgresql://#{host}:#{port}/#{db}",
|
|
264
|
+
"--db_user" => db_user,
|
|
265
|
+
"--db_password" => db_password,
|
|
266
|
+
"--db_table" => table,
|
|
267
|
+
"--partition_by" => partition_keys.join(",")
|
|
268
|
+
},
|
|
269
|
+
polling_interval: 60,
|
|
270
|
+
max_wait_seconds: 7200
|
|
271
|
+
)
|
|
272
|
+
|
|
273
|
+
# 3. Verificar integridad y purgar Postgres (DataDrain solo lee Parquet)
|
|
274
|
+
DataDrain::Engine.new(
|
|
275
|
+
bucket: bucket,
|
|
276
|
+
folder_name: table,
|
|
277
|
+
start_date: start_date,
|
|
278
|
+
end_date: end_date,
|
|
279
|
+
table_name: table,
|
|
280
|
+
partition_keys: partition_keys,
|
|
281
|
+
skip_export: true # export ya lo hizo Glue
|
|
282
|
+
).call
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
### Prerequisites
|
|
286
|
+
|
|
287
|
+
1. **Subir el script a S3:**
|
|
288
|
+
```bash
|
|
289
|
+
aws s3 cp glue_pyspark_export.py s3://my-bucket/scripts/
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
2. **IAM Role** con permisos para: Glue, S3 (lectura del script + escritura del bucket destino), RDS/Postgres (vía JDBC)
|
|
293
|
+
|
|
294
|
+
3. **Script PySpark** en `s3://my-bucket/scripts/glue_pyspark_export.py` (ver [ejemplo](../glue_pyspark_example.py))
|
|
295
|
+
|
|
136
296
|
## Convenciones de nombres
|
|
137
297
|
|
|
138
|
-
AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`). No permite
|
|
298
|
+
AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`), guiones bajos (`_`). No permite espacios ni caracteres especiales.
|
|
139
299
|
|
|
140
300
|
```ruby
|
|
141
301
|
# Válido
|
|
142
302
|
DataDrain::GlueRunner.job_exists?("my-export-job-v2")
|
|
303
|
+
DataDrain::GlueRunner.job_exists?("my_export_job")
|
|
143
304
|
|
|
144
305
|
# Inválido — lanza ConfigurationError
|
|
145
|
-
DataDrain::GlueRunner.job_exists?("
|
|
146
|
-
# DataDrain::ConfigurationError: job_name '
|
|
306
|
+
DataDrain::GlueRunner.job_exists?("-starts-with-dash")
|
|
307
|
+
# DataDrain::ConfigurationError: job_name '-starts-with-dash' no es un nombre válido para Glue Job
|
|
147
308
|
```
|
|
148
309
|
|
|
149
310
|
## Eventos de telemetría
|
|
@@ -151,8 +312,18 @@ DataDrain::GlueRunner.job_exists?("my_export_job")
|
|
|
151
312
|
| Evento | Nivel | Descripción |
|
|
152
313
|
|--------|-------|-------------|
|
|
153
314
|
| `glue_runner.start` | INFO | Antes de `start_job_run` |
|
|
154
|
-
| `glue_runner.
|
|
315
|
+
| `glue_runner.job_create` | INFO | Job creado exitosamente |
|
|
316
|
+
| `glue_runner.job_update` | INFO | Job actualizado (incluye `changed_fields`) |
|
|
317
|
+
| `glue_runner.job_delete` | INFO | Job eliminado exitosamente |
|
|
318
|
+
| `glue_runner.job_delete_skipped` | INFO | `delete_job` sobre job inexistente |
|
|
319
|
+
| `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` (y difiere) |
|
|
155
320
|
| `glue_runner.job_created` | INFO | Job creado en `ensure_job` |
|
|
321
|
+
| `glue_runner.job_unchanged` | INFO | Job existe con config idéntica en `ensure_job` |
|
|
322
|
+
| `glue_runner.job_create_error` | ERROR | Error en `create_job` |
|
|
323
|
+
| `glue_runner.job_update_error` | ERROR | Error en `update_job` |
|
|
324
|
+
| `glue_runner.job_delete_error` | ERROR | Error en `delete_job` |
|
|
325
|
+
| `glue_runner.script_uploaded` | INFO | Script subido a S3 (v0.5.0+) |
|
|
326
|
+
| `glue_runner.script_upload_error` | ERROR | Error al subir script a S3 (v0.5.0+) |
|
|
156
327
|
| `glue_runner.polling` | INFO | Chequeo de estado durante `run_and_wait` |
|
|
157
328
|
| `glue_runner.complete` | INFO | Job terminó `SUCCEEDED` |
|
|
158
329
|
| `glue_runner.failed` | ERROR | Job falló con `FAILED\|STOPPED\|TIMEOUT` |
|
|
@@ -1,11 +1,29 @@
|
|
|
1
1
|
"""
|
|
2
2
|
Script de AWS Glue (PySpark) compatible con DataDrain::GlueRunner.
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
|
|
4
|
+
Para crear el Glue Job programmatically (en vez de la consola):
|
|
5
|
+
|
|
6
|
+
# Opcion moderna: script local subido por la gema (v0.5.0+)
|
|
7
|
+
DataDrain::GlueRunner.ensure_job(
|
|
8
|
+
"my-export-job",
|
|
9
|
+
script_path: "docs/glue_pyspark_example.py",
|
|
10
|
+
script_bucket: "my-bucket",
|
|
11
|
+
script_folder: "scripts",
|
|
12
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
13
|
+
worker_type: "G.1X",
|
|
14
|
+
number_of_workers: 10,
|
|
15
|
+
timeout: 1440
|
|
16
|
+
)
|
|
17
|
+
# -> Sube este archivo a s3://my-bucket/scripts/glue_pyspark_example.py
|
|
18
|
+
# -> Crea el Job apuntando a ese path
|
|
19
|
+
|
|
20
|
+
# Ejecutar
|
|
21
|
+
DataDrain::GlueRunner.run_and_wait("my-export-job", { "--start_date" => "2025-01-01", ... })
|
|
22
|
+
|
|
23
|
+
Argumentos requeridos del job: JOB_NAME, start_date, end_date, s3_bucket, s3_folder,
|
|
6
24
|
db_url, db_user, db_password, db_table, partition_by.
|
|
7
25
|
|
|
8
|
-
Personalizar la
|
|
26
|
+
Personalizar la seccion de columnas derivadas segun las partition_keys de cada tabla.
|
|
9
27
|
"""
|
|
10
28
|
|
|
11
29
|
import sys
|
|
@@ -15,27 +33,38 @@ from awsglue.context import GlueContext
|
|
|
15
33
|
from awsglue.job import Job
|
|
16
34
|
from pyspark.sql.functions import col, year, month
|
|
17
35
|
|
|
18
|
-
args = getResolvedOptions(
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
36
|
+
args = getResolvedOptions(
|
|
37
|
+
sys.argv,
|
|
38
|
+
[
|
|
39
|
+
"JOB_NAME",
|
|
40
|
+
"start_date",
|
|
41
|
+
"end_date",
|
|
42
|
+
"s3_bucket",
|
|
43
|
+
"s3_folder",
|
|
44
|
+
"db_url",
|
|
45
|
+
"db_user",
|
|
46
|
+
"db_password",
|
|
47
|
+
"db_table",
|
|
48
|
+
"partition_by",
|
|
49
|
+
],
|
|
50
|
+
)
|
|
22
51
|
|
|
23
52
|
sc = SparkContext()
|
|
24
53
|
glueContext = GlueContext(sc)
|
|
25
54
|
spark = glueContext.spark_session
|
|
26
55
|
job = Job(glueContext)
|
|
27
|
-
job.init(args[
|
|
56
|
+
job.init(args["JOB_NAME"], args)
|
|
28
57
|
|
|
29
58
|
options = {
|
|
30
|
-
"url": args[
|
|
31
|
-
"dbtable": args[
|
|
32
|
-
"user": args[
|
|
33
|
-
"password": args[
|
|
59
|
+
"url": args["db_url"],
|
|
60
|
+
"dbtable": args["db_table"],
|
|
61
|
+
"user": args["db_user"],
|
|
62
|
+
"password": args["db_password"],
|
|
34
63
|
"sampleQuery": (
|
|
35
64
|
f"SELECT * FROM {args['db_table']} "
|
|
36
65
|
f"WHERE created_at >= '{args['start_date']}' "
|
|
37
66
|
f"AND created_at < '{args['end_date']}'"
|
|
38
|
-
)
|
|
67
|
+
),
|
|
39
68
|
}
|
|
40
69
|
|
|
41
70
|
df = spark.read.format("jdbc").options(**options).load()
|
|
@@ -43,18 +72,19 @@ df = spark.read.format("jdbc").options(**options).load()
|
|
|
43
72
|
# Agregar columnas derivadas necesarias para las particiones.
|
|
44
73
|
# isp_id ya existe en la tabla fuente — solo agregar las que se calculan.
|
|
45
74
|
# Personalizar esta seccion segun las partition_keys de cada tabla.
|
|
46
|
-
df_final = (
|
|
47
|
-
|
|
48
|
-
.withColumn("month", month(col("created_at")))
|
|
75
|
+
df_final = df.withColumn("year", year(col("created_at"))).withColumn(
|
|
76
|
+
"month", month(col("created_at"))
|
|
49
77
|
)
|
|
50
78
|
|
|
51
79
|
output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
|
|
52
|
-
partitions = args[
|
|
80
|
+
partitions = args["partition_by"].split(",")
|
|
53
81
|
|
|
54
|
-
(
|
|
82
|
+
(
|
|
83
|
+
df_final.write.mode("overwrite")
|
|
55
84
|
.partitionBy(*partitions)
|
|
56
85
|
.format("parquet")
|
|
57
86
|
.option("compression", "zstd")
|
|
58
|
-
.save(output_path)
|
|
87
|
+
.save(output_path)
|
|
88
|
+
)
|
|
59
89
|
|
|
60
90
|
job.commit()
|
|
@@ -40,17 +40,29 @@ module DataDrain
|
|
|
40
40
|
client.get_job(job_name: job_name).job
|
|
41
41
|
end
|
|
42
42
|
|
|
43
|
-
def self.create_job(job_name, role_arn:, script_location
|
|
44
|
-
|
|
45
|
-
|
|
43
|
+
def self.create_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
44
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
45
|
+
command_name: "glueetl", default_arguments: {}, description: nil,
|
|
46
|
+
worker_type: nil, number_of_workers: nil, timeout: 2880,
|
|
47
|
+
max_retries: 0, allocated_capacity: nil, glue_version: nil)
|
|
48
|
+
@logger = DataDrain.configuration.logger
|
|
46
49
|
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
50
|
+
|
|
51
|
+
final_script_location = resolve_script_location(
|
|
52
|
+
script_location: script_location,
|
|
53
|
+
script_path: script_path,
|
|
54
|
+
script_bucket: script_bucket,
|
|
55
|
+
script_folder: script_folder,
|
|
56
|
+
script_filename: script_filename
|
|
57
|
+
)
|
|
58
|
+
|
|
47
59
|
opts = {
|
|
48
60
|
name: job_name,
|
|
49
61
|
role: role_arn,
|
|
50
62
|
command: {
|
|
51
63
|
name: command_name,
|
|
52
64
|
python_version: "3",
|
|
53
|
-
script_location:
|
|
65
|
+
script_location: final_script_location
|
|
54
66
|
}
|
|
55
67
|
}
|
|
56
68
|
opts[:default_arguments] = default_arguments unless default_arguments.empty?
|
|
@@ -63,13 +75,24 @@ module DataDrain
|
|
|
63
75
|
opts[:glue_version] = glue_version if glue_version
|
|
64
76
|
|
|
65
77
|
client.create_job(**opts)
|
|
78
|
+
safe_log(:info, "glue_runner.job_create", {
|
|
79
|
+
job: job_name,
|
|
80
|
+
glue_version: glue_version,
|
|
81
|
+
worker_type: worker_type,
|
|
82
|
+
number_of_workers: number_of_workers
|
|
83
|
+
})
|
|
66
84
|
get_job(job_name)
|
|
85
|
+
rescue Aws::Glue::Errors::ServiceError => e
|
|
86
|
+
safe_log(:error, "glue_runner.job_create_error",
|
|
87
|
+
{ job: job_name }.merge(exception_metadata(e)))
|
|
88
|
+
raise
|
|
67
89
|
end
|
|
68
90
|
|
|
69
91
|
def self.update_job(job_name, role_arn: nil, command_name: nil, script_location: nil,
|
|
70
92
|
default_arguments: nil, description: nil, worker_type: nil,
|
|
71
93
|
number_of_workers: nil, timeout: nil, max_retries: nil, allocated_capacity: nil,
|
|
72
94
|
glue_version: nil)
|
|
95
|
+
@logger = DataDrain.configuration.logger
|
|
73
96
|
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
74
97
|
job_update = {}
|
|
75
98
|
job_update[:role] = role_arn if role_arn
|
|
@@ -87,30 +110,77 @@ module DataDrain
|
|
|
87
110
|
job_update[:glue_version] = glue_version if glue_version
|
|
88
111
|
|
|
89
112
|
client.update_job(job_name: job_name, job_update: job_update)
|
|
113
|
+
safe_log(:info, "glue_runner.job_update", {
|
|
114
|
+
job: job_name,
|
|
115
|
+
changed_fields: job_update.keys.map(&:to_s)
|
|
116
|
+
})
|
|
90
117
|
get_job(job_name)
|
|
118
|
+
rescue Aws::Glue::Errors::ServiceError => e
|
|
119
|
+
safe_log(:error, "glue_runner.job_update_error",
|
|
120
|
+
{ job: job_name }.merge(exception_metadata(e)))
|
|
121
|
+
raise
|
|
91
122
|
end
|
|
92
123
|
|
|
93
124
|
def self.delete_job(job_name)
|
|
125
|
+
@logger = DataDrain.configuration.logger
|
|
94
126
|
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
95
127
|
client.delete_job(job_name: job_name)
|
|
96
|
-
|
|
128
|
+
safe_log(:info, "glue_runner.job_delete", { job: job_name })
|
|
129
|
+
true
|
|
130
|
+
rescue Aws::Glue::Errors::EntityNotFoundException
|
|
131
|
+
safe_log(:info, "glue_runner.job_delete_skipped", { job: job_name, reason: "not_found" })
|
|
132
|
+
false
|
|
133
|
+
rescue Aws::Glue::Errors::ServiceError => e
|
|
134
|
+
safe_log(:error, "glue_runner.job_delete_error",
|
|
135
|
+
{ job: job_name }.merge(exception_metadata(e)))
|
|
136
|
+
raise
|
|
97
137
|
end
|
|
98
138
|
|
|
99
|
-
def self.ensure_job(job_name, role_arn:, script_location
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
139
|
+
def self.ensure_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
140
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
141
|
+
command_name: "glueetl", default_arguments: {}, description: nil,
|
|
142
|
+
worker_type: nil, number_of_workers: nil, timeout: 2880,
|
|
143
|
+
max_retries: 0, allocated_capacity: nil, glue_version: nil)
|
|
144
|
+
@logger = DataDrain.configuration.logger
|
|
145
|
+
|
|
146
|
+
final_script_location = resolve_script_location(
|
|
147
|
+
script_location: script_location,
|
|
148
|
+
script_path: script_path,
|
|
149
|
+
script_bucket: script_bucket,
|
|
150
|
+
script_folder: script_folder,
|
|
151
|
+
script_filename: script_filename
|
|
152
|
+
)
|
|
153
|
+
|
|
103
154
|
if job_exists?(job_name)
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
155
|
+
current = get_job(job_name)
|
|
156
|
+
desired = {
|
|
157
|
+
role: role_arn,
|
|
158
|
+
command_name: command_name,
|
|
159
|
+
script_location: final_script_location,
|
|
160
|
+
default_arguments: default_arguments,
|
|
161
|
+
description: description,
|
|
162
|
+
worker_type: worker_type,
|
|
163
|
+
number_of_workers: number_of_workers,
|
|
164
|
+
timeout: timeout,
|
|
165
|
+
max_retries: max_retries,
|
|
166
|
+
glue_version: glue_version
|
|
167
|
+
}
|
|
168
|
+
changed = changed_fields(desired, current)
|
|
169
|
+
if changed.empty?
|
|
170
|
+
safe_log(:info, "glue_runner.job_unchanged", { job: job_name })
|
|
171
|
+
current
|
|
172
|
+
else
|
|
173
|
+
safe_log(:info, "glue_runner.job_exists", { job: job_name })
|
|
174
|
+
update_job(job_name, role_arn: role_arn, command_name: command_name,
|
|
175
|
+
script_location: final_script_location, default_arguments: default_arguments,
|
|
176
|
+
description: description, worker_type: worker_type,
|
|
177
|
+
number_of_workers: number_of_workers, timeout: timeout,
|
|
178
|
+
max_retries: max_retries, allocated_capacity: allocated_capacity,
|
|
179
|
+
glue_version: glue_version)
|
|
180
|
+
end
|
|
111
181
|
else
|
|
112
182
|
safe_log(:info, "glue_runner.job_created", { job: job_name })
|
|
113
|
-
create_job(job_name, role_arn: role_arn, script_location:
|
|
183
|
+
create_job(job_name, role_arn: role_arn, script_location: final_script_location,
|
|
114
184
|
command_name: command_name, default_arguments: default_arguments,
|
|
115
185
|
description: description, worker_type: worker_type,
|
|
116
186
|
number_of_workers: number_of_workers, timeout: timeout,
|
|
@@ -119,6 +189,72 @@ module DataDrain
|
|
|
119
189
|
end
|
|
120
190
|
end
|
|
121
191
|
|
|
192
|
+
def self.changed_fields(desired, current)
|
|
193
|
+
changed = []
|
|
194
|
+
changed << :role if current.role != desired[:role]
|
|
195
|
+
changed << :command if current.command.name != desired[:command_name] ||
|
|
196
|
+
current.command.script_location != desired[:script_location]
|
|
197
|
+
changed << :default_arguments if current.default_arguments != desired[:default_arguments]
|
|
198
|
+
changed << :description if current.description != desired[:description]
|
|
199
|
+
changed << :worker_type if current.worker_type != desired[:worker_type]
|
|
200
|
+
changed << :number_of_workers if current.number_of_workers != desired[:number_of_workers]
|
|
201
|
+
changed << :timeout if current.timeout != desired[:timeout]
|
|
202
|
+
changed << :max_retries if current.max_retries != desired[:max_retries]
|
|
203
|
+
changed << :glue_version if current.glue_version != desired[:glue_version]
|
|
204
|
+
changed
|
|
205
|
+
end
|
|
206
|
+
private_class_method :changed_fields
|
|
207
|
+
|
|
208
|
+
def self.resolve_script_location(script_location:, script_path:, script_bucket:, script_folder:, script_filename:)
|
|
209
|
+
both_set = script_location && script_path
|
|
210
|
+
raise DataDrain::ConfigurationError, "provee script_location o script_path, no ambos" if both_set
|
|
211
|
+
|
|
212
|
+
return script_location if script_location
|
|
213
|
+
raise ArgumentError, "script_location o script_path es requerido" unless script_path
|
|
214
|
+
raise DataDrain::ConfigurationError, "script_path requiere script_bucket" unless script_bucket
|
|
215
|
+
|
|
216
|
+
upload_script(
|
|
217
|
+
local_path: script_path,
|
|
218
|
+
bucket: script_bucket,
|
|
219
|
+
folder: script_folder,
|
|
220
|
+
filename: script_filename
|
|
221
|
+
)
|
|
222
|
+
end
|
|
223
|
+
private_class_method :resolve_script_location
|
|
224
|
+
|
|
225
|
+
def self.upload_script(local_path:, bucket:, folder: "scripts", filename: nil)
|
|
226
|
+
@logger = DataDrain.configuration.logger
|
|
227
|
+
|
|
228
|
+
unless File.exist?(local_path)
|
|
229
|
+
raise DataDrain::ConfigurationError,
|
|
230
|
+
"Script local '#{local_path}' no existe"
|
|
231
|
+
end
|
|
232
|
+
|
|
233
|
+
actual_filename = filename || File.basename(local_path)
|
|
234
|
+
s3_key = "#{folder.chomp("/")}/#{actual_filename}"
|
|
235
|
+
bytes = File.size(local_path)
|
|
236
|
+
|
|
237
|
+
adapter = DataDrain::Storage.adapter
|
|
238
|
+
unless adapter.is_a?(DataDrain::Storage::S3)
|
|
239
|
+
raise DataDrain::ConfigurationError,
|
|
240
|
+
"upload_script requiere storage_mode = :s3, actual: #{DataDrain.configuration.storage_mode}"
|
|
241
|
+
end
|
|
242
|
+
|
|
243
|
+
s3_path = adapter.upload_file(local_path, bucket, s3_key, content_type: "text/x-python")
|
|
244
|
+
|
|
245
|
+
safe_log(:info, "glue_runner.script_uploaded", {
|
|
246
|
+
local_path: local_path,
|
|
247
|
+
s3_path: s3_path,
|
|
248
|
+
bytes: bytes
|
|
249
|
+
})
|
|
250
|
+
|
|
251
|
+
s3_path
|
|
252
|
+
rescue Aws::S3::Errors::ServiceError => e
|
|
253
|
+
safe_log(:error, "glue_runner.script_upload_error",
|
|
254
|
+
{ local_path: local_path, bucket: bucket }.merge(exception_metadata(e)))
|
|
255
|
+
raise
|
|
256
|
+
end
|
|
257
|
+
|
|
122
258
|
def self.run_and_wait(job_name, arguments = {}, polling_interval: 30, max_wait_seconds: nil)
|
|
123
259
|
config = DataDrain.configuration
|
|
124
260
|
config.validate!
|
|
@@ -55,6 +55,18 @@ module DataDrain
|
|
|
55
55
|
raise NotImplementedError, "#{self.class} debe implementar #destroy_partitions"
|
|
56
56
|
end
|
|
57
57
|
|
|
58
|
+
# Sube un archivo local al storage.
|
|
59
|
+
#
|
|
60
|
+
# @param local_path [String]
|
|
61
|
+
# @param bucket [String]
|
|
62
|
+
# @param s3_key [String] key relativo (ej. "scripts/export.py")
|
|
63
|
+
# @param content_type [String, nil]
|
|
64
|
+
# @return [String] URI completo del archivo subido
|
|
65
|
+
# @raise [NotImplementedError]
|
|
66
|
+
def upload_file(local_path, bucket, s3_key, content_type: nil)
|
|
67
|
+
raise NotImplementedError, "#{self.class} debe implementar #upload_file"
|
|
68
|
+
end
|
|
69
|
+
|
|
58
70
|
protected
|
|
59
71
|
|
|
60
72
|
# @param bucket [String]
|
|
@@ -27,6 +27,19 @@ module DataDrain
|
|
|
27
27
|
"#{build_path_base(bucket, folder_name, partition_path)}/**/*.parquet"
|
|
28
28
|
end
|
|
29
29
|
|
|
30
|
+
# @param local_path [String]
|
|
31
|
+
# @param bucket [String] Directorio destino
|
|
32
|
+
# @param s3_key [String] Path relativo dentro del bucket
|
|
33
|
+
# @param content_type [String, nil] Ignorado en modo local
|
|
34
|
+
# @return [String] Path absoluto al archivo destino
|
|
35
|
+
def upload_file(local_path, bucket, s3_key, content_type: nil)
|
|
36
|
+
_ = content_type
|
|
37
|
+
dest_path = File.join(bucket, s3_key)
|
|
38
|
+
FileUtils.mkdir_p(File.dirname(dest_path))
|
|
39
|
+
FileUtils.cp(local_path, dest_path)
|
|
40
|
+
dest_path
|
|
41
|
+
end
|
|
42
|
+
|
|
30
43
|
# @param bucket [String]
|
|
31
44
|
# @param folder_name [String]
|
|
32
45
|
# @param partition_keys [Array<Symbol>]
|
|
@@ -38,6 +38,23 @@ module DataDrain
|
|
|
38
38
|
delete_in_batches(client, bucket, objects)
|
|
39
39
|
end
|
|
40
40
|
|
|
41
|
+
# @param local_path [String]
|
|
42
|
+
# @param bucket [String]
|
|
43
|
+
# @param s3_key [String]
|
|
44
|
+
# @param content_type [String, nil]
|
|
45
|
+
# @return [String] "s3://bucket/key"
|
|
46
|
+
def upload_file(local_path, bucket, s3_key, content_type: nil)
|
|
47
|
+
client = s3_client
|
|
48
|
+
|
|
49
|
+
File.open(local_path, "rb") do |file|
|
|
50
|
+
params = { bucket: bucket, key: s3_key, body: file }
|
|
51
|
+
params[:content_type] = content_type if content_type
|
|
52
|
+
client.put_object(**params)
|
|
53
|
+
end
|
|
54
|
+
|
|
55
|
+
"s3://#{bucket}/#{s3_key}"
|
|
56
|
+
end
|
|
57
|
+
|
|
41
58
|
private
|
|
42
59
|
|
|
43
60
|
# @return [Aws::S3::Client]
|
|
@@ -6,7 +6,7 @@ module DataDrain
|
|
|
6
6
|
# Regex que valida identificadores SQL (tablas, columnas, etc.).
|
|
7
7
|
# Permite letras, guiones bajos y números (no al inicio).
|
|
8
8
|
IDENTIFIER_REGEX = /\A[a-zA-Z_][a-zA-Z0-9_]*\z/
|
|
9
|
-
GLUE_NAME_REGEX = /\A[
|
|
9
|
+
GLUE_NAME_REGEX = /\A(?![_-])[a-zA-Z0-9_-]+\z/
|
|
10
10
|
|
|
11
11
|
module_function
|
|
12
12
|
|
|
@@ -14,7 +14,7 @@ module DataDrain
|
|
|
14
14
|
return if GLUE_NAME_REGEX.match?(value.to_s)
|
|
15
15
|
|
|
16
16
|
raise DataDrain::ConfigurationError,
|
|
17
|
-
"#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números y
|
|
17
|
+
"#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números, '-' y '_')"
|
|
18
18
|
end
|
|
19
19
|
|
|
20
20
|
def validate_identifier!(name, value)
|
data/lib/data_drain/version.rb
CHANGED