data_drain 0.3.2 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +12 -0
- data/CHANGELOG.md +43 -0
- data/README.md +30 -0
- data/docs/IMPROVEMENT_PLAN.md +114 -0
- data/docs/execution/v0.4.0-OBSERVACIONES.md +144 -0
- data/docs/execution/v0.4.0.md +1216 -0
- data/docs/execution/v0.5.0-OBSERVACIONES.md +167 -0
- data/docs/execution/v0.5.0.md +900 -0
- data/docs/glue-jobs-lifecycle.md +330 -0
- data/docs/glue_pyspark_example.py +49 -19
- data/lib/data_drain/glue_runner.rb +236 -1
- data/lib/data_drain/storage/base.rb +12 -0
- data/lib/data_drain/storage/local.rb +13 -0
- data/lib/data_drain/storage/s3.rb +17 -0
- data/lib/data_drain/validations.rb +8 -0
- data/lib/data_drain/version.rb +1 -1
- data/skill/SKILL.md +64 -3
- data/skill/references/eventos-telemetria.md +8 -0
- metadata +6 -1
|
@@ -0,0 +1,330 @@
|
|
|
1
|
+
# Glue Jobs Lifecycle
|
|
2
|
+
|
|
3
|
+
Gestión completa de AWS Glue Jobs desde la gema.
|
|
4
|
+
|
|
5
|
+
## Métodos
|
|
6
|
+
|
|
7
|
+
### `job_exists?(job_name)` → Boolean
|
|
8
|
+
|
|
9
|
+
Verifica si un job existe en Glue.
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
DataDrain::GlueRunner.job_exists?("my-job")
|
|
13
|
+
# => true
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
17
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
18
|
+
- Lanza otros errores de AWS sin atrapar.
|
|
19
|
+
|
|
20
|
+
### `get_job(job_name)` → Aws::Glue::Types::Job
|
|
21
|
+
|
|
22
|
+
Obtiene la configuración completa de un job.
|
|
23
|
+
|
|
24
|
+
```ruby
|
|
25
|
+
job = DataDrain::GlueRunner.get_job("my-job")
|
|
26
|
+
job.name # => "my-job"
|
|
27
|
+
job.role # => "arn:aws:iam::123:role/GlueRole"
|
|
28
|
+
job.command # => { name: "glueetl", python_version: "3", script_location: "s3://..." }
|
|
29
|
+
job.default_arguments # => { "--extra-files" => "s3://..." }
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
33
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
34
|
+
|
|
35
|
+
### Subir scripts locales (v0.5.0+)
|
|
36
|
+
|
|
37
|
+
Desde v0.5.0 la gema puede subir scripts PySpark a S3 automáticamente.
|
|
38
|
+
|
|
39
|
+
```ruby
|
|
40
|
+
# Opción moderna: script local subido por la gema
|
|
41
|
+
DataDrain::GlueRunner.create_job(
|
|
42
|
+
"my-job",
|
|
43
|
+
script_path: "scripts/glue/export.py", # local
|
|
44
|
+
script_bucket: "my-bucket",
|
|
45
|
+
script_folder: "scripts",
|
|
46
|
+
role_arn: "arn:aws:iam::123:role/GlueRole"
|
|
47
|
+
)
|
|
48
|
+
# → Sube scripts/glue/export.py a s3://my-bucket/scripts/export.py
|
|
49
|
+
# → Crea el job
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Parámetros para upload:**
|
|
53
|
+
- `script_path` (String): ruta local al script Python.
|
|
54
|
+
- `script_bucket` (String): bucket S3 destino. **Requerido si se usa `script_path`.**
|
|
55
|
+
- `script_folder` (String): folder dentro del bucket. Default: `"scripts"`.
|
|
56
|
+
- `script_filename` (String, nil): override del nombre en S3. Default: basename del archivo.
|
|
57
|
+
|
|
58
|
+
**`script_location` vs `script_path`:**
|
|
59
|
+
- `script_location:` → comportamiento anterior, no hay upload.
|
|
60
|
+
- `script_path:` + `script_bucket:` → la gema sube a S3 primero, luego crea el Job.
|
|
61
|
+
- Si se pasan ambos → `DataDrain::ConfigurationError`.
|
|
62
|
+
- Si no se pasa ninguno → `ArgumentError`.
|
|
63
|
+
|
|
64
|
+
**Importante:** el upload **sobrescribe** cualquier archivo existente en el mismo path.
|
|
65
|
+
No es idempotente en sentido estricto. Usar `script_filename:` con hash o timestamp
|
|
66
|
+
si necesitás versionado.
|
|
67
|
+
|
|
68
|
+
### Concurrencia (limitación conocida)
|
|
69
|
+
|
|
70
|
+
No hay lock distribuido. Si dos procesos llaman `upload_script` con el mismo destino
|
|
71
|
+
simultáneamente, el último `put_object` en llegar a S3 gana. Para scripts PySpark
|
|
72
|
+
esto es típicamente bajo riesgo (scripts son pequeños, rara vez hay writes
|
|
73
|
+
concurrentes al mismo path).
|
|
74
|
+
|
|
75
|
+
### Permisos IAM mínimos
|
|
76
|
+
|
|
77
|
+
El IAM role/user que ejecuta `upload_script` necesita:
|
|
78
|
+
|
|
79
|
+
```json
|
|
80
|
+
{
|
|
81
|
+
"Effect": "Allow",
|
|
82
|
+
"Action": ["s3:PutObject"],
|
|
83
|
+
"Resource": "arn:aws:s3:::my-bucket/scripts/*"
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Para usar con `create_job`/`ensure_job` también se necesitan los permisos de Glue
|
|
88
|
+
(ver sección "Permisos Glue" al inicio de este documento) + permiso para que el
|
|
89
|
+
IAM role del Glue Job pueda leer el script:
|
|
90
|
+
|
|
91
|
+
```json
|
|
92
|
+
{
|
|
93
|
+
"Effect": "Allow",
|
|
94
|
+
"Action": ["s3:GetObject"],
|
|
95
|
+
"Resource": "arn:aws:s3:::my-bucket/scripts/*"
|
|
96
|
+
}
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
(Este último en el role del Glue Job, no en el role de la aplicación Ruby.)
|
|
100
|
+
|
|
101
|
+
### API standalone: `upload_script`
|
|
102
|
+
|
|
103
|
+
Para casos donde solo querés subir (sin crear Job):
|
|
104
|
+
|
|
105
|
+
```ruby
|
|
106
|
+
s3_path = DataDrain::GlueRunner.upload_script(
|
|
107
|
+
local_path: "scripts/glue/export.py",
|
|
108
|
+
bucket: "my-bucket",
|
|
109
|
+
folder: "scripts"
|
|
110
|
+
)
|
|
111
|
+
# => "s3://my-bucket/scripts/export.py"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Requiere `storage_mode = :s3`.
|
|
115
|
+
|
|
116
|
+
### `create_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
|
|
117
|
+
|
|
118
|
+
Crea un nuevo job en Glue y retorna el job creado.
|
|
119
|
+
|
|
120
|
+
**Parámetros requeridos:**
|
|
121
|
+
- `job_name` (String): nombre del job
|
|
122
|
+
- `role_arn` (String): ARN del IAM role de Glue
|
|
123
|
+
|
|
124
|
+
**Parámetros de script (mutuamente excluyentes):**
|
|
125
|
+
- `script_location` (String): path S3 del script Python (comportamiento anterior)
|
|
126
|
+
- `script_path` + `script_bucket` (String): upload local a S3 primero (v0.5.0+)
|
|
127
|
+
|
|
128
|
+
**Parámetros opcionales:**
|
|
129
|
+
- `script_folder` (String): folder S3. Default: `"scripts"`.
|
|
130
|
+
- `script_filename` (String, nil): override del nombre en S3.
|
|
131
|
+
- `command_name` (String): nombre del comando (`"glueetl"`, `"pythonshell"`). Default: `"glueetl"`.
|
|
132
|
+
- `default_arguments` (Hash): argumentos default del job
|
|
133
|
+
- `description` (String): descripción del job
|
|
134
|
+
- `timeout` (Integer): timeout en minutos. Default: `2880` (48h)
|
|
135
|
+
- `max_retries` (Integer): reintentos. Default: `0`
|
|
136
|
+
- `allocated_capacity` (Integer): DPU legacy. Preferir `worker_type` + `number_of_workers`
|
|
137
|
+
- `worker_type` (String): `"Standard"`, `"G.1X"`, `"G.2X"`, `"G.4X"`, `"G.8X"`
|
|
138
|
+
- `number_of_workers` (Integer): número de workers (requiere `worker_type`)
|
|
139
|
+
- `glue_version` (String): versión de Glue (ej. `"4.0"`)
|
|
140
|
+
|
|
141
|
+
```ruby
|
|
142
|
+
job = DataDrain::GlueRunner.create_job(
|
|
143
|
+
"my-job",
|
|
144
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
145
|
+
script_location: "s3://my-bucket/scripts/export.py",
|
|
146
|
+
default_arguments: { "--extra-files" => "s3://my-bucket/scripts/udf.py" },
|
|
147
|
+
timeout: 1440,
|
|
148
|
+
max_retries: 2,
|
|
149
|
+
worker_type: "G.1X",
|
|
150
|
+
number_of_workers: 10
|
|
151
|
+
)
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
155
|
+
- Lanza errores de AWS sin atrapar (nombre duplicado, rol inválido, etc.)
|
|
156
|
+
|
|
157
|
+
### `update_job(job_name, ...)` → Aws::Glue::Types::Job
|
|
158
|
+
|
|
159
|
+
Actualiza un job existente y retorna el job actualizado.
|
|
160
|
+
|
|
161
|
+
Mismos parámetros que `create_job`, todos opcionales. Solo los parámetros provistos se actualizan.
|
|
162
|
+
|
|
163
|
+
```ruby
|
|
164
|
+
job = DataDrain::GlueRunner.update_job(
|
|
165
|
+
"my-job",
|
|
166
|
+
script_location: "s3://my-bucket/scripts/export-v2.py",
|
|
167
|
+
timeout: 720
|
|
168
|
+
)
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
172
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
173
|
+
|
|
174
|
+
### `delete_job(job_name)` → Boolean
|
|
175
|
+
|
|
176
|
+
Elimina un job de Glue. Es idempotente.
|
|
177
|
+
|
|
178
|
+
```ruby
|
|
179
|
+
DataDrain::GlueRunner.delete_job("my-job")
|
|
180
|
+
# => true (job existía y fue eliminado)
|
|
181
|
+
|
|
182
|
+
DataDrain::GlueRunner.delete_job("nonexistent")
|
|
183
|
+
# => false (job no existía)
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
187
|
+
- Lanza otros errores de AWS sin atrapar.
|
|
188
|
+
|
|
189
|
+
### `ensure_job(job_name, role_arn:, ...)` → Aws::Glue::Types::Job
|
|
190
|
+
|
|
191
|
+
Crea o actualiza un job de forma idempotente con diffing de configuración.
|
|
192
|
+
|
|
193
|
+
- Si el job no existe → `create_job`
|
|
194
|
+
- Si el job existe con config diferente → `update_job`
|
|
195
|
+
- Si el job existe con config idéntica → no-op, retorna el job actual (`:unchanged`)
|
|
196
|
+
|
|
197
|
+
```ruby
|
|
198
|
+
job = DataDrain::GlueRunner.ensure_job(
|
|
199
|
+
"my-job",
|
|
200
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
201
|
+
script_location: "s3://my-bucket/scripts/export.py",
|
|
202
|
+
timeout: 1440
|
|
203
|
+
)
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
También soporta upload de script local (v0.5.0+):
|
|
207
|
+
|
|
208
|
+
```ruby
|
|
209
|
+
job = DataDrain::GlueRunner.ensure_job(
|
|
210
|
+
"my-job",
|
|
211
|
+
script_path: "scripts/glue/export.py",
|
|
212
|
+
script_bucket: "my-bucket",
|
|
213
|
+
script_folder: "scripts",
|
|
214
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
215
|
+
timeout: 1440
|
|
216
|
+
)
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
220
|
+
- Lanza errores de AWS sin atrapar.
|
|
221
|
+
|
|
222
|
+
### `run_and_wait(job_name, arguments = {}, ...)` → Boolean
|
|
223
|
+
|
|
224
|
+
Ejecuta un job existente y espera a que complete.
|
|
225
|
+
|
|
226
|
+
```ruby
|
|
227
|
+
DataDrain::GlueRunner.run_and_wait(
|
|
228
|
+
"my-job",
|
|
229
|
+
{ "--start_date" => "2025-01-01", "--end_date" => "2025-02-01" },
|
|
230
|
+
polling_interval: 60,
|
|
231
|
+
max_wait_seconds: 7200
|
|
232
|
+
)
|
|
233
|
+
# => true (SUCCEEDED)
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
- Lanza `RuntimeError` si el job falla (`FAILED`, `STOPPED`, `TIMEOUT`).
|
|
237
|
+
- Lanza `DataDrain::Error` si `max_wait_seconds` excede.
|
|
238
|
+
|
|
239
|
+
## Patrón completo: ensure_job + run_and_wait + PySpark
|
|
240
|
+
|
|
241
|
+
Workflow end-to-end para archivar y purgar tablas PostgreSQL usando AWS Glue:
|
|
242
|
+
|
|
243
|
+
```ruby
|
|
244
|
+
# 1. Asegurar que el Glue Job existe con la config deseada (idempotente)
|
|
245
|
+
DataDrain::GlueRunner.ensure_job(
|
|
246
|
+
"my-export-job",
|
|
247
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
248
|
+
script_location: "s3://my-bucket/scripts/glue_pyspark_export.py",
|
|
249
|
+
glue_version: "4.0",
|
|
250
|
+
worker_type: "G.1X",
|
|
251
|
+
number_of_workers: 10,
|
|
252
|
+
timeout: 1440
|
|
253
|
+
)
|
|
254
|
+
|
|
255
|
+
# 2. Ejecutar el export (delegado a Glue Spark distribuido)
|
|
256
|
+
DataDrain::GlueRunner.run_and_wait(
|
|
257
|
+
"my-export-job",
|
|
258
|
+
{
|
|
259
|
+
"--start_date" => start_date.to_fs(:db),
|
|
260
|
+
"--end_date" => end_date.to_fs(:db),
|
|
261
|
+
"--s3_bucket" => bucket,
|
|
262
|
+
"--s3_folder" => table,
|
|
263
|
+
"--db_url" => "jdbc:postgresql://#{host}:#{port}/#{db}",
|
|
264
|
+
"--db_user" => db_user,
|
|
265
|
+
"--db_password" => db_password,
|
|
266
|
+
"--db_table" => table,
|
|
267
|
+
"--partition_by" => partition_keys.join(",")
|
|
268
|
+
},
|
|
269
|
+
polling_interval: 60,
|
|
270
|
+
max_wait_seconds: 7200
|
|
271
|
+
)
|
|
272
|
+
|
|
273
|
+
# 3. Verificar integridad y purgar Postgres (DataDrain solo lee Parquet)
|
|
274
|
+
DataDrain::Engine.new(
|
|
275
|
+
bucket: bucket,
|
|
276
|
+
folder_name: table,
|
|
277
|
+
start_date: start_date,
|
|
278
|
+
end_date: end_date,
|
|
279
|
+
table_name: table,
|
|
280
|
+
partition_keys: partition_keys,
|
|
281
|
+
skip_export: true # export ya lo hizo Glue
|
|
282
|
+
).call
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
### Prerequisites
|
|
286
|
+
|
|
287
|
+
1. **Subir el script a S3:**
|
|
288
|
+
```bash
|
|
289
|
+
aws s3 cp glue_pyspark_export.py s3://my-bucket/scripts/
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
2. **IAM Role** con permisos para: Glue, S3 (lectura del script + escritura del bucket destino), RDS/Postgres (vía JDBC)
|
|
293
|
+
|
|
294
|
+
3. **Script PySpark** en `s3://my-bucket/scripts/glue_pyspark_export.py` (ver [ejemplo](../glue_pyspark_example.py))
|
|
295
|
+
|
|
296
|
+
## Convenciones de nombres
|
|
297
|
+
|
|
298
|
+
AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`), guiones bajos (`_`). No permite espacios ni caracteres especiales.
|
|
299
|
+
|
|
300
|
+
```ruby
|
|
301
|
+
# Válido
|
|
302
|
+
DataDrain::GlueRunner.job_exists?("my-export-job-v2")
|
|
303
|
+
DataDrain::GlueRunner.job_exists?("my_export_job")
|
|
304
|
+
|
|
305
|
+
# Inválido — lanza ConfigurationError
|
|
306
|
+
DataDrain::GlueRunner.job_exists?("-starts-with-dash")
|
|
307
|
+
# DataDrain::ConfigurationError: job_name '-starts-with-dash' no es un nombre válido para Glue Job
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
## Eventos de telemetría
|
|
311
|
+
|
|
312
|
+
| Evento | Nivel | Descripción |
|
|
313
|
+
|--------|-------|-------------|
|
|
314
|
+
| `glue_runner.start` | INFO | Antes de `start_job_run` |
|
|
315
|
+
| `glue_runner.job_create` | INFO | Job creado exitosamente |
|
|
316
|
+
| `glue_runner.job_update` | INFO | Job actualizado (incluye `changed_fields`) |
|
|
317
|
+
| `glue_runner.job_delete` | INFO | Job eliminado exitosamente |
|
|
318
|
+
| `glue_runner.job_delete_skipped` | INFO | `delete_job` sobre job inexistente |
|
|
319
|
+
| `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` (y difiere) |
|
|
320
|
+
| `glue_runner.job_created` | INFO | Job creado en `ensure_job` |
|
|
321
|
+
| `glue_runner.job_unchanged` | INFO | Job existe con config idéntica en `ensure_job` |
|
|
322
|
+
| `glue_runner.job_create_error` | ERROR | Error en `create_job` |
|
|
323
|
+
| `glue_runner.job_update_error` | ERROR | Error en `update_job` |
|
|
324
|
+
| `glue_runner.job_delete_error` | ERROR | Error en `delete_job` |
|
|
325
|
+
| `glue_runner.script_uploaded` | INFO | Script subido a S3 (v0.5.0+) |
|
|
326
|
+
| `glue_runner.script_upload_error` | ERROR | Error al subir script a S3 (v0.5.0+) |
|
|
327
|
+
| `glue_runner.polling` | INFO | Chequeo de estado durante `run_and_wait` |
|
|
328
|
+
| `glue_runner.complete` | INFO | Job terminó `SUCCEEDED` |
|
|
329
|
+
| `glue_runner.failed` | ERROR | Job falló con `FAILED\|STOPPED\|TIMEOUT` |
|
|
330
|
+
| `glue_runner.timeout` | ERROR | `max_wait_seconds` excedido |
|
|
@@ -1,11 +1,29 @@
|
|
|
1
1
|
"""
|
|
2
2
|
Script de AWS Glue (PySpark) compatible con DataDrain::GlueRunner.
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
|
|
4
|
+
Para crear el Glue Job programmatically (en vez de la consola):
|
|
5
|
+
|
|
6
|
+
# Opcion moderna: script local subido por la gema (v0.5.0+)
|
|
7
|
+
DataDrain::GlueRunner.ensure_job(
|
|
8
|
+
"my-export-job",
|
|
9
|
+
script_path: "docs/glue_pyspark_example.py",
|
|
10
|
+
script_bucket: "my-bucket",
|
|
11
|
+
script_folder: "scripts",
|
|
12
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
13
|
+
worker_type: "G.1X",
|
|
14
|
+
number_of_workers: 10,
|
|
15
|
+
timeout: 1440
|
|
16
|
+
)
|
|
17
|
+
# -> Sube este archivo a s3://my-bucket/scripts/glue_pyspark_example.py
|
|
18
|
+
# -> Crea el Job apuntando a ese path
|
|
19
|
+
|
|
20
|
+
# Ejecutar
|
|
21
|
+
DataDrain::GlueRunner.run_and_wait("my-export-job", { "--start_date" => "2025-01-01", ... })
|
|
22
|
+
|
|
23
|
+
Argumentos requeridos del job: JOB_NAME, start_date, end_date, s3_bucket, s3_folder,
|
|
6
24
|
db_url, db_user, db_password, db_table, partition_by.
|
|
7
25
|
|
|
8
|
-
Personalizar la
|
|
26
|
+
Personalizar la seccion de columnas derivadas segun las partition_keys de cada tabla.
|
|
9
27
|
"""
|
|
10
28
|
|
|
11
29
|
import sys
|
|
@@ -15,27 +33,38 @@ from awsglue.context import GlueContext
|
|
|
15
33
|
from awsglue.job import Job
|
|
16
34
|
from pyspark.sql.functions import col, year, month
|
|
17
35
|
|
|
18
|
-
args = getResolvedOptions(
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
36
|
+
args = getResolvedOptions(
|
|
37
|
+
sys.argv,
|
|
38
|
+
[
|
|
39
|
+
"JOB_NAME",
|
|
40
|
+
"start_date",
|
|
41
|
+
"end_date",
|
|
42
|
+
"s3_bucket",
|
|
43
|
+
"s3_folder",
|
|
44
|
+
"db_url",
|
|
45
|
+
"db_user",
|
|
46
|
+
"db_password",
|
|
47
|
+
"db_table",
|
|
48
|
+
"partition_by",
|
|
49
|
+
],
|
|
50
|
+
)
|
|
22
51
|
|
|
23
52
|
sc = SparkContext()
|
|
24
53
|
glueContext = GlueContext(sc)
|
|
25
54
|
spark = glueContext.spark_session
|
|
26
55
|
job = Job(glueContext)
|
|
27
|
-
job.init(args[
|
|
56
|
+
job.init(args["JOB_NAME"], args)
|
|
28
57
|
|
|
29
58
|
options = {
|
|
30
|
-
"url": args[
|
|
31
|
-
"dbtable": args[
|
|
32
|
-
"user": args[
|
|
33
|
-
"password": args[
|
|
59
|
+
"url": args["db_url"],
|
|
60
|
+
"dbtable": args["db_table"],
|
|
61
|
+
"user": args["db_user"],
|
|
62
|
+
"password": args["db_password"],
|
|
34
63
|
"sampleQuery": (
|
|
35
64
|
f"SELECT * FROM {args['db_table']} "
|
|
36
65
|
f"WHERE created_at >= '{args['start_date']}' "
|
|
37
66
|
f"AND created_at < '{args['end_date']}'"
|
|
38
|
-
)
|
|
67
|
+
),
|
|
39
68
|
}
|
|
40
69
|
|
|
41
70
|
df = spark.read.format("jdbc").options(**options).load()
|
|
@@ -43,18 +72,19 @@ df = spark.read.format("jdbc").options(**options).load()
|
|
|
43
72
|
# Agregar columnas derivadas necesarias para las particiones.
|
|
44
73
|
# isp_id ya existe en la tabla fuente — solo agregar las que se calculan.
|
|
45
74
|
# Personalizar esta seccion segun las partition_keys de cada tabla.
|
|
46
|
-
df_final = (
|
|
47
|
-
|
|
48
|
-
.withColumn("month", month(col("created_at")))
|
|
75
|
+
df_final = df.withColumn("year", year(col("created_at"))).withColumn(
|
|
76
|
+
"month", month(col("created_at"))
|
|
49
77
|
)
|
|
50
78
|
|
|
51
79
|
output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
|
|
52
|
-
partitions = args[
|
|
80
|
+
partitions = args["partition_by"].split(",")
|
|
53
81
|
|
|
54
|
-
(
|
|
82
|
+
(
|
|
83
|
+
df_final.write.mode("overwrite")
|
|
55
84
|
.partitionBy(*partitions)
|
|
56
85
|
.format("parquet")
|
|
57
86
|
.option("compression", "zstd")
|
|
58
|
-
.save(output_path)
|
|
87
|
+
.save(output_path)
|
|
88
|
+
)
|
|
59
89
|
|
|
60
90
|
job.commit()
|
|
@@ -19,10 +19,245 @@ module DataDrain
|
|
|
19
19
|
# @return [Boolean] true si el Job terminó exitosamente (SUCCEEDED).
|
|
20
20
|
# @raise [DataDrain::Error] si max_wait_seconds excede antes de SUCCEEDED.
|
|
21
21
|
# @raise [RuntimeError] si el Job falla o se detiene.
|
|
22
|
+
def self.client
|
|
23
|
+
@client ||= Aws::Glue::Client.new(region: DataDrain.configuration.aws_region)
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
class << self
|
|
27
|
+
attr_writer :client
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def self.job_exists?(job_name)
|
|
31
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
32
|
+
get_job(job_name)
|
|
33
|
+
true
|
|
34
|
+
rescue Aws::Glue::Errors::EntityNotFoundException
|
|
35
|
+
false
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
def self.get_job(job_name)
|
|
39
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
40
|
+
client.get_job(job_name: job_name).job
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
def self.create_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
44
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
45
|
+
command_name: "glueetl", default_arguments: {}, description: nil,
|
|
46
|
+
worker_type: nil, number_of_workers: nil, timeout: 2880,
|
|
47
|
+
max_retries: 0, allocated_capacity: nil, glue_version: nil)
|
|
48
|
+
@logger = DataDrain.configuration.logger
|
|
49
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
50
|
+
|
|
51
|
+
final_script_location = resolve_script_location(
|
|
52
|
+
script_location: script_location,
|
|
53
|
+
script_path: script_path,
|
|
54
|
+
script_bucket: script_bucket,
|
|
55
|
+
script_folder: script_folder,
|
|
56
|
+
script_filename: script_filename
|
|
57
|
+
)
|
|
58
|
+
|
|
59
|
+
opts = {
|
|
60
|
+
name: job_name,
|
|
61
|
+
role: role_arn,
|
|
62
|
+
command: {
|
|
63
|
+
name: command_name,
|
|
64
|
+
python_version: "3",
|
|
65
|
+
script_location: final_script_location
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
opts[:default_arguments] = default_arguments unless default_arguments.empty?
|
|
69
|
+
opts[:description] = description if description
|
|
70
|
+
opts[:timeout] = timeout if timeout
|
|
71
|
+
opts[:max_retries] = max_retries if max_retries
|
|
72
|
+
opts[:allocated_capacity] = allocated_capacity if allocated_capacity
|
|
73
|
+
opts[:worker_type] = worker_type if worker_type
|
|
74
|
+
opts[:number_of_workers] = number_of_workers if number_of_workers
|
|
75
|
+
opts[:glue_version] = glue_version if glue_version
|
|
76
|
+
|
|
77
|
+
client.create_job(**opts)
|
|
78
|
+
safe_log(:info, "glue_runner.job_create", {
|
|
79
|
+
job: job_name,
|
|
80
|
+
glue_version: glue_version,
|
|
81
|
+
worker_type: worker_type,
|
|
82
|
+
number_of_workers: number_of_workers
|
|
83
|
+
})
|
|
84
|
+
get_job(job_name)
|
|
85
|
+
rescue Aws::Glue::Errors::ServiceError => e
|
|
86
|
+
safe_log(:error, "glue_runner.job_create_error",
|
|
87
|
+
{ job: job_name }.merge(exception_metadata(e)))
|
|
88
|
+
raise
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
def self.update_job(job_name, role_arn: nil, command_name: nil, script_location: nil,
|
|
92
|
+
default_arguments: nil, description: nil, worker_type: nil,
|
|
93
|
+
number_of_workers: nil, timeout: nil, max_retries: nil, allocated_capacity: nil,
|
|
94
|
+
glue_version: nil)
|
|
95
|
+
@logger = DataDrain.configuration.logger
|
|
96
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
97
|
+
job_update = {}
|
|
98
|
+
job_update[:role] = role_arn if role_arn
|
|
99
|
+
if command_name && script_location
|
|
100
|
+
job_update[:command] =
|
|
101
|
+
{ name: command_name, python_version: "3", script_location: script_location }
|
|
102
|
+
end
|
|
103
|
+
job_update[:default_arguments] = default_arguments if default_arguments
|
|
104
|
+
job_update[:description] = description if description
|
|
105
|
+
job_update[:timeout] = timeout if timeout
|
|
106
|
+
job_update[:max_retries] = max_retries if max_retries
|
|
107
|
+
job_update[:allocated_capacity] = allocated_capacity if allocated_capacity
|
|
108
|
+
job_update[:worker_type] = worker_type if worker_type
|
|
109
|
+
job_update[:number_of_workers] = number_of_workers if number_of_workers
|
|
110
|
+
job_update[:glue_version] = glue_version if glue_version
|
|
111
|
+
|
|
112
|
+
client.update_job(job_name: job_name, job_update: job_update)
|
|
113
|
+
safe_log(:info, "glue_runner.job_update", {
|
|
114
|
+
job: job_name,
|
|
115
|
+
changed_fields: job_update.keys.map(&:to_s)
|
|
116
|
+
})
|
|
117
|
+
get_job(job_name)
|
|
118
|
+
rescue Aws::Glue::Errors::ServiceError => e
|
|
119
|
+
safe_log(:error, "glue_runner.job_update_error",
|
|
120
|
+
{ job: job_name }.merge(exception_metadata(e)))
|
|
121
|
+
raise
|
|
122
|
+
end
|
|
123
|
+
|
|
124
|
+
def self.delete_job(job_name)
|
|
125
|
+
@logger = DataDrain.configuration.logger
|
|
126
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
127
|
+
client.delete_job(job_name: job_name)
|
|
128
|
+
safe_log(:info, "glue_runner.job_delete", { job: job_name })
|
|
129
|
+
true
|
|
130
|
+
rescue Aws::Glue::Errors::EntityNotFoundException
|
|
131
|
+
safe_log(:info, "glue_runner.job_delete_skipped", { job: job_name, reason: "not_found" })
|
|
132
|
+
false
|
|
133
|
+
rescue Aws::Glue::Errors::ServiceError => e
|
|
134
|
+
safe_log(:error, "glue_runner.job_delete_error",
|
|
135
|
+
{ job: job_name }.merge(exception_metadata(e)))
|
|
136
|
+
raise
|
|
137
|
+
end
|
|
138
|
+
|
|
139
|
+
def self.ensure_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
140
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
141
|
+
command_name: "glueetl", default_arguments: {}, description: nil,
|
|
142
|
+
worker_type: nil, number_of_workers: nil, timeout: 2880,
|
|
143
|
+
max_retries: 0, allocated_capacity: nil, glue_version: nil)
|
|
144
|
+
@logger = DataDrain.configuration.logger
|
|
145
|
+
|
|
146
|
+
final_script_location = resolve_script_location(
|
|
147
|
+
script_location: script_location,
|
|
148
|
+
script_path: script_path,
|
|
149
|
+
script_bucket: script_bucket,
|
|
150
|
+
script_folder: script_folder,
|
|
151
|
+
script_filename: script_filename
|
|
152
|
+
)
|
|
153
|
+
|
|
154
|
+
if job_exists?(job_name)
|
|
155
|
+
current = get_job(job_name)
|
|
156
|
+
desired = {
|
|
157
|
+
role: role_arn,
|
|
158
|
+
command_name: command_name,
|
|
159
|
+
script_location: final_script_location,
|
|
160
|
+
default_arguments: default_arguments,
|
|
161
|
+
description: description,
|
|
162
|
+
worker_type: worker_type,
|
|
163
|
+
number_of_workers: number_of_workers,
|
|
164
|
+
timeout: timeout,
|
|
165
|
+
max_retries: max_retries,
|
|
166
|
+
glue_version: glue_version
|
|
167
|
+
}
|
|
168
|
+
changed = changed_fields(desired, current)
|
|
169
|
+
if changed.empty?
|
|
170
|
+
safe_log(:info, "glue_runner.job_unchanged", { job: job_name })
|
|
171
|
+
current
|
|
172
|
+
else
|
|
173
|
+
safe_log(:info, "glue_runner.job_exists", { job: job_name })
|
|
174
|
+
update_job(job_name, role_arn: role_arn, command_name: command_name,
|
|
175
|
+
script_location: final_script_location, default_arguments: default_arguments,
|
|
176
|
+
description: description, worker_type: worker_type,
|
|
177
|
+
number_of_workers: number_of_workers, timeout: timeout,
|
|
178
|
+
max_retries: max_retries, allocated_capacity: allocated_capacity,
|
|
179
|
+
glue_version: glue_version)
|
|
180
|
+
end
|
|
181
|
+
else
|
|
182
|
+
safe_log(:info, "glue_runner.job_created", { job: job_name })
|
|
183
|
+
create_job(job_name, role_arn: role_arn, script_location: final_script_location,
|
|
184
|
+
command_name: command_name, default_arguments: default_arguments,
|
|
185
|
+
description: description, worker_type: worker_type,
|
|
186
|
+
number_of_workers: number_of_workers, timeout: timeout,
|
|
187
|
+
max_retries: max_retries, allocated_capacity: allocated_capacity,
|
|
188
|
+
glue_version: glue_version)
|
|
189
|
+
end
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
def self.changed_fields(desired, current)
|
|
193
|
+
changed = []
|
|
194
|
+
changed << :role if current.role != desired[:role]
|
|
195
|
+
changed << :command if current.command.name != desired[:command_name] ||
|
|
196
|
+
current.command.script_location != desired[:script_location]
|
|
197
|
+
changed << :default_arguments if current.default_arguments != desired[:default_arguments]
|
|
198
|
+
changed << :description if current.description != desired[:description]
|
|
199
|
+
changed << :worker_type if current.worker_type != desired[:worker_type]
|
|
200
|
+
changed << :number_of_workers if current.number_of_workers != desired[:number_of_workers]
|
|
201
|
+
changed << :timeout if current.timeout != desired[:timeout]
|
|
202
|
+
changed << :max_retries if current.max_retries != desired[:max_retries]
|
|
203
|
+
changed << :glue_version if current.glue_version != desired[:glue_version]
|
|
204
|
+
changed
|
|
205
|
+
end
|
|
206
|
+
private_class_method :changed_fields
|
|
207
|
+
|
|
208
|
+
def self.resolve_script_location(script_location:, script_path:, script_bucket:, script_folder:, script_filename:)
|
|
209
|
+
both_set = script_location && script_path
|
|
210
|
+
raise DataDrain::ConfigurationError, "provee script_location o script_path, no ambos" if both_set
|
|
211
|
+
|
|
212
|
+
return script_location if script_location
|
|
213
|
+
raise ArgumentError, "script_location o script_path es requerido" unless script_path
|
|
214
|
+
raise DataDrain::ConfigurationError, "script_path requiere script_bucket" unless script_bucket
|
|
215
|
+
|
|
216
|
+
upload_script(
|
|
217
|
+
local_path: script_path,
|
|
218
|
+
bucket: script_bucket,
|
|
219
|
+
folder: script_folder,
|
|
220
|
+
filename: script_filename
|
|
221
|
+
)
|
|
222
|
+
end
|
|
223
|
+
private_class_method :resolve_script_location
|
|
224
|
+
|
|
225
|
+
def self.upload_script(local_path:, bucket:, folder: "scripts", filename: nil)
|
|
226
|
+
@logger = DataDrain.configuration.logger
|
|
227
|
+
|
|
228
|
+
unless File.exist?(local_path)
|
|
229
|
+
raise DataDrain::ConfigurationError,
|
|
230
|
+
"Script local '#{local_path}' no existe"
|
|
231
|
+
end
|
|
232
|
+
|
|
233
|
+
actual_filename = filename || File.basename(local_path)
|
|
234
|
+
s3_key = "#{folder.chomp("/")}/#{actual_filename}"
|
|
235
|
+
bytes = File.size(local_path)
|
|
236
|
+
|
|
237
|
+
adapter = DataDrain::Storage.adapter
|
|
238
|
+
unless adapter.is_a?(DataDrain::Storage::S3)
|
|
239
|
+
raise DataDrain::ConfigurationError,
|
|
240
|
+
"upload_script requiere storage_mode = :s3, actual: #{DataDrain.configuration.storage_mode}"
|
|
241
|
+
end
|
|
242
|
+
|
|
243
|
+
s3_path = adapter.upload_file(local_path, bucket, s3_key, content_type: "text/x-python")
|
|
244
|
+
|
|
245
|
+
safe_log(:info, "glue_runner.script_uploaded", {
|
|
246
|
+
local_path: local_path,
|
|
247
|
+
s3_path: s3_path,
|
|
248
|
+
bytes: bytes
|
|
249
|
+
})
|
|
250
|
+
|
|
251
|
+
s3_path
|
|
252
|
+
rescue Aws::S3::Errors::ServiceError => e
|
|
253
|
+
safe_log(:error, "glue_runner.script_upload_error",
|
|
254
|
+
{ local_path: local_path, bucket: bucket }.merge(exception_metadata(e)))
|
|
255
|
+
raise
|
|
256
|
+
end
|
|
257
|
+
|
|
22
258
|
def self.run_and_wait(job_name, arguments = {}, polling_interval: 30, max_wait_seconds: nil)
|
|
23
259
|
config = DataDrain.configuration
|
|
24
260
|
config.validate!
|
|
25
|
-
client = Aws::Glue::Client.new(region: config.aws_region)
|
|
26
261
|
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
27
262
|
|
|
28
263
|
@logger = config.logger
|