data_drain 0.4.0 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +18 -0
- data/docs/execution/v0.5.0-OBSERVACIONES.md +167 -0
- data/docs/execution/v0.5.0.md +900 -0
- data/docs/glue-jobs-lifecycle.md +184 -13
- data/docs/glue_pyspark_example.py +49 -19
- data/lib/data_drain/glue_runner.rb +153 -17
- data/lib/data_drain/storage/base.rb +12 -0
- data/lib/data_drain/storage/local.rb +13 -0
- data/lib/data_drain/storage/s3.rb +17 -0
- data/lib/data_drain/validations.rb +2 -2
- data/lib/data_drain/version.rb +1 -1
- data/skill/SKILL.md +64 -3
- metadata +3 -1
|
@@ -0,0 +1,900 @@
|
|
|
1
|
+
# Plan de Ejecución — v0.5.0
|
|
2
|
+
|
|
3
|
+
**Feature:** Upload de scripts a S3 desde `GlueRunner`
|
|
4
|
+
**Items del roadmap:** 37 (nuevo, follow-up post-roadmap)
|
|
5
|
+
**Branch sugerido:** `feature/v0.5.0`
|
|
6
|
+
**Base:** `main` (contiene v0.4.0)
|
|
7
|
+
**Estado:** Pendiente de revisión (versión 2 tras review)
|
|
8
|
+
**Última actualización:** 2026-04-15
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## 1. Contexto
|
|
13
|
+
|
|
14
|
+
Tras v0.4.0 la gema gestiona el ciclo de vida completo de Glue Jobs (`create_job`, `update_job`, `delete_job`, `ensure_job`), pero asume que el script PySpark ya está en S3 (decisión B1 del plan v0.4.0). El caller debe subirlo manualmente:
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
aws s3 cp scripts/glue/export.py s3://bucket/scripts/
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
**Problema observado en uso:** cada servicio que usa la gema repite el upload, duplica lógica y puede olvidar actualizar el script al cambiar código. El archivo `docs/glue_pyspark_example.py` ya existe como template reutilizable, pero no hay forma ergonómica de subirlo desde Ruby.
|
|
21
|
+
|
|
22
|
+
**Solución:** agregar upload desde la gema usando el `Storage::S3` adapter existente (no crear cliente paralelo).
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## 2. Decisión de diseño (re-aprobada tras review)
|
|
27
|
+
|
|
28
|
+
### Opción A — `script_path` en `create_job`/`ensure_job` (RECOMENDADA)
|
|
29
|
+
|
|
30
|
+
- Si se pasa `script_location:` (S3 path) → comportamiento actual, no hay upload.
|
|
31
|
+
- Si se pasa `script_path:` (archivo local) + `script_bucket:` → la gema sube a S3 primero, luego crea el Job.
|
|
32
|
+
- Si se pasan ambos → `DataDrain::ConfigurationError` (ambigüedad).
|
|
33
|
+
- Si no se pasa ninguno → `ArgumentError` (requerido).
|
|
34
|
+
|
|
35
|
+
```ruby
|
|
36
|
+
# Uso actual (sin cambios)
|
|
37
|
+
DataDrain::GlueRunner.create_job(
|
|
38
|
+
"my-job",
|
|
39
|
+
script_location: "s3://bucket/scripts/export.py",
|
|
40
|
+
role_arn: "arn:aws:iam::123:role/GlueRole"
|
|
41
|
+
)
|
|
42
|
+
|
|
43
|
+
# Uso nuevo
|
|
44
|
+
DataDrain::GlueRunner.create_job(
|
|
45
|
+
"my-job",
|
|
46
|
+
script_path: "scripts/glue/export.py",
|
|
47
|
+
script_bucket: "my-bucket",
|
|
48
|
+
script_folder: "scripts",
|
|
49
|
+
role_arn: "arn:aws:iam::123:role/GlueRole"
|
|
50
|
+
)
|
|
51
|
+
# → Sube scripts/glue/export.py a s3://my-bucket/scripts/export.py
|
|
52
|
+
# → Crea el Job apuntando a ese S3 path
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Review de agentes — v2 incorporado
|
|
56
|
+
|
|
57
|
+
Revisión por **big-pickle** 2026-04-15 (`docs/execution/v0.5.0-OBSERVACIONES.md`). 8 observaciones verificadas contra código real de `main`:
|
|
58
|
+
|
|
59
|
+
| # | Sev | Hallazgo | Estado |
|
|
60
|
+
|---|-----|---------|--------|
|
|
61
|
+
| 1 | Alta | `s3_client` existe en `Storage::S3`? | ✅ **Verificado — existe** en `storage/s3.rb:44`. Plan v2 lo reutiliza correctamente. Nota: el SDK acepta `nil` en `access_key_id`/`secret_access_key` y resuelve credential chain automáticamente. |
|
|
62
|
+
| 2 | Alta | `changed_fields` incluye `script_location`? | ✅ **Verificado — sí**, vía `:command` key (línea 174-175 `glue_runner.rb`). `current.command.script_location != desired[:script_location]`. Plan v2 OK. |
|
|
63
|
+
| 3 | Media | `Configuration#local_bucket` existe? | ✅ **Verificado — no existe**. Plan v2 pide `bucket` como arg explícito en `Storage::Local#upload_file`. Documentado. |
|
|
64
|
+
| 4 | Baja | Paths Windows | Low risk (Glue corre en Linux). Test opcional no bloqueante. |
|
|
65
|
+
| 5 | Media | IAM permisos mínimos no documentados | **Aplicado:** agregar sección "Permisos IAM mínimos" en Fase 5.1 (`docs/glue-jobs-lifecycle.md`). |
|
|
66
|
+
| 6 | Baja | Error claro en modo :local | Ya cubierto por test "levanta ConfigurationError si storage_mode es :local" del plan v2. |
|
|
67
|
+
| 7 | Info | content_type explícito `text/x-python` | OK. Más predecible que dejar que S3 infiera. |
|
|
68
|
+
| 8 | Info | Concurrencia sin lock | **Aplicado:** documentar como limitación en Fase 5.1. Riesgo bajo (scripts chicos, rare concurrent writes en mismo path). |
|
|
69
|
+
|
|
70
|
+
**Cambios materiales al plan v2:**
|
|
71
|
+
- Fase 5.1 agrega sección "Permisos IAM mínimos" (obs 5)
|
|
72
|
+
- Fase 5.1 agrega sección "Concurrencia" documentando limitación (obs 8)
|
|
73
|
+
|
|
74
|
+
### 2da pasada de big-pickle — hallazgos rechazados
|
|
75
|
+
|
|
76
|
+
Big-pickle hizo una segunda pasada expandiendo a 11 observaciones. 2 de ellas contienen errores factuales verificados y se rechazan:
|
|
77
|
+
|
|
78
|
+
**❌ Obs 1 (2da pasada) — "SyntaxError en firma de `create_job`"**
|
|
79
|
+
|
|
80
|
+
Big-pickle afirmó que esta firma es inválida en Ruby:
|
|
81
|
+
|
|
82
|
+
def self.create_job(job_name, role_arn:, script_location: nil, ...)
|
|
83
|
+
|
|
84
|
+
Citó: *"required keyword argument cannot have a default value"*.
|
|
85
|
+
|
|
86
|
+
**Rechazado:** Ruby permite mezclar required y optional kwargs libremente. El error citado aplica a un SOLO kwarg que intenta ser required y optional al mismo tiempo (`def foo(x:, x: 1)`), no a firmas con varios kwargs donde algunos son required y otros optional.
|
|
87
|
+
|
|
88
|
+
Verificación:
|
|
89
|
+
|
|
90
|
+
$ ruby -e 'def test(a:, b: nil, c:, d: 1); "OK"; end; puts test(a: 1, c: 3)'
|
|
91
|
+
# => OK
|
|
92
|
+
|
|
93
|
+
El plan mantiene `role_arn:` como required (sin default). `script_location:`, `script_path:`, etc. son optional con default `nil`. Sintaxis válida desde Ruby 2.0.
|
|
94
|
+
|
|
95
|
+
**❌ Obs 6 (2da pasada) — "`spec/data_drain/storage/local_spec.rb` no existe"**
|
|
96
|
+
|
|
97
|
+
Big-pickle afirmó que el archivo no existe y pidió crearlo.
|
|
98
|
+
|
|
99
|
+
**Rechazado:** el archivo existe desde v0.2.0 (item 4, cobertura P0). Verificado:
|
|
100
|
+
|
|
101
|
+
$ ls spec/data_drain/storage/
|
|
102
|
+
local_spec.rb s3_spec.rb
|
|
103
|
+
|
|
104
|
+
Plan v2 agrega tests al archivo existente, no lo crea nuevo.
|
|
105
|
+
|
|
106
|
+
**✅ Obs válidas (2da pasada):**
|
|
107
|
+
|
|
108
|
+
- 2, 3: confirmaciones de obs 1, 2 de la 1ra pasada (ya aplicadas).
|
|
109
|
+
- 4: ya aplicada en plan v2 (Storage::Local necesita directorio manual).
|
|
110
|
+
- 5, 7, 8, 10, 11: info/OK, no requieren cambio.
|
|
111
|
+
- 9: decisión discutible — `bytes` vs `byte_count`. Mantener `bytes` por convención AWS (ContentLength). Documentada como desviación consciente del spec Wispro.
|
|
112
|
+
|
|
113
|
+
### Cambios vs plan original (v1)
|
|
114
|
+
|
|
115
|
+
Cambios clave de v1 → v2:
|
|
116
|
+
|
|
117
|
+
| Review finding | Cambio aplicado |
|
|
118
|
+
|---------------|-----------------|
|
|
119
|
+
| Plan v1 creaba `Aws::S3::Client.new` paralelo al Storage adapter | **v2 usa `Storage::S3` adapter existente** con `credential_chain` ya configurado |
|
|
120
|
+
| Plan v1 usaba kwarg `bucket:` conflictivo con el de Engine/FileIngestor/Record | **v2 renombra a `script_bucket:`** para claridad |
|
|
121
|
+
| Plan v1 no emitía logs estructurados | **v2 agrega `glue_runner.script_uploaded` + `script_upload_error`** |
|
|
122
|
+
| Plan v1 decía "idempotente" pero `put_object` sobrescribe siempre | **v2 documenta como "overwrite always"**, opcional ETag check futuro |
|
|
123
|
+
| Plan v1 mitigaba colisión de nombres solo con docs | **v2 permite `s3_filename:` override** (default basename) |
|
|
124
|
+
| Plan v1 no tenía Plan B ni estimación | **v2 agrega ambos** |
|
|
125
|
+
| Plan v1 no actualizaba `docs/glue_pyspark_example.py` | **v2 lo incluye** en Fase 5 |
|
|
126
|
+
|
|
127
|
+
### Alternativas descartadas
|
|
128
|
+
|
|
129
|
+
- **Método `upload_script` separado sin auto-wire a `create_job`** — requiere 2 pasos en caller; se ofrece también como API pública para casos independientes (ver Fase 2), pero `create_job` lo llama internamente cuando corresponde.
|
|
130
|
+
- **Bucket global en `Configuration`** — acopla, el bucket puede variar por job. Descartado.
|
|
131
|
+
- **Upload a `bucket/folder_name/` (reusar bucket del Data Lake)** — mezcla scripts con datos archivados. Mala separación.
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## 3. API Propuesta
|
|
136
|
+
|
|
137
|
+
### Nuevo en `Storage::S3` — `upload_file`
|
|
138
|
+
|
|
139
|
+
Como primitiva genérica. Vive en el adapter, no en `GlueRunner`.
|
|
140
|
+
|
|
141
|
+
```ruby
|
|
142
|
+
# lib/data_drain/storage/s3.rb
|
|
143
|
+
def upload_file(local_path, bucket, s3_key, content_type: nil)
|
|
144
|
+
# ...
|
|
145
|
+
end
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
En `Storage::Local` se implementa como copy a disco (por consistencia de la interfaz; útil para tests locales). En `Storage::Base` abstracto.
|
|
149
|
+
|
|
150
|
+
### Nuevo en `GlueRunner` — `upload_script`
|
|
151
|
+
|
|
152
|
+
Wrapper semántico sobre `Storage.adapter.upload_file` con convenciones para scripts Python.
|
|
153
|
+
|
|
154
|
+
```ruby
|
|
155
|
+
DataDrain::GlueRunner.upload_script(
|
|
156
|
+
local_path:, # Path local al archivo (String)
|
|
157
|
+
bucket:, # Bucket S3 destino (String)
|
|
158
|
+
folder: "scripts", # Folder S3 (default: "scripts")
|
|
159
|
+
filename: nil # Override del nombre S3 (default: basename)
|
|
160
|
+
)
|
|
161
|
+
# => "s3://bucket/scripts/filename.py"
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
**Comportamiento:**
|
|
165
|
+
- Valida que `local_path` existe (`File.exist?`).
|
|
166
|
+
- Delega a `Storage::S3.adapter#upload_file` con `content_type: "text/x-python"`.
|
|
167
|
+
- Retorna el path S3 completo.
|
|
168
|
+
- Emite `glue_runner.script_uploaded` con `local_path`, `s3_path`, `bytes`.
|
|
169
|
+
- **No es idempotente en sentido estricto:** `put_object` sobrescribe siempre. Documentar.
|
|
170
|
+
|
|
171
|
+
### Modificación — `create_job`
|
|
172
|
+
|
|
173
|
+
Firma (nuevos params marcados con ⬆):
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
def self.create_job(
|
|
177
|
+
job_name,
|
|
178
|
+
role_arn:,
|
|
179
|
+
script_location: nil, # ⬆ ahora opcional (antes requerido)
|
|
180
|
+
script_path: nil, # ⬆ nuevo
|
|
181
|
+
script_bucket: nil, # ⬆ nuevo (requerido si script_path)
|
|
182
|
+
script_folder: "scripts", # ⬆ nuevo
|
|
183
|
+
script_filename: nil, # ⬆ nuevo (override basename)
|
|
184
|
+
command_name: "glueetl",
|
|
185
|
+
default_arguments: {},
|
|
186
|
+
description: nil,
|
|
187
|
+
worker_type: nil,
|
|
188
|
+
number_of_workers: nil,
|
|
189
|
+
timeout: 2880,
|
|
190
|
+
max_retries: 0,
|
|
191
|
+
allocated_capacity: nil,
|
|
192
|
+
glue_version: nil
|
|
193
|
+
)
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Lógica de resolución (helper privado):**
|
|
197
|
+
|
|
198
|
+
```ruby
|
|
199
|
+
# @api private
|
|
200
|
+
def self.resolve_script_location(script_location:, script_path:, script_bucket:, script_folder:, script_filename:)
|
|
201
|
+
both_set = script_location && script_path
|
|
202
|
+
raise DataDrain::ConfigurationError, "provee script_location o script_path, no ambos" if both_set
|
|
203
|
+
|
|
204
|
+
return script_location if script_location
|
|
205
|
+
raise ArgumentError, "script_location o script_path es requerido" unless script_path
|
|
206
|
+
raise DataDrain::ConfigurationError, "script_path requiere script_bucket" unless script_bucket
|
|
207
|
+
|
|
208
|
+
upload_script(
|
|
209
|
+
local_path: script_path,
|
|
210
|
+
bucket: script_bucket,
|
|
211
|
+
folder: script_folder,
|
|
212
|
+
filename: script_filename
|
|
213
|
+
)
|
|
214
|
+
end
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
Llamado al inicio de `create_job`:
|
|
218
|
+
|
|
219
|
+
```ruby
|
|
220
|
+
def self.create_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
221
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
222
|
+
...)
|
|
223
|
+
@logger = DataDrain.configuration.logger
|
|
224
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
225
|
+
|
|
226
|
+
final_script_location = resolve_script_location(
|
|
227
|
+
script_location: script_location,
|
|
228
|
+
script_path: script_path,
|
|
229
|
+
script_bucket: script_bucket,
|
|
230
|
+
script_folder: script_folder,
|
|
231
|
+
script_filename: script_filename
|
|
232
|
+
)
|
|
233
|
+
# ... resto del método usa final_script_location
|
|
234
|
+
end
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
### Modificación — `ensure_job`
|
|
238
|
+
|
|
239
|
+
Misma firma extendida. Pasa los kwargs a `create_job`/`update_job` via passthrough (update_job no soporta `script_path` por ahora — ver Fase 3).
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## 4. Implementación
|
|
244
|
+
|
|
245
|
+
### 4.1 `Storage::Base#upload_file` (abstracto)
|
|
246
|
+
|
|
247
|
+
**Archivo:** `lib/data_drain/storage/base.rb`
|
|
248
|
+
|
|
249
|
+
```ruby
|
|
250
|
+
# Sube un archivo local al storage.
|
|
251
|
+
#
|
|
252
|
+
# @param local_path [String]
|
|
253
|
+
# @param bucket [String]
|
|
254
|
+
# @param s3_key [String] key relativo (ej. "scripts/export.py")
|
|
255
|
+
# @param content_type [String, nil]
|
|
256
|
+
# @return [String] URI completo del archivo subido
|
|
257
|
+
# @raise [NotImplementedError]
|
|
258
|
+
def upload_file(local_path, bucket, s3_key, content_type: nil)
|
|
259
|
+
raise NotImplementedError, "#{self.class} debe implementar #upload_file"
|
|
260
|
+
end
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
### 4.2 `Storage::S3#upload_file`
|
|
264
|
+
|
|
265
|
+
**Archivo:** `lib/data_drain/storage/s3.rb`
|
|
266
|
+
|
|
267
|
+
```ruby
|
|
268
|
+
# @param local_path [String]
|
|
269
|
+
# @param bucket [String]
|
|
270
|
+
# @param s3_key [String]
|
|
271
|
+
# @param content_type [String, nil]
|
|
272
|
+
# @return [String] "s3://bucket/key"
|
|
273
|
+
def upload_file(local_path, bucket, s3_key, content_type: nil)
|
|
274
|
+
client = s3_client # ya existe, usa config cacheada
|
|
275
|
+
|
|
276
|
+
File.open(local_path, "rb") do |file|
|
|
277
|
+
params = { bucket: bucket, key: s3_key, body: file }
|
|
278
|
+
params[:content_type] = content_type if content_type
|
|
279
|
+
client.put_object(**params)
|
|
280
|
+
end
|
|
281
|
+
|
|
282
|
+
"s3://#{bucket}/#{s3_key}"
|
|
283
|
+
end
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
**Nota:** reutiliza `s3_client` private method existente (desde v0.3.0) — usa `credential_chain` si corresponde.
|
|
287
|
+
|
|
288
|
+
### 4.3 `Storage::Local#upload_file`
|
|
289
|
+
|
|
290
|
+
**Archivo:** `lib/data_drain/storage/local.rb`
|
|
291
|
+
|
|
292
|
+
```ruby
|
|
293
|
+
# @param local_path [String]
|
|
294
|
+
# @param bucket [String] Directorio destino
|
|
295
|
+
# @param s3_key [String] Path relativo dentro del bucket
|
|
296
|
+
# @param content_type [String, nil] Ignorado en modo local
|
|
297
|
+
# @return [String] Path absoluto al archivo destino
|
|
298
|
+
def upload_file(local_path, bucket, s3_key, _content_type: nil)
|
|
299
|
+
dest_path = File.join(bucket, s3_key)
|
|
300
|
+
FileUtils.mkdir_p(File.dirname(dest_path))
|
|
301
|
+
FileUtils.cp(local_path, dest_path)
|
|
302
|
+
dest_path
|
|
303
|
+
end
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
Útil para tests en modo `:local` sin mockear S3.
|
|
307
|
+
|
|
308
|
+
### 4.4 `GlueRunner.upload_script`
|
|
309
|
+
|
|
310
|
+
**Archivo:** `lib/data_drain/glue_runner.rb`
|
|
311
|
+
|
|
312
|
+
```ruby
|
|
313
|
+
# Sube un script Python local al bucket S3 configurado y retorna el path S3.
|
|
314
|
+
#
|
|
315
|
+
# @param local_path [String] Ruta local al archivo.
|
|
316
|
+
# @param bucket [String] Bucket S3 destino.
|
|
317
|
+
# @param folder [String] Folder dentro del bucket. Default: "scripts".
|
|
318
|
+
# @param filename [String, nil] Override del nombre en S3. Default: basename(local_path).
|
|
319
|
+
# @return [String] S3 path del archivo subido (ej. "s3://bucket/scripts/export.py").
|
|
320
|
+
# @raise [DataDrain::ConfigurationError] si local_path no existe.
|
|
321
|
+
# @raise [Aws::S3::Errors::ServiceError] si el upload falla.
|
|
322
|
+
#
|
|
323
|
+
# @note NO es idempotente en sentido estricto. `put_object` sobrescribe siempre
|
|
324
|
+
# sin comparar contenido. Si dos callers suben el mismo path con contenido
|
|
325
|
+
# distinto, gana el último. Usar `filename:` distinto o versionar manualmente
|
|
326
|
+
# si necesitás coexistencia.
|
|
327
|
+
def self.upload_script(local_path:, bucket:, folder: "scripts", filename: nil)
|
|
328
|
+
@logger = DataDrain.configuration.logger
|
|
329
|
+
|
|
330
|
+
unless File.exist?(local_path)
|
|
331
|
+
raise DataDrain::ConfigurationError,
|
|
332
|
+
"Script local '#{local_path}' no existe"
|
|
333
|
+
end
|
|
334
|
+
|
|
335
|
+
actual_filename = filename || File.basename(local_path)
|
|
336
|
+
s3_key = "#{folder.chomp("/")}/#{actual_filename}"
|
|
337
|
+
bytes = File.size(local_path)
|
|
338
|
+
|
|
339
|
+
adapter = DataDrain::Storage.adapter
|
|
340
|
+
unless adapter.is_a?(DataDrain::Storage::S3)
|
|
341
|
+
raise DataDrain::ConfigurationError,
|
|
342
|
+
"upload_script requiere storage_mode = :s3, actual: #{DataDrain.configuration.storage_mode}"
|
|
343
|
+
end
|
|
344
|
+
|
|
345
|
+
s3_path = adapter.upload_file(local_path, bucket, s3_key, content_type: "text/x-python")
|
|
346
|
+
|
|
347
|
+
safe_log(:info, "glue_runner.script_uploaded", {
|
|
348
|
+
local_path: local_path,
|
|
349
|
+
s3_path: s3_path,
|
|
350
|
+
bytes: bytes
|
|
351
|
+
})
|
|
352
|
+
|
|
353
|
+
s3_path
|
|
354
|
+
rescue Aws::S3::Errors::ServiceError => e
|
|
355
|
+
safe_log(:error, "glue_runner.script_upload_error",
|
|
356
|
+
{ local_path: local_path, bucket: bucket }.merge(exception_metadata(e)))
|
|
357
|
+
raise
|
|
358
|
+
end
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
### 4.5 `GlueRunner.create_job` — signature + resolución
|
|
362
|
+
|
|
363
|
+
Diff sobre el actual:
|
|
364
|
+
|
|
365
|
+
```ruby
|
|
366
|
+
def self.create_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
367
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
368
|
+
command_name: "glueetl", default_arguments: {}, description: nil,
|
|
369
|
+
worker_type: nil, number_of_workers: nil, timeout: 2880,
|
|
370
|
+
max_retries: 0, allocated_capacity: nil, glue_version: nil)
|
|
371
|
+
@logger = DataDrain.configuration.logger
|
|
372
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
373
|
+
|
|
374
|
+
final_script = resolve_script_location(
|
|
375
|
+
script_location: script_location,
|
|
376
|
+
script_path: script_path,
|
|
377
|
+
script_bucket: script_bucket,
|
|
378
|
+
script_folder: script_folder,
|
|
379
|
+
script_filename: script_filename
|
|
380
|
+
)
|
|
381
|
+
|
|
382
|
+
opts = {
|
|
383
|
+
name: job_name,
|
|
384
|
+
role: role_arn,
|
|
385
|
+
command: {
|
|
386
|
+
name: command_name,
|
|
387
|
+
python_version: "3",
|
|
388
|
+
script_location: final_script
|
|
389
|
+
}
|
|
390
|
+
}
|
|
391
|
+
# ... resto igual al actual ...
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
### 4.6 `GlueRunner.ensure_job` — passthrough
|
|
395
|
+
|
|
396
|
+
Agregar mismos kwargs nuevos, pasarlos a `create_job`. **Limitación:** `update_job` no soporta `script_path` en esta versión (siempre usa `script_location` si se provee). Si el caller quiere actualizar el script en un job existente vía `ensure_job`, debe pasar `script_path:` — `ensure_job` internamente sube (via `upload_script`) y pasa el resultado a `update_job` como `script_location`. Esto requiere logic extra.
|
|
397
|
+
|
|
398
|
+
**Decisión:** `ensure_job` sube el script también si recibe `script_path`, antes de comparar `changed_fields`. El `current.command.script_location` se compara contra el `script_location` resultante del upload.
|
|
399
|
+
|
|
400
|
+
```ruby
|
|
401
|
+
def self.ensure_job(job_name, role_arn:, script_location: nil, script_path: nil,
|
|
402
|
+
script_bucket: nil, script_folder: "scripts", script_filename: nil,
|
|
403
|
+
...)
|
|
404
|
+
@logger = DataDrain.configuration.logger
|
|
405
|
+
|
|
406
|
+
final_script = resolve_script_location(
|
|
407
|
+
script_location: script_location,
|
|
408
|
+
script_path: script_path,
|
|
409
|
+
script_bucket: script_bucket,
|
|
410
|
+
script_folder: script_folder,
|
|
411
|
+
script_filename: script_filename
|
|
412
|
+
)
|
|
413
|
+
|
|
414
|
+
if job_exists?(job_name)
|
|
415
|
+
current = get_job(job_name)
|
|
416
|
+
desired = { ..., script_location: final_script, ... }
|
|
417
|
+
changed = changed_fields(desired, current)
|
|
418
|
+
# ...
|
|
419
|
+
else
|
|
420
|
+
create_job(job_name, role_arn: role_arn, script_location: final_script, ...)
|
|
421
|
+
end
|
|
422
|
+
end
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
**Edge case:** si el caller llama `ensure_job` con `script_path:` varias veces con el mismo contenido, cada vez sube a S3 (no es idempotente el upload). Documentar en skill.
|
|
426
|
+
|
|
427
|
+
### 4.7 Tests
|
|
428
|
+
|
|
429
|
+
#### `spec/data_drain/storage/s3_spec.rb`
|
|
430
|
+
|
|
431
|
+
```ruby
|
|
432
|
+
describe "#upload_file" do
|
|
433
|
+
let(:temp_script) { Tempfile.new(["export", ".py"], binmode: true) }
|
|
434
|
+
|
|
435
|
+
before do
|
|
436
|
+
temp_script.write("# python content")
|
|
437
|
+
temp_script.close
|
|
438
|
+
end
|
|
439
|
+
|
|
440
|
+
after { temp_script.unlink }
|
|
441
|
+
|
|
442
|
+
it "sube el archivo a S3 y retorna el path completo" do
|
|
443
|
+
uploaded_params = nil
|
|
444
|
+
s3_client.stub_responses(:put_object, lambda { |context|
|
|
445
|
+
uploaded_params = context.params
|
|
446
|
+
{}
|
|
447
|
+
})
|
|
448
|
+
|
|
449
|
+
result = adapter.upload_file(temp_script.path, "my-bucket", "scripts/export.py",
|
|
450
|
+
content_type: "text/x-python")
|
|
451
|
+
|
|
452
|
+
expect(result).to eq("s3://my-bucket/scripts/export.py")
|
|
453
|
+
expect(uploaded_params[:bucket]).to eq("my-bucket")
|
|
454
|
+
expect(uploaded_params[:key]).to eq("scripts/export.py")
|
|
455
|
+
expect(uploaded_params[:content_type]).to eq("text/x-python")
|
|
456
|
+
end
|
|
457
|
+
|
|
458
|
+
it "omite content_type si no se provee" do
|
|
459
|
+
uploaded_params = nil
|
|
460
|
+
s3_client.stub_responses(:put_object, lambda { |context|
|
|
461
|
+
uploaded_params = context.params
|
|
462
|
+
{}
|
|
463
|
+
})
|
|
464
|
+
|
|
465
|
+
adapter.upload_file(temp_script.path, "my-bucket", "export.py")
|
|
466
|
+
|
|
467
|
+
expect(uploaded_params).not_to have_key(:content_type)
|
|
468
|
+
end
|
|
469
|
+
end
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
#### `spec/data_drain/storage/local_spec.rb`
|
|
473
|
+
|
|
474
|
+
```ruby
|
|
475
|
+
describe "#upload_file" do
|
|
476
|
+
it "copia el archivo al destino local y retorna el path absoluto" do
|
|
477
|
+
Dir.mktmpdir do |tmpdir|
|
|
478
|
+
source = File.join(tmpdir, "source.py")
|
|
479
|
+
File.write(source, "# python")
|
|
480
|
+
|
|
481
|
+
dest_dir = File.join(tmpdir, "dest")
|
|
482
|
+
result = adapter.upload_file(source, dest_dir, "scripts/export.py")
|
|
483
|
+
|
|
484
|
+
expected = File.join(dest_dir, "scripts/export.py")
|
|
485
|
+
expect(result).to eq(expected)
|
|
486
|
+
expect(File.read(expected)).to eq("# python")
|
|
487
|
+
end
|
|
488
|
+
end
|
|
489
|
+
end
|
|
490
|
+
```
|
|
491
|
+
|
|
492
|
+
#### `spec/data_drain/glue_runner_spec.rb`
|
|
493
|
+
|
|
494
|
+
```ruby
|
|
495
|
+
describe ".upload_script" do
|
|
496
|
+
let(:temp_script) { Tempfile.new(["export", ".py"], binmode: true) }
|
|
497
|
+
|
|
498
|
+
before do
|
|
499
|
+
temp_script.write("# python")
|
|
500
|
+
temp_script.close
|
|
501
|
+
DataDrain.configure { |c| c.storage_mode = :s3; c.aws_region = "us-east-1" }
|
|
502
|
+
DataDrain::Storage.reset_adapter!
|
|
503
|
+
end
|
|
504
|
+
|
|
505
|
+
after do
|
|
506
|
+
temp_script.unlink
|
|
507
|
+
DataDrain::Storage.reset_adapter!
|
|
508
|
+
end
|
|
509
|
+
|
|
510
|
+
it "sube el script y retorna S3 path" do
|
|
511
|
+
allow(DataDrain::Storage.adapter).to receive(:upload_file)
|
|
512
|
+
.and_return("s3://my-bucket/scripts/export.py")
|
|
513
|
+
|
|
514
|
+
result = described_class.upload_script(
|
|
515
|
+
local_path: temp_script.path,
|
|
516
|
+
bucket: "my-bucket",
|
|
517
|
+
folder: "scripts"
|
|
518
|
+
)
|
|
519
|
+
|
|
520
|
+
expect(result).to eq("s3://my-bucket/scripts/export.py")
|
|
521
|
+
end
|
|
522
|
+
|
|
523
|
+
it "usa filename override si se provee" do
|
|
524
|
+
expected_key = "scripts/custom_name.py"
|
|
525
|
+
expect(DataDrain::Storage.adapter).to receive(:upload_file)
|
|
526
|
+
.with(temp_script.path, "my-bucket", expected_key, content_type: "text/x-python")
|
|
527
|
+
.and_return("s3://my-bucket/#{expected_key}")
|
|
528
|
+
|
|
529
|
+
described_class.upload_script(
|
|
530
|
+
local_path: temp_script.path,
|
|
531
|
+
bucket: "my-bucket",
|
|
532
|
+
filename: "custom_name.py"
|
|
533
|
+
)
|
|
534
|
+
end
|
|
535
|
+
|
|
536
|
+
it "levanta ConfigurationError si local_path no existe" do
|
|
537
|
+
expect {
|
|
538
|
+
described_class.upload_script(local_path: "/nonexistent.py", bucket: "b")
|
|
539
|
+
}.to raise_error(DataDrain::ConfigurationError, /no existe/)
|
|
540
|
+
end
|
|
541
|
+
|
|
542
|
+
it "levanta ConfigurationError si storage_mode es :local" do
|
|
543
|
+
DataDrain.configure { |c| c.storage_mode = :local }
|
|
544
|
+
DataDrain::Storage.reset_adapter!
|
|
545
|
+
|
|
546
|
+
expect {
|
|
547
|
+
described_class.upload_script(local_path: temp_script.path, bucket: "b")
|
|
548
|
+
}.to raise_error(DataDrain::ConfigurationError, /storage_mode/)
|
|
549
|
+
end
|
|
550
|
+
|
|
551
|
+
it "emite glue_runner.script_uploaded con bytes + s3_path" do
|
|
552
|
+
allow(DataDrain::Storage.adapter).to receive(:upload_file)
|
|
553
|
+
.and_return("s3://my-bucket/scripts/export.py")
|
|
554
|
+
|
|
555
|
+
logs = capture_logs {
|
|
556
|
+
described_class.upload_script(local_path: temp_script.path, bucket: "my-bucket")
|
|
557
|
+
}
|
|
558
|
+
expect(logs.find { |l| l.include?("glue_runner.script_uploaded") }).to include("bytes=")
|
|
559
|
+
end
|
|
560
|
+
|
|
561
|
+
it "emite glue_runner.script_upload_error y propaga si falla" do
|
|
562
|
+
allow(DataDrain::Storage.adapter).to receive(:upload_file)
|
|
563
|
+
.and_raise(Aws::S3::Errors::ServiceError.new(nil, "AccessDenied"))
|
|
564
|
+
|
|
565
|
+
expect {
|
|
566
|
+
described_class.upload_script(local_path: temp_script.path, bucket: "b")
|
|
567
|
+
}.to raise_error(Aws::S3::Errors::ServiceError)
|
|
568
|
+
end
|
|
569
|
+
end
|
|
570
|
+
|
|
571
|
+
describe ".create_job" do
|
|
572
|
+
context "con script_path local" do
|
|
573
|
+
before do
|
|
574
|
+
allow(described_class).to receive(:upload_script)
|
|
575
|
+
.and_return("s3://my-bucket/scripts/export.py")
|
|
576
|
+
glue_client.stub_responses(:create_job, {})
|
|
577
|
+
glue_client.stub_responses(:get_job, { job: { name: "my-job" } })
|
|
578
|
+
end
|
|
579
|
+
|
|
580
|
+
it "sube script y crea job con script_location resultante" do
|
|
581
|
+
expect(described_class).to receive(:upload_script).with(
|
|
582
|
+
local_path: "scripts/glue/export.py",
|
|
583
|
+
bucket: "my-bucket",
|
|
584
|
+
folder: "scripts",
|
|
585
|
+
filename: nil
|
|
586
|
+
)
|
|
587
|
+
|
|
588
|
+
described_class.create_job(
|
|
589
|
+
"my-job",
|
|
590
|
+
role_arn: "arn:aws:iam::123:role/GlueRole",
|
|
591
|
+
script_path: "scripts/glue/export.py",
|
|
592
|
+
script_bucket: "my-bucket"
|
|
593
|
+
)
|
|
594
|
+
end
|
|
595
|
+
end
|
|
596
|
+
|
|
597
|
+
it "levanta ConfigurationError si script_path sin script_bucket" do
|
|
598
|
+
expect {
|
|
599
|
+
described_class.create_job(
|
|
600
|
+
"my-job",
|
|
601
|
+
role_arn: "arn:aws:iam::123:role/GlueRole",
|
|
602
|
+
script_path: "/local/script.py"
|
|
603
|
+
)
|
|
604
|
+
}.to raise_error(DataDrain::ConfigurationError, /script_bucket/)
|
|
605
|
+
end
|
|
606
|
+
|
|
607
|
+
it "levanta ConfigurationError si script_path y script_location ambos" do
|
|
608
|
+
expect {
|
|
609
|
+
described_class.create_job(
|
|
610
|
+
"my-job",
|
|
611
|
+
role_arn: "arn:aws:iam::123:role/GlueRole",
|
|
612
|
+
script_path: "/local/script.py",
|
|
613
|
+
script_location: "s3://b/s.py",
|
|
614
|
+
script_bucket: "b"
|
|
615
|
+
)
|
|
616
|
+
}.to raise_error(DataDrain::ConfigurationError, /no ambos/)
|
|
617
|
+
end
|
|
618
|
+
|
|
619
|
+
it "levanta ArgumentError si no se provee ni script_location ni script_path" do
|
|
620
|
+
expect {
|
|
621
|
+
described_class.create_job("my-job", role_arn: "arn:aws:iam::123:role/GlueRole")
|
|
622
|
+
}.to raise_error(ArgumentError, /requerido/)
|
|
623
|
+
end
|
|
624
|
+
end
|
|
625
|
+
|
|
626
|
+
describe ".ensure_job con script_path" do
|
|
627
|
+
# ... test passthrough completo ...
|
|
628
|
+
end
|
|
629
|
+
```
|
|
630
|
+
|
|
631
|
+
---
|
|
632
|
+
|
|
633
|
+
## 5. Docs
|
|
634
|
+
|
|
635
|
+
### 5.1 `docs/glue-jobs-lifecycle.md`
|
|
636
|
+
|
|
637
|
+
Agregar sección "Subir scripts locales (v0.5.0+)":
|
|
638
|
+
|
|
639
|
+
```markdown
|
|
640
|
+
## Subir scripts locales
|
|
641
|
+
|
|
642
|
+
Desde v0.5.0 la gema puede subir scripts PySpark a S3 automáticamente. En lugar de:
|
|
643
|
+
|
|
644
|
+
# 1. Upload manual
|
|
645
|
+
# aws s3 cp scripts/glue/export.py s3://bucket/scripts/
|
|
646
|
+
|
|
647
|
+
# 2. Create job
|
|
648
|
+
DataDrain::GlueRunner.create_job(
|
|
649
|
+
"my-job",
|
|
650
|
+
script_location: "s3://bucket/scripts/export.py",
|
|
651
|
+
role_arn: "arn:aws:iam::123:role/GlueRole"
|
|
652
|
+
)
|
|
653
|
+
|
|
654
|
+
Usar `script_path:` local:
|
|
655
|
+
|
|
656
|
+
DataDrain::GlueRunner.create_job(
|
|
657
|
+
"my-job",
|
|
658
|
+
script_path: "scripts/glue/export.py", # local
|
|
659
|
+
script_bucket: "my-bucket",
|
|
660
|
+
script_folder: "scripts",
|
|
661
|
+
role_arn: "arn:aws:iam::123:role/GlueRole"
|
|
662
|
+
)
|
|
663
|
+
# → Sube a s3://my-bucket/scripts/export.py
|
|
664
|
+
# → Crea el job
|
|
665
|
+
|
|
666
|
+
**`script_filename:`** opcional para override del nombre en S3 (default: basename).
|
|
667
|
+
|
|
668
|
+
**Importante:** el upload **sobrescribe** cualquier archivo existente en el mismo path.
|
|
669
|
+
No es idempotente en sentido estricto. Si varios callers usan el mismo `script_bucket`
|
|
670
|
+
+ `script_folder` + nombre, gana el último. Usar nombres únicos o buckets separados
|
|
671
|
+
por servicio.
|
|
672
|
+
|
|
673
|
+
### Concurrencia (limitación conocida)
|
|
674
|
+
|
|
675
|
+
No hay lock distribuido. Si dos procesos llaman `upload_script` con el mismo destino
|
|
676
|
+
simultáneamente, el último `put_object` en llegar a S3 gana. Para scripts PySpark
|
|
677
|
+
esto es típicamente bajo riesgo (scripts son pequeños, rara vez hay writes
|
|
678
|
+
concurrentes al mismo path). Si necesitás coordinación estricta:
|
|
679
|
+
|
|
680
|
+
- Usar `filename:` con identificador único (hash del contenido, timestamp, run_id)
|
|
681
|
+
- O lock externo (DynamoDB, Redis) antes de `upload_script`
|
|
682
|
+
|
|
683
|
+
### Permisos IAM mínimos
|
|
684
|
+
|
|
685
|
+
El IAM role/user que ejecuta `upload_script` necesita:
|
|
686
|
+
|
|
687
|
+
{
|
|
688
|
+
"Effect": "Allow",
|
|
689
|
+
"Action": ["s3:PutObject"],
|
|
690
|
+
"Resource": "arn:aws:s3:::my-bucket/scripts/*"
|
|
691
|
+
}
|
|
692
|
+
|
|
693
|
+
Para usar con `create_job`/`ensure_job` también se necesitan los permisos de Glue
|
|
694
|
+
(ver sección "Permisos Glue" al inicio de este documento) + permiso para que el
|
|
695
|
+
IAM role del Glue Job pueda leer el script:
|
|
696
|
+
|
|
697
|
+
{
|
|
698
|
+
"Effect": "Allow",
|
|
699
|
+
"Action": ["s3:GetObject"],
|
|
700
|
+
"Resource": "arn:aws:s3:::my-bucket/scripts/*"
|
|
701
|
+
}
|
|
702
|
+
|
|
703
|
+
(Este último en el role del Glue Job, no en el role de la aplicación Ruby.)
|
|
704
|
+
|
|
705
|
+
### API standalone: `upload_script`
|
|
706
|
+
|
|
707
|
+
Para casos donde solo querés subir (sin crear Job):
|
|
708
|
+
|
|
709
|
+
s3_path = DataDrain::GlueRunner.upload_script(
|
|
710
|
+
local_path: "scripts/glue/export.py",
|
|
711
|
+
bucket: "my-bucket",
|
|
712
|
+
folder: "scripts"
|
|
713
|
+
)
|
|
714
|
+
# => "s3://my-bucket/scripts/export.py"
|
|
715
|
+
```
|
|
716
|
+
|
|
717
|
+
### 5.2 `skill/SKILL.md` FAQ
|
|
718
|
+
|
|
719
|
+
Agregar:
|
|
720
|
+
|
|
721
|
+
```markdown
|
|
722
|
+
### ¿Cómo subo un script Glue desde mi repo?
|
|
723
|
+
|
|
724
|
+
Desde v0.5.0 podés usar `script_path:` en lugar de `script_location:`:
|
|
725
|
+
|
|
726
|
+
DataDrain::GlueRunner.ensure_job(
|
|
727
|
+
"my-export-job",
|
|
728
|
+
script_path: "scripts/glue/export.py",
|
|
729
|
+
script_bucket: "my-bucket",
|
|
730
|
+
script_folder: "scripts",
|
|
731
|
+
role_arn: ENV["GLUE_ROLE_ARN"]
|
|
732
|
+
)
|
|
733
|
+
|
|
734
|
+
La gema sube el script a S3 usando el `Storage::S3` adapter existente
|
|
735
|
+
(con `credential_chain` si tenés IAM role). **Requiere `storage_mode = :s3`**.
|
|
736
|
+
Si `storage_mode = :local`, levanta `ConfigurationError`.
|
|
737
|
+
|
|
738
|
+
**Overwrite:** cada invocación sobrescribe el archivo en S3. Útil para que
|
|
739
|
+
el script siga al código del repo. Si necesitás versionar, usar `script_filename:`
|
|
740
|
+
con hash o timestamp.
|
|
741
|
+
```
|
|
742
|
+
|
|
743
|
+
### 5.3 `skill/references/api-detallada.md`
|
|
744
|
+
|
|
745
|
+
Agregar sección GlueRunner con los 3 métodos nuevos/modificados.
|
|
746
|
+
|
|
747
|
+
### 5.4 `skill/references/eventos-telemetria.md`
|
|
748
|
+
|
|
749
|
+
Agregar:
|
|
750
|
+
|
|
751
|
+
```markdown
|
|
752
|
+
### `glue_runner.script_uploaded`
|
|
753
|
+
**Nivel:** INFO.
|
|
754
|
+
**Campos:** `local_path`, `s3_path`, `bytes`.
|
|
755
|
+
|
|
756
|
+
### `glue_runner.script_upload_error`
|
|
757
|
+
**Nivel:** ERROR.
|
|
758
|
+
**Campos:** `local_path`, `bucket`, `error_class`, `error_message`.
|
|
759
|
+
**Consecuencia:** propaga el `Aws::S3::Errors::ServiceError`.
|
|
760
|
+
```
|
|
761
|
+
|
|
762
|
+
### 5.5 `docs/glue_pyspark_example.py`
|
|
763
|
+
|
|
764
|
+
El docstring del archivo ya tiene un ejemplo de `ensure_job` con `script_location`. Agregar variante con `script_path`:
|
|
765
|
+
|
|
766
|
+
```python
|
|
767
|
+
"""
|
|
768
|
+
# Opción moderna: script local subido por la gema (v0.5.0+)
|
|
769
|
+
DataDrain::GlueRunner.ensure_job(
|
|
770
|
+
"my-export-job",
|
|
771
|
+
script_path: "docs/glue_pyspark_example.py",
|
|
772
|
+
script_bucket: "my-bucket",
|
|
773
|
+
script_folder: "scripts",
|
|
774
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
775
|
+
worker_type: "G.1X",
|
|
776
|
+
number_of_workers: 10,
|
|
777
|
+
timeout: 1440
|
|
778
|
+
)
|
|
779
|
+
# → Sube este archivo a s3://my-bucket/scripts/glue_pyspark_example.py
|
|
780
|
+
# → Crea el Job apuntando a ese path
|
|
781
|
+
"""
|
|
782
|
+
```
|
|
783
|
+
|
|
784
|
+
### 5.6 README
|
|
785
|
+
|
|
786
|
+
Agregar 3 líneas en la sección "Orquestación con AWS Glue" apuntando a docs detalle.
|
|
787
|
+
|
|
788
|
+
### 5.7 CHANGELOG
|
|
789
|
+
|
|
790
|
+
```markdown
|
|
791
|
+
## [0.5.0] - 2026-04-XX
|
|
792
|
+
|
|
793
|
+
### Features
|
|
794
|
+
- `Storage::S3#upload_file` y `Storage::Local#upload_file` — primitiva para subir archivos al storage configurado. (item 37)
|
|
795
|
+
- `GlueRunner.upload_script(local_path:, bucket:, folder:, filename:)` — sube script Python local a S3 usando el `Storage::S3` adapter existente. Emite `glue_runner.script_uploaded` (INFO) y `glue_runner.script_upload_error` (ERROR). (item 37)
|
|
796
|
+
- `GlueRunner.create_job` y `GlueRunner.ensure_job` aceptan `script_path:` + `script_bucket:` + `script_folder:` + `script_filename:` para subir scripts locales automáticamente. Si se usa `script_location:`, comportamiento idéntico al anterior. (item 37)
|
|
797
|
+
|
|
798
|
+
### Docs
|
|
799
|
+
- `docs/glue-jobs-lifecycle.md` — sección "Subir scripts locales" con patrón completo.
|
|
800
|
+
- `docs/glue_pyspark_example.py` — ejemplo de uso con `script_path`.
|
|
801
|
+
|
|
802
|
+
### Notas
|
|
803
|
+
- **Upload NO es idempotente en sentido estricto:** `put_object` sobrescribe siempre. Documentado.
|
|
804
|
+
- `upload_script` requiere `storage_mode = :s3`. En `:local` levanta `ConfigurationError`.
|
|
805
|
+
```
|
|
806
|
+
|
|
807
|
+
---
|
|
808
|
+
|
|
809
|
+
## 6. Estimación
|
|
810
|
+
|
|
811
|
+
| Fase | Resumen | Estimación |
|
|
812
|
+
|------|---------|------------|
|
|
813
|
+
| 0 | Setup + marcar item 37 `[~]` en roadmap | 10min |
|
|
814
|
+
| 1 | `Storage::Base#upload_file` abstract + specs | 30min |
|
|
815
|
+
| 2 | `Storage::S3#upload_file` + specs | 1h |
|
|
816
|
+
| 3 | `Storage::Local#upload_file` + specs | 30min |
|
|
817
|
+
| 4 | `GlueRunner.upload_script` + specs (happy, error, storage_mode, bytes, filename) | 1.5h |
|
|
818
|
+
| 5 | `resolve_script_location` helper + integrar en `create_job` y `ensure_job` + specs | 1.5h |
|
|
819
|
+
| 6 | Docs (`glue-jobs-lifecycle.md`, SKILL FAQ, `api-detallada.md`, eventos, PySpark example, README) | 1-1.5h |
|
|
820
|
+
| 7 | Release (CHANGELOG, version bump, tag) | 30min |
|
|
821
|
+
|
|
822
|
+
**Total:** 6-7h, 1 día enfocado.
|
|
823
|
+
|
|
824
|
+
**Breaking:** ninguno. `script_location:` sigue siendo aceptado; ahora es opcional en favor de `script_path:`. Callers existentes no tocan.
|
|
825
|
+
|
|
826
|
+
---
|
|
827
|
+
|
|
828
|
+
## 7. Plan B — escenarios de bloqueo
|
|
829
|
+
|
|
830
|
+
| Si... | Entonces... |
|
|
831
|
+
|-------|-------------|
|
|
832
|
+
| `Storage::S3#upload_file` no logra abrir archivos binarios grandes (OOM) | Stream chunked con `Aws::S3::Client#upload_file` (multipart). No en scope v0.5.0 inicial, abrir como item 38 post-release. |
|
|
833
|
+
| Permisos S3 faltantes (`s3:PutObject`) | Error AWS propagado, `glue_runner.script_upload_error` log. Caller ajusta IAM policy. Documentar set mínimo en skill. |
|
|
834
|
+
| `content_type: "text/x-python"` rechazado por S3 | S3 acepta cualquier content_type. Si aparece issue, hacer opcional (default nil → S3 infiere). |
|
|
835
|
+
| Glue Job existente apunta a S3 path y cambiar `script_path` a otro nombre deja archivo huérfano | Documentar. Opcional futuro: limpieza con `delete_object` al cambiar filename. |
|
|
836
|
+
| `ensure_job` con `script_path` corre 2 veces seguidas → 2 uploads innecesarios | Aceptable en v0.5.0. Futuro (item 38): check ETag/MD5 antes de upload para short-circuit. |
|
|
837
|
+
| Tests `put_object` con body=File stub no funciona correctamente | Usar `stub_responses` con lambda que inspecciona params, como en S3 specs de v0.3.1. |
|
|
838
|
+
| `Storage::Local#upload_file` es "cheating" (copia local) sin sentido | Si molesta, dejarlo como `raise NotImplementedError` y que `upload_script` valide `storage_mode` estrictamente. No bloqueante. |
|
|
839
|
+
|
|
840
|
+
---
|
|
841
|
+
|
|
842
|
+
## 8. Orden de ejecución
|
|
843
|
+
|
|
844
|
+
```
|
|
845
|
+
Fase 0: setup
|
|
846
|
+
│
|
|
847
|
+
▼
|
|
848
|
+
Fase 1: Storage::Base#upload_file abstract
|
|
849
|
+
│
|
|
850
|
+
▼
|
|
851
|
+
Fase 2: Storage::S3#upload_file + tests
|
|
852
|
+
│
|
|
853
|
+
▼
|
|
854
|
+
Fase 3: Storage::Local#upload_file + tests
|
|
855
|
+
│
|
|
856
|
+
▼
|
|
857
|
+
Fase 4: GlueRunner.upload_script + tests (usa Storage adapter)
|
|
858
|
+
│
|
|
859
|
+
▼
|
|
860
|
+
Fase 5: resolve_script_location + integrar en create_job + ensure_job + tests
|
|
861
|
+
│
|
|
862
|
+
▼
|
|
863
|
+
Fase 6: Docs (lifecycle, SKILL, api, eventos, pyspark example, README)
|
|
864
|
+
│
|
|
865
|
+
▼
|
|
866
|
+
Fase 7: Release (CHANGELOG + bump + tag)
|
|
867
|
+
```
|
|
868
|
+
|
|
869
|
+
Cada fase cierra con `bundle exec rspec` + `bundle exec rubocop` verdes antes de avanzar.
|
|
870
|
+
|
|
871
|
+
---
|
|
872
|
+
|
|
873
|
+
## 9. Roadmap — items nuevos
|
|
874
|
+
|
|
875
|
+
Agregar a `docs/IMPROVEMENT_PLAN.md` sección "Follow-ups post-roadmap":
|
|
876
|
+
|
|
877
|
+
- **Item 37** (v0.5.0) — Upload de scripts a S3 desde `GlueRunner`. Primitiva `Storage::*#upload_file` + `GlueRunner.upload_script` + integración en `create_job`/`ensure_job` via `script_path:`.
|
|
878
|
+
- **Item 38** (futuro) — `upload_script` idempotente con check ETag/MD5.
|
|
879
|
+
- **Item 39** (futuro) — multipart upload para scripts grandes (>5MB).
|
|
880
|
+
|
|
881
|
+
---
|
|
882
|
+
|
|
883
|
+
## 10. Revisor debería verificar
|
|
884
|
+
|
|
885
|
+
1. **¿El `Storage::S3#upload_file` reutiliza `s3_client` private existente?** (debe hacerlo, no crear client paralelo)
|
|
886
|
+
2. **`script_bucket` no colisiona con `bucket` de Engine/FileIngestor/Record.** (renombre aplicado)
|
|
887
|
+
3. **`glue_runner.script_uploaded` / `script_upload_error` siguen Wispro-Observability-Spec v1** (component/event primero, snake_case, `bytes` Integer, sin unidades)
|
|
888
|
+
4. **`resolve_script_location` es `@api private`** con `private_class_method`
|
|
889
|
+
5. **Tests cubren los 4 errores de validación** (script_path sin bucket, ambos seteados, ninguno seteado, local no existe)
|
|
890
|
+
6. **`ensure_job` test con script_path** verifica passthrough y diff correcto post-upload
|
|
891
|
+
7. **`docs/glue_pyspark_example.py` actualizado** con ejemplo script_path
|
|
892
|
+
8. **No romper callers existentes:** `script_location:` sigue funcionando idéntico
|
|
893
|
+
|
|
894
|
+
---
|
|
895
|
+
|
|
896
|
+
## 11. Items relacionados
|
|
897
|
+
|
|
898
|
+
- **v0.4.0 items 32-36** (Glue Jobs Lifecycle) — este feature extiende `create_job`/`ensure_job`.
|
|
899
|
+
- **v0.3.0 item 6** (sandboxing Record) — no relacionado directamente, pero confirmar que `Storage::S3#upload_file` no rompe sandbox (no usa DuckDB, OK).
|
|
900
|
+
- **Storage Adapter pattern** — establecido en v0.2.0. Este plan agrega nueva operación sin romper la interfaz.
|