data_drain 0.3.1 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +12 -0
- data/CHANGELOG.md +31 -0
- data/README.md +30 -0
- data/docs/IMPROVEMENT_PLAN.md +114 -0
- data/docs/execution/v0.4.0-OBSERVACIONES.md +144 -0
- data/docs/execution/v0.4.0.md +1216 -0
- data/docs/glue-jobs-lifecycle.md +159 -0
- data/lib/data_drain/engine.rb +2 -2
- data/lib/data_drain/file_ingestor.rb +1 -1
- data/lib/data_drain/glue_runner.rb +100 -1
- data/lib/data_drain/validations.rb +8 -0
- data/lib/data_drain/version.rb +1 -1
- data/skill/references/eventos-telemetria.md +8 -0
- metadata +6 -3
- /data/docs/execution/{v0.3.1-OBSERVACIONES.md → archive/v0.3.1-OBSERVACIONES.md} +0 -0
- /data/docs/execution/{v0.3.1.md → archive/v0.3.1.md} +0 -0
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
# Glue Jobs Lifecycle
|
|
2
|
+
|
|
3
|
+
Gestión completa de AWS Glue Jobs desde la gema.
|
|
4
|
+
|
|
5
|
+
## Métodos
|
|
6
|
+
|
|
7
|
+
### `job_exists?(job_name)` → Boolean
|
|
8
|
+
|
|
9
|
+
Verifica si un job existe en Glue.
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
DataDrain::GlueRunner.job_exists?("my-job")
|
|
13
|
+
# => true
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
17
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
18
|
+
- Lanza otros errores de AWS sin atrapar.
|
|
19
|
+
|
|
20
|
+
### `get_job(job_name)` → Aws::Glue::Types::Job
|
|
21
|
+
|
|
22
|
+
Obtiene la configuración completa de un job.
|
|
23
|
+
|
|
24
|
+
```ruby
|
|
25
|
+
job = DataDrain::GlueRunner.get_job("my-job")
|
|
26
|
+
job.name # => "my-job"
|
|
27
|
+
job.role # => "arn:aws:iam::123:role/GlueRole"
|
|
28
|
+
job.command # => { name: "glueetl", python_version: "3", script_location: "s3://..." }
|
|
29
|
+
job.default_arguments # => { "--extra-files" => "s3://..." }
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
33
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
34
|
+
|
|
35
|
+
### `create_job(job_name, role_arn:, script_location:, ...)` → Aws::Glue::Types::Job
|
|
36
|
+
|
|
37
|
+
Crea un nuevo job en Glue y retorna el job creado.
|
|
38
|
+
|
|
39
|
+
**Parámetros requeridos:**
|
|
40
|
+
- `job_name` (String): nombre del job
|
|
41
|
+
- `role_arn` (String): ARN del IAM role de Glue
|
|
42
|
+
- `script_location` (String): path S3 del script Python
|
|
43
|
+
|
|
44
|
+
**Parámetros opcionales:**
|
|
45
|
+
- `command_name` (String): nombre del comando (`"glueetl"`, `"pythonshell"`). Default: `"glueetl"`.
|
|
46
|
+
- `default_arguments` (Hash): argumentos default del job
|
|
47
|
+
- `description` (String): descripción del job
|
|
48
|
+
- `timeout` (Integer): timeout en minutos. Default: `2880` (48h)
|
|
49
|
+
- `max_retries` (Integer): reintentos. Default: `0`
|
|
50
|
+
- `allocated_capacity` (Integer): DPU legacy. Preferir `worker_type` + `number_of_workers`
|
|
51
|
+
- `worker_type` (String): `"Standard"`, `"G.1X"`, `"G.2X"`, `"G.4X"`, `"G.8X"`
|
|
52
|
+
- `number_of_workers` (Integer): número de workers (requiere `worker_type`)
|
|
53
|
+
- `glue_version` (String): versión de Glue (ej. `"4.0"`)
|
|
54
|
+
|
|
55
|
+
```ruby
|
|
56
|
+
job = DataDrain::GlueRunner.create_job(
|
|
57
|
+
"my-job",
|
|
58
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
59
|
+
script_location: "s3://my-bucket/scripts/export.py",
|
|
60
|
+
default_arguments: { "--extra-files" => "s3://my-bucket/scripts/udf.py" },
|
|
61
|
+
timeout: 1440,
|
|
62
|
+
max_retries: 2,
|
|
63
|
+
worker_type: "G.1X",
|
|
64
|
+
number_of_workers: 10
|
|
65
|
+
)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
69
|
+
- Lanza errores de AWS sin atrapar (nombre duplicado, rol inválido, etc.)
|
|
70
|
+
|
|
71
|
+
### `update_job(job_name, ...)` → Aws::Glue::Types::Job
|
|
72
|
+
|
|
73
|
+
Actualiza un job existente y retorna el job actualizado.
|
|
74
|
+
|
|
75
|
+
Mismos parámetros que `create_job`, todos opcionales. Solo los parámetros provistos se actualizan.
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
job = DataDrain::GlueRunner.update_job(
|
|
79
|
+
"my-job",
|
|
80
|
+
script_location: "s3://my-bucket/scripts/export-v2.py",
|
|
81
|
+
timeout: 720
|
|
82
|
+
)
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
86
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
87
|
+
|
|
88
|
+
### `delete_job(job_name)` → nil
|
|
89
|
+
|
|
90
|
+
Elimina un job de Glue.
|
|
91
|
+
|
|
92
|
+
```ruby
|
|
93
|
+
DataDrain::GlueRunner.delete_job("my-job")
|
|
94
|
+
# => nil
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
98
|
+
- Lanza `Aws::Glue::Errors::EntityNotFoundException` si el job no existe.
|
|
99
|
+
|
|
100
|
+
### `ensure_job(job_name, role_arn:, script_location:, ...)` → Aws::Glue::Types::Job
|
|
101
|
+
|
|
102
|
+
Crea o actualiza un job de forma idempotente.
|
|
103
|
+
|
|
104
|
+
- Si el job existe → `update_job`
|
|
105
|
+
- Si el job no existe → `create_job`
|
|
106
|
+
|
|
107
|
+
```ruby
|
|
108
|
+
job = DataDrain::GlueRunner.ensure_job(
|
|
109
|
+
"my-job",
|
|
110
|
+
role_arn: "arn:aws:iam::123:role/GlueServiceRole",
|
|
111
|
+
script_location: "s3://my-bucket/scripts/export.py",
|
|
112
|
+
timeout: 1440
|
|
113
|
+
)
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
- Lanza `DataDrain::ConfigurationError` si `job_name` es inválido.
|
|
117
|
+
- Lanza errores de AWS sin atrapar.
|
|
118
|
+
|
|
119
|
+
### `run_and_wait(job_name, arguments = {}, ...)` → Boolean
|
|
120
|
+
|
|
121
|
+
Ejecuta un job existente y espera a que complete.
|
|
122
|
+
|
|
123
|
+
```ruby
|
|
124
|
+
DataDrain::GlueRunner.run_and_wait(
|
|
125
|
+
"my-job",
|
|
126
|
+
{ "--start_date" => "2025-01-01", "--end_date" => "2025-02-01" },
|
|
127
|
+
polling_interval: 60,
|
|
128
|
+
max_wait_seconds: 7200
|
|
129
|
+
)
|
|
130
|
+
# => true (SUCCEEDED)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
- Lanza `RuntimeError` si el job falla (`FAILED`, `STOPPED`, `TIMEOUT`).
|
|
134
|
+
- Lanza `DataDrain::Error` si `max_wait_seconds` excede.
|
|
135
|
+
|
|
136
|
+
## Convenciones de nombres
|
|
137
|
+
|
|
138
|
+
AWS Glue permite: letras (`a-zA-Z`), números (`0-9`), guiones (`-`). No permite guiones bajos ni espacios.
|
|
139
|
+
|
|
140
|
+
```ruby
|
|
141
|
+
# Válido
|
|
142
|
+
DataDrain::GlueRunner.job_exists?("my-export-job-v2")
|
|
143
|
+
|
|
144
|
+
# Inválido — lanza ConfigurationError
|
|
145
|
+
DataDrain::GlueRunner.job_exists?("my_export_job")
|
|
146
|
+
# DataDrain::ConfigurationError: job_name 'my_export_job' no es un nombre válido para Glue Job
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
## Eventos de telemetría
|
|
150
|
+
|
|
151
|
+
| Evento | Nivel | Descripción |
|
|
152
|
+
|--------|-------|-------------|
|
|
153
|
+
| `glue_runner.start` | INFO | Antes de `start_job_run` |
|
|
154
|
+
| `glue_runner.job_exists` | INFO | Job encontrado en `ensure_job` |
|
|
155
|
+
| `glue_runner.job_created` | INFO | Job creado en `ensure_job` |
|
|
156
|
+
| `glue_runner.polling` | INFO | Chequeo de estado durante `run_and_wait` |
|
|
157
|
+
| `glue_runner.complete` | INFO | Job terminó `SUCCEEDED` |
|
|
158
|
+
| `glue_runner.failed` | ERROR | Job falló con `FAILED\|STOPPED\|TIMEOUT` |
|
|
159
|
+
| `glue_runner.timeout` | ERROR | `max_wait_seconds` excedido |
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -159,7 +159,7 @@ module DataDrain
|
|
|
159
159
|
# @api private
|
|
160
160
|
# @return [Integer]
|
|
161
161
|
def get_postgres_count
|
|
162
|
-
pg_sql = "SELECT
|
|
162
|
+
pg_sql = "SELECT COUNT(*) AS row_count FROM public.#{@table_name} WHERE #{base_where_sql}"
|
|
163
163
|
pg_sql = pg_sql.gsub("'", "''")
|
|
164
164
|
query = "SELECT row_count FROM postgres_query('pg_source', '#{pg_sql}')"
|
|
165
165
|
@duckdb.query(query).first.first
|
|
@@ -204,7 +204,7 @@ module DataDrain
|
|
|
204
204
|
|
|
205
205
|
begin
|
|
206
206
|
query = <<~SQL
|
|
207
|
-
SELECT
|
|
207
|
+
SELECT COUNT(*)
|
|
208
208
|
FROM read_parquet('#{archive_path}')
|
|
209
209
|
WHERE #{base_where_sql}
|
|
210
210
|
SQL
|
|
@@ -82,7 +82,7 @@ module DataDrain
|
|
|
82
82
|
|
|
83
83
|
# @api private
|
|
84
84
|
def step_count_source
|
|
85
|
-
source_count = timed(:source_query) { @duckdb.query("SELECT
|
|
85
|
+
source_count = timed(:source_query) { @duckdb.query("SELECT COUNT(*) FROM #{@reader_function}").first.first }
|
|
86
86
|
safe_log(:info, "file_ingestor.count", {
|
|
87
87
|
source_path: @source_path,
|
|
88
88
|
count: source_count,
|
|
@@ -19,10 +19,109 @@ module DataDrain
|
|
|
19
19
|
# @return [Boolean] true si el Job terminó exitosamente (SUCCEEDED).
|
|
20
20
|
# @raise [DataDrain::Error] si max_wait_seconds excede antes de SUCCEEDED.
|
|
21
21
|
# @raise [RuntimeError] si el Job falla o se detiene.
|
|
22
|
+
def self.client
|
|
23
|
+
@client ||= Aws::Glue::Client.new(region: DataDrain.configuration.aws_region)
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
class << self
|
|
27
|
+
attr_writer :client
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def self.job_exists?(job_name)
|
|
31
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
32
|
+
get_job(job_name)
|
|
33
|
+
true
|
|
34
|
+
rescue Aws::Glue::Errors::EntityNotFoundException
|
|
35
|
+
false
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
def self.get_job(job_name)
|
|
39
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
40
|
+
client.get_job(job_name: job_name).job
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
def self.create_job(job_name, role_arn:, script_location:, command_name: "glueetl",
|
|
44
|
+
default_arguments: {}, description: nil, worker_type: nil, number_of_workers: nil,
|
|
45
|
+
timeout: 2880, max_retries: 0, allocated_capacity: nil, glue_version: nil)
|
|
46
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
47
|
+
opts = {
|
|
48
|
+
name: job_name,
|
|
49
|
+
role: role_arn,
|
|
50
|
+
command: {
|
|
51
|
+
name: command_name,
|
|
52
|
+
python_version: "3",
|
|
53
|
+
script_location: script_location
|
|
54
|
+
}
|
|
55
|
+
}
|
|
56
|
+
opts[:default_arguments] = default_arguments unless default_arguments.empty?
|
|
57
|
+
opts[:description] = description if description
|
|
58
|
+
opts[:timeout] = timeout if timeout
|
|
59
|
+
opts[:max_retries] = max_retries if max_retries
|
|
60
|
+
opts[:allocated_capacity] = allocated_capacity if allocated_capacity
|
|
61
|
+
opts[:worker_type] = worker_type if worker_type
|
|
62
|
+
opts[:number_of_workers] = number_of_workers if number_of_workers
|
|
63
|
+
opts[:glue_version] = glue_version if glue_version
|
|
64
|
+
|
|
65
|
+
client.create_job(**opts)
|
|
66
|
+
get_job(job_name)
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
def self.update_job(job_name, role_arn: nil, command_name: nil, script_location: nil,
|
|
70
|
+
default_arguments: nil, description: nil, worker_type: nil,
|
|
71
|
+
number_of_workers: nil, timeout: nil, max_retries: nil, allocated_capacity: nil,
|
|
72
|
+
glue_version: nil)
|
|
73
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
74
|
+
job_update = {}
|
|
75
|
+
job_update[:role] = role_arn if role_arn
|
|
76
|
+
if command_name && script_location
|
|
77
|
+
job_update[:command] =
|
|
78
|
+
{ name: command_name, python_version: "3", script_location: script_location }
|
|
79
|
+
end
|
|
80
|
+
job_update[:default_arguments] = default_arguments if default_arguments
|
|
81
|
+
job_update[:description] = description if description
|
|
82
|
+
job_update[:timeout] = timeout if timeout
|
|
83
|
+
job_update[:max_retries] = max_retries if max_retries
|
|
84
|
+
job_update[:allocated_capacity] = allocated_capacity if allocated_capacity
|
|
85
|
+
job_update[:worker_type] = worker_type if worker_type
|
|
86
|
+
job_update[:number_of_workers] = number_of_workers if number_of_workers
|
|
87
|
+
job_update[:glue_version] = glue_version if glue_version
|
|
88
|
+
|
|
89
|
+
client.update_job(job_name: job_name, job_update: job_update)
|
|
90
|
+
get_job(job_name)
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
def self.delete_job(job_name)
|
|
94
|
+
DataDrain::Validations.validate_glue_name!(:job_name, job_name)
|
|
95
|
+
client.delete_job(job_name: job_name)
|
|
96
|
+
nil
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
def self.ensure_job(job_name, role_arn:, script_location:, command_name: "glueetl",
|
|
100
|
+
default_arguments: {}, description: nil, worker_type: nil,
|
|
101
|
+
number_of_workers: nil, timeout: 2880, max_retries: 0,
|
|
102
|
+
allocated_capacity: nil, glue_version: nil)
|
|
103
|
+
if job_exists?(job_name)
|
|
104
|
+
safe_log(:info, "glue_runner.job_exists", { job: job_name })
|
|
105
|
+
update_job(job_name, role_arn: role_arn, command_name: command_name,
|
|
106
|
+
script_location: script_location, default_arguments: default_arguments,
|
|
107
|
+
description: description, worker_type: worker_type,
|
|
108
|
+
number_of_workers: number_of_workers, timeout: timeout,
|
|
109
|
+
max_retries: max_retries, allocated_capacity: allocated_capacity,
|
|
110
|
+
glue_version: glue_version)
|
|
111
|
+
else
|
|
112
|
+
safe_log(:info, "glue_runner.job_created", { job: job_name })
|
|
113
|
+
create_job(job_name, role_arn: role_arn, script_location: script_location,
|
|
114
|
+
command_name: command_name, default_arguments: default_arguments,
|
|
115
|
+
description: description, worker_type: worker_type,
|
|
116
|
+
number_of_workers: number_of_workers, timeout: timeout,
|
|
117
|
+
max_retries: max_retries, allocated_capacity: allocated_capacity,
|
|
118
|
+
glue_version: glue_version)
|
|
119
|
+
end
|
|
120
|
+
end
|
|
121
|
+
|
|
22
122
|
def self.run_and_wait(job_name, arguments = {}, polling_interval: 30, max_wait_seconds: nil)
|
|
23
123
|
config = DataDrain.configuration
|
|
24
124
|
config.validate!
|
|
25
|
-
client = Aws::Glue::Client.new(region: config.aws_region)
|
|
26
125
|
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
27
126
|
|
|
28
127
|
@logger = config.logger
|
|
@@ -6,9 +6,17 @@ module DataDrain
|
|
|
6
6
|
# Regex que valida identificadores SQL (tablas, columnas, etc.).
|
|
7
7
|
# Permite letras, guiones bajos y números (no al inicio).
|
|
8
8
|
IDENTIFIER_REGEX = /\A[a-zA-Z_][a-zA-Z0-9_]*\z/
|
|
9
|
+
GLUE_NAME_REGEX = /\A[a-zA-Z0-9][a-zA-Z0-9-]*\z/
|
|
9
10
|
|
|
10
11
|
module_function
|
|
11
12
|
|
|
13
|
+
def validate_glue_name!(name, value)
|
|
14
|
+
return if GLUE_NAME_REGEX.match?(value.to_s)
|
|
15
|
+
|
|
16
|
+
raise DataDrain::ConfigurationError,
|
|
17
|
+
"#{name} '#{value}' no es un nombre válido para Glue Job (usa solo letras, números y guiones)"
|
|
18
|
+
end
|
|
19
|
+
|
|
12
20
|
def validate_identifier!(name, value)
|
|
13
21
|
return if IDENTIFIER_REGEX.match?(value.to_s)
|
|
14
22
|
|
data/lib/data_drain/version.rb
CHANGED
|
@@ -115,6 +115,14 @@ Catálogo completo de eventos KV emitidos por DataDrain. Formato Wispro-Observab
|
|
|
115
115
|
**Nivel:** INFO. Emite antes de `start_job_run`.
|
|
116
116
|
**Campos:** `job`.
|
|
117
117
|
|
|
118
|
+
### `glue_runner.job_exists`
|
|
119
|
+
**Nivel:** INFO. Emite en `ensure_job` cuando el job ya existe y se actualiza.
|
|
120
|
+
**Campos:** `job`.
|
|
121
|
+
|
|
122
|
+
### `glue_runner.job_created`
|
|
123
|
+
**Nivel:** INFO. Emite en `ensure_job` cuando el job se crea.
|
|
124
|
+
**Campos:** `job`.
|
|
125
|
+
|
|
118
126
|
### `glue_runner.polling`
|
|
119
127
|
**Nivel:** INFO. Emite cada chequeo de estado mientras Job no terminó.
|
|
120
128
|
**Campos:** `job`, `run_id`, `status`, `next_check_in_s`.
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.4.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
@@ -102,9 +102,12 @@ files:
|
|
|
102
102
|
- docs/execution/archive/v0.2.0.md
|
|
103
103
|
- docs/execution/archive/v0.3.0-OBSERVACIONES.md
|
|
104
104
|
- docs/execution/archive/v0.3.0.md
|
|
105
|
+
- docs/execution/archive/v0.3.1-OBSERVACIONES.md
|
|
106
|
+
- docs/execution/archive/v0.3.1.md
|
|
105
107
|
- docs/execution/v0.2.2.md
|
|
106
|
-
- docs/execution/v0.
|
|
107
|
-
- docs/execution/v0.
|
|
108
|
+
- docs/execution/v0.4.0-OBSERVACIONES.md
|
|
109
|
+
- docs/execution/v0.4.0.md
|
|
110
|
+
- docs/glue-jobs-lifecycle.md
|
|
108
111
|
- docs/glue_pyspark_example.py
|
|
109
112
|
- lib/data_drain.rb
|
|
110
113
|
- lib/data_drain/configuration.rb
|
|
File without changes
|
|
File without changes
|