data_drain 0.1.9 → 0.1.14
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/settings.local.json +24 -0
- data/CHANGELOG.md +24 -0
- data/README.md +101 -22
- data/data_drain.gemspec +1 -0
- data/lib/data_drain/engine.rb +16 -10
- data/lib/data_drain/file_ingestor.rb +7 -7
- data/lib/data_drain/glue_runner.rb +43 -0
- data/lib/data_drain/record.rb +2 -2
- data/lib/data_drain/storage.rb +17 -9
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +2 -0
- metadata +18 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 97d660cb624931d75d6f39e51527c58faf180b7ab727d9c85a7fa44079dc76a0
|
|
4
|
+
data.tar.gz: 932c85dcf3542e52b0f3981281e6a93a757ac194153c8b0b7080a79857613ed5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d30e7aaf152e576821b2b2c9a3a68cba01a4c3db6941209e0d0ad0ffb7f69f763e5cf93bd90ac0964a4a2b9b5a5582e348c6f9f5599a5c3ddb24df45168e6418
|
|
7
|
+
data.tar.gz: f71de76a5075e99eea50a83d0c0d1831091c011a2a64e17b4f3ea206fe8f50ec4bcd2309dfb3096478995c75b4bbfc384431af0d5a5bf3ff446522fa06857891
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
{
|
|
2
|
+
"hooks": {
|
|
3
|
+
"Notification": [
|
|
4
|
+
{
|
|
5
|
+
"hooks": [
|
|
6
|
+
{
|
|
7
|
+
"type": "command",
|
|
8
|
+
"command": "curl -sf -X POST -H \"Content-Type: application/json\" -H \"X-Emdash-Token: $EMDASH_HOOK_TOKEN\" -H \"X-Emdash-Pty-Id: $EMDASH_PTY_ID\" -H \"X-Emdash-Event-Type: notification\" -d @- \"http://127.0.0.1:$EMDASH_HOOK_PORT/hook\" || true"
|
|
9
|
+
}
|
|
10
|
+
]
|
|
11
|
+
}
|
|
12
|
+
],
|
|
13
|
+
"Stop": [
|
|
14
|
+
{
|
|
15
|
+
"hooks": [
|
|
16
|
+
{
|
|
17
|
+
"type": "command",
|
|
18
|
+
"command": "curl -sf -X POST -H \"Content-Type: application/json\" -H \"X-Emdash-Token: $EMDASH_HOOK_TOKEN\" -H \"X-Emdash-Pty-Id: $EMDASH_PTY_ID\" -H \"X-Emdash-Event-Type: stop\" -d @- \"http://127.0.0.1:$EMDASH_HOOK_PORT/hook\" || true"
|
|
19
|
+
}
|
|
20
|
+
]
|
|
21
|
+
}
|
|
22
|
+
]
|
|
23
|
+
}
|
|
24
|
+
}
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,29 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.1.14] - 2026-03-17
|
|
4
|
+
|
|
5
|
+
- Feature: Implementación de **Logging Estructurado** en toda la gema (\`key=value\`) para mejor observabilidad en producción.
|
|
6
|
+
- Optimization: Caching automático de adaptadores de almacenamiento para mejorar el rendimiento de consultas repetidas.
|
|
7
|
+
- Testing: Mejora en la robustez de los tests de \`Engine\` desacoplándolos de cambios menores en el setup de DuckDB.
|
|
8
|
+
|
|
9
|
+
## [0.1.13] - 2026-03-17
|
|
10
|
+
|
|
11
|
+
- Feature: Parametrización total en la orquestación con Glue. Se añadieron \`s3_bucket\`, \`s3_folder\` y \`partition_by\` como argumentos dinámicos, permitiendo que el mismo Job de Glue sirva para múltiples tablas y destinos.
|
|
12
|
+
|
|
13
|
+
## [0.1.12] - 2026-03-17
|
|
14
|
+
|
|
15
|
+
- Feature: Parametrización dinámica de la base de datos en \`GlueRunner\` y el script de PySpark. Ahora se pasan \`db_url\`, \`db_user\`, \`db_password\` y \`db_table\` como argumentos al Job de Glue.
|
|
16
|
+
|
|
17
|
+
## [0.1.11] - 2026-03-17
|
|
18
|
+
|
|
19
|
+
- Feature: Se agregó \`DataDrain::GlueRunner\` para orquestar Jobs de AWS Glue.
|
|
20
|
+
- Feature: Soporte oficial para procesamiento de Big Data (ej. tablas de 1TB) mediante delegación a AWS Glue.
|
|
21
|
+
- Documentation: Se incluyó un script maestro de PySpark en el README compatible con el formato de la gema.
|
|
22
|
+
|
|
23
|
+
## [0.1.10] - 2026-03-17
|
|
24
|
+
|
|
25
|
+
- Feature: Se agregó la opción \`skip_export\` a \`DataDrain::Engine\`. Permite utilizar herramientas externas (como AWS Glue) para la exportación de datos, dejando que DataDrain se encargue solo de la validación de integridad y la purga de PostgreSQL.
|
|
26
|
+
|
|
3
27
|
## [0.1.9] - 2026-03-17
|
|
4
28
|
|
|
5
29
|
- Fix: Mejora en la precisión del rango de fechas en consultas SQL usando límites semi-abiertos (<) para evitar pérdida de registros por microsegundos.
|
data/README.md
CHANGED
|
@@ -98,39 +98,118 @@ ingestor.call
|
|
|
98
98
|
|
|
99
99
|
### 2. Extracción y Purga de BD (Engine)
|
|
100
100
|
|
|
101
|
-
Ideal para crear Ventanas Rodantes de retención (ej. mantener solo 6 meses de datos vivos en Postgres y archivar el resto).
|
|
101
|
+
Ideal para crear Ventanas Rodantes de retención (ej. mantener solo 6 meses de datos vivos en Postgres y archivar el resto).
|
|
102
102
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
task versions: :environment do
|
|
106
|
-
target_date = 6.months.ago.beginning_of_month
|
|
107
|
-
|
|
108
|
-
select_sql = <<~SQL
|
|
109
|
-
id, item_type, item_id, event, whodunnit,
|
|
110
|
-
object::VARCHAR AS object,
|
|
111
|
-
object_changes::VARCHAR AS object_changes,
|
|
112
|
-
created_at,
|
|
113
|
-
EXTRACT(YEAR FROM created_at)::INT AS year,
|
|
114
|
-
EXTRACT(MONTH FROM created_at)::INT AS month,
|
|
115
|
-
isp_id
|
|
116
|
-
SQL
|
|
103
|
+
**Modo Purga con Exportación Externa (AWS Glue):**
|
|
104
|
+
Si tu arquitectura ya utiliza **AWS Glue** o **AWS EMR** para mover datos pesados, puedes configurar DataDrain para que actúe únicamente como **Garante de Integridad**. En este modo, el motor omitirá el paso de exportación, pero verificará matemáticamente que los datos existan en el Data Lake antes de proceder a eliminarlos de PostgreSQL.
|
|
117
105
|
|
|
106
|
+
```ruby
|
|
107
|
+
# lib/tasks/archive_with_glue.rake
|
|
108
|
+
task purge_only: :environment do
|
|
118
109
|
engine = DataDrain::Engine.new(
|
|
119
110
|
bucket: 'my-bucket-store',
|
|
120
|
-
start_date:
|
|
121
|
-
end_date:
|
|
111
|
+
start_date: 6.months.ago.beginning_of_month,
|
|
112
|
+
end_date: 6.months.ago.end_of_month,
|
|
122
113
|
table_name: 'versions',
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
where_clause: "event = 'update'"
|
|
114
|
+
partition_keys: %w[year month],
|
|
115
|
+
skip_export: true # ⚡️ No exporta nada, solo valida S3 y purga Postgres
|
|
126
116
|
)
|
|
127
117
|
|
|
128
|
-
# Cuenta, exporta a Parquet, verifica integridad y purga Postgres.
|
|
129
118
|
engine.call
|
|
130
119
|
end
|
|
131
120
|
```
|
|
132
121
|
|
|
133
|
-
### 3.
|
|
122
|
+
### 3. Orquestación con AWS Glue (Big Data)
|
|
123
|
+
|
|
124
|
+
Para tablas de gran volumen (**ej. > 500GB o 1TB**), se recomienda delegar el movimiento de datos a **AWS Glue** (basado en Apache Spark) para evitar saturar el servidor de Ruby. `DataDrain` actúa como el orquestador que dispara el Job, espera a que termine y luego realiza la validación y purga.
|
|
125
|
+
|
|
126
|
+
```ruby
|
|
127
|
+
# 1. Disparar el Job de Glue y esperar su finalización exitosa
|
|
128
|
+
config = DataDrain.configuration
|
|
129
|
+
bucket = "my-bucket"
|
|
130
|
+
table = "versions"
|
|
131
|
+
|
|
132
|
+
DataDrain::GlueRunner.run_and_wait(
|
|
133
|
+
"my-glue-export-job",
|
|
134
|
+
{
|
|
135
|
+
"--start_date" => start_date.to_fs(:db),
|
|
136
|
+
"--end_date" => end_date.to_fs(:db),
|
|
137
|
+
"--s3_bucket" => bucket,
|
|
138
|
+
"--s3_folder" => table,
|
|
139
|
+
"--db_url" => "jdbc:postgresql://#{config.db_host}:#{config.db_port}/#{config.db_name}",
|
|
140
|
+
"--db_user" => config.db_user,
|
|
141
|
+
"--db_password" => config.db_pass,
|
|
142
|
+
"--db_table" => table,
|
|
143
|
+
"--partition_by" => "year,month,isp_id" # <--- Columnas dinámicas
|
|
144
|
+
}
|
|
145
|
+
)
|
|
146
|
+
|
|
147
|
+
# 2. Una vez que Glue exportó el TB, DataDrain valida integridad y purga Postgres
|
|
148
|
+
DataDrain::Engine.new(
|
|
149
|
+
bucket: bucket,
|
|
150
|
+
folder_name: table,
|
|
151
|
+
start_date: start_date,
|
|
152
|
+
end_date: end_date,
|
|
153
|
+
table_name: table,
|
|
154
|
+
partition_keys: %w[year month isp_id],
|
|
155
|
+
skip_export: true # <--- Modo Validación + Purga
|
|
156
|
+
).call
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
#### Script de AWS Glue (PySpark) compatible con DataDrain
|
|
160
|
+
|
|
161
|
+
Crea un Job en la consola de AWS Glue (Spark 4.0+) y utiliza este script como base. Está diseñado para extraer datos de PostgreSQL de forma dinámica:
|
|
162
|
+
|
|
163
|
+
```python
|
|
164
|
+
import sys
|
|
165
|
+
from awsglue.utils import getResolvedOptions
|
|
166
|
+
from pyspark.context import SparkContext
|
|
167
|
+
from awsglue.context import GlueContext
|
|
168
|
+
from awsglue.job import Job
|
|
169
|
+
from pyspark.sql.functions import col, year, month
|
|
170
|
+
|
|
171
|
+
# Parámetros recibidos desde DataDrain::GlueRunner
|
|
172
|
+
args = getResolvedOptions(sys.argv, [
|
|
173
|
+
'JOB_NAME', 'start_date', 'end_date', 's3_bucket', 's3_folder',
|
|
174
|
+
'db_url', 'db_user', 'db_password', 'db_table', 'partition_by'
|
|
175
|
+
])
|
|
176
|
+
|
|
177
|
+
sc = SparkContext()
|
|
178
|
+
glueContext = GlueContext(sc)
|
|
179
|
+
spark = glueContext.spark_session
|
|
180
|
+
job = Job(glueContext)
|
|
181
|
+
job.init(args['JOB_NAME'], args)
|
|
182
|
+
|
|
183
|
+
# 1. Leer de PostgreSQL (vía JDBC dinámico)
|
|
184
|
+
options = {
|
|
185
|
+
"url": args['db_url'],
|
|
186
|
+
"dbtable": args['db_table'],
|
|
187
|
+
"user": args['db_user'],
|
|
188
|
+
"password": args['db_password'],
|
|
189
|
+
"sampleQuery": f"SELECT * FROM {args['db_table']} WHERE created_at >= '{args['start_date']}' AND created_at < '{args['end_date']}'"
|
|
190
|
+
}
|
|
191
|
+
|
|
192
|
+
df = spark.read.format("jdbc").options(**options).load()
|
|
193
|
+
|
|
194
|
+
# 2. Agregar columnas de partición temporales (Hive Partitioning)
|
|
195
|
+
df_final = df.withColumn("year", year(col("created_at"))) \
|
|
196
|
+
.withColumn("month", month(col("created_at")))
|
|
197
|
+
|
|
198
|
+
# 3. Escribir a S3 en Parquet con compresión ZSTD
|
|
199
|
+
# Construimos el path dinámicamente: s3://bucket/folder/
|
|
200
|
+
output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
|
|
201
|
+
partitions = args['partition_by'].split(",")
|
|
202
|
+
|
|
203
|
+
df_final.write.mode("overwrite") \
|
|
204
|
+
.partitionBy(*partitions) \
|
|
205
|
+
.format("parquet") \
|
|
206
|
+
.option("compression", "zstd") \
|
|
207
|
+
.save(output_path)
|
|
208
|
+
|
|
209
|
+
job.commit()
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### 4. Consultar el Data Lake (Record)
|
|
134
213
|
|
|
135
214
|
Para consultar los datos archivados sin salir de Ruby, crea un modelo que herede de `DataDrain::Record`.
|
|
136
215
|
|
data/data_drain.gemspec
CHANGED
|
@@ -26,6 +26,7 @@ Gem::Specification.new do |spec|
|
|
|
26
26
|
|
|
27
27
|
# 💡 Dependencias Core de la Gema
|
|
28
28
|
spec.add_dependency "activemodel", ">= 6.0"
|
|
29
|
+
spec.add_dependency "aws-sdk-glue", "~> 1.0"
|
|
29
30
|
spec.add_dependency "aws-sdk-s3", "~> 1.114"
|
|
30
31
|
spec.add_dependency "duckdb", "~> 1.4"
|
|
31
32
|
spec.add_dependency "pg", ">= 1.2"
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -20,6 +20,7 @@ module DataDrain
|
|
|
20
20
|
# @option options [Array<String, Symbol>] :partition_keys Columnas para particionar.
|
|
21
21
|
# @option options [String] :primary_key (Opcional) Clave primaria para borrado. Por defecto 'id'.
|
|
22
22
|
# @option options [String] :where_clause (Opcional) Condición SQL extra.
|
|
23
|
+
# @option options [Boolean] :skip_export (Opcional) Si es true, no realiza el export a Parquet, solo validación y purga.
|
|
23
24
|
def initialize(options)
|
|
24
25
|
@start_date = options.fetch(:start_date).beginning_of_day
|
|
25
26
|
|
|
@@ -34,6 +35,7 @@ module DataDrain
|
|
|
34
35
|
@primary_key = options.fetch(:primary_key, "id")
|
|
35
36
|
@where_clause = options[:where_clause]
|
|
36
37
|
@bucket = options[:bucket]
|
|
38
|
+
@skip_export = options.fetch(:skip_export, false)
|
|
37
39
|
|
|
38
40
|
@config = DataDrain.configuration
|
|
39
41
|
@logger = @config.logger
|
|
@@ -43,30 +45,34 @@ module DataDrain
|
|
|
43
45
|
@duckdb = database.connect
|
|
44
46
|
end
|
|
45
47
|
|
|
46
|
-
# Ejecuta el flujo completo del motor: Setup, Conteo, Exportación, Verificación y Purga.
|
|
48
|
+
# Ejecuta el flujo completo del motor: Setup, Conteo, Exportación (opcional), Verificación y Purga.
|
|
47
49
|
#
|
|
48
50
|
# @return [Boolean] `true` si el proceso finalizó con éxito, `false` si falló la integridad.
|
|
49
51
|
def call
|
|
50
|
-
@logger.info "
|
|
52
|
+
@logger.info "component=data_drain event=engine.start table=#{@table_name} start_date=#{@start_date.to_date} end_date=#{@end_date.to_date}"
|
|
51
53
|
|
|
52
54
|
setup_duckdb
|
|
53
55
|
|
|
54
56
|
@pg_count = get_postgres_count
|
|
55
57
|
|
|
56
58
|
if @pg_count.zero?
|
|
57
|
-
@logger.info "
|
|
59
|
+
@logger.info "component=data_drain event=engine.skip_empty table=#{@table_name}"
|
|
58
60
|
return true
|
|
59
61
|
end
|
|
60
62
|
|
|
61
|
-
|
|
62
|
-
|
|
63
|
+
if @skip_export
|
|
64
|
+
@logger.info "component=data_drain event=engine.skip_export table=#{@table_name}"
|
|
65
|
+
else
|
|
66
|
+
@logger.info "component=data_drain event=engine.export_start table=#{@table_name} count=#{@pg_count}"
|
|
67
|
+
export_to_parquet
|
|
68
|
+
end
|
|
63
69
|
|
|
64
70
|
if verify_integrity
|
|
65
71
|
purge_from_postgres
|
|
66
|
-
@logger.info "
|
|
72
|
+
@logger.info "component=data_drain event=engine.complete table=#{@table_name}"
|
|
67
73
|
true
|
|
68
74
|
else
|
|
69
|
-
@logger.error "
|
|
75
|
+
@logger.error "component=data_drain event=engine.integrity_error table=#{@table_name}"
|
|
70
76
|
false
|
|
71
77
|
end
|
|
72
78
|
end
|
|
@@ -141,17 +147,17 @@ module DataDrain
|
|
|
141
147
|
SQL
|
|
142
148
|
parquet_result = @duckdb.query(query).first.first
|
|
143
149
|
rescue DuckDB::Error => e
|
|
144
|
-
@logger.error "
|
|
150
|
+
@logger.error "component=data_drain event=engine.parquet_read_error table=#{@table_name} error=#{e.message}"
|
|
145
151
|
return false
|
|
146
152
|
end
|
|
147
153
|
|
|
148
|
-
@logger.info "
|
|
154
|
+
@logger.info "component=data_drain event=engine.integrity_check table=#{@table_name} pg_count=#{@pg_count} parquet_count=#{parquet_result}"
|
|
149
155
|
@pg_count == parquet_result
|
|
150
156
|
end
|
|
151
157
|
|
|
152
158
|
# @api private
|
|
153
159
|
def purge_from_postgres
|
|
154
|
-
@logger.info "
|
|
160
|
+
@logger.info "component=data_drain event=engine.purge_start table=#{@table_name} batch_size=#{@config.batch_size}"
|
|
155
161
|
|
|
156
162
|
conn = PG.connect(
|
|
157
163
|
host: @config.db_host,
|
|
@@ -30,10 +30,10 @@ module DataDrain
|
|
|
30
30
|
# Ejecuta el flujo de ingestión.
|
|
31
31
|
# @return [Boolean] true si el proceso fue exitoso.
|
|
32
32
|
def call
|
|
33
|
-
@logger.info "
|
|
33
|
+
@logger.info "component=data_drain event=file_ingestor.start source_path=#{@source_path}"
|
|
34
34
|
|
|
35
35
|
unless File.exist?(@source_path)
|
|
36
|
-
@logger.error "
|
|
36
|
+
@logger.error "component=data_drain event=file_ingestor.file_not_found source_path=#{@source_path}"
|
|
37
37
|
return false
|
|
38
38
|
end
|
|
39
39
|
|
|
@@ -47,7 +47,7 @@ module DataDrain
|
|
|
47
47
|
|
|
48
48
|
# 1. Conteo de seguridad
|
|
49
49
|
source_count = @duckdb.query("SELECT COUNT(*) FROM #{reader_function}").first.first
|
|
50
|
-
@logger.info "
|
|
50
|
+
@logger.info "component=data_drain event=file_ingestor.count source_path=#{@source_path} count=#{source_count}"
|
|
51
51
|
|
|
52
52
|
if source_count.zero?
|
|
53
53
|
cleanup_local_file
|
|
@@ -73,15 +73,15 @@ module DataDrain
|
|
|
73
73
|
);
|
|
74
74
|
SQL
|
|
75
75
|
|
|
76
|
-
@logger.info "
|
|
76
|
+
@logger.info "component=data_drain event=file_ingestor.export_start dest_path=#{dest_path}"
|
|
77
77
|
@duckdb.query(query)
|
|
78
78
|
|
|
79
|
-
@logger.info "
|
|
79
|
+
@logger.info "component=data_drain event=file_ingestor.complete source_path=#{@source_path}"
|
|
80
80
|
|
|
81
81
|
cleanup_local_file
|
|
82
82
|
true
|
|
83
83
|
rescue DuckDB::Error => e
|
|
84
|
-
@logger.error "
|
|
84
|
+
@logger.error "component=data_drain event=file_ingestor.duckdb_error source_path=#{@source_path} error=#{e.message}"
|
|
85
85
|
false
|
|
86
86
|
ensure
|
|
87
87
|
@duckdb&.close
|
|
@@ -107,7 +107,7 @@ module DataDrain
|
|
|
107
107
|
def cleanup_local_file
|
|
108
108
|
if @delete_after_upload && File.exist?(@source_path)
|
|
109
109
|
File.delete(@source_path)
|
|
110
|
-
@logger.info "
|
|
110
|
+
@logger.info "component=data_drain event=file_ingestor.cleanup source_path=#{@source_path}"
|
|
111
111
|
end
|
|
112
112
|
end
|
|
113
113
|
end
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "aws-sdk-glue"
|
|
4
|
+
|
|
5
|
+
module DataDrain
|
|
6
|
+
# Orquestador para AWS Glue. Permite disparar y monitorear Jobs en AWS
|
|
7
|
+
# para delegar el movimiento masivo de datos (ej. tablas de 1TB).
|
|
8
|
+
class GlueRunner
|
|
9
|
+
# Dispara un Job de Glue y espera a que termine exitosamente.
|
|
10
|
+
#
|
|
11
|
+
# @param job_name [String] Nombre del Job en la consola de AWS.
|
|
12
|
+
# @param arguments [Hash] Argumentos de ejecución (deben empezar con --).
|
|
13
|
+
# @param polling_interval [Integer] Segundos de espera entre cada chequeo de estado.
|
|
14
|
+
# @return [Boolean] true si el Job terminó exitosamente (SUCCEEDED).
|
|
15
|
+
# @raise [RuntimeError] Si el Job falla o se detiene.
|
|
16
|
+
def self.run_and_wait(job_name, arguments = {}, polling_interval: 30)
|
|
17
|
+
config = DataDrain.configuration
|
|
18
|
+
client = Aws::Glue::Client.new(region: config.aws_region)
|
|
19
|
+
|
|
20
|
+
config.logger.info "component=data_drain event=glue_runner.start job=#{job_name}"
|
|
21
|
+
resp = client.start_job_run(job_name: job_name, arguments: arguments)
|
|
22
|
+
run_id = resp.job_run_id
|
|
23
|
+
|
|
24
|
+
loop do
|
|
25
|
+
run_info = client.get_job_run(job_name: job_name, run_id: run_id).job_run
|
|
26
|
+
status = run_info.job_run_state
|
|
27
|
+
|
|
28
|
+
case status
|
|
29
|
+
when "SUCCEEDED"
|
|
30
|
+
config.logger.info "component=data_drain event=glue_runner.complete job=#{job_name} run_id=#{run_id}"
|
|
31
|
+
return true
|
|
32
|
+
when "FAILED", "STOPPED", "TIMEOUT"
|
|
33
|
+
error_msg = run_info.error_message || "Sin mensaje de error disponible."
|
|
34
|
+
config.logger.error "component=data_drain event=glue_runner.failed job=#{job_name} run_id=#{run_id} status=#{status} error=#{error_msg}"
|
|
35
|
+
raise "Glue Job #{job_name} (Run ID: #{run_id}) falló con estado #{status}."
|
|
36
|
+
else
|
|
37
|
+
config.logger.info "component=data_drain event=glue_runner.polling job=#{job_name} run_id=#{run_id} status=#{status} next_check_in=#{polling_interval}s"
|
|
38
|
+
sleep polling_interval
|
|
39
|
+
end
|
|
40
|
+
end
|
|
41
|
+
end
|
|
42
|
+
end
|
|
43
|
+
end
|
data/lib/data_drain/record.rb
CHANGED
|
@@ -85,7 +85,7 @@ module DataDrain
|
|
|
85
85
|
# @return [Integer] Cantidad de particiones físicas eliminadas.
|
|
86
86
|
def self.destroy_all(**partitions)
|
|
87
87
|
adapter = DataDrain::Storage.adapter
|
|
88
|
-
DataDrain.configuration.logger.info "
|
|
88
|
+
DataDrain.configuration.logger.info "component=data_drain event=record.destroy_all folder=#{folder_name} partitions=#{partitions.inspect}"
|
|
89
89
|
|
|
90
90
|
adapter.destroy_partitions(bucket, folder_name, partition_keys, partitions)
|
|
91
91
|
end
|
|
@@ -118,7 +118,7 @@ module DataDrain
|
|
|
118
118
|
begin
|
|
119
119
|
result = connection.query(sql)
|
|
120
120
|
rescue DuckDB::Error => e
|
|
121
|
-
DataDrain.configuration.logger.warn "
|
|
121
|
+
DataDrain.configuration.logger.warn "component=data_drain event=record.parquet_not_found error=#{e.message}"
|
|
122
122
|
return []
|
|
123
123
|
end
|
|
124
124
|
|
data/lib/data_drain/storage.rb
CHANGED
|
@@ -11,20 +11,28 @@ module DataDrain
|
|
|
11
11
|
class InvalidAdapterError < DataDrain::Error; end
|
|
12
12
|
|
|
13
13
|
# Resuelve e instancia el adaptador de almacenamiento correspondiente
|
|
14
|
-
# basándose en la configuración actual del framework.
|
|
14
|
+
# basándose en la configuración actual del framework. La instancia se
|
|
15
|
+
# cachea para evitar allocations innecesarias entre queries.
|
|
15
16
|
#
|
|
16
17
|
# @return [DataDrain::Storage::Base] Una instancia de Local o S3.
|
|
17
18
|
# @raise [InvalidAdapterError] Si el storage_mode no es válido.
|
|
18
19
|
def self.adapter
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
20
|
+
@adapter ||= begin
|
|
21
|
+
mode = DataDrain.configuration.storage_mode
|
|
22
|
+
case mode.to_sym
|
|
23
|
+
when :local
|
|
24
|
+
Local.new(DataDrain.configuration)
|
|
25
|
+
when :s3
|
|
26
|
+
S3.new(DataDrain.configuration)
|
|
27
|
+
else
|
|
28
|
+
raise InvalidAdapterError, "Storage mode '#{mode}' no está soportado."
|
|
29
|
+
end
|
|
27
30
|
end
|
|
28
31
|
end
|
|
32
|
+
|
|
33
|
+
# Descarta el adaptador cacheado. Llamar cuando cambia storage_mode.
|
|
34
|
+
def self.reset_adapter!
|
|
35
|
+
@adapter = nil
|
|
36
|
+
end
|
|
29
37
|
end
|
|
30
38
|
end
|
data/lib/data_drain/version.rb
CHANGED
data/lib/data_drain.rb
CHANGED
|
@@ -8,6 +8,7 @@ require_relative "data_drain/storage"
|
|
|
8
8
|
require_relative "data_drain/engine"
|
|
9
9
|
require_relative "data_drain/record"
|
|
10
10
|
require_relative "data_drain/file_ingestor"
|
|
11
|
+
require_relative "data_drain/glue_runner"
|
|
11
12
|
|
|
12
13
|
# Registramos el tipo JSON personalizado de ActiveModel
|
|
13
14
|
require_relative "data_drain/types/json_type"
|
|
@@ -28,6 +29,7 @@ module DataDrain
|
|
|
28
29
|
# @api private
|
|
29
30
|
def reset_configuration!
|
|
30
31
|
@configuration = Configuration.new
|
|
32
|
+
DataDrain::Storage.reset_adapter!
|
|
31
33
|
end
|
|
32
34
|
end
|
|
33
35
|
end
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.14
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-03-
|
|
11
|
+
date: 2026-03-22 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -24,6 +24,20 @@ dependencies:
|
|
|
24
24
|
- - ">="
|
|
25
25
|
- !ruby/object:Gem::Version
|
|
26
26
|
version: '6.0'
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: aws-sdk-glue
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - "~>"
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '1.0'
|
|
34
|
+
type: :runtime
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - "~>"
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '1.0'
|
|
27
41
|
- !ruby/object:Gem::Dependency
|
|
28
42
|
name: aws-sdk-s3
|
|
29
43
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -74,6 +88,7 @@ executables: []
|
|
|
74
88
|
extensions: []
|
|
75
89
|
extra_rdoc_files: []
|
|
76
90
|
files:
|
|
91
|
+
- ".claude/settings.local.json"
|
|
77
92
|
- ".rspec"
|
|
78
93
|
- ".rubocop.yml"
|
|
79
94
|
- CHANGELOG.md
|
|
@@ -87,6 +102,7 @@ files:
|
|
|
87
102
|
- lib/data_drain/engine.rb
|
|
88
103
|
- lib/data_drain/errors.rb
|
|
89
104
|
- lib/data_drain/file_ingestor.rb
|
|
105
|
+
- lib/data_drain/glue_runner.rb
|
|
90
106
|
- lib/data_drain/record.rb
|
|
91
107
|
- lib/data_drain/storage.rb
|
|
92
108
|
- lib/data_drain/storage/base.rb
|