data_drain 0.1.15 → 0.1.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -1
- data/CLAUDE.md +13 -10
- data/README.md +92 -75
- data/lib/data_drain/engine.rb +52 -11
- data/lib/data_drain/file_ingestor.rb +20 -8
- data/lib/data_drain/glue_runner.rb +17 -5
- data/lib/data_drain/observability.rb +48 -0
- data/lib/data_drain/record.rb +6 -2
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: '09d58bbf9060fa6fb61ddeff5e43f020168280d9487726912c25deda6b1a2a45'
|
|
4
|
+
data.tar.gz: e8d13997382a5b9c69031406450ff579f01afe9593b1b9edee28546944b9faee
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: de7135c83eb0d5cbdc018cf965d974ccc449ae9c74166868914b4f73e5c775ea9bc39c80bee0ada779b7cafeb313c4cdde7b20b454cfab7b415d9cb7e25ff815
|
|
7
|
+
data.tar.gz: de65115bbb65cfe1ef4ae035c2c7c644027109fb485e2b0e9e17b079b15595ad2ce015ffd4771432551e314ac7bd42cedb014f907cbd690d669d9a7166a79625
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,25 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
-
## [0.1.
|
|
3
|
+
## [0.1.18] - 2026-03-23
|
|
4
|
+
|
|
5
|
+
- Feature: Módulo `Observability` centraliza el logging estructurado en toda la gema.
|
|
6
|
+
- Feature: Heartbeat de progreso para purgas masivas (`engine.purge_heartbeat`).
|
|
7
|
+
- Telemetry: Separación de contexto de error (`error_class`, `error_message`) en todos los eventos de falla.
|
|
8
|
+
- Resilience: Los fallos en el sistema de logs nunca interrumpen el flujo principal de datos.
|
|
9
|
+
|
|
10
|
+
## [0.1.17] - 2026-03-17
|
|
11
|
+
|
|
12
|
+
- Feature: Telemetría granular por fases (Ingeniería de Performance).
|
|
13
|
+
- Telemetry: Inclusión de métricas específicas como \`db_query_duration_s\`, \`export_duration_s\`, \`integrity_duration_s\` y \`purge_duration_s\` en el evento \`engine.complete\`.
|
|
14
|
+
- Telemetry: Inclusión de \`source_query_duration_s\` y \`export_duration_s\` en \`file_ingestor.complete\`.
|
|
15
|
+
|
|
16
|
+
## [0.1.16] - 2026-03-17
|
|
17
|
+
|
|
18
|
+
- Refactor: Cumplimiento con el estándar **Wispro-Observability-Spec (v1)**.
|
|
19
|
+
- Telemetry: Renombrado de métricas de tiempo a \`duration_s\` y \`next_check_in_s\` eliminando sufijos de unidad en los valores.
|
|
20
|
+
- Observability: Garantía de valores numéricos puros para contadores y tiempos, facilitando el procesamiento por \`exis_ray\`.
|
|
21
|
+
|
|
22
|
+
## [0.1.15] - 2026-03-17
|
|
4
23
|
|
|
5
24
|
- Performance: Medición de duraciones con reloj monotónico (`Process.clock_gettime`) en eventos terminales de `Engine`, `FileIngestor` y `GlueRunner`.
|
|
6
25
|
- Fix: `idle_in_transaction_session_timeout` ahora se aplica correctamente cuando el valor es `0` (desactiva el timeout). Antes `0.present?` evaluaba a `false` y se ignoraba.
|
data/CLAUDE.md
CHANGED
|
@@ -25,16 +25,19 @@ Las exportaciones usan `OVERWRITE_OR_IGNORE 1` de DuckDB. Los procesos son segur
|
|
|
25
25
|
### `idle_in_transaction_session_timeout`
|
|
26
26
|
El valor `0` **desactiva** el timeout (sin límite). Para purgas de gran volumen esto es mandatorio. Internamente, se debe validar con `!nil?` ya que `0.present?` es falso.
|
|
27
27
|
|
|
28
|
-
## Logging
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
- Formato
|
|
33
|
-
-
|
|
34
|
-
-
|
|
35
|
-
-
|
|
36
|
-
-
|
|
37
|
-
-
|
|
28
|
+
## Logging (Wispro-Observability-Spec v1)
|
|
29
|
+
|
|
30
|
+
La telemetría debe ser estructurada (KV) para ser procesada por `exis_ray`.
|
|
31
|
+
|
|
32
|
+
- **Formato:** `component=data_drain event=<clase>.<suceso> [campos]`
|
|
33
|
+
- **Unidades:** Prohibido incluir unidades en los valores (ej: NO usar "0.5s").
|
|
34
|
+
- **Tiempos:** Usar el sufijo `_s` en la key y valor `Float`. Ej: `duration_s=0.57`.
|
|
35
|
+
- **Contadores:** Usar la palabra `count` en la key y valor `Integer`. Ej: `pg_count=100`.
|
|
36
|
+
- **Naming:** Todas las llaves deben ser `snake_case`.
|
|
37
|
+
- **Automatización:** El campo `source` lo inyecta automáticamente `exis_ray` — no incluirlo manualmente.
|
|
38
|
+
- **DEBUG:** Siempre en forma de bloque: `logger.debug { "k=#{v}" }`.
|
|
39
|
+
- **Duraciones:** Usar siempre `Process.clock_gettime(Process::CLOCK_MONOTONIC)`.
|
|
40
|
+
- **Sensibilidad:** Filtrar datos sensibles (`password`, `token`, `secret`) → `[FILTERED]`.
|
|
38
41
|
|
|
39
42
|
## Código Ruby
|
|
40
43
|
|
data/README.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
# DataDrain
|
|
1
|
+
# DataDrain
|
|
2
2
|
|
|
3
3
|
DataDrain es un micro-framework de nivel empresarial diseñado para extraer, archivar y purgar datos históricos desde bases de datos PostgreSQL transaccionales, así como para **ingerir archivos crudos (CSV, JSON, Parquet)**, hacia un Data Lake analítico.
|
|
4
4
|
|
|
@@ -12,13 +12,14 @@ Utiliza **DuckDB** en memoria para lograr velocidades de procesamiento y compres
|
|
|
12
12
|
* **Storage Adapters:** Soporte nativo y transparente para almacenamiento en Disco Local y AWS S3.
|
|
13
13
|
* **Integridad Garantizada:** Verifica matemáticamente que los datos exportados coincidan exactamente con el origen antes de ejecutar sentencias `DELETE`.
|
|
14
14
|
* **ORM Analítico Integrado:** Incluye una clase base (`DataDrain::Record`) compatible con `ActiveModel` para consultar y destruir particiones históricas de forma idiomática.
|
|
15
|
+
* **Observabilidad Estructurada:** Todos los eventos emiten logs en formato `key=value` compatibles con Datadog, CloudWatch y `exis_ray`. Los fallos de logging nunca interrumpen el flujo principal.
|
|
15
16
|
|
|
16
17
|
## Instalación
|
|
17
18
|
|
|
18
19
|
Agrega esta línea al `Gemfile` de tu aplicación o microservicio:
|
|
19
20
|
|
|
20
21
|
```ruby
|
|
21
|
-
gem 'data_drain', git: '
|
|
22
|
+
gem 'data_drain', git: 'https://github.com/gedera/data_drain.git', branch: 'main'
|
|
22
23
|
```
|
|
23
24
|
|
|
24
25
|
Y ejecuta:
|
|
@@ -50,47 +51,42 @@ DataDrain.configure do |config|
|
|
|
50
51
|
# Rendimiento y Tuning de Postgres
|
|
51
52
|
config.batch_size = 5000 # Registros a borrar por transacción
|
|
52
53
|
config.throttle_delay = 0.5 # Segundos de pausa entre borrados
|
|
53
|
-
|
|
54
|
+
|
|
54
55
|
# Timeout de inactividad de transacciones en PostgreSQL (en milisegundos).
|
|
55
|
-
#
|
|
56
|
-
#
|
|
56
|
+
# El valor 0 DESACTIVA el timeout (sin límite de tiempo).
|
|
57
|
+
# Mandatorio para purgas de gran volumen donde cada lote puede tardar segundos.
|
|
57
58
|
config.idle_in_transaction_session_timeout = 0
|
|
58
|
-
|
|
59
|
-
config.logger
|
|
59
|
+
|
|
60
|
+
config.logger = Rails.logger
|
|
60
61
|
|
|
61
62
|
# Tuning de DuckDB
|
|
62
63
|
# Límite máximo de RAM para las consultas en memoria de DuckDB (ej. '2GB', '512MB').
|
|
63
|
-
# Evita que el proceso
|
|
64
|
-
config.limit_ram
|
|
65
|
-
|
|
64
|
+
# Evita que el proceso muera por OOM en contenedores con memoria limitada.
|
|
65
|
+
config.limit_ram = '2GB'
|
|
66
|
+
|
|
66
67
|
# Directorio temporal de DuckDB para desbordar memoria (spill to disk) durante
|
|
67
68
|
# transformaciones pesadas o creación de archivos Parquet masivos.
|
|
68
|
-
#
|
|
69
|
-
config.tmp_directory
|
|
69
|
+
# Se recomienda que este directorio resida en un disco SSD/NVMe rápido.
|
|
70
|
+
config.tmp_directory = '/tmp/duckdb_work'
|
|
70
71
|
end
|
|
71
72
|
```
|
|
72
73
|
|
|
73
74
|
## Uso
|
|
74
75
|
|
|
75
|
-
El framework provee
|
|
76
|
+
El framework provee cuatro herramientas principales: **Ingestor de Archivos**, **Drenaje de Base de Datos**, **ORM Analítico** y **Orquestación con AWS Glue**.
|
|
76
77
|
|
|
77
78
|
### 1. Ingestión de Archivos Crudos (FileIngestor)
|
|
78
79
|
|
|
79
80
|
Ideal para servicios que generan grandes volúmenes de datos (ej. métricas de Netflow). Toma un archivo local, lo transforma, lo comprime a Parquet y lo sube particionado a S3.
|
|
80
81
|
|
|
81
82
|
```ruby
|
|
82
|
-
# Un archivo generado temporalmente por tu servicio
|
|
83
|
-
archivo_temporal = "/tmp/netflow_metrics_1600.csv"
|
|
84
|
-
|
|
85
83
|
ingestor = DataDrain::FileIngestor.new(
|
|
86
|
-
bucket:
|
|
87
|
-
source_path:
|
|
88
|
-
folder_name:
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
select_sql: "*, EXTRACT(YEAR FROM timestamp) AS year, EXTRACT(MONTH FROM timestamp) AS month",
|
|
93
|
-
delete_after_upload: true # Limpia el archivo temporal al terminar
|
|
84
|
+
bucket: 'my-bucket-store',
|
|
85
|
+
source_path: '/tmp/netflow_metrics_1600.csv',
|
|
86
|
+
folder_name: 'netflow',
|
|
87
|
+
partition_keys: %w[year month isp_id],
|
|
88
|
+
select_sql: "*, EXTRACT(YEAR FROM timestamp) AS year, EXTRACT(MONTH FROM timestamp) AS month",
|
|
89
|
+
delete_after_upload: true
|
|
94
90
|
)
|
|
95
91
|
|
|
96
92
|
ingestor.call
|
|
@@ -98,25 +94,37 @@ ingestor.call
|
|
|
98
94
|
|
|
99
95
|
### 2. Extracción y Purga de BD (Engine)
|
|
100
96
|
|
|
101
|
-
Ideal para crear
|
|
97
|
+
Ideal para crear ventanas rodantes de retención (ej. mantener solo 6 meses de datos vivos en Postgres y archivar el resto).
|
|
102
98
|
|
|
103
|
-
**
|
|
104
|
-
Si tu arquitectura ya utiliza **AWS Glue** o **AWS EMR** para mover datos pesados, puedes configurar DataDrain para que actúe únicamente como **Garante de Integridad**. En este modo, el motor omitirá el paso de exportación, pero verificará matemáticamente que los datos existan en el Data Lake antes de proceder a eliminarlos de PostgreSQL.
|
|
99
|
+
**Flujo completo (Export + Verify + Purge):**
|
|
105
100
|
|
|
106
101
|
```ruby
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
102
|
+
engine = DataDrain::Engine.new(
|
|
103
|
+
bucket: 'my-bucket-store',
|
|
104
|
+
start_date: 6.months.ago.beginning_of_month,
|
|
105
|
+
end_date: 6.months.ago.end_of_month,
|
|
106
|
+
table_name: 'versions',
|
|
107
|
+
partition_keys: %w[year month]
|
|
108
|
+
)
|
|
109
|
+
|
|
110
|
+
engine.call
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**Modo Purga con Exportación Externa (skip_export):**
|
|
114
|
+
|
|
115
|
+
Si tu arquitectura ya utiliza **AWS Glue** o **AWS EMR** para mover datos pesados, puedes configurar DataDrain para que actúe únicamente como garante de integridad. En este modo omite la exportación pero verifica matemáticamente que los datos existan en el Data Lake antes de eliminarlos de PostgreSQL.
|
|
116
|
+
|
|
117
|
+
```ruby
|
|
118
|
+
engine = DataDrain::Engine.new(
|
|
119
|
+
bucket: 'my-bucket-store',
|
|
120
|
+
start_date: 6.months.ago.beginning_of_month,
|
|
121
|
+
end_date: 6.months.ago.end_of_month,
|
|
122
|
+
table_name: 'versions',
|
|
123
|
+
partition_keys: %w[year month],
|
|
124
|
+
skip_export: true
|
|
125
|
+
)
|
|
126
|
+
|
|
127
|
+
engine.call
|
|
120
128
|
```
|
|
121
129
|
|
|
122
130
|
### 3. Orquestación con AWS Glue (Big Data)
|
|
@@ -124,23 +132,23 @@ end
|
|
|
124
132
|
Para tablas de gran volumen (**ej. > 500GB o 1TB**), se recomienda delegar el movimiento de datos a **AWS Glue** (basado en Apache Spark) para evitar saturar el servidor de Ruby. `DataDrain` actúa como el orquestador que dispara el Job, espera a que termine y luego realiza la validación y purga.
|
|
125
133
|
|
|
126
134
|
```ruby
|
|
127
|
-
# 1. Disparar el Job de Glue y esperar su finalización exitosa
|
|
128
135
|
config = DataDrain.configuration
|
|
129
136
|
bucket = "my-bucket"
|
|
130
137
|
table = "versions"
|
|
131
138
|
|
|
139
|
+
# 1. Disparar el Job de Glue y esperar su finalización exitosa
|
|
132
140
|
DataDrain::GlueRunner.run_and_wait(
|
|
133
141
|
"my-glue-export-job",
|
|
134
142
|
{
|
|
135
|
-
"--start_date"
|
|
136
|
-
"--end_date"
|
|
137
|
-
"--s3_bucket"
|
|
138
|
-
"--s3_folder"
|
|
139
|
-
"--db_url"
|
|
140
|
-
"--db_user"
|
|
141
|
-
"--db_password"
|
|
142
|
-
"--db_table"
|
|
143
|
-
"--partition_by"
|
|
143
|
+
"--start_date" => start_date.to_fs(:db),
|
|
144
|
+
"--end_date" => end_date.to_fs(:db),
|
|
145
|
+
"--s3_bucket" => bucket,
|
|
146
|
+
"--s3_folder" => table,
|
|
147
|
+
"--db_url" => "jdbc:postgresql://#{config.db_host}:#{config.db_port}/#{config.db_name}",
|
|
148
|
+
"--db_user" => config.db_user,
|
|
149
|
+
"--db_password" => config.db_pass,
|
|
150
|
+
"--db_table" => table,
|
|
151
|
+
"--partition_by" => "year,month,isp_id"
|
|
144
152
|
}
|
|
145
153
|
)
|
|
146
154
|
|
|
@@ -152,13 +160,13 @@ DataDrain::Engine.new(
|
|
|
152
160
|
end_date: end_date,
|
|
153
161
|
table_name: table,
|
|
154
162
|
partition_keys: %w[year month isp_id],
|
|
155
|
-
skip_export: true
|
|
163
|
+
skip_export: true
|
|
156
164
|
).call
|
|
157
165
|
```
|
|
158
166
|
|
|
159
167
|
#### Script de AWS Glue (PySpark) compatible con DataDrain
|
|
160
168
|
|
|
161
|
-
Crea un Job en la consola de AWS Glue (Spark 4.0+) y utiliza este script como base
|
|
169
|
+
Crea un Job en la consola de AWS Glue (Spark 4.0+) y utiliza este script como base:
|
|
162
170
|
|
|
163
171
|
```python
|
|
164
172
|
import sys
|
|
@@ -168,7 +176,6 @@ from awsglue.context import GlueContext
|
|
|
168
176
|
from awsglue.job import Job
|
|
169
177
|
from pyspark.sql.functions import col, year, month
|
|
170
178
|
|
|
171
|
-
# Parámetros recibidos desde DataDrain::GlueRunner
|
|
172
179
|
args = getResolvedOptions(sys.argv, [
|
|
173
180
|
'JOB_NAME', 'start_date', 'end_date', 's3_bucket', 's3_folder',
|
|
174
181
|
'db_url', 'db_user', 'db_password', 'db_table', 'partition_by'
|
|
@@ -180,7 +187,6 @@ spark = glueContext.spark_session
|
|
|
180
187
|
job = Job(glueContext)
|
|
181
188
|
job.init(args['JOB_NAME'], args)
|
|
182
189
|
|
|
183
|
-
# 1. Leer de PostgreSQL (vía JDBC dinámico)
|
|
184
190
|
options = {
|
|
185
191
|
"url": args['db_url'],
|
|
186
192
|
"dbtable": args['db_table'],
|
|
@@ -191,12 +197,9 @@ options = {
|
|
|
191
197
|
|
|
192
198
|
df = spark.read.format("jdbc").options(**options).load()
|
|
193
199
|
|
|
194
|
-
# 2. Agregar columnas de partición temporales (Hive Partitioning)
|
|
195
200
|
df_final = df.withColumn("year", year(col("created_at"))) \
|
|
196
201
|
.withColumn("month", month(col("created_at")))
|
|
197
202
|
|
|
198
|
-
# 3. Escribir a S3 en Parquet con compresión ZSTD
|
|
199
|
-
# Construimos el path dinámicamente: s3://bucket/folder/
|
|
200
203
|
output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
|
|
201
204
|
partitions = args['partition_by'].split(",")
|
|
202
205
|
|
|
@@ -216,27 +219,25 @@ Para consultar los datos archivados sin salir de Ruby, crea un modelo que herede
|
|
|
216
219
|
```ruby
|
|
217
220
|
# app/models/archived_version.rb
|
|
218
221
|
class ArchivedVersion < DataDrain::Record
|
|
219
|
-
self.bucket
|
|
220
|
-
self.folder_name
|
|
222
|
+
self.bucket = 'my-bucket-storage'
|
|
223
|
+
self.folder_name = 'versions'
|
|
221
224
|
self.partition_keys = [:year, :month, :isp_id]
|
|
222
225
|
|
|
223
|
-
attribute :id,
|
|
224
|
-
attribute :item_type,
|
|
225
|
-
attribute :item_id,
|
|
226
|
-
attribute :event,
|
|
227
|
-
attribute :whodunnit,
|
|
228
|
-
attribute :created_at,
|
|
229
|
-
|
|
230
|
-
# Utiliza el tipo :json provisto por la gema para hidratar Hashes
|
|
231
|
-
attribute :object, :json
|
|
226
|
+
attribute :id, :string
|
|
227
|
+
attribute :item_type, :string
|
|
228
|
+
attribute :item_id, :string
|
|
229
|
+
attribute :event, :string
|
|
230
|
+
attribute :whodunnit, :string
|
|
231
|
+
attribute :created_at, :datetime
|
|
232
|
+
attribute :object, :json
|
|
232
233
|
attribute :object_changes, :json
|
|
233
234
|
end
|
|
234
235
|
```
|
|
235
236
|
|
|
236
|
-
Consultas
|
|
237
|
+
Consultas optimizadas mediante Hive Partitioning:
|
|
237
238
|
|
|
238
239
|
```ruby
|
|
239
|
-
# Búsqueda puntual
|
|
240
|
+
# Búsqueda puntual aislando la partición exacta
|
|
240
241
|
version = ArchivedVersion.find("un-uuid", year: 2026, month: 3, isp_id: 42)
|
|
241
242
|
puts version.object_changes # => {"status" => ["active", "suspended"]}
|
|
242
243
|
|
|
@@ -244,12 +245,12 @@ puts version.object_changes # => {"status" => ["active", "suspended"]}
|
|
|
244
245
|
history = ArchivedVersion.where(limit: 10, year: 2026, month: 3, isp_id: 42)
|
|
245
246
|
```
|
|
246
247
|
|
|
247
|
-
###
|
|
248
|
+
### 5. Destrucción de Datos (Retención y Cumplimiento)
|
|
248
249
|
|
|
249
250
|
El framework permite eliminar físicamente carpetas completas en S3 o Local utilizando comodines.
|
|
250
251
|
|
|
251
252
|
```ruby
|
|
252
|
-
# Elimina todo el historial de un cliente
|
|
253
|
+
# Elimina todo el historial de un cliente a través de todos los años
|
|
253
254
|
ArchivedVersion.destroy_all(isp_id: 42)
|
|
254
255
|
|
|
255
256
|
# Elimina todos los datos de marzo de 2024 globalmente
|
|
@@ -258,9 +259,25 @@ ArchivedVersion.destroy_all(year: 2024, month: 3)
|
|
|
258
259
|
|
|
259
260
|
## Arquitectura
|
|
260
261
|
|
|
261
|
-
DataDrain implementa el patrón **Storage Adapter**, lo que permite aislar completamente la lógica del sistema de archivos de los motores de procesamiento.
|
|
262
|
-
|
|
263
|
-
*
|
|
262
|
+
DataDrain implementa el patrón **Storage Adapter**, lo que permite aislar completamente la lógica del sistema de archivos de los motores de procesamiento.
|
|
263
|
+
|
|
264
|
+
* **Conexión DuckDB thread-local:** `DataDrain::Record` mantiene una conexión DuckDB por thread (`Thread.current[:data_drain_duckdb]`). Cada thread inicializa su propia conexión una sola vez, incluyendo la carga de extensiones como `httpfs`. Tener esto en cuenta en entornos Puma o Sidekiq.
|
|
265
|
+
* **Storage Adapter cacheado:** `DataDrain::Storage.adapter` cachea la instancia del adaptador. Si `storage_mode` cambia en runtime, llamar `DataDrain::Storage.reset_adapter!` para invalidar el cache.
|
|
266
|
+
* **ORM Analítico con sanitización:** `DataDrain::Record` incluye sanitización de parámetros para prevenir inyección SQL al consultar archivos Parquet.
|
|
267
|
+
|
|
268
|
+
## Observabilidad
|
|
269
|
+
|
|
270
|
+
Todos los eventos emiten logs estructurados en formato `key=value` procesables por herramientas como Datadog, CloudWatch Logs Insights o `exis_ray`:
|
|
271
|
+
|
|
272
|
+
```
|
|
273
|
+
component=data_drain event=engine.complete table=versions duration_s=12.4 export_duration_s=8.1 purge_duration_s=3.9 count=150000
|
|
274
|
+
component=data_drain event=engine.integrity_error table=versions duration_s=5.2 count=150000
|
|
275
|
+
component=data_drain event=engine.purge_heartbeat table=versions batches_processed_count=100 rows_deleted_count=500000
|
|
276
|
+
component=data_drain event=file_ingestor.complete source_path=/tmp/data.csv duration_s=2.1 count=85000
|
|
277
|
+
component=data_drain event=glue_runner.failed job=my-export-job run_id=jr_abc123 status=FAILED duration_s=301.0
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
Los fallos internos del sistema de logging nunca interrumpen el flujo principal de datos.
|
|
264
281
|
|
|
265
282
|
## Licencia
|
|
266
283
|
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -9,6 +9,7 @@ module DataDrain
|
|
|
9
9
|
# Orquesta el flujo ETL desde PostgreSQL hacia un Data Lake analítico
|
|
10
10
|
# delegando la interacción del almacenamiento al adaptador configurado.
|
|
11
11
|
class Engine
|
|
12
|
+
include Observability
|
|
12
13
|
# Inicializa una nueva instancia del motor de extracción.
|
|
13
14
|
#
|
|
14
15
|
# @param options [Hash] Diccionario de configuración para la extracción.
|
|
@@ -50,33 +51,57 @@ module DataDrain
|
|
|
50
51
|
# @return [Boolean] `true` si el proceso finalizó con éxito, `false` si falló la integridad.
|
|
51
52
|
def call
|
|
52
53
|
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
53
|
-
|
|
54
|
+
safe_log(:info, "engine.start", { table: @table_name, start_date: @start_date.to_date, end_date: @end_date.to_date })
|
|
54
55
|
|
|
55
56
|
setup_duckdb
|
|
56
57
|
|
|
58
|
+
# 1. Conteo inicial en Postgres
|
|
59
|
+
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
57
60
|
@pg_count = get_postgres_count
|
|
61
|
+
db_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
58
62
|
|
|
59
63
|
if @pg_count.zero?
|
|
60
64
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
61
|
-
|
|
65
|
+
safe_log(:info, "engine.skip_empty", { table: @table_name, duration_s: duration.round(2), db_query_duration_s: db_query_duration.round(2) })
|
|
62
66
|
return true
|
|
63
67
|
end
|
|
64
68
|
|
|
69
|
+
# 2. Exportación
|
|
70
|
+
export_duration = 0.0
|
|
65
71
|
if @skip_export
|
|
66
|
-
|
|
72
|
+
safe_log(:info, "engine.skip_export", { table: @table_name })
|
|
67
73
|
else
|
|
68
|
-
|
|
74
|
+
safe_log(:info, "engine.export_start", { table: @table_name, count: @pg_count })
|
|
75
|
+
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
69
76
|
export_to_parquet
|
|
77
|
+
export_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
70
78
|
end
|
|
71
79
|
|
|
72
|
-
|
|
80
|
+
# 3. Verificación de Integridad
|
|
81
|
+
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
82
|
+
integrity_ok = verify_integrity
|
|
83
|
+
integrity_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
84
|
+
|
|
85
|
+
if integrity_ok
|
|
86
|
+
# 4. Purga en Postgres
|
|
87
|
+
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
73
88
|
purge_from_postgres
|
|
89
|
+
purge_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
90
|
+
|
|
74
91
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
75
|
-
|
|
92
|
+
safe_log(:info, "engine.complete", {
|
|
93
|
+
table: @table_name,
|
|
94
|
+
duration_s: duration.round(2),
|
|
95
|
+
db_query_duration_s: db_query_duration.round(2),
|
|
96
|
+
export_duration_s: export_duration.round(2),
|
|
97
|
+
integrity_duration_s: integrity_duration.round(2),
|
|
98
|
+
purge_duration_s: purge_duration.round(2),
|
|
99
|
+
count: @pg_count
|
|
100
|
+
})
|
|
76
101
|
true
|
|
77
102
|
else
|
|
78
103
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
79
|
-
|
|
104
|
+
safe_log(:error, "engine.integrity_error", { table: @table_name, duration_s: duration.round(2), count: @pg_count })
|
|
80
105
|
false
|
|
81
106
|
end
|
|
82
107
|
end
|
|
@@ -151,17 +176,17 @@ module DataDrain
|
|
|
151
176
|
SQL
|
|
152
177
|
parquet_result = @duckdb.query(query).first.first
|
|
153
178
|
rescue DuckDB::Error => e
|
|
154
|
-
|
|
179
|
+
safe_log(:error, "engine.parquet_read_error", { table: @table_name }.merge(exception_metadata(e)))
|
|
155
180
|
return false
|
|
156
181
|
end
|
|
157
182
|
|
|
158
|
-
|
|
183
|
+
safe_log(:info, "engine.integrity_check", { table: @table_name, pg_count: @pg_count, parquet_count: parquet_result })
|
|
159
184
|
@pg_count == parquet_result
|
|
160
185
|
end
|
|
161
186
|
|
|
162
187
|
# @api private
|
|
163
188
|
def purge_from_postgres
|
|
164
|
-
|
|
189
|
+
safe_log(:info, "engine.purge_start", { table: @table_name, batch_size: @config.batch_size })
|
|
165
190
|
|
|
166
191
|
conn = PG.connect(
|
|
167
192
|
host: @config.db_host,
|
|
@@ -175,6 +200,9 @@ module DataDrain
|
|
|
175
200
|
conn.exec("SET idle_in_transaction_session_timeout = #{@config.idle_in_transaction_session_timeout};")
|
|
176
201
|
end
|
|
177
202
|
|
|
203
|
+
batches_processed = 0
|
|
204
|
+
total_deleted = 0
|
|
205
|
+
|
|
178
206
|
loop do
|
|
179
207
|
sql = <<~SQL
|
|
180
208
|
DELETE FROM #{@table_name}
|
|
@@ -186,7 +214,20 @@ module DataDrain
|
|
|
186
214
|
SQL
|
|
187
215
|
|
|
188
216
|
result = conn.exec(sql)
|
|
189
|
-
|
|
217
|
+
count = result.cmd_tuples
|
|
218
|
+
break if count.zero?
|
|
219
|
+
|
|
220
|
+
batches_processed += 1
|
|
221
|
+
total_deleted += count
|
|
222
|
+
|
|
223
|
+
# Heartbeat cada 100 lotes para monitorear procesos largos de 1TB
|
|
224
|
+
if (batches_processed % 100).zero?
|
|
225
|
+
safe_log(:info, "engine.purge_heartbeat", {
|
|
226
|
+
table: @table_name,
|
|
227
|
+
batches_processed_count: batches_processed,
|
|
228
|
+
rows_deleted_count: total_deleted
|
|
229
|
+
})
|
|
230
|
+
end
|
|
190
231
|
|
|
191
232
|
sleep(@config.throttle_delay) if @config.throttle_delay.positive?
|
|
192
233
|
end
|
|
@@ -5,6 +5,8 @@ module DataDrain
|
|
|
5
5
|
# generados por otros servicios (ej. Netflow) y subirlos al Data Lake
|
|
6
6
|
# aplicando compresión ZSTD y particionamiento Hive.
|
|
7
7
|
class FileIngestor
|
|
8
|
+
include Observability
|
|
9
|
+
|
|
8
10
|
# @param options [Hash] Opciones de ingestión.
|
|
9
11
|
# @option options [String] :source_path Ruta absoluta al archivo local.
|
|
10
12
|
# @option options [String] :folder_name Nombre de la carpeta destino en el Data Lake.
|
|
@@ -31,10 +33,10 @@ module DataDrain
|
|
|
31
33
|
# @return [Boolean] true si el proceso fue exitoso.
|
|
32
34
|
def call
|
|
33
35
|
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
34
|
-
|
|
36
|
+
safe_log(:info, "file_ingestor.start", { source_path: @source_path })
|
|
35
37
|
|
|
36
38
|
unless File.exist?(@source_path)
|
|
37
|
-
|
|
39
|
+
safe_log(:error, "file_ingestor.file_not_found", { source_path: @source_path })
|
|
38
40
|
return false
|
|
39
41
|
end
|
|
40
42
|
|
|
@@ -47,13 +49,15 @@ module DataDrain
|
|
|
47
49
|
reader_function = determine_reader
|
|
48
50
|
|
|
49
51
|
# 1. Conteo de seguridad
|
|
52
|
+
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
50
53
|
source_count = @duckdb.query("SELECT COUNT(*) FROM #{reader_function}").first.first
|
|
51
|
-
|
|
54
|
+
source_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
55
|
+
safe_log(:info, "file_ingestor.count", { source_path: @source_path, count: source_count, source_query_duration_s: source_query_duration.round(2) })
|
|
52
56
|
|
|
53
57
|
if source_count.zero?
|
|
54
58
|
cleanup_local_file
|
|
55
59
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
56
|
-
|
|
60
|
+
safe_log(:info, "file_ingestor.skip_empty", { source_path: @source_path, duration_s: duration.round(2) })
|
|
57
61
|
return true
|
|
58
62
|
end
|
|
59
63
|
|
|
@@ -76,17 +80,25 @@ module DataDrain
|
|
|
76
80
|
);
|
|
77
81
|
SQL
|
|
78
82
|
|
|
79
|
-
|
|
83
|
+
safe_log(:info, "file_ingestor.export_start", { dest_path: dest_path })
|
|
84
|
+
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
80
85
|
@duckdb.query(query)
|
|
86
|
+
export_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
81
87
|
|
|
82
88
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
83
|
-
|
|
89
|
+
safe_log(:info, "file_ingestor.complete", {
|
|
90
|
+
source_path: @source_path,
|
|
91
|
+
duration_s: duration.round(2),
|
|
92
|
+
source_query_duration_s: source_query_duration.round(2),
|
|
93
|
+
export_duration_s: export_duration.round(2),
|
|
94
|
+
count: source_count
|
|
95
|
+
})
|
|
84
96
|
|
|
85
97
|
cleanup_local_file
|
|
86
98
|
true
|
|
87
99
|
rescue DuckDB::Error => e
|
|
88
100
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
89
|
-
|
|
101
|
+
safe_log(:error, "file_ingestor.duckdb_error", { source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
|
|
90
102
|
false
|
|
91
103
|
ensure
|
|
92
104
|
@duckdb&.close
|
|
@@ -112,7 +124,7 @@ module DataDrain
|
|
|
112
124
|
def cleanup_local_file
|
|
113
125
|
if @delete_after_upload && File.exist?(@source_path)
|
|
114
126
|
File.delete(@source_path)
|
|
115
|
-
|
|
127
|
+
safe_log(:info, "file_ingestor.cleanup", { source_path: @source_path })
|
|
116
128
|
end
|
|
117
129
|
end
|
|
118
130
|
end
|
|
@@ -6,6 +6,9 @@ module DataDrain
|
|
|
6
6
|
# Orquestador para AWS Glue. Permite disparar y monitorear Jobs en AWS
|
|
7
7
|
# para delegar el movimiento masivo de datos (ej. tablas de 1TB).
|
|
8
8
|
class GlueRunner
|
|
9
|
+
extend Observability
|
|
10
|
+
private_class_method :safe_log, :exception_metadata, :observability_name
|
|
11
|
+
|
|
9
12
|
# Dispara un Job de Glue y espera a que termine exitosamente.
|
|
10
13
|
#
|
|
11
14
|
# @param job_name [String] Nombre del Job en la consola de AWS.
|
|
@@ -18,7 +21,11 @@ module DataDrain
|
|
|
18
21
|
client = Aws::Glue::Client.new(region: config.aws_region)
|
|
19
22
|
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
20
23
|
|
|
21
|
-
|
|
24
|
+
# Usamos el logger de la configuración directamente para el primer log antes de instanciar safe_log si fuera necesario
|
|
25
|
+
# Pero como extendemos Observability, usamos safe_log directamente.
|
|
26
|
+
@logger = config.logger
|
|
27
|
+
|
|
28
|
+
safe_log(:info, "glue_runner.start", { job: job_name })
|
|
22
29
|
resp = client.start_job_run(job_name: job_name, arguments: arguments)
|
|
23
30
|
run_id = resp.job_run_id
|
|
24
31
|
|
|
@@ -29,15 +36,20 @@ module DataDrain
|
|
|
29
36
|
case status
|
|
30
37
|
when "SUCCEEDED"
|
|
31
38
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
32
|
-
|
|
39
|
+
safe_log(:info, "glue_runner.complete", { job: job_name, run_id: run_id, duration_s: duration.round(2) })
|
|
33
40
|
return true
|
|
34
41
|
when "FAILED", "STOPPED", "TIMEOUT"
|
|
35
|
-
error_msg = run_info.error_message || "Sin mensaje de error disponible."
|
|
36
42
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
37
|
-
|
|
43
|
+
error_metadata = { job: job_name, run_id: run_id, status: status, duration_s: duration.round(2) }
|
|
44
|
+
|
|
45
|
+
if run_info.error_message
|
|
46
|
+
error_metadata[:error_message] = run_info.error_message.gsub("\"", "'")[0, 200]
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
safe_log(:error, "glue_runner.failed", error_metadata)
|
|
38
50
|
raise "Glue Job #{job_name} (Run ID: #{run_id}) falló con estado #{status}."
|
|
39
51
|
else
|
|
40
|
-
|
|
52
|
+
safe_log(:info, "glue_runner.polling", { job: job_name, run_id: run_id, status: status, next_check_in_s: polling_interval })
|
|
41
53
|
sleep polling_interval
|
|
42
54
|
end
|
|
43
55
|
end
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DataDrain
|
|
4
|
+
# Módulo interno para garantizar que la telemetría cumpla con los
|
|
5
|
+
# Global-Observability-Standards: resiliencia, KV-structured y precisión.
|
|
6
|
+
#
|
|
7
|
+
# Este módulo es genérico y puede ser utilizado en otras gemas.
|
|
8
|
+
# @api private
|
|
9
|
+
module Observability
|
|
10
|
+
private
|
|
11
|
+
|
|
12
|
+
# Emite un log estructurado de forma segura.
|
|
13
|
+
# Garantiza que el logging nunca interrumpa el proceso principal (Resilience).
|
|
14
|
+
def safe_log(level, event, metadata = {})
|
|
15
|
+
return unless @logger
|
|
16
|
+
|
|
17
|
+
# component y event siempre primeros, luego el contexto
|
|
18
|
+
fields = { component: observability_name, event: event }.merge(metadata)
|
|
19
|
+
|
|
20
|
+
# Enmascaramiento preventivo de secretos (Security)
|
|
21
|
+
log_line = fields.map do |k, v|
|
|
22
|
+
val = %i[password token secret api_key auth].include?(k.to_sym) ? "[FILTERED]" : v
|
|
23
|
+
"#{k}=#{val}"
|
|
24
|
+
end.join(" ")
|
|
25
|
+
|
|
26
|
+
@logger.send(level) { log_line }
|
|
27
|
+
rescue StandardError
|
|
28
|
+
# Silencio absoluto en fallos de log para no detener procesos críticos
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
# Formatea excepciones siguiendo el Standard Error Context.
|
|
32
|
+
def exception_metadata(error)
|
|
33
|
+
{
|
|
34
|
+
error_class: error.class.name,
|
|
35
|
+
error_message: error.message.gsub("\"", "'")[0, 200]
|
|
36
|
+
}
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
# Nombre del componente para los logs.
|
|
40
|
+
# Funciona tanto en métodos de instancia (self = objeto) como de clase (self = Class).
|
|
41
|
+
def observability_name
|
|
42
|
+
klass = is_a?(Class) ? self : self.class
|
|
43
|
+
klass.name.split("::").first.gsub(/([a-z\d])([A-Z])/, '\1_\2').downcase
|
|
44
|
+
rescue StandardError
|
|
45
|
+
"unknown"
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
data/lib/data_drain/record.rb
CHANGED
|
@@ -17,6 +17,8 @@ module DataDrain
|
|
|
17
17
|
class Record
|
|
18
18
|
include ActiveModel::Model
|
|
19
19
|
include ActiveModel::Attributes
|
|
20
|
+
extend Observability
|
|
21
|
+
private_class_method :safe_log, :exception_metadata, :observability_name
|
|
20
22
|
|
|
21
23
|
class_attribute :bucket
|
|
22
24
|
class_attribute :folder_name
|
|
@@ -86,7 +88,8 @@ module DataDrain
|
|
|
86
88
|
# @return [Integer] Cantidad de particiones físicas eliminadas.
|
|
87
89
|
def self.destroy_all(**partitions)
|
|
88
90
|
adapter = DataDrain::Storage.adapter
|
|
89
|
-
|
|
91
|
+
@logger = DataDrain.configuration.logger
|
|
92
|
+
safe_log(:info, "record.destroy_all", { folder: folder_name, partitions: partitions.inspect })
|
|
90
93
|
|
|
91
94
|
adapter.destroy_partitions(bucket, folder_name, partition_keys, partitions)
|
|
92
95
|
end
|
|
@@ -116,10 +119,11 @@ module DataDrain
|
|
|
116
119
|
# @param columns [Array<String>]
|
|
117
120
|
# @return [Array<DataDrain::Record>]
|
|
118
121
|
def execute_and_instantiate(sql, columns)
|
|
122
|
+
@logger = DataDrain.configuration.logger
|
|
119
123
|
begin
|
|
120
124
|
result = connection.query(sql)
|
|
121
125
|
rescue DuckDB::Error => e
|
|
122
|
-
|
|
126
|
+
safe_log(:warn, "record.parquet_not_found", exception_metadata(e))
|
|
123
127
|
return []
|
|
124
128
|
end
|
|
125
129
|
|
data/lib/data_drain/version.rb
CHANGED
data/lib/data_drain.rb
CHANGED
|
@@ -5,6 +5,7 @@ require_relative "data_drain/version"
|
|
|
5
5
|
require_relative "data_drain/errors"
|
|
6
6
|
require_relative "data_drain/configuration"
|
|
7
7
|
require_relative "data_drain/storage"
|
|
8
|
+
require_relative "data_drain/observability"
|
|
8
9
|
require_relative "data_drain/engine"
|
|
9
10
|
require_relative "data_drain/record"
|
|
10
11
|
require_relative "data_drain/file_ingestor"
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.18
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-03-
|
|
11
|
+
date: 2026-03-24 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -103,6 +103,7 @@ files:
|
|
|
103
103
|
- lib/data_drain/errors.rb
|
|
104
104
|
- lib/data_drain/file_ingestor.rb
|
|
105
105
|
- lib/data_drain/glue_runner.rb
|
|
106
|
+
- lib/data_drain/observability.rb
|
|
106
107
|
- lib/data_drain/record.rb
|
|
107
108
|
- lib/data_drain/storage.rb
|
|
108
109
|
- lib/data_drain/storage/base.rb
|