data_drain 0.2.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +23 -0
- data/docs/IMPROVEMENT_PLAN.md +10 -5
- data/lib/data_drain/configuration.rb +7 -1
- data/lib/data_drain/engine.rb +182 -72
- data/lib/data_drain/file_ingestor.rb +64 -47
- data/lib/data_drain/observability/timing.rb +23 -0
- data/lib/data_drain/record.rb +8 -15
- data/lib/data_drain/storage/s3.rb +30 -11
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 8fc6bdf2a5a21825f9bc07185c0fc259056646c6084de2a42fd2d4e3198af4f2
|
|
4
|
+
data.tar.gz: aa7623b2ec910da4491c38904aa4623fd48a466ccc97a38124f29aafd9677655
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: '0201838e917434b529250689a5987008f9e8895366b56a78522752fe5b490fb6ab3d0b4c40dc81bcc830af3132c005fc8d136ab719093a16f0c9de5221e3421c'
|
|
7
|
+
data.tar.gz: 5be0c9c4a47d07f479f49c977d0490aad909a33fbc0c9292594bd51163132ee5229e0131f23d253120732d530930e5474bd31abbef7a5289f3022055799b5525
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,28 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.3.0] - 2026-04-15
|
|
4
|
+
|
|
5
|
+
### Refactor
|
|
6
|
+
- `Engine#call` refactorizado: extraídos `step_count`, `step_export`, `step_verify`, `step_purge` como métodos privados con `timed` helper. CC bajó de 13 a 5. Eventos emitidos idénticos al comportamiento anterior. (item 10)
|
|
7
|
+
- Extraído `DataDrain::Observability::Timing` mixin compartido entre Engine y FileIngestor. (item 20)
|
|
8
|
+
- `FileIngestor#call` refactorizado análogo a Engine. (item 20)
|
|
9
|
+
- Eliminados todos los `# rubocop:disable Metrics/*` en `lib/`. (item 20)
|
|
10
|
+
|
|
11
|
+
### Features
|
|
12
|
+
- `config.vacuum_after_purge = false` (default). Si `true`, ejecuta `VACUUM ANALYZE` post-purga cuando hubo deletes. Emite `engine.vacuum_complete` con dead_tuples antes/después y duración. Errores PG se capturan como `engine.vacuum_error` WARN. (item 5)
|
|
13
|
+
- `config.slow_batch_threshold_s = 30` y `config.slow_batch_alert_after = 5`. Detecta lotes de purga lentos. Emite `engine.slow_batch` WARN por cada lote lento, `engine.purge_degraded` WARN una vez por streak. Incluye hint a docs de tuning. (item 11b)
|
|
14
|
+
|
|
15
|
+
### Security
|
|
16
|
+
- `Record.connection` aplica `SET lock_configuration=true` post-setup. Congela cualquier SET futuro sobre la conexión (defensa en profundidad). NO afecta secrets ni extensiones ya cargadas. (item 6)
|
|
17
|
+
|
|
18
|
+
### Telemetry nueva
|
|
19
|
+
- `engine.vacuum_complete`, `engine.vacuum_error`, `engine.slow_batch`, `engine.purge_degraded`.
|
|
20
|
+
|
|
21
|
+
### Tests
|
|
22
|
+
- Coverage se mantiene ≥ 80%.
|
|
23
|
+
- Nuevo test de equivalencia para Engine (eventos idénticos pre/post refactor).
|
|
24
|
+
- Timecop agregado para tests de timing (item 11b).
|
|
25
|
+
|
|
3
26
|
## [0.2.2] - 2026-04-14
|
|
4
27
|
|
|
5
28
|
### Security
|
data/docs/IMPROVEMENT_PLAN.md
CHANGED
|
@@ -433,12 +433,13 @@ spec/
|
|
|
433
433
|
|
|
434
434
|
#### Item 5 — VACUUM ANALYZE opcional post-purga
|
|
435
435
|
|
|
436
|
-
**Estado:** `[
|
|
436
|
+
**Estado:** `[x]`
|
|
437
437
|
**Prioridad:** P1
|
|
438
438
|
**Tipo:** `feat` `perf`
|
|
439
439
|
**Compatibilidad:** backward-compatible (default `false`, opt-in)
|
|
440
440
|
**Estimación:** S (2-3h)
|
|
441
441
|
**Release:** v0.2.1
|
|
442
|
+
**Commit:** `93bf8a8`
|
|
442
443
|
|
|
443
444
|
##### Contexto
|
|
444
445
|
|
|
@@ -754,12 +755,13 @@ Contenido:
|
|
|
754
755
|
|
|
755
756
|
#### Item 6 — Sandboxing de `Record.connection`
|
|
756
757
|
|
|
757
|
-
**Estado:** `[
|
|
758
|
+
**Estado:** `[x]`
|
|
758
759
|
**Prioridad:** P1
|
|
759
760
|
**Tipo:** `security`
|
|
760
761
|
**Compatibilidad:** backward-compatible (con risk de breaking si caller hizo workarounds raros)
|
|
761
762
|
**Estimación:** M (3-4h)
|
|
762
763
|
**Release:** v0.3.0
|
|
764
|
+
**Commit:** `f042c56`
|
|
763
765
|
|
|
764
766
|
##### Contexto
|
|
765
767
|
|
|
@@ -800,12 +802,13 @@ Reduce blast radius si alguien intenta inyectar SQL malicioso vía `where_clause
|
|
|
800
802
|
|
|
801
803
|
#### Item 10 — Refactor `Engine#call` (CC=13 → ~5)
|
|
802
804
|
|
|
803
|
-
**Estado:** `[
|
|
805
|
+
**Estado:** `[x]`
|
|
804
806
|
**Prioridad:** P1
|
|
805
807
|
**Tipo:** `refactor`
|
|
806
808
|
**Compatibilidad:** backward-compatible
|
|
807
809
|
**Estimación:** M (4-6h)
|
|
808
810
|
**Release:** v0.3.0
|
|
811
|
+
**Commit:** `6a06850`
|
|
809
812
|
|
|
810
813
|
##### Contexto
|
|
811
814
|
|
|
@@ -878,12 +881,13 @@ Reduce blast radius si alguien intenta inyectar SQL malicioso vía `where_clause
|
|
|
878
881
|
|
|
879
882
|
#### Item 11b — Warning runtime de purga lenta sin avance
|
|
880
883
|
|
|
881
|
-
**Estado:** `[
|
|
884
|
+
**Estado:** `[x]`
|
|
882
885
|
**Prioridad:** P1
|
|
883
886
|
**Tipo:** `feat` `perf`
|
|
884
887
|
**Compatibilidad:** backward-compatible
|
|
885
888
|
**Estimación:** M (3-4h)
|
|
886
889
|
**Release:** v0.3.0
|
|
890
|
+
**Commit:** `d72ec0a`
|
|
887
891
|
|
|
888
892
|
##### Contexto
|
|
889
893
|
|
|
@@ -1206,12 +1210,13 @@ v0.2.1 solo corre CI en Ruby 3.4.4. La gema declara `required_ruby_version = ">=
|
|
|
1206
1210
|
|
|
1207
1211
|
#### Item 20 — Limpiar `rubocop:disable` en `lib/` agregados en v0.2.0
|
|
1208
1212
|
|
|
1209
|
-
**Estado:** `[
|
|
1213
|
+
**Estado:** `[x]`
|
|
1210
1214
|
**Prioridad:** P2
|
|
1211
1215
|
**Tipo:** `refactor`
|
|
1212
1216
|
**Compatibilidad:** N/A
|
|
1213
1217
|
**Estimación:** Depende del item 10 (refactor Engine#call)
|
|
1214
1218
|
**Release sugerido:** v0.3.0 (junto con item 10)
|
|
1219
|
+
**Commit:** `f6f4ddc` (FileIngestor), `02d207c` (S3 refactor), `5522c79` (Timing mixin)
|
|
1215
1220
|
|
|
1216
1221
|
##### Contexto
|
|
1217
1222
|
|
|
@@ -9,7 +9,10 @@ module DataDrain
|
|
|
9
9
|
:aws_access_key_id, :aws_secret_access_key,
|
|
10
10
|
:db_host, :db_port, :db_user, :db_pass, :db_name,
|
|
11
11
|
:batch_size, :throttle_delay, :logger, :limit_ram, :tmp_directory,
|
|
12
|
-
:idle_in_transaction_session_timeout
|
|
12
|
+
:idle_in_transaction_session_timeout,
|
|
13
|
+
:vacuum_after_purge,
|
|
14
|
+
:slow_batch_threshold_s,
|
|
15
|
+
:slow_batch_alert_after
|
|
13
16
|
|
|
14
17
|
def initialize
|
|
15
18
|
@storage_mode = :local
|
|
@@ -20,6 +23,9 @@ module DataDrain
|
|
|
20
23
|
@limit_ram = nil # eg 2GB
|
|
21
24
|
@tmp_directory = nil # eg /tmp/duckdb_work
|
|
22
25
|
@idle_in_transaction_session_timeout = 0
|
|
26
|
+
@vacuum_after_purge = false
|
|
27
|
+
@slow_batch_threshold_s = 30
|
|
28
|
+
@slow_batch_alert_after = 5
|
|
23
29
|
@logger = Logger.new($stdout)
|
|
24
30
|
end
|
|
25
31
|
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -5,12 +5,12 @@ require "pg"
|
|
|
5
5
|
|
|
6
6
|
module DataDrain
|
|
7
7
|
# Motor principal de extracción y purga de datos (DataDrain).
|
|
8
|
-
# rubocop:disable Metrics/ClassLength, Metrics/AbcSize, Metrics/MethodLength, Naming/AccessorMethodName
|
|
9
8
|
#
|
|
10
9
|
# Orquesta el flujo ETL desde PostgreSQL hacia un Data Lake analítico
|
|
11
10
|
# delegando la interacción del almacenamiento al adaptador configurado.
|
|
12
11
|
class Engine
|
|
13
12
|
include Observability
|
|
13
|
+
include Observability::Timing
|
|
14
14
|
# Inicializa una nueva instancia del motor de extracción.
|
|
15
15
|
#
|
|
16
16
|
# @param options [Hash] Diccionario de configuración para la extracción.
|
|
@@ -50,70 +50,91 @@ module DataDrain
|
|
|
50
50
|
@duckdb = database.connect
|
|
51
51
|
end
|
|
52
52
|
|
|
53
|
-
# Ejecuta el flujo completo del motor: Setup, Conteo, Exportación (opcional), Verificación y Purga.
|
|
54
|
-
#
|
|
55
|
-
# @return [Boolean] `true` si el proceso finalizó con éxito, `false` si falló la integridad.
|
|
56
53
|
def call
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
54
|
+
@durations = {}
|
|
55
|
+
start_time = monotonic
|
|
56
|
+
log_start
|
|
60
57
|
|
|
61
58
|
setup_duckdb
|
|
59
|
+
return skip_empty(start_time) if step_count.zero?
|
|
62
60
|
|
|
63
|
-
# 1. Conteo inicial en Postgres
|
|
64
|
-
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
65
|
-
@pg_count = get_postgres_count
|
|
66
|
-
db_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
67
|
-
|
|
68
|
-
if @pg_count.zero?
|
|
69
|
-
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
70
|
-
safe_log(:info, "engine.skip_empty",
|
|
71
|
-
{ table: @table_name, duration_s: duration.round(2), db_query_duration_s: db_query_duration.round(2) })
|
|
72
|
-
return true
|
|
73
|
-
end
|
|
74
|
-
|
|
75
|
-
# 2. Exportación
|
|
76
|
-
export_duration = 0.0
|
|
77
61
|
if @skip_export
|
|
78
62
|
safe_log(:info, "engine.skip_export", { table: @table_name })
|
|
79
63
|
else
|
|
80
|
-
|
|
81
|
-
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
82
|
-
export_to_parquet
|
|
83
|
-
export_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
64
|
+
step_export
|
|
84
65
|
end
|
|
66
|
+
return integrity_failed(start_time) unless step_verify
|
|
85
67
|
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
68
|
+
step_purge
|
|
69
|
+
log_complete(start_time)
|
|
70
|
+
true
|
|
71
|
+
end
|
|
90
72
|
|
|
91
|
-
|
|
92
|
-
# 4. Purga en Postgres
|
|
93
|
-
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
94
|
-
purge_from_postgres
|
|
95
|
-
purge_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
73
|
+
private
|
|
96
74
|
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
db_query_duration_s: db_query_duration.round(2),
|
|
102
|
-
export_duration_s: export_duration.round(2),
|
|
103
|
-
integrity_duration_s: integrity_duration.round(2),
|
|
104
|
-
purge_duration_s: purge_duration.round(2),
|
|
105
|
-
count: @pg_count
|
|
106
|
-
})
|
|
107
|
-
true
|
|
108
|
-
else
|
|
109
|
-
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
110
|
-
safe_log(:error, "engine.integrity_error",
|
|
111
|
-
{ table: @table_name, duration_s: duration.round(2), count: @pg_count })
|
|
112
|
-
false
|
|
113
|
-
end
|
|
75
|
+
# @api private
|
|
76
|
+
def log_start
|
|
77
|
+
safe_log(:info, "engine.start",
|
|
78
|
+
{ table: @table_name, start_date: @start_date.to_date, end_date: @end_date.to_date })
|
|
114
79
|
end
|
|
115
80
|
|
|
116
|
-
private
|
|
81
|
+
# @api private
|
|
82
|
+
def step_count
|
|
83
|
+
@pg_count = timed(:db_query) { get_postgres_count }
|
|
84
|
+
@pg_count
|
|
85
|
+
end
|
|
86
|
+
|
|
87
|
+
# @api private
|
|
88
|
+
def skip_empty(start_time)
|
|
89
|
+
duration = monotonic - start_time
|
|
90
|
+
safe_log(:info, "engine.skip_empty", {
|
|
91
|
+
table: @table_name,
|
|
92
|
+
duration_s: duration.round(2),
|
|
93
|
+
db_query_duration_s: @durations.fetch(:db_query, 0).round(2)
|
|
94
|
+
})
|
|
95
|
+
true
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
# @api private
|
|
99
|
+
def step_export
|
|
100
|
+
safe_log(:info, "engine.export_start", { table: @table_name, count: @pg_count })
|
|
101
|
+
timed(:export) { export_to_parquet }
|
|
102
|
+
end
|
|
103
|
+
|
|
104
|
+
# @api private
|
|
105
|
+
def step_verify
|
|
106
|
+
timed(:integrity) { verify_integrity }
|
|
107
|
+
end
|
|
108
|
+
|
|
109
|
+
# @api private
|
|
110
|
+
def step_purge
|
|
111
|
+
timed(:purge) { purge_from_postgres }
|
|
112
|
+
end
|
|
113
|
+
|
|
114
|
+
# @api private
|
|
115
|
+
def log_complete(start_time)
|
|
116
|
+
duration = monotonic - start_time
|
|
117
|
+
safe_log(:info, "engine.complete", {
|
|
118
|
+
table: @table_name,
|
|
119
|
+
duration_s: duration.round(2),
|
|
120
|
+
db_query_duration_s: @durations.fetch(:db_query, 0).round(2),
|
|
121
|
+
export_duration_s: @durations.fetch(:export, 0).round(2),
|
|
122
|
+
integrity_duration_s: @durations.fetch(:integrity, 0).round(2),
|
|
123
|
+
purge_duration_s: @durations.fetch(:purge, 0).round(2),
|
|
124
|
+
count: @pg_count
|
|
125
|
+
})
|
|
126
|
+
end
|
|
127
|
+
|
|
128
|
+
# @api private
|
|
129
|
+
def integrity_failed(start_time)
|
|
130
|
+
duration = monotonic - start_time
|
|
131
|
+
safe_log(:error, "engine.integrity_error", {
|
|
132
|
+
table: @table_name,
|
|
133
|
+
duration_s: duration.round(2),
|
|
134
|
+
count: @pg_count
|
|
135
|
+
})
|
|
136
|
+
false
|
|
137
|
+
end
|
|
117
138
|
|
|
118
139
|
# @api private
|
|
119
140
|
# @return [String]
|
|
@@ -213,40 +234,129 @@ module DataDrain
|
|
|
213
234
|
conn.exec("SET idle_in_transaction_session_timeout = #{@config.idle_in_transaction_session_timeout};")
|
|
214
235
|
end
|
|
215
236
|
|
|
237
|
+
total_deleted = purge_loop(conn)
|
|
238
|
+
|
|
239
|
+
vacuum_if_needed(conn, total_deleted)
|
|
240
|
+
ensure
|
|
241
|
+
conn&.close
|
|
242
|
+
end
|
|
243
|
+
|
|
244
|
+
# @api private
|
|
245
|
+
def vacuum_if_needed(conn, total_deleted)
|
|
246
|
+
return unless @config.vacuum_after_purge
|
|
247
|
+
return if total_deleted.zero?
|
|
248
|
+
|
|
249
|
+
vacuum_start = monotonic
|
|
250
|
+
dead_before = fetch_dead_tuple_count(conn)
|
|
251
|
+
|
|
252
|
+
begin
|
|
253
|
+
conn.exec("VACUUM ANALYZE #{@table_name};")
|
|
254
|
+
rescue PG::Error => e
|
|
255
|
+
safe_log(:warn, "engine.vacuum_error", {
|
|
256
|
+
table: @table_name,
|
|
257
|
+
dead_tuples_before: dead_before,
|
|
258
|
+
rows_deleted_count: total_deleted,
|
|
259
|
+
duration_s: (monotonic - vacuum_start).round(2)
|
|
260
|
+
}.merge(exception_metadata(e)))
|
|
261
|
+
return
|
|
262
|
+
end
|
|
263
|
+
|
|
264
|
+
dead_after = fetch_dead_tuple_count(conn)
|
|
265
|
+
vacuum_duration = monotonic - vacuum_start
|
|
266
|
+
|
|
267
|
+
safe_log(:info, "engine.vacuum_complete", {
|
|
268
|
+
table: @table_name,
|
|
269
|
+
duration_s: vacuum_duration.round(2),
|
|
270
|
+
dead_tuples_before: dead_before,
|
|
271
|
+
dead_tuples_after: dead_after,
|
|
272
|
+
rows_deleted_count: total_deleted
|
|
273
|
+
})
|
|
274
|
+
end
|
|
275
|
+
|
|
276
|
+
# @api private
|
|
277
|
+
def fetch_dead_tuple_count(conn)
|
|
278
|
+
result = conn.exec_params(
|
|
279
|
+
"SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname = $1",
|
|
280
|
+
[@table_name]
|
|
281
|
+
)
|
|
282
|
+
result.first&.dig("n_dead_tup")&.to_i || 0
|
|
283
|
+
rescue PG::Error
|
|
284
|
+
-1
|
|
285
|
+
end
|
|
286
|
+
|
|
287
|
+
# @api private
|
|
288
|
+
# @param conn [PG::Connection]
|
|
289
|
+
# @return [Integer] total de filas borradas
|
|
290
|
+
def purge_loop(conn)
|
|
216
291
|
batches_processed = 0
|
|
217
292
|
total_deleted = 0
|
|
293
|
+
slow_batch_streak = 0
|
|
218
294
|
|
|
219
295
|
loop do
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
SELECT #{@primary_key} FROM #{@table_name}
|
|
224
|
-
WHERE #{base_where_sql}
|
|
225
|
-
LIMIT #{@config.batch_size}
|
|
226
|
-
)
|
|
227
|
-
SQL
|
|
228
|
-
|
|
229
|
-
result = conn.exec(sql)
|
|
296
|
+
batch_start = monotonic
|
|
297
|
+
result = conn.exec(build_delete_sql)
|
|
298
|
+
batch_duration = monotonic - batch_start
|
|
230
299
|
count = result.cmd_tuples
|
|
231
300
|
break if count.zero?
|
|
232
301
|
|
|
233
302
|
batches_processed += 1
|
|
234
303
|
total_deleted += count
|
|
235
304
|
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
305
|
+
slow_batch_streak = handle_batch_timing(batch_duration, count, slow_batch_streak)
|
|
306
|
+
emit_heartbeat_if_due(batches_processed, total_deleted)
|
|
307
|
+
|
|
308
|
+
sleep(@config.throttle_delay) if @config.throttle_delay.positive?
|
|
309
|
+
end
|
|
310
|
+
|
|
311
|
+
total_deleted
|
|
312
|
+
end
|
|
313
|
+
|
|
314
|
+
# @api private
|
|
315
|
+
def handle_batch_timing(batch_duration, count, streak)
|
|
316
|
+
if batch_duration > @config.slow_batch_threshold_s
|
|
317
|
+
streak += 1
|
|
318
|
+
safe_log(:warn, "engine.slow_batch", {
|
|
319
|
+
table: @table_name,
|
|
320
|
+
batch_duration_s: batch_duration.round(2),
|
|
321
|
+
batch_size: count,
|
|
322
|
+
streak: streak,
|
|
323
|
+
threshold_s: @config.slow_batch_threshold_s
|
|
324
|
+
})
|
|
325
|
+
|
|
326
|
+
if streak == @config.slow_batch_alert_after
|
|
327
|
+
safe_log(:warn, "engine.purge_degraded", {
|
|
239
328
|
table: @table_name,
|
|
240
|
-
|
|
241
|
-
|
|
329
|
+
consecutive_slow_batches: streak,
|
|
330
|
+
hint: "considerar índice composite o particionamiento (ver postgres-tuning.md)"
|
|
242
331
|
})
|
|
243
332
|
end
|
|
244
|
-
|
|
245
|
-
|
|
333
|
+
streak
|
|
334
|
+
else
|
|
335
|
+
0
|
|
246
336
|
end
|
|
247
|
-
|
|
248
|
-
|
|
337
|
+
end
|
|
338
|
+
|
|
339
|
+
# @api private
|
|
340
|
+
def emit_heartbeat_if_due(batches_processed, total_deleted)
|
|
341
|
+
return unless (batches_processed % 100).zero?
|
|
342
|
+
|
|
343
|
+
safe_log(:info, "engine.purge_heartbeat", {
|
|
344
|
+
table: @table_name,
|
|
345
|
+
batches_processed_count: batches_processed,
|
|
346
|
+
rows_deleted_count: total_deleted
|
|
347
|
+
})
|
|
348
|
+
end
|
|
349
|
+
|
|
350
|
+
# @api private
|
|
351
|
+
def build_delete_sql
|
|
352
|
+
<<~SQL
|
|
353
|
+
DELETE FROM #{@table_name}
|
|
354
|
+
WHERE #{@primary_key} IN (
|
|
355
|
+
SELECT #{@primary_key} FROM #{@table_name}
|
|
356
|
+
WHERE #{base_where_sql}
|
|
357
|
+
LIMIT #{@config.batch_size}
|
|
358
|
+
)
|
|
359
|
+
SQL
|
|
249
360
|
end
|
|
250
361
|
end
|
|
251
|
-
# rubocop:enable Metrics/ClassLength, Metrics/AbcSize, Metrics/MethodLength, Naming/AccessorMethodName
|
|
252
362
|
end
|
|
@@ -6,8 +6,7 @@ module DataDrain
|
|
|
6
6
|
# aplicando compresión ZSTD y particionamiento Hive.
|
|
7
7
|
class FileIngestor
|
|
8
8
|
include Observability
|
|
9
|
-
|
|
10
|
-
# Metrics/MethodLength
|
|
9
|
+
include Observability::Timing
|
|
11
10
|
|
|
12
11
|
# @param options [Hash] Opciones de ingestión.
|
|
13
12
|
# @option options [String] :source_path Ruta absoluta al archivo local.
|
|
@@ -36,46 +35,77 @@ module DataDrain
|
|
|
36
35
|
# Ejecuta el flujo de ingestión.
|
|
37
36
|
# @return [Boolean] true si el proceso fue exitoso.
|
|
38
37
|
def call
|
|
39
|
-
|
|
38
|
+
@durations = {}
|
|
39
|
+
start_time = monotonic
|
|
40
40
|
safe_log(:info, "file_ingestor.start", { source_path: @source_path })
|
|
41
41
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
42
|
+
return file_not_found(start_time) unless step_validate_file
|
|
43
|
+
|
|
44
|
+
step_setup_duckdb
|
|
45
|
+
@reader_function = determine_reader
|
|
46
|
+
@source_count = step_count_source
|
|
47
|
+
|
|
48
|
+
return skip_empty(start_time) if @source_count.zero?
|
|
49
|
+
|
|
50
|
+
step_export
|
|
51
|
+
log_complete(start_time)
|
|
52
|
+
cleanup_local_file
|
|
53
|
+
true
|
|
54
|
+
rescue DuckDB::Error => e
|
|
55
|
+
duration = monotonic - start_time
|
|
56
|
+
safe_log(:error, "file_ingestor.duckdb_error",
|
|
57
|
+
{ source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
|
|
58
|
+
false
|
|
59
|
+
ensure
|
|
60
|
+
@duckdb&.close
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
private
|
|
64
|
+
|
|
65
|
+
# @api private
|
|
66
|
+
def file_not_found(_start_time)
|
|
67
|
+
safe_log(:error, "file_ingestor.file_not_found", { source_path: @source_path })
|
|
68
|
+
false
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
# @api private
|
|
72
|
+
def step_validate_file
|
|
73
|
+
File.exist?(@source_path)
|
|
74
|
+
end
|
|
46
75
|
|
|
76
|
+
# @api private
|
|
77
|
+
def step_setup_duckdb
|
|
47
78
|
@duckdb.query("SET max_memory='#{@config.limit_ram}';") if @config.limit_ram.present?
|
|
48
79
|
@duckdb.query("SET temp_directory='#{@config.tmp_directory}'") if @config.tmp_directory.present?
|
|
49
|
-
|
|
50
80
|
@adapter.setup_duckdb(@duckdb)
|
|
81
|
+
end
|
|
51
82
|
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
# 1. Conteo de seguridad
|
|
56
|
-
step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
57
|
-
source_count = @duckdb.query("SELECT COUNT(*) FROM #{reader_function}").first.first
|
|
58
|
-
source_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
83
|
+
# @api private
|
|
84
|
+
def step_count_source
|
|
85
|
+
source_count = timed(:source_query) { @duckdb.query("SELECT COUNT(*) FROM #{@reader_function}").first.first }
|
|
59
86
|
safe_log(:info, "file_ingestor.count", {
|
|
60
87
|
source_path: @source_path,
|
|
61
88
|
count: source_count,
|
|
62
|
-
source_query_duration_s:
|
|
89
|
+
source_query_duration_s: @durations.fetch(:source_query, 0).round(2)
|
|
63
90
|
})
|
|
91
|
+
source_count
|
|
92
|
+
end
|
|
64
93
|
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
94
|
+
# @api private
|
|
95
|
+
def skip_empty(start_time)
|
|
96
|
+
cleanup_local_file
|
|
97
|
+
duration = monotonic - start_time
|
|
98
|
+
safe_log(:info, "file_ingestor.skip_empty", { source_path: @source_path, duration_s: duration.round(2) })
|
|
99
|
+
true
|
|
100
|
+
end
|
|
71
101
|
|
|
72
|
-
|
|
102
|
+
# @api private
|
|
103
|
+
def step_export
|
|
73
104
|
@adapter.prepare_export_path(@bucket, @folder_name)
|
|
74
105
|
dest_path = if @config.storage_mode.to_sym == :s3
|
|
75
106
|
"s3://#{@bucket}/#{@folder_name}/"
|
|
76
107
|
else
|
|
77
|
-
File.join(@bucket,
|
|
78
|
-
@folder_name, "")
|
|
108
|
+
File.join(@bucket, @folder_name, "")
|
|
79
109
|
end
|
|
80
110
|
|
|
81
111
|
partition_clause = @partition_keys.any? ? "PARTITION_BY (#{@partition_keys.join(", ")})," : ""
|
|
@@ -83,7 +113,7 @@ module DataDrain
|
|
|
83
113
|
query = <<~SQL
|
|
84
114
|
COPY (
|
|
85
115
|
SELECT #{@select_sql}
|
|
86
|
-
FROM #{reader_function}
|
|
116
|
+
FROM #{@reader_function}
|
|
87
117
|
) TO '#{dest_path}'
|
|
88
118
|
(
|
|
89
119
|
FORMAT PARQUET,
|
|
@@ -94,32 +124,21 @@ module DataDrain
|
|
|
94
124
|
SQL
|
|
95
125
|
|
|
96
126
|
safe_log(:info, "file_ingestor.export_start", { dest_path: dest_path })
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
export_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
|
|
127
|
+
timed(:export) { @duckdb.query(query) }
|
|
128
|
+
end
|
|
100
129
|
|
|
101
|
-
|
|
130
|
+
# @api private
|
|
131
|
+
def log_complete(start_time)
|
|
132
|
+
duration = monotonic - start_time
|
|
102
133
|
safe_log(:info, "file_ingestor.complete", {
|
|
103
134
|
source_path: @source_path,
|
|
104
135
|
duration_s: duration.round(2),
|
|
105
|
-
source_query_duration_s:
|
|
106
|
-
export_duration_s:
|
|
107
|
-
count: source_count
|
|
136
|
+
source_query_duration_s: @durations.fetch(:source_query, 0).round(2),
|
|
137
|
+
export_duration_s: @durations.fetch(:export, 0).round(2),
|
|
138
|
+
count: @source_count
|
|
108
139
|
})
|
|
109
|
-
|
|
110
|
-
cleanup_local_file
|
|
111
|
-
true
|
|
112
|
-
rescue DuckDB::Error => e
|
|
113
|
-
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
114
|
-
safe_log(:error, "file_ingestor.duckdb_error",
|
|
115
|
-
{ source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
|
|
116
|
-
false
|
|
117
|
-
ensure
|
|
118
|
-
@duckdb&.close
|
|
119
140
|
end
|
|
120
141
|
|
|
121
|
-
private
|
|
122
|
-
|
|
123
142
|
# @api private
|
|
124
143
|
def determine_reader
|
|
125
144
|
case File.extname(@source_path).downcase
|
|
@@ -142,6 +161,4 @@ module DataDrain
|
|
|
142
161
|
safe_log(:info, "file_ingestor.cleanup", { source_path: @source_path })
|
|
143
162
|
end
|
|
144
163
|
end
|
|
145
|
-
# rubocop:enable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity,
|
|
146
|
-
# Metrics/MethodLength
|
|
147
164
|
end
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DataDrain
|
|
4
|
+
module Observability
|
|
5
|
+
# Helper para medición de duración de operaciones.
|
|
6
|
+
# @api private
|
|
7
|
+
module Timing
|
|
8
|
+
private
|
|
9
|
+
|
|
10
|
+
def monotonic
|
|
11
|
+
Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def timed(step_name)
|
|
15
|
+
t = monotonic
|
|
16
|
+
result = yield
|
|
17
|
+
@durations ||= {}
|
|
18
|
+
@durations[step_name] = monotonic - t
|
|
19
|
+
result
|
|
20
|
+
end
|
|
21
|
+
end
|
|
22
|
+
end
|
|
23
|
+
end
|
data/lib/data_drain/record.rb
CHANGED
|
@@ -46,7 +46,6 @@ module DataDrain
|
|
|
46
46
|
# Esto previene tener que recargar extensiones (como httpfs) en cada consulta.
|
|
47
47
|
#
|
|
48
48
|
# @return [DuckDB::Connection] Conexión activa a DuckDB.
|
|
49
|
-
# rubocop:disable Metrics/AbcSize
|
|
50
49
|
def self.connection
|
|
51
50
|
Thread.current[:data_drain_duckdb] ||= begin
|
|
52
51
|
db = DuckDB::Database.open(":memory:")
|
|
@@ -57,11 +56,13 @@ module DataDrain
|
|
|
57
56
|
conn.query("SET temp_directory='#{config.tmp_directory}'") if config.tmp_directory.present?
|
|
58
57
|
|
|
59
58
|
DataDrain::Storage.adapter.setup_duckdb(conn)
|
|
59
|
+
|
|
60
|
+
conn.query("SET lock_configuration=true;")
|
|
61
|
+
|
|
60
62
|
{ db: db, conn: conn }
|
|
61
63
|
end
|
|
62
64
|
Thread.current[:data_drain_duckdb][:conn]
|
|
63
65
|
end
|
|
64
|
-
# rubocop:enable Metrics/AbcSize
|
|
65
66
|
|
|
66
67
|
# Consulta registros en el Data Lake filtrando por claves de partición.
|
|
67
68
|
#
|
|
@@ -138,22 +139,14 @@ module DataDrain
|
|
|
138
139
|
# @param sql [String]
|
|
139
140
|
# @param columns [Array<String>]
|
|
140
141
|
# @return [Array<DataDrain::Record>]
|
|
141
|
-
# rubocop:disable Metrics/MethodLength
|
|
142
142
|
def execute_and_instantiate(sql, columns)
|
|
143
143
|
@logger = DataDrain.configuration.logger
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
end
|
|
150
|
-
|
|
151
|
-
result.map do |row|
|
|
152
|
-
attributes_hash = columns.zip(row).to_h
|
|
153
|
-
new(attributes_hash)
|
|
154
|
-
end
|
|
144
|
+
result = connection.query(sql)
|
|
145
|
+
result.map { |row| new(columns.zip(row).to_h) }
|
|
146
|
+
rescue DuckDB::Error => e
|
|
147
|
+
safe_log(:warn, "record.parquet_not_found", exception_metadata(e))
|
|
148
|
+
[]
|
|
155
149
|
end
|
|
156
150
|
end
|
|
157
|
-
# rubocop:enable Metrics/MethodLength
|
|
158
151
|
end
|
|
159
152
|
end
|
|
@@ -3,8 +3,6 @@
|
|
|
3
3
|
module DataDrain
|
|
4
4
|
module Storage
|
|
5
5
|
class S3 < Base
|
|
6
|
-
# rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength
|
|
7
|
-
|
|
8
6
|
# Carga la extensión httpfs en DuckDB e inyecta las credenciales de AWS.
|
|
9
7
|
# Si aws_access_key_id y aws_secret_access_key están seteados, usa
|
|
10
8
|
# credenciales explícitas. Si no, usa credential_chain (IAM role, env vars,
|
|
@@ -32,34 +30,56 @@ module DataDrain
|
|
|
32
30
|
# @param partitions [Hash]
|
|
33
31
|
# @return [Integer]
|
|
34
32
|
def destroy_partitions(bucket, folder_name, partition_keys, partitions)
|
|
35
|
-
client =
|
|
33
|
+
client = s3_client
|
|
34
|
+
prefix, pattern_regex = build_destroy_pattern(folder_name, partition_keys, partitions)
|
|
35
|
+
objects = collect_matching_objects(client, bucket, prefix, pattern_regex)
|
|
36
|
+
delete_in_batches(client, bucket, objects)
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
# @return [Aws::S3::Client]
|
|
42
|
+
def s3_client
|
|
43
|
+
Aws::S3::Client.new(
|
|
36
44
|
region: @config.aws_region,
|
|
37
45
|
access_key_id: @config.aws_access_key_id,
|
|
38
46
|
secret_access_key: @config.aws_secret_access_key
|
|
39
47
|
)
|
|
48
|
+
end
|
|
40
49
|
|
|
50
|
+
# @param folder_name [String]
|
|
51
|
+
# @param partition_keys [Array<Symbol>]
|
|
52
|
+
# @param partitions [Hash]
|
|
53
|
+
# @return [Array(String, Regexp)] prefix y pattern_regex
|
|
54
|
+
def build_destroy_pattern(folder_name, partition_keys, partitions)
|
|
41
55
|
regex_parts = partition_keys.map do |key|
|
|
42
56
|
val = partitions[key]
|
|
43
57
|
val.nil? || val.to_s.empty? ? "#{key}=[^/]+" : "#{key}=#{val}"
|
|
44
58
|
end
|
|
45
|
-
|
|
59
|
+
pattern = Regexp.new("^#{folder_name}/#{regex_parts.join("/")}")
|
|
46
60
|
|
|
47
|
-
objects_to_delete = []
|
|
48
61
|
prefix = "#{folder_name}/"
|
|
49
62
|
first_key = partition_keys.first
|
|
50
63
|
prefix += "#{first_key}=#{partitions[first_key]}/" if partitions[first_key]
|
|
51
64
|
|
|
65
|
+
[prefix, pattern]
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
# @param client [Aws::S3::Client]
|
|
69
|
+
# @param bucket [String]
|
|
70
|
+
# @param prefix [String]
|
|
71
|
+
# @param pattern_regex [Regexp]
|
|
72
|
+
# @return [Array<Hash>]
|
|
73
|
+
def collect_matching_objects(client, bucket, prefix, pattern_regex)
|
|
74
|
+
objects = []
|
|
52
75
|
client.list_objects_v2(bucket: bucket, prefix: prefix).each do |response|
|
|
53
76
|
response.contents.each do |obj|
|
|
54
|
-
|
|
77
|
+
objects << { key: obj.key } if obj.key.match?(pattern_regex)
|
|
55
78
|
end
|
|
56
79
|
end
|
|
57
|
-
|
|
58
|
-
delete_in_batches(client, bucket, objects_to_delete)
|
|
80
|
+
objects
|
|
59
81
|
end
|
|
60
82
|
|
|
61
|
-
private
|
|
62
|
-
|
|
63
83
|
# @param connection [DuckDB::Connection]
|
|
64
84
|
# @raise [DataDrain::ConfigurationError]
|
|
65
85
|
def create_s3_secret(connection)
|
|
@@ -107,6 +127,5 @@ module DataDrain
|
|
|
107
127
|
deleted_count
|
|
108
128
|
end
|
|
109
129
|
end
|
|
110
|
-
# rubocop:enable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength
|
|
111
130
|
end
|
|
112
131
|
end
|
data/lib/data_drain/version.rb
CHANGED
data/lib/data_drain.rb
CHANGED
|
@@ -7,6 +7,7 @@ require_relative "data_drain/configuration"
|
|
|
7
7
|
require_relative "data_drain/validations"
|
|
8
8
|
require_relative "data_drain/storage"
|
|
9
9
|
require_relative "data_drain/observability"
|
|
10
|
+
require_relative "data_drain/observability/timing"
|
|
10
11
|
require_relative "data_drain/engine"
|
|
11
12
|
require_relative "data_drain/record"
|
|
12
13
|
require_relative "data_drain/file_ingestor"
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.3.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-04-
|
|
11
|
+
date: 2026-04-15 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -109,6 +109,7 @@ files:
|
|
|
109
109
|
- lib/data_drain/file_ingestor.rb
|
|
110
110
|
- lib/data_drain/glue_runner.rb
|
|
111
111
|
- lib/data_drain/observability.rb
|
|
112
|
+
- lib/data_drain/observability/timing.rb
|
|
112
113
|
- lib/data_drain/record.rb
|
|
113
114
|
- lib/data_drain/storage.rb
|
|
114
115
|
- lib/data_drain/storage/base.rb
|