data_drain 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ce021179463aeff31239f3bba35fc8b663633bfd648a2584a723b64b5d444387
4
- data.tar.gz: b41a91f9a7792f952402c7ac2c9dc981e8372178df200c66e9d7fd8c63753c9c
3
+ metadata.gz: 8fc6bdf2a5a21825f9bc07185c0fc259056646c6084de2a42fd2d4e3198af4f2
4
+ data.tar.gz: aa7623b2ec910da4491c38904aa4623fd48a466ccc97a38124f29aafd9677655
5
5
  SHA512:
6
- metadata.gz: a23130f9892dfc34dfa748a50de6de5f768135d9b3bf012da0f648d322be6b11d0b20bbaf7e783d92a189cfc6e459cb7dc885800ea436f10d177f434e084e018
7
- data.tar.gz: acbe4ed9caf168c7519881bce74c8ca90d8392bc8ee1f3718af600e64cfdacd3991f8860691db2b2ad9bb0c949aa88f0015f21e847f540030489951ba81f9ef0
6
+ metadata.gz: '0201838e917434b529250689a5987008f9e8895366b56a78522752fe5b490fb6ab3d0b4c40dc81bcc830af3132c005fc8d136ab719093a16f0c9de5221e3421c'
7
+ data.tar.gz: 5be0c9c4a47d07f479f49c977d0490aad909a33fbc0c9292594bd51163132ee5229e0131f23d253120732d530930e5474bd31abbef7a5289f3022055799b5525
data/CHANGELOG.md CHANGED
@@ -1,5 +1,28 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.3.0] - 2026-04-15
4
+
5
+ ### Refactor
6
+ - `Engine#call` refactorizado: extraídos `step_count`, `step_export`, `step_verify`, `step_purge` como métodos privados con `timed` helper. CC bajó de 13 a 5. Eventos emitidos idénticos al comportamiento anterior. (item 10)
7
+ - Extraído `DataDrain::Observability::Timing` mixin compartido entre Engine y FileIngestor. (item 20)
8
+ - `FileIngestor#call` refactorizado análogo a Engine. (item 20)
9
+ - Eliminados todos los `# rubocop:disable Metrics/*` en `lib/`. (item 20)
10
+
11
+ ### Features
12
+ - `config.vacuum_after_purge = false` (default). Si `true`, ejecuta `VACUUM ANALYZE` post-purga cuando hubo deletes. Emite `engine.vacuum_complete` con dead_tuples antes/después y duración. Errores PG se capturan como `engine.vacuum_error` WARN. (item 5)
13
+ - `config.slow_batch_threshold_s = 30` y `config.slow_batch_alert_after = 5`. Detecta lotes de purga lentos. Emite `engine.slow_batch` WARN por cada lote lento, `engine.purge_degraded` WARN una vez por streak. Incluye hint a docs de tuning. (item 11b)
14
+
15
+ ### Security
16
+ - `Record.connection` aplica `SET lock_configuration=true` post-setup. Congela cualquier SET futuro sobre la conexión (defensa en profundidad). NO afecta secrets ni extensiones ya cargadas. (item 6)
17
+
18
+ ### Telemetry nueva
19
+ - `engine.vacuum_complete`, `engine.vacuum_error`, `engine.slow_batch`, `engine.purge_degraded`.
20
+
21
+ ### Tests
22
+ - Coverage se mantiene ≥ 80%.
23
+ - Nuevo test de equivalencia para Engine (eventos idénticos pre/post refactor).
24
+ - Timecop agregado para tests de timing (item 11b).
25
+
3
26
  ## [0.2.2] - 2026-04-14
4
27
 
5
28
  ### Security
@@ -433,12 +433,13 @@ spec/
433
433
 
434
434
  #### Item 5 — VACUUM ANALYZE opcional post-purga
435
435
 
436
- **Estado:** `[ ]`
436
+ **Estado:** `[x]`
437
437
  **Prioridad:** P1
438
438
  **Tipo:** `feat` `perf`
439
439
  **Compatibilidad:** backward-compatible (default `false`, opt-in)
440
440
  **Estimación:** S (2-3h)
441
441
  **Release:** v0.2.1
442
+ **Commit:** `93bf8a8`
442
443
 
443
444
  ##### Contexto
444
445
 
@@ -754,12 +755,13 @@ Contenido:
754
755
 
755
756
  #### Item 6 — Sandboxing de `Record.connection`
756
757
 
757
- **Estado:** `[ ]`
758
+ **Estado:** `[x]`
758
759
  **Prioridad:** P1
759
760
  **Tipo:** `security`
760
761
  **Compatibilidad:** backward-compatible (con risk de breaking si caller hizo workarounds raros)
761
762
  **Estimación:** M (3-4h)
762
763
  **Release:** v0.3.0
764
+ **Commit:** `f042c56`
763
765
 
764
766
  ##### Contexto
765
767
 
@@ -800,12 +802,13 @@ Reduce blast radius si alguien intenta inyectar SQL malicioso vía `where_clause
800
802
 
801
803
  #### Item 10 — Refactor `Engine#call` (CC=13 → ~5)
802
804
 
803
- **Estado:** `[ ]`
805
+ **Estado:** `[x]`
804
806
  **Prioridad:** P1
805
807
  **Tipo:** `refactor`
806
808
  **Compatibilidad:** backward-compatible
807
809
  **Estimación:** M (4-6h)
808
810
  **Release:** v0.3.0
811
+ **Commit:** `6a06850`
809
812
 
810
813
  ##### Contexto
811
814
 
@@ -878,12 +881,13 @@ Reduce blast radius si alguien intenta inyectar SQL malicioso vía `where_clause
878
881
 
879
882
  #### Item 11b — Warning runtime de purga lenta sin avance
880
883
 
881
- **Estado:** `[ ]`
884
+ **Estado:** `[x]`
882
885
  **Prioridad:** P1
883
886
  **Tipo:** `feat` `perf`
884
887
  **Compatibilidad:** backward-compatible
885
888
  **Estimación:** M (3-4h)
886
889
  **Release:** v0.3.0
890
+ **Commit:** `d72ec0a`
887
891
 
888
892
  ##### Contexto
889
893
 
@@ -1206,12 +1210,13 @@ v0.2.1 solo corre CI en Ruby 3.4.4. La gema declara `required_ruby_version = ">=
1206
1210
 
1207
1211
  #### Item 20 — Limpiar `rubocop:disable` en `lib/` agregados en v0.2.0
1208
1212
 
1209
- **Estado:** `[ ]`
1213
+ **Estado:** `[x]`
1210
1214
  **Prioridad:** P2
1211
1215
  **Tipo:** `refactor`
1212
1216
  **Compatibilidad:** N/A
1213
1217
  **Estimación:** Depende del item 10 (refactor Engine#call)
1214
1218
  **Release sugerido:** v0.3.0 (junto con item 10)
1219
+ **Commit:** `f6f4ddc` (FileIngestor), `02d207c` (S3 refactor), `5522c79` (Timing mixin)
1215
1220
 
1216
1221
  ##### Contexto
1217
1222
 
@@ -9,7 +9,10 @@ module DataDrain
9
9
  :aws_access_key_id, :aws_secret_access_key,
10
10
  :db_host, :db_port, :db_user, :db_pass, :db_name,
11
11
  :batch_size, :throttle_delay, :logger, :limit_ram, :tmp_directory,
12
- :idle_in_transaction_session_timeout
12
+ :idle_in_transaction_session_timeout,
13
+ :vacuum_after_purge,
14
+ :slow_batch_threshold_s,
15
+ :slow_batch_alert_after
13
16
 
14
17
  def initialize
15
18
  @storage_mode = :local
@@ -20,6 +23,9 @@ module DataDrain
20
23
  @limit_ram = nil # eg 2GB
21
24
  @tmp_directory = nil # eg /tmp/duckdb_work
22
25
  @idle_in_transaction_session_timeout = 0
26
+ @vacuum_after_purge = false
27
+ @slow_batch_threshold_s = 30
28
+ @slow_batch_alert_after = 5
23
29
  @logger = Logger.new($stdout)
24
30
  end
25
31
 
@@ -5,12 +5,12 @@ require "pg"
5
5
 
6
6
  module DataDrain
7
7
  # Motor principal de extracción y purga de datos (DataDrain).
8
- # rubocop:disable Metrics/ClassLength, Metrics/AbcSize, Metrics/MethodLength, Naming/AccessorMethodName
9
8
  #
10
9
  # Orquesta el flujo ETL desde PostgreSQL hacia un Data Lake analítico
11
10
  # delegando la interacción del almacenamiento al adaptador configurado.
12
11
  class Engine
13
12
  include Observability
13
+ include Observability::Timing
14
14
  # Inicializa una nueva instancia del motor de extracción.
15
15
  #
16
16
  # @param options [Hash] Diccionario de configuración para la extracción.
@@ -50,70 +50,91 @@ module DataDrain
50
50
  @duckdb = database.connect
51
51
  end
52
52
 
53
- # Ejecuta el flujo completo del motor: Setup, Conteo, Exportación (opcional), Verificación y Purga.
54
- #
55
- # @return [Boolean] `true` si el proceso finalizó con éxito, `false` si falló la integridad.
56
53
  def call
57
- start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
58
- safe_log(:info, "engine.start",
59
- { table: @table_name, start_date: @start_date.to_date, end_date: @end_date.to_date })
54
+ @durations = {}
55
+ start_time = monotonic
56
+ log_start
60
57
 
61
58
  setup_duckdb
59
+ return skip_empty(start_time) if step_count.zero?
62
60
 
63
- # 1. Conteo inicial en Postgres
64
- step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
65
- @pg_count = get_postgres_count
66
- db_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
67
-
68
- if @pg_count.zero?
69
- duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
70
- safe_log(:info, "engine.skip_empty",
71
- { table: @table_name, duration_s: duration.round(2), db_query_duration_s: db_query_duration.round(2) })
72
- return true
73
- end
74
-
75
- # 2. Exportación
76
- export_duration = 0.0
77
61
  if @skip_export
78
62
  safe_log(:info, "engine.skip_export", { table: @table_name })
79
63
  else
80
- safe_log(:info, "engine.export_start", { table: @table_name, count: @pg_count })
81
- step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
82
- export_to_parquet
83
- export_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
64
+ step_export
84
65
  end
66
+ return integrity_failed(start_time) unless step_verify
85
67
 
86
- # 3. Verificación de Integridad
87
- step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
88
- integrity_ok = verify_integrity
89
- integrity_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
68
+ step_purge
69
+ log_complete(start_time)
70
+ true
71
+ end
90
72
 
91
- if integrity_ok
92
- # 4. Purga en Postgres
93
- step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
94
- purge_from_postgres
95
- purge_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
73
+ private
96
74
 
97
- duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
98
- safe_log(:info, "engine.complete", {
99
- table: @table_name,
100
- duration_s: duration.round(2),
101
- db_query_duration_s: db_query_duration.round(2),
102
- export_duration_s: export_duration.round(2),
103
- integrity_duration_s: integrity_duration.round(2),
104
- purge_duration_s: purge_duration.round(2),
105
- count: @pg_count
106
- })
107
- true
108
- else
109
- duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
110
- safe_log(:error, "engine.integrity_error",
111
- { table: @table_name, duration_s: duration.round(2), count: @pg_count })
112
- false
113
- end
75
+ # @api private
76
+ def log_start
77
+ safe_log(:info, "engine.start",
78
+ { table: @table_name, start_date: @start_date.to_date, end_date: @end_date.to_date })
114
79
  end
115
80
 
116
- private
81
+ # @api private
82
+ def step_count
83
+ @pg_count = timed(:db_query) { get_postgres_count }
84
+ @pg_count
85
+ end
86
+
87
+ # @api private
88
+ def skip_empty(start_time)
89
+ duration = monotonic - start_time
90
+ safe_log(:info, "engine.skip_empty", {
91
+ table: @table_name,
92
+ duration_s: duration.round(2),
93
+ db_query_duration_s: @durations.fetch(:db_query, 0).round(2)
94
+ })
95
+ true
96
+ end
97
+
98
+ # @api private
99
+ def step_export
100
+ safe_log(:info, "engine.export_start", { table: @table_name, count: @pg_count })
101
+ timed(:export) { export_to_parquet }
102
+ end
103
+
104
+ # @api private
105
+ def step_verify
106
+ timed(:integrity) { verify_integrity }
107
+ end
108
+
109
+ # @api private
110
+ def step_purge
111
+ timed(:purge) { purge_from_postgres }
112
+ end
113
+
114
+ # @api private
115
+ def log_complete(start_time)
116
+ duration = monotonic - start_time
117
+ safe_log(:info, "engine.complete", {
118
+ table: @table_name,
119
+ duration_s: duration.round(2),
120
+ db_query_duration_s: @durations.fetch(:db_query, 0).round(2),
121
+ export_duration_s: @durations.fetch(:export, 0).round(2),
122
+ integrity_duration_s: @durations.fetch(:integrity, 0).round(2),
123
+ purge_duration_s: @durations.fetch(:purge, 0).round(2),
124
+ count: @pg_count
125
+ })
126
+ end
127
+
128
+ # @api private
129
+ def integrity_failed(start_time)
130
+ duration = monotonic - start_time
131
+ safe_log(:error, "engine.integrity_error", {
132
+ table: @table_name,
133
+ duration_s: duration.round(2),
134
+ count: @pg_count
135
+ })
136
+ false
137
+ end
117
138
 
118
139
  # @api private
119
140
  # @return [String]
@@ -213,40 +234,129 @@ module DataDrain
213
234
  conn.exec("SET idle_in_transaction_session_timeout = #{@config.idle_in_transaction_session_timeout};")
214
235
  end
215
236
 
237
+ total_deleted = purge_loop(conn)
238
+
239
+ vacuum_if_needed(conn, total_deleted)
240
+ ensure
241
+ conn&.close
242
+ end
243
+
244
+ # @api private
245
+ def vacuum_if_needed(conn, total_deleted)
246
+ return unless @config.vacuum_after_purge
247
+ return if total_deleted.zero?
248
+
249
+ vacuum_start = monotonic
250
+ dead_before = fetch_dead_tuple_count(conn)
251
+
252
+ begin
253
+ conn.exec("VACUUM ANALYZE #{@table_name};")
254
+ rescue PG::Error => e
255
+ safe_log(:warn, "engine.vacuum_error", {
256
+ table: @table_name,
257
+ dead_tuples_before: dead_before,
258
+ rows_deleted_count: total_deleted,
259
+ duration_s: (monotonic - vacuum_start).round(2)
260
+ }.merge(exception_metadata(e)))
261
+ return
262
+ end
263
+
264
+ dead_after = fetch_dead_tuple_count(conn)
265
+ vacuum_duration = monotonic - vacuum_start
266
+
267
+ safe_log(:info, "engine.vacuum_complete", {
268
+ table: @table_name,
269
+ duration_s: vacuum_duration.round(2),
270
+ dead_tuples_before: dead_before,
271
+ dead_tuples_after: dead_after,
272
+ rows_deleted_count: total_deleted
273
+ })
274
+ end
275
+
276
+ # @api private
277
+ def fetch_dead_tuple_count(conn)
278
+ result = conn.exec_params(
279
+ "SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname = $1",
280
+ [@table_name]
281
+ )
282
+ result.first&.dig("n_dead_tup")&.to_i || 0
283
+ rescue PG::Error
284
+ -1
285
+ end
286
+
287
+ # @api private
288
+ # @param conn [PG::Connection]
289
+ # @return [Integer] total de filas borradas
290
+ def purge_loop(conn)
216
291
  batches_processed = 0
217
292
  total_deleted = 0
293
+ slow_batch_streak = 0
218
294
 
219
295
  loop do
220
- sql = <<~SQL
221
- DELETE FROM #{@table_name}
222
- WHERE #{@primary_key} IN (
223
- SELECT #{@primary_key} FROM #{@table_name}
224
- WHERE #{base_where_sql}
225
- LIMIT #{@config.batch_size}
226
- )
227
- SQL
228
-
229
- result = conn.exec(sql)
296
+ batch_start = monotonic
297
+ result = conn.exec(build_delete_sql)
298
+ batch_duration = monotonic - batch_start
230
299
  count = result.cmd_tuples
231
300
  break if count.zero?
232
301
 
233
302
  batches_processed += 1
234
303
  total_deleted += count
235
304
 
236
- # Heartbeat cada 100 lotes para monitorear procesos largos de 1TB
237
- if (batches_processed % 100).zero?
238
- safe_log(:info, "engine.purge_heartbeat", {
305
+ slow_batch_streak = handle_batch_timing(batch_duration, count, slow_batch_streak)
306
+ emit_heartbeat_if_due(batches_processed, total_deleted)
307
+
308
+ sleep(@config.throttle_delay) if @config.throttle_delay.positive?
309
+ end
310
+
311
+ total_deleted
312
+ end
313
+
314
+ # @api private
315
+ def handle_batch_timing(batch_duration, count, streak)
316
+ if batch_duration > @config.slow_batch_threshold_s
317
+ streak += 1
318
+ safe_log(:warn, "engine.slow_batch", {
319
+ table: @table_name,
320
+ batch_duration_s: batch_duration.round(2),
321
+ batch_size: count,
322
+ streak: streak,
323
+ threshold_s: @config.slow_batch_threshold_s
324
+ })
325
+
326
+ if streak == @config.slow_batch_alert_after
327
+ safe_log(:warn, "engine.purge_degraded", {
239
328
  table: @table_name,
240
- batches_processed_count: batches_processed,
241
- rows_deleted_count: total_deleted
329
+ consecutive_slow_batches: streak,
330
+ hint: "considerar índice composite o particionamiento (ver postgres-tuning.md)"
242
331
  })
243
332
  end
244
-
245
- sleep(@config.throttle_delay) if @config.throttle_delay.positive?
333
+ streak
334
+ else
335
+ 0
246
336
  end
247
- ensure
248
- conn&.close
337
+ end
338
+
339
+ # @api private
340
+ def emit_heartbeat_if_due(batches_processed, total_deleted)
341
+ return unless (batches_processed % 100).zero?
342
+
343
+ safe_log(:info, "engine.purge_heartbeat", {
344
+ table: @table_name,
345
+ batches_processed_count: batches_processed,
346
+ rows_deleted_count: total_deleted
347
+ })
348
+ end
349
+
350
+ # @api private
351
+ def build_delete_sql
352
+ <<~SQL
353
+ DELETE FROM #{@table_name}
354
+ WHERE #{@primary_key} IN (
355
+ SELECT #{@primary_key} FROM #{@table_name}
356
+ WHERE #{base_where_sql}
357
+ LIMIT #{@config.batch_size}
358
+ )
359
+ SQL
249
360
  end
250
361
  end
251
- # rubocop:enable Metrics/ClassLength, Metrics/AbcSize, Metrics/MethodLength, Naming/AccessorMethodName
252
362
  end
@@ -6,8 +6,7 @@ module DataDrain
6
6
  # aplicando compresión ZSTD y particionamiento Hive.
7
7
  class FileIngestor
8
8
  include Observability
9
- # rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity,
10
- # Metrics/MethodLength
9
+ include Observability::Timing
11
10
 
12
11
  # @param options [Hash] Opciones de ingestión.
13
12
  # @option options [String] :source_path Ruta absoluta al archivo local.
@@ -36,46 +35,77 @@ module DataDrain
36
35
  # Ejecuta el flujo de ingestión.
37
36
  # @return [Boolean] true si el proceso fue exitoso.
38
37
  def call
39
- start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
38
+ @durations = {}
39
+ start_time = monotonic
40
40
  safe_log(:info, "file_ingestor.start", { source_path: @source_path })
41
41
 
42
- unless File.exist?(@source_path)
43
- safe_log(:error, "file_ingestor.file_not_found", { source_path: @source_path })
44
- return false
45
- end
42
+ return file_not_found(start_time) unless step_validate_file
43
+
44
+ step_setup_duckdb
45
+ @reader_function = determine_reader
46
+ @source_count = step_count_source
47
+
48
+ return skip_empty(start_time) if @source_count.zero?
49
+
50
+ step_export
51
+ log_complete(start_time)
52
+ cleanup_local_file
53
+ true
54
+ rescue DuckDB::Error => e
55
+ duration = monotonic - start_time
56
+ safe_log(:error, "file_ingestor.duckdb_error",
57
+ { source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
58
+ false
59
+ ensure
60
+ @duckdb&.close
61
+ end
62
+
63
+ private
64
+
65
+ # @api private
66
+ def file_not_found(_start_time)
67
+ safe_log(:error, "file_ingestor.file_not_found", { source_path: @source_path })
68
+ false
69
+ end
70
+
71
+ # @api private
72
+ def step_validate_file
73
+ File.exist?(@source_path)
74
+ end
46
75
 
76
+ # @api private
77
+ def step_setup_duckdb
47
78
  @duckdb.query("SET max_memory='#{@config.limit_ram}';") if @config.limit_ram.present?
48
79
  @duckdb.query("SET temp_directory='#{@config.tmp_directory}'") if @config.tmp_directory.present?
49
-
50
80
  @adapter.setup_duckdb(@duckdb)
81
+ end
51
82
 
52
- # Determinamos la función lectora de DuckDB según la extensión del archivo
53
- reader_function = determine_reader
54
-
55
- # 1. Conteo de seguridad
56
- step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
57
- source_count = @duckdb.query("SELECT COUNT(*) FROM #{reader_function}").first.first
58
- source_query_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
83
+ # @api private
84
+ def step_count_source
85
+ source_count = timed(:source_query) { @duckdb.query("SELECT COUNT(*) FROM #{@reader_function}").first.first }
59
86
  safe_log(:info, "file_ingestor.count", {
60
87
  source_path: @source_path,
61
88
  count: source_count,
62
- source_query_duration_s: source_query_duration.round(2)
89
+ source_query_duration_s: @durations.fetch(:source_query, 0).round(2)
63
90
  })
91
+ source_count
92
+ end
64
93
 
65
- if source_count.zero?
66
- cleanup_local_file
67
- duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
68
- safe_log(:info, "file_ingestor.skip_empty", { source_path: @source_path, duration_s: duration.round(2) })
69
- return true
70
- end
94
+ # @api private
95
+ def skip_empty(start_time)
96
+ cleanup_local_file
97
+ duration = monotonic - start_time
98
+ safe_log(:info, "file_ingestor.skip_empty", { source_path: @source_path, duration_s: duration.round(2) })
99
+ true
100
+ end
71
101
 
72
- # 2. Exportación / Subida
102
+ # @api private
103
+ def step_export
73
104
  @adapter.prepare_export_path(@bucket, @folder_name)
74
105
  dest_path = if @config.storage_mode.to_sym == :s3
75
106
  "s3://#{@bucket}/#{@folder_name}/"
76
107
  else
77
- File.join(@bucket,
78
- @folder_name, "")
108
+ File.join(@bucket, @folder_name, "")
79
109
  end
80
110
 
81
111
  partition_clause = @partition_keys.any? ? "PARTITION_BY (#{@partition_keys.join(", ")})," : ""
@@ -83,7 +113,7 @@ module DataDrain
83
113
  query = <<~SQL
84
114
  COPY (
85
115
  SELECT #{@select_sql}
86
- FROM #{reader_function}
116
+ FROM #{@reader_function}
87
117
  ) TO '#{dest_path}'
88
118
  (
89
119
  FORMAT PARQUET,
@@ -94,32 +124,21 @@ module DataDrain
94
124
  SQL
95
125
 
96
126
  safe_log(:info, "file_ingestor.export_start", { dest_path: dest_path })
97
- step_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
98
- @duckdb.query(query)
99
- export_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - step_start
127
+ timed(:export) { @duckdb.query(query) }
128
+ end
100
129
 
101
- duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
130
+ # @api private
131
+ def log_complete(start_time)
132
+ duration = monotonic - start_time
102
133
  safe_log(:info, "file_ingestor.complete", {
103
134
  source_path: @source_path,
104
135
  duration_s: duration.round(2),
105
- source_query_duration_s: source_query_duration.round(2),
106
- export_duration_s: export_duration.round(2),
107
- count: source_count
136
+ source_query_duration_s: @durations.fetch(:source_query, 0).round(2),
137
+ export_duration_s: @durations.fetch(:export, 0).round(2),
138
+ count: @source_count
108
139
  })
109
-
110
- cleanup_local_file
111
- true
112
- rescue DuckDB::Error => e
113
- duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
114
- safe_log(:error, "file_ingestor.duckdb_error",
115
- { source_path: @source_path }.merge(exception_metadata(e)).merge(duration_s: duration.round(2)))
116
- false
117
- ensure
118
- @duckdb&.close
119
140
  end
120
141
 
121
- private
122
-
123
142
  # @api private
124
143
  def determine_reader
125
144
  case File.extname(@source_path).downcase
@@ -142,6 +161,4 @@ module DataDrain
142
161
  safe_log(:info, "file_ingestor.cleanup", { source_path: @source_path })
143
162
  end
144
163
  end
145
- # rubocop:enable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity,
146
- # Metrics/MethodLength
147
164
  end
@@ -0,0 +1,23 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DataDrain
4
+ module Observability
5
+ # Helper para medición de duración de operaciones.
6
+ # @api private
7
+ module Timing
8
+ private
9
+
10
+ def monotonic
11
+ Process.clock_gettime(Process::CLOCK_MONOTONIC)
12
+ end
13
+
14
+ def timed(step_name)
15
+ t = monotonic
16
+ result = yield
17
+ @durations ||= {}
18
+ @durations[step_name] = monotonic - t
19
+ result
20
+ end
21
+ end
22
+ end
23
+ end
@@ -46,7 +46,6 @@ module DataDrain
46
46
  # Esto previene tener que recargar extensiones (como httpfs) en cada consulta.
47
47
  #
48
48
  # @return [DuckDB::Connection] Conexión activa a DuckDB.
49
- # rubocop:disable Metrics/AbcSize
50
49
  def self.connection
51
50
  Thread.current[:data_drain_duckdb] ||= begin
52
51
  db = DuckDB::Database.open(":memory:")
@@ -57,11 +56,13 @@ module DataDrain
57
56
  conn.query("SET temp_directory='#{config.tmp_directory}'") if config.tmp_directory.present?
58
57
 
59
58
  DataDrain::Storage.adapter.setup_duckdb(conn)
59
+
60
+ conn.query("SET lock_configuration=true;")
61
+
60
62
  { db: db, conn: conn }
61
63
  end
62
64
  Thread.current[:data_drain_duckdb][:conn]
63
65
  end
64
- # rubocop:enable Metrics/AbcSize
65
66
 
66
67
  # Consulta registros en el Data Lake filtrando por claves de partición.
67
68
  #
@@ -138,22 +139,14 @@ module DataDrain
138
139
  # @param sql [String]
139
140
  # @param columns [Array<String>]
140
141
  # @return [Array<DataDrain::Record>]
141
- # rubocop:disable Metrics/MethodLength
142
142
  def execute_and_instantiate(sql, columns)
143
143
  @logger = DataDrain.configuration.logger
144
- begin
145
- result = connection.query(sql)
146
- rescue DuckDB::Error => e
147
- safe_log(:warn, "record.parquet_not_found", exception_metadata(e))
148
- return []
149
- end
150
-
151
- result.map do |row|
152
- attributes_hash = columns.zip(row).to_h
153
- new(attributes_hash)
154
- end
144
+ result = connection.query(sql)
145
+ result.map { |row| new(columns.zip(row).to_h) }
146
+ rescue DuckDB::Error => e
147
+ safe_log(:warn, "record.parquet_not_found", exception_metadata(e))
148
+ []
155
149
  end
156
150
  end
157
- # rubocop:enable Metrics/MethodLength
158
151
  end
159
152
  end
@@ -3,8 +3,6 @@
3
3
  module DataDrain
4
4
  module Storage
5
5
  class S3 < Base
6
- # rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength
7
-
8
6
  # Carga la extensión httpfs en DuckDB e inyecta las credenciales de AWS.
9
7
  # Si aws_access_key_id y aws_secret_access_key están seteados, usa
10
8
  # credenciales explícitas. Si no, usa credential_chain (IAM role, env vars,
@@ -32,34 +30,56 @@ module DataDrain
32
30
  # @param partitions [Hash]
33
31
  # @return [Integer]
34
32
  def destroy_partitions(bucket, folder_name, partition_keys, partitions)
35
- client = Aws::S3::Client.new(
33
+ client = s3_client
34
+ prefix, pattern_regex = build_destroy_pattern(folder_name, partition_keys, partitions)
35
+ objects = collect_matching_objects(client, bucket, prefix, pattern_regex)
36
+ delete_in_batches(client, bucket, objects)
37
+ end
38
+
39
+ private
40
+
41
+ # @return [Aws::S3::Client]
42
+ def s3_client
43
+ Aws::S3::Client.new(
36
44
  region: @config.aws_region,
37
45
  access_key_id: @config.aws_access_key_id,
38
46
  secret_access_key: @config.aws_secret_access_key
39
47
  )
48
+ end
40
49
 
50
+ # @param folder_name [String]
51
+ # @param partition_keys [Array<Symbol>]
52
+ # @param partitions [Hash]
53
+ # @return [Array(String, Regexp)] prefix y pattern_regex
54
+ def build_destroy_pattern(folder_name, partition_keys, partitions)
41
55
  regex_parts = partition_keys.map do |key|
42
56
  val = partitions[key]
43
57
  val.nil? || val.to_s.empty? ? "#{key}=[^/]+" : "#{key}=#{val}"
44
58
  end
45
- pattern_regex = Regexp.new("^#{folder_name}/#{regex_parts.join("/")}")
59
+ pattern = Regexp.new("^#{folder_name}/#{regex_parts.join("/")}")
46
60
 
47
- objects_to_delete = []
48
61
  prefix = "#{folder_name}/"
49
62
  first_key = partition_keys.first
50
63
  prefix += "#{first_key}=#{partitions[first_key]}/" if partitions[first_key]
51
64
 
65
+ [prefix, pattern]
66
+ end
67
+
68
+ # @param client [Aws::S3::Client]
69
+ # @param bucket [String]
70
+ # @param prefix [String]
71
+ # @param pattern_regex [Regexp]
72
+ # @return [Array<Hash>]
73
+ def collect_matching_objects(client, bucket, prefix, pattern_regex)
74
+ objects = []
52
75
  client.list_objects_v2(bucket: bucket, prefix: prefix).each do |response|
53
76
  response.contents.each do |obj|
54
- objects_to_delete << { key: obj.key } if obj.key.match?(pattern_regex)
77
+ objects << { key: obj.key } if obj.key.match?(pattern_regex)
55
78
  end
56
79
  end
57
-
58
- delete_in_batches(client, bucket, objects_to_delete)
80
+ objects
59
81
  end
60
82
 
61
- private
62
-
63
83
  # @param connection [DuckDB::Connection]
64
84
  # @raise [DataDrain::ConfigurationError]
65
85
  def create_s3_secret(connection)
@@ -107,6 +127,5 @@ module DataDrain
107
127
  deleted_count
108
128
  end
109
129
  end
110
- # rubocop:enable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/MethodLength
111
130
  end
112
131
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module DataDrain
4
- VERSION = "0.2.2"
4
+ VERSION = "0.3.0"
5
5
  end
data/lib/data_drain.rb CHANGED
@@ -7,6 +7,7 @@ require_relative "data_drain/configuration"
7
7
  require_relative "data_drain/validations"
8
8
  require_relative "data_drain/storage"
9
9
  require_relative "data_drain/observability"
10
+ require_relative "data_drain/observability/timing"
10
11
  require_relative "data_drain/engine"
11
12
  require_relative "data_drain/record"
12
13
  require_relative "data_drain/file_ingestor"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_drain
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.2
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Gabriel
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-04-14 00:00:00.000000000 Z
11
+ date: 2026-04-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activemodel
@@ -109,6 +109,7 @@ files:
109
109
  - lib/data_drain/file_ingestor.rb
110
110
  - lib/data_drain/glue_runner.rb
111
111
  - lib/data_drain/observability.rb
112
+ - lib/data_drain/observability/timing.rb
112
113
  - lib/data_drain/record.rb
113
114
  - lib/data_drain/storage.rb
114
115
  - lib/data_drain/storage/base.rb