data_drain 0.1.13 → 0.1.15
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -0
- data/CLAUDE.md +56 -0
- data/lib/data_drain/engine.rb +14 -10
- data/lib/data_drain/file_ingestor.rb +12 -7
- data/lib/data_drain/glue_runner.rb +7 -4
- data/lib/data_drain/record.rb +5 -4
- data/lib/data_drain/storage.rb +17 -9
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c5a97927218d94763cdead9362a4a0a0a40fe4a1b8b327f0f074117a66a10a46
|
|
4
|
+
data.tar.gz: a5f28048457a43d86942472b36946955e0aa88c9d75cb85158d27f44c986aec2
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 5f538227b8eda210214fa448ede9f3247fa73bf997a2cfb04ca0a1b37c81b096198ad250b4e3dcf2f902a4f54eee16cb69233e68e03d65aca68c6bf497de72e4
|
|
7
|
+
data.tar.gz: 93f7b591e556713614c0310415301787605cf3cefc26718e94b74ee1ab60cec17ffaeaaec3d7b8ce5bb31deb40a51b1007d60bd9cdcba1211e4a8e06f1079293
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,19 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.1.15] - 2026-03-23
|
|
4
|
+
|
|
5
|
+
- Performance: Medición de duraciones con reloj monotónico (`Process.clock_gettime`) en eventos terminales de `Engine`, `FileIngestor` y `GlueRunner`.
|
|
6
|
+
- Fix: `idle_in_transaction_session_timeout` ahora se aplica correctamente cuando el valor es `0` (desactiva el timeout). Antes `0.present?` evaluaba a `false` y se ignoraba.
|
|
7
|
+
- Fix: Objeto `DuckDB::Database` en `Record` ahora se ancla en el thread-local junto a la conexión, previniendo garbage collection prematura.
|
|
8
|
+
- Fix: `Storage.adapter` cachea la instancia en vez de crearla en cada llamada.
|
|
9
|
+
- Documentation: Agregado `CLAUDE.md` con guía de arquitectura y estándares del proyecto.
|
|
10
|
+
|
|
11
|
+
## [0.1.14] - 2026-03-17
|
|
12
|
+
|
|
13
|
+
- Feature: Implementación de **Logging Estructurado** en toda la gema (\`key=value\`) para mejor observabilidad en producción.
|
|
14
|
+
- Optimization: Caching automático de adaptadores de almacenamiento para mejorar el rendimiento de consultas repetidas.
|
|
15
|
+
- Testing: Mejora en la robustez de los tests de \`Engine\` desacoplándolos de cambios menores en el setup de DuckDB.
|
|
16
|
+
|
|
3
17
|
## [0.1.13] - 2026-03-17
|
|
4
18
|
|
|
5
19
|
- Feature: Parametrización total en la orquestación con Glue. Se añadieron \`s3_bucket\`, \`s3_folder\` y \`partition_by\` como argumentos dinámicos, permitiendo que el mismo Job de Glue sirva para múltiples tablas y destinos.
|
data/CLAUDE.md
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# DataDrain - Contexto de Desarrollo
|
|
2
|
+
|
|
3
|
+
## Arquitectura y Patrones Core
|
|
4
|
+
|
|
5
|
+
- **Engine (`DataDrain::Engine`):** Orquesta el flujo ETL: Conteo → Export → Verify → Purge. El paso de export es omitible con `skip_export: true` (para delegar a AWS Glue).
|
|
6
|
+
- **Storage Adapters (`DataDrain::Storage`):** Patrón Strategy. La instancia se cachea en `DataDrain::Storage.adapter`. Si `storage_mode` cambia en runtime, llamar `DataDrain::Storage.reset_adapter!` antes de la próxima operación.
|
|
7
|
+
- **Analytical ORM (`DataDrain::Record`):** Interfaz tipo ActiveRecord de solo lectura sobre Parquet vía DuckDB. Usa una conexión DuckDB por thread (`Thread.current[:data_drain_duckdb_conn]`) que se inicializa una vez y se reutiliza — nunca se cierra explícitamente. Tener en cuenta en Puma/Sidekiq.
|
|
8
|
+
- **Glue Orchestrator (`DataDrain::GlueRunner`):** Para tablas 1TB+. Patrón: `GlueRunner.run_and_wait(...)` seguido de `Engine.new(..., skip_export: true).call` para verificar + purgar.
|
|
9
|
+
|
|
10
|
+
## Convenciones Críticas
|
|
11
|
+
|
|
12
|
+
### Seguridad en Purga
|
|
13
|
+
`purge_from_postgres` nunca debe ejecutarse si `verify_integrity` devuelve `false`. La verificación matemática de conteos (Postgres vs Parquet) es el único gate de seguridad antes de borrar datos.
|
|
14
|
+
|
|
15
|
+
### Precisión de Fechas
|
|
16
|
+
Las consultas SQL de rango siempre deben usar **límites semi-abiertos**:
|
|
17
|
+
```sql
|
|
18
|
+
created_at >= 'START' AND created_at < 'END_BOUNDARY'
|
|
19
|
+
```
|
|
20
|
+
Donde `END_BOUNDARY` es el inicio del periodo siguiente (ej. `next_day.beginning_of_day`). Nunca usar `<= end_of_day` — los microsegundos en el límite pueden quedar fuera.
|
|
21
|
+
|
|
22
|
+
### Idempotencia
|
|
23
|
+
Las exportaciones usan `OVERWRITE_OR_IGNORE 1` de DuckDB. Los procesos son seguros de reintentar.
|
|
24
|
+
|
|
25
|
+
### `idle_in_transaction_session_timeout`
|
|
26
|
+
El valor `0` **desactiva** el timeout (sin límite). Para purgas de gran volumen esto es mandatorio. Internamente, se debe validar con `!nil?` ya que `0.present?` es falso.
|
|
27
|
+
|
|
28
|
+
## Logging
|
|
29
|
+
|
|
30
|
+
Seguir los estándares globales definidos en `~/.claude/CLAUDE.md`. Reglas específicas de este proyecto:
|
|
31
|
+
|
|
32
|
+
- Formato obligatorio: `component=data_drain event=<clase>.<suceso> [campos]`
|
|
33
|
+
- El campo `source` lo inyecta automáticamente `exis_ray` vía `ExisRay::Tracer` — DataDrain no debe incluirlo ni recibirlo como parámetro
|
|
34
|
+
- Nunca logs puramente descriptivos, con emojis ni con prefijos entre corchetes
|
|
35
|
+
- DEBUG siempre en forma de bloque: `logger.debug { "k=#{v}" }`
|
|
36
|
+
- Duraciones con reloj monotónico: `Process.clock_gettime(Process::CLOCK_MONOTONIC)`
|
|
37
|
+
- Filtrar datos sensibles (`password`, `token`, `secret`, `api_key`, `auth`) → `[FILTERED]`
|
|
38
|
+
|
|
39
|
+
## Código Ruby
|
|
40
|
+
|
|
41
|
+
- Todo código nuevo o modificado debe pasar `bundle exec rubocop` sin ofensas
|
|
42
|
+
- Documentación pública con YARD (`@param`, `@return`, `@raise`, `@example`)
|
|
43
|
+
- No modificar ni agregar YARD/comentarios a código existente no tocado
|
|
44
|
+
|
|
45
|
+
## Comandos
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
bundle exec rspec # tests
|
|
49
|
+
bundle exec rubocop # linting
|
|
50
|
+
bin/console # REPL de desarrollo
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Rendimiento
|
|
54
|
+
|
|
55
|
+
- `limit_ram` y `tmp_directory` en la configuración evitan OOM en contenedores
|
|
56
|
+
- DuckDB usa spill-to-disk automáticamente cuando `tmp_directory` está seteado
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -49,30 +49,34 @@ module DataDrain
|
|
|
49
49
|
#
|
|
50
50
|
# @return [Boolean] `true` si el proceso finalizó con éxito, `false` si falló la integridad.
|
|
51
51
|
def call
|
|
52
|
-
|
|
52
|
+
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
53
|
+
@logger.info "component=data_drain event=engine.start table=#{@table_name} start_date=#{@start_date.to_date} end_date=#{@end_date.to_date}"
|
|
53
54
|
|
|
54
55
|
setup_duckdb
|
|
55
56
|
|
|
56
57
|
@pg_count = get_postgres_count
|
|
57
58
|
|
|
58
59
|
if @pg_count.zero?
|
|
59
|
-
|
|
60
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
61
|
+
@logger.info "component=data_drain event=engine.skip_empty table=#{@table_name} duration=#{duration.round(2)}s"
|
|
60
62
|
return true
|
|
61
63
|
end
|
|
62
64
|
|
|
63
65
|
if @skip_export
|
|
64
|
-
@logger.info "
|
|
66
|
+
@logger.info "component=data_drain event=engine.skip_export table=#{@table_name}"
|
|
65
67
|
else
|
|
66
|
-
@logger.info "
|
|
68
|
+
@logger.info "component=data_drain event=engine.export_start table=#{@table_name} count=#{@pg_count}"
|
|
67
69
|
export_to_parquet
|
|
68
70
|
end
|
|
69
71
|
|
|
70
72
|
if verify_integrity
|
|
71
73
|
purge_from_postgres
|
|
72
|
-
|
|
74
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
75
|
+
@logger.info "component=data_drain event=engine.complete table=#{@table_name} duration=#{duration.round(2)}s"
|
|
73
76
|
true
|
|
74
77
|
else
|
|
75
|
-
|
|
78
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
79
|
+
@logger.error "component=data_drain event=engine.integrity_error table=#{@table_name} duration=#{duration.round(2)}s"
|
|
76
80
|
false
|
|
77
81
|
end
|
|
78
82
|
end
|
|
@@ -147,17 +151,17 @@ module DataDrain
|
|
|
147
151
|
SQL
|
|
148
152
|
parquet_result = @duckdb.query(query).first.first
|
|
149
153
|
rescue DuckDB::Error => e
|
|
150
|
-
@logger.error "
|
|
154
|
+
@logger.error "component=data_drain event=engine.parquet_read_error table=#{@table_name} error=#{e.message}"
|
|
151
155
|
return false
|
|
152
156
|
end
|
|
153
157
|
|
|
154
|
-
@logger.info "
|
|
158
|
+
@logger.info "component=data_drain event=engine.integrity_check table=#{@table_name} pg_count=#{@pg_count} parquet_count=#{parquet_result}"
|
|
155
159
|
@pg_count == parquet_result
|
|
156
160
|
end
|
|
157
161
|
|
|
158
162
|
# @api private
|
|
159
163
|
def purge_from_postgres
|
|
160
|
-
@logger.info "
|
|
164
|
+
@logger.info "component=data_drain event=engine.purge_start table=#{@table_name} batch_size=#{@config.batch_size}"
|
|
161
165
|
|
|
162
166
|
conn = PG.connect(
|
|
163
167
|
host: @config.db_host,
|
|
@@ -167,7 +171,7 @@ module DataDrain
|
|
|
167
171
|
dbname: @config.db_name
|
|
168
172
|
)
|
|
169
173
|
|
|
170
|
-
|
|
174
|
+
unless @config.idle_in_transaction_session_timeout.nil?
|
|
171
175
|
conn.exec("SET idle_in_transaction_session_timeout = #{@config.idle_in_transaction_session_timeout};")
|
|
172
176
|
end
|
|
173
177
|
|
|
@@ -30,10 +30,11 @@ module DataDrain
|
|
|
30
30
|
# Ejecuta el flujo de ingestión.
|
|
31
31
|
# @return [Boolean] true si el proceso fue exitoso.
|
|
32
32
|
def call
|
|
33
|
-
|
|
33
|
+
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
34
|
+
@logger.info "component=data_drain event=file_ingestor.start source_path=#{@source_path}"
|
|
34
35
|
|
|
35
36
|
unless File.exist?(@source_path)
|
|
36
|
-
@logger.error "
|
|
37
|
+
@logger.error "component=data_drain event=file_ingestor.file_not_found source_path=#{@source_path}"
|
|
37
38
|
return false
|
|
38
39
|
end
|
|
39
40
|
|
|
@@ -47,10 +48,12 @@ module DataDrain
|
|
|
47
48
|
|
|
48
49
|
# 1. Conteo de seguridad
|
|
49
50
|
source_count = @duckdb.query("SELECT COUNT(*) FROM #{reader_function}").first.first
|
|
50
|
-
@logger.info "
|
|
51
|
+
@logger.info "component=data_drain event=file_ingestor.count source_path=#{@source_path} count=#{source_count}"
|
|
51
52
|
|
|
52
53
|
if source_count.zero?
|
|
53
54
|
cleanup_local_file
|
|
55
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
56
|
+
@logger.info "component=data_drain event=file_ingestor.skip_empty source_path=#{@source_path} duration=#{duration.round(2)}s"
|
|
54
57
|
return true
|
|
55
58
|
end
|
|
56
59
|
|
|
@@ -73,15 +76,17 @@ module DataDrain
|
|
|
73
76
|
);
|
|
74
77
|
SQL
|
|
75
78
|
|
|
76
|
-
@logger.info "
|
|
79
|
+
@logger.info "component=data_drain event=file_ingestor.export_start dest_path=#{dest_path}"
|
|
77
80
|
@duckdb.query(query)
|
|
78
81
|
|
|
79
|
-
|
|
82
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
83
|
+
@logger.info "component=data_drain event=file_ingestor.complete source_path=#{@source_path} duration=#{duration.round(2)}s"
|
|
80
84
|
|
|
81
85
|
cleanup_local_file
|
|
82
86
|
true
|
|
83
87
|
rescue DuckDB::Error => e
|
|
84
|
-
|
|
88
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
89
|
+
@logger.error "component=data_drain event=file_ingestor.duckdb_error source_path=#{@source_path} error=#{e.message} duration=#{duration.round(2)}s"
|
|
85
90
|
false
|
|
86
91
|
ensure
|
|
87
92
|
@duckdb&.close
|
|
@@ -107,7 +112,7 @@ module DataDrain
|
|
|
107
112
|
def cleanup_local_file
|
|
108
113
|
if @delete_after_upload && File.exist?(@source_path)
|
|
109
114
|
File.delete(@source_path)
|
|
110
|
-
@logger.info "
|
|
115
|
+
@logger.info "component=data_drain event=file_ingestor.cleanup source_path=#{@source_path}"
|
|
111
116
|
end
|
|
112
117
|
end
|
|
113
118
|
end
|
|
@@ -16,8 +16,9 @@ module DataDrain
|
|
|
16
16
|
def self.run_and_wait(job_name, arguments = {}, polling_interval: 30)
|
|
17
17
|
config = DataDrain.configuration
|
|
18
18
|
client = Aws::Glue::Client.new(region: config.aws_region)
|
|
19
|
+
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
19
20
|
|
|
20
|
-
config.logger.info "
|
|
21
|
+
config.logger.info "component=data_drain event=glue_runner.start job=#{job_name}"
|
|
21
22
|
resp = client.start_job_run(job_name: job_name, arguments: arguments)
|
|
22
23
|
run_id = resp.job_run_id
|
|
23
24
|
|
|
@@ -27,14 +28,16 @@ module DataDrain
|
|
|
27
28
|
|
|
28
29
|
case status
|
|
29
30
|
when "SUCCEEDED"
|
|
30
|
-
|
|
31
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
32
|
+
config.logger.info "component=data_drain event=glue_runner.complete job=#{job_name} run_id=#{run_id} duration=#{duration.round(2)}s"
|
|
31
33
|
return true
|
|
32
34
|
when "FAILED", "STOPPED", "TIMEOUT"
|
|
33
35
|
error_msg = run_info.error_message || "Sin mensaje de error disponible."
|
|
34
|
-
|
|
36
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
37
|
+
config.logger.error "component=data_drain event=glue_runner.failed job=#{job_name} run_id=#{run_id} status=#{status} error=#{error_msg} duration=#{duration.round(2)}s"
|
|
35
38
|
raise "Glue Job #{job_name} (Run ID: #{run_id}) falló con estado #{status}."
|
|
36
39
|
else
|
|
37
|
-
config.logger.info "
|
|
40
|
+
config.logger.info "component=data_drain event=glue_runner.polling job=#{job_name} run_id=#{run_id} status=#{status} next_check_in=#{polling_interval}s"
|
|
38
41
|
sleep polling_interval
|
|
39
42
|
end
|
|
40
43
|
end
|
data/lib/data_drain/record.rb
CHANGED
|
@@ -27,7 +27,7 @@ module DataDrain
|
|
|
27
27
|
#
|
|
28
28
|
# @return [DuckDB::Connection] Conexión activa a DuckDB.
|
|
29
29
|
def self.connection
|
|
30
|
-
Thread.current[:
|
|
30
|
+
Thread.current[:data_drain_duckdb] ||= begin
|
|
31
31
|
db = DuckDB::Database.open(":memory:")
|
|
32
32
|
conn = db.connect
|
|
33
33
|
|
|
@@ -36,8 +36,9 @@ module DataDrain
|
|
|
36
36
|
conn.query("SET temp_directory='#{config.tmp_directory}'") if config.tmp_directory.present?
|
|
37
37
|
|
|
38
38
|
DataDrain::Storage.adapter.setup_duckdb(conn)
|
|
39
|
-
conn
|
|
39
|
+
{ db: db, conn: conn }
|
|
40
40
|
end
|
|
41
|
+
Thread.current[:data_drain_duckdb][:conn]
|
|
41
42
|
end
|
|
42
43
|
|
|
43
44
|
# Consulta registros en el Data Lake filtrando por claves de partición.
|
|
@@ -85,7 +86,7 @@ module DataDrain
|
|
|
85
86
|
# @return [Integer] Cantidad de particiones físicas eliminadas.
|
|
86
87
|
def self.destroy_all(**partitions)
|
|
87
88
|
adapter = DataDrain::Storage.adapter
|
|
88
|
-
DataDrain.configuration.logger.info "
|
|
89
|
+
DataDrain.configuration.logger.info "component=data_drain event=record.destroy_all folder=#{folder_name} partitions=#{partitions.inspect}"
|
|
89
90
|
|
|
90
91
|
adapter.destroy_partitions(bucket, folder_name, partition_keys, partitions)
|
|
91
92
|
end
|
|
@@ -118,7 +119,7 @@ module DataDrain
|
|
|
118
119
|
begin
|
|
119
120
|
result = connection.query(sql)
|
|
120
121
|
rescue DuckDB::Error => e
|
|
121
|
-
DataDrain.configuration.logger.warn "
|
|
122
|
+
DataDrain.configuration.logger.warn "component=data_drain event=record.parquet_not_found error=#{e.message}"
|
|
122
123
|
return []
|
|
123
124
|
end
|
|
124
125
|
|
data/lib/data_drain/storage.rb
CHANGED
|
@@ -11,20 +11,28 @@ module DataDrain
|
|
|
11
11
|
class InvalidAdapterError < DataDrain::Error; end
|
|
12
12
|
|
|
13
13
|
# Resuelve e instancia el adaptador de almacenamiento correspondiente
|
|
14
|
-
# basándose en la configuración actual del framework.
|
|
14
|
+
# basándose en la configuración actual del framework. La instancia se
|
|
15
|
+
# cachea para evitar allocations innecesarias entre queries.
|
|
15
16
|
#
|
|
16
17
|
# @return [DataDrain::Storage::Base] Una instancia de Local o S3.
|
|
17
18
|
# @raise [InvalidAdapterError] Si el storage_mode no es válido.
|
|
18
19
|
def self.adapter
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
20
|
+
@adapter ||= begin
|
|
21
|
+
mode = DataDrain.configuration.storage_mode
|
|
22
|
+
case mode.to_sym
|
|
23
|
+
when :local
|
|
24
|
+
Local.new(DataDrain.configuration)
|
|
25
|
+
when :s3
|
|
26
|
+
S3.new(DataDrain.configuration)
|
|
27
|
+
else
|
|
28
|
+
raise InvalidAdapterError, "Storage mode '#{mode}' no está soportado."
|
|
29
|
+
end
|
|
27
30
|
end
|
|
28
31
|
end
|
|
32
|
+
|
|
33
|
+
# Descarta el adaptador cacheado. Llamar cuando cambia storage_mode.
|
|
34
|
+
def self.reset_adapter!
|
|
35
|
+
@adapter = nil
|
|
36
|
+
end
|
|
29
37
|
end
|
|
30
38
|
end
|
data/lib/data_drain/version.rb
CHANGED
data/lib/data_drain.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.15
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-03-
|
|
11
|
+
date: 2026-03-23 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -91,6 +91,7 @@ files:
|
|
|
91
91
|
- ".rspec"
|
|
92
92
|
- ".rubocop.yml"
|
|
93
93
|
- CHANGELOG.md
|
|
94
|
+
- CLAUDE.md
|
|
94
95
|
- CODE_OF_CONDUCT.md
|
|
95
96
|
- LICENSE.txt
|
|
96
97
|
- README.md
|