data_drain 0.1.19 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -0
- data/CLAUDE.md +4 -0
- data/README.md +67 -170
- data/lib/data_drain/engine.rb +53 -40
- data/lib/data_drain/file_ingestor.rb +40 -25
- data/lib/data_drain/record.rb +24 -3
- data/lib/data_drain/storage/s3.rb +48 -6
- data/lib/data_drain/validations.rb +17 -0
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +2 -0
- data/skill/SKILL.md +215 -0
- data/skill/references/antipatrones.md +242 -0
- data/skill/references/api-detallada.md +257 -0
- data/skill/references/eventos-telemetria.md +154 -0
- metadata +7 -2
|
@@ -0,0 +1,242 @@
|
|
|
1
|
+
# Antipatrones
|
|
2
|
+
|
|
3
|
+
Qué NO hacer en DataDrain. Cada antipatrón incluye código incorrecto, razón y alternativa correcta.
|
|
4
|
+
|
|
5
|
+
## 1. Bypassear `verify_integrity` para purgar más rápido
|
|
6
|
+
|
|
7
|
+
**Incorrecto:**
|
|
8
|
+
```ruby
|
|
9
|
+
engine = DataDrain::Engine.new(...)
|
|
10
|
+
engine.send(:setup_duckdb)
|
|
11
|
+
engine.send(:purge_from_postgres) # SIN verificar antes
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
**Razón:** `verify_integrity` es la **única salvaguarda matemática** entre la exportación y el `DELETE` definitivo. Si se omite, podés borrar datos que no fueron archivados (corrupción silenciosa, archivo Parquet vacío, mismatch de fechas, etc.).
|
|
15
|
+
|
|
16
|
+
**Alternativa:** Siempre usar `Engine#call`. Si necesitás solo verificar+purgar (porque el export lo hizo Glue/EMR), usar `skip_export: true` — el verify sigue siendo obligatorio dentro del flujo.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## 2. Mismatch en orden de `partition_keys` entre escritura y lectura
|
|
21
|
+
|
|
22
|
+
**Incorrecto:**
|
|
23
|
+
```ruby
|
|
24
|
+
# Engine escribe con orden A
|
|
25
|
+
Engine.new(partition_keys: %w[year month isp_id], ...).call
|
|
26
|
+
|
|
27
|
+
# Record lee con orden B
|
|
28
|
+
class ArchivedX < DataDrain::Record
|
|
29
|
+
self.partition_keys = [:isp_id, :year, :month] # MISMATCH
|
|
30
|
+
end
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
**Razón:** El orden de `partition_keys` determina la jerarquía Hive en disco (`year=X/month=Y/isp_id=Z`). Si Record lee con otro orden, el path generado no coincide y **DuckDB devuelve `[]` sin error**. La falla es silenciosa.
|
|
34
|
+
|
|
35
|
+
**Alternativa:** Mantener orden idéntico en escritura (Engine/FileIngestor) y lectura (Record). Convención canónica: `[dimension_principal, year, month]` (mayor cardinalidad o filtro más usado primero).
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## 3. Cambiar `storage_mode` sin resetear el adapter
|
|
40
|
+
|
|
41
|
+
**Incorrecto:**
|
|
42
|
+
```ruby
|
|
43
|
+
DataDrain.configure { |c| c.storage_mode = :s3 }
|
|
44
|
+
DataDrain::Engine.new(...).call # Sigue usando Local cacheado si ya se inicializó
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**Razón:** `Storage.adapter` es memoizado (`@adapter ||= ...`). Cambiar `storage_mode` después de la primera invocación no tiene efecto.
|
|
48
|
+
|
|
49
|
+
**Alternativa:**
|
|
50
|
+
```ruby
|
|
51
|
+
DataDrain.configure { |c| c.storage_mode = :s3 }
|
|
52
|
+
DataDrain::Storage.reset_adapter!
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## 4. Validar `idle_in_transaction_session_timeout` con `.present?`
|
|
58
|
+
|
|
59
|
+
**Incorrecto:**
|
|
60
|
+
```ruby
|
|
61
|
+
if @config.idle_in_transaction_session_timeout.present? # 0.present? == false
|
|
62
|
+
conn.exec("SET ... = #{...};")
|
|
63
|
+
end
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Razón:** El valor `0` significa **timeout desactivado** (sin límite), que es exactamente lo que querés en purgas masivas. `0.present?` es `false` en Rails, así que `0` se ignora silenciosamente y Postgres aplica el timeout default.
|
|
67
|
+
|
|
68
|
+
**Alternativa:** Usar `!nil?`:
|
|
69
|
+
```ruby
|
|
70
|
+
unless @config.idle_in_transaction_session_timeout.nil?
|
|
71
|
+
conn.exec("SET ... = #{@config.idle_in_transaction_session_timeout};")
|
|
72
|
+
end
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## 5. Usar `<= end_of_day` en rangos de fecha
|
|
78
|
+
|
|
79
|
+
**Incorrecto:**
|
|
80
|
+
```ruby
|
|
81
|
+
"created_at >= '#{start.beginning_of_day}' AND created_at <= '#{end_date.end_of_day}'"
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
**Razón:** `end_of_day` retorna `23:59:59.999999`. Registros con timestamps en los microsegundos siguientes (`23:59:59.9999995`) quedan fuera o cruzados según floor/ceil del cliente. Con `BETWEEN` o `<=` la pérdida de filas es silenciosa.
|
|
85
|
+
|
|
86
|
+
**Alternativa:** Rango semi-abierto con `<` y boundary del próximo periodo:
|
|
87
|
+
```ruby
|
|
88
|
+
"created_at >= '#{start.beginning_of_day}' AND created_at < '#{end_date.next_day.beginning_of_day}'"
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## 6. Loguear `source=` manualmente
|
|
94
|
+
|
|
95
|
+
**Incorrecto:**
|
|
96
|
+
```ruby
|
|
97
|
+
safe_log(:info, "engine.start", { source: "data_drain", table: @table_name })
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
**Razón:** El campo `source=` lo inyecta automáticamente el middleware `exis_ray` (identifica el entrypoint: `http`, `sidekiq`, `task`, `system`). Emitirlo manualmente lo duplica o lo sobrescribe con un valor incorrecto.
|
|
101
|
+
|
|
102
|
+
**Alternativa:** Nunca incluir `source` en metadata. Solo `component` (automático vía `observability_name`) + `event` + campos de negocio.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## 7. Olvidar `private_class_method` al usar `extend Observability`
|
|
107
|
+
|
|
108
|
+
**Incorrecto:**
|
|
109
|
+
```ruby
|
|
110
|
+
class GlueRunner
|
|
111
|
+
extend Observability
|
|
112
|
+
# safe_log queda público — cualquiera puede llamar GlueRunner.safe_log(...)
|
|
113
|
+
end
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
**Razón:** `extend` hace los métodos del módulo accesibles como métodos de clase **públicos**. Eso filtra una API interna y rompe encapsulación.
|
|
117
|
+
|
|
118
|
+
**Alternativa:**
|
|
119
|
+
```ruby
|
|
120
|
+
class GlueRunner
|
|
121
|
+
extend Observability
|
|
122
|
+
private_class_method :safe_log, :exception_metadata, :observability_name
|
|
123
|
+
end
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## 8. Olvidar `include Observability` en clases de instancia
|
|
129
|
+
|
|
130
|
+
**Incorrecto:**
|
|
131
|
+
```ruby
|
|
132
|
+
class Engine
|
|
133
|
+
# falta include
|
|
134
|
+
def call
|
|
135
|
+
safe_log(:info, "engine.start", {}) # NoMethodError
|
|
136
|
+
end
|
|
137
|
+
end
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**Razón:** Sin `include`, `safe_log` no existe en la clase. Falla en runtime al primer evento.
|
|
141
|
+
|
|
142
|
+
**Alternativa:**
|
|
143
|
+
```ruby
|
|
144
|
+
class Engine
|
|
145
|
+
include Observability
|
|
146
|
+
def call
|
|
147
|
+
safe_log(:info, "engine.start", {})
|
|
148
|
+
end
|
|
149
|
+
end
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
## 9. Agregar lógica de infraestructura en `Observability`
|
|
155
|
+
|
|
156
|
+
**Incorrecto:** Agregar al módulo `Observability` métodos como `current_memory_mb` que usen backticks (`` `ps` ``) o `Process` para inferir métricas del sistema.
|
|
157
|
+
|
|
158
|
+
**Razón:** `Observability` es un **módulo de logging genérico**, reusable en otras gemas. Mezclarle lógica de infraestructura lo acopla al runtime específico y rompe portabilidad.
|
|
159
|
+
|
|
160
|
+
**Alternativa:** Métricas de infraestructura van en otro módulo (ej. `Telemetry::Process`) o en el caller. `Observability` solo formatea y emite logs.
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## 10. Usar `Time.now` para medir duraciones
|
|
165
|
+
|
|
166
|
+
**Incorrecto:**
|
|
167
|
+
```ruby
|
|
168
|
+
start = Time.now
|
|
169
|
+
do_work
|
|
170
|
+
duration = Time.now - start
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
**Razón:** `Time.now` es wall clock — cambia con NTP, cambios de zona horaria, leap seconds. Mide tiempos negativos o saltos. No es apto para latencia.
|
|
174
|
+
|
|
175
|
+
**Alternativa:**
|
|
176
|
+
```ruby
|
|
177
|
+
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
178
|
+
do_work
|
|
179
|
+
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## 11. Loguear DEBUG sin bloque
|
|
185
|
+
|
|
186
|
+
**Incorrecto:**
|
|
187
|
+
```ruby
|
|
188
|
+
logger.debug("query=#{expensive_serialize(obj)}") # Siempre evalúa, incluso si DEBUG off
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
**Razón:** Sin bloque, el string se construye siempre, incluso cuando el nivel DEBUG está desactivado en producción. Costo invisible.
|
|
192
|
+
|
|
193
|
+
**Alternativa:**
|
|
194
|
+
```ruby
|
|
195
|
+
logger.debug { "query=#{expensive_serialize(obj)}" }
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## 12. Asumir que `Record.connection` se puede cerrar manualmente
|
|
201
|
+
|
|
202
|
+
**Incorrecto:**
|
|
203
|
+
```ruby
|
|
204
|
+
ArchivedX.where(...)
|
|
205
|
+
ArchivedX.connection.close # Rompe la siguiente query del mismo thread
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
**Razón:** `Record.connection` es thread-local y persistente — diseñada para amortizar el costo de cargar `httpfs` y credenciales. Cerrarla obliga a reconectar todo en la próxima query y puede dejar el `Thread.current` apuntando a una conexión muerta (`Database` GC'd).
|
|
209
|
+
|
|
210
|
+
**Alternativa:** No cerrarla manualmente. Vive mientras vive el thread.
|
|
211
|
+
|
|
212
|
+
---
|
|
213
|
+
|
|
214
|
+
## 13. Pasar input de usuario a `select_sql` o `where_clause`
|
|
215
|
+
|
|
216
|
+
**Incorrecto:**
|
|
217
|
+
```ruby
|
|
218
|
+
DataDrain::Engine.new(
|
|
219
|
+
table_name: params[:table], # input usuario interpolado en SQL
|
|
220
|
+
where_clause: params[:filter],
|
|
221
|
+
...
|
|
222
|
+
).call
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
**Razón:** `select_sql` y `where_clause` se interpolan **directamente en SQL** (no son prepared statements). Input de usuario abre vector de SQL injection.
|
|
226
|
+
|
|
227
|
+
**Nota:** `table_name` y `primary_key` se validan con regex `\A[a-zA-Z_][a-zA-Z0-9_]*\z` en `Engine#initialize`. Si el valor no matchea, levantan `DataDrain::ConfigurationError`. `select_sql` y `where_clause` siguen siendo trusted (no se validan).
|
|
228
|
+
|
|
229
|
+
**Alternativa:** `table_name` y `primary_key` ahora están protegidos contra injection trivial. `select_sql` y `where_clause` deben venir de código de aplicación (constantes, configuración, jobs con valores fijos).
|
|
230
|
+
|
|
231
|
+
---
|
|
232
|
+
|
|
233
|
+
## 14. Confiar en que `GlueRunner` tiene timeout máximo
|
|
234
|
+
|
|
235
|
+
**Incorrecto:**
|
|
236
|
+
```ruby
|
|
237
|
+
DataDrain::GlueRunner.run_and_wait("job", args) # Asumir que retorna en X minutos
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
**Razón:** El loop de polling no tiene timeout máximo. Si Glue queda colgado en `RUNNING` indefinidamente, `run_and_wait` bloquea para siempre.
|
|
241
|
+
|
|
242
|
+
**Alternativa:** Envolver en `Timeout.timeout(N)` en el caller, o monitorear el job desde fuera (CloudWatch alarm). Mejor aún: futura mejora de la gema agregar `max_wait_seconds`.
|
|
@@ -0,0 +1,257 @@
|
|
|
1
|
+
# API Detallada
|
|
2
|
+
|
|
3
|
+
Firmas completas, parámetros, retornos y comportamientos de cada clase pública de DataDrain.
|
|
4
|
+
|
|
5
|
+
## `DataDrain` (módulo)
|
|
6
|
+
|
|
7
|
+
### `DataDrain.configure { |config| ... }`
|
|
8
|
+
Bloque de configuración global. `config` es una instancia singleton de `Configuration`.
|
|
9
|
+
|
|
10
|
+
### `DataDrain.configuration`
|
|
11
|
+
Retorna la `Configuration` singleton (lazy init).
|
|
12
|
+
|
|
13
|
+
### `DataDrain.reset_configuration!`
|
|
14
|
+
Resetea config y resetea `Storage.adapter`. Útil en tests.
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## `DataDrain::Configuration`
|
|
19
|
+
|
|
20
|
+
Atributos (`attr_accessor`):
|
|
21
|
+
|
|
22
|
+
| Atributo | Tipo | Default | Descripción |
|
|
23
|
+
|----------|------|---------|-------------|
|
|
24
|
+
| `storage_mode` | Symbol | `:local` | `:local` o `:s3` |
|
|
25
|
+
| `aws_region` | String | `nil` | Requerido si `:s3` |
|
|
26
|
+
| `aws_access_key_id` | String | `nil` | Requerido si `:s3` |
|
|
27
|
+
| `aws_secret_access_key` | String | `nil` | Requerido si `:s3` |
|
|
28
|
+
| `db_host` | String | `"127.0.0.1"` | Host PostgreSQL |
|
|
29
|
+
| `db_port` | Integer | `5432` | Puerto PostgreSQL |
|
|
30
|
+
| `db_user` | String | `nil` | Usuario PostgreSQL |
|
|
31
|
+
| `db_pass` | String | `nil` | Password PostgreSQL |
|
|
32
|
+
| `db_name` | String | `nil` | Base de datos |
|
|
33
|
+
| `batch_size` | Integer | `5000` | Registros por DELETE en purga |
|
|
34
|
+
| `throttle_delay` | Float | `0.5` | Segundos de pausa entre lotes |
|
|
35
|
+
| `idle_in_transaction_session_timeout` | Integer | `0` | Milisegundos. `0` = DESACTIVADO. `nil` = no setear |
|
|
36
|
+
| `limit_ram` | String | `nil` | Límite memoria DuckDB (ej. `"2GB"`) |
|
|
37
|
+
| `tmp_directory` | String | `nil` | Spill-to-disk DuckDB |
|
|
38
|
+
| `logger` | Logger | `Logger.new($stdout)` | Logger |
|
|
39
|
+
|
|
40
|
+
### `#duckdb_connection_string`
|
|
41
|
+
Retorna URI: `postgresql://user:pass@host:port/db?options=-c%20idle_in_transaction_session_timeout%3D<val>`
|
|
42
|
+
|
|
43
|
+
**No hay validaciones automáticas.** Caller debe garantizar consistencia (ej. credenciales AWS si `storage_mode = :s3`).
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## `DataDrain::Engine`
|
|
48
|
+
|
|
49
|
+
### `#initialize(options)`
|
|
50
|
+
|
|
51
|
+
| Opción | Tipo | Requerido | Descripción |
|
|
52
|
+
|--------|------|-----------|-------------|
|
|
53
|
+
| `:start_date` | Time/DateTime/Date | sí | Convertido a `beginning_of_day` |
|
|
54
|
+
| `:end_date` | Time/DateTime/Date | sí | Convertido a `next_day.beginning_of_day` (semi-abierto `<`) |
|
|
55
|
+
| `:table_name` | String | sí | Tabla origen en `public` |
|
|
56
|
+
| `:partition_keys` | Array<String, Symbol> | sí | Orden = jerarquía Hive |
|
|
57
|
+
| `:bucket` | String | no | Ruta local o nombre bucket S3 |
|
|
58
|
+
| `:folder_name` | String | no | Default = `table_name` |
|
|
59
|
+
| `:select_sql` | String | no | Default = `"*"` |
|
|
60
|
+
| `:primary_key` | String | no | Default = `"id"`. Usado en DELETE |
|
|
61
|
+
| `:where_clause` | String | no | SQL extra (concat con `AND`) |
|
|
62
|
+
| `:skip_export` | Boolean | no | Default `false`. `true` omite export |
|
|
63
|
+
|
|
64
|
+
Internamente: crea `DuckDB::Database.open(":memory:")`, captura `@config`, `@logger`, `@adapter`.
|
|
65
|
+
|
|
66
|
+
### `#call → Boolean`
|
|
67
|
+
Ejecuta flujo completo: setup → count → [export] → verify → [purge].
|
|
68
|
+
|
|
69
|
+
- Retorna `true` si flujo completó (incluyendo caso `pg_count == 0`).
|
|
70
|
+
- Retorna `false` si verify falla o si error leyendo Parquet.
|
|
71
|
+
- Cuando retorna `false`, **NO ejecuta purga** (garantía de seguridad).
|
|
72
|
+
|
|
73
|
+
Eventos emitidos: ver [eventos-telemetria.md](eventos-telemetria.md).
|
|
74
|
+
|
|
75
|
+
### Métodos privados notables
|
|
76
|
+
|
|
77
|
+
- `#base_where_sql` — `created_at >= START AND created_at < END_BOUNDARY [AND where_clause]`. Semi-abierto.
|
|
78
|
+
- `#setup_duckdb` — INSTALL/LOAD postgres, set max_memory/temp_directory, ATTACH pg_source READ_ONLY, delega `setup_duckdb` al adapter.
|
|
79
|
+
- `#get_postgres_count → Integer` — Vía `postgres_query('pg_source', ...)`.
|
|
80
|
+
- `#export_to_parquet` — `COPY ... TO ... PARTITION_BY (...) COMPRESSION 'ZSTD' OVERWRITE_OR_IGNORE 1`.
|
|
81
|
+
- `#verify_integrity → Boolean` — `COUNT(*) read_parquet(...) == @pg_count`. Captura `DuckDB::Error` → `false`.
|
|
82
|
+
- `#purge_from_postgres` — Loop `DELETE WHERE pk IN (SELECT pk ... LIMIT batch_size)` hasta que `cmd_tuples == 0`. Heartbeat cada 100 lotes. `sleep(throttle_delay)` si `> 0`. Cierra conexión PG en `ensure`.
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## `DataDrain::FileIngestor`
|
|
87
|
+
|
|
88
|
+
### `#initialize(options)`
|
|
89
|
+
|
|
90
|
+
| Opción | Tipo | Requerido | Descripción |
|
|
91
|
+
|--------|------|-----------|-------------|
|
|
92
|
+
| `:source_path` | String | sí | Ruta absoluta al archivo |
|
|
93
|
+
| `:folder_name` | String | sí | Carpeta destino en Data Lake |
|
|
94
|
+
| `:bucket` | String | no | Ruta local o bucket S3 |
|
|
95
|
+
| `:partition_keys` | Array | no | Default `[]` (sin particionamiento) |
|
|
96
|
+
| `:select_sql` | String | no | Default `"*"`. Útil para extraer columnas derivadas (`EXTRACT(YEAR FROM ts) AS year`) |
|
|
97
|
+
| `:delete_after_upload` | Boolean | no | Default `true` |
|
|
98
|
+
|
|
99
|
+
### `#call → Boolean`
|
|
100
|
+
|
|
101
|
+
Flujo:
|
|
102
|
+
1. Valida que el archivo exista. Si no, log `file_ingestor.file_not_found`, retorna `false`.
|
|
103
|
+
2. Aplica `limit_ram`, `tmp_directory` y delega `setup_duckdb` al adapter.
|
|
104
|
+
3. Determina reader según extensión (`.csv`, `.json`, `.parquet`). Otras extensiones → `raise DataDrain::Error`.
|
|
105
|
+
4. Conteo de seguridad. Si `0`, limpia y retorna `true`.
|
|
106
|
+
5. `COPY ... TO ... [PARTITION_BY (...)] COMPRESSION 'ZSTD' OVERWRITE_OR_IGNORE 1`.
|
|
107
|
+
6. `cleanup_local_file` si `delete_after_upload`.
|
|
108
|
+
7. Retorna `true`.
|
|
109
|
+
|
|
110
|
+
`rescue DuckDB::Error` → log `file_ingestor.duckdb_error`, retorna `false`. `ensure` cierra conexión DuckDB.
|
|
111
|
+
|
|
112
|
+
### Formatos soportados
|
|
113
|
+
|
|
114
|
+
- `.csv` → `read_csv_auto`
|
|
115
|
+
- `.json` → `read_json_auto`
|
|
116
|
+
- `.parquet` → `read_parquet`
|
|
117
|
+
- Otros → `raise DataDrain::Error, "Formato de archivo no soportado para ingestión: ..."`
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
## `DataDrain::Record`
|
|
122
|
+
|
|
123
|
+
Clase abstracta. Subclasificar para cada tabla archivada.
|
|
124
|
+
|
|
125
|
+
```ruby
|
|
126
|
+
class ArchivedX < DataDrain::Record
|
|
127
|
+
self.bucket = "..."
|
|
128
|
+
self.folder_name = "..."
|
|
129
|
+
self.partition_keys = [:isp_id, :year, :month] # ORDEN CRÍTICO
|
|
130
|
+
|
|
131
|
+
attribute :id, :string
|
|
132
|
+
attribute :created_at, :datetime
|
|
133
|
+
attribute :payload, :json # usa DataDrain::Types::JsonType
|
|
134
|
+
end
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Hereda de `ActiveModel::Model` + `ActiveModel::Attributes`.
|
|
138
|
+
|
|
139
|
+
### `class_attribute`s
|
|
140
|
+
- `bucket` — String
|
|
141
|
+
- `folder_name` — String
|
|
142
|
+
- `partition_keys` — Array<Symbol>
|
|
143
|
+
|
|
144
|
+
### `.connection → DuckDB::Connection`
|
|
145
|
+
Conexión persistente por thread (cacheada en `Thread.current[:data_drain_duckdb] = { db:, conn: }`). El hash ancla la `Database` para evitar GC. La conexión nunca se cierra explícitamente.
|
|
146
|
+
|
|
147
|
+
### `.where(limit: 50, **partitions) → Array<self>`
|
|
148
|
+
Construye path Hive en orden de `partition_keys` (no del orden de kwargs). SQL: `SELECT ... FROM read_parquet(path) ORDER BY created_at DESC LIMIT n`. Si Parquet no existe, retorna `[]` y loguea `record.parquet_not_found` en WARN.
|
|
149
|
+
|
|
150
|
+
### `.find(id, **partitions) → self | nil`
|
|
151
|
+
SQL: `WHERE id = '<safe_id>' LIMIT 1`. Sanitiza `id` con `gsub("'", "''")` (escape de comillas simples). Retorna primera fila o `nil`.
|
|
152
|
+
|
|
153
|
+
### `.destroy_all(**partitions) → Integer`
|
|
154
|
+
Delega a `Storage.adapter#destroy_partitions`. Loguea `record.destroy_all`. Si no se especifica una key de `partition_keys`, se usa wildcard (`*` en Local, regex `[^/]+` en S3). Retorna cantidad de particiones/objetos borrados.
|
|
155
|
+
|
|
156
|
+
### `#inspect → String`
|
|
157
|
+
Formato: `#<Class attr1: val1, attr2: val2, ...>`.
|
|
158
|
+
|
|
159
|
+
### Métodos privados
|
|
160
|
+
- `.build_query_path(partitions)` — Itera `partition_keys` (no kwargs) y arma `key=val/key=val/...`. Acepta keys como Symbol o String en `partitions`.
|
|
161
|
+
- `.execute_and_instantiate(sql, columns)` — Ejecuta query, captura `DuckDB::Error` → `[]` con WARN, mapea filas a instancias.
|
|
162
|
+
|
|
163
|
+
---
|
|
164
|
+
|
|
165
|
+
## `DataDrain::GlueRunner`
|
|
166
|
+
|
|
167
|
+
### `.run_and_wait(job_name, arguments = {}, polling_interval: 30) → true`
|
|
168
|
+
|
|
169
|
+
| Parámetro | Tipo | Descripción |
|
|
170
|
+
|-----------|------|-------------|
|
|
171
|
+
| `job_name` | String | Nombre del Job en consola AWS |
|
|
172
|
+
| `arguments` | Hash | Args con prefijo `--` (ej. `"--start_date" => "..."`) |
|
|
173
|
+
| `polling_interval` | Integer | Segundos entre chequeos. Default `30` |
|
|
174
|
+
|
|
175
|
+
Flujo:
|
|
176
|
+
1. `Aws::Glue::Client.new(region: config.aws_region)`
|
|
177
|
+
2. `start_job_run` → captura `run_id`
|
|
178
|
+
3. Loop: `get_job_run`, evalúa `job_run_state`:
|
|
179
|
+
- `SUCCEEDED` → log `glue_runner.complete`, retorna `true`
|
|
180
|
+
- `FAILED|STOPPED|TIMEOUT` → log `glue_runner.failed` (incluye `error_message` truncado a 200 chars), `raise RuntimeError`
|
|
181
|
+
- Otro → log `glue_runner.polling`, `sleep polling_interval`
|
|
182
|
+
|
|
183
|
+
No tiene timeout máximo. Si Glue queda colgado en `RUNNING`, esto bloquea indefinidamente.
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## `DataDrain::Storage`
|
|
188
|
+
|
|
189
|
+
### `.adapter → Storage::Base`
|
|
190
|
+
Memoizada. Devuelve `Local.new` o `S3.new` según `config.storage_mode`. `raise InvalidAdapterError` si modo desconocido.
|
|
191
|
+
|
|
192
|
+
### `.reset_adapter!`
|
|
193
|
+
Limpia memoización. **Obligatorio** si se cambia `storage_mode` en runtime.
|
|
194
|
+
|
|
195
|
+
### `Storage::Base` (interfaz)
|
|
196
|
+
Métodos abstractos:
|
|
197
|
+
- `#setup_duckdb(connection)` — `raise NotImplementedError`
|
|
198
|
+
- `#prepare_export_path(bucket, folder_name)` — No-op por defecto
|
|
199
|
+
- `#build_path(bucket, folder_name, partition_path) → String` — `raise NotImplementedError`
|
|
200
|
+
- `#destroy_partitions(bucket, folder_name, partition_keys, partitions) → Integer` — `raise NotImplementedError`
|
|
201
|
+
|
|
202
|
+
### `Storage::Local`
|
|
203
|
+
- `#setup_duckdb` — No-op (DuckDB nativo)
|
|
204
|
+
- `#prepare_export_path` — `FileUtils.mkdir_p`
|
|
205
|
+
- `#build_path` — `"<bucket>/<folder>/<partition_path>/**/*.parquet"`
|
|
206
|
+
- `#destroy_partitions` — Construye glob con `key=*` para nulos, `Dir.glob`, `FileUtils.rm_rf` cada match
|
|
207
|
+
|
|
208
|
+
### `Storage::S3`
|
|
209
|
+
- `#setup_duckdb` — `INSTALL httpfs; LOAD httpfs;` + `SET s3_region/s3_access_key_id/s3_secret_access_key`
|
|
210
|
+
- `#prepare_export_path` — No-op (S3 no requiere mkdir)
|
|
211
|
+
- `#build_path` — `"s3://<bucket>/<folder>/<partition_path>/**/*.parquet"`
|
|
212
|
+
- `#destroy_partitions` — `Aws::S3::Client.list_objects_v2` con prefix optimizado (primera key si no es nula), filtra con regex (`key=[^/]+` para nulos), `delete_objects` en lotes de 1000
|
|
213
|
+
|
|
214
|
+
---
|
|
215
|
+
|
|
216
|
+
## `DataDrain::Observability` (mixín)
|
|
217
|
+
|
|
218
|
+
Diseñado para `include` (instance methods, requiere `@logger`) o `extend` (class methods, requiere `@logger` de clase).
|
|
219
|
+
|
|
220
|
+
### `#safe_log(level, event, metadata = {})` (privado)
|
|
221
|
+
- Si `@logger` es nil, no-op.
|
|
222
|
+
- Construye `fields = { component: observability_name, event: event }.merge(metadata)`.
|
|
223
|
+
- Filtra valores cuyas keys sean `:password|:token|:secret|:api_key|:auth` → `[FILTERED]`.
|
|
224
|
+
- Emite `@logger.send(level) { "k1=v1 k2=v2 ..." }`.
|
|
225
|
+
- `rescue StandardError` silencioso (resilience).
|
|
226
|
+
|
|
227
|
+
### `#exception_metadata(error)` (privado)
|
|
228
|
+
Retorna `{ error_class: error.class.name, error_message: error.message.gsub('"', "'")[0, 200] }`.
|
|
229
|
+
|
|
230
|
+
### `#observability_name` (privado)
|
|
231
|
+
Extrae el primer namespace del nombre de clase y lo convierte a snake_case. Ej. `DataDrain::Engine` → `data_drain`.
|
|
232
|
+
|
|
233
|
+
**Importante:** cuando se usa `extend`, marcar los métodos como `private_class_method :safe_log, :exception_metadata, :observability_name`.
|
|
234
|
+
|
|
235
|
+
---
|
|
236
|
+
|
|
237
|
+
## `DataDrain::Types::JsonType`
|
|
238
|
+
|
|
239
|
+
`ActiveModel::Type::Value` registrado como `:json`. `#cast`:
|
|
240
|
+
- Si valor es Hash, Array o nil → retorna tal cual.
|
|
241
|
+
- Si String → `JSON.parse(value)`.
|
|
242
|
+
- Si parse falla → retorna valor original (no levanta).
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## `DataDrain::Error` jerarquía
|
|
247
|
+
|
|
248
|
+
```
|
|
249
|
+
StandardError
|
|
250
|
+
└── DataDrain::Error
|
|
251
|
+
├── DataDrain::ConfigurationError
|
|
252
|
+
├── DataDrain::IntegrityError
|
|
253
|
+
├── DataDrain::StorageError
|
|
254
|
+
└── DataDrain::Storage::InvalidAdapterError
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
`DuckDB::Error` y `Aws::S3::Errors::*` NO se envuelven en `StorageError` actualmente — se capturan puntualmente o se propagan.
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# Eventos y Telemetría
|
|
2
|
+
|
|
3
|
+
Catálogo completo de eventos KV emitidos por DataDrain. Formato Wispro-Observability-Spec v1.
|
|
4
|
+
|
|
5
|
+
## Convenciones
|
|
6
|
+
|
|
7
|
+
- Formato: `component=data_drain event=<clase>.<suceso> [campos]`
|
|
8
|
+
- `component=` y `event=` siempre primeros
|
|
9
|
+
- Tiempos: sufijo `_s`, valor Float redondeado a 2 decimales
|
|
10
|
+
- Contadores: sufijo `_count`, valor Integer
|
|
11
|
+
- Sin unidades en valores (NO `"0.5s"`, SÍ `0.5`)
|
|
12
|
+
- snake_case en todas las keys
|
|
13
|
+
- `source=` lo inyecta `exis_ray` automáticamente — DataDrain NO lo emite
|
|
14
|
+
- Secretos (`password|token|secret|api_key|auth`) → `[FILTERED]`
|
|
15
|
+
- Fallos del logger nunca rompen flujo principal
|
|
16
|
+
|
|
17
|
+
## Engine
|
|
18
|
+
|
|
19
|
+
### `engine.start`
|
|
20
|
+
**Nivel:** INFO. Emite al inicio de `#call`.
|
|
21
|
+
**Campos:** `table`, `start_date`, `end_date`.
|
|
22
|
+
|
|
23
|
+
### `engine.skip_empty`
|
|
24
|
+
**Nivel:** INFO. Emite cuando `pg_count == 0`.
|
|
25
|
+
**Campos:** `table`, `duration_s`, `db_query_duration_s`.
|
|
26
|
+
|
|
27
|
+
### `engine.skip_export`
|
|
28
|
+
**Nivel:** INFO. Emite cuando `skip_export: true`.
|
|
29
|
+
**Campos:** `table`.
|
|
30
|
+
|
|
31
|
+
### `engine.export_start`
|
|
32
|
+
**Nivel:** INFO. Emite antes del `COPY ... TO`.
|
|
33
|
+
**Campos:** `table`, `count`.
|
|
34
|
+
|
|
35
|
+
### `engine.integrity_check`
|
|
36
|
+
**Nivel:** INFO. Emite tras el conteo de Parquet (siempre que se pueda leer).
|
|
37
|
+
**Campos:** `table`, `pg_count`, `parquet_count`.
|
|
38
|
+
|
|
39
|
+
### `engine.parquet_read_error`
|
|
40
|
+
**Nivel:** ERROR. Emite si `read_parquet` levanta `DuckDB::Error` durante verify.
|
|
41
|
+
**Campos:** `table`, `error_class`, `error_message` (truncado a 200 chars).
|
|
42
|
+
**Consecuencia:** `verify_integrity` retorna `false`, purga abortada.
|
|
43
|
+
|
|
44
|
+
### `engine.purge_start`
|
|
45
|
+
**Nivel:** INFO. Emite antes del primer DELETE.
|
|
46
|
+
**Campos:** `table`, `batch_size`.
|
|
47
|
+
|
|
48
|
+
### `engine.purge_heartbeat`
|
|
49
|
+
**Nivel:** INFO. Emite cada 100 lotes durante purga.
|
|
50
|
+
**Campos:** `table`, `batches_processed_count`, `rows_deleted_count`.
|
|
51
|
+
|
|
52
|
+
### `engine.complete`
|
|
53
|
+
**Nivel:** INFO. Emite al final exitoso del flujo.
|
|
54
|
+
**Campos:** `table`, `duration_s`, `db_query_duration_s`, `export_duration_s`, `integrity_duration_s`, `purge_duration_s`, `count`.
|
|
55
|
+
|
|
56
|
+
### `engine.integrity_error`
|
|
57
|
+
**Nivel:** ERROR. Emite si `pg_count != parquet_count`.
|
|
58
|
+
**Campos:** `table`, `duration_s`, `count`.
|
|
59
|
+
**Consecuencia:** Retorna `false`, purga abortada.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## FileIngestor
|
|
64
|
+
|
|
65
|
+
### `file_ingestor.start`
|
|
66
|
+
**Nivel:** INFO. Emite al inicio de `#call`.
|
|
67
|
+
**Campos:** `source_path`.
|
|
68
|
+
|
|
69
|
+
### `file_ingestor.file_not_found`
|
|
70
|
+
**Nivel:** ERROR. Emite si `File.exist?(source_path) == false`.
|
|
71
|
+
**Campos:** `source_path`.
|
|
72
|
+
|
|
73
|
+
### `file_ingestor.count`
|
|
74
|
+
**Nivel:** INFO. Emite tras el conteo del archivo origen.
|
|
75
|
+
**Campos:** `source_path`, `count`, `source_query_duration_s`.
|
|
76
|
+
|
|
77
|
+
### `file_ingestor.skip_empty`
|
|
78
|
+
**Nivel:** INFO. Emite si conteo es `0`.
|
|
79
|
+
**Campos:** `source_path`, `duration_s`.
|
|
80
|
+
|
|
81
|
+
### `file_ingestor.export_start`
|
|
82
|
+
**Nivel:** INFO. Emite antes del `COPY ... TO`.
|
|
83
|
+
**Campos:** `dest_path`.
|
|
84
|
+
|
|
85
|
+
### `file_ingestor.complete`
|
|
86
|
+
**Nivel:** INFO. Emite al final exitoso.
|
|
87
|
+
**Campos:** `source_path`, `duration_s`, `source_query_duration_s`, `export_duration_s`, `count`.
|
|
88
|
+
|
|
89
|
+
### `file_ingestor.duckdb_error`
|
|
90
|
+
**Nivel:** ERROR. Emite si `DuckDB::Error` durante el proceso.
|
|
91
|
+
**Campos:** `source_path`, `error_class`, `error_message`, `duration_s`.
|
|
92
|
+
|
|
93
|
+
### `file_ingestor.cleanup`
|
|
94
|
+
**Nivel:** INFO. Emite tras borrar el archivo local (si `delete_after_upload`).
|
|
95
|
+
**Campos:** `source_path`.
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Record
|
|
100
|
+
|
|
101
|
+
### `record.destroy_all`
|
|
102
|
+
**Nivel:** INFO. Emite al inicio de `.destroy_all`.
|
|
103
|
+
**Campos:** `folder`, `partitions` (inspect del hash).
|
|
104
|
+
|
|
105
|
+
### `record.parquet_not_found`
|
|
106
|
+
**Nivel:** WARN. Emite si `read_parquet` levanta `DuckDB::Error` en queries `where`/`find`.
|
|
107
|
+
**Campos:** `error_class`, `error_message`.
|
|
108
|
+
**Consecuencia:** Retorna `[]` (no levanta).
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## GlueRunner
|
|
113
|
+
|
|
114
|
+
### `glue_runner.start`
|
|
115
|
+
**Nivel:** INFO. Emite antes de `start_job_run`.
|
|
116
|
+
**Campos:** `job`.
|
|
117
|
+
|
|
118
|
+
### `glue_runner.polling`
|
|
119
|
+
**Nivel:** INFO. Emite cada chequeo de estado mientras Job no terminó.
|
|
120
|
+
**Campos:** `job`, `run_id`, `status`, `next_check_in_s`.
|
|
121
|
+
|
|
122
|
+
### `glue_runner.complete`
|
|
123
|
+
**Nivel:** INFO. Emite cuando estado es `SUCCEEDED`.
|
|
124
|
+
**Campos:** `job`, `run_id`, `duration_s`.
|
|
125
|
+
|
|
126
|
+
### `glue_runner.failed`
|
|
127
|
+
**Nivel:** ERROR. Emite cuando estado es `FAILED|STOPPED|TIMEOUT`.
|
|
128
|
+
**Campos:** `job`, `run_id`, `status`, `duration_s`, `error_message` (si Glue lo provee, truncado a 200 chars).
|
|
129
|
+
**Consecuencia:** `raise RuntimeError`.
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Ejemplos reales
|
|
134
|
+
|
|
135
|
+
```
|
|
136
|
+
component=data_drain event=engine.start table=versions start_date=2025-10-01 end_date=2025-11-01
|
|
137
|
+
component=data_drain event=engine.export_start table=versions count=1500000
|
|
138
|
+
component=data_drain event=engine.integrity_check table=versions pg_count=1500000 parquet_count=1500000
|
|
139
|
+
component=data_drain event=engine.purge_heartbeat table=versions batches_processed_count=100 rows_deleted_count=500000
|
|
140
|
+
component=data_drain event=engine.complete table=versions duration_s=185.4 db_query_duration_s=2.1 export_duration_s=42.7 integrity_duration_s=18.3 purge_duration_s=122.3 count=1500000
|
|
141
|
+
|
|
142
|
+
component=data_drain event=file_ingestor.complete source_path=/tmp/netflow.csv duration_s=12.4 source_query_duration_s=0.8 export_duration_s=11.2 count=4500000
|
|
143
|
+
|
|
144
|
+
component=data_drain event=glue_runner.polling job=my-export-job run_id=jr_abc123 status=RUNNING next_check_in_s=30
|
|
145
|
+
component=data_drain event=glue_runner.failed job=my-export-job run_id=jr_abc123 status=FAILED duration_s=301.0 error_message='Out of memory in executor 4'
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## Cómo agregar un nuevo evento
|
|
149
|
+
|
|
150
|
+
1. En la clase, asegurar `include Observability` (instance) o `extend Observability` + `private_class_method :safe_log, :exception_metadata, :observability_name` (class).
|
|
151
|
+
2. Asegurar `@logger = config.logger` (instance: en `initialize`; class: antes del primer `safe_log`).
|
|
152
|
+
3. Llamar `safe_log(:level, "clase.suceso", { campo1: val1, campo2: val2 })`.
|
|
153
|
+
4. Validar: keys snake_case, tiempos `_s` Float, contadores `_count` Integer, sin unidades en valores, sin `source=`.
|
|
154
|
+
5. Para errores: incluir `exception_metadata(e)` mergeado al hash de campos.
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-
|
|
11
|
+
date: 2026-04-14 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -110,8 +110,13 @@ files:
|
|
|
110
110
|
- lib/data_drain/storage/local.rb
|
|
111
111
|
- lib/data_drain/storage/s3.rb
|
|
112
112
|
- lib/data_drain/types/json_type.rb
|
|
113
|
+
- lib/data_drain/validations.rb
|
|
113
114
|
- lib/data_drain/version.rb
|
|
114
115
|
- sig/data_drain.rbs
|
|
116
|
+
- skill/SKILL.md
|
|
117
|
+
- skill/references/antipatrones.md
|
|
118
|
+
- skill/references/api-detallada.md
|
|
119
|
+
- skill/references/eventos-telemetria.md
|
|
115
120
|
homepage: https://github.com/gedera/data_drain
|
|
116
121
|
licenses: []
|
|
117
122
|
metadata: {}
|