data_drain 0.5.1 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +29 -0
- data/README.md +22 -2
- data/data_drain.gemspec +1 -1
- data/lib/data_drain/engine.rb +40 -4
- data/lib/data_drain/record.rb +4 -1
- data/lib/data_drain/version.rb +1 -1
- data/skill/SKILL.md +14 -3
- metadata +3 -3
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 918cc35413b6b09496ce15c782b6334dea8d515d9972bcf6a00a1e052140db67
|
|
4
|
+
data.tar.gz: 62a8278a7d7d4ba064b5bc4c33b63b846745d59c58a2cb77806dc8dc3ba677a5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c9fecbd5cfdf411b6032ca39bed25e4a12a518d246a3591d84c524e4d2950b5045c0dcb20a1de471600a3b37145d3f6c05b4f5a14c86103a1109fd25f05ba494
|
|
7
|
+
data.tar.gz: 7c1c6c951e880e2d512599aa90fea56f80fffca0315ab4fba7fc179c4e305c5a27e5ef8af0608459650598c051b945d56790b38e279320ee49db1e1897dd077b
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,34 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.6.0] - 2026-04-16
|
|
4
|
+
|
|
5
|
+
### Features
|
|
6
|
+
|
|
7
|
+
- Nueva opción `purge_where_clause` en `Engine#initialize`. Permite especificar una condición SQL independiente para el DELETE, distinta de `where_clause` (que aplica a export/verify). Caso de uso: archivar subset (`isp_id IS NOT NULL`) pero purgar superset (todo el rango). Valores: `nil` = no purge, `""` = purge todo el rango, `"x"` = rango AND x. Backwards compatible vía `fetch(:purge_where_clause, @where_clause)`. Fixes #3.
|
|
8
|
+
|
|
9
|
+
### Refactor
|
|
10
|
+
|
|
11
|
+
- Extraído helper `date_range_sql` en Engine para eliminar duplicación entre `base_where_sql` y `purge_where_sql`.
|
|
12
|
+
|
|
13
|
+
### YARD
|
|
14
|
+
|
|
15
|
+
- Documentación actualizada en `Engine#initialize` para los tres casos de `purge_where_clause`.
|
|
16
|
+
- `Engine#build_delete_sql` ahora documenta retorno `String|nil`.
|
|
17
|
+
|
|
18
|
+
### Telemetry
|
|
19
|
+
|
|
20
|
+
- Nuevo evento `engine.purge_skipped` cuando no hay cláusula de purge (`delete_sql.nil?`).
|
|
21
|
+
|
|
22
|
+
### Tests
|
|
23
|
+
|
|
24
|
+
- 5 nuevos tests para `purge_where_clause`: backwards compatible, empty string purge all, integrity usa base_where_sql, independiente de where_clause, y use case primario (archive subset / purge superset).
|
|
25
|
+
|
|
26
|
+
## [0.5.2] - 2026-04-16
|
|
27
|
+
|
|
28
|
+
### Correcciones
|
|
29
|
+
|
|
30
|
+
- `Record#where()` ahora usa wildcards (`key=*`) para partition keys no especificadas, en lugar de valores vacíos (`key=`). Consistente con `destroy_partitions`. Fixes #1.
|
|
31
|
+
|
|
3
32
|
## [0.5.1] - 2026-04-15
|
|
4
33
|
|
|
5
34
|
### Docs
|
data/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# DataDrain
|
|
2
2
|
|
|
3
|
-
[](https://github.com/sequre/data_drain/actions/workflows/main.yml)
|
|
4
4
|
|
|
5
5
|
Micro-framework Ruby para extraer, archivar y purgar datos históricos de PostgreSQL hacia un Data Lake (S3 o disco local) en formato Parquet, usando DuckDB en memoria.
|
|
6
6
|
|
|
@@ -18,7 +18,7 @@ Micro-framework Ruby para extraer, archivar y purgar datos históricos de Postgr
|
|
|
18
18
|
|
|
19
19
|
```ruby
|
|
20
20
|
# Gemfile
|
|
21
|
-
gem 'data_drain', git: 'https://github.com/
|
|
21
|
+
gem 'data_drain', git: 'https://github.com/sequre/data_drain.git', branch: 'main'
|
|
22
22
|
```
|
|
23
23
|
|
|
24
24
|
```bash
|
|
@@ -104,6 +104,26 @@ DataDrain::Engine.new(
|
|
|
104
104
|
).call
|
|
105
105
|
```
|
|
106
106
|
|
|
107
|
+
### Purge subset vs archive superset
|
|
108
|
+
|
|
109
|
+
Caso común: archivar filas válidas (`isp_id IS NOT NULL`) pero borrar superset (válidas + trash).
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
# Archiva solo isp_id NOT NULL, verifica integridad solo sobre esos,
|
|
113
|
+
# pero purga TODO el mes (NULL + NOT NULL) con batching/throttling/vacuum
|
|
114
|
+
DataDrain::Engine.new(
|
|
115
|
+
bucket: 'my-bucket-store',
|
|
116
|
+
start_date: 6.months.ago.beginning_of_month,
|
|
117
|
+
end_date: 6.months.ago.end_of_month,
|
|
118
|
+
table_name: 'versions',
|
|
119
|
+
partition_keys: %w[year month],
|
|
120
|
+
where_clause: 'isp_id IS NOT NULL', # filtra qué se archiva
|
|
121
|
+
purge_where_clause: '' # purge TODO el mes (vacío = sin filtro adicional)
|
|
122
|
+
).call
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
**Resultado:** Export/verify cuentan y comparan solo `isp_id NOT NULL`. Purge borra el mes completo con batching, throttling y vacuum del `purge_loop`.
|
|
126
|
+
|
|
107
127
|
### Orquestación con AWS Glue (tablas 1TB+)
|
|
108
128
|
|
|
109
129
|
```ruby
|
data/data_drain.gemspec
CHANGED
|
@@ -11,7 +11,7 @@ Gem::Specification.new do |spec|
|
|
|
11
11
|
spec.summary = "Micro-framework para drenar datos de PostgreSQL a Parquet vía DuckDB."
|
|
12
12
|
spec.description = "Extrae datos transaccionales, los archiva en un Data Lake (S3/Local) " \
|
|
13
13
|
"en formato Parquet usando Hive Partitioning, y purga el origen de forma segura."
|
|
14
|
-
spec.homepage = "https://github.com/
|
|
14
|
+
spec.homepage = "https://github.com/sequre/data_drain"
|
|
15
15
|
spec.required_ruby_version = ">= 3.2"
|
|
16
16
|
|
|
17
17
|
spec.files = Dir.chdir(__dir__) do
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -21,7 +21,16 @@ module DataDrain
|
|
|
21
21
|
# @option options [String] :select_sql (Opcional) Sentencia SELECT personalizada.
|
|
22
22
|
# @option options [Array<String, Symbol>] :partition_keys Columnas para particionar.
|
|
23
23
|
# @option options [String] :primary_key (Opcional) Clave primaria para borrado. Por defecto 'id'.
|
|
24
|
-
# @option options [String] :where_clause (Opcional) Condición SQL extra
|
|
24
|
+
# @option options [String] :where_clause (Opcional) Condición SQL extra
|
|
25
|
+
# que filtra export, count e integrity check. Define "qué se archiva".
|
|
26
|
+
# @option options [String] :purge_where_clause (Opcional) Condición SQL
|
|
27
|
+
# para el DELETE. Si se omite, usa :where_clause (backwards compatible).
|
|
28
|
+
# Pasar nil explícito para desactivar purga. Pasar '' (vacío) para purgar
|
|
29
|
+
# todo el rango de fechas sin filtro adicional (útil para archivar subset
|
|
30
|
+
# y borrar superset).
|
|
31
|
+
# Puede ser más amplia que :where_clause; filas que matchean
|
|
32
|
+
# :purge_where_clause pero no :where_clause se borran sin archivar ni
|
|
33
|
+
# verificar. Útil para limpieza de orphans/trash que no debe respaldarse.
|
|
25
34
|
# @option options [Boolean] :skip_export (Opcional) Si true, no exporta
|
|
26
35
|
# a Parquet — solo valida y purga (para uso con GlueRunner).
|
|
27
36
|
def initialize(options)
|
|
@@ -38,6 +47,7 @@ module DataDrain
|
|
|
38
47
|
@primary_key = options.fetch(:primary_key, "id")
|
|
39
48
|
Validations.validate_identifier!(:primary_key, @primary_key)
|
|
40
49
|
@where_clause = options[:where_clause]
|
|
50
|
+
@purge_where_clause = options.fetch(:purge_where_clause, @where_clause)
|
|
41
51
|
@bucket = options[:bucket]
|
|
42
52
|
@skip_export = options.fetch(:skip_export, false)
|
|
43
53
|
|
|
@@ -140,11 +150,27 @@ module DataDrain
|
|
|
140
150
|
# @api private
|
|
141
151
|
# @return [String]
|
|
142
152
|
def base_where_sql
|
|
143
|
-
sql =
|
|
153
|
+
sql = date_range_sql
|
|
144
154
|
sql += " AND #{@where_clause}" if @where_clause && !@where_clause.empty?
|
|
145
155
|
sql
|
|
146
156
|
end
|
|
147
157
|
|
|
158
|
+
# @api private
|
|
159
|
+
# @return [String]
|
|
160
|
+
def purge_where_sql
|
|
161
|
+
return nil if @purge_where_clause.nil?
|
|
162
|
+
|
|
163
|
+
sql = date_range_sql
|
|
164
|
+
sql += " AND #{@purge_where_clause}" unless @purge_where_clause.empty?
|
|
165
|
+
sql
|
|
166
|
+
end
|
|
167
|
+
|
|
168
|
+
# @api private
|
|
169
|
+
# @return [String]
|
|
170
|
+
def date_range_sql
|
|
171
|
+
"created_at >= '#{@start_date.to_fs(:db)}' AND created_at < '#{@end_date.to_fs(:db)}'"
|
|
172
|
+
end
|
|
173
|
+
|
|
148
174
|
# @api private
|
|
149
175
|
def setup_duckdb
|
|
150
176
|
@duckdb.query("INSTALL postgres; LOAD postgres;")
|
|
@@ -289,13 +315,19 @@ module DataDrain
|
|
|
289
315
|
# @param conn [PG::Connection]
|
|
290
316
|
# @return [Integer] total de filas borradas
|
|
291
317
|
def purge_loop(conn)
|
|
318
|
+
delete_sql = build_delete_sql
|
|
319
|
+
if delete_sql.nil?
|
|
320
|
+
safe_log(:info, "engine.purge_skipped", { table: @table_name, reason: "no_purge_clause" })
|
|
321
|
+
return 0
|
|
322
|
+
end
|
|
323
|
+
|
|
292
324
|
batches_processed = 0
|
|
293
325
|
total_deleted = 0
|
|
294
326
|
slow_batch_streak = 0
|
|
295
327
|
|
|
296
328
|
loop do
|
|
297
329
|
batch_start = monotonic
|
|
298
|
-
result = conn.exec(
|
|
330
|
+
result = conn.exec(delete_sql)
|
|
299
331
|
batch_duration = monotonic - batch_start
|
|
300
332
|
count = result.cmd_tuples
|
|
301
333
|
break if count.zero?
|
|
@@ -349,12 +381,16 @@ module DataDrain
|
|
|
349
381
|
end
|
|
350
382
|
|
|
351
383
|
# @api private
|
|
384
|
+
# @return [String, nil] SQL DELETE statement or nil if no purge clause
|
|
352
385
|
def build_delete_sql
|
|
386
|
+
where = purge_where_sql
|
|
387
|
+
return nil if where.nil?
|
|
388
|
+
|
|
353
389
|
<<~SQL
|
|
354
390
|
DELETE FROM #{@table_name}
|
|
355
391
|
WHERE #{@primary_key} IN (
|
|
356
392
|
SELECT #{@primary_key} FROM #{@table_name}
|
|
357
|
-
WHERE #{
|
|
393
|
+
WHERE #{where}
|
|
358
394
|
LIMIT #{@config.batch_size}
|
|
359
395
|
)
|
|
360
396
|
SQL
|
data/lib/data_drain/record.rb
CHANGED
|
@@ -131,7 +131,10 @@ module DataDrain
|
|
|
131
131
|
# @param partitions [Hash]
|
|
132
132
|
# @return [String]
|
|
133
133
|
def build_query_path(partitions)
|
|
134
|
-
partition_path = partition_keys.map
|
|
134
|
+
partition_path = partition_keys.map do |k|
|
|
135
|
+
val = partitions.key?(k.to_sym) ? partitions[k.to_sym] : partitions[k.to_s]
|
|
136
|
+
val.nil? || val.to_s.empty? ? "#{k}=*" : "#{k}=#{val}"
|
|
137
|
+
end.join("/")
|
|
135
138
|
DataDrain::Storage.adapter.build_path(bucket, folder_name, partition_path)
|
|
136
139
|
end
|
|
137
140
|
|
data/lib/data_drain/version.rb
CHANGED
data/skill/SKILL.md
CHANGED
|
@@ -14,6 +14,7 @@ Skill de conocimiento completo sobre DataDrain. Consultame para cualquier pregun
|
|
|
14
14
|
- **Hive Partitioning** — Estructura de carpetas `key1=val1/key2=val2/...` que DuckDB genera y consume nativamente para prefix scans eficientes.
|
|
15
15
|
- **Semi-abierto** — Convención de rangos `[start, end)` con `<` (no `<=`) para evitar pérdida de microsegundos en límites de fecha.
|
|
16
16
|
- **skip_export** — Modo del Engine donde delega export a herramienta externa (Glue/EMR) y solo verifica + purga.
|
|
17
|
+
- **purge_where_clause** — Condición SQL independiente para el DELETE. Permite archivar subset y purgar superset. nil = skip, "" = purge todo el rango, "x" = rango AND x.
|
|
17
18
|
- **ensure_job** — Wrapper idempotente de GlueRunner que crea o actualiza un job según config deseada. Incluye diffing de configuración para evitar API calls innecesarios.
|
|
18
19
|
- **changed_fields** — Helper privado de ensure_job que compara config deseada vs actual de un Glue Job y retorna qué campos difieren.
|
|
19
20
|
- **Heartbeat** — Log de progreso emitido cada 100 lotes en purgas masivas (tablas 1TB).
|
|
@@ -70,7 +71,7 @@ DataDrain resuelve el ciclo de vida de datos históricos en bases relacionales c
|
|
|
70
71
|
|
|
71
72
|
- Ruby `>= 3.2.0`
|
|
72
73
|
- Runtime: `activemodel >= 6.0`, `duckdb ~> 1.4`, `pg >= 1.2`, `aws-sdk-s3 ~> 1.114`, `aws-sdk-glue ~> 1.0`
|
|
73
|
-
- Versión actual: `0.
|
|
74
|
+
- Versión actual: `0.6.0`
|
|
74
75
|
|
|
75
76
|
## API Pública (resumen)
|
|
76
77
|
|
|
@@ -98,12 +99,22 @@ DataDrain::Engine.new(
|
|
|
98
99
|
bucket:, start_date:, end_date:, table_name:,
|
|
99
100
|
partition_keys: %w[isp_id year month],
|
|
100
101
|
primary_key: "id", # opcional
|
|
101
|
-
where_clause: nil, # opcional, SQL extra
|
|
102
|
+
where_clause: nil, # opcional, SQL extra para export/verify
|
|
103
|
+
purge_where_clause: nil, # opcional, SQL para DELETE (nil=skip, ""=full range, "x"=range+x)
|
|
102
104
|
skip_export: false, # true delega export a Glue
|
|
103
105
|
folder_name: nil, # default = table_name
|
|
104
106
|
select_sql: "*" # default
|
|
105
107
|
).call # => true (ok) | false (integrity fail)
|
|
106
108
|
|
|
109
|
+
# Purge subset vs archive superset (v0.6.0+)
|
|
110
|
+
DataDrain::Engine.new(
|
|
111
|
+
bucket:, start_date:, end_date:, table_name:,
|
|
112
|
+
partition_keys: %w[year month],
|
|
113
|
+
where_clause: "isp_id IS NOT NULL", # filtra qué se archiva
|
|
114
|
+
purge_where_clause: "" # purge TODO el rango (vacío = sin filtro adicional)
|
|
115
|
+
).call
|
|
116
|
+
# Resultado: export/verify sobre isp_id NOT NULL, purge sobre todo el rango
|
|
117
|
+
|
|
107
118
|
# 2. Ingesta de archivos crudos
|
|
108
119
|
DataDrain::FileIngestor.new(
|
|
109
120
|
bucket:, source_path:, folder_name:,
|
|
@@ -271,7 +282,7 @@ Catálogo completo en [Antipatrones](references/antipatrones.md). Resumen de los
|
|
|
271
282
|
## Referencias
|
|
272
283
|
|
|
273
284
|
- [API Detallada](references/api-detallada.md) — Firmas completas, parámetros, retornos y comportamientos de cada clase pública.
|
|
274
|
-
- [Glue Jobs Lifecycle](https://github.com/
|
|
285
|
+
- [Glue Jobs Lifecycle](https://github.com/sequre/data_drain/blob/main/docs/glue-jobs-lifecycle.md) — Guía completa de gestión de AWS Glue Jobs: crear, actualizar, eliminar, verificar y ejecutar jobs idempotentemente.
|
|
275
286
|
- [Eventos y Telemetría](references/eventos-telemetria.md) — Catálogo completo de eventos KV emitidos por la gema.
|
|
276
287
|
- [Antipatrones](references/antipatrones.md) — Qué NO hacer y alternativas correctas.
|
|
277
288
|
- [Postgres Tuning](references/postgres-tuning.md) — Índices, VACUUM, particionamiento y diagnóstico por tamaño de tabla.
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.6.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-04-
|
|
11
|
+
date: 2026-04-16 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -133,7 +133,7 @@ files:
|
|
|
133
133
|
- skill/references/api-detallada.md
|
|
134
134
|
- skill/references/eventos-telemetria.md
|
|
135
135
|
- skill/references/postgres-tuning.md
|
|
136
|
-
homepage: https://github.com/
|
|
136
|
+
homepage: https://github.com/sequre/data_drain
|
|
137
137
|
licenses: []
|
|
138
138
|
metadata: {}
|
|
139
139
|
post_install_message:
|