data_drain 0.1.18 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -0
- data/CLAUDE.md +22 -0
- data/README.md +69 -169
- data/lib/data_drain/engine.rb +53 -40
- data/lib/data_drain/file_ingestor.rb +40 -25
- data/lib/data_drain/record.rb +26 -5
- data/lib/data_drain/storage/s3.rb +48 -6
- data/lib/data_drain/validations.rb +17 -0
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +2 -0
- data/skill/SKILL.md +215 -0
- data/skill/references/antipatrones.md +242 -0
- data/skill/references/api-detallada.md +257 -0
- data/skill/references/eventos-telemetria.md +154 -0
- metadata +7 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: e121779f19f63fc4209e8c8393155a96403f4516dc62c285e90cebf244b3548e
|
|
4
|
+
data.tar.gz: 8e48a3a12f6b901030ce570b97ebd71999daceaa2b562f94980f85c414f1eea6
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: e20f0cc6586c0a1bed1281eae429ed5747b09cd8bf008b5fc996c7e3690f6a56a14083debd41744440f190c3521a8135c6822bc19c9faca1c33b3dd1507b67c2
|
|
7
|
+
data.tar.gz: c2c3333e2b3938431c8732ea3662cbfae00bd2fcb78519a4a5e9ec5d953c00ec3235ed134cc480b9029ced947b232394b575c60bf24eacbfac301cb32668c6ee
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,25 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.2.0] - 2026-04-13
|
|
4
|
+
|
|
5
|
+
### Security
|
|
6
|
+
- **BREAKING (preventivo):** `table_name` y `primary_key` se validan contra regex `\A[a-zA-Z_][a-zA-Z0-9_]*\z`. Identificadores con caracteres especiales (puntos, espacios, comillas) ahora levantan `DataDrain::ConfigurationError`. (item 2)
|
|
7
|
+
- Storage::S3 migra a `CREATE SECRET (TYPE S3, PROVIDER credential_chain)`. Si `aws_access_key_id`/`aws_secret_access_key` están seteados, se mantiene comportamiento explícito; si no, usa AWS credential chain (IAM roles, env vars, ~/.aws/credentials). `aws_region` ahora se escapa con `''` en el SQL. (item 1)
|
|
8
|
+
|
|
9
|
+
### Features
|
|
10
|
+
- `Record.disconnect!` cierra y limpia la conexión DuckDB thread-local. Recomendado en middlewares Sidekiq/Puma para evitar memory leak. Idempotente. (item 3)
|
|
11
|
+
|
|
12
|
+
### Tests
|
|
13
|
+
- Cobertura: 112 specs, coverage líneas 97.37% (SimpleCov).
|
|
14
|
+
- Specs nuevos: Record, Storage::Local, Storage::S3, Storage factory, GlueRunner, Observability, Configuration, JsonType, Validations, Engine (validación), FileIngestor (validación + ingestión CSV/JSON/Parquet).
|
|
15
|
+
|
|
16
|
+
## [0.1.19] - 2026-03-30
|
|
17
|
+
|
|
18
|
+
- Fix: `Record.build_query_path` ahora usa `partition_keys` como fuente de verdad del orden, ignorando el orden de los kwargs del caller. Antes, pasar `where(year: 2026, isp_id: 42)` en distinto orden generaba un path que no coincidía con la estructura Hive en disco.
|
|
19
|
+
- Fix: `GlueRunner` reemplaza `.truncate(200)` de ActiveSupport por `[0, 200]` de Ruby puro, eliminando la dependencia implícita.
|
|
20
|
+
- Convention: orden canónico de `partition_keys` es `[dimension_principal, year, month]` (ej. `isp_id` primero). Documentado en CLAUDE.md y actualizado en README, specs y ejemplos de PySpark.
|
|
21
|
+
- Docs: README actualizado con ejemplos de producción correctos para Glue + Engine + Record.
|
|
22
|
+
|
|
3
23
|
## [0.1.18] - 2026-03-23
|
|
4
24
|
|
|
5
25
|
- Feature: Módulo `Observability` centraliza el logging estructurado en toda la gema.
|
data/CLAUDE.md
CHANGED
|
@@ -19,9 +19,31 @@ created_at >= 'START' AND created_at < 'END_BOUNDARY'
|
|
|
19
19
|
```
|
|
20
20
|
Donde `END_BOUNDARY` es el inicio del periodo siguiente (ej. `next_day.beginning_of_day`). Nunca usar `<= end_of_day` — los microsegundos en el límite pueden quedar fuera.
|
|
21
21
|
|
|
22
|
+
### Partition Keys — Orden y Contrato
|
|
23
|
+
|
|
24
|
+
El array `partition_keys` es **completamente dinámico** — cada tabla/modelo define el suyo. No existe un orden estándar en la librería.
|
|
25
|
+
|
|
26
|
+
**Regla crítica:** el orden de `partition_keys` al **escribir** (Engine/FileIngestor) debe ser idéntico al declarado en el modelo **Record** que lee esos archivos. Un mismatch genera paths que no coinciden y DuckDB retorna vacío sin error.
|
|
27
|
+
|
|
28
|
+
```ruby
|
|
29
|
+
# Escritura
|
|
30
|
+
Engine.new(partition_keys: %w[isp_id year month], ...)
|
|
31
|
+
|
|
32
|
+
# Lectura — debe coincidir
|
|
33
|
+
class ArchivedVersion < DataDrain::Record
|
|
34
|
+
self.partition_keys = [:isp_id, :year, :month]
|
|
35
|
+
end
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**Criterio de diseño del orden:** el primer key debe ser la dimensión de mayor cardinalidad o la que más se usa como filtro (ej. `isp_id` si las consultas son siempre por ISP). Esto determina la jerarquía de carpetas Hive y el rendimiento del prefix scan en S3.
|
|
39
|
+
|
|
22
40
|
### Idempotencia
|
|
23
41
|
Las exportaciones usan `OVERWRITE_OR_IGNORE 1` de DuckDB. Los procesos son seguros de reintentar.
|
|
24
42
|
|
|
43
|
+
### Validación de identificadores SQL
|
|
44
|
+
|
|
45
|
+
`Engine#initialize` y `FileIngestor#initialize` validan `table_name`, `primary_key` y `folder_name` contra la regex `\A[a-zA-Z_][a-zA-Z0-9_]*\z`. Valores con caracteres especiales (`.`, `;`, espacios, comillas) levantan `DataDrain::ConfigurationError`. `select_sql` y `where_clause` siguen siendo trusted.
|
|
46
|
+
|
|
25
47
|
### `idle_in_transaction_session_timeout`
|
|
26
48
|
El valor `0` **desactiva** el timeout (sin límite). Para purgas de gran volumen esto es mandatorio. Internamente, se debe validar con `!nil?` ya que `0.present?` es falso.
|
|
27
49
|
|
data/README.md
CHANGED
|
@@ -1,142 +1,107 @@
|
|
|
1
1
|
# DataDrain
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Micro-framework Ruby para extraer, archivar y purgar datos históricos de PostgreSQL hacia un Data Lake (S3 o disco local) en formato Parquet, usando DuckDB en memoria.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
## Características
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
* **ORM Analítico Integrado:** Incluye una clase base (`DataDrain::Record`) compatible con `ActiveModel` para consultar y destruir particiones históricas de forma idiomática.
|
|
15
|
-
* **Observabilidad Estructurada:** Todos los eventos emiten logs en formato `key=value` compatibles con Datadog, CloudWatch y `exis_ray`. Los fallos de logging nunca interrumpen el flujo principal.
|
|
7
|
+
- **ETL de alto rendimiento:** millones de registros de Postgres a Parquet sin cargar objetos en RAM Ruby.
|
|
8
|
+
- **File ingestion:** convierte CSV, JSON o Parquet locales a Parquet (ZSTD) particionado y los sube a S3.
|
|
9
|
+
- **Hive partitioning:** organiza archivos en `key=val/key=val/...` para prefix scans eficientes.
|
|
10
|
+
- **Storage adapters:** soporte transparente para disco local y AWS S3.
|
|
11
|
+
- **Integridad garantizada:** verificación matemática Postgres vs Parquet antes de cualquier `DELETE`.
|
|
12
|
+
- **ORM analítico:** clase base `DataDrain::Record` (compatible `ActiveModel`) para consultar y purgar particiones históricas.
|
|
13
|
+
- **Observabilidad estructurada:** logs `key=value` compatibles con Datadog, CloudWatch y `exis_ray`. Fallos del logger nunca interrumpen el flujo principal.
|
|
16
14
|
|
|
17
15
|
## Instalación
|
|
18
16
|
|
|
19
|
-
Agrega esta línea al `Gemfile` de tu aplicación o microservicio:
|
|
20
|
-
|
|
21
17
|
```ruby
|
|
18
|
+
# Gemfile
|
|
22
19
|
gem 'data_drain', git: 'https://github.com/gedera/data_drain.git', branch: 'main'
|
|
23
20
|
```
|
|
24
21
|
|
|
25
|
-
Y ejecuta:
|
|
26
22
|
```bash
|
|
27
|
-
|
|
23
|
+
bundle install
|
|
28
24
|
```
|
|
29
25
|
|
|
30
26
|
## Configuración
|
|
31
27
|
|
|
32
|
-
Crea un inicializador en tu aplicación (ej. `config/initializers/data_drain.rb`) para configurar las credenciales y el comportamiento del motor:
|
|
33
|
-
|
|
34
28
|
```ruby
|
|
29
|
+
# config/initializers/data_drain.rb
|
|
35
30
|
DataDrain.configure do |config|
|
|
36
|
-
|
|
37
|
-
config.storage_mode = ENV.fetch('STORAGE_MODE', 'local').to_sym
|
|
31
|
+
config.storage_mode = ENV.fetch('STORAGE_MODE', 'local').to_sym # :local o :s3
|
|
38
32
|
|
|
39
|
-
# AWS S3 (
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
33
|
+
# AWS S3 (solo si storage_mode == :s3)
|
|
34
|
+
config.aws_region = ENV['AWS_REGION']
|
|
35
|
+
config.aws_access_key_id = ENV['AWS_ACCESS_KEY_ID']
|
|
36
|
+
config.aws_secret_access_key = ENV['AWS_SECRET_ACCESS_KEY']
|
|
43
37
|
|
|
44
|
-
#
|
|
38
|
+
# PostgreSQL origen (solo para Engine)
|
|
45
39
|
config.db_host = ENV.fetch('DB_HOST', '127.0.0.1')
|
|
46
40
|
config.db_port = ENV.fetch('DB_PORT', '5432')
|
|
47
41
|
config.db_user = ENV.fetch('DB_USER', 'postgres')
|
|
48
42
|
config.db_pass = ENV.fetch('DB_PASS', '')
|
|
49
43
|
config.db_name = ENV.fetch('DB_NAME', 'core_production')
|
|
50
44
|
|
|
51
|
-
#
|
|
52
|
-
config.batch_size
|
|
53
|
-
config.throttle_delay
|
|
45
|
+
# Tuning de purga
|
|
46
|
+
config.batch_size = 5000 # registros por DELETE
|
|
47
|
+
config.throttle_delay = 0.5 # segundos entre lotes
|
|
48
|
+
config.idle_in_transaction_session_timeout = 0 # 0 = DESACTIVADO (mandatorio en purgas masivas)
|
|
54
49
|
|
|
55
|
-
#
|
|
56
|
-
#
|
|
57
|
-
|
|
58
|
-
config.idle_in_transaction_session_timeout = 0
|
|
50
|
+
# Tuning de DuckDB
|
|
51
|
+
config.limit_ram = '2GB' # evita OOM en contenedores
|
|
52
|
+
config.tmp_directory = '/tmp/duckdb_work' # spill-to-disk (preferir SSD/NVMe)
|
|
59
53
|
|
|
60
54
|
config.logger = Rails.logger
|
|
61
|
-
|
|
62
|
-
# Tuning de DuckDB
|
|
63
|
-
# Límite máximo de RAM para las consultas en memoria de DuckDB (ej. '2GB', '512MB').
|
|
64
|
-
# Evita que el proceso muera por OOM en contenedores con memoria limitada.
|
|
65
|
-
config.limit_ram = '2GB'
|
|
66
|
-
|
|
67
|
-
# Directorio temporal de DuckDB para desbordar memoria (spill to disk) durante
|
|
68
|
-
# transformaciones pesadas o creación de archivos Parquet masivos.
|
|
69
|
-
# Se recomienda que este directorio resida en un disco SSD/NVMe rápido.
|
|
70
|
-
config.tmp_directory = '/tmp/duckdb_work'
|
|
71
55
|
end
|
|
72
56
|
```
|
|
73
57
|
|
|
74
58
|
## Uso
|
|
75
59
|
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
### 1. Ingestión de Archivos Crudos (FileIngestor)
|
|
79
|
-
|
|
80
|
-
Ideal para servicios que generan grandes volúmenes de datos (ej. métricas de Netflow). Toma un archivo local, lo transforma, lo comprime a Parquet y lo sube particionado a S3.
|
|
60
|
+
### Ingesta de archivos crudos (FileIngestor)
|
|
81
61
|
|
|
82
62
|
```ruby
|
|
83
|
-
|
|
63
|
+
DataDrain::FileIngestor.new(
|
|
84
64
|
bucket: 'my-bucket-store',
|
|
85
|
-
source_path: '/tmp/
|
|
65
|
+
source_path: '/tmp/netflow_metrics.csv',
|
|
86
66
|
folder_name: 'netflow',
|
|
87
|
-
partition_keys: %w[year month
|
|
67
|
+
partition_keys: %w[isp_id year month],
|
|
88
68
|
select_sql: "*, EXTRACT(YEAR FROM timestamp) AS year, EXTRACT(MONTH FROM timestamp) AS month",
|
|
89
69
|
delete_after_upload: true
|
|
90
|
-
)
|
|
91
|
-
|
|
92
|
-
ingestor.call
|
|
70
|
+
).call
|
|
93
71
|
```
|
|
94
72
|
|
|
95
|
-
###
|
|
96
|
-
|
|
97
|
-
Ideal para crear ventanas rodantes de retención (ej. mantener solo 6 meses de datos vivos en Postgres y archivar el resto).
|
|
73
|
+
### Extracción y purga (Engine)
|
|
98
74
|
|
|
99
|
-
|
|
75
|
+
Ventanas rodantes de retención: archivar 6 meses atrás y purgar el origen.
|
|
100
76
|
|
|
101
77
|
```ruby
|
|
102
|
-
|
|
78
|
+
DataDrain::Engine.new(
|
|
103
79
|
bucket: 'my-bucket-store',
|
|
104
80
|
start_date: 6.months.ago.beginning_of_month,
|
|
105
81
|
end_date: 6.months.ago.end_of_month,
|
|
106
82
|
table_name: 'versions',
|
|
107
83
|
partition_keys: %w[year month]
|
|
108
|
-
)
|
|
109
|
-
|
|
110
|
-
engine.call
|
|
84
|
+
).call
|
|
111
85
|
```
|
|
112
86
|
|
|
113
|
-
|
|
87
|
+
### Modo `skip_export` (delegar export a Glue/EMR)
|
|
114
88
|
|
|
115
|
-
|
|
89
|
+
DataDrain solo verifica integridad y purga; el export ya lo hizo otra herramienta.
|
|
116
90
|
|
|
117
91
|
```ruby
|
|
118
|
-
|
|
92
|
+
DataDrain::Engine.new(
|
|
119
93
|
bucket: 'my-bucket-store',
|
|
120
94
|
start_date: 6.months.ago.beginning_of_month,
|
|
121
95
|
end_date: 6.months.ago.end_of_month,
|
|
122
96
|
table_name: 'versions',
|
|
123
97
|
partition_keys: %w[year month],
|
|
124
98
|
skip_export: true
|
|
125
|
-
)
|
|
126
|
-
|
|
127
|
-
engine.call
|
|
99
|
+
).call
|
|
128
100
|
```
|
|
129
101
|
|
|
130
|
-
###
|
|
131
|
-
|
|
132
|
-
Para tablas de gran volumen (**ej. > 500GB o 1TB**), se recomienda delegar el movimiento de datos a **AWS Glue** (basado en Apache Spark) para evitar saturar el servidor de Ruby. `DataDrain` actúa como el orquestador que dispara el Job, espera a que termine y luego realiza la validación y purga.
|
|
102
|
+
### Orquestación con AWS Glue (tablas 1TB+)
|
|
133
103
|
|
|
134
104
|
```ruby
|
|
135
|
-
config = DataDrain.configuration
|
|
136
|
-
bucket = "my-bucket"
|
|
137
|
-
table = "versions"
|
|
138
|
-
|
|
139
|
-
# 1. Disparar el Job de Glue y esperar su finalización exitosa
|
|
140
105
|
DataDrain::GlueRunner.run_and_wait(
|
|
141
106
|
"my-glue-export-job",
|
|
142
107
|
{
|
|
@@ -148,137 +113,72 @@ DataDrain::GlueRunner.run_and_wait(
|
|
|
148
113
|
"--db_user" => config.db_user,
|
|
149
114
|
"--db_password" => config.db_pass,
|
|
150
115
|
"--db_table" => table,
|
|
151
|
-
"--partition_by" => "year,month
|
|
116
|
+
"--partition_by" => "isp_id,year,month"
|
|
152
117
|
}
|
|
153
118
|
)
|
|
154
119
|
|
|
155
|
-
# 2. Una vez que Glue exportó el TB, DataDrain valida integridad y purga Postgres
|
|
156
120
|
DataDrain::Engine.new(
|
|
157
|
-
bucket:
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
end_date: end_date,
|
|
161
|
-
table_name: table,
|
|
162
|
-
partition_keys: %w[year month isp_id],
|
|
163
|
-
skip_export: true
|
|
121
|
+
bucket:, folder_name: table, start_date:, end_date:,
|
|
122
|
+
table_name: table, partition_keys: %w[isp_id year month],
|
|
123
|
+
skip_export: true
|
|
164
124
|
).call
|
|
165
125
|
```
|
|
166
126
|
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
Crea un Job en la consola de AWS Glue (Spark 4.0+) y utiliza este script como base:
|
|
170
|
-
|
|
171
|
-
```python
|
|
172
|
-
import sys
|
|
173
|
-
from awsglue.utils import getResolvedOptions
|
|
174
|
-
from pyspark.context import SparkContext
|
|
175
|
-
from awsglue.context import GlueContext
|
|
176
|
-
from awsglue.job import Job
|
|
177
|
-
from pyspark.sql.functions import col, year, month
|
|
127
|
+
Script PySpark de referencia: [`docs/glue_pyspark_example.py`](docs/glue_pyspark_example.py).
|
|
178
128
|
|
|
179
|
-
|
|
180
|
-
'JOB_NAME', 'start_date', 'end_date', 's3_bucket', 's3_folder',
|
|
181
|
-
'db_url', 'db_user', 'db_password', 'db_table', 'partition_by'
|
|
182
|
-
])
|
|
183
|
-
|
|
184
|
-
sc = SparkContext()
|
|
185
|
-
glueContext = GlueContext(sc)
|
|
186
|
-
spark = glueContext.spark_session
|
|
187
|
-
job = Job(glueContext)
|
|
188
|
-
job.init(args['JOB_NAME'], args)
|
|
189
|
-
|
|
190
|
-
options = {
|
|
191
|
-
"url": args['db_url'],
|
|
192
|
-
"dbtable": args['db_table'],
|
|
193
|
-
"user": args['db_user'],
|
|
194
|
-
"password": args['db_password'],
|
|
195
|
-
"sampleQuery": f"SELECT * FROM {args['db_table']} WHERE created_at >= '{args['start_date']}' AND created_at < '{args['end_date']}'"
|
|
196
|
-
}
|
|
197
|
-
|
|
198
|
-
df = spark.read.format("jdbc").options(**options).load()
|
|
199
|
-
|
|
200
|
-
df_final = df.withColumn("year", year(col("created_at"))) \
|
|
201
|
-
.withColumn("month", month(col("created_at")))
|
|
202
|
-
|
|
203
|
-
output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
|
|
204
|
-
partitions = args['partition_by'].split(",")
|
|
205
|
-
|
|
206
|
-
df_final.write.mode("overwrite") \
|
|
207
|
-
.partitionBy(*partitions) \
|
|
208
|
-
.format("parquet") \
|
|
209
|
-
.option("compression", "zstd") \
|
|
210
|
-
.save(output_path)
|
|
211
|
-
|
|
212
|
-
job.commit()
|
|
213
|
-
```
|
|
214
|
-
|
|
215
|
-
### 4. Consultar el Data Lake (Record)
|
|
216
|
-
|
|
217
|
-
Para consultar los datos archivados sin salir de Ruby, crea un modelo que herede de `DataDrain::Record`.
|
|
129
|
+
### Consultar el Data Lake (Record)
|
|
218
130
|
|
|
219
131
|
```ruby
|
|
220
|
-
# app/models/archived_version.rb
|
|
221
132
|
class ArchivedVersion < DataDrain::Record
|
|
222
|
-
self.bucket
|
|
223
|
-
self.folder_name
|
|
224
|
-
self.partition_keys = [:
|
|
133
|
+
self.bucket = 'my-bucket-storage'
|
|
134
|
+
self.folder_name = 'versions'
|
|
135
|
+
self.partition_keys = [:isp_id, :year, :month] # orden = jerarquía Hive
|
|
225
136
|
|
|
226
137
|
attribute :id, :string
|
|
227
138
|
attribute :item_type, :string
|
|
228
|
-
attribute :item_id, :string
|
|
229
139
|
attribute :event, :string
|
|
230
|
-
attribute :whodunnit, :string
|
|
231
140
|
attribute :created_at, :datetime
|
|
232
141
|
attribute :object, :json
|
|
233
142
|
attribute :object_changes, :json
|
|
234
143
|
end
|
|
235
|
-
```
|
|
236
|
-
|
|
237
|
-
Consultas optimizadas mediante Hive Partitioning:
|
|
238
144
|
|
|
239
|
-
```ruby
|
|
240
145
|
# Búsqueda puntual aislando la partición exacta
|
|
241
|
-
|
|
242
|
-
puts version.object_changes # => {"status" => ["active", "suspended"]}
|
|
146
|
+
ArchivedVersion.find("uuid", isp_id: 42, year: 2026, month: 3)
|
|
243
147
|
|
|
244
148
|
# Colecciones
|
|
245
|
-
|
|
246
|
-
```
|
|
247
|
-
|
|
248
|
-
### 5. Destrucción de Datos (Retención y Cumplimiento)
|
|
249
|
-
|
|
250
|
-
El framework permite eliminar físicamente carpetas completas en S3 o Local utilizando comodines.
|
|
251
|
-
|
|
252
|
-
```ruby
|
|
253
|
-
# Elimina todo el historial de un cliente a través de todos los años
|
|
254
|
-
ArchivedVersion.destroy_all(isp_id: 42)
|
|
149
|
+
ArchivedVersion.where(limit: 10, isp_id: 42, year: 2026, month: 3)
|
|
255
150
|
|
|
256
|
-
#
|
|
257
|
-
ArchivedVersion.destroy_all(
|
|
151
|
+
# Eliminación (retención y cumplimiento)
|
|
152
|
+
ArchivedVersion.destroy_all(isp_id: 42) # todo el historial de un cliente
|
|
153
|
+
ArchivedVersion.destroy_all(year: 2024, month: 3) # un mes globalmente
|
|
258
154
|
```
|
|
259
155
|
|
|
260
|
-
##
|
|
156
|
+
## Convenciones críticas
|
|
261
157
|
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
* **ORM Analítico con sanitización:** `DataDrain::Record` incluye sanitización de parámetros para prevenir inyección SQL al consultar archivos Parquet.
|
|
158
|
+
- **Rangos de fecha semi-abiertos:** siempre `created_at >= START AND created_at < END_BOUNDARY`. Nunca `<= end_of_day`.
|
|
159
|
+
- **Orden de `partition_keys`:** debe coincidir entre escritura (Engine/FileIngestor) y lectura (Record). Mismatch → DuckDB devuelve vacío sin error.
|
|
160
|
+
- **Cambiar `storage_mode` en runtime:** llamar `DataDrain::Storage.reset_adapter!` después.
|
|
161
|
+
- **`verify_integrity`** es la única salvaguarda antes de purgar. Si falla, el flujo retorna `false` y aborta el `DELETE`.
|
|
267
162
|
|
|
268
163
|
## Observabilidad
|
|
269
164
|
|
|
270
|
-
Todos los eventos emiten logs estructurados en formato `key=value` procesables por herramientas como Datadog, CloudWatch Logs Insights o `exis_ray`:
|
|
271
|
-
|
|
272
165
|
```
|
|
273
166
|
component=data_drain event=engine.complete table=versions duration_s=12.4 export_duration_s=8.1 purge_duration_s=3.9 count=150000
|
|
274
|
-
component=data_drain event=engine.integrity_error table=versions duration_s=5.2 count=150000
|
|
275
167
|
component=data_drain event=engine.purge_heartbeat table=versions batches_processed_count=100 rows_deleted_count=500000
|
|
276
|
-
component=data_drain event=file_ingestor.complete source_path=/tmp/data.csv duration_s=2.1 count=85000
|
|
277
168
|
component=data_drain event=glue_runner.failed job=my-export-job run_id=jr_abc123 status=FAILED duration_s=301.0
|
|
278
169
|
```
|
|
279
170
|
|
|
280
|
-
|
|
171
|
+
Formato `key=value`. Tiempos con sufijo `_s` (Float). Contadores con `_count` (Integer). Sin unidades en valores. Fallos internos del logger nunca interrumpen el flujo principal.
|
|
172
|
+
|
|
173
|
+
## Contribuir
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
bundle install
|
|
177
|
+
bundle exec rspec # tests
|
|
178
|
+
bundle exec rubocop # linting
|
|
179
|
+
bin/console # REPL
|
|
180
|
+
```
|
|
281
181
|
|
|
282
182
|
## Licencia
|
|
283
183
|
|
|
284
|
-
|
|
184
|
+
MIT.
|
data/lib/data_drain/engine.rb
CHANGED
|
@@ -5,6 +5,7 @@ require "pg"
|
|
|
5
5
|
|
|
6
6
|
module DataDrain
|
|
7
7
|
# Motor principal de extracción y purga de datos (DataDrain).
|
|
8
|
+
# rubocop:disable Metrics/ClassLength, Metrics/AbcSize, Metrics/MethodLength, Naming/AccessorMethodName
|
|
8
9
|
#
|
|
9
10
|
# Orquesta el flujo ETL desde PostgreSQL hacia un Data Lake analítico
|
|
10
11
|
# delegando la interacción del almacenamiento al adaptador configurado.
|
|
@@ -21,29 +22,31 @@ module DataDrain
|
|
|
21
22
|
# @option options [Array<String, Symbol>] :partition_keys Columnas para particionar.
|
|
22
23
|
# @option options [String] :primary_key (Opcional) Clave primaria para borrado. Por defecto 'id'.
|
|
23
24
|
# @option options [String] :where_clause (Opcional) Condición SQL extra.
|
|
24
|
-
# @option options [Boolean] :skip_export (Opcional) Si
|
|
25
|
+
# @option options [Boolean] :skip_export (Opcional) Si true, no exporta
|
|
26
|
+
# a Parquet — solo valida y purga (para uso con GlueRunner).
|
|
25
27
|
def initialize(options)
|
|
26
|
-
@start_date
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
# Esto evita problemas de precisión con los microsegundos al usar end_of_day
|
|
30
|
-
@end_date = options.fetch(:end_date).to_date.next_day.beginning_of_day
|
|
31
|
-
|
|
32
|
-
@table_name = options.fetch(:table_name)
|
|
33
|
-
@folder_name = options.fetch(:folder_name, @table_name)
|
|
34
|
-
@select_sql = options.fetch(:select_sql, "*")
|
|
35
|
-
@partition_keys = options.fetch(:partition_keys)
|
|
36
|
-
@primary_key = options.fetch(:primary_key, "id")
|
|
37
|
-
@where_clause = options[:where_clause]
|
|
38
|
-
@bucket = options[:bucket]
|
|
39
|
-
@skip_export = options.fetch(:skip_export, false)
|
|
28
|
+
@start_date = options.fetch(:start_date).beginning_of_day
|
|
29
|
+
|
|
30
|
+
@end_date = options.fetch(:end_date).to_date.next_day.beginning_of_day
|
|
40
31
|
|
|
41
|
-
@
|
|
42
|
-
|
|
32
|
+
@table_name = options.fetch(:table_name)
|
|
33
|
+
Validations.validate_identifier!(:table_name, @table_name)
|
|
34
|
+
|
|
35
|
+
@folder_name = options.fetch(:folder_name, @table_name)
|
|
36
|
+
@select_sql = options.fetch(:select_sql, "*")
|
|
37
|
+
@partition_keys = options.fetch(:partition_keys)
|
|
38
|
+
@primary_key = options.fetch(:primary_key, "id")
|
|
39
|
+
Validations.validate_identifier!(:primary_key, @primary_key)
|
|
40
|
+
@where_clause = options[:where_clause]
|
|
41
|
+
@bucket = options[:bucket]
|
|
42
|
+
@skip_export = options.fetch(:skip_export, false)
|
|
43
|
+
|
|
44
|
+
@config = DataDrain.configuration
|
|
45
|
+
@logger = @config.logger
|
|
43
46
|
@adapter = DataDrain::Storage.adapter
|
|
44
47
|
|
|
45
48
|
database = DuckDB::Database.open(":memory:")
|
|
46
|
-
@duckdb
|
|
49
|
+
@duckdb = database.connect
|
|
47
50
|
end
|
|
48
51
|
|
|
49
52
|
# Ejecuta el flujo completo del motor: Setup, Conteo, Exportación (opcional), Verificación y Purga.
|
|
@@ -51,7 +54,8 @@ module DataDrain
|
|
|
51
54
|
# @return [Boolean] `true` si el proceso finalizó con éxito, `false` si falló la integridad.
|
|
52
55
|
def call
|
|
53
56
|
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
54
|
-
safe_log(:info, "engine.start",
|
|
57
|
+
safe_log(:info, "engine.start",
|
|
58
|
+
{ table: @table_name, start_date: @start_date.to_date, end_date: @end_date.to_date })
|
|
55
59
|
|
|
56
60
|
setup_duckdb
|
|
57
61
|
|
|
@@ -62,7 +66,8 @@ module DataDrain
|
|
|
62
66
|
|
|
63
67
|
if @pg_count.zero?
|
|
64
68
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
65
|
-
safe_log(:info, "engine.skip_empty",
|
|
69
|
+
safe_log(:info, "engine.skip_empty",
|
|
70
|
+
{ table: @table_name, duration_s: duration.round(2), db_query_duration_s: db_query_duration.round(2) })
|
|
66
71
|
return true
|
|
67
72
|
end
|
|
68
73
|
|
|
@@ -90,18 +95,19 @@ module DataDrain
|
|
|
90
95
|
|
|
91
96
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
92
97
|
safe_log(:info, "engine.complete", {
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
98
|
+
table: @table_name,
|
|
99
|
+
duration_s: duration.round(2),
|
|
100
|
+
db_query_duration_s: db_query_duration.round(2),
|
|
101
|
+
export_duration_s: export_duration.round(2),
|
|
102
|
+
integrity_duration_s: integrity_duration.round(2),
|
|
103
|
+
purge_duration_s: purge_duration.round(2),
|
|
104
|
+
count: @pg_count
|
|
105
|
+
})
|
|
101
106
|
true
|
|
102
107
|
else
|
|
103
108
|
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
|
|
104
|
-
safe_log(:error, "engine.integrity_error",
|
|
109
|
+
safe_log(:error, "engine.integrity_error",
|
|
110
|
+
{ table: @table_name, duration_s: duration.round(2), count: @pg_count })
|
|
105
111
|
false
|
|
106
112
|
end
|
|
107
113
|
end
|
|
@@ -142,7 +148,12 @@ module DataDrain
|
|
|
142
148
|
@adapter.prepare_export_path(@bucket, @folder_name)
|
|
143
149
|
|
|
144
150
|
# Determinamos el path base de destino según el adaptador
|
|
145
|
-
dest_path = @config.storage_mode.to_sym == :s3
|
|
151
|
+
dest_path = if @config.storage_mode.to_sym == :s3
|
|
152
|
+
"s3://#{@bucket}/#{@folder_name}/"
|
|
153
|
+
else
|
|
154
|
+
File.join(@bucket,
|
|
155
|
+
@folder_name, "")
|
|
156
|
+
end
|
|
146
157
|
|
|
147
158
|
pg_sql = "SELECT #{@select_sql} FROM public.#{@table_name} WHERE #{base_where_sql}"
|
|
148
159
|
pg_sql = pg_sql.gsub("'", "''")
|
|
@@ -154,7 +165,7 @@ module DataDrain
|
|
|
154
165
|
) TO '#{dest_path}'
|
|
155
166
|
(
|
|
156
167
|
FORMAT PARQUET,
|
|
157
|
-
PARTITION_BY (#{@partition_keys.join(
|
|
168
|
+
PARTITION_BY (#{@partition_keys.join(", ")}),
|
|
158
169
|
COMPRESSION 'ZSTD',
|
|
159
170
|
OVERWRITE_OR_IGNORE 1
|
|
160
171
|
);
|
|
@@ -180,7 +191,8 @@ module DataDrain
|
|
|
180
191
|
return false
|
|
181
192
|
end
|
|
182
193
|
|
|
183
|
-
safe_log(:info, "engine.integrity_check",
|
|
194
|
+
safe_log(:info, "engine.integrity_check",
|
|
195
|
+
{ table: @table_name, pg_count: @pg_count, parquet_count: parquet_result })
|
|
184
196
|
@pg_count == parquet_result
|
|
185
197
|
end
|
|
186
198
|
|
|
@@ -189,11 +201,11 @@ module DataDrain
|
|
|
189
201
|
safe_log(:info, "engine.purge_start", { table: @table_name, batch_size: @config.batch_size })
|
|
190
202
|
|
|
191
203
|
conn = PG.connect(
|
|
192
|
-
host:
|
|
193
|
-
port:
|
|
194
|
-
user:
|
|
204
|
+
host: @config.db_host,
|
|
205
|
+
port: @config.db_port,
|
|
206
|
+
user: @config.db_user,
|
|
195
207
|
password: @config.db_pass,
|
|
196
|
-
dbname:
|
|
208
|
+
dbname: @config.db_name
|
|
197
209
|
)
|
|
198
210
|
|
|
199
211
|
unless @config.idle_in_transaction_session_timeout.nil?
|
|
@@ -223,10 +235,10 @@ module DataDrain
|
|
|
223
235
|
# Heartbeat cada 100 lotes para monitorear procesos largos de 1TB
|
|
224
236
|
if (batches_processed % 100).zero?
|
|
225
237
|
safe_log(:info, "engine.purge_heartbeat", {
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
238
|
+
table: @table_name,
|
|
239
|
+
batches_processed_count: batches_processed,
|
|
240
|
+
rows_deleted_count: total_deleted
|
|
241
|
+
})
|
|
230
242
|
end
|
|
231
243
|
|
|
232
244
|
sleep(@config.throttle_delay) if @config.throttle_delay.positive?
|
|
@@ -235,4 +247,5 @@ module DataDrain
|
|
|
235
247
|
conn&.close
|
|
236
248
|
end
|
|
237
249
|
end
|
|
250
|
+
# rubocop:enable Metrics/ClassLength, Metrics/AbcSize, Metrics/MethodLength, Naming/AccessorMethodName
|
|
238
251
|
end
|