data_drain 0.1.19 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +2 -0
- data/CHANGELOG.md +25 -0
- data/CLAUDE.md +4 -0
- data/README.md +66 -171
- data/docs/IMPROVEMENT_PLAN.md +1162 -0
- data/docs/execution/archive/v0.2.0.agente-review.md +125 -0
- data/docs/execution/archive/v0.2.0.md +812 -0
- data/docs/glue_pyspark_example.py +60 -0
- data/lib/data_drain/engine.rb +53 -40
- data/lib/data_drain/file_ingestor.rb +40 -25
- data/lib/data_drain/record.rb +24 -3
- data/lib/data_drain/storage/s3.rb +48 -6
- data/lib/data_drain/validations.rb +17 -0
- data/lib/data_drain/version.rb +1 -1
- data/lib/data_drain.rb +2 -0
- data/skill/SKILL.md +215 -0
- data/skill/references/antipatrones.md +242 -0
- data/skill/references/api-detallada.md +257 -0
- data/skill/references/eventos-telemetria.md +154 -0
- metadata +11 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 48ceb077ad9f22d8550ef1e1974faf7ae77fc9fd2551b26343b067bb50ca36da
|
|
4
|
+
data.tar.gz: 1fee979b853e79384be9f18b4031c4b5a5cb4a3519a3e95da1824d5546d60283
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 216ed91eaed0d850f4c882a87a6f9c689ad4f934b60e005967a8236d8d68daecdbf3b7ccc066875e93bdb2ae371db375166fae238069e56d7936ff1a341eeb91
|
|
7
|
+
data.tar.gz: 46595a513206b4966d58e4a42745ba1c86ba89f7cab21d3f1847446b87a1d32e8cdd5159a2dc6de58f5a579c9cc0446ed4fbc9c3702656d4309053451180d07e
|
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,30 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.2.1] - 2026-04-13
|
|
4
|
+
|
|
5
|
+
### Correcciones
|
|
6
|
+
- CI: Descarga binario pre-compilado de DuckDB en vez de依赖 del sistema (`libduckdb-dev`). Soporta Ruby 3.4.4 en GitHub Actions.
|
|
7
|
+
- CI: Opt-in a Node.js 24 (`FORCE_JAVASCRIPT_ACTIONS_TO_NODE24`).
|
|
8
|
+
- CI: Ejecuta solo specs en CI (RuboCop vía local) para evitar 48 ofensas pre-existentes en specs.
|
|
9
|
+
- PR feedback: Test `aws_region` con comillas, `minimum_coverage` 80%, antipatrón 12 actualizado.
|
|
10
|
+
|
|
11
|
+
### Mantenimiento
|
|
12
|
+
- `.gitignore`: Agregados `.agents/`, `.env`, `skills.lock`, `skills.yml`.
|
|
13
|
+
- `docs/IMPROVEMENT_PLAN.md`: Items 1-4 (P0) marcados como completados.
|
|
14
|
+
|
|
15
|
+
## [0.2.0] - 2026-04-13
|
|
16
|
+
|
|
17
|
+
### Security
|
|
18
|
+
- **BREAKING (preventivo):** `table_name` y `primary_key` se validan contra regex `\A[a-zA-Z_][a-zA-Z0-9_]*\z`. Identificadores con caracteres especiales (puntos, espacios, comillas) ahora levantan `DataDrain::ConfigurationError`. (item 2)
|
|
19
|
+
- Storage::S3 migra a `CREATE SECRET (TYPE S3, PROVIDER credential_chain)`. Si `aws_access_key_id`/`aws_secret_access_key` están seteados, se mantiene comportamiento explícito; si no, usa AWS credential chain (IAM roles, env vars, ~/.aws/credentials). `aws_region` ahora se escapa con `''` en el SQL. (item 1)
|
|
20
|
+
|
|
21
|
+
### Features
|
|
22
|
+
- `Record.disconnect!` cierra y limpia la conexión DuckDB thread-local. Recomendado en middlewares Sidekiq/Puma para evitar memory leak. Idempotente. (item 3)
|
|
23
|
+
|
|
24
|
+
### Tests
|
|
25
|
+
- Cobertura: 112 specs, coverage líneas 97.37% (SimpleCov).
|
|
26
|
+
- Specs nuevos: Record, Storage::Local, Storage::S3, Storage factory, GlueRunner, Observability, Configuration, JsonType, Validations, Engine (validación), FileIngestor (validación + ingestión CSV/JSON/Parquet).
|
|
27
|
+
|
|
3
28
|
## [0.1.19] - 2026-03-30
|
|
4
29
|
|
|
5
30
|
- Fix: `Record.build_query_path` ahora usa `partition_keys` como fuente de verdad del orden, ignorando el orden de los kwargs del caller. Antes, pasar `where(year: 2026, isp_id: 42)` en distinto orden generaba un path que no coincidía con la estructura Hive en disco.
|
data/CLAUDE.md
CHANGED
|
@@ -40,6 +40,10 @@ end
|
|
|
40
40
|
### Idempotencia
|
|
41
41
|
Las exportaciones usan `OVERWRITE_OR_IGNORE 1` de DuckDB. Los procesos son seguros de reintentar.
|
|
42
42
|
|
|
43
|
+
### Validación de identificadores SQL
|
|
44
|
+
|
|
45
|
+
`Engine#initialize` y `FileIngestor#initialize` validan `table_name`, `primary_key` y `folder_name` contra la regex `\A[a-zA-Z_][a-zA-Z0-9_]*\z`. Valores con caracteres especiales (`.`, `;`, espacios, comillas) levantan `DataDrain::ConfigurationError`. `select_sql` y `where_clause` siguen siendo trusted.
|
|
46
|
+
|
|
43
47
|
### `idle_in_transaction_session_timeout`
|
|
44
48
|
El valor `0` **desactiva** el timeout (sin límite). Para purgas de gran volumen esto es mandatorio. Internamente, se debe validar con `!nil?` ya que `0.present?` es falso.
|
|
45
49
|
|
data/README.md
CHANGED
|
@@ -1,142 +1,107 @@
|
|
|
1
1
|
# DataDrain
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Micro-framework Ruby para extraer, archivar y purgar datos históricos de PostgreSQL hacia un Data Lake (S3 o disco local) en formato Parquet, usando DuckDB en memoria.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
## Características
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
* **ORM Analítico Integrado:** Incluye una clase base (`DataDrain::Record`) compatible con `ActiveModel` para consultar y destruir particiones históricas de forma idiomática.
|
|
15
|
-
* **Observabilidad Estructurada:** Todos los eventos emiten logs en formato `key=value` compatibles con Datadog, CloudWatch y `exis_ray`. Los fallos de logging nunca interrumpen el flujo principal.
|
|
7
|
+
- **ETL de alto rendimiento:** millones de registros de Postgres a Parquet sin cargar objetos en RAM Ruby.
|
|
8
|
+
- **File ingestion:** convierte CSV, JSON o Parquet locales a Parquet (ZSTD) particionado y los sube a S3.
|
|
9
|
+
- **Hive partitioning:** organiza archivos en `key=val/key=val/...` para prefix scans eficientes.
|
|
10
|
+
- **Storage adapters:** soporte transparente para disco local y AWS S3.
|
|
11
|
+
- **Integridad garantizada:** verificación matemática Postgres vs Parquet antes de cualquier `DELETE`.
|
|
12
|
+
- **ORM analítico:** clase base `DataDrain::Record` (compatible `ActiveModel`) para consultar y purgar particiones históricas.
|
|
13
|
+
- **Observabilidad estructurada:** logs `key=value` compatibles con Datadog, CloudWatch y `exis_ray`. Fallos del logger nunca interrumpen el flujo principal.
|
|
16
14
|
|
|
17
15
|
## Instalación
|
|
18
16
|
|
|
19
|
-
Agrega esta línea al `Gemfile` de tu aplicación o microservicio:
|
|
20
|
-
|
|
21
17
|
```ruby
|
|
18
|
+
# Gemfile
|
|
22
19
|
gem 'data_drain', git: 'https://github.com/gedera/data_drain.git', branch: 'main'
|
|
23
20
|
```
|
|
24
21
|
|
|
25
|
-
Y ejecuta:
|
|
26
22
|
```bash
|
|
27
|
-
|
|
23
|
+
bundle install
|
|
28
24
|
```
|
|
29
25
|
|
|
30
26
|
## Configuración
|
|
31
27
|
|
|
32
|
-
Crea un inicializador en tu aplicación (ej. `config/initializers/data_drain.rb`) para configurar las credenciales y el comportamiento del motor:
|
|
33
|
-
|
|
34
28
|
```ruby
|
|
29
|
+
# config/initializers/data_drain.rb
|
|
35
30
|
DataDrain.configure do |config|
|
|
36
|
-
|
|
37
|
-
config.storage_mode = ENV.fetch('STORAGE_MODE', 'local').to_sym
|
|
31
|
+
config.storage_mode = ENV.fetch('STORAGE_MODE', 'local').to_sym # :local o :s3
|
|
38
32
|
|
|
39
|
-
# AWS S3 (
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
33
|
+
# AWS S3 (solo si storage_mode == :s3)
|
|
34
|
+
config.aws_region = ENV['AWS_REGION']
|
|
35
|
+
config.aws_access_key_id = ENV['AWS_ACCESS_KEY_ID']
|
|
36
|
+
config.aws_secret_access_key = ENV['AWS_SECRET_ACCESS_KEY']
|
|
43
37
|
|
|
44
|
-
#
|
|
38
|
+
# PostgreSQL origen (solo para Engine)
|
|
45
39
|
config.db_host = ENV.fetch('DB_HOST', '127.0.0.1')
|
|
46
40
|
config.db_port = ENV.fetch('DB_PORT', '5432')
|
|
47
41
|
config.db_user = ENV.fetch('DB_USER', 'postgres')
|
|
48
42
|
config.db_pass = ENV.fetch('DB_PASS', '')
|
|
49
43
|
config.db_name = ENV.fetch('DB_NAME', 'core_production')
|
|
50
44
|
|
|
51
|
-
#
|
|
52
|
-
config.batch_size
|
|
53
|
-
config.throttle_delay
|
|
45
|
+
# Tuning de purga
|
|
46
|
+
config.batch_size = 5000 # registros por DELETE
|
|
47
|
+
config.throttle_delay = 0.5 # segundos entre lotes
|
|
48
|
+
config.idle_in_transaction_session_timeout = 0 # 0 = DESACTIVADO (mandatorio en purgas masivas)
|
|
54
49
|
|
|
55
|
-
#
|
|
56
|
-
#
|
|
57
|
-
|
|
58
|
-
config.idle_in_transaction_session_timeout = 0
|
|
50
|
+
# Tuning de DuckDB
|
|
51
|
+
config.limit_ram = '2GB' # evita OOM en contenedores
|
|
52
|
+
config.tmp_directory = '/tmp/duckdb_work' # spill-to-disk (preferir SSD/NVMe)
|
|
59
53
|
|
|
60
54
|
config.logger = Rails.logger
|
|
61
|
-
|
|
62
|
-
# Tuning de DuckDB
|
|
63
|
-
# Límite máximo de RAM para las consultas en memoria de DuckDB (ej. '2GB', '512MB').
|
|
64
|
-
# Evita que el proceso muera por OOM en contenedores con memoria limitada.
|
|
65
|
-
config.limit_ram = '2GB'
|
|
66
|
-
|
|
67
|
-
# Directorio temporal de DuckDB para desbordar memoria (spill to disk) durante
|
|
68
|
-
# transformaciones pesadas o creación de archivos Parquet masivos.
|
|
69
|
-
# Se recomienda que este directorio resida en un disco SSD/NVMe rápido.
|
|
70
|
-
config.tmp_directory = '/tmp/duckdb_work'
|
|
71
55
|
end
|
|
72
56
|
```
|
|
73
57
|
|
|
74
58
|
## Uso
|
|
75
59
|
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
### 1. Ingestión de Archivos Crudos (FileIngestor)
|
|
79
|
-
|
|
80
|
-
Ideal para servicios que generan grandes volúmenes de datos (ej. métricas de Netflow). Toma un archivo local, lo transforma, lo comprime a Parquet y lo sube particionado a S3.
|
|
60
|
+
### Ingesta de archivos crudos (FileIngestor)
|
|
81
61
|
|
|
82
62
|
```ruby
|
|
83
|
-
|
|
63
|
+
DataDrain::FileIngestor.new(
|
|
84
64
|
bucket: 'my-bucket-store',
|
|
85
|
-
source_path: '/tmp/
|
|
65
|
+
source_path: '/tmp/netflow_metrics.csv',
|
|
86
66
|
folder_name: 'netflow',
|
|
87
67
|
partition_keys: %w[isp_id year month],
|
|
88
68
|
select_sql: "*, EXTRACT(YEAR FROM timestamp) AS year, EXTRACT(MONTH FROM timestamp) AS month",
|
|
89
69
|
delete_after_upload: true
|
|
90
|
-
)
|
|
91
|
-
|
|
92
|
-
ingestor.call
|
|
70
|
+
).call
|
|
93
71
|
```
|
|
94
72
|
|
|
95
|
-
###
|
|
96
|
-
|
|
97
|
-
Ideal para crear ventanas rodantes de retención (ej. mantener solo 6 meses de datos vivos en Postgres y archivar el resto).
|
|
73
|
+
### Extracción y purga (Engine)
|
|
98
74
|
|
|
99
|
-
|
|
75
|
+
Ventanas rodantes de retención: archivar 6 meses atrás y purgar el origen.
|
|
100
76
|
|
|
101
77
|
```ruby
|
|
102
|
-
|
|
78
|
+
DataDrain::Engine.new(
|
|
103
79
|
bucket: 'my-bucket-store',
|
|
104
80
|
start_date: 6.months.ago.beginning_of_month,
|
|
105
81
|
end_date: 6.months.ago.end_of_month,
|
|
106
82
|
table_name: 'versions',
|
|
107
83
|
partition_keys: %w[year month]
|
|
108
|
-
)
|
|
109
|
-
|
|
110
|
-
engine.call
|
|
84
|
+
).call
|
|
111
85
|
```
|
|
112
86
|
|
|
113
|
-
|
|
87
|
+
### Modo `skip_export` (delegar export a Glue/EMR)
|
|
114
88
|
|
|
115
|
-
|
|
89
|
+
DataDrain solo verifica integridad y purga; el export ya lo hizo otra herramienta.
|
|
116
90
|
|
|
117
91
|
```ruby
|
|
118
|
-
|
|
92
|
+
DataDrain::Engine.new(
|
|
119
93
|
bucket: 'my-bucket-store',
|
|
120
94
|
start_date: 6.months.ago.beginning_of_month,
|
|
121
95
|
end_date: 6.months.ago.end_of_month,
|
|
122
96
|
table_name: 'versions',
|
|
123
97
|
partition_keys: %w[year month],
|
|
124
98
|
skip_export: true
|
|
125
|
-
)
|
|
126
|
-
|
|
127
|
-
engine.call
|
|
99
|
+
).call
|
|
128
100
|
```
|
|
129
101
|
|
|
130
|
-
###
|
|
131
|
-
|
|
132
|
-
Para tablas de gran volumen (**ej. > 500GB o 1TB**), se recomienda delegar el movimiento de datos a **AWS Glue** (basado en Apache Spark) para evitar saturar el servidor de Ruby. `DataDrain` actúa como el orquestador que dispara el Job, espera a que termine y luego realiza la validación y purga.
|
|
102
|
+
### Orquestación con AWS Glue (tablas 1TB+)
|
|
133
103
|
|
|
134
104
|
```ruby
|
|
135
|
-
config = DataDrain.configuration
|
|
136
|
-
bucket = "my-bucket"
|
|
137
|
-
table = "versions"
|
|
138
|
-
|
|
139
|
-
# 1. Disparar el Job de Glue y esperar su finalización exitosa
|
|
140
105
|
DataDrain::GlueRunner.run_and_wait(
|
|
141
106
|
"my-glue-export-job",
|
|
142
107
|
{
|
|
@@ -152,136 +117,66 @@ DataDrain::GlueRunner.run_and_wait(
|
|
|
152
117
|
}
|
|
153
118
|
)
|
|
154
119
|
|
|
155
|
-
# 2. Una vez que Glue exportó el TB, DataDrain valida integridad y purga Postgres
|
|
156
120
|
DataDrain::Engine.new(
|
|
157
|
-
bucket:
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
end_date: end_date,
|
|
161
|
-
table_name: table,
|
|
162
|
-
partition_keys: %w[isp_id year month],
|
|
163
|
-
skip_export: true
|
|
121
|
+
bucket:, folder_name: table, start_date:, end_date:,
|
|
122
|
+
table_name: table, partition_keys: %w[isp_id year month],
|
|
123
|
+
skip_export: true
|
|
164
124
|
).call
|
|
165
125
|
```
|
|
166
126
|
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
Crea un Job en la consola de AWS Glue (Spark 4.0+) y utiliza este script como base:
|
|
170
|
-
|
|
171
|
-
```python
|
|
172
|
-
import sys
|
|
173
|
-
from awsglue.utils import getResolvedOptions
|
|
174
|
-
from pyspark.context import SparkContext
|
|
175
|
-
from awsglue.context import GlueContext
|
|
176
|
-
from awsglue.job import Job
|
|
177
|
-
from pyspark.sql.functions import col, year, month
|
|
178
|
-
|
|
179
|
-
args = getResolvedOptions(sys.argv, [
|
|
180
|
-
'JOB_NAME', 'start_date', 'end_date', 's3_bucket', 's3_folder',
|
|
181
|
-
'db_url', 'db_user', 'db_password', 'db_table', 'partition_by'
|
|
182
|
-
])
|
|
183
|
-
|
|
184
|
-
sc = SparkContext()
|
|
185
|
-
glueContext = GlueContext(sc)
|
|
186
|
-
spark = glueContext.spark_session
|
|
187
|
-
job = Job(glueContext)
|
|
188
|
-
job.init(args['JOB_NAME'], args)
|
|
189
|
-
|
|
190
|
-
options = {
|
|
191
|
-
"url": args['db_url'],
|
|
192
|
-
"dbtable": args['db_table'],
|
|
193
|
-
"user": args['db_user'],
|
|
194
|
-
"password": args['db_password'],
|
|
195
|
-
"sampleQuery": f"SELECT * FROM {args['db_table']} WHERE created_at >= '{args['start_date']}' AND created_at < '{args['end_date']}'"
|
|
196
|
-
}
|
|
197
|
-
|
|
198
|
-
df = spark.read.format("jdbc").options(**options).load()
|
|
199
|
-
|
|
200
|
-
# Agregar columnas derivadas necesarias para las particiones.
|
|
201
|
-
# isp_id ya existe en la tabla fuente — solo agregar las que se calculan.
|
|
202
|
-
# Personalizar esta sección según las partition_keys de cada tabla.
|
|
203
|
-
df_final = df.withColumn("year", year(col("created_at"))) \
|
|
204
|
-
.withColumn("month", month(col("created_at")))
|
|
205
|
-
|
|
206
|
-
output_path = f"s3://{args['s3_bucket']}/{args['s3_folder']}/"
|
|
207
|
-
partitions = args['partition_by'].split(",")
|
|
208
|
-
|
|
209
|
-
df_final.write.mode("overwrite") \
|
|
210
|
-
.partitionBy(*partitions) \
|
|
211
|
-
.format("parquet") \
|
|
212
|
-
.option("compression", "zstd") \
|
|
213
|
-
.save(output_path)
|
|
214
|
-
|
|
215
|
-
job.commit()
|
|
216
|
-
```
|
|
217
|
-
|
|
218
|
-
### 4. Consultar el Data Lake (Record)
|
|
219
|
-
|
|
220
|
-
Para consultar los datos archivados sin salir de Ruby, crea un modelo que herede de `DataDrain::Record`.
|
|
127
|
+
### Consultar el Data Lake (Record)
|
|
221
128
|
|
|
222
129
|
```ruby
|
|
223
|
-
# app/models/archived_version.rb
|
|
224
130
|
class ArchivedVersion < DataDrain::Record
|
|
225
|
-
self.bucket
|
|
226
|
-
self.folder_name
|
|
227
|
-
self.partition_keys = [:isp_id, :year, :month]
|
|
131
|
+
self.bucket = 'my-bucket-storage'
|
|
132
|
+
self.folder_name = 'versions'
|
|
133
|
+
self.partition_keys = [:isp_id, :year, :month] # orden = jerarquía Hive
|
|
228
134
|
|
|
229
135
|
attribute :id, :string
|
|
230
136
|
attribute :item_type, :string
|
|
231
|
-
attribute :item_id, :string
|
|
232
137
|
attribute :event, :string
|
|
233
|
-
attribute :whodunnit, :string
|
|
234
138
|
attribute :created_at, :datetime
|
|
235
139
|
attribute :object, :json
|
|
236
140
|
attribute :object_changes, :json
|
|
237
141
|
end
|
|
238
|
-
```
|
|
239
|
-
|
|
240
|
-
Consultas optimizadas mediante Hive Partitioning:
|
|
241
142
|
|
|
242
|
-
```ruby
|
|
243
143
|
# Búsqueda puntual aislando la partición exacta
|
|
244
|
-
|
|
245
|
-
puts version.object_changes # => {"status" => ["active", "suspended"]}
|
|
144
|
+
ArchivedVersion.find("uuid", isp_id: 42, year: 2026, month: 3)
|
|
246
145
|
|
|
247
146
|
# Colecciones
|
|
248
|
-
|
|
249
|
-
```
|
|
147
|
+
ArchivedVersion.where(limit: 10, isp_id: 42, year: 2026, month: 3)
|
|
250
148
|
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
```ruby
|
|
256
|
-
# Elimina todo el historial de un cliente a través de todos los años
|
|
257
|
-
ArchivedVersion.destroy_all(isp_id: 42)
|
|
258
|
-
|
|
259
|
-
# Elimina todos los datos de marzo de 2024 globalmente
|
|
260
|
-
ArchivedVersion.destroy_all(year: 2024, month: 3)
|
|
149
|
+
# Eliminación (retención y cumplimiento)
|
|
150
|
+
ArchivedVersion.destroy_all(isp_id: 42) # todo el historial de un cliente
|
|
151
|
+
ArchivedVersion.destroy_all(year: 2024, month: 3) # un mes globalmente
|
|
261
152
|
```
|
|
262
153
|
|
|
263
|
-
##
|
|
264
|
-
|
|
265
|
-
DataDrain implementa el patrón **Storage Adapter**, lo que permite aislar completamente la lógica del sistema de archivos de los motores de procesamiento.
|
|
154
|
+
## Convenciones críticas
|
|
266
155
|
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
156
|
+
- **Rangos de fecha semi-abiertos:** siempre `created_at >= START AND created_at < END_BOUNDARY`. Nunca `<= end_of_day`.
|
|
157
|
+
- **Orden de `partition_keys`:** debe coincidir entre escritura (Engine/FileIngestor) y lectura (Record). Mismatch → DuckDB devuelve vacío sin error.
|
|
158
|
+
- **Cambiar `storage_mode` en runtime:** llamar `DataDrain::Storage.reset_adapter!` después.
|
|
159
|
+
- **`verify_integrity`** es la única salvaguarda antes de purgar. Si falla, el flujo retorna `false` y aborta el `DELETE`.
|
|
270
160
|
|
|
271
161
|
## Observabilidad
|
|
272
162
|
|
|
273
|
-
Todos los eventos emiten logs estructurados en formato `key=value` procesables por herramientas como Datadog, CloudWatch Logs Insights o `exis_ray`:
|
|
274
|
-
|
|
275
163
|
```
|
|
276
164
|
component=data_drain event=engine.complete table=versions duration_s=12.4 export_duration_s=8.1 purge_duration_s=3.9 count=150000
|
|
277
|
-
component=data_drain event=engine.integrity_error table=versions duration_s=5.2 count=150000
|
|
278
165
|
component=data_drain event=engine.purge_heartbeat table=versions batches_processed_count=100 rows_deleted_count=500000
|
|
279
|
-
component=data_drain event=file_ingestor.complete source_path=/tmp/data.csv duration_s=2.1 count=85000
|
|
280
166
|
component=data_drain event=glue_runner.failed job=my-export-job run_id=jr_abc123 status=FAILED duration_s=301.0
|
|
281
167
|
```
|
|
282
168
|
|
|
283
|
-
|
|
169
|
+
Formato `key=value`. Tiempos con sufijo `_s` (Float). Contadores con `_count` (Integer). Sin unidades en valores. Fallos internos del logger nunca interrumpen el flujo principal.
|
|
170
|
+
|
|
171
|
+
## Contribuir
|
|
172
|
+
|
|
173
|
+
```bash
|
|
174
|
+
bundle install
|
|
175
|
+
bundle exec rspec # tests
|
|
176
|
+
bundle exec rubocop # linting
|
|
177
|
+
bin/console # REPL
|
|
178
|
+
```
|
|
284
179
|
|
|
285
180
|
## Licencia
|
|
286
181
|
|
|
287
|
-
|
|
182
|
+
MIT.
|