data_drain 0.1.18 → 0.1.19
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +7 -0
- data/CLAUDE.md +18 -0
- data/README.md +9 -6
- data/lib/data_drain/record.rb +2 -2
- data/lib/data_drain/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c3b2ce171059217fbb96cf1d1f93e9bce121b31e0afdf73eaa3889d5dca38d5c
|
|
4
|
+
data.tar.gz: 14600532ba59fd8daf0ec7e1890175211402172d643481539980da8f54799f9b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d08d3a7391a2b1ec4ab4b5e9c6f3d894bd5a8d1f46cc1d93f4324559f7a92e9a4150e689f3ca990afecdf33817cbfd3259f9c6bd7162040742ad2fdda3ae3661
|
|
7
|
+
data.tar.gz: 863f1be6a3e391fe32c63b88a2d944443159d984fbf74f598dba58cbc44ffd8c4a5dc14cafcede6182c620d8b8580f9bec2225d298e75c50215766a87b56cb4a
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,12 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.1.19] - 2026-03-30
|
|
4
|
+
|
|
5
|
+
- Fix: `Record.build_query_path` ahora usa `partition_keys` como fuente de verdad del orden, ignorando el orden de los kwargs del caller. Antes, pasar `where(year: 2026, isp_id: 42)` en distinto orden generaba un path que no coincidía con la estructura Hive en disco.
|
|
6
|
+
- Fix: `GlueRunner` reemplaza `.truncate(200)` de ActiveSupport por `[0, 200]` de Ruby puro, eliminando la dependencia implícita.
|
|
7
|
+
- Convention: orden canónico de `partition_keys` es `[dimension_principal, year, month]` (ej. `isp_id` primero). Documentado en CLAUDE.md y actualizado en README, specs y ejemplos de PySpark.
|
|
8
|
+
- Docs: README actualizado con ejemplos de producción correctos para Glue + Engine + Record.
|
|
9
|
+
|
|
3
10
|
## [0.1.18] - 2026-03-23
|
|
4
11
|
|
|
5
12
|
- Feature: Módulo `Observability` centraliza el logging estructurado en toda la gema.
|
data/CLAUDE.md
CHANGED
|
@@ -19,6 +19,24 @@ created_at >= 'START' AND created_at < 'END_BOUNDARY'
|
|
|
19
19
|
```
|
|
20
20
|
Donde `END_BOUNDARY` es el inicio del periodo siguiente (ej. `next_day.beginning_of_day`). Nunca usar `<= end_of_day` — los microsegundos en el límite pueden quedar fuera.
|
|
21
21
|
|
|
22
|
+
### Partition Keys — Orden y Contrato
|
|
23
|
+
|
|
24
|
+
El array `partition_keys` es **completamente dinámico** — cada tabla/modelo define el suyo. No existe un orden estándar en la librería.
|
|
25
|
+
|
|
26
|
+
**Regla crítica:** el orden de `partition_keys` al **escribir** (Engine/FileIngestor) debe ser idéntico al declarado en el modelo **Record** que lee esos archivos. Un mismatch genera paths que no coinciden y DuckDB retorna vacío sin error.
|
|
27
|
+
|
|
28
|
+
```ruby
|
|
29
|
+
# Escritura
|
|
30
|
+
Engine.new(partition_keys: %w[isp_id year month], ...)
|
|
31
|
+
|
|
32
|
+
# Lectura — debe coincidir
|
|
33
|
+
class ArchivedVersion < DataDrain::Record
|
|
34
|
+
self.partition_keys = [:isp_id, :year, :month]
|
|
35
|
+
end
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**Criterio de diseño del orden:** el primer key debe ser la dimensión de mayor cardinalidad o la que más se usa como filtro (ej. `isp_id` si las consultas son siempre por ISP). Esto determina la jerarquía de carpetas Hive y el rendimiento del prefix scan en S3.
|
|
39
|
+
|
|
22
40
|
### Idempotencia
|
|
23
41
|
Las exportaciones usan `OVERWRITE_OR_IGNORE 1` de DuckDB. Los procesos son seguros de reintentar.
|
|
24
42
|
|
data/README.md
CHANGED
|
@@ -84,7 +84,7 @@ ingestor = DataDrain::FileIngestor.new(
|
|
|
84
84
|
bucket: 'my-bucket-store',
|
|
85
85
|
source_path: '/tmp/netflow_metrics_1600.csv',
|
|
86
86
|
folder_name: 'netflow',
|
|
87
|
-
partition_keys: %w[year month
|
|
87
|
+
partition_keys: %w[isp_id year month],
|
|
88
88
|
select_sql: "*, EXTRACT(YEAR FROM timestamp) AS year, EXTRACT(MONTH FROM timestamp) AS month",
|
|
89
89
|
delete_after_upload: true
|
|
90
90
|
)
|
|
@@ -148,7 +148,7 @@ DataDrain::GlueRunner.run_and_wait(
|
|
|
148
148
|
"--db_user" => config.db_user,
|
|
149
149
|
"--db_password" => config.db_pass,
|
|
150
150
|
"--db_table" => table,
|
|
151
|
-
"--partition_by" => "year,month
|
|
151
|
+
"--partition_by" => "isp_id,year,month"
|
|
152
152
|
}
|
|
153
153
|
)
|
|
154
154
|
|
|
@@ -159,7 +159,7 @@ DataDrain::Engine.new(
|
|
|
159
159
|
start_date: start_date,
|
|
160
160
|
end_date: end_date,
|
|
161
161
|
table_name: table,
|
|
162
|
-
partition_keys: %w[year month
|
|
162
|
+
partition_keys: %w[isp_id year month],
|
|
163
163
|
skip_export: true
|
|
164
164
|
).call
|
|
165
165
|
```
|
|
@@ -197,6 +197,9 @@ options = {
|
|
|
197
197
|
|
|
198
198
|
df = spark.read.format("jdbc").options(**options).load()
|
|
199
199
|
|
|
200
|
+
# Agregar columnas derivadas necesarias para las particiones.
|
|
201
|
+
# isp_id ya existe en la tabla fuente — solo agregar las que se calculan.
|
|
202
|
+
# Personalizar esta sección según las partition_keys de cada tabla.
|
|
200
203
|
df_final = df.withColumn("year", year(col("created_at"))) \
|
|
201
204
|
.withColumn("month", month(col("created_at")))
|
|
202
205
|
|
|
@@ -221,7 +224,7 @@ Para consultar los datos archivados sin salir de Ruby, crea un modelo que herede
|
|
|
221
224
|
class ArchivedVersion < DataDrain::Record
|
|
222
225
|
self.bucket = 'my-bucket-storage'
|
|
223
226
|
self.folder_name = 'versions'
|
|
224
|
-
self.partition_keys = [:
|
|
227
|
+
self.partition_keys = [:isp_id, :year, :month]
|
|
225
228
|
|
|
226
229
|
attribute :id, :string
|
|
227
230
|
attribute :item_type, :string
|
|
@@ -238,11 +241,11 @@ Consultas optimizadas mediante Hive Partitioning:
|
|
|
238
241
|
|
|
239
242
|
```ruby
|
|
240
243
|
# Búsqueda puntual aislando la partición exacta
|
|
241
|
-
version = ArchivedVersion.find("un-uuid", year: 2026, month: 3
|
|
244
|
+
version = ArchivedVersion.find("un-uuid", isp_id: 42, year: 2026, month: 3)
|
|
242
245
|
puts version.object_changes # => {"status" => ["active", "suspended"]}
|
|
243
246
|
|
|
244
247
|
# Colecciones
|
|
245
|
-
history = ArchivedVersion.where(limit: 10, year: 2026, month: 3
|
|
248
|
+
history = ArchivedVersion.where(limit: 10, isp_id: 42, year: 2026, month: 3)
|
|
246
249
|
```
|
|
247
250
|
|
|
248
251
|
### 5. Destrucción de Datos (Retención y Cumplimiento)
|
data/lib/data_drain/record.rb
CHANGED
|
@@ -11,7 +11,7 @@ module DataDrain
|
|
|
11
11
|
# @example
|
|
12
12
|
# class ArchivedVersion < DataDrain::Record
|
|
13
13
|
# self.folder_name = 'versions'
|
|
14
|
-
# self.partition_keys = [:
|
|
14
|
+
# self.partition_keys = [:isp_id, :year, :month]
|
|
15
15
|
# attribute :event, :string
|
|
16
16
|
# end
|
|
17
17
|
class Record
|
|
@@ -110,7 +110,7 @@ module DataDrain
|
|
|
110
110
|
# @param partitions [Hash]
|
|
111
111
|
# @return [String]
|
|
112
112
|
def build_query_path(partitions)
|
|
113
|
-
partition_path =
|
|
113
|
+
partition_path = partition_keys.map { |k| "#{k}=#{partitions[k.to_sym] || partitions[k.to_s]}" }.join("/")
|
|
114
114
|
DataDrain::Storage.adapter.build_path(bucket, folder_name, partition_path)
|
|
115
115
|
end
|
|
116
116
|
|
data/lib/data_drain/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_drain
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.19
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Gabriel
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-03-
|
|
11
|
+
date: 2026-03-30 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|