@revos/cli 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/revos.js +1 -1
- package/dist/adapters/oclif/commands/auth/login.mjs +2 -2
- package/dist/adapters/oclif/commands/auth/logout.mjs +2 -2
- package/dist/adapters/oclif/commands/auth/status.mjs +2 -2
- package/dist/adapters/oclif/commands/init.mjs +5 -4
- package/dist/adapters/oclif/commands/org/current.mjs +3 -3
- package/dist/adapters/oclif/commands/org/list.mjs +3 -3
- package/dist/adapters/oclif/commands/org/switch.mjs +3 -3
- package/dist/adapters/oclif/commands/overlays/diff.mjs +3 -3
- package/dist/adapters/oclif/commands/overlays/pull.mjs +3 -3
- package/dist/adapters/oclif/commands/overlays/push.mjs +3 -3
- package/dist/adapters/oclif/commands/overlays/status.mjs +3 -3
- package/dist/{base.command-BGM225ik.mjs → base.command-DlYMawJ6.mjs} +1 -1
- package/dist/{core-Bif-kxlo.mjs → core-Dq15hO6f.mjs} +70 -208
- package/dist/{index-C0e8MXGP.d.mts → index-DuqD2b_7.d.mts} +2 -8
- package/dist/index.d.mts +1 -1
- package/dist/index.mjs +1 -1
- package/dist/templates/.devcontainer/Dockerfile +14 -0
- package/dist/templates/.devcontainer/devcontainer.json +54 -0
- package/dist/templates/.devcontainer/setup.sh +32 -0
- package/dist/templates/AGENTS.md +2 -3
- package/dist/templates/CLAUDE.md +0 -16
- package/dist/templates/README.md +23 -0
- package/dist/templates/dbt/dbt_project.yml +22 -0
- package/dist/templates/index.ts +4 -0
- package/dist/templates/skills/create-semantic-model/SKILL.md +1611 -0
- package/dist/templates/skills/explore-lakehouse/SKILL.md +131 -0
- package/package.json +1 -3
|
@@ -0,0 +1,1611 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: create-semantic-model
|
|
3
|
+
description: Create semantic models (Cube.dev cubes) from existing RevOS dbt gold models. Use when asked to build a semantic layer, create cubes, or generate Cube definitions from dbt.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
@.claude/skills/create_dbt_transformations/SKILL.md
|
|
7
|
+
|
|
8
|
+
# Create Semantical Model
|
|
9
|
+
|
|
10
|
+
Use this skill when the user asks to create a semantic model or semantic overlay.
|
|
11
|
+
|
|
12
|
+
Typical user requests:
|
|
13
|
+
|
|
14
|
+
> Create a new semantic overlay for `<transformation>`.
|
|
15
|
+
|
|
16
|
+
> Create a semantic model from the available gold models.
|
|
17
|
+
|
|
18
|
+
> Build a Cube semantic layer for these gold tables.
|
|
19
|
+
|
|
20
|
+
This skill analyzes existing dbt gold models, asks the user which models should participate, detects keys and relationships, validates join candidates with SQL before proposing them to the user, asks the user to confirm the semantic structure and measures, and generates Cube.dev semantic overlays under `semantic/cubes/`.
|
|
21
|
+
|
|
22
|
+
## Skill Dependencies
|
|
23
|
+
|
|
24
|
+
This skill delegates dbt-related knowledge to `create_dbt_transformations` (loaded above):
|
|
25
|
+
|
|
26
|
+
- Project layout and how to find gold models
|
|
27
|
+
- Resolving a dbt model name to its physical BigQuery table reference
|
|
28
|
+
- Creating new dbt support models (bridge models from JSON arrays)
|
|
29
|
+
- dbt validation commands
|
|
30
|
+
|
|
31
|
+
If BigQuery exploration is needed (listing tables, inspecting schemas, previewing rows, checking null rates beyond what is required for join validation), use the `explore-lakehouse` skill at `.claude/skills/explore-lakehouse/SKILL.md`. Load it only when those capabilities are actually required.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Purpose
|
|
36
|
+
|
|
37
|
+
Expose existing dbt gold models as queryable Cube.dev semantic models without manually writing YAML boilerplate.
|
|
38
|
+
|
|
39
|
+
Gold models may be dbt tables or dbt views. Treat both as valid semantic sources.
|
|
40
|
+
|
|
41
|
+
This skill does not build gold models from silver. If a needed gold model is missing, hand off to `create_dbt_transformations`.
|
|
42
|
+
|
|
43
|
+
The expected flow is:
|
|
44
|
+
|
|
45
|
+
```text
|
|
46
|
+
dbt/models/gold/
|
|
47
|
+
-> discover available gold models (via create_dbt_transformations)
|
|
48
|
+
-> ask the user which gold models should participate
|
|
49
|
+
-> inspect selected model schemas from the lakehouse
|
|
50
|
+
-> detect primary keys, secondary keys, foreign keys, JSON/array keys, and candidate relationships
|
|
51
|
+
-> if selected models are not directly connected, search remaining gold models for connector/intermediate models
|
|
52
|
+
-> ask the user to approve connector or bridge/support models when needed
|
|
53
|
+
-> validate candidate joins with SQL before proposing them as relationships
|
|
54
|
+
-> present validated relationships and join directions to the user
|
|
55
|
+
-> ask the user to confirm relationships
|
|
56
|
+
-> generate dimensions from all selected columns by default
|
|
57
|
+
-> generate default count measure
|
|
58
|
+
-> suggest useful additional measures
|
|
59
|
+
-> ask the user to confirm or define custom measures
|
|
60
|
+
-> generate Cube.dev semantic overlays in semantic/cubes/
|
|
61
|
+
-> validate generated files
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Execution Order
|
|
67
|
+
|
|
68
|
+
1. Discover available gold models (use `create_dbt_transformations` for navigation).
|
|
69
|
+
2. If the user named a specific transformation/model, find that gold model first.
|
|
70
|
+
3. Ask the user which gold models should participate, unless the request clearly targets one specific model only.
|
|
71
|
+
4. Resolve `$GOOGLE_CLOUD_PROJECT` and `$REVOS_BQ_DATASET` to literal values via `echo` and keep them for the rest of the session.
|
|
72
|
+
5. Inspect schemas and column types for selected gold models.
|
|
73
|
+
6. Detect primary keys, secondary keys, foreign keys, JSON/array keys, and candidate relationships.
|
|
74
|
+
7. If selected models are disconnected, search for connector models among the remaining gold models.
|
|
75
|
+
8. Ask the user to approve connector or bridge/support models before adding them to the working scope.
|
|
76
|
+
9. Validate candidate joins with SQL before proposing them as relationships.
|
|
77
|
+
10. Present validated relationships, join directions, cardinality, and validation evidence to the user.
|
|
78
|
+
11. Ask the user to confirm or modify relationships.
|
|
79
|
+
12. Generate dimensions from all selected gold model columns by default.
|
|
80
|
+
13. Create the default `count` measure.
|
|
81
|
+
14. Suggest useful additional measures.
|
|
82
|
+
15. Ask the user to confirm suggested measures or define custom measures.
|
|
83
|
+
16. Generate Cube semantic overlays in `semantic/cubes/`.
|
|
84
|
+
17. Validate generated files.
|
|
85
|
+
|
|
86
|
+
Do not skip ahead.
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## Naming Convention: Cube Files and Cube Names
|
|
91
|
+
|
|
92
|
+
Gold models in `dbt/models/gold/` use the `gold_` prefix in their file name (for example, `gold_hubspot_companies.sql`). The materialized table in BigQuery keeps the same name.
|
|
93
|
+
|
|
94
|
+
The semantic overlay layer drops the `gold_` prefix:
|
|
95
|
+
|
|
96
|
+
1. The gold SQL file: `gold_<entity>` (for example, `gold_hubspot_companies.sql`).
|
|
97
|
+
2. The materialized BigQuery table: `gold_<entity>` (for example, `gold_hubspot_companies`).
|
|
98
|
+
3. The semantic overlay YAML file: `<entity>.yml` (for example, `hubspot_companies.yml`).
|
|
99
|
+
4. The cube `name` inside that YAML file: `<entity>` (for example, `name: hubspot_companies`).
|
|
100
|
+
5. References to that cube from joins: `${<entity>}` (for example, `${hubspot_companies}`).
|
|
101
|
+
|
|
102
|
+
Mapping example:
|
|
103
|
+
|
|
104
|
+
```text
|
|
105
|
+
gold SQL file: dbt/models/gold/gold_hubspot_companies.sql
|
|
106
|
+
BigQuery table: gold_hubspot_companies
|
|
107
|
+
overlay YAML file: semantic/cubes/hubspot_companies.yml
|
|
108
|
+
cube name: hubspot_companies
|
|
109
|
+
join reference: ${hubspot_companies}
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
Naming rules:
|
|
113
|
+
|
|
114
|
+
1. Always strip the `gold_` prefix when generating the overlay YAML file name.
|
|
115
|
+
2. The cube `name` must match the overlay YAML file name (without extension).
|
|
116
|
+
3. The cube `name` is what appears in `${...}` references inside join SQL.
|
|
117
|
+
4. The physical table reference in `sql_table` always uses the full BigQuery name with `gold_` prefix.
|
|
118
|
+
5. Bridge and support cube names follow the same rule. If the gold support model is `gold_deals_companies`, the cube name is `deals_companies` and the file is `deals_companies.yml`.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Cube `sql_table` Reference
|
|
123
|
+
|
|
124
|
+
Cube overlays must reference the physical warehouse table directly, not through `dbt ref()`. Cube does not understand Jinja.
|
|
125
|
+
|
|
126
|
+
Use the literal `project` and `dataset` values captured during the "Resolve Environment Variables First" step at the start of Phase 2 (see Phase 2 below). If you reach Phase 8 without those literals already in memory, go back and run the resolution step first.
|
|
127
|
+
|
|
128
|
+
Wrap the literal in BigQuery backticks and use it as `sql_table`:
|
|
129
|
+
|
|
130
|
+
```yaml
|
|
131
|
+
sql_table: "`<resolved_project>.<resolved_dataset>.<gold_model_name>`"
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Concrete example:
|
|
135
|
+
|
|
136
|
+
```yaml
|
|
137
|
+
sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_companies`"
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
The cube `name` is `hubspot_companies` (no `gold_` prefix), but `sql_table` keeps the `gold_` prefix because that is the physical table.
|
|
141
|
+
|
|
142
|
+
Apply the same fully qualified format anywhere a physical table name is needed in a Cube overlay (notably in `refresh_key.sql`).
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## Non-Negotiable Rules
|
|
147
|
+
|
|
148
|
+
1. Always start from existing dbt gold models in `dbt/models/gold/`. If no usable gold models exist, hand off to `create_dbt_transformations` and stop.
|
|
149
|
+
2. Gold models may be tables or views.
|
|
150
|
+
3. Always discover available gold models before doing anything else.
|
|
151
|
+
4. Always ask the user which discovered gold models should participate, unless the user clearly requested one specific model only.
|
|
152
|
+
5. Do not inspect selected schemas, analyze joins, generate overlays, or create support models until the model scope is clear.
|
|
153
|
+
6. Only selected models may be used for the main semantic model.
|
|
154
|
+
7. If selected models are not directly connected, search remaining gold models for connector/intermediate models.
|
|
155
|
+
8. During connector search, Claude may inspect column names and key-like fields of non-selected gold models only for relationship discovery.
|
|
156
|
+
9. Inspecting a non-selected model does not add it to the semantic model.
|
|
157
|
+
10. Connector models can only be added after explicit user approval.
|
|
158
|
+
11. If a many-to-many relationship requires a bridge/support model and no suitable gold model exists, ask the user whether to create one. If the user approves, hand off the bridge model creation to `create_dbt_transformations`.
|
|
159
|
+
12. Do not invent joins. If models cannot be connected directly or through approved connector/bridge models, generate separate semantic subgraphs and clearly report that they are disconnected.
|
|
160
|
+
13. Validate candidate joins with SQL before proposing them to the user as confirmed relationship options.
|
|
161
|
+
14. Ask the user to confirm validated relationships before generating semantic overlays.
|
|
162
|
+
15. For every confirmed relationship, generate joins in both directions where both cubes exist.
|
|
163
|
+
16. Validate joins in both directions where possible.
|
|
164
|
+
17. Relationship direction must be correct from the perspective of the current cube.
|
|
165
|
+
18. Preserve every business column from each selected gold model as a Cube dimension by default.
|
|
166
|
+
19. From technical Airbyte metadata columns (`_airbyte_*`), include only `_airbyte_extracted_at` as a dimension named `airbyte_extracted_at`. Exclude all other `_airbyte_*` columns by default.
|
|
167
|
+
20. Do not remove non-Airbyte columns just because they do not look immediately useful.
|
|
168
|
+
21. Measures are not all numeric columns by default. Always create `count`, suggest useful additional measures, and ask the user to confirm or define custom measures.
|
|
169
|
+
22. Keys may be stored in normal columns, JSON strings, JSON arrays, repeated fields, or nested structures. Always check for hidden relationship keys.
|
|
170
|
+
23. For JSON arrays, use `UNNEST(JSON_VALUE_ARRAY(...))`.
|
|
171
|
+
24. Bridge and junction cubes should use `public: false` where the project convention supports it.
|
|
172
|
+
25. Every generated Cube overlay must include `refresh_key`.
|
|
173
|
+
26. Prefer `_airbyte_extracted_at` for refresh keys when available. Otherwise fall back to `every: 1 hour`.
|
|
174
|
+
27. New Cube overlays must reference the physical warehouse table directly using the fully qualified name. Do not use `dbt ref()` in Cube overlays.
|
|
175
|
+
28. Cube `name` and overlay file name must drop the `gold_` prefix. The physical table reference in `sql_table` keeps the `gold_` prefix.
|
|
176
|
+
29. Follow the existing `semantic/cubes/` style in the repository. If existing overlays use map-style `dimensions`, `measures`, and `joins`, use map-style.
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
180
|
+
## Mandatory User Checkpoints
|
|
181
|
+
|
|
182
|
+
### Checkpoint 1: Gold Model Selection
|
|
183
|
+
|
|
184
|
+
After discovering gold models, show the available models and ask which ones should participate.
|
|
185
|
+
|
|
186
|
+
Do not proceed until the user selects models.
|
|
187
|
+
|
|
188
|
+
Exception: if the user explicitly requested one specific model, and that model exists in `dbt/models/gold/`, treat that model as selected. If the user appears to want a multi-model semantic layer, still ask whether related models should be included.
|
|
189
|
+
|
|
190
|
+
### Checkpoint 2a: Connector Model Approval
|
|
191
|
+
|
|
192
|
+
Triggered when selected models are not directly connected.
|
|
193
|
+
|
|
194
|
+
Search remaining gold models for connector/intermediate models. During this search, Claude may inspect column names and key-like fields of non-selected gold models only for relationship discovery. Inspecting a model does not add it to the semantic model.
|
|
195
|
+
|
|
196
|
+
If a connector path is found, explain why it is needed and ask whether to add it.
|
|
197
|
+
|
|
198
|
+
Example:
|
|
199
|
+
|
|
200
|
+
```text
|
|
201
|
+
You selected products and addresses.
|
|
202
|
+
|
|
203
|
+
I do not see a direct relationship between products and addresses.
|
|
204
|
+
|
|
205
|
+
However, I found a possible connector path:
|
|
206
|
+
|
|
207
|
+
products -> clients -> addresses
|
|
208
|
+
|
|
209
|
+
The clients model appears to connect products to addresses.
|
|
210
|
+
|
|
211
|
+
Should I include gold_clients as a connector model in this semantic model?
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
Do not use connector models until the user approves them.
|
|
215
|
+
|
|
216
|
+
### Checkpoint 2b: Bridge / Support Model Approval
|
|
217
|
+
|
|
218
|
+
Triggered when a many-to-many relationship is detected and no suitable existing gold bridge model exists.
|
|
219
|
+
|
|
220
|
+
Ask whether to create one:
|
|
221
|
+
|
|
222
|
+
```text
|
|
223
|
+
The relationship between deals and contacts appears to be many-to-many,
|
|
224
|
+
likely stored as a JSON array on hubspot_deals.
|
|
225
|
+
|
|
226
|
+
I did not find an existing gold bridge model for this relationship.
|
|
227
|
+
|
|
228
|
+
Should I create a support model named gold_deals_contacts?
|
|
229
|
+
If yes, I will hand off bridge model creation to the create_dbt_transformations skill,
|
|
230
|
+
then continue with the semantic model.
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
If approved, invoke `create_dbt_transformations` with the bridge model template (it knows the standard JSON-array bridge pattern). Once that skill confirms the bridge model is built and tested, return here and continue.
|
|
234
|
+
|
|
235
|
+
Do not create or use bridge/support models until the user approves them.
|
|
236
|
+
|
|
237
|
+
### Checkpoint 3: Relationship Confirmation
|
|
238
|
+
|
|
239
|
+
After candidate relationships are detected and validated with SQL, present the validated relationship options, join directions, cardinality, and validation evidence.
|
|
240
|
+
|
|
241
|
+
Ask the user to confirm or modify relationships before generating semantic overlays.
|
|
242
|
+
|
|
243
|
+
Do not present unvalidated joins as confirmed relationships.
|
|
244
|
+
|
|
245
|
+
If validation could not be executed, clearly mark the relationship as `validation pending` and ask the user whether to proceed with that assumption. If the user proceeds, the join is generated but flagged inside the YAML with a comment (see Phase 8).
|
|
246
|
+
|
|
247
|
+
### Checkpoint 4: Measures Confirmation
|
|
248
|
+
|
|
249
|
+
After schema and relationship analysis, generate default dimensions and default `count` measure, then propose useful additional measures.
|
|
250
|
+
|
|
251
|
+
Ask the user to confirm suggested measures or define custom measures.
|
|
252
|
+
|
|
253
|
+
Do not create ambiguous custom measures. If the user asks for a custom measure but the definition is unclear, ask for more detail.
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
# Workflow
|
|
258
|
+
|
|
259
|
+
Follow these phases in order.
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
## Phase 1: Discover Gold Models and Select Scope
|
|
264
|
+
|
|
265
|
+
### Goal
|
|
266
|
+
|
|
267
|
+
Find available prepared gold models, detect existing overlay conventions, and determine which models participate in the semantic model.
|
|
268
|
+
|
|
269
|
+
### Steps
|
|
270
|
+
|
|
271
|
+
1. Discover gold models. Use the navigation commands documented in `create_dbt_transformations` ("Model Navigation"). Specifically:
|
|
272
|
+
|
|
273
|
+
```bash
|
|
274
|
+
find dbt/models/gold -name "*.sql"
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
2. If no usable gold models exist, stop and tell the user:
|
|
278
|
+
|
|
279
|
+
```text
|
|
280
|
+
No gold models were found in dbt/models/gold/. The required dbt transformation
|
|
281
|
+
must exist before I can build a semantic model on top of it.
|
|
282
|
+
|
|
283
|
+
Use the create_dbt_transformations skill to create the gold model first, then
|
|
284
|
+
run semantic model generation again.
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
3. Inspect existing overlays in `semantic/cubes/` to detect conventions used in this repository.
|
|
288
|
+
|
|
289
|
+
Look at one or two existing files and check for:
|
|
290
|
+
|
|
291
|
+
- Whether `extends:` is used to inherit from base cubes.
|
|
292
|
+
- Whether existing cubes use map-style or list-style for `dimensions`, `measures`, and `joins`.
|
|
293
|
+
- The `public:` flag pattern (true / false / omitted).
|
|
294
|
+
- The `refresh_key` style (`sql:` based or `every:` based).
|
|
295
|
+
- Naming patterns for dimensions and measures.
|
|
296
|
+
- Whether there are base cubes intended for `extends:`.
|
|
297
|
+
|
|
298
|
+
Apply the detected conventions to the new overlays. If the repository is empty or conventions are inconsistent, follow the defaults defined in this skill.
|
|
299
|
+
|
|
300
|
+
4. If the user named a specific transformation/model, search for it. If the model exists, treat it as the selected model. If it does not exist, stop:
|
|
301
|
+
|
|
302
|
+
```text
|
|
303
|
+
I could not find the requested gold model in dbt/models/gold/. The gold model
|
|
304
|
+
must exist before I can create a semantic overlay for it.
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
5. If the user did not name a specific model, list all discovered gold models.
|
|
308
|
+
|
|
309
|
+
6. Infer likely business entities from file names.
|
|
310
|
+
|
|
311
|
+
Examples:
|
|
312
|
+
|
|
313
|
+
```text
|
|
314
|
+
gold_hubspot_companies -> companies
|
|
315
|
+
gold_hubspot_deals -> deals
|
|
316
|
+
gold_hubspot_contacts -> contacts
|
|
317
|
+
gold_hubspot_users -> users
|
|
318
|
+
gold_products -> products
|
|
319
|
+
gold_clients -> clients
|
|
320
|
+
gold_addresses -> addresses
|
|
321
|
+
gold_invoices -> invoices
|
|
322
|
+
gold_payments -> payments
|
|
323
|
+
gold_active_users_last_month -> active users
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
7. Present discovered models and ask the user which ones should be included.
|
|
327
|
+
|
|
328
|
+
8. Stop and wait for user selection if no specific model was provided.
|
|
329
|
+
|
|
330
|
+
9. Confirm selected models back to the user.
|
|
331
|
+
|
|
332
|
+
10. Keep the full discovered gold model list available for connector search.
|
|
333
|
+
|
|
334
|
+
11. If the user asks to include all models, warn that this may create a larger and more complex semantic model before proceeding.
|
|
335
|
+
|
|
336
|
+
---
|
|
337
|
+
|
|
338
|
+
## Phase 2: Analyze Selected Model Schemas and Keys
|
|
339
|
+
|
|
340
|
+
### Goal
|
|
341
|
+
|
|
342
|
+
Inspect selected gold models and identify primary keys, secondary keys, foreign keys, JSON keys, array keys, time columns, and metric-like columns.
|
|
343
|
+
|
|
344
|
+
### Resolve Environment Variables First
|
|
345
|
+
|
|
346
|
+
Before running any SQL or generating any YAML in this session, resolve `$GOOGLE_CLOUD_PROJECT` and `$REVOS_BQ_DATASET` to their literal values **once**, and use those literals everywhere downstream. SQL placeholders like `<project>` are not valid BigQuery syntax, and Cube YAML does not interpolate env variables — both contexts need real values.
|
|
347
|
+
|
|
348
|
+
Run:
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
echo "PROJECT=$GOOGLE_CLOUD_PROJECT"
|
|
352
|
+
echo "DATASET=$REVOS_BQ_DATASET"
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
Record the two literal values returned (for example, `revos-dev` and `revos_1737556292084`). For the rest of this session, every time you see `<project>` or `<dataset>` in a SQL template or YAML example, substitute these literals.
|
|
356
|
+
|
|
357
|
+
If either variable is empty, stop and tell the user:
|
|
358
|
+
|
|
359
|
+
```text
|
|
360
|
+
The environment variable $GOOGLE_CLOUD_PROJECT or $REVOS_BQ_DATASET is not set.
|
|
361
|
+
I cannot resolve physical BigQuery table references without them. Please set
|
|
362
|
+
them and try again.
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
Do not proceed to schema inspection or join validation until the variables are resolved.
|
|
366
|
+
|
|
367
|
+
### Schema Discovery
|
|
368
|
+
|
|
369
|
+
For each selected gold model, inspect available columns and types. If BigQuery exploration commands are needed (listing tables, inspecting schemas, previewing rows), refer to the `explore-lakehouse` skill at `.claude/skills/explore-lakehouse/SKILL.md`.
|
|
370
|
+
|
|
371
|
+
When inspecting schema output, check whether `_airbyte_extracted_at` exists (it will be needed for `refresh_key` and as the only Airbyte dimension). Other `_airbyte_*` columns can be noted but are not used by default.
|
|
372
|
+
|
|
373
|
+
Validation queries reference physical tables using the resolved literal values (see "Resolve Environment Variables First" above). Substitute `<project>` and `<dataset>` placeholders in the SQL templates throughout this skill with the literals captured at the start of Phase 2.
|
|
374
|
+
|
|
375
|
+
### Primary Key Detection
|
|
376
|
+
|
|
377
|
+
Look for stable unique identifiers.
|
|
378
|
+
|
|
379
|
+
Common primary key patterns:
|
|
380
|
+
|
|
381
|
+
```text
|
|
382
|
+
id
|
|
383
|
+
<entity>_id
|
|
384
|
+
<model_name>_id
|
|
385
|
+
uuid
|
|
386
|
+
unique_id
|
|
387
|
+
external_id
|
|
388
|
+
source_id
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
Examples:
|
|
392
|
+
|
|
393
|
+
```text
|
|
394
|
+
companies.id
|
|
395
|
+
companies.company_id
|
|
396
|
+
companies.office_unique_id
|
|
397
|
+
hubspot_companies.properties_company_unique_id
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
Primary key rules:
|
|
401
|
+
|
|
402
|
+
1. Prefer a known business or platform identifier over a generated row number.
|
|
403
|
+
2. Prefer stable IDs over names or labels.
|
|
404
|
+
3. Do not mark a column as primary key only because it looks unique by name.
|
|
405
|
+
4. Validate uniqueness where possible.
|
|
406
|
+
|
|
407
|
+
Validation:
|
|
408
|
+
|
|
409
|
+
```sql
|
|
410
|
+
SELECT
|
|
411
|
+
COUNT(*) AS total_rows,
|
|
412
|
+
COUNT(DISTINCT <candidate_pk>) AS distinct_keys,
|
|
413
|
+
COUNT(*) - COUNT(DISTINCT <candidate_pk>) AS duplicate_count
|
|
414
|
+
FROM `<project>.<dataset>.<gold_model>`;
|
|
415
|
+
```
|
|
416
|
+
|
|
417
|
+
A primary key should normally have `total_rows = distinct_keys`, or a clearly explained reason why duplicates are expected.
|
|
418
|
+
|
|
419
|
+
### Secondary Key Detection
|
|
420
|
+
|
|
421
|
+
Secondary keys are identifiers that are not the table primary key but can be used for joins, grouping, lookup, or `count_distinct` measures.
|
|
422
|
+
|
|
423
|
+
Common secondary key patterns:
|
|
424
|
+
|
|
425
|
+
```text
|
|
426
|
+
office_unique_id
|
|
427
|
+
company_id
|
|
428
|
+
customer_id
|
|
429
|
+
client_id
|
|
430
|
+
deal_id
|
|
431
|
+
contact_id
|
|
432
|
+
owner_id
|
|
433
|
+
user_id
|
|
434
|
+
account_id
|
|
435
|
+
product_id
|
|
436
|
+
address_id
|
|
437
|
+
external_id
|
|
438
|
+
source_id
|
|
439
|
+
```
|
|
440
|
+
|
|
441
|
+
Secondary key rules:
|
|
442
|
+
|
|
443
|
+
1. Track secondary keys explicitly.
|
|
444
|
+
2. Secondary keys may be foreign keys to another entity.
|
|
445
|
+
3. Secondary keys should usually become Cube dimensions.
|
|
446
|
+
4. Secondary keys may support `count_distinct` measures if analytically useful, but only inside the cube that owns the FK (see Phase 7 caution about fan-out).
|
|
447
|
+
|
|
448
|
+
### JSON / Array Key Detection
|
|
449
|
+
|
|
450
|
+
Keys may be hidden inside JSON strings, JSON arrays, repeated fields, or nested structures. This is especially common for one-to-many and many-to-many relationships.
|
|
451
|
+
|
|
452
|
+
Common JSON / array relationship patterns:
|
|
453
|
+
|
|
454
|
+
```text
|
|
455
|
+
companies
|
|
456
|
+
deals
|
|
457
|
+
contacts
|
|
458
|
+
users
|
|
459
|
+
owners
|
|
460
|
+
clients
|
|
461
|
+
products
|
|
462
|
+
addresses
|
|
463
|
+
associations
|
|
464
|
+
associated_companies
|
|
465
|
+
associated_deals
|
|
466
|
+
associated_contacts
|
|
467
|
+
associated_clients
|
|
468
|
+
associated_products
|
|
469
|
+
company_ids
|
|
470
|
+
deal_ids
|
|
471
|
+
contact_ids
|
|
472
|
+
client_ids
|
|
473
|
+
product_ids
|
|
474
|
+
address_ids
|
|
475
|
+
```
|
|
476
|
+
|
|
477
|
+
Example: `gold_hubspot_deals.companies` may contain an array of company IDs.
|
|
478
|
+
|
|
479
|
+
For JSON arrays, use `UNNEST(JSON_VALUE_ARRAY(...))`.
|
|
480
|
+
|
|
481
|
+
JSON / array rules:
|
|
482
|
+
|
|
483
|
+
1. Always inspect JSON, array, repeated, and nested fields for hidden relationship keys.
|
|
484
|
+
2. Do not assume relationship keys only exist as flat columns.
|
|
485
|
+
3. If JSON structure is unknown, inspect sample values first:
|
|
486
|
+
|
|
487
|
+
```sql
|
|
488
|
+
SELECT
|
|
489
|
+
<json_or_array_column>
|
|
490
|
+
FROM `<project>.<dataset>.<gold_model>`
|
|
491
|
+
WHERE <json_or_array_column> IS NOT NULL
|
|
492
|
+
LIMIT 20;
|
|
493
|
+
```
|
|
494
|
+
|
|
495
|
+
4. If a relationship is stored as an array of IDs, use or create an approved bridge/support model. Bridge model creation is delegated to `create_dbt_transformations` (which has the standard JSON-array bridge template).
|
|
496
|
+
5. Bridge models should preserve both sides of the relationship as keys.
|
|
497
|
+
6. Bridge and junction cubes should use `public: false` where the project convention supports it.
|
|
498
|
+
|
|
499
|
+
### Foreign Key Detection
|
|
500
|
+
|
|
501
|
+
Look for columns that reference other entities.
|
|
502
|
+
|
|
503
|
+
Common patterns:
|
|
504
|
+
|
|
505
|
+
```text
|
|
506
|
+
<entity>_id
|
|
507
|
+
<entity>Id
|
|
508
|
+
fk_<entity>
|
|
509
|
+
associated_<entity>_id
|
|
510
|
+
parent_<entity>_id
|
|
511
|
+
owner_id
|
|
512
|
+
created_by_user_id
|
|
513
|
+
updated_by_user_id
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
Also check JSON and array-based foreign keys:
|
|
517
|
+
|
|
518
|
+
```text
|
|
519
|
+
deals.companies -> companies.id
|
|
520
|
+
companies.deals -> deals.id
|
|
521
|
+
contacts.associated_company_ids -> companies.id
|
|
522
|
+
```
|
|
523
|
+
|
|
524
|
+
### Schema Summary Output
|
|
525
|
+
|
|
526
|
+
After analysis, summarize each selected model.
|
|
527
|
+
|
|
528
|
+
Example:
|
|
529
|
+
|
|
530
|
+
```text
|
|
531
|
+
Model: gold_hubspot_deals
|
|
532
|
+
Columns: 18
|
|
533
|
+
Candidate primary key: deal_id
|
|
534
|
+
Secondary keys: company_id, owner_id
|
|
535
|
+
JSON / array relationship columns: companies, contacts
|
|
536
|
+
Time columns: created_at, updated_at, closed_at
|
|
537
|
+
Numeric metric-like columns: amount
|
|
538
|
+
Airbyte columns present: _airbyte_extracted_at (will be exposed), _airbyte_raw_id, _airbyte_meta, _airbyte_generation_id (will be excluded by default)
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
---
|
|
542
|
+
|
|
543
|
+
## Phase 3: Detect Candidate Relationships
|
|
544
|
+
|
|
545
|
+
### Goal
|
|
546
|
+
|
|
547
|
+
Build a candidate relationship graph between selected models and approved connector/bridge models.
|
|
548
|
+
|
|
549
|
+
These relationships are candidates only. They must be validated with SQL before being proposed to the user as relationship options.
|
|
550
|
+
|
|
551
|
+
### Single-Model Case
|
|
552
|
+
|
|
553
|
+
If only one gold model was selected in Phase 1 **and** that model has no JSON/array relationship columns identified in Phase 2, skip Phase 3 entirely. There is nothing to relate. Proceed directly to Phase 6 (dimensions) and Phase 7 (measures). Phase 4 (validation) is also skipped since there are no candidate joins.
|
|
554
|
+
|
|
555
|
+
If only one model is selected but it has JSON/array relationship columns (for example, `gold_hubspot_deals` with a `companies` array), Phase 3 still applies — but only the bridge/junction detection branch. Treat the JSON-array target entity as a candidate model and ask the user whether to add it to the scope (and, if needed, to create a bridge support model).
|
|
556
|
+
|
|
557
|
+
### Relationship Types
|
|
558
|
+
|
|
559
|
+
Use Cube relationship types: `one_to_one`, `one_to_many`, `many_to_one`, `many_to_many`.
|
|
560
|
+
|
|
561
|
+
### Relationship Direction
|
|
562
|
+
|
|
563
|
+
Relationship direction is from the perspective of the current cube.
|
|
564
|
+
|
|
565
|
+
```text
|
|
566
|
+
Deal belongs to one company:
|
|
567
|
+
deals.company_id -> companies.id
|
|
568
|
+
relationship from deals to companies: many_to_one
|
|
569
|
+
|
|
570
|
+
Company has many deals:
|
|
571
|
+
companies.id -> deals.company_id
|
|
572
|
+
relationship from companies to deals: one_to_many
|
|
573
|
+
```
|
|
574
|
+
|
|
575
|
+
### Cardinality Rules
|
|
576
|
+
|
|
577
|
+
1. If source foreign key can repeat and target key is unique → `many_to_one` / reverse `one_to_many`.
|
|
578
|
+
2. If both sides are unique → `one_to_one`.
|
|
579
|
+
3. If both sides can repeat or the relationship is represented through a bridge/junction model → `many_to_many` through bridge/junction model.
|
|
580
|
+
4. If cardinality is unclear, report uncertainty and validate before proposing the relationship.
|
|
581
|
+
|
|
582
|
+
### Direct Join Detection
|
|
583
|
+
|
|
584
|
+
Analyze direct joins between selected models first.
|
|
585
|
+
|
|
586
|
+
Look for:
|
|
587
|
+
|
|
588
|
+
1. Foreign key to primary key matches.
|
|
589
|
+
2. Secondary key matches.
|
|
590
|
+
3. Known entity conventions.
|
|
591
|
+
4. Existing junction/bridge models among selected models.
|
|
592
|
+
5. JSON / array relationship fields among selected models.
|
|
593
|
+
6. Nested association structures.
|
|
594
|
+
|
|
595
|
+
### Connector Model Search
|
|
596
|
+
|
|
597
|
+
Run connector search when selected models are not directly connected.
|
|
598
|
+
|
|
599
|
+
Rules:
|
|
600
|
+
|
|
601
|
+
1. Search all discovered gold models, including non-selected models.
|
|
602
|
+
2. During connector search, inspect column names and key-like fields of non-selected gold models only for relationship discovery.
|
|
603
|
+
3. This lightweight schema discovery does not add the connector model to the semantic model.
|
|
604
|
+
4. Do not automatically add connector models.
|
|
605
|
+
5. Look for paths of length 2 first.
|
|
606
|
+
6. If needed, look for paths of length 3.
|
|
607
|
+
7. Do not create long or speculative chains without user confirmation.
|
|
608
|
+
8. Use connector models only after user approval.
|
|
609
|
+
|
|
610
|
+
Example. User selected `gold_products` and `gold_addresses`. No direct relationship found. Search remaining gold models and find that both have `client_id` matching `gold_clients.id`. Proposed connector path: `products -> clients -> addresses`.
|
|
611
|
+
|
|
612
|
+
If the user approves the connector model:
|
|
613
|
+
|
|
614
|
+
1. Add it to the working model scope.
|
|
615
|
+
2. Run full schema discovery for it.
|
|
616
|
+
3. Analyze its keys and relationships.
|
|
617
|
+
4. Validate the connector path with SQL before proposing it as a relationship.
|
|
618
|
+
5. Generate a semantic overlay for it.
|
|
619
|
+
|
|
620
|
+
If the user rejects the connector model:
|
|
621
|
+
|
|
622
|
+
1. Continue only with directly connected selected models.
|
|
623
|
+
2. Clearly document disconnected selected models.
|
|
624
|
+
3. Do not invent joins.
|
|
625
|
+
|
|
626
|
+
### Bridge / Junction Relationship Detection
|
|
627
|
+
|
|
628
|
+
Use existing bridge/junction gold models when available — but only after confirming they are usable.
|
|
629
|
+
|
|
630
|
+
If an existing bridge model is found in `dbt/models/gold/`:
|
|
631
|
+
|
|
632
|
+
1. Inspect its schema (use `explore-lakehouse` for schema commands).
|
|
633
|
+
2. Verify it contains the expected key columns for the relationship — typically `<entity_a>_id` and `<entity_b>_id`. The exact column names may differ from the convention (for example, `company_uuid` instead of `company_id`).
|
|
634
|
+
3. Check whether `_airbyte_extracted_at` is preserved. If not, the bridge cube will have to fall back to `every: 1 hour` for `refresh_key`.
|
|
635
|
+
4. Run a quick row-count and sample-rows check to confirm the bridge has data.
|
|
636
|
+
|
|
637
|
+
If the existing bridge model fits the relationship cleanly, use it.
|
|
638
|
+
|
|
639
|
+
If it does not fit (key columns named unexpectedly, missing one side of the relationship, has unrelated extra columns that change cardinality, or is empty), present the situation to the user and ask:
|
|
640
|
+
|
|
641
|
+
```text
|
|
642
|
+
I found an existing bridge model `<model_name>` in dbt/models/gold/, but it does not fit the relationship cleanly:
|
|
643
|
+
- <specific issue, e.g. "key column is named company_uuid, not company_id">
|
|
644
|
+
- <specific issue>
|
|
645
|
+
|
|
646
|
+
Options:
|
|
647
|
+
- use as-is (I will adapt the cube join SQL to match the existing column names)
|
|
648
|
+
- create a new bridge model via create_dbt_transformations (I will use the standard JSON-array template and a new name)
|
|
649
|
+
- abort and let me revisit this later
|
|
650
|
+
```
|
|
651
|
+
|
|
652
|
+
Wait for the user's decision. Do not silently adapt to a mismatched bridge — surfacing the mismatch is more important than the fix being automatic.
|
|
653
|
+
|
|
654
|
+
If no existing bridge model is available and the user approves creating one (Checkpoint 2b), delegate the bridge model creation to `create_dbt_transformations`. That skill has the standard JSON-array bridge template and the naming convention (`gold_<entity_a>_<entity_b>`, with cube name `<entity_a>_<entity_b>` after dropping `gold_`).
|
|
655
|
+
|
|
656
|
+
Once the bridge model exists and is materialized, return here and continue:
|
|
657
|
+
|
|
658
|
+
- Generate a Cube overlay for the bridge as a junction cube with `public: false`.
|
|
659
|
+
- Generate joins in both directions through the bridge.
|
|
660
|
+
|
|
661
|
+
---
|
|
662
|
+
|
|
663
|
+
## Phase 4: Validate Candidate Relationships Before Proposal
|
|
664
|
+
|
|
665
|
+
### Goal
|
|
666
|
+
|
|
667
|
+
Verify with SQL that candidate joins actually work and that relationship direction is correct before presenting them to the user for confirmation.
|
|
668
|
+
|
|
669
|
+
This validation runs against BigQuery directly. It is not a YAML-level check. The point is to verify, with real data, that:
|
|
670
|
+
|
|
671
|
+
- Candidate primary keys are unique.
|
|
672
|
+
- Foreign keys actually match target primary keys at acceptable rates.
|
|
673
|
+
- Reverse aggregations produce sensible counts.
|
|
674
|
+
- Many-to-many bridge edges resolve correctly on both sides.
|
|
675
|
+
- JSON / array relationships extract to keys that exist in the target table.
|
|
676
|
+
- Join column types are compatible.
|
|
677
|
+
|
|
678
|
+
Validation must check both directions of every candidate relationship where possible.
|
|
679
|
+
|
|
680
|
+
If validation cannot be executed because the environment is incomplete, clearly mark validation as pending and explain what must be run later.
|
|
681
|
+
|
|
682
|
+
Validation queries reference physical tables using the literal `project` and `dataset` values resolved at the start of Phase 2. Substitute `<project>` and `<dataset>` placeholders in the SQL templates below with those literals before executing.
|
|
683
|
+
|
|
684
|
+
Do not present an unvalidated join as a confirmed relationship.
|
|
685
|
+
|
|
686
|
+
### 4.1 Validate Key Uniqueness
|
|
687
|
+
|
|
688
|
+
```sql
|
|
689
|
+
SELECT
|
|
690
|
+
COUNT(*) AS total_rows,
|
|
691
|
+
COUNT(DISTINCT <candidate_pk>) AS distinct_keys,
|
|
692
|
+
COUNT(*) - COUNT(DISTINCT <candidate_pk>) AS duplicate_count
|
|
693
|
+
FROM `<project>.<dataset>.<gold_model>`;
|
|
694
|
+
```
|
|
695
|
+
|
|
696
|
+
For a primary key, `duplicate_count` should normally be `0`. If duplicates exist, do not mark the column as `primary_key: true` unless there is a clearly documented reason.
|
|
697
|
+
|
|
698
|
+
### 4.2 Validate Many-to-One Direction
|
|
699
|
+
|
|
700
|
+
Example relationship: `deals.company_id -> companies.id`.
|
|
701
|
+
|
|
702
|
+
Expected direction: `deals -> companies: many_to_one`, `companies -> deals: one_to_many`.
|
|
703
|
+
|
|
704
|
+
Validate the many-to-one side:
|
|
705
|
+
|
|
706
|
+
```sql
|
|
707
|
+
SELECT
|
|
708
|
+
COUNT(*) AS total_rows_with_fk,
|
|
709
|
+
COUNT(c.id) AS matched_rows,
|
|
710
|
+
COUNT(*) - COUNT(c.id) AS unmatched_rows,
|
|
711
|
+
ROUND(100.0 * COUNT(c.id) / COUNT(*), 2) AS match_percentage
|
|
712
|
+
FROM `<project>.<dataset>.gold_hubspot_deals` d
|
|
713
|
+
LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
|
|
714
|
+
ON d.company_id = c.id
|
|
715
|
+
WHERE d.company_id IS NOT NULL;
|
|
716
|
+
```
|
|
717
|
+
|
|
718
|
+
Also check whether a single source row matches multiple target rows:
|
|
719
|
+
|
|
720
|
+
```sql
|
|
721
|
+
SELECT
|
|
722
|
+
d.deal_id,
|
|
723
|
+
COUNT(c.id) AS matched_companies
|
|
724
|
+
FROM `<project>.<dataset>.gold_hubspot_deals` d
|
|
725
|
+
LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
|
|
726
|
+
ON d.company_id = c.id
|
|
727
|
+
WHERE d.company_id IS NOT NULL
|
|
728
|
+
GROUP BY d.deal_id
|
|
729
|
+
HAVING COUNT(c.id) > 1
|
|
730
|
+
LIMIT 20;
|
|
731
|
+
```
|
|
732
|
+
|
|
733
|
+
For a valid many-to-one relationship, this result should normally be empty.
|
|
734
|
+
|
|
735
|
+
### 4.3 Validate Reverse One-to-Many Direction
|
|
736
|
+
|
|
737
|
+
For the same relationship, validate the reverse direction using aggregation:
|
|
738
|
+
|
|
739
|
+
```sql
|
|
740
|
+
SELECT
|
|
741
|
+
c.id AS company_id,
|
|
742
|
+
COUNT(d.deal_id) AS deal_count
|
|
743
|
+
FROM `<project>.<dataset>.gold_hubspot_companies` c
|
|
744
|
+
LEFT JOIN `<project>.<dataset>.gold_hubspot_deals` d
|
|
745
|
+
ON c.id = d.company_id
|
|
746
|
+
GROUP BY c.id
|
|
747
|
+
ORDER BY deal_count DESC
|
|
748
|
+
LIMIT 20;
|
|
749
|
+
```
|
|
750
|
+
|
|
751
|
+
Cross-check sampled counts directly from the child/source table:
|
|
752
|
+
|
|
753
|
+
```sql
|
|
754
|
+
SELECT
|
|
755
|
+
company_id,
|
|
756
|
+
COUNT(*) AS expected_deal_count
|
|
757
|
+
FROM `<project>.<dataset>.gold_hubspot_deals`
|
|
758
|
+
WHERE company_id IN (<sample_company_ids>)
|
|
759
|
+
GROUP BY company_id;
|
|
760
|
+
```
|
|
761
|
+
|
|
762
|
+
The counts must match.
|
|
763
|
+
|
|
764
|
+
### 4.4 Validate One-to-One Relationships
|
|
765
|
+
|
|
766
|
+
For a one-to-one relationship, validate uniqueness on both sides:
|
|
767
|
+
|
|
768
|
+
```sql
|
|
769
|
+
SELECT
|
|
770
|
+
COUNT(*) AS total_rows,
|
|
771
|
+
COUNT(DISTINCT <left_key>) AS distinct_left_keys
|
|
772
|
+
FROM `<project>.<dataset>.<left_model>`;
|
|
773
|
+
```
|
|
774
|
+
|
|
775
|
+
Same for the right side. Then validate the join:
|
|
776
|
+
|
|
777
|
+
```sql
|
|
778
|
+
SELECT
|
|
779
|
+
COUNT(*) AS total_rows,
|
|
780
|
+
COUNT(r.<right_key>) AS matched_rows,
|
|
781
|
+
COUNT(*) - COUNT(r.<right_key>) AS unmatched_rows,
|
|
782
|
+
ROUND(100.0 * COUNT(r.<right_key>) / COUNT(*), 2) AS match_percentage
|
|
783
|
+
FROM `<project>.<dataset>.<left_model>` l
|
|
784
|
+
LEFT JOIN `<project>.<dataset>.<right_model>` r
|
|
785
|
+
ON l.<left_key> = r.<right_key>
|
|
786
|
+
WHERE l.<left_key> IS NOT NULL;
|
|
787
|
+
```
|
|
788
|
+
|
|
789
|
+
Also validate the reverse direction. If either side has duplicate keys, the relationship is not one-to-one.
|
|
790
|
+
|
|
791
|
+
### 4.5 Validate Many-to-Many Through Bridge/Junction Models
|
|
792
|
+
|
|
793
|
+
For a many-to-many relationship, validate both bridge edges.
|
|
794
|
+
|
|
795
|
+
Example: `companies <-> deals through gold_companies_deals`.
|
|
796
|
+
|
|
797
|
+
Validate bridge to companies:
|
|
798
|
+
|
|
799
|
+
```sql
|
|
800
|
+
SELECT
|
|
801
|
+
COUNT(*) AS total_bridge_rows,
|
|
802
|
+
COUNT(c.id) AS matched_companies,
|
|
803
|
+
COUNT(*) - COUNT(c.id) AS unmatched_companies,
|
|
804
|
+
ROUND(100.0 * COUNT(c.id) / COUNT(*), 2) AS match_percentage
|
|
805
|
+
FROM `<project>.<dataset>.gold_companies_deals` b
|
|
806
|
+
LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
|
|
807
|
+
ON b.company_id = c.id
|
|
808
|
+
WHERE b.company_id IS NOT NULL;
|
|
809
|
+
```
|
|
810
|
+
|
|
811
|
+
Validate bridge to deals (analogous query, swapping `c.id` for `d.id` and the source table).
|
|
812
|
+
|
|
813
|
+
Validate reverse aggregations from each parent through the bridge:
|
|
814
|
+
|
|
815
|
+
```sql
|
|
816
|
+
SELECT
|
|
817
|
+
c.id AS company_id,
|
|
818
|
+
COUNT(b.deal_id) AS related_deals
|
|
819
|
+
FROM `<project>.<dataset>.gold_hubspot_companies` c
|
|
820
|
+
LEFT JOIN `<project>.<dataset>.gold_companies_deals` b
|
|
821
|
+
ON c.id = b.company_id
|
|
822
|
+
GROUP BY c.id
|
|
823
|
+
ORDER BY related_deals DESC
|
|
824
|
+
LIMIT 20;
|
|
825
|
+
```
|
|
826
|
+
|
|
827
|
+
Same query swapped for deals → bridge → companies.
|
|
828
|
+
|
|
829
|
+
Report sampled counts to the user.
|
|
830
|
+
|
|
831
|
+
### 4.6 Validate JSON / Array Relationships
|
|
832
|
+
|
|
833
|
+
For JSON or array-based relationships, validate extracted keys:
|
|
834
|
+
|
|
835
|
+
```sql
|
|
836
|
+
WITH extracted AS (
|
|
837
|
+
SELECT DISTINCT
|
|
838
|
+
src.<source_pk> AS source_id,
|
|
839
|
+
extracted_id
|
|
840
|
+
FROM `<project>.<dataset>.<source_model>` src,
|
|
841
|
+
UNNEST(JSON_VALUE_ARRAY(src.<json_array_column>)) AS extracted_id
|
|
842
|
+
)
|
|
843
|
+
|
|
844
|
+
SELECT
|
|
845
|
+
COUNT(*) AS total_relationships,
|
|
846
|
+
COUNT(tgt.<target_pk>) AS matched_relationships,
|
|
847
|
+
COUNT(*) - COUNT(tgt.<target_pk>) AS unmatched_relationships,
|
|
848
|
+
ROUND(100.0 * COUNT(tgt.<target_pk>) / COUNT(*), 2) AS match_percentage
|
|
849
|
+
FROM extracted e
|
|
850
|
+
LEFT JOIN `<project>.<dataset>.<target_model>` tgt
|
|
851
|
+
ON e.extracted_id = tgt.<target_pk>;
|
|
852
|
+
```
|
|
853
|
+
|
|
854
|
+
Sample the extracted join to inspect actual matched values:
|
|
855
|
+
|
|
856
|
+
```sql
|
|
857
|
+
WITH extracted AS (
|
|
858
|
+
SELECT DISTINCT
|
|
859
|
+
src.<source_pk> AS source_id,
|
|
860
|
+
extracted_id
|
|
861
|
+
FROM `<project>.<dataset>.<source_model>` src,
|
|
862
|
+
UNNEST(JSON_VALUE_ARRAY(src.<json_array_column>)) AS extracted_id
|
|
863
|
+
)
|
|
864
|
+
|
|
865
|
+
SELECT
|
|
866
|
+
e.source_id,
|
|
867
|
+
e.extracted_id,
|
|
868
|
+
tgt.<target_pk>,
|
|
869
|
+
tgt.<display_column>
|
|
870
|
+
FROM extracted e
|
|
871
|
+
LEFT JOIN `<project>.<dataset>.<target_model>` tgt
|
|
872
|
+
ON e.extracted_id = tgt.<target_pk>
|
|
873
|
+
LIMIT 10;
|
|
874
|
+
```
|
|
875
|
+
|
|
876
|
+
### 4.7 Validate Type Compatibility
|
|
877
|
+
|
|
878
|
+
Check that join columns have compatible types using INFORMATION_SCHEMA:
|
|
879
|
+
|
|
880
|
+
```sql
|
|
881
|
+
SELECT column_name, data_type
|
|
882
|
+
FROM `<project>.<dataset>.INFORMATION_SCHEMA.COLUMNS`
|
|
883
|
+
WHERE table_name IN ('<source_model>', '<target_model>')
|
|
884
|
+
AND column_name IN ('<foreign_key>', '<target_pk>');
|
|
885
|
+
```
|
|
886
|
+
|
|
887
|
+
For JSON / array extracted keys, check the extracted key type against the target key type.
|
|
888
|
+
|
|
889
|
+
If types differ:
|
|
890
|
+
|
|
891
|
+
1. Report the mismatch.
|
|
892
|
+
2. Prefer fixing type alignment in the dbt model or approved support model.
|
|
893
|
+
3. Only cast in Cube join SQL when necessary.
|
|
894
|
+
4. Prefer casting the foreign-key side to match the primary-key side.
|
|
895
|
+
|
|
896
|
+
### 4.8 Validation Rules
|
|
897
|
+
|
|
898
|
+
Do not present relationships to the user as valid options until:
|
|
899
|
+
|
|
900
|
+
1. Candidate joins have been validated, or validation is explicitly marked as pending.
|
|
901
|
+
2. Both directions of every candidate relationship have been validated where possible.
|
|
902
|
+
3. Primary key uniqueness has been checked, or marked as pending.
|
|
903
|
+
4. Type compatibility has been checked, or marked as pending.
|
|
904
|
+
5. Match rates are reported, or marked as pending.
|
|
905
|
+
6. Reverse one-to-many aggregations are checked with sampled counts where applicable.
|
|
906
|
+
7. Many-to-many bridge edges are validated in both directions where applicable.
|
|
907
|
+
8. JSON / array relationships have been extracted and validated where present, or marked as pending.
|
|
908
|
+
9. Low match rates or suspicious results are explained.
|
|
909
|
+
10. Sample joined data or sampled aggregate counts look reasonable, or limitations are documented.
|
|
910
|
+
|
|
911
|
+
---
|
|
912
|
+
|
|
913
|
+
## Phase 5: Present Validated Relationships and Ask for Confirmation
|
|
914
|
+
|
|
915
|
+
### Goal
|
|
916
|
+
|
|
917
|
+
Show the detected and validated semantic structure before generating files.
|
|
918
|
+
|
|
919
|
+
Present selected models, approved connector models, candidate keys, JSON/array keys, connector paths, bridge models, validated joins, cardinality, and validation evidence.
|
|
920
|
+
|
|
921
|
+
Example:
|
|
922
|
+
|
|
923
|
+
```text
|
|
924
|
+
Selected gold models:
|
|
925
|
+
- gold_hubspot_companies
|
|
926
|
+
- gold_hubspot_deals
|
|
927
|
+
|
|
928
|
+
Approved connector models:
|
|
929
|
+
- gold_hubspot_users
|
|
930
|
+
|
|
931
|
+
Entity: deals
|
|
932
|
+
Source model: gold_hubspot_deals
|
|
933
|
+
Cube name (overlay): hubspot_deals
|
|
934
|
+
Candidate primary key: deal_id
|
|
935
|
+
Secondary keys: owner_id
|
|
936
|
+
JSON / array keys:
|
|
937
|
+
- companies -> company_id
|
|
938
|
+
|
|
939
|
+
Validated joins:
|
|
940
|
+
- deals.owner_id -> users.user_id (many_to_one)
|
|
941
|
+
Match rate: 99.8%
|
|
942
|
+
Reverse direction: users -> deals (one_to_many)
|
|
943
|
+
Sample reverse counts checked.
|
|
944
|
+
|
|
945
|
+
- deals.companies[] -> companies.id through bridge gold_companies_deals
|
|
946
|
+
Relationship: many_to_many
|
|
947
|
+
Bridge edges validated.
|
|
948
|
+
```
|
|
949
|
+
|
|
950
|
+
Then ask:
|
|
951
|
+
|
|
952
|
+
```text
|
|
953
|
+
Please confirm these selected models, connector models, bridge/support models, and validated relationships, or tell me what to change before I generate the semantic overlays.
|
|
954
|
+
```
|
|
955
|
+
|
|
956
|
+
Do not generate final files until the user confirms or corrects the relationship model.
|
|
957
|
+
|
|
958
|
+
If a relationship could not be validated but the user still wants to proceed, mark it clearly as an assumption in the generated summary, and tag it with a YAML comment in the generated overlay (see Phase 8).
|
|
959
|
+
|
|
960
|
+
---
|
|
961
|
+
|
|
962
|
+
## Phase 6: Generate Dimensions
|
|
963
|
+
|
|
964
|
+
### Goal
|
|
965
|
+
|
|
966
|
+
Expose all selected gold model business columns as Cube dimensions, plus `_airbyte_extracted_at`.
|
|
967
|
+
|
|
968
|
+
### Type Mapping
|
|
969
|
+
|
|
970
|
+
```text
|
|
971
|
+
STRING / VARCHAR / TEXT -> string
|
|
972
|
+
INTEGER / FLOAT / NUMERIC / DECIMAL -> number
|
|
973
|
+
BOOLEAN / BOOL -> boolean
|
|
974
|
+
DATE / DATETIME / TIMESTAMP -> time
|
|
975
|
+
JSON / ARRAY / STRUCT -> string or skip only if not queryable directly
|
|
976
|
+
```
|
|
977
|
+
|
|
978
|
+
### Dimension Rules
|
|
979
|
+
|
|
980
|
+
1. Include primary keys as dimensions with `primary_key: true`.
|
|
981
|
+
2. Include secondary keys as dimensions.
|
|
982
|
+
3. Include human-readable names and statuses as string dimensions.
|
|
983
|
+
4. Include timestamps as time dimensions.
|
|
984
|
+
5. Include numeric attributes as number dimensions.
|
|
985
|
+
6. Expose every business column from each selected gold model as a dimension by default.
|
|
986
|
+
7. From technical Airbyte columns, include only `_airbyte_extracted_at`. Name the dimension `airbyte_extracted_at` (without the leading underscore) and reference the column as `${CUBE}._airbyte_extracted_at`.
|
|
987
|
+
8. Exclude all other `_airbyte_*` columns by default (`_airbyte_raw_id`, `_airbyte_meta`, `_airbyte_generation_id`, etc.). Do not include them unless the user explicitly asks.
|
|
988
|
+
9. Skip or transform only columns that cannot be represented safely in Cube, and document why.
|
|
989
|
+
10. JSON / array columns used for relationships should usually be represented through bridge/support models.
|
|
990
|
+
|
|
991
|
+
### Composite Primary Keys
|
|
992
|
+
|
|
993
|
+
Cube allows exactly one dimension flagged with `primary_key: true` per cube. When a gold model has a composite primary key (multiple columns that together uniquely identify a row, for example `(office_unique_id, month)` in a monthly snapshot model), do not flag any of the component columns as primary key directly. Instead:
|
|
994
|
+
|
|
995
|
+
1. Keep each component column as a regular dimension (no `primary_key` flag).
|
|
996
|
+
2. Add an additional synthetic primary-key dimension that concatenates the components, using the same `CONCAT` pattern used for bridge cubes.
|
|
997
|
+
|
|
998
|
+
Example for a monthly active users model with composite PK `(office_unique_id, month)`:
|
|
999
|
+
|
|
1000
|
+
```yaml
|
|
1001
|
+
dimensions:
|
|
1002
|
+
id:
|
|
1003
|
+
sql: "CONCAT(${CUBE}.office_unique_id, '-', ${CUBE}.month)"
|
|
1004
|
+
type: string
|
|
1005
|
+
primary_key: true
|
|
1006
|
+
|
|
1007
|
+
office_unique_id:
|
|
1008
|
+
sql: "${CUBE}.office_unique_id"
|
|
1009
|
+
type: string
|
|
1010
|
+
|
|
1011
|
+
month:
|
|
1012
|
+
sql: "${CUBE}.month"
|
|
1013
|
+
type: time
|
|
1014
|
+
```
|
|
1015
|
+
|
|
1016
|
+
Choose a separator that does not appear in either component value. `-` is usually safe for IDs and dates; if components may contain `-`, use `||` or another unambiguous separator.
|
|
1017
|
+
|
|
1018
|
+
The synthetic `id` dimension is the cube's primary key for Cube's purposes. Joins to this cube must reference that synthetic `id`, not individual components — unless the joining cube has the same composite key columns and uses a parallel `CONCAT` in the join SQL.
|
|
1019
|
+
|
|
1020
|
+
### Large Column Count Warning
|
|
1021
|
+
|
|
1022
|
+
If a model has more than 50 business columns, inform the user before generating:
|
|
1023
|
+
|
|
1024
|
+
```text
|
|
1025
|
+
This model has <N> business columns. I will generate <N> dimensions by default,
|
|
1026
|
+
plus airbyte_extracted_at if the column exists.
|
|
1027
|
+
Other _airbyte_* columns will be excluded by default.
|
|
1028
|
+
|
|
1029
|
+
Proceed with all business columns, or should I skip any groups?
|
|
1030
|
+
```
|
|
1031
|
+
|
|
1032
|
+
Do not skip business columns without explicit user instruction.
|
|
1033
|
+
|
|
1034
|
+
Example dimensions block:
|
|
1035
|
+
|
|
1036
|
+
```yaml
|
|
1037
|
+
dimensions:
|
|
1038
|
+
deal_id:
|
|
1039
|
+
sql: "${CUBE}.deal_id"
|
|
1040
|
+
type: string
|
|
1041
|
+
primary_key: true
|
|
1042
|
+
|
|
1043
|
+
company_id:
|
|
1044
|
+
sql: "${CUBE}.company_id"
|
|
1045
|
+
type: string
|
|
1046
|
+
|
|
1047
|
+
deal_name:
|
|
1048
|
+
sql: "${CUBE}.deal_name"
|
|
1049
|
+
type: string
|
|
1050
|
+
|
|
1051
|
+
amount:
|
|
1052
|
+
sql: "${CUBE}.amount"
|
|
1053
|
+
type: number
|
|
1054
|
+
|
|
1055
|
+
created_at:
|
|
1056
|
+
sql: "${CUBE}.created_at"
|
|
1057
|
+
type: time
|
|
1058
|
+
|
|
1059
|
+
airbyte_extracted_at:
|
|
1060
|
+
sql: "${CUBE}._airbyte_extracted_at"
|
|
1061
|
+
type: time
|
|
1062
|
+
```
|
|
1063
|
+
|
|
1064
|
+
---
|
|
1065
|
+
|
|
1066
|
+
## Phase 7: Suggest and Confirm Measures
|
|
1067
|
+
|
|
1068
|
+
### Goal
|
|
1069
|
+
|
|
1070
|
+
Create default and useful measures without inventing ambiguous business logic.
|
|
1071
|
+
|
|
1072
|
+
### Default Measure
|
|
1073
|
+
|
|
1074
|
+
Always include a row count measure unless project convention says otherwise.
|
|
1075
|
+
|
|
1076
|
+
```yaml
|
|
1077
|
+
measures:
|
|
1078
|
+
count:
|
|
1079
|
+
type: count
|
|
1080
|
+
```
|
|
1081
|
+
|
|
1082
|
+
### Suggested Measures
|
|
1083
|
+
|
|
1084
|
+
Suggest useful measures based on model schema and column names.
|
|
1085
|
+
|
|
1086
|
+
Common suggestions:
|
|
1087
|
+
|
|
1088
|
+
```text
|
|
1089
|
+
amount -> total_amount (sum), average_amount (avg), min_amount, max_amount
|
|
1090
|
+
revenue -> total_revenue (sum)
|
|
1091
|
+
price -> total_price or average_price depending on context
|
|
1092
|
+
cost -> total_cost (sum)
|
|
1093
|
+
quantity / qty -> total_quantity (sum)
|
|
1094
|
+
duration -> total_duration or average_duration
|
|
1095
|
+
deal_id / company_id / user_id -> count_distinct
|
|
1096
|
+
created_at -> first_created_at (min), last_created_at (max)
|
|
1097
|
+
closed_at -> first_closed_at (min), last_closed_at (max)
|
|
1098
|
+
updated_at -> last_updated_at (max)
|
|
1099
|
+
```
|
|
1100
|
+
|
|
1101
|
+
Examples:
|
|
1102
|
+
|
|
1103
|
+
```yaml
|
|
1104
|
+
measures:
|
|
1105
|
+
total_amount:
|
|
1106
|
+
sql: "${CUBE}.amount"
|
|
1107
|
+
type: sum
|
|
1108
|
+
|
|
1109
|
+
unique_companies:
|
|
1110
|
+
sql: "${CUBE}.company_id"
|
|
1111
|
+
type: count_distinct
|
|
1112
|
+
|
|
1113
|
+
last_closed_at:
|
|
1114
|
+
sql: "${CUBE}.closed_at"
|
|
1115
|
+
type: max
|
|
1116
|
+
```
|
|
1117
|
+
|
|
1118
|
+
### `count_distinct` on Foreign Keys
|
|
1119
|
+
|
|
1120
|
+
`count_distinct` on a foreign-key column counts unique values within the cube that owns the FK. For example, `count_distinct(company_id)` inside the `hubspot_deals` cube answers "how many distinct companies have deals."
|
|
1121
|
+
|
|
1122
|
+
Be careful when this measure is used in queries that join multiple cubes. Joins can produce row fan-out, which inflates or distorts distinct counts. Rules:
|
|
1123
|
+
|
|
1124
|
+
1. Define `count_distinct` on FK columns inside the cube that owns the FK.
|
|
1125
|
+
2. Add a brief description in the measure (or note it for the user) so consumers know the cube boundary.
|
|
1126
|
+
3. Avoid suggesting `count_distinct` on FK columns inside the parent cube (the cube that owns the PK) — `count` of rows there usually answers the same question more reliably.
|
|
1127
|
+
|
|
1128
|
+
### Measure Confirmation
|
|
1129
|
+
|
|
1130
|
+
After suggesting measures, ask the user:
|
|
1131
|
+
|
|
1132
|
+
```text
|
|
1133
|
+
I will create the default count measure.
|
|
1134
|
+
|
|
1135
|
+
I also found these possible additional measures:
|
|
1136
|
+
- total_amount: sum(amount)
|
|
1137
|
+
- average_amount: avg(amount)
|
|
1138
|
+
- unique_companies: count_distinct(company_id)
|
|
1139
|
+
- last_closed_at: max(closed_at)
|
|
1140
|
+
|
|
1141
|
+
Which of these should I include?
|
|
1142
|
+
|
|
1143
|
+
Do you want to define any custom measures?
|
|
1144
|
+
```
|
|
1145
|
+
|
|
1146
|
+
### Custom Measures
|
|
1147
|
+
|
|
1148
|
+
If the user requests a custom measure, ask for a precise definition if needed.
|
|
1149
|
+
|
|
1150
|
+
A custom measure definition should include:
|
|
1151
|
+
|
|
1152
|
+
```text
|
|
1153
|
+
measure name
|
|
1154
|
+
source column(s)
|
|
1155
|
+
aggregation type
|
|
1156
|
+
filters or conditions
|
|
1157
|
+
business meaning
|
|
1158
|
+
```
|
|
1159
|
+
|
|
1160
|
+
Custom measure rules:
|
|
1161
|
+
|
|
1162
|
+
1. Do not create ambiguous custom measures.
|
|
1163
|
+
2. If the definition is clear, generate the measure.
|
|
1164
|
+
3. If the definition is unclear, ask for clarification.
|
|
1165
|
+
4. If the requested measure cannot be created from available columns, explain why and list the missing data.
|
|
1166
|
+
|
|
1167
|
+
---
|
|
1168
|
+
|
|
1169
|
+
## Phase 8: Generate Cube Semantic Overlays
|
|
1170
|
+
|
|
1171
|
+
### Goal
|
|
1172
|
+
|
|
1173
|
+
Create Cube.dev semantic overlay YAML files for selected and approved gold models.
|
|
1174
|
+
|
|
1175
|
+
### Output Location and File Naming
|
|
1176
|
+
|
|
1177
|
+
Create files under `semantic/cubes/`. The overlay file name strips the `gold_` prefix. The cube `name` matches the file name (without extension).
|
|
1178
|
+
|
|
1179
|
+
Examples:
|
|
1180
|
+
|
|
1181
|
+
```text
|
|
1182
|
+
gold_hubspot_deals.sql -> semantic/cubes/hubspot_deals.yml (name: hubspot_deals)
|
|
1183
|
+
gold_hubspot_companies.sql -> semantic/cubes/hubspot_companies.yml (name: hubspot_companies)
|
|
1184
|
+
gold_companies_deals.sql -> semantic/cubes/companies_deals.yml (name: companies_deals)
|
|
1185
|
+
gold_clients.sql -> semantic/cubes/clients.yml (name: clients)
|
|
1186
|
+
```
|
|
1187
|
+
|
|
1188
|
+
### Required Source Reference
|
|
1189
|
+
|
|
1190
|
+
Use the fully qualified BigQuery table reference. Resolve env variables to literals via `echo` (see "Cube `sql_table` Reference" section above).
|
|
1191
|
+
|
|
1192
|
+
Example:
|
|
1193
|
+
|
|
1194
|
+
```yaml
|
|
1195
|
+
sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_deals`"
|
|
1196
|
+
```
|
|
1197
|
+
|
|
1198
|
+
### Overlay Style
|
|
1199
|
+
|
|
1200
|
+
Follow the existing `semantic/cubes/` style detected in Phase 1.
|
|
1201
|
+
|
|
1202
|
+
If existing overlays use map-style `dimensions`, `measures`, and `joins`, use map-style.
|
|
1203
|
+
|
|
1204
|
+
If existing overlays use `extends:` to inherit from base cubes, follow that convention. Otherwise, generate self-contained cubes.
|
|
1205
|
+
|
|
1206
|
+
### Canonical Example: Standard Cube
|
|
1207
|
+
|
|
1208
|
+
A complete example of a standard (non-bridge) cube:
|
|
1209
|
+
|
|
1210
|
+
```yaml
|
|
1211
|
+
cubes:
|
|
1212
|
+
- name: hubspot_companies
|
|
1213
|
+
sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_companies`"
|
|
1214
|
+
|
|
1215
|
+
joins:
|
|
1216
|
+
companies_deals:
|
|
1217
|
+
sql: "${CUBE}.id = ${companies_deals}.company_id"
|
|
1218
|
+
relationship: one_to_many
|
|
1219
|
+
|
|
1220
|
+
measures:
|
|
1221
|
+
count:
|
|
1222
|
+
type: count
|
|
1223
|
+
|
|
1224
|
+
total_deal_value:
|
|
1225
|
+
sql: "${CUBE}.properties_hs_total_deal_value"
|
|
1226
|
+
type: sum
|
|
1227
|
+
|
|
1228
|
+
num_open_deals:
|
|
1229
|
+
sql: "${CUBE}.properties_hs_num_open_deals"
|
|
1230
|
+
type: sum
|
|
1231
|
+
|
|
1232
|
+
dimensions:
|
|
1233
|
+
id:
|
|
1234
|
+
sql: "${CUBE}.id"
|
|
1235
|
+
type: string
|
|
1236
|
+
primary_key: true
|
|
1237
|
+
|
|
1238
|
+
airbyte_extracted_at:
|
|
1239
|
+
sql: "${CUBE}._airbyte_extracted_at"
|
|
1240
|
+
type: time
|
|
1241
|
+
|
|
1242
|
+
refresh_key:
|
|
1243
|
+
sql: "SELECT MAX(_airbyte_extracted_at) FROM `revos-dev.revos_1737556292084.gold_hubspot_companies`"
|
|
1244
|
+
```
|
|
1245
|
+
|
|
1246
|
+
Notes:
|
|
1247
|
+
|
|
1248
|
+
1. Top-level `cubes:` array is required.
|
|
1249
|
+
2. Cube `name` is `hubspot_companies` (no `gold_` prefix).
|
|
1250
|
+
3. `sql_table` references `gold_hubspot_companies` (with `gold_` prefix), in backticks.
|
|
1251
|
+
4. The join references `${companies_deals}` — the cube name of a bridge cube defined in `semantic/cubes/companies_deals.yml`.
|
|
1252
|
+
5. Only `_airbyte_extracted_at` is exposed from Airbyte metadata, as `airbyte_extracted_at`.
|
|
1253
|
+
6. `refresh_key.sql` uses the same fully qualified table name as `sql_table`.
|
|
1254
|
+
|
|
1255
|
+
### Unvalidated Joins
|
|
1256
|
+
|
|
1257
|
+
If the user chose to proceed with a join that could not be validated, generate the join but tag it with a YAML comment:
|
|
1258
|
+
|
|
1259
|
+
```yaml
|
|
1260
|
+
joins:
|
|
1261
|
+
hubspot_companies:
|
|
1262
|
+
# UNVALIDATED: match rate could not be measured because gold_hubspot_companies was not yet materialized in BigQuery
|
|
1263
|
+
sql: "${CUBE}.company_id = ${hubspot_companies}.id"
|
|
1264
|
+
relationship: many_to_one
|
|
1265
|
+
```
|
|
1266
|
+
|
|
1267
|
+
Use a short, factual reason after `UNVALIDATED:`.
|
|
1268
|
+
|
|
1269
|
+
---
|
|
1270
|
+
|
|
1271
|
+
## Cube Overlay Requirements
|
|
1272
|
+
|
|
1273
|
+
### Joins
|
|
1274
|
+
|
|
1275
|
+
Every confirmed relationship must be represented in both directions where both cubes exist.
|
|
1276
|
+
|
|
1277
|
+
Direct many-to-one example:
|
|
1278
|
+
|
|
1279
|
+
```yaml
|
|
1280
|
+
# In hubspot_deals.yml
|
|
1281
|
+
joins:
|
|
1282
|
+
hubspot_companies:
|
|
1283
|
+
sql: "${CUBE}.company_id = ${hubspot_companies}.id"
|
|
1284
|
+
relationship: many_to_one
|
|
1285
|
+
```
|
|
1286
|
+
|
|
1287
|
+
Reverse one-to-many:
|
|
1288
|
+
|
|
1289
|
+
```yaml
|
|
1290
|
+
# In hubspot_companies.yml
|
|
1291
|
+
joins:
|
|
1292
|
+
hubspot_deals:
|
|
1293
|
+
sql: "${CUBE}.id = ${hubspot_deals}.company_id"
|
|
1294
|
+
relationship: one_to_many
|
|
1295
|
+
```
|
|
1296
|
+
|
|
1297
|
+
Connector path. For `products -> clients -> addresses`, create joins for each edge in both directions:
|
|
1298
|
+
|
|
1299
|
+
```yaml
|
|
1300
|
+
# In products.yml
|
|
1301
|
+
joins:
|
|
1302
|
+
clients:
|
|
1303
|
+
sql: "${CUBE}.client_id = ${clients}.id"
|
|
1304
|
+
relationship: many_to_one
|
|
1305
|
+
```
|
|
1306
|
+
|
|
1307
|
+
```yaml
|
|
1308
|
+
# In clients.yml
|
|
1309
|
+
joins:
|
|
1310
|
+
products:
|
|
1311
|
+
sql: "${CUBE}.id = ${products}.client_id"
|
|
1312
|
+
relationship: one_to_many
|
|
1313
|
+
|
|
1314
|
+
addresses:
|
|
1315
|
+
sql: "${CUBE}.id = ${addresses}.client_id"
|
|
1316
|
+
relationship: one_to_many
|
|
1317
|
+
```
|
|
1318
|
+
|
|
1319
|
+
```yaml
|
|
1320
|
+
# In addresses.yml
|
|
1321
|
+
joins:
|
|
1322
|
+
clients:
|
|
1323
|
+
sql: "${CUBE}.client_id = ${clients}.id"
|
|
1324
|
+
relationship: many_to_one
|
|
1325
|
+
```
|
|
1326
|
+
|
|
1327
|
+
Bridge join. The bridge cube joins to both parents:
|
|
1328
|
+
|
|
1329
|
+
```yaml
|
|
1330
|
+
# In companies_deals.yml
|
|
1331
|
+
joins:
|
|
1332
|
+
hubspot_companies:
|
|
1333
|
+
relationship: many_to_one
|
|
1334
|
+
sql: "${CUBE}.company_id = ${hubspot_companies}.id"
|
|
1335
|
+
|
|
1336
|
+
hubspot_deals:
|
|
1337
|
+
relationship: many_to_one
|
|
1338
|
+
sql: "${CUBE}.deal_id = ${hubspot_deals}.id"
|
|
1339
|
+
```
|
|
1340
|
+
|
|
1341
|
+
Reverse joins from each parent to the bridge:
|
|
1342
|
+
|
|
1343
|
+
```yaml
|
|
1344
|
+
# In hubspot_companies.yml
|
|
1345
|
+
joins:
|
|
1346
|
+
companies_deals:
|
|
1347
|
+
sql: "${CUBE}.id = ${companies_deals}.company_id"
|
|
1348
|
+
relationship: one_to_many
|
|
1349
|
+
```
|
|
1350
|
+
|
|
1351
|
+
```yaml
|
|
1352
|
+
# In hubspot_deals.yml
|
|
1353
|
+
joins:
|
|
1354
|
+
companies_deals:
|
|
1355
|
+
sql: "${CUBE}.id = ${companies_deals}.deal_id"
|
|
1356
|
+
relationship: one_to_many
|
|
1357
|
+
```
|
|
1358
|
+
|
|
1359
|
+
Join rules:
|
|
1360
|
+
|
|
1361
|
+
1. Use validated keys.
|
|
1362
|
+
2. Use the correct relationship direction from the current cube.
|
|
1363
|
+
3. Generate both directions for every confirmed relationship.
|
|
1364
|
+
4. Reference other cubes by their cube `name` (without `gold_` prefix) in `${...}`.
|
|
1365
|
+
5. Prefer joins between gold cubes.
|
|
1366
|
+
6. Keep join SQL readable and explicit.
|
|
1367
|
+
7. If key casting is required, prefer fixing it in the dbt model or support model first.
|
|
1368
|
+
8. Tag unvalidated joins with `# UNVALIDATED: <reason>` instead of silently emitting them.
|
|
1369
|
+
9. For JSON / array relationships, use or create approved bridge/support models (creation delegated to `create_dbt_transformations`).
|
|
1370
|
+
10. For connector paths, join through the connector model instead of inventing a direct join.
|
|
1371
|
+
11. Bridge and junction cubes should use `public: false` where the project convention supports it.
|
|
1372
|
+
|
|
1373
|
+
### Bridge / Junction Cubes
|
|
1374
|
+
|
|
1375
|
+
Bridge and junction cubes should use `public: false` where the project convention supports it.
|
|
1376
|
+
|
|
1377
|
+
Example:
|
|
1378
|
+
|
|
1379
|
+
```yaml
|
|
1380
|
+
cubes:
|
|
1381
|
+
- name: companies_deals
|
|
1382
|
+
sql_table: "`revos-dev.revos_1737556292084.gold_companies_deals`"
|
|
1383
|
+
public: false
|
|
1384
|
+
|
|
1385
|
+
joins:
|
|
1386
|
+
hubspot_companies:
|
|
1387
|
+
relationship: many_to_one
|
|
1388
|
+
sql: "${CUBE}.company_id = ${hubspot_companies}.id"
|
|
1389
|
+
|
|
1390
|
+
hubspot_deals:
|
|
1391
|
+
relationship: many_to_one
|
|
1392
|
+
sql: "${CUBE}.deal_id = ${hubspot_deals}.id"
|
|
1393
|
+
|
|
1394
|
+
measures:
|
|
1395
|
+
count:
|
|
1396
|
+
type: count
|
|
1397
|
+
|
|
1398
|
+
dimensions:
|
|
1399
|
+
id:
|
|
1400
|
+
sql: "CONCAT(${CUBE}.deal_id, '-', ${CUBE}.company_id)"
|
|
1401
|
+
type: string
|
|
1402
|
+
primary_key: true
|
|
1403
|
+
|
|
1404
|
+
deal_id:
|
|
1405
|
+
sql: "${CUBE}.deal_id"
|
|
1406
|
+
type: string
|
|
1407
|
+
|
|
1408
|
+
company_id:
|
|
1409
|
+
sql: "${CUBE}.company_id"
|
|
1410
|
+
type: string
|
|
1411
|
+
|
|
1412
|
+
airbyte_extracted_at:
|
|
1413
|
+
sql: "${CUBE}._airbyte_extracted_at"
|
|
1414
|
+
type: time
|
|
1415
|
+
|
|
1416
|
+
refresh_key:
|
|
1417
|
+
sql: "SELECT MAX(_airbyte_extracted_at) FROM `revos-dev.revos_1737556292084.gold_companies_deals`"
|
|
1418
|
+
```
|
|
1419
|
+
|
|
1420
|
+
If the bridge model does not have `_airbyte_extracted_at`, omit that dimension and use the default time-based refresh key:
|
|
1421
|
+
|
|
1422
|
+
```yaml
|
|
1423
|
+
refresh_key:
|
|
1424
|
+
every: 1 hour
|
|
1425
|
+
```
|
|
1426
|
+
|
|
1427
|
+
### Refresh Key
|
|
1428
|
+
|
|
1429
|
+
Every generated Cube overlay must include `refresh_key`.
|
|
1430
|
+
|
|
1431
|
+
Priority order:
|
|
1432
|
+
|
|
1433
|
+
1. If the gold model has `_airbyte_extracted_at`, use it for a SQL-based refresh key.
|
|
1434
|
+
2. If the gold model has another reliable ingestion or update timestamp (`updated_at`, `modified_at`, `loaded_at`, `synced_at`), use that.
|
|
1435
|
+
3. Otherwise use the default time-based refresh key.
|
|
1436
|
+
|
|
1437
|
+
`refresh_key.sql` must reference the same fully qualified BigQuery table as the cube's own `sql_table`. Never a different table.
|
|
1438
|
+
|
|
1439
|
+
Airbyte refresh key:
|
|
1440
|
+
|
|
1441
|
+
```yaml
|
|
1442
|
+
refresh_key:
|
|
1443
|
+
sql: "SELECT MAX(_airbyte_extracted_at) FROM `<project>.<dataset>.<gold_model>`"
|
|
1444
|
+
```
|
|
1445
|
+
|
|
1446
|
+
Other timestamp-based refresh key:
|
|
1447
|
+
|
|
1448
|
+
```yaml
|
|
1449
|
+
refresh_key:
|
|
1450
|
+
sql: "SELECT MAX(updated_at) FROM `<project>.<dataset>.<gold_model>`"
|
|
1451
|
+
```
|
|
1452
|
+
|
|
1453
|
+
Default:
|
|
1454
|
+
|
|
1455
|
+
```yaml
|
|
1456
|
+
refresh_key:
|
|
1457
|
+
every: 1 hour
|
|
1458
|
+
```
|
|
1459
|
+
|
|
1460
|
+
Refresh key rules:
|
|
1461
|
+
|
|
1462
|
+
1. Always include `refresh_key` in every generated Cube overlay.
|
|
1463
|
+
2. Prefer `_airbyte_extracted_at` when it exists.
|
|
1464
|
+
3. Prefer reliable timestamp-based refresh keys over fixed interval refresh keys.
|
|
1465
|
+
4. Use `every: 1 hour` as the default fallback.
|
|
1466
|
+
5. If a timestamp column exists but is not reliable, explain the assumption and use the default fallback.
|
|
1467
|
+
6. Bridge and junction cubes must also include `refresh_key`.
|
|
1468
|
+
7. `refresh_key.sql` must reference the same fully qualified BigQuery table as the cube's own `sql_table`.
|
|
1469
|
+
|
|
1470
|
+
---
|
|
1471
|
+
|
|
1472
|
+
## Phase 9: Validate Generated Files
|
|
1473
|
+
|
|
1474
|
+
### Goal
|
|
1475
|
+
|
|
1476
|
+
Validate generated semantic files and any approved support models.
|
|
1477
|
+
|
|
1478
|
+
### dbt Validation
|
|
1479
|
+
|
|
1480
|
+
If `create_dbt_transformations` was invoked during this run (for example, to create a bridge model), it has already validated the new dbt models with `revos dbt run` and `revos dbt test`. No re-validation is needed here for those.
|
|
1481
|
+
|
|
1482
|
+
If only existing gold models were used, run a basic syntax check via the dbt skill's standard command:
|
|
1483
|
+
|
|
1484
|
+
```bash
|
|
1485
|
+
revos dbt parse
|
|
1486
|
+
```
|
|
1487
|
+
|
|
1488
|
+
### Verify Physical Tables Exist in BigQuery
|
|
1489
|
+
|
|
1490
|
+
For each generated cube, confirm the physical table referenced in `sql_table` actually exists in BigQuery. Cube does not catch a missing table at YAML parse time; it only fails at first query.
|
|
1491
|
+
|
|
1492
|
+
```bash
|
|
1493
|
+
bq show <project>:<dataset>.<table_name>
|
|
1494
|
+
```
|
|
1495
|
+
|
|
1496
|
+
Example:
|
|
1497
|
+
|
|
1498
|
+
```bash
|
|
1499
|
+
bq show revos-dev:revos_1737556292084.gold_hubspot_companies
|
|
1500
|
+
```
|
|
1501
|
+
|
|
1502
|
+
If the table does not exist, the gold model is not yet materialized. Either materialize it first (run the gold dbt build via `create_dbt_transformations`), or document this as a pending item before handing the overlay over.
|
|
1503
|
+
|
|
1504
|
+
### Semantic Validation
|
|
1505
|
+
|
|
1506
|
+
Run available project commands to validate Cube YAML.
|
|
1507
|
+
|
|
1508
|
+
Placeholder:
|
|
1509
|
+
|
|
1510
|
+
```bash
|
|
1511
|
+
<cube-validation-command>
|
|
1512
|
+
```
|
|
1513
|
+
|
|
1514
|
+
### Manual Validation Checklist
|
|
1515
|
+
|
|
1516
|
+
1. Semantic overlays were created in `semantic/cubes/`.
|
|
1517
|
+
2. Overlay file names drop the `gold_` prefix (`hubspot_companies.yml`, not `gold_hubspot_companies.yml`).
|
|
1518
|
+
3. Each cube `name` matches the overlay file name (without extension) and has no `gold_` prefix.
|
|
1519
|
+
4. Cube overlays reference the gold tables using `sql_table: "`<project>.<dataset>.gold_<entity>`"`.
|
|
1520
|
+
5. The physical table name in `sql_table` keeps the `gold_` prefix.
|
|
1521
|
+
6. No `dbt ref()` syntax is used in new overlays.
|
|
1522
|
+
7. Each cube has dimensions.
|
|
1523
|
+
8. Every business column from each selected gold model is exposed as a dimension unless documented otherwise.
|
|
1524
|
+
9. From `_airbyte_*` columns, only `_airbyte_extracted_at` is exposed (as dimension `airbyte_extracted_at`); other `_airbyte_*` columns are excluded.
|
|
1525
|
+
10. Each cube has a `count` measure.
|
|
1526
|
+
11. Suggested and approved additional measures are included.
|
|
1527
|
+
12. Ambiguous custom measures are not created without clarification.
|
|
1528
|
+
13. `count_distinct` measures on FK columns are defined inside the cube that owns the FK, not the parent cube.
|
|
1529
|
+
14. Primary keys are marked correctly. For composite primary keys, a synthetic `CONCAT(...)` dimension is the one flagged with `primary_key: true`; individual components are kept as regular dimensions.
|
|
1530
|
+
15. Secondary keys are preserved as dimensions.
|
|
1531
|
+
16. JSON arrays use `UNNEST(JSON_VALUE_ARRAY(...))`.
|
|
1532
|
+
17. JSON / array relationship keys are extracted into approved bridge/support models where needed.
|
|
1533
|
+
18. Bridge and junction cubes use `public: false` where the project convention supports it.
|
|
1534
|
+
19. Each cube has a `refresh_key`.
|
|
1535
|
+
20. Cubes with `_airbyte_extracted_at` use it in a SQL-based `refresh_key` where possible.
|
|
1536
|
+
21. Cubes without `_airbyte_extracted_at` but with another reliable timestamp use that timestamp where possible.
|
|
1537
|
+
22. Cubes without a reliable timestamp use `every: 1 hour`.
|
|
1538
|
+
23. `refresh_key.sql` references the same fully qualified BigQuery table as the cube's own `sql_table`.
|
|
1539
|
+
24. Joins use validated columns.
|
|
1540
|
+
25. Joins reference other cubes by their cube `name` (without `gold_` prefix), e.g. `${hubspot_companies}`.
|
|
1541
|
+
26. Join relationship types are correct from the current cube perspective.
|
|
1542
|
+
27. Every confirmed relationship has joins in both directions where both cubes exist.
|
|
1543
|
+
28. Candidate joins were validated before being proposed to the user.
|
|
1544
|
+
29. Join validation was performed in both directions where possible.
|
|
1545
|
+
30. Reverse one-to-many aggregation checks were performed where applicable.
|
|
1546
|
+
31. Many-to-many bridge edges were validated in both directions where applicable.
|
|
1547
|
+
32. Joins that could not be validated and were generated anyway are tagged with `# UNVALIDATED: <reason>`.
|
|
1548
|
+
33. Connector models are only used if the user approved them.
|
|
1549
|
+
34. Selected models that remain disconnected are documented.
|
|
1550
|
+
35. Physical tables referenced by `sql_table` exist in BigQuery, or the missing tables are clearly listed as pending items.
|
|
1551
|
+
36. Placeholder commands or assumptions are clearly marked.
|
|
1552
|
+
|
|
1553
|
+
---
|
|
1554
|
+
|
|
1555
|
+
## Final Response Format
|
|
1556
|
+
|
|
1557
|
+
After generation, summarize what was created:
|
|
1558
|
+
|
|
1559
|
+
```text
|
|
1560
|
+
Created semantic model draft.
|
|
1561
|
+
|
|
1562
|
+
Selected gold models:
|
|
1563
|
+
- dbt/models/gold/<gold_model_1>.sql
|
|
1564
|
+
- dbt/models/gold/<gold_model_2>.sql
|
|
1565
|
+
|
|
1566
|
+
Approved connector models:
|
|
1567
|
+
- dbt/models/gold/<connector_model>.sql
|
|
1568
|
+
|
|
1569
|
+
Bridge/support models created during this run (via create_dbt_transformations):
|
|
1570
|
+
- dbt/models/gold/<bridge_model>.sql
|
|
1571
|
+
|
|
1572
|
+
Semantic overlays:
|
|
1573
|
+
- semantic/cubes/<entity_1>.yml (cube name: <entity_1>)
|
|
1574
|
+
- semantic/cubes/<entity_2>.yml (cube name: <entity_2>)
|
|
1575
|
+
- semantic/cubes/<bridge_entity>.yml (cube name: <bridge_entity>, public: false)
|
|
1576
|
+
|
|
1577
|
+
Detected and validated relationships:
|
|
1578
|
+
- <entity_a>.<key> -> <entity_b>.<key> (<relationship_type>)
|
|
1579
|
+
- <entity_b>.<key> -> <entity_a>.<key> (<reverse_relationship_type>)
|
|
1580
|
+
|
|
1581
|
+
Connector paths:
|
|
1582
|
+
- <selected_entity_a> -> <connector_entity> -> <selected_entity_b>
|
|
1583
|
+
|
|
1584
|
+
Bridge relationships:
|
|
1585
|
+
- <source_entity>.<json_array_column>[] -> <target_entity>.<target_key> through <bridge_entity>
|
|
1586
|
+
|
|
1587
|
+
Measures:
|
|
1588
|
+
- count
|
|
1589
|
+
- <approved_measure_1>
|
|
1590
|
+
- <approved_measure_2>
|
|
1591
|
+
|
|
1592
|
+
Validation:
|
|
1593
|
+
- dbt validation: <passed / failed / pending / not run — only existing models used>
|
|
1594
|
+
- physical table existence in BigQuery: <passed / failed / pending>
|
|
1595
|
+
- join candidate validation before proposal: <passed / failed / pending>
|
|
1596
|
+
- reverse join validation: <passed / failed / pending>
|
|
1597
|
+
- semantic validation: <passed / failed / pending>
|
|
1598
|
+
|
|
1599
|
+
Unvalidated joins (tagged in YAML with # UNVALIDATED):
|
|
1600
|
+
- <cube>.<join_target>: <reason>
|
|
1601
|
+
|
|
1602
|
+
Assumptions:
|
|
1603
|
+
- <assumption_1>
|
|
1604
|
+
- <assumption_2>
|
|
1605
|
+
|
|
1606
|
+
Pending items:
|
|
1607
|
+
- <pending_item_1>
|
|
1608
|
+
- <pending_item_2>
|
|
1609
|
+
```
|
|
1610
|
+
|
|
1611
|
+
If validation is incomplete, say exactly what remains pending.
|