@revos/cli 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (28) hide show
  1. package/bin/revos.js +1 -1
  2. package/dist/adapters/oclif/commands/auth/login.mjs +2 -2
  3. package/dist/adapters/oclif/commands/auth/logout.mjs +2 -2
  4. package/dist/adapters/oclif/commands/auth/status.mjs +2 -2
  5. package/dist/adapters/oclif/commands/init.mjs +5 -4
  6. package/dist/adapters/oclif/commands/org/current.mjs +3 -3
  7. package/dist/adapters/oclif/commands/org/list.mjs +3 -3
  8. package/dist/adapters/oclif/commands/org/switch.mjs +3 -3
  9. package/dist/adapters/oclif/commands/overlays/diff.mjs +3 -3
  10. package/dist/adapters/oclif/commands/overlays/pull.mjs +3 -3
  11. package/dist/adapters/oclif/commands/overlays/push.mjs +3 -3
  12. package/dist/adapters/oclif/commands/overlays/status.mjs +3 -3
  13. package/dist/{base.command-BGM225ik.mjs → base.command-DlYMawJ6.mjs} +1 -1
  14. package/dist/{core-Bif-kxlo.mjs → core-Dq15hO6f.mjs} +70 -208
  15. package/dist/{index-C0e8MXGP.d.mts → index-DuqD2b_7.d.mts} +2 -8
  16. package/dist/index.d.mts +1 -1
  17. package/dist/index.mjs +1 -1
  18. package/dist/templates/.devcontainer/Dockerfile +14 -0
  19. package/dist/templates/.devcontainer/devcontainer.json +54 -0
  20. package/dist/templates/.devcontainer/setup.sh +32 -0
  21. package/dist/templates/AGENTS.md +2 -3
  22. package/dist/templates/CLAUDE.md +0 -16
  23. package/dist/templates/README.md +23 -0
  24. package/dist/templates/dbt/dbt_project.yml +22 -0
  25. package/dist/templates/index.ts +4 -0
  26. package/dist/templates/skills/create-semantic-model/SKILL.md +1611 -0
  27. package/dist/templates/skills/explore-lakehouse/SKILL.md +131 -0
  28. package/package.json +1 -3
@@ -0,0 +1,1611 @@
1
+ ---
2
+ name: create-semantic-model
3
+ description: Create semantic models (Cube.dev cubes) from existing RevOS dbt gold models. Use when asked to build a semantic layer, create cubes, or generate Cube definitions from dbt.
4
+ ---
5
+
6
+ @.claude/skills/create_dbt_transformations/SKILL.md
7
+
8
+ # Create Semantical Model
9
+
10
+ Use this skill when the user asks to create a semantic model or semantic overlay.
11
+
12
+ Typical user requests:
13
+
14
+ > Create a new semantic overlay for `<transformation>`.
15
+
16
+ > Create a semantic model from the available gold models.
17
+
18
+ > Build a Cube semantic layer for these gold tables.
19
+
20
+ This skill analyzes existing dbt gold models, asks the user which models should participate, detects keys and relationships, validates join candidates with SQL before proposing them to the user, asks the user to confirm the semantic structure and measures, and generates Cube.dev semantic overlays under `semantic/cubes/`.
21
+
22
+ ## Skill Dependencies
23
+
24
+ This skill delegates dbt-related knowledge to `create_dbt_transformations` (loaded above):
25
+
26
+ - Project layout and how to find gold models
27
+ - Resolving a dbt model name to its physical BigQuery table reference
28
+ - Creating new dbt support models (bridge models from JSON arrays)
29
+ - dbt validation commands
30
+
31
+ If BigQuery exploration is needed (listing tables, inspecting schemas, previewing rows, checking null rates beyond what is required for join validation), use the `explore-lakehouse` skill at `.claude/skills/explore-lakehouse/SKILL.md`. Load it only when those capabilities are actually required.
32
+
33
+ ---
34
+
35
+ ## Purpose
36
+
37
+ Expose existing dbt gold models as queryable Cube.dev semantic models without manually writing YAML boilerplate.
38
+
39
+ Gold models may be dbt tables or dbt views. Treat both as valid semantic sources.
40
+
41
+ This skill does not build gold models from silver. If a needed gold model is missing, hand off to `create_dbt_transformations`.
42
+
43
+ The expected flow is:
44
+
45
+ ```text
46
+ dbt/models/gold/
47
+ -> discover available gold models (via create_dbt_transformations)
48
+ -> ask the user which gold models should participate
49
+ -> inspect selected model schemas from the lakehouse
50
+ -> detect primary keys, secondary keys, foreign keys, JSON/array keys, and candidate relationships
51
+ -> if selected models are not directly connected, search remaining gold models for connector/intermediate models
52
+ -> ask the user to approve connector or bridge/support models when needed
53
+ -> validate candidate joins with SQL before proposing them as relationships
54
+ -> present validated relationships and join directions to the user
55
+ -> ask the user to confirm relationships
56
+ -> generate dimensions from all selected columns by default
57
+ -> generate default count measure
58
+ -> suggest useful additional measures
59
+ -> ask the user to confirm or define custom measures
60
+ -> generate Cube.dev semantic overlays in semantic/cubes/
61
+ -> validate generated files
62
+ ```
63
+
64
+ ---
65
+
66
+ ## Execution Order
67
+
68
+ 1. Discover available gold models (use `create_dbt_transformations` for navigation).
69
+ 2. If the user named a specific transformation/model, find that gold model first.
70
+ 3. Ask the user which gold models should participate, unless the request clearly targets one specific model only.
71
+ 4. Resolve `$GOOGLE_CLOUD_PROJECT` and `$REVOS_BQ_DATASET` to literal values via `echo` and keep them for the rest of the session.
72
+ 5. Inspect schemas and column types for selected gold models.
73
+ 6. Detect primary keys, secondary keys, foreign keys, JSON/array keys, and candidate relationships.
74
+ 7. If selected models are disconnected, search for connector models among the remaining gold models.
75
+ 8. Ask the user to approve connector or bridge/support models before adding them to the working scope.
76
+ 9. Validate candidate joins with SQL before proposing them as relationships.
77
+ 10. Present validated relationships, join directions, cardinality, and validation evidence to the user.
78
+ 11. Ask the user to confirm or modify relationships.
79
+ 12. Generate dimensions from all selected gold model columns by default.
80
+ 13. Create the default `count` measure.
81
+ 14. Suggest useful additional measures.
82
+ 15. Ask the user to confirm suggested measures or define custom measures.
83
+ 16. Generate Cube semantic overlays in `semantic/cubes/`.
84
+ 17. Validate generated files.
85
+
86
+ Do not skip ahead.
87
+
88
+ ---
89
+
90
+ ## Naming Convention: Cube Files and Cube Names
91
+
92
+ Gold models in `dbt/models/gold/` use the `gold_` prefix in their file name (for example, `gold_hubspot_companies.sql`). The materialized table in BigQuery keeps the same name.
93
+
94
+ The semantic overlay layer drops the `gold_` prefix:
95
+
96
+ 1. The gold SQL file: `gold_<entity>` (for example, `gold_hubspot_companies.sql`).
97
+ 2. The materialized BigQuery table: `gold_<entity>` (for example, `gold_hubspot_companies`).
98
+ 3. The semantic overlay YAML file: `<entity>.yml` (for example, `hubspot_companies.yml`).
99
+ 4. The cube `name` inside that YAML file: `<entity>` (for example, `name: hubspot_companies`).
100
+ 5. References to that cube from joins: `${<entity>}` (for example, `${hubspot_companies}`).
101
+
102
+ Mapping example:
103
+
104
+ ```text
105
+ gold SQL file: dbt/models/gold/gold_hubspot_companies.sql
106
+ BigQuery table: gold_hubspot_companies
107
+ overlay YAML file: semantic/cubes/hubspot_companies.yml
108
+ cube name: hubspot_companies
109
+ join reference: ${hubspot_companies}
110
+ ```
111
+
112
+ Naming rules:
113
+
114
+ 1. Always strip the `gold_` prefix when generating the overlay YAML file name.
115
+ 2. The cube `name` must match the overlay YAML file name (without extension).
116
+ 3. The cube `name` is what appears in `${...}` references inside join SQL.
117
+ 4. The physical table reference in `sql_table` always uses the full BigQuery name with `gold_` prefix.
118
+ 5. Bridge and support cube names follow the same rule. If the gold support model is `gold_deals_companies`, the cube name is `deals_companies` and the file is `deals_companies.yml`.
119
+
120
+ ---
121
+
122
+ ## Cube `sql_table` Reference
123
+
124
+ Cube overlays must reference the physical warehouse table directly, not through `dbt ref()`. Cube does not understand Jinja.
125
+
126
+ Use the literal `project` and `dataset` values captured during the "Resolve Environment Variables First" step at the start of Phase 2 (see Phase 2 below). If you reach Phase 8 without those literals already in memory, go back and run the resolution step first.
127
+
128
+ Wrap the literal in BigQuery backticks and use it as `sql_table`:
129
+
130
+ ```yaml
131
+ sql_table: "`<resolved_project>.<resolved_dataset>.<gold_model_name>`"
132
+ ```
133
+
134
+ Concrete example:
135
+
136
+ ```yaml
137
+ sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_companies`"
138
+ ```
139
+
140
+ The cube `name` is `hubspot_companies` (no `gold_` prefix), but `sql_table` keeps the `gold_` prefix because that is the physical table.
141
+
142
+ Apply the same fully qualified format anywhere a physical table name is needed in a Cube overlay (notably in `refresh_key.sql`).
143
+
144
+ ---
145
+
146
+ ## Non-Negotiable Rules
147
+
148
+ 1. Always start from existing dbt gold models in `dbt/models/gold/`. If no usable gold models exist, hand off to `create_dbt_transformations` and stop.
149
+ 2. Gold models may be tables or views.
150
+ 3. Always discover available gold models before doing anything else.
151
+ 4. Always ask the user which discovered gold models should participate, unless the user clearly requested one specific model only.
152
+ 5. Do not inspect selected schemas, analyze joins, generate overlays, or create support models until the model scope is clear.
153
+ 6. Only selected models may be used for the main semantic model.
154
+ 7. If selected models are not directly connected, search remaining gold models for connector/intermediate models.
155
+ 8. During connector search, Claude may inspect column names and key-like fields of non-selected gold models only for relationship discovery.
156
+ 9. Inspecting a non-selected model does not add it to the semantic model.
157
+ 10. Connector models can only be added after explicit user approval.
158
+ 11. If a many-to-many relationship requires a bridge/support model and no suitable gold model exists, ask the user whether to create one. If the user approves, hand off the bridge model creation to `create_dbt_transformations`.
159
+ 12. Do not invent joins. If models cannot be connected directly or through approved connector/bridge models, generate separate semantic subgraphs and clearly report that they are disconnected.
160
+ 13. Validate candidate joins with SQL before proposing them to the user as confirmed relationship options.
161
+ 14. Ask the user to confirm validated relationships before generating semantic overlays.
162
+ 15. For every confirmed relationship, generate joins in both directions where both cubes exist.
163
+ 16. Validate joins in both directions where possible.
164
+ 17. Relationship direction must be correct from the perspective of the current cube.
165
+ 18. Preserve every business column from each selected gold model as a Cube dimension by default.
166
+ 19. From technical Airbyte metadata columns (`_airbyte_*`), include only `_airbyte_extracted_at` as a dimension named `airbyte_extracted_at`. Exclude all other `_airbyte_*` columns by default.
167
+ 20. Do not remove non-Airbyte columns just because they do not look immediately useful.
168
+ 21. Measures are not all numeric columns by default. Always create `count`, suggest useful additional measures, and ask the user to confirm or define custom measures.
169
+ 22. Keys may be stored in normal columns, JSON strings, JSON arrays, repeated fields, or nested structures. Always check for hidden relationship keys.
170
+ 23. For JSON arrays, use `UNNEST(JSON_VALUE_ARRAY(...))`.
171
+ 24. Bridge and junction cubes should use `public: false` where the project convention supports it.
172
+ 25. Every generated Cube overlay must include `refresh_key`.
173
+ 26. Prefer `_airbyte_extracted_at` for refresh keys when available. Otherwise fall back to `every: 1 hour`.
174
+ 27. New Cube overlays must reference the physical warehouse table directly using the fully qualified name. Do not use `dbt ref()` in Cube overlays.
175
+ 28. Cube `name` and overlay file name must drop the `gold_` prefix. The physical table reference in `sql_table` keeps the `gold_` prefix.
176
+ 29. Follow the existing `semantic/cubes/` style in the repository. If existing overlays use map-style `dimensions`, `measures`, and `joins`, use map-style.
177
+
178
+ ---
179
+
180
+ ## Mandatory User Checkpoints
181
+
182
+ ### Checkpoint 1: Gold Model Selection
183
+
184
+ After discovering gold models, show the available models and ask which ones should participate.
185
+
186
+ Do not proceed until the user selects models.
187
+
188
+ Exception: if the user explicitly requested one specific model, and that model exists in `dbt/models/gold/`, treat that model as selected. If the user appears to want a multi-model semantic layer, still ask whether related models should be included.
189
+
190
+ ### Checkpoint 2a: Connector Model Approval
191
+
192
+ Triggered when selected models are not directly connected.
193
+
194
+ Search remaining gold models for connector/intermediate models. During this search, Claude may inspect column names and key-like fields of non-selected gold models only for relationship discovery. Inspecting a model does not add it to the semantic model.
195
+
196
+ If a connector path is found, explain why it is needed and ask whether to add it.
197
+
198
+ Example:
199
+
200
+ ```text
201
+ You selected products and addresses.
202
+
203
+ I do not see a direct relationship between products and addresses.
204
+
205
+ However, I found a possible connector path:
206
+
207
+ products -> clients -> addresses
208
+
209
+ The clients model appears to connect products to addresses.
210
+
211
+ Should I include gold_clients as a connector model in this semantic model?
212
+ ```
213
+
214
+ Do not use connector models until the user approves them.
215
+
216
+ ### Checkpoint 2b: Bridge / Support Model Approval
217
+
218
+ Triggered when a many-to-many relationship is detected and no suitable existing gold bridge model exists.
219
+
220
+ Ask whether to create one:
221
+
222
+ ```text
223
+ The relationship between deals and contacts appears to be many-to-many,
224
+ likely stored as a JSON array on hubspot_deals.
225
+
226
+ I did not find an existing gold bridge model for this relationship.
227
+
228
+ Should I create a support model named gold_deals_contacts?
229
+ If yes, I will hand off bridge model creation to the create_dbt_transformations skill,
230
+ then continue with the semantic model.
231
+ ```
232
+
233
+ If approved, invoke `create_dbt_transformations` with the bridge model template (it knows the standard JSON-array bridge pattern). Once that skill confirms the bridge model is built and tested, return here and continue.
234
+
235
+ Do not create or use bridge/support models until the user approves them.
236
+
237
+ ### Checkpoint 3: Relationship Confirmation
238
+
239
+ After candidate relationships are detected and validated with SQL, present the validated relationship options, join directions, cardinality, and validation evidence.
240
+
241
+ Ask the user to confirm or modify relationships before generating semantic overlays.
242
+
243
+ Do not present unvalidated joins as confirmed relationships.
244
+
245
+ If validation could not be executed, clearly mark the relationship as `validation pending` and ask the user whether to proceed with that assumption. If the user proceeds, the join is generated but flagged inside the YAML with a comment (see Phase 8).
246
+
247
+ ### Checkpoint 4: Measures Confirmation
248
+
249
+ After schema and relationship analysis, generate default dimensions and default `count` measure, then propose useful additional measures.
250
+
251
+ Ask the user to confirm suggested measures or define custom measures.
252
+
253
+ Do not create ambiguous custom measures. If the user asks for a custom measure but the definition is unclear, ask for more detail.
254
+
255
+ ---
256
+
257
+ # Workflow
258
+
259
+ Follow these phases in order.
260
+
261
+ ---
262
+
263
+ ## Phase 1: Discover Gold Models and Select Scope
264
+
265
+ ### Goal
266
+
267
+ Find available prepared gold models, detect existing overlay conventions, and determine which models participate in the semantic model.
268
+
269
+ ### Steps
270
+
271
+ 1. Discover gold models. Use the navigation commands documented in `create_dbt_transformations` ("Model Navigation"). Specifically:
272
+
273
+ ```bash
274
+ find dbt/models/gold -name "*.sql"
275
+ ```
276
+
277
+ 2. If no usable gold models exist, stop and tell the user:
278
+
279
+ ```text
280
+ No gold models were found in dbt/models/gold/. The required dbt transformation
281
+ must exist before I can build a semantic model on top of it.
282
+
283
+ Use the create_dbt_transformations skill to create the gold model first, then
284
+ run semantic model generation again.
285
+ ```
286
+
287
+ 3. Inspect existing overlays in `semantic/cubes/` to detect conventions used in this repository.
288
+
289
+ Look at one or two existing files and check for:
290
+
291
+ - Whether `extends:` is used to inherit from base cubes.
292
+ - Whether existing cubes use map-style or list-style for `dimensions`, `measures`, and `joins`.
293
+ - The `public:` flag pattern (true / false / omitted).
294
+ - The `refresh_key` style (`sql:` based or `every:` based).
295
+ - Naming patterns for dimensions and measures.
296
+ - Whether there are base cubes intended for `extends:`.
297
+
298
+ Apply the detected conventions to the new overlays. If the repository is empty or conventions are inconsistent, follow the defaults defined in this skill.
299
+
300
+ 4. If the user named a specific transformation/model, search for it. If the model exists, treat it as the selected model. If it does not exist, stop:
301
+
302
+ ```text
303
+ I could not find the requested gold model in dbt/models/gold/. The gold model
304
+ must exist before I can create a semantic overlay for it.
305
+ ```
306
+
307
+ 5. If the user did not name a specific model, list all discovered gold models.
308
+
309
+ 6. Infer likely business entities from file names.
310
+
311
+ Examples:
312
+
313
+ ```text
314
+ gold_hubspot_companies -> companies
315
+ gold_hubspot_deals -> deals
316
+ gold_hubspot_contacts -> contacts
317
+ gold_hubspot_users -> users
318
+ gold_products -> products
319
+ gold_clients -> clients
320
+ gold_addresses -> addresses
321
+ gold_invoices -> invoices
322
+ gold_payments -> payments
323
+ gold_active_users_last_month -> active users
324
+ ```
325
+
326
+ 7. Present discovered models and ask the user which ones should be included.
327
+
328
+ 8. Stop and wait for user selection if no specific model was provided.
329
+
330
+ 9. Confirm selected models back to the user.
331
+
332
+ 10. Keep the full discovered gold model list available for connector search.
333
+
334
+ 11. If the user asks to include all models, warn that this may create a larger and more complex semantic model before proceeding.
335
+
336
+ ---
337
+
338
+ ## Phase 2: Analyze Selected Model Schemas and Keys
339
+
340
+ ### Goal
341
+
342
+ Inspect selected gold models and identify primary keys, secondary keys, foreign keys, JSON keys, array keys, time columns, and metric-like columns.
343
+
344
+ ### Resolve Environment Variables First
345
+
346
+ Before running any SQL or generating any YAML in this session, resolve `$GOOGLE_CLOUD_PROJECT` and `$REVOS_BQ_DATASET` to their literal values **once**, and use those literals everywhere downstream. SQL placeholders like `<project>` are not valid BigQuery syntax, and Cube YAML does not interpolate env variables — both contexts need real values.
347
+
348
+ Run:
349
+
350
+ ```bash
351
+ echo "PROJECT=$GOOGLE_CLOUD_PROJECT"
352
+ echo "DATASET=$REVOS_BQ_DATASET"
353
+ ```
354
+
355
+ Record the two literal values returned (for example, `revos-dev` and `revos_1737556292084`). For the rest of this session, every time you see `<project>` or `<dataset>` in a SQL template or YAML example, substitute these literals.
356
+
357
+ If either variable is empty, stop and tell the user:
358
+
359
+ ```text
360
+ The environment variable $GOOGLE_CLOUD_PROJECT or $REVOS_BQ_DATASET is not set.
361
+ I cannot resolve physical BigQuery table references without them. Please set
362
+ them and try again.
363
+ ```
364
+
365
+ Do not proceed to schema inspection or join validation until the variables are resolved.
366
+
367
+ ### Schema Discovery
368
+
369
+ For each selected gold model, inspect available columns and types. If BigQuery exploration commands are needed (listing tables, inspecting schemas, previewing rows), refer to the `explore-lakehouse` skill at `.claude/skills/explore-lakehouse/SKILL.md`.
370
+
371
+ When inspecting schema output, check whether `_airbyte_extracted_at` exists (it will be needed for `refresh_key` and as the only Airbyte dimension). Other `_airbyte_*` columns can be noted but are not used by default.
372
+
373
+ Validation queries reference physical tables using the resolved literal values (see "Resolve Environment Variables First" above). Substitute `<project>` and `<dataset>` placeholders in the SQL templates throughout this skill with the literals captured at the start of Phase 2.
374
+
375
+ ### Primary Key Detection
376
+
377
+ Look for stable unique identifiers.
378
+
379
+ Common primary key patterns:
380
+
381
+ ```text
382
+ id
383
+ <entity>_id
384
+ <model_name>_id
385
+ uuid
386
+ unique_id
387
+ external_id
388
+ source_id
389
+ ```
390
+
391
+ Examples:
392
+
393
+ ```text
394
+ companies.id
395
+ companies.company_id
396
+ companies.office_unique_id
397
+ hubspot_companies.properties_company_unique_id
398
+ ```
399
+
400
+ Primary key rules:
401
+
402
+ 1. Prefer a known business or platform identifier over a generated row number.
403
+ 2. Prefer stable IDs over names or labels.
404
+ 3. Do not mark a column as primary key only because it looks unique by name.
405
+ 4. Validate uniqueness where possible.
406
+
407
+ Validation:
408
+
409
+ ```sql
410
+ SELECT
411
+ COUNT(*) AS total_rows,
412
+ COUNT(DISTINCT <candidate_pk>) AS distinct_keys,
413
+ COUNT(*) - COUNT(DISTINCT <candidate_pk>) AS duplicate_count
414
+ FROM `<project>.<dataset>.<gold_model>`;
415
+ ```
416
+
417
+ A primary key should normally have `total_rows = distinct_keys`, or a clearly explained reason why duplicates are expected.
418
+
419
+ ### Secondary Key Detection
420
+
421
+ Secondary keys are identifiers that are not the table primary key but can be used for joins, grouping, lookup, or `count_distinct` measures.
422
+
423
+ Common secondary key patterns:
424
+
425
+ ```text
426
+ office_unique_id
427
+ company_id
428
+ customer_id
429
+ client_id
430
+ deal_id
431
+ contact_id
432
+ owner_id
433
+ user_id
434
+ account_id
435
+ product_id
436
+ address_id
437
+ external_id
438
+ source_id
439
+ ```
440
+
441
+ Secondary key rules:
442
+
443
+ 1. Track secondary keys explicitly.
444
+ 2. Secondary keys may be foreign keys to another entity.
445
+ 3. Secondary keys should usually become Cube dimensions.
446
+ 4. Secondary keys may support `count_distinct` measures if analytically useful, but only inside the cube that owns the FK (see Phase 7 caution about fan-out).
447
+
448
+ ### JSON / Array Key Detection
449
+
450
+ Keys may be hidden inside JSON strings, JSON arrays, repeated fields, or nested structures. This is especially common for one-to-many and many-to-many relationships.
451
+
452
+ Common JSON / array relationship patterns:
453
+
454
+ ```text
455
+ companies
456
+ deals
457
+ contacts
458
+ users
459
+ owners
460
+ clients
461
+ products
462
+ addresses
463
+ associations
464
+ associated_companies
465
+ associated_deals
466
+ associated_contacts
467
+ associated_clients
468
+ associated_products
469
+ company_ids
470
+ deal_ids
471
+ contact_ids
472
+ client_ids
473
+ product_ids
474
+ address_ids
475
+ ```
476
+
477
+ Example: `gold_hubspot_deals.companies` may contain an array of company IDs.
478
+
479
+ For JSON arrays, use `UNNEST(JSON_VALUE_ARRAY(...))`.
480
+
481
+ JSON / array rules:
482
+
483
+ 1. Always inspect JSON, array, repeated, and nested fields for hidden relationship keys.
484
+ 2. Do not assume relationship keys only exist as flat columns.
485
+ 3. If JSON structure is unknown, inspect sample values first:
486
+
487
+ ```sql
488
+ SELECT
489
+ <json_or_array_column>
490
+ FROM `<project>.<dataset>.<gold_model>`
491
+ WHERE <json_or_array_column> IS NOT NULL
492
+ LIMIT 20;
493
+ ```
494
+
495
+ 4. If a relationship is stored as an array of IDs, use or create an approved bridge/support model. Bridge model creation is delegated to `create_dbt_transformations` (which has the standard JSON-array bridge template).
496
+ 5. Bridge models should preserve both sides of the relationship as keys.
497
+ 6. Bridge and junction cubes should use `public: false` where the project convention supports it.
498
+
499
+ ### Foreign Key Detection
500
+
501
+ Look for columns that reference other entities.
502
+
503
+ Common patterns:
504
+
505
+ ```text
506
+ <entity>_id
507
+ <entity>Id
508
+ fk_<entity>
509
+ associated_<entity>_id
510
+ parent_<entity>_id
511
+ owner_id
512
+ created_by_user_id
513
+ updated_by_user_id
514
+ ```
515
+
516
+ Also check JSON and array-based foreign keys:
517
+
518
+ ```text
519
+ deals.companies -> companies.id
520
+ companies.deals -> deals.id
521
+ contacts.associated_company_ids -> companies.id
522
+ ```
523
+
524
+ ### Schema Summary Output
525
+
526
+ After analysis, summarize each selected model.
527
+
528
+ Example:
529
+
530
+ ```text
531
+ Model: gold_hubspot_deals
532
+ Columns: 18
533
+ Candidate primary key: deal_id
534
+ Secondary keys: company_id, owner_id
535
+ JSON / array relationship columns: companies, contacts
536
+ Time columns: created_at, updated_at, closed_at
537
+ Numeric metric-like columns: amount
538
+ Airbyte columns present: _airbyte_extracted_at (will be exposed), _airbyte_raw_id, _airbyte_meta, _airbyte_generation_id (will be excluded by default)
539
+ ```
540
+
541
+ ---
542
+
543
+ ## Phase 3: Detect Candidate Relationships
544
+
545
+ ### Goal
546
+
547
+ Build a candidate relationship graph between selected models and approved connector/bridge models.
548
+
549
+ These relationships are candidates only. They must be validated with SQL before being proposed to the user as relationship options.
550
+
551
+ ### Single-Model Case
552
+
553
+ If only one gold model was selected in Phase 1 **and** that model has no JSON/array relationship columns identified in Phase 2, skip Phase 3 entirely. There is nothing to relate. Proceed directly to Phase 6 (dimensions) and Phase 7 (measures). Phase 4 (validation) is also skipped since there are no candidate joins.
554
+
555
+ If only one model is selected but it has JSON/array relationship columns (for example, `gold_hubspot_deals` with a `companies` array), Phase 3 still applies — but only the bridge/junction detection branch. Treat the JSON-array target entity as a candidate model and ask the user whether to add it to the scope (and, if needed, to create a bridge support model).
556
+
557
+ ### Relationship Types
558
+
559
+ Use Cube relationship types: `one_to_one`, `one_to_many`, `many_to_one`, `many_to_many`.
560
+
561
+ ### Relationship Direction
562
+
563
+ Relationship direction is from the perspective of the current cube.
564
+
565
+ ```text
566
+ Deal belongs to one company:
567
+ deals.company_id -> companies.id
568
+ relationship from deals to companies: many_to_one
569
+
570
+ Company has many deals:
571
+ companies.id -> deals.company_id
572
+ relationship from companies to deals: one_to_many
573
+ ```
574
+
575
+ ### Cardinality Rules
576
+
577
+ 1. If source foreign key can repeat and target key is unique → `many_to_one` / reverse `one_to_many`.
578
+ 2. If both sides are unique → `one_to_one`.
579
+ 3. If both sides can repeat or the relationship is represented through a bridge/junction model → `many_to_many` through bridge/junction model.
580
+ 4. If cardinality is unclear, report uncertainty and validate before proposing the relationship.
581
+
582
+ ### Direct Join Detection
583
+
584
+ Analyze direct joins between selected models first.
585
+
586
+ Look for:
587
+
588
+ 1. Foreign key to primary key matches.
589
+ 2. Secondary key matches.
590
+ 3. Known entity conventions.
591
+ 4. Existing junction/bridge models among selected models.
592
+ 5. JSON / array relationship fields among selected models.
593
+ 6. Nested association structures.
594
+
595
+ ### Connector Model Search
596
+
597
+ Run connector search when selected models are not directly connected.
598
+
599
+ Rules:
600
+
601
+ 1. Search all discovered gold models, including non-selected models.
602
+ 2. During connector search, inspect column names and key-like fields of non-selected gold models only for relationship discovery.
603
+ 3. This lightweight schema discovery does not add the connector model to the semantic model.
604
+ 4. Do not automatically add connector models.
605
+ 5. Look for paths of length 2 first.
606
+ 6. If needed, look for paths of length 3.
607
+ 7. Do not create long or speculative chains without user confirmation.
608
+ 8. Use connector models only after user approval.
609
+
610
+ Example. User selected `gold_products` and `gold_addresses`. No direct relationship found. Search remaining gold models and find that both have `client_id` matching `gold_clients.id`. Proposed connector path: `products -> clients -> addresses`.
611
+
612
+ If the user approves the connector model:
613
+
614
+ 1. Add it to the working model scope.
615
+ 2. Run full schema discovery for it.
616
+ 3. Analyze its keys and relationships.
617
+ 4. Validate the connector path with SQL before proposing it as a relationship.
618
+ 5. Generate a semantic overlay for it.
619
+
620
+ If the user rejects the connector model:
621
+
622
+ 1. Continue only with directly connected selected models.
623
+ 2. Clearly document disconnected selected models.
624
+ 3. Do not invent joins.
625
+
626
+ ### Bridge / Junction Relationship Detection
627
+
628
+ Use existing bridge/junction gold models when available — but only after confirming they are usable.
629
+
630
+ If an existing bridge model is found in `dbt/models/gold/`:
631
+
632
+ 1. Inspect its schema (use `explore-lakehouse` for schema commands).
633
+ 2. Verify it contains the expected key columns for the relationship — typically `<entity_a>_id` and `<entity_b>_id`. The exact column names may differ from the convention (for example, `company_uuid` instead of `company_id`).
634
+ 3. Check whether `_airbyte_extracted_at` is preserved. If not, the bridge cube will have to fall back to `every: 1 hour` for `refresh_key`.
635
+ 4. Run a quick row-count and sample-rows check to confirm the bridge has data.
636
+
637
+ If the existing bridge model fits the relationship cleanly, use it.
638
+
639
+ If it does not fit (key columns named unexpectedly, missing one side of the relationship, has unrelated extra columns that change cardinality, or is empty), present the situation to the user and ask:
640
+
641
+ ```text
642
+ I found an existing bridge model `<model_name>` in dbt/models/gold/, but it does not fit the relationship cleanly:
643
+ - <specific issue, e.g. "key column is named company_uuid, not company_id">
644
+ - <specific issue>
645
+
646
+ Options:
647
+ - use as-is (I will adapt the cube join SQL to match the existing column names)
648
+ - create a new bridge model via create_dbt_transformations (I will use the standard JSON-array template and a new name)
649
+ - abort and let me revisit this later
650
+ ```
651
+
652
+ Wait for the user's decision. Do not silently adapt to a mismatched bridge — surfacing the mismatch is more important than the fix being automatic.
653
+
654
+ If no existing bridge model is available and the user approves creating one (Checkpoint 2b), delegate the bridge model creation to `create_dbt_transformations`. That skill has the standard JSON-array bridge template and the naming convention (`gold_<entity_a>_<entity_b>`, with cube name `<entity_a>_<entity_b>` after dropping `gold_`).
655
+
656
+ Once the bridge model exists and is materialized, return here and continue:
657
+
658
+ - Generate a Cube overlay for the bridge as a junction cube with `public: false`.
659
+ - Generate joins in both directions through the bridge.
660
+
661
+ ---
662
+
663
+ ## Phase 4: Validate Candidate Relationships Before Proposal
664
+
665
+ ### Goal
666
+
667
+ Verify with SQL that candidate joins actually work and that relationship direction is correct before presenting them to the user for confirmation.
668
+
669
+ This validation runs against BigQuery directly. It is not a YAML-level check. The point is to verify, with real data, that:
670
+
671
+ - Candidate primary keys are unique.
672
+ - Foreign keys actually match target primary keys at acceptable rates.
673
+ - Reverse aggregations produce sensible counts.
674
+ - Many-to-many bridge edges resolve correctly on both sides.
675
+ - JSON / array relationships extract to keys that exist in the target table.
676
+ - Join column types are compatible.
677
+
678
+ Validation must check both directions of every candidate relationship where possible.
679
+
680
+ If validation cannot be executed because the environment is incomplete, clearly mark validation as pending and explain what must be run later.
681
+
682
+ Validation queries reference physical tables using the literal `project` and `dataset` values resolved at the start of Phase 2. Substitute `<project>` and `<dataset>` placeholders in the SQL templates below with those literals before executing.
683
+
684
+ Do not present an unvalidated join as a confirmed relationship.
685
+
686
+ ### 4.1 Validate Key Uniqueness
687
+
688
+ ```sql
689
+ SELECT
690
+ COUNT(*) AS total_rows,
691
+ COUNT(DISTINCT <candidate_pk>) AS distinct_keys,
692
+ COUNT(*) - COUNT(DISTINCT <candidate_pk>) AS duplicate_count
693
+ FROM `<project>.<dataset>.<gold_model>`;
694
+ ```
695
+
696
+ For a primary key, `duplicate_count` should normally be `0`. If duplicates exist, do not mark the column as `primary_key: true` unless there is a clearly documented reason.
697
+
698
+ ### 4.2 Validate Many-to-One Direction
699
+
700
+ Example relationship: `deals.company_id -> companies.id`.
701
+
702
+ Expected direction: `deals -> companies: many_to_one`, `companies -> deals: one_to_many`.
703
+
704
+ Validate the many-to-one side:
705
+
706
+ ```sql
707
+ SELECT
708
+ COUNT(*) AS total_rows_with_fk,
709
+ COUNT(c.id) AS matched_rows,
710
+ COUNT(*) - COUNT(c.id) AS unmatched_rows,
711
+ ROUND(100.0 * COUNT(c.id) / COUNT(*), 2) AS match_percentage
712
+ FROM `<project>.<dataset>.gold_hubspot_deals` d
713
+ LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
714
+ ON d.company_id = c.id
715
+ WHERE d.company_id IS NOT NULL;
716
+ ```
717
+
718
+ Also check whether a single source row matches multiple target rows:
719
+
720
+ ```sql
721
+ SELECT
722
+ d.deal_id,
723
+ COUNT(c.id) AS matched_companies
724
+ FROM `<project>.<dataset>.gold_hubspot_deals` d
725
+ LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
726
+ ON d.company_id = c.id
727
+ WHERE d.company_id IS NOT NULL
728
+ GROUP BY d.deal_id
729
+ HAVING COUNT(c.id) > 1
730
+ LIMIT 20;
731
+ ```
732
+
733
+ For a valid many-to-one relationship, this result should normally be empty.
734
+
735
+ ### 4.3 Validate Reverse One-to-Many Direction
736
+
737
+ For the same relationship, validate the reverse direction using aggregation:
738
+
739
+ ```sql
740
+ SELECT
741
+ c.id AS company_id,
742
+ COUNT(d.deal_id) AS deal_count
743
+ FROM `<project>.<dataset>.gold_hubspot_companies` c
744
+ LEFT JOIN `<project>.<dataset>.gold_hubspot_deals` d
745
+ ON c.id = d.company_id
746
+ GROUP BY c.id
747
+ ORDER BY deal_count DESC
748
+ LIMIT 20;
749
+ ```
750
+
751
+ Cross-check sampled counts directly from the child/source table:
752
+
753
+ ```sql
754
+ SELECT
755
+ company_id,
756
+ COUNT(*) AS expected_deal_count
757
+ FROM `<project>.<dataset>.gold_hubspot_deals`
758
+ WHERE company_id IN (<sample_company_ids>)
759
+ GROUP BY company_id;
760
+ ```
761
+
762
+ The counts must match.
763
+
764
+ ### 4.4 Validate One-to-One Relationships
765
+
766
+ For a one-to-one relationship, validate uniqueness on both sides:
767
+
768
+ ```sql
769
+ SELECT
770
+ COUNT(*) AS total_rows,
771
+ COUNT(DISTINCT <left_key>) AS distinct_left_keys
772
+ FROM `<project>.<dataset>.<left_model>`;
773
+ ```
774
+
775
+ Same for the right side. Then validate the join:
776
+
777
+ ```sql
778
+ SELECT
779
+ COUNT(*) AS total_rows,
780
+ COUNT(r.<right_key>) AS matched_rows,
781
+ COUNT(*) - COUNT(r.<right_key>) AS unmatched_rows,
782
+ ROUND(100.0 * COUNT(r.<right_key>) / COUNT(*), 2) AS match_percentage
783
+ FROM `<project>.<dataset>.<left_model>` l
784
+ LEFT JOIN `<project>.<dataset>.<right_model>` r
785
+ ON l.<left_key> = r.<right_key>
786
+ WHERE l.<left_key> IS NOT NULL;
787
+ ```
788
+
789
+ Also validate the reverse direction. If either side has duplicate keys, the relationship is not one-to-one.
790
+
791
+ ### 4.5 Validate Many-to-Many Through Bridge/Junction Models
792
+
793
+ For a many-to-many relationship, validate both bridge edges.
794
+
795
+ Example: `companies <-> deals through gold_companies_deals`.
796
+
797
+ Validate bridge to companies:
798
+
799
+ ```sql
800
+ SELECT
801
+ COUNT(*) AS total_bridge_rows,
802
+ COUNT(c.id) AS matched_companies,
803
+ COUNT(*) - COUNT(c.id) AS unmatched_companies,
804
+ ROUND(100.0 * COUNT(c.id) / COUNT(*), 2) AS match_percentage
805
+ FROM `<project>.<dataset>.gold_companies_deals` b
806
+ LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
807
+ ON b.company_id = c.id
808
+ WHERE b.company_id IS NOT NULL;
809
+ ```
810
+
811
+ Validate bridge to deals (analogous query, swapping `c.id` for `d.id` and the source table).
812
+
813
+ Validate reverse aggregations from each parent through the bridge:
814
+
815
+ ```sql
816
+ SELECT
817
+ c.id AS company_id,
818
+ COUNT(b.deal_id) AS related_deals
819
+ FROM `<project>.<dataset>.gold_hubspot_companies` c
820
+ LEFT JOIN `<project>.<dataset>.gold_companies_deals` b
821
+ ON c.id = b.company_id
822
+ GROUP BY c.id
823
+ ORDER BY related_deals DESC
824
+ LIMIT 20;
825
+ ```
826
+
827
+ Same query swapped for deals → bridge → companies.
828
+
829
+ Report sampled counts to the user.
830
+
831
+ ### 4.6 Validate JSON / Array Relationships
832
+
833
+ For JSON or array-based relationships, validate extracted keys:
834
+
835
+ ```sql
836
+ WITH extracted AS (
837
+ SELECT DISTINCT
838
+ src.<source_pk> AS source_id,
839
+ extracted_id
840
+ FROM `<project>.<dataset>.<source_model>` src,
841
+ UNNEST(JSON_VALUE_ARRAY(src.<json_array_column>)) AS extracted_id
842
+ )
843
+
844
+ SELECT
845
+ COUNT(*) AS total_relationships,
846
+ COUNT(tgt.<target_pk>) AS matched_relationships,
847
+ COUNT(*) - COUNT(tgt.<target_pk>) AS unmatched_relationships,
848
+ ROUND(100.0 * COUNT(tgt.<target_pk>) / COUNT(*), 2) AS match_percentage
849
+ FROM extracted e
850
+ LEFT JOIN `<project>.<dataset>.<target_model>` tgt
851
+ ON e.extracted_id = tgt.<target_pk>;
852
+ ```
853
+
854
+ Sample the extracted join to inspect actual matched values:
855
+
856
+ ```sql
857
+ WITH extracted AS (
858
+ SELECT DISTINCT
859
+ src.<source_pk> AS source_id,
860
+ extracted_id
861
+ FROM `<project>.<dataset>.<source_model>` src,
862
+ UNNEST(JSON_VALUE_ARRAY(src.<json_array_column>)) AS extracted_id
863
+ )
864
+
865
+ SELECT
866
+ e.source_id,
867
+ e.extracted_id,
868
+ tgt.<target_pk>,
869
+ tgt.<display_column>
870
+ FROM extracted e
871
+ LEFT JOIN `<project>.<dataset>.<target_model>` tgt
872
+ ON e.extracted_id = tgt.<target_pk>
873
+ LIMIT 10;
874
+ ```
875
+
876
+ ### 4.7 Validate Type Compatibility
877
+
878
+ Check that join columns have compatible types using INFORMATION_SCHEMA:
879
+
880
+ ```sql
881
+ SELECT column_name, data_type
882
+ FROM `<project>.<dataset>.INFORMATION_SCHEMA.COLUMNS`
883
+ WHERE table_name IN ('<source_model>', '<target_model>')
884
+ AND column_name IN ('<foreign_key>', '<target_pk>');
885
+ ```
886
+
887
+ For JSON / array extracted keys, check the extracted key type against the target key type.
888
+
889
+ If types differ:
890
+
891
+ 1. Report the mismatch.
892
+ 2. Prefer fixing type alignment in the dbt model or approved support model.
893
+ 3. Only cast in Cube join SQL when necessary.
894
+ 4. Prefer casting the foreign-key side to match the primary-key side.
895
+
896
+ ### 4.8 Validation Rules
897
+
898
+ Do not present relationships to the user as valid options until:
899
+
900
+ 1. Candidate joins have been validated, or validation is explicitly marked as pending.
901
+ 2. Both directions of every candidate relationship have been validated where possible.
902
+ 3. Primary key uniqueness has been checked, or marked as pending.
903
+ 4. Type compatibility has been checked, or marked as pending.
904
+ 5. Match rates are reported, or marked as pending.
905
+ 6. Reverse one-to-many aggregations are checked with sampled counts where applicable.
906
+ 7. Many-to-many bridge edges are validated in both directions where applicable.
907
+ 8. JSON / array relationships have been extracted and validated where present, or marked as pending.
908
+ 9. Low match rates or suspicious results are explained.
909
+ 10. Sample joined data or sampled aggregate counts look reasonable, or limitations are documented.
910
+
911
+ ---
912
+
913
+ ## Phase 5: Present Validated Relationships and Ask for Confirmation
914
+
915
+ ### Goal
916
+
917
+ Show the detected and validated semantic structure before generating files.
918
+
919
+ Present selected models, approved connector models, candidate keys, JSON/array keys, connector paths, bridge models, validated joins, cardinality, and validation evidence.
920
+
921
+ Example:
922
+
923
+ ```text
924
+ Selected gold models:
925
+ - gold_hubspot_companies
926
+ - gold_hubspot_deals
927
+
928
+ Approved connector models:
929
+ - gold_hubspot_users
930
+
931
+ Entity: deals
932
+ Source model: gold_hubspot_deals
933
+ Cube name (overlay): hubspot_deals
934
+ Candidate primary key: deal_id
935
+ Secondary keys: owner_id
936
+ JSON / array keys:
937
+ - companies -> company_id
938
+
939
+ Validated joins:
940
+ - deals.owner_id -> users.user_id (many_to_one)
941
+ Match rate: 99.8%
942
+ Reverse direction: users -> deals (one_to_many)
943
+ Sample reverse counts checked.
944
+
945
+ - deals.companies[] -> companies.id through bridge gold_companies_deals
946
+ Relationship: many_to_many
947
+ Bridge edges validated.
948
+ ```
949
+
950
+ Then ask:
951
+
952
+ ```text
953
+ Please confirm these selected models, connector models, bridge/support models, and validated relationships, or tell me what to change before I generate the semantic overlays.
954
+ ```
955
+
956
+ Do not generate final files until the user confirms or corrects the relationship model.
957
+
958
+ If a relationship could not be validated but the user still wants to proceed, mark it clearly as an assumption in the generated summary, and tag it with a YAML comment in the generated overlay (see Phase 8).
959
+
960
+ ---
961
+
962
+ ## Phase 6: Generate Dimensions
963
+
964
+ ### Goal
965
+
966
+ Expose all selected gold model business columns as Cube dimensions, plus `_airbyte_extracted_at`.
967
+
968
+ ### Type Mapping
969
+
970
+ ```text
971
+ STRING / VARCHAR / TEXT -> string
972
+ INTEGER / FLOAT / NUMERIC / DECIMAL -> number
973
+ BOOLEAN / BOOL -> boolean
974
+ DATE / DATETIME / TIMESTAMP -> time
975
+ JSON / ARRAY / STRUCT -> string or skip only if not queryable directly
976
+ ```
977
+
978
+ ### Dimension Rules
979
+
980
+ 1. Include primary keys as dimensions with `primary_key: true`.
981
+ 2. Include secondary keys as dimensions.
982
+ 3. Include human-readable names and statuses as string dimensions.
983
+ 4. Include timestamps as time dimensions.
984
+ 5. Include numeric attributes as number dimensions.
985
+ 6. Expose every business column from each selected gold model as a dimension by default.
986
+ 7. From technical Airbyte columns, include only `_airbyte_extracted_at`. Name the dimension `airbyte_extracted_at` (without the leading underscore) and reference the column as `${CUBE}._airbyte_extracted_at`.
987
+ 8. Exclude all other `_airbyte_*` columns by default (`_airbyte_raw_id`, `_airbyte_meta`, `_airbyte_generation_id`, etc.). Do not include them unless the user explicitly asks.
988
+ 9. Skip or transform only columns that cannot be represented safely in Cube, and document why.
989
+ 10. JSON / array columns used for relationships should usually be represented through bridge/support models.
990
+
991
+ ### Composite Primary Keys
992
+
993
+ Cube allows exactly one dimension flagged with `primary_key: true` per cube. When a gold model has a composite primary key (multiple columns that together uniquely identify a row, for example `(office_unique_id, month)` in a monthly snapshot model), do not flag any of the component columns as primary key directly. Instead:
994
+
995
+ 1. Keep each component column as a regular dimension (no `primary_key` flag).
996
+ 2. Add an additional synthetic primary-key dimension that concatenates the components, using the same `CONCAT` pattern used for bridge cubes.
997
+
998
+ Example for a monthly active users model with composite PK `(office_unique_id, month)`:
999
+
1000
+ ```yaml
1001
+ dimensions:
1002
+ id:
1003
+ sql: "CONCAT(${CUBE}.office_unique_id, '-', ${CUBE}.month)"
1004
+ type: string
1005
+ primary_key: true
1006
+
1007
+ office_unique_id:
1008
+ sql: "${CUBE}.office_unique_id"
1009
+ type: string
1010
+
1011
+ month:
1012
+ sql: "${CUBE}.month"
1013
+ type: time
1014
+ ```
1015
+
1016
+ Choose a separator that does not appear in either component value. `-` is usually safe for IDs and dates; if components may contain `-`, use `||` or another unambiguous separator.
1017
+
1018
+ The synthetic `id` dimension is the cube's primary key for Cube's purposes. Joins to this cube must reference that synthetic `id`, not individual components — unless the joining cube has the same composite key columns and uses a parallel `CONCAT` in the join SQL.
1019
+
1020
+ ### Large Column Count Warning
1021
+
1022
+ If a model has more than 50 business columns, inform the user before generating:
1023
+
1024
+ ```text
1025
+ This model has <N> business columns. I will generate <N> dimensions by default,
1026
+ plus airbyte_extracted_at if the column exists.
1027
+ Other _airbyte_* columns will be excluded by default.
1028
+
1029
+ Proceed with all business columns, or should I skip any groups?
1030
+ ```
1031
+
1032
+ Do not skip business columns without explicit user instruction.
1033
+
1034
+ Example dimensions block:
1035
+
1036
+ ```yaml
1037
+ dimensions:
1038
+ deal_id:
1039
+ sql: "${CUBE}.deal_id"
1040
+ type: string
1041
+ primary_key: true
1042
+
1043
+ company_id:
1044
+ sql: "${CUBE}.company_id"
1045
+ type: string
1046
+
1047
+ deal_name:
1048
+ sql: "${CUBE}.deal_name"
1049
+ type: string
1050
+
1051
+ amount:
1052
+ sql: "${CUBE}.amount"
1053
+ type: number
1054
+
1055
+ created_at:
1056
+ sql: "${CUBE}.created_at"
1057
+ type: time
1058
+
1059
+ airbyte_extracted_at:
1060
+ sql: "${CUBE}._airbyte_extracted_at"
1061
+ type: time
1062
+ ```
1063
+
1064
+ ---
1065
+
1066
+ ## Phase 7: Suggest and Confirm Measures
1067
+
1068
+ ### Goal
1069
+
1070
+ Create default and useful measures without inventing ambiguous business logic.
1071
+
1072
+ ### Default Measure
1073
+
1074
+ Always include a row count measure unless project convention says otherwise.
1075
+
1076
+ ```yaml
1077
+ measures:
1078
+ count:
1079
+ type: count
1080
+ ```
1081
+
1082
+ ### Suggested Measures
1083
+
1084
+ Suggest useful measures based on model schema and column names.
1085
+
1086
+ Common suggestions:
1087
+
1088
+ ```text
1089
+ amount -> total_amount (sum), average_amount (avg), min_amount, max_amount
1090
+ revenue -> total_revenue (sum)
1091
+ price -> total_price or average_price depending on context
1092
+ cost -> total_cost (sum)
1093
+ quantity / qty -> total_quantity (sum)
1094
+ duration -> total_duration or average_duration
1095
+ deal_id / company_id / user_id -> count_distinct
1096
+ created_at -> first_created_at (min), last_created_at (max)
1097
+ closed_at -> first_closed_at (min), last_closed_at (max)
1098
+ updated_at -> last_updated_at (max)
1099
+ ```
1100
+
1101
+ Examples:
1102
+
1103
+ ```yaml
1104
+ measures:
1105
+ total_amount:
1106
+ sql: "${CUBE}.amount"
1107
+ type: sum
1108
+
1109
+ unique_companies:
1110
+ sql: "${CUBE}.company_id"
1111
+ type: count_distinct
1112
+
1113
+ last_closed_at:
1114
+ sql: "${CUBE}.closed_at"
1115
+ type: max
1116
+ ```
1117
+
1118
+ ### `count_distinct` on Foreign Keys
1119
+
1120
+ `count_distinct` on a foreign-key column counts unique values within the cube that owns the FK. For example, `count_distinct(company_id)` inside the `hubspot_deals` cube answers "how many distinct companies have deals."
1121
+
1122
+ Be careful when this measure is used in queries that join multiple cubes. Joins can produce row fan-out, which inflates or distorts distinct counts. Rules:
1123
+
1124
+ 1. Define `count_distinct` on FK columns inside the cube that owns the FK.
1125
+ 2. Add a brief description in the measure (or note it for the user) so consumers know the cube boundary.
1126
+ 3. Avoid suggesting `count_distinct` on FK columns inside the parent cube (the cube that owns the PK) — `count` of rows there usually answers the same question more reliably.
1127
+
1128
+ ### Measure Confirmation
1129
+
1130
+ After suggesting measures, ask the user:
1131
+
1132
+ ```text
1133
+ I will create the default count measure.
1134
+
1135
+ I also found these possible additional measures:
1136
+ - total_amount: sum(amount)
1137
+ - average_amount: avg(amount)
1138
+ - unique_companies: count_distinct(company_id)
1139
+ - last_closed_at: max(closed_at)
1140
+
1141
+ Which of these should I include?
1142
+
1143
+ Do you want to define any custom measures?
1144
+ ```
1145
+
1146
+ ### Custom Measures
1147
+
1148
+ If the user requests a custom measure, ask for a precise definition if needed.
1149
+
1150
+ A custom measure definition should include:
1151
+
1152
+ ```text
1153
+ measure name
1154
+ source column(s)
1155
+ aggregation type
1156
+ filters or conditions
1157
+ business meaning
1158
+ ```
1159
+
1160
+ Custom measure rules:
1161
+
1162
+ 1. Do not create ambiguous custom measures.
1163
+ 2. If the definition is clear, generate the measure.
1164
+ 3. If the definition is unclear, ask for clarification.
1165
+ 4. If the requested measure cannot be created from available columns, explain why and list the missing data.
1166
+
1167
+ ---
1168
+
1169
+ ## Phase 8: Generate Cube Semantic Overlays
1170
+
1171
+ ### Goal
1172
+
1173
+ Create Cube.dev semantic overlay YAML files for selected and approved gold models.
1174
+
1175
+ ### Output Location and File Naming
1176
+
1177
+ Create files under `semantic/cubes/`. The overlay file name strips the `gold_` prefix. The cube `name` matches the file name (without extension).
1178
+
1179
+ Examples:
1180
+
1181
+ ```text
1182
+ gold_hubspot_deals.sql -> semantic/cubes/hubspot_deals.yml (name: hubspot_deals)
1183
+ gold_hubspot_companies.sql -> semantic/cubes/hubspot_companies.yml (name: hubspot_companies)
1184
+ gold_companies_deals.sql -> semantic/cubes/companies_deals.yml (name: companies_deals)
1185
+ gold_clients.sql -> semantic/cubes/clients.yml (name: clients)
1186
+ ```
1187
+
1188
+ ### Required Source Reference
1189
+
1190
+ Use the fully qualified BigQuery table reference. Resolve env variables to literals via `echo` (see "Cube `sql_table` Reference" section above).
1191
+
1192
+ Example:
1193
+
1194
+ ```yaml
1195
+ sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_deals`"
1196
+ ```
1197
+
1198
+ ### Overlay Style
1199
+
1200
+ Follow the existing `semantic/cubes/` style detected in Phase 1.
1201
+
1202
+ If existing overlays use map-style `dimensions`, `measures`, and `joins`, use map-style.
1203
+
1204
+ If existing overlays use `extends:` to inherit from base cubes, follow that convention. Otherwise, generate self-contained cubes.
1205
+
1206
+ ### Canonical Example: Standard Cube
1207
+
1208
+ A complete example of a standard (non-bridge) cube:
1209
+
1210
+ ```yaml
1211
+ cubes:
1212
+ - name: hubspot_companies
1213
+ sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_companies`"
1214
+
1215
+ joins:
1216
+ companies_deals:
1217
+ sql: "${CUBE}.id = ${companies_deals}.company_id"
1218
+ relationship: one_to_many
1219
+
1220
+ measures:
1221
+ count:
1222
+ type: count
1223
+
1224
+ total_deal_value:
1225
+ sql: "${CUBE}.properties_hs_total_deal_value"
1226
+ type: sum
1227
+
1228
+ num_open_deals:
1229
+ sql: "${CUBE}.properties_hs_num_open_deals"
1230
+ type: sum
1231
+
1232
+ dimensions:
1233
+ id:
1234
+ sql: "${CUBE}.id"
1235
+ type: string
1236
+ primary_key: true
1237
+
1238
+ airbyte_extracted_at:
1239
+ sql: "${CUBE}._airbyte_extracted_at"
1240
+ type: time
1241
+
1242
+ refresh_key:
1243
+ sql: "SELECT MAX(_airbyte_extracted_at) FROM `revos-dev.revos_1737556292084.gold_hubspot_companies`"
1244
+ ```
1245
+
1246
+ Notes:
1247
+
1248
+ 1. Top-level `cubes:` array is required.
1249
+ 2. Cube `name` is `hubspot_companies` (no `gold_` prefix).
1250
+ 3. `sql_table` references `gold_hubspot_companies` (with `gold_` prefix), in backticks.
1251
+ 4. The join references `${companies_deals}` — the cube name of a bridge cube defined in `semantic/cubes/companies_deals.yml`.
1252
+ 5. Only `_airbyte_extracted_at` is exposed from Airbyte metadata, as `airbyte_extracted_at`.
1253
+ 6. `refresh_key.sql` uses the same fully qualified table name as `sql_table`.
1254
+
1255
+ ### Unvalidated Joins
1256
+
1257
+ If the user chose to proceed with a join that could not be validated, generate the join but tag it with a YAML comment:
1258
+
1259
+ ```yaml
1260
+ joins:
1261
+ hubspot_companies:
1262
+ # UNVALIDATED: match rate could not be measured because gold_hubspot_companies was not yet materialized in BigQuery
1263
+ sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1264
+ relationship: many_to_one
1265
+ ```
1266
+
1267
+ Use a short, factual reason after `UNVALIDATED:`.
1268
+
1269
+ ---
1270
+
1271
+ ## Cube Overlay Requirements
1272
+
1273
+ ### Joins
1274
+
1275
+ Every confirmed relationship must be represented in both directions where both cubes exist.
1276
+
1277
+ Direct many-to-one example:
1278
+
1279
+ ```yaml
1280
+ # In hubspot_deals.yml
1281
+ joins:
1282
+ hubspot_companies:
1283
+ sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1284
+ relationship: many_to_one
1285
+ ```
1286
+
1287
+ Reverse one-to-many:
1288
+
1289
+ ```yaml
1290
+ # In hubspot_companies.yml
1291
+ joins:
1292
+ hubspot_deals:
1293
+ sql: "${CUBE}.id = ${hubspot_deals}.company_id"
1294
+ relationship: one_to_many
1295
+ ```
1296
+
1297
+ Connector path. For `products -> clients -> addresses`, create joins for each edge in both directions:
1298
+
1299
+ ```yaml
1300
+ # In products.yml
1301
+ joins:
1302
+ clients:
1303
+ sql: "${CUBE}.client_id = ${clients}.id"
1304
+ relationship: many_to_one
1305
+ ```
1306
+
1307
+ ```yaml
1308
+ # In clients.yml
1309
+ joins:
1310
+ products:
1311
+ sql: "${CUBE}.id = ${products}.client_id"
1312
+ relationship: one_to_many
1313
+
1314
+ addresses:
1315
+ sql: "${CUBE}.id = ${addresses}.client_id"
1316
+ relationship: one_to_many
1317
+ ```
1318
+
1319
+ ```yaml
1320
+ # In addresses.yml
1321
+ joins:
1322
+ clients:
1323
+ sql: "${CUBE}.client_id = ${clients}.id"
1324
+ relationship: many_to_one
1325
+ ```
1326
+
1327
+ Bridge join. The bridge cube joins to both parents:
1328
+
1329
+ ```yaml
1330
+ # In companies_deals.yml
1331
+ joins:
1332
+ hubspot_companies:
1333
+ relationship: many_to_one
1334
+ sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1335
+
1336
+ hubspot_deals:
1337
+ relationship: many_to_one
1338
+ sql: "${CUBE}.deal_id = ${hubspot_deals}.id"
1339
+ ```
1340
+
1341
+ Reverse joins from each parent to the bridge:
1342
+
1343
+ ```yaml
1344
+ # In hubspot_companies.yml
1345
+ joins:
1346
+ companies_deals:
1347
+ sql: "${CUBE}.id = ${companies_deals}.company_id"
1348
+ relationship: one_to_many
1349
+ ```
1350
+
1351
+ ```yaml
1352
+ # In hubspot_deals.yml
1353
+ joins:
1354
+ companies_deals:
1355
+ sql: "${CUBE}.id = ${companies_deals}.deal_id"
1356
+ relationship: one_to_many
1357
+ ```
1358
+
1359
+ Join rules:
1360
+
1361
+ 1. Use validated keys.
1362
+ 2. Use the correct relationship direction from the current cube.
1363
+ 3. Generate both directions for every confirmed relationship.
1364
+ 4. Reference other cubes by their cube `name` (without `gold_` prefix) in `${...}`.
1365
+ 5. Prefer joins between gold cubes.
1366
+ 6. Keep join SQL readable and explicit.
1367
+ 7. If key casting is required, prefer fixing it in the dbt model or support model first.
1368
+ 8. Tag unvalidated joins with `# UNVALIDATED: <reason>` instead of silently emitting them.
1369
+ 9. For JSON / array relationships, use or create approved bridge/support models (creation delegated to `create_dbt_transformations`).
1370
+ 10. For connector paths, join through the connector model instead of inventing a direct join.
1371
+ 11. Bridge and junction cubes should use `public: false` where the project convention supports it.
1372
+
1373
+ ### Bridge / Junction Cubes
1374
+
1375
+ Bridge and junction cubes should use `public: false` where the project convention supports it.
1376
+
1377
+ Example:
1378
+
1379
+ ```yaml
1380
+ cubes:
1381
+ - name: companies_deals
1382
+ sql_table: "`revos-dev.revos_1737556292084.gold_companies_deals`"
1383
+ public: false
1384
+
1385
+ joins:
1386
+ hubspot_companies:
1387
+ relationship: many_to_one
1388
+ sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1389
+
1390
+ hubspot_deals:
1391
+ relationship: many_to_one
1392
+ sql: "${CUBE}.deal_id = ${hubspot_deals}.id"
1393
+
1394
+ measures:
1395
+ count:
1396
+ type: count
1397
+
1398
+ dimensions:
1399
+ id:
1400
+ sql: "CONCAT(${CUBE}.deal_id, '-', ${CUBE}.company_id)"
1401
+ type: string
1402
+ primary_key: true
1403
+
1404
+ deal_id:
1405
+ sql: "${CUBE}.deal_id"
1406
+ type: string
1407
+
1408
+ company_id:
1409
+ sql: "${CUBE}.company_id"
1410
+ type: string
1411
+
1412
+ airbyte_extracted_at:
1413
+ sql: "${CUBE}._airbyte_extracted_at"
1414
+ type: time
1415
+
1416
+ refresh_key:
1417
+ sql: "SELECT MAX(_airbyte_extracted_at) FROM `revos-dev.revos_1737556292084.gold_companies_deals`"
1418
+ ```
1419
+
1420
+ If the bridge model does not have `_airbyte_extracted_at`, omit that dimension and use the default time-based refresh key:
1421
+
1422
+ ```yaml
1423
+ refresh_key:
1424
+ every: 1 hour
1425
+ ```
1426
+
1427
+ ### Refresh Key
1428
+
1429
+ Every generated Cube overlay must include `refresh_key`.
1430
+
1431
+ Priority order:
1432
+
1433
+ 1. If the gold model has `_airbyte_extracted_at`, use it for a SQL-based refresh key.
1434
+ 2. If the gold model has another reliable ingestion or update timestamp (`updated_at`, `modified_at`, `loaded_at`, `synced_at`), use that.
1435
+ 3. Otherwise use the default time-based refresh key.
1436
+
1437
+ `refresh_key.sql` must reference the same fully qualified BigQuery table as the cube's own `sql_table`. Never a different table.
1438
+
1439
+ Airbyte refresh key:
1440
+
1441
+ ```yaml
1442
+ refresh_key:
1443
+ sql: "SELECT MAX(_airbyte_extracted_at) FROM `<project>.<dataset>.<gold_model>`"
1444
+ ```
1445
+
1446
+ Other timestamp-based refresh key:
1447
+
1448
+ ```yaml
1449
+ refresh_key:
1450
+ sql: "SELECT MAX(updated_at) FROM `<project>.<dataset>.<gold_model>`"
1451
+ ```
1452
+
1453
+ Default:
1454
+
1455
+ ```yaml
1456
+ refresh_key:
1457
+ every: 1 hour
1458
+ ```
1459
+
1460
+ Refresh key rules:
1461
+
1462
+ 1. Always include `refresh_key` in every generated Cube overlay.
1463
+ 2. Prefer `_airbyte_extracted_at` when it exists.
1464
+ 3. Prefer reliable timestamp-based refresh keys over fixed interval refresh keys.
1465
+ 4. Use `every: 1 hour` as the default fallback.
1466
+ 5. If a timestamp column exists but is not reliable, explain the assumption and use the default fallback.
1467
+ 6. Bridge and junction cubes must also include `refresh_key`.
1468
+ 7. `refresh_key.sql` must reference the same fully qualified BigQuery table as the cube's own `sql_table`.
1469
+
1470
+ ---
1471
+
1472
+ ## Phase 9: Validate Generated Files
1473
+
1474
+ ### Goal
1475
+
1476
+ Validate generated semantic files and any approved support models.
1477
+
1478
+ ### dbt Validation
1479
+
1480
+ If `create_dbt_transformations` was invoked during this run (for example, to create a bridge model), it has already validated the new dbt models with `revos dbt run` and `revos dbt test`. No re-validation is needed here for those.
1481
+
1482
+ If only existing gold models were used, run a basic syntax check via the dbt skill's standard command:
1483
+
1484
+ ```bash
1485
+ revos dbt parse
1486
+ ```
1487
+
1488
+ ### Verify Physical Tables Exist in BigQuery
1489
+
1490
+ For each generated cube, confirm the physical table referenced in `sql_table` actually exists in BigQuery. Cube does not catch a missing table at YAML parse time; it only fails at first query.
1491
+
1492
+ ```bash
1493
+ bq show <project>:<dataset>.<table_name>
1494
+ ```
1495
+
1496
+ Example:
1497
+
1498
+ ```bash
1499
+ bq show revos-dev:revos_1737556292084.gold_hubspot_companies
1500
+ ```
1501
+
1502
+ If the table does not exist, the gold model is not yet materialized. Either materialize it first (run the gold dbt build via `create_dbt_transformations`), or document this as a pending item before handing the overlay over.
1503
+
1504
+ ### Semantic Validation
1505
+
1506
+ Run available project commands to validate Cube YAML.
1507
+
1508
+ Placeholder:
1509
+
1510
+ ```bash
1511
+ <cube-validation-command>
1512
+ ```
1513
+
1514
+ ### Manual Validation Checklist
1515
+
1516
+ 1. Semantic overlays were created in `semantic/cubes/`.
1517
+ 2. Overlay file names drop the `gold_` prefix (`hubspot_companies.yml`, not `gold_hubspot_companies.yml`).
1518
+ 3. Each cube `name` matches the overlay file name (without extension) and has no `gold_` prefix.
1519
+ 4. Cube overlays reference the gold tables using `sql_table: "`<project>.<dataset>.gold_<entity>`"`.
1520
+ 5. The physical table name in `sql_table` keeps the `gold_` prefix.
1521
+ 6. No `dbt ref()` syntax is used in new overlays.
1522
+ 7. Each cube has dimensions.
1523
+ 8. Every business column from each selected gold model is exposed as a dimension unless documented otherwise.
1524
+ 9. From `_airbyte_*` columns, only `_airbyte_extracted_at` is exposed (as dimension `airbyte_extracted_at`); other `_airbyte_*` columns are excluded.
1525
+ 10. Each cube has a `count` measure.
1526
+ 11. Suggested and approved additional measures are included.
1527
+ 12. Ambiguous custom measures are not created without clarification.
1528
+ 13. `count_distinct` measures on FK columns are defined inside the cube that owns the FK, not the parent cube.
1529
+ 14. Primary keys are marked correctly. For composite primary keys, a synthetic `CONCAT(...)` dimension is the one flagged with `primary_key: true`; individual components are kept as regular dimensions.
1530
+ 15. Secondary keys are preserved as dimensions.
1531
+ 16. JSON arrays use `UNNEST(JSON_VALUE_ARRAY(...))`.
1532
+ 17. JSON / array relationship keys are extracted into approved bridge/support models where needed.
1533
+ 18. Bridge and junction cubes use `public: false` where the project convention supports it.
1534
+ 19. Each cube has a `refresh_key`.
1535
+ 20. Cubes with `_airbyte_extracted_at` use it in a SQL-based `refresh_key` where possible.
1536
+ 21. Cubes without `_airbyte_extracted_at` but with another reliable timestamp use that timestamp where possible.
1537
+ 22. Cubes without a reliable timestamp use `every: 1 hour`.
1538
+ 23. `refresh_key.sql` references the same fully qualified BigQuery table as the cube's own `sql_table`.
1539
+ 24. Joins use validated columns.
1540
+ 25. Joins reference other cubes by their cube `name` (without `gold_` prefix), e.g. `${hubspot_companies}`.
1541
+ 26. Join relationship types are correct from the current cube perspective.
1542
+ 27. Every confirmed relationship has joins in both directions where both cubes exist.
1543
+ 28. Candidate joins were validated before being proposed to the user.
1544
+ 29. Join validation was performed in both directions where possible.
1545
+ 30. Reverse one-to-many aggregation checks were performed where applicable.
1546
+ 31. Many-to-many bridge edges were validated in both directions where applicable.
1547
+ 32. Joins that could not be validated and were generated anyway are tagged with `# UNVALIDATED: <reason>`.
1548
+ 33. Connector models are only used if the user approved them.
1549
+ 34. Selected models that remain disconnected are documented.
1550
+ 35. Physical tables referenced by `sql_table` exist in BigQuery, or the missing tables are clearly listed as pending items.
1551
+ 36. Placeholder commands or assumptions are clearly marked.
1552
+
1553
+ ---
1554
+
1555
+ ## Final Response Format
1556
+
1557
+ After generation, summarize what was created:
1558
+
1559
+ ```text
1560
+ Created semantic model draft.
1561
+
1562
+ Selected gold models:
1563
+ - dbt/models/gold/<gold_model_1>.sql
1564
+ - dbt/models/gold/<gold_model_2>.sql
1565
+
1566
+ Approved connector models:
1567
+ - dbt/models/gold/<connector_model>.sql
1568
+
1569
+ Bridge/support models created during this run (via create_dbt_transformations):
1570
+ - dbt/models/gold/<bridge_model>.sql
1571
+
1572
+ Semantic overlays:
1573
+ - semantic/cubes/<entity_1>.yml (cube name: <entity_1>)
1574
+ - semantic/cubes/<entity_2>.yml (cube name: <entity_2>)
1575
+ - semantic/cubes/<bridge_entity>.yml (cube name: <bridge_entity>, public: false)
1576
+
1577
+ Detected and validated relationships:
1578
+ - <entity_a>.<key> -> <entity_b>.<key> (<relationship_type>)
1579
+ - <entity_b>.<key> -> <entity_a>.<key> (<reverse_relationship_type>)
1580
+
1581
+ Connector paths:
1582
+ - <selected_entity_a> -> <connector_entity> -> <selected_entity_b>
1583
+
1584
+ Bridge relationships:
1585
+ - <source_entity>.<json_array_column>[] -> <target_entity>.<target_key> through <bridge_entity>
1586
+
1587
+ Measures:
1588
+ - count
1589
+ - <approved_measure_1>
1590
+ - <approved_measure_2>
1591
+
1592
+ Validation:
1593
+ - dbt validation: <passed / failed / pending / not run — only existing models used>
1594
+ - physical table existence in BigQuery: <passed / failed / pending>
1595
+ - join candidate validation before proposal: <passed / failed / pending>
1596
+ - reverse join validation: <passed / failed / pending>
1597
+ - semantic validation: <passed / failed / pending>
1598
+
1599
+ Unvalidated joins (tagged in YAML with # UNVALIDATED):
1600
+ - <cube>.<join_target>: <reason>
1601
+
1602
+ Assumptions:
1603
+ - <assumption_1>
1604
+ - <assumption_2>
1605
+
1606
+ Pending items:
1607
+ - <pending_item_1>
1608
+ - <pending_item_2>
1609
+ ```
1610
+
1611
+ If validation is incomplete, say exactly what remains pending.