@revos/cli 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/README.md +17 -13
  2. package/dist/adapters/oclif/commands/auth/login.mjs +2 -2
  3. package/dist/adapters/oclif/commands/auth/logout.mjs +2 -2
  4. package/dist/adapters/oclif/commands/auth/status.mjs +2 -2
  5. package/dist/adapters/oclif/commands/init.mjs +2 -2
  6. package/dist/adapters/oclif/commands/org/current.mjs +2 -2
  7. package/dist/adapters/oclif/commands/org/list.mjs +2 -2
  8. package/dist/adapters/oclif/commands/org/switch.mjs +2 -2
  9. package/dist/adapters/oclif/commands/overlays/diff.d.mts +1 -1
  10. package/dist/adapters/oclif/commands/overlays/diff.mjs +3 -3
  11. package/dist/adapters/oclif/commands/overlays/pull.d.mts +1 -1
  12. package/dist/adapters/oclif/commands/overlays/pull.mjs +3 -3
  13. package/dist/adapters/oclif/commands/overlays/push.d.mts +1 -1
  14. package/dist/adapters/oclif/commands/overlays/push.mjs +3 -3
  15. package/dist/adapters/oclif/commands/overlays/status.d.mts +1 -1
  16. package/dist/adapters/oclif/commands/overlays/status.mjs +3 -3
  17. package/dist/{base.command-DDSLyx5v.mjs → base.command-CaFn9EwG.mjs} +1 -1
  18. package/dist/{core-EJgxP-x5.mjs → core-BSLZ9hQU.mjs} +42 -17
  19. package/dist/{index-DH6vy050.d.mts → index-B8n2GxTc.d.mts} +1 -1
  20. package/dist/index.d.mts +3 -3
  21. package/dist/index.mjs +1 -1
  22. package/dist/templates/AGENTS.md +1 -1
  23. package/dist/templates/skills/create-dbt-transformations/SKILL.md +214 -0
  24. package/dist/templates/skills/create-dbt-transformations/references/edge-cases.md +46 -0
  25. package/dist/templates/skills/create-dbt-transformations/references/schema-conventions.md +128 -0
  26. package/dist/templates/skills/create-dbt-transformations/references/sql-templates.md +73 -0
  27. package/dist/templates/skills/create-semantic-model/SKILL.md +126 -1432
  28. package/dist/templates/skills/create-semantic-model/references/cube-examples.md +267 -0
  29. package/dist/templates/skills/create-semantic-model/references/key-patterns.md +150 -0
  30. package/dist/templates/skills/create-semantic-model/references/validation-queries.md +209 -0
  31. package/dist/templates/skills/explore-lakehouse/SKILL.md +8 -1
  32. package/dist/{types-DZssnweO.d.mts → types-DmuJzN0Z.d.mts} +5 -1
  33. package/package.json +2 -1
@@ -1,1611 +1,305 @@
1
1
  ---
2
2
  name: create-semantic-model
3
- description: Create semantic models (Cube.dev cubes) from existing RevOS dbt gold models. Use when asked to build a semantic layer, create cubes, or generate Cube definitions from dbt.
3
+ description: >
4
+ Create semantic models (Cube.dev cubes) from existing RevOS dbt gold models.
5
+ Use when asked to: build a semantic layer, create cubes, generate Cube definitions from dbt,
6
+ create a semantic overlay, or create a semantic model from gold models.
4
7
  ---
5
8
 
6
- @.claude/skills/create_dbt_transformations/SKILL.md
7
-
8
- # Create Semantical Model
9
-
10
- Use this skill when the user asks to create a semantic model or semantic overlay.
11
-
12
- Typical user requests:
13
-
14
- > Create a new semantic overlay for `<transformation>`.
15
-
16
- > Create a semantic model from the available gold models.
17
-
18
- > Build a Cube semantic layer for these gold tables.
19
-
20
- This skill analyzes existing dbt gold models, asks the user which models should participate, detects keys and relationships, validates join candidates with SQL before proposing them to the user, asks the user to confirm the semantic structure and measures, and generates Cube.dev semantic overlays under `semantic/cubes/`.
9
+ # Create Semantic Model
21
10
 
22
11
  ## Skill Dependencies
23
12
 
24
- This skill delegates dbt-related knowledge to `create_dbt_transformations` (loaded above):
13
+ This skill delegates dbt-related knowledge to `create-dbt-transformations`:
14
+ project layout, finding gold models, resolving dbt model names to BigQuery table references,
15
+ creating bridge models from JSON arrays, and dbt validation commands.
25
16
 
26
- - Project layout and how to find gold models
27
- - Resolving a dbt model name to its physical BigQuery table reference
28
- - Creating new dbt support models (bridge models from JSON arrays)
29
- - dbt validation commands
17
+ If `create-dbt-transformations` is not installed: discover gold models directly via
18
+ `find dbt/models/gold -name "*.sql"`, run dbt commands without the `revos` wrapper
19
+ (`dbt parse`, `dbt run`, `dbt test`), and skip bridge model delegation.
20
+ Warn the user: "The `create-dbt-transformations` skill is not installed — bridge model creation and dbt validation are limited."
30
21
 
31
- If BigQuery exploration is needed (listing tables, inspecting schemas, previewing rows, checking null rates beyond what is required for join validation), use the `explore-lakehouse` skill at `.claude/skills/explore-lakehouse/SKILL.md`. Load it only when those capabilities are actually required.
22
+ If BigQuery exploration is needed (listing tables, inspecting schemas, previewing rows),
23
+ load the `explore-lakehouse` skill on demand.
32
24
 
33
25
  ---
34
26
 
35
27
  ## Purpose
36
28
 
37
- Expose existing dbt gold models as queryable Cube.dev semantic models without manually writing YAML boilerplate.
29
+ Expose existing dbt gold models as queryable Cube.dev semantic models without manually writing YAML boilerplate. Gold models may be tables or views.
38
30
 
39
- Gold models may be dbt tables or dbt views. Treat both as valid semantic sources.
40
-
41
- This skill does not build gold models from silver. If a needed gold model is missing, hand off to `create_dbt_transformations`.
42
-
43
- The expected flow is:
44
-
45
- ```text
46
- dbt/models/gold/
47
- -> discover available gold models (via create_dbt_transformations)
48
- -> ask the user which gold models should participate
49
- -> inspect selected model schemas from the lakehouse
50
- -> detect primary keys, secondary keys, foreign keys, JSON/array keys, and candidate relationships
51
- -> if selected models are not directly connected, search remaining gold models for connector/intermediate models
52
- -> ask the user to approve connector or bridge/support models when needed
53
- -> validate candidate joins with SQL before proposing them as relationships
54
- -> present validated relationships and join directions to the user
55
- -> ask the user to confirm relationships
56
- -> generate dimensions from all selected columns by default
57
- -> generate default count measure
58
- -> suggest useful additional measures
59
- -> ask the user to confirm or define custom measures
60
- -> generate Cube.dev semantic overlays in semantic/cubes/
61
- -> validate generated files
62
- ```
31
+ This skill does not build gold models. If a needed gold model is missing, hand off to `create-dbt-transformations`.
63
32
 
64
33
  ---
65
34
 
66
- ## Execution Order
67
-
68
- 1. Discover available gold models (use `create_dbt_transformations` for navigation).
69
- 2. If the user named a specific transformation/model, find that gold model first.
70
- 3. Ask the user which gold models should participate, unless the request clearly targets one specific model only.
71
- 4. Resolve `$GOOGLE_CLOUD_PROJECT` and `$REVOS_BQ_DATASET` to literal values via `echo` and keep them for the rest of the session.
72
- 5. Inspect schemas and column types for selected gold models.
73
- 6. Detect primary keys, secondary keys, foreign keys, JSON/array keys, and candidate relationships.
74
- 7. If selected models are disconnected, search for connector models among the remaining gold models.
75
- 8. Ask the user to approve connector or bridge/support models before adding them to the working scope.
76
- 9. Validate candidate joins with SQL before proposing them as relationships.
77
- 10. Present validated relationships, join directions, cardinality, and validation evidence to the user.
78
- 11. Ask the user to confirm or modify relationships.
79
- 12. Generate dimensions from all selected gold model columns by default.
80
- 13. Create the default `count` measure.
81
- 14. Suggest useful additional measures.
82
- 15. Ask the user to confirm suggested measures or define custom measures.
83
- 16. Generate Cube semantic overlays in `semantic/cubes/`.
84
- 17. Validate generated files.
85
-
86
- Do not skip ahead.
87
-
88
- ---
35
+ ## Naming Convention
89
36
 
90
- ## Naming Convention: Cube Files and Cube Names
91
-
92
- Gold models in `dbt/models/gold/` use the `gold_` prefix in their file name (for example, `gold_hubspot_companies.sql`). The materialized table in BigQuery keeps the same name.
93
-
94
- The semantic overlay layer drops the `gold_` prefix:
95
-
96
- 1. The gold SQL file: `gold_<entity>` (for example, `gold_hubspot_companies.sql`).
97
- 2. The materialized BigQuery table: `gold_<entity>` (for example, `gold_hubspot_companies`).
98
- 3. The semantic overlay YAML file: `<entity>.yml` (for example, `hubspot_companies.yml`).
99
- 4. The cube `name` inside that YAML file: `<entity>` (for example, `name: hubspot_companies`).
100
- 5. References to that cube from joins: `${<entity>}` (for example, `${hubspot_companies}`).
101
-
102
- Mapping example:
37
+ Strip the `gold_` prefix for cube names and file names. Keep `gold_` in `sql_table` (physical table).
103
38
 
104
39
  ```text
105
- gold SQL file: dbt/models/gold/gold_hubspot_companies.sql
106
- BigQuery table: gold_hubspot_companies
107
- overlay YAML file: semantic/cubes/hubspot_companies.yml
108
- cube name: hubspot_companies
109
- join reference: ${hubspot_companies}
40
+ gold SQL file: dbt/models/gold/gold_hubspot_companies.sql
41
+ BigQuery table: gold_hubspot_companies
42
+ overlay file: semantic/hubspot_companies.yml
43
+ cube name: hubspot_companies
44
+ join reference: ${hubspot_companies}
45
+ sql_table: "`<dataset>.gold_hubspot_companies`"
110
46
  ```
111
47
 
112
- Naming rules:
113
-
114
- 1. Always strip the `gold_` prefix when generating the overlay YAML file name.
115
- 2. The cube `name` must match the overlay YAML file name (without extension).
116
- 3. The cube `name` is what appears in `${...}` references inside join SQL.
117
- 4. The physical table reference in `sql_table` always uses the full BigQuery name with `gold_` prefix.
118
- 5. Bridge and support cube names follow the same rule. If the gold support model is `gold_deals_companies`, the cube name is `deals_companies` and the file is `deals_companies.yml`.
119
-
120
- ---
48
+ Same rule for bridge cubes: `gold_deals_companies` -> cube name `deals_companies`, file `deals_companies.yml`.
121
49
 
122
50
  ## Cube `sql_table` Reference
123
51
 
124
- Cube overlays must reference the physical warehouse table directly, not through `dbt ref()`. Cube does not understand Jinja.
52
+ Reference the physical warehouse table directly. Cube does not understand Jinja — never use `dbt ref()`.
125
53
 
126
- Use the literal `project` and `dataset` values captured during the "Resolve Environment Variables First" step at the start of Phase 2 (see Phase 2 below). If you reach Phase 8 without those literals already in memory, go back and run the resolution step first.
127
-
128
- Wrap the literal in BigQuery backticks and use it as `sql_table`:
129
-
130
- ```yaml
131
- sql_table: "`<resolved_project>.<resolved_dataset>.<gold_model_name>`"
132
- ```
133
-
134
- Concrete example:
54
+ Use the literal `dataset` value resolved in Phase 2. Wrap in BigQuery backticks:
135
55
 
136
56
  ```yaml
137
- sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_companies`"
57
+ sql_table: "`revos_1737556292084.gold_hubspot_companies`"
138
58
  ```
139
59
 
140
- The cube `name` is `hubspot_companies` (no `gold_` prefix), but `sql_table` keeps the `gold_` prefix because that is the physical table.
141
-
142
- Apply the same fully qualified format anywhere a physical table name is needed in a Cube overlay (notably in `refresh_key.sql`).
60
+ Apply the same fully qualified format in `refresh_key.sql`.
143
61
 
144
62
  ---
145
63
 
146
- ## Non-Negotiable Rules
147
-
148
- 1. Always start from existing dbt gold models in `dbt/models/gold/`. If no usable gold models exist, hand off to `create_dbt_transformations` and stop.
149
- 2. Gold models may be tables or views.
150
- 3. Always discover available gold models before doing anything else.
151
- 4. Always ask the user which discovered gold models should participate, unless the user clearly requested one specific model only.
152
- 5. Do not inspect selected schemas, analyze joins, generate overlays, or create support models until the model scope is clear.
153
- 6. Only selected models may be used for the main semantic model.
154
- 7. If selected models are not directly connected, search remaining gold models for connector/intermediate models.
155
- 8. During connector search, Claude may inspect column names and key-like fields of non-selected gold models only for relationship discovery.
156
- 9. Inspecting a non-selected model does not add it to the semantic model.
157
- 10. Connector models can only be added after explicit user approval.
158
- 11. If a many-to-many relationship requires a bridge/support model and no suitable gold model exists, ask the user whether to create one. If the user approves, hand off the bridge model creation to `create_dbt_transformations`.
159
- 12. Do not invent joins. If models cannot be connected directly or through approved connector/bridge models, generate separate semantic subgraphs and clearly report that they are disconnected.
160
- 13. Validate candidate joins with SQL before proposing them to the user as confirmed relationship options.
161
- 14. Ask the user to confirm validated relationships before generating semantic overlays.
162
- 15. For every confirmed relationship, generate joins in both directions where both cubes exist.
163
- 16. Validate joins in both directions where possible.
164
- 17. Relationship direction must be correct from the perspective of the current cube.
165
- 18. Preserve every business column from each selected gold model as a Cube dimension by default.
166
- 19. From technical Airbyte metadata columns (`_airbyte_*`), include only `_airbyte_extracted_at` as a dimension named `airbyte_extracted_at`. Exclude all other `_airbyte_*` columns by default.
167
- 20. Do not remove non-Airbyte columns just because they do not look immediately useful.
168
- 21. Measures are not all numeric columns by default. Always create `count`, suggest useful additional measures, and ask the user to confirm or define custom measures.
169
- 22. Keys may be stored in normal columns, JSON strings, JSON arrays, repeated fields, or nested structures. Always check for hidden relationship keys.
170
- 23. For JSON arrays, use `UNNEST(JSON_VALUE_ARRAY(...))`.
171
- 24. Bridge and junction cubes should use `public: false` where the project convention supports it.
172
- 25. Every generated Cube overlay must include `refresh_key`.
173
- 26. Prefer `_airbyte_extracted_at` for refresh keys when available. Otherwise fall back to `every: 1 hour`.
174
- 27. New Cube overlays must reference the physical warehouse table directly using the fully qualified name. Do not use `dbt ref()` in Cube overlays.
175
- 28. Cube `name` and overlay file name must drop the `gold_` prefix. The physical table reference in `sql_table` keeps the `gold_` prefix.
176
- 29. Follow the existing `semantic/cubes/` style in the repository. If existing overlays use map-style `dimensions`, `measures`, and `joins`, use map-style.
177
-
178
- ---
179
-
180
- ## Mandatory User Checkpoints
64
+ ## User Checkpoints
181
65
 
182
66
  ### Checkpoint 1: Gold Model Selection
183
67
 
184
- After discovering gold models, show the available models and ask which ones should participate.
185
-
186
- Do not proceed until the user selects models.
187
-
188
- Exception: if the user explicitly requested one specific model, and that model exists in `dbt/models/gold/`, treat that model as selected. If the user appears to want a multi-model semantic layer, still ask whether related models should be included.
68
+ After discovering gold models, show available models and ask which should participate. Do not proceed until the user selects. Exception: if the user named one specific model, treat it as selected.
189
69
 
190
70
  ### Checkpoint 2a: Connector Model Approval
191
71
 
192
- Triggered when selected models are not directly connected.
193
-
194
- Search remaining gold models for connector/intermediate models. During this search, Claude may inspect column names and key-like fields of non-selected gold models only for relationship discovery. Inspecting a model does not add it to the semantic model.
195
-
196
- If a connector path is found, explain why it is needed and ask whether to add it.
197
-
198
- Example:
199
-
200
- ```text
201
- You selected products and addresses.
202
-
203
- I do not see a direct relationship between products and addresses.
204
-
205
- However, I found a possible connector path:
206
-
207
- products -> clients -> addresses
208
-
209
- The clients model appears to connect products to addresses.
210
-
211
- Should I include gold_clients as a connector model in this semantic model?
212
- ```
213
-
214
- Do not use connector models until the user approves them.
72
+ When selected models are not directly connected, search remaining gold models for connectors. Present the connector path and ask approval before adding.
215
73
 
216
74
  ### Checkpoint 2b: Bridge / Support Model Approval
217
75
 
218
- Triggered when a many-to-many relationship is detected and no suitable existing gold bridge model exists.
219
-
220
- Ask whether to create one:
221
-
222
- ```text
223
- The relationship between deals and contacts appears to be many-to-many,
224
- likely stored as a JSON array on hubspot_deals.
225
-
226
- I did not find an existing gold bridge model for this relationship.
227
-
228
- Should I create a support model named gold_deals_contacts?
229
- If yes, I will hand off bridge model creation to the create_dbt_transformations skill,
230
- then continue with the semantic model.
231
- ```
232
-
233
- If approved, invoke `create_dbt_transformations` with the bridge model template (it knows the standard JSON-array bridge pattern). Once that skill confirms the bridge model is built and tested, return here and continue.
234
-
235
- Do not create or use bridge/support models until the user approves them.
76
+ When a many-to-many relationship is detected and no suitable bridge model exists, ask whether to create one. If approved, delegate to `create-dbt-transformations`.
236
77
 
237
78
  ### Checkpoint 3: Relationship Confirmation
238
79
 
239
- After candidate relationships are detected and validated with SQL, present the validated relationship options, join directions, cardinality, and validation evidence.
240
-
241
- Ask the user to confirm or modify relationships before generating semantic overlays.
242
-
243
- Do not present unvalidated joins as confirmed relationships.
244
-
245
- If validation could not be executed, clearly mark the relationship as `validation pending` and ask the user whether to proceed with that assumption. If the user proceeds, the join is generated but flagged inside the YAML with a comment (see Phase 8).
80
+ Present validated relationships with join directions, cardinality, and match rates. Ask the user to confirm before generating overlays. Do not present unvalidated joins as confirmed mark them as `validation pending`.
246
81
 
247
82
  ### Checkpoint 4: Measures Confirmation
248
83
 
249
- After schema and relationship analysis, generate default dimensions and default `count` measure, then propose useful additional measures.
250
-
251
- Ask the user to confirm suggested measures or define custom measures.
252
-
253
- Do not create ambiguous custom measures. If the user asks for a custom measure but the definition is unclear, ask for more detail.
84
+ Generate default `count` measure, suggest additional measures, and ask the user to confirm or define custom measures.
254
85
 
255
86
  ---
256
87
 
257
88
  # Workflow
258
89
 
259
- Follow these phases in order.
90
+ Follow these phases in order. Do not skip ahead.
260
91
 
261
92
  ---
262
93
 
263
94
  ## Phase 1: Discover Gold Models and Select Scope
264
95
 
265
- ### Goal
266
-
267
- Find available prepared gold models, detect existing overlay conventions, and determine which models participate in the semantic model.
268
-
269
- ### Steps
270
-
271
- 1. Discover gold models. Use the navigation commands documented in `create_dbt_transformations` ("Model Navigation"). Specifically:
272
-
273
- ```bash
274
- find dbt/models/gold -name "*.sql"
275
- ```
276
-
277
- 2. If no usable gold models exist, stop and tell the user:
278
-
279
- ```text
280
- No gold models were found in dbt/models/gold/. The required dbt transformation
281
- must exist before I can build a semantic model on top of it.
282
-
283
- Use the create_dbt_transformations skill to create the gold model first, then
284
- run semantic model generation again.
285
- ```
286
-
287
- 3. Inspect existing overlays in `semantic/cubes/` to detect conventions used in this repository.
288
-
289
- Look at one or two existing files and check for:
290
-
291
- - Whether `extends:` is used to inherit from base cubes.
292
- - Whether existing cubes use map-style or list-style for `dimensions`, `measures`, and `joins`.
293
- - The `public:` flag pattern (true / false / omitted).
294
- - The `refresh_key` style (`sql:` based or `every:` based).
295
- - Naming patterns for dimensions and measures.
296
- - Whether there are base cubes intended for `extends:`.
297
-
298
- Apply the detected conventions to the new overlays. If the repository is empty or conventions are inconsistent, follow the defaults defined in this skill.
299
-
300
- 4. If the user named a specific transformation/model, search for it. If the model exists, treat it as the selected model. If it does not exist, stop:
301
-
302
- ```text
303
- I could not find the requested gold model in dbt/models/gold/. The gold model
304
- must exist before I can create a semantic overlay for it.
305
- ```
306
-
307
- 5. If the user did not name a specific model, list all discovered gold models.
308
-
309
- 6. Infer likely business entities from file names.
310
-
311
- Examples:
312
-
313
- ```text
314
- gold_hubspot_companies -> companies
315
- gold_hubspot_deals -> deals
316
- gold_hubspot_contacts -> contacts
317
- gold_hubspot_users -> users
318
- gold_products -> products
319
- gold_clients -> clients
320
- gold_addresses -> addresses
321
- gold_invoices -> invoices
322
- gold_payments -> payments
323
- gold_active_users_last_month -> active users
324
- ```
325
-
326
- 7. Present discovered models and ask the user which ones should be included.
327
-
328
- 8. Stop and wait for user selection if no specific model was provided.
329
-
330
- 9. Confirm selected models back to the user.
331
-
332
- 10. Keep the full discovered gold model list available for connector search.
333
-
334
- 11. If the user asks to include all models, warn that this may create a larger and more complex semantic model before proceeding.
96
+ 1. Discover gold models via `find dbt/models/gold -name "*.sql"`.
97
+ 2. If none exist, stop and tell the user to create gold models first via `create-dbt-transformations`.
98
+ 3. Inspect 1-2 existing overlays in `semantic/` to detect conventions (map-style vs list-style, `extends:`, `public:`, `refresh_key` style). Apply detected conventions to new overlays.
99
+ 4. If the user named a specific model, find it. If not found, stop.
100
+ 5. Otherwise list all discovered gold models and ask which should participate (Checkpoint 1).
101
+ 6. Keep the full discovered list available for connector search in Phase 3.
335
102
 
336
103
  ---
337
104
 
338
105
  ## Phase 2: Analyze Selected Model Schemas and Keys
339
106
 
340
- ### Goal
341
-
342
- Inspect selected gold models and identify primary keys, secondary keys, foreign keys, JSON keys, array keys, time columns, and metric-like columns.
343
-
344
107
  ### Resolve Environment Variables First
345
108
 
346
- Before running any SQL or generating any YAML in this session, resolve `$GOOGLE_CLOUD_PROJECT` and `$REVOS_BQ_DATASET` to their literal values **once**, and use those literals everywhere downstream. SQL placeholders like `<project>` are not valid BigQuery syntax, and Cube YAML does not interpolate env variables — both contexts need real values.
347
-
348
- Run:
109
+ Before any SQL or YAML generation, resolve `$REVOS_BQ_DATASET` to a literal:
349
110
 
350
111
  ```bash
351
- echo "PROJECT=$GOOGLE_CLOUD_PROJECT"
352
112
  echo "DATASET=$REVOS_BQ_DATASET"
353
113
  ```
354
114
 
355
- Record the two literal values returned (for example, `revos-dev` and `revos_1737556292084`). For the rest of this session, every time you see `<project>` or `<dataset>` in a SQL template or YAML example, substitute these literals.
356
-
357
- If either variable is empty, stop and tell the user:
358
-
359
- ```text
360
- The environment variable $GOOGLE_CLOUD_PROJECT or $REVOS_BQ_DATASET is not set.
361
- I cannot resolve physical BigQuery table references without them. Please set
362
- them and try again.
363
- ```
364
-
365
- Do not proceed to schema inspection or join validation until the variables are resolved.
115
+ If empty, stop and ask the user to set it. Use this literal everywhere downstream.
366
116
 
367
117
  ### Schema Discovery
368
118
 
369
- For each selected gold model, inspect available columns and types. If BigQuery exploration commands are needed (listing tables, inspecting schemas, previewing rows), refer to the `explore-lakehouse` skill at `.claude/skills/explore-lakehouse/SKILL.md`.
370
-
371
- When inspecting schema output, check whether `_airbyte_extracted_at` exists (it will be needed for `refresh_key` and as the only Airbyte dimension). Other `_airbyte_*` columns can be noted but are not used by default.
372
-
373
- Validation queries reference physical tables using the resolved literal values (see "Resolve Environment Variables First" above). Substitute `<project>` and `<dataset>` placeholders in the SQL templates throughout this skill with the literals captured at the start of Phase 2.
374
-
375
- ### Primary Key Detection
376
-
377
- Look for stable unique identifiers.
378
-
379
- Common primary key patterns:
380
-
381
- ```text
382
- id
383
- <entity>_id
384
- <model_name>_id
385
- uuid
386
- unique_id
387
- external_id
388
- source_id
389
- ```
390
-
391
- Examples:
392
-
393
- ```text
394
- companies.id
395
- companies.company_id
396
- companies.office_unique_id
397
- hubspot_companies.properties_company_unique_id
398
- ```
399
-
400
- Primary key rules:
401
-
402
- 1. Prefer a known business or platform identifier over a generated row number.
403
- 2. Prefer stable IDs over names or labels.
404
- 3. Do not mark a column as primary key only because it looks unique by name.
405
- 4. Validate uniqueness where possible.
406
-
407
- Validation:
408
-
409
- ```sql
410
- SELECT
411
- COUNT(*) AS total_rows,
412
- COUNT(DISTINCT <candidate_pk>) AS distinct_keys,
413
- COUNT(*) - COUNT(DISTINCT <candidate_pk>) AS duplicate_count
414
- FROM `<project>.<dataset>.<gold_model>`;
415
- ```
416
-
417
- A primary key should normally have `total_rows = distinct_keys`, or a clearly explained reason why duplicates are expected.
418
-
419
- ### Secondary Key Detection
420
-
421
- Secondary keys are identifiers that are not the table primary key but can be used for joins, grouping, lookup, or `count_distinct` measures.
422
-
423
- Common secondary key patterns:
424
-
425
- ```text
426
- office_unique_id
427
- company_id
428
- customer_id
429
- client_id
430
- deal_id
431
- contact_id
432
- owner_id
433
- user_id
434
- account_id
435
- product_id
436
- address_id
437
- external_id
438
- source_id
439
- ```
119
+ For each selected gold model, inspect columns and types. Use `explore-lakehouse` if needed.
440
120
 
441
- Secondary key rules:
442
-
443
- 1. Track secondary keys explicitly.
444
- 2. Secondary keys may be foreign keys to another entity.
445
- 3. Secondary keys should usually become Cube dimensions.
446
- 4. Secondary keys may support `count_distinct` measures if analytically useful, but only inside the cube that owns the FK (see Phase 7 caution about fan-out).
447
-
448
- ### JSON / Array Key Detection
449
-
450
- Keys may be hidden inside JSON strings, JSON arrays, repeated fields, or nested structures. This is especially common for one-to-many and many-to-many relationships.
451
-
452
- Common JSON / array relationship patterns:
453
-
454
- ```text
455
- companies
456
- deals
457
- contacts
458
- users
459
- owners
460
- clients
461
- products
462
- addresses
463
- associations
464
- associated_companies
465
- associated_deals
466
- associated_contacts
467
- associated_clients
468
- associated_products
469
- company_ids
470
- deal_ids
471
- contact_ids
472
- client_ids
473
- product_ids
474
- address_ids
475
- ```
476
-
477
- Example: `gold_hubspot_deals.companies` may contain an array of company IDs.
478
-
479
- For JSON arrays, use `UNNEST(JSON_VALUE_ARRAY(...))`.
480
-
481
- JSON / array rules:
482
-
483
- 1. Always inspect JSON, array, repeated, and nested fields for hidden relationship keys.
484
- 2. Do not assume relationship keys only exist as flat columns.
485
- 3. If JSON structure is unknown, inspect sample values first:
486
-
487
- ```sql
488
- SELECT
489
- <json_or_array_column>
490
- FROM `<project>.<dataset>.<gold_model>`
491
- WHERE <json_or_array_column> IS NOT NULL
492
- LIMIT 20;
493
- ```
121
+ Check whether `_airbyte_extracted_at` exists (needed for `refresh_key` and as the only Airbyte dimension to expose).
494
122
 
495
- 4. If a relationship is stored as an array of IDs, use or create an approved bridge/support model. Bridge model creation is delegated to `create_dbt_transformations` (which has the standard JSON-array bridge template).
496
- 5. Bridge models should preserve both sides of the relationship as keys.
497
- 6. Bridge and junction cubes should use `public: false` where the project convention supports it.
123
+ ### Key Detection
498
124
 
499
- ### Foreign Key Detection
125
+ Detect primary keys, secondary keys, foreign keys, and JSON/array keys.
126
+ See [references/key-patterns.md](references/key-patterns.md) for common patterns and detection rules.
500
127
 
501
- Look for columns that reference other entities.
128
+ Validate primary key uniqueness with SQL. See [references/validation-queries.md](references/validation-queries.md), section 1.
502
129
 
503
- Common patterns:
504
-
505
- ```text
506
- <entity>_id
507
- <entity>Id
508
- fk_<entity>
509
- associated_<entity>_id
510
- parent_<entity>_id
511
- owner_id
512
- created_by_user_id
513
- updated_by_user_id
514
- ```
515
-
516
- Also check JSON and array-based foreign keys:
517
-
518
- ```text
519
- deals.companies -> companies.id
520
- companies.deals -> deals.id
521
- contacts.associated_company_ids -> companies.id
522
- ```
523
-
524
- ### Schema Summary Output
525
-
526
- After analysis, summarize each selected model.
527
-
528
- Example:
529
-
530
- ```text
531
- Model: gold_hubspot_deals
532
- Columns: 18
533
- Candidate primary key: deal_id
534
- Secondary keys: company_id, owner_id
535
- JSON / array relationship columns: companies, contacts
536
- Time columns: created_at, updated_at, closed_at
537
- Numeric metric-like columns: amount
538
- Airbyte columns present: _airbyte_extracted_at (will be exposed), _airbyte_raw_id, _airbyte_meta, _airbyte_generation_id (will be excluded by default)
539
- ```
130
+ Output a schema summary for each model (format in key-patterns.md).
540
131
 
541
132
  ---
542
133
 
543
134
  ## Phase 3: Detect Candidate Relationships
544
135
 
545
- ### Goal
546
-
547
- Build a candidate relationship graph between selected models and approved connector/bridge models.
548
-
549
- These relationships are candidates only. They must be validated with SQL before being proposed to the user as relationship options.
136
+ Build a candidate relationship graph. These are candidates only — they must be validated in Phase 4.
550
137
 
551
138
  ### Single-Model Case
552
139
 
553
- If only one gold model was selected in Phase 1 **and** that model has no JSON/array relationship columns identified in Phase 2, skip Phase 3 entirely. There is nothing to relate. Proceed directly to Phase 6 (dimensions) and Phase 7 (measures). Phase 4 (validation) is also skipped since there are no candidate joins.
554
-
555
- If only one model is selected but it has JSON/array relationship columns (for example, `gold_hubspot_deals` with a `companies` array), Phase 3 still applies — but only the bridge/junction detection branch. Treat the JSON-array target entity as a candidate model and ask the user whether to add it to the scope (and, if needed, to create a bridge support model).
556
-
557
- ### Relationship Types
140
+ If only one model was selected and it has no JSON/array relationship columns, skip to Phase 6. If it has JSON/array columns, check for bridge/junction needs.
558
141
 
559
- Use Cube relationship types: `one_to_one`, `one_to_many`, `many_to_one`, `many_to_many`.
142
+ ### Relationship Types and Direction
560
143
 
561
- ### Relationship Direction
562
-
563
- Relationship direction is from the perspective of the current cube.
564
-
565
- ```text
566
- Deal belongs to one company:
567
- deals.company_id -> companies.id
568
- relationship from deals to companies: many_to_one
569
-
570
- Company has many deals:
571
- companies.id -> deals.company_id
572
- relationship from companies to deals: one_to_many
573
- ```
144
+ Use Cube types: `one_to_one`, `one_to_many`, `many_to_one`, `many_to_many`. Direction is always from the perspective of the current cube.
574
145
 
575
146
  ### Cardinality Rules
576
147
 
577
- 1. If source foreign key can repeat and target key is unique `many_to_one` / reverse `one_to_many`.
578
- 2. If both sides are unique `one_to_one`.
579
- 3. If both sides can repeat or the relationship is represented through a bridge/junction model `many_to_many` through bridge/junction model.
580
- 4. If cardinality is unclear, report uncertainty and validate before proposing the relationship.
148
+ 1. Source FK can repeat, target key unique -> `many_to_one` / reverse `one_to_many`.
149
+ 2. Both sides unique -> `one_to_one`.
150
+ 3. Both can repeat or through bridge -> `many_to_many`.
151
+ 4. If unclear, validate before proposing.
581
152
 
582
153
  ### Direct Join Detection
583
154
 
584
- Analyze direct joins between selected models first.
585
-
586
- Look for:
587
-
588
- 1. Foreign key to primary key matches.
589
- 2. Secondary key matches.
590
- 3. Known entity conventions.
591
- 4. Existing junction/bridge models among selected models.
592
- 5. JSON / array relationship fields among selected models.
593
- 6. Nested association structures.
155
+ Look for FK-to-PK matches, secondary key matches, existing bridge models, and JSON/array relationship fields.
594
156
 
595
157
  ### Connector Model Search
596
158
 
597
- Run connector search when selected models are not directly connected.
598
-
599
- Rules:
600
-
601
- 1. Search all discovered gold models, including non-selected models.
602
- 2. During connector search, inspect column names and key-like fields of non-selected gold models only for relationship discovery.
603
- 3. This lightweight schema discovery does not add the connector model to the semantic model.
604
- 4. Do not automatically add connector models.
605
- 5. Look for paths of length 2 first.
606
- 6. If needed, look for paths of length 3.
607
- 7. Do not create long or speculative chains without user confirmation.
608
- 8. Use connector models only after user approval.
159
+ When selected models are disconnected:
609
160
 
610
- Example. User selected `gold_products` and `gold_addresses`. No direct relationship found. Search remaining gold models and find that both have `client_id` matching `gold_clients.id`. Proposed connector path: `products -> clients -> addresses`.
161
+ 1. Search all discovered gold models for connector paths (length 2, then 3).
162
+ 2. Inspect non-selected models for relationship discovery only — this does not add them to the scope.
163
+ 3. Present connector path and ask user approval (Checkpoint 2a).
164
+ 4. If approved, add to scope and run full schema discovery. If rejected, document disconnected models.
611
165
 
612
- If the user approves the connector model:
613
-
614
- 1. Add it to the working model scope.
615
- 2. Run full schema discovery for it.
616
- 3. Analyze its keys and relationships.
617
- 4. Validate the connector path with SQL before proposing it as a relationship.
618
- 5. Generate a semantic overlay for it.
619
-
620
- If the user rejects the connector model:
621
-
622
- 1. Continue only with directly connected selected models.
623
- 2. Clearly document disconnected selected models.
624
- 3. Do not invent joins.
625
-
626
- ### Bridge / Junction Relationship Detection
627
-
628
- Use existing bridge/junction gold models when available — but only after confirming they are usable.
166
+ ### Bridge / Junction Detection
629
167
 
630
168
  If an existing bridge model is found in `dbt/models/gold/`:
631
169
 
632
- 1. Inspect its schema (use `explore-lakehouse` for schema commands).
633
- 2. Verify it contains the expected key columns for the relationship typically `<entity_a>_id` and `<entity_b>_id`. The exact column names may differ from the convention (for example, `company_uuid` instead of `company_id`).
634
- 3. Check whether `_airbyte_extracted_at` is preserved. If not, the bridge cube will have to fall back to `every: 1 hour` for `refresh_key`.
635
- 4. Run a quick row-count and sample-rows check to confirm the bridge has data.
636
-
637
- If the existing bridge model fits the relationship cleanly, use it.
638
-
639
- If it does not fit (key columns named unexpectedly, missing one side of the relationship, has unrelated extra columns that change cardinality, or is empty), present the situation to the user and ask:
170
+ 1. Inspect its schema and verify it has the expected key columns.
171
+ 2. If it fits, use it. If it doesn't fit cleanly, present the mismatch and offer options (use as-is, create new, or abort).
640
172
 
641
- ```text
642
- I found an existing bridge model `<model_name>` in dbt/models/gold/, but it does not fit the relationship cleanly:
643
- - <specific issue, e.g. "key column is named company_uuid, not company_id">
644
- - <specific issue>
645
-
646
- Options:
647
- - use as-is (I will adapt the cube join SQL to match the existing column names)
648
- - create a new bridge model via create_dbt_transformations (I will use the standard JSON-array template and a new name)
649
- - abort and let me revisit this later
650
- ```
651
-
652
- Wait for the user's decision. Do not silently adapt to a mismatched bridge — surfacing the mismatch is more important than the fix being automatic.
653
-
654
- If no existing bridge model is available and the user approves creating one (Checkpoint 2b), delegate the bridge model creation to `create_dbt_transformations`. That skill has the standard JSON-array bridge template and the naming convention (`gold_<entity_a>_<entity_b>`, with cube name `<entity_a>_<entity_b>` after dropping `gold_`).
655
-
656
- Once the bridge model exists and is materialized, return here and continue:
657
-
658
- - Generate a Cube overlay for the bridge as a junction cube with `public: false`.
659
- - Generate joins in both directions through the bridge.
173
+ If no bridge exists and user approves creating one (Checkpoint 2b), delegate to `create-dbt-transformations`. Once built, generate a bridge cube with `public: false`.
660
174
 
661
175
  ---
662
176
 
663
- ## Phase 4: Validate Candidate Relationships Before Proposal
664
-
665
- ### Goal
666
-
667
- Verify with SQL that candidate joins actually work and that relationship direction is correct before presenting them to the user for confirmation.
668
-
669
- This validation runs against BigQuery directly. It is not a YAML-level check. The point is to verify, with real data, that:
670
-
671
- - Candidate primary keys are unique.
672
- - Foreign keys actually match target primary keys at acceptable rates.
673
- - Reverse aggregations produce sensible counts.
674
- - Many-to-many bridge edges resolve correctly on both sides.
675
- - JSON / array relationships extract to keys that exist in the target table.
676
- - Join column types are compatible.
677
-
678
- Validation must check both directions of every candidate relationship where possible.
679
-
680
- If validation cannot be executed because the environment is incomplete, clearly mark validation as pending and explain what must be run later.
681
-
682
- Validation queries reference physical tables using the literal `project` and `dataset` values resolved at the start of Phase 2. Substitute `<project>` and `<dataset>` placeholders in the SQL templates below with those literals before executing.
683
-
684
- Do not present an unvalidated join as a confirmed relationship.
685
-
686
- ### 4.1 Validate Key Uniqueness
687
-
688
- ```sql
689
- SELECT
690
- COUNT(*) AS total_rows,
691
- COUNT(DISTINCT <candidate_pk>) AS distinct_keys,
692
- COUNT(*) - COUNT(DISTINCT <candidate_pk>) AS duplicate_count
693
- FROM `<project>.<dataset>.<gold_model>`;
694
- ```
695
-
696
- For a primary key, `duplicate_count` should normally be `0`. If duplicates exist, do not mark the column as `primary_key: true` unless there is a clearly documented reason.
697
-
698
- ### 4.2 Validate Many-to-One Direction
699
-
700
- Example relationship: `deals.company_id -> companies.id`.
701
-
702
- Expected direction: `deals -> companies: many_to_one`, `companies -> deals: one_to_many`.
703
-
704
- Validate the many-to-one side:
705
-
706
- ```sql
707
- SELECT
708
- COUNT(*) AS total_rows_with_fk,
709
- COUNT(c.id) AS matched_rows,
710
- COUNT(*) - COUNT(c.id) AS unmatched_rows,
711
- ROUND(100.0 * COUNT(c.id) / COUNT(*), 2) AS match_percentage
712
- FROM `<project>.<dataset>.gold_hubspot_deals` d
713
- LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
714
- ON d.company_id = c.id
715
- WHERE d.company_id IS NOT NULL;
716
- ```
717
-
718
- Also check whether a single source row matches multiple target rows:
719
-
720
- ```sql
721
- SELECT
722
- d.deal_id,
723
- COUNT(c.id) AS matched_companies
724
- FROM `<project>.<dataset>.gold_hubspot_deals` d
725
- LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
726
- ON d.company_id = c.id
727
- WHERE d.company_id IS NOT NULL
728
- GROUP BY d.deal_id
729
- HAVING COUNT(c.id) > 1
730
- LIMIT 20;
731
- ```
732
-
733
- For a valid many-to-one relationship, this result should normally be empty.
734
-
735
- ### 4.3 Validate Reverse One-to-Many Direction
736
-
737
- For the same relationship, validate the reverse direction using aggregation:
738
-
739
- ```sql
740
- SELECT
741
- c.id AS company_id,
742
- COUNT(d.deal_id) AS deal_count
743
- FROM `<project>.<dataset>.gold_hubspot_companies` c
744
- LEFT JOIN `<project>.<dataset>.gold_hubspot_deals` d
745
- ON c.id = d.company_id
746
- GROUP BY c.id
747
- ORDER BY deal_count DESC
748
- LIMIT 20;
749
- ```
750
-
751
- Cross-check sampled counts directly from the child/source table:
752
-
753
- ```sql
754
- SELECT
755
- company_id,
756
- COUNT(*) AS expected_deal_count
757
- FROM `<project>.<dataset>.gold_hubspot_deals`
758
- WHERE company_id IN (<sample_company_ids>)
759
- GROUP BY company_id;
760
- ```
761
-
762
- The counts must match.
763
-
764
- ### 4.4 Validate One-to-One Relationships
765
-
766
- For a one-to-one relationship, validate uniqueness on both sides:
767
-
768
- ```sql
769
- SELECT
770
- COUNT(*) AS total_rows,
771
- COUNT(DISTINCT <left_key>) AS distinct_left_keys
772
- FROM `<project>.<dataset>.<left_model>`;
773
- ```
774
-
775
- Same for the right side. Then validate the join:
776
-
777
- ```sql
778
- SELECT
779
- COUNT(*) AS total_rows,
780
- COUNT(r.<right_key>) AS matched_rows,
781
- COUNT(*) - COUNT(r.<right_key>) AS unmatched_rows,
782
- ROUND(100.0 * COUNT(r.<right_key>) / COUNT(*), 2) AS match_percentage
783
- FROM `<project>.<dataset>.<left_model>` l
784
- LEFT JOIN `<project>.<dataset>.<right_model>` r
785
- ON l.<left_key> = r.<right_key>
786
- WHERE l.<left_key> IS NOT NULL;
787
- ```
788
-
789
- Also validate the reverse direction. If either side has duplicate keys, the relationship is not one-to-one.
790
-
791
- ### 4.5 Validate Many-to-Many Through Bridge/Junction Models
792
-
793
- For a many-to-many relationship, validate both bridge edges.
794
-
795
- Example: `companies <-> deals through gold_companies_deals`.
796
-
797
- Validate bridge to companies:
177
+ ## Phase 4: Validate Candidate Relationships
798
178
 
799
- ```sql
800
- SELECT
801
- COUNT(*) AS total_bridge_rows,
802
- COUNT(c.id) AS matched_companies,
803
- COUNT(*) - COUNT(c.id) AS unmatched_companies,
804
- ROUND(100.0 * COUNT(c.id) / COUNT(*), 2) AS match_percentage
805
- FROM `<project>.<dataset>.gold_companies_deals` b
806
- LEFT JOIN `<project>.<dataset>.gold_hubspot_companies` c
807
- ON b.company_id = c.id
808
- WHERE b.company_id IS NOT NULL;
809
- ```
810
-
811
- Validate bridge to deals (analogous query, swapping `c.id` for `d.id` and the source table).
812
-
813
- Validate reverse aggregations from each parent through the bridge:
814
-
815
- ```sql
816
- SELECT
817
- c.id AS company_id,
818
- COUNT(b.deal_id) AS related_deals
819
- FROM `<project>.<dataset>.gold_hubspot_companies` c
820
- LEFT JOIN `<project>.<dataset>.gold_companies_deals` b
821
- ON c.id = b.company_id
822
- GROUP BY c.id
823
- ORDER BY related_deals DESC
824
- LIMIT 20;
825
- ```
826
-
827
- Same query swapped for deals → bridge → companies.
828
-
829
- Report sampled counts to the user.
179
+ Verify with SQL against BigQuery that candidate joins work and direction is correct.
830
180
 
831
- ### 4.6 Validate JSON / Array Relationships
181
+ For each candidate relationship, validate:
832
182
 
833
- For JSON or array-based relationships, validate extracted keys:
183
+ - Key uniqueness
184
+ - FK match rates (LEFT JOIN + match percentage)
185
+ - Reverse direction aggregation counts
186
+ - Bridge edge integrity (both sides)
187
+ - JSON array extraction match rates
188
+ - Type compatibility via INFORMATION_SCHEMA
834
189
 
835
- ```sql
836
- WITH extracted AS (
837
- SELECT DISTINCT
838
- src.<source_pk> AS source_id,
839
- extracted_id
840
- FROM `<project>.<dataset>.<source_model>` src,
841
- UNNEST(JSON_VALUE_ARRAY(src.<json_array_column>)) AS extracted_id
842
- )
190
+ See [references/validation-queries.md](references/validation-queries.md) for all SQL templates.
843
191
 
844
- SELECT
845
- COUNT(*) AS total_relationships,
846
- COUNT(tgt.<target_pk>) AS matched_relationships,
847
- COUNT(*) - COUNT(tgt.<target_pk>) AS unmatched_relationships,
848
- ROUND(100.0 * COUNT(tgt.<target_pk>) / COUNT(*), 2) AS match_percentage
849
- FROM extracted e
850
- LEFT JOIN `<project>.<dataset>.<target_model>` tgt
851
- ON e.extracted_id = tgt.<target_pk>;
852
- ```
853
-
854
- Sample the extracted join to inspect actual matched values:
855
-
856
- ```sql
857
- WITH extracted AS (
858
- SELECT DISTINCT
859
- src.<source_pk> AS source_id,
860
- extracted_id
861
- FROM `<project>.<dataset>.<source_model>` src,
862
- UNNEST(JSON_VALUE_ARRAY(src.<json_array_column>)) AS extracted_id
863
- )
864
-
865
- SELECT
866
- e.source_id,
867
- e.extracted_id,
868
- tgt.<target_pk>,
869
- tgt.<display_column>
870
- FROM extracted e
871
- LEFT JOIN `<project>.<dataset>.<target_model>` tgt
872
- ON e.extracted_id = tgt.<target_pk>
873
- LIMIT 10;
874
- ```
875
-
876
- ### 4.7 Validate Type Compatibility
877
-
878
- Check that join columns have compatible types using INFORMATION_SCHEMA:
879
-
880
- ```sql
881
- SELECT column_name, data_type
882
- FROM `<project>.<dataset>.INFORMATION_SCHEMA.COLUMNS`
883
- WHERE table_name IN ('<source_model>', '<target_model>')
884
- AND column_name IN ('<foreign_key>', '<target_pk>');
885
- ```
192
+ Validate both directions of every candidate relationship where possible.
886
193
 
887
- For JSON / array extracted keys, check the extracted key type against the target key type.
888
-
889
- If types differ:
890
-
891
- 1. Report the mismatch.
892
- 2. Prefer fixing type alignment in the dbt model or approved support model.
893
- 3. Only cast in Cube join SQL when necessary.
894
- 4. Prefer casting the foreign-key side to match the primary-key side.
895
-
896
- ### 4.8 Validation Rules
897
-
898
- Do not present relationships to the user as valid options until:
899
-
900
- 1. Candidate joins have been validated, or validation is explicitly marked as pending.
901
- 2. Both directions of every candidate relationship have been validated where possible.
902
- 3. Primary key uniqueness has been checked, or marked as pending.
903
- 4. Type compatibility has been checked, or marked as pending.
904
- 5. Match rates are reported, or marked as pending.
905
- 6. Reverse one-to-many aggregations are checked with sampled counts where applicable.
906
- 7. Many-to-many bridge edges are validated in both directions where applicable.
907
- 8. JSON / array relationships have been extracted and validated where present, or marked as pending.
908
- 9. Low match rates or suspicious results are explained.
909
- 10. Sample joined data or sampled aggregate counts look reasonable, or limitations are documented.
194
+ Do not present unvalidated joins as confirmed. If validation cannot run, mark as `validation pending`.
910
195
 
911
196
  ---
912
197
 
913
- ## Phase 5: Present Validated Relationships and Ask for Confirmation
914
-
915
- ### Goal
916
-
917
- Show the detected and validated semantic structure before generating files.
198
+ ## Phase 5: Present Validated Relationships
918
199
 
919
- Present selected models, approved connector models, candidate keys, JSON/array keys, connector paths, bridge models, validated joins, cardinality, and validation evidence.
200
+ Present all validated relationships to the user: selected models, approved connectors, keys, joins with cardinality, match rates, and validation evidence.
920
201
 
921
- Example:
922
-
923
- ```text
924
- Selected gold models:
925
- - gold_hubspot_companies
926
- - gold_hubspot_deals
927
-
928
- Approved connector models:
929
- - gold_hubspot_users
930
-
931
- Entity: deals
932
- Source model: gold_hubspot_deals
933
- Cube name (overlay): hubspot_deals
934
- Candidate primary key: deal_id
935
- Secondary keys: owner_id
936
- JSON / array keys:
937
- - companies -> company_id
938
-
939
- Validated joins:
940
- - deals.owner_id -> users.user_id (many_to_one)
941
- Match rate: 99.8%
942
- Reverse direction: users -> deals (one_to_many)
943
- Sample reverse counts checked.
944
-
945
- - deals.companies[] -> companies.id through bridge gold_companies_deals
946
- Relationship: many_to_many
947
- Bridge edges validated.
948
- ```
202
+ Ask user to confirm or modify (Checkpoint 3). Do not generate files until confirmed.
949
203
 
950
- Then ask:
951
-
952
- ```text
953
- Please confirm these selected models, connector models, bridge/support models, and validated relationships, or tell me what to change before I generate the semantic overlays.
954
- ```
955
-
956
- Do not generate final files until the user confirms or corrects the relationship model.
957
-
958
- If a relationship could not be validated but the user still wants to proceed, mark it clearly as an assumption in the generated summary, and tag it with a YAML comment in the generated overlay (see Phase 8).
204
+ If a relationship could not be validated but user proceeds, tag it with `# UNVALIDATED: <reason>` in the generated YAML.
959
205
 
960
206
  ---
961
207
 
962
208
  ## Phase 6: Generate Dimensions
963
209
 
964
- ### Goal
965
-
966
- Expose all selected gold model business columns as Cube dimensions, plus `_airbyte_extracted_at`.
967
-
968
- ### Type Mapping
969
-
970
- ```text
971
- STRING / VARCHAR / TEXT -> string
972
- INTEGER / FLOAT / NUMERIC / DECIMAL -> number
973
- BOOLEAN / BOOL -> boolean
974
- DATE / DATETIME / TIMESTAMP -> time
975
- JSON / ARRAY / STRUCT -> string or skip only if not queryable directly
976
- ```
977
-
978
- ### Dimension Rules
979
-
980
- 1. Include primary keys as dimensions with `primary_key: true`.
981
- 2. Include secondary keys as dimensions.
982
- 3. Include human-readable names and statuses as string dimensions.
983
- 4. Include timestamps as time dimensions.
984
- 5. Include numeric attributes as number dimensions.
985
- 6. Expose every business column from each selected gold model as a dimension by default.
986
- 7. From technical Airbyte columns, include only `_airbyte_extracted_at`. Name the dimension `airbyte_extracted_at` (without the leading underscore) and reference the column as `${CUBE}._airbyte_extracted_at`.
987
- 8. Exclude all other `_airbyte_*` columns by default (`_airbyte_raw_id`, `_airbyte_meta`, `_airbyte_generation_id`, etc.). Do not include them unless the user explicitly asks.
988
- 9. Skip or transform only columns that cannot be represented safely in Cube, and document why.
989
- 10. JSON / array columns used for relationships should usually be represented through bridge/support models.
990
-
991
- ### Composite Primary Keys
992
-
993
- Cube allows exactly one dimension flagged with `primary_key: true` per cube. When a gold model has a composite primary key (multiple columns that together uniquely identify a row, for example `(office_unique_id, month)` in a monthly snapshot model), do not flag any of the component columns as primary key directly. Instead:
994
-
995
- 1. Keep each component column as a regular dimension (no `primary_key` flag).
996
- 2. Add an additional synthetic primary-key dimension that concatenates the components, using the same `CONCAT` pattern used for bridge cubes.
997
-
998
- Example for a monthly active users model with composite PK `(office_unique_id, month)`:
999
-
1000
- ```yaml
1001
- dimensions:
1002
- id:
1003
- sql: "CONCAT(${CUBE}.office_unique_id, '-', ${CUBE}.month)"
1004
- type: string
1005
- primary_key: true
1006
-
1007
- office_unique_id:
1008
- sql: "${CUBE}.office_unique_id"
1009
- type: string
1010
-
1011
- month:
1012
- sql: "${CUBE}.month"
1013
- type: time
1014
- ```
1015
-
1016
- Choose a separator that does not appear in either component value. `-` is usually safe for IDs and dates; if components may contain `-`, use `||` or another unambiguous separator.
1017
-
1018
- The synthetic `id` dimension is the cube's primary key for Cube's purposes. Joins to this cube must reference that synthetic `id`, not individual components — unless the joining cube has the same composite key columns and uses a parallel `CONCAT` in the join SQL.
1019
-
1020
- ### Large Column Count Warning
1021
-
1022
- If a model has more than 50 business columns, inform the user before generating:
1023
-
1024
- ```text
1025
- This model has <N> business columns. I will generate <N> dimensions by default,
1026
- plus airbyte_extracted_at if the column exists.
1027
- Other _airbyte_* columns will be excluded by default.
1028
-
1029
- Proceed with all business columns, or should I skip any groups?
1030
- ```
210
+ Expose all business columns from each selected gold model as Cube dimensions.
1031
211
 
1032
- Do not skip business columns without explicit user instruction.
212
+ Key rules:
1033
213
 
1034
- Example dimensions block:
214
+ 1. Include PKs with `primary_key: true`. For composite PKs, use a synthetic `CONCAT` dimension — see [references/cube-examples.md](references/cube-examples.md), Composite Primary Key section.
215
+ 2. Include secondary keys, names, statuses, timestamps, and numeric attributes as dimensions.
216
+ 3. From `_airbyte_*` columns, include only `_airbyte_extracted_at` as dimension `airbyte_extracted_at` (reference as `${CUBE}._airbyte_extracted_at`). Exclude all other `_airbyte_*` columns.
217
+ 4. Do not remove business columns just because they don't look immediately useful.
218
+ 5. JSON/array columns used for relationships should be represented through bridge models.
219
+ 6. If a model has >50 business columns, ask user before generating.
1035
220
 
1036
- ```yaml
1037
- dimensions:
1038
- deal_id:
1039
- sql: "${CUBE}.deal_id"
1040
- type: string
1041
- primary_key: true
1042
-
1043
- company_id:
1044
- sql: "${CUBE}.company_id"
1045
- type: string
1046
-
1047
- deal_name:
1048
- sql: "${CUBE}.deal_name"
1049
- type: string
1050
-
1051
- amount:
1052
- sql: "${CUBE}.amount"
1053
- type: number
1054
-
1055
- created_at:
1056
- sql: "${CUBE}.created_at"
1057
- type: time
1058
-
1059
- airbyte_extracted_at:
1060
- sql: "${CUBE}._airbyte_extracted_at"
1061
- type: time
1062
- ```
221
+ See [references/cube-examples.md](references/cube-examples.md) for type mapping and dimension examples.
1063
222
 
1064
223
  ---
1065
224
 
1066
225
  ## Phase 7: Suggest and Confirm Measures
1067
226
 
1068
- ### Goal
1069
-
1070
- Create default and useful measures without inventing ambiguous business logic.
1071
-
1072
- ### Default Measure
1073
-
1074
- Always include a row count measure unless project convention says otherwise.
1075
-
1076
- ```yaml
1077
- measures:
1078
- count:
1079
- type: count
1080
- ```
1081
-
1082
- ### Suggested Measures
1083
-
1084
- Suggest useful measures based on model schema and column names.
1085
-
1086
- Common suggestions:
1087
-
1088
- ```text
1089
- amount -> total_amount (sum), average_amount (avg), min_amount, max_amount
1090
- revenue -> total_revenue (sum)
1091
- price -> total_price or average_price depending on context
1092
- cost -> total_cost (sum)
1093
- quantity / qty -> total_quantity (sum)
1094
- duration -> total_duration or average_duration
1095
- deal_id / company_id / user_id -> count_distinct
1096
- created_at -> first_created_at (min), last_created_at (max)
1097
- closed_at -> first_closed_at (min), last_closed_at (max)
1098
- updated_at -> last_updated_at (max)
1099
- ```
1100
-
1101
- Examples:
1102
-
1103
- ```yaml
1104
- measures:
1105
- total_amount:
1106
- sql: "${CUBE}.amount"
1107
- type: sum
1108
-
1109
- unique_companies:
1110
- sql: "${CUBE}.company_id"
1111
- type: count_distinct
1112
-
1113
- last_closed_at:
1114
- sql: "${CUBE}.closed_at"
1115
- type: max
1116
- ```
1117
-
1118
- ### `count_distinct` on Foreign Keys
1119
-
1120
- `count_distinct` on a foreign-key column counts unique values within the cube that owns the FK. For example, `count_distinct(company_id)` inside the `hubspot_deals` cube answers "how many distinct companies have deals."
1121
-
1122
- Be careful when this measure is used in queries that join multiple cubes. Joins can produce row fan-out, which inflates or distorts distinct counts. Rules:
1123
-
1124
- 1. Define `count_distinct` on FK columns inside the cube that owns the FK.
1125
- 2. Add a brief description in the measure (or note it for the user) so consumers know the cube boundary.
1126
- 3. Avoid suggesting `count_distinct` on FK columns inside the parent cube (the cube that owns the PK) — `count` of rows there usually answers the same question more reliably.
1127
-
1128
- ### Measure Confirmation
1129
-
1130
- After suggesting measures, ask the user:
1131
-
1132
- ```text
1133
- I will create the default count measure.
1134
-
1135
- I also found these possible additional measures:
1136
- - total_amount: sum(amount)
1137
- - average_amount: avg(amount)
1138
- - unique_companies: count_distinct(company_id)
1139
- - last_closed_at: max(closed_at)
1140
-
1141
- Which of these should I include?
1142
-
1143
- Do you want to define any custom measures?
1144
- ```
1145
-
1146
- ### Custom Measures
1147
-
1148
- If the user requests a custom measure, ask for a precise definition if needed.
1149
-
1150
- A custom measure definition should include:
1151
-
1152
- ```text
1153
- measure name
1154
- source column(s)
1155
- aggregation type
1156
- filters or conditions
1157
- business meaning
1158
- ```
1159
-
1160
- Custom measure rules:
1161
-
1162
- 1. Do not create ambiguous custom measures.
1163
- 2. If the definition is clear, generate the measure.
1164
- 3. If the definition is unclear, ask for clarification.
1165
- 4. If the requested measure cannot be created from available columns, explain why and list the missing data.
227
+ 1. Always include a default `count` measure.
228
+ 2. Suggest useful additional measures based on column names. See [references/cube-examples.md](references/cube-examples.md), Measure Suggestions section.
229
+ 3. `count_distinct` on FK columns: define inside the cube that owns the FK, not the parent cube. Joins produce fan-out that distorts distinct counts.
230
+ 4. Ask user to confirm suggested measures or define custom ones (Checkpoint 4).
231
+ 5. If custom measure definition is unclear, ask for clarification.
1166
232
 
1167
233
  ---
1168
234
 
1169
235
  ## Phase 8: Generate Cube Semantic Overlays
1170
236
 
1171
- ### Goal
1172
-
1173
- Create Cube.dev semantic overlay YAML files for selected and approved gold models.
1174
-
1175
- ### Output Location and File Naming
1176
-
1177
- Create files under `semantic/cubes/`. The overlay file name strips the `gold_` prefix. The cube `name` matches the file name (without extension).
1178
-
1179
- Examples:
1180
-
1181
- ```text
1182
- gold_hubspot_deals.sql -> semantic/cubes/hubspot_deals.yml (name: hubspot_deals)
1183
- gold_hubspot_companies.sql -> semantic/cubes/hubspot_companies.yml (name: hubspot_companies)
1184
- gold_companies_deals.sql -> semantic/cubes/companies_deals.yml (name: companies_deals)
1185
- gold_clients.sql -> semantic/cubes/clients.yml (name: clients)
1186
- ```
1187
-
1188
- ### Required Source Reference
1189
-
1190
- Use the fully qualified BigQuery table reference. Resolve env variables to literals via `echo` (see "Cube `sql_table` Reference" section above).
1191
-
1192
- Example:
1193
-
1194
- ```yaml
1195
- sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_deals`"
1196
- ```
1197
-
1198
- ### Overlay Style
237
+ Create Cube.dev YAML files in `semantic/`. Follow the existing style detected in Phase 1.
1199
238
 
1200
- Follow the existing `semantic/cubes/` style detected in Phase 1.
239
+ Key rules:
1201
240
 
1202
- If existing overlays use map-style `dimensions`, `measures`, and `joins`, use map-style.
241
+ 1. File name = cube `name` (no `gold_` prefix) + `.yml`.
242
+ 2. `sql_table` uses fully qualified BigQuery reference with `gold_` prefix.
243
+ 3. Every confirmed relationship gets joins in both directions.
244
+ 4. Bridge/junction cubes use `public: false`.
245
+ 5. Every overlay includes `refresh_key`. Prefer `_airbyte_extracted_at`, fall back to other timestamps, then `every: 1 hour`.
246
+ 6. `refresh_key.sql` references the same table as `sql_table`.
247
+ 7. Tag unvalidated joins with `# UNVALIDATED: <reason>`.
1203
248
 
1204
- If existing overlays use `extends:` to inherit from base cubes, follow that convention. Otherwise, generate self-contained cubes.
1205
-
1206
- ### Canonical Example: Standard Cube
1207
-
1208
- A complete example of a standard (non-bridge) cube:
1209
-
1210
- ```yaml
1211
- cubes:
1212
- - name: hubspot_companies
1213
- sql_table: "`revos-dev.revos_1737556292084.gold_hubspot_companies`"
1214
-
1215
- joins:
1216
- companies_deals:
1217
- sql: "${CUBE}.id = ${companies_deals}.company_id"
1218
- relationship: one_to_many
1219
-
1220
- measures:
1221
- count:
1222
- type: count
1223
-
1224
- total_deal_value:
1225
- sql: "${CUBE}.properties_hs_total_deal_value"
1226
- type: sum
1227
-
1228
- num_open_deals:
1229
- sql: "${CUBE}.properties_hs_num_open_deals"
1230
- type: sum
1231
-
1232
- dimensions:
1233
- id:
1234
- sql: "${CUBE}.id"
1235
- type: string
1236
- primary_key: true
1237
-
1238
- airbyte_extracted_at:
1239
- sql: "${CUBE}._airbyte_extracted_at"
1240
- type: time
1241
-
1242
- refresh_key:
1243
- sql: "SELECT MAX(_airbyte_extracted_at) FROM `revos-dev.revos_1737556292084.gold_hubspot_companies`"
1244
- ```
1245
-
1246
- Notes:
1247
-
1248
- 1. Top-level `cubes:` array is required.
1249
- 2. Cube `name` is `hubspot_companies` (no `gold_` prefix).
1250
- 3. `sql_table` references `gold_hubspot_companies` (with `gold_` prefix), in backticks.
1251
- 4. The join references `${companies_deals}` — the cube name of a bridge cube defined in `semantic/cubes/companies_deals.yml`.
1252
- 5. Only `_airbyte_extracted_at` is exposed from Airbyte metadata, as `airbyte_extracted_at`.
1253
- 6. `refresh_key.sql` uses the same fully qualified table name as `sql_table`.
1254
-
1255
- ### Unvalidated Joins
1256
-
1257
- If the user chose to proceed with a join that could not be validated, generate the join but tag it with a YAML comment:
1258
-
1259
- ```yaml
1260
- joins:
1261
- hubspot_companies:
1262
- # UNVALIDATED: match rate could not be measured because gold_hubspot_companies was not yet materialized in BigQuery
1263
- sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1264
- relationship: many_to_one
1265
- ```
1266
-
1267
- Use a short, factual reason after `UNVALIDATED:`.
1268
-
1269
- ---
1270
-
1271
- ## Cube Overlay Requirements
1272
-
1273
- ### Joins
1274
-
1275
- Every confirmed relationship must be represented in both directions where both cubes exist.
1276
-
1277
- Direct many-to-one example:
1278
-
1279
- ```yaml
1280
- # In hubspot_deals.yml
1281
- joins:
1282
- hubspot_companies:
1283
- sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1284
- relationship: many_to_one
1285
- ```
1286
-
1287
- Reverse one-to-many:
1288
-
1289
- ```yaml
1290
- # In hubspot_companies.yml
1291
- joins:
1292
- hubspot_deals:
1293
- sql: "${CUBE}.id = ${hubspot_deals}.company_id"
1294
- relationship: one_to_many
1295
- ```
1296
-
1297
- Connector path. For `products -> clients -> addresses`, create joins for each edge in both directions:
1298
-
1299
- ```yaml
1300
- # In products.yml
1301
- joins:
1302
- clients:
1303
- sql: "${CUBE}.client_id = ${clients}.id"
1304
- relationship: many_to_one
1305
- ```
1306
-
1307
- ```yaml
1308
- # In clients.yml
1309
- joins:
1310
- products:
1311
- sql: "${CUBE}.id = ${products}.client_id"
1312
- relationship: one_to_many
1313
-
1314
- addresses:
1315
- sql: "${CUBE}.id = ${addresses}.client_id"
1316
- relationship: one_to_many
1317
- ```
1318
-
1319
- ```yaml
1320
- # In addresses.yml
1321
- joins:
1322
- clients:
1323
- sql: "${CUBE}.client_id = ${clients}.id"
1324
- relationship: many_to_one
1325
- ```
1326
-
1327
- Bridge join. The bridge cube joins to both parents:
1328
-
1329
- ```yaml
1330
- # In companies_deals.yml
1331
- joins:
1332
- hubspot_companies:
1333
- relationship: many_to_one
1334
- sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1335
-
1336
- hubspot_deals:
1337
- relationship: many_to_one
1338
- sql: "${CUBE}.deal_id = ${hubspot_deals}.id"
1339
- ```
1340
-
1341
- Reverse joins from each parent to the bridge:
1342
-
1343
- ```yaml
1344
- # In hubspot_companies.yml
1345
- joins:
1346
- companies_deals:
1347
- sql: "${CUBE}.id = ${companies_deals}.company_id"
1348
- relationship: one_to_many
1349
- ```
1350
-
1351
- ```yaml
1352
- # In hubspot_deals.yml
1353
- joins:
1354
- companies_deals:
1355
- sql: "${CUBE}.id = ${companies_deals}.deal_id"
1356
- relationship: one_to_many
1357
- ```
1358
-
1359
- Join rules:
1360
-
1361
- 1. Use validated keys.
1362
- 2. Use the correct relationship direction from the current cube.
1363
- 3. Generate both directions for every confirmed relationship.
1364
- 4. Reference other cubes by their cube `name` (without `gold_` prefix) in `${...}`.
1365
- 5. Prefer joins between gold cubes.
1366
- 6. Keep join SQL readable and explicit.
1367
- 7. If key casting is required, prefer fixing it in the dbt model or support model first.
1368
- 8. Tag unvalidated joins with `# UNVALIDATED: <reason>` instead of silently emitting them.
1369
- 9. For JSON / array relationships, use or create approved bridge/support models (creation delegated to `create_dbt_transformations`).
1370
- 10. For connector paths, join through the connector model instead of inventing a direct join.
1371
- 11. Bridge and junction cubes should use `public: false` where the project convention supports it.
1372
-
1373
- ### Bridge / Junction Cubes
1374
-
1375
- Bridge and junction cubes should use `public: false` where the project convention supports it.
1376
-
1377
- Example:
1378
-
1379
- ```yaml
1380
- cubes:
1381
- - name: companies_deals
1382
- sql_table: "`revos-dev.revos_1737556292084.gold_companies_deals`"
1383
- public: false
1384
-
1385
- joins:
1386
- hubspot_companies:
1387
- relationship: many_to_one
1388
- sql: "${CUBE}.company_id = ${hubspot_companies}.id"
1389
-
1390
- hubspot_deals:
1391
- relationship: many_to_one
1392
- sql: "${CUBE}.deal_id = ${hubspot_deals}.id"
1393
-
1394
- measures:
1395
- count:
1396
- type: count
1397
-
1398
- dimensions:
1399
- id:
1400
- sql: "CONCAT(${CUBE}.deal_id, '-', ${CUBE}.company_id)"
1401
- type: string
1402
- primary_key: true
1403
-
1404
- deal_id:
1405
- sql: "${CUBE}.deal_id"
1406
- type: string
1407
-
1408
- company_id:
1409
- sql: "${CUBE}.company_id"
1410
- type: string
1411
-
1412
- airbyte_extracted_at:
1413
- sql: "${CUBE}._airbyte_extracted_at"
1414
- type: time
1415
-
1416
- refresh_key:
1417
- sql: "SELECT MAX(_airbyte_extracted_at) FROM `revos-dev.revos_1737556292084.gold_companies_deals`"
1418
- ```
1419
-
1420
- If the bridge model does not have `_airbyte_extracted_at`, omit that dimension and use the default time-based refresh key:
1421
-
1422
- ```yaml
1423
- refresh_key:
1424
- every: 1 hour
1425
- ```
1426
-
1427
- ### Refresh Key
1428
-
1429
- Every generated Cube overlay must include `refresh_key`.
1430
-
1431
- Priority order:
1432
-
1433
- 1. If the gold model has `_airbyte_extracted_at`, use it for a SQL-based refresh key.
1434
- 2. If the gold model has another reliable ingestion or update timestamp (`updated_at`, `modified_at`, `loaded_at`, `synced_at`), use that.
1435
- 3. Otherwise use the default time-based refresh key.
1436
-
1437
- `refresh_key.sql` must reference the same fully qualified BigQuery table as the cube's own `sql_table`. Never a different table.
1438
-
1439
- Airbyte refresh key:
1440
-
1441
- ```yaml
1442
- refresh_key:
1443
- sql: "SELECT MAX(_airbyte_extracted_at) FROM `<project>.<dataset>.<gold_model>`"
1444
- ```
1445
-
1446
- Other timestamp-based refresh key:
1447
-
1448
- ```yaml
1449
- refresh_key:
1450
- sql: "SELECT MAX(updated_at) FROM `<project>.<dataset>.<gold_model>`"
1451
- ```
1452
-
1453
- Default:
1454
-
1455
- ```yaml
1456
- refresh_key:
1457
- every: 1 hour
1458
- ```
1459
-
1460
- Refresh key rules:
1461
-
1462
- 1. Always include `refresh_key` in every generated Cube overlay.
1463
- 2. Prefer `_airbyte_extracted_at` when it exists.
1464
- 3. Prefer reliable timestamp-based refresh keys over fixed interval refresh keys.
1465
- 4. Use `every: 1 hour` as the default fallback.
1466
- 5. If a timestamp column exists but is not reliable, explain the assumption and use the default fallback.
1467
- 6. Bridge and junction cubes must also include `refresh_key`.
1468
- 7. `refresh_key.sql` must reference the same fully qualified BigQuery table as the cube's own `sql_table`.
249
+ See [references/cube-examples.md](references/cube-examples.md) for canonical standard cube, bridge cube, join direction examples, and refresh key variants.
1469
250
 
1470
251
  ---
1471
252
 
1472
253
  ## Phase 9: Validate Generated Files
1473
254
 
1474
- ### Goal
1475
-
1476
- Validate generated semantic files and any approved support models.
1477
-
1478
- ### dbt Validation
1479
-
1480
- If `create_dbt_transformations` was invoked during this run (for example, to create a bridge model), it has already validated the new dbt models with `revos dbt run` and `revos dbt test`. No re-validation is needed here for those.
1481
-
1482
- If only existing gold models were used, run a basic syntax check via the dbt skill's standard command:
1483
-
1484
- ```bash
1485
- revos dbt parse
1486
- ```
1487
-
1488
- ### Verify Physical Tables Exist in BigQuery
1489
-
1490
- For each generated cube, confirm the physical table referenced in `sql_table` actually exists in BigQuery. Cube does not catch a missing table at YAML parse time; it only fails at first query.
1491
-
1492
- ```bash
1493
- bq show <project>:<dataset>.<table_name>
1494
- ```
1495
-
1496
- Example:
1497
-
1498
- ```bash
1499
- bq show revos-dev:revos_1737556292084.gold_hubspot_companies
1500
- ```
1501
-
1502
- If the table does not exist, the gold model is not yet materialized. Either materialize it first (run the gold dbt build via `create_dbt_transformations`), or document this as a pending item before handing the overlay over.
1503
-
1504
- ### Semantic Validation
1505
-
1506
- Run available project commands to validate Cube YAML.
1507
-
1508
- Placeholder:
1509
-
1510
- ```bash
1511
- <cube-validation-command>
1512
- ```
1513
-
1514
- ### Manual Validation Checklist
1515
-
1516
- 1. Semantic overlays were created in `semantic/cubes/`.
1517
- 2. Overlay file names drop the `gold_` prefix (`hubspot_companies.yml`, not `gold_hubspot_companies.yml`).
1518
- 3. Each cube `name` matches the overlay file name (without extension) and has no `gold_` prefix.
1519
- 4. Cube overlays reference the gold tables using `sql_table: "`<project>.<dataset>.gold_<entity>`"`.
1520
- 5. The physical table name in `sql_table` keeps the `gold_` prefix.
1521
- 6. No `dbt ref()` syntax is used in new overlays.
1522
- 7. Each cube has dimensions.
1523
- 8. Every business column from each selected gold model is exposed as a dimension unless documented otherwise.
1524
- 9. From `_airbyte_*` columns, only `_airbyte_extracted_at` is exposed (as dimension `airbyte_extracted_at`); other `_airbyte_*` columns are excluded.
1525
- 10. Each cube has a `count` measure.
1526
- 11. Suggested and approved additional measures are included.
1527
- 12. Ambiguous custom measures are not created without clarification.
1528
- 13. `count_distinct` measures on FK columns are defined inside the cube that owns the FK, not the parent cube.
1529
- 14. Primary keys are marked correctly. For composite primary keys, a synthetic `CONCAT(...)` dimension is the one flagged with `primary_key: true`; individual components are kept as regular dimensions.
1530
- 15. Secondary keys are preserved as dimensions.
1531
- 16. JSON arrays use `UNNEST(JSON_VALUE_ARRAY(...))`.
1532
- 17. JSON / array relationship keys are extracted into approved bridge/support models where needed.
1533
- 18. Bridge and junction cubes use `public: false` where the project convention supports it.
1534
- 19. Each cube has a `refresh_key`.
1535
- 20. Cubes with `_airbyte_extracted_at` use it in a SQL-based `refresh_key` where possible.
1536
- 21. Cubes without `_airbyte_extracted_at` but with another reliable timestamp use that timestamp where possible.
1537
- 22. Cubes without a reliable timestamp use `every: 1 hour`.
1538
- 23. `refresh_key.sql` references the same fully qualified BigQuery table as the cube's own `sql_table`.
1539
- 24. Joins use validated columns.
1540
- 25. Joins reference other cubes by their cube `name` (without `gold_` prefix), e.g. `${hubspot_companies}`.
1541
- 26. Join relationship types are correct from the current cube perspective.
1542
- 27. Every confirmed relationship has joins in both directions where both cubes exist.
1543
- 28. Candidate joins were validated before being proposed to the user.
1544
- 29. Join validation was performed in both directions where possible.
1545
- 30. Reverse one-to-many aggregation checks were performed where applicable.
1546
- 31. Many-to-many bridge edges were validated in both directions where applicable.
1547
- 32. Joins that could not be validated and were generated anyway are tagged with `# UNVALIDATED: <reason>`.
1548
- 33. Connector models are only used if the user approved them.
1549
- 34. Selected models that remain disconnected are documented.
1550
- 35. Physical tables referenced by `sql_table` exist in BigQuery, or the missing tables are clearly listed as pending items.
1551
- 36. Placeholder commands or assumptions are clearly marked.
255
+ 1. If `create-dbt-transformations` was invoked (bridge model), it already validated dbt models. Otherwise run `revos dbt parse`.
256
+ 2. Verify physical tables exist in BigQuery: `bq show <dataset>.<table_name>`. If missing, document as pending.
257
+ 3. Verify generated overlays match conventions: flat YAML, correct naming, correct `sql_table`, all dimensions present, `refresh_key` included, joins in both directions.
1552
258
 
1553
259
  ---
1554
260
 
1555
261
  ## Final Response Format
1556
262
 
1557
- After generation, summarize what was created:
1558
-
1559
263
  ```text
1560
264
  Created semantic model draft.
1561
265
 
1562
266
  Selected gold models:
1563
267
  - dbt/models/gold/<gold_model_1>.sql
1564
- - dbt/models/gold/<gold_model_2>.sql
1565
268
 
1566
269
  Approved connector models:
1567
270
  - dbt/models/gold/<connector_model>.sql
1568
271
 
1569
- Bridge/support models created during this run (via create_dbt_transformations):
272
+ Bridge/support models created (via create-dbt-transformations):
1570
273
  - dbt/models/gold/<bridge_model>.sql
1571
274
 
1572
275
  Semantic overlays:
1573
- - semantic/cubes/<entity_1>.yml (cube name: <entity_1>)
1574
- - semantic/cubes/<entity_2>.yml (cube name: <entity_2>)
1575
- - semantic/cubes/<bridge_entity>.yml (cube name: <bridge_entity>, public: false)
276
+ - semantic/<entity_1>.yml (cube name: <entity_1>)
277
+ - semantic/<bridge_entity>.yml (cube name: <bridge_entity>, public: false)
1576
278
 
1577
- Detected and validated relationships:
279
+ Validated relationships:
1578
280
  - <entity_a>.<key> -> <entity_b>.<key> (<relationship_type>)
1579
- - <entity_b>.<key> -> <entity_a>.<key> (<reverse_relationship_type>)
1580
-
1581
- Connector paths:
1582
- - <selected_entity_a> -> <connector_entity> -> <selected_entity_b>
1583
-
1584
- Bridge relationships:
1585
- - <source_entity>.<json_array_column>[] -> <target_entity>.<target_key> through <bridge_entity>
1586
281
 
1587
282
  Measures:
1588
283
  - count
1589
284
  - <approved_measure_1>
1590
- - <approved_measure_2>
1591
285
 
1592
286
  Validation:
1593
- - dbt validation: <passed / failed / pending / not run — only existing models used>
1594
- - physical table existence in BigQuery: <passed / failed / pending>
1595
- - join candidate validation before proposal: <passed / failed / pending>
1596
- - reverse join validation: <passed / failed / pending>
1597
- - semantic validation: <passed / failed / pending>
287
+ - dbt: <passed / pending / not run>
288
+ - physical tables: <passed / pending>
289
+ - join validation: <passed / pending>
290
+ - semantic validation: <passed / pending>
1598
291
 
1599
- Unvalidated joins (tagged in YAML with # UNVALIDATED):
292
+ Unvalidated joins (tagged with # UNVALIDATED):
1600
293
  - <cube>.<join_target>: <reason>
1601
294
 
1602
295
  Assumptions:
1603
- - <assumption_1>
1604
- - <assumption_2>
296
+ - <assumption>
1605
297
 
1606
298
  Pending items:
1607
- - <pending_item_1>
1608
- - <pending_item_2>
299
+ - <pending_item>
300
+
301
+ Next step:
302
+ revos overlays push -d ./semantic
1609
303
  ```
1610
304
 
1611
305
  If validation is incomplete, say exactly what remains pending.