@rishildi/ldi-process-skills 0.1.6 → 0.1.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/build/skills/embedded.js +8 -8
- package/package.json +1 -1
package/build/skills/embedded.js
CHANGED
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
// AUTO-GENERATED by scripts/embed-skills.ts — do not edit
|
|
2
|
-
// Generated at: 2026-04-
|
|
2
|
+
// Generated at: 2026-04-04T22:28:17.621Z
|
|
3
3
|
export const EMBEDDED_SKILLS = [
|
|
4
4
|
{
|
|
5
5
|
name: "create-fabric-lakehouses",
|
|
@@ -7,7 +7,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
7
7
|
files: [
|
|
8
8
|
{
|
|
9
9
|
relativePath: "SKILL.md",
|
|
10
|
-
content: "---\nname: create-fabric-lakehouses\ndescription: >\n Use this skill when asked to create, provision, or set up one or more\n Lakehouse items in existing Microsoft Fabric workspaces. Triggers on:\n \"create a lakehouse\", \"provision lakehouses\", \"set up a Fabric lakehouse\",\n \"create lakehouse in Fabric\", \"new lakehouse\", \"create lakehouses across\n workspaces\". Does NOT trigger for: creating workspaces (use\n generate-fabric-workspace), querying lakehouse data, managing tables,\n uploading files, creating shortcuts, or general Fabric workspace management.\nlicense: MIT\ncompatibility: Fabric CLI (fab) installed and authenticated; Python 3.10+ for notebook approach\n---\n\n# Create Fabric Lakehouse\n\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\n> review and run — it never executes commands directly against a live Fabric environment.\n> Present each generated artefact to the operator before they run it.\n\nProvisions one or more empty Lakehouse items across one or more existing\nMicrosoft Fabric workspaces, using a user-chosen approach, and produces an\naudit-trail definition file.\n\n**Companion skills:** Workspace creation is handled by the\n`generate-fabric-workspace` skill. Shortcut creation between lakehouses is\na separate skill / manual step. This skill assumes target workspaces already\nexist.\n\n## Prerequisites\n\nBefore starting, ask the operator to run the following and share the output:\n\n```bash\nfab auth status # Must show authenticated\nfab ls # Must return workspace list\n```\n\nIf not authenticated, ask the operator to run `fab auth login` first.\n\n## Workflow\n\nExecute these steps in order.\n\n### Step 1 — Choose Provisioning Approach\n\nAsk the user which approach they want to follow:\n\n| Approach | Description | Best for |\n|----------|-------------|----------|\n| **A — PySpark Notebook** | Generates a `.py` notebook script that installs `ms-fabric-cli` and uses `!fab` commands. Output for the user to run in their Fabric workspace. | Users who want a reusable notebook artefact in Fabric |\n| **B — PowerShell Script** | Generates a PowerShell script containing `fab` CLI commands. Output for user validation before execution. | Users who prefer a single script to review and run locally |\n| **C — Interactive CLI** | Runs `fab` commands one-by-one in the terminal, pausing for user validation after each step. | Users who want maximum control and visibility |\n\n### Step 2 — Collect Workspace & Lakehouse Definitions (Sequential)\n\nCollect definitions **one workspace at a time**. For each workspace, gather:\n\n#### 2a — Target Workspace\n\n- [ ] **Workspace name** — must already exist. Verify with:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n If the workspace does not exist, inform the user and suggest they run\n the `generate-fabric-workspace` skill first. Do not proceed for that\n workspace until it exists.\n\n#### 2b — Naming Convention\n\nSuggest the default naming pattern: `{Prefix}_{CoreName}_{Suffix}`\n\n| Component | Description | Default | Example |\n|-----------|-------------|---------|---------|\n| **Prefix** | Item type indicator | `LH` | `LH` |\n| **CoreName** | Business/project name | *(user provides)* | `LANDONREVENUE` |\n| **Suffix** | Medallion layer or purpose | `BRONZE`, `SILVER`, `GOLD` | `BRONZE` |\n| **Separator** | Character between components | `_` | `_` |\n\nExample result: `LH_LANDONREVENUE_BRONZE`\n\nPresent the suggested defaults and **ask the user to confirm or override**\neach component. The user may change any component or use fully custom names\nthat don't follow the pattern at all.\n\n#### 2c — Lakehouse Definitions\n\nFor each lakehouse in this workspace, collect:\n\n- [ ] **Name** — generated from the naming convention, or custom\n- [ ] **Description** — optional text describing the lakehouse's purpose\n- [ ] **Schema-enabled** — yes/no (default: no). See\n `references/schema-enabled.md` for guidance.\n\n#### 2d — More Workspaces?\n\nAfter finishing one workspace, ask:\n\n> \"Do you have another workspace to provision lakehouses in, or are we done?\"\n\nIf yes, loop back to Step 2a. If done, proceed to Step 3.\n\n### Step 3 — Validate Inputs\n\nBefore generating anything, validate **all** lakehouse definitions:\n\n1. For each workspace, confirm it exists:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n — if it does not exist, stop and direct the user to create it first\n\n2. For each lakehouse, check it doesn't already exist:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if it already exists, warn the user and ask whether to **skip** or\n **rename**\n\n3. Validate lakehouse names against naming constraints (see Gotchas)\n\n### Step 4 — Generate & Execute\n\nBranch by the approach chosen in Step 1. Process workspaces sequentially.\n\n**Maintain an audit log** throughout execution — record every command run and\nits outcome. This log feeds into the definition file in Step 6.\n\n#### Approach A — PySpark Notebook\n\n1. Generate a PySpark notebook using the template in\n `references/notebook-template.py`\n2. The notebook pattern is:\n - Install `ms-fabric-cli` via `%pip install ms-fabric-cli -q`\n - Authenticate using `notebookutils.credentials.getToken('pbi')` for `FAB_TOKEN`\n and `FAB_TOKEN_AZURE`, and `notebookutils.credentials.getToken('storage')` for\n `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n - Add pip's scripts directory to `PATH` so `!fab` works\n - Use `!fab mkdir` shell commands for standard lakehouses\n - Use `!fab api` with REST payload for schema-enabled lakehouses\n3. The notebook must include:\n - A configuration cell with all workspace/lakehouse definitions\n - Existence checks before each creation\n - A summary cell at the end\n4. Save to `/home/claude/<workspace>_create_lakehouses.py`\n5. Present to user for review\n6. Optionally upload:\n ```bash\n fab import \"<Workspace>.Workspace/<Name>.Notebook\" -i <path> --format py -f\n ```\n\n#### Approach B — PowerShell Script\n\n1. Generate a PowerShell script with the following structure:\n2. The script must:\n - Use `fab mkdir` for standard lakehouses\n - Handle schema-enabled lakehouses via the Fabric REST API\n (`fab api` wrapper — see `references/fabric-api-lakehouse.md`)\n - Include `fab exists` checks before each creation\n - Track created items for potential rollback\n - Include error handling and summary output\n3. Save to `/home/claude/create_lakehouses.ps1`\n4. Present the script and **wait for explicit approval** before running\n\n#### Approach C — Interactive CLI\n\nExecute commands one-by-one per workspace, pausing after each:\n\n1. **For each lakehouse** — check then create:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if not exists, create. For standard lakehouses:\n ```bash\n fab mkdir \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — for schema-enabled lakehouses, use the REST API:\n ```bash\n WS_ID=$(fab get \"<WorkspaceName>.Workspace\" -q \"id\" | tr -d '\"')\n fab api \"workspaces/$WS_ID/lakehouses\" -X post \\\n -i '{\"displayName\":\"<Name>\",\"description\":\"<Desc>\",\"creationPayload\":{\"enableSchemas\":true}}'\n ```\n — wait for user confirmation after each\n\n2. **Verification** after all lakehouses in a workspace:\n ```bash\n fab ls \"<WorkspaceName>.Workspace\" -l\n ```\n\n3. Move to next workspace or proceed to Step 5.\n\n### Step 4a — Failure Handling\n\nIf any lakehouse creation fails during execution:\n\n1. **Stop immediately** — do not proceed to the next lakehouse\n2. **Report** what succeeded and what failed\n3. **Ask the user** how to proceed:\n\n| Option | Action |\n|--------|--------|\n| **Retry** | Re-attempt the failed lakehouse creation |\n| **Skip** | Skip the failed item and continue with remaining |\n| **Rollback & Abort** | Delete all lakehouses created *in this run*, then stop |\n| **Abort (keep)** | Stop but leave already-created lakehouses in place |\n\nIf the user chooses **Rollback & Abort**:\n```bash\nfab rm \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -f\n```\n— for each lakehouse created in this run (tracked in the audit log).\nConfirm each deletion with the user before executing.\n\n### Step 5 — Verify Creation\n\nRegardless of approach, verify every lakehouse across all workspaces:\n\n```bash\nfab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n```\n\nCollect the lakehouse ID for each:\n```bash\nfab get \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -q \"id\"\n```\n\nIf any verification fails, report and ask the user how to proceed (same\noptions as Step 4a).\n\n### Step 6 — Generate Definition File\n\nAfter all lakehouses are verified, generate a Lakehouse Definition markdown\nfile using the template in `references/definition-template.md`.\n\nThe definition file must include:\n\n- **Per workspace:** name, ID\n- **Per lakehouse:** name, ID, description, schema-enabled status, naming\n convention used, creation timestamp\n- **Overall:** approach used, naming convention applied, full audit trail of\n commands/API calls executed, any warnings, skipped items, or rollback actions\n\nSave to `/home/claude/lakehouse_definition.md` and present to user.\n\n## Gotchas\n\n- `fab mkdir` creates a standard lakehouse but does NOT support the\n `enableSchemas` property. To create a schema-enabled lakehouse, use\n the Fabric REST API: `POST workspaces/{workspaceId}/lakehouses` with\n `{\"displayName\":\"<n>\",\"creationPayload\":{\"enableSchemas\":true}}`\n- Always use `-f` flag with `fab` commands in scripts to avoid interactive\n prompts that block execution\n- Lakehouse names must be unique within a workspace\n- Workspace names are case-sensitive in `fab` paths\n- Always quote paths containing spaces: `\"My Workspace.Workspace\"`\n- The Fabric REST API requires workspace ID (GUID), not display name —\n extract with `fab get \"<n>.Workspace\" -q \"id\"`\n- In notebooks, `ms-fabric-cli` must be installed via `%pip install` and\n the scripts directory added to `PATH` before `!fab` commands work\n- Token audiences for notebook auth: `'pbi'` for `FAB_TOKEN` and `FAB_TOKEN_AZURE`,\n `'storage'` for `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n- `fab auth status` must show a valid token before any operations; tokens\n expire and may need refresh\n- Lakehouse names cannot contain: `/`, `\\`, `#`, `%`, `?` or\n leading/trailing spaces. Max length: 256 characters\n- When rolling back, always confirm each deletion with the user — `fab rm`\n with `-f` is irreversible\n- This skill does NOT create workspaces — if a workspace is missing, direct\n the user to the `generate-fabric-workspace` skill\n- This skill does NOT create shortcuts between lakehouses — that is a\n separate step\n\n## Output Format\n\nSee `references/definition-template.md` for the full template.\n\n## Available References\n\n- **`references/notebook-template.py`** — PySpark notebook template for Approach A\n- **`references/definition-template.md`** — Lakehouse definition output template\n- **`references/schema-enabled.md`** — How schema-enabled lakehouses work\n- **`references/fabric-api-lakehouse.md`** — Fabric REST API reference for\n lakehouse creation\n",
|
|
10
|
+
content: "---\nname: create-fabric-lakehouses\ndescription: >\n Use this skill when asked to create, provision, or set up one or more\n Lakehouse items in existing Microsoft Fabric workspaces. Triggers on:\n \"create a lakehouse\", \"provision lakehouses\", \"set up a Fabric lakehouse\",\n \"create lakehouse in Fabric\", \"new lakehouse\", \"create lakehouses across\n workspaces\". Does NOT trigger for: creating workspaces (use\n generate-fabric-workspace), querying lakehouse data, managing tables,\n uploading files, creating shortcuts, or general Fabric workspace management.\nlicense: MIT\ncompatibility: Fabric CLI (fab) installed and authenticated; Python 3.10+ for notebook approach\n---\n\n# Create Fabric Lakehouse\n\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\n> review and run — it never executes commands directly against a live Fabric environment.\n> Present each generated artefact to the operator before they run it.\n\nProvisions one or more empty Lakehouse items across one or more existing\nMicrosoft Fabric workspaces, using a user-chosen approach, and produces an\naudit-trail definition file.\n\n**Companion skills:** Workspace creation is handled by the\n`generate-fabric-workspace` skill. Shortcut creation between lakehouses is\na separate skill / manual step. This skill assumes target workspaces already\nexist.\n\n## Orchestrated Context\n\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\nand the SOP before asking the user anything.\n\n| Parameter | Source when orchestrated |\n|---|---|\n| Deployment approach (notebook / PowerShell / terminal) | Environment profile |\n| Workspace name(s) | Environment profile or implementation plan |\n| Naming convention / prefix | Implementation plan or SOP |\n| Medallion layer(s) to create (Bronze / Silver / Gold) | SOP shared parameters |\n| Schema-enabled preference | SOP or implementation plan |\n\n**Only ask for parameters not found in these documents.** Summarise what was resolved\nautomatically, then ask for what remains (e.g. lakehouse core name, description).\n\n## Prerequisites\n\nBefore starting, ask the operator to run the following and share the output:\n\n```bash\nfab auth status # Must show authenticated\nfab ls # Must return workspace list\n```\n\nIf not authenticated, ask the operator to run `fab auth login` first.\n\n## Workflow\n\nExecute these steps in order.\n\n### Step 1 — Choose Provisioning Approach\n\nAsk the user which approach they want to follow:\n\n| Approach | Description | Best for |\n|----------|-------------|----------|\n| **A — PySpark Notebook** | Generates a `.py` notebook script that installs `ms-fabric-cli` and uses `!fab` commands. Output for the user to run in their Fabric workspace. | Users who want a reusable notebook artefact in Fabric |\n| **B — PowerShell Script** | Generates a PowerShell script containing `fab` CLI commands. Output for user validation before execution. | Users who prefer a single script to review and run locally |\n| **C — Interactive CLI** | Runs `fab` commands one-by-one in the terminal, pausing for user validation after each step. | Users who want maximum control and visibility |\n\n### Step 2 — Collect Workspace & Lakehouse Definitions (Sequential)\n\nCollect definitions **one workspace at a time**. For each workspace, gather:\n\n#### 2a — Target Workspace\n\n- [ ] **Workspace name** — must already exist. Verify with:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n If the workspace does not exist, inform the user and suggest they run\n the `generate-fabric-workspace` skill first. Do not proceed for that\n workspace until it exists.\n\n#### 2b — Naming Convention\n\nSuggest the default naming pattern: `{Prefix}_{CoreName}_{Suffix}`\n\n| Component | Description | Default | Example |\n|-----------|-------------|---------|---------|\n| **Prefix** | Item type indicator | `LH` | `LH` |\n| **CoreName** | Business/project name | *(user provides)* | `LANDONREVENUE` |\n| **Suffix** | Medallion layer or purpose | `BRONZE`, `SILVER`, `GOLD` | `BRONZE` |\n| **Separator** | Character between components | `_` | `_` |\n\nExample result: `LH_LANDONREVENUE_BRONZE`\n\nPresent the suggested defaults and **ask the user to confirm or override**\neach component. The user may change any component or use fully custom names\nthat don't follow the pattern at all.\n\n#### 2c — Lakehouse Definitions\n\nFor each lakehouse in this workspace, collect:\n\n- [ ] **Name** — generated from the naming convention, or custom\n- [ ] **Description** — optional text describing the lakehouse's purpose\n- [ ] **Schema-enabled** — yes/no (default: no). See\n `references/schema-enabled.md` for guidance.\n\n#### 2d — More Workspaces?\n\nAfter finishing one workspace, ask:\n\n> \"Do you have another workspace to provision lakehouses in, or are we done?\"\n\nIf yes, loop back to Step 2a. If done, proceed to Step 3.\n\n### Step 3 — Validate Inputs\n\nBefore generating anything, validate **all** lakehouse definitions:\n\n1. For each workspace, confirm it exists:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n — if it does not exist, stop and direct the user to create it first\n\n2. For each lakehouse, check it doesn't already exist:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if it already exists, warn the user and ask whether to **skip** or\n **rename**\n\n3. Validate lakehouse names against naming constraints (see Gotchas)\n\n### Step 4 — Generate & Execute\n\nBranch by the approach chosen in Step 1. Process workspaces sequentially.\n\n**Maintain an audit log** throughout execution — record every command run and\nits outcome. This log feeds into the definition file in Step 6.\n\n#### Approach A — PySpark Notebook\n\n1. Generate a PySpark notebook using the template in\n `references/notebook-template.py`\n2. The notebook pattern is:\n - Install `ms-fabric-cli` via `%pip install ms-fabric-cli -q`\n - Authenticate using `notebookutils.credentials.getToken('pbi')` for `FAB_TOKEN`\n and `FAB_TOKEN_AZURE`, and `notebookutils.credentials.getToken('storage')` for\n `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n - Add pip's scripts directory to `PATH` so `!fab` works\n - Use `!fab mkdir` shell commands for standard lakehouses\n - Use `!fab api` with REST payload for schema-enabled lakehouses\n3. The notebook must include:\n - A configuration cell with all workspace/lakehouse definitions\n - Existence checks before each creation\n - A summary cell at the end\n4. Save to `/home/claude/<workspace>_create_lakehouses.py`\n5. Present to user for review\n6. Optionally upload:\n ```bash\n fab import \"<Workspace>.Workspace/<Name>.Notebook\" -i <path> --format py -f\n ```\n\n#### Approach B — PowerShell Script\n\n1. Generate a PowerShell script with the following structure:\n2. The script must:\n - Use `fab mkdir` for standard lakehouses\n - Handle schema-enabled lakehouses via the Fabric REST API\n (`fab api` wrapper — see `references/fabric-api-lakehouse.md`)\n - Include `fab exists` checks before each creation\n - Track created items for potential rollback\n - Include error handling and summary output\n3. Save to `/home/claude/create_lakehouses.ps1`\n4. Present the script and **wait for explicit approval** before running\n\n#### Approach C — Interactive CLI\n\nExecute commands one-by-one per workspace, pausing after each:\n\n1. **For each lakehouse** — check then create:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if not exists, create. For standard lakehouses:\n ```bash\n fab mkdir \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — for schema-enabled lakehouses, use the REST API:\n ```bash\n WS_ID=$(fab get \"<WorkspaceName>.Workspace\" -q \"id\" | tr -d '\"')\n fab api \"workspaces/$WS_ID/lakehouses\" -X post \\\n -i '{\"displayName\":\"<Name>\",\"description\":\"<Desc>\",\"creationPayload\":{\"enableSchemas\":true}}'\n ```\n — wait for user confirmation after each\n\n2. **Verification** after all lakehouses in a workspace:\n ```bash\n fab ls \"<WorkspaceName>.Workspace\" -l\n ```\n\n3. Move to next workspace or proceed to Step 5.\n\n### Step 4a — Failure Handling\n\nIf any lakehouse creation fails during execution:\n\n1. **Stop immediately** — do not proceed to the next lakehouse\n2. **Report** what succeeded and what failed\n3. **Ask the user** how to proceed:\n\n| Option | Action |\n|--------|--------|\n| **Retry** | Re-attempt the failed lakehouse creation |\n| **Skip** | Skip the failed item and continue with remaining |\n| **Rollback & Abort** | Delete all lakehouses created *in this run*, then stop |\n| **Abort (keep)** | Stop but leave already-created lakehouses in place |\n\nIf the user chooses **Rollback & Abort**:\n```bash\nfab rm \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -f\n```\n— for each lakehouse created in this run (tracked in the audit log).\nConfirm each deletion with the user before executing.\n\n### Step 5 — Verify Creation\n\nRegardless of approach, verify every lakehouse across all workspaces:\n\n```bash\nfab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n```\n\nCollect the lakehouse ID for each:\n```bash\nfab get \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -q \"id\"\n```\n\nIf any verification fails, report and ask the user how to proceed (same\noptions as Step 4a).\n\n### Step 6 — Generate Definition File\n\nAfter all lakehouses are verified, generate a Lakehouse Definition markdown\nfile using the template in `references/definition-template.md`.\n\nThe definition file must include:\n\n- **Per workspace:** name, ID\n- **Per lakehouse:** name, ID, description, schema-enabled status, naming\n convention used, creation timestamp\n- **Overall:** approach used, naming convention applied, full audit trail of\n commands/API calls executed, any warnings, skipped items, or rollback actions\n\nSave to `/home/claude/lakehouse_definition.md` and present to user.\n\n## Gotchas\n\n- `fab mkdir` creates a standard lakehouse but does NOT support the\n `enableSchemas` property. To create a schema-enabled lakehouse, use\n the Fabric REST API: `POST workspaces/{workspaceId}/lakehouses` with\n `{\"displayName\":\"<n>\",\"creationPayload\":{\"enableSchemas\":true}}`\n- Always use `-f` flag with `fab` commands in scripts to avoid interactive\n prompts that block execution\n- Lakehouse names must be unique within a workspace\n- Workspace names are case-sensitive in `fab` paths\n- Always quote paths containing spaces: `\"My Workspace.Workspace\"`\n- The Fabric REST API requires workspace ID (GUID), not display name —\n extract with `fab get \"<n>.Workspace\" -q \"id\"`\n- In notebooks, `ms-fabric-cli` must be installed via `%pip install` and\n the scripts directory added to `PATH` before `!fab` commands work\n- Token audiences for notebook auth: `'pbi'` for `FAB_TOKEN` and `FAB_TOKEN_AZURE`,\n `'storage'` for `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n- `fab auth status` must show a valid token before any operations; tokens\n expire and may need refresh\n- Lakehouse names cannot contain: `/`, `\\`, `#`, `%`, `?` or\n leading/trailing spaces. Max length: 256 characters\n- When rolling back, always confirm each deletion with the user — `fab rm`\n with `-f` is irreversible\n- This skill does NOT create workspaces — if a workspace is missing, direct\n the user to the `generate-fabric-workspace` skill\n- This skill does NOT create shortcuts between lakehouses — that is a\n separate step\n\n## Output Format\n\nSee `references/definition-template.md` for the full template.\n\n## Available References\n\n- **`references/notebook-template.py`** — PySpark notebook template for Approach A\n- **`references/definition-template.md`** — Lakehouse definition output template\n- **`references/schema-enabled.md`** — How schema-enabled lakehouses work\n- **`references/fabric-api-lakehouse.md`** — Fabric REST API reference for\n lakehouse creation\n",
|
|
11
11
|
},
|
|
12
12
|
{
|
|
13
13
|
relativePath: "references/agent.md",
|
|
@@ -75,7 +75,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
75
75
|
},
|
|
76
76
|
{
|
|
77
77
|
relativePath: "assets/agent-template.md",
|
|
78
|
-
content: "# Orchestration Agent: {PROCESS_NAME}\r\n\r\n## Context\r\n\r\n**Process**: {PROCESS_NAME}\r\n**Requirements**: {REQUIREMENTS_SUMMARY}\r\n\r\n---\r\n\r\n## How to Run This Agent\r\n\r\n**Start with Sub-Agent 0 (Environment Discovery).** This gathers the user's\r\npermissions, tooling, and preferences so that every subsequent sub-agent produces\r\nplans tailored to their actual environment. Do not skip this step.\r\n\r\nThen execute each remaining sub-agent in sequence:\r\n\r\n1. Use only the inputs and instructions provided in this file.\r\n2. Produce the specified output document in the designated subfolder.\r\n3. Present the output to the user; ask clarifying questions if anything is unclear.\r\n4. Refine until the user explicitly confirms the output.\r\n5. Append a timestamped entry to `CHANGE_LOG.md` recording what was produced or decided.\r\n6. Pass the confirmed output as the primary input to the next sub-agent.\r\n **Every sub-agent must also read `00-environment-discovery/environment-profile.md`**\r\n and respect the path decisions recorded there.\r\n\r\n**Do not proceed to the next sub-agent without explicit user confirmation.**\r\n**Do not produce code, scripts, or data artefacts not described in each sub-agent below.**\r\n\r\n### Notebook Documentation Standard\r\n\r\nEvery Fabric notebook produced by any skill **must** include a numbered markdown cell\r\nimmediately above each code cell. Each markdown cell must:\r\n\r\n1. State the cell number and a short title (e.g. `## Cell 1 — Install dependencies`).\r\n2. Explain **what** the code cell does in 1–2 sentences.\r\n3. Explain **how to use it**: variables to change, flags to toggle, prerequisites.\r\n\r\nAll transformation logic and design rationale must be **embedded as markdown cells inside\r\nthe notebook** — not maintained as separate documentation files. The notebook is the single\r\nsource of truth. A reader must be able to understand what each cell does, why the logic was\r\nchosen, and how to run it without opening any other file.\r\n\r\n### Output Conventions\r\n\r\n- Each sub-agent writes to its own **numbered subfolder** (`01-implementation-plan/`,\r\n `02-business-process/`, etc.). Execution steps continue the numbering (e.g.,\r\n `05-execution/`, `06-gold-layer/`).\r\n- Within each subfolder, only present **final deliverables** to the user: notebooks,\r\n SQL scripts, and documentation they run or deploy. Generator scripts (e.g.\r\n `generate_notebook.py`) are internal tools the skill runs to produce deliverables —\r\n **never present generator scripts as outputs and never generate notebook or script\r\n content directly**. Run the generator script via Bash; present what it produces.\r\n- All transformation logic and design rationale must be **embedded as markdown cells\r\n inside notebooks** — not maintained as separate documentation files. The notebook\r\n is the single source of truth.\r\n\r\n---\r\n\r\n## Sub-Agent 0: Environment Discovery\r\n\r\n**Input**: Requirements above\r\n**Output**: `00-environment-discovery/environment-profile.md`\r\n\r\nThis sub-agent runs **before anything is planned or built**. Its sole purpose is to\r\nunderstand the operator's environment, permissions, and preferences so that every\r\nsubsequent sub-agent produces plans tailored to what is actually possible and practical.\r\n\r\n**Invoke the `fabric-process-discovery` skill to run this step.**\r\n\r\nThe skill defines the full adaptive questioning tree — which questions to ask, in what\r\norder, and how to branch based on answers. Key principles:\r\n\r\n- **Read the requirements first.** Only ask about domains the process actually needs.\r\n A CSV ingestion job does not need workspace creation questions. A full pipeline\r\n needs all domains.\r\n- **Present all questions in a single turn**, grouped by domain. Never ask one question\r\n at a time. Target **5–7 questions** for most processes; simpler ones may need 3–4.\r\n- **Branch adaptively.** The skill defines conditional follow-ups — apply them after\r\n the first-turn answers before presenting the confirmation summary.\r\n- **Confirm before proceeding.** After processing answers, present the path table and\r\n ask: *\"Is this accurate, or anything to correct before I proceed to planning?\"*\r\n Wait for explicit confirmation.\r\n\r\nThe skill covers these domains (use only those relevant to the requirements):\r\n\r\n| Domain | When to include |\r\n|--------|----------------|\r\n| **A — Workspace access** | Any step creates or uses workspaces |\r\n| **A — Domain assignment** | Requirements mention domain governance (only if creating workspaces) |\r\n| **A — Access control / groups** | Process assigns roles to users or groups |\r\n| **B — Deployment approach** | Any step generates notebooks, scripts, or CLI commands |\r\n| **C — Source data location** | Process ingests files (CSV, PDF, etc.) |\r\n| **D — Capacity / SKU** | Process involves compute-intensive operations |\r\n\r\n**Critical framing rules from the skill — do not deviate:**\r\n\r\n1. **Deployment approach is NOT a CLI vs no-CLI question.** All three options (PySpark\r\n notebook, PowerShell script, CLI commands) use the Fabric CLI internally. The\r\n question is only about *how* the operator runs it. Present it as:\r\n - **A) PySpark notebook** — imported into Fabric, run cell-by-cell in the Fabric UI\r\n - **B) PowerShell script** — generated `.ps1` reviewed and run locally\r\n - **C) CLI commands** — individual `fab` commands run interactively in the terminal\r\n\r\n2. **Workspace creation must branch correctly.** If the operator cannot create\r\n workspaces, immediately ask for the exact names of existing hub and spoke\r\n workspaces — do not ask about domain assignment or access control (they only\r\n apply when creating).\r\n\r\n3. **Entra group Object IDs are a known technical constraint.** When groups are\r\n involved, always surface this: *\"The Fabric API requires Object IDs — display\r\n names are not accepted programmatically.\"* Then offer the resolution options\r\n (have IDs / Azure CLI / PowerShell Graph / UI manual).\r\n\r\n4. **Never leave the user blocked.** If a step requires permissions they don't have,\r\n offer: (a) skip and mark as manual, (b) produce a spec for their admin, or\r\n (c) substitute a UI-based workaround.\r\n\r\nOnce the environment profile is confirmed, save it as\r\n`00-environment-discovery/environment-profile.md` and append to `CHANGE_LOG.md`:\r\n`[{DATETIME}] Sub-Agent 0 complete — environment-profile.md produced. [N] path decisions recorded. Manual gates: [list or none].`\r\n\r\n**Confirm the environment profile with the user before proceeding to Sub-Agent 1.**\r\n\r\n---\r\n\r\n## Sub-Agent 1: Implementation Plan\r\n\r\n**Input**: Requirements above\r\n**Output**: `01-implementation-plan/implementation-plan.md`\r\n\r\nProduce a phased implementation plan using the structure below. Keep ≤50 lines.\r\nUpdate the RAID log whenever a later sub-agent raises a new risk or dependency.\r\n\r\n```markdown\r\n---\r\ngoal: {PROCESS_NAME} — Implementation Plan\r\nstatus: Planned\r\ndate_created: {DATE}\r\n---\r\n\r\n# Implementation Plan: {PROCESS_NAME}\r\n\r\n## Requirements & Constraints\r\n- REQ-001: [Requirement drawn from the context above]\r\n- CON-001: [Key constraint]\r\n\r\n## Phases\r\n\r\n### Phase 1: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-001 | [Task] | Planned |\r\n| TASK-002 | [Task] | Planned |\r\n\r\n### Phase 2: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-003 | [Task] | Planned |\r\n\r\n## RAID Log\r\n| Type | ID | Description | Mitigation / Action | Status |\r\n|------------|-------|--------------|---------------------|--------|\r\n| Risk | R-001 | [Risk] | [Mitigation] | Open |\r\n| Assumption | A-001 | [Assumption] | [Validation] | Open |\r\n| Issue | I-001 | [Issue] | [Resolution] | Open |\r\n| Dependency | D-001 | [Dependency] | [Owner] | Open |\r\n```\r\n\r\nRules:\r\n- Use REQ-, CON-, TASK-, R-, A-, I-, D- prefixes consistently.\r\n- Task status values: Planned / In Progress / Done.\r\n- Do not include implementation code or scripts.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 1 complete — implementation-plan.md produced.`\r\n- **Confirm with user before proceeding to Sub-Agent 2.**\r\n\r\n---\r\n\r\n## Sub-Agent 2: Business Process Mapping\r\n\r\n**Input**: Confirmed output of Sub-Agent 1 + Requirements above\r\n**Output**: `02-business-process/sop.md`\r\n\r\nThis sub-agent maps requirements to process skills, creates any that are missing,\r\nand produces a Standard Operating Procedure. Work through the three steps below.\r\n\r\n### Step 1 — Decompose requirements into process steps\r\n\r\nRead the requirements and break them into discrete, ordered steps. For each step,\r\nwrite a one-line description of what it needs to do and what its output is.\r\n\r\n### Step 2 — Map each step to a process skill\r\n\r\nFor each step, search the skills directory for a matching process skill\r\n(a skill whose description covers the same action and output).\r\n\r\nFor every step, one of three outcomes applies:\r\n\r\n**A — Skill found**: Read the skill's `SKILL.md`. Note its inputs, outputs, and\r\nany parameters it needs from earlier steps. Mark the step as covered.\r\n\r\n**B — Skill not found**: Determine the deterministic logic needed to automate\r\nthis step (the specific inputs, the repeatable actions, and the expected output).\r\nInvoke `create-fabric-process-skill` to create a new skill definition for this step.\r\nOnce created, read its `SKILL.md` and mark the step as covered.\r\nAppend to `CHANGE_LOG.md`:\r\n`[{DATETIME}] New skill created: [skill-name] — [one-line description of what it does].`\r\nAdd the new skill as a dependency in the RAID log from Sub-Agent 1.\r\n\r\n**C — Step must be manual**: If the step cannot be automated (e.g. requires human\r\njudgement or a physical action), document it as a manual step with exact operator\r\ninstructions and mark it accordingly.\r\n\r\nRepeat until every step is either covered by a skill or accepted as manual.\r\nAsk the user to confirm the skill list before proceeding to Step 3.\r\n\r\n### Step 3 — Produce the SOP\r\n\r\n```markdown\r\n# SOP: {PROCESS_NAME}\r\n\r\n## Step Sequence\r\n| Step | Skill / Action | Input Parameters | Output | Manual? |\r\n|------|---------------------|--------------------|-------------------|---------|\r\n| 1 | [skill-name] | param=value | [output artefact] | No |\r\n| 2 | [skill-name] | output from step 1 | [output artefact] | No |\r\n| 3 | [Manual: action] | — | — | Yes |\r\n\r\n## Shared Parameters\r\n| Parameter | Source | Passed to steps |\r\n|-----------|------------|-----------------|\r\n| [param] | User input | 1, 3 |\r\n\r\n## Newly Created Skills\r\n| Skill name | Step | Description |\r\n|--------------|------|------------------------------------|\r\n| [skill-name] | 2 | [What it does — one line] |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n```\r\n\r\nRules:\r\n- If requirements are unclear for any step, ask a targeted question and update\r\n requirements before continuing.\r\n- New skills created in this sub-agent are a permanent addition to the skills\r\n library and will be available for future agents.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 2 complete — sop.md produced. [N] new skills created.`\r\n- **Confirm with user before proceeding to Sub-Agent 3.**\r\n\r\n---\r\n\r\n## Sub-Agent 3: Solution Architecture\r\n\r\n**Input**: Confirmed output of Sub-Agent 2\r\n**Output**: `03-solution-architecture/specification.md`\r\n\r\nProduce a plain-language specification. Keep total length ≤50 lines.\r\nWrite for a non-technical reader — no code, no implementation detail.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Solution Specification\r\nstatus: Draft\r\ndate_created: {DATE}\r\n---\r\n\r\n# Specification: {PROCESS_NAME}\r\n\r\n## Purpose\r\n[One paragraph: what this solution does and what problem it solves.]\r\n\r\n## Scope\r\n[What is included and what is explicitly excluded.]\r\n\r\n## How It Works\r\n| Step | What happens | Automated? | Notes |\r\n|------|-------------------------------|------------|-----------------|\r\n| 1 | [Plain-language description] | Yes | |\r\n| 2 | [Plain-language description] | No | See MANUAL-001 |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n\r\n## Acceptance Criteria\r\n- AC-001: Given [context], when [action], then [expected outcome].\r\n\r\n## Dependencies\r\n- DEP-001: [External system, file, or service] — [Purpose]\r\n```\r\n\r\nRules:\r\n- Write for a non-technical reader. No jargon without explanation.\r\n- Every manual step must include exact operator instructions.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 3 complete — specification.md produced.`\r\n- **Confirm with user before proceeding to Sub-Agent 4.**\r\n\r\n---\r\n\r\n## Sub-Agent 4: Security, Testing and Governance\r\n\r\n**Input**: Confirmed output of Sub-Agent 3\r\n**Output**: `04-governance/governance-plan.md`\r\n\r\nProduce a governance and deployment plan. Keep total length ≤45 lines.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Governance Plan\r\ndate_created: {DATE}\r\n---\r\n\r\n# Governance Plan: {PROCESS_NAME}\r\n\r\n## Agent Boundaries\r\n| Boundary | Rule |\r\n|-------------------------|--------------------------------------------|\r\n| Allowed actions | [Permitted operations] |\r\n| Blocked actions | [Prohibited operations] |\r\n| Requires human approval | [Steps needing explicit sign-off] |\r\n\r\n## Testing Checklist\r\n- [ ] Validate each sub-agent output before passing it to the next\r\n- [ ] Test all manual steps with a real operator before production use\r\n- [ ] Run against a minimal test dataset before using real data\r\n- [ ] Review CHANGE_LOG.md to confirm all new skills are correct\r\n- [ ] Verify the output folder structure after scaffolding\r\n\r\n## Microsoft Responsible AI Alignment\r\n| Principle | How Applied |\r\n|----------------|--------------------------------------------------------|\r\n| Fairness | [How bias is avoided in outputs and decisions] |\r\n| Reliability | [Validation steps, error handling, new skill review] |\r\n| Privacy | [Data handling — no PII retained in output files] |\r\n| Inclusiveness | [Plain language; no domain assumptions made] |\r\n| Transparency | [User validates every sub-agent output; CHANGE_LOG] |\r\n| Accountability | [Human sign-off required before production execution] |\r\n\r\n## Deployment Guidance\r\n- Review `CHANGE_LOG.md` to verify all newly created skills before first run.\r\n- Store `agent.md`, all outputs, and new skills in version control.\r\n- Review the RAID log from Sub-Agent 1 before each new run.\r\n- Human sign-off required before running against production systems.\r\n```\r\n\r\nRules:\r\n- Every RAI principle row must be completed — state explicitly if not applicable and why.\r\n- Human approval must be required for any step that modifies production systems.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 4 complete — governance-plan.md produced. Agent definition finalised.`\r\n- **Confirm with user before finalising.**\r\n",
|
|
78
|
+
content: "# Orchestration Agent: {PROCESS_NAME}\r\n\r\n## Context\r\n\r\n**Process**: {PROCESS_NAME}\r\n**Requirements**: {REQUIREMENTS_SUMMARY}\r\n\r\n---\r\n\r\n## How to Run This Agent\r\n\r\n**Start with Sub-Agent 0 (Environment Discovery).** This gathers the user's\r\npermissions, tooling, and preferences so that every subsequent sub-agent produces\r\nplans tailored to their actual environment. Do not skip this step.\r\n\r\nThen execute each remaining sub-agent in sequence:\r\n\r\n1. Use only the inputs and instructions provided in this file.\r\n2. Produce the specified output document in the designated subfolder.\r\n3. Present the output to the user; ask clarifying questions if anything is unclear.\r\n4. Refine until the user explicitly confirms the output.\r\n5. Append a timestamped entry to `CHANGE_LOG.md` recording what was produced or decided.\r\n6. Pass the confirmed output as the primary input to the next sub-agent.\r\n **Every sub-agent must also read `00-environment-discovery/environment-profile.md`**\r\n and respect the path decisions recorded there.\r\n\r\n**Do not proceed to the next sub-agent without explicit user confirmation.**\r\n**Do not produce code, scripts, or data artefacts not described in each sub-agent below.**\r\n\r\n### Parameter Resolution Protocol\r\n\r\nWhen invoking any skill, **always resolve parameters from existing documents before\r\nasking the user**. Check in this order:\r\n\r\n1. `00-environment-discovery/environment-profile.md` — provides: deployment approach,\r\n capacity name, workspace names, access control method, Object ID resolution approach,\r\n environment (dev/prod), credential management approach, available tooling\r\n2. The confirmed SOP (`02-business-process/sop.md`) — provides: lakehouse names,\r\n schema names, shared parameters, step inputs and outputs\r\n3. The implementation plan (`01-implementation-plan/implementation-plan.md`) — provides:\r\n naming conventions, task-level decisions\r\n\r\n**Only ask the user for parameters not found in any of these documents.** Summarise\r\nwhat was resolved automatically before asking for what remains. Never ask for a\r\nparameter that was explicitly captured during environment discovery or planning.\r\n\r\n### Notebook Documentation Standard\r\n\r\nEvery Fabric notebook produced by any skill **must** include a numbered markdown cell\r\nimmediately above each code cell. Each markdown cell must:\r\n\r\n1. State the cell number and a short title (e.g. `## Cell 1 — Install dependencies`).\r\n2. Explain **what** the code cell does in 1–2 sentences.\r\n3. Explain **how to use it**: variables to change, flags to toggle, prerequisites.\r\n\r\nAll transformation logic and design rationale must be **embedded as markdown cells inside\r\nthe notebook** — not maintained as separate documentation files. The notebook is the single\r\nsource of truth. A reader must be able to understand what each cell does, why the logic was\r\nchosen, and how to run it without opening any other file.\r\n\r\n### Output Conventions\r\n\r\n- Each sub-agent writes to its own **numbered subfolder** (`01-implementation-plan/`,\r\n `02-business-process/`, etc.). Execution steps continue the numbering (e.g.,\r\n `05-execution/`, `06-gold-layer/`).\r\n- Within each subfolder, only present **final deliverables** to the user: notebooks,\r\n SQL scripts, and documentation they run or deploy. Generator scripts (e.g.\r\n `generate_notebook.py`) are internal tools the skill runs to produce deliverables —\r\n **never present generator scripts as outputs and never generate notebook or script\r\n content directly**. Run the generator script via Bash; present what it produces.\r\n- All transformation logic and design rationale must be **embedded as markdown cells\r\n inside notebooks** — not maintained as separate documentation files. The notebook\r\n is the single source of truth.\r\n\r\n---\r\n\r\n## Sub-Agent 0: Environment Discovery\r\n\r\n**Input**: Requirements above\r\n**Output**: `00-environment-discovery/environment-profile.md`\r\n\r\nThis sub-agent runs **before anything is planned or built**. Its sole purpose is to\r\nunderstand the operator's environment, permissions, and preferences so that every\r\nsubsequent sub-agent produces plans tailored to what is actually possible and practical.\r\n\r\n**Invoke the `fabric-process-discovery` skill to run this step.**\r\n\r\nThe skill defines the full adaptive questioning tree — which questions to ask, in what\r\norder, and how to branch based on answers. Key principles:\r\n\r\n- **Read the requirements first.** Only ask about domains the process actually needs.\r\n A CSV ingestion job does not need workspace creation questions. A full pipeline\r\n needs all domains.\r\n- **Present all questions in a single turn**, grouped by domain. Never ask one question\r\n at a time. Target **5–7 questions** for most processes; simpler ones may need 3–4.\r\n- **Branch adaptively.** The skill defines conditional follow-ups — apply them after\r\n the first-turn answers before presenting the confirmation summary.\r\n- **Confirm before proceeding.** After processing answers, present the path table and\r\n ask: *\"Is this accurate, or anything to correct before I proceed to planning?\"*\r\n Wait for explicit confirmation.\r\n\r\nThe skill covers these domains (use only those relevant to the requirements):\r\n\r\n| Domain | When to include |\r\n|--------|----------------|\r\n| **A — Workspace access** | Any step creates or uses workspaces |\r\n| **A — Domain assignment** | Requirements mention domain governance (only if creating workspaces) |\r\n| **A — Access control / groups** | Process assigns roles to users or groups |\r\n| **B — Deployment approach** | Any step generates notebooks, scripts, or CLI commands |\r\n| **C — Source data location** | Process ingests files (CSV, PDF, etc.) |\r\n| **D — Capacity / SKU** | Process involves compute-intensive operations |\r\n\r\n**Critical framing rules from the skill — do not deviate:**\r\n\r\n1. **Deployment approach is NOT a CLI vs no-CLI question.** All three options (PySpark\r\n notebook, PowerShell script, CLI commands) use the Fabric CLI internally. The\r\n question is only about *how* the operator runs it. Present it as:\r\n - **A) PySpark notebook** — imported into Fabric, run cell-by-cell in the Fabric UI\r\n - **B) PowerShell script** — generated `.ps1` reviewed and run locally\r\n - **C) CLI commands** — individual `fab` commands run interactively in the terminal\r\n\r\n2. **Workspace creation must branch correctly.** If the operator cannot create\r\n workspaces, immediately ask for the exact names of existing hub and spoke\r\n workspaces — do not ask about domain assignment or access control (they only\r\n apply when creating).\r\n\r\n3. **Entra group Object IDs are a known technical constraint.** When groups are\r\n involved, always surface this: *\"The Fabric API requires Object IDs — display\r\n names are not accepted programmatically.\"* Then offer the resolution options\r\n (have IDs / Azure CLI / PowerShell Graph / UI manual).\r\n\r\n4. **Never leave the user blocked.** If a step requires permissions they don't have,\r\n offer: (a) skip and mark as manual, (b) produce a spec for their admin, or\r\n (c) substitute a UI-based workaround.\r\n\r\nOnce the environment profile is confirmed, save it as\r\n`00-environment-discovery/environment-profile.md` and append to `CHANGE_LOG.md`:\r\n`[{DATETIME}] Sub-Agent 0 complete — environment-profile.md produced. [N] path decisions recorded. Manual gates: [list or none].`\r\n\r\n**Confirm the environment profile with the user before proceeding to Sub-Agent 1.**\r\n\r\n---\r\n\r\n## Sub-Agent 1: Implementation Plan\r\n\r\n**Input**: Requirements above\r\n**Output**: `01-implementation-plan/implementation-plan.md`\r\n\r\nProduce a phased implementation plan using the structure below. Keep ≤50 lines.\r\nUpdate the RAID log whenever a later sub-agent raises a new risk or dependency.\r\n\r\n```markdown\r\n---\r\ngoal: {PROCESS_NAME} — Implementation Plan\r\nstatus: Planned\r\ndate_created: {DATE}\r\n---\r\n\r\n# Implementation Plan: {PROCESS_NAME}\r\n\r\n## Requirements & Constraints\r\n- REQ-001: [Requirement drawn from the context above]\r\n- CON-001: [Key constraint]\r\n\r\n## Phases\r\n\r\n### Phase 1: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-001 | [Task] | Planned |\r\n| TASK-002 | [Task] | Planned |\r\n\r\n### Phase 2: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-003 | [Task] | Planned |\r\n\r\n## RAID Log\r\n| Type | ID | Description | Mitigation / Action | Status |\r\n|------------|-------|--------------|---------------------|--------|\r\n| Risk | R-001 | [Risk] | [Mitigation] | Open |\r\n| Assumption | A-001 | [Assumption] | [Validation] | Open |\r\n| Issue | I-001 | [Issue] | [Resolution] | Open |\r\n| Dependency | D-001 | [Dependency] | [Owner] | Open |\r\n```\r\n\r\nRules:\r\n- Use REQ-, CON-, TASK-, R-, A-, I-, D- prefixes consistently.\r\n- Task status values: Planned / In Progress / Done.\r\n- Do not include implementation code or scripts.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 1 complete — implementation-plan.md produced.`\r\n- **Confirm with user before proceeding to Sub-Agent 2.**\r\n\r\n---\r\n\r\n## Sub-Agent 2: Business Process Mapping\r\n\r\n**Input**: Confirmed output of Sub-Agent 1 + Requirements above\r\n**Output**: `02-business-process/sop.md`\r\n\r\nThis sub-agent maps requirements to process skills, creates any that are missing,\r\nand produces a Standard Operating Procedure. Work through the three steps below.\r\n\r\n### Step 1 — Decompose requirements into process steps\r\n\r\nRead the requirements and break them into discrete, ordered steps. For each step,\r\nwrite a one-line description of what it needs to do and what its output is.\r\n\r\n### Step 2 — Map each step to a process skill\r\n\r\nFor each step, search the skills directory for a matching process skill\r\n(a skill whose description covers the same action and output).\r\n\r\nFor every step, one of three outcomes applies:\r\n\r\n**A — Skill found**: Read the skill's `SKILL.md`. Note its inputs, outputs, and\r\nany parameters it needs from earlier steps. Mark the step as covered.\r\n\r\n**B — Skill not found**: Determine the deterministic logic needed to automate\r\nthis step (the specific inputs, the repeatable actions, and the expected output).\r\nInvoke `create-fabric-process-skill` to create a new skill definition for this step.\r\nOnce created, read its `SKILL.md` and mark the step as covered.\r\nAppend to `CHANGE_LOG.md`:\r\n`[{DATETIME}] New skill created: [skill-name] — [one-line description of what it does].`\r\nAdd the new skill as a dependency in the RAID log from Sub-Agent 1.\r\n\r\n**C — Step must be manual**: If the step cannot be automated (e.g. requires human\r\njudgement or a physical action), document it as a manual step with exact operator\r\ninstructions and mark it accordingly.\r\n\r\nRepeat until every step is either covered by a skill or accepted as manual.\r\nAsk the user to confirm the skill list before proceeding to Step 3.\r\n\r\n### Step 3 — Produce the SOP\r\n\r\n```markdown\r\n# SOP: {PROCESS_NAME}\r\n\r\n## Step Sequence\r\n| Step | Skill / Action | Input Parameters | Output | Manual? |\r\n|------|---------------------|--------------------|-------------------|---------|\r\n| 1 | [skill-name] | param=value | [output artefact] | No |\r\n| 2 | [skill-name] | output from step 1 | [output artefact] | No |\r\n| 3 | [Manual: action] | — | — | Yes |\r\n\r\n## Shared Parameters\r\n| Parameter | Source | Passed to steps |\r\n|-----------|------------|-----------------|\r\n| [param] | User input | 1, 3 |\r\n\r\n## Newly Created Skills\r\n| Skill name | Step | Description |\r\n|--------------|------|------------------------------------|\r\n| [skill-name] | 2 | [What it does — one line] |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n```\r\n\r\nRules:\r\n- If requirements are unclear for any step, ask a targeted question and update\r\n requirements before continuing.\r\n- New skills created in this sub-agent are a permanent addition to the skills\r\n library and will be available for future agents.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 2 complete — sop.md produced. [N] new skills created.`\r\n- **Confirm with user before proceeding to Sub-Agent 3.**\r\n\r\n---\r\n\r\n## Sub-Agent 3: Solution Architecture\r\n\r\n**Input**: Confirmed output of Sub-Agent 2\r\n**Output**: `03-solution-architecture/specification.md`\r\n\r\nProduce a plain-language specification. Keep total length ≤50 lines.\r\nWrite for a non-technical reader — no code, no implementation detail.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Solution Specification\r\nstatus: Draft\r\ndate_created: {DATE}\r\n---\r\n\r\n# Specification: {PROCESS_NAME}\r\n\r\n## Purpose\r\n[One paragraph: what this solution does and what problem it solves.]\r\n\r\n## Scope\r\n[What is included and what is explicitly excluded.]\r\n\r\n## How It Works\r\n| Step | What happens | Automated? | Notes |\r\n|------|-------------------------------|------------|-----------------|\r\n| 1 | [Plain-language description] | Yes | |\r\n| 2 | [Plain-language description] | No | See MANUAL-001 |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n\r\n## Acceptance Criteria\r\n- AC-001: Given [context], when [action], then [expected outcome].\r\n\r\n## Dependencies\r\n- DEP-001: [External system, file, or service] — [Purpose]\r\n```\r\n\r\nRules:\r\n- Write for a non-technical reader. No jargon without explanation.\r\n- Every manual step must include exact operator instructions.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 3 complete — specification.md produced.`\r\n- **Confirm with user before proceeding to Sub-Agent 4.**\r\n\r\n---\r\n\r\n## Sub-Agent 4: Security, Testing and Governance\r\n\r\n**Input**: Confirmed output of Sub-Agent 3\r\n**Output**: `04-governance/governance-plan.md`\r\n\r\nProduce a governance and deployment plan. Keep total length ≤45 lines.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Governance Plan\r\ndate_created: {DATE}\r\n---\r\n\r\n# Governance Plan: {PROCESS_NAME}\r\n\r\n## Agent Boundaries\r\n| Boundary | Rule |\r\n|-------------------------|--------------------------------------------|\r\n| Allowed actions | [Permitted operations] |\r\n| Blocked actions | [Prohibited operations] |\r\n| Requires human approval | [Steps needing explicit sign-off] |\r\n\r\n## Testing Checklist\r\n- [ ] Validate each sub-agent output before passing it to the next\r\n- [ ] Test all manual steps with a real operator before production use\r\n- [ ] Run against a minimal test dataset before using real data\r\n- [ ] Review CHANGE_LOG.md to confirm all new skills are correct\r\n- [ ] Verify the output folder structure after scaffolding\r\n\r\n## Microsoft Responsible AI Alignment\r\n| Principle | How Applied |\r\n|----------------|--------------------------------------------------------|\r\n| Fairness | [How bias is avoided in outputs and decisions] |\r\n| Reliability | [Validation steps, error handling, new skill review] |\r\n| Privacy | [Data handling — no PII retained in output files] |\r\n| Inclusiveness | [Plain language; no domain assumptions made] |\r\n| Transparency | [User validates every sub-agent output; CHANGE_LOG] |\r\n| Accountability | [Human sign-off required before production execution] |\r\n\r\n## Deployment Guidance\r\n- Review `CHANGE_LOG.md` to verify all newly created skills before first run.\r\n- Store `agent.md`, all outputs, and new skills in version control.\r\n- Review the RAID log from Sub-Agent 1 before each new run.\r\n- Human sign-off required before running against production systems.\r\n```\r\n\r\nRules:\r\n- Every RAI principle row must be completed — state explicitly if not applicable and why.\r\n- Human approval must be required for any step that modifies production systems.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 4 complete — governance-plan.md produced. Agent definition finalised.`\r\n- **Confirm with user before finalising.**\r\n",
|
|
79
79
|
},
|
|
80
80
|
{
|
|
81
81
|
relativePath: "references/section-descriptions.md",
|
|
@@ -93,7 +93,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
93
93
|
files: [
|
|
94
94
|
{
|
|
95
95
|
relativePath: "SKILL.md",
|
|
96
|
-
content: "---\r\nname: create-lakehouse-schemas-and-shortcuts\r\ndescription: >\r\n Use this skill to create schemas in schema-enabled Microsoft Fabric lakehouses\r\n and create cross-lakehouse table shortcuts using the Fabric CLI. Triggers on:\r\n \"create lakehouse shortcuts\", \"create schema in lakehouse\", \"shortcut tables\r\n between lakehouses\", \"cross-lakehouse shortcuts\", \"surface bronze tables in\r\n silver\". Does NOT trigger for: creating lakehouses (use create-fabric-lakehouse),\r\n uploading files, creating delta tables from CSV/PDF, or generating MLV scripts.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ for scripts/. Fabric CLI (fab) installed and authenticated.\r\n---\r\n\r\n# Create Lakehouse Schemas and Shortcuts\r\n\r\nCreates schemas in schema-enabled Fabric lakehouses and creates cross-lakehouse\r\ntable shortcuts using `fab ln --type oneLake`. Schemas and shortcuts are\r\ncreated in the same run. Source and target lakehouses must already exist.\r\n\r\n> **GOVERNANCE**: This skill generates commands — it does not execute them.\r\n> All `fab` commands are presented for the operator to review and run.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `--source-workspace` | Source Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_HUB\"` |\r\n| `--source-lakehouse` | Source lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_BRONZE\"` |\r\n| `--source-schema` | Schema in source lakehouse. Use `dbo` for non-schema-enabled | `\"dbo\"` |\r\n| `--target-workspace` | Target Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_FINANCE_SPOKE\"` |\r\n| `--target-lakehouse` | Target lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_SILVER\"` |\r\n| `--target-schema` | Schema to create in target and place shortcuts into | `\"bronze\"` |\r\n| `--tables` | Comma-separated table names, or output of `fab ls` | `\"bookings,events\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Step 1 — Collect parameters**: Ask the user for all inputs listed above.\r\n If source and target are in the same workspace, both workspace parameters will\r\n be the same value.\r\n\r\n- [ ] **Step 2 — Discover tables**: Ask the user to either:\r\n - Provide an explicit comma-separated list of table names, **or**\r\n - Run this command and share the output:\r\n ```\r\n fab ls \"<SOURCE_WORKSPACE>.Workspace/<SOURCE_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Parse table names from the output or list. Present them back and confirm.\r\n\r\n- [ ] **Step 3 — Generate commands**: Run the script:\r\n ```\r\n python scripts/generate_schema_shortcut_commands.py \\\r\n --source-workspace \"<SOURCE_WORKSPACE>\" \\\r\n --source-lakehouse \"<SOURCE_LAKEHOUSE>\" \\\r\n --source-schema \"<SOURCE_SCHEMA>\" \\\r\n --target-workspace \"<TARGET_WORKSPACE>\" \\\r\n --target-lakehouse \"<TARGET_LAKEHOUSE>\" \\\r\n --target-schema \"<TARGET_SCHEMA>\" \\\r\n --tables \"<TABLE1>,<TABLE2>,...\"\r\n ```\r\n The script outputs JSON to stdout with sections: `schema_sql`,\r\n `schema_shortcut_test`, `shortcut_commands`, and `validation_command`.\r\n\r\n- [ ] **Step 4 — (Optional) Test schema-level shortcut**: Before creating\r\n individual table shortcuts, optionally test whether a single schema-level\r\n shortcut captures all tables (see \"Schema-Level Shortcut Hypothesis\" below).\r\n Use the `schema_shortcut_test` command from the script output.\r\n If the test succeeds and all tables appear, skip Step 5.\r\n\r\n- [ ] **Step 5 — Choose deployment approach**: Present these options:\r\n\r\n **Option A — Notebook Cells (Recommended for pipeline integration)**\r\n Append two cells to an existing notebook attached to the target lakehouse:\r\n 1. **Spark SQL cell**: Contains `CREATE SCHEMA IF NOT EXISTS <schema>;`\r\n from the `schema_sql` output.\r\n 2. **Code cell**: Contains each command from `shortcut_commands` prefixed\r\n with `!` (one per line).\r\n If no existing notebook is available, create a new one and note that it\r\n will need its own Spark session and `fab` authentication.\r\n\r\n **Option B — PowerShell Script**\r\n Write the `fab ln` commands from `shortcut_commands` to a `.ps1` file.\r\n Add a comment at the top reminding the user to create the schema first\r\n via a Spark SQL notebook cell (`fab` CLI cannot create schemas).\r\n\r\n **Option C — Interactive Terminal**\r\n Present each command one at a time for the operator to run. Start with the\r\n schema creation SQL (must run in a notebook), then present `fab ln` commands.\r\n\r\n- [ ] **Step 6 — Validate**: Ask the user to run:\r\n ```\r\n fab ls \"<TARGET_WORKSPACE>.Workspace/<TARGET_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Confirm the expected shortcuts appear under the target schema.\r\n\r\n## Schema-Level Shortcut Hypothesis\r\n\r\nWhen creating shortcuts through the Fabric **UI**, connecting to a schema\r\nautomatically surfaces all tables in that schema as shortcuts. It is unknown\r\nwhether this works programmatically via `fab ln`. To test, use the\r\n`schema_shortcut_test` command from the script output, e.g.:\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables -f\r\n```\r\n\r\nIf this succeeds and all source tables appear in the target schema, use this\r\none-command approach instead of individual table shortcuts. Document the result\r\nfor future runs.\r\n\r\nIf the source is non-schema-enabled, test with `Tables` as the target path\r\n(no schema segment). If schema-enabled, use `Tables/<source_schema>`.\r\n\r\n## fab ln Syntax Reference\r\n\r\n### Shortcut naming convention (FIXED)\r\n\r\nShortcuts in schema-enabled lakehouses use **slash notation** for the schema path:\r\n```\r\nTables/<Schema>/<table_name>.Shortcut\r\n```\r\nExample: `Tables/Bronze/revenue_raw.Shortcut`\r\n\r\n**Periods (`.`) are FORBIDDEN in shortcut names.** Dot notation like\r\n`Tables/bronze.revenue_raw.Shortcut` will fail with:\r\n`[InvalidPath] Invalid shortcut name. The name should not include any of the following characters: [\"\\:|<>*?.%+]`\r\n\r\n### Cross-lakehouse: non-schema source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<TABLE> -f\r\n```\r\n\r\n### Cross-lakehouse: schema-enabled source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<SOURCE_SCHEMA>/<TABLE> -f\r\n```\r\n\r\n### Key rules\r\n\r\n- **Type**: Always `--type oneLake` for cross-lakehouse table shortcuts.\r\n Valid `fab ln` types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`. There is no `lakehouseTable` type.\r\n- **Slash notation**: Shortcut path uses `Tables/<Schema>/<table>.Shortcut` (slash, NOT dot)\r\n- **Periods forbidden**: `.` is not allowed in shortcut names — will error with `[InvalidPath]`\r\n- **`-f` flag**: Always include `-f` to skip the \"Are you sure?\" confirmation prompt\r\n (terminals that don't support CPR will hang without it)\r\n- **Source path**: Schema-enabled sources use `Tables/<schema>/<table>` (slash).\r\n Non-schema sources use `Tables/<table>` (no schema segment)\r\n- **URL encoding**: Workspace names with spaces use `%20` in the `--target` path\r\n- **`../../` prefix**: Required for cross-workspace targets to navigate to OneLake root\r\n- **Display names**: Shortcut destination path uses plain workspace/lakehouse names\r\n (no URL encoding); only the `--target` path is URL-encoded\r\n\r\n## Gotchas\r\n\r\n- **Slash NOT dot in shortcut paths**: The shortcut destination uses slash notation\r\n (`Tables/Bronze/revenue_raw.Shortcut`), NOT dot notation. Periods (`.`) are\r\n **forbidden** in shortcut names and will cause `[InvalidPath]` errors.\r\n- **Always use `-f` flag**: Without `-f`, `fab ln` prompts \"Are you sure? (Y/n)\".\r\n Terminals that don't support cursor position requests (CPR) will hang. Always\r\n append `-f` to force creation without confirmation.\r\n- **`--type oneLake` not `--type lakehouseTable`**: Cross-lakehouse table shortcuts\r\n require `--type oneLake`. The type `lakehouseTable` does not exist in the `fab ln`\r\n CLI. Valid types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`.\r\n- **Schema creation requires Spark SQL**: The `fab` CLI cannot create schemas.\r\n Schemas must be created via `CREATE SCHEMA IF NOT EXISTS <name>` in a Spark SQL\r\n cell in a notebook attached to the target lakehouse.\r\n- **Schema names are case-sensitive** in Fabric. Use exact casing consistently.\r\n- **Viewer access required**: Cross-workspace shortcuts require at least Viewer\r\n access on the source workspace.\r\n- **Existing shortcuts fail**: If a shortcut with the same name already exists,\r\n `fab ln` will error. Skip or delete existing ones before rerunning.\r\n- **Same-workspace shortcuts**: When source and target are in the same workspace,\r\n the `../../` prefix and URL encoding still apply in the `--target` path.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_schema_shortcut_commands.py`** — Generates structured JSON\r\n containing schema SQL, `fab ln` shortcut commands, a schema-level shortcut test\r\n command, and a validation command.\r\n Run: `python scripts/generate_schema_shortcut_commands.py --help`\r\n",
|
|
96
|
+
content: "---\r\nname: create-lakehouse-schemas-and-shortcuts\r\ndescription: >\r\n Use this skill to create schemas in schema-enabled Microsoft Fabric lakehouses\r\n and create cross-lakehouse table shortcuts using the Fabric CLI. Triggers on:\r\n \"create lakehouse shortcuts\", \"create schema in lakehouse\", \"shortcut tables\r\n between lakehouses\", \"cross-lakehouse shortcuts\", \"surface bronze tables in\r\n silver\". Does NOT trigger for: creating lakehouses (use create-fabric-lakehouse),\r\n uploading files, creating delta tables from CSV/PDF, or generating MLV scripts.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ for scripts/. Fabric CLI (fab) installed and authenticated.\r\n---\r\n\r\n# Create Lakehouse Schemas and Shortcuts\r\n\r\nCreates schemas in schema-enabled Fabric lakehouses and creates cross-lakehouse\r\ntable shortcuts using `fab ln --type oneLake`. Schemas and shortcuts are\r\ncreated in the same run. Source and target lakehouses must already exist.\r\n\r\n> **GOVERNANCE**: This skill generates commands — it does not execute them.\r\n> All `fab` commands are presented for the operator to review and run.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Source and target workspace names | Environment profile or implementation plan |\r\n| Source and target lakehouse names | SOP shared parameters (from lakehouse creation step) |\r\n| Source schema | SOP shared parameters |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. target schema name,\r\nspecific tables to shortcut if not listed in the SOP).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `--source-workspace` | Source Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_HUB\"` |\r\n| `--source-lakehouse` | Source lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_BRONZE\"` |\r\n| `--source-schema` | Schema in source lakehouse. Use `dbo` for non-schema-enabled | `\"dbo\"` |\r\n| `--target-workspace` | Target Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_FINANCE_SPOKE\"` |\r\n| `--target-lakehouse` | Target lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_SILVER\"` |\r\n| `--target-schema` | Schema to create in target and place shortcuts into | `\"bronze\"` |\r\n| `--tables` | Comma-separated table names, or output of `fab ls` | `\"bookings,events\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Step 1 — Collect parameters**: Ask the user for all inputs listed above.\r\n If source and target are in the same workspace, both workspace parameters will\r\n be the same value.\r\n\r\n- [ ] **Step 2 — Discover tables**: Ask the user to either:\r\n - Provide an explicit comma-separated list of table names, **or**\r\n - Run this command and share the output:\r\n ```\r\n fab ls \"<SOURCE_WORKSPACE>.Workspace/<SOURCE_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Parse table names from the output or list. Present them back and confirm.\r\n\r\n- [ ] **Step 3 — Generate commands**: Run the script:\r\n ```\r\n python scripts/generate_schema_shortcut_commands.py \\\r\n --source-workspace \"<SOURCE_WORKSPACE>\" \\\r\n --source-lakehouse \"<SOURCE_LAKEHOUSE>\" \\\r\n --source-schema \"<SOURCE_SCHEMA>\" \\\r\n --target-workspace \"<TARGET_WORKSPACE>\" \\\r\n --target-lakehouse \"<TARGET_LAKEHOUSE>\" \\\r\n --target-schema \"<TARGET_SCHEMA>\" \\\r\n --tables \"<TABLE1>,<TABLE2>,...\"\r\n ```\r\n The script outputs JSON to stdout with sections: `schema_sql`,\r\n `schema_shortcut_test`, `shortcut_commands`, and `validation_command`.\r\n\r\n- [ ] **Step 4 — (Optional) Test schema-level shortcut**: Before creating\r\n individual table shortcuts, optionally test whether a single schema-level\r\n shortcut captures all tables (see \"Schema-Level Shortcut Hypothesis\" below).\r\n Use the `schema_shortcut_test` command from the script output.\r\n If the test succeeds and all tables appear, skip Step 5.\r\n\r\n- [ ] **Step 5 — Choose deployment approach**: Present these options:\r\n\r\n **Option A — Notebook Cells (Recommended for pipeline integration)**\r\n Append two cells to an existing notebook attached to the target lakehouse:\r\n 1. **Spark SQL cell**: Contains `CREATE SCHEMA IF NOT EXISTS <schema>;`\r\n from the `schema_sql` output.\r\n 2. **Code cell**: Contains each command from `shortcut_commands` prefixed\r\n with `!` (one per line).\r\n If no existing notebook is available, create a new one and note that it\r\n will need its own Spark session and `fab` authentication.\r\n\r\n **Option B — PowerShell Script**\r\n Write the `fab ln` commands from `shortcut_commands` to a `.ps1` file.\r\n Add a comment at the top reminding the user to create the schema first\r\n via a Spark SQL notebook cell (`fab` CLI cannot create schemas).\r\n\r\n **Option C — Interactive Terminal**\r\n Present each command one at a time for the operator to run. Start with the\r\n schema creation SQL (must run in a notebook), then present `fab ln` commands.\r\n\r\n- [ ] **Step 6 — Validate**: Ask the user to run:\r\n ```\r\n fab ls \"<TARGET_WORKSPACE>.Workspace/<TARGET_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Confirm the expected shortcuts appear under the target schema.\r\n\r\n## Schema-Level Shortcut Hypothesis\r\n\r\nWhen creating shortcuts through the Fabric **UI**, connecting to a schema\r\nautomatically surfaces all tables in that schema as shortcuts. It is unknown\r\nwhether this works programmatically via `fab ln`. To test, use the\r\n`schema_shortcut_test` command from the script output, e.g.:\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables -f\r\n```\r\n\r\nIf this succeeds and all source tables appear in the target schema, use this\r\none-command approach instead of individual table shortcuts. Document the result\r\nfor future runs.\r\n\r\nIf the source is non-schema-enabled, test with `Tables` as the target path\r\n(no schema segment). If schema-enabled, use `Tables/<source_schema>`.\r\n\r\n## fab ln Syntax Reference\r\n\r\n### Shortcut naming convention (FIXED)\r\n\r\nShortcuts in schema-enabled lakehouses use **slash notation** for the schema path:\r\n```\r\nTables/<Schema>/<table_name>.Shortcut\r\n```\r\nExample: `Tables/Bronze/revenue_raw.Shortcut`\r\n\r\n**Periods (`.`) are FORBIDDEN in shortcut names.** Dot notation like\r\n`Tables/bronze.revenue_raw.Shortcut` will fail with:\r\n`[InvalidPath] Invalid shortcut name. The name should not include any of the following characters: [\"\\:|<>*?.%+]`\r\n\r\n### Cross-lakehouse: non-schema source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<TABLE> -f\r\n```\r\n\r\n### Cross-lakehouse: schema-enabled source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<SOURCE_SCHEMA>/<TABLE> -f\r\n```\r\n\r\n### Key rules\r\n\r\n- **Type**: Always `--type oneLake` for cross-lakehouse table shortcuts.\r\n Valid `fab ln` types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`. There is no `lakehouseTable` type.\r\n- **Slash notation**: Shortcut path uses `Tables/<Schema>/<table>.Shortcut` (slash, NOT dot)\r\n- **Periods forbidden**: `.` is not allowed in shortcut names — will error with `[InvalidPath]`\r\n- **`-f` flag**: Always include `-f` to skip the \"Are you sure?\" confirmation prompt\r\n (terminals that don't support CPR will hang without it)\r\n- **Source path**: Schema-enabled sources use `Tables/<schema>/<table>` (slash).\r\n Non-schema sources use `Tables/<table>` (no schema segment)\r\n- **URL encoding**: Workspace names with spaces use `%20` in the `--target` path\r\n- **`../../` prefix**: Required for cross-workspace targets to navigate to OneLake root\r\n- **Display names**: Shortcut destination path uses plain workspace/lakehouse names\r\n (no URL encoding); only the `--target` path is URL-encoded\r\n\r\n## Gotchas\r\n\r\n- **Slash NOT dot in shortcut paths**: The shortcut destination uses slash notation\r\n (`Tables/Bronze/revenue_raw.Shortcut`), NOT dot notation. Periods (`.`) are\r\n **forbidden** in shortcut names and will cause `[InvalidPath]` errors.\r\n- **Always use `-f` flag**: Without `-f`, `fab ln` prompts \"Are you sure? (Y/n)\".\r\n Terminals that don't support cursor position requests (CPR) will hang. Always\r\n append `-f` to force creation without confirmation.\r\n- **`--type oneLake` not `--type lakehouseTable`**: Cross-lakehouse table shortcuts\r\n require `--type oneLake`. The type `lakehouseTable` does not exist in the `fab ln`\r\n CLI. Valid types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`.\r\n- **Schema creation requires Spark SQL**: The `fab` CLI cannot create schemas.\r\n Schemas must be created via `CREATE SCHEMA IF NOT EXISTS <name>` in a Spark SQL\r\n cell in a notebook attached to the target lakehouse.\r\n- **Schema names are case-sensitive** in Fabric. Use exact casing consistently.\r\n- **Viewer access required**: Cross-workspace shortcuts require at least Viewer\r\n access on the source workspace.\r\n- **Existing shortcuts fail**: If a shortcut with the same name already exists,\r\n `fab ln` will error. Skip or delete existing ones before rerunning.\r\n- **Same-workspace shortcuts**: When source and target are in the same workspace,\r\n the `../../` prefix and URL encoding still apply in the `--target` path.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_schema_shortcut_commands.py`** — Generates structured JSON\r\n containing schema SQL, `fab ln` shortcut commands, a schema-level shortcut test\r\n command, and a validation command.\r\n Run: `python scripts/generate_schema_shortcut_commands.py --help`\r\n",
|
|
97
97
|
},
|
|
98
98
|
{
|
|
99
99
|
relativePath: "scripts/generate_schema_shortcut_commands.py",
|
|
@@ -107,7 +107,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
107
107
|
files: [
|
|
108
108
|
{
|
|
109
109
|
relativePath: "SKILL.md",
|
|
110
|
-
content: "---\nname: create-materialised-lakeview-scripts\ndescription: >\n Use this skill when asked to generate Spark SQL Materialized Lake View (MLV)\n scripts for Microsoft Fabric Lakehouse transformations. Triggers on: \"generate\n MLV\", \"create silver layer\", \"create gold layer\", \"bronze to silver\", \"silver\n to gold\", \"star schema\", \"lakehouse transformation\", \"materialized lake view\".\n Supports two layers (bronze→silver, silver→gold) and two approaches each\n (schema-driven with source+target CSVs, or pattern-driven with source-only CSVs).\n Does NOT trigger for general SQL writing, Power BI semantic model creation,\n notebook authoring, or Fabric workspace/lakehouse provisioning.\nlicense: MIT\ncompatibility: Python 3.8+ with pandas (for profiling script)\n---\n\n# Fabric Lakehouse MLV Generator\n\n> ⚠️ **GOVERNANCE**: This skill produces Spark SQL notebooks and scripts for the\n> operator to review and run — it never executes queries or deploys notebooks\n> autonomously. Present each generated artefact to the operator before they run it.\n\nGenerates `CREATE OR REPLACE MATERIALIZED LAKE VIEW` scripts that transform data\nbetween lakehouse layers in Microsoft Fabric. Supports bronze→silver (cleaning,\nconforming, restructuring) and silver→gold (Power BI-optimised star schema).\n\n## Inputs\n\n| Parameter | Description | Example |\n|---|---|---|\n| Layer | Bronze→Silver or Silver→Gold | \"bronze to silver\" |\n| Approach | Schema-driven (source+target CSVs) or Pattern-driven (source CSVs only) | \"schema-driven\" |\n| Source CSVs | CSV exports of the source layer tables | `/mnt/user-data/uploads/*.csv` |\n| Target CSVs | (Schema-driven only) CSV exports of the target layer tables | `/mnt/user-data/uploads/silver_*.csv` |\n| Source schema | Schema name for source tables in SQL | `bronze` |\n| Target schema | Schema name for target views in SQL | `silver` or `gold` |\n| Fiscal year start | (Gold layer only) Month number 1–12 | `3` (March) |\n| Currency code | (Gold layer only) Base currency for measure suffixes | `GBP` |\n\n## Workflow\n\n### Phase 1 — Route the request\n\n- [ ] **1.1** Ask the user: **What layer transformation is this?**\n - Bronze → Silver\n - Silver → Gold\n\n- [ ] **1.2** Ask the user: **Which approach?**\n - **Schema-driven** — \"I have both source and target CSV files\"\n - **Pattern-driven** — \"I only have source CSV files; suggest transformations\"\n\n- [ ] **1.3** Based on answers, load the appropriate reference file:\n\n| Layer | Approach | Reference to load |\n|---|---|---|\n| Bronze → Silver | Schema-driven | `references/bronze-to-silver-schema-driven.md` |\n| Bronze → Silver | Pattern-driven | `references/bronze-to-silver-pattern-driven.md` |\n| Silver → Gold | Schema-driven | `references/silver-to-gold-schema-driven.md` |\n| Silver → Gold | Pattern-driven | `references/silver-to-gold-pattern-driven.md` |\n\nRead the full reference file with the `view` tool before proceeding. The reference\ncontains the detailed transformation catalogue, SQL patterns, and validation rules\nfor this specific layer+approach combination.\n\n- [ ] **1.4** Ask the user to confirm:\n - Source schema name (default: `bronze` for B→S, `silver` for S→G)\n - Target schema name (default: `silver` for B→S, `gold` for S→G)\n - If Silver→Gold: fiscal year start month and base currency code\n\n### Phase 2 — Inventory and profile\n\n- [ ] **2.1** List all CSV files in `/mnt/user-data/uploads/`.\n\n- [ ] **2.2** Ask the user to identify which CSVs are **source** and which (if\n schema-driven) are **target**. If file naming makes this obvious, propose the\n split and ask for confirmation.\n\n- [ ] **2.3** Run the profiler against every CSV:\n\n```bash\npython references/profile_csvs.py --dir /mnt/user-data/uploads/ --files <file1.csv> <file2.csv> ...\n```\n\nThe profiler outputs a JSON report per file with: column names, inferred dtypes,\nrow count, unique counts, null counts, sample values, and pattern flags (dates,\ncurrency, booleans, commas-in-numbers, whitespace). Store this output for use in\nsubsequent steps.\n\n> **Column naming in Fabric delta tables:** When CSVs are loaded into Fabric\n> Lakehouse delta tables (e.g., via the `csv-to-bronze-delta-tables` skill), a\n> `clean_columns()` function is applied that lowercases all column names and\n> replaces spaces and special characters with underscores. For example,\n> `Hotel ID` becomes `hotel_id` and `No_of_Rooms` becomes `no_of_rooms`.\n> PDF-extracted tables (from the `pdf-to-bronze-delta-tables` skill) may have\n> **entirely different column schemas** since fields are AI-extracted strings.\n> Always verify actual delta table column names — do NOT assume they match the\n> original CSV file headers.\n\n- [ ] **2.4** If schema-driven: profile both source and target CSVs. Map each\n target file to its source file(s) by column overlap. Present the mapping and\n ask the user to confirm.\n\n- [ ] **2.5** If pattern-driven: classify each source file by archetype (see\n reference file for the classification table). Present the classification and\n ask the user to confirm.\n\n### Phase 3 — Detect and plan transformations\n\nFollow the reference file's Step 3 (schema-driven) or Step 3 + Step 4\n(pattern-driven) exactly. The reference contains the full transformation detection\nlogic and catalogue.\n\n- [ ] **3.1** For each source→target pair (schema-driven) or each source file\n (pattern-driven), detect all applicable transformations.\n\n- [ ] **3.2** Present a **transformation plan** to the user — a table showing\n each output view, its sources, the transformations that will be applied, and\n any assumptions.\n\n- [ ] **3.3** If Silver→Gold: run the **anti-pattern check** from the reference:\n - No table mixes dimensions and measures\n - No dimension references another dimension via FK (no snowflaking)\n - Consistent grain within each fact\n - Degenerate dimensions stay in facts\n - Flag junk dimension candidates\n\n- [ ] **3.4** Wait for user confirmation before generating SQL.\n\n### Phase 4 — Generate the SQL\n\nFollow the reference file's SQL generation step exactly (Step 4 or Step 5,\ndepending on reference). Key rules that apply to ALL layer+approach combinations:\n\n**File structure:**\n1. `CREATE SCHEMA IF NOT EXISTS <target_schema>;`\n2. Comment header with assumptions (layer, approach, fiscal year, currency, grain)\n3. Views ordered by dependency (dimensions/independent views first, then dependents)\n4. Each view: `CREATE OR REPLACE MATERIALIZED LAKE VIEW <schema>.<view_name> AS`\n\n**Notebook documentation (when delivering as .ipynb):**\nLoad `references/notebook-standard.md` for the required markdown cell structure.\nWhen delivering as a notebook, the per-view markdown cells replace the separate\nlogic file — the notebook is the single source of truth.\n\n**MLV-to-MLV dependency pattern:**\nMaterialized Lake Views in Fabric can reference other Materialized Lake Views.\nThis is the **standard layered pattern** — build dimensions and independent\nfacts first, then create dependent views that JOIN to them. For example:\n- `silver.room_rate` joins to `silver.hotel_dim` via a fuzzy/normalised key\n- `silver.forecast_monthly` reads from `silver.revenue_monthly` for weight calculation\n- `silver.expenses_monthly` reads from `silver.revenue_monthly` for proportional allocation\n\nAlways order views by dependency: independent views first, dependent views last.\n\nLoad `references/sql-conventions.md` for naming conventions, CTE patterns,\ntype casting rules, and non-obvious Spark SQL syntax before writing any SQL.\n\n- [ ] **4.1** Write the SQL to `/home/claude/mlv_output.sql`.\n\n### Phase 4a — Generate T-SQL validation queries\n\nBefore converting to MLV format, generate a set of plain `SELECT` queries that\nthe user can run against the Fabric SQL Analytics Endpoint to validate the\ntransformation logic independently.\n\n- [ ] **4a.1** For each MLV definition, extract the CTE + SELECT logic and wrap\n it as a standalone `SELECT` statement (removing the `CREATE OR REPLACE\n MATERIALIZED LAKE VIEW` wrapper).\n\n- [ ] **4a.2** Write the validation queries to a separate file:\n - Bronze→Silver: `bronze_to_silver_validation.sql`\n - Silver→Gold: `silver_to_gold_validation.sql`\n\n- [ ] **4a.3** For each query, add a `LIMIT 20` clause and a `-- Expected: ...`\n comment indicating the expected row count and key column values.\n\n- [ ] **4a.4** Present the validation file to the user. The user can run these\n queries in the Fabric SQL Analytics Endpoint (T-SQL mode) to inspect outputs\n before committing to the MLV definitions.\n\n> **Why T-SQL first?** MLV creation is an all-or-nothing operation. If a column\n> name is wrong or a date format doesn't parse, the entire MLV fails. Running\n> validation SELECTs first catches these issues with clear error messages and\n> lets the user inspect sample data before committing.\n\n### Phase 5 — Validate\n\n- [ ] **5.1** Run the **data validation** from the reference file's validation\n step. Load source (and target, if schema-driven) CSVs in pandas and verify:\n - Column names match the target / expected output\n - Row counts are within tolerance (exact for dims, ±5% for facts)\n - Numeric columns: values within tolerance\n - Date columns: all parse correctly\n\n- [ ] **5.2** If Silver→Gold, run the **star schema structural checklist**:\n - [ ] Every table is clearly a dimension or a fact\n - [ ] Every fact has FKs to all related dimensions\n - [ ] Every dimension has a unique primary key\n - [ ] A date dimension exists spanning the full fact date range\n - [ ] Date dimension has display + sort column pairs for Power BI\n - [ ] Every dimension has an unknown/unassigned member row\n - [ ] No snowflaking (no dim-to-dim FK references)\n - [ ] No fact embeds descriptive attributes belonging in a dimension\n - [ ] Consistent grain within each fact table\n - [ ] Consistent naming: `dim_` for dimensions, `fact_` for facts\n - [ ] Surrogate key DENSE_RANK ORDER BY identical in dim views and fact CTEs\n - [ ] Role-playing dimensions documented\n - [ ] Degenerate dimensions remain in facts\n\n- [ ] **5.3** Fix any issues found. Re-validate until clean.\n\n### Phase 6 — Deliver\n\n- [ ] **6.1** Copy the validated SQL to `/mnt/user-data/outputs/` with a\n descriptive filename:\n - Bronze→Silver: `bronze_to_silver_mlv.sql`\n - Silver→Gold: `silver_to_gold_mlv.sql`\n\n- [ ] **6.2** Generate a **transformation logic document** alongside the SQL:\n - Bronze→Silver: `silver_logic.md`\n - Silver→Gold: `gold_logic.md`\n\n This file MUST contain:\n - **Per-view section** with: source table(s), transformations applied (reference\n T-codes), column mapping (bronze name → silver alias + type), any data quality\n issues detected (nulls, artifacts, dirty data, ambiguous formats) and how they\n were handled.\n - **Cross-view dependencies**: which MLVs reference other MLVs and why.\n - **Dropped/excluded data**: columns or rows removed, with rationale.\n - **Domain context**: any business-domain knowledge that informed the design\n (e.g., location hierarchies, currency conventions, fiscal calendars).\n - **Assumptions**: anything not explicitly confirmed by the user.\n\n If delivering as a notebook (`.ipynb`), the per-view markdown cells serve as\n the inline documentation — no separate logic file is needed, since the same\n information is embedded directly in the notebook.\n\n- [ ] **6.3** Present both files to the user.\n\n- [ ] **6.4** Summarise:\n - Number of views created\n - Key transformation patterns applied\n - (Gold) Number of dimensions vs facts, fiscal year config, currency\n - Any warnings or assumptions\n\n## Output Format\n\n```sql\n-- <Layer> layer Spark SQL MLV definitions\n-- Generated by fabric-lakehouse-mlv skill\n-- Source schema: <source_schema> | Target schema: <target_schema>\n-- Assumptions: <fiscal year, currency, grain, etc.>\n\nCREATE SCHEMA IF NOT EXISTS <target_schema>;\n\n-- <View description>\nCREATE OR REPLACE MATERIALIZED LAKE VIEW <target_schema>.<view_name> AS\nWITH cleaned AS (\n ...\n)\nSELECT ...\nFROM cleaned;\n```\n\n## Gotchas\n\n- **BOM characters**: Bronze/silver CSVs often have UTF-8 BOM. Always use\n `encoding='utf-8-sig'` in pandas.\n- **Date format ambiguity**: If all day values ≤ 12, `dd/MM/yyyy` vs `MM/dd/yyyy`\n is ambiguous. Default to `dd/MM/yyyy` for UK/EU data. Ask the user if unsure.\n- **Unpivot STACK count**: The integer N in `LATERAL VIEW STACK(N, ...)` must\n exactly match the number of column pairs. Off-by-one causes silent data loss.\n- **Surrogate key determinism**: `DENSE_RANK(ORDER BY col)` in a gold dimension\n and the matching CTE in a fact MUST use the exact same ORDER BY or keys diverge.\n- **SCD fan-out**: Overlapping date ranges in SCD tables duplicate fact rows.\n Validate non-overlap in silver before building gold.\n- **COALESCE placement**: Apply in the final SELECT of gold facts, never in the\n JOIN condition. Joining `ON fk = 'UNKNOWN'` would incorrectly match the\n unknown dimension row.\n- **Revenue-weighted allocation**: Only use when a revenue table exists. Fall back\n to equal split (`amount / 12.0`) when revenue is zero for a period.\n- **Power BI sort columns**: In the gold date dimension, always pair display\n columns (MonthName, DayOfWeekName, FiscalPeriodLabel) with numeric sort\n columns (MonthNumber, DayOfWeekNumber, FiscalPeriodNumber). Without these,\n months sort alphabetically in Power BI.\n- **No snowflaking in gold**: Flatten all dimension attributes. `dim_hotel`\n should contain City and Country directly, not reference a `dim_geography`.\n- **dayofweek() in Spark**: Returns 1=Sunday, 7=Saturday. Weekend = `IN (1,7)`.\n- **Fiscal year formula**: `((month + (12 - start_month)) % 12) + 1`. Test at\n January and at the start month for off-by-one errors.\n- **MLV-to-MLV references**: Materialized Lake Views in Fabric CAN reference\n other Materialized Lake Views. This is the preferred layered pattern. Always\n create referenced views before referencing views (dependency ordering).\n Use `silver.view_name` (not `bronze.view_name`) when joining to a silver\n view from another silver view.\n- **Column naming mismatch**: Bronze delta table columns may differ from the\n original CSV file headers. The `csv-to-bronze-delta-tables` skill applies\n `clean_columns()` which lowercases all names and replaces spaces/special\n characters with underscores (e.g., `Hotel ID` → `hotel_id`). PDF-extracted\n tables (from `pdf-to-bronze-delta-tables`) have AI-determined field names\n that may not match any CSV. Always verify actual lakehouse column names\n before writing SQL.\n\n## Available References\n\n- **`references/profile_csvs.py`** — Profiles uploaded CSV files and outputs a JSON\n report with column metadata, type flags, and pattern detection.\n Run: `python references/profile_csvs.py --help`\n- **`references/sql-conventions.md`** — Naming, CTE patterns, type casting, and Spark SQL syntax. Load during Phase 4.\n- **`references/notebook-standard.md`** — Required markdown cell structure when delivering output as a `.ipynb` notebook. Load when user requests notebook output.\n- **`references/bronze-to-silver-schema-driven.md`** — Transformation catalogue for bronze→silver schema-driven approach.\n- **`references/bronze-to-silver-pattern-driven.md`** — Transformation catalogue for bronze→silver pattern-driven approach.\n- **`references/silver-to-gold-schema-driven.md`** — Transformation catalogue for silver→gold schema-driven approach.\n- **`references/silver-to-gold-pattern-driven.md`** — Transformation catalogue for silver→gold pattern-driven approach.\n- **`references/output-template.sql`** — SQL output template.\n",
|
|
110
|
+
content: "---\nname: create-materialised-lakeview-scripts\ndescription: >\n Use this skill when asked to generate Spark SQL Materialized Lake View (MLV)\n scripts for Microsoft Fabric Lakehouse transformations. Triggers on: \"generate\n MLV\", \"create silver layer\", \"create gold layer\", \"bronze to silver\", \"silver\n to gold\", \"star schema\", \"lakehouse transformation\", \"materialized lake view\".\n Supports two layers (bronze→silver, silver→gold) and two approaches each\n (schema-driven with source+target CSVs, or pattern-driven with source-only CSVs).\n Does NOT trigger for general SQL writing, Power BI semantic model creation,\n notebook authoring, or Fabric workspace/lakehouse provisioning.\nlicense: MIT\ncompatibility: Python 3.8+ with pandas (for profiling script)\n---\n\n# Fabric Lakehouse MLV Generator\n\n> ⚠️ **GOVERNANCE**: This skill produces Spark SQL notebooks and scripts for the\n> operator to review and run — it never executes queries or deploys notebooks\n> autonomously. Present each generated artefact to the operator before they run it.\n\nGenerates `CREATE OR REPLACE MATERIALIZED LAKE VIEW` scripts that transform data\nbetween lakehouse layers in Microsoft Fabric. Supports bronze→silver (cleaning,\nconforming, restructuring) and silver→gold (Power BI-optimised star schema).\n\n## Orchestrated Context\n\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\nand the SOP before asking the user anything.\n\n| Parameter | Source when orchestrated |\n|---|---|\n| Source and target schema names | SOP shared parameters or implementation plan |\n| Medallion layer (Bronze→Silver or Silver→Gold) | SOP step sequence |\n| Fiscal year start, currency code | Environment profile (organisation-wide settings) |\n\n**Only ask for parameters not found in these documents** (e.g. source/target CSV\nuploads, approach choice, any business-specific transformation rules).\n\n## Inputs\n\n| Parameter | Description | Example |\n|---|---|---|\n| Layer | Bronze→Silver or Silver→Gold | \"bronze to silver\" |\n| Approach | Schema-driven (source+target CSVs) or Pattern-driven (source CSVs only) | \"schema-driven\" |\n| Source CSVs | CSV exports of the source layer tables | `/mnt/user-data/uploads/*.csv` |\n| Target CSVs | (Schema-driven only) CSV exports of the target layer tables | `/mnt/user-data/uploads/silver_*.csv` |\n| Source schema | Schema name for source tables in SQL | `bronze` |\n| Target schema | Schema name for target views in SQL | `silver` or `gold` |\n| Fiscal year start | (Gold layer only) Month number 1–12 | `3` (March) |\n| Currency code | (Gold layer only) Base currency for measure suffixes | `GBP` |\n\n## Workflow\n\n### Phase 1 — Route the request\n\n- [ ] **1.1** Ask the user: **What layer transformation is this?**\n - Bronze → Silver\n - Silver → Gold\n\n- [ ] **1.2** Ask the user: **Which approach?**\n - **Schema-driven** — \"I have both source and target CSV files\"\n - **Pattern-driven** — \"I only have source CSV files; suggest transformations\"\n\n- [ ] **1.3** Based on answers, load the appropriate reference file:\n\n| Layer | Approach | Reference to load |\n|---|---|---|\n| Bronze → Silver | Schema-driven | `references/bronze-to-silver-schema-driven.md` |\n| Bronze → Silver | Pattern-driven | `references/bronze-to-silver-pattern-driven.md` |\n| Silver → Gold | Schema-driven | `references/silver-to-gold-schema-driven.md` |\n| Silver → Gold | Pattern-driven | `references/silver-to-gold-pattern-driven.md` |\n\nRead the full reference file with the `view` tool before proceeding. The reference\ncontains the detailed transformation catalogue, SQL patterns, and validation rules\nfor this specific layer+approach combination.\n\n- [ ] **1.4** Ask the user to confirm:\n - Source schema name (default: `bronze` for B→S, `silver` for S→G)\n - Target schema name (default: `silver` for B→S, `gold` for S→G)\n - If Silver→Gold: fiscal year start month and base currency code\n\n### Phase 2 — Inventory and profile\n\n- [ ] **2.1** List all CSV files in `/mnt/user-data/uploads/`.\n\n- [ ] **2.2** Ask the user to identify which CSVs are **source** and which (if\n schema-driven) are **target**. If file naming makes this obvious, propose the\n split and ask for confirmation.\n\n- [ ] **2.3** Run the profiler against every CSV:\n\n```bash\npython references/profile_csvs.py --dir /mnt/user-data/uploads/ --files <file1.csv> <file2.csv> ...\n```\n\nThe profiler outputs a JSON report per file with: column names, inferred dtypes,\nrow count, unique counts, null counts, sample values, and pattern flags (dates,\ncurrency, booleans, commas-in-numbers, whitespace). Store this output for use in\nsubsequent steps.\n\n> **Column naming in Fabric delta tables:** When CSVs are loaded into Fabric\n> Lakehouse delta tables (e.g., via the `csv-to-bronze-delta-tables` skill), a\n> `clean_columns()` function is applied that lowercases all column names and\n> replaces spaces and special characters with underscores. For example,\n> `Hotel ID` becomes `hotel_id` and `No_of_Rooms` becomes `no_of_rooms`.\n> PDF-extracted tables (from the `pdf-to-bronze-delta-tables` skill) may have\n> **entirely different column schemas** since fields are AI-extracted strings.\n> Always verify actual delta table column names — do NOT assume they match the\n> original CSV file headers.\n\n- [ ] **2.4** If schema-driven: profile both source and target CSVs. Map each\n target file to its source file(s) by column overlap. Present the mapping and\n ask the user to confirm.\n\n- [ ] **2.5** If pattern-driven: classify each source file by archetype (see\n reference file for the classification table). Present the classification and\n ask the user to confirm.\n\n### Phase 3 — Detect and plan transformations\n\nFollow the reference file's Step 3 (schema-driven) or Step 3 + Step 4\n(pattern-driven) exactly. The reference contains the full transformation detection\nlogic and catalogue.\n\n- [ ] **3.1** For each source→target pair (schema-driven) or each source file\n (pattern-driven), detect all applicable transformations.\n\n- [ ] **3.2** Present a **transformation plan** to the user — a table showing\n each output view, its sources, the transformations that will be applied, and\n any assumptions.\n\n- [ ] **3.3** If Silver→Gold: run the **anti-pattern check** from the reference:\n - No table mixes dimensions and measures\n - No dimension references another dimension via FK (no snowflaking)\n - Consistent grain within each fact\n - Degenerate dimensions stay in facts\n - Flag junk dimension candidates\n\n- [ ] **3.4** Wait for user confirmation before generating SQL.\n\n### Phase 4 — Generate the SQL\n\nFollow the reference file's SQL generation step exactly (Step 4 or Step 5,\ndepending on reference). Key rules that apply to ALL layer+approach combinations:\n\n**File structure:**\n1. `CREATE SCHEMA IF NOT EXISTS <target_schema>;`\n2. Comment header with assumptions (layer, approach, fiscal year, currency, grain)\n3. Views ordered by dependency (dimensions/independent views first, then dependents)\n4. Each view: `CREATE OR REPLACE MATERIALIZED LAKE VIEW <schema>.<view_name> AS`\n\n**Notebook documentation (when delivering as .ipynb):**\nLoad `references/notebook-standard.md` for the required markdown cell structure.\nWhen delivering as a notebook, the per-view markdown cells replace the separate\nlogic file — the notebook is the single source of truth.\n\n**MLV-to-MLV dependency pattern:**\nMaterialized Lake Views in Fabric can reference other Materialized Lake Views.\nThis is the **standard layered pattern** — build dimensions and independent\nfacts first, then create dependent views that JOIN to them. For example:\n- `silver.room_rate` joins to `silver.hotel_dim` via a fuzzy/normalised key\n- `silver.forecast_monthly` reads from `silver.revenue_monthly` for weight calculation\n- `silver.expenses_monthly` reads from `silver.revenue_monthly` for proportional allocation\n\nAlways order views by dependency: independent views first, dependent views last.\n\nLoad `references/sql-conventions.md` for naming conventions, CTE patterns,\ntype casting rules, and non-obvious Spark SQL syntax before writing any SQL.\n\n- [ ] **4.1** Write the SQL to `/home/claude/mlv_output.sql`.\n\n### Phase 4a — Generate T-SQL validation queries\n\nBefore converting to MLV format, generate a set of plain `SELECT` queries that\nthe user can run against the Fabric SQL Analytics Endpoint to validate the\ntransformation logic independently.\n\n- [ ] **4a.1** For each MLV definition, extract the CTE + SELECT logic and wrap\n it as a standalone `SELECT` statement (removing the `CREATE OR REPLACE\n MATERIALIZED LAKE VIEW` wrapper).\n\n- [ ] **4a.2** Write the validation queries to a separate file:\n - Bronze→Silver: `bronze_to_silver_validation.sql`\n - Silver→Gold: `silver_to_gold_validation.sql`\n\n- [ ] **4a.3** For each query, add a `LIMIT 20` clause and a `-- Expected: ...`\n comment indicating the expected row count and key column values.\n\n- [ ] **4a.4** Present the validation file to the user. The user can run these\n queries in the Fabric SQL Analytics Endpoint (T-SQL mode) to inspect outputs\n before committing to the MLV definitions.\n\n> **Why T-SQL first?** MLV creation is an all-or-nothing operation. If a column\n> name is wrong or a date format doesn't parse, the entire MLV fails. Running\n> validation SELECTs first catches these issues with clear error messages and\n> lets the user inspect sample data before committing.\n\n### Phase 5 — Validate\n\n- [ ] **5.1** Run the **data validation** from the reference file's validation\n step. Load source (and target, if schema-driven) CSVs in pandas and verify:\n - Column names match the target / expected output\n - Row counts are within tolerance (exact for dims, ±5% for facts)\n - Numeric columns: values within tolerance\n - Date columns: all parse correctly\n\n- [ ] **5.2** If Silver→Gold, run the **star schema structural checklist**:\n - [ ] Every table is clearly a dimension or a fact\n - [ ] Every fact has FKs to all related dimensions\n - [ ] Every dimension has a unique primary key\n - [ ] A date dimension exists spanning the full fact date range\n - [ ] Date dimension has display + sort column pairs for Power BI\n - [ ] Every dimension has an unknown/unassigned member row\n - [ ] No snowflaking (no dim-to-dim FK references)\n - [ ] No fact embeds descriptive attributes belonging in a dimension\n - [ ] Consistent grain within each fact table\n - [ ] Consistent naming: `dim_` for dimensions, `fact_` for facts\n - [ ] Surrogate key DENSE_RANK ORDER BY identical in dim views and fact CTEs\n - [ ] Role-playing dimensions documented\n - [ ] Degenerate dimensions remain in facts\n\n- [ ] **5.3** Fix any issues found. Re-validate until clean.\n\n### Phase 6 — Deliver\n\n- [ ] **6.1** Copy the validated SQL to `/mnt/user-data/outputs/` with a\n descriptive filename:\n - Bronze→Silver: `bronze_to_silver_mlv.sql`\n - Silver→Gold: `silver_to_gold_mlv.sql`\n\n- [ ] **6.2** Generate a **transformation logic document** alongside the SQL:\n - Bronze→Silver: `silver_logic.md`\n - Silver→Gold: `gold_logic.md`\n\n This file MUST contain:\n - **Per-view section** with: source table(s), transformations applied (reference\n T-codes), column mapping (bronze name → silver alias + type), any data quality\n issues detected (nulls, artifacts, dirty data, ambiguous formats) and how they\n were handled.\n - **Cross-view dependencies**: which MLVs reference other MLVs and why.\n - **Dropped/excluded data**: columns or rows removed, with rationale.\n - **Domain context**: any business-domain knowledge that informed the design\n (e.g., location hierarchies, currency conventions, fiscal calendars).\n - **Assumptions**: anything not explicitly confirmed by the user.\n\n If delivering as a notebook (`.ipynb`), the per-view markdown cells serve as\n the inline documentation — no separate logic file is needed, since the same\n information is embedded directly in the notebook.\n\n- [ ] **6.3** Present both files to the user.\n\n- [ ] **6.4** Summarise:\n - Number of views created\n - Key transformation patterns applied\n - (Gold) Number of dimensions vs facts, fiscal year config, currency\n - Any warnings or assumptions\n\n## Output Format\n\n```sql\n-- <Layer> layer Spark SQL MLV definitions\n-- Generated by fabric-lakehouse-mlv skill\n-- Source schema: <source_schema> | Target schema: <target_schema>\n-- Assumptions: <fiscal year, currency, grain, etc.>\n\nCREATE SCHEMA IF NOT EXISTS <target_schema>;\n\n-- <View description>\nCREATE OR REPLACE MATERIALIZED LAKE VIEW <target_schema>.<view_name> AS\nWITH cleaned AS (\n ...\n)\nSELECT ...\nFROM cleaned;\n```\n\n## Gotchas\n\n- **BOM characters**: Bronze/silver CSVs often have UTF-8 BOM. Always use\n `encoding='utf-8-sig'` in pandas.\n- **Date format ambiguity**: If all day values ≤ 12, `dd/MM/yyyy` vs `MM/dd/yyyy`\n is ambiguous. Default to `dd/MM/yyyy` for UK/EU data. Ask the user if unsure.\n- **Unpivot STACK count**: The integer N in `LATERAL VIEW STACK(N, ...)` must\n exactly match the number of column pairs. Off-by-one causes silent data loss.\n- **Surrogate key determinism**: `DENSE_RANK(ORDER BY col)` in a gold dimension\n and the matching CTE in a fact MUST use the exact same ORDER BY or keys diverge.\n- **SCD fan-out**: Overlapping date ranges in SCD tables duplicate fact rows.\n Validate non-overlap in silver before building gold.\n- **COALESCE placement**: Apply in the final SELECT of gold facts, never in the\n JOIN condition. Joining `ON fk = 'UNKNOWN'` would incorrectly match the\n unknown dimension row.\n- **Revenue-weighted allocation**: Only use when a revenue table exists. Fall back\n to equal split (`amount / 12.0`) when revenue is zero for a period.\n- **Power BI sort columns**: In the gold date dimension, always pair display\n columns (MonthName, DayOfWeekName, FiscalPeriodLabel) with numeric sort\n columns (MonthNumber, DayOfWeekNumber, FiscalPeriodNumber). Without these,\n months sort alphabetically in Power BI.\n- **No snowflaking in gold**: Flatten all dimension attributes. `dim_hotel`\n should contain City and Country directly, not reference a `dim_geography`.\n- **dayofweek() in Spark**: Returns 1=Sunday, 7=Saturday. Weekend = `IN (1,7)`.\n- **Fiscal year formula**: `((month + (12 - start_month)) % 12) + 1`. Test at\n January and at the start month for off-by-one errors.\n- **MLV-to-MLV references**: Materialized Lake Views in Fabric CAN reference\n other Materialized Lake Views. This is the preferred layered pattern. Always\n create referenced views before referencing views (dependency ordering).\n Use `silver.view_name` (not `bronze.view_name`) when joining to a silver\n view from another silver view.\n- **Column naming mismatch**: Bronze delta table columns may differ from the\n original CSV file headers. The `csv-to-bronze-delta-tables` skill applies\n `clean_columns()` which lowercases all names and replaces spaces/special\n characters with underscores (e.g., `Hotel ID` → `hotel_id`). PDF-extracted\n tables (from `pdf-to-bronze-delta-tables`) have AI-determined field names\n that may not match any CSV. Always verify actual lakehouse column names\n before writing SQL.\n\n## Available References\n\n- **`references/profile_csvs.py`** — Profiles uploaded CSV files and outputs a JSON\n report with column metadata, type flags, and pattern detection.\n Run: `python references/profile_csvs.py --help`\n- **`references/sql-conventions.md`** — Naming, CTE patterns, type casting, and Spark SQL syntax. Load during Phase 4.\n- **`references/notebook-standard.md`** — Required markdown cell structure when delivering output as a `.ipynb` notebook. Load when user requests notebook output.\n- **`references/bronze-to-silver-schema-driven.md`** — Transformation catalogue for bronze→silver schema-driven approach.\n- **`references/bronze-to-silver-pattern-driven.md`** — Transformation catalogue for bronze→silver pattern-driven approach.\n- **`references/silver-to-gold-schema-driven.md`** — Transformation catalogue for silver→gold schema-driven approach.\n- **`references/silver-to-gold-pattern-driven.md`** — Transformation catalogue for silver→gold pattern-driven approach.\n- **`references/output-template.sql`** — SQL output template.\n",
|
|
111
111
|
},
|
|
112
112
|
{
|
|
113
113
|
relativePath: "references/agent.md",
|
|
@@ -153,7 +153,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
153
153
|
files: [
|
|
154
154
|
{
|
|
155
155
|
relativePath: "SKILL.md",
|
|
156
|
-
content: "---\r\nname: csv-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to upload CSV files from a local machine into a Microsoft Fabric\r\n bronze lakehouse and convert them to delta tables. Triggers on: \"create delta\r\n tables from CSV files\", \"load CSVs into bronze lakehouse\", \"upload CSV to Fabric\r\n and create tables\", \"ingest CSV files to delta format in Fabric\", \"create bronze\r\n tables from local CSV\". Does NOT trigger for creating lakehouses, transforming\r\n existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ required for scripts/. Fabric CLI (fab) must be installed for the CLI upload option.\r\n---\r\n\r\n# CSV to Bronze Delta Tables\r\n\r\nUploads CSV files from an operator's local machine to a Microsoft Fabric bronze\r\nlakehouse and converts them to delta tables. The lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab cp`, `fab ln`, and `fab ls` commands are presented to the operator as\r\n> script blocks for them to run. The agent only generates and presents commands.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LOCAL_CSV_FOLDER` | Relative path to local folder containing CSV files (CLI upload only) | `\"./Data\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under the Files section of the lakehouse | `\"raw\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Upload CSV files** — Present these three options and ask the operator to\r\n choose one:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Open the OneLake File Explorer desktop app and drag-and-drop the CSV files into\r\n the target folder under the lakehouse Files section. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section, open or create\r\n the target folder, click **Upload** and select the CSV files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_CSV_FOLDER` as the **exact absolute path** to the local folder\r\n and `LAKEHOUSE_FILES_FOLDER` (the destination folder name under Files). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_CSV_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_csv_files.ps1\"\r\n ```\r\n The script generates a PowerShell `.ps1` file saved directly to the outputs folder.\r\n Present the script path to the operator and ask them to run it with `pwsh upload_csv_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning the workflow, create the output folder:\r\n```\r\noutputs/csv-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll scripts produced during the run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm the CSV files are visible\r\n in the Files section of the lakehouse before proceeding.\r\n\r\n- [ ] **Create delta tables** — If `LAKEHOUSE_FILES_FOLDER` was not already\r\n captured above, ask for it now. Present these two options:\r\n\r\n **Option 1 — Fabric UI (Manual)**\r\n > Quick and easy — recommended for most users.\r\n In the Fabric browser UI navigate to the lakehouse → Files →\r\n `<LAKEHOUSE_FILES_FOLDER>`. For each CSV file: click the three-dot menu →\r\n **Load to Tables** → **New Table**. Accept the suggested table name (Fabric\r\n applies it automatically). No agent action required.\r\n\r\n **Option 2 — PySpark notebook (Automated)**\r\n Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\csv_to_delta_tables.ipynb\"\r\n ```\r\n This writes a ready-to-run `.ipynb` file to the outputs folder. Tell the operator:\r\n 1. In the Fabric UI go to the workspace → **New** → **Import notebook**\r\n 2. Select `csv_to_delta_tables.ipynb` from the outputs folder\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically\r\n **Validate**: confirm every cell printed `✅ Created table: <table_name>` with\r\n no errors. If any `❌` lines appear, report the error message to the operator.\r\n\r\n## Table Naming Convention\r\n\r\nCSV filename → delta table name:\r\n- Strip `.csv` extension\r\n- Convert to lowercase\r\n- Replace any non-alphanumeric characters (spaces, hyphens, dots) with underscores\r\n- Strip leading/trailing underscores\r\n\r\nExamples:\r\n| CSV filename | Delta table name |\r\n|---|---|\r\n| `Revenue Data.csv` | `revenue_data` |\r\n| `Landon hotel revenue data.csv` | `landon_hotel_revenue_data` |\r\n| `Q1-Sales.csv` | `q1_sales` |\r\n\r\n## Column Naming Convention\r\n\r\nWhen CSVs are loaded into delta tables via the PySpark notebook (Option 2 of\r\ndelta table creation), a `clean_columns()` function transforms every column name:\r\n\r\n- Convert to lowercase\r\n- Replace spaces, hyphens, and other non-alphanumeric characters with underscores\r\n- Strip leading/trailing underscores\r\n\r\n| CSV column header | Delta table column name |\r\n|---|---|\r\n| `Hotel ID` | `hotel_id` |\r\n| `No_of_Rooms` | `no_of_rooms` |\r\n| `Total Revenue (GBP)` | `total_revenue_gbp` |\r\n| `First Name` | `first_name` |\r\n\r\n> **Important for downstream skills:** When writing SQL queries against bronze\r\n> delta tables (e.g., in the `create-materialised-lakeview-scripts` skill),\r\n> always use the cleaned column names — not the original CSV headers.\r\n\r\n## Output Format\r\n\r\nDelta tables appear under the **Tables** section of the bronze lakehouse in the\r\nFabric UI, named according to the convention above. Each table is queryable via\r\nthe lakehouse SQL endpoint and PySpark.\r\n\r\n## Gotchas\r\n\r\n- `fab cp` uses the path prefix to identify local vs OneLake paths. **Absolute\r\n Windows paths (`C:\\...`) are not recognised as local** and cause a\r\n `[NotSupported] Source and destination must be of the same type` error. Always\r\n use `Push-Location` into the source folder and `./filename` (forward slash,\r\n not backslash) syntax — confirmed working pattern.\r\n- **The destination folder must exist before running `fab cp`.** Always run\r\n `fab mkdir \"{WORKSPACE}.Workspace/{LAKEHOUSE}.Lakehouse/Files/{FOLDER}\"` first.\r\n Running `fab mkdir` on an already-existing folder is safe and does not error.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive and must exactly match\r\n what appears in the Fabric UI.\r\n- Shortcuts (Option 1 for delta table creation) use Fabric's automatic schema\r\n inference. They may fail if column names contain spaces or if data types are\r\n inconsistent. Switch to Option 2 (PySpark notebook) in those cases.\r\n- The PySpark notebook attaches the lakehouse automatically via `%%configure` in\r\n Cell 1 — no manual attachment needed before running.\r\n- When using the Fabric CLI, run all commands from the directory that\r\n `LOCAL_CSV_FOLDER` is relative to (typically the project root).\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local CSV folder and outputs\r\n `fab cp` commands to upload each file to the lakehouse Files section.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook pre-configured with the correct lakehouse and `FILES_FOLDER`. The\r\n notebook attaches the lakehouse automatically via `%%configure`. Import into\r\n Fabric via **New → Import notebook**.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
156
|
+
content: "---\r\nname: csv-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to upload CSV files from a local machine into a Microsoft Fabric\r\n bronze lakehouse and convert them to delta tables. Triggers on: \"create delta\r\n tables from CSV files\", \"load CSVs into bronze lakehouse\", \"upload CSV to Fabric\r\n and create tables\", \"ingest CSV files to delta format in Fabric\", \"create bronze\r\n tables from local CSV\". Does NOT trigger for creating lakehouses, transforming\r\n existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ required for scripts/. Fabric CLI (fab) must be installed for the CLI upload option.\r\n---\r\n\r\n# CSV to Bronze Delta Tables\r\n\r\nUploads CSV files from an operator's local machine to a Microsoft Fabric bronze\r\nlakehouse and converts them to delta tables. The lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab cp`, `fab ln`, and `fab ls` commands are presented to the operator as\r\n> script blocks for them to run. The agent only generates and presents commands.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark (`%%configure`, `spark.read.csv`,\r\n> `df.write.format(\"delta\")`) — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Workspace name | Environment profile or implementation plan |\r\n| Lakehouse name | SOP shared parameters (from lakehouse creation step) |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. local CSV folder path,\r\ndestination folder name in OneLake, table naming preferences).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LOCAL_CSV_FOLDER` | Relative path to local folder containing CSV files (CLI upload only) | `\"./Data\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under the Files section of the lakehouse | `\"raw\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Upload CSV files** — Present these three options and ask the operator to\r\n choose one:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Open the OneLake File Explorer desktop app and drag-and-drop the CSV files into\r\n the target folder under the lakehouse Files section. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section, open or create\r\n the target folder, click **Upload** and select the CSV files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_CSV_FOLDER` as the **exact absolute path** to the local folder\r\n and `LAKEHOUSE_FILES_FOLDER` (the destination folder name under Files). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_CSV_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_csv_files.ps1\"\r\n ```\r\n The script generates a PowerShell `.ps1` file saved directly to the outputs folder.\r\n Present the script path to the operator and ask them to run it with `pwsh upload_csv_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning the workflow, create the output folder:\r\n```\r\noutputs/csv-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll scripts produced during the run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm the CSV files are visible\r\n in the Files section of the lakehouse before proceeding.\r\n\r\n- [ ] **Create delta tables** — If `LAKEHOUSE_FILES_FOLDER` was not already\r\n captured above, ask for it now. Present these two options:\r\n\r\n **Option 1 — Fabric UI (Manual)**\r\n > Quick and easy — recommended for most users.\r\n In the Fabric browser UI navigate to the lakehouse → Files →\r\n `<LAKEHOUSE_FILES_FOLDER>`. For each CSV file: click the three-dot menu →\r\n **Load to Tables** → **New Table**. Accept the suggested table name (Fabric\r\n applies it automatically). No agent action required.\r\n\r\n **Option 2 — PySpark notebook (Automated)**\r\n Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\csv_to_delta_tables.ipynb\"\r\n ```\r\n This writes a ready-to-run `.ipynb` file to the outputs folder. Tell the operator:\r\n 1. In the Fabric UI go to the workspace → **New** → **Import notebook**\r\n 2. Select `csv_to_delta_tables.ipynb` from the outputs folder\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically\r\n **Validate**: confirm every cell printed `✅ Created table: <table_name>` with\r\n no errors. If any `❌` lines appear, report the error message to the operator.\r\n\r\n## Table Naming Convention\r\n\r\nCSV filename → delta table name:\r\n- Strip `.csv` extension\r\n- Convert to lowercase\r\n- Replace any non-alphanumeric characters (spaces, hyphens, dots) with underscores\r\n- Strip leading/trailing underscores\r\n\r\nExamples:\r\n| CSV filename | Delta table name |\r\n|---|---|\r\n| `Revenue Data.csv` | `revenue_data` |\r\n| `Landon hotel revenue data.csv` | `landon_hotel_revenue_data` |\r\n| `Q1-Sales.csv` | `q1_sales` |\r\n\r\n## Column Naming Convention\r\n\r\nWhen CSVs are loaded into delta tables via the PySpark notebook (Option 2 of\r\ndelta table creation), a `clean_columns()` function transforms every column name:\r\n\r\n- Convert to lowercase\r\n- Replace spaces, hyphens, and other non-alphanumeric characters with underscores\r\n- Strip leading/trailing underscores\r\n\r\n| CSV column header | Delta table column name |\r\n|---|---|\r\n| `Hotel ID` | `hotel_id` |\r\n| `No_of_Rooms` | `no_of_rooms` |\r\n| `Total Revenue (GBP)` | `total_revenue_gbp` |\r\n| `First Name` | `first_name` |\r\n\r\n> **Important for downstream skills:** When writing SQL queries against bronze\r\n> delta tables (e.g., in the `create-materialised-lakeview-scripts` skill),\r\n> always use the cleaned column names — not the original CSV headers.\r\n\r\n## Output Format\r\n\r\nDelta tables appear under the **Tables** section of the bronze lakehouse in the\r\nFabric UI, named according to the convention above. Each table is queryable via\r\nthe lakehouse SQL endpoint and PySpark.\r\n\r\n## Gotchas\r\n\r\n- `fab cp` uses the path prefix to identify local vs OneLake paths. **Absolute\r\n Windows paths (`C:\\...`) are not recognised as local** and cause a\r\n `[NotSupported] Source and destination must be of the same type` error. Always\r\n use `Push-Location` into the source folder and `./filename` (forward slash,\r\n not backslash) syntax — confirmed working pattern.\r\n- **The destination folder must exist before running `fab cp`.** Always run\r\n `fab mkdir \"{WORKSPACE}.Workspace/{LAKEHOUSE}.Lakehouse/Files/{FOLDER}\"` first.\r\n Running `fab mkdir` on an already-existing folder is safe and does not error.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive and must exactly match\r\n what appears in the Fabric UI.\r\n- Shortcuts (Option 1 for delta table creation) use Fabric's automatic schema\r\n inference. They may fail if column names contain spaces or if data types are\r\n inconsistent. Switch to Option 2 (PySpark notebook) in those cases.\r\n- The PySpark notebook attaches the lakehouse automatically via `%%configure` in\r\n Cell 1 — no manual attachment needed before running.\r\n- When using the Fabric CLI, run all commands from the directory that\r\n `LOCAL_CSV_FOLDER` is relative to (typically the project root).\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local CSV folder and outputs\r\n `fab cp` commands to upload each file to the lakehouse Files section.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook pre-configured with the correct lakehouse and `FILES_FOLDER`. The\r\n notebook attaches the lakehouse automatically via `%%configure`. Import into\r\n Fabric via **New → Import notebook**.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
157
157
|
},
|
|
158
158
|
{
|
|
159
159
|
relativePath: "assets/pyspark_notebook_template.py",
|
|
@@ -201,7 +201,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
201
201
|
files: [
|
|
202
202
|
{
|
|
203
203
|
relativePath: "SKILL.md",
|
|
204
|
-
content: "---\r\nname: generate-fabric-workspace\r\ndescription: >\r\n Use this skill when asked to create, provision, or set up a Microsoft Fabric\r\n workspace. Triggers on: \"create a Fabric workspace\", \"provision a workspace\r\n in Fabric\", \"set up a new Fabric workspace\", \"generate a workspace with\r\n capacity and permissions\", \"create workspace and assign roles in Fabric\".\r\n Collects workspace name, capacity, principals/roles, and optional domain\r\n settings, then creates the workspace using one of three approaches: PySpark\r\n notebook, PowerShell script, or interactive terminal commands. Produces a\r\n workspace definition markdown as a creation audit record. Does NOT trigger\r\n for general Fabric questions, item creation within a workspace, or\r\n workspace deletion tasks.\r\nlicense: MIT\r\ncompatibility: >\r\n ms-fabric-cli required (pip install ms-fabric-cli). Approach 1 requires a\r\n Fabric notebook environment. Approaches 2 and 3 require fab CLI installed\r\n locally with network access to Microsoft Fabric.\r\n---\r\n\r\n# Generate Fabric Workspace\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n>\r\n> ⚠️ **GENERATION**: Always run the generator scripts (`scripts/generate_notebook.py`,\r\n> `scripts/generate_ps1.py`) via Bash to produce artefacts — never generate notebook\r\n> or script content directly. Do not present generator scripts themselves as outputs.\r\n>\r\n> **Canonical notebook pattern** — every generated PySpark notebook follows this\r\n> exact cell structure. Do not deviate:\r\n> 1. `%pip install ms-fabric-cli -q --no-warn-conflicts` (Cell 1 — restart kernel after)\r\n> 2. `notebookutils.credentials.getToken('pbi')` and `getToken('storage')` → set as\r\n> `os.environ['FAB_TOKEN']`, `FAB_TOKEN_ONELAKE`, `FAB_TOKEN_AZURE` (Cell 2 — auth)\r\n> 3. All workspace operations use `!fab` shell commands — `!fab mkdir`, `!fab get`,\r\n> `!fab acl set`, `!fab api`, etc. Python subprocess is never used.\r\n\r\nCreates a Microsoft Fabric workspace assigned to a specified capacity, with\r\naccess roles and optional domain assignment. If the workspace already exists,\r\ncreation is skipped and roles/domain are updated. Outputs a workspace\r\ndefinition markdown as an audit trail.\r\n\r\n## Step 1 — Choose Approach\r\n\r\nAsk the user:\r\n\r\n> \"Which approach would you like to use?\r\n> 1. **PySpark Notebook** — generates a notebook to run inside Fabric\r\n> (authenticated automatically via the notebook environment)\r\n> 2. **PowerShell Script** — generates a `.ps1` for your review before execution\r\n> (requires fab CLI installed locally)\r\n> 3. **Interactive Terminal** — runs fab CLI commands one by one in the terminal,\r\n> with your confirmation at each step (requires fab CLI installed locally)\"\r\n\r\n### Authentication by approach\r\n\r\n| Approach | Authentication |\r\n|---|---|\r\n| PySpark Notebook | Auto via `notebookutils.credentials.getToken('pbi')` inside Fabric |\r\n| PowerShell / Terminal | `fab auth login` (browser pop-up) or set `$env:FAB_TOKEN` / `FAB_TOKEN` |\r\n\r\n## Step 2 — Domain Handling\r\n\r\nAsk the user:\r\n\r\n> \"Would you like to:\r\n> A. **Create a new domain** and assign the workspace to it\r\n> ⚠️ Requires **Fabric Admin** tenant-level permissions.\r\n> You will also need to specify an **Entra group** that will be allowed to\r\n> add/remove workspaces from this domain (the domain contributor group).\r\n> B. **Assign the workspace to an existing domain**\r\n> C. **Skip domain assignment**\"\r\n\r\n- If **A**: collect `DOMAIN_NAME` and `DOMAIN_CONTRIBUTOR_GROUP` (the Entra\r\n group display name allowed to add/remove workspaces from the domain). Confirm\r\n the user has Fabric Admin rights.\r\n- If **B**: collect `DOMAIN_NAME` only.\r\n- If **C**: no domain parameters needed.\r\n\r\n## Step 3 — Collect Parameters\r\n\r\nCollect these values from the user:\r\n\r\n| Parameter | Required | Description |\r\n|---|---|---|\r\n| `WORKSPACE_NAME` | Yes | Display name for the workspace |\r\n| `CAPACITY_NAME` | Yes | Exact name of the Fabric capacity to assign |\r\n| `DOMAIN_NAME` | If A or B | Name of the domain (new or existing) |\r\n| `DOMAIN_CONTRIBUTOR_GROUP` | If A | Display name of the Entra group that manages the domain |\r\n| `WORKSPACE_ROLES` | Conditional | Additional principals + roles (see approach-specific guidance below) |\r\n\r\n### Workspace roles — approach-specific guidance\r\n\r\nThe workspace creator is **automatically assigned as Admin**. Before collecting\r\nadditional roles, ask:\r\n\r\n> \"You (the creator) will be automatically assigned as workspace Admin. Do you\r\n> want to assign additional roles to other users or groups?\"\r\n\r\nIf **no**, skip role collection entirely. If **yes**, load\r\n`references/role-assignment.md` for approach-specific guidance on collecting\r\nprincipals, group resolution requirements, and Service Principal prerequisites.\r\n\r\nFor each additional principal, collect:\r\n- User **email address (UPN)** or Entra **group display name** — do NOT ask for Object IDs\r\n- Principal type: `User` or `Group` (or `ServicePrincipal`)\r\n- Role: `Admin`, `Member`, `Contributor`, or `Viewer`\r\n\r\n## Step 4 — Execute\r\n\r\n### Approach 1: PySpark Notebook\r\n\r\nIf role assignment includes Entra groups, `TENANT_ID`, `CLIENT_ID`, and `CLIENT_SECRET`\r\nare required — entered directly into Cell 1 of the generated notebook. See\r\n`references/role-assignment.md` for prerequisite details.\r\n\r\nRun `scripts/generate_notebook.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_notebook.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ipynb\r\n```\r\n\r\nPresent the generated `workspace_setup.ipynb` to the user and instruct them to:\r\n1. Upload to any Fabric workspace as a notebook\r\n2. Run each cell **one at a time**, reading the output before proceeding\r\n3. ✅ Verification cells are clearly marked — confirm output before moving on\r\n4. Share the output of Cell 7 (`fab ls`) and Cell 9 (`fab acl ls`)\r\n\r\n### Approach 2: PowerShell Script\r\n\r\nRun `scripts/generate_ps1.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_ps1.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ps1\r\n```\r\n\r\nShow `workspace_setup.ps1` to the user for review. **Do not execute until the\r\nuser confirms.** Then run:\r\n\r\n```powershell\r\n.\\workspace_setup.ps1\r\n```\r\n\r\n### Approach 3: Interactive Terminal\r\n\r\nRun these commands in sequence. Show output after each and ask the user to\r\nconfirm before continuing.\r\n\r\n**Install and authenticate:**\r\n```bash\r\npip install ms-fabric-cli\r\nfab auth login\r\n```\r\n\r\n**Check if workspace already exists:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\n```\r\n- Exit code 0 → workspace exists → skip creation, go to role assignment\r\n- Non-zero → proceed to create\r\n\r\n**Create workspace:**\r\n```bash\r\nfab mkdir \"WORKSPACE_NAME.Workspace\" -P capacityName=CAPACITY_NAME\r\n```\r\n\r\n**Verify creation:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\nfab ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Resolve principal IDs** (before assigning roles — repeat for each principal):\r\n```bash\r\n# For a user (by UPN / email):\r\naz ad user show --id user@corp.com --query id -o tsv\r\n\r\n# For a group (by display name):\r\naz ad group show --group \"Finance Team\" --query id -o tsv\r\n\r\n# For a service principal (by display name or app ID):\r\naz ad sp show --id \"My App Name\" --query id -o tsv\r\n```\r\n\r\n**Assign roles** (use the resolved Object ID, role in lowercase):\r\n```bash\r\nfab acl set \"WORKSPACE_NAME.Workspace\" -I <RESOLVED_OBJECT_ID> -R role\r\n```\r\n\r\n**Verify roles:**\r\n```bash\r\nfab acl ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Create domain** (if Step 2 = A):\r\n```bash\r\n# Resolve domain contributor group ID:\r\naz ad group show --group \"DOMAIN_CONTRIBUTOR_GROUP\" --query id -o tsv\r\n\r\nfab mkdir \"DOMAIN_NAME.domain\"\r\nfab acl set \".domains/DOMAIN_NAME.Domain\" -I <RESOLVED_GROUP_ID> -R contributor\r\n```\r\n\r\n**Assign workspace to domain** (if Step 2 = A or B):\r\n```bash\r\nfab assign \".domains/DOMAIN_NAME.Domain\" -W \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n## Step 5 — Generate Workspace Definition\r\n\r\nCollect from the command output (or ask the user):\r\n- Workspace ID (appears in `fab ls` output)\r\n- Tenant name or tenant ID\r\n- Confirmed principals and roles\r\n- Domain name (if assigned)\r\n\r\nRun `scripts/generate_definition.py`:\r\n\r\n```bash\r\npython scripts/generate_definition.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --workspace-id \"WORKSPACE_ID\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --tenant \"TENANT_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n --approach \"notebook|powershell|terminal\" \\\r\n --output workspace_definition.md\r\n```\r\n\r\nPresent `workspace_definition.md` to the user.\r\n\r\n## Gotchas\r\n\r\n- Workspace path format is `WorkspaceName.Workspace` — the `.Workspace` suffix is required.\r\n- The capacity must be **Active** before `fab mkdir`. If you see `CapacityNotInActiveState`,\r\n ask the user to resume the capacity in the Azure portal before retrying.\r\n- `notebookutils.credentials.getToken()` in Fabric notebooks **does not support Microsoft Graph**.\r\n The notebook approach requires a Service Principal with `Group.Read.All` + `User.Read.All`\r\n application permissions and admin consent. The SP credentials are entered in Cell 1 of\r\n the generated notebook. If the user doesn't have an SP, direct them to the PowerShell\r\n or Interactive Terminal approach instead.\r\n- Domain creation requires Fabric Administrator tenant-level rights. If the user cannot\r\n create a domain, fall back to assigning an existing one or skipping.\r\n- `fab exists` uses exit code (0 = exists, non-zero = not found) — do not rely on stdout text alone.\r\n- In the notebook approach, `notebookutils` is only available inside a Fabric notebook.\r\n The generated script must not be run as a plain Python script outside Fabric.\r\n- The `.domain` suffix (lowercase) is used in `fab mkdir`; `.Domain` (capitalised) is\r\n used in `fab assign` and `fab acl set` — these are different and both matter.\r\n- Role values passed to `fab acl set` must be **lowercase** (`admin`, `member`, `contributor`, `viewer`).\r\n The scripts handle this conversion automatically.\r\n- For PowerShell/terminal approaches, `az login` must be completed before `az ad user/group show` will work.\r\n This is separate from `fab auth login` — both are required.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_notebook.py`** — Generates PySpark notebook. Run: `python scripts/generate_notebook.py --help`\r\n- **`scripts/generate_ps1.py`** — Generates PowerShell script. Run: `python scripts/generate_ps1.py --help`\r\n- **`scripts/generate_definition.py`** — Generates workspace definition markdown. Run: `python scripts/generate_definition.py --help`\r\n\r\n## Available References\r\n\r\n- **`references/role-assignment.md`** — Approach-specific guidance for assigning roles to users and Entra groups. Load when user wants to assign additional workspace roles.\r\n- **`references/fabric-cli-reference.md`** — Fabric CLI command reference.\r\n",
|
|
204
|
+
content: "---\r\nname: generate-fabric-workspace\r\ndescription: >\r\n Use this skill when asked to create, provision, or set up a Microsoft Fabric\r\n workspace. Triggers on: \"create a Fabric workspace\", \"provision a workspace\r\n in Fabric\", \"set up a new Fabric workspace\", \"generate a workspace with\r\n capacity and permissions\", \"create workspace and assign roles in Fabric\".\r\n Collects workspace name, capacity, principals/roles, and optional domain\r\n settings, then creates the workspace using one of three approaches: PySpark\r\n notebook, PowerShell script, or interactive terminal commands. Produces a\r\n workspace definition markdown as a creation audit record. Does NOT trigger\r\n for general Fabric questions, item creation within a workspace, or\r\n workspace deletion tasks.\r\nlicense: MIT\r\ncompatibility: >\r\n ms-fabric-cli required (pip install ms-fabric-cli). Approach 1 requires a\r\n Fabric notebook environment. Approaches 2 and 3 require fab CLI installed\r\n locally with network access to Microsoft Fabric.\r\n---\r\n\r\n# Generate Fabric Workspace\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n>\r\n> ⚠️ **GENERATION**: Always run the generator scripts (`scripts/generate_notebook.py`,\r\n> `scripts/generate_ps1.py`) via Bash to produce artefacts — never generate notebook\r\n> or script content directly. Do not present generator scripts themselves as outputs.\r\n>\r\n> **Canonical notebook pattern** — every generated PySpark notebook follows this\r\n> exact cell structure. Do not deviate:\r\n> 1. `%pip install ms-fabric-cli -q --no-warn-conflicts` (Cell 1 — restart kernel after)\r\n> 2. `notebookutils.credentials.getToken('pbi')` and `getToken('storage')` → set as\r\n> `os.environ['FAB_TOKEN']`, `FAB_TOKEN_ONELAKE`, `FAB_TOKEN_AZURE` (Cell 2 — auth)\r\n> 3. All workspace operations use `!fab` shell commands — `!fab mkdir`, `!fab get`,\r\n> `!fab acl set`, `!fab api`, etc. Python subprocess is never used.\r\n\r\nCreates a Microsoft Fabric workspace assigned to a specified capacity, with\r\naccess roles and optional domain assignment. If the workspace already exists,\r\ncreation is skipped and roles/domain are updated. Outputs a workspace\r\ndefinition markdown as an audit trail.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Deployment approach (notebook / PowerShell / terminal) | Environment profile |\r\n| Capacity name | Environment profile |\r\n| Workspace name(s) | Environment profile or implementation plan |\r\n| Access control method + Object ID resolution | Environment profile |\r\n| Domain assignment approach | Environment profile |\r\n| Credential management approach (Key Vault / runtime) | Environment profile |\r\n| Domain name, role assignments, group names | SOP shared parameters |\r\n\r\n**Only ask for parameters not found in these documents.** Summarise what was resolved\r\nautomatically, then ask for what remains.\r\n\r\n## Step 1 — Choose Approach\r\n\r\nAsk the user:\r\n\r\n> \"Which approach would you like to use?\r\n> 1. **PySpark Notebook** — generates a notebook to run inside Fabric\r\n> (authenticated automatically via the notebook environment)\r\n> 2. **PowerShell Script** — generates a `.ps1` for your review before execution\r\n> (requires fab CLI installed locally)\r\n> 3. **Interactive Terminal** — runs fab CLI commands one by one in the terminal,\r\n> with your confirmation at each step (requires fab CLI installed locally)\"\r\n\r\n### Authentication by approach\r\n\r\n| Approach | Authentication |\r\n|---|---|\r\n| PySpark Notebook | Auto via `notebookutils.credentials.getToken('pbi')` inside Fabric |\r\n| PowerShell / Terminal | `fab auth login` (browser pop-up) or set `$env:FAB_TOKEN` / `FAB_TOKEN` |\r\n\r\n## Step 2 — Domain Handling\r\n\r\nAsk the user:\r\n\r\n> \"Would you like to:\r\n> A. **Create a new domain** and assign the workspace to it\r\n> ⚠️ Requires **Fabric Admin** tenant-level permissions.\r\n> You will also need to specify an **Entra group** that will be allowed to\r\n> add/remove workspaces from this domain (the domain contributor group).\r\n> B. **Assign the workspace to an existing domain**\r\n> C. **Skip domain assignment**\"\r\n\r\n- If **A**: collect `DOMAIN_NAME` and `DOMAIN_CONTRIBUTOR_GROUP` (the Entra\r\n group display name allowed to add/remove workspaces from the domain). Confirm\r\n the user has Fabric Admin rights.\r\n- If **B**: collect `DOMAIN_NAME` only.\r\n- If **C**: no domain parameters needed.\r\n\r\n## Step 3 — Collect Parameters\r\n\r\nCollect these values from the user:\r\n\r\n| Parameter | Required | Description |\r\n|---|---|---|\r\n| `WORKSPACE_NAME` | Yes | Display name for the workspace |\r\n| `CAPACITY_NAME` | Yes | Exact name of the Fabric capacity to assign |\r\n| `DOMAIN_NAME` | If A or B | Name of the domain (new or existing) |\r\n| `DOMAIN_CONTRIBUTOR_GROUP` | If A | Display name of the Entra group that manages the domain |\r\n| `WORKSPACE_ROLES` | Conditional | Additional principals + roles (see approach-specific guidance below) |\r\n\r\n### Workspace roles — approach-specific guidance\r\n\r\nThe workspace creator is **automatically assigned as Admin**. Before collecting\r\nadditional roles, ask:\r\n\r\n> \"You (the creator) will be automatically assigned as workspace Admin. Do you\r\n> want to assign additional roles to other users or groups?\"\r\n\r\nIf **no**, skip role collection entirely. If **yes**, load\r\n`references/role-assignment.md` for approach-specific guidance on collecting\r\nprincipals, group resolution requirements, and Service Principal prerequisites.\r\n\r\nFor each additional principal, collect:\r\n- User **email address (UPN)** or Entra **group display name** — do NOT ask for Object IDs\r\n- Principal type: `User` or `Group` (or `ServicePrincipal`)\r\n- Role: `Admin`, `Member`, `Contributor`, or `Viewer`\r\n\r\n## Step 4 — Execute\r\n\r\n### Approach 1: PySpark Notebook\r\n\r\nIf role assignment includes Entra groups, `TENANT_ID`, `CLIENT_ID`, and `CLIENT_SECRET`\r\nare required — entered directly into Cell 1 of the generated notebook. See\r\n`references/role-assignment.md` for prerequisite details.\r\n\r\nRun `scripts/generate_notebook.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_notebook.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ipynb\r\n```\r\n\r\nPresent the generated `workspace_setup.ipynb` to the user and instruct them to:\r\n1. Upload to any Fabric workspace as a notebook\r\n2. Run each cell **one at a time**, reading the output before proceeding\r\n3. ✅ Verification cells are clearly marked — confirm output before moving on\r\n4. Share the output of Cell 7 (`fab ls`) and Cell 9 (`fab acl ls`)\r\n\r\n### Approach 2: PowerShell Script\r\n\r\nRun `scripts/generate_ps1.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_ps1.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ps1\r\n```\r\n\r\nShow `workspace_setup.ps1` to the user for review. **Do not execute until the\r\nuser confirms.** Then run:\r\n\r\n```powershell\r\n.\\workspace_setup.ps1\r\n```\r\n\r\n### Approach 3: Interactive Terminal\r\n\r\nRun these commands in sequence. Show output after each and ask the user to\r\nconfirm before continuing.\r\n\r\n**Install and authenticate:**\r\n```bash\r\npip install ms-fabric-cli\r\nfab auth login\r\n```\r\n\r\n**Check if workspace already exists:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\n```\r\n- Exit code 0 → workspace exists → skip creation, go to role assignment\r\n- Non-zero → proceed to create\r\n\r\n**Create workspace:**\r\n```bash\r\nfab mkdir \"WORKSPACE_NAME.Workspace\" -P capacityName=CAPACITY_NAME\r\n```\r\n\r\n**Verify creation:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\nfab ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Resolve principal IDs** (before assigning roles — repeat for each principal):\r\n```bash\r\n# For a user (by UPN / email):\r\naz ad user show --id user@corp.com --query id -o tsv\r\n\r\n# For a group (by display name):\r\naz ad group show --group \"Finance Team\" --query id -o tsv\r\n\r\n# For a service principal (by display name or app ID):\r\naz ad sp show --id \"My App Name\" --query id -o tsv\r\n```\r\n\r\n**Assign roles** (use the resolved Object ID, role in lowercase):\r\n```bash\r\nfab acl set \"WORKSPACE_NAME.Workspace\" -I <RESOLVED_OBJECT_ID> -R role\r\n```\r\n\r\n**Verify roles:**\r\n```bash\r\nfab acl ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Create domain** (if Step 2 = A):\r\n```bash\r\n# Resolve domain contributor group ID:\r\naz ad group show --group \"DOMAIN_CONTRIBUTOR_GROUP\" --query id -o tsv\r\n\r\nfab mkdir \"DOMAIN_NAME.domain\"\r\nfab acl set \".domains/DOMAIN_NAME.Domain\" -I <RESOLVED_GROUP_ID> -R contributor\r\n```\r\n\r\n**Assign workspace to domain** (if Step 2 = A or B):\r\n```bash\r\nfab assign \".domains/DOMAIN_NAME.Domain\" -W \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n## Step 5 — Generate Workspace Definition\r\n\r\nCollect from the command output (or ask the user):\r\n- Workspace ID (appears in `fab ls` output)\r\n- Tenant name or tenant ID\r\n- Confirmed principals and roles\r\n- Domain name (if assigned)\r\n\r\nRun `scripts/generate_definition.py`:\r\n\r\n```bash\r\npython scripts/generate_definition.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --workspace-id \"WORKSPACE_ID\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --tenant \"TENANT_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n --approach \"notebook|powershell|terminal\" \\\r\n --output workspace_definition.md\r\n```\r\n\r\nPresent `workspace_definition.md` to the user.\r\n\r\n## Gotchas\r\n\r\n- Workspace path format is `WorkspaceName.Workspace` — the `.Workspace` suffix is required.\r\n- The capacity must be **Active** before `fab mkdir`. If you see `CapacityNotInActiveState`,\r\n ask the user to resume the capacity in the Azure portal before retrying.\r\n- `notebookutils.credentials.getToken()` in Fabric notebooks **does not support Microsoft Graph**.\r\n The notebook approach requires a Service Principal with `Group.Read.All` + `User.Read.All`\r\n application permissions and admin consent. The SP credentials are entered in Cell 1 of\r\n the generated notebook. If the user doesn't have an SP, direct them to the PowerShell\r\n or Interactive Terminal approach instead.\r\n- Domain creation requires Fabric Administrator tenant-level rights. If the user cannot\r\n create a domain, fall back to assigning an existing one or skipping.\r\n- `fab exists` uses exit code (0 = exists, non-zero = not found) — do not rely on stdout text alone.\r\n- In the notebook approach, `notebookutils` is only available inside a Fabric notebook.\r\n The generated script must not be run as a plain Python script outside Fabric.\r\n- The `.domain` suffix (lowercase) is used in `fab mkdir`; `.Domain` (capitalised) is\r\n used in `fab assign` and `fab acl set` — these are different and both matter.\r\n- Role values passed to `fab acl set` must be **lowercase** (`admin`, `member`, `contributor`, `viewer`).\r\n The scripts handle this conversion automatically.\r\n- For PowerShell/terminal approaches, `az login` must be completed before `az ad user/group show` will work.\r\n This is separate from `fab auth login` — both are required.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_notebook.py`** — Generates PySpark notebook. Run: `python scripts/generate_notebook.py --help`\r\n- **`scripts/generate_ps1.py`** — Generates PowerShell script. Run: `python scripts/generate_ps1.py --help`\r\n- **`scripts/generate_definition.py`** — Generates workspace definition markdown. Run: `python scripts/generate_definition.py --help`\r\n\r\n## Available References\r\n\r\n- **`references/role-assignment.md`** — Approach-specific guidance for assigning roles to users and Entra groups. Load when user wants to assign additional workspace roles.\r\n- **`references/fabric-cli-reference.md`** — Fabric CLI command reference.\r\n",
|
|
205
205
|
},
|
|
206
206
|
{
|
|
207
207
|
relativePath: "references/fabric-cli-reference.md",
|
|
@@ -231,7 +231,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
231
231
|
files: [
|
|
232
232
|
{
|
|
233
233
|
relativePath: "SKILL.md",
|
|
234
|
-
content: "---\r\nname: pdf-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to extract structured data from PDF files on an operator's\r\n local machine, upload them to a Microsoft Fabric bronze lakehouse, and convert\r\n them to a delta table using AI-powered field extraction. Triggers on: \"create\r\n delta tables from PDFs\", \"extract data from PDF invoices to Fabric\", \"load\r\n PDFs into bronze lakehouse\", \"parse PDF documents to delta format\", \"ingest\r\n PDF files to Fabric tables\". Does NOT trigger for CSV/Excel ingestion,\r\n transforming existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: >\r\n Python 3.8+ for scripts/. Fabric CLI (fab) for CLI upload option.\r\n Fabric notebook runtime 1.3 required (for synapse.ml.aifunc).\r\n---\r\n\r\n# PDF to Bronze Delta Tables\r\n\r\nUploads PDF files from a local machine to a Microsoft Fabric bronze lakehouse\r\nand converts each PDF into a row in a delta table using AI field extraction.\r\nThe lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab` commands are written to a PowerShell script for the operator to run.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under lakehouse Files section | `\"Booking PDFs\"` |\r\n| `TABLE_NAME` | Target delta table name (snake_case) | `\"booking_invoices\"` |\r\n| `LOCAL_PDF_FOLDER` | Exact absolute path to local PDF folder (CLI upload only) | `\"C:\\Users\\rishi\\Data\\Booking PDFs\"` |\r\n| `FIELDS` | Fields to extract from each PDF — collected in Step 2 | See workflow |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Suggest and confirm extraction fields** — Before asking the operator to\r\n define fields from scratch, the agent should **read a sample PDF** to understand\r\n the document structure and proactively suggest fields:\r\n\r\n 1. Use `pdfplumber` (or equivalent) to extract text from 1–2 sample PDFs in\r\n `LOCAL_PDF_FOLDER`. If a second PDF is from a different sub-group (e.g.\r\n different property/entity), include it to confirm layout consistency.\r\n 2. Identify all extractable fields from the document structure (headers, labels,\r\n line items, totals, payment details, etc.).\r\n 3. Present the suggested fields to the operator in a table format, split into:\r\n - **Header-level fields** (one row per PDF) — for the main table\r\n - **Line-item fields** (multiple rows per PDF) — for the detail table, if\r\n the document contains repeating line items\r\n 4. For each field, show: `snake_case` name, extraction hint for the AI, and an\r\n example value from the sample PDF.\r\n 5. Ask the operator:\r\n - \"Do these fields look right? Anything to add, remove, or rename?\"\r\n - \"What should the main delta table be named?\" → `TABLE_NAME`\r\n - \"Do you want a second table for line/detail items?\" If yes:\r\n → `LINE_ITEMS_TABLE_NAME` and confirm the line-item fields\r\n - \"What folder name will the PDFs be stored in under the lakehouse Files\r\n section?\" → `LAKEHOUSE_FILES_FOLDER`\r\n 6. **Do not proceed until the operator confirms the fields.**\r\n\r\n Build `FIELDS` as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n If the operator confirmed a second line-items table, build `LINE_ITEMS_FIELDS`\r\n as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n- [ ] **Upload PDFs** — Present these three options and ask the operator to choose:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Drag-and-drop the PDFs into the target folder under the lakehouse Files section\r\n using the OneLake File Explorer desktop app. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section → open or\r\n create the `LAKEHOUSE_FILES_FOLDER` folder → click **Upload** and select the\r\n PDF files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_PDF_FOLDER` (exact absolute path). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_PDF_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_pdf_files.ps1\"\r\n ```\r\n Present the script path to the operator and ask them to run it with `pwsh upload_pdf_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning, create the output folder:\r\n```\r\noutputs/pdf-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll generated scripts and notebooks for this run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm all PDFs are visible in the\r\n lakehouse Files section before proceeding.\r\n\r\n- [ ] **Generate TEST notebook** — Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --test-mode \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_TEST.ipynb\"\r\n ```\r\n Where `<FIELDS_JSON>` is the JSON array built from `FIELDS` above, as a\r\n single-line string (e.g. `'[{\"name\":\"invoice_number\",\"description\":\"...\"}]'`).\r\n Include `--line-items-table-name` and `--line-items-fields-json` if a second\r\n line-items table was requested — both must be provided together.\r\n\r\n Tell the operator:\r\n 1. Go to the workspace → **New** → **Import notebook**\r\n 2. Select `pdf_to_delta_TEST.ipynb`\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically and\r\n processes **one PDF only**\r\n 4. Share the output row displayed at the end of the notebook\r\n\r\n- [ ] **Validate and iterate** — Review the output row the operator shares:\r\n - Check each field has a value and it looks correct\r\n - If a field is missing or wrong: update its description in `FIELDS_JSON`,\r\n regenerate the TEST notebook, and ask the operator to re-run it\r\n - Repeat until all fields are correct\r\n - **Do not proceed to full run until the test row is confirmed correct**\r\n\r\n- [ ] **Generate FULL notebook** — Once test output is confirmed, run the same\r\n command **without** `--test-mode`:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_FULL.ipynb\"\r\n ```\r\n Tell the operator to import and run `pdf_to_delta_FULL.ipynb`. This processes\r\n all PDFs in the folder.\r\n\r\n- [ ] **Validate final table** — Ask the operator to confirm:\r\n - Delta table `<TABLE_NAME>` appears in the Tables section of the lakehouse\r\n - Row count matches the number of PDFs uploaded\r\n - Spot-check a few rows for data quality\r\n\r\n## Table Naming\r\n\r\n- Use a descriptive `snake_case` name based on the document type, not the filename\r\n- PDFs are individual records — do not derive table name from filenames\r\n- Ask the operator to confirm the table name before generating any notebook\r\n\r\n## Gotchas\r\n\r\n- **AI features must be enabled on the capacity.** `synapse.ml.aifunc` uses Fabric's\r\n built-in AI endpoint — no Azure OpenAI key needed. Prerequisites: (1) paid Fabric\r\n capacity F2 or higher, (2) tenant admin must enable \"Copilot and other features\r\n powered by Azure OpenAI\" in Admin portal → Tenant settings, (3) if capacity is\r\n outside an Azure OpenAI region, also enable the cross-geo processing toggle.\r\n- **Default model is `gpt-4.1-mini`.** If the notebook throws `DeploymentConfigNotFound`,\r\n the `MODEL_DEPLOYMENT_NAME` in the configuration cell doesn't match a model on\r\n the built-in endpoint. Check supported models at\r\n https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview\r\n- `fab cp` requires `./filename` (forward slash) syntax.Absolute Windows paths\r\n (`C:\\...`) cause `[NotSupported]` errors. The generated script uses `Push-Location`\r\n to work around this — do not modify this pattern.\r\n- **Destination folder must exist before uploading.** The script runs `fab mkdir` first.\r\n Running `fab mkdir` on an existing folder is safe.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive.\r\n- The notebook uses `synapse.ml.aifunc` which requires Fabric **runtime 1.3**.\r\n If the operator sees import errors, check runtime version in notebook settings.\r\n- The `%%configure` cell attaches the lakehouse automatically — no manual\r\n attachment needed before clicking Run All.\r\n- AI extraction temperature is set to `0.0` for consistency, but it is still\r\n non-deterministic across different PDF layouts. Always validate with TEST mode first.\r\n- All extracted fields are written as strings. If the operator needs typed columns\r\n (dates, numbers), add a post-processing step after confirming extraction is correct.\r\n- **Column names come from AI extraction.** The delta table column names match\r\n the `name` field in the `FIELDS` JSON array provided during setup. These are\r\n `snake_case` names chosen by the operator (e.g., `invoice_number`, `hotel_name`).\r\n They do NOT follow the same `clean_columns()` convention used by the\r\n `csv-to-bronze-delta-tables` skill. Downstream skills (e.g.,\r\n `create-materialised-lakeview-scripts`) must verify actual delta table column\r\n names rather than assuming any naming convention.\r\n- The notebook installs `openai` and `pymupdf4llm` at runtime. The `synapse.ml.aifunc`\r\n package is pre-installed in Fabric Runtime 1.3+.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local folder for PDFs and\r\n writes a PowerShell script of `fab cp` upload commands.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook with the AI extraction prompt pre-populated from the supplied fields.\r\n Supports `--test-mode` for single-PDF validation runs.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
234
|
+
content: "---\r\nname: pdf-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to extract structured data from PDF files on an operator's\r\n local machine, upload them to a Microsoft Fabric bronze lakehouse, and convert\r\n them to a delta table using AI-powered field extraction. Triggers on: \"create\r\n delta tables from PDFs\", \"extract data from PDF invoices to Fabric\", \"load\r\n PDFs into bronze lakehouse\", \"parse PDF documents to delta format\", \"ingest\r\n PDF files to Fabric tables\". Does NOT trigger for CSV/Excel ingestion,\r\n transforming existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: >\r\n Python 3.8+ for scripts/. Fabric CLI (fab) for CLI upload option.\r\n Fabric notebook runtime 1.3 required (for synapse.ml.aifunc).\r\n---\r\n\r\n# PDF to Bronze Delta Tables\r\n\r\nUploads PDF files from a local machine to a Microsoft Fabric bronze lakehouse\r\nand converts each PDF into a row in a delta table using AI field extraction.\r\nThe lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab` commands are written to a PowerShell script for the operator to run.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark with `synapse.ml.aifunc` for AI extraction\r\n> — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Workspace name | Environment profile or implementation plan |\r\n| Lakehouse name | SOP shared parameters (from lakehouse creation step) |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. local PDF folder path,\r\ndestination folder, table name, extraction field definitions).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under lakehouse Files section | `\"Booking PDFs\"` |\r\n| `TABLE_NAME` | Target delta table name (snake_case) | `\"booking_invoices\"` |\r\n| `LOCAL_PDF_FOLDER` | Exact absolute path to local PDF folder (CLI upload only) | `\"C:\\Users\\rishi\\Data\\Booking PDFs\"` |\r\n| `FIELDS` | Fields to extract from each PDF — collected in Step 2 | See workflow |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Suggest and confirm extraction fields** — Before asking the operator to\r\n define fields from scratch, the agent should **read a sample PDF** to understand\r\n the document structure and proactively suggest fields:\r\n\r\n 1. Use `pdfplumber` (or equivalent) to extract text from 1–2 sample PDFs in\r\n `LOCAL_PDF_FOLDER`. If a second PDF is from a different sub-group (e.g.\r\n different property/entity), include it to confirm layout consistency.\r\n 2. Identify all extractable fields from the document structure (headers, labels,\r\n line items, totals, payment details, etc.).\r\n 3. Present the suggested fields to the operator in a table format, split into:\r\n - **Header-level fields** (one row per PDF) — for the main table\r\n - **Line-item fields** (multiple rows per PDF) — for the detail table, if\r\n the document contains repeating line items\r\n 4. For each field, show: `snake_case` name, extraction hint for the AI, and an\r\n example value from the sample PDF.\r\n 5. Ask the operator:\r\n - \"Do these fields look right? Anything to add, remove, or rename?\"\r\n - \"What should the main delta table be named?\" → `TABLE_NAME`\r\n - \"Do you want a second table for line/detail items?\" If yes:\r\n → `LINE_ITEMS_TABLE_NAME` and confirm the line-item fields\r\n - \"What folder name will the PDFs be stored in under the lakehouse Files\r\n section?\" → `LAKEHOUSE_FILES_FOLDER`\r\n 6. **Do not proceed until the operator confirms the fields.**\r\n\r\n Build `FIELDS` as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n If the operator confirmed a second line-items table, build `LINE_ITEMS_FIELDS`\r\n as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n- [ ] **Upload PDFs** — Present these three options and ask the operator to choose:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Drag-and-drop the PDFs into the target folder under the lakehouse Files section\r\n using the OneLake File Explorer desktop app. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section → open or\r\n create the `LAKEHOUSE_FILES_FOLDER` folder → click **Upload** and select the\r\n PDF files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_PDF_FOLDER` (exact absolute path). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_PDF_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_pdf_files.ps1\"\r\n ```\r\n Present the script path to the operator and ask them to run it with `pwsh upload_pdf_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning, create the output folder:\r\n```\r\noutputs/pdf-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll generated scripts and notebooks for this run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm all PDFs are visible in the\r\n lakehouse Files section before proceeding.\r\n\r\n- [ ] **Generate TEST notebook** — Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --test-mode \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_TEST.ipynb\"\r\n ```\r\n Where `<FIELDS_JSON>` is the JSON array built from `FIELDS` above, as a\r\n single-line string (e.g. `'[{\"name\":\"invoice_number\",\"description\":\"...\"}]'`).\r\n Include `--line-items-table-name` and `--line-items-fields-json` if a second\r\n line-items table was requested — both must be provided together.\r\n\r\n Tell the operator:\r\n 1. Go to the workspace → **New** → **Import notebook**\r\n 2. Select `pdf_to_delta_TEST.ipynb`\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically and\r\n processes **one PDF only**\r\n 4. Share the output row displayed at the end of the notebook\r\n\r\n- [ ] **Validate and iterate** — Review the output row the operator shares:\r\n - Check each field has a value and it looks correct\r\n - If a field is missing or wrong: update its description in `FIELDS_JSON`,\r\n regenerate the TEST notebook, and ask the operator to re-run it\r\n - Repeat until all fields are correct\r\n - **Do not proceed to full run until the test row is confirmed correct**\r\n\r\n- [ ] **Generate FULL notebook** — Once test output is confirmed, run the same\r\n command **without** `--test-mode`:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_FULL.ipynb\"\r\n ```\r\n Tell the operator to import and run `pdf_to_delta_FULL.ipynb`. This processes\r\n all PDFs in the folder.\r\n\r\n- [ ] **Validate final table** — Ask the operator to confirm:\r\n - Delta table `<TABLE_NAME>` appears in the Tables section of the lakehouse\r\n - Row count matches the number of PDFs uploaded\r\n - Spot-check a few rows for data quality\r\n\r\n## Table Naming\r\n\r\n- Use a descriptive `snake_case` name based on the document type, not the filename\r\n- PDFs are individual records — do not derive table name from filenames\r\n- Ask the operator to confirm the table name before generating any notebook\r\n\r\n## Gotchas\r\n\r\n- **AI features must be enabled on the capacity.** `synapse.ml.aifunc` uses Fabric's\r\n built-in AI endpoint — no Azure OpenAI key needed. Prerequisites: (1) paid Fabric\r\n capacity F2 or higher, (2) tenant admin must enable \"Copilot and other features\r\n powered by Azure OpenAI\" in Admin portal → Tenant settings, (3) if capacity is\r\n outside an Azure OpenAI region, also enable the cross-geo processing toggle.\r\n- **Default model is `gpt-4.1-mini`.** If the notebook throws `DeploymentConfigNotFound`,\r\n the `MODEL_DEPLOYMENT_NAME` in the configuration cell doesn't match a model on\r\n the built-in endpoint. Check supported models at\r\n https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview\r\n- `fab cp` requires `./filename` (forward slash) syntax.Absolute Windows paths\r\n (`C:\\...`) cause `[NotSupported]` errors. The generated script uses `Push-Location`\r\n to work around this — do not modify this pattern.\r\n- **Destination folder must exist before uploading.** The script runs `fab mkdir` first.\r\n Running `fab mkdir` on an existing folder is safe.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive.\r\n- The notebook uses `synapse.ml.aifunc` which requires Fabric **runtime 1.3**.\r\n If the operator sees import errors, check runtime version in notebook settings.\r\n- The `%%configure` cell attaches the lakehouse automatically — no manual\r\n attachment needed before clicking Run All.\r\n- AI extraction temperature is set to `0.0` for consistency, but it is still\r\n non-deterministic across different PDF layouts. Always validate with TEST mode first.\r\n- All extracted fields are written as strings. If the operator needs typed columns\r\n (dates, numbers), add a post-processing step after confirming extraction is correct.\r\n- **Column names come from AI extraction.** The delta table column names match\r\n the `name` field in the `FIELDS` JSON array provided during setup. These are\r\n `snake_case` names chosen by the operator (e.g., `invoice_number`, `hotel_name`).\r\n They do NOT follow the same `clean_columns()` convention used by the\r\n `csv-to-bronze-delta-tables` skill. Downstream skills (e.g.,\r\n `create-materialised-lakeview-scripts`) must verify actual delta table column\r\n names rather than assuming any naming convention.\r\n- The notebook installs `openai` and `pymupdf4llm` at runtime. The `synapse.ml.aifunc`\r\n package is pre-installed in Fabric Runtime 1.3+.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local folder for PDFs and\r\n writes a PowerShell script of `fab cp` upload commands.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook with the AI extraction prompt pre-populated from the supplied fields.\r\n Supports `--test-mode` for single-PDF validation runs.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
235
235
|
},
|
|
236
236
|
{
|
|
237
237
|
relativePath: "references/notebook-cells-reference.md",
|