@rishildi/ldi-process-skills-test 0.0.23 → 0.0.24
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/build/skills/embedded.js +14 -14
- package/package.json +1 -1
package/build/skills/embedded.js
CHANGED
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
// AUTO-GENERATED by scripts/embed-skills.ts — do not edit
|
|
2
|
-
// Generated at: 2026-04-
|
|
2
|
+
// Generated at: 2026-04-05T20:27:26.598Z
|
|
3
3
|
export const EMBEDDED_SKILLS = [
|
|
4
4
|
{
|
|
5
5
|
name: "create-fabric-lakehouses",
|
|
@@ -7,7 +7,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
7
7
|
files: [
|
|
8
8
|
{
|
|
9
9
|
relativePath: "SKILL.md",
|
|
10
|
-
content: "---\nname: create-fabric-lakehouses\ndescription: >\n Use this skill when asked to create, provision, or set up one or more\n Lakehouse items in existing Microsoft Fabric workspaces. Triggers on:\n \"create a lakehouse\", \"provision lakehouses\", \"set up a Fabric lakehouse\",\n \"create lakehouse in Fabric\", \"new lakehouse\", \"create lakehouses across\n workspaces\". Does NOT trigger for: creating workspaces (use\n generate-fabric-workspace), querying lakehouse data, managing tables,\n uploading files, creating shortcuts, or general Fabric workspace management.\nlicense: MIT\ncompatibility: Fabric CLI (fab) installed and authenticated; Python 3.10+ for notebook approach\n---\n\n# Create Fabric Lakehouse\n\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\n> review and run — it never executes commands directly against a live Fabric environment.\n> Present each generated artefact to the operator before they run it.\n\nProvisions one or more empty Lakehouse items across one or more existing\nMicrosoft Fabric workspaces, using a user-chosen approach, and produces an\naudit-trail definition file.\n\n**Companion skills:** Workspace creation is handled by the\n`generate-fabric-workspace` skill. Shortcut creation between lakehouses is\na separate skill / manual step. This skill assumes target workspaces already\nexist.\n\n## Orchestrated Context\n\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\nand the SOP before asking the user anything.\n\n| Parameter | Source when orchestrated |\n|---|---|\n| Deployment approach (notebook / PowerShell / terminal) | Environment profile |\n| Workspace name(s) | Environment profile or implementation plan |\n| Naming convention / prefix | Implementation plan or SOP |\n| Medallion layer(s) to create (Bronze / Silver / Gold) | SOP shared parameters |\n| Schema-enabled preference | SOP or implementation plan |\n\n**Only ask for parameters not found in these documents.** Summarise what was resolved\nautomatically, then ask for what remains (e.g. lakehouse core name, description).\n\n## Prerequisites\n\nBefore starting, ask the operator to run the following and share the output:\n\n```bash\nfab auth status # Must show authenticated\nfab ls # Must return workspace list\n```\n\nIf not authenticated, ask the operator to run `fab auth login` first.\n\n## Workflow\n\nExecute these steps in order.\n\n### Step 1 — Choose Provisioning Approach\n\nAsk the user which approach they want to follow:\n\n| Approach | Description | Best for |\n|----------|-------------|----------|\n| **A — PySpark Notebook** | Generates a `.py` notebook script that installs `ms-fabric-cli` and uses `!fab` commands. Output for the user to run in their Fabric workspace. | Users who want a reusable notebook artefact in Fabric |\n| **B — PowerShell Script** | Generates a PowerShell script containing `fab` CLI commands. Output for user validation before execution. | Users who prefer a single script to review and run locally |\n| **C — Interactive CLI** | Runs `fab` commands one-by-one in the terminal, pausing for user validation after each step. | Users who want maximum control and visibility |\n\n### Step 2 — Collect Workspace & Lakehouse Definitions (Sequential)\n\nCollect definitions **one workspace at a time**. For each workspace, gather:\n\n#### 2a — Target Workspace\n\n- [ ] **Workspace name** — must already exist. Verify with:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n If the workspace does not exist, inform the user and suggest they run\n the `generate-fabric-workspace` skill first. Do not proceed for that\n workspace until it exists.\n\n#### 2b — Naming Convention\n\nSuggest the default naming pattern: `{Prefix}_{CoreName}_{Suffix}`\n\n| Component | Description | Default | Example |\n|-----------|-------------|---------|---------|\n| **Prefix** | Item type indicator | `LH` | `LH` |\n| **CoreName** | Business/project name | *(user provides)* | `LANDONREVENUE` |\n| **Suffix** | Medallion layer or purpose | `BRONZE`, `SILVER`, `GOLD` | `BRONZE` |\n| **Separator** | Character between components | `_` | `_` |\n\nExample result: `LH_LANDONREVENUE_BRONZE`\n\nPresent the suggested defaults and **ask the user to confirm or override**\neach component. The user may change any component or use fully custom names\nthat don't follow the pattern at all.\n\n#### 2c — Lakehouse Definitions\n\nFor each lakehouse in this workspace, collect:\n\n- [ ] **Name** — generated from the naming convention, or custom\n- [ ] **Description** — optional text describing the lakehouse's purpose\n- [ ] **Schema-enabled** — yes/no (default: no). See\n `references/schema-enabled.md` for guidance.\n\n#### 2d — More Workspaces?\n\nAfter finishing one workspace, ask:\n\n> \"Do you have another workspace to provision lakehouses in, or are we done?\"\n\nIf yes, loop back to Step 2a. If done, proceed to Step 3.\n\n### Step 3 — Validate Inputs\n\nBefore generating anything, validate **all** lakehouse definitions:\n\n1. For each workspace, confirm it exists:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n — if it does not exist, stop and direct the user to create it first\n\n2. For each lakehouse, check it doesn't already exist:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if it already exists, warn the user and ask whether to **skip** or\n **rename**\n\n3. Validate lakehouse names against naming constraints (see Gotchas)\n\n### Step 4 — Generate & Execute\n\nBranch by the approach chosen in Step 1. Process workspaces sequentially.\n\n**Maintain an audit log** throughout execution — record every command run and\nits outcome. This log feeds into the definition file in Step 6.\n\n#### Approach A — PySpark Notebook\n\n1. Generate a PySpark notebook using the template in\n `references/notebook-template.py`\n2. The notebook pattern is:\n - Install `ms-fabric-cli` via `%pip install ms-fabric-cli -q`\n - Authenticate using `notebookutils.credentials.getToken('pbi')` for `FAB_TOKEN`\n and `FAB_TOKEN_AZURE`, and `notebookutils.credentials.getToken('storage')` for\n `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n - Add pip's scripts directory to `PATH` so `!fab` works\n - Use `!fab mkdir` shell commands for standard lakehouses\n - Use `!fab api` with REST payload for schema-enabled lakehouses\n3. The notebook must include:\n - A configuration cell with all workspace/lakehouse definitions\n - Existence checks before each creation\n - A summary cell at the end\n4. Save to `/home/claude/<workspace>_create_lakehouses.py`\n5. Present to user for review\n6. Optionally upload:\n ```bash\n fab import \"<Workspace>.Workspace/<Name>.Notebook\" -i <path> --format py -f\n ```\n\n#### Approach B — PowerShell Script\n\n1. Generate a PowerShell script with the following structure:\n2. The script must:\n - Use `fab mkdir` for standard lakehouses\n - Handle schema-enabled lakehouses via the Fabric REST API\n (`fab api` wrapper — see `references/fabric-api-lakehouse.md`)\n - Include `fab exists` checks before each creation\n - Track created items for potential rollback\n - Include error handling and summary output\n3. Save to `/home/claude/create_lakehouses.ps1`\n4. Present the script and **wait for explicit approval** before running\n\n#### Approach C — Interactive CLI\n\nExecute commands one-by-one per workspace, pausing after each:\n\n1. **For each lakehouse** — check then create:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if not exists, create. For standard lakehouses:\n ```bash\n fab mkdir \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — for schema-enabled lakehouses, use the REST API:\n ```bash\n WS_ID=$(fab get \"<WorkspaceName>.Workspace\" -q \"id\" | tr -d '\"')\n fab api \"workspaces/$WS_ID/lakehouses\" -X post \\\n -i '{\"displayName\":\"<Name>\",\"description\":\"<Desc>\",\"creationPayload\":{\"enableSchemas\":true}}'\n ```\n — wait for user confirmation after each\n\n2. **Verification** after all lakehouses in a workspace:\n ```bash\n fab ls \"<WorkspaceName>.Workspace\" -l\n ```\n\n3. Move to next workspace or proceed to Step 5.\n\n### Step 4a — Failure Handling\n\nIf any lakehouse creation fails during execution:\n\n1. **Stop immediately** — do not proceed to the next lakehouse\n2. **Report** what succeeded and what failed\n3. **Ask the user** how to proceed:\n\n| Option | Action |\n|--------|--------|\n| **Retry** | Re-attempt the failed lakehouse creation |\n| **Skip** | Skip the failed item and continue with remaining |\n| **Rollback & Abort** | Delete all lakehouses created *in this run*, then stop |\n| **Abort (keep)** | Stop but leave already-created lakehouses in place |\n\nIf the user chooses **Rollback & Abort**:\n```bash\nfab rm \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -f\n```\n— for each lakehouse created in this run (tracked in the audit log).\nConfirm each deletion with the user before executing.\n\n### Step 5 — Verify Creation\n\nRegardless of approach, verify every lakehouse across all workspaces:\n\n```bash\nfab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n```\n\nCollect the lakehouse ID for each:\n```bash\nfab get \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -q \"id\"\n```\n\nIf any verification fails, report and ask the user how to proceed (same\noptions as Step 4a).\n\n### Step 6 — Generate Definition File\n\nAfter all lakehouses are verified, generate a Lakehouse Definition markdown\nfile using the template in `references/definition-template.md`.\n\nThe definition file must include:\n\n- **Per workspace:** name, ID\n- **Per lakehouse:** name, ID, description, schema-enabled status, naming\n convention used, creation timestamp\n- **Overall:** approach used, naming convention applied, full audit trail of\n commands/API calls executed, any warnings, skipped items, or rollback actions\n\nSave to `/home/claude/lakehouse_definition.md` and present to user.\n\n## Gotchas\n\n- `fab mkdir` creates a standard lakehouse but does NOT support the\n `enableSchemas` property. To create a schema-enabled lakehouse, use\n the Fabric REST API: `POST workspaces/{workspaceId}/lakehouses` with\n `{\"displayName\":\"<n>\",\"creationPayload\":{\"enableSchemas\":true}}`\n- Always use `-f` flag with `fab` commands in scripts to avoid interactive\n prompts that block execution\n- Lakehouse names must be unique within a workspace\n- Workspace names are case-sensitive in `fab` paths\n- Always quote paths containing spaces: `\"My Workspace.Workspace\"`\n- The Fabric REST API requires workspace ID (GUID), not display name —\n extract with `fab get \"<n>.Workspace\" -q \"id\"`\n- In notebooks, `ms-fabric-cli` must be installed via `%pip install` and\n the scripts directory added to `PATH` before `!fab` commands work\n- Token audiences for notebook auth: `'pbi'` for `FAB_TOKEN` and `FAB_TOKEN_AZURE`,\n `'storage'` for `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n- `fab auth status` must show a valid token before any operations; tokens\n expire and may need refresh\n- Lakehouse names cannot contain: `/`, `\\`, `#`, `%`, `?` or\n leading/trailing spaces. Max length: 256 characters\n- When rolling back, always confirm each deletion with the user — `fab rm`\n with `-f` is irreversible\n- This skill does NOT create workspaces — if a workspace is missing, direct\n the user to the `generate-fabric-workspace` skill\n- This skill does NOT create shortcuts between lakehouses — that is a\n separate step\n\n## Output Format\n\nSee `references/definition-template.md` for the full template.\n\n## Available References\n\n- **`references/notebook-template.py`** — PySpark notebook template for Approach A\n- **`references/definition-template.md`** — Lakehouse definition output template\n- **`references/schema-enabled.md`** — How schema-enabled lakehouses work\n- **`references/fabric-api-lakehouse.md`** — Fabric REST API reference for\n lakehouse creation\n",
|
|
10
|
+
content: "---\nname: create-fabric-lakehouses\ndescription: >\n Use this skill when asked to create, provision, or set up one or more\n Lakehouse items in existing Microsoft Fabric workspaces. Triggers on:\n \"create a lakehouse\", \"provision lakehouses\", \"set up a Fabric lakehouse\",\n \"create lakehouse in Fabric\", \"new lakehouse\", \"create lakehouses across\n workspaces\". Does NOT trigger for: creating workspaces (use\n generate-fabric-workspace), querying lakehouse data, managing tables,\n uploading files, creating shortcuts, or general Fabric workspace management.\nlicense: MIT\ncompatibility: Fabric CLI (fab) installed and authenticated; Python 3.10+ for notebook approach\n---\n\n# Create Fabric Lakehouse\n\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\n> review and run — it never executes commands directly against a live Fabric environment.\n> Present each generated artefact to the operator before they run it.\n\nProvisions one or more empty Lakehouse items across one or more existing\nMicrosoft Fabric workspaces, using a user-chosen approach, and produces an\naudit-trail definition file.\n\n**Companion skills:** Workspace creation is handled by the\n`generate-fabric-workspace` skill. Shortcut creation between lakehouses is\na separate skill / manual step. This skill assumes target workspaces already\nexist.\n\n## Orchestrated Context\n\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\nand the SOP before asking the user anything.\n\n| Parameter | Source when orchestrated |\n|---|---|\n| Deployment approach (notebook / PowerShell / terminal) | Environment profile |\n| Workspace name(s) | Environment profile or implementation plan |\n| Naming convention / prefix | Implementation plan or SOP |\n| Medallion layer(s) to create (Bronze / Silver / Gold) | SOP shared parameters |\n| Schema-enabled preference | SOP or implementation plan |\n\n**Only ask for parameters not found in these documents.** Summarise what was resolved\nautomatically, then ask for what remains (e.g. lakehouse core name, description).\n\n## Prerequisites\n\nBefore starting, ask the operator to run the following and share the output:\n\n```bash\nfab auth status # Must show authenticated\nfab ls # Must return workspace list\n```\n\nIf not authenticated, ask the operator to run `fab auth login` first.\n\n## Workflow\n\nExecute these steps in order.\n\n### Step 1 — Choose Provisioning Approach\n\nAsk the user which approach they want to follow:\n\n| Approach | Description | Best for |\n|----------|-------------|----------|\n| **A — PySpark Notebook** | Generates a `.py` notebook script that installs `ms-fabric-cli` and uses `!fab` commands. Output for the user to run in their Fabric workspace. | Users who want a reusable notebook artefact in Fabric |\n| **B — PowerShell Script** | Generates a PowerShell script containing `fab` CLI commands. Output for user validation before execution. | Users who prefer a single script to review and run locally |\n| **C — Interactive CLI** | Runs `fab` commands one-by-one in the terminal, pausing for user validation after each step. | Users who want maximum control and visibility |\n\n### Step 2 — Collect Workspace & Lakehouse Definitions (Sequential)\n\nCollect definitions **one workspace at a time**. For each workspace, gather:\n\n#### 2a — Target Workspace\n\n- [ ] **Workspace name** — must already exist. Verify with:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n If the workspace does not exist, inform the user and suggest they run\n the `generate-fabric-workspace` skill first. Do not proceed for that\n workspace until it exists.\n\n#### 2b — Naming Convention\n\nSuggest the default naming pattern: `{Prefix}_{CoreName}_{Suffix}`\n\n| Component | Description | Default | Example |\n|-----------|-------------|---------|---------|\n| **Prefix** | Item type indicator | `LH` | `LH` |\n| **CoreName** | Business/project name | *(user provides)* | `LANDONREVENUE` |\n| **Suffix** | Medallion layer or purpose | `BRONZE`, `SILVER`, `GOLD` | `BRONZE` |\n| **Separator** | Character between components | `_` | `_` |\n\nExample result: `LH_LANDONREVENUE_BRONZE`\n\nPresent the suggested defaults and **ask the user to confirm or override**\neach component. The user may change any component or use fully custom names\nthat don't follow the pattern at all.\n\n#### 2c — Lakehouse Definitions\n\nFor each lakehouse in this workspace, collect:\n\n- [ ] **Name** — generated from the naming convention, or custom\n- [ ] **Description** — optional text describing the lakehouse's purpose\n- [ ] **Schema-enabled** — yes/no (default: no). See\n `references/schema-enabled.md` for guidance.\n\n#### 2d — More Workspaces?\n\nAfter finishing one workspace, ask:\n\n> \"Do you have another workspace to provision lakehouses in, or are we done?\"\n\nIf yes, loop back to Step 2a. If done, proceed to Step 3.\n\n### Step 3 — Validate Inputs\n\nBefore generating anything, validate **all** lakehouse definitions:\n\n1. For each workspace, confirm it exists:\n ```bash\n fab exists \"<WorkspaceName>.Workspace\"\n ```\n — if it does not exist, stop and direct the user to create it first\n\n2. For each lakehouse, check it doesn't already exist:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if it already exists, warn the user and ask whether to **skip** or\n **rename**\n\n3. Validate lakehouse names against naming constraints (see Gotchas)\n\n### Step 4 — Generate & Execute\n\nBranch by the approach chosen in Step 1. Process workspaces sequentially.\n\n**Maintain an audit log** throughout execution — record every command run and\nits outcome. This log feeds into the definition file in Step 6.\n\n#### Approach A — PySpark Notebook\n\n1. Generate a PySpark notebook using the template in\n `references/notebook-template.py`\n2. The notebook pattern is:\n - Install `ms-fabric-cli` via `%pip install ms-fabric-cli -q`\n - Authenticate using `notebookutils.credentials.getToken('pbi')` for `FAB_TOKEN`\n and `FAB_TOKEN_AZURE`, and `notebookutils.credentials.getToken('storage')` for\n `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n - Add pip's scripts directory to `PATH` so `!fab` works\n - Use `!fab mkdir` shell commands for standard lakehouses\n - Use `!fab api` with REST payload for schema-enabled lakehouses\n3. The notebook must include:\n - A configuration cell with all workspace/lakehouse definitions\n - Existence checks before each creation\n - A summary cell at the end\n4. Save to `outputs/create-fabric-lakehouses_{YYYY-MM-DD_HH-MM}_{USERNAME}/<workspace>_create_lakehouses.py`\n5. Present to user for review\n6. Optionally upload:\n ```bash\n fab import \"<Workspace>.Workspace/<Name>.Notebook\" -i <path> --format py -f\n ```\n\n#### Approach B — PowerShell Script\n\n1. Generate a PowerShell script with the following structure:\n2. The script must:\n - Use `fab mkdir` for standard lakehouses\n - Handle schema-enabled lakehouses via the Fabric REST API\n (`fab api` wrapper — see `references/fabric-api-lakehouse.md`)\n - Include `fab exists` checks before each creation\n - Track created items for potential rollback\n - Include error handling and summary output\n3. Save to `outputs/create-fabric-lakehouses_{YYYY-MM-DD_HH-MM}_{USERNAME}/create_lakehouses.ps1`\n4. Present the script and **wait for explicit approval** before running\n\n#### Approach C — Interactive CLI\n\nExecute commands one-by-one per workspace, pausing after each:\n\n1. **For each lakehouse** — check then create:\n ```bash\n fab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n ```\n — if not exists, create. For standard lakehouses:\n ```bash\n fab mkdir \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -f\n ```\n — for schema-enabled lakehouses, use the REST API:\n ```bash\n WS_ID=$(fab get \"<WorkspaceName>.Workspace\" -q \"id\" | tr -d '\"')\n fab api \"workspaces/$WS_ID/lakehouses\" -X post \\\n -i '{\"displayName\":\"<Name>\",\"description\":\"<Desc>\",\"creationPayload\":{\"enableSchemas\":true}}'\n ```\n — wait for user confirmation after each\n\n2. **Verification** after all lakehouses in a workspace:\n ```bash\n fab ls \"<WorkspaceName>.Workspace\" -l\n ```\n\n3. Move to next workspace or proceed to Step 5.\n\n### Step 4a — Failure Handling\n\nIf any lakehouse creation fails during execution:\n\n1. **Stop immediately** — do not proceed to the next lakehouse\n2. **Report** what succeeded and what failed\n3. **Ask the user** how to proceed:\n\n| Option | Action |\n|--------|--------|\n| **Retry** | Re-attempt the failed lakehouse creation |\n| **Skip** | Skip the failed item and continue with remaining |\n| **Rollback & Abort** | Delete all lakehouses created *in this run*, then stop |\n| **Abort (keep)** | Stop but leave already-created lakehouses in place |\n\nIf the user chooses **Rollback & Abort**:\n```bash\nfab rm \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -f\n```\n— for each lakehouse created in this run (tracked in the audit log).\nConfirm each deletion with the user before executing.\n\n### Step 5 — Verify Creation\n\nRegardless of approach, verify every lakehouse across all workspaces:\n\n```bash\nfab exists \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\"\n```\n\nCollect the lakehouse ID for each:\n```bash\nfab get \"<WorkspaceName>.Workspace/<LakehouseName>.Lakehouse\" -q \"id\"\n```\n\nIf any verification fails, report and ask the user how to proceed (same\noptions as Step 4a).\n\n### Step 6 — Generate Definition File\n\nAfter all lakehouses are verified, generate a Lakehouse Definition markdown\nfile using the template in `references/definition-template.md`.\n\nThe definition file must include:\n\n- **Per workspace:** name, ID\n- **Per lakehouse:** name, ID, description, schema-enabled status, naming\n convention used, creation timestamp\n- **Overall:** approach used, naming convention applied, full audit trail of\n commands/API calls executed, any warnings, skipped items, or rollback actions\n\nSave to `outputs/create-fabric-lakehouses_{YYYY-MM-DD_HH-MM}_{USERNAME}/lakehouse_definition.md` and present to user.\n\n## Gotchas\n\n- `fab mkdir` creates a standard lakehouse but does NOT support the\n `enableSchemas` property. To create a schema-enabled lakehouse, use\n the Fabric REST API: `POST workspaces/{workspaceId}/lakehouses` with\n `{\"displayName\":\"<n>\",\"creationPayload\":{\"enableSchemas\":true}}`\n- Always use `-f` flag with `fab` commands in scripts to avoid interactive\n prompts that block execution\n- Lakehouse names must be unique within a workspace\n- Workspace names are case-sensitive in `fab` paths\n- Always quote paths containing spaces: `\"My Workspace.Workspace\"`\n- The Fabric REST API requires workspace ID (GUID), not display name —\n extract with `fab get \"<n>.Workspace\" -q \"id\"`\n- In notebooks, `ms-fabric-cli` must be installed via `%pip install` and\n the scripts directory added to `PATH` before `!fab` commands work\n- Token audiences for notebook auth: `'pbi'` for `FAB_TOKEN` and `FAB_TOKEN_AZURE`,\n `'storage'` for `FAB_TOKEN_ONELAKE` (OneLake requires the storage-scope token)\n- `fab auth status` must show a valid token before any operations; tokens\n expire and may need refresh\n- Lakehouse names cannot contain: `/`, `\\`, `#`, `%`, `?` or\n leading/trailing spaces. Max length: 256 characters\n- When rolling back, always confirm each deletion with the user — `fab rm`\n with `-f` is irreversible\n- This skill does NOT create workspaces — if a workspace is missing, direct\n the user to the `generate-fabric-workspace` skill\n- This skill does NOT create shortcuts between lakehouses — that is a\n separate step\n\n## Output Format\n\nSee `references/definition-template.md` for the full template.\n\n## Available References\n\n- **`references/notebook-template.py`** — PySpark notebook template for Approach A\n- **`references/definition-template.md`** — Lakehouse definition output template\n- **`references/schema-enabled.md`** — How schema-enabled lakehouses work\n- **`references/fabric-api-lakehouse.md`** — Fabric REST API reference for\n lakehouse creation\n",
|
|
11
11
|
},
|
|
12
12
|
{
|
|
13
13
|
relativePath: "references/agent.md",
|
|
@@ -37,11 +37,11 @@ export const EMBEDDED_SKILLS = [
|
|
|
37
37
|
files: [
|
|
38
38
|
{
|
|
39
39
|
relativePath: "SKILL.md",
|
|
40
|
-
content: "---\nname: create-fabric-process-workflow-agent\ndescription: >\n Use this skill to create an orchestration agent definition (agent.md) for any\n Microsoft Fabric technical process. The user describes what they want to automate;\n the skill produces a self-contained agent.md. When run, the agent maps each process\n step to an available Fabric process skill, flags any steps with no matching skill\n as UNMAPPED, and at execution time offers three options for unmapped steps: perform\n manually, build a lightweight skill on the fly (saved locally), or engage the LDI\n Skills Creation Framework. Logs all decisions to an audit trail and orchestrates\n the full process end-to-end.\n Triggers on: \"create a process workflow agent\", \"build an orchestration agent\n for [process]\", \"create an agent that automates [process]\", \"orchestrate\n [process] into an agent\". Does NOT trigger for creating individual process\n skills, running an agent, writing code, or one-off analysis.\nlicense: MIT\ncompatibility: Python 3.8+ required for scripts/\n---\n\n# Create Fabric Process Workflow Agent\n\nCreates a concise, self-contained `agent.md` that defines an orchestration agent\nfor a Microsoft Fabric technical process. When run, the agent maps each process step\nto an available skill (COVERED), marks steps with no matching skill as UNMAPPED, and\noffers three options at execution time for unmapped steps: manual, on-the-fly skill\n(saved locally in a `skills/` folder), or the LDI Skills Creation Framework.\n\n## Core Governance Rules\n\nThese rules are non-negotiable. They must be embedded verbatim in every generated\n`agent.md` so they are active at runtime.\n\n- **RULE 1 — Never execute autonomously.** Never run terminal commands, API calls,\n or scripts directly. Present every command in a fenced code block with the\n insert-into-terminal icon. The user runs it and reports back before proceeding.\n- **RULE 2 — Parameter gate before every execution step.** Before generating any\n artefact for a step, verify every required parameter is resolved. Any parameter\n deferred during discovery (marked `[TBC]`) must be asked for explicitly before\n proceeding. Never silently skip a parameter or substitute an empty value.\n- **RULE 3 — No silent approach changes.** If a blocker is found with the chosen\n approach, surface it and present alternatives. Let the user decide. Never switch\n silently. Approach constraints by step type:\n - Local file upload (CSV/PDF from operator's machine): **notebook not possible** —\n options are script, CLI commands, or manual. For 50+ files, note that script/CLI\n is sequential and slow; suggest manual upload via Fabric Files UI instead.\n - Schema creation: notebook (Spark SQL) or CLI; no native Fabric UI for lakehouses.\n - Shortcuts: CLI (`fab ln`) or script; notebook cannot run `fab ln` natively.\n- **RULE 4 — No inference from context.** Collect all parameters from the user or\n the current prompt. Do not pre-populate from prior chat history, previous runs,\n or attached files not explicitly part of the current request.\n- **RULE 5 — Respect the user's skill level and environment.** Do not steer toward\n an approach the agent finds easier to generate. Match the user's comfort level,\n installed tooling, and stated preferences.\n- **RULE 6 — Stay within skill boundaries.** Generate only what skill definitions\n describe. On any failure: explain the cause from the error, offer the simplest\n manual or UI fallback, ask whether to skip.\n- **RULE 7 — Append to CHANGE_LOG.md after every step.** Include: step number,\n what was done, outcome (success/failure/skipped), and any notable decisions.\n- **RULE 8 — Two-question post-step pattern.** After each execution step: (Q1) ask\n whether the previous artefact ran correctly — if not, get the error and resolve it\n before proceeding; (Q2) propose the next step by name, state the planned approach\n and any implications, offer Yes (generate it) or No (choose a different approach\n or manual). Update the SOP and CHANGE_LOG to reflect any runtime decisions.\n\n## Inputs\n\n| Parameter | Description | Example |\n|-----------|-------------|---------|\n| `PROCESS_NAME` | Short name for the process (lowercase, hyphens) | `monthly-budget-consolidation` |\n| `REQUIREMENTS` | Full description of the process and each of its steps | `\"1) Collect data from five Excel files... 2) Summarise by category...\"` |\n| `SECTIONS` | Sub-agent sections to include (default: all four) | `impl-plan, biz-process, architecture, governance` |\n| `USERNAME` | Used in output folder naming | `rishi` |\n\n## Workflow\n\n- [ ] **Collect** — If `PROCESS_NAME`, `REQUIREMENTS`, or `USERNAME` are missing, ask for them.\n\n- [ ] **Analyse discovery questions** — Read the requirements and identify the\n environment-specific questions that determine which approaches are viable. For each question:\n - Name the specific activity that needs the permission or tool\n - Offer concrete options (not yes/no)\n - State what the agent does differently based on the answer\n Group questions by domain (permissions, tooling, execution preferences, data access,\n existing infrastructure). Ask only about domains the requirements actually need.\n Embed the questionnaire as **Sub-Agent 0: Environment Discovery** in the generated agent.md.\n\n- [ ] **Confirm sections** — Present the four standard sections with descriptions\n (see `references/section-descriptions.md`). Ask which to include. Default: all four.\n Wait for explicit confirmation before drafting.\n\n- [ ] **Draft agent.md** — Use `assets/agent-template.md` as the base.\n - Substitute `{PROCESS_NAME}` and a ≤3-sentence `{REQUIREMENTS_SUMMARY}`.\n - Remove excluded sections. Keep each sub-agent block ≤25 lines.\n - Do not name any specific process skill or technology — all resolved at runtime.\n - Do not hardcode company names, specific values, or environment paths.\n\n- [ ] **Validate** — Present the draft. Ask: *\"Does this accurately reflect the process? Anything unclear?\"*\n Refine until the user confirms.\n\n- [ ] **Scaffold** — Run `python scripts/scaffold_output.py --process-name $PROCESS_NAME --username $USERNAME --sections $SECTIONS`.\n Write the confirmed agent.md to the returned `agent_md_path`.\n\n- [ ] **Confirm** — Report the output root path and list all created subfolders.\n\n## Output Format\n\n```\noutputs/\n└── {process-name}_{YYYY-MM-DD_HH-MM}_{username}/\n ├── agent.md ← self-contained orchestration agent definition\n ├── CHANGE_LOG.md ← audit trail; updated as agent runs\n ├── 01-implementation-plan/ ← empty; populated when agent runs\n ├── 02-business-process/ ← empty; populated when agent runs\n ├── 03-solution-architecture/ ← empty; populated when agent runs\n ├── 04-governance/ ← empty; populated when agent runs\n ├── 05-step-name/ ← execution step 1 (numbered from 05)\n │ └── thing.ipynb ← deliverable only (.ipynb / .ps1 / cli-commands.md)\n ├── 06-step-name/ ← execution step 2\n │ └── thing.ps1\n └── skills/ ← on-the-fly skills (created at runtime for UNMAPPED steps)\n └── [skill-name]/\n └── SKILL.md ← lightweight skill definition for local reference\n```\n\n`CHANGE_LOG.md` is initialised empty and updated by the agent each time it runs.\n\n### Intermediate vs. final artefacts\n\n| Classification | Description | Examples |\n|----------------|-------------|----------|\n| **Final** | The deliverable the user runs or deploys | `.ipynb` notebooks, `.sql` scripts, `.md` documentation |\n| **Intermediate** | Scripts that generate the final artefacts | `generate_*.py`, `generate_*.ps1` |\n\n- Intermediate artefacts live alongside their final outputs (same subfolder).\n- Label both types clearly when presenting outputs to the user.\n- Intermediate scripts must be deterministic and re-runnable.\n\n### Sub-agents in the generated agent.md\n\n| # | Section | Output document |\n|---|---------|-----------------|\n| 0 | Environment Discovery | `00-environment-discovery/environment-profile.md` |\n| 1 | Implementation Plan | `01-implementation-plan/implementation-plan.md` |\n| 2 | Business Process Mapping | `02-business-process/sop.md` |\n| 3 | Solution Architecture | `03-solution-architecture/specification.md` |\n| 4 | Security, Testing & Governance | `04-governance/governance-plan.md` |\n| — | **Execution Phase** | `05-[step]/`, `06-[step]/` ... + `COMPLETION_SUMMARY.md` |\n\nThe execution phase runs after all planning sub-agents are reviewed and confirmed.\nEach SOP step becomes a numbered execution subfolder. The SOP is updated in place\nthroughout execution to reflect runtime decisions (approach changes, errors, manual\nselections). CHANGE_LOG.md is updated after every step.\n\n## Gotchas\n\n- **Do not attempt to create process skills during skill execution.** Skill mapping\n happens inside Sub-Agent 2 when the generated agent.md is run. UNMAPPED steps are\n resolved by the operator at execution time, not by this skill upfront.\n- **Do not execute sub-agents** during skill execution — `agent.md` is a definition only.\n- Do not name specific tools, technologies, or process skills in the generated agent.md.\n- **Environment discovery must be contextual, not generic.** Derive questions from the\n requirements. If the process doesn't involve workspaces, don't ask about workspace\n creation permissions. The questionnaire should read like a knowledgeable consultant\n scoping a project, not a bureaucratic form.\n- Confirm sections **before** drafting, not after.\n- Keep each sub-agent block ≤25 lines to avoid context overload when the agent runs.\n\n## Available Scripts\n\n- **`scripts/scaffold_output.py`** — Creates the dated output folder structure including\n an empty `CHANGE_LOG.md`. Run: `python scripts/scaffold_output.py --help`\n",
|
|
40
|
+
content: "---\nname: create-fabric-process-workflow-agent\ndescription: >\n Use this skill to create an orchestration agent definition (agent.md) for any\n Microsoft Fabric technical process. The user describes what they want to automate;\n the skill produces a self-contained agent.md. When run, the agent maps each process\n step to an available Fabric process skill, flags any steps with no matching skill\n as UNMAPPED, and at execution time offers three options for unmapped steps: perform\n manually, build a lightweight skill on the fly (saved locally), or engage the LDI\n Skills Creation Framework. Logs all decisions to an audit trail and orchestrates\n the full process end-to-end.\n Triggers on: \"create a process workflow agent\", \"build an orchestration agent\n for [process]\", \"create an agent that automates [process]\", \"orchestrate\n [process] into an agent\". Does NOT trigger for creating individual process\n skills, running an agent, writing code, or one-off analysis.\nlicense: MIT\ncompatibility: Python 3.8+ required for scripts/\n---\n\n# Create Fabric Process Workflow Agent\n\nCreates a concise, self-contained `agent.md` that defines an orchestration agent\nfor a Microsoft Fabric technical process. When run, the agent maps each process step\nto an available skill (COVERED), marks steps with no matching skill as UNMAPPED, and\noffers three options at execution time for unmapped steps: manual, on-the-fly skill\n(saved locally in a `skills/` folder), or the LDI Skills Creation Framework.\n\n## Core Governance Rules\n\nThese rules are non-negotiable. They must be embedded verbatim in every generated\n`agent.md` so they are active at runtime.\n\n- **RULE 1 — Never execute autonomously.** Never run terminal commands, API calls,\n or scripts directly. Present every command in a fenced code block with the\n insert-into-terminal icon. The user runs it and reports back before proceeding.\n- **RULE 2 — Parameter gate before every execution step.** Before generating any\n artefact for a step, verify every required parameter is resolved. Any parameter\n deferred during discovery (marked `[TBC]`) must be asked for explicitly before\n proceeding. Never silently skip a parameter or substitute an empty value.\n- **RULE 3 — No silent approach changes.** If a blocker is found with the chosen\n approach, surface it and present alternatives. Let the user decide. Never switch\n silently. Approach constraints by step type:\n - Local file upload (CSV/PDF from operator's machine): **notebook not possible** —\n options are script, CLI commands, or manual. For 50+ files, note that script/CLI\n is sequential and slow; suggest manual upload via Fabric Files UI instead.\n - Schema creation: notebook (Spark SQL) or CLI; no native Fabric UI for lakehouses.\n - Shortcuts: notebook (`!fab ln` cell), PowerShell script, or interactive CLI all work.\n- **RULE 4 — No inference from context.** Collect all parameters from the user or\n the current prompt. Do not pre-populate from prior chat history, previous runs,\n or attached files not explicitly part of the current request.\n- **RULE 5 — Respect the user's skill level and environment.** Do not steer toward\n an approach the agent finds easier to generate. Match the user's comfort level,\n installed tooling, and stated preferences.\n- **RULE 6 — Stay within skill boundaries.** Generate only what skill definitions\n describe. On any failure: explain the cause from the error, offer the simplest\n manual or UI fallback, ask whether to skip.\n- **RULE 7 — Append to CHANGE_LOG.md after every step.** Include: step number,\n what was done, outcome (success/failure/skipped), and any notable decisions.\n- **RULE 8 — Two-question post-step pattern.** After each execution step: (Q1) ask\n whether the previous artefact ran correctly — if not, get the error and resolve it\n before proceeding; (Q2) propose the next step by name, state the planned approach\n and any implications, offer Yes (generate it) or No (choose a different approach\n or manual). Update the SOP and CHANGE_LOG to reflect any runtime decisions.\n\n## Inputs\n\n| Parameter | Description | Example |\n|-----------|-------------|---------|\n| `PROCESS_NAME` | Short name for the process (lowercase, hyphens) | `monthly-budget-consolidation` |\n| `REQUIREMENTS` | Full description of the process and each of its steps | `\"1) Collect data from five Excel files... 2) Summarise by category...\"` |\n| `SECTIONS` | Sub-agent sections to include (default: all four) | `impl-plan, biz-process, architecture, governance` |\n| `USERNAME` | Used in output folder naming | `rishi` |\n\n## Workflow\n\n- [ ] **Collect** — If `PROCESS_NAME`, `REQUIREMENTS`, or `USERNAME` are missing, ask for them.\n\n- [ ] **Confirm sections** — Present the four standard sections with descriptions\n (see `references/section-descriptions.md`). Ask which to include. Default: all four.\n Wait for explicit confirmation before drafting.\n\n- [ ] **Draft agent.md** — Use `assets/agent-template.md` as the base.\n - Substitute `{PROCESS_NAME}` and a ≤3-sentence `{REQUIREMENTS_SUMMARY}`.\n - Remove excluded sections. Keep each sub-agent block ≤25 lines.\n - Do not name any specific process skill or technology — all resolved at runtime.\n - Do not hardcode company names, specific values, or environment paths.\n\n- [ ] **Validate** — Present the draft. Ask: *\"Does this accurately reflect the process? Anything unclear?\"*\n Refine until the user confirms.\n\n- [ ] **Scaffold** — Run `python scripts/scaffold_output.py --process-name $PROCESS_NAME --username $USERNAME --sections $SECTIONS`.\n Write the confirmed agent.md to the returned `agent_md_path`.\n\n- [ ] **Confirm** — Report the output root path and list all created subfolders.\n\n## Output Format\n\n```\noutputs/\n└── {process-name}_{YYYY-MM-DD_HH-MM}_{username}/\n ├── agent.md ← self-contained orchestration agent definition\n ├── CHANGE_LOG.md ← audit trail; updated as agent runs\n ├── 01-implementation-plan/ ← empty; populated when agent runs\n ├── 02-business-process/ ← empty; populated when agent runs\n ├── 03-solution-architecture/ ← empty; populated when agent runs\n ├── 04-governance/ ← empty; populated when agent runs\n ├── 05-step-name/ ← execution step 1 (numbered from 05)\n │ └── thing.ipynb ← deliverable only (.ipynb / .ps1 / cli-commands.md)\n ├── 06-step-name/ ← execution step 2\n │ └── thing.ps1\n └── skills/ ← on-the-fly skills (created at runtime for UNMAPPED steps)\n └── [skill-name]/\n └── SKILL.md ← lightweight skill definition for local reference\n```\n\n`CHANGE_LOG.md` is initialised empty and updated by the agent each time it runs.\n\n### Intermediate vs. final artefacts\n\n| Classification | Description | Examples |\n|----------------|-------------|----------|\n| **Final** | The deliverable the user runs or deploys | `.ipynb` notebooks, `.sql` scripts, `.md` documentation |\n| **Intermediate** | Scripts that generate the final artefacts | `generate_*.py`, `generate_*.ps1` |\n\n- Intermediate artefacts live alongside their final outputs (same subfolder).\n- Label both types clearly when presenting outputs to the user.\n- Intermediate scripts must be deterministic and re-runnable.\n\n### Sub-agents in the generated agent.md\n\n| # | Section | Output document |\n|---|---------|-----------------|\n| 0 | Environment Discovery | `00-environment-discovery/environment-profile.md` |\n| 1 | Implementation Plan | `01-implementation-plan/implementation-plan.md` |\n| 2 | Business Process Mapping | `02-business-process/sop.md` |\n| 3 | Solution Architecture | `03-solution-architecture/specification.md` |\n| 4 | Security, Testing & Governance | `04-governance/governance-plan.md` |\n| — | **Execution Phase** | `05-[step]/`, `06-[step]/` ... + `COMPLETION_SUMMARY.md` |\n\nThe execution phase runs after all planning sub-agents are reviewed and confirmed.\nEach SOP step becomes a numbered execution subfolder. The SOP is updated in place\nthroughout execution to reflect runtime decisions (approach changes, errors, manual\nselections). CHANGE_LOG.md is updated after every step.\n\n## Gotchas\n\n- **Do not attempt to create process skills during skill execution.** Skill mapping\n happens inside Sub-Agent 2 when the generated agent.md is run. UNMAPPED steps are\n resolved by the operator at execution time, not by this skill upfront.\n- **Do not execute sub-agents** during skill execution — `agent.md` is a definition only.\n- Do not name specific tools, technologies, or process skills in the generated agent.md.\n- **Sub-Agent 0 always invokes `fabric-process-discovery` — never generate discovery questions ad-hoc.** The skill defines 6 fixed questions covering all Fabric process prerequisites: tenant access, workspace creation, item creation, domain assignment, Entra group visibility, and deployment preference. Do not derive or generate questions from the requirements.\n- Confirm sections **before** drafting, not after.\n- Keep each sub-agent block ≤25 lines to avoid context overload when the agent runs.\n\n## Available Scripts\n\n- **`scripts/scaffold_output.py`** — Creates the dated output folder structure including\n an empty `CHANGE_LOG.md`. Run: `python scripts/scaffold_output.py --help`\n",
|
|
41
41
|
},
|
|
42
42
|
{
|
|
43
43
|
relativePath: "assets/agent-template.md",
|
|
44
|
-
content: "# Orchestration Agent: {PROCESS_NAME}\r\n\r\n## Context\r\n\r\n**Process**: {PROCESS_NAME}\r\n**Requirements**: {REQUIREMENTS_SUMMARY}\r\n\r\n---\r\n\r\n## How to Run This Agent\r\n\r\n**Start with Sub-Agent 0 (Environment Discovery).** This gathers the user's\r\npermissions, tooling, and preferences so that every subsequent sub-agent produces\r\nplans tailored to their actual environment. Do not skip this step.\r\n\r\nThen execute each remaining sub-agent in sequence:\r\n\r\n1. Use only the inputs and instructions provided in this file.\r\n2. Produce the specified output document in the designated subfolder.\r\n3. Present the output to the user; ask clarifying questions if anything is unclear.\r\n4. Refine until the user explicitly confirms the output.\r\n5. Append a timestamped entry to `CHANGE_LOG.md` recording what was produced or decided.\r\n6. Pass the confirmed output as the primary input to the next sub-agent.\r\n **Every sub-agent must also read `00-environment-discovery/environment-profile.md`**\r\n and respect the path decisions recorded there.\r\n\r\n> 🛑 **HARD STOP RULE — applies to every sub-agent and every execution step:**\r\n> After producing any output, you MUST stop and wait. Do not proceed to the next\r\n> step until the user responds with explicit confirmation (e.g. \"confirmed\",\r\n> \"looks good\", \"proceed\"). A lack of objection is NOT confirmation. Never\r\n> self-confirm or assume approval. Never run two steps in the same turn.\r\n\r\n**Do not produce code, scripts, or data artefacts not described in each sub-agent below.**\r\n\r\n### Parameter Resolution Protocol\r\n\r\nWhen invoking any skill, **always resolve parameters from existing sources before\r\nasking the user**. Check in this order:\r\n\r\n1. **The original requirements** (the user's prompt that started this agent) —\r\n names, values, and preferences stated upfront should be used directly and never\r\n re-asked. If the user said \"create a workspace called Finance Hub\", use that name.\r\n2. `00-environment-discovery/environment-profile.md` — deployment approach, permissions\r\n profile, deployment preference\r\n3. The confirmed SOP (`02-business-process/sop.md`) — shared parameters, step inputs\r\n and outputs, lakehouse names, schema names\r\n4. The implementation plan (`01-implementation-plan/implementation-plan.md`) — naming\r\n conventions, task-level decisions\r\n\r\n**Only ask the user for parameters not found in any of these sources.** Summarise\r\nwhat was resolved automatically before asking for what remains. Never re-ask for\r\nsomething the user already provided.\r\n\r\n### Notebook Documentation Standard\r\n\r\nEvery Fabric notebook produced by any skill **must** include a numbered markdown cell\r\nimmediately above each code cell. Each markdown cell must:\r\n\r\n1. State the cell number and a short title (e.g. `## Cell 1 — Install dependencies`).\r\n2. Explain **what** the code cell does in 1–2 sentences.\r\n3. Explain **how to use it**: variables to change, flags to toggle, prerequisites.\r\n\r\nAll transformation logic and design rationale must be **embedded as markdown cells inside\r\nthe notebook** — not maintained as separate documentation files. The notebook is the single\r\nsource of truth. A reader must be able to understand what each cell does, why the logic was\r\nchosen, and how to run it without opening any other file.\r\n\r\n### Output Conventions\r\n\r\n- Each sub-agent writes to its own **numbered subfolder** (`01-implementation-plan/`,\r\n `02-business-process/`, etc.). Execution steps continue the numbering (e.g.,\r\n `05-execution/`, `06-gold-layer/`).\r\n- Within each subfolder, only present **final deliverables** to the user: notebooks,\r\n SQL scripts, and documentation they run or deploy. Generator scripts (e.g.\r\n `generate_notebook.py`) are internal tools the skill runs to produce deliverables —\r\n **never present generator scripts as outputs and never generate notebook or script\r\n content directly**. Run the generator script via Bash; present what it produces.\r\n- All transformation logic and design rationale must be **embedded as markdown cells\r\n inside notebooks** — not maintained as separate documentation files. The notebook\r\n is the single source of truth.\r\n\r\n---\r\n\r\n## Sub-Agent 0: Environment Discovery\r\n\r\n**Input**: Requirements above\r\n**Output**: `00-environment-discovery/environment-profile.md`\r\n\r\nThis sub-agent runs **before anything is planned or built**. Its sole purpose is to\r\nunderstand the operator's environment, permissions, and preferences so that every\r\nsubsequent sub-agent produces plans tailored to what is actually possible and practical.\r\n\r\n**Invoke the `fabric-process-discovery` skill to run this step.**\r\n\r\nThe skill defines the full adaptive questioning tree — which questions to ask, in what\r\norder, and how to branch based on answers. Key principles:\r\n\r\n- **Read the requirements first.** Only ask about domains the process actually needs.\r\n A CSV ingestion job does not need workspace creation questions. A full pipeline\r\n needs all domains.\r\n- **Present all questions in a single turn**, grouped by domain. Never ask one question\r\n at a time. Target **5–7 questions** for most processes; simpler ones may need 3–4.\r\n- **Branch adaptively.** The skill defines conditional follow-ups — apply them after\r\n the first-turn answers before presenting the confirmation summary.\r\n- **Confirm before proceeding.** After processing answers, present the path table and\r\n ask: *\"Is this accurate, or anything to correct before I proceed to planning?\"*\r\n Wait for explicit confirmation.\r\n\r\nThe skill covers these domains (use only those relevant to the requirements):\r\n\r\n| Domain | When to include |\r\n|--------|----------------|\r\n| **A — Workspace access** | Any step creates or uses workspaces |\r\n| **A — Domain assignment** | Requirements mention domain governance (only if creating workspaces) |\r\n| **A — Access control / groups** | Process assigns roles to users or groups |\r\n| **B — Deployment approach** | Any step generates notebooks, scripts, or CLI commands |\r\n| **C — Source data location** | Process ingests files (CSV, PDF, etc.) |\r\n| **D — Capacity / SKU** | Process involves compute-intensive operations |\r\n\r\n**Critical framing rules from the skill — do not deviate:**\r\n\r\n1. **Deployment approach is NOT a CLI vs no-CLI question.** All three options (PySpark\r\n notebook, PowerShell script, CLI commands) use the Fabric CLI internally. The\r\n question is only about *how* the operator runs it. Present it as:\r\n - **A) PySpark notebook** — imported into Fabric, run cell-by-cell in the Fabric UI\r\n - **B) PowerShell script** — generated `.ps1` reviewed and run locally\r\n - **C) CLI commands** — individual `fab` commands run interactively in the terminal\r\n\r\n2. **Workspace creation must branch correctly.** If the operator cannot create\r\n workspaces, immediately ask for the exact names of existing hub and spoke\r\n workspaces — do not ask about domain assignment or access control (they only\r\n apply when creating).\r\n\r\n3. **Entra group Object IDs are a known technical constraint.** When groups are\r\n involved, always surface this: *\"The Fabric API requires Object IDs — display\r\n names are not accepted programmatically.\"* Then offer the resolution options\r\n (have IDs / Azure CLI / PowerShell Graph / UI manual).\r\n\r\n4. **Never leave the user blocked.** If a step requires permissions they don't have,\r\n offer: (a) skip and mark as manual, (b) produce a spec for their admin, or\r\n (c) substitute a UI-based workaround.\r\n\r\nOnce the environment profile is confirmed, save it as\r\n`00-environment-discovery/environment-profile.md` and append to `CHANGE_LOG.md`:\r\n`[{DATETIME}] Sub-Agent 0 complete — environment-profile.md produced. [N] path decisions recorded. Manual gates: [list or none].`\r\n\r\n🛑 **STOP — present the environment profile and ask: \"Does this look correct? Please confirm before I move to the implementation plan.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 1: Implementation Plan\r\n\r\n**Input**: Requirements above\r\n**Output**: `01-implementation-plan/implementation-plan.md`\r\n\r\nProduce a phased implementation plan using the structure below. Keep ≤50 lines.\r\nUpdate the RAID log whenever a later sub-agent raises a new risk or dependency.\r\n\r\n```markdown\r\n---\r\ngoal: {PROCESS_NAME} — Implementation Plan\r\nstatus: Planned\r\ndate_created: {DATE}\r\n---\r\n\r\n# Implementation Plan: {PROCESS_NAME}\r\n\r\n## Requirements & Constraints\r\n- REQ-001: [Requirement drawn from the context above]\r\n- CON-001: [Key constraint]\r\n\r\n## Phases\r\n\r\n### Phase 1: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-001 | [Task] | Planned |\r\n| TASK-002 | [Task] | Planned |\r\n\r\n### Phase 2: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-003 | [Task] | Planned |\r\n\r\n## RAID Log\r\n| Type | ID | Description | Mitigation / Action | Status |\r\n|------------|-------|--------------|---------------------|--------|\r\n| Risk | R-001 | [Risk] | [Mitigation] | Open |\r\n| Assumption | A-001 | [Assumption] | [Validation] | Open |\r\n| Issue | I-001 | [Issue] | [Resolution] | Open |\r\n| Dependency | D-001 | [Dependency] | [Owner] | Open |\r\n```\r\n\r\nRules:\r\n- Use REQ-, CON-, TASK-, R-, A-, I-, D- prefixes consistently.\r\n- Task status values: Planned / In Progress / Done.\r\n- Do not include implementation code or scripts.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 1 complete — implementation-plan.md produced.`\r\n- 🛑 **STOP — present the implementation plan and ask: \"Does this look correct? Please confirm before I move to the business process mapping.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 2: Business Process Mapping\r\n\r\n**Input**: Confirmed output of Sub-Agent 1 + Requirements above\r\n**Output**: `02-business-process/sop.md`\r\n\r\nThis sub-agent maps requirements to process skills and produces a Standard\r\nOperating Procedure. Work through the three steps below.\r\n\r\n### Step 1 — Decompose requirements into process steps\r\n\r\nRead the requirements and break them into discrete, ordered steps. For each step,\r\nwrite a one-line description of what it needs to do and what its output is.\r\n\r\n### Step 2 — Map each step to a process skill\r\n\r\nFor each step, search the skills directory for a matching process skill\r\n(a skill whose description covers the same action and output).\r\n\r\nFor every step, one of three outcomes applies:\r\n\r\n**A — Skill found**: Read the skill's `SKILL.md`. Note its inputs, outputs, and\r\nany parameters it needs from earlier steps. Mark the step as `COVERED`.\r\n\r\n**B — No skill found**: Note the step as `UNMAPPED`. Document what it needs to\r\ndo: the specific inputs, the repeatable actions, and the expected output. Do not\r\nattempt to create a new skill here — the execution phase will offer options.\r\nAdd the unmapped step as a dependency in the RAID log from Sub-Agent 1 with\r\nstatus `Open — no skill available`.\r\n\r\n**C — Step must be manual**: If the step cannot be automated (e.g. requires human\r\njudgement or a physical action), document it as a manual step with exact operator\r\ninstructions and mark it as `MANUAL`.\r\n\r\nRepeat until every step is classified as COVERED, UNMAPPED, or MANUAL.\r\n\r\n🛑 **STOP — present the skill mapping and ask: \"Does this mapping look correct? Please confirm before I produce the SOP.\"** Do not proceed to Step 3 until the user confirms.\r\n\r\n### Step 3 — Produce the SOP\r\n\r\n```markdown\r\n# SOP: {PROCESS_NAME}\r\n\r\n## Step Sequence\r\n| Step | Skill / Action | Input Parameters (resolved values where known) | Output | Status |\r\n|------|--------------------------|------------------------------------------------|-------------------|----------|\r\n| 1 | [skill-name] | capacity=ldifabricdev, deployment=notebook | [output artefact] | COVERED |\r\n| 2 | [skill-name] | workspace=[from step 1], lakehouse=[name] | [output artefact] | COVERED |\r\n| 3 | [UNMAPPED: description] | [inputs needed] | [expected output] | UNMAPPED |\r\n| 4 | [Manual: action] | — | — | MANUAL |\r\n\r\nPopulate parameter values from `00-environment-discovery/environment-profile.md` where\r\nalready known. Use `[TBC]` only for parameters not yet resolved.\r\n\r\n## Shared Parameters\r\n| Parameter | Value / Source | Passed to steps |\r\n|-----------|---------------------------------|-----------------|\r\n| [param] | [actual value or \"user input\"] | 1, 3 |\r\n\r\n## Unmapped Steps\r\n| Step | Description | Inputs needed | Expected output |\r\n|------|--------------------------------------|----------------------|---------------------|\r\n| [N] | [What this step needs to do] | [Required inputs] | [Expected output] |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n```\r\n\r\nRules:\r\n- If requirements are unclear for any step, ask a targeted question and update\r\n requirements before continuing.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 2 complete — sop.md produced. [N] unmapped steps.`\r\n- 🛑 **STOP — present the SOP and ask: \"Does this look correct? Please confirm before I move to the solution architecture.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 3: Solution Architecture\r\n\r\n**Input**: Confirmed output of Sub-Agent 2\r\n**Output**: `03-solution-architecture/specification.md`\r\n\r\nProduce a plain-language specification. Keep total length ≤50 lines.\r\nWrite for a non-technical reader — no code, no implementation detail.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Solution Specification\r\nstatus: Draft\r\ndate_created: {DATE}\r\n---\r\n\r\n# Specification: {PROCESS_NAME}\r\n\r\n## Purpose\r\n[One paragraph: what this solution does and what problem it solves.]\r\n\r\n## Scope\r\n[What is included and what is explicitly excluded.]\r\n\r\n## How It Works\r\n| Step | What happens | Automated? | Notes |\r\n|------|-------------------------------|------------|-----------------|\r\n| 1 | [Plain-language description] | Yes | |\r\n| 2 | [Plain-language description] | No | See MANUAL-001 |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n\r\n## Acceptance Criteria\r\n- AC-001: Given [context], when [action], then [expected outcome].\r\n\r\n## Dependencies\r\n- DEP-001: [External system, file, or service] — [Purpose]\r\n```\r\n\r\nRules:\r\n- Write for a non-technical reader. No jargon without explanation.\r\n- Every manual step must include exact operator instructions.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 3 complete — specification.md produced.`\r\n- 🛑 **STOP — present the specification and ask: \"Does this look correct? Please confirm before I move to the governance plan.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 4: Security, Testing and Governance\r\n\r\n**Input**: Confirmed output of Sub-Agent 3\r\n**Output**: `04-governance/governance-plan.md`\r\n\r\nProduce a governance and deployment plan. Keep total length ≤45 lines.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Governance Plan\r\ndate_created: {DATE}\r\n---\r\n\r\n# Governance Plan: {PROCESS_NAME}\r\n\r\n## Agent Boundaries\r\n| Boundary | Rule |\r\n|-------------------------|--------------------------------------------|\r\n| Allowed actions | [Permitted operations] |\r\n| Blocked actions | [Prohibited operations] |\r\n| Requires human approval | [Steps needing explicit sign-off] |\r\n\r\n## Testing Checklist\r\n- [ ] Validate each sub-agent output before passing it to the next\r\n- [ ] Test all manual steps with a real operator before production use\r\n- [ ] Run against a minimal test dataset before using real data\r\n- [ ] Review CHANGE_LOG.md to confirm all new skills are correct\r\n- [ ] Verify the output folder structure after scaffolding\r\n\r\n## Microsoft Responsible AI Alignment\r\n| Principle | How Applied |\r\n|----------------|--------------------------------------------------------|\r\n| Fairness | [How bias is avoided in outputs and decisions] |\r\n| Reliability | [Validation steps, error handling, new skill review] |\r\n| Privacy | [Data handling — no PII retained in output files] |\r\n| Inclusiveness | [Plain language; no domain assumptions made] |\r\n| Transparency | [User validates every sub-agent output; CHANGE_LOG] |\r\n| Accountability | [Human sign-off required before production execution] |\r\n\r\n## Deployment Guidance\r\n- Review `CHANGE_LOG.md` to verify all newly created skills before first run.\r\n- Store `agent.md`, all outputs, and new skills in version control.\r\n- Review the RAID log from Sub-Agent 1 before each new run.\r\n- Human sign-off required before running against production systems.\r\n```\r\n\r\nRules:\r\n- Every RAI principle row must be completed — state explicitly if not applicable and why.\r\n- Human approval must be required for any step that modifies production systems.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 4 complete — governance-plan.md produced. Agent definition finalised.`\r\n- 🛑 **STOP — present the governance plan and ask:**\r\n > \"Planning is complete. Here's a summary of what we've produced:\r\n > - `00-environment-discovery/environment-profile.md`\r\n > - `01-implementation-plan/implementation-plan.md`\r\n > - `02-business-process/sop.md`\r\n > - `03-solution-architecture/specification.md`\r\n > - `04-governance/governance-plan.md`\r\n >\r\n > Please review these documents. When you're ready to proceed with execution, say **'ready to execute'**.\"\r\n Do not begin the Execution Phase until the user says they are ready.\r\n\r\n---\r\n\r\n## Execution Phase\r\n\r\n**Input**: Confirmed outputs of Sub-Agents 0–4 (environment profile, SOP, governance plan)\r\n**Trigger**: User explicitly confirms they are ready to execute after reviewing Sub-Agent 4\r\n\r\n🛑 **Do not begin execution until the user explicitly says they are ready** (e.g. \"ready\r\nto execute\", \"let's go\", \"proceed\"). When they confirm, read the SOP from\r\n`02-business-process/sop.md` and execute steps one at a time.\r\n\r\n**One step per turn.** After completing each step and presenting the output, stop and\r\nask: *\"Step [N] complete — [filename] is in `0N-[step-slug]/`. Ready for step [N+1]?\"*\r\nDo not proceed until the user confirms.\r\n\r\n### Unmapped steps at execution time\r\n\r\nWhen the SOP step is marked `UNMAPPED`, do not attempt to generate an artefact.\r\nInstead present the step description and offer three options:\r\n\r\n> *\"Step [N] — **[step description]** — has no matching skill in the library.*\r\n> *How would you like to handle it?*\r\n>\r\n> *— A) **Perform manually** — I'll give you step-by-step instructions for doing this in the Fabric portal (or relevant UI)*\r\n> *— B) **Build a skill on the fly** — I'll generate a lightweight skill definition for this step and save it to a local `skills/[skill-name]/` folder in your working directory. You can reference it for this run and share it for future use*\r\n> *— C) **Use the LDI Skills Creation Framework** — Rishi can build a production-quality, governed skill for this process step using the LDI framework. Email rishi@learndatainsights.com or connect on LinkedIn to learn more\"*\r\n\r\n**If A — Manual:** Follow the Manual step instructions below. Log:\r\n`[{DATETIME}] Step [N] — UNMAPPED — manual approach selected — UI instructions provided.`\r\n\r\n**If B — Build on the fly:**\r\n1. Generate a `SKILL.md` for this step with: trigger description, inputs, workflow steps,\r\n output format, and any gotchas. Base it on the step description from the SOP.\r\n2. Write it to `skills/[skill-name]/SKILL.md` inside the working output folder.\r\n3. Invoke the skill inline to produce the artefact for this run.\r\n4. Tell the user: *\"Skill saved to `skills/[skill-name]/SKILL.md` in your working folder.\r\n You can review, refine, and share it as a starting point for a future governed skill.\"*\r\nLog: `[{DATETIME}] Step [N] — UNMAPPED — on-the-fly skill built: [skill-name].`\r\n\r\n**If C — LDI Framework:** Log:\r\n`[{DATETIME}] Step [N] — UNMAPPED — referred to LDI Skills Creation Framework.`\r\nThen ask: *\"Would you like to skip this step for now and continue with the remaining steps,\r\nor pause here?\"*\r\n\r\nUpdate the SOP step to reflect the chosen option and log the decision.\r\n\r\n### Per-step execution pattern\r\n\r\nEvery step follows this exact sequence. Do not skip any part of it.\r\n\r\n---\r\n\r\n#### A — Parameter check (before generating anything)\r\n\r\nBefore invoking the skill, verify every required parameter is resolved. Cross-check\r\nagainst `environment-profile.md` and the SOP shared parameters table. For any\r\nparameter that was deferred during discovery or planning (marked `[TBC]` or\r\n\"provide at runtime\"), ask now:\r\n\r\n> *\"Before I generate step [N], I need a few values that weren't available earlier:*\r\n> *— [param 1]: [brief explanation of what it is and where to find it]*\r\n> *— [param 2]: ...*\r\n> *Please provide these and I'll proceed.\"*\r\n\r\nDo not generate the artefact until all required parameters are confirmed. Never\r\nsilently skip a parameter or substitute an empty value.\r\n\r\n---\r\n\r\n#### B — Generate the artefact\r\n\r\nInvoke the skill using the resolved parameters. Follow the skill's instructions\r\nexactly — run generator scripts via Bash, do not generate artefact content directly.\r\n\r\nWrite the deliverable to a numbered subfolder continuing from `04-governance/`:\r\n- Step 1 → `05-[step-slug]/`, step 2 → `06-[step-slug]/`, etc.\r\n- Slug = short lowercase hyphenated step name (e.g. `05-create-workspaces/`)\r\n- Only the deliverable goes in the folder: `.ipynb`, `.ps1`, `cli-commands.md`,\r\n or the specific file type described by the skill. No generator scripts, no notes.\r\n\r\n---\r\n\r\n#### C — Q1: Did the previous step run correctly?\r\n\r\nPresent the generated artefact. Then ask (for all steps after step 1):\r\n\r\n> *\"Before we move on — did step [N-1] ([step name]) run correctly?*\r\n> *— A) Yes, all looks good*\r\n> *— B) No — I hit an error\"*\r\n\r\nIf B: ask the user to paste the error message and note where it occurred. Diagnose\r\nthe issue, suggest a fix or workaround, update the SOP to note the error, and log:\r\n`[{DATETIME}] Error in step [N-1] — [error summary] — [resolution or status].`\r\nOnly proceed once the error is resolved or the user accepts the workaround.\r\n\r\n---\r\n\r\n#### D — Q2: Proceed to next step with approach confirmation\r\n\r\nAfter Q1 is resolved, propose the next step:\r\n\r\n> *\"I've updated the change log. Next is **step [N+1]: [step name]** — [one sentence\r\n> of what it does].*\r\n>\r\n> *[Approach note — see rules below]*\r\n>\r\n> *Shall I continue?*\r\n> *— A) Yes — generate the [notebook / PowerShell script / CLI commands]*\r\n> *— B) No — I want to take a different approach for this step\"*\r\n\r\nIf B: present the available alternatives for this step type (see Approach Rules\r\nbelow), including implications of each. If the user selects manual, generate\r\ndetailed UI instructions (see Manual Instructions below).\r\n\r\nUpdate the SOP step to reflect the chosen approach and log:\r\n`[{DATETIME}] Step [N+1] approach confirmed: [approach] — [reason if changed].`\r\n\r\n---\r\n\r\n### Approach rules and implications\r\n\r\nWhen proposing step [N+1] in Q2, include a one-line approach note. Use the rules\r\nbelow to determine what to say and what options to offer if the user says no.\r\n\r\n**Workspace / lakehouse creation, role assignment:**\r\n- All three approaches work: notebook, PowerShell script, CLI commands\r\n- Default to the approach chosen in the environment profile\r\n- If manual selected: walk through the Fabric portal UI step by step\r\n\r\n**Local file upload (CSV, PDF, any file from the operator's machine):**\r\n- ⚠️ **Notebook approach is not possible** — notebooks run inside Fabric and cannot\r\n access the operator's local file system\r\n- Available options: PowerShell script, CLI terminal commands, manual upload via UI\r\n- For **large file volumes (50+ files)**, note: script and CLI upload is sequential\r\n and slow for large batches. For 100+ files, manual drag-and-drop via the Fabric\r\n Files section (or OneDrive sync) is significantly faster\r\n- If manual selected: provide instructions for uploading via the Fabric Files UI\r\n\r\n**Schema creation:**\r\n- Notebook (Spark SQL cell) or CLI commands work; PowerShell script works via\r\n shell-invoked CLI\r\n- Manual: Fabric doesn't have a direct UI for schema creation in lakehouses —\r\n recommend using a notebook with a single Spark SQL cell as the simplest option\r\n\r\n**Shortcuts (cross-lakehouse):**\r\n- CLI commands (`fab ln`) or PowerShell script work; notebook cannot run `fab ln`\r\n natively\r\n- Manual: Fabric portal has a shortcut creation UI (Lakehouse → New shortcut)\r\n\r\n**Notebook / script execution (running something already generated):**\r\n- This step is always manual — the operator runs the artefact themselves\r\n- Provide instructions for importing and running it in Fabric\r\n\r\n---\r\n\r\n### Manual step instructions\r\n\r\nWhen a step is manual (either flagged in the SOP or chosen at runtime), do not just\r\nsay \"do this manually.\" Generate step-by-step UI instructions specific to the action:\r\n\r\n- State the exact URL or navigation path in the Fabric portal\r\n- List each click, field, and value required\r\n- Include what success looks like (what the user should see when done)\r\n- Note any common mistakes or things to watch for\r\n\r\nLog: `[{DATETIME}] Step [N] — manual approach selected — UI instructions provided.`\r\nUpdate the SOP step to mark it as Manual with reason.\r\n\r\n### CLI command log format\r\n\r\nWhen deployment approach is terminal (interactive CLI), produce a `cli-commands.md`\r\nin the step subfolder with this structure:\r\n\r\n```markdown\r\n# CLI Commands: [Step Name]\r\n_Executed: {DATETIME}_\r\n\r\n## Commands Run\r\n\r\n### [Command description]\r\n```bash\r\n[exact command]\r\n```\r\n**Output:**\r\n```\r\n[output or \"No output / success\"]\r\n```\r\n\r\n## Result\r\n[One-sentence summary of what was created or confirmed]\r\n```\r\n\r\n### SOP and CHANGE_LOG updates at runtime\r\n\r\nThe SOP is a living document during execution. Update `02-business-process/sop.md`\r\nwhenever a runtime decision changes the plan:\r\n\r\n- Approach changed for a step → update the Skill / Action column and add a note\r\n- Error encountered and resolved → add an error note and resolution to the step row\r\n- Parameter provided at runtime → fill in the `[TBC]` value in the Shared Parameters table\r\n- Step marked manual at runtime → update Manual? column to Yes, add reason\r\n\r\nEvery update to the SOP must also be logged in `CHANGE_LOG.md` with a timestamp.\r\n\r\n---\r\n\r\n### After all steps complete\r\n\r\nOnce all SOP steps are confirmed, produce `outputs/COMPLETION_SUMMARY.md`:\r\n\r\n```markdown\r\n# Completion Summary: {PROCESS_NAME}\r\n_Completed: {DATETIME}_\r\n\r\n## Steps Executed\r\n| Step | Folder | Deliverable | Approach | Status |\r\n|------|--------|-------------|----------|--------|\r\n| [N] | [folder] | [filename] | [notebook/script/CLI/manual] | ✅ Complete |\r\n\r\n## Runtime Decisions\r\n| Step | Decision | Reason |\r\n|------|----------|--------|\r\n| [N] | Changed from notebook to manual upload | 150 files — script too slow |\r\n\r\n## Manual Steps\r\n| Step | Description | Confirmed by operator |\r\n|------|-------------|----------------------|\r\n| [N] | [description] | ✅ Yes |\r\n\r\n## Next Steps\r\n[Any post-execution actions: verify in Fabric UI, share workspace, run first notebook, etc.]\r\n```\r\n\r\nAppend to `CHANGE_LOG.md`:\r\n`[{DATETIME}] Execution phase complete — all [N] steps done. See COMPLETION_SUMMARY.md.`\r\n",
|
|
44
|
+
content: "# Orchestration Agent: {PROCESS_NAME}\r\n\r\n## Context\r\n\r\n**Process**: {PROCESS_NAME}\r\n**Requirements**: {REQUIREMENTS_SUMMARY}\r\n\r\n---\r\n\r\n## How to Run This Agent\r\n\r\n**Start with Sub-Agent 0 (Environment Discovery).** This gathers the user's\r\npermissions, tooling, and preferences so that every subsequent sub-agent produces\r\nplans tailored to their actual environment. Do not skip this step.\r\n\r\nThen execute each remaining sub-agent in sequence:\r\n\r\n1. Use only the inputs and instructions provided in this file.\r\n2. Produce the specified output document in the designated subfolder.\r\n3. Present the output to the user; ask clarifying questions if anything is unclear.\r\n4. Refine until the user explicitly confirms the output.\r\n5. Append a timestamped entry to `CHANGE_LOG.md` recording what was produced or decided.\r\n6. Pass the confirmed output as the primary input to the next sub-agent.\r\n **Every sub-agent must also read `00-environment-discovery/environment-profile.md`**\r\n and respect the path decisions recorded there.\r\n\r\n> 🛑 **HARD STOP RULE — applies to every sub-agent and every execution step:**\r\n> After producing any output, you MUST stop and wait. Do not proceed to the next\r\n> step until the user responds with explicit confirmation (e.g. \"confirmed\",\r\n> \"looks good\", \"proceed\"). A lack of objection is NOT confirmation. Never\r\n> self-confirm or assume approval. Never run two steps in the same turn.\r\n\r\n**Do not produce code, scripts, or data artefacts not described in each sub-agent below.**\r\n\r\n### Parameter Resolution Protocol\r\n\r\nWhen invoking any skill, **always resolve parameters from existing sources before\r\nasking the user**. Check in this order:\r\n\r\n1. **The original requirements** (the user's prompt that started this agent) —\r\n names, values, and preferences stated upfront should be used directly and never\r\n re-asked. If the user said \"create a workspace called Finance Hub\", use that name.\r\n2. `00-environment-discovery/environment-profile.md` — deployment approach, permissions\r\n profile, deployment preference\r\n3. The confirmed SOP (`02-business-process/sop.md`) — shared parameters, step inputs\r\n and outputs, lakehouse names, schema names\r\n4. The implementation plan (`01-implementation-plan/implementation-plan.md`) — naming\r\n conventions, task-level decisions\r\n\r\n**Only ask the user for parameters not found in any of these sources.** Summarise\r\nwhat was resolved automatically before asking for what remains. Never re-ask for\r\nsomething the user already provided.\r\n\r\n### Notebook Documentation Standard\r\n\r\nEvery Fabric notebook produced by any skill **must** include a numbered markdown cell\r\nimmediately above each code cell. Each markdown cell must:\r\n\r\n1. State the cell number and a short title (e.g. `## Cell 1 — Install dependencies`).\r\n2. Explain **what** the code cell does in 1–2 sentences.\r\n3. Explain **how to use it**: variables to change, flags to toggle, prerequisites.\r\n\r\nAll transformation logic and design rationale must be **embedded as markdown cells inside\r\nthe notebook** — not maintained as separate documentation files. The notebook is the single\r\nsource of truth. A reader must be able to understand what each cell does, why the logic was\r\nchosen, and how to run it without opening any other file.\r\n\r\n### Output Conventions\r\n\r\n- Each sub-agent writes to its own **numbered subfolder** (`01-implementation-plan/`,\r\n `02-business-process/`, etc.). Execution steps continue the numbering (e.g.,\r\n `05-execution/`, `06-gold-layer/`).\r\n- Within each subfolder, only present **final deliverables** to the user: notebooks,\r\n SQL scripts, and documentation they run or deploy. Generator scripts (e.g.\r\n `generate_notebook.py`) are internal tools the skill runs to produce deliverables —\r\n **never present generator scripts as outputs and never generate notebook or script\r\n content directly**. Run the generator script via Bash; present what it produces.\r\n- All transformation logic and design rationale must be **embedded as markdown cells\r\n inside notebooks** — not maintained as separate documentation files. The notebook\r\n is the single source of truth.\r\n\r\n---\r\n\r\n## Sub-Agent 0: Environment Discovery\r\n\r\n**Input**: Requirements above\r\n**Output**: `00-environment-discovery/environment-profile.md`\r\n\r\nThis sub-agent runs **before anything is planned or built**. Its sole purpose is to\r\nunderstand the operator's environment, permissions, and preferences so that every\r\nsubsequent sub-agent produces plans tailored to what is actually possible and practical.\r\n\r\n**Invoke the `fabric-process-discovery` skill to run this step.**\r\n\r\nThe skill defines the full adaptive questioning tree — which questions to ask, in what\r\norder, and how to branch based on answers. Key principles:\r\n\r\n- **Read the requirements first.** Only ask about domains the process actually needs.\r\n A CSV ingestion job does not need workspace creation questions. A full pipeline\r\n needs all domains.\r\n- **Ask questions one at a time**, waiting for each answer before asking the next.\r\n The skill defines the exact sequence and branching logic — follow it.\r\n- **Branch adaptively.** Apply conditional follow-ups based on each answer before\r\n moving to the next question.\r\n- **Confirm before proceeding.** After processing answers, present the path table and\r\n ask: *\"Is this accurate, or anything to correct before I proceed to planning?\"*\r\n Wait for explicit confirmation.\r\n\r\nThe skill covers these domains (use only those relevant to the requirements):\r\n\r\n| Domain | When to include |\r\n|--------|----------------|\r\n| **A — Workspace access** | Any step creates or uses workspaces |\r\n| **A — Domain assignment** | Requirements mention domain governance (only if creating workspaces) |\r\n| **A — Access control / groups** | Process assigns roles to users or groups |\r\n| **B — Deployment approach** | Any step generates notebooks, scripts, or CLI commands |\r\n| **C — Source data location** | Process ingests files (CSV, PDF, etc.) |\r\n| **D — Capacity / SKU** | Process involves compute-intensive operations |\r\n\r\n**Critical framing rules from the skill — do not deviate:**\r\n\r\n1. **Deployment approach is NOT a CLI vs no-CLI question.** All three options (PySpark\r\n notebook, PowerShell script, CLI commands) use the Fabric CLI internally. The\r\n question is only about *how* the operator runs it. Present it as:\r\n - **A) PySpark notebook** — imported into Fabric, run cell-by-cell in the Fabric UI\r\n - **B) PowerShell script** — generated `.ps1` reviewed and run locally\r\n - **C) CLI commands** — individual `fab` commands run interactively in the terminal\r\n\r\n2. **Workspace creation must branch correctly.** If the operator cannot create\r\n workspaces, immediately ask for the exact names of existing hub and spoke\r\n workspaces — do not ask about domain assignment or access control (they only\r\n apply when creating).\r\n\r\n3. **Entra group Object IDs are a known technical constraint.** When groups are\r\n involved, always surface this: *\"The Fabric API requires Object IDs — display\r\n names are not accepted programmatically.\"* Then offer the resolution options\r\n (have IDs / Azure CLI / PowerShell Graph / UI manual).\r\n\r\n4. **Never leave the user blocked.** If a step requires permissions they don't have,\r\n offer: (a) skip and mark as manual, (b) produce a spec for their admin, or\r\n (c) substitute a UI-based workaround.\r\n\r\nOnce the environment profile is confirmed, save it as\r\n`00-environment-discovery/environment-profile.md` and append to `CHANGE_LOG.md`:\r\n`[{DATETIME}] Sub-Agent 0 complete — environment-profile.md produced. [N] path decisions recorded. Manual gates: [list or none].`\r\n\r\n🛑 **STOP — present the environment profile and ask: \"Does this look correct? Please confirm before I move to the implementation plan.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 1: Implementation Plan\r\n\r\n**Input**: Requirements above\r\n**Output**: `01-implementation-plan/implementation-plan.md`\r\n\r\nProduce a phased implementation plan using the structure below. Keep ≤50 lines.\r\nUpdate the RAID log whenever a later sub-agent raises a new risk or dependency.\r\n\r\n```markdown\r\n---\r\ngoal: {PROCESS_NAME} — Implementation Plan\r\nstatus: Planned\r\ndate_created: {DATE}\r\n---\r\n\r\n# Implementation Plan: {PROCESS_NAME}\r\n\r\n## Requirements & Constraints\r\n- REQ-001: [Requirement drawn from the context above]\r\n- CON-001: [Key constraint]\r\n\r\n## Phases\r\n\r\n### Phase 1: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-001 | [Task] | Planned |\r\n| TASK-002 | [Task] | Planned |\r\n\r\n### Phase 2: [Phase name]\r\n| Task | Description | Status |\r\n|----------|-------------|---------|\r\n| TASK-003 | [Task] | Planned |\r\n\r\n## RAID Log\r\n| Type | ID | Description | Mitigation / Action | Status |\r\n|------------|-------|--------------|---------------------|--------|\r\n| Risk | R-001 | [Risk] | [Mitigation] | Open |\r\n| Assumption | A-001 | [Assumption] | [Validation] | Open |\r\n| Issue | I-001 | [Issue] | [Resolution] | Open |\r\n| Dependency | D-001 | [Dependency] | [Owner] | Open |\r\n```\r\n\r\nRules:\r\n- Use REQ-, CON-, TASK-, R-, A-, I-, D- prefixes consistently.\r\n- Task status values: Planned / In Progress / Done.\r\n- Do not include implementation code or scripts.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 1 complete — implementation-plan.md produced.`\r\n- 🛑 **STOP — present the implementation plan and ask: \"Does this look correct? Please confirm before I move to the business process mapping.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 2: Business Process Mapping\r\n\r\n**Input**: Confirmed output of Sub-Agent 1 + Requirements above\r\n**Output**: `02-business-process/sop.md`\r\n\r\nThis sub-agent maps requirements to process skills and produces a Standard\r\nOperating Procedure. Work through the three steps below.\r\n\r\n### Step 1 — Decompose requirements into process steps\r\n\r\nRead the requirements and break them into discrete, ordered steps. For each step,\r\nwrite a one-line description of what it needs to do and what its output is.\r\n\r\n### Step 2 — Map each step to a process skill\r\n\r\nFor each step, search the skills directory for a matching process skill\r\n(a skill whose description covers the same action and output).\r\n\r\nFor every step, one of three outcomes applies:\r\n\r\n**A — Skill found**: Read the skill's `SKILL.md`. Note its inputs, outputs, and\r\nany parameters it needs from earlier steps. Mark the step as `COVERED`.\r\n\r\n**B — No skill found**: Note the step as `UNMAPPED`. Document what it needs to\r\ndo: the specific inputs, the repeatable actions, and the expected output. Do not\r\nattempt to create a new skill here — the execution phase will offer options.\r\nAdd the unmapped step as a dependency in the RAID log from Sub-Agent 1 with\r\nstatus `Open — no skill available`.\r\n\r\n**C — Step must be manual**: If the step cannot be automated (e.g. requires human\r\njudgement or a physical action), document it as a manual step with exact operator\r\ninstructions and mark it as `MANUAL`.\r\n\r\nRepeat until every step is classified as COVERED, UNMAPPED, or MANUAL.\r\n\r\n🛑 **STOP — present the skill mapping and ask: \"Does this mapping look correct? Please confirm before I produce the SOP.\"** Do not proceed to Step 3 until the user confirms.\r\n\r\n### Step 3 — Produce the SOP\r\n\r\n```markdown\r\n# SOP: {PROCESS_NAME}\r\n\r\n## Step Sequence\r\n| Step | Skill / Action | Input Parameters (resolved values where known) | Output | Status |\r\n|------|--------------------------|------------------------------------------------|-------------------|----------|\r\n| 1 | [skill-name] | capacity=ldifabricdev, deployment=notebook | [output artefact] | COVERED |\r\n| 2 | [skill-name] | workspace=[from step 1], lakehouse=[name] | [output artefact] | COVERED |\r\n| 3 | [UNMAPPED: description] | [inputs needed] | [expected output] | UNMAPPED |\r\n| 4 | [Manual: action] | — | — | MANUAL |\r\n\r\nPopulate parameter values from `00-environment-discovery/environment-profile.md` where\r\nalready known. Use `[TBC]` only for parameters not yet resolved.\r\n\r\n## Shared Parameters\r\n| Parameter | Value / Source | Passed to steps |\r\n|-----------|---------------------------------|-----------------|\r\n| [param] | [actual value or \"user input\"] | 1, 3 |\r\n\r\n## Unmapped Steps\r\n| Step | Description | Inputs needed | Expected output |\r\n|------|--------------------------------------|----------------------|---------------------|\r\n| [N] | [What this step needs to do] | [Required inputs] | [Expected output] |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n```\r\n\r\nRules:\r\n- If requirements are unclear for any step, ask a targeted question and update\r\n requirements before continuing.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 2 complete — sop.md produced. [N] unmapped steps.`\r\n- 🛑 **STOP — present the SOP and ask: \"Does this look correct? Please confirm before I move to the solution architecture.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 3: Solution Architecture\r\n\r\n**Input**: Confirmed output of Sub-Agent 2\r\n**Output**: `03-solution-architecture/specification.md`\r\n\r\nProduce a plain-language specification. Keep total length ≤50 lines.\r\nWrite for a non-technical reader — no code, no implementation detail.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Solution Specification\r\nstatus: Draft\r\ndate_created: {DATE}\r\n---\r\n\r\n# Specification: {PROCESS_NAME}\r\n\r\n## Purpose\r\n[One paragraph: what this solution does and what problem it solves.]\r\n\r\n## Scope\r\n[What is included and what is explicitly excluded.]\r\n\r\n## How It Works\r\n| Step | What happens | Automated? | Notes |\r\n|------|-------------------------------|------------|-----------------|\r\n| 1 | [Plain-language description] | Yes | |\r\n| 2 | [Plain-language description] | No | See MANUAL-001 |\r\n\r\n## Manual Steps\r\n- MANUAL-001: [Step] — [Reason] — [Exact operator instructions]\r\n\r\n## Acceptance Criteria\r\n- AC-001: Given [context], when [action], then [expected outcome].\r\n\r\n## Dependencies\r\n- DEP-001: [External system, file, or service] — [Purpose]\r\n```\r\n\r\nRules:\r\n- Write for a non-technical reader. No jargon without explanation.\r\n- Every manual step must include exact operator instructions.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 3 complete — specification.md produced.`\r\n- 🛑 **STOP — present the specification and ask: \"Does this look correct? Please confirm before I move to the governance plan.\"** Do not proceed until the user confirms.\r\n\r\n---\r\n\r\n## Sub-Agent 4: Security, Testing and Governance\r\n\r\n**Input**: Confirmed output of Sub-Agent 3\r\n**Output**: `04-governance/governance-plan.md`\r\n\r\nProduce a governance and deployment plan. Keep total length ≤45 lines.\r\n\r\n```markdown\r\n---\r\ntitle: {PROCESS_NAME} — Governance Plan\r\ndate_created: {DATE}\r\n---\r\n\r\n# Governance Plan: {PROCESS_NAME}\r\n\r\n## Agent Boundaries\r\n| Boundary | Rule |\r\n|-------------------------|--------------------------------------------|\r\n| Allowed actions | [Permitted operations] |\r\n| Blocked actions | [Prohibited operations] |\r\n| Requires human approval | [Steps needing explicit sign-off] |\r\n\r\n## Testing Checklist\r\n- [ ] Validate each sub-agent output before passing it to the next\r\n- [ ] Test all manual steps with a real operator before production use\r\n- [ ] Run against a minimal test dataset before using real data\r\n- [ ] Review CHANGE_LOG.md to confirm all new skills are correct\r\n- [ ] Verify the output folder structure after scaffolding\r\n\r\n## Microsoft Responsible AI Alignment\r\n| Principle | How Applied |\r\n|----------------|--------------------------------------------------------|\r\n| Fairness | [How bias is avoided in outputs and decisions] |\r\n| Reliability | [Validation steps, error handling, new skill review] |\r\n| Privacy | [Data handling — no PII retained in output files] |\r\n| Inclusiveness | [Plain language; no domain assumptions made] |\r\n| Transparency | [User validates every sub-agent output; CHANGE_LOG] |\r\n| Accountability | [Human sign-off required before production execution] |\r\n\r\n## Deployment Guidance\r\n- Review `CHANGE_LOG.md` to verify all newly created skills before first run.\r\n- Store `agent.md`, all outputs, and new skills in version control.\r\n- Review the RAID log from Sub-Agent 1 before each new run.\r\n- Human sign-off required before running against production systems.\r\n```\r\n\r\nRules:\r\n- Every RAI principle row must be completed — state explicitly if not applicable and why.\r\n- Human approval must be required for any step that modifies production systems.\r\n- Append to `CHANGE_LOG.md`: `[{DATETIME}] Sub-Agent 4 complete — governance-plan.md produced. Agent definition finalised.`\r\n- 🛑 **STOP — present the governance plan and ask:**\r\n > \"Planning is complete. Here's a summary of what we've produced:\r\n > - `00-environment-discovery/environment-profile.md`\r\n > - `01-implementation-plan/implementation-plan.md`\r\n > - `02-business-process/sop.md`\r\n > - `03-solution-architecture/specification.md`\r\n > - `04-governance/governance-plan.md`\r\n >\r\n > Please review these documents. When you're ready to proceed with execution, say **'ready to execute'**.\"\r\n Do not begin the Execution Phase until the user says they are ready.\r\n\r\n---\r\n\r\n## Execution Phase\r\n\r\n**Input**: Confirmed outputs of Sub-Agents 0–4 (environment profile, SOP, governance plan)\r\n**Trigger**: User explicitly confirms they are ready to execute after reviewing Sub-Agent 4\r\n\r\n🛑 **Do not begin execution until the user explicitly says they are ready** (e.g. \"ready\r\nto execute\", \"let's go\", \"proceed\"). When they confirm, read the SOP from\r\n`02-business-process/sop.md` and execute steps one at a time.\r\n\r\n**One step per turn.** After completing each step and presenting the output, stop and\r\nask: *\"Step [N] complete — [filename] is in `0N-[step-slug]/`. Ready for step [N+1]?\"*\r\nDo not proceed until the user confirms.\r\n\r\n### Unmapped steps at execution time\r\n\r\nWhen the SOP step is marked `UNMAPPED`, do not attempt to generate an artefact.\r\nInstead present the step description and offer three options:\r\n\r\n> *\"Step [N] — **[step description]** — has no matching skill in the library.*\r\n> *How would you like to handle it?*\r\n>\r\n> *— A) **Perform manually** — I'll give you step-by-step instructions for doing this in the Fabric portal (or relevant UI)*\r\n> *— B) **Build a skill on the fly** — I'll generate a lightweight skill definition for this step and save it to a local `skills/[skill-name]/` folder in your working directory. You can reference it for this run and share it for future use*\r\n> *— C) **Use the LDI Skills Creation Framework** — Rishi can build a production-quality, governed skill for this process step using the LDI framework. Email rishi@learndatainsights.com or connect on LinkedIn to learn more\"*\r\n\r\n**If A — Manual:** Follow the Manual step instructions below. Log:\r\n`[{DATETIME}] Step [N] — UNMAPPED — manual approach selected — UI instructions provided.`\r\n\r\n**If B — Build on the fly:**\r\n1. Generate a `SKILL.md` for this step with: trigger description, inputs, workflow steps,\r\n output format, and any gotchas. Base it on the step description from the SOP.\r\n2. Write it to `skills/[skill-name]/SKILL.md` inside the working output folder.\r\n3. Invoke the skill inline to produce the artefact for this run.\r\n4. Tell the user: *\"Skill saved to `skills/[skill-name]/SKILL.md` in your working folder.\r\n You can review, refine, and share it as a starting point for a future governed skill.\"*\r\nLog: `[{DATETIME}] Step [N] — UNMAPPED — on-the-fly skill built: [skill-name].`\r\n\r\n**If C — LDI Framework:** Log:\r\n`[{DATETIME}] Step [N] — UNMAPPED — referred to LDI Skills Creation Framework.`\r\nThen ask: *\"Would you like to skip this step for now and continue with the remaining steps,\r\nor pause here?\"*\r\n\r\nUpdate the SOP step to reflect the chosen option and log the decision.\r\n\r\n### Per-step execution pattern\r\n\r\nEvery step follows this exact sequence. Do not skip any part of it.\r\n\r\n---\r\n\r\n#### A — Parameter check (before generating anything)\r\n\r\nBefore invoking the skill, verify every required parameter is resolved. Cross-check\r\nagainst `environment-profile.md` and the SOP shared parameters table. For any\r\nparameter that was deferred during discovery or planning (marked `[TBC]` or\r\n\"provide at runtime\"), ask now:\r\n\r\n> *\"Before I generate step [N], I need a few values that weren't available earlier:*\r\n> *— [param 1]: [brief explanation of what it is and where to find it]*\r\n> *— [param 2]: ...*\r\n> *Please provide these and I'll proceed.\"*\r\n\r\nDo not generate the artefact until all required parameters are confirmed. Never\r\nsilently skip a parameter or substitute an empty value.\r\n\r\n---\r\n\r\n#### B — Generate the artefact\r\n\r\nInvoke the skill using the resolved parameters. Follow the skill's instructions\r\nexactly — run generator scripts via Bash, do not generate artefact content directly.\r\n\r\nWrite the deliverable to a numbered subfolder continuing from `04-governance/`:\r\n- Step 1 → `05-[step-slug]/`, step 2 → `06-[step-slug]/`, etc.\r\n- Slug = short lowercase hyphenated step name (e.g. `05-create-workspaces/`)\r\n- Only the deliverable goes in the folder: `.ipynb`, `.ps1`, `cli-commands.md`,\r\n or the specific file type described by the skill. No generator scripts, no notes.\r\n\r\n---\r\n\r\n#### C — Q1: Did the previous step run correctly?\r\n\r\nPresent the generated artefact. Then ask (for all steps after step 1):\r\n\r\n> *\"Before we move on — did step [N-1] ([step name]) run correctly?*\r\n> *— A) Yes, all looks good*\r\n> *— B) No — I hit an error\"*\r\n\r\nIf B: ask the user to paste the error message and note where it occurred. Diagnose\r\nthe issue, suggest a fix or workaround, update the SOP to note the error, and log:\r\n`[{DATETIME}] Error in step [N-1] — [error summary] — [resolution or status].`\r\nOnly proceed once the error is resolved or the user accepts the workaround.\r\n\r\n---\r\n\r\n#### D — Q2: Proceed to next step with approach confirmation\r\n\r\nAfter Q1 is resolved, propose the next step:\r\n\r\n> *\"I've updated the change log. Next is **step [N+1]: [step name]** — [one sentence\r\n> of what it does].*\r\n>\r\n> *[Approach note — see rules below]*\r\n>\r\n> *Shall I continue?*\r\n> *— A) Yes — generate the [notebook / PowerShell script / CLI commands]*\r\n> *— B) No — I want to take a different approach for this step\"*\r\n\r\nIf B: present the available alternatives for this step type (see Approach Rules\r\nbelow), including implications of each. If the user selects manual, generate\r\ndetailed UI instructions (see Manual Instructions below).\r\n\r\nUpdate the SOP step to reflect the chosen approach and log:\r\n`[{DATETIME}] Step [N+1] approach confirmed: [approach] — [reason if changed].`\r\n\r\n---\r\n\r\n### Approach rules and implications\r\n\r\nWhen proposing step [N+1] in Q2, include a one-line approach note. Use the rules\r\nbelow to determine what to say and what options to offer if the user says no.\r\n\r\n**Workspace / lakehouse creation, role assignment:**\r\n- All three approaches work: notebook, PowerShell script, CLI commands\r\n- Default to the approach chosen in the environment profile\r\n- If manual selected: walk through the Fabric portal UI step by step\r\n\r\n**Local file upload (CSV, PDF, any file from the operator's machine):**\r\n- ⚠️ **Notebook approach is not possible** — notebooks run inside Fabric and cannot\r\n access the operator's local file system\r\n- Available options: PowerShell script, CLI terminal commands, manual upload via UI\r\n- For **large file volumes (50+ files)**, note: script and CLI upload is sequential\r\n and slow for large batches. For 100+ files, manual drag-and-drop via the Fabric\r\n Files section (or OneDrive sync) is significantly faster\r\n- If manual selected: provide instructions for uploading via the Fabric Files UI\r\n\r\n**Schema creation:**\r\n- Notebook (Spark SQL cell) or CLI commands work; PowerShell script works via\r\n shell-invoked CLI\r\n- Manual: Fabric doesn't have a direct UI for schema creation in lakehouses —\r\n recommend using a notebook with a single Spark SQL cell as the simplest option\r\n\r\n**Shortcuts (cross-lakehouse):**\r\n- Notebook (`!fab ln` cell), PowerShell script, or interactive CLI all work\r\n- Manual: Fabric portal has a shortcut creation UI (Lakehouse → New shortcut)\r\n\r\n**Notebook / script execution (running something already generated):**\r\n- This step is always manual — the operator runs the artefact themselves\r\n- Provide instructions for importing and running it in Fabric\r\n\r\n---\r\n\r\n### Manual step instructions\r\n\r\nWhen a step is manual (either flagged in the SOP or chosen at runtime), do not just\r\nsay \"do this manually.\" Generate step-by-step UI instructions specific to the action:\r\n\r\n- State the exact URL or navigation path in the Fabric portal\r\n- List each click, field, and value required\r\n- Include what success looks like (what the user should see when done)\r\n- Note any common mistakes or things to watch for\r\n\r\nLog: `[{DATETIME}] Step [N] — manual approach selected — UI instructions provided.`\r\nUpdate the SOP step to mark it as Manual with reason.\r\n\r\n### CLI command log format\r\n\r\nWhen deployment approach is terminal (interactive CLI), produce a `cli-commands.md`\r\nin the step subfolder with this structure:\r\n\r\n```markdown\r\n# CLI Commands: [Step Name]\r\n_Executed: {DATETIME}_\r\n\r\n## Commands Run\r\n\r\n### [Command description]\r\n```bash\r\n[exact command]\r\n```\r\n**Output:**\r\n```\r\n[output or \"No output / success\"]\r\n```\r\n\r\n## Result\r\n[One-sentence summary of what was created or confirmed]\r\n```\r\n\r\n### SOP and CHANGE_LOG updates at runtime\r\n\r\nThe SOP is a living document during execution. Update `02-business-process/sop.md`\r\nwhenever a runtime decision changes the plan:\r\n\r\n- Approach changed for a step → update the Skill / Action column and add a note\r\n- Error encountered and resolved → add an error note and resolution to the step row\r\n- Parameter provided at runtime → fill in the `[TBC]` value in the Shared Parameters table\r\n- Step marked manual at runtime → update Manual? column to Yes, add reason\r\n\r\nEvery update to the SOP must also be logged in `CHANGE_LOG.md` with a timestamp.\r\n\r\n---\r\n\r\n### After all steps complete\r\n\r\nOnce all SOP steps are confirmed, produce `outputs/COMPLETION_SUMMARY.md`:\r\n\r\n```markdown\r\n# Completion Summary: {PROCESS_NAME}\r\n_Completed: {DATETIME}_\r\n\r\n## Steps Executed\r\n| Step | Folder | Deliverable | Approach | Status |\r\n|------|--------|-------------|----------|--------|\r\n| [N] | [folder] | [filename] | [notebook/script/CLI/manual] | ✅ Complete |\r\n\r\n## Runtime Decisions\r\n| Step | Decision | Reason |\r\n|------|----------|--------|\r\n| [N] | Changed from notebook to manual upload | 150 files — script too slow |\r\n\r\n## Manual Steps\r\n| Step | Description | Confirmed by operator |\r\n|------|-------------|----------------------|\r\n| [N] | [description] | ✅ Yes |\r\n\r\n## Next Steps\r\n[Any post-execution actions: verify in Fabric UI, share workspace, run first notebook, etc.]\r\n```\r\n\r\nAppend to `CHANGE_LOG.md`:\r\n`[{DATETIME}] Execution phase complete — all [N] steps done. See COMPLETION_SUMMARY.md.`\r\n",
|
|
45
45
|
},
|
|
46
46
|
{
|
|
47
47
|
relativePath: "references/section-descriptions.md",
|
|
@@ -59,7 +59,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
59
59
|
files: [
|
|
60
60
|
{
|
|
61
61
|
relativePath: "SKILL.md",
|
|
62
|
-
content: "---\r\nname: create-lakehouse-schemas-and-shortcuts\r\ndescription: >\r\n Use this skill to create schemas in schema-enabled Microsoft Fabric lakehouses\r\n and create cross-lakehouse table shortcuts using the Fabric CLI. Triggers on:\r\n \"create lakehouse shortcuts\", \"create schema in lakehouse\", \"shortcut tables\r\n between lakehouses\", \"cross-lakehouse shortcuts\", \"surface bronze tables in\r\n silver\". Does NOT trigger for: creating lakehouses (use create-fabric-lakehouse),\r\n uploading files, creating delta tables from CSV/PDF, or generating MLV scripts.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ for scripts/. Fabric CLI (fab) installed and authenticated.\r\n---\r\n\r\n# Create Lakehouse Schemas and Shortcuts\r\n\r\nCreates schemas in schema-enabled Fabric lakehouses and creates cross-lakehouse\r\ntable shortcuts using `fab ln --type oneLake`. Schemas and shortcuts are\r\ncreated in the same run. Source and target lakehouses must already exist.\r\n\r\n> **GOVERNANCE**: This skill generates commands — it does not execute them.\r\n> All `fab` commands are presented for the operator to review and run.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Source and target workspace names | Environment profile or implementation plan |\r\n| Source and target lakehouse names | SOP shared parameters (from lakehouse creation step) |\r\n| Source schema | SOP shared parameters |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. target schema name,\r\nspecific tables to shortcut if not listed in the SOP).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `--source-workspace` | Source Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_HUB\"` |\r\n| `--source-lakehouse` | Source lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_BRONZE\"` |\r\n| `--source-schema` | Schema in source lakehouse. Use `dbo` for non-schema-enabled | `\"dbo\"` |\r\n| `--target-workspace` | Target Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_FINANCE_SPOKE\"` |\r\n| `--target-lakehouse` | Target lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_SILVER\"` |\r\n| `--target-schema` | Schema to create in target and place shortcuts into | `\"bronze\"` |\r\n| `--tables` | Comma-separated table names, or output of `fab ls` | `\"bookings,events\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Step 1 — Collect parameters**: Ask the user for all inputs listed above.\r\n If source and target are in the same workspace, both workspace parameters will\r\n be the same value.\r\n\r\n- [ ] **Step 2 — Discover tables**: Ask the user to either:\r\n - Provide an explicit comma-separated list of table names, **or**\r\n - Run this command and share the output:\r\n ```\r\n fab ls \"<SOURCE_WORKSPACE>.Workspace/<SOURCE_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Parse table names from the output or list. Present them back and confirm.\r\n\r\n- [ ] **Step 3 — Generate commands**: Run the script:\r\n ```\r\n python scripts/generate_schema_shortcut_commands.py \\\r\n --source-workspace \"<SOURCE_WORKSPACE>\" \\\r\n --source-lakehouse \"<SOURCE_LAKEHOUSE>\" \\\r\n --source-schema \"<SOURCE_SCHEMA>\" \\\r\n --target-workspace \"<TARGET_WORKSPACE>\" \\\r\n --target-lakehouse \"<TARGET_LAKEHOUSE>\" \\\r\n --target-schema \"<TARGET_SCHEMA>\" \\\r\n --tables \"<TABLE1>,<TABLE2>,...\"\r\n ```\r\n The script outputs JSON to stdout with sections: `schema_sql`,\r\n `schema_shortcut_test`, `shortcut_commands`, and `validation_command`.\r\n\r\n- [ ] **Step 4 — (Optional) Test schema-level shortcut**: Before creating\r\n individual table shortcuts, optionally test whether a single schema-level\r\n shortcut captures all tables (see \"Schema-Level Shortcut Hypothesis\" below).\r\n Use the `schema_shortcut_test` command from the script output.\r\n If the test succeeds and all tables appear, skip Step 5.\r\n\r\n- [ ] **Step 5 — Choose deployment approach**: Present these options:\r\n\r\n **Option A — Notebook Cells (Recommended for pipeline integration)**\r\n Append two cells to an existing notebook attached to the target lakehouse:\r\n 1. **Spark SQL cell**: Contains `CREATE SCHEMA IF NOT EXISTS <schema>;`\r\n from the `schema_sql` output.\r\n 2. **Code cell**: Contains each command from `shortcut_commands` prefixed\r\n with `!` (one per line).\r\n If no existing notebook is available, create a new one and note that it\r\n will need its own Spark session and `fab` authentication.\r\n\r\n **Option B — PowerShell Script**\r\n Write the `fab ln` commands from `shortcut_commands` to a `.ps1` file.\r\n Add a comment at the top reminding the user to create the schema first\r\n via a Spark SQL notebook cell (`fab` CLI cannot create schemas).\r\n\r\n **Option C — Interactive Terminal**\r\n Present each command one at a time for the operator to run. Start with the\r\n schema creation SQL (must run in a notebook), then present `fab ln` commands.\r\n\r\n- [ ] **Step 6 — Validate**: Ask the user to run:\r\n ```\r\n fab ls \"<TARGET_WORKSPACE>.Workspace/<TARGET_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Confirm the expected shortcuts appear under the target schema.\r\n\r\n## Schema-Level Shortcut Hypothesis\r\n\r\nWhen creating shortcuts through the Fabric **UI**, connecting to a schema\r\nautomatically surfaces all tables in that schema as shortcuts. It is unknown\r\nwhether this works programmatically via `fab ln`. To test, use the\r\n`schema_shortcut_test` command from the script output, e.g.:\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables -f\r\n```\r\n\r\nIf this succeeds and all source tables appear in the target schema, use this\r\none-command approach instead of individual table shortcuts. Document the result\r\nfor future runs.\r\n\r\nIf the source is non-schema-enabled, test with `Tables` as the target path\r\n(no schema segment). If schema-enabled, use `Tables/<source_schema>`.\r\n\r\n## fab ln Syntax Reference\r\n\r\n### Shortcut naming convention (FIXED)\r\n\r\nShortcuts in schema-enabled lakehouses use **slash notation** for the schema path:\r\n```\r\nTables/<Schema>/<table_name>.Shortcut\r\n```\r\nExample: `Tables/Bronze/revenue_raw.Shortcut`\r\n\r\n**Periods (`.`) are FORBIDDEN in shortcut names.** Dot notation like\r\n`Tables/bronze.revenue_raw.Shortcut` will fail with:\r\n`[InvalidPath] Invalid shortcut name. The name should not include any of the following characters: [\"\\:|<>*?.%+]`\r\n\r\n### Cross-lakehouse: non-schema source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<TABLE> -f\r\n```\r\n\r\n### Cross-lakehouse: schema-enabled source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<SOURCE_SCHEMA>/<TABLE> -f\r\n```\r\n\r\n### Key rules\r\n\r\n- **Type**: Always `--type oneLake` for cross-lakehouse table shortcuts.\r\n Valid `fab ln` types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`. There is no `lakehouseTable` type.\r\n- **Slash notation**: Shortcut path uses `Tables/<Schema>/<table>.Shortcut` (slash, NOT dot)\r\n- **Periods forbidden**: `.` is not allowed in shortcut names — will error with `[InvalidPath]`\r\n- **`-f` flag**: Always include `-f` to skip the \"Are you sure?\" confirmation prompt\r\n (terminals that don't support CPR will hang without it)\r\n- **Source path**: Schema-enabled sources use `Tables/<schema>/<table>` (slash).\r\n Non-schema sources use `Tables/<table>` (no schema segment)\r\n- **URL encoding**: Workspace names with spaces use `%20` in the `--target` path\r\n- **`../../` prefix**: Required for cross-workspace targets to navigate to OneLake root\r\n- **Display names**: Shortcut destination path uses plain workspace/lakehouse names\r\n (no URL encoding); only the `--target` path is URL-encoded\r\n\r\n## Gotchas\r\n\r\n- **Slash NOT dot in shortcut paths**: The shortcut destination uses slash notation\r\n (`Tables/Bronze/revenue_raw.Shortcut`), NOT dot notation. Periods (`.`) are\r\n **forbidden** in shortcut names and will cause `[InvalidPath]` errors.\r\n- **Always use `-f` flag**: Without `-f`, `fab ln` prompts \"Are you sure? (Y/n)\".\r\n Terminals that don't support cursor position requests (CPR) will hang. Always\r\n append `-f` to force creation without confirmation.\r\n- **`--type oneLake` not `--type lakehouseTable`**: Cross-lakehouse table shortcuts\r\n require `--type oneLake`. The type `lakehouseTable` does not exist in the `fab ln`\r\n CLI. Valid types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`.\r\n- **Schema creation requires Spark SQL**: The `fab` CLI cannot create schemas.\r\n Schemas must be created via `CREATE SCHEMA IF NOT EXISTS <name>` in a Spark SQL\r\n cell in a notebook attached to the target lakehouse.\r\n- **Schema names are case-sensitive** in Fabric. Use exact casing consistently.\r\n- **Viewer access required**: Cross-workspace shortcuts require at least Viewer\r\n access on the source workspace.\r\n- **Existing shortcuts fail**: If a shortcut with the same name already exists,\r\n `fab ln` will error. Skip or delete existing ones before rerunning.\r\n- **Same-workspace shortcuts**: When source and target are in the same workspace,\r\n the `../../` prefix and URL encoding still apply in the `--target` path.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_schema_shortcut_commands.py`** — Generates structured JSON\r\n containing schema SQL, `fab ln` shortcut commands, a schema-level shortcut test\r\n command, and a validation command.\r\n Run: `python scripts/generate_schema_shortcut_commands.py --help`\r\n",
|
|
62
|
+
content: "---\r\nname: create-lakehouse-schemas-and-shortcuts\r\ndescription: >\r\n Use this skill to create schemas in schema-enabled Microsoft Fabric lakehouses\r\n and create cross-lakehouse table shortcuts using the Fabric CLI. Triggers on:\r\n \"create lakehouse shortcuts\", \"create schema in lakehouse\", \"shortcut tables\r\n between lakehouses\", \"cross-lakehouse shortcuts\", \"surface bronze tables in\r\n silver\". Does NOT trigger for: creating lakehouses (use create-fabric-lakehouse),\r\n uploading files, creating delta tables from CSV/PDF, or generating MLV scripts.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ for scripts/. Fabric CLI (fab) installed and authenticated.\r\n---\r\n\r\n# Create Lakehouse Schemas and Shortcuts\r\n\r\nCreates schemas in schema-enabled Fabric lakehouses and creates cross-lakehouse\r\ntable shortcuts using `fab ln --type oneLake`. Schemas and shortcuts are\r\ncreated in the same run. Source and target lakehouses must already exist.\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Source and target workspace names | Environment profile or implementation plan |\r\n| Source and target lakehouse names | SOP shared parameters (from lakehouse creation step) |\r\n| Source schema | SOP shared parameters |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. target schema name,\r\nspecific tables to shortcut if not listed in the SOP).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `--source-workspace` | Source Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_HUB\"` |\r\n| `--source-lakehouse` | Source lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_BRONZE\"` |\r\n| `--source-schema` | Schema in source lakehouse. Use `dbo` for non-schema-enabled | `\"dbo\"` |\r\n| `--target-workspace` | Target Fabric workspace name (exact, case-sensitive) | `\"LANDON_TEST_20260402_FINANCE_SPOKE\"` |\r\n| `--target-lakehouse` | Target lakehouse name (exact, case-sensitive) | `\"LANDON_FINANCE_SILVER\"` |\r\n| `--target-schema` | Schema to create in target and place shortcuts into | `\"bronze\"` |\r\n| `--tables` | Comma-separated table names, or output of `fab ls` | `\"bookings,events\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Step 1 — Collect parameters**: Ask the user for all inputs listed above.\r\n If source and target are in the same workspace, both workspace parameters will\r\n be the same value.\r\n\r\n- [ ] **Step 2 — Discover tables**: Ask the user to either:\r\n - Provide an explicit comma-separated list of table names, **or**\r\n - Run this command and share the output:\r\n ```\r\n fab ls \"<SOURCE_WORKSPACE>.Workspace/<SOURCE_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Parse table names from the output or list. Present them back and confirm.\r\n\r\n- [ ] **Step 3 — Generate commands**: Run the script:\r\n ```\r\n python scripts/generate_schema_shortcut_commands.py \\\r\n --source-workspace \"<SOURCE_WORKSPACE>\" \\\r\n --source-lakehouse \"<SOURCE_LAKEHOUSE>\" \\\r\n --source-schema \"<SOURCE_SCHEMA>\" \\\r\n --target-workspace \"<TARGET_WORKSPACE>\" \\\r\n --target-lakehouse \"<TARGET_LAKEHOUSE>\" \\\r\n --target-schema \"<TARGET_SCHEMA>\" \\\r\n --tables \"<TABLE1>,<TABLE2>,...\"\r\n ```\r\n The script outputs JSON to stdout with sections: `schema_sql`,\r\n `schema_shortcut_test`, `shortcut_commands`, and `validation_command`.\r\n\r\n- [ ] **Step 4 — (Optional) Test schema-level shortcut**: Before creating\r\n individual table shortcuts, optionally test whether a single schema-level\r\n shortcut captures all tables (see \"Schema-Level Shortcut Hypothesis\" below).\r\n Use the `schema_shortcut_test` command from the script output.\r\n If the test succeeds and all tables appear, skip Step 5.\r\n\r\n- [ ] **Step 5 — Choose deployment approach**: Present these options:\r\n\r\n **Option A — Notebook Cells (Recommended for pipeline integration)**\r\n Append two cells to an existing notebook attached to the target lakehouse:\r\n 1. **Spark SQL cell**: Contains `CREATE SCHEMA IF NOT EXISTS <schema>;`\r\n from the `schema_sql` output.\r\n 2. **Code cell**: Contains each command from `shortcut_commands` prefixed\r\n with `!` (one per line).\r\n If no existing notebook is available, create a new one and note that it\r\n will need its own Spark session and `fab` authentication.\r\n\r\n **Option B — PowerShell Script**\r\n Write the `fab ln` commands from `shortcut_commands` to a `.ps1` file.\r\n Add a comment at the top reminding the user to create the schema first\r\n via a Spark SQL notebook cell (`fab` CLI cannot create schemas).\r\n\r\n **Option C — Interactive Terminal**\r\n Present each command one at a time for the operator to run. Start with the\r\n schema creation SQL (must run in a notebook), then present `fab ln` commands.\r\n\r\n- [ ] **Step 6 — Validate**: Ask the user to run:\r\n ```\r\n fab ls \"<TARGET_WORKSPACE>.Workspace/<TARGET_LAKEHOUSE>.Lakehouse/Tables/\" -l\r\n ```\r\n Confirm the expected shortcuts appear under the target schema.\r\n\r\n## Schema-Level Shortcut Hypothesis\r\n\r\nWhen creating shortcuts through the Fabric **UI**, connecting to a schema\r\nautomatically surfaces all tables in that schema as shortcuts. It is unknown\r\nwhether this works programmatically via `fab ln`. To test, use the\r\n`schema_shortcut_test` command from the script output, e.g.:\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables -f\r\n```\r\n\r\nIf this succeeds and all source tables appear in the target schema, use this\r\none-command approach instead of individual table shortcuts. Document the result\r\nfor future runs.\r\n\r\nIf the source is non-schema-enabled, test with `Tables` as the target path\r\n(no schema segment). If schema-enabled, use `Tables/<source_schema>`.\r\n\r\n## fab ln Syntax Reference\r\n\r\n### Shortcut naming convention (FIXED)\r\n\r\nShortcuts in schema-enabled lakehouses use **slash notation** for the schema path:\r\n```\r\nTables/<Schema>/<table_name>.Shortcut\r\n```\r\nExample: `Tables/Bronze/revenue_raw.Shortcut`\r\n\r\n**Periods (`.`) are FORBIDDEN in shortcut names.** Dot notation like\r\n`Tables/bronze.revenue_raw.Shortcut` will fail with:\r\n`[InvalidPath] Invalid shortcut name. The name should not include any of the following characters: [\"\\:|<>*?.%+]`\r\n\r\n### Cross-lakehouse: non-schema source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<TABLE> -f\r\n```\r\n\r\n### Cross-lakehouse: schema-enabled source → schema-enabled target\r\n\r\n```\r\nfab ln \"<TARGET_WS>.Workspace/<TARGET_LH>.Lakehouse/Tables/<TARGET_SCHEMA>/<TABLE>.Shortcut\" \\\r\n --type oneLake \\\r\n --target ../../<SOURCE_WS_URL>.Workspace/<SOURCE_LH>.Lakehouse/Tables/<SOURCE_SCHEMA>/<TABLE> -f\r\n```\r\n\r\n### Key rules\r\n\r\n- **Type**: Always `--type oneLake` for cross-lakehouse table shortcuts.\r\n Valid `fab ln` types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`. There is no `lakehouseTable` type.\r\n- **Slash notation**: Shortcut path uses `Tables/<Schema>/<table>.Shortcut` (slash, NOT dot)\r\n- **Periods forbidden**: `.` is not allowed in shortcut names — will error with `[InvalidPath]`\r\n- **`-f` flag**: Always include `-f` to skip the \"Are you sure?\" confirmation prompt\r\n (terminals that don't support CPR will hang without it)\r\n- **Source path**: Schema-enabled sources use `Tables/<schema>/<table>` (slash).\r\n Non-schema sources use `Tables/<table>` (no schema segment)\r\n- **URL encoding**: Workspace names with spaces use `%20` in the `--target` path\r\n- **`../../` prefix**: Required for cross-workspace targets to navigate to OneLake root\r\n- **Display names**: Shortcut destination path uses plain workspace/lakehouse names\r\n (no URL encoding); only the `--target` path is URL-encoded\r\n\r\n## Gotchas\r\n\r\n- **Slash NOT dot in shortcut paths**: The shortcut destination uses slash notation\r\n (`Tables/Bronze/revenue_raw.Shortcut`), NOT dot notation. Periods (`.`) are\r\n **forbidden** in shortcut names and will cause `[InvalidPath]` errors.\r\n- **Always use `-f` flag**: Without `-f`, `fab ln` prompts \"Are you sure? (Y/n)\".\r\n Terminals that don't support cursor position requests (CPR) will hang. Always\r\n append `-f` to force creation without confirmation.\r\n- **`--type oneLake` not `--type lakehouseTable`**: Cross-lakehouse table shortcuts\r\n require `--type oneLake`. The type `lakehouseTable` does not exist in the `fab ln`\r\n CLI. Valid types are: `adlsGen2`, `amazonS3`, `dataverse`, `googleCloudStorage`,\r\n `oneLake`, `s3Compatible`.\r\n- **Schema creation requires Spark SQL**: The `fab` CLI cannot create schemas.\r\n Schemas must be created via `CREATE SCHEMA IF NOT EXISTS <name>` in a Spark SQL\r\n cell in a notebook attached to the target lakehouse.\r\n- **Schema names are case-sensitive** in Fabric. Use exact casing consistently.\r\n- **Viewer access required**: Cross-workspace shortcuts require at least Viewer\r\n access on the source workspace.\r\n- **Existing shortcuts fail**: If a shortcut with the same name already exists,\r\n `fab ln` will error. Skip or delete existing ones before rerunning.\r\n- **Same-workspace shortcuts**: When source and target are in the same workspace,\r\n the `../../` prefix and URL encoding still apply in the `--target` path.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_schema_shortcut_commands.py`** — Generates structured JSON\r\n containing schema SQL, `fab ln` shortcut commands, a schema-level shortcut test\r\n command, and a validation command.\r\n Run: `python scripts/generate_schema_shortcut_commands.py --help`\r\n",
|
|
63
63
|
},
|
|
64
64
|
{
|
|
65
65
|
relativePath: "scripts/generate_schema_shortcut_commands.py",
|
|
@@ -119,7 +119,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
119
119
|
files: [
|
|
120
120
|
{
|
|
121
121
|
relativePath: "SKILL.md",
|
|
122
|
-
content: "---\r\nname: csv-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to upload CSV files from a local machine into a Microsoft Fabric\r\n bronze lakehouse and convert them to delta tables. Triggers on: \"create delta\r\n tables from CSV files\", \"load CSVs into bronze lakehouse\", \"upload CSV to Fabric\r\n and create tables\", \"ingest CSV files to delta format in Fabric\", \"create bronze\r\n tables from local CSV\". Does NOT trigger for creating lakehouses, transforming\r\n existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ required for scripts/. Fabric CLI (fab) must be installed for the CLI upload option.\r\n---\r\n\r\n# CSV to Bronze Delta Tables\r\n\r\nUploads CSV files from an operator's local machine to a Microsoft Fabric bronze\r\nlakehouse and converts them to delta tables. The lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE
|
|
122
|
+
content: "---\r\nname: csv-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to upload CSV files from a local machine into a Microsoft Fabric\r\n bronze lakehouse and convert them to delta tables. Triggers on: \"create delta\r\n tables from CSV files\", \"load CSVs into bronze lakehouse\", \"upload CSV to Fabric\r\n and create tables\", \"ingest CSV files to delta format in Fabric\", \"create bronze\r\n tables from local CSV\". Does NOT trigger for creating lakehouses, transforming\r\n existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ required for scripts/. Fabric CLI (fab) must be installed for the CLI upload option.\r\n---\r\n\r\n# CSV to Bronze Delta Tables\r\n\r\nUploads CSV files from an operator's local machine to a Microsoft Fabric bronze\r\nlakehouse and converts them to delta tables. The lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark (`spark.read.csv`, `df.write.format(\"delta\")`)\r\n> — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Workspace name | Environment profile or implementation plan |\r\n| Lakehouse name | SOP shared parameters (from lakehouse creation step) |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. local CSV folder path,\r\ndestination folder name in OneLake, table naming preferences).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LOCAL_CSV_FOLDER` | Relative path to local folder containing CSV files (CLI upload only) | `\"./Data\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under the Files section of the lakehouse | `\"raw\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Upload CSV files** — Present these three options and ask the operator to\r\n choose one:\r\n\r\n **Option A — OneLake File Explorer (Manual)**\r\n Open the OneLake File Explorer desktop app and drag-and-drop the CSV files into\r\n the target folder under the lakehouse Files section. No agent action required.\r\n\r\n **Option B — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section, open or create\r\n the target folder, click **Upload** and select the CSV files. No agent action required.\r\n\r\n **Option C — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option A or B.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options A or B.\r\n > Recommend Options A or B for bulk uploads.\r\n\r\n Ask for `LOCAL_CSV_FOLDER` as the **exact absolute path** to the local folder\r\n and `LAKEHOUSE_FILES_FOLDER` (the destination folder name under Files). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_CSV_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_csv_files.ps1\"\r\n ```\r\n The script generates a PowerShell `.ps1` file saved directly to the outputs folder.\r\n Present the script path to the operator and ask them to run it with `pwsh upload_csv_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning the workflow, create the output folder:\r\n```\r\noutputs/csv-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll scripts produced during the run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm the CSV files are visible\r\n in the Files section of the lakehouse before proceeding.\r\n\r\n- [ ] **Create delta tables** — If `LAKEHOUSE_FILES_FOLDER` was not already\r\n captured above, ask for it now. Present these two options:\r\n\r\n **Option A — Fabric UI (Manual)**\r\n > Quick and easy — recommended for most users.\r\n In the Fabric browser UI navigate to the lakehouse → Files →\r\n `<LAKEHOUSE_FILES_FOLDER>`. For each CSV file: click the three-dot menu →\r\n **Load to Tables** → **New Table**. Accept the suggested table name (Fabric\r\n applies it automatically). No agent action required.\r\n\r\n **Option B — PySpark notebook (Automated)**\r\n Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\csv_to_delta_tables.ipynb\"\r\n ```\r\n This writes a ready-to-run `.ipynb` file to the outputs folder. Tell the operator:\r\n 1. In the Fabric UI go to the workspace → **New** → **Import notebook**\r\n 2. Select `csv_to_delta_tables.ipynb` from the outputs folder\r\n 3. Follow the instructions in **Cell 1** to manually attach the lakehouse before running\r\n 4. Click **Run All**\r\n **Validate**: confirm every cell printed `✅ Created table: <table_name>` with\r\n no errors. If any `❌` lines appear, report the error message to the operator.\r\n\r\n## Table Naming Convention\r\n\r\nCSV filename → delta table name:\r\n- Strip `.csv` extension\r\n- Convert to lowercase\r\n- Replace any non-alphanumeric characters (spaces, hyphens, dots) with underscores\r\n- Strip leading/trailing underscores\r\n\r\nExamples:\r\n| CSV filename | Delta table name |\r\n|---|---|\r\n| `Revenue Data.csv` | `revenue_data` |\r\n| `Landon hotel revenue data.csv` | `landon_hotel_revenue_data` |\r\n| `Q1-Sales.csv` | `q1_sales` |\r\n\r\n## Column Naming Convention\r\n\r\nWhen CSVs are loaded into delta tables via the PySpark notebook (Option B of\r\ndelta table creation), a `clean_columns()` function transforms every column name:\r\n\r\n- Convert to lowercase\r\n- Replace spaces, hyphens, and other non-alphanumeric characters with underscores\r\n- Strip leading/trailing underscores\r\n\r\n| CSV column header | Delta table column name |\r\n|---|---|\r\n| `Hotel ID` | `hotel_id` |\r\n| `No_of_Rooms` | `no_of_rooms` |\r\n| `Total Revenue (GBP)` | `total_revenue_gbp` |\r\n| `First Name` | `first_name` |\r\n\r\n> **Important for downstream skills:** When writing SQL queries against bronze\r\n> delta tables (e.g., in the `create-materialised-lakeview-scripts` skill),\r\n> always use the cleaned column names — not the original CSV headers.\r\n\r\n## Output Format\r\n\r\nDelta tables appear under the **Tables** section of the bronze lakehouse in the\r\nFabric UI, named according to the convention above. Each table is queryable via\r\nthe lakehouse SQL endpoint and PySpark.\r\n\r\n## Gotchas\r\n\r\n- `fab cp` uses the path prefix to identify local vs OneLake paths. **Absolute\r\n Windows paths (`C:\\...`) are not recognised as local** and cause a\r\n `[NotSupported] Source and destination must be of the same type` error. Always\r\n use `Push-Location` into the source folder and `./filename` (forward slash,\r\n not backslash) syntax — confirmed working pattern.\r\n- **The destination folder must exist before running `fab cp`.** Always run\r\n `fab mkdir \"{WORKSPACE}.Workspace/{LAKEHOUSE}.Lakehouse/Files/{FOLDER}\"` first.\r\n Running `fab mkdir` on an already-existing folder is safe and does not error.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive and must exactly match\r\n what appears in the Fabric UI.\r\n- The Fabric UI shortcut approach (Option A for delta table creation) uses Fabric's\r\n automatic schema inference. It may fail if column names contain spaces or data types\r\n are inconsistent. Switch to Option B (PySpark notebook) in those cases.\r\n- The PySpark notebook requires the lakehouse to be manually attached before running —\r\n Cell 1 contains step-by-step instructions. If you skip this, `saveAsTable()` will fail.\r\n- When using the Fabric CLI, run all commands from the directory that\r\n `LOCAL_CSV_FOLDER` is relative to (typically the project root).\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local CSV folder and outputs\r\n `fab cp` commands to upload each file to the lakehouse Files section.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook pre-configured with the correct lakehouse name and `FILES_FOLDER`.\r\n Cell 1 instructs the operator to manually attach the lakehouse before running.\r\n Import into Fabric via **New → Import notebook**.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
123
123
|
},
|
|
124
124
|
{
|
|
125
125
|
relativePath: "assets/pyspark_notebook_template.py",
|
|
@@ -131,7 +131,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
131
131
|
},
|
|
132
132
|
{
|
|
133
133
|
relativePath: "scripts/generate_notebook.py",
|
|
134
|
-
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a Fabric-compatible PySpark notebook (.ipynb) that reads CSV files\r\nfrom a lakehouse Files section and writes each one as a delta table.\r\n\r\nUsage example:\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"LH_MY_BRONZE\" \\\r\n --lakehouse-folder \"CSV\" \\\r\n --output-notebook \"outputs/my-run/csv_to_delta_tables.ipynb\"\r\n\"\"\"\r\nimport argparse\r\nimport json\r\nimport os\r\nimport sys\r\n\r\n\r\ndef make_cell(source_lines, cell_type=\"code\"):\r\n \"\"\"Build a notebook cell dict from a list of source lines.\"\"\"\r\n source = [line + \"\\n\" for line in source_lines[:-1]] + [source_lines[-1]]\r\n cell = {\r\n \"cell_type\": cell_type,\r\n \"metadata\": {},\r\n \"source\": source,\r\n \"outputs\": [],\r\n \"execution_count\": None,\r\n }\r\n if cell_type == \"markdown\":\r\n del cell[\"outputs\"]\r\n del cell[\"execution_count\"]\r\n return cell\r\n\r\n\r\ndef build_notebook(lakehouse_name: str, lakehouse_folder: str) -> dict:\r\n cells = []\r\n\r\n # ── Cell 1: manual lakehouse attachment instructions ─────────────────────\r\n cells.append(make_cell([\r\n f\"## ⚠️ Step 1: Attach the Lakehouse BEFORE Running\",\r\n \"\",\r\n \"Before clicking **Run All**, attach the bronze lakehouse:\",\r\n \"\",\r\n \"1. In the left panel of the notebook, click **Add data items** (the database icon)\",\r\n \"2. Click **Add lakehouse**\",\r\n \"3. Select **Existing lakehouse**\",\r\n f\"4. Choose **{lakehouse_name}**\",\r\n \"5. Click **Confirm**\",\r\n \"\",\r\n f\"The lakehouse **{lakehouse_name}** must appear in the left panel before you proceed.\",\r\n \"If you skip this step, `saveAsTable()` will fail with a lakehouse not found error.\",\r\n ], cell_type=\"markdown\"))\r\n\r\n # ── Cell 2: markdown header ──────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## CSV to Bronze Delta Tables\",\r\n \"\",\r\n f\"Reads every CSV from `Files/{lakehouse_folder}` in **{lakehouse_name}** and writes\",\r\n \"each one as a managed delta table in the **Tables** section.\",\r\n ], cell_type=\"markdown\"))\r\n\r\n # ── Cell 3: configuration ────────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"# ── CONFIGURE ────────────────────────────────────────────────────────\",\r\n f'FILES_FOLDER = \"{lakehouse_folder}\" # folder under Files/ containing the CSVs',\r\n \"# ─────────────────────────────────────────────────────────────────────\",\r\n ]))\r\n\r\n # ── Cell 4: imports ──────────────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"import re\",\r\n ]))\r\n\r\n # ── Cell 5: helper — table name ──────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"def to_table_name(filename: str) -> str:\",\r\n ' \"\"\"CSV filename → delta table name: lowercase, non-alphanumeric → underscores.\"\"\"',\r\n ' name = filename[:-4] if filename.lower().endswith(\".csv\") else filename',\r\n ' return re.sub(r\"[^a-zA-Z0-9]+\", \"_\", name).lower().strip(\"_\")',\r\n ]))\r\n\r\n # ── Cell 6: helper — clean columns ───────────────────────────────────────\r\n cells.append(make_cell([\r\n \"def clean_columns(df):\",\r\n ' \"\"\"Rename columns to be delta-safe: lowercase, special chars → underscores.\"\"\"',\r\n \" for col in df.columns:\",\r\n ' clean = re.sub(r\"[^a-zA-Z0-9]+\", \"_\", col).lower().strip(\"_\")',\r\n \" if clean != col:\",\r\n \" df = df.withColumnRenamed(col, clean)\",\r\n \" return df\",\r\n ]))\r\n\r\n # ── Cell 7: find CSV files ───────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"files_path = f\\\"Files/{FILES_FOLDER}\\\"\",\r\n \"all_files = mssparkutils.fs.ls(files_path)\",\r\n 'csv_files = [f for f in all_files if f.name.lower().endswith(\".csv\")]',\r\n \"\",\r\n \"if not csv_files:\",\r\n \" raise ValueError(\",\r\n \" f\\\"No CSV files found in '{files_path}'. \\\"\",\r\n \" \\\"Check FILES_FOLDER is correct and files have been uploaded.\\\"\",\r\n \" )\",\r\n \"\",\r\n \"print(f\\\"Found {len(csv_files)} CSV file(s) in '{files_path}':\\\")\",\r\n \"for f in csv_files:\",\r\n \" print(f\\\" - {f.name}\\\")\",\r\n ]))\r\n\r\n # ── Cell 8: create delta tables ──────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"created, failed = [], []\",\r\n \"\",\r\n \"for f in csv_files:\",\r\n \" table_name = to_table_name(f.name)\",\r\n \" try:\",\r\n \" df = (\",\r\n \" spark.read\",\r\n ' .option(\"header\", \"true\")',\r\n ' .option(\"inferSchema\", \"true\")',\r\n \" .csv(f.path)\",\r\n \" )\",\r\n \" df = clean_columns(df)\",\r\n ' df.write.format(\"delta\").mode(\"overwrite\").saveAsTable(table_name)',\r\n \" print(f\\\"\\\\u2705 Created table: {table_name} ({df.count()} rows, {len(df.columns)} columns)\\\")\",\r\n \" created.append(table_name)\",\r\n \" except Exception as e:\",\r\n \" print(f\\\"\\\\u274c Failed: {f.name} \\\\u2192 {table_name}: {e}\\\")\",\r\n \" failed.append({\\\"file\\\": f.name, \\\"table\\\": table_name, \\\"error\\\": str(e)})\",\r\n ]))\r\n\r\n # ── Cell 9: summary ──────────────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"print(f\\\"{'=' * 60}\\\")\",\r\n \"print(f\\\"Summary: {len(created)} table(s) created, {len(failed)} failed.\\\")\",\r\n \"if failed:\",\r\n \" print(\\\"\\\\nFailed files:\\\")\",\r\n \" for item in failed:\",\r\n \" print(f\\\" - {item['file']} \\\\u2192 {item['table']}: {item['error']}\\\")\",\r\n ]))\r\n\r\n return {\r\n \"nbformat\": 4,\r\n \"nbformat_minor\": 5,\r\n \"metadata\": {\r\n \"kernelspec\": {\r\n \"display_name\": \"PySpark\",\r\n \"language\": \"python\",\r\n \"name\": \"synapse_pyspark\",\r\n },\r\n \"language_info\": {\r\n \"name\": \"python\",\r\n },\r\n \"trident\": {\r\n \"lakehouse\": {},\r\n },\r\n },\r\n \"cells\": cells,\r\n }\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Generate a Fabric-compatible PySpark notebook (.ipynb) that reads CSV \"\r\n \"files from a lakehouse Files folder and writes each as a delta table.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Example:\\n\"\r\n \" python scripts/generate_notebook.py \\\\\\n\"\r\n ' --lakehouse \"LH_MY_BRONZE\" \\\\\\n'\r\n ' --lakehouse-folder \"CSV\" \\\\\\n'\r\n ' --output-notebook \"outputs/my-run/csv_to_delta_tables.ipynb\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\r\n \"--lakehouse\", required=True,\r\n help='Name of the bronze lakehouse to attach. E.g. \"LH_MY_BRONZE\"',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse-folder\", required=True,\r\n help='Folder under Files/ in the lakehouse containing the CSV files. E.g. \"CSV\"',\r\n )\r\n parser.add_argument(\r\n \"--output-notebook\", required=True,\r\n help='Path where the .ipynb file should be saved.',\r\n )\r\n args = parser.parse_args()\r\n\r\n notebook = build_notebook(args.lakehouse, args.lakehouse_folder)\r\n\r\n out_path = os.path.abspath(args.output_notebook)\r\n os.makedirs(os.path.dirname(out_path), exist_ok=True)\r\n with open(out_path, \"w\", encoding=\"utf-8\") as f:\r\n json.dump(notebook, f, indent=2, ensure_ascii=False)\r\n\r\n print(f\"Notebook written to: {out_path}\", file=sys.stderr)\r\n print(\"Import into Fabric: Workspace → New → Import notebook → select this .ipynb file.\", file=sys.stderr)\r\n print(\"Lakehouse is attached automatically via %%configure — just click Run All.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
134
|
+
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a Fabric-compatible PySpark notebook (.ipynb) that reads CSV files\r\nfrom a lakehouse Files section and writes each one as a delta table.\r\n\r\nUsage example:\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"LH_MY_BRONZE\" \\\r\n --lakehouse-folder \"CSV\" \\\r\n --output-notebook \"outputs/my-run/csv_to_delta_tables.ipynb\"\r\n\"\"\"\r\nimport argparse\r\nimport json\r\nimport os\r\nimport sys\r\n\r\n\r\ndef make_cell(source_lines, cell_type=\"code\"):\r\n \"\"\"Build a notebook cell dict from a list of source lines.\"\"\"\r\n source = [line + \"\\n\" for line in source_lines[:-1]] + [source_lines[-1]]\r\n cell = {\r\n \"cell_type\": cell_type,\r\n \"metadata\": {},\r\n \"source\": source,\r\n \"outputs\": [],\r\n \"execution_count\": None,\r\n }\r\n if cell_type == \"markdown\":\r\n del cell[\"outputs\"]\r\n del cell[\"execution_count\"]\r\n return cell\r\n\r\n\r\ndef build_notebook(lakehouse_name: str, lakehouse_folder: str) -> dict:\r\n cells = []\r\n\r\n # ── Cell 1: manual lakehouse attachment instructions ─────────────────────\r\n cells.append(make_cell([\r\n f\"## ⚠️ Step 1: Attach the Lakehouse BEFORE Running\",\r\n \"\",\r\n \"Before clicking **Run All**, attach the bronze lakehouse:\",\r\n \"\",\r\n \"1. In the left panel of the notebook, click **Add data items** (the database icon)\",\r\n \"2. Click **Add lakehouse**\",\r\n \"3. Select **Existing lakehouse**\",\r\n f\"4. Choose **{lakehouse_name}**\",\r\n \"5. Click **Confirm**\",\r\n \"\",\r\n f\"The lakehouse **{lakehouse_name}** must appear in the left panel before you proceed.\",\r\n \"If you skip this step, `saveAsTable()` will fail with a lakehouse not found error.\",\r\n ], cell_type=\"markdown\"))\r\n\r\n # ── Cell 2: markdown header ──────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## CSV to Bronze Delta Tables\",\r\n \"\",\r\n f\"Reads every CSV from `Files/{lakehouse_folder}` in **{lakehouse_name}** and writes\",\r\n \"each one as a managed delta table in the **Tables** section.\",\r\n ], cell_type=\"markdown\"))\r\n\r\n # ── Cell 3: configuration ────────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 1 — Configuration\",\r\n \"\",\r\n \"Sets the source folder under `Files/` that contains the CSV files to ingest.\",\r\n \"\",\r\n f\"**How to use**: Change `FILES_FOLDER` to the folder name you uploaded your CSVs to.\",\r\n f\"The current value `\\\"{lakehouse_folder}\\\"` was set when this notebook was generated.\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"# ── CONFIGURE ────────────────────────────────────────────────────────\",\r\n f'FILES_FOLDER = \"{lakehouse_folder}\" # folder under Files/ containing the CSVs',\r\n \"# ─────────────────────────────────────────────────────────────────────\",\r\n ]))\r\n\r\n # ── Cell 4: imports ──────────────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 2 — Imports\",\r\n \"\",\r\n \"Imports the `re` module used for sanitising filenames and column names into\",\r\n \"valid delta table and column identifiers.\",\r\n \"\",\r\n \"**How to use**: No changes needed. Run once before the helper functions.\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"import re\",\r\n ]))\r\n\r\n # ── Cell 5: helper — table name ──────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 3 — Helper: Table Name\",\r\n \"\",\r\n \"Defines `to_table_name()`, which converts a CSV filename into a valid delta\",\r\n \"table name: strips the `.csv` extension, lowercases everything, and replaces\",\r\n \"any non-alphanumeric characters (spaces, hyphens, dots) with underscores.\",\r\n \"\",\r\n \"**How to use**: No changes needed. Examples: `Revenue Data.csv` → `revenue_data`,\",\r\n \"`Q1-Sales.csv` → `q1_sales`.\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"def to_table_name(filename: str) -> str:\",\r\n ' \"\"\"CSV filename → delta table name: lowercase, non-alphanumeric → underscores.\"\"\"',\r\n ' name = filename[:-4] if filename.lower().endswith(\".csv\") else filename',\r\n ' return re.sub(r\"[^a-zA-Z0-9]+\", \"_\", name).lower().strip(\"_\")',\r\n ]))\r\n\r\n # ── Cell 6: helper — clean columns ───────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 4 — Helper: Clean Column Names\",\r\n \"\",\r\n \"Defines `clean_columns()`, which renames every column in a DataFrame to be\",\r\n \"delta-safe: lowercased, with spaces and special characters replaced by underscores.\",\r\n \"\",\r\n \"**How to use**: No changes needed. This is applied automatically to every CSV\",\r\n \"before writing. Example: `Hotel ID` → `hotel_id`, `Total Revenue (GBP)` → `total_revenue_gbp`.\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"def clean_columns(df):\",\r\n ' \"\"\"Rename columns to be delta-safe: lowercase, special chars → underscores.\"\"\"',\r\n \" for col in df.columns:\",\r\n ' clean = re.sub(r\"[^a-zA-Z0-9]+\", \"_\", col).lower().strip(\"_\")',\r\n \" if clean != col:\",\r\n \" df = df.withColumnRenamed(col, clean)\",\r\n \" return df\",\r\n ]))\r\n\r\n # ── Cell 7: find CSV files ───────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 5 — Find CSV Files\",\r\n \"\",\r\n f\"Lists all `.csv` files in `Files/{lakehouse_folder}` inside **{lakehouse_name}**.\",\r\n \"Raises an error if no files are found so you know immediately if the upload\",\r\n \"step was skipped or the folder name is wrong.\",\r\n \"\",\r\n \"**How to use**: If this cell raises a `ValueError`, check that:\",\r\n \"1. CSV files have been uploaded to the lakehouse (see the setup cell above)\",\r\n f\"2. `FILES_FOLDER` in Cell 1 matches the exact folder name (currently `\\\"{lakehouse_folder}\\\"`)\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"files_path = f\\\"Files/{FILES_FOLDER}\\\"\",\r\n \"all_files = mssparkutils.fs.ls(files_path)\",\r\n 'csv_files = [f for f in all_files if f.name.lower().endswith(\".csv\")]',\r\n \"\",\r\n \"if not csv_files:\",\r\n \" raise ValueError(\",\r\n \" f\\\"No CSV files found in '{files_path}'. \\\"\",\r\n \" \\\"Check FILES_FOLDER is correct and files have been uploaded.\\\"\",\r\n \" )\",\r\n \"\",\r\n \"print(f\\\"Found {len(csv_files)} CSV file(s) in '{files_path}':\\\")\",\r\n \"for f in csv_files:\",\r\n \" print(f\\\" - {f.name}\\\")\",\r\n ]))\r\n\r\n # ── Cell 8: create delta tables ──────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 6 — Create Delta Tables\",\r\n \"\",\r\n \"Reads each CSV with automatic header detection and schema inference, applies\",\r\n \"`clean_columns()` to sanitise column names, then writes each file as a managed\",\r\n \"delta table in the **Tables** section of the lakehouse using `overwrite` mode.\",\r\n \"\",\r\n \"**How to use**: Run after confirming Cell 5 found your files. Each successful\",\r\n \"table prints `✅ Created table: <name> (<rows> rows, <cols> columns)`.\",\r\n \"Any failures print `❌ Failed: <file> → <table>: <error>` and are collected\",\r\n \"in `failed` for the summary cell — they do not stop the loop.\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"created, failed = [], []\",\r\n \"\",\r\n \"for f in csv_files:\",\r\n \" table_name = to_table_name(f.name)\",\r\n \" try:\",\r\n \" df = (\",\r\n \" spark.read\",\r\n ' .option(\"header\", \"true\")',\r\n ' .option(\"inferSchema\", \"true\")',\r\n \" .csv(f.path)\",\r\n \" )\",\r\n \" df = clean_columns(df)\",\r\n ' df.write.format(\"delta\").mode(\"overwrite\").saveAsTable(table_name)',\r\n \" print(f\\\"\\\\u2705 Created table: {table_name} ({df.count()} rows, {len(df.columns)} columns)\\\")\",\r\n \" created.append(table_name)\",\r\n \" except Exception as e:\",\r\n \" print(f\\\"\\\\u274c Failed: {f.name} \\\\u2192 {table_name}: {e}\\\")\",\r\n \" failed.append({\\\"file\\\": f.name, \\\"table\\\": table_name, \\\"error\\\": str(e)})\",\r\n ]))\r\n\r\n # ── Cell 9: summary ──────────────────────────────────────────────────────\r\n cells.append(make_cell([\r\n \"## Cell 7 — Summary\",\r\n \"\",\r\n \"Prints a final count of tables created vs failed, and lists any failed files\",\r\n \"with their error messages so you can diagnose and re-run individual files if needed.\",\r\n \"\",\r\n \"**How to use**: Review the output. If all tables show `✅`, the ingestion is\",\r\n \"complete. For any `❌` entries, fix the underlying issue (e.g. schema conflict,\",\r\n \"permissions) and re-run just Cell 6 for those files, or re-run the full notebook.\",\r\n ], cell_type=\"markdown\"))\r\n cells.append(make_cell([\r\n \"print(f\\\"{'=' * 60}\\\")\",\r\n \"print(f\\\"Summary: {len(created)} table(s) created, {len(failed)} failed.\\\")\",\r\n \"if failed:\",\r\n \" print(\\\"\\\\nFailed files:\\\")\",\r\n \" for item in failed:\",\r\n \" print(f\\\" - {item['file']} \\\\u2192 {item['table']}: {item['error']}\\\")\",\r\n ]))\r\n\r\n return {\r\n \"nbformat\": 4,\r\n \"nbformat_minor\": 5,\r\n \"metadata\": {\r\n \"kernelspec\": {\r\n \"display_name\": \"PySpark\",\r\n \"language\": \"python\",\r\n \"name\": \"synapse_pyspark\",\r\n },\r\n \"language_info\": {\r\n \"name\": \"python\",\r\n },\r\n \"trident\": {\r\n \"lakehouse\": {},\r\n },\r\n },\r\n \"cells\": cells,\r\n }\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Generate a Fabric-compatible PySpark notebook (.ipynb) that reads CSV \"\r\n \"files from a lakehouse Files folder and writes each as a delta table.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Example:\\n\"\r\n \" python scripts/generate_notebook.py \\\\\\n\"\r\n ' --lakehouse \"LH_MY_BRONZE\" \\\\\\n'\r\n ' --lakehouse-folder \"CSV\" \\\\\\n'\r\n ' --output-notebook \"outputs/my-run/csv_to_delta_tables.ipynb\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\r\n \"--lakehouse\", required=True,\r\n help='Name of the bronze lakehouse to attach. E.g. \"LH_MY_BRONZE\"',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse-folder\", required=True,\r\n help='Folder under Files/ in the lakehouse containing the CSV files. E.g. \"CSV\"',\r\n )\r\n parser.add_argument(\r\n \"--output-notebook\", required=True,\r\n help='Path where the .ipynb file should be saved.',\r\n )\r\n args = parser.parse_args()\r\n\r\n notebook = build_notebook(args.lakehouse, args.lakehouse_folder)\r\n\r\n out_path = os.path.abspath(args.output_notebook)\r\n os.makedirs(os.path.dirname(out_path), exist_ok=True)\r\n with open(out_path, \"w\", encoding=\"utf-8\") as f:\r\n json.dump(notebook, f, indent=2, ensure_ascii=False)\r\n\r\n print(f\"Notebook written to: {out_path}\", file=sys.stderr)\r\n print(\"Import into Fabric: Workspace → New → Import notebook → select this .ipynb file.\", file=sys.stderr)\r\n print(\"Lakehouse is attached automatically via %%configure — just click Run All.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
135
135
|
},
|
|
136
136
|
{
|
|
137
137
|
relativePath: "scripts/generate_shortcut_commands.py",
|
|
@@ -139,7 +139,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
139
139
|
},
|
|
140
140
|
{
|
|
141
141
|
relativePath: "scripts/generate_upload_commands.py",
|
|
142
|
-
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a PowerShell script containing `fab cp` commands to upload CSV files\r\nfrom a local folder to a Microsoft Fabric lakehouse Files section.\r\n\r\nThe generated script uses $PSScriptRoot to resolve paths relative to where\r\nthe .ps1 file is saved, so it works from any working directory.\r\n\r\nUsage example:\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"C:\\\\Users\\\\rishi\\\\source\\\\Data\\\\CSV\" \\\r\n --workspace \"Landon Finance Month End\" \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"raw\" \\\r\n --output-script \"outputs/my-run/upload_csv_files.ps1\"\r\n\"\"\"\r\nimport argparse\r\nimport os\r\nimport sys\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Scan a local folder for CSV files and write a PowerShell script \"\r\n \"containing `fab cp` commands to upload each one to a Microsoft \"\r\n \"Fabric lakehouse Files section. Paths in the script are resolved \"\r\n \"relative to the script's saved location via $PSScriptRoot.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Example:\\n\"\r\n ' python scripts/generate_upload_commands.py \\\\\\n'\r\n ' --local-folder \"C:\\\\\\\\Users\\\\\\\\rishi\\\\\\\\source\\\\\\\\Data\\\\\\\\CSV\" \\\\\\n'\r\n ' --workspace \"Landon Finance Month End\" \\\\\\n'\r\n ' --lakehouse \"Lh_landon_finance_bronze\" \\\\\\n'\r\n ' --lakehouse-folder \"raw\" \\\\\\n'\r\n ' --output-script \"outputs/my-run/upload_csv_files.ps1\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\r\n \"--local-folder\", required=True,\r\n help=\"Exact absolute path to the local folder containing CSV files. \"\r\n 'E.g. \"C:\\\\Users\\\\rishi\\\\source\\\\Data\\\\CSV\"',\r\n )\r\n parser.add_argument(\r\n \"--workspace\", required=True,\r\n help='Fabric workspace name (exact, case-sensitive). E.g. \"Landon Finance Month End\"',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse\", required=True,\r\n help='Lakehouse name (exact, case-sensitive). E.g. \"Lh_landon_finance_bronze\"',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse-folder\", required=True,\r\n help='Destination folder under the Files section of the lakehouse. E.g. \"raw\"',\r\n )\r\n parser.add_argument(\r\n \"--output-script\", required=True,\r\n help='Path where the generated .ps1 file should be saved. '\r\n 'E.g. \"outputs/my-run/upload_csv_files.ps1\"',\r\n )\r\n args = parser.parse_args()\r\n\r\n local_folder = os.path.abspath(args.local_folder)\r\n if not os.path.isdir(local_folder):\r\n print(f\"ERROR: Local folder not found: {local_folder}\", file=sys.stderr)\r\n print(\"Expected: a valid absolute directory path containing .csv files.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n csv_files = sorted(f for f in os.listdir(local_folder) if f.lower().endswith(\".csv\"))\r\n if not csv_files:\r\n print(f\"ERROR: No CSV files found in: {local_folder}\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n # Compute the relative path from the output script's directory to the CSV folder\r\n script_dir = os.path.abspath(os.path.dirname(args.output_script))\r\n rel_csv_path = os.path.relpath(local_folder, script_dir)\r\n\r\n lines = [\r\n \"# \" + \"=\" * 77,\r\n f\"# Upload CSV Files to Bronze Lakehouse - Fabric CLI Script\",\r\n f\"# Workspace : {args.workspace}\",\r\n f\"# Lakehouse : {args.lakehouse}\",\r\n f\"# Destination: Files/{args.lakehouse_folder}\",\r\n f\"# CSV source : {local_folder}\",\r\n \"# \" + \"=\" * 77,\r\n \"# NOTE: This script uploads files one at a time via the Fabric CLI.\",\r\n \"# For 50+ files, Options 1 (OneLake File Explorer) or 2 (Fabric UI)\",\r\n \"# are significantly faster.\",\r\n \"# \" + \"=\" * 77,\r\n \"\",\r\n f'$csvFolder = \"{local_folder}\"',\r\n 'Write-Host \"CSV source folder: $csvFolder\" -ForegroundColor Cyan',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 1 - Install the Fabric CLI\",\r\n \"# Comment out this line if you already have the Fabric CLI installed.\",\r\n \"# \" + \"-\" * 77,\r\n \"# pip install ms-fabric-cli\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 2 - Authenticate with Fabric\",\r\n \"# Comment out this line if you are already authenticated.\",\r\n \"# \" + \"-\" * 77,\r\n \"# fab auth login\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 3 - Create the destination folder in the lakehouse (if it doesn't exist)\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n f'fab mkdir \"{args.workspace}.Workspace/{args.lakehouse}.Lakehouse/Files/{args.lakehouse_folder}\"',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 4 - Upload CSV files\",\r\n \"# fab cp requires a ./ prefix to identify the source as a local file.\",\r\n \"# Push-Location sets the working directory to the CSV folder so ./filename works.\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n \"Push-Location $csvFolder\",\r\n \"\",\r\n ]\r\n\r\n for filename in csv_files:\r\n dest = (\r\n f\"{args.workspace}.Workspace/\"\r\n f\"{args.lakehouse}.Lakehouse/\"\r\n f\"Files/{args.lakehouse_folder}/{filename}\"\r\n )\r\n lines.append(f'fab cp \"./{filename}\" `')\r\n lines.append(f' \"{dest}\"')\r\n lines.append(\"\")\r\n\r\n lines += [\r\n \"Pop-Location\",\r\n \"\",\r\n 'Write-Host \"Upload complete. Please verify the files are visible in the '\r\n 'lakehouse Files/' + args.lakehouse_folder + ' section before proceeding.\" '\r\n '-ForegroundColor Green',\r\n ]\r\n\r\n script_content = \"\\n\".join(lines)\r\n\r\n os.makedirs(os.path.dirname(os.path.abspath(args.output_script)), exist_ok=True)\r\n with open(args.output_script, \"w\", encoding=\"utf-8\") as f:\r\n f.write(script_content)\r\n\r\n print(f\"Script written to: {os.path.abspath(args.output_script)}\", file=sys.stderr)\r\n print(f\"{len(csv_files)} file(s) included.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n\r\n",
|
|
142
|
+
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a PowerShell script containing `fab cp` commands to upload CSV files\r\nfrom a local folder to a Microsoft Fabric lakehouse Files section.\r\n\r\nThe generated script uses $PSScriptRoot to resolve paths relative to where\r\nthe .ps1 file is saved, so it works from any working directory.\r\n\r\nUsage example:\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"C:\\\\Users\\\\rishi\\\\source\\\\Data\\\\CSV\" \\\r\n --workspace \"Landon Finance Month End\" \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"raw\" \\\r\n --output-script \"outputs/my-run/upload_csv_files.ps1\"\r\n\"\"\"\r\nimport argparse\r\nimport os\r\nimport sys\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Scan a local folder for CSV files and write a PowerShell script \"\r\n \"containing `fab cp` commands to upload each one to a Microsoft \"\r\n \"Fabric lakehouse Files section. Paths in the script are resolved \"\r\n \"relative to the script's saved location via $PSScriptRoot.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Example:\\n\"\r\n ' python scripts/generate_upload_commands.py \\\\\\n'\r\n ' --local-folder \"C:\\\\\\\\Users\\\\\\\\rishi\\\\\\\\source\\\\\\\\Data\\\\\\\\CSV\" \\\\\\n'\r\n ' --workspace \"Landon Finance Month End\" \\\\\\n'\r\n ' --lakehouse \"Lh_landon_finance_bronze\" \\\\\\n'\r\n ' --lakehouse-folder \"raw\" \\\\\\n'\r\n ' --output-script \"outputs/my-run/upload_csv_files.ps1\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\r\n \"--local-folder\", required=True,\r\n help=\"Exact absolute path to the local folder containing CSV files. \"\r\n 'E.g. \"C:\\\\Users\\\\rishi\\\\source\\\\Data\\\\CSV\"',\r\n )\r\n parser.add_argument(\r\n \"--workspace\", required=True,\r\n help='Fabric workspace name (exact, case-sensitive). E.g. \"Landon Finance Month End\"',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse\", required=True,\r\n help='Lakehouse name (exact, case-sensitive). E.g. \"Lh_landon_finance_bronze\"',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse-folder\", required=True,\r\n help='Destination folder under the Files section of the lakehouse. E.g. \"raw\"',\r\n )\r\n parser.add_argument(\r\n \"--output-script\", required=True,\r\n help='Path where the generated .ps1 file should be saved. '\r\n 'E.g. \"outputs/my-run/upload_csv_files.ps1\"',\r\n )\r\n args = parser.parse_args()\r\n\r\n local_folder = os.path.abspath(args.local_folder)\r\n if not os.path.isdir(local_folder):\r\n print(f\"ERROR: Local folder not found: {local_folder}\", file=sys.stderr)\r\n print(\"Expected: a valid absolute directory path containing .csv files.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n csv_files = sorted(f for f in os.listdir(local_folder) if f.lower().endswith(\".csv\"))\r\n if not csv_files:\r\n print(f\"ERROR: No CSV files found in: {local_folder}\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n # Compute the relative path from the output script's directory to the CSV folder\r\n script_dir = os.path.abspath(os.path.dirname(args.output_script))\r\n rel_csv_path = os.path.relpath(local_folder, script_dir)\r\n\r\n lines = [\r\n \"# \" + \"=\" * 77,\r\n f\"# Upload CSV Files to Bronze Lakehouse - Fabric CLI Script\",\r\n f\"# Workspace : {args.workspace}\",\r\n f\"# Lakehouse : {args.lakehouse}\",\r\n f\"# Destination: Files/{args.lakehouse_folder}\",\r\n f\"# CSV source : {local_folder}\",\r\n \"# \" + \"=\" * 77,\r\n \"# NOTE: This script uploads files one at a time via the Fabric CLI.\",\r\n \"# For 50+ files, Options 1 (OneLake File Explorer) or 2 (Fabric UI)\",\r\n \"# are significantly faster.\",\r\n \"# \" + \"=\" * 77,\r\n \"\",\r\n f'$csvFolder = \"{local_folder}\"',\r\n 'Write-Host \"CSV source folder: $csvFolder\" -ForegroundColor Cyan',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 1 - Install the Fabric CLI\",\r\n \"# Comment out this line if you already have the Fabric CLI installed.\",\r\n \"# \" + \"-\" * 77,\r\n \"# pip install ms-fabric-cli\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 2 - Authenticate with Fabric\",\r\n \"# Comment out this line if you are already authenticated.\",\r\n \"# \" + \"-\" * 77,\r\n \"# fab auth login\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 3 - Create the destination folder in the lakehouse (if it doesn't exist)\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n f'fab mkdir \"{args.workspace}.Workspace/{args.lakehouse}.Lakehouse/Files/{args.lakehouse_folder}\" -f',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 4 - Upload CSV files\",\r\n \"# fab cp requires a ./ prefix to identify the source as a local file.\",\r\n \"# Push-Location sets the working directory to the CSV folder so ./filename works.\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n \"Push-Location $csvFolder\",\r\n \"\",\r\n ]\r\n\r\n for filename in csv_files:\r\n dest = (\r\n f\"{args.workspace}.Workspace/\"\r\n f\"{args.lakehouse}.Lakehouse/\"\r\n f\"Files/{args.lakehouse_folder}/{filename}\"\r\n )\r\n lines.append(f'fab cp \"./{filename}\" `')\r\n lines.append(f' \"{dest}\"')\r\n lines.append(\"\")\r\n\r\n lines += [\r\n \"Pop-Location\",\r\n \"\",\r\n 'Write-Host \"Upload complete. Please verify the files are visible in the '\r\n 'lakehouse Files/' + args.lakehouse_folder + ' section before proceeding.\" '\r\n '-ForegroundColor Green',\r\n ]\r\n\r\n script_content = \"\\n\".join(lines)\r\n\r\n os.makedirs(os.path.dirname(os.path.abspath(args.output_script)), exist_ok=True)\r\n with open(args.output_script, \"w\", encoding=\"utf-8\") as f:\r\n f.write(script_content)\r\n\r\n print(f\"Script written to: {os.path.abspath(args.output_script)}\", file=sys.stderr)\r\n print(f\"{len(csv_files)} file(s) included.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n\r\n",
|
|
143
143
|
},
|
|
144
144
|
],
|
|
145
145
|
},
|
|
@@ -167,7 +167,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
167
167
|
files: [
|
|
168
168
|
{
|
|
169
169
|
relativePath: "SKILL.md",
|
|
170
|
-
content: "---\r\nname: generate-fabric-workspace\r\ndescription: >\r\n Use this skill when asked to create, provision, or set up a Microsoft Fabric\r\n workspace. Triggers on: \"create a Fabric workspace\", \"provision a workspace\r\n in Fabric\", \"set up a new Fabric workspace\", \"generate a workspace with\r\n capacity and permissions\", \"create workspace and assign roles in Fabric\".\r\n Collects workspace name, capacity, principals/roles, and optional domain\r\n settings, then creates the workspace using one of three approaches: PySpark\r\n notebook, PowerShell script, or interactive terminal commands. Produces a\r\n workspace definition markdown as a creation audit record. Does NOT trigger\r\n for general Fabric questions, item creation within a workspace, or\r\n workspace deletion tasks.\r\nlicense: MIT\r\ncompatibility: >\r\n ms-fabric-cli required (pip install ms-fabric-cli). Approach 1 requires a\r\n Fabric notebook environment. Approaches 2 and 3 require fab CLI installed\r\n locally with network access to Microsoft Fabric.\r\n---\r\n\r\n# Generate Fabric Workspace\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n>\r\n> ⚠️ **GENERATION**: Always run the generator scripts (`scripts/generate_notebook.py`,\r\n> `scripts/generate_ps1.py`) via Bash to produce artefacts — never generate notebook\r\n> or script content directly. Do not present generator scripts themselves as outputs.\r\n>\r\n> **Canonical notebook pattern** — every generated PySpark notebook follows this\r\n> exact cell structure. Do not deviate:\r\n> 1. `%pip install ms-fabric-cli -q --no-warn-conflicts` (Cell 1 — restart kernel after)\r\n> 2. `notebookutils.credentials.getToken('pbi')` and `getToken('storage')` → set as\r\n> `os.environ['FAB_TOKEN']`, `FAB_TOKEN_ONELAKE`, `FAB_TOKEN_AZURE` (Cell 2 — auth)\r\n> 3. All workspace operations use `!fab` shell commands — `!fab mkdir`, `!fab get`,\r\n> `!fab acl set`, `!fab api`, etc. Python subprocess is never used.\r\n\r\nCreates a Microsoft Fabric workspace assigned to a specified capacity, with\r\naccess roles and optional domain assignment. If the workspace already exists,\r\ncreation is skipped and roles/domain are updated. Outputs a workspace\r\ndefinition markdown as an audit trail.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Deployment approach (notebook / PowerShell / terminal) | Environment profile |\r\n| Capacity name | Environment profile |\r\n| Workspace name(s) | Environment profile or implementation plan |\r\n| Access control method + Object ID resolution | Environment profile |\r\n| Domain assignment approach | Environment profile |\r\n| Credential management approach (Key Vault / runtime) | Environment profile |\r\n| Domain name, role assignments, group names | SOP shared parameters |\r\n\r\n**Only ask for parameters not found in these documents.** Summarise what was resolved\r\nautomatically, then ask for what remains.\r\n\r\n## Step 1 — Choose Approach\r\n\r\nAsk the user:\r\n\r\n> \"Which approach would you like to use?\r\n> 1. **PySpark Notebook** — generates a notebook to run inside Fabric\r\n> (authenticated automatically via the notebook environment)\r\n> 2. **PowerShell Script** — generates a `.ps1` for your review before execution\r\n> (requires fab CLI installed locally)\r\n> 3. **Interactive Terminal** — runs fab CLI commands one by one in the terminal,\r\n> with your confirmation at each step (requires fab CLI installed locally)\"\r\n\r\n### Authentication by approach\r\n\r\n| Approach | Authentication |\r\n|---|---|\r\n| PySpark Notebook | Auto via `notebookutils.credentials.getToken('pbi')` inside Fabric |\r\n| PowerShell / Terminal | `fab auth login` (browser pop-up) or set `$env:FAB_TOKEN` / `FAB_TOKEN` |\r\n\r\n## Step 2 — Domain Handling\r\n\r\nAsk the user:\r\n\r\n> \"Would you like to:\r\n> A. **Create a new domain** and assign the workspace to it\r\n> ⚠️ Requires **Fabric Admin** tenant-level permissions.\r\n> You will also need to specify an **Entra group** that will be allowed to\r\n> add/remove workspaces from this domain (the domain contributor group).\r\n> B. **Assign the workspace to an existing domain**\r\n> C. **Skip domain assignment**\"\r\n\r\n- If **A**: collect `DOMAIN_NAME` and `DOMAIN_CONTRIBUTOR_GROUP` (the Entra\r\n group display name allowed to add/remove workspaces from the domain). Confirm\r\n the user has Fabric Admin rights.\r\n- If **B**: collect `DOMAIN_NAME` only.\r\n- If **C**: no domain parameters needed.\r\n\r\n## Step 3 — Collect Parameters\r\n\r\nCollect these values from the user:\r\n\r\n| Parameter | Required | Description |\r\n|---|---|---|\r\n| `WORKSPACE_NAME` | Yes | Display name for the workspace |\r\n| `CAPACITY_NAME` | Yes | Exact name of the Fabric capacity to assign |\r\n| `DOMAIN_NAME` | If A or B | Name of the domain (new or existing) |\r\n| `DOMAIN_CONTRIBUTOR_GROUP` | If A | Display name of the Entra group that manages the domain |\r\n| `WORKSPACE_ROLES` | Conditional | Additional principals + roles (see approach-specific guidance below) |\r\n\r\n### Workspace roles — approach-specific guidance\r\n\r\nThe workspace creator is **automatically assigned as Admin**. Before collecting\r\nadditional roles, ask:\r\n\r\n> \"You (the creator) will be automatically assigned as workspace Admin. Do you\r\n> want to assign additional roles to other users or groups?\"\r\n\r\nIf **no**, skip role collection entirely. If **yes**, load\r\n`references/role-assignment.md` for approach-specific guidance on collecting\r\nprincipals, group resolution requirements, and Service Principal prerequisites.\r\n\r\nFor each additional principal, collect:\r\n- User **email address (UPN)** or Entra **group display name** — do NOT ask for Object IDs\r\n- Principal type: `User` or `Group` (or `ServicePrincipal`)\r\n- Role: `Admin`, `Member`, `Contributor`, or `Viewer`\r\n\r\n## Step 4 — Execute\r\n\r\n### Approach 1: PySpark Notebook\r\n\r\nIf role assignment includes Entra groups, `TENANT_ID`, `CLIENT_ID`, and `CLIENT_SECRET`\r\nare required — entered directly into Cell 1 of the generated notebook. See\r\n`references/role-assignment.md` for prerequisite details.\r\n\r\nRun `scripts/generate_notebook.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_notebook.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ipynb\r\n```\r\n\r\nPresent the generated `workspace_setup.ipynb` to the user and instruct them to:\r\n1. Upload to any Fabric workspace as a notebook\r\n2. Run each cell **one at a time**, reading the output before proceeding\r\n3. ✅ Verification cells are clearly marked — confirm output before moving on\r\n4. Share the output of Cell 7 (`fab ls`) and Cell 9 (`fab acl ls`)\r\n\r\n### Approach 2: PowerShell Script\r\n\r\nRun `scripts/generate_ps1.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_ps1.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ps1\r\n```\r\n\r\nShow `workspace_setup.ps1` to the user for review. **Do not execute until the\r\nuser confirms.** Then run:\r\n\r\n```powershell\r\n.\\workspace_setup.ps1\r\n```\r\n\r\n### Approach 3: Interactive Terminal\r\n\r\nRun these commands in sequence. Show output after each and ask the user to\r\nconfirm before continuing.\r\n\r\n**Install and authenticate:**\r\n```bash\r\npip install ms-fabric-cli\r\nfab auth login\r\n```\r\n\r\n**Check if workspace already exists:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\n```\r\n- Exit code 0 → workspace exists → skip creation, go to role assignment\r\n- Non-zero → proceed to create\r\n\r\n**Create workspace:**\r\n```bash\r\nfab mkdir \"WORKSPACE_NAME.Workspace\" -P capacityName=CAPACITY_NAME\r\n```\r\n\r\n**Verify creation:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\nfab ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Resolve principal IDs** (before assigning roles — repeat for each principal):\r\n```bash\r\n# For a user (by UPN / email):\r\naz ad user show --id user@corp.com --query id -o tsv\r\n\r\n# For a group (by display name):\r\naz ad group show --group \"Finance Team\" --query id -o tsv\r\n\r\n# For a service principal (by display name or app ID):\r\naz ad sp show --id \"My App Name\" --query id -o tsv\r\n```\r\n\r\n**Assign roles** (use the resolved Object ID, role in lowercase):\r\n```bash\r\nfab acl set \"WORKSPACE_NAME.Workspace\" -I <RESOLVED_OBJECT_ID> -R role\r\n```\r\n\r\n**Verify roles:**\r\n```bash\r\nfab acl ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Create domain** (if Step 2 = A):\r\n```bash\r\n# Resolve domain contributor group ID:\r\naz ad group show --group \"DOMAIN_CONTRIBUTOR_GROUP\" --query id -o tsv\r\n\r\nfab mkdir \"DOMAIN_NAME.domain\"\r\nfab acl set \".domains/DOMAIN_NAME.Domain\" -I <RESOLVED_GROUP_ID> -R contributor\r\n```\r\n\r\n**Assign workspace to domain** (if Step 2 = A or B):\r\n```bash\r\nfab assign \".domains/DOMAIN_NAME.Domain\" -W \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n## Step 5 — Generate Workspace Definition\r\n\r\nCollect from the command output (or ask the user):\r\n- Workspace ID (appears in `fab ls` output)\r\n- Tenant name or tenant ID\r\n- Confirmed principals and roles\r\n- Domain name (if assigned)\r\n\r\nRun `scripts/generate_definition.py`:\r\n\r\n```bash\r\npython scripts/generate_definition.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --workspace-id \"WORKSPACE_ID\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --tenant \"TENANT_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n --approach \"notebook|powershell|terminal\" \\\r\n --output workspace_definition.md\r\n```\r\n\r\nPresent `workspace_definition.md` to the user.\r\n\r\n## Gotchas\r\n\r\n- Workspace path format is `WorkspaceName.Workspace` — the `.Workspace` suffix is required.\r\n- The capacity must be **Active** before `fab mkdir`. If you see `CapacityNotInActiveState`,\r\n ask the user to resume the capacity in the Azure portal before retrying.\r\n- `notebookutils.credentials.getToken()` in Fabric notebooks **does not support Microsoft Graph**.\r\n The notebook approach requires a Service Principal with `Group.Read.All` + `User.Read.All`\r\n application permissions and admin consent. The SP credentials are entered in Cell 1 of\r\n the generated notebook. If the user doesn't have an SP, direct them to the PowerShell\r\n or Interactive Terminal approach instead.\r\n- Domain creation requires Fabric Administrator tenant-level rights. If the user cannot\r\n create a domain, fall back to assigning an existing one or skipping.\r\n- `fab exists` uses exit code (0 = exists, non-zero = not found) — do not rely on stdout text alone.\r\n- In the notebook approach, `notebookutils` is only available inside a Fabric notebook.\r\n The generated script must not be run as a plain Python script outside Fabric.\r\n- The `.domain` suffix (lowercase) is used in `fab mkdir`; `.Domain` (capitalised) is\r\n used in `fab assign` and `fab acl set` — these are different and both matter.\r\n- Role values passed to `fab acl set` must be **lowercase** (`admin`, `member`, `contributor`, `viewer`).\r\n The scripts handle this conversion automatically.\r\n- For PowerShell/terminal approaches, `az login` must be completed before `az ad user/group show` will work.\r\n This is separate from `fab auth login` — both are required.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_notebook.py`** — Generates PySpark notebook. Run: `python scripts/generate_notebook.py --help`\r\n- **`scripts/generate_ps1.py`** — Generates PowerShell script. Run: `python scripts/generate_ps1.py --help`\r\n- **`scripts/generate_definition.py`** — Generates workspace definition markdown. Run: `python scripts/generate_definition.py --help`\r\n\r\n## Available References\r\n\r\n- **`references/role-assignment.md`** — Approach-specific guidance for assigning roles to users and Entra groups. Load when user wants to assign additional workspace roles.\r\n- **`references/fabric-cli-reference.md`** — Fabric CLI command reference.\r\n",
|
|
170
|
+
content: "---\r\nname: generate-fabric-workspace\r\ndescription: >\r\n Use this skill when asked to create, provision, or set up a Microsoft Fabric\r\n workspace. Triggers on: \"create a Fabric workspace\", \"provision a workspace\r\n in Fabric\", \"set up a new Fabric workspace\", \"generate a workspace with\r\n capacity and permissions\", \"create workspace and assign roles in Fabric\".\r\n Collects workspace name, capacity, principals/roles, and optional domain\r\n settings, then creates the workspace using one of three approaches: PySpark\r\n notebook, PowerShell script, or interactive terminal commands. Produces a\r\n workspace definition markdown as a creation audit record. Does NOT trigger\r\n for general Fabric questions, item creation within a workspace, or\r\n workspace deletion tasks.\r\nlicense: MIT\r\ncompatibility: >\r\n ms-fabric-cli required (pip install ms-fabric-cli). Approach A requires a\r\n Fabric notebook environment. Approaches B and C require fab CLI installed\r\n locally with network access to Microsoft Fabric.\r\n---\r\n\r\n# Generate Fabric Workspace\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n>\r\n> ⚠️ **GENERATION**: Always run the generator scripts (`scripts/generate_notebook.py`,\r\n> `scripts/generate_ps1.py`) via Bash to produce artefacts — never generate notebook\r\n> or script content directly. Do not present generator scripts themselves as outputs.\r\n>\r\n> **Canonical notebook pattern** — every generated PySpark notebook follows this\r\n> exact cell structure. Do not deviate:\r\n> 1. `%pip install ms-fabric-cli -q --no-warn-conflicts` (Cell 1 — no kernel restart needed)\r\n> 2. `notebookutils.credentials.getToken('pbi')` and `getToken('storage')` → set as\r\n> `os.environ['FAB_TOKEN']`, `FAB_TOKEN_ONELAKE`, `FAB_TOKEN_AZURE` (Cell 2 — auth)\r\n> 3. All workspace operations use `!fab` shell commands — `!fab mkdir`, `!fab get`,\r\n> `!fab acl set`, `!fab api`, etc. Python subprocess is never used.\r\n\r\nCreates a Microsoft Fabric workspace assigned to a specified capacity, with\r\naccess roles and optional domain assignment. If the workspace already exists,\r\ncreation is skipped and roles/domain are updated. Outputs a workspace\r\ndefinition markdown as an audit trail.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Deployment approach (notebook / PowerShell / terminal) | Environment profile |\r\n| Capacity name | Environment profile |\r\n| Workspace name(s) | Environment profile or implementation plan |\r\n| Access control method + Object ID resolution | Environment profile |\r\n| Domain assignment approach | Environment profile |\r\n| Credential management approach (Key Vault / runtime) | Environment profile |\r\n| Domain name, role assignments, group names | SOP shared parameters |\r\n\r\n**Only ask for parameters not found in these documents.** Summarise what was resolved\r\nautomatically, then ask for what remains.\r\n\r\n## Step 1 — Choose Approach\r\n\r\nAsk the user:\r\n\r\n> \"Which approach would you like to use?\r\n> A. **PySpark Notebook** — generates a notebook to run inside Fabric\r\n> (authenticated automatically via the notebook environment)\r\n> B. **PowerShell Script** — generates a `.ps1` for your review before execution\r\n> (requires fab CLI installed locally)\r\n> C. **Interactive Terminal** — runs fab CLI commands one by one in the terminal,\r\n> with your confirmation at each step (requires fab CLI installed locally)\"\r\n\r\n### Authentication by approach\r\n\r\n| Approach | Authentication |\r\n|---|---|\r\n| PySpark Notebook | Auto via `notebookutils.credentials.getToken('pbi')` inside Fabric |\r\n| PowerShell / Terminal | `fab auth login` (browser pop-up) or set `$env:FAB_TOKEN` / `FAB_TOKEN` |\r\n\r\n## Step 2 — Domain Handling\r\n\r\nAsk the user:\r\n\r\n> \"Would you like to:\r\n> A. **Create a new domain** and assign the workspace to it\r\n> ⚠️ Requires **Fabric Admin** tenant-level permissions.\r\n> You will also need to specify an **Entra group** that will be allowed to\r\n> add/remove workspaces from this domain (the domain contributor group).\r\n> B. **Assign the workspace to an existing domain**\r\n> C. **Skip domain assignment**\"\r\n\r\n- If **A**: collect `DOMAIN_NAME` and `DOMAIN_CONTRIBUTOR_GROUP` (the Entra\r\n group display name allowed to add/remove workspaces from the domain). Confirm\r\n the user has Fabric Admin rights.\r\n- If **B**: collect `DOMAIN_NAME` only.\r\n- If **C**: no domain parameters needed.\r\n\r\n## Step 3 — Collect Parameters\r\n\r\nCollect these values from the user:\r\n\r\n| Parameter | Required | Description |\r\n|---|---|---|\r\n| `WORKSPACE_NAME` | Yes | Display name for the workspace |\r\n| `CAPACITY_NAME` | Yes | Exact name of the Fabric capacity to assign |\r\n| `DOMAIN_NAME` | If A or B | Name of the domain (new or existing) |\r\n| `DOMAIN_CONTRIBUTOR_GROUP` | If A | Display name of the Entra group that manages the domain |\r\n| `WORKSPACE_ROLES` | Conditional | Additional principals + roles (see approach-specific guidance below) |\r\n\r\n### Workspace roles — approach-specific guidance\r\n\r\nThe workspace creator is **automatically assigned as Admin**. Before collecting\r\nadditional roles, ask:\r\n\r\n> \"You (the creator) will be automatically assigned as workspace Admin. Do you\r\n> want to assign additional roles to other users or groups?\"\r\n\r\nIf **no**, skip role collection entirely. If **yes**, load\r\n`references/role-assignment.md` for approach-specific guidance on collecting\r\nprincipals, group resolution requirements, and Service Principal prerequisites.\r\n\r\nFor each additional principal, collect:\r\n- User **email address (UPN)** or Entra **group display name** — do NOT ask for Object IDs\r\n- Principal type: `User` or `Group` (or `ServicePrincipal`)\r\n- Role: `Admin`, `Member`, `Contributor`, or `Viewer`\r\n\r\n## Step 4 — Execute\r\n\r\n### Approach A: PySpark Notebook\r\n\r\nIf role assignment includes Entra groups, `TENANT_ID`, `CLIENT_ID`, and `CLIENT_SECRET`\r\nare required — entered directly into Cell 1 of the generated notebook. See\r\n`references/role-assignment.md` for prerequisite details.\r\n\r\nRun `scripts/generate_notebook.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_notebook.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ipynb\r\n```\r\n\r\nPresent the generated `workspace_setup.ipynb` to the user and instruct them to:\r\n1. Upload to any Fabric workspace as a notebook\r\n2. Run each cell **one at a time**, reading the output before proceeding\r\n3. ✅ Verification cells are clearly marked — confirm output before moving on\r\n4. Share the output of Cell 7 (`fab ls`) and Cell 9 (`fab acl ls`)\r\n\r\n### Approach B: PowerShell Script\r\n\r\nRun `scripts/generate_ps1.py` with the collected parameters:\r\n\r\n```bash\r\npython scripts/generate_ps1.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n [--create-domain] \\\r\n [--domain-contributor-group \"DOMAIN_CONTRIBUTOR_GROUP\"] \\\r\n --output workspace_setup.ps1\r\n```\r\n\r\nShow `workspace_setup.ps1` to the user for review. **Do not execute until the\r\nuser confirms.** Then run:\r\n\r\n```powershell\r\n.\\workspace_setup.ps1\r\n```\r\n\r\n### Approach C: Interactive Terminal\r\n\r\nRun these commands in sequence. Show output after each and ask the user to\r\nconfirm before continuing.\r\n\r\n**Install and authenticate:**\r\n```bash\r\npip install ms-fabric-cli\r\nfab auth login\r\n```\r\n\r\n**Check if workspace already exists:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\n```\r\n- Exit code 0 → workspace exists → skip creation, go to role assignment\r\n- Non-zero → proceed to create\r\n\r\n**Create workspace:**\r\n```bash\r\nfab mkdir \"WORKSPACE_NAME.Workspace\" -P capacityName=CAPACITY_NAME -f\r\n```\r\n\r\n**Verify creation:**\r\n```bash\r\nfab exists \"WORKSPACE_NAME.Workspace\"\r\nfab ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Resolve principal IDs** (before assigning roles — repeat for each principal):\r\n```bash\r\n# For a user (by UPN / email):\r\naz ad user show --id user@corp.com --query id -o tsv\r\n\r\n# For a group (by display name):\r\naz ad group show --group \"Finance Team\" --query id -o tsv\r\n\r\n# For a service principal (by display name or app ID):\r\naz ad sp show --id \"My App Name\" --query id -o tsv\r\n```\r\n\r\n**Assign roles** (use the resolved Object ID, role in lowercase):\r\n```bash\r\nfab acl set \"WORKSPACE_NAME.Workspace\" -I <RESOLVED_OBJECT_ID> -R role\r\n```\r\n\r\n**Verify roles:**\r\n```bash\r\nfab acl ls \"WORKSPACE_NAME.Workspace\"\r\n```\r\n\r\n**Create domain** (if Step 2 = A):\r\n```bash\r\nfab create \".domains/DOMAIN_NAME.Domain\" -f\r\n```\r\n⚠️ After creation, set domain contributors manually in the Fabric Admin portal\r\n(admin.powerbi.com → Domains → DOMAIN_NAME → Manage contributors).\r\n`fab acl set` is not supported on `.domains/` paths.\r\n\r\n**Assign workspace to domain** (if Step 2 = A or B):\r\n```bash\r\nfab assign \".domains/DOMAIN_NAME.Domain\" -W \"WORKSPACE_NAME.Workspace\" -f\r\n```\r\n\r\n## Step 5 — Generate Workspace Definition\r\n\r\nCollect from the command output (or ask the user):\r\n- Workspace ID (appears in `fab ls` output)\r\n- Tenant name or tenant ID\r\n- Confirmed principals and roles\r\n- Domain name (if assigned)\r\n\r\nRun `scripts/generate_definition.py`:\r\n\r\n```bash\r\npython scripts/generate_definition.py \\\r\n --workspace-name \"WORKSPACE_NAME\" \\\r\n --workspace-id \"WORKSPACE_ID\" \\\r\n --capacity-name \"CAPACITY_NAME\" \\\r\n --tenant \"TENANT_NAME\" \\\r\n --roles \"user@corp.com:User:Admin,Finance Team:Group:Member\" \\\r\n [--domain-name \"DOMAIN_NAME\"] \\\r\n --approach \"notebook|powershell|terminal\" \\\r\n --output workspace_definition.md\r\n```\r\n\r\nPresent `workspace_definition.md` to the user.\r\n\r\n## Gotchas\r\n\r\n- Workspace path format is `WorkspaceName.Workspace` — the `.Workspace` suffix is required.\r\n- The capacity must be **Active** before `fab mkdir`. If you see `CapacityNotInActiveState`,\r\n ask the user to resume the capacity in the Azure portal before retrying.\r\n- `notebookutils.credentials.getToken()` in Fabric notebooks **does not support Microsoft Graph**.\r\n The notebook approach requires a Service Principal with `Group.Read.All` + `User.Read.All`\r\n application permissions and admin consent. The SP credentials are entered in Cell 1 of\r\n the generated notebook. If the user doesn't have an SP, direct them to the PowerShell\r\n or Interactive Terminal approach instead.\r\n- Domain creation requires Fabric Administrator tenant-level rights. If the user cannot\r\n create a domain, fall back to assigning an existing one or skipping.\r\n- `fab exists` uses exit code (0 = exists, non-zero = not found) — do not rely on stdout text alone.\r\n- In the notebook approach, `notebookutils` is only available inside a Fabric notebook.\r\n The generated script must not be run as a plain Python script outside Fabric.\r\n- The `.domain` suffix (lowercase) is used in `fab mkdir`; `.Domain` (capitalised) is\r\n used in `fab assign` and `fab acl set` — these are different and both matter.\r\n- Role values passed to `fab acl set` must be **lowercase** (`admin`, `member`, `contributor`, `viewer`).\r\n The scripts handle this conversion automatically.\r\n- For PowerShell/terminal approaches, `az login` must be completed before `az ad user/group show` will work.\r\n This is separate from `fab auth login` — both are required.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_notebook.py`** — Generates PySpark notebook. Run: `python scripts/generate_notebook.py --help`\r\n- **`scripts/generate_ps1.py`** — Generates PowerShell script. Run: `python scripts/generate_ps1.py --help`\r\n- **`scripts/generate_definition.py`** — Generates workspace definition markdown. Run: `python scripts/generate_definition.py --help`\r\n\r\n## Available References\r\n\r\n- **`references/role-assignment.md`** — Approach-specific guidance for assigning roles to users and Entra groups. Load when user wants to assign additional workspace roles.\r\n- **`references/fabric-cli-reference.md`** — Fabric CLI command reference.\r\n",
|
|
171
171
|
},
|
|
172
172
|
{
|
|
173
173
|
relativePath: "references/fabric-cli-reference.md",
|
|
@@ -183,11 +183,11 @@ export const EMBEDDED_SKILLS = [
|
|
|
183
183
|
},
|
|
184
184
|
{
|
|
185
185
|
relativePath: "scripts/generate_notebook.py",
|
|
186
|
-
content: "#!/usr/bin/env python3\r\n# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a Jupyter (.ipynb) notebook for creating a Microsoft Fabric workspace.\r\nEach step is a separate cell so it can be run one at a time inside Fabric.\r\n\r\nPasses names/emails directly to `fab acl set -I`. This works for users (UPN/email)\r\nand may work for groups depending on your tenant configuration. If a group assignment\r\nfails, the cell reports the error and continues — re-run that assignment with the\r\ngroup's Entra Object ID instead.\r\n\r\nUsage:\r\n python scripts/generate_notebook.py --workspace-name NAME --capacity-name CAP --roles ROLES [OPTIONS]\r\n\r\nOptions:\r\n --workspace-name TEXT Workspace display name (required)\r\n --capacity-name TEXT Fabric capacity name (required)\r\n --roles TEXT Comma-separated EMAIL:ROLE list (required)\r\n e.g. \"alice@corp.com:Admin,bob@corp.com:Member\"\r\n Roles: Admin, Member, Contributor, Viewer\r\n Note: user UPNs (email addresses) only.\r\n For group assignments, use the PowerShell approach instead.\r\n --domain-name TEXT Domain name to create or assign (optional)\r\n --create-domain Create the domain before assigning (optional)\r\n --domain-contributor-group TEXT Display name or email of Entra group that manages the domain\r\n (required if --create-domain)\r\n --output TEXT Output file path (default: workspace_setup.ipynb)\r\n --help Show this message and exit\r\n\r\nExamples:\r\n python scripts/generate_notebook.py \\\\\r\n --workspace-name \"Finance-Reporting\" \\\\\r\n --capacity-name \"fabriccapacity01\" \\\\\r\n --roles \"alice@corp.com:User:Admin,Finance Team:Group:Member\" \\\\\r\n --output workspace_setup.ipynb\r\n\"\"\"\r\n\r\nimport argparse\r\nimport json\r\nimport sys\r\nfrom datetime import datetime, timezone\r\n\r\n\r\ndef parse_roles(roles_str: str) -> list[dict]:\r\n entries = []\r\n for entry in roles_str.split(\",\"):\r\n parts = entry.strip().split(\":\")\r\n if len(parts) != 2:\r\n print(f\"Error: '{entry}' must be EMAIL:ROLE\", file=sys.stderr)\r\n print(\" e.g. alice@corp.com:Admin\", file=sys.stderr)\r\n sys.exit(1)\r\n email, role = parts[0].strip(), parts[1].strip()\r\n if role not in {\"Admin\", \"Member\", \"Contributor\", \"Viewer\"}:\r\n print(f\"Error: role '{role}' must be Admin, Member, Contributor, or Viewer\", file=sys.stderr)\r\n sys.exit(1)\r\n entries.append({\"email\": email, \"role\": role})\r\n return entries\r\n\r\n\r\ndef md_cell(lines: list[str]) -> dict:\r\n return {\"cell_type\": \"markdown\", \"metadata\": {}, \"source\": [l + \"\\n\" for l in lines]}\r\n\r\n\r\ndef code_cell(lines: list[str]) -> dict:\r\n return {\"cell_type\": \"code\", \"execution_count\": None, \"metadata\": {}, \"outputs\": [], \"source\": [l + \"\\n\" for l in lines]}\r\n\r\n\r\ndef build_notebook(ws_name: str, cap_name: str, roles: list[dict],\r\n domain_name: str | None, create_domain: bool,\r\n domain_contributor_group: str | None) -> dict:\r\n ws_path = f\"{ws_name}.Workspace\"\r\n now = datetime.now(timezone.utc).strftime(\"%Y-%m-%d %H:%M UTC\")\r\n\r\n cells = []\r\n\r\n # ── Title ──────────────────────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n f\"# Fabric Workspace Setup: {ws_name}\",\r\n f\"_Generated: {now}_\",\r\n \"\",\r\n \"**Run each cell one at a time. Read the output before running the next cell.**\",\r\n \"\",\r\n f\"**Prerequisite:** Capacity `{cap_name}` must be in **Active** state.\",\r\n ]))\r\n\r\n # ── Cell 1: Install ────────────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 1 — Install Fabric CLI\",\r\n \"\",\r\n \"⚠️ **The kernel restarts after `%pip install`.** Run this cell first,\",\r\n \"then continue from Cell 2. **Skip if `ms-fabric-cli` is already installed.**\",\r\n ]))\r\n cells.append(code_cell([\r\n \"%pip install ms-fabric-cli -q --no-warn-conflicts\",\r\n ]))\r\n\r\n # ── Cell 2: Authenticate + Parameters ─────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 2 — Authenticate & Set Parameters\",\r\n \"\",\r\n \"Sets Fabric CLI tokens using the notebook user's identity and defines workspace parameters.\",\r\n \"**Start here if you skipped Cell 1.**\",\r\n ]))\r\n cells.append(code_cell([\r\n \"import os, sysconfig, json\",\r\n \"\",\r\n \"# ── Parameters ──────────────────────────────────────────────\",\r\n f'ws_name = \"{ws_name}\"',\r\n f'ws_path = \"{ws_path}\"',\r\n f'cap_name = \"{cap_name}\"',\r\n \"\",\r\n \"# ── Auth ────────────────────────────────────────────────────\",\r\n \"scripts_dir = sysconfig.get_path('scripts')\",\r\n \"os.environ['PATH'] = scripts_dir + os.pathsep + os.environ.get('PATH', '')\",\r\n \"\",\r\n \"token = notebookutils.credentials.getToken('pbi')\",\r\n \"storage_token = notebookutils.credentials.getToken('storage')\",\r\n \"os.environ['FAB_TOKEN'] = token\",\r\n \"os.environ['FAB_TOKEN_ONELAKE'] = storage_token # OneLake needs storage scope\",\r\n \"os.environ['FAB_TOKEN_AZURE'] = token\",\r\n \"print(f'Authenticated. Workspace: {ws_name} Capacity: {cap_name}')\",\r\n ]))\r\n\r\n # ── Cell 3: Create workspace ───────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 3 — Create Workspace\",\r\n \"\",\r\n f\"Creates `{ws_name}` on capacity `{cap_name}`.\",\r\n \"If the workspace already exists, fab will report it — that is fine, continue to Cell 4.\",\r\n f\"> ⚠️ If you see `CapacityNotInActiveState`, resume `{cap_name}` in the Azure portal first.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'print(f\"=== Creating workspace: {{ws_name}} ===\")',\r\n f'!fab mkdir \"{{ws_path}}\" -P capacityName={{cap_name}}',\r\n ]))\r\n\r\n # ── Cell 4: Verify workspace + capture WS_ID ──────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 4 — Verify Workspace\",\r\n \"\",\r\n \"✅ Confirm the workspace was created. `WS_ID` is captured for use in later cells.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'print(f\"=== Workspace details: {{ws_name}} ===\")',\r\n f'!fab get \"{{ws_path}}\"',\r\n \"\",\r\n \"ws_id_out = !fab get \\\"{ws_path}\\\" -q \\\"id\\\"\",\r\n \"WS_ID = ws_id_out[0].strip('\\\"')\",\r\n 'print(f\"\\\\nWorkspace ID: {WS_ID}\")',\r\n ]))\r\n\r\n # ── Cell 5: Assign roles ───────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 5 — Assign Workspace Roles\",\r\n \"\",\r\n \"Writes each role payload to a temp file then calls `fab api` to POST it.\",\r\n \"Valid roles: `Admin`, `Member`, `Contributor`, `Viewer`.\",\r\n ]))\r\n role_lines = [\r\n 'print(\"=== Assigning workspace roles ===\")',\r\n \"\",\r\n \"roles = [\",\r\n ]\r\n for r in roles:\r\n role_lines.append(f' (\"{r[\"email\"]}\", \"{r[\"role\"]}\"),')\r\n role_lines += [\r\n \"]\",\r\n \"\",\r\n \"for email, role in roles:\",\r\n \" payload = json.dumps({'emailAddress': email, 'groupUserAccessRight': role})\",\r\n \" tmp = f\\\"/tmp/role_{email.replace('@','_').replace('.','_')}.json\\\"\",\r\n \" with open(tmp, 'w') as f:\",\r\n \" f.write(payload)\",\r\n \" print(f\\\" {role} -> {email}\\\")\",\r\n \" !fab api -A powerbi \\\"groups/{WS_ID}/users\\\" -X post -i {tmp}\",\r\n \"\",\r\n 'print(\"Done.\")',\r\n ]\r\n cells.append(code_cell(role_lines))\r\n\r\n # ── Cell 6: Verify roles ───────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 6 — Verify Roles\",\r\n \"\",\r\n \"✅ Confirm all expected users and roles appear.\",\r\n ]))\r\n cells.append(code_cell([\r\n 'print(f\"=== Users in workspace: {ws_name} ===\")',\r\n '!fab api -A powerbi \"groups/{WS_ID}/users\"',\r\n ]))\r\n\r\n # ── Domain cells (optional) ────────────────────────────────────────────────\r\n cell_num = 7\r\n if domain_name:\r\n domain_path = f\".domains/{domain_name}.Domain\"\r\n\r\n if create_domain:\r\n cells.append(md_cell([\r\n f\"## Cell {cell_num} — Create Domain\",\r\n \"\",\r\n f\"Creates domain `{domain_name}`.\",\r\n \"⚠️ Requires Fabric Admin role.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'domain_path = \"{domain_path}\"',\r\n f'print(f\"=== Creating domain: {domain_name} ===\")',\r\n f'!fab create \"{{domain_path}}\"',\r\n \"\",\r\n \"domain_id_out = !fab get \\\"{domain_path}\\\" -q \\\"id\\\"\",\r\n \"DOMAIN_ID = domain_id_out[0].strip('\\\"')\",\r\n 'print(f\"Domain ID: {DOMAIN_ID}\")',\r\n ]))\r\n cell_num += 1\r\n\r\n cells.append(md_cell([\r\n f\"## Cell {cell_num} — Assign Workspace to Domain\",\r\n \"\",\r\n f\"Links `{ws_name}` to domain `{domain_name}` using the Fabric admin API.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'print(f\"=== Assigning {{ws_name}} to domain: {domain_name} ===\")',\r\n \"payload = json.dumps({'workspacesIds': [WS_ID]})\",\r\n \"with open('/tmp/domain_assign.json', 'w') as f:\",\r\n \" f.write(payload)\",\r\n '!fab api -X post \"admin/domains/{DOMAIN_ID}/assignWorkspaces\" -i /tmp/domain_assign.json',\r\n 'print(\"✅ Done.\")',\r\n ]))\r\n cell_num += 1\r\n\r\n # ── Final cell ─────────────────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n f\"## Cell {cell_num} — Setup Complete\",\r\n \"\",\r\n \"✅ Workspace setup complete.\",\r\n ]))\r\n\r\n return {\r\n \"nbformat\": 4,\r\n \"nbformat_minor\": 5,\r\n \"metadata\": {\r\n \"kernelspec\": {\"display_name\": \"Python 3\", \"language\": \"python\", \"name\": \"python3\"},\r\n \"language_info\": {\"name\": \"python\"},\r\n },\r\n \"cells\": cells,\r\n }\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=\"Generate a Jupyter notebook (.ipynb) for Fabric workspace creation.\",\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=__doc__,\r\n )\r\n parser.add_argument(\"--workspace-name\", required=True)\r\n parser.add_argument(\"--capacity-name\", required=True)\r\n parser.add_argument(\"--roles\", required=True)\r\n parser.add_argument(\"--domain-name\", default=None)\r\n parser.add_argument(\"--create-domain\", action=\"store_true\")\r\n parser.add_argument(\"--domain-contributor-group\", default=None)\r\n parser.add_argument(\"--output\", default=\"workspace_setup.ipynb\")\r\n args = parser.parse_args()\r\n\r\n if args.create_domain and not args.domain_contributor_group:\r\n print(\"Error: --domain-contributor-group is required when --create-domain is set.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n roles = parse_roles(args.roles)\r\n notebook = build_notebook(\r\n args.workspace_name, args.capacity_name, roles,\r\n args.domain_name, args.create_domain, args.domain_contributor_group,\r\n )\r\n\r\n output = args.output if args.output.endswith(\".ipynb\") else args.output.replace(\".py\", \".ipynb\")\r\n with open(output, \"w\", encoding=\"utf-8\") as f:\r\n json.dump(notebook, f, indent=2)\r\n\r\n print(f'{{\"status\": \"ok\", \"output\": \"{output}\", \"cells\": {len(notebook[\"cells\"])}, \"roles\": {len(roles)}}}')\r\n print(f\"✅ Notebook written to: {output}\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
186
|
+
content: "#!/usr/bin/env python3\r\n# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a Jupyter (.ipynb) notebook for creating a Microsoft Fabric workspace.\r\nEach step is a separate cell so it can be run one at a time inside Fabric.\r\n\r\nPasses names/emails directly to `fab acl set -I`. This works for users (UPN/email)\r\nand may work for groups depending on your tenant configuration. If a group assignment\r\nfails, the cell reports the error and continues — re-run that assignment with the\r\ngroup's Entra Object ID instead.\r\n\r\nUsage:\r\n python scripts/generate_notebook.py --workspace-name NAME --capacity-name CAP --roles ROLES [OPTIONS]\r\n\r\nOptions:\r\n --workspace-name TEXT Workspace display name (required)\r\n --capacity-name TEXT Fabric capacity name (required)\r\n --roles TEXT Comma-separated EMAIL:ROLE list (required)\r\n e.g. \"alice@corp.com:Admin,bob@corp.com:Member\"\r\n Roles: Admin, Member, Contributor, Viewer\r\n Note: user UPNs (email addresses) only.\r\n For group assignments, use the PowerShell approach instead.\r\n --domain-name TEXT Domain name to create or assign (optional)\r\n --create-domain Create the domain before assigning (optional)\r\n --domain-contributor-group TEXT Display name or email of Entra group that manages the domain\r\n (required if --create-domain)\r\n --output TEXT Output file path (default: workspace_setup.ipynb)\r\n --help Show this message and exit\r\n\r\nExamples:\r\n python scripts/generate_notebook.py \\\\\r\n --workspace-name \"Finance-Reporting\" \\\\\r\n --capacity-name \"fabriccapacity01\" \\\\\r\n --roles \"alice@corp.com:User:Admin,Finance Team:Group:Member\" \\\\\r\n --output workspace_setup.ipynb\r\n\"\"\"\r\n\r\nimport argparse\r\nimport json\r\nimport sys\r\nfrom datetime import datetime, timezone\r\n\r\n\r\ndef parse_roles(roles_str: str) -> list[dict]:\r\n entries = []\r\n for entry in roles_str.split(\",\"):\r\n parts = entry.strip().split(\":\")\r\n if len(parts) != 2:\r\n print(f\"Error: '{entry}' must be EMAIL:ROLE\", file=sys.stderr)\r\n print(\" e.g. alice@corp.com:Admin\", file=sys.stderr)\r\n sys.exit(1)\r\n email, role = parts[0].strip(), parts[1].strip()\r\n if role not in {\"Admin\", \"Member\", \"Contributor\", \"Viewer\"}:\r\n print(f\"Error: role '{role}' must be Admin, Member, Contributor, or Viewer\", file=sys.stderr)\r\n sys.exit(1)\r\n entries.append({\"email\": email, \"role\": role})\r\n return entries\r\n\r\n\r\ndef md_cell(lines: list[str]) -> dict:\r\n source = [l + \"\\n\" for l in lines[:-1]] + [lines[-1]]\r\n return {\"cell_type\": \"markdown\", \"metadata\": {}, \"source\": source}\r\n\r\n\r\ndef code_cell(lines: list[str]) -> dict:\r\n source = [l + \"\\n\" for l in lines[:-1]] + [lines[-1]]\r\n return {\"cell_type\": \"code\", \"execution_count\": None, \"metadata\": {}, \"outputs\": [], \"source\": source}\r\n\r\n\r\ndef build_notebook(ws_name: str, cap_name: str, roles: list[dict],\r\n domain_name: str | None, create_domain: bool,\r\n domain_contributor_group: str | None) -> dict:\r\n ws_path = f\"{ws_name}.Workspace\"\r\n now = datetime.now(timezone.utc).strftime(\"%Y-%m-%d %H:%M UTC\")\r\n\r\n cells = []\r\n\r\n # ── Title ──────────────────────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n f\"# Fabric Workspace Setup: {ws_name}\",\r\n f\"_Generated: {now}_\",\r\n \"\",\r\n \"**Run each cell one at a time. Read the output before running the next cell.**\",\r\n \"\",\r\n f\"**Prerequisite:** Capacity `{cap_name}` must be in **Active** state.\",\r\n ]))\r\n\r\n # ── Cell 1: Install ────────────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 1 — Install Fabric CLI\",\r\n \"\",\r\n \"Installs `ms-fabric-cli` into the notebook's Python environment. The `-q` flag\",\r\n \"suppresses output and `--no-warn-conflicts` prevents dependency conflict warnings.\",\r\n \"\",\r\n \"**How to use**: Run once per session. If `ms-fabric-cli` is already installed,\",\r\n \"skip to Cell 2. No kernel restart needed — continue straight to Cell 2.\",\r\n ]))\r\n cells.append(code_cell([\r\n \"%pip install ms-fabric-cli -q --no-warn-conflicts\",\r\n ]))\r\n\r\n # ── Cell 2: Authenticate + Parameters ─────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 2 — Authenticate & Set Parameters\",\r\n \"\",\r\n \"Sets workspace parameters and authenticates the Fabric CLI using the notebook\",\r\n \"user's identity. Tokens are fetched via `notebookutils` and injected into\",\r\n \"environment variables that `fab` reads automatically.\",\r\n \"\",\r\n \"**How to use**: No changes needed — values were pre-populated when this notebook\",\r\n \"was generated. If you see an auth error, ensure the notebook is running inside\",\r\n f\"a Fabric workspace and that the capacity `{cap_name}` is Active.\",\r\n ]))\r\n cells.append(code_cell([\r\n \"import os, sysconfig, json\",\r\n \"\",\r\n \"# ── Parameters ──────────────────────────────────────────────\",\r\n f'ws_name = \"{ws_name}\"',\r\n f'ws_path = \"{ws_path}\"',\r\n f'cap_name = \"{cap_name}\"',\r\n \"\",\r\n \"# ── Auth ────────────────────────────────────────────────────\",\r\n \"scripts_dir = sysconfig.get_path('scripts')\",\r\n \"os.environ['PATH'] = scripts_dir + os.pathsep + os.environ.get('PATH', '')\",\r\n \"\",\r\n \"token = notebookutils.credentials.getToken('pbi')\",\r\n \"storage_token = notebookutils.credentials.getToken('storage')\",\r\n \"os.environ['FAB_TOKEN'] = token\",\r\n \"os.environ['FAB_TOKEN_ONELAKE'] = storage_token # OneLake needs storage scope\",\r\n \"os.environ['FAB_TOKEN_AZURE'] = token\",\r\n \"print(f'Authenticated. Workspace: {ws_name} Capacity: {cap_name}')\",\r\n ]))\r\n\r\n # ── Cell 3: Create workspace ───────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 3 — Create Workspace\",\r\n \"\",\r\n f\"Creates workspace `{ws_name}` on capacity `{cap_name}` using `fab mkdir`.\",\r\n \"If the workspace already exists, `fab` will report it — that is safe, continue to Cell 4.\",\r\n \"\",\r\n f\"**How to use**: If you see `CapacityNotInActiveState`, resume `{cap_name}` in\",\r\n \"the Azure portal (Fabric Capacities → Resume) then re-run this cell.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'print(f\"=== Creating workspace: {{ws_name}} ===\")',\r\n f'!fab mkdir \"{{ws_path}}\" -P capacityName={{cap_name}} -f',\r\n ]))\r\n\r\n # ── Cell 4: Verify workspace + capture WS_ID ──────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 4 — Verify Workspace\",\r\n \"\",\r\n \"Confirms the workspace was created and captures its ID into `WS_ID`.\",\r\n \"`WS_ID` is used in later cells for role assignment and domain operations.\",\r\n \"\",\r\n \"**How to use**: Confirm the output shows the workspace name and a valid GUID.\",\r\n \"If `WS_ID` is empty or shows an error, re-run Cell 3 and check for errors.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'print(f\"=== Workspace details: {{ws_name}} ===\")',\r\n f'!fab get \"{{ws_path}}\"',\r\n \"\",\r\n \"ws_id_out = !fab get \\\"{ws_path}\\\" -q \\\"id\\\"\",\r\n \"WS_ID = ws_id_out[0].strip('\\\"')\",\r\n 'print(f\"\\\\nWorkspace ID: {WS_ID}\")',\r\n ]))\r\n\r\n # ── Cell 5: Assign roles ───────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 5 — Assign Workspace Roles\",\r\n \"\",\r\n \"Assigns each user/group to the workspace by POSTing to the Power BI Groups API.\",\r\n \"Each role payload is written to a temp file and passed to `fab api`.\",\r\n \"The workspace creator is already Admin — this cell assigns any additional principals.\",\r\n \"\",\r\n \"**How to use**: Review the role list below before running. If a role assignment\",\r\n \"fails with a 404, the user email may be wrong. If it fails for a group, you\",\r\n \"may need to use the group's Entra Object ID instead of the display name.\",\r\n \"Valid roles: `Admin`, `Member`, `Contributor`, `Viewer`.\",\r\n ]))\r\n role_lines = [\r\n 'print(\"=== Assigning workspace roles ===\")',\r\n \"\",\r\n \"roles = [\",\r\n ]\r\n for r in roles:\r\n role_lines.append(f' (\"{r[\"email\"]}\", \"{r[\"role\"]}\"),')\r\n role_lines += [\r\n \"]\",\r\n \"\",\r\n \"for email, role in roles:\",\r\n \" payload = json.dumps({'emailAddress': email, 'groupUserAccessRight': role})\",\r\n \" tmp = f\\\"/tmp/role_{email.replace('@','_').replace('.','_')}.json\\\"\",\r\n \" with open(tmp, 'w') as f:\",\r\n \" f.write(payload)\",\r\n \" print(f\\\" {role} -> {email}\\\")\",\r\n \" !fab api -A powerbi \\\"groups/{WS_ID}/users\\\" -X post -i {tmp}\",\r\n \"\",\r\n 'print(\"Done.\")',\r\n ]\r\n cells.append(code_cell(role_lines))\r\n\r\n # ── Cell 6: Verify roles ───────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n \"## Cell 6 — Verify Roles\",\r\n \"\",\r\n \"Lists all users and groups currently assigned to the workspace.\",\r\n \"\",\r\n \"**How to use**: Confirm every expected principal appears with the correct role.\",\r\n \"If anyone is missing, re-run Cell 5 after correcting their email or Object ID.\",\r\n ]))\r\n cells.append(code_cell([\r\n 'print(f\"=== Users in workspace: {ws_name} ===\")',\r\n '!fab api -A powerbi \"groups/{WS_ID}/users\"',\r\n ]))\r\n\r\n # ── Domain cells (optional) ────────────────────────────────────────────────\r\n cell_num = 7\r\n if domain_name:\r\n domain_path = f\".domains/{domain_name}.Domain\"\r\n\r\n if create_domain:\r\n cells.append(md_cell([\r\n f\"## Cell {cell_num} — Create Domain\",\r\n \"\",\r\n f\"Creates domain `{domain_name}`. Requires **Fabric Admin** role.\",\r\n \"\",\r\n \"**How to use**: If this cell errors with a permissions failure, you do not\",\r\n \"have Fabric Admin rights — ask your admin to create the domain instead.\",\r\n \"After creation, set domain contributors manually via the Fabric Admin portal:\",\r\n f\"admin.powerbi.com → Domains → {domain_name} → Manage contributors.\",\r\n \"(The `fab acl set` command is not supported on `.domains/` paths.)\",\r\n ]))\r\n cells.append(code_cell([\r\n f'domain_path = \"{domain_path}\"',\r\n f'print(f\"=== Creating domain: {domain_name} ===\")',\r\n f'!fab create \"{{domain_path}}\" -f',\r\n \"\",\r\n \"domain_id_out = !fab get \\\"{domain_path}\\\" -q \\\"id\\\"\",\r\n \"DOMAIN_ID = domain_id_out[0].strip('\\\"')\",\r\n 'print(f\"Domain ID: {DOMAIN_ID}\")',\r\n ]))\r\n cell_num += 1\r\n\r\n cells.append(md_cell([\r\n f\"## Cell {cell_num} — Assign Workspace to Domain\",\r\n \"\",\r\n f\"Assigns workspace `{ws_name}` to domain `{domain_name}` via the Fabric admin API.\",\r\n \"Requires Domain Contributor or Fabric Admin rights.\",\r\n \"\",\r\n \"**How to use**: If this cell errors with a 403, you do not have domain assignment\",\r\n \"rights — ask a Domain Contributor or Fabric Admin to run this step.\",\r\n ]))\r\n cells.append(code_cell([\r\n f'print(f\"=== Assigning {{ws_name}} to domain: {domain_name} ===\")',\r\n \"payload = json.dumps({'workspacesIds': [WS_ID]})\",\r\n \"with open('/tmp/domain_assign.json', 'w') as f:\",\r\n \" f.write(payload)\",\r\n '!fab api -X post \"admin/domains/{DOMAIN_ID}/assignWorkspaces\" -i /tmp/domain_assign.json',\r\n 'print(\"✅ Done.\")',\r\n ]))\r\n cell_num += 1\r\n\r\n # ── Final cell ─────────────────────────────────────────────────────────────\r\n cells.append(md_cell([\r\n f\"## Cell {cell_num} — Setup Complete\",\r\n \"\",\r\n \"✅ Workspace setup complete.\",\r\n ]))\r\n\r\n return {\r\n \"nbformat\": 4,\r\n \"nbformat_minor\": 5,\r\n \"metadata\": {\r\n \"kernelspec\": {\"display_name\": \"Python 3\", \"language\": \"python\", \"name\": \"python3\"},\r\n \"language_info\": {\"name\": \"python\"},\r\n },\r\n \"cells\": cells,\r\n }\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=\"Generate a Jupyter notebook (.ipynb) for Fabric workspace creation.\",\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=__doc__,\r\n )\r\n parser.add_argument(\"--workspace-name\", required=True)\r\n parser.add_argument(\"--capacity-name\", required=True)\r\n parser.add_argument(\"--roles\", required=True)\r\n parser.add_argument(\"--domain-name\", default=None)\r\n parser.add_argument(\"--create-domain\", action=\"store_true\")\r\n parser.add_argument(\"--domain-contributor-group\", default=None)\r\n parser.add_argument(\"--output\", default=\"workspace_setup.ipynb\")\r\n args = parser.parse_args()\r\n\r\n if args.create_domain and not args.domain_contributor_group:\r\n print(\"Error: --domain-contributor-group is required when --create-domain is set.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n roles = parse_roles(args.roles)\r\n notebook = build_notebook(\r\n args.workspace_name, args.capacity_name, roles,\r\n args.domain_name, args.create_domain, args.domain_contributor_group,\r\n )\r\n\r\n output = args.output if args.output.endswith(\".ipynb\") else args.output.replace(\".py\", \".ipynb\")\r\n with open(output, \"w\", encoding=\"utf-8\") as f:\r\n json.dump(notebook, f, indent=2)\r\n\r\n print(f'{{\"status\": \"ok\", \"output\": \"{output}\", \"cells\": {len(notebook[\"cells\"])}, \"roles\": {len(roles)}}}')\r\n print(f\"✅ Notebook written to: {output}\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
187
187
|
},
|
|
188
188
|
{
|
|
189
189
|
relativePath: "scripts/generate_ps1.py",
|
|
190
|
-
content: "#!/usr/bin/env python3\r\n# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a PowerShell script for creating a Microsoft Fabric workspace.\r\nResolves user emails and group display names to Entra Object IDs via Azure CLI (az).\r\n\r\nUsage:\r\n python scripts/generate_ps1.py --workspace-name NAME --capacity-name CAP --roles ROLES [OPTIONS]\r\n\r\nOptions:\r\n --workspace-name TEXT Workspace display name (required)\r\n --capacity-name TEXT Fabric capacity name (required)\r\n --roles TEXT Comma-separated NAME_OR_EMAIL:TYPE:ROLE list (required)\r\n e.g. \"alice@corp.com:User:Admin,Finance Team:Group:Member\"\r\n Types: User, Group, ServicePrincipal\r\n Roles: Admin, Member, Contributor, Viewer\r\n --domain-name TEXT Domain name to create or assign (optional)\r\n --create-domain Create the domain before assigning (optional)\r\n --domain-contributor-group TEXT Display name of Entra group that manages domain\r\n (required if --create-domain)\r\n --output TEXT Output file path (default: workspace_setup.ps1)\r\n --help Show this message and exit\r\n\r\nExamples:\r\n python scripts/generate_ps1.py \\\\\r\n --workspace-name \"Finance-Reporting\" \\\\\r\n --capacity-name \"fabriccapacity01\" \\\\\r\n --roles \"alice@corp.com:User:Admin,Finance Team:Group:Member\" \\\\\r\n --output workspace_setup.ps1\r\n\"\"\"\r\n\r\nimport argparse\r\nimport sys\r\nfrom datetime import datetime, timezone\r\n\r\n\r\ndef parse_roles(roles_str: str) -> list[dict]:\r\n entries = []\r\n for entry in roles_str.split(\",\"):\r\n parts = entry.strip().split(\":\")\r\n if len(parts) != 3:\r\n print(f\"Error: '{entry}' must be NAME_OR_EMAIL:TYPE:ROLE\", file=sys.stderr)\r\n print(\" e.g. alice@corp.com:User:Admin or Finance Team:Group:Member\", file=sys.stderr)\r\n sys.exit(1)\r\n name, ptype, role = parts[0].strip(), parts[1].strip(), parts[2].strip()\r\n if ptype not in {\"User\", \"Group\", \"ServicePrincipal\"}:\r\n print(f\"Error: type '{ptype}' must be User, Group, or ServicePrincipal\", file=sys.stderr)\r\n sys.exit(1)\r\n if role not in {\"Admin\", \"Member\", \"Contributor\", \"Viewer\"}:\r\n print(f\"Error: role '{role}' must be Admin, Member, Contributor, or Viewer\", file=sys.stderr)\r\n sys.exit(1)\r\n entries.append({\"name\": name, \"type\": ptype, \"role\": role})\r\n return entries\r\n\r\n\r\nPRINCIPAL_TYPE_MAP = {\"User\": \"User\", \"Group\": \"Group\", \"ServicePrincipal\": \"App\"}\r\n\r\n\r\ndef safe_var(name: str) -> str:\r\n \"\"\"Convert a display name/email to a safe PowerShell variable suffix.\"\"\"\r\n import re\r\n return re.sub(r\"[^A-Za-z0-9]\", \"\", name.title())[:20] or \"Principal\"\r\n\r\n\r\ndef generate_ps1(ws_name: str, cap_name: str, roles: list[dict],\r\n domain_name: str | None, create_domain: bool,\r\n domain_contributor_group: str | None) -> str:\r\n ws_path = f\"{ws_name}.Workspace\"\r\n now = datetime.now(timezone.utc).strftime(\"%Y-%m-%d %H:%M UTC\")\r\n\r\n # ── Resolve IDs (one-liners) ──────────────────────────────────────────────\r\n resolve_lines = []\r\n assign_lines = []\r\n for i, r in enumerate(roles):\r\n var = f\"Id_{safe_var(r['name'])}\"\r\n if r[\"type\"] == \"User\":\r\n id_expr = f'(az ad user show --id \"{r[\"name\"]}\" --query id -o tsv).Trim()'\r\n elif r[\"type\"] == \"Group\":\r\n id_expr = f'(az ad group show --group \"{r[\"name\"]}\" --query id -o tsv).Trim()'\r\n else:\r\n id_expr = f'(az ad sp show --id \"{r[\"name\"]}\" --query id -o tsv).Trim()'\r\n resolve_lines.append(f'${var} = {id_expr}')\r\n assign_lines.append(f'fab acl set \"$WorkspacePath\" -I ${var} -R {r[\"role\"].lower()} -f')\r\n\r\n # ── Domain ────────────────────────────────────────────────────────────────\r\n domain_lines = []\r\n if domain_name:\r\n domain_path = f\".domains/{domain_name}.Domain\"\r\n if create_domain:\r\n
|
|
190
|
+
content: "#!/usr/bin/env python3\r\n# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a PowerShell script for creating a Microsoft Fabric workspace.\r\nResolves user emails and group display names to Entra Object IDs via Azure CLI (az).\r\n\r\nUsage:\r\n python scripts/generate_ps1.py --workspace-name NAME --capacity-name CAP --roles ROLES [OPTIONS]\r\n\r\nOptions:\r\n --workspace-name TEXT Workspace display name (required)\r\n --capacity-name TEXT Fabric capacity name (required)\r\n --roles TEXT Comma-separated NAME_OR_EMAIL:TYPE:ROLE list (required)\r\n e.g. \"alice@corp.com:User:Admin,Finance Team:Group:Member\"\r\n Types: User, Group, ServicePrincipal\r\n Roles: Admin, Member, Contributor, Viewer\r\n --domain-name TEXT Domain name to create or assign (optional)\r\n --create-domain Create the domain before assigning (optional)\r\n --domain-contributor-group TEXT Display name of Entra group that manages domain\r\n (required if --create-domain)\r\n --output TEXT Output file path (default: workspace_setup.ps1)\r\n --help Show this message and exit\r\n\r\nExamples:\r\n python scripts/generate_ps1.py \\\\\r\n --workspace-name \"Finance-Reporting\" \\\\\r\n --capacity-name \"fabriccapacity01\" \\\\\r\n --roles \"alice@corp.com:User:Admin,Finance Team:Group:Member\" \\\\\r\n --output workspace_setup.ps1\r\n\"\"\"\r\n\r\nimport argparse\r\nimport sys\r\nfrom datetime import datetime, timezone\r\n\r\n\r\ndef parse_roles(roles_str: str) -> list[dict]:\r\n entries = []\r\n for entry in roles_str.split(\",\"):\r\n parts = entry.strip().split(\":\")\r\n if len(parts) != 3:\r\n print(f\"Error: '{entry}' must be NAME_OR_EMAIL:TYPE:ROLE\", file=sys.stderr)\r\n print(\" e.g. alice@corp.com:User:Admin or Finance Team:Group:Member\", file=sys.stderr)\r\n sys.exit(1)\r\n name, ptype, role = parts[0].strip(), parts[1].strip(), parts[2].strip()\r\n if ptype not in {\"User\", \"Group\", \"ServicePrincipal\"}:\r\n print(f\"Error: type '{ptype}' must be User, Group, or ServicePrincipal\", file=sys.stderr)\r\n sys.exit(1)\r\n if role not in {\"Admin\", \"Member\", \"Contributor\", \"Viewer\"}:\r\n print(f\"Error: role '{role}' must be Admin, Member, Contributor, or Viewer\", file=sys.stderr)\r\n sys.exit(1)\r\n entries.append({\"name\": name, \"type\": ptype, \"role\": role})\r\n return entries\r\n\r\n\r\nPRINCIPAL_TYPE_MAP = {\"User\": \"User\", \"Group\": \"Group\", \"ServicePrincipal\": \"App\"}\r\n\r\n\r\ndef safe_var(name: str) -> str:\r\n \"\"\"Convert a display name/email to a safe PowerShell variable suffix.\"\"\"\r\n import re\r\n return re.sub(r\"[^A-Za-z0-9]\", \"\", name.title())[:20] or \"Principal\"\r\n\r\n\r\ndef generate_ps1(ws_name: str, cap_name: str, roles: list[dict],\r\n domain_name: str | None, create_domain: bool,\r\n domain_contributor_group: str | None) -> str:\r\n ws_path = f\"{ws_name}.Workspace\"\r\n now = datetime.now(timezone.utc).strftime(\"%Y-%m-%d %H:%M UTC\")\r\n\r\n # ── Resolve IDs (one-liners) ──────────────────────────────────────────────\r\n resolve_lines = []\r\n assign_lines = []\r\n for i, r in enumerate(roles):\r\n var = f\"Id_{safe_var(r['name'])}\"\r\n if r[\"type\"] == \"User\":\r\n id_expr = f'(az ad user show --id \"{r[\"name\"]}\" --query id -o tsv).Trim()'\r\n elif r[\"type\"] == \"Group\":\r\n id_expr = f'(az ad group show --group \"{r[\"name\"]}\" --query id -o tsv).Trim()'\r\n else:\r\n id_expr = f'(az ad sp show --id \"{r[\"name\"]}\" --query id -o tsv).Trim()'\r\n resolve_lines.append(f'${var} = {id_expr}')\r\n assign_lines.append(f'fab acl set \"$WorkspacePath\" -I ${var} -R {r[\"role\"].lower()} -f')\r\n\r\n # ── Domain ────────────────────────────────────────────────────────────────\r\n domain_lines = []\r\n if domain_name:\r\n domain_path = f\".domains/{domain_name}.Domain\"\r\n if create_domain:\r\n domain_lines += [\r\n f'fab create \"{domain_path}\" -f',\r\n f'# NOTE: Set domain contributors manually via the Fabric Admin portal',\r\n f'# (admin.powerbi.com -> Domains -> {domain_name} -> Manage contributors)',\r\n f'# fab acl set is NOT supported on .domains/ paths.',\r\n ]\r\n domain_lines.append(f'fab assign \"{domain_path}\" -W \"$WorkspacePath\" -f')\r\n\r\n return f'''# ============================================================\r\n# Fabric Workspace Setup: {ws_name}\r\n# Generated: {now}\r\n# ============================================================\r\n\r\n# ── Prerequisites ─────────────────────────────────────────────────────────────\r\npip install ms-fabric-cli -q\r\naz login\r\nfab auth login\r\n\r\nSet-StrictMode -Version Latest\r\n$ErrorActionPreference = \"Stop\"\r\n\r\n# ── Parameters ────────────────────────────────────────────────────────────────\r\n$WorkspaceName = \"{ws_name}\"\r\n$WorkspacePath = \"{ws_path}\"\r\n$CapacityName = \"{cap_name}\"\r\n\r\n# ── Resolve Entra Object IDs ──────────────────────────────────────────────────\r\n{chr(10).join(resolve_lines)}\r\n\r\n# ── Create workspace ──────────────────────────────────────────────────────────\r\nfab mkdir \"$WorkspacePath\" -P capacityName=$CapacityName -f\r\n\r\n# ── Assign workspace roles ────────────────────────────────────────────────────\r\n{chr(10).join(assign_lines)}\r\n\r\n# ── Verify ────────────────────────────────────────────────────────────────────\r\nfab acl ls \"$WorkspacePath\"\r\n{(chr(10) + \"# ── Domain ──────────────────────────────────────────────────────────────────\" + chr(10) + chr(10).join(domain_lines)) if domain_lines else \"\"}\r\nWrite-Host \"Setup complete. Workspace: $WorkspaceName\" -ForegroundColor Green\r\n'''\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=\"Generate a PowerShell script for Fabric workspace creation.\",\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=__doc__,\r\n )\r\n parser.add_argument(\"--workspace-name\", required=True)\r\n parser.add_argument(\"--capacity-name\", required=True)\r\n parser.add_argument(\"--roles\", required=True)\r\n parser.add_argument(\"--domain-name\", default=None)\r\n parser.add_argument(\"--create-domain\", action=\"store_true\")\r\n parser.add_argument(\"--domain-contributor-group\", default=None)\r\n parser.add_argument(\"--output\", default=\"workspace_setup.ps1\")\r\n args = parser.parse_args()\r\n\r\n if args.create_domain and not args.domain_contributor_group:\r\n print(\"Error: --domain-contributor-group is required when --create-domain is set.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n roles = parse_roles(args.roles)\r\n script = generate_ps1(\r\n args.workspace_name, args.capacity_name, roles,\r\n args.domain_name, args.create_domain, args.domain_contributor_group\r\n )\r\n\r\n with open(args.output, \"w\", encoding=\"utf-8\") as f:\r\n f.write(script)\r\n\r\n print(f\"✅ Script written to: {args.output}\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
191
191
|
},
|
|
192
192
|
],
|
|
193
193
|
},
|
|
@@ -197,7 +197,7 @@ export const EMBEDDED_SKILLS = [
|
|
|
197
197
|
files: [
|
|
198
198
|
{
|
|
199
199
|
relativePath: "SKILL.md",
|
|
200
|
-
content: "---\r\nname: pdf-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to extract structured data from PDF files on an operator's\r\n local machine, upload them to a Microsoft Fabric bronze lakehouse, and convert\r\n them to a delta table using AI-powered field extraction. Triggers on: \"create\r\n delta tables from PDFs\", \"extract data from PDF invoices to Fabric\", \"load\r\n PDFs into bronze lakehouse\", \"parse PDF documents to delta format\", \"ingest\r\n PDF files to Fabric tables\". Does NOT trigger for CSV/Excel ingestion,\r\n transforming existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: >\r\n Python 3.8+ for scripts/. Fabric CLI (fab) for CLI upload option.\r\n Fabric notebook runtime 1.3 required (for synapse.ml.aifunc).\r\n---\r\n\r\n# PDF to Bronze Delta Tables\r\n\r\nUploads PDF files from a local machine to a Microsoft Fabric bronze lakehouse\r\nand converts each PDF into a row in a delta table using AI field extraction.\r\nThe lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab` commands are written to a PowerShell script for the operator to run.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark with `synapse.ml.aifunc` for AI extraction\r\n> — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Workspace name | Environment profile or implementation plan |\r\n| Lakehouse name | SOP shared parameters (from lakehouse creation step) |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. local PDF folder path,\r\ndestination folder, table name, extraction field definitions).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under lakehouse Files section | `\"Booking PDFs\"` |\r\n| `TABLE_NAME` | Target delta table name (snake_case) | `\"booking_invoices\"` |\r\n| `LOCAL_PDF_FOLDER` | Exact absolute path to local PDF folder (CLI upload only) | `\"C:\\Users\\rishi\\Data\\Booking PDFs\"` |\r\n| `FIELDS` | Fields to extract from each PDF — collected in Step 2 | See workflow |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Suggest and confirm extraction fields** — Before asking the operator to\r\n define fields from scratch, the agent should **read a sample PDF** to understand\r\n the document structure and proactively suggest fields:\r\n\r\n 1. Use `pdfplumber` (or equivalent) to extract text from 1–2 sample PDFs in\r\n `LOCAL_PDF_FOLDER`. If a second PDF is from a different sub-group (e.g.\r\n different property/entity), include it to confirm layout consistency.\r\n 2. Identify all extractable fields from the document structure (headers, labels,\r\n line items, totals, payment details, etc.).\r\n 3. Present the suggested fields to the operator in a table format, split into:\r\n - **Header-level fields** (one row per PDF) — for the main table\r\n - **Line-item fields** (multiple rows per PDF) — for the detail table, if\r\n the document contains repeating line items\r\n 4. For each field, show: `snake_case` name, extraction hint for the AI, and an\r\n example value from the sample PDF.\r\n 5. Ask the operator:\r\n - \"Do these fields look right? Anything to add, remove, or rename?\"\r\n - \"What should the main delta table be named?\" → `TABLE_NAME`\r\n - \"Do you want a second table for line/detail items?\" If yes:\r\n → `LINE_ITEMS_TABLE_NAME` and confirm the line-item fields\r\n - \"What folder name will the PDFs be stored in under the lakehouse Files\r\n section?\" → `LAKEHOUSE_FILES_FOLDER`\r\n 6. **Do not proceed until the operator confirms the fields.**\r\n\r\n Build `FIELDS` as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n If the operator confirmed a second line-items table, build `LINE_ITEMS_FIELDS`\r\n as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n- [ ] **Upload PDFs** — Present these three options and ask the operator to choose:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Drag-and-drop the PDFs into the target folder under the lakehouse Files section\r\n using the OneLake File Explorer desktop app. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section → open or\r\n create the `LAKEHOUSE_FILES_FOLDER` folder → click **Upload** and select the\r\n PDF files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_PDF_FOLDER` (exact absolute path). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_PDF_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_pdf_files.ps1\"\r\n ```\r\n Present the script path to the operator and ask them to run it with `pwsh upload_pdf_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning, create the output folder:\r\n```\r\noutputs/pdf-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll generated scripts and notebooks for this run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm all PDFs are visible in the\r\n lakehouse Files section before proceeding.\r\n\r\n- [ ] **Generate TEST notebook** — Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --test-mode \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_TEST.ipynb\"\r\n ```\r\n Where `<FIELDS_JSON>` is the JSON array built from `FIELDS` above, as a\r\n single-line string (e.g. `'[{\"name\":\"invoice_number\",\"description\":\"...\"}]'`).\r\n Include `--line-items-table-name` and `--line-items-fields-json` if a second\r\n line-items table was requested — both must be provided together.\r\n\r\n Tell the operator:\r\n 1. Go to the workspace → **New** → **Import notebook**\r\n 2. Select `pdf_to_delta_TEST.ipynb`\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically and\r\n processes **one PDF only**\r\n 4. Share the output row displayed at the end of the notebook\r\n\r\n- [ ] **Validate and iterate** — Review the output row the operator shares:\r\n - Check each field has a value and it looks correct\r\n - If a field is missing or wrong: update its description in `FIELDS_JSON`,\r\n regenerate the TEST notebook, and ask the operator to re-run it\r\n - Repeat until all fields are correct\r\n - **Do not proceed to full run until the test row is confirmed correct**\r\n\r\n- [ ] **Generate FULL notebook** — Once test output is confirmed, run the same\r\n command **without** `--test-mode`:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_FULL.ipynb\"\r\n ```\r\n Tell the operator to import and run `pdf_to_delta_FULL.ipynb`. This processes\r\n all PDFs in the folder.\r\n\r\n- [ ] **Validate final table** — Ask the operator to confirm:\r\n - Delta table `<TABLE_NAME>` appears in the Tables section of the lakehouse\r\n - Row count matches the number of PDFs uploaded\r\n - Spot-check a few rows for data quality\r\n\r\n## Table Naming\r\n\r\n- Use a descriptive `snake_case` name based on the document type, not the filename\r\n- PDFs are individual records — do not derive table name from filenames\r\n- Ask the operator to confirm the table name before generating any notebook\r\n\r\n## Gotchas\r\n\r\n- **AI features must be enabled on the capacity.** `synapse.ml.aifunc` uses Fabric's\r\n built-in AI endpoint — no Azure OpenAI key needed. Prerequisites: (1) paid Fabric\r\n capacity F2 or higher, (2) tenant admin must enable \"Copilot and other features\r\n powered by Azure OpenAI\" in Admin portal → Tenant settings, (3) if capacity is\r\n outside an Azure OpenAI region, also enable the cross-geo processing toggle.\r\n- **Default model is `gpt-4.1-mini`.** If the notebook throws `DeploymentConfigNotFound`,\r\n the `MODEL_DEPLOYMENT_NAME` in the configuration cell doesn't match a model on\r\n the built-in endpoint. Check supported models at\r\n https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview\r\n- `fab cp` requires `./filename` (forward slash) syntax.Absolute Windows paths\r\n (`C:\\...`) cause `[NotSupported]` errors. The generated script uses `Push-Location`\r\n to work around this — do not modify this pattern.\r\n- **Destination folder must exist before uploading.** The script runs `fab mkdir` first.\r\n Running `fab mkdir` on an existing folder is safe.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive.\r\n- The notebook uses `synapse.ml.aifunc` which requires Fabric **runtime 1.3**.\r\n If the operator sees import errors, check runtime version in notebook settings.\r\n- The `%%configure` cell attaches the lakehouse automatically — no manual\r\n attachment needed before clicking Run All.\r\n- AI extraction temperature is set to `0.0` for consistency, but it is still\r\n non-deterministic across different PDF layouts. Always validate with TEST mode first.\r\n- All extracted fields are written as strings. If the operator needs typed columns\r\n (dates, numbers), add a post-processing step after confirming extraction is correct.\r\n- **Column names come from AI extraction.** The delta table column names match\r\n the `name` field in the `FIELDS` JSON array provided during setup. These are\r\n `snake_case` names chosen by the operator (e.g., `invoice_number`, `hotel_name`).\r\n They do NOT follow the same `clean_columns()` convention used by the\r\n `csv-to-bronze-delta-tables` skill. Downstream skills (e.g.,\r\n `create-materialised-lakeview-scripts`) must verify actual delta table column\r\n names rather than assuming any naming convention.\r\n- The notebook installs `openai` and `pymupdf4llm` at runtime. The `synapse.ml.aifunc`\r\n package is pre-installed in Fabric Runtime 1.3+.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local folder for PDFs and\r\n writes a PowerShell script of `fab cp` upload commands.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook with the AI extraction prompt pre-populated from the supplied fields.\r\n Supports `--test-mode` for single-PDF validation runs.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
200
|
+
content: "---\r\nname: pdf-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to extract structured data from PDF files on an operator's\r\n local machine, upload them to a Microsoft Fabric bronze lakehouse, and convert\r\n them to a delta table using AI-powered field extraction. Triggers on: \"create\r\n delta tables from PDFs\", \"extract data from PDF invoices to Fabric\", \"load\r\n PDFs into bronze lakehouse\", \"parse PDF documents to delta format\", \"ingest\r\n PDF files to Fabric tables\". Does NOT trigger for CSV/Excel ingestion,\r\n transforming existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: >\r\n Python 3.8+ for scripts/. Fabric CLI (fab) for CLI upload option.\r\n Fabric notebook runtime 1.3 required (for synapse.ml.aifunc).\r\n---\r\n\r\n# PDF to Bronze Delta Tables\r\n\r\nUploads PDF files from a local machine to a Microsoft Fabric bronze lakehouse\r\nand converts each PDF into a row in a delta table using AI field extraction.\r\nThe lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE**: This skill produces notebooks and scripts for the operator to\r\n> review and run — it never executes commands directly against a live Fabric environment.\r\n> Present each generated artefact to the operator before they run it.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark with `synapse.ml.aifunc` for AI extraction\r\n> — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Orchestrated Context\r\n\r\nWhen invoked from a workflow agent, read `00-environment-discovery/environment-profile.md`\r\nand the SOP before asking the user anything.\r\n\r\n| Parameter | Source when orchestrated |\r\n|---|---|\r\n| Workspace name | Environment profile or implementation plan |\r\n| Lakehouse name | SOP shared parameters (from lakehouse creation step) |\r\n\r\n**Only ask for parameters not found in these documents** (e.g. local PDF folder path,\r\ndestination folder, table name, extraction field definitions).\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under lakehouse Files section | `\"Booking PDFs\"` |\r\n| `TABLE_NAME` | Target delta table name (snake_case) | `\"booking_invoices\"` |\r\n| `LOCAL_PDF_FOLDER` | Exact absolute path to local PDF folder (CLI upload only) | `\"C:\\Users\\rishi\\Data\\Booking PDFs\"` |\r\n| `FIELDS` | Fields to extract from each PDF — collected in Step 2 | See workflow |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Suggest and confirm extraction fields** — Before asking the operator to\r\n define fields from scratch, the agent should **read a sample PDF** to understand\r\n the document structure and proactively suggest fields:\r\n\r\n 1. Use `pdfplumber` (or equivalent) to extract text from 1–2 sample PDFs in\r\n `LOCAL_PDF_FOLDER`. If a second PDF is from a different sub-group (e.g.\r\n different property/entity), include it to confirm layout consistency.\r\n 2. Identify all extractable fields from the document structure (headers, labels,\r\n line items, totals, payment details, etc.).\r\n 3. Present the suggested fields to the operator in a table format, split into:\r\n - **Header-level fields** (one row per PDF) — for the main table\r\n - **Line-item fields** (multiple rows per PDF) — for the detail table, if\r\n the document contains repeating line items\r\n 4. For each field, show: `snake_case` name, extraction hint for the AI, and an\r\n example value from the sample PDF.\r\n 5. Ask the operator:\r\n - \"Do these fields look right? Anything to add, remove, or rename?\"\r\n - \"What should the main delta table be named?\" → `TABLE_NAME`\r\n - \"Do you want a second table for line/detail items?\" If yes:\r\n → `LINE_ITEMS_TABLE_NAME` and confirm the line-item fields\r\n - \"What folder name will the PDFs be stored in under the lakehouse Files\r\n section?\" → `LAKEHOUSE_FILES_FOLDER`\r\n 6. **Do not proceed until the operator confirms the fields.**\r\n\r\n Build `FIELDS` as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n If the operator confirmed a second line-items table, build `LINE_ITEMS_FIELDS`\r\n as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n- [ ] **Upload PDFs** — Present these three options and ask the operator to choose:\r\n\r\n **Option A — OneLake File Explorer (Manual)**\r\n Drag-and-drop the PDFs into the target folder under the lakehouse Files section\r\n using the OneLake File Explorer desktop app. No agent action required.\r\n\r\n **Option B — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section → open or\r\n create the `LAKEHOUSE_FILES_FOLDER` folder → click **Upload** and select the\r\n PDF files. No agent action required.\r\n\r\n **Option C — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option A or B.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options A or B.\r\n > Recommend Options A or B for bulk uploads.\r\n\r\n Ask for `LOCAL_PDF_FOLDER` (exact absolute path). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_PDF_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_pdf_files.ps1\"\r\n ```\r\n Present the script path to the operator and ask them to run it with `pwsh upload_pdf_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning, create the output folder:\r\n```\r\noutputs/pdf-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll generated scripts and notebooks for this run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm all PDFs are visible in the\r\n lakehouse Files section before proceeding.\r\n\r\n- [ ] **Generate TEST notebook** — Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --test-mode \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_TEST.ipynb\"\r\n ```\r\n Where `<FIELDS_JSON>` is the JSON array built from `FIELDS` above, as a\r\n single-line string (e.g. `'[{\"name\":\"invoice_number\",\"description\":\"...\"}]'`).\r\n Include `--line-items-table-name` and `--line-items-fields-json` if a second\r\n line-items table was requested — both must be provided together.\r\n\r\n Tell the operator:\r\n 1. Go to the workspace → **New** → **Import notebook**\r\n 2. Select `pdf_to_delta_TEST.ipynb`\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically and\r\n processes **one PDF only**\r\n 4. Share the output row displayed at the end of the notebook\r\n\r\n- [ ] **Validate and iterate** — Review the output row the operator shares:\r\n - Check each field has a value and it looks correct\r\n - If a field is missing or wrong: update its description in `FIELDS_JSON`,\r\n regenerate the TEST notebook, and ask the operator to re-run it\r\n - Repeat until all fields are correct\r\n - **Do not proceed to full run until the test row is confirmed correct**\r\n\r\n- [ ] **Generate FULL notebook** — Once test output is confirmed, run the same\r\n command **without** `--test-mode`:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_FULL.ipynb\"\r\n ```\r\n Tell the operator to import and run `pdf_to_delta_FULL.ipynb`. This processes\r\n all PDFs in the folder.\r\n\r\n- [ ] **Validate final table** — Ask the operator to confirm:\r\n - Delta table `<TABLE_NAME>` appears in the Tables section of the lakehouse\r\n - Row count matches the number of PDFs uploaded\r\n - Spot-check a few rows for data quality\r\n\r\n## Table Naming\r\n\r\n- Use a descriptive `snake_case` name based on the document type, not the filename\r\n- PDFs are individual records — do not derive table name from filenames\r\n- Ask the operator to confirm the table name before generating any notebook\r\n\r\n## Gotchas\r\n\r\n- **AI features must be enabled on the capacity.** `synapse.ml.aifunc` uses Fabric's\r\n built-in AI endpoint — no Azure OpenAI key needed. Prerequisites: (1) paid Fabric\r\n capacity F2 or higher, (2) tenant admin must enable \"Copilot and other features\r\n powered by Azure OpenAI\" in Admin portal → Tenant settings, (3) if capacity is\r\n outside an Azure OpenAI region, also enable the cross-geo processing toggle.\r\n- **Default model is `gpt-4.1-mini`.** If the notebook throws `DeploymentConfigNotFound`,\r\n the `MODEL_DEPLOYMENT_NAME` in the configuration cell doesn't match a model on\r\n the built-in endpoint. Check supported models at\r\n https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview\r\n- `fab cp` requires `./filename` (forward slash) syntax.Absolute Windows paths\r\n (`C:\\...`) cause `[NotSupported]` errors. The generated script uses `Push-Location`\r\n to work around this — do not modify this pattern.\r\n- **Destination folder must exist before uploading.** The script runs `fab mkdir` first.\r\n Running `fab mkdir` on an existing folder is safe.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive.\r\n- The notebook uses `synapse.ml.aifunc` which requires Fabric **runtime 1.3**.\r\n If the operator sees import errors, check runtime version in notebook settings.\r\n- The `%%configure` cell attaches the lakehouse automatically — no manual\r\n attachment needed before clicking Run All.\r\n- AI extraction temperature is set to `0.0` for consistency, but it is still\r\n non-deterministic across different PDF layouts. Always validate with TEST mode first.\r\n- All extracted fields are written as strings. If the operator needs typed columns\r\n (dates, numbers), add a post-processing step after confirming extraction is correct.\r\n- **Column names come from AI extraction.** The delta table column names match\r\n the `name` field in the `FIELDS` JSON array provided during setup. These are\r\n `snake_case` names chosen by the operator (e.g., `invoice_number`, `hotel_name`).\r\n They do NOT follow the same `clean_columns()` convention used by the\r\n `csv-to-bronze-delta-tables` skill. Downstream skills (e.g.,\r\n `create-materialised-lakeview-scripts`) must verify actual delta table column\r\n names rather than assuming any naming convention.\r\n- The notebook installs `openai` and `pymupdf4llm` at runtime. The `synapse.ml.aifunc`\r\n package is pre-installed in Fabric Runtime 1.3+.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local folder for PDFs and\r\n writes a PowerShell script of `fab cp` upload commands.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook with the AI extraction prompt pre-populated from the supplied fields.\r\n Supports `--test-mode` for single-PDF validation runs.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
|
|
201
201
|
},
|
|
202
202
|
{
|
|
203
203
|
relativePath: "references/notebook-cells-reference.md",
|
|
@@ -205,11 +205,11 @@ export const EMBEDDED_SKILLS = [
|
|
|
205
205
|
},
|
|
206
206
|
{
|
|
207
207
|
relativePath: "scripts/generate_notebook.py",
|
|
208
|
-
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a Fabric-compatible PySpark notebook (.ipynb) that reads PDF files\r\nfrom a lakehouse Files section, extracts structured fields using AI, and writes\r\nthe results to one or two delta tables (header + optional line items).\r\n\r\nThe notebook follows the structure of NB_ConvertPDFToDelta.ipynb:\r\n %%configure -> pip installs -> imports/helpers -> config -> load PDFs ->\r\n AI prompt -> run extraction -> parse output -> display -> write delta table(s)\r\n\r\nSingle-table usage:\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"Booking PDFs\" \\\r\n --table-name \"booking_invoices\" \\\r\n --fields-json '[{\"name\":\"invoice_number\",\"description\":\"invoice number after no.\"}]' \\\r\n --test-mode \\\r\n --output-notebook \"outputs/my-run/pdf_to_delta_TEST.ipynb\"\r\n\r\nTwo-table usage (header + line items):\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"Booking PDFs\" \\\r\n --table-name \"booking_invoices\" \\\r\n --fields-json '[{\"name\":\"invoice_number\",\"description\":\"invoice number after no.\"}]' \\\r\n --line-items-table-name \"booking_invoice_line_items\" \\\r\n --line-items-fields-json '[{\"name\":\"item_date\",\"description\":\"date as YYYY-MM-DD\"},{\"name\":\"description\",\"description\":\"line item description\"},{\"name\":\"charge_gbp\",\"description\":\"charge as float, null if empty\"}]' \\\r\n --test-mode \\\r\n --output-notebook \"outputs/my-run/pdf_to_delta_TEST.ipynb\"\r\n\"\"\"\r\nimport argparse\r\nimport json\r\nimport os\r\nimport sys\r\n\r\n\r\ndef cell(source: str, cell_type: str = \"code\") -> dict:\r\n \"\"\"Build a notebook cell dict from a multi-line string.\"\"\"\r\n lines = source.split(\"\\n\")\r\n source_list = [line + \"\\n\" for line in lines[:-1]] + [lines[-1]]\r\n c = {\r\n \"cell_type\": cell_type,\r\n \"metadata\": {},\r\n \"source\": source_list,\r\n \"outputs\": [],\r\n \"execution_count\": None,\r\n }\r\n if cell_type == \"markdown\":\r\n del c[\"outputs\"]\r\n del c[\"execution_count\"]\r\n return c\r\n\r\n\r\ndef build_prompt_json(fields: list, line_items_fields: list = None) -> str:\r\n \"\"\"Build the JSON template block for the AI extraction prompt.\"\"\"\r\n lines = [\"{\"]\r\n for i, f in enumerate(fields):\r\n comma = \",\" if (i < len(fields) - 1 or line_items_fields) else \"\"\r\n lines.append(f' \"{f[\"name\"]}\": \"{f[\"description\"]}\"{comma}')\r\n if line_items_fields:\r\n lines.append(' \"line_items\": [')\r\n lines.append(\" {\")\r\n for i, f in enumerate(line_items_fields):\r\n comma = \",\" if i < len(line_items_fields) - 1 else \"\"\r\n lines.append(f' \"{f[\"name\"]}\": \"{f[\"description\"]}\"{comma}')\r\n lines.append(\" }\")\r\n lines.append(\" ]\")\r\n lines.append(\"}\")\r\n return \"\\n\".join(lines)\r\n\r\n\r\ndef build_notebook(\r\n lakehouse_name: str,\r\n lakehouse_folder: str,\r\n table_name: str,\r\n fields: list,\r\n test_mode: bool,\r\n line_items_table_name: str = None,\r\n line_items_fields: list = None,\r\n) -> dict:\r\n two_tables = bool(line_items_table_name and line_items_fields)\r\n mode_label = \"TEST — 1 PDF only\" if test_mode else \"FULL — all PDFs\"\r\n test_mode_str = \"True\" if test_mode else \"False\"\r\n prompt_json = build_prompt_json(fields, line_items_fields if two_tables else None)\r\n tables_note = (\r\n f\"Writes **two** delta tables: `{table_name}` (header) and \"\r\n f\"`{line_items_table_name}` (line items).\"\r\n if two_tables\r\n else f\"Writes delta table **`{table_name}`**.\"\r\n )\r\n\r\n cells = []\r\n\r\n # ── Cell 1: manual lakehouse attachment instructions ─────────────────────\r\n cells.append(cell(\r\n f'## ⚠️ Before Running: Setup Steps Required\\n'\r\n f'\\n'\r\n f'### Step 1 — Attach the Lakehouse\\n'\r\n f'1. In the left panel, click **Add data items** (database icon)\\n'\r\n f'2. Click **Add lakehouse** → **Existing lakehouse**\\n'\r\n f'3. Choose **{lakehouse_name}** → **Confirm**\\n'\r\n f'\\n'\r\n f'### Step 2 — AI Features\\n'\r\n f'`synapse.ml.aifunc` uses Fabric\\'s built-in AI — no Azure OpenAI key or workspace\\n'\r\n f'settings change needed. It works automatically on capacities with AI/Copilot features\\n'\r\n f'enabled (F64+ or a trial with AI enabled).\\n'\r\n f'\\n'\r\n f'If you see `AuthenticationError: Authentication failed for all authenticators`,\\n'\r\n f'the workspace is on a capacity without AI features. Move the notebook to a workspace\\n'\r\n f'on an AI-enabled capacity and re-run.',\r\n cell_type=\"markdown\",\r\n ))\r\n\r\n # ── Cell 2: markdown header ──────────────────────────────────────────────\r\n cells.append(cell(\r\n f'## PDF to Bronze Delta Tables\\n'\r\n f'\\n'\r\n f'Extracts structured fields from PDFs in `Files/{lakehouse_folder}` using AI. \\n'\r\n f'{tables_note}\\n'\r\n f'\\n'\r\n f'**Mode**: `{mode_label}` \\n'\r\n f'To switch mode, change `TEST_MODE` in the configuration cell and re-run.',\r\n cell_type=\"markdown\",\r\n ))\r\n # ── Cell 3: pip installs ─────────────────────────────────────────────────\r\n # Per https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview\r\n # Pandas on PySpark runtime requires openai package. pymupdf4llm for PDF→markdown.\r\n cells.append(cell(\r\n '%pip install -q openai pymupdf4llm 2>/dev/null'\r\n ))\r\n\r\n # ── Cell 4: imports ──────────────────────────────────────────────────────\r\n cells.append(cell(\r\n 'import re\\n'\r\n 'import json\\n'\r\n 'import pandas as pd\\n'\r\n 'import synapse.ml.aifunc as aifunc\\n'\r\n 'from synapse.ml.aifunc import Conf\\n'\r\n 'from notebookutils import fs'\r\n ))\r\n\r\n # ── Cell 5: helper — PDF to markdown ────────────────────────────────────\r\n cells.append(cell(\r\n 'def create_mkd(path: str):\\n'\r\n ' \"\"\"Convert a PDF file to markdown text using pymupdf4llm.\"\"\"\\n'\r\n ' try:\\n'\r\n ' import pymupdf4llm\\n'\r\n ' return pymupdf4llm.to_markdown(path), None\\n'\r\n ' except Exception as e:\\n'\r\n ' return None, str(e)'\r\n ))\r\n\r\n # ── Cell 6: helper — load all PDFs from lakehouse ────────────────────────\r\n cells.append(cell(\r\n 'def process_pdfs(files_folder: str, test_mode: bool = False):\\n'\r\n ' \"\"\"List PDFs from the lakehouse Files section and convert each to markdown.\"\"\"\\n'\r\n ' files = fs.ls(files_folder)\\n'\r\n ' pdf_files = [f for f in files if f.name.lower().endswith(\".pdf\")]\\n'\r\n ' if not pdf_files:\\n'\r\n ' raise ValueError(f\"No PDF files found in \\'{files_folder}\\'. \"\\n'\r\n ' \"Check LAKEHOUSE_FILES_FOLDER and upload step.\")\\n'\r\n ' if test_mode:\\n'\r\n ' pdf_files = pdf_files[:1]\\n'\r\n ' print(f\"TEST MODE: processing 1 PDF \\u2014 {pdf_files[0].name}\")\\n'\r\n ' data = []\\n'\r\n ' for f in pdf_files:\\n'\r\n ' local_path = \"/lakehouse/default/Files\" + f.path.split(\"Files\")[1]\\n'\r\n ' mkdown_text, error = create_mkd(local_path)\\n'\r\n ' if error:\\n'\r\n ' print(f\"\\u26a0\\ufe0f Failed to convert {f.name}: {error}\")\\n'\r\n ' data.append((f.name, local_path, mkdown_text, error))\\n'\r\n ' return pd.DataFrame(data, columns=[\"filename\", \"file_path\", \"mkdown_text\", \"error\"])'\r\n ))\r\n\r\n # ── Cell 7: configuration ────────────────────────────────────────────────\r\n if two_tables:\r\n table_config = (\r\n f'HEADER_TABLE_NAME = \"{table_name}\" # header delta table\\n'\r\n f'LINE_ITEMS_TABLE_NAME = \"{line_items_table_name}\" # line items delta table'\r\n )\r\n else:\r\n table_config = f'TABLE_NAME = \"{table_name}\" # delta table name'\r\n\r\n cells.append(cell(\r\n '# \\u2500\\u2500 CONFIGURE \\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\n'\r\n f'LAKEHOUSE_FILES_FOLDER = \"{lakehouse_folder}\" # folder under Files/ containing PDFs\\n'\r\n f'{table_config}\\n'\r\n f'TEST_MODE = {test_mode_str} # True = process 1 PDF only\\n'\r\n '# AI model deployment — must match a model supported by the Fabric built-in AI endpoint.\\n'\r\n '# Default: \"gpt-4.1-mini\". Other options: \"gpt-4.1\", \"gpt-4o\", \"gpt-5\"\\n'\r\n 'MODEL_DEPLOYMENT_NAME = \"gpt-4.1-mini\" # update if DeploymentNotFound error\\n'\r\n '# \\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500'\r\n ))\r\n\r\n # ── Cell 8: load PDFs and convert to markdown ────────────────────────────\r\n cells.append(cell(\r\n 'pdf_df = process_pdfs(f\"Files/{LAKEHOUSE_FILES_FOLDER}\", test_mode=TEST_MODE)\\n'\r\n 'print(f\"Loaded {len(pdf_df)} PDF(s) from Files/{LAKEHOUSE_FILES_FOLDER}\")\\n'\r\n 'display(pdf_df[[\"filename\", \"error\"]])'\r\n ))\r\n\r\n # ── Cell 9: AI extraction prompt ─────────────────────────────────────────\r\n cells.append(cell(\r\n '# Edit field descriptions below to tune extraction for your PDFs\\n'\r\n 'EXTRACTION_PROMPT = \"\"\"\\n'\r\n 'Extract the following fields from this document and return ONLY a valid JSON object.\\n'\r\n 'No explanation, no markdown fences, no additional text.\\n'\r\n '\\n'\r\n f'{prompt_json}\\n'\r\n '\\n'\r\n 'Document:\\n'\r\n '\"\"\"'\r\n ))\r\n\r\n # ── Cell 10: run AI extraction ───────────────────────────────────────────\r\n cells.append(cell(\r\n 'pdf_df[\"output\"] = pdf_df[[\"mkdown_text\"]].ai.generate_response(\\n'\r\n ' EXTRACTION_PROMPT,\\n'\r\n ' conf=Conf(model_deployment_name=MODEL_DEPLOYMENT_NAME, temperature=0.0, top_p=1.0, concurrency=25)\\n'\r\n ')\\n'\r\n 'print(\"AI extraction complete.\")'\r\n ))\r\n\r\n # ── Cell 11: parse output ────────────────────────────────────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n 'def parse_output(df, json_column):\\n'\r\n ' \"\"\"Parse LLM JSON output into header_df and line_items_df.\"\"\"\\n'\r\n ' def parse_row(val):\\n'\r\n ' if not isinstance(val, str):\\n'\r\n ' return {}\\n'\r\n ' try:\\n'\r\n ' cleaned = val.strip().replace(\"```json\", \"\").replace(\"```\", \"\").strip()\\n'\r\n ' return json.loads(cleaned)\\n'\r\n ' except Exception:\\n'\r\n ' return {}\\n'\r\n '\\n'\r\n ' header_rows, all_line_items = [], []\\n'\r\n ' for val, filename in zip(df[json_column], df[\"filename\"]):\\n'\r\n ' row = parse_row(val)\\n'\r\n ' line_items = row.pop(\"line_items\", None) or []\\n'\r\n ' invoice_number = row.get(\"invoice_number\", \"\")\\n'\r\n ' header_rows.append({\"source_filename\": filename, **row})\\n'\r\n ' for item in line_items:\\n'\r\n ' all_line_items.append({\"source_filename\": filename,\\n'\r\n ' \"invoice_number\": invoice_number,\\n'\r\n ' **item})\\n'\r\n '\\n'\r\n ' header_df = pd.DataFrame(header_rows)\\n'\r\n ' line_items_df = pd.DataFrame(all_line_items) if all_line_items else pd.DataFrame()\\n'\r\n ' return header_df, line_items_df\\n'\r\n '\\n'\r\n 'header_df, line_items_df = parse_output(pdf_df, \"output\")'\r\n ))\r\n else:\r\n cells.append(cell(\r\n 'def parse_output(df, json_column):\\n'\r\n ' def parse_row(val):\\n'\r\n ' if not isinstance(val, str):\\n'\r\n ' return {}\\n'\r\n ' try:\\n'\r\n ' cleaned = val.strip().replace(\"```json\", \"\").replace(\"```\", \"\").strip()\\n'\r\n ' return json.loads(cleaned)\\n'\r\n ' except Exception:\\n'\r\n ' return {}\\n'\r\n ' extracted = [pd.json_normalize(parse_row(v)) for v in df[json_column]]\\n'\r\n ' result = pd.concat(extracted, ignore_index=True)\\n'\r\n ' result.insert(0, \"source_filename\", df[\"filename\"].values)\\n'\r\n ' return result\\n'\r\n '\\n'\r\n 'final_df = parse_output(pdf_df, \"output\")'\r\n ))\r\n\r\n # ── Cell 12: display header results ─────────────────────────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n 'print(f\"Header rows: {len(header_df)}\")\\n'\r\n 'if TEST_MODE:\\n'\r\n ' print(\"\\u2705 TEST MODE \\u2014 review header row below.\")\\n'\r\n ' print(\" If all fields look correct, scroll down to check line items too.\")\\n'\r\n ' print(\" If any field is wrong, update EXTRACTION_PROMPT and re-run.\")\\n'\r\n 'display(header_df)'\r\n ))\r\n else:\r\n cells.append(cell(\r\n 'print(f\"Extracted {len(final_df)} row(s) from {len(pdf_df)} PDF(s).\")\\n'\r\n 'if TEST_MODE:\\n'\r\n ' print(\"\\u2705 TEST MODE \\u2014 review the row below.\")\\n'\r\n ' print(\" If all fields look correct, set TEST_MODE = False and re-run.\")\\n'\r\n ' print(\" If any field is wrong, update EXTRACTION_PROMPT and re-run.\")\\n'\r\n 'display(final_df)'\r\n ))\r\n\r\n # ── Cell 13: display line items (two-table mode only) ───────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n 'print(f\"Line item rows: {len(line_items_df)}\")\\n'\r\n 'if TEST_MODE:\\n'\r\n ' print(\"\\u2705 TEST MODE \\u2014 review line items below.\")\\n'\r\n ' print(\" If correct, set TEST_MODE = False and re-run for all PDFs.\")\\n'\r\n 'display(line_items_df)'\r\n ))\r\n\r\n # ── Cell 14: write header table ──────────────────────────────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n 'spark_df = spark.createDataFrame(header_df.astype(str).fillna(\"\"))\\n'\r\n 'spark_df.write.format(\"delta\").mode(\"overwrite\").saveAsTable(HEADER_TABLE_NAME)\\n'\r\n 'print(f\"\\u2705 Written {len(header_df)} row(s) to delta table: {HEADER_TABLE_NAME}\")'\r\n ))\r\n else:\r\n cells.append(cell(\r\n 'spark_df = spark.createDataFrame(final_df.astype(str).fillna(\"\"))\\n'\r\n 'spark_df.write.format(\"delta\").mode(\"overwrite\").saveAsTable(TABLE_NAME)\\n'\r\n 'print(f\"\\u2705 Written {len(final_df)} row(s) to delta table: {TABLE_NAME}\")'\r\n ))\r\n\r\n # ── Cell 15: write line items table (two-table mode only) ────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n 'if not line_items_df.empty:\\n'\r\n ' spark_li = spark.createDataFrame(line_items_df.astype(str).fillna(\"\"))\\n'\r\n ' spark_li.write.format(\"delta\").mode(\"overwrite\").saveAsTable(LINE_ITEMS_TABLE_NAME)\\n'\r\n ' print(f\"\\u2705 Written {len(line_items_df)} row(s) to delta table: {LINE_ITEMS_TABLE_NAME}\")\\n'\r\n 'else:\\n'\r\n ' print(\"\\u26a0\\ufe0f No line items extracted \\u2014 check the line_items field in EXTRACTION_PROMPT.\")'\r\n ))\r\n\r\n return {\r\n \"nbformat\": 4,\r\n \"nbformat_minor\": 5,\r\n \"metadata\": {\r\n \"kernelspec\": {\r\n \"display_name\": \"synapse_pyspark\",\r\n \"language\": \"python\",\r\n \"name\": \"synapse_pyspark\",\r\n },\r\n \"language_info\": {\"name\": \"python\"},\r\n \"trident\": {\"lakehouse\": {}},\r\n },\r\n \"cells\": cells,\r\n }\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Generate a Fabric-compatible PySpark notebook (.ipynb) that extracts \"\r\n \"structured fields from PDFs using AI and writes one or two delta tables.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Single-table example:\\n\"\r\n \" python scripts/generate_notebook.py \\\\\\n\"\r\n ' --lakehouse \"Lh_landon_finance_bronze\" \\\\\\n'\r\n ' --lakehouse-folder \"Booking PDFs\" \\\\\\n'\r\n ' --table-name \"booking_invoices\" \\\\\\n'\r\n \" --fields-json '[{\\\"name\\\":\\\"invoice_number\\\",\\\"description\\\":\\\"...\\\"}]' \\\\\\n\"\r\n \" --test-mode \\\\\\n\"\r\n ' --output-notebook \"outputs/my-run/pdf_to_delta_TEST.ipynb\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\"--lakehouse\", required=True,\r\n help=\"Name of the bronze lakehouse to attach.\")\r\n parser.add_argument(\"--lakehouse-folder\", required=True,\r\n help=\"Folder under Files/ in the lakehouse containing PDFs.\")\r\n parser.add_argument(\"--table-name\", required=True,\r\n help=\"Delta table name for header/main rows.\")\r\n parser.add_argument(\"--fields-json\", required=True,\r\n help='JSON array: [{\"name\": \"...\", \"description\": \"...\"}, ...]')\r\n parser.add_argument(\"--line-items-table-name\", default=None,\r\n help=\"(Optional) Delta table name for line items rows.\")\r\n parser.add_argument(\"--line-items-fields-json\", default=None,\r\n help=\"(Optional) JSON array of line item field definitions.\")\r\n parser.add_argument(\"--test-mode\", action=\"store_true\",\r\n help=\"If set, notebook processes only the first PDF.\")\r\n parser.add_argument(\"--output-notebook\", required=True,\r\n help=\"Path where the .ipynb file should be saved.\")\r\n args = parser.parse_args()\r\n\r\n def parse_fields(raw, label):\r\n try:\r\n fields = json.loads(raw)\r\n except json.JSONDecodeError as e:\r\n print(f\"ERROR: {label} is not valid JSON: {e}\", file=sys.stderr)\r\n sys.exit(1)\r\n if not isinstance(fields, list) or not all(\r\n \"name\" in f and \"description\" in f for f in fields\r\n ):\r\n print(f'ERROR: {label} must be a JSON array of {{\"name\":..., \"description\":...}} objects.',\r\n file=sys.stderr)\r\n sys.exit(1)\r\n return fields\r\n\r\n fields = parse_fields(args.fields_json, \"--fields-json\")\r\n line_items_fields = (\r\n parse_fields(args.line_items_fields_json, \"--line-items-fields-json\")\r\n if args.line_items_fields_json else None\r\n )\r\n\r\n if bool(args.line_items_table_name) != bool(line_items_fields):\r\n print(\"ERROR: --line-items-table-name and --line-items-fields-json must both be provided together.\",\r\n file=sys.stderr)\r\n sys.exit(1)\r\n\r\n notebook = build_notebook(\r\n lakehouse_name=args.lakehouse,\r\n lakehouse_folder=args.lakehouse_folder,\r\n table_name=args.table_name,\r\n fields=fields,\r\n test_mode=args.test_mode,\r\n line_items_table_name=args.line_items_table_name,\r\n line_items_fields=line_items_fields,\r\n )\r\n\r\n out_path = os.path.abspath(args.output_notebook)\r\n os.makedirs(os.path.dirname(out_path), exist_ok=True)\r\n with open(out_path, \"w\", encoding=\"utf-8\") as f:\r\n json.dump(notebook, f, indent=2, ensure_ascii=False)\r\n\r\n two_tables = bool(args.line_items_table_name)\r\n mode_label = \"TEST (1 PDF)\" if args.test_mode else \"FULL (all PDFs)\"\r\n print(f\"Notebook written to: {out_path}\", file=sys.stderr)\r\n print(f\"Mode: {mode_label} | Tables: {'2 (header + line items)' if two_tables else '1'}\", file=sys.stderr)\r\n print(f\"Header fields: {len(fields)}\" + (f\" | Line item fields: {len(line_items_fields)}\" if line_items_fields else \"\"), file=sys.stderr)\r\n print(\"Import into Fabric: Workspace -> New -> Import notebook -> select this .ipynb file.\", file=sys.stderr)\r\n print(\"Lakehouse is attached automatically via %%configure -- just click Run All.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
208
|
+
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nGenerate a Fabric-compatible PySpark notebook (.ipynb) that reads PDF files\r\nfrom a lakehouse Files section, extracts structured fields using AI, and writes\r\nthe results to one or two delta tables (header + optional line items).\r\n\r\nThe notebook follows the structure of NB_ConvertPDFToDelta.ipynb:\r\n %%configure -> pip installs -> imports/helpers -> config -> load PDFs ->\r\n AI prompt -> run extraction -> parse output -> display -> write delta table(s)\r\n\r\nSingle-table usage:\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"Booking PDFs\" \\\r\n --table-name \"booking_invoices\" \\\r\n --fields-json '[{\"name\":\"invoice_number\",\"description\":\"invoice number after no.\"}]' \\\r\n --test-mode \\\r\n --output-notebook \"outputs/my-run/pdf_to_delta_TEST.ipynb\"\r\n\r\nTwo-table usage (header + line items):\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"Booking PDFs\" \\\r\n --table-name \"booking_invoices\" \\\r\n --fields-json '[{\"name\":\"invoice_number\",\"description\":\"invoice number after no.\"}]' \\\r\n --line-items-table-name \"booking_invoice_line_items\" \\\r\n --line-items-fields-json '[{\"name\":\"item_date\",\"description\":\"date as YYYY-MM-DD\"},{\"name\":\"description\",\"description\":\"line item description\"},{\"name\":\"charge_gbp\",\"description\":\"charge as float, null if empty\"}]' \\\r\n --test-mode \\\r\n --output-notebook \"outputs/my-run/pdf_to_delta_TEST.ipynb\"\r\n\"\"\"\r\nimport argparse\r\nimport json\r\nimport os\r\nimport sys\r\n\r\n\r\ndef cell(source: str, cell_type: str = \"code\") -> dict:\r\n \"\"\"Build a notebook cell dict from a multi-line string.\"\"\"\r\n lines = source.split(\"\\n\")\r\n source_list = [line + \"\\n\" for line in lines[:-1]] + [lines[-1]]\r\n c = {\r\n \"cell_type\": cell_type,\r\n \"metadata\": {},\r\n \"source\": source_list,\r\n \"outputs\": [],\r\n \"execution_count\": None,\r\n }\r\n if cell_type == \"markdown\":\r\n del c[\"outputs\"]\r\n del c[\"execution_count\"]\r\n return c\r\n\r\n\r\ndef build_prompt_json(fields: list, line_items_fields: list = None) -> str:\r\n \"\"\"Build the JSON template block for the AI extraction prompt.\"\"\"\r\n lines = [\"{\"]\r\n for i, f in enumerate(fields):\r\n comma = \",\" if (i < len(fields) - 1 or line_items_fields) else \"\"\r\n lines.append(f' \"{f[\"name\"]}\": \"{f[\"description\"]}\"{comma}')\r\n if line_items_fields:\r\n lines.append(' \"line_items\": [')\r\n lines.append(\" {\")\r\n for i, f in enumerate(line_items_fields):\r\n comma = \",\" if i < len(line_items_fields) - 1 else \"\"\r\n lines.append(f' \"{f[\"name\"]}\": \"{f[\"description\"]}\"{comma}')\r\n lines.append(\" }\")\r\n lines.append(\" ]\")\r\n lines.append(\"}\")\r\n return \"\\n\".join(lines)\r\n\r\n\r\ndef build_notebook(\r\n lakehouse_name: str,\r\n lakehouse_folder: str,\r\n table_name: str,\r\n fields: list,\r\n test_mode: bool,\r\n line_items_table_name: str = None,\r\n line_items_fields: list = None,\r\n) -> dict:\r\n two_tables = bool(line_items_table_name and line_items_fields)\r\n mode_label = \"TEST — 1 PDF only\" if test_mode else \"FULL — all PDFs\"\r\n test_mode_str = \"True\" if test_mode else \"False\"\r\n prompt_json = build_prompt_json(fields, line_items_fields if two_tables else None)\r\n tables_note = (\r\n f\"Writes **two** delta tables: `{table_name}` (header) and \"\r\n f\"`{line_items_table_name}` (line items).\"\r\n if two_tables\r\n else f\"Writes delta table **`{table_name}`**.\"\r\n )\r\n\r\n cells = []\r\n\r\n # ── Cell 1: manual lakehouse attachment instructions ─────────────────────\r\n cells.append(cell(\r\n f'## ⚠️ Before Running: Setup Steps Required\\n'\r\n f'\\n'\r\n f'### Step 1 — Attach the Lakehouse\\n'\r\n f'1. In the left panel, click **Add data items** (database icon)\\n'\r\n f'2. Click **Add lakehouse** → **Existing lakehouse**\\n'\r\n f'3. Choose **{lakehouse_name}** → **Confirm**\\n'\r\n f'\\n'\r\n f'### Step 2 — AI Features\\n'\r\n f'`synapse.ml.aifunc` uses Fabric\\'s built-in AI — no Azure OpenAI key or workspace\\n'\r\n f'settings change needed. It works automatically on capacities with AI/Copilot features\\n'\r\n f'enabled (F64+ or a trial with AI enabled).\\n'\r\n f'\\n'\r\n f'If you see `AuthenticationError: Authentication failed for all authenticators`,\\n'\r\n f'the workspace is on a capacity without AI features. Move the notebook to a workspace\\n'\r\n f'on an AI-enabled capacity and re-run.',\r\n cell_type=\"markdown\",\r\n ))\r\n\r\n # ── Cell 2: markdown header ──────────────────────────────────────────────\r\n cells.append(cell(\r\n f'## PDF to Bronze Delta Tables\\n'\r\n f'\\n'\r\n f'Extracts structured fields from PDFs in `Files/{lakehouse_folder}` using AI. \\n'\r\n f'{tables_note}\\n'\r\n f'\\n'\r\n f'**Mode**: `{mode_label}` \\n'\r\n f'To switch mode, change `TEST_MODE` in the configuration cell and re-run.',\r\n cell_type=\"markdown\",\r\n ))\r\n # ── Cell 3: pip installs ─────────────────────────────────────────────────\r\n # Per https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview\r\n # Pandas on PySpark runtime requires openai package. pymupdf4llm for PDF→markdown.\r\n cells.append(cell(\r\n '## Cell 1 — Install Dependencies\\n'\r\n '\\n'\r\n 'Installs `openai` (required by the Fabric AI functions runtime on PySpark) and\\n'\r\n '`pymupdf4llm` (used to convert each PDF page to clean markdown text for the AI).\\n'\r\n '\\n'\r\n '**How to use**: Run this cell first. The kernel does **not** restart after\\n'\r\n '`%pip install -q` — continue straight to Cell 2. Skip if already installed.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n '%pip install -q openai pymupdf4llm 2>/dev/null'\r\n ))\r\n\r\n # ── Cell 4: imports ──────────────────────────────────────────────────────\r\n cells.append(cell(\r\n '## Cell 2 — Imports\\n'\r\n '\\n'\r\n 'Imports standard libraries (`re`, `json`, `pandas`) and the Fabric AI functions\\n'\r\n 'module (`synapse.ml.aifunc`) used to run batch AI extraction against the PDFs.\\n'\r\n '\\n'\r\n '**How to use**: No changes needed. Run once before the helper functions.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'import re\\n'\r\n 'import json\\n'\r\n 'import pandas as pd\\n'\r\n 'import synapse.ml.aifunc as aifunc\\n'\r\n 'from synapse.ml.aifunc import Conf\\n'\r\n 'from notebookutils import fs'\r\n ))\r\n\r\n # ── Cell 5: helper — PDF to markdown ────────────────────────────────────\r\n cells.append(cell(\r\n '## Cell 3 — Helper: PDF to Markdown\\n'\r\n '\\n'\r\n 'Defines `create_mkd()`, which converts a single PDF file to markdown text using\\n'\r\n '`pymupdf4llm`. The markdown is then passed to the AI extraction prompt in Cell 6.\\n'\r\n '\\n'\r\n '**How to use**: No changes needed. Returns `(text, None)` on success or\\n'\r\n '`(None, error_message)` on failure — failures are logged but do not stop the run.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'def create_mkd(path: str):\\n'\r\n ' \"\"\"Convert a PDF file to markdown text using pymupdf4llm.\"\"\"\\n'\r\n ' try:\\n'\r\n ' import pymupdf4llm\\n'\r\n ' return pymupdf4llm.to_markdown(path), None\\n'\r\n ' except Exception as e:\\n'\r\n ' return None, str(e)'\r\n ))\r\n\r\n # ── Cell 6: helper — load all PDFs from lakehouse ────────────────────────\r\n cells.append(cell(\r\n '## Cell 4 — Helper: Load PDFs from Lakehouse\\n'\r\n '\\n'\r\n 'Defines `process_pdfs()`, which lists all `.pdf` files in the configured lakehouse\\n'\r\n 'folder, converts each to markdown via `create_mkd()`, and returns a DataFrame with\\n'\r\n 'columns `filename`, `file_path`, `mkdown_text`, and `error`.\\n'\r\n '\\n'\r\n '**How to use**: No changes needed. In TEST_MODE only the first PDF is processed\\n'\r\n '— use this to validate the extraction prompt before running against all files.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'def process_pdfs(files_folder: str, test_mode: bool = False):\\n'\r\n ' \"\"\"List PDFs from the lakehouse Files section and convert each to markdown.\"\"\"\\n'\r\n ' files = fs.ls(files_folder)\\n'\r\n ' pdf_files = [f for f in files if f.name.lower().endswith(\".pdf\")]\\n'\r\n ' if not pdf_files:\\n'\r\n ' raise ValueError(f\"No PDF files found in \\'{files_folder}\\'. \"\\n'\r\n ' \"Check LAKEHOUSE_FILES_FOLDER and upload step.\")\\n'\r\n ' if test_mode:\\n'\r\n ' pdf_files = pdf_files[:1]\\n'\r\n ' print(f\"TEST MODE: processing 1 PDF \\u2014 {pdf_files[0].name}\")\\n'\r\n ' data = []\\n'\r\n ' for f in pdf_files:\\n'\r\n ' local_path = \"/lakehouse/default/Files\" + f.path.split(\"Files\")[1]\\n'\r\n ' mkdown_text, error = create_mkd(local_path)\\n'\r\n ' if error:\\n'\r\n ' print(f\"\\u26a0\\ufe0f Failed to convert {f.name}: {error}\")\\n'\r\n ' data.append((f.name, local_path, mkdown_text, error))\\n'\r\n ' return pd.DataFrame(data, columns=[\"filename\", \"file_path\", \"mkdown_text\", \"error\"])'\r\n ))\r\n\r\n # ── Cell 7: configuration ────────────────────────────────────────────────\r\n if two_tables:\r\n table_config = (\r\n f'HEADER_TABLE_NAME = \"{table_name}\" # header delta table\\n'\r\n f'LINE_ITEMS_TABLE_NAME = \"{line_items_table_name}\" # line items delta table'\r\n )\r\n table_desc = f'`{table_name}` (header rows) and `{line_items_table_name}` (line items)'\r\n else:\r\n table_config = f'TABLE_NAME = \"{table_name}\" # delta table name'\r\n table_desc = f'`{table_name}`'\r\n\r\n cells.append(cell(\r\n '## Cell 5 — Configuration\\n'\r\n '\\n'\r\n f'Sets the source folder, output delta table name(s), run mode, and AI model.\\n'\r\n f'Values were pre-populated when this notebook was generated.\\n'\r\n '\\n'\r\n f'**How to use**:\\n'\r\n f'- `LAKEHOUSE_FILES_FOLDER`: folder under `Files/` containing your PDFs (currently `\"{lakehouse_folder}\"`)\\n'\r\n f'- Output table(s): {table_desc}\\n'\r\n f'- `TEST_MODE = True` processes only the first PDF — use this to validate the prompt before a full run\\n'\r\n f'- `MODEL_DEPLOYMENT_NAME`: change if you get a `DeploymentNotFound` error (e.g. try `\"gpt-4.1\"` or `\"gpt-4o\"`)',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n '# \\u2500\\u2500 CONFIGURE \\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\n'\r\n f'LAKEHOUSE_FILES_FOLDER = \"{lakehouse_folder}\" # folder under Files/ containing PDFs\\n'\r\n f'{table_config}\\n'\r\n f'TEST_MODE = {test_mode_str} # True = process 1 PDF only\\n'\r\n '# AI model deployment — must match a model supported by the Fabric built-in AI endpoint.\\n'\r\n '# Default: \"gpt-4.1-mini\". Other options: \"gpt-4.1\", \"gpt-4o\", \"gpt-5\"\\n'\r\n 'MODEL_DEPLOYMENT_NAME = \"gpt-4.1-mini\" # update if DeploymentNotFound error\\n'\r\n '# \\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500\\u2500'\r\n ))\r\n\r\n # ── Cell 8: load PDFs and convert to markdown ────────────────────────────\r\n cells.append(cell(\r\n '## Cell 6 — Load PDFs\\n'\r\n '\\n'\r\n f'Calls `process_pdfs()` to list all PDFs in `Files/{lakehouse_folder}`,\\n'\r\n 'convert each to markdown text, and collect the results into a DataFrame.\\n'\r\n 'Displays a summary table showing filenames and any conversion errors.\\n'\r\n '\\n'\r\n '**How to use**: Run after confirming the lakehouse is attached. If `error` is\\n'\r\n 'non-null for any file, the PDF could not be converted — check it opens correctly\\n'\r\n 'in a PDF viewer before re-running. In TEST_MODE, only one PDF is processed.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'pdf_df = process_pdfs(f\"Files/{LAKEHOUSE_FILES_FOLDER}\", test_mode=TEST_MODE)\\n'\r\n 'print(f\"Loaded {len(pdf_df)} PDF(s) from Files/{LAKEHOUSE_FILES_FOLDER}\")\\n'\r\n 'display(pdf_df[[\"filename\", \"error\"]])'\r\n ))\r\n\r\n # ── Cell 9: AI extraction prompt ─────────────────────────────────────────\r\n cells.append(cell(\r\n '## Cell 7 — AI Extraction Prompt\\n'\r\n '\\n'\r\n 'Defines the prompt sent to the AI model for each PDF. The JSON template below\\n'\r\n 'was generated from the field definitions you provided — each key is a field name\\n'\r\n 'and its value describes what the AI should extract.\\n'\r\n '\\n'\r\n '**How to use**: If any fields are extracted incorrectly, edit the description\\n'\r\n 'values in the JSON template (e.g. add more context or examples) and re-run\\n'\r\n 'from this cell. Do not change the outer prompt structure.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n '# Edit field descriptions below to tune extraction for your PDFs\\n'\r\n 'EXTRACTION_PROMPT = \"\"\"\\n'\r\n 'Extract the following fields from this document and return ONLY a valid JSON object.\\n'\r\n 'No explanation, no markdown fences, no additional text.\\n'\r\n '\\n'\r\n f'{prompt_json}\\n'\r\n '\\n'\r\n 'Document:\\n'\r\n '\"\"\"'\r\n ))\r\n\r\n # ── Cell 10: run AI extraction ───────────────────────────────────────────\r\n cells.append(cell(\r\n '## Cell 8 — Run AI Extraction\\n'\r\n '\\n'\r\n 'Sends each PDF\\'s markdown text to the AI model using the prompt above, with\\n'\r\n '`concurrency=25` for parallel processing. Results are stored in `pdf_df[\"output\"]`.\\n'\r\n '\\n'\r\n '**How to use**: This cell may take 30–120 seconds depending on the number and\\n'\r\n 'size of PDFs. If you get `AuthenticationError`, the workspace capacity does not\\n'\r\n 'have AI features enabled — move the notebook to an AI-enabled capacity (F64+ or trial).',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'pdf_df[\"output\"] = pdf_df[[\"mkdown_text\"]].ai.generate_response(\\n'\r\n ' EXTRACTION_PROMPT,\\n'\r\n ' conf=Conf(model_deployment_name=MODEL_DEPLOYMENT_NAME, temperature=0.0, top_p=1.0, concurrency=25)\\n'\r\n ')\\n'\r\n 'print(\"AI extraction complete.\")'\r\n ))\r\n\r\n # ── Cell 11: parse output ────────────────────────────────────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n '## Cell 9 — Parse AI Output\\n'\r\n '\\n'\r\n 'Parses the raw JSON strings returned by the AI into two structured DataFrames:\\n'\r\n '`header_df` (one row per PDF with top-level fields) and `line_items_df`\\n'\r\n '(one row per line item, joined by `invoice_number`).\\n'\r\n '\\n'\r\n '**How to use**: No changes needed. Rows with unparseable JSON are silently\\n'\r\n 'dropped — if `header_df` has fewer rows than expected, check `pdf_df[\"output\"]`\\n'\r\n 'for malformed JSON responses and refine the extraction prompt.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'def parse_output(df, json_column):\\n'\r\n ' \"\"\"Parse LLM JSON output into header_df and line_items_df.\"\"\"\\n'\r\n ' def parse_row(val):\\n'\r\n ' if not isinstance(val, str):\\n'\r\n ' return {}\\n'\r\n ' try:\\n'\r\n ' cleaned = val.strip().replace(\"```json\", \"\").replace(\"```\", \"\").strip()\\n'\r\n ' return json.loads(cleaned)\\n'\r\n ' except Exception:\\n'\r\n ' return {}\\n'\r\n '\\n'\r\n ' header_rows, all_line_items = [], []\\n'\r\n ' for val, filename in zip(df[json_column], df[\"filename\"]):\\n'\r\n ' row = parse_row(val)\\n'\r\n ' line_items = row.pop(\"line_items\", None) or []\\n'\r\n ' invoice_number = row.get(\"invoice_number\", \"\")\\n'\r\n ' header_rows.append({\"source_filename\": filename, **row})\\n'\r\n ' for item in line_items:\\n'\r\n ' all_line_items.append({\"source_filename\": filename,\\n'\r\n ' \"invoice_number\": invoice_number,\\n'\r\n ' **item})\\n'\r\n '\\n'\r\n ' header_df = pd.DataFrame(header_rows)\\n'\r\n ' line_items_df = pd.DataFrame(all_line_items) if all_line_items else pd.DataFrame()\\n'\r\n ' return header_df, line_items_df\\n'\r\n '\\n'\r\n 'header_df, line_items_df = parse_output(pdf_df, \"output\")'\r\n ))\r\n else:\r\n cells.append(cell(\r\n '## Cell 9 — Parse AI Output\\n'\r\n '\\n'\r\n 'Parses the raw JSON strings returned by the AI into a structured DataFrame\\n'\r\n '`final_df` with one row per PDF and a `source_filename` column prepended.\\n'\r\n '\\n'\r\n '**How to use**: No changes needed. Rows with unparseable JSON result in\\n'\r\n 'empty dicts — if any columns are missing data, check `pdf_df[\"output\"]`\\n'\r\n 'for malformed JSON and refine the extraction prompt.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'def parse_output(df, json_column):\\n'\r\n ' def parse_row(val):\\n'\r\n ' if not isinstance(val, str):\\n'\r\n ' return {}\\n'\r\n ' try:\\n'\r\n ' cleaned = val.strip().replace(\"```json\", \"\").replace(\"```\", \"\").strip()\\n'\r\n ' return json.loads(cleaned)\\n'\r\n ' except Exception:\\n'\r\n ' return {}\\n'\r\n ' extracted = [pd.json_normalize(parse_row(v)) for v in df[json_column]]\\n'\r\n ' result = pd.concat(extracted, ignore_index=True)\\n'\r\n ' result.insert(0, \"source_filename\", df[\"filename\"].values)\\n'\r\n ' return result\\n'\r\n '\\n'\r\n 'final_df = parse_output(pdf_df, \"output\")'\r\n ))\r\n\r\n # ── Cell 12: display header results ─────────────────────────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n '## Cell 10 — Review Header Results\\n'\r\n '\\n'\r\n 'Displays the extracted header rows so you can verify fields are correct\\n'\r\n 'before writing to the delta table. In TEST_MODE this is your quality gate.\\n'\r\n '\\n'\r\n '**How to use**: Check that all fields are populated correctly. If any field\\n'\r\n 'is wrong or missing, update `EXTRACTION_PROMPT` in Cell 7 and re-run from there.\\n'\r\n 'When satisfied, scroll down to check line items, then set `TEST_MODE = False`\\n'\r\n 'in Cell 5 and re-run the notebook for all PDFs.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'print(f\"Header rows: {len(header_df)}\")\\n'\r\n 'if TEST_MODE:\\n'\r\n ' print(\"\\u2705 TEST MODE \\u2014 review header row below.\")\\n'\r\n ' print(\" If all fields look correct, scroll down to check line items too.\")\\n'\r\n ' print(\" If any field is wrong, update EXTRACTION_PROMPT and re-run.\")\\n'\r\n 'display(header_df)'\r\n ))\r\n else:\r\n cells.append(cell(\r\n '## Cell 10 — Review Extracted Results\\n'\r\n '\\n'\r\n 'Displays all extracted rows so you can verify fields are correct before\\n'\r\n 'writing to the delta table. In TEST_MODE this is your quality gate.\\n'\r\n '\\n'\r\n '**How to use**: Check that all fields are populated correctly for the test PDF.\\n'\r\n 'If any field is wrong or missing, update `EXTRACTION_PROMPT` in Cell 7 and\\n'\r\n 're-run from there. When satisfied, set `TEST_MODE = False` in Cell 5 and\\n'\r\n 're-run the notebook to process all PDFs.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'print(f\"Extracted {len(final_df)} row(s) from {len(pdf_df)} PDF(s).\")\\n'\r\n 'if TEST_MODE:\\n'\r\n ' print(\"\\u2705 TEST MODE \\u2014 review the row below.\")\\n'\r\n ' print(\" If all fields look correct, set TEST_MODE = False and re-run.\")\\n'\r\n ' print(\" If any field is wrong, update EXTRACTION_PROMPT and re-run.\")\\n'\r\n 'display(final_df)'\r\n ))\r\n\r\n # ── Cell 13: display line items (two-table mode only) ───────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n '## Cell 11 — Review Line Item Results\\n'\r\n '\\n'\r\n 'Displays the extracted line items DataFrame so you can verify the nested\\n'\r\n 'fields are parsed correctly before writing to the delta table.\\n'\r\n '\\n'\r\n '**How to use**: Check that each line item row has the expected columns and\\n'\r\n 'values. If any field is wrong, update the `line_items` block in `EXTRACTION_PROMPT`\\n'\r\n 'in Cell 7 and re-run from there. When satisfied, set `TEST_MODE = False` in\\n'\r\n 'Cell 5 and re-run the notebook to process all PDFs.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'print(f\"Line item rows: {len(line_items_df)}\")\\n'\r\n 'if TEST_MODE:\\n'\r\n ' print(\"\\u2705 TEST MODE \\u2014 review line items below.\")\\n'\r\n ' print(\" If correct, set TEST_MODE = False and re-run for all PDFs.\")\\n'\r\n 'display(line_items_df)'\r\n ))\r\n\r\n # ── Cell 14: write header table ──────────────────────────────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n f'## Cell {\"12\" if two_tables else \"11\"} — Write Header Delta Table\\n'\r\n '\\n'\r\n f'Converts `header_df` to a Spark DataFrame (casting all columns to string\\n'\r\n 'to handle mixed types) and writes it as a managed delta table using `overwrite`\\n'\r\n f'mode. The table `{table_name}` will appear in the **Tables** section of the lakehouse.\\n'\r\n '\\n'\r\n '**How to use**: Only run this cell after reviewing Cell 10. `overwrite` mode\\n'\r\n 'replaces the table on each run — this is intentional for re-runs.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'spark_df = spark.createDataFrame(header_df.astype(str).fillna(\"\"))\\n'\r\n 'spark_df.write.format(\"delta\").mode(\"overwrite\").saveAsTable(HEADER_TABLE_NAME)\\n'\r\n 'print(f\"\\u2705 Written {len(header_df)} row(s) to delta table: {HEADER_TABLE_NAME}\")'\r\n ))\r\n else:\r\n cells.append(cell(\r\n f'## Cell 11 — Write Delta Table\\n'\r\n '\\n'\r\n f'Converts `final_df` to a Spark DataFrame (casting all columns to string\\n'\r\n 'to handle mixed types) and writes it as a managed delta table using `overwrite`\\n'\r\n f'mode. The table `{table_name}` will appear in the **Tables** section of the lakehouse.\\n'\r\n '\\n'\r\n '**How to use**: Only run this cell after reviewing Cell 10. `overwrite` mode\\n'\r\n 'replaces the table on each run — this is intentional for re-runs.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'spark_df = spark.createDataFrame(final_df.astype(str).fillna(\"\"))\\n'\r\n 'spark_df.write.format(\"delta\").mode(\"overwrite\").saveAsTable(TABLE_NAME)\\n'\r\n 'print(f\"\\u2705 Written {len(final_df)} row(s) to delta table: {TABLE_NAME}\")'\r\n ))\r\n\r\n # ── Cell 15: write line items table (two-table mode only) ────────────────\r\n if two_tables:\r\n cells.append(cell(\r\n f'## Cell 13 — Write Line Items Delta Table\\n'\r\n '\\n'\r\n f'Writes `line_items_df` as a separate managed delta table `{line_items_table_name}`.\\n'\r\n 'Each row represents one line item, linked to its parent header row via\\n'\r\n '`source_filename` and `invoice_number`.\\n'\r\n '\\n'\r\n '**How to use**: Only run after reviewing Cell 11. If `line_items_df` was empty\\n'\r\n '(⚠️ warning printed), refine the `line_items` extraction block in the prompt.',\r\n cell_type=\"markdown\",\r\n ))\r\n cells.append(cell(\r\n 'if not line_items_df.empty:\\n'\r\n ' spark_li = spark.createDataFrame(line_items_df.astype(str).fillna(\"\"))\\n'\r\n ' spark_li.write.format(\"delta\").mode(\"overwrite\").saveAsTable(LINE_ITEMS_TABLE_NAME)\\n'\r\n ' print(f\"\\u2705 Written {len(line_items_df)} row(s) to delta table: {LINE_ITEMS_TABLE_NAME}\")\\n'\r\n 'else:\\n'\r\n ' print(\"\\u26a0\\ufe0f No line items extracted \\u2014 check the line_items field in EXTRACTION_PROMPT.\")'\r\n ))\r\n\r\n return {\r\n \"nbformat\": 4,\r\n \"nbformat_minor\": 5,\r\n \"metadata\": {\r\n \"kernelspec\": {\r\n \"display_name\": \"synapse_pyspark\",\r\n \"language\": \"python\",\r\n \"name\": \"synapse_pyspark\",\r\n },\r\n \"language_info\": {\"name\": \"python\"},\r\n \"trident\": {\"lakehouse\": {}},\r\n },\r\n \"cells\": cells,\r\n }\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Generate a Fabric-compatible PySpark notebook (.ipynb) that extracts \"\r\n \"structured fields from PDFs using AI and writes one or two delta tables.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Single-table example:\\n\"\r\n \" python scripts/generate_notebook.py \\\\\\n\"\r\n ' --lakehouse \"Lh_landon_finance_bronze\" \\\\\\n'\r\n ' --lakehouse-folder \"Booking PDFs\" \\\\\\n'\r\n ' --table-name \"booking_invoices\" \\\\\\n'\r\n \" --fields-json '[{\\\"name\\\":\\\"invoice_number\\\",\\\"description\\\":\\\"...\\\"}]' \\\\\\n\"\r\n \" --test-mode \\\\\\n\"\r\n ' --output-notebook \"outputs/my-run/pdf_to_delta_TEST.ipynb\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\"--lakehouse\", required=True,\r\n help=\"Name of the bronze lakehouse to attach.\")\r\n parser.add_argument(\"--lakehouse-folder\", required=True,\r\n help=\"Folder under Files/ in the lakehouse containing PDFs.\")\r\n parser.add_argument(\"--table-name\", required=True,\r\n help=\"Delta table name for header/main rows.\")\r\n parser.add_argument(\"--fields-json\", required=True,\r\n help='JSON array: [{\"name\": \"...\", \"description\": \"...\"}, ...]')\r\n parser.add_argument(\"--line-items-table-name\", default=None,\r\n help=\"(Optional) Delta table name for line items rows.\")\r\n parser.add_argument(\"--line-items-fields-json\", default=None,\r\n help=\"(Optional) JSON array of line item field definitions.\")\r\n parser.add_argument(\"--test-mode\", action=\"store_true\",\r\n help=\"If set, notebook processes only the first PDF.\")\r\n parser.add_argument(\"--output-notebook\", required=True,\r\n help=\"Path where the .ipynb file should be saved.\")\r\n args = parser.parse_args()\r\n\r\n def parse_fields(raw, label):\r\n try:\r\n fields = json.loads(raw)\r\n except json.JSONDecodeError as e:\r\n print(f\"ERROR: {label} is not valid JSON: {e}\", file=sys.stderr)\r\n sys.exit(1)\r\n if not isinstance(fields, list) or not all(\r\n \"name\" in f and \"description\" in f for f in fields\r\n ):\r\n print(f'ERROR: {label} must be a JSON array of {{\"name\":..., \"description\":...}} objects.',\r\n file=sys.stderr)\r\n sys.exit(1)\r\n return fields\r\n\r\n fields = parse_fields(args.fields_json, \"--fields-json\")\r\n line_items_fields = (\r\n parse_fields(args.line_items_fields_json, \"--line-items-fields-json\")\r\n if args.line_items_fields_json else None\r\n )\r\n\r\n if bool(args.line_items_table_name) != bool(line_items_fields):\r\n print(\"ERROR: --line-items-table-name and --line-items-fields-json must both be provided together.\",\r\n file=sys.stderr)\r\n sys.exit(1)\r\n\r\n notebook = build_notebook(\r\n lakehouse_name=args.lakehouse,\r\n lakehouse_folder=args.lakehouse_folder,\r\n table_name=args.table_name,\r\n fields=fields,\r\n test_mode=args.test_mode,\r\n line_items_table_name=args.line_items_table_name,\r\n line_items_fields=line_items_fields,\r\n )\r\n\r\n out_path = os.path.abspath(args.output_notebook)\r\n os.makedirs(os.path.dirname(out_path), exist_ok=True)\r\n with open(out_path, \"w\", encoding=\"utf-8\") as f:\r\n json.dump(notebook, f, indent=2, ensure_ascii=False)\r\n\r\n two_tables = bool(args.line_items_table_name)\r\n mode_label = \"TEST (1 PDF)\" if args.test_mode else \"FULL (all PDFs)\"\r\n print(f\"Notebook written to: {out_path}\", file=sys.stderr)\r\n print(f\"Mode: {mode_label} | Tables: {'2 (header + line items)' if two_tables else '1'}\", file=sys.stderr)\r\n print(f\"Header fields: {len(fields)}\" + (f\" | Line item fields: {len(line_items_fields)}\" if line_items_fields else \"\"), file=sys.stderr)\r\n print(\"Import into Fabric: Workspace -> New -> Import notebook -> select this .ipynb file.\", file=sys.stderr)\r\n print(\"Lakehouse is attached automatically via %%configure -- just click Run All.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
209
209
|
},
|
|
210
210
|
{
|
|
211
211
|
relativePath: "scripts/generate_upload_commands.py",
|
|
212
|
-
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nScan a local folder for PDF files and write a PowerShell script containing\r\n`fab cp` commands to upload each one to a Microsoft Fabric lakehouse Files section.\r\n\r\nThe generated script uses $PSScriptRoot to resolve paths relative to where\r\nthe .ps1 file is saved, so it works from any working directory.\r\n\r\nUsage example:\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"C:\\\\Users\\\\rishi\\\\Data\\\\Booking PDFs\" \\\r\n --workspace \"Landon Finance Month End\" \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"Booking PDFs\" \\\r\n --output-script \"outputs/my-run/upload_pdf_files.ps1\"\r\n\"\"\"\r\nimport argparse\r\nimport os\r\nimport sys\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Scan a local folder for PDF files and write a PowerShell script \"\r\n \"containing `fab cp` commands to upload each one to a Microsoft \"\r\n \"Fabric lakehouse Files section. Paths in the script are resolved \"\r\n \"relative to the script's saved location via $PSScriptRoot.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Example:\\n\"\r\n ' python scripts/generate_upload_commands.py \\\\\\n'\r\n ' --local-folder \"C:\\\\\\\\Users\\\\\\\\rishi\\\\\\\\Data\\\\\\\\Booking PDFs\" \\\\\\n'\r\n ' --workspace \"Landon Finance Month End\" \\\\\\n'\r\n ' --lakehouse \"Lh_landon_finance_bronze\" \\\\\\n'\r\n ' --lakehouse-folder \"Booking PDFs\" \\\\\\n'\r\n ' --output-script \"outputs/my-run/upload_pdf_files.ps1\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\r\n \"--local-folder\", required=True,\r\n help='Exact absolute path to the local folder containing PDF files.',\r\n )\r\n parser.add_argument(\r\n \"--workspace\", required=True,\r\n help='Fabric workspace name (exact, case-sensitive).',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse\", required=True,\r\n help='Lakehouse name (exact, case-sensitive).',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse-folder\", required=True,\r\n help='Destination folder under the Files section of the lakehouse.',\r\n )\r\n parser.add_argument(\r\n \"--output-script\", required=True,\r\n help='Path where the generated .ps1 file should be saved.',\r\n )\r\n args = parser.parse_args()\r\n\r\n local_folder = os.path.abspath(args.local_folder)\r\n if not os.path.isdir(local_folder):\r\n print(f\"ERROR: Local folder not found: {local_folder}\", file=sys.stderr)\r\n print(\"Expected: a valid absolute directory path containing .pdf files.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n pdf_files = sorted(f for f in os.listdir(local_folder) if f.lower().endswith(\".pdf\"))\r\n if not pdf_files:\r\n print(f\"ERROR: No PDF files found in: {local_folder}\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n script_dir = os.path.abspath(os.path.dirname(args.output_script))\r\n rel_pdf_path = os.path.relpath(local_folder, script_dir)\r\n\r\n lines = [\r\n \"# \" + \"=\" * 77,\r\n \"# Upload PDF Files to Bronze Lakehouse — Fabric CLI Script\",\r\n f\"# Workspace : {args.workspace}\",\r\n f\"# Lakehouse : {args.lakehouse}\",\r\n f\"# Destination: Files/{args.lakehouse_folder}\",\r\n f\"# PDF source : {local_folder}\",\r\n \"# \" + \"=\" * 77,\r\n \"# Paths are resolved relative to this script's saved location.\",\r\n \"# You can run this script from any working directory.\",\r\n \"# \" + \"=\" * 77,\r\n \"# NOTE: This script uploads files one at a time via the Fabric CLI.\",\r\n \"# For 50+ files, Options 1 (OneLake File Explorer) or 2 (Fabric UI)\",\r\n \"# are significantly faster.\",\r\n \"# \" + \"=\" * 77,\r\n \"\",\r\n \"# Absolute path to the local PDF folder\",\r\n f'$pdfFolder = \"{local_folder}\"',\r\n 'Write-Host \"PDF source folder: $pdfFolder\" -ForegroundColor Cyan',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 1 - Install the Fabric CLI\",\r\n \"# Comment out this line if you already have the Fabric CLI installed.\",\r\n \"# \" + \"-\" * 77,\r\n \"# pip install ms-fabric-cli\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 2 - Authenticate with Fabric\",\r\n \"# Comment out this line if you are already authenticated.\",\r\n \"# \" + \"-\" * 77,\r\n \"# fab auth login\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 3 - Create the destination folder in the lakehouse (if it doesn't exist)\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n f'fab mkdir \"{args.workspace}.Workspace/{args.lakehouse}.Lakehouse/Files/{args.lakehouse_folder}\"',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 4 - Upload PDF files\",\r\n \"# fab cp requires a ./ prefix to identify the source as a local file.\",\r\n \"# Push-Location sets the working directory so ./filename works correctly.\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n \"Push-Location $pdfFolder\",\r\n \"\",\r\n ]\r\n\r\n for filename in pdf_files:\r\n dest = (\r\n f\"{args.workspace}.Workspace/\"\r\n f\"{args.lakehouse}.Lakehouse/\"\r\n f\"Files/{args.lakehouse_folder}/{filename}\"\r\n )\r\n lines.append(f'fab cp \"./{filename}\" `')\r\n lines.append(f' \"{dest}\"')\r\n lines.append(\"\")\r\n\r\n lines += [\r\n \"Pop-Location\",\r\n \"\",\r\n 'Write-Host \"Upload complete. Please verify the files are visible in the '\r\n f'lakehouse Files/{args.lakehouse_folder} section before proceeding.\" '\r\n '-ForegroundColor Green',\r\n ]\r\n\r\n os.makedirs(os.path.dirname(os.path.abspath(args.output_script)), exist_ok=True)\r\n with open(args.output_script, \"w\", encoding=\"utf-8\") as f:\r\n f.write(\"\\n\".join(lines))\r\n\r\n print(f\"Script written to: {os.path.abspath(args.output_script)}\", file=sys.stderr)\r\n print(f\"{len(pdf_files)} PDF file(s) included.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
212
|
+
content: "# /// script\r\n# requires-python = \">=3.8\"\r\n# dependencies = []\r\n# ///\r\n\"\"\"\r\nScan a local folder for PDF files and write a PowerShell script containing\r\n`fab cp` commands to upload each one to a Microsoft Fabric lakehouse Files section.\r\n\r\nThe generated script uses $PSScriptRoot to resolve paths relative to where\r\nthe .ps1 file is saved, so it works from any working directory.\r\n\r\nUsage example:\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"C:\\\\Users\\\\rishi\\\\Data\\\\Booking PDFs\" \\\r\n --workspace \"Landon Finance Month End\" \\\r\n --lakehouse \"Lh_landon_finance_bronze\" \\\r\n --lakehouse-folder \"Booking PDFs\" \\\r\n --output-script \"outputs/my-run/upload_pdf_files.ps1\"\r\n\"\"\"\r\nimport argparse\r\nimport os\r\nimport sys\r\n\r\n\r\ndef main():\r\n parser = argparse.ArgumentParser(\r\n description=(\r\n \"Scan a local folder for PDF files and write a PowerShell script \"\r\n \"containing `fab cp` commands to upload each one to a Microsoft \"\r\n \"Fabric lakehouse Files section. Paths in the script are resolved \"\r\n \"relative to the script's saved location via $PSScriptRoot.\"\r\n ),\r\n formatter_class=argparse.RawDescriptionHelpFormatter,\r\n epilog=(\r\n \"Example:\\n\"\r\n ' python scripts/generate_upload_commands.py \\\\\\n'\r\n ' --local-folder \"C:\\\\\\\\Users\\\\\\\\rishi\\\\\\\\Data\\\\\\\\Booking PDFs\" \\\\\\n'\r\n ' --workspace \"Landon Finance Month End\" \\\\\\n'\r\n ' --lakehouse \"Lh_landon_finance_bronze\" \\\\\\n'\r\n ' --lakehouse-folder \"Booking PDFs\" \\\\\\n'\r\n ' --output-script \"outputs/my-run/upload_pdf_files.ps1\"\\n'\r\n ),\r\n )\r\n parser.add_argument(\r\n \"--local-folder\", required=True,\r\n help='Exact absolute path to the local folder containing PDF files.',\r\n )\r\n parser.add_argument(\r\n \"--workspace\", required=True,\r\n help='Fabric workspace name (exact, case-sensitive).',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse\", required=True,\r\n help='Lakehouse name (exact, case-sensitive).',\r\n )\r\n parser.add_argument(\r\n \"--lakehouse-folder\", required=True,\r\n help='Destination folder under the Files section of the lakehouse.',\r\n )\r\n parser.add_argument(\r\n \"--output-script\", required=True,\r\n help='Path where the generated .ps1 file should be saved.',\r\n )\r\n args = parser.parse_args()\r\n\r\n local_folder = os.path.abspath(args.local_folder)\r\n if not os.path.isdir(local_folder):\r\n print(f\"ERROR: Local folder not found: {local_folder}\", file=sys.stderr)\r\n print(\"Expected: a valid absolute directory path containing .pdf files.\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n pdf_files = sorted(f for f in os.listdir(local_folder) if f.lower().endswith(\".pdf\"))\r\n if not pdf_files:\r\n print(f\"ERROR: No PDF files found in: {local_folder}\", file=sys.stderr)\r\n sys.exit(1)\r\n\r\n script_dir = os.path.abspath(os.path.dirname(args.output_script))\r\n rel_pdf_path = os.path.relpath(local_folder, script_dir)\r\n\r\n lines = [\r\n \"# \" + \"=\" * 77,\r\n \"# Upload PDF Files to Bronze Lakehouse — Fabric CLI Script\",\r\n f\"# Workspace : {args.workspace}\",\r\n f\"# Lakehouse : {args.lakehouse}\",\r\n f\"# Destination: Files/{args.lakehouse_folder}\",\r\n f\"# PDF source : {local_folder}\",\r\n \"# \" + \"=\" * 77,\r\n \"# Paths are resolved relative to this script's saved location.\",\r\n \"# You can run this script from any working directory.\",\r\n \"# \" + \"=\" * 77,\r\n \"# NOTE: This script uploads files one at a time via the Fabric CLI.\",\r\n \"# For 50+ files, Options 1 (OneLake File Explorer) or 2 (Fabric UI)\",\r\n \"# are significantly faster.\",\r\n \"# \" + \"=\" * 77,\r\n \"\",\r\n \"# Absolute path to the local PDF folder\",\r\n f'$pdfFolder = \"{local_folder}\"',\r\n 'Write-Host \"PDF source folder: $pdfFolder\" -ForegroundColor Cyan',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 1 - Install the Fabric CLI\",\r\n \"# Comment out this line if you already have the Fabric CLI installed.\",\r\n \"# \" + \"-\" * 77,\r\n \"# pip install ms-fabric-cli\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 2 - Authenticate with Fabric\",\r\n \"# Comment out this line if you are already authenticated.\",\r\n \"# \" + \"-\" * 77,\r\n \"# fab auth login\",\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 3 - Create the destination folder in the lakehouse (if it doesn't exist)\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n f'fab mkdir \"{args.workspace}.Workspace/{args.lakehouse}.Lakehouse/Files/{args.lakehouse_folder}\" -f',\r\n \"\",\r\n \"# \" + \"-\" * 77,\r\n \"# STEP 4 - Upload PDF files\",\r\n \"# fab cp requires a ./ prefix to identify the source as a local file.\",\r\n \"# Push-Location sets the working directory so ./filename works correctly.\",\r\n \"# \" + \"-\" * 77,\r\n \"\",\r\n \"Push-Location $pdfFolder\",\r\n \"\",\r\n ]\r\n\r\n for filename in pdf_files:\r\n dest = (\r\n f\"{args.workspace}.Workspace/\"\r\n f\"{args.lakehouse}.Lakehouse/\"\r\n f\"Files/{args.lakehouse_folder}/{filename}\"\r\n )\r\n lines.append(f'fab cp \"./{filename}\" `')\r\n lines.append(f' \"{dest}\"')\r\n lines.append(\"\")\r\n\r\n lines += [\r\n \"Pop-Location\",\r\n \"\",\r\n 'Write-Host \"Upload complete. Please verify the files are visible in the '\r\n f'lakehouse Files/{args.lakehouse_folder} section before proceeding.\" '\r\n '-ForegroundColor Green',\r\n ]\r\n\r\n os.makedirs(os.path.dirname(os.path.abspath(args.output_script)), exist_ok=True)\r\n with open(args.output_script, \"w\", encoding=\"utf-8\") as f:\r\n f.write(\"\\n\".join(lines))\r\n\r\n print(f\"Script written to: {os.path.abspath(args.output_script)}\", file=sys.stderr)\r\n print(f\"{len(pdf_files)} PDF file(s) included.\", file=sys.stderr)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n main()\r\n",
|
|
213
213
|
},
|
|
214
214
|
],
|
|
215
215
|
},
|
package/package.json
CHANGED