@rishildi/ldi-process-skills 0.1.6 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,5 +1,5 @@
1
1
  // AUTO-GENERATED by scripts/embed-skills.ts — do not edit
2
- // Generated at: 2026-04-04T21:59:19.892Z
2
+ // Generated at: 2026-04-04T22:01:57.693Z
3
3
  export const EMBEDDED_SKILLS = [
4
4
  {
5
5
  name: "create-fabric-lakehouses",
@@ -153,7 +153,7 @@ export const EMBEDDED_SKILLS = [
153
153
  files: [
154
154
  {
155
155
  relativePath: "SKILL.md",
156
- content: "---\r\nname: csv-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to upload CSV files from a local machine into a Microsoft Fabric\r\n bronze lakehouse and convert them to delta tables. Triggers on: \"create delta\r\n tables from CSV files\", \"load CSVs into bronze lakehouse\", \"upload CSV to Fabric\r\n and create tables\", \"ingest CSV files to delta format in Fabric\", \"create bronze\r\n tables from local CSV\". Does NOT trigger for creating lakehouses, transforming\r\n existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ required for scripts/. Fabric CLI (fab) must be installed for the CLI upload option.\r\n---\r\n\r\n# CSV to Bronze Delta Tables\r\n\r\nUploads CSV files from an operator's local machine to a Microsoft Fabric bronze\r\nlakehouse and converts them to delta tables. The lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab cp`, `fab ln`, and `fab ls` commands are presented to the operator as\r\n> script blocks for them to run. The agent only generates and presents commands.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LOCAL_CSV_FOLDER` | Relative path to local folder containing CSV files (CLI upload only) | `\"./Data\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under the Files section of the lakehouse | `\"raw\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Upload CSV files** — Present these three options and ask the operator to\r\n choose one:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Open the OneLake File Explorer desktop app and drag-and-drop the CSV files into\r\n the target folder under the lakehouse Files section. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section, open or create\r\n the target folder, click **Upload** and select the CSV files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_CSV_FOLDER` as the **exact absolute path** to the local folder\r\n and `LAKEHOUSE_FILES_FOLDER` (the destination folder name under Files). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_CSV_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_csv_files.ps1\"\r\n ```\r\n The script generates a PowerShell `.ps1` file saved directly to the outputs folder.\r\n Present the script path to the operator and ask them to run it with `pwsh upload_csv_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning the workflow, create the output folder:\r\n```\r\noutputs/csv-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll scripts produced during the run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm the CSV files are visible\r\n in the Files section of the lakehouse before proceeding.\r\n\r\n- [ ] **Create delta tables** — If `LAKEHOUSE_FILES_FOLDER` was not already\r\n captured above, ask for it now. Present these two options:\r\n\r\n **Option 1 — Fabric UI (Manual)**\r\n > Quick and easy — recommended for most users.\r\n In the Fabric browser UI navigate to the lakehouse → Files →\r\n `<LAKEHOUSE_FILES_FOLDER>`. For each CSV file: click the three-dot menu →\r\n **Load to Tables** → **New Table**. Accept the suggested table name (Fabric\r\n applies it automatically). No agent action required.\r\n\r\n **Option 2 — PySpark notebook (Automated)**\r\n Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\csv_to_delta_tables.ipynb\"\r\n ```\r\n This writes a ready-to-run `.ipynb` file to the outputs folder. Tell the operator:\r\n 1. In the Fabric UI go to the workspace → **New** → **Import notebook**\r\n 2. Select `csv_to_delta_tables.ipynb` from the outputs folder\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically\r\n **Validate**: confirm every cell printed `✅ Created table: <table_name>` with\r\n no errors. If any `❌` lines appear, report the error message to the operator.\r\n\r\n## Table Naming Convention\r\n\r\nCSV filename → delta table name:\r\n- Strip `.csv` extension\r\n- Convert to lowercase\r\n- Replace any non-alphanumeric characters (spaces, hyphens, dots) with underscores\r\n- Strip leading/trailing underscores\r\n\r\nExamples:\r\n| CSV filename | Delta table name |\r\n|---|---|\r\n| `Revenue Data.csv` | `revenue_data` |\r\n| `Landon hotel revenue data.csv` | `landon_hotel_revenue_data` |\r\n| `Q1-Sales.csv` | `q1_sales` |\r\n\r\n## Column Naming Convention\r\n\r\nWhen CSVs are loaded into delta tables via the PySpark notebook (Option 2 of\r\ndelta table creation), a `clean_columns()` function transforms every column name:\r\n\r\n- Convert to lowercase\r\n- Replace spaces, hyphens, and other non-alphanumeric characters with underscores\r\n- Strip leading/trailing underscores\r\n\r\n| CSV column header | Delta table column name |\r\n|---|---|\r\n| `Hotel ID` | `hotel_id` |\r\n| `No_of_Rooms` | `no_of_rooms` |\r\n| `Total Revenue (GBP)` | `total_revenue_gbp` |\r\n| `First Name` | `first_name` |\r\n\r\n> **Important for downstream skills:** When writing SQL queries against bronze\r\n> delta tables (e.g., in the `create-materialised-lakeview-scripts` skill),\r\n> always use the cleaned column names — not the original CSV headers.\r\n\r\n## Output Format\r\n\r\nDelta tables appear under the **Tables** section of the bronze lakehouse in the\r\nFabric UI, named according to the convention above. Each table is queryable via\r\nthe lakehouse SQL endpoint and PySpark.\r\n\r\n## Gotchas\r\n\r\n- `fab cp` uses the path prefix to identify local vs OneLake paths. **Absolute\r\n Windows paths (`C:\\...`) are not recognised as local** and cause a\r\n `[NotSupported] Source and destination must be of the same type` error. Always\r\n use `Push-Location` into the source folder and `./filename` (forward slash,\r\n not backslash) syntax — confirmed working pattern.\r\n- **The destination folder must exist before running `fab cp`.** Always run\r\n `fab mkdir \"{WORKSPACE}.Workspace/{LAKEHOUSE}.Lakehouse/Files/{FOLDER}\"` first.\r\n Running `fab mkdir` on an already-existing folder is safe and does not error.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive and must exactly match\r\n what appears in the Fabric UI.\r\n- Shortcuts (Option 1 for delta table creation) use Fabric's automatic schema\r\n inference. They may fail if column names contain spaces or if data types are\r\n inconsistent. Switch to Option 2 (PySpark notebook) in those cases.\r\n- The PySpark notebook attaches the lakehouse automatically via `%%configure` in\r\n Cell 1 — no manual attachment needed before running.\r\n- When using the Fabric CLI, run all commands from the directory that\r\n `LOCAL_CSV_FOLDER` is relative to (typically the project root).\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local CSV folder and outputs\r\n `fab cp` commands to upload each file to the lakehouse Files section.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook pre-configured with the correct lakehouse and `FILES_FOLDER`. The\r\n notebook attaches the lakehouse automatically via `%%configure`. Import into\r\n Fabric via **New → Import notebook**.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
156
+ content: "---\r\nname: csv-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to upload CSV files from a local machine into a Microsoft Fabric\r\n bronze lakehouse and convert them to delta tables. Triggers on: \"create delta\r\n tables from CSV files\", \"load CSVs into bronze lakehouse\", \"upload CSV to Fabric\r\n and create tables\", \"ingest CSV files to delta format in Fabric\", \"create bronze\r\n tables from local CSV\". Does NOT trigger for creating lakehouses, transforming\r\n existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: Python 3.8+ required for scripts/. Fabric CLI (fab) must be installed for the CLI upload option.\r\n---\r\n\r\n# CSV to Bronze Delta Tables\r\n\r\nUploads CSV files from an operator's local machine to a Microsoft Fabric bronze\r\nlakehouse and converts them to delta tables. The lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab cp`, `fab ln`, and `fab ls` commands are presented to the operator as\r\n> script blocks for them to run. The agent only generates and presents commands.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark (`%%configure`, `spark.read.csv`,\r\n> `df.write.format(\"delta\")`) — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LOCAL_CSV_FOLDER` | Relative path to local folder containing CSV files (CLI upload only) | `\"./Data\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under the Files section of the lakehouse | `\"raw\"` |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Upload CSV files** — Present these three options and ask the operator to\r\n choose one:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Open the OneLake File Explorer desktop app and drag-and-drop the CSV files into\r\n the target folder under the lakehouse Files section. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section, open or create\r\n the target folder, click **Upload** and select the CSV files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_CSV_FOLDER` as the **exact absolute path** to the local folder\r\n and `LAKEHOUSE_FILES_FOLDER` (the destination folder name under Files). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_CSV_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_csv_files.ps1\"\r\n ```\r\n The script generates a PowerShell `.ps1` file saved directly to the outputs folder.\r\n Present the script path to the operator and ask them to run it with `pwsh upload_csv_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning the workflow, create the output folder:\r\n```\r\noutputs/csv-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll scripts produced during the run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm the CSV files are visible\r\n in the Files section of the lakehouse before proceeding.\r\n\r\n- [ ] **Create delta tables** — If `LAKEHOUSE_FILES_FOLDER` was not already\r\n captured above, ask for it now. Present these two options:\r\n\r\n **Option 1 — Fabric UI (Manual)**\r\n > Quick and easy — recommended for most users.\r\n In the Fabric browser UI navigate to the lakehouse → Files →\r\n `<LAKEHOUSE_FILES_FOLDER>`. For each CSV file: click the three-dot menu →\r\n **Load to Tables** → **New Table**. Accept the suggested table name (Fabric\r\n applies it automatically). No agent action required.\r\n\r\n **Option 2 — PySpark notebook (Automated)**\r\n Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\csv_to_delta_tables.ipynb\"\r\n ```\r\n This writes a ready-to-run `.ipynb` file to the outputs folder. Tell the operator:\r\n 1. In the Fabric UI go to the workspace → **New** → **Import notebook**\r\n 2. Select `csv_to_delta_tables.ipynb` from the outputs folder\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically\r\n **Validate**: confirm every cell printed `✅ Created table: <table_name>` with\r\n no errors. If any `❌` lines appear, report the error message to the operator.\r\n\r\n## Table Naming Convention\r\n\r\nCSV filename → delta table name:\r\n- Strip `.csv` extension\r\n- Convert to lowercase\r\n- Replace any non-alphanumeric characters (spaces, hyphens, dots) with underscores\r\n- Strip leading/trailing underscores\r\n\r\nExamples:\r\n| CSV filename | Delta table name |\r\n|---|---|\r\n| `Revenue Data.csv` | `revenue_data` |\r\n| `Landon hotel revenue data.csv` | `landon_hotel_revenue_data` |\r\n| `Q1-Sales.csv` | `q1_sales` |\r\n\r\n## Column Naming Convention\r\n\r\nWhen CSVs are loaded into delta tables via the PySpark notebook (Option 2 of\r\ndelta table creation), a `clean_columns()` function transforms every column name:\r\n\r\n- Convert to lowercase\r\n- Replace spaces, hyphens, and other non-alphanumeric characters with underscores\r\n- Strip leading/trailing underscores\r\n\r\n| CSV column header | Delta table column name |\r\n|---|---|\r\n| `Hotel ID` | `hotel_id` |\r\n| `No_of_Rooms` | `no_of_rooms` |\r\n| `Total Revenue (GBP)` | `total_revenue_gbp` |\r\n| `First Name` | `first_name` |\r\n\r\n> **Important for downstream skills:** When writing SQL queries against bronze\r\n> delta tables (e.g., in the `create-materialised-lakeview-scripts` skill),\r\n> always use the cleaned column names — not the original CSV headers.\r\n\r\n## Output Format\r\n\r\nDelta tables appear under the **Tables** section of the bronze lakehouse in the\r\nFabric UI, named according to the convention above. Each table is queryable via\r\nthe lakehouse SQL endpoint and PySpark.\r\n\r\n## Gotchas\r\n\r\n- `fab cp` uses the path prefix to identify local vs OneLake paths. **Absolute\r\n Windows paths (`C:\\...`) are not recognised as local** and cause a\r\n `[NotSupported] Source and destination must be of the same type` error. Always\r\n use `Push-Location` into the source folder and `./filename` (forward slash,\r\n not backslash) syntax — confirmed working pattern.\r\n- **The destination folder must exist before running `fab cp`.** Always run\r\n `fab mkdir \"{WORKSPACE}.Workspace/{LAKEHOUSE}.Lakehouse/Files/{FOLDER}\"` first.\r\n Running `fab mkdir` on an already-existing folder is safe and does not error.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive and must exactly match\r\n what appears in the Fabric UI.\r\n- Shortcuts (Option 1 for delta table creation) use Fabric's automatic schema\r\n inference. They may fail if column names contain spaces or if data types are\r\n inconsistent. Switch to Option 2 (PySpark notebook) in those cases.\r\n- The PySpark notebook attaches the lakehouse automatically via `%%configure` in\r\n Cell 1 — no manual attachment needed before running.\r\n- When using the Fabric CLI, run all commands from the directory that\r\n `LOCAL_CSV_FOLDER` is relative to (typically the project root).\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local CSV folder and outputs\r\n `fab cp` commands to upload each file to the lakehouse Files section.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook pre-configured with the correct lakehouse and `FILES_FOLDER`. The\r\n notebook attaches the lakehouse automatically via `%%configure`. Import into\r\n Fabric via **New → Import notebook**.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
157
157
  },
158
158
  {
159
159
  relativePath: "assets/pyspark_notebook_template.py",
@@ -231,7 +231,7 @@ export const EMBEDDED_SKILLS = [
231
231
  files: [
232
232
  {
233
233
  relativePath: "SKILL.md",
234
- content: "---\r\nname: pdf-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to extract structured data from PDF files on an operator's\r\n local machine, upload them to a Microsoft Fabric bronze lakehouse, and convert\r\n them to a delta table using AI-powered field extraction. Triggers on: \"create\r\n delta tables from PDFs\", \"extract data from PDF invoices to Fabric\", \"load\r\n PDFs into bronze lakehouse\", \"parse PDF documents to delta format\", \"ingest\r\n PDF files to Fabric tables\". Does NOT trigger for CSV/Excel ingestion,\r\n transforming existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: >\r\n Python 3.8+ for scripts/. Fabric CLI (fab) for CLI upload option.\r\n Fabric notebook runtime 1.3 required (for synapse.ml.aifunc).\r\n---\r\n\r\n# PDF to Bronze Delta Tables\r\n\r\nUploads PDF files from a local machine to a Microsoft Fabric bronze lakehouse\r\nand converts each PDF into a row in a delta table using AI field extraction.\r\nThe lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab` commands are written to a PowerShell script for the operator to run.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under lakehouse Files section | `\"Booking PDFs\"` |\r\n| `TABLE_NAME` | Target delta table name (snake_case) | `\"booking_invoices\"` |\r\n| `LOCAL_PDF_FOLDER` | Exact absolute path to local PDF folder (CLI upload only) | `\"C:\\Users\\rishi\\Data\\Booking PDFs\"` |\r\n| `FIELDS` | Fields to extract from each PDF — collected in Step 2 | See workflow |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Suggest and confirm extraction fields** — Before asking the operator to\r\n define fields from scratch, the agent should **read a sample PDF** to understand\r\n the document structure and proactively suggest fields:\r\n\r\n 1. Use `pdfplumber` (or equivalent) to extract text from 1–2 sample PDFs in\r\n `LOCAL_PDF_FOLDER`. If a second PDF is from a different sub-group (e.g.\r\n different property/entity), include it to confirm layout consistency.\r\n 2. Identify all extractable fields from the document structure (headers, labels,\r\n line items, totals, payment details, etc.).\r\n 3. Present the suggested fields to the operator in a table format, split into:\r\n - **Header-level fields** (one row per PDF) — for the main table\r\n - **Line-item fields** (multiple rows per PDF) — for the detail table, if\r\n the document contains repeating line items\r\n 4. For each field, show: `snake_case` name, extraction hint for the AI, and an\r\n example value from the sample PDF.\r\n 5. Ask the operator:\r\n - \"Do these fields look right? Anything to add, remove, or rename?\"\r\n - \"What should the main delta table be named?\" → `TABLE_NAME`\r\n - \"Do you want a second table for line/detail items?\" If yes:\r\n → `LINE_ITEMS_TABLE_NAME` and confirm the line-item fields\r\n - \"What folder name will the PDFs be stored in under the lakehouse Files\r\n section?\" → `LAKEHOUSE_FILES_FOLDER`\r\n 6. **Do not proceed until the operator confirms the fields.**\r\n\r\n Build `FIELDS` as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n If the operator confirmed a second line-items table, build `LINE_ITEMS_FIELDS`\r\n as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n- [ ] **Upload PDFs** — Present these three options and ask the operator to choose:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Drag-and-drop the PDFs into the target folder under the lakehouse Files section\r\n using the OneLake File Explorer desktop app. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section → open or\r\n create the `LAKEHOUSE_FILES_FOLDER` folder → click **Upload** and select the\r\n PDF files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_PDF_FOLDER` (exact absolute path). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_PDF_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_pdf_files.ps1\"\r\n ```\r\n Present the script path to the operator and ask them to run it with `pwsh upload_pdf_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning, create the output folder:\r\n```\r\noutputs/pdf-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll generated scripts and notebooks for this run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm all PDFs are visible in the\r\n lakehouse Files section before proceeding.\r\n\r\n- [ ] **Generate TEST notebook** — Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --test-mode \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_TEST.ipynb\"\r\n ```\r\n Where `<FIELDS_JSON>` is the JSON array built from `FIELDS` above, as a\r\n single-line string (e.g. `'[{\"name\":\"invoice_number\",\"description\":\"...\"}]'`).\r\n Include `--line-items-table-name` and `--line-items-fields-json` if a second\r\n line-items table was requested — both must be provided together.\r\n\r\n Tell the operator:\r\n 1. Go to the workspace → **New** → **Import notebook**\r\n 2. Select `pdf_to_delta_TEST.ipynb`\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically and\r\n processes **one PDF only**\r\n 4. Share the output row displayed at the end of the notebook\r\n\r\n- [ ] **Validate and iterate** — Review the output row the operator shares:\r\n - Check each field has a value and it looks correct\r\n - If a field is missing or wrong: update its description in `FIELDS_JSON`,\r\n regenerate the TEST notebook, and ask the operator to re-run it\r\n - Repeat until all fields are correct\r\n - **Do not proceed to full run until the test row is confirmed correct**\r\n\r\n- [ ] **Generate FULL notebook** — Once test output is confirmed, run the same\r\n command **without** `--test-mode`:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_FULL.ipynb\"\r\n ```\r\n Tell the operator to import and run `pdf_to_delta_FULL.ipynb`. This processes\r\n all PDFs in the folder.\r\n\r\n- [ ] **Validate final table** — Ask the operator to confirm:\r\n - Delta table `<TABLE_NAME>` appears in the Tables section of the lakehouse\r\n - Row count matches the number of PDFs uploaded\r\n - Spot-check a few rows for data quality\r\n\r\n## Table Naming\r\n\r\n- Use a descriptive `snake_case` name based on the document type, not the filename\r\n- PDFs are individual records — do not derive table name from filenames\r\n- Ask the operator to confirm the table name before generating any notebook\r\n\r\n## Gotchas\r\n\r\n- **AI features must be enabled on the capacity.** `synapse.ml.aifunc` uses Fabric's\r\n built-in AI endpoint — no Azure OpenAI key needed. Prerequisites: (1) paid Fabric\r\n capacity F2 or higher, (2) tenant admin must enable \"Copilot and other features\r\n powered by Azure OpenAI\" in Admin portal → Tenant settings, (3) if capacity is\r\n outside an Azure OpenAI region, also enable the cross-geo processing toggle.\r\n- **Default model is `gpt-4.1-mini`.** If the notebook throws `DeploymentConfigNotFound`,\r\n the `MODEL_DEPLOYMENT_NAME` in the configuration cell doesn't match a model on\r\n the built-in endpoint. Check supported models at\r\n https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview\r\n- `fab cp` requires `./filename` (forward slash) syntax.Absolute Windows paths\r\n (`C:\\...`) cause `[NotSupported]` errors. The generated script uses `Push-Location`\r\n to work around this — do not modify this pattern.\r\n- **Destination folder must exist before uploading.** The script runs `fab mkdir` first.\r\n Running `fab mkdir` on an existing folder is safe.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive.\r\n- The notebook uses `synapse.ml.aifunc` which requires Fabric **runtime 1.3**.\r\n If the operator sees import errors, check runtime version in notebook settings.\r\n- The `%%configure` cell attaches the lakehouse automatically — no manual\r\n attachment needed before clicking Run All.\r\n- AI extraction temperature is set to `0.0` for consistency, but it is still\r\n non-deterministic across different PDF layouts. Always validate with TEST mode first.\r\n- All extracted fields are written as strings. If the operator needs typed columns\r\n (dates, numbers), add a post-processing step after confirming extraction is correct.\r\n- **Column names come from AI extraction.** The delta table column names match\r\n the `name` field in the `FIELDS` JSON array provided during setup. These are\r\n `snake_case` names chosen by the operator (e.g., `invoice_number`, `hotel_name`).\r\n They do NOT follow the same `clean_columns()` convention used by the\r\n `csv-to-bronze-delta-tables` skill. Downstream skills (e.g.,\r\n `create-materialised-lakeview-scripts`) must verify actual delta table column\r\n names rather than assuming any naming convention.\r\n- The notebook installs `openai` and `pymupdf4llm` at runtime. The `synapse.ml.aifunc`\r\n package is pre-installed in Fabric Runtime 1.3+.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local folder for PDFs and\r\n writes a PowerShell script of `fab cp` upload commands.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook with the AI extraction prompt pre-populated from the supplied fields.\r\n Supports `--test-mode` for single-PDF validation runs.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
234
+ content: "---\r\nname: pdf-to-bronze-delta-tables\r\ndescription: >\r\n Use this skill to extract structured data from PDF files on an operator's\r\n local machine, upload them to a Microsoft Fabric bronze lakehouse, and convert\r\n them to a delta table using AI-powered field extraction. Triggers on: \"create\r\n delta tables from PDFs\", \"extract data from PDF invoices to Fabric\", \"load\r\n PDFs into bronze lakehouse\", \"parse PDF documents to delta format\", \"ingest\r\n PDF files to Fabric tables\". Does NOT trigger for CSV/Excel ingestion,\r\n transforming existing delta tables, or non-Fabric storage targets.\r\nlicense: MIT\r\ncompatibility: >\r\n Python 3.8+ for scripts/. Fabric CLI (fab) for CLI upload option.\r\n Fabric notebook runtime 1.3 required (for synapse.ml.aifunc).\r\n---\r\n\r\n# PDF to Bronze Delta Tables\r\n\r\nUploads PDF files from a local machine to a Microsoft Fabric bronze lakehouse\r\nand converts each PDF into a row in a delta table using AI field extraction.\r\nThe lakehouse must already exist.\r\n\r\n> ⚠️ **GOVERNANCE RULE**: This skill **never executes `fab` CLI commands directly**.\r\n> All `fab` commands are written to a PowerShell script for the operator to run.\r\n>\r\n> ⚠️ **GENERATION**: Always run `scripts/generate_notebook.py` via Bash to produce\r\n> the `.ipynb` notebook — never generate notebook cell content directly. The\r\n> generated notebook uses native PySpark with `synapse.ml.aifunc` for AI extraction\r\n> — it does not use `fab` CLI or `FAB_TOKEN` auth.\r\n\r\n## Inputs\r\n\r\n| Parameter | Description | Example |\r\n|-----------|-------------|---------|\r\n| `WORKSPACE_NAME` | Fabric workspace name (exact, case-sensitive) | `\"Landon Finance Month End\"` |\r\n| `LAKEHOUSE_NAME` | Bronze lakehouse name (exact, case-sensitive) | `\"Lh_landon_finance_bronze\"` |\r\n| `LAKEHOUSE_FILES_FOLDER` | Folder name under lakehouse Files section | `\"Booking PDFs\"` |\r\n| `TABLE_NAME` | Target delta table name (snake_case) | `\"booking_invoices\"` |\r\n| `LOCAL_PDF_FOLDER` | Exact absolute path to local PDF folder (CLI upload only) | `\"C:\\Users\\rishi\\Data\\Booking PDFs\"` |\r\n| `FIELDS` | Fields to extract from each PDF — collected in Step 2 | See workflow |\r\n\r\n## Workflow\r\n\r\n- [ ] **Collect parameters** — If `WORKSPACE_NAME` or `LAKEHOUSE_NAME` are not\r\n provided, ask the operator for them before proceeding.\r\n\r\n- [ ] **Suggest and confirm extraction fields** — Before asking the operator to\r\n define fields from scratch, the agent should **read a sample PDF** to understand\r\n the document structure and proactively suggest fields:\r\n\r\n 1. Use `pdfplumber` (or equivalent) to extract text from 1–2 sample PDFs in\r\n `LOCAL_PDF_FOLDER`. If a second PDF is from a different sub-group (e.g.\r\n different property/entity), include it to confirm layout consistency.\r\n 2. Identify all extractable fields from the document structure (headers, labels,\r\n line items, totals, payment details, etc.).\r\n 3. Present the suggested fields to the operator in a table format, split into:\r\n - **Header-level fields** (one row per PDF) — for the main table\r\n - **Line-item fields** (multiple rows per PDF) — for the detail table, if\r\n the document contains repeating line items\r\n 4. For each field, show: `snake_case` name, extraction hint for the AI, and an\r\n example value from the sample PDF.\r\n 5. Ask the operator:\r\n - \"Do these fields look right? Anything to add, remove, or rename?\"\r\n - \"What should the main delta table be named?\" → `TABLE_NAME`\r\n - \"Do you want a second table for line/detail items?\" If yes:\r\n → `LINE_ITEMS_TABLE_NAME` and confirm the line-item fields\r\n - \"What folder name will the PDFs be stored in under the lakehouse Files\r\n section?\" → `LAKEHOUSE_FILES_FOLDER`\r\n 6. **Do not proceed until the operator confirms the fields.**\r\n\r\n Build `FIELDS` as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n If the operator confirmed a second line-items table, build `LINE_ITEMS_FIELDS`\r\n as a JSON array: `[{\"name\": \"...\", \"description\": \"...\"}, ...]`\r\n\r\n- [ ] **Upload PDFs** — Present these three options and ask the operator to choose:\r\n\r\n **Option 1 — OneLake File Explorer (Manual)**\r\n Drag-and-drop the PDFs into the target folder under the lakehouse Files section\r\n using the OneLake File Explorer desktop app. No agent action required.\r\n\r\n **Option 2 — Fabric UI (Manual)**\r\n In the Fabric browser UI navigate to the lakehouse → Files section → open or\r\n create the `LAKEHOUSE_FILES_FOLDER` folder → click **Upload** and select the\r\n PDF files. No agent action required.\r\n\r\n **Option 3 — Fabric CLI (Automated)**\r\n > ⚠️ **Requires PowerShell** — generates a `.ps1` script. PowerShell is available\r\n > on Windows natively and on Mac/Linux via `brew install powershell`. If PowerShell\r\n > is not available and the operator does not want to install it, use Option 1 or 2.\r\n > Do not substitute a bash or shell script.\r\n >\r\n > ⚠️ **Performance note**: The CLI uploads files one at a time. For large\r\n > batches (50+ files) this is significantly slower than Options 1 or 2.\r\n > Recommend Options 1 or 2 for bulk uploads.\r\n\r\n Ask for `LOCAL_PDF_FOLDER` (exact absolute path). Then run:\r\n ```\r\n python scripts/generate_upload_commands.py \\\r\n --local-folder \"<LOCAL_PDF_FOLDER>\" \\\r\n --workspace \"<WORKSPACE_NAME>\" \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --output-script \"<OUTPUT_FOLDER>/upload_pdf_files.ps1\"\r\n ```\r\n Present the script path to the operator and ask them to run it with `pwsh upload_pdf_files.ps1`.\r\n\r\n## Output Folder\r\n\r\nBefore beginning, create the output folder:\r\n```\r\noutputs/pdf-to-bronze-delta-tables_{YYYY-MM-DD_HH-MM}_{USERNAME}/\r\n```\r\nAll generated scripts and notebooks for this run are saved here.\r\n\r\n- [ ] **Confirm upload** — Ask the operator to confirm all PDFs are visible in the\r\n lakehouse Files section before proceeding.\r\n\r\n- [ ] **Generate TEST notebook** — Run:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --test-mode \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_TEST.ipynb\"\r\n ```\r\n Where `<FIELDS_JSON>` is the JSON array built from `FIELDS` above, as a\r\n single-line string (e.g. `'[{\"name\":\"invoice_number\",\"description\":\"...\"}]'`).\r\n Include `--line-items-table-name` and `--line-items-fields-json` if a second\r\n line-items table was requested — both must be provided together.\r\n\r\n Tell the operator:\r\n 1. Go to the workspace → **New** → **Import notebook**\r\n 2. Select `pdf_to_delta_TEST.ipynb`\r\n 3. Click **Run All** — the notebook attaches the lakehouse automatically and\r\n processes **one PDF only**\r\n 4. Share the output row displayed at the end of the notebook\r\n\r\n- [ ] **Validate and iterate** — Review the output row the operator shares:\r\n - Check each field has a value and it looks correct\r\n - If a field is missing or wrong: update its description in `FIELDS_JSON`,\r\n regenerate the TEST notebook, and ask the operator to re-run it\r\n - Repeat until all fields are correct\r\n - **Do not proceed to full run until the test row is confirmed correct**\r\n\r\n- [ ] **Generate FULL notebook** — Once test output is confirmed, run the same\r\n command **without** `--test-mode`:\r\n ```\r\n python scripts/generate_notebook.py \\\r\n --lakehouse \"<LAKEHOUSE_NAME>\" \\\r\n --lakehouse-folder \"<LAKEHOUSE_FILES_FOLDER>\" \\\r\n --table-name \"<TABLE_NAME>\" \\\r\n --fields-json \"<FIELDS_JSON>\" \\\r\n [--line-items-table-name \"<LINE_ITEMS_TABLE_NAME>\"] \\\r\n [--line-items-fields-json \"<LINE_ITEMS_FIELDS_JSON>\"] \\\r\n --output-notebook \"<OUTPUT_FOLDER>\\pdf_to_delta_FULL.ipynb\"\r\n ```\r\n Tell the operator to import and run `pdf_to_delta_FULL.ipynb`. This processes\r\n all PDFs in the folder.\r\n\r\n- [ ] **Validate final table** — Ask the operator to confirm:\r\n - Delta table `<TABLE_NAME>` appears in the Tables section of the lakehouse\r\n - Row count matches the number of PDFs uploaded\r\n - Spot-check a few rows for data quality\r\n\r\n## Table Naming\r\n\r\n- Use a descriptive `snake_case` name based on the document type, not the filename\r\n- PDFs are individual records — do not derive table name from filenames\r\n- Ask the operator to confirm the table name before generating any notebook\r\n\r\n## Gotchas\r\n\r\n- **AI features must be enabled on the capacity.** `synapse.ml.aifunc` uses Fabric's\r\n built-in AI endpoint — no Azure OpenAI key needed. Prerequisites: (1) paid Fabric\r\n capacity F2 or higher, (2) tenant admin must enable \"Copilot and other features\r\n powered by Azure OpenAI\" in Admin portal → Tenant settings, (3) if capacity is\r\n outside an Azure OpenAI region, also enable the cross-geo processing toggle.\r\n- **Default model is `gpt-4.1-mini`.** If the notebook throws `DeploymentConfigNotFound`,\r\n the `MODEL_DEPLOYMENT_NAME` in the configuration cell doesn't match a model on\r\n the built-in endpoint. Check supported models at\r\n https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview\r\n- `fab cp` requires `./filename` (forward slash) syntax.Absolute Windows paths\r\n (`C:\\...`) cause `[NotSupported]` errors. The generated script uses `Push-Location`\r\n to work around this — do not modify this pattern.\r\n- **Destination folder must exist before uploading.** The script runs `fab mkdir` first.\r\n Running `fab mkdir` on an existing folder is safe.\r\n- `WORKSPACE_NAME` and `LAKEHOUSE_NAME` are case-sensitive.\r\n- The notebook uses `synapse.ml.aifunc` which requires Fabric **runtime 1.3**.\r\n If the operator sees import errors, check runtime version in notebook settings.\r\n- The `%%configure` cell attaches the lakehouse automatically — no manual\r\n attachment needed before clicking Run All.\r\n- AI extraction temperature is set to `0.0` for consistency, but it is still\r\n non-deterministic across different PDF layouts. Always validate with TEST mode first.\r\n- All extracted fields are written as strings. If the operator needs typed columns\r\n (dates, numbers), add a post-processing step after confirming extraction is correct.\r\n- **Column names come from AI extraction.** The delta table column names match\r\n the `name` field in the `FIELDS` JSON array provided during setup. These are\r\n `snake_case` names chosen by the operator (e.g., `invoice_number`, `hotel_name`).\r\n They do NOT follow the same `clean_columns()` convention used by the\r\n `csv-to-bronze-delta-tables` skill. Downstream skills (e.g.,\r\n `create-materialised-lakeview-scripts`) must verify actual delta table column\r\n names rather than assuming any naming convention.\r\n- The notebook installs `openai` and `pymupdf4llm` at runtime. The `synapse.ml.aifunc`\r\n package is pre-installed in Fabric Runtime 1.3+.\r\n\r\n## Available Scripts\r\n\r\n- **`scripts/generate_upload_commands.py`** — Scans a local folder for PDFs and\r\n writes a PowerShell script of `fab cp` upload commands.\r\n Run: `python scripts/generate_upload_commands.py --help`\r\n- **`scripts/generate_notebook.py`** — Generates a Fabric-compatible `.ipynb`\r\n notebook with the AI extraction prompt pre-populated from the supplied fields.\r\n Supports `--test-mode` for single-PDF validation runs.\r\n Run: `python scripts/generate_notebook.py --help`\r\n",
235
235
  },
236
236
  {
237
237
  relativePath: "references/notebook-cells-reference.md",
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@rishildi/ldi-process-skills",
3
- "version": "0.1.6",
3
+ "version": "0.1.7",
4
4
  "description": "LDI Process Skills MCP Server — brings curated, step-by-step process skills to your AI agents.",
5
5
  "type": "module",
6
6
  "bin": {