npm - dojo.md - Versions diffs - 0.2.2 → 0.2.3 - Mend

dojo.md 0.2.2 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (149) hide show

package/courses/terraform-infrastructure-setup/scenarios/level-3/large-scale-refactoring.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: large-scale-refactoring
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Refactor large Terraform codebases — split monoliths into modules, migrate between state files, and use moved blocks for safe resource reorganization"
+  tags: [Terraform, refactoring, modules, moved-blocks, migration, advanced]
+state: {}
+trigger: |
+  Your organization's Terraform codebase has grown organically over
+  3 years into a monolith:
+  ```
+  infrastructure/
+  ├── main.tf           (3500 lines, 180 resources)
+  ├── variables.tf      (800 lines, 95 variables)
+  ├── outputs.tf        (200 lines)
+  └── terraform.tfstate (25MB, all resources in one state)
+  ```
+  Problems:
+  - terraform plan takes 8 minutes (refreshes all 180 resources)
+  - Any change risks all resources (blast radius = everything)
+  - 5 teams touch the same files, causing merge conflicts
+  - Lock contention: only one person can run terraform at a time
+  Target architecture:
+  ```
+  infrastructure/
+  ├── foundation/          (VPC, DNS, IAM — Platform team)
+  │   └── terraform.tfstate
+  ├── database/            (RDS, ElastiCache — Database team)
+  │   └── terraform.tfstate
+  ├── compute/             (ECS, ALB — App team)
+  │   └── terraform.tfstate
+  ├── monitoring/          (CloudWatch, Alarms — SRE team)
+  │   └── terraform.tfstate
+  └── modules/             (Shared modules)
+  ```
+  Task: Design the migration strategy from monolith to modular
+  Terraform, covering state splitting, moved blocks, cross-state
+  references, testing the migration, and rollback planning.
+assertions:
+  - type: llm_judge
+    criteria: "Migration strategy is phased — Phase 1: catalog all resources by team/domain. Phase 2: create module structure and write configurations for each domain. Phase 3: use moved blocks within the monolith to reorganize into modules (no state split yet). Phase 4: split state files using state mv or state rm + import. Phase 5: establish cross-state references using terraform_remote_state data sources. Each phase is independently verifiable: plan should show no changes after each phase. Never do everything at once — incremental migration with verification"
+    weight: 0.35
+    description: "Migration strategy"
+  - type: llm_judge
+    criteria: "State splitting mechanics are covered — approach 1 (state mv): (1) backup state, (2) create new backend configs, (3) terraform state mv resources to new state files. Approach 2 (state rm + import): (1) remove resources from monolith state, (2) import into new domain state files. Approach 3 (manual): (1) state pull, (2) edit JSON to split resources, (3) state push to new backends. Cross-state references: foundation outputs VPC ID, compute reads it via terraform_remote_state. IAM and dependency order: foundation first (VPC, IAM), then database (needs VPC), then compute (needs both)"
+    weight: 0.35
+    description: "State splitting"
+  - type: llm_judge
+    criteria: "Testing and rollback are practical — testing: after each migration step, terraform plan must show zero changes in all state files. If plan shows changes, something was migrated incorrectly — fix before proceeding. Rollback: keep the original monolith state backup throughout migration. If anything goes wrong, restore from backup and restart the phase. Timeline: for 180 resources, plan 2-4 weeks. Risk mitigation: migrate non-production first, then production during maintenance window. Communication: notify all teams of the plan, freeze non-essential changes during migration"
+    weight: 0.30
+    description: "Testing and rollback"

package/courses/terraform-infrastructure-setup/scenarios/level-3/multi-provider-config.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: multi-provider-config
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Configure multi-provider setups — manage multi-region, multi-account, and multi-cloud deployments with provider aliases and assume_role"
+  tags: [Terraform, providers, multi-region, multi-account, cross-account, advanced]
+state: {}
+trigger: |
+  Your organization needs infrastructure across multiple AWS accounts
+  and regions:
+  ```
+  Production Account (111111111111) - us-east-1
+  Staging Account (222222222222) - us-east-1
+  DR Account (111111111111) - us-west-2
+  Shared Services (333333333333) - us-east-1
+  ```
+  Your Terraform configuration:
+  ```hcl
+  provider "aws" {
+    region = "us-east-1"
+  }
+  provider "aws" {
+    alias  = "dr"
+    region = "us-west-2"
+  }
+  provider "aws" {
+    alias = "staging"
+    region = "us-east-1"
+    assume_role {
+      role_arn = "arn:aws:iam::222222222222:role/TerraformRole"
+    }
+  }
+  ```
+  Error when deploying to staging:
+  ```
+  Error: error configuring Terraform AWS Provider: IAM Role
+  (arn:aws:iam::222222222222:role/TerraformRole) cannot be assumed.
+  There are a number of possible causes:
+  - The credentials used do not have permission to assume the role
+  - The role's trust policy does not allow the current identity
+  ```
+  Task: Explain multi-provider configuration, assume_role for
+  cross-account access, passing providers to modules, provider
+  configuration best practices, and debugging cross-account issues.
+assertions:
+  - type: llm_judge
+    criteria: "Multi-provider setup is explained — provider aliases allow multiple configurations of the same provider. Default provider (no alias) used when provider isn't specified on a resource. Aliased providers: specify with provider = aws.dr on each resource. assume_role: Terraform assumes an IAM role in another account. Requirements: (1) trust policy on target role must allow the source account/role, (2) source must have sts:AssumeRole permission, (3) external_id for additional security. The error: trust policy or permissions issue — check both sides"
+    weight: 0.35
+    description: "Multi-provider setup"
+  - type: llm_judge
+    criteria: "Provider passing to modules is covered — modules don't inherit provider aliases automatically. Pass explicitly: module 'dr_vpc' { source = './modules/vpc', providers = { aws = aws.dr } }. Module must declare required providers: terraform { required_providers { aws = { source = 'hashicorp/aws' } } }. For modules needing multiple providers: providers = { aws = aws, aws.secondary = aws.dr }. Anti-pattern: configuring providers inside modules — always configure in root and pass down"
+    weight: 0.35
+    description: "Module providers"
+  - type: llm_judge
+    criteria: "Cross-account debugging is practical — debugging assume_role: (1) verify trust policy on target role allows the source identity, (2) verify source has sts:AssumeRole permission, (3) check for external_id requirement, (4) test manually: aws sts assume-role --role-arn ... (5) enable TF_LOG=DEBUG to see the exact API call. IAM role trust policy must include the specific ARN (account, user, or role). Session duration: default 1 hour, can increase with duration_seconds. MFA: if required, must be handled outside Terraform. Best practice: use separate state files per account for blast radius isolation"
+    weight: 0.30
+    description: "Cross-account debugging"

package/courses/terraform-infrastructure-setup/scenarios/level-3/state-surgery.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: state-surgery
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Perform state surgery — use state mv, rm, pull, push for complex migrations, module extraction, and resource address changes"
+  tags: [Terraform, state, migration, state-mv, state-rm, advanced]
+state: {}
+trigger: |
+  Your monolithic Terraform configuration with 200 resources needs
+  to be split into separate modules. Current flat structure:
+  ```hcl
+  # main.tf (2000 lines)
+  resource "aws_vpc" "main" { ... }
+  resource "aws_subnet" "public" { ... }
+  resource "aws_instance" "web" { ... }
+  resource "aws_rds_instance" "db" { ... }
+  ```
+  Target: split into modules/networking, modules/compute, modules/database.
+  Attempt 1 — Just move code into modules:
+  ```
+  $ terraform plan
+  # aws_vpc.main will be destroyed
+  # module.networking.aws_vpc.main will be created
+  # aws_instance.web will be destroyed
+  # module.compute.aws_instance.web will be created
+  # aws_rds_instance.db will be destroyed (!!!)
+  # module.database.aws_rds_instance.db will be created
+  Plan: 6 to add, 0 to change, 6 to destroy.
+  ```
+  All resources will be destroyed and recreated — unacceptable for
+  production! The database would be lost.
+  Task: Explain state surgery operations (mv, rm, pull, push),
+  how to migrate resources between modules without recreation,
+  moved blocks (Terraform 1.1+), state backup best practices,
+  and complex migration strategies.
+assertions:
+  - type: llm_judge
+    criteria: "State mv migration is explained — terraform state mv moves a resource from one address to another in state without modifying infrastructure. To migrate to modules: terraform state mv aws_vpc.main module.networking.aws_vpc.main, terraform state mv aws_instance.web module.compute.aws_instance.web, etc. After all moves: terraform plan should show no changes. Always backup state first: terraform state pull > backup.tfstate. State mv is atomic per resource — if interrupted, some resources moved, others not. Plan carefully and script the moves"
+    weight: 0.35
+    description: "State mv"
+  - type: llm_judge
+    criteria: "Moved blocks are covered as the modern alternative — moved { from = aws_vpc.main, to = module.networking.aws_vpc.main }. Benefits over state mv: (1) declarative and code-reviewable, (2) handled during plan/apply, (3) no manual state manipulation, (4) works across plan/apply workflow. Multiple moved blocks can coexist. Moved blocks are removed after successful apply. Supports: resource address changes, module refactoring, count to for_each migration. Terraform 1.1+ required. Preferred over state mv for most migrations"
+    weight: 0.35
+    description: "Moved blocks"
+  - type: llm_judge
+    criteria: "State rm and complex operations are practical — terraform state rm: removes resource from state without destroying it. Use when: (1) resource should no longer be managed by Terraform, (2) moving resource to different state file, (3) removing accidentally imported resource. terraform state pull/push: download/upload entire state file. Use for: manual state repair, migrating between backends, debugging. Complex migration: for splitting state files, (1) state pull, (2) manipulate JSON, (3) state push to new backend. Always: backup before surgery, verify with plan after, use -dry-run where available"
+    weight: 0.30
+    description: "Complex operations"

package/courses/terraform-infrastructure-setup/scenarios/level-3/terraform-cloud-enterprise.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: terraform-cloud-enterprise
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Use Terraform Cloud/Enterprise — configure remote execution, VCS integration, workspace management, and Sentinel policies"
+  tags: [Terraform, Terraform-Cloud, Enterprise, remote-execution, Sentinel, advanced]
+state: {}
+trigger: |
+  Your team is migrating from local Terraform execution to Terraform
+  Cloud. Current pain points:
+  - Engineers run terraform from laptops with different provider versions
+  - No audit trail of who applied what
+  - State files stored in S3 with overly permissive access
+  - No policy enforcement (anyone can create m5.24xlarge instances)
+  Migration configuration:
+  ```hcl
+  terraform {
+    cloud {
+      organization = "acme-corp"
+      workspaces {
+        name = "production"
+      }
+    }
+  }
+  ```
+  After migration:
+  ```
+  $ terraform plan
+  Running plan in Terraform Cloud. Output will stream here.
+  Error: Terraform Cloud returned an unexpected error
+  UNAUTHORIZED: You are not authorized to perform this action.
+  ```
+  Task: Explain Terraform Cloud features (remote execution, VCS
+  integration, workspace management), Sentinel policies for
+  governance, migration from local/S3 to Terraform Cloud, and
+  when to use Cloud vs Enterprise vs self-hosted.
+assertions:
+  - type: llm_judge
+    criteria: "Terraform Cloud features are explained — remote execution: plan and apply run on Terraform Cloud's infrastructure (consistent environment, no laptop dependencies). VCS integration: connect to GitHub/GitLab, automatic plans on PRs, apply on merge. Workspace management: each workspace has its own state, variables, and permissions. Variable sets: share variables across workspaces. Run triggers: chain workspaces (VPC workspace triggers EKS workspace). The auth error: need to run terraform login first, or set TF_TOKEN_app_terraform_io environment variable. Team permissions control who can plan vs apply"
+    weight: 0.35
+    description: "Cloud features"
+  - type: llm_judge
+    criteria: "Sentinel policies are covered — Sentinel: policy-as-code framework for governance. Policy sets: attach to workspaces. Enforcement levels: advisory (warn), soft-mandatory (override with approval), hard-mandatory (no override). Example policies: restrict instance types (no m5.24xlarge), require tags on all resources, enforce encryption, restrict regions. Policy workflow: plan → Sentinel check → cost estimation → apply. Policies written in Sentinel language (not HCL). OPA (Open Policy Agent) also supported as alternative"
+    weight: 0.35
+    description: "Sentinel policies"
+  - type: llm_judge
+    criteria: "Migration and comparison are practical — migration from S3: (1) add cloud block to config, (2) terraform login, (3) terraform init to migrate state. Cloud vs Enterprise vs self-hosted: Cloud (SaaS, free tier available, easiest setup), Enterprise (self-hosted, air-gapped support, custom agents), self-hosted agents with Cloud (hybrid — control plane in Cloud, execution on your infrastructure). When Cloud: most teams. When Enterprise: regulatory requirements for air-gapped, very large scale, custom integrations. Cost: Cloud free for small teams, Enterprise starts at $70K+/year"
+    weight: 0.30
+    description: "Migration and comparison"

package/courses/terraform-infrastructure-setup/scenarios/level-3/terraform-debugging.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: terraform-debugging
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Debug Terraform with TF_LOG — use log levels, provider-specific debugging, crash logs, and systematic troubleshooting for complex failures"
+  tags: [Terraform, debugging, TF_LOG, crash-logs, troubleshooting, advanced]
+state: {}
+trigger: |
+  A terraform apply fails with a cryptic error that gives no useful
+  information:
+  ```
+  Error: error creating ECS Service (my-service): InvalidParameterException:
+  Unable to assume the provided role.
+  with aws_ecs_service.web,
+  on ecs.tf line 15, in resource "aws_ecs_service" "web":
+  15: resource "aws_ecs_service" "web" {
+  ```
+  The IAM role exists and looks correct. You need to dig deeper.
+  You also encounter a Terraform crash:
+  ```
+  !!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!
+  Terraform crashed! This is always indicative of a bug within
+  Terraform or a provider. Crash log saved to: crash.log
+  ```
+  Task: Explain Terraform debugging techniques, TF_LOG levels and
+  environment variables, provider-specific debugging, crash log
+  analysis, and systematic troubleshooting methodology for complex
+  infrastructure failures.
+assertions:
+  - type: llm_judge
+    criteria: "TF_LOG debugging is explained — levels (most to least verbose): TRACE, DEBUG, INFO, WARN, ERROR. Set: TF_LOG=DEBUG terraform apply. Save to file: TF_LOG_PATH=./debug.log. Component-specific: TF_LOG_CORE=WARN TF_LOG_PROVIDER=DEBUG (provider operations verbose, core quiet). The ECS error: TF_LOG=DEBUG reveals the actual API request/response — likely IAM role trust policy doesn't include ecs.amazonaws.com, or there's an IAM propagation delay. DEBUG shows: HTTP requests, API responses, retry attempts, timing. TRACE shows everything including internal state operations"
+    weight: 0.35
+    description: "TF_LOG debugging"
+  - type: llm_judge
+    criteria: "Crash logs and provider debugging are covered — crash log: contains Go stack trace, panic message, provider version. Report to: provider GitHub issues if provider crash, Terraform core GitHub if core crash. Include: Terraform version, provider versions, sanitized config, crash.log. Provider debugging: check provider changelog for known bugs, try upgrading/downgrading provider version, reproduce with minimal configuration. AWS-specific: decode authorization failure messages with aws sts decode-authorization-message. Eventual consistency: IAM changes can take seconds to propagate — add depends_on or retry"
+    weight: 0.35
+    description: "Crash and provider"
+  - type: llm_judge
+    criteria: "Systematic troubleshooting is practical — methodology: (1) read the error message carefully (resource, file, line), (2) check provider documentation for the resource, (3) enable TF_LOG=DEBUG and search for the actual API error, (4) reproduce with minimal configuration (isolate the issue), (5) check for known issues on provider GitHub, (6) verify cloud-side (correct permissions, quotas, resource limits). Common hidden causes: IAM propagation delay, API rate limiting (429 errors hidden in retries), eventual consistency, stale provider cache (terraform init -upgrade). terraform plan -refresh-only to verify state matches reality"
+    weight: 0.30
+    description: "Troubleshooting method"

package/courses/terraform-infrastructure-setup/scenarios/level-4/blast-radius-management.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: blast-radius-management
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Manage Terraform blast radius — design state boundaries, implement approval workflows, and prevent large-scale outages from single changes"
+  tags: [Terraform, blast-radius, state-separation, approvals, risk, expert]
+state: {}
+trigger: |
+  A single terraform apply destroyed your production database, two
+  load balancers, and a VPN connection. Root cause: all 300 resources
+  were in one state file. The engineer intended to modify a CloudWatch
+  alarm but a provider upgrade changed the behavior of unrelated
+  resources.
+  Impact:
+  - 4 hours of downtime
+  - Database restored from backup (30 minutes of data loss)
+  - Post-mortem found: blast radius = 300 resources per apply
+  - Board asked: "How do we prevent this from happening again?"
+  Current state architecture:
+  ```
+  Single state: 300 resources
+  - VPC, subnets, NAT gateways
+  - RDS, ElastiCache
+  - ECS services, ALBs
+  - CloudWatch, SNS, SQS
+  - IAM roles, policies
+  - S3 buckets, CloudFront
+  ```
+  Task: Design the blast radius management strategy covering: state
+  file boundaries, change classification (risk levels), approval
+  workflows, provider upgrade safety, and recovery procedures.
+assertions:
+  - type: llm_judge
+    criteria: "State boundaries reduce blast radius — split 300 resources into isolated state files: foundation (VPC, subnets, NAT — rarely changes, ~20 resources), database (RDS, ElastiCache — critical, ~10 resources), compute (ECS, ALB — frequently changes, ~50 resources), messaging (SQS, SNS — moderate, ~30 resources), monitoring (CloudWatch, alarms — frequent, ~40 resources), IAM (roles, policies — sensitive, ~30 resources), CDN (CloudFront, S3 — moderate, ~20 resources). Each state file limits the blast radius. Maximum 50-80 resources per state. Cross-state references via terraform_remote_state"
+    weight: 0.35
+    description: "State boundaries"
+  - type: llm_judge
+    criteria: "Change classification and approvals are defined — risk levels: Low (monitoring, tags, non-destructive updates — auto-approve in CI), Medium (security group changes, scaling modifications — 1 approval), High (database changes, network topology, IAM — 2 approvals + change window), Critical (provider upgrades, state operations, foundation changes — team lead + SRE approval). Implement via: Terraform Cloud workspace-level permissions, GitHub environment protection rules, or Atlantis apply requirements. Provider upgrades: pin exact versions, upgrade in dev first, review changelog for breaking changes, upgrade one state file at a time"
+    weight: 0.35
+    description: "Classification and approvals"
+  - type: llm_judge
+    criteria: "Recovery procedures are practical — immediate response: (1) don't run terraform apply again, (2) assess damage scope from state and CloudTrail, (3) restore from backups (RDS snapshots, S3 versioning). Recovery: (1) if resources destroyed but state intact: terraform apply recreates, (2) if state corrupted: restore from S3 versioned state backup. Prevention: prevent_destroy on databases and critical resources, separate state files limit collateral damage, terraform plan -detailed-exitcode in CI catches unexpected destroys, plan output review required before apply. Provider upgrades: test in isolated environment first, upgrade one service domain at a time, maintain rollback plan (pin to previous version)"
+    weight: 0.30
+    description: "Recovery"

package/courses/terraform-infrastructure-setup/scenarios/level-4/cicd-pipeline-design.yaml ADDED Viewed

@@ -0,0 +1,50 @@
+meta:
+  id: cicd-pipeline-design
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design CI/CD pipelines for Terraform — implement GitOps workflows with Atlantis, GitHub Actions, or Terraform Cloud for safe infrastructure deployment"
+  tags: [Terraform, CI/CD, GitOps, Atlantis, GitHub-Actions, expert]
+state: {}
+trigger: |
+  Your team deploys Terraform from individual laptops. Last month:
+  - An engineer applied to production instead of staging (wrong workspace)
+  - Two engineers ran apply simultaneously, causing state corruption
+  - An apply failed halfway but no one noticed for 3 hours
+  - No record of who deployed what or when
+  You need to design a CI/CD pipeline for Terraform that prevents
+  all of these issues. Options on the table:
+  1. GitHub Actions with custom workflow
+  2. Atlantis (pull request automation)
+  3. Terraform Cloud/Enterprise
+  4. Spacelift
+  Requirements:
+  - Plan on every PR
+  - Apply only after approval and merge
+  - Environment protection (can't accidentally apply to prod)
+  - Cost estimation before apply
+  - Security scanning (tfsec/checkov)
+  - Slack notifications for plan/apply results
+  Task: Design the CI/CD pipeline for Terraform, compare the tool
+  options, show a complete workflow from code change to production
+  deployment, and address security considerations.
+assertions:
+  - type: llm_judge
+    criteria: "Complete pipeline workflow is designed — code change → PR opened → automated pipeline: (1) terraform fmt -check (formatting), (2) terraform validate (syntax), (3) tfsec/checkov scan (security), (4) terraform plan (preview changes), (5) Infracost estimate (cost), (6) post results as PR comment. On merge to main: (7) terraform plan again (detect drift since PR), (8) approval gate (manual for prod), (9) terraform apply, (10) post-apply verification, (11) Slack notification. Environment promotion: dev auto-apply, staging auto-apply, prod manual approval"
+    weight: 0.35
+    description: "Pipeline workflow"
+  - type: llm_judge
+    criteria: "Tool comparison is practical — Atlantis: open-source, PR automation, self-hosted, lightweight. Best for: teams wanting simple PR-based workflow. GitHub Actions: flexible, native GitHub integration, custom workflows. Best for: teams already on GitHub wanting full control. Terraform Cloud: managed service, built-in Sentinel, cost estimation, team management. Best for: organizations wanting managed solution. Spacelift: multi-tool support, advanced policies, drift detection. Best for: enterprises with complex requirements. Recommendation depends on: team size, budget, compliance needs, multi-tool requirements"
+    weight: 0.35
+    description: "Tool comparison"
+  - type: llm_judge
+    criteria: "Security considerations are covered — credentials: use OIDC for cloud authentication (no static keys in CI). GitHub Actions: aws-actions/configure-aws-credentials with OIDC. State access: CI role has minimal permissions (plan role vs apply role). Secrets: never echo credentials, use GitHub encrypted secrets or Terraform Cloud variables. Branch protection: require PR reviews, no direct pushes to main. Environment protection: GitHub environments with required reviewers for prod. Audit: log all plan/apply with outputs. Network: CI runner in private network if accessing private resources"
+    weight: 0.30
+    description: "Security"

package/courses/terraform-infrastructure-setup/scenarios/level-4/compliance-as-code.yaml ADDED Viewed

@@ -0,0 +1,46 @@
+meta:
+  id: compliance-as-code
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Implement compliance as code — enforce SOC2, HIPAA, and PCI-DSS requirements through Terraform policies, scanning, and automated remediation"
+  tags: [Terraform, compliance, SOC2, HIPAA, PCI-DSS, policy-as-code, expert]
+state: {}
+trigger: |
+  Your healthtech company needs SOC2 Type II and HIPAA compliance.
+  The compliance auditor found these Terraform-managed infrastructure
+  gaps:
+  1. S3 buckets without encryption at rest (5 of 30 buckets)
+  2. Security groups allowing 0.0.0.0/0 ingress on non-HTTP ports
+  3. RDS instances without encryption or automated backups
+  4. CloudTrail not enabled in all regions
+  5. No log retention policy (CloudWatch logs kept indefinitely)
+  6. IAM users with programmatic access keys older than 90 days
+  7. EBS volumes not encrypted by default
+  The auditor needs evidence that:
+  - These controls are enforced automatically (not just documented)
+  - Non-compliant resources cannot be deployed
+  - Continuous monitoring detects and alerts on compliance drift
+  Task: Design the compliance-as-code strategy covering: policy
+  enforcement (prevent non-compliant deployments), automated scanning,
+  remediation patterns, audit evidence generation, and continuous
+  compliance monitoring.
+assertions:
+  - type: llm_judge
+    criteria: "Policy enforcement prevents non-compliant deployments — pre-deployment: checkov/tfsec in CI catches violations before apply. Sentinel policies (Terraform Cloud): hard-mandatory rules that block non-compliant applies. Example policies: all S3 buckets must have server_side_encryption_configuration, all RDS instances must have storage_encrypted = true and backup_retention_period >= 7, security groups cannot have cidr_blocks = ['0.0.0.0/0'] except on ports 80 and 443. Module library: compliant-by-default modules that enforce encryption, logging, and access controls"
+    weight: 0.35
+    description: "Policy enforcement"
+  - type: llm_judge
+    criteria: "Remediation and audit evidence are covered — remediation: (1) update Terraform modules to include compliance requirements by default (encryption, backups, logging), (2) apply changes across all environments using shared module updates, (3) for existing non-compliant resources: plan and apply to add encryption/backups. Audit evidence: (1) Terraform Cloud audit logs showing who approved what, (2) Git history showing code reviews for all infrastructure changes, (3) checkov reports stored as CI artifacts, (4) AWS Config compliance dashboard. Compliance as code = evidence generated automatically from deployment pipeline"
+    weight: 0.35
+    description: "Remediation and audit"
+  - type: llm_judge
+    criteria: "Continuous monitoring is practical — AWS Config rules: detect non-compliant resources in real-time (encrypted-volumes, s3-bucket-server-side-encryption-enabled, rds-storage-encrypted). Config remediation: auto-remediate with SSM Automation (e.g., enable encryption on new unencrypted volumes). Terraform Cloud drift detection: scheduled plans detect unauthorized changes. Alert pipeline: Config finding → SNS → Lambda → Slack/PagerDuty. Quarterly compliance review: run full checkov scan, compare against SOC2/HIPAA control matrix, generate compliance report. Map each Terraform policy to specific compliance control (SOC2 CC6.1, HIPAA §164.312(a)(2)(iv))"
+    weight: 0.30
+    description: "Continuous monitoring"

package/courses/terraform-infrastructure-setup/scenarios/level-4/cost-estimation-governance.yaml ADDED Viewed

@@ -0,0 +1,42 @@
+meta:
+  id: cost-estimation-governance
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Implement cost governance with Terraform — integrate Infracost for pre-deployment estimation, set budget alerts, and enforce cost policies"
+  tags: [Terraform, cost, Infracost, FinOps, governance, budgets, expert]
+state: {}
+trigger: |
+  Your monthly AWS bill is $150K and growing 15% month-over-month.
+  Nobody knows the cost impact of Terraform changes until the bill
+  arrives. Recent surprises:
+  - Engineer created a NAT Gateway in 4 AZs ($576/month) when 1 was
+    sufficient ($144/month)
+  - A for_each over 50 items created 50 CloudWatch dashboards at
+    $3/each ($150/month) — nobody realized
+  - An RDS upgrade from db.r5.large to db.r5.4xlarge increased cost
+    from $260/month to $2,080/month
+  - Dev environment running same instance types as production ($8K/month
+    wasted)
+  Task: Design the cost governance strategy covering: Infracost
+  integration in CI/CD, policy-based cost controls, environment-specific
+  sizing, tagging for cost allocation, and ongoing cost optimization
+  practices.
+assertions:
+  - type: llm_judge
+    criteria: "Infracost integration is designed — Infracost estimates cost changes in PRs before deployment. CI integration: infracost breakdown --path . shows total monthly cost, infracost diff shows cost change from PR. PR comment: shows cost increase/decrease per resource. Setup: install Infracost in CI, generate plan JSON (terraform plan -out=plan.tfplan && terraform show -json plan.tfplan), run infracost diff --path plan.json. Thresholds: alert if monthly cost increase > $100, block if > $500 (configurable). Free tier available for open source and small teams"
+    weight: 0.35
+    description: "Infracost"
+  - type: llm_judge
+    criteria: "Policy-based cost controls are implemented — Sentinel/OPA policies: restrict expensive instance types (no db.r5.4xlarge in non-prod), limit resource counts (max 3 NAT Gateways), require cost justification for changes over threshold. Variable validation: variable 'instance_type' { validation { condition = !contains(['r5.4xlarge','r5.8xlarge'], var.instance_type) || var.environment == 'prod' } }. Environment sizing: locals { env_sizing = { dev = 't3.small', staging = 't3.medium', prod = 't3.large' } }. Enforce via policy: dev instances must be t3.small or smaller"
+    weight: 0.35
+    description: "Cost policies"
+  - type: llm_judge
+    criteria: "Tagging and optimization are practical — mandatory tags for cost allocation: Team, Environment, CostCenter, Service. AWS Cost Explorer uses tags for breakdown. Enforce tags via Sentinel policy or AWS SCP (deny resource creation without required tags). Cost optimization: (1) right-size instances (use AWS Compute Optimizer data), (2) Reserved Instances or Savings Plans for steady-state, (3) spot instances for non-critical workloads, (4) auto-shutdown dev environments off-hours (Lambda + CloudWatch Events). Monthly cost review: compare actual vs Infracost estimates, identify optimization opportunities"
+    weight: 0.30
+    description: "Tagging and optimization"

package/courses/terraform-infrastructure-setup/scenarios/level-4/expert-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: expert-debugging-shift
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Combined expert shift — advise on organizational IaC strategy while handling CI/CD pipeline failures and compliance audit findings"
+  tags: [Terraform, troubleshooting, combined, shift-simulation, expert]
+state: {}
+trigger: |
+  As infrastructure lead, you face three organizational challenges:
+  Challenge 1 — CI/CD pipeline reliability:
+  Your Atlantis-based pipeline has been flaky:
+  - 20% of plans timeout (state lock contention with 8 teams)
+  - Plans show different results on retry (eventual consistency)
+  - Apply fails intermittently (API rate limiting)
+  Teams are losing trust and starting to apply from laptops again.
+  Challenge 2 — Compliance audit preparation:
+  SOC2 auditor arrives in 6 weeks. They need to see:
+  - Evidence that all infrastructure changes go through code review
+  - Proof that production access is restricted
+  - Encryption enforcement across all resources
+  - Automated security scanning results
+  Challenge 3 — Cost optimization mandate:
+  CFO mandates 25% cost reduction ($37.5K/month savings from $150K).
+  Current waste identified:
+  - Dev environments running 24/7 ($20K/month)
+  - Oversized RDS instances ($15K/month excess)
+  - Unused EBS volumes and snapshots ($8K/month)
+  - NAT Gateway in all AZs for non-prod ($5K/month excess)
+  Task: Address all three challenges with actionable plans and
+  timelines.
+assertions:
+  - type: llm_judge
+    criteria: "CI/CD pipeline reliability is addressed — state lock contention: split monolith states into per-team states (each team's plan/apply doesn't block others). Timeout: increase lock timeout (-lock-timeout=5m), investigate which team's applies are long-running. Eventual consistency: add terraform plan -refresh-only before plan to ensure consistent state. Rate limiting: reduce parallelism (-parallelism=5), stagger team deployments. Trust recovery: show teams metrics (success rate improvement), ensure fast feedback loops. Consider: Terraform Cloud for managed execution (handles locking, retries, queueing)"
+    weight: 0.35
+    description: "Pipeline reliability"
+  - type: llm_judge
+    criteria: "Compliance preparation has a timeline — weeks 1-2: implement checkov/tfsec in all CI pipelines, generate baseline compliance reports. Weeks 2-3: remediate findings (add encryption to all S3 buckets, RDS, EBS; restrict security groups; configure CloudTrail). Weeks 3-4: implement Sentinel policies to prevent future violations. Weeks 4-5: generate audit evidence (Git logs showing all changes reviewed, CI scan reports, Terraform Cloud audit logs). Week 6: dry-run audit with compliance team. Evidence portfolio: PR review logs, automated scan reports, policy enforcement logs, access control documentation (IAM policy)"
+    weight: 0.35
+    description: "Compliance"
+  - type: llm_judge
+    criteria: "Cost optimization targets specific savings — dev environments 24/7 → schedule off-hours (Lambda + EventBridge, save $15K): terraform manages the schedule. RDS right-sizing: use Performance Insights data, downsize dev/staging instances (save $10K). EBS cleanup: terraform state list to find managed volumes, delete unattached ones, manage snapshot lifecycle (save $5K). NAT Gateway: single NAT Gateway per non-prod VPC instead of per-AZ (save $4K). Total: ~$34K savings (23%, close to 25% target). Implementation: Infracost in all PRs to prevent future waste, monthly cost review meeting, per-team cost dashboards using tags"
+    weight: 0.30
+    description: "Cost optimization"

package/courses/terraform-infrastructure-setup/scenarios/level-4/iac-organization-strategy.yaml ADDED Viewed

@@ -0,0 +1,45 @@
+meta:
+  id: iac-organization-strategy
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design IaC organization strategy — choose between mono-repo and multi-repo, design state architecture, and establish team ownership boundaries"
+  tags: [Terraform, organization, mono-repo, multi-repo, strategy, expert]
+state: {}
+trigger: |
+  You're the infrastructure architect for a company with 80 engineers
+  across 8 teams. Current state:
+  - 3 separate Terraform repositories with inconsistent patterns
+  - 15 state files with no naming convention
+  - No shared modules — each team copy-pastes configurations
+  - Teams frequently conflict when deploying overlapping resources
+  - No visibility into who owns what infrastructure
+  You need to design the Terraform organization strategy for the
+  company. Leadership wants:
+  - Clear team ownership boundaries
+  - Reusable modules (stop copy-paste)
+  - Safe deployment workflows
+  - Audit trail for all changes
+  - Cost visibility per team
+  Task: Design the IaC organization strategy covering: repository
+  structure (mono-repo vs multi-repo trade-offs), state architecture
+  (how to partition state files), module library design, team
+  ownership model, and governance policies.
+assertions:
+  - type: llm_judge
+    criteria: "Repository and state architecture are designed — mono-repo: single repo with directories per team/service. Benefits: unified modules, single PR workflow, easy cross-team visibility. Challenges: large repo, team coupling, complex CI/CD. Multi-repo: separate repos per team or service domain. Benefits: team autonomy, independent versioning, isolated CI/CD. Challenges: module sharing harder, cross-repo coordination. Recommended for 8 teams: hybrid — shared modules repo + per-team repos. State architecture: partition by (1) environment (dev/staging/prod), (2) team/service domain, (3) blast radius. Naming: s3://state/<team>/<env>/<service>.tfstate"
+    weight: 0.35
+    description: "Repo and state"
+  - type: llm_judge
+    criteria: "Module library and team ownership are designed — internal module registry: centralized repo with versioned, tested modules (VPC, EKS, RDS, S3). Module standards: README, input/output documentation, examples, tests (terraform test or Terratest). Publishing: git tags for versioning, semantic versioning (major.minor.patch). Team ownership: CODEOWNERS file mapping directories to teams. Platform team owns shared modules and foundation infrastructure. Service teams own their application infrastructure. Tagging strategy: mandatory tags for team, cost-center, environment on all resources"
+    weight: 0.35
+    description: "Modules and ownership"
+  - type: llm_judge
+    criteria: "Governance and deployment are practical — deployment workflow: feature branch → PR → automated plan → code review → merge → automated apply. Policy enforcement: pre-commit hooks (fmt, validate), CI checks (tflint, tfsec, checkov), Sentinel/OPA policies in Terraform Cloud. Cost visibility: Infracost in PR comments, AWS Cost Explorer tags. Audit: Terraform Cloud audit logs or CloudTrail for API calls. Change management: production changes require 2 approvals, blast radius classification (high-risk changes need additional review). Onboarding: documentation, module catalog, self-service templates"
+    weight: 0.30
+    description: "Governance"