npm - dojo.md - Versions diffs - 0.2.2 → 0.2.4 - Mend

dojo.md 0.2.2 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (196) hide show

package/courses/terraform-infrastructure-setup/scenarios/level-4/compliance-as-code.yaml ADDED Viewed

@@ -0,0 +1,46 @@
+meta:
+  id: compliance-as-code
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Implement compliance as code — enforce SOC2, HIPAA, and PCI-DSS requirements through Terraform policies, scanning, and automated remediation"
+  tags: [Terraform, compliance, SOC2, HIPAA, PCI-DSS, policy-as-code, expert]
+state: {}
+trigger: |
+  Your healthtech company needs SOC2 Type II and HIPAA compliance.
+  The compliance auditor found these Terraform-managed infrastructure
+  gaps:
+  1. S3 buckets without encryption at rest (5 of 30 buckets)
+  2. Security groups allowing 0.0.0.0/0 ingress on non-HTTP ports
+  3. RDS instances without encryption or automated backups
+  4. CloudTrail not enabled in all regions
+  5. No log retention policy (CloudWatch logs kept indefinitely)
+  6. IAM users with programmatic access keys older than 90 days
+  7. EBS volumes not encrypted by default
+  The auditor needs evidence that:
+  - These controls are enforced automatically (not just documented)
+  - Non-compliant resources cannot be deployed
+  - Continuous monitoring detects and alerts on compliance drift
+  Task: Design the compliance-as-code strategy covering: policy
+  enforcement (prevent non-compliant deployments), automated scanning,
+  remediation patterns, audit evidence generation, and continuous
+  compliance monitoring.
+assertions:
+  - type: llm_judge
+    criteria: "Policy enforcement prevents non-compliant deployments — pre-deployment: checkov/tfsec in CI catches violations before apply. Sentinel policies (Terraform Cloud): hard-mandatory rules that block non-compliant applies. Example policies: all S3 buckets must have server_side_encryption_configuration, all RDS instances must have storage_encrypted = true and backup_retention_period >= 7, security groups cannot have cidr_blocks = ['0.0.0.0/0'] except on ports 80 and 443. Module library: compliant-by-default modules that enforce encryption, logging, and access controls"
+    weight: 0.35
+    description: "Policy enforcement"
+  - type: llm_judge
+    criteria: "Remediation and audit evidence are covered — remediation: (1) update Terraform modules to include compliance requirements by default (encryption, backups, logging), (2) apply changes across all environments using shared module updates, (3) for existing non-compliant resources: plan and apply to add encryption/backups. Audit evidence: (1) Terraform Cloud audit logs showing who approved what, (2) Git history showing code reviews for all infrastructure changes, (3) checkov reports stored as CI artifacts, (4) AWS Config compliance dashboard. Compliance as code = evidence generated automatically from deployment pipeline"
+    weight: 0.35
+    description: "Remediation and audit"
+  - type: llm_judge
+    criteria: "Continuous monitoring is practical — AWS Config rules: detect non-compliant resources in real-time (encrypted-volumes, s3-bucket-server-side-encryption-enabled, rds-storage-encrypted). Config remediation: auto-remediate with SSM Automation (e.g., enable encryption on new unencrypted volumes). Terraform Cloud drift detection: scheduled plans detect unauthorized changes. Alert pipeline: Config finding → SNS → Lambda → Slack/PagerDuty. Quarterly compliance review: run full checkov scan, compare against SOC2/HIPAA control matrix, generate compliance report. Map each Terraform policy to specific compliance control (SOC2 CC6.1, HIPAA §164.312(a)(2)(iv))"
+    weight: 0.30
+    description: "Continuous monitoring"

package/courses/terraform-infrastructure-setup/scenarios/level-4/cost-estimation-governance.yaml ADDED Viewed

@@ -0,0 +1,42 @@
+meta:
+  id: cost-estimation-governance
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Implement cost governance with Terraform — integrate Infracost for pre-deployment estimation, set budget alerts, and enforce cost policies"
+  tags: [Terraform, cost, Infracost, FinOps, governance, budgets, expert]
+state: {}
+trigger: |
+  Your monthly AWS bill is $150K and growing 15% month-over-month.
+  Nobody knows the cost impact of Terraform changes until the bill
+  arrives. Recent surprises:
+  - Engineer created a NAT Gateway in 4 AZs ($576/month) when 1 was
+    sufficient ($144/month)
+  - A for_each over 50 items created 50 CloudWatch dashboards at
+    $3/each ($150/month) — nobody realized
+  - An RDS upgrade from db.r5.large to db.r5.4xlarge increased cost
+    from $260/month to $2,080/month
+  - Dev environment running same instance types as production ($8K/month
+    wasted)
+  Task: Design the cost governance strategy covering: Infracost
+  integration in CI/CD, policy-based cost controls, environment-specific
+  sizing, tagging for cost allocation, and ongoing cost optimization
+  practices.
+assertions:
+  - type: llm_judge
+    criteria: "Infracost integration is designed — Infracost estimates cost changes in PRs before deployment. CI integration: infracost breakdown --path . shows total monthly cost, infracost diff shows cost change from PR. PR comment: shows cost increase/decrease per resource. Setup: install Infracost in CI, generate plan JSON (terraform plan -out=plan.tfplan && terraform show -json plan.tfplan), run infracost diff --path plan.json. Thresholds: alert if monthly cost increase > $100, block if > $500 (configurable). Free tier available for open source and small teams"
+    weight: 0.35
+    description: "Infracost"
+  - type: llm_judge
+    criteria: "Policy-based cost controls are implemented — Sentinel/OPA policies: restrict expensive instance types (no db.r5.4xlarge in non-prod), limit resource counts (max 3 NAT Gateways), require cost justification for changes over threshold. Variable validation: variable 'instance_type' { validation { condition = !contains(['r5.4xlarge','r5.8xlarge'], var.instance_type) || var.environment == 'prod' } }. Environment sizing: locals { env_sizing = { dev = 't3.small', staging = 't3.medium', prod = 't3.large' } }. Enforce via policy: dev instances must be t3.small or smaller"
+    weight: 0.35
+    description: "Cost policies"
+  - type: llm_judge
+    criteria: "Tagging and optimization are practical — mandatory tags for cost allocation: Team, Environment, CostCenter, Service. AWS Cost Explorer uses tags for breakdown. Enforce tags via Sentinel policy or AWS SCP (deny resource creation without required tags). Cost optimization: (1) right-size instances (use AWS Compute Optimizer data), (2) Reserved Instances or Savings Plans for steady-state, (3) spot instances for non-critical workloads, (4) auto-shutdown dev environments off-hours (Lambda + CloudWatch Events). Monthly cost review: compare actual vs Infracost estimates, identify optimization opportunities"
+    weight: 0.30
+    description: "Tagging and optimization"

package/courses/terraform-infrastructure-setup/scenarios/level-4/expert-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: expert-debugging-shift
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Combined expert shift — advise on organizational IaC strategy while handling CI/CD pipeline failures and compliance audit findings"
+  tags: [Terraform, troubleshooting, combined, shift-simulation, expert]
+state: {}
+trigger: |
+  As infrastructure lead, you face three organizational challenges:
+  Challenge 1 — CI/CD pipeline reliability:
+  Your Atlantis-based pipeline has been flaky:
+  - 20% of plans timeout (state lock contention with 8 teams)
+  - Plans show different results on retry (eventual consistency)
+  - Apply fails intermittently (API rate limiting)
+  Teams are losing trust and starting to apply from laptops again.
+  Challenge 2 — Compliance audit preparation:
+  SOC2 auditor arrives in 6 weeks. They need to see:
+  - Evidence that all infrastructure changes go through code review
+  - Proof that production access is restricted
+  - Encryption enforcement across all resources
+  - Automated security scanning results
+  Challenge 3 — Cost optimization mandate:
+  CFO mandates 25% cost reduction ($37.5K/month savings from $150K).
+  Current waste identified:
+  - Dev environments running 24/7 ($20K/month)
+  - Oversized RDS instances ($15K/month excess)
+  - Unused EBS volumes and snapshots ($8K/month)
+  - NAT Gateway in all AZs for non-prod ($5K/month excess)
+  Task: Address all three challenges with actionable plans and
+  timelines.
+assertions:
+  - type: llm_judge
+    criteria: "CI/CD pipeline reliability is addressed — state lock contention: split monolith states into per-team states (each team's plan/apply doesn't block others). Timeout: increase lock timeout (-lock-timeout=5m), investigate which team's applies are long-running. Eventual consistency: add terraform plan -refresh-only before plan to ensure consistent state. Rate limiting: reduce parallelism (-parallelism=5), stagger team deployments. Trust recovery: show teams metrics (success rate improvement), ensure fast feedback loops. Consider: Terraform Cloud for managed execution (handles locking, retries, queueing)"
+    weight: 0.35
+    description: "Pipeline reliability"
+  - type: llm_judge
+    criteria: "Compliance preparation has a timeline — weeks 1-2: implement checkov/tfsec in all CI pipelines, generate baseline compliance reports. Weeks 2-3: remediate findings (add encryption to all S3 buckets, RDS, EBS; restrict security groups; configure CloudTrail). Weeks 3-4: implement Sentinel policies to prevent future violations. Weeks 4-5: generate audit evidence (Git logs showing all changes reviewed, CI scan reports, Terraform Cloud audit logs). Week 6: dry-run audit with compliance team. Evidence portfolio: PR review logs, automated scan reports, policy enforcement logs, access control documentation (IAM policy)"
+    weight: 0.35
+    description: "Compliance"
+  - type: llm_judge
+    criteria: "Cost optimization targets specific savings — dev environments 24/7 → schedule off-hours (Lambda + EventBridge, save $15K): terraform manages the schedule. RDS right-sizing: use Performance Insights data, downsize dev/staging instances (save $10K). EBS cleanup: terraform state list to find managed volumes, delete unattached ones, manage snapshot lifecycle (save $5K). NAT Gateway: single NAT Gateway per non-prod VPC instead of per-AZ (save $4K). Total: ~$34K savings (23%, close to 25% target). Implementation: Infracost in all PRs to prevent future waste, monthly cost review meeting, per-team cost dashboards using tags"
+    weight: 0.30
+    description: "Cost optimization"

package/courses/terraform-infrastructure-setup/scenarios/level-4/iac-organization-strategy.yaml ADDED Viewed

@@ -0,0 +1,45 @@
+meta:
+  id: iac-organization-strategy
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design IaC organization strategy — choose between mono-repo and multi-repo, design state architecture, and establish team ownership boundaries"
+  tags: [Terraform, organization, mono-repo, multi-repo, strategy, expert]
+state: {}
+trigger: |
+  You're the infrastructure architect for a company with 80 engineers
+  across 8 teams. Current state:
+  - 3 separate Terraform repositories with inconsistent patterns
+  - 15 state files with no naming convention
+  - No shared modules — each team copy-pastes configurations
+  - Teams frequently conflict when deploying overlapping resources
+  - No visibility into who owns what infrastructure
+  You need to design the Terraform organization strategy for the
+  company. Leadership wants:
+  - Clear team ownership boundaries
+  - Reusable modules (stop copy-paste)
+  - Safe deployment workflows
+  - Audit trail for all changes
+  - Cost visibility per team
+  Task: Design the IaC organization strategy covering: repository
+  structure (mono-repo vs multi-repo trade-offs), state architecture
+  (how to partition state files), module library design, team
+  ownership model, and governance policies.
+assertions:
+  - type: llm_judge
+    criteria: "Repository and state architecture are designed — mono-repo: single repo with directories per team/service. Benefits: unified modules, single PR workflow, easy cross-team visibility. Challenges: large repo, team coupling, complex CI/CD. Multi-repo: separate repos per team or service domain. Benefits: team autonomy, independent versioning, isolated CI/CD. Challenges: module sharing harder, cross-repo coordination. Recommended for 8 teams: hybrid — shared modules repo + per-team repos. State architecture: partition by (1) environment (dev/staging/prod), (2) team/service domain, (3) blast radius. Naming: s3://state/<team>/<env>/<service>.tfstate"
+    weight: 0.35
+    description: "Repo and state"
+  - type: llm_judge
+    criteria: "Module library and team ownership are designed — internal module registry: centralized repo with versioned, tested modules (VPC, EKS, RDS, S3). Module standards: README, input/output documentation, examples, tests (terraform test or Terratest). Publishing: git tags for versioning, semantic versioning (major.minor.patch). Team ownership: CODEOWNERS file mapping directories to teams. Platform team owns shared modules and foundation infrastructure. Service teams own their application infrastructure. Tagging strategy: mandatory tags for team, cost-center, environment on all resources"
+    weight: 0.35
+    description: "Modules and ownership"
+  - type: llm_judge
+    criteria: "Governance and deployment are practical — deployment workflow: feature branch → PR → automated plan → code review → merge → automated apply. Policy enforcement: pre-commit hooks (fmt, validate), CI checks (tflint, tfsec, checkov), Sentinel/OPA policies in Terraform Cloud. Cost visibility: Infracost in PR comments, AWS Cost Explorer tags. Audit: Terraform Cloud audit logs or CloudTrail for API calls. Change management: production changes require 2 approvals, blast radius classification (high-risk changes need additional review). Onboarding: documentation, module catalog, self-service templates"
+    weight: 0.30
+    description: "Governance"

package/courses/terraform-infrastructure-setup/scenarios/level-4/incident-response-iac.yaml ADDED Viewed

@@ -0,0 +1,47 @@
+meta:
+  id: incident-response-iac
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Handle infrastructure incidents with Terraform — implement emergency change procedures, rollback strategies, and post-incident IaC reconciliation"
+  tags: [Terraform, incident-response, rollback, emergency, recovery, expert]
+state: {}
+trigger: |
+  Production is down. The sequence of events:
+  10:00 — Engineer deploys Terraform changes (new ECS task definition)
+  10:05 — Health checks start failing on 3 of 8 ECS services
+  10:10 — ALB marks targets unhealthy, 502 errors spike
+  10:15 — Pager fires, incident declared
+  10:20 — Investigation: new task definition has wrong environment
+           variable pointing to staging database
+  10:25 — Need to rollback immediately
+  Options debated during the incident:
+  1. Revert the git commit and re-apply
+  2. terraform apply -target to fix just the task definition
+  3. Manually update ECS in console
+  4. terraform state replace-provider (someone suggested this randomly)
+  After the immediate fix, terraform plan shows drift because someone
+  made emergency changes in the console during the incident.
+  Task: Design the incident response procedure for Terraform-managed
+  infrastructure covering: rollback strategies, emergency change
+  procedures, post-incident reconciliation, and runbook development.
+assertions:
+  - type: llm_judge
+    criteria: "Rollback strategies are evaluated — Option 1 (git revert + apply): safest, maintains IaC integrity, but slow (5-10 minutes for plan+apply). Option 2 (targeted apply): faster, fixes specific resource, but skips normal review process. Option 3 (console change): fastest (30 seconds), but creates drift. Recommendation: (1) for critical outages: fix in console first to restore service, then reconcile Terraform. (2) for moderate issues: git revert + targeted apply. (3) for minor issues: normal git revert + full apply. Speed of recovery matters more than IaC purity during incidents"
+    weight: 0.35
+    description: "Rollback strategies"
+  - type: llm_judge
+    criteria: "Emergency change procedure is defined — emergency procedure: (1) declare incident, (2) fix immediately using fastest safe method (console if needed), (3) document all manual changes made, (4) after incident resolved: create PR with Terraform changes matching manual fixes, (5) run terraform plan to verify no drift, (6) apply to reconcile state. Emergency access: pre-configured 'break glass' IAM role with broad permissions, used only during incidents, logged via CloudTrail. Never run terraform destroy during an incident. Incident commander approves all infrastructure changes"
+    weight: 0.35
+    description: "Emergency procedure"
+  - type: llm_judge
+    criteria: "Post-incident reconciliation is practical — after incident: (1) terraform plan -refresh-only to see all drift, (2) review each drift: accept intentional changes (update .tf), revert accidental changes (apply). (3) Update IaC to prevent recurrence (add validation, pre-deploy checks). Runbook: document common failure scenarios with exact rollback commands. Example runbooks: bad ECS deployment (terraform apply -target=aws_ecs_task_definition.app -var='image_tag=v1.2.3'), database connection issue (terraform apply -target=aws_security_group.db). Store runbooks alongside Terraform code. Post-mortem: identify what IaC improvements would have prevented the incident"
+    weight: 0.30
+    description: "Reconciliation"

package/courses/terraform-infrastructure-setup/scenarios/level-4/infrastructure-testing.yaml ADDED Viewed

@@ -0,0 +1,41 @@
+meta:
+  id: infrastructure-testing
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Test Terraform infrastructure — implement unit tests with terraform test, integration tests with Terratest, and policy testing with checkov and tfsec"
+  tags: [Terraform, testing, Terratest, checkov, tfsec, terraform-test, expert]
+state: {}
+trigger: |
+  Your Terraform modules have no tests. Last month, three incidents
+  were caused by module changes that broke consumers:
+  - VPC module changed output name (vpc_id → id), breaking 5 services
+  - RDS module removed a variable, breaking all callers
+  - Security group module allowed 0.0.0.0/0 ingress by default
+  Your testing strategy needs to cover:
+  1. Module contract validation (inputs/outputs don't break)
+  2. Security compliance (no open security groups, encryption enabled)
+  3. Integration testing (resources actually work in AWS)
+  4. Cost validation (changes don't blow budget)
+  Task: Design the infrastructure testing strategy covering: terraform
+  test (native, Terraform 1.6+), Terratest (Go-based integration),
+  policy scanning (checkov, tfsec), testing pyramid for infrastructure,
+  and CI integration for automated testing.
+assertions:
+  - type: llm_judge
+    criteria: "terraform test (native testing) is explained — terraform test runs .tftest.hcl files. command = plan: validates without creating resources (fast, free). command = apply: creates real resources (slow, costs money, thorough). Assert conditions: assert { condition = output.vpc_id != '', error_message = 'VPC ID must not be empty' }. Variables block for test inputs. Run blocks chain: create VPC → verify VPC → create subnet using VPC. Module contract testing: verify required outputs exist and have correct types. Run with: terraform test. Best for: unit testing modules without external dependencies"
+    weight: 0.35
+    description: "terraform test"
+  - type: llm_judge
+    criteria: "Terratest and policy scanning are covered — Terratest (Go): creates real infrastructure, validates properties, destroys after test. Pattern: InitAndApply → verify outputs → verify cloud resources (SDK calls) → Destroy. Example: deploy VPC, verify CIDR block matches, verify subnets are in correct AZs, destroy. Best for: integration testing that verifies real cloud behavior. Policy scanning: checkov scans for security misconfigurations (CIS benchmarks, HIPAA, PCI-DSS). tfsec: Terraform-specific security scanner. OPA/Conftest: custom policy validation. Run in CI before plan to catch issues early"
+    weight: 0.35
+    description: "Terratest and policy"
+  - type: llm_judge
+    criteria: "Testing pyramid and CI integration are practical — testing pyramid for infrastructure: base = static analysis (fmt, validate, tfsec, checkov — fast, run on every PR), middle = plan-based tests (terraform test with command = plan — moderate speed, no cost), top = integration tests (Terratest/terraform test with command = apply — slow, costly, run nightly or on release). CI integration: every PR gets static analysis + plan tests. Nightly: full integration tests against ephemeral AWS account. Release: full integration suite. Cost management: use smallest instance types in tests, set up auto-cleanup for failed test runs, dedicated test AWS account with budget alerts"
+    weight: 0.30
+    description: "Pyramid and CI"

package/courses/terraform-infrastructure-setup/scenarios/level-4/module-registry-design.yaml ADDED Viewed

@@ -0,0 +1,45 @@
+meta:
+  id: module-registry-design
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design a private module registry — create versioned, tested, documented modules with governance for enterprise consumption"
+  tags: [Terraform, modules, registry, versioning, governance, enterprise, expert]
+state: {}
+trigger: |
+  Your organization has 200+ Terraform modules scattered across 15
+  repositories with no versioning, testing, or documentation standards.
+  Teams duplicate effort building similar modules. There's no way to
+  know which modules are safe, maintained, or compliant.
+  A recent incident: a team used an outdated VPC module that created
+  security groups without logging — violating SOC2 controls. The module
+  had been "fixed" months ago but the team was using an old copy.
+  You need to design a private module registry that:
+  - Provides a catalog of approved, tested modules
+  - Enforces versioning and deprecation
+  - Includes compliance-checked modules
+  - Has clear ownership and support model
+  - Prevents use of unapproved or outdated modules
+  Task: Design the private module registry covering: registry platform
+  choice, module lifecycle (create, review, publish, deprecate),
+  versioning strategy, testing requirements, documentation standards,
+  and consumption governance.
+assertions:
+  - type: llm_judge
+    criteria: "Registry platform and module lifecycle are designed — platform options: Terraform Cloud private registry (built-in, easy), self-hosted registry (terraform-registry-address), Git-based with tags (simple, no separate infrastructure). Module lifecycle: (1) proposal: RFC for new module, (2) development: follow template structure, (3) review: platform team reviews for compliance and quality, (4) testing: automated tests must pass (terraform test, checkov), (5) publish: tagged release with changelog, (6) maintenance: active, deprecated, archived states. Module maturity levels: experimental, supported, certified"
+    weight: 0.35
+    description: "Registry and lifecycle"
+  - type: llm_judge
+    criteria: "Versioning and testing are enforced — semantic versioning: MAJOR (breaking changes), MINOR (new features, backward compatible), PATCH (bug fixes). Version constraints: consumers use ~> 2.0 (allows 2.x, not 3.0). Breaking change policy: major version bump, migration guide, deprecation notice 2 releases before removal. Testing requirements before publish: (1) terraform fmt -check passes, (2) terraform validate passes, (3) checkov scan clean, (4) terraform test plan-level tests pass, (5) integration tests pass (for certified modules). CI pipeline: on tag creation, run all tests, publish to registry if passing"
+    weight: 0.35
+    description: "Versioning and testing"
+  - type: llm_judge
+    criteria: "Documentation and consumption governance are practical — documentation requirements: README with description, usage examples, inputs table, outputs table, requirements (provider versions). Generated with terraform-docs. Examples directory with working configurations. Consumption governance: Sentinel policy requiring modules from approved registry sources only. Module pinning: all consumers must use version constraints (not latest). Upgrade process: platform team announces new versions, teams have 90 days to upgrade deprecated versions. Metrics: module adoption rate, version currency (% on latest), support ticket volume per module"
+    weight: 0.30
+    description: "Docs and governance"

package/courses/terraform-infrastructure-setup/scenarios/level-4/multi-account-strategy.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: multi-account-strategy
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design multi-account Terraform strategy — implement AWS Organizations landing zone, cross-account roles, and account vending with Terraform"
+  tags: [Terraform, multi-account, AWS-Organizations, landing-zone, cross-account, expert]
+state: {}
+trigger: |
+  Your company is moving from a single AWS account (everything in one
+  account) to a multi-account strategy using AWS Organizations. Plan:
+  ```
+  Management Account (root)
+  ├── Security OU
+  │   ├── Security Account (GuardDuty, SecurityHub)
+  │   └── Log Archive Account (CloudTrail, Config)
+  ├── Infrastructure OU
+  │   ├── Shared Services (DNS, VPN, Transit Gateway)
+  │   └── Network Hub (centralized networking)
+  ├── Workloads OU
+  │   ├── Production OU
+  │   │   ├── App1-Prod
+  │   │   └── App2-Prod
+  │   └── Non-Production OU
+  │       ├── App1-Dev
+  │       └── App2-Staging
+  └── Sandbox OU
+      └── Developer Sandboxes
+  ```
+  Terraform needs to:
+  1. Create and manage the Organization structure
+  2. Provision new accounts automatically (account vending)
+  3. Apply baseline security controls to every account
+  4. Manage cross-account networking (Transit Gateway)
+  Task: Design the multi-account Terraform strategy covering:
+  Organization management, account vending machine, baseline
+  security controls (SCPs, GuardDuty, Config), cross-account
+  IAM, and state management across accounts.
+assertions:
+  - type: llm_judge
+    criteria: "Organization management with Terraform is designed — aws_organizations_organization for the org, aws_organizations_organizational_unit for OUs, aws_organizations_account for member accounts, aws_organizations_policy (SCP) for guardrails. Account vending: module that creates account, configures baseline (IAM roles, logging, security), outputs account ID. SCPs: restrict allowed services and regions, deny root user actions, enforce encryption. Terraform runs from management account with OrganizationAccountAccessRole to configure member accounts"
+    weight: 0.35
+    description: "Organization management"
+  - type: llm_judge
+    criteria: "Baseline security and cross-account are covered — baseline module per account: (1) CloudTrail → Log Archive bucket, (2) AWS Config → centralized rules, (3) GuardDuty member enrollment, (4) IAM password policy, (5) EBS default encryption, (6) S3 Block Public Access account-level setting. Cross-account IAM: create TerraformRole in each account with trust to management/CI account. Cross-account networking: Transit Gateway in network hub, RAM sharing to workload accounts. VPC peering or Transit Gateway attachments managed by network team's Terraform"
+    weight: 0.35
+    description: "Baseline and cross-account"
+  - type: llm_judge
+    criteria: "State management across accounts is practical — one state file per account per domain (not one giant state). State bucket: centralized in management or shared services account with cross-account access policies. State architecture: management-account/org.tfstate, security-account/baseline.tfstate, each-workload-account/baseline.tfstate + app.tfstate. CI/CD: pipeline assumes different roles per account. terraform_remote_state: network team's outputs consumed by workload teams. DynamoDB locking table: centralized with per-state granularity. Least privilege: each team's CI role can only access their account's state"
+    weight: 0.30
+    description: "State management"

package/courses/terraform-infrastructure-setup/scenarios/level-5/board-infrastructure-investment.yaml ADDED Viewed

@@ -0,0 +1,53 @@
+meta:
+  id: board-infrastructure-investment
+  level: 5
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Present infrastructure investment to the board — justify IaC platform spend, demonstrate ROI, and align technology strategy with business outcomes"
+  tags: [Terraform, board, ROI, investment, business-case, strategy, master]
+state: {}
+trigger: |
+  You're preparing a board presentation for a mid-market SaaS company
+  ($50M ARR, 150 engineers, Series C). The board is questioning the
+  proposed $2M infrastructure platform investment over 2 years:
+  Investment breakdown:
+  - Terraform Cloud Enterprise: $200K/year
+  - Platform team (4 new hires): $800K/year
+  - Module library development: $300K (one-time)
+  - Training program: $100K/year
+  - Infrastructure testing: $100K/year
+  Total: $800K Year 1, $1.2M Year 2
+  Board concerns:
+  - "We're not an infrastructure company — why spend $2M on plumbing?"
+  - "Our competitors seem to manage without this investment"
+  - "Can't we just hire more DevOps engineers instead?"
+  - "What's the ROI timeline?"
+  Current pain:
+  - 3 production outages/quarter (average $50K revenue impact each)
+  - 2-week deployment cycle (competitors deploy daily)
+  - 30% of engineering time on infrastructure ops
+  - Failed enterprise deals due to security/compliance gaps
+  - $1.8M/year in AWS costs, growing 20% YoY with no optimization
+  Task: Build the board-level business case for IaC investment
+  covering: ROI analysis, competitive positioning, risk mitigation,
+  and the executive narrative.
+assertions:
+  - type: llm_judge
+    criteria: "ROI analysis is quantified — costs: $2M over 2 years. Benefits Year 1: reduced outages (from 12 to 3/year at $50K each = $450K saved), deployment acceleration (engineering productivity: 30% ops → 15% = 22 engineer-months freed at $15K/month = $330K), AWS cost optimization (20% reduction = $360K/year from current $1.8M). Benefits Year 2: additional enterprise deals enabled by compliance ($2M+ ARR pipeline), further ops reduction (10%), hiring efficiency (fewer DevOps needed). Total 2-year benefit: $3-5M. ROI: 150-250% over 2 years. Payback period: 12-15 months. Frame as: infrastructure investment SAVES money, it doesn't just cost money"
+    weight: 0.35
+    description: "ROI analysis"
+  - type: llm_judge
+    criteria: "Competitive positioning addresses board concerns — 'not an infrastructure company': infrastructure is a competitive moat (faster deployments = faster features = win customers). Competitors DO invest (they just don't talk about it publicly). 'hire more DevOps': doesn't scale — each new DevOps engineer adds linear capacity, platform adds exponential (150 engineers benefit from 4 platform engineers). 'ROI timeline': infrastructure investment front-loads cost but compounds returns. Year 1: breakeven. Year 2+: net positive and accelerating. Enterprise readiness: SOC2/HIPAA compliance unlocks enterprise market ($10M+ ARR opportunity), impossible without proper IaC governance"
+    weight: 0.35
+    description: "Competitive positioning"
+  - type: llm_judge
+    criteria: "Risk and narrative are compelling — risk of NOT investing: continued outages erode customer trust, enterprise deals lost to competitors, increasing AWS bill without optimization, engineering velocity gap widens. Risk mitigation of investment: phased approach (spend $200K first 3 months, validate before full commitment), measurable milestones (if no improvement by month 6, adjust strategy). Executive narrative: 'We're investing in engineering velocity. Every dollar spent on infrastructure automation generates $3 in engineering productivity and $5 in enterprise revenue opportunity. This isn't plumbing — it's the engine that powers our product velocity and enterprise readiness.' Board metrics to track: deployment frequency, outage count, AWS cost trend, enterprise deal closure rate"
+    weight: 0.30
+    description: "Risk and narrative"

package/courses/terraform-infrastructure-setup/scenarios/level-5/disaster-recovery-iac.yaml ADDED Viewed

@@ -0,0 +1,47 @@
+meta:
+  id: disaster-recovery-iac
+  level: 5
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design disaster recovery with Terraform — implement multi-region failover, state backup, infrastructure rebuilding, and DR testing strategies"
+  tags: [Terraform, disaster-recovery, multi-region, failover, backup, master]
+state: {}
+trigger: |
+  Your client's us-east-1 region experienced a 4-hour outage affecting
+  their production environment. Post-mortem revealed:
+  - No multi-region deployment (single region)
+  - Terraform state stored only in us-east-1 S3 bucket (inaccessible
+    during outage)
+  - No runbook for rebuilding infrastructure from scratch
+  - RTO target: 1 hour. Actual recovery: 4 hours
+  - RPO target: 15 minutes. Actual data loss: 2 hours
+  The CEO demands a DR strategy that meets:
+  - RTO: < 30 minutes for critical services
+  - RPO: < 5 minutes for transactional data
+  - Annual DR testing: full failover drill
+  - Cost: additional spend < 30% of current infrastructure cost
+  Current infrastructure: $200K/month
+  DR budget: up to $60K/month additional
+  Task: Design the DR strategy using Terraform covering: multi-region
+  architecture, state backup strategy, infrastructure-as-code for
+  rapid rebuilding, DR testing automation, and cost optimization
+  for standby infrastructure.
+assertions:
+  - type: llm_judge
+    criteria: "Multi-region architecture meets RTO/RPO — active-passive architecture: primary (us-east-1) handles all traffic, secondary (us-west-2) has warm standby. Critical tier (RTO < 30 min): multi-AZ RDS with cross-region read replica (RPO: seconds), S3 cross-region replication, Route53 health checks with automated failover. Important tier (RTO < 2 hours): AMI/container image replication, Terraform can provision compute in 15 minutes. Non-critical tier (RTO < 8 hours): rebuild from Terraform on demand, no standby resources. Terraform manages all regions: provider aliases for us-east-1 and us-west-2, shared modules deployed to both"
+    weight: 0.35
+    description: "Multi-region architecture"
+  - type: llm_judge
+    criteria: "State backup and rebuilding strategy are robust — state backup: S3 bucket with versioning + cross-region replication to us-west-2 bucket. DynamoDB global table for lock table (accessible in both regions). If primary S3 inaccessible: terraform init with backend pointing to replicated bucket. Infrastructure rebuilding: Terraform code can recreate all infrastructure from scratch. Test this quarterly — run terraform plan in a clean account to verify. Runbook: step-by-step DR activation (1) activate Route53 failover, (2) promote RDS read replica, (3) terraform apply in DR region for compute, (4) verify application health. Automated: Lambda triggered by CloudWatch alarm runs DR activation script"
+    weight: 0.35
+    description: "State and rebuilding"
+  - type: llm_judge
+    criteria: "DR testing and cost optimization are practical — DR testing: quarterly full failover drill using Terraform. Automation: (1) terraform workspace for DR drill, (2) apply creates full DR infrastructure, (3) automated tests verify failover works, (4) terraform destroy after drill. GameDay approach: simulate failures (kill primary, verify automatic failover). Document: recovery time achieved, data loss measured, issues found. Cost optimization: warm standby only for critical tier (RDS replica: ~$3K/month, S3 replication: minimal, Route53 health checks: minimal). Compute on-demand in DR region (no standby instances — provision with Terraform during failover, 10-15 min). Estimated DR cost: $15-20K/month (10% of infrastructure), well within $60K budget"
+    weight: 0.30
+    description: "Testing and cost"

package/courses/terraform-infrastructure-setup/scenarios/level-5/enterprise-iac-transformation.yaml ADDED Viewed

@@ -0,0 +1,48 @@
+meta:
+  id: enterprise-iac-transformation
+  level: 5
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Lead enterprise IaC transformation — design the organizational change management strategy for adopting Terraform across a 500-person engineering organization"
+  tags: [Terraform, enterprise, transformation, change-management, adoption, master]
+state: {}
+trigger: |
+  You're a consulting CTO advising a Fortune 500 financial services
+  company (500 engineers, $5M/month AWS bill) on IaC transformation.
+  Current state:
+  - 80% of infrastructure provisioned via console or manual scripts
+  - 5 teams use Terraform (inconsistent patterns, no governance)
+  - 20 teams use no IaC at all
+  - Manual change management process (tickets, approvals, 3-day SLA)
+  - 3 compliance frameworks (SOC2, PCI-DSS, GDPR)
+  - 2 failed IaC adoption attempts in the past 3 years
+  Why previous attempts failed:
+  - Attempt 1: mandated Terraform for everyone, no training → engineers
+    created broken configurations, lost trust
+  - Attempt 2: platform team built perfect modules → too rigid, teams
+    couldn't customize, abandoned within 6 months
+  CEO: "We need infrastructure as code. Our competitors deploy daily,
+  we deploy monthly. But it has to actually stick this time."
+  Task: Design the enterprise IaC transformation strategy that
+  addresses why previous attempts failed. Cover: phased adoption,
+  team enablement, governance model, success metrics, executive
+  communication, and risk mitigation.
+assertions:
+  - type: llm_judge
+    criteria: "Phased adoption addresses previous failures — why mandates fail: forced adoption without enablement creates resistance. Why rigid platforms fail: one-size-fits-all ignores team autonomy. Better approach: Phase 1 (months 1-3): select 3 willing champion teams, co-develop patterns with them (not for them). Phase 2 (months 4-6): expand to 8-10 teams using champion-developed patterns, champions become mentors. Phase 3 (months 7-12): organization-wide with self-service platform, remaining teams onboard with support. Phase 4 (months 12-18): optimize, advanced features, measure ROI. Key: teams choose when to adopt (within a deadline), not forced on day 1"
+    weight: 0.35
+    description: "Phased adoption"
+  - type: llm_judge
+    criteria: "Enablement and governance balance autonomy with guardrails — enablement: 2-week IaC bootcamp per team (not just Terraform syntax — include workflow, patterns, debugging). Pair programming: platform engineers embed in teams for first 2 months. Self-service catalog: modules teams can use immediately (VPC, ECS, RDS) with sensible defaults but configurable. Governance: guardrails not gates. Pre-commit hooks (fmt, validate, scan) — fast feedback. CI pipeline (plan, security scan) — catch issues before merge. Policy enforcement (Sentinel/OPA) — prevent non-compliant deployments. Allow teams to write custom modules within compliance boundaries"
+    weight: 0.35
+    description: "Enablement and governance"
+  - type: llm_judge
+    criteria: "Metrics and executive communication are practical — success metrics: (1) adoption rate (% of infrastructure managed by IaC, target: 50% in 6 months, 80% in 12 months), (2) deployment frequency (monthly → weekly → daily), (3) change failure rate (failed deployments ÷ total deployments), (4) MTTR (time to recover from failures), (5) compliance score (automated scan pass rate), (6) team satisfaction (survey). Executive dashboard: IaC coverage, deployment velocity, cost savings, compliance posture. Board narrative: IaC is competitive advantage — competitors deploy 10x faster, compliance is automated not manual, risk is reduced through automation. ROI: reduced manual work ($500K/year), faster deployments ($2M opportunity cost), compliance automation ($300K/year audit savings)"
+    weight: 0.30
+    description: "Metrics and communication"

package/courses/terraform-infrastructure-setup/scenarios/level-5/iac-technology-evolution.yaml ADDED Viewed

@@ -0,0 +1,49 @@
+meta:
+  id: iac-technology-evolution
+  level: 5
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Navigate IaC technology evolution — evaluate emerging trends like AI-assisted infrastructure, ephemeral environments, and the future of infrastructure management"
+  tags: [Terraform, future, AI, evolution, trends, ephemeral, master]
+state: {}
+trigger: |
+  You're presenting at a CTO Summit on "The Future of Infrastructure
+  as Code." Your audience: 200 CTOs from mid-market to enterprise
+  companies. They want to understand where IaC is heading and how
+  to prepare.
+  Trends to address:
+  1. AI-assisted infrastructure: AI generating Terraform code, AI
+     reviewing plans, AI optimizing resource configurations
+  2. Ephemeral infrastructure: environments created on-demand per PR,
+     destroyed after merge (preview environments)
+  3. Policy-as-code maturity: from basic scanning to real-time
+     enforcement with AI-enhanced policies
+  4. Platform engineering explosion: internal developer platforms
+     abstracting Terraform entirely from developers
+  5. License and governance changes: BSL, OpenTofu, commercial model
+     evolution
+  6. Serverless/managed services reducing IaC scope: less infrastructure
+     to manage as services become more managed
+  7. GitOps and declarative infrastructure beyond Terraform
+  Task: Present the IaC evolution roadmap covering: where we are today,
+  where we're heading (2-5 year horizon), what CTOs should invest in
+  now, what to watch but not invest in yet, and what's hype vs reality.
+assertions:
+  - type: llm_judge
+    criteria: "Current state and near-term evolution are grounded — today: Terraform dominates multi-cloud IaC, Pulumi growing for developer-centric teams, policy-as-code established but underutilized. Near-term (1-2 years): AI-assisted IaC code generation is real and useful for boilerplate but requires human review for correctness and security. Ephemeral environments becoming standard (Terraform + CI/CD creates preview envs per PR, destroys on merge). Platform engineering is the dominant pattern — developers interact with catalogs, not with Terraform. Invest now: platform engineering (high ROI, solves real bottleneck), policy-as-code (compliance requirement growing), CI/CD for Terraform (table stakes)"
+    weight: 0.35
+    description: "Current and near-term"
+  - type: llm_judge
+    criteria: "Medium-term trends are realistic — 2-5 year horizon: AI will handle 60-70% of routine IaC (module creation, configuration, troubleshooting) but humans still needed for architecture decisions and complex debugging. Self-healing infrastructure: Terraform + AI detects drift, proposes fixes, applies after approval. Serverless/managed services continue reducing IaC scope — less EC2, more Lambda/Fargate/managed databases. Watch but don't over-invest: fully autonomous infrastructure (AI managing everything without human oversight — too risky for production). License evolution: open-source IaC will always exist (OpenTofu guarantees), but commercial features diverge. Multi-tool strategies: organizations use Terraform + CDK + Helm, not just one tool"
+    weight: 0.35
+    description: "Medium-term trends"
+  - type: llm_judge
+    criteria: "CTO action items separate hype from reality — invest now: (1) platform engineering team (highest ROI, immediate impact), (2) compliance automation (regulatory pressure increasing), (3) cost governance in IaC pipeline (FinOps). Watch carefully: (1) AI-generated IaC (useful for acceleration, not for replacement), (2) OpenTofu stability and enterprise readiness, (3) serverless-first architecture reducing IaC needs. Hype to ignore: (1) 'no-code infrastructure' (always needs expert oversight), (2) 'universal cloud abstraction' (cloud-agnostic = lowest common denominator), (3) 'AI replaces DevOps engineers' (AI augments, doesn't replace infrastructure expertise). Key advice: the IaC tool matters less than the process — invest in workflows, governance, and developer experience rather than chasing the latest tool"
+    weight: 0.30
+    description: "CTO actions"