npm - dojo.md - Versions diffs - 0.2.2 → 0.2.4 - Mend

dojo.md 0.2.2 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (196) hide show

package/courses/terraform-infrastructure-setup/scenarios/level-3/conditional-resources.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: conditional-resources
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Create conditional resources — use count, for_each, and ternary expressions to conditionally create infrastructure based on environment or feature flags"
+  tags: [Terraform, conditional, feature-flags, count, ternary, advanced]
+state: {}
+trigger: |
+  Your infrastructure needs different resources per environment:
+  - Prod: NAT Gateway (expensive), multi-AZ RDS, CloudFront
+  - Dev: NAT Instance (cheap), single-AZ RDS, no CloudFront
+  - All: VPC, subnets, security groups
+  Your initial attempt:
+  ```hcl
+  resource "aws_nat_gateway" "main" {
+    count = var.environment == "prod" ? 1 : 0
+    # ...
+  }
+  resource "aws_instance" "nat" {
+    count = var.environment == "prod" ? 0 : 1
+    # ...
+  }
+  resource "aws_db_instance" "main" {
+    multi_az = var.environment == "prod" ? true : false
+    # ...
+  }
+  resource "aws_cloudfront_distribution" "cdn" {
+    count = var.environment == "prod" ? 1 : 0
+    origin {
+      domain_name = aws_lb.web.dns_name
+    }
+  }
+  ```
+  Problem: other resources reference aws_nat_gateway.main.id but it
+  might not exist (count = 0):
+  ```
+  Error: Invalid index
+    aws_nat_gateway.main[0] does not exist
+  ```
+  Task: Explain conditional resource creation patterns, how to
+  safely reference conditional resources, feature flags in Terraform,
+  and environment-specific configuration strategies.
+assertions:
+  - type: llm_judge
+    criteria: "Conditional creation patterns are explained — count = condition ? 1 : 0 is the primary pattern. When count = 0, resource doesn't exist. Reference safely: one_of(aws_nat_gateway.main[*].id, aws_instance.nat[*].id) or try(aws_nat_gateway.main[0].id, null). Splat expression: aws_nat_gateway.main[*].id returns empty list when count = 0 (no error). Conditional with for_each: for_each = var.enable_cdn ? { cdn = true } : {} — resource exists when map is non-empty. Ternary for attributes: multi_az = var.environment == 'prod' (direct boolean)"
+    weight: 0.35
+    description: "Conditional patterns"
+  - type: llm_judge
+    criteria: "Safe referencing is covered — problem: aws_nat_gateway.main[0].id errors when count = 0. Solutions: (1) use splat: aws_nat_gateway.main[*].id (returns list, empty if count=0). (2) use try(): try(aws_nat_gateway.main[0].id, ''). (3) use one(): one(aws_nat_gateway.main[*].id) (returns single value or null). (4) use locals to abstract: locals { nat_gateway_id = length(aws_nat_gateway.main) > 0 ? aws_nat_gateway.main[0].id : null }. Then reference local.nat_gateway_id. For route tables: use coalesce() or conditional in the referencing resource's count"
+    weight: 0.35
+    description: "Safe referencing"
+  - type: llm_judge
+    criteria: "Feature flags and environment strategy are practical — feature flags: variable 'features' { type = object({ enable_cdn = bool, enable_waf = bool, enable_monitoring = bool }), default = { enable_cdn = false, enable_waf = false, enable_monitoring = true } }. Each feature conditionally creates resources. Environment strategy: (1) simple: ternary on var.environment, (2) structured: locals map with per-environment settings: locals { env_config = { prod = { nat_type = 'gateway', rds_multi_az = true }, dev = { nat_type = 'instance', rds_multi_az = false } } }. Reference: local.env_config[var.environment].rds_multi_az. Keeps all environment differences in one place"
+    weight: 0.30
+    description: "Feature flags"

package/courses/terraform-infrastructure-setup/scenarios/level-3/drift-detection.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: drift-detection
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Detect and remediate infrastructure drift — diagnose out-of-band changes, implement continuous drift detection, and establish drift prevention workflows"
+  tags: [Terraform, drift, detection, remediation, refresh, advanced]
+state: {}
+trigger: |
+  Your security team ran an emergency script that modified security
+  groups directly in AWS. Monday morning, terraform plan shows:
+  ```
+  Note: Objects have changed outside of Terraform
+  Terraform detected the following changes made outside of Terraform:
+    # aws_security_group.web has been changed
+    ~ resource "aws_security_group" "web" {
+        ~ ingress = [
+            # New rule added outside Terraform:
+            + {
+                cidr_blocks = ["0.0.0.0/0"]
+                from_port   = 22
+                to_port     = 22
+                protocol    = "tcp"
+              }
+          ]
+      }
+    # aws_instance.web has been changed
+    ~ resource "aws_instance" "web" {
+        ~ instance_type = "t3.micro" -> "t3.large"
+      }
+    # aws_db_instance.main has been changed
+    ~ resource "aws_db_instance" "main" {
+        ~ backup_retention_period = 7 -> 30
+      }
+  Plan: 0 to add, 3 to change, 0 to destroy.
+  ```
+  Terraform wants to revert ALL changes. But some changes were
+  intentional (backup retention increase) and others were dangerous
+  (SSH open to 0.0.0.0/0).
+  Task: Explain drift detection mechanics, selective drift handling
+  (accept some, revert others), continuous drift detection strategies,
+  and prevention workflows to minimize drift.
+assertions:
+  - type: llm_judge
+    criteria: "Drift detection mechanics are explained — Terraform refreshes state before plan/apply by querying cloud APIs. Compares: real infrastructure vs state vs configuration. Drift = real infrastructure differs from state. Plan shows drift as 'Objects have changed outside of Terraform'. Terraform then plans to reconcile real infrastructure with configuration (not state). terraform plan -refresh-only: shows drift without planning changes. terraform apply -refresh-only: updates state to match reality without changing infrastructure. terraform refresh (deprecated): same as apply -refresh-only"
+    weight: 0.35
+    description: "Detection mechanics"
+  - type: llm_judge
+    criteria: "Selective drift handling is covered — to accept some drift and revert others: (1) SSH rule (dangerous): revert by applying — Terraform removes the unauthorized 0.0.0.0/0 SSH rule. (2) Instance type change: if intentional, update .tf file to t3.large, plan shows no changes. (3) Backup retention: if intentional, update .tf to backup_retention_period = 30. Workflow: review each drift item, decide accept or revert, update code for accepted changes, apply to revert rejected changes. ignore_changes lifecycle: for attributes managed externally (not recommended as default — masks problems)"
+    weight: 0.35
+    description: "Selective handling"
+  - type: llm_judge
+    criteria: "Prevention and continuous detection are practical — continuous detection: scheduled terraform plan -detailed-exitcode in CI (exit code 2 = drift detected, alert the team). Terraform Cloud: drift detection feature runs plans automatically. Prevention: (1) all changes through Terraform (enforce via IAM — deny console modifications), (2) SCP policies in AWS Organizations to restrict manual changes, (3) AWS Config rules to detect non-Terraform changes, (4) establish an emergency change process that includes updating Terraform code after. Culture: drift is a process problem, not a tool problem"
+    weight: 0.30
+    description: "Prevention"

package/courses/terraform-infrastructure-setup/scenarios/level-3/dynamic-blocks.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: dynamic-blocks
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Use dynamic blocks — generate repetitive configuration blocks from variables, handle nested dynamics, and avoid over-engineering"
+  tags: [Terraform, dynamic-blocks, for-each, iteration, DRY, advanced]
+state: {}
+trigger: |
+  Your security group configuration has grown unwieldy with 20 inline
+  ingress rules hardcoded:
+  ```hcl
+  resource "aws_security_group" "web" {
+    name = "web-sg"
+    ingress {
+      from_port   = 80
+      to_port     = 80
+      protocol    = "tcp"
+      cidr_blocks = ["0.0.0.0/0"]
+    }
+    ingress {
+      from_port   = 443
+      to_port     = 443
+      protocol    = "tcp"
+      cidr_blocks = ["0.0.0.0/0"]
+    }
+    ingress {
+      from_port   = 22
+      to_port     = 22
+      protocol    = "tcp"
+      cidr_blocks = ["10.0.0.0/8"]
+    }
+    # ... 17 more rules
+  }
+  ```
+  You want to make this data-driven using variables and dynamic blocks.
+  But your first attempt creates confusing, deeply nested code:
+  ```hcl
+  dynamic "ingress" {
+    for_each = var.rules
+    content {
+      dynamic "self" {  # ERROR: can't nest dynamic within dynamic
+                        # for the same block type
+      }
+    }
+  }
+  ```
+  Task: Explain dynamic blocks, when to use them, nested dynamic
+  blocks, iterator naming, the content block, and when dynamic blocks
+  hurt readability (over-engineering).
+assertions:
+  - type: llm_judge
+    criteria: "Dynamic blocks are explained — dynamic 'ingress' { for_each = var.ingress_rules, content { from_port = ingress.value.from_port, to_port = ingress.value.to_port, protocol = ingress.value.protocol, cidr_blocks = ingress.value.cidr_blocks } }. The iterator name defaults to the block label (ingress). Custom iterator: dynamic 'ingress' { iterator = rule, for_each = ..., content { from_port = rule.value.port } }. for_each accepts: list, set, map. Within content, use iterator.key and iterator.value. Dynamic blocks can only generate repeatable nested blocks, not top-level arguments"
+    weight: 0.35
+    description: "Dynamic blocks"
+  - type: llm_judge
+    criteria: "Nested dynamics and practical patterns are covered — nested dynamic blocks: dynamic blocks can be nested when a block contains another repeatable block. Example: aws_security_group with dynamic ingress that has dynamic cidr_blocks (though cidr_blocks is actually a list attribute, not a block). Real nested example: aws_lb_listener_rule with dynamic condition { dynamic host_header { ... } }. Variable structure: use list of objects for simple rules, map of objects when you need keys. Flatten complex structures with locals before passing to dynamic blocks"
+    weight: 0.35
+    description: "Nested dynamics"
+  - type: llm_judge
+    criteria: "Over-engineering warnings are given — when NOT to use dynamic blocks: (1) fewer than 3-4 rules — just write them out, (2) deeply nested dynamics (3+ levels) — becomes unreadable, (3) when separate resources are clearer (aws_security_group_rule instead of inline ingress blocks). Dynamic blocks trade readability for DRY — sometimes repetition is more maintainable. Alternative to dynamic: use for_each on separate resource blocks (aws_security_group_rule with for_each). This is often clearer and gives each rule its own lifecycle. Rule of thumb: if a new team member can't understand it in 30 seconds, simplify"
+    weight: 0.30
+    description: "Over-engineering"

package/courses/terraform-infrastructure-setup/scenarios/level-3/large-scale-refactoring.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: large-scale-refactoring
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Refactor large Terraform codebases — split monoliths into modules, migrate between state files, and use moved blocks for safe resource reorganization"
+  tags: [Terraform, refactoring, modules, moved-blocks, migration, advanced]
+state: {}
+trigger: |
+  Your organization's Terraform codebase has grown organically over
+  3 years into a monolith:
+  ```
+  infrastructure/
+  ├── main.tf           (3500 lines, 180 resources)
+  ├── variables.tf      (800 lines, 95 variables)
+  ├── outputs.tf        (200 lines)
+  └── terraform.tfstate (25MB, all resources in one state)
+  ```
+  Problems:
+  - terraform plan takes 8 minutes (refreshes all 180 resources)
+  - Any change risks all resources (blast radius = everything)
+  - 5 teams touch the same files, causing merge conflicts
+  - Lock contention: only one person can run terraform at a time
+  Target architecture:
+  ```
+  infrastructure/
+  ├── foundation/          (VPC, DNS, IAM — Platform team)
+  │   └── terraform.tfstate
+  ├── database/            (RDS, ElastiCache — Database team)
+  │   └── terraform.tfstate
+  ├── compute/             (ECS, ALB — App team)
+  │   └── terraform.tfstate
+  ├── monitoring/          (CloudWatch, Alarms — SRE team)
+  │   └── terraform.tfstate
+  └── modules/             (Shared modules)
+  ```
+  Task: Design the migration strategy from monolith to modular
+  Terraform, covering state splitting, moved blocks, cross-state
+  references, testing the migration, and rollback planning.
+assertions:
+  - type: llm_judge
+    criteria: "Migration strategy is phased — Phase 1: catalog all resources by team/domain. Phase 2: create module structure and write configurations for each domain. Phase 3: use moved blocks within the monolith to reorganize into modules (no state split yet). Phase 4: split state files using state mv or state rm + import. Phase 5: establish cross-state references using terraform_remote_state data sources. Each phase is independently verifiable: plan should show no changes after each phase. Never do everything at once — incremental migration with verification"
+    weight: 0.35
+    description: "Migration strategy"
+  - type: llm_judge
+    criteria: "State splitting mechanics are covered — approach 1 (state mv): (1) backup state, (2) create new backend configs, (3) terraform state mv resources to new state files. Approach 2 (state rm + import): (1) remove resources from monolith state, (2) import into new domain state files. Approach 3 (manual): (1) state pull, (2) edit JSON to split resources, (3) state push to new backends. Cross-state references: foundation outputs VPC ID, compute reads it via terraform_remote_state. IAM and dependency order: foundation first (VPC, IAM), then database (needs VPC), then compute (needs both)"
+    weight: 0.35
+    description: "State splitting"
+  - type: llm_judge
+    criteria: "Testing and rollback are practical — testing: after each migration step, terraform plan must show zero changes in all state files. If plan shows changes, something was migrated incorrectly — fix before proceeding. Rollback: keep the original monolith state backup throughout migration. If anything goes wrong, restore from backup and restart the phase. Timeline: for 180 resources, plan 2-4 weeks. Risk mitigation: migrate non-production first, then production during maintenance window. Communication: notify all teams of the plan, freeze non-essential changes during migration"
+    weight: 0.30
+    description: "Testing and rollback"

package/courses/terraform-infrastructure-setup/scenarios/level-3/multi-provider-config.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: multi-provider-config
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Configure multi-provider setups — manage multi-region, multi-account, and multi-cloud deployments with provider aliases and assume_role"
+  tags: [Terraform, providers, multi-region, multi-account, cross-account, advanced]
+state: {}
+trigger: |
+  Your organization needs infrastructure across multiple AWS accounts
+  and regions:
+  ```
+  Production Account (111111111111) - us-east-1
+  Staging Account (222222222222) - us-east-1
+  DR Account (111111111111) - us-west-2
+  Shared Services (333333333333) - us-east-1
+  ```
+  Your Terraform configuration:
+  ```hcl
+  provider "aws" {
+    region = "us-east-1"
+  }
+  provider "aws" {
+    alias  = "dr"
+    region = "us-west-2"
+  }
+  provider "aws" {
+    alias = "staging"
+    region = "us-east-1"
+    assume_role {
+      role_arn = "arn:aws:iam::222222222222:role/TerraformRole"
+    }
+  }
+  ```
+  Error when deploying to staging:
+  ```
+  Error: error configuring Terraform AWS Provider: IAM Role
+  (arn:aws:iam::222222222222:role/TerraformRole) cannot be assumed.
+  There are a number of possible causes:
+  - The credentials used do not have permission to assume the role
+  - The role's trust policy does not allow the current identity
+  ```
+  Task: Explain multi-provider configuration, assume_role for
+  cross-account access, passing providers to modules, provider
+  configuration best practices, and debugging cross-account issues.
+assertions:
+  - type: llm_judge
+    criteria: "Multi-provider setup is explained — provider aliases allow multiple configurations of the same provider. Default provider (no alias) used when provider isn't specified on a resource. Aliased providers: specify with provider = aws.dr on each resource. assume_role: Terraform assumes an IAM role in another account. Requirements: (1) trust policy on target role must allow the source account/role, (2) source must have sts:AssumeRole permission, (3) external_id for additional security. The error: trust policy or permissions issue — check both sides"
+    weight: 0.35
+    description: "Multi-provider setup"
+  - type: llm_judge
+    criteria: "Provider passing to modules is covered — modules don't inherit provider aliases automatically. Pass explicitly: module 'dr_vpc' { source = './modules/vpc', providers = { aws = aws.dr } }. Module must declare required providers: terraform { required_providers { aws = { source = 'hashicorp/aws' } } }. For modules needing multiple providers: providers = { aws = aws, aws.secondary = aws.dr }. Anti-pattern: configuring providers inside modules — always configure in root and pass down"
+    weight: 0.35
+    description: "Module providers"
+  - type: llm_judge
+    criteria: "Cross-account debugging is practical — debugging assume_role: (1) verify trust policy on target role allows the source identity, (2) verify source has sts:AssumeRole permission, (3) check for external_id requirement, (4) test manually: aws sts assume-role --role-arn ... (5) enable TF_LOG=DEBUG to see the exact API call. IAM role trust policy must include the specific ARN (account, user, or role). Session duration: default 1 hour, can increase with duration_seconds. MFA: if required, must be handled outside Terraform. Best practice: use separate state files per account for blast radius isolation"
+    weight: 0.30
+    description: "Cross-account debugging"

package/courses/terraform-infrastructure-setup/scenarios/level-3/state-surgery.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: state-surgery
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Perform state surgery — use state mv, rm, pull, push for complex migrations, module extraction, and resource address changes"
+  tags: [Terraform, state, migration, state-mv, state-rm, advanced]
+state: {}
+trigger: |
+  Your monolithic Terraform configuration with 200 resources needs
+  to be split into separate modules. Current flat structure:
+  ```hcl
+  # main.tf (2000 lines)
+  resource "aws_vpc" "main" { ... }
+  resource "aws_subnet" "public" { ... }
+  resource "aws_instance" "web" { ... }
+  resource "aws_rds_instance" "db" { ... }
+  ```
+  Target: split into modules/networking, modules/compute, modules/database.
+  Attempt 1 — Just move code into modules:
+  ```
+  $ terraform plan
+  # aws_vpc.main will be destroyed
+  # module.networking.aws_vpc.main will be created
+  # aws_instance.web will be destroyed
+  # module.compute.aws_instance.web will be created
+  # aws_rds_instance.db will be destroyed (!!!)
+  # module.database.aws_rds_instance.db will be created
+  Plan: 6 to add, 0 to change, 6 to destroy.
+  ```
+  All resources will be destroyed and recreated — unacceptable for
+  production! The database would be lost.
+  Task: Explain state surgery operations (mv, rm, pull, push),
+  how to migrate resources between modules without recreation,
+  moved blocks (Terraform 1.1+), state backup best practices,
+  and complex migration strategies.
+assertions:
+  - type: llm_judge
+    criteria: "State mv migration is explained — terraform state mv moves a resource from one address to another in state without modifying infrastructure. To migrate to modules: terraform state mv aws_vpc.main module.networking.aws_vpc.main, terraform state mv aws_instance.web module.compute.aws_instance.web, etc. After all moves: terraform plan should show no changes. Always backup state first: terraform state pull > backup.tfstate. State mv is atomic per resource — if interrupted, some resources moved, others not. Plan carefully and script the moves"
+    weight: 0.35
+    description: "State mv"
+  - type: llm_judge
+    criteria: "Moved blocks are covered as the modern alternative — moved { from = aws_vpc.main, to = module.networking.aws_vpc.main }. Benefits over state mv: (1) declarative and code-reviewable, (2) handled during plan/apply, (3) no manual state manipulation, (4) works across plan/apply workflow. Multiple moved blocks can coexist. Moved blocks are removed after successful apply. Supports: resource address changes, module refactoring, count to for_each migration. Terraform 1.1+ required. Preferred over state mv for most migrations"
+    weight: 0.35
+    description: "Moved blocks"
+  - type: llm_judge
+    criteria: "State rm and complex operations are practical — terraform state rm: removes resource from state without destroying it. Use when: (1) resource should no longer be managed by Terraform, (2) moving resource to different state file, (3) removing accidentally imported resource. terraform state pull/push: download/upload entire state file. Use for: manual state repair, migrating between backends, debugging. Complex migration: for splitting state files, (1) state pull, (2) manipulate JSON, (3) state push to new backend. Always: backup before surgery, verify with plan after, use -dry-run where available"
+    weight: 0.30
+    description: "Complex operations"

package/courses/terraform-infrastructure-setup/scenarios/level-3/terraform-cloud-enterprise.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: terraform-cloud-enterprise
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Use Terraform Cloud/Enterprise — configure remote execution, VCS integration, workspace management, and Sentinel policies"
+  tags: [Terraform, Terraform-Cloud, Enterprise, remote-execution, Sentinel, advanced]
+state: {}
+trigger: |
+  Your team is migrating from local Terraform execution to Terraform
+  Cloud. Current pain points:
+  - Engineers run terraform from laptops with different provider versions
+  - No audit trail of who applied what
+  - State files stored in S3 with overly permissive access
+  - No policy enforcement (anyone can create m5.24xlarge instances)
+  Migration configuration:
+  ```hcl
+  terraform {
+    cloud {
+      organization = "acme-corp"
+      workspaces {
+        name = "production"
+      }
+    }
+  }
+  ```
+  After migration:
+  ```
+  $ terraform plan
+  Running plan in Terraform Cloud. Output will stream here.
+  Error: Terraform Cloud returned an unexpected error
+  UNAUTHORIZED: You are not authorized to perform this action.
+  ```
+  Task: Explain Terraform Cloud features (remote execution, VCS
+  integration, workspace management), Sentinel policies for
+  governance, migration from local/S3 to Terraform Cloud, and
+  when to use Cloud vs Enterprise vs self-hosted.
+assertions:
+  - type: llm_judge
+    criteria: "Terraform Cloud features are explained — remote execution: plan and apply run on Terraform Cloud's infrastructure (consistent environment, no laptop dependencies). VCS integration: connect to GitHub/GitLab, automatic plans on PRs, apply on merge. Workspace management: each workspace has its own state, variables, and permissions. Variable sets: share variables across workspaces. Run triggers: chain workspaces (VPC workspace triggers EKS workspace). The auth error: need to run terraform login first, or set TF_TOKEN_app_terraform_io environment variable. Team permissions control who can plan vs apply"
+    weight: 0.35
+    description: "Cloud features"
+  - type: llm_judge
+    criteria: "Sentinel policies are covered — Sentinel: policy-as-code framework for governance. Policy sets: attach to workspaces. Enforcement levels: advisory (warn), soft-mandatory (override with approval), hard-mandatory (no override). Example policies: restrict instance types (no m5.24xlarge), require tags on all resources, enforce encryption, restrict regions. Policy workflow: plan → Sentinel check → cost estimation → apply. Policies written in Sentinel language (not HCL). OPA (Open Policy Agent) also supported as alternative"
+    weight: 0.35
+    description: "Sentinel policies"
+  - type: llm_judge
+    criteria: "Migration and comparison are practical — migration from S3: (1) add cloud block to config, (2) terraform login, (3) terraform init to migrate state. Cloud vs Enterprise vs self-hosted: Cloud (SaaS, free tier available, easiest setup), Enterprise (self-hosted, air-gapped support, custom agents), self-hosted agents with Cloud (hybrid — control plane in Cloud, execution on your infrastructure). When Cloud: most teams. When Enterprise: regulatory requirements for air-gapped, very large scale, custom integrations. Cost: Cloud free for small teams, Enterprise starts at $70K+/year"
+    weight: 0.30
+    description: "Migration and comparison"

package/courses/terraform-infrastructure-setup/scenarios/level-3/terraform-debugging.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: terraform-debugging
+  level: 3
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Debug Terraform with TF_LOG — use log levels, provider-specific debugging, crash logs, and systematic troubleshooting for complex failures"
+  tags: [Terraform, debugging, TF_LOG, crash-logs, troubleshooting, advanced]
+state: {}
+trigger: |
+  A terraform apply fails with a cryptic error that gives no useful
+  information:
+  ```
+  Error: error creating ECS Service (my-service): InvalidParameterException:
+  Unable to assume the provided role.
+  with aws_ecs_service.web,
+  on ecs.tf line 15, in resource "aws_ecs_service" "web":
+  15: resource "aws_ecs_service" "web" {
+  ```
+  The IAM role exists and looks correct. You need to dig deeper.
+  You also encounter a Terraform crash:
+  ```
+  !!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!
+  Terraform crashed! This is always indicative of a bug within
+  Terraform or a provider. Crash log saved to: crash.log
+  ```
+  Task: Explain Terraform debugging techniques, TF_LOG levels and
+  environment variables, provider-specific debugging, crash log
+  analysis, and systematic troubleshooting methodology for complex
+  infrastructure failures.
+assertions:
+  - type: llm_judge
+    criteria: "TF_LOG debugging is explained — levels (most to least verbose): TRACE, DEBUG, INFO, WARN, ERROR. Set: TF_LOG=DEBUG terraform apply. Save to file: TF_LOG_PATH=./debug.log. Component-specific: TF_LOG_CORE=WARN TF_LOG_PROVIDER=DEBUG (provider operations verbose, core quiet). The ECS error: TF_LOG=DEBUG reveals the actual API request/response — likely IAM role trust policy doesn't include ecs.amazonaws.com, or there's an IAM propagation delay. DEBUG shows: HTTP requests, API responses, retry attempts, timing. TRACE shows everything including internal state operations"
+    weight: 0.35
+    description: "TF_LOG debugging"
+  - type: llm_judge
+    criteria: "Crash logs and provider debugging are covered — crash log: contains Go stack trace, panic message, provider version. Report to: provider GitHub issues if provider crash, Terraform core GitHub if core crash. Include: Terraform version, provider versions, sanitized config, crash.log. Provider debugging: check provider changelog for known bugs, try upgrading/downgrading provider version, reproduce with minimal configuration. AWS-specific: decode authorization failure messages with aws sts decode-authorization-message. Eventual consistency: IAM changes can take seconds to propagate — add depends_on or retry"
+    weight: 0.35
+    description: "Crash and provider"
+  - type: llm_judge
+    criteria: "Systematic troubleshooting is practical — methodology: (1) read the error message carefully (resource, file, line), (2) check provider documentation for the resource, (3) enable TF_LOG=DEBUG and search for the actual API error, (4) reproduce with minimal configuration (isolate the issue), (5) check for known issues on provider GitHub, (6) verify cloud-side (correct permissions, quotas, resource limits). Common hidden causes: IAM propagation delay, API rate limiting (429 errors hidden in retries), eventual consistency, stale provider cache (terraform init -upgrade). terraform plan -refresh-only to verify state matches reality"
+    weight: 0.30
+    description: "Troubleshooting method"

package/courses/terraform-infrastructure-setup/scenarios/level-4/blast-radius-management.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: blast-radius-management
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Manage Terraform blast radius — design state boundaries, implement approval workflows, and prevent large-scale outages from single changes"
+  tags: [Terraform, blast-radius, state-separation, approvals, risk, expert]
+state: {}
+trigger: |
+  A single terraform apply destroyed your production database, two
+  load balancers, and a VPN connection. Root cause: all 300 resources
+  were in one state file. The engineer intended to modify a CloudWatch
+  alarm but a provider upgrade changed the behavior of unrelated
+  resources.
+  Impact:
+  - 4 hours of downtime
+  - Database restored from backup (30 minutes of data loss)
+  - Post-mortem found: blast radius = 300 resources per apply
+  - Board asked: "How do we prevent this from happening again?"
+  Current state architecture:
+  ```
+  Single state: 300 resources
+  - VPC, subnets, NAT gateways
+  - RDS, ElastiCache
+  - ECS services, ALBs
+  - CloudWatch, SNS, SQS
+  - IAM roles, policies
+  - S3 buckets, CloudFront
+  ```
+  Task: Design the blast radius management strategy covering: state
+  file boundaries, change classification (risk levels), approval
+  workflows, provider upgrade safety, and recovery procedures.
+assertions:
+  - type: llm_judge
+    criteria: "State boundaries reduce blast radius — split 300 resources into isolated state files: foundation (VPC, subnets, NAT — rarely changes, ~20 resources), database (RDS, ElastiCache — critical, ~10 resources), compute (ECS, ALB — frequently changes, ~50 resources), messaging (SQS, SNS — moderate, ~30 resources), monitoring (CloudWatch, alarms — frequent, ~40 resources), IAM (roles, policies — sensitive, ~30 resources), CDN (CloudFront, S3 — moderate, ~20 resources). Each state file limits the blast radius. Maximum 50-80 resources per state. Cross-state references via terraform_remote_state"
+    weight: 0.35
+    description: "State boundaries"
+  - type: llm_judge
+    criteria: "Change classification and approvals are defined — risk levels: Low (monitoring, tags, non-destructive updates — auto-approve in CI), Medium (security group changes, scaling modifications — 1 approval), High (database changes, network topology, IAM — 2 approvals + change window), Critical (provider upgrades, state operations, foundation changes — team lead + SRE approval). Implement via: Terraform Cloud workspace-level permissions, GitHub environment protection rules, or Atlantis apply requirements. Provider upgrades: pin exact versions, upgrade in dev first, review changelog for breaking changes, upgrade one state file at a time"
+    weight: 0.35
+    description: "Classification and approvals"
+  - type: llm_judge
+    criteria: "Recovery procedures are practical — immediate response: (1) don't run terraform apply again, (2) assess damage scope from state and CloudTrail, (3) restore from backups (RDS snapshots, S3 versioning). Recovery: (1) if resources destroyed but state intact: terraform apply recreates, (2) if state corrupted: restore from S3 versioned state backup. Prevention: prevent_destroy on databases and critical resources, separate state files limit collateral damage, terraform plan -detailed-exitcode in CI catches unexpected destroys, plan output review required before apply. Provider upgrades: test in isolated environment first, upgrade one service domain at a time, maintain rollback plan (pin to previous version)"
+    weight: 0.30
+    description: "Recovery"

package/courses/terraform-infrastructure-setup/scenarios/level-4/cicd-pipeline-design.yaml ADDED Viewed

@@ -0,0 +1,50 @@
+meta:
+  id: cicd-pipeline-design
+  level: 4
+  course: terraform-infrastructure-setup
+  type: output
+  description: "Design CI/CD pipelines for Terraform — implement GitOps workflows with Atlantis, GitHub Actions, or Terraform Cloud for safe infrastructure deployment"
+  tags: [Terraform, CI/CD, GitOps, Atlantis, GitHub-Actions, expert]
+state: {}
+trigger: |
+  Your team deploys Terraform from individual laptops. Last month:
+  - An engineer applied to production instead of staging (wrong workspace)
+  - Two engineers ran apply simultaneously, causing state corruption
+  - An apply failed halfway but no one noticed for 3 hours
+  - No record of who deployed what or when
+  You need to design a CI/CD pipeline for Terraform that prevents
+  all of these issues. Options on the table:
+  1. GitHub Actions with custom workflow
+  2. Atlantis (pull request automation)
+  3. Terraform Cloud/Enterprise
+  4. Spacelift
+  Requirements:
+  - Plan on every PR
+  - Apply only after approval and merge
+  - Environment protection (can't accidentally apply to prod)
+  - Cost estimation before apply
+  - Security scanning (tfsec/checkov)
+  - Slack notifications for plan/apply results
+  Task: Design the CI/CD pipeline for Terraform, compare the tool
+  options, show a complete workflow from code change to production
+  deployment, and address security considerations.
+assertions:
+  - type: llm_judge
+    criteria: "Complete pipeline workflow is designed — code change → PR opened → automated pipeline: (1) terraform fmt -check (formatting), (2) terraform validate (syntax), (3) tfsec/checkov scan (security), (4) terraform plan (preview changes), (5) Infracost estimate (cost), (6) post results as PR comment. On merge to main: (7) terraform plan again (detect drift since PR), (8) approval gate (manual for prod), (9) terraform apply, (10) post-apply verification, (11) Slack notification. Environment promotion: dev auto-apply, staging auto-apply, prod manual approval"
+    weight: 0.35
+    description: "Pipeline workflow"
+  - type: llm_judge
+    criteria: "Tool comparison is practical — Atlantis: open-source, PR automation, self-hosted, lightweight. Best for: teams wanting simple PR-based workflow. GitHub Actions: flexible, native GitHub integration, custom workflows. Best for: teams already on GitHub wanting full control. Terraform Cloud: managed service, built-in Sentinel, cost estimation, team management. Best for: organizations wanting managed solution. Spacelift: multi-tool support, advanced policies, drift detection. Best for: enterprises with complex requirements. Recommendation depends on: team size, budget, compliance needs, multi-tool requirements"
+    weight: 0.35
+    description: "Tool comparison"
+  - type: llm_judge
+    criteria: "Security considerations are covered — credentials: use OIDC for cloud authentication (no static keys in CI). GitHub Actions: aws-actions/configure-aws-credentials with OIDC. State access: CI role has minimal permissions (plan role vs apply role). Secrets: never echo credentials, use GitHub encrypted secrets or Terraform Cloud variables. Branch protection: require PR reviews, no direct pushes to main. Environment protection: GitHub environments with required reviewers for prod. Audit: log all plan/apply with outputs. Network: CI runner in private network if accessing private resources"
+    weight: 0.30
+    description: "Security"