npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/aws-lambda-debugging/scenarios/level-2/sam-local-debugging.yaml ADDED Viewed

@@ -0,0 +1,74 @@
+meta:
+  id: sam-local-debugging
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda locally with SAM CLI — use sam local invoke, start-api, and generate-event for local development and testing"
+  tags: [AWS, Lambda, SAM, local-testing, debugging, development, intermediate]
+state: {}
+trigger: |
+  Your development workflow is painful: make a change, deploy to AWS
+  (2 minutes), test, see error, repeat. You need local debugging.
+  Setting up SAM CLI:
+  $ sam init --runtime python3.12 --app-template hello-world
+  $ sam build
+  $ sam local invoke HelloWorldFunction
+  Invoking app.lambda_handler (python3.12)
+  Building image...
+  START RequestId: local-123
+  {"statusCode": 200, "body": "{\"message\": \"hello world\"}"}
+  END RequestId: local-123
+  REPORT RequestId: local-123 Duration: 5.23 ms
+  Works! But now testing with a real event:
+  $ sam local generate-event apigateway aws-proxy > event.json
+  $ sam local invoke HelloWorldFunction -e event.json
+  Error: Template file not found: template.yaml
+  Wrong directory — SAM needs template.yaml (infrastructure definition).
+  Testing API locally:
+  $ sam local start-api
+  Mounting HelloWorldFunction at http://127.0.0.1:3000/hello [GET]
+  $ curl http://localhost:3000/hello
+  {"message": "hello world"}
+  But environment variables are missing:
+  $ sam local invoke -n env.json HelloWorldFunction
+  # env.json: {"HelloWorldFunction": {"TABLE_NAME": "local-table"}}
+  Debugging with breakpoints:
+  $ sam local invoke -d 5858 HelloWorldFunction
+  Debugger listening on ws://127.0.0.1:5858
+  (attach VS Code debugger to port 5858)
+  Limitations:
+  - Local execution uses Docker — slightly different from actual Lambda
+  - IAM permissions not enforced locally
+  - VPC networking not simulated
+  - Event source mappings not simulated (must provide events manually)
+  Task: Explain SAM CLI local debugging. Write: sam local invoke
+  (function testing), sam local start-api (API testing), generating
+  test events, environment variables and configuration, debugging
+  with breakpoints (VS Code), limitations of local testing, and
+  complementary tools (LocalStack, moto for AWS mocking).
+assertions:
+  - type: llm_judge
+    criteria: "SAM local commands are explained — sam local invoke: runs function once with optional event file (-e event.json). sam local start-api: starts local HTTP server that routes to Lambda functions (mirrors API Gateway). sam local start-lambda: starts local Lambda endpoint for SDK testing. sam local generate-event: creates sample events for various triggers (apigateway, s3, sqs, dynamodb, etc.). sam build: compiles/packages code before local testing. All require Docker running locally"
+    weight: 0.35
+    description: "SAM commands"
+  - type: llm_judge
+    criteria: "Debugging and configuration are covered — breakpoint debugging: sam local invoke -d <port> pauses function until debugger attaches. VS Code: launch.json configuration for SAM debugging. Environment variables: -n env.json file or --parameter-overrides. Docker network: --docker-network to connect Lambda to local services (local DynamoDB, PostgreSQL). Hot reloading: --skip-pull-image for faster iteration. Layer testing: layers specified in template.yaml are automatically included in local invocations"
+    weight: 0.35
+    description: "Debugging"
+  - type: llm_judge
+    criteria: "Limitations and alternatives are practical — SAM local limitations: no IAM enforcement (permission errors won't appear), no VPC simulation, no event source mapping simulation, Docker-based execution has slight differences from real Lambda. Alternatives: LocalStack (full AWS service emulation), moto (Python AWS mocking), DynamoDB Local (official local DynamoDB). Best practice: local testing for rapid development iteration, deploy to a dev AWS account for integration testing, use CI/CD for automated testing against real services. Unit tests mock AWS SDK calls for fast feedback"
+    weight: 0.30
+    description: "Limitations"

package/courses/aws-lambda-debugging/scenarios/level-2/sqs-event-source.yaml ADDED Viewed

@@ -0,0 +1,72 @@
+meta:
+  id: sqs-event-source
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda SQS event source mapping — diagnose message processing failures, batch errors, visibility timeout issues, and dead-letter queue configuration"
+  tags: [AWS, Lambda, SQS, event-source-mapping, DLQ, batch, intermediate]
+state: {}
+trigger: |
+  Your Lambda function processes SQS messages but messages keep
+  reappearing in the queue and eventually go to the DLQ:
+  $ aws sqs get-queue-attributes --queue-url $QUEUE_URL \
+    --attribute-names ApproximateNumberOfMessages \
+    ApproximateNumberOfMessagesNotVisible
+  {
+    "ApproximateNumberOfMessages": "0",
+    "ApproximateNumberOfMessagesNotVisible": "150"
+  }
+  150 messages are "in flight" (not visible) — they're being
+  processed but not being deleted.
+  CloudWatch Logs:
+  Processing batch of 10 messages...
+  Error processing message 7: Invalid JSON in message body
+  ERROR: Unhandled exception
+  The problem: when ANY message in a batch fails, Lambda reports the
+  entire batch as failed. All 10 messages return to the queue, even
+  the 9 that processed successfully. This causes:
+  - Successfully processed messages are processed again (duplicates!)
+  - The bad message keeps failing, blocking the batch
+  - After maxReceiveCount (e.g., 3), ALL messages go to DLQ
+  Fix — Partial batch failure reporting:
+  Return failed message IDs in the response:
+  {
+    "batchItemFailures": [
+      { "itemIdentifier": "message-id-7" }
+    ]
+  }
+  Enable ReportBatchItemFailures in the event source mapping.
+  Now only message 7 retries; messages 1-6 and 8-10 are deleted.
+  Another issue — visibility timeout:
+  If Lambda takes longer to process than the queue's visibility
+  timeout (default 30s), the message becomes visible again and
+  another Lambda instance picks it up → duplicate processing.
+  Rule: visibility timeout should be 6x the Lambda timeout.
+  Task: Explain SQS + Lambda debugging. Write: how event source
+  mapping works (polling, batching), batch failure handling (partial
+  vs full), visibility timeout tuning, DLQ configuration, message
+  deduplication, and common SQS + Lambda anti-patterns.
+assertions:
+  - type: llm_judge
+    criteria: "Event source mapping mechanics are explained — Lambda polls SQS queue automatically (long polling). Messages delivered in batches (configurable: 1-10 for standard, 1-10000 for FIFO). Lambda deletes messages automatically on successful processing. If function fails or times out, messages return to queue after visibility timeout. Batch window: collect messages for up to 5 minutes before invoking. FIFO queues: messages processed in order within message group, concurrency limited by number of active message groups"
+    weight: 0.35
+    description: "Event source mapping"
+  - type: llm_judge
+    criteria: "Batch failure and DLQ are covered — default: entire batch fails if function returns error. With ReportBatchItemFailures: return batchItemFailures array with failed message IDs — only those retry. Without this, one bad message blocks entire batch. DLQ: messages exceeding maxReceiveCount sent to DLQ. Configure DLQ on the source queue (not the Lambda). DLQ inspection: replay messages after fix. Lambda Destinations: alternative to DLQ, provides more context (function response, error details). Idempotency: design functions to handle duplicate processing safely"
+    weight: 0.35
+    description: "Batch failure and DLQ"
+  - type: llm_judge
+    criteria: "Visibility timeout and anti-patterns are practical — set visibility timeout to at least 6x Lambda function timeout. If Lambda timeout = 60s, visibility timeout ≥ 360s. Otherwise: message reappears while still being processed → duplicate processing. Anti-patterns: not handling partial failures (entire batch retries), not implementing idempotency (duplicate processing causes bugs), processing too long for visibility timeout, no DLQ (messages retry forever), not monitoring ApproximateAgeOfOldestMessage (indicates processing lag)"
+    weight: 0.30
+    description: "Timeout and anti-patterns"

package/courses/aws-lambda-debugging/scenarios/level-2/vpc-networking-issues.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: vpc-networking-issues
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda VPC networking issues — diagnose ENI provisioning failures, NAT Gateway requirements, and VPC endpoint configuration"
+  tags: [AWS, Lambda, VPC, networking, NAT-Gateway, endpoints, intermediate]
+state: {}
+trigger: |
+  Your Lambda function was recently moved into a VPC to access an
+  RDS database. Now it can reach the database but can't reach any
+  AWS services or external APIs:
+  $ aws lambda invoke --function-name order-api output.json
+  {
+    "errorType": "TimeoutError",
+    "errorMessage": "connect ETIMEDOUT 52.119.225.0:443"
+  }
+  The function tries to call DynamoDB and S3, but these are public
+  AWS endpoints. A Lambda in a VPC private subnet has NO internet
+  access by default.
+  Network architecture:
+  Before VPC: Lambda → Internet → AWS Services (S3, DynamoDB) ✓
+  After VPC (private subnet, no NAT): Lambda → VPC → ??? → No route
+  Solutions:
+  Option A — NAT Gateway ($32/month + data processing):
+  Private subnet → route to NAT Gateway in public subnet → Internet
+  Gateway → Internet → AWS services.
+  Adds internet access but costs money and adds latency.
+  Option B — VPC Endpoints (recommended for AWS services):
+  Gateway endpoints (free): S3, DynamoDB
+  Interface endpoints ($7.20/month each): Secrets Manager, SQS,
+  SNS, KMS, and 100+ other services.
+  Traffic stays within AWS network. No internet required.
+  Another issue discovered:
+  ENILimitReachedException — Lambda creates Elastic Network Interfaces
+  (ENIs) in your VPC subnets. If the subnet runs out of available
+  IPs or you hit the ENI quota, new Lambda instances can't start.
+  $ aws ec2 describe-network-interfaces \
+    --filters Name=requester-id,Values=*lambda*
+  (showing 250 ENIs — hitting the limit)
+  Task: Explain Lambda VPC networking. Write: why VPC Lambda loses
+  internet access, NAT Gateway vs VPC endpoints (cost, performance),
+  ENI management (Hyperplane ENIs, subnet sizing, IP exhaustion),
+  security group configuration for Lambda, and when to (and not to)
+  put Lambda in a VPC.
+assertions:
+  - type: llm_judge
+    criteria: "VPC networking problem is explained — Lambda in VPC gets a network interface in your subnet but private subnets have no internet route. Two solutions: NAT Gateway (routes through internet — needed for external APIs, costs ~$32/month + data charges) or VPC Endpoints (direct AWS service access without internet — Gateway endpoints for S3/DynamoDB are free, Interface endpoints cost $7.20/month each). Best practice: use VPC endpoints for AWS services, NAT Gateway only for external API access. VPC adds cold start latency (ENI attachment)"
+    weight: 0.35
+    description: "VPC networking"
+  - type: llm_judge
+    criteria: "ENI and subnet management are covered — Lambda uses Hyperplane ENIs (shared across functions with same subnet + security group). ENI limit: default 250 per VPC (request increase). Subnet sizing: /24 gives ~251 usable IPs, adequate for most. Multiple subnets across AZs for availability. Security groups: Lambda needs outbound rules to reach databases/services. Inbound rules typically not needed (Lambda initiates connections). Check: aws ec2 describe-network-interfaces to see Lambda ENIs"
+    weight: 0.35
+    description: "ENI and subnet"
+  - type: llm_judge
+    criteria: "When to use VPC is practical — put Lambda in VPC ONLY when it needs to access VPC resources (RDS, ElastiCache, EC2 instances). Don't use VPC if Lambda only accesses public AWS services (S3, DynamoDB, SQS) — VPC adds complexity, cost, and cold start latency without benefit. If you need both VPC resources AND internet: use VPC endpoints for AWS services + NAT Gateway for external. IAM role needs AWSLambdaVPCAccessExecutionRole for ENI management permissions. Consider RDS Proxy for database connection pooling in VPC"
+    weight: 0.30
+    description: "When to use VPC"

package/courses/aws-lambda-debugging/scenarios/level-2/xray-tracing.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: xray-tracing
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda with X-Ray tracing — implement distributed tracing to identify performance bottlenecks across multiple services"
+  tags: [AWS, Lambda, X-Ray, tracing, distributed, performance, intermediate]
+state: {}
+trigger: |
+  Your API has intermittent slow responses. CloudWatch shows p99
+  latency of 4.5 seconds but p50 is 200ms. You need to find what
+  causes the occasional slowness. CloudWatch Logs alone can't tell
+  you which downstream call is slow.
+  Enable X-Ray tracing:
+  $ aws lambda update-function-configuration \
+    --function-name order-api \
+    --tracing-config Mode=Active
+  After enabling X-Ray, the service map shows:
+  API Gateway → order-api (Lambda) → DynamoDB (getItem)
+                                   → payment-service (Lambda)
+                                   → SES (sendEmail)
+  X-Ray trace for a slow request (4.2s total):
+  ├─ Lambda Init: 450ms (cold start)
+  ├─ DynamoDB getItem: 15ms
+  ├─ payment-service invoke: 3,500ms ← BOTTLENECK
+  │  └─ payment-service Lambda:
+  │     ├─ Stripe API call: 3,200ms ← ROOT CAUSE
+  │     └─ DynamoDB putItem: 50ms
+  └─ SES sendEmail: 180ms
+  X-Ray reveals: the payment-service makes a synchronous call to
+  Stripe's API that occasionally takes 3+ seconds (Stripe rate
+  limiting or connectivity issue).
+  Fix: Make the payment processing asynchronous — invoke payment-
+  service with InvocationType=Event, confirm the order immediately,
+  process payment in the background.
+  Task: Explain X-Ray tracing for Lambda. Write: how to enable and
+  configure X-Ray, reading service maps and traces, adding custom
+  subsegments for detailed breakdown, correlating X-Ray traces with
+  CloudWatch Logs (trace ID), common performance patterns revealed
+  by tracing, and the cost of X-Ray tracing.
+assertions:
+  - type: llm_judge
+    criteria: "X-Ray setup and traces are explained — enable: set tracing mode to Active on Lambda function. Lambda execution role needs xray:PutTraceSegments and xray:PutTelemetryRecords. Service map: visual representation of all services and their connections. Traces: timeline view of a single request showing each segment (service call) with duration. Subsegments: detailed breakdown within a segment (database queries, HTTP calls). X-Amzn-Trace-Id header propagates trace context across services automatically"
+    weight: 0.35
+    description: "X-Ray setup"
+  - type: llm_judge
+    criteria: "Reading traces and debugging are covered — trace view shows: total duration, each service call's duration, errors/faults at each step. Identify bottleneck: longest segment in the trace. Annotations: key-value pairs for filtering traces (e.g., userId, orderId). Metadata: additional data not indexed. Custom subsegments: wrap code blocks to track internal operations (e.g., database query, API call, data processing). AWS SDK calls automatically create subsegments. Correlate with CloudWatch Logs: trace ID appears in both for cross-reference"
+    weight: 0.35
+    description: "Reading traces"
+  - type: llm_judge
+    criteria: "Patterns and cost are practical — common patterns: cold start visible as Init segment, slow external API calls, N+1 query patterns (many small DB calls), synchronous chains (each service waits for the next). Solutions: async invocation for non-critical work, caching for repeated lookups, connection reuse. X-Ray cost: $5 per million traces recorded, $0.50 per million traces retrieved. First 100K traces/month free. Sampling: X-Ray samples 1 request/second + 5% of additional requests by default. Custom sampling rules to control cost"
+    weight: 0.30
+    description: "Patterns and cost"

package/courses/aws-lambda-debugging/scenarios/level-3/advanced-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,72 @@
+meta:
+  id: advanced-debugging-shift
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Combined advanced debugging shift — diagnose a complex serverless architecture with Step Functions, EventBridge, cross-account, and observability issues simultaneously"
+  tags: [AWS, Lambda, troubleshooting, combined, shift-simulation, advanced]
+state: {}
+trigger: |
+  You're the on-call engineer for a serverless e-commerce platform.
+  The PagerDuty alert fires at 2 AM: "Order completion rate dropped
+  from 98% to 72%."
+  Architecture:
+  API Gateway → order-api (Lambda)
+  → EventBridge (OrderCreated event)
+  → Step Functions workflow:
+    ValidateOrder → CheckInventory → ProcessPayment →
+    FulfillOrder → NotifyCustomer
+  → DynamoDB Streams → sync-to-analytics (Lambda, cross-account)
+  Investigation reveals 5 interconnected issues:
+  1. Step Functions: ProcessPayment fails for ~28% of orders.
+     The payment Lambda throws PaymentGatewayError but the Retry
+     config has MaxAttempts: 3 with BackoffRate: 10.0 — retries
+     at 5s, 50s, 500s. The total retry time exceeds the Step
+     Functions execution timeout of 60 seconds. Execution times out
+     during the third retry.
+  2. EventBridge: A new event pattern was deployed that's too
+     restrictive. OrderCreated events with amount > $1000 require
+     a "highValue" field that doesn't exist yet in the event
+     schema. These orders silently drop — no DLQ configured on
+     the rule.
+  3. Lambda@Edge: An A/B test function on CloudFront is throwing
+     errors for 5% of requests. The logs are scattered across
+     multiple regions. The function exceeds the 5-second viewer
+     request timeout during cold starts.
+  4. Cross-account sync: The analytics Lambda in the data account
+     stopped receiving DynamoDB Stream events. The event source
+     mapping shows "PROBLEM: Function call failed." The Lambda's
+     IAM role lost cross-account DynamoDB Stream permissions after
+     a recent IAM policy cleanup.
+  5. Observability gap: none of these issues triggered alerts.
+     CloudWatch alarms exist for Lambda Errors but not for Step
+     Functions failures, EventBridge delivery failures, or business
+     metrics (order completion rate).
+  Task: Walk through the complete incident response. Write: the
+  triage approach (business impact first), fix for each issue,
+  the monitoring gaps that allowed silent failures, and the
+  post-incident improvements to prevent recurrence.
+assertions:
+  - type: llm_judge
+    criteria: "All five issues are diagnosed — (1) Step Functions: reduce BackoffRate from 10 to 2, or increase execution timeout, or set MaxAttempts to 2. (2) EventBridge: fix pattern to not require 'highValue' field, add DLQ to EventBridge rule. (3) Lambda@Edge: reduce package size for faster cold starts, or move to origin request (30s timeout), check logs in execution regions. (4) Cross-account: restore IAM permissions for DynamoDB Stream access, configure MaximumRetryAttempts and failure destination. (5) Add alerts for Step Functions failures, EventBridge FailedInvocations, business metric (order completion rate)"
+    weight: 0.35
+    description: "All issues diagnosed"
+  - type: llm_judge
+    criteria: "Triage approach is business-focused — start with business impact: which orders are failing? (28% failure rate). Identify the data flow: API → EventBridge → Step Functions → completion. Check each step for errors. Step Functions execution history shows exactly which state failed. EventBridge: check FailedInvocations and InvocationsCreatedByRule metrics. For Lambda@Edge: check CloudFront error rate metric (4xx/5xx). Priority: fix Step Functions first (directly causing 28% failure), then EventBridge, then Lambda@Edge, then cross-account sync"
+    weight: 0.35
+    description: "Triage approach"
+  - type: llm_judge
+    criteria: "Post-incident improvements are comprehensive — monitoring gaps: alert on Step Functions ExecutionsFailed, EventBridge FailedInvocations, and business metrics (order completion rate is the most important metric, not Lambda errors). Add DLQ/destination to every async integration. Add observability: end-to-end tracing with X-Ray across all services, business metric dashboards, SLO definition (order completion rate > 99%). Process improvements: require DLQ configuration in IaC reviews, test event patterns with EventBridge sandbox before deploying, document cross-account dependencies, run chaos engineering to discover silent failures"
+    weight: 0.30
+    description: "Post-incident improvements"

package/courses/aws-lambda-debugging/scenarios/level-3/container-image-lambda.yaml ADDED Viewed

@@ -0,0 +1,79 @@
+meta:
+  id: container-image-lambda
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda container image deployments — diagnose ECR issues, runtime interface client errors, and container-based Lambda configuration problems"
+  tags: [AWS, Lambda, container-image, ECR, Docker, runtime-interface, advanced]
+state: {}
+trigger: |
+  Your team migrated a Lambda function from zip deployment to
+  container image for larger dependency support. The function fails:
+  $ aws lambda invoke --function-name ml-model output.json
+  {
+    "errorType": "Runtime.ExitError",
+    "errorMessage": "RequestId: abc-123 Error: Runtime exited
+    without providing a response"
+  }
+  The Dockerfile:
+  FROM python:3.12-slim
+  COPY requirements.txt .
+  RUN pip install -r requirements.txt
+  COPY app/ /app/
+  WORKDIR /app
+  CMD ["python", "handler.py"]
+  Problem 1: Missing Lambda Runtime Interface Client (RIC).
+  Container image Lambda requires the AWS Lambda Runtime Interface
+  Client to communicate with the Lambda service:
+  FROM python:3.12-slim
+  RUN pip install awslambdaric
+  COPY app/ /app/
+  WORKDIR /app
+  ENTRYPOINT ["python", "-m", "awslambdaric"]
+  CMD ["handler.lambda_handler"]
+  Or use the AWS-provided base image (includes RIC):
+  FROM public.ecr.aws/lambda/python:3.12
+  Problem 2: Image in ECR but Lambda can't pull it:
+  "ImageNotFoundException: The image with imageId ... does not exist"
+  ECR image URI must match exactly: account.dkr.ecr.region.amazonaws.com/
+  repo:tag. Digest-based reference is more reliable than tags.
+  Problem 3: Cold start is 15+ seconds.
+  Container image: 2.5GB (ML model + dependencies). Lambda must
+  download and extract the entire image on cold start.
+  Fix: use multi-stage build, strip unnecessary files, consider
+  Lambda SnapStart or provisioned concurrency.
+  Problem 4: Working locally but not in Lambda.
+  Local: docker run --rm -p 9000:8080 my-func
+  curl http://localhost:9000/2015-03-31/functions/function/invocations
+  Works! But Lambda has: read-only filesystem (except /tmp),
+  different user, and networking restrictions.
+  Task: Explain Lambda container image debugging. Write: AWS base
+  images vs custom images (with RIC), Dockerfile best practices for
+  Lambda, ECR configuration and permissions, cold start impact of
+  image size, local testing with RIE (Runtime Interface Emulator),
+  and when to use container images vs zip packages.
+assertions:
+  - type: llm_judge
+    criteria: "Runtime Interface Client is explained — Lambda container images need RIC (Runtime Interface Client) to communicate with Lambda service. Two approaches: (1) AWS base images (public.ecr.aws/lambda/python:3.12) include RIC and RIE, simplest setup. (2) Custom base images: install awslambdaric manually, set ENTRYPOINT to RIC, CMD to handler. ENTRYPOINT + CMD pattern: ENTRYPOINT runs RIC, CMD specifies handler. Without RIC: function exits immediately without responding"
+    weight: 0.35
+    description: "Runtime Interface"
+  - type: llm_judge
+    criteria: "ECR and cold starts are covered — ECR permissions: Lambda execution role needs ecr:GetDownloadUrlForLayer, ecr:BatchGetImage. Cross-account ECR: set ECR repository policy to allow Lambda's account. Image size directly impacts cold start: 2GB image = 10+ second cold start. Optimization: use slim base images, multi-stage builds, remove build tools/caches from final image. Lambda caches frequently-used images (reduced cold start on subsequent invocations). Max image size: 10GB. Consider: if image is > 1GB, cold starts will be significant"
+    weight: 0.35
+    description: "ECR and cold starts"
+  - type: llm_judge
+    criteria: "Local testing and decision-making are practical — RIE (Runtime Interface Emulator): test container images locally. docker run -p 9000:8080 my-func, then curl localhost:9000/.../invocations. AWS base images include RIE; custom images need manual install. When to use container images: dependencies > 250MB, need specific OS packages, existing Docker workflow, ML models. When to use zip: smaller packages, faster cold starts, simpler deployment. Container images: more flexibility, larger size allowed, familiar Docker workflow. Zip: faster deployment, smaller cold starts, simpler"
+    weight: 0.30
+    description: "Local testing"

package/courses/aws-lambda-debugging/scenarios/level-3/cross-account-invocation.yaml ADDED Viewed

@@ -0,0 +1,72 @@
+meta:
+  id: cross-account-invocation
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug cross-account Lambda invocation — diagnose permission issues, resource policies, and multi-account architecture patterns"
+  tags: [AWS, Lambda, cross-account, IAM, resource-policy, multi-account, advanced]
+state: {}
+trigger: |
+  Your organization uses separate AWS accounts for dev, staging, and
+  production. A shared services account hosts common Lambda functions
+  that other accounts invoke. Cross-account invocation fails:
+  Account A (application) tries to invoke a function in Account B (shared):
+  $ aws lambda invoke --function-name \
+    arn:aws:lambda:us-east-1:222222222222:function:shared-auth \
+    --region us-east-1 output.json
+  Error:
+  An error occurred (AccessDeniedException): User:
+  arn:aws:sts::111111111111:assumed-role/app-role/... is not authorized
+  to perform: lambda:InvokeFunction on resource:
+  arn:aws:lambda:us-east-1:222222222222:function:shared-auth
+  Cross-account invocation requires TWO things:
+  1. Account B (target) — Resource-based policy on the Lambda function:
+  $ aws lambda add-permission \
+    --function-name shared-auth \
+    --statement-id AllowAccountA \
+    --action lambda:InvokeFunction \
+    --principal 111111111111
+  2. Account A (caller) — IAM policy allowing the role to invoke:
+  {
+    "Effect": "Allow",
+    "Action": "lambda:InvokeFunction",
+    "Resource": "arn:aws:lambda:us-east-1:222222222222:function:shared-auth"
+  }
+  After fixing, another issue: the shared function uses $LATEST but
+  different accounts see different versions because $LATEST was
+  updated mid-deployment. Use a specific version or alias for
+  consistency across accounts.
+  Additional complexity: cross-account event source mappings.
+  Account A's SQS queue triggering Account B's Lambda requires
+  the Lambda execution role to have sqs:* permissions on Account A's
+  queue, AND Account A's queue policy must allow Account B's role.
+  Task: Explain cross-account Lambda debugging. Write: resource
+  policies vs IAM policies (both needed), cross-account invocation
+  setup, cross-account event sources, version/alias management
+  across accounts, AWS Organizations for centralized governance,
+  and multi-account architecture patterns.
+assertions:
+  - type: llm_judge
+    criteria: "Cross-account permissions are explained — two policies needed: (1) resource-based policy on target Lambda (allows the calling account/role to invoke), (2) IAM policy on calling role (allows invoking the specific function ARN). Either alone is insufficient. Resource policy: aws lambda add-permission with --principal (account or role ARN). IAM policy: lambda:InvokeFunction on the cross-account function ARN. For event sources: both the Lambda execution role and the source resource policy must allow cross-account access"
+    weight: 0.35
+    description: "Cross-account permissions"
+  - type: llm_judge
+    criteria: "Multi-account patterns are covered — common architecture: shared services account (auth, logging, notification), application accounts (dev, staging, prod), security account (audit, compliance). Use AWS Organizations for account management. Share Lambda functions via: cross-account invocation (resource policy), or deploy same function to each account via CI/CD. Version management: use aliases (prod, staging) and specific versions — never $LATEST across accounts. AWS RAM (Resource Access Manager) for sharing other resources"
+    weight: 0.35
+    description: "Multi-account patterns"
+  - type: llm_judge
+    criteria: "Debugging and governance are practical — debugging cross-account: check both IAM policy AND resource policy (common to fix only one). Use aws iam simulate-principal-policy for permission testing. CloudTrail logs in both accounts for access tracking. Governance: use AWS Organizations SCPs to restrict which functions can be invoked cross-account. Tag functions with account access level. Document all cross-account dependencies. Use IaC (CDK/Terraform) to manage resource policies consistently. Cross-account X-Ray: traces can span accounts for end-to-end visibility"
+    weight: 0.30
+    description: "Debugging and governance"

package/courses/aws-lambda-debugging/scenarios/level-3/eventbridge-patterns.yaml ADDED Viewed

@@ -0,0 +1,79 @@
+meta:
+  id: eventbridge-patterns
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug EventBridge and Lambda integration — diagnose rule matching failures, event pattern issues, and event-driven architecture debugging"
+  tags: [AWS, Lambda, EventBridge, events, patterns, event-driven, advanced]
+state: {}
+trigger: |
+  Your event-driven architecture uses EventBridge to route events to
+  Lambda functions. A new event type isn't being processed:
+  Event published:
+  {
+    "source": "myapp.orders",
+    "detail-type": "OrderCreated",
+    "detail": {
+      "orderId": "ORD-123",
+      "status": "created",
+      "amount": 99.99,
+      "customer": { "tier": "premium" }
+    }
+  }
+  EventBridge rule pattern:
+  {
+    "source": ["myapp.orders"],
+    "detail-type": ["OrderCreated"],
+    "detail": {
+      "status": ["created"],
+      "customer": {
+        "tier": ["premium", "enterprise"]
+      }
+    }
+  }
+  The event matches but the Lambda isn't invoked. Investigation:
+  1. The rule exists but the target Lambda function ARN has a typo.
+     No error raised — EventBridge silently fails to invoke the target.
+     EventBridge metrics: FailedInvocations shows the failures.
+  2. After fixing the ARN, the Lambda resource policy doesn't allow
+     EventBridge to invoke it. Need:
+     $ aws lambda add-permission \
+       --function-name process-premium-order \
+       --statement-id EventBridgeInvoke \
+       --action lambda:InvokeFunction \
+       --principal events.amazonaws.com \
+       --source-arn arn:aws:events:us-east-1:...:rule/premium-order-rule
+  3. Some events match but the Lambda receives a different format
+     than expected. EventBridge wraps the detail in its envelope:
+     { "source": "...", "detail-type": "...", "detail": { ... }, ... }
+     The Lambda must extract data from event.detail, not event directly.
+  4. Events from another account don't arrive. Cross-account
+     EventBridge requires: event bus policy + rule on the receiving bus.
+  Task: Explain EventBridge + Lambda debugging. Write: event patterns
+  (matching rules, content filtering), common pattern mistakes, rule
+  and target configuration, cross-account/cross-region events, dead-
+  letter queues for failed deliveries, and EventBridge testing tools.
+assertions:
+  - type: llm_judge
+    criteria: "Event patterns are explained — patterns match on exact values (array of allowed values), prefix, numeric comparison, exists check. Common mistakes: putting a string instead of array ('created' vs ['created']), wrong nesting level, case sensitivity. Pattern testing: use EventBridge sandbox to test patterns before deploying. Content-based filtering: only process events matching specific criteria. All fields in pattern must match — fields not in pattern are ignored (open matching). Input transformation: reshape the event before sending to target"
+    weight: 0.35
+    description: "Event patterns"
+  - type: llm_judge
+    criteria: "Target and delivery issues are covered — target Lambda must have resource policy allowing events.amazonaws.com. Misconfigured targets fail silently — check FailedInvocations metric. DLQ for rules: configure DLQ to capture events that fail to deliver to target. EventBridge invokes Lambda asynchronously — standard async retry behavior applies (2 retries). Maximum event size: 256KB. Event bus has a throughput limit (default varies by region). Cross-account: sender's event bus → receiver's event bus (requires event bus policy)"
+    weight: 0.35
+    description: "Target and delivery"
+  - type: llm_judge
+    criteria: "Testing and debugging are practical — EventBridge sandbox: test event patterns without deploying. Archive and replay: archive events for debugging, replay to test fixes. aws events put-events: manually send test events. CloudWatch Logs: enable EventBridge logging to see which rules matched. Metrics: Invocations, FailedInvocations, TriggeredRules, ThrottledRules. Use EventBridge Schema Registry to document event formats. Design events with versioning (include version field) for backward compatibility. Event catalog: document all event types, sources, and consumers"
+    weight: 0.30
+    description: "Testing and debugging"

package/courses/aws-lambda-debugging/scenarios/level-3/iac-deployment-debugging.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+meta:
+  id: iac-deployment-debugging
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda IaC deployments — diagnose SAM, CDK, and CloudFormation deployment failures, rollbacks, and configuration drift"
+  tags: [AWS, Lambda, SAM, CDK, CloudFormation, IaC, deployment, advanced]
+state: {}
+trigger: |
+  Your SAM deployment fails during CloudFormation stack update:
+  $ sam deploy --stack-name order-api --guided
+  CloudFormation events:
+  CREATE_IN_PROGRESS  AWS::Lambda::Function  OrderFunction
+  CREATE_FAILED       AWS::Lambda::Function  OrderFunction
+    "Resource handler returned message: 'Lambda function
+    order-api-OrderFunction-abc123 could not be found'
+    (Service: Lambda, Status Code: 404)"
+  ROLLBACK_IN_PROGRESS
+  The stack is now in ROLLBACK_IN_PROGRESS. Investigation:
+  1. The function code was uploaded to S3 (CodeUri) but the S3
+     bucket was in a different region than the stack. Lambda and
+     the code bucket must be in the same region.
+  2. After fixing the region, deployment succeeds but the function
+     uses the old code. SAM didn't detect code changes because the
+     zip file hash didn't change (build artifacts weren't cleaned):
+     $ sam build --no-cached  # Force rebuild
+  3. CDK deployment fails differently:
+     $ cdk deploy
+     "UPDATE_ROLLBACK_COMPLETE" — stack rolled back successfully
+     but the next deploy fails because CloudFormation is stuck:
+     Need to wait for rollback to complete, or fix the issue and
+     redeploy.
+  4. Configuration drift: someone manually changed the Lambda
+     function's memory via console from 256MB to 512MB. The next
+     SAM deploy reverts it to 256MB (as defined in template.yaml).
+     The team is confused: "Who changed the memory back?"
+  5. CloudFormation stack stuck in DELETE_FAILED because the Lambda
+     function has a resource-based policy referencing a deleted
+     resource. Manual intervention needed.
+  Task: Explain IaC deployment debugging for Lambda. Write: common
+  SAM/CDK deployment errors, CloudFormation stack states (and how
+  to recover from each), configuration drift detection, rollback
+  strategies, blue-green deployments with SAM/CDK, and IaC best
+  practices for serverless applications.
+assertions:
+  - type: llm_judge
+    criteria: "Deployment errors are explained — common SAM errors: S3 bucket region mismatch, stale build artifacts (use sam build --no-cached), template validation errors (sam validate). CDK errors: synthesis failures (TypeScript errors), asset upload failures, stack update conflicts. CloudFormation: CREATE_FAILED (resource creation error), UPDATE_ROLLBACK_COMPLETE (update failed and rolled back), DELETE_FAILED (can't delete dependent resources). Always check CloudFormation events for the specific error message — it's usually very descriptive"
+    weight: 0.35
+    description: "Deployment errors"
+  - type: llm_judge
+    criteria: "Stack states and recovery are covered — ROLLBACK_IN_PROGRESS: wait for completion, then fix and redeploy. UPDATE_ROLLBACK_COMPLETE: stack is stable at previous version, safe to deploy again. DELETE_FAILED: manually delete problematic resources, then retry delete. ROLLBACK_FAILED: most dangerous — may need AWS support or manual resource cleanup. Configuration drift: use CloudFormation drift detection (aws cloudformation detect-stack-drift). Prevent: use IaC for all changes, restrict console access for production, use AWS Config rules to detect manual changes"
+    weight: 0.35
+    description: "Stack recovery"
+  - type: llm_judge
+    criteria: "Best practices are practical — use sam deploy --no-fail-on-empty-changeset for CI/CD (no error if nothing changed). SAM safe deployments: AutoPublishAlias + DeploymentPreference (Canary, Linear) for gradual rollouts with automatic rollback on CloudWatch alarm. CDK: use cdk diff before deploy to review changes. Pin dependency versions in requirements.txt/package-lock.json for reproducible builds. Use parameterized templates for multi-environment deployment. Never manually modify resources managed by CloudFormation. Use CloudFormation change sets to preview changes before applying"
+    weight: 0.30
+    description: "Best practices"