aigroup-workflow 2.2.0 → 2.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/fix-build.md +10 -5
- package/.claude/commands/init-project.md +13 -8
- package/.claude/commands/plan.md +15 -8
- package/.claude/commands/review.md +12 -6
- package/.claude/commands/tdd.md +11 -5
- package/.claude/commands/workflow-start.md +20 -11
- package/.claude/settings.json +28 -0
- package/.codex/agents/architect.toml +207 -0
- package/.codex/agents/build-error-resolver.toml +110 -0
- package/.codex/agents/code-reviewer.toml +233 -0
- package/.codex/agents/doc-updater.toml +103 -0
- package/.codex/agents/e2e-runner.toml +103 -0
- package/.codex/agents/get-current-datetime.toml +23 -0
- package/.codex/agents/init-architect.toml +181 -0
- package/.codex/agents/planner.toml +208 -0
- package/.codex/agents/refactor-cleaner.toml +81 -0
- package/.codex/agents/rust-reviewer.toml +90 -0
- package/.codex/agents/security-reviewer.toml +104 -0
- package/.codex/agents/tdd-guide.toml +87 -0
- package/AGENTS.md +2 -2
- package/CLAUDE.md +23 -1
- package/LICENSE +20 -20
- package/README.md +333 -333
- package/agents/a11y-architect.md +141 -141
- package/agents/architect.md +211 -211
- package/agents/build-error-resolver.md +114 -114
- package/agents/chief-of-staff.md +151 -151
- package/agents/code-architect.md +71 -71
- package/agents/code-explorer.md +69 -69
- package/agents/code-reviewer.md +237 -237
- package/agents/code-simplifier.md +47 -47
- package/agents/comment-analyzer.md +45 -45
- package/agents/conversation-analyzer.md +52 -52
- package/agents/cpp-build-resolver.md +90 -90
- package/agents/cpp-reviewer.md +72 -72
- package/agents/csharp-reviewer.md +101 -101
- package/agents/dart-build-resolver.md +201 -201
- package/agents/database-reviewer.md +91 -91
- package/agents/doc-updater.md +107 -107
- package/agents/docs-lookup.md +68 -68
- package/agents/e2e-runner.md +107 -107
- package/agents/flutter-reviewer.md +243 -243
- package/agents/gan-evaluator.md +209 -209
- package/agents/gan-generator.md +131 -131
- package/agents/gan-planner.md +99 -99
- package/agents/get-current-datetime.md +26 -26
- package/agents/go-build-resolver.md +94 -94
- package/agents/go-reviewer.md +76 -76
- package/agents/harness-optimizer.md +35 -35
- package/agents/healthcare-reviewer.md +83 -83
- package/agents/java-build-resolver.md +153 -153
- package/agents/java-reviewer.md +92 -92
- package/agents/kotlin-build-resolver.md +118 -118
- package/agents/kotlin-reviewer.md +159 -159
- package/agents/loop-operator.md +36 -36
- package/agents/opensource-forker.md +198 -198
- package/agents/opensource-packager.md +249 -249
- package/agents/opensource-sanitizer.md +188 -188
- package/agents/performance-optimizer.md +446 -446
- package/agents/planner.md +212 -212
- package/agents/pr-test-analyzer.md +45 -45
- package/agents/python-reviewer.md +98 -98
- package/agents/pytorch-build-resolver.md +120 -120
- package/agents/refactor-cleaner.md +85 -85
- package/agents/rust-build-resolver.md +148 -148
- package/agents/rust-reviewer.md +94 -94
- package/agents/security-reviewer.md +108 -108
- package/agents/seo-specialist.md +59 -59
- package/agents/silent-failure-hunter.md +50 -50
- package/agents/tdd-guide.md +91 -91
- package/agents/type-design-analyzer.md +41 -41
- package/agents/typescript-reviewer.md +112 -112
- package/cli/commands/update.mjs +1 -1
- package/cli/utils/scaffold.mjs +53 -0
- package/docs/rules/agents.md +166 -50
- package/docs/rules/cpp/coding-style.md +44 -44
- package/docs/rules/cpp/hooks.md +39 -39
- package/docs/rules/cpp/patterns.md +51 -51
- package/docs/rules/cpp/security.md +51 -51
- package/docs/rules/cpp/testing.md +44 -44
- package/docs/rules/csharp/coding-style.md +72 -72
- package/docs/rules/csharp/hooks.md +25 -25
- package/docs/rules/csharp/patterns.md +50 -50
- package/docs/rules/csharp/security.md +58 -58
- package/docs/rules/csharp/testing.md +46 -46
- package/docs/rules/dart/coding-style.md +159 -159
- package/docs/rules/dart/hooks.md +66 -66
- package/docs/rules/dart/patterns.md +261 -261
- package/docs/rules/dart/security.md +135 -135
- package/docs/rules/dart/testing.md +215 -215
- package/docs/rules/golang/coding-style.md +32 -32
- package/docs/rules/golang/hooks.md +17 -17
- package/docs/rules/golang/patterns.md +45 -45
- package/docs/rules/golang/security.md +34 -34
- package/docs/rules/golang/testing.md +31 -31
- package/docs/rules/java/coding-style.md +114 -114
- package/docs/rules/java/hooks.md +18 -18
- package/docs/rules/java/patterns.md +146 -146
- package/docs/rules/java/security.md +100 -100
- package/docs/rules/java/testing.md +131 -131
- package/docs/rules/kotlin/coding-style.md +86 -86
- package/docs/rules/kotlin/hooks.md +17 -17
- package/docs/rules/kotlin/patterns.md +146 -146
- package/docs/rules/kotlin/security.md +82 -82
- package/docs/rules/kotlin/testing.md +128 -128
- package/docs/rules/perl/coding-style.md +46 -46
- package/docs/rules/perl/hooks.md +22 -22
- package/docs/rules/perl/patterns.md +76 -76
- package/docs/rules/perl/security.md +69 -69
- package/docs/rules/perl/testing.md +54 -54
- package/docs/rules/php/coding-style.md +40 -40
- package/docs/rules/php/hooks.md +24 -24
- package/docs/rules/php/patterns.md +33 -33
- package/docs/rules/php/security.md +37 -37
- package/docs/rules/php/testing.md +39 -39
- package/docs/rules/python/coding-style.md +42 -42
- package/docs/rules/python/hooks.md +19 -19
- package/docs/rules/python/patterns.md +39 -39
- package/docs/rules/python/security.md +30 -30
- package/docs/rules/python/testing.md +38 -38
- package/docs/rules/rust/coding-style.md +151 -151
- package/docs/rules/rust/hooks.md +16 -16
- package/docs/rules/rust/patterns.md +168 -168
- package/docs/rules/rust/security.md +141 -141
- package/docs/rules/rust/testing.md +154 -154
- package/docs/rules/swift/coding-style.md +47 -47
- package/docs/rules/swift/hooks.md +20 -20
- package/docs/rules/swift/patterns.md +66 -66
- package/docs/rules/swift/security.md +33 -33
- package/docs/rules/swift/testing.md +45 -45
- package/docs/rules/typescript/coding-style.md +199 -199
- package/docs/rules/typescript/hooks.md +22 -22
- package/docs/rules/typescript/patterns.md +52 -52
- package/docs/rules/typescript/security.md +28 -28
- package/docs/rules/typescript/testing.md +18 -18
- package/docs/rules/web/coding-style.md +96 -96
- package/docs/rules/web/design-quality.md +62 -62
- package/docs/rules/web/hooks.md +120 -120
- package/docs/rules/web/patterns.md +79 -79
- package/docs/rules/web/performance.md +64 -64
- package/docs/rules/web/security.md +57 -57
- package/docs/rules/web/testing.md +55 -55
- package/docs/templates/README.md +36 -36
- package/docs/templates/ai-project-final.md +124 -124
- package/docs/templates/ai-project.md +105 -105
- package/docs/templates/api.md +157 -157
- package/docs/templates/bug.md +62 -62
- package/docs/templates/code-review.md +87 -87
- package/docs/templates/generic.md +116 -116
- package/docs/templates/implementation-plan.md +1 -1
- package/docs/templates/meeting.md +68 -68
- package/docs/templates/prd.md +98 -98
- package/docs/templates/ui.md +134 -134
- package/docs/workflow-pipeline.md +11 -10
- package/package.json +40 -39
- package/scripts/hooks/checks/orchestration-artifacts.cjs +28 -23
- package/scripts/hooks/checks/workflow-state.cjs +4 -5
- package/scripts/orchestration/lib/orchestrator.cjs +344 -117
- package/scripts/orchestration/lib/validate.cjs +145 -0
- package/scripts/orchestration/session.cjs +88 -44
- package/skills/SUPERPOWERS-LICENSE +21 -21
- package/skills/ai-ml/fine-tuning-expert/SKILL.md +162 -162
- package/skills/ai-ml/fine-tuning-expert/references/dataset-preparation.md +540 -540
- package/skills/ai-ml/fine-tuning-expert/references/deployment-optimization.md +673 -673
- package/skills/ai-ml/fine-tuning-expert/references/evaluation-metrics.md +597 -597
- package/skills/ai-ml/fine-tuning-expert/references/hyperparameter-tuning.md +565 -565
- package/skills/ai-ml/fine-tuning-expert/references/lora-peft.md +347 -347
- package/skills/ai-ml/ml-pipeline/SKILL.md +159 -159
- package/skills/ai-ml/ml-pipeline/references/experiment-tracking.md +833 -833
- package/skills/ai-ml/ml-pipeline/references/feature-engineering.md +631 -631
- package/skills/ai-ml/ml-pipeline/references/model-validation.md +978 -978
- package/skills/ai-ml/ml-pipeline/references/pipeline-orchestration.md +907 -907
- package/skills/ai-ml/ml-pipeline/references/training-pipelines.md +782 -782
- package/skills/ai-ml/rag-architect/SKILL.md +194 -194
- package/skills/ai-ml/rag-architect/references/chunking-strategies.md +878 -878
- package/skills/ai-ml/rag-architect/references/embedding-models.md +561 -561
- package/skills/ai-ml/rag-architect/references/rag-evaluation.md +833 -833
- package/skills/ai-ml/rag-architect/references/retrieval-optimization.md +795 -795
- package/skills/ai-ml/rag-architect/references/vector-databases.md +589 -589
- package/skills/ai-ml/spark-engineer/SKILL.md +148 -148
- package/skills/ai-ml/spark-engineer/references/partitioning-caching.md +543 -543
- package/skills/ai-ml/spark-engineer/references/performance-tuning.md +544 -544
- package/skills/ai-ml/spark-engineer/references/rdd-operations.md +599 -599
- package/skills/ai-ml/spark-engineer/references/spark-sql-dataframes.md +474 -474
- package/skills/ai-ml/spark-engineer/references/streaming-patterns.md +786 -786
- package/skills/backend/api-designer/SKILL.md +217 -217
- package/skills/backend/api-designer/references/error-handling.md +541 -541
- package/skills/backend/api-designer/references/openapi.md +824 -824
- package/skills/backend/api-designer/references/pagination.md +494 -494
- package/skills/backend/api-designer/references/rest-patterns.md +335 -335
- package/skills/backend/api-designer/references/versioning.md +391 -391
- package/skills/backend/architecture-designer/SKILL.md +117 -117
- package/skills/backend/architecture-designer/references/adr-template.md +116 -116
- package/skills/backend/architecture-designer/references/architecture-patterns.md +111 -111
- package/skills/backend/architecture-designer/references/database-selection.md +102 -102
- package/skills/backend/architecture-designer/references/nfr-checklist.md +112 -112
- package/skills/backend/architecture-designer/references/system-design.md +100 -100
- package/skills/backend/code-documenter/SKILL.md +147 -147
- package/skills/backend/code-documenter/references/api-docs-fastapi-django.md +166 -166
- package/skills/backend/code-documenter/references/api-docs-nestjs-express.md +220 -220
- package/skills/backend/code-documenter/references/coverage-reports.md +125 -125
- package/skills/backend/code-documenter/references/documentation-systems.md +333 -333
- package/skills/backend/code-documenter/references/interactive-api-docs.md +531 -531
- package/skills/backend/code-documenter/references/python-docstrings.md +121 -121
- package/skills/backend/code-documenter/references/typescript-jsdoc.md +145 -145
- package/skills/backend/code-documenter/references/user-guides-tutorials.md +530 -530
- package/skills/backend/debugging-wizard/SKILL.md +105 -105
- package/skills/backend/debugging-wizard/references/common-patterns.md +132 -132
- package/skills/backend/debugging-wizard/references/debugging-tools.md +140 -140
- package/skills/backend/debugging-wizard/references/quick-fixes.md +177 -177
- package/skills/backend/debugging-wizard/references/strategies.md +142 -142
- package/skills/backend/debugging-wizard/references/systematic-debugging.md +367 -367
- package/skills/backend/feature-forge/SKILL.md +98 -98
- package/skills/backend/feature-forge/references/acceptance-criteria.md +104 -104
- package/skills/backend/feature-forge/references/ears-syntax.md +99 -99
- package/skills/backend/feature-forge/references/interview-questions.md +150 -150
- package/skills/backend/feature-forge/references/pre-discovery-subagents.md +54 -54
- package/skills/backend/feature-forge/references/specification-template.md +103 -103
- package/skills/backend/fullstack-guardian/SKILL.md +105 -105
- package/skills/backend/fullstack-guardian/references/api-design-standards.md +307 -307
- package/skills/backend/fullstack-guardian/references/architecture-decisions.md +350 -350
- package/skills/backend/fullstack-guardian/references/backend-patterns.md +237 -237
- package/skills/backend/fullstack-guardian/references/common-patterns.md +134 -134
- package/skills/backend/fullstack-guardian/references/deliverables-checklist.md +354 -354
- package/skills/backend/fullstack-guardian/references/design-template.md +91 -91
- package/skills/backend/fullstack-guardian/references/error-handling.md +135 -135
- package/skills/backend/fullstack-guardian/references/frontend-patterns.md +340 -340
- package/skills/backend/fullstack-guardian/references/integration-patterns.md +333 -333
- package/skills/backend/fullstack-guardian/references/security-checklist.md +106 -106
- package/skills/backend/graphql-architect/SKILL.md +146 -146
- package/skills/backend/graphql-architect/references/federation.md +418 -418
- package/skills/backend/graphql-architect/references/migration-from-rest.md +1141 -1141
- package/skills/backend/graphql-architect/references/resolvers.md +425 -425
- package/skills/backend/graphql-architect/references/schema-design.md +393 -393
- package/skills/backend/graphql-architect/references/security.md +569 -569
- package/skills/backend/graphql-architect/references/subscriptions.md +510 -510
- package/skills/backend/legacy-modernizer/SKILL.md +137 -137
- package/skills/backend/legacy-modernizer/references/legacy-testing.md +381 -381
- package/skills/backend/legacy-modernizer/references/migration-strategies.md +423 -423
- package/skills/backend/legacy-modernizer/references/refactoring-patterns.md +395 -395
- package/skills/backend/legacy-modernizer/references/strangler-fig-pattern.md +281 -281
- package/skills/backend/legacy-modernizer/references/system-assessment.md +487 -487
- package/skills/backend/microservices-architect/SKILL.md +164 -164
- package/skills/backend/microservices-architect/references/communication.md +499 -499
- package/skills/backend/microservices-architect/references/data.md +721 -721
- package/skills/backend/microservices-architect/references/decomposition.md +344 -344
- package/skills/backend/microservices-architect/references/observability.md +805 -805
- package/skills/backend/microservices-architect/references/patterns.md +603 -603
- package/skills/database/database-optimizer/SKILL.md +147 -147
- package/skills/database/database-optimizer/references/index-strategies.md +331 -331
- package/skills/database/database-optimizer/references/monitoring-analysis.md +501 -501
- package/skills/database/database-optimizer/references/mysql-tuning.md +452 -452
- package/skills/database/database-optimizer/references/postgresql-tuning.md +413 -413
- package/skills/database/database-optimizer/references/query-optimization.md +251 -251
- package/skills/database/postgres-pro/SKILL.md +152 -152
- package/skills/database/postgres-pro/references/extensions.md +404 -404
- package/skills/database/postgres-pro/references/jsonb.md +321 -321
- package/skills/database/postgres-pro/references/maintenance.md +481 -481
- package/skills/database/postgres-pro/references/performance.md +265 -265
- package/skills/database/postgres-pro/references/replication.md +446 -446
- package/skills/database/sql-pro/SKILL.md +129 -129
- package/skills/database/sql-pro/references/database-design.md +402 -402
- package/skills/database/sql-pro/references/dialect-differences.md +419 -419
- package/skills/database/sql-pro/references/optimization.md +384 -384
- package/skills/database/sql-pro/references/query-patterns.md +285 -285
- package/skills/database/sql-pro/references/window-functions.md +328 -328
- package/skills/dotnet/csharp-developer/SKILL.md +125 -125
- package/skills/dotnet/csharp-developer/references/aspnet-core.md +394 -394
- package/skills/dotnet/csharp-developer/references/blazor.md +553 -553
- package/skills/dotnet/csharp-developer/references/entity-framework.md +409 -409
- package/skills/dotnet/csharp-developer/references/modern-csharp.md +248 -248
- package/skills/dotnet/csharp-developer/references/performance.md +498 -498
- package/skills/dotnet/dotnet-core-expert/SKILL.md +138 -138
- package/skills/dotnet/dotnet-core-expert/references/authentication.md +546 -546
- package/skills/dotnet/dotnet-core-expert/references/clean-architecture.md +455 -455
- package/skills/dotnet/dotnet-core-expert/references/cloud-native.md +548 -548
- package/skills/dotnet/dotnet-core-expert/references/entity-framework.md +440 -440
- package/skills/dotnet/dotnet-core-expert/references/minimal-apis.md +319 -319
- package/skills/frontend/angular-architect/SKILL.md +152 -152
- package/skills/frontend/angular-architect/references/components.md +297 -297
- package/skills/frontend/angular-architect/references/ngrx.md +401 -401
- package/skills/frontend/angular-architect/references/routing.md +361 -361
- package/skills/frontend/angular-architect/references/rxjs.md +319 -319
- package/skills/frontend/angular-architect/references/testing.md +405 -405
- package/skills/frontend/design-commands/design.md +91 -91
- package/skills/frontend/design-commands/handoff.md +97 -97
- package/skills/frontend/design-commands/prototype.md +120 -120
- package/skills/frontend/design-commands/spec.md +160 -160
- package/skills/frontend/design-commands/style.md +78 -78
- package/skills/frontend/flutter-expert/SKILL.md +138 -138
- package/skills/frontend/flutter-expert/references/bloc-state.md +259 -259
- package/skills/frontend/flutter-expert/references/gorouter-navigation.md +119 -119
- package/skills/frontend/flutter-expert/references/performance.md +99 -99
- package/skills/frontend/flutter-expert/references/project-structure.md +118 -118
- package/skills/frontend/flutter-expert/references/riverpod-state.md +130 -130
- package/skills/frontend/flutter-expert/references/widget-patterns.md +123 -123
- package/skills/frontend/nextjs-developer/SKILL.md +143 -143
- package/skills/frontend/nextjs-developer/references/app-router.md +311 -311
- package/skills/frontend/nextjs-developer/references/data-fetching.md +482 -482
- package/skills/frontend/nextjs-developer/references/deployment.md +545 -545
- package/skills/frontend/nextjs-developer/references/server-actions.md +462 -462
- package/skills/frontend/nextjs-developer/references/server-components.md +384 -384
- package/skills/frontend/react-expert/SKILL.md +149 -149
- package/skills/frontend/react-expert/references/hooks-patterns.md +162 -162
- package/skills/frontend/react-expert/references/migration-class-to-modern.md +1119 -1119
- package/skills/frontend/react-expert/references/performance.md +168 -168
- package/skills/frontend/react-expert/references/react-19-features.md +174 -174
- package/skills/frontend/react-expert/references/server-components.md +143 -143
- package/skills/frontend/react-expert/references/state-management.md +171 -171
- package/skills/frontend/react-expert/references/testing-react.md +174 -174
- package/skills/frontend/react-native-expert/SKILL.md +185 -185
- package/skills/frontend/react-native-expert/references/expo-router.md +187 -187
- package/skills/frontend/react-native-expert/references/list-optimization.md +204 -204
- package/skills/frontend/react-native-expert/references/platform-handling.md +188 -188
- package/skills/frontend/react-native-expert/references/project-structure.md +171 -171
- package/skills/frontend/react-native-expert/references/storage-hooks.md +173 -173
- package/skills/frontend/senior-frontend/SKILL.md +477 -477
- package/skills/frontend/senior-frontend/references/frontend_best_practices.md +806 -806
- package/skills/frontend/senior-frontend/references/nextjs_optimization_guide.md +724 -724
- package/skills/frontend/senior-frontend/references/react_patterns.md +746 -746
- package/skills/frontend/senior-frontend/scripts/bundle_analyzer.py +407 -407
- package/skills/frontend/senior-frontend/scripts/component_generator.py +329 -329
- package/skills/frontend/senior-frontend/scripts/frontend_scaffolder.py +1005 -1005
- package/skills/frontend/ui-ux-pro-max/SKILL.md +386 -386
- package/skills/frontend/ui-ux-pro-max/data/charts.csv +26 -26
- package/skills/frontend/ui-ux-pro-max/data/colors.csv +97 -97
- package/skills/frontend/ui-ux-pro-max/data/icons.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/landing.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/data/products.csv +96 -96
- package/skills/frontend/ui-ux-pro-max/data/react-performance.csv +45 -45
- package/skills/frontend/ui-ux-pro-max/data/stacks/astro.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/flutter.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -56
- package/skills/frontend/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nextjs.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -59
- package/skills/frontend/ui-ux-pro-max/data/stacks/react-native.csv +52 -52
- package/skills/frontend/ui-ux-pro-max/data/stacks/react.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/shadcn.csv +61 -61
- package/skills/frontend/ui-ux-pro-max/data/stacks/svelte.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/swiftui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/vue.csv +50 -50
- package/skills/frontend/ui-ux-pro-max/data/styles.csv +68 -68
- package/skills/frontend/ui-ux-pro-max/data/typography.csv +57 -57
- package/skills/frontend/ui-ux-pro-max/data/ui-reasoning.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/ux-guidelines.csv +99 -99
- package/skills/frontend/ui-ux-pro-max/data/web-interface.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/scripts/core.py +253 -253
- package/skills/frontend/ui-ux-pro-max/scripts/design_system.py +1067 -1067
- package/skills/frontend/ui-ux-pro-max/scripts/search.py +114 -114
- package/skills/frontend/vue-expert/SKILL.md +98 -98
- package/skills/frontend/vue-expert/references/build-tooling.md +480 -480
- package/skills/frontend/vue-expert/references/components.md +448 -448
- package/skills/frontend/vue-expert/references/composition-api.md +299 -299
- package/skills/frontend/vue-expert/references/mobile-hybrid.md +636 -636
- package/skills/frontend/vue-expert/references/nuxt.md +669 -669
- package/skills/frontend/vue-expert/references/state-management.md +449 -449
- package/skills/frontend/vue-expert/references/typescript.md +584 -584
- package/skills/frontend/vue-expert-js/SKILL.md +167 -167
- package/skills/frontend/vue-expert-js/references/component-architecture.md +219 -219
- package/skills/frontend/vue-expert-js/references/composables-patterns.md +183 -183
- package/skills/frontend/vue-expert-js/references/jsdoc-typing.md +535 -535
- package/skills/frontend/vue-expert-js/references/state-management.md +249 -249
- package/skills/frontend/vue-expert-js/references/testing-patterns.md +237 -237
- package/skills/go-rust-cpp/cpp-pro/SKILL.md +115 -115
- package/skills/go-rust-cpp/cpp-pro/references/build-tooling.md +440 -440
- package/skills/go-rust-cpp/cpp-pro/references/concurrency.md +437 -437
- package/skills/go-rust-cpp/cpp-pro/references/memory-performance.md +397 -397
- package/skills/go-rust-cpp/cpp-pro/references/modern-cpp.md +304 -304
- package/skills/go-rust-cpp/cpp-pro/references/templates.md +357 -357
- package/skills/go-rust-cpp/golang-pro/SKILL.md +122 -122
- package/skills/go-rust-cpp/golang-pro/references/concurrency.md +329 -329
- package/skills/go-rust-cpp/golang-pro/references/generics.md +442 -442
- package/skills/go-rust-cpp/golang-pro/references/interfaces.md +432 -432
- package/skills/go-rust-cpp/golang-pro/references/project-structure.md +477 -477
- package/skills/go-rust-cpp/golang-pro/references/testing.md +451 -451
- package/skills/go-rust-cpp/rust-engineer/SKILL.md +167 -167
- package/skills/go-rust-cpp/rust-engineer/references/async.md +458 -458
- package/skills/go-rust-cpp/rust-engineer/references/error-handling.md +334 -334
- package/skills/go-rust-cpp/rust-engineer/references/ownership.md +278 -278
- package/skills/go-rust-cpp/rust-engineer/references/testing.md +470 -470
- package/skills/go-rust-cpp/rust-engineer/references/traits.md +413 -413
- package/skills/infra/cli-developer/SKILL.md +113 -113
- package/skills/infra/cli-developer/references/design-patterns.md +221 -221
- package/skills/infra/cli-developer/references/go-cli.md +540 -540
- package/skills/infra/cli-developer/references/node-cli.md +383 -383
- package/skills/infra/cli-developer/references/python-cli.md +422 -422
- package/skills/infra/cli-developer/references/ux-patterns.md +448 -448
- package/skills/infra/cloud-architect/SKILL.md +216 -216
- package/skills/infra/cloud-architect/references/aws.md +394 -394
- package/skills/infra/cloud-architect/references/azure.md +562 -562
- package/skills/infra/cloud-architect/references/cost.md +582 -582
- package/skills/infra/cloud-architect/references/gcp.md +633 -633
- package/skills/infra/cloud-architect/references/multi-cloud.md +483 -483
- package/skills/infra/devops-engineer/SKILL.md +144 -144
- package/skills/infra/devops-engineer/references/deployment-strategies.md +241 -241
- package/skills/infra/devops-engineer/references/docker-patterns.md +113 -113
- package/skills/infra/devops-engineer/references/github-actions.md +139 -139
- package/skills/infra/devops-engineer/references/incident-response.md +331 -331
- package/skills/infra/devops-engineer/references/kubernetes.md +154 -154
- package/skills/infra/devops-engineer/references/platform-engineering.md +417 -417
- package/skills/infra/devops-engineer/references/release-automation.md +527 -527
- package/skills/infra/devops-engineer/references/terraform-iac.md +141 -141
- package/skills/infra/kubernetes-specialist/SKILL.md +241 -241
- package/skills/infra/kubernetes-specialist/references/configuration.md +452 -452
- package/skills/infra/kubernetes-specialist/references/cost-optimization.md +458 -458
- package/skills/infra/kubernetes-specialist/references/custom-operators.md +563 -563
- package/skills/infra/kubernetes-specialist/references/gitops.md +530 -530
- package/skills/infra/kubernetes-specialist/references/helm-charts.md +912 -912
- package/skills/infra/kubernetes-specialist/references/multi-cluster.md +507 -507
- package/skills/infra/kubernetes-specialist/references/networking.md +447 -447
- package/skills/infra/kubernetes-specialist/references/service-mesh.md +459 -459
- package/skills/infra/kubernetes-specialist/references/storage.md +535 -535
- package/skills/infra/kubernetes-specialist/references/troubleshooting.md +414 -414
- package/skills/infra/kubernetes-specialist/references/workloads.md +377 -377
- package/skills/infra/mcp-developer/SKILL.md +143 -143
- package/skills/infra/mcp-developer/references/protocol.md +244 -244
- package/skills/infra/mcp-developer/references/python-sdk.md +367 -367
- package/skills/infra/mcp-developer/references/resources.md +554 -554
- package/skills/infra/mcp-developer/references/tools.md +480 -480
- package/skills/infra/mcp-developer/references/typescript-sdk.md +350 -350
- package/skills/infra/monitoring-expert/SKILL.md +176 -176
- package/skills/infra/monitoring-expert/references/alerting-rules.md +141 -141
- package/skills/infra/monitoring-expert/references/application-profiling.md +331 -331
- package/skills/infra/monitoring-expert/references/capacity-planning.md +344 -344
- package/skills/infra/monitoring-expert/references/dashboards.md +126 -126
- package/skills/infra/monitoring-expert/references/opentelemetry.md +123 -123
- package/skills/infra/monitoring-expert/references/performance-testing.md +269 -269
- package/skills/infra/monitoring-expert/references/prometheus-metrics.md +136 -136
- package/skills/infra/monitoring-expert/references/structured-logging.md +142 -142
- package/skills/infra/sre-engineer/SKILL.md +181 -181
- package/skills/infra/sre-engineer/references/automation-toil.md +492 -492
- package/skills/infra/sre-engineer/references/error-budget-policy.md +334 -334
- package/skills/infra/sre-engineer/references/incident-chaos.md +576 -576
- package/skills/infra/sre-engineer/references/monitoring-alerting.md +424 -424
- package/skills/infra/sre-engineer/references/slo-sli-management.md +238 -238
- package/skills/infra/terraform-engineer/SKILL.md +143 -143
- package/skills/infra/terraform-engineer/references/best-practices.md +583 -583
- package/skills/infra/terraform-engineer/references/module-patterns.md +297 -297
- package/skills/infra/terraform-engineer/references/providers.md +452 -452
- package/skills/infra/terraform-engineer/references/state-management.md +371 -371
- package/skills/infra/terraform-engineer/references/testing.md +486 -486
- package/skills/infra/websocket-engineer/SKILL.md +168 -168
- package/skills/infra/websocket-engineer/references/alternatives.md +391 -391
- package/skills/infra/websocket-engineer/references/patterns.md +400 -400
- package/skills/infra/websocket-engineer/references/protocol.md +195 -195
- package/skills/infra/websocket-engineer/references/scaling.md +333 -333
- package/skills/infra/websocket-engineer/references/security.md +474 -474
- package/skills/java/java-architect/SKILL.md +132 -132
- package/skills/java/java-architect/references/jpa-optimization.md +393 -393
- package/skills/java/java-architect/references/reactive-webflux.md +356 -356
- package/skills/java/java-architect/references/spring-boot-setup.md +269 -269
- package/skills/java/java-architect/references/spring-security.md +445 -445
- package/skills/java/java-architect/references/testing-patterns.md +500 -500
- package/skills/java/kotlin-specialist/SKILL.md +147 -147
- package/skills/java/kotlin-specialist/references/android-compose.md +419 -419
- package/skills/java/kotlin-specialist/references/coroutines-flow.md +276 -276
- package/skills/java/kotlin-specialist/references/dsl-idioms.md +421 -421
- package/skills/java/kotlin-specialist/references/ktor-server.md +426 -426
- package/skills/java/kotlin-specialist/references/multiplatform-kmp.md +380 -380
- package/skills/java/spring-boot-engineer/SKILL.md +195 -195
- package/skills/java/spring-boot-engineer/references/cloud.md +498 -498
- package/skills/java/spring-boot-engineer/references/data.md +381 -381
- package/skills/java/spring-boot-engineer/references/security.md +459 -459
- package/skills/java/spring-boot-engineer/references/testing.md +545 -545
- package/skills/java/spring-boot-engineer/references/web.md +295 -295
- package/skills/javascript/javascript-pro/SKILL.md +132 -132
- package/skills/javascript/javascript-pro/references/async-patterns.md +334 -334
- package/skills/javascript/javascript-pro/references/browser-apis.md +398 -398
- package/skills/javascript/javascript-pro/references/modern-syntax.md +272 -272
- package/skills/javascript/javascript-pro/references/modules.md +357 -357
- package/skills/javascript/javascript-pro/references/node-essentials.md +471 -471
- package/skills/javascript/nestjs-expert/SKILL.md +206 -206
- package/skills/javascript/nestjs-expert/references/authentication.md +166 -166
- package/skills/javascript/nestjs-expert/references/controllers-routing.md +111 -111
- package/skills/javascript/nestjs-expert/references/dtos-validation.md +153 -153
- package/skills/javascript/nestjs-expert/references/migration-from-express.md +1237 -1237
- package/skills/javascript/nestjs-expert/references/services-di.md +140 -140
- package/skills/javascript/nestjs-expert/references/testing-patterns.md +186 -186
- package/skills/javascript/typescript-pro/SKILL.md +145 -145
- package/skills/javascript/typescript-pro/references/advanced-types.md +259 -259
- package/skills/javascript/typescript-pro/references/configuration.md +445 -445
- package/skills/javascript/typescript-pro/references/patterns.md +484 -484
- package/skills/javascript/typescript-pro/references/type-guards.md +352 -352
- package/skills/javascript/typescript-pro/references/utility-types.md +329 -329
- package/skills/php/laravel-specialist/SKILL.md +262 -262
- package/skills/php/laravel-specialist/references/eloquent.md +351 -351
- package/skills/php/laravel-specialist/references/livewire.md +512 -512
- package/skills/php/laravel-specialist/references/queues.md +423 -423
- package/skills/php/laravel-specialist/references/routing.md +362 -362
- package/skills/php/laravel-specialist/references/testing.md +522 -522
- package/skills/php/php-pro/SKILL.md +206 -206
- package/skills/php/php-pro/references/async-patterns.md +412 -412
- package/skills/php/php-pro/references/laravel-patterns.md +377 -377
- package/skills/php/php-pro/references/modern-php-features.md +323 -323
- package/skills/php/php-pro/references/symfony-patterns.md +466 -466
- package/skills/php/php-pro/references/testing-quality.md +466 -466
- package/skills/product/competitive-analysis/SKILL.md +257 -257
- package/skills/product/meeting-notes/SKILL.md +266 -266
- package/skills/product/prd-template/SKILL.md +150 -150
- package/skills/product/stakeholder-update/SKILL.md +225 -225
- package/skills/product/user-research-synthesis/SKILL.md +235 -235
- package/skills/python/django-expert/SKILL.md +162 -162
- package/skills/python/django-expert/references/authentication.md +145 -145
- package/skills/python/django-expert/references/drf-serializers.md +148 -148
- package/skills/python/django-expert/references/models-orm.md +151 -151
- package/skills/python/django-expert/references/testing-django.md +204 -204
- package/skills/python/django-expert/references/viewsets-views.md +153 -153
- package/skills/python/fastapi-expert/SKILL.md +185 -185
- package/skills/python/fastapi-expert/references/async-sqlalchemy.md +146 -146
- package/skills/python/fastapi-expert/references/authentication.md +159 -159
- package/skills/python/fastapi-expert/references/endpoints-routing.md +142 -142
- package/skills/python/fastapi-expert/references/migration-from-django.md +996 -996
- package/skills/python/fastapi-expert/references/pydantic-v2.md +135 -135
- package/skills/python/fastapi-expert/references/testing-async.md +159 -159
- package/skills/python/pandas-pro/SKILL.md +178 -178
- package/skills/python/pandas-pro/references/aggregation-groupby.md +545 -545
- package/skills/python/pandas-pro/references/data-cleaning.md +500 -500
- package/skills/python/pandas-pro/references/dataframe-operations.md +420 -420
- package/skills/python/pandas-pro/references/merging-joining.md +596 -596
- package/skills/python/pandas-pro/references/performance-optimization.md +597 -597
- package/skills/python/python-pro/SKILL.md +177 -177
- package/skills/python/python-pro/references/async-patterns.md +356 -356
- package/skills/python/python-pro/references/packaging.md +460 -460
- package/skills/python/python-pro/references/standard-library.md +378 -378
- package/skills/python/python-pro/references/testing.md +404 -404
- package/skills/python/python-pro/references/type-system.md +290 -290
- package/skills/quality/chaos-engineer/SKILL.md +182 -182
- package/skills/quality/chaos-engineer/references/chaos-tools.md +511 -511
- package/skills/quality/chaos-engineer/references/experiment-design.md +229 -229
- package/skills/quality/chaos-engineer/references/game-days.md +434 -434
- package/skills/quality/chaos-engineer/references/infrastructure-chaos.md +348 -348
- package/skills/quality/chaos-engineer/references/kubernetes-chaos.md +432 -432
- package/skills/quality/code-reviewer/SKILL.md +119 -119
- package/skills/quality/code-reviewer/references/common-issues.md +142 -142
- package/skills/quality/code-reviewer/references/feedback-examples.md +144 -144
- package/skills/quality/code-reviewer/references/receiving-feedback.md +238 -238
- package/skills/quality/code-reviewer/references/report-template.md +109 -109
- package/skills/quality/code-reviewer/references/review-checklist.md +88 -88
- package/skills/quality/code-reviewer/references/spec-compliance-review.md +258 -258
- package/skills/quality/playwright-expert/SKILL.md +169 -169
- package/skills/quality/playwright-expert/references/api-mocking.md +140 -140
- package/skills/quality/playwright-expert/references/configuration.md +155 -155
- package/skills/quality/playwright-expert/references/debugging-flaky.md +150 -150
- package/skills/quality/playwright-expert/references/page-object-model.md +152 -152
- package/skills/quality/playwright-expert/references/selectors-locators.md +119 -119
- package/skills/quality/secure-code-guardian/SKILL.md +191 -191
- package/skills/quality/secure-code-guardian/references/authentication.md +136 -136
- package/skills/quality/secure-code-guardian/references/input-validation.md +146 -146
- package/skills/quality/secure-code-guardian/references/owasp-prevention.md +135 -135
- package/skills/quality/secure-code-guardian/references/security-headers.md +133 -133
- package/skills/quality/secure-code-guardian/references/xss-csrf.md +157 -157
- package/skills/quality/security-reviewer/SKILL.md +103 -103
- package/skills/quality/security-reviewer/references/infrastructure-security.md +268 -268
- package/skills/quality/security-reviewer/references/penetration-testing.md +268 -268
- package/skills/quality/security-reviewer/references/report-template.md +170 -170
- package/skills/quality/security-reviewer/references/sast-tools.md +117 -117
- package/skills/quality/security-reviewer/references/secret-scanning.md +125 -125
- package/skills/quality/security-reviewer/references/vulnerability-patterns.md +152 -152
- package/skills/quality/senior-qa/README.md +196 -196
- package/skills/quality/senior-qa/SKILL.md +399 -399
- package/skills/quality/senior-qa/references/qa_best_practices.md +964 -964
- package/skills/quality/senior-qa/references/test_automation_patterns.md +1009 -1009
- package/skills/quality/senior-qa/references/testing_strategies.md +649 -649
- package/skills/quality/senior-qa/scripts/coverage_analyzer.py +836 -836
- package/skills/quality/senior-qa/scripts/e2e_test_scaffolder.py +820 -820
- package/skills/quality/senior-qa/scripts/test_suite_generator.py +605 -605
- package/skills/quality/tdd-guide/HOW_TO_USE.md +313 -313
- package/skills/quality/tdd-guide/README.md +680 -680
- package/skills/quality/tdd-guide/SKILL.md +122 -122
- package/skills/quality/tdd-guide/assets/expected_output.json +77 -77
- package/skills/quality/tdd-guide/assets/sample_input_python.json +39 -39
- package/skills/quality/tdd-guide/assets/sample_input_typescript.json +36 -36
- package/skills/quality/tdd-guide/references/ci-integration.md +195 -195
- package/skills/quality/tdd-guide/references/framework-guide.md +206 -206
- package/skills/quality/tdd-guide/references/tdd-best-practices.md +128 -128
- package/skills/quality/tdd-guide/scripts/coverage_analyzer.py +434 -434
- package/skills/quality/tdd-guide/scripts/fixture_generator.py +440 -440
- package/skills/quality/tdd-guide/scripts/format_detector.py +384 -384
- package/skills/quality/tdd-guide/scripts/framework_adapter.py +428 -428
- package/skills/quality/tdd-guide/scripts/metrics_calculator.py +456 -456
- package/skills/quality/tdd-guide/scripts/output_formatter.py +354 -354
- package/skills/quality/tdd-guide/scripts/tdd_workflow.py +474 -474
- package/skills/quality/tdd-guide/scripts/test_generator.py +438 -438
- package/skills/quality/test-master/SKILL.md +94 -94
- package/skills/quality/test-master/references/automation-frameworks.md +294 -294
- package/skills/quality/test-master/references/e2e-testing.md +128 -128
- package/skills/quality/test-master/references/integration-testing.md +120 -120
- package/skills/quality/test-master/references/performance-testing.md +118 -118
- package/skills/quality/test-master/references/qa-methodology.md +247 -247
- package/skills/quality/test-master/references/security-testing.md +127 -127
- package/skills/quality/test-master/references/tdd-iron-laws.md +174 -174
- package/skills/quality/test-master/references/test-reports.md +104 -104
- package/skills/quality/test-master/references/testing-anti-patterns.md +231 -231
- package/skills/quality/test-master/references/unit-testing.md +113 -113
- package/skills/ruby/rails-expert/SKILL.md +154 -154
- package/skills/ruby/rails-expert/references/active-record.md +244 -244
- package/skills/ruby/rails-expert/references/api-development.md +401 -401
- package/skills/ruby/rails-expert/references/background-jobs.md +272 -272
- package/skills/ruby/rails-expert/references/hotwire-turbo.md +228 -228
- package/skills/ruby/rails-expert/references/rspec-testing.md +367 -367
- package/skills/swift/swift-expert/SKILL.md +163 -163
- package/skills/swift/swift-expert/references/async-concurrency.md +360 -360
- package/skills/swift/swift-expert/references/memory-performance.md +377 -377
- package/skills/swift/swift-expert/references/protocol-oriented.md +354 -354
- package/skills/swift/swift-expert/references/swiftui-patterns.md +291 -291
- package/skills/swift/swift-expert/references/testing-patterns.md +399 -399
- package/skills/workflow/brainstorming/SKILL.md +164 -164
- package/skills/workflow/brainstorming/scripts/frame-template.html +214 -214
- package/skills/workflow/brainstorming/scripts/helper.js +88 -88
- package/skills/workflow/brainstorming/scripts/server.cjs +354 -354
- package/skills/workflow/brainstorming/scripts/start-server.sh +148 -148
- package/skills/workflow/brainstorming/scripts/stop-server.sh +56 -56
- package/skills/workflow/brainstorming/spec-document-reviewer-prompt.md +49 -49
- package/skills/workflow/brainstorming/visual-companion.md +287 -287
- package/skills/workflow/documentation/SKILL.md +45 -45
- package/skills/workflow/entropy-management/SKILL.md +115 -115
- package/skills/workflow/executing-plans/SKILL.md +70 -70
- package/skills/workflow/finishing-a-development-branch/SKILL.md +200 -200
- package/skills/workflow/receiving-code-review/SKILL.md +213 -213
- package/skills/workflow/requesting-code-review/SKILL.md +105 -105
- package/skills/workflow/requesting-code-review/code-reviewer.md +146 -146
- package/skills/workflow/requirement-engineering/SKILL.md +111 -111
- package/skills/workflow/systematic-debugging/CREATION-LOG.md +119 -119
- package/skills/workflow/systematic-debugging/SKILL.md +296 -296
- package/skills/workflow/systematic-debugging/condition-based-waiting-example.ts +158 -158
- package/skills/workflow/systematic-debugging/condition-based-waiting.md +115 -115
- package/skills/workflow/systematic-debugging/defense-in-depth.md +122 -122
- package/skills/workflow/systematic-debugging/find-polluter.sh +63 -63
- package/skills/workflow/systematic-debugging/root-cause-tracing.md +169 -169
- package/skills/workflow/systematic-debugging/test-academic.md +14 -14
- package/skills/workflow/systematic-debugging/test-pressure-1.md +58 -58
- package/skills/workflow/systematic-debugging/test-pressure-2.md +68 -68
- package/skills/workflow/systematic-debugging/test-pressure-3.md +69 -69
- package/skills/workflow/using-git-worktrees/SKILL.md +218 -218
- package/skills/workflow/verification-before-completion/SKILL.md +139 -139
- package/skills/workflow/writing-plans/SKILL.md +151 -151
- package/skills/workflow/writing-plans/plan-document-reviewer-prompt.md +49 -49
- package/skills/workflow/writing-skills/SKILL.md +655 -655
- package/skills/workflow/writing-skills/anthropic-best-practices.md +1150 -1150
- package/skills/workflow/writing-skills/examples/CLAUDE_MD_TESTING.md +189 -189
- package/skills/workflow/writing-skills/persuasion-principles.md +187 -187
- package/skills/workflow/writing-skills/render-graphs.js +168 -168
- package/skills/workflow/writing-skills/testing-skills-with-subagents.md +384 -384
|
@@ -1,474 +1,474 @@
|
|
|
1
|
-
# Spark SQL and DataFrame API
|
|
2
|
-
|
|
3
|
-
---
|
|
4
|
-
|
|
5
|
-
## When to Use DataFrames vs RDDs
|
|
6
|
-
|
|
7
|
-
**Use DataFrames when:**
|
|
8
|
-
- Processing structured or semi-structured data (JSON, Parquet, CSV, Avro)
|
|
9
|
-
- Performing SQL-like operations (joins, aggregations, filters)
|
|
10
|
-
- Need Catalyst optimizer benefits (predicate pushdown, column pruning)
|
|
11
|
-
- Working with columnar formats for better compression
|
|
12
|
-
|
|
13
|
-
**Use RDDs when:**
|
|
14
|
-
- Need fine-grained control over physical data distribution
|
|
15
|
-
- Working with unstructured data (text processing, custom binary formats)
|
|
16
|
-
- Implementing custom partitioning logic
|
|
17
|
-
- Legacy code migration (prefer DataFrame migration when possible)
|
|
18
|
-
|
|
19
|
-
---
|
|
20
|
-
|
|
21
|
-
## Schema Definition
|
|
22
|
-
|
|
23
|
-
### Explicit Schema (Production Required)
|
|
24
|
-
|
|
25
|
-
```python
|
|
26
|
-
# PySpark - Explicit schema definition
|
|
27
|
-
from pyspark.sql.types import (
|
|
28
|
-
StructType, StructField, StringType, IntegerType,
|
|
29
|
-
DoubleType, TimestampType, ArrayType, MapType
|
|
30
|
-
)
|
|
31
|
-
|
|
32
|
-
# Define schema explicitly - ALWAYS do this in production
|
|
33
|
-
user_schema = StructType([
|
|
34
|
-
StructField("user_id", StringType(), nullable=False),
|
|
35
|
-
StructField("name", StringType(), nullable=True),
|
|
36
|
-
StructField("age", IntegerType(), nullable=True),
|
|
37
|
-
StructField("email", StringType(), nullable=True),
|
|
38
|
-
StructField("created_at", TimestampType(), nullable=False),
|
|
39
|
-
StructField("tags", ArrayType(StringType()), nullable=True),
|
|
40
|
-
StructField("metadata", MapType(StringType(), StringType()), nullable=True)
|
|
41
|
-
])
|
|
42
|
-
|
|
43
|
-
# Read with explicit schema - no inference overhead
|
|
44
|
-
df = spark.read.schema(user_schema).json("s3://bucket/users/")
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
```scala
|
|
48
|
-
// Scala - Explicit schema definition
|
|
49
|
-
import org.apache.spark.sql.types._
|
|
50
|
-
|
|
51
|
-
val userSchema = StructType(Seq(
|
|
52
|
-
StructField("user_id", StringType, nullable = false),
|
|
53
|
-
StructField("name", StringType, nullable = true),
|
|
54
|
-
StructField("age", IntegerType, nullable = true),
|
|
55
|
-
StructField("email", StringType, nullable = true),
|
|
56
|
-
StructField("created_at", TimestampType, nullable = false),
|
|
57
|
-
StructField("tags", ArrayType(StringType), nullable = true),
|
|
58
|
-
StructField("metadata", MapType(StringType, StringType), nullable = true)
|
|
59
|
-
))
|
|
60
|
-
|
|
61
|
-
val df = spark.read.schema(userSchema).json("s3://bucket/users/")
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
### Schema Inference Pitfalls
|
|
65
|
-
|
|
66
|
-
```python
|
|
67
|
-
# AVOID in production - causes full data scan
|
|
68
|
-
df = spark.read.json("s3://bucket/users/") # Infers schema - slow!
|
|
69
|
-
|
|
70
|
-
# If you must infer, sample a small portion
|
|
71
|
-
df = spark.read.option("samplingRatio", 0.01).json("s3://bucket/users/")
|
|
72
|
-
```
|
|
73
|
-
|
|
74
|
-
---
|
|
75
|
-
|
|
76
|
-
## Column Operations and Expressions
|
|
77
|
-
|
|
78
|
-
### Built-in Functions (Always Prefer Over UDFs)
|
|
79
|
-
|
|
80
|
-
```python
|
|
81
|
-
from pyspark.sql import functions as F
|
|
82
|
-
from pyspark.sql.window import Window
|
|
83
|
-
|
|
84
|
-
# Column transformations - use built-in functions
|
|
85
|
-
df = df.withColumn("name_upper", F.upper(F.col("name")))
|
|
86
|
-
df = df.withColumn("email_domain", F.split(F.col("email"), "@")[1])
|
|
87
|
-
df = df.withColumn("age_group",
|
|
88
|
-
F.when(F.col("age") < 18, "minor")
|
|
89
|
-
.when(F.col("age") < 65, "adult")
|
|
90
|
-
.otherwise("senior")
|
|
91
|
-
)
|
|
92
|
-
|
|
93
|
-
# Date/time operations
|
|
94
|
-
df = df.withColumn("year", F.year("created_at"))
|
|
95
|
-
df = df.withColumn("date_str", F.date_format("created_at", "yyyy-MM-dd"))
|
|
96
|
-
df = df.withColumn("days_since", F.datediff(F.current_date(), "created_at"))
|
|
97
|
-
|
|
98
|
-
# Array operations
|
|
99
|
-
df = df.withColumn("first_tag", F.col("tags")[0])
|
|
100
|
-
df = df.withColumn("tag_count", F.size("tags"))
|
|
101
|
-
df = df.withColumn("has_premium", F.array_contains("tags", "premium"))
|
|
102
|
-
|
|
103
|
-
# Null handling
|
|
104
|
-
df = df.withColumn("name_clean", F.coalesce("name", F.lit("Unknown")))
|
|
105
|
-
df = df.filter(F.col("email").isNotNull())
|
|
106
|
-
```
|
|
107
|
-
|
|
108
|
-
### Window Functions
|
|
109
|
-
|
|
110
|
-
```python
|
|
111
|
-
from pyspark.sql.window import Window
|
|
112
|
-
from pyspark.sql import functions as F
|
|
113
|
-
|
|
114
|
-
# Define window specifications
|
|
115
|
-
user_window = Window.partitionBy("user_id").orderBy("created_at")
|
|
116
|
-
category_window = Window.partitionBy("category")
|
|
117
|
-
|
|
118
|
-
# Ranking functions
|
|
119
|
-
df = df.withColumn("row_num", F.row_number().over(user_window))
|
|
120
|
-
df = df.withColumn("rank", F.rank().over(user_window))
|
|
121
|
-
df = df.withColumn("dense_rank", F.dense_rank().over(user_window))
|
|
122
|
-
|
|
123
|
-
# Analytic functions
|
|
124
|
-
df = df.withColumn("prev_value", F.lag("amount", 1).over(user_window))
|
|
125
|
-
df = df.withColumn("next_value", F.lead("amount", 1).over(user_window))
|
|
126
|
-
df = df.withColumn("running_total", F.sum("amount").over(user_window))
|
|
127
|
-
|
|
128
|
-
# Aggregations over windows
|
|
129
|
-
df = df.withColumn("category_avg", F.avg("amount").over(category_window))
|
|
130
|
-
df = df.withColumn("category_max", F.max("amount").over(category_window))
|
|
131
|
-
|
|
132
|
-
# Rolling windows
|
|
133
|
-
rolling_7day = Window.partitionBy("user_id") \
|
|
134
|
-
.orderBy(F.col("created_at").cast("long")) \
|
|
135
|
-
.rangeBetween(-7*86400, 0) # 7 days in seconds
|
|
136
|
-
|
|
137
|
-
df = df.withColumn("rolling_7d_sum", F.sum("amount").over(rolling_7day))
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
```scala
|
|
141
|
-
// Scala window functions
|
|
142
|
-
import org.apache.spark.sql.expressions.Window
|
|
143
|
-
import org.apache.spark.sql.functions._
|
|
144
|
-
|
|
145
|
-
val userWindow = Window.partitionBy("user_id").orderBy("created_at")
|
|
146
|
-
val categoryWindow = Window.partitionBy("category")
|
|
147
|
-
|
|
148
|
-
val result = df
|
|
149
|
-
.withColumn("row_num", row_number().over(userWindow))
|
|
150
|
-
.withColumn("running_total", sum("amount").over(userWindow))
|
|
151
|
-
.withColumn("category_avg", avg("amount").over(categoryWindow))
|
|
152
|
-
```
|
|
153
|
-
|
|
154
|
-
---
|
|
155
|
-
|
|
156
|
-
## Spark SQL Queries
|
|
157
|
-
|
|
158
|
-
### Registering DataFrames as Views
|
|
159
|
-
|
|
160
|
-
```python
|
|
161
|
-
# Temporary view - session scoped
|
|
162
|
-
df.createOrReplaceTempView("users")
|
|
163
|
-
|
|
164
|
-
# Global temporary view - application scoped
|
|
165
|
-
df.createOrReplaceGlobalTempView("users")
|
|
166
|
-
# Access via: global_temp.users
|
|
167
|
-
|
|
168
|
-
# Execute SQL
|
|
169
|
-
result = spark.sql("""
|
|
170
|
-
SELECT
|
|
171
|
-
user_id,
|
|
172
|
-
name,
|
|
173
|
-
COUNT(*) as order_count,
|
|
174
|
-
SUM(amount) as total_spent
|
|
175
|
-
FROM users u
|
|
176
|
-
JOIN orders o ON u.user_id = o.user_id
|
|
177
|
-
WHERE u.created_at >= '2024-01-01'
|
|
178
|
-
GROUP BY user_id, name
|
|
179
|
-
HAVING total_spent > 1000
|
|
180
|
-
ORDER BY total_spent DESC
|
|
181
|
-
""")
|
|
182
|
-
```
|
|
183
|
-
|
|
184
|
-
### CTEs and Subqueries
|
|
185
|
-
|
|
186
|
-
```python
|
|
187
|
-
result = spark.sql("""
|
|
188
|
-
WITH user_stats AS (
|
|
189
|
-
SELECT
|
|
190
|
-
user_id,
|
|
191
|
-
COUNT(*) as order_count,
|
|
192
|
-
SUM(amount) as total_spent,
|
|
193
|
-
AVG(amount) as avg_order
|
|
194
|
-
FROM orders
|
|
195
|
-
WHERE order_date >= '2024-01-01'
|
|
196
|
-
GROUP BY user_id
|
|
197
|
-
),
|
|
198
|
-
ranked_users AS (
|
|
199
|
-
SELECT
|
|
200
|
-
*,
|
|
201
|
-
PERCENT_RANK() OVER (ORDER BY total_spent) as spend_percentile
|
|
202
|
-
FROM user_stats
|
|
203
|
-
)
|
|
204
|
-
SELECT *
|
|
205
|
-
FROM ranked_users
|
|
206
|
-
WHERE spend_percentile >= 0.9
|
|
207
|
-
""")
|
|
208
|
-
```
|
|
209
|
-
|
|
210
|
-
---
|
|
211
|
-
|
|
212
|
-
## Join Strategies
|
|
213
|
-
|
|
214
|
-
### Join Types and When to Use
|
|
215
|
-
|
|
216
|
-
```python
|
|
217
|
-
# Inner join - matching records only
|
|
218
|
-
result = orders.join(users, orders.user_id == users.user_id, "inner")
|
|
219
|
-
|
|
220
|
-
# Left outer - all from left, matching from right
|
|
221
|
-
result = orders.join(users, "user_id", "left")
|
|
222
|
-
|
|
223
|
-
# Right outer - all from right, matching from left
|
|
224
|
-
result = orders.join(users, "user_id", "right")
|
|
225
|
-
|
|
226
|
-
# Full outer - all records from both
|
|
227
|
-
result = orders.join(users, "user_id", "full")
|
|
228
|
-
|
|
229
|
-
# Left anti - records in left NOT in right
|
|
230
|
-
new_users = all_users.join(existing_users, "user_id", "left_anti")
|
|
231
|
-
|
|
232
|
-
# Left semi - records in left that have match in right (no columns from right)
|
|
233
|
-
active_users = users.join(orders, "user_id", "left_semi")
|
|
234
|
-
|
|
235
|
-
# Cross join - cartesian product (use carefully!)
|
|
236
|
-
result = df1.crossJoin(df2)
|
|
237
|
-
```
|
|
238
|
-
|
|
239
|
-
### Broadcast Join (Small Table Optimization)
|
|
240
|
-
|
|
241
|
-
```python
|
|
242
|
-
from pyspark.sql.functions import broadcast
|
|
243
|
-
|
|
244
|
-
# Explicit broadcast hint - join small table to large table
|
|
245
|
-
# Broadcasts entire small_df to all executors (must fit in memory)
|
|
246
|
-
result = large_df.join(broadcast(small_df), "join_key")
|
|
247
|
-
|
|
248
|
-
# Auto broadcast threshold (default 10MB)
|
|
249
|
-
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 200 * 1024 * 1024) # 200MB
|
|
250
|
-
|
|
251
|
-
# Disable auto broadcast for specific query
|
|
252
|
-
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
|
|
253
|
-
```
|
|
254
|
-
|
|
255
|
-
**Spark UI Check:** In SQL tab, look for "BroadcastHashJoin" vs "SortMergeJoin". Broadcast should show quick exchange, while sort-merge shows shuffle.
|
|
256
|
-
|
|
257
|
-
### Handling Skewed Joins (Spark 3.x AQE)
|
|
258
|
-
|
|
259
|
-
```python
|
|
260
|
-
# Enable Adaptive Query Execution (Spark 3.0+)
|
|
261
|
-
spark.conf.set("spark.sql.adaptive.enabled", "true")
|
|
262
|
-
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
|
|
263
|
-
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", 5)
|
|
264
|
-
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")
|
|
265
|
-
|
|
266
|
-
# Manual skew handling with salting
|
|
267
|
-
from pyspark.sql.functions import monotonically_increasing_id, explode, array, lit
|
|
268
|
-
|
|
269
|
-
# Add salt to skewed key in large table
|
|
270
|
-
salt_count = 10
|
|
271
|
-
large_df_salted = large_df.withColumn(
|
|
272
|
-
"join_key_salted",
|
|
273
|
-
F.concat(F.col("join_key"), F.lit("_"), (F.monotonically_increasing_id() % salt_count).cast("string"))
|
|
274
|
-
)
|
|
275
|
-
|
|
276
|
-
# Explode small table to match salted keys
|
|
277
|
-
small_df_exploded = small_df.withColumn(
|
|
278
|
-
"salt", F.explode(F.array([F.lit(i) for i in range(salt_count)]))
|
|
279
|
-
).withColumn(
|
|
280
|
-
"join_key_salted",
|
|
281
|
-
F.concat(F.col("join_key"), F.lit("_"), F.col("salt").cast("string"))
|
|
282
|
-
)
|
|
283
|
-
|
|
284
|
-
# Join on salted key
|
|
285
|
-
result = large_df_salted.join(small_df_exploded, "join_key_salted")
|
|
286
|
-
```
|
|
287
|
-
|
|
288
|
-
---
|
|
289
|
-
|
|
290
|
-
## Aggregations
|
|
291
|
-
|
|
292
|
-
### GroupBy Operations
|
|
293
|
-
|
|
294
|
-
```python
|
|
295
|
-
from pyspark.sql import functions as F
|
|
296
|
-
|
|
297
|
-
# Basic aggregations
|
|
298
|
-
stats = df.groupBy("category").agg(
|
|
299
|
-
F.count("*").alias("count"),
|
|
300
|
-
F.sum("amount").alias("total"),
|
|
301
|
-
F.avg("amount").alias("average"),
|
|
302
|
-
F.min("amount").alias("minimum"),
|
|
303
|
-
F.max("amount").alias("maximum"),
|
|
304
|
-
F.stddev("amount").alias("std_dev"),
|
|
305
|
-
F.countDistinct("user_id").alias("unique_users"),
|
|
306
|
-
F.collect_list("product_id").alias("products"), # Caution: can OOM
|
|
307
|
-
F.collect_set("product_id").alias("unique_products")
|
|
308
|
-
)
|
|
309
|
-
|
|
310
|
-
# Multiple grouping sets (Spark SQL)
|
|
311
|
-
result = spark.sql("""
|
|
312
|
-
SELECT
|
|
313
|
-
category,
|
|
314
|
-
region,
|
|
315
|
-
SUM(amount) as total
|
|
316
|
-
FROM sales
|
|
317
|
-
GROUP BY GROUPING SETS (
|
|
318
|
-
(category, region),
|
|
319
|
-
(category),
|
|
320
|
-
(region),
|
|
321
|
-
()
|
|
322
|
-
)
|
|
323
|
-
""")
|
|
324
|
-
|
|
325
|
-
# Equivalent with rollup/cube
|
|
326
|
-
rollup_df = df.rollup("category", "region").agg(F.sum("amount"))
|
|
327
|
-
cube_df = df.cube("category", "region").agg(F.sum("amount"))
|
|
328
|
-
```
|
|
329
|
-
|
|
330
|
-
### Pivot Tables
|
|
331
|
-
|
|
332
|
-
```python
|
|
333
|
-
# Pivot - turn row values into columns
|
|
334
|
-
pivot_df = df.groupBy("user_id").pivot("category", ["electronics", "clothing", "food"]) \
|
|
335
|
-
.agg(F.sum("amount"))
|
|
336
|
-
|
|
337
|
-
# Result columns: user_id, electronics, clothing, food
|
|
338
|
-
|
|
339
|
-
# Unpivot (melt) - turn columns into rows
|
|
340
|
-
from pyspark.sql.functions import expr
|
|
341
|
-
|
|
342
|
-
unpivot_df = pivot_df.select(
|
|
343
|
-
"user_id",
|
|
344
|
-
expr("stack(3, 'electronics', electronics, 'clothing', clothing, 'food', food) as (category, amount)")
|
|
345
|
-
).filter("amount is not null")
|
|
346
|
-
```
|
|
347
|
-
|
|
348
|
-
---
|
|
349
|
-
|
|
350
|
-
## Catalyst Optimizer Tips
|
|
351
|
-
|
|
352
|
-
### Predicate Pushdown
|
|
353
|
-
|
|
354
|
-
```python
|
|
355
|
-
# Good - filter pushed down to data source
|
|
356
|
-
df = spark.read.parquet("s3://bucket/data/").filter(F.col("date") == "2024-01-01")
|
|
357
|
-
|
|
358
|
-
# Check physical plan for PushedFilters
|
|
359
|
-
df.explain(True)
|
|
360
|
-
```
|
|
361
|
-
|
|
362
|
-
### Column Pruning
|
|
363
|
-
|
|
364
|
-
```python
|
|
365
|
-
# Good - only read required columns
|
|
366
|
-
df = spark.read.parquet("s3://bucket/data/").select("id", "name", "amount")
|
|
367
|
-
|
|
368
|
-
# Bad - reads all columns then filters
|
|
369
|
-
df = spark.read.parquet("s3://bucket/data/")
|
|
370
|
-
result = df.select("id", "name", "amount")
|
|
371
|
-
```
|
|
372
|
-
|
|
373
|
-
### Partition Pruning
|
|
374
|
-
|
|
375
|
-
```python
|
|
376
|
-
# Data partitioned by date
|
|
377
|
-
# Good - only reads matching partitions
|
|
378
|
-
df = spark.read.parquet("s3://bucket/data/") \
|
|
379
|
-
.filter(F.col("date").between("2024-01-01", "2024-01-31"))
|
|
380
|
-
|
|
381
|
-
# Verify partition pruning in Spark UI - Files Read should be reduced
|
|
382
|
-
```
|
|
383
|
-
|
|
384
|
-
---
|
|
385
|
-
|
|
386
|
-
## Common Anti-Patterns
|
|
387
|
-
|
|
388
|
-
### Avoid These Patterns
|
|
389
|
-
|
|
390
|
-
```python
|
|
391
|
-
# BAD: Using Python UDF when built-in exists
|
|
392
|
-
from pyspark.sql.functions import udf
|
|
393
|
-
@udf("string")
|
|
394
|
-
def upper_udf(s):
|
|
395
|
-
return s.upper() if s else None
|
|
396
|
-
df.withColumn("name", upper_udf("name")) # 10-100x slower!
|
|
397
|
-
|
|
398
|
-
# GOOD: Use built-in function
|
|
399
|
-
df.withColumn("name", F.upper("name"))
|
|
400
|
-
|
|
401
|
-
# BAD: Collect large data to driver
|
|
402
|
-
all_data = df.collect() # OOM risk!
|
|
403
|
-
for row in all_data:
|
|
404
|
-
process(row)
|
|
405
|
-
|
|
406
|
-
# GOOD: Process distributed or use take/limit
|
|
407
|
-
sample = df.take(100) # Small sample
|
|
408
|
-
df.foreach(process_partition) # Distributed processing
|
|
409
|
-
|
|
410
|
-
# BAD: Multiple actions triggering recomputation
|
|
411
|
-
count = df.count()
|
|
412
|
-
total = df.agg(F.sum("amount")).collect()
|
|
413
|
-
# Two full scans of data!
|
|
414
|
-
|
|
415
|
-
# GOOD: Cache if multiple actions needed
|
|
416
|
-
df.cache()
|
|
417
|
-
count = df.count()
|
|
418
|
-
total = df.agg(F.sum("amount")).collect()
|
|
419
|
-
df.unpersist()
|
|
420
|
-
|
|
421
|
-
# BAD: String column used in filter (case sensitivity issues)
|
|
422
|
-
df.filter(df.status == "ACTIVE") # May miss "active", "Active"
|
|
423
|
-
|
|
424
|
-
# GOOD: Normalize or use case-insensitive comparison
|
|
425
|
-
df.filter(F.upper("status") == "ACTIVE")
|
|
426
|
-
```
|
|
427
|
-
|
|
428
|
-
---
|
|
429
|
-
|
|
430
|
-
## Spark UI Analysis for DataFrames
|
|
431
|
-
|
|
432
|
-
### SQL Tab Metrics to Monitor
|
|
433
|
-
|
|
434
|
-
1. **Duration** - Long stages indicate optimization opportunities
|
|
435
|
-
2. **Input Size** - Verify partition pruning reduced data read
|
|
436
|
-
3. **Shuffle Write/Read** - Large shuffles suggest join/aggregation issues
|
|
437
|
-
4. **Spill (Memory/Disk)** - Indicates memory pressure, increase executor memory
|
|
438
|
-
|
|
439
|
-
### Physical Plan Analysis
|
|
440
|
-
|
|
441
|
-
```python
|
|
442
|
-
# View physical plan
|
|
443
|
-
df.explain(True)
|
|
444
|
-
|
|
445
|
-
# Look for:
|
|
446
|
-
# - FileScan with PushedFilters (predicate pushdown working)
|
|
447
|
-
# - BroadcastHashJoin vs SortMergeJoin (broadcast optimization)
|
|
448
|
-
# - Exchange (shuffle operations)
|
|
449
|
-
# - WholeStageCodegen (Tungsten optimization active)
|
|
450
|
-
```
|
|
451
|
-
|
|
452
|
-
### Key Metrics in Stages Tab
|
|
453
|
-
|
|
454
|
-
| Metric | Healthy Range | Action if High |
|
|
455
|
-
|--------|---------------|----------------|
|
|
456
|
-
| Shuffle Read Size | < 1GB per task | Increase partitions, add filter |
|
|
457
|
-
| Spill (Disk) | 0 | Increase executor memory |
|
|
458
|
-
| GC Time | < 10% of task time | Tune memory fractions |
|
|
459
|
-
| Task Duration Variance | < 2x median | Address data skew |
|
|
460
|
-
|
|
461
|
-
---
|
|
462
|
-
|
|
463
|
-
## Best Practices Summary
|
|
464
|
-
|
|
465
|
-
1. **Always define explicit schemas** - No inference in production
|
|
466
|
-
2. **Use built-in functions** - Avoid UDFs when possible
|
|
467
|
-
3. **Broadcast small tables** - Tables under 200MB
|
|
468
|
-
4. **Filter early** - Push filters before joins and aggregations
|
|
469
|
-
5. **Select only needed columns** - Enable column pruning
|
|
470
|
-
6. **Partition by common filter columns** - Enable partition pruning
|
|
471
|
-
7. **Cache strategically** - Only reused DataFrames
|
|
472
|
-
8. **Monitor Spark UI** - Check shuffle, spill, and GC metrics
|
|
473
|
-
9. **Enable AQE in Spark 3.x** - Automatic optimization for skew and partitions
|
|
474
|
-
10. **Test with production data volume** - Performance varies with scale
|
|
1
|
+
# Spark SQL and DataFrame API
|
|
2
|
+
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
## When to Use DataFrames vs RDDs
|
|
6
|
+
|
|
7
|
+
**Use DataFrames when:**
|
|
8
|
+
- Processing structured or semi-structured data (JSON, Parquet, CSV, Avro)
|
|
9
|
+
- Performing SQL-like operations (joins, aggregations, filters)
|
|
10
|
+
- Need Catalyst optimizer benefits (predicate pushdown, column pruning)
|
|
11
|
+
- Working with columnar formats for better compression
|
|
12
|
+
|
|
13
|
+
**Use RDDs when:**
|
|
14
|
+
- Need fine-grained control over physical data distribution
|
|
15
|
+
- Working with unstructured data (text processing, custom binary formats)
|
|
16
|
+
- Implementing custom partitioning logic
|
|
17
|
+
- Legacy code migration (prefer DataFrame migration when possible)
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## Schema Definition
|
|
22
|
+
|
|
23
|
+
### Explicit Schema (Production Required)
|
|
24
|
+
|
|
25
|
+
```python
|
|
26
|
+
# PySpark - Explicit schema definition
|
|
27
|
+
from pyspark.sql.types import (
|
|
28
|
+
StructType, StructField, StringType, IntegerType,
|
|
29
|
+
DoubleType, TimestampType, ArrayType, MapType
|
|
30
|
+
)
|
|
31
|
+
|
|
32
|
+
# Define schema explicitly - ALWAYS do this in production
|
|
33
|
+
user_schema = StructType([
|
|
34
|
+
StructField("user_id", StringType(), nullable=False),
|
|
35
|
+
StructField("name", StringType(), nullable=True),
|
|
36
|
+
StructField("age", IntegerType(), nullable=True),
|
|
37
|
+
StructField("email", StringType(), nullable=True),
|
|
38
|
+
StructField("created_at", TimestampType(), nullable=False),
|
|
39
|
+
StructField("tags", ArrayType(StringType()), nullable=True),
|
|
40
|
+
StructField("metadata", MapType(StringType(), StringType()), nullable=True)
|
|
41
|
+
])
|
|
42
|
+
|
|
43
|
+
# Read with explicit schema - no inference overhead
|
|
44
|
+
df = spark.read.schema(user_schema).json("s3://bucket/users/")
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
```scala
|
|
48
|
+
// Scala - Explicit schema definition
|
|
49
|
+
import org.apache.spark.sql.types._
|
|
50
|
+
|
|
51
|
+
val userSchema = StructType(Seq(
|
|
52
|
+
StructField("user_id", StringType, nullable = false),
|
|
53
|
+
StructField("name", StringType, nullable = true),
|
|
54
|
+
StructField("age", IntegerType, nullable = true),
|
|
55
|
+
StructField("email", StringType, nullable = true),
|
|
56
|
+
StructField("created_at", TimestampType, nullable = false),
|
|
57
|
+
StructField("tags", ArrayType(StringType), nullable = true),
|
|
58
|
+
StructField("metadata", MapType(StringType, StringType), nullable = true)
|
|
59
|
+
))
|
|
60
|
+
|
|
61
|
+
val df = spark.read.schema(userSchema).json("s3://bucket/users/")
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Schema Inference Pitfalls
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
# AVOID in production - causes full data scan
|
|
68
|
+
df = spark.read.json("s3://bucket/users/") # Infers schema - slow!
|
|
69
|
+
|
|
70
|
+
# If you must infer, sample a small portion
|
|
71
|
+
df = spark.read.option("samplingRatio", 0.01).json("s3://bucket/users/")
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## Column Operations and Expressions
|
|
77
|
+
|
|
78
|
+
### Built-in Functions (Always Prefer Over UDFs)
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
from pyspark.sql import functions as F
|
|
82
|
+
from pyspark.sql.window import Window
|
|
83
|
+
|
|
84
|
+
# Column transformations - use built-in functions
|
|
85
|
+
df = df.withColumn("name_upper", F.upper(F.col("name")))
|
|
86
|
+
df = df.withColumn("email_domain", F.split(F.col("email"), "@")[1])
|
|
87
|
+
df = df.withColumn("age_group",
|
|
88
|
+
F.when(F.col("age") < 18, "minor")
|
|
89
|
+
.when(F.col("age") < 65, "adult")
|
|
90
|
+
.otherwise("senior")
|
|
91
|
+
)
|
|
92
|
+
|
|
93
|
+
# Date/time operations
|
|
94
|
+
df = df.withColumn("year", F.year("created_at"))
|
|
95
|
+
df = df.withColumn("date_str", F.date_format("created_at", "yyyy-MM-dd"))
|
|
96
|
+
df = df.withColumn("days_since", F.datediff(F.current_date(), "created_at"))
|
|
97
|
+
|
|
98
|
+
# Array operations
|
|
99
|
+
df = df.withColumn("first_tag", F.col("tags")[0])
|
|
100
|
+
df = df.withColumn("tag_count", F.size("tags"))
|
|
101
|
+
df = df.withColumn("has_premium", F.array_contains("tags", "premium"))
|
|
102
|
+
|
|
103
|
+
# Null handling
|
|
104
|
+
df = df.withColumn("name_clean", F.coalesce("name", F.lit("Unknown")))
|
|
105
|
+
df = df.filter(F.col("email").isNotNull())
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Window Functions
|
|
109
|
+
|
|
110
|
+
```python
|
|
111
|
+
from pyspark.sql.window import Window
|
|
112
|
+
from pyspark.sql import functions as F
|
|
113
|
+
|
|
114
|
+
# Define window specifications
|
|
115
|
+
user_window = Window.partitionBy("user_id").orderBy("created_at")
|
|
116
|
+
category_window = Window.partitionBy("category")
|
|
117
|
+
|
|
118
|
+
# Ranking functions
|
|
119
|
+
df = df.withColumn("row_num", F.row_number().over(user_window))
|
|
120
|
+
df = df.withColumn("rank", F.rank().over(user_window))
|
|
121
|
+
df = df.withColumn("dense_rank", F.dense_rank().over(user_window))
|
|
122
|
+
|
|
123
|
+
# Analytic functions
|
|
124
|
+
df = df.withColumn("prev_value", F.lag("amount", 1).over(user_window))
|
|
125
|
+
df = df.withColumn("next_value", F.lead("amount", 1).over(user_window))
|
|
126
|
+
df = df.withColumn("running_total", F.sum("amount").over(user_window))
|
|
127
|
+
|
|
128
|
+
# Aggregations over windows
|
|
129
|
+
df = df.withColumn("category_avg", F.avg("amount").over(category_window))
|
|
130
|
+
df = df.withColumn("category_max", F.max("amount").over(category_window))
|
|
131
|
+
|
|
132
|
+
# Rolling windows
|
|
133
|
+
rolling_7day = Window.partitionBy("user_id") \
|
|
134
|
+
.orderBy(F.col("created_at").cast("long")) \
|
|
135
|
+
.rangeBetween(-7*86400, 0) # 7 days in seconds
|
|
136
|
+
|
|
137
|
+
df = df.withColumn("rolling_7d_sum", F.sum("amount").over(rolling_7day))
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
```scala
|
|
141
|
+
// Scala window functions
|
|
142
|
+
import org.apache.spark.sql.expressions.Window
|
|
143
|
+
import org.apache.spark.sql.functions._
|
|
144
|
+
|
|
145
|
+
val userWindow = Window.partitionBy("user_id").orderBy("created_at")
|
|
146
|
+
val categoryWindow = Window.partitionBy("category")
|
|
147
|
+
|
|
148
|
+
val result = df
|
|
149
|
+
.withColumn("row_num", row_number().over(userWindow))
|
|
150
|
+
.withColumn("running_total", sum("amount").over(userWindow))
|
|
151
|
+
.withColumn("category_avg", avg("amount").over(categoryWindow))
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
## Spark SQL Queries
|
|
157
|
+
|
|
158
|
+
### Registering DataFrames as Views
|
|
159
|
+
|
|
160
|
+
```python
|
|
161
|
+
# Temporary view - session scoped
|
|
162
|
+
df.createOrReplaceTempView("users")
|
|
163
|
+
|
|
164
|
+
# Global temporary view - application scoped
|
|
165
|
+
df.createOrReplaceGlobalTempView("users")
|
|
166
|
+
# Access via: global_temp.users
|
|
167
|
+
|
|
168
|
+
# Execute SQL
|
|
169
|
+
result = spark.sql("""
|
|
170
|
+
SELECT
|
|
171
|
+
user_id,
|
|
172
|
+
name,
|
|
173
|
+
COUNT(*) as order_count,
|
|
174
|
+
SUM(amount) as total_spent
|
|
175
|
+
FROM users u
|
|
176
|
+
JOIN orders o ON u.user_id = o.user_id
|
|
177
|
+
WHERE u.created_at >= '2024-01-01'
|
|
178
|
+
GROUP BY user_id, name
|
|
179
|
+
HAVING total_spent > 1000
|
|
180
|
+
ORDER BY total_spent DESC
|
|
181
|
+
""")
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### CTEs and Subqueries
|
|
185
|
+
|
|
186
|
+
```python
|
|
187
|
+
result = spark.sql("""
|
|
188
|
+
WITH user_stats AS (
|
|
189
|
+
SELECT
|
|
190
|
+
user_id,
|
|
191
|
+
COUNT(*) as order_count,
|
|
192
|
+
SUM(amount) as total_spent,
|
|
193
|
+
AVG(amount) as avg_order
|
|
194
|
+
FROM orders
|
|
195
|
+
WHERE order_date >= '2024-01-01'
|
|
196
|
+
GROUP BY user_id
|
|
197
|
+
),
|
|
198
|
+
ranked_users AS (
|
|
199
|
+
SELECT
|
|
200
|
+
*,
|
|
201
|
+
PERCENT_RANK() OVER (ORDER BY total_spent) as spend_percentile
|
|
202
|
+
FROM user_stats
|
|
203
|
+
)
|
|
204
|
+
SELECT *
|
|
205
|
+
FROM ranked_users
|
|
206
|
+
WHERE spend_percentile >= 0.9
|
|
207
|
+
""")
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Join Strategies
|
|
213
|
+
|
|
214
|
+
### Join Types and When to Use
|
|
215
|
+
|
|
216
|
+
```python
|
|
217
|
+
# Inner join - matching records only
|
|
218
|
+
result = orders.join(users, orders.user_id == users.user_id, "inner")
|
|
219
|
+
|
|
220
|
+
# Left outer - all from left, matching from right
|
|
221
|
+
result = orders.join(users, "user_id", "left")
|
|
222
|
+
|
|
223
|
+
# Right outer - all from right, matching from left
|
|
224
|
+
result = orders.join(users, "user_id", "right")
|
|
225
|
+
|
|
226
|
+
# Full outer - all records from both
|
|
227
|
+
result = orders.join(users, "user_id", "full")
|
|
228
|
+
|
|
229
|
+
# Left anti - records in left NOT in right
|
|
230
|
+
new_users = all_users.join(existing_users, "user_id", "left_anti")
|
|
231
|
+
|
|
232
|
+
# Left semi - records in left that have match in right (no columns from right)
|
|
233
|
+
active_users = users.join(orders, "user_id", "left_semi")
|
|
234
|
+
|
|
235
|
+
# Cross join - cartesian product (use carefully!)
|
|
236
|
+
result = df1.crossJoin(df2)
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
### Broadcast Join (Small Table Optimization)
|
|
240
|
+
|
|
241
|
+
```python
|
|
242
|
+
from pyspark.sql.functions import broadcast
|
|
243
|
+
|
|
244
|
+
# Explicit broadcast hint - join small table to large table
|
|
245
|
+
# Broadcasts entire small_df to all executors (must fit in memory)
|
|
246
|
+
result = large_df.join(broadcast(small_df), "join_key")
|
|
247
|
+
|
|
248
|
+
# Auto broadcast threshold (default 10MB)
|
|
249
|
+
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 200 * 1024 * 1024) # 200MB
|
|
250
|
+
|
|
251
|
+
# Disable auto broadcast for specific query
|
|
252
|
+
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
**Spark UI Check:** In SQL tab, look for "BroadcastHashJoin" vs "SortMergeJoin". Broadcast should show quick exchange, while sort-merge shows shuffle.
|
|
256
|
+
|
|
257
|
+
### Handling Skewed Joins (Spark 3.x AQE)
|
|
258
|
+
|
|
259
|
+
```python
|
|
260
|
+
# Enable Adaptive Query Execution (Spark 3.0+)
|
|
261
|
+
spark.conf.set("spark.sql.adaptive.enabled", "true")
|
|
262
|
+
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
|
|
263
|
+
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", 5)
|
|
264
|
+
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")
|
|
265
|
+
|
|
266
|
+
# Manual skew handling with salting
|
|
267
|
+
from pyspark.sql.functions import monotonically_increasing_id, explode, array, lit
|
|
268
|
+
|
|
269
|
+
# Add salt to skewed key in large table
|
|
270
|
+
salt_count = 10
|
|
271
|
+
large_df_salted = large_df.withColumn(
|
|
272
|
+
"join_key_salted",
|
|
273
|
+
F.concat(F.col("join_key"), F.lit("_"), (F.monotonically_increasing_id() % salt_count).cast("string"))
|
|
274
|
+
)
|
|
275
|
+
|
|
276
|
+
# Explode small table to match salted keys
|
|
277
|
+
small_df_exploded = small_df.withColumn(
|
|
278
|
+
"salt", F.explode(F.array([F.lit(i) for i in range(salt_count)]))
|
|
279
|
+
).withColumn(
|
|
280
|
+
"join_key_salted",
|
|
281
|
+
F.concat(F.col("join_key"), F.lit("_"), F.col("salt").cast("string"))
|
|
282
|
+
)
|
|
283
|
+
|
|
284
|
+
# Join on salted key
|
|
285
|
+
result = large_df_salted.join(small_df_exploded, "join_key_salted")
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
## Aggregations
|
|
291
|
+
|
|
292
|
+
### GroupBy Operations
|
|
293
|
+
|
|
294
|
+
```python
|
|
295
|
+
from pyspark.sql import functions as F
|
|
296
|
+
|
|
297
|
+
# Basic aggregations
|
|
298
|
+
stats = df.groupBy("category").agg(
|
|
299
|
+
F.count("*").alias("count"),
|
|
300
|
+
F.sum("amount").alias("total"),
|
|
301
|
+
F.avg("amount").alias("average"),
|
|
302
|
+
F.min("amount").alias("minimum"),
|
|
303
|
+
F.max("amount").alias("maximum"),
|
|
304
|
+
F.stddev("amount").alias("std_dev"),
|
|
305
|
+
F.countDistinct("user_id").alias("unique_users"),
|
|
306
|
+
F.collect_list("product_id").alias("products"), # Caution: can OOM
|
|
307
|
+
F.collect_set("product_id").alias("unique_products")
|
|
308
|
+
)
|
|
309
|
+
|
|
310
|
+
# Multiple grouping sets (Spark SQL)
|
|
311
|
+
result = spark.sql("""
|
|
312
|
+
SELECT
|
|
313
|
+
category,
|
|
314
|
+
region,
|
|
315
|
+
SUM(amount) as total
|
|
316
|
+
FROM sales
|
|
317
|
+
GROUP BY GROUPING SETS (
|
|
318
|
+
(category, region),
|
|
319
|
+
(category),
|
|
320
|
+
(region),
|
|
321
|
+
()
|
|
322
|
+
)
|
|
323
|
+
""")
|
|
324
|
+
|
|
325
|
+
# Equivalent with rollup/cube
|
|
326
|
+
rollup_df = df.rollup("category", "region").agg(F.sum("amount"))
|
|
327
|
+
cube_df = df.cube("category", "region").agg(F.sum("amount"))
|
|
328
|
+
```
|
|
329
|
+
|
|
330
|
+
### Pivot Tables
|
|
331
|
+
|
|
332
|
+
```python
|
|
333
|
+
# Pivot - turn row values into columns
|
|
334
|
+
pivot_df = df.groupBy("user_id").pivot("category", ["electronics", "clothing", "food"]) \
|
|
335
|
+
.agg(F.sum("amount"))
|
|
336
|
+
|
|
337
|
+
# Result columns: user_id, electronics, clothing, food
|
|
338
|
+
|
|
339
|
+
# Unpivot (melt) - turn columns into rows
|
|
340
|
+
from pyspark.sql.functions import expr
|
|
341
|
+
|
|
342
|
+
unpivot_df = pivot_df.select(
|
|
343
|
+
"user_id",
|
|
344
|
+
expr("stack(3, 'electronics', electronics, 'clothing', clothing, 'food', food) as (category, amount)")
|
|
345
|
+
).filter("amount is not null")
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## Catalyst Optimizer Tips
|
|
351
|
+
|
|
352
|
+
### Predicate Pushdown
|
|
353
|
+
|
|
354
|
+
```python
|
|
355
|
+
# Good - filter pushed down to data source
|
|
356
|
+
df = spark.read.parquet("s3://bucket/data/").filter(F.col("date") == "2024-01-01")
|
|
357
|
+
|
|
358
|
+
# Check physical plan for PushedFilters
|
|
359
|
+
df.explain(True)
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
### Column Pruning
|
|
363
|
+
|
|
364
|
+
```python
|
|
365
|
+
# Good - only read required columns
|
|
366
|
+
df = spark.read.parquet("s3://bucket/data/").select("id", "name", "amount")
|
|
367
|
+
|
|
368
|
+
# Bad - reads all columns then filters
|
|
369
|
+
df = spark.read.parquet("s3://bucket/data/")
|
|
370
|
+
result = df.select("id", "name", "amount")
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
### Partition Pruning
|
|
374
|
+
|
|
375
|
+
```python
|
|
376
|
+
# Data partitioned by date
|
|
377
|
+
# Good - only reads matching partitions
|
|
378
|
+
df = spark.read.parquet("s3://bucket/data/") \
|
|
379
|
+
.filter(F.col("date").between("2024-01-01", "2024-01-31"))
|
|
380
|
+
|
|
381
|
+
# Verify partition pruning in Spark UI - Files Read should be reduced
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
---
|
|
385
|
+
|
|
386
|
+
## Common Anti-Patterns
|
|
387
|
+
|
|
388
|
+
### Avoid These Patterns
|
|
389
|
+
|
|
390
|
+
```python
|
|
391
|
+
# BAD: Using Python UDF when built-in exists
|
|
392
|
+
from pyspark.sql.functions import udf
|
|
393
|
+
@udf("string")
|
|
394
|
+
def upper_udf(s):
|
|
395
|
+
return s.upper() if s else None
|
|
396
|
+
df.withColumn("name", upper_udf("name")) # 10-100x slower!
|
|
397
|
+
|
|
398
|
+
# GOOD: Use built-in function
|
|
399
|
+
df.withColumn("name", F.upper("name"))
|
|
400
|
+
|
|
401
|
+
# BAD: Collect large data to driver
|
|
402
|
+
all_data = df.collect() # OOM risk!
|
|
403
|
+
for row in all_data:
|
|
404
|
+
process(row)
|
|
405
|
+
|
|
406
|
+
# GOOD: Process distributed or use take/limit
|
|
407
|
+
sample = df.take(100) # Small sample
|
|
408
|
+
df.foreach(process_partition) # Distributed processing
|
|
409
|
+
|
|
410
|
+
# BAD: Multiple actions triggering recomputation
|
|
411
|
+
count = df.count()
|
|
412
|
+
total = df.agg(F.sum("amount")).collect()
|
|
413
|
+
# Two full scans of data!
|
|
414
|
+
|
|
415
|
+
# GOOD: Cache if multiple actions needed
|
|
416
|
+
df.cache()
|
|
417
|
+
count = df.count()
|
|
418
|
+
total = df.agg(F.sum("amount")).collect()
|
|
419
|
+
df.unpersist()
|
|
420
|
+
|
|
421
|
+
# BAD: String column used in filter (case sensitivity issues)
|
|
422
|
+
df.filter(df.status == "ACTIVE") # May miss "active", "Active"
|
|
423
|
+
|
|
424
|
+
# GOOD: Normalize or use case-insensitive comparison
|
|
425
|
+
df.filter(F.upper("status") == "ACTIVE")
|
|
426
|
+
```
|
|
427
|
+
|
|
428
|
+
---
|
|
429
|
+
|
|
430
|
+
## Spark UI Analysis for DataFrames
|
|
431
|
+
|
|
432
|
+
### SQL Tab Metrics to Monitor
|
|
433
|
+
|
|
434
|
+
1. **Duration** - Long stages indicate optimization opportunities
|
|
435
|
+
2. **Input Size** - Verify partition pruning reduced data read
|
|
436
|
+
3. **Shuffle Write/Read** - Large shuffles suggest join/aggregation issues
|
|
437
|
+
4. **Spill (Memory/Disk)** - Indicates memory pressure, increase executor memory
|
|
438
|
+
|
|
439
|
+
### Physical Plan Analysis
|
|
440
|
+
|
|
441
|
+
```python
|
|
442
|
+
# View physical plan
|
|
443
|
+
df.explain(True)
|
|
444
|
+
|
|
445
|
+
# Look for:
|
|
446
|
+
# - FileScan with PushedFilters (predicate pushdown working)
|
|
447
|
+
# - BroadcastHashJoin vs SortMergeJoin (broadcast optimization)
|
|
448
|
+
# - Exchange (shuffle operations)
|
|
449
|
+
# - WholeStageCodegen (Tungsten optimization active)
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
### Key Metrics in Stages Tab
|
|
453
|
+
|
|
454
|
+
| Metric | Healthy Range | Action if High |
|
|
455
|
+
|--------|---------------|----------------|
|
|
456
|
+
| Shuffle Read Size | < 1GB per task | Increase partitions, add filter |
|
|
457
|
+
| Spill (Disk) | 0 | Increase executor memory |
|
|
458
|
+
| GC Time | < 10% of task time | Tune memory fractions |
|
|
459
|
+
| Task Duration Variance | < 2x median | Address data skew |
|
|
460
|
+
|
|
461
|
+
---
|
|
462
|
+
|
|
463
|
+
## Best Practices Summary
|
|
464
|
+
|
|
465
|
+
1. **Always define explicit schemas** - No inference in production
|
|
466
|
+
2. **Use built-in functions** - Avoid UDFs when possible
|
|
467
|
+
3. **Broadcast small tables** - Tables under 200MB
|
|
468
|
+
4. **Filter early** - Push filters before joins and aggregations
|
|
469
|
+
5. **Select only needed columns** - Enable column pruning
|
|
470
|
+
6. **Partition by common filter columns** - Enable partition pruning
|
|
471
|
+
7. **Cache strategically** - Only reused DataFrames
|
|
472
|
+
8. **Monitor Spark UI** - Check shuffle, spill, and GC metrics
|
|
473
|
+
9. **Enable AQE in Spark 3.x** - Automatic optimization for skew and partitions
|
|
474
|
+
10. **Test with production data volume** - Performance varies with scale
|