aigroup-workflow 2.2.1 → 2.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/fix-build.md +10 -5
- package/.claude/commands/init-project.md +13 -8
- package/.claude/commands/plan.md +15 -8
- package/.claude/commands/review.md +12 -6
- package/.claude/commands/tdd.md +11 -5
- package/.claude/commands/workflow-start.md +20 -11
- package/.claude/settings.json +28 -0
- package/.codex/agents/architect.toml +207 -0
- package/.codex/agents/build-error-resolver.toml +110 -0
- package/.codex/agents/code-reviewer.toml +233 -0
- package/.codex/agents/doc-updater.toml +103 -0
- package/.codex/agents/e2e-runner.toml +103 -0
- package/.codex/agents/get-current-datetime.toml +23 -0
- package/.codex/agents/init-architect.toml +181 -0
- package/.codex/agents/planner.toml +208 -0
- package/.codex/agents/refactor-cleaner.toml +81 -0
- package/.codex/agents/rust-reviewer.toml +90 -0
- package/.codex/agents/security-reviewer.toml +104 -0
- package/.codex/agents/tdd-guide.toml +87 -0
- package/AGENTS.md +2 -2
- package/CLAUDE.md +23 -1
- package/LICENSE +20 -20
- package/README.md +333 -333
- package/agents/a11y-architect.md +141 -141
- package/agents/architect.md +211 -211
- package/agents/build-error-resolver.md +114 -114
- package/agents/chief-of-staff.md +151 -151
- package/agents/code-architect.md +71 -71
- package/agents/code-explorer.md +69 -69
- package/agents/code-reviewer.md +237 -237
- package/agents/code-simplifier.md +47 -47
- package/agents/comment-analyzer.md +45 -45
- package/agents/conversation-analyzer.md +52 -52
- package/agents/cpp-build-resolver.md +90 -90
- package/agents/cpp-reviewer.md +72 -72
- package/agents/csharp-reviewer.md +101 -101
- package/agents/dart-build-resolver.md +201 -201
- package/agents/database-reviewer.md +91 -91
- package/agents/doc-updater.md +107 -107
- package/agents/docs-lookup.md +68 -68
- package/agents/e2e-runner.md +107 -107
- package/agents/flutter-reviewer.md +243 -243
- package/agents/gan-evaluator.md +209 -209
- package/agents/gan-generator.md +131 -131
- package/agents/gan-planner.md +99 -99
- package/agents/get-current-datetime.md +26 -26
- package/agents/go-build-resolver.md +94 -94
- package/agents/go-reviewer.md +76 -76
- package/agents/harness-optimizer.md +35 -35
- package/agents/healthcare-reviewer.md +83 -83
- package/agents/java-build-resolver.md +153 -153
- package/agents/java-reviewer.md +92 -92
- package/agents/kotlin-build-resolver.md +118 -118
- package/agents/kotlin-reviewer.md +159 -159
- package/agents/loop-operator.md +36 -36
- package/agents/opensource-forker.md +198 -198
- package/agents/opensource-packager.md +249 -249
- package/agents/opensource-sanitizer.md +188 -188
- package/agents/performance-optimizer.md +446 -446
- package/agents/planner.md +212 -212
- package/agents/pr-test-analyzer.md +45 -45
- package/agents/python-reviewer.md +98 -98
- package/agents/pytorch-build-resolver.md +120 -120
- package/agents/refactor-cleaner.md +85 -85
- package/agents/rust-build-resolver.md +148 -148
- package/agents/rust-reviewer.md +94 -94
- package/agents/security-reviewer.md +108 -108
- package/agents/seo-specialist.md +59 -59
- package/agents/silent-failure-hunter.md +50 -50
- package/agents/tdd-guide.md +91 -91
- package/agents/type-design-analyzer.md +41 -41
- package/agents/typescript-reviewer.md +112 -112
- package/cli/commands/update.mjs +1 -1
- package/cli/utils/scaffold.mjs +53 -0
- package/docs/rules/agents.md +166 -50
- package/docs/rules/cpp/coding-style.md +44 -44
- package/docs/rules/cpp/hooks.md +39 -39
- package/docs/rules/cpp/patterns.md +51 -51
- package/docs/rules/cpp/security.md +51 -51
- package/docs/rules/cpp/testing.md +44 -44
- package/docs/rules/csharp/coding-style.md +72 -72
- package/docs/rules/csharp/hooks.md +25 -25
- package/docs/rules/csharp/patterns.md +50 -50
- package/docs/rules/csharp/security.md +58 -58
- package/docs/rules/csharp/testing.md +46 -46
- package/docs/rules/dart/coding-style.md +159 -159
- package/docs/rules/dart/hooks.md +66 -66
- package/docs/rules/dart/patterns.md +261 -261
- package/docs/rules/dart/security.md +135 -135
- package/docs/rules/dart/testing.md +215 -215
- package/docs/rules/golang/coding-style.md +32 -32
- package/docs/rules/golang/hooks.md +17 -17
- package/docs/rules/golang/patterns.md +45 -45
- package/docs/rules/golang/security.md +34 -34
- package/docs/rules/golang/testing.md +31 -31
- package/docs/rules/java/coding-style.md +114 -114
- package/docs/rules/java/hooks.md +18 -18
- package/docs/rules/java/patterns.md +146 -146
- package/docs/rules/java/security.md +100 -100
- package/docs/rules/java/testing.md +131 -131
- package/docs/rules/kotlin/coding-style.md +86 -86
- package/docs/rules/kotlin/hooks.md +17 -17
- package/docs/rules/kotlin/patterns.md +146 -146
- package/docs/rules/kotlin/security.md +82 -82
- package/docs/rules/kotlin/testing.md +128 -128
- package/docs/rules/perl/coding-style.md +46 -46
- package/docs/rules/perl/hooks.md +22 -22
- package/docs/rules/perl/patterns.md +76 -76
- package/docs/rules/perl/security.md +69 -69
- package/docs/rules/perl/testing.md +54 -54
- package/docs/rules/php/coding-style.md +40 -40
- package/docs/rules/php/hooks.md +24 -24
- package/docs/rules/php/patterns.md +33 -33
- package/docs/rules/php/security.md +37 -37
- package/docs/rules/php/testing.md +39 -39
- package/docs/rules/python/coding-style.md +42 -42
- package/docs/rules/python/hooks.md +19 -19
- package/docs/rules/python/patterns.md +39 -39
- package/docs/rules/python/security.md +30 -30
- package/docs/rules/python/testing.md +38 -38
- package/docs/rules/rust/coding-style.md +151 -151
- package/docs/rules/rust/hooks.md +16 -16
- package/docs/rules/rust/patterns.md +168 -168
- package/docs/rules/rust/security.md +141 -141
- package/docs/rules/rust/testing.md +154 -154
- package/docs/rules/swift/coding-style.md +47 -47
- package/docs/rules/swift/hooks.md +20 -20
- package/docs/rules/swift/patterns.md +66 -66
- package/docs/rules/swift/security.md +33 -33
- package/docs/rules/swift/testing.md +45 -45
- package/docs/rules/typescript/coding-style.md +199 -199
- package/docs/rules/typescript/hooks.md +22 -22
- package/docs/rules/typescript/patterns.md +52 -52
- package/docs/rules/typescript/security.md +28 -28
- package/docs/rules/typescript/testing.md +18 -18
- package/docs/rules/web/coding-style.md +96 -96
- package/docs/rules/web/design-quality.md +62 -62
- package/docs/rules/web/hooks.md +120 -120
- package/docs/rules/web/patterns.md +79 -79
- package/docs/rules/web/performance.md +64 -64
- package/docs/rules/web/security.md +57 -57
- package/docs/rules/web/testing.md +55 -55
- package/docs/templates/README.md +36 -36
- package/docs/templates/ai-project-final.md +124 -124
- package/docs/templates/ai-project.md +105 -105
- package/docs/templates/api.md +157 -157
- package/docs/templates/bug.md +62 -62
- package/docs/templates/code-review.md +87 -87
- package/docs/templates/generic.md +116 -116
- package/docs/templates/implementation-plan.md +1 -1
- package/docs/templates/meeting.md +68 -68
- package/docs/templates/prd.md +98 -98
- package/docs/templates/ui.md +134 -134
- package/docs/workflow-pipeline.md +5 -5
- package/package.json +40 -39
- package/skills/SUPERPOWERS-LICENSE +21 -21
- package/skills/ai-ml/fine-tuning-expert/SKILL.md +162 -162
- package/skills/ai-ml/fine-tuning-expert/references/dataset-preparation.md +540 -540
- package/skills/ai-ml/fine-tuning-expert/references/deployment-optimization.md +673 -673
- package/skills/ai-ml/fine-tuning-expert/references/evaluation-metrics.md +597 -597
- package/skills/ai-ml/fine-tuning-expert/references/hyperparameter-tuning.md +565 -565
- package/skills/ai-ml/fine-tuning-expert/references/lora-peft.md +347 -347
- package/skills/ai-ml/ml-pipeline/SKILL.md +159 -159
- package/skills/ai-ml/ml-pipeline/references/experiment-tracking.md +833 -833
- package/skills/ai-ml/ml-pipeline/references/feature-engineering.md +631 -631
- package/skills/ai-ml/ml-pipeline/references/model-validation.md +978 -978
- package/skills/ai-ml/ml-pipeline/references/pipeline-orchestration.md +907 -907
- package/skills/ai-ml/ml-pipeline/references/training-pipelines.md +782 -782
- package/skills/ai-ml/rag-architect/SKILL.md +194 -194
- package/skills/ai-ml/rag-architect/references/chunking-strategies.md +878 -878
- package/skills/ai-ml/rag-architect/references/embedding-models.md +561 -561
- package/skills/ai-ml/rag-architect/references/rag-evaluation.md +833 -833
- package/skills/ai-ml/rag-architect/references/retrieval-optimization.md +795 -795
- package/skills/ai-ml/rag-architect/references/vector-databases.md +589 -589
- package/skills/ai-ml/spark-engineer/SKILL.md +148 -148
- package/skills/ai-ml/spark-engineer/references/partitioning-caching.md +543 -543
- package/skills/ai-ml/spark-engineer/references/performance-tuning.md +544 -544
- package/skills/ai-ml/spark-engineer/references/rdd-operations.md +599 -599
- package/skills/ai-ml/spark-engineer/references/spark-sql-dataframes.md +474 -474
- package/skills/ai-ml/spark-engineer/references/streaming-patterns.md +786 -786
- package/skills/backend/api-designer/SKILL.md +217 -217
- package/skills/backend/api-designer/references/error-handling.md +541 -541
- package/skills/backend/api-designer/references/openapi.md +824 -824
- package/skills/backend/api-designer/references/pagination.md +494 -494
- package/skills/backend/api-designer/references/rest-patterns.md +335 -335
- package/skills/backend/api-designer/references/versioning.md +391 -391
- package/skills/backend/architecture-designer/SKILL.md +117 -117
- package/skills/backend/architecture-designer/references/adr-template.md +116 -116
- package/skills/backend/architecture-designer/references/architecture-patterns.md +111 -111
- package/skills/backend/architecture-designer/references/database-selection.md +102 -102
- package/skills/backend/architecture-designer/references/nfr-checklist.md +112 -112
- package/skills/backend/architecture-designer/references/system-design.md +100 -100
- package/skills/backend/code-documenter/SKILL.md +147 -147
- package/skills/backend/code-documenter/references/api-docs-fastapi-django.md +166 -166
- package/skills/backend/code-documenter/references/api-docs-nestjs-express.md +220 -220
- package/skills/backend/code-documenter/references/coverage-reports.md +125 -125
- package/skills/backend/code-documenter/references/documentation-systems.md +333 -333
- package/skills/backend/code-documenter/references/interactive-api-docs.md +531 -531
- package/skills/backend/code-documenter/references/python-docstrings.md +121 -121
- package/skills/backend/code-documenter/references/typescript-jsdoc.md +145 -145
- package/skills/backend/code-documenter/references/user-guides-tutorials.md +530 -530
- package/skills/backend/debugging-wizard/SKILL.md +105 -105
- package/skills/backend/debugging-wizard/references/common-patterns.md +132 -132
- package/skills/backend/debugging-wizard/references/debugging-tools.md +140 -140
- package/skills/backend/debugging-wizard/references/quick-fixes.md +177 -177
- package/skills/backend/debugging-wizard/references/strategies.md +142 -142
- package/skills/backend/debugging-wizard/references/systematic-debugging.md +367 -367
- package/skills/backend/feature-forge/SKILL.md +98 -98
- package/skills/backend/feature-forge/references/acceptance-criteria.md +104 -104
- package/skills/backend/feature-forge/references/ears-syntax.md +99 -99
- package/skills/backend/feature-forge/references/interview-questions.md +150 -150
- package/skills/backend/feature-forge/references/pre-discovery-subagents.md +54 -54
- package/skills/backend/feature-forge/references/specification-template.md +103 -103
- package/skills/backend/fullstack-guardian/SKILL.md +105 -105
- package/skills/backend/fullstack-guardian/references/api-design-standards.md +307 -307
- package/skills/backend/fullstack-guardian/references/architecture-decisions.md +350 -350
- package/skills/backend/fullstack-guardian/references/backend-patterns.md +237 -237
- package/skills/backend/fullstack-guardian/references/common-patterns.md +134 -134
- package/skills/backend/fullstack-guardian/references/deliverables-checklist.md +354 -354
- package/skills/backend/fullstack-guardian/references/design-template.md +91 -91
- package/skills/backend/fullstack-guardian/references/error-handling.md +135 -135
- package/skills/backend/fullstack-guardian/references/frontend-patterns.md +340 -340
- package/skills/backend/fullstack-guardian/references/integration-patterns.md +333 -333
- package/skills/backend/fullstack-guardian/references/security-checklist.md +106 -106
- package/skills/backend/graphql-architect/SKILL.md +146 -146
- package/skills/backend/graphql-architect/references/federation.md +418 -418
- package/skills/backend/graphql-architect/references/migration-from-rest.md +1141 -1141
- package/skills/backend/graphql-architect/references/resolvers.md +425 -425
- package/skills/backend/graphql-architect/references/schema-design.md +393 -393
- package/skills/backend/graphql-architect/references/security.md +569 -569
- package/skills/backend/graphql-architect/references/subscriptions.md +510 -510
- package/skills/backend/legacy-modernizer/SKILL.md +137 -137
- package/skills/backend/legacy-modernizer/references/legacy-testing.md +381 -381
- package/skills/backend/legacy-modernizer/references/migration-strategies.md +423 -423
- package/skills/backend/legacy-modernizer/references/refactoring-patterns.md +395 -395
- package/skills/backend/legacy-modernizer/references/strangler-fig-pattern.md +281 -281
- package/skills/backend/legacy-modernizer/references/system-assessment.md +487 -487
- package/skills/backend/microservices-architect/SKILL.md +164 -164
- package/skills/backend/microservices-architect/references/communication.md +499 -499
- package/skills/backend/microservices-architect/references/data.md +721 -721
- package/skills/backend/microservices-architect/references/decomposition.md +344 -344
- package/skills/backend/microservices-architect/references/observability.md +805 -805
- package/skills/backend/microservices-architect/references/patterns.md +603 -603
- package/skills/database/database-optimizer/SKILL.md +147 -147
- package/skills/database/database-optimizer/references/index-strategies.md +331 -331
- package/skills/database/database-optimizer/references/monitoring-analysis.md +501 -501
- package/skills/database/database-optimizer/references/mysql-tuning.md +452 -452
- package/skills/database/database-optimizer/references/postgresql-tuning.md +413 -413
- package/skills/database/database-optimizer/references/query-optimization.md +251 -251
- package/skills/database/postgres-pro/SKILL.md +152 -152
- package/skills/database/postgres-pro/references/extensions.md +404 -404
- package/skills/database/postgres-pro/references/jsonb.md +321 -321
- package/skills/database/postgres-pro/references/maintenance.md +481 -481
- package/skills/database/postgres-pro/references/performance.md +265 -265
- package/skills/database/postgres-pro/references/replication.md +446 -446
- package/skills/database/sql-pro/SKILL.md +129 -129
- package/skills/database/sql-pro/references/database-design.md +402 -402
- package/skills/database/sql-pro/references/dialect-differences.md +419 -419
- package/skills/database/sql-pro/references/optimization.md +384 -384
- package/skills/database/sql-pro/references/query-patterns.md +285 -285
- package/skills/database/sql-pro/references/window-functions.md +328 -328
- package/skills/dotnet/csharp-developer/SKILL.md +125 -125
- package/skills/dotnet/csharp-developer/references/aspnet-core.md +394 -394
- package/skills/dotnet/csharp-developer/references/blazor.md +553 -553
- package/skills/dotnet/csharp-developer/references/entity-framework.md +409 -409
- package/skills/dotnet/csharp-developer/references/modern-csharp.md +248 -248
- package/skills/dotnet/csharp-developer/references/performance.md +498 -498
- package/skills/dotnet/dotnet-core-expert/SKILL.md +138 -138
- package/skills/dotnet/dotnet-core-expert/references/authentication.md +546 -546
- package/skills/dotnet/dotnet-core-expert/references/clean-architecture.md +455 -455
- package/skills/dotnet/dotnet-core-expert/references/cloud-native.md +548 -548
- package/skills/dotnet/dotnet-core-expert/references/entity-framework.md +440 -440
- package/skills/dotnet/dotnet-core-expert/references/minimal-apis.md +319 -319
- package/skills/frontend/angular-architect/SKILL.md +152 -152
- package/skills/frontend/angular-architect/references/components.md +297 -297
- package/skills/frontend/angular-architect/references/ngrx.md +401 -401
- package/skills/frontend/angular-architect/references/routing.md +361 -361
- package/skills/frontend/angular-architect/references/rxjs.md +319 -319
- package/skills/frontend/angular-architect/references/testing.md +405 -405
- package/skills/frontend/design-commands/design.md +91 -91
- package/skills/frontend/design-commands/handoff.md +97 -97
- package/skills/frontend/design-commands/prototype.md +120 -120
- package/skills/frontend/design-commands/spec.md +160 -160
- package/skills/frontend/design-commands/style.md +78 -78
- package/skills/frontend/flutter-expert/SKILL.md +138 -138
- package/skills/frontend/flutter-expert/references/bloc-state.md +259 -259
- package/skills/frontend/flutter-expert/references/gorouter-navigation.md +119 -119
- package/skills/frontend/flutter-expert/references/performance.md +99 -99
- package/skills/frontend/flutter-expert/references/project-structure.md +118 -118
- package/skills/frontend/flutter-expert/references/riverpod-state.md +130 -130
- package/skills/frontend/flutter-expert/references/widget-patterns.md +123 -123
- package/skills/frontend/nextjs-developer/SKILL.md +143 -143
- package/skills/frontend/nextjs-developer/references/app-router.md +311 -311
- package/skills/frontend/nextjs-developer/references/data-fetching.md +482 -482
- package/skills/frontend/nextjs-developer/references/deployment.md +545 -545
- package/skills/frontend/nextjs-developer/references/server-actions.md +462 -462
- package/skills/frontend/nextjs-developer/references/server-components.md +384 -384
- package/skills/frontend/react-expert/SKILL.md +149 -149
- package/skills/frontend/react-expert/references/hooks-patterns.md +162 -162
- package/skills/frontend/react-expert/references/migration-class-to-modern.md +1119 -1119
- package/skills/frontend/react-expert/references/performance.md +168 -168
- package/skills/frontend/react-expert/references/react-19-features.md +174 -174
- package/skills/frontend/react-expert/references/server-components.md +143 -143
- package/skills/frontend/react-expert/references/state-management.md +171 -171
- package/skills/frontend/react-expert/references/testing-react.md +174 -174
- package/skills/frontend/react-native-expert/SKILL.md +185 -185
- package/skills/frontend/react-native-expert/references/expo-router.md +187 -187
- package/skills/frontend/react-native-expert/references/list-optimization.md +204 -204
- package/skills/frontend/react-native-expert/references/platform-handling.md +188 -188
- package/skills/frontend/react-native-expert/references/project-structure.md +171 -171
- package/skills/frontend/react-native-expert/references/storage-hooks.md +173 -173
- package/skills/frontend/senior-frontend/SKILL.md +477 -477
- package/skills/frontend/senior-frontend/references/frontend_best_practices.md +806 -806
- package/skills/frontend/senior-frontend/references/nextjs_optimization_guide.md +724 -724
- package/skills/frontend/senior-frontend/references/react_patterns.md +746 -746
- package/skills/frontend/senior-frontend/scripts/bundle_analyzer.py +407 -407
- package/skills/frontend/senior-frontend/scripts/component_generator.py +329 -329
- package/skills/frontend/senior-frontend/scripts/frontend_scaffolder.py +1005 -1005
- package/skills/frontend/ui-ux-pro-max/SKILL.md +386 -386
- package/skills/frontend/ui-ux-pro-max/data/charts.csv +26 -26
- package/skills/frontend/ui-ux-pro-max/data/colors.csv +97 -97
- package/skills/frontend/ui-ux-pro-max/data/icons.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/landing.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/data/products.csv +96 -96
- package/skills/frontend/ui-ux-pro-max/data/react-performance.csv +45 -45
- package/skills/frontend/ui-ux-pro-max/data/stacks/astro.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/flutter.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -56
- package/skills/frontend/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nextjs.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -59
- package/skills/frontend/ui-ux-pro-max/data/stacks/react-native.csv +52 -52
- package/skills/frontend/ui-ux-pro-max/data/stacks/react.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/shadcn.csv +61 -61
- package/skills/frontend/ui-ux-pro-max/data/stacks/svelte.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/swiftui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/vue.csv +50 -50
- package/skills/frontend/ui-ux-pro-max/data/styles.csv +68 -68
- package/skills/frontend/ui-ux-pro-max/data/typography.csv +57 -57
- package/skills/frontend/ui-ux-pro-max/data/ui-reasoning.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/ux-guidelines.csv +99 -99
- package/skills/frontend/ui-ux-pro-max/data/web-interface.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/scripts/core.py +253 -253
- package/skills/frontend/ui-ux-pro-max/scripts/design_system.py +1067 -1067
- package/skills/frontend/ui-ux-pro-max/scripts/search.py +114 -114
- package/skills/frontend/vue-expert/SKILL.md +98 -98
- package/skills/frontend/vue-expert/references/build-tooling.md +480 -480
- package/skills/frontend/vue-expert/references/components.md +448 -448
- package/skills/frontend/vue-expert/references/composition-api.md +299 -299
- package/skills/frontend/vue-expert/references/mobile-hybrid.md +636 -636
- package/skills/frontend/vue-expert/references/nuxt.md +669 -669
- package/skills/frontend/vue-expert/references/state-management.md +449 -449
- package/skills/frontend/vue-expert/references/typescript.md +584 -584
- package/skills/frontend/vue-expert-js/SKILL.md +167 -167
- package/skills/frontend/vue-expert-js/references/component-architecture.md +219 -219
- package/skills/frontend/vue-expert-js/references/composables-patterns.md +183 -183
- package/skills/frontend/vue-expert-js/references/jsdoc-typing.md +535 -535
- package/skills/frontend/vue-expert-js/references/state-management.md +249 -249
- package/skills/frontend/vue-expert-js/references/testing-patterns.md +237 -237
- package/skills/go-rust-cpp/cpp-pro/SKILL.md +115 -115
- package/skills/go-rust-cpp/cpp-pro/references/build-tooling.md +440 -440
- package/skills/go-rust-cpp/cpp-pro/references/concurrency.md +437 -437
- package/skills/go-rust-cpp/cpp-pro/references/memory-performance.md +397 -397
- package/skills/go-rust-cpp/cpp-pro/references/modern-cpp.md +304 -304
- package/skills/go-rust-cpp/cpp-pro/references/templates.md +357 -357
- package/skills/go-rust-cpp/golang-pro/SKILL.md +122 -122
- package/skills/go-rust-cpp/golang-pro/references/concurrency.md +329 -329
- package/skills/go-rust-cpp/golang-pro/references/generics.md +442 -442
- package/skills/go-rust-cpp/golang-pro/references/interfaces.md +432 -432
- package/skills/go-rust-cpp/golang-pro/references/project-structure.md +477 -477
- package/skills/go-rust-cpp/golang-pro/references/testing.md +451 -451
- package/skills/go-rust-cpp/rust-engineer/SKILL.md +167 -167
- package/skills/go-rust-cpp/rust-engineer/references/async.md +458 -458
- package/skills/go-rust-cpp/rust-engineer/references/error-handling.md +334 -334
- package/skills/go-rust-cpp/rust-engineer/references/ownership.md +278 -278
- package/skills/go-rust-cpp/rust-engineer/references/testing.md +470 -470
- package/skills/go-rust-cpp/rust-engineer/references/traits.md +413 -413
- package/skills/infra/cli-developer/SKILL.md +113 -113
- package/skills/infra/cli-developer/references/design-patterns.md +221 -221
- package/skills/infra/cli-developer/references/go-cli.md +540 -540
- package/skills/infra/cli-developer/references/node-cli.md +383 -383
- package/skills/infra/cli-developer/references/python-cli.md +422 -422
- package/skills/infra/cli-developer/references/ux-patterns.md +448 -448
- package/skills/infra/cloud-architect/SKILL.md +216 -216
- package/skills/infra/cloud-architect/references/aws.md +394 -394
- package/skills/infra/cloud-architect/references/azure.md +562 -562
- package/skills/infra/cloud-architect/references/cost.md +582 -582
- package/skills/infra/cloud-architect/references/gcp.md +633 -633
- package/skills/infra/cloud-architect/references/multi-cloud.md +483 -483
- package/skills/infra/devops-engineer/SKILL.md +144 -144
- package/skills/infra/devops-engineer/references/deployment-strategies.md +241 -241
- package/skills/infra/devops-engineer/references/docker-patterns.md +113 -113
- package/skills/infra/devops-engineer/references/github-actions.md +139 -139
- package/skills/infra/devops-engineer/references/incident-response.md +331 -331
- package/skills/infra/devops-engineer/references/kubernetes.md +154 -154
- package/skills/infra/devops-engineer/references/platform-engineering.md +417 -417
- package/skills/infra/devops-engineer/references/release-automation.md +527 -527
- package/skills/infra/devops-engineer/references/terraform-iac.md +141 -141
- package/skills/infra/kubernetes-specialist/SKILL.md +241 -241
- package/skills/infra/kubernetes-specialist/references/configuration.md +452 -452
- package/skills/infra/kubernetes-specialist/references/cost-optimization.md +458 -458
- package/skills/infra/kubernetes-specialist/references/custom-operators.md +563 -563
- package/skills/infra/kubernetes-specialist/references/gitops.md +530 -530
- package/skills/infra/kubernetes-specialist/references/helm-charts.md +912 -912
- package/skills/infra/kubernetes-specialist/references/multi-cluster.md +507 -507
- package/skills/infra/kubernetes-specialist/references/networking.md +447 -447
- package/skills/infra/kubernetes-specialist/references/service-mesh.md +459 -459
- package/skills/infra/kubernetes-specialist/references/storage.md +535 -535
- package/skills/infra/kubernetes-specialist/references/troubleshooting.md +414 -414
- package/skills/infra/kubernetes-specialist/references/workloads.md +377 -377
- package/skills/infra/mcp-developer/SKILL.md +143 -143
- package/skills/infra/mcp-developer/references/protocol.md +244 -244
- package/skills/infra/mcp-developer/references/python-sdk.md +367 -367
- package/skills/infra/mcp-developer/references/resources.md +554 -554
- package/skills/infra/mcp-developer/references/tools.md +480 -480
- package/skills/infra/mcp-developer/references/typescript-sdk.md +350 -350
- package/skills/infra/monitoring-expert/SKILL.md +176 -176
- package/skills/infra/monitoring-expert/references/alerting-rules.md +141 -141
- package/skills/infra/monitoring-expert/references/application-profiling.md +331 -331
- package/skills/infra/monitoring-expert/references/capacity-planning.md +344 -344
- package/skills/infra/monitoring-expert/references/dashboards.md +126 -126
- package/skills/infra/monitoring-expert/references/opentelemetry.md +123 -123
- package/skills/infra/monitoring-expert/references/performance-testing.md +269 -269
- package/skills/infra/monitoring-expert/references/prometheus-metrics.md +136 -136
- package/skills/infra/monitoring-expert/references/structured-logging.md +142 -142
- package/skills/infra/sre-engineer/SKILL.md +181 -181
- package/skills/infra/sre-engineer/references/automation-toil.md +492 -492
- package/skills/infra/sre-engineer/references/error-budget-policy.md +334 -334
- package/skills/infra/sre-engineer/references/incident-chaos.md +576 -576
- package/skills/infra/sre-engineer/references/monitoring-alerting.md +424 -424
- package/skills/infra/sre-engineer/references/slo-sli-management.md +238 -238
- package/skills/infra/terraform-engineer/SKILL.md +143 -143
- package/skills/infra/terraform-engineer/references/best-practices.md +583 -583
- package/skills/infra/terraform-engineer/references/module-patterns.md +297 -297
- package/skills/infra/terraform-engineer/references/providers.md +452 -452
- package/skills/infra/terraform-engineer/references/state-management.md +371 -371
- package/skills/infra/terraform-engineer/references/testing.md +486 -486
- package/skills/infra/websocket-engineer/SKILL.md +168 -168
- package/skills/infra/websocket-engineer/references/alternatives.md +391 -391
- package/skills/infra/websocket-engineer/references/patterns.md +400 -400
- package/skills/infra/websocket-engineer/references/protocol.md +195 -195
- package/skills/infra/websocket-engineer/references/scaling.md +333 -333
- package/skills/infra/websocket-engineer/references/security.md +474 -474
- package/skills/java/java-architect/SKILL.md +132 -132
- package/skills/java/java-architect/references/jpa-optimization.md +393 -393
- package/skills/java/java-architect/references/reactive-webflux.md +356 -356
- package/skills/java/java-architect/references/spring-boot-setup.md +269 -269
- package/skills/java/java-architect/references/spring-security.md +445 -445
- package/skills/java/java-architect/references/testing-patterns.md +500 -500
- package/skills/java/kotlin-specialist/SKILL.md +147 -147
- package/skills/java/kotlin-specialist/references/android-compose.md +419 -419
- package/skills/java/kotlin-specialist/references/coroutines-flow.md +276 -276
- package/skills/java/kotlin-specialist/references/dsl-idioms.md +421 -421
- package/skills/java/kotlin-specialist/references/ktor-server.md +426 -426
- package/skills/java/kotlin-specialist/references/multiplatform-kmp.md +380 -380
- package/skills/java/spring-boot-engineer/SKILL.md +195 -195
- package/skills/java/spring-boot-engineer/references/cloud.md +498 -498
- package/skills/java/spring-boot-engineer/references/data.md +381 -381
- package/skills/java/spring-boot-engineer/references/security.md +459 -459
- package/skills/java/spring-boot-engineer/references/testing.md +545 -545
- package/skills/java/spring-boot-engineer/references/web.md +295 -295
- package/skills/javascript/javascript-pro/SKILL.md +132 -132
- package/skills/javascript/javascript-pro/references/async-patterns.md +334 -334
- package/skills/javascript/javascript-pro/references/browser-apis.md +398 -398
- package/skills/javascript/javascript-pro/references/modern-syntax.md +272 -272
- package/skills/javascript/javascript-pro/references/modules.md +357 -357
- package/skills/javascript/javascript-pro/references/node-essentials.md +471 -471
- package/skills/javascript/nestjs-expert/SKILL.md +206 -206
- package/skills/javascript/nestjs-expert/references/authentication.md +166 -166
- package/skills/javascript/nestjs-expert/references/controllers-routing.md +111 -111
- package/skills/javascript/nestjs-expert/references/dtos-validation.md +153 -153
- package/skills/javascript/nestjs-expert/references/migration-from-express.md +1237 -1237
- package/skills/javascript/nestjs-expert/references/services-di.md +140 -140
- package/skills/javascript/nestjs-expert/references/testing-patterns.md +186 -186
- package/skills/javascript/typescript-pro/SKILL.md +145 -145
- package/skills/javascript/typescript-pro/references/advanced-types.md +259 -259
- package/skills/javascript/typescript-pro/references/configuration.md +445 -445
- package/skills/javascript/typescript-pro/references/patterns.md +484 -484
- package/skills/javascript/typescript-pro/references/type-guards.md +352 -352
- package/skills/javascript/typescript-pro/references/utility-types.md +329 -329
- package/skills/php/laravel-specialist/SKILL.md +262 -262
- package/skills/php/laravel-specialist/references/eloquent.md +351 -351
- package/skills/php/laravel-specialist/references/livewire.md +512 -512
- package/skills/php/laravel-specialist/references/queues.md +423 -423
- package/skills/php/laravel-specialist/references/routing.md +362 -362
- package/skills/php/laravel-specialist/references/testing.md +522 -522
- package/skills/php/php-pro/SKILL.md +206 -206
- package/skills/php/php-pro/references/async-patterns.md +412 -412
- package/skills/php/php-pro/references/laravel-patterns.md +377 -377
- package/skills/php/php-pro/references/modern-php-features.md +323 -323
- package/skills/php/php-pro/references/symfony-patterns.md +466 -466
- package/skills/php/php-pro/references/testing-quality.md +466 -466
- package/skills/product/competitive-analysis/SKILL.md +257 -257
- package/skills/product/meeting-notes/SKILL.md +266 -266
- package/skills/product/prd-template/SKILL.md +150 -150
- package/skills/product/stakeholder-update/SKILL.md +225 -225
- package/skills/product/user-research-synthesis/SKILL.md +235 -235
- package/skills/python/django-expert/SKILL.md +162 -162
- package/skills/python/django-expert/references/authentication.md +145 -145
- package/skills/python/django-expert/references/drf-serializers.md +148 -148
- package/skills/python/django-expert/references/models-orm.md +151 -151
- package/skills/python/django-expert/references/testing-django.md +204 -204
- package/skills/python/django-expert/references/viewsets-views.md +153 -153
- package/skills/python/fastapi-expert/SKILL.md +185 -185
- package/skills/python/fastapi-expert/references/async-sqlalchemy.md +146 -146
- package/skills/python/fastapi-expert/references/authentication.md +159 -159
- package/skills/python/fastapi-expert/references/endpoints-routing.md +142 -142
- package/skills/python/fastapi-expert/references/migration-from-django.md +996 -996
- package/skills/python/fastapi-expert/references/pydantic-v2.md +135 -135
- package/skills/python/fastapi-expert/references/testing-async.md +159 -159
- package/skills/python/pandas-pro/SKILL.md +178 -178
- package/skills/python/pandas-pro/references/aggregation-groupby.md +545 -545
- package/skills/python/pandas-pro/references/data-cleaning.md +500 -500
- package/skills/python/pandas-pro/references/dataframe-operations.md +420 -420
- package/skills/python/pandas-pro/references/merging-joining.md +596 -596
- package/skills/python/pandas-pro/references/performance-optimization.md +597 -597
- package/skills/python/python-pro/SKILL.md +177 -177
- package/skills/python/python-pro/references/async-patterns.md +356 -356
- package/skills/python/python-pro/references/packaging.md +460 -460
- package/skills/python/python-pro/references/standard-library.md +378 -378
- package/skills/python/python-pro/references/testing.md +404 -404
- package/skills/python/python-pro/references/type-system.md +290 -290
- package/skills/quality/chaos-engineer/SKILL.md +182 -182
- package/skills/quality/chaos-engineer/references/chaos-tools.md +511 -511
- package/skills/quality/chaos-engineer/references/experiment-design.md +229 -229
- package/skills/quality/chaos-engineer/references/game-days.md +434 -434
- package/skills/quality/chaos-engineer/references/infrastructure-chaos.md +348 -348
- package/skills/quality/chaos-engineer/references/kubernetes-chaos.md +432 -432
- package/skills/quality/code-reviewer/SKILL.md +119 -119
- package/skills/quality/code-reviewer/references/common-issues.md +142 -142
- package/skills/quality/code-reviewer/references/feedback-examples.md +144 -144
- package/skills/quality/code-reviewer/references/receiving-feedback.md +238 -238
- package/skills/quality/code-reviewer/references/report-template.md +109 -109
- package/skills/quality/code-reviewer/references/review-checklist.md +88 -88
- package/skills/quality/code-reviewer/references/spec-compliance-review.md +258 -258
- package/skills/quality/playwright-expert/SKILL.md +169 -169
- package/skills/quality/playwright-expert/references/api-mocking.md +140 -140
- package/skills/quality/playwright-expert/references/configuration.md +155 -155
- package/skills/quality/playwright-expert/references/debugging-flaky.md +150 -150
- package/skills/quality/playwright-expert/references/page-object-model.md +152 -152
- package/skills/quality/playwright-expert/references/selectors-locators.md +119 -119
- package/skills/quality/secure-code-guardian/SKILL.md +191 -191
- package/skills/quality/secure-code-guardian/references/authentication.md +136 -136
- package/skills/quality/secure-code-guardian/references/input-validation.md +146 -146
- package/skills/quality/secure-code-guardian/references/owasp-prevention.md +135 -135
- package/skills/quality/secure-code-guardian/references/security-headers.md +133 -133
- package/skills/quality/secure-code-guardian/references/xss-csrf.md +157 -157
- package/skills/quality/security-reviewer/SKILL.md +103 -103
- package/skills/quality/security-reviewer/references/infrastructure-security.md +268 -268
- package/skills/quality/security-reviewer/references/penetration-testing.md +268 -268
- package/skills/quality/security-reviewer/references/report-template.md +170 -170
- package/skills/quality/security-reviewer/references/sast-tools.md +117 -117
- package/skills/quality/security-reviewer/references/secret-scanning.md +125 -125
- package/skills/quality/security-reviewer/references/vulnerability-patterns.md +152 -152
- package/skills/quality/senior-qa/README.md +196 -196
- package/skills/quality/senior-qa/SKILL.md +399 -399
- package/skills/quality/senior-qa/references/qa_best_practices.md +964 -964
- package/skills/quality/senior-qa/references/test_automation_patterns.md +1009 -1009
- package/skills/quality/senior-qa/references/testing_strategies.md +649 -649
- package/skills/quality/senior-qa/scripts/coverage_analyzer.py +836 -836
- package/skills/quality/senior-qa/scripts/e2e_test_scaffolder.py +820 -820
- package/skills/quality/senior-qa/scripts/test_suite_generator.py +605 -605
- package/skills/quality/tdd-guide/HOW_TO_USE.md +313 -313
- package/skills/quality/tdd-guide/README.md +680 -680
- package/skills/quality/tdd-guide/SKILL.md +122 -122
- package/skills/quality/tdd-guide/assets/expected_output.json +77 -77
- package/skills/quality/tdd-guide/assets/sample_input_python.json +39 -39
- package/skills/quality/tdd-guide/assets/sample_input_typescript.json +36 -36
- package/skills/quality/tdd-guide/references/ci-integration.md +195 -195
- package/skills/quality/tdd-guide/references/framework-guide.md +206 -206
- package/skills/quality/tdd-guide/references/tdd-best-practices.md +128 -128
- package/skills/quality/tdd-guide/scripts/coverage_analyzer.py +434 -434
- package/skills/quality/tdd-guide/scripts/fixture_generator.py +440 -440
- package/skills/quality/tdd-guide/scripts/format_detector.py +384 -384
- package/skills/quality/tdd-guide/scripts/framework_adapter.py +428 -428
- package/skills/quality/tdd-guide/scripts/metrics_calculator.py +456 -456
- package/skills/quality/tdd-guide/scripts/output_formatter.py +354 -354
- package/skills/quality/tdd-guide/scripts/tdd_workflow.py +474 -474
- package/skills/quality/tdd-guide/scripts/test_generator.py +438 -438
- package/skills/quality/test-master/SKILL.md +94 -94
- package/skills/quality/test-master/references/automation-frameworks.md +294 -294
- package/skills/quality/test-master/references/e2e-testing.md +128 -128
- package/skills/quality/test-master/references/integration-testing.md +120 -120
- package/skills/quality/test-master/references/performance-testing.md +118 -118
- package/skills/quality/test-master/references/qa-methodology.md +247 -247
- package/skills/quality/test-master/references/security-testing.md +127 -127
- package/skills/quality/test-master/references/tdd-iron-laws.md +174 -174
- package/skills/quality/test-master/references/test-reports.md +104 -104
- package/skills/quality/test-master/references/testing-anti-patterns.md +231 -231
- package/skills/quality/test-master/references/unit-testing.md +113 -113
- package/skills/ruby/rails-expert/SKILL.md +154 -154
- package/skills/ruby/rails-expert/references/active-record.md +244 -244
- package/skills/ruby/rails-expert/references/api-development.md +401 -401
- package/skills/ruby/rails-expert/references/background-jobs.md +272 -272
- package/skills/ruby/rails-expert/references/hotwire-turbo.md +228 -228
- package/skills/ruby/rails-expert/references/rspec-testing.md +367 -367
- package/skills/swift/swift-expert/SKILL.md +163 -163
- package/skills/swift/swift-expert/references/async-concurrency.md +360 -360
- package/skills/swift/swift-expert/references/memory-performance.md +377 -377
- package/skills/swift/swift-expert/references/protocol-oriented.md +354 -354
- package/skills/swift/swift-expert/references/swiftui-patterns.md +291 -291
- package/skills/swift/swift-expert/references/testing-patterns.md +399 -399
- package/skills/workflow/brainstorming/SKILL.md +164 -164
- package/skills/workflow/brainstorming/scripts/frame-template.html +214 -214
- package/skills/workflow/brainstorming/scripts/helper.js +88 -88
- package/skills/workflow/brainstorming/scripts/server.cjs +354 -354
- package/skills/workflow/brainstorming/scripts/start-server.sh +148 -148
- package/skills/workflow/brainstorming/scripts/stop-server.sh +56 -56
- package/skills/workflow/brainstorming/spec-document-reviewer-prompt.md +49 -49
- package/skills/workflow/brainstorming/visual-companion.md +287 -287
- package/skills/workflow/documentation/SKILL.md +45 -45
- package/skills/workflow/entropy-management/SKILL.md +115 -115
- package/skills/workflow/executing-plans/SKILL.md +70 -70
- package/skills/workflow/finishing-a-development-branch/SKILL.md +200 -200
- package/skills/workflow/receiving-code-review/SKILL.md +213 -213
- package/skills/workflow/requesting-code-review/SKILL.md +105 -105
- package/skills/workflow/requesting-code-review/code-reviewer.md +146 -146
- package/skills/workflow/requirement-engineering/SKILL.md +111 -111
- package/skills/workflow/systematic-debugging/CREATION-LOG.md +119 -119
- package/skills/workflow/systematic-debugging/SKILL.md +296 -296
- package/skills/workflow/systematic-debugging/condition-based-waiting-example.ts +158 -158
- package/skills/workflow/systematic-debugging/condition-based-waiting.md +115 -115
- package/skills/workflow/systematic-debugging/defense-in-depth.md +122 -122
- package/skills/workflow/systematic-debugging/find-polluter.sh +63 -63
- package/skills/workflow/systematic-debugging/root-cause-tracing.md +169 -169
- package/skills/workflow/systematic-debugging/test-academic.md +14 -14
- package/skills/workflow/systematic-debugging/test-pressure-1.md +58 -58
- package/skills/workflow/systematic-debugging/test-pressure-2.md +68 -68
- package/skills/workflow/systematic-debugging/test-pressure-3.md +69 -69
- package/skills/workflow/using-git-worktrees/SKILL.md +218 -218
- package/skills/workflow/verification-before-completion/SKILL.md +139 -139
- package/skills/workflow/writing-plans/SKILL.md +151 -151
- package/skills/workflow/writing-plans/plan-document-reviewer-prompt.md +49 -49
- package/skills/workflow/writing-skills/SKILL.md +655 -655
- package/skills/workflow/writing-skills/anthropic-best-practices.md +1150 -1150
- package/skills/workflow/writing-skills/examples/CLAUDE_MD_TESTING.md +189 -189
- package/skills/workflow/writing-skills/persuasion-principles.md +187 -187
- package/skills/workflow/writing-skills/render-graphs.js +168 -168
- package/skills/workflow/writing-skills/testing-skills-with-subagents.md +384 -384
|
@@ -1,599 +1,599 @@
|
|
|
1
|
-
# RDD Operations
|
|
2
|
-
|
|
3
|
-
---
|
|
4
|
-
|
|
5
|
-
## When to Use RDDs
|
|
6
|
-
|
|
7
|
-
**Use RDDs when:**
|
|
8
|
-
- Processing unstructured data (raw text, custom binary formats)
|
|
9
|
-
- Need fine-grained control over physical data placement
|
|
10
|
-
- Implementing custom partitioning logic for specific access patterns
|
|
11
|
-
- Working with legacy Spark code that needs maintenance
|
|
12
|
-
- Building custom data structures not expressible as DataFrames
|
|
13
|
-
|
|
14
|
-
**Prefer DataFrames when:**
|
|
15
|
-
- Processing structured/semi-structured data
|
|
16
|
-
- Performing SQL-like operations
|
|
17
|
-
- Need Catalyst optimizer benefits
|
|
18
|
-
- Working with standard file formats (Parquet, JSON, ORC)
|
|
19
|
-
|
|
20
|
-
---
|
|
21
|
-
|
|
22
|
-
## RDD Creation
|
|
23
|
-
|
|
24
|
-
### From Collections
|
|
25
|
-
|
|
26
|
-
```python
|
|
27
|
-
# PySpark - Create RDD from Python collection
|
|
28
|
-
data = [1, 2, 3, 4, 5]
|
|
29
|
-
rdd = spark.sparkContext.parallelize(data, numSlices=4) # 4 partitions
|
|
30
|
-
|
|
31
|
-
# From key-value pairs
|
|
32
|
-
pairs = [("a", 1), ("b", 2), ("c", 3)]
|
|
33
|
-
pair_rdd = spark.sparkContext.parallelize(pairs)
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
```scala
|
|
37
|
-
// Scala - Create RDD from collection
|
|
38
|
-
val data = Seq(1, 2, 3, 4, 5)
|
|
39
|
-
val rdd = spark.sparkContext.parallelize(data, numSlices = 4)
|
|
40
|
-
|
|
41
|
-
// From key-value pairs
|
|
42
|
-
val pairs = Seq(("a", 1), ("b", 2), ("c", 3))
|
|
43
|
-
val pairRdd = spark.sparkContext.parallelize(pairs)
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
### From Files
|
|
47
|
-
|
|
48
|
-
```python
|
|
49
|
-
# Text files - each line becomes an element
|
|
50
|
-
text_rdd = spark.sparkContext.textFile("hdfs://path/to/files/*.txt")
|
|
51
|
-
|
|
52
|
-
# Whole text files - each file as (filename, content) pair
|
|
53
|
-
files_rdd = spark.sparkContext.wholeTextFiles("hdfs://path/to/files/")
|
|
54
|
-
|
|
55
|
-
# Binary files
|
|
56
|
-
binary_rdd = spark.sparkContext.binaryFiles("hdfs://path/to/files/")
|
|
57
|
-
|
|
58
|
-
# Sequence files (Hadoop)
|
|
59
|
-
seq_rdd = spark.sparkContext.sequenceFile("hdfs://path/to/seqfile",
|
|
60
|
-
"org.apache.hadoop.io.Text",
|
|
61
|
-
"org.apache.hadoop.io.IntWritable")
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
### From DataFrame
|
|
65
|
-
|
|
66
|
-
```python
|
|
67
|
-
# Convert DataFrame to RDD of Rows
|
|
68
|
-
df = spark.read.parquet("s3://bucket/data/")
|
|
69
|
-
row_rdd = df.rdd
|
|
70
|
-
|
|
71
|
-
# Access Row fields
|
|
72
|
-
result_rdd = row_rdd.map(lambda row: (row.user_id, row.amount))
|
|
73
|
-
|
|
74
|
-
# Convert back to DataFrame
|
|
75
|
-
from pyspark.sql import Row
|
|
76
|
-
df_new = result_rdd.map(lambda x: Row(user_id=x[0], amount=x[1])).toDF()
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
---
|
|
80
|
-
|
|
81
|
-
## Transformations (Lazy)
|
|
82
|
-
|
|
83
|
-
Transformations return a new RDD and are not executed until an action is called.
|
|
84
|
-
|
|
85
|
-
### Basic Transformations
|
|
86
|
-
|
|
87
|
-
```python
|
|
88
|
-
# map - apply function to each element
|
|
89
|
-
squares = rdd.map(lambda x: x ** 2)
|
|
90
|
-
|
|
91
|
-
# flatMap - apply function returning iterator, flatten results
|
|
92
|
-
words = text_rdd.flatMap(lambda line: line.split(" "))
|
|
93
|
-
|
|
94
|
-
# filter - keep elements matching predicate
|
|
95
|
-
evens = rdd.filter(lambda x: x % 2 == 0)
|
|
96
|
-
|
|
97
|
-
# distinct - remove duplicates (causes shuffle)
|
|
98
|
-
unique = rdd.distinct()
|
|
99
|
-
|
|
100
|
-
# sample - random sample
|
|
101
|
-
sampled = rdd.sample(withReplacement=False, fraction=0.1, seed=42)
|
|
102
|
-
|
|
103
|
-
# union - combine two RDDs
|
|
104
|
-
combined = rdd1.union(rdd2)
|
|
105
|
-
|
|
106
|
-
# intersection - elements in both RDDs (causes shuffle)
|
|
107
|
-
common = rdd1.intersection(rdd2)
|
|
108
|
-
|
|
109
|
-
# subtract - elements in rdd1 not in rdd2 (causes shuffle)
|
|
110
|
-
diff = rdd1.subtract(rdd2)
|
|
111
|
-
|
|
112
|
-
# cartesian - all pairs (expensive!)
|
|
113
|
-
product = rdd1.cartesian(rdd2)
|
|
114
|
-
```
|
|
115
|
-
|
|
116
|
-
```scala
|
|
117
|
-
// Scala transformations
|
|
118
|
-
val squares = rdd.map(x => x * x)
|
|
119
|
-
val words = textRdd.flatMap(line => line.split(" "))
|
|
120
|
-
val evens = rdd.filter(_ % 2 == 0)
|
|
121
|
-
val unique = rdd.distinct()
|
|
122
|
-
val sampled = rdd.sample(withReplacement = false, fraction = 0.1, seed = 42L)
|
|
123
|
-
```
|
|
124
|
-
|
|
125
|
-
### MapPartitions (Efficient Batch Processing)
|
|
126
|
-
|
|
127
|
-
```python
|
|
128
|
-
# Process entire partition at once - more efficient than map
|
|
129
|
-
# Good for: database connections, expensive initialization, batch operations
|
|
130
|
-
|
|
131
|
-
def process_partition(iterator):
|
|
132
|
-
# Initialize expensive resource once per partition
|
|
133
|
-
connection = create_database_connection()
|
|
134
|
-
try:
|
|
135
|
-
for record in iterator:
|
|
136
|
-
result = connection.process(record)
|
|
137
|
-
yield result
|
|
138
|
-
finally:
|
|
139
|
-
connection.close()
|
|
140
|
-
|
|
141
|
-
result_rdd = rdd.mapPartitions(process_partition)
|
|
142
|
-
|
|
143
|
-
# With partition index
|
|
144
|
-
def process_with_index(partition_index, iterator):
|
|
145
|
-
for record in iterator:
|
|
146
|
-
yield (partition_index, record)
|
|
147
|
-
|
|
148
|
-
result_rdd = rdd.mapPartitionsWithIndex(process_with_index)
|
|
149
|
-
```
|
|
150
|
-
|
|
151
|
-
```scala
|
|
152
|
-
// Scala mapPartitions
|
|
153
|
-
val result = rdd.mapPartitions { iterator =>
|
|
154
|
-
val connection = createDatabaseConnection()
|
|
155
|
-
try {
|
|
156
|
-
iterator.map(record => connection.process(record))
|
|
157
|
-
} finally {
|
|
158
|
-
connection.close()
|
|
159
|
-
}
|
|
160
|
-
}
|
|
161
|
-
```
|
|
162
|
-
|
|
163
|
-
### Repartition and Coalesce
|
|
164
|
-
|
|
165
|
-
```python
|
|
166
|
-
# repartition - increase or decrease partitions (full shuffle)
|
|
167
|
-
rdd_repart = rdd.repartition(100)
|
|
168
|
-
|
|
169
|
-
# coalesce - decrease partitions only (avoids full shuffle)
|
|
170
|
-
rdd_coalesced = rdd.coalesce(10) # Efficient reduction
|
|
171
|
-
|
|
172
|
-
# glom - collect each partition into an array
|
|
173
|
-
partitions = rdd.glom() # RDD[Array[T]]
|
|
174
|
-
```
|
|
175
|
-
|
|
176
|
-
**When to use:**
|
|
177
|
-
- `repartition(n)`: When increasing partitions or need even distribution
|
|
178
|
-
- `coalesce(n)`: When decreasing partitions (after filter reduced data)
|
|
179
|
-
|
|
180
|
-
---
|
|
181
|
-
|
|
182
|
-
## Pair RDD Operations
|
|
183
|
-
|
|
184
|
-
Pair RDDs (key-value pairs) enable powerful transformations.
|
|
185
|
-
|
|
186
|
-
### Creating Pair RDDs
|
|
187
|
-
|
|
188
|
-
```python
|
|
189
|
-
# From tuples
|
|
190
|
-
pair_rdd = rdd.map(lambda x: (x.key, x.value))
|
|
191
|
-
|
|
192
|
-
# keyBy - create pairs from existing elements
|
|
193
|
-
pair_rdd = rdd.keyBy(lambda x: x.user_id)
|
|
194
|
-
```
|
|
195
|
-
|
|
196
|
-
### Transformations on Pair RDDs
|
|
197
|
-
|
|
198
|
-
```python
|
|
199
|
-
# reduceByKey - aggregate values by key (more efficient than groupByKey)
|
|
200
|
-
counts = pair_rdd.reduceByKey(lambda a, b: a + b)
|
|
201
|
-
|
|
202
|
-
# groupByKey - group all values for each key (shuffles all data!)
|
|
203
|
-
grouped = pair_rdd.groupByKey() # Avoid when possible
|
|
204
|
-
|
|
205
|
-
# aggregateByKey - combine with different local/global combiners
|
|
206
|
-
sum_count = pair_rdd.aggregateByKey(
|
|
207
|
-
zeroValue=(0, 0), # (sum, count)
|
|
208
|
-
seqFunc=lambda acc, v: (acc[0] + v, acc[1] + 1), # within partition
|
|
209
|
-
combFunc=lambda a, b: (a[0] + b[0], a[1] + b[1]) # across partitions
|
|
210
|
-
)
|
|
211
|
-
|
|
212
|
-
# combineByKey - most general aggregation
|
|
213
|
-
averages = pair_rdd.combineByKey(
|
|
214
|
-
createCombiner=lambda v: (v, 1),
|
|
215
|
-
mergeValue=lambda acc, v: (acc[0] + v, acc[1] + 1),
|
|
216
|
-
mergeCombiners=lambda a, b: (a[0] + b[0], a[1] + b[1])
|
|
217
|
-
).mapValues(lambda x: x[0] / x[1])
|
|
218
|
-
|
|
219
|
-
# mapValues - transform values only (preserves partitioning)
|
|
220
|
-
doubled = pair_rdd.mapValues(lambda v: v * 2)
|
|
221
|
-
|
|
222
|
-
# flatMapValues - flatMap on values only
|
|
223
|
-
expanded = pair_rdd.flatMapValues(lambda v: range(v))
|
|
224
|
-
|
|
225
|
-
# keys and values
|
|
226
|
-
keys_rdd = pair_rdd.keys()
|
|
227
|
-
values_rdd = pair_rdd.values()
|
|
228
|
-
|
|
229
|
-
# sortByKey
|
|
230
|
-
sorted_rdd = pair_rdd.sortByKey(ascending=True)
|
|
231
|
-
|
|
232
|
-
# join operations (all cause shuffle)
|
|
233
|
-
joined = rdd1.join(rdd2) # inner join
|
|
234
|
-
left = rdd1.leftOuterJoin(rdd2) # left outer
|
|
235
|
-
right = rdd1.rightOuterJoin(rdd2) # right outer
|
|
236
|
-
full = rdd1.fullOuterJoin(rdd2) # full outer
|
|
237
|
-
cogroup = rdd1.cogroup(rdd2) # group by key from both RDDs
|
|
238
|
-
|
|
239
|
-
# subtractByKey - remove keys present in other RDD
|
|
240
|
-
filtered = rdd1.subtractByKey(rdd2)
|
|
241
|
-
```
|
|
242
|
-
|
|
243
|
-
```scala
|
|
244
|
-
// Scala pair RDD operations
|
|
245
|
-
val counts = pairRdd.reduceByKey(_ + _)
|
|
246
|
-
val grouped = pairRdd.groupByKey() // Avoid when possible
|
|
247
|
-
|
|
248
|
-
val averages = pairRdd.combineByKey(
|
|
249
|
-
(v: Int) => (v, 1),
|
|
250
|
-
(acc: (Int, Int), v: Int) => (acc._1 + v, acc._2 + 1),
|
|
251
|
-
(a: (Int, Int), b: (Int, Int)) => (a._1 + b._1, a._2 + b._2)
|
|
252
|
-
).mapValues { case (sum, count) => sum.toDouble / count }
|
|
253
|
-
|
|
254
|
-
val joined = rdd1.join(rdd2)
|
|
255
|
-
```
|
|
256
|
-
|
|
257
|
-
### reduceByKey vs groupByKey
|
|
258
|
-
|
|
259
|
-
```python
|
|
260
|
-
# BAD: groupByKey shuffles all values
|
|
261
|
-
# Memory-intensive, can cause OOM
|
|
262
|
-
word_counts = words.map(lambda w: (w, 1)).groupByKey().mapValues(sum)
|
|
263
|
-
|
|
264
|
-
# GOOD: reduceByKey combines locally first
|
|
265
|
-
# Much more efficient, less data shuffled
|
|
266
|
-
word_counts = words.map(lambda w: (w, 1)).reduceByKey(lambda a, b: a + b)
|
|
267
|
-
```
|
|
268
|
-
|
|
269
|
-
**Spark UI Check:** Compare shuffle write sizes. `reduceByKey` should show much smaller shuffle than `groupByKey` for the same operation.
|
|
270
|
-
|
|
271
|
-
---
|
|
272
|
-
|
|
273
|
-
## Actions (Trigger Execution)
|
|
274
|
-
|
|
275
|
-
Actions return values to the driver or write to storage.
|
|
276
|
-
|
|
277
|
-
### Collection Actions
|
|
278
|
-
|
|
279
|
-
```python
|
|
280
|
-
# collect - return all elements to driver (OOM risk!)
|
|
281
|
-
all_data = rdd.collect() # Use carefully on large RDDs
|
|
282
|
-
|
|
283
|
-
# take - return first n elements
|
|
284
|
-
first_10 = rdd.take(10)
|
|
285
|
-
|
|
286
|
-
# takeOrdered - return smallest/largest n elements
|
|
287
|
-
smallest_5 = rdd.takeOrdered(5) # ascending
|
|
288
|
-
largest_5 = rdd.takeOrdered(5, key=lambda x: -x)
|
|
289
|
-
|
|
290
|
-
# takeSample - random sample
|
|
291
|
-
sample = rdd.takeSample(withReplacement=False, num=100, seed=42)
|
|
292
|
-
|
|
293
|
-
# first - return first element
|
|
294
|
-
first = rdd.first()
|
|
295
|
-
|
|
296
|
-
# top - return largest n elements
|
|
297
|
-
top_5 = rdd.top(5)
|
|
298
|
-
|
|
299
|
-
# count - count elements
|
|
300
|
-
total = rdd.count()
|
|
301
|
-
|
|
302
|
-
# countByKey - count elements per key (returns dict to driver)
|
|
303
|
-
key_counts = pair_rdd.countByKey()
|
|
304
|
-
|
|
305
|
-
# countByValue - count occurrences of each value
|
|
306
|
-
value_counts = rdd.countByValue()
|
|
307
|
-
```
|
|
308
|
-
|
|
309
|
-
### Aggregation Actions
|
|
310
|
-
|
|
311
|
-
```python
|
|
312
|
-
# reduce - aggregate all elements
|
|
313
|
-
total = rdd.reduce(lambda a, b: a + b)
|
|
314
|
-
|
|
315
|
-
# fold - reduce with zero value
|
|
316
|
-
total = rdd.fold(0, lambda a, b: a + b)
|
|
317
|
-
|
|
318
|
-
# aggregate - combine with different types
|
|
319
|
-
stats = rdd.aggregate(
|
|
320
|
-
zeroValue=(0, 0), # (sum, count)
|
|
321
|
-
seqOp=lambda acc, v: (acc[0] + v, acc[1] + 1),
|
|
322
|
-
combOp=lambda a, b: (a[0] + b[0], a[1] + b[1])
|
|
323
|
-
)
|
|
324
|
-
average = stats[0] / stats[1]
|
|
325
|
-
```
|
|
326
|
-
|
|
327
|
-
### Output Actions
|
|
328
|
-
|
|
329
|
-
```python
|
|
330
|
-
# saveAsTextFile - save as text files
|
|
331
|
-
rdd.saveAsTextFile("hdfs://path/output/")
|
|
332
|
-
|
|
333
|
-
# saveAsSequenceFile - save as Hadoop sequence file
|
|
334
|
-
pair_rdd.saveAsSequenceFile("hdfs://path/output/")
|
|
335
|
-
|
|
336
|
-
# saveAsPickleFile - Python pickle format
|
|
337
|
-
rdd.saveAsPickleFile("hdfs://path/output/")
|
|
338
|
-
|
|
339
|
-
# foreach - apply function to each element (side effects)
|
|
340
|
-
rdd.foreach(lambda x: print(x)) # Runs on executors
|
|
341
|
-
|
|
342
|
-
# foreachPartition - apply function to each partition
|
|
343
|
-
def save_partition(iterator):
|
|
344
|
-
connection = create_connection()
|
|
345
|
-
for record in iterator:
|
|
346
|
-
connection.save(record)
|
|
347
|
-
connection.close()
|
|
348
|
-
|
|
349
|
-
rdd.foreachPartition(save_partition)
|
|
350
|
-
```
|
|
351
|
-
|
|
352
|
-
---
|
|
353
|
-
|
|
354
|
-
## Custom Partitioners
|
|
355
|
-
|
|
356
|
-
### Implementing Custom Partitioner
|
|
357
|
-
|
|
358
|
-
```python
|
|
359
|
-
from pyspark import Partitioner
|
|
360
|
-
|
|
361
|
-
class RangePartitioner(Partitioner):
|
|
362
|
-
def __init__(self, ranges):
|
|
363
|
-
"""
|
|
364
|
-
ranges: list of (min, max) tuples defining partition boundaries
|
|
365
|
-
"""
|
|
366
|
-
self.ranges = ranges
|
|
367
|
-
|
|
368
|
-
def numPartitions(self):
|
|
369
|
-
return len(self.ranges)
|
|
370
|
-
|
|
371
|
-
def getPartition(self, key):
|
|
372
|
-
for i, (min_val, max_val) in enumerate(self.ranges):
|
|
373
|
-
if min_val <= key < max_val:
|
|
374
|
-
return i
|
|
375
|
-
return len(self.ranges) - 1 # Default to last partition
|
|
376
|
-
|
|
377
|
-
# Use custom partitioner
|
|
378
|
-
ranges = [(0, 100), (100, 500), (500, 1000), (1000, float('inf'))]
|
|
379
|
-
partitioner = RangePartitioner(ranges)
|
|
380
|
-
partitioned_rdd = pair_rdd.partitionBy(partitioner.numPartitions(), partitioner.getPartition)
|
|
381
|
-
```
|
|
382
|
-
|
|
383
|
-
```scala
|
|
384
|
-
// Scala custom partitioner
|
|
385
|
-
import org.apache.spark.Partitioner
|
|
386
|
-
|
|
387
|
-
class DomainPartitioner(numParts: Int) extends Partitioner {
|
|
388
|
-
override def numPartitions: Int = numParts
|
|
389
|
-
|
|
390
|
-
override def getPartition(key: Any): Int = {
|
|
391
|
-
val domain = key.asInstanceOf[String].split("@")(1)
|
|
392
|
-
math.abs(domain.hashCode % numPartitions)
|
|
393
|
-
}
|
|
394
|
-
|
|
395
|
-
override def equals(other: Any): Boolean = other match {
|
|
396
|
-
case p: DomainPartitioner => p.numPartitions == numPartitions
|
|
397
|
-
case _ => false
|
|
398
|
-
}
|
|
399
|
-
}
|
|
400
|
-
|
|
401
|
-
val partitioned = pairRdd.partitionBy(new DomainPartitioner(10))
|
|
402
|
-
```
|
|
403
|
-
|
|
404
|
-
### Hash Partitioner (Default)
|
|
405
|
-
|
|
406
|
-
```python
|
|
407
|
-
from pyspark import HashPartitioner
|
|
408
|
-
|
|
409
|
-
# Repartition with hash partitioner
|
|
410
|
-
partitioned_rdd = pair_rdd.partitionBy(100) # Uses HashPartitioner
|
|
411
|
-
|
|
412
|
-
# Preserve partitioning across transformations
|
|
413
|
-
# mapValues and flatMapValues preserve partitioner
|
|
414
|
-
preserved = partitioned_rdd.mapValues(lambda v: v * 2)
|
|
415
|
-
assert preserved.partitioner == partitioned_rdd.partitioner
|
|
416
|
-
|
|
417
|
-
# map does NOT preserve partitioner
|
|
418
|
-
not_preserved = partitioned_rdd.map(lambda x: (x[0], x[1] * 2))
|
|
419
|
-
assert not_preserved.partitioner is None
|
|
420
|
-
```
|
|
421
|
-
|
|
422
|
-
---
|
|
423
|
-
|
|
424
|
-
## Broadcast Variables and Accumulators
|
|
425
|
-
|
|
426
|
-
### Broadcast Variables
|
|
427
|
-
|
|
428
|
-
```python
|
|
429
|
-
# Broadcast large read-only data to all executors
|
|
430
|
-
lookup_table = {"a": 1, "b": 2, "c": 3} # Small example
|
|
431
|
-
lookup_broadcast = spark.sparkContext.broadcast(lookup_table)
|
|
432
|
-
|
|
433
|
-
def enrich_record(record):
|
|
434
|
-
table = lookup_broadcast.value # Access broadcast value
|
|
435
|
-
return (record, table.get(record, 0))
|
|
436
|
-
|
|
437
|
-
enriched_rdd = rdd.map(enrich_record)
|
|
438
|
-
|
|
439
|
-
# Clean up when done
|
|
440
|
-
lookup_broadcast.unpersist()
|
|
441
|
-
lookup_broadcast.destroy()
|
|
442
|
-
```
|
|
443
|
-
|
|
444
|
-
```scala
|
|
445
|
-
// Scala broadcast
|
|
446
|
-
val lookupTable = Map("a" -> 1, "b" -> 2, "c" -> 3)
|
|
447
|
-
val lookupBroadcast = spark.sparkContext.broadcast(lookupTable)
|
|
448
|
-
|
|
449
|
-
val enriched = rdd.map { record =>
|
|
450
|
-
val table = lookupBroadcast.value
|
|
451
|
-
(record, table.getOrElse(record, 0))
|
|
452
|
-
}
|
|
453
|
-
```
|
|
454
|
-
|
|
455
|
-
### Accumulators
|
|
456
|
-
|
|
457
|
-
```python
|
|
458
|
-
# Long accumulator
|
|
459
|
-
error_count = spark.sparkContext.longAccumulator("Error Count")
|
|
460
|
-
|
|
461
|
-
def process_record(record):
|
|
462
|
-
try:
|
|
463
|
-
return transform(record)
|
|
464
|
-
except Exception:
|
|
465
|
-
error_count.add(1)
|
|
466
|
-
return None
|
|
467
|
-
|
|
468
|
-
result_rdd = rdd.map(process_record).filter(lambda x: x is not None)
|
|
469
|
-
result_rdd.count() # Trigger execution
|
|
470
|
-
|
|
471
|
-
print(f"Errors encountered: {error_count.value}")
|
|
472
|
-
|
|
473
|
-
# Collection accumulator
|
|
474
|
-
from pyspark import AccumulatorParam
|
|
475
|
-
|
|
476
|
-
class SetAccumulatorParam(AccumulatorParam):
|
|
477
|
-
def zero(self, initial_value):
|
|
478
|
-
return set()
|
|
479
|
-
|
|
480
|
-
def addInPlace(self, v1, v2):
|
|
481
|
-
return v1.union(v2)
|
|
482
|
-
|
|
483
|
-
error_types = spark.sparkContext.accumulator(set(), SetAccumulatorParam())
|
|
484
|
-
|
|
485
|
-
def track_errors(record):
|
|
486
|
-
try:
|
|
487
|
-
return process(record)
|
|
488
|
-
except ValueError:
|
|
489
|
-
error_types.add({"ValueError"})
|
|
490
|
-
return None
|
|
491
|
-
except TypeError:
|
|
492
|
-
error_types.add({"TypeError"})
|
|
493
|
-
return None
|
|
494
|
-
```
|
|
495
|
-
|
|
496
|
-
**Caution:** Accumulators may be updated more than once if tasks are retried. Use only for debugging/monitoring, not business logic.
|
|
497
|
-
|
|
498
|
-
---
|
|
499
|
-
|
|
500
|
-
## Performance Patterns
|
|
501
|
-
|
|
502
|
-
### Avoiding Shuffle
|
|
503
|
-
|
|
504
|
-
```python
|
|
505
|
-
# BAD: Multiple shuffles
|
|
506
|
-
result = rdd.groupByKey().mapValues(sum).reduceByKey(max)
|
|
507
|
-
|
|
508
|
-
# GOOD: Single shuffle with combineByKey
|
|
509
|
-
result = rdd.combineByKey(
|
|
510
|
-
createCombiner=lambda v: v,
|
|
511
|
-
mergeValue=lambda acc, v: acc + v,
|
|
512
|
-
mergeCombiners=lambda a, b: max(a, b)
|
|
513
|
-
)
|
|
514
|
-
|
|
515
|
-
# Co-partition related RDDs to avoid join shuffles
|
|
516
|
-
partitioned_users = users.partitionBy(100)
|
|
517
|
-
partitioned_orders = orders.partitionBy(100) # Same partitioner
|
|
518
|
-
joined = partitioned_users.join(partitioned_orders) # No shuffle if same partitioner
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
### Efficient Serialization
|
|
522
|
-
|
|
523
|
-
```python
|
|
524
|
-
# Use Kryo serialization for better performance
|
|
525
|
-
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
|
|
526
|
-
spark.conf.set("spark.kryo.registrationRequired", "false")
|
|
527
|
-
|
|
528
|
-
# Register custom classes for Kryo (Scala)
|
|
529
|
-
# spark.conf.set("spark.kryo.classesToRegister", "com.example.MyClass")
|
|
530
|
-
```
|
|
531
|
-
|
|
532
|
-
### Memory-Efficient Operations
|
|
533
|
-
|
|
534
|
-
```python
|
|
535
|
-
# Prefer iterator-based operations
|
|
536
|
-
def efficient_processing(iterator):
|
|
537
|
-
for record in iterator:
|
|
538
|
-
# Process one at a time, don't collect
|
|
539
|
-
yield transform(record)
|
|
540
|
-
|
|
541
|
-
result = rdd.mapPartitions(efficient_processing)
|
|
542
|
-
|
|
543
|
-
# Avoid collecting large data to driver
|
|
544
|
-
# BAD
|
|
545
|
-
all_keys = rdd.keys().collect() # Could be millions!
|
|
546
|
-
|
|
547
|
-
# GOOD
|
|
548
|
-
key_sample = rdd.keys().take(1000) # Sample only
|
|
549
|
-
```
|
|
550
|
-
|
|
551
|
-
---
|
|
552
|
-
|
|
553
|
-
## Spark UI Analysis for RDDs
|
|
554
|
-
|
|
555
|
-
### Stages Tab Metrics
|
|
556
|
-
|
|
557
|
-
| Metric | What to Check |
|
|
558
|
-
|--------|---------------|
|
|
559
|
-
| Shuffle Write | Minimize with reduceByKey over groupByKey |
|
|
560
|
-
| Shuffle Read | Large reads indicate join/aggregation overhead |
|
|
561
|
-
| Spill (Memory) | Indicates partition too large for memory |
|
|
562
|
-
| Spill (Disk) | Data being written to disk - increase memory |
|
|
563
|
-
| GC Time | Should be < 10% of task time |
|
|
564
|
-
|
|
565
|
-
### Common Issues
|
|
566
|
-
|
|
567
|
-
1. **Uneven partition sizes**: Look for tasks taking much longer than others
|
|
568
|
-
2. **Data skew**: One partition has much more data than others
|
|
569
|
-
3. **Straggler tasks**: A few tasks taking 10x longer than median
|
|
570
|
-
|
|
571
|
-
### Debugging Tips
|
|
572
|
-
|
|
573
|
-
```python
|
|
574
|
-
# Check partition sizes
|
|
575
|
-
partition_sizes = rdd.glom().map(len).collect()
|
|
576
|
-
print(f"Partition sizes: min={min(partition_sizes)}, max={max(partition_sizes)}, avg={sum(partition_sizes)/len(partition_sizes)}")
|
|
577
|
-
|
|
578
|
-
# Check partitioner
|
|
579
|
-
print(f"Partitioner: {rdd.partitioner}")
|
|
580
|
-
print(f"Num partitions: {rdd.getNumPartitions()}")
|
|
581
|
-
|
|
582
|
-
# Debug lineage
|
|
583
|
-
print(rdd.toDebugString())
|
|
584
|
-
```
|
|
585
|
-
|
|
586
|
-
---
|
|
587
|
-
|
|
588
|
-
## Best Practices Summary
|
|
589
|
-
|
|
590
|
-
1. **Prefer DataFrames** - Use RDDs only when DataFrame API insufficient
|
|
591
|
-
2. **Use reduceByKey over groupByKey** - Combines locally first, reduces shuffle
|
|
592
|
-
3. **Preserve partitioning** - Use mapValues/flatMapValues to keep partitioner
|
|
593
|
-
4. **Minimize shuffles** - Co-partition related RDDs, use broadcast for small data
|
|
594
|
-
5. **Use mapPartitions** - For expensive initialization (DB connections, etc.)
|
|
595
|
-
6. **Avoid collect on large data** - Use take, takeSample, or foreachPartition
|
|
596
|
-
7. **Broadcast lookup tables** - Avoid shuffle for small reference data
|
|
597
|
-
8. **Monitor accumulators** - Use for debugging, not business logic
|
|
598
|
-
9. **Check partition distribution** - Avoid skew with custom partitioners
|
|
599
|
-
10. **Profile with Spark UI** - Identify shuffle, spill, and GC issues
|
|
1
|
+
# RDD Operations
|
|
2
|
+
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
## When to Use RDDs
|
|
6
|
+
|
|
7
|
+
**Use RDDs when:**
|
|
8
|
+
- Processing unstructured data (raw text, custom binary formats)
|
|
9
|
+
- Need fine-grained control over physical data placement
|
|
10
|
+
- Implementing custom partitioning logic for specific access patterns
|
|
11
|
+
- Working with legacy Spark code that needs maintenance
|
|
12
|
+
- Building custom data structures not expressible as DataFrames
|
|
13
|
+
|
|
14
|
+
**Prefer DataFrames when:**
|
|
15
|
+
- Processing structured/semi-structured data
|
|
16
|
+
- Performing SQL-like operations
|
|
17
|
+
- Need Catalyst optimizer benefits
|
|
18
|
+
- Working with standard file formats (Parquet, JSON, ORC)
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## RDD Creation
|
|
23
|
+
|
|
24
|
+
### From Collections
|
|
25
|
+
|
|
26
|
+
```python
|
|
27
|
+
# PySpark - Create RDD from Python collection
|
|
28
|
+
data = [1, 2, 3, 4, 5]
|
|
29
|
+
rdd = spark.sparkContext.parallelize(data, numSlices=4) # 4 partitions
|
|
30
|
+
|
|
31
|
+
# From key-value pairs
|
|
32
|
+
pairs = [("a", 1), ("b", 2), ("c", 3)]
|
|
33
|
+
pair_rdd = spark.sparkContext.parallelize(pairs)
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
```scala
|
|
37
|
+
// Scala - Create RDD from collection
|
|
38
|
+
val data = Seq(1, 2, 3, 4, 5)
|
|
39
|
+
val rdd = spark.sparkContext.parallelize(data, numSlices = 4)
|
|
40
|
+
|
|
41
|
+
// From key-value pairs
|
|
42
|
+
val pairs = Seq(("a", 1), ("b", 2), ("c", 3))
|
|
43
|
+
val pairRdd = spark.sparkContext.parallelize(pairs)
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### From Files
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
# Text files - each line becomes an element
|
|
50
|
+
text_rdd = spark.sparkContext.textFile("hdfs://path/to/files/*.txt")
|
|
51
|
+
|
|
52
|
+
# Whole text files - each file as (filename, content) pair
|
|
53
|
+
files_rdd = spark.sparkContext.wholeTextFiles("hdfs://path/to/files/")
|
|
54
|
+
|
|
55
|
+
# Binary files
|
|
56
|
+
binary_rdd = spark.sparkContext.binaryFiles("hdfs://path/to/files/")
|
|
57
|
+
|
|
58
|
+
# Sequence files (Hadoop)
|
|
59
|
+
seq_rdd = spark.sparkContext.sequenceFile("hdfs://path/to/seqfile",
|
|
60
|
+
"org.apache.hadoop.io.Text",
|
|
61
|
+
"org.apache.hadoop.io.IntWritable")
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### From DataFrame
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
# Convert DataFrame to RDD of Rows
|
|
68
|
+
df = spark.read.parquet("s3://bucket/data/")
|
|
69
|
+
row_rdd = df.rdd
|
|
70
|
+
|
|
71
|
+
# Access Row fields
|
|
72
|
+
result_rdd = row_rdd.map(lambda row: (row.user_id, row.amount))
|
|
73
|
+
|
|
74
|
+
# Convert back to DataFrame
|
|
75
|
+
from pyspark.sql import Row
|
|
76
|
+
df_new = result_rdd.map(lambda x: Row(user_id=x[0], amount=x[1])).toDF()
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## Transformations (Lazy)
|
|
82
|
+
|
|
83
|
+
Transformations return a new RDD and are not executed until an action is called.
|
|
84
|
+
|
|
85
|
+
### Basic Transformations
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
# map - apply function to each element
|
|
89
|
+
squares = rdd.map(lambda x: x ** 2)
|
|
90
|
+
|
|
91
|
+
# flatMap - apply function returning iterator, flatten results
|
|
92
|
+
words = text_rdd.flatMap(lambda line: line.split(" "))
|
|
93
|
+
|
|
94
|
+
# filter - keep elements matching predicate
|
|
95
|
+
evens = rdd.filter(lambda x: x % 2 == 0)
|
|
96
|
+
|
|
97
|
+
# distinct - remove duplicates (causes shuffle)
|
|
98
|
+
unique = rdd.distinct()
|
|
99
|
+
|
|
100
|
+
# sample - random sample
|
|
101
|
+
sampled = rdd.sample(withReplacement=False, fraction=0.1, seed=42)
|
|
102
|
+
|
|
103
|
+
# union - combine two RDDs
|
|
104
|
+
combined = rdd1.union(rdd2)
|
|
105
|
+
|
|
106
|
+
# intersection - elements in both RDDs (causes shuffle)
|
|
107
|
+
common = rdd1.intersection(rdd2)
|
|
108
|
+
|
|
109
|
+
# subtract - elements in rdd1 not in rdd2 (causes shuffle)
|
|
110
|
+
diff = rdd1.subtract(rdd2)
|
|
111
|
+
|
|
112
|
+
# cartesian - all pairs (expensive!)
|
|
113
|
+
product = rdd1.cartesian(rdd2)
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
```scala
|
|
117
|
+
// Scala transformations
|
|
118
|
+
val squares = rdd.map(x => x * x)
|
|
119
|
+
val words = textRdd.flatMap(line => line.split(" "))
|
|
120
|
+
val evens = rdd.filter(_ % 2 == 0)
|
|
121
|
+
val unique = rdd.distinct()
|
|
122
|
+
val sampled = rdd.sample(withReplacement = false, fraction = 0.1, seed = 42L)
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### MapPartitions (Efficient Batch Processing)
|
|
126
|
+
|
|
127
|
+
```python
|
|
128
|
+
# Process entire partition at once - more efficient than map
|
|
129
|
+
# Good for: database connections, expensive initialization, batch operations
|
|
130
|
+
|
|
131
|
+
def process_partition(iterator):
|
|
132
|
+
# Initialize expensive resource once per partition
|
|
133
|
+
connection = create_database_connection()
|
|
134
|
+
try:
|
|
135
|
+
for record in iterator:
|
|
136
|
+
result = connection.process(record)
|
|
137
|
+
yield result
|
|
138
|
+
finally:
|
|
139
|
+
connection.close()
|
|
140
|
+
|
|
141
|
+
result_rdd = rdd.mapPartitions(process_partition)
|
|
142
|
+
|
|
143
|
+
# With partition index
|
|
144
|
+
def process_with_index(partition_index, iterator):
|
|
145
|
+
for record in iterator:
|
|
146
|
+
yield (partition_index, record)
|
|
147
|
+
|
|
148
|
+
result_rdd = rdd.mapPartitionsWithIndex(process_with_index)
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
```scala
|
|
152
|
+
// Scala mapPartitions
|
|
153
|
+
val result = rdd.mapPartitions { iterator =>
|
|
154
|
+
val connection = createDatabaseConnection()
|
|
155
|
+
try {
|
|
156
|
+
iterator.map(record => connection.process(record))
|
|
157
|
+
} finally {
|
|
158
|
+
connection.close()
|
|
159
|
+
}
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Repartition and Coalesce
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
# repartition - increase or decrease partitions (full shuffle)
|
|
167
|
+
rdd_repart = rdd.repartition(100)
|
|
168
|
+
|
|
169
|
+
# coalesce - decrease partitions only (avoids full shuffle)
|
|
170
|
+
rdd_coalesced = rdd.coalesce(10) # Efficient reduction
|
|
171
|
+
|
|
172
|
+
# glom - collect each partition into an array
|
|
173
|
+
partitions = rdd.glom() # RDD[Array[T]]
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
**When to use:**
|
|
177
|
+
- `repartition(n)`: When increasing partitions or need even distribution
|
|
178
|
+
- `coalesce(n)`: When decreasing partitions (after filter reduced data)
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## Pair RDD Operations
|
|
183
|
+
|
|
184
|
+
Pair RDDs (key-value pairs) enable powerful transformations.
|
|
185
|
+
|
|
186
|
+
### Creating Pair RDDs
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
# From tuples
|
|
190
|
+
pair_rdd = rdd.map(lambda x: (x.key, x.value))
|
|
191
|
+
|
|
192
|
+
# keyBy - create pairs from existing elements
|
|
193
|
+
pair_rdd = rdd.keyBy(lambda x: x.user_id)
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### Transformations on Pair RDDs
|
|
197
|
+
|
|
198
|
+
```python
|
|
199
|
+
# reduceByKey - aggregate values by key (more efficient than groupByKey)
|
|
200
|
+
counts = pair_rdd.reduceByKey(lambda a, b: a + b)
|
|
201
|
+
|
|
202
|
+
# groupByKey - group all values for each key (shuffles all data!)
|
|
203
|
+
grouped = pair_rdd.groupByKey() # Avoid when possible
|
|
204
|
+
|
|
205
|
+
# aggregateByKey - combine with different local/global combiners
|
|
206
|
+
sum_count = pair_rdd.aggregateByKey(
|
|
207
|
+
zeroValue=(0, 0), # (sum, count)
|
|
208
|
+
seqFunc=lambda acc, v: (acc[0] + v, acc[1] + 1), # within partition
|
|
209
|
+
combFunc=lambda a, b: (a[0] + b[0], a[1] + b[1]) # across partitions
|
|
210
|
+
)
|
|
211
|
+
|
|
212
|
+
# combineByKey - most general aggregation
|
|
213
|
+
averages = pair_rdd.combineByKey(
|
|
214
|
+
createCombiner=lambda v: (v, 1),
|
|
215
|
+
mergeValue=lambda acc, v: (acc[0] + v, acc[1] + 1),
|
|
216
|
+
mergeCombiners=lambda a, b: (a[0] + b[0], a[1] + b[1])
|
|
217
|
+
).mapValues(lambda x: x[0] / x[1])
|
|
218
|
+
|
|
219
|
+
# mapValues - transform values only (preserves partitioning)
|
|
220
|
+
doubled = pair_rdd.mapValues(lambda v: v * 2)
|
|
221
|
+
|
|
222
|
+
# flatMapValues - flatMap on values only
|
|
223
|
+
expanded = pair_rdd.flatMapValues(lambda v: range(v))
|
|
224
|
+
|
|
225
|
+
# keys and values
|
|
226
|
+
keys_rdd = pair_rdd.keys()
|
|
227
|
+
values_rdd = pair_rdd.values()
|
|
228
|
+
|
|
229
|
+
# sortByKey
|
|
230
|
+
sorted_rdd = pair_rdd.sortByKey(ascending=True)
|
|
231
|
+
|
|
232
|
+
# join operations (all cause shuffle)
|
|
233
|
+
joined = rdd1.join(rdd2) # inner join
|
|
234
|
+
left = rdd1.leftOuterJoin(rdd2) # left outer
|
|
235
|
+
right = rdd1.rightOuterJoin(rdd2) # right outer
|
|
236
|
+
full = rdd1.fullOuterJoin(rdd2) # full outer
|
|
237
|
+
cogroup = rdd1.cogroup(rdd2) # group by key from both RDDs
|
|
238
|
+
|
|
239
|
+
# subtractByKey - remove keys present in other RDD
|
|
240
|
+
filtered = rdd1.subtractByKey(rdd2)
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
```scala
|
|
244
|
+
// Scala pair RDD operations
|
|
245
|
+
val counts = pairRdd.reduceByKey(_ + _)
|
|
246
|
+
val grouped = pairRdd.groupByKey() // Avoid when possible
|
|
247
|
+
|
|
248
|
+
val averages = pairRdd.combineByKey(
|
|
249
|
+
(v: Int) => (v, 1),
|
|
250
|
+
(acc: (Int, Int), v: Int) => (acc._1 + v, acc._2 + 1),
|
|
251
|
+
(a: (Int, Int), b: (Int, Int)) => (a._1 + b._1, a._2 + b._2)
|
|
252
|
+
).mapValues { case (sum, count) => sum.toDouble / count }
|
|
253
|
+
|
|
254
|
+
val joined = rdd1.join(rdd2)
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
### reduceByKey vs groupByKey
|
|
258
|
+
|
|
259
|
+
```python
|
|
260
|
+
# BAD: groupByKey shuffles all values
|
|
261
|
+
# Memory-intensive, can cause OOM
|
|
262
|
+
word_counts = words.map(lambda w: (w, 1)).groupByKey().mapValues(sum)
|
|
263
|
+
|
|
264
|
+
# GOOD: reduceByKey combines locally first
|
|
265
|
+
# Much more efficient, less data shuffled
|
|
266
|
+
word_counts = words.map(lambda w: (w, 1)).reduceByKey(lambda a, b: a + b)
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
**Spark UI Check:** Compare shuffle write sizes. `reduceByKey` should show much smaller shuffle than `groupByKey` for the same operation.
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## Actions (Trigger Execution)
|
|
274
|
+
|
|
275
|
+
Actions return values to the driver or write to storage.
|
|
276
|
+
|
|
277
|
+
### Collection Actions
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
# collect - return all elements to driver (OOM risk!)
|
|
281
|
+
all_data = rdd.collect() # Use carefully on large RDDs
|
|
282
|
+
|
|
283
|
+
# take - return first n elements
|
|
284
|
+
first_10 = rdd.take(10)
|
|
285
|
+
|
|
286
|
+
# takeOrdered - return smallest/largest n elements
|
|
287
|
+
smallest_5 = rdd.takeOrdered(5) # ascending
|
|
288
|
+
largest_5 = rdd.takeOrdered(5, key=lambda x: -x)
|
|
289
|
+
|
|
290
|
+
# takeSample - random sample
|
|
291
|
+
sample = rdd.takeSample(withReplacement=False, num=100, seed=42)
|
|
292
|
+
|
|
293
|
+
# first - return first element
|
|
294
|
+
first = rdd.first()
|
|
295
|
+
|
|
296
|
+
# top - return largest n elements
|
|
297
|
+
top_5 = rdd.top(5)
|
|
298
|
+
|
|
299
|
+
# count - count elements
|
|
300
|
+
total = rdd.count()
|
|
301
|
+
|
|
302
|
+
# countByKey - count elements per key (returns dict to driver)
|
|
303
|
+
key_counts = pair_rdd.countByKey()
|
|
304
|
+
|
|
305
|
+
# countByValue - count occurrences of each value
|
|
306
|
+
value_counts = rdd.countByValue()
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
### Aggregation Actions
|
|
310
|
+
|
|
311
|
+
```python
|
|
312
|
+
# reduce - aggregate all elements
|
|
313
|
+
total = rdd.reduce(lambda a, b: a + b)
|
|
314
|
+
|
|
315
|
+
# fold - reduce with zero value
|
|
316
|
+
total = rdd.fold(0, lambda a, b: a + b)
|
|
317
|
+
|
|
318
|
+
# aggregate - combine with different types
|
|
319
|
+
stats = rdd.aggregate(
|
|
320
|
+
zeroValue=(0, 0), # (sum, count)
|
|
321
|
+
seqOp=lambda acc, v: (acc[0] + v, acc[1] + 1),
|
|
322
|
+
combOp=lambda a, b: (a[0] + b[0], a[1] + b[1])
|
|
323
|
+
)
|
|
324
|
+
average = stats[0] / stats[1]
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### Output Actions
|
|
328
|
+
|
|
329
|
+
```python
|
|
330
|
+
# saveAsTextFile - save as text files
|
|
331
|
+
rdd.saveAsTextFile("hdfs://path/output/")
|
|
332
|
+
|
|
333
|
+
# saveAsSequenceFile - save as Hadoop sequence file
|
|
334
|
+
pair_rdd.saveAsSequenceFile("hdfs://path/output/")
|
|
335
|
+
|
|
336
|
+
# saveAsPickleFile - Python pickle format
|
|
337
|
+
rdd.saveAsPickleFile("hdfs://path/output/")
|
|
338
|
+
|
|
339
|
+
# foreach - apply function to each element (side effects)
|
|
340
|
+
rdd.foreach(lambda x: print(x)) # Runs on executors
|
|
341
|
+
|
|
342
|
+
# foreachPartition - apply function to each partition
|
|
343
|
+
def save_partition(iterator):
|
|
344
|
+
connection = create_connection()
|
|
345
|
+
for record in iterator:
|
|
346
|
+
connection.save(record)
|
|
347
|
+
connection.close()
|
|
348
|
+
|
|
349
|
+
rdd.foreachPartition(save_partition)
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
---
|
|
353
|
+
|
|
354
|
+
## Custom Partitioners
|
|
355
|
+
|
|
356
|
+
### Implementing Custom Partitioner
|
|
357
|
+
|
|
358
|
+
```python
|
|
359
|
+
from pyspark import Partitioner
|
|
360
|
+
|
|
361
|
+
class RangePartitioner(Partitioner):
|
|
362
|
+
def __init__(self, ranges):
|
|
363
|
+
"""
|
|
364
|
+
ranges: list of (min, max) tuples defining partition boundaries
|
|
365
|
+
"""
|
|
366
|
+
self.ranges = ranges
|
|
367
|
+
|
|
368
|
+
def numPartitions(self):
|
|
369
|
+
return len(self.ranges)
|
|
370
|
+
|
|
371
|
+
def getPartition(self, key):
|
|
372
|
+
for i, (min_val, max_val) in enumerate(self.ranges):
|
|
373
|
+
if min_val <= key < max_val:
|
|
374
|
+
return i
|
|
375
|
+
return len(self.ranges) - 1 # Default to last partition
|
|
376
|
+
|
|
377
|
+
# Use custom partitioner
|
|
378
|
+
ranges = [(0, 100), (100, 500), (500, 1000), (1000, float('inf'))]
|
|
379
|
+
partitioner = RangePartitioner(ranges)
|
|
380
|
+
partitioned_rdd = pair_rdd.partitionBy(partitioner.numPartitions(), partitioner.getPartition)
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
```scala
|
|
384
|
+
// Scala custom partitioner
|
|
385
|
+
import org.apache.spark.Partitioner
|
|
386
|
+
|
|
387
|
+
class DomainPartitioner(numParts: Int) extends Partitioner {
|
|
388
|
+
override def numPartitions: Int = numParts
|
|
389
|
+
|
|
390
|
+
override def getPartition(key: Any): Int = {
|
|
391
|
+
val domain = key.asInstanceOf[String].split("@")(1)
|
|
392
|
+
math.abs(domain.hashCode % numPartitions)
|
|
393
|
+
}
|
|
394
|
+
|
|
395
|
+
override def equals(other: Any): Boolean = other match {
|
|
396
|
+
case p: DomainPartitioner => p.numPartitions == numPartitions
|
|
397
|
+
case _ => false
|
|
398
|
+
}
|
|
399
|
+
}
|
|
400
|
+
|
|
401
|
+
val partitioned = pairRdd.partitionBy(new DomainPartitioner(10))
|
|
402
|
+
```
|
|
403
|
+
|
|
404
|
+
### Hash Partitioner (Default)
|
|
405
|
+
|
|
406
|
+
```python
|
|
407
|
+
from pyspark import HashPartitioner
|
|
408
|
+
|
|
409
|
+
# Repartition with hash partitioner
|
|
410
|
+
partitioned_rdd = pair_rdd.partitionBy(100) # Uses HashPartitioner
|
|
411
|
+
|
|
412
|
+
# Preserve partitioning across transformations
|
|
413
|
+
# mapValues and flatMapValues preserve partitioner
|
|
414
|
+
preserved = partitioned_rdd.mapValues(lambda v: v * 2)
|
|
415
|
+
assert preserved.partitioner == partitioned_rdd.partitioner
|
|
416
|
+
|
|
417
|
+
# map does NOT preserve partitioner
|
|
418
|
+
not_preserved = partitioned_rdd.map(lambda x: (x[0], x[1] * 2))
|
|
419
|
+
assert not_preserved.partitioner is None
|
|
420
|
+
```
|
|
421
|
+
|
|
422
|
+
---
|
|
423
|
+
|
|
424
|
+
## Broadcast Variables and Accumulators
|
|
425
|
+
|
|
426
|
+
### Broadcast Variables
|
|
427
|
+
|
|
428
|
+
```python
|
|
429
|
+
# Broadcast large read-only data to all executors
|
|
430
|
+
lookup_table = {"a": 1, "b": 2, "c": 3} # Small example
|
|
431
|
+
lookup_broadcast = spark.sparkContext.broadcast(lookup_table)
|
|
432
|
+
|
|
433
|
+
def enrich_record(record):
|
|
434
|
+
table = lookup_broadcast.value # Access broadcast value
|
|
435
|
+
return (record, table.get(record, 0))
|
|
436
|
+
|
|
437
|
+
enriched_rdd = rdd.map(enrich_record)
|
|
438
|
+
|
|
439
|
+
# Clean up when done
|
|
440
|
+
lookup_broadcast.unpersist()
|
|
441
|
+
lookup_broadcast.destroy()
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
```scala
|
|
445
|
+
// Scala broadcast
|
|
446
|
+
val lookupTable = Map("a" -> 1, "b" -> 2, "c" -> 3)
|
|
447
|
+
val lookupBroadcast = spark.sparkContext.broadcast(lookupTable)
|
|
448
|
+
|
|
449
|
+
val enriched = rdd.map { record =>
|
|
450
|
+
val table = lookupBroadcast.value
|
|
451
|
+
(record, table.getOrElse(record, 0))
|
|
452
|
+
}
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
### Accumulators
|
|
456
|
+
|
|
457
|
+
```python
|
|
458
|
+
# Long accumulator
|
|
459
|
+
error_count = spark.sparkContext.longAccumulator("Error Count")
|
|
460
|
+
|
|
461
|
+
def process_record(record):
|
|
462
|
+
try:
|
|
463
|
+
return transform(record)
|
|
464
|
+
except Exception:
|
|
465
|
+
error_count.add(1)
|
|
466
|
+
return None
|
|
467
|
+
|
|
468
|
+
result_rdd = rdd.map(process_record).filter(lambda x: x is not None)
|
|
469
|
+
result_rdd.count() # Trigger execution
|
|
470
|
+
|
|
471
|
+
print(f"Errors encountered: {error_count.value}")
|
|
472
|
+
|
|
473
|
+
# Collection accumulator
|
|
474
|
+
from pyspark import AccumulatorParam
|
|
475
|
+
|
|
476
|
+
class SetAccumulatorParam(AccumulatorParam):
|
|
477
|
+
def zero(self, initial_value):
|
|
478
|
+
return set()
|
|
479
|
+
|
|
480
|
+
def addInPlace(self, v1, v2):
|
|
481
|
+
return v1.union(v2)
|
|
482
|
+
|
|
483
|
+
error_types = spark.sparkContext.accumulator(set(), SetAccumulatorParam())
|
|
484
|
+
|
|
485
|
+
def track_errors(record):
|
|
486
|
+
try:
|
|
487
|
+
return process(record)
|
|
488
|
+
except ValueError:
|
|
489
|
+
error_types.add({"ValueError"})
|
|
490
|
+
return None
|
|
491
|
+
except TypeError:
|
|
492
|
+
error_types.add({"TypeError"})
|
|
493
|
+
return None
|
|
494
|
+
```
|
|
495
|
+
|
|
496
|
+
**Caution:** Accumulators may be updated more than once if tasks are retried. Use only for debugging/monitoring, not business logic.
|
|
497
|
+
|
|
498
|
+
---
|
|
499
|
+
|
|
500
|
+
## Performance Patterns
|
|
501
|
+
|
|
502
|
+
### Avoiding Shuffle
|
|
503
|
+
|
|
504
|
+
```python
|
|
505
|
+
# BAD: Multiple shuffles
|
|
506
|
+
result = rdd.groupByKey().mapValues(sum).reduceByKey(max)
|
|
507
|
+
|
|
508
|
+
# GOOD: Single shuffle with combineByKey
|
|
509
|
+
result = rdd.combineByKey(
|
|
510
|
+
createCombiner=lambda v: v,
|
|
511
|
+
mergeValue=lambda acc, v: acc + v,
|
|
512
|
+
mergeCombiners=lambda a, b: max(a, b)
|
|
513
|
+
)
|
|
514
|
+
|
|
515
|
+
# Co-partition related RDDs to avoid join shuffles
|
|
516
|
+
partitioned_users = users.partitionBy(100)
|
|
517
|
+
partitioned_orders = orders.partitionBy(100) # Same partitioner
|
|
518
|
+
joined = partitioned_users.join(partitioned_orders) # No shuffle if same partitioner
|
|
519
|
+
```
|
|
520
|
+
|
|
521
|
+
### Efficient Serialization
|
|
522
|
+
|
|
523
|
+
```python
|
|
524
|
+
# Use Kryo serialization for better performance
|
|
525
|
+
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
|
|
526
|
+
spark.conf.set("spark.kryo.registrationRequired", "false")
|
|
527
|
+
|
|
528
|
+
# Register custom classes for Kryo (Scala)
|
|
529
|
+
# spark.conf.set("spark.kryo.classesToRegister", "com.example.MyClass")
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
### Memory-Efficient Operations
|
|
533
|
+
|
|
534
|
+
```python
|
|
535
|
+
# Prefer iterator-based operations
|
|
536
|
+
def efficient_processing(iterator):
|
|
537
|
+
for record in iterator:
|
|
538
|
+
# Process one at a time, don't collect
|
|
539
|
+
yield transform(record)
|
|
540
|
+
|
|
541
|
+
result = rdd.mapPartitions(efficient_processing)
|
|
542
|
+
|
|
543
|
+
# Avoid collecting large data to driver
|
|
544
|
+
# BAD
|
|
545
|
+
all_keys = rdd.keys().collect() # Could be millions!
|
|
546
|
+
|
|
547
|
+
# GOOD
|
|
548
|
+
key_sample = rdd.keys().take(1000) # Sample only
|
|
549
|
+
```
|
|
550
|
+
|
|
551
|
+
---
|
|
552
|
+
|
|
553
|
+
## Spark UI Analysis for RDDs
|
|
554
|
+
|
|
555
|
+
### Stages Tab Metrics
|
|
556
|
+
|
|
557
|
+
| Metric | What to Check |
|
|
558
|
+
|--------|---------------|
|
|
559
|
+
| Shuffle Write | Minimize with reduceByKey over groupByKey |
|
|
560
|
+
| Shuffle Read | Large reads indicate join/aggregation overhead |
|
|
561
|
+
| Spill (Memory) | Indicates partition too large for memory |
|
|
562
|
+
| Spill (Disk) | Data being written to disk - increase memory |
|
|
563
|
+
| GC Time | Should be < 10% of task time |
|
|
564
|
+
|
|
565
|
+
### Common Issues
|
|
566
|
+
|
|
567
|
+
1. **Uneven partition sizes**: Look for tasks taking much longer than others
|
|
568
|
+
2. **Data skew**: One partition has much more data than others
|
|
569
|
+
3. **Straggler tasks**: A few tasks taking 10x longer than median
|
|
570
|
+
|
|
571
|
+
### Debugging Tips
|
|
572
|
+
|
|
573
|
+
```python
|
|
574
|
+
# Check partition sizes
|
|
575
|
+
partition_sizes = rdd.glom().map(len).collect()
|
|
576
|
+
print(f"Partition sizes: min={min(partition_sizes)}, max={max(partition_sizes)}, avg={sum(partition_sizes)/len(partition_sizes)}")
|
|
577
|
+
|
|
578
|
+
# Check partitioner
|
|
579
|
+
print(f"Partitioner: {rdd.partitioner}")
|
|
580
|
+
print(f"Num partitions: {rdd.getNumPartitions()}")
|
|
581
|
+
|
|
582
|
+
# Debug lineage
|
|
583
|
+
print(rdd.toDebugString())
|
|
584
|
+
```
|
|
585
|
+
|
|
586
|
+
---
|
|
587
|
+
|
|
588
|
+
## Best Practices Summary
|
|
589
|
+
|
|
590
|
+
1. **Prefer DataFrames** - Use RDDs only when DataFrame API insufficient
|
|
591
|
+
2. **Use reduceByKey over groupByKey** - Combines locally first, reduces shuffle
|
|
592
|
+
3. **Preserve partitioning** - Use mapValues/flatMapValues to keep partitioner
|
|
593
|
+
4. **Minimize shuffles** - Co-partition related RDDs, use broadcast for small data
|
|
594
|
+
5. **Use mapPartitions** - For expensive initialization (DB connections, etc.)
|
|
595
|
+
6. **Avoid collect on large data** - Use take, takeSample, or foreachPartition
|
|
596
|
+
7. **Broadcast lookup tables** - Avoid shuffle for small reference data
|
|
597
|
+
8. **Monitor accumulators** - Use for debugging, not business logic
|
|
598
|
+
9. **Check partition distribution** - Avoid skew with custom partitioners
|
|
599
|
+
10. **Profile with Spark UI** - Identify shuffle, spill, and GC issues
|