cuda-engine 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cuda_engine-1.0.0/.github/workflows/eval.yml +123 -0
- cuda_engine-1.0.0/.github/workflows/nightly.yml +58 -0
- cuda_engine-1.0.0/.github/workflows/pr.yml +16 -0
- cuda_engine-1.0.0/.gitignore +28 -0
- cuda_engine-1.0.0/CHANGELOG.md +39 -0
- cuda_engine-1.0.0/LICENSE +21 -0
- cuda_engine-1.0.0/PKG-INFO +266 -0
- cuda_engine-1.0.0/README.md +230 -0
- cuda_engine-1.0.0/docs/brainstorming-notes.md +116 -0
- cuda_engine-1.0.0/docs/colab.md +339 -0
- cuda_engine-1.0.0/docs/cost.md +124 -0
- cuda_engine-1.0.0/docs/milestones/M2-evidence.md +186 -0
- cuda_engine-1.0.0/docs/milestones/M3-evidence.md +535 -0
- cuda_engine-1.0.0/docs/milestones/M3-internal-eval-colab-runbook.md +235 -0
- cuda_engine-1.0.0/docs/milestones/M3-task-4.3-colab-runbook.md +124 -0
- cuda_engine-1.0.0/docs/milestones/M4-evidence.md +69 -0
- cuda_engine-1.0.0/docs/privacy.md +73 -0
- cuda_engine-1.0.0/docs/superpowers/plans/2026-04-26-cuda-synthesis-engine-plan.md +1757 -0
- cuda_engine-1.0.0/docs/superpowers/plans/2026-05-07-stage-escalation-plan.md +1305 -0
- cuda_engine-1.0.0/docs/superpowers/plans/2026-05-11-fast1-lift-plan.md +709 -0
- cuda_engine-1.0.0/docs/superpowers/plans/2026-05-11-torch-compile-baseline-plan.md +1575 -0
- cuda_engine-1.0.0/docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md +456 -0
- cuda_engine-1.0.0/docs/superpowers/specs/2026-05-07-stage-escalation-design.md +333 -0
- cuda_engine-1.0.0/docs/superpowers/specs/2026-05-11-fast1-lift-design.md +247 -0
- cuda_engine-1.0.0/docs/superpowers/specs/2026-05-11-torch-compile-baseline-design.md +412 -0
- cuda_engine-1.0.0/evals/__init__.py +1 -0
- cuda_engine-1.0.0/evals/internal/add_relu_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/add_relu_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/add_relu_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/add_relu_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/argmax_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/argmax_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/argmax_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/argmax_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/clamp_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/clamp_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/clamp_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/clamp_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/dropout_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/dropout_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/dropout_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/dropout_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/geglu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/geglu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/geglu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/geglu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/gelu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/gelu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/gelu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/gelu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/l2_norm_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/l2_norm_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/l2_norm_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/l2_norm_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/layernorm_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/layernorm_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/layernorm_fp16/reference.py +5 -0
- cuda_engine-1.0.0/evals/internal/layernorm_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/reference.py +6 -0
- cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/masked_mean_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/masked_mean_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/masked_mean_fp16/reference.py +6 -0
- cuda_engine-1.0.0/evals/internal/masked_mean_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/relu_bias_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/relu_bias_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/relu_bias_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/relu_bias_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/rms_norm_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/rms_norm_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/rms_norm_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/rms_norm_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/reference.py +4 -0
- cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/reference.py +2 -0
- cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/segment_sum_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/segment_sum_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/segment_sum_fp32/reference.py +4 -0
- cuda_engine-1.0.0/evals/internal/segment_sum_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/silu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/silu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/silu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/silu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/swiglu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/swiglu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/swiglu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/swiglu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/tanh_add_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/tanh_add_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/tanh_add_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/tanh_add_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/topk_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/topk_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/topk_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/internal/topk_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/internal/vector_add_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/internal/vector_add_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/internal/vector_add_fp32/reference.py +2 -0
- cuda_engine-1.0.0/evals/internal/vector_add_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/README.md +78 -0
- cuda_engine-1.0.0/evals/kernelbench/__init__.py +24 -0
- cuda_engine-1.0.0/evals/kernelbench/filter.py +384 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/.gitkeep +0 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/reference.py +6 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/notes.md +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/reference.py +3 -0
- cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/shapes.yaml +3 -0
- cuda_engine-1.0.0/evals/runner.py +497 -0
- cuda_engine-1.0.0/examples/kernels/README.md +63 -0
- cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/notes.md +24 -0
- cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/reference.py +5 -0
- cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/shapes.yaml +4 -0
- cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/notes.md +20 -0
- cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/prompt.txt +1 -0
- cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/reference.py +3 -0
- cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/shapes.yaml +4 -0
- cuda_engine-1.0.0/examples/kernels/topk_fp32/notes.md +20 -0
- cuda_engine-1.0.0/examples/kernels/topk_fp32/prompt.txt +1 -0
- cuda_engine-1.0.0/examples/kernels/topk_fp32/reference.py +3 -0
- cuda_engine-1.0.0/examples/kernels/topk_fp32/shapes.yaml +4 -0
- cuda_engine-1.0.0/examples/notebook.ipynb +258 -0
- cuda_engine-1.0.0/examples/web_demo.py +261 -0
- cuda_engine-1.0.0/pyproject.toml +65 -0
- cuda_engine-1.0.0/ruff.toml +6 -0
- cuda_engine-1.0.0/src/cuda_engine/__init__.py +24 -0
- cuda_engine-1.0.0/src/cuda_engine/api.py +39 -0
- cuda_engine-1.0.0/src/cuda_engine/cli.py +485 -0
- cuda_engine-1.0.0/src/cuda_engine/config.py +32 -0
- cuda_engine-1.0.0/src/cuda_engine/models/__init__.py +27 -0
- cuda_engine-1.0.0/src/cuda_engine/models/artifact.py +12 -0
- cuda_engine-1.0.0/src/cuda_engine/models/reports.py +106 -0
- cuda_engine-1.0.0/src/cuda_engine/models/spec.py +45 -0
- cuda_engine-1.0.0/src/cuda_engine/orchestrator.py +352 -0
- cuda_engine-1.0.0/src/cuda_engine/prompts/__init__.py +8 -0
- cuda_engine-1.0.0/src/cuda_engine/prompts/codegen.md +29 -0
- cuda_engine-1.0.0/src/cuda_engine/prompts/interview.md +30 -0
- cuda_engine-1.0.0/src/cuda_engine/prompts/perf_fix.md +56 -0
- cuda_engine-1.0.0/src/cuda_engine/prompts/polish.md +13 -0
- cuda_engine-1.0.0/src/cuda_engine/services/__init__.py +1 -0
- cuda_engine-1.0.0/src/cuda_engine/services/gpu/__init__.py +3 -0
- cuda_engine-1.0.0/src/cuda_engine/services/gpu/_run_kernel_child.py +305 -0
- cuda_engine-1.0.0/src/cuda_engine/services/gpu/base.py +88 -0
- cuda_engine-1.0.0/src/cuda_engine/services/gpu/local.py +451 -0
- cuda_engine-1.0.0/src/cuda_engine/services/gpu/mocks.py +85 -0
- cuda_engine-1.0.0/src/cuda_engine/services/llm/__init__.py +3 -0
- cuda_engine-1.0.0/src/cuda_engine/services/llm/anthropic.py +71 -0
- cuda_engine-1.0.0/src/cuda_engine/services/llm/base.py +35 -0
- cuda_engine-1.0.0/src/cuda_engine/services/llm/mocks.py +38 -0
- cuda_engine-1.0.0/src/cuda_engine/services/llm/tools.py +64 -0
- cuda_engine-1.0.0/src/cuda_engine/services/store/__init__.py +3 -0
- cuda_engine-1.0.0/src/cuda_engine/services/store/base.py +24 -0
- cuda_engine-1.0.0/src/cuda_engine/services/store/local_dir.py +42 -0
- cuda_engine-1.0.0/src/cuda_engine/services/store/mocks.py +27 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/__init__.py +1 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/base.py +41 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/codegen.py +193 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/correctness.py +241 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/interview.py +117 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/performance.py +424 -0
- cuda_engine-1.0.0/src/cuda_engine/stages/polish.py +152 -0
- cuda_engine-1.0.0/src/cuda_engine/targets/__init__.py +7 -0
- cuda_engine-1.0.0/src/cuda_engine/targets/sm_100.py +2 -0
- cuda_engine-1.0.0/src/cuda_engine/targets/sm_80.py +18 -0
- cuda_engine-1.0.0/src/cuda_engine/targets/sm_90.py +2 -0
- cuda_engine-1.0.0/tests/fixtures/ncu_basic_vector_add.csv +50 -0
- cuda_engine-1.0.0/tests/integration/test_baseline_torch_compile.py +119 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_argmax.py +32 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_hard_gate_sad_path.py +42 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_perf_loop_escalation.py +89 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_rms_norm.py +36 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_scalar_multiply.py +31 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_sum_reduction.py +32 -0
- cuda_engine-1.0.0/tests/integration/test_e2e_vector_add.py +28 -0
- cuda_engine-1.0.0/tests/integration/test_local_compile_vector_add.py +24 -0
- cuda_engine-1.0.0/tests/integration/test_local_profile_vector_add.py +68 -0
- cuda_engine-1.0.0/tests/integration/test_run_kernel_custom_op.py +74 -0
- cuda_engine-1.0.0/tests/unit/models/__init__.py +1 -0
- cuda_engine-1.0.0/tests/unit/models/test_models.py +78 -0
- cuda_engine-1.0.0/tests/unit/services/__init__.py +1 -0
- cuda_engine-1.0.0/tests/unit/services/gpu/__init__.py +1 -0
- cuda_engine-1.0.0/tests/unit/services/gpu/test_local.py +377 -0
- cuda_engine-1.0.0/tests/unit/services/gpu/test_mocks.py +34 -0
- cuda_engine-1.0.0/tests/unit/services/gpu/test_ncu_parser.py +93 -0
- cuda_engine-1.0.0/tests/unit/services/gpu/test_run_kernel_child.py +205 -0
- cuda_engine-1.0.0/tests/unit/services/llm/__init__.py +1 -0
- cuda_engine-1.0.0/tests/unit/services/llm/test_anthropic.py +107 -0
- cuda_engine-1.0.0/tests/unit/services/llm/test_mocks.py +28 -0
- cuda_engine-1.0.0/tests/unit/services/llm/test_tools.py +26 -0
- cuda_engine-1.0.0/tests/unit/services/store/__init__.py +1 -0
- cuda_engine-1.0.0/tests/unit/services/store/test_local_dir.py +40 -0
- cuda_engine-1.0.0/tests/unit/services/store/test_mocks.py +11 -0
- cuda_engine-1.0.0/tests/unit/services/test_abcs_exist.py +47 -0
- cuda_engine-1.0.0/tests/unit/stages/test_codegen.py +216 -0
- cuda_engine-1.0.0/tests/unit/stages/test_correctness.py +368 -0
- cuda_engine-1.0.0/tests/unit/stages/test_interview.py +135 -0
- cuda_engine-1.0.0/tests/unit/stages/test_performance.py +768 -0
- cuda_engine-1.0.0/tests/unit/stages/test_polish.py +113 -0
- cuda_engine-1.0.0/tests/unit/stages/test_stages_pass_through.py +20 -0
- cuda_engine-1.0.0/tests/unit/test_api.py +50 -0
- cuda_engine-1.0.0/tests/unit/test_cli.py +439 -0
- cuda_engine-1.0.0/tests/unit/test_config.py +47 -0
- cuda_engine-1.0.0/tests/unit/test_eval_runner.py +583 -0
- cuda_engine-1.0.0/tests/unit/test_internal_eval_suite.py +85 -0
- cuda_engine-1.0.0/tests/unit/test_kernelbench_filter.py +264 -0
- cuda_engine-1.0.0/tests/unit/test_kernelbench_suite.py +66 -0
- cuda_engine-1.0.0/tests/unit/test_orchestrator.py +501 -0
- cuda_engine-1.0.0/tests/unit/test_prompts.py +37 -0
- cuda_engine-1.0.0/tests/unit/test_prompts_interview.py +12 -0
- cuda_engine-1.0.0/tests/unit/test_targets.py +16 -0
- cuda_engine-1.0.0/tests/unit/test_web_demo.py +75 -0
|
@@ -0,0 +1,123 @@
|
|
|
1
|
+
name: Pre-release Eval
|
|
2
|
+
|
|
3
|
+
# Manual trigger only. Runs the full internal eval suite on a self-hosted
|
|
4
|
+
# A100 runner and posts results as both an artifact and a PR comment.
|
|
5
|
+
# Used as the v1.0 release gate per M4 Task 5.9.
|
|
6
|
+
|
|
7
|
+
on:
|
|
8
|
+
workflow_dispatch:
|
|
9
|
+
inputs:
|
|
10
|
+
ref:
|
|
11
|
+
description: "Branch or commit to evaluate (default: current)"
|
|
12
|
+
required: false
|
|
13
|
+
default: ""
|
|
14
|
+
suite:
|
|
15
|
+
description: "Eval suite to run"
|
|
16
|
+
required: false
|
|
17
|
+
default: "internal"
|
|
18
|
+
type: choice
|
|
19
|
+
options:
|
|
20
|
+
- internal
|
|
21
|
+
- kernelbench
|
|
22
|
+
|
|
23
|
+
permissions:
|
|
24
|
+
contents: read
|
|
25
|
+
pull-requests: write
|
|
26
|
+
|
|
27
|
+
jobs:
|
|
28
|
+
pre-release-eval:
|
|
29
|
+
name: Pre-release eval on A100
|
|
30
|
+
runs-on: [self-hosted, linux, x64, a100]
|
|
31
|
+
timeout-minutes: 480
|
|
32
|
+
env:
|
|
33
|
+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
34
|
+
CUDA_ENGINE_EVAL_OUT: evals/results/prerelease-${{ github.run_id }}
|
|
35
|
+
steps:
|
|
36
|
+
- uses: actions/checkout@v4
|
|
37
|
+
with:
|
|
38
|
+
ref: ${{ inputs.ref || github.ref }}
|
|
39
|
+
|
|
40
|
+
- uses: actions/setup-python@v5
|
|
41
|
+
with:
|
|
42
|
+
python-version: "3.11"
|
|
43
|
+
|
|
44
|
+
- name: Verify GPU tooling
|
|
45
|
+
run: |
|
|
46
|
+
nvidia-smi --query-gpu=name,driver_version,compute_cap --format=csv,noheader
|
|
47
|
+
which nvcc && nvcc --version
|
|
48
|
+
which ncu && ncu --version
|
|
49
|
+
|
|
50
|
+
- name: Install package
|
|
51
|
+
run: |
|
|
52
|
+
python -m pip install --upgrade pip
|
|
53
|
+
python -m pip install -e ".[dev]"
|
|
54
|
+
|
|
55
|
+
- name: Run pre-release eval
|
|
56
|
+
run: |
|
|
57
|
+
mkdir -p "$CUDA_ENGINE_EVAL_OUT"
|
|
58
|
+
cuda-engine eval \
|
|
59
|
+
--suite ${{ inputs.suite }} \
|
|
60
|
+
--out "$CUDA_ENGINE_EVAL_OUT" \
|
|
61
|
+
--resume
|
|
62
|
+
|
|
63
|
+
- name: Summarize eval gate
|
|
64
|
+
id: gate
|
|
65
|
+
if: always()
|
|
66
|
+
run: |
|
|
67
|
+
SUMMARY="$CUDA_ENGINE_EVAL_OUT/summary.md"
|
|
68
|
+
if [ ! -f "$SUMMARY" ]; then
|
|
69
|
+
echo "summary_exists=false" >> "$GITHUB_OUTPUT"
|
|
70
|
+
exit 0
|
|
71
|
+
fi
|
|
72
|
+
echo "summary_exists=true" >> "$GITHUB_OUTPUT"
|
|
73
|
+
{
|
|
74
|
+
echo "## Pre-release Eval Results"
|
|
75
|
+
echo
|
|
76
|
+
echo "**Ref:** \`${{ inputs.ref || github.ref }}\`"
|
|
77
|
+
echo "**Suite:** \`${{ inputs.suite }}\`"
|
|
78
|
+
echo "**Run ID:** \`${{ github.run_id }}\`"
|
|
79
|
+
echo
|
|
80
|
+
cat "$SUMMARY"
|
|
81
|
+
} > "$CUDA_ENGINE_EVAL_OUT/pr-comment.md"
|
|
82
|
+
|
|
83
|
+
- name: Upload eval artifacts
|
|
84
|
+
if: always()
|
|
85
|
+
uses: actions/upload-artifact@v4
|
|
86
|
+
with:
|
|
87
|
+
name: cuda-engine-prerelease-eval-${{ github.run_id }}
|
|
88
|
+
path: |
|
|
89
|
+
${{ env.CUDA_ENGINE_EVAL_OUT }}/results.csv
|
|
90
|
+
${{ env.CUDA_ENGINE_EVAL_OUT }}/summary.md
|
|
91
|
+
${{ env.CUDA_ENGINE_EVAL_OUT }}/pr-comment.md
|
|
92
|
+
${{ env.CUDA_ENGINE_EVAL_OUT }}/kernels/**/*.json
|
|
93
|
+
${{ env.CUDA_ENGINE_EVAL_OUT }}/artifacts/**
|
|
94
|
+
if-no-files-found: warn
|
|
95
|
+
|
|
96
|
+
- name: Comment on associated PR
|
|
97
|
+
if: |
|
|
98
|
+
always()
|
|
99
|
+
&& steps.gate.outputs.summary_exists == 'true'
|
|
100
|
+
&& github.event.workflow_run.event == 'pull_request'
|
|
101
|
+
uses: actions/github-script@v7
|
|
102
|
+
with:
|
|
103
|
+
script: |
|
|
104
|
+
const fs = require('fs');
|
|
105
|
+
const body = fs.readFileSync(process.env.CUDA_ENGINE_EVAL_OUT + '/pr-comment.md', 'utf8');
|
|
106
|
+
// Find a PR associated with the ref
|
|
107
|
+
const prs = await github.rest.pulls.list({
|
|
108
|
+
owner: context.repo.owner,
|
|
109
|
+
repo: context.repo.repo,
|
|
110
|
+
state: 'open',
|
|
111
|
+
});
|
|
112
|
+
const inputRef = '${{ inputs.ref || github.ref }}'.replace('refs/heads/', '');
|
|
113
|
+
const pr = prs.data.find(p => p.head.ref === inputRef);
|
|
114
|
+
if (pr) {
|
|
115
|
+
await github.rest.issues.createComment({
|
|
116
|
+
owner: context.repo.owner,
|
|
117
|
+
repo: context.repo.repo,
|
|
118
|
+
issue_number: pr.number,
|
|
119
|
+
body,
|
|
120
|
+
});
|
|
121
|
+
} else {
|
|
122
|
+
core.info('No open PR found for ref ' + inputRef + '; skipping comment.');
|
|
123
|
+
}
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
name: Nightly A100 Eval
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
schedule:
|
|
5
|
+
- cron: "17 8 * * *"
|
|
6
|
+
workflow_dispatch:
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
a100-eval:
|
|
10
|
+
name: Integration and internal eval on A100
|
|
11
|
+
runs-on: [self-hosted, linux, x64, a100]
|
|
12
|
+
timeout-minutes: 360
|
|
13
|
+
env:
|
|
14
|
+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
15
|
+
CUDA_ENGINE_EVAL_OUT: evals/results/nightly-${{ github.run_id }}
|
|
16
|
+
steps:
|
|
17
|
+
- uses: actions/checkout@v4
|
|
18
|
+
|
|
19
|
+
- uses: actions/setup-python@v5
|
|
20
|
+
with:
|
|
21
|
+
python-version: "3.11"
|
|
22
|
+
|
|
23
|
+
- name: Verify GPU tooling
|
|
24
|
+
run: |
|
|
25
|
+
nvidia-smi --query-gpu=name,driver_version,compute_cap --format=csv,noheader
|
|
26
|
+
which nvcc
|
|
27
|
+
nvcc --version
|
|
28
|
+
which ncu
|
|
29
|
+
ncu --version
|
|
30
|
+
|
|
31
|
+
- name: Install package
|
|
32
|
+
run: |
|
|
33
|
+
python -m pip install --upgrade pip
|
|
34
|
+
python -m pip install -e ".[dev]"
|
|
35
|
+
|
|
36
|
+
- name: Run integration tests
|
|
37
|
+
run: |
|
|
38
|
+
python -m pytest tests/integration -v -s --tb=short -m integration
|
|
39
|
+
|
|
40
|
+
- name: Run internal eval suite
|
|
41
|
+
run: |
|
|
42
|
+
mkdir -p "$CUDA_ENGINE_EVAL_OUT"
|
|
43
|
+
cuda-engine eval \
|
|
44
|
+
--suite internal \
|
|
45
|
+
--out "$CUDA_ENGINE_EVAL_OUT" \
|
|
46
|
+
--resume
|
|
47
|
+
|
|
48
|
+
- name: Upload eval results
|
|
49
|
+
if: always()
|
|
50
|
+
uses: actions/upload-artifact@v4
|
|
51
|
+
with:
|
|
52
|
+
name: cuda-engine-nightly-eval-${{ github.run_id }}
|
|
53
|
+
path: |
|
|
54
|
+
evals/results/nightly-${{ github.run_id }}/results.csv
|
|
55
|
+
evals/results/nightly-${{ github.run_id }}/summary.md
|
|
56
|
+
evals/results/nightly-${{ github.run_id }}/kernels/**/*.json
|
|
57
|
+
evals/results/nightly-${{ github.run_id }}/artifacts/**
|
|
58
|
+
if-no-files-found: warn
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
name: PR
|
|
2
|
+
|
|
3
|
+
on: [pull_request, push]
|
|
4
|
+
|
|
5
|
+
jobs:
|
|
6
|
+
test:
|
|
7
|
+
runs-on: ubuntu-latest
|
|
8
|
+
steps:
|
|
9
|
+
- uses: actions/checkout@v4
|
|
10
|
+
- uses: actions/setup-python@v5
|
|
11
|
+
with:
|
|
12
|
+
python-version: "3.11"
|
|
13
|
+
- run: pip install -e ".[dev]"
|
|
14
|
+
- run: ruff check src tests
|
|
15
|
+
- run: mypy src/
|
|
16
|
+
- run: pytest tests/unit -v --cov=cuda_engine --cov-report=term-missing -m "not integration"
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
__pycache__/
|
|
2
|
+
*.py[cod]
|
|
3
|
+
.venv/
|
|
4
|
+
.pytest_cache/
|
|
5
|
+
pytest-cache-files-*/
|
|
6
|
+
.mypy_cache/
|
|
7
|
+
.ruff_cache/
|
|
8
|
+
*.egg-info/
|
|
9
|
+
dist/
|
|
10
|
+
build/
|
|
11
|
+
.coverage
|
|
12
|
+
htmlcov/
|
|
13
|
+
.cache/
|
|
14
|
+
.tmp/
|
|
15
|
+
.test_artifacts/
|
|
16
|
+
runs/
|
|
17
|
+
evals/results/
|
|
18
|
+
.evaldiag/
|
|
19
|
+
*.zip
|
|
20
|
+
examples/kernels/*/run_dir/
|
|
21
|
+
*.so
|
|
22
|
+
*.cubin
|
|
23
|
+
*.ptx
|
|
24
|
+
.DS_Store
|
|
25
|
+
.idea/
|
|
26
|
+
.vscode/
|
|
27
|
+
.agentbridge/
|
|
28
|
+
.claude/settings.local.json
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to **cuda-engine** are documented here.
|
|
4
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
5
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## [1.0.0] - 2026-06-28
|
|
8
|
+
|
|
9
|
+
First public release. Turns a plain-English prompt plus a PyTorch reference
|
|
10
|
+
function into a verified, benchmarked, annotated CUDA kernel through a
|
|
11
|
+
five-stage, Claude-driven agent loop.
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
- **Five-stage synthesis pipeline**: Interview → Codegen → Correctness
|
|
15
|
+
(hard gate) → Performance (soft gate, Nsight-guided) → Polish.
|
|
16
|
+
- **`cuda-engine` CLI**: kernel synthesis plus a resumable `eval` runner for the
|
|
17
|
+
internal and KernelBench suites.
|
|
18
|
+
- **LLM backend**: Claude Sonnet 4.6 default with Opus escalation, prompt
|
|
19
|
+
caching, and tool use.
|
|
20
|
+
- **Service interfaces**: `LLMClient`, `GPURunner`, and `ArtifactStore`, each
|
|
21
|
+
with a single v1 implementation.
|
|
22
|
+
- **Streamlit demo** (`examples/web_demo.py`) and a Colab quickstart notebook.
|
|
23
|
+
- **Three worked examples**: `rmsnorm_silu_fp16`, `softmax_lastdim_fp16`,
|
|
24
|
+
`topk_fp32`.
|
|
25
|
+
- **Evaluation suites**: 30 hand-curated internal kernels and a 12-kernel
|
|
26
|
+
hand-translated KernelBench external subset (no overlap with internal).
|
|
27
|
+
- **Docs**: README quickstart with honest eval numbers, privacy and cost guides.
|
|
28
|
+
|
|
29
|
+
### Verified (A100, sm_80)
|
|
30
|
+
- **Internal suite**: 30/30 functional, median 1.04× and p25 1.00× vs the
|
|
31
|
+
fastest torch.compile mode (N=16M), fast_1 24/30 (80%).
|
|
32
|
+
- **KernelBench external subset**: 12/12 functional, median 1.05×, p25 1.03×,
|
|
33
|
+
fast_1 11/12 (92%).
|
|
34
|
+
|
|
35
|
+
### Scope
|
|
36
|
+
- v1 targets elementwise ops, simple fused ops, and reductions/scans.
|
|
37
|
+
GEMM and attention are out of scope.
|
|
38
|
+
|
|
39
|
+
[1.0.0]: https://github.com/shivnarainms22/Cuda-Engine/releases/tag/v1.0.0
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Shivnarain
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,266 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cuda-engine
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: Plain-English -> verified CUDA kernels via Claude-driven agent loop
|
|
5
|
+
Project-URL: Homepage, https://github.com/shivnarainms22/Cuda-Engine
|
|
6
|
+
Project-URL: Repository, https://github.com/shivnarainms22/Cuda-Engine
|
|
7
|
+
Project-URL: Issues, https://github.com/shivnarainms22/Cuda-Engine/issues
|
|
8
|
+
License: MIT
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: claude,code-generation,cuda,gpu,kernel-generation,llm,pytorch
|
|
11
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
12
|
+
Classifier: Environment :: GPU :: NVIDIA CUDA
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering
|
|
19
|
+
Classifier: Topic :: Software Development :: Code Generators
|
|
20
|
+
Requires-Python: >=3.11
|
|
21
|
+
Requires-Dist: anthropic>=0.40
|
|
22
|
+
Requires-Dist: pydantic>=2.7
|
|
23
|
+
Requires-Dist: pyyaml>=6
|
|
24
|
+
Requires-Dist: rich>=13
|
|
25
|
+
Requires-Dist: torch>=2.4
|
|
26
|
+
Requires-Dist: typer>=0.12
|
|
27
|
+
Provides-Extra: demo
|
|
28
|
+
Requires-Dist: streamlit>=1.36; extra == 'demo'
|
|
29
|
+
Provides-Extra: dev
|
|
30
|
+
Requires-Dist: mypy>=1.10; extra == 'dev'
|
|
31
|
+
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
|
|
32
|
+
Requires-Dist: pytest-cov>=5; extra == 'dev'
|
|
33
|
+
Requires-Dist: pytest>=8; extra == 'dev'
|
|
34
|
+
Requires-Dist: ruff>=0.6; extra == 'dev'
|
|
35
|
+
Description-Content-Type: text/markdown
|
|
36
|
+
|
|
37
|
+
# cuda-engine
|
|
38
|
+
|
|
39
|
+
> Plain English + a slow PyTorch reference → a verified, benchmarked, annotated CUDA kernel.
|
|
40
|
+
|
|
41
|
+
`cuda-engine` is a Python library and CLI that turns a natural-language description and a reference PyTorch function into a CUDA kernel that compiles, matches the reference within tolerance on a real GPU, and benchmarks against `torch.compile`. It uses Claude (Anthropic) for a 5-stage agent loop (interview → codegen → correctness → performance → polish) with Nsight-driven perf repair and Sonnet→Opus escalation when budgets bust.
|
|
42
|
+
|
|
43
|
+
**Status:** pre-1.0. Implementation is feature-complete through M3; v1.0 release gate (M4) is pending. Internal regression suite, eval runner, nightly CI, and `torch.compile` baseline measurement are all in place. See [docs/milestones/M3-evidence.md](docs/milestones/M3-evidence.md) for the most recent eval results.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## What it does
|
|
48
|
+
|
|
49
|
+
```python
|
|
50
|
+
import torch
|
|
51
|
+
from cuda_engine import synthesize
|
|
52
|
+
|
|
53
|
+
def rms_norm(x):
|
|
54
|
+
return x * (x.float().pow(2).mean(dim=-1, keepdim=True) + 1e-5).rsqrt().to(x.dtype)
|
|
55
|
+
|
|
56
|
+
result = synthesize(
|
|
57
|
+
prompt="Generate a fp16 RMSNorm kernel without gamma over the last dimension.",
|
|
58
|
+
reference=rms_norm,
|
|
59
|
+
target="sm_80",
|
|
60
|
+
)
|
|
61
|
+
|
|
62
|
+
assert result.passed
|
|
63
|
+
assert result.correctness.passed # verified vs the reference
|
|
64
|
+
assert result.performance.below_target is False # ≥1.0× torch.compile
|
|
65
|
+
print(f"Speedup: {result.performance.speedup_vs_torch_compile:.2f}x")
|
|
66
|
+
print(f"Kernel: {result.artifacts_dir}/stage5_polish/final/kernel.cu")
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
Each `synthesize()` call produces a run directory under `~/.cache/cuda_engine/runs/<run_id>/` containing every prompt sent, every LLM response, every kernel attempt, the final kernel source, the compiled shared object, and the full synthesis trace.
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## Quickstart
|
|
74
|
+
|
|
75
|
+
### Install
|
|
76
|
+
|
|
77
|
+
Requires Python 3.11+, CUDA 12.x toolchain (`nvcc`), PyTorch 2.4+, and an A100-class GPU for end-to-end runs.
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
pip install cuda-engine # post-v1.0 release
|
|
81
|
+
# or, from source:
|
|
82
|
+
git clone https://github.com/shivnarainms22/Cuda-Engine.git
|
|
83
|
+
cd Cuda-Engine
|
|
84
|
+
pip install -e ".[dev]"
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Set your Anthropic key:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
export ANTHROPIC_API_KEY=sk-ant-...
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### CLI
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
# Synthesize a single kernel
|
|
97
|
+
cuda-engine synthesize \
|
|
98
|
+
--prompt "Generate a fp16 RMSNorm kernel without gamma over the last dimension." \
|
|
99
|
+
--reference path/to/rms_norm.py \
|
|
100
|
+
--target sm_80
|
|
101
|
+
|
|
102
|
+
# Inspect a previous run
|
|
103
|
+
cuda-engine inspect <run_id>
|
|
104
|
+
|
|
105
|
+
# Run the internal eval suite (30 kernels)
|
|
106
|
+
cuda-engine eval --suite internal --out evals/results/2026-05-12 --resume
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
`path/to/rms_norm.py` should define either a top-level `REFERENCE` variable or a top-level `reference()` function.
|
|
110
|
+
|
|
111
|
+
### Library
|
|
112
|
+
|
|
113
|
+
```python
|
|
114
|
+
from cuda_engine import SynthesisConfig, synthesize
|
|
115
|
+
from cuda_engine.config import RetryBudgets
|
|
116
|
+
|
|
117
|
+
result = synthesize(
|
|
118
|
+
prompt="...",
|
|
119
|
+
reference=my_pytorch_fn,
|
|
120
|
+
target="sm_80",
|
|
121
|
+
config=SynthesisConfig(
|
|
122
|
+
retry_budgets=RetryBudgets(codegen=3, performance=2),
|
|
123
|
+
escalate_to_opus_on_bust=True,
|
|
124
|
+
perf_target_speedup_vs_torch_compile=1.0,
|
|
125
|
+
),
|
|
126
|
+
)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
See [`docs/cost.md`](docs/cost.md) for tuning retry budgets to bound API spend.
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## How it works
|
|
134
|
+
|
|
135
|
+
```
|
|
136
|
+
prompt + reference.py
|
|
137
|
+
│
|
|
138
|
+
▼
|
|
139
|
+
┌─────────────────────┐
|
|
140
|
+
│ Stage 1: Interview │ → KernelSpec (frozen contract)
|
|
141
|
+
└─────────────────────┘
|
|
142
|
+
│
|
|
143
|
+
▼
|
|
144
|
+
┌─────────────────────┐
|
|
145
|
+
│ Stage 2: Codegen │ → kernel.cu + compile.log (hard retry budget)
|
|
146
|
+
└─────────────────────┘
|
|
147
|
+
│
|
|
148
|
+
▼
|
|
149
|
+
┌─────────────────────┐ fail → repair via Stage 2
|
|
150
|
+
│ Stage 3: Correct. │ ──────────┐
|
|
151
|
+
│ HARD GATE │ │
|
|
152
|
+
└─────────────────────┘ │
|
|
153
|
+
│ pass ▼
|
|
154
|
+
▼ (loop until pass or budget exhausted)
|
|
155
|
+
┌─────────────────────┐
|
|
156
|
+
│ Stage 4: Perf │ → benchmark vs torch.compile
|
|
157
|
+
│ SOFT GATE │ Nsight-driven repair loop
|
|
158
|
+
│ │ Sonnet → Opus escalation
|
|
159
|
+
└─────────────────────┘
|
|
160
|
+
│
|
|
161
|
+
▼
|
|
162
|
+
┌─────────────────────┐
|
|
163
|
+
│ Stage 5: Polish │ → annotated kernel.cu (re-verified)
|
|
164
|
+
└─────────────────────┘
|
|
165
|
+
│
|
|
166
|
+
▼
|
|
167
|
+
SynthesisResult + run_dir
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
- **Hard gate (Stage 3):** kernels that don't match the reference within tolerance fail outright. No exceptions.
|
|
171
|
+
- **Soft gate (Stage 4):** kernels below the perf target still ship, but with `below_target=True` and a warning. Stage 4 burns its retry budget on Nsight-driven optimizations, then optionally escalates to Opus.
|
|
172
|
+
- **Subprocess isolation:** all GPU work happens in a subprocess child. Crashes in user kernels (segfaults, illegal memory access, OOM) don't take down the orchestrator.
|
|
173
|
+
|
|
174
|
+
Design document: [`docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md`](docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md).
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## Eval results
|
|
179
|
+
|
|
180
|
+
The internal regression suite has 30 hand-curated kernels covering elementwise ops, reductions, and simple fused kernels. All speedups are measured on an A100 (sm_80) against the **fastest** `torch.compile` mode (best of `default` / `max-autotune` / `reduce-overhead`) at N≈16M, so a win means beating torch.compile at its best.
|
|
181
|
+
|
|
182
|
+
**Internal suite — 30/30, A100, 2026-06-01** ([M3-evidence.md](docs/milestones/M3-evidence.md)):
|
|
183
|
+
- Functional pass rate: **30/30 (100%)**.
|
|
184
|
+
- Median speedup vs torch.compile: **1.04×**; p25: **1.00×**.
|
|
185
|
+
- **fast_1: 24/30 (80%)** kernels strictly faster than torch.compile.
|
|
186
|
+
- Biggest wins: `topk_fp32` 12.5× (inductor falls back to a slow sort), `masked_mean` 2.6×, `cumulative_max` 1.45×, `softmax_lastdim` 1.33×. As expected for bandwidth-bound elementwise ops, those sit at parity (torch.compile is already at the HBM roofline); the wins come from reductions/scans.
|
|
187
|
+
|
|
188
|
+
**KernelBench external subset** (12 unseen, in-scope level1 ops): 9/9 functional on the kernels run so far (remaining 3 pending a credit top-up).
|
|
189
|
+
|
|
190
|
+
> An earlier baseline bug measured against `torch.compile`'s *slowest* mode (reduce-overhead) at too-small N, which inflated speedups (one kernel read 9.7× when the honest number is ~parity). Fixed in commit `21f3b2b`; all numbers above use the corrected best-mode baseline.
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## Scope
|
|
195
|
+
|
|
196
|
+
### In scope for v1
|
|
197
|
+
- **Kernel categories:** elementwise + simple fused (RMSNorm, layernorm, GELU/SiLU/sigmoid variants, GLU/SwiGLU/GEGLU fusions, dropout-fused) and reductions/scans (sum, mean, argmax, top-k, prefix-sum, masked-mean).
|
|
198
|
+
- **Targets:** codegen for `sm_80` / `sm_90` / `sm_100`; runtime verification on `sm_80` only.
|
|
199
|
+
- **LLM:** Anthropic Claude Sonnet 4.6 default, Opus 4.7 escalation. Prompt caching enabled.
|
|
200
|
+
- **Eval suites:** 30-kernel internal regression + filtered KernelBench subset.
|
|
201
|
+
|
|
202
|
+
### Out of scope for v1
|
|
203
|
+
- GEMM, matmul, attention kernels (CUTLASS and FlashAttention dominate; deferred to v2/v3).
|
|
204
|
+
- Multi-GPU, multi-node, rack-scale orchestration.
|
|
205
|
+
- Formal verification (SMT race-freedom proofs).
|
|
206
|
+
- Cross-LLM-provider support (Anthropic-only behind a single seam).
|
|
207
|
+
- Backward-pass kernel synthesis, autograd custom ops.
|
|
208
|
+
- VS Code / IDE integrations.
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Cost
|
|
213
|
+
|
|
214
|
+
Per-kernel envelope under default config:
|
|
215
|
+
|
|
216
|
+
| Scenario | USD |
|
|
217
|
+
|---|---|
|
|
218
|
+
| Happy path | ~$0.10–0.20 |
|
|
219
|
+
| Typical with retries | ~$0.15–0.40 |
|
|
220
|
+
| Hard kernel | ~$0.30–0.80 |
|
|
221
|
+
| With Opus escalation | ~$0.80–2.00 |
|
|
222
|
+
|
|
223
|
+
Full eval suite (30 kernels): ~$5–20 depending on retries. See [`docs/cost.md`](docs/cost.md) for the per-stage breakdown and the four config knobs to bound spend.
|
|
224
|
+
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Privacy
|
|
228
|
+
|
|
229
|
+
`cuda-engine` writes full LLM transcripts and reference source code to `~/.cache/cuda_engine/runs/<run_id>/`. No telemetry, no third-party logging. All network traffic is to `api.anthropic.com` over TLS. See [`docs/privacy.md`](docs/privacy.md) for how to keep proprietary references out of artifact directories.
|
|
230
|
+
|
|
231
|
+
---
|
|
232
|
+
|
|
233
|
+
## Examples
|
|
234
|
+
|
|
235
|
+
- [`examples/notebook.ipynb`](examples/notebook.ipynb) — Colab quickstart (5 cells).
|
|
236
|
+
- [`examples/web_demo.py`](examples/web_demo.py) — Streamlit live demo.
|
|
237
|
+
- [`examples/kernels/`](examples/kernels/) — worked examples with prompt, reference, generated kernel, and synthesis report.
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## Development
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
pip install -e ".[dev]"
|
|
245
|
+
ruff check src tests evals
|
|
246
|
+
mypy src
|
|
247
|
+
pytest tests/unit -v
|
|
248
|
+
pytest tests/integration -v -m integration # requires CUDA + ANTHROPIC_API_KEY
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
CI:
|
|
252
|
+
- **PR workflow** ([.github/workflows/pr.yml](.github/workflows/pr.yml)) — unit tests, ruff, mypy on every push/PR.
|
|
253
|
+
- **Nightly workflow** ([.github/workflows/nightly.yml](.github/workflows/nightly.yml)) — full integration suite + eval on self-hosted A100, daily cron.
|
|
254
|
+
- **Pre-release workflow** ([.github/workflows/eval.yml](.github/workflows/eval.yml)) — manual trigger, gates v1.0 release.
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
258
|
+
## License
|
|
259
|
+
|
|
260
|
+
MIT. See [`LICENSE`](LICENSE).
|
|
261
|
+
|
|
262
|
+
---
|
|
263
|
+
|
|
264
|
+
## Acknowledgements
|
|
265
|
+
|
|
266
|
+
Built on top of Anthropic's Claude API, PyTorch's `torch.utils.cpp_extension`, NVIDIA's CUDA toolkit and Nsight Compute. Internal regression kernels draw inspiration from [KernelBench](https://github.com/ScalingIntelligence/KernelBench).
|