cuda-engine 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (302) hide show
  1. cuda_engine-1.0.0/.github/workflows/eval.yml +123 -0
  2. cuda_engine-1.0.0/.github/workflows/nightly.yml +58 -0
  3. cuda_engine-1.0.0/.github/workflows/pr.yml +16 -0
  4. cuda_engine-1.0.0/.gitignore +28 -0
  5. cuda_engine-1.0.0/CHANGELOG.md +39 -0
  6. cuda_engine-1.0.0/LICENSE +21 -0
  7. cuda_engine-1.0.0/PKG-INFO +266 -0
  8. cuda_engine-1.0.0/README.md +230 -0
  9. cuda_engine-1.0.0/docs/brainstorming-notes.md +116 -0
  10. cuda_engine-1.0.0/docs/colab.md +339 -0
  11. cuda_engine-1.0.0/docs/cost.md +124 -0
  12. cuda_engine-1.0.0/docs/milestones/M2-evidence.md +186 -0
  13. cuda_engine-1.0.0/docs/milestones/M3-evidence.md +535 -0
  14. cuda_engine-1.0.0/docs/milestones/M3-internal-eval-colab-runbook.md +235 -0
  15. cuda_engine-1.0.0/docs/milestones/M3-task-4.3-colab-runbook.md +124 -0
  16. cuda_engine-1.0.0/docs/milestones/M4-evidence.md +69 -0
  17. cuda_engine-1.0.0/docs/privacy.md +73 -0
  18. cuda_engine-1.0.0/docs/superpowers/plans/2026-04-26-cuda-synthesis-engine-plan.md +1757 -0
  19. cuda_engine-1.0.0/docs/superpowers/plans/2026-05-07-stage-escalation-plan.md +1305 -0
  20. cuda_engine-1.0.0/docs/superpowers/plans/2026-05-11-fast1-lift-plan.md +709 -0
  21. cuda_engine-1.0.0/docs/superpowers/plans/2026-05-11-torch-compile-baseline-plan.md +1575 -0
  22. cuda_engine-1.0.0/docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md +456 -0
  23. cuda_engine-1.0.0/docs/superpowers/specs/2026-05-07-stage-escalation-design.md +333 -0
  24. cuda_engine-1.0.0/docs/superpowers/specs/2026-05-11-fast1-lift-design.md +247 -0
  25. cuda_engine-1.0.0/docs/superpowers/specs/2026-05-11-torch-compile-baseline-design.md +412 -0
  26. cuda_engine-1.0.0/evals/__init__.py +1 -0
  27. cuda_engine-1.0.0/evals/internal/add_relu_fp32/notes.md +3 -0
  28. cuda_engine-1.0.0/evals/internal/add_relu_fp32/prompt.txt +1 -0
  29. cuda_engine-1.0.0/evals/internal/add_relu_fp32/reference.py +3 -0
  30. cuda_engine-1.0.0/evals/internal/add_relu_fp32/shapes.yaml +3 -0
  31. cuda_engine-1.0.0/evals/internal/argmax_fp32/notes.md +3 -0
  32. cuda_engine-1.0.0/evals/internal/argmax_fp32/prompt.txt +1 -0
  33. cuda_engine-1.0.0/evals/internal/argmax_fp32/reference.py +3 -0
  34. cuda_engine-1.0.0/evals/internal/argmax_fp32/shapes.yaml +3 -0
  35. cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/notes.md +3 -0
  36. cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/prompt.txt +1 -0
  37. cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/reference.py +3 -0
  38. cuda_engine-1.0.0/evals/internal/bias_gelu_fp16/shapes.yaml +3 -0
  39. cuda_engine-1.0.0/evals/internal/clamp_fp32/notes.md +3 -0
  40. cuda_engine-1.0.0/evals/internal/clamp_fp32/prompt.txt +1 -0
  41. cuda_engine-1.0.0/evals/internal/clamp_fp32/reference.py +3 -0
  42. cuda_engine-1.0.0/evals/internal/clamp_fp32/shapes.yaml +3 -0
  43. cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/notes.md +3 -0
  44. cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/prompt.txt +1 -0
  45. cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/reference.py +3 -0
  46. cuda_engine-1.0.0/evals/internal/cumulative_max_fp32/shapes.yaml +3 -0
  47. cuda_engine-1.0.0/evals/internal/dropout_fp16/notes.md +3 -0
  48. cuda_engine-1.0.0/evals/internal/dropout_fp16/prompt.txt +1 -0
  49. cuda_engine-1.0.0/evals/internal/dropout_fp16/reference.py +3 -0
  50. cuda_engine-1.0.0/evals/internal/dropout_fp16/shapes.yaml +3 -0
  51. cuda_engine-1.0.0/evals/internal/geglu_fp16/notes.md +3 -0
  52. cuda_engine-1.0.0/evals/internal/geglu_fp16/prompt.txt +1 -0
  53. cuda_engine-1.0.0/evals/internal/geglu_fp16/reference.py +3 -0
  54. cuda_engine-1.0.0/evals/internal/geglu_fp16/shapes.yaml +3 -0
  55. cuda_engine-1.0.0/evals/internal/gelu_fp16/notes.md +3 -0
  56. cuda_engine-1.0.0/evals/internal/gelu_fp16/prompt.txt +1 -0
  57. cuda_engine-1.0.0/evals/internal/gelu_fp16/reference.py +3 -0
  58. cuda_engine-1.0.0/evals/internal/gelu_fp16/shapes.yaml +3 -0
  59. cuda_engine-1.0.0/evals/internal/l2_norm_fp32/notes.md +3 -0
  60. cuda_engine-1.0.0/evals/internal/l2_norm_fp32/prompt.txt +1 -0
  61. cuda_engine-1.0.0/evals/internal/l2_norm_fp32/reference.py +3 -0
  62. cuda_engine-1.0.0/evals/internal/l2_norm_fp32/shapes.yaml +3 -0
  63. cuda_engine-1.0.0/evals/internal/layernorm_fp16/notes.md +3 -0
  64. cuda_engine-1.0.0/evals/internal/layernorm_fp16/prompt.txt +1 -0
  65. cuda_engine-1.0.0/evals/internal/layernorm_fp16/reference.py +5 -0
  66. cuda_engine-1.0.0/evals/internal/layernorm_fp16/shapes.yaml +3 -0
  67. cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/notes.md +3 -0
  68. cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/prompt.txt +1 -0
  69. cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/reference.py +6 -0
  70. cuda_engine-1.0.0/evals/internal/layernorm_silu_fused_fp16/shapes.yaml +3 -0
  71. cuda_engine-1.0.0/evals/internal/masked_mean_fp16/notes.md +3 -0
  72. cuda_engine-1.0.0/evals/internal/masked_mean_fp16/prompt.txt +1 -0
  73. cuda_engine-1.0.0/evals/internal/masked_mean_fp16/reference.py +6 -0
  74. cuda_engine-1.0.0/evals/internal/masked_mean_fp16/shapes.yaml +3 -0
  75. cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/notes.md +3 -0
  76. cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/prompt.txt +1 -0
  77. cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/reference.py +3 -0
  78. cuda_engine-1.0.0/evals/internal/max_lastdim_fp32/shapes.yaml +3 -0
  79. cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/notes.md +3 -0
  80. cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/prompt.txt +1 -0
  81. cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/reference.py +3 -0
  82. cuda_engine-1.0.0/evals/internal/mean_lastdim_fp32/shapes.yaml +3 -0
  83. cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/notes.md +3 -0
  84. cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/prompt.txt +1 -0
  85. cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/reference.py +3 -0
  86. cuda_engine-1.0.0/evals/internal/min_lastdim_fp32/shapes.yaml +3 -0
  87. cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/notes.md +3 -0
  88. cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/prompt.txt +1 -0
  89. cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/reference.py +3 -0
  90. cuda_engine-1.0.0/evals/internal/prefix_sum_fp32/shapes.yaml +3 -0
  91. cuda_engine-1.0.0/evals/internal/relu_bias_fp32/notes.md +3 -0
  92. cuda_engine-1.0.0/evals/internal/relu_bias_fp32/prompt.txt +1 -0
  93. cuda_engine-1.0.0/evals/internal/relu_bias_fp32/reference.py +3 -0
  94. cuda_engine-1.0.0/evals/internal/relu_bias_fp32/shapes.yaml +3 -0
  95. cuda_engine-1.0.0/evals/internal/rms_norm_fp16/notes.md +3 -0
  96. cuda_engine-1.0.0/evals/internal/rms_norm_fp16/prompt.txt +1 -0
  97. cuda_engine-1.0.0/evals/internal/rms_norm_fp16/reference.py +3 -0
  98. cuda_engine-1.0.0/evals/internal/rms_norm_fp16/shapes.yaml +3 -0
  99. cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/notes.md +3 -0
  100. cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/prompt.txt +1 -0
  101. cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/reference.py +4 -0
  102. cuda_engine-1.0.0/evals/internal/rmsnorm_silu_fused_fp16/shapes.yaml +3 -0
  103. cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/notes.md +3 -0
  104. cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/prompt.txt +1 -0
  105. cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/reference.py +2 -0
  106. cuda_engine-1.0.0/evals/internal/scalar_multiply_fp32/shapes.yaml +3 -0
  107. cuda_engine-1.0.0/evals/internal/segment_sum_fp32/notes.md +3 -0
  108. cuda_engine-1.0.0/evals/internal/segment_sum_fp32/prompt.txt +1 -0
  109. cuda_engine-1.0.0/evals/internal/segment_sum_fp32/reference.py +4 -0
  110. cuda_engine-1.0.0/evals/internal/segment_sum_fp32/shapes.yaml +3 -0
  111. cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/notes.md +3 -0
  112. cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/prompt.txt +1 -0
  113. cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/reference.py +3 -0
  114. cuda_engine-1.0.0/evals/internal/sigmoid_mul_fp16/shapes.yaml +3 -0
  115. cuda_engine-1.0.0/evals/internal/silu_fp16/notes.md +3 -0
  116. cuda_engine-1.0.0/evals/internal/silu_fp16/prompt.txt +1 -0
  117. cuda_engine-1.0.0/evals/internal/silu_fp16/reference.py +3 -0
  118. cuda_engine-1.0.0/evals/internal/silu_fp16/shapes.yaml +3 -0
  119. cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/notes.md +3 -0
  120. cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/prompt.txt +1 -0
  121. cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/reference.py +3 -0
  122. cuda_engine-1.0.0/evals/internal/softmax_lastdim_fp16/shapes.yaml +3 -0
  123. cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/notes.md +3 -0
  124. cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/prompt.txt +1 -0
  125. cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/reference.py +3 -0
  126. cuda_engine-1.0.0/evals/internal/softmax_numerator_fp16/shapes.yaml +3 -0
  127. cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/notes.md +3 -0
  128. cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/prompt.txt +1 -0
  129. cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/reference.py +3 -0
  130. cuda_engine-1.0.0/evals/internal/sum_reduction_fp32/shapes.yaml +3 -0
  131. cuda_engine-1.0.0/evals/internal/swiglu_fp16/notes.md +3 -0
  132. cuda_engine-1.0.0/evals/internal/swiglu_fp16/prompt.txt +1 -0
  133. cuda_engine-1.0.0/evals/internal/swiglu_fp16/reference.py +3 -0
  134. cuda_engine-1.0.0/evals/internal/swiglu_fp16/shapes.yaml +3 -0
  135. cuda_engine-1.0.0/evals/internal/tanh_add_fp32/notes.md +3 -0
  136. cuda_engine-1.0.0/evals/internal/tanh_add_fp32/prompt.txt +1 -0
  137. cuda_engine-1.0.0/evals/internal/tanh_add_fp32/reference.py +3 -0
  138. cuda_engine-1.0.0/evals/internal/tanh_add_fp32/shapes.yaml +3 -0
  139. cuda_engine-1.0.0/evals/internal/topk_fp32/notes.md +3 -0
  140. cuda_engine-1.0.0/evals/internal/topk_fp32/prompt.txt +1 -0
  141. cuda_engine-1.0.0/evals/internal/topk_fp32/reference.py +3 -0
  142. cuda_engine-1.0.0/evals/internal/topk_fp32/shapes.yaml +3 -0
  143. cuda_engine-1.0.0/evals/internal/vector_add_fp32/notes.md +3 -0
  144. cuda_engine-1.0.0/evals/internal/vector_add_fp32/prompt.txt +1 -0
  145. cuda_engine-1.0.0/evals/internal/vector_add_fp32/reference.py +2 -0
  146. cuda_engine-1.0.0/evals/internal/vector_add_fp32/shapes.yaml +3 -0
  147. cuda_engine-1.0.0/evals/kernelbench/README.md +78 -0
  148. cuda_engine-1.0.0/evals/kernelbench/__init__.py +24 -0
  149. cuda_engine-1.0.0/evals/kernelbench/filter.py +384 -0
  150. cuda_engine-1.0.0/evals/kernelbench/filtered/.gitkeep +0 -0
  151. cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/notes.md +3 -0
  152. cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/prompt.txt +1 -0
  153. cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/reference.py +3 -0
  154. cuda_engine-1.0.0/evals/kernelbench/filtered/argmin_fp32/shapes.yaml +3 -0
  155. cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/notes.md +3 -0
  156. cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/prompt.txt +1 -0
  157. cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/reference.py +3 -0
  158. cuda_engine-1.0.0/evals/kernelbench/filtered/elu_fp16/shapes.yaml +3 -0
  159. cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/notes.md +3 -0
  160. cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/prompt.txt +1 -0
  161. cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/reference.py +3 -0
  162. cuda_engine-1.0.0/evals/kernelbench/filtered/frobenius_norm_fp32/shapes.yaml +3 -0
  163. cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/notes.md +3 -0
  164. cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/prompt.txt +1 -0
  165. cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/reference.py +3 -0
  166. cuda_engine-1.0.0/evals/kernelbench/filtered/l1_norm_fp32/shapes.yaml +3 -0
  167. cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/notes.md +3 -0
  168. cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/prompt.txt +1 -0
  169. cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/reference.py +3 -0
  170. cuda_engine-1.0.0/evals/kernelbench/filtered/leaky_relu_fp16/shapes.yaml +3 -0
  171. cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/notes.md +3 -0
  172. cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/prompt.txt +1 -0
  173. cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/reference.py +3 -0
  174. cuda_engine-1.0.0/evals/kernelbench/filtered/log_softmax_fp16/shapes.yaml +3 -0
  175. cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/notes.md +3 -0
  176. cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/prompt.txt +1 -0
  177. cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/reference.py +3 -0
  178. cuda_engine-1.0.0/evals/kernelbench/filtered/masked_cumsum_fp32/shapes.yaml +3 -0
  179. cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/notes.md +3 -0
  180. cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/prompt.txt +1 -0
  181. cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/reference.py +6 -0
  182. cuda_engine-1.0.0/evals/kernelbench/filtered/mingpt_gelu_fp16/shapes.yaml +3 -0
  183. cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/notes.md +3 -0
  184. cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/prompt.txt +1 -0
  185. cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/reference.py +3 -0
  186. cuda_engine-1.0.0/evals/kernelbench/filtered/mse_loss_fp32/shapes.yaml +3 -0
  187. cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/notes.md +3 -0
  188. cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/prompt.txt +1 -0
  189. cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/reference.py +3 -0
  190. cuda_engine-1.0.0/evals/kernelbench/filtered/reverse_cumsum_fp32/shapes.yaml +3 -0
  191. cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/notes.md +3 -0
  192. cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/prompt.txt +1 -0
  193. cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/reference.py +3 -0
  194. cuda_engine-1.0.0/evals/kernelbench/filtered/softplus_fp16/shapes.yaml +3 -0
  195. cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/notes.md +3 -0
  196. cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/prompt.txt +1 -0
  197. cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/reference.py +3 -0
  198. cuda_engine-1.0.0/evals/kernelbench/filtered/softsign_fp16/shapes.yaml +3 -0
  199. cuda_engine-1.0.0/evals/runner.py +497 -0
  200. cuda_engine-1.0.0/examples/kernels/README.md +63 -0
  201. cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/notes.md +24 -0
  202. cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/prompt.txt +1 -0
  203. cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/reference.py +5 -0
  204. cuda_engine-1.0.0/examples/kernels/rmsnorm_silu_fp16/shapes.yaml +4 -0
  205. cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/notes.md +20 -0
  206. cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/prompt.txt +1 -0
  207. cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/reference.py +3 -0
  208. cuda_engine-1.0.0/examples/kernels/softmax_lastdim_fp16/shapes.yaml +4 -0
  209. cuda_engine-1.0.0/examples/kernels/topk_fp32/notes.md +20 -0
  210. cuda_engine-1.0.0/examples/kernels/topk_fp32/prompt.txt +1 -0
  211. cuda_engine-1.0.0/examples/kernels/topk_fp32/reference.py +3 -0
  212. cuda_engine-1.0.0/examples/kernels/topk_fp32/shapes.yaml +4 -0
  213. cuda_engine-1.0.0/examples/notebook.ipynb +258 -0
  214. cuda_engine-1.0.0/examples/web_demo.py +261 -0
  215. cuda_engine-1.0.0/pyproject.toml +65 -0
  216. cuda_engine-1.0.0/ruff.toml +6 -0
  217. cuda_engine-1.0.0/src/cuda_engine/__init__.py +24 -0
  218. cuda_engine-1.0.0/src/cuda_engine/api.py +39 -0
  219. cuda_engine-1.0.0/src/cuda_engine/cli.py +485 -0
  220. cuda_engine-1.0.0/src/cuda_engine/config.py +32 -0
  221. cuda_engine-1.0.0/src/cuda_engine/models/__init__.py +27 -0
  222. cuda_engine-1.0.0/src/cuda_engine/models/artifact.py +12 -0
  223. cuda_engine-1.0.0/src/cuda_engine/models/reports.py +106 -0
  224. cuda_engine-1.0.0/src/cuda_engine/models/spec.py +45 -0
  225. cuda_engine-1.0.0/src/cuda_engine/orchestrator.py +352 -0
  226. cuda_engine-1.0.0/src/cuda_engine/prompts/__init__.py +8 -0
  227. cuda_engine-1.0.0/src/cuda_engine/prompts/codegen.md +29 -0
  228. cuda_engine-1.0.0/src/cuda_engine/prompts/interview.md +30 -0
  229. cuda_engine-1.0.0/src/cuda_engine/prompts/perf_fix.md +56 -0
  230. cuda_engine-1.0.0/src/cuda_engine/prompts/polish.md +13 -0
  231. cuda_engine-1.0.0/src/cuda_engine/services/__init__.py +1 -0
  232. cuda_engine-1.0.0/src/cuda_engine/services/gpu/__init__.py +3 -0
  233. cuda_engine-1.0.0/src/cuda_engine/services/gpu/_run_kernel_child.py +305 -0
  234. cuda_engine-1.0.0/src/cuda_engine/services/gpu/base.py +88 -0
  235. cuda_engine-1.0.0/src/cuda_engine/services/gpu/local.py +451 -0
  236. cuda_engine-1.0.0/src/cuda_engine/services/gpu/mocks.py +85 -0
  237. cuda_engine-1.0.0/src/cuda_engine/services/llm/__init__.py +3 -0
  238. cuda_engine-1.0.0/src/cuda_engine/services/llm/anthropic.py +71 -0
  239. cuda_engine-1.0.0/src/cuda_engine/services/llm/base.py +35 -0
  240. cuda_engine-1.0.0/src/cuda_engine/services/llm/mocks.py +38 -0
  241. cuda_engine-1.0.0/src/cuda_engine/services/llm/tools.py +64 -0
  242. cuda_engine-1.0.0/src/cuda_engine/services/store/__init__.py +3 -0
  243. cuda_engine-1.0.0/src/cuda_engine/services/store/base.py +24 -0
  244. cuda_engine-1.0.0/src/cuda_engine/services/store/local_dir.py +42 -0
  245. cuda_engine-1.0.0/src/cuda_engine/services/store/mocks.py +27 -0
  246. cuda_engine-1.0.0/src/cuda_engine/stages/__init__.py +1 -0
  247. cuda_engine-1.0.0/src/cuda_engine/stages/base.py +41 -0
  248. cuda_engine-1.0.0/src/cuda_engine/stages/codegen.py +193 -0
  249. cuda_engine-1.0.0/src/cuda_engine/stages/correctness.py +241 -0
  250. cuda_engine-1.0.0/src/cuda_engine/stages/interview.py +117 -0
  251. cuda_engine-1.0.0/src/cuda_engine/stages/performance.py +424 -0
  252. cuda_engine-1.0.0/src/cuda_engine/stages/polish.py +152 -0
  253. cuda_engine-1.0.0/src/cuda_engine/targets/__init__.py +7 -0
  254. cuda_engine-1.0.0/src/cuda_engine/targets/sm_100.py +2 -0
  255. cuda_engine-1.0.0/src/cuda_engine/targets/sm_80.py +18 -0
  256. cuda_engine-1.0.0/src/cuda_engine/targets/sm_90.py +2 -0
  257. cuda_engine-1.0.0/tests/fixtures/ncu_basic_vector_add.csv +50 -0
  258. cuda_engine-1.0.0/tests/integration/test_baseline_torch_compile.py +119 -0
  259. cuda_engine-1.0.0/tests/integration/test_e2e_argmax.py +32 -0
  260. cuda_engine-1.0.0/tests/integration/test_e2e_hard_gate_sad_path.py +42 -0
  261. cuda_engine-1.0.0/tests/integration/test_e2e_perf_loop_escalation.py +89 -0
  262. cuda_engine-1.0.0/tests/integration/test_e2e_rms_norm.py +36 -0
  263. cuda_engine-1.0.0/tests/integration/test_e2e_scalar_multiply.py +31 -0
  264. cuda_engine-1.0.0/tests/integration/test_e2e_sum_reduction.py +32 -0
  265. cuda_engine-1.0.0/tests/integration/test_e2e_vector_add.py +28 -0
  266. cuda_engine-1.0.0/tests/integration/test_local_compile_vector_add.py +24 -0
  267. cuda_engine-1.0.0/tests/integration/test_local_profile_vector_add.py +68 -0
  268. cuda_engine-1.0.0/tests/integration/test_run_kernel_custom_op.py +74 -0
  269. cuda_engine-1.0.0/tests/unit/models/__init__.py +1 -0
  270. cuda_engine-1.0.0/tests/unit/models/test_models.py +78 -0
  271. cuda_engine-1.0.0/tests/unit/services/__init__.py +1 -0
  272. cuda_engine-1.0.0/tests/unit/services/gpu/__init__.py +1 -0
  273. cuda_engine-1.0.0/tests/unit/services/gpu/test_local.py +377 -0
  274. cuda_engine-1.0.0/tests/unit/services/gpu/test_mocks.py +34 -0
  275. cuda_engine-1.0.0/tests/unit/services/gpu/test_ncu_parser.py +93 -0
  276. cuda_engine-1.0.0/tests/unit/services/gpu/test_run_kernel_child.py +205 -0
  277. cuda_engine-1.0.0/tests/unit/services/llm/__init__.py +1 -0
  278. cuda_engine-1.0.0/tests/unit/services/llm/test_anthropic.py +107 -0
  279. cuda_engine-1.0.0/tests/unit/services/llm/test_mocks.py +28 -0
  280. cuda_engine-1.0.0/tests/unit/services/llm/test_tools.py +26 -0
  281. cuda_engine-1.0.0/tests/unit/services/store/__init__.py +1 -0
  282. cuda_engine-1.0.0/tests/unit/services/store/test_local_dir.py +40 -0
  283. cuda_engine-1.0.0/tests/unit/services/store/test_mocks.py +11 -0
  284. cuda_engine-1.0.0/tests/unit/services/test_abcs_exist.py +47 -0
  285. cuda_engine-1.0.0/tests/unit/stages/test_codegen.py +216 -0
  286. cuda_engine-1.0.0/tests/unit/stages/test_correctness.py +368 -0
  287. cuda_engine-1.0.0/tests/unit/stages/test_interview.py +135 -0
  288. cuda_engine-1.0.0/tests/unit/stages/test_performance.py +768 -0
  289. cuda_engine-1.0.0/tests/unit/stages/test_polish.py +113 -0
  290. cuda_engine-1.0.0/tests/unit/stages/test_stages_pass_through.py +20 -0
  291. cuda_engine-1.0.0/tests/unit/test_api.py +50 -0
  292. cuda_engine-1.0.0/tests/unit/test_cli.py +439 -0
  293. cuda_engine-1.0.0/tests/unit/test_config.py +47 -0
  294. cuda_engine-1.0.0/tests/unit/test_eval_runner.py +583 -0
  295. cuda_engine-1.0.0/tests/unit/test_internal_eval_suite.py +85 -0
  296. cuda_engine-1.0.0/tests/unit/test_kernelbench_filter.py +264 -0
  297. cuda_engine-1.0.0/tests/unit/test_kernelbench_suite.py +66 -0
  298. cuda_engine-1.0.0/tests/unit/test_orchestrator.py +501 -0
  299. cuda_engine-1.0.0/tests/unit/test_prompts.py +37 -0
  300. cuda_engine-1.0.0/tests/unit/test_prompts_interview.py +12 -0
  301. cuda_engine-1.0.0/tests/unit/test_targets.py +16 -0
  302. cuda_engine-1.0.0/tests/unit/test_web_demo.py +75 -0
@@ -0,0 +1,123 @@
1
+ name: Pre-release Eval
2
+
3
+ # Manual trigger only. Runs the full internal eval suite on a self-hosted
4
+ # A100 runner and posts results as both an artifact and a PR comment.
5
+ # Used as the v1.0 release gate per M4 Task 5.9.
6
+
7
+ on:
8
+ workflow_dispatch:
9
+ inputs:
10
+ ref:
11
+ description: "Branch or commit to evaluate (default: current)"
12
+ required: false
13
+ default: ""
14
+ suite:
15
+ description: "Eval suite to run"
16
+ required: false
17
+ default: "internal"
18
+ type: choice
19
+ options:
20
+ - internal
21
+ - kernelbench
22
+
23
+ permissions:
24
+ contents: read
25
+ pull-requests: write
26
+
27
+ jobs:
28
+ pre-release-eval:
29
+ name: Pre-release eval on A100
30
+ runs-on: [self-hosted, linux, x64, a100]
31
+ timeout-minutes: 480
32
+ env:
33
+ ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
34
+ CUDA_ENGINE_EVAL_OUT: evals/results/prerelease-${{ github.run_id }}
35
+ steps:
36
+ - uses: actions/checkout@v4
37
+ with:
38
+ ref: ${{ inputs.ref || github.ref }}
39
+
40
+ - uses: actions/setup-python@v5
41
+ with:
42
+ python-version: "3.11"
43
+
44
+ - name: Verify GPU tooling
45
+ run: |
46
+ nvidia-smi --query-gpu=name,driver_version,compute_cap --format=csv,noheader
47
+ which nvcc && nvcc --version
48
+ which ncu && ncu --version
49
+
50
+ - name: Install package
51
+ run: |
52
+ python -m pip install --upgrade pip
53
+ python -m pip install -e ".[dev]"
54
+
55
+ - name: Run pre-release eval
56
+ run: |
57
+ mkdir -p "$CUDA_ENGINE_EVAL_OUT"
58
+ cuda-engine eval \
59
+ --suite ${{ inputs.suite }} \
60
+ --out "$CUDA_ENGINE_EVAL_OUT" \
61
+ --resume
62
+
63
+ - name: Summarize eval gate
64
+ id: gate
65
+ if: always()
66
+ run: |
67
+ SUMMARY="$CUDA_ENGINE_EVAL_OUT/summary.md"
68
+ if [ ! -f "$SUMMARY" ]; then
69
+ echo "summary_exists=false" >> "$GITHUB_OUTPUT"
70
+ exit 0
71
+ fi
72
+ echo "summary_exists=true" >> "$GITHUB_OUTPUT"
73
+ {
74
+ echo "## Pre-release Eval Results"
75
+ echo
76
+ echo "**Ref:** \`${{ inputs.ref || github.ref }}\`"
77
+ echo "**Suite:** \`${{ inputs.suite }}\`"
78
+ echo "**Run ID:** \`${{ github.run_id }}\`"
79
+ echo
80
+ cat "$SUMMARY"
81
+ } > "$CUDA_ENGINE_EVAL_OUT/pr-comment.md"
82
+
83
+ - name: Upload eval artifacts
84
+ if: always()
85
+ uses: actions/upload-artifact@v4
86
+ with:
87
+ name: cuda-engine-prerelease-eval-${{ github.run_id }}
88
+ path: |
89
+ ${{ env.CUDA_ENGINE_EVAL_OUT }}/results.csv
90
+ ${{ env.CUDA_ENGINE_EVAL_OUT }}/summary.md
91
+ ${{ env.CUDA_ENGINE_EVAL_OUT }}/pr-comment.md
92
+ ${{ env.CUDA_ENGINE_EVAL_OUT }}/kernels/**/*.json
93
+ ${{ env.CUDA_ENGINE_EVAL_OUT }}/artifacts/**
94
+ if-no-files-found: warn
95
+
96
+ - name: Comment on associated PR
97
+ if: |
98
+ always()
99
+ && steps.gate.outputs.summary_exists == 'true'
100
+ && github.event.workflow_run.event == 'pull_request'
101
+ uses: actions/github-script@v7
102
+ with:
103
+ script: |
104
+ const fs = require('fs');
105
+ const body = fs.readFileSync(process.env.CUDA_ENGINE_EVAL_OUT + '/pr-comment.md', 'utf8');
106
+ // Find a PR associated with the ref
107
+ const prs = await github.rest.pulls.list({
108
+ owner: context.repo.owner,
109
+ repo: context.repo.repo,
110
+ state: 'open',
111
+ });
112
+ const inputRef = '${{ inputs.ref || github.ref }}'.replace('refs/heads/', '');
113
+ const pr = prs.data.find(p => p.head.ref === inputRef);
114
+ if (pr) {
115
+ await github.rest.issues.createComment({
116
+ owner: context.repo.owner,
117
+ repo: context.repo.repo,
118
+ issue_number: pr.number,
119
+ body,
120
+ });
121
+ } else {
122
+ core.info('No open PR found for ref ' + inputRef + '; skipping comment.');
123
+ }
@@ -0,0 +1,58 @@
1
+ name: Nightly A100 Eval
2
+
3
+ on:
4
+ schedule:
5
+ - cron: "17 8 * * *"
6
+ workflow_dispatch:
7
+
8
+ jobs:
9
+ a100-eval:
10
+ name: Integration and internal eval on A100
11
+ runs-on: [self-hosted, linux, x64, a100]
12
+ timeout-minutes: 360
13
+ env:
14
+ ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
15
+ CUDA_ENGINE_EVAL_OUT: evals/results/nightly-${{ github.run_id }}
16
+ steps:
17
+ - uses: actions/checkout@v4
18
+
19
+ - uses: actions/setup-python@v5
20
+ with:
21
+ python-version: "3.11"
22
+
23
+ - name: Verify GPU tooling
24
+ run: |
25
+ nvidia-smi --query-gpu=name,driver_version,compute_cap --format=csv,noheader
26
+ which nvcc
27
+ nvcc --version
28
+ which ncu
29
+ ncu --version
30
+
31
+ - name: Install package
32
+ run: |
33
+ python -m pip install --upgrade pip
34
+ python -m pip install -e ".[dev]"
35
+
36
+ - name: Run integration tests
37
+ run: |
38
+ python -m pytest tests/integration -v -s --tb=short -m integration
39
+
40
+ - name: Run internal eval suite
41
+ run: |
42
+ mkdir -p "$CUDA_ENGINE_EVAL_OUT"
43
+ cuda-engine eval \
44
+ --suite internal \
45
+ --out "$CUDA_ENGINE_EVAL_OUT" \
46
+ --resume
47
+
48
+ - name: Upload eval results
49
+ if: always()
50
+ uses: actions/upload-artifact@v4
51
+ with:
52
+ name: cuda-engine-nightly-eval-${{ github.run_id }}
53
+ path: |
54
+ evals/results/nightly-${{ github.run_id }}/results.csv
55
+ evals/results/nightly-${{ github.run_id }}/summary.md
56
+ evals/results/nightly-${{ github.run_id }}/kernels/**/*.json
57
+ evals/results/nightly-${{ github.run_id }}/artifacts/**
58
+ if-no-files-found: warn
@@ -0,0 +1,16 @@
1
+ name: PR
2
+
3
+ on: [pull_request, push]
4
+
5
+ jobs:
6
+ test:
7
+ runs-on: ubuntu-latest
8
+ steps:
9
+ - uses: actions/checkout@v4
10
+ - uses: actions/setup-python@v5
11
+ with:
12
+ python-version: "3.11"
13
+ - run: pip install -e ".[dev]"
14
+ - run: ruff check src tests
15
+ - run: mypy src/
16
+ - run: pytest tests/unit -v --cov=cuda_engine --cov-report=term-missing -m "not integration"
@@ -0,0 +1,28 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ .venv/
4
+ .pytest_cache/
5
+ pytest-cache-files-*/
6
+ .mypy_cache/
7
+ .ruff_cache/
8
+ *.egg-info/
9
+ dist/
10
+ build/
11
+ .coverage
12
+ htmlcov/
13
+ .cache/
14
+ .tmp/
15
+ .test_artifacts/
16
+ runs/
17
+ evals/results/
18
+ .evaldiag/
19
+ *.zip
20
+ examples/kernels/*/run_dir/
21
+ *.so
22
+ *.cubin
23
+ *.ptx
24
+ .DS_Store
25
+ .idea/
26
+ .vscode/
27
+ .agentbridge/
28
+ .claude/settings.local.json
@@ -0,0 +1,39 @@
1
+ # Changelog
2
+
3
+ All notable changes to **cuda-engine** are documented here.
4
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
5
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [1.0.0] - 2026-06-28
8
+
9
+ First public release. Turns a plain-English prompt plus a PyTorch reference
10
+ function into a verified, benchmarked, annotated CUDA kernel through a
11
+ five-stage, Claude-driven agent loop.
12
+
13
+ ### Added
14
+ - **Five-stage synthesis pipeline**: Interview → Codegen → Correctness
15
+ (hard gate) → Performance (soft gate, Nsight-guided) → Polish.
16
+ - **`cuda-engine` CLI**: kernel synthesis plus a resumable `eval` runner for the
17
+ internal and KernelBench suites.
18
+ - **LLM backend**: Claude Sonnet 4.6 default with Opus escalation, prompt
19
+ caching, and tool use.
20
+ - **Service interfaces**: `LLMClient`, `GPURunner`, and `ArtifactStore`, each
21
+ with a single v1 implementation.
22
+ - **Streamlit demo** (`examples/web_demo.py`) and a Colab quickstart notebook.
23
+ - **Three worked examples**: `rmsnorm_silu_fp16`, `softmax_lastdim_fp16`,
24
+ `topk_fp32`.
25
+ - **Evaluation suites**: 30 hand-curated internal kernels and a 12-kernel
26
+ hand-translated KernelBench external subset (no overlap with internal).
27
+ - **Docs**: README quickstart with honest eval numbers, privacy and cost guides.
28
+
29
+ ### Verified (A100, sm_80)
30
+ - **Internal suite**: 30/30 functional, median 1.04× and p25 1.00× vs the
31
+ fastest torch.compile mode (N=16M), fast_1 24/30 (80%).
32
+ - **KernelBench external subset**: 12/12 functional, median 1.05×, p25 1.03×,
33
+ fast_1 11/12 (92%).
34
+
35
+ ### Scope
36
+ - v1 targets elementwise ops, simple fused ops, and reductions/scans.
37
+ GEMM and attention are out of scope.
38
+
39
+ [1.0.0]: https://github.com/shivnarainms22/Cuda-Engine/releases/tag/v1.0.0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Shivnarain
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,266 @@
1
+ Metadata-Version: 2.4
2
+ Name: cuda-engine
3
+ Version: 1.0.0
4
+ Summary: Plain-English -> verified CUDA kernels via Claude-driven agent loop
5
+ Project-URL: Homepage, https://github.com/shivnarainms22/Cuda-Engine
6
+ Project-URL: Repository, https://github.com/shivnarainms22/Cuda-Engine
7
+ Project-URL: Issues, https://github.com/shivnarainms22/Cuda-Engine/issues
8
+ License: MIT
9
+ License-File: LICENSE
10
+ Keywords: claude,code-generation,cuda,gpu,kernel-generation,llm,pytorch
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Environment :: GPU :: NVIDIA CUDA
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Topic :: Scientific/Engineering
19
+ Classifier: Topic :: Software Development :: Code Generators
20
+ Requires-Python: >=3.11
21
+ Requires-Dist: anthropic>=0.40
22
+ Requires-Dist: pydantic>=2.7
23
+ Requires-Dist: pyyaml>=6
24
+ Requires-Dist: rich>=13
25
+ Requires-Dist: torch>=2.4
26
+ Requires-Dist: typer>=0.12
27
+ Provides-Extra: demo
28
+ Requires-Dist: streamlit>=1.36; extra == 'demo'
29
+ Provides-Extra: dev
30
+ Requires-Dist: mypy>=1.10; extra == 'dev'
31
+ Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
32
+ Requires-Dist: pytest-cov>=5; extra == 'dev'
33
+ Requires-Dist: pytest>=8; extra == 'dev'
34
+ Requires-Dist: ruff>=0.6; extra == 'dev'
35
+ Description-Content-Type: text/markdown
36
+
37
+ # cuda-engine
38
+
39
+ > Plain English + a slow PyTorch reference → a verified, benchmarked, annotated CUDA kernel.
40
+
41
+ `cuda-engine` is a Python library and CLI that turns a natural-language description and a reference PyTorch function into a CUDA kernel that compiles, matches the reference within tolerance on a real GPU, and benchmarks against `torch.compile`. It uses Claude (Anthropic) for a 5-stage agent loop (interview → codegen → correctness → performance → polish) with Nsight-driven perf repair and Sonnet→Opus escalation when budgets bust.
42
+
43
+ **Status:** pre-1.0. Implementation is feature-complete through M3; v1.0 release gate (M4) is pending. Internal regression suite, eval runner, nightly CI, and `torch.compile` baseline measurement are all in place. See [docs/milestones/M3-evidence.md](docs/milestones/M3-evidence.md) for the most recent eval results.
44
+
45
+ ---
46
+
47
+ ## What it does
48
+
49
+ ```python
50
+ import torch
51
+ from cuda_engine import synthesize
52
+
53
+ def rms_norm(x):
54
+ return x * (x.float().pow(2).mean(dim=-1, keepdim=True) + 1e-5).rsqrt().to(x.dtype)
55
+
56
+ result = synthesize(
57
+ prompt="Generate a fp16 RMSNorm kernel without gamma over the last dimension.",
58
+ reference=rms_norm,
59
+ target="sm_80",
60
+ )
61
+
62
+ assert result.passed
63
+ assert result.correctness.passed # verified vs the reference
64
+ assert result.performance.below_target is False # ≥1.0× torch.compile
65
+ print(f"Speedup: {result.performance.speedup_vs_torch_compile:.2f}x")
66
+ print(f"Kernel: {result.artifacts_dir}/stage5_polish/final/kernel.cu")
67
+ ```
68
+
69
+ Each `synthesize()` call produces a run directory under `~/.cache/cuda_engine/runs/<run_id>/` containing every prompt sent, every LLM response, every kernel attempt, the final kernel source, the compiled shared object, and the full synthesis trace.
70
+
71
+ ---
72
+
73
+ ## Quickstart
74
+
75
+ ### Install
76
+
77
+ Requires Python 3.11+, CUDA 12.x toolchain (`nvcc`), PyTorch 2.4+, and an A100-class GPU for end-to-end runs.
78
+
79
+ ```bash
80
+ pip install cuda-engine # post-v1.0 release
81
+ # or, from source:
82
+ git clone https://github.com/shivnarainms22/Cuda-Engine.git
83
+ cd Cuda-Engine
84
+ pip install -e ".[dev]"
85
+ ```
86
+
87
+ Set your Anthropic key:
88
+
89
+ ```bash
90
+ export ANTHROPIC_API_KEY=sk-ant-...
91
+ ```
92
+
93
+ ### CLI
94
+
95
+ ```bash
96
+ # Synthesize a single kernel
97
+ cuda-engine synthesize \
98
+ --prompt "Generate a fp16 RMSNorm kernel without gamma over the last dimension." \
99
+ --reference path/to/rms_norm.py \
100
+ --target sm_80
101
+
102
+ # Inspect a previous run
103
+ cuda-engine inspect <run_id>
104
+
105
+ # Run the internal eval suite (30 kernels)
106
+ cuda-engine eval --suite internal --out evals/results/2026-05-12 --resume
107
+ ```
108
+
109
+ `path/to/rms_norm.py` should define either a top-level `REFERENCE` variable or a top-level `reference()` function.
110
+
111
+ ### Library
112
+
113
+ ```python
114
+ from cuda_engine import SynthesisConfig, synthesize
115
+ from cuda_engine.config import RetryBudgets
116
+
117
+ result = synthesize(
118
+ prompt="...",
119
+ reference=my_pytorch_fn,
120
+ target="sm_80",
121
+ config=SynthesisConfig(
122
+ retry_budgets=RetryBudgets(codegen=3, performance=2),
123
+ escalate_to_opus_on_bust=True,
124
+ perf_target_speedup_vs_torch_compile=1.0,
125
+ ),
126
+ )
127
+ ```
128
+
129
+ See [`docs/cost.md`](docs/cost.md) for tuning retry budgets to bound API spend.
130
+
131
+ ---
132
+
133
+ ## How it works
134
+
135
+ ```
136
+ prompt + reference.py
137
+
138
+
139
+ ┌─────────────────────┐
140
+ │ Stage 1: Interview │ → KernelSpec (frozen contract)
141
+ └─────────────────────┘
142
+
143
+
144
+ ┌─────────────────────┐
145
+ │ Stage 2: Codegen │ → kernel.cu + compile.log (hard retry budget)
146
+ └─────────────────────┘
147
+
148
+
149
+ ┌─────────────────────┐ fail → repair via Stage 2
150
+ │ Stage 3: Correct. │ ──────────┐
151
+ │ HARD GATE │ │
152
+ └─────────────────────┘ │
153
+ │ pass ▼
154
+ ▼ (loop until pass or budget exhausted)
155
+ ┌─────────────────────┐
156
+ │ Stage 4: Perf │ → benchmark vs torch.compile
157
+ │ SOFT GATE │ Nsight-driven repair loop
158
+ │ │ Sonnet → Opus escalation
159
+ └─────────────────────┘
160
+
161
+
162
+ ┌─────────────────────┐
163
+ │ Stage 5: Polish │ → annotated kernel.cu (re-verified)
164
+ └─────────────────────┘
165
+
166
+
167
+ SynthesisResult + run_dir
168
+ ```
169
+
170
+ - **Hard gate (Stage 3):** kernels that don't match the reference within tolerance fail outright. No exceptions.
171
+ - **Soft gate (Stage 4):** kernels below the perf target still ship, but with `below_target=True` and a warning. Stage 4 burns its retry budget on Nsight-driven optimizations, then optionally escalates to Opus.
172
+ - **Subprocess isolation:** all GPU work happens in a subprocess child. Crashes in user kernels (segfaults, illegal memory access, OOM) don't take down the orchestrator.
173
+
174
+ Design document: [`docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md`](docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md).
175
+
176
+ ---
177
+
178
+ ## Eval results
179
+
180
+ The internal regression suite has 30 hand-curated kernels covering elementwise ops, reductions, and simple fused kernels. All speedups are measured on an A100 (sm_80) against the **fastest** `torch.compile` mode (best of `default` / `max-autotune` / `reduce-overhead`) at N≈16M, so a win means beating torch.compile at its best.
181
+
182
+ **Internal suite — 30/30, A100, 2026-06-01** ([M3-evidence.md](docs/milestones/M3-evidence.md)):
183
+ - Functional pass rate: **30/30 (100%)**.
184
+ - Median speedup vs torch.compile: **1.04×**; p25: **1.00×**.
185
+ - **fast_1: 24/30 (80%)** kernels strictly faster than torch.compile.
186
+ - Biggest wins: `topk_fp32` 12.5× (inductor falls back to a slow sort), `masked_mean` 2.6×, `cumulative_max` 1.45×, `softmax_lastdim` 1.33×. As expected for bandwidth-bound elementwise ops, those sit at parity (torch.compile is already at the HBM roofline); the wins come from reductions/scans.
187
+
188
+ **KernelBench external subset** (12 unseen, in-scope level1 ops): 9/9 functional on the kernels run so far (remaining 3 pending a credit top-up).
189
+
190
+ > An earlier baseline bug measured against `torch.compile`'s *slowest* mode (reduce-overhead) at too-small N, which inflated speedups (one kernel read 9.7× when the honest number is ~parity). Fixed in commit `21f3b2b`; all numbers above use the corrected best-mode baseline.
191
+
192
+ ---
193
+
194
+ ## Scope
195
+
196
+ ### In scope for v1
197
+ - **Kernel categories:** elementwise + simple fused (RMSNorm, layernorm, GELU/SiLU/sigmoid variants, GLU/SwiGLU/GEGLU fusions, dropout-fused) and reductions/scans (sum, mean, argmax, top-k, prefix-sum, masked-mean).
198
+ - **Targets:** codegen for `sm_80` / `sm_90` / `sm_100`; runtime verification on `sm_80` only.
199
+ - **LLM:** Anthropic Claude Sonnet 4.6 default, Opus 4.7 escalation. Prompt caching enabled.
200
+ - **Eval suites:** 30-kernel internal regression + filtered KernelBench subset.
201
+
202
+ ### Out of scope for v1
203
+ - GEMM, matmul, attention kernels (CUTLASS and FlashAttention dominate; deferred to v2/v3).
204
+ - Multi-GPU, multi-node, rack-scale orchestration.
205
+ - Formal verification (SMT race-freedom proofs).
206
+ - Cross-LLM-provider support (Anthropic-only behind a single seam).
207
+ - Backward-pass kernel synthesis, autograd custom ops.
208
+ - VS Code / IDE integrations.
209
+
210
+ ---
211
+
212
+ ## Cost
213
+
214
+ Per-kernel envelope under default config:
215
+
216
+ | Scenario | USD |
217
+ |---|---|
218
+ | Happy path | ~$0.10–0.20 |
219
+ | Typical with retries | ~$0.15–0.40 |
220
+ | Hard kernel | ~$0.30–0.80 |
221
+ | With Opus escalation | ~$0.80–2.00 |
222
+
223
+ Full eval suite (30 kernels): ~$5–20 depending on retries. See [`docs/cost.md`](docs/cost.md) for the per-stage breakdown and the four config knobs to bound spend.
224
+
225
+ ---
226
+
227
+ ## Privacy
228
+
229
+ `cuda-engine` writes full LLM transcripts and reference source code to `~/.cache/cuda_engine/runs/<run_id>/`. No telemetry, no third-party logging. All network traffic is to `api.anthropic.com` over TLS. See [`docs/privacy.md`](docs/privacy.md) for how to keep proprietary references out of artifact directories.
230
+
231
+ ---
232
+
233
+ ## Examples
234
+
235
+ - [`examples/notebook.ipynb`](examples/notebook.ipynb) — Colab quickstart (5 cells).
236
+ - [`examples/web_demo.py`](examples/web_demo.py) — Streamlit live demo.
237
+ - [`examples/kernels/`](examples/kernels/) — worked examples with prompt, reference, generated kernel, and synthesis report.
238
+
239
+ ---
240
+
241
+ ## Development
242
+
243
+ ```bash
244
+ pip install -e ".[dev]"
245
+ ruff check src tests evals
246
+ mypy src
247
+ pytest tests/unit -v
248
+ pytest tests/integration -v -m integration # requires CUDA + ANTHROPIC_API_KEY
249
+ ```
250
+
251
+ CI:
252
+ - **PR workflow** ([.github/workflows/pr.yml](.github/workflows/pr.yml)) — unit tests, ruff, mypy on every push/PR.
253
+ - **Nightly workflow** ([.github/workflows/nightly.yml](.github/workflows/nightly.yml)) — full integration suite + eval on self-hosted A100, daily cron.
254
+ - **Pre-release workflow** ([.github/workflows/eval.yml](.github/workflows/eval.yml)) — manual trigger, gates v1.0 release.
255
+
256
+ ---
257
+
258
+ ## License
259
+
260
+ MIT. See [`LICENSE`](LICENSE).
261
+
262
+ ---
263
+
264
+ ## Acknowledgements
265
+
266
+ Built on top of Anthropic's Claude API, PyTorch's `torch.utils.cpp_extension`, NVIDIA's CUDA toolkit and Nsight Compute. Internal regression kernels draw inspiration from [KernelBench](https://github.com/ScalingIntelligence/KernelBench).