skypilot-nightly 1.0.0.dev20251203__py3-none-any.whl → 1.0.0.dev20260112__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (245) hide show
  1. sky/__init__.py +6 -2
  2. sky/adaptors/aws.py +1 -61
  3. sky/adaptors/slurm.py +565 -0
  4. sky/backends/backend_utils.py +95 -12
  5. sky/backends/cloud_vm_ray_backend.py +224 -65
  6. sky/backends/task_codegen.py +380 -4
  7. sky/catalog/__init__.py +0 -3
  8. sky/catalog/data_fetchers/fetch_gcp.py +9 -1
  9. sky/catalog/data_fetchers/fetch_nebius.py +1 -1
  10. sky/catalog/data_fetchers/fetch_vast.py +4 -2
  11. sky/catalog/kubernetes_catalog.py +12 -4
  12. sky/catalog/seeweb_catalog.py +30 -15
  13. sky/catalog/shadeform_catalog.py +5 -2
  14. sky/catalog/slurm_catalog.py +236 -0
  15. sky/catalog/vast_catalog.py +30 -6
  16. sky/check.py +25 -11
  17. sky/client/cli/command.py +391 -32
  18. sky/client/interactive_utils.py +190 -0
  19. sky/client/sdk.py +64 -2
  20. sky/client/sdk_async.py +9 -0
  21. sky/clouds/__init__.py +2 -0
  22. sky/clouds/aws.py +60 -2
  23. sky/clouds/azure.py +2 -0
  24. sky/clouds/cloud.py +7 -0
  25. sky/clouds/kubernetes.py +2 -0
  26. sky/clouds/runpod.py +38 -7
  27. sky/clouds/slurm.py +610 -0
  28. sky/clouds/ssh.py +3 -2
  29. sky/clouds/vast.py +39 -16
  30. sky/core.py +197 -37
  31. sky/dashboard/out/404.html +1 -1
  32. sky/dashboard/out/_next/static/3nu-b8raeKRNABZ2d4GAG/_buildManifest.js +1 -0
  33. sky/dashboard/out/_next/static/chunks/1871-0565f8975a7dcd10.js +6 -0
  34. sky/dashboard/out/_next/static/chunks/2109-55a1546d793574a7.js +11 -0
  35. sky/dashboard/out/_next/static/chunks/2521-099b07cd9e4745bf.js +26 -0
  36. sky/dashboard/out/_next/static/chunks/2755.a636e04a928a700e.js +31 -0
  37. sky/dashboard/out/_next/static/chunks/3495.05eab4862217c1a5.js +6 -0
  38. sky/dashboard/out/_next/static/chunks/3785.cfc5dcc9434fd98c.js +1 -0
  39. sky/dashboard/out/_next/static/chunks/3850-fd5696f3bbbaddae.js +1 -0
  40. sky/dashboard/out/_next/static/chunks/3981.645d01bf9c8cad0c.js +21 -0
  41. sky/dashboard/out/_next/static/chunks/4083-0115d67c1fb57d6c.js +21 -0
  42. sky/dashboard/out/_next/static/chunks/{8640.5b9475a2d18c5416.js → 429.a58e9ba9742309ed.js} +2 -2
  43. sky/dashboard/out/_next/static/chunks/4555.8e221537181b5dc1.js +6 -0
  44. sky/dashboard/out/_next/static/chunks/4725.937865b81fdaaebb.js +6 -0
  45. sky/dashboard/out/_next/static/chunks/6082-edabd8f6092300ce.js +25 -0
  46. sky/dashboard/out/_next/static/chunks/6989-49cb7dca83a7a62d.js +1 -0
  47. sky/dashboard/out/_next/static/chunks/6990-630bd2a2257275f8.js +1 -0
  48. sky/dashboard/out/_next/static/chunks/7248-a99800d4db8edabd.js +1 -0
  49. sky/dashboard/out/_next/static/chunks/754-cfc5d4ad1b843d29.js +18 -0
  50. sky/dashboard/out/_next/static/chunks/8050-dd8aa107b17dce00.js +16 -0
  51. sky/dashboard/out/_next/static/chunks/8056-d4ae1e0cb81e7368.js +1 -0
  52. sky/dashboard/out/_next/static/chunks/8555.011023e296c127b3.js +6 -0
  53. sky/dashboard/out/_next/static/chunks/8821-93c25df904a8362b.js +1 -0
  54. sky/dashboard/out/_next/static/chunks/8969-0662594b69432ade.js +1 -0
  55. sky/dashboard/out/_next/static/chunks/9025.f15c91c97d124a5f.js +6 -0
  56. sky/dashboard/out/_next/static/chunks/9353-7ad6bd01858556f1.js +1 -0
  57. sky/dashboard/out/_next/static/chunks/pages/_app-5a86569acad99764.js +34 -0
  58. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-8297476714acb4ac.js +6 -0
  59. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-337c3ba1085f1210.js +1 -0
  60. sky/dashboard/out/_next/static/chunks/pages/{clusters-ee39056f9851a3ff.js → clusters-57632ff3684a8b5c.js} +1 -1
  61. sky/dashboard/out/_next/static/chunks/pages/{config-dfb9bf07b13045f4.js → config-718cdc365de82689.js} +1 -1
  62. sky/dashboard/out/_next/static/chunks/pages/infra/[context]-5fd3a453c079c2ea.js +1 -0
  63. sky/dashboard/out/_next/static/chunks/pages/infra-9f85c02c9c6cae9e.js +1 -0
  64. sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-90f16972cbecf354.js +1 -0
  65. sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-2dd42fc37aad427a.js +16 -0
  66. sky/dashboard/out/_next/static/chunks/pages/jobs-ed806aeace26b972.js +1 -0
  67. sky/dashboard/out/_next/static/chunks/pages/plugins/[...slug]-449a9f5a3bb20fb3.js +1 -0
  68. sky/dashboard/out/_next/static/chunks/pages/users-bec34706b36f3524.js +1 -0
  69. sky/dashboard/out/_next/static/chunks/pages/{volumes-b84b948ff357c43e.js → volumes-a83ba9b38dff7ea9.js} +1 -1
  70. sky/dashboard/out/_next/static/chunks/pages/workspaces/{[name]-84a40f8c7c627fe4.js → [name]-c781e9c3e52ef9fc.js} +1 -1
  71. sky/dashboard/out/_next/static/chunks/pages/workspaces-91e0942f47310aae.js +1 -0
  72. sky/dashboard/out/_next/static/chunks/webpack-cfe59cf684ee13b9.js +1 -0
  73. sky/dashboard/out/_next/static/css/b0dbca28f027cc19.css +3 -0
  74. sky/dashboard/out/clusters/[cluster]/[job].html +1 -1
  75. sky/dashboard/out/clusters/[cluster].html +1 -1
  76. sky/dashboard/out/clusters.html +1 -1
  77. sky/dashboard/out/config.html +1 -1
  78. sky/dashboard/out/index.html +1 -1
  79. sky/dashboard/out/infra/[context].html +1 -1
  80. sky/dashboard/out/infra.html +1 -1
  81. sky/dashboard/out/jobs/[job].html +1 -1
  82. sky/dashboard/out/jobs/pools/[pool].html +1 -1
  83. sky/dashboard/out/jobs.html +1 -1
  84. sky/dashboard/out/plugins/[...slug].html +1 -0
  85. sky/dashboard/out/users.html +1 -1
  86. sky/dashboard/out/volumes.html +1 -1
  87. sky/dashboard/out/workspace/new.html +1 -1
  88. sky/dashboard/out/workspaces/[name].html +1 -1
  89. sky/dashboard/out/workspaces.html +1 -1
  90. sky/data/data_utils.py +26 -12
  91. sky/data/mounting_utils.py +44 -5
  92. sky/global_user_state.py +111 -19
  93. sky/jobs/client/sdk.py +8 -3
  94. sky/jobs/controller.py +191 -31
  95. sky/jobs/recovery_strategy.py +109 -11
  96. sky/jobs/server/core.py +81 -4
  97. sky/jobs/server/server.py +14 -0
  98. sky/jobs/state.py +417 -19
  99. sky/jobs/utils.py +73 -80
  100. sky/models.py +11 -0
  101. sky/optimizer.py +8 -6
  102. sky/provision/__init__.py +12 -9
  103. sky/provision/common.py +20 -0
  104. sky/provision/docker_utils.py +15 -2
  105. sky/provision/kubernetes/utils.py +163 -20
  106. sky/provision/kubernetes/volume.py +52 -17
  107. sky/provision/provisioner.py +17 -7
  108. sky/provision/runpod/instance.py +3 -1
  109. sky/provision/runpod/utils.py +13 -1
  110. sky/provision/runpod/volume.py +25 -9
  111. sky/provision/slurm/__init__.py +12 -0
  112. sky/provision/slurm/config.py +13 -0
  113. sky/provision/slurm/instance.py +618 -0
  114. sky/provision/slurm/utils.py +689 -0
  115. sky/provision/vast/instance.py +4 -1
  116. sky/provision/vast/utils.py +11 -6
  117. sky/resources.py +135 -13
  118. sky/schemas/api/responses.py +4 -0
  119. sky/schemas/db/global_user_state/010_save_ssh_key.py +1 -1
  120. sky/schemas/db/spot_jobs/008_add_full_resources.py +34 -0
  121. sky/schemas/db/spot_jobs/009_job_events.py +32 -0
  122. sky/schemas/db/spot_jobs/010_job_events_timestamp_with_timezone.py +43 -0
  123. sky/schemas/db/spot_jobs/011_add_links.py +34 -0
  124. sky/schemas/generated/jobsv1_pb2.py +9 -5
  125. sky/schemas/generated/jobsv1_pb2.pyi +12 -0
  126. sky/schemas/generated/jobsv1_pb2_grpc.py +44 -0
  127. sky/schemas/generated/managed_jobsv1_pb2.py +32 -28
  128. sky/schemas/generated/managed_jobsv1_pb2.pyi +11 -2
  129. sky/serve/serve_utils.py +232 -40
  130. sky/serve/server/impl.py +1 -1
  131. sky/server/common.py +17 -0
  132. sky/server/constants.py +1 -1
  133. sky/server/metrics.py +6 -3
  134. sky/server/plugins.py +238 -0
  135. sky/server/requests/executor.py +5 -2
  136. sky/server/requests/payloads.py +30 -1
  137. sky/server/requests/request_names.py +4 -0
  138. sky/server/requests/requests.py +33 -11
  139. sky/server/requests/serializers/encoders.py +22 -0
  140. sky/server/requests/serializers/return_value_serializers.py +70 -0
  141. sky/server/server.py +506 -109
  142. sky/server/server_utils.py +30 -0
  143. sky/server/uvicorn.py +5 -0
  144. sky/setup_files/MANIFEST.in +1 -0
  145. sky/setup_files/dependencies.py +22 -9
  146. sky/sky_logging.py +2 -1
  147. sky/skylet/attempt_skylet.py +13 -3
  148. sky/skylet/constants.py +55 -13
  149. sky/skylet/events.py +10 -4
  150. sky/skylet/executor/__init__.py +1 -0
  151. sky/skylet/executor/slurm.py +187 -0
  152. sky/skylet/job_lib.py +91 -5
  153. sky/skylet/log_lib.py +22 -6
  154. sky/skylet/log_lib.pyi +8 -6
  155. sky/skylet/services.py +18 -3
  156. sky/skylet/skylet.py +5 -1
  157. sky/skylet/subprocess_daemon.py +2 -1
  158. sky/ssh_node_pools/constants.py +12 -0
  159. sky/ssh_node_pools/core.py +40 -3
  160. sky/ssh_node_pools/deploy/__init__.py +4 -0
  161. sky/{utils/kubernetes/deploy_ssh_node_pools.py → ssh_node_pools/deploy/deploy.py} +279 -504
  162. sky/ssh_node_pools/deploy/tunnel/ssh-tunnel.sh +379 -0
  163. sky/ssh_node_pools/deploy/tunnel_utils.py +199 -0
  164. sky/ssh_node_pools/deploy/utils.py +173 -0
  165. sky/ssh_node_pools/server.py +11 -13
  166. sky/{utils/kubernetes/ssh_utils.py → ssh_node_pools/utils.py} +9 -6
  167. sky/templates/kubernetes-ray.yml.j2 +12 -6
  168. sky/templates/slurm-ray.yml.j2 +115 -0
  169. sky/templates/vast-ray.yml.j2 +1 -0
  170. sky/templates/websocket_proxy.py +18 -41
  171. sky/users/model.conf +1 -1
  172. sky/users/permission.py +85 -52
  173. sky/users/rbac.py +31 -3
  174. sky/utils/annotations.py +108 -8
  175. sky/utils/auth_utils.py +42 -0
  176. sky/utils/cli_utils/status_utils.py +19 -5
  177. sky/utils/cluster_utils.py +10 -3
  178. sky/utils/command_runner.py +389 -35
  179. sky/utils/command_runner.pyi +43 -4
  180. sky/utils/common_utils.py +47 -31
  181. sky/utils/context.py +32 -0
  182. sky/utils/db/db_utils.py +36 -6
  183. sky/utils/db/migration_utils.py +41 -21
  184. sky/utils/infra_utils.py +5 -1
  185. sky/utils/instance_links.py +139 -0
  186. sky/utils/interactive_utils.py +49 -0
  187. sky/utils/kubernetes/generate_kubeconfig.sh +42 -33
  188. sky/utils/kubernetes/kubernetes_deploy_utils.py +2 -94
  189. sky/utils/kubernetes/rsync_helper.sh +5 -1
  190. sky/utils/kubernetes/ssh-tunnel.sh +7 -376
  191. sky/utils/plugin_extensions/__init__.py +14 -0
  192. sky/utils/plugin_extensions/external_failure_source.py +176 -0
  193. sky/utils/resources_utils.py +10 -8
  194. sky/utils/rich_utils.py +9 -11
  195. sky/utils/schemas.py +93 -19
  196. sky/utils/status_lib.py +7 -0
  197. sky/utils/subprocess_utils.py +17 -0
  198. sky/volumes/client/sdk.py +6 -3
  199. sky/volumes/server/core.py +65 -27
  200. sky_templates/ray/start_cluster +8 -4
  201. {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/METADATA +67 -59
  202. {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/RECORD +208 -180
  203. sky/dashboard/out/_next/static/96_E2yl3QAiIJGOYCkSpB/_buildManifest.js +0 -1
  204. sky/dashboard/out/_next/static/chunks/1141-e6aa9ab418717c59.js +0 -11
  205. sky/dashboard/out/_next/static/chunks/1871-7e202677c42f43fe.js +0 -6
  206. sky/dashboard/out/_next/static/chunks/2260-7703229c33c5ebd5.js +0 -1
  207. sky/dashboard/out/_next/static/chunks/2350.fab69e61bac57b23.js +0 -1
  208. sky/dashboard/out/_next/static/chunks/2369.fc20f0c2c8ed9fe7.js +0 -15
  209. sky/dashboard/out/_next/static/chunks/2755.edd818326d489a1d.js +0 -26
  210. sky/dashboard/out/_next/static/chunks/3294.20a8540fe697d5ee.js +0 -1
  211. sky/dashboard/out/_next/static/chunks/3785.7e245f318f9d1121.js +0 -1
  212. sky/dashboard/out/_next/static/chunks/3800-7b45f9fbb6308557.js +0 -1
  213. sky/dashboard/out/_next/static/chunks/3850-ff4a9a69d978632b.js +0 -1
  214. sky/dashboard/out/_next/static/chunks/4725.172ede95d1b21022.js +0 -1
  215. sky/dashboard/out/_next/static/chunks/4937.a2baa2df5572a276.js +0 -15
  216. sky/dashboard/out/_next/static/chunks/6212-7bd06f60ba693125.js +0 -13
  217. sky/dashboard/out/_next/static/chunks/6856-8f27d1c10c98def8.js +0 -1
  218. sky/dashboard/out/_next/static/chunks/6989-01359c57e018caa4.js +0 -1
  219. sky/dashboard/out/_next/static/chunks/6990-9146207c4567fdfd.js +0 -1
  220. sky/dashboard/out/_next/static/chunks/7359-c8d04e06886000b3.js +0 -30
  221. sky/dashboard/out/_next/static/chunks/7411-b15471acd2cba716.js +0 -41
  222. sky/dashboard/out/_next/static/chunks/7615-019513abc55b3b47.js +0 -1
  223. sky/dashboard/out/_next/static/chunks/8969-452f9d5cbdd2dc73.js +0 -1
  224. sky/dashboard/out/_next/static/chunks/9025.fa408f3242e9028d.js +0 -6
  225. sky/dashboard/out/_next/static/chunks/9353-cff34f7e773b2e2b.js +0 -1
  226. sky/dashboard/out/_next/static/chunks/9360.a536cf6b1fa42355.js +0 -31
  227. sky/dashboard/out/_next/static/chunks/9847.3aaca6bb33455140.js +0 -30
  228. sky/dashboard/out/_next/static/chunks/pages/_app-bde01e4a2beec258.js +0 -34
  229. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-792db96d918c98c9.js +0 -16
  230. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-abfcac9c137aa543.js +0 -1
  231. sky/dashboard/out/_next/static/chunks/pages/infra/[context]-c0b5935149902e6f.js +0 -1
  232. sky/dashboard/out/_next/static/chunks/pages/infra-aed0ea19df7cf961.js +0 -1
  233. sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-d66997e2bfc837cf.js +0 -16
  234. sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-9faf940b253e3e06.js +0 -21
  235. sky/dashboard/out/_next/static/chunks/pages/jobs-2072b48b617989c9.js +0 -1
  236. sky/dashboard/out/_next/static/chunks/pages/users-f42674164aa73423.js +0 -1
  237. sky/dashboard/out/_next/static/chunks/pages/workspaces-531b2f8c4bf89f82.js +0 -1
  238. sky/dashboard/out/_next/static/chunks/webpack-64e05f17bf2cf8ce.js +0 -1
  239. sky/dashboard/out/_next/static/css/0748ce22df867032.css +0 -3
  240. /sky/dashboard/out/_next/static/{96_E2yl3QAiIJGOYCkSpB → 3nu-b8raeKRNABZ2d4GAG}/_ssgManifest.js +0 -0
  241. /sky/{utils/kubernetes → ssh_node_pools/deploy/tunnel}/cleanup-tunnel.sh +0 -0
  242. {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/WHEEL +0 -0
  243. {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/entry_points.txt +0 -0
  244. {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/licenses/LICENSE +0 -0
  245. {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/top_level.txt +0 -0
sky/utils/schemas.py CHANGED
@@ -208,26 +208,49 @@ def _get_single_resources_schema():
208
208
  },
209
209
  'job_recovery': {
210
210
  # Either a string or a dict.
211
- 'anyOf': [{
212
- 'type': 'string',
213
- }, {
214
- 'type': 'object',
215
- 'required': [],
216
- 'additionalProperties': False,
217
- 'properties': {
218
- 'strategy': {
219
- 'anyOf': [{
220
- 'type': 'string',
221
- }, {
222
- 'type': 'null',
223
- }],
224
- },
225
- 'max_restarts_on_errors': {
226
- 'type': 'integer',
227
- 'minimum': 0,
228
- },
211
+ 'anyOf': [
212
+ {
213
+ 'type': 'string',
214
+ },
215
+ {
216
+ 'type': 'object',
217
+ 'required': [],
218
+ 'additionalProperties': False,
219
+ 'properties': {
220
+ 'strategy': {
221
+ 'anyOf': [{
222
+ 'type': 'string',
223
+ }, {
224
+ 'type': 'null',
225
+ }],
226
+ },
227
+ 'max_restarts_on_errors': {
228
+ 'type': 'integer',
229
+ 'minimum': 0,
230
+ },
231
+ 'recover_on_exit_codes': {
232
+ 'anyOf': [
233
+ {
234
+ # Single exit code
235
+ 'type': 'integer',
236
+ 'minimum': 0,
237
+ 'maximum': 255,
238
+ },
239
+ {
240
+ # List of exit codes
241
+ 'type': 'array',
242
+ 'items': {
243
+ 'type': 'integer',
244
+ 'minimum': 0,
245
+ 'maximum': 255,
246
+ },
247
+ 'uniqueItems': True,
248
+ },
249
+ ],
250
+ },
251
+ }
229
252
  }
230
- }],
253
+ ],
231
254
  },
232
255
  'volumes': {
233
256
  'type': 'array',
@@ -1401,6 +1424,27 @@ def get_config_schema():
1401
1424
  **_CONTEXT_CONFIG_SCHEMA_MINIMAL,
1402
1425
  }
1403
1426
  },
1427
+ 'slurm': {
1428
+ 'type': 'object',
1429
+ 'required': [],
1430
+ 'additionalProperties': False,
1431
+ 'properties': {
1432
+ 'allowed_clusters': {
1433
+ 'oneOf': [{
1434
+ 'type': 'array',
1435
+ 'items': {
1436
+ 'type': 'string',
1437
+ },
1438
+ }, {
1439
+ 'type': 'string',
1440
+ 'pattern': '^all$'
1441
+ }]
1442
+ },
1443
+ 'provision_timeout': {
1444
+ 'type': 'integer',
1445
+ },
1446
+ }
1447
+ },
1404
1448
  'oci': {
1405
1449
  'type': 'object',
1406
1450
  'required': [],
@@ -1435,6 +1479,16 @@ def get_config_schema():
1435
1479
  }
1436
1480
  },
1437
1481
  },
1482
+ 'vast': {
1483
+ 'type': 'object',
1484
+ 'required': [],
1485
+ 'additionalProperties': False,
1486
+ 'properties': {
1487
+ 'datacenter_only': {
1488
+ 'type': 'boolean',
1489
+ },
1490
+ }
1491
+ },
1438
1492
  'nebius': {
1439
1493
  'type': 'object',
1440
1494
  'required': [],
@@ -1814,6 +1868,25 @@ def get_config_schema():
1814
1868
  config['properties'].update(_REMOTE_IDENTITY_SCHEMA_KUBERNETES)
1815
1869
  else:
1816
1870
  config['properties'].update(_REMOTE_IDENTITY_SCHEMA)
1871
+
1872
+ data_schema = {
1873
+ 'type': 'object',
1874
+ 'required': [],
1875
+ 'additionalProperties': False,
1876
+ 'properties': {
1877
+ 'mount_cached': {
1878
+ 'type': 'object',
1879
+ 'required': [],
1880
+ 'additionalProperties': False,
1881
+ 'properties': {
1882
+ 'sequential_upload': {
1883
+ 'type': 'boolean',
1884
+ },
1885
+ },
1886
+ },
1887
+ },
1888
+ }
1889
+
1817
1890
  return {
1818
1891
  '$schema': 'https://json-schema.org/draft/2020-12/schema',
1819
1892
  'type': 'object',
@@ -1840,6 +1913,7 @@ def get_config_schema():
1840
1913
  'rbac': rbac_schema,
1841
1914
  'logs': logs_schema,
1842
1915
  'daemons': daemon_schema,
1916
+ 'data': data_schema,
1843
1917
  **cloud_configs,
1844
1918
  },
1845
1919
  }
sky/utils/status_lib.py CHANGED
@@ -27,6 +27,12 @@ class ClusterStatus(enum.Enum):
27
27
 
28
28
  STOPPED = 'STOPPED'
29
29
  """The cluster is stopped."""
30
+ PENDING = 'PENDING'
31
+ """The cluster is pending scheduling.
32
+
33
+ NOTE: This state is for display only and should not be used in state
34
+ machine logic without necessary considerations.
35
+ """
30
36
 
31
37
  def colored_str(self):
32
38
  color = _STATUS_TO_COLOR[self]
@@ -37,6 +43,7 @@ _STATUS_TO_COLOR = {
37
43
  ClusterStatus.INIT: colorama.Fore.BLUE,
38
44
  ClusterStatus.UP: colorama.Fore.GREEN,
39
45
  ClusterStatus.STOPPED: colorama.Fore.YELLOW,
46
+ ClusterStatus.PENDING: colorama.Fore.CYAN,
40
47
  }
41
48
 
42
49
 
@@ -7,6 +7,7 @@ import resource
7
7
  import shlex
8
8
  import subprocess
9
9
  import sys
10
+ import termios
10
11
  import threading
11
12
  import time
12
13
  import typing
@@ -450,3 +451,19 @@ def slow_start_processes(processes: List[Startable],
450
451
  break
451
452
  batch_size = min(batch_size * 2, max_batch_size)
452
453
  time.sleep(delay)
454
+
455
+
456
+ def is_echo_disabled(fd: int) -> bool:
457
+ """Check if terminal ECHO is disabled on the given fd.
458
+
459
+ When a subprocess wants password/sensitive input, it disables ECHO.
460
+ This is how pexpect's waitnoecho() works. See:
461
+ https://pexpect.readthedocs.io/en/stable/api/pexpect.html#pexpect.spawn.waitnoecho
462
+ """
463
+ assert os.isatty(fd), 'fd is not connected to a terminal'
464
+ try:
465
+ attr = termios.tcgetattr(fd)
466
+ echo_on = bool(attr[3] & termios.ECHO)
467
+ return not echo_on
468
+ except (termios.error, OSError):
469
+ return False
sky/volumes/client/sdk.py CHANGED
@@ -1,4 +1,4 @@
1
- """SDK functions for managed jobs."""
1
+ """SDK functions for volumes."""
2
2
  import json
3
3
  import typing
4
4
  from typing import List
@@ -135,16 +135,19 @@ def ls() -> server_common.RequestId[List[responses.VolumeRecord]]:
135
135
  @usage_lib.entrypoint
136
136
  @server_common.check_server_healthy_or_start
137
137
  @annotations.client_api
138
- def delete(names: List[str]) -> server_common.RequestId[None]:
138
+ def delete(names: List[str],
139
+ purge: bool = False) -> server_common.RequestId[None]:
139
140
  """Deletes volumes.
140
141
 
141
142
  Args:
142
143
  names: List of volume names to delete.
144
+ purge: If True, delete the volume from the database even if the
145
+ deletion API fails.
143
146
 
144
147
  Returns:
145
148
  The request ID of the delete request.
146
149
  """
147
- body = payloads.VolumeDeleteBody(names=names)
150
+ body = payloads.VolumeDeleteBody(names=names, purge=purge)
148
151
  response = server_common.make_authenticated_request(
149
152
  'POST', '/volumes/delete', json=json.loads(body.model_dump_json()))
150
153
  return server_common.get_request_id(response)
@@ -30,6 +30,10 @@ def volume_refresh():
30
30
  volumes = volume_list(is_ephemeral=False)
31
31
  for volume in volumes:
32
32
  volume_name = volume.name
33
+ if volume.usedby_fetch_failed:
34
+ logger.info(f'Skipping status update for volume {volume_name} '
35
+ f'due to failed usedby fetch')
36
+ continue
33
37
  usedby_pods = volume.usedby_pods
34
38
  with _volume_lock(volume_name):
35
39
  latest_volume = global_user_state.get_volume_by_name(volume_name)
@@ -55,6 +59,9 @@ def volume_list(
55
59
  is_ephemeral: Optional[bool] = None) -> List[responses.VolumeRecord]:
56
60
  """Gets the volumes.
57
61
 
62
+ Args:
63
+ is_ephemeral: Whether to include ephemeral volumes.
64
+
58
65
  Returns:
59
66
  [
60
67
  {
@@ -74,6 +81,7 @@ def volume_list(
74
81
  'status': sky.VolumeStatus,
75
82
  'usedby_pods': List[str],
76
83
  'usedby_clusters': List[str],
84
+ 'usedby_fetch_failed': bool,
77
85
  'is_ephemeral': bool,
78
86
  }
79
87
  ]
@@ -93,11 +101,23 @@ def volume_list(
93
101
  cloud_to_configs[cloud].append(config)
94
102
 
95
103
  cloud_to_used_by_pods, cloud_to_used_by_clusters = {}, {}
104
+ cloud_to_failed_volume_names = {}
96
105
  for cloud, configs in cloud_to_configs.items():
97
- used_by_pods, used_by_clusters = provision.get_all_volumes_usedby(
98
- cloud, configs)
99
- cloud_to_used_by_pods[cloud] = used_by_pods
100
- cloud_to_used_by_clusters[cloud] = used_by_clusters
106
+ try:
107
+ used_by_pods, used_by_clusters, failed_volume_names = (
108
+ provision.get_all_volumes_usedby(cloud, configs))
109
+ cloud_to_used_by_pods[cloud] = used_by_pods
110
+ cloud_to_used_by_clusters[cloud] = used_by_clusters
111
+ cloud_to_failed_volume_names[cloud] = failed_volume_names
112
+ except Exception as e: # pylint: disable=broad-except
113
+ logger.warning(
114
+ f'Failed to get usedby info for volumes on {cloud}: {e}')
115
+ cloud_to_used_by_pods[cloud] = {}
116
+ cloud_to_used_by_clusters[cloud] = {}
117
+ cloud_to_failed_volume_names[cloud] = {
118
+ config.name for config in configs
119
+ }
120
+ continue
101
121
 
102
122
  all_users = global_user_state.get_all_users()
103
123
  user_map = {user.id: user.name for user in all_users}
@@ -114,6 +134,7 @@ def volume_list(
114
134
  'last_use': volume.get('last_use'),
115
135
  'usedby_pods': [],
116
136
  'usedby_clusters': [],
137
+ 'usedby_fetch_failed': False,
117
138
  'is_ephemeral': volume.get('is_ephemeral', False),
118
139
  }
119
140
  status = volume.get('status')
@@ -126,12 +147,17 @@ def volume_list(
126
147
  logger.warning(f'Volume {volume_name} has no handle.')
127
148
  continue
128
149
  cloud = config.cloud
129
- usedby_pods, usedby_clusters = provision.map_all_volumes_usedby(
130
- cloud,
131
- cloud_to_used_by_pods[cloud],
132
- cloud_to_used_by_clusters[cloud],
133
- config,
134
- )
150
+ if volume_name in cloud_to_failed_volume_names[cloud]:
151
+ record['usedby_fetch_failed'] = True
152
+ else:
153
+ usedby_pods, usedby_clusters = provision.map_all_volumes_usedby(
154
+ cloud,
155
+ cloud_to_used_by_pods[cloud],
156
+ cloud_to_used_by_clusters[cloud],
157
+ config,
158
+ )
159
+ record['usedby_pods'] = usedby_pods
160
+ record['usedby_clusters'] = usedby_clusters
135
161
  record['type'] = config.type
136
162
  record['cloud'] = config.cloud
137
163
  record['region'] = config.region
@@ -139,18 +165,20 @@ def volume_list(
139
165
  record['size'] = config.size
140
166
  record['config'] = config.config
141
167
  record['name_on_cloud'] = config.name_on_cloud
142
- record['usedby_pods'] = usedby_pods
143
- record['usedby_clusters'] = usedby_clusters
144
168
  records.append(responses.VolumeRecord(**record))
145
169
  return records
146
170
 
147
171
 
148
- def volume_delete(names: List[str], ignore_not_found: bool = False) -> None:
172
+ def volume_delete(names: List[str],
173
+ ignore_not_found: bool = False,
174
+ purge: bool = False) -> None:
149
175
  """Deletes volumes.
150
176
 
151
177
  Args:
152
178
  names: List of volume names to delete.
153
179
  ignore_not_found: If True, ignore volumes that are not found.
180
+ purge: If True, delete the volume from the database even if the
181
+ deletion API fails.
154
182
 
155
183
  Raises:
156
184
  ValueError: If the volume does not exist
@@ -167,22 +195,32 @@ def volume_delete(names: List[str], ignore_not_found: bool = False) -> None:
167
195
  if config is None:
168
196
  raise ValueError(f'Volume {name} has no handle.')
169
197
  cloud = config.cloud
170
- usedby_pods, usedby_clusters = provision.get_volume_usedby(
171
- cloud, config)
172
- if usedby_clusters:
173
- usedby_clusters_str = ', '.join(usedby_clusters)
174
- cluster_str = 'clusters' if len(
175
- usedby_clusters) > 1 else 'cluster'
176
- raise ValueError(f'Volume {name} is used by {cluster_str}'
177
- f' {usedby_clusters_str}.')
178
- if usedby_pods:
179
- usedby_pods_str = ', '.join(usedby_pods)
180
- pod_str = 'pods' if len(usedby_pods) > 1 else 'pod'
181
- raise ValueError(
182
- f'Volume {name} is used by {pod_str} {usedby_pods_str}.')
198
+ if not purge:
199
+ usedby_pods, usedby_clusters = provision.get_volume_usedby(
200
+ cloud, config)
201
+ if usedby_clusters:
202
+ usedby_clusters_str = ', '.join(usedby_clusters)
203
+ cluster_str = 'clusters' if len(
204
+ usedby_clusters) > 1 else 'cluster'
205
+ raise ValueError(f'Volume {name} is used by {cluster_str}'
206
+ f' {usedby_clusters_str}.')
207
+ if usedby_pods:
208
+ usedby_pods_str = ', '.join(usedby_pods)
209
+ pod_str = 'pods' if len(usedby_pods) > 1 else 'pod'
210
+ raise ValueError(
211
+ f'Volume {name} is used by {pod_str} {usedby_pods_str}.'
212
+ )
183
213
  logger.debug(f'Deleting volume {name} with config {config}')
184
214
  with _volume_lock(name):
185
- provision.delete_volume(cloud, config)
215
+ try:
216
+ provision.delete_volume(cloud, config)
217
+ except Exception as e: # pylint: disable=broad-except
218
+ if purge:
219
+ logger.warning(f'Failed to delete volume {name} '
220
+ f'on {cloud}: {e}. Purging from '
221
+ 'database.')
222
+ else:
223
+ raise
186
224
  global_user_state.delete_volume(name)
187
225
  logger.info(f'Deleted volumes: {names}')
188
226
 
@@ -77,14 +77,18 @@ if ! run_ray --version > /dev/null; then
77
77
  fi
78
78
  echo -e "${GREEN}Ray $(run_ray --version | cut -d' ' -f3) is installed.${NC}"
79
79
 
80
- RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
80
+ LOCAL_RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
81
+ RAY_ADDRESS=${LOCAL_RAY_ADDRESS}
81
82
  if [ "${SKYPILOT_NODE_RANK}" -ne 0 ]; then
82
83
  HEAD_IP=$(echo "${SKYPILOT_NODE_IPS}" | head -n1)
83
84
  RAY_ADDRESS="${HEAD_IP}:${RAY_HEAD_PORT}"
84
85
  fi
85
86
 
86
- # Check if user-space Ray is already running
87
- if run_ray status --address="${RAY_ADDRESS}" &> /dev/null; then
87
+ # Check if user-space Ray is already running. Use local address to check, as
88
+ # if we use the head node address, the check will succeed even if the Ray
89
+ # cluster is started on the head node but not started on the current worker
90
+ # node.
91
+ if run_ray status --address="${LOCAL_RAY_ADDRESS}" &> /dev/null; then
88
92
  echo -e "${YELLOW}Ray cluster is already running.${NC}"
89
93
  run_ray status --address="${RAY_ADDRESS}"
90
94
  exit 0
@@ -140,7 +144,7 @@ if [ "${SKYPILOT_NODE_RANK}" -eq 0 ]; then
140
144
  echo -e "${RED}Error: Timeout waiting for nodes.${NC}" >&2
141
145
  exit 1
142
146
  fi
143
- ready_nodes=$(run_ray list nodes --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
147
+ ready_nodes=$(run_ray list nodes --address="${RAY_ADDRESS}" --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
144
148
  if [ "${ready_nodes}" -ge "${SKYPILOT_NUM_NODES}" ]; then
145
149
  break
146
150
  fi
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: skypilot-nightly
3
- Version: 1.0.0.dev20251203
3
+ Version: 1.0.0.dev20260112
4
4
  Summary: SkyPilot: Run AI on Any Infra — Unified, Faster, Cheaper.
5
5
  Author: SkyPilot Team
6
6
  License: Apache 2.0
@@ -64,6 +64,7 @@ Requires-Dist: passlib
64
64
  Requires-Dist: bcrypt==4.0.1
65
65
  Requires-Dist: pyjwt
66
66
  Requires-Dist: gitpython
67
+ Requires-Dist: paramiko
67
68
  Requires-Dist: types-paramiko
68
69
  Requires-Dist: alembic
69
70
  Requires-Dist: aiohttp
@@ -72,7 +73,7 @@ Provides-Extra: aws
72
73
  Requires-Dist: awscli>=1.27.10; extra == "aws"
73
74
  Requires-Dist: botocore>=1.29.10; extra == "aws"
74
75
  Requires-Dist: boto3>=1.26.1; extra == "aws"
75
- Requires-Dist: colorama<0.4.5; extra == "aws"
76
+ Requires-Dist: colorama<0.4.7; extra == "aws"
76
77
  Requires-Dist: casbin; extra == "aws"
77
78
  Requires-Dist: sqlalchemy_adapter; extra == "aws"
78
79
  Requires-Dist: passlib; extra == "aws"
@@ -160,7 +161,7 @@ Provides-Extra: cloudflare
160
161
  Requires-Dist: awscli>=1.27.10; extra == "cloudflare"
161
162
  Requires-Dist: botocore>=1.29.10; extra == "cloudflare"
162
163
  Requires-Dist: boto3>=1.26.1; extra == "cloudflare"
163
- Requires-Dist: colorama<0.4.5; extra == "cloudflare"
164
+ Requires-Dist: colorama<0.4.7; extra == "cloudflare"
164
165
  Requires-Dist: casbin; extra == "cloudflare"
165
166
  Requires-Dist: sqlalchemy_adapter; extra == "cloudflare"
166
167
  Requires-Dist: passlib; extra == "cloudflare"
@@ -175,7 +176,7 @@ Provides-Extra: coreweave
175
176
  Requires-Dist: awscli>=1.27.10; extra == "coreweave"
176
177
  Requires-Dist: botocore>=1.29.10; extra == "coreweave"
177
178
  Requires-Dist: boto3>=1.26.1; extra == "coreweave"
178
- Requires-Dist: colorama<0.4.5; extra == "coreweave"
179
+ Requires-Dist: colorama<0.4.7; extra == "coreweave"
179
180
  Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "coreweave"
180
181
  Requires-Dist: websockets; extra == "coreweave"
181
182
  Requires-Dist: python-dateutil; extra == "coreweave"
@@ -244,6 +245,7 @@ Requires-Dist: greenlet; extra == "ssh"
244
245
  Provides-Extra: runpod
245
246
  Requires-Dist: runpod>=1.6.1; extra == "runpod"
246
247
  Requires-Dist: tomli; extra == "runpod"
248
+ Requires-Dist: pycares<5; extra == "runpod"
247
249
  Requires-Dist: casbin; extra == "runpod"
248
250
  Requires-Dist: sqlalchemy_adapter; extra == "runpod"
249
251
  Requires-Dist: passlib; extra == "runpod"
@@ -344,7 +346,7 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "nebius"
344
346
  Requires-Dist: awscli>=1.27.10; extra == "nebius"
345
347
  Requires-Dist: botocore>=1.29.10; extra == "nebius"
346
348
  Requires-Dist: boto3>=1.26.1; extra == "nebius"
347
- Requires-Dist: colorama<0.4.5; extra == "nebius"
349
+ Requires-Dist: colorama<0.4.7; extra == "nebius"
348
350
  Requires-Dist: casbin; extra == "nebius"
349
351
  Requires-Dist: sqlalchemy_adapter; extra == "nebius"
350
352
  Requires-Dist: passlib; extra == "nebius"
@@ -389,52 +391,66 @@ Requires-Dist: grpcio>=1.63.0; extra == "shadeform"
389
391
  Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "shadeform"
390
392
  Requires-Dist: aiosqlite; extra == "shadeform"
391
393
  Requires-Dist: greenlet; extra == "shadeform"
394
+ Provides-Extra: slurm
395
+ Requires-Dist: python-hostlist; extra == "slurm"
396
+ Requires-Dist: casbin; extra == "slurm"
397
+ Requires-Dist: sqlalchemy_adapter; extra == "slurm"
398
+ Requires-Dist: passlib; extra == "slurm"
399
+ Requires-Dist: pyjwt; extra == "slurm"
400
+ Requires-Dist: aiohttp; extra == "slurm"
401
+ Requires-Dist: anyio; extra == "slurm"
402
+ Requires-Dist: grpcio>=1.63.0; extra == "slurm"
403
+ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "slurm"
404
+ Requires-Dist: aiosqlite; extra == "slurm"
405
+ Requires-Dist: greenlet; extra == "slurm"
392
406
  Provides-Extra: all
393
- Requires-Dist: greenlet; extra == "all"
394
- Requires-Dist: azure-identity>=1.19.0; extra == "all"
395
- Requires-Dist: msrestazure; extra == "all"
396
- Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
397
- Requires-Dist: aiosqlite; extra == "all"
398
- Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
399
- Requires-Dist: anyio; extra == "all"
400
- Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
401
- Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
402
- Requires-Dist: ibm-cloud-sdk-core; extra == "all"
403
- Requires-Dist: sqlalchemy_adapter; extra == "all"
404
- Requires-Dist: botocore>=1.29.10; extra == "all"
405
- Requires-Dist: msgraph-sdk; extra == "all"
406
407
  Requires-Dist: aiohttp; extra == "all"
407
- Requires-Dist: nebius>=0.3.12; extra == "all"
408
- Requires-Dist: passlib; extra == "all"
408
+ Requires-Dist: tomli; extra == "all"
409
+ Requires-Dist: ecsapi==0.4.0; extra == "all"
410
+ Requires-Dist: msgraph-sdk; extra == "all"
411
+ Requires-Dist: azure-cli>=2.65.0; extra == "all"
412
+ Requires-Dist: python-dateutil; extra == "all"
413
+ Requires-Dist: ray[default]>=2.6.1; extra == "all"
414
+ Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
415
+ Requires-Dist: pydo>=0.3.0; extra == "all"
416
+ Requires-Dist: google-cloud-storage; extra == "all"
417
+ Requires-Dist: azure-identity>=1.19.0; extra == "all"
409
418
  Requires-Dist: grpcio>=1.63.0; extra == "all"
410
- Requires-Dist: websockets; extra == "all"
419
+ Requires-Dist: colorama<0.4.7; extra == "all"
420
+ Requires-Dist: boto3>=1.26.1; extra == "all"
421
+ Requires-Dist: docker; extra == "all"
422
+ Requires-Dist: sqlalchemy_adapter; extra == "all"
423
+ Requires-Dist: anyio; extra == "all"
424
+ Requires-Dist: pyjwt; extra == "all"
411
425
  Requires-Dist: google-api-python-client>=2.69.0; extra == "all"
412
- Requires-Dist: google-cloud-storage; extra == "all"
413
- Requires-Dist: azure-cli>=2.65.0; extra == "all"
414
426
  Requires-Dist: oci; extra == "all"
415
- Requires-Dist: ecsapi==0.4.0; extra == "all"
427
+ Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
428
+ Requires-Dist: websockets; extra == "all"
429
+ Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "all"
430
+ Requires-Dist: ibm-cloud-sdk-core; extra == "all"
431
+ Requires-Dist: runpod>=1.6.1; extra == "all"
432
+ Requires-Dist: azure-core>=1.24.0; extra == "all"
433
+ Requires-Dist: passlib; extra == "all"
434
+ Requires-Dist: ibm-vpc; extra == "all"
435
+ Requires-Dist: nebius>=0.3.12; extra == "all"
416
436
  Requires-Dist: cudo-compute>=0.1.10; extra == "all"
437
+ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "all"
438
+ Requires-Dist: awscli>=1.27.10; extra == "all"
439
+ Requires-Dist: pycares<5; extra == "all"
440
+ Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
441
+ Requires-Dist: greenlet; extra == "all"
417
442
  Requires-Dist: azure-core>=1.31.0; extra == "all"
418
- Requires-Dist: colorama<0.4.5; extra == "all"
443
+ Requires-Dist: msrestazure; extra == "all"
444
+ Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
445
+ Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
419
446
  Requires-Dist: ibm-cos-sdk; extra == "all"
420
- Requires-Dist: python-dateutil; extra == "all"
421
- Requires-Dist: docker; extra == "all"
422
- Requires-Dist: awscli>=1.27.10; extra == "all"
423
- Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
424
- Requires-Dist: tomli; extra == "all"
425
- Requires-Dist: azure-core>=1.24.0; extra == "all"
447
+ Requires-Dist: python-hostlist; extra == "all"
448
+ Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
449
+ Requires-Dist: botocore>=1.29.10; extra == "all"
450
+ Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
426
451
  Requires-Dist: casbin; extra == "all"
427
- Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "all"
428
- Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
429
- Requires-Dist: pyjwt; extra == "all"
430
- Requires-Dist: runpod>=1.6.1; extra == "all"
431
- Requires-Dist: boto3>=1.26.1; extra == "all"
432
- Requires-Dist: ray[default]>=2.6.1; extra == "all"
433
- Requires-Dist: pydo>=0.3.0; extra == "all"
452
+ Requires-Dist: aiosqlite; extra == "all"
434
453
  Requires-Dist: azure-common; extra == "all"
435
- Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "all"
436
- Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
437
- Requires-Dist: ibm-vpc; extra == "all"
438
454
  Provides-Extra: remote
439
455
  Requires-Dist: grpcio>=1.63.0; extra == "remote"
440
456
  Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "remote"
@@ -484,7 +500,7 @@ Dynamic: summary
484
500
  </p>
485
501
 
486
502
  <h3 align="center">
487
- Simplify & scale any AI infrastructure
503
+ Run AI on Any Infrastructure
488
504
  </h3>
489
505
 
490
506
  <div align="center">
@@ -494,10 +510,18 @@ Dynamic: summary
494
510
  </div>
495
511
 
496
512
 
513
+ SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
497
514
 
498
- ----
515
+ SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
516
+ **Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
517
+
518
+ <img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
519
+
520
+ -----
499
521
 
500
522
  :fire: *News* :fire:
523
+ - [Dec 2025] **SkyPilot v0.11** released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. [**Release notes**](https://github.com/skypilot-org/skypilot/releases/tag/v0.11.0)
524
+ - [Dec 2025] **SkyPilot Pools** released: Run batch inference and other jobs on a managed pool of warm workers (across clouds or clusters). [**blog**](https://blog.skypilot.co/skypilot-pools-deepseek-ocr/), [**docs**](https://docs.skypilot.co/en/latest/examples/pools.html)
501
525
  - [Nov 2025] Serve **Kimi K2 Thinking** with reasoning capabilities on your Kubernetes or clouds: [**example**](./llm/kimi-k2-thinking/)
502
526
  - [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
503
527
  - [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
@@ -506,22 +530,6 @@ Dynamic: summary
506
530
  - [Sep 2025] Network and Storage Benchmarks for LLM training on the cloud: [**blog**](https://maknee.github.io/blog/2025/Network-And-Storage-Training-Skypilot/)
507
531
  - [Aug 2025] Serve and finetune **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**serve**](./llm/gpt-oss/) + [**LoRA and full finetuning**](./llm/gpt-oss-finetuning/)
508
532
  - [Jul 2025] Run distributed **RL training for LLMs** with Verl (PPO, GRPO) on any cloud: [**example**](./llm/verl/)
509
- - [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
510
- - [Jul 2025] Two-part blog series, `The Evolution of AI Job Orchestration`: (1) [Running AI jobs on GPU Neoclouds](https://blog.skypilot.co/ai-job-orchestration-pt1-gpu-neoclouds/), (2) [The AI-Native Control Plane & Orchestration that Finally Works for ML](https://blog.skypilot.co/ai-job-orchestration-pt2-ai-control-plane/)
511
- - [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
512
-
513
-
514
-
515
- **LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
516
-
517
- ----
518
-
519
- SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
520
-
521
- SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
522
- **Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
523
-
524
- <img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
525
533
 
526
534
  ## Overview
527
535