skypilot-nightly 1.0.0.dev20251210__py3-none-any.whl → 1.0.0.dev20260112__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (207) hide show
  1. sky/__init__.py +4 -2
  2. sky/adaptors/slurm.py +159 -72
  3. sky/backends/backend_utils.py +52 -10
  4. sky/backends/cloud_vm_ray_backend.py +192 -32
  5. sky/backends/task_codegen.py +40 -2
  6. sky/catalog/data_fetchers/fetch_gcp.py +9 -1
  7. sky/catalog/data_fetchers/fetch_nebius.py +1 -1
  8. sky/catalog/data_fetchers/fetch_vast.py +4 -2
  9. sky/catalog/seeweb_catalog.py +30 -15
  10. sky/catalog/shadeform_catalog.py +5 -2
  11. sky/catalog/slurm_catalog.py +0 -7
  12. sky/catalog/vast_catalog.py +30 -6
  13. sky/check.py +11 -8
  14. sky/client/cli/command.py +106 -54
  15. sky/client/interactive_utils.py +190 -0
  16. sky/client/sdk.py +8 -0
  17. sky/client/sdk_async.py +9 -0
  18. sky/clouds/aws.py +60 -2
  19. sky/clouds/azure.py +2 -0
  20. sky/clouds/kubernetes.py +2 -0
  21. sky/clouds/runpod.py +38 -7
  22. sky/clouds/slurm.py +44 -12
  23. sky/clouds/ssh.py +1 -1
  24. sky/clouds/vast.py +30 -17
  25. sky/core.py +69 -1
  26. sky/dashboard/out/404.html +1 -1
  27. sky/dashboard/out/_next/static/3nu-b8raeKRNABZ2d4GAG/_buildManifest.js +1 -0
  28. sky/dashboard/out/_next/static/chunks/1871-0565f8975a7dcd10.js +6 -0
  29. sky/dashboard/out/_next/static/chunks/2109-55a1546d793574a7.js +11 -0
  30. sky/dashboard/out/_next/static/chunks/2521-099b07cd9e4745bf.js +26 -0
  31. sky/dashboard/out/_next/static/chunks/2755.a636e04a928a700e.js +31 -0
  32. sky/dashboard/out/_next/static/chunks/3495.05eab4862217c1a5.js +6 -0
  33. sky/dashboard/out/_next/static/chunks/3785.cfc5dcc9434fd98c.js +1 -0
  34. sky/dashboard/out/_next/static/chunks/3981.645d01bf9c8cad0c.js +21 -0
  35. sky/dashboard/out/_next/static/chunks/4083-0115d67c1fb57d6c.js +21 -0
  36. sky/dashboard/out/_next/static/chunks/{8640.5b9475a2d18c5416.js → 429.a58e9ba9742309ed.js} +2 -2
  37. sky/dashboard/out/_next/static/chunks/4555.8e221537181b5dc1.js +6 -0
  38. sky/dashboard/out/_next/static/chunks/4725.937865b81fdaaebb.js +6 -0
  39. sky/dashboard/out/_next/static/chunks/6082-edabd8f6092300ce.js +25 -0
  40. sky/dashboard/out/_next/static/chunks/6989-49cb7dca83a7a62d.js +1 -0
  41. sky/dashboard/out/_next/static/chunks/6990-630bd2a2257275f8.js +1 -0
  42. sky/dashboard/out/_next/static/chunks/7248-a99800d4db8edabd.js +1 -0
  43. sky/dashboard/out/_next/static/chunks/754-cfc5d4ad1b843d29.js +18 -0
  44. sky/dashboard/out/_next/static/chunks/8050-dd8aa107b17dce00.js +16 -0
  45. sky/dashboard/out/_next/static/chunks/8056-d4ae1e0cb81e7368.js +1 -0
  46. sky/dashboard/out/_next/static/chunks/8555.011023e296c127b3.js +6 -0
  47. sky/dashboard/out/_next/static/chunks/8821-93c25df904a8362b.js +1 -0
  48. sky/dashboard/out/_next/static/chunks/8969-0662594b69432ade.js +1 -0
  49. sky/dashboard/out/_next/static/chunks/9025.f15c91c97d124a5f.js +6 -0
  50. sky/dashboard/out/_next/static/chunks/{9353-8369df1cf105221c.js → 9353-7ad6bd01858556f1.js} +1 -1
  51. sky/dashboard/out/_next/static/chunks/pages/_app-5a86569acad99764.js +34 -0
  52. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-8297476714acb4ac.js +6 -0
  53. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-337c3ba1085f1210.js +1 -0
  54. sky/dashboard/out/_next/static/chunks/pages/{clusters-9e5d47818b9bdadd.js → clusters-57632ff3684a8b5c.js} +1 -1
  55. sky/dashboard/out/_next/static/chunks/pages/infra/[context]-5fd3a453c079c2ea.js +1 -0
  56. sky/dashboard/out/_next/static/chunks/pages/infra-9f85c02c9c6cae9e.js +1 -0
  57. sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-90f16972cbecf354.js +1 -0
  58. sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-2dd42fc37aad427a.js +16 -0
  59. sky/dashboard/out/_next/static/chunks/pages/jobs-ed806aeace26b972.js +1 -0
  60. sky/dashboard/out/_next/static/chunks/pages/users-bec34706b36f3524.js +1 -0
  61. sky/dashboard/out/_next/static/chunks/pages/{volumes-ef19d49c6d0e8500.js → volumes-a83ba9b38dff7ea9.js} +1 -1
  62. sky/dashboard/out/_next/static/chunks/pages/workspaces/{[name]-96e0f298308da7e2.js → [name]-c781e9c3e52ef9fc.js} +1 -1
  63. sky/dashboard/out/_next/static/chunks/pages/workspaces-91e0942f47310aae.js +1 -0
  64. sky/dashboard/out/_next/static/chunks/webpack-cfe59cf684ee13b9.js +1 -0
  65. sky/dashboard/out/_next/static/css/b0dbca28f027cc19.css +3 -0
  66. sky/dashboard/out/clusters/[cluster]/[job].html +1 -1
  67. sky/dashboard/out/clusters/[cluster].html +1 -1
  68. sky/dashboard/out/clusters.html +1 -1
  69. sky/dashboard/out/config.html +1 -1
  70. sky/dashboard/out/index.html +1 -1
  71. sky/dashboard/out/infra/[context].html +1 -1
  72. sky/dashboard/out/infra.html +1 -1
  73. sky/dashboard/out/jobs/[job].html +1 -1
  74. sky/dashboard/out/jobs/pools/[pool].html +1 -1
  75. sky/dashboard/out/jobs.html +1 -1
  76. sky/dashboard/out/plugins/[...slug].html +1 -1
  77. sky/dashboard/out/users.html +1 -1
  78. sky/dashboard/out/volumes.html +1 -1
  79. sky/dashboard/out/workspace/new.html +1 -1
  80. sky/dashboard/out/workspaces/[name].html +1 -1
  81. sky/dashboard/out/workspaces.html +1 -1
  82. sky/data/data_utils.py +26 -12
  83. sky/data/mounting_utils.py +29 -4
  84. sky/global_user_state.py +108 -16
  85. sky/jobs/client/sdk.py +8 -3
  86. sky/jobs/controller.py +191 -31
  87. sky/jobs/recovery_strategy.py +109 -11
  88. sky/jobs/server/core.py +81 -4
  89. sky/jobs/server/server.py +14 -0
  90. sky/jobs/state.py +417 -19
  91. sky/jobs/utils.py +73 -80
  92. sky/models.py +9 -0
  93. sky/optimizer.py +2 -1
  94. sky/provision/__init__.py +11 -9
  95. sky/provision/kubernetes/utils.py +122 -15
  96. sky/provision/kubernetes/volume.py +52 -17
  97. sky/provision/provisioner.py +2 -1
  98. sky/provision/runpod/instance.py +3 -1
  99. sky/provision/runpod/utils.py +13 -1
  100. sky/provision/runpod/volume.py +25 -9
  101. sky/provision/slurm/instance.py +75 -29
  102. sky/provision/slurm/utils.py +213 -107
  103. sky/provision/vast/utils.py +1 -0
  104. sky/resources.py +135 -13
  105. sky/schemas/api/responses.py +4 -0
  106. sky/schemas/db/global_user_state/010_save_ssh_key.py +1 -1
  107. sky/schemas/db/spot_jobs/008_add_full_resources.py +34 -0
  108. sky/schemas/db/spot_jobs/009_job_events.py +32 -0
  109. sky/schemas/db/spot_jobs/010_job_events_timestamp_with_timezone.py +43 -0
  110. sky/schemas/db/spot_jobs/011_add_links.py +34 -0
  111. sky/schemas/generated/jobsv1_pb2.py +9 -5
  112. sky/schemas/generated/jobsv1_pb2.pyi +12 -0
  113. sky/schemas/generated/jobsv1_pb2_grpc.py +44 -0
  114. sky/schemas/generated/managed_jobsv1_pb2.py +32 -28
  115. sky/schemas/generated/managed_jobsv1_pb2.pyi +11 -2
  116. sky/serve/serve_utils.py +232 -40
  117. sky/server/common.py +17 -0
  118. sky/server/constants.py +1 -1
  119. sky/server/metrics.py +6 -3
  120. sky/server/plugins.py +16 -0
  121. sky/server/requests/payloads.py +18 -0
  122. sky/server/requests/request_names.py +2 -0
  123. sky/server/requests/requests.py +28 -10
  124. sky/server/requests/serializers/encoders.py +5 -0
  125. sky/server/requests/serializers/return_value_serializers.py +14 -4
  126. sky/server/server.py +434 -107
  127. sky/server/uvicorn.py +5 -0
  128. sky/setup_files/MANIFEST.in +1 -0
  129. sky/setup_files/dependencies.py +21 -10
  130. sky/sky_logging.py +2 -1
  131. sky/skylet/constants.py +22 -5
  132. sky/skylet/executor/slurm.py +4 -6
  133. sky/skylet/job_lib.py +89 -4
  134. sky/skylet/services.py +18 -3
  135. sky/ssh_node_pools/deploy/tunnel/cleanup-tunnel.sh +62 -0
  136. sky/ssh_node_pools/deploy/tunnel/ssh-tunnel.sh +379 -0
  137. sky/templates/kubernetes-ray.yml.j2 +4 -6
  138. sky/templates/slurm-ray.yml.j2 +32 -2
  139. sky/templates/websocket_proxy.py +18 -41
  140. sky/users/permission.py +61 -51
  141. sky/utils/auth_utils.py +42 -0
  142. sky/utils/cli_utils/status_utils.py +19 -5
  143. sky/utils/cluster_utils.py +10 -3
  144. sky/utils/command_runner.py +256 -94
  145. sky/utils/command_runner.pyi +16 -0
  146. sky/utils/common_utils.py +30 -29
  147. sky/utils/context.py +32 -0
  148. sky/utils/db/db_utils.py +36 -6
  149. sky/utils/db/migration_utils.py +41 -21
  150. sky/utils/infra_utils.py +5 -1
  151. sky/utils/instance_links.py +139 -0
  152. sky/utils/interactive_utils.py +49 -0
  153. sky/utils/kubernetes/generate_kubeconfig.sh +42 -33
  154. sky/utils/kubernetes/rsync_helper.sh +5 -1
  155. sky/utils/plugin_extensions/__init__.py +14 -0
  156. sky/utils/plugin_extensions/external_failure_source.py +176 -0
  157. sky/utils/resources_utils.py +10 -8
  158. sky/utils/rich_utils.py +9 -11
  159. sky/utils/schemas.py +63 -20
  160. sky/utils/status_lib.py +7 -0
  161. sky/utils/subprocess_utils.py +17 -0
  162. sky/volumes/client/sdk.py +6 -3
  163. sky/volumes/server/core.py +65 -27
  164. sky_templates/ray/start_cluster +8 -4
  165. {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/METADATA +53 -57
  166. {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/RECORD +172 -162
  167. sky/dashboard/out/_next/static/KYAhEFa3FTfq4JyKVgo-s/_buildManifest.js +0 -1
  168. sky/dashboard/out/_next/static/chunks/1141-9c810f01ff4f398a.js +0 -11
  169. sky/dashboard/out/_next/static/chunks/1871-7e202677c42f43fe.js +0 -6
  170. sky/dashboard/out/_next/static/chunks/2260-7703229c33c5ebd5.js +0 -1
  171. sky/dashboard/out/_next/static/chunks/2350.fab69e61bac57b23.js +0 -1
  172. sky/dashboard/out/_next/static/chunks/2369.fc20f0c2c8ed9fe7.js +0 -15
  173. sky/dashboard/out/_next/static/chunks/2755.edd818326d489a1d.js +0 -26
  174. sky/dashboard/out/_next/static/chunks/3294.ddda8c6c6f9f24dc.js +0 -1
  175. sky/dashboard/out/_next/static/chunks/3785.7e245f318f9d1121.js +0 -1
  176. sky/dashboard/out/_next/static/chunks/3800-b589397dc09c5b4e.js +0 -1
  177. sky/dashboard/out/_next/static/chunks/4725.172ede95d1b21022.js +0 -1
  178. sky/dashboard/out/_next/static/chunks/4937.a2baa2df5572a276.js +0 -15
  179. sky/dashboard/out/_next/static/chunks/6212-7bd06f60ba693125.js +0 -13
  180. sky/dashboard/out/_next/static/chunks/6856-da20c5fd999f319c.js +0 -1
  181. sky/dashboard/out/_next/static/chunks/6989-01359c57e018caa4.js +0 -1
  182. sky/dashboard/out/_next/static/chunks/6990-09cbf02d3cd518c3.js +0 -1
  183. sky/dashboard/out/_next/static/chunks/7359-c8d04e06886000b3.js +0 -30
  184. sky/dashboard/out/_next/static/chunks/7411-b15471acd2cba716.js +0 -41
  185. sky/dashboard/out/_next/static/chunks/7615-019513abc55b3b47.js +0 -1
  186. sky/dashboard/out/_next/static/chunks/8969-452f9d5cbdd2dc73.js +0 -1
  187. sky/dashboard/out/_next/static/chunks/9025.fa408f3242e9028d.js +0 -6
  188. sky/dashboard/out/_next/static/chunks/9360.a536cf6b1fa42355.js +0 -31
  189. sky/dashboard/out/_next/static/chunks/9847.3aaca6bb33455140.js +0 -30
  190. sky/dashboard/out/_next/static/chunks/pages/_app-68b647e26f9d2793.js +0 -34
  191. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-33f525539665fdfd.js +0 -16
  192. sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-a7565f586ef86467.js +0 -1
  193. sky/dashboard/out/_next/static/chunks/pages/infra/[context]-12c559ec4d81fdbd.js +0 -1
  194. sky/dashboard/out/_next/static/chunks/pages/infra-d187cd0413d72475.js +0 -1
  195. sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-895847b6cf200b04.js +0 -16
  196. sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-8d0f4655400b4eb9.js +0 -21
  197. sky/dashboard/out/_next/static/chunks/pages/jobs-e5a98f17f8513a96.js +0 -1
  198. sky/dashboard/out/_next/static/chunks/pages/users-2f7646eb77785a2c.js +0 -1
  199. sky/dashboard/out/_next/static/chunks/pages/workspaces-cb4da3abe08ebf19.js +0 -1
  200. sky/dashboard/out/_next/static/chunks/webpack-fba3de387ff6bb08.js +0 -1
  201. sky/dashboard/out/_next/static/css/c5a4cfd2600fc715.css +0 -3
  202. /sky/dashboard/out/_next/static/{KYAhEFa3FTfq4JyKVgo-s → 3nu-b8raeKRNABZ2d4GAG}/_ssgManifest.js +0 -0
  203. /sky/dashboard/out/_next/static/chunks/pages/plugins/{[...slug]-4f46050ca065d8f8.js → [...slug]-449a9f5a3bb20fb3.js} +0 -0
  204. {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/WHEEL +0 -0
  205. {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/entry_points.txt +0 -0
  206. {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/licenses/LICENSE +0 -0
  207. {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/top_level.txt +0 -0
@@ -30,6 +30,10 @@ def volume_refresh():
30
30
  volumes = volume_list(is_ephemeral=False)
31
31
  for volume in volumes:
32
32
  volume_name = volume.name
33
+ if volume.usedby_fetch_failed:
34
+ logger.info(f'Skipping status update for volume {volume_name} '
35
+ f'due to failed usedby fetch')
36
+ continue
33
37
  usedby_pods = volume.usedby_pods
34
38
  with _volume_lock(volume_name):
35
39
  latest_volume = global_user_state.get_volume_by_name(volume_name)
@@ -55,6 +59,9 @@ def volume_list(
55
59
  is_ephemeral: Optional[bool] = None) -> List[responses.VolumeRecord]:
56
60
  """Gets the volumes.
57
61
 
62
+ Args:
63
+ is_ephemeral: Whether to include ephemeral volumes.
64
+
58
65
  Returns:
59
66
  [
60
67
  {
@@ -74,6 +81,7 @@ def volume_list(
74
81
  'status': sky.VolumeStatus,
75
82
  'usedby_pods': List[str],
76
83
  'usedby_clusters': List[str],
84
+ 'usedby_fetch_failed': bool,
77
85
  'is_ephemeral': bool,
78
86
  }
79
87
  ]
@@ -93,11 +101,23 @@ def volume_list(
93
101
  cloud_to_configs[cloud].append(config)
94
102
 
95
103
  cloud_to_used_by_pods, cloud_to_used_by_clusters = {}, {}
104
+ cloud_to_failed_volume_names = {}
96
105
  for cloud, configs in cloud_to_configs.items():
97
- used_by_pods, used_by_clusters = provision.get_all_volumes_usedby(
98
- cloud, configs)
99
- cloud_to_used_by_pods[cloud] = used_by_pods
100
- cloud_to_used_by_clusters[cloud] = used_by_clusters
106
+ try:
107
+ used_by_pods, used_by_clusters, failed_volume_names = (
108
+ provision.get_all_volumes_usedby(cloud, configs))
109
+ cloud_to_used_by_pods[cloud] = used_by_pods
110
+ cloud_to_used_by_clusters[cloud] = used_by_clusters
111
+ cloud_to_failed_volume_names[cloud] = failed_volume_names
112
+ except Exception as e: # pylint: disable=broad-except
113
+ logger.warning(
114
+ f'Failed to get usedby info for volumes on {cloud}: {e}')
115
+ cloud_to_used_by_pods[cloud] = {}
116
+ cloud_to_used_by_clusters[cloud] = {}
117
+ cloud_to_failed_volume_names[cloud] = {
118
+ config.name for config in configs
119
+ }
120
+ continue
101
121
 
102
122
  all_users = global_user_state.get_all_users()
103
123
  user_map = {user.id: user.name for user in all_users}
@@ -114,6 +134,7 @@ def volume_list(
114
134
  'last_use': volume.get('last_use'),
115
135
  'usedby_pods': [],
116
136
  'usedby_clusters': [],
137
+ 'usedby_fetch_failed': False,
117
138
  'is_ephemeral': volume.get('is_ephemeral', False),
118
139
  }
119
140
  status = volume.get('status')
@@ -126,12 +147,17 @@ def volume_list(
126
147
  logger.warning(f'Volume {volume_name} has no handle.')
127
148
  continue
128
149
  cloud = config.cloud
129
- usedby_pods, usedby_clusters = provision.map_all_volumes_usedby(
130
- cloud,
131
- cloud_to_used_by_pods[cloud],
132
- cloud_to_used_by_clusters[cloud],
133
- config,
134
- )
150
+ if volume_name in cloud_to_failed_volume_names[cloud]:
151
+ record['usedby_fetch_failed'] = True
152
+ else:
153
+ usedby_pods, usedby_clusters = provision.map_all_volumes_usedby(
154
+ cloud,
155
+ cloud_to_used_by_pods[cloud],
156
+ cloud_to_used_by_clusters[cloud],
157
+ config,
158
+ )
159
+ record['usedby_pods'] = usedby_pods
160
+ record['usedby_clusters'] = usedby_clusters
135
161
  record['type'] = config.type
136
162
  record['cloud'] = config.cloud
137
163
  record['region'] = config.region
@@ -139,18 +165,20 @@ def volume_list(
139
165
  record['size'] = config.size
140
166
  record['config'] = config.config
141
167
  record['name_on_cloud'] = config.name_on_cloud
142
- record['usedby_pods'] = usedby_pods
143
- record['usedby_clusters'] = usedby_clusters
144
168
  records.append(responses.VolumeRecord(**record))
145
169
  return records
146
170
 
147
171
 
148
- def volume_delete(names: List[str], ignore_not_found: bool = False) -> None:
172
+ def volume_delete(names: List[str],
173
+ ignore_not_found: bool = False,
174
+ purge: bool = False) -> None:
149
175
  """Deletes volumes.
150
176
 
151
177
  Args:
152
178
  names: List of volume names to delete.
153
179
  ignore_not_found: If True, ignore volumes that are not found.
180
+ purge: If True, delete the volume from the database even if the
181
+ deletion API fails.
154
182
 
155
183
  Raises:
156
184
  ValueError: If the volume does not exist
@@ -167,22 +195,32 @@ def volume_delete(names: List[str], ignore_not_found: bool = False) -> None:
167
195
  if config is None:
168
196
  raise ValueError(f'Volume {name} has no handle.')
169
197
  cloud = config.cloud
170
- usedby_pods, usedby_clusters = provision.get_volume_usedby(
171
- cloud, config)
172
- if usedby_clusters:
173
- usedby_clusters_str = ', '.join(usedby_clusters)
174
- cluster_str = 'clusters' if len(
175
- usedby_clusters) > 1 else 'cluster'
176
- raise ValueError(f'Volume {name} is used by {cluster_str}'
177
- f' {usedby_clusters_str}.')
178
- if usedby_pods:
179
- usedby_pods_str = ', '.join(usedby_pods)
180
- pod_str = 'pods' if len(usedby_pods) > 1 else 'pod'
181
- raise ValueError(
182
- f'Volume {name} is used by {pod_str} {usedby_pods_str}.')
198
+ if not purge:
199
+ usedby_pods, usedby_clusters = provision.get_volume_usedby(
200
+ cloud, config)
201
+ if usedby_clusters:
202
+ usedby_clusters_str = ', '.join(usedby_clusters)
203
+ cluster_str = 'clusters' if len(
204
+ usedby_clusters) > 1 else 'cluster'
205
+ raise ValueError(f'Volume {name} is used by {cluster_str}'
206
+ f' {usedby_clusters_str}.')
207
+ if usedby_pods:
208
+ usedby_pods_str = ', '.join(usedby_pods)
209
+ pod_str = 'pods' if len(usedby_pods) > 1 else 'pod'
210
+ raise ValueError(
211
+ f'Volume {name} is used by {pod_str} {usedby_pods_str}.'
212
+ )
183
213
  logger.debug(f'Deleting volume {name} with config {config}')
184
214
  with _volume_lock(name):
185
- provision.delete_volume(cloud, config)
215
+ try:
216
+ provision.delete_volume(cloud, config)
217
+ except Exception as e: # pylint: disable=broad-except
218
+ if purge:
219
+ logger.warning(f'Failed to delete volume {name} '
220
+ f'on {cloud}: {e}. Purging from '
221
+ 'database.')
222
+ else:
223
+ raise
186
224
  global_user_state.delete_volume(name)
187
225
  logger.info(f'Deleted volumes: {names}')
188
226
 
@@ -77,14 +77,18 @@ if ! run_ray --version > /dev/null; then
77
77
  fi
78
78
  echo -e "${GREEN}Ray $(run_ray --version | cut -d' ' -f3) is installed.${NC}"
79
79
 
80
- RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
80
+ LOCAL_RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
81
+ RAY_ADDRESS=${LOCAL_RAY_ADDRESS}
81
82
  if [ "${SKYPILOT_NODE_RANK}" -ne 0 ]; then
82
83
  HEAD_IP=$(echo "${SKYPILOT_NODE_IPS}" | head -n1)
83
84
  RAY_ADDRESS="${HEAD_IP}:${RAY_HEAD_PORT}"
84
85
  fi
85
86
 
86
- # Check if user-space Ray is already running
87
- if run_ray status --address="${RAY_ADDRESS}" &> /dev/null; then
87
+ # Check if user-space Ray is already running. Use local address to check, as
88
+ # if we use the head node address, the check will succeed even if the Ray
89
+ # cluster is started on the head node but not started on the current worker
90
+ # node.
91
+ if run_ray status --address="${LOCAL_RAY_ADDRESS}" &> /dev/null; then
88
92
  echo -e "${YELLOW}Ray cluster is already running.${NC}"
89
93
  run_ray status --address="${RAY_ADDRESS}"
90
94
  exit 0
@@ -140,7 +144,7 @@ if [ "${SKYPILOT_NODE_RANK}" -eq 0 ]; then
140
144
  echo -e "${RED}Error: Timeout waiting for nodes.${NC}" >&2
141
145
  exit 1
142
146
  fi
143
- ready_nodes=$(run_ray list nodes --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
147
+ ready_nodes=$(run_ray list nodes --address="${RAY_ADDRESS}" --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
144
148
  if [ "${ready_nodes}" -ge "${SKYPILOT_NUM_NODES}" ]; then
145
149
  break
146
150
  fi
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: skypilot-nightly
3
- Version: 1.0.0.dev20251210
3
+ Version: 1.0.0.dev20260112
4
4
  Summary: SkyPilot: Run AI on Any Infra — Unified, Faster, Cheaper.
5
5
  Author: SkyPilot Team
6
6
  License: Apache 2.0
@@ -73,7 +73,7 @@ Provides-Extra: aws
73
73
  Requires-Dist: awscli>=1.27.10; extra == "aws"
74
74
  Requires-Dist: botocore>=1.29.10; extra == "aws"
75
75
  Requires-Dist: boto3>=1.26.1; extra == "aws"
76
- Requires-Dist: colorama<0.4.5; extra == "aws"
76
+ Requires-Dist: colorama<0.4.7; extra == "aws"
77
77
  Requires-Dist: casbin; extra == "aws"
78
78
  Requires-Dist: sqlalchemy_adapter; extra == "aws"
79
79
  Requires-Dist: passlib; extra == "aws"
@@ -161,7 +161,7 @@ Provides-Extra: cloudflare
161
161
  Requires-Dist: awscli>=1.27.10; extra == "cloudflare"
162
162
  Requires-Dist: botocore>=1.29.10; extra == "cloudflare"
163
163
  Requires-Dist: boto3>=1.26.1; extra == "cloudflare"
164
- Requires-Dist: colorama<0.4.5; extra == "cloudflare"
164
+ Requires-Dist: colorama<0.4.7; extra == "cloudflare"
165
165
  Requires-Dist: casbin; extra == "cloudflare"
166
166
  Requires-Dist: sqlalchemy_adapter; extra == "cloudflare"
167
167
  Requires-Dist: passlib; extra == "cloudflare"
@@ -176,7 +176,7 @@ Provides-Extra: coreweave
176
176
  Requires-Dist: awscli>=1.27.10; extra == "coreweave"
177
177
  Requires-Dist: botocore>=1.29.10; extra == "coreweave"
178
178
  Requires-Dist: boto3>=1.26.1; extra == "coreweave"
179
- Requires-Dist: colorama<0.4.5; extra == "coreweave"
179
+ Requires-Dist: colorama<0.4.7; extra == "coreweave"
180
180
  Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "coreweave"
181
181
  Requires-Dist: websockets; extra == "coreweave"
182
182
  Requires-Dist: python-dateutil; extra == "coreweave"
@@ -245,6 +245,7 @@ Requires-Dist: greenlet; extra == "ssh"
245
245
  Provides-Extra: runpod
246
246
  Requires-Dist: runpod>=1.6.1; extra == "runpod"
247
247
  Requires-Dist: tomli; extra == "runpod"
248
+ Requires-Dist: pycares<5; extra == "runpod"
248
249
  Requires-Dist: casbin; extra == "runpod"
249
250
  Requires-Dist: sqlalchemy_adapter; extra == "runpod"
250
251
  Requires-Dist: passlib; extra == "runpod"
@@ -345,7 +346,7 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "nebius"
345
346
  Requires-Dist: awscli>=1.27.10; extra == "nebius"
346
347
  Requires-Dist: botocore>=1.29.10; extra == "nebius"
347
348
  Requires-Dist: boto3>=1.26.1; extra == "nebius"
348
- Requires-Dist: colorama<0.4.5; extra == "nebius"
349
+ Requires-Dist: colorama<0.4.7; extra == "nebius"
349
350
  Requires-Dist: casbin; extra == "nebius"
350
351
  Requires-Dist: sqlalchemy_adapter; extra == "nebius"
351
352
  Requires-Dist: passlib; extra == "nebius"
@@ -391,6 +392,7 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "shadeform"
391
392
  Requires-Dist: aiosqlite; extra == "shadeform"
392
393
  Requires-Dist: greenlet; extra == "shadeform"
393
394
  Provides-Extra: slurm
395
+ Requires-Dist: python-hostlist; extra == "slurm"
394
396
  Requires-Dist: casbin; extra == "slurm"
395
397
  Requires-Dist: sqlalchemy_adapter; extra == "slurm"
396
398
  Requires-Dist: passlib; extra == "slurm"
@@ -402,51 +404,53 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "slurm"
402
404
  Requires-Dist: aiosqlite; extra == "slurm"
403
405
  Requires-Dist: greenlet; extra == "slurm"
404
406
  Provides-Extra: all
405
- Requires-Dist: ibm-cloud-sdk-core; extra == "all"
406
- Requires-Dist: azure-core>=1.24.0; extra == "all"
407
- Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
408
- Requires-Dist: colorama<0.4.5; extra == "all"
409
- Requires-Dist: sqlalchemy_adapter; extra == "all"
407
+ Requires-Dist: aiohttp; extra == "all"
408
+ Requires-Dist: tomli; extra == "all"
409
+ Requires-Dist: ecsapi==0.4.0; extra == "all"
410
410
  Requires-Dist: msgraph-sdk; extra == "all"
411
- Requires-Dist: greenlet; extra == "all"
412
- Requires-Dist: oci; extra == "all"
413
- Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
414
- Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
415
- Requires-Dist: pyjwt; extra == "all"
416
- Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
411
+ Requires-Dist: azure-cli>=2.65.0; extra == "all"
412
+ Requires-Dist: python-dateutil; extra == "all"
417
413
  Requires-Dist: ray[default]>=2.6.1; extra == "all"
418
- Requires-Dist: runpod>=1.6.1; extra == "all"
419
- Requires-Dist: docker; extra == "all"
414
+ Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
415
+ Requires-Dist: pydo>=0.3.0; extra == "all"
416
+ Requires-Dist: google-cloud-storage; extra == "all"
420
417
  Requires-Dist: azure-identity>=1.19.0; extra == "all"
421
- Requires-Dist: python-dateutil; extra == "all"
418
+ Requires-Dist: grpcio>=1.63.0; extra == "all"
419
+ Requires-Dist: colorama<0.4.7; extra == "all"
420
+ Requires-Dist: boto3>=1.26.1; extra == "all"
421
+ Requires-Dist: docker; extra == "all"
422
+ Requires-Dist: sqlalchemy_adapter; extra == "all"
423
+ Requires-Dist: anyio; extra == "all"
424
+ Requires-Dist: pyjwt; extra == "all"
425
+ Requires-Dist: google-api-python-client>=2.69.0; extra == "all"
426
+ Requires-Dist: oci; extra == "all"
427
+ Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
428
+ Requires-Dist: websockets; extra == "all"
422
429
  Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "all"
423
- Requires-Dist: azure-core>=1.31.0; extra == "all"
430
+ Requires-Dist: ibm-cloud-sdk-core; extra == "all"
431
+ Requires-Dist: runpod>=1.6.1; extra == "all"
432
+ Requires-Dist: azure-core>=1.24.0; extra == "all"
424
433
  Requires-Dist: passlib; extra == "all"
425
- Requires-Dist: awscli>=1.27.10; extra == "all"
434
+ Requires-Dist: ibm-vpc; extra == "all"
435
+ Requires-Dist: nebius>=0.3.12; extra == "all"
426
436
  Requires-Dist: cudo-compute>=0.1.10; extra == "all"
427
- Requires-Dist: boto3>=1.26.1; extra == "all"
428
- Requires-Dist: botocore>=1.29.10; extra == "all"
429
- Requires-Dist: websockets; extra == "all"
430
- Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
431
- Requires-Dist: azure-cli>=2.65.0; extra == "all"
432
437
  Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "all"
438
+ Requires-Dist: awscli>=1.27.10; extra == "all"
439
+ Requires-Dist: pycares<5; extra == "all"
440
+ Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
441
+ Requires-Dist: greenlet; extra == "all"
442
+ Requires-Dist: azure-core>=1.31.0; extra == "all"
433
443
  Requires-Dist: msrestazure; extra == "all"
444
+ Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
445
+ Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
434
446
  Requires-Dist: ibm-cos-sdk; extra == "all"
435
- Requires-Dist: grpcio>=1.63.0; extra == "all"
436
- Requires-Dist: google-api-python-client>=2.69.0; extra == "all"
437
- Requires-Dist: azure-common; extra == "all"
438
- Requires-Dist: aiohttp; extra == "all"
439
- Requires-Dist: nebius>=0.3.12; extra == "all"
440
- Requires-Dist: ibm-vpc; extra == "all"
447
+ Requires-Dist: python-hostlist; extra == "all"
448
+ Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
449
+ Requires-Dist: botocore>=1.29.10; extra == "all"
450
+ Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
441
451
  Requires-Dist: casbin; extra == "all"
442
- Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
443
- Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
444
- Requires-Dist: tomli; extra == "all"
445
- Requires-Dist: ecsapi==0.4.0; extra == "all"
446
- Requires-Dist: pydo>=0.3.0; extra == "all"
447
- Requires-Dist: google-cloud-storage; extra == "all"
448
- Requires-Dist: anyio; extra == "all"
449
452
  Requires-Dist: aiosqlite; extra == "all"
453
+ Requires-Dist: azure-common; extra == "all"
450
454
  Provides-Extra: remote
451
455
  Requires-Dist: grpcio>=1.63.0; extra == "remote"
452
456
  Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "remote"
@@ -496,7 +500,7 @@ Dynamic: summary
496
500
  </p>
497
501
 
498
502
  <h3 align="center">
499
- Simplify & scale any AI infrastructure
503
+ Run AI on Any Infrastructure
500
504
  </h3>
501
505
 
502
506
  <div align="center">
@@ -506,10 +510,18 @@ Dynamic: summary
506
510
  </div>
507
511
 
508
512
 
513
+ SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
514
+
515
+ SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
516
+ **Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
517
+
518
+ <img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
509
519
 
510
- ----
520
+ -----
511
521
 
512
522
  :fire: *News* :fire:
523
+ - [Dec 2025] **SkyPilot v0.11** released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. [**Release notes**](https://github.com/skypilot-org/skypilot/releases/tag/v0.11.0)
524
+ - [Dec 2025] **SkyPilot Pools** released: Run batch inference and other jobs on a managed pool of warm workers (across clouds or clusters). [**blog**](https://blog.skypilot.co/skypilot-pools-deepseek-ocr/), [**docs**](https://docs.skypilot.co/en/latest/examples/pools.html)
513
525
  - [Nov 2025] Serve **Kimi K2 Thinking** with reasoning capabilities on your Kubernetes or clouds: [**example**](./llm/kimi-k2-thinking/)
514
526
  - [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
515
527
  - [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
@@ -518,22 +530,6 @@ Dynamic: summary
518
530
  - [Sep 2025] Network and Storage Benchmarks for LLM training on the cloud: [**blog**](https://maknee.github.io/blog/2025/Network-And-Storage-Training-Skypilot/)
519
531
  - [Aug 2025] Serve and finetune **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**serve**](./llm/gpt-oss/) + [**LoRA and full finetuning**](./llm/gpt-oss-finetuning/)
520
532
  - [Jul 2025] Run distributed **RL training for LLMs** with Verl (PPO, GRPO) on any cloud: [**example**](./llm/verl/)
521
- - [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
522
- - [Jul 2025] Two-part blog series, `The Evolution of AI Job Orchestration`: (1) [Running AI jobs on GPU Neoclouds](https://blog.skypilot.co/ai-job-orchestration-pt1-gpu-neoclouds/), (2) [The AI-Native Control Plane & Orchestration that Finally Works for ML](https://blog.skypilot.co/ai-job-orchestration-pt2-ai-control-plane/)
523
- - [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
524
-
525
-
526
-
527
- **LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
528
-
529
- ----
530
-
531
- SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
532
-
533
- SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
534
- **Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
535
-
536
- <img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
537
533
 
538
534
  ## Overview
539
535