skypilot-nightly 1.0.0.dev20251210__py3-none-any.whl → 1.0.0.dev20260112__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sky/__init__.py +4 -2
- sky/adaptors/slurm.py +159 -72
- sky/backends/backend_utils.py +52 -10
- sky/backends/cloud_vm_ray_backend.py +192 -32
- sky/backends/task_codegen.py +40 -2
- sky/catalog/data_fetchers/fetch_gcp.py +9 -1
- sky/catalog/data_fetchers/fetch_nebius.py +1 -1
- sky/catalog/data_fetchers/fetch_vast.py +4 -2
- sky/catalog/seeweb_catalog.py +30 -15
- sky/catalog/shadeform_catalog.py +5 -2
- sky/catalog/slurm_catalog.py +0 -7
- sky/catalog/vast_catalog.py +30 -6
- sky/check.py +11 -8
- sky/client/cli/command.py +106 -54
- sky/client/interactive_utils.py +190 -0
- sky/client/sdk.py +8 -0
- sky/client/sdk_async.py +9 -0
- sky/clouds/aws.py +60 -2
- sky/clouds/azure.py +2 -0
- sky/clouds/kubernetes.py +2 -0
- sky/clouds/runpod.py +38 -7
- sky/clouds/slurm.py +44 -12
- sky/clouds/ssh.py +1 -1
- sky/clouds/vast.py +30 -17
- sky/core.py +69 -1
- sky/dashboard/out/404.html +1 -1
- sky/dashboard/out/_next/static/3nu-b8raeKRNABZ2d4GAG/_buildManifest.js +1 -0
- sky/dashboard/out/_next/static/chunks/1871-0565f8975a7dcd10.js +6 -0
- sky/dashboard/out/_next/static/chunks/2109-55a1546d793574a7.js +11 -0
- sky/dashboard/out/_next/static/chunks/2521-099b07cd9e4745bf.js +26 -0
- sky/dashboard/out/_next/static/chunks/2755.a636e04a928a700e.js +31 -0
- sky/dashboard/out/_next/static/chunks/3495.05eab4862217c1a5.js +6 -0
- sky/dashboard/out/_next/static/chunks/3785.cfc5dcc9434fd98c.js +1 -0
- sky/dashboard/out/_next/static/chunks/3981.645d01bf9c8cad0c.js +21 -0
- sky/dashboard/out/_next/static/chunks/4083-0115d67c1fb57d6c.js +21 -0
- sky/dashboard/out/_next/static/chunks/{8640.5b9475a2d18c5416.js → 429.a58e9ba9742309ed.js} +2 -2
- sky/dashboard/out/_next/static/chunks/4555.8e221537181b5dc1.js +6 -0
- sky/dashboard/out/_next/static/chunks/4725.937865b81fdaaebb.js +6 -0
- sky/dashboard/out/_next/static/chunks/6082-edabd8f6092300ce.js +25 -0
- sky/dashboard/out/_next/static/chunks/6989-49cb7dca83a7a62d.js +1 -0
- sky/dashboard/out/_next/static/chunks/6990-630bd2a2257275f8.js +1 -0
- sky/dashboard/out/_next/static/chunks/7248-a99800d4db8edabd.js +1 -0
- sky/dashboard/out/_next/static/chunks/754-cfc5d4ad1b843d29.js +18 -0
- sky/dashboard/out/_next/static/chunks/8050-dd8aa107b17dce00.js +16 -0
- sky/dashboard/out/_next/static/chunks/8056-d4ae1e0cb81e7368.js +1 -0
- sky/dashboard/out/_next/static/chunks/8555.011023e296c127b3.js +6 -0
- sky/dashboard/out/_next/static/chunks/8821-93c25df904a8362b.js +1 -0
- sky/dashboard/out/_next/static/chunks/8969-0662594b69432ade.js +1 -0
- sky/dashboard/out/_next/static/chunks/9025.f15c91c97d124a5f.js +6 -0
- sky/dashboard/out/_next/static/chunks/{9353-8369df1cf105221c.js → 9353-7ad6bd01858556f1.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/_app-5a86569acad99764.js +34 -0
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-8297476714acb4ac.js +6 -0
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-337c3ba1085f1210.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/{clusters-9e5d47818b9bdadd.js → clusters-57632ff3684a8b5c.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/infra/[context]-5fd3a453c079c2ea.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/infra-9f85c02c9c6cae9e.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-90f16972cbecf354.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-2dd42fc37aad427a.js +16 -0
- sky/dashboard/out/_next/static/chunks/pages/jobs-ed806aeace26b972.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/users-bec34706b36f3524.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/{volumes-ef19d49c6d0e8500.js → volumes-a83ba9b38dff7ea9.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/workspaces/{[name]-96e0f298308da7e2.js → [name]-c781e9c3e52ef9fc.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/workspaces-91e0942f47310aae.js +1 -0
- sky/dashboard/out/_next/static/chunks/webpack-cfe59cf684ee13b9.js +1 -0
- sky/dashboard/out/_next/static/css/b0dbca28f027cc19.css +3 -0
- sky/dashboard/out/clusters/[cluster]/[job].html +1 -1
- sky/dashboard/out/clusters/[cluster].html +1 -1
- sky/dashboard/out/clusters.html +1 -1
- sky/dashboard/out/config.html +1 -1
- sky/dashboard/out/index.html +1 -1
- sky/dashboard/out/infra/[context].html +1 -1
- sky/dashboard/out/infra.html +1 -1
- sky/dashboard/out/jobs/[job].html +1 -1
- sky/dashboard/out/jobs/pools/[pool].html +1 -1
- sky/dashboard/out/jobs.html +1 -1
- sky/dashboard/out/plugins/[...slug].html +1 -1
- sky/dashboard/out/users.html +1 -1
- sky/dashboard/out/volumes.html +1 -1
- sky/dashboard/out/workspace/new.html +1 -1
- sky/dashboard/out/workspaces/[name].html +1 -1
- sky/dashboard/out/workspaces.html +1 -1
- sky/data/data_utils.py +26 -12
- sky/data/mounting_utils.py +29 -4
- sky/global_user_state.py +108 -16
- sky/jobs/client/sdk.py +8 -3
- sky/jobs/controller.py +191 -31
- sky/jobs/recovery_strategy.py +109 -11
- sky/jobs/server/core.py +81 -4
- sky/jobs/server/server.py +14 -0
- sky/jobs/state.py +417 -19
- sky/jobs/utils.py +73 -80
- sky/models.py +9 -0
- sky/optimizer.py +2 -1
- sky/provision/__init__.py +11 -9
- sky/provision/kubernetes/utils.py +122 -15
- sky/provision/kubernetes/volume.py +52 -17
- sky/provision/provisioner.py +2 -1
- sky/provision/runpod/instance.py +3 -1
- sky/provision/runpod/utils.py +13 -1
- sky/provision/runpod/volume.py +25 -9
- sky/provision/slurm/instance.py +75 -29
- sky/provision/slurm/utils.py +213 -107
- sky/provision/vast/utils.py +1 -0
- sky/resources.py +135 -13
- sky/schemas/api/responses.py +4 -0
- sky/schemas/db/global_user_state/010_save_ssh_key.py +1 -1
- sky/schemas/db/spot_jobs/008_add_full_resources.py +34 -0
- sky/schemas/db/spot_jobs/009_job_events.py +32 -0
- sky/schemas/db/spot_jobs/010_job_events_timestamp_with_timezone.py +43 -0
- sky/schemas/db/spot_jobs/011_add_links.py +34 -0
- sky/schemas/generated/jobsv1_pb2.py +9 -5
- sky/schemas/generated/jobsv1_pb2.pyi +12 -0
- sky/schemas/generated/jobsv1_pb2_grpc.py +44 -0
- sky/schemas/generated/managed_jobsv1_pb2.py +32 -28
- sky/schemas/generated/managed_jobsv1_pb2.pyi +11 -2
- sky/serve/serve_utils.py +232 -40
- sky/server/common.py +17 -0
- sky/server/constants.py +1 -1
- sky/server/metrics.py +6 -3
- sky/server/plugins.py +16 -0
- sky/server/requests/payloads.py +18 -0
- sky/server/requests/request_names.py +2 -0
- sky/server/requests/requests.py +28 -10
- sky/server/requests/serializers/encoders.py +5 -0
- sky/server/requests/serializers/return_value_serializers.py +14 -4
- sky/server/server.py +434 -107
- sky/server/uvicorn.py +5 -0
- sky/setup_files/MANIFEST.in +1 -0
- sky/setup_files/dependencies.py +21 -10
- sky/sky_logging.py +2 -1
- sky/skylet/constants.py +22 -5
- sky/skylet/executor/slurm.py +4 -6
- sky/skylet/job_lib.py +89 -4
- sky/skylet/services.py +18 -3
- sky/ssh_node_pools/deploy/tunnel/cleanup-tunnel.sh +62 -0
- sky/ssh_node_pools/deploy/tunnel/ssh-tunnel.sh +379 -0
- sky/templates/kubernetes-ray.yml.j2 +4 -6
- sky/templates/slurm-ray.yml.j2 +32 -2
- sky/templates/websocket_proxy.py +18 -41
- sky/users/permission.py +61 -51
- sky/utils/auth_utils.py +42 -0
- sky/utils/cli_utils/status_utils.py +19 -5
- sky/utils/cluster_utils.py +10 -3
- sky/utils/command_runner.py +256 -94
- sky/utils/command_runner.pyi +16 -0
- sky/utils/common_utils.py +30 -29
- sky/utils/context.py +32 -0
- sky/utils/db/db_utils.py +36 -6
- sky/utils/db/migration_utils.py +41 -21
- sky/utils/infra_utils.py +5 -1
- sky/utils/instance_links.py +139 -0
- sky/utils/interactive_utils.py +49 -0
- sky/utils/kubernetes/generate_kubeconfig.sh +42 -33
- sky/utils/kubernetes/rsync_helper.sh +5 -1
- sky/utils/plugin_extensions/__init__.py +14 -0
- sky/utils/plugin_extensions/external_failure_source.py +176 -0
- sky/utils/resources_utils.py +10 -8
- sky/utils/rich_utils.py +9 -11
- sky/utils/schemas.py +63 -20
- sky/utils/status_lib.py +7 -0
- sky/utils/subprocess_utils.py +17 -0
- sky/volumes/client/sdk.py +6 -3
- sky/volumes/server/core.py +65 -27
- sky_templates/ray/start_cluster +8 -4
- {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/METADATA +53 -57
- {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/RECORD +172 -162
- sky/dashboard/out/_next/static/KYAhEFa3FTfq4JyKVgo-s/_buildManifest.js +0 -1
- sky/dashboard/out/_next/static/chunks/1141-9c810f01ff4f398a.js +0 -11
- sky/dashboard/out/_next/static/chunks/1871-7e202677c42f43fe.js +0 -6
- sky/dashboard/out/_next/static/chunks/2260-7703229c33c5ebd5.js +0 -1
- sky/dashboard/out/_next/static/chunks/2350.fab69e61bac57b23.js +0 -1
- sky/dashboard/out/_next/static/chunks/2369.fc20f0c2c8ed9fe7.js +0 -15
- sky/dashboard/out/_next/static/chunks/2755.edd818326d489a1d.js +0 -26
- sky/dashboard/out/_next/static/chunks/3294.ddda8c6c6f9f24dc.js +0 -1
- sky/dashboard/out/_next/static/chunks/3785.7e245f318f9d1121.js +0 -1
- sky/dashboard/out/_next/static/chunks/3800-b589397dc09c5b4e.js +0 -1
- sky/dashboard/out/_next/static/chunks/4725.172ede95d1b21022.js +0 -1
- sky/dashboard/out/_next/static/chunks/4937.a2baa2df5572a276.js +0 -15
- sky/dashboard/out/_next/static/chunks/6212-7bd06f60ba693125.js +0 -13
- sky/dashboard/out/_next/static/chunks/6856-da20c5fd999f319c.js +0 -1
- sky/dashboard/out/_next/static/chunks/6989-01359c57e018caa4.js +0 -1
- sky/dashboard/out/_next/static/chunks/6990-09cbf02d3cd518c3.js +0 -1
- sky/dashboard/out/_next/static/chunks/7359-c8d04e06886000b3.js +0 -30
- sky/dashboard/out/_next/static/chunks/7411-b15471acd2cba716.js +0 -41
- sky/dashboard/out/_next/static/chunks/7615-019513abc55b3b47.js +0 -1
- sky/dashboard/out/_next/static/chunks/8969-452f9d5cbdd2dc73.js +0 -1
- sky/dashboard/out/_next/static/chunks/9025.fa408f3242e9028d.js +0 -6
- sky/dashboard/out/_next/static/chunks/9360.a536cf6b1fa42355.js +0 -31
- sky/dashboard/out/_next/static/chunks/9847.3aaca6bb33455140.js +0 -30
- sky/dashboard/out/_next/static/chunks/pages/_app-68b647e26f9d2793.js +0 -34
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-33f525539665fdfd.js +0 -16
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-a7565f586ef86467.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/infra/[context]-12c559ec4d81fdbd.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/infra-d187cd0413d72475.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-895847b6cf200b04.js +0 -16
- sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-8d0f4655400b4eb9.js +0 -21
- sky/dashboard/out/_next/static/chunks/pages/jobs-e5a98f17f8513a96.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/users-2f7646eb77785a2c.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/workspaces-cb4da3abe08ebf19.js +0 -1
- sky/dashboard/out/_next/static/chunks/webpack-fba3de387ff6bb08.js +0 -1
- sky/dashboard/out/_next/static/css/c5a4cfd2600fc715.css +0 -3
- /sky/dashboard/out/_next/static/{KYAhEFa3FTfq4JyKVgo-s → 3nu-b8raeKRNABZ2d4GAG}/_ssgManifest.js +0 -0
- /sky/dashboard/out/_next/static/chunks/pages/plugins/{[...slug]-4f46050ca065d8f8.js → [...slug]-449a9f5a3bb20fb3.js} +0 -0
- {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/WHEEL +0 -0
- {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/entry_points.txt +0 -0
- {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/licenses/LICENSE +0 -0
- {skypilot_nightly-1.0.0.dev20251210.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/top_level.txt +0 -0
sky/volumes/server/core.py
CHANGED
|
@@ -30,6 +30,10 @@ def volume_refresh():
|
|
|
30
30
|
volumes = volume_list(is_ephemeral=False)
|
|
31
31
|
for volume in volumes:
|
|
32
32
|
volume_name = volume.name
|
|
33
|
+
if volume.usedby_fetch_failed:
|
|
34
|
+
logger.info(f'Skipping status update for volume {volume_name} '
|
|
35
|
+
f'due to failed usedby fetch')
|
|
36
|
+
continue
|
|
33
37
|
usedby_pods = volume.usedby_pods
|
|
34
38
|
with _volume_lock(volume_name):
|
|
35
39
|
latest_volume = global_user_state.get_volume_by_name(volume_name)
|
|
@@ -55,6 +59,9 @@ def volume_list(
|
|
|
55
59
|
is_ephemeral: Optional[bool] = None) -> List[responses.VolumeRecord]:
|
|
56
60
|
"""Gets the volumes.
|
|
57
61
|
|
|
62
|
+
Args:
|
|
63
|
+
is_ephemeral: Whether to include ephemeral volumes.
|
|
64
|
+
|
|
58
65
|
Returns:
|
|
59
66
|
[
|
|
60
67
|
{
|
|
@@ -74,6 +81,7 @@ def volume_list(
|
|
|
74
81
|
'status': sky.VolumeStatus,
|
|
75
82
|
'usedby_pods': List[str],
|
|
76
83
|
'usedby_clusters': List[str],
|
|
84
|
+
'usedby_fetch_failed': bool,
|
|
77
85
|
'is_ephemeral': bool,
|
|
78
86
|
}
|
|
79
87
|
]
|
|
@@ -93,11 +101,23 @@ def volume_list(
|
|
|
93
101
|
cloud_to_configs[cloud].append(config)
|
|
94
102
|
|
|
95
103
|
cloud_to_used_by_pods, cloud_to_used_by_clusters = {}, {}
|
|
104
|
+
cloud_to_failed_volume_names = {}
|
|
96
105
|
for cloud, configs in cloud_to_configs.items():
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
106
|
+
try:
|
|
107
|
+
used_by_pods, used_by_clusters, failed_volume_names = (
|
|
108
|
+
provision.get_all_volumes_usedby(cloud, configs))
|
|
109
|
+
cloud_to_used_by_pods[cloud] = used_by_pods
|
|
110
|
+
cloud_to_used_by_clusters[cloud] = used_by_clusters
|
|
111
|
+
cloud_to_failed_volume_names[cloud] = failed_volume_names
|
|
112
|
+
except Exception as e: # pylint: disable=broad-except
|
|
113
|
+
logger.warning(
|
|
114
|
+
f'Failed to get usedby info for volumes on {cloud}: {e}')
|
|
115
|
+
cloud_to_used_by_pods[cloud] = {}
|
|
116
|
+
cloud_to_used_by_clusters[cloud] = {}
|
|
117
|
+
cloud_to_failed_volume_names[cloud] = {
|
|
118
|
+
config.name for config in configs
|
|
119
|
+
}
|
|
120
|
+
continue
|
|
101
121
|
|
|
102
122
|
all_users = global_user_state.get_all_users()
|
|
103
123
|
user_map = {user.id: user.name for user in all_users}
|
|
@@ -114,6 +134,7 @@ def volume_list(
|
|
|
114
134
|
'last_use': volume.get('last_use'),
|
|
115
135
|
'usedby_pods': [],
|
|
116
136
|
'usedby_clusters': [],
|
|
137
|
+
'usedby_fetch_failed': False,
|
|
117
138
|
'is_ephemeral': volume.get('is_ephemeral', False),
|
|
118
139
|
}
|
|
119
140
|
status = volume.get('status')
|
|
@@ -126,12 +147,17 @@ def volume_list(
|
|
|
126
147
|
logger.warning(f'Volume {volume_name} has no handle.')
|
|
127
148
|
continue
|
|
128
149
|
cloud = config.cloud
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
150
|
+
if volume_name in cloud_to_failed_volume_names[cloud]:
|
|
151
|
+
record['usedby_fetch_failed'] = True
|
|
152
|
+
else:
|
|
153
|
+
usedby_pods, usedby_clusters = provision.map_all_volumes_usedby(
|
|
154
|
+
cloud,
|
|
155
|
+
cloud_to_used_by_pods[cloud],
|
|
156
|
+
cloud_to_used_by_clusters[cloud],
|
|
157
|
+
config,
|
|
158
|
+
)
|
|
159
|
+
record['usedby_pods'] = usedby_pods
|
|
160
|
+
record['usedby_clusters'] = usedby_clusters
|
|
135
161
|
record['type'] = config.type
|
|
136
162
|
record['cloud'] = config.cloud
|
|
137
163
|
record['region'] = config.region
|
|
@@ -139,18 +165,20 @@ def volume_list(
|
|
|
139
165
|
record['size'] = config.size
|
|
140
166
|
record['config'] = config.config
|
|
141
167
|
record['name_on_cloud'] = config.name_on_cloud
|
|
142
|
-
record['usedby_pods'] = usedby_pods
|
|
143
|
-
record['usedby_clusters'] = usedby_clusters
|
|
144
168
|
records.append(responses.VolumeRecord(**record))
|
|
145
169
|
return records
|
|
146
170
|
|
|
147
171
|
|
|
148
|
-
def volume_delete(names: List[str],
|
|
172
|
+
def volume_delete(names: List[str],
|
|
173
|
+
ignore_not_found: bool = False,
|
|
174
|
+
purge: bool = False) -> None:
|
|
149
175
|
"""Deletes volumes.
|
|
150
176
|
|
|
151
177
|
Args:
|
|
152
178
|
names: List of volume names to delete.
|
|
153
179
|
ignore_not_found: If True, ignore volumes that are not found.
|
|
180
|
+
purge: If True, delete the volume from the database even if the
|
|
181
|
+
deletion API fails.
|
|
154
182
|
|
|
155
183
|
Raises:
|
|
156
184
|
ValueError: If the volume does not exist
|
|
@@ -167,22 +195,32 @@ def volume_delete(names: List[str], ignore_not_found: bool = False) -> None:
|
|
|
167
195
|
if config is None:
|
|
168
196
|
raise ValueError(f'Volume {name} has no handle.')
|
|
169
197
|
cloud = config.cloud
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
198
|
+
if not purge:
|
|
199
|
+
usedby_pods, usedby_clusters = provision.get_volume_usedby(
|
|
200
|
+
cloud, config)
|
|
201
|
+
if usedby_clusters:
|
|
202
|
+
usedby_clusters_str = ', '.join(usedby_clusters)
|
|
203
|
+
cluster_str = 'clusters' if len(
|
|
204
|
+
usedby_clusters) > 1 else 'cluster'
|
|
205
|
+
raise ValueError(f'Volume {name} is used by {cluster_str}'
|
|
206
|
+
f' {usedby_clusters_str}.')
|
|
207
|
+
if usedby_pods:
|
|
208
|
+
usedby_pods_str = ', '.join(usedby_pods)
|
|
209
|
+
pod_str = 'pods' if len(usedby_pods) > 1 else 'pod'
|
|
210
|
+
raise ValueError(
|
|
211
|
+
f'Volume {name} is used by {pod_str} {usedby_pods_str}.'
|
|
212
|
+
)
|
|
183
213
|
logger.debug(f'Deleting volume {name} with config {config}')
|
|
184
214
|
with _volume_lock(name):
|
|
185
|
-
|
|
215
|
+
try:
|
|
216
|
+
provision.delete_volume(cloud, config)
|
|
217
|
+
except Exception as e: # pylint: disable=broad-except
|
|
218
|
+
if purge:
|
|
219
|
+
logger.warning(f'Failed to delete volume {name} '
|
|
220
|
+
f'on {cloud}: {e}. Purging from '
|
|
221
|
+
'database.')
|
|
222
|
+
else:
|
|
223
|
+
raise
|
|
186
224
|
global_user_state.delete_volume(name)
|
|
187
225
|
logger.info(f'Deleted volumes: {names}')
|
|
188
226
|
|
sky_templates/ray/start_cluster
CHANGED
|
@@ -77,14 +77,18 @@ if ! run_ray --version > /dev/null; then
|
|
|
77
77
|
fi
|
|
78
78
|
echo -e "${GREEN}Ray $(run_ray --version | cut -d' ' -f3) is installed.${NC}"
|
|
79
79
|
|
|
80
|
-
|
|
80
|
+
LOCAL_RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
|
|
81
|
+
RAY_ADDRESS=${LOCAL_RAY_ADDRESS}
|
|
81
82
|
if [ "${SKYPILOT_NODE_RANK}" -ne 0 ]; then
|
|
82
83
|
HEAD_IP=$(echo "${SKYPILOT_NODE_IPS}" | head -n1)
|
|
83
84
|
RAY_ADDRESS="${HEAD_IP}:${RAY_HEAD_PORT}"
|
|
84
85
|
fi
|
|
85
86
|
|
|
86
|
-
# Check if user-space Ray is already running
|
|
87
|
-
if
|
|
87
|
+
# Check if user-space Ray is already running. Use local address to check, as
|
|
88
|
+
# if we use the head node address, the check will succeed even if the Ray
|
|
89
|
+
# cluster is started on the head node but not started on the current worker
|
|
90
|
+
# node.
|
|
91
|
+
if run_ray status --address="${LOCAL_RAY_ADDRESS}" &> /dev/null; then
|
|
88
92
|
echo -e "${YELLOW}Ray cluster is already running.${NC}"
|
|
89
93
|
run_ray status --address="${RAY_ADDRESS}"
|
|
90
94
|
exit 0
|
|
@@ -140,7 +144,7 @@ if [ "${SKYPILOT_NODE_RANK}" -eq 0 ]; then
|
|
|
140
144
|
echo -e "${RED}Error: Timeout waiting for nodes.${NC}" >&2
|
|
141
145
|
exit 1
|
|
142
146
|
fi
|
|
143
|
-
ready_nodes=$(run_ray list nodes --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
|
|
147
|
+
ready_nodes=$(run_ray list nodes --address="${RAY_ADDRESS}" --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
|
|
144
148
|
if [ "${ready_nodes}" -ge "${SKYPILOT_NUM_NODES}" ]; then
|
|
145
149
|
break
|
|
146
150
|
fi
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: skypilot-nightly
|
|
3
|
-
Version: 1.0.0.
|
|
3
|
+
Version: 1.0.0.dev20260112
|
|
4
4
|
Summary: SkyPilot: Run AI on Any Infra — Unified, Faster, Cheaper.
|
|
5
5
|
Author: SkyPilot Team
|
|
6
6
|
License: Apache 2.0
|
|
@@ -73,7 +73,7 @@ Provides-Extra: aws
|
|
|
73
73
|
Requires-Dist: awscli>=1.27.10; extra == "aws"
|
|
74
74
|
Requires-Dist: botocore>=1.29.10; extra == "aws"
|
|
75
75
|
Requires-Dist: boto3>=1.26.1; extra == "aws"
|
|
76
|
-
Requires-Dist: colorama<0.4.
|
|
76
|
+
Requires-Dist: colorama<0.4.7; extra == "aws"
|
|
77
77
|
Requires-Dist: casbin; extra == "aws"
|
|
78
78
|
Requires-Dist: sqlalchemy_adapter; extra == "aws"
|
|
79
79
|
Requires-Dist: passlib; extra == "aws"
|
|
@@ -161,7 +161,7 @@ Provides-Extra: cloudflare
|
|
|
161
161
|
Requires-Dist: awscli>=1.27.10; extra == "cloudflare"
|
|
162
162
|
Requires-Dist: botocore>=1.29.10; extra == "cloudflare"
|
|
163
163
|
Requires-Dist: boto3>=1.26.1; extra == "cloudflare"
|
|
164
|
-
Requires-Dist: colorama<0.4.
|
|
164
|
+
Requires-Dist: colorama<0.4.7; extra == "cloudflare"
|
|
165
165
|
Requires-Dist: casbin; extra == "cloudflare"
|
|
166
166
|
Requires-Dist: sqlalchemy_adapter; extra == "cloudflare"
|
|
167
167
|
Requires-Dist: passlib; extra == "cloudflare"
|
|
@@ -176,7 +176,7 @@ Provides-Extra: coreweave
|
|
|
176
176
|
Requires-Dist: awscli>=1.27.10; extra == "coreweave"
|
|
177
177
|
Requires-Dist: botocore>=1.29.10; extra == "coreweave"
|
|
178
178
|
Requires-Dist: boto3>=1.26.1; extra == "coreweave"
|
|
179
|
-
Requires-Dist: colorama<0.4.
|
|
179
|
+
Requires-Dist: colorama<0.4.7; extra == "coreweave"
|
|
180
180
|
Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "coreweave"
|
|
181
181
|
Requires-Dist: websockets; extra == "coreweave"
|
|
182
182
|
Requires-Dist: python-dateutil; extra == "coreweave"
|
|
@@ -245,6 +245,7 @@ Requires-Dist: greenlet; extra == "ssh"
|
|
|
245
245
|
Provides-Extra: runpod
|
|
246
246
|
Requires-Dist: runpod>=1.6.1; extra == "runpod"
|
|
247
247
|
Requires-Dist: tomli; extra == "runpod"
|
|
248
|
+
Requires-Dist: pycares<5; extra == "runpod"
|
|
248
249
|
Requires-Dist: casbin; extra == "runpod"
|
|
249
250
|
Requires-Dist: sqlalchemy_adapter; extra == "runpod"
|
|
250
251
|
Requires-Dist: passlib; extra == "runpod"
|
|
@@ -345,7 +346,7 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "nebius"
|
|
|
345
346
|
Requires-Dist: awscli>=1.27.10; extra == "nebius"
|
|
346
347
|
Requires-Dist: botocore>=1.29.10; extra == "nebius"
|
|
347
348
|
Requires-Dist: boto3>=1.26.1; extra == "nebius"
|
|
348
|
-
Requires-Dist: colorama<0.4.
|
|
349
|
+
Requires-Dist: colorama<0.4.7; extra == "nebius"
|
|
349
350
|
Requires-Dist: casbin; extra == "nebius"
|
|
350
351
|
Requires-Dist: sqlalchemy_adapter; extra == "nebius"
|
|
351
352
|
Requires-Dist: passlib; extra == "nebius"
|
|
@@ -391,6 +392,7 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "shadeform"
|
|
|
391
392
|
Requires-Dist: aiosqlite; extra == "shadeform"
|
|
392
393
|
Requires-Dist: greenlet; extra == "shadeform"
|
|
393
394
|
Provides-Extra: slurm
|
|
395
|
+
Requires-Dist: python-hostlist; extra == "slurm"
|
|
394
396
|
Requires-Dist: casbin; extra == "slurm"
|
|
395
397
|
Requires-Dist: sqlalchemy_adapter; extra == "slurm"
|
|
396
398
|
Requires-Dist: passlib; extra == "slurm"
|
|
@@ -402,51 +404,53 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "slurm"
|
|
|
402
404
|
Requires-Dist: aiosqlite; extra == "slurm"
|
|
403
405
|
Requires-Dist: greenlet; extra == "slurm"
|
|
404
406
|
Provides-Extra: all
|
|
405
|
-
Requires-Dist:
|
|
406
|
-
Requires-Dist:
|
|
407
|
-
Requires-Dist:
|
|
408
|
-
Requires-Dist: colorama<0.4.5; extra == "all"
|
|
409
|
-
Requires-Dist: sqlalchemy_adapter; extra == "all"
|
|
407
|
+
Requires-Dist: aiohttp; extra == "all"
|
|
408
|
+
Requires-Dist: tomli; extra == "all"
|
|
409
|
+
Requires-Dist: ecsapi==0.4.0; extra == "all"
|
|
410
410
|
Requires-Dist: msgraph-sdk; extra == "all"
|
|
411
|
-
Requires-Dist:
|
|
412
|
-
Requires-Dist:
|
|
413
|
-
Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
|
|
414
|
-
Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
|
|
415
|
-
Requires-Dist: pyjwt; extra == "all"
|
|
416
|
-
Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
|
|
411
|
+
Requires-Dist: azure-cli>=2.65.0; extra == "all"
|
|
412
|
+
Requires-Dist: python-dateutil; extra == "all"
|
|
417
413
|
Requires-Dist: ray[default]>=2.6.1; extra == "all"
|
|
418
|
-
Requires-Dist:
|
|
419
|
-
Requires-Dist:
|
|
414
|
+
Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
|
|
415
|
+
Requires-Dist: pydo>=0.3.0; extra == "all"
|
|
416
|
+
Requires-Dist: google-cloud-storage; extra == "all"
|
|
420
417
|
Requires-Dist: azure-identity>=1.19.0; extra == "all"
|
|
421
|
-
Requires-Dist:
|
|
418
|
+
Requires-Dist: grpcio>=1.63.0; extra == "all"
|
|
419
|
+
Requires-Dist: colorama<0.4.7; extra == "all"
|
|
420
|
+
Requires-Dist: boto3>=1.26.1; extra == "all"
|
|
421
|
+
Requires-Dist: docker; extra == "all"
|
|
422
|
+
Requires-Dist: sqlalchemy_adapter; extra == "all"
|
|
423
|
+
Requires-Dist: anyio; extra == "all"
|
|
424
|
+
Requires-Dist: pyjwt; extra == "all"
|
|
425
|
+
Requires-Dist: google-api-python-client>=2.69.0; extra == "all"
|
|
426
|
+
Requires-Dist: oci; extra == "all"
|
|
427
|
+
Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
|
|
428
|
+
Requires-Dist: websockets; extra == "all"
|
|
422
429
|
Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "all"
|
|
423
|
-
Requires-Dist:
|
|
430
|
+
Requires-Dist: ibm-cloud-sdk-core; extra == "all"
|
|
431
|
+
Requires-Dist: runpod>=1.6.1; extra == "all"
|
|
432
|
+
Requires-Dist: azure-core>=1.24.0; extra == "all"
|
|
424
433
|
Requires-Dist: passlib; extra == "all"
|
|
425
|
-
Requires-Dist:
|
|
434
|
+
Requires-Dist: ibm-vpc; extra == "all"
|
|
435
|
+
Requires-Dist: nebius>=0.3.12; extra == "all"
|
|
426
436
|
Requires-Dist: cudo-compute>=0.1.10; extra == "all"
|
|
427
|
-
Requires-Dist: boto3>=1.26.1; extra == "all"
|
|
428
|
-
Requires-Dist: botocore>=1.29.10; extra == "all"
|
|
429
|
-
Requires-Dist: websockets; extra == "all"
|
|
430
|
-
Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
|
|
431
|
-
Requires-Dist: azure-cli>=2.65.0; extra == "all"
|
|
432
437
|
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "all"
|
|
438
|
+
Requires-Dist: awscli>=1.27.10; extra == "all"
|
|
439
|
+
Requires-Dist: pycares<5; extra == "all"
|
|
440
|
+
Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
|
|
441
|
+
Requires-Dist: greenlet; extra == "all"
|
|
442
|
+
Requires-Dist: azure-core>=1.31.0; extra == "all"
|
|
433
443
|
Requires-Dist: msrestazure; extra == "all"
|
|
444
|
+
Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
|
|
445
|
+
Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
|
|
434
446
|
Requires-Dist: ibm-cos-sdk; extra == "all"
|
|
435
|
-
Requires-Dist:
|
|
436
|
-
Requires-Dist:
|
|
437
|
-
Requires-Dist:
|
|
438
|
-
Requires-Dist:
|
|
439
|
-
Requires-Dist: nebius>=0.3.12; extra == "all"
|
|
440
|
-
Requires-Dist: ibm-vpc; extra == "all"
|
|
447
|
+
Requires-Dist: python-hostlist; extra == "all"
|
|
448
|
+
Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
|
|
449
|
+
Requires-Dist: botocore>=1.29.10; extra == "all"
|
|
450
|
+
Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
|
|
441
451
|
Requires-Dist: casbin; extra == "all"
|
|
442
|
-
Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
|
|
443
|
-
Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
|
|
444
|
-
Requires-Dist: tomli; extra == "all"
|
|
445
|
-
Requires-Dist: ecsapi==0.4.0; extra == "all"
|
|
446
|
-
Requires-Dist: pydo>=0.3.0; extra == "all"
|
|
447
|
-
Requires-Dist: google-cloud-storage; extra == "all"
|
|
448
|
-
Requires-Dist: anyio; extra == "all"
|
|
449
452
|
Requires-Dist: aiosqlite; extra == "all"
|
|
453
|
+
Requires-Dist: azure-common; extra == "all"
|
|
450
454
|
Provides-Extra: remote
|
|
451
455
|
Requires-Dist: grpcio>=1.63.0; extra == "remote"
|
|
452
456
|
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "remote"
|
|
@@ -496,7 +500,7 @@ Dynamic: summary
|
|
|
496
500
|
</p>
|
|
497
501
|
|
|
498
502
|
<h3 align="center">
|
|
499
|
-
|
|
503
|
+
Run AI on Any Infrastructure
|
|
500
504
|
</h3>
|
|
501
505
|
|
|
502
506
|
<div align="center">
|
|
@@ -506,10 +510,18 @@ Dynamic: summary
|
|
|
506
510
|
</div>
|
|
507
511
|
|
|
508
512
|
|
|
513
|
+
SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
|
|
514
|
+
|
|
515
|
+
SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
|
|
516
|
+
**Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
|
|
517
|
+
|
|
518
|
+
<img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
|
|
509
519
|
|
|
510
|
-
|
|
520
|
+
-----
|
|
511
521
|
|
|
512
522
|
:fire: *News* :fire:
|
|
523
|
+
- [Dec 2025] **SkyPilot v0.11** released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. [**Release notes**](https://github.com/skypilot-org/skypilot/releases/tag/v0.11.0)
|
|
524
|
+
- [Dec 2025] **SkyPilot Pools** released: Run batch inference and other jobs on a managed pool of warm workers (across clouds or clusters). [**blog**](https://blog.skypilot.co/skypilot-pools-deepseek-ocr/), [**docs**](https://docs.skypilot.co/en/latest/examples/pools.html)
|
|
513
525
|
- [Nov 2025] Serve **Kimi K2 Thinking** with reasoning capabilities on your Kubernetes or clouds: [**example**](./llm/kimi-k2-thinking/)
|
|
514
526
|
- [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
|
|
515
527
|
- [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
|
|
@@ -518,22 +530,6 @@ Dynamic: summary
|
|
|
518
530
|
- [Sep 2025] Network and Storage Benchmarks for LLM training on the cloud: [**blog**](https://maknee.github.io/blog/2025/Network-And-Storage-Training-Skypilot/)
|
|
519
531
|
- [Aug 2025] Serve and finetune **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**serve**](./llm/gpt-oss/) + [**LoRA and full finetuning**](./llm/gpt-oss-finetuning/)
|
|
520
532
|
- [Jul 2025] Run distributed **RL training for LLMs** with Verl (PPO, GRPO) on any cloud: [**example**](./llm/verl/)
|
|
521
|
-
- [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
|
|
522
|
-
- [Jul 2025] Two-part blog series, `The Evolution of AI Job Orchestration`: (1) [Running AI jobs on GPU Neoclouds](https://blog.skypilot.co/ai-job-orchestration-pt1-gpu-neoclouds/), (2) [The AI-Native Control Plane & Orchestration that Finally Works for ML](https://blog.skypilot.co/ai-job-orchestration-pt2-ai-control-plane/)
|
|
523
|
-
- [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
|
|
524
|
-
|
|
525
|
-
|
|
526
|
-
|
|
527
|
-
**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
|
|
528
|
-
|
|
529
|
-
----
|
|
530
|
-
|
|
531
|
-
SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
|
|
532
|
-
|
|
533
|
-
SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
|
|
534
|
-
**Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
|
|
535
|
-
|
|
536
|
-
<img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
|
|
537
533
|
|
|
538
534
|
## Overview
|
|
539
535
|
|