skypilot-nightly 1.0.0.dev20251203__py3-none-any.whl → 1.0.0.dev20260112__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sky/__init__.py +6 -2
- sky/adaptors/aws.py +1 -61
- sky/adaptors/slurm.py +565 -0
- sky/backends/backend_utils.py +95 -12
- sky/backends/cloud_vm_ray_backend.py +224 -65
- sky/backends/task_codegen.py +380 -4
- sky/catalog/__init__.py +0 -3
- sky/catalog/data_fetchers/fetch_gcp.py +9 -1
- sky/catalog/data_fetchers/fetch_nebius.py +1 -1
- sky/catalog/data_fetchers/fetch_vast.py +4 -2
- sky/catalog/kubernetes_catalog.py +12 -4
- sky/catalog/seeweb_catalog.py +30 -15
- sky/catalog/shadeform_catalog.py +5 -2
- sky/catalog/slurm_catalog.py +236 -0
- sky/catalog/vast_catalog.py +30 -6
- sky/check.py +25 -11
- sky/client/cli/command.py +391 -32
- sky/client/interactive_utils.py +190 -0
- sky/client/sdk.py +64 -2
- sky/client/sdk_async.py +9 -0
- sky/clouds/__init__.py +2 -0
- sky/clouds/aws.py +60 -2
- sky/clouds/azure.py +2 -0
- sky/clouds/cloud.py +7 -0
- sky/clouds/kubernetes.py +2 -0
- sky/clouds/runpod.py +38 -7
- sky/clouds/slurm.py +610 -0
- sky/clouds/ssh.py +3 -2
- sky/clouds/vast.py +39 -16
- sky/core.py +197 -37
- sky/dashboard/out/404.html +1 -1
- sky/dashboard/out/_next/static/3nu-b8raeKRNABZ2d4GAG/_buildManifest.js +1 -0
- sky/dashboard/out/_next/static/chunks/1871-0565f8975a7dcd10.js +6 -0
- sky/dashboard/out/_next/static/chunks/2109-55a1546d793574a7.js +11 -0
- sky/dashboard/out/_next/static/chunks/2521-099b07cd9e4745bf.js +26 -0
- sky/dashboard/out/_next/static/chunks/2755.a636e04a928a700e.js +31 -0
- sky/dashboard/out/_next/static/chunks/3495.05eab4862217c1a5.js +6 -0
- sky/dashboard/out/_next/static/chunks/3785.cfc5dcc9434fd98c.js +1 -0
- sky/dashboard/out/_next/static/chunks/3850-fd5696f3bbbaddae.js +1 -0
- sky/dashboard/out/_next/static/chunks/3981.645d01bf9c8cad0c.js +21 -0
- sky/dashboard/out/_next/static/chunks/4083-0115d67c1fb57d6c.js +21 -0
- sky/dashboard/out/_next/static/chunks/{8640.5b9475a2d18c5416.js → 429.a58e9ba9742309ed.js} +2 -2
- sky/dashboard/out/_next/static/chunks/4555.8e221537181b5dc1.js +6 -0
- sky/dashboard/out/_next/static/chunks/4725.937865b81fdaaebb.js +6 -0
- sky/dashboard/out/_next/static/chunks/6082-edabd8f6092300ce.js +25 -0
- sky/dashboard/out/_next/static/chunks/6989-49cb7dca83a7a62d.js +1 -0
- sky/dashboard/out/_next/static/chunks/6990-630bd2a2257275f8.js +1 -0
- sky/dashboard/out/_next/static/chunks/7248-a99800d4db8edabd.js +1 -0
- sky/dashboard/out/_next/static/chunks/754-cfc5d4ad1b843d29.js +18 -0
- sky/dashboard/out/_next/static/chunks/8050-dd8aa107b17dce00.js +16 -0
- sky/dashboard/out/_next/static/chunks/8056-d4ae1e0cb81e7368.js +1 -0
- sky/dashboard/out/_next/static/chunks/8555.011023e296c127b3.js +6 -0
- sky/dashboard/out/_next/static/chunks/8821-93c25df904a8362b.js +1 -0
- sky/dashboard/out/_next/static/chunks/8969-0662594b69432ade.js +1 -0
- sky/dashboard/out/_next/static/chunks/9025.f15c91c97d124a5f.js +6 -0
- sky/dashboard/out/_next/static/chunks/9353-7ad6bd01858556f1.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/_app-5a86569acad99764.js +34 -0
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-8297476714acb4ac.js +6 -0
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-337c3ba1085f1210.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/{clusters-ee39056f9851a3ff.js → clusters-57632ff3684a8b5c.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/{config-dfb9bf07b13045f4.js → config-718cdc365de82689.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/infra/[context]-5fd3a453c079c2ea.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/infra-9f85c02c9c6cae9e.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-90f16972cbecf354.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-2dd42fc37aad427a.js +16 -0
- sky/dashboard/out/_next/static/chunks/pages/jobs-ed806aeace26b972.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/plugins/[...slug]-449a9f5a3bb20fb3.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/users-bec34706b36f3524.js +1 -0
- sky/dashboard/out/_next/static/chunks/pages/{volumes-b84b948ff357c43e.js → volumes-a83ba9b38dff7ea9.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/workspaces/{[name]-84a40f8c7c627fe4.js → [name]-c781e9c3e52ef9fc.js} +1 -1
- sky/dashboard/out/_next/static/chunks/pages/workspaces-91e0942f47310aae.js +1 -0
- sky/dashboard/out/_next/static/chunks/webpack-cfe59cf684ee13b9.js +1 -0
- sky/dashboard/out/_next/static/css/b0dbca28f027cc19.css +3 -0
- sky/dashboard/out/clusters/[cluster]/[job].html +1 -1
- sky/dashboard/out/clusters/[cluster].html +1 -1
- sky/dashboard/out/clusters.html +1 -1
- sky/dashboard/out/config.html +1 -1
- sky/dashboard/out/index.html +1 -1
- sky/dashboard/out/infra/[context].html +1 -1
- sky/dashboard/out/infra.html +1 -1
- sky/dashboard/out/jobs/[job].html +1 -1
- sky/dashboard/out/jobs/pools/[pool].html +1 -1
- sky/dashboard/out/jobs.html +1 -1
- sky/dashboard/out/plugins/[...slug].html +1 -0
- sky/dashboard/out/users.html +1 -1
- sky/dashboard/out/volumes.html +1 -1
- sky/dashboard/out/workspace/new.html +1 -1
- sky/dashboard/out/workspaces/[name].html +1 -1
- sky/dashboard/out/workspaces.html +1 -1
- sky/data/data_utils.py +26 -12
- sky/data/mounting_utils.py +44 -5
- sky/global_user_state.py +111 -19
- sky/jobs/client/sdk.py +8 -3
- sky/jobs/controller.py +191 -31
- sky/jobs/recovery_strategy.py +109 -11
- sky/jobs/server/core.py +81 -4
- sky/jobs/server/server.py +14 -0
- sky/jobs/state.py +417 -19
- sky/jobs/utils.py +73 -80
- sky/models.py +11 -0
- sky/optimizer.py +8 -6
- sky/provision/__init__.py +12 -9
- sky/provision/common.py +20 -0
- sky/provision/docker_utils.py +15 -2
- sky/provision/kubernetes/utils.py +163 -20
- sky/provision/kubernetes/volume.py +52 -17
- sky/provision/provisioner.py +17 -7
- sky/provision/runpod/instance.py +3 -1
- sky/provision/runpod/utils.py +13 -1
- sky/provision/runpod/volume.py +25 -9
- sky/provision/slurm/__init__.py +12 -0
- sky/provision/slurm/config.py +13 -0
- sky/provision/slurm/instance.py +618 -0
- sky/provision/slurm/utils.py +689 -0
- sky/provision/vast/instance.py +4 -1
- sky/provision/vast/utils.py +11 -6
- sky/resources.py +135 -13
- sky/schemas/api/responses.py +4 -0
- sky/schemas/db/global_user_state/010_save_ssh_key.py +1 -1
- sky/schemas/db/spot_jobs/008_add_full_resources.py +34 -0
- sky/schemas/db/spot_jobs/009_job_events.py +32 -0
- sky/schemas/db/spot_jobs/010_job_events_timestamp_with_timezone.py +43 -0
- sky/schemas/db/spot_jobs/011_add_links.py +34 -0
- sky/schemas/generated/jobsv1_pb2.py +9 -5
- sky/schemas/generated/jobsv1_pb2.pyi +12 -0
- sky/schemas/generated/jobsv1_pb2_grpc.py +44 -0
- sky/schemas/generated/managed_jobsv1_pb2.py +32 -28
- sky/schemas/generated/managed_jobsv1_pb2.pyi +11 -2
- sky/serve/serve_utils.py +232 -40
- sky/serve/server/impl.py +1 -1
- sky/server/common.py +17 -0
- sky/server/constants.py +1 -1
- sky/server/metrics.py +6 -3
- sky/server/plugins.py +238 -0
- sky/server/requests/executor.py +5 -2
- sky/server/requests/payloads.py +30 -1
- sky/server/requests/request_names.py +4 -0
- sky/server/requests/requests.py +33 -11
- sky/server/requests/serializers/encoders.py +22 -0
- sky/server/requests/serializers/return_value_serializers.py +70 -0
- sky/server/server.py +506 -109
- sky/server/server_utils.py +30 -0
- sky/server/uvicorn.py +5 -0
- sky/setup_files/MANIFEST.in +1 -0
- sky/setup_files/dependencies.py +22 -9
- sky/sky_logging.py +2 -1
- sky/skylet/attempt_skylet.py +13 -3
- sky/skylet/constants.py +55 -13
- sky/skylet/events.py +10 -4
- sky/skylet/executor/__init__.py +1 -0
- sky/skylet/executor/slurm.py +187 -0
- sky/skylet/job_lib.py +91 -5
- sky/skylet/log_lib.py +22 -6
- sky/skylet/log_lib.pyi +8 -6
- sky/skylet/services.py +18 -3
- sky/skylet/skylet.py +5 -1
- sky/skylet/subprocess_daemon.py +2 -1
- sky/ssh_node_pools/constants.py +12 -0
- sky/ssh_node_pools/core.py +40 -3
- sky/ssh_node_pools/deploy/__init__.py +4 -0
- sky/{utils/kubernetes/deploy_ssh_node_pools.py → ssh_node_pools/deploy/deploy.py} +279 -504
- sky/ssh_node_pools/deploy/tunnel/ssh-tunnel.sh +379 -0
- sky/ssh_node_pools/deploy/tunnel_utils.py +199 -0
- sky/ssh_node_pools/deploy/utils.py +173 -0
- sky/ssh_node_pools/server.py +11 -13
- sky/{utils/kubernetes/ssh_utils.py → ssh_node_pools/utils.py} +9 -6
- sky/templates/kubernetes-ray.yml.j2 +12 -6
- sky/templates/slurm-ray.yml.j2 +115 -0
- sky/templates/vast-ray.yml.j2 +1 -0
- sky/templates/websocket_proxy.py +18 -41
- sky/users/model.conf +1 -1
- sky/users/permission.py +85 -52
- sky/users/rbac.py +31 -3
- sky/utils/annotations.py +108 -8
- sky/utils/auth_utils.py +42 -0
- sky/utils/cli_utils/status_utils.py +19 -5
- sky/utils/cluster_utils.py +10 -3
- sky/utils/command_runner.py +389 -35
- sky/utils/command_runner.pyi +43 -4
- sky/utils/common_utils.py +47 -31
- sky/utils/context.py +32 -0
- sky/utils/db/db_utils.py +36 -6
- sky/utils/db/migration_utils.py +41 -21
- sky/utils/infra_utils.py +5 -1
- sky/utils/instance_links.py +139 -0
- sky/utils/interactive_utils.py +49 -0
- sky/utils/kubernetes/generate_kubeconfig.sh +42 -33
- sky/utils/kubernetes/kubernetes_deploy_utils.py +2 -94
- sky/utils/kubernetes/rsync_helper.sh +5 -1
- sky/utils/kubernetes/ssh-tunnel.sh +7 -376
- sky/utils/plugin_extensions/__init__.py +14 -0
- sky/utils/plugin_extensions/external_failure_source.py +176 -0
- sky/utils/resources_utils.py +10 -8
- sky/utils/rich_utils.py +9 -11
- sky/utils/schemas.py +93 -19
- sky/utils/status_lib.py +7 -0
- sky/utils/subprocess_utils.py +17 -0
- sky/volumes/client/sdk.py +6 -3
- sky/volumes/server/core.py +65 -27
- sky_templates/ray/start_cluster +8 -4
- {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/METADATA +67 -59
- {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/RECORD +208 -180
- sky/dashboard/out/_next/static/96_E2yl3QAiIJGOYCkSpB/_buildManifest.js +0 -1
- sky/dashboard/out/_next/static/chunks/1141-e6aa9ab418717c59.js +0 -11
- sky/dashboard/out/_next/static/chunks/1871-7e202677c42f43fe.js +0 -6
- sky/dashboard/out/_next/static/chunks/2260-7703229c33c5ebd5.js +0 -1
- sky/dashboard/out/_next/static/chunks/2350.fab69e61bac57b23.js +0 -1
- sky/dashboard/out/_next/static/chunks/2369.fc20f0c2c8ed9fe7.js +0 -15
- sky/dashboard/out/_next/static/chunks/2755.edd818326d489a1d.js +0 -26
- sky/dashboard/out/_next/static/chunks/3294.20a8540fe697d5ee.js +0 -1
- sky/dashboard/out/_next/static/chunks/3785.7e245f318f9d1121.js +0 -1
- sky/dashboard/out/_next/static/chunks/3800-7b45f9fbb6308557.js +0 -1
- sky/dashboard/out/_next/static/chunks/3850-ff4a9a69d978632b.js +0 -1
- sky/dashboard/out/_next/static/chunks/4725.172ede95d1b21022.js +0 -1
- sky/dashboard/out/_next/static/chunks/4937.a2baa2df5572a276.js +0 -15
- sky/dashboard/out/_next/static/chunks/6212-7bd06f60ba693125.js +0 -13
- sky/dashboard/out/_next/static/chunks/6856-8f27d1c10c98def8.js +0 -1
- sky/dashboard/out/_next/static/chunks/6989-01359c57e018caa4.js +0 -1
- sky/dashboard/out/_next/static/chunks/6990-9146207c4567fdfd.js +0 -1
- sky/dashboard/out/_next/static/chunks/7359-c8d04e06886000b3.js +0 -30
- sky/dashboard/out/_next/static/chunks/7411-b15471acd2cba716.js +0 -41
- sky/dashboard/out/_next/static/chunks/7615-019513abc55b3b47.js +0 -1
- sky/dashboard/out/_next/static/chunks/8969-452f9d5cbdd2dc73.js +0 -1
- sky/dashboard/out/_next/static/chunks/9025.fa408f3242e9028d.js +0 -6
- sky/dashboard/out/_next/static/chunks/9353-cff34f7e773b2e2b.js +0 -1
- sky/dashboard/out/_next/static/chunks/9360.a536cf6b1fa42355.js +0 -31
- sky/dashboard/out/_next/static/chunks/9847.3aaca6bb33455140.js +0 -30
- sky/dashboard/out/_next/static/chunks/pages/_app-bde01e4a2beec258.js +0 -34
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]/[job]-792db96d918c98c9.js +0 -16
- sky/dashboard/out/_next/static/chunks/pages/clusters/[cluster]-abfcac9c137aa543.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/infra/[context]-c0b5935149902e6f.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/infra-aed0ea19df7cf961.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/jobs/[job]-d66997e2bfc837cf.js +0 -16
- sky/dashboard/out/_next/static/chunks/pages/jobs/pools/[pool]-9faf940b253e3e06.js +0 -21
- sky/dashboard/out/_next/static/chunks/pages/jobs-2072b48b617989c9.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/users-f42674164aa73423.js +0 -1
- sky/dashboard/out/_next/static/chunks/pages/workspaces-531b2f8c4bf89f82.js +0 -1
- sky/dashboard/out/_next/static/chunks/webpack-64e05f17bf2cf8ce.js +0 -1
- sky/dashboard/out/_next/static/css/0748ce22df867032.css +0 -3
- /sky/dashboard/out/_next/static/{96_E2yl3QAiIJGOYCkSpB → 3nu-b8raeKRNABZ2d4GAG}/_ssgManifest.js +0 -0
- /sky/{utils/kubernetes → ssh_node_pools/deploy/tunnel}/cleanup-tunnel.sh +0 -0
- {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/WHEEL +0 -0
- {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/entry_points.txt +0 -0
- {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/licenses/LICENSE +0 -0
- {skypilot_nightly-1.0.0.dev20251203.dist-info → skypilot_nightly-1.0.0.dev20260112.dist-info}/top_level.txt +0 -0
sky/utils/schemas.py
CHANGED
|
@@ -208,26 +208,49 @@ def _get_single_resources_schema():
|
|
|
208
208
|
},
|
|
209
209
|
'job_recovery': {
|
|
210
210
|
# Either a string or a dict.
|
|
211
|
-
'anyOf': [
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
'
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
'
|
|
228
|
-
|
|
211
|
+
'anyOf': [
|
|
212
|
+
{
|
|
213
|
+
'type': 'string',
|
|
214
|
+
},
|
|
215
|
+
{
|
|
216
|
+
'type': 'object',
|
|
217
|
+
'required': [],
|
|
218
|
+
'additionalProperties': False,
|
|
219
|
+
'properties': {
|
|
220
|
+
'strategy': {
|
|
221
|
+
'anyOf': [{
|
|
222
|
+
'type': 'string',
|
|
223
|
+
}, {
|
|
224
|
+
'type': 'null',
|
|
225
|
+
}],
|
|
226
|
+
},
|
|
227
|
+
'max_restarts_on_errors': {
|
|
228
|
+
'type': 'integer',
|
|
229
|
+
'minimum': 0,
|
|
230
|
+
},
|
|
231
|
+
'recover_on_exit_codes': {
|
|
232
|
+
'anyOf': [
|
|
233
|
+
{
|
|
234
|
+
# Single exit code
|
|
235
|
+
'type': 'integer',
|
|
236
|
+
'minimum': 0,
|
|
237
|
+
'maximum': 255,
|
|
238
|
+
},
|
|
239
|
+
{
|
|
240
|
+
# List of exit codes
|
|
241
|
+
'type': 'array',
|
|
242
|
+
'items': {
|
|
243
|
+
'type': 'integer',
|
|
244
|
+
'minimum': 0,
|
|
245
|
+
'maximum': 255,
|
|
246
|
+
},
|
|
247
|
+
'uniqueItems': True,
|
|
248
|
+
},
|
|
249
|
+
],
|
|
250
|
+
},
|
|
251
|
+
}
|
|
229
252
|
}
|
|
230
|
-
|
|
253
|
+
],
|
|
231
254
|
},
|
|
232
255
|
'volumes': {
|
|
233
256
|
'type': 'array',
|
|
@@ -1401,6 +1424,27 @@ def get_config_schema():
|
|
|
1401
1424
|
**_CONTEXT_CONFIG_SCHEMA_MINIMAL,
|
|
1402
1425
|
}
|
|
1403
1426
|
},
|
|
1427
|
+
'slurm': {
|
|
1428
|
+
'type': 'object',
|
|
1429
|
+
'required': [],
|
|
1430
|
+
'additionalProperties': False,
|
|
1431
|
+
'properties': {
|
|
1432
|
+
'allowed_clusters': {
|
|
1433
|
+
'oneOf': [{
|
|
1434
|
+
'type': 'array',
|
|
1435
|
+
'items': {
|
|
1436
|
+
'type': 'string',
|
|
1437
|
+
},
|
|
1438
|
+
}, {
|
|
1439
|
+
'type': 'string',
|
|
1440
|
+
'pattern': '^all$'
|
|
1441
|
+
}]
|
|
1442
|
+
},
|
|
1443
|
+
'provision_timeout': {
|
|
1444
|
+
'type': 'integer',
|
|
1445
|
+
},
|
|
1446
|
+
}
|
|
1447
|
+
},
|
|
1404
1448
|
'oci': {
|
|
1405
1449
|
'type': 'object',
|
|
1406
1450
|
'required': [],
|
|
@@ -1435,6 +1479,16 @@ def get_config_schema():
|
|
|
1435
1479
|
}
|
|
1436
1480
|
},
|
|
1437
1481
|
},
|
|
1482
|
+
'vast': {
|
|
1483
|
+
'type': 'object',
|
|
1484
|
+
'required': [],
|
|
1485
|
+
'additionalProperties': False,
|
|
1486
|
+
'properties': {
|
|
1487
|
+
'datacenter_only': {
|
|
1488
|
+
'type': 'boolean',
|
|
1489
|
+
},
|
|
1490
|
+
}
|
|
1491
|
+
},
|
|
1438
1492
|
'nebius': {
|
|
1439
1493
|
'type': 'object',
|
|
1440
1494
|
'required': [],
|
|
@@ -1814,6 +1868,25 @@ def get_config_schema():
|
|
|
1814
1868
|
config['properties'].update(_REMOTE_IDENTITY_SCHEMA_KUBERNETES)
|
|
1815
1869
|
else:
|
|
1816
1870
|
config['properties'].update(_REMOTE_IDENTITY_SCHEMA)
|
|
1871
|
+
|
|
1872
|
+
data_schema = {
|
|
1873
|
+
'type': 'object',
|
|
1874
|
+
'required': [],
|
|
1875
|
+
'additionalProperties': False,
|
|
1876
|
+
'properties': {
|
|
1877
|
+
'mount_cached': {
|
|
1878
|
+
'type': 'object',
|
|
1879
|
+
'required': [],
|
|
1880
|
+
'additionalProperties': False,
|
|
1881
|
+
'properties': {
|
|
1882
|
+
'sequential_upload': {
|
|
1883
|
+
'type': 'boolean',
|
|
1884
|
+
},
|
|
1885
|
+
},
|
|
1886
|
+
},
|
|
1887
|
+
},
|
|
1888
|
+
}
|
|
1889
|
+
|
|
1817
1890
|
return {
|
|
1818
1891
|
'$schema': 'https://json-schema.org/draft/2020-12/schema',
|
|
1819
1892
|
'type': 'object',
|
|
@@ -1840,6 +1913,7 @@ def get_config_schema():
|
|
|
1840
1913
|
'rbac': rbac_schema,
|
|
1841
1914
|
'logs': logs_schema,
|
|
1842
1915
|
'daemons': daemon_schema,
|
|
1916
|
+
'data': data_schema,
|
|
1843
1917
|
**cloud_configs,
|
|
1844
1918
|
},
|
|
1845
1919
|
}
|
sky/utils/status_lib.py
CHANGED
|
@@ -27,6 +27,12 @@ class ClusterStatus(enum.Enum):
|
|
|
27
27
|
|
|
28
28
|
STOPPED = 'STOPPED'
|
|
29
29
|
"""The cluster is stopped."""
|
|
30
|
+
PENDING = 'PENDING'
|
|
31
|
+
"""The cluster is pending scheduling.
|
|
32
|
+
|
|
33
|
+
NOTE: This state is for display only and should not be used in state
|
|
34
|
+
machine logic without necessary considerations.
|
|
35
|
+
"""
|
|
30
36
|
|
|
31
37
|
def colored_str(self):
|
|
32
38
|
color = _STATUS_TO_COLOR[self]
|
|
@@ -37,6 +43,7 @@ _STATUS_TO_COLOR = {
|
|
|
37
43
|
ClusterStatus.INIT: colorama.Fore.BLUE,
|
|
38
44
|
ClusterStatus.UP: colorama.Fore.GREEN,
|
|
39
45
|
ClusterStatus.STOPPED: colorama.Fore.YELLOW,
|
|
46
|
+
ClusterStatus.PENDING: colorama.Fore.CYAN,
|
|
40
47
|
}
|
|
41
48
|
|
|
42
49
|
|
sky/utils/subprocess_utils.py
CHANGED
|
@@ -7,6 +7,7 @@ import resource
|
|
|
7
7
|
import shlex
|
|
8
8
|
import subprocess
|
|
9
9
|
import sys
|
|
10
|
+
import termios
|
|
10
11
|
import threading
|
|
11
12
|
import time
|
|
12
13
|
import typing
|
|
@@ -450,3 +451,19 @@ def slow_start_processes(processes: List[Startable],
|
|
|
450
451
|
break
|
|
451
452
|
batch_size = min(batch_size * 2, max_batch_size)
|
|
452
453
|
time.sleep(delay)
|
|
454
|
+
|
|
455
|
+
|
|
456
|
+
def is_echo_disabled(fd: int) -> bool:
|
|
457
|
+
"""Check if terminal ECHO is disabled on the given fd.
|
|
458
|
+
|
|
459
|
+
When a subprocess wants password/sensitive input, it disables ECHO.
|
|
460
|
+
This is how pexpect's waitnoecho() works. See:
|
|
461
|
+
https://pexpect.readthedocs.io/en/stable/api/pexpect.html#pexpect.spawn.waitnoecho
|
|
462
|
+
"""
|
|
463
|
+
assert os.isatty(fd), 'fd is not connected to a terminal'
|
|
464
|
+
try:
|
|
465
|
+
attr = termios.tcgetattr(fd)
|
|
466
|
+
echo_on = bool(attr[3] & termios.ECHO)
|
|
467
|
+
return not echo_on
|
|
468
|
+
except (termios.error, OSError):
|
|
469
|
+
return False
|
sky/volumes/client/sdk.py
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
"""SDK functions for
|
|
1
|
+
"""SDK functions for volumes."""
|
|
2
2
|
import json
|
|
3
3
|
import typing
|
|
4
4
|
from typing import List
|
|
@@ -135,16 +135,19 @@ def ls() -> server_common.RequestId[List[responses.VolumeRecord]]:
|
|
|
135
135
|
@usage_lib.entrypoint
|
|
136
136
|
@server_common.check_server_healthy_or_start
|
|
137
137
|
@annotations.client_api
|
|
138
|
-
def delete(names: List[str]
|
|
138
|
+
def delete(names: List[str],
|
|
139
|
+
purge: bool = False) -> server_common.RequestId[None]:
|
|
139
140
|
"""Deletes volumes.
|
|
140
141
|
|
|
141
142
|
Args:
|
|
142
143
|
names: List of volume names to delete.
|
|
144
|
+
purge: If True, delete the volume from the database even if the
|
|
145
|
+
deletion API fails.
|
|
143
146
|
|
|
144
147
|
Returns:
|
|
145
148
|
The request ID of the delete request.
|
|
146
149
|
"""
|
|
147
|
-
body = payloads.VolumeDeleteBody(names=names)
|
|
150
|
+
body = payloads.VolumeDeleteBody(names=names, purge=purge)
|
|
148
151
|
response = server_common.make_authenticated_request(
|
|
149
152
|
'POST', '/volumes/delete', json=json.loads(body.model_dump_json()))
|
|
150
153
|
return server_common.get_request_id(response)
|
sky/volumes/server/core.py
CHANGED
|
@@ -30,6 +30,10 @@ def volume_refresh():
|
|
|
30
30
|
volumes = volume_list(is_ephemeral=False)
|
|
31
31
|
for volume in volumes:
|
|
32
32
|
volume_name = volume.name
|
|
33
|
+
if volume.usedby_fetch_failed:
|
|
34
|
+
logger.info(f'Skipping status update for volume {volume_name} '
|
|
35
|
+
f'due to failed usedby fetch')
|
|
36
|
+
continue
|
|
33
37
|
usedby_pods = volume.usedby_pods
|
|
34
38
|
with _volume_lock(volume_name):
|
|
35
39
|
latest_volume = global_user_state.get_volume_by_name(volume_name)
|
|
@@ -55,6 +59,9 @@ def volume_list(
|
|
|
55
59
|
is_ephemeral: Optional[bool] = None) -> List[responses.VolumeRecord]:
|
|
56
60
|
"""Gets the volumes.
|
|
57
61
|
|
|
62
|
+
Args:
|
|
63
|
+
is_ephemeral: Whether to include ephemeral volumes.
|
|
64
|
+
|
|
58
65
|
Returns:
|
|
59
66
|
[
|
|
60
67
|
{
|
|
@@ -74,6 +81,7 @@ def volume_list(
|
|
|
74
81
|
'status': sky.VolumeStatus,
|
|
75
82
|
'usedby_pods': List[str],
|
|
76
83
|
'usedby_clusters': List[str],
|
|
84
|
+
'usedby_fetch_failed': bool,
|
|
77
85
|
'is_ephemeral': bool,
|
|
78
86
|
}
|
|
79
87
|
]
|
|
@@ -93,11 +101,23 @@ def volume_list(
|
|
|
93
101
|
cloud_to_configs[cloud].append(config)
|
|
94
102
|
|
|
95
103
|
cloud_to_used_by_pods, cloud_to_used_by_clusters = {}, {}
|
|
104
|
+
cloud_to_failed_volume_names = {}
|
|
96
105
|
for cloud, configs in cloud_to_configs.items():
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
106
|
+
try:
|
|
107
|
+
used_by_pods, used_by_clusters, failed_volume_names = (
|
|
108
|
+
provision.get_all_volumes_usedby(cloud, configs))
|
|
109
|
+
cloud_to_used_by_pods[cloud] = used_by_pods
|
|
110
|
+
cloud_to_used_by_clusters[cloud] = used_by_clusters
|
|
111
|
+
cloud_to_failed_volume_names[cloud] = failed_volume_names
|
|
112
|
+
except Exception as e: # pylint: disable=broad-except
|
|
113
|
+
logger.warning(
|
|
114
|
+
f'Failed to get usedby info for volumes on {cloud}: {e}')
|
|
115
|
+
cloud_to_used_by_pods[cloud] = {}
|
|
116
|
+
cloud_to_used_by_clusters[cloud] = {}
|
|
117
|
+
cloud_to_failed_volume_names[cloud] = {
|
|
118
|
+
config.name for config in configs
|
|
119
|
+
}
|
|
120
|
+
continue
|
|
101
121
|
|
|
102
122
|
all_users = global_user_state.get_all_users()
|
|
103
123
|
user_map = {user.id: user.name for user in all_users}
|
|
@@ -114,6 +134,7 @@ def volume_list(
|
|
|
114
134
|
'last_use': volume.get('last_use'),
|
|
115
135
|
'usedby_pods': [],
|
|
116
136
|
'usedby_clusters': [],
|
|
137
|
+
'usedby_fetch_failed': False,
|
|
117
138
|
'is_ephemeral': volume.get('is_ephemeral', False),
|
|
118
139
|
}
|
|
119
140
|
status = volume.get('status')
|
|
@@ -126,12 +147,17 @@ def volume_list(
|
|
|
126
147
|
logger.warning(f'Volume {volume_name} has no handle.')
|
|
127
148
|
continue
|
|
128
149
|
cloud = config.cloud
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
150
|
+
if volume_name in cloud_to_failed_volume_names[cloud]:
|
|
151
|
+
record['usedby_fetch_failed'] = True
|
|
152
|
+
else:
|
|
153
|
+
usedby_pods, usedby_clusters = provision.map_all_volumes_usedby(
|
|
154
|
+
cloud,
|
|
155
|
+
cloud_to_used_by_pods[cloud],
|
|
156
|
+
cloud_to_used_by_clusters[cloud],
|
|
157
|
+
config,
|
|
158
|
+
)
|
|
159
|
+
record['usedby_pods'] = usedby_pods
|
|
160
|
+
record['usedby_clusters'] = usedby_clusters
|
|
135
161
|
record['type'] = config.type
|
|
136
162
|
record['cloud'] = config.cloud
|
|
137
163
|
record['region'] = config.region
|
|
@@ -139,18 +165,20 @@ def volume_list(
|
|
|
139
165
|
record['size'] = config.size
|
|
140
166
|
record['config'] = config.config
|
|
141
167
|
record['name_on_cloud'] = config.name_on_cloud
|
|
142
|
-
record['usedby_pods'] = usedby_pods
|
|
143
|
-
record['usedby_clusters'] = usedby_clusters
|
|
144
168
|
records.append(responses.VolumeRecord(**record))
|
|
145
169
|
return records
|
|
146
170
|
|
|
147
171
|
|
|
148
|
-
def volume_delete(names: List[str],
|
|
172
|
+
def volume_delete(names: List[str],
|
|
173
|
+
ignore_not_found: bool = False,
|
|
174
|
+
purge: bool = False) -> None:
|
|
149
175
|
"""Deletes volumes.
|
|
150
176
|
|
|
151
177
|
Args:
|
|
152
178
|
names: List of volume names to delete.
|
|
153
179
|
ignore_not_found: If True, ignore volumes that are not found.
|
|
180
|
+
purge: If True, delete the volume from the database even if the
|
|
181
|
+
deletion API fails.
|
|
154
182
|
|
|
155
183
|
Raises:
|
|
156
184
|
ValueError: If the volume does not exist
|
|
@@ -167,22 +195,32 @@ def volume_delete(names: List[str], ignore_not_found: bool = False) -> None:
|
|
|
167
195
|
if config is None:
|
|
168
196
|
raise ValueError(f'Volume {name} has no handle.')
|
|
169
197
|
cloud = config.cloud
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
198
|
+
if not purge:
|
|
199
|
+
usedby_pods, usedby_clusters = provision.get_volume_usedby(
|
|
200
|
+
cloud, config)
|
|
201
|
+
if usedby_clusters:
|
|
202
|
+
usedby_clusters_str = ', '.join(usedby_clusters)
|
|
203
|
+
cluster_str = 'clusters' if len(
|
|
204
|
+
usedby_clusters) > 1 else 'cluster'
|
|
205
|
+
raise ValueError(f'Volume {name} is used by {cluster_str}'
|
|
206
|
+
f' {usedby_clusters_str}.')
|
|
207
|
+
if usedby_pods:
|
|
208
|
+
usedby_pods_str = ', '.join(usedby_pods)
|
|
209
|
+
pod_str = 'pods' if len(usedby_pods) > 1 else 'pod'
|
|
210
|
+
raise ValueError(
|
|
211
|
+
f'Volume {name} is used by {pod_str} {usedby_pods_str}.'
|
|
212
|
+
)
|
|
183
213
|
logger.debug(f'Deleting volume {name} with config {config}')
|
|
184
214
|
with _volume_lock(name):
|
|
185
|
-
|
|
215
|
+
try:
|
|
216
|
+
provision.delete_volume(cloud, config)
|
|
217
|
+
except Exception as e: # pylint: disable=broad-except
|
|
218
|
+
if purge:
|
|
219
|
+
logger.warning(f'Failed to delete volume {name} '
|
|
220
|
+
f'on {cloud}: {e}. Purging from '
|
|
221
|
+
'database.')
|
|
222
|
+
else:
|
|
223
|
+
raise
|
|
186
224
|
global_user_state.delete_volume(name)
|
|
187
225
|
logger.info(f'Deleted volumes: {names}')
|
|
188
226
|
|
sky_templates/ray/start_cluster
CHANGED
|
@@ -77,14 +77,18 @@ if ! run_ray --version > /dev/null; then
|
|
|
77
77
|
fi
|
|
78
78
|
echo -e "${GREEN}Ray $(run_ray --version | cut -d' ' -f3) is installed.${NC}"
|
|
79
79
|
|
|
80
|
-
|
|
80
|
+
LOCAL_RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
|
|
81
|
+
RAY_ADDRESS=${LOCAL_RAY_ADDRESS}
|
|
81
82
|
if [ "${SKYPILOT_NODE_RANK}" -ne 0 ]; then
|
|
82
83
|
HEAD_IP=$(echo "${SKYPILOT_NODE_IPS}" | head -n1)
|
|
83
84
|
RAY_ADDRESS="${HEAD_IP}:${RAY_HEAD_PORT}"
|
|
84
85
|
fi
|
|
85
86
|
|
|
86
|
-
# Check if user-space Ray is already running
|
|
87
|
-
if
|
|
87
|
+
# Check if user-space Ray is already running. Use local address to check, as
|
|
88
|
+
# if we use the head node address, the check will succeed even if the Ray
|
|
89
|
+
# cluster is started on the head node but not started on the current worker
|
|
90
|
+
# node.
|
|
91
|
+
if run_ray status --address="${LOCAL_RAY_ADDRESS}" &> /dev/null; then
|
|
88
92
|
echo -e "${YELLOW}Ray cluster is already running.${NC}"
|
|
89
93
|
run_ray status --address="${RAY_ADDRESS}"
|
|
90
94
|
exit 0
|
|
@@ -140,7 +144,7 @@ if [ "${SKYPILOT_NODE_RANK}" -eq 0 ]; then
|
|
|
140
144
|
echo -e "${RED}Error: Timeout waiting for nodes.${NC}" >&2
|
|
141
145
|
exit 1
|
|
142
146
|
fi
|
|
143
|
-
ready_nodes=$(run_ray list nodes --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
|
|
147
|
+
ready_nodes=$(run_ray list nodes --address="${RAY_ADDRESS}" --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
|
|
144
148
|
if [ "${ready_nodes}" -ge "${SKYPILOT_NUM_NODES}" ]; then
|
|
145
149
|
break
|
|
146
150
|
fi
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: skypilot-nightly
|
|
3
|
-
Version: 1.0.0.
|
|
3
|
+
Version: 1.0.0.dev20260112
|
|
4
4
|
Summary: SkyPilot: Run AI on Any Infra — Unified, Faster, Cheaper.
|
|
5
5
|
Author: SkyPilot Team
|
|
6
6
|
License: Apache 2.0
|
|
@@ -64,6 +64,7 @@ Requires-Dist: passlib
|
|
|
64
64
|
Requires-Dist: bcrypt==4.0.1
|
|
65
65
|
Requires-Dist: pyjwt
|
|
66
66
|
Requires-Dist: gitpython
|
|
67
|
+
Requires-Dist: paramiko
|
|
67
68
|
Requires-Dist: types-paramiko
|
|
68
69
|
Requires-Dist: alembic
|
|
69
70
|
Requires-Dist: aiohttp
|
|
@@ -72,7 +73,7 @@ Provides-Extra: aws
|
|
|
72
73
|
Requires-Dist: awscli>=1.27.10; extra == "aws"
|
|
73
74
|
Requires-Dist: botocore>=1.29.10; extra == "aws"
|
|
74
75
|
Requires-Dist: boto3>=1.26.1; extra == "aws"
|
|
75
|
-
Requires-Dist: colorama<0.4.
|
|
76
|
+
Requires-Dist: colorama<0.4.7; extra == "aws"
|
|
76
77
|
Requires-Dist: casbin; extra == "aws"
|
|
77
78
|
Requires-Dist: sqlalchemy_adapter; extra == "aws"
|
|
78
79
|
Requires-Dist: passlib; extra == "aws"
|
|
@@ -160,7 +161,7 @@ Provides-Extra: cloudflare
|
|
|
160
161
|
Requires-Dist: awscli>=1.27.10; extra == "cloudflare"
|
|
161
162
|
Requires-Dist: botocore>=1.29.10; extra == "cloudflare"
|
|
162
163
|
Requires-Dist: boto3>=1.26.1; extra == "cloudflare"
|
|
163
|
-
Requires-Dist: colorama<0.4.
|
|
164
|
+
Requires-Dist: colorama<0.4.7; extra == "cloudflare"
|
|
164
165
|
Requires-Dist: casbin; extra == "cloudflare"
|
|
165
166
|
Requires-Dist: sqlalchemy_adapter; extra == "cloudflare"
|
|
166
167
|
Requires-Dist: passlib; extra == "cloudflare"
|
|
@@ -175,7 +176,7 @@ Provides-Extra: coreweave
|
|
|
175
176
|
Requires-Dist: awscli>=1.27.10; extra == "coreweave"
|
|
176
177
|
Requires-Dist: botocore>=1.29.10; extra == "coreweave"
|
|
177
178
|
Requires-Dist: boto3>=1.26.1; extra == "coreweave"
|
|
178
|
-
Requires-Dist: colorama<0.4.
|
|
179
|
+
Requires-Dist: colorama<0.4.7; extra == "coreweave"
|
|
179
180
|
Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "coreweave"
|
|
180
181
|
Requires-Dist: websockets; extra == "coreweave"
|
|
181
182
|
Requires-Dist: python-dateutil; extra == "coreweave"
|
|
@@ -244,6 +245,7 @@ Requires-Dist: greenlet; extra == "ssh"
|
|
|
244
245
|
Provides-Extra: runpod
|
|
245
246
|
Requires-Dist: runpod>=1.6.1; extra == "runpod"
|
|
246
247
|
Requires-Dist: tomli; extra == "runpod"
|
|
248
|
+
Requires-Dist: pycares<5; extra == "runpod"
|
|
247
249
|
Requires-Dist: casbin; extra == "runpod"
|
|
248
250
|
Requires-Dist: sqlalchemy_adapter; extra == "runpod"
|
|
249
251
|
Requires-Dist: passlib; extra == "runpod"
|
|
@@ -344,7 +346,7 @@ Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "nebius"
|
|
|
344
346
|
Requires-Dist: awscli>=1.27.10; extra == "nebius"
|
|
345
347
|
Requires-Dist: botocore>=1.29.10; extra == "nebius"
|
|
346
348
|
Requires-Dist: boto3>=1.26.1; extra == "nebius"
|
|
347
|
-
Requires-Dist: colorama<0.4.
|
|
349
|
+
Requires-Dist: colorama<0.4.7; extra == "nebius"
|
|
348
350
|
Requires-Dist: casbin; extra == "nebius"
|
|
349
351
|
Requires-Dist: sqlalchemy_adapter; extra == "nebius"
|
|
350
352
|
Requires-Dist: passlib; extra == "nebius"
|
|
@@ -389,52 +391,66 @@ Requires-Dist: grpcio>=1.63.0; extra == "shadeform"
|
|
|
389
391
|
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "shadeform"
|
|
390
392
|
Requires-Dist: aiosqlite; extra == "shadeform"
|
|
391
393
|
Requires-Dist: greenlet; extra == "shadeform"
|
|
394
|
+
Provides-Extra: slurm
|
|
395
|
+
Requires-Dist: python-hostlist; extra == "slurm"
|
|
396
|
+
Requires-Dist: casbin; extra == "slurm"
|
|
397
|
+
Requires-Dist: sqlalchemy_adapter; extra == "slurm"
|
|
398
|
+
Requires-Dist: passlib; extra == "slurm"
|
|
399
|
+
Requires-Dist: pyjwt; extra == "slurm"
|
|
400
|
+
Requires-Dist: aiohttp; extra == "slurm"
|
|
401
|
+
Requires-Dist: anyio; extra == "slurm"
|
|
402
|
+
Requires-Dist: grpcio>=1.63.0; extra == "slurm"
|
|
403
|
+
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "slurm"
|
|
404
|
+
Requires-Dist: aiosqlite; extra == "slurm"
|
|
405
|
+
Requires-Dist: greenlet; extra == "slurm"
|
|
392
406
|
Provides-Extra: all
|
|
393
|
-
Requires-Dist: greenlet; extra == "all"
|
|
394
|
-
Requires-Dist: azure-identity>=1.19.0; extra == "all"
|
|
395
|
-
Requires-Dist: msrestazure; extra == "all"
|
|
396
|
-
Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
|
|
397
|
-
Requires-Dist: aiosqlite; extra == "all"
|
|
398
|
-
Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
|
|
399
|
-
Requires-Dist: anyio; extra == "all"
|
|
400
|
-
Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
|
|
401
|
-
Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
|
|
402
|
-
Requires-Dist: ibm-cloud-sdk-core; extra == "all"
|
|
403
|
-
Requires-Dist: sqlalchemy_adapter; extra == "all"
|
|
404
|
-
Requires-Dist: botocore>=1.29.10; extra == "all"
|
|
405
|
-
Requires-Dist: msgraph-sdk; extra == "all"
|
|
406
407
|
Requires-Dist: aiohttp; extra == "all"
|
|
407
|
-
Requires-Dist:
|
|
408
|
-
Requires-Dist:
|
|
408
|
+
Requires-Dist: tomli; extra == "all"
|
|
409
|
+
Requires-Dist: ecsapi==0.4.0; extra == "all"
|
|
410
|
+
Requires-Dist: msgraph-sdk; extra == "all"
|
|
411
|
+
Requires-Dist: azure-cli>=2.65.0; extra == "all"
|
|
412
|
+
Requires-Dist: python-dateutil; extra == "all"
|
|
413
|
+
Requires-Dist: ray[default]>=2.6.1; extra == "all"
|
|
414
|
+
Requires-Dist: azure-storage-blob>=12.23.1; extra == "all"
|
|
415
|
+
Requires-Dist: pydo>=0.3.0; extra == "all"
|
|
416
|
+
Requires-Dist: google-cloud-storage; extra == "all"
|
|
417
|
+
Requires-Dist: azure-identity>=1.19.0; extra == "all"
|
|
409
418
|
Requires-Dist: grpcio>=1.63.0; extra == "all"
|
|
410
|
-
Requires-Dist:
|
|
419
|
+
Requires-Dist: colorama<0.4.7; extra == "all"
|
|
420
|
+
Requires-Dist: boto3>=1.26.1; extra == "all"
|
|
421
|
+
Requires-Dist: docker; extra == "all"
|
|
422
|
+
Requires-Dist: sqlalchemy_adapter; extra == "all"
|
|
423
|
+
Requires-Dist: anyio; extra == "all"
|
|
424
|
+
Requires-Dist: pyjwt; extra == "all"
|
|
411
425
|
Requires-Dist: google-api-python-client>=2.69.0; extra == "all"
|
|
412
|
-
Requires-Dist: google-cloud-storage; extra == "all"
|
|
413
|
-
Requires-Dist: azure-cli>=2.65.0; extra == "all"
|
|
414
426
|
Requires-Dist: oci; extra == "all"
|
|
415
|
-
Requires-Dist:
|
|
427
|
+
Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
|
|
428
|
+
Requires-Dist: websockets; extra == "all"
|
|
429
|
+
Requires-Dist: kubernetes!=32.0.0,>=20.0.0; extra == "all"
|
|
430
|
+
Requires-Dist: ibm-cloud-sdk-core; extra == "all"
|
|
431
|
+
Requires-Dist: runpod>=1.6.1; extra == "all"
|
|
432
|
+
Requires-Dist: azure-core>=1.24.0; extra == "all"
|
|
433
|
+
Requires-Dist: passlib; extra == "all"
|
|
434
|
+
Requires-Dist: ibm-vpc; extra == "all"
|
|
435
|
+
Requires-Dist: nebius>=0.3.12; extra == "all"
|
|
416
436
|
Requires-Dist: cudo-compute>=0.1.10; extra == "all"
|
|
437
|
+
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "all"
|
|
438
|
+
Requires-Dist: awscli>=1.27.10; extra == "all"
|
|
439
|
+
Requires-Dist: pycares<5; extra == "all"
|
|
440
|
+
Requires-Dist: ibm-platform-services>=0.48.0; extra == "all"
|
|
441
|
+
Requires-Dist: greenlet; extra == "all"
|
|
417
442
|
Requires-Dist: azure-core>=1.31.0; extra == "all"
|
|
418
|
-
Requires-Dist:
|
|
443
|
+
Requires-Dist: msrestazure; extra == "all"
|
|
444
|
+
Requires-Dist: vastai-sdk>=0.1.12; extra == "all"
|
|
445
|
+
Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
|
|
419
446
|
Requires-Dist: ibm-cos-sdk; extra == "all"
|
|
420
|
-
Requires-Dist: python-
|
|
421
|
-
Requires-Dist:
|
|
422
|
-
Requires-Dist:
|
|
423
|
-
Requires-Dist: azure-
|
|
424
|
-
Requires-Dist: tomli; extra == "all"
|
|
425
|
-
Requires-Dist: azure-core>=1.24.0; extra == "all"
|
|
447
|
+
Requires-Dist: python-hostlist; extra == "all"
|
|
448
|
+
Requires-Dist: azure-mgmt-compute>=33.0.0; extra == "all"
|
|
449
|
+
Requires-Dist: botocore>=1.29.10; extra == "all"
|
|
450
|
+
Requires-Dist: azure-mgmt-network>=27.0.0; extra == "all"
|
|
426
451
|
Requires-Dist: casbin; extra == "all"
|
|
427
|
-
Requires-Dist:
|
|
428
|
-
Requires-Dist: pyvmomi==8.0.1.0.2; extra == "all"
|
|
429
|
-
Requires-Dist: pyjwt; extra == "all"
|
|
430
|
-
Requires-Dist: runpod>=1.6.1; extra == "all"
|
|
431
|
-
Requires-Dist: boto3>=1.26.1; extra == "all"
|
|
432
|
-
Requires-Dist: ray[default]>=2.6.1; extra == "all"
|
|
433
|
-
Requires-Dist: pydo>=0.3.0; extra == "all"
|
|
452
|
+
Requires-Dist: aiosqlite; extra == "all"
|
|
434
453
|
Requires-Dist: azure-common; extra == "all"
|
|
435
|
-
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "all"
|
|
436
|
-
Requires-Dist: pyopenssl<24.3.0,>=23.2.0; extra == "all"
|
|
437
|
-
Requires-Dist: ibm-vpc; extra == "all"
|
|
438
454
|
Provides-Extra: remote
|
|
439
455
|
Requires-Dist: grpcio>=1.63.0; extra == "remote"
|
|
440
456
|
Requires-Dist: protobuf<7.0.0,>=5.26.1; extra == "remote"
|
|
@@ -484,7 +500,7 @@ Dynamic: summary
|
|
|
484
500
|
</p>
|
|
485
501
|
|
|
486
502
|
<h3 align="center">
|
|
487
|
-
|
|
503
|
+
Run AI on Any Infrastructure
|
|
488
504
|
</h3>
|
|
489
505
|
|
|
490
506
|
<div align="center">
|
|
@@ -494,10 +510,18 @@ Dynamic: summary
|
|
|
494
510
|
</div>
|
|
495
511
|
|
|
496
512
|
|
|
513
|
+
SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
|
|
497
514
|
|
|
498
|
-
|
|
515
|
+
SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
|
|
516
|
+
**Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
|
|
517
|
+
|
|
518
|
+
<img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
|
|
519
|
+
|
|
520
|
+
-----
|
|
499
521
|
|
|
500
522
|
:fire: *News* :fire:
|
|
523
|
+
- [Dec 2025] **SkyPilot v0.11** released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. [**Release notes**](https://github.com/skypilot-org/skypilot/releases/tag/v0.11.0)
|
|
524
|
+
- [Dec 2025] **SkyPilot Pools** released: Run batch inference and other jobs on a managed pool of warm workers (across clouds or clusters). [**blog**](https://blog.skypilot.co/skypilot-pools-deepseek-ocr/), [**docs**](https://docs.skypilot.co/en/latest/examples/pools.html)
|
|
501
525
|
- [Nov 2025] Serve **Kimi K2 Thinking** with reasoning capabilities on your Kubernetes or clouds: [**example**](./llm/kimi-k2-thinking/)
|
|
502
526
|
- [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
|
|
503
527
|
- [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
|
|
@@ -506,22 +530,6 @@ Dynamic: summary
|
|
|
506
530
|
- [Sep 2025] Network and Storage Benchmarks for LLM training on the cloud: [**blog**](https://maknee.github.io/blog/2025/Network-And-Storage-Training-Skypilot/)
|
|
507
531
|
- [Aug 2025] Serve and finetune **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**serve**](./llm/gpt-oss/) + [**LoRA and full finetuning**](./llm/gpt-oss-finetuning/)
|
|
508
532
|
- [Jul 2025] Run distributed **RL training for LLMs** with Verl (PPO, GRPO) on any cloud: [**example**](./llm/verl/)
|
|
509
|
-
- [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
|
|
510
|
-
- [Jul 2025] Two-part blog series, `The Evolution of AI Job Orchestration`: (1) [Running AI jobs on GPU Neoclouds](https://blog.skypilot.co/ai-job-orchestration-pt1-gpu-neoclouds/), (2) [The AI-Native Control Plane & Orchestration that Finally Works for ML](https://blog.skypilot.co/ai-job-orchestration-pt2-ai-control-plane/)
|
|
511
|
-
- [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
|
|
512
|
-
|
|
513
|
-
|
|
514
|
-
|
|
515
|
-
**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
|
|
516
|
-
|
|
517
|
-
----
|
|
518
|
-
|
|
519
|
-
SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
|
|
520
|
-
|
|
521
|
-
SkyPilot gives **AI teams** a simple interface to run jobs on any infra.
|
|
522
|
-
**Infra teams** get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
|
|
523
|
-
|
|
524
|
-
<img src="./docs/source/images/skypilot-abstractions-long-2.png" alt="SkyPilot Abstractions">
|
|
525
533
|
|
|
526
534
|
## Overview
|
|
527
535
|
|