PyPI - skypilot-nightly - Versions diffs - 1.0.0.dev20251005__py3-none-any.whl → 1.0.0.dev20251009__py3-none-any.whl - Mend

skypilot-nightly 1.0.0.dev20251005py3-none-any.whl → 1.0.0.dev20251009py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of skypilot-nightly might be problematic. Click here for more details.

Files changed (102) hide show

sky/dashboard/out/workspaces/[name].html CHANGED Viewed

	@@ -1 +1 @@
1	- <!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><meta name="next-head-count" content="2"/><link rel="preload" href="/dashboard/_next/static/css/4614e06482d7309e.css" as="style"/><link rel="stylesheet" href="/dashboard/_next/static/css/4614e06482d7309e.css" data-n-g=""/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/dashboard/_next/static/chunks/polyfills-78c92fac7aa8fdd8.js"></script><script src="/dashboard/_next/static/chunks/webpack-~~3286453d56f3c0a0~~.js" defer=""></script><script src="/dashboard/_next/static/chunks/framework-cf60a09ccd051a10.js" defer=""></script><script src="/dashboard/_next/static/chunks/main-f15ccb73239a3bf1.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/_app-ce361c6959bc2001.js" defer=""></script><script src="/dashboard/_next/static/chunks/616-3d59f75e2ccf9321.js" defer=""></script><script src="/dashboard/_next/static/chunks/6130-2be46d70a38f1e82.js" defer=""></script><script src="/dashboard/_next/static/chunks/5739-d67458fcb1386c92.js" defer=""></script><script src="/dashboard/_next/static/chunks/7411-b15471acd2cba716.js" defer=""></script><script src="/dashboard/_next/static/chunks/1272-1ef0bf0237faccdb.js" defer=""></script><script src="/dashboard/_next/static/chunks/~~1836~~-~~37fede578e2da5f8~~.js" defer=""></script><script src="/dashboard/_next/static/chunks/6989-01359c57e018caa4.js" defer=""></script><script src="/dashboard/_next/static/chunks/3850-ff4a9a69d978632b.js" defer=""></script><script src="/dashboard/_next/static/chunks/8969-66237729cdf9749e.js" defer=""></script><script src="/dashboard/_next/static/chunks/6990-f6818c84ed8f1c86.js" defer=""></script><script src="/dashboard/_next/static/chunks/6135-4b4d5e824b7f9d3c.js" defer=""></script><script src="/dashboard/_next/static/chunks/1121-d0782b9251f0fcd3.js" defer=""></script><script src="/dashboard/_next/static/chunks/6601-06114c982db410b6.js" defer=""></script><script src="/dashboard/_next/static/chunks/3015-8d748834fcc60b46.js" defer=""></script><script src="/dashboard/_next/static/chunks/1141-~~159df2d4c441a9d1~~.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/workspaces/%5Bname%5D-~~af76bb06dbb3954f~~.js" defer=""></script><script src="/dashboard/_next/static/~~Vg53Kzbf7u4o6fYPeOHMe~~/_buildManifest.js" defer=""></script><script src="/dashboard/_next/static/~~Vg53Kzbf7u4o6fYPeOHMe~~/_ssgManifest.js" defer=""></script></head><body><div id="__next"></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{}},"page":"/workspaces/[name]","query":{},"buildId":"~~Vg53Kzbf7u4o6fYPeOHMe~~","assetPrefix":"/dashboard","nextExport":true,"autoExport":true,"isFallback":false,"scriptLoader":[]}</script></body></html>
1	+ <!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><meta name="next-head-count" content="2"/><link rel="preload" href="/dashboard/_next/static/css/4614e06482d7309e.css" as="style"/><link rel="stylesheet" href="/dashboard/_next/static/css/4614e06482d7309e.css" data-n-g=""/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/dashboard/_next/static/chunks/polyfills-78c92fac7aa8fdd8.js"></script><script src="/dashboard/_next/static/chunks/webpack-6a5ddd0184bfa22c.js" defer=""></script><script src="/dashboard/_next/static/chunks/framework-cf60a09ccd051a10.js" defer=""></script><script src="/dashboard/_next/static/chunks/main-f15ccb73239a3bf1.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/_app-ce361c6959bc2001.js" defer=""></script><script src="/dashboard/_next/static/chunks/616-3d59f75e2ccf9321.js" defer=""></script><script src="/dashboard/_next/static/chunks/6130-2be46d70a38f1e82.js" defer=""></script><script src="/dashboard/_next/static/chunks/5739-d67458fcb1386c92.js" defer=""></script><script src="/dashboard/_next/static/chunks/7411-b15471acd2cba716.js" defer=""></script><script src="/dashboard/_next/static/chunks/1272-1ef0bf0237faccdb.js" defer=""></script><script src="/dashboard/_next/static/chunks/7359-c8d04e06886000b3.js" defer=""></script><script src="/dashboard/_next/static/chunks/6989-01359c57e018caa4.js" defer=""></script><script src="/dashboard/_next/static/chunks/3850-ff4a9a69d978632b.js" defer=""></script><script src="/dashboard/_next/static/chunks/8969-66237729cdf9749e.js" defer=""></script><script src="/dashboard/_next/static/chunks/6990-f6818c84ed8f1c86.js" defer=""></script><script src="/dashboard/_next/static/chunks/6135-4b4d5e824b7f9d3c.js" defer=""></script><script src="/dashboard/_next/static/chunks/1121-d0782b9251f0fcd3.js" defer=""></script><script src="/dashboard/_next/static/chunks/6601-06114c982db410b6.js" defer=""></script><script src="/dashboard/_next/static/chunks/3015-8d748834fcc60b46.js" defer=""></script><script src="/dashboard/_next/static/chunks/1141-3b40c39626f99c89.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/workspaces/%5Bname%5D-e8688c35c06f0ac5.js" defer=""></script><script src="/dashboard/_next/static/hIViZcQBkn0HE8SpaSsUU/_buildManifest.js" defer=""></script><script src="/dashboard/_next/static/hIViZcQBkn0HE8SpaSsUU/_ssgManifest.js" defer=""></script></head><body><div id="__next"></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{}},"page":"/workspaces/[name]","query":{},"buildId":"hIViZcQBkn0HE8SpaSsUU","assetPrefix":"/dashboard","nextExport":true,"autoExport":true,"isFallback":false,"scriptLoader":[]}</script></body></html>

sky/dashboard/out/workspaces.html CHANGED Viewed

	@@ -1 +1 @@
1	- <!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><meta name="next-head-count" content="2"/><link rel="preload" href="/dashboard/_next/static/css/4614e06482d7309e.css" as="style"/><link rel="stylesheet" href="/dashboard/_next/static/css/4614e06482d7309e.css" data-n-g=""/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/dashboard/_next/static/chunks/polyfills-78c92fac7aa8fdd8.js"></script><script src="/dashboard/_next/static/chunks/webpack-~~3286453d56f3c0a0~~.js" defer=""></script><script src="/dashboard/_next/static/chunks/framework-cf60a09ccd051a10.js" defer=""></script><script src="/dashboard/_next/static/chunks/main-f15ccb73239a3bf1.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/_app-ce361c6959bc2001.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/workspaces-~~7528cc0ef8c522c5~~.js" defer=""></script><script src="/dashboard/_next/static/~~Vg53Kzbf7u4o6fYPeOHMe~~/_buildManifest.js" defer=""></script><script src="/dashboard/_next/static/~~Vg53Kzbf7u4o6fYPeOHMe~~/_ssgManifest.js" defer=""></script></head><body><div id="__next"></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{}},"page":"/workspaces","query":{},"buildId":"~~Vg53Kzbf7u4o6fYPeOHMe~~","assetPrefix":"/dashboard","nextExport":true,"autoExport":true,"isFallback":false,"scriptLoader":[]}</script></body></html>
1	+ <!DOCTYPE html><html><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><meta name="next-head-count" content="2"/><link rel="preload" href="/dashboard/_next/static/css/4614e06482d7309e.css" as="style"/><link rel="stylesheet" href="/dashboard/_next/static/css/4614e06482d7309e.css" data-n-g=""/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/dashboard/_next/static/chunks/polyfills-78c92fac7aa8fdd8.js"></script><script src="/dashboard/_next/static/chunks/webpack-6a5ddd0184bfa22c.js" defer=""></script><script src="/dashboard/_next/static/chunks/framework-cf60a09ccd051a10.js" defer=""></script><script src="/dashboard/_next/static/chunks/main-f15ccb73239a3bf1.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/_app-ce361c6959bc2001.js" defer=""></script><script src="/dashboard/_next/static/chunks/pages/workspaces-69c80d677d3c2949.js" defer=""></script><script src="/dashboard/_next/static/hIViZcQBkn0HE8SpaSsUU/_buildManifest.js" defer=""></script><script src="/dashboard/_next/static/hIViZcQBkn0HE8SpaSsUU/_ssgManifest.js" defer=""></script></head><body><div id="__next"></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{}},"page":"/workspaces","query":{},"buildId":"hIViZcQBkn0HE8SpaSsUU","assetPrefix":"/dashboard","nextExport":true,"autoExport":true,"isFallback":false,"scriptLoader":[]}</script></body></html>

sky/execution.py CHANGED Viewed

@@ -112,7 +112,6 @@ def _execute(
     stages: Optional[List[Stage]] = None,
     cluster_name: Optional[str] = None,
     detach_setup: bool = False,
-    detach_run: bool = False,
     idle_minutes_to_autostop: Optional[int] = None,
     no_setup: bool = False,
     clone_disk_from: Optional[str] = None,
@@ -157,8 +156,6 @@ def _execute(
         job itself. You can safely ctrl-c to detach from logging, and it will
         not interrupt the setup process. To see the logs again after detaching,
         use `sky logs`. To cancel setup, cancel the job via `sky cancel`.
-      detach_run: If True, as soon as a job is submitted, return from this
-        function and do not stream execution logs.
       idle_minutes_to_autostop: int; if provided, the cluster will be set to
         autostop after this many minutes of idleness.
       no_setup: bool; whether to skip setup commands or not when (re-)launching.
@@ -217,7 +214,6 @@ def _execute(
             stages=stages,
             cluster_name=cluster_name,
             detach_setup=detach_setup,
-            detach_run=detach_run,
             no_setup=no_setup,
             clone_disk_from=clone_disk_from,
             skip_unnecessary_provisioning=skip_unnecessary_provisioning,
@@ -239,7 +235,6 @@ def _execute_dag(
     stages: Optional[List[Stage]],
     cluster_name: Optional[str],
     detach_setup: bool,
-    detach_run: bool,
     no_setup: bool,
     clone_disk_from: Optional[str],
     skip_unnecessary_provisioning: bool,
@@ -507,10 +502,7 @@ def _execute_dag(
         if Stage.EXEC in stages:
             try:
                 global_user_state.update_last_use(handle.get_cluster_name())
-                job_id = backend.execute(handle,
-                                         task,
-                                         detach_run,
-                                         dryrun=dryrun)
+                job_id = backend.execute(handle, task, dryrun=dryrun)
             finally:
                 # Enables post_execute() to be run after KeyboardInterrupt.
                 backend.post_execute(handle, down)
@@ -707,7 +699,6 @@ def launch(
         stages=stages,
         cluster_name=cluster_name,
         detach_setup=detach_setup,
-        detach_run=True,
         idle_minutes_to_autostop=idle_minutes_to_autostop,
         no_setup=no_setup,
         clone_disk_from=clone_disk_from,
@@ -802,6 +793,5 @@ def exec(  # pylint: disable=redefined-builtin
             Stage.EXEC,
         ],
         cluster_name=cluster_name,
-        detach_run=True,
         job_logger=job_logger,
     )

sky/global_user_state.py CHANGED Viewed

@@ -2495,11 +2495,22 @@ def _set_cluster_yaml_from_file(cluster_yaml_path: str,
     # on the local file system and migrate it to the database.
     # TODO(syang): remove this check once we have a way to migrate the
     # cluster from file to database. Remove on v0.12.0.
-    if cluster_yaml_path is not None and os.path.exists(cluster_yaml_path):
-        with open(cluster_yaml_path, 'r', encoding='utf-8') as f:
-            yaml_str = f.read()
-        set_cluster_yaml(cluster_name, yaml_str)
-        return yaml_str
+    if cluster_yaml_path is not None:
+        # First try the exact path
+        path_to_read = None
+        if os.path.exists(cluster_yaml_path):
+            path_to_read = cluster_yaml_path
+        # Fallback: try with .debug suffix (when debug logging was enabled)
+        # Debug logging causes YAML files to be saved with .debug suffix
+        # but the path stored in the handle doesn't include it
+        debug_path = cluster_yaml_path + '.debug'
+        if os.path.exists(debug_path):
+            path_to_read = debug_path
+        if path_to_read is not None:
+            with open(path_to_read, 'r', encoding='utf-8') as f:
+                yaml_str = f.read()
+            set_cluster_yaml(cluster_name, yaml_str)
+            return yaml_str
     return None

sky/jobs/constants.py CHANGED Viewed

@@ -15,16 +15,10 @@ JOB_CONTROLLER_INDICATOR_FILE = '~/.sky/is_jobs_controller'
 CONSOLIDATED_SIGNAL_PATH = os.path.expanduser('~/.sky/signals/')
 SIGNAL_FILE_PREFIX = '/tmp/sky_jobs_controller_signal_{}'
 # Resources as a dict for the jobs controller.
-# Use smaller CPU instance type for jobs controller, but with more memory, i.e.
-# r6i.xlarge (4vCPUs, 32 GB) for AWS, Standard_E4s_v5 (4vCPUs, 32 GB) for Azure,
-# and n2-highmem-4 (4 vCPUs, 32 GB) for GCP, etc.
-# Concurrently limits are set based on profiling. 4x num vCPUs is the launch
-# parallelism limit, and memory / 350MB is the limit to concurrently running
-# jobs. See _get_launch_parallelism and _get_job_parallelism in scheduler.py.
 # We use 50 GB disk size to reduce the cost.
 CONTROLLER_RESOURCES: Dict[str, Union[str, int]] = {
     'cpus': '4+',
-    'memory': '8x',
+    'memory': '4x',
     'disk_size': 50
 }

sky/jobs/controller.py CHANGED Viewed

@@ -870,8 +870,16 @@ class Controller:
             # because when SkyPilot API server machine sends the yaml config to
             # the controller machine, only storage metadata is sent, not the
             # storage object itself.
-            for storage in task.storage_mounts.values():
-                storage.construct()
+            try:
+                for storage in task.storage_mounts.values():
+                    storage.construct()
+            except (exceptions.StorageSpecError, exceptions.StorageError) as e:
+                job_logger.warning(
+                    f'Failed to construct storage object for teardown: {e}\n'
+                    'This may happen because storage construction already '
+                    'failed during launch, storage was deleted externally, '
+                    'credentials expired/changed, or network connectivity '
+                    'issues.')
             try:
                 backend.teardown_ephemeral_storage(task)
             except Exception as e:  # pylint: disable=broad-except
@@ -1144,7 +1152,15 @@ class Controller:
                 await asyncio.sleep(30)
                 continue
-            if len(running_tasks) >= scheduler.JOBS_PER_WORKER:
+            # Normally, 200 jobs can run on each controller. But if we have a
+            # ton of controllers, we need to limit the number of jobs that can
+            # run on each controller, to achieve a total of 2000 jobs across all
+            # controllers.
+            max_jobs = min(scheduler.MAX_JOBS_PER_WORKER,
+                           (scheduler.MAX_TOTAL_RUNNING_JOBS //
+                            scheduler.get_number_of_controllers()))
+            if len(running_tasks) >= max_jobs:
                 await asyncio.sleep(60)
                 continue

sky/jobs/recovery_strategy.py CHANGED Viewed

@@ -495,7 +495,9 @@ class StrategyExecutor:
                         self._logger.info('Managed job cluster launched.')
                     except (exceptions.InvalidClusterNameError,
                             exceptions.NoCloudAccessError,
-                            exceptions.ResourcesMismatchError) as e:
+                            exceptions.ResourcesMismatchError,
+                            exceptions.StorageSpecError,
+                            exceptions.StorageError) as e:
                         self._logger.error(
                             'Failure happened before provisioning. '
                             f'{common_utils.format_exception(e)}')

sky/jobs/scheduler.py CHANGED Viewed

@@ -63,7 +63,9 @@ from sky.jobs import state
 from sky.jobs import utils as managed_job_utils
 from sky.server import config as server_config
 from sky.skylet import constants
+from sky.utils import annotations
 from sky.utils import common_utils
+from sky.utils import controller_utils
 from sky.utils import subprocess_utils
 if typing.TYPE_CHECKING:
@@ -91,20 +93,29 @@ JOB_MEMORY_MB = 400
 LAUNCHES_PER_WORKER = 8
 # this can probably be increased to around 300-400 but keeping it lower to just
 # to be safe
-JOBS_PER_WORKER = 200
-# keep 1GB reserved after the controllers
-MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB = 2048
-CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')
+MAX_JOBS_PER_WORKER = 200
+# Maximum number of controllers that can be running. Hard to handle more than
+# 512 launches at once.
+MAX_CONTROLLERS = 512 // LAUNCHES_PER_WORKER
+# Limit the number of jobs that can be running at once on the entire jobs
+# controller cluster. It's hard to handle cancellation of more than 2000 jobs at
+# once.
+# TODO(cooperc): Once we eliminate static bottlenecks (e.g. sqlite), remove this
+# hardcoded max limit.
+MAX_TOTAL_RUNNING_JOBS = 2000
 # Maximum values for above constants. There will start to be lagging issues
 # at these numbers already.
 # JOB_MEMORY_MB = 200
 # LAUNCHES_PER_WORKER = 16
 # JOBS_PER_WORKER = 400
+# keep 2GB reserved after the controllers
+MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB = 2048
+CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')
+@annotations.lru_cache(scope='global')
 def get_number_of_controllers() -> int:
     """Returns the number of controllers that should be running.
@@ -123,7 +134,7 @@ def get_number_of_controllers() -> int:
     consolidation_mode = skypilot_config.get_nested(
         ('jobs', 'controller', 'consolidation_mode'), default_value=False)
-    total_memory_mb = common_utils.get_mem_size_gb() * 1024
+    total_memory_mb = controller_utils.get_controller_mem_size_gb() * 1024
     if consolidation_mode:
         config = server_config.compute_server_config(deploy=True, quiet=True)
@@ -136,13 +147,16 @@ def get_number_of_controllers() -> int:
                     config.short_worker_config.burstable_parallelism) * \
             server_config.SHORT_WORKER_MEM_GB * 1024
-        return max(1, int((total_memory_mb - used) // JOB_MEMORY_MB))
+        return min(MAX_CONTROLLERS,
+                   max(1, int((total_memory_mb - used) // JOB_MEMORY_MB)))
     else:
-        return max(
-            1,
-            int((total_memory_mb - MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB) /
-                ((LAUNCHES_PER_WORKER * server_config.LONG_WORKER_MEM_GB) * 1024
-                 + JOB_MEMORY_MB)))
+        return min(
+            MAX_CONTROLLERS,
+            max(
+                1,
+                int((total_memory_mb - MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB) /
+                    ((LAUNCHES_PER_WORKER * server_config.LONG_WORKER_MEM_GB) *
+                     1024 + JOB_MEMORY_MB))))
 def start_controller() -> None:
@@ -280,7 +294,8 @@ def submit_job(job_id: int, dag_yaml_path: str, original_user_yaml_path: str,
                                 common_utils.get_user_hash(), priority)
     if state.get_ha_recovery_script(job_id) is None:
         # the run command is just the command that called scheduler
-        run = (f'{sys.executable} -m sky.jobs.scheduler {dag_yaml_path} '
+        run = (f'source {env_file_path} && '
+               f'{sys.executable} -m sky.jobs.scheduler {dag_yaml_path} '
                f'--job-id {job_id} --env-file {env_file_path} '
                f'--user-yaml-path {original_user_yaml_path} '
                f'--priority {priority}')

sky/jobs/server/core.py CHANGED Viewed

@@ -407,9 +407,12 @@ def launch(
             job_identity = ''
             if job_rank is not None:
                 job_identity = f' (rank: {job_rank})'
-            logger.info(f'{colorama.Fore.YELLOW}'
-                        f'Launching managed job {dag.name!r}{job_identity} '
-                        f'from jobs controller...{colorama.Style.RESET_ALL}')
+            job_controller_postfix = (' from jobs controller' if
+                                      consolidation_mode_job_id is None else '')
+            logger.info(
+                f'{colorama.Fore.YELLOW}'
+                f'Launching managed job {dag.name!r}{job_identity}'
+                f'{job_controller_postfix}...{colorama.Style.RESET_ALL}')
             # Launch with the api server's user hash, so that sky status does
             # not show the owner of the controller as whatever user launched
@@ -456,6 +459,8 @@ def launch(
                     managed_job_state.set_ha_recovery_script(
                         consolidation_mode_job_id, run_script)
                     backend.run_on_head(local_handle, run_script)
+                    ux_utils.starting_message(
+                        f'Job submitted, ID: {consolidation_mode_job_id}')
                     return consolidation_mode_job_id, local_handle
     if pool is None:

sky/jobs/utils.py CHANGED Viewed

@@ -11,6 +11,7 @@ import enum
 import logging
 import os
 import pathlib
+import re
 import shlex
 import textwrap
 import time
@@ -299,8 +300,10 @@ async def get_job_status(
                 job_logger.info(f'Job status: {status}')
             job_logger.info('=' * 34)
             return status
-        except (exceptions.CommandError, grpc.RpcError,
-                grpc.FutureTimeoutError) as e:
+        except (exceptions.CommandError, grpc.RpcError, grpc.FutureTimeoutError,
+                ValueError, TypeError) as e:
+            # Note: Each of these exceptions has some additional conditions to
+            # limit how we handle it and whether or not we catch it.
             # Retry on k8s transient network errors. This is useful when using
             # coreweave which may have transient network issue sometimes.
             is_transient_error = False
@@ -319,6 +322,31 @@ async def get_job_status(
                     is_transient_error = True
             elif isinstance(e, grpc.FutureTimeoutError):
                 detailed_reason = 'Timeout'
+            # TODO(cooperc): Gracefully handle these exceptions in the backend.
+            elif isinstance(e, ValueError):
+                # If the cluster yaml is deleted in the middle of getting the
+                # SSH credentials, we could see this. See
+                # sky/global_user_state.py get_cluster_yaml_dict.
+                if re.search(r'Cluster yaml .* not found', str(e)):
+                    detailed_reason = 'Cluster yaml was deleted'
+                else:
+                    raise
+            elif isinstance(e, TypeError):
+                # We will grab the SSH credentials from the cluster yaml, but if
+                # handle.cluster_yaml is None, we will just return an empty dict
+                # for the credentials. See
+                # backend_utils.ssh_credential_from_yaml. Then, the credentials
+                # are passed as kwargs to SSHCommandRunner.__init__ - see
+                # cloud_vm_ray_backend.get_command_runners. So we can hit this
+                # TypeError if the cluster yaml is removed from the handle right
+                # when we pull it before the cluster is fully deleted.
+                error_msg_to_check = (
+                    'SSHCommandRunner.__init__() missing 2 required positional '
+                    'arguments: \'ssh_user\' and \'ssh_private_key\'')
+                if str(e) == error_msg_to_check:
+                    detailed_reason = 'SSH credentials were already cleaned up'
+                else:
+                    raise
             if is_transient_error:
                 logger.info('Failed to connect to the cluster. Retrying '
                             f'({i + 1}/{_JOB_STATUS_FETCH_MAX_RETRIES})...')

sky/metrics/utils.py CHANGED Viewed

@@ -11,7 +11,9 @@ from typing import List, Optional, Tuple
 import httpx
 import prometheus_client as prom
+from sky import sky_logging
 from sky.skylet import constants
+from sky.utils import common_utils
 from sky.utils import context_utils
 _SELECT_TIMEOUT = 1
@@ -35,6 +37,8 @@ _MEM_BUCKETS = [
     float('inf'),
 ]
+logger = sky_logging.init_logger(__name__)
 # Whether the metrics are enabled, cannot be changed at runtime.
 METRICS_ENABLED = os.environ.get(constants.ENV_VAR_SERVER_METRICS_ENABLED,
                                  'false').lower() == 'true'
@@ -188,35 +192,42 @@ def start_svc_port_forward(context: str, namespace: str, service: str,
     if 'KUBECONFIG' not in env:
         env['KUBECONFIG'] = os.path.expanduser('~/.kube/config')
-    # start the port forward process
-    port_forward_process = subprocess.Popen(cmd,
-                                            stdout=subprocess.PIPE,
-                                            stderr=subprocess.STDOUT,
-                                            text=True,
-                                            env=env)
+    port_forward_process = None
+    port_forward_exit = False
     local_port = None
-    start_time = time.time()
-    buffer = ''
-    # wait for the port forward to start and extract the local port
-    while time.time() - start_time < start_port_forward_timeout:
-        if port_forward_process.poll() is not None:
-            # port forward process has terminated
-            if port_forward_process.returncode != 0:
-                raise RuntimeError(
-                    f'Port forward failed for service {service} in namespace '
-                    f'{namespace} on context {context}')
-            break
-        # read output line by line to find the local port
-        if port_forward_process.stdout:
-            # Wait up to 1s for data to be available without blocking
-            r, _, _ = select.select([port_forward_process.stdout], [], [],
-                                    _SELECT_TIMEOUT)
-            if r:
+    poller = None
+    fd = None
+    try:
+        # start the port forward process
+        port_forward_process = subprocess.Popen(cmd,
+                                                stdout=subprocess.PIPE,
+                                                stderr=subprocess.STDOUT,
+                                                text=True,
+                                                env=env)
+        # Use poll() instead of select() to avoid FD_SETSIZE limit
+        poller = select.poll()
+        assert port_forward_process.stdout is not None
+        fd = port_forward_process.stdout.fileno()
+        poller.register(fd, select.POLLIN)
+        start_time = time.time()
+        buffer = ''
+        # wait for the port forward to start and extract the local port
+        while time.time() - start_time < start_port_forward_timeout:
+            if port_forward_process.poll() is not None:
+                # port forward process has terminated
+                if port_forward_process.returncode != 0:
+                    port_forward_exit = True
+                break
+            # Wait up to 1000ms for data to be available without blocking
+            # poll() takes timeout in milliseconds
+            events = poller.poll(_SELECT_TIMEOUT * 1000)
+            if events:
                 # Read available bytes from the FD without blocking
-                fd = port_forward_process.stdout.fileno()
                 raw = os.read(fd, _SELECT_BUFFER_SIZE)
                 chunk = raw.decode(errors='ignore')
                 buffer += chunk
@@ -225,16 +236,28 @@ def start_svc_port_forward(context: str, namespace: str, service: str,
                     local_port = int(match.group(1))
                     break
-        # sleep for 100ms to avoid busy-waiting
-        time.sleep(0.1)
+            # sleep for 100ms to avoid busy-waiting
+            time.sleep(0.1)
+    except BaseException:  # pylint: disable=broad-exception-caught
+        if port_forward_process:
+            stop_svc_port_forward(port_forward_process,
+                                  timeout=terminate_port_forward_timeout)
+        raise
+    finally:
+        if poller is not None and fd is not None:
+            try:
+                poller.unregister(fd)
+            except (OSError, ValueError):
+                # FD may already be unregistered or invalid
+                pass
+    if port_forward_exit:
+        raise RuntimeError(f'Port forward failed for service {service} in '
+                           f'namespace {namespace} on context {context}')
     if local_port is None:
         try:
-            port_forward_process.terminate()
-            port_forward_process.wait(timeout=terminate_port_forward_timeout)
-        except subprocess.TimeoutExpired:
-            port_forward_process.kill()
-            port_forward_process.wait()
+            if port_forward_process:
+                stop_svc_port_forward(port_forward_process,
+                                      timeout=terminate_port_forward_timeout)
         finally:
             raise RuntimeError(
                 f'Failed to extract local port for service {service} in '
@@ -243,14 +266,15 @@ def start_svc_port_forward(context: str, namespace: str, service: str,
     return port_forward_process, local_port
-def stop_svc_port_forward(port_forward_process: subprocess.Popen) -> None:
+def stop_svc_port_forward(port_forward_process: subprocess.Popen,
+                          timeout: int = 5) -> None:
     """Stops a port forward to a service in a Kubernetes cluster.
     Args:
         port_forward_process: The subprocess.Popen process to terminate
     """
     try:
         port_forward_process.terminate()
-        port_forward_process.wait(timeout=5)
+        port_forward_process.wait(timeout=timeout)
     except subprocess.TimeoutExpired:
         port_forward_process.kill()
         port_forward_process.wait()
@@ -301,6 +325,10 @@ async def send_metrics_request_with_port_forward(
             response.raise_for_status()
             return response.text
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        logger.error(f'Failed to send metrics request with port forward: '
+                     f'{common_utils.format_exception(e)}')
+        raise
     finally:
         # Always clean up port forward
         if port_forward_process:

sky/provision/instance_setup.py CHANGED Viewed

@@ -10,6 +10,7 @@ from typing import Any, Callable, Dict, List, Optional, Tuple
 from sky import exceptions
 from sky import logs
 from sky import provision
+from sky import resources as resources_lib
 from sky import sky_logging
 from sky.provision import common
 from sky.provision import docker_utils
@@ -92,12 +93,6 @@ def _set_usage_run_id_cmd() -> str:
         f'{usage_constants.USAGE_RUN_ID_FILE}')
-def _set_skypilot_env_var_cmd() -> str:
-    """Sets the skypilot environment variables on the remote machine."""
-    env_vars = env_options.Options.all_options()
-    return '; '.join([f'export {k}={v}' for k, v in env_vars.items()])
 def _auto_retry(should_retry: Callable[[Exception], bool] = lambda _: True):
     """Decorator that retries the function if it fails.
@@ -482,11 +477,38 @@ def start_ray_on_worker_nodes(cluster_name: str, no_restart: bool,
 @common.log_function_start_end
 @_auto_retry()
 @timeline.event
-def start_skylet_on_head_node(cluster_name: str,
-                              cluster_info: common.ClusterInfo,
-                              ssh_credentials: Dict[str, Any]) -> None:
+def start_skylet_on_head_node(
+        cluster_name: resources_utils.ClusterName,
+        cluster_info: common.ClusterInfo, ssh_credentials: Dict[str, Any],
+        launched_resources: resources_lib.Resources) -> None:
     """Start skylet on the head node."""
-    del cluster_name
+    # Avoid circular import.
+    # pylint: disable=import-outside-toplevel
+    from sky.utils import controller_utils
+    def _set_skypilot_env_var_cmd() -> str:
+        """Sets the skypilot environment variables on the remote machine."""
+        env_vars = {
+            k: str(v) for (k, v) in env_options.Options.all_options().items()
+        }
+        is_controller = controller_utils.Controllers.from_name(
+            cluster_name.display_name) is not None
+        is_kubernetes = cluster_info.provider_name == 'kubernetes'
+        if is_controller and is_kubernetes:
+            # For jobs/serve controller, we pass in the CPU and memory limits
+            # when starting the skylet to handle cases where these env vars
+            # are not set on the cluster's pod spec. The skylet will read
+            # these env vars when starting (ManagedJobEvent.start()) and write
+            # it to disk.
+            resources = launched_resources.assert_launchable()
+            vcpus, mem = resources.cloud.get_vcpus_mem_from_instance_type(
+                resources.instance_type)
+            if vcpus is not None:
+                env_vars['SKYPILOT_POD_CPU_CORE_LIMIT'] = str(vcpus)
+            if mem is not None:
+                env_vars['SKYPILOT_POD_MEMORY_GB_LIMIT'] = str(mem)
+        return '; '.join([f'export {k}={v}' for k, v in env_vars.items()])
     runners = provision.get_command_runners(cluster_info.provider_name,
                                             cluster_info, **ssh_credentials)
     head_runner = runners[0]

sky/provision/kubernetes/instance.py CHANGED Viewed

@@ -934,8 +934,11 @@ def _create_pods(region: str, cluster_name: str, cluster_name_on_cloud: str,
     running_pods = kubernetes_utils.filter_pods(namespace, context, tags,
                                                 ['Pending', 'Running'])
     head_pod_name = _get_head_pod_name(running_pods)
+    running_pod_statuses = [{
+        pod.metadata.name: pod.status.phase
+    } for pod in running_pods.values()]
     logger.debug(f'Found {len(running_pods)} existing pods: '
-                 f'{list(running_pods.keys())}')
+                 f'{running_pod_statuses}')
     to_start_count = config.count - len(running_pods)
     if to_start_count < 0:
@@ -1142,10 +1145,21 @@ def _create_pods(region: str, cluster_name: str, cluster_name_on_cloud: str,
         pods = created_resources
     created_pods = {}
+    valid_pods = []
     for pod in pods:
+        # In case Pod is not created
+        if pod is None:
+            continue
+        valid_pods.append(pod)
         created_pods[pod.metadata.name] = pod
         if head_pod_name is None and _is_head(pod):
             head_pod_name = pod.metadata.name
+    pods = valid_pods
+    # The running_pods may include Pending Pods, so we add them to the pods
+    # list to wait for scheduling and running
+    if running_pods:
+        pods = pods + list(running_pods.values())
     provision_timeout = provider_config['timeout']
@@ -1369,8 +1383,9 @@ def get_cluster_info(
             assert head_spec is not None, pod
             cpu_request = head_spec.containers[0].resources.requests['cpu']
-    assert cpu_request is not None, ('cpu_request should not be None, check '
-                                     'the Pod status')
+    if cpu_request is None:
+        raise RuntimeError(f'Pod {cluster_name_on_cloud}-head not found'
+                           ' or not Running, check the Pod status')
     ssh_user = 'sky'
     # Use pattern matching to extract SSH user, handling MOTD contamination.

sky/provision/kubernetes/utils.py CHANGED Viewed

@@ -1688,7 +1688,10 @@ def check_credentials(context: Optional[str],
     try:
         namespace = get_kube_config_context_namespace(context)
         kubernetes.core_api(context).list_namespaced_pod(
-            namespace, _request_timeout=timeout)
+            namespace, limit=1, _request_timeout=timeout)
+        # This call is "free" because this function is a cached call,
+        # and it will not be called again in this function.
+        get_kubernetes_nodes(context=context)
     except ImportError:
         # TODO(romilb): Update these error strs to also include link to docs
         #  when docs are ready.

skypilot-nightly 1.0.0.dev20251005__py3-none-any.whl → 1.0.0.dev20251009__py3-none-any.whl

Potentially problematic release.

skypilot-nightly 1.0.0.dev20251005py3-none-any.whl → 1.0.0.dev20251009py3-none-any.whl