PyPI - aissemble-inference-deploy - Versions diffs - 1.5.0rc3__py3-none-any.whl - Mend

aissemble-inference-deploy 1.5.0rc3__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

aissemble_inference_deploy/templates/docker/docker-compose.yml.j2 ADDED Viewed

@@ -0,0 +1,22 @@
+# Generated by aissemble-inference-deploy
+# Docker Compose configuration for local MLServer deployment
+services:
+  mlserver:
+    image: {{ image_name }}:latest
+    build:
+      context: ../..  # Project root (where models/ and pyproject.toml are)
+      dockerfile: deploy/docker/Dockerfile
+    ports:
+      - "{{ http_port }}:8080"   # HTTP
+      - "{{ grpc_port }}:8081"   # gRPC
+    environment:
+      - MLSERVER_HOST=0.0.0.0  # Bind to all interfaces (required for Docker)
+      - MLSERVER_HTTP_PORT=8080
+      - MLSERVER_GRPC_PORT=8081
+{% if models %}
+    # Models served:
+{% for model in models %}
+    #   - {{ model.name }}
+{% endfor %}
+{% endif %}

aissemble_inference_deploy/templates/kserve/README.md.j2 ADDED Viewed

@@ -0,0 +1,278 @@
+# KServe Deployment
+This directory contains KServe manifests for deploying MLServer models using the
+ServingRuntime + InferenceService pattern for DRY configuration.
+## Prerequisites
+### 1. Kubernetes Cluster
+A Kubernetes cluster (Rancher Desktop, kind, minikube, EKS, GKE, etc.)
+### 2. cert-manager
+KServe requires cert-manager for webhook certificates:
+```bash
+# Install cert-manager
+kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.3/cert-manager.yaml
+# Wait for cert-manager to be ready
+kubectl wait --for=condition=Available deployment --all -n cert-manager --timeout=300s
+```
+### 3. Knative Serving (for serverless/scale-to-zero)
+KServe's serverless mode requires Knative Serving:
+```bash
+# Install Knative Serving CRDs and core
+kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-crds.yaml
+kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-core.yaml
+# Install networking layer (Kourier - lighter than Istio)
+kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.12.0/kourier.yaml
+# Configure Knative to use Kourier
+kubectl patch configmap/config-network -n knative-serving \
+  --type merge -p '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'
+# Wait for Knative to be ready
+kubectl wait --for=condition=Available deployment --all -n knative-serving --timeout=300s
+```
+**Note:** If your cluster already has Istio, you can skip Kourier and configure Knative to use Istio instead.
+### 4. KServe
+Install KServe (includes ServingRuntime CRD):
+```bash
+# Install KServe
+kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.2/kserve.yaml
+# Wait for KServe to be ready
+kubectl wait --for=condition=Available deployment --all -n kserve --timeout=300s
+```
+### 5. Docker Image
+Build and push the Docker image (see `../docker/`).
+## Structure
+```
+kserve/
+  serving-runtime.yaml    # ServingRuntime (shared runtime configuration)
+  inference-service.yaml  # InferenceService (model deployment)
+```
+**Why two files?** This follows the DRY principle:
+- **ServingRuntime** defines the runtime configuration once (image, ports, resources)
+- **InferenceService** references the runtime and adds scaling configuration
+- If you later add more services, they can share the same ServingRuntime
+## Models Included
+{% for model in models %}
+- **{{ model.name }}**{% if model.runtime %} ({{ model.runtime }}){% endif %}
+{% endfor %}
+## Namespace Configuration
+By default, these manifests deploy to the `default` namespace. For production deployments,
+create a dedicated namespace:
+```bash
+# Create namespace
+kubectl create namespace ml-serving
+# Deploy to the namespace
+kubectl apply -f serving-runtime.yaml -n ml-serving
+kubectl apply -f inference-service.yaml -n ml-serving
+```
+Alternatively, uncomment and set the `namespace` field in both YAML files.
+## Deployment
+### 1. Build and Push Docker Image
+```bash
+cd ../docker
+docker-compose build
+# Tag and push to your registry
+docker tag {{ image_name }}:latest your-registry/{{ image_name }}:v1.0.0
+docker push your-registry/{{ image_name }}:v1.0.0
+```
+### 2. Update Image Reference
+Edit `serving-runtime.yaml` to use your registry:
+```yaml
+containers:
+  - name: kserve-container
+    image: your-registry/{{ image_name }}:v1.0.0  # Update this line
+```
+### 3. Deploy to KServe
+```bash
+# Apply the ServingRuntime first (defines the runtime)
+kubectl apply -f serving-runtime.yaml
+# Verify ServingRuntime is created
+kubectl get servingruntime {{ app_name }}-runtime
+# Apply the InferenceService (deploys the model)
+kubectl apply -f inference-service.yaml
+# Watch for the InferenceService to become ready
+kubectl get inferenceservice {{ app_name }} -w
+# Wait for ready status
+kubectl wait --for=condition=Ready inferenceservice/{{ app_name }} --timeout=300s
+```
+## Testing the Deployment
+### Get the Ingress URL
+```bash
+# Get the InferenceService URL
+SERVICE_URL=$(kubectl get inferenceservice {{ app_name }} -o jsonpath='{.status.url}')
+echo $SERVICE_URL
+# For local clusters without ingress, use port-forward:
+kubectl port-forward svc/{{ app_name }}-predictor 8080:80
+SERVICE_URL="http://localhost:8080"
+```
+### Test Health Endpoint
+```bash
+curl ${SERVICE_URL}/v2/health/ready
+```
+### Test Inference
+```bash
+# Example inference request (adjust for your model)
+curl -X POST ${SERVICE_URL}/v2/models/{{ models[0].name if models else 'your-model' }}/infer \
+  -H "Content-Type: application/json" \
+  -d '{
+    "inputs": [
+      {
+        "name": "input",
+        "shape": [1],
+        "datatype": "BYTES",
+        "data": ["your input data"]
+      }
+    ]
+  }'
+```
+## Scale-to-Zero Verification
+KServe supports scale-to-zero when there's no traffic:
+```bash
+# Check current pods
+kubectl get pods -l serving.kserve.io/inferenceservice={{ app_name }}
+# Wait 5+ minutes with no traffic, then check again
+kubectl get pods -l serving.kserve.io/inferenceservice={{ app_name }}
+# Should show 0 pods
+# Send a request - pods should spin up
+curl ${SERVICE_URL}/v2/health/ready
+kubectl get pods -l serving.kserve.io/inferenceservice={{ app_name }}
+# Should show 1+ pods starting
+```
+## Configuration Options
+### Scaling Configuration
+Edit `inference-service.yaml`:
+```yaml
+spec:
+  predictor:
+    minReplicas: 1      # Set to 1 to disable scale-to-zero
+    maxReplicas: 10     # Maximum replicas for high traffic
+```
+### Resource Limits
+Edit `serving-runtime.yaml`:
+```yaml
+resources:
+  requests:
+    memory: "1Gi"
+    cpu: "500m"
+  limits:
+    memory: "4Gi"
+    cpu: "2000m"
+```
+### GPU Support
+Add to `serving-runtime.yaml`:
+```yaml
+resources:
+  limits:
+    nvidia.com/gpu: 1
+```
+## Monitoring
+```bash
+# View logs
+kubectl logs -l serving.kserve.io/inferenceservice={{ app_name }} -c kserve-container
+# View events
+kubectl describe inferenceservice {{ app_name }}
+```
+## Cleanup
+```bash
+# Remove InferenceService
+kubectl delete inferenceservice {{ app_name }}
+# Remove ServingRuntime
+kubectl delete servingruntime {{ app_name }}-runtime
+```
+## Troubleshooting
+### "no matches for kind ServingRuntime"
+KServe is not installed or missing the ServingRuntime CRD. Follow the Prerequisites section above.
+### InferenceService stuck in "Unknown"
+```bash
+kubectl get servingruntime {{ app_name }}-runtime
+kubectl describe inferenceservice {{ app_name }}
+```
+### Pod fails to start
+```bash
+kubectl describe pod -l serving.kserve.io/inferenceservice={{ app_name }}
+kubectl logs -l serving.kserve.io/inferenceservice={{ app_name }} -c kserve-container
+```
+### Image pull errors
+For local images on Rancher Desktop with containerd:
+```bash
+docker save {{ image_name }}:latest | nerdctl --namespace k8s.io load
+```

aissemble_inference_deploy/templates/kserve/inference-service.yaml.j2 ADDED Viewed

@@ -0,0 +1,14 @@
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: {{ app_name }}
+  # namespace: your-namespace  # Uncomment and set for non-default namespace
+spec:
+  predictor:
+    minReplicas: 0
+    maxReplicas: 5
+    model:
+      modelFormat:
+        name: mlserver
+      runtime: {{ app_name }}-runtime
+      protocolVersion: v2

aissemble_inference_deploy/templates/kserve/serving-runtime.yaml.j2 ADDED Viewed

@@ -0,0 +1,35 @@
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+  name: {{ app_name }}-runtime
+  # namespace: your-namespace  # Uncomment and set for non-default namespace
+spec:
+  supportedModelFormats:
+    - name: mlserver
+      version: "1"
+      autoSelect: true
+  protocolVersions:
+    - v2
+  containers:
+    - name: kserve-container
+      image: {{ image_name }}:latest
+      ports:
+        - name: h2c
+          containerPort: {{ http_port }}
+          protocol: TCP
+        - name: grpc
+          containerPort: {{ grpc_port }}
+          protocol: TCP
+      env:
+        - name: MLSERVER_HTTP_PORT
+          value: "{{ http_port }}"
+        - name: MLSERVER_GRPC_PORT
+          value: "{{ grpc_port }}"
+      # Adjust resources based on model size - larger models may need 4Gi+ memory
+      resources:
+        requests:
+          memory: "512Mi"
+          cpu: "250m"
+        limits:
+          memory: "2Gi"
+          cpu: "1000m"

aissemble_inference_deploy/templates/kubernetes/README.md.j2 ADDED Viewed

@@ -0,0 +1,164 @@
+# Kubernetes Deployment
+This directory contains Kubernetes manifests for deploying MLServer using Kustomize.
+## Prerequisites
+- kubectl configured with cluster access
+- Docker image built and pushed to a registry (see `../docker/`)
+- Kustomize (built into kubectl 1.14+)
+## Structure
+```
+kubernetes/
+  base/                    # Base manifests
+    deployment.yaml        # MLServer Deployment
+    service.yaml           # ClusterIP Service
+    kustomization.yaml     # Kustomize base config
+  overlays/
+    dev/                   # Development overlay (1 replica, lower resources)
+    prod/                  # Production overlay (3 replicas, higher resources)
+```
+## Models Included
+{% for model in models %}
+- **{{ model.name }}**{% if model.runtime %} ({{ model.runtime }}){% endif %}
+{% endfor %}
+## Deployment
+### Development Environment
+```bash
+# Preview the manifests
+kubectl kustomize overlays/dev
+# Apply to cluster
+kubectl apply -k overlays/dev
+# Check rollout status
+kubectl rollout status deployment/dev-{{ app_name }}
+# Check pods
+kubectl get pods -l app={{ app_name }},environment=dev
+```
+### Production Environment
+```bash
+# Preview the manifests
+kubectl kustomize overlays/prod
+# Apply to cluster
+kubectl apply -k overlays/prod
+# Check rollout status
+kubectl rollout status deployment/prod-{{ app_name }}
+# Check pods
+kubectl get pods -l app={{ app_name }},environment=prod
+```
+## Image Configuration
+The deployment uses the image `{{ image_name }}:latest` which should be built using the Docker
+generator (`inference deploy init --target docker`) and then `docker-compose build`.
+### Local Development (Rancher Desktop)
+The dev overlay uses `imagePullPolicy: Never` so the image must be available locally.
+With Rancher Desktop, load the image into the cluster:
+```bash
+# Build the image
+cd ../docker && docker-compose build
+# For Rancher Desktop with dockerd (moby):
+# Images are automatically available - just deploy
+# For Rancher Desktop with containerd:
+# Export and import the image
+docker save {{ image_name }}:latest | nerdctl --namespace k8s.io load
+```
+### Remote Clusters
+For remote clusters, push the image to a registry and update the image reference:
+```bash
+# Tag and push to your registry
+docker tag {{ image_name }}:latest your-registry/{{ image_name }}:v1.0.0
+docker push your-registry/{{ image_name }}:v1.0.0
+# Update image using kustomize
+cd overlays/prod
+kustomize edit set image {{ image_name }}=your-registry/{{ image_name }}:v1.0.0
+```
+## Accessing the Service
+### Development (NodePort)
+The dev overlay uses NodePort for direct access without port-forwarding:
+```bash
+# Test health endpoint (HTTP on port {{ node_port_http }}, gRPC on port {{ node_port_grpc }})
+curl http://localhost:{{ node_port_http }}/v2/health/ready
+```
+If ports {{ node_port_http }}/{{ node_port_grpc }} conflict with other services, edit `overlays/dev/kustomization.yaml`
+to change the `nodePort` values (valid range: 30000-32767).
+### Production (Port Forward)
+```bash
+kubectl port-forward svc/prod-{{ app_name }} {{ http_port }}:{{ http_port }}
+# Test health endpoint
+curl http://localhost:{{ http_port }}/v2/health/ready
+```
+### LoadBalancer (Production)
+To expose externally, modify the service type in your overlay:
+```yaml
+# overlays/prod/service-patch.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: {{ app_name }}
+spec:
+  type: LoadBalancer
+```
+## Health Checks
+The deployment includes:
+- **Startup probe**: `/v2/health/ready` - Allows up to 5 minutes for model loading (30 retries × 10s)
+- **Readiness probe**: `/v2/health/ready` - Pod ready to receive traffic
+- **Liveness probe**: `/v2/health/live` - Pod is healthy
+For larger models that take longer to load, increase `failureThreshold` in the startup probe.
+## Resource Limits
+| Environment | CPU Request | CPU Limit | Memory Request | Memory Limit | Replicas |
+|-------------|-------------|-----------|----------------|--------------|----------|
+| dev         | 100m        | 500m      | 256Mi          | 1Gi          | 1        |
+| prod        | 500m        | 2000m     | 1Gi            | 4Gi          | 3        |
+Adjust these values in the overlay kustomization files based on your model requirements.
+## Cleanup
+```bash
+# Remove dev deployment
+kubectl delete -k overlays/dev
+# Remove prod deployment
+kubectl delete -k overlays/prod
+```

aissemble_inference_deploy/templates/kubernetes/deployment.yaml.j2 ADDED Viewed

@@ -0,0 +1,50 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: {{ app_name }}
+  labels:
+    app: {{ app_name }}
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: {{ app_name }}
+  template:
+    metadata:
+      labels:
+        app: {{ app_name }}
+    spec:
+      containers:
+        - name: {{ app_name }}
+          image: {{ image_name }}:latest
+          imagePullPolicy: IfNotPresent
+          ports:
+            - name: http
+              containerPort: {{ http_port }}
+              protocol: TCP
+            - name: grpc
+              containerPort: {{ grpc_port }}
+              protocol: TCP
+          resources:
+            requests:
+              memory: "512Mi"
+              cpu: "250m"
+            limits:
+              memory: "2Gi"
+              cpu: "1000m"
+          startupProbe:
+            httpGet:
+              path: /v2/health/ready
+              port: http
+            failureThreshold: 30
+            periodSeconds: 10
+          readinessProbe:
+            httpGet:
+              path: /v2/health/ready
+              port: http
+            periodSeconds: 5
+          livenessProbe:
+            httpGet:
+              path: /v2/health/live
+              port: http
+            periodSeconds: 10

aissemble_inference_deploy/templates/kubernetes/kustomization.yaml.j2 ADDED Viewed

@@ -0,0 +1,11 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+  - deployment.yaml
+  - service.yaml
+labels:
+  - pairs:
+      app.kubernetes.io/name: {{ app_name }}
+      app.kubernetes.io/component: inference-server

aissemble_inference_deploy/templates/kubernetes/overlays/dev/kustomization.yaml.j2 ADDED Viewed

@@ -0,0 +1,52 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+  - ../../base
+namePrefix: dev-
+labels:
+  - pairs:
+      environment: dev
+patches:
+  - patch: |-
+      - op: replace
+        path: /spec/replicas
+        value: 1
+      - op: replace
+        path: /spec/template/spec/containers/0/imagePullPolicy
+        value: Never
+    target:
+      kind: Deployment
+      name: {{ app_name }}
+  - patch: |-
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/requests/memory
+        value: "256Mi"
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/requests/cpu
+        value: "100m"
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/limits/memory
+        value: "1Gi"
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/limits/cpu
+        value: "500m"
+    target:
+      kind: Deployment
+      name: {{ app_name }}
+  - patch: |-
+      - op: replace
+        path: /spec/type
+        value: NodePort
+      - op: add
+        path: /spec/ports/0/nodePort
+        value: {{ node_port_http }}
+      - op: add
+        path: /spec/ports/1/nodePort
+        value: {{ node_port_grpc }}
+    target:
+      kind: Service
+      name: {{ app_name }}

aissemble_inference_deploy/templates/kubernetes/overlays/prod/kustomization.yaml.j2 ADDED Viewed

@@ -0,0 +1,36 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+  - ../../base
+namePrefix: prod-
+labels:
+  - pairs:
+      environment: prod
+patches:
+  - patch: |-
+      - op: replace
+        path: /spec/replicas
+        value: 3
+    target:
+      kind: Deployment
+      name: {{ app_name }}
+  - patch: |-
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/requests/memory
+        value: "1Gi"
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/requests/cpu
+        value: "500m"
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/limits/memory
+        value: "4Gi"
+      - op: replace
+        path: /spec/template/spec/containers/0/resources/limits/cpu
+        value: "2000m"
+    target:
+      kind: Deployment
+      name: {{ app_name }}

aissemble_inference_deploy/templates/kubernetes/service.yaml.j2 ADDED Viewed

@@ -0,0 +1,19 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: {{ app_name }}
+  labels:
+    app: {{ app_name }}
+spec:
+  type: ClusterIP
+  ports:
+    - name: http
+      port: {{ http_port }}
+      targetPort: http
+      protocol: TCP
+    - name: grpc
+      port: {{ grpc_port }}
+      targetPort: grpc
+      protocol: TCP
+  selector:
+    app: {{ app_name }}