rcrewai 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1836 @@
1
+ ---
2
+ layout: tutorial
3
+ title: Production Deployment
4
+ description: Complete guide to deploying RCrewAI applications in production with Docker, Kubernetes, monitoring, and enterprise features
5
+ ---
6
+
7
+ # Production Deployment
8
+
9
+ This comprehensive tutorial covers deploying RCrewAI applications to production environments with enterprise-grade reliability, monitoring, scaling, and security. You'll learn containerization, orchestration, monitoring, and operational best practices.
10
+
11
+ ## Table of Contents
12
+ 1. [Production Readiness Checklist](#production-readiness-checklist)
13
+ 2. [Containerization with Docker](#containerization-with-docker)
14
+ 3. [Kubernetes Deployment](#kubernetes-deployment)
15
+ 4. [Configuration Management](#configuration-management)
16
+ 5. [Monitoring and Observability](#monitoring-and-observability)
17
+ 6. [Scaling and Load Balancing](#scaling-and-load-balancing)
18
+ 7. [Security and Access Control](#security-and-access-control)
19
+ 8. [CI/CD Pipeline](#cicd-pipeline)
20
+ 9. [Operational Procedures](#operational-procedures)
21
+ 10. [Troubleshooting and Recovery](#troubleshooting-and-recovery)
22
+
23
+ ## Production Readiness Checklist
24
+
25
+ Before deploying to production, ensure your RCrewAI application meets these requirements:
26
+
27
+ ### ✅ Code Quality
28
+ - [ ] Comprehensive test coverage (>90%)
29
+ - [ ] Code review process in place
30
+ - [ ] Static analysis and linting
31
+ - [ ] Performance benchmarks established
32
+ - [ ] Security vulnerability scanning
33
+
34
+ ### ✅ Configuration
35
+ - [ ] Environment-based configuration
36
+ - [ ] Secrets management implemented
37
+ - [ ] Resource limits defined
38
+ - [ ] Timeout and retry logic configured
39
+ - [ ] Logging levels appropriate for production
40
+
41
+ ### ✅ Monitoring
42
+ - [ ] Health check endpoints implemented
43
+ - [ ] Metrics collection configured
44
+ - [ ] Alerting rules defined
45
+ - [ ] Log aggregation setup
46
+ - [ ] Performance monitoring enabled
47
+
48
+ ### ✅ Infrastructure
49
+ - [ ] Load balancing configured
50
+ - [ ] Auto-scaling policies defined
51
+ - [ ] Backup and disaster recovery plan
52
+ - [ ] Network security implemented
53
+ - [ ] Resource quotas established
54
+
55
+ ## Containerization with Docker
56
+
57
+ ### Basic Dockerfile
58
+
59
+ ```dockerfile
60
+ # Use official Ruby runtime as base image
61
+ FROM ruby:3.1-slim
62
+
63
+ # Install system dependencies
64
+ RUN apt-get update && apt-get install -y \
65
+ build-essential \
66
+ curl \
67
+ git \
68
+ && rm -rf /var/lib/apt/lists/*
69
+
70
+ # Set working directory
71
+ WORKDIR /app
72
+
73
+ # Copy Gemfile and Gemfile.lock
74
+ COPY Gemfile Gemfile.lock ./
75
+
76
+ # Install Ruby dependencies
77
+ RUN bundle config set --local deployment 'true' && \
78
+ bundle config set --local without 'development test' && \
79
+ bundle install
80
+
81
+ # Copy application code
82
+ COPY . .
83
+
84
+ # Create non-root user for security
85
+ RUN groupadd -r rcrewai && useradd -r -g rcrewai rcrewai
86
+ RUN chown -R rcrewai:rcrewai /app
87
+ USER rcrewai
88
+
89
+ # Expose port
90
+ EXPOSE 8080
91
+
92
+ # Health check
93
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
94
+ CMD curl -f http://localhost:8080/health || exit 1
95
+
96
+ # Default command
97
+ CMD ["ruby", "production_app.rb"]
98
+ ```
99
+
100
+ ### Multi-stage Production Dockerfile
101
+
102
+ ```dockerfile
103
+ # Build stage
104
+ FROM ruby:3.1-slim AS builder
105
+
106
+ RUN apt-get update && apt-get install -y \
107
+ build-essential \
108
+ git \
109
+ && rm -rf /var/lib/apt/lists/*
110
+
111
+ WORKDIR /app
112
+
113
+ COPY Gemfile Gemfile.lock ./
114
+ RUN bundle config set --local deployment 'true' && \
115
+ bundle config set --local without 'development test' && \
116
+ bundle install
117
+
118
+ # Production stage
119
+ FROM ruby:3.1-slim AS production
120
+
121
+ # Install only runtime dependencies
122
+ RUN apt-get update && apt-get install -y \
123
+ curl \
124
+ && rm -rf /var/lib/apt/lists/* \
125
+ && apt-get autoremove -y
126
+
127
+ WORKDIR /app
128
+
129
+ # Copy gems from builder stage
130
+ COPY --from=builder /usr/local/bundle /usr/local/bundle
131
+
132
+ # Copy application code
133
+ COPY . .
134
+
135
+ # Create non-root user
136
+ RUN groupadd -r rcrewai && useradd -r -g rcrewai -d /app rcrewai
137
+ RUN chown -R rcrewai:rcrewai /app
138
+
139
+ # Switch to non-root user
140
+ USER rcrewai
141
+
142
+ # Environment variables
143
+ ENV RAILS_ENV=production
144
+ ENV RACK_ENV=production
145
+ ENV BUNDLE_DEPLOYMENT=true
146
+ ENV BUNDLE_WITHOUT="development:test"
147
+
148
+ # Health check
149
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
150
+ CMD ruby health_check.rb || exit 1
151
+
152
+ EXPOSE 8080
153
+
154
+ CMD ["ruby", "production_app.rb"]
155
+ ```
156
+
157
+ ### Production Application Structure
158
+
159
+ ```ruby
160
+ # production_app.rb
161
+ require 'rcrewai'
162
+ require 'sinatra'
163
+ require 'json'
164
+ require 'logger'
165
+ require 'prometheus/middleware/collector'
166
+ require 'prometheus/middleware/exporter'
167
+
168
+ class ProductionRCrewAI < Sinatra::Base
169
+ configure :production do
170
+ enable :logging
171
+ set :logger, Logger.new($stdout)
172
+
173
+ # Metrics collection
174
+ use Prometheus::Middleware::Collector
175
+ use Prometheus::Middleware::Exporter
176
+
177
+ # Configure RCrewAI for production
178
+ RCrewAI.configure do |config|
179
+ config.llm_provider = ENV.fetch('LLM_PROVIDER', 'openai').to_sym
180
+ config.openai_api_key = ENV.fetch('OPENAI_API_KEY')
181
+ config.temperature = ENV.fetch('LLM_TEMPERATURE', '0.1').to_f
182
+ config.max_tokens = ENV.fetch('LLM_MAX_TOKENS', '4000').to_i
183
+ config.timeout = ENV.fetch('LLM_TIMEOUT', '60').to_i
184
+ end
185
+
186
+ # Initialize crew registry
187
+ @@crew_registry = CrewRegistry.new
188
+ @@crew_registry.register_default_crews
189
+ end
190
+
191
+ # Health check endpoint
192
+ get '/health' do
193
+ content_type :json
194
+
195
+ begin
196
+ health_status = perform_health_check
197
+ status health_status[:status] == 'healthy' ? 200 : 503
198
+ health_status.to_json
199
+ rescue => e
200
+ status 503
201
+ { status: 'unhealthy', error: e.message }.to_json
202
+ end
203
+ end
204
+
205
+ # Readiness check endpoint
206
+ get '/ready' do
207
+ content_type :json
208
+
209
+ begin
210
+ readiness_status = perform_readiness_check
211
+ status readiness_status[:ready] ? 200 : 503
212
+ readiness_status.to_json
213
+ rescue => e
214
+ status 503
215
+ { ready: false, error: e.message }.to_json
216
+ end
217
+ end
218
+
219
+ # Metrics endpoint
220
+ get '/metrics' do
221
+ # Prometheus metrics are handled by middleware
222
+ end
223
+
224
+ # Main execution endpoint
225
+ post '/execute' do
226
+ content_type :json
227
+
228
+ begin
229
+ request_data = JSON.parse(request.body.read)
230
+
231
+ # Validate request
232
+ validate_execution_request(request_data)
233
+
234
+ # Get crew
235
+ crew_name = request_data['crew_name']
236
+ crew = @@crew_registry.get_crew(crew_name)
237
+
238
+ # Execute with monitoring
239
+ result = execute_with_monitoring(crew, request_data)
240
+
241
+ status 200
242
+ result.to_json
243
+
244
+ rescue JSON::ParserError
245
+ status 400
246
+ { error: 'Invalid JSON in request body' }.to_json
247
+ rescue ValidationError => e
248
+ status 400
249
+ { error: e.message }.to_json
250
+ rescue => e
251
+ logger.error "Execution failed: #{e.message}"
252
+ logger.error e.backtrace.join("\n")
253
+
254
+ status 500
255
+ { error: 'Internal server error' }.to_json
256
+ end
257
+ end
258
+
259
+ private
260
+
261
+ def perform_health_check
262
+ checks = {
263
+ timestamp: Time.now.iso8601,
264
+ status: 'healthy',
265
+ checks: {}
266
+ }
267
+
268
+ # Check LLM provider connectivity
269
+ begin
270
+ # Quick LLM test
271
+ RCrewAI.client.chat(
272
+ messages: [{ role: 'user', content: 'test' }],
273
+ max_tokens: 1,
274
+ temperature: 0
275
+ )
276
+ checks[:checks][:llm] = { status: 'healthy' }
277
+ rescue => e
278
+ checks[:checks][:llm] = { status: 'unhealthy', error: e.message }
279
+ checks[:status] = 'unhealthy'
280
+ end
281
+
282
+ # Check memory usage
283
+ memory_usage = get_memory_usage
284
+ if memory_usage > 0.9
285
+ checks[:checks][:memory] = { status: 'warning', usage: memory_usage }
286
+ checks[:status] = 'degraded'
287
+ else
288
+ checks[:checks][:memory] = { status: 'healthy', usage: memory_usage }
289
+ end
290
+
291
+ checks
292
+ end
293
+
294
+ def perform_readiness_check
295
+ {
296
+ ready: true,
297
+ timestamp: Time.now.iso8601,
298
+ crews: @@crew_registry.crew_count,
299
+ uptime: Process.clock_gettime(Process::CLOCK_MONOTONIC).to_i
300
+ }
301
+ end
302
+
303
+ def validate_execution_request(data)
304
+ required_fields = ['crew_name']
305
+ missing_fields = required_fields - data.keys
306
+
307
+ if missing_fields.any?
308
+ raise ValidationError, "Missing required fields: #{missing_fields.join(', ')}"
309
+ end
310
+
311
+ unless @@crew_registry.crew_exists?(data['crew_name'])
312
+ raise ValidationError, "Unknown crew: #{data['crew_name']}"
313
+ end
314
+ end
315
+
316
+ def execute_with_monitoring(crew, request_data)
317
+ start_time = Time.now
318
+ execution_id = SecureRandom.uuid
319
+
320
+ logger.info "Starting execution", {
321
+ execution_id: execution_id,
322
+ crew_name: crew.name,
323
+ request_id: request_data['request_id']
324
+ }
325
+
326
+ begin
327
+ # Execute crew
328
+ result = crew.execute(
329
+ timeout: ENV.fetch('EXECUTION_TIMEOUT', '300').to_i,
330
+ max_retries: ENV.fetch('MAX_RETRIES', '3').to_i
331
+ )
332
+
333
+ duration = Time.now - start_time
334
+
335
+ logger.info "Execution completed", {
336
+ execution_id: execution_id,
337
+ duration: duration,
338
+ success_rate: result[:success_rate]
339
+ }
340
+
341
+ {
342
+ execution_id: execution_id,
343
+ success: true,
344
+ duration: duration,
345
+ result: result
346
+ }
347
+
348
+ rescue => e
349
+ duration = Time.now - start_time
350
+
351
+ logger.error "Execution failed", {
352
+ execution_id: execution_id,
353
+ duration: duration,
354
+ error: e.message
355
+ }
356
+
357
+ raise
358
+ end
359
+ end
360
+
361
+ def get_memory_usage
362
+ # Simple memory usage check
363
+ memory_info = `cat /proc/meminfo`.split("\n")
364
+ total = memory_info.find { |line| line.start_with?('MemTotal:') }.split[1].to_i
365
+ available = memory_info.find { |line| line.start_with?('MemAvailable:') }.split[1].to_i
366
+
367
+ (total - available).to_f / total
368
+ rescue
369
+ 0.0
370
+ end
371
+ end
372
+
373
+ class ValidationError < StandardError; end
374
+
375
+ class CrewRegistry
376
+ def initialize
377
+ @crews = {}
378
+ end
379
+
380
+ def register_crew(name, crew)
381
+ @crews[name] = crew
382
+ end
383
+
384
+ def get_crew(name)
385
+ crew = @crews[name]
386
+ raise ValidationError, "Crew not found: #{name}" unless crew
387
+ crew
388
+ end
389
+
390
+ def crew_exists?(name)
391
+ @crews.key?(name)
392
+ end
393
+
394
+ def crew_count
395
+ @crews.length
396
+ end
397
+
398
+ def register_default_crews
399
+ # Register your production crews here
400
+ support_crew = create_support_crew
401
+ register_crew('customer_support', support_crew)
402
+
403
+ analysis_crew = create_analysis_crew
404
+ register_crew('data_analysis', analysis_crew)
405
+ end
406
+
407
+ private
408
+
409
+ def create_support_crew
410
+ crew = RCrewAI::Crew.new("customer_support")
411
+
412
+ support_agent = RCrewAI::Agent.new(
413
+ name: "support_specialist",
414
+ role: "Customer Support Specialist",
415
+ goal: "Provide excellent customer support and resolve issues efficiently",
416
+ tools: [
417
+ RCrewAI::Tools::WebSearch.new(max_results: 5),
418
+ RCrewAI::Tools::FileReader.new
419
+ ]
420
+ )
421
+
422
+ crew.add_agent(support_agent)
423
+
424
+ support_task = RCrewAI::Task.new(
425
+ name: "handle_support_request",
426
+ description: "Handle customer support request with empathy and expertise",
427
+ expected_output: "Professional support response with clear next steps"
428
+ )
429
+
430
+ crew.add_task(support_task)
431
+ crew
432
+ end
433
+
434
+ def create_analysis_crew
435
+ crew = RCrewAI::Crew.new("data_analysis")
436
+
437
+ analyst = RCrewAI::Agent.new(
438
+ name: "data_analyst",
439
+ role: "Senior Data Analyst",
440
+ goal: "Analyze data and provide actionable insights",
441
+ tools: [
442
+ RCrewAI::Tools::FileReader.new,
443
+ RCrewAI::Tools::WebSearch.new
444
+ ]
445
+ )
446
+
447
+ crew.add_agent(analyst)
448
+
449
+ analysis_task = RCrewAI::Task.new(
450
+ name: "data_analysis",
451
+ description: "Perform comprehensive data analysis and generate insights",
452
+ expected_output: "Detailed analysis report with charts and recommendations"
453
+ )
454
+
455
+ crew.add_task(analysis_task)
456
+ crew
457
+ end
458
+ end
459
+
460
+ # Health check script for Docker
461
+ # health_check.rb
462
+ begin
463
+ require 'net/http'
464
+
465
+ uri = URI('http://localhost:8080/health')
466
+ response = Net::HTTP.get_response(uri)
467
+
468
+ exit(response.code == '200' ? 0 : 1)
469
+ rescue
470
+ exit 1
471
+ end
472
+
473
+ # Start the application
474
+ if __FILE__ == $0
475
+ ProductionRCrewAI.run!(
476
+ host: '0.0.0.0',
477
+ port: ENV.fetch('PORT', 8080).to_i
478
+ )
479
+ end
480
+ ```
481
+
482
+ ### Docker Compose for Development
483
+
484
+ ```yaml
485
+ # docker-compose.yml
486
+ version: '3.8'
487
+
488
+ services:
489
+ rcrewai:
490
+ build:
491
+ context: .
492
+ dockerfile: Dockerfile
493
+ target: production
494
+ ports:
495
+ - "8080:8080"
496
+ environment:
497
+ - RAILS_ENV=production
498
+ - OPENAI_API_KEY=${OPENAI_API_KEY}
499
+ - LLM_PROVIDER=openai
500
+ - LLM_TEMPERATURE=0.1
501
+ - EXECUTION_TIMEOUT=300
502
+ - MAX_RETRIES=3
503
+ depends_on:
504
+ - redis
505
+ - prometheus
506
+ restart: unless-stopped
507
+ healthcheck:
508
+ test: ["CMD", "ruby", "health_check.rb"]
509
+ interval: 30s
510
+ timeout: 10s
511
+ retries: 3
512
+
513
+ redis:
514
+ image: redis:7-alpine
515
+ command: redis-server --appendonly yes
516
+ volumes:
517
+ - redis_data:/data
518
+ restart: unless-stopped
519
+
520
+ prometheus:
521
+ image: prom/prometheus:latest
522
+ ports:
523
+ - "9090:9090"
524
+ volumes:
525
+ - ./prometheus.yml:/etc/prometheus/prometheus.yml
526
+ - prometheus_data:/prometheus
527
+ command:
528
+ - '--config.file=/etc/prometheus/prometheus.yml'
529
+ - '--storage.tsdb.path=/prometheus'
530
+ - '--web.console.libraries=/etc/prometheus/console_libraries'
531
+ - '--web.console.templates=/etc/prometheus/consoles'
532
+ restart: unless-stopped
533
+
534
+ grafana:
535
+ image: grafana/grafana:latest
536
+ ports:
537
+ - "3000:3000"
538
+ environment:
539
+ - GF_SECURITY_ADMIN_PASSWORD=admin
540
+ volumes:
541
+ - grafana_data:/var/lib/grafana
542
+ - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
543
+ - ./grafana/datasources:/etc/grafana/provisioning/datasources
544
+ restart: unless-stopped
545
+
546
+ volumes:
547
+ redis_data:
548
+ prometheus_data:
549
+ grafana_data:
550
+ ```
551
+
552
+ ## Kubernetes Deployment
553
+
554
+ ### Deployment Configuration
555
+
556
+ ```yaml
557
+ # k8s/deployment.yaml
558
+ apiVersion: apps/v1
559
+ kind: Deployment
560
+ metadata:
561
+ name: rcrewai-app
562
+ labels:
563
+ app: rcrewai
564
+ version: v1.0.0
565
+ spec:
566
+ replicas: 3
567
+ strategy:
568
+ type: RollingUpdate
569
+ rollingUpdate:
570
+ maxSurge: 1
571
+ maxUnavailable: 1
572
+ selector:
573
+ matchLabels:
574
+ app: rcrewai
575
+ template:
576
+ metadata:
577
+ labels:
578
+ app: rcrewai
579
+ version: v1.0.0
580
+ spec:
581
+ serviceAccountName: rcrewai-service-account
582
+ containers:
583
+ - name: rcrewai
584
+ image: your-registry/rcrewai:v1.0.0
585
+ ports:
586
+ - containerPort: 8080
587
+ env:
588
+ - name: RAILS_ENV
589
+ value: "production"
590
+ - name: OPENAI_API_KEY
591
+ valueFrom:
592
+ secretKeyRef:
593
+ name: rcrewai-secrets
594
+ key: openai-api-key
595
+ - name: REDIS_URL
596
+ value: "redis://redis-service:6379"
597
+ - name: LLM_PROVIDER
598
+ value: "openai"
599
+ - name: LLM_TEMPERATURE
600
+ value: "0.1"
601
+ - name: EXECUTION_TIMEOUT
602
+ value: "300"
603
+ - name: MAX_RETRIES
604
+ value: "3"
605
+ resources:
606
+ requests:
607
+ memory: "256Mi"
608
+ cpu: "250m"
609
+ limits:
610
+ memory: "512Mi"
611
+ cpu: "500m"
612
+ livenessProbe:
613
+ httpGet:
614
+ path: /health
615
+ port: 8080
616
+ initialDelaySeconds: 60
617
+ periodSeconds: 30
618
+ timeoutSeconds: 5
619
+ failureThreshold: 3
620
+ readinessProbe:
621
+ httpGet:
622
+ path: /ready
623
+ port: 8080
624
+ initialDelaySeconds: 10
625
+ periodSeconds: 5
626
+ timeoutSeconds: 3
627
+ failureThreshold: 3
628
+ securityContext:
629
+ runAsNonRoot: true
630
+ runAsUser: 1000
631
+ readOnlyRootFilesystem: true
632
+ allowPrivilegeEscalation: false
633
+ volumeMounts:
634
+ - name: tmp-volume
635
+ mountPath: /tmp
636
+ volumes:
637
+ - name: tmp-volume
638
+ emptyDir: {}
639
+ imagePullSecrets:
640
+ - name: registry-secret
641
+ ---
642
+ apiVersion: v1
643
+ kind: Service
644
+ metadata:
645
+ name: rcrewai-service
646
+ labels:
647
+ app: rcrewai
648
+ spec:
649
+ type: ClusterIP
650
+ ports:
651
+ - port: 80
652
+ targetPort: 8080
653
+ protocol: TCP
654
+ name: http
655
+ selector:
656
+ app: rcrewai
657
+ ---
658
+ apiVersion: v1
659
+ kind: ServiceAccount
660
+ metadata:
661
+ name: rcrewai-service-account
662
+ ---
663
+ apiVersion: networking.k8s.io/v1
664
+ kind: Ingress
665
+ metadata:
666
+ name: rcrewai-ingress
667
+ annotations:
668
+ kubernetes.io/ingress.class: nginx
669
+ cert-manager.io/cluster-issuer: letsencrypt-prod
670
+ nginx.ingress.kubernetes.io/rate-limit: "100"
671
+ nginx.ingress.kubernetes.io/ssl-redirect: "true"
672
+ spec:
673
+ tls:
674
+ - hosts:
675
+ - api.yourcompany.com
676
+ secretName: rcrewai-tls
677
+ rules:
678
+ - host: api.yourcompany.com
679
+ http:
680
+ paths:
681
+ - path: /
682
+ pathType: Prefix
683
+ backend:
684
+ service:
685
+ name: rcrewai-service
686
+ port:
687
+ number: 80
688
+ ```
689
+
690
+ ### ConfigMap and Secrets
691
+
692
+ ```yaml
693
+ # k8s/configmap.yaml
694
+ apiVersion: v1
695
+ kind: ConfigMap
696
+ metadata:
697
+ name: rcrewai-config
698
+ data:
699
+ LLM_PROVIDER: "openai"
700
+ LLM_TEMPERATURE: "0.1"
701
+ LLM_MAX_TOKENS: "4000"
702
+ EXECUTION_TIMEOUT: "300"
703
+ MAX_RETRIES: "3"
704
+ LOG_LEVEL: "INFO"
705
+ METRICS_ENABLED: "true"
706
+ ---
707
+ apiVersion: v1
708
+ kind: Secret
709
+ metadata:
710
+ name: rcrewai-secrets
711
+ type: Opaque
712
+ data:
713
+ openai-api-key: <base64-encoded-api-key>
714
+ anthropic-api-key: <base64-encoded-api-key>
715
+ database-url: <base64-encoded-database-url>
716
+ ```
717
+
718
+ ### Horizontal Pod Autoscaler
719
+
720
+ ```yaml
721
+ # k8s/hpa.yaml
722
+ apiVersion: autoscaling/v2
723
+ kind: HorizontalPodAutoscaler
724
+ metadata:
725
+ name: rcrewai-hpa
726
+ spec:
727
+ scaleTargetRef:
728
+ apiVersion: apps/v1
729
+ kind: Deployment
730
+ name: rcrewai-app
731
+ minReplicas: 3
732
+ maxReplicas: 20
733
+ metrics:
734
+ - type: Resource
735
+ resource:
736
+ name: cpu
737
+ target:
738
+ type: Utilization
739
+ averageUtilization: 70
740
+ - type: Resource
741
+ resource:
742
+ name: memory
743
+ target:
744
+ type: Utilization
745
+ averageUtilization: 80
746
+ behavior:
747
+ scaleDown:
748
+ stabilizationWindowSeconds: 300
749
+ policies:
750
+ - type: Percent
751
+ value: 10
752
+ periodSeconds: 60
753
+ scaleUp:
754
+ stabilizationWindowSeconds: 60
755
+ policies:
756
+ - type: Percent
757
+ value: 50
758
+ periodSeconds: 60
759
+ ```
760
+
761
+ ## Configuration Management
762
+
763
+ ### Environment-based Configuration
764
+
765
+ ```ruby
766
+ # config/production.rb
767
+ class ProductionConfig
768
+ def self.configure
769
+ RCrewAI.configure do |config|
770
+ # LLM Provider Configuration
771
+ config.llm_provider = ENV.fetch('LLM_PROVIDER', 'openai').to_sym
772
+
773
+ case config.llm_provider
774
+ when :openai
775
+ config.openai_api_key = ENV.fetch('OPENAI_API_KEY')
776
+ config.base_url = ENV['OPENAI_BASE_URL'] # Optional custom endpoint
777
+ when :anthropic
778
+ config.anthropic_api_key = ENV.fetch('ANTHROPIC_API_KEY')
779
+ when :azure
780
+ config.azure_api_key = ENV.fetch('AZURE_OPENAI_API_KEY')
781
+ config.base_url = ENV.fetch('AZURE_OPENAI_ENDPOINT')
782
+ config.api_version = ENV.fetch('AZURE_API_VERSION', '2023-05-15')
783
+ when :google
784
+ config.google_api_key = ENV.fetch('GOOGLE_API_KEY')
785
+ end
786
+
787
+ # Model Parameters
788
+ config.temperature = ENV.fetch('LLM_TEMPERATURE', '0.1').to_f
789
+ config.max_tokens = ENV.fetch('LLM_MAX_TOKENS', '4000').to_i
790
+ config.timeout = ENV.fetch('LLM_TIMEOUT', '60').to_i
791
+
792
+ # Production Settings
793
+ config.retry_limit = ENV.fetch('LLM_RETRY_LIMIT', '3').to_i
794
+ config.retry_delay = ENV.fetch('LLM_RETRY_DELAY', '2').to_i
795
+ config.max_concurrent_requests = ENV.fetch('MAX_CONCURRENT_REQUESTS', '10').to_i
796
+
797
+ # Logging
798
+ config.log_level = ENV.fetch('LOG_LEVEL', 'INFO').upcase
799
+ config.structured_logging = ENV.fetch('STRUCTURED_LOGGING', 'true') == 'true'
800
+
801
+ # Security
802
+ config.validate_ssl = ENV.fetch('VALIDATE_SSL', 'true') == 'true'
803
+ config.user_agent = "RCrewAI/#{RCrewAI::VERSION} (Production)"
804
+ end
805
+ end
806
+
807
+ def self.database_config
808
+ {
809
+ url: ENV.fetch('DATABASE_URL'),
810
+ pool_size: ENV.fetch('DB_POOL_SIZE', '5').to_i,
811
+ checkout_timeout: ENV.fetch('DB_CHECKOUT_TIMEOUT', '5').to_i,
812
+ reaping_frequency: ENV.fetch('DB_REAPING_FREQUENCY', '10').to_i
813
+ }
814
+ end
815
+
816
+ def self.redis_config
817
+ {
818
+ url: ENV.fetch('REDIS_URL', 'redis://localhost:6379'),
819
+ timeout: ENV.fetch('REDIS_TIMEOUT', '5').to_i,
820
+ reconnect_attempts: ENV.fetch('REDIS_RECONNECT_ATTEMPTS', '3').to_i
821
+ }
822
+ end
823
+
824
+ def self.monitoring_config
825
+ {
826
+ metrics_enabled: ENV.fetch('METRICS_ENABLED', 'true') == 'true',
827
+ traces_enabled: ENV.fetch('TRACES_ENABLED', 'true') == 'true',
828
+ health_check_interval: ENV.fetch('HEALTH_CHECK_INTERVAL', '30').to_i,
829
+ performance_monitoring: ENV.fetch('PERFORMANCE_MONITORING', 'true') == 'true'
830
+ }
831
+ end
832
+ end
833
+ ```
834
+
835
+ ### Secrets Management with Vault
836
+
837
+ ```ruby
838
+ # config/vault_client.rb
839
+ require 'vault'
840
+
841
+ class VaultClient
842
+ def initialize
843
+ Vault.configure do |config|
844
+ config.address = ENV.fetch('VAULT_ADDR')
845
+ config.token = ENV['VAULT_TOKEN']
846
+ config.ssl_verify = ENV.fetch('VAULT_SSL_VERIFY', 'true') == 'true'
847
+ end
848
+ end
849
+
850
+ def get_secret(path)
851
+ secret = Vault.logical.read(path)
852
+ secret&.data
853
+ rescue Vault::VaultError => e
854
+ Rails.logger.error "Vault error: #{e.message}"
855
+ raise
856
+ end
857
+
858
+ def get_database_credentials
859
+ get_secret('secret/data/database')
860
+ end
861
+
862
+ def get_llm_api_keys
863
+ get_secret('secret/data/llm_providers')
864
+ end
865
+
866
+ def refresh_secrets
867
+ # Implement secret rotation logic
868
+ new_secrets = get_llm_api_keys
869
+
870
+ if new_secrets
871
+ ENV['OPENAI_API_KEY'] = new_secrets[:openai_api_key]
872
+ ENV['ANTHROPIC_API_KEY'] = new_secrets[:anthropic_api_key]
873
+
874
+ # Reconfigure RCrewAI with new secrets
875
+ ProductionConfig.configure
876
+ end
877
+ end
878
+ end
879
+
880
+ # Periodic secret refresh
881
+ Thread.new do
882
+ vault_client = VaultClient.new
883
+
884
+ loop do
885
+ sleep(3600) # Refresh every hour
886
+
887
+ begin
888
+ vault_client.refresh_secrets
889
+ rescue => e
890
+ Rails.logger.error "Secret refresh failed: #{e.message}"
891
+ end
892
+ end
893
+ end
894
+ ```
895
+
896
+ ## Monitoring and Observability
897
+
898
+ ### Prometheus Metrics
899
+
900
+ ```ruby
901
+ # lib/metrics.rb
902
+ require 'prometheus/client'
903
+
904
+ class RCrewAIMetrics
905
+ def initialize
906
+ @registry = Prometheus::Client.registry
907
+ setup_metrics
908
+ end
909
+
910
+ def setup_metrics
911
+ # Request counters
912
+ @request_total = @registry.counter(
913
+ :rcrewai_requests_total,
914
+ docstring: 'Total number of requests',
915
+ labels: [:method, :path, :status]
916
+ )
917
+
918
+ @execution_total = @registry.counter(
919
+ :rcrewai_executions_total,
920
+ docstring: 'Total number of crew executions',
921
+ labels: [:crew_name, :status]
922
+ )
923
+
924
+ # Duration histograms
925
+ @request_duration = @registry.histogram(
926
+ :rcrewai_request_duration_seconds,
927
+ docstring: 'Request duration in seconds',
928
+ labels: [:method, :path],
929
+ buckets: [0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0]
930
+ )
931
+
932
+ @execution_duration = @registry.histogram(
933
+ :rcrewai_execution_duration_seconds,
934
+ docstring: 'Crew execution duration in seconds',
935
+ labels: [:crew_name],
936
+ buckets: [1.0, 5.0, 10.0, 30.0, 60.0, 300.0, 600.0]
937
+ )
938
+
939
+ # Gauges
940
+ @active_executions = @registry.gauge(
941
+ :rcrewai_active_executions,
942
+ docstring: 'Number of active executions',
943
+ labels: [:crew_name]
944
+ )
945
+
946
+ @memory_usage = @registry.gauge(
947
+ :rcrewai_memory_usage_bytes,
948
+ docstring: 'Memory usage in bytes'
949
+ )
950
+
951
+ @llm_api_calls = @registry.counter(
952
+ :rcrewai_llm_api_calls_total,
953
+ docstring: 'Total LLM API calls',
954
+ labels: [:provider, :model, :status]
955
+ )
956
+ end
957
+
958
+ def record_request(method, path, status, duration)
959
+ @request_total.increment(labels: { method: method, path: path, status: status })
960
+ @request_duration.observe(duration, labels: { method: method, path: path })
961
+ end
962
+
963
+ def record_execution_start(crew_name)
964
+ @active_executions.increment(labels: { crew_name: crew_name })
965
+ end
966
+
967
+ def record_execution_complete(crew_name, status, duration)
968
+ @active_executions.decrement(labels: { crew_name: crew_name })
969
+ @execution_total.increment(labels: { crew_name: crew_name, status: status })
970
+ @execution_duration.observe(duration, labels: { crew_name: crew_name })
971
+ end
972
+
973
+ def record_llm_call(provider, model, status)
974
+ @llm_api_calls.increment(labels: { provider: provider, model: model, status: status })
975
+ end
976
+
977
+ def update_memory_usage
978
+ memory = get_memory_usage_bytes
979
+ @memory_usage.set(memory)
980
+ end
981
+
982
+ private
983
+
984
+ def get_memory_usage_bytes
985
+ `ps -o rss= -p #{Process.pid}`.to_i * 1024
986
+ rescue
987
+ 0
988
+ end
989
+ end
990
+
991
+ # Initialize global metrics instance
992
+ $metrics = RCrewAIMetrics.new
993
+
994
+ # Middleware for automatic metrics collection
995
+ class MetricsMiddleware
996
+ def initialize(app)
997
+ @app = app
998
+ end
999
+
1000
+ def call(env)
1001
+ start_time = Time.now
1002
+ method = env['REQUEST_METHOD']
1003
+ path = env['PATH_INFO']
1004
+
1005
+ status, headers, body = @app.call(env)
1006
+
1007
+ duration = Time.now - start_time
1008
+ $metrics.record_request(method, path, status.to_s, duration)
1009
+
1010
+ [status, headers, body]
1011
+ end
1012
+ end
1013
+ ```
1014
+
1015
+ ### Structured Logging
1016
+
1017
+ ```ruby
1018
+ # lib/structured_logger.rb
1019
+ require 'json'
1020
+ require 'logger'
1021
+
1022
+ class StructuredLogger
1023
+ def initialize(output = $stdout)
1024
+ @logger = Logger.new(output)
1025
+ @logger.level = Logger.const_get(ENV.fetch('LOG_LEVEL', 'INFO'))
1026
+ @logger.formatter = method(:json_formatter)
1027
+ end
1028
+
1029
+ def info(message, context = {})
1030
+ @logger.info(log_entry(message, context))
1031
+ end
1032
+
1033
+ def warn(message, context = {})
1034
+ @logger.warn(log_entry(message, context))
1035
+ end
1036
+
1037
+ def error(message, context = {})
1038
+ @logger.error(log_entry(message, context))
1039
+ end
1040
+
1041
+ def debug(message, context = {})
1042
+ @logger.debug(log_entry(message, context))
1043
+ end
1044
+
1045
+ private
1046
+
1047
+ def log_entry(message, context)
1048
+ {
1049
+ timestamp: Time.now.utc.iso8601,
1050
+ level: caller_locations(2, 1)[0].label.upcase,
1051
+ message: message,
1052
+ service: 'rcrewai',
1053
+ version: RCrewAI::VERSION,
1054
+ environment: ENV.fetch('RAILS_ENV', 'development'),
1055
+ process_id: Process.pid,
1056
+ thread_id: Thread.current.object_id
1057
+ }.merge(context)
1058
+ end
1059
+
1060
+ def json_formatter(severity, timestamp, progname, msg)
1061
+ if msg.is_a?(Hash)
1062
+ msg.to_json + "\n"
1063
+ else
1064
+ {
1065
+ timestamp: timestamp.utc.iso8601,
1066
+ level: severity,
1067
+ message: msg.to_s,
1068
+ service: 'rcrewai'
1069
+ }.to_json + "\n"
1070
+ end
1071
+ end
1072
+ end
1073
+
1074
+ # Global logger instance
1075
+ $logger = StructuredLogger.new
1076
+ ```
1077
+
1078
+ ### Distributed Tracing
1079
+
1080
+ ```ruby
1081
+ # lib/tracing.rb
1082
+ require 'opentelemetry/sdk'
1083
+ require 'opentelemetry/exporter/jaeger'
1084
+ require 'opentelemetry/instrumentation/all'
1085
+
1086
+ class TracingSetup
1087
+ def self.configure
1088
+ OpenTelemetry::SDK.configure do |c|
1089
+ c.service_name = 'rcrewai'
1090
+ c.service_version = RCrewAI::VERSION
1091
+
1092
+ c.add_span_processor(
1093
+ OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
1094
+ OpenTelemetry::Exporter::Jaeger::AgentExporter.new(
1095
+ endpoint: ENV.fetch('JAEGER_AGENT_HOST', 'localhost:14268')
1096
+ )
1097
+ )
1098
+ )
1099
+
1100
+ c.use_all() # Enable all instrumentations
1101
+ end
1102
+ end
1103
+
1104
+ def self.tracer
1105
+ OpenTelemetry.tracer_provider.tracer('rcrewai', RCrewAI::VERSION)
1106
+ end
1107
+ end
1108
+
1109
+ # Initialize tracing
1110
+ TracingSetup.configure if ENV.fetch('TRACING_ENABLED', 'true') == 'true'
1111
+
1112
+ # Tracing middleware
1113
+ class TracingMiddleware
1114
+ def initialize(app)
1115
+ @app = app
1116
+ @tracer = TracingSetup.tracer
1117
+ end
1118
+
1119
+ def call(env)
1120
+ @tracer.in_span("http_request") do |span|
1121
+ span.set_attribute('http.method', env['REQUEST_METHOD'])
1122
+ span.set_attribute('http.url', env['PATH_INFO'])
1123
+
1124
+ status, headers, body = @app.call(env)
1125
+
1126
+ span.set_attribute('http.status_code', status)
1127
+ span.status = OpenTelemetry::Trace::Status.error if status >= 400
1128
+
1129
+ [status, headers, body]
1130
+ end
1131
+ end
1132
+ end
1133
+ ```
1134
+
1135
+ ## Scaling and Load Balancing
1136
+
1137
+ ### Auto-scaling Configuration
1138
+
1139
+ ```yaml
1140
+ # k8s/vertical-pod-autoscaler.yaml
1141
+ apiVersion: autoscaling.k8s.io/v1
1142
+ kind: VerticalPodAutoscaler
1143
+ metadata:
1144
+ name: rcrewai-vpa
1145
+ spec:
1146
+ targetRef:
1147
+ apiVersion: apps/v1
1148
+ kind: Deployment
1149
+ name: rcrewai-app
1150
+ updatePolicy:
1151
+ updateMode: "Auto"
1152
+ resourcePolicy:
1153
+ containerPolicies:
1154
+ - containerName: rcrewai
1155
+ minAllowed:
1156
+ cpu: 100m
1157
+ memory: 128Mi
1158
+ maxAllowed:
1159
+ cpu: 1
1160
+ memory: 1Gi
1161
+ ```
1162
+
1163
+ ### Load Balancer Configuration
1164
+
1165
+ ```nginx
1166
+ # nginx.conf
1167
+ upstream rcrewai_backend {
1168
+ least_conn;
1169
+ server rcrewai-1:8080 max_fails=3 fail_timeout=30s;
1170
+ server rcrewai-2:8080 max_fails=3 fail_timeout=30s;
1171
+ server rcrewai-3:8080 max_fails=3 fail_timeout=30s;
1172
+ }
1173
+
1174
+ server {
1175
+ listen 80;
1176
+ server_name api.yourcompany.com;
1177
+
1178
+ # Rate limiting
1179
+ limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
1180
+ limit_req zone=api burst=20 nodelay;
1181
+
1182
+ # Request timeout
1183
+ proxy_read_timeout 300s;
1184
+ proxy_connect_timeout 60s;
1185
+ proxy_send_timeout 60s;
1186
+
1187
+ location / {
1188
+ proxy_pass http://rcrewai_backend;
1189
+ proxy_set_header Host $host;
1190
+ proxy_set_header X-Real-IP $remote_addr;
1191
+ proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
1192
+ proxy_set_header X-Forwarded-Proto $scheme;
1193
+
1194
+ # Health check
1195
+ proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
1196
+ proxy_next_upstream_tries 3;
1197
+ proxy_next_upstream_timeout 60s;
1198
+ }
1199
+
1200
+ location /health {
1201
+ access_log off;
1202
+ proxy_pass http://rcrewai_backend;
1203
+ }
1204
+
1205
+ location /metrics {
1206
+ access_log off;
1207
+ allow 10.0.0.0/8;
1208
+ allow 192.168.0.0/16;
1209
+ deny all;
1210
+ proxy_pass http://rcrewai_backend;
1211
+ }
1212
+ }
1213
+ ```
1214
+
1215
+ ## Security and Access Control
1216
+
1217
+ ### Network Policies
1218
+
1219
+ ```yaml
1220
+ # k8s/network-policy.yaml
1221
+ apiVersion: networking.k8s.io/v1
1222
+ kind: NetworkPolicy
1223
+ metadata:
1224
+ name: rcrewai-network-policy
1225
+ spec:
1226
+ podSelector:
1227
+ matchLabels:
1228
+ app: rcrewai
1229
+ policyTypes:
1230
+ - Ingress
1231
+ - Egress
1232
+ ingress:
1233
+ - from:
1234
+ - namespaceSelector:
1235
+ matchLabels:
1236
+ name: ingress-system
1237
+ - podSelector:
1238
+ matchLabels:
1239
+ app: load-balancer
1240
+ ports:
1241
+ - protocol: TCP
1242
+ port: 8080
1243
+ egress:
1244
+ - to: []
1245
+ ports:
1246
+ - protocol: TCP
1247
+ port: 443 # HTTPS
1248
+ - protocol: TCP
1249
+ port: 53 # DNS
1250
+ - protocol: UDP
1251
+ port: 53 # DNS
1252
+ - to:
1253
+ - podSelector:
1254
+ matchLabels:
1255
+ app: redis
1256
+ ports:
1257
+ - protocol: TCP
1258
+ port: 6379
1259
+ ```
1260
+
1261
+ ### Pod Security Standards
1262
+
1263
+ ```yaml
1264
+ # k8s/pod-security-policy.yaml
1265
+ apiVersion: policy/v1beta1
1266
+ kind: PodSecurityPolicy
1267
+ metadata:
1268
+ name: rcrewai-psp
1269
+ spec:
1270
+ privileged: false
1271
+ allowPrivilegeEscalation: false
1272
+ requiredDropCapabilities:
1273
+ - ALL
1274
+ volumes:
1275
+ - 'configMap'
1276
+ - 'emptyDir'
1277
+ - 'projected'
1278
+ - 'secret'
1279
+ - 'downwardAPI'
1280
+ - 'persistentVolumeClaim'
1281
+ runAsUser:
1282
+ rule: 'MustRunAsNonRoot'
1283
+ seLinux:
1284
+ rule: 'RunAsAny'
1285
+ fsGroup:
1286
+ rule: 'RunAsAny'
1287
+ ```
1288
+
1289
+ ### Authentication and Authorization
1290
+
1291
+ ```ruby
1292
+ # lib/auth.rb
1293
+ require 'jwt'
1294
+
1295
+ class AuthenticationMiddleware
1296
+ def initialize(app)
1297
+ @app = app
1298
+ @secret = ENV.fetch('JWT_SECRET')
1299
+ end
1300
+
1301
+ def call(env)
1302
+ # Skip auth for health checks
1303
+ if env['PATH_INFO'] == '/health' || env['PATH_INFO'] == '/ready'
1304
+ return @app.call(env)
1305
+ end
1306
+
1307
+ auth_header = env['HTTP_AUTHORIZATION']
1308
+
1309
+ unless auth_header&.start_with?('Bearer ')
1310
+ return unauthorized_response
1311
+ end
1312
+
1313
+ token = auth_header.sub('Bearer ', '')
1314
+
1315
+ begin
1316
+ payload = JWT.decode(token, @secret, true, algorithm: 'HS256')[0]
1317
+ env['user_id'] = payload['user_id']
1318
+ env['permissions'] = payload['permissions'] || []
1319
+
1320
+ @app.call(env)
1321
+ rescue JWT::DecodeError
1322
+ unauthorized_response
1323
+ end
1324
+ end
1325
+
1326
+ private
1327
+
1328
+ def unauthorized_response
1329
+ [401, {'Content-Type' => 'application/json'}, [
1330
+ { error: 'Unauthorized' }.to_json
1331
+ ]]
1332
+ end
1333
+ end
1334
+
1335
+ class AuthorizationMiddleware
1336
+ def initialize(app)
1337
+ @app = app
1338
+ end
1339
+
1340
+ def call(env)
1341
+ permissions = env['permissions'] || []
1342
+ path = env['PATH_INFO']
1343
+ method = env['REQUEST_METHOD']
1344
+
1345
+ required_permission = determine_required_permission(method, path)
1346
+
1347
+ if required_permission && !permissions.include?(required_permission)
1348
+ return forbidden_response
1349
+ end
1350
+
1351
+ @app.call(env)
1352
+ end
1353
+
1354
+ private
1355
+
1356
+ def determine_required_permission(method, path)
1357
+ case [method, path]
1358
+ when ['POST', '/execute']
1359
+ 'execute_crew'
1360
+ when ['GET', '/metrics']
1361
+ 'view_metrics'
1362
+ else
1363
+ nil # No special permission required
1364
+ end
1365
+ end
1366
+
1367
+ def forbidden_response
1368
+ [403, {'Content-Type' => 'application/json'}, [
1369
+ { error: 'Forbidden' }.to_json
1370
+ ]]
1371
+ end
1372
+ end
1373
+ ```
1374
+
1375
+ ## CI/CD Pipeline
1376
+
1377
+ ### GitHub Actions Workflow
1378
+
1379
+ ```yaml
1380
+ # .github/workflows/deploy.yml
1381
+ name: Deploy to Production
1382
+
1383
+ on:
1384
+ push:
1385
+ branches: [main]
1386
+ tags: ['v*']
1387
+
1388
+ env:
1389
+ REGISTRY: ghcr.io
1390
+ IMAGE_NAME: ${{ github.repository }}
1391
+
1392
+ jobs:
1393
+ test:
1394
+ runs-on: ubuntu-latest
1395
+ steps:
1396
+ - uses: actions/checkout@v3
1397
+
1398
+ - name: Set up Ruby
1399
+ uses: ruby/setup-ruby@v1
1400
+ with:
1401
+ ruby-version: 3.1
1402
+ bundler-cache: true
1403
+
1404
+ - name: Run tests
1405
+ run: bundle exec rspec
1406
+
1407
+ - name: Run security scan
1408
+ run: |
1409
+ bundle exec bundle-audit check --update
1410
+ bundle exec brakeman -q -w2
1411
+
1412
+ - name: Check code style
1413
+ run: bundle exec rubocop
1414
+
1415
+ build:
1416
+ needs: test
1417
+ runs-on: ubuntu-latest
1418
+ permissions:
1419
+ contents: read
1420
+ packages: write
1421
+ steps:
1422
+ - uses: actions/checkout@v3
1423
+
1424
+ - name: Log in to Container Registry
1425
+ uses: docker/login-action@v2
1426
+ with:
1427
+ registry: ${{ env.REGISTRY }}
1428
+ username: ${{ github.actor }}
1429
+ password: ${{ secrets.GITHUB_TOKEN }}
1430
+
1431
+ - name: Extract metadata
1432
+ id: meta
1433
+ uses: docker/metadata-action@v4
1434
+ with:
1435
+ images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
1436
+ tags: |
1437
+ type=ref,event=branch
1438
+ type=ref,event=pr
1439
+ type=semver,pattern={{version}}
1440
+ type=semver,pattern={{major}}.{{minor}}
1441
+
1442
+ - name: Build and push Docker image
1443
+ uses: docker/build-push-action@v4
1444
+ with:
1445
+ context: .
1446
+ push: true
1447
+ tags: ${{ steps.meta.outputs.tags }}
1448
+ labels: ${{ steps.meta.outputs.labels }}
1449
+
1450
+ deploy:
1451
+ needs: build
1452
+ runs-on: ubuntu-latest
1453
+ environment: production
1454
+ if: github.ref == 'refs/heads/main'
1455
+ steps:
1456
+ - uses: actions/checkout@v3
1457
+
1458
+ - name: Configure kubectl
1459
+ uses: azure/k8s-set-context@v1
1460
+ with:
1461
+ method: kubeconfig
1462
+ kubeconfig: ${{ secrets.KUBE_CONFIG }}
1463
+
1464
+ - name: Deploy to Kubernetes
1465
+ run: |
1466
+ kubectl set image deployment/rcrewai-app \
1467
+ rcrewai=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
1468
+ kubectl rollout status deployment/rcrewai-app --timeout=600s
1469
+
1470
+ - name: Run smoke tests
1471
+ run: |
1472
+ kubectl wait --for=condition=ready pod -l app=rcrewai --timeout=300s
1473
+ ./scripts/smoke-tests.sh
1474
+ ```
1475
+
1476
+ ### Deployment Scripts
1477
+
1478
+ ```bash
1479
+ #!/bin/bash
1480
+ # scripts/deploy.sh
1481
+
1482
+ set -euo pipefail
1483
+
1484
+ ENVIRONMENT=${1:-production}
1485
+ IMAGE_TAG=${2:-latest}
1486
+
1487
+ echo "Deploying RCrewAI to $ENVIRONMENT with tag $IMAGE_TAG"
1488
+
1489
+ # Update deployment with new image
1490
+ kubectl set image deployment/rcrewai-app \
1491
+ rcrewai="ghcr.io/yourorg/rcrewai:$IMAGE_TAG" \
1492
+ --namespace="$ENVIRONMENT"
1493
+
1494
+ # Wait for rollout to complete
1495
+ kubectl rollout status deployment/rcrewai-app \
1496
+ --namespace="$ENVIRONMENT" \
1497
+ --timeout=600s
1498
+
1499
+ # Verify deployment
1500
+ echo "Verifying deployment..."
1501
+ kubectl get pods -l app=rcrewai --namespace="$ENVIRONMENT"
1502
+
1503
+ # Run health check
1504
+ echo "Running health check..."
1505
+ HEALTH_URL=$(kubectl get service rcrewai-service --namespace="$ENVIRONMENT" \
1506
+ -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
1507
+
1508
+ curl -f "http://$HEALTH_URL/health" || {
1509
+ echo "Health check failed!"
1510
+ exit 1
1511
+ }
1512
+
1513
+ echo "Deployment successful!"
1514
+ ```
1515
+
1516
+ ```bash
1517
+ #!/bin/bash
1518
+ # scripts/smoke-tests.sh
1519
+
1520
+ set -euo pipefail
1521
+
1522
+ SERVICE_URL=${SERVICE_URL:-http://localhost:8080}
1523
+
1524
+ echo "Running smoke tests against $SERVICE_URL"
1525
+
1526
+ # Test 1: Health check
1527
+ echo "Testing health endpoint..."
1528
+ response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/health")
1529
+ if [[ $response != "200" ]]; then
1530
+ echo "Health check failed: $response"
1531
+ exit 1
1532
+ fi
1533
+
1534
+ # Test 2: Ready check
1535
+ echo "Testing readiness endpoint..."
1536
+ response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/ready")
1537
+ if [[ $response != "200" ]]; then
1538
+ echo "Readiness check failed: $response"
1539
+ exit 1
1540
+ fi
1541
+
1542
+ # Test 3: Metrics endpoint
1543
+ echo "Testing metrics endpoint..."
1544
+ response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/metrics")
1545
+ if [[ $response != "200" ]] && [[ $response != "403" ]]; then
1546
+ echo "Metrics check failed: $response"
1547
+ exit 1
1548
+ fi
1549
+
1550
+ # Test 4: Basic execution (if auth allows)
1551
+ echo "Testing basic execution..."
1552
+ response=$(curl -s -X POST "$SERVICE_URL/execute" \
1553
+ -H "Content-Type: application/json" \
1554
+ -d '{"crew_name": "customer_support", "request_id": "test-123"}' \
1555
+ -w "%{http_code}" -o /dev/null)
1556
+
1557
+ # Accept 401/403 for auth-protected endpoints
1558
+ if [[ $response != "200" ]] && [[ $response != "401" ]] && [[ $response != "403" ]]; then
1559
+ echo "Execution test failed: $response"
1560
+ exit 1
1561
+ fi
1562
+
1563
+ echo "All smoke tests passed!"
1564
+ ```
1565
+
1566
+ ## Operational Procedures
1567
+
1568
+ ### Monitoring Dashboard
1569
+
1570
+ ```yaml
1571
+ # grafana/dashboards/rcrewai-dashboard.json
1572
+ {
1573
+ "dashboard": {
1574
+ "title": "RCrewAI Production Dashboard",
1575
+ "panels": [
1576
+ {
1577
+ "title": "Request Rate",
1578
+ "type": "graph",
1579
+ "targets": [
1580
+ {
1581
+ "expr": "rate(rcrewai_requests_total[5m])",
1582
+ "legendFormat": "{{method}} {{path}}"
1583
+ }
1584
+ ]
1585
+ },
1586
+ {
1587
+ "title": "Response Times",
1588
+ "type": "graph",
1589
+ "targets": [
1590
+ {
1591
+ "expr": "histogram_quantile(0.95, rate(rcrewai_request_duration_seconds_bucket[5m]))",
1592
+ "legendFormat": "95th percentile"
1593
+ },
1594
+ {
1595
+ "expr": "histogram_quantile(0.50, rate(rcrewai_request_duration_seconds_bucket[5m]))",
1596
+ "legendFormat": "50th percentile"
1597
+ }
1598
+ ]
1599
+ },
1600
+ {
1601
+ "title": "Error Rate",
1602
+ "type": "stat",
1603
+ "targets": [
1604
+ {
1605
+ "expr": "rate(rcrewai_requests_total{status=~\"5..\"}[5m]) / rate(rcrewai_requests_total[5m]) * 100",
1606
+ "legendFormat": "Error Rate %"
1607
+ }
1608
+ ]
1609
+ },
1610
+ {
1611
+ "title": "Crew Executions",
1612
+ "type": "graph",
1613
+ "targets": [
1614
+ {
1615
+ "expr": "rate(rcrewai_executions_total[5m])",
1616
+ "legendFormat": "{{crew_name}} {{status}}"
1617
+ }
1618
+ ]
1619
+ },
1620
+ {
1621
+ "title": "Memory Usage",
1622
+ "type": "graph",
1623
+ "targets": [
1624
+ {
1625
+ "expr": "rcrewai_memory_usage_bytes / 1024 / 1024",
1626
+ "legendFormat": "Memory MB"
1627
+ }
1628
+ ]
1629
+ },
1630
+ {
1631
+ "title": "Active Executions",
1632
+ "type": "stat",
1633
+ "targets": [
1634
+ {
1635
+ "expr": "sum(rcrewai_active_executions)",
1636
+ "legendFormat": "Active"
1637
+ }
1638
+ ]
1639
+ }
1640
+ ]
1641
+ }
1642
+ }
1643
+ ```
1644
+
1645
+ ### Alerting Rules
1646
+
1647
+ ```yaml
1648
+ # prometheus/alerts.yml
1649
+ groups:
1650
+ - name: rcrewai
1651
+ rules:
1652
+ - alert: HighErrorRate
1653
+ expr: rate(rcrewai_requests_total{status=~"5.."}[5m]) / rate(rcrewai_requests_total[5m]) > 0.05
1654
+ for: 2m
1655
+ labels:
1656
+ severity: critical
1657
+ annotations:
1658
+ summary: "High error rate detected"
1659
+ description: "Error rate is {{ $value }}% for the last 5 minutes"
1660
+
1661
+ - alert: HighResponseTime
1662
+ expr: histogram_quantile(0.95, rate(rcrewai_request_duration_seconds_bucket[5m])) > 10
1663
+ for: 5m
1664
+ labels:
1665
+ severity: warning
1666
+ annotations:
1667
+ summary: "High response time detected"
1668
+ description: "95th percentile response time is {{ $value }}s"
1669
+
1670
+ - alert: ServiceDown
1671
+ expr: up{job="rcrewai"} == 0
1672
+ for: 1m
1673
+ labels:
1674
+ severity: critical
1675
+ annotations:
1676
+ summary: "RCrewAI service is down"
1677
+ description: "RCrewAI service has been down for more than 1 minute"
1678
+
1679
+ - alert: HighMemoryUsage
1680
+ expr: rcrewai_memory_usage_bytes / 1024 / 1024 / 1024 > 1
1681
+ for: 5m
1682
+ labels:
1683
+ severity: warning
1684
+ annotations:
1685
+ summary: "High memory usage"
1686
+ description: "Memory usage is {{ $value }}GB"
1687
+
1688
+ - alert: TooManyActiveExecutions
1689
+ expr: sum(rcrewai_active_executions) > 50
1690
+ for: 2m
1691
+ labels:
1692
+ severity: warning
1693
+ annotations:
1694
+ summary: "Too many active executions"
1695
+ description: "{{ $value }} executions are currently active"
1696
+ ```
1697
+
1698
+ ### Backup and Recovery
1699
+
1700
+ ```bash
1701
+ #!/bin/bash
1702
+ # scripts/backup.sh
1703
+
1704
+ set -euo pipefail
1705
+
1706
+ BACKUP_DIR=${BACKUP_DIR:-/backups}
1707
+ TIMESTAMP=$(date +%Y%m%d_%H%M%S)
1708
+
1709
+ echo "Starting backup at $TIMESTAMP"
1710
+
1711
+ # Backup configuration
1712
+ kubectl get configmap rcrewai-config -o yaml > "$BACKUP_DIR/config_$TIMESTAMP.yaml"
1713
+ kubectl get secret rcrewai-secrets -o yaml > "$BACKUP_DIR/secrets_$TIMESTAMP.yaml"
1714
+
1715
+ # Backup persistent data (if any)
1716
+ if kubectl get pvc rcrewai-data 2>/dev/null; then
1717
+ kubectl exec -it deployment/rcrewai-app -- tar czf - /data > "$BACKUP_DIR/data_$TIMESTAMP.tar.gz"
1718
+ fi
1719
+
1720
+ # Cleanup old backups (keep last 30 days)
1721
+ find "$BACKUP_DIR" -name "*.yaml" -mtime +30 -delete
1722
+ find "$BACKUP_DIR" -name "*.tar.gz" -mtime +30 -delete
1723
+
1724
+ echo "Backup completed successfully"
1725
+ ```
1726
+
1727
+ ## Troubleshooting and Recovery
1728
+
1729
+ ### Common Issues and Solutions
1730
+
1731
+ #### High Memory Usage
1732
+ ```bash
1733
+ # Check memory usage
1734
+ kubectl top pods -l app=rcrewai
1735
+
1736
+ # Check for memory leaks
1737
+ kubectl exec -it deployment/rcrewai-app -- ps aux
1738
+
1739
+ # Restart pods if needed
1740
+ kubectl rollout restart deployment/rcrewai-app
1741
+ ```
1742
+
1743
+ #### Slow Response Times
1744
+ ```bash
1745
+ # Check CPU usage
1746
+ kubectl top pods -l app=rcrewai
1747
+
1748
+ # Scale up if needed
1749
+ kubectl scale deployment rcrewai-app --replicas=5
1750
+
1751
+ # Check database connections
1752
+ kubectl logs deployment/rcrewai-app | grep -i "database\|connection"
1753
+ ```
1754
+
1755
+ #### Failed Deployments
1756
+ ```bash
1757
+ # Check rollout status
1758
+ kubectl rollout status deployment/rcrewai-app
1759
+
1760
+ # Check pod logs
1761
+ kubectl logs deployment/rcrewai-app --previous
1762
+
1763
+ # Rollback if needed
1764
+ kubectl rollout undo deployment/rcrewai-app
1765
+ ```
1766
+
1767
+ ### Recovery Procedures
1768
+
1769
+ #### Complete Service Recovery
1770
+ ```bash
1771
+ #!/bin/bash
1772
+ # scripts/disaster-recovery.sh
1773
+
1774
+ set -euo pipefail
1775
+
1776
+ echo "Starting disaster recovery procedure"
1777
+
1778
+ # 1. Restore configuration
1779
+ kubectl apply -f backups/config_latest.yaml
1780
+ kubectl apply -f backups/secrets_latest.yaml
1781
+
1782
+ # 2. Deploy application
1783
+ kubectl apply -f k8s/
1784
+
1785
+ # 3. Wait for deployment
1786
+ kubectl wait --for=condition=available deployment/rcrewai-app --timeout=600s
1787
+
1788
+ # 4. Restore data if needed
1789
+ if [[ -f "backups/data_latest.tar.gz" ]]; then
1790
+ kubectl exec -it deployment/rcrewai-app -- tar xzf - -C / < backups/data_latest.tar.gz
1791
+ fi
1792
+
1793
+ # 5. Verify service
1794
+ ./scripts/smoke-tests.sh
1795
+
1796
+ echo "Disaster recovery completed"
1797
+ ```
1798
+
1799
+ ## Best Practices Summary
1800
+
1801
+ ### 1. **Security**
1802
+ - Use non-root containers
1803
+ - Implement network policies
1804
+ - Manage secrets properly
1805
+ - Enable authentication and authorization
1806
+ - Regular security scans
1807
+
1808
+ ### 2. **Reliability**
1809
+ - Health and readiness checks
1810
+ - Resource limits and requests
1811
+ - Graceful shutdown handling
1812
+ - Circuit breakers for external calls
1813
+ - Comprehensive error handling
1814
+
1815
+ ### 3. **Scalability**
1816
+ - Horizontal pod autoscaling
1817
+ - Load balancing
1818
+ - Stateless application design
1819
+ - Resource optimization
1820
+ - Performance monitoring
1821
+
1822
+ ### 4. **Observability**
1823
+ - Structured logging
1824
+ - Comprehensive metrics
1825
+ - Distributed tracing
1826
+ - Real-time alerting
1827
+ - Dashboard visualization
1828
+
1829
+ ### 5. **Operations**
1830
+ - Automated deployments
1831
+ - Blue-green deployments
1832
+ - Backup and recovery procedures
1833
+ - Incident response playbooks
1834
+ - Regular performance reviews
1835
+
1836
+ This production deployment guide provides a comprehensive foundation for running RCrewAI applications at scale with enterprise-grade reliability, security, and observability.