rcrewai 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/docs/api/agent.md +429 -0
- data/docs/api/task.md +494 -0
- data/docs/examples/api-integration.md +829 -0
- data/docs/examples/async-execution.md +893 -0
- data/docs/examples/code-review-crew.md +660 -0
- data/docs/examples/content-marketing-pipeline.md +681 -0
- data/docs/examples/custom-tools.md +1224 -0
- data/docs/examples/customer-support.md +717 -0
- data/docs/examples/data-analysis-team.md +677 -0
- data/docs/examples/database-operations.md +1298 -0
- data/docs/examples/ecommerce-operations.md +990 -0
- data/docs/examples/financial-analysis.md +857 -0
- data/docs/examples/hierarchical-crew.md +479 -0
- data/docs/examples/product-development.md +688 -0
- data/docs/examples/production-ready-crew.md +384 -408
- data/docs/examples/research-development.md +1225 -0
- data/docs/examples/social-media.md +1073 -0
- data/docs/examples/task-automation.md +527 -0
- data/docs/examples/tool-composition.md +1075 -0
- data/docs/examples/web-scraping.md +1201 -0
- data/docs/tutorials/advanced-agents.md +1014 -0
- data/docs/tutorials/custom-tools.md +1242 -0
- data/docs/tutorials/deployment.md +1836 -0
- data/docs/tutorials/index.md +184 -0
- data/docs/tutorials/multiple-crews.md +1692 -0
- data/lib/rcrewai/llm_clients/anthropic.rb +1 -1
- data/lib/rcrewai/version.rb +1 -1
- metadata +26 -2
@@ -0,0 +1,1836 @@
|
|
1
|
+
---
|
2
|
+
layout: tutorial
|
3
|
+
title: Production Deployment
|
4
|
+
description: Complete guide to deploying RCrewAI applications in production with Docker, Kubernetes, monitoring, and enterprise features
|
5
|
+
---
|
6
|
+
|
7
|
+
# Production Deployment
|
8
|
+
|
9
|
+
This comprehensive tutorial covers deploying RCrewAI applications to production environments with enterprise-grade reliability, monitoring, scaling, and security. You'll learn containerization, orchestration, monitoring, and operational best practices.
|
10
|
+
|
11
|
+
## Table of Contents
|
12
|
+
1. [Production Readiness Checklist](#production-readiness-checklist)
|
13
|
+
2. [Containerization with Docker](#containerization-with-docker)
|
14
|
+
3. [Kubernetes Deployment](#kubernetes-deployment)
|
15
|
+
4. [Configuration Management](#configuration-management)
|
16
|
+
5. [Monitoring and Observability](#monitoring-and-observability)
|
17
|
+
6. [Scaling and Load Balancing](#scaling-and-load-balancing)
|
18
|
+
7. [Security and Access Control](#security-and-access-control)
|
19
|
+
8. [CI/CD Pipeline](#cicd-pipeline)
|
20
|
+
9. [Operational Procedures](#operational-procedures)
|
21
|
+
10. [Troubleshooting and Recovery](#troubleshooting-and-recovery)
|
22
|
+
|
23
|
+
## Production Readiness Checklist
|
24
|
+
|
25
|
+
Before deploying to production, ensure your RCrewAI application meets these requirements:
|
26
|
+
|
27
|
+
### ✅ Code Quality
|
28
|
+
- [ ] Comprehensive test coverage (>90%)
|
29
|
+
- [ ] Code review process in place
|
30
|
+
- [ ] Static analysis and linting
|
31
|
+
- [ ] Performance benchmarks established
|
32
|
+
- [ ] Security vulnerability scanning
|
33
|
+
|
34
|
+
### ✅ Configuration
|
35
|
+
- [ ] Environment-based configuration
|
36
|
+
- [ ] Secrets management implemented
|
37
|
+
- [ ] Resource limits defined
|
38
|
+
- [ ] Timeout and retry logic configured
|
39
|
+
- [ ] Logging levels appropriate for production
|
40
|
+
|
41
|
+
### ✅ Monitoring
|
42
|
+
- [ ] Health check endpoints implemented
|
43
|
+
- [ ] Metrics collection configured
|
44
|
+
- [ ] Alerting rules defined
|
45
|
+
- [ ] Log aggregation setup
|
46
|
+
- [ ] Performance monitoring enabled
|
47
|
+
|
48
|
+
### ✅ Infrastructure
|
49
|
+
- [ ] Load balancing configured
|
50
|
+
- [ ] Auto-scaling policies defined
|
51
|
+
- [ ] Backup and disaster recovery plan
|
52
|
+
- [ ] Network security implemented
|
53
|
+
- [ ] Resource quotas established
|
54
|
+
|
55
|
+
## Containerization with Docker
|
56
|
+
|
57
|
+
### Basic Dockerfile
|
58
|
+
|
59
|
+
```dockerfile
|
60
|
+
# Use official Ruby runtime as base image
|
61
|
+
FROM ruby:3.1-slim
|
62
|
+
|
63
|
+
# Install system dependencies
|
64
|
+
RUN apt-get update && apt-get install -y \
|
65
|
+
build-essential \
|
66
|
+
curl \
|
67
|
+
git \
|
68
|
+
&& rm -rf /var/lib/apt/lists/*
|
69
|
+
|
70
|
+
# Set working directory
|
71
|
+
WORKDIR /app
|
72
|
+
|
73
|
+
# Copy Gemfile and Gemfile.lock
|
74
|
+
COPY Gemfile Gemfile.lock ./
|
75
|
+
|
76
|
+
# Install Ruby dependencies
|
77
|
+
RUN bundle config set --local deployment 'true' && \
|
78
|
+
bundle config set --local without 'development test' && \
|
79
|
+
bundle install
|
80
|
+
|
81
|
+
# Copy application code
|
82
|
+
COPY . .
|
83
|
+
|
84
|
+
# Create non-root user for security
|
85
|
+
RUN groupadd -r rcrewai && useradd -r -g rcrewai rcrewai
|
86
|
+
RUN chown -R rcrewai:rcrewai /app
|
87
|
+
USER rcrewai
|
88
|
+
|
89
|
+
# Expose port
|
90
|
+
EXPOSE 8080
|
91
|
+
|
92
|
+
# Health check
|
93
|
+
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
|
94
|
+
CMD curl -f http://localhost:8080/health || exit 1
|
95
|
+
|
96
|
+
# Default command
|
97
|
+
CMD ["ruby", "production_app.rb"]
|
98
|
+
```
|
99
|
+
|
100
|
+
### Multi-stage Production Dockerfile
|
101
|
+
|
102
|
+
```dockerfile
|
103
|
+
# Build stage
|
104
|
+
FROM ruby:3.1-slim AS builder
|
105
|
+
|
106
|
+
RUN apt-get update && apt-get install -y \
|
107
|
+
build-essential \
|
108
|
+
git \
|
109
|
+
&& rm -rf /var/lib/apt/lists/*
|
110
|
+
|
111
|
+
WORKDIR /app
|
112
|
+
|
113
|
+
COPY Gemfile Gemfile.lock ./
|
114
|
+
RUN bundle config set --local deployment 'true' && \
|
115
|
+
bundle config set --local without 'development test' && \
|
116
|
+
bundle install
|
117
|
+
|
118
|
+
# Production stage
|
119
|
+
FROM ruby:3.1-slim AS production
|
120
|
+
|
121
|
+
# Install only runtime dependencies
|
122
|
+
RUN apt-get update && apt-get install -y \
|
123
|
+
curl \
|
124
|
+
&& rm -rf /var/lib/apt/lists/* \
|
125
|
+
&& apt-get autoremove -y
|
126
|
+
|
127
|
+
WORKDIR /app
|
128
|
+
|
129
|
+
# Copy gems from builder stage
|
130
|
+
COPY --from=builder /usr/local/bundle /usr/local/bundle
|
131
|
+
|
132
|
+
# Copy application code
|
133
|
+
COPY . .
|
134
|
+
|
135
|
+
# Create non-root user
|
136
|
+
RUN groupadd -r rcrewai && useradd -r -g rcrewai -d /app rcrewai
|
137
|
+
RUN chown -R rcrewai:rcrewai /app
|
138
|
+
|
139
|
+
# Switch to non-root user
|
140
|
+
USER rcrewai
|
141
|
+
|
142
|
+
# Environment variables
|
143
|
+
ENV RAILS_ENV=production
|
144
|
+
ENV RACK_ENV=production
|
145
|
+
ENV BUNDLE_DEPLOYMENT=true
|
146
|
+
ENV BUNDLE_WITHOUT="development:test"
|
147
|
+
|
148
|
+
# Health check
|
149
|
+
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
|
150
|
+
CMD ruby health_check.rb || exit 1
|
151
|
+
|
152
|
+
EXPOSE 8080
|
153
|
+
|
154
|
+
CMD ["ruby", "production_app.rb"]
|
155
|
+
```
|
156
|
+
|
157
|
+
### Production Application Structure
|
158
|
+
|
159
|
+
```ruby
|
160
|
+
# production_app.rb
|
161
|
+
require 'rcrewai'
|
162
|
+
require 'sinatra'
|
163
|
+
require 'json'
|
164
|
+
require 'logger'
|
165
|
+
require 'prometheus/middleware/collector'
|
166
|
+
require 'prometheus/middleware/exporter'
|
167
|
+
|
168
|
+
class ProductionRCrewAI < Sinatra::Base
|
169
|
+
configure :production do
|
170
|
+
enable :logging
|
171
|
+
set :logger, Logger.new($stdout)
|
172
|
+
|
173
|
+
# Metrics collection
|
174
|
+
use Prometheus::Middleware::Collector
|
175
|
+
use Prometheus::Middleware::Exporter
|
176
|
+
|
177
|
+
# Configure RCrewAI for production
|
178
|
+
RCrewAI.configure do |config|
|
179
|
+
config.llm_provider = ENV.fetch('LLM_PROVIDER', 'openai').to_sym
|
180
|
+
config.openai_api_key = ENV.fetch('OPENAI_API_KEY')
|
181
|
+
config.temperature = ENV.fetch('LLM_TEMPERATURE', '0.1').to_f
|
182
|
+
config.max_tokens = ENV.fetch('LLM_MAX_TOKENS', '4000').to_i
|
183
|
+
config.timeout = ENV.fetch('LLM_TIMEOUT', '60').to_i
|
184
|
+
end
|
185
|
+
|
186
|
+
# Initialize crew registry
|
187
|
+
@@crew_registry = CrewRegistry.new
|
188
|
+
@@crew_registry.register_default_crews
|
189
|
+
end
|
190
|
+
|
191
|
+
# Health check endpoint
|
192
|
+
get '/health' do
|
193
|
+
content_type :json
|
194
|
+
|
195
|
+
begin
|
196
|
+
health_status = perform_health_check
|
197
|
+
status health_status[:status] == 'healthy' ? 200 : 503
|
198
|
+
health_status.to_json
|
199
|
+
rescue => e
|
200
|
+
status 503
|
201
|
+
{ status: 'unhealthy', error: e.message }.to_json
|
202
|
+
end
|
203
|
+
end
|
204
|
+
|
205
|
+
# Readiness check endpoint
|
206
|
+
get '/ready' do
|
207
|
+
content_type :json
|
208
|
+
|
209
|
+
begin
|
210
|
+
readiness_status = perform_readiness_check
|
211
|
+
status readiness_status[:ready] ? 200 : 503
|
212
|
+
readiness_status.to_json
|
213
|
+
rescue => e
|
214
|
+
status 503
|
215
|
+
{ ready: false, error: e.message }.to_json
|
216
|
+
end
|
217
|
+
end
|
218
|
+
|
219
|
+
# Metrics endpoint
|
220
|
+
get '/metrics' do
|
221
|
+
# Prometheus metrics are handled by middleware
|
222
|
+
end
|
223
|
+
|
224
|
+
# Main execution endpoint
|
225
|
+
post '/execute' do
|
226
|
+
content_type :json
|
227
|
+
|
228
|
+
begin
|
229
|
+
request_data = JSON.parse(request.body.read)
|
230
|
+
|
231
|
+
# Validate request
|
232
|
+
validate_execution_request(request_data)
|
233
|
+
|
234
|
+
# Get crew
|
235
|
+
crew_name = request_data['crew_name']
|
236
|
+
crew = @@crew_registry.get_crew(crew_name)
|
237
|
+
|
238
|
+
# Execute with monitoring
|
239
|
+
result = execute_with_monitoring(crew, request_data)
|
240
|
+
|
241
|
+
status 200
|
242
|
+
result.to_json
|
243
|
+
|
244
|
+
rescue JSON::ParserError
|
245
|
+
status 400
|
246
|
+
{ error: 'Invalid JSON in request body' }.to_json
|
247
|
+
rescue ValidationError => e
|
248
|
+
status 400
|
249
|
+
{ error: e.message }.to_json
|
250
|
+
rescue => e
|
251
|
+
logger.error "Execution failed: #{e.message}"
|
252
|
+
logger.error e.backtrace.join("\n")
|
253
|
+
|
254
|
+
status 500
|
255
|
+
{ error: 'Internal server error' }.to_json
|
256
|
+
end
|
257
|
+
end
|
258
|
+
|
259
|
+
private
|
260
|
+
|
261
|
+
def perform_health_check
|
262
|
+
checks = {
|
263
|
+
timestamp: Time.now.iso8601,
|
264
|
+
status: 'healthy',
|
265
|
+
checks: {}
|
266
|
+
}
|
267
|
+
|
268
|
+
# Check LLM provider connectivity
|
269
|
+
begin
|
270
|
+
# Quick LLM test
|
271
|
+
RCrewAI.client.chat(
|
272
|
+
messages: [{ role: 'user', content: 'test' }],
|
273
|
+
max_tokens: 1,
|
274
|
+
temperature: 0
|
275
|
+
)
|
276
|
+
checks[:checks][:llm] = { status: 'healthy' }
|
277
|
+
rescue => e
|
278
|
+
checks[:checks][:llm] = { status: 'unhealthy', error: e.message }
|
279
|
+
checks[:status] = 'unhealthy'
|
280
|
+
end
|
281
|
+
|
282
|
+
# Check memory usage
|
283
|
+
memory_usage = get_memory_usage
|
284
|
+
if memory_usage > 0.9
|
285
|
+
checks[:checks][:memory] = { status: 'warning', usage: memory_usage }
|
286
|
+
checks[:status] = 'degraded'
|
287
|
+
else
|
288
|
+
checks[:checks][:memory] = { status: 'healthy', usage: memory_usage }
|
289
|
+
end
|
290
|
+
|
291
|
+
checks
|
292
|
+
end
|
293
|
+
|
294
|
+
def perform_readiness_check
|
295
|
+
{
|
296
|
+
ready: true,
|
297
|
+
timestamp: Time.now.iso8601,
|
298
|
+
crews: @@crew_registry.crew_count,
|
299
|
+
uptime: Process.clock_gettime(Process::CLOCK_MONOTONIC).to_i
|
300
|
+
}
|
301
|
+
end
|
302
|
+
|
303
|
+
def validate_execution_request(data)
|
304
|
+
required_fields = ['crew_name']
|
305
|
+
missing_fields = required_fields - data.keys
|
306
|
+
|
307
|
+
if missing_fields.any?
|
308
|
+
raise ValidationError, "Missing required fields: #{missing_fields.join(', ')}"
|
309
|
+
end
|
310
|
+
|
311
|
+
unless @@crew_registry.crew_exists?(data['crew_name'])
|
312
|
+
raise ValidationError, "Unknown crew: #{data['crew_name']}"
|
313
|
+
end
|
314
|
+
end
|
315
|
+
|
316
|
+
def execute_with_monitoring(crew, request_data)
|
317
|
+
start_time = Time.now
|
318
|
+
execution_id = SecureRandom.uuid
|
319
|
+
|
320
|
+
logger.info "Starting execution", {
|
321
|
+
execution_id: execution_id,
|
322
|
+
crew_name: crew.name,
|
323
|
+
request_id: request_data['request_id']
|
324
|
+
}
|
325
|
+
|
326
|
+
begin
|
327
|
+
# Execute crew
|
328
|
+
result = crew.execute(
|
329
|
+
timeout: ENV.fetch('EXECUTION_TIMEOUT', '300').to_i,
|
330
|
+
max_retries: ENV.fetch('MAX_RETRIES', '3').to_i
|
331
|
+
)
|
332
|
+
|
333
|
+
duration = Time.now - start_time
|
334
|
+
|
335
|
+
logger.info "Execution completed", {
|
336
|
+
execution_id: execution_id,
|
337
|
+
duration: duration,
|
338
|
+
success_rate: result[:success_rate]
|
339
|
+
}
|
340
|
+
|
341
|
+
{
|
342
|
+
execution_id: execution_id,
|
343
|
+
success: true,
|
344
|
+
duration: duration,
|
345
|
+
result: result
|
346
|
+
}
|
347
|
+
|
348
|
+
rescue => e
|
349
|
+
duration = Time.now - start_time
|
350
|
+
|
351
|
+
logger.error "Execution failed", {
|
352
|
+
execution_id: execution_id,
|
353
|
+
duration: duration,
|
354
|
+
error: e.message
|
355
|
+
}
|
356
|
+
|
357
|
+
raise
|
358
|
+
end
|
359
|
+
end
|
360
|
+
|
361
|
+
def get_memory_usage
|
362
|
+
# Simple memory usage check
|
363
|
+
memory_info = `cat /proc/meminfo`.split("\n")
|
364
|
+
total = memory_info.find { |line| line.start_with?('MemTotal:') }.split[1].to_i
|
365
|
+
available = memory_info.find { |line| line.start_with?('MemAvailable:') }.split[1].to_i
|
366
|
+
|
367
|
+
(total - available).to_f / total
|
368
|
+
rescue
|
369
|
+
0.0
|
370
|
+
end
|
371
|
+
end
|
372
|
+
|
373
|
+
class ValidationError < StandardError; end
|
374
|
+
|
375
|
+
class CrewRegistry
|
376
|
+
def initialize
|
377
|
+
@crews = {}
|
378
|
+
end
|
379
|
+
|
380
|
+
def register_crew(name, crew)
|
381
|
+
@crews[name] = crew
|
382
|
+
end
|
383
|
+
|
384
|
+
def get_crew(name)
|
385
|
+
crew = @crews[name]
|
386
|
+
raise ValidationError, "Crew not found: #{name}" unless crew
|
387
|
+
crew
|
388
|
+
end
|
389
|
+
|
390
|
+
def crew_exists?(name)
|
391
|
+
@crews.key?(name)
|
392
|
+
end
|
393
|
+
|
394
|
+
def crew_count
|
395
|
+
@crews.length
|
396
|
+
end
|
397
|
+
|
398
|
+
def register_default_crews
|
399
|
+
# Register your production crews here
|
400
|
+
support_crew = create_support_crew
|
401
|
+
register_crew('customer_support', support_crew)
|
402
|
+
|
403
|
+
analysis_crew = create_analysis_crew
|
404
|
+
register_crew('data_analysis', analysis_crew)
|
405
|
+
end
|
406
|
+
|
407
|
+
private
|
408
|
+
|
409
|
+
def create_support_crew
|
410
|
+
crew = RCrewAI::Crew.new("customer_support")
|
411
|
+
|
412
|
+
support_agent = RCrewAI::Agent.new(
|
413
|
+
name: "support_specialist",
|
414
|
+
role: "Customer Support Specialist",
|
415
|
+
goal: "Provide excellent customer support and resolve issues efficiently",
|
416
|
+
tools: [
|
417
|
+
RCrewAI::Tools::WebSearch.new(max_results: 5),
|
418
|
+
RCrewAI::Tools::FileReader.new
|
419
|
+
]
|
420
|
+
)
|
421
|
+
|
422
|
+
crew.add_agent(support_agent)
|
423
|
+
|
424
|
+
support_task = RCrewAI::Task.new(
|
425
|
+
name: "handle_support_request",
|
426
|
+
description: "Handle customer support request with empathy and expertise",
|
427
|
+
expected_output: "Professional support response with clear next steps"
|
428
|
+
)
|
429
|
+
|
430
|
+
crew.add_task(support_task)
|
431
|
+
crew
|
432
|
+
end
|
433
|
+
|
434
|
+
def create_analysis_crew
|
435
|
+
crew = RCrewAI::Crew.new("data_analysis")
|
436
|
+
|
437
|
+
analyst = RCrewAI::Agent.new(
|
438
|
+
name: "data_analyst",
|
439
|
+
role: "Senior Data Analyst",
|
440
|
+
goal: "Analyze data and provide actionable insights",
|
441
|
+
tools: [
|
442
|
+
RCrewAI::Tools::FileReader.new,
|
443
|
+
RCrewAI::Tools::WebSearch.new
|
444
|
+
]
|
445
|
+
)
|
446
|
+
|
447
|
+
crew.add_agent(analyst)
|
448
|
+
|
449
|
+
analysis_task = RCrewAI::Task.new(
|
450
|
+
name: "data_analysis",
|
451
|
+
description: "Perform comprehensive data analysis and generate insights",
|
452
|
+
expected_output: "Detailed analysis report with charts and recommendations"
|
453
|
+
)
|
454
|
+
|
455
|
+
crew.add_task(analysis_task)
|
456
|
+
crew
|
457
|
+
end
|
458
|
+
end
|
459
|
+
|
460
|
+
# Health check script for Docker
|
461
|
+
# health_check.rb
|
462
|
+
begin
|
463
|
+
require 'net/http'
|
464
|
+
|
465
|
+
uri = URI('http://localhost:8080/health')
|
466
|
+
response = Net::HTTP.get_response(uri)
|
467
|
+
|
468
|
+
exit(response.code == '200' ? 0 : 1)
|
469
|
+
rescue
|
470
|
+
exit 1
|
471
|
+
end
|
472
|
+
|
473
|
+
# Start the application
|
474
|
+
if __FILE__ == $0
|
475
|
+
ProductionRCrewAI.run!(
|
476
|
+
host: '0.0.0.0',
|
477
|
+
port: ENV.fetch('PORT', 8080).to_i
|
478
|
+
)
|
479
|
+
end
|
480
|
+
```
|
481
|
+
|
482
|
+
### Docker Compose for Development
|
483
|
+
|
484
|
+
```yaml
|
485
|
+
# docker-compose.yml
|
486
|
+
version: '3.8'
|
487
|
+
|
488
|
+
services:
|
489
|
+
rcrewai:
|
490
|
+
build:
|
491
|
+
context: .
|
492
|
+
dockerfile: Dockerfile
|
493
|
+
target: production
|
494
|
+
ports:
|
495
|
+
- "8080:8080"
|
496
|
+
environment:
|
497
|
+
- RAILS_ENV=production
|
498
|
+
- OPENAI_API_KEY=${OPENAI_API_KEY}
|
499
|
+
- LLM_PROVIDER=openai
|
500
|
+
- LLM_TEMPERATURE=0.1
|
501
|
+
- EXECUTION_TIMEOUT=300
|
502
|
+
- MAX_RETRIES=3
|
503
|
+
depends_on:
|
504
|
+
- redis
|
505
|
+
- prometheus
|
506
|
+
restart: unless-stopped
|
507
|
+
healthcheck:
|
508
|
+
test: ["CMD", "ruby", "health_check.rb"]
|
509
|
+
interval: 30s
|
510
|
+
timeout: 10s
|
511
|
+
retries: 3
|
512
|
+
|
513
|
+
redis:
|
514
|
+
image: redis:7-alpine
|
515
|
+
command: redis-server --appendonly yes
|
516
|
+
volumes:
|
517
|
+
- redis_data:/data
|
518
|
+
restart: unless-stopped
|
519
|
+
|
520
|
+
prometheus:
|
521
|
+
image: prom/prometheus:latest
|
522
|
+
ports:
|
523
|
+
- "9090:9090"
|
524
|
+
volumes:
|
525
|
+
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
526
|
+
- prometheus_data:/prometheus
|
527
|
+
command:
|
528
|
+
- '--config.file=/etc/prometheus/prometheus.yml'
|
529
|
+
- '--storage.tsdb.path=/prometheus'
|
530
|
+
- '--web.console.libraries=/etc/prometheus/console_libraries'
|
531
|
+
- '--web.console.templates=/etc/prometheus/consoles'
|
532
|
+
restart: unless-stopped
|
533
|
+
|
534
|
+
grafana:
|
535
|
+
image: grafana/grafana:latest
|
536
|
+
ports:
|
537
|
+
- "3000:3000"
|
538
|
+
environment:
|
539
|
+
- GF_SECURITY_ADMIN_PASSWORD=admin
|
540
|
+
volumes:
|
541
|
+
- grafana_data:/var/lib/grafana
|
542
|
+
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
|
543
|
+
- ./grafana/datasources:/etc/grafana/provisioning/datasources
|
544
|
+
restart: unless-stopped
|
545
|
+
|
546
|
+
volumes:
|
547
|
+
redis_data:
|
548
|
+
prometheus_data:
|
549
|
+
grafana_data:
|
550
|
+
```
|
551
|
+
|
552
|
+
## Kubernetes Deployment
|
553
|
+
|
554
|
+
### Deployment Configuration
|
555
|
+
|
556
|
+
```yaml
|
557
|
+
# k8s/deployment.yaml
|
558
|
+
apiVersion: apps/v1
|
559
|
+
kind: Deployment
|
560
|
+
metadata:
|
561
|
+
name: rcrewai-app
|
562
|
+
labels:
|
563
|
+
app: rcrewai
|
564
|
+
version: v1.0.0
|
565
|
+
spec:
|
566
|
+
replicas: 3
|
567
|
+
strategy:
|
568
|
+
type: RollingUpdate
|
569
|
+
rollingUpdate:
|
570
|
+
maxSurge: 1
|
571
|
+
maxUnavailable: 1
|
572
|
+
selector:
|
573
|
+
matchLabels:
|
574
|
+
app: rcrewai
|
575
|
+
template:
|
576
|
+
metadata:
|
577
|
+
labels:
|
578
|
+
app: rcrewai
|
579
|
+
version: v1.0.0
|
580
|
+
spec:
|
581
|
+
serviceAccountName: rcrewai-service-account
|
582
|
+
containers:
|
583
|
+
- name: rcrewai
|
584
|
+
image: your-registry/rcrewai:v1.0.0
|
585
|
+
ports:
|
586
|
+
- containerPort: 8080
|
587
|
+
env:
|
588
|
+
- name: RAILS_ENV
|
589
|
+
value: "production"
|
590
|
+
- name: OPENAI_API_KEY
|
591
|
+
valueFrom:
|
592
|
+
secretKeyRef:
|
593
|
+
name: rcrewai-secrets
|
594
|
+
key: openai-api-key
|
595
|
+
- name: REDIS_URL
|
596
|
+
value: "redis://redis-service:6379"
|
597
|
+
- name: LLM_PROVIDER
|
598
|
+
value: "openai"
|
599
|
+
- name: LLM_TEMPERATURE
|
600
|
+
value: "0.1"
|
601
|
+
- name: EXECUTION_TIMEOUT
|
602
|
+
value: "300"
|
603
|
+
- name: MAX_RETRIES
|
604
|
+
value: "3"
|
605
|
+
resources:
|
606
|
+
requests:
|
607
|
+
memory: "256Mi"
|
608
|
+
cpu: "250m"
|
609
|
+
limits:
|
610
|
+
memory: "512Mi"
|
611
|
+
cpu: "500m"
|
612
|
+
livenessProbe:
|
613
|
+
httpGet:
|
614
|
+
path: /health
|
615
|
+
port: 8080
|
616
|
+
initialDelaySeconds: 60
|
617
|
+
periodSeconds: 30
|
618
|
+
timeoutSeconds: 5
|
619
|
+
failureThreshold: 3
|
620
|
+
readinessProbe:
|
621
|
+
httpGet:
|
622
|
+
path: /ready
|
623
|
+
port: 8080
|
624
|
+
initialDelaySeconds: 10
|
625
|
+
periodSeconds: 5
|
626
|
+
timeoutSeconds: 3
|
627
|
+
failureThreshold: 3
|
628
|
+
securityContext:
|
629
|
+
runAsNonRoot: true
|
630
|
+
runAsUser: 1000
|
631
|
+
readOnlyRootFilesystem: true
|
632
|
+
allowPrivilegeEscalation: false
|
633
|
+
volumeMounts:
|
634
|
+
- name: tmp-volume
|
635
|
+
mountPath: /tmp
|
636
|
+
volumes:
|
637
|
+
- name: tmp-volume
|
638
|
+
emptyDir: {}
|
639
|
+
imagePullSecrets:
|
640
|
+
- name: registry-secret
|
641
|
+
---
|
642
|
+
apiVersion: v1
|
643
|
+
kind: Service
|
644
|
+
metadata:
|
645
|
+
name: rcrewai-service
|
646
|
+
labels:
|
647
|
+
app: rcrewai
|
648
|
+
spec:
|
649
|
+
type: ClusterIP
|
650
|
+
ports:
|
651
|
+
- port: 80
|
652
|
+
targetPort: 8080
|
653
|
+
protocol: TCP
|
654
|
+
name: http
|
655
|
+
selector:
|
656
|
+
app: rcrewai
|
657
|
+
---
|
658
|
+
apiVersion: v1
|
659
|
+
kind: ServiceAccount
|
660
|
+
metadata:
|
661
|
+
name: rcrewai-service-account
|
662
|
+
---
|
663
|
+
apiVersion: networking.k8s.io/v1
|
664
|
+
kind: Ingress
|
665
|
+
metadata:
|
666
|
+
name: rcrewai-ingress
|
667
|
+
annotations:
|
668
|
+
kubernetes.io/ingress.class: nginx
|
669
|
+
cert-manager.io/cluster-issuer: letsencrypt-prod
|
670
|
+
nginx.ingress.kubernetes.io/rate-limit: "100"
|
671
|
+
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
672
|
+
spec:
|
673
|
+
tls:
|
674
|
+
- hosts:
|
675
|
+
- api.yourcompany.com
|
676
|
+
secretName: rcrewai-tls
|
677
|
+
rules:
|
678
|
+
- host: api.yourcompany.com
|
679
|
+
http:
|
680
|
+
paths:
|
681
|
+
- path: /
|
682
|
+
pathType: Prefix
|
683
|
+
backend:
|
684
|
+
service:
|
685
|
+
name: rcrewai-service
|
686
|
+
port:
|
687
|
+
number: 80
|
688
|
+
```
|
689
|
+
|
690
|
+
### ConfigMap and Secrets
|
691
|
+
|
692
|
+
```yaml
|
693
|
+
# k8s/configmap.yaml
|
694
|
+
apiVersion: v1
|
695
|
+
kind: ConfigMap
|
696
|
+
metadata:
|
697
|
+
name: rcrewai-config
|
698
|
+
data:
|
699
|
+
LLM_PROVIDER: "openai"
|
700
|
+
LLM_TEMPERATURE: "0.1"
|
701
|
+
LLM_MAX_TOKENS: "4000"
|
702
|
+
EXECUTION_TIMEOUT: "300"
|
703
|
+
MAX_RETRIES: "3"
|
704
|
+
LOG_LEVEL: "INFO"
|
705
|
+
METRICS_ENABLED: "true"
|
706
|
+
---
|
707
|
+
apiVersion: v1
|
708
|
+
kind: Secret
|
709
|
+
metadata:
|
710
|
+
name: rcrewai-secrets
|
711
|
+
type: Opaque
|
712
|
+
data:
|
713
|
+
openai-api-key: <base64-encoded-api-key>
|
714
|
+
anthropic-api-key: <base64-encoded-api-key>
|
715
|
+
database-url: <base64-encoded-database-url>
|
716
|
+
```
|
717
|
+
|
718
|
+
### Horizontal Pod Autoscaler
|
719
|
+
|
720
|
+
```yaml
|
721
|
+
# k8s/hpa.yaml
|
722
|
+
apiVersion: autoscaling/v2
|
723
|
+
kind: HorizontalPodAutoscaler
|
724
|
+
metadata:
|
725
|
+
name: rcrewai-hpa
|
726
|
+
spec:
|
727
|
+
scaleTargetRef:
|
728
|
+
apiVersion: apps/v1
|
729
|
+
kind: Deployment
|
730
|
+
name: rcrewai-app
|
731
|
+
minReplicas: 3
|
732
|
+
maxReplicas: 20
|
733
|
+
metrics:
|
734
|
+
- type: Resource
|
735
|
+
resource:
|
736
|
+
name: cpu
|
737
|
+
target:
|
738
|
+
type: Utilization
|
739
|
+
averageUtilization: 70
|
740
|
+
- type: Resource
|
741
|
+
resource:
|
742
|
+
name: memory
|
743
|
+
target:
|
744
|
+
type: Utilization
|
745
|
+
averageUtilization: 80
|
746
|
+
behavior:
|
747
|
+
scaleDown:
|
748
|
+
stabilizationWindowSeconds: 300
|
749
|
+
policies:
|
750
|
+
- type: Percent
|
751
|
+
value: 10
|
752
|
+
periodSeconds: 60
|
753
|
+
scaleUp:
|
754
|
+
stabilizationWindowSeconds: 60
|
755
|
+
policies:
|
756
|
+
- type: Percent
|
757
|
+
value: 50
|
758
|
+
periodSeconds: 60
|
759
|
+
```
|
760
|
+
|
761
|
+
## Configuration Management
|
762
|
+
|
763
|
+
### Environment-based Configuration
|
764
|
+
|
765
|
+
```ruby
|
766
|
+
# config/production.rb
|
767
|
+
class ProductionConfig
|
768
|
+
def self.configure
|
769
|
+
RCrewAI.configure do |config|
|
770
|
+
# LLM Provider Configuration
|
771
|
+
config.llm_provider = ENV.fetch('LLM_PROVIDER', 'openai').to_sym
|
772
|
+
|
773
|
+
case config.llm_provider
|
774
|
+
when :openai
|
775
|
+
config.openai_api_key = ENV.fetch('OPENAI_API_KEY')
|
776
|
+
config.base_url = ENV['OPENAI_BASE_URL'] # Optional custom endpoint
|
777
|
+
when :anthropic
|
778
|
+
config.anthropic_api_key = ENV.fetch('ANTHROPIC_API_KEY')
|
779
|
+
when :azure
|
780
|
+
config.azure_api_key = ENV.fetch('AZURE_OPENAI_API_KEY')
|
781
|
+
config.base_url = ENV.fetch('AZURE_OPENAI_ENDPOINT')
|
782
|
+
config.api_version = ENV.fetch('AZURE_API_VERSION', '2023-05-15')
|
783
|
+
when :google
|
784
|
+
config.google_api_key = ENV.fetch('GOOGLE_API_KEY')
|
785
|
+
end
|
786
|
+
|
787
|
+
# Model Parameters
|
788
|
+
config.temperature = ENV.fetch('LLM_TEMPERATURE', '0.1').to_f
|
789
|
+
config.max_tokens = ENV.fetch('LLM_MAX_TOKENS', '4000').to_i
|
790
|
+
config.timeout = ENV.fetch('LLM_TIMEOUT', '60').to_i
|
791
|
+
|
792
|
+
# Production Settings
|
793
|
+
config.retry_limit = ENV.fetch('LLM_RETRY_LIMIT', '3').to_i
|
794
|
+
config.retry_delay = ENV.fetch('LLM_RETRY_DELAY', '2').to_i
|
795
|
+
config.max_concurrent_requests = ENV.fetch('MAX_CONCURRENT_REQUESTS', '10').to_i
|
796
|
+
|
797
|
+
# Logging
|
798
|
+
config.log_level = ENV.fetch('LOG_LEVEL', 'INFO').upcase
|
799
|
+
config.structured_logging = ENV.fetch('STRUCTURED_LOGGING', 'true') == 'true'
|
800
|
+
|
801
|
+
# Security
|
802
|
+
config.validate_ssl = ENV.fetch('VALIDATE_SSL', 'true') == 'true'
|
803
|
+
config.user_agent = "RCrewAI/#{RCrewAI::VERSION} (Production)"
|
804
|
+
end
|
805
|
+
end
|
806
|
+
|
807
|
+
def self.database_config
|
808
|
+
{
|
809
|
+
url: ENV.fetch('DATABASE_URL'),
|
810
|
+
pool_size: ENV.fetch('DB_POOL_SIZE', '5').to_i,
|
811
|
+
checkout_timeout: ENV.fetch('DB_CHECKOUT_TIMEOUT', '5').to_i,
|
812
|
+
reaping_frequency: ENV.fetch('DB_REAPING_FREQUENCY', '10').to_i
|
813
|
+
}
|
814
|
+
end
|
815
|
+
|
816
|
+
def self.redis_config
|
817
|
+
{
|
818
|
+
url: ENV.fetch('REDIS_URL', 'redis://localhost:6379'),
|
819
|
+
timeout: ENV.fetch('REDIS_TIMEOUT', '5').to_i,
|
820
|
+
reconnect_attempts: ENV.fetch('REDIS_RECONNECT_ATTEMPTS', '3').to_i
|
821
|
+
}
|
822
|
+
end
|
823
|
+
|
824
|
+
def self.monitoring_config
|
825
|
+
{
|
826
|
+
metrics_enabled: ENV.fetch('METRICS_ENABLED', 'true') == 'true',
|
827
|
+
traces_enabled: ENV.fetch('TRACES_ENABLED', 'true') == 'true',
|
828
|
+
health_check_interval: ENV.fetch('HEALTH_CHECK_INTERVAL', '30').to_i,
|
829
|
+
performance_monitoring: ENV.fetch('PERFORMANCE_MONITORING', 'true') == 'true'
|
830
|
+
}
|
831
|
+
end
|
832
|
+
end
|
833
|
+
```
|
834
|
+
|
835
|
+
### Secrets Management with Vault
|
836
|
+
|
837
|
+
```ruby
|
838
|
+
# config/vault_client.rb
|
839
|
+
require 'vault'
|
840
|
+
|
841
|
+
class VaultClient
|
842
|
+
def initialize
|
843
|
+
Vault.configure do |config|
|
844
|
+
config.address = ENV.fetch('VAULT_ADDR')
|
845
|
+
config.token = ENV['VAULT_TOKEN']
|
846
|
+
config.ssl_verify = ENV.fetch('VAULT_SSL_VERIFY', 'true') == 'true'
|
847
|
+
end
|
848
|
+
end
|
849
|
+
|
850
|
+
def get_secret(path)
|
851
|
+
secret = Vault.logical.read(path)
|
852
|
+
secret&.data
|
853
|
+
rescue Vault::VaultError => e
|
854
|
+
Rails.logger.error "Vault error: #{e.message}"
|
855
|
+
raise
|
856
|
+
end
|
857
|
+
|
858
|
+
def get_database_credentials
|
859
|
+
get_secret('secret/data/database')
|
860
|
+
end
|
861
|
+
|
862
|
+
def get_llm_api_keys
|
863
|
+
get_secret('secret/data/llm_providers')
|
864
|
+
end
|
865
|
+
|
866
|
+
def refresh_secrets
|
867
|
+
# Implement secret rotation logic
|
868
|
+
new_secrets = get_llm_api_keys
|
869
|
+
|
870
|
+
if new_secrets
|
871
|
+
ENV['OPENAI_API_KEY'] = new_secrets[:openai_api_key]
|
872
|
+
ENV['ANTHROPIC_API_KEY'] = new_secrets[:anthropic_api_key]
|
873
|
+
|
874
|
+
# Reconfigure RCrewAI with new secrets
|
875
|
+
ProductionConfig.configure
|
876
|
+
end
|
877
|
+
end
|
878
|
+
end
|
879
|
+
|
880
|
+
# Periodic secret refresh
|
881
|
+
Thread.new do
|
882
|
+
vault_client = VaultClient.new
|
883
|
+
|
884
|
+
loop do
|
885
|
+
sleep(3600) # Refresh every hour
|
886
|
+
|
887
|
+
begin
|
888
|
+
vault_client.refresh_secrets
|
889
|
+
rescue => e
|
890
|
+
Rails.logger.error "Secret refresh failed: #{e.message}"
|
891
|
+
end
|
892
|
+
end
|
893
|
+
end
|
894
|
+
```
|
895
|
+
|
896
|
+
## Monitoring and Observability
|
897
|
+
|
898
|
+
### Prometheus Metrics
|
899
|
+
|
900
|
+
```ruby
|
901
|
+
# lib/metrics.rb
|
902
|
+
require 'prometheus/client'
|
903
|
+
|
904
|
+
class RCrewAIMetrics
|
905
|
+
def initialize
|
906
|
+
@registry = Prometheus::Client.registry
|
907
|
+
setup_metrics
|
908
|
+
end
|
909
|
+
|
910
|
+
def setup_metrics
|
911
|
+
# Request counters
|
912
|
+
@request_total = @registry.counter(
|
913
|
+
:rcrewai_requests_total,
|
914
|
+
docstring: 'Total number of requests',
|
915
|
+
labels: [:method, :path, :status]
|
916
|
+
)
|
917
|
+
|
918
|
+
@execution_total = @registry.counter(
|
919
|
+
:rcrewai_executions_total,
|
920
|
+
docstring: 'Total number of crew executions',
|
921
|
+
labels: [:crew_name, :status]
|
922
|
+
)
|
923
|
+
|
924
|
+
# Duration histograms
|
925
|
+
@request_duration = @registry.histogram(
|
926
|
+
:rcrewai_request_duration_seconds,
|
927
|
+
docstring: 'Request duration in seconds',
|
928
|
+
labels: [:method, :path],
|
929
|
+
buckets: [0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0]
|
930
|
+
)
|
931
|
+
|
932
|
+
@execution_duration = @registry.histogram(
|
933
|
+
:rcrewai_execution_duration_seconds,
|
934
|
+
docstring: 'Crew execution duration in seconds',
|
935
|
+
labels: [:crew_name],
|
936
|
+
buckets: [1.0, 5.0, 10.0, 30.0, 60.0, 300.0, 600.0]
|
937
|
+
)
|
938
|
+
|
939
|
+
# Gauges
|
940
|
+
@active_executions = @registry.gauge(
|
941
|
+
:rcrewai_active_executions,
|
942
|
+
docstring: 'Number of active executions',
|
943
|
+
labels: [:crew_name]
|
944
|
+
)
|
945
|
+
|
946
|
+
@memory_usage = @registry.gauge(
|
947
|
+
:rcrewai_memory_usage_bytes,
|
948
|
+
docstring: 'Memory usage in bytes'
|
949
|
+
)
|
950
|
+
|
951
|
+
@llm_api_calls = @registry.counter(
|
952
|
+
:rcrewai_llm_api_calls_total,
|
953
|
+
docstring: 'Total LLM API calls',
|
954
|
+
labels: [:provider, :model, :status]
|
955
|
+
)
|
956
|
+
end
|
957
|
+
|
958
|
+
def record_request(method, path, status, duration)
|
959
|
+
@request_total.increment(labels: { method: method, path: path, status: status })
|
960
|
+
@request_duration.observe(duration, labels: { method: method, path: path })
|
961
|
+
end
|
962
|
+
|
963
|
+
def record_execution_start(crew_name)
|
964
|
+
@active_executions.increment(labels: { crew_name: crew_name })
|
965
|
+
end
|
966
|
+
|
967
|
+
def record_execution_complete(crew_name, status, duration)
|
968
|
+
@active_executions.decrement(labels: { crew_name: crew_name })
|
969
|
+
@execution_total.increment(labels: { crew_name: crew_name, status: status })
|
970
|
+
@execution_duration.observe(duration, labels: { crew_name: crew_name })
|
971
|
+
end
|
972
|
+
|
973
|
+
def record_llm_call(provider, model, status)
|
974
|
+
@llm_api_calls.increment(labels: { provider: provider, model: model, status: status })
|
975
|
+
end
|
976
|
+
|
977
|
+
def update_memory_usage
|
978
|
+
memory = get_memory_usage_bytes
|
979
|
+
@memory_usage.set(memory)
|
980
|
+
end
|
981
|
+
|
982
|
+
private
|
983
|
+
|
984
|
+
def get_memory_usage_bytes
|
985
|
+
`ps -o rss= -p #{Process.pid}`.to_i * 1024
|
986
|
+
rescue
|
987
|
+
0
|
988
|
+
end
|
989
|
+
end
|
990
|
+
|
991
|
+
# Initialize global metrics instance
|
992
|
+
$metrics = RCrewAIMetrics.new
|
993
|
+
|
994
|
+
# Middleware for automatic metrics collection
|
995
|
+
class MetricsMiddleware
|
996
|
+
def initialize(app)
|
997
|
+
@app = app
|
998
|
+
end
|
999
|
+
|
1000
|
+
def call(env)
|
1001
|
+
start_time = Time.now
|
1002
|
+
method = env['REQUEST_METHOD']
|
1003
|
+
path = env['PATH_INFO']
|
1004
|
+
|
1005
|
+
status, headers, body = @app.call(env)
|
1006
|
+
|
1007
|
+
duration = Time.now - start_time
|
1008
|
+
$metrics.record_request(method, path, status.to_s, duration)
|
1009
|
+
|
1010
|
+
[status, headers, body]
|
1011
|
+
end
|
1012
|
+
end
|
1013
|
+
```
|
1014
|
+
|
1015
|
+
### Structured Logging
|
1016
|
+
|
1017
|
+
```ruby
|
1018
|
+
# lib/structured_logger.rb
|
1019
|
+
require 'json'
|
1020
|
+
require 'logger'
|
1021
|
+
|
1022
|
+
class StructuredLogger
|
1023
|
+
def initialize(output = $stdout)
|
1024
|
+
@logger = Logger.new(output)
|
1025
|
+
@logger.level = Logger.const_get(ENV.fetch('LOG_LEVEL', 'INFO'))
|
1026
|
+
@logger.formatter = method(:json_formatter)
|
1027
|
+
end
|
1028
|
+
|
1029
|
+
def info(message, context = {})
|
1030
|
+
@logger.info(log_entry(message, context))
|
1031
|
+
end
|
1032
|
+
|
1033
|
+
def warn(message, context = {})
|
1034
|
+
@logger.warn(log_entry(message, context))
|
1035
|
+
end
|
1036
|
+
|
1037
|
+
def error(message, context = {})
|
1038
|
+
@logger.error(log_entry(message, context))
|
1039
|
+
end
|
1040
|
+
|
1041
|
+
def debug(message, context = {})
|
1042
|
+
@logger.debug(log_entry(message, context))
|
1043
|
+
end
|
1044
|
+
|
1045
|
+
private
|
1046
|
+
|
1047
|
+
def log_entry(message, context)
|
1048
|
+
{
|
1049
|
+
timestamp: Time.now.utc.iso8601,
|
1050
|
+
level: caller_locations(2, 1)[0].label.upcase,
|
1051
|
+
message: message,
|
1052
|
+
service: 'rcrewai',
|
1053
|
+
version: RCrewAI::VERSION,
|
1054
|
+
environment: ENV.fetch('RAILS_ENV', 'development'),
|
1055
|
+
process_id: Process.pid,
|
1056
|
+
thread_id: Thread.current.object_id
|
1057
|
+
}.merge(context)
|
1058
|
+
end
|
1059
|
+
|
1060
|
+
def json_formatter(severity, timestamp, progname, msg)
|
1061
|
+
if msg.is_a?(Hash)
|
1062
|
+
msg.to_json + "\n"
|
1063
|
+
else
|
1064
|
+
{
|
1065
|
+
timestamp: timestamp.utc.iso8601,
|
1066
|
+
level: severity,
|
1067
|
+
message: msg.to_s,
|
1068
|
+
service: 'rcrewai'
|
1069
|
+
}.to_json + "\n"
|
1070
|
+
end
|
1071
|
+
end
|
1072
|
+
end
|
1073
|
+
|
1074
|
+
# Global logger instance
|
1075
|
+
$logger = StructuredLogger.new
|
1076
|
+
```
|
1077
|
+
|
1078
|
+
### Distributed Tracing
|
1079
|
+
|
1080
|
+
```ruby
|
1081
|
+
# lib/tracing.rb
|
1082
|
+
require 'opentelemetry/sdk'
|
1083
|
+
require 'opentelemetry/exporter/jaeger'
|
1084
|
+
require 'opentelemetry/instrumentation/all'
|
1085
|
+
|
1086
|
+
class TracingSetup
|
1087
|
+
def self.configure
|
1088
|
+
OpenTelemetry::SDK.configure do |c|
|
1089
|
+
c.service_name = 'rcrewai'
|
1090
|
+
c.service_version = RCrewAI::VERSION
|
1091
|
+
|
1092
|
+
c.add_span_processor(
|
1093
|
+
OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
|
1094
|
+
OpenTelemetry::Exporter::Jaeger::AgentExporter.new(
|
1095
|
+
endpoint: ENV.fetch('JAEGER_AGENT_HOST', 'localhost:14268')
|
1096
|
+
)
|
1097
|
+
)
|
1098
|
+
)
|
1099
|
+
|
1100
|
+
c.use_all() # Enable all instrumentations
|
1101
|
+
end
|
1102
|
+
end
|
1103
|
+
|
1104
|
+
def self.tracer
|
1105
|
+
OpenTelemetry.tracer_provider.tracer('rcrewai', RCrewAI::VERSION)
|
1106
|
+
end
|
1107
|
+
end
|
1108
|
+
|
1109
|
+
# Initialize tracing
|
1110
|
+
TracingSetup.configure if ENV.fetch('TRACING_ENABLED', 'true') == 'true'
|
1111
|
+
|
1112
|
+
# Tracing middleware
|
1113
|
+
class TracingMiddleware
|
1114
|
+
def initialize(app)
|
1115
|
+
@app = app
|
1116
|
+
@tracer = TracingSetup.tracer
|
1117
|
+
end
|
1118
|
+
|
1119
|
+
def call(env)
|
1120
|
+
@tracer.in_span("http_request") do |span|
|
1121
|
+
span.set_attribute('http.method', env['REQUEST_METHOD'])
|
1122
|
+
span.set_attribute('http.url', env['PATH_INFO'])
|
1123
|
+
|
1124
|
+
status, headers, body = @app.call(env)
|
1125
|
+
|
1126
|
+
span.set_attribute('http.status_code', status)
|
1127
|
+
span.status = OpenTelemetry::Trace::Status.error if status >= 400
|
1128
|
+
|
1129
|
+
[status, headers, body]
|
1130
|
+
end
|
1131
|
+
end
|
1132
|
+
end
|
1133
|
+
```
|
1134
|
+
|
1135
|
+
## Scaling and Load Balancing
|
1136
|
+
|
1137
|
+
### Auto-scaling Configuration
|
1138
|
+
|
1139
|
+
```yaml
|
1140
|
+
# k8s/vertical-pod-autoscaler.yaml
|
1141
|
+
apiVersion: autoscaling.k8s.io/v1
|
1142
|
+
kind: VerticalPodAutoscaler
|
1143
|
+
metadata:
|
1144
|
+
name: rcrewai-vpa
|
1145
|
+
spec:
|
1146
|
+
targetRef:
|
1147
|
+
apiVersion: apps/v1
|
1148
|
+
kind: Deployment
|
1149
|
+
name: rcrewai-app
|
1150
|
+
updatePolicy:
|
1151
|
+
updateMode: "Auto"
|
1152
|
+
resourcePolicy:
|
1153
|
+
containerPolicies:
|
1154
|
+
- containerName: rcrewai
|
1155
|
+
minAllowed:
|
1156
|
+
cpu: 100m
|
1157
|
+
memory: 128Mi
|
1158
|
+
maxAllowed:
|
1159
|
+
cpu: 1
|
1160
|
+
memory: 1Gi
|
1161
|
+
```
|
1162
|
+
|
1163
|
+
### Load Balancer Configuration
|
1164
|
+
|
1165
|
+
```nginx
|
1166
|
+
# nginx.conf
|
1167
|
+
upstream rcrewai_backend {
|
1168
|
+
least_conn;
|
1169
|
+
server rcrewai-1:8080 max_fails=3 fail_timeout=30s;
|
1170
|
+
server rcrewai-2:8080 max_fails=3 fail_timeout=30s;
|
1171
|
+
server rcrewai-3:8080 max_fails=3 fail_timeout=30s;
|
1172
|
+
}
|
1173
|
+
|
1174
|
+
server {
|
1175
|
+
listen 80;
|
1176
|
+
server_name api.yourcompany.com;
|
1177
|
+
|
1178
|
+
# Rate limiting
|
1179
|
+
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
|
1180
|
+
limit_req zone=api burst=20 nodelay;
|
1181
|
+
|
1182
|
+
# Request timeout
|
1183
|
+
proxy_read_timeout 300s;
|
1184
|
+
proxy_connect_timeout 60s;
|
1185
|
+
proxy_send_timeout 60s;
|
1186
|
+
|
1187
|
+
location / {
|
1188
|
+
proxy_pass http://rcrewai_backend;
|
1189
|
+
proxy_set_header Host $host;
|
1190
|
+
proxy_set_header X-Real-IP $remote_addr;
|
1191
|
+
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
1192
|
+
proxy_set_header X-Forwarded-Proto $scheme;
|
1193
|
+
|
1194
|
+
# Health check
|
1195
|
+
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
|
1196
|
+
proxy_next_upstream_tries 3;
|
1197
|
+
proxy_next_upstream_timeout 60s;
|
1198
|
+
}
|
1199
|
+
|
1200
|
+
location /health {
|
1201
|
+
access_log off;
|
1202
|
+
proxy_pass http://rcrewai_backend;
|
1203
|
+
}
|
1204
|
+
|
1205
|
+
location /metrics {
|
1206
|
+
access_log off;
|
1207
|
+
allow 10.0.0.0/8;
|
1208
|
+
allow 192.168.0.0/16;
|
1209
|
+
deny all;
|
1210
|
+
proxy_pass http://rcrewai_backend;
|
1211
|
+
}
|
1212
|
+
}
|
1213
|
+
```
|
1214
|
+
|
1215
|
+
## Security and Access Control
|
1216
|
+
|
1217
|
+
### Network Policies
|
1218
|
+
|
1219
|
+
```yaml
|
1220
|
+
# k8s/network-policy.yaml
|
1221
|
+
apiVersion: networking.k8s.io/v1
|
1222
|
+
kind: NetworkPolicy
|
1223
|
+
metadata:
|
1224
|
+
name: rcrewai-network-policy
|
1225
|
+
spec:
|
1226
|
+
podSelector:
|
1227
|
+
matchLabels:
|
1228
|
+
app: rcrewai
|
1229
|
+
policyTypes:
|
1230
|
+
- Ingress
|
1231
|
+
- Egress
|
1232
|
+
ingress:
|
1233
|
+
- from:
|
1234
|
+
- namespaceSelector:
|
1235
|
+
matchLabels:
|
1236
|
+
name: ingress-system
|
1237
|
+
- podSelector:
|
1238
|
+
matchLabels:
|
1239
|
+
app: load-balancer
|
1240
|
+
ports:
|
1241
|
+
- protocol: TCP
|
1242
|
+
port: 8080
|
1243
|
+
egress:
|
1244
|
+
- to: []
|
1245
|
+
ports:
|
1246
|
+
- protocol: TCP
|
1247
|
+
port: 443 # HTTPS
|
1248
|
+
- protocol: TCP
|
1249
|
+
port: 53 # DNS
|
1250
|
+
- protocol: UDP
|
1251
|
+
port: 53 # DNS
|
1252
|
+
- to:
|
1253
|
+
- podSelector:
|
1254
|
+
matchLabels:
|
1255
|
+
app: redis
|
1256
|
+
ports:
|
1257
|
+
- protocol: TCP
|
1258
|
+
port: 6379
|
1259
|
+
```
|
1260
|
+
|
1261
|
+
### Pod Security Standards
|
1262
|
+
|
1263
|
+
```yaml
|
1264
|
+
# k8s/pod-security-policy.yaml
|
1265
|
+
apiVersion: policy/v1beta1
|
1266
|
+
kind: PodSecurityPolicy
|
1267
|
+
metadata:
|
1268
|
+
name: rcrewai-psp
|
1269
|
+
spec:
|
1270
|
+
privileged: false
|
1271
|
+
allowPrivilegeEscalation: false
|
1272
|
+
requiredDropCapabilities:
|
1273
|
+
- ALL
|
1274
|
+
volumes:
|
1275
|
+
- 'configMap'
|
1276
|
+
- 'emptyDir'
|
1277
|
+
- 'projected'
|
1278
|
+
- 'secret'
|
1279
|
+
- 'downwardAPI'
|
1280
|
+
- 'persistentVolumeClaim'
|
1281
|
+
runAsUser:
|
1282
|
+
rule: 'MustRunAsNonRoot'
|
1283
|
+
seLinux:
|
1284
|
+
rule: 'RunAsAny'
|
1285
|
+
fsGroup:
|
1286
|
+
rule: 'RunAsAny'
|
1287
|
+
```
|
1288
|
+
|
1289
|
+
### Authentication and Authorization
|
1290
|
+
|
1291
|
+
```ruby
|
1292
|
+
# lib/auth.rb
|
1293
|
+
require 'jwt'
|
1294
|
+
|
1295
|
+
class AuthenticationMiddleware
|
1296
|
+
def initialize(app)
|
1297
|
+
@app = app
|
1298
|
+
@secret = ENV.fetch('JWT_SECRET')
|
1299
|
+
end
|
1300
|
+
|
1301
|
+
def call(env)
|
1302
|
+
# Skip auth for health checks
|
1303
|
+
if env['PATH_INFO'] == '/health' || env['PATH_INFO'] == '/ready'
|
1304
|
+
return @app.call(env)
|
1305
|
+
end
|
1306
|
+
|
1307
|
+
auth_header = env['HTTP_AUTHORIZATION']
|
1308
|
+
|
1309
|
+
unless auth_header&.start_with?('Bearer ')
|
1310
|
+
return unauthorized_response
|
1311
|
+
end
|
1312
|
+
|
1313
|
+
token = auth_header.sub('Bearer ', '')
|
1314
|
+
|
1315
|
+
begin
|
1316
|
+
payload = JWT.decode(token, @secret, true, algorithm: 'HS256')[0]
|
1317
|
+
env['user_id'] = payload['user_id']
|
1318
|
+
env['permissions'] = payload['permissions'] || []
|
1319
|
+
|
1320
|
+
@app.call(env)
|
1321
|
+
rescue JWT::DecodeError
|
1322
|
+
unauthorized_response
|
1323
|
+
end
|
1324
|
+
end
|
1325
|
+
|
1326
|
+
private
|
1327
|
+
|
1328
|
+
def unauthorized_response
|
1329
|
+
[401, {'Content-Type' => 'application/json'}, [
|
1330
|
+
{ error: 'Unauthorized' }.to_json
|
1331
|
+
]]
|
1332
|
+
end
|
1333
|
+
end
|
1334
|
+
|
1335
|
+
class AuthorizationMiddleware
|
1336
|
+
def initialize(app)
|
1337
|
+
@app = app
|
1338
|
+
end
|
1339
|
+
|
1340
|
+
def call(env)
|
1341
|
+
permissions = env['permissions'] || []
|
1342
|
+
path = env['PATH_INFO']
|
1343
|
+
method = env['REQUEST_METHOD']
|
1344
|
+
|
1345
|
+
required_permission = determine_required_permission(method, path)
|
1346
|
+
|
1347
|
+
if required_permission && !permissions.include?(required_permission)
|
1348
|
+
return forbidden_response
|
1349
|
+
end
|
1350
|
+
|
1351
|
+
@app.call(env)
|
1352
|
+
end
|
1353
|
+
|
1354
|
+
private
|
1355
|
+
|
1356
|
+
def determine_required_permission(method, path)
|
1357
|
+
case [method, path]
|
1358
|
+
when ['POST', '/execute']
|
1359
|
+
'execute_crew'
|
1360
|
+
when ['GET', '/metrics']
|
1361
|
+
'view_metrics'
|
1362
|
+
else
|
1363
|
+
nil # No special permission required
|
1364
|
+
end
|
1365
|
+
end
|
1366
|
+
|
1367
|
+
def forbidden_response
|
1368
|
+
[403, {'Content-Type' => 'application/json'}, [
|
1369
|
+
{ error: 'Forbidden' }.to_json
|
1370
|
+
]]
|
1371
|
+
end
|
1372
|
+
end
|
1373
|
+
```
|
1374
|
+
|
1375
|
+
## CI/CD Pipeline
|
1376
|
+
|
1377
|
+
### GitHub Actions Workflow
|
1378
|
+
|
1379
|
+
```yaml
|
1380
|
+
# .github/workflows/deploy.yml
|
1381
|
+
name: Deploy to Production
|
1382
|
+
|
1383
|
+
on:
|
1384
|
+
push:
|
1385
|
+
branches: [main]
|
1386
|
+
tags: ['v*']
|
1387
|
+
|
1388
|
+
env:
|
1389
|
+
REGISTRY: ghcr.io
|
1390
|
+
IMAGE_NAME: ${{ github.repository }}
|
1391
|
+
|
1392
|
+
jobs:
|
1393
|
+
test:
|
1394
|
+
runs-on: ubuntu-latest
|
1395
|
+
steps:
|
1396
|
+
- uses: actions/checkout@v3
|
1397
|
+
|
1398
|
+
- name: Set up Ruby
|
1399
|
+
uses: ruby/setup-ruby@v1
|
1400
|
+
with:
|
1401
|
+
ruby-version: 3.1
|
1402
|
+
bundler-cache: true
|
1403
|
+
|
1404
|
+
- name: Run tests
|
1405
|
+
run: bundle exec rspec
|
1406
|
+
|
1407
|
+
- name: Run security scan
|
1408
|
+
run: |
|
1409
|
+
bundle exec bundle-audit check --update
|
1410
|
+
bundle exec brakeman -q -w2
|
1411
|
+
|
1412
|
+
- name: Check code style
|
1413
|
+
run: bundle exec rubocop
|
1414
|
+
|
1415
|
+
build:
|
1416
|
+
needs: test
|
1417
|
+
runs-on: ubuntu-latest
|
1418
|
+
permissions:
|
1419
|
+
contents: read
|
1420
|
+
packages: write
|
1421
|
+
steps:
|
1422
|
+
- uses: actions/checkout@v3
|
1423
|
+
|
1424
|
+
- name: Log in to Container Registry
|
1425
|
+
uses: docker/login-action@v2
|
1426
|
+
with:
|
1427
|
+
registry: ${{ env.REGISTRY }}
|
1428
|
+
username: ${{ github.actor }}
|
1429
|
+
password: ${{ secrets.GITHUB_TOKEN }}
|
1430
|
+
|
1431
|
+
- name: Extract metadata
|
1432
|
+
id: meta
|
1433
|
+
uses: docker/metadata-action@v4
|
1434
|
+
with:
|
1435
|
+
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
|
1436
|
+
tags: |
|
1437
|
+
type=ref,event=branch
|
1438
|
+
type=ref,event=pr
|
1439
|
+
type=semver,pattern={{version}}
|
1440
|
+
type=semver,pattern={{major}}.{{minor}}
|
1441
|
+
|
1442
|
+
- name: Build and push Docker image
|
1443
|
+
uses: docker/build-push-action@v4
|
1444
|
+
with:
|
1445
|
+
context: .
|
1446
|
+
push: true
|
1447
|
+
tags: ${{ steps.meta.outputs.tags }}
|
1448
|
+
labels: ${{ steps.meta.outputs.labels }}
|
1449
|
+
|
1450
|
+
deploy:
|
1451
|
+
needs: build
|
1452
|
+
runs-on: ubuntu-latest
|
1453
|
+
environment: production
|
1454
|
+
if: github.ref == 'refs/heads/main'
|
1455
|
+
steps:
|
1456
|
+
- uses: actions/checkout@v3
|
1457
|
+
|
1458
|
+
- name: Configure kubectl
|
1459
|
+
uses: azure/k8s-set-context@v1
|
1460
|
+
with:
|
1461
|
+
method: kubeconfig
|
1462
|
+
kubeconfig: ${{ secrets.KUBE_CONFIG }}
|
1463
|
+
|
1464
|
+
- name: Deploy to Kubernetes
|
1465
|
+
run: |
|
1466
|
+
kubectl set image deployment/rcrewai-app \
|
1467
|
+
rcrewai=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
|
1468
|
+
kubectl rollout status deployment/rcrewai-app --timeout=600s
|
1469
|
+
|
1470
|
+
- name: Run smoke tests
|
1471
|
+
run: |
|
1472
|
+
kubectl wait --for=condition=ready pod -l app=rcrewai --timeout=300s
|
1473
|
+
./scripts/smoke-tests.sh
|
1474
|
+
```
|
1475
|
+
|
1476
|
+
### Deployment Scripts
|
1477
|
+
|
1478
|
+
```bash
|
1479
|
+
#!/bin/bash
|
1480
|
+
# scripts/deploy.sh
|
1481
|
+
|
1482
|
+
set -euo pipefail
|
1483
|
+
|
1484
|
+
ENVIRONMENT=${1:-production}
|
1485
|
+
IMAGE_TAG=${2:-latest}
|
1486
|
+
|
1487
|
+
echo "Deploying RCrewAI to $ENVIRONMENT with tag $IMAGE_TAG"
|
1488
|
+
|
1489
|
+
# Update deployment with new image
|
1490
|
+
kubectl set image deployment/rcrewai-app \
|
1491
|
+
rcrewai="ghcr.io/yourorg/rcrewai:$IMAGE_TAG" \
|
1492
|
+
--namespace="$ENVIRONMENT"
|
1493
|
+
|
1494
|
+
# Wait for rollout to complete
|
1495
|
+
kubectl rollout status deployment/rcrewai-app \
|
1496
|
+
--namespace="$ENVIRONMENT" \
|
1497
|
+
--timeout=600s
|
1498
|
+
|
1499
|
+
# Verify deployment
|
1500
|
+
echo "Verifying deployment..."
|
1501
|
+
kubectl get pods -l app=rcrewai --namespace="$ENVIRONMENT"
|
1502
|
+
|
1503
|
+
# Run health check
|
1504
|
+
echo "Running health check..."
|
1505
|
+
HEALTH_URL=$(kubectl get service rcrewai-service --namespace="$ENVIRONMENT" \
|
1506
|
+
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
|
1507
|
+
|
1508
|
+
curl -f "http://$HEALTH_URL/health" || {
|
1509
|
+
echo "Health check failed!"
|
1510
|
+
exit 1
|
1511
|
+
}
|
1512
|
+
|
1513
|
+
echo "Deployment successful!"
|
1514
|
+
```
|
1515
|
+
|
1516
|
+
```bash
|
1517
|
+
#!/bin/bash
|
1518
|
+
# scripts/smoke-tests.sh
|
1519
|
+
|
1520
|
+
set -euo pipefail
|
1521
|
+
|
1522
|
+
SERVICE_URL=${SERVICE_URL:-http://localhost:8080}
|
1523
|
+
|
1524
|
+
echo "Running smoke tests against $SERVICE_URL"
|
1525
|
+
|
1526
|
+
# Test 1: Health check
|
1527
|
+
echo "Testing health endpoint..."
|
1528
|
+
response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/health")
|
1529
|
+
if [[ $response != "200" ]]; then
|
1530
|
+
echo "Health check failed: $response"
|
1531
|
+
exit 1
|
1532
|
+
fi
|
1533
|
+
|
1534
|
+
# Test 2: Ready check
|
1535
|
+
echo "Testing readiness endpoint..."
|
1536
|
+
response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/ready")
|
1537
|
+
if [[ $response != "200" ]]; then
|
1538
|
+
echo "Readiness check failed: $response"
|
1539
|
+
exit 1
|
1540
|
+
fi
|
1541
|
+
|
1542
|
+
# Test 3: Metrics endpoint
|
1543
|
+
echo "Testing metrics endpoint..."
|
1544
|
+
response=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL/metrics")
|
1545
|
+
if [[ $response != "200" ]] && [[ $response != "403" ]]; then
|
1546
|
+
echo "Metrics check failed: $response"
|
1547
|
+
exit 1
|
1548
|
+
fi
|
1549
|
+
|
1550
|
+
# Test 4: Basic execution (if auth allows)
|
1551
|
+
echo "Testing basic execution..."
|
1552
|
+
response=$(curl -s -X POST "$SERVICE_URL/execute" \
|
1553
|
+
-H "Content-Type: application/json" \
|
1554
|
+
-d '{"crew_name": "customer_support", "request_id": "test-123"}' \
|
1555
|
+
-w "%{http_code}" -o /dev/null)
|
1556
|
+
|
1557
|
+
# Accept 401/403 for auth-protected endpoints
|
1558
|
+
if [[ $response != "200" ]] && [[ $response != "401" ]] && [[ $response != "403" ]]; then
|
1559
|
+
echo "Execution test failed: $response"
|
1560
|
+
exit 1
|
1561
|
+
fi
|
1562
|
+
|
1563
|
+
echo "All smoke tests passed!"
|
1564
|
+
```
|
1565
|
+
|
1566
|
+
## Operational Procedures
|
1567
|
+
|
1568
|
+
### Monitoring Dashboard
|
1569
|
+
|
1570
|
+
```yaml
|
1571
|
+
# grafana/dashboards/rcrewai-dashboard.json
|
1572
|
+
{
|
1573
|
+
"dashboard": {
|
1574
|
+
"title": "RCrewAI Production Dashboard",
|
1575
|
+
"panels": [
|
1576
|
+
{
|
1577
|
+
"title": "Request Rate",
|
1578
|
+
"type": "graph",
|
1579
|
+
"targets": [
|
1580
|
+
{
|
1581
|
+
"expr": "rate(rcrewai_requests_total[5m])",
|
1582
|
+
"legendFormat": "{{method}} {{path}}"
|
1583
|
+
}
|
1584
|
+
]
|
1585
|
+
},
|
1586
|
+
{
|
1587
|
+
"title": "Response Times",
|
1588
|
+
"type": "graph",
|
1589
|
+
"targets": [
|
1590
|
+
{
|
1591
|
+
"expr": "histogram_quantile(0.95, rate(rcrewai_request_duration_seconds_bucket[5m]))",
|
1592
|
+
"legendFormat": "95th percentile"
|
1593
|
+
},
|
1594
|
+
{
|
1595
|
+
"expr": "histogram_quantile(0.50, rate(rcrewai_request_duration_seconds_bucket[5m]))",
|
1596
|
+
"legendFormat": "50th percentile"
|
1597
|
+
}
|
1598
|
+
]
|
1599
|
+
},
|
1600
|
+
{
|
1601
|
+
"title": "Error Rate",
|
1602
|
+
"type": "stat",
|
1603
|
+
"targets": [
|
1604
|
+
{
|
1605
|
+
"expr": "rate(rcrewai_requests_total{status=~\"5..\"}[5m]) / rate(rcrewai_requests_total[5m]) * 100",
|
1606
|
+
"legendFormat": "Error Rate %"
|
1607
|
+
}
|
1608
|
+
]
|
1609
|
+
},
|
1610
|
+
{
|
1611
|
+
"title": "Crew Executions",
|
1612
|
+
"type": "graph",
|
1613
|
+
"targets": [
|
1614
|
+
{
|
1615
|
+
"expr": "rate(rcrewai_executions_total[5m])",
|
1616
|
+
"legendFormat": "{{crew_name}} {{status}}"
|
1617
|
+
}
|
1618
|
+
]
|
1619
|
+
},
|
1620
|
+
{
|
1621
|
+
"title": "Memory Usage",
|
1622
|
+
"type": "graph",
|
1623
|
+
"targets": [
|
1624
|
+
{
|
1625
|
+
"expr": "rcrewai_memory_usage_bytes / 1024 / 1024",
|
1626
|
+
"legendFormat": "Memory MB"
|
1627
|
+
}
|
1628
|
+
]
|
1629
|
+
},
|
1630
|
+
{
|
1631
|
+
"title": "Active Executions",
|
1632
|
+
"type": "stat",
|
1633
|
+
"targets": [
|
1634
|
+
{
|
1635
|
+
"expr": "sum(rcrewai_active_executions)",
|
1636
|
+
"legendFormat": "Active"
|
1637
|
+
}
|
1638
|
+
]
|
1639
|
+
}
|
1640
|
+
]
|
1641
|
+
}
|
1642
|
+
}
|
1643
|
+
```
|
1644
|
+
|
1645
|
+
### Alerting Rules
|
1646
|
+
|
1647
|
+
```yaml
|
1648
|
+
# prometheus/alerts.yml
|
1649
|
+
groups:
|
1650
|
+
- name: rcrewai
|
1651
|
+
rules:
|
1652
|
+
- alert: HighErrorRate
|
1653
|
+
expr: rate(rcrewai_requests_total{status=~"5.."}[5m]) / rate(rcrewai_requests_total[5m]) > 0.05
|
1654
|
+
for: 2m
|
1655
|
+
labels:
|
1656
|
+
severity: critical
|
1657
|
+
annotations:
|
1658
|
+
summary: "High error rate detected"
|
1659
|
+
description: "Error rate is {{ $value }}% for the last 5 minutes"
|
1660
|
+
|
1661
|
+
- alert: HighResponseTime
|
1662
|
+
expr: histogram_quantile(0.95, rate(rcrewai_request_duration_seconds_bucket[5m])) > 10
|
1663
|
+
for: 5m
|
1664
|
+
labels:
|
1665
|
+
severity: warning
|
1666
|
+
annotations:
|
1667
|
+
summary: "High response time detected"
|
1668
|
+
description: "95th percentile response time is {{ $value }}s"
|
1669
|
+
|
1670
|
+
- alert: ServiceDown
|
1671
|
+
expr: up{job="rcrewai"} == 0
|
1672
|
+
for: 1m
|
1673
|
+
labels:
|
1674
|
+
severity: critical
|
1675
|
+
annotations:
|
1676
|
+
summary: "RCrewAI service is down"
|
1677
|
+
description: "RCrewAI service has been down for more than 1 minute"
|
1678
|
+
|
1679
|
+
- alert: HighMemoryUsage
|
1680
|
+
expr: rcrewai_memory_usage_bytes / 1024 / 1024 / 1024 > 1
|
1681
|
+
for: 5m
|
1682
|
+
labels:
|
1683
|
+
severity: warning
|
1684
|
+
annotations:
|
1685
|
+
summary: "High memory usage"
|
1686
|
+
description: "Memory usage is {{ $value }}GB"
|
1687
|
+
|
1688
|
+
- alert: TooManyActiveExecutions
|
1689
|
+
expr: sum(rcrewai_active_executions) > 50
|
1690
|
+
for: 2m
|
1691
|
+
labels:
|
1692
|
+
severity: warning
|
1693
|
+
annotations:
|
1694
|
+
summary: "Too many active executions"
|
1695
|
+
description: "{{ $value }} executions are currently active"
|
1696
|
+
```
|
1697
|
+
|
1698
|
+
### Backup and Recovery
|
1699
|
+
|
1700
|
+
```bash
|
1701
|
+
#!/bin/bash
|
1702
|
+
# scripts/backup.sh
|
1703
|
+
|
1704
|
+
set -euo pipefail
|
1705
|
+
|
1706
|
+
BACKUP_DIR=${BACKUP_DIR:-/backups}
|
1707
|
+
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
1708
|
+
|
1709
|
+
echo "Starting backup at $TIMESTAMP"
|
1710
|
+
|
1711
|
+
# Backup configuration
|
1712
|
+
kubectl get configmap rcrewai-config -o yaml > "$BACKUP_DIR/config_$TIMESTAMP.yaml"
|
1713
|
+
kubectl get secret rcrewai-secrets -o yaml > "$BACKUP_DIR/secrets_$TIMESTAMP.yaml"
|
1714
|
+
|
1715
|
+
# Backup persistent data (if any)
|
1716
|
+
if kubectl get pvc rcrewai-data 2>/dev/null; then
|
1717
|
+
kubectl exec -it deployment/rcrewai-app -- tar czf - /data > "$BACKUP_DIR/data_$TIMESTAMP.tar.gz"
|
1718
|
+
fi
|
1719
|
+
|
1720
|
+
# Cleanup old backups (keep last 30 days)
|
1721
|
+
find "$BACKUP_DIR" -name "*.yaml" -mtime +30 -delete
|
1722
|
+
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +30 -delete
|
1723
|
+
|
1724
|
+
echo "Backup completed successfully"
|
1725
|
+
```
|
1726
|
+
|
1727
|
+
## Troubleshooting and Recovery
|
1728
|
+
|
1729
|
+
### Common Issues and Solutions
|
1730
|
+
|
1731
|
+
#### High Memory Usage
|
1732
|
+
```bash
|
1733
|
+
# Check memory usage
|
1734
|
+
kubectl top pods -l app=rcrewai
|
1735
|
+
|
1736
|
+
# Check for memory leaks
|
1737
|
+
kubectl exec -it deployment/rcrewai-app -- ps aux
|
1738
|
+
|
1739
|
+
# Restart pods if needed
|
1740
|
+
kubectl rollout restart deployment/rcrewai-app
|
1741
|
+
```
|
1742
|
+
|
1743
|
+
#### Slow Response Times
|
1744
|
+
```bash
|
1745
|
+
# Check CPU usage
|
1746
|
+
kubectl top pods -l app=rcrewai
|
1747
|
+
|
1748
|
+
# Scale up if needed
|
1749
|
+
kubectl scale deployment rcrewai-app --replicas=5
|
1750
|
+
|
1751
|
+
# Check database connections
|
1752
|
+
kubectl logs deployment/rcrewai-app | grep -i "database\|connection"
|
1753
|
+
```
|
1754
|
+
|
1755
|
+
#### Failed Deployments
|
1756
|
+
```bash
|
1757
|
+
# Check rollout status
|
1758
|
+
kubectl rollout status deployment/rcrewai-app
|
1759
|
+
|
1760
|
+
# Check pod logs
|
1761
|
+
kubectl logs deployment/rcrewai-app --previous
|
1762
|
+
|
1763
|
+
# Rollback if needed
|
1764
|
+
kubectl rollout undo deployment/rcrewai-app
|
1765
|
+
```
|
1766
|
+
|
1767
|
+
### Recovery Procedures
|
1768
|
+
|
1769
|
+
#### Complete Service Recovery
|
1770
|
+
```bash
|
1771
|
+
#!/bin/bash
|
1772
|
+
# scripts/disaster-recovery.sh
|
1773
|
+
|
1774
|
+
set -euo pipefail
|
1775
|
+
|
1776
|
+
echo "Starting disaster recovery procedure"
|
1777
|
+
|
1778
|
+
# 1. Restore configuration
|
1779
|
+
kubectl apply -f backups/config_latest.yaml
|
1780
|
+
kubectl apply -f backups/secrets_latest.yaml
|
1781
|
+
|
1782
|
+
# 2. Deploy application
|
1783
|
+
kubectl apply -f k8s/
|
1784
|
+
|
1785
|
+
# 3. Wait for deployment
|
1786
|
+
kubectl wait --for=condition=available deployment/rcrewai-app --timeout=600s
|
1787
|
+
|
1788
|
+
# 4. Restore data if needed
|
1789
|
+
if [[ -f "backups/data_latest.tar.gz" ]]; then
|
1790
|
+
kubectl exec -it deployment/rcrewai-app -- tar xzf - -C / < backups/data_latest.tar.gz
|
1791
|
+
fi
|
1792
|
+
|
1793
|
+
# 5. Verify service
|
1794
|
+
./scripts/smoke-tests.sh
|
1795
|
+
|
1796
|
+
echo "Disaster recovery completed"
|
1797
|
+
```
|
1798
|
+
|
1799
|
+
## Best Practices Summary
|
1800
|
+
|
1801
|
+
### 1. **Security**
|
1802
|
+
- Use non-root containers
|
1803
|
+
- Implement network policies
|
1804
|
+
- Manage secrets properly
|
1805
|
+
- Enable authentication and authorization
|
1806
|
+
- Regular security scans
|
1807
|
+
|
1808
|
+
### 2. **Reliability**
|
1809
|
+
- Health and readiness checks
|
1810
|
+
- Resource limits and requests
|
1811
|
+
- Graceful shutdown handling
|
1812
|
+
- Circuit breakers for external calls
|
1813
|
+
- Comprehensive error handling
|
1814
|
+
|
1815
|
+
### 3. **Scalability**
|
1816
|
+
- Horizontal pod autoscaling
|
1817
|
+
- Load balancing
|
1818
|
+
- Stateless application design
|
1819
|
+
- Resource optimization
|
1820
|
+
- Performance monitoring
|
1821
|
+
|
1822
|
+
### 4. **Observability**
|
1823
|
+
- Structured logging
|
1824
|
+
- Comprehensive metrics
|
1825
|
+
- Distributed tracing
|
1826
|
+
- Real-time alerting
|
1827
|
+
- Dashboard visualization
|
1828
|
+
|
1829
|
+
### 5. **Operations**
|
1830
|
+
- Automated deployments
|
1831
|
+
- Blue-green deployments
|
1832
|
+
- Backup and recovery procedures
|
1833
|
+
- Incident response playbooks
|
1834
|
+
- Regular performance reviews
|
1835
|
+
|
1836
|
+
This production deployment guide provides a comprehensive foundation for running RCrewAI applications at scale with enterprise-grade reliability, security, and observability.
|