@mevdragon/vidfarm-devcli 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,898 @@
1
+ # Vidfarm Platform Spec
2
+
3
+ This document defines the initial platform architecture for Vidfarm running on a single Dockerized EC2 host.
4
+
5
+ The goal is to let in-house developers build new video production templates quickly while the platform centrally owns auth, billing, job orchestration, customer state, and deployment.
6
+
7
+ ## Goals
8
+
9
+ 1. Support multiple video production patterns behind one consistent API.
10
+ 2. Treat every operation as an async job that immediately returns `job_id`.
11
+ 3. Let template developers write normal TypeScript/Node code with normal npm dependencies.
12
+ 4. Keep the first production deployment simple enough to run on one EC2 Docker host.
13
+ 5. Preserve a clean path to later split heavy workloads into isolated workers or separate services.
14
+
15
+ ## Non-Goals
16
+
17
+ 1. Public marketplace for third-party user-submitted templates.
18
+ 2. Multi-region or scale-to-zero serverless deployment in v1.
19
+ 3. Full microservice isolation for every template.
20
+ 4. Perfect cost attribution in v1. The first target is safe, conservative billing that protects gross margin.
21
+
22
+ ## Core Decision
23
+
24
+ Vidfarm should run as one shared platform container on EC2.
25
+
26
+ Templates should be packaged as internal code modules loaded by that platform, not as separately deployed HTTP services by default.
27
+ Platform runtime code should live under `src/*`, while template implementation code should live outside the platform tree under `templates/<template-folder>/*`.
28
+
29
+ This gives us:
30
+
31
+ 1. One auth and billing boundary.
32
+ 2. One job table and one queueing system.
33
+ 3. One deployment artifact for the normal case.
34
+ 4. Full npm freedom for template developers.
35
+ 5. A cleaner developer experience than forcing every template to become its own service.
36
+
37
+ Templates may still opt into isolated execution later when they have special requirements such as native binaries, unusually high memory use, or independent scaling needs.
38
+
39
+ ## Initial Runtime Choice
40
+
41
+ ### Production Runtime
42
+
43
+ - Node.js 22
44
+ - TypeScript
45
+ - Hono for the HTTP API
46
+ - SQLite for v1 job state and queue state
47
+ - S3 for customer files and generated artifacts
48
+ - Remotion Lambda for final render workloads where appropriate
49
+
50
+ ### Why Hono
51
+
52
+ Hono is a good fit for the control plane:
53
+
54
+ 1. Small and fast.
55
+ 2. Strong middleware model.
56
+ 3. Good TypeScript ergonomics.
57
+ 4. Easy to keep the HTTP layer thin while most complexity lives in jobs and template execution.
58
+
59
+ ### Why Not Bun for v1 Runtime
60
+
61
+ Bun is not forbidden, but it should not be the production baseline for v1.
62
+
63
+ Reasons:
64
+
65
+ 1. The platform already assumes Node-oriented Docker execution.
66
+ 2. AI SDK compatibility and native package behavior are more predictable on Node.
67
+ 3. Remotion and adjacent tooling are safer on the Node compatibility path.
68
+ 4. The hard part of this system is orchestration correctness, not JavaScript runtime speed.
69
+
70
+ If desired, Bun can be evaluated later as a local development runner or for specific internal tools.
71
+
72
+ ## Supported Production Patterns
73
+
74
+ The platform must support all of the following under one model:
75
+
76
+ 1. Pure AI multi-stage production.
77
+ 2. Remotion render pipelines.
78
+ 3. Hybrid research plus render pipelines.
79
+ 4. Animated storytelling pipelines.
80
+
81
+ The common abstraction is:
82
+
83
+ 1. Customer hits a template endpoint.
84
+ 2. Platform validates auth and input.
85
+ 3. Platform creates an async job.
86
+ 4. Worker executes the requested operation.
87
+ 5. Customer polls job state or receives a webhook.
88
+
89
+ ## Architectural Overview
90
+
91
+ ```txt
92
+ Client
93
+ -> Hono API
94
+ -> Auth / Billing / Template Registry / Job Creation
95
+ -> SQLite (jobs, logs, queue, rate-limit state, customer metadata pointers)
96
+ -> Worker Loop in same container
97
+ -> External providers (OpenAI, Gemini, OpenRouter, Perplexity, etc.)
98
+ -> S3 (workspace files, stage artifacts, final outputs)
99
+ -> Remotion Lambda (when render pipeline requires it)
100
+ ```
101
+
102
+ Initial deployment is one process image with two logical responsibilities:
103
+
104
+ 1. API server
105
+ 2. Background worker / dispatcher
106
+
107
+ These can run in the same container in v1. If needed later, they can be split into separate process types using the same codebase.
108
+
109
+ ## API Principles
110
+
111
+ All template operations are async-first.
112
+
113
+ Even if a task could finish quickly, the platform should still prefer job creation so the customer sees one consistent model.
114
+
115
+ ### Base Path
116
+
117
+ ```txt
118
+ /templates/:templateId/*
119
+ ```
120
+
121
+ ### Core Endpoints
122
+
123
+ ```txt
124
+ GET /templates/:templateId
125
+ POST /templates/:templateId/config
126
+ POST /templates/:templateId/operations/:operationName
127
+ GET /templates/:templateId/jobs/:jobId
128
+ GET /templates/:templateId/jobs/:jobId/logs
129
+ POST /templates/:templateId/jobs/:jobId/cancel
130
+ ```
131
+
132
+ ### Request Headers
133
+
134
+ ```txt
135
+ vidfarm-user-id: string
136
+ vidfarm-api-key: string
137
+ ```
138
+
139
+ ### Job Creation Request Shape
140
+
141
+ ```json
142
+ {
143
+ "tracer": "client-generated-string",
144
+ "payload": {}
145
+ }
146
+ ```
147
+
148
+ ### Job Creation Response Shape
149
+
150
+ ```json
151
+ {
152
+ "job_id": "job_xxx",
153
+ "tracer": "client-generated-string",
154
+ "status": "queued"
155
+ }
156
+ ```
157
+
158
+ ## Template Model
159
+
160
+ Templates are normal TypeScript packages with unrestricted internal structure and should live under a repo-level `templates/` directory, for example `templates/template_0000/*`.
161
+
162
+ They may:
163
+
164
+ 1. Import npm libraries.
165
+ 2. Define helper modules.
166
+ 3. Bundle prompt files.
167
+ 4. Include Remotion compositions.
168
+ 5. Call provider SDKs.
169
+ 6. Run arbitrary internal orchestration logic.
170
+
171
+ The framework should not force templates into a single-file callback model.
172
+
173
+ ### Template Contract
174
+
175
+ The external API surface should be defined as operations, not just raw stage names.
176
+
177
+ Suggested shape:
178
+
179
+ ```ts
180
+ export const myTemplate = defineTemplate({
181
+ id: "ugc-voiceover-v1",
182
+ version: "1.0.0",
183
+ description: "Short-form UGC voiceover pipeline",
184
+ configSchema: z.object({
185
+ defaultProvider: z.enum(["openai", "gemini", "openrouter", "perplexity"]).default("openai")
186
+ }),
187
+
188
+ operations: {
189
+ scaffold: {
190
+ description: "Generate a script scaffold.",
191
+ inputSchema: z.object({
192
+ topic: z.string()
193
+ }),
194
+ workflow: "scaffoldWorkflow",
195
+ providerHint: "openai",
196
+ webhookSupport: true
197
+ },
198
+ render: {
199
+ description: "Submit final render work.",
200
+ inputSchema: z.object({
201
+ storyboardId: z.string()
202
+ }),
203
+ workflow: "renderWorkflow",
204
+ webhookSupport: true
205
+ }
206
+ },
207
+
208
+ jobs: {
209
+ async scaffoldWorkflow(ctx, input) {
210
+ return {
211
+ progress: 1,
212
+ output: {}
213
+ };
214
+ },
215
+ async renderWorkflow(ctx, input) {
216
+ return {
217
+ progress: 1,
218
+ output: {}
219
+ };
220
+ }
221
+ }
222
+ });
223
+ ```
224
+
225
+ ### Why This Contract
226
+
227
+ This separates:
228
+
229
+ 1. Public API entrypoints.
230
+ 2. Internal workflow implementation.
231
+ 3. Template metadata and validation.
232
+
233
+ It gives template developers full control over their workflow logic while keeping the platform contract stable.
234
+
235
+ ## Template Execution Context
236
+
237
+ Each operation or job should receive a framework-owned context object.
238
+
239
+ Suggested capabilities:
240
+
241
+ ```ts
242
+ interface TemplateJobContext {
243
+ env: "development" | "production";
244
+ customer: CustomerContext;
245
+ templateConfig: Record<string, unknown>;
246
+ logger: {
247
+ debug(message: string, metadata?: Record<string, unknown>): void;
248
+ info(message: string, metadata?: Record<string, unknown>): void;
249
+ warn(message: string, metadata?: Record<string, unknown>): void;
250
+ error(message: string, metadata?: Record<string, unknown>): void;
251
+ progress(progress: number, message: string, metadata?: Record<string, unknown>): void;
252
+ };
253
+ jobs: {
254
+ enqueueChild(input: {
255
+ operationName: string;
256
+ workflowName: string;
257
+ payload: Record<string, unknown>;
258
+ providerHint?: ProviderType;
259
+ }): Promise<{ jobId: string }>;
260
+ };
261
+ storage: {
262
+ putJson(key: string, value: unknown): Promise<{ key: string; url: string | null }>;
263
+ putText(key: string, value: string, contentType?: string): Promise<{ key: string; url: string | null }>;
264
+ putBuffer(
265
+ key: string,
266
+ value: Uint8Array,
267
+ options?: { contentType?: string; kind?: string; metadata?: Record<string, unknown> }
268
+ ): Promise<{ key: string; url: string | null }>;
269
+ getPublicUrl(key: string): string | null;
270
+ };
271
+ billing: {
272
+ record(input: {
273
+ type: "ai_generation" | "render" | "storage_write" | "cpu_estimate";
274
+ costUsd: number;
275
+ chargeUsd?: number;
276
+ metadata?: Record<string, unknown>;
277
+ }): Promise<void>;
278
+ };
279
+ providers: {
280
+ generateText(input: {
281
+ provider: ProviderType;
282
+ model: string;
283
+ prompt: string;
284
+ temperature?: number;
285
+ }): Promise<{ text: string }>;
286
+ generateImage(input: {
287
+ provider: ProviderType;
288
+ model: string;
289
+ prompt: string;
290
+ size?: string;
291
+ }): Promise<{ bytes: Uint8Array; contentType: string; revisedPrompt: string | null }>;
292
+ analyzeImageLayout(input: {
293
+ provider: ProviderType;
294
+ model: string;
295
+ imageUrl: string;
296
+ overlayText: string;
297
+ }): Promise<{
298
+ zone: "top" | "center" | "bottom";
299
+ align: "left" | "center" | "right";
300
+ maxWidthPercent: number;
301
+ justification: string;
302
+ }>;
303
+ };
304
+ remotion: {
305
+ render(input: {
306
+ compositionId: string;
307
+ serveUrl?: string;
308
+ entryPoint?: string;
309
+ outputKey?: string;
310
+ inputProps: Record<string, unknown>;
311
+ }): Promise<{ renderId: string; outputUrl: string | null; metadata: Record<string, unknown> }>;
312
+ };
313
+ }
314
+ ```
315
+
316
+ Framework-owned context capabilities should include:
317
+
318
+ 1. Resolving customer AI keys safely.
319
+ 2. Writing artifacts through a stable storage prefix.
320
+ 3. Enqueuing child jobs.
321
+ 4. Emitting logs and progress.
322
+ 5. Recording billable events.
323
+ 6. Submitting downstream renders through a Remotion adapter.
324
+ 7. Calling provider adapters through centralized rate-limit enforcement.
325
+
326
+ ## Environment Behavior
327
+
328
+ The platform must clearly distinguish development from production.
329
+
330
+ ### Development
331
+
332
+ Developer-owned API keys from local `.env` are allowed.
333
+
334
+ This is for:
335
+
336
+ 1. Local testing.
337
+ 2. Template development.
338
+ 3. Dry-running internal workflows before deployment.
339
+
340
+ ### Production
341
+
342
+ The platform must use customer-owned provider keys stored in the customer profile when the template requests external AI inference on behalf of that customer.
343
+
344
+ Platform-controlled keys may still exist for:
345
+
346
+ 1. Platform-level fallback behavior.
347
+ 2. Internal moderation or diagnostics.
348
+ 3. Emergency operations.
349
+
350
+ But customer-billed workloads should default to customer-owned keys when available.
351
+
352
+ ## Customer Profile Model
353
+
354
+ Each customer profile should support:
355
+
356
+ 1. Multiple provider API keys.
357
+ 2. Multiple keys per provider.
358
+ 3. Workspace file storage references.
359
+ 4. Webhook destinations.
360
+ 5. Billing preferences and limits.
361
+
362
+ Suggested provider key record:
363
+
364
+ ```ts
365
+ interface CustomerProviderKey {
366
+ id: string;
367
+ provider: "openai" | "gemini" | "openrouter" | "perplexity";
368
+ encryptedSecret: string;
369
+ label?: string;
370
+ status: "active" | "paused" | "rate_limited" | "invalid";
371
+ lastUsedAt?: string;
372
+ cooldownUntil?: string;
373
+ }
374
+ ```
375
+
376
+ Customers may store multiple keys for the same provider. The platform should treat those keys as a small pooled resource that jobs must acquire before making outbound AI requests.
377
+
378
+ ## Queueing and Async Jobs
379
+
380
+ The platform is async-native.
381
+
382
+ Every operation should create a job record and return immediately.
383
+
384
+ ### Job State
385
+
386
+ Suggested states:
387
+
388
+ ```txt
389
+ queued
390
+ running
391
+ waiting_for_child
392
+ waiting_for_human
393
+ succeeded
394
+ failed
395
+ cancelled
396
+ ```
397
+
398
+ ### Job Data
399
+
400
+ Each job record should track:
401
+
402
+ 1. `job_id`
403
+ 2. `template_id`
404
+ 3. `operation_name`
405
+ 4. `tracer`
406
+ 5. `status`
407
+ 6. `payload`
408
+ 7. `result`
409
+ 8. `error`
410
+ 9. `progress`
411
+ 10. `webhook_url`
412
+ 11. `parent_job_id`
413
+ 12. `customer_id`
414
+ 13. `reservation_id` or billing reference
415
+ 14. timestamps
416
+
417
+ ### Logs
418
+
419
+ Logs must be stored as structured job events, not just raw text.
420
+
421
+ Each event should support:
422
+
423
+ 1. timestamp
424
+ 2. level
425
+ 3. message
426
+ 4. machine-readable metadata
427
+ 5. progress update
428
+ 6. artifact references
429
+
430
+ This lets the client render a live job timeline later.
431
+
432
+ ## SQLite-Backed AI Key Queue
433
+
434
+ SQLite is not only the v1 job store. It is also the coordination layer for customer AI API key usage.
435
+
436
+ The intended model is:
437
+
438
+ 1. A job becomes runnable.
439
+ 2. The worker identifies which provider and model the next step requires.
440
+ 3. The worker attempts to lease one eligible customer key from SQLite.
441
+ 4. If a lease is granted, the worker performs the outbound API call.
442
+ 5. The worker records usage, updates cooldown state if needed, and releases the lease.
443
+
444
+ This gives the platform a lightweight queue for AI key access without needing Redis, SQS, or a separate lock service.
445
+
446
+ ### Why SQLite Is Acceptable in v1
447
+
448
+ This is a reasonable design if all of the following remain true:
449
+
450
+ 1. One EC2 host is the active source of truth.
451
+ 2. The platform runs a moderate number of worker loops.
452
+ 3. SQLite is configured in WAL mode.
453
+ 4. Lease acquisition is done transactionally.
454
+ 5. Jobs are retry-safe and can be rescheduled when no key is available.
455
+
456
+ ### Core Idea
457
+
458
+ The AI key queue is represented by a combination of:
459
+
460
+ 1. Customer provider key records.
461
+ 2. Active key lease records.
462
+ 3. Key usage and error events.
463
+ 4. Cooldown timestamps after rate-limit responses.
464
+
465
+ There is no separate message broker for API key access. Eligibility is derived from database state at lease time.
466
+
467
+ ### Suggested Tables
468
+
469
+ ```sql
470
+ create table customer_provider_keys (
471
+ id text primary key,
472
+ customer_id text not null,
473
+ provider text not null,
474
+ label text,
475
+ encrypted_secret text not null,
476
+ status text not null,
477
+ weight integer not null default 1,
478
+ last_used_at text,
479
+ cooldown_until text,
480
+ disabled_reason text,
481
+ created_at text not null,
482
+ updated_at text not null
483
+ );
484
+
485
+ create table provider_key_leases (
486
+ key_id text primary key,
487
+ lease_token text not null,
488
+ worker_id text not null,
489
+ job_id text not null,
490
+ leased_at text not null,
491
+ expires_at text not null
492
+ );
493
+
494
+ create table provider_key_usage_events (
495
+ id text primary key,
496
+ key_id text not null,
497
+ job_id text not null,
498
+ provider text not null,
499
+ model text,
500
+ event_type text not null,
501
+ input_tokens integer,
502
+ output_tokens integer,
503
+ cost_usd real,
504
+ created_at text not null
505
+ );
506
+ ```
507
+
508
+ Optional model capability table:
509
+
510
+ ```sql
511
+ create table provider_key_capabilities (
512
+ key_id text not null,
513
+ model text not null,
514
+ primary key (key_id, model)
515
+ );
516
+ ```
517
+
518
+ ### Lease Acquisition
519
+
520
+ Workers must acquire a key lease before making any outbound provider request on behalf of a customer.
521
+
522
+ Lease acquisition should happen inside a transaction using `BEGIN IMMEDIATE`.
523
+
524
+ The query should exclude keys that are:
525
+
526
+ 1. Not active.
527
+ 2. In cooldown.
528
+ 3. Already leased and whose lease has not expired.
529
+ 4. Incompatible with the requested provider or model.
530
+
531
+ Preferred selection order in v1:
532
+
533
+ 1. Least recently used eligible key.
534
+ 2. Higher weight first when weights differ.
535
+
536
+ Illustrative flow:
537
+
538
+ ```txt
539
+ BEGIN IMMEDIATE
540
+ 1. Select one eligible key
541
+ 2. Insert active lease row
542
+ 3. Commit
543
+ ```
544
+
545
+ Illustrative query shape:
546
+
547
+ ```sql
548
+ select k.id
549
+ from customer_provider_keys k
550
+ left join provider_key_leases l
551
+ on l.key_id = k.id
552
+ and l.expires_at > datetime('now')
553
+ where k.customer_id = ?
554
+ and k.provider = ?
555
+ and k.status = 'active'
556
+ and (k.cooldown_until is null or k.cooldown_until <= datetime('now'))
557
+ and l.key_id is null
558
+ order by k.last_used_at asc nulls first, k.weight desc
559
+ limit 1;
560
+ ```
561
+
562
+ If a key is found, the worker inserts a lease row with a short expiry such as 30 to 90 seconds.
563
+
564
+ If no key is found, the worker must not busy-loop. It should reschedule the job for a future run.
565
+
566
+ ### Lease Semantics
567
+
568
+ Lease rows should contain:
569
+
570
+ 1. `key_id`
571
+ 2. `lease_token`
572
+ 3. `worker_id`
573
+ 4. `job_id`
574
+ 5. `leased_at`
575
+ 6. `expires_at`
576
+
577
+ The lease token should be required for release or extension so one worker cannot accidentally release another worker's lease.
578
+
579
+ ### Lease Expiry and Recovery
580
+
581
+ If a worker crashes, its lease should naturally expire and the key should become eligible again.
582
+
583
+ For long-running requests, the platform may optionally support lease extension heartbeats. This is useful when the provider call or downstream processing can exceed the default lease duration.
584
+
585
+ ### Success Path
586
+
587
+ After a successful provider call, the worker should:
588
+
589
+ 1. Record a usage event.
590
+ 2. Update `last_used_at`.
591
+ 3. Clear any temporary rate-limit status when appropriate.
592
+ 4. Release the lease.
593
+
594
+ ### Rate-Limit Path
595
+
596
+ If the provider returns a rate-limit response, the worker should:
597
+
598
+ 1. Record a `rate_limit` usage event.
599
+ 2. Put the key into cooldown by setting `cooldown_until`.
600
+ 3. Release the lease.
601
+ 4. Reschedule the job.
602
+
603
+ Cooldown duration may initially be determined by:
604
+
605
+ 1. Provider response headers if available.
606
+ 2. Provider-specific backoff policy.
607
+ 3. Conservative defaults when headers are absent.
608
+
609
+ ### Auth Failure Path
610
+
611
+ If the provider reports invalid credentials, the worker should:
612
+
613
+ 1. Record an `auth_error` usage event.
614
+ 2. Mark the key `invalid`.
615
+ 3. Release the lease.
616
+ 4. Retry with another key if one exists.
617
+ 5. Fail the job clearly if no valid key remains.
618
+
619
+ ### Scheduler Behavior
620
+
621
+ The scheduler should treat jobs as runnable only when both are true:
622
+
623
+ 1. The job itself is ready to run.
624
+ 2. A compatible provider key can likely be leased now or soon.
625
+
626
+ Recommended loop:
627
+
628
+ 1. Fetch queued jobs ordered by `run_after`.
629
+ 2. Attempt key lease acquisition for the next provider-dependent step.
630
+ 3. If lease succeeds, run the step and mark job `running`.
631
+ 4. If lease fails, move `run_after` forward instead of spinning.
632
+ 5. Retry later.
633
+
634
+ This is what makes the AI key queue effectively a SQLite-backed coordination system rather than a separate infrastructure dependency.
635
+
636
+ ### Observability
637
+
638
+ The platform should emit logs and metrics for:
639
+
640
+ 1. Lease acquisition success rate.
641
+ 2. Lease wait time.
642
+ 3. Key cooldown frequency by provider.
643
+ 4. Key invalidation frequency.
644
+ 5. Job deferrals caused by unavailable keys.
645
+
646
+ These signals will tell us when SQLite remains sufficient and when key coordination needs a stronger backend.
647
+
648
+ ## Rate Limiting and Provider Routing
649
+
650
+ Customer keys are not interchangeable infinite resources.
651
+
652
+ The platform must route AI calls through a provider layer that understands:
653
+
654
+ 1. Provider type.
655
+ 2. Model name.
656
+ 3. Key-level rate limits.
657
+ 4. Backoff behavior.
658
+ 5. Retry policy.
659
+ 6. Temporary key disablement after provider errors.
660
+
661
+ ### Initial Approach
662
+
663
+ Use SQLite-backed leasing for queue and key selection in v1.
664
+
665
+ This is acceptable if:
666
+
667
+ 1. One EC2 host is the source of truth.
668
+ 2. SQLite is configured in WAL mode.
669
+ 3. Concurrency expectations remain moderate.
670
+ 4. Jobs are idempotent enough to survive retries.
671
+
672
+ ### Future Upgrade Path
673
+
674
+ If platform concurrency or reliability needs outgrow SQLite, move the job and rate-limit state to Postgres before adopting many-worker horizontal scale.
675
+
676
+ ## Billing Model
677
+
678
+ The platform owns billing enforcement, but templates should emit billing events through framework APIs.
679
+
680
+ Templates should not hand-roll their own pricing logic in arbitrary ways.
681
+
682
+ ### Billing Principle
683
+
684
+ Bill conservatively enough to avoid cloud-cost loss.
685
+
686
+ Current target:
687
+
688
+ ```txt
689
+ customer_charge_usd ~= platform_cost_usd * 2
690
+ ```
691
+
692
+ This is a margin buffer, not a final finance system.
693
+
694
+ ### Billing Event Types
695
+
696
+ The framework should support at least:
697
+
698
+ 1. External AI token usage.
699
+ 2. Remotion render usage.
700
+ 3. EC2 / CPU / memory approximation.
701
+ 4. Storage writes.
702
+ 5. Data egress or expensive file processing when relevant.
703
+
704
+ ### Billing API for Templates
705
+
706
+ Templates should call framework helpers like:
707
+
708
+ ```ts
709
+ await ctx.billing.record({
710
+ type: "ai_generation",
711
+ provider: "openai",
712
+ model: "gpt-4.1",
713
+ estimatedCostUsd: 0.024,
714
+ metadata: {},
715
+ });
716
+ ```
717
+
718
+ The framework should translate these events into customer-facing charges.
719
+
720
+ ## Webhooks
721
+
722
+ Every job may include an optional webhook destination.
723
+
724
+ The platform should emit webhook events for:
725
+
726
+ 1. `job.queued`
727
+ 2. `job.running`
728
+ 3. `job.progress`
729
+ 4. `job.succeeded`
730
+ 5. `job.failed`
731
+ 6. `job.cancelled`
732
+
733
+ Webhook delivery should be:
734
+
735
+ 1. Signed
736
+ 2. Retried with backoff
737
+ 3. Persisted as delivery attempts
738
+
739
+ ## File and Artifact Storage
740
+
741
+ S3 is the system of record for customer-uploaded files and large generated outputs.
742
+
743
+ ### Customer Workspace Convention
744
+
745
+ Suggested logical prefix:
746
+
747
+ ```txt
748
+ s3://bucket/customers/:customerId/workspace/...
749
+ ```
750
+
751
+ ### Job Artifact Convention
752
+
753
+ Suggested logical prefix:
754
+
755
+ ```txt
756
+ s3://bucket/templates/:templateId/users/:userId/jobs/:jobId/...
757
+ ```
758
+
759
+ Artifacts may include:
760
+
761
+ 1. Prompt snapshots
762
+ 2. Storyboards
763
+ 3. Preview images
764
+ 4. Audio assets
765
+ 5. Subtitle files
766
+ 6. Render manifests
767
+ 7. Final video outputs
768
+
769
+ This prefix is for template-generated outputs and intermediate artifacts. Keep it stable across storage backends so local development mirrors production object layout.
770
+
771
+ ## Remotion Integration
772
+
773
+ Remotion should be treated as a specialized downstream execution path, not the center of the platform.
774
+
775
+ Templates can invoke Remotion through a framework adapter.
776
+
777
+ Suggested flow:
778
+
779
+ 1. Template job prepares structured render input.
780
+ 2. Template writes required assets through framework storage.
781
+ 3. Template calls `ctx.remotion.render(...)`.
782
+ 4. The adapter renders locally or via Lambda depending on environment config.
783
+ 5. Final artifact is attached back to the parent job result.
784
+
785
+ This keeps Remotion as one implementation detail among several, rather than forcing the platform to be Remotion-first.
786
+
787
+ ## Isolation Policy
788
+
789
+ Default mode is shared in-process execution inside the main platform runtime.
790
+
791
+ ### Shared Execution Is Correct By Default
792
+
793
+ Use shared execution when the template:
794
+
795
+ 1. Uses standard Node dependencies.
796
+ 2. Fits within normal memory and CPU budgets.
797
+ 3. Does not need a custom OS image.
798
+ 4. Can safely coexist with other templates.
799
+
800
+ ### Isolated Execution Is an Escape Hatch
801
+
802
+ Allow a template to declare isolated execution later if it needs:
803
+
804
+ 1. Heavy FFmpeg or native binary workloads.
805
+ 2. Custom Chromium or system library requirements.
806
+ 3. A stricter security or resource boundary.
807
+ 4. Independent scaling or scheduling.
808
+
809
+ Even then, the main platform should still own:
810
+
811
+ 1. Auth
812
+ 2. Billing
813
+ 3. Job creation
814
+ 4. Customer state
815
+ 5. Webhook delivery
816
+
817
+ ## Suggested Repository Shape
818
+
819
+ ```txt
820
+ /src
821
+ /templates/template_0000
822
+ /templates/template_0001
823
+ /AWS_REMOTION_HANDOFF.md
824
+ /PLATFORM_SPEC.md
825
+ /SKILL.developer.md
826
+ ```
827
+
828
+ ## Suggested Internal Components
829
+
830
+ ### Platform API
831
+
832
+ Responsibilities:
833
+
834
+ 1. Request validation
835
+ 2. Auth
836
+ 3. Template lookup
837
+ 4. Config updates
838
+ 5. Job creation
839
+ 6. Job status reads
840
+ 7. Webhook registration
841
+
842
+ ### Worker / Dispatcher
843
+
844
+ Responsibilities:
845
+
846
+ 1. Pull queued jobs
847
+ 2. Acquire provider key leases
848
+ 3. Execute template jobs
849
+ 4. Persist logs and artifacts
850
+ 5. Update billing
851
+ 6. Deliver completion webhooks
852
+
853
+ ### Template Registry
854
+
855
+ Responsibilities:
856
+
857
+ 1. Register approved in-house templates
858
+ 2. Expose metadata
859
+ 3. Resolve operations and jobs
860
+ 4. Enforce version compatibility
861
+
862
+ ## Security Notes
863
+
864
+ 1. Customer provider keys must be encrypted at rest.
865
+ 2. API keys must be hash-verified, not stored in plaintext.
866
+ 3. Template code is trusted internal code in v1, not untrusted tenant code.
867
+ 4. Webhook signatures must be mandatory.
868
+ 5. Customer file access must always be scoped by customer identity.
869
+
870
+ ## Operational Notes for v1
871
+
872
+ 1. Run one Docker image on one EC2 host.
873
+ 2. Keep API and worker in the same deployable unit initially.
874
+ 3. Use SQLite in WAL mode.
875
+ 4. Back up SQLite and treat it as transitional infrastructure.
876
+ 5. Store all large artifacts in S3, not on container disk.
877
+ 6. Assume template code is centrally reviewed before deployment.
878
+
879
+ ## Known Limits of v1
880
+
881
+ 1. SQLite is a reasonable starting point but not the final queueing backend for large-scale concurrency.
882
+ 2. Shared in-process template execution is simpler operationally but weaker as a hard isolation boundary.
883
+ 3. EC2 cost attribution will begin as approximation plus provider-cost tracking, not perfect real-time infrastructure metering.
884
+
885
+ ## Recommendation Summary
886
+
887
+ The recommended initial Vidfarm platform is:
888
+
889
+ 1. One Node.js 22 Docker container on EC2.
890
+ 2. Hono as the HTTP API layer.
891
+ 3. SQLite as the initial jobs and rate-limit store.
892
+ 4. S3 for workspace and artifact storage.
893
+ 5. Remotion Lambda as a downstream rendering path.
894
+ 6. Templates implemented as normal internal TypeScript packages with full npm access.
895
+ 7. Public API defined as async operations that enqueue jobs.
896
+ 8. Optional isolated template execution added later only when justified by concrete workload needs.
897
+
898
+ This is the simplest architecture that matches the product goals without prematurely turning each template into a separate service.