@variantlab/core 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,399 @@
1
+ # Crash-triggered automatic rollback
2
+
3
+ A variant that crashes should not keep crashing your users. variantlab's crash-rollback feature detects errors in a variant, clears the assignment, and forces the default.
4
+
5
+ ## Table of contents
6
+
7
+ - [Why this exists](#why-this-exists)
8
+ - [How it works](#how-it-works)
9
+ - [Configuration](#configuration)
10
+ - [The VariantErrorBoundary](#the-varianterrorboundary)
11
+ - [Programmatic reporting](#programmatic-reporting)
12
+ - [Persistence](#persistence)
13
+ - [Events](#events)
14
+ - [Debugging a rollback](#debugging-a-rollback)
15
+ - [Limitations](#limitations)
16
+
17
+ ---
18
+
19
+ ## Why this exists
20
+
21
+ Real-world story from Drishtikon: we shipped a new card layout that crashed when a particular article had no hero image. The crash affected 100% of users in that variant until we shipped a new config.
22
+
23
+ With crash-rollback, the second crash would have automatically cleared the assignment and forced the default. No user would have seen more than one crash.
24
+
25
+ This is the safety net that lets teams ship risky experiments with confidence.
26
+
27
+ ---
28
+
29
+ ## How it works
30
+
31
+ ```
32
+ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐
33
+ │ Variant A │────▶│ ErrorBoundary│────▶│ reportCrash │
34
+ └─────────────┘ └──────────────┘ └──────────────┘
35
+
36
+
37
+ ┌──────────────────┐
38
+ │ Count crashes │
39
+ │ in time window │
40
+ └──────────────────┘
41
+
42
+
43
+ ┌──────────────────────────┐
44
+ │ Threshold reached? │
45
+ │ Yes → rollback │
46
+ │ No → keep trying │
47
+ └──────────────────────────┘
48
+
49
+
50
+ ┌──────────────────────────┐
51
+ │ Clear assignment │
52
+ │ Force default │
53
+ │ Emit onRollback event │
54
+ │ Persist (optional) │
55
+ └──────────────────────────┘
56
+ ```
57
+
58
+ 1. You wrap the variant in `<VariantErrorBoundary>`
59
+ 2. A crash is caught and reported to the engine
60
+ 3. The engine maintains a per-(experiment, variant, user) crash counter within a sliding window
61
+ 4. When `threshold` crashes occur within `window` ms, the engine:
62
+ - Clears the user's assignment for that experiment
63
+ - Forces the `default` variant on the next resolve
64
+ - Emits an `onRollback` event
65
+ - Optionally persists the rollback so future sessions stay on default
66
+
67
+ ---
68
+
69
+ ## Configuration
70
+
71
+ Per-experiment, in `experiments.json`:
72
+
73
+ ```json
74
+ {
75
+ "id": "news-card-layout",
76
+ "default": "responsive",
77
+ "variants": [
78
+ { "id": "responsive" },
79
+ { "id": "new-risky-layout" }
80
+ ],
81
+ "rollback": {
82
+ "threshold": 3,
83
+ "window": 60000,
84
+ "persistent": false
85
+ }
86
+ }
87
+ ```
88
+
89
+ ### `threshold`
90
+
91
+ Number of crashes that trigger rollback. Integer, 1-100. Default: 3.
92
+
93
+ - **Lower** = more aggressive (rollback after fewer crashes)
94
+ - **Higher** = more forgiving (tolerates flaky crashes)
95
+
96
+ For high-risk experiments, start at 2. For low-risk, 5-10 is fine.
97
+
98
+ ### `window`
99
+
100
+ Time window in milliseconds. Integer, 1000-3600000 (1s to 1h). Default: 60000 (1 min).
101
+
102
+ Crashes outside this window don't count. A variant that crashes once an hour is probably fine; a variant that crashes 3 times in 60 seconds is not.
103
+
104
+ ### `persistent`
105
+
106
+ Boolean. Default: `false`.
107
+
108
+ - **`false`**: rollback is in-memory only. Restart the app and the user may get the variant again.
109
+ - **`true`**: rollback is written to Storage. Even after app restart, the user stays on default.
110
+
111
+ Persistent rollbacks are cleared when the engine updates to a new config (presumably with the fix).
112
+
113
+ ### Global defaults
114
+
115
+ You can set engine-level defaults:
116
+
117
+ ```ts
118
+ createEngine(config, {
119
+ rollbackDefaults: {
120
+ threshold: 3,
121
+ window: 60000,
122
+ persistent: false,
123
+ },
124
+ });
125
+ ```
126
+
127
+ Per-experiment config overrides the defaults.
128
+
129
+ ---
130
+
131
+ ## The VariantErrorBoundary
132
+
133
+ Wrap your variant subtree in an error boundary:
134
+
135
+ ```tsx
136
+ import { VariantErrorBoundary, Variant } from "@variantlab/react";
137
+
138
+ <VariantErrorBoundary experimentId="news-card-layout">
139
+ <Variant experimentId="news-card-layout">
140
+ {{
141
+ responsive: <ResponsiveCard />,
142
+ "new-risky-layout": <NewRiskyLayout />,
143
+ }}
144
+ </Variant>
145
+ </VariantErrorBoundary>
146
+ ```
147
+
148
+ When the child throws:
149
+
150
+ 1. The boundary catches the error
151
+ 2. It calls `engine.reportCrash(experimentId, variantId, error)`
152
+ 3. It renders a fallback (either the default variant or your custom fallback)
153
+ 4. If the rollback threshold is hit, the engine forces the default
154
+
155
+ ### Custom fallback
156
+
157
+ ```tsx
158
+ <VariantErrorBoundary
159
+ experimentId="news-card-layout"
160
+ fallback={(error, variantId) => (
161
+ <ErrorCard message="Something went wrong" />
162
+ )}
163
+ >
164
+ <Variant experimentId="news-card-layout">...</Variant>
165
+ </VariantErrorBoundary>
166
+ ```
167
+
168
+ ### Rendering the default on error
169
+
170
+ By default, after a crash, the boundary re-renders with the default variant:
171
+
172
+ ```tsx
173
+ <VariantErrorBoundary experimentId="news-card-layout" renderDefaultOnCrash>
174
+ ...
175
+ </VariantErrorBoundary>
176
+ ```
177
+
178
+ This is the recommended pattern — the user sees the working default, not a generic error.
179
+
180
+ ### Nested boundaries
181
+
182
+ You can nest multiple boundaries for different experiments. Each one is scoped to its `experimentId`:
183
+
184
+ ```tsx
185
+ <VariantErrorBoundary experimentId="top-nav">
186
+ <TopNavExperiment />
187
+ <VariantErrorBoundary experimentId="card-layout">
188
+ <CardLayoutExperiment />
189
+ </VariantErrorBoundary>
190
+ </VariantErrorBoundary>
191
+ ```
192
+
193
+ A crash in `card-layout` won't trigger a rollback on `top-nav`.
194
+
195
+ ---
196
+
197
+ ## Programmatic reporting
198
+
199
+ If you can't use an error boundary (e.g., async errors, native crashes, worker errors), report directly:
200
+
201
+ ```ts
202
+ import { useVariantLabEngine } from "@variantlab/react";
203
+
204
+ const engine = useVariantLabEngine();
205
+
206
+ try {
207
+ await riskyOperation();
208
+ } catch (error) {
209
+ engine.reportCrash("news-card-layout", currentVariant, error);
210
+ throw error;
211
+ }
212
+ ```
213
+
214
+ ### Global error handler
215
+
216
+ ```ts
217
+ window.addEventListener("error", (event) => {
218
+ if (event.error?.variantlabExperimentId) {
219
+ engine.reportCrash(
220
+ event.error.variantlabExperimentId,
221
+ event.error.variantlabVariantId,
222
+ event.error,
223
+ );
224
+ }
225
+ });
226
+ ```
227
+
228
+ Or on React Native:
229
+
230
+ ```ts
231
+ ErrorUtils.setGlobalHandler((error, isFatal) => {
232
+ // inspect the error and call engine.reportCrash if applicable
233
+ });
234
+ ```
235
+
236
+ ---
237
+
238
+ ## Persistence
239
+
240
+ When `persistent: true`, rollback state is stored in the engine's Storage adapter under a key like:
241
+
242
+ ```
243
+ @variantlab/rollback:news-card-layout:user-123
244
+ ```
245
+
246
+ The value includes:
247
+
248
+ - The rolled-back variant ID
249
+ - The timestamp of the rollback
250
+ - The crash count that triggered it
251
+
252
+ ### When is it cleared?
253
+
254
+ - **On config update**: if the config version changes (i.e., a new config is loaded), the rollback is cleared. The assumption is that the new config includes the fix.
255
+ - **Manually**: `engine.clearRollback(experimentId)` or via the debug overlay
256
+ - **On `resetAll()`**
257
+
258
+ ### Why clear on config update?
259
+
260
+ Because the whole point of a rollback is to protect users until a fix ships. Once the fix is in a new config, the user should be re-enrolled.
261
+
262
+ If you don't want this behavior, set `persistent: "forever"`:
263
+
264
+ ```json
265
+ "rollback": { "threshold": 3, "window": 60000, "persistent": "forever" }
266
+ ```
267
+
268
+ This keeps the rollback until explicit clearing.
269
+
270
+ ---
271
+
272
+ ## Events
273
+
274
+ Rollbacks fire an event on the engine's event bus:
275
+
276
+ ```ts
277
+ engine.on("rollback", (event) => {
278
+ console.log("Rollback triggered:", event);
279
+ // {
280
+ // experimentId: "news-card-layout",
281
+ // variantId: "new-risky-layout",
282
+ // reason: "threshold-exceeded",
283
+ // crashCount: 3,
284
+ // window: 60000,
285
+ // userId: "user-123",
286
+ // timestamp: 1739500000000,
287
+ // }
288
+ });
289
+ ```
290
+
291
+ Use this to:
292
+
293
+ - Forward to your telemetry (e.g., Sentry, Datadog)
294
+ - Alert on-call via PagerDuty
295
+ - Increment a rollback counter in your metrics pipeline
296
+
297
+ ### Telemetry integration
298
+
299
+ If you configure a telemetry provider, rollback events are forwarded automatically:
300
+
301
+ ```ts
302
+ createEngine(config, {
303
+ telemetry: {
304
+ track(event, properties) {
305
+ posthog.capture(event, properties);
306
+ },
307
+ },
308
+ });
309
+ ```
310
+
311
+ ---
312
+
313
+ ## Debugging a rollback
314
+
315
+ ### Debug overlay
316
+
317
+ The overlay's **Events** tab shows rollback events in real time. Each entry has:
318
+
319
+ - Experiment ID
320
+ - Variant that crashed
321
+ - Crash count
322
+ - Timestamp
323
+ - "View stack trace" button
324
+
325
+ ### Crash history
326
+
327
+ The overlay's **History** tab shows past rollbacks with timestamps. See [`time-travel.md`](./time-travel.md).
328
+
329
+ ### Programmatic inspection
330
+
331
+ ```ts
332
+ const rollbacks = engine.getRollbacks();
333
+ // [
334
+ // { experimentId: "news-card-layout", variantId: "new-risky-layout", ... }
335
+ // ]
336
+ ```
337
+
338
+ ### Manually clearing
339
+
340
+ From the overlay: tap the experiment card → overflow menu → "Clear rollback".
341
+
342
+ Programmatically:
343
+
344
+ ```ts
345
+ engine.clearRollback("news-card-layout");
346
+ // or
347
+ engine.resetAll();
348
+ ```
349
+
350
+ ---
351
+
352
+ ## Limitations
353
+
354
+ ### What we can detect
355
+
356
+ - **React render errors** — caught by ErrorBoundary
357
+ - **Event handler errors** — caught by ErrorBoundary (React 19+)
358
+ - **Promise rejections inside effects** — caught if you wrap in try/catch and report manually
359
+ - **Global JS errors** — caught if you register a global handler and report manually
360
+
361
+ ### What we cannot detect
362
+
363
+ - **Native crashes** (iOS/Android native layer) — these bypass JavaScript entirely
364
+ - **Memory leaks** that don't throw — not an error, just slow
365
+ - **Infinite loops** — the app hangs, no error is thrown
366
+ - **Crashes in the engine itself** — if variantlab's code crashes, the rollback system can't run
367
+
368
+ For the first and last cases, pair variantlab with Sentry or Crashlytics to get native crash reports and manually update your config when you see a surge.
369
+
370
+ ### False positives
371
+
372
+ A variant that legitimately errors once in a while (e.g., transient network failures) may trigger a rollback even though the variant is fine. Mitigation:
373
+
374
+ - Use a higher `threshold`
375
+ - Use a longer `window`
376
+ - Don't report network errors as variant crashes
377
+
378
+ ### Race conditions
379
+
380
+ If the same user has the same experiment rendered in 3 places simultaneously and all 3 crash, that's 3 crash reports in milliseconds. The threshold counts real reports, so 3 concurrent reports count as 3 crashes. This is usually desired — 3 crashes in a frame is worse than 3 crashes over a minute.
381
+
382
+ ---
383
+
384
+ ## Best practices
385
+
386
+ 1. **Wrap risky experiments** — always use `VariantErrorBoundary` around experimental components
387
+ 2. **Tune thresholds** — higher threshold for low-risk experiments, lower for high-risk
388
+ 3. **Always set a working default** — the whole point of rollback is to fall back to the default
389
+ 4. **Forward to telemetry** — don't just rollback silently; alert the team
390
+ 5. **Test the rollback** — in dev, inject a crash and verify the rollback triggers
391
+ 6. **Clear rollbacks on fix** — ship a new config version and the rollback auto-clears
392
+
393
+ ---
394
+
395
+ ## See also
396
+
397
+ - [`API.md`](../../API.md) — `VariantErrorBoundary` props
398
+ - [`origin-story.md`](../research/origin-story.md) — the crashing card that inspired this feature
399
+ - [`hmac-signing.md`](./hmac-signing.md) — making sure rollback configs aren't tampered with