bulltrackers-module 1.0.293 → 1.0.295
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/functions/computation-system/executors/PriceBatchExecutor.js +0 -1
- package/functions/computation-system/executors/StandardExecutor.js +47 -7
- package/functions/computation-system/features.md +395 -0
- package/functions/computation-system/helpers/computation_dispatcher.js +35 -17
- package/functions/computation-system/layers/extractors.js +9 -9
- package/functions/computation-system/paper.md +93 -0
- package/functions/computation-system/persistence/RunRecorder.js +16 -16
- package/functions/generic-api/admin-api/index.js +233 -0
- package/functions/generic-api/helpers/api_helpers.js +30 -4
- package/functions/generic-api/index.js +8 -1
- package/package.json +1 -1
- package/functions/computation-system/onboarding.md +0 -210
|
@@ -1,210 +0,0 @@
|
|
|
1
|
-
# BullTrackers Computation System: Architecture & Operational Manual
|
|
2
|
-
|
|
3
|
-
This document provides a comprehensive overview of the BullTrackers Computation System, a distributed, deterministic, and self-optimizing data pipeline. Unlike traditional task schedulers, this system operates on "Build System" principles, treating data calculations as compiled artifacts with strict versioning and dependency guarantees.
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## 1. System Philosophy & Core Concepts
|
|
8
|
-
|
|
9
|
-
### The "Build System" Paradigm
|
|
10
|
-
We treat the computation pipeline like a large-scale software build system (e.g., Bazel or Make). Every data point is an "artifact" produced by a specific version of code (Code Hash) acting on specific versions of dependencies (Dependency Hashes).
|
|
11
|
-
* **Determinism**: If the input data and code haven't changed, the output *must* be identical. We verify this to skip unnecessary work.
|
|
12
|
-
* **Merkle Tree Structure**: The state of the system is a DAG (Directed Acyclic Graph) of hashes. A change in a root node propagates potential invalidation down the tree, but invalidation stops as soon as a node produces the same output as before (Short-Circuiting).
|
|
13
|
-
|
|
14
|
-
### Source-of-Truth Architecture
|
|
15
|
-
The **Root Data Index** is the absolute source of truth. No computation can start until the underlying raw data (prices, signals) is indexed and verified "Available" for the target date. This prevents partial runs and "garbage-in-garbage-out".
|
|
16
|
-
|
|
17
|
-
### The Three-Layer Hash Model
|
|
18
|
-
To optimize execution, we track three distinct hashes for every calculation:
|
|
19
|
-
1. **Code Hash (Static)**: A SHA-256 hash of the cleaned source code (comments and whitespace stripped). This tells us if the logic *might* have changed.
|
|
20
|
-
2. **SimHash (Behavioral)**: Generated by running the code against a deterministic "Fabricated" context. This tells us if the logic *actually* changed behavior (e.g., a refactor that changes variable names but not logic will have a different Code Hash but the same SimHash).
|
|
21
|
-
3. **ResultHash (Output)**: A hash of the actual production output from a run. This tells us if the data changed. Used for downstream short-circuiting.
|
|
22
|
-
|
|
23
|
-
---
|
|
24
|
-
|
|
25
|
-
## 2. Core Components Overview
|
|
26
|
-
|
|
27
|
-
### Root Data Indexer
|
|
28
|
-
A scheduled crawler that verifies the availability of raw external data (e.g., asset prices, global signals) for a given date. It produces an "Availability Manifest" that the Dispatcher consults before scheduling anything.
|
|
29
|
-
|
|
30
|
-
### Manifest Builder
|
|
31
|
-
* **Role**: Topology Discovery.
|
|
32
|
-
* **Mechanism**: It scans the `calculations/` directory, loads every module, and builds the global Dependency Graph (DAG) in memory.
|
|
33
|
-
* **Output**: A topological sort of all calculations assigned to "Passes" (Pass 0, Pass 1, etc.).
|
|
34
|
-
|
|
35
|
-
### The Dispatcher (`WorkflowOrchestrator.js`)
|
|
36
|
-
The "Brain" of the system. It runs largely stateless, analyzing the `StatusRepository` against the `Manifest`.
|
|
37
|
-
* **Responsibility**: For a given Grid (Date x Calculation), it determines if the state is `RUNNABLE`, `BLOCKED`, `SKIPPED`, or `IMPOSSIBLE`.
|
|
38
|
-
* **Key Logic**: It implements the "Short-Circuiting" and "Historical Continuity" checks.
|
|
39
|
-
|
|
40
|
-
### The Build Optimizer
|
|
41
|
-
A pre-flight tool that attempts to avoiding running tasks by proving they are identical to previous versions.
|
|
42
|
-
* **Mechanism**: If a calculation's Code Hash changes, the Optimizer runs a **Simulation** (using `SimRunner`) to generate a SimHash. If the SimHash matches the registry, the system acts as if the code never changed, skipping the production re-run.
|
|
43
|
-
|
|
44
|
-
### The Worker (`StandardExecutor` / `MetaExecutor`)
|
|
45
|
-
The execution unit. It is unaware of the broader topology.
|
|
46
|
-
* **Input**: A target Calculation and Date.
|
|
47
|
-
* **Action**: Fetches inputs, runs `process()`, validates results, and writes to Firestore.
|
|
48
|
-
* **Output**: The computed data + the **ResultHash**.
|
|
49
|
-
|
|
50
|
-
---
|
|
51
|
-
|
|
52
|
-
## 3. The Daily Lifecycle (Chronological Process)
|
|
53
|
-
|
|
54
|
-
### Phase 1: Indexing
|
|
55
|
-
The system waits for the `SystemEpoch` to advance. The Root Data Indexer checks for "Canary Blocks" (indicators that external data providers have finished for the day). Once confirmed, the date is marked `OPEN`.
|
|
56
|
-
|
|
57
|
-
### Phase 2: Pre-Flight Optimization
|
|
58
|
-
Before dispatching workers:
|
|
59
|
-
1. The system identifies all calculations with new **Code Hashes**.
|
|
60
|
-
2. It runs `SimRunner` for these calculations to generate fresh **SimHashes**.
|
|
61
|
-
3. If `SimHash(New) == SimHash(Old)`, the system updates the Status Ledger to enable the new Code Hash without flagging it as "Changed".
|
|
62
|
-
|
|
63
|
-
### Phase 3: Dispatch Analysis
|
|
64
|
-
The Dispatcher iterates through the Topological Passes (0 -> N). For each calculation, it queries `calculateExecutionStatus`:
|
|
65
|
-
* Are dependencies done?
|
|
66
|
-
* Did dependencies change their output (`ResultHash`)?
|
|
67
|
-
* Is historical context available?
|
|
68
|
-
|
|
69
|
-
### Phase 4: Execution Waves
|
|
70
|
-
Workers are triggered via Pub/Sub or direct method invocation.
|
|
71
|
-
* **Pass 1**: Primitive conversions (e.g., Price Extractor).
|
|
72
|
-
* **Pass 2**: Technical Indicators that depend on Pass 1.
|
|
73
|
-
* **Pass 3**: Aggregations and Complex Metrics.
|
|
74
|
-
|
|
75
|
-
### Phase 5: Reconciliation
|
|
76
|
-
After all queues drain, the system performs a final sweep. Any tasks marked `FAILED` are retried (up to a limit). Impossible tasks are finalized as `IMPOSSIBLE`.
|
|
77
|
-
|
|
78
|
-
---
|
|
79
|
-
|
|
80
|
-
## 4. Deep Dive: Hashing & Dependency Logic
|
|
81
|
-
|
|
82
|
-
### Intrinsic Code Hashing
|
|
83
|
-
Located in `topology/HashManager.js`.
|
|
84
|
-
We generate a unique fingerprint for every calculation file:
|
|
85
|
-
```javascript
|
|
86
|
-
clean = codeString.replace(comments).replace(whitespace);
|
|
87
|
-
hash = sha256(clean);
|
|
88
|
-
```
|
|
89
|
-
This ensures that changes to comments or formatting do *not* trigger re-runs.
|
|
90
|
-
|
|
91
|
-
### Behavioral Hashing (SimHash)
|
|
92
|
-
Located in `simulation/SimRunner.js`.
|
|
93
|
-
When code changes, we can't be 100% sure it's safe just by looking at the source.
|
|
94
|
-
1. **The Fabricator**: Generates a deterministic mock `Context` (prices, previous results) based on the input schema.
|
|
95
|
-
2. **Simulation Run**: The calculation `process()` method is executed against this mock data.
|
|
96
|
-
3. **The Registry**: The hash of the *output* of this simulation is stored.
|
|
97
|
-
If a refactor results in the exact same Mock Output, the system considers the change "Cosmetic".
|
|
98
|
-
|
|
99
|
-
### Dependency Short-Circuiting
|
|
100
|
-
Implemented in `WorkflowOrchestrator.js` (`analyzeDateExecution`).
|
|
101
|
-
Even if an upstream calculation re-runs, downstream dependents might not need to.
|
|
102
|
-
* **Logic**:
|
|
103
|
-
* Calc A (Upstream) re-runs. Old Output Hash: `HashX`. New Output Hash: `HashX`.
|
|
104
|
-
* Calc B (Downstream) sees that Calc A "changed" (new timestamp), BUT the content hash `HashX` is identical to what Calc B used last time.
|
|
105
|
-
* **Result**: Calc B is `SKIPPED`.
|
|
106
|
-
|
|
107
|
-
---
|
|
108
|
-
|
|
109
|
-
## 5. Decision Logic & Edge Case Scenarios
|
|
110
|
-
|
|
111
|
-
### Scenario A: Standard Code Change (Logic)
|
|
112
|
-
* **Trigger**: You change the formula for `RSI`. Code Hash changes. SimHash changes.
|
|
113
|
-
* **Dispatcher**: Sees `storedHash !== currentHash`.
|
|
114
|
-
* **Result**: Marks as `RUNNABLE`. Worker runs.
|
|
115
|
-
|
|
116
|
-
### Scenario B: Cosmetic Code Change (Refactor)
|
|
117
|
-
* **Trigger**: You rename a variable in `RSI`. Code Hash changes. SimHash remains identical.
|
|
118
|
-
* **Optimizer**: Updates the centralized Status Ledger: "Version `Desc_v2` is equivalent to `Desc_v1`".
|
|
119
|
-
* **Dispatcher**: Sees the new hash in the ledger as "Verified".
|
|
120
|
-
* **Result**: Task is `SKIPPED`.
|
|
121
|
-
|
|
122
|
-
### Scenario C: Upstream Invalidation (The Cascade)
|
|
123
|
-
* **Condition**: `PriceExtractor` fixes a bug. `ResultHash` changes from `HashA` to `HashB`.
|
|
124
|
-
* **Downstream**: `RSI` checks detailed dependency report.
|
|
125
|
-
* **Check**: `LastRunDeps['PriceExtractor'] (HashA) !== CurrentDeps['PriceExtractor'] (HashB)`.
|
|
126
|
-
* **Result**: `RSI` is forced to re-run.
|
|
127
|
-
|
|
128
|
-
### Scenario D: Upstream Stability (The Firewall)
|
|
129
|
-
* **Condition**: `PriceExtractor` runs an optimization. Output is exact same data. `ResultHash` remains `HashA`.
|
|
130
|
-
* **Downstream**: `RSI` checks dependency report.
|
|
131
|
-
* **Check**: `LastRunDeps['PriceExtractor'] (HashA) === CurrentDeps['PriceExtractor'] (HashA)`.
|
|
132
|
-
* **Result**: `RSI` is `SKIPPED`. This firewall prevents massive re-calculation storms for non-functional upstream changes.
|
|
133
|
-
|
|
134
|
-
### Scenario E: The "Impossible" State
|
|
135
|
-
* **Condition**: Core market data is missing for `1990-01-01`.
|
|
136
|
-
* **Root Indexer**: Marks date as providing `[]` (empty) for critical inputs.
|
|
137
|
-
* **Dispatcher**: Marks `PriceExtractor` as `IMPOSSIBLE: NO_DATA`.
|
|
138
|
-
* **Propagation**: Any calculation depending on `PriceExtractor` sees the `IMPOSSIBLE` status and marks *itself* as `IMPOSSIBLE: UPSTREAM`.
|
|
139
|
-
* **Benefit**: The system doesn't waste cycles retrying calculations that can never succeed.
|
|
140
|
-
|
|
141
|
-
### Scenario F: Category Migration
|
|
142
|
-
* **Condition**: You change `getMetadata()` for a calculation, moving it from `signals` to `risk`.
|
|
143
|
-
* **Dispatcher**: Detects `storedCategory !== newCategory`.
|
|
144
|
-
* **Worker**:
|
|
145
|
-
1. Runs `process()` and writes to the *new* path (`risk/CalculateX`).
|
|
146
|
-
2. Detects the `previousCategory` flag.
|
|
147
|
-
3. Deletes the data at the *old* path (`signals/CalculateX`) to prevent orphan data.
|
|
148
|
-
|
|
149
|
-
---
|
|
150
|
-
|
|
151
|
-
## 6. Data Management & Storage
|
|
152
|
-
|
|
153
|
-
### Input Streaming
|
|
154
|
-
To handle large datasets without OOM (Out Of Memory) errors:
|
|
155
|
-
* `StandardExecutor` does not load all users/tickers at once.
|
|
156
|
-
* It utilizes wait-and-stream logic (e.g., batches of 50 ids) to process the `Context`.
|
|
157
|
-
|
|
158
|
-
### Transparent Auto-Sharding
|
|
159
|
-
Firestore has a 1MB document limit.
|
|
160
|
-
* **Write Path**: If a calculation result > 900KB, it is split into `DocID`, `DocID_shard1`, `DocID_shard2`.
|
|
161
|
-
* **Read Path**: The `DependencyFetcher` automatically detects sharding pointers and re-assembles (hydrates) the full object before passing it to `process()`.
|
|
162
|
-
|
|
163
|
-
### Compression Strategy
|
|
164
|
-
* Payloads are inspected before write.
|
|
165
|
-
* If efficient (high entropy text/JSON), Zlib compression is applied.
|
|
166
|
-
* Metadata is tagged `encoding: 'zlib'` so readers know to inflate.
|
|
167
|
-
|
|
168
|
-
---
|
|
169
|
-
|
|
170
|
-
## 7. Quality Assurance & Self-Healing
|
|
171
|
-
|
|
172
|
-
### The Heuristic Validator
|
|
173
|
-
Before saving *any* result, the Executor runs heuristics:
|
|
174
|
-
* **NaN Check**: Are there `NaN` or `Infinity` values in key fields?
|
|
175
|
-
* **Flatline Check**: Is the data variance 0.00 across a large timespan?
|
|
176
|
-
* **Null Density**: Is >50% of the dataset null?
|
|
177
|
-
* **Circuit Breaker**: If heuristics fail, the task throws an error. It is better to fail and alert than to persist corrupted data that pollutes the cache.
|
|
178
|
-
|
|
179
|
-
### Zombie Task Recovery
|
|
180
|
-
* **Lease Mechanism**: When a task starts, it sets a `startedAt` timestamp.
|
|
181
|
-
* **Detection**: The Dispatcher checks for tasks marked `RUNNING` where `startedAt` > 15 minutes ago.
|
|
182
|
-
* **Resolution**: These are assumed crashed (OOM/Timeout). They are reset to `PENDING` (or `FAILED` if retry count exceeded).
|
|
183
|
-
|
|
184
|
-
### Dead Letter Queue (DLQ)
|
|
185
|
-
Tasks that deterministically fail (crash every time) after N retries are moved to a special DLQ status. This prevents the system from getting stuck in an infinite retry loop.
|
|
186
|
-
|
|
187
|
-
---
|
|
188
|
-
|
|
189
|
-
## 8. Developer Workflows
|
|
190
|
-
|
|
191
|
-
### How to Add a New Calculation
|
|
192
|
-
1. Create `calculations/category/MyNewCalc.js`.
|
|
193
|
-
2. Implement `getMetadata()` to define dependencies.
|
|
194
|
-
3. Implement `process(context)`.
|
|
195
|
-
4. Run `npm run build-manifest` to register it in the topology.
|
|
196
|
-
|
|
197
|
-
### How to Force a Global Re-Run
|
|
198
|
-
* Change the `SYSTEM_EPOCH` constant in `system_epoch.js`.
|
|
199
|
-
* This changes the "Global Salt" for all hashes, processing every calculation as "New".
|
|
200
|
-
|
|
201
|
-
### How to Backfill History
|
|
202
|
-
* **Standard Dispatcher**: Good for recent history (last 30 days).
|
|
203
|
-
* **BatchPriceExecutor**: Specialized for massive historical backfills (e.g., 20 years of price data). It bypasses some topology checks for raw speed.
|
|
204
|
-
|
|
205
|
-
### Local Debugging
|
|
206
|
-
Run the orchestrator in "Dry Run" mode:
|
|
207
|
-
```bash
|
|
208
|
-
node scripts/run_orchestrator.js --date=2024-01-01 --dry-run
|
|
209
|
-
```
|
|
210
|
-
This prints the `Analysis Report` (Runnable/Blocked lists) without actually triggering workers.
|