bulltrackers-module 1.0.293 → 1.0.295

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,210 +0,0 @@
1
- # BullTrackers Computation System: Architecture & Operational Manual
2
-
3
- This document provides a comprehensive overview of the BullTrackers Computation System, a distributed, deterministic, and self-optimizing data pipeline. Unlike traditional task schedulers, this system operates on "Build System" principles, treating data calculations as compiled artifacts with strict versioning and dependency guarantees.
4
-
5
- ---
6
-
7
- ## 1. System Philosophy & Core Concepts
8
-
9
- ### The "Build System" Paradigm
10
- We treat the computation pipeline like a large-scale software build system (e.g., Bazel or Make). Every data point is an "artifact" produced by a specific version of code (Code Hash) acting on specific versions of dependencies (Dependency Hashes).
11
- * **Determinism**: If the input data and code haven't changed, the output *must* be identical. We verify this to skip unnecessary work.
12
- * **Merkle Tree Structure**: The state of the system is a DAG (Directed Acyclic Graph) of hashes. A change in a root node propagates potential invalidation down the tree, but invalidation stops as soon as a node produces the same output as before (Short-Circuiting).
13
-
14
- ### Source-of-Truth Architecture
15
- The **Root Data Index** is the absolute source of truth. No computation can start until the underlying raw data (prices, signals) is indexed and verified "Available" for the target date. This prevents partial runs and "garbage-in-garbage-out".
16
-
17
- ### The Three-Layer Hash Model
18
- To optimize execution, we track three distinct hashes for every calculation:
19
- 1. **Code Hash (Static)**: A SHA-256 hash of the cleaned source code (comments and whitespace stripped). This tells us if the logic *might* have changed.
20
- 2. **SimHash (Behavioral)**: Generated by running the code against a deterministic "Fabricated" context. This tells us if the logic *actually* changed behavior (e.g., a refactor that changes variable names but not logic will have a different Code Hash but the same SimHash).
21
- 3. **ResultHash (Output)**: A hash of the actual production output from a run. This tells us if the data changed. Used for downstream short-circuiting.
22
-
23
- ---
24
-
25
- ## 2. Core Components Overview
26
-
27
- ### Root Data Indexer
28
- A scheduled crawler that verifies the availability of raw external data (e.g., asset prices, global signals) for a given date. It produces an "Availability Manifest" that the Dispatcher consults before scheduling anything.
29
-
30
- ### Manifest Builder
31
- * **Role**: Topology Discovery.
32
- * **Mechanism**: It scans the `calculations/` directory, loads every module, and builds the global Dependency Graph (DAG) in memory.
33
- * **Output**: A topological sort of all calculations assigned to "Passes" (Pass 0, Pass 1, etc.).
34
-
35
- ### The Dispatcher (`WorkflowOrchestrator.js`)
36
- The "Brain" of the system. It runs largely stateless, analyzing the `StatusRepository` against the `Manifest`.
37
- * **Responsibility**: For a given Grid (Date x Calculation), it determines if the state is `RUNNABLE`, `BLOCKED`, `SKIPPED`, or `IMPOSSIBLE`.
38
- * **Key Logic**: It implements the "Short-Circuiting" and "Historical Continuity" checks.
39
-
40
- ### The Build Optimizer
41
- A pre-flight tool that attempts to avoiding running tasks by proving they are identical to previous versions.
42
- * **Mechanism**: If a calculation's Code Hash changes, the Optimizer runs a **Simulation** (using `SimRunner`) to generate a SimHash. If the SimHash matches the registry, the system acts as if the code never changed, skipping the production re-run.
43
-
44
- ### The Worker (`StandardExecutor` / `MetaExecutor`)
45
- The execution unit. It is unaware of the broader topology.
46
- * **Input**: A target Calculation and Date.
47
- * **Action**: Fetches inputs, runs `process()`, validates results, and writes to Firestore.
48
- * **Output**: The computed data + the **ResultHash**.
49
-
50
- ---
51
-
52
- ## 3. The Daily Lifecycle (Chronological Process)
53
-
54
- ### Phase 1: Indexing
55
- The system waits for the `SystemEpoch` to advance. The Root Data Indexer checks for "Canary Blocks" (indicators that external data providers have finished for the day). Once confirmed, the date is marked `OPEN`.
56
-
57
- ### Phase 2: Pre-Flight Optimization
58
- Before dispatching workers:
59
- 1. The system identifies all calculations with new **Code Hashes**.
60
- 2. It runs `SimRunner` for these calculations to generate fresh **SimHashes**.
61
- 3. If `SimHash(New) == SimHash(Old)`, the system updates the Status Ledger to enable the new Code Hash without flagging it as "Changed".
62
-
63
- ### Phase 3: Dispatch Analysis
64
- The Dispatcher iterates through the Topological Passes (0 -> N). For each calculation, it queries `calculateExecutionStatus`:
65
- * Are dependencies done?
66
- * Did dependencies change their output (`ResultHash`)?
67
- * Is historical context available?
68
-
69
- ### Phase 4: Execution Waves
70
- Workers are triggered via Pub/Sub or direct method invocation.
71
- * **Pass 1**: Primitive conversions (e.g., Price Extractor).
72
- * **Pass 2**: Technical Indicators that depend on Pass 1.
73
- * **Pass 3**: Aggregations and Complex Metrics.
74
-
75
- ### Phase 5: Reconciliation
76
- After all queues drain, the system performs a final sweep. Any tasks marked `FAILED` are retried (up to a limit). Impossible tasks are finalized as `IMPOSSIBLE`.
77
-
78
- ---
79
-
80
- ## 4. Deep Dive: Hashing & Dependency Logic
81
-
82
- ### Intrinsic Code Hashing
83
- Located in `topology/HashManager.js`.
84
- We generate a unique fingerprint for every calculation file:
85
- ```javascript
86
- clean = codeString.replace(comments).replace(whitespace);
87
- hash = sha256(clean);
88
- ```
89
- This ensures that changes to comments or formatting do *not* trigger re-runs.
90
-
91
- ### Behavioral Hashing (SimHash)
92
- Located in `simulation/SimRunner.js`.
93
- When code changes, we can't be 100% sure it's safe just by looking at the source.
94
- 1. **The Fabricator**: Generates a deterministic mock `Context` (prices, previous results) based on the input schema.
95
- 2. **Simulation Run**: The calculation `process()` method is executed against this mock data.
96
- 3. **The Registry**: The hash of the *output* of this simulation is stored.
97
- If a refactor results in the exact same Mock Output, the system considers the change "Cosmetic".
98
-
99
- ### Dependency Short-Circuiting
100
- Implemented in `WorkflowOrchestrator.js` (`analyzeDateExecution`).
101
- Even if an upstream calculation re-runs, downstream dependents might not need to.
102
- * **Logic**:
103
- * Calc A (Upstream) re-runs. Old Output Hash: `HashX`. New Output Hash: `HashX`.
104
- * Calc B (Downstream) sees that Calc A "changed" (new timestamp), BUT the content hash `HashX` is identical to what Calc B used last time.
105
- * **Result**: Calc B is `SKIPPED`.
106
-
107
- ---
108
-
109
- ## 5. Decision Logic & Edge Case Scenarios
110
-
111
- ### Scenario A: Standard Code Change (Logic)
112
- * **Trigger**: You change the formula for `RSI`. Code Hash changes. SimHash changes.
113
- * **Dispatcher**: Sees `storedHash !== currentHash`.
114
- * **Result**: Marks as `RUNNABLE`. Worker runs.
115
-
116
- ### Scenario B: Cosmetic Code Change (Refactor)
117
- * **Trigger**: You rename a variable in `RSI`. Code Hash changes. SimHash remains identical.
118
- * **Optimizer**: Updates the centralized Status Ledger: "Version `Desc_v2` is equivalent to `Desc_v1`".
119
- * **Dispatcher**: Sees the new hash in the ledger as "Verified".
120
- * **Result**: Task is `SKIPPED`.
121
-
122
- ### Scenario C: Upstream Invalidation (The Cascade)
123
- * **Condition**: `PriceExtractor` fixes a bug. `ResultHash` changes from `HashA` to `HashB`.
124
- * **Downstream**: `RSI` checks detailed dependency report.
125
- * **Check**: `LastRunDeps['PriceExtractor'] (HashA) !== CurrentDeps['PriceExtractor'] (HashB)`.
126
- * **Result**: `RSI` is forced to re-run.
127
-
128
- ### Scenario D: Upstream Stability (The Firewall)
129
- * **Condition**: `PriceExtractor` runs an optimization. Output is exact same data. `ResultHash` remains `HashA`.
130
- * **Downstream**: `RSI` checks dependency report.
131
- * **Check**: `LastRunDeps['PriceExtractor'] (HashA) === CurrentDeps['PriceExtractor'] (HashA)`.
132
- * **Result**: `RSI` is `SKIPPED`. This firewall prevents massive re-calculation storms for non-functional upstream changes.
133
-
134
- ### Scenario E: The "Impossible" State
135
- * **Condition**: Core market data is missing for `1990-01-01`.
136
- * **Root Indexer**: Marks date as providing `[]` (empty) for critical inputs.
137
- * **Dispatcher**: Marks `PriceExtractor` as `IMPOSSIBLE: NO_DATA`.
138
- * **Propagation**: Any calculation depending on `PriceExtractor` sees the `IMPOSSIBLE` status and marks *itself* as `IMPOSSIBLE: UPSTREAM`.
139
- * **Benefit**: The system doesn't waste cycles retrying calculations that can never succeed.
140
-
141
- ### Scenario F: Category Migration
142
- * **Condition**: You change `getMetadata()` for a calculation, moving it from `signals` to `risk`.
143
- * **Dispatcher**: Detects `storedCategory !== newCategory`.
144
- * **Worker**:
145
- 1. Runs `process()` and writes to the *new* path (`risk/CalculateX`).
146
- 2. Detects the `previousCategory` flag.
147
- 3. Deletes the data at the *old* path (`signals/CalculateX`) to prevent orphan data.
148
-
149
- ---
150
-
151
- ## 6. Data Management & Storage
152
-
153
- ### Input Streaming
154
- To handle large datasets without OOM (Out Of Memory) errors:
155
- * `StandardExecutor` does not load all users/tickers at once.
156
- * It utilizes wait-and-stream logic (e.g., batches of 50 ids) to process the `Context`.
157
-
158
- ### Transparent Auto-Sharding
159
- Firestore has a 1MB document limit.
160
- * **Write Path**: If a calculation result > 900KB, it is split into `DocID`, `DocID_shard1`, `DocID_shard2`.
161
- * **Read Path**: The `DependencyFetcher` automatically detects sharding pointers and re-assembles (hydrates) the full object before passing it to `process()`.
162
-
163
- ### Compression Strategy
164
- * Payloads are inspected before write.
165
- * If efficient (high entropy text/JSON), Zlib compression is applied.
166
- * Metadata is tagged `encoding: 'zlib'` so readers know to inflate.
167
-
168
- ---
169
-
170
- ## 7. Quality Assurance & Self-Healing
171
-
172
- ### The Heuristic Validator
173
- Before saving *any* result, the Executor runs heuristics:
174
- * **NaN Check**: Are there `NaN` or `Infinity` values in key fields?
175
- * **Flatline Check**: Is the data variance 0.00 across a large timespan?
176
- * **Null Density**: Is >50% of the dataset null?
177
- * **Circuit Breaker**: If heuristics fail, the task throws an error. It is better to fail and alert than to persist corrupted data that pollutes the cache.
178
-
179
- ### Zombie Task Recovery
180
- * **Lease Mechanism**: When a task starts, it sets a `startedAt` timestamp.
181
- * **Detection**: The Dispatcher checks for tasks marked `RUNNING` where `startedAt` > 15 minutes ago.
182
- * **Resolution**: These are assumed crashed (OOM/Timeout). They are reset to `PENDING` (or `FAILED` if retry count exceeded).
183
-
184
- ### Dead Letter Queue (DLQ)
185
- Tasks that deterministically fail (crash every time) after N retries are moved to a special DLQ status. This prevents the system from getting stuck in an infinite retry loop.
186
-
187
- ---
188
-
189
- ## 8. Developer Workflows
190
-
191
- ### How to Add a New Calculation
192
- 1. Create `calculations/category/MyNewCalc.js`.
193
- 2. Implement `getMetadata()` to define dependencies.
194
- 3. Implement `process(context)`.
195
- 4. Run `npm run build-manifest` to register it in the topology.
196
-
197
- ### How to Force a Global Re-Run
198
- * Change the `SYSTEM_EPOCH` constant in `system_epoch.js`.
199
- * This changes the "Global Salt" for all hashes, processing every calculation as "New".
200
-
201
- ### How to Backfill History
202
- * **Standard Dispatcher**: Good for recent history (last 30 days).
203
- * **BatchPriceExecutor**: Specialized for massive historical backfills (e.g., 20 years of price data). It bypasses some topology checks for raw speed.
204
-
205
- ### Local Debugging
206
- Run the orchestrator in "Dry Run" mode:
207
- ```bash
208
- node scripts/run_orchestrator.js --date=2024-01-01 --dry-run
209
- ```
210
- This prints the `Analysis Report` (Runnable/Blocked lists) without actually triggering workers.