cnhkmcp 2.1.9__py3-none-any.whl → 2.2.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cnhkmcp/__init__.py +1 -1
- cnhkmcp/untracked/AI/321/206/320/231/320/243/321/205/342/225/226/320/265/321/204/342/225/221/342/225/221/BRAIN_AI/321/206/320/231/320/243/321/205/342/225/226/320/265/321/204/342/225/221/342/225/221Mac_Linux/321/207/320/231/320/230/321/206/320/254/320/274.zip +0 -0
- cnhkmcp/untracked/AI/321/206/320/231/320/243/321/205/342/225/226/320/265/321/204/342/225/221/342/225/221//321/205/320/237/320/234/321/205/320/227/342/225/227/321/205/320/276/320/231/321/210/320/263/320/225AI/321/206/320/231/320/243/321/205/342/225/226/320/265/321/204/342/225/221/342/225/221_Windows/321/207/320/231/320/230/321/206/320/254/320/274.exe +0 -0
- cnhkmcp/untracked/AI/321/206/320/261/320/234/321/211/320/255/320/262/321/206/320/237/320/242/321/204/342/225/227/342/225/242/vector_db/chroma.sqlite3 +0 -0
- cnhkmcp/untracked/skills/brain-data-feature-engineering/OUTPUT_TEMPLATE.md +325 -0
- cnhkmcp/untracked/skills/brain-data-feature-engineering/SKILL.md +263 -0
- cnhkmcp/untracked/skills/brain-data-feature-engineering/examples.md +244 -0
- cnhkmcp/untracked/skills/brain-data-feature-engineering/reference.md +493 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/SKILL.md +87 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/config.json +6 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/analyst15_GLB_delay1.csv +289 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/final_expressions.json +410 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588244.json +4 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588251.json +20 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588273.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588293.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588319.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588322.json +14 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588325.json +20 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588328.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588354.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588357.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588361.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588364.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588368.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588391.json +14 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588394.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588397.json +59 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588400.json +35 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588403.json +20 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588428.json +23 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588431.json +32 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588434.json +20 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588438.json +20 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588441.json +14 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/data/analyst15_GLB_delay1/idea_1768588468.json +20 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/scripts/ace_lib.py +1514 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/scripts/fetch_dataset.py +107 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/scripts/helpful_functions.py +180 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/scripts/implement_idea.py +164 -0
- cnhkmcp/untracked/skills/brain-feature-implementation/scripts/merge_expression_list.py +88 -0
- cnhkmcp/untracked/skills/planning-with-files/SKILL.md +211 -0
- cnhkmcp/untracked/skills/planning-with-files/examples.md +202 -0
- cnhkmcp/untracked/skills/planning-with-files/reference.md +218 -0
- cnhkmcp/untracked/skills/planning-with-files/scripts/check-complete.sh +44 -0
- cnhkmcp/untracked/skills/planning-with-files/scripts/init-session.sh +120 -0
- cnhkmcp/untracked/skills/planning-with-files/templates/findings.md +95 -0
- cnhkmcp/untracked/skills/planning-with-files/templates/progress.md +114 -0
- cnhkmcp/untracked/skills/planning-with-files/templates/task_plan.md +132 -0
- {cnhkmcp-2.1.9.dist-info → cnhkmcp-2.2.0.dist-info}/METADATA +1 -1
- {cnhkmcp-2.1.9.dist-info → cnhkmcp-2.2.0.dist-info}/RECORD +55 -10
- {cnhkmcp-2.1.9.dist-info → cnhkmcp-2.2.0.dist-info}/WHEEL +0 -0
- {cnhkmcp-2.1.9.dist-info → cnhkmcp-2.2.0.dist-info}/entry_points.txt +0 -0
- {cnhkmcp-2.1.9.dist-info → cnhkmcp-2.2.0.dist-info}/licenses/LICENSE +0 -0
- {cnhkmcp-2.1.9.dist-info → cnhkmcp-2.2.0.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,493 @@
|
|
|
1
|
+
# Feature Engineering Mindset Patterns
|
|
2
|
+
|
|
3
|
+
This document provides a comprehensive framework for **thinking** about feature engineering, not a list of patterns to apply blindly.
|
|
4
|
+
|
|
5
|
+
## The Core Philosophy
|
|
6
|
+
|
|
7
|
+
**Feature engineering is not about finding predictive patterns—it's about understanding what data truly means and expressing that meaning in quantifiable ways.**
|
|
8
|
+
|
|
9
|
+
## 1. Data Semantic Understanding Framework
|
|
10
|
+
|
|
11
|
+
### Field Deconstruction Methodology
|
|
12
|
+
|
|
13
|
+
**For each field, ask these fundamental questions:**
|
|
14
|
+
|
|
15
|
+
#### What is being measured?
|
|
16
|
+
- Not just the surface description—what is the actual entity or concept?
|
|
17
|
+
- Example: Don't think "P/E ratio", think "price divided by earnings per share"
|
|
18
|
+
- What is the "thing" behind the numbers?
|
|
19
|
+
|
|
20
|
+
#### How is it measured?
|
|
21
|
+
- Data collection method (survey, sensor, calculation)
|
|
22
|
+
- Assumptions embedded in measurement
|
|
23
|
+
- Frequency and timing considerations
|
|
24
|
+
- Example: Book values are quarterly, audited, historical cost; market cap is continuous, forward-looking
|
|
25
|
+
|
|
26
|
+
#### What is the time dimension?
|
|
27
|
+
- Instantaneous snapshot (price at moment T)
|
|
28
|
+
- Cumulative value (total sales to date)
|
|
29
|
+
- Rate of change (velocity, acceleration)
|
|
30
|
+
- Memory/persistence (how long effects last)
|
|
31
|
+
|
|
32
|
+
#### Why does this field exist?
|
|
33
|
+
- What problem was it designed to solve?
|
|
34
|
+
- Who uses it and for what purpose?
|
|
35
|
+
- What business process generates it?
|
|
36
|
+
|
|
37
|
+
### Field Relationship Mapping
|
|
38
|
+
|
|
39
|
+
**Find the story the data tells:**
|
|
40
|
+
|
|
41
|
+
#### Identify connections:
|
|
42
|
+
- **Causal**: X causes Y (revenue → profit)
|
|
43
|
+
- **Complementary**: X and Y measure related aspects (price & volume)
|
|
44
|
+
- **Conflicting**: X and Y can diverge (book value vs. market cap)
|
|
45
|
+
- **Independent**: X and Y are unrelated (company location vs. stock price)
|
|
46
|
+
|
|
47
|
+
#### Build the narrative:
|
|
48
|
+
- What is the complete picture these fields paint?
|
|
49
|
+
- What are the key turning points?
|
|
50
|
+
- What is missing that would complete the story?
|
|
51
|
+
|
|
52
|
+
### Data Quality Assessment
|
|
53
|
+
|
|
54
|
+
**Evaluate from the source:**
|
|
55
|
+
|
|
56
|
+
#### Generation mechanisms:
|
|
57
|
+
- Manual entry (human error, bias, gaming)
|
|
58
|
+
- Automated collection (sensor precision, calibration)
|
|
59
|
+
- Calculated values (formula assumptions, input quality)
|
|
60
|
+
|
|
61
|
+
#### Reliability indicators:
|
|
62
|
+
- Audit trails and verification processes
|
|
63
|
+
- Consistency checks across sources
|
|
64
|
+
- Update frequency vs. true change rate
|
|
65
|
+
|
|
66
|
+
## 2. First-Principles Thinking
|
|
67
|
+
|
|
68
|
+
**Strip away all labels and assumptions.**
|
|
69
|
+
|
|
70
|
+
### The Process:
|
|
71
|
+
1. **Forget what you "know"**: Ignore domain-specific labels
|
|
72
|
+
2. **Identify raw components**: What are the fundamental elements?
|
|
73
|
+
3. **Question everything**: Why is it measured this way?
|
|
74
|
+
4. **Rebuild from basics**: Construct features from fundamental truths
|
|
75
|
+
|
|
76
|
+
### Example:
|
|
77
|
+
**Don't say**: "P/E ratio measures valuation"
|
|
78
|
+
**Do say**: "Price per share divided by earnings per share compares market price to accounting profit"
|
|
79
|
+
|
|
80
|
+
**First principles analysis**:
|
|
81
|
+
- Price: What market participants collectively believe value is
|
|
82
|
+
- Earnings: Accounting measure of profit generation
|
|
83
|
+
- Ratio: Comparison of two different perspectives on value
|
|
84
|
+
- **Insight**: The spread between perspectives is what matters, not the ratio itself
|
|
85
|
+
|
|
86
|
+
### Exercise:
|
|
87
|
+
For any field, write down:
|
|
88
|
+
- What is literally being measured (no jargon)
|
|
89
|
+
- What assumptions are built in
|
|
90
|
+
- What could cause it to be wrong
|
|
91
|
+
- What it would mean if it were very high or very low
|
|
92
|
+
|
|
93
|
+
## 3. Question-Driven Feature Generation
|
|
94
|
+
|
|
95
|
+
**Start with questions, not formulas.**
|
|
96
|
+
|
|
97
|
+
### The Question Bank:
|
|
98
|
+
|
|
99
|
+
#### Q1: "What is stable?" (Invariance)
|
|
100
|
+
**Purpose**: Find what doesn't change—it's often more meaningful than what does
|
|
101
|
+
|
|
102
|
+
**Leads to features about:**
|
|
103
|
+
- Stability measures (coefficient of variation)
|
|
104
|
+
- Invariant relationships (ratios that stay constant)
|
|
105
|
+
- Structural constants (parameters that define the system)
|
|
106
|
+
|
|
107
|
+
**Examples**:
|
|
108
|
+
- "Customer acquisition cost stability" = std_dev(CAC) / mean(CAC)
|
|
109
|
+
- *Meaning*: Is our cost structure predictable?
|
|
110
|
+
- *High value*: Costs are volatile, business model is unstable
|
|
111
|
+
- *Low value*: Costs are predictable, scalable model
|
|
112
|
+
|
|
113
|
+
#### Q2: "What is changing?" (Dynamics)
|
|
114
|
+
**Purpose**: Understand motion, rate, and direction
|
|
115
|
+
|
|
116
|
+
**Leads to features about:**
|
|
117
|
+
- Velocity and acceleration
|
|
118
|
+
- Trend vs. noise
|
|
119
|
+
- Change significance
|
|
120
|
+
|
|
121
|
+
**Examples**:
|
|
122
|
+
- "Growth acceleration" = (revenue_t - revenue_{t-1}) - (revenue_{t-1} - revenue_{t-2})
|
|
123
|
+
- *Meaning*: Is growth speeding up or slowing down?
|
|
124
|
+
- *High value*: Accelerating growth
|
|
125
|
+
- *Low value*: Decelerating growth
|
|
126
|
+
- *Why it matters*: Acceleration is early signal of inflection points
|
|
127
|
+
|
|
128
|
+
#### Q3: "What is anomalous?" (Deviation)
|
|
129
|
+
**Purpose**: Identify what breaks patterns—the exceptions reveal rules
|
|
130
|
+
|
|
131
|
+
**Leads to features about:**
|
|
132
|
+
- Outliers and extremes
|
|
133
|
+
- Deviation from normal
|
|
134
|
+
- Pattern breaks
|
|
135
|
+
|
|
136
|
+
**Examples**:
|
|
137
|
+
- "Earnings surprise magnitude" = (actual - expected) / |expected|
|
|
138
|
+
- *Meaning*: How much did results deviate from expectations?
|
|
139
|
+
- *High value*: Significant surprise (positive or negative)
|
|
140
|
+
- *Why it matters*: Surprises often trigger re-evaluation
|
|
141
|
+
|
|
142
|
+
#### Q4: "What is combined?" (Interaction)
|
|
143
|
+
**Purpose**: Understand how elements affect each other
|
|
144
|
+
|
|
145
|
+
**Leads to features about:**
|
|
146
|
+
- Synergies and conflicts
|
|
147
|
+
- Joint effects
|
|
148
|
+
- Conditional relationships
|
|
149
|
+
|
|
150
|
+
**Examples**:
|
|
151
|
+
- "Marketing-sales synergy" = (marketing_spend × sales_efficiency)
|
|
152
|
+
- *Meaning*: Do marketing and sales amplify each other?
|
|
153
|
+
- *High value*: Strong synergy (1+1=3)
|
|
154
|
+
- *Low value*: Weak synergy (1+1=1.5)
|
|
155
|
+
- *Why it matters*: Synergy indicates scalability
|
|
156
|
+
|
|
157
|
+
#### Q5: "What is structural?" (Composition)
|
|
158
|
+
**Purpose**: Decompose wholes into meaningful parts
|
|
159
|
+
|
|
160
|
+
**Leads to features about:**
|
|
161
|
+
- Component breakdowns
|
|
162
|
+
- Proportional relationships
|
|
163
|
+
- Structure changes
|
|
164
|
+
|
|
165
|
+
**Examples**:
|
|
166
|
+
- "Recurring revenue quality" = subscription_revenue / total_revenue
|
|
167
|
+
- *Meaning*: What portion of revenue is predictable?
|
|
168
|
+
- *High value*: High-quality recurring revenue
|
|
169
|
+
- *Low value*: Low-quality one-time revenue
|
|
170
|
+
- *Why it matters*: Predictability affects valuation
|
|
171
|
+
|
|
172
|
+
#### Q6: "What is cumulative?" (Accumulation)
|
|
173
|
+
**Purpose**: Capture time-based build-up and decay
|
|
174
|
+
|
|
175
|
+
**Leads to features about:**
|
|
176
|
+
- Running totals and diminishing returns
|
|
177
|
+
- Memory effects
|
|
178
|
+
- Time-weighted values
|
|
179
|
+
|
|
180
|
+
**Examples**:
|
|
181
|
+
- "Customer relationship depth" = Σ(purchase_value × e^{-days_ago / half_life})
|
|
182
|
+
- *Meaning*: Time-decayed cumulative purchase value
|
|
183
|
+
- *High value*: Deep, recent relationship
|
|
184
|
+
- *Low value*: Shallow or old relationship
|
|
185
|
+
- *Why it matters*: Recency and frequency predict loyalty
|
|
186
|
+
|
|
187
|
+
#### Q7: "What is relative?" (Comparison)
|
|
188
|
+
**Purpose**: Understand position in context
|
|
189
|
+
|
|
190
|
+
**Leads to features about:**
|
|
191
|
+
- Rankings and percentiles
|
|
192
|
+
- Normalizations
|
|
193
|
+
- Context-aware measures
|
|
194
|
+
|
|
195
|
+
**Examples**:
|
|
196
|
+
- "Relative efficiency" = company_efficiency / industry_median_efficiency
|
|
197
|
+
- *Meaning*: How efficient vs. peers?
|
|
198
|
+
- *High value*: More efficient than typical
|
|
199
|
+
- *Low value*: Less efficient than typical
|
|
200
|
+
- *Why it matters*: Competitiveness indicator
|
|
201
|
+
|
|
202
|
+
#### Q8: "What is essential?" (Essence)
|
|
203
|
+
**Purpose**: Distill to core truths
|
|
204
|
+
|
|
205
|
+
**Leads to features about:**
|
|
206
|
+
- First-principles measures
|
|
207
|
+
- Fundamental relationships
|
|
208
|
+
- Stripped-down indicators
|
|
209
|
+
|
|
210
|
+
**Examples**:
|
|
211
|
+
- "Core profitability" = (revenue - variable_costs) / revenue
|
|
212
|
+
- *Meaning*: Profitability without fixed cost distortions
|
|
213
|
+
- *Why it matters*: Shows true unit economics
|
|
214
|
+
|
|
215
|
+
### How to Use the Question Bank:
|
|
216
|
+
|
|
217
|
+
**For any dataset**:
|
|
218
|
+
1. Go through each question
|
|
219
|
+
2. Ask: "Which fields or combinations can answer this?"
|
|
220
|
+
3. Formulate specific feature concepts
|
|
221
|
+
4. Validate each concept has clear meaning
|
|
222
|
+
5. Document the reasoning
|
|
223
|
+
|
|
224
|
+
**Example Workflow:**
|
|
225
|
+
```
|
|
226
|
+
Dataset: Sales data with fields [customer_id, order_value, order_date, product_category]
|
|
227
|
+
|
|
228
|
+
Q: "What is stable?"
|
|
229
|
+
→ Average order value per customer over time
|
|
230
|
+
→ Favorite category per customer (most frequent)
|
|
231
|
+
→ Purchase frequency pattern
|
|
232
|
+
|
|
233
|
+
Q: "What is changing?"
|
|
234
|
+
→ Order value trend (increasing/decreasing)
|
|
235
|
+
→ Category preference evolution
|
|
236
|
+
→ Purchase interval changes
|
|
237
|
+
|
|
238
|
+
Q: "What is anomalous?"
|
|
239
|
+
→ Orders far from customer's typical behavior
|
|
240
|
+
→ Sudden category switches
|
|
241
|
+
→ Unusually large/small orders
|
|
242
|
+
|
|
243
|
+
Q: "What is combined?"
|
|
244
|
+
→ Order value × frequency = total value
|
|
245
|
+
→ Category diversity × consistency = loyalty measure
|
|
246
|
+
→ Recency × frequency = engagement score
|
|
247
|
+
|
|
248
|
+
... (continue through all questions)
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
## 4. Field Combination Logic Patterns
|
|
252
|
+
|
|
253
|
+
### When you combine fields, what are you really doing?
|
|
254
|
+
|
|
255
|
+
#### Addition: "X + Y" → What does this sum represent?
|
|
256
|
+
**Good when**: Combining parts of a whole
|
|
257
|
+
- Total revenue = product_A_revenue + product_B_revenue
|
|
258
|
+
**Bad when**: Adding unrelated concepts
|
|
259
|
+
- Price + volume (What does this mean?)
|
|
260
|
+
|
|
261
|
+
#### Subtraction: "X - Y" → What is the difference telling you?
|
|
262
|
+
**Good when**: Measuring gap or surplus
|
|
263
|
+
- Profit = revenue - costs
|
|
264
|
+
- Shortfall = target - actual
|
|
265
|
+
**Bad when**: Ignoring that difference scales with magnitude
|
|
266
|
+
- Revenue_2023 - revenue_2022 (better: percentage change)
|
|
267
|
+
|
|
268
|
+
#### Multiplication: "X × Y" → What is the joint effect?
|
|
269
|
+
**Good when**: Capturing interaction or scaling
|
|
270
|
+
- Total_value = price × quantity
|
|
271
|
+
- Weighted_importance = score × weight
|
|
272
|
+
**Bad when**: Mixing units without meaning
|
|
273
|
+
- Revenue × employee_count (What is "dollar-employees"?)
|
|
274
|
+
|
|
275
|
+
#### Division: "X / Y" → What ratio or rate are you computing?
|
|
276
|
+
**Good when**: Creating relative measures
|
|
277
|
+
- Efficiency = output / input
|
|
278
|
+
- Concentration = part / whole
|
|
279
|
+
**Bad when**: Denominator can be zero or meaningless
|
|
280
|
+
- Revenue / days_since_founded (early days distort heavily)
|
|
281
|
+
|
|
282
|
+
#### Conditional: "If X then Y" → What condition matters?
|
|
283
|
+
**Good when**: Threshold effects exist
|
|
284
|
+
- If temperature > 100°C then phase = "gas"
|
|
285
|
+
- If churn_risk > 0.8 then intervene = true
|
|
286
|
+
**Bad when**: Arbitrary thresholds without justification
|
|
287
|
+
- If customer_age > 30 then category = "old" (why 30?)
|
|
288
|
+
|
|
289
|
+
### The Deeper Question:
|
|
290
|
+
**"What new information does this combination create?"**
|
|
291
|
+
|
|
292
|
+
A good combination:
|
|
293
|
+
- Reveals something the individual fields hide
|
|
294
|
+
- Creates a new concept with clear meaning
|
|
295
|
+
- Has intuitive interpretation
|
|
296
|
+
|
|
297
|
+
A bad combination:
|
|
298
|
+
- Just applies math to numbers
|
|
299
|
+
- Creates meaningless units (dollar-days per employee)
|
|
300
|
+
- Is hard to explain
|
|
301
|
+
|
|
302
|
+
## 5. Escaping Conventional Thinking Traps
|
|
303
|
+
|
|
304
|
+
### Trap 1: "This is a [field type], so I should..."
|
|
305
|
+
**Wrong**: "This is price data, so I should calculate moving averages"
|
|
306
|
+
**Right**: "This is a time series of transaction values—what patterns exist?"
|
|
307
|
+
|
|
308
|
+
**Escaping method**: Pretend you don't know the field name or domain. Just look at:
|
|
309
|
+
- Data type (number, category, date)
|
|
310
|
+
- Update frequency
|
|
311
|
+
- Distribution
|
|
312
|
+
- Missingness pattern
|
|
313
|
+
|
|
314
|
+
**Ask**: What would a data scientist from a different field see?
|
|
315
|
+
|
|
316
|
+
### Trap 2: "Everyone uses [conventional feature], so I will too"
|
|
317
|
+
**Wrong**: Building P/E, moving averages, RSI because "that's what you do"
|
|
318
|
+
**Right**: Asking "What does this ratio truly mean? Is there a better way to express that concept?"
|
|
319
|
+
|
|
320
|
+
**Example with P/E**:
|
|
321
|
+
- Conventional: P/E = price / earnings ("valuation metric")
|
|
322
|
+
- First principles: Compares market's forward-looking assessment to accounting record
|
|
323
|
+
- Deeper question: Why do these diverge? What does divergence mean?
|
|
324
|
+
- Better feature: Track divergence trend, not just level
|
|
325
|
+
|
|
326
|
+
### Trap 3: "Complexity = better"
|
|
327
|
+
**Wrong**: Adding more variables, interactions, conditions to improve "sophistication"
|
|
328
|
+
**Right**: Simpler is often more robust and interpretable
|
|
329
|
+
|
|
330
|
+
**Test**: Can you explain the feature in one sentence to a non-expert?
|
|
331
|
+
- If no → It's too complex
|
|
332
|
+
- If yes → It might be valuable
|
|
333
|
+
|
|
334
|
+
### Trap 4: "Feature engineering is separate from domain knowledge"
|
|
335
|
+
**Wrong**: Applying math without understanding what fields mean
|
|
336
|
+
**Right**: Deep domain understanding → Better features
|
|
337
|
+
|
|
338
|
+
**Process**:
|
|
339
|
+
1. Understand the business process that generates each field
|
|
340
|
+
2. Identify pain points and edge cases in that process
|
|
341
|
+
3. Build features that capture those nuances
|
|
342
|
+
4. Validate with domain experts
|
|
343
|
+
|
|
344
|
+
## 6. Feature Validation Checklist
|
|
345
|
+
|
|
346
|
+
### Before finalizing any feature, verify:
|
|
347
|
+
|
|
348
|
+
#### □ Clear Definition
|
|
349
|
+
- [ ] Can be explained in one sentence
|
|
350
|
+
- [ ] Uses precise language
|
|
351
|
+
- [ ] Avoids jargon and buzzwords
|
|
352
|
+
|
|
353
|
+
#### □ Logical Meaning
|
|
354
|
+
- [ ] Represents a real phenomenon or concept
|
|
355
|
+
- [ ] Not just a mathematical operation
|
|
356
|
+
- [ ] Has intuitive interpretation
|
|
357
|
+
|
|
358
|
+
#### □ Business Relevance
|
|
359
|
+
- [ ] Connects to real-world decision-making
|
|
360
|
+
- [ ] Answers a meaningful question
|
|
361
|
+
- [ ] Reveals actionable insight
|
|
362
|
+
|
|
363
|
+
#### □ Directional Understanding
|
|
364
|
+
- [ ] What does high value mean?
|
|
365
|
+
- [ ] What does low value mean?
|
|
366
|
+
- [ ] Is there an optimal range?
|
|
367
|
+
|
|
368
|
+
#### □ Boundary Conditions
|
|
369
|
+
- [ ] What do extreme values indicate?
|
|
370
|
+
- [ ] What happens at zero/infinity?
|
|
371
|
+
- [ ] Are there theoretical limits?
|
|
372
|
+
|
|
373
|
+
#### □ Data Quality Awareness
|
|
374
|
+
- [ ] What are sources of noise?
|
|
375
|
+
- [ ] When might this be unreliable?
|
|
376
|
+
- [ ] What biases could affect it?
|
|
377
|
+
|
|
378
|
+
#### □ Novelty Check
|
|
379
|
+
- [ ] Does this reveal something new?
|
|
380
|
+
- [ ] Or just repackage existing information?
|
|
381
|
+
- [ ] Would an expert learn something?
|
|
382
|
+
|
|
383
|
+
### Example Validation:
|
|
384
|
+
|
|
385
|
+
**Feature**: Customer purchase velocity = total_purchases / account_age_days
|
|
386
|
+
|
|
387
|
+
- **Clear definition**: "Average number of purchases per day since account creation"
|
|
388
|
+
- **Logical meaning**: Measures purchase frequency over customer lifetime
|
|
389
|
+
- **Business relevance**: Indicates customer engagement and habit formation
|
|
390
|
+
- **Directional**: High = frequent buyer, Low = infrequent buyer
|
|
391
|
+
- **Boundaries**: Zero = no purchases, Very high = possible data error or bulk buyer
|
|
392
|
+
- **Data quality**: Affected by returns, multi-item orders, gift purchases
|
|
393
|
+
- **Novelty**: Reveals engagement pattern beyond simple total purchases
|
|
394
|
+
|
|
395
|
+
## 7. Creative Thinking Techniques
|
|
396
|
+
|
|
397
|
+
### A. Lateral Thinking (Borrow from other domains)
|
|
398
|
+
|
|
399
|
+
**Ask**: How would a physicist/biologist/sociologist approach this?
|
|
400
|
+
|
|
401
|
+
**Example - Physics**:
|
|
402
|
+
- Field: Customer usage frequency
|
|
403
|
+
- Physics concept: Resonance frequency
|
|
404
|
+
- Feature idea: "Natural usage cadence" = frequency with highest amplitude
|
|
405
|
+
- **Meaning**: Inherent rhythm of customer behavior
|
|
406
|
+
|
|
407
|
+
**Example - Biology**:
|
|
408
|
+
- Field: Product adoption rates
|
|
409
|
+
- Biology concept: Population growth
|
|
410
|
+
- Feature idea: "Adoption growth model" = fit logistic growth curve
|
|
411
|
+
- **Meaning**: Identify inflection point where growth slows
|
|
412
|
+
|
|
413
|
+
**Exercise**: For each field, brainstorm 3 analogies from other disciplines
|
|
414
|
+
|
|
415
|
+
### B. Vertical Thinking (Keep asking "why?")
|
|
416
|
+
|
|
417
|
+
**The 5 Whys exercise**:
|
|
418
|
+
1. Why do customers churn? → Because they stop using the product
|
|
419
|
+
2. Why do they stop using it? → Because they don't find value
|
|
420
|
+
3. Why don't they find value? → Because their needs changed
|
|
421
|
+
4. Why did needs change? → Because their business grew
|
|
422
|
+
5. Why did business growth matter? → Because the product didn't scale with them
|
|
423
|
+
|
|
424
|
+
**Resulting feature**: "Scalability mismatch" = customer_growth_rate / product_capability
|
|
425
|
+
|
|
426
|
+
**Process**: Don't stop at surface-level questions. Dig until you hit fundamental truths.
|
|
427
|
+
|
|
428
|
+
### C. Perspective Shifting (Change your viewpoint)
|
|
429
|
+
|
|
430
|
+
**Time ↔ Space**:
|
|
431
|
+
- If you have time series data, think about spatial patterns (clustering, distribution)
|
|
432
|
+
- If you have spatial/cross-sectional data, think about evolution over time
|
|
433
|
+
|
|
434
|
+
**Individual ↔ Collective**:
|
|
435
|
+
- Zoom in: What does this mean for one entity?
|
|
436
|
+
- Zoom out: What does this pattern mean for the group?
|
|
437
|
+
|
|
438
|
+
**Quantitative ↔ Qualitative**:
|
|
439
|
+
- What would the qualitative description be?
|
|
440
|
+
- How do you quantify that description?
|
|
441
|
+
|
|
442
|
+
### D. Constraint-Based Creativity (Add restrictions)
|
|
443
|
+
|
|
444
|
+
**Artificial constraints force creative solutions**:
|
|
445
|
+
|
|
446
|
+
- "You can only use one field" → Forces focus on that field's nuances
|
|
447
|
+
- "You can only use addition/subtraction" → Simplifies relationships
|
|
448
|
+
- "You must include time" → Adds temporal dimension
|
|
449
|
+
- "You must be able to explain to a 5-year-old" → Forces simplicity
|
|
450
|
+
|
|
451
|
+
**Example**: "Explain customer value using only purchase timestamps"
|
|
452
|
+
- Feature: Time-based engagement depth (weighted recency/frequency)
|
|
453
|
+
- **Meaning**: Recent, frequent purchases = high engagement
|
|
454
|
+
|
|
455
|
+
## 8. From Concepts to Implementations
|
|
456
|
+
|
|
457
|
+
### Bridging the Gap:
|
|
458
|
+
|
|
459
|
+
**Concept**: "Customer engagement momentum" (from "What is changing?")
|
|
460
|
+
- **Meaning**: Is engagement increasing or decreasing in intensity?
|
|
461
|
+
- **Implementation**: Δ(engagement_score) over time, with acceleration
|
|
462
|
+
|
|
463
|
+
**Steps**:
|
|
464
|
+
1. Define engagement_score (purchase frequency × recency_weight)
|
|
465
|
+
2. Calculate change: engagement_today - engagement_last_week
|
|
466
|
+
3. Calculate acceleration: change_today - change_last_week
|
|
467
|
+
4. **Result**: Positive = increasing momentum, Negative = losing momentum
|
|
468
|
+
|
|
469
|
+
### Common Implementation Patterns:
|
|
470
|
+
|
|
471
|
+
**For stability**: Rolling coefficient of variation, autocorrelation, entropy
|
|
472
|
+
**For change**: Differences, log differences, second differences
|
|
473
|
+
**For anomalies**: Z-scores, isolation forest scores, deviation from predicted
|
|
474
|
+
**For interactions**: Products, ratios, conditional means
|
|
475
|
+
**For structure**: Component ratios, hierarchical decompositions
|
|
476
|
+
**For accumulation**: Running sums, exponentially weighted sums, integration
|
|
477
|
+
**For relativity**: Percentiles, z-scores, min-max scaling
|
|
478
|
+
**For essence**: Factor analysis, PCA, simple base components
|
|
479
|
+
|
|
480
|
+
### Quality Metrics for Implementation:
|
|
481
|
+
|
|
482
|
+
**Coverage**: What percentage of entities have data?
|
|
483
|
+
**Stability**: Does the feature behave consistently across time periods?
|
|
484
|
+
**Interpretability**: Can you explain the value meaningfully?
|
|
485
|
+
**Actionability**: Does it suggest a clear action?
|
|
486
|
+
|
|
487
|
+
## Summary: The Mindset in Seven Words
|
|
488
|
+
|
|
489
|
+
**"Understand deeply, question assumptions, express meaningfully"**
|
|
490
|
+
|
|
491
|
+
---
|
|
492
|
+
|
|
493
|
+
*This document provides thinking tools, not formulas. True feature engineering happens when you combine deep data understanding with creative questions about what that data means.*
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: brain-feature-implementation
|
|
3
|
+
description: Implements WorldQuant Brain features from an idea markdown file. Downloads dataset and generates alpha expressions defined in the idea.
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Read
|
|
6
|
+
- RunTerminal
|
|
7
|
+
- ManageTodoList
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Brain Feature Implementation
|
|
11
|
+
|
|
12
|
+
## Description
|
|
13
|
+
This skill automates the process of converting a WorldQuant Brain idea document (Markdown) into actionable Alpha expressions. It handles dataset downloading and code generation for each distinct idea pattern.
|
|
14
|
+
|
|
15
|
+
## Scope of Work
|
|
16
|
+
* This skill operates exclusively by manipulating local CSV files using the provided Python scripts.
|
|
17
|
+
* **Do NOT use any WorldQuant Brain MCP tools** (e.g., `brain-api`).
|
|
18
|
+
* **Do NOT write custom Python scripts** (e.g. `python -c ...` or new `.py` files) to check data or generate expressions. You MUST use the `scripts/implement_idea.py` tool.
|
|
19
|
+
* Do not attempt to submit alphas or run simulations on the platform. Focus only on generating the expression files locally.
|
|
20
|
+
|
|
21
|
+
## Instructions
|
|
22
|
+
|
|
23
|
+
1. **Analyze the Idea Document**
|
|
24
|
+
* Read the provided markdown file.
|
|
25
|
+
* Extract the following metadata:
|
|
26
|
+
* **Dataset ID** (e.g., `analyst15`)
|
|
27
|
+
* **Region** (e.g., `GLB`)
|
|
28
|
+
* **Delay** (e.g., `1` or `0`)
|
|
29
|
+
* *If any metadata is missing, ask the user to clarify.*
|
|
30
|
+
|
|
31
|
+
2. **Download Dataset**
|
|
32
|
+
* Execute the fetch script using the extracted parameters.
|
|
33
|
+
* **Locate Scripts**:
|
|
34
|
+
* Check your current working directory (`ls -R` or `Get-ChildItem -Recurse`).
|
|
35
|
+
* Find the path to `fetch_dataset.py`. It is likely in `brain-feature-implementation/scripts` or `scripts`.
|
|
36
|
+
* **Run Command**:
|
|
37
|
+
* Change directory to the folder containing the script before running it.
|
|
38
|
+
* Command:
|
|
39
|
+
```bash
|
|
40
|
+
cd <PATH_TO_SCRIPTS_FOLDER> && python fetch_dataset.py --datasetid <ID> --region <REGION> --delay <DELAY>
|
|
41
|
+
```
|
|
42
|
+
* Wait for the download to complete. The script will create a folder in `../data/`.
|
|
43
|
+
|
|
44
|
+
3. **Plan Implementation**
|
|
45
|
+
* Scan the markdown file for **Feature Definitions** or **Formulas**.
|
|
46
|
+
* Look for patterns like `Definition: <formula>` or code blocks describing math.
|
|
47
|
+
* Use the `manage_todo_list` tool to create a plan with one entry for each unique idea/formula found.
|
|
48
|
+
* *Title*: The Idea Name or ID (e.g., "3.1.1 Estimate Stability Score").
|
|
49
|
+
* *Description*: The specific template formula (e.g., `template: "{st_dev} / abs({mean})"`).
|
|
50
|
+
|
|
51
|
+
4. **Execute Implementation**
|
|
52
|
+
* For each item in the Todo List:
|
|
53
|
+
* **Construct the Template**:
|
|
54
|
+
* Use Python format string syntax `{variable}`.
|
|
55
|
+
* The `{variable}` must match the **suffix** of the fields in the dataset (e.g., `mean`, `st_dev`, `gro`).
|
|
56
|
+
* **CRITICAL**: Do NOT include the full prefix or horizon in the template. The script auto-detects these.
|
|
57
|
+
* *Correct Example*: For `anl15_gr_12_m_gro / anl15_gr_12_m_pe`, use template: `{gro} / {pe}`.
|
|
58
|
+
* *Incorrect Example*: `{anl15_gr_12_m_gro} / {pe}` (Includes prefix).
|
|
59
|
+
* *Incorrect Example*: `${gro} / ${pe}` (Shell syntax).
|
|
60
|
+
* **Determine Dataset Folder**: `{ID}_{REGION}_delay{DELAY}` (e.g., `analyst10_GLB_delay1`).
|
|
61
|
+
* **Run Script**:
|
|
62
|
+
* Navigate to the folder containing `implement_idea.py` (as identified in step 2).
|
|
63
|
+
* Command:
|
|
64
|
+
```bash
|
|
65
|
+
cd <PATH_TO_SCRIPTS_FOLDER> && python implement_idea.py --template "<TEMPLATE_STRING>" --dataset "<DATASET_FOLDER_NAME>"
|
|
66
|
+
```
|
|
67
|
+
* *Note*: The script ONLY accepts `--template` and `--dataset`. Do not pass any other arguments like `--filters` or `--groupby`.
|
|
68
|
+
* **Strict Rule**: Do NOT use `python -c` or create temporary scripts to verify or process results. Trust the output of `implement_idea.py`.
|
|
69
|
+
* Verify the output (number of expressions generated).
|
|
70
|
+
* Mark the Todo item as completed.
|
|
71
|
+
|
|
72
|
+
5. **Finalize Output**
|
|
73
|
+
* After all Todo items are completed, merge all generated expressions into a single file.
|
|
74
|
+
* **Run Merge Script**:
|
|
75
|
+
* Navigate to the folder containing scripts.
|
|
76
|
+
* Command:
|
|
77
|
+
```bash
|
|
78
|
+
cd <PATH_TO_SCRIPTS_FOLDER> && python merge_expression_list.py --dataset "<DATASET_FOLDER_NAME>"
|
|
79
|
+
```
|
|
80
|
+
* This will create `final_expressions.json` in the dataset directory.
|
|
81
|
+
* Report the total number of unique expressions and the path to the final file to the user.
|
|
82
|
+
|
|
83
|
+
## Script Dependencies
|
|
84
|
+
This skill relies on the following scripts in its `scripts/` directory:
|
|
85
|
+
- `fetch_dataset.py`: Downloads data from Brain API.
|
|
86
|
+
- `implement_idea.py`: Generates alpha expressions from templates.
|
|
87
|
+
- `ace_lib.py` & `helpful_functions.py`: Support libraries.
|