@yeyuan98/opencode-bioresearcher-plugin 1.5.2 → 1.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +1 -0
- package/dist/agents/bioresearcher/prompt.js +235 -235
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/analysis-methods.md +551 -551
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/best-practices.md +647 -647
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/python-standards.md +944 -944
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/report-template.md +613 -613
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/tool-selection.md +481 -481
- package/dist/skills/bioresearcher-core/patterns/citations.md +234 -234
- package/dist/skills/bioresearcher-core/patterns/rate-limiting.md +167 -167
- package/dist/skills/gromacs-guides/SKILL.md +48 -0
- package/dist/skills/gromacs-guides/guides/create_index.md +96 -0
- package/dist/skills/gromacs-guides/guides/inspect_tpr.md +93 -0
- package/package.json +1 -1
|
@@ -1,647 +1,647 @@
|
|
|
1
|
-
# Best Practices for Data Analysis
|
|
2
|
-
|
|
3
|
-
Critical best practices for data analysis including upfront filtering, validation, error handling, and performance optimization.
|
|
4
|
-
|
|
5
|
-
## Overview
|
|
6
|
-
|
|
7
|
-
This pattern defines mandatory best practices that MUST be followed for all data analysis tasks:
|
|
8
|
-
1. Upfront Filtering (CRITICAL)
|
|
9
|
-
2. Data Validation
|
|
10
|
-
3. Error Handling & Retry Logic
|
|
11
|
-
4. Rate Limiting
|
|
12
|
-
5. Performance Optimization
|
|
13
|
-
6. Context Window Management
|
|
14
|
-
|
|
15
|
-
---
|
|
16
|
-
|
|
17
|
-
## 1. Upfront Filtering (CRITICAL)
|
|
18
|
-
|
|
19
|
-
### Core Rule
|
|
20
|
-
|
|
21
|
-
**RULE: Always filter data at the SOURCE, not after retrieval**
|
|
22
|
-
|
|
23
|
-
**Why:**
|
|
24
|
-
- Reduces data transfer
|
|
25
|
-
- Improves performance
|
|
26
|
-
- Conserves context window
|
|
27
|
-
- Follows database best practices
|
|
28
|
-
- Minimizes memory usage
|
|
29
|
-
|
|
30
|
-
### Database Queries
|
|
31
|
-
|
|
32
|
-
```python
|
|
33
|
-
# ✅ GOOD: Filter at database level
|
|
34
|
-
dbQuery(
|
|
35
|
-
"SELECT * FROM clinical_trials WHERE phase = :phase AND status = :status LIMIT 100",
|
|
36
|
-
{phase: "Phase 3", status: "Recruiting"}
|
|
37
|
-
)
|
|
38
|
-
|
|
39
|
-
# ❌ BAD: Retrieve all data then filter
|
|
40
|
-
dbQuery("SELECT * FROM clinical_trials")
|
|
41
|
-
# Then filter in Python - inefficient!
|
|
42
|
-
```
|
|
43
|
-
|
|
44
|
-
**Best Practices:**
|
|
45
|
-
- Use WHERE clauses to filter rows
|
|
46
|
-
- Use LIMIT to cap result size
|
|
47
|
-
- Use indexed columns in WHERE clause
|
|
48
|
-
- Use named parameters (not string concatenation)
|
|
49
|
-
- Select only needed columns (avoid SELECT *)
|
|
50
|
-
|
|
51
|
-
### Table Operations
|
|
52
|
-
|
|
53
|
-
```python
|
|
54
|
-
# ✅ GOOD: Filter with table tools
|
|
55
|
-
tableFilterRows(
|
|
56
|
-
file_path="data/trials.xlsx",
|
|
57
|
-
column="Phase",
|
|
58
|
-
operator="=",
|
|
59
|
-
value="Phase 3",
|
|
60
|
-
max_results=100
|
|
61
|
-
)
|
|
62
|
-
|
|
63
|
-
# ❌ BAD: Load entire table then filter
|
|
64
|
-
tableGetRange(file_path="data/trials.xlsx", range="A1:Z10000")
|
|
65
|
-
# Then filter in Python - wastes memory!
|
|
66
|
-
```
|
|
67
|
-
|
|
68
|
-
**Best Practices:**
|
|
69
|
-
- Use tableFilterRows for single-condition filters
|
|
70
|
-
- Use tableSearch to find specific values
|
|
71
|
-
- Use max_results to limit output
|
|
72
|
-
- Preview with tableGetSheetPreview before processing
|
|
73
|
-
- Check row count before deciding approach
|
|
74
|
-
|
|
75
|
-
### BioMCP Queries
|
|
76
|
-
|
|
77
|
-
```python
|
|
78
|
-
# ✅ GOOD: Targeted query with filters
|
|
79
|
-
biomcp_article_searcher(
|
|
80
|
-
genes=["BRAF", "NRAS"],
|
|
81
|
-
diseases=["melanoma"],
|
|
82
|
-
keywords=["treatment resistance"],
|
|
83
|
-
variants=["V600E"],
|
|
84
|
-
page_size=50
|
|
85
|
-
)
|
|
86
|
-
|
|
87
|
-
# ❌ BAD: Broad query then manual filtering
|
|
88
|
-
biomcp_search(query="BRAF")
|
|
89
|
-
# Then manually filter through thousands of results!
|
|
90
|
-
```
|
|
91
|
-
|
|
92
|
-
**Best Practices:**
|
|
93
|
-
- Use specific domain filters (genes, diseases, variants)
|
|
94
|
-
- Combine multiple filters with AND logic
|
|
95
|
-
- Use page_size to limit results per page
|
|
96
|
-
- Use biomcp_search only for cross-domain queries
|
|
97
|
-
- Specify exact criteria upfront
|
|
98
|
-
|
|
99
|
-
### Web Searches
|
|
100
|
-
|
|
101
|
-
```python
|
|
102
|
-
# ✅ GOOD: Specific search query
|
|
103
|
-
web-search-prime_web_search_prime(
|
|
104
|
-
search_query="BRAF V600E melanoma FDA approval 2024",
|
|
105
|
-
search_recency_filter="oneYear"
|
|
106
|
-
)
|
|
107
|
-
|
|
108
|
-
# ❌ BAD: Broad search then filter results
|
|
109
|
-
web-search-prime_web_search_prime(search_query="BRAF")
|
|
110
|
-
# Then manually evaluate hundreds of results!
|
|
111
|
-
```
|
|
112
|
-
|
|
113
|
-
**Best Practices:**
|
|
114
|
-
- Use specific search terms
|
|
115
|
-
- Use recency filters when time-sensitive
|
|
116
|
-
- Use domain filters for trusted sources
|
|
117
|
-
- Limit results to manageable number
|
|
118
|
-
- Verify source quality before using
|
|
119
|
-
|
|
120
|
-
---
|
|
121
|
-
|
|
122
|
-
## 2. Data Validation
|
|
123
|
-
|
|
124
|
-
### Validation Pattern
|
|
125
|
-
|
|
126
|
-
```
|
|
127
|
-
VALIDATION WORKFLOW:
|
|
128
|
-
1. Check data existence (not empty/null)
|
|
129
|
-
2. Validate structure (required fields present)
|
|
130
|
-
3. Validate types (correct data types)
|
|
131
|
-
4. Validate values (within expected ranges)
|
|
132
|
-
5. Validate quality (no duplicates, no corruption)
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
### Example: Comprehensive Validation
|
|
136
|
-
|
|
137
|
-
```python
|
|
138
|
-
def validate_clinical_trials(trials: List[Dict]) -> List[Dict]:
|
|
139
|
-
"""Validate clinical trial data with comprehensive checks.
|
|
140
|
-
|
|
141
|
-
Args:
|
|
142
|
-
trials: List of trial dictionaries
|
|
143
|
-
|
|
144
|
-
Returns:
|
|
145
|
-
Validated trials
|
|
146
|
-
|
|
147
|
-
Raises:
|
|
148
|
-
ValueError: If validation fails
|
|
149
|
-
"""
|
|
150
|
-
# 1. Check existence
|
|
151
|
-
if not trials:
|
|
152
|
-
raise ValueError("No trial data provided")
|
|
153
|
-
|
|
154
|
-
# 2. Define required structure
|
|
155
|
-
required_fields = {
|
|
156
|
-
"nct_id": str,
|
|
157
|
-
"phase": str,
|
|
158
|
-
"status": str,
|
|
159
|
-
"condition": str,
|
|
160
|
-
"response_rate": (int, float),
|
|
161
|
-
"patient_count": int
|
|
162
|
-
}
|
|
163
|
-
|
|
164
|
-
valid_trials = []
|
|
165
|
-
errors = []
|
|
166
|
-
|
|
167
|
-
for i, trial in enumerate(trials):
|
|
168
|
-
try:
|
|
169
|
-
# 3. Validate structure
|
|
170
|
-
for field, expected_type in required_fields.items():
|
|
171
|
-
if field not in trial:
|
|
172
|
-
raise ValueError(f"Missing field: {field}")
|
|
173
|
-
|
|
174
|
-
# 4. Validate types
|
|
175
|
-
if not isinstance(trial[field], expected_type):
|
|
176
|
-
raise ValueError(
|
|
177
|
-
f"Field {field} has wrong type: "
|
|
178
|
-
f"expected {expected_type}, got {type(trial[field])}"
|
|
179
|
-
)
|
|
180
|
-
|
|
181
|
-
# 5. Validate values
|
|
182
|
-
if not trial["nct_id"].startswith("NCT"):
|
|
183
|
-
raise ValueError(f"Invalid NCT ID format: {trial['nct_id']}")
|
|
184
|
-
|
|
185
|
-
if trial["phase"] not in ["Phase 1", "Phase 2", "Phase 3", "Phase 4"]:
|
|
186
|
-
raise ValueError(f"Invalid phase: {trial['phase']}")
|
|
187
|
-
|
|
188
|
-
if not 0 <= trial["response_rate"] <= 100:
|
|
189
|
-
raise ValueError(f"Response rate out of range: {trial['response_rate']}")
|
|
190
|
-
|
|
191
|
-
if trial["patient_count"] < 0:
|
|
192
|
-
raise ValueError(f"Negative patient count: {trial['patient_count']}")
|
|
193
|
-
|
|
194
|
-
valid_trials.append(trial)
|
|
195
|
-
|
|
196
|
-
except ValueError as e:
|
|
197
|
-
errors.append(f"Trial {i}: {e}")
|
|
198
|
-
|
|
199
|
-
# 6. Report validation results
|
|
200
|
-
if errors:
|
|
201
|
-
print(f"Validation warnings: {len(errors)} trials had issues")
|
|
202
|
-
for error in errors[:5]: # Show first 5
|
|
203
|
-
print(f" - {error}")
|
|
204
|
-
if len(errors) > 5:
|
|
205
|
-
print(f" ... and {len(errors) - 5} more")
|
|
206
|
-
|
|
207
|
-
print(f"Validated {len(valid_trials)}/{len(trials)} trials")
|
|
208
|
-
|
|
209
|
-
return valid_trials
|
|
210
|
-
```
|
|
211
|
-
|
|
212
|
-
### JSON Validation
|
|
213
|
-
|
|
214
|
-
```python
|
|
215
|
-
# Using jsonValidate for structured data
|
|
216
|
-
from json_tools import jsonExtract, jsonValidate
|
|
217
|
-
|
|
218
|
-
# Define expected schema
|
|
219
|
-
schema = {
|
|
220
|
-
"type": "object",
|
|
221
|
-
"required": ["nct_id", "phase", "status"],
|
|
222
|
-
"properties": {
|
|
223
|
-
"nct_id": {"type": "string", "pattern": "^NCT[0-9]{8}$"},
|
|
224
|
-
"phase": {"type": "string"},
|
|
225
|
-
"status": {"type": "string"},
|
|
226
|
-
"response_rate": {"type": "number", "minimum": 0, "maximum": 100}
|
|
227
|
-
}
|
|
228
|
-
}
|
|
229
|
-
|
|
230
|
-
# Extract and validate
|
|
231
|
-
result = jsonExtract(file_path="output.json")
|
|
232
|
-
if result.success:
|
|
233
|
-
validation = jsonValidate(data=result.data, schema=schema)
|
|
234
|
-
if not validation.valid:
|
|
235
|
-
print(f"Validation failed: {validation.errors}")
|
|
236
|
-
else:
|
|
237
|
-
print("Data validated successfully")
|
|
238
|
-
```
|
|
239
|
-
|
|
240
|
-
---
|
|
241
|
-
|
|
242
|
-
## 3. Error Handling & Retry Logic
|
|
243
|
-
|
|
244
|
-
### Retry Pattern (from retry.md)
|
|
245
|
-
|
|
246
|
-
```python
|
|
247
|
-
def fetch_with_retry(
|
|
248
|
-
operation,
|
|
249
|
-
max_attempts: int = 3,
|
|
250
|
-
initial_delay: float = 2.0,
|
|
251
|
-
backoff_factor: float = 2.0
|
|
252
|
-
):
|
|
253
|
-
"""Execute operation with exponential backoff retry.
|
|
254
|
-
|
|
255
|
-
Args:
|
|
256
|
-
operation: Function to execute
|
|
257
|
-
max_attempts: Maximum retry attempts
|
|
258
|
-
initial_delay: Initial delay in seconds
|
|
259
|
-
backoff_factor: Multiplier for delay after each failure
|
|
260
|
-
|
|
261
|
-
Returns:
|
|
262
|
-
Operation result
|
|
263
|
-
|
|
264
|
-
Raises:
|
|
265
|
-
Exception: If all attempts fail
|
|
266
|
-
"""
|
|
267
|
-
delay = initial_delay
|
|
268
|
-
|
|
269
|
-
for attempt in range(max_attempts):
|
|
270
|
-
try:
|
|
271
|
-
result = operation()
|
|
272
|
-
return result
|
|
273
|
-
|
|
274
|
-
except Exception as e:
|
|
275
|
-
if attempt < max_attempts - 1:
|
|
276
|
-
print(f"Attempt {attempt + 1} failed: {e}")
|
|
277
|
-
print(f"Retrying in {delay} seconds...")
|
|
278
|
-
blockingTimer(delay)
|
|
279
|
-
delay = delay * backoff_factor # Exponential backoff
|
|
280
|
-
else:
|
|
281
|
-
print(f"All {max_attempts} attempts failed")
|
|
282
|
-
raise
|
|
283
|
-
```
|
|
284
|
-
|
|
285
|
-
### Apply Retry to External Operations
|
|
286
|
-
|
|
287
|
-
```python
|
|
288
|
-
# BioMCP queries
|
|
289
|
-
def fetch_articles_with_retry(genes, diseases):
|
|
290
|
-
"""Fetch articles with retry logic."""
|
|
291
|
-
return fetch_with_retry(
|
|
292
|
-
lambda: biomcp_article_searcher(
|
|
293
|
-
genes=genes,
|
|
294
|
-
diseases=diseases,
|
|
295
|
-
page_size=50
|
|
296
|
-
)
|
|
297
|
-
)
|
|
298
|
-
|
|
299
|
-
# Database queries
|
|
300
|
-
def query_database_with_retry(sql, params):
|
|
301
|
-
"""Query database with retry logic."""
|
|
302
|
-
return fetch_with_retry(
|
|
303
|
-
lambda: dbQuery(sql, params)
|
|
304
|
-
)
|
|
305
|
-
|
|
306
|
-
# Web requests
|
|
307
|
-
def fetch_web_with_retry(url):
|
|
308
|
-
"""Fetch web content with retry logic."""
|
|
309
|
-
return fetch_with_retry(
|
|
310
|
-
lambda: webfetch(url)
|
|
311
|
-
)
|
|
312
|
-
```
|
|
313
|
-
|
|
314
|
-
### Error Handling Best Practices
|
|
315
|
-
|
|
316
|
-
```python
|
|
317
|
-
# ✅ GOOD: Comprehensive error handling
|
|
318
|
-
def process_trials(file_path: str) -> Dict:
|
|
319
|
-
"""Process trials with comprehensive error handling."""
|
|
320
|
-
try:
|
|
321
|
-
# Load data
|
|
322
|
-
df = load_data(file_path)
|
|
323
|
-
|
|
324
|
-
# Validate
|
|
325
|
-
if df.empty:
|
|
326
|
-
raise ValueError("File contains no data")
|
|
327
|
-
|
|
328
|
-
# Process
|
|
329
|
-
results = analyze_trials(df)
|
|
330
|
-
|
|
331
|
-
return results
|
|
332
|
-
|
|
333
|
-
except FileNotFoundError:
|
|
334
|
-
raise FileNotFoundError(f"File not found: {file_path}")
|
|
335
|
-
|
|
336
|
-
except pd.errors.EmptyDataError:
|
|
337
|
-
raise ValueError(f"File is empty or corrupt: {file_path}")
|
|
338
|
-
|
|
339
|
-
except Exception as e:
|
|
340
|
-
raise RuntimeError(f"Processing failed: {e}")
|
|
341
|
-
|
|
342
|
-
# ❌ BAD: No error handling
|
|
343
|
-
def process_trials(file_path: str) -> Dict:
|
|
344
|
-
"""Process trials without error handling."""
|
|
345
|
-
df = load_data(file_path) # May fail silently
|
|
346
|
-
results = analyze_trials(df) # May fail silently
|
|
347
|
-
return results
|
|
348
|
-
```
|
|
349
|
-
|
|
350
|
-
---
|
|
351
|
-
|
|
352
|
-
## 4. Rate Limiting
|
|
353
|
-
|
|
354
|
-
### BioMCP Rate Limiting (MANDATORY)
|
|
355
|
-
|
|
356
|
-
```python
|
|
357
|
-
# ✅ GOOD: Sequential calls with rate limiting
|
|
358
|
-
articles = biomcp_article_searcher(genes=["BRAF"], page_size=50)
|
|
359
|
-
blockingTimer(0.3) # 300ms delay
|
|
360
|
-
|
|
361
|
-
for article in articles[:10]:
|
|
362
|
-
details = biomcp_article_getter(pmid=article["pmid"])
|
|
363
|
-
blockingTimer(0.3) # 300ms between each call
|
|
364
|
-
# Process details
|
|
365
|
-
|
|
366
|
-
# ❌ BAD: Concurrent calls without rate limiting
|
|
367
|
-
# This will cause API throttling!
|
|
368
|
-
results = []
|
|
369
|
-
for pmid in pmids:
|
|
370
|
-
results.append(biomcp_article_getter(pmid)) # No delay!
|
|
371
|
-
```
|
|
372
|
-
|
|
373
|
-
### Web Rate Limiting
|
|
374
|
-
|
|
375
|
-
```python
|
|
376
|
-
# ✅ GOOD: Conservative rate limiting for web
|
|
377
|
-
for url in urls:
|
|
378
|
-
content = webfetch(url)
|
|
379
|
-
blockingTimer(0.5) # 500ms delay for web requests
|
|
380
|
-
# Process content
|
|
381
|
-
|
|
382
|
-
# ❌ BAD: No rate limiting
|
|
383
|
-
for url in urls:
|
|
384
|
-
content = webfetch(url) # May get blocked!
|
|
385
|
-
```
|
|
386
|
-
|
|
387
|
-
### Rate Limiting Guidelines
|
|
388
|
-
|
|
389
|
-
| Tool Category | Delay | Rationale |
|
|
390
|
-
|--------------|-------|-----------|
|
|
391
|
-
| BioMCP tools | 0.3s | API rate limits |
|
|
392
|
-
| Web tools | 0.5s | Server courtesy |
|
|
393
|
-
| Database | None | Local/network, no limit |
|
|
394
|
-
| File operations | None | Local filesystem, no limit |
|
|
395
|
-
| Parser tools | None | Local processing, no limit |
|
|
396
|
-
|
|
397
|
-
---
|
|
398
|
-
|
|
399
|
-
## 5. Performance Optimization
|
|
400
|
-
|
|
401
|
-
### Use Appropriate Tools for Data Size
|
|
402
|
-
|
|
403
|
-
```python
|
|
404
|
-
# Small datasets (< 30 rows): Use table tools
|
|
405
|
-
if row_count < 30:
|
|
406
|
-
tableFilterRows(file_path, column="Status", operator="=", value="Active")
|
|
407
|
-
tableGroupBy(file_path, group_column="Phase", agg_column="Count", agg_type="count")
|
|
408
|
-
|
|
409
|
-
# Medium datasets (30-1000 rows): Use long-table-summary skill
|
|
410
|
-
elif row_count < 1000:
|
|
411
|
-
skill long-table-summary
|
|
412
|
-
# Follow 16-step workflow
|
|
413
|
-
|
|
414
|
-
# Large datasets (> 1000 rows): Use Python
|
|
415
|
-
else:
|
|
416
|
-
uv run python .scripts/py/large_analysis.py --input data.xlsx
|
|
417
|
-
```
|
|
418
|
-
|
|
419
|
-
### Batch Processing
|
|
420
|
-
|
|
421
|
-
```python
|
|
422
|
-
# ✅ GOOD: Process in batches
|
|
423
|
-
batch_size = 100
|
|
424
|
-
total_items = len(data)
|
|
425
|
-
|
|
426
|
-
for i in range(0, total_items, batch_size):
|
|
427
|
-
batch = data[i:i+batch_size]
|
|
428
|
-
results = process_batch(batch)
|
|
429
|
-
|
|
430
|
-
# Report progress
|
|
431
|
-
completed = min(i + batch_size, total_items)
|
|
432
|
-
percent = (completed / total_items) * 100
|
|
433
|
-
print(f"Progress: {completed}/{total_items} ({percent:.1f}%)")
|
|
434
|
-
|
|
435
|
-
# ❌ BAD: Process all at once (may overload memory)
|
|
436
|
-
results = process_all(data) # Memory issue with large datasets!
|
|
437
|
-
```
|
|
438
|
-
|
|
439
|
-
### Caching Results
|
|
440
|
-
|
|
441
|
-
```python
|
|
442
|
-
# Cache expensive operations
|
|
443
|
-
import hashlib
|
|
444
|
-
import json
|
|
445
|
-
|
|
446
|
-
def get_cache_key(params: Dict) -> str:
|
|
447
|
-
"""Generate cache key from parameters."""
|
|
448
|
-
return hashlib.md5(json.dumps(params, sort_keys=True).encode()).hexdigest()
|
|
449
|
-
|
|
450
|
-
def cached_query(sql: str, params: Dict, cache_dir: str = ".cache") -> Any:
|
|
451
|
-
"""Execute query with caching."""
|
|
452
|
-
import os
|
|
453
|
-
|
|
454
|
-
cache_key = get_cache_key({"sql": sql, "params": params})
|
|
455
|
-
cache_file = f"{cache_dir}/{cache_key}.json"
|
|
456
|
-
|
|
457
|
-
# Check cache
|
|
458
|
-
if os.path.exists(cache_file):
|
|
459
|
-
with open(cache_file, 'r') as f:
|
|
460
|
-
return json.load(f)
|
|
461
|
-
|
|
462
|
-
# Execute query
|
|
463
|
-
result = dbQuery(sql, params)
|
|
464
|
-
|
|
465
|
-
# Save to cache
|
|
466
|
-
os.makedirs(cache_dir, exist_ok=True)
|
|
467
|
-
with open(cache_file, 'w') as f:
|
|
468
|
-
json.dump(result, f)
|
|
469
|
-
|
|
470
|
-
return result
|
|
471
|
-
```
|
|
472
|
-
|
|
473
|
-
---
|
|
474
|
-
|
|
475
|
-
## 6. Context Window Management
|
|
476
|
-
|
|
477
|
-
### Minimize Data in Context
|
|
478
|
-
|
|
479
|
-
```python
|
|
480
|
-
# ✅ GOOD: Summarize large datasets
|
|
481
|
-
if len(results) > 100:
|
|
482
|
-
summary = {
|
|
483
|
-
"total": len(results),
|
|
484
|
-
"sample": results[:5], # First 5 items
|
|
485
|
-
"statistics": calculate_stats(results)
|
|
486
|
-
}
|
|
487
|
-
# Return summary, not full dataset
|
|
488
|
-
|
|
489
|
-
# ❌ BAD: Load entire dataset into context
|
|
490
|
-
return results # 1000+ items in context!
|
|
491
|
-
```
|
|
492
|
-
|
|
493
|
-
### Use File-Based Data Exchange
|
|
494
|
-
|
|
495
|
-
```python
|
|
496
|
-
# ✅ GOOD: Write to file, pass file path
|
|
497
|
-
output_file = ".work/results.json"
|
|
498
|
-
with open(output_file, 'w') as f:
|
|
499
|
-
json.dump(results, f)
|
|
500
|
-
|
|
501
|
-
return f"Results saved to {output_file}"
|
|
502
|
-
|
|
503
|
-
# ❌ BAD: Return large data structure
|
|
504
|
-
return results # Consumes context window!
|
|
505
|
-
```
|
|
506
|
-
|
|
507
|
-
### Pagination for Large Result Sets
|
|
508
|
-
|
|
509
|
-
```python
|
|
510
|
-
# ✅ GOOD: Paginated retrieval
|
|
511
|
-
page_size = 50
|
|
512
|
-
all_results = []
|
|
513
|
-
|
|
514
|
-
for page in range(1, max_pages + 1):
|
|
515
|
-
results = biomcp_article_searcher(
|
|
516
|
-
genes=["BRAF"],
|
|
517
|
-
page=page,
|
|
518
|
-
page_size=page_size
|
|
519
|
-
)
|
|
520
|
-
blockingTimer(0.3)
|
|
521
|
-
|
|
522
|
-
all_results.extend(results)
|
|
523
|
-
|
|
524
|
-
if len(results) < page_size:
|
|
525
|
-
break # No more results
|
|
526
|
-
|
|
527
|
-
# ❌ BAD: Try to get all at once
|
|
528
|
-
all_results = biomcp_article_searcher(genes=["BRAF"], page_size=10000)
|
|
529
|
-
# May hit API limits!
|
|
530
|
-
```
|
|
531
|
-
|
|
532
|
-
---
|
|
533
|
-
|
|
534
|
-
## 7. Common Anti-Patterns to Avoid
|
|
535
|
-
|
|
536
|
-
### Anti-Pattern 1: Premature Optimization
|
|
537
|
-
|
|
538
|
-
```python
|
|
539
|
-
# ❌ BAD: Optimize before measuring
|
|
540
|
-
def process_data(data):
|
|
541
|
-
# Complex optimization for small dataset
|
|
542
|
-
return optimized_result
|
|
543
|
-
|
|
544
|
-
# ✅ GOOD: Measure first, optimize if needed
|
|
545
|
-
def process_data(data):
|
|
546
|
-
if len(data) < 100:
|
|
547
|
-
# Simple approach is fine
|
|
548
|
-
return simple_process(data)
|
|
549
|
-
else:
|
|
550
|
-
# Optimize for large dataset
|
|
551
|
-
return optimized_process(data)
|
|
552
|
-
```
|
|
553
|
-
|
|
554
|
-
### Anti-Pattern 2: Over-Engineering
|
|
555
|
-
|
|
556
|
-
```python
|
|
557
|
-
# ❌ BAD: Complex solution for simple problem
|
|
558
|
-
class TrialProcessor:
|
|
559
|
-
def __init__(self, config):
|
|
560
|
-
self.config = config
|
|
561
|
-
self.validator = Validator()
|
|
562
|
-
self.analyzer = Analyzer()
|
|
563
|
-
# ... many more components
|
|
564
|
-
|
|
565
|
-
processor = TrialProcessor(config)
|
|
566
|
-
result = processor.process(trial)
|
|
567
|
-
|
|
568
|
-
# ✅ GOOD: Simple solution for simple problem
|
|
569
|
-
result = analyze_trial(trial)
|
|
570
|
-
```
|
|
571
|
-
|
|
572
|
-
### Anti-Pattern 3: Not Using Existing Tools
|
|
573
|
-
|
|
574
|
-
```python
|
|
575
|
-
# ❌ BAD: Write custom Python when tools exist
|
|
576
|
-
df = pd.read_excel("data.xlsx")
|
|
577
|
-
filtered = df[df["Phase"] == "Phase 3"]
|
|
578
|
-
stats = filtered.groupby("Condition").size()
|
|
579
|
-
|
|
580
|
-
# ✅ GOOD: Use existing tools
|
|
581
|
-
tableFilterRows(file_path, column="Phase", operator="=", value="Phase 3")
|
|
582
|
-
tableGroupBy(file_path, group_column="Condition", agg_column="Count", agg_type="count")
|
|
583
|
-
```
|
|
584
|
-
|
|
585
|
-
---
|
|
586
|
-
|
|
587
|
-
## 8. Best Practices Checklist
|
|
588
|
-
|
|
589
|
-
Before completing any data analysis task, verify:
|
|
590
|
-
|
|
591
|
-
### Upfront Filtering
|
|
592
|
-
- [ ] Data filtered at source (database WHERE, table filters, API parameters)
|
|
593
|
-
- [ ] No bulk retrieval then filtering in Python
|
|
594
|
-
- [ ] Result size limited appropriately
|
|
595
|
-
- [ ] Only necessary columns/fields retrieved
|
|
596
|
-
|
|
597
|
-
### Validation
|
|
598
|
-
- [ ] Data existence checked (not empty)
|
|
599
|
-
- [ ] Structure validated (required fields present)
|
|
600
|
-
- [ ] Types validated (correct data types)
|
|
601
|
-
- [ ] Values validated (within expected ranges)
|
|
602
|
-
- [ ] Quality validated (no duplicates, no corruption)
|
|
603
|
-
|
|
604
|
-
### Error Handling
|
|
605
|
-
- [ ] Try-except for external operations
|
|
606
|
-
- [ ] Retry logic for network/API calls
|
|
607
|
-
- [ ] Informative error messages
|
|
608
|
-
- [ ] Graceful degradation when possible
|
|
609
|
-
- [ ] Error logging for debugging
|
|
610
|
-
|
|
611
|
-
### Rate Limiting
|
|
612
|
-
- [ ] BioMCP: 0.3s delay between calls
|
|
613
|
-
- [ ] Web: 0.5s delay between requests
|
|
614
|
-
- [ ] Sequential (not concurrent) API calls
|
|
615
|
-
- [ ] Respect API rate limits
|
|
616
|
-
|
|
617
|
-
### Performance
|
|
618
|
-
- [ ] Appropriate tool selected for data size
|
|
619
|
-
- [ ] Batch processing for large datasets
|
|
620
|
-
- [ ] Caching for expensive operations
|
|
621
|
-
- [ ] Progress reporting for long operations
|
|
622
|
-
|
|
623
|
-
### Context Management
|
|
624
|
-
- [ ] Large datasets summarized, not fully loaded
|
|
625
|
-
- [ ] File-based data exchange for subagents
|
|
626
|
-
- [ ] Pagination for large result sets
|
|
627
|
-
- [ ] Only necessary data in context
|
|
628
|
-
|
|
629
|
-
---
|
|
630
|
-
|
|
631
|
-
## Summary
|
|
632
|
-
|
|
633
|
-
**Critical Rules:**
|
|
634
|
-
|
|
635
|
-
1. **ALWAYS filter upfront** - Never retrieve then filter
|
|
636
|
-
2. **ALWAYS validate data** - Check structure, types, values
|
|
637
|
-
3. **ALWAYS handle errors** - Retry with backoff
|
|
638
|
-
4. **ALWAYS rate limit** - Respect API limits
|
|
639
|
-
5. **ALWAYS optimize for size** - Use appropriate tools
|
|
640
|
-
6. **ALWAYS manage context** - Minimize data in context
|
|
641
|
-
|
|
642
|
-
**Following these best practices ensures:**
|
|
643
|
-
- ✅ Efficient data processing
|
|
644
|
-
- ✅ Reliable operations
|
|
645
|
-
- ✅ Professional quality results
|
|
646
|
-
- ✅ Optimal resource usage
|
|
647
|
-
- ✅ Reproducible research
|
|
1
|
+
# Best Practices for Data Analysis
|
|
2
|
+
|
|
3
|
+
Critical best practices for data analysis including upfront filtering, validation, error handling, and performance optimization.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
This pattern defines mandatory best practices that MUST be followed for all data analysis tasks:
|
|
8
|
+
1. Upfront Filtering (CRITICAL)
|
|
9
|
+
2. Data Validation
|
|
10
|
+
3. Error Handling & Retry Logic
|
|
11
|
+
4. Rate Limiting
|
|
12
|
+
5. Performance Optimization
|
|
13
|
+
6. Context Window Management
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 1. Upfront Filtering (CRITICAL)
|
|
18
|
+
|
|
19
|
+
### Core Rule
|
|
20
|
+
|
|
21
|
+
**RULE: Always filter data at the SOURCE, not after retrieval**
|
|
22
|
+
|
|
23
|
+
**Why:**
|
|
24
|
+
- Reduces data transfer
|
|
25
|
+
- Improves performance
|
|
26
|
+
- Conserves context window
|
|
27
|
+
- Follows database best practices
|
|
28
|
+
- Minimizes memory usage
|
|
29
|
+
|
|
30
|
+
### Database Queries
|
|
31
|
+
|
|
32
|
+
```python
|
|
33
|
+
# ✅ GOOD: Filter at database level
|
|
34
|
+
dbQuery(
|
|
35
|
+
"SELECT * FROM clinical_trials WHERE phase = :phase AND status = :status LIMIT 100",
|
|
36
|
+
{phase: "Phase 3", status: "Recruiting"}
|
|
37
|
+
)
|
|
38
|
+
|
|
39
|
+
# ❌ BAD: Retrieve all data then filter
|
|
40
|
+
dbQuery("SELECT * FROM clinical_trials")
|
|
41
|
+
# Then filter in Python - inefficient!
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
**Best Practices:**
|
|
45
|
+
- Use WHERE clauses to filter rows
|
|
46
|
+
- Use LIMIT to cap result size
|
|
47
|
+
- Use indexed columns in WHERE clause
|
|
48
|
+
- Use named parameters (not string concatenation)
|
|
49
|
+
- Select only needed columns (avoid SELECT *)
|
|
50
|
+
|
|
51
|
+
### Table Operations
|
|
52
|
+
|
|
53
|
+
```python
|
|
54
|
+
# ✅ GOOD: Filter with table tools
|
|
55
|
+
tableFilterRows(
|
|
56
|
+
file_path="data/trials.xlsx",
|
|
57
|
+
column="Phase",
|
|
58
|
+
operator="=",
|
|
59
|
+
value="Phase 3",
|
|
60
|
+
max_results=100
|
|
61
|
+
)
|
|
62
|
+
|
|
63
|
+
# ❌ BAD: Load entire table then filter
|
|
64
|
+
tableGetRange(file_path="data/trials.xlsx", range="A1:Z10000")
|
|
65
|
+
# Then filter in Python - wastes memory!
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
**Best Practices:**
|
|
69
|
+
- Use tableFilterRows for single-condition filters
|
|
70
|
+
- Use tableSearch to find specific values
|
|
71
|
+
- Use max_results to limit output
|
|
72
|
+
- Preview with tableGetSheetPreview before processing
|
|
73
|
+
- Check row count before deciding approach
|
|
74
|
+
|
|
75
|
+
### BioMCP Queries
|
|
76
|
+
|
|
77
|
+
```python
|
|
78
|
+
# ✅ GOOD: Targeted query with filters
|
|
79
|
+
biomcp_article_searcher(
|
|
80
|
+
genes=["BRAF", "NRAS"],
|
|
81
|
+
diseases=["melanoma"],
|
|
82
|
+
keywords=["treatment resistance"],
|
|
83
|
+
variants=["V600E"],
|
|
84
|
+
page_size=50
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
# ❌ BAD: Broad query then manual filtering
|
|
88
|
+
biomcp_search(query="BRAF")
|
|
89
|
+
# Then manually filter through thousands of results!
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**Best Practices:**
|
|
93
|
+
- Use specific domain filters (genes, diseases, variants)
|
|
94
|
+
- Combine multiple filters with AND logic
|
|
95
|
+
- Use page_size to limit results per page
|
|
96
|
+
- Use biomcp_search only for cross-domain queries
|
|
97
|
+
- Specify exact criteria upfront
|
|
98
|
+
|
|
99
|
+
### Web Searches
|
|
100
|
+
|
|
101
|
+
```python
|
|
102
|
+
# ✅ GOOD: Specific search query
|
|
103
|
+
web-search-prime_web_search_prime(
|
|
104
|
+
search_query="BRAF V600E melanoma FDA approval 2024",
|
|
105
|
+
search_recency_filter="oneYear"
|
|
106
|
+
)
|
|
107
|
+
|
|
108
|
+
# ❌ BAD: Broad search then filter results
|
|
109
|
+
web-search-prime_web_search_prime(search_query="BRAF")
|
|
110
|
+
# Then manually evaluate hundreds of results!
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**Best Practices:**
|
|
114
|
+
- Use specific search terms
|
|
115
|
+
- Use recency filters when time-sensitive
|
|
116
|
+
- Use domain filters for trusted sources
|
|
117
|
+
- Limit results to manageable number
|
|
118
|
+
- Verify source quality before using
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## 2. Data Validation
|
|
123
|
+
|
|
124
|
+
### Validation Pattern
|
|
125
|
+
|
|
126
|
+
```
|
|
127
|
+
VALIDATION WORKFLOW:
|
|
128
|
+
1. Check data existence (not empty/null)
|
|
129
|
+
2. Validate structure (required fields present)
|
|
130
|
+
3. Validate types (correct data types)
|
|
131
|
+
4. Validate values (within expected ranges)
|
|
132
|
+
5. Validate quality (no duplicates, no corruption)
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Example: Comprehensive Validation
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
def validate_clinical_trials(trials: List[Dict]) -> List[Dict]:
|
|
139
|
+
"""Validate clinical trial data with comprehensive checks.
|
|
140
|
+
|
|
141
|
+
Args:
|
|
142
|
+
trials: List of trial dictionaries
|
|
143
|
+
|
|
144
|
+
Returns:
|
|
145
|
+
Validated trials
|
|
146
|
+
|
|
147
|
+
Raises:
|
|
148
|
+
ValueError: If validation fails
|
|
149
|
+
"""
|
|
150
|
+
# 1. Check existence
|
|
151
|
+
if not trials:
|
|
152
|
+
raise ValueError("No trial data provided")
|
|
153
|
+
|
|
154
|
+
# 2. Define required structure
|
|
155
|
+
required_fields = {
|
|
156
|
+
"nct_id": str,
|
|
157
|
+
"phase": str,
|
|
158
|
+
"status": str,
|
|
159
|
+
"condition": str,
|
|
160
|
+
"response_rate": (int, float),
|
|
161
|
+
"patient_count": int
|
|
162
|
+
}
|
|
163
|
+
|
|
164
|
+
valid_trials = []
|
|
165
|
+
errors = []
|
|
166
|
+
|
|
167
|
+
for i, trial in enumerate(trials):
|
|
168
|
+
try:
|
|
169
|
+
# 3. Validate structure
|
|
170
|
+
for field, expected_type in required_fields.items():
|
|
171
|
+
if field not in trial:
|
|
172
|
+
raise ValueError(f"Missing field: {field}")
|
|
173
|
+
|
|
174
|
+
# 4. Validate types
|
|
175
|
+
if not isinstance(trial[field], expected_type):
|
|
176
|
+
raise ValueError(
|
|
177
|
+
f"Field {field} has wrong type: "
|
|
178
|
+
f"expected {expected_type}, got {type(trial[field])}"
|
|
179
|
+
)
|
|
180
|
+
|
|
181
|
+
# 5. Validate values
|
|
182
|
+
if not trial["nct_id"].startswith("NCT"):
|
|
183
|
+
raise ValueError(f"Invalid NCT ID format: {trial['nct_id']}")
|
|
184
|
+
|
|
185
|
+
if trial["phase"] not in ["Phase 1", "Phase 2", "Phase 3", "Phase 4"]:
|
|
186
|
+
raise ValueError(f"Invalid phase: {trial['phase']}")
|
|
187
|
+
|
|
188
|
+
if not 0 <= trial["response_rate"] <= 100:
|
|
189
|
+
raise ValueError(f"Response rate out of range: {trial['response_rate']}")
|
|
190
|
+
|
|
191
|
+
if trial["patient_count"] < 0:
|
|
192
|
+
raise ValueError(f"Negative patient count: {trial['patient_count']}")
|
|
193
|
+
|
|
194
|
+
valid_trials.append(trial)
|
|
195
|
+
|
|
196
|
+
except ValueError as e:
|
|
197
|
+
errors.append(f"Trial {i}: {e}")
|
|
198
|
+
|
|
199
|
+
# 6. Report validation results
|
|
200
|
+
if errors:
|
|
201
|
+
print(f"Validation warnings: {len(errors)} trials had issues")
|
|
202
|
+
for error in errors[:5]: # Show first 5
|
|
203
|
+
print(f" - {error}")
|
|
204
|
+
if len(errors) > 5:
|
|
205
|
+
print(f" ... and {len(errors) - 5} more")
|
|
206
|
+
|
|
207
|
+
print(f"Validated {len(valid_trials)}/{len(trials)} trials")
|
|
208
|
+
|
|
209
|
+
return valid_trials
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### JSON Validation
|
|
213
|
+
|
|
214
|
+
```python
|
|
215
|
+
# Using jsonValidate for structured data
|
|
216
|
+
from json_tools import jsonExtract, jsonValidate
|
|
217
|
+
|
|
218
|
+
# Define expected schema
|
|
219
|
+
schema = {
|
|
220
|
+
"type": "object",
|
|
221
|
+
"required": ["nct_id", "phase", "status"],
|
|
222
|
+
"properties": {
|
|
223
|
+
"nct_id": {"type": "string", "pattern": "^NCT[0-9]{8}$"},
|
|
224
|
+
"phase": {"type": "string"},
|
|
225
|
+
"status": {"type": "string"},
|
|
226
|
+
"response_rate": {"type": "number", "minimum": 0, "maximum": 100}
|
|
227
|
+
}
|
|
228
|
+
}
|
|
229
|
+
|
|
230
|
+
# Extract and validate
|
|
231
|
+
result = jsonExtract(file_path="output.json")
|
|
232
|
+
if result.success:
|
|
233
|
+
validation = jsonValidate(data=result.data, schema=schema)
|
|
234
|
+
if not validation.valid:
|
|
235
|
+
print(f"Validation failed: {validation.errors}")
|
|
236
|
+
else:
|
|
237
|
+
print("Data validated successfully")
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
|
|
242
|
+
## 3. Error Handling & Retry Logic
|
|
243
|
+
|
|
244
|
+
### Retry Pattern (from retry.md)
|
|
245
|
+
|
|
246
|
+
```python
|
|
247
|
+
def fetch_with_retry(
|
|
248
|
+
operation,
|
|
249
|
+
max_attempts: int = 3,
|
|
250
|
+
initial_delay: float = 2.0,
|
|
251
|
+
backoff_factor: float = 2.0
|
|
252
|
+
):
|
|
253
|
+
"""Execute operation with exponential backoff retry.
|
|
254
|
+
|
|
255
|
+
Args:
|
|
256
|
+
operation: Function to execute
|
|
257
|
+
max_attempts: Maximum retry attempts
|
|
258
|
+
initial_delay: Initial delay in seconds
|
|
259
|
+
backoff_factor: Multiplier for delay after each failure
|
|
260
|
+
|
|
261
|
+
Returns:
|
|
262
|
+
Operation result
|
|
263
|
+
|
|
264
|
+
Raises:
|
|
265
|
+
Exception: If all attempts fail
|
|
266
|
+
"""
|
|
267
|
+
delay = initial_delay
|
|
268
|
+
|
|
269
|
+
for attempt in range(max_attempts):
|
|
270
|
+
try:
|
|
271
|
+
result = operation()
|
|
272
|
+
return result
|
|
273
|
+
|
|
274
|
+
except Exception as e:
|
|
275
|
+
if attempt < max_attempts - 1:
|
|
276
|
+
print(f"Attempt {attempt + 1} failed: {e}")
|
|
277
|
+
print(f"Retrying in {delay} seconds...")
|
|
278
|
+
blockingTimer(delay)
|
|
279
|
+
delay = delay * backoff_factor # Exponential backoff
|
|
280
|
+
else:
|
|
281
|
+
print(f"All {max_attempts} attempts failed")
|
|
282
|
+
raise
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
### Apply Retry to External Operations
|
|
286
|
+
|
|
287
|
+
```python
|
|
288
|
+
# BioMCP queries
|
|
289
|
+
def fetch_articles_with_retry(genes, diseases):
|
|
290
|
+
"""Fetch articles with retry logic."""
|
|
291
|
+
return fetch_with_retry(
|
|
292
|
+
lambda: biomcp_article_searcher(
|
|
293
|
+
genes=genes,
|
|
294
|
+
diseases=diseases,
|
|
295
|
+
page_size=50
|
|
296
|
+
)
|
|
297
|
+
)
|
|
298
|
+
|
|
299
|
+
# Database queries
|
|
300
|
+
def query_database_with_retry(sql, params):
|
|
301
|
+
"""Query database with retry logic."""
|
|
302
|
+
return fetch_with_retry(
|
|
303
|
+
lambda: dbQuery(sql, params)
|
|
304
|
+
)
|
|
305
|
+
|
|
306
|
+
# Web requests
|
|
307
|
+
def fetch_web_with_retry(url):
|
|
308
|
+
"""Fetch web content with retry logic."""
|
|
309
|
+
return fetch_with_retry(
|
|
310
|
+
lambda: webfetch(url)
|
|
311
|
+
)
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
### Error Handling Best Practices
|
|
315
|
+
|
|
316
|
+
```python
|
|
317
|
+
# ✅ GOOD: Comprehensive error handling
|
|
318
|
+
def process_trials(file_path: str) -> Dict:
|
|
319
|
+
"""Process trials with comprehensive error handling."""
|
|
320
|
+
try:
|
|
321
|
+
# Load data
|
|
322
|
+
df = load_data(file_path)
|
|
323
|
+
|
|
324
|
+
# Validate
|
|
325
|
+
if df.empty:
|
|
326
|
+
raise ValueError("File contains no data")
|
|
327
|
+
|
|
328
|
+
# Process
|
|
329
|
+
results = analyze_trials(df)
|
|
330
|
+
|
|
331
|
+
return results
|
|
332
|
+
|
|
333
|
+
except FileNotFoundError:
|
|
334
|
+
raise FileNotFoundError(f"File not found: {file_path}")
|
|
335
|
+
|
|
336
|
+
except pd.errors.EmptyDataError:
|
|
337
|
+
raise ValueError(f"File is empty or corrupt: {file_path}")
|
|
338
|
+
|
|
339
|
+
except Exception as e:
|
|
340
|
+
raise RuntimeError(f"Processing failed: {e}")
|
|
341
|
+
|
|
342
|
+
# ❌ BAD: No error handling
|
|
343
|
+
def process_trials(file_path: str) -> Dict:
|
|
344
|
+
"""Process trials without error handling."""
|
|
345
|
+
df = load_data(file_path) # May fail silently
|
|
346
|
+
results = analyze_trials(df) # May fail silently
|
|
347
|
+
return results
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
---
|
|
351
|
+
|
|
352
|
+
## 4. Rate Limiting
|
|
353
|
+
|
|
354
|
+
### BioMCP Rate Limiting (MANDATORY)
|
|
355
|
+
|
|
356
|
+
```python
|
|
357
|
+
# ✅ GOOD: Sequential calls with rate limiting
|
|
358
|
+
articles = biomcp_article_searcher(genes=["BRAF"], page_size=50)
|
|
359
|
+
blockingTimer(0.3) # 300ms delay
|
|
360
|
+
|
|
361
|
+
for article in articles[:10]:
|
|
362
|
+
details = biomcp_article_getter(pmid=article["pmid"])
|
|
363
|
+
blockingTimer(0.3) # 300ms between each call
|
|
364
|
+
# Process details
|
|
365
|
+
|
|
366
|
+
# ❌ BAD: Concurrent calls without rate limiting
|
|
367
|
+
# This will cause API throttling!
|
|
368
|
+
results = []
|
|
369
|
+
for pmid in pmids:
|
|
370
|
+
results.append(biomcp_article_getter(pmid)) # No delay!
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
### Web Rate Limiting
|
|
374
|
+
|
|
375
|
+
```python
|
|
376
|
+
# ✅ GOOD: Conservative rate limiting for web
|
|
377
|
+
for url in urls:
|
|
378
|
+
content = webfetch(url)
|
|
379
|
+
blockingTimer(0.5) # 500ms delay for web requests
|
|
380
|
+
# Process content
|
|
381
|
+
|
|
382
|
+
# ❌ BAD: No rate limiting
|
|
383
|
+
for url in urls:
|
|
384
|
+
content = webfetch(url) # May get blocked!
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### Rate Limiting Guidelines
|
|
388
|
+
|
|
389
|
+
| Tool Category | Delay | Rationale |
|
|
390
|
+
|--------------|-------|-----------|
|
|
391
|
+
| BioMCP tools | 0.3s | API rate limits |
|
|
392
|
+
| Web tools | 0.5s | Server courtesy |
|
|
393
|
+
| Database | None | Local/network, no limit |
|
|
394
|
+
| File operations | None | Local filesystem, no limit |
|
|
395
|
+
| Parser tools | None | Local processing, no limit |
|
|
396
|
+
|
|
397
|
+
---
|
|
398
|
+
|
|
399
|
+
## 5. Performance Optimization
|
|
400
|
+
|
|
401
|
+
### Use Appropriate Tools for Data Size
|
|
402
|
+
|
|
403
|
+
```python
|
|
404
|
+
# Small datasets (< 30 rows): Use table tools
|
|
405
|
+
if row_count < 30:
|
|
406
|
+
tableFilterRows(file_path, column="Status", operator="=", value="Active")
|
|
407
|
+
tableGroupBy(file_path, group_column="Phase", agg_column="Count", agg_type="count")
|
|
408
|
+
|
|
409
|
+
# Medium datasets (30-1000 rows): Use long-table-summary skill
|
|
410
|
+
elif row_count < 1000:
|
|
411
|
+
skill long-table-summary
|
|
412
|
+
# Follow 16-step workflow
|
|
413
|
+
|
|
414
|
+
# Large datasets (> 1000 rows): Use Python
|
|
415
|
+
else:
|
|
416
|
+
uv run python .scripts/py/large_analysis.py --input data.xlsx
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
### Batch Processing
|
|
420
|
+
|
|
421
|
+
```python
|
|
422
|
+
# ✅ GOOD: Process in batches
|
|
423
|
+
batch_size = 100
|
|
424
|
+
total_items = len(data)
|
|
425
|
+
|
|
426
|
+
for i in range(0, total_items, batch_size):
|
|
427
|
+
batch = data[i:i+batch_size]
|
|
428
|
+
results = process_batch(batch)
|
|
429
|
+
|
|
430
|
+
# Report progress
|
|
431
|
+
completed = min(i + batch_size, total_items)
|
|
432
|
+
percent = (completed / total_items) * 100
|
|
433
|
+
print(f"Progress: {completed}/{total_items} ({percent:.1f}%)")
|
|
434
|
+
|
|
435
|
+
# ❌ BAD: Process all at once (may overload memory)
|
|
436
|
+
results = process_all(data) # Memory issue with large datasets!
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
### Caching Results
|
|
440
|
+
|
|
441
|
+
```python
|
|
442
|
+
# Cache expensive operations
|
|
443
|
+
import hashlib
|
|
444
|
+
import json
|
|
445
|
+
|
|
446
|
+
def get_cache_key(params: Dict) -> str:
|
|
447
|
+
"""Generate cache key from parameters."""
|
|
448
|
+
return hashlib.md5(json.dumps(params, sort_keys=True).encode()).hexdigest()
|
|
449
|
+
|
|
450
|
+
def cached_query(sql: str, params: Dict, cache_dir: str = ".cache") -> Any:
|
|
451
|
+
"""Execute query with caching."""
|
|
452
|
+
import os
|
|
453
|
+
|
|
454
|
+
cache_key = get_cache_key({"sql": sql, "params": params})
|
|
455
|
+
cache_file = f"{cache_dir}/{cache_key}.json"
|
|
456
|
+
|
|
457
|
+
# Check cache
|
|
458
|
+
if os.path.exists(cache_file):
|
|
459
|
+
with open(cache_file, 'r') as f:
|
|
460
|
+
return json.load(f)
|
|
461
|
+
|
|
462
|
+
# Execute query
|
|
463
|
+
result = dbQuery(sql, params)
|
|
464
|
+
|
|
465
|
+
# Save to cache
|
|
466
|
+
os.makedirs(cache_dir, exist_ok=True)
|
|
467
|
+
with open(cache_file, 'w') as f:
|
|
468
|
+
json.dump(result, f)
|
|
469
|
+
|
|
470
|
+
return result
|
|
471
|
+
```
|
|
472
|
+
|
|
473
|
+
---
|
|
474
|
+
|
|
475
|
+
## 6. Context Window Management
|
|
476
|
+
|
|
477
|
+
### Minimize Data in Context
|
|
478
|
+
|
|
479
|
+
```python
|
|
480
|
+
# ✅ GOOD: Summarize large datasets
|
|
481
|
+
if len(results) > 100:
|
|
482
|
+
summary = {
|
|
483
|
+
"total": len(results),
|
|
484
|
+
"sample": results[:5], # First 5 items
|
|
485
|
+
"statistics": calculate_stats(results)
|
|
486
|
+
}
|
|
487
|
+
# Return summary, not full dataset
|
|
488
|
+
|
|
489
|
+
# ❌ BAD: Load entire dataset into context
|
|
490
|
+
return results # 1000+ items in context!
|
|
491
|
+
```
|
|
492
|
+
|
|
493
|
+
### Use File-Based Data Exchange
|
|
494
|
+
|
|
495
|
+
```python
|
|
496
|
+
# ✅ GOOD: Write to file, pass file path
|
|
497
|
+
output_file = ".work/results.json"
|
|
498
|
+
with open(output_file, 'w') as f:
|
|
499
|
+
json.dump(results, f)
|
|
500
|
+
|
|
501
|
+
return f"Results saved to {output_file}"
|
|
502
|
+
|
|
503
|
+
# ❌ BAD: Return large data structure
|
|
504
|
+
return results # Consumes context window!
|
|
505
|
+
```
|
|
506
|
+
|
|
507
|
+
### Pagination for Large Result Sets
|
|
508
|
+
|
|
509
|
+
```python
|
|
510
|
+
# ✅ GOOD: Paginated retrieval
|
|
511
|
+
page_size = 50
|
|
512
|
+
all_results = []
|
|
513
|
+
|
|
514
|
+
for page in range(1, max_pages + 1):
|
|
515
|
+
results = biomcp_article_searcher(
|
|
516
|
+
genes=["BRAF"],
|
|
517
|
+
page=page,
|
|
518
|
+
page_size=page_size
|
|
519
|
+
)
|
|
520
|
+
blockingTimer(0.3)
|
|
521
|
+
|
|
522
|
+
all_results.extend(results)
|
|
523
|
+
|
|
524
|
+
if len(results) < page_size:
|
|
525
|
+
break # No more results
|
|
526
|
+
|
|
527
|
+
# ❌ BAD: Try to get all at once
|
|
528
|
+
all_results = biomcp_article_searcher(genes=["BRAF"], page_size=10000)
|
|
529
|
+
# May hit API limits!
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
---
|
|
533
|
+
|
|
534
|
+
## 7. Common Anti-Patterns to Avoid
|
|
535
|
+
|
|
536
|
+
### Anti-Pattern 1: Premature Optimization
|
|
537
|
+
|
|
538
|
+
```python
|
|
539
|
+
# ❌ BAD: Optimize before measuring
|
|
540
|
+
def process_data(data):
|
|
541
|
+
# Complex optimization for small dataset
|
|
542
|
+
return optimized_result
|
|
543
|
+
|
|
544
|
+
# ✅ GOOD: Measure first, optimize if needed
|
|
545
|
+
def process_data(data):
|
|
546
|
+
if len(data) < 100:
|
|
547
|
+
# Simple approach is fine
|
|
548
|
+
return simple_process(data)
|
|
549
|
+
else:
|
|
550
|
+
# Optimize for large dataset
|
|
551
|
+
return optimized_process(data)
|
|
552
|
+
```
|
|
553
|
+
|
|
554
|
+
### Anti-Pattern 2: Over-Engineering
|
|
555
|
+
|
|
556
|
+
```python
|
|
557
|
+
# ❌ BAD: Complex solution for simple problem
|
|
558
|
+
class TrialProcessor:
|
|
559
|
+
def __init__(self, config):
|
|
560
|
+
self.config = config
|
|
561
|
+
self.validator = Validator()
|
|
562
|
+
self.analyzer = Analyzer()
|
|
563
|
+
# ... many more components
|
|
564
|
+
|
|
565
|
+
processor = TrialProcessor(config)
|
|
566
|
+
result = processor.process(trial)
|
|
567
|
+
|
|
568
|
+
# ✅ GOOD: Simple solution for simple problem
|
|
569
|
+
result = analyze_trial(trial)
|
|
570
|
+
```
|
|
571
|
+
|
|
572
|
+
### Anti-Pattern 3: Not Using Existing Tools
|
|
573
|
+
|
|
574
|
+
```python
|
|
575
|
+
# ❌ BAD: Write custom Python when tools exist
|
|
576
|
+
df = pd.read_excel("data.xlsx")
|
|
577
|
+
filtered = df[df["Phase"] == "Phase 3"]
|
|
578
|
+
stats = filtered.groupby("Condition").size()
|
|
579
|
+
|
|
580
|
+
# ✅ GOOD: Use existing tools
|
|
581
|
+
tableFilterRows(file_path, column="Phase", operator="=", value="Phase 3")
|
|
582
|
+
tableGroupBy(file_path, group_column="Condition", agg_column="Count", agg_type="count")
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
---
|
|
586
|
+
|
|
587
|
+
## 8. Best Practices Checklist
|
|
588
|
+
|
|
589
|
+
Before completing any data analysis task, verify:
|
|
590
|
+
|
|
591
|
+
### Upfront Filtering
|
|
592
|
+
- [ ] Data filtered at source (database WHERE, table filters, API parameters)
|
|
593
|
+
- [ ] No bulk retrieval then filtering in Python
|
|
594
|
+
- [ ] Result size limited appropriately
|
|
595
|
+
- [ ] Only necessary columns/fields retrieved
|
|
596
|
+
|
|
597
|
+
### Validation
|
|
598
|
+
- [ ] Data existence checked (not empty)
|
|
599
|
+
- [ ] Structure validated (required fields present)
|
|
600
|
+
- [ ] Types validated (correct data types)
|
|
601
|
+
- [ ] Values validated (within expected ranges)
|
|
602
|
+
- [ ] Quality validated (no duplicates, no corruption)
|
|
603
|
+
|
|
604
|
+
### Error Handling
|
|
605
|
+
- [ ] Try-except for external operations
|
|
606
|
+
- [ ] Retry logic for network/API calls
|
|
607
|
+
- [ ] Informative error messages
|
|
608
|
+
- [ ] Graceful degradation when possible
|
|
609
|
+
- [ ] Error logging for debugging
|
|
610
|
+
|
|
611
|
+
### Rate Limiting
|
|
612
|
+
- [ ] BioMCP: 0.3s delay between calls
|
|
613
|
+
- [ ] Web: 0.5s delay between requests
|
|
614
|
+
- [ ] Sequential (not concurrent) API calls
|
|
615
|
+
- [ ] Respect API rate limits
|
|
616
|
+
|
|
617
|
+
### Performance
|
|
618
|
+
- [ ] Appropriate tool selected for data size
|
|
619
|
+
- [ ] Batch processing for large datasets
|
|
620
|
+
- [ ] Caching for expensive operations
|
|
621
|
+
- [ ] Progress reporting for long operations
|
|
622
|
+
|
|
623
|
+
### Context Management
|
|
624
|
+
- [ ] Large datasets summarized, not fully loaded
|
|
625
|
+
- [ ] File-based data exchange for subagents
|
|
626
|
+
- [ ] Pagination for large result sets
|
|
627
|
+
- [ ] Only necessary data in context
|
|
628
|
+
|
|
629
|
+
---
|
|
630
|
+
|
|
631
|
+
## Summary
|
|
632
|
+
|
|
633
|
+
**Critical Rules:**
|
|
634
|
+
|
|
635
|
+
1. **ALWAYS filter upfront** - Never retrieve then filter
|
|
636
|
+
2. **ALWAYS validate data** - Check structure, types, values
|
|
637
|
+
3. **ALWAYS handle errors** - Retry with backoff
|
|
638
|
+
4. **ALWAYS rate limit** - Respect API limits
|
|
639
|
+
5. **ALWAYS optimize for size** - Use appropriate tools
|
|
640
|
+
6. **ALWAYS manage context** - Minimize data in context
|
|
641
|
+
|
|
642
|
+
**Following these best practices ensures:**
|
|
643
|
+
- ✅ Efficient data processing
|
|
644
|
+
- ✅ Reliable operations
|
|
645
|
+
- ✅ Professional quality results
|
|
646
|
+
- ✅ Optimal resource usage
|
|
647
|
+
- ✅ Reproducible research
|