@yeyuan98/opencode-bioresearcher-plugin 1.5.2 → 1.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,647 +1,647 @@
1
- # Best Practices for Data Analysis
2
-
3
- Critical best practices for data analysis including upfront filtering, validation, error handling, and performance optimization.
4
-
5
- ## Overview
6
-
7
- This pattern defines mandatory best practices that MUST be followed for all data analysis tasks:
8
- 1. Upfront Filtering (CRITICAL)
9
- 2. Data Validation
10
- 3. Error Handling & Retry Logic
11
- 4. Rate Limiting
12
- 5. Performance Optimization
13
- 6. Context Window Management
14
-
15
- ---
16
-
17
- ## 1. Upfront Filtering (CRITICAL)
18
-
19
- ### Core Rule
20
-
21
- **RULE: Always filter data at the SOURCE, not after retrieval**
22
-
23
- **Why:**
24
- - Reduces data transfer
25
- - Improves performance
26
- - Conserves context window
27
- - Follows database best practices
28
- - Minimizes memory usage
29
-
30
- ### Database Queries
31
-
32
- ```python
33
- # ✅ GOOD: Filter at database level
34
- dbQuery(
35
- "SELECT * FROM clinical_trials WHERE phase = :phase AND status = :status LIMIT 100",
36
- {phase: "Phase 3", status: "Recruiting"}
37
- )
38
-
39
- # ❌ BAD: Retrieve all data then filter
40
- dbQuery("SELECT * FROM clinical_trials")
41
- # Then filter in Python - inefficient!
42
- ```
43
-
44
- **Best Practices:**
45
- - Use WHERE clauses to filter rows
46
- - Use LIMIT to cap result size
47
- - Use indexed columns in WHERE clause
48
- - Use named parameters (not string concatenation)
49
- - Select only needed columns (avoid SELECT *)
50
-
51
- ### Table Operations
52
-
53
- ```python
54
- # ✅ GOOD: Filter with table tools
55
- tableFilterRows(
56
- file_path="data/trials.xlsx",
57
- column="Phase",
58
- operator="=",
59
- value="Phase 3",
60
- max_results=100
61
- )
62
-
63
- # ❌ BAD: Load entire table then filter
64
- tableGetRange(file_path="data/trials.xlsx", range="A1:Z10000")
65
- # Then filter in Python - wastes memory!
66
- ```
67
-
68
- **Best Practices:**
69
- - Use tableFilterRows for single-condition filters
70
- - Use tableSearch to find specific values
71
- - Use max_results to limit output
72
- - Preview with tableGetSheetPreview before processing
73
- - Check row count before deciding approach
74
-
75
- ### BioMCP Queries
76
-
77
- ```python
78
- # ✅ GOOD: Targeted query with filters
79
- biomcp_article_searcher(
80
- genes=["BRAF", "NRAS"],
81
- diseases=["melanoma"],
82
- keywords=["treatment resistance"],
83
- variants=["V600E"],
84
- page_size=50
85
- )
86
-
87
- # ❌ BAD: Broad query then manual filtering
88
- biomcp_search(query="BRAF")
89
- # Then manually filter through thousands of results!
90
- ```
91
-
92
- **Best Practices:**
93
- - Use specific domain filters (genes, diseases, variants)
94
- - Combine multiple filters with AND logic
95
- - Use page_size to limit results per page
96
- - Use biomcp_search only for cross-domain queries
97
- - Specify exact criteria upfront
98
-
99
- ### Web Searches
100
-
101
- ```python
102
- # ✅ GOOD: Specific search query
103
- web-search-prime_web_search_prime(
104
- search_query="BRAF V600E melanoma FDA approval 2024",
105
- search_recency_filter="oneYear"
106
- )
107
-
108
- # ❌ BAD: Broad search then filter results
109
- web-search-prime_web_search_prime(search_query="BRAF")
110
- # Then manually evaluate hundreds of results!
111
- ```
112
-
113
- **Best Practices:**
114
- - Use specific search terms
115
- - Use recency filters when time-sensitive
116
- - Use domain filters for trusted sources
117
- - Limit results to manageable number
118
- - Verify source quality before using
119
-
120
- ---
121
-
122
- ## 2. Data Validation
123
-
124
- ### Validation Pattern
125
-
126
- ```
127
- VALIDATION WORKFLOW:
128
- 1. Check data existence (not empty/null)
129
- 2. Validate structure (required fields present)
130
- 3. Validate types (correct data types)
131
- 4. Validate values (within expected ranges)
132
- 5. Validate quality (no duplicates, no corruption)
133
- ```
134
-
135
- ### Example: Comprehensive Validation
136
-
137
- ```python
138
- def validate_clinical_trials(trials: List[Dict]) -> List[Dict]:
139
- """Validate clinical trial data with comprehensive checks.
140
-
141
- Args:
142
- trials: List of trial dictionaries
143
-
144
- Returns:
145
- Validated trials
146
-
147
- Raises:
148
- ValueError: If validation fails
149
- """
150
- # 1. Check existence
151
- if not trials:
152
- raise ValueError("No trial data provided")
153
-
154
- # 2. Define required structure
155
- required_fields = {
156
- "nct_id": str,
157
- "phase": str,
158
- "status": str,
159
- "condition": str,
160
- "response_rate": (int, float),
161
- "patient_count": int
162
- }
163
-
164
- valid_trials = []
165
- errors = []
166
-
167
- for i, trial in enumerate(trials):
168
- try:
169
- # 3. Validate structure
170
- for field, expected_type in required_fields.items():
171
- if field not in trial:
172
- raise ValueError(f"Missing field: {field}")
173
-
174
- # 4. Validate types
175
- if not isinstance(trial[field], expected_type):
176
- raise ValueError(
177
- f"Field {field} has wrong type: "
178
- f"expected {expected_type}, got {type(trial[field])}"
179
- )
180
-
181
- # 5. Validate values
182
- if not trial["nct_id"].startswith("NCT"):
183
- raise ValueError(f"Invalid NCT ID format: {trial['nct_id']}")
184
-
185
- if trial["phase"] not in ["Phase 1", "Phase 2", "Phase 3", "Phase 4"]:
186
- raise ValueError(f"Invalid phase: {trial['phase']}")
187
-
188
- if not 0 <= trial["response_rate"] <= 100:
189
- raise ValueError(f"Response rate out of range: {trial['response_rate']}")
190
-
191
- if trial["patient_count"] < 0:
192
- raise ValueError(f"Negative patient count: {trial['patient_count']}")
193
-
194
- valid_trials.append(trial)
195
-
196
- except ValueError as e:
197
- errors.append(f"Trial {i}: {e}")
198
-
199
- # 6. Report validation results
200
- if errors:
201
- print(f"Validation warnings: {len(errors)} trials had issues")
202
- for error in errors[:5]: # Show first 5
203
- print(f" - {error}")
204
- if len(errors) > 5:
205
- print(f" ... and {len(errors) - 5} more")
206
-
207
- print(f"Validated {len(valid_trials)}/{len(trials)} trials")
208
-
209
- return valid_trials
210
- ```
211
-
212
- ### JSON Validation
213
-
214
- ```python
215
- # Using jsonValidate for structured data
216
- from json_tools import jsonExtract, jsonValidate
217
-
218
- # Define expected schema
219
- schema = {
220
- "type": "object",
221
- "required": ["nct_id", "phase", "status"],
222
- "properties": {
223
- "nct_id": {"type": "string", "pattern": "^NCT[0-9]{8}$"},
224
- "phase": {"type": "string"},
225
- "status": {"type": "string"},
226
- "response_rate": {"type": "number", "minimum": 0, "maximum": 100}
227
- }
228
- }
229
-
230
- # Extract and validate
231
- result = jsonExtract(file_path="output.json")
232
- if result.success:
233
- validation = jsonValidate(data=result.data, schema=schema)
234
- if not validation.valid:
235
- print(f"Validation failed: {validation.errors}")
236
- else:
237
- print("Data validated successfully")
238
- ```
239
-
240
- ---
241
-
242
- ## 3. Error Handling & Retry Logic
243
-
244
- ### Retry Pattern (from retry.md)
245
-
246
- ```python
247
- def fetch_with_retry(
248
- operation,
249
- max_attempts: int = 3,
250
- initial_delay: float = 2.0,
251
- backoff_factor: float = 2.0
252
- ):
253
- """Execute operation with exponential backoff retry.
254
-
255
- Args:
256
- operation: Function to execute
257
- max_attempts: Maximum retry attempts
258
- initial_delay: Initial delay in seconds
259
- backoff_factor: Multiplier for delay after each failure
260
-
261
- Returns:
262
- Operation result
263
-
264
- Raises:
265
- Exception: If all attempts fail
266
- """
267
- delay = initial_delay
268
-
269
- for attempt in range(max_attempts):
270
- try:
271
- result = operation()
272
- return result
273
-
274
- except Exception as e:
275
- if attempt < max_attempts - 1:
276
- print(f"Attempt {attempt + 1} failed: {e}")
277
- print(f"Retrying in {delay} seconds...")
278
- blockingTimer(delay)
279
- delay = delay * backoff_factor # Exponential backoff
280
- else:
281
- print(f"All {max_attempts} attempts failed")
282
- raise
283
- ```
284
-
285
- ### Apply Retry to External Operations
286
-
287
- ```python
288
- # BioMCP queries
289
- def fetch_articles_with_retry(genes, diseases):
290
- """Fetch articles with retry logic."""
291
- return fetch_with_retry(
292
- lambda: biomcp_article_searcher(
293
- genes=genes,
294
- diseases=diseases,
295
- page_size=50
296
- )
297
- )
298
-
299
- # Database queries
300
- def query_database_with_retry(sql, params):
301
- """Query database with retry logic."""
302
- return fetch_with_retry(
303
- lambda: dbQuery(sql, params)
304
- )
305
-
306
- # Web requests
307
- def fetch_web_with_retry(url):
308
- """Fetch web content with retry logic."""
309
- return fetch_with_retry(
310
- lambda: webfetch(url)
311
- )
312
- ```
313
-
314
- ### Error Handling Best Practices
315
-
316
- ```python
317
- # ✅ GOOD: Comprehensive error handling
318
- def process_trials(file_path: str) -> Dict:
319
- """Process trials with comprehensive error handling."""
320
- try:
321
- # Load data
322
- df = load_data(file_path)
323
-
324
- # Validate
325
- if df.empty:
326
- raise ValueError("File contains no data")
327
-
328
- # Process
329
- results = analyze_trials(df)
330
-
331
- return results
332
-
333
- except FileNotFoundError:
334
- raise FileNotFoundError(f"File not found: {file_path}")
335
-
336
- except pd.errors.EmptyDataError:
337
- raise ValueError(f"File is empty or corrupt: {file_path}")
338
-
339
- except Exception as e:
340
- raise RuntimeError(f"Processing failed: {e}")
341
-
342
- # ❌ BAD: No error handling
343
- def process_trials(file_path: str) -> Dict:
344
- """Process trials without error handling."""
345
- df = load_data(file_path) # May fail silently
346
- results = analyze_trials(df) # May fail silently
347
- return results
348
- ```
349
-
350
- ---
351
-
352
- ## 4. Rate Limiting
353
-
354
- ### BioMCP Rate Limiting (MANDATORY)
355
-
356
- ```python
357
- # ✅ GOOD: Sequential calls with rate limiting
358
- articles = biomcp_article_searcher(genes=["BRAF"], page_size=50)
359
- blockingTimer(0.3) # 300ms delay
360
-
361
- for article in articles[:10]:
362
- details = biomcp_article_getter(pmid=article["pmid"])
363
- blockingTimer(0.3) # 300ms between each call
364
- # Process details
365
-
366
- # ❌ BAD: Concurrent calls without rate limiting
367
- # This will cause API throttling!
368
- results = []
369
- for pmid in pmids:
370
- results.append(biomcp_article_getter(pmid)) # No delay!
371
- ```
372
-
373
- ### Web Rate Limiting
374
-
375
- ```python
376
- # ✅ GOOD: Conservative rate limiting for web
377
- for url in urls:
378
- content = webfetch(url)
379
- blockingTimer(0.5) # 500ms delay for web requests
380
- # Process content
381
-
382
- # ❌ BAD: No rate limiting
383
- for url in urls:
384
- content = webfetch(url) # May get blocked!
385
- ```
386
-
387
- ### Rate Limiting Guidelines
388
-
389
- | Tool Category | Delay | Rationale |
390
- |--------------|-------|-----------|
391
- | BioMCP tools | 0.3s | API rate limits |
392
- | Web tools | 0.5s | Server courtesy |
393
- | Database | None | Local/network, no limit |
394
- | File operations | None | Local filesystem, no limit |
395
- | Parser tools | None | Local processing, no limit |
396
-
397
- ---
398
-
399
- ## 5. Performance Optimization
400
-
401
- ### Use Appropriate Tools for Data Size
402
-
403
- ```python
404
- # Small datasets (< 30 rows): Use table tools
405
- if row_count < 30:
406
- tableFilterRows(file_path, column="Status", operator="=", value="Active")
407
- tableGroupBy(file_path, group_column="Phase", agg_column="Count", agg_type="count")
408
-
409
- # Medium datasets (30-1000 rows): Use long-table-summary skill
410
- elif row_count < 1000:
411
- skill long-table-summary
412
- # Follow 16-step workflow
413
-
414
- # Large datasets (> 1000 rows): Use Python
415
- else:
416
- uv run python .scripts/py/large_analysis.py --input data.xlsx
417
- ```
418
-
419
- ### Batch Processing
420
-
421
- ```python
422
- # ✅ GOOD: Process in batches
423
- batch_size = 100
424
- total_items = len(data)
425
-
426
- for i in range(0, total_items, batch_size):
427
- batch = data[i:i+batch_size]
428
- results = process_batch(batch)
429
-
430
- # Report progress
431
- completed = min(i + batch_size, total_items)
432
- percent = (completed / total_items) * 100
433
- print(f"Progress: {completed}/{total_items} ({percent:.1f}%)")
434
-
435
- # ❌ BAD: Process all at once (may overload memory)
436
- results = process_all(data) # Memory issue with large datasets!
437
- ```
438
-
439
- ### Caching Results
440
-
441
- ```python
442
- # Cache expensive operations
443
- import hashlib
444
- import json
445
-
446
- def get_cache_key(params: Dict) -> str:
447
- """Generate cache key from parameters."""
448
- return hashlib.md5(json.dumps(params, sort_keys=True).encode()).hexdigest()
449
-
450
- def cached_query(sql: str, params: Dict, cache_dir: str = ".cache") -> Any:
451
- """Execute query with caching."""
452
- import os
453
-
454
- cache_key = get_cache_key({"sql": sql, "params": params})
455
- cache_file = f"{cache_dir}/{cache_key}.json"
456
-
457
- # Check cache
458
- if os.path.exists(cache_file):
459
- with open(cache_file, 'r') as f:
460
- return json.load(f)
461
-
462
- # Execute query
463
- result = dbQuery(sql, params)
464
-
465
- # Save to cache
466
- os.makedirs(cache_dir, exist_ok=True)
467
- with open(cache_file, 'w') as f:
468
- json.dump(result, f)
469
-
470
- return result
471
- ```
472
-
473
- ---
474
-
475
- ## 6. Context Window Management
476
-
477
- ### Minimize Data in Context
478
-
479
- ```python
480
- # ✅ GOOD: Summarize large datasets
481
- if len(results) > 100:
482
- summary = {
483
- "total": len(results),
484
- "sample": results[:5], # First 5 items
485
- "statistics": calculate_stats(results)
486
- }
487
- # Return summary, not full dataset
488
-
489
- # ❌ BAD: Load entire dataset into context
490
- return results # 1000+ items in context!
491
- ```
492
-
493
- ### Use File-Based Data Exchange
494
-
495
- ```python
496
- # ✅ GOOD: Write to file, pass file path
497
- output_file = ".work/results.json"
498
- with open(output_file, 'w') as f:
499
- json.dump(results, f)
500
-
501
- return f"Results saved to {output_file}"
502
-
503
- # ❌ BAD: Return large data structure
504
- return results # Consumes context window!
505
- ```
506
-
507
- ### Pagination for Large Result Sets
508
-
509
- ```python
510
- # ✅ GOOD: Paginated retrieval
511
- page_size = 50
512
- all_results = []
513
-
514
- for page in range(1, max_pages + 1):
515
- results = biomcp_article_searcher(
516
- genes=["BRAF"],
517
- page=page,
518
- page_size=page_size
519
- )
520
- blockingTimer(0.3)
521
-
522
- all_results.extend(results)
523
-
524
- if len(results) < page_size:
525
- break # No more results
526
-
527
- # ❌ BAD: Try to get all at once
528
- all_results = biomcp_article_searcher(genes=["BRAF"], page_size=10000)
529
- # May hit API limits!
530
- ```
531
-
532
- ---
533
-
534
- ## 7. Common Anti-Patterns to Avoid
535
-
536
- ### Anti-Pattern 1: Premature Optimization
537
-
538
- ```python
539
- # ❌ BAD: Optimize before measuring
540
- def process_data(data):
541
- # Complex optimization for small dataset
542
- return optimized_result
543
-
544
- # ✅ GOOD: Measure first, optimize if needed
545
- def process_data(data):
546
- if len(data) < 100:
547
- # Simple approach is fine
548
- return simple_process(data)
549
- else:
550
- # Optimize for large dataset
551
- return optimized_process(data)
552
- ```
553
-
554
- ### Anti-Pattern 2: Over-Engineering
555
-
556
- ```python
557
- # ❌ BAD: Complex solution for simple problem
558
- class TrialProcessor:
559
- def __init__(self, config):
560
- self.config = config
561
- self.validator = Validator()
562
- self.analyzer = Analyzer()
563
- # ... many more components
564
-
565
- processor = TrialProcessor(config)
566
- result = processor.process(trial)
567
-
568
- # ✅ GOOD: Simple solution for simple problem
569
- result = analyze_trial(trial)
570
- ```
571
-
572
- ### Anti-Pattern 3: Not Using Existing Tools
573
-
574
- ```python
575
- # ❌ BAD: Write custom Python when tools exist
576
- df = pd.read_excel("data.xlsx")
577
- filtered = df[df["Phase"] == "Phase 3"]
578
- stats = filtered.groupby("Condition").size()
579
-
580
- # ✅ GOOD: Use existing tools
581
- tableFilterRows(file_path, column="Phase", operator="=", value="Phase 3")
582
- tableGroupBy(file_path, group_column="Condition", agg_column="Count", agg_type="count")
583
- ```
584
-
585
- ---
586
-
587
- ## 8. Best Practices Checklist
588
-
589
- Before completing any data analysis task, verify:
590
-
591
- ### Upfront Filtering
592
- - [ ] Data filtered at source (database WHERE, table filters, API parameters)
593
- - [ ] No bulk retrieval then filtering in Python
594
- - [ ] Result size limited appropriately
595
- - [ ] Only necessary columns/fields retrieved
596
-
597
- ### Validation
598
- - [ ] Data existence checked (not empty)
599
- - [ ] Structure validated (required fields present)
600
- - [ ] Types validated (correct data types)
601
- - [ ] Values validated (within expected ranges)
602
- - [ ] Quality validated (no duplicates, no corruption)
603
-
604
- ### Error Handling
605
- - [ ] Try-except for external operations
606
- - [ ] Retry logic for network/API calls
607
- - [ ] Informative error messages
608
- - [ ] Graceful degradation when possible
609
- - [ ] Error logging for debugging
610
-
611
- ### Rate Limiting
612
- - [ ] BioMCP: 0.3s delay between calls
613
- - [ ] Web: 0.5s delay between requests
614
- - [ ] Sequential (not concurrent) API calls
615
- - [ ] Respect API rate limits
616
-
617
- ### Performance
618
- - [ ] Appropriate tool selected for data size
619
- - [ ] Batch processing for large datasets
620
- - [ ] Caching for expensive operations
621
- - [ ] Progress reporting for long operations
622
-
623
- ### Context Management
624
- - [ ] Large datasets summarized, not fully loaded
625
- - [ ] File-based data exchange for subagents
626
- - [ ] Pagination for large result sets
627
- - [ ] Only necessary data in context
628
-
629
- ---
630
-
631
- ## Summary
632
-
633
- **Critical Rules:**
634
-
635
- 1. **ALWAYS filter upfront** - Never retrieve then filter
636
- 2. **ALWAYS validate data** - Check structure, types, values
637
- 3. **ALWAYS handle errors** - Retry with backoff
638
- 4. **ALWAYS rate limit** - Respect API limits
639
- 5. **ALWAYS optimize for size** - Use appropriate tools
640
- 6. **ALWAYS manage context** - Minimize data in context
641
-
642
- **Following these best practices ensures:**
643
- - ✅ Efficient data processing
644
- - ✅ Reliable operations
645
- - ✅ Professional quality results
646
- - ✅ Optimal resource usage
647
- - ✅ Reproducible research
1
+ # Best Practices for Data Analysis
2
+
3
+ Critical best practices for data analysis including upfront filtering, validation, error handling, and performance optimization.
4
+
5
+ ## Overview
6
+
7
+ This pattern defines mandatory best practices that MUST be followed for all data analysis tasks:
8
+ 1. Upfront Filtering (CRITICAL)
9
+ 2. Data Validation
10
+ 3. Error Handling & Retry Logic
11
+ 4. Rate Limiting
12
+ 5. Performance Optimization
13
+ 6. Context Window Management
14
+
15
+ ---
16
+
17
+ ## 1. Upfront Filtering (CRITICAL)
18
+
19
+ ### Core Rule
20
+
21
+ **RULE: Always filter data at the SOURCE, not after retrieval**
22
+
23
+ **Why:**
24
+ - Reduces data transfer
25
+ - Improves performance
26
+ - Conserves context window
27
+ - Follows database best practices
28
+ - Minimizes memory usage
29
+
30
+ ### Database Queries
31
+
32
+ ```python
33
+ # ✅ GOOD: Filter at database level
34
+ dbQuery(
35
+ "SELECT * FROM clinical_trials WHERE phase = :phase AND status = :status LIMIT 100",
36
+ {phase: "Phase 3", status: "Recruiting"}
37
+ )
38
+
39
+ # ❌ BAD: Retrieve all data then filter
40
+ dbQuery("SELECT * FROM clinical_trials")
41
+ # Then filter in Python - inefficient!
42
+ ```
43
+
44
+ **Best Practices:**
45
+ - Use WHERE clauses to filter rows
46
+ - Use LIMIT to cap result size
47
+ - Use indexed columns in WHERE clause
48
+ - Use named parameters (not string concatenation)
49
+ - Select only needed columns (avoid SELECT *)
50
+
51
+ ### Table Operations
52
+
53
+ ```python
54
+ # ✅ GOOD: Filter with table tools
55
+ tableFilterRows(
56
+ file_path="data/trials.xlsx",
57
+ column="Phase",
58
+ operator="=",
59
+ value="Phase 3",
60
+ max_results=100
61
+ )
62
+
63
+ # ❌ BAD: Load entire table then filter
64
+ tableGetRange(file_path="data/trials.xlsx", range="A1:Z10000")
65
+ # Then filter in Python - wastes memory!
66
+ ```
67
+
68
+ **Best Practices:**
69
+ - Use tableFilterRows for single-condition filters
70
+ - Use tableSearch to find specific values
71
+ - Use max_results to limit output
72
+ - Preview with tableGetSheetPreview before processing
73
+ - Check row count before deciding approach
74
+
75
+ ### BioMCP Queries
76
+
77
+ ```python
78
+ # ✅ GOOD: Targeted query with filters
79
+ biomcp_article_searcher(
80
+ genes=["BRAF", "NRAS"],
81
+ diseases=["melanoma"],
82
+ keywords=["treatment resistance"],
83
+ variants=["V600E"],
84
+ page_size=50
85
+ )
86
+
87
+ # ❌ BAD: Broad query then manual filtering
88
+ biomcp_search(query="BRAF")
89
+ # Then manually filter through thousands of results!
90
+ ```
91
+
92
+ **Best Practices:**
93
+ - Use specific domain filters (genes, diseases, variants)
94
+ - Combine multiple filters with AND logic
95
+ - Use page_size to limit results per page
96
+ - Use biomcp_search only for cross-domain queries
97
+ - Specify exact criteria upfront
98
+
99
+ ### Web Searches
100
+
101
+ ```python
102
+ # ✅ GOOD: Specific search query
103
+ web-search-prime_web_search_prime(
104
+ search_query="BRAF V600E melanoma FDA approval 2024",
105
+ search_recency_filter="oneYear"
106
+ )
107
+
108
+ # ❌ BAD: Broad search then filter results
109
+ web-search-prime_web_search_prime(search_query="BRAF")
110
+ # Then manually evaluate hundreds of results!
111
+ ```
112
+
113
+ **Best Practices:**
114
+ - Use specific search terms
115
+ - Use recency filters when time-sensitive
116
+ - Use domain filters for trusted sources
117
+ - Limit results to manageable number
118
+ - Verify source quality before using
119
+
120
+ ---
121
+
122
+ ## 2. Data Validation
123
+
124
+ ### Validation Pattern
125
+
126
+ ```
127
+ VALIDATION WORKFLOW:
128
+ 1. Check data existence (not empty/null)
129
+ 2. Validate structure (required fields present)
130
+ 3. Validate types (correct data types)
131
+ 4. Validate values (within expected ranges)
132
+ 5. Validate quality (no duplicates, no corruption)
133
+ ```
134
+
135
+ ### Example: Comprehensive Validation
136
+
137
+ ```python
138
+ def validate_clinical_trials(trials: List[Dict]) -> List[Dict]:
139
+ """Validate clinical trial data with comprehensive checks.
140
+
141
+ Args:
142
+ trials: List of trial dictionaries
143
+
144
+ Returns:
145
+ Validated trials
146
+
147
+ Raises:
148
+ ValueError: If validation fails
149
+ """
150
+ # 1. Check existence
151
+ if not trials:
152
+ raise ValueError("No trial data provided")
153
+
154
+ # 2. Define required structure
155
+ required_fields = {
156
+ "nct_id": str,
157
+ "phase": str,
158
+ "status": str,
159
+ "condition": str,
160
+ "response_rate": (int, float),
161
+ "patient_count": int
162
+ }
163
+
164
+ valid_trials = []
165
+ errors = []
166
+
167
+ for i, trial in enumerate(trials):
168
+ try:
169
+ # 3. Validate structure
170
+ for field, expected_type in required_fields.items():
171
+ if field not in trial:
172
+ raise ValueError(f"Missing field: {field}")
173
+
174
+ # 4. Validate types
175
+ if not isinstance(trial[field], expected_type):
176
+ raise ValueError(
177
+ f"Field {field} has wrong type: "
178
+ f"expected {expected_type}, got {type(trial[field])}"
179
+ )
180
+
181
+ # 5. Validate values
182
+ if not trial["nct_id"].startswith("NCT"):
183
+ raise ValueError(f"Invalid NCT ID format: {trial['nct_id']}")
184
+
185
+ if trial["phase"] not in ["Phase 1", "Phase 2", "Phase 3", "Phase 4"]:
186
+ raise ValueError(f"Invalid phase: {trial['phase']}")
187
+
188
+ if not 0 <= trial["response_rate"] <= 100:
189
+ raise ValueError(f"Response rate out of range: {trial['response_rate']}")
190
+
191
+ if trial["patient_count"] < 0:
192
+ raise ValueError(f"Negative patient count: {trial['patient_count']}")
193
+
194
+ valid_trials.append(trial)
195
+
196
+ except ValueError as e:
197
+ errors.append(f"Trial {i}: {e}")
198
+
199
+ # 6. Report validation results
200
+ if errors:
201
+ print(f"Validation warnings: {len(errors)} trials had issues")
202
+ for error in errors[:5]: # Show first 5
203
+ print(f" - {error}")
204
+ if len(errors) > 5:
205
+ print(f" ... and {len(errors) - 5} more")
206
+
207
+ print(f"Validated {len(valid_trials)}/{len(trials)} trials")
208
+
209
+ return valid_trials
210
+ ```
211
+
212
+ ### JSON Validation
213
+
214
+ ```python
215
+ # Using jsonValidate for structured data
216
+ from json_tools import jsonExtract, jsonValidate
217
+
218
+ # Define expected schema
219
+ schema = {
220
+ "type": "object",
221
+ "required": ["nct_id", "phase", "status"],
222
+ "properties": {
223
+ "nct_id": {"type": "string", "pattern": "^NCT[0-9]{8}$"},
224
+ "phase": {"type": "string"},
225
+ "status": {"type": "string"},
226
+ "response_rate": {"type": "number", "minimum": 0, "maximum": 100}
227
+ }
228
+ }
229
+
230
+ # Extract and validate
231
+ result = jsonExtract(file_path="output.json")
232
+ if result.success:
233
+ validation = jsonValidate(data=result.data, schema=schema)
234
+ if not validation.valid:
235
+ print(f"Validation failed: {validation.errors}")
236
+ else:
237
+ print("Data validated successfully")
238
+ ```
239
+
240
+ ---
241
+
242
+ ## 3. Error Handling & Retry Logic
243
+
244
+ ### Retry Pattern (from retry.md)
245
+
246
+ ```python
247
+ def fetch_with_retry(
248
+ operation,
249
+ max_attempts: int = 3,
250
+ initial_delay: float = 2.0,
251
+ backoff_factor: float = 2.0
252
+ ):
253
+ """Execute operation with exponential backoff retry.
254
+
255
+ Args:
256
+ operation: Function to execute
257
+ max_attempts: Maximum retry attempts
258
+ initial_delay: Initial delay in seconds
259
+ backoff_factor: Multiplier for delay after each failure
260
+
261
+ Returns:
262
+ Operation result
263
+
264
+ Raises:
265
+ Exception: If all attempts fail
266
+ """
267
+ delay = initial_delay
268
+
269
+ for attempt in range(max_attempts):
270
+ try:
271
+ result = operation()
272
+ return result
273
+
274
+ except Exception as e:
275
+ if attempt < max_attempts - 1:
276
+ print(f"Attempt {attempt + 1} failed: {e}")
277
+ print(f"Retrying in {delay} seconds...")
278
+ blockingTimer(delay)
279
+ delay = delay * backoff_factor # Exponential backoff
280
+ else:
281
+ print(f"All {max_attempts} attempts failed")
282
+ raise
283
+ ```
284
+
285
+ ### Apply Retry to External Operations
286
+
287
+ ```python
288
+ # BioMCP queries
289
+ def fetch_articles_with_retry(genes, diseases):
290
+ """Fetch articles with retry logic."""
291
+ return fetch_with_retry(
292
+ lambda: biomcp_article_searcher(
293
+ genes=genes,
294
+ diseases=diseases,
295
+ page_size=50
296
+ )
297
+ )
298
+
299
+ # Database queries
300
+ def query_database_with_retry(sql, params):
301
+ """Query database with retry logic."""
302
+ return fetch_with_retry(
303
+ lambda: dbQuery(sql, params)
304
+ )
305
+
306
+ # Web requests
307
+ def fetch_web_with_retry(url):
308
+ """Fetch web content with retry logic."""
309
+ return fetch_with_retry(
310
+ lambda: webfetch(url)
311
+ )
312
+ ```
313
+
314
+ ### Error Handling Best Practices
315
+
316
+ ```python
317
+ # ✅ GOOD: Comprehensive error handling
318
+ def process_trials(file_path: str) -> Dict:
319
+ """Process trials with comprehensive error handling."""
320
+ try:
321
+ # Load data
322
+ df = load_data(file_path)
323
+
324
+ # Validate
325
+ if df.empty:
326
+ raise ValueError("File contains no data")
327
+
328
+ # Process
329
+ results = analyze_trials(df)
330
+
331
+ return results
332
+
333
+ except FileNotFoundError:
334
+ raise FileNotFoundError(f"File not found: {file_path}")
335
+
336
+ except pd.errors.EmptyDataError:
337
+ raise ValueError(f"File is empty or corrupt: {file_path}")
338
+
339
+ except Exception as e:
340
+ raise RuntimeError(f"Processing failed: {e}")
341
+
342
+ # ❌ BAD: No error handling
343
+ def process_trials(file_path: str) -> Dict:
344
+ """Process trials without error handling."""
345
+ df = load_data(file_path) # May fail silently
346
+ results = analyze_trials(df) # May fail silently
347
+ return results
348
+ ```
349
+
350
+ ---
351
+
352
+ ## 4. Rate Limiting
353
+
354
+ ### BioMCP Rate Limiting (MANDATORY)
355
+
356
+ ```python
357
+ # ✅ GOOD: Sequential calls with rate limiting
358
+ articles = biomcp_article_searcher(genes=["BRAF"], page_size=50)
359
+ blockingTimer(0.3) # 300ms delay
360
+
361
+ for article in articles[:10]:
362
+ details = biomcp_article_getter(pmid=article["pmid"])
363
+ blockingTimer(0.3) # 300ms between each call
364
+ # Process details
365
+
366
+ # ❌ BAD: Concurrent calls without rate limiting
367
+ # This will cause API throttling!
368
+ results = []
369
+ for pmid in pmids:
370
+ results.append(biomcp_article_getter(pmid)) # No delay!
371
+ ```
372
+
373
+ ### Web Rate Limiting
374
+
375
+ ```python
376
+ # ✅ GOOD: Conservative rate limiting for web
377
+ for url in urls:
378
+ content = webfetch(url)
379
+ blockingTimer(0.5) # 500ms delay for web requests
380
+ # Process content
381
+
382
+ # ❌ BAD: No rate limiting
383
+ for url in urls:
384
+ content = webfetch(url) # May get blocked!
385
+ ```
386
+
387
+ ### Rate Limiting Guidelines
388
+
389
+ | Tool Category | Delay | Rationale |
390
+ |--------------|-------|-----------|
391
+ | BioMCP tools | 0.3s | API rate limits |
392
+ | Web tools | 0.5s | Server courtesy |
393
+ | Database | None | Local/network, no limit |
394
+ | File operations | None | Local filesystem, no limit |
395
+ | Parser tools | None | Local processing, no limit |
396
+
397
+ ---
398
+
399
+ ## 5. Performance Optimization
400
+
401
+ ### Use Appropriate Tools for Data Size
402
+
403
+ ```python
404
+ # Small datasets (< 30 rows): Use table tools
405
+ if row_count < 30:
406
+ tableFilterRows(file_path, column="Status", operator="=", value="Active")
407
+ tableGroupBy(file_path, group_column="Phase", agg_column="Count", agg_type="count")
408
+
409
+ # Medium datasets (30-1000 rows): Use long-table-summary skill
410
+ elif row_count < 1000:
411
+ skill long-table-summary
412
+ # Follow 16-step workflow
413
+
414
+ # Large datasets (> 1000 rows): Use Python
415
+ else:
416
+ uv run python .scripts/py/large_analysis.py --input data.xlsx
417
+ ```
418
+
419
+ ### Batch Processing
420
+
421
+ ```python
422
+ # ✅ GOOD: Process in batches
423
+ batch_size = 100
424
+ total_items = len(data)
425
+
426
+ for i in range(0, total_items, batch_size):
427
+ batch = data[i:i+batch_size]
428
+ results = process_batch(batch)
429
+
430
+ # Report progress
431
+ completed = min(i + batch_size, total_items)
432
+ percent = (completed / total_items) * 100
433
+ print(f"Progress: {completed}/{total_items} ({percent:.1f}%)")
434
+
435
+ # ❌ BAD: Process all at once (may overload memory)
436
+ results = process_all(data) # Memory issue with large datasets!
437
+ ```
438
+
439
+ ### Caching Results
440
+
441
+ ```python
442
+ # Cache expensive operations
443
+ import hashlib
444
+ import json
445
+
446
+ def get_cache_key(params: Dict) -> str:
447
+ """Generate cache key from parameters."""
448
+ return hashlib.md5(json.dumps(params, sort_keys=True).encode()).hexdigest()
449
+
450
+ def cached_query(sql: str, params: Dict, cache_dir: str = ".cache") -> Any:
451
+ """Execute query with caching."""
452
+ import os
453
+
454
+ cache_key = get_cache_key({"sql": sql, "params": params})
455
+ cache_file = f"{cache_dir}/{cache_key}.json"
456
+
457
+ # Check cache
458
+ if os.path.exists(cache_file):
459
+ with open(cache_file, 'r') as f:
460
+ return json.load(f)
461
+
462
+ # Execute query
463
+ result = dbQuery(sql, params)
464
+
465
+ # Save to cache
466
+ os.makedirs(cache_dir, exist_ok=True)
467
+ with open(cache_file, 'w') as f:
468
+ json.dump(result, f)
469
+
470
+ return result
471
+ ```
472
+
473
+ ---
474
+
475
+ ## 6. Context Window Management
476
+
477
+ ### Minimize Data in Context
478
+
479
+ ```python
480
+ # ✅ GOOD: Summarize large datasets
481
+ if len(results) > 100:
482
+ summary = {
483
+ "total": len(results),
484
+ "sample": results[:5], # First 5 items
485
+ "statistics": calculate_stats(results)
486
+ }
487
+ # Return summary, not full dataset
488
+
489
+ # ❌ BAD: Load entire dataset into context
490
+ return results # 1000+ items in context!
491
+ ```
492
+
493
+ ### Use File-Based Data Exchange
494
+
495
+ ```python
496
+ # ✅ GOOD: Write to file, pass file path
497
+ output_file = ".work/results.json"
498
+ with open(output_file, 'w') as f:
499
+ json.dump(results, f)
500
+
501
+ return f"Results saved to {output_file}"
502
+
503
+ # ❌ BAD: Return large data structure
504
+ return results # Consumes context window!
505
+ ```
506
+
507
+ ### Pagination for Large Result Sets
508
+
509
+ ```python
510
+ # ✅ GOOD: Paginated retrieval
511
+ page_size = 50
512
+ all_results = []
513
+
514
+ for page in range(1, max_pages + 1):
515
+ results = biomcp_article_searcher(
516
+ genes=["BRAF"],
517
+ page=page,
518
+ page_size=page_size
519
+ )
520
+ blockingTimer(0.3)
521
+
522
+ all_results.extend(results)
523
+
524
+ if len(results) < page_size:
525
+ break # No more results
526
+
527
+ # ❌ BAD: Try to get all at once
528
+ all_results = biomcp_article_searcher(genes=["BRAF"], page_size=10000)
529
+ # May hit API limits!
530
+ ```
531
+
532
+ ---
533
+
534
+ ## 7. Common Anti-Patterns to Avoid
535
+
536
+ ### Anti-Pattern 1: Premature Optimization
537
+
538
+ ```python
539
+ # ❌ BAD: Optimize before measuring
540
+ def process_data(data):
541
+ # Complex optimization for small dataset
542
+ return optimized_result
543
+
544
+ # ✅ GOOD: Measure first, optimize if needed
545
+ def process_data(data):
546
+ if len(data) < 100:
547
+ # Simple approach is fine
548
+ return simple_process(data)
549
+ else:
550
+ # Optimize for large dataset
551
+ return optimized_process(data)
552
+ ```
553
+
554
+ ### Anti-Pattern 2: Over-Engineering
555
+
556
+ ```python
557
+ # ❌ BAD: Complex solution for simple problem
558
+ class TrialProcessor:
559
+ def __init__(self, config):
560
+ self.config = config
561
+ self.validator = Validator()
562
+ self.analyzer = Analyzer()
563
+ # ... many more components
564
+
565
+ processor = TrialProcessor(config)
566
+ result = processor.process(trial)
567
+
568
+ # ✅ GOOD: Simple solution for simple problem
569
+ result = analyze_trial(trial)
570
+ ```
571
+
572
+ ### Anti-Pattern 3: Not Using Existing Tools
573
+
574
+ ```python
575
+ # ❌ BAD: Write custom Python when tools exist
576
+ df = pd.read_excel("data.xlsx")
577
+ filtered = df[df["Phase"] == "Phase 3"]
578
+ stats = filtered.groupby("Condition").size()
579
+
580
+ # ✅ GOOD: Use existing tools
581
+ tableFilterRows(file_path, column="Phase", operator="=", value="Phase 3")
582
+ tableGroupBy(file_path, group_column="Condition", agg_column="Count", agg_type="count")
583
+ ```
584
+
585
+ ---
586
+
587
+ ## 8. Best Practices Checklist
588
+
589
+ Before completing any data analysis task, verify:
590
+
591
+ ### Upfront Filtering
592
+ - [ ] Data filtered at source (database WHERE, table filters, API parameters)
593
+ - [ ] No bulk retrieval then filtering in Python
594
+ - [ ] Result size limited appropriately
595
+ - [ ] Only necessary columns/fields retrieved
596
+
597
+ ### Validation
598
+ - [ ] Data existence checked (not empty)
599
+ - [ ] Structure validated (required fields present)
600
+ - [ ] Types validated (correct data types)
601
+ - [ ] Values validated (within expected ranges)
602
+ - [ ] Quality validated (no duplicates, no corruption)
603
+
604
+ ### Error Handling
605
+ - [ ] Try-except for external operations
606
+ - [ ] Retry logic for network/API calls
607
+ - [ ] Informative error messages
608
+ - [ ] Graceful degradation when possible
609
+ - [ ] Error logging for debugging
610
+
611
+ ### Rate Limiting
612
+ - [ ] BioMCP: 0.3s delay between calls
613
+ - [ ] Web: 0.5s delay between requests
614
+ - [ ] Sequential (not concurrent) API calls
615
+ - [ ] Respect API rate limits
616
+
617
+ ### Performance
618
+ - [ ] Appropriate tool selected for data size
619
+ - [ ] Batch processing for large datasets
620
+ - [ ] Caching for expensive operations
621
+ - [ ] Progress reporting for long operations
622
+
623
+ ### Context Management
624
+ - [ ] Large datasets summarized, not fully loaded
625
+ - [ ] File-based data exchange for subagents
626
+ - [ ] Pagination for large result sets
627
+ - [ ] Only necessary data in context
628
+
629
+ ---
630
+
631
+ ## Summary
632
+
633
+ **Critical Rules:**
634
+
635
+ 1. **ALWAYS filter upfront** - Never retrieve then filter
636
+ 2. **ALWAYS validate data** - Check structure, types, values
637
+ 3. **ALWAYS handle errors** - Retry with backoff
638
+ 4. **ALWAYS rate limit** - Respect API limits
639
+ 5. **ALWAYS optimize for size** - Use appropriate tools
640
+ 6. **ALWAYS manage context** - Minimize data in context
641
+
642
+ **Following these best practices ensures:**
643
+ - ✅ Efficient data processing
644
+ - ✅ Reliable operations
645
+ - ✅ Professional quality results
646
+ - ✅ Optimal resource usage
647
+ - ✅ Reproducible research